swh-loader-svn
==============

The goal is to load a svn repository's lifetime logs to swh-storage.

This must be able to deal with:
- unknown svn repository (resulting in a new origin)
- known svn repository (starting up from the last known svn revision
  and update from that moment on)

For a full detailed comparison between version's speed, please refer
to https://forge.softwareheritage.org/diffusion/DLDSVN/browse/master/docs/comparison-git-svn-swh-svn.org.


# v1

## Description

This is a first basic implementation, a proof-of-concept of sort.
Based on checkout-ing on disk the svn repository at each revision and
walking the tree at svn revision to compute the swh hashes and store
them in swh-storage.

Conclusion: It is possible but it is slow.

We use git-svn to check if the hash computations were a match, and
they were not. The swh hashes computation are corrects though.

It's just not the same assertions as git-svn so the hashes mismatch.

git-svn:
- does not checkout empty folders
- adds metadata at the end of the svn commit message (by default, this
  can be avoided but then no update, in the swh sense, is possible
  afterwards)
- integrates the svn repository's uuid in the git revision for the
  commit author (author@<repo-uuid>)

swh-loader-svn:
- checkouts empty folder (which are then used in swh hashes)
- adds metadata the git way (leveraging git's extra-header slot), so
  that we can deal with svn repository updates

## Pseudo

```
Checkout/Update/Export on disk the first known revision or 1 if unknown repository
When revision is not 1
  Check the history is altered (revision hashes won't match)
  If it is altered, log an error message and stop
  Otherwise continue

Iterate over logs from revision 1 to revision head_revision
  The revision is now rev
  checkout/update/export the revision at rev
  walk the tree directory for that revision and compute hashes
  compute the revision hash
  send the blobs for storage in swh
  send the directories for storage in swh
  send the revision for storage in swh
done

Send the occurrence pointing to the last revision seen
```

## Notes

SVN checkout/update instructions are faster than export since they
leverage svn diffs.  But:
- they do keyword expansion (so bad for diffs with external tools so
  bad for swh)
- we need to ignore .svn folder since it's present (this needed some
  adaptation in code to ignore folder based on pattern so slow as
  well)

SVN export instruction is slower than the 2 previous ones since they
don't use diffs.  But:
- there is one option to ignore keyword expansion (good)
- no folder are to be omitted during hash computation from disk (good)

All in all, there is a trade-off here to choose from.

Still, everything was tested (with much code adapted in the lower
level api) and both are slow.

# v2

## Description

The v2 is more about:
- adding options to match the git-svn's hash computations
- trying to improve the performance

So, options are added:
- remove empty folder when encountered (to ignore during hash
  computations)
- add an extra commit line to the svn commit message
- (de)activate the loader svn's update routine
- (de)activate the sending of
  contents/directories/revisions/occurrences/releases to swh-storage
- (de)activate the extra-header metadata in revision hash (thus
  deactivating the svn update options altogether)

As this is thought as genuine implementation, we adapted the revision
message to also use the repository's uuid in the author's email.

Also, optimization are done as well:
- instead of walking the disk from the top leve at each revision (slow
  for huge repository like svn.apache.org), compute from the svn log's
  changed paths between the previous revision and the current one, the
  lowest common path. Then, walk only that path to compute the updated
  hashes. Then update from that path to the top level the in-memory
  hashes (less i/o, less RAM are used).

- in the loader-core, lifting the existing swh-storage api to filter
  only the missing entities on the client side (there are already
  filters on the server side but filtering client-side uses less RAM.
  Especially for blobs, since we extract the data from disk and store
  it in RAM, this is now done only for unknown blobs and still before
  updating the disk with a new revision content)

- in the loader-core, cache are added as well

Now the computations, with the right options, are a match with git-svn.
Still, the performance against git-svn are bad.

Taking a closer look at git-svn, they used a remote-access approach,
that is discussing directly with the svn server and computing at the
same time the hashes.
That is the base for the v3 implementation.

## Pseudo

Relatively to the v1, the logic does not change, only the inner
implementation.

# v3

## Description

This one is about performance only.

Leveraging another low-level library (subvertpy) to permit the use of
the same git-svn approach, the remote-access.

The idea is to replay the logs and diffs on disk and compute hashes
closely in time (not as close as possible though, cf. ## Note below).

## Pseudo

```
Do we know the repository (with swh-svn-update option on)?
  Yes, extract the last swh known revision from swh-storage
    set start-rev to last-swh-known-revision
    Export on disk the svn at start-rev
    Compute revision hashes (from top level tree's hashes + commit log for that revision)
    Does the revision hash match the one in swh-storage? (<=> Is the history altered?)
      No
        log an error message and stop
      Yes
        keep the current in-memory hashes (for the following updates steps if any)
  No
    set start-rev to 1

Set head-revision to latest svn repository's head revision
When start-rev is the same as head revision, we are done.
Otherwise continue

Iterate over the stream of svn-logs from start-rev to head-rev
  The current revision is rev
  replay the diffs from previous rev (rev - 1) to rev and compute hashes along
  compute the revision hash
  send the blobs for storage in swh
  send the directories for storage in swh
  send the revision for storage in swh
done

Send the occurrence pointing to the last revision seen
```

## Note

There could be margin for improvement in the actual implementation
here.

We apply the diff on files first and then open the file to compute its
hashes afterwards.

If we'd apply the diff and compute the hashes directly, we'd gain one
round-trip. Depending on the ratio files/directory, this could be
significant.

This approach has also the following benefits:
- no keyword expansion
- no need to ignore .svn folder (since it does not exist)
