-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Add support for delta reuse #7040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Extract the code preparing a suffixed pack file path from pack_index_check_locked() so that it can be reused for arbitrary suffixes. Simplify pack_index_open_locked() accordingly.
Pack file index enables quickly mapping OIDs to entry offsets, but not the other way round. Furthermore, pack file entries do not store the size of the compressed object representation. Performing both of these operations quickly is necessary for robust extraction of compressed pack file data at arbitrary offsets. This is where the pack file reverse index comes in handy. It is an array of index entry numbers ordered by pack file offset, which enables quickly mapping entry offsets to OIDs (using binary search) and determining compressed object representation size (by checking the offset of the next entry in the pack file). Implement loading reverse index files ("pack-*.rev") in a similar way as for index files ("pack-*.idx"). Perform some basic sanity checks on reverse index contents.
Storing pack file reverse indexes in "pack-*.rev" files is a relatively new feature (introduced in Git 2.31). Since the information stored in the reverse index is necessary for robust extraction of compressed pack file data at arbitrary offsets, compute reverse index contents in memory if the reverse index file does not exist.
Add helper functions enabling convenient use of reverse index data: - pack_offset_to_ridx_pos() for mapping pack file entry offset to the number of its corresponding reverse index entry, - pack_ridx_pos_to_id() for mapping a reverse index entry number to the OID of its corresponding pack file entry, - pack_ridx_pos_to_offset() for mapping a reverse index entry number to the offset of its corresponding pack file entry, - pack_ridx_pos_to_pack_pos() for extracting reverse index data from either the mapped reverse index file or the data structure prepared in memory.
Extend the ODB backend interface with a function returning compressed delta data for the provided OID.
Enable the pack backend to return compressed delta data via the new get_delta() ODB interface.
When considering objects for deltification, first check whether the ODB can readily supply compressed delta data. This drastically speeds up the deltification stage during pack file preparation for large pushes.
Add a new "pack.noReuseDelta" configuration option to support disabling delta reuse during pack file preparation.
Thanks for doing this work; I'm really excited about it, but it's a big-ish change and I haven't had time to look at it yet. Thanks for your patience and I'll hope to get eyes on it very soon. 🙏 |
Absolutely, take your time. I am swamped by other work anyway, so the turnaround time on my side won't be short. However, the core concept here is pretty simple to grasp, so "switching context" back to this code won't be hard for me. I spent some time looking into why the original Git implementation is so much faster than libgit2 (about 2-3x) in the object enumeration stage. There are several contributing factors, but the most important ones are:
I'll see what I can do about speeding up libgit2's enumeration logic with the extra obstacle in the form of the ODB backend abstraction layer in place. I think this is worth tackling and I am interested in working on it, but implementing delta reuse is low-hanging fruit compared to revising the object enumeration code. One thing that crossed my mind is that the approach proposed in this PR, i.e. ODB backends providing compressed object data, might be useful to generalize for all packed objects, not just deltas. Original Git certainly does not make that distinction. Perhaps it is prudent not to merge this PR immediately after review and instead take the enumeration logic into consideration when designing the API extensions here, so that the ODB interface doesn't have to be modified twice. Just a thought. |
This PR enables reusing existing deltas contained in pack files while pushing.
The branch is laid out so that it builds the necessary mechanisms from the ground up, but it may be easier to review this work the other way round (i.e. starting from the tip of the branch), to start from a higher level and then strip layers off.
The basic idea is simple: during a push, after enumerating objects that should end up in the pack file, for each object to be pushed check whether the ODB is able to supply it in the form of a delta; if so, and if the base of that delta is also going to be included in the pack file, get the compressed delta data from the ODB, store it in memory, and remove the current object from the list of objects considered for deltification. This relieves the pack builder from having to deltify a lot of the objects it needs to push.
As mentioned in the penultimate commit, this drastically speeds up the deltification stage during pack file preparation for large pushes (from pack-file-backed repositories). I am willing to prepare some specific benchmarks for this, but I thought it was wiser to get feedback on the design and the code itself before starting that particular side quest. As far as ballpark numbers go, though, for the repositories I tested this code on, the speed-up is in the range of about 5-20x (which is not really shocking because the whole idea is to avoid "re-deltifying" objects which are already available as deltas), depending on repository contents.
This PR does not affect the object enumeration stage, which can also take a very long time for large repositories.
Some comments on the decisions made and the trade-offs encountered:
Retrieving the delta from the ODB backend could be split into finding the base OID first and only retrieving the delta data when the pack file is written. Doing it all in one go was just simpler and shorter (given that this logic needs to cross the ODB abstraction layer, which means quite a bit of glue code is needed for every new ODB backend callback), at the expense of slightly increased memory use (all compressed delta data for reused deltas is stored in memory, i.e. attached to
git_pobject
structures), which is already pretty massive for large pushes. If necessary, this problem can be eliminated by only retrieving compressed delta data when it is necessary to send it to the remote side - at the expense of adding one more callback to the ODB backend interface and reducing the locality of this logic: one more perk of the simple approach proposed right now is that it enables reused delta data to be treated identically as the deltas generated byfind_deltas()
when sending them to the remote side, i.e. no modifications towrite_object()
are necessary.The last commit in this PR adds a safety mechanism to disable the new logic in case of trouble. It does not have a counterpart in the original Git implementation (well, apart from the
--no-reuse-delta
option forgit-pack-objects
). It can be dropped if you're feeling adventurous ;) I made it a configuration option because the pack builder does not seem to have any API-level options (or an option structure) that would configure its behavior and AFAICT, libgit2 does not have an API for repacking existing packs, which is the obvious case I can think of that would make such a knob practically useful in the long run.The code in this PR that computes reverse index contents in memory is nowhere near as sophisticated as the corresponding code in the original Git implementation (which performs a radix sort), so it might use quite a bit of memory for repositories with a huge number of objects. This can be nullified by providing a reverse index file on disk, which the original Git implementation supports since version 2.31 (the code in this PR is capable of using that file).
While
docs/conventions.md
says to increment the default value of theversion
member whenever a transparent structure changes (likestruct git_odb_backend
does in this PR), some previous commits from the past seem to ignore that plea (e.g. ea28590, 459ac85, 97f9a5f), so I joyfully followed their example :-)