8000 SHA256 proof of concept by ethomson · Pull Request #6191 · libgit2/libgit2 · GitHub
[go: up one dir, main page]

Skip to content

SHA256 proof of concept #6191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jul 14, 2022
Merged

SHA256 proof of concept #6191

merged 21 commits into from
Jul 14, 2022

Conversation

ethomson
Copy link
Member

This is a proof-of-concept of adding SHA256 support to libgit2. This follows the SHA256 support in git itself, which is documented at https://github.com/git/git/blob/master/Documentation/technical/hash-function-transition.txt. Note that this is an aspirational document that does not describe the actual status quo. In git, only extensions.objectFormat is supported. extensions.compatObjectFormat is not implemented, although described in that document.

We will support extensions.objectFormat to be git-compatible. We will likely not implement extensions.compatObjectFormat unless/until git chooses to do this. It's certainly not impossible that this is never implemented, or not implemented as described in the document, and we should not assume that it ever will be.

This PR is is not complete, it only adds SHA256 to the loose object database backend. It's enough to illustrate the proposed approach to adding SHA256, and illustrate it working. This should be enough to show a proof of concept for review and to get opinions; it is believed that we can extend this PR to a complete support for extensions.objectFormat.

The implementation

Obviously we can't add a global variable that sets the object type like git does - we'll need to plumb this through our architectural depths and can't build on a load-bearing global variable like that. Instead, this adds a new type member to the object id structure and extends the object id's id member to be 256 bits to support SHA256.

This is a significantly breaking change. If we go this route, we'll do it as libgit2 v2.0, since we follow semantic versioning.

Details:

  1. We stop using git_oids for things that aren't object IDs. For example, things like the index store a checksum that is (not coincidentally) the same hash type as the oid type in the repository. Despite that, it's not an object ID. And things like git_indexer_hash assume that packfile names are the checksums which isn't necessarily a guarantee forever. Instead, provide a function to get the (string) filename.

  2. We're adding a type member to git_oid. At the moment this allows for both GIT_OID_SHA1 and GIT_OID_SHA256. There may be additional object ID types in the future.

  3. We use the HTTPS backend for SHA256 support where appropriate.

Questions:

  1. Is putting the type and hash into the same object sane? I think so, but it's open to discussion.

  2. Should we use an enum for git_oid_t? (Or, more accurately, should the type member of git_oid be a git_oid_t? That's an int, which is a gigantic waste of memory. We only need a single byte to deal with the oid type and should probably only use a single byte, but actually using an enum does provide some safety for consumers. I've left that as an enum for simplicity in this PR but it seems like the wrong thing to ship.

  3. We need to do something to deal with oids as a bag of bytes (for examples, in an index entry). This PR simply deals with them as an unsigned char array, which is disappointing because you can't index into it as oid[i] (where i is the number of object ids), you need to index into it as oid[i * GIT_OID_SHA1_SIZE].

    This isn't really a terrible burden, but it does feel like a bit of a detriment. We could introduce a git_oid_sha1_raw type to aid with that (which is basically our existing git_oid) ... but then we'll also need a git_oid_sha256_raw, and we'll need an index_entry_sha1 and index_entry_sha256... ultimately this feels like an explosion of complexity, not an improvement.

  4. If we're making git_oid a more complicated structure - should we add still more? We could put another member in there that reflects its size, so that we could support abbreviated object ids in the same structure. We could then remove the _prefix functions and allow the lookup function to simply take an abbreviated oid.

    I actually think that this is rather confusing, as most of our functions won't take abbreviated oids, only full oids. But I think it's worth asking.

Alternatives

(In order of my preference, from my least to most hated).

  1. Introduce a new type (eg, git_object_id) that is capable of coping with both SHA1 and SHA256, leaving git_oid as-is and marking it as deprecated. Since git_oids are not an opaque type, this means that we would need a new API surface that took git_object_ids instead of git_oids and the old git_oid functions would be deprecated and call those. I'm not unwilling to do this but it's a dramatic amount of new APIs with names like git_tag_lookup_2 or - perhaps more inspired - git_tag_lookup_by_id. This feels to me like the least bad alternative, but I want to hear that this is a thing we need to do before we go do it.

  2. Don't add a type field to git_oid, but extend it to 256 bits. Require a type on all API calls that take an object ID -- eg:

    int git_tag_lookup(git_tag **out, git_repository *repo, const git_oid *id, git_oid_t id_type)
    

    This feels like a bad option: it breaks all the APIs and we have to be very thoughtful about always taking the type on all the functions. The type of ID should be a part of the object ID. The only nice thing about this is that we can keep using git_oid as a bag of bytes for things like pointing into commit graphs and packfiles.

  3. Introduce a second type (eg, git_oid_sha256). This presumably necessitates new APIs, whichs is awful, plus I really don't want git_oid_sha512, et al. Let's have a single OID type.

Thoughts?

@ethomson ethomson marked this pull request as draft January 28, 2022 03:28
@ethomson
Copy link
Member Author

/cc @carlosmn @boretrk @lhchavez

@boretrk
Copy link
Contributor
boretrk commented Jan 28, 2022

TBH I haven't read through the PR but I figured I give some preliminary thoughts.

  1. Is putting the type and hash into the same object sane? I think so, but it's open to discussion.

Agreed, at least if we want to abstract the hash type away from the library user.

  1. Should we use an enum for git_oid_t? (Or, more accurately, should the type member of git_oid be a git_oid_t? That's an int, which is a gigantic waste of memory.

Is there any particular place where you feel that there could be a large enough number of git_oid for those bytes to become significant?

  1. We need to do something to deal with oids as a bag of bytes (for examples, in an index entry). This PR simply deals with them as an unsigned char array, which is disappointing because you can't index into it as oid[i]

I guess this depends a lot on usage.
We could make it a void * and add a couple of defines or getter functions to access the array as desired type: ((git_oid_sha1_raw *)oid)[i]

Or add a type field together with the pointer

typedef struct {
	git_oid_t type;
	char **oid;
} git_raw_oid_array;

In that case we could have a git_oid_from_raw_array(git_oid *, git_raw_oid_array *, int i) that fills in an oid depending on the array type without the caller needing to know what type it is.
Or just a char *git_raw_oid_addr(git_raw_oid_array *, int i) to get the address of the raw oid in the array. (But for that case storing the element size or passing it in as an argument might be better.)

We could have a companion function to GIT_EXTERN(int) git_oid_fromraw(git_oid *out, const unsigned char *raw, git_oid_t type); that also takes the index to the raw array.

  1. If we're making git_oid a more complicated structure - should we add still more? We could put another member in there that reflects its size, so that we could support abbreviated object ids in the same structure.

I don't like that idea, but that is mostly based on me feeling that an abbreviated oid is a text representation.
That it needs a binary representation at all before being converted to a real oid is just a necessary evil.
Also, I'm thinking that someone who has an abbreviated oid won't know if it is a SHA-1 or SHA-256 oid until a unique match have been found in the database, until then it is just a collection of nibbles that could have been a reference to the nonexisting tag feed

@ethomson
Copy link
Member Author

TBH I haven't read through the PR but I figured I give some preliminary thoughts.

That's fine, most of the PR is mechanical. I appreciate the early thoughts a lot.

Agreed, at least if we want to abstract the hash type away from the library user.

I think that it's nice for the DX. I don't really want to think about whether my repository has sha1 or sha256 (or the magical compatibility translation layer), I just want to use an object id, probably as an opaque blob. I want to get its string representation to show to the user, but I don't really care how long it is or if it's hex. We could encode them with [0-9a-z] and users would probably be happier because they take up less room on screen.

Is there any particular place where you feel that there could be a large enough number of git_oid for those bytes to become significant?

I guess pack files. https://github.com/torvalds/linux is currently 8581458 objects reachable from HEAD. If you want an array of all the oids, that's 163 MB if they're SHA1. That's 262 MB if they're SHA256. Adding a single byte on top adds 8.5 MB while adding an enum adds 34.3 MB.

I guess in the world of Electron apps, that's not very much. 😁 It just feels like quadrupling the storage is a hefty price to pay for a slightly improved developer experience.

Or add a type field together with the pointer

That will definitely satisfy some of the uses (packfiles) but not the index one. I think that I like your unsigned char * (or void *)-but-cast idea. 🤔

Also, I'm thinking that someone who has an abbreviated oid won't know if it is a SHA-1 or SHA-256 oid until a unique match have been found in the database, until then it is just a collection of nibbles that could have been a reference to the nonexisting tag feed

🤔 Interesting point. I'm actually surprised that there's not a way to lookup an object as sha1:deadbeef in git. This is especially puzzling to me since there's a way for me to refer directly to a commit's parent's tree (in fact, multiple ways).


Here's my real question: how impactful is this change on your codebase? It's a lot of API breakage.

I'm even more concerned about the ABI breakage. If I were rust or LibGit2Sharp and use an FFI that's basically "make sure that the structs are declared identically" then we're just going to silently go boom. 💣

I think that it's probably worth me taking a pass on what API and ABI compatibility could look like. I think that it's possible, and it's definitely worth exploring some more. 🤔

@ethomson
Copy link
Member Author

Also cc @extrawurst @palmin @csware as a random sampling of libgit2 users who might be able to or interested in providing some more thoughts on this...

@boretrk
Copy link
Contributor
boretrk commented Jan 29, 2022

If you want an array of all the oids, that's 163 MB if they're SHA1. That's 262 MB if they're SHA256. Adding a single byte on top adds 8.5 MB while adding an enum adds 34.3 MB.

The current oid have a fixed size so it's always 262 MB regardless of if it is SHA1 or SHA256. To get away from that we would need something less practical.
But I see your point, a byte increases the size with 3.1% while the enum increases it with 12.5% and given my use case I do appreciate wanting to keep the memory usage down.

It could be possible to declare enum with typedef enum __attribute__((packed)) conditionally on gcc or fall back to char on other compilers, but maybe that could be considered to be a bit sketchy, especially since it is used in the external interface.
Using a byte is probably fine, the intent is to not look at it anyway :-)

Here's my real question: how impactful is this change on your codebase? It's a lot of API breakage.

I'm afraid I can't contribute much here. My interest in libgit2 comes from wanting to get https://github.com/billyfish/simplegit to run on AmigaOS3.x, but I haven't gotten to a point where I feel comfortable about saying how problematic it would be to update the code to this change. (I've been putting it off waiting for #5983 and #6178 since those also solve some Amiga-specific issues but it's not really a good excuse since I probably want to take some intermediate steps to bring it up to libgit2 1.0)

Of course I would be happy with API changes being as minimal as possible, but it seems to me that it's just a matter of setting a variable GIT_OID_SHA1 or GIT_OID_SHA256 if extensions.objectFormat is sha256 and pass it on to every function-call that fails to compile.
Clearly the next feature request will be adding support for GIT_OID_DEFAULT to those functions.

@extrawurst
Copy link

[..] how impactful is this change on your codebase? It's a lot of API breakage.
I'm even more concerned about the ABI breakage.
I think that it's probably worth me taking a pass on what API and ABI compatibility could look like. I think that it's possible, and it's definitely worth exploring some more.

I personally would not want us to build technical dept just for the sake of ABI-Stability. Me coming at this from the rust side of things has little concern for us to be jeopardised hard by this. we have a semver-versioned crate that we can make a major version bump which allows breaking changes on the API level and deploying binary builds is less of a thing in the rust ecosystem anyway.

So IMO lets break it and make it a major release 2.0.

cc @alexcrichton @joshtriplett @ehuss (owners of git2-rs and cargo which relies on it)

@joshtriplett
Copy link
Contributor
joshtriplett commented Jan 29, 2022

Going to 2.0 seems fine, but please take care to make sure a process doesn't explode if it ends up with both v1 and v2 libraries loaded, which could easily happen. Symbol versioning would help with this.

I do care about having a huge number of oid objects without extra space. Pairing them with a type byte seems fine, but it may make sense to have a "raw" underlying type, so that people can easily keep a large array of oid objects of the same type. Please don't add anything else to the oid object though, such as abbreviation handling; that can go in a separate type.

It might make sense to make OID variable-length (with a couple of equivalent fixed length structs for convenience), which would take some care to pass around without knowing its underlying type, but which would avoid both the excess space in the sha1 case as well as the need for a version 3.0 if you ever need to add a new hash that's larger. I'm not sure if that's worth the annoyance, but it seems worth considering.

(Also, if you're going to go 2.0, there are some additional changes worth making, such as going 100% iterator rather than callback, and not hardcoding backend indexes.)

@boretrk
Copy link
Contributor
boretrk commented Jan 30, 2022

Some thoughts:

  1. A 40 character hex string could be either a complete SHA-1 oid or an abbreviated SHA-256 oid.
    The only case when a hex-string can be considered non-abbreviated is when it is the length of the longest possible hash.
  2. While one can't specify that a hash is of a specific type like sha1:deadbeef, branches and tags can always be distinguished by a longer path tags/deadbeef. This indicates to me that matching hashes should have priority over tags and branches with a matching name.
  3. A single byte enum is still sufficient to indicate that an oid is abbreviated. One could allocate the range 128+nibblecount for that, or specify that a value over 15 is a bitcount since at least four nibbles are required to be set.
    Neither version feels particularly good to me but it is a possibility.

@joshtriplett
Copy link
Contributor

@boretrk I don't want to invoke the machinery to resolve an abbreviated OID without specifically intending to; that's why I'd prefer to keep "abbreviated OID" and "unabbreviated OID" as two distinct types.

@lhchavez
Copy link
Contributor

🎉 it's happening 🎉 thank you so much for doing this!

once there's a clear direction, i can probably pitch in with packfiles and friends (midx).

  1. We stop using git_oids for things that aren't object IDs. For example, things like the index store a checksum that is (not coincidentally) the same hash type as the oid type in the repository. Despite that, it's not an object ID. And things like git_indexer_hash assume that packfile names are the checksums which isn't necessarily a guarantee forever. Instead, provide a function to get the (string) filename.

oof, thanks for fixing this.

  1. Is putting the type and hash into the same object sane? I think so, but it's open to discussion.

I think it's fine. Although is there a possibility to have an union (maybe three entries: id with the full size, sha1, and sha256 to their correct sizes)? that way the sizeof each hash member is defined without wasting memory. and also may help identifying some bugs where one type of hash is used when the other one should have been (in a very opt-in fashion).

Another option to save space is to (shudder) make git_oid an opaque type and use the unsigned char id[1]; struct hack (but that would lose the unionification property). That may make things a bit harder for callers because now they would need to call a function to get the bytes back.

  1. We need to do something to deal with oids as a bag of bytes (for examples, in an index entry).

for symmetry with the above answer, and taking ideas from upthread, could this also be an union? so both versions would be

typedef struct git_oid {
	/** type of object id */
	git_oid_t type;

	/** raw binary id */
	union {
		unsigned char id[GIT_OID_MAX_SIZE];  /* For ergonomics, and when you don't care about the actual type */
		unsigned char sha1[GIT_OID_SHA1_SIZE];
		unsigned char sha256[GIT_OID_SHA256_SIZE];
	};
} git_oid;

typedef struct git_oidarray {
	/** type of object id */
	git_oid_t type;

	/** number of entries in the array */
	size_t count;


8000
	/** array of raw binary formatted ids */
	union {
		/* intentionally not `typedef`ing these since that's the "most hated" version */
		unsigned char *sha1[GIT_OID_SHA1_SIZE];
		unsigned char *sha256[GIT_OID_SHA256_SIZE];
	};
} git_oidarray;

i am now very happy that "Intermixing objects using multiple hash functions in a single repository." is an explicit non-goal in git :D that would have been a gnarly problem.

@ethomson
Copy link
Member Author
ethomson commented Jan 31, 2022

I think it's fine. Although is there a possibility to have an union (maybe three entries: id with the full size, sha1, and sha256 to their correct sizes)? that way the sizeof each hash member is defined without wasting memory. and also may help identifying some bugs where one type of hash is used when the other one should have been (in a very opt-in fashion).

Oh, yeah, this is definitely the right thing to do.

(I think?)

I think we can't do it without the generic id ... but I worry that there's too much complexity now? 🤔 I'd have to think about it a little bit more, but I'm inclined to do this.

@kim
Copy link
Contributor
kim commented Feb 1, 2022

We will support extensions.objectFormat to be git-compatible. We will likely not implement extensions.compatObjectFormat unless/until git chooses to do this. It's certainly not impossible that this is never implemented, or not implemented as described in the document, and we should not assume that it ever will be.

Is this assumption based on something specifically? The notes from the Git Contributors' Summit 2021 made it sound like interop wasn't completely written off.

Assuming it was, then two repositories with different objectFormats are incompatible, and git_oids with different types can't be used interchangeably. I'm wondering: can we expect libgit2 to perform runtime checks to ensure the right hash algorithm is used, or would it be up to the user to maintain that?

@ethomson
Copy link
Member Author
ethomson commented Feb 1, 2022

Is this assumption based on something specifically? The notes from the Git Contributors' Summit 2021 made it sound like interop wasn't completely written off.

It's not that I'm assuming that it won't happen, it's that I'm not yet assuming that it will happen. It's a hard problem, and no code has been written to think about it. It's the sort of thing that can definitely change between a quick ideation memo and the finished product. I don't want to go build a thing and then for git to decide that they want to do something different.

Assuming it was, then two repositories with different objectFormats are incompatible, and git_oids with different types can't be used interchangeably. I'm wondering: can we expect libgit2 to perform runtime checks to ensure the right hash algorithm is used, or would it be up to the user to maintain that?

I think that this is an independent concern; this is true even if extensions.compatObjectFormat exists in theory but isn't set in a repository.

We will check a given git_oid at runtime to ensure that it is a supported type by the repository (either by looking at extensions.objectFormat or SHA-1 in the absence of that).

@boretrk
Copy link
Contributor
boretrk commented Feb 2, 2022

The current git_oidarray just contains git_oid *ids; which allows for using the elements directly.
Stepping away from that is probably a good idea considering the memory save, but it means that the user would have to extract the oid with git_oid_from_oidarray(git_oid *out, git_oidarray *array, size_t index) or something before being able to use it.

@ethomson ethomson force-pushed the ethomson/sha256_poc branch from fa55e4b to af87971 Compare February 13, 2022 00:41
@joshtriplett
Copy link
Contributor

Is there any possible design that would allow users of sha1 repositories to avoid using extra memory for individual OIDs? (Not just oidarray, but individual OIDs?)

I have a fair bit of code that stores in-memory sets and maps using OIDs as either keys or values. Some of that code operates on tens of millions of entries. If at all possible, I'd like to avoid adding tens of megabytes of type bytes, and I'd definitely like to avoid having hundreds of megabytes of unused hash bytes.

It would help, for instance, to have dedicated storage types for sha1 OIDs that didn't use the extra bytes, and to have it be not too inefficient to pass those types around to things expecting OIDs.

I'd love to have an interface underneath the hood that I could reasonably easily and efficiently map to some kind of generic in the Rust bindings. For instance, perhaps the Rust bindings could provide a git2::Repository<O: OidType>, within which all the functions accept that instead of Oid, and then a HashMap<Something, ConcreteOidType> will end up being efficient. (And we can easily encompass generic runtime handling of either type within that, as well, with just a little less efficiency.)

I'm not expecting the C interface to be able to do that, exactly, but I'd love to avoid a massive impedance mismatch in trying to do so.

@ethomson
Copy link
Member Author
ethomson commented Feb 13, 2022

It would help, for instance, to have dedicated storage types for sha1 OIDs that didn't use the extra bytes, and to have it be not too inefficient to pass those types around to things expecting OIDs.

I think that we should never assume that a git_oid is a certain number of bytes beyond the size that we derive from the type. In other words, if you pass 41 bytes, where the first byte matches the type for GIT_OID_SHA1, then that's totally reasonable.

(Note that I don't think that this PR abides by that rule, I bet git_oid_cpy is stupid. But we should do this.)

We could provide something like:

struct git_oid_sha1 {
    unsigned char type;
    unsigned char id[40];
} git_oid_sha1;

#define GIT_OID_SHA1_INIT { GIT_OID_SHA1, { 0 } }

and this is totally interoperable with a git_oid that uses extra space.

This cuts down on the unnecessary 24 bytes per object ID, but there's still an additional byte for the type.

For instance, perhaps the Rust bindings could provide a git2::Repository<O: OidType>, within which all the functions accept that instead of Oid

Right, that's an interesting option, and it's something that we could do, too (though not as safely, but I think that's a given 😁 ). git_commit_lookup() could decide how big the object ID given to it is by asking the repository. In a world where there's no interoperability between SHA1 and SHA256, that might be okay. But if we do actually add this translation layer (and I think that we'll need one) then that gets a lot fuzzier. Looking at the Git goals for SHA256:

b. A SHA-256 repository can communicate with SHA-1 Git servers
(push/fetch).
c. Users can use SHA-1 and SHA-256 identifiers for objects
interchangeably (see "Object names on the command line", below).

Concretely, this URL: https://github.com/libgit2/libgit2/commit/5f66aa264ae645e385714104b15c1ade8ae17e7d should work even after libgit2 becomes a SHA256 repository on GitHub's servers. 😁

I think that for us that means that we want to git_commit_lookup() either a SHA1 or a SHA256 and we take care of dealing with the mapping. The alternative would be that users have to do the translation lookup, which is an annoying tax to make them pay. Especially if git does this transparently, I think that we should match that behavior.

(Of course git doesn't do anything here, so we're left sort of guessing as to what they will do based on that document.)

@joshtriplett
Copy link
Contributor

@ethomson Sure, some API users will need a mechanism that allows the same repository to process either kind of ID; the generic solution could allow that, by having a third option for a generic type-annotated OID. But other API users may be fine with using only one kind of ID. Depends on the kinds of operations you're doing and the API you're providing. A web UI accepting an OID from the client isn't going to care about an extra byte or an extra handful of bytes; something doing substantial computation on large number of objects in the repository would benefit from the space optimization.

@ethomson
Copy link
Member Author

@joshtriplett am I correct in thinking that you didn't complain much when seeing my proposal that the extra byte in question is disappointing but acceptable? 😬

@joshtriplett
Copy link
Contributor

@ethomson It seems OK if passing things to a function, but ideally I'd like to not have it when storing a large number of IDs separately. And it's somewhat annoying to have to copy (rather than reference) an object ID in order to prepend the type byte.

@ethomson ethomson force-pushed the ethomson/sha256_poc branch 3 times, most recently from 87d2d08 to ca2fafd Compare February 27, 2022 01:02
@ethomson ethomson force-pushed the ethomson/sha256_poc branch from ca2fafd to 5fc025e Compare April 10, 2022 14:00
@ethomson ethomson force-pushed the ethomson/sha256_poc branch from 5fc025e to 2788ac9 Compare April 27, 2022 14:18
ethomson added 2 commits June 14, 2022 22:29
In preparation for SHA256 support, `GIT_OID_RAWSZ` and `GIT_OID_HEXSZ`
need to indicate that they're the size of _SHA1_ OIDs.
Callers should not assume the layout of the oid structure; provide them
a macro that defines the null / zero sha1 object id.
@ethomson ethomson force-pushed the ethomson/sha256_poc branch 4 times, most recently from 724db43 to f80ce19 Compare June 20, 2022 15:37
@ethomson ethomson marked this pull request as ready for review June 20, 2022 20:59
ethomson added 14 commits June 20, 2022 17:05
`git_oid`s now have a type, and we require the oid type when creating
the object id from creation functions.
libgit2's current default oid type is SHA1, set a public macro for that.
We intentionally separate oid types from hash types; a hash is a generic
hunk of bytes, an object id has meaning and backs an object on disk.  As
a result of this separation, we need a 1:1 mapping.
The git_odb_hash helper functions should not assume SHA1, and instead
should be given the oid type that they're producing.
Users will need to be able to specify the object id type for the given
object database; add a new `git_odb_options` with that option.
Allow the object database to take an oid type that it supports.  This
oid type will be used to validate the objects that the backends provide.
Move the arguments to `git_odb_loose` into an options structure.
Tidy up `nfmt` / `pathfmt`.
Teach the loose object database how to cope with SHA256 objects.
Linux has a /usr/include/features.h, which gets confusing; update this
to `git2_features.h` and move it into the `util` directory.
libgit2 can be built with optional, experimental sha256 support. This
allows consumers to begin testing and providing feedback for our sha256
support while we continue to develop it, and allows us to make API
breaking changes while we iterate on a final sha256 implementation.

The results will be `git2-experimental.dll` and installed as
`git2-experimental.h` to avoid confusion with a production libgit2.
@ethomson ethomson force-pushed the ethomson/sha256_poc branch from 09e301f to 6013b6a Compare June 20, 2022 21:13
@ethomson
Copy link
Member Author

I want to start pushing this forward - we've talked about this a fair bit in the abstract but until people can actually write code against it, I think that we will continue to only talk in the abstract. In addition, this is currently a long-running feature branch and that's hell to maintain. So let's get it merged behind the moral equivalent of a feature flag.

I've added a new build configuration that can be enabled by cmake -DEXPERIMENTAL_SHA256=ON. By enabling this, you will get the proposed new SHA256-compatible API and ABIs. (If you do not enable this, you will maintain the current OID API which is untyped and SHA1 only.)

Since these two libraries are not interoperable and the software built for them is not compatible, the resulting binaries are libgit2-experimental.so and installed as git2-experimental.h when you run make install.

Application and binding authors can compile libgit2 in experimental SHA256 mode in order to start understanding how SHA256 will affect them. This allows - for example @joshtriplett to give us more concrete guidance on how we could save him bytes, or @carlosmn to identify something that rugged will not be happy about. This also allows us to continue to work on adapting and testing the library for SHA256 - for example @lhchavez's mention about midx and commit graph support (above).

At this point, I think that a review of the experimental indirection would be helpful. Do we successfully contain the SHA256 bits so that they don't leak into "normal" builds? Is this a useful strategy? I'm also welcome for more suggestions about how we implement the SHA256 bits but my hope is that by doing this, we can turn some of those into PRs against main instead of comments on this PR.

Note

This is: a) not a final API design. I expect that we'll continue to iterate plenty on this. and b) a work in progress. As such, please do not ship things depending on the experimental SHA256 support. Especially, please do not release bindings with SHA256 support and doubly especially don't publish these to distributions or package managers. It's far too early for that.

@samboy
Copy link
samboy commented Jun 24, 2022

As someone who doesn’t use this library, but someone who uses Git, I think adding support for something more recent than SHA-1 is long overdue. 10 years ago, I had this to say:

The bottom line is this: It is high time for Git to move on beyond SHA-1.

This is just as true today as it was nearly a decade ago when I wrote that.

@ethomson
Copy link
Member Author

OK, friends, you can now feature flag yourself in to SHA256 support. Again, this is for development.

Please do not ship things depending on the experimental SHA256 support. Especially, please do not release bindings with SHA256 support and doubly especially don't publish these to distributions or package managers. It's far too early for that.

@ethomson ethomson merged commit 433a133 into main Jul 14, 2022
@ethomson ethomson deleted the ethomson/sha256_poc branch July 14, 2022 01:08
@csware
Copy link
Contributor
csware commented Aug 12, 2022

I'd like to open a small discussion point: I don't like the experimental.h.in file that I would need to create automatically ( as I don't use cmake for building libigt2 in TortoiseGit). This refers to PR #4346 (LIBGIT2_NO_FEATURES_H). It would be great if we could find a similar solution.

@ethomson ethomson changed the title RFC: SHA256 proof of concept SHA256 proof of concept Feb 23, 2023
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Feb 25, 2023
v1.6
----

This is release v1.6.0, "Hubbeliges Krokodil". This release adds experimental SHA256 support and includes many new features and bugfixes.

## What's Changed

### New features

* **Support for bare repositories with SHA256 support (experimental)** by @ethomson in libgit2/libgit2#6191
   You can configure experimental SHA256 support in libgit2 with `cmake -DEXPERIMENTAL_SHA256=ON` during project setup. This is useful for considering future integrations, work on clients, and work on language bindings. At present, working with bare repositories should largely work, including remote operations. But many pieces of functionality - including working with the index - are not yet supported. As a result, **libgit2 with SHA256 support should not be used in production or released with package distribution.**

* **Support the notion of a home directory separately from global configuration directory** by @ethomson in libgit2/libgit2#6455 and libgit2/libgit2#6456
  Callers and language bindings can now configure the home directory that libgit2 uses for file lookups (eg, the `.ssh` directory). This configuration is separate from the git global configuration path.

* **stash: partial stash specific files** by @gitkraken-jacobw in libgit2/libgit2#6330
  A stash can be created with only specific files, using a pathspec. This is similar to the `git stash push` command.

* **push: revparse refspec source, so you can push things that are not refs** by @sven-of-cord in libgit2/libgit2#6362
  Pushes can be performed using refspecs instead of only references.

* **Support OpenSSL3** by @ethomson in libgit2/libgit2#6464 and libgit2/libgit2#6471
  OpenSSL 3 is now supported, both when compiled directly and dynamically loaded.

### Bug fixes
* winhttp: support long custom headers by @kcsaul in libgit2/libgit2#6363
* Fix memory leak by @csware in libgit2/libgit2#6382
* Don't fail the whole clone if you can't find a default branch by @torvalds in libgit2/libgit2#6369
* #6366: When a worktree is missing, return `GIT_ENOTFOUND`. by @arroz in libgit2/libgit2#6395
* commit-graph: only verify csum on `git_commit_graph_open()`. by @derrickstolee in libgit2/libgit2#6420
* Ignore missing 'safe.directory' config during ownership checks by @kcsaul in libgit2/libgit2#6408
* Fix leak in `git_tag_create_from_buffer` by @julianmesa-gitkraken in libgit2/libgit2#6421
* http: Update httpclient options when reusing an existing connection. by @slackner in libgit2/libgit2#6416
* Add support for `safe.directory *` by @csware in libgit2/libgit2#6429
* URL parsing for google-compatible URLs by @ethomson in libgit2/libgit2#6326
* Fixes #6433: `git_submodule_update` fails to update configured but missing submodule by @tagesuhu in libgit2/libgit2#6434
* transport: fix capabilities calculation by @russell in libgit2/libgit2#6435
* push: use resolved oid as the source by @ethomson in libgit2/libgit2#6452
* Use `git_clone__submodule` to avoid file checks in workdir by @abizjak in libgit2/libgit2#6444
* #6422: handle dangling symbolic refs gracefully by @arroz in libgit2/libgit2#6423
* `diff_file`: Fix crash when freeing a patch representing an empty untracked file by @jorio in libgit2/libgit2#6475
* clone: clean up options on failure by @ethomson in libgit2/libgit2#6479
* stash: update strarray usage by @ethomson in libgit2/libgit2#6487
* #6491: Sets `oid_type` on repos open with `git_repository_open_bare` by @arroz in libgit2/libgit2#6492
* Handle Win32 shares by @ethomson in libgit2/libgit2#6493
* Make failure to connect to ssh-agent non-fatal by @fxcoudert in libgit2/libgit2#6497
* odb: don't unconditionally add `oid_type` to stream by @ethomson in libgit2/libgit2#6499
* Pass hostkey & port to host verify callback by @fxcoudert in libgit2/libgit2#6503

### Security fixes

### Code cleanups
* meta: update version number to v1.6.0-alpha by @ethomson in libgit2/libgit2#6352
* sha256: indirection for experimental functions by @ethomson in libgit2/libgit2#6354
* Delete `create.c.bak` by @lrm29 in libgit2/libgit2#6398
* Support non-cmake builds with an in-tree `experimental.h` by @ethomson in libgit2/libgit2#6405

### Build and CI improvements
* tests: skip flaky-ass googlesource tests by @ethomson in libgit2/libgit2#6353
* clar: remove ftrunacte from libgit2 tests by @boretrk in libgit2/libgit2#6357
* CI Improvements by @ethomson in libgit2/libgit2#6403
* fix compile on Windows with `-DWIN32_LEAN_AND_MEAN` by @christoph-cullmann in libgit2/libgit2#6373
* Fixes #6365 : Uppercase windows.h include fails build in case-sensitive OS by @Vinz2008 in libgit2/libgit2#6377
* ci: update version numbers of actions by @ethomson in libgit2/libgit2#6448
* thread: avoid warnings when building without threads by @ethomson in libgit2/libgit2#6432
* src: hide unused hmac() prototype by @0-wiz-0 in libgit2/libgit2#6458
* tests: update clar test runner by @ethomson in libgit2/libgit2#6459
* ci: always create test summaries, even on failure by @ethomson in libgit2/libgit2#6460
* Fix build failure with `-DEMBED_SSH_PATH` by @vicr123 in libgit2/libgit2#6374
* Define correct `off64_t` for AIX by @bzEq in libgit2/libgit2#6376
* Fix some warnings in main by @ethomson in libgit2/libgit2#6480
* strarray: remove deprecated declaration by @ethomson in libgit2/libgit2#6486
* tests: always unset `HTTP_PROXY` before starting tests by @ethomson in libgit2/libgit2#6498

### Documentation improvements
* add 2-clause BSD license to COPYING by @martinvonz in libgit2/libgit2#6413
* Add new PHP bindings project to language bindings section of README.md by @RogerGee in libgit2/libgit2#6473
* README: clarify the linking exception by @ethomson in libgit2/libgit2#6494
* Correct the definition of "empty" in the docs for `git_repository_is_empty` by @timrogers in libgit2/libgit2#6500
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0