git-evtag
can be used as a replacement for git-tag -s
. It
will generate a strong checksum (called Git-EVTag-v0-SHA512
) over the
commit, tree, and blobs it references (and recursively over submodules).
A primary rationale for this is that the underlying SHA1 algorithm of
git is under increasing threat.
Further, a goal here is to create a checksum that covers the entire source of
a single revision as a replacement for tarballs + checksums.
git-evtag was originally discussed (long before the SHA1 collision of February 2017) on the git mailing list:
See also the the Node.js implementation.
- Fedora package
- Building from source: Requires glib2 and libgit2.
Create a new v2015.10
tag, covering the HEAD
revision with GPG
signature and Git-EVTag-v0-SHA512
:
$ git-evtag sign v2015.10
( type your tag message, note a Git-EVTag-v0-SHA512 line in the message )
$ git show v2015.10
( Note signature covered by GPG signature )
Verify a tag:
$ git-evtag verify v2015.10
gpg: Signature made Sun 28 Jun 2015 10:49:11 AM EDT
gpg: using RSA key 0xDC45FD5921C13F0B
gpg: Good signature from "Colin Walters <walters@redhat.com>" [ultimate]
gpg: aka "Colin Walters <walters@verbum.org>" [ultimate]
Primary key fingerprint: 1CEC 7A9D F7DA 85AB EF84 3DC0 A866 D7CC AE08 7291
Subkey fingerprint: AB92 8A9C F8DD 0629 09C3 7BBD DC45 FD59 21C1 3F0B
Successfully verified: Git-EVTag-v0-SHA512: b05f10f9adb0eff352d90938588834508d33fdfcedbcfc332999ee397efa321d1f49a539f1b82f024111a281c1f441002e7f536b06eb04d41857b01636f6f268
This is similar to what project distributors often accomplish by using
git archive
, or make dist
, or similar tools to generate a tarball,
and then checksumming that, and (ideally) providing a GPG signature
covering it.
The problem with git archive
and make dist
is that tarballs (and
other tools like zip files) are not easily reproducible exactly from
a git repository commit. The authors of git reserve the right to
change the file format output by git archive
in the future. Also,
there are a variety of reasons why compressors like gzip
and xz
aren't necessarily reproducible, such as compression levels, included
timestamps, optimizations in the algorithm, etc. See
Pristine tar
for some examples of the difficulties involved (e.g. trying to
retroactively guess the compression level arguments from the xz
dictionary size).
If the checksum is not reproducible, it becomes much more difficult to easily and reliably verify that a generated tarball contains the same source code as a particular git commit.
What git-evtag
implements is an algorithm for providing a strong
checksum over the complete source objects for the target commit (+
trees + blobs + submodules). Then it's integrated with GPG for
end-to-end verification. (Although, one could also wrap the checksum
in X.509 or some other public/private signature solution).
Then no out of band distribution mechanism is necessary, and better, the checksums strengthen the ability to verify integrity of the git repository.
(And if you want to avoid downloading the entire history, that's what
git clone --depth=1
is for.)
NEW! The first SHA1 collision was announced February 23, 2017:
Git uses a modified Merkle tree with SHA1, which means that if an attacker managed to create a SHA1 collision for a source file object (git blob), it would affect all revisions and checkouts - invalidating the security of all GPG signed tags whose commits point to that object.
Now, the author of this tool believes that today, GPG signed git tags are fairly secure, especially if one is careful to ensure transport integrity (e.g. pinned TLS certificates from the origin).
There is currently only one version of the Git-EVTag
algorithm,
called v0
- and it only supports
SHA-512. It is declared
stable. All further text refers to this version of the algorithm. In
the unlikely event that it is necessary to introduce a new version,
this tool will support all known versions.
Git-EVTag-v0-SHA512
covers the complete contents of all objects for
a commit; again similar to checksumming git archive
, except
reproducible. Each object is added to the checksum in its raw
canonicalized form, including the header.
For a given commit (in Rust-style pseudocode):
fn git_evtag(repo: GitRepo, commitid: String) -> SHA512 {
let checksum = new SHA512();
walk_commit(repo, checksum, commitid)
return checksum
}
fn walk_commit(repo: GitRepo, checksum : SHA512, commitid : String) {
checksum_object(repo, checksum, commitid)
let treeid = repo.load_commit(commitid).treeid();
walk(repo, checksum, treeid)
}
fn checksum_object(repo: GitRepo, checksum: SHA512, objid: String) -> () {
// This is the canonical header of the object; <typename> <length (ascii base 10)>
// https://git-scm.com/book/en/v2/Git-Internals-Git-Objects#Object-Storage
let header : &str = repo.load_object_header(objid);
// The NUL byte after the header, explicitly included in the checksum
let nul = [0u8];
// The remaining raw content of the object as a byte array
let body : &[u8] = repo.load_object_body(objid);
checksum.update(header.as_bytes())
checksum.update(&nul);
checksum.update(body)
}
fn walk(repo: GitRepo, checksum: SHA512, treeid: String) -> () {
// First, add the tree object itself
checksum_object(repo, checksum, treeid);
let tree = repo.load_tree(treeid);
for child in tree.children() {
match childtype {
Blob(blobid) => checksum_object(repo, checksum, blobid),
Tree(child_treeid) => walk(repo, checksum, child_treeid),
Commit(commitid, path) => {
let child_repo = repo.get_submodule(path)
walk_commit(child_repo, checksum, commitid)
}
}
}
}
This strong checksum, can be verified reproducibly offline after cloning a git repository for a particular tag. When covered by a GPG signature, it provides a strong end-to-end integrity guarantee.
It's quite inexpensive and practical to compute Git-EVTag-v0-SHA512
once per tag/release creation. At the time of this writing, on the
Linux kernel (a large project by most standards), it takes about 5
seconds to compute on this author's laptop. On most smaller projects,
it's completely negligible.
This project is just addressing one small part of the larger git/tarball question. Anything else is out of scope, but a brief discussion of other aspects is included below.
Historically, many projects include additional content in tarballs.
For example, the GNU Autotools pregenerate a configure
script from
configure.ac
and the like. Other projects don't include
translations in git, but merge them out of band when generating
tarballs.
There are many other things like this, and they all harm reproducibility and continuous integration/delivery.
For example, while many of my projects use Autotools, I simply have
downstream authors run autogen.sh
. It works just fine - the
autotools are no longer changing often, and many downstreams want to
do it anyways.
For the translation issue, note that bad translations can actually crash one's application. If they're part of the git repository, they can be more easily tested as a unit continuously.