8000 Fix CI cache restore by ibbem · Pull Request #156 · VariantSync/DiffDetective · GitHub
[go: up one dir, main page]

Skip to content

Fix CI cache restore #156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 31, 2025
Merged

Fix CI cache restore #156

merged 4 commits into from
Mar 31, 2025

Conversation

ibbem
Copy link
Collaborator
@ibbem ibbem commented Feb 26, 2025

I identified the feature interaction that caused sporadic CI failures:

  1. The GitHub Pages are built using Nix (the share/github-pages/DiffDetective/ in the nix-build output). However, GitHub Pages uses the github-metadata Jekyll plugin which connects to the GitHub API to retrieve some metadata (which is uninteresting for us). This step is skipped in case there is no internet connection.
  2. To decrease CI build times, we currently cache the Nix installation including the Nix store after building all dependencies. Due to a bug (the Nix configuration /etc/nix/nix.conf is missing if the Nix installation is restored from the cache), the Nix sandbox settings differ between CI runs. In particular, if there is a cache hit, the Nix sandbox is disabled and thus internet connections become possible during all Nix builds.
    The combination of these two problems would result in unreproducible CI builds (which we probably would have never found).
    Fortunately (in the sense that this problem became noticeable), there is another interaction that masks these unreproducible CI builds: By default, Nix doesn't have TLS certificates installed in the build environment (they shouldn't be needed anyways). When github-metadata tries to access the GitHub API, if there was a cache hit, it fails to validate the GitHub server because the TLS certificate is missing and thus aborts the build.

Problem 1 is effectively solved by jekyll/github-metadata#285. We just need to upgrade the github-metadata version or wait until nixpkgs adopts that version and update nixpkgs. Then we can just set the environment variable PAGES_DISABLE_NETWORK to reliable disable network connections even if the Nix sandbox is disabled.
During investigation of problem 2, I noticed that the cache-install action, which implements the Nix installation caching for us, is kind of broken by design. I would really like to fix this, unfortunately this design is necessary due to limitations (regarding file permissions) of GitHub's cache action. After some experimentation I found a better approach and a more reliable caching mechanism:

  1. I replaced cache-install with install-nix-action which just installs Nix in the GitHub Actions environment but is much faster even without using a similar cache. Though, this new installation doesn't cache our dependencies.
  2. After building the dependencies, the Nix store is exported into a single file which can then be cached, restored and imported very reliable, keeping all used components well in their specifications (in contrast to cache-install needs to integrate deeply into the Nix installation internals).

During testing I found that change 1 decreases our build times from about 4:35 min to about 1:39 min. Hence, the cache of change 2 doesn't offer the same benefit as before. In particular, the build time when creating the cache increases to about 1:52 min and decreases with on a cache hit to about 1:19 min.
@pmbittner Do you think we should still implement the cache or just drop change 2 and stick with only using change 1? Note that with our current development pace, the cache is only rarely useful because it's automatically purged by GitHub after one week (that's the reason the CI failures were only sporadic).

@ibbem ibbem force-pushed the fix-CI-cache-restore branch 9 times, most recently from 0bdd769 to 04735fa Compare March 25, 2025 06:31
ibbem added 2 commits March 28, 2025 23:07
The old method also didn't reproduce the same environment if there was a
cache hit (the sandbox was disabled in this case).
This should be more reliable than the old cache-install action because
all the software is operated according to their intended interfaces.
@ibbem ibbem force-pushed the fix-CI-cache-restore branch from 2e66d9f to da11a13 Compare March 28, 2025 22:16
@ibbem ibbem marked this pull request as ready for review March 28, 2025 22:25
@ibbem ibbem requested a review from pmbittner March 28, 2025 22:25
@pmbittner pmbittner added bug Something isn't working enhancement New feature or request labels Mar 29, 2025
@pmbittner
Copy link
Member

Thank you so much for figuring this hairy bug out and also fixing the respective bug in the jekyll project!

I appreciate leaving cache-install behind if it is so broken. If I understood correctly, the new cache you implemented creates a single big text file with all the nix store package / directory names that were created and that are needed for the CI?

Given your convincing arguments for not having a cache (at least not for DiffDetective), I am in favor of reducing complexity and just going with change 1. Maybe leaving a short comment at the place where the caching would needs to be, which points to this PR would be nice.

So this means we would merge your first commit but drop the second one?

ibbem added 2 commits March 31, 2025 09:43
After testing (everything works as excepted) and measuring the cache, we
noticed that the time improvement is not worth the additional
complexity. Hence, we decided to remove this cache but leave it in the
commit history in case it becomes relevant in the future.

This effectively reverts commit da11a13
"CI: Cache dependencies between runs".
In case the Nix sandbox is enabled, the github-metadata plugin
automatically disables network accesses. However, if the sandbox is
disabled (e.g., on MacOS or by explicitly disabling it), the
github-metadata plugin tries to access the GitHub API and (fortunately)
fails which aborts the build.

As the required patch is currently unreleased, the patch is directly
fetched from GitHub. This overlay can be removed as soon as this patch
lands in nixpkgs and we update to such a nixpkgs version.
@ibbem
Copy link
Collaborator Author
ibbem commented Mar 31, 2025

If I understood correctly, the new cache you implemented creates a single big text file with all the nix store package / directory names that were created and that are needed for the CI?

Yes, that's exactly what it does.

So this means we would merge your first commit but drop the second one?

I deciced to revert the commit instead of dropping it. This keeps this caching mechanism in the commit history for future reference.

I also pushed a fix for problem 1. It applies the patch for github-metadata onto the version that we are currently using. As soon as github-metadata tags a new release and it lands in nixpkgs, we might consider updating our dependenies to get rid of this patch.

@pmbittner pmbittner merged commit 73144c4 into develop Mar 31, 2025
2 checks passed
@pmbittner
Copy link
Member

sounds good to me :)

@pmbittner pmbittner mentioned this pull request Mar 31, 2025
9 tasks
@ibbem ibbem deleted the fix-CI-cache-restore branch June 12, 2025 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0