8000 Proposal for the baseline images problem · Issue #16447 · matplotlib/matplotlib · GitHub
[go: up one dir, main page]

Skip to content

Proposal for the baseline images problem #16447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
anntzer opened this issue Feb 8, 2020 · 95 comments
Open

Proposal for the baseline images problem #16447

anntzer opened this issue Feb 8, 2020 · 95 comments
Labels
keep Items to be ignored by the “Stale” Github Action topic: testing

Comments

@anntzer
Copy link
Contributor
anntzer commented Feb 8, 2020

This is a proposal to help with the difficulty in adding/modifying tests which require a baseline image. As a brief reminder, baseline images are problematic because

  • they cause the repo size to grow rather quickly (matplotlib's .git is now ~400Mb, compared e.g. to numpy's which is 72Mb), even though we've been exceedingly careful in recent years to avoid adding more baseline images (the problems with the merge of Fix mplot3d projection #8896 got me to think about this again).
  • they force us to pin to a somewhat old version of freetype, because nearly every release of freetype causes tiny rasterization changes that would entail regenerating all baseline images (and thus cause even more repo size growth).

The idea is to actually not store baseline images in the repo at all.

  1. When generating a sdist or wheels for a new release, we do generate baseline images (from the tests) and store them in the released sdist or wheel. The validity of the baseline images is guaranteed by what comes below.
  2. During normal development, the baseline images should not change. If someone wants to run the test suite from a git checkout, matplotlib would first generate a set of "known" baseline images either by grabbing the last release, or just installing the last tagged release into a temporary venv and running the test suite there (of course, they would be cached into ~/.cache/matplotlib so this is only done once -- this caching also needs to be done so that CI time doesn't double...).
  3. If a baseline image needs to change (we should still try to keep this as rare as possible), the commit message needs to include a special string e.g. [baseline-change]: path/to/image. The process described in 2) would also consult the git log to know if there's any images that need to be gen'd in such a way. Likewise, if a baseline image needs to be added for a new test, the commit message needs to be tagged accordingly. Possibly, we could also require that the filename also be changed in that case so that even if you're not in a git repo, you at least don't end up using the wrong baseline image, and instead just get a skip.

Note that this means that if you're using a non-git-repo, non-sdist/wheel, you cannot run the full test suite anymore, because you at least need to know what was the last release, and if any image changed since then, you'd like to know when (however, if you want to write a patch and check that it's good, this still works because you can compare the baselines with what was present before your patch). OTOH distro/conda packagers, which should be working from sdists, would still be able to run the test suite fine, because the sdists/wheels will include baseline images per 1).

There's a bit of machinery involved in making this work, but I don't think it's overly hard to implement. Thoughts?

@QuLogic
Copy link
Member
QuLogic commented Feb 9, 2020

At least as far as getting new baseline images with other FreeType, you can grab them from https://github.com/QuLogic/mpl-images.

I have not yet completely thought about the rest of the proposal, but I suspect if caching is not sufficient, it might be possible to do the remaining work in a GitHub Action.

@tacaswell
Copy link
Member

This is the best proposal for handling this problem yet!

In summary:

  • on release, generate the 'blessed' version of the test images for that tag (these could even be thrown onto github as artifacts with the release)
  • on a dev machine, generate and/or download and cache the test images running against the last released version and/or current master
  • some logic to flag what images have been changed from last version to the current branch tip and do something about them
  • for CI we can rely on CI cache or stand up a store on github/DO/AWS/... managing that cache can be done via github actions

As part of this we should publish mpl-test-data as a wheel separately.

If we have the tools for CI to automatically grab the latest baseline images based on the current commit, might as well have that be the recommended way for devs to get the test images as well (with the generate-in-place being an option if you have local CPU/time, but poor internet)?

One thing we might lose with this is that we currently use the exactly same version of the images on all OS, all python versions, all machines 8000 etc. I think we can maintain this, but it is a bit trickier to be 100% sure that we have it right (as opposed to "git says it is these bits")

A thing that this would give us is a nice way to generate forward and backwards compatibility reports (run the next tag with the baseline images from the previous tag, list what fails).

This would also get us around the problem that some of our latex baselines are different with newer versions of latex.

Are we re-inventing git LFS?

@anntzer
Copy link
Contributor Author
anntzer commented Feb 9, 2020

Are we re-inventing git LFS?

If you add the machinery to grab the latest test images, perhaps (but then you also need to decide where to host them and may run into bandwidth issues -- similar to why we haven't adopted git-lfs on github); but the intent was really that CI uses CI cache to roll over one run's generated test images as baselines for the next run, and that devs do the same locally. Which is really the fundamental idea here: instead of comparing against a fixed baseline, compare a commit to the next (with some additional shortcuts based on reference points at releases, etc.).

@jklymak
Copy link
Member
jklymak commented Feb 10, 2020

I'm not fully following this, but it sounds painful for contributors. If we are going to be painful to contributors, it seems to me that just having a separate images repository on github would be simpler. Someone would have to tag it to releases, so maybe that makes the release work harder? But then folks could just do shallow fetches if they do not need the full history of the images. But perhaps I'm missing something, or maybe thats more complicated.

@anntzer
Copy link
Contributor Author
anntzer commented Feb 10, 2020

This should transparent for contributors (except that their first test run is going to be a bit slower if some baseline images changed since the last release, given that mpl will automatically rebuild itself at the correct commit to generate the corresponding baseline).
Likewise, the release work should be automated as much as possible.

@SidharthBansal

This comment has been minimized.

@SaraLatif99

This comment has been minimized.

@anntzer

This comment has been minimized.

@SidharthBansal

This comment has been minimized.

@SidharthBansal

This comment has been minimized.

@laksh9950
Copy link

I am interested in this issue . This is the kind of problem i would like to work on my summer time . As i Know testing is an important part of a development process . From what i understand we can actually store the binary of baseline images and their format in the release . I need to work a lot more than that. @anntzer please guide me
Thanks

@anntzer
Copy link
Contributor Author
anntzer commented Mar 31, 2020

@laksh9950
Copy link

See some suggestions at https://discourse.matplotlib.org/t/gsoc-due-march-31-18-00utc/20975/10?u=anntzer.lee as well.

yeah i have looked at it . i have on python testing in one of internships but i have to get comfortable with the codebase and then i can do a pull request .
I need some help in my proposal . can you please help @anntzer

@SidharthBansal
Copy link
Contributor

See some suggestions at https://discourse.matplotlib.org/t/gsoc-due-march-31-18-00utc/20975/10?u=anntzer.lee as well.

Okay, I am looking towards them. Thanks

@SidharthBansal
Copy link
Contributor

Individual changes listed above like generation of baseline images from new release in wheels, modification of baseline images logics, CI changes are quite vast. So, I think we should tackle them in different PRS after breaking the issue. This will also ensure modularity. But, independent pull requests merged into master, can leave master in an inconsistent state.
So, we should create a beta branch and get them merged into beta branch. Once the beta branch is fully completed, then we can move the changes to the master. What do you suggest?

@anntzer
Copy link
Contributor Author
anntzer commented Apr 2, 2020

We typically prefer merging things in a piecemeal fashion but if that's too tricky (I realize it's going to be a lot of work, that's why it's a GSOC :-)) then I guess putting everything on a branch may be an option; thoughs @tacaswell @QuLogic or anyone else?

@laksh9950
Copy link
laksh9950 commented Apr 3, 2020 via email

@anntzer
Copy link
Contributor Author
anntzer commented Apr 3, 2020

Can we include baseline images with a pip package.

Yes, certainly, that's part of the plan too. That's in fact necessary so that people who test just based on the pip package (e.g., Linux distro packagers) can still run the tests.

@laksh9950
Copy link

Can we include baseline images with a pip package.

Yes, certainly, that's part of the plan too. That's in fact necessary so that people who test just based on the pip package (e.g., Linux distro packagers) can still run the tests.

Yes , I am looking at the codebase and looking for the ways to include baseline images in the pip package .

@tacaswell
Copy link
Member

As much as possible this should be done via many small PRs. That both keeps them reviewable and makes sure that if at the end of GSoC we have not finished everything, there is still work in the code base that we can build on in the future.

If the PRs are leaving the master branch in an in-consistent state then the work has not been split up correctly.

@SidharthBansal
Copy link
Contributor
SidharthBansal commented Apr 10, 2020

Few questions:

Use Case 1: Suppose a user of matplotlib(not a contributor) is A. So, I think A will never get involved in testing. So, in the entire lifespan of A(assuming he never runs pytest), he will never have any baseline images on their computer, nor his fork repository. Please correct me if I am wrong.
There is no use for A, to grab baseline images as they are not involved in testing.

Use Case 2: Developer(maybe also a user of mpl) runs only one test (out of entire Test Suite) for a particular feature testing. The feature involves only 2 baseline images B1 and B2. The entire set of baseline images B. So, will this developer download B or (B1+B2)?

Use Case solution 2A: Download B-> First time, this will take a lot of time for downloading the entire set of baseline images from the last release(having all baseline images). Later on, time will be reduced for testing.

Use Case solution 2B: Download B1+B2: Faster download of B1+B2 but a little bit of overhead each time to connect to the last release to grab baseline images. Also, there will need to check which baseline images are present on the developer's desktop. Only images which are not present on the developer's desktop needs to be downloaded from the last release. So, logic regarding the comparison between the last release and the current developer's master branch needs to be created. Another subcase could be, the developer deleted some of the previously downloaded baseline images. So, while checking the last release, developer should be able to download those images which no longer exists on his master.

Considering coupling and fact that one test may break non-correspondent test(by indirect calling), it is better to fetch all baseline images at once. Also, most of the times developer run full test suite. Also, increasing wait time each time for the B1+B2 is irritating for the developer. I prefer 2A.

Use Case 3:Deletion of baseline images: I observed that deletion of baseline image is not mentioned above. Suppose a plotting feature is deprecated. So, in later releases, there will need to delete the baseline images related to the deprecated images. Also, deletion of baseline images (if developed as a reusable submodule/method) can be used for baseline image modification in two steps - Deletion of old image(s) + Addition of new image(s).

What do you guys think/suggest?

@anntzer
Copy link
Contributor Author
anntzer commented Apr 10, 2020

Use Case 1: Suppose a user of matplotlib(not a contributor) is A. So, I think A will never get involved in testing. So, in the entire lifespan of A(assuming he never runs pytest), he will never have any baseline images on their computer, nor his fork repository. Please correct me if I am wrong.
There is no use for A, to grab baseline images as they are not involved in testing.

Yes.

Use Case 2: Developer(maybe also a user of mpl) runs only one test (out of entire Test Suite) for a particular feature testing. The feature involves only 2 baseline images B1 and B2. The entire set of baseline images B. So, will this developer download B or (B1+B2)?

The plan is to download everything.

Use Case solution 2A: Download B-> First time, this will take a lot of time for downloading the entire set of baseline images from the last release(having all baseline images). Later on, time will be reduced for testing.

Yes, but the total download is not worse than currently, because currently you always download all baseline images anyways.

Use Case solution 2B: Download B1+B2: Faster download of B1+B2 but a little bit of overhead each time to connect to the last release to grab baseline images. Also, there will need to check which baseline images are present on the developer's desktop. Only images which are not present on the developer's desktop needs to be downloaded from the last release. So, logic regarding the comparison between the last release and the current developer's master branch needs to be created. Another subcase could be, the developer deleted some of the previously downloaded baseline images. So, while checking the last release, developer should be able to download those images which no longer exists on his master.

Considering coupling and fact that one test may break non-correspondent test(by indirect calling), it is better to fetch all baseline images at once. Also, most of the times developer run full test suite. Also, increasing wait time each time for the B1+B2 is irritating for the developer. I prefer 2A.

It's also more infrastructure to set up to be able to do piecemean downbloads.

Use Case 3:Deletion of baseline images: I observed that deletion of baseline image is not mentioned above. Suppose a plotting feature is deprecated. So, in later releases, there will need to delete the baseline images related to the deprecated images. Also, deletion of baseline images (if developed as a reusable submodule/method) can be used for base

That seems fine?


One important case that needs to be mentioned is if a baseline image changes but there's no release yet with that change. In that case the developer cannot download that baseline image. Instead (this is one important point of the proposal), this change should be marked in some way in the git history, the testing machinery should detect that, set up a temporary environment with the last "known good" version of Matplotlib, run the test suite locally there to generate the corresponding baseline image, cache it locally (so that we don't have to redo this every time), and then use that as baseline image to run the test on master.

@SidharthBansal
Copy link
Contributor

Yes, but the total download is not worse than currently, because currently you always download all baseline images anyways.

That is the reason I asked whether we need 2A or 2B. So, aiming at entire baseline image set will also reduce the code complexity. :)

That seems fine?

Suppose we modified baseline images subset B = {B1, B2...., Bn}. So, we deleted things from release Rx and current release is Ry such that they both may not be concurrent(y-x >= 1). So, we will be adding the modified B to Ry so that these modified features can be continuously integrated and tested by other developers on current Ry. Other plan will be to include it in Ry+1. I am in favor of Ry. Both of the scenarios will change the release version number from which baseline images are downloaded. What do you think?

@SidharthBansal
Copy link
Contributor

What does because the sdists/wheels will include baseline images per 1 mean in issue body?

@SidharthBansal
Copy link
Contributor

When are we aiming at merging #11732 ? Any plans?

@SidharthBansal
Copy link
Contributor
SidharthBansal commented Apr 10, 2020

Note that this means that if you're using a non-git-repo, non-sdist/wheel, you cannot run the full test suite anymore, because you at least need to know what was the last release, and if any image changed since then, you'd like to know when (however, if you want to write a patch and check that it's good, this still works because you can compare the baselines with what was present before your patch).

My opinion is #11732 can simplify #16447 to a large extend. If we are not maintaining separate test and test-data wheels then each time we need to detect the latest set of baseline images belong to which release. If separation is maintained, then logic of comparison will be simplified to last test-wheel release instead of entire git history. What do you think?

What do you mean by non-git repo above? Are you considering the future scenario of shifting matplotlib to gitlab etc. I think I am missing something here.

@anntzer
Copy link
Contributor Author
anntzer commented 8000 Apr 12, 2020

Suppose we modified baseline images subset B = {B1, B2...., Bn}. So, we deleted things from release Rx and current release is Ry such that they both may not be concurrent(y-x >= 1). So, we will be adding the modified B to Ry so that these modified features can be continuously integrated and tested by other developers on current Ry. Other plan will be to include it in Ry+1. I am in favor of Ry. Both of the scenarios will change the release version number from which baseline images are downloaded. What do you think?

I think we can consider that Matplotlib only ever has one "active" branch from the PoV of baseline images. (The last long-lived branch was 2.2.x for Py2 support, but that has been closed now.)

So for example right now 3.2.x is the last release and people are working on the next release which is 3.3; for simplification let's say that we will never make changes to baseline images on the 3.2.x branch as that would count as API changes; the only branch on which baseline images changes will occur is 3.3. (Again, we may want to relax this restriction later but I don't think it's unreasonable.)

So on PyPI there's the set of 3.2.0 baseline images available ("matplotlib-tests", as discussed below); the main remaining problem is for developers who are currently working on 3.3; they may want to run tests locally before 3.3 is released (once 3.3 is released we simultaneously release "matplotlib-tests 3.3" and again everyone has access to it and can download it). (Note that developers is not just the "experienced" group, there's also the "drive-by" contributors who may just want to put in a PR; it should remain as simple as possible for them to just clone the repo, write their feature, and run the test suite locally.) If between 3.2.0 and the current master (somewhere on the 3.3 branch before the 3.3.0 release) no images have changed, then it's simple, the test machinery can just download "matplotlib-tests-3.2.0" and use that. The tricky part is if some images have changed in between (now the test machinery needs to generate baseline images using a previous commit, e.g. the one just after the image has changed, where the image is known to be good).

(As a side point, I think another simplification for now may be to say that baseline image never "change", they just get deleted and new ones added -- it's not too onerous to add a counter to the filename of images we want to modify.)

What does "because the sdists/wheels will include baseline images per 1 mean in issue body"?

The idea was that if someone downloads matplotlib from PyPI (either as a .tar.gz or as a .whl), they should get something self-contained from which it should be possible to run the test suite without any further internet access or git repo or whatnot -- as they do now. Although this idea should be nuanced by the possibility that the test data actually be in a separate PyPI package ("matplotlib-tests"), such as the one proposed by #11732.

When are we aiming at merging #11732 ? Any plans?

IIRC the main problem there was that I wanted to ensure that the "matplotlib-notestdata" PyPI package (which should be just called "matplotlib" but which I'll call "matplotlib-notestdata" here just to avoid confusion) should never include the test data (otherwise, someone who installs both "matplotlib-notestdata" and "matplotlib-tests" will have the second package overwrite files in the first, likely causing all sorts of confusion -- pypa/pip#3868 and linked issues.)

Also I don't exactly remember how #11732 interacts with editable installs (please read about them if you don't know them; our developer guide explicitly suggests using editable installs for developing Matplotlib). Probably fine, but worth checking too.

My opinion is #11732 can simplify #16447 to a large extend. If we are not maintaining separate test and test-data wheels then each time we need to detect the latest set of baseline images belong to which release. If separation is maintained, then logic of comparison will be simplified to last test-wheel release instead of entire git history. What do you think?

Yes, I think getting #11732 to work will help a lot. It's just a matter of ironing out the edges.

What do you mean by non-git repo above? Are you considering the future scenario of shifting matplotlib to gitlab etc. I think I am missing something here.

Copying my reply on gitter, but feel free to ask for clarifications:
We only have a git repo; the non-git repo case is that people may just have a tarball (or rather, a sdist, in python terminology, which is essentially a zip with some metadata), and still want to run tests from that; this is quite common e.g. for linux packagers.

@jklymak
Copy link
Member
jklymak commented Feb 7, 2021

Coming back to this, my vote would be a separate repository for the images and folks who don't want to pull down the whole tree can set the depth to something shallow. The only drawback in my mind is a little more test developer overhead, but "be sure to check in the images and open a related PR" seems easier to me than any of the instructions required for the solutions described above.

@dopplershift
Copy link
Contributor

So an additional problem of using the "ancient" Freetype 2.6.1 is that it does not build on the mac-arm64 architecture, which makes it impossible to successfully run on that architecture. Before long, that's going to be the dominant architecture for the macOS platform. So are we committed to going in this direction? Or are we going to bump to a modern Freetype? Or...?

@QuLogic
Copy link
Member
QuLogic commented Sep 7, 2022

But we build wheels for arm64 and universal2 with bundled FreeType 2.6.1 already?

@dopplershift
Copy link
Contributor

@QuLogic Apparently I missed that. Any idea if there was some magic needed? My recollection the last time I tried to get 2.6.1 to build on macos-arm64 (for conda-forge) was that it basically said "I have no clue what you're trying to build on".

@QuLogic
Copy link
Member
QuLogic commented Sep 8, 2022

Apparently nothing special: #20970

@github-actions
Copy link
github-actions bot commented Sep 8, 2023

This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help!

@github-actions github-actions bot added the status: inactive Marked by the “Stale” Github Action label Sep 8, 2023
@story645 story645 added keep Items to be ignored by the “Stale” Github Action and removed status: inactive Marked by the “Stale” Github Action labels Sep 8, 2023
@dstansby
Copy link
Member

Since this seems to be topical at the moment (#29816), I thought I would share the process used in https://github.com/astropy/astropy and https://github.com/sunpy/sunpy/ for generating/updating test images. I was partially responsible for building the pipeline (I think... it's been a while), and it works quite well.

The rough idea is to:

  • Store hashes of generated figures in the main repository
  • Store test images in a secondary repository
  • When running a figure test, generate a hash of the figure
  • Compare the hash against the expected hash in the main repository
  • If the hashes differ, then:
  1. Download the test image from the secondary repository
  2. Use this to generate a visual diff for local comparison
  3. Also generate a visual diff on CI for reviewers to inspect (see here for an example of a HTML page generated for a sunpy commit)
  4. If the difference is expected, commit the changed hash and merge to main
  5. On merges to main, if any of the hashes have changed, GitHub actions re-generates the test image and pushes the updated image to the secondary repository

It would need a decent chunk of time to implement (like any solution to this problem I guess), but I think it works well from both a developer and contributor point of view, and it's been tried and tested for a couple of years in other projects.

@timhoffm
Copy link
Member

Sounds interesting. I assume the hashes imply exact identity. Do you have the case that images change minimally within tolerances so that the reference image stays the same but hashes have to be regenerated?

Since #29816 requires a solution soonish, would a step-wise approach be feasible; i.e. flip out the reference images for hashes in the main library so that #29816 can update to new hashes and be merged. For a transition period we would then only have the hashes as reference and have to resolve any deviations manually. The image diff infrastructure could be build up in parallel in the background. Worst-case would be that we see that the approach does not work and we have to revert the #29816 variant with hashes and instead go with #29816 based on new images.

@dstansby
Copy link
Member

Do you have the case that images change minimally within tolerances so that the reference image stays the same but hashes have to be regenerated?

As far as I remember sunpy doesn't use tolerances, so if the bytes change the image is just regenerated. I guess with the images in a separate repo there is less 'cost' to uploading a new image, even if the pixels haven't changed. And it's on a slightly smaller scale - sunpy has 82 images, versus the hundreds(?) here.

As well as being used to avoid updating a test image, are tolerances also used to 'hide' differences in the images generated by different environments (OS, Python version, dependency versions etc.)? In that case I'm not sure off the top of my head how a 'hash based' solution could work. Perhaps one could store multiple hashes for each figure test, and as long as the generated hash matches one of the expected hashes the test passes? I'm not sure in that case you would know which of the expected hashes to updated when the figure changes though.

@jklymak
Copy link
Member
jklymak commented Mar 28, 2025

Tolerances are definitely used to avoid issues between different architectures. So if you go the hash route your would need hashes for the oddball architectures as well. Which may be fine? You could still compare to a single baseline image if the hashes don't match.

@greglucas
Copy link
Contributor

@bjlittle has proposed using perceptual image hashes in pytest-mpl: matplotlib/pytest-mpl#146 That seems like it would be nice here as well and a tolerance could be added to the hash distance even.

One thing that isn't clear to me is whether I need internet access to do these things? Is there a local caching mechanism where I can bring the images down locally and work/compare without needing to retrieve the remote file all the time to see the difference (example if I'm on a plane, but I want to be able to compare actual image differences while testing not just hashes).

@jklymak
Copy link
Member
jklymak commented Mar 28, 2025

@greglucas Do you know what that looks like?

I think our somewhat draconian tolerances arose because a missing glyph could pass looser tolerances. Would we have the same problem here? If we continually update FreeType and thousands of images are changing, no one is likely to go through and check them all?

@greglucas
Copy link
Contributor

@greglucas Do you know what that looks like?

No. I have not played around with these things at all which is why I looped Bill in here because I believe he has :) Maybe @rcomer knows if they are using it on Iris too?

Maybe there is a way for us to figure out what the tolerance was for an entirely missing glyph and set the threshold just below that. That would theoretically give us multiple minor version updates of freetype with minor kerning adjustments and only once enough of the kerning changes all add up does it go over that threshold.

I am probably not the one to explicitly ask about the tolerances as I am more OK with looser tolerances. One idea is that we could try to make a distinction between categories of tests, freetype-specific / font-specific where I do care about tolerances, and plots where I am less concerned about minor glyph kerning and care more about the placement of a scatter marker in the right spot.

@dstansby
Copy link
Member
dstansby commented Mar 28, 2025

One thing that isn't clear to me is whether I need internet access to do these things?

If we store the test images in a repository, you could either point the location of the baseline images to be that remote repository, or you could clone it (before starting your internet free time) and point to the local checkout as the location of the baseline images.

That would theoretically give us multiple minor version updates of freetype

The idea of using hashes is that updating the images is (almost) zero cost in terms of effort and main repository size, so I would not worry too much about having to update all the images for different versions of freetype to start with. I think to keep things simple here we should try out storing the images in a separate repository and using hashes - if that works, we could then explore whether there's still a need to reduce the frequency that we're updating test images using a different comparison metric (presumably to keep the size of the test image respository low?)

@dstansby
Copy link
Member
dstansby commented Mar 28, 2025

Ahh I have just understood the motivation for using a perceptual difference instead of a hash - it would allow us to replicate the current behaviour of using tolerances in the tests, where using a cryptographic hash wouldn't allow us to do that.

@tacaswell
Copy link
Member

I think our somewhat draconian tolerances arose because a missing glyph could pass looser tolerances.

Yes, we were using \cdot (one dot in the center) instead of \ddots (three dots diagonal) or something similar. Because the contrast is black/white for text even minor shifts give huge RMS and the number of pixels "wrong" for dropping some dots is less than from shifting a lowercase l by 2 pixels. I do not think there exists a tolerance that will reliably catch that sort of error but will ignore the sort of version-to-version drift we see in freetype.


https://github.com/tacaswell/mpl-imageset-demo and #25734 are working but lack documentation and sorting out how to make the required caching work on CI.

@tacaswell
Copy link
Member

I also have general concerns about updating images too often and view it as a high cost (even if they are in a second repo) or high risk. We have a lot of test images and if we are on the regular regenerating most of them, it will either require a lot of manual work to validate that the "small" changes are in fact small-from-a-communication point of view or we are going to be lax and end up committing very wrong baseline images again.

An additional concern with hashes is that we put the version + date in the files as metadata so we can not hash the files. Additionally for the vector backends we want the tests to be insensitive to things like the order of things in the file, local identifiers, and other non-meaningful changes to the files (e.g. dropping the number of digits we write out for floats). Currently the way we do this is to checking the svg/pdf/ps and then at test time use the same viewer to rasterize the files and compare the pixels of the rasterization. If we were to track the hash of that rasterization then we would add a dependency on the version of the viewer we used (and any platform dependent variation they have).


One idea is that we could try to make a distinction between categories of tests, freetype-specific / font-specific where I do care about tolerances, and plots where I am less concerned about minor glyph kerning and care more about the placement of a scatter marker in the right spot.

If we are going to go through and re-generate a lot of test images it should be coupled with an audit of if that image actually needs text in it and switching them to the new defaults.

@timhoffm
Copy link
Member

An additional concern with hashes is that we put the version + date in the files as metadata so we can not hash the files.

That's not a fundamental issue. Worst-case, we hash the RGB image values.

@rcomer
Copy link
Member
rcomer commented Mar 28, 2025

Iris’s decision to use perceptual image hashes was before I got involved, so I’m not really up on the pros and cons. I guess @pelson would have been involved back then so might have an opinion whether they’d be appropriate here.

@timhoffm
Copy link
Member
timhoffm commented Mar 28, 2025

For any kind of hash, we can use them as a pre-test for an image test if the hash is reasonably stable. If the hash is "known-good", we are done, if not we have to fall back to the image test (possibly downloading the reference). If the image test is successful, we add the hash to the "known-good" hashes.

This works as long as long as there is a limited number of image variants. It wouldn't work if there's a small randomness in every generated image.

@bjlittle
Copy link
bjlittle commented Mar 28, 2025

@greglucas Thanks for the nudge and looping me in ...

I'm still very keen to close matplotlib/pytest-mpl#146. The original pull-request matplotlib/pytest-mpl#150 unfortunately stalled as I needed some pytest-mpl core dev support to help me cross the line. Also my proposal was a major breaking change in behaviour and some big decisions still required to be agreed and ironed out. Anyways, I kinda lost momentum, life got in the way (as it does), and I closed the PR to keep things honest. That's all totally on me btw.

In terms of https://github.com/SciTools/iris, it was my decision to implement and introduce perceptual image hashing, and thankfully it's really worked out for us, in that it's given us the stability and control that we required. We've been happily using it for many years now to perform our mpl image testing.

Anyways, I'd love to blow the dust off matplotlib/pytest-mpl#150 and resurrect it.

Perhaps what would give me more confidence and conviction to actually follow through and deliver would be if I could work with someone from the matplotlib dev community? Someone who could be a technical point of contact so that we could kick the tyres of perceptual image hashing to determine whether it addresses some/all/none of your issues. That would really give me the confidence to push the breaking change to pytest-mpl and more importantly it might help mitigate risk on your side, give you guys clarity on the impact and implications, and ultimately help decide whether this might be useful to you or not.

My proposal would be to work with @rcomer, who's also at the Met Office, and logistically that would make a lot of sense. I know this suggestion has come pretty much from left field, so have a think and let me know what you guys decide. In the meantime I'm going to tentatively see whether it's feasible to secure the time and space to do this work during my 9-to-5; I know of several groups at the Met Office that would like to see this happen, so that gives me some leverage to hard-sell the idea.

Anyways, just a thought ...

@dstansby
Copy link
Member
dstansby commented Mar 31, 2025

I also have general concerns about updating images too often and view it as a high cost (even if they are in a second repo) or high risk.

Yes, I suppose moving to another repo just lowers the cost, but doesn't make it zero. Thinking about this a bit more, I guess typically updates fall into either 1) updating or adding a single or a handful of images 2) wholesale updating a lot of images (e.g., freetype updates). The nice thing about hashes is it would automatically download only the images you need from the secondary repo when tests fail, which for most contributors results in a big saving in MB downloaded when they're just updating a few images.

An additional concern with hashes is that we put the version + date in the files as metadata so we can not hash the files.

We can easily strip metadata - indeed, pytest-mpl already does this. Or as @timhoffm suggests, hash the RGB values instead of the whole file.

for the vector backends we want the tests to be insensitive to things like the order of things in the file, local identifiers, and other non-meaningful changes to the files (e.g. dropping the number of digits we write out for floats). Currently the way we do this is to checking the svg/pdf/ps and then at test time use the same viewer to rasterize the files and compare the pixels of the rasterization. If we were to track the hash of that rasterization then we would add a dependency on the version of the viewer we used (and any platform dependent variation they have).

Is there not a dependency on the version of the viewer used anyway to do the rasterization?


I tried to use pytest-mpl to convert one test to hash-based at #29823, and it didn't go great - the hashes on different platforms were different compared to the one I generated locally. Some of them matched, but some didn't - ironically, I'm on macOS, and the hash I generated only matched some Linux runs and no macOS runs 😆 I will pick away and try and work out why there are differences.


Anyways, I'd love to blow the dust off matplotlib/pytest-mpl#150 and resurrect it. Perhaps what would give me more confidence and conviction to actually follow through and deliver would be if I could work with someone from the matplotlib dev community?

@bjlittle I'd be super happy to 'pair' on this and resurrect it! I think regardless of the potential use case for our figure tests here, it's a neat little feature that I'd support getting into pytest-mpl anyway. Let me know what the best way forward is for you - perhaps re-opening a PR, and we can discuss/iterate from there?

@tacaswell
Copy link
Member

Is there not a dependency on the version of the viewer used anyway to do the rasterization?

We run the rasterization of both the baseline and the machine running the test with (presumably) the same version of the viewer so variation between how the veiwers render them should was out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep Items to be ignored by the “Stale” Github Action topic: testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

0