-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Proposal for the baseline images problem #16447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
At least as far as getting new baseline images with other FreeType, you can grab them from https://github.com/QuLogic/mpl-images. I have not yet completely thought about the rest of the proposal, but I suspect if caching is not sufficient, it might be possible to do the remaining work in a GitHub Action. |
This is the best proposal for handling this problem yet! In summary:
As part of this we should publish If we have the tools for CI to automatically grab the latest baseline images based on the current commit, might as well have that be the recommended way for devs to get the test images as well (with the generate-in-place being an option if you have local CPU/time, but poor internet)? One thing we might lose with this is that we currently use the exactly same version of the images on all OS, all python versions, all machines 8000 etc. I think we can maintain this, but it is a bit trickier to be 100% sure that we have it right (as opposed to "git says it is these bits") A thing that this would give us is a nice way to generate forward and backwards compatibility reports (run the next tag with the baseline images from the previous tag, list what fails). This would also get us around the problem that some of our latex baselines are different with newer versions of latex. Are we re-inventing git LFS? |
If you add the machinery to grab the latest test images, perhaps (but then you also need to decide where to host them and may run into bandwidth issues -- similar to why we haven't adopted git-lfs on github); but the intent was really that CI uses CI cache to roll over one run's generated test images as baselines for the next run, and that devs do the same locally. Which is really the fundamental idea here: instead of comparing against a fixed baseline, compare a commit to the next (with some additional shortcuts based on reference points at releases, etc.). |
I'm not fully following this, but it sounds painful for contributors. If we are going to be painful to contributors, it seems to me that just having a separate images repository on github would be simpler. Someone would have to tag it to releases, so maybe that makes the release work harder? But then folks could just do shallow fetches if they do not need the full history of the images. But perhaps I'm missing something, or maybe thats more complicated. |
This should transparent for contributors (except that their first test run is going to be a bit slower if some baseline images changed since the last release, given that mpl will automatically rebuild itself at the correct commit to generate the corresponding baseline). |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I am interested in this issue . This is the kind of problem i would like to work on my summer time . As i Know testing is an important part of a development process . From what i understand we can actually store the binary of baseline images and their format in the release . I need to work a lot more than that. @anntzer please guide me |
See some suggestions at https://discourse.matplotlib.org/t/gsoc-due-march-31-18-00utc/20975/10?u=anntzer.lee as well. |
yeah i have looked at it . i have on python testing in one of internships but i have to get comfortable with the codebase and then i can do a pull request . |
Okay, I am looking towards them. Thanks |
Individual changes listed above like generation of baseline images from new release in wheels, modification of baseline images logics, CI changes are quite vast. So, I think we should tackle them in different PRS after breaking the issue. This will also ensure modularity. But, independent pull requests merged into master, can leave master in an inconsistent state. |
We typically prefer merging things in a piecemeal fashion but if that's too tricky (I realize it's going to be a lot of work, that's why it's a GSOC :-)) then I guess putting everything on a branch may be an option; thoughs @tacaswell @QuLogic or anyone else? |
Actually the problem is not to include the baseline images In repo itself
.developer working on the project get the baseline images using previous
versions as the baseline images can be cached in the memory. I would to get
more thoughts on that . Can we include baseline images with a pip package .
@anntzer
…On Fri, Apr 3, 2020, 01:23 Antony Lee ***@***.***> wrote:
We typically prefer merging things in a piecemeal fashion but if that's
too tricky (I realize it's going to be a lot of work, that's why it's a
GSOC :-)) then I guess putting everything on a branch may be an option;
thoughs @tacaswell <https://github.com/tacaswell> @QuLogic
<https://github.com/QuLogic> or anyone else?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16447 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHX77D7L5A7VRMYVSQG4SCLRKTUJZANCNFSM4KR5UM4A>
.
|
Yes, certainly, that's part of the plan too. That's in fact necessary so that people who test just based on the pip package (e.g., Linux distro packagers) can still run the tests. |
Yes , I am looking at the codebase and looking for the ways to include baseline images in the pip package . |
As much as possible this should be done via many small PRs. That both keeps them reviewable and makes sure that if at the end of GSoC we have not finished everything, there is still work in the code base that we can build on in the future. If the PRs are leaving the master branch in an in-consistent state then the work has not been split up correctly. |
Few questions: Use Case 1: Suppose a user of matplotlib(not a contributor) is A. So, I think A will never get involved in testing. So, in the entire lifespan of A(assuming he never runs pytest), he will never have any baseline images on their computer, nor his fork repository. Please correct me if I am wrong. Use Case 2: Developer(maybe also a user of mpl) runs only one test (out of entire Test Suite) for a particular feature testing. The feature involves only 2 baseline images B1 and B2. The entire set of baseline images B. So, will this developer download B or (B1+B2)? Use Case solution 2A: Download B-> First time, this will take a lot of time for downloading the entire set of baseline images from the last release(having all baseline images). Later on, time will be reduced for testing. Use Case solution 2B: Download B1+B2: Faster download of B1+B2 but a little bit of overhead each time to connect to the last release to grab baseline images. Also, there will need to check which baseline images are present on the developer's desktop. Only images which are not present on the developer's desktop needs to be downloaded from the last release. So, logic regarding the comparison between the last release and the current developer's master branch needs to be created. Another subcase could be, the developer deleted some of the previously downloaded baseline images. So, while checking the last release, developer should be able to download those images which no longer exists on his master. Considering coupling and fact that one test may break non-correspondent test(by indirect calling), it is better to fetch all baseline images at once. Also, most of the times developer run full test suite. Also, increasing wait time each time for the B1+B2 is irritating for the developer. I prefer 2A. Use Case 3:Deletion of baseline images: I observed that deletion of baseline image is not mentioned above. Suppose a plotting feature is deprecated. So, in later releases, there will need to delete the baseline images related to the deprecated images. Also, deletion of baseline images (if developed as a reusable submodule/method) can be used for baseline image modification in two steps - Deletion of old image(s) + Addition of new image(s). What do you guys think/suggest? |
Yes.
The plan is to download everything.
Yes, but the total download is not worse than currently, because currently you always download all baseline images anyways.
It's also more infrastructure to set up to be able to do piecemean downbloads.
That seems fine? One important case that needs to be mentioned is if a baseline image changes but there's no release yet with that change. In that case the developer cannot download that baseline image. Instead (this is one important point of the proposal), this change should be marked in some way in the git history, the testing machinery should detect that, set up a temporary environment with the last "known good" version of Matplotlib, run the test suite locally there to generate the corresponding baseline image, cache it locally (so that we don't have to redo this every time), and then use that as baseline image to run the test on master. |
That is the reason I asked whether we need 2A or 2B. So, aiming at entire baseline image set will also reduce the code complexity. :)
Suppose we modified baseline images subset B = {B1, B2...., Bn}. So, we deleted things from release Rx and current release is Ry such that they both may not be concurrent(y-x >= 1). So, we will be adding the modified B to Ry so that these modified features can be continuously integrated and tested by other developers on current Ry. Other plan will be to include it in Ry+1. I am in favor of Ry. Both of the scenarios will change the release version number from which baseline images are downloaded. What do you think? |
What does |
When are we aiming at merging #11732 ? Any plans? |
My opinion is #11732 can simplify #16447 to a large extend. If we are not maintaining separate test and test-data wheels then each time we need to detect the latest set of baseline images belong to which release. If separation is maintained, then logic of comparison will be simplified to last test-wheel release instead of entire git history. What do you think? What do you mean by non-git repo above? Are you considering the future scenario of shifting matplotlib to gitlab etc. I think I am missing something here. |
I think we can consider that Matplotlib only ever has one "active" branch from the PoV of baseline images. (The last long-lived branch was 2.2.x for Py2 support, but that has been closed now.) So for example right now 3.2.x is the last release and people are working on the next release which is 3.3; for simplification let's say that we will never make changes to baseline images on the 3.2.x branch as that would count as API changes; the only branch on which baseline images changes will occur is 3.3. (Again, we may want to relax this restriction later but I don't think it's unreasonable.) So on PyPI there's the set of 3.2.0 baseline images available ("matplotlib-tests", as discussed below); the main remaining problem is for developers who are currently working on 3.3; they may want to run tests locally before 3.3 is released (once 3.3 is released we simultaneously release "matplotlib-tests 3.3" and again everyone has access to it and can download it). (Note that developers is not just the "experienced" group, there's also the "drive-by" contributors who may just want to put in a PR; it should remain as simple as possible for them to just clone the repo, write their feature, and run the test suite locally.) If between 3.2.0 and the current master (somewhere on the 3.3 branch before the 3.3.0 release) no images have changed, then it's simple, the test machinery can just download "matplotlib-tests-3.2.0" and use that. The tricky part is if some images have changed in between (now the test machinery needs to generate baseline images using a previous commit, e.g. the one just after the image has changed, where the image is known to be good). (As a side point, I think another simplification for now may be to say that baseline image never "change", they just get deleted and new ones added -- it's not too onerous to add a counter to the filename of images we want to modify.)
The idea was that if someone downloads matplotlib from PyPI (either as a .tar.gz or as a .whl), they should get something self-contained from which it should be possible to run the test suite without any further internet access or git repo or whatnot -- as they do now. Although this idea should be nuanced by the possibility that the test data actually be in a separate PyPI package ("matplotlib-tests"), such as the one proposed by #11732.
IIRC the main problem there was that I wanted to ensure that the "matplotlib-notestdata" PyPI package (which should be just called "matplotlib" but which I'll call "matplotlib-notestdata" here just to avoid confusion) should never include the test data (otherwise, someone who installs both "matplotlib-notestdata" and "matplotlib-tests" will have the second package overwrite files in the first, likely causing all sorts of confusion -- pypa/pip#3868 and linked issues.) Also I don't exactly remember how #11732 interacts with editable installs (please read about them if you don't know them; our developer guide explicitly suggests using editable installs for developing Matplotlib). Probably fine, but worth checking too.
Yes, I think getting #11732 to work will help a lot. It's just a matter of ironing out the edges.
Copying my reply on gitter, but feel free to ask for clarifications: |
Coming back to this, my vote would be a separate repository for the images and folks who don't want to pull down the whole tree can set the depth to something shallow. The only drawback in my mind is a little more test developer overhead, but "be sure to check in the images and open a related PR" seems easier to me than any of the instructions required for the solutions described above. |
So an additional problem of using the "ancient" Freetype 2.6.1 is that it does not build on the mac-arm64 architecture, which makes it impossible to successfully run on that architecture. Before long, that's going to be the dominant architecture for the macOS platform. So are we committed to going in this direction? Or are we going to bump to a modern Freetype? Or...? |
But we build wheels for arm64 and universal2 with bundled FreeType 2.6.1 already? |
@QuLogic Apparently I missed that. Any idea if there was some magic needed? My recollection the last time I tried to get 2.6.1 to build on macos-arm64 (for conda-forge) was that it basically said "I have no clue what you're trying to build on". |
Apparently nothing special: #20970 |
This issue has been marked "inactive" because it has been 365 days since the last comment. If this issue is still present in recent Matplotlib releases, or the feature request is still wanted, please leave a comment and this label will be removed. If there are no updates in another 30 days, this issue will be automatically closed, but you are free to re-open or create a new issue if needed. We value issue reports, and this procedure is meant to help us resurface and prioritize issues that have not been addressed yet, not make them disappear. Thanks for your help! |
Since this seems to be topical at the moment (#29816), I thought I would share the process used in https://github.com/astropy/astropy and https://github.com/sunpy/sunpy/ for generating/updating test images. I was partially responsible for building the pipeline (I think... it's been a while), and it works quite well. The rough idea is to:
It would need a decent chunk of time to implement (like any solution to this problem I guess), but I think it works well from both a developer and contributor point of view, and it's been tried and tested for a couple of years in other projects. |
Sounds interesting. I assume the hashes imply exact identity. Do you have the case that images change minimally within tolerances so that the reference image stays the same but hashes have to be regenerated? Since #29816 requires a solution soonish, would a step-wise approach be feasible; i.e. flip out the reference images for hashes in the main library so that #29816 can update to new hashes and be merged. For a transition period we would then only have the hashes as reference and have to resolve any deviations manually. The image diff infrastructure could be build up in parallel in the background. Worst-case would be that we see that the approach does not work and we have to revert the #29816 variant with hashes and instead go with #29816 based on new images. |
As far as I remember As well as being used to avoid updating a test image, are tolerances also used to 'hide' differences in the images generated by different environments (OS, Python version, dependency versions etc.)? In that case I'm not sure off the top of my head how a 'hash based' solution could work. Perhaps one could store multiple hashes for each figure test, and as long as the generated hash matches one of the expected hashes the test passes? I'm not sure in that case you would know which of the expected hashes to updated when the figure changes though. |
Tolerances are definitely used to avoid issues between different architectures. So if you go the hash route your would need hashes for the oddball architectures as well. Which may be fine? You could still compare to a single baseline image if the hashes don't match. |
@bjlittle has proposed using perceptual image hashes in pytest-mpl: matplotlib/pytest-mpl#146 That seems like it would be nice here as well and a tolerance could be added to the hash distance even. One thing that isn't clear to me is whether I need internet access to do these things? Is there a local caching mechanism where I can bring the images down locally and work/compare without needing to retrieve the remote file all the time to see the difference (example if I'm on a plane, but I want to be able to compare actual image differences while testing not just hashes). |
@greglucas Do you know what that looks like? I think our somewhat draconian tolerances arose because a missing glyph could pass looser tolerances. Would we have the same problem here? If we continually update FreeType and thousands of images are changing, no one is likely to go through and check them all? |
No. I have not played around with these things at all which is why I looped Bill in here because I believe he has :) Maybe @rcomer knows if they are using it on Iris too? Maybe there is a way for us to figure out what the tolerance was for an entirely missing glyph and set the threshold just below that. That would theoretically give us multiple minor version updates of freetype with minor kerning adjustments and only once enough of the kerning changes all add up does it go over that threshold. I am probably not the one to explicitly ask about the tolerances as I am more OK with looser tolerances. One idea is that we could try to make a distinction between categories of tests, freetype-specific / font-specific where I do care about tolerances, and plots where I am less concerned about minor glyph kerning and care more about the placement of a scatter marker in the right spot. |
If we store the test images in a repository, you could either point the location of the baseline images to be that remote repository, or you could clone it (before starting your internet free time) and point to the local checkout as the location of the baseline images.
The idea of using hashes is that updating the images is (almost) zero cost in terms of effort and main repository size, so I would not worry too much about having to update all the images for different versions of freetype to start with. I think to keep things simple here we should try out storing the images in a separate repository and using hashes - if that works, we could then explore whether there's still a need to reduce the frequency that we're updating test images using a different comparison metric (presumably to keep the size of the test image respository low?) |
Ahh I have just understood the motivation for using a perceptual difference instead of a hash - it would allow us to replicate the current behaviour of using tolerances in the tests, where using a cryptographic hash wouldn't allow us to do that. |
Yes, we were using https://github.com/tacaswell/mpl-imageset-demo and #25734 are working but lack documentation and sorting out how to make the required caching work on CI. |
I also have general concerns about updating images too often and view it as a high cost (even if they are in a second repo) or high risk. We have a lot of test images and if we are on the regular regenerating most of them, it will either require a lot of manual work to validate that the "small" changes are in fact small-from-a-communication point of view or we are going to be lax and end up committing very wrong baseline images again. An additional concern with hashes is that we put the version + date in the files as metadata so we can not hash the files. Additionally for the vector backends we want the tests to be insensitive to things like the order of things in the file, local identifiers, and other non-meaningful changes to the files (e.g. dropping the number of digits we write out for floats). Currently the way we do this is to checking the svg/pdf/ps and then at test time use the same viewer to rasterize the files and compare the pixels of the rasterization. If we were to track the hash of that rasterization then we would add a dependency on the version of the viewer we used (and any platform dependent variation they have).
If we are going to go through and re-generate a lot of test images it should be coupled with an audit of if that image actually needs text in it and switching them to the new defaults. |
That's not a fundamental issue. Worst-case, we hash the RGB image values. |
Iris’s decision to use perceptual image hashes was before I got involved, so I’m not really up on the pros and cons. I guess @pelson would have been involved back then so might have an opinion whether they’d be appropriate here. |
For any kind of hash, we can use them as a pre-test for an image test if the hash is reasonably stable. If the hash is "known-good", we are done, if not we have to fall back to the image test (possibly downloading the reference). If the image test is successful, we add the hash to the "known-good" hashes. This works as long as long as there is a limited number of image variants. It wouldn't work if there's a small randomness in every generated image. |
@greglucas Thanks for the nudge and looping me in ... I'm still very keen to close matplotlib/pytest-mpl#146. The original pull-request matplotlib/pytest-mpl#150 unfortunately stalled as I needed some In terms of https://github.com/SciTools/iris, it was my decision to implement and introduce perceptual image hashing, and thankfully it's really worked out for us, in that it's given us the stability and control that we required. We've been happily using it for many years now to perform our mpl image testing. Anyways, I'd love to blow the dust off matplotlib/pytest-mpl#150 and resurrect it. Perhaps what would give me more confidence and conviction to actually follow through and deliver would be if I could work with someone from the matplotlib dev community? Someone who could be a technical point of contact so that we could kick the tyres of perceptual image hashing to determine whether it addresses some/all/none of your issues. That would really give me the confidence to push the breaking change to My proposal would be to work with @rcomer, who's also at the Met Office, and logistically that would make a lot of sense. I know this suggestion has come pretty much from left field, so have a think and let me know what you guys decide. In the meantime I'm going to tentatively see whether it's feasible to secure the time and space to do this work during my 9-to-5; I know of several groups at the Met Office that would like to see this happen, so that gives me some leverage to hard-sell the idea. Anyways, just a thought ... |
Yes, I suppose moving to another repo just lowers the cost, but doesn't make it zero. Thinking about this a bit more, I guess typically updates fall into either 1) updating or adding a single or a handful of images 2) wholesale updating a lot of images (e.g., freetype updates). The nice thing about hashes is it would automatically download only the images you need from the secondary repo when tests fail, which for most contributors results in a big saving in MB downloaded when they're just updating a few images.
We can easily strip metadata - indeed,
Is there not a dependency on the version of the viewer used anyway to do the rasterization? I tried to use
@bjlittle I'd be super happy to 'pair' on this and resurrect it! I think regardless of the potential use case for our figure tests here, it's a neat little feature that I'd support getting into |
We run the rasterization of both the baseline and the machine running the test with (presumably) the same version of the viewer so variation between how the veiwers render them should was out. |
This is a proposal to help with the difficulty in adding/modifying tests which require a baseline image. As a brief reminder, baseline images are problematic because
The idea is to actually not store baseline images in the repo at all.
[baseline-change]: path/to/image
. The process described in 2) would also consult the git log to know if there's any images that need to be gen'd in such a way. Likewise, if a baseline image needs to be added for a new test, the commit message needs to be tagged accordingly. Possibly, we could also require that the filename also be changed in that case so that even if you're not in a git repo, you at least don't end up using the wrong baseline image, and instead just get a skip.Note that this means that if you're using a non-git-repo, non-sdist/wheel, you cannot run the full test suite anymore, because you at least need to know what was the last release, and if any image changed since then, you'd like to know when (however, if you want to write a patch and check that it's good, this still works because you can compare the baselines with what was present before your patch). OTOH distro/conda packagers, which should be working from sdists, would still be able to run the test suite fine, because the sdists/wheels will include baseline images per 1).
There's a bit of machinery involved in making this work, but I don't think it's overly hard to implement. Thoughts?
The text was updated successfully, but these errors were encountered: