8000 Reduce the size of some images in the documentation · Issue #17568 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Reduce the size of some images in the documentation #17568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rth opened this issue Jun 11, 2020 · 11 comments · Fixed by #26968 or #26976
Closed

Reduce the size of some images in the documentation #17568

rth opened this issue Jun 11, 2020 · 11 comments · Fixed by #26968 or #26976
Labels
Documentation Meta-issue General issue associated to an identified list of tasks

Comments

@rth
Copy link
Member
rth commented Jun 11, 2020

Please also take a look at #17568 (comment) for updated context on this issue.

The documentation repository is becoming quite large (#17564 (comment)) and in particular there are 66MB of images. It might be worth checking if the size of lagest ones couldn't be reduced a bit by adjusting matplotlib options in examples. In the _images/ folder,

$ find . -type f  -exec du -h {} + | sort -r -h
940K    ./sphx_glr_plot_classifier_comparison_001.png
840K    ./sphx_glr_plot_cluster_comparison_0011.png
840K    ./sphx_glr_plot_cluster_comparison_001.png
784K    ./sphx_glr_plot_birch_vs_minibatchkmeans_0011.png
784K    ./sphx_glr_plot_birch_vs_minibatchkmeans_001.png
684K    ./sphx_glr_plot_all_scaling_008.png
644K    ./sphx_glr_plot_linkage_comparison_0011.png
644K    ./sphx_glr_plot_linkage_comparison_001.png
624K    ./sphx_glr_plot_discretization_classification_001.png
432K    ./sphx_glr_plot_anomaly_comparison_0011.png
432K    ./sphx_glr_plot_anomaly_comparison_001.png
416K    ./sphx_glr_plot_mlp_alpha_0011.png
416K    ./sphx_glr_plot_mlp_alpha_001.png
380K    ./sphx_glr_plot_t_sne_perplexity_001.png
380K    ./sphx_glr_plot_manifold_sphere_001.png
348K    ./sphx_glr_plot_compare_methods_0011.png
348K    ./sphx_glr_plot_compare_methods_001.png
348K    ./sphx_glr_plot_color_quantization_0011.png
348K    ./sphx_glr_plot_color_quantization_001.png
348K    ./iris.png
328K    ./sphx_glr_plot_kernel_approximation_0021.png
328K    ./sphx_glr_plot_kernel_approximation_002.png
324K    ./sphx_glr_plot_all_scaling_009.png
320K    ./sphx_glr_plot_stock_market_001.png
300K    ./sphx_glr_plot_spectral_biclustering_002.png
300K    ./sphx_glr_plot_all_scaling_006.png
268K    ./sphx_glr_plot_spectral_biclustering_0031.png
268K    ./sphx_glr_plot_spectral_biclustering_003.png
268K    ./sphx_glr_plot_spectral_biclustering_001.png
268K    ./sphx_glr_plot_rbf_parameters_001.png
260K    ./sphx_glr_plot_all_scaling_007.png
@TomDLT
Copy link
Member
TomDLT commented Jun 12, 2020

If the goal is to reduce the doc repo, we should not necessarily focus on large images, but rather on images that change at each build. Images can change at each build because of different random states or because of different run durations. For example, images in this example are about 30 KB, but they are updated with a change of 1-2 KB for each image at each commit, which can lead to much higher diff sizes than large images that are not updated very often.

Also, if this doc repo is too large, is there a reason not to rewrite git history, e.g. squashing many bot commits ?

@adrinjalali
Copy link
Member

This seems like a good candidate for first time contributors. I'm adding this to the pyladies sprint in Berlin.

(The history on the website repo was removed in #21171 (comment))

To contributors:

For each image, you first need to find the example. For instance, the example file for the file sphx_glr_plot_classifier_comparison_001.png would be plot_classifier_comparison.py, and make sure you remove all sources of randomness. You need to find the image produced in the example, and where the data of the image comes from, and figure out if there is any random seeds you could set in the example. You can submit one PR for each example.

@adrinjalali adrinjalali added the Meta-issue General issue associated to an identified list of tasks label Feb 15, 2023
@thomasjpfan
Copy link
Member

If the issue is the size of the docs repo, I suspect it's because we keep pushing dev docs into scikit-learn.github.io. If we host the dev docs in another repo and only push to scikit-learn.github.io when we release, then we can have scikit-learn.github.io be a manageable size. This is what matplotlib does with https://github.com/matplotlib/devdocs and https://github.com/matplotlib/matplotlib.github.com .

For PNGs, different versions of matplotlib will not generate the same binary even with the same data. In the simple case, the PNG generated by matplotlib will have its version in the PNG's metadata. If we switch the backend to product SVGs, I think they can become more reproducible. For SVGs, even if there are changes they will be in plain text which is more manageable with git.

In any case, I still think it is good to remove the randomness in the examples.

@dmitryhits
Copy link

I checked the folder scikit-learn/doc/_build/html/stable/_images
running command there: find . -type f -exec du -h {} + | sort -r -h

124K	./quansight-labs.png
124K	./plot_face_recognition_1.png
100K	./grid_search_workflow.png
 88K	./plot_face_recognition_2.png
 88K	./multilayerperceptron_network.png
 88K	./generated-doc-ci.png
 84K	./visual-studio-build-tools-selection.png
 84K	./infonea.jpg
 64K	./poisson_gamma_tweedie_distributions.png
 64K	./bnp_paribas_cardif.png
 56K	./beta_divergence.png
 48K	./mars.png
 48K	./grid_search_cross_validation.png
 44K	./sklearn-metrics-PredictionErrorDisplay-3.png
 44K	./sklearn-metrics-PredictionErrorDisplay-2.png
 44K	./sklearn-metrics-PredictionErrorDisplay-1.png
 44K	./aweber.png
 40K	./sydney-primary.jpeg
 36K	./telecom.png
...

I don't see any big files there. Maybe we can close this issue...

@betatim
Copy link
Member
betatim commented Mar 28, 2023

@dmitryhits sorry for the suggestion at the sprint that maybe we can close this. I just re-read the comments and in particular this one. It seems the problem isn't so much large images. So I think we can keep this open and maybe update the top comment to reflect the focus on removing randomness from examples.

@TamaraAtanasoska
Copy link
Contributor

Me and @AnnaWey will work on this issue, starting with sphx_glr_plot_classifier_comparison_001.png, generated from plot_classifier_comparison.py.

@AnnaWey
Copy link
Contributor
AnnaWey commented Aug 1, 2023

Me and @TamaraAtanasoska will continue with the plot_cluster_comparison.py

@TamaraAtanasoska
Copy link
Contributor

@glemaitre this is not yet ready to be closed, there are 29 images left :) I will take them on slowly in multiple subsequent PRs.

@glemaitre glemaitre reopened this Aug 23, 2023
@glemaitre
Copy link
Member

Indeed, it has been automatically close with the mention in the PR. Thanks @TamaraAtanasoska for noticing. Reopening.

@TamaraAtanasoska
Copy link
Contributor
TamaraAtanasoska commented Aug 24, 2023

Here is a task list so we can keep track of where we stand with the issue, especially if someone else wants to join in.
The tasks in progress will have (in progress) at the end of the image name. I will keep the original list, although the same number of images per files isn't created any more.

pinging @adrinjalali as we already talked about this. One question would be how do we update it? Do I post an updated list when I want to take on new files and you just copy paste it?

@glemaitre
Copy link
Member

Closing this issue since all changes have been done (whenever possible).
Thanks @TamaraAtanasoska to have follow-up on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Meta-issue General issue associated to an identified list of tasks
Projects
None yet
9 participants
0