8000 ENH: update only changed examples by jklymak · Pull Request #400 · sphinx-gallery/sphinx-gallery · GitHub
[go: up one dir, main page]

Skip to content

ENH: update only changed examples #400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jklymak
Copy link
Contributor
@jklymak jklymak commented Jul 25, 2018

Closes #394, #395 (or I think it does)...

This speeds things up a bit on the sphinx-gallery docs. Not a huge improvement, though...

Master:

First run: 10.69 s
second run 3.55 s

This PR:

first run: 10.65 s
second run: 3.11 s

@codecov-io
Copy link
codecov-io commented Jul 25, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@6a1810e). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #400   +/-   ##
=========================================
  Coverage          ?   95.32%           
=========================================
  Files             ?       27           
  Lines             ?     1989           
  Branches          ?        0           
=========================================
  Hits              ?     1896           
  Misses            ?       93           
  Partials          ?        0
Impacted Files Coverage Δ
sphinx_gallery/gen_rst.py 95.98% <100%> (ø)
sphinx_gallery/backreferences.py 96.77% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a1810e...92723d3. Read the comment docs.

Copy link
Contributor
@larsoner larsoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

Have you tried it with your actual use case (matplotlib)? What kind of build speedup do you see there when modifying one example?

@@ -17,6 +17,8 @@
from time import time
import ast
import codecs
import distutils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need this import. You can do just import distutils.file_util

@jklymak
Copy link
Contributor Author
jklymak commented Jul 25, 2018

Have you tried it with your actual use case (matplotlib)? What kind of build speedup do you see there when modifying one example?

Yes, I tried with matplotlib. I still don't fully understand the sphinx build, but the real slow down is the API pages that get generated. In matplotlib the build makes api/_as_gen/api/_as_gen/matplotlib.pyplot.get_fignums.rst and many thousand others, as well as api/_as_gen/matplotlib.pyplot.get_fignums.examples. With this change, these are both "old" when I do an incremental build, yet the html still gets rebuilt:

-rw-r--r-- 1 jklymak staff   0 Jul 24 15:37 api/_as_gen/matplotlib.pyplot.get_fignums.examples
-rw-r--r-- 1 jklymak staff 241 Jul 24 15:31 api/_as_gen/matplotlib.pyplot.get_fignums.rst

and

-rw-r--r-- 1 jklymak staff 11K Jul 25 06:41 build/html/api/_as_gen/matplotlib.pyplot.get_fignums.html

Maybe its just my ignorance, but I thought that the html would not be built if nothing changed in the repo.

@larsoner
Copy link
Contributor

So you don't get any meaningful gains?

What if you disable SG entirely so that the .examples does not need to be incorporated, do you get fast subsequent builds in that case? If so maybe it has to do with including one file in another always causing a rebuild (Sphinx bug or limitation).

@Titan-C
Copy link
Member
Titan-C commented Jul 25, 2018

Are you sure the API generation is the bottle neck. To me this is a fast process. I feel slow process are more on the side of embedding links in the code, or sphinx itself reading all available rst to be processed.

pickle.dump(example_code_obj, fid, pickle.HIGHEST_PROTOCOL)
if (not os.path.isfile(codeobj_fname) or
os.stat(example_file).st_mtime >
os.stat(codeobj_fname).st_mtime):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We abandoned checking against timestamps long ago. They lead to lots of problems. I Try using the change against the md5sum of the sourcefile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... actually walk me through this. The source and destination files are not the same in the .example case. Are you sugesting caching the md5sum of the source files between runs somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Titan-C I'm not sure how this could be done unless we embed the MD5s from the previous run somewhere. I guess we could start doing this.

It seems like we could at least allow a default rebuild = True default config arg (current behavior), or rebuild='mtime' (proposed changes here), and later we could add rebuild='md5' which would somehow keep track of the MD5 sums. That would provide a way forward without adding too many new complications for people who are satisfied with mtime.

@jklymak
Copy link
Contributor Author
jklymak commented Jul 25, 2018

I’ll test properly, there may have indeed been gains. But the matpltolib build is still pretty slow despite nothing changing in the source tree. If the example files were all changing every build that made sense to me, but now I guess something else triggers the API rebuild.

@jklymak
Copy link
Contributor Author
jklymak commented Jul 25, 2018

Are you sure the API generation is the bottle neck.

Ha, no, not sure at all. With this change, it spends a lot of time on the writing output... step in Matplotlib.

As for how to best detect if a file is modified or not, more than happy to do what works well across different platforms.

Timing on matplotlib incremental build:

Before

Lots of time on reading sources... and writing output...

101 s

After

Almost no time on reading sources... still lots of time on writing output...
71.56s

@jklymak
Copy link
Contributor Author
jklymak commented Jul 26, 2018

OK, so my tests on this are that I run the sphinx-gallery doc build, not changing the sources between builds.

If I disable all the sphinx.extensions including sphinx-gallery, the incremental build behaves as expected and doesn't output new html files. If I enable sphinx-gallery it rebuilds all the html files, even with the changes in this PR. So I'm still not clear what causes this rebuild step just within sphinx-gallery.

@larsoner
Copy link
Contributor

If I disable all the sphinx.extensions including sphinx-gallery, the incremental build behaves as expected and doesn't output new html files. If I enable sphinx-gallery it rebuilds all the html files, even with the changes in this PR. So I'm still not clear what causes this rebuild step just within sphinx-gallery.

Could it be sphinx.ext.autosummary? It might always regenerate API pages no matter what.

I would try disabling all to get a fast build, then reenabling them one by one to see what causes the slowdown.

@Titan-C
Copy link
Member
Titan-C commented Nov 15, 2018 via email

@NicolasHug
Copy link
Contributor

Hi everyone, I'm super interested in this feature because it makes the scikit-learn doc build much slower than it could be.

I may have missed something important but it seems that the following changes do pretty much what I need: rebuilding the docs go from about 2min to a few seconds.

Tests are passing.

It is using mtime, but that's also what sphinx is doing.

diff --git a/sphinx_gallery/gen_rst.py b/sphinx_gallery/gen_rst.py
index 9e5829d..c5388c0 100644
--- a/sphinx_gallery/gen_rst.py
+++ b/sphinx_gallery/gen_rst.py
@@ -325,6 +325,10 @@ def generate_dir_rst(src_dir, target_dir, gallery_conf, seen_backrefs):
         length=len(sorted_listdir))
     clean_modules(gallery_conf, src_dir)  # fix gh-316
     for fname in iterator:
+        src_file = os.path.normpath(os.path.join(src_dir, fname))
+        target_file = os.path.join(target_dir, fname)
+        if os.path.exists(target_file) and os.path.getmtime(src_file) <= os.path.getmtime(target_file):
+            continue
         intro, time_elapsed = generate_file_rst(
             fname, target_dir, src_dir, gallery_conf)
         clean_modules(gallery_conf, fname)

@NicolasHug
Copy link
Contributor

BTW I'm happy to submit a PR if this is a valid solution (but of course @jklymak feel free to use it here)

@larsoner
Copy link
Contributor

I'm fine with using mtime even if it's not perfect. Especially if it's what sphinx itself does, we're even being consistent.

@larsoner
Copy link
Contributor

(see API proposal above about allowing different values of a new rebuild config var)

@jklymak
Copy link
Contributor Author
jklymak commented Feb 20, 2019

Feel free to close this PR and open another. Or to add the two together. My change didn’t make things that much faster for matplotlib but if the above makes things faster I think it would be nice if it were an option.

@larsoner larsoner added this to the v0.3.0 milestone Feb 21, 2019
@jklymak
Copy link
Contributor Author
jklymak commented Mar 3, 2019

Closing in lieu of #448!

@jklymak jklymak closed this Mar 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4D65
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0