-
Notifications
You must be signed in to change notification settings - Fork 207
ENH: update only changed examples #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: update only changed examples #400
Conversation
Codecov Report
@@ Coverage Diff @@
## master #400 +/- ##
=========================================
Coverage ? 95.32%
=========================================
Files ? 27
Lines ? 1989
Branches ? 0
=========================================
Hits ? 1896
Misses ? 93
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me.
Have you tried it with your actual use case (matplotlib
)? What kind of build speedup do you see there when modifying one example?
@@ -17,6 +17,8 @@ | |||
from time import time | |||
import ast | |||
import codecs | |||
import distutils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need this import. You can do just import distutils.file_util
Yes, I tried with matplotlib. I still don't fully understand the sphinx build, but the real slow down is the API pages that get generated. In matplotlib the build makes
and
Maybe its just my ignorance, but I thought that the html would not be built if nothing changed in the repo. |
So you don't get any meaningful gains? What if you disable SG entirely so that the .examples does not need to be incorporated, do you get fast subsequent builds in that case? If so maybe it has to do with including one file in another always causing a rebuild (Sphinx bug or limitation). |
Are you sure the API generation is the bottle neck. To me this is a fast process. I feel slow process are more on the side of embedding links in the code, or sphinx itself reading all available rst to be processed. |
pickle.dump(example_code_obj, fid, pickle.HIGHEST_PROTOCOL) | ||
if (not os.path.isfile(codeobj_fname) or | ||
os.stat(example_file).st_mtime > | ||
os.stat(codeobj_fname).st_mtime): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We abandoned checking against timestamps long ago. They lead to lots of problems. I Try using the change against the md5sum of the sourcefile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... actually walk me through this. The source and destination files are not the same in the .example
case. Are you sugesting caching the md5sum
of the source files between runs somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Titan-C I'm not sure how this could be done unless we embed the MD5s from the previous run somewhere. I guess we could start doing this.
It seems like we could at least allow a default rebuild = True
default config arg (current behavior), or rebuild='mtime'
(proposed changes here), and later we could add rebuild='md5'
which would somehow keep track of the MD5 sums. That would provide a way forward without adding too many new complications for people who are satisfied with mtime.
I’ll test properly, there may have indeed been gains. But the matpltolib build is still pretty slow despite nothing changing in the source tree. If the example files were all changing every build that made sense to me, but now I guess something else triggers the API rebuild. |
Ha, no, not sure at all. With this change, it spends a lot of time on the As for how to best detect if a file is modified or not, more than happy to do what works well across different platforms. Timing on matplotlib incremental build: BeforeLots of time on 101 s AfterAlmost no time on |
OK, so my tests on this are that I run the If I disable all the sphinx.extensions including sphinx-gallery, the incremental build behaves as expected and doesn't output new html files. If I enable |
Could it be I would try disabling all to get a fast build, then reenabling them one by one to see what causes the slowdown. |
@Titan-C I'm not sure how this could be done unless we embed the MD5s from the previous run somewhere. I guess we could start doing this.
Indeed, we can store that in the codeobj.pickle. Although, If my memory
is not failing that file is saved in the same place as the md5 hashfile,
so we could just read from disk without the need of changing to many things.
|
Hi everyone, I'm super interested in this feature because it makes the scikit-learn doc build much slower than it could be. I may have missed something important but it seems that the following changes do pretty much what I need: rebuilding the docs go from about 2min to a few seconds. Tests are passing. It is using mtime, but that's also what sphinx is doing. diff --git a/sphinx_gallery/gen_rst.py b/sphinx_gallery/gen_rst.py
index 9e5829d..c5388c0 100644
--- a/sphinx_gallery/gen_rst.py
+++ b/sphinx_gallery/gen_rst.py
@@ -325,6 +325,10 @@ def generate_dir_rst(src_dir, target_dir, gallery_conf, seen_backrefs):
length=len(sorted_listdir))
clean_modules(gallery_conf, src_dir) # fix gh-316
for fname in iterator:
+ src_file = os.path.normpath(os.path.join(src_dir, fname))
+ target_file = os.path.join(target_dir, fname)
+ if os.path.exists(target_file) and os.path.getmtime(src_file) <= os.path.getmtime(target_file):
+ continue
intro, time_elapsed = generate_file_rst(
fname, target_dir, src_dir, gallery_conf)
clean_modules(gallery_conf, fname) |
BTW I'm happy to submit a PR if this is a valid solution (but of course @jklymak feel free to use it here) |
I'm fine with using |
(see API proposal above about allowing different values of a new |
Feel free to close this PR and open another. Or to add the two together. My change didn’t make things that much faster for matplotlib but if the above makes things faster I think it would be nice if it were an option. |
Closing in lieu of #448! |
Closes #394, #395 (or I think it does)...
This speeds things up a bit on the sphinx-gallery docs. Not a huge improvement, though...
Master:
First run: 10.69 s
second run 3.55 s
This PR:
first run: 10.65 s
second run: 3.11 s