8000 [MRG] DOC: add binder links to examples by lesteve · Pull Request #11221 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] DOC: add binder links to examples #11221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 19, 2019

Conversation

lesteve
Copy link
Member
@lesteve lesteve commented Jun 8, 2018

This is using the binder feature in the recent 0.2 sphinx-gallery release.

Summary of the changes:

  • conf.py needs to configure sphinx-gallery to create binder links
  • binder/requirements.txt: requirements for the docker image created on binder
  • build_tools/circle/push_doc.sh: binder needs the binder repo to be at the root of the repo and the binder folder is in dev/binder.

At the moment the latest scikit-learn release is installed so some examples may not work if they use some features from master.
Maybe a question for @choldgraf: if I add -f git+https://github.com/scikit-learn/scikit-learn.git to requirements.txt will it ensure that the docker image is rebuilt on each push and that I have always scikit-learn master in the docker image?

I have to say I haven't thought too much about how stable doc vs dev doc is going to work yet. For now the binder links are only visible on the dev doc anyway.

@lesteve lesteve force-pushed the add-binder-links branch from 1b7d223 to b089c93 Compare June 8, 2018 09:28
@ogrisel
Copy link
Member
ogrisel commented Jun 8, 2018

Maybe we should only do it for the stable doc to avoid having to keep the docker image in sync with scikit-learn master.

@lesteve
Copy link
Member Author
lesteve commented Jun 8, 2018

Something I forgot to mention, this PR needs to be merged before the binder links actually work.

I pushed a test-binder branch in scikit-learn/scikit-learn.github.io with the doc I generated locally. If you want to play with binder a bit you can try links like this:
https://mybinder.org/v2/gh/scikit-learn/scikit-learn.github.io/test-binder?urlpath=lab/tree/notebooks/auto_examples/plot_isotonic_regression.ipynb

@glemaitre
Copy link
Member
glemaitre commented Jun 8, 2018

Maybe we should only do it for the stable doc to avoid having to keep the docker image in sync with scikit-learn master.

Maybe an interesting thing is to actually play with new features which are in master.

@choldgraf
Copy link
Contributor

woot! just one thing I want to point out: mybinder.org runs on fairly limited hardware, so depending on how much number crunching or memory use happens in the examples, it may be a better/worse experience for people. I'm not sure how much sklearn uses CPU/memory-heavy examples, but just wanted to flag it.

@choldgraf
Copy link
Contributor
choldgraf commented Jun 8, 2018

and @lesteve on your question, yep that'll pull the master branch of sklearn, but if you want to keep up-to-date with updates you'll need to make a push to the docs to update the hash so that repo2docker will re-build things

@jnothman
Copy link
Member
jnothman commented Jun 9, 2018

re -f git+https://github.com/scikit-learn/scikit-learn.git

https://github.com/scikit-learn/scikit-learn/archive/master.zip is always faster for pip... A nightly wheel feed would be even better to work with...

@choldgraf, where does the change to the hash go?

Is there any problem with conf.py dynamically generating the requirements.txt to include the git hash of the current HEAD?

@choldgraf
Copy link
Contributor
choldgraf commented Jun 9, 2018

@jnothman nope that'd be fine, but shouldn't be necessary (if you just specify a direct path to the github repo, it'll build from latest master). In deciding to (re)build a repository, binder will just look for the latest git commit hash on the repository, and if it's different from whatever has been built last, it'll trigger a new build

@amueller
Copy link
Member
amueller commented Jun 9, 2018

In a sense slightly less open but I'm not sure what the options are to run on google datalab or azure notebooks. Both are free and can run custom notebooks and have probably more capacity than mybinder.

@qinhanmin2014
Copy link
Member

+1 for the great PR which enables users to easily try scikit-learn at anytime and anywhere as long as the server can afford such a popular project.

Maybe an interesting thing is to actually play with new features which are in master.

+1. It will be cool to announce a feature (e.g., on twitter) with a link to the example notebook, so that users can try it immediately and provide their feedback. If it's possible, I might prefer two images, one for stable version, and one for dev version.

@choldgraf
Copy link
Contributor

re: stable and dev, you could change the configuration in conf.py to accommodate this.

E.g.: here you could pin scikit-learn to a version for non-master documentation, and for the "latest" documentation you could have it build off of master.

@lesteve
Copy link
Member Author
lesteve commented Jun 13, 2018

re: stable and dev, you could change the configuration in conf.py to accommodate this.

We are using a single repo (https://github.com/scikit-learn/scikit-learn.github.io) for the dev doc and the stable doc. I am not sure how binder can handle two different requirements for a single repo.

@choldgraf
Copy link
Contributor

ahh I see - are you just putting the dev docs in a different folder? I thought they'd be built from a tag or something (e.g. what readthedocs does)

@lesteve
Copy link
Member Author
lesteve commented Jun 13, 2018

ahh I see - are you just putting the dev docs in a different folder?

Yes. Happy to hear how other projects are doing it if there is a better (and more binder-friendly) way. I guess the constraint we have is that we are using github-pages and I think you can only use one branch (master of gh-pages) for the website.

@choldgraf
Copy link
Contributor

gotcha - it sounds like the short-term solution is to either choose the latest version or "master" for the Binder links

@choldgraf
Copy link
Contributor

my recommendation there would be to figure out which part of the sphinx-gallery docs gets more traffic (probably whatever the latest released version is) and pin the binder links against that

@lesteve
Copy link
Member Author
lesteve commented Oct 9, 2018

So I gave this another go today, and it looks like the docker image takes a long time to be generated (I have not timed it precisely but my binder tab has been opened with "Repository scikit-learn/scikit-learn.github.io/test-binder is taking longer than usual to load, hang tight!" for ~1 hour now). I remember it was slow to generate the docker image when I worked on this originally, but not that slow I have to admit.

This is very likely related to the fact that our doc repo scikit-learn/scikit-learn.github.io is very big (cloning it with --depth 1 is taking ~4 minutes on my machine). For completeness, I tried to clone without --depth 1 and gave up after 1h30, I was at "Receiving objects: 40%".

Not sure whether we can do something about this in our doc repo. I saw that there was a recent PR in repo2docker jupyterhub/repo2docker#420 to use --depth 1 which may help in some cases.

Another option would to push the generated notebooks into a different repo. Other suggestions more than welcome!

@choldgraf
Copy link
Contributor

@lesteve where's the branch you're working from? I didn't see a binder/ folder or requirements.txt or something in here: https://github.com/lesteve/scikit-learn/tree/add-binder-links

A few quick thoughts:

  1. Matplotlib builds its documentation in a separate repository, partially for this reason (MPL core and MPL docs are both quite beefy). Seems to work relatively well, though it definitely adds to the complexity of deployments/updates etc.
  2. Don't forget that the Docker image will only be built one time if you pin your Binder links to point to a specific version/commit/etc. After that, it'll use the cached image. The other main thing that affects load times is whether the image exists on the node where a session is being launched. If not, then a Docker Pull happens, which takes some time. However this is also a one-time operation per-node, and given how popular the sklearn docs are, I don't imagine this would be much of an issue long-term.

@lesteve
Copy link
Member Author
lesteve commented Oct 9, 2018

@choldgraf thanks for the inputs!

I am currently testing with: https://mybinder.org/v2/gh/scikit-learn/scikit-learn.github.io/test-binder?urlpath=lab/tree/notebooks/auto_examples/plot_isotonic_regression.ipynb. In other words I pushed the doc I generated locally (with the binder requirements) in the test-binder branch) in our scikit-learn/scikit-learn.github.io repo: https://github.com/scikit-learn/scikit-learn.github.io/tree/test-binder.

I am not sure I fully got your point 1. but I'll try to look at how matplotlib does it.

About 2. I thought about the caching mechanism but my understanding is that each time there is a new commit in the repo the docker image needs to gets rebuild (which makes complete sense of course). Because we have commits reasonably frequently in scikit-learn (say once a day), and because we use the same repo for the stable doc and the dev doc, I am not sure the user experience is going to be very smooth (as things stand today. The --depth 1 PR that is in repo2docker master may make help as I said) ...

To be perfectly honest, I need to run simple tests with a toy repo to make sure that I understand a bit better the binder behaviour.

@choldgraf
Copy link
Contributor

@lesteve to clarify the caching point - you can pin a Binder link so that it is linked against a specific tag (or even commit hash). This will ensure that it isn't re-built every time a new commit is pushed to the branch. E.g. this link

https://mybinder.org/v2/gh/scikit-learn/scikit-learn.github.io/35ea7b?urlpath=lab/tree/notebooks/auto_examples/plot_isotonic_regression.ipynb

will build a Binder for that specific commit (the latest one on your test-binder branch), so it wouldn't be re-built if new commits are pushed to that branch.

So, you could pin the Binder links to the branch, and only update them every now and then when you wanted Binder to reflect the latest feature updates

@jnothman
Copy link
Member
jnothman commented Oct 10, 2018 via email

@betatim
Copy link
Member
betatim commented Oct 10, 2018

What are the disadvantages of pushing the generated notebooks into a different repo?

I can't think of one beyond the added maintenance burden. BinderHub doesn't care about the history of the repo, it just wants to use one particular ref (the one explicitly specified or that the branch/tag resolves to)

The --depth 1 PR that is in repo2docker master may make help as I said) ...

Unfortunately not. repo2docker will only use --depth 1 if no explicit ref was specified. BinderHub (the software behind mybinder.org) always resolves branch names to SHAs, so we always specify the ref. We have had an attempt to use a shallow clone (say depth 50 like travis), check if the ref has been cloned and if not unshallow the clone. However we reverted that as it had bugs (jupyterhub/repo2docker#120, jupyterhub/repo2docker#130). If someone wants to make a new attempt that would be great. Maybe now a more modern version of git is part of Ubuntu?

I would go with having a separate repo with generated notebooks in it or take the approach of spacy.io who added a small bit of JS to their docs to make code examples runnable (scroll down a bit). You could use this for the example gallery instead of converting them to notebooks. It degrades well when the user has no JS or mybinder.org is down.

@betatim
Copy link
Member
betatim commented Nov 16, 2018

Below the steps that I think need to happen to get this to work:

  • add support for repo2docker https://github.com/org/repo/archive/master.zip. This would solve the problem that currently cloning the docs repo takes too long. I think how to do it is clear, but it needs someone to actually do it. I'd be happy to help someone implement this.
  • add support to BinderHub (software serving mybinder.org) so that it supports the new "from zip file" feature of repo2docker. This might require a bit of discussion first to figure out what the best way of doing this is. In particular how we can support the current behaviour of only building each SHA-1 once per repo. This is something I'd like to see in BinderHub but low priority for me for the time being. I will help if someone else champions this.

A completely different way of doing would be something like what I suggested in sphinx-gallery/sphinx-gallery#419 (comment) If I had a large example gallery I think I'd take this approach. The hard part here is figuring out how exactly to make the small "side car" repository that gets launched on mybinder.org. You'd want to update this side car repo every time you want to have your examples use a new version of scikit-learn. I'd create a repository that contains a requirements.txt like the following on the master branc 8000 h:

numpy
matplotlib
https://github.com/scikit-learn/scikit-learn/archive/<sha-1-of-master>.zip

and have a process that updates the SHA-1 every time a PR is merged. On branches for each release you can have:

numpy
matplotlib
https://github.com/scikit-learn/scikit-learn/archive/<name-of-the-release-tag>.zip

This way the docs for a release can use the corresponding branch of the side car repo and the dev build of the docs can use master.

@lesteve
Copy link
Member Author
lesteve commented Jul 29, 2019

Just a FYI, after some discussion with @ogrisel and @glemaitre, here is what a simple "side car" repo would yout like: https://github.com/lesteve/test-binder-scikit-learn.

  • git submodule for the scikit-learn repo (binder automatically do the git submodule update --init)
  • notebooks get created from the python examples in the postBuild binder step (sphinx-gallery has a script for the python to notebook conversion).

For now this is for the stable documentation (environment.yml and scikit-learn git submodule). I need to think a little bit more how dev vs stable would work and how the "sidecar repo" would be updated in the doc build process on CircleCI. Any comments more than welcome!

@lesteve lesteve force-pushed the add-binder-links branch from b089c93 to afa4de0 Compare July 29, 2019 13:47
@adrinjalali
Copy link
Member

I was just thinking about this and was reminded of this PR. Even for the stable website this would be awesome to have.

@lesteve lesteve force-pushed the add-binder-links branch 2 times, most recently from fed0546 to 9f5ab10 Compare July 30, 2019 17:45
@lesteve
Copy link
Member Author
lesteve commented Aug 7, 2019

@thomasjpfan there is a 0.21.X branch in scikit-learn/binder-examples that is tracking the 0.21.X branch of scikit-learn/scikit-learn so, unless I am missing something, what you are proposing is already implemented.

@thomasjpfan
Copy link
Member
thomasjpfan commented Aug 7, 2019

@lesteve

I think binder does not know that `0.21.X` is unchanged everytime the `binder-examples` is updated nightly. It will always try to rebuild the initial docker image because of the daily `update submodule to commit` from travis ci.

On the binder website under "how it works 2." it states:

If a new commit has been made, the image will automatically be rebuilt.

EDIT: Okay I see that 0.21.X is not being updated.

@lesteve
Copy link
Member Author
lesteve commented Aug 7, 2019

Sorry if this was not clear enough, there is no daily update of scikit-learn/binder-examples. There is a daily cron job that tries to update the scikit-learn submodule, if no update has happened on the scikit-learn/scikit-learn tracked branch, there is no commit on scikit-learn/binder-examples branch.

Look for example at the commit history of scikit-learn/binder-examples 0.21.X, it has not been updated for a few days (the submodule has not been updated since the initial commit actually):
https://github.com/scikit-learn/binder-examples/commits/0.21.X

Similar thing with scikit-learn/binder-examples master, it has not been updated on August 4 :
https://github.com/scikit-learn/binder-examples/commits/master

@thomasjpfan
Copy link
Member

@lesteve The 0.21.X makes sense. 0.21.x does not update unless we are doing a release. I hope this means, binder does not rebuild the image that often.

For scikit-learn/binder-examples master, https://github.com/scikit-learn/binder-examples/commits/master it looks like its being updated everyday. Binder would most likely need to rebuild the master version quite frequently, which is okay.

@lesteve
Copy link
Member Author
lesteve commented Aug 7, 2019

@lesteve The 0.21.X makes sense. 0.21.x does not update unless we are doing a release. I hope this means, binder does not rebuild the image that often.

Yes it does. From Tim's message above #11221 (comment):

So for the released version of scikit-learn you'd not expect that 99.9% of users wouldn't need to wait for a build. They'd get a prebuilt image. For master builds would happen more frequently because the commit that master points to changes.

@betatim
Copy link
Member
betatim commented Aug 7, 2019

When presented with a tag or branch name mybinder.org will resolve it to a SHA1, then check its cache if there is already an image for that SHA1 (and repository). If there is an image, this is what gets launched. Otherwise the image gets built and stored in the cache.

Some situations where there won't be an image in the cache:

  • the commit is new, it has never been built
    • because we operate two clusters with two separate caches we need to build the image twice, once for each cache. This means there will be two users who need to wait for a build. (We are working on improving this)
  • we empty our cache on purpose (big changes or security vulnerability etc)

@thomasjpfan
Copy link
Member

So for the released version of scikit-learn you'd not expect that 99.9% of users wouldn't need to wait for a build.

This statement seems to contradict itself. By removing the double negative, this states: "So for the released version of scikit-learn you'd expect that 99.9% of the users would need to wait for a build". This statement conflicts with "They'd get a prebuilt image."

@betatim Sorry if this question has already been asked. I want to verify that, for a given branch, if the commit is usually not new (like in our 0.20.X branch), then the image would most likely have been built and users would not need to wait.

@betatim
Copy link
Member
betatim commented Aug 8, 2019

You are right, that sentence with double negative is terrible and incorrect!

I confirm that if the commit doesn't/rarely changes the most likely thing to happen is that users don't have to wait for a build.

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

doc/conf.py Outdated
if 'dev' in version:
binder_branch = 'master'
else:
match = re.match(r'^(\d+)\.(\d+)(?:\.\d)?$', version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passive groups... awesome. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure:

Suggested change
match = re.match(r'^(\d+)\.(\d+)(?:\.\d)?$', version)
match = re.match(r'^(\d+)\.(\d+)(?:\.\d+)?$', version)

@thomasjpfan
Copy link
Member

@betatim Thank you for your patience and your thoughtful explanations! :)

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to merge this and see how it goes. I don't think it implies much risk.

doc/conf.py Outdated
if 'dev' in version:
binder_branch = 'master'
else:
match = re.match(r'^(\d+)\.(\d+)(?:\.\d)?$', version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure:

Suggested change
match = re.match(r'^(\d+)\.(\d+)(?:\.\d)?$', version)
match = re.match(r'^(\d+)\.(\d+)(?:\.\d+)?$', version)

@lesteve
Copy link
Member Author
lesteve commented Aug 19, 2019

I'm happy to merge this and see how it goes. I don't think it implies much risk.

Glad to hear this!

I pused a commit with your recommended change (and a minor change in the error message)

@jnothman
Copy link
Member

Let's give it a whirl

@jnothman jnothman merged commit f0faaee into scikit-learn:master Aug 19, 2019
@lesteve lesteve deleted the add-binder-links branch August 19, 2019 06:10
@lesteve
Copy link
Member Author
lesteve commented Aug 19, 2019

Great to see this merged. The dev doc has been uploaded, everything seems to work: there is a binder link and the binder works.

Here is an example.

Side-comment for @betatim (not crucial at all): the nbviewer preview does a 404:

image

@betatim
Copy link
Member
betatim commented Aug 19, 2019

Nods, we have "don't show the preview if it would show an error" on our to do list. As well as trying to get better at guessing the correct URL to construct so that the preview works :)

If there is any problem I'll open an issue ok? Or if you have a problem can you either open an issue on https://github.com/jupyterhub/mybinder.org-deploy/ or @jupyterhub/mybinder-org-operators so we get a notification?

@lesteve
Copy link
Member Author
lesteve commented Aug 19, 2019

Great if you have this on your todo list already. Not showing the preview would be completely fine for us I think.

By the way, thinking more about this it completely makes sense that the preview does not work:

  • we are pointing to a specific .ipynb by using a urlpath in the binder URL
  • the .ipynb does not exist in the binder-examples repo: the .ipynb file is generated in postBuild from the .py file

The reason I never noticed this very minor issue before: when I was testing this PR, I was always using a binder URL without urlpath ...

@ogrisel
Copy link
Member
ogrisel commented Aug 19, 2019

Indeed it would be great to have a configuration file in the binder-examples repo to tell binder to not display the nbviewer based preview on that repo (or to automatically disable the preview if the ipynb file is missing).

Edit: Github was not updated with the past two replies when I first wrote this comment.

@lesteve
Copy link
Member Author
lesteve commented Aug 20, 2019

Thanks to @betatim, there is no preview with 404 any more (mentioned in #11221 (comment))!

For example, see this binder link.

For more details about the binderhub fix: jupyterhub/binderhub#934.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0