8000 [invalid] Dead link in latent dirichlet aloc · Issue #10275 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[invalid] Dead link in latent dirichlet aloc #10275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hnykda opened this issue Dec 9, 2017 · 14 comments
Closed

[invalid] Dead link in latent dirichlet aloc #10275

hnykda opened this issue Dec 9, 2017 · 14 comments

Comments

@hnykda
Copy link
Contributor
hnykda commented Dec 9, 2017

Description

The link in documentation is dead here: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/decomposition/online_lda.py#L260

I have a proposal to prevent these:

  1. copy all websites, articles, code, link's content, if legally possible, to scikit-learn.org servers and keep them there, because, unfortunately, academia people are usually terrible in keeping things alive
  2. add a simple script to a test suite which would crawl all documentation references, gathered all the links and tried to GET them and check if the code is 200 or something. This would point out to dead links.
@jnothman
Copy link
Member
jnothman commented Dec 9, 2017

sphinx provides linkcheck builder which does the latter. We should make a point of running it once in a while though (@lesteve, another travis cron job??).

Please provide a PR fixing this specific instance.

@hnykda
Copy link
Contributor Author
hnykda commented Dec 9, 2017

My bad, I was watching old revision... abb43c1

@hnykda hnykda closed this as completed Dec 9, 2017
@hnykda hnykda changed the title Dead link in latent dirichlet aloc [invalid] Dead link in latent dirichlet aloc Dec 9, 2017
@rth
Copy link
Member
rth commented Dec 9, 2017

copy all websites, articles, code, link's content, if legally possible, to scikit-learn.org servers and keep them there, because, unfortunately, academia people are usually terrible in keeping things alive

scikit-learn.org is hosted on Github Pages, and that's not really suitable for archiving data.
See #7425 as a more reliable solution

add a simple script to a test suite which would crawl all documentation references, gathered all the links and tried to GET them and check if the code is 200 or something. This would point out to dead links.

That could be useful, the questions is how to run it using the infrastructure available for OSS projects. Done as a side project, this could be for instance a cron job on Travis CI for some repo (other than the scikit-learn one, where it's used already), then uploading the list of broken urls on some server (again different than scikit-learn.org). Note sure.

@rth
Copy link
Member
rth commented Dec 9, 2017

sphinx provides linkcheck builder which does the latter

Good to know. From what I saw, running multiple cron jobs with different tasks doesn't look easy as TRAVIS_EVENT_TYPE="cron" is the only way to detect run type.

@lesteve
Copy link
Member
lesteve commented Dec 11, 2017

The easiest is probably to run make linkcheck in Circle (where we already have all the needed dependencies), maybe only on master because that's generally not the PR's fault if a link is broken.

@lesteve
Copy link
Member
lesteve commented Dec 11, 2017

Good to know. From what I saw, running multiple cron jobs with different tasks doesn't look easy as TRAVIS_EVENT_TYPE="cron" is the only way to detect run type.

For the record, you can have multiple builds with type=cron and it will do exactly the same as what you have for push-based Travis builds, i.e. you have a build matrix with multiple builds that are run by the Cron job. Here is an example where I added a Python 3.4 build to the Cron job (on my fork): https://travis-ci.org/lesteve/scikit-learn/builds/306332273.

@rth
Copy link
Member
rth commented Dec 12, 2017

For the record, you can have multiple builds with type=cron and it will do exactly the same as what you have for push-based Travis builds, i.e. you have a build matrix with multiple builds that are run by the Cron job.

I meant that AFAIK it doesn't work as regular cron jobs would: one can't run two different taks with different periodicity. But I guess that's not a bit issue.

@jnothman
Copy link
Member
jnothman commented Dec 12, 2017 via email

@lesteve
Copy link
Member
lesteve commented Dec 13, 2017

it doesn't belong in a usual circle run. Linkcheck is a very slow process.

make linkcheck takes about 14 minutes locally with the fix from #10300. I reckon we could run it only on the master branch (and maybe maintenance branches) in CircleCI.

@jnothman
Copy link
Member
jnothman commented Dec 13, 2017 via email

@rth
Copy link
Member
rth commented Dec 14, 2017

I'm a bit ambivalent about sphinx's linkcheck. It is nice to have something built-in in sphinx, but on the other hand the current CI setup is complicated enough, and this is non-critical (i.e. some links will inevitably go dead after a while, there is no urgency to fix them) so I'm not convinced that putting it in the regular CI pipeline on Circle CI is that good an idea (slows thing down, possible failures etc).

Checking for broken links in the documentation is a recurrent problem in OSS projects, and can be done with a few lines of scrapy (see e.g. here) independently of the way the docs were build, so instead of putting the effort into hacking the current CI setup (with all the constraints that is has, think about notifications etc) I think it could make sense to do this in a separate Github repo, check links with scrapy (or some other similar solution) and run it in Travis Cron. This way it could be applied to the different versions of scikit-learn docs without building them, scikit-learn-contrib projects or any other projects for that matter.. I can allocate some time for such a project.

@jnothman
Copy link
Member
jnothman commented Dec 14, 2017 via email

@amueller
Copy link
Member

Is there bots / services that could do that? I would expect this to be mostly solvable statically without installing anything.

@jnothman
Copy link
Member
jnothman commented Dec 15, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
0