[invalid] Dead link in latent dirichlet aloc #10275

hnykda · 2017-12-09T09:00:41Z

Description

The link in documentation is dead here: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/decomposition/online_lda.py#L260

I have a proposal to prevent these:

copy all websites, articles, code, link's content, if legally possible, to scikit-learn.org servers and keep them there, because, unfortunately, academia people are usually terrible in keeping things alive
add a simple script to a test suite which would crawl all documentation references, gathered all the links and tried to GET them and check if the code is 200 or something. This would point out to dead links.

The text was updated successfully, but these errors were encountered:

jnothman · 2017-12-09T12:15:45Z

sphinx provides linkcheck builder which does the latter. We should make a point of running it once in a while though (@lesteve, another travis cron job??).

Please provide a PR fixing this specific instance.

hnykda · 2017-12-09T14:54:49Z

My bad, I was watching old revision... abb43c1

rth · 2017-12-09T16:07:59Z

copy all websites, articles, code, link's content, if legally possible, to scikit-learn.org servers and keep them there, because, unfortunately, academia people are usually terrible in keeping things alive

scikit-learn.org is hosted on Github Pages, and that's not really suitable for archiving data.
See #7425 as a more reliable solution

add a simple script to a test suite which would crawl all documentation references, gathered all the links and tried to GET them and check if the code is 200 or something. This would point out to dead links.

That could be useful, the questions is how to run it using the infrastructure available for OSS projects. Done as a side project, this could be for instance a cron job on Travis CI for some repo (other than the scikit-learn one, where it's used already), then uploading the list of broken urls on some server (again different than scikit-learn.org). Note sure.

rth · 2017-12-09T16:14:50Z

sphinx provides linkcheck builder which does the latter

Good to know. From what I saw, running multiple cron jobs with different tasks doesn't look easy as TRAVIS_EVENT_TYPE="cron" is the only way to detect run type.

lesteve · 2017-12-11T08:25:57Z

The easiest is probably to run make linkcheck in Circle (where we already have all the needed dependencies), maybe only on master because that's generally not the PR's fault if a link is broken.

lesteve · 2017-12-11T08:28:33Z

Good to know. From what I saw, running multiple cron jobs with different tasks doesn't look easy as TRAVIS_EVENT_TYPE="cron" is the only way to detect run type.

For the record, you can have multiple builds with type=cron and it will do exactly the same as what you have for push-based Travis builds, i.e. you have a build matrix with multiple builds that are run by the Cron job. Here is an example where I added a Python 3.4 build to the Cron job (on my fork): https://travis-ci.org/lesteve/scikit-learn/builds/306332273.

rth · 2017-12-12T07:35:19Z

For the record, you can have multiple builds with type=cron and it will do exactly the same as what you have for push-based Travis builds, i.e. you have a build matrix with multiple builds that are run by the Cron job.

I meant that AFAIK it doesn't work as regular cron jobs would: one can't run two different taks with different periodicity. But I guess that's not a bit issue.

jnothman · 2017-12-12T09:40:07Z

it doesn't belong in a usual circle run. Linkcheck is a very slow process. Even a separate bot running it quarterly and posting an issue with the results would do fine

lesteve · 2017-12-13T08:42:35Z

it doesn't belong in a usual circle run. Linkcheck is a very slow process.

make linkcheck takes about 14 minutes locally with the fix from #10300. I reckon we could run it only on the master branch (and maybe maintenance branches) in CircleCI.

jnothman · 2017-12-13T09:49:43Z

Not worth running on maintenance branch. Might be worth running on master, except that a temporary web site outage and then recovery would send an alert, which would suck. Basically, we only want it to tell us if there's something broken for more than a week. Is there a nice way to do that?

rth · 2017-12-14T10:32:07Z

I'm a bit ambivalent about sphinx's linkcheck. It is nice to have something built-in in sphinx, but on the other hand the current CI setup is complicated enough, and this is non-critical (i.e. some links will inevitably go dead after a while, there is no urgency to fix them) so I'm not convinced that putting it in the regular CI pipeline on Circle CI is that good an idea (slows thing down, possible failures etc).

Checking for broken links in the documentation is a recurrent problem in OSS projects, and can be done with a few lines of scrapy (see e.g. here) independently of the way the docs were build, so instead of putting the effort into hacking the current CI setup (with all the constraints that is has, think about notifications etc) I think it could make sense to do this in a separate Github repo, check links with scrapy (or some other similar solution) and run it in Travis Cron. This way it could be applied to the different versions of scikit-learn docs without building them, scikit-learn-contrib projects or any other projects for that matter.. I can allocate some time for such a project.

jnothman · 2017-12-14T20:23:17Z

sounds like a good idea.

amueller · 2017-12-14T20:26:29Z

Is there bots / services that could do that? I would expect this to be mostly solvable statically without installing anything.

jnothman · 2017-12-15T03:43:20Z

Things like https://www.deadlinkchecker.com/?

…

On 15 December 2017 at 07:26, Andreas Mueller ***@***.***> wrote: Is there bots / services that could do that? I would expect this to be mostly solvable statically without installing anything. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#10275 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-MOjs1OlYN4bmH40jELeknXta6Wks5tAYR3gaJpZM4Q8A0r> .

hnykda closed this as completed Dec 9, 2017

hnykda changed the title ~~Dead link in latent dirichlet aloc~~ [invalid] Dead link in latent dirichlet aloc Dec 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[invalid] Dead link in latent dirichlet aloc #10275

[invalid] Dead link in latent dirichlet aloc #10275

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[invalid] Dead link in latent dirichlet aloc #10275

[invalid] Dead link in latent dirichlet aloc #10275

Comments

Uh oh!

Description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!