[MRG+1] DOC Clean up datasets loaders as part of the reorganization of the dataset section #11319

jeremiedbb · 2018-06-19T15:48:30Z

This pull request aims to standardize the datasets informations, as part of a more general reorganization of the dataset section in user guide, see #11083.

Currently, information about datasets can be found a different places, depending on the datasets, with some redundancies. It can be in loader function or module docstrings, in the DESCR attribute of the loaded data, or directly on the dataset loading utilities page.

I propose to store all the information in the DESCR attribute, for all datasets ; Leave in the docstrings the minimal informations. For now, the dataset loading utilities page will display the same information as the DESCR attribute. Keeping the whole information on this page is still in discussion in #11083.

During this cleaning I found 2 datasets that are currently not included in the dataset loading utilities page : the california housing dataset and the species distributions dataset.
I didn't find any issue or PR about the species distributions dataset.
About, the california housing dataset, there have been a PR that aimed to add it to the doc, see #10586. I'm not sure it's been continued.

edit: Fixes #10555

jnothman · 2018-06-20T02:55:25Z

Were there todos on the datasets/index.rst page unrelated to moving things into sklearn/datasets/descr, etc? I'd be keen on making the docs *look* right before we deal with this maintenance issue. Perhaps you can cherry-pick relevant commits from #10586 wh 8000 ich appears to be stalled.

jeremiedbb · 2018-06-20T08:51:29Z

Yes the TODOs are mainly add some introductions for the new subsections in index.rst, they are unrelated to moving things into .../descr.
I'll focus on finishing index.rst and leave this one aside in the meanwhile.

qinhanmin2014 · 2018-07-03T03:55:03Z

@jeremiedbb Is this ready for review? Or what's the remaining things to do here?

jeremiedbb · 2018-07-03T11:18:16Z

@qinhanmin2014 I'm not sure, I have to check cause I haven't been working on that for 2 weeks. I'll ping when I think it's ready. In the meanwhile, feel free to give me your opinion on the general changes I made. I changed a lot of things, moved some files. Do you think it's fine or have I been too hard on cuts ?

…set-2

jeremiedbb · 2018-07-05T16:31:54Z

@qinhanmin2014 ready for review :)

aboucaud · 2018-07-16T10:23:09Z

PR #11548 adds doc for the california_housing dataset.
If it is merged before this one, some modifs are still necessary.

jeremiedbb · 2018-07-16T13:09:56Z

@aboucaud so should #11548 be closed ?

aboucaud · 2018-07-16T13:13:22Z

yes, done.

qinhanmin2014

This looks pretty good. Thanks a lot @jeremiedbb @aboucaud for your great work and apologies for the delay.
I'll mark it as 0.20 to attract reviewers.

qinhanmin2014 · 2018-07-16T14:05:40Z

sklearn/datasets/descr/twenty_newsgroups.rst

@@ -62,7 +71,7 @@ attribute is the integer index of the category::
  >>> newsgroups_train.target.shape
  (11314,)
  >>> newsgroups_train.target[:10]
-  array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])


What's happening here?

I have no idea... I corrected it.
(I may have had errors and thought that the dataset had been modified, but I've just tested it and it works as expected.)

qinhanmin2014 · 2018-07-16T14:06:54Z

sklearn/datasets/descr/twenty_newsgroups.rst

@@ -105,10 +114,10 @@ components by sample in a more than 30000-dimensional space
 (less than .5% non-zero features)::

  >>> vectors.nnz / float(vectors.shape[0])
-  159.01327...


We generally don't expose so many digits. Try # doctest: +ELLIPSIS

did it. But I found that many more results expose all the digits. Should I make the same for all ?

Yes, please.

…an-up-doc-dataset-2

GaelVaroquaux · 2018-07-17T10:03:14Z

@qinhanmin2014 : I think that you reviewed this PR and were +1. Could you mark it as 👍?

GaelVaroquaux

A few minor comments.

GaelVaroquaux · 2018-07-17T10:07:34Z

sklearn/datasets/descr/covtype.rst

@@ -8,10 +8,19 @@ collected for the task of predicting each patch's cover type,
 i.e. the dominant species of tree.
 There are seven covertypes, making this a multiclass classification problem.
 Each sample has 54 features, described on the
-`dataset's homepage <http://archive.ics.uci.edu/ml/datasets/Covertype>`__.
+`dataset's homepage <http://archive.ics.uci.edu/ml/datasets/Covertype>`_.


Why did you change this? "__" means anonymous link, and it was probably a good choice here.

probably bad copy paste... reverted

GaelVaroquaux · 2018-07-17T10:08:41Z

sklearn/datasets/descr/olivetti_faces.rst

-
-`This dataset contains a set of face images`_ taken between April 1992 and April
-1994 at AT&T Laboratories Cambridge. The
+`This dataset contains a set of face images`_ taken between April 1992 and 


It's very surprising to me to see a link here.

Brain fart. Please ignore.

GaelVaroquaux · 2018-07-17T10:10:07Z

sklearn/datasets/descr/twenty_newsgroups.rst

  >>> pred = clf.predict(vectors_test)
  >>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
-  0.88213...
+  0.88213592402729568


I think that the ellipsis was a good idea. We were just lacking the "doctest: +ELLIPSIS" pragma.

GaelVaroquaux · 2018-07-17T10:10:58Z

sklearn/datasets/descr/twenty_newsgroups.rst

+  alt.atheism: sgi livesey atheists writes people caltech com god keith edu
+  comp.graphics: organization thanks files subject com image lines university edu graphics
+  sci.space: toronto moon gov com alaska access henry nasa edu space
+  talk.religion.misc: article writes kent people christian jesus sandvik edu com god



Why did this change?

jeremiedbb · 2018-07-17T13:28:47Z

@GaelVaroquaux I found what happened. I missed some changes in the file made in another PR a few weeks ago (about skipping downloading the dataset for the doctest).

It should be fine now. However I think the example now shows weird results.
If you look at the example in 0.19
http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
the most informative features are relevant and seem ok.
If you look in master
http://scikit-learn.org/dev/datasets/twenty_newsgroups.html
the most informative features make no sens.

jnothman · 2018-07-18T02:43:48Z

That does look strange... especially as predictive performance seems the same...?

jeremiedbb · 2018-07-18T10:04:48Z

This as been changed in #11284.
Looking at the diff I haven't been able to find the reason of this change, besides make the tests pass.

It seems that the dataset has been shuffled before #11284 because the target is not the same... But it shouldn't change the feature importances.

jeremiedbb · 2018-07-19T12:42:34Z

@GaelVaroquaux I fixed your requested changes. I think it's good to go.
But before can you take a look a the strange behavior I described above ?

…set-2

qinhanmin2014

It's hard to ensure the correctness of every details here but I'll vote +1 here.

qinhanmin2014 · 2018-07-23T08:09:43Z

sklearn/datasets/descr/twenty_newsgroups.rst

  0.88213...

-(The example :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py` shuffles
+(The example :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py` shuffles


What's happening here?

qinhanmin2014 · 2018-07-23T08:09:51Z

sklearn/datasets/descr/twenty_newsgroups.rst

@@ -220,4 +230,4 @@ the ``--filter`` option to compare the results.

   * :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py`

-   * :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`
+   * :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py`


What's happening here?

I've had issues with twenty_newsgroup.rst. Some changes happened in the file after I started the PR, and I kept using the older version.
I thought I had catch all the changes, but apparently not. It should be ok now.

qinhanmin2014

LGTM, thanks @jeremiedbb @aboucaud for your great work

jnothman · 2018-07-25T13:52:07Z

Thanks @jeremiedbb

jeremiedbb added 5 commits June 19, 2018 17:06

first clean ups on fetch datasets

bcc93e9

keep cleaning

7f601bd

fix include path

bded254

standardization

2243ce4

fixes and docstirng formatting

0ba98a3

jeremiedbb added 2 commits June 20, 2018 10:29

fix flake8

c95e8e2

Merge branch 'master' into clean-up-doc-dataset-2

d182be6

jeremiedbb mentioned this pull request Jun 20, 2018

Reorganize dataset section in user guide #11083

Closed

TomDLT added the Documentation label Jun 21, 2018

jeremiedbb changed the title ~~Doc : clean up datasets loaders as part of the reorganization of the dataset section #11083~~ [WIP] Doc : clean up datasets loaders as part of the reorganization of the dataset section #11083 Jun 22, 2018

jeremiedbb added 2 commits July 5, 2018 16:43

Merge remote-tracking branch 'upstream/master' into clean-up-doc-data…

13a357a

…set-2

add dataset characteristics for real world datasets

8c5b19a

jeremiedbb changed the title ~~[WIP] Doc : clean up datasets loaders as part of the reorganization of the dataset section #11083~~ [MRG] Doc : clean up dat 8000 asets loaders as part of the reorganization of the dataset section #11083 Jul 5, 2018

jeremiedbb mentioned this pull request Jul 16, 2018

Add California housing doc in user guide. #11548

Closed

aboucaud added 4 commits July 16, 2018 14:41

add california housing doc

a882a90

link Bunch DESCR to the doc file content

995411a

add more accurate description

7df6d36

remove old descr files

d7ba5a1

jeremiedbb and others added 2 commits July 16, 2018 15:49

Merge branch 'master' into clean-up-doc-dataset-2

deeb33c

remove whitespaces

8f50467

qinhanmin2014 reviewed Jul 16, 2018

View reviewed changes

qinhanmin2014 added this to the 0.20 milestone Jul 16, 2018

jeremiedbb added 3 commits July 17, 2018 09:19

comments

0cfe48a

Merge remote-tracking branch 'origin/clean-up-doc-dataset-2' into cle…

14aafbb

…an-up-doc-dataset-2

ellipsis

19c342a

missed 1 ellipsis

8c28f7c

GaelVaroquaux requested changes Jul 17, 2018

View reviewed changes

missed changes in twenty...rst

8b48137

Merge remote-tracking branch 'upstream/master' into clean-up-doc-data…

d6a1ff7

…set-2

qinhanmin2014 approved these changes Jul 23, 2018

View reviewed changes

qinhanmin2014 changed the title ~~[MRG] Doc : clean up datasets loaders as part of the reorganization of the dataset section #11083~~ [MRG+1] DOC Clean up datasets loaders as part of the reorganization of the dataset section Jul 23, 2018

twenty_newsgroup examples

317f388

qinhanmin2014 approved these changes Jul 25, 2018

View reviewed changes

jnothman approved these changes Jul 25, 2018

View reviewed changes

jnothman merged commit 9d649c5 into scikit-learn:master Jul 25, 2018

qinhanmin2014 mentioned this pull request Jul 25, 2018

[MRG] <fetch_california_housing dataset description not found> (finished) #10586

Closed

jeremiedbb deleted the clean-up-doc-dataset-2 branch September 7, 2018 14:04

Uh oh!

[MRG+1] DOC Clean up datasets loaders as part of the reorganization of the dataset section #11319

[MRG+1] DOC Clean up datasets loaders as part of the reorganization of the dataset section #11319

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!