-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] DOC Clean up datasets loaders as part of the reorganization of the dataset section #11319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] DOC Clean up datasets loaders as part of the reorganization of the dataset section #11319
Conversation
Were there todos on the datasets/index.rst page unrelated to moving things
into sklearn/datasets/descr, etc? I'd be keen on making the docs *look*
right before we deal with this maintenance issue.
Perhaps you can cherry-pick relevant commits from #10586 wh
8000
ich appears to
be stalled.
|
Yes the TODOs are mainly add some introductions for the new subsections in index.rst, they are unrelated to moving things into .../descr. |
@jeremiedbb Is this ready for review? Or what's the remaining things to do here? |
@qinhanmin2014 I'm not sure, I have to check cause I haven't been working on that for 2 weeks. I'll ping when I think it's ready. In the meanwhile, feel free to give me your opinion on the general changes I made. I changed a lot of things, moved some files. Do you think it's fine or have I been too hard on cuts ? |
@qinhanmin2014 ready for review :) |
PR #11548 adds doc for the california_housing dataset. |
yes, done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good. Thanks a lot @jeremiedbb @aboucaud for your great work and apologies for the delay.
I'll mark it as 0.20 to attract reviewers.
@@ -62,7 +71,7 @@ attribute is the integer index of the category:: | |||
>>> newsgroups_train.target.shape | |||
(11314,) | |||
>>> newsgroups_train.target[:10] | |||
array([ 7, 4, 4, 1, 14, 16, 13, 3, 2, 4]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea... I corrected it.
(I may have had errors and thought that the dataset had been modified, but I've just tested it and it works as expected.)
@@ -105,10 +114,10 @@ components by sample in a more than 30000-dimensional space | |||
(less than .5% non-zero features):: | |||
|
|||
>>> vectors.nnz / float(vectors.shape[0]) | |||
159.01327... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't expose so many digits. Try # doctest: +ELLIPSIS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did it. But I found that many more results expose all the digits. Should I make the same for all ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please.
@qinhanmin2014 : I think that you reviewed this PR and were +1. Could you mark it as 👍? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments.
sklearn/datasets/descr/covtype.rst
Outdated
@@ -8,10 +8,19 @@ collected for the task of predicting each patch's cover type, | |||
i.e. the dominant species of tree. | |||
There are seven covertypes, making this a multiclass classification problem. | |||
Each sample has 54 features, described on the | |||
`dataset's homepage <http://archive.ics.uci.edu/ml/datasets/Covertype>`__. | |||
`dataset's homepage <http://archive.ics.uci.edu/ml/datasets/Covertype>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change this? "__" means anonymous link, and it was probably a good choice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably bad copy paste... reverted
|
||
`This dataset contains a set of face images`_ taken between April 1992 and April | ||
1994 at AT&T Laboratories Cambridge. The | ||
`This dataset contains a set of face images`_ taken between April 1992 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very surprising to me to see a link here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brain fart. Please ignore.
>>> pred = clf.predict(vectors_test) | ||
>>> metrics.f1_score(newsgroups_test.target, pred, average='macro') | ||
0.88213... | ||
0.88213592402729568 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the ellipsis was a good idea. We were just lacking the "doctest: +ELLIPSIS" pragma.
alt.atheism: sgi livesey atheists writes people caltech com god keith edu | ||
comp.graphics: organization thanks files subject com image lines university edu graphics | ||
sci.space: toronto moon gov com alaska access henry nasa edu space | ||
talk.religion.misc: article writes kent people christian jesus sandvik edu com god | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change?
@GaelVaroquaux I found what happened. I missed some changes in the file made in another PR a few weeks ago (about skipping downloading the dataset for the doctest). It should be fine now. However I think the example now shows weird results. |
That does look strange... especially as predictive performance seems the same...? |
@GaelVaroquaux I fixed your requested changes. I think it's good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to ensure the correctness of every details here but I'll vote +1 here.
0.88213... | ||
|
||
(The example :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py` shuffles | ||
(The example :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py` shuffles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening here?
@@ -220,4 +230,4 @@ the ``--filter`` option to compare the results. | |||
|
|||
* :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py` | |||
|
|||
* :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py` | |||
* :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had issues with twenty_newsgroup.rst
. Some changes happened in the file after I started the PR, and I kept using the older version.
I thought I had catch all the changes, but apparently not. It should be ok now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @jeremiedbb @aboucaud for your great work
Thanks @jeremiedbb |
This pull request aims to standardize the datasets informations, as part of a more general reorganization of the dataset section in user guide, see #11083.
Currently, information about datasets can be found a different places, depending on the datasets, with some redundancies. It can be in loader function or module docstrings, in the
DESCR
attribute of the loaded data, or directly on the dataset loading utilities page.I propose to store all the information in the
DESCR
attribute, for all datasets ; Leave in the docstrings the minimal informations. For now, the dataset loading utilities page will display the same information as theDESCR
attribute. Keeping the whole information on this page is still in discussion in #11083.During this cleaning I found 2 datasets that are currently not included in the dataset loading utilities page : the california housing dataset and the species distributions dataset.
I didn't find any issue or PR about the species distributions dataset.
About, the california housing dataset, there have been a PR that aimed to add it to the doc, see #10586. I'm not sure it's been continued.
edit: Fixes #10555