8000 [MRG+1] DOC Clean up datasets loaders as part of the reorganization of the dataset section by jeremiedbb · Pull Request #11319 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] DOC Clean up datasets loaders as part of the reorganization of the dataset section #11319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jul 25, 2018

Conversation

jeremiedbb
Copy link
Member
@jeremiedbb jeremiedbb commented Jun 19, 2018

This pull request aims to standardize the datasets informations, as part of a more general reorganization of the dataset section in user guide, see #11083.

Currently, information about datasets can be found a different places, depending on the datasets, with some redundancies. It can be in loader function or module docstrings, in the DESCR attribute of the loaded data, or directly on the dataset loading utilities page.

I propose to store all the information in the DESCR attribute, for all datasets ; Leave in the docstrings the minimal informations. For now, the dataset loading utilities page will display the same information as the DESCR attribute. Keeping the whole information on this page is still in discussion in #11083.

During this cleaning I found 2 datasets that are currently not included in the dataset loading utilities page : the california housing dataset and the species distributions dataset.
I didn't find any issue or PR about the species distributions dataset.
About, the california housing dataset, there have been a PR that aimed to add it to the doc, see #10586. I'm not sure it's been continued.

edit: Fixes #10555

@jnothman
Copy link
Member
jnothman commented Jun 20, 2018 via email

@jeremiedbb
Copy link
Member Author

Yes the TODOs are mainly add some introductions for the new subsections in index.rst, they are unrelated to moving things into .../descr.
I'll focus on finishing index.rst and leave this one aside in the meanwhile.

@jeremiedbb jeremiedbb changed the title Doc : clean up datasets loaders as part of the reorganization of the dataset section #11083 [WIP] Doc : clean up datasets loaders as part of the reorganization of the dataset section #11083 Jun 22, 2018
@qinhanmin2014
Copy link
Member

@jeremiedbb Is this ready for review? Or what's the remaining things to do here?

@jeremiedbb
Copy link
Member Author

@qinhanmin2014 I'm not sure, I have to check cause I haven't been working on that for 2 weeks. I'll ping when I think it's ready. In the meanwhile, feel free to give me your opinion on the general changes I made. I changed a lot of things, moved some files. Do you think it's fine or have I been too hard on cuts ?

@jeremiedbb
Copy link
Member Author

@qinhanmin2014 ready for review :)

@jeremiedbb jeremiedbb changed the title [WIP] Doc : clean up datasets loaders as part of the reorganization of the dataset section #11083 [MRG] Doc : clean up dat 8000 asets loaders as part of the reorganization of the dataset section #11083 Jul 5, 2018
@aboucaud
Copy link
Contributor

PR #11548 adds doc for the california_housing dataset.
If it is merged before this one, some modifs are still necessary.

@jeremiedbb
Copy link
Member Author

@aboucaud so should #11548 be closed ?

@aboucaud
Copy link
Contributor

yes, done.

Copy link
Member
@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. Thanks a lot @jeremiedbb @aboucaud for your great work and apologies for the delay.
I'll mark it as 0.20 to attract reviewers.

@@ -62,7 +71,7 @@ attribute is the integer index of the category::
>>> newsgroups_train.target.shape
(11314,)
>>> newsgroups_train.target[:10]
array([ 7, 4, 4, 1, 14, 16, 13, 3, 2, 4])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea... I corrected it.
(I may have had errors and thought that the dataset had been modified, but I've just tested it and it works as expected.)

@@ -105,10 +114,10 @@ components by sample in a more than 30000-dimensional space
(less than .5% non-zero features)::

>>> vectors.nnz / float(vectors.shape[0])
159.01327...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't expose so many digits. Try # doctest: +ELLIPSIS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did it. But I found that many more results expose all the digits. Should I make the same for all ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.

@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 16, 2018
@GaelVaroquaux
Copy link
Member

@qinhanmin2014 : I think that you reviewed this PR and were +1. Could you mark it as 👍?

Copy link
Member
@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments.

@@ -8,10 +8,19 @@ collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <http://archive.ics.uci.edu/ml/datasets/Covertype>`__.
`dataset's homepage <http://archive.ics.uci.edu/ml/datasets/Covertype>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this? "__" means anonymous link, and it was probably a good choice here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably bad copy paste... reverted


`This dataset contains a set of face images`_ taken between April 1992 and April
1994 at AT&T Laboratories Cambridge. The
`This dataset contains a set of face images`_ taken between April 1992 and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very surprising to me to see a link here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brain fart. Please ignore.

>>> pred = clf.predict(vectors_test)
>>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
0.88213...
0.88213592402729568
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the ellipsis was a good idea. We were just lacking the "doctest: +ELLIPSIS" pragma.

alt.atheism: sgi livesey atheists writes people caltech com god keith edu
comp.graphics: organization thanks files subject com image lines university edu graphics
sci.space: toronto moon gov com alaska access henry nasa edu space
talk.religion.misc: article writes kent people christian jesus sandvik edu com god

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change?

@jeremiedbb
Copy link
Member Author

@GaelVaroquaux I found what happened. I missed some changes in the file made in another PR a few weeks ago (about skipping downloading the dataset for the doctest).

It should be fine now. However I think the example now shows weird results.
If you look at the example in 0.19
http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
the most informative features are relevant and seem ok.
If you look in master
http://scikit-learn.org/dev/datasets/twenty_newsgroups.html
the most informative features make no sens.

@jnothman
Copy link
Member

That does look strange... especially as predictive performance seems the same...?

@jeremiedbb
Copy link
Member Author

This as been changed in #11284.
Looking at the diff I haven't been able to find the reason of this change, besides make the tests pass.

It seems that the dataset has been shuffled before #11284 because the target is not the same... But it shouldn't change the feature importances.

@jeremiedbb
Copy link
Member Author

@GaelVaroquaux I fixed your requested changes. I think it's good to go.
But before can you take a look a the strange behavior I described above ?

Copy link
Member
@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to ensure the correctness of every details here but I'll vote +1 here.

0.88213...

(The example :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py` shuffles
(The example :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py` shuffles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening here?

@@ -220,4 +230,4 @@ the ``--filter`` option to compare the results.

* :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py`

* :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`
* :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had issues with twenty_newsgroup.rst. Some changes happened in the file after I started the PR, and I kept using the older version.
I thought I had catch all the changes, but apparently not. It should be ok now.

@qinhanmin2014 qinhanmin2014 changed the title [MRG] Doc : clean up datasets loaders as part of the reorganization of the dataset section #11083 [MRG+1] DOC Clean up datasets loaders as part of the reorganization of the dataset section Jul 23, 2018
Copy link
Member
@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jeremiedbb @aboucaud for your great work

@jnothman jnothman merged commit 9d649c5 into scikit-learn:master Jul 25, 2018
@jnothman
Copy link
Member

Thanks @jeremiedbb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0