Reorganize dataset section in user guide #11083

qinhanmin2014 · 2018-05-10T14:37:24Z

I think the dataset section in user guide (http://scikit-learn.org/dev/datasets/index.html) is hard to follow. See the subsections:

5.1. General dataset API
5.2. Toy datasets
5.3. Sample images
5.4. Sample generators
5.5. Datasets in svmlight / libsvm format
5.6. Loading from external datasets
5.7. The Olivetti faces dataset
5.8. The 20 newsgroups text dataset
5.9. Downloading datasets from the mldata.org repository
5.10. The Labeled Faces in the Wild face recognition dataset
5.11. Forest covertypes
5.12. RCV1 dataset
5.13. Kddcup 99 dataset
5.14. Boston House Prices dataset
5.15. Breast Cancer Wisconsin (Diagnostic) Database
5.16. Diabetes dataset
5.17. Optical Recognition of Handwritten Digits Data Set
5.18. Iris Plants Database
5.19. Linnerrud dataset

I can't find the logic behind such organization.
I think we should first divide the section into two subsections: one for dataset loaders, one for sample generators. For dataset loaders, we can further divide according to fetch_ and load_. For sample generators, we can further divide according to different tasks (e.g., regression, classification).
Related to #10555
What do others think?

The text was updated successfully, but these errors were encountered:

rth · 2018-05-13T08:15:40Z

I also can't follow the logic of the current organization and agree that having a more hierarchical organization of sections would be better.

glemaitre · 2018-05-13T10:49:23Z

I also agree and I think that @qinhanmin2014 will make more sense than the current TOC

jnothman · 2018-05-13T23:06:54Z

seems a good idea for users trying to find a dataset relevant to a particular task. I don't think fetchers and loaders should be distinguished, but generic from specific and generation from real world...

amueller · 2018-05-19T20:24:26Z

I don't think there is a logic to the TOC, and hierarchy seems good.
@jnothman what's real world? ;)
I would definitely distinguish generators and generic and specific.
I'm not sure about load vs fetch because it seems much more likely that someone will come looking for digits or iris or boston than linnerrud, I think. So having a section on 'built-in' might make sense?

qinhanmin2014 · 2018-05-20T00:51:42Z

Thanks all for the reply here :)

So having a section on 'built-in' might make sense?

Agree. By saying divide according to fetch_ and load_, I mean having a section for built-in datasets (similar to Section 5.2 Toy datasets in current version). I think sometimes users might want to explore scikit-learn with these small datasets which do not require extra download.

jnothman · 2018-05-21T11:31:34Z

Whereas I think most users don't care about a download that happens once. Therefore the main thing that characterises built-in datasets is their being small!

jeremiedbb · 2018-06-04T13:38:09Z

Hi, I tried a more hierarchical structure for this section as suggested earlier, with 2 main subsections : loading datasets and generating datasets. And also a distinction between toy datasets and real world datasets. However, it results in a tree of depth 5, quite unreadable.
The main issue seems to be the descriptions of each dataset. An idea to solve it could be first to only keep on this section a general description of which type of datasets can be found for which purpose. Then make a new page for each dataset in particular (toy and real world) with their full description and guide and links to corresponding functions in the api.
Is it worth following this direction ?

qinhanmin2014 · 2018-06-04T14:08:18Z

@jeremiedbb Thanks for taking this up. It will be better if you update your PR accordingly to give us something to review.

However, it results in a tree of depth 5, quite unreadable.

I think it's acceptable. The main concern is to make it easier for users to follow the document. (but seems that depth 4 (e.g., something like 5.1 dataset loaders, 5.1.1 toy datasets, 5.1.1.1 iris) is enough?)

The main issue seems to be the descriptions of each dataset.

Since we're doing reorganize, I don't think it's good to remove too many things as in your PR.
So a possible way might be first provide the structure, fill some of them with existing contents and leave others as TODO. Then you can start with these TODOs.

jeremiedbb · 2018-06-04T15:52:33Z

My first PR was just a test, sorry about that.

I updated the PR with a quick draft for the structure. I just filled a few datasets for now to show how it's rendered. I find the subsections 5.x.x.x hard to distinguish from the rest. Moreover some datasets have deeper subsections like the 20 newsgroup dataset. Do you find this structure more appropriate ?

jnothman · 2018-06-12T02:58:04Z

Another confusing thing is that many of the dataset descriptions are included from doc/datasets/, while others are included from sklearn/datasets/descr/. I don't know what the difference is between those, but including doc/datasets/kddcup99.rst into doc/datasets/index.rst results in both being translated to HTML, such that warnings like /home/circleci/project/doc/datasets/kddcup99.rst:5: WARNING: duplicate label kddcup99, other instance in /home/circleci/project/doc/datasets/index.rst are raised by Sphinx

jorisvandenbossche · 2018-06-13T08:58:00Z

Another confusing thing is that many of the dataset descriptions are included from doc/datasets/

And since it are rather short files, I also don't really see a reason to not include the text directly in index.rst, since they are not re-used for the DESCR attribute (for the current situation).

More in general, so it seems we have for each dataset different places where they are described:

The docstring (and thus html docstring page), eg http://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_boston.html
The DESCR attribute of the returned objects. Those are coming from:
- the files in sklearn/datasets/descr/ for the small datasets (the 'load' ones)
- the module docstring for the bigger datasets (the 'fetch' ones)
The explanation in the main datasets doc page in the user guide (http://scikit-learn.org/dev/datasets/index.html). Those are coming from either:
- the files in sklearn/datasets/descr/ for the small datasets, eg http://scikit-learn.org/dev/datasets/index.html#boston-house-prices-dataset
- the other files in doc/datasets/ for the bigger datasets

So for the bigger datasets it means that you have information in 3 different places in the code (module docstring, function docstring, user guide entry), which could certainly use some reduction.

But from a user point of view, my question 8000 is also: do we need to include all those explanations in datasets/index.rst ? We could also have a separate page for each (although we already have a docstring for that? But then what would still be the difference between the docstring and the DESCR attribute?)
I would personally rather see a good and concise summary for each type of datasets (toy, real-world, generated, ..) of the available datasets and its characteristics (for regression or classification, how big, images or numerical, ...)

jeremiedbb · 2018-06-18T16:20:08Z

In addition, there are inconsistencies between the DESCR attribute of the "real world" datasets (the fetch ones). Some use the function docstring such as RCV1 ; Some use the module docstring such as olivetti_faces ; some just don't have a DESCR. It couldn't harm to do some clean up here.

I propose to make a descr file in sklearn/datasets/descr for each of these datasets, like the ones for the toy datasets. Those files would contain the descriptions present in doc/datasets/index.rst, that is in the .rst files in doc/datasets. Then delete the files in doc/datasets, leaving only index.rst.

In the meanwhile, clean up the docstrings of all the 'fetch' modules and functions since they won't appear in DESCR.

Doing that, all informations about the datasets would be in DESCR or in doc/datasets/index.rst. Then we have to decide wether to keep the full description of each dataset in doc/datasets/index.rst or not. I'm +1 with @jorisvandenbossche to only leave a brief description of each dataset, redirecting to the DESCR attribute for a complete description.

Do you agree with that ?
Should I'm make a different PR for the clean up ?

jnothman · 2018-06-19T00:06:54Z

yes, leave the cleanup for another PR I intend to check out this PR on a desktop and merge if happy. no point delaying on such changes

jeremiedbb · 2018-06-20T13:19:56Z

There are now 2 distinct PRs.
The first one, #11328, which is a continuation of #11180, is mainly a rework on the very structure of the datasets loading utilities page.
The second one, #11319, is a clean up of the dataset loaders as discussed above. Essentially standardize the DESCR attribute of the datasets, and remove information redundancy.
The second one is actually on stand by till first one isn't ready, but all remarks are welcomed on both.

qinhanmin2014 · 2018-07-16T15:19:34Z

Closing since we've finished reorganizing the section in #11180
I've marked #11319 (which contains some further work on it) as 0.20 to attract reviewers.

…taset section (#11319) Standardize the datasets informations, as part of a more general reorganization of the dataset section in user guide, see #11083. Fixes #10555

qinhanmin2014 added Documentation Moderate Anything that requires some knowledge of conventions and best practices Sprint help wanted labels May 13, 2018

jeremiedbb mentioned this issue Jun 1, 2018

WIP: reorganize datasets documentation page #11180

Merged

jnothman mentioned this issue Jun 12, 2018

[MRG] DOC fix some sphinx warnings #11241

Merged

qinhanmin2014 added this to the 0.20 milestone Jun 19, 2018

This was referenced Jun 19, 2018

[MRG+1] DOC Clean up datasets loaders as part of the reorganization of the dataset section #11319

Merged

[MRG+1] Doc: complete PR #11180 for the reorganization of the dataset loading utilities section #11328

Merged

qinhanmin2014 closed this as completed Jul 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reorganize dataset section in user guide #11083

Reorganize dataset section in user guide #11083

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reorganize dataset section in user guide #11083

Reorganize dataset section in user guide #11083

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!