8000 Reorganize dataset section in user guide · Issue #11083 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Reorganize dataset section in user guide #11083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qinhanmin2014 opened this issue May 10, 2018 · 15 comments
Closed

Reorganize dataset section in user guide #11083

qinhanmin2014 opened this issue May 10, 2018 · 15 comments
Labels
Documentation help wanted Moderate Anything that requires some knowledge of conventions and best practices Sprint
Milestone

Comments

@qinhanmin2014
Copy link
Member
qinhanmin2014 commented May 10, 2018

I think the dataset section in user guide (http://scikit-learn.org/dev/datasets/index.html) is hard to follow. See the subsections:

5.1. General dataset API
5.2. Toy datasets
5.3. Sample images
5.4. Sample generators
5.5. Datasets in svmlight / libsvm format
5.6. Loading from external datasets
5.7. The Olivetti faces dataset
5.8. The 20 newsgroups text dataset
5.9. Downloading datasets from the mldata.org repository
5.10. The Labeled Faces in the Wild face recognition dataset
5.11. Forest covertypes
5.12. RCV1 dataset
5.13. Kddcup 99 dataset
5.14. Boston House Prices dataset
5.15. Breast Cancer Wisconsin (Diagnostic) Database
5.16. Diabetes dataset
5.17. Optical Recognition of Handwritten Digits Data Set
5.18. Iris Plants Database
5.19. Linnerrud dataset

I can't find the logic behind such organization.
I think we should first divide the section into two subsections: one for dataset loaders, one for sample generators. For dataset loaders, we can further divide according to fetch_ and load_. For sample generators, we can further divide according to different tasks (e.g., regression, classification).
Related to #10555
What do others think?

@rth
Copy link
Member
rth commented May 13, 2018

I also can't follow the logic of the current organization and agree that having a more hierarchical organization of sections would be better.

@qinhanmin2014 qinhanmin2014 added Documentation Moderate Anything that requires some knowledge of conventions and best practices Sprint help wanted labels May 13, 2018
@glemaitre
Copy link
Member

I also agree and I think that @qinhanmin2014 will make more sense than the current TOC

@jnothman
Copy link
Member
jnothman commented May 13, 2018 via email

@amueller
Copy link
Member

I don't think there is a logic to the TOC, and hierarchy seems good.
@jnothman what's real world? ;)
I would definitely distinguish generators and generic and specific.
I'm not sure about load vs fetch because it seems much more likely that someone will come looking for digits or iris or boston than linnerrud, I think. So having a section on 'built-in' might make sense?

@qinhanmin2014
Copy link
Member Author

Thanks all for the reply here :)

So having a section on 'built-in' might make sense?

Agree. By saying divide according to fetch_ and load_, I mean having a section for built-in datasets (similar to Section 5.2 Toy datasets in current version). I think sometimes users might want to explore scikit-learn with these small datasets which do not require extra download.

@jnothman
Copy link
Member
jnothman commented May 21, 2018 via email

@jeremiedbb
Copy link
Member

Hi, I tried a more hierarchical structure for this section as suggested earlier, with 2 main subsections : loading datasets and generating datasets. And also a distinction between toy datasets and real world datasets. However, it results in a tree of depth 5, quite unreadable.
The main issue seems to be the descriptions of each dataset. An idea to solve it could be first to only keep on this section a general description of which type of datasets can be found for which purpose. Then make a new page for each dataset in particular (toy and real world) with their full description and guide and links to corresponding functions in the api.
Is it worth following this direction ?

@qinhanmin2014
Copy link
Member Author

@jeremiedbb Thanks for taking this up. It will be better if you update your PR accordingly to give us something to review.

However, it results in a tree of depth 5, quite unreadable.

I think it's acceptable. The main concern is to make it easier for users to follow the document. (but seems that depth 4 (e.g., something like 5.1 dataset loaders, 5.1.1 toy datasets, 5.1.1.1 iris) is enough?)

The main issue seems to be the descriptions of each dataset.

Since we're doing reorganize, I don't think it's good to remove too many things as in your PR.
So a possible way might be first provide the structure, fill some of them with existing contents and leave others as TODO. Then you can start with these TODOs.

@jeremiedbb
Copy link
Member

My first PR was just a test, sorry about that.

I updated the PR with a quick draft for the structure. I just filled a few datasets for now to show how it's rendered. I find the subsections 5.x.x.x hard to distinguish from the rest. Moreover some datasets have deeper subsections like the 20 newsgroup dataset. Do you find this structure more appropriate ?

@jnothman
Copy link
Member

Another confusing thing is that many of the dataset descriptions are included from doc/datasets/, while others are included from sklearn/datasets/descr/. I don't know what the difference is between those, but including doc/datasets/kddcup99.rst into doc/datasets/index.rst results in both being translated to HTML, such that warnings like /home/circleci/project/doc/datasets/kddcup99.rst:5: WARNING: duplicate label kddcup99, other instance in /home/circleci/project/doc/datasets/index.rst are raised by Sphinx

@jorisvandenbossche
Copy link
Member

Another confusing thing is that many of the dataset descriptions are included from doc/datasets/

And since it are rather short files, I also don't really see a reason to not include the text directly in index.rst, since they are not re-used for the DESCR attribute (for the current situation).

More in general, so it seems we have for each dataset different places where they are described:

So for the bigger datasets it means that you have information in 3 different places in the code (module docstring, function docstring, user guide entry), which could certainly use some reduction.

But from a user point of view, my question 8000 is also: do we need to include all those explanations in datasets/index.rst ? We could also have a separate page for each (although we already have a docstring for that? But then what would still be the difference between the docstring and the DESCR attribute?)
I would personally rather see a good and concise summary for each type of datasets (toy, real-world, generated, ..) of the available datasets and its characteristics (for regression or classification, how big, images or numerical, ...)

@jeremiedbb
Copy link
Member

In addition, there are inconsistencies between the DESCR attribute of the "real world" datasets (the fetch ones). Some use the function docstring such as RCV1 ; Some use the module docstring such as olivetti_faces ; some just don't have a DESCR. It couldn't harm to do some clean up here.

I propose to make a descr file in sklearn/datasets/descr for each of these datasets, like the ones for the toy datasets. Those files would contain the descriptions present in doc/datasets/index.rst, that is in the .rst files in doc/datasets. Then delete the files in doc/datasets, leaving only index.rst.

In the meanwhile, clean up the docstrings of all the 'fetch' modules and functions since they won't appear in DESCR.

Doing that, all informations about the datasets would be in DESCR or in doc/datasets/index.rst. Then we have to decide wether to keep the full description of each dataset in doc/datasets/index.rst or not. I'm +1 with @jorisvandenbossche to only leave a brief description of each dataset, redirecting to the DESCR attribute for a complete description.

Do you agree with that ?
Should I'm make a different PR for the clean up ?

@jnothman
Copy link
Member
jnothman commented Jun 19, 2018 via email

@jeremiedbb
Copy link
Member

There are now 2 distinct PRs.
The first one, #11328, which is a continuation of #11180, is mainly a rework on the very structure of the datasets loading utilities page.
The second one, #11319, is a clean up of the dataset loaders as discussed above. Essentially standardize the DESCR attribute of the datasets, and remove information redundancy.
The second one is actually on stand by till first one isn't ready, but all remarks are welcomed on both.

@qinhanmin2014
Copy link
Member Author

Closing since we've finished reorganizing the section in #11180
I've marked #11319 (which contains some further work on it) as 0.20 to attract reviewers.

jnothman pushed a commit that referenced this issue Jul 25, 2018
…taset section (#11319)

Standardize the datasets informations, as part of a more general reorganization of the dataset section in user guide, see #11083.

Fixes #10555
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation help wanted Moderate Anything that requires some knowledge of conventions and best practices Sprint
Projects
None yet
Development

No branches or pull requests

7 participants
0