-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Reorganize dataset section in user guide #11083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I also can't follow the logic of the current organization and agree that having a more hierarchical organization of sections would be better. |
I also agree and I think that @qinhanmin2014 will make more sense than the current TOC |
seems a good idea for users trying to find a dataset relevant to a
particular task. I don't think fetchers and loaders should be
distinguished, but generic from specific and generation from real world...
|
I don't think there is a logic to the TOC, and hierarchy seems good. |
Thanks all for the reply here :)
Agree. By saying |
Whereas I think most users don't care about a download that happens once.
Therefore the main thing that characterises built-in datasets is their
being small!
|
Hi, I tried a more hierarchical structure for this section as suggested earlier, with 2 main subsections : loading datasets and generating datasets. And also a distinction between toy datasets and real world datasets. However, it results in a tree of depth 5, quite unreadable. |
@jeremiedbb Thanks for taking this up. It will be better if you update your PR accordingly to give us something to review.
I think it's acceptable. The main concern is to make it easier for users to follow the document. (but seems that depth 4 (e.g., something like 5.1 dataset loaders, 5.1.1 toy datasets, 5.1.1.1 iris) is enough?)
Since we're doing reorganize, I don't think it's good to remove too many things as in your PR. |
My first PR was just a test, sorry about that. I updated the PR with a quick draft for the structure. I just filled a few datasets for now to show how it's rendered. I find the subsections 5.x.x.x hard to distinguish from the rest. Moreover some datasets have deeper subsections like the 20 newsgroup dataset. Do you find this structure more appropriate ? |
Another confusing thing is that many of the dataset descriptions are included from |
And since it are rather short files, I also don't really see a reason to not include the text directly in More in general, so it seems we have for each dataset different places where they are described:
So for the bigger datasets it means that you have information in 3 different places in the code (module docstring, function docstring, user guide entry), which could certainly use some reduction. But from a user point of view, my question
8000
is also: do we need to include all those explanations in |
In addition, there are inconsistencies between the I propose to make a descr file in In the meanwhile, clean up the docstrings of all the 'fetch' modules and functions since they won't appear in Doing that, all informations about the datasets would be in Do you agree with that ? |
yes, leave the cleanup for another PR
I intend to check out this PR on a desktop and merge if happy. no point
delaying on such changes
|
There are now 2 distinct PRs. |
Uh oh!
There was an error while loading. Please reload this page.
I think the dataset section in user guide (http://scikit-learn.org/dev/datasets/index.html) is hard to follow. See the subsections:
I can't find the logic behind such organization.
I think we should first divide the section into two subsections: one for dataset loaders, one for sample generators. For dataset loaders, we can further divide according to
fetch_
andload_
. For sample generators, we can further divide according to different tasks (e.g., regression, classification).Related to #10555
What do others think?
The text was updated successfully, but these errors were encountered: