8000 When fetching both training and test set of the 20newsgroup dataset, by hhchen1105 · Pull Request #4435 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

When fetching both training and test set of the 20newsgroup dataset, #4435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

When fetching both training and test set of the 20newsgroup dataset, #4435

wants to merge 1 commit into from

Conversation

hhchen1105
Copy link
Contributor

When fetching both training and test set of the 20newsgroup dataset, the data['data'], data['target'], data['filenames'] should be updated.

the data['data'], data['target'], data['filenames'] should be updated.
@landscape-bot
Copy link

Code Health
Code quality remained the same when pulling e3c9294 on hhchen1105:fix-fetch-20newsgroups-all-error into ad26ae4 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 95.09% when pulling e3c9294 on hhchen1105:fix-fetch-20newsgroups-all-error into ad26ae4 on scikit-learn:master.

@amueller
Copy link
Member

These are automatically updated for Bunches.

bunch = Bunch(x="x")
bunch.x = "y"
bunch['x']

'y'

Did you run into any issues?

@hhchen1105
Copy link
Contributor Author

Please see the simple test below.

Before fixing:

In [1]: import sklearn.datasets

In [2]: len(sklearn.datasets.fetch_20newsgroups(subset='all')['data'])
Out[2]: 7532

In [3]: len(sklearn.datasets.fetch_20newsgroups(subset='train')['data'])
Out[3]: 11314

In [4]: len(sklearn.datasets.fetch_20newsgroups(subset='test')['data'])
Out[4]: 7532

After:

In [1]: import sklearn.datasets

In [2]: len(sklearn.datasets.fetch_20newsgroups(subset='all')['data'])
Out[2]: 18846

In [3]: len(sklearn.datasets.fetch_20newsgroups(subset='train')['data'])
Out[3]: 11314

In [4]: len(sklearn.datasets.fetch_20newsgroups(subset='test')['data'])
Out[4]: 7532

@amueller
Copy link
Member

I can reproduce, but I do not understand. Your fix is odd as the current version should be working.

@amueller amueller added the Bug label Mar 23, 2015
@amueller
Copy link
Member

Sorry I just broke your PR by doing a small fix.

@amueller
Copy link
Member

Ah, I got it. At the place where this assignment happens data is just a dict not a Bunch.
My recommended fix would be to use a Bunch, that is in line 98 use Bunch instead of dict.
The other possiblity would be to not use data.data or data.anything when data is a dict because it is really confusing.
Also a regression test would be nice.

@hhchen1105
Copy link
Contributor Author

It seems to be more subtle than that. While at line 98 cache is a dict, cache['train'] and cache['test'] are both Bunch. I'm not sure about the root cause though.

@amueller
Copy link
Member

I'm pretty sure cache can not be a bunch unless you changed the code and changed it back (it is pickled in line 100). There is only one call to the Bunch constructor, and that is for the vectorized case.

@hhchen1105
Copy link
Contributor Author

Did I misunderstand anything here?
I know cache is not a Bunch, but cache['train'] and cache['test'] are both Bunch.
At line 229, data is set to cache[subset], not cache.

@amueller
Copy link
Member

Oh, you are right.

@amueller
Copy link
Member
In [2]: len(sklearn.datasets.fetch_20newsgroups(subset='all')['data'])
Out[2]: 18846
In [3]: len(sklearn.datasets.fetch_20newsgroups(subset='all').data)
Out[3]: 7532
In [12]: type(sklearn.datasets.fetch_20newsgroups(subset='all'))
Out[12]: sklearn.datasets.base.Bunch

Looks like something is wrong with our bunch implementation.

@lesteve
Copy link
Member
lesteve commented Apr 17, 2015

This one should be closed now that #4600 has been merged.

@amueller
Copy link
Member

Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0