8000 [MRG+1] Implementing fix for fetching 20newsgroup dataset by jfraj · Pull Request #4600 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Implementing fix for fetching 20newsgroup dataset #4600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 16, 2015

Conversation

jfraj
Copy link
Contributor
@jfraj jfraj commented Apr 15, 2015

Fixing issue #4435 :
the fetched bunch behaved differently for .data and ['data']
(similary for target and filenames)
Fixed by updating both in the fetching (same as was suggested in the issue)
I copied the (three lines) changes from commit e3c9294
instead of doing a rebase.

Also test has been added to check that both .data and ['data'] have the same length
(testing also target and filenames)

@amueller
Copy link
Member

this is the same fix as #4435, right? Did you find out what was wrong with the bunch?
I really don't like this "fix" as it is hiding a bug in our bunch implementation.

@ogrisel
Copy link
Member
ogrisel commented Apr 15, 2015

Yes this is the same fix as #4435 but made on top of current master (manual rebase) and with a test.

The bunch is just a 3-lines hack is not really meant to be "read / write" but only "read only". The problem is that this specific data loader is modifying the attributes of a Bunch instance inplace which is not supported. To get it to work correctly one as to do a setattr when doing a __setitem__ and vice versa. We could fix the Bunch object to do that by default. I am not sure we want to encourage people to use the Bunch object by making it too feature-rich though.

@@ -38,6 +38,19 @@ def test_20news():
entry2 = data.data[np.where(data.target == label)[0][0]]
assert_equal(entry1, entry2)

def test_20news_length():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more explicit to name that test test_20news_length_consistency.

@jfraj jfraj force-pushed the bug_20newsgroup branch from 37dbd8c to a4d0c0f Compare April 15, 2015 22:59
@@ -18,6 +18,7 @@
from os.path import isdir
from os import listdir
from os import makedirs
import re
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have been committed.

Adding set/getattr methods that fill/query the same thing as `bunch[key]`.

Add test for a non-regression bug in fetch_20newsgroups.
@jfraj jfraj force-pushed the bug_20newsgroup branch from a4d0c0f to 4161d55 Compare April 15, 2015 23:05
@jfraj
Copy link
Contributor Author
jfraj commented Apr 15, 2015

The above suggestions have been implemented. Bunch "bug" has been fixed.

@ogrisel ogrisel changed the title Implementing fix for fetching 20newsgroup dataset [MRG+1] Implementing fix for fetching 20newsgroup dataset Apr 15, 2015
@ogrisel
Copy link
Member
ogrisel commented Apr 15, 2015

Looks good to me. Thanks for the fix and the test! (if travis is green)

@agramfort
Copy link
Member

thanks !

agramfort added a commit that referenced this pull request Apr 16, 2015
[MRG+1] Implementing fix for fetching 20newsgroup dataset
@agramfort agramfort merged commit 1f74f1b into scikit-learn:master Apr 16, 2015
@amueller
Copy link
Member

Thanks for the fix.
Can someone explain to the slow me why this is necessary?

with the old implementation, I did

bunch = Bunch(x="x")
bunch.x = "y"
bunch['x']

'y'

Why did this work? Why is the current addition necessary?

@jfraj
Copy link
Contributor Author
jfraj commented Apr 16, 2015

Maybe is related to pickle (the bunch is loaded from pickle), I will do some tests...

@jfraj
Copy link
Contributor Author
jfraj commented Apr 16, 2015

Indeed, the feature comes with pickle:

bunch = Bunch(x="x")
b_pkl = pickle.dumps(bunch)
bunch_from_pkl = pickle.loads(b_pkl)
bunch_from_pkl.x = "y"
print(bunch_from_pkl['x'])
print(bunch_from_pkl.x)
x
y

I will add a test to test_base.py and make a PR

@amueller
Copy link
Member

"interesting" ^^

@amueller
Copy link
Member

Thanks for the fix / test :)

jfraj added a commit to jfraj/scikit-learn that referenced this pull request Apr 16, 2015
jfraj added a commit to jfraj/scikit-learn that referenced this pull request Apr 16, 2015
Updated Bunch getattr definition to raise the appropriate error
(should raise Attribute error not KeyError)

See issue scikit-learn#4600
jfraj added a commit to jfraj/scikit-learn that referenced this pull request Apr 16, 2015
Updated Bunch getattr definition to raise the appropriate error
(should raise Attribute error not KeyError)

See issue scikit-learn#4600
ogrisel added a commit that referenced this pull request Apr 16, 2015
Fix (sub)issue #4600 adding test to Bunch class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0