-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
fetch_20newsgroups 'remove' argument (Python 2.7.11) #6196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can reproduce. Looks like this bug was introduced in scikit-learn 0.17 and is still in master. |
Actually @boskaiolo can you try removing From some little investigation I believe this is related to #4600 that changes the Bunch class. So basically if you start from a fresh ~/scikit_learn_data, run your snippet with scikit-learn 0.16.1 (meaning you download the data) and then run your snippet with 0.17 (meaning you load the data from the saved joblib pickle) you'll see the problem. If you download with 0.17 everything works fine. |
I can confirm: after removing the file, downloaded with scikit-learn (<0.17), the same code works. |
I reckon this one should be closed. This looks like a gotcha that we will have to live with when people transition from 0.16 to 0.17 I am afraid ... |
For completeness, here is a snippet I use to investigate more. As noted in #4600 (comment) there is some interesting interaction between the changes in the Bunch class and what happens when we load a Bunch from a pickle which was generated before theses changes. import pickle
import sklearn
from sklearn.datasets.base import Bunch
def write():
filename = '/tmp/test_{}.pkl'.format(sklearn.__version__)
bunch = Bunch(key='original')
pickle.dump(bunch, open(filename, 'wb'))
def read():
print('Bunch pickles read with scikit-learn {}\n{}'.format(
sklearn.__version__, 80*'-'))
for sklearn_version_used_for_pickling in ('0.16.1', '0.17'):
print('Bunch pickled with scikit-learn version: {}'.format(
sklearn_version_used_for_pickling))
filename = '/tmp/test_{}.pkl'.format(
sklearn_version_used_for_pickling)
bunch = pickle.load(open(filename, 'rb'))
bunch.key = 'changed'
print('bunch.key: {}'.format(bunch.key))
print("bunch['key']: {}".format(bunch['key']))
# write()
read() You'll need to write the pickles for both sklearn 0.16 and 0.17 first. Output for sklearn 0.16:
This was the bug fixed in #4600: when the Bunch is loaded from a pickle changing the attribute of the bunch doesn't change the value associated to the key. Output for sklearn 0.17:
In the case "Bunch pickled with 0.16 and read with 0.17" the huge WTF is that you do |
I understood the problem a bit more: 0.16 Bunch objects have a non empty I opened a PR with a fix. For completeness here is the Bunch class code for 0.16.1 and 0.17. |
* tag '0.17.1': (29 commits) Release 0.17.1 MAINT remove non-existing cache folder in 0.17.X branch FIX cythonize TSNE MAINT simplify freeing logic for Barnes-Hut SNE memory leak fix Fix memory leak in Barnes-Hut SNE FIX check_build_doc.py false positive detections MAINT more informative output to circle/check_build_doc.py FIX fetch_california_housing FIX in randomized_svd flip sign Updated examples and tests that use scipy's lena DOC whats_new entry for scikit-learn#6258 fix joblib error in LatentDirichletAllocation MAINT fix / speedup travis on 0.17.X MAINT Upgrade pip in appveyor and display version DOC missing changelog entry for scikit-learn#5857 DOC add fix for scikit-learn#6147 to the changelog FIX 6147: ensure that AUC is always a float TST non-regression test for scikit-learn#6147, roc_auc on memmap data Added changelog entry about scikit-learn#6196 Fix reading of bunch pickles ...
* releases: (29 commits) Release 0.17.1 MAINT remove non-existing cache folder in 0.17.X branch FIX cythonize TSNE MAINT simplify freeing logic for Barnes-Hut SNE memory leak fix Fix memory leak in Barnes-Hut SNE FIX check_build_doc.py false positive detections MAINT more informative output to circle/check_build_doc.py FIX fetch_california_housing FIX in randomized_svd flip sign Updated examples and tests that use scipy's lena DOC whats_new entry for scikit-learn#6258 fix joblib error in LatentDirichletAllocation MAINT fix / speedup travis on 0.17.X MAINT Upgrade pip in appveyor and display version DOC missing changelog entry for scikit-learn#5857 DOC add fix for scikit-learn#6147 to the changelog FIX 6147: ensure that AUC is always a float TST non-regression test for scikit-learn#6147, roc_auc on memmap data Added changelog entry about scikit-learn#6196 Fix reading of bunch pickles ...
* dfsg: (29 commits) Release 0.17.1 MAINT remove non-existing cache folder in 0.17.X branch FIX cythonize TSNE MAINT simplify freeing logic for Barnes-Hut SNE memory leak fix Fix memory leak in Barnes-Hut SNE FIX check_build_doc.py false positive detections MAINT more informative output to circle/check_build_doc.py FIX fetch_california_housing FIX in randomized_svd flip sign Updated examples and tests that use scipy's lena DOC whats_new entry for scikit-learn#6258 fix joblib error in LatentDirichletAllocation MAINT fix / speedup travis on 0.17.X MAINT Upgrade pip in appveyor and display version DOC missing changelog entry for scikit-learn#5857 DOC add fix for scikit-learn#6147 to the changelog FIX 6147: ensure that AUC is always a float TST non-regression test for scikit-learn#6147, roc_auc on memmap data Added changelog entry about scikit-learn#6196 Fix reading of bunch pickles ...
On Python 2.7.11, the remove argument of the fetch_20newsgroups method doesn't work.
Here's an example (you can change '10' with another index, the problem appear again):
Although the removal of headers, footers and quotes is set, this is the output:
Problem doesn't appear in Python 3.5.1
The text was updated successfully, but these errors were encountered: