10000 Problem with fetch_20newsgroups_vectorized · Issue #5088 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Problem with fetch_20newsgroups_vectorized #5088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mblondel opened this issue Aug 5, 2015 · 6 comments
Closed

Problem with fetch_20newsgroups_vectorized #5088

mblondel opened this issue Aug 5, 2015 · 6 comments

Comments

@mblondel
Copy link
Member
mblondel commented Aug 5, 2015

I was using the 20 newsgroups dataset on a custom estimator and the accuracy was really crappy. It took me a while to realize that the problem was not in my estimator but in fetch_20newsgroups_vectorized. I already had the data serialized on disk and it used to work fine, as far as I remember. I deleted the data to force a rebuild and now it works fine.

Were there changes in the loader or in joblib recently that could explain this?

If we find the change that caused this, we could possibly force a rebuild for when the data was serialized prior to this change.

@mblondel
Copy link
Member Author
mblondel commented Aug 5, 2015

I managed to reproduce the issue on another machine which has the old cached data as well.
git bisect says that the problem was introduced in e4b9ff6. If I go back to the parent commit, which is 75a0d2c, for some reason, the cache fails to load and a re-download is triggered:

$ python test_news20.py 
________________________________________________________________________________
Cache loading failed
________________________________________________________________________________
'__setstate__'
No handlers could be found for logger "sklearn.datasets.twenty_newsgroups"
0.959089461955

The script I used for git bisect:

from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.classification import CDClassifier

# Load News20 dataset from scikit-learn.
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target

# Set classifier options.
clf = CDClassifier(penalty="l1/l2",
                  loss="squared_hinge",
                  multiclass=True,
                  max_iter=20,
                  alpha=1e-4,
                  C=1.0 / X.shape[0],
                  tol=1e-3)

# Train the model.
clf.fit(X, y)

# Accuracy
print clf.score(X, y)

Prints 0.95 when working, 0.60 when failing.

When doing git bisect, I labeled all commits that triggered a re-download as good (without actually checking) so I hope the guilty commit was correctly indentified.

I made a copy of the old serialized data if needed.

CC @jfraj @ogrisel

@mblondel
Copy link
Member Author
mblondel commented Aug 5, 2015

It seems to be a combination of e4b9ff6 and earlier changes that causes the problem. It's hard to find which changes, because for earlier commits, a re-download is triggered and the old cached data is overwritten... It's a strange bug. I get an X and a y of the correct shapes but the data inside seems wrong.

@amueller
Copy link
Member
amueller commented Aug 5, 2015

Hum... interesting. This was to fix #4435. Before "all" was just the test set (sometimes?)

@amueller
Copy link
Member
amueller commented Aug 5, 2015

I guess in the pickled version data['train'] and data.train are inconsistent and unpickling them breaks because they need to be the same. We could possibly overwrite __setstate__ on the Bunch and check if err self and self.__dict__ (?!) are consistent?

@mblondel
Copy link
Member Author
mblondel commented Aug 6, 2015

But I do get an X and a y. It's just that the data is wrong. One possible reason would be that X and y don't use the same shuffling and so the labels don't match the right instances. This could be the case because y is loaded from the raw data every time, while X is pickled to disk. So maybe my old pickled X is not shuffled the same way as y.

@lesteve
Copy link
Member
lesteve commented Jan 21, 2025

I guess after all this time, this is unlikely to still be relevant, let's close this one

@lesteve lesteve closed this as not planned Won't fix, can't repro, duplicate, stale Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0