Problem with fetch_20newsgroups_vectorized #5088

mblondel · 2015-08-05T06:09:08Z

I was using the 20 newsgroups dataset on a custom estimator and the accuracy was really crappy. It took me a while to realize that the problem was not in my estimator but in fetch_20newsgroups_vectorized. I already had the data serialized on disk and it used to work fine, as far as I remember. I deleted the data to force a rebuild and now it works fine.

Were there changes in the loader or in joblib recently that could explain this?

If we find the change that caused this, we could possibly force a rebuild for when the data was serialized prior to this change.

The text was updated successfully, but these errors were encountered:

mblondel · 2015-08-05T15:00:11Z

I managed to reproduce the issue on another machine which has the old cached data as well.
git bisect says that the problem was introduced in e4b9ff6. If I go back to the parent commit, which is 75a0d2c, for some reason, the cache fails to load and a re-download is triggered:

$ python test_news20.py 
________________________________________________________________________________
Cache loading failed
________________________________________________________________________________
'__setstate__'
No handlers could be found for logger "sklearn.datasets.twenty_newsgroups"
0.959089461955

The script I used for git bisect:

from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.classification import CDClassifier

# Load News20 dataset from scikit-learn.
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target

# Set classifier options.
clf = CDClassifier(penalty="l1/l2",
                  loss="squared_hinge",
                  multiclass=True,
                  max_iter=20,
                  alpha=1e-4,
                  C=1.0 / X.shape[0],
                  tol=1e-3)

# Train the model.
clf.fit(X, y)

# Accuracy
print clf.score(X, y)

Prints 0.95 when working, 0.60 when failing.

When doing git bisect, I labeled all commits that triggered a re-download as good (without actually checking) so I hope the guilty commit was correctly indentified.

I made a copy of the old serialized data if needed.

CC @jfraj @ogrisel

mblondel · 2015-08-05T15:29:50Z

It seems to be a combination of e4b9ff6 and earlier changes that causes the problem. It's hard to find which changes, because for earlier commits, a re-download is triggered and the old cached data is overwritten... It's a strange bug. I get an X and a y of the correct shapes but the data inside seems wrong.

amueller · 2015-08-05T15:50:55Z

Hum... interesting. This was to fix #4435. Before "all" was just the test set (sometimes?)

amueller · 2015-08-05T15:55:19Z

I guess in the pickled version data['train'] and data.train are inconsistent and unpickling them breaks because they need to be the same. We could possibly overwrite __setstate__ on the Bunch and check if err self and self.__dict__ (?!) are consistent?

mblondel · 2015-08-06T04:05:32Z

But I do get an X and a y. It's just that the data is wrong. One possible reason would be that X and y don't use the same shuffling and so the labels don't match the right instances. This could be the case because y is loaded from the raw data every time, while X is pickled to disk. So maybe my old pickled X is not shuffled the same way as y.

lesteve · 2025-01-21T05:45:45Z

I guess after all this time, this is unlikely to still be relevant, let's close this one

cmarmo added the module:datasets label Dec 1, 2021

thomasjpfan moved this to Todo📬 in Quansight's scikit-learn Project Board Apr 15, 2022

thomasjpfan added this to Quansight's scikit-learn Project Board Apr 15, 2022

lesteve closed this as not planned Won't fix, can't repro, duplicate, stale Jan 21, 2025

github-project-automation bot moved this from Todo📬 to Done🚀 in Quansight's scikit-learn Project Board Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Problem with fetch_20newsgroups_vectorized #5088

Problem with fetch_20newsgroups_vectorized #5088

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Problem with fetch_20newsgroups_vectorized #5088

Problem with fetch_20newsgroups_vectorized #5088

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!