-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Problem with fetch_20newsgroups_vectorized #5088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I managed to reproduce the issue on another machine which has the old cached data as well.
The script I used for git bisect: from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.classification import CDClassifier
# Load News20 dataset from scikit-learn.
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target
# Set classifier options.
clf = CDClassifier(penalty="l1/l2",
loss="squared_hinge",
multiclass=True,
max_iter=20,
alpha=1e-4,
C=1.0 / X.shape[0],
tol=1e-3)
# Train the model.
clf.fit(X, y)
# Accuracy
print clf.score(X, y) Prints 0.95 when working, 0.60 when failing. When doing git bisect, I labeled all commits that triggered a re-download as good (without actually checking) so I hope the guilty commit was correctly indentified. I made a copy of the old serialized data if needed. |
It seems to be a combination of e4b9ff6 and earlier changes that causes the problem. It's hard to find which changes, because for earlier commits, a re-download is triggered and the old cached data is overwritten... It's a strange bug. I get an X and a y of the correct shapes but the data inside seems wrong. |
Hum... interesting. This was to fix #4435. Before "all" was just the test set (sometimes?) |
I guess in the pickled version |
But I do get an X and a y. It's just that the data is wrong. One possible reason would be that |
I guess after all this time, this is unlikely to still be relevant, let's close this one |
I was using the 20 newsgroups dataset on a custom estimator and the accuracy was really crappy. It took me a while to realize that the problem was not in my estimator but in
fetch_20newsgroups_vectorized
. I already had the data serialized on disk and it used to work fine, as far as I remember. I deleted the data to force a rebuild and now it works fine.Were there changes in the loader or in joblib recently that could explain this?
If we find the change that caused this, we could possibly force a rebuild for when the data was serialized prior to this change.
The text was updated successfully, but these errors were encountered: