10000 dump_svmlight_file doesn't handle csr, adds comments · Issue #1501 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

dump_svmlight_file doesn't handle csr, adds comments #1501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Dec 31, 2012 · 21 comments
Closed

dump_svmlight_file doesn't handle csr, adds comments #1501

amueller opened this issue Dec 31, 2012 · 21 comments
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices

Comments

@amueller
Copy link
Member

I just loaded an svmlight file to investigate #1476.
I tried to dump a slice but that wasn't possible without converting the matrices to dense in between.
I noticed two things: the features where backwards when trying to dump a sparse matrix, i.e. the entries in the libsvm file started with the highes non-zero entry. This might be the cause of the problem.

Also, dump_svmlight_file add a comment at the beginning of the file.
LibSVM, the de-facto standard SVM implementation, does not recognize these comments and I had to remove them by hand to load the file. There seems to be no way to disable the comment (judging from the docstring).

@mblondel
Copy link
Member
mblondel commented Jan 1, 2013

Last time I tried dump_svmlight_file, it was so slow I had to interrupt my program. Maybe this was the same problem as you describe.

@larsmans
Copy link
Member

I can easily remove the comment, but do you have an example script that dumps a slice?

@amueller
Copy link
Member Author

@larsmans how do you mean, that dumps a slice?

@larsmans
Copy link
Member

You said "I tried to dump a slice but that wasn't possible"? Is it just dump_svmlight_file(X[i:j, :]) that fails?

@amueller
Copy link
Member Author

Oh, yeah, sorry. Yes, that was what I tried.

@amueller
Copy link
Member Author

But I guess that problem is connected to not being able to dump CSR in the first place...

@larsmans
Copy link
Member

This is very strange. I'm investigating.

@amueller
Copy link
Member Author

Thanks. Sorry for not being very helpful atm, I am quite busy. I'll try to get back to you tonight.

larsmans added a commit that referenced this issue Jan 16, 2013
@larsmans
Copy link
Member

Ok. I fixed the comment issue, but I can't reproduce the rest of it:

>>> from sklearn.datasets import dump_svmlight_file
>>> from scipy.sparse import csr_matrix
>>> X = csr_matrix([[1,2,3],[4,5,6]])
>>> dump_svmlight_file(X[1,:], [2], "bar.dump")
>>> dump_svmlight_file(X, [1,2], "foo.dump")
>>> !cat foo.dump
1 0:1.000000 1:2.000000 2:3.000000
2 0:4.000000 1:5.000000 2:6.000000
>>> dump_svmlight_file(X[1,:], [2], "bar.dump")
>>> !cat bar.dump
2 0:4.000000 1:5.000000 2:6.000000

@amueller
Copy link
Member Author

Thanks. I guess I used "digits", but I'm not sure any more. I'll have a look once I'm on my laptop.

@amueller
Copy link
Member Author

Couldn't reproduce with digits, but with the gist from #1476.

X, y = load_svmlight_file("ntcir.en.vec")
X, y = shuffle(X, y)
n_samples = X.shape[0]
X_train = X[:2]
y_train = y[:2]
dump_svmlight_file(X_train, y_train, "asdf.txt")
!cat asdf.txt

Produces

# Generated by dump_svmlight_file from scikit-learn 0.13-git
# Column indices are zero-based
-1.000000 3302:1.0000000000000000e+00 3301:1.0000000000000000e+00 3300:1.0000000000000000e+00 3299:1.0000000000000000e+00 3099:1.0000000000000000e+00 2745:1.0000000000000000e+00 2362:1.0000000000000000e+00 1566:1.0000000000000000e+00 1336:1.0000000000000000e+00 1277:1.0000000000000000e+00 1230:1.0000000000000000e+00 574:1.0000000000000000e+00 347:1.0000000000000000e+00 293:2.0000000000000000e+00 125:1.0000000000000000e+00 23:1.0000000000000000e+00 16:1.0000000000000000e+00 6:1.0000000000000000e+00
1.000000 701:1.0000000000000000e+00 700:1.0000000000000000e+00 699:1.0000000000000000e+00 698:1.0000000000000000e+00 689:1.0000000000000000e+00 685:1.0000000000000000e+00 682:1.0000000000000000e+00 681:1.0000000000000000e+00 680:1.0000000000000000e+00 679:1.0000000000000000e+00 676:1.0000000000000000e+00 675:1.0000000000000000e+00 661:1.0000000000000000e+00 652:1.0000000000000000e+00 635:1.0000000000000000e+00 594:1.0000000000000000e+00 493:2.0000000000000000e+00 404:1.0000000000000000e+00 391:1.0000000000000000e+00 365:1.0000000000000000e+00 332:1.0000000000000000e+00 295:1.0000000000000000e+00 171:1.0000000000000000e+00 145:1.0000000000000000e+00 123:1.0000000000000000e+00 96:1.0000000000000000e+00 94:1.0000000000000000e+00 66:2.0000000000000000e+00 20:2.0000000000000000e+00 17:1.0000000000000000e+00 3:1.0000000000000000e+00

Not sure what my problem with slicing was, though :-/

@amueller
Copy link
Member Author

Ok look at the bottom of #1476 for a probable explanation. Also I'd appreciate your help there, as I'm not so good with scipy.sparse.

@amueller
Copy link
Member Author

You don't need to fetch the dataset. You can just do

from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
X = sp.csc_matrix(X)
print(X.data)
X_slice = X[range(X.shape[0])]
print(X_slice.data)

@amueller
Copy link
Member Author

So the fix for this and for #1476 seems to be to call X.sort_indices(). Right?

@larsmans
Copy link
Member

Man, this is surprising! I'm having a hard time figuring out what to do about this. Whether I dump after sorting, or sort after loading, I keep getting different matrices out.

@larsmans
Copy link
Member

Oh, wait, zero-based vs. one-based just bit me...

@amueller
Copy link
Member Author

I'd sort after loading and sort before dumping, just to be on the safe side, if it is not too expensive.
We need to add this to the soon-to-come test for estimators on sparse vs dense matrices. this is really ugly.

@larsmans
Copy link
Member

It's O(n lg n) in the number of non-zeros, I guess, so that's not very expensive. Too bad there's no "half-copying" version of sort(ed)_indices in scipy.sparse.

larsmans added a commit that referenced this issue Jan 16, 2013
@larsmans
Copy link
Member

Should be solved in c59af39.

@amueller
Copy link
Member Author

Cool, thanks. I would leave the issue open so we can add a test later.

@larsmans
Copy link
Member

Added a test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

No branches or pull requests

3 participants
0