-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
dump_svmlight_file doesn't handle csr, adds comments #1501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Last time I tried |
I can easily remove the comment, but do you have an example script that dumps a slice? |
@larsmans how do you mean, that dumps a slice? |
You said "I tried to dump a slice but that wasn't possible"? Is it just |
Oh, yeah, sorry. Yes, that was what I tried. |
But I guess that problem is connected to not being able to dump CSR in the first place... |
This is very strange. I'm investigating. |
Thanks. Sorry for not being very helpful atm, I am quite busy. I'll try to get back to you tonight. |
Ok. I fixed the comment issue, but I can't reproduce the rest of it:
|
Thanks. I guess I used "digits", but I'm not sure any more. I'll have a look once I'm on my laptop. |
Couldn't reproduce with digits, but with the gist from #1476. X, y = load_svmlight_file("ntcir.en.vec")
X, y = shuffle(X, y)
n_samples = X.shape[0]
X_train = X[:2]
y_train = y[:2]
dump_svmlight_file(X_train, y_train, "asdf.txt")
!cat asdf.txt Produces
Not sure what my problem with slicing was, though :-/ |
Ok look at the bottom of #1476 for a probable explanation. Also I'd appreciate your help there, as I'm not so good with scipy.sparse. |
You don't need to fetch the dataset. You can just do from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
X = sp.csc_matrix(X)
print(X.data)
X_slice = X[range(X.shape[0])]
print(X_slice.data) |
So the fix for this and for #1476 seems to be to call |
Man, this is surprising! I'm having a hard time figuring out what to do about this. Whether I dump after sorting, or sort after loading, I keep getting different matrices out. |
Oh, wait, zero-based vs. one-based just bit me... |
I'd sort after loading and sort before dumping, just to be on the safe side, if it is not too expensive. |
It's O(n lg n) in the number of non-zeros, I guess, so that's not very expensive. Too bad there's no "half-copying" version of |
Should be solved in c59af39. |
Cool, thanks. I would leave the issue open so we can add a test later. |
Added a test. |
I just loaded an svmlight file to investigate #1476.
I tried to dump a slice but that wasn't possible without converting the matrices to dense in between.
I noticed two things: the features where backwards when trying to dump a sparse matrix, i.e. the entries in the libsvm file started with the highes non-zero entry. This might be the cause of the problem.
Also,
dump_svmlight_file
add a comment at the beginning of the file.LibSVM, the de-facto standard SVM implementation, does not recognize these comments and I had to remove them by hand to load the file. There seems to be no way to disable the comment (judging from the docstring).
The text was updated successfully, but these errors were encountered: