8000 TST Extend tests for `scipy.sparse.*array` in `sklearn/svm/tests/test_sparse.py` by work-mohit · Pull Request #27511 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

TST Extend tests for scipy.sparse.*array in sklearn/svm/tests/test_sparse.py #27511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

work-mohit
Copy link
Contributor

Refer Issue #27090

currently a draft, would love to take any help.

@github-actions
Copy link
github-actions bot commented Oct 1, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 52222e7. Link to the linter CI: here

@work-mohit
Copy link
Contributor Author

@Charlie-XIAO would love to take any help from you whenever you feel free.
I'm confused in two functions test_svc() and test_sparse_oneclasssvm .

@Charlie-XIAO
Copy link
Contributor
Charlie-XIAO commented Oct 1, 2023

@work-mohit Personally I would suggest refactoring these two tests a bit because I 8000 find them hard to read (not sure if maintainers agree).

First add X_blobs, y_blobs = make_blobs(n_samples=100, centers=10, random_state=0) to be beginning because I think it should be at the same level as the other datasets (X, X2, iris).

Then, I think we can refactor check_svm_model_equal because dense_svm and sparse_svm seems to be exactly the same model (though different copies). For instance:

def check_svm_model_equal(svm_model, X_train, y_train, X_test):
    # Use the original svm model for dense fit and clone an exactly same
    # svm model for sparse fit
    sparse_svm = base.clone(svm_model)

    svm_model.fit(X_train.toarray(), y_train)
    if sparse.issparse(X_test):
        X_test_dense = X_test.toarray()
    else:
        X_test_dense = X_test
    sparse_svm.fit(X_train, y_train)
    assert sparse.issparse(sparse_svm.support_vectors_)
    assert sparse.issparse(sparse_svm.dual_coef_)
    assert_array_almost_equal(
        svm_model.support_vectors_, sparse_svm.support_vectors_.toarray()
    )
    assert_array_almost_equal(svm_model.dual_coef_, sparse_svm.dual_coef_.toarray())
    if svm_model.kernel == "linear":
        assert sparse.issparse(sparse_svm.coef_)
        assert_array_almost_equal(svm_model.coef_, sparse_svm.coef_.toarray())
    assert_array_almost_equal(svm_model.support_, sparse_svm.support_)
    assert_array_almost_equal(
        svm_model.predict(X_test_dense), sparse_svm.predict(X_test)
    )
    assert_array_almost_equal(
        svm_model.decision_function(X_test_dense), sparse_svm.decision_function(X_test)
    )
    assert_array_almost_equal(
        svm_model.decision_function(X_test_dense),
        sparse_svm.decision_function(X_test_dense),
    )
    if isinstance(svm_model, svm.OneClassSVM):
        msg = "cannot use sparse input in 'OneClassSVM' trained on dense data"
    else:
        assert_array_almost_equal(
            svm_model.predict_proba(X_test_dense), sparse_svm.predict_proba(X_test), 4
        )
        msg = "cannot use sparse input in 'SVC' trained on dense data"
    if sparse.issparse(X_test):
        with pytest.raises(ValueError, match=msg):
            svm_model.predict(X_test)

Then test_svc can be simplified as follows (I used both LIL and CSR because X and X2 used the former while iris and blob used the latter; this is subject to further discussion):

@skip_if_32bit
@pytest.mark.parametrize(
    "X_train, y_train, X_test",
    [
        [X, Y, T],
        [X2, Y2, T2],
        [X_blobs[:80], y_blobs[:80], X_blobs[80:]],
        [iris.data, iris.target, iris.data],
    ],
)
@pytest.mark.parametrize("kernel", ["linear", "poly", "rbf", "sigmoid"])
@pytest.mark.parametrize("sparse_container", CSR_CONTAINERS + LIL_CONTAINERS)
def test_svc(X_train, y_train, X_test, kernel, sparse_container):
    """Check that sparse SVC gives the same result as SVC."""
    X_train = sparse_container(X_train)

    clf = svm.SVC(
        gamma=1,
        kernel=kernel,
        probability=True,
        random_state=0,
        decision_function_shape="ovo",
    )
    check_svm_model_equal(clf, X_train, y_train, X_test)

You can then do something similar to test_sparse_oneclasssvm:

@pytest.mark.parametrize(
    "X_train, y_train, X_test",
    [
        [X, None, T],
        [X2, None, T2],
        [X_blobs[:80], None, X_blobs[80:]],
        [iris.data, None, iris.data],
    ],
)
@pytest.mark.parametrize("kernel", ["linear", "poly", "rbf", "sigmoid"])
@pytest.mark.parametrize("sparse_container", CSR_CONTAINERS + LIL_CONTAINERS)
@skip_if_32bit
def test_sparse_oneclasssvm(X_train, y_train, X_test, kernel, sparse_container):
    # Check that sparse OneClassSVM gives the same result as dense OneClassSVM
    X_train = sparse_container(X_train)

    clf = svm.OneClassSVM(gamma=1, kernel=kernel)
    check_svm_model_equal(clf, X_train, y_train, X_test)

@Charlie-XIAO
Copy link
Contributor
Charlie-XIAO commented Oct 1, 2023

I have also seen you getting ValueError: inconsistent shapes in some other tests. Try using --full-trace when doing pytest and try to find out the last stack frame within scikit-learn. Not 100% sure but I doubt there is somewhere that we used * for matrix multiplication. If that's the case, replacing it with @ may solve the issue.

@work-mohit
Copy link
Contributor Author
X_train

Intersting !

@work-mohit
Copy link
Contributor Author

I have also seen you getting ValueError: inconsistent shapes in some other tests. I'm not a 100% sure but try using --full-trace when doing pytest and try to find out the last stack frame within scikit-learn. Did not look into it but I believe there is somewhere that we used * for matrix multiplication. Replacing it with @ may solve the issue.

Yes sure, I have checked test yet, just raised as a draft so that I can take help with others.

@work-mohit
Copy link
Contributor Author

from ._radius_neighbors_classmode import (
E ModuleNotFoundError: No module named 'sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode'
========================================================== short test summary info ==========================================================
ERROR sklearn/svm/tests/test_sparse.py - ModuleNotFoundError: No module named 'sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode'
ERROR sklearn/svm/tests/test_sparse.py

facing this issue, any help ?

@Charlie-XIAO
Copy link
Contributor
Charlie-XIAO commented Oct 3, 2023

Please recompile @work-mohit. This Cython module is created a few months ago (if I remembered correctly).

pip install -v --no-use-pep517 --no-build-isolation -e .

@work-mohit work-mohit marked this pull request as ready for review October 3, 2023 16:06
@work-mohit
Copy link
Contributor Author

C:\Users\Mohit\anaconda3\envs\sklearndev\lib\site-packages\scipy\sparse_compressed.py:415: ValueError
========================================================== short test summary info ==========================================================
FAILED sklearn/svm/tests/test_sparse.py::test_sparse_svc_clone_with_callable_kernel[lil_array] - ValueError: inconsistent shapes
FAILED sklearn/svm/tests/test_sparse.py::test_timeout[lil_array] - ValueError: inconsistent shapes

below is the code at 415 at _commpressed.py
help needed! @Charlie-XIAO

def multiply(self, other):
        """Point-wise multiplication by another matrix, vector, or
        scalar.
        """
        # Scalar multiplication.
        if isscalarlike(other):
            return self._mul_scalar(other)
        # Sparse matrix or vector.
        if issparse(other):
            if self.shape == other.shape:
                other = self.__class__(other)
                return self._binopt(other, '_elmul_')
            # Single element.
            elif other.shape == (1, 1):
                return self._mul_scalar(other.toarray()[0, 0])
            elif self.shape == (1, 1):
                return other._mul_scalar(self.toarray()[0, 0])
            # A row times a column.
            elif self.shape[1] == 1 and other.shape[0] == 1:
                return self._mul_sparse_matrix(other.tocsc())
            elif self.shape[0] == 1 and other.shape[1] == 1:
                return other._mul_sparse_matrix(self.tocsc())
            # Row vector times matrix. other is a row.
            elif other.shape[0] == 1 and self.shape[1] == other.shape[1]:
                other = self._dia_container(
                    (other.toarray().ravel(), [0]),
                    shape=(other.shape[1], other.shape[1])
                )
                return self._mul_sparse_matrix(other)
            # self is a row.
            elif self.shape[0] == 1 and self.shape[1] == other.shape[1]:
                copy = self._dia_container(
                    (self.toarray().ravel(), [0]),
                    shape=(self.shape[1], self.shape[1])
                )`
`                return other._mul_sparse_matrix(copy)
            # Column vector times matrix. other is a column.
            elif other.shape[1] == 1 and self.shape[0] == other.shape[0]:
                other = self._dia_container(
                    (other.toarray().ravel(), [0]),
                    shape=(other.shape[0], other.shape[0])
                )
                return other._mul_sparse_matrix(self)
            # self is a column.
            elif self.shape[1] == 1 and self.shape[0] == other.shape[0]:
                copy = self._dia_container(
                    (self.toarray().ravel(), [0]),
                    shape=(self.shape[0], self.shape[0])
                )
                return copy._mul_sparse_matrix(other)
            else:
                raise ValueError("inconsistent shapes")``

@Charlie-XIAO
Copy link
Contributor
Charlie-XIAO commented Oct 4, 2023

@work-mohit I think I've already told you how to debug this: #27511 (comment). I've checked that this is indeed the case, and it appears in the test instead of in the module.

The thing is, if you have matrix * matrix, this is automatically matrix multiplication, but if you have array * array this is not, thus inconsistent shapes. You should use array @ array (and matrix @ matrix` so that in both cases you will have matrix multiplication.

@work-mohit
Copy link
Contributor Author

@ogrisel I'm getting

ValueError: Only sparse matrices with 32-bit integer indices are accepted. Got int64 indices. Please do report a minimal reproducer on scikit-learn issue tracker so that support for your use-case can be studied by maintainers. See: https://scikit-learn.org/dev/developers/minimal_reproducer.html

should this be reported?

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please quote the full traceback of the exception to understand the context of the problem?

@@ -53,77 +56,71 @@ def check_svm_model_equal(dense_svm, sparse_svm, X_train, y_train, X_test):
assert sparse.issparse(sparse_svm.support_vectors_)
assert sparse.issparse(sparse_svm.dual_coef_)
assert_array_almost_equal(
dense_svm.support_vectors_, sparse_svm.support_vectors_.toarray()
svm_model.support_vectors_, sparse_svm.support_vectors_.toarray()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not rename this variable. The previous name was more informative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to point out that, instead of taking two params (dense_svm & sparse_svm) separately, we're combining both. So I guess only naming dense_svm would not be sufficient. You can suggest some more meaningful name.

@work-mohit
Copy link
Contributor Author

csr_container = <class 'scipy.sparse._csr.csr_array'>

@pytest.mark.parametrize("csr_container", CSR_CONTAINERS)
def test_sparse_realdata(csr_container):
    # Test on a subset from the 20newsgroups dataset.
    # This catches some bugs if input is not correctly converted into
    # sparse format or weights are not correctly initialized.

    data = np.array([0.03771744, 0.1003567, 0.01174647, 0.027069])
    indices = np.array([6, 5, 35, 31])
    indptr = np.array(
        [
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            2,
            4,
            4,
            4,
        ]
    )
    X = csr_container((data, indices, indptr))
    y = np.array(
        [
            1.0,
            0.0,
            2.0,
            2.0,
            1.0,
            1.0,
            1.0,
            2.0,
            2.0,
            0.0,
            1.0,
            2.0,
            2.0,
            0.0,
            2.0,
            0.0,
            3.0,
            0.0,
            3.0,
            0.0,
            1.0,
            1.0,
            3.0,
            2.0,
            3.0,
            2.0,
            0.0,
            3.0,
            1.0,
            0.0,
            2.0,
            1.0,
            2.0,
            0.0,
            1.0,
            0.0,
            2.0,
            3.0,
            1.0,
            3.0,
            0.0,
            1.0,
            0.0,
            0.0,
            2.0,
            0.0,
            1.0,
            2.0,
            2.0,
            2.0,
            3.0,
            2.0,
            0.0,
            3.0,
            2.0,
            1.0,
            2.0,
            3.0,
            2.0,
            2.0,
            0.0,
            1.0,
            0.0,
            1.0,
            2.0,
            3.0,
            0.0,
            0.0,
            2.0,
            2.0,
            1.0,
            3.0,
            1.0,
            1.0,
            0.0,
            1.0,
            2.0,
            1.0,
            1.0,
            3.0,
        ]
    )

    clf = svm.SVC(kernel="linear").fit(X.toarray(), y)
    sp_clf = svm.SVC(kernel="linear").fit(X.tocoo(), y)  <===== this line giving check failure.

ValueError: Only sparse matrices with 32-bit integer indices are accepted. Got int64 indices. Please do report a minimal reproducer on scikit-learn issue tracker so that support for your use-case can be studied by maintainers. See: https://scikit-learn.org/dev/developers/minimal_reproducer.html

@jjerphan
Copy link
Member
jjerphan commented Nov 1, 2023

Hi @work-mohit, are you still working on this PR?

@work-mohit
Copy link
Contributor Author

Hi @work-mohit, are you still working on this PR?

I got stuck somewhere in the issue, now I'm not tracking any more.

@jjerphan
Copy link
Member
jjerphan commented Nov 2, 2023

@Charlie-XIAO: are you interested to continue this PR?

@Charlie-XIAO
Copy link
Contributor

Yes, I can continue this PR if needed. However, if I need to contribute to this branch I will need access granted by @work-mohit since I am not a maintainer. If @work-mohit no longer decides to continue, I would personally prefer to open another PR and mention him as collaborator in a commit. Would that be okay or do you suggest an alternative approach @jjerphan?

@jjerphan
Copy link
Member
jjerphan commented Nov 2, 2023

Yes, this is usually the approach we take, giving credits to the original authors.

@Charlie-XIAO
Copy link
Contributor

@jjerphan Sorry for the delay, you may refer to #27723.

@marenwestermann
Copy link
Member

@Charlie-XIAO this PR can be closed, right?

@jjerphan
Copy link
Member
jjerphan commented Mar 1, 2024

I confirm it can be closed since it was superseded by #27723.

@jjerphan jjerphan closed this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0