10000 Array dimension issue when using sklearn.covariance.fast_mcd · Issue #4512 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Array dimension issue when using sklearn.covariance.fast_mcd #4512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ThatGeoGuy opened this issue Apr 3, 2015 · 6 comments
Closed

Array dimension issue when using sklearn.covariance.fast_mcd #4512

ThatGeoGuy opened this issue Apr 3, 2015 · 6 comments
Labels
Bug Easy Well-defined and straightforward way to resolve
Milestone

Comments

@ThatGeoGuy
Copy link
Contributor

Lines 362-368 in sklearn/covariance/robust_covariance.py specify the following:

    X = np.asarray(X)
    if X.ndim == 1:
        X = np.reshape(X, (1, -1))
        warnings.warn("Only one sample available. "
                      "You may want to reshape your data array")
    n_samples, n_features = X.shape

The problem with this is that if you pass in a 1D array of shape (n_samples,), you typically want the univariate estimate of the MCD for all those samples (hence you actually wanted to pass in an array of shape (n_samples, 1)). However, the code above assumes you really wanted to pass in a 1D array of shape (1, n_features), which has the following problems:

  1. This is backwards to the assumption made by LIBRA (http://wis.kuleuven.be/stat/robust.html
    ), which contains the reference implementation of FastMCD by Van Dreissen and Rombouts. This can be found in the mcdcov.m file, lines 213-215.
  2. Nobody in their right mind would try to find the covariance amongst n_features using only a single sample. I'm half-kidding, but at the very least, the behaviour appears unconventional to me. Perhaps I am missing some justification for this?
  3. The current behaviour raises an exception stating that the covariance matrix is singular, when in reality it is non-singular, just univariate. If you for example pass in a matrix object instead of an array with the appropriate dimensions, then the function will compute without error.

Because of these reasons, I'm going to assume this is a bug, which appears similar to the bugs found in #4509 and #4466. The fix is simple, just change the above lines to the following:

    X = np.asarray(X)
    if X.ndim == 1:
        X = np.reshape(X, (-1, 1))
        warnings.warn("1D array passed in. " 
                "Assuming the array contains samples, not features. "
                "You may wish to reshape your data.")
    n_samples, n_features = X.shape

With this fix, finding univariate estimates of the MCD becomes much easier. I have made the above changes to my fork at https://github.com/ThatGeoGuy/scikit-learn and can submit a pull request at any time. However, while running nosetests, I could not correctly get the tests to complete. I also could not find any documentation mentioning how to run the tests so sklearn.__check_build will run appropriately.

Any advice is appreciated. I am examining the MinCovDet / fast_mcd so that I can hopefully fix issue #3367, which is currently preventing me from completing a project I am working on.

@agramfort
Copy link
Member
agramfort commented Apr 4, 2015 via email

@ThatGeoGuy
Copy link
Contributor Author

The fix itself does not appear to break anything as per PR #4517. If you choose to merge it, feel free to close this issue.

Otherwise, please let me know if there is any additional work necessary to incorporate this fix. Thanks again! :)

@amueller
Copy link
Member
amueller commented Apr 5, 2015

As I said in the PR, I think the right fix is this: #4511 which should fix it across scikit-learn.

@agramfort
Copy link
Member
agramfort commented Apr 5, 2015 via email

@amueller amueller added Bug Easy Well-defined and straightforward way to resolve labels Sep 9, 2015
@amueller amueller added this to the 0.17 milestone Sep 9, 2015
@amueller
Copy link
Member
amueller commented Sep 9, 2015

Unfortunately this didn't get fixed in #5152, as this is inside a function. High time we have common tests for functions [or deprecate them?].
@vighneshbirodkar would you mind giving this the same treatment as you did in #5152?
The warning should be adjusted and the behavior should be deprecated.

@ogrisel
Copy link
Member
ogrisel commented Oct 12, 2015

Fixed by @vighneshbirodkar in #5234.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Easy Well-defined and straightforward way to resolve
Projects
None yet
Development

No branches or pull requests

4 participants
0