8000 [MRG+1] Fixes issue #3367 -> MCD fails on data with singular covarian… · Sundrique/scikit-learn@79160d8 · GitHub
[go: up one dir, main page]

Skip to content

Commit 79160d8

Browse files
ThatGeoGuySundrique
authored andcommitted
[MRG+1] Fixes issue scikit-learn#3367 -> MCD fails on data with singular covariance matrix (scikit-learn#8328)
* Adds failing test for issue 3367 * Adds extra comments to describe what I'm testing Basically in this instance I discovered this bug independently from \scikit-learn#3367 because I was trying to use principle components to estimate plane normals / geometry of points in 3D space. When you have a set of points that specify a perfect plane though (or in the case of wanting to use MCD, you have a subset of your points that specify a perfect plane), then the code fails because the covariance matrix is singular. However, if your covariance matrix is singular, you've already found the set of points with the lowest determinant. As per Rousseeuw & Van Driessen 1999, at this point you can stop searching. The code did stop searching, however, it raised a ValueError on singular matrices for no reason. So the correct fix should be to remove that. * Fixes issue 3367 This should work with the test case provided. * Adds missing argument to test * Style corrections to pass flake runner Implements the style corrections as mentioned in pull request scikit-learn#8328 * Adds entry for PR scikit-learn#8328 to what's new * Adds review changes from @jnothman
1 parent 50b5176 commit 79160d8

File tree

3 files changed

+51
-8
lines changed

3 files changed

+51
-8
lines changed

doc/whats_new.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,12 @@ Enhancements
187187

188188
Bug fixes
189189
.........
190+
191+
- Fixed a bug in :class:`sklearn.covariance.MinCovDet` where inputting data
192+
that produced a singular covariance matrix would cause the helper method
193+
`_c_step` to throw an exception.
194+
:issue:`3367` by :user:`Jeremy Steward <ThatGeoGuy>`
195+
190196
- Fixed a bug where :class:`sklearn.ensemble.IsolationForest` uses an
191197
an incorrect formula for the average path length
19219 8000 8
:issue:`8549` by `Peter Wang <https://github.com/PTRWang>`_.
@@ -203,6 +209,7 @@ Bug fixes
203209
- Fixed a bug where :func:`sklearn.model_selection.BaseSearchCV.inverse_transform`
204210
returns self.best_estimator_.transform() instead of self.best_estimator_.inverse_transform()
205211
:issue:`8344` by :user:`Akshay Gupta <Akshay0724>`
212+
206213
- Fixed same issue in :func:`sklearn.grid_search.BaseSearchCV.inverse_transform`
207214
:issue:`8846` by :user:`Rasmus Eriksson <MrMjauh>`
208215

sklearn/covariance/robust_covariance.py

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ def _c_step(X, n_support, random_state, remaining_iterations=30,
9696
initial_estimates=None, verbose=False,
9797
cov_computation_method=empirical_covariance):
9898
n_samples, n_features = X.shape
99+
dist = np.inf
99100

100101
# Initialisation
101102
support = np.zeros(n_samples, dtype=bool)
@@ -119,8 +120,14 @@ def _c_step(X, n_support, random_state, remaining_iterations=30,
119120

120121
# Iterative procedure for Minimum Covariance Determinant computation
121122
det = fast_logdet(covariance)
123+
# If the data already has singular covariance, calculate the precision,
124+
# as the loop below will not be entered.
125+
if np.isinf(det):
126+
precision = pinvh(covariance)
127+
122128
previous_det = np.inf
123-
while (det < previous_det) and (remaining_iterations > 0):
129+
while (det < previous_det and remaining_iterations > 0
130+
and not np.isinf(det)):
124131
# save old estimates values
125132
previous_location = location
126133
previous_covariance = covariance
@@ -142,14 +149,9 @@ def _c_step(X, n_support, random_state, remaining_iterations=30,
142149

143150
previous_dist = dist
144151
dist = (np.dot(X - location, precision) * (X - location)).sum(axis=1)
145-
# Catch computation errors
152+
# Check if best fit already found (det => 0, logdet => -inf)
146153
if np.isinf(det):
147-
raise ValueError(
148-
"Singular covariance matrix. "
149-
"Please check that the covariance matrix corresponding "
150-
"to the dataset is full rank and that MinCovDet is used with "
151-
"Gaussian-distributed data (or at least data drawn from a "
152-
"unimodal, symmetric distribution.")
154+
results = location, covariance, det, support, dist
153155
# Check convergence
154156
if np.allclose(det, previous_det):
155157
# c_step procedure converged

sklearn/covariance/tests/test_robust_covariance.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
#
55
# License: BSD 3 clause
66

7+
import itertools
8+
79
import numpy as np
810

911
from sklearn.utils.testing import assert_almost_equal
@@ -92,6 +94,38 @@ def test_mcd_issue1127():
9294
mcd.fit(X)
9395

9496

97+
def test_mcd_issue3367():
98+
# Check that MCD completes when the covariance matrix is singular
99+
# i.e. one of the rows and columns are all zeros
100+
rand_gen = np.random.RandomState(0)
101+
102+
# Think of these as the values for X and Y -> 10 values between -5 and 5
103+
data_values = np.linspace(-5, 5, 10).tolist()
104+
# Get the cartesian product of all possible coordinate pairs from above set
105+
data = np.array(list(itertools.product(data_values, data_values)))
106+
107+
# Add a third column that's all zeros to make our data a set of point
108+
# within a plane, which means that the covariance matrix will be singular
109+
data = np.hstack((data, np.zeros((data.shape[0], 1))))
110+
111+
# The below line of code should raise an exception if the covariance matrix
112+
# is singular. As a further test, since we have points in XYZ, the
113+
# principle components (Eigenvectors) of these directly relate to the
114+
# geometry of the points. Since it's a plane, we should be able to test
115+
# that the Eigenvector that corresponds to the smallest Eigenvalue is the
116+
# plane normal, specifically [0, 0, 1], since everything is in the XY plane
117+
# (as I've set it up above). To do this one would start by:
118+
#
119+
# evals, evecs = np.linalg.eigh(mcd_fit.covariance_)
120+
# normal = evecs[:, np.argmin(evals)]
121+
#
122+
# After which we need to assert that our `normal` is equal to [0, 0, 1].
123+
# Do note that there is floating point error associated with this, so it's
124+
# best to subtract the two and then compare some small tolerance (e.g.
125+
# 1e-12).
126+
MinCovDet(random_state=rand_gen).fit(data)
127+
128+
95129
def test_outlier_detection():
96130
rnd = np.random.RandomState(0)
97131
X = rnd.randn(100, 10)

0 commit comments

Comments
 (0)
0