10000 This is the first step in deprecating 1d arrays · scikit-learn/scikit-learn@f760d27 · GitHub
[go: up one dir, main page]

Skip to content

Commit f760d27

Browse files
This is the first step in deprecating 1d arrays
Passing 1D arrays to check_array, without setting `ensure_2d` to false now raises a deprecation warning before reshaping it. This will later throw an error. All Scaler classes also throw warnings when 1D arrays are passed. All unit tests/doctests are modified to ensure that no 1D arrays are passed, except in explicit 1D array tests where the warnings have been silenced. Additional tests are also included which check for different 1D array cases. 2D array tests with one samples and one features are also added and where they failed, `check_array` call has been modified to give a more useful error message
1 parent e2e5fbd commit f760d27

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+349
-154
lines changed

doc/modules/model_evaluation.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ Here is an example of building custom scorers, and of using the
170170
>>> # and predictions defined below.
171171
>>> loss = make_scorer(my_custom_loss_func, greater_is_better=False)
172172
>>> score = make_scorer(my_custom_loss_func, greater_is_better=True)
173-
>>> ground_truth = [1, 1]
173+
>>> ground_truth = [[1, 1]]
174174
>>> predictions = [0, 1]
175175
>>> from sklearn.dummy import DummyClassifier
176176
>>> clf = DummyClassifier(strategy='most_frequent', random_state=0)

doc/modules/model_persistence.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_
3030
>>> import pickle
3131
>>> s = pickle.dumps(clf)
3232
>>> clf2 = pickle.loads(s)
33-
>>> clf2.predict(X[0])
33+
>>> clf2.predict([X[0]])
3434
array([0])
3535
>>> y[0]
3636
0

doc/tutorial/basic/tutorial.rst

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ An introduction to machine learning with scikit-learn
55

66
.. topic:: Section contents
77

8-
In this section, we introduce the `machine learning
8+
In this section, we introduce the `machine learning
99
<http://en.wikipedia.org/wiki/Machine_learning>`_
1010
vocabulary that we use throughout scikit-learn and give a
1111
simple learning example.
@@ -14,30 +14,30 @@ An introduction to machine learning with scikit-learn
1414
Machine learning: the problem setting
1515
-------------------------------------
1616

17-
In general, a learning problem considers a set of n
17+
In general, a learning problem considers a set of n
1818
`samples <http://en.wikipedia.org/wiki/Sample_(statistics)>`_ of
1919
data and then tries to predict properties of unknown data. If each sample is
2020
more than a single number and, for instance, a multi-dimensional entry
21-
(aka `multivariate <http://en.wikipedia.org/wiki/Multivariate_random_variable>`_
21+
(aka `multivariate <http://en.wikipedia.org/wiki/Multivariate_random_variable>`_
2222
data), is it said to have several attributes or **features**.
2323

2424
We can separate learning problems in a few large categories:
2525

26-
* `supervised learning <http://en.wikipedia.org/wiki/Supervised_learning>`_,
27-
in which the data comes with additional attributes that we want to predict
26+
* `supervised learning <http://en.wikipedia.org/wiki/Supervised_learning>`_,
27+
in which the data comes with additional attributes that we want to predict
2828
(:ref:`Click here <supervised-learning>`
29-
to go to the scikit-learn supervised learning page).This problem
29+
to go to the scikit-learn supervised learning page).This problem
3030
can be either:
3131

32-
* `classification
32+
* `classification
3333
<http://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:
3434
samples belong to two or more classes and we
3535
want to learn from already labeled data how to predict the class
3636
of unlabeled data. An example of classification problem would
37-
be the handwritten digit recognition example, in which the aim is
37+
be the handwritten digit recognition example, in which the aim is
3838
to assign each input vector to one of a finite number of discrete
39-
categories. Another way to think of classification is as a discrete
40-
(as opposed to continuous) form of supervised learning where one has a
39+
categories. Another way to think of classification is as a discrete
40+
(as opposed to continuous) form of supervised learning where one has a
4141
limited number of categories and for each of the n samples provided,
4242
one is to try to label them with the correct category or class.
4343

@@ -48,15 +48,15 @@ We can separate learning problems in a few large categories:
4848
length of a salmon as a function of its age and weight.
4949

5050
* `unsupervised learning <http://en.wikipedia.org/wiki/Unsupervised_learning>`_,
51-
in which the training data consists of a set of input vectors x
52-
without any corresponding target values. The goal in such problems
53-
may be to discover groups of similar examples within the data, where
54-
it is called `clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`_,
55-
or to determine the distribution of data within the input space, known as
56-
`density estimation <http://en.wikipedia.org/wiki/Density_estimation>`_, or
57-
to project the data from a high-dimensional space down to two or three
58-
dimensions for the purpose of *visualization*
59-
(:ref:`Click here <unsupervised-learning>`
51+
in which the training data consists of a set of input vectors x
52+
without any corresponding target values. The goal in such problems
53+
may be to discover groups of similar examples within the data, where
54+
it is called `clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`_,
55+
or to determine the distribution of data within the input space, known as
56+
`density estimation <http://en.wikipedia.org/wiki/Density_estimation>`_, or
57+
to project the data from a high-dimensional space down to two or three
58+
dimensions for the purpose of *visualization*
59+
(:ref:`Click here <unsupervised-learning>`
6060
to go to the Scikit-Learn unsupervised learning page).
6161

6262
.. topic:: Training set and testing set
@@ -143,7 +143,7 @@ Learning and predicting
143143

144144
In the case of the digits dataset, the task is to predict, given an image,
145145
which digit it represents. We are given samples of each of the 10
146-
possible classes (the digits zero through nine) on which we *fit* an
146+
possible classes (the digits zero through nine) on which we *fit* an
147147
`estimator <http://en.wikipedia.org/wiki/Estimator>`_ to be able to *predict*
148148
the classes to which unseen samples belong.
149149

@@ -185,7 +185,7 @@ Now you can predict new values, in particular, we can ask to the
185185
classifier what is the digit of our last image in the ``digits`` dataset,
186186
which we have not used to train the classifier::
187187

188-
>>> clf.predict(digits.data[-1])
188+
>>> clf.predict([digits.data[-1]])
189189
array([8])
190190

191191
The corresponding image is the following:
@@ -223,7 +223,7 @@ persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_
223223
>>> import pickle
224224
>>> s = pickle.dumps(clf)
225225
>>> clf2 = pickle.loads(s)
226-
>>> clf2.predict(X[0])
226+
>>> clf2.predict([X[0]])
227227
array([0])
228228
>>> y[0]
229229
0
@@ -235,10 +235,10 @@ and not to a string::
235235

236236
>>> from sklearn.externals import joblib
237237
>>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
238-
238+
239239
Later you can load back the pickled model (possibly in another Python process)
240240
with::
241-
241+
242242
>>> clf = joblib.load('filename.pkl') # doctest:+SKIP
243243

244244
.. note::

sklearn/cluster/hierarchical.py

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -713,7 +713,7 @@ def fit(self, X, y=None):
713713
-------
714714
self
715715
"""
716-
X = check_array(X)
716+
X = check_array(X, ensure_min_samples=2)
717717
memory = self.memory
718718
if isinstance(memory, six.string_types):
719719
memory = Memory(cachedir=memory, verbose=0)
@@ -869,11 +869,8 @@ def fit(self, X, y=None, **params):
869869
-------
870870
self
871871
"""
872-
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
873-
if not (len(X.shape) == 2 and X.shape[0] > 0):
874-
raise ValueError('At least one sample is required to fit the '
875-
'model. A data matrix of shape %s was given.'
876-
% (X.shape, ))
872+
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
873+
ensure_min_features=2)
877874
return AgglomerativeClustering.fit(self, X.T, **params)
878875

879876
@property

sklearn/cluster/k_means_.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,8 @@ def _k_init(X, n_clusters, x_squared_norms, random_state, n_local_trials=None):
9595

9696
# Initialize list of closest distances and calculate current potential
9797
closest_dist_sq = euclidean_distances(
98-
centers[0], X, Y_norm_squared=x_squared_norms, squared=True)
98+
centers[0, np.newaxis], X, Y_norm_squared=x_squared_norms,
99+
squared=True)
99100
current_pot = closest_dist_sq.sum()
100101

101102
# Pick the remaining n_clusters-1 points

sklearn/covariance/empirical_covariance_.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ def empirical_covariance(X, assume_centered=False):
7070
X = np.asarray(X)
7171
if X.ndim == 1:
7272
X = np.reshape(X, (1, -1))
73+
7374
if X.shape[0] == 1:
7475
warnings.warn("Only one sample available. "
7576
"You may want to reshape your data array")
@@ -79,6 +80,8 @@ def empirical_covariance(X, assume_centered=False):
7980
else:
8081
covariance = np.cov(X.T, bias=1)
8182

83+
if covariance.ndim == 0:
84+
covariance = np.array([[covariance]])
8285
return covariance
8386

8487

sklearn/covariance/graph_lasso_.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -334,7 +334,7 @@ def __init__(self, alpha=.01, mode='cd', tol=1e-4, enet_tol=1e-4,
334334
self.store_precision = True
335335

336336
def fit(self, X, y=None):
337-
X = check_array(X)
337+
X = check_array(X, ensure_min_features=2, ensure_min_samples=2)
338338
if self.assume_centered:
339339
self.location_ = np.zeros(X.shape[1])
340340
else:
@@ -557,7 +557,7 @@ def fit(self, X, y=None):
557557
X : ndarray, shape (n_samples, n_features)
558558
Data from which to compute the covariance estimate
559559
"""
560-
X = check_array(X)
560+
X = check_array(X, ensure_min_features=2)
561561
if self.assume_centered:
562562
self.location_ = np.zeros(X.shape[1])
563563
else:

sklearn/covariance/tests/test_covariance.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,8 @@ def test_covariance():
5555
cov.error_norm(empirical_covariance(X_1d), norm='spectral'), 0)
5656

5757
# test with one sample
58-
# FIXME I don't know what this test does
59-
X_1sample = np.arange(5)
58+
# Create X with 1 sample and 5 features
59+
X_1sample = np.arange(5).reshape(1, 5)
6060
cov = EmpiricalCovariance()
6161
assert_warns(UserWarning, cov.fit, X_1sample)
6262
assert_array_almost_equal(cov.covariance_,
@@ -172,8 +172,8 @@ def test_ledoit_wolf():
172172
assert_array_almost_equal(empirical_covariance(X_1d), lw.covariance_, 4)
173173

174174
# test with one sample
175-
# FIXME I don't know what this test does
176-
X_1sample = np.arange(5)
175+
# warning should be raised when using only 1 sample
176+
X_1sample = np.arange(5).reshape(1, 5)
177177
lw = LedoitWolf()
178178
assert_warns(UserWarning, lw.fit, X_1sample)
179179
assert_array_almost_equal(lw.covariance_,
@@ -220,7 +220,7 @@ def test_oas():
220220
assert_array_almost_equal(scov.covariance_, oa.covariance_, 4)
221221

222222
# test with n_features = 1
223-
X_1d = X[:, 0].reshape((-1, 1))
223+
X_1d = X[:, 0, np.newaxis]
224224
oa = OAS(assume_centered=True)
225225
oa.fit(X_1d)
226226
oa_cov_from_mle, oa_shinkrage_from_mle = oas(X_1d, assume_centered=True)
@@ -259,8 +259,8 @@ def test_oas():
259259
assert_array_almost_equal(empirical_covariance(X_1d), oa.covariance_, 4)
260260

261261
# test with one sample
262-
# FIXME I don't know what this test does
263-
X_1sample = np.arange(5)
262+
# warning should be raised when using only 1 sample
263+
X_1sample = np.arange(5).reshape(1, 5)
264264
oa = OAS()
265265
assert_warns(UserWarning, oa.fit, X_1sample)
266266
assert_array_almost_equal(oa.covariance_,

sklearn/decomposition/tests/test_dict_learning.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,11 +75,11 @@ def test_dict_learning_nonzero_coefs():
7575
n_components = 4
7676
dico = DictionaryLearning(n_components, transform_algorithm='lars',
7777
transform_n_nonzero_coefs=3, random_state=0)
78-
code = dico.fit(X).transform(X[1])
78+
code = dico.fit(X).transform(X[np.newaxis, 1])
7979
assert_true(len(np.flatnonzero(code)) == 3)
8080

8181
dico.set_params(transform_algorithm='omp')
82-
code = dico.transform(X[1])
82+
code = dico.transform(X[np.newaxis, 1])
8383
assert_equal(len(np.flatnonzero(code)), 3)
8484

8585

@@ -173,7 +173,7 @@ def test_dict_learning_online_partial_fit():
173173
random_state=0)
174174
for i in range(10):
175175
for sample in X:
176-
dict2.partial_fit(sample)
176+
dict2.partial_fit(sample[np.newaxis, :])
177177

178178
assert_true(not np.all(sparse_encode(X, dict1.components_, alpha=1) ==
179179
0))
@@ -225,4 +225,4 @@ def test_sparse_coder_estimator():
225225
code = SparseCoder(dictionary=V, transform_algorithm='lasso_lars',
226226
transform_alpha=0.001).transform(X)
227227
assert_true(not np.all(code == 0))
228-
assert_less(np.sqrt(np.sum((np.dot(code, V) - X) ** 2)), 0.1)
228+
assert_less(np.sqrt(np.sum((np.dot(code, V) - X) ** 2)), 0.1)

sklearn/ensemble/tests/test_forest.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939

4040
from sklearn.tree.tree import SPARSE_SPLITTERS
4141

42+
4243
# toy sample
4344
X = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1]]
4445
y = [-1, -1, -1, 1, 1, 1]
@@ -724,6 +725,7 @@ def test_memory_layout():
724725
yield check_memory_layout, name, dtype
725726

726727

728+
@ignore_warnings
727729
def check_1d_input(name, X, X_2d, y):
728730
ForestEstimator = FOREST_ESTIMATORS[name]
729731
assert_raises(ValueError, ForestEstimator(random_state=0).fit, X, y)
@@ -735,8 +737,9 @@ def check_1d_input(name, X, X_2d, y):
735737
assert_raises(ValueError, est.predict, X)
736738

737739

740+
@ignore_warnings
738741
def test_1d_input():
739-
X = iris.data[:, 0].ravel()
742+
X = iris.data[:, 0]
740743
X_2d = iris.data[:, 0].reshape((-1, 1))
741744
y = iris.target
742745

0 commit comments

Comments
 (0)
0