8000 [MRG+1] _preprocess_data consistent with fused types by Henley13 · Pull Request #9093 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] _preprocess_data consistent with fused types #9093

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 23, 2017

Conversation

Henley13
Copy link
Contributor
@Henley13 Henley13 commented Jun 9, 2017

Reference Issue

Works on #8769

What does this implement/fix? Explain your changes.

Prevent _preprocess_data from casting float32 data into float64.

Any other comments?

Intermediate step for PR #9087

@GaelVaroquaux GaelVaroquaux changed the title [MRG] _preprocess_data consistent with fused types [MRG+1] _preprocess_data consistent with fused types Jun 9, 2017
@GaelVaroquaux
Copy link
Member

LGTM. +1 for merge

if X.dtype == np.float32:
y_offset = np.float32(0)
else:
y_offset = np.float64(0)
Copy link
Contributor
@jmargeta jmargeta Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about replacing this block with just
y_offset = X.dtype.type(0) ?
Tested the dtype.type method with numpy 1.8.2 and 1.12.1
https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.type.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exactly the function I was looking for, thank you!

@Henley13
Copy link
Contributor Author

@GaelVaroquaux I changed some lines of code

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not checked whether this preprocessing is not used in some linear models, e.g. SGD, and whether that explains their absence from the changes.

Apart from that and the wording, this LGTM

@@ -651,7 +651,8 @@ def fit(self, X, y, check_input=True):
Data

y : ndarray, shape (n_samples,) or (n_samples, n_targets)
Target
Target. If it's not the case, y is cast in X.dtype further
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather this phrased as "Will be cast to X's dtype."

for normalize in [True, False]:

Xt_32, yt_32, X_mean_32, y_mean_32, X_norm_32 = \
_preprocess_data(X_32, y_32, fit_intercept=fit_intercept,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you avoid to use the backslash?

Copy link
Member
@MechCoder MechCoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, just some minor comments.

@@ -460,7 +464,8 @@ def fit(self, X, y, sample_weight=None):
Training data

y : numpy array of shape [n_samples, n_targets]
Target values
Target values. If it's not the case, y is cast in X.dtype further
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm sorry, what does "it's" in "if it's not the case" refer to?

@@ -661,11 +661,6 @@ def test_check_input_false():
clf = ElasticNet(selection='cyclic', tol=1e-8)
# Check that no error is raised if data is provided in the right format
clf.fit(X, y, check_input=False)
X = check_array(X, order='F', dtype='float32')
clf.fit(X, y, check_input=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove these two lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was used for the test below (assert_raises(ValueError, clf.fit, X, y, check_input=False)), casting X in 32 bits. But now, _preprocess_data prevent fit from raising a ValueError, even if check_input=False. Since you suggested a smoke test, I can put it back.

clf.fit(X, y, check_input=True)
# Check that an error is raised if data is provided in the wrong dtype,
# because of check bypassing
assert_raises(ValueError, clf.fit, X, y, check_input=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to change this to a smoke test:

clf.fit(X, y, check_input=False)

and add a comment saying because check_input=False, an exhaustive check is not made on y but just the dtype of y is cast in _preprocess_data to the dtype of X so this passes. (We will definitely forget in the future)

assert_equal(y_mean_6432.dtype, np.float64)
assert_equal(X_norm_6432.dtype, np.float64)

assert_array_almost_equal(Xt_32, Xt_64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy is set to be True by default. Hence can you also check that the dtype of the initial array does not change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did, few lines below!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doing assert_array_equal(X_32, X_32_initial) I don't know if the dtype is properly tested...

@jnothman jnothman added this to the 0.19 milestone Jun 18, 2017
@amueller
Copy link
Member

has conflicts

@amueller
Copy link
Member

@MechCoder I mistook your avatar for a fidget spinner and now I can't unsee it.

@GaelVaroquaux
Copy link
Member

@Henley13 : can you resolve the merge commits, please

@MechCoder
Copy link
Member

@amueller I googled what a fidget spinner is and now I have to change my avatar :-|

@MechCoder
Copy link
Member

Can you just change the "If it's not the case" everywhere and I'll be happy to merge.

@Henley13
Copy link
Contributor Author

@MechCoder Sorry, I thought I did it. Should be ok now.

@MechCoder MechCoder merged commit 89962f0 into scikit-learn:master Jun 23, 2017
@MechCoder
Copy link
Member

thanks @Henley13 1

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* add test for _preprocess_data and make it consistent

* fix pep8

* add doc, cast systematically y in X.dtype and update test_coordinate_descent.py

* test if input values don't change with copy=True

* test if input values don't change with copy=True scikit-learn#2

* fix doc

* fix doc scikit-learn#2

* fix doc scikit-learn#3
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* add test for _preprocess_data and make it consistent

* fix pep8

* add doc, cast systematically y in X.dtype and update test_coordinate_descent.py

* test if input values don't change with copy=True

* test if input values don't change with copy=True scikit-learn#2

* fix doc

* fix doc scikit-learn#2

* fix doc scikit-learn#3
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
* add test for _preprocess_data and make it consistent

* fix pep8

* add doc, cast systematically y in X.dtype and update test_coordinate_descent.py

* test if input values don't change with copy=True

* test if input values don't change with copy=True #2

* fix doc

* fix doc #2

* fix doc #3
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* add test for _preprocess_data and make it consistent

* fix pep8

* add doc, cast systematically y in X.dtype and update test_coordinate_descent.py

* test if input values don't change with copy=True

* test if input values don't change with copy=True scikit-learn#2

* fix doc

* fix doc scikit-learn#2

* fix doc scikit-learn#3
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
* add test for _preprocess_data and make it consistent

* fix pep8

* add doc, cast systematically y in X.dtype and update test_coordinate_descent.py

* test if input values don't change with copy=True

* test if input values don't change with copy=True scikit-learn#2

* fix doc

* fix doc scikit-learn#2

* fix doc scikit-learn#3
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* add test for _preprocess_data and make it consistent

* fix pep8

* add doc, cast systematically y in X.dtype and update test_coordinate_descent.py

* test if input values don't change with copy=True

* test if input values don't change with copy=True scikit-learn#2

* fix doc

* fix doc scikit-learn#2

* fix doc scikit-learn#3
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* add test for _preprocess_data and make it consistent

* fix pep8

* add doc, cast systematically y in X.dtype and update test_coordinate_descent.py

* test if input values don't change with copy=True

* test if input values don't change with copy=True scikit-learn#2

* fix doc

* fix doc scikit-learn#2

* fix doc scikit-learn#3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0