[MRG] MNT: Optionally skip another input check in ElasticNet, Lasso #11487

jakirkham · 2018-07-11T21:29:03Z

Optionally disables a check_array call in a tight loop in Dictionary Learning when fitting with LASSO.

Reference Issues/PRs

Follow-up to PR ( #5133 ) to optionally skip more check_array calls.

What does this implement/fix? Explain your changes.

Any other comments?

Includes a `check_input` option to `_preprocess_data`, which checks to see if `check_array` should be run on the input array or not. As the `check_array` step can be expensive (particularly when trying to detect non-finite values), it is important to have an option to skip this step particularly when it shows up in a tight loop like with Lasso in Dictionary Learning. The default is to keep this check. Make sure to still copy the array if that is requested even if the check is disabled.

Many of the places using `_preprocess_data` do so directly and/or via `_pre_fit`. As such it makes sense to expose the `check_input` parameter in `_pre_fit` as well.

If we want to skip `check_array` calls in `ElasticNet` or `Lasso`, we should disable them in `_pre_fit` as well. Otherwise we will still have to do the same checks that we skipped previously.

rth

I have mixed feeling about this one. If I understand correctly this saves n_target calls to check_array. How much does this impact performance in your use case?

jakirkham · 2018-07-13T16:13:40Z

It's pretty significant unfortunately. Haven't done a formal benchmark, but caught this in the flame graph when profiling the code. Wouldn't be surprised if benchmarking shows the same problem demonstrated in this comment. Also using dict_learning_online as the author of that PR was.

amueller · 2018-07-14T16:38:52Z

@jakirkham can you maybe just show some results on a synthetic dataset? I totally believe it though :-/ the infinity check can be slow.

jakirkham · 2018-07-14T23:51:06Z

can you maybe just show some results on a synthetic dataset?

What would you propose? Also what are you looking for in this run?

I totally believe it though :-/ the infinity check can be slow.

Yep, that was the next thing up in the flame graph.

jnothman · 2018-07-15T05:59:32Z

sklearn/linear_model/base.py

+                        dtype=FLOAT_DTYPES)
+    elif copy:
+        if sp.issparse(X):
+            X = X.copy()


This isn't covered by tests.

Ok, think I've fixed this.

amueller · 2018-07-15T17:25:11Z

I meant doing some benchmarks showing that it actually helps in some cases, just as a sanity test?

rth · 2018-07-16T15:22:34Z

sklearn/linear_model/tests/test_base.py

@@ -321,6 +321,15 @@ def test_csr_preprocess_data():
    assert_equal(csr_.getformat(), 'csr')


+def test_csr_preprocess_data_no_checks():


Maybe better to parametrize the test_csr_preprocess_data above with,

@pytest.mark.parametrize('check_data', (True, False)) def test_csr_preprocess_data(check_data): ...

as it's the same test..

The test added wasn't covering all the cases we should cover. So have updated the test instead. Don't think they should be merged, but please take another look and let me know.

jakirkham · 2018-07-16T19:02:52Z

Thanks for clarifying, @amueller.

Used the script below to perform a simplistic benchmark on Mac, which was heavily influenced by this example. Also set OPENBLAS_NUM_THREADS=1 when running it.

Script:

#!/usr/bin/env python


from time import time

import numpy as np

from scipy.misc import face
from sklearn.decomposition import MiniBatchDictionaryLearning
from sklearn.feature_extraction.image import extract_patches_2d


# Get test data and normalize
face = face(gray=True)

# Downsample test data for speed
face = face / 255.
face = face[::2, ::2] + face[1::2, ::2] + face[::2, 1::2] + face[1::2, 1::2]
face /= 4.0

# Distort test data
height, width = face.shape
distorted = face.copy()
distorted[:, width // 2:] += 0.075 * np.random.randn(height, width // 2)
distorted[:, width // 2:] += 0.075 * np.random.randn(height, width // 2)

# Generate patches from test data
patch_size = (30, 30)
data = extract_patches_2d(distorted[:, :width // 2], patch_size)
data = extract_patches_2d(distorted[:, :width // 2], patch_size)
data = data.reshape(data.shape[0], -1)
data -= np.mean(data, axis=0)
data /= np.std(data, axis=0)

# Perform dictionary learning on test data
times = np.empty((20,), dtype=float)
for i in range(len(times)):
    t0 = time()
    dico = MiniBatchDictionaryLearning(
        n_components=10, alpha=0.2, n_iter=40,
        fit_algorithm="cd", n_jobs=1, batch_size=20
    )
    V = dico.fit(data).components_
    times[i] = time() - t0

print(
    "Dictionary learning runtime: %.2fs±%.2fs." % (times.mean(), times.std())
)

Environment:

name: sklearndev
channels:
  - conda-forge
  - defaults
dependencies:
  - alabaster=0.7.10=py36_1
  - appnope=0.1.0=py36_0
  - asn1crypto=0.24.0=py36_0
  - atomicwrites=1.1.5=py36_0
  - attrs=18.1.0=py_0
  - babel=2.6.0=py36_0
  - backcall=0.1.0=py_0
  - blas=1.1=openblas
  - bleach=2.1.3=py_0
  - ca-certificates=2018.4.16=0
  - certifi=2018.4.16=py36_0
  - cffi=1.11.5=py36_0
  - chardet=3.0.4=py36_0
  - cloudpickle=0.5.3=py_0
  - cryptography=2.2.1=py36_0
  - cycler=0.10.0=py36_0
  - cython=0.28.3=py36_0
  - dask-core=0.17.5=py_0
  - decorator=4.3.0=py_0
  - docutils=0.14=py36_0
  - entrypoints=0.2.3=py36_1
  - flake8=3.5.0=py36_0
  - freetype=2.8.1=0
  - html5lib=1.0.1=py_0
  - idna=2.6=py36_1
  - imageio=2.3.0=py36_0
  - imagesize=1.0.0=py36_0
  - ipykernel=4.8.2=py36_0
  - ipython=6.4.0=py36_0
  - ipython_genutils=0.2.0=py36_0
  - jedi=0.12.0=py36_0
  - jinja2=2.10=py36_0
  - jpeg=9b=2
  - jsonschema=2.6.0=py36_1
  - jupyter_client=5.2.3=py36_0
  - jupyter_core=4.4.0=py_0
  - kiwisolver=1.0.1=py36_1
  - libffi=3.2.1=3
  - libgfortran=3.0.0=0
  - libpng=1.6.34=0
  - libsod
8000
ium=1.0.16=0
  - libtiff=4.0.9=0
  - markupsafe=1.0=py36_0
  - matplotlib=2.2.2=py36_1
  - mccabe=0.6.1=py36_0
  - mistune=0.8.3=py36_1
  - more-itertools=4.2.0=py_0
  - nbconvert=5.3.1=py_1
  - nbformat=4.4.0=py36_0
  - ncurses=5.9=10
  - networkx=2.1=py36_0
  - notebook=5.5.0=py36_0
  - numpy=1.14.3=py36_blas_openblas_200
  - numpydoc=0.8.0=py36_0
  - olefile=0.45.1=py36_0
  - openblas=0.2.20=8
  - openssl=1.0.2o=0
  - packaging=17.1=py_0
  - pandas=0.23.0=py36_1
  - pandoc=2.2.1=hde52d81_0
  - pandocfilters=1.4.2=py36_0
  - parso=0.2.1=py_0
  - pexpect=4.6.0=py36_0
  - pickleshare=0.7.4=py36_0
  - pillow=5.1.0=py36_0
  - pip=9.0.3=py36_0
  - pluggy=0.6.0=py_0
  - prompt_toolkit=1.0.15=py36_0
  - ptyprocess=0.5.2=py36_0
  - py=1.5.3=py_0
  - pycodestyle=2.3.1=py36_0
  - pycparser=2.18=py36_0
  - pyflakes=1.6.0=py36_0
  - pygments=2.2.0=py36_0
  - pyopenssl=18.0.0=py36_0
  - pyparsing=2.2.0=py36_0
  - pysocks=1.6.8=py36_1
  - pytest=3.6.0=py36_1
  - python=3.6.5=1
  - python-dateutil=2.7.3=py_0
  - pytz=2018.4=py_0
  - pywavelets=0.5.2=py36_1
  - pyzmq=17.0.0=py36_4
  - readline=7.0=0
  - requests=2.18.4=py36_1
  - scikit-image=0.14.0=py36hfc679d8_1
  - scipy=1.1.0=py36_blas_openblas_200
  - send2trash=1.5.0=py_0
  - setuptools=39.2.0=py36_0
  - simplegeneric=0.8.1=py36_0
  - six=1.11.0=py36_1
  - snowballstemmer=1.2.1=py36_0
  - sphinx=1.6.2=py36_0
  - sphinx-gallery=0.1.12=py36_0
  - sphinxcontrib-websupport=1.1.0=py36_0
  - sqlite=3.20.1=2
  - terminado=0.8.1=py36_0
  - testpath=0.3.1=py36_0
  - tk=8.6.7=0
  - toolz=0.9.0=py_0
  - tornado=5.0.2=py36_0
  - traitlets=4.3.2=py36_0
  - typing=3.6.4=py36_0
  - urllib3=1.22=py36_0
  - wcwidth=0.1.7=py36_0
  - webencodings=0.5.1=py36_0
  - wheel=0.31.0=py36_0
  - xz=5.2.3=0
  - zeromq=4.2.5=1
  - zlib=1.2.11=h470a237_3
  - pip:
    - click==6.7
    - dask==0.17.5
    - heapdict==1.0.0
    - msgpack==0.5.6
    - psutil==5.4.5
    - scikit-learn==0.20.dev0
    - sortedcontainers==2.0.2
    - tblib==1.3.2
    - zict==0.1.3

With this change, found the runtime of dictionary learning after 20 runs is 4.28s±0.09s. Without this change ( 153aec4 on master ), find the runtime of dictionary learning after 20 runs is 4.76s±0.31s. It's not a massive difference, but it is noticeable (especially for something that shouldn't be).

Edit: Both cases had the same number of runs. Had a typo that suggested differently. Everything else is otherwise the same.

jakirkham · 2018-07-17T14:31:40Z

AppVeyor failed for some weird reason unrelated to this PR. Should it be restarted?

rth

If you can, but the CI queue is quite large at the moment, so it will take a while. I assume that it'll pass, and generally this LGTM.

rth · 2018-07-17T14:38:06Z

(I can't restart the build because of how Appveyor account is currently set up).

jakirkham · 2018-07-17T15:07:20Z

No worries. Went ahead and force pushed it to restart.

agramfort · 2018-07-18T08:14:55Z

I have a mixed feeling here too.

Does it mean that currently check_input is called multiple times with Lasso.fit ?

What raises a question is that dict learning makes calls to lasso.fit(..., check_input=False)
so it should not need to make use of private functions also called in fit by Lasso et al. estimators.

do I miss something?

Yes I know this code is mess... I plead guilty ...

jakirkham · 2018-07-18T15:27:15Z

Yes, check_array is called at the beginning of Lasso and within Lasso in these private prep functions. Dictionary Learning also loops over batches of arrays that have already been check, in which it calls Lasso on each batch (which is thus rechecking each batch). So these checks are happening many times within a loop. This doesn't seem right to me as these checks should only happen once.

Cannot comment on how the code should be structured since I don't know what scikit-learn's internal code model is very well. That said, lasso.fit(..., check_input=False) was already added in PR ( #5133 ). This merely makes sure a few internal functions respect check_input=False when they hadn't previously.

The internal functions here are being called by Lasso (and possibly other things) where these checks are redundant. Dictionary Learning is not calling them directly. It is only calling Lasso, which calls _prefit.

Make sure that even when checks are disabled, the sparse input still gets copied if requested.

jakirkham · 2018-07-19T15:12:34Z

FWIW CI passes.

rth · 2018-07-19T16:36:20Z

@agramfort Did the above answer address your concerns or do you have other comments about this?

jakirkham · 2018-07-24T14:24:03Z

Friendly nudge 😉

jakirkham · 2018-07-26T16:51:04Z

Anything else?

rth · 2018-07-27T11:41:04Z

As far as I can tell @agramfort 's comments were addressed. Merging.

rth · 2018-07-27T11:42:11Z

Thanks @jakirkham !

jakirkham · 2018-07-27T13:28:56Z

Thanks all for the reviews :)

jakirkham changed the title ~~MNT: Optionally skip input check in _preprocess_data~~ WIP: MNT: Optionally skip input check in _preprocess_data Jul 12, 2018

jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from e0206f7 to a7da2d9 Compare July 12, 2018 14:34

jakirkham added 2 commits July 12, 2018 12:57

Add option to disable check_input in _pre_fit

f21ba6f

Many of the places using `_preprocess_data` do so directly and/or via `_pre_fit`. As such it makes sense to expose the `check_input` parameter in `_pre_fit` as well.

jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from a7da2d9 to 6e47ed6 Compare July 12, 2018 16:57

jakirkham changed the title ~~WIP: MNT: Optionally skip input check in _preprocess_data~~ MNT: Optionally skip input check in _preprocess_data Jul 12, 2018

Pass check_input to _pre_fit in ElasticNet

9a59e73

If we want to skip `check_array` calls in `ElasticNet` or `Lasso`, we should disable them in `_pre_fit` as well. Otherwise we will still have to do the same checks that we skipped previously.

jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from 6e47ed6 to 9a59e73 Compare July 12, 2018 17:44

jakirkham changed the title ~~MNT: Optionally skip input check in _preprocess_data~~ [MRG] MNT: Optionally skip input check in _preprocess_data Jul 12, 2018

jakirkham changed the title ~~[MRG] MNT: Optionally skip input check in _preprocess_data~~ [MRG] MNT: Optionally skip another input check in ElasticNet, Lasso Jul 12, 2018

rth reviewed Jul 13, 2018

View reviewed changes

jnothman reviewed Jul 15, 2018

View reviewed changes

rth reviewed Jul 16, 2018

View reviewed changes

jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from 952447a to d2f5087 Compare July 16, 2018 19:18

rth approved these changes Jul 17, 2018

View reviewed changes

jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from d2f5087 to 473e80b Compare July 17, 2018 15:05

jnothman approved these changes Jul 18, 2018

View reviewed changes

TST: Run _preprocess_data on sparse without checks

387fb13

Make sure that even when checks are disabled, the sparse input still gets copied if requested.

jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from 473e80b to 387fb13 Compare July 18, 2018 16:20

rth mentioned this pull request Jul 25, 2018

Optimizing assert_all_finite check #11681

Closed

rth merged commit b3c177c into scikit-learn:master Jul 27, 2018

jakirkham deleted the opt_skip_check_input_preprocess_data branch July 27, 2018 13:29

jakirkham mentioned this pull request Aug 7, 2018

[MRG] MNT: Use memoryviews instead of ndarrays #11722

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] MNT: Optionally skip another input check in ElasticNet, Lasso #11487

[MRG] MNT: Optionally skip another input check in ElasticNet, Lasso #11487

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -321,6 +321,15 @@ def test_csr_preprocess_data():
		assert_equal(csr_.getformat(), 'csr')


		def test_csr_preprocess_data_no_checks():

Uh oh!

[MRG] MNT: Optionally skip another input check in ElasticNet, Lasso #11487

[MRG] MNT: Optionally skip another input check in ElasticNet, Lasso #11487

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!