8000 [MRG] MNT: Optionally skip another input check in ElasticNet, Lasso by jakirkham · Pull Request #11487 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] MNT: Optionally skip another input check in ElasticNet, Lasso #11487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jakirkham
Copy link
Contributor

Optionally disables a check_array call in a tight loop in Dictionary Learning when fitting with LASSO.

Reference Issues/PRs

Follow-up to PR ( #5133 ) to optionally skip more check_array calls.

What does this implement/fix? Explain your changes.

Any other comments?

@jakirkham jakirkham changed the title MNT: Optionally skip input check in _preprocess_data WIP: MNT: Optionally skip input check in _preprocess_data Jul 12, 2018
@jakirkham jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from e0206f7 to a7da2d9 Compare July 12, 2018 14:34
Includes a `check_input` option to `_preprocess_data`, which checks to
see if `check_array` should be run on the input array or not. As the
`check_array` step can be expensive (particularly when trying to detect
non-finite values), it is important to have an option to skip this step
particularly when it shows up in a tight loop like with Lasso in
Dictionary Learning. The default is to keep this check. Make sure to
still copy the array if that is requested even if the check is disabled.
Many of the places using `_preprocess_data` do so directly and/or via
`_pre_fit`. As such it makes sense to expose the `check_input` parameter
in `_pre_fit` as well.
@jakirkham jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from a7da2d9 to 6e47ed6 Compare July 12, 2018 16:57
@jakirkham jakirkham changed the title WIP: MNT: Optionally skip input check in _preprocess_data MNT: Optionally skip input check in _preprocess_data Jul 12, 2018
If we want to skip `check_array` calls in `ElasticNet` or `Lasso`, we
should disable them in `_pre_fit` as well. Otherwise we will still have
to do the same checks that we skipped previously.
@jakirkham jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from 6e47ed6 to 9a59e73 Compare July 12, 2018 17:44
@jakirkham jakirkham changed the title MNT: Optionally skip input check in _preprocess_data [MRG] MNT: Optionally skip input check in _preprocess_data Jul 12, 2018
@jakirkham jakirkham changed the title [MRG] MNT: Optionally skip input check in _preprocess_data [MRG] MNT: Optionally skip another input check in ElasticNet, Lasso Jul 12, 2018
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feeling about this one. If I understand correctly this saves n_target calls to check_array. How much does this impact performance in your use case?

@jakirkham
Copy link
Contributor Author

It's pretty significant unfortunately. Haven't done a formal benchmark, but caught this in the flame graph when profiling the code. Wouldn't be surprised if benchmarking shows the same problem demonstrated in this comment. Also using dict_learning_online as the author of that PR was.

@amueller
Copy link
Member

@jakirkham can you maybe just show some results on a synthetic dataset? I totally believe it though :-/ the infinity check can be slow.

@jakirkham
Copy link
Contributor Author

can you maybe just show some results on a synthetic dataset?

What would you propose? Also what are you looking for in this run?

I totally believe it though :-/ the infinity check can be slow.

Yep, that was the next thing up in the flame graph.

dtype=FLOAT_DTYPES)
elif copy:
if sp.issparse(X):
X = X.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't covered by tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, think I've fixed this.

@amueller
Copy link
Member

I meant doing some benchmarks showing that it actually helps in some cases, just as a sanity test?

@@ -321,6 +321,15 @@ def test_csr_preprocess_data():
assert_equal(csr_.getformat(), 'csr')


def test_csr_preprocess_data_no_checks():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to parametrize the test_csr_preprocess_data above with,

@pytest.mark.parametrize('check_data', (True, False))
def test_csr_preprocess_data(check_data):
    ...

as it's the same test..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test added wasn't covering all the cases we should cover. So have updated the test instead. Don't think they should be merged, but please take another look and let me know.

@jakirkham
Copy link
Contributor Author
jakirkham commented Jul 16, 2018

Thanks for clarifying, @amueller.

Used the script below to perform a simplistic benchmark on Mac, which was heavily influenced by this example. Also set OPENBLAS_NUM_THREADS=1 when running it.

Script:
#!/usr/bin/env python


from time import time

import numpy as np

from scipy.misc import face
from sklearn.decomposition import MiniBatchDictionaryLearning
from sklearn.feature_extraction.image import extract_patches_2d


# Get test data and normalize
face = face(gray=True)

# Downsample test data for speed
face = face / 255.
face = face[::2, ::2] + face[1::2, ::2] + face[::2, 1::2] + face[1::2, 1::2]
face /= 4.0

# Distort test data
height, width = face.shape
distorted = face.copy()
distorted[:, width // 2:] += 0.075 * np.random.randn(height, width // 2)
distorted[:, width // 2:] += 0.075 * np.random.randn(height, width // 2)

# Generate patches from test data
patch_size = (30, 30)
data = extract_patches_2d(distorted[:, :width // 2], patch_size)
data = extract_patches_2d(distorted[:, :width // 2], patch_size)
data = data.reshape(data.shape[0], -1)
data -= np.mean(data, axis=0)
data /= np.std(data, axis=0)

# Perform dictionary learning on test data
times = np.empty((20,), dtype=float)
for i in range(len(times)):
    t0 = time()
    dico = MiniBatchDictionaryLearning(
        n_components=10, alpha=0.2, n_iter=40,
        fit_algorithm="cd", n_jobs=1, batch_size=20
    )
    V = dico.fit(data).components_
    times[i] = time() - t0

print(
    "Dictionary learning runtime: %.2fs±%.2fs." % (times.mean(), times.std())
)

Environment:
name: sklearndev
channels:
  - conda-forge
  - defaults
dependencies:
  - alabaster=0.7.10=py36_1
  - appnope=0.1.0=py36_0
  - asn1crypto=0.24.0=py36_0
  - atomicwrites=1.1.5=py36_0
  - attrs=18.1.0=py_0
  - babel=2.6.0=py36_0
  - backcall=0.1.0=py_0
  - blas=1.1=openblas
  - bleach=2.1.3=py_0
  - ca-certificates=2018.4.16=0
  - certifi=2018.4.16=py36_0
  - cffi=1.11.5=py36_0
  - chardet=3.0.4=py36_0
  - cloudpickle=0.5.3=py_0
  - cryptography=2.2.1=py36_0
  - cycler=0.10.0=py36_0
  - cython=0.28.3=py36_0
  - dask-core=0.17.5=py_0
  - decorator=4.3.0=py_0
  - docutils=0.14=py36_0
  - entrypoints=0.2.3=py36_1
  - flake8=3.5.0=py36_0
  - freetype=2.8.1=0
  - html5lib=1.0.1=py_0
  - idna=2.6=py36_1
  - imageio=2.3.0=py36_0
  - imagesize=1.0.0=py36_0
  - ipykernel=4.8.2=py36_0
  - ipython=6.4.0=py36_0
  - ipython_genutils=0.2.0=py36_0
  - jedi=0.12.0=py36_0
  - jinja2=2.10=py36_0
  - jpeg=9b=2
  - jsonschema=2.6.0=py36_1
  - jupyter_client=5.2.3=py36_0
  - jupyter_core=4.4.0=py_0
  - kiwisolver=1.0.1=py36_1
  - libffi=3.2.1=3
  - libgfortran=3.0.0=0
  - libpng=1.6.34=0
  - libsodium=1.0.16=0
  - libtiff=4.0.9=0
  - markupsafe=1.0=py36_0
  - matplotlib=2.2.2=py36_1
  - mccabe=0.6.1=py36_0
  - mistune=0.8.3=py36_1
  - more-itertools=4.2.0=py_0
  - nbconvert=5.3.1=py_1
  - nbformat=4.4.0=py36_0
  - ncurses=5.9=10
  - networkx=2.1=py36_0
  - notebook=5.5.0=py36_0
  - numpy=1.14.3=py36_blas_openblas_200
  - numpydoc=0.8.0=py36_0
  - olefile=0.45.1=py36_0
  - openblas=0.2.20=8
  - openssl=1.0.2o=0
  - packaging=17.1=py_0
  - pandas=0.23.0=py36_1
  - pandoc=2.2.1=hde52d81_0
  - pandocfilters=1.4.2=py36_0
  - parso=0.2.1=py_0
  - pexpect=4.6.0=py36_0
  - pickleshare=0.7.4=py36_0
  - pillow=5.1.0=py36_0
  - pip=9.0.3=py36_0
  - pluggy=0.6.0=py_0
  - prompt_toolkit=1.0.15=py36_0
  - ptyprocess=0.5.2=py36_0
  - py=1.5.3=py_0
  - pycodestyle=2.3.1=py36_0
  - pycparser=2.18=py36_0
  - pyflakes=1.6.0=py36_0
  - pygments=2.2.0=py36_0
  - pyopenssl=18.0.0=py36_0
  - pyparsing=2.2.0=py36_0
  - pysocks=1.6.8=py36_1
  - pytest=3.6.0=py36_1
  - python=3.6.5=1
  - python-dateutil=2.7.3=py_0
  - pytz=2018.4=py_0
  - pywavelets=0.5.2=py36_1
  - pyzmq=17.0.0=py36_4
  - readline=7.0=0
  - requests=2.18.4=py36_1
  - scikit-image=0.14.0=py36hfc679d8_1
  - scipy=1.1.0=py36_blas_openblas_200
  - send2trash=1.5.0=py_0
  - setuptools=39.2.0=py36_0
  - simplegeneric=0.8.1=py36_0
  - six=1.11.0=py36_1
  - snowballstemmer=1.2.1=py36_0
  - sphinx=1.6.2=py36_0
  - sphinx-gallery=0.1.12=py36_0
  - sphinxcontrib-websupport=1.1.0=py36_0
  - sqlite=3.20.1=2
  - terminado=0.8.1=py36_0
  - testpath=0.3.1=py36_0
  - tk=8.6.7=0
  - toolz=0.9.0=py_0
  - tornado=5.0.2=py36_0
  - traitlets=4.3.2=py36_0
  - typing=3.6.4=py36_0
  - urllib3=1.22=py36_0
  - wcwidth=0.1.7=py36_0
  - webencodings=0.5.1=py36_0
  - wheel=0.31.0=py36_0
  - xz=5.2.3=0
  - zeromq=4.2.5=1
  - zlib=1.2.11=h470a237_3
  - pip:
    - click==6.7
    - dask==0.17.5
    - heapdict==1.0.0
    - msgpack==0.5.6
    - psutil==5.4.5
    - scikit-learn==0.20.dev0
    - sortedcontainers==2.0.2
    - tblib==1.3.2
    - zict==0.1.3

With this change, found the runtime of dictionary learning after 20 runs is 4.28s±0.09s. Without this change ( 153aec4 on master ), find the runtime of dictionary learning after 20 runs is 4.76s±0.31s. It's not a massive difference, but it is noticeable (especially for something that shouldn't be).

Edit: Both cases had the same number of runs. Had a typo that suggested differently. Everything else is otherwise the same.

@jakirkham jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from 952447a to d2f5087 Compare July 16, 2018 19:18
@jakirkham
Copy link
Contributor Author

AppVeyor failed for some weird reason unrelated to this PR. Should it be restarted?

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can, but the CI queue is quite large at the moment, so it will take a while. I assume that it'll pass, and generally this LGTM.

@rth
Copy link
Member
rth commented Jul 17, 2018

(I can't restart the build because of how Appveyor account is currently set up).

@jakirkham jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from d2f5087 to 473e80b Compare July 17, 2018 15:05
@jakirkham
Copy link
Contributor Author

No worries. Went ahead and force pushed it to restart.

@agramfort
Copy link
Member

I have a mixed feeling here too.

Does it mean that currently check_input is called multiple times with Lasso.fit ?

What raises a question is that dict learning makes calls to lasso.fit(..., check_input=False)
so it should not need to make use of private functions also called in fit by Lasso et al. estimators.

do I miss something?

Yes I know this code is mess... I plead guilty ...

@jakirkham
Copy link
Contributor Author

Yes, check_array is called at the beginning of Lasso and within Lasso in these private prep functions. Dictionary Learning also loops over batches of arrays that have already been check, in which it calls Lasso on each batch (which is thus rechecking each batch). So these checks are happening many times within a loop. This doesn't seem right to me as these checks should only happen once.

Cannot comment on how the code should be structured since I don't know what scikit-learn's internal code model is very well. That said, lasso.fit(..., check_input=False) was already added in PR ( #5133 ). This merely makes sure a few internal functions respect check_input=False when they hadn't previously.

The internal functions here are being called by Lasso (and possibly other things) where these checks are redundant. Dictionary Learning is not calling them directly. It is only calling Lasso, which calls _prefit.

Make sure that even when checks are disabled, the sparse input still
gets copied if requested.
@jakirkham jakirkham force-pushed the opt_skip_check_input_preprocess_data branch from 473e80b to 387fb13 Compare July 18, 2018 16:20
@jakirkham
Copy link
Contributor Author

FWIW CI passes.

@rth
Copy link
Member
rth commented Jul 19, 2018

@agramfort Did the above answer address your concerns or do you have other comments about this?

@jakirkham
Copy link
Contributor Author

Friendly nudge 😉

@jakirkham
Copy link
Contributor Author

Anything else?

@rth
Copy link
Member
rth commented Jul 27, 2018

As far as I can tell @agramfort 's comments were addressed. Merging.

@rth rth merged commit b3c177c into scikit-learn:master Jul 27, 2018
@rth
Copy link
Member
rth commented Jul 27, 2018

Thanks @jakirkham !

@jakirkham
Copy link
Contributor Author

Thanks all for the reviews :)

@jakirkham jakirkham deleted the opt_skip_check_input_preprocess_data branch July 27, 2018 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0