8000 MNT Add asv benchmark suite by jeremiedbb · Pull Request #17026 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MNT Add asv benchmark suite #17026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Jul 29, 2020
Merged

Conversation

jeremiedbb
Copy link
Member
@jeremiedbb jeremiedbb commented Apr 24, 2020

Fixes #16723

It's essentially the benchmark suite started here.

The main goal is to be able to easily ask for a benchmark when a PR might impact performance. The benchmark suite includes only a subset of the sklearn estimators but we can add new ones after. Obviously, adding more estimators make the whole run to take longer and having all estimators would take several hours.

Right now, the run takes an hour and a half on my laptop, with n_jobs=1, with an empty cache, with the default configuration. With the fastest config and some stuff cached it goes down to 20 min.

TODO:

  • Add documentation

ping @rth

@jeremiedbb jeremiedbb mentioned this pull request Apr 24, 2020
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// `asv` will cache results of the recent builds in each
// environment, making them faster to install next time. This is
// the number of builds to keep, per environment.
// "build_cache_size": 2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's funny they went with json which specifically doesn't allow comments in the spec.

@property
@classmethod
@abstractmethod
def params(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all 3 necessary? How can it even work: I would have imagined property and classmethod have conflicting signatures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently it works :) but you're right only 2 are necessary. I changed that.

from sklearn.model_selection import train_test_split

# memory location for caching datasets
M = Memory(location=os.path.dirname(os.path.realpath(__file__)) + "/cache")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we reuse the scikit-learn datasets dir (maybe with a specific "asv-cache"" folder there)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the cache folder for the datasets but also for temporary stuff. I think it makes more sense to keep everything in a same place. And also to keep everything related to the benchmarks in the benchmarks folder

@jeremiedbb
Copy link
Member Author

Note for later:
We can try to setup some github action to automatically run the benchmark on a specific module/estimator with a special tag in the commit or comment.

@thomasjpfan
Copy link
Member

Note for later:
We can try to setup some github action to automatically run the benchmark on a specific module/estimator with a special tag in the commit or comment.

Sounds fun! Would it on master and on the PR, then display the results? Since the benchmark is running on the same instance, the results should be more or less comparable.

@jeremiedbb
Copy link
Member Author
jeremiedbb commented Apr 30, 2020

Sounds fun! Would it on master and on the PR, then display the results?

yes

Since the benchmark is running on the same instance, the results should be more or less comparable.

Sorry I don't understand this part

@jeremiedbb jeremiedbb changed the title [WIP] Add an asv benchmark suite [MRG] Add an asv benchmark suite May 5, 2020
@jeremiedbb
Copy link
Member Author

@ogrisel I changed the order to put the command to run on specific module before. Let me know what you think.

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Here are some suggestions to reorg a bit the order of the doc a bit to show the most common usage first.

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. +1 on my side.

Benchmarks for KMeans.
"""

param_names = ['representation', 'algorithm', 'n_jobs']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_jobs should be removed. It's a bit worrying that the inconsistency between this and params is not picked up anywhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's weird that asv does not check length consistency of params and param_names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch btw :)

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeremiedbb

Mostly looks good but I'm concerned about the dependency on conda. I think we shouldn't expect that.

Do we have a way to clarify that everything benchmark-related is private? (We can never be too careful...)

Comment on lines +786 to +789
The benchmark suite is configured to run against your local clone of
scikit-learn. Make sure it is up to date::

git fetch upstream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this sentence, and the need to git fetch:

According to the text below,

  • one can select which branches to compare
  • origin/master is used, so git fetch would have no effect

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was upstream/master in a previous commit. I must have made a bad copy paste... I put it back


asv run --python=same

It's particulary useful when you installed scikit-learn in editable mode. By
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that this is the contributing guidem, is there any use case where one would not have installed scikit-learn in editable mode?

Copy link
Member Author
@jeremiedbb jeremiedbb May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run the benchmarks once in an env where you have it installed in editable mode and then in an env where you have installed from let's say conda to compare the perf between using mkl or openblas.

Also, even though you have it installed in editable mode, by default asv will create an isolated env to run the benchmarks.

I think both modes are useful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is this really related to the editable mode?

What I understand from your comment is that this is useful when one wants to have full control on the dependencies, instead of relying on whatever asv will be installing by default?

@@ -762,6 +762,85 @@ To test code coverage, you need to install the `coverage

3. Loop.

Monitoring performance [*]_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit confused by the "ref" since it brings you to the bottom of the page, but the footnote isn't obvious.

Maybe we can just add the content of the footnote as the first sentence of this section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I moved it to the top of the section

// If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable.
"environment_type": "conda",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should expect contributors to have conda

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the config file it's the default env. I added an explanation on how to use virtualenv instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave it empty?

If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable

from .utils import neg_mean_inertia


class KMeans_bench(Predictor, Transformer, Estimator, Benchmark):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but I think pep8 says this should be KMeansBench?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I went for KMeansBenchmark

@@ -0,0 +1,220 @@
import os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the functions and classes of this file would help

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I hope it's clearer. Tell me if and where it needs more.

@jeremiedbb
Copy link
Member Author

I added the possibility to change config params with env vars. They will have priority over those defined in config.json. It makes it easier to run the benchmark suite with non default values in an automated setting.

@jeremiedbb
Copy link
Member Author

I'm setting up weekly runs of the benchmark suite on a private INRIA cloud. The results will be hosted in this repo and will be visible here. (for now it only store test results)

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried to run,

asv continuous -b LogisticRegression upstream/master HEAD

this branch, but I'm getting the following error

asv continuous -b LogisticRegression upstream/master HEAD
· Creating environments...
· Discovering benchmarks
·· Uninstalling from conda-py3.8-cython-joblib-numpy-scipy
·· Building 15342aa9 <add-asv-benchmarks> for conda-py3.8-cython-joblib-numpy-scipy........................................................
·· Installing 15342aa9 <add-asv-benchmarks> into conda-py3.8-cython-joblib-numpy-scipy.
·· Error running /home/rth/src/scikit-learn/asv_benchmarks/env/32ae2ab96bf2f01aaf986fddd3b291e8/bin/python /home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py discover /home/rth/src/scikit-learn/asv_benchmarks/benchmarks /tmp/tmphz8wtj1_/result.json (exit status 1)
   STDOUT -------->
   
   STDERR -------->
   Traceback (most recent call last):
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1315, in <module>
       main()
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1308, in main
       commands[mode](args)
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1004, in main_discover
       list_benchmarks(benchmark_dir, fp)
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 989, in list_benchmarks
       for benchmark in disc_benchmarks(root):
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 896, in disc_benchmarks
       benchmark = _get_benchmark(name, module, module_attr,
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 838, in _get_benchmark
       instance = klass()
   TypeError: Can't instantiate abstract class Estimator with abstract methods setup_cache_

The results will be hosted in this repo and will be visible here. (for now it only store test results)

Great! Looking forward to that!


profile = os.getenv('SKLBENCH_PROFILE', config['profile'])

n_jobs_vals_env = os.getenv('SKLBENCH_NJOBS')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we list all env variables used in the documentation section on these benchmarks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the doc section I mention the config file 'config.json'. In this file I describe the env vars. I don't want to expand much on these config vars in the doc section. It's for very specific use case. If someone needs to change the config, all the info is in the config file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but what's the use case for env variables here? I imagine one would run these on a config file that can be customized, why would we additionally need env variables?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these env vars yesterday. It makes it easier to run the benchmark suite with non default values in an automated setting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the docs we can just mention that it is possible to further configure the benchmark suite by setting env variables, which are described in asv_benchmarks/benchmarks/config.json.

Comment on lines 32 to 36
cache_path = os.path.join(current_path, 'cache')
if not os.path.exists(cache_path):
os.mkdir(cache_path)
estimators_path = os.path.join(current_path, 'cache', 'estimators')
if not os.path.exists(estimators_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new python code doing a lot of path manipulations, using pahlib makes it much pleasant to read/write, here

Suggested change
cache_path = os.path.join(current_path, 'cache')
if not os.path.exists(cache_path):
os.mkdir(cache_path)
estimators_path = os.path.join(current_path, 'cache', 'estimators')
if not os.path.exists(estimators_path):
from pathlib import Path
current_path = Path(__file__).resolve()
config_path = current_path / 'config.json'
[...]
cache_path = current_path / 'cache'
if not cache_path.exists():
os.mkdir(cache_path)
estimators_path = current_path / 'cache' / 'estimators'
if not estimators_path.exists():

https://treyhunner.com/2018/12/why-you-should-be-using-pathlib/

Comment on lines 73 to 74
f_name = (benchmark.__class__.__name__[:-6]
+ '_data_' + '_'.join(list(map(str, params))) + '.pkl')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's __name__[:-6]? For short benchmarks it will be an empty string.

Maybe re.subs('(?i)benchmark$', '', name) will be a bit more explicit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All benchmarks were named <Estimator>_bench. I just used to truncate the _bench part. No risk to get an empty string.
But I renamed them <Estimator>Benchmark and forgot to update that... I just leave the whole name now, it's simpler :)

path = os.path.join(os.path.dirname(os.path.realpath(__file__)),
'cache', 'tmp')

list(map(os.remove, (os.path.join(path, f) for f in os.listdir(path))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
list(map(os.remove, (os.path.join(path, f) for f in os.listdir(path))))
for fpath in os.listdir(path):
os.remove(os.path.join(path, fpath))

otherwise it will be annoying to debug it something doesn't work as expected.

Comment on lines 141 to 142
with open(data_path, 'wb') as f:
pickle.dump(data, f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serializing/cleaning datasets on disk for each estimator class sounds expensive/slow. Can't we rather have a global list of datasets that are lazy loaded and cached in memory when they are needed, then copied before use? Or are datasets too big to hold them all in memory at once?

Alternatively I'm OK with joblib.Memory for typically used datasets (as done in the following file) bit then I don't understand why this method is necessary. That means 2 level of cache, right?

Copy link
Member Author
@jeremiedbb jeremiedbb Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are datasets too big to hold them all in memory at once?

Yes some of them are quite big and are sometimes different for different combinations of parameters so it would take a lot of ram.

I reworked the whole part of dataset caching. There's only joblib.Memory caching now. It's much cleaner.


First of all you need to install the development version of asv::

pip install git+https://github.com/airspeed-velocity/asv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the development version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried to run,
asv continuous -b LogisticRegression upstream/master HEAD
this branch, but I'm getting the following error

To not get this error :D
(avoid benchmarking abstract classes is a not released yet)

@ogrisel
Copy link
Member
ogrisel commented Jun 9, 2020

For hosting the benchmark data and HTML report I would be in favor of re-purposing the old https://github.com/scikit-learn/scikit-learn-speed repo which is now unmaintained.

@NicolasHug
Copy link
Member

@jeremiedbb I pushed an empty commit so that docs would build again (artifacts were empty) and it seems like the docs are broken now.

Is this a random conda issue or could it come from the changes in this PR?

@jeremiedbb
Copy link
Member Author

@NicolasHug merging master fixed it. I guess it was an old unrelated issue.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a last pass with some minor comments but LGTM anyway. Thanks a lot @jeremiedbb this will be nice to have

In the contributing guide there are 2 occurrences of the term "benchmark" prior to the new section:

  • PRs should often substantiate the change, through benchmarks...

  • Bonus points for contributions that include a performance analysis with a benchmark...

I'd suggest adding links to the new " Monitoring performance" section. (and to also remove the part about the mailing list in the second occurrence which I believe is outdated.)


asv run --python=same

It's particulary useful when you installed scikit-learn in editable mode. By
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is this really related to the editable mode?

What I understand from your comment is that this is useful when one wants to have full control on the dependencies, instead of relying on whatever asv will be installing by default?

// If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable.
"environment_type": "conda",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave it empty?

If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable


More information on how to write a benchmark and how to use asv can be found in
the `asv documentation <https://asv.readthedocs.io/en/latest/index.html>`_.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we finally describe what contributors should do once they have run some benchmarks? i.e. where to retrieve the results, and the preferred way to communicate them in the PR / issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the how to retrieve the results is already described just above. I just added that one should report the results on the github pull request. Is that what you had in mind ?


profile = os.getenv('SKLBENCH_PROFILE', config['profile'])

n_jobs_vals_env = os.getenv('SKLBENCH_NJOBS')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the docs we can just mention that it is possible to further configure the benchmark suite by setting env variables, which are described in asv_benchmarks/benchmarks/config.json.

@jeremiedbb
Copy link
Member Author

But is this really related to the editable mode?
What I understand from your comment is that this is useful when one wants to have full control on the dependencies, instead of relying on whatever asv will be installing by default?

It is in the sense that when you want to iterate quickly on your pr, you don't want to make a new environment and recompile all sklearn each time. I added something to explain that

@jeremiedbb
Copy link
Member Author

Should we leave it empty?
If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable

I'm not sure what could happen. I looked at pandas, numpy and asv and none of them leaves it empty. They use either conda or virtualenv. I think leaving it empty would make things harder for us when it does not work for somebody.

@jeremiedbb
Copy link
Member Author

Maybe in the docs we can just mention that it is possible to further configure the benchmark suite by setting env variables

I introduced these environment variables only to make things easier when running the benchmark suite in an automated setting. For the contributing guide I think it's better to only mention the config.json file (which does exactly the same thing as the env vars, and the env vars are described in it).

@NicolasHug NicolasHug merged commit 9964c50 into scikit-learn:master Jul 29, 2020
@NicolasHug NicolasHug changed the title [MRG] Add an asv benchmark suite MNT Add asv benchmark suite Jul 29, 2020
@NicolasHug
Copy link
Member

Merging , thanks a lot @jeremiedbb

@jnothman
Copy link
Member
jnothman commented Jul 30, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add benchmarks
7 participants
0