-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MNT Add asv benchmark suite #17026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT Add asv benchmark suite #17026
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeremiedbb ! I'll try to give it a proper review in the next few day.
// `asv` will cache results of the recent builds in each | ||
// environment, making them faster to install next time. This is | ||
// the number of builds to keep, per environment. | ||
// "build_cache_size": 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's funny they went with json which specifically doesn't allow comments in the spec.
@property | ||
@classmethod | ||
@abstractmethod | ||
def params(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all 3 necessary? How can it even work: I would have imagined property and classmethod have conflicting signatures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently it works :) but you're right only 2 are necessary. I changed that.
from sklearn.model_selection import train_test_split | ||
|
||
# memory location for caching datasets | ||
M = Memory(location=os.path.dirname(os.path.realpath(__file__)) + "/cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we reuse the scikit-learn datasets dir (maybe with a specific "asv-cache"" folder there)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the cache folder for the datasets but also for temporary stuff. I think it makes more sense to keep everything in a same place. And also to keep everything related to the benchmarks in the benchmarks folder
Note for later: |
Sounds fun! Would it on master and on the PR, then display the results? Since the benchmark is running on the same instance, the results should be more or less comparable. |
yes
Sorry I don't understand this part |
@ogrisel I changed the order to put the command to run on specific module before. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Here are some suggestions to reorg a bit the order of the doc a bit to show the most common usage first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. +1 on my side.
asv_benchmarks/benchmarks/cluster.py
Outdated
Benchmarks for KMeans. | ||
""" | ||
|
||
param_names = ['representation', 'algorithm', 'n_jobs'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_jobs should be removed. It's a bit worrying that the inconsistency between this and params
is not picked up anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's weird that asv does not check length consistency of params and param_names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch btw :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeremiedbb
Mostly looks good but I'm concerned about the dependency on conda. I think we shouldn't expect that.
Do we have a way to clarify that everything benchmark-related is private? (We can never be too careful...)
The benchmark suite is configured to run against your local clone of | ||
scikit-learn. Make sure it is up to date:: | ||
|
||
git fetch upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this sentence, and the need to git fetch
:
According to the text below,
- one can select which branches to compare
origin/master
is used, sogit fetch
would have no effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was upstream/master
in a previous commit. I must have made a bad copy paste... I put it back
doc/developers/contributing.rst
Outdated
|
||
asv run --python=same | ||
|
||
It's particulary useful when you installed scikit-learn in editable mode. By |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that this is the contributing guidem, is there any use case where one would not have installed scikit-learn in editable mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can run the benchmarks once in an env where you have it installed in editable mode and then in an env where you have installed from let's say conda to compare the perf between using mkl or openblas.
Also, even though you have it installed in editable mode, by default asv will create an isolated env to run the benchmarks.
I think both modes are useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is this really related to the editable mode?
What I understand from your comment is that this is useful when one wants to have full control on the dependencies, instead of relying on whatever asv
will be installing by default?
doc/developers/contributing.rst
Outdated
@@ -762,6 +762,85 @@ To test code coverage, you need to install the `coverage | |||
|
|||
3. Loop. | |||
|
|||
Monitoring performance [*]_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit confused by the "ref" since it brings you to the bottom of the page, but the footnote isn't obvious.
Maybe we can just add the content of the footnote as the first sentence of this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I moved it to the top of the section
// If missing or the empty string, the tool will be automatically | ||
// determined by looking for tools on the PATH environment | ||
// variable. | ||
"environment_type": "conda", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should expect contributors to have conda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the config file it's the default env. I added an explanation on how to use virtualenv instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we leave it empty?
If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable
asv_benchmarks/benchmarks/cluster.py
Outdated
from .utils import neg_mean_inertia | ||
|
||
|
||
class KMeans_bench(Predictor, Transformer, Estimator, Benchmark): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but I think pep8 says this should be KMeansBench
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. I went for KMeansBenchmark
@@ -0,0 +1,220 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on the functions and classes of this file would help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I hope it's clearer. Tell me if and where it needs more.
I added the possibility to change config params with env vars. They will have priority over those defined in config.json. It makes it easier to run the benchmark suite with non default values in an automated setting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried to run,
asv continuous -b LogisticRegression upstream/master HEAD
this branch, but I'm getting the following error
asv continuous -b LogisticRegression upstream/master HEAD
· Creating environments...
· Discovering benchmarks
·· Uninstalling from conda-py3.8-cython-joblib-numpy-scipy
·· Building 15342aa9 <add-asv-benchmarks> for conda-py3.8-cython-joblib-numpy-scipy........................................................
·· Installing 15342aa9 <add-asv-benchmarks> into conda-py3.8-cython-joblib-numpy-scipy.
·· Error running /home/rth/src/scikit-learn/asv_benchmarks/env/32ae2ab96bf2f01aaf986fddd3b291e8/bin/python /home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py discover /home/rth/src/scikit-learn/asv_benchmarks/benchmarks /tmp/tmphz8wtj1_/result.json (exit status 1)
STDOUT -------->
STDERR -------->
Traceback (most recent call last):
File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1315, in <module>
main()
File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1308, in main
commands[mode](args)
File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1004, in main_discover
list_benchmarks(benchmark_dir, fp)
File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 989, in list_benchmarks
for benchmark in disc_benchmarks(root):
File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 896, in disc_benchmarks
benchmark = _get_benchmark(name, module, module_attr,
File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 838, in _get_benchmark
instance = klass()
TypeError: Can't instantiate abstract class Estimator with abstract methods setup_cache_
The results will be hosted in this repo and will be visible here. (for now it only store test results)
Great! Looking forward to that!
|
||
profile = os.getenv('SKLBENCH_PROFILE', config['profile']) | ||
|
||
n_jobs_vals_env = os.getenv('SKLBENCH_NJOBS') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we list all env variables used in the documentation section on these benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the doc section I mention the config file 'config.json'. In this file I describe the env vars. I don't want to expand much on these config vars in the doc section. It's for very specific use case. If someone needs to change the config, all the info is in the config file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but what's the use case for env variables here? I imagine one would run these on a config file that can be customized, why would we additionally need env variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these env vars yesterday. It makes it easier to run the benchmark suite with non default values in an automated setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in the docs we can just mention that it is possible to further configure the benchmark suite by setting env variables, which are described in asv_benchmarks/benchmarks/config.json
.
asv_benchmarks/benchmarks/common.py
Outdated
cache_path = os.path.join(current_path, 'cache') | ||
if not os.path.exists(cache_path): | ||
os.mkdir(cache_path) | ||
estimators_path = os.path.join(current_path, 'cache', 'estimators') | ||
if not os.path.exists(estimators_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new python code doing a lot of path manipulations, using pahlib makes it much pleasant to read/write, here
cache_path = os.path.join(current_path, 'cache') | |
if not os.path.exists(cache_path): | |
os.mkdir(cache_path) | |
estimators_path = os.path.join(current_path, 'cache', 'estimators') | |
if not os.path.exists(estimators_path): | |
from pathlib import Path | |
current_path = Path(__file__).resolve() | |
config_path = current_path / 'config.json' | |
[...] | |
cache_path = current_path / 'cache' | |
if not cache_path.exists(): | |
os.mkdir(cache_path) | |
estimators_path = current_path / 'cache' / 'estimators' | |
if not estimators_path.exists(): |
https://treyhunner.com/2018/12/why-you-should-be-using-pathlib/
asv_benchmarks/benchmarks/common.py
Outdated
f_name = (benchmark.__class__.__name__[:-6] | ||
+ '_data_' + '_'.join(list(map(str, params))) + '.pkl') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's __name__[:-6]
? For short benchmarks it will be an empty string.
Maybe re.subs('(?i)benchmark$', '', name)
will be a bit more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All benchmarks were named <Estimator>_bench
. I just used to truncate the _bench
part. No risk to get an empty string.
But I renamed them <Estimator>Benchmark
and forgot to update that... I just leave the whole name now, it's simpler :)
asv_benchmarks/benchmarks/common.py
Outdated
path = os.path.join(os.path.dirname(os.path.realpath(__file__)), | ||
'cache', 'tmp') | ||
|
||
list(map(os.remove, (os.path.join(path, f) for f in os.listdir(path)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list(map(os.remove, (os.path.join(path, f) for f in os.listdir(path)))) | |
for fpath in os.listdir(path): | |
os.remove(os.path.join(path, fpath)) |
otherwise it will be annoying to debug it something doesn't work as expected.
asv_benchmarks/benchmarks/common.py
Outdated
with open(data_path, 'wb') as f: | ||
pickle.dump(data, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serializing/cleaning datasets on disk for each estimator class sounds expensive/slow. Can't we rather have a global list of datasets that are lazy loaded and cached in memory when they are needed, then copied before use? Or are datasets too big to hold them all in memory at once?
Alternatively I'm OK with joblib.Memory
for typically used datasets (as done in the following file) bit then I don't understand why this method is necessary. That means 2 level of cache, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or are datasets too big to hold them all in memory at once?
Yes some of them are quite big and are sometimes different for different combinations of parameters so it would take a lot of ram.
I reworked the whole part of dataset caching. There's only joblib.Memory caching now. It's much cleaner.
|
||
First of all you need to install the development version of asv:: | ||
|
||
pip install git+https://github.com/airspeed-velocity/asv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the development version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried to run,
asv continuous -b LogisticRegression upstream/master HEAD
this branch, but I'm getting the following error
To not get this error :D
(avoid benchmarking abstract classes is a not released yet)
For hosting the benchmark data and HTML report I would be in favor of re-purposing the old https://github.com/scikit-learn/scikit-learn-speed repo which is now unmaintained. |
@jeremiedbb I pushed an empty commit so that docs would build again (artifacts were empty) and it seems like the docs are broken now. Is this a random conda issue or could it come from the changes in this PR? |
@NicolasHug merging master fixed it. I guess it was an old unrelated issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a last pass with some minor comments but LGTM anyway. Thanks a lot @jeremiedbb this will be nice to have
In the contributing guide there are 2 occurrences of the term "benchmark" prior to the new section:
-
PRs should often substantiate the change, through benchmarks...
-
Bonus points for contributions that include a performance analysis with a benchmark...
I'd suggest adding links to the new " Monitoring performance" section. (and to also remove the part about the mailing list in the second occurrence which I believe is outdated.)
doc/developers/contributing.rst
Outdated
|
||
asv run --python=same | ||
|
||
It's particulary useful when you installed scikit-learn in editable mode. By |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is this really related to the editable mode?
What I understand from your comment is that this is useful when one wants to have full control on the dependencies, instead of relying on whatever asv
will be installing by default?
// If missing or the empty string, the tool will be automatically | ||
// determined by looking for tools on the PATH environment | ||
// variable. | ||
"environment_type": "conda", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we leave it empty?
If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable
|
||
More information on how to write a benchmark and how to use asv can be found in | ||
the `asv documentation <https://asv.readthedocs.io/en/latest/index.html>`_. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we finally describe what contributors should do once they have run some benchmarks? i.e. where to retrieve the results, and the preferred way to communicate them in the PR / issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the how to retrieve the results is already described just above. I just added that one should report the results on the github pull request. Is that what you had in mind ?
|
||
profile = os.getenv('SKLBENCH_PROFILE', config['profile']) | ||
|
||
n_jobs_vals_env = os.getenv('SKLBENCH_NJOBS') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in the docs we can just mention that it is possible to further configure the benchmark suite by setting env variables, which are described in asv_benchmarks/benchmarks/config.json
.
It is in the sense that when you want to iterate quickly on your pr, you don't want to make a new environment and recompile all sklearn each time. I added something to explain that |
|
I introduced these environment variables only to make things easier when running the benchmark suite in an automated setting. For the contributing guide I think it's better to only mention the config.json file (which does exactly the same thing as the env vars, and the env vars are described in it). |
Merging , thanks a lot @jeremiedbb |
Yay! Thank you @jeremiedbb!
|
Fixes #16723
It's essentially the benchmark suite started here.
The main goal is to be able to easily ask for a benchmark when a PR might impact performance. The benchmark suite includes only a subset of the sklearn estimators but we can add new ones after. Obviously, adding more estimators make the whole run to take longer and having all estimators would take several hours.
Right now, the run takes an hour and a half on my laptop, with n_jobs=1, with an empty cache, with the default configuration. With the fastest config and some stuff cached it goes down to 20 min.
TODO:
ping @rth