MNT Add asv benchmark suite #17026

jeremiedbb · 2020-04-24T14:11:19Z

It's essentially the benchmark suite started here.

The main goal is to be able to easily ask for a benchmark when a PR might impact performance. The benchmark suite includes only a subset of the sklearn estimators but we can add new ones after. Obviously, adding more estimators make the whole run to take longer and having all estimators would take several hours.

Right now, the run takes an hour and a half on my laptop, with n_jobs=1, with an empty cache, with the default configuration. With the fastest config and some stuff cached it goes down to 20 min.

TODO:

Add documentation

ping @rth

rth · 2020-04-24T14:16:47Z

asv_benchmarks/asv.conf.json

+                  // `asv` will cache results of the recent builds in each
+                  // environment, making them faster to install next time.  This is
+                  // the number of builds to keep, per environment.
+                  // "build_cache_size": 2,

rth

Apr 24, 2020

It's funny they went with json which specifically doesn't allow comments in the spec.

asv_benchmarks/benchmarks/common.py

+                  @property
+                  @classmethod
+                  @abstractmethod
+                  def params(self):

rth

Apr 24, 2020

Are all 3 necessary? How can it even work: I would have imagined property and classmethod have conflicting signatures?

jeremiedbb

Apr 30, 2020

apparently it works :) but you're right only 2 are necessary. I changed that.

asv_benchmarks/benchmarks/datasets.py

+              from sklearn.model_selection import train_test_split
+              # memory location for caching datasets
+              M = Memory(location=os.path.dirname(os.path.realpath(__file__)) + "/cache")

rth

Apr 24, 2020

Can't we reuse the scikit-learn datasets dir (maybe with a specific "asv-cache"" folder there)?

jeremiedbb

Apr 30, 2020

I use the cache folder for the datasets but also for temporary stuff. I think it makes more sense to keep everything in a same place. And also to keep everything related to the benchmarks in the benchmarks folder

jeremiedbb · 2020-04-24T15:56:26Z

Note for later:
We can try to setup some github action to automatically run the benchmark on a specific module/estimator with a special tag in the commit or comment.

thomasjpfan · 2020-04-30T15:44:16Z

Note for later:
We can try to setup some github action to automatically run the benchmark on a specific module/estimator with a special tag in the commit or comment.

Sounds fun! Would it on master and on the PR, then display the results? Since the benchmark is running on the same instance, the results should be more or less comparable.

jeremiedbb · 2020-04-30T15:52:49Z

Sounds fun! Would it on master and on the PR, then display the results?

yes

Since the benchmark is running on the same instance, the results should be more or less comparable.

Sorry I don't understand this part

jeremiedbb · 2020-05-05T15:37:37Z

@ogrisel I changed the order to put the command to run on specific module before. Let me know what you think.

ogrisel

This looks great. Here are some suggestions to reorg a bit the order of the doc a bit to show the most common usage first.

doc/developers/contributing.rst

ogrisel

This looks great. +1 on my side.

jnothman · 2020-05-20T14:15:16Z

asv_benchmarks/benchmarks/cluster.py

+    Benchmarks for KMeans.
+    """
+
+    param_names = ['representation', 'algorithm', 'n_jobs']


n_jobs should be removed. It's a bit worrying that the inconsistency between this and params is not picked up anywhere

Yes, it's weird that asv does not check length consistency of params and param_names.

good catch btw :)

NicolasHug

Thanks @jeremiedbb

Mostly looks good but I'm concerned about the dependency on conda. I think we shouldn't expect that.

Do we have a way to clarify that everything benchmark-related is private? (We can never be too careful...)

NicolasHug · 2020-05-20T18:24:35Z

doc/developers/contributing.rst

+The benchmark suite is configured to run against your local clone of
+scikit-learn. Make sure it is up to date::
+
+  git fetch upstream


I'm not sure I understand this sentence, and the need to git fetch:

According to the text below,

one can select which branches to compare

origin/master is used, so git fetch would have no effect

it was upstream/master in a previous commit. I must have made a bad copy paste... I put it back

NicolasHug · 2020-05-20T18:28:50Z

doc/developers/contributing.rst

+
+  asv run --python=same
+
+It's particulary useful when you installed scikit-learn in editable mode. By


Considering that this is the contributing guidem, is there any use case where one would not have installed scikit-learn in editable mode?

You can run the benchmarks once in an env where you have it installed in editable mode and then in an env where you have installed from let's say conda to compare the perf between using mkl or openblas.

Also, even though you have it installed in editable mode, by default asv will create an isolated env to run the benchmarks.

I think both modes are useful

But is this really related to the editable mode?

What I understand from your comment is that this is useful when one wants to have full control on the dependencies, instead of relying on whatever asv will be installing by default?

NicolasHug · 2020-05-20T18:30:22Z

doc/developers/contributing.rst

@@ -762,6 +762,85 @@ To test code coverage, you need to install the `coverage

 3. Loop.

+Monitoring performance [*]_


I was a bit confused by the "ref" since it brings you to the bottom of the page, but the footnote isn't obvious.

Maybe we can just add the content of the footnote as the first sentence of this section.

I agree. I moved it to the top of the section

NicolasHug · 2020-05-20T18:34:52Z

asv_benchmarks/asv.conf.json

+    // If missing or the empty string, the tool will be automatically
+    // determined by looking for tools on the PATH environment
+    // variable.
+    "environment_type": "conda",


I don't think we should expect contributors to have conda

In the config file it's the default env. I added an explanation on how to use virtualenv instead

Should we leave it empty?

If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable

NicolasHug · 2020-05-20T18:37:29Z

asv_benchmarks/benchmarks/cluster.py

+from .utils import neg_mean_inertia
+
+
+class KMeans_bench(Predictor, Transformer, Estimator, Benchmark):


nit but I think pep8 says this should be KMeansBench?

done. I went for KMeansBenchmark

NicolasHug · 2020-05-20T18:57:57Z

asv_benchmarks/benchmarks/common.py

@@ -0,0 +1,220 @@
+import os


A few comments on the functions and classes of this file would help

Done. I hope it's clearer. Tell me if and where it needs more.

jeremiedbb · 2020-06-08T14:01:40Z

I added the possibility to change config params with env vars. They will have priority over those defined in config.json. It makes it easier to run the benchmark suite with non default values in an automated setting.

jeremiedbb · 2020-06-08T16:29:14Z

I'm setting up weekly runs of the benchmark suite on a private INRIA cloud. The results will be hosted in this repo and will be visible here. (for now it only store test results)

rth

So I tried to run,

asv continuous -b LogisticRegression upstream/master HEAD

this branch, but I'm getting the following error

asv continuous -b LogisticRegression upstream/master HEAD
· Creating environments...
· Discovering benchmarks
·· Uninstalling from conda-py3.8-cython-joblib-numpy-scipy
·· Building 15342aa9 <add-asv-benchmarks> for conda-py3.8-cython-joblib-numpy-scipy........................................................
·· Installing 15342aa9 <add-asv-benchmarks> into conda-py3.8-cython-joblib-numpy-scipy.
·· Error running /home/rth/src/scikit-learn/asv_benchmarks/env/32ae2ab96bf2f01aaf986fddd3b291e8/bin/python /home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py discover /home/rth/src/scikit-learn/asv_benchmarks/benchmarks /tmp/tmphz8wtj1_/result.json (exit status 1)
   STDOUT -------->
   
   STDERR -------->
   Traceback (most recent call last):
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1315, in <module>
       main()
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1308, in main
       commands[mode](args)
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1004, in main_discover
       list_benchmarks(benchmark_dir, fp)
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 989, in list_benchmarks
       for benchmark in disc_benchmarks(root):
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 896, in disc_benchmarks
       benchmark = _get_benchmark(name, module, module_attr,
     File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.8/site-packages/asv/benchmark.py", line 838, in _get_benchmark
       instance = klass()
   TypeError: Can't instantiate abstract class Estimator with abstract methods setup_cache_

The results will be hosted in this repo and will be visible here. (for now it only store test results)

Great! Looking forward to that!

rth · 2020-06-08T16:34:55Z

asv_benchmarks/benchmarks/common.py

+
+    profile = os.getenv('SKLBENCH_PROFILE', config['profile'])
+
+    n_jobs_vals_env = os.getenv('SKLBENCH_NJOBS')


Could we list all env variables used in the documentation section on these benchmarks?

In the doc section I mention the config file 'config.json'. In this file I describe the env vars. I don't want to expand much on these config vars in the doc section. It's for very specific use case. If someone needs to change the config, all the info is in the config file

OK, but what's the use case for env variables here? I imagine one would run these on a config file that can be customized, why would we additionally need env variables?

I added these env vars yesterday. It makes it easier to run the benchmark suite with non default values in an automated setting.

Maybe in the docs we can just mention that it is possible to further configure the benchmark suite by setting env variables, which are described in asv_benchmarks/benchmarks/config.json.

rth · 2020-06-08T16:40:54Z

asv_benchmarks/benchmarks/common.py

+    cache_path = os.path.join(current_path, 'cache')
+    if not os.path.exists(cache_path):
+        os.mkdir(cache_path)
+    estimators_path = os.path.join(current_path, 'cache', 'estimators')
+    if not os.path.exists(estimators_path):


For new python code doing a lot of path manipulations, using pahlib makes it much pleasant to read/write, here

Suggested change

cache_path = os.path.join(current_path, 'cache')

if not os.path.exists(cache_path):

os.mkdir(cache_path)

estimators_path = os.path.join(current_path, 'cache', 'estimators')

if not os.path.exists(estimators_path):

from pathlib import Path

current_path = Path(__file__).resolve()

config_path = current_path / 'config.json'

[...]

cache_path = current_path / 'cache'

if not cache_path.exists():

os.mkdir(cache_path)

estimators_path = current_path / 'cache' / 'estimators'

if not estimators_path.exists():

https://treyhunner.com/2018/12/why-you-should-be-using-pathlib/

asv_benchmarks/benchmarks/common.py

rth · 2020-06-08T16:48:52Z

asv_benchmarks/benchmarks/common.py

+    f_name = (benchmark.__class__.__name__[:-6]
+              + '_data_' + '_'.join(list(map(str, params))) + '.pkl')


What's __name__[:-6]? For short benchmarks it will be an empty string.

Maybe re.subs('(?i)benchmark$', '', name) will be a bit more explicit?

All benchmarks were named <Estimator>_bench. I just used to truncate the _bench part. No risk to get an empty string.
But I renamed them <Estimator>Benchmark and forgot to update that... I just leave the whole name now, it's simpler :)

rth · 2020-06-08T16:51:16Z

asv_benchmarks/benchmarks/common.py

+    path = os.path.join(os.path.dirname(os.path.realpath(__file__)),
+                        'cache', 'tmp')
+
+    list(map(os.remove, (os.path.join(path, f) for f in os.listdir(path))))


Suggested change

list(map(os.remove, (os.path.join(path, f) for f in os.listdir(path))))

for fpath in os.listdir(path):

os.remove(os.path.join(path, fpath))

otherwise it will be annoying to debug it something doesn't work as expected.

rth · 2020-06-08T16:55:54Z

asv_benchmarks/benchmarks/common.py

+            with open(data_path, 'wb') as f:
+                pickle.dump(data, f)


Serializing/cleaning datasets on disk for each estimator class sounds expensive/slow. Can't we rather have a global list of datasets that are lazy loaded and cached in memory when they are needed, then copied before use? Or are datasets too big to hold them all in memory at once?

Alternatively I'm OK with joblib.Memory for typically used datasets (as done in the following file) bit then I don't understand why this method is necessary. That means 2 level of cache, right?

Or are datasets too big to hold them all in memory at once?

Yes some of them are quite big and are sometimes different for different combinations of parameters so it would take a lot of ram.

I reworked the whole part of dataset caching. There's only joblib.Memory caching now. It's much cleaner.

rth · 2020-06-08T17:03:38Z

doc/developers/contributing.rst

+
+First of all you need to install the development version of asv::
+
+  pip install git+https://github.com/airspeed-velocity/asv


Why the development version?

So I tried to run,
asv continuous -b LogisticRegression upstream/master HEAD
this branch, but I'm getting the following error

To not get this error :D
(avoid benchmarking abstract classes is a not released yet)

ogrisel · 2020-06-09T07:58:26Z

For hosting the benchmark data and HTML report I would be in favor of re-purposing the old https://github.com/scikit-learn/scikit-learn-speed repo which is now unmaintained.

NicolasHug · 2020-07-27T15:32:56Z

@jeremiedbb I pushed an empty commit so that docs would build again (artifacts were empty) and it seems like the docs are broken now.

Is this a random conda issue or could it come from the changes in this PR?

jeremiedbb · 2020-07-28T10:48:23Z

@NicolasHug merging master fixed it. I guess it was an old unrelated issue.

NicolasHug

Made a last pass with some minor comments but LGTM anyway. Thanks a lot @jeremiedbb this will be nice to have

In the contributing guide there are 2 occurrences of the term "benchmark" prior to the new section:

PRs should often substantiate the change, through benchmarks...
Bonus points for contributions that include a performance analysis with a benchmark...

I'd suggest adding links to the new " Monitoring performance" section. (and to also remove the part about the mailing list in the second occurrence which I believe is outdated.)

doc/developers/contributing.rst

NicolasHug · 2020-07-28T14:16:38Z

doc/developers/contributing.rst

+
+  asv run --python=same
+
+It's particulary useful when you installed scikit-learn in editable mode. By


But is this really related to the editable mode?

What I understand from your comment is that this is useful when one wants to have full control on the dependencies, instead of relying on whatever asv will be installing by default?

NicolasHug · 2020-07-28T14:27:10Z

asv_benchmarks/asv.conf.json

+    // If missing or the empty string, the tool will be automatically
+    // determined by looking for tools on the PATH environment
+    // variable.
+    "environment_type": "conda",


Should we leave it empty?

If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable

NicolasHug · 2020-07-28T14:35:27Z

doc/developers/contributing.rst

+
+More information on how to write a benchmark and how to use asv can be found in
+the `asv documentation <https://asv.readthedocs.io/en/latest/index.html>`_.
+


Should we finally describe what contributors should do once they have run some benchmarks? i.e. where to retrieve the results, and the preferred way to communicate them in the PR / issue?

the how to retrieve the results is already described just above. I just added that one should report the results on the github pull request. Is that what you had in mind ?

NicolasHug · 2020-07-28T14:40:48Z

asv_benchmarks/benchmarks/common.py

+
+    profile = os.getenv('SKLBENCH_PROFILE', config['profile'])
+
+    n_jobs_vals_env = os.getenv('SKLBENCH_NJOBS')


Maybe in the docs we can just mention that it is possible to further configure the benchmark suite by setting env variables, which are described in asv_benchmarks/benchmarks/config.json.

jeremiedbb · 2020-07-29T12:06:31Z

But is this really related to the editable mode?
What I understand from your comment is that this is useful when one wants to have full control on the dependencies, instead of relying on whatever asv will be installing by default?

It is in the sense that when you want to iterate quickly on your pr, you don't want to make a new environment and recompile all sklearn each time. I added something to explain that

jeremiedbb · 2020-07-29T12:08:30Z

Should we leave it empty?
If missing or the empty string, the tool will be automatically
// determined by looking for tools on the PATH environment
// variable

I'm not sure what could happen. I looked at pandas, numpy and asv and none of them leaves it empty. They use either conda or virtualenv. I think leaving it empty would make things harder for us when it does not work for somebody.

jeremiedbb · 2020-07-29T12:12:02Z

Maybe in the docs we can just mention that it is possible to further configure the benchmark suite by setting env variables

I introduced these environment variables only to make things easier when running the benchmark suite in an automated setting. For the contributing guide I think it's better to only mention the config.json file (which does exactly the same thing as the env vars, and the env vars are described in it).

NicolasHug · 2020-07-29T13:28:41Z

Merging , thanks a lot @jeremiedbb

jnothman · 2020-07-30T03:23:06Z

Yay! Thank you @jeremiedbb!

jeremiedbb added 4 commits April 23, 2020 19:20

move asv benchmark suite to scikit-learn

c5b9c22

cln

1836d36

don't track cache

df88036

config

55c8149

jeremiedbb mentioned this pull request Apr 24, 2020

Add benchmarks #16723

Closed

rth reviewed Apr 24, 2020

View reviewed changes

jeremiedbb added 4 commits April 30, 2020 16:41

add doc

3cf50b7

fix path

ad13f0f

commited wrong stuff

6a21314

remove classmethod

0b4a141

jeremiedbb added 2 commits April 30, 2020 17:45

typo

6c6d040

mention pandas doc

03f2a43

jeremiedbb added 2 commits April 30, 2020 18:12

tst broken doc

24b3eb2

reorder

fd500b0

jeremiedbb changed the title ~~[WIP] Add an asv benchmark suite~~ [MRG] Add an asv benchmark suite May 5, 2020

ogrisel reviewed May 7, 2020

View reviewed changes

doc/developers/contributing.rst Outdated Show resolved Hide resolved

doc/developers/contributing.rst Outdated Show resolved Hide resolved

reorg

d4dc45a

ogrisel approved these changes May 20, 2020

View reviewed changes

jnothman reviewed May 20, 2020

View reviewed changes

remove n_jobs from kmeans param names

7e32cb6

NicolasHug reviewed May 20, 2020

View reviewed changes

rth mentioned this pull request May 23, 2020

Time computation in benchmarks: process_time() vs time(), perf_counter(), datetime.now() #17316

Closed

cmarmo added this to the 0.24 milestone May 25, 2020

jeremiedbb added 3 commits May 26, 2020 16:11

origin -> upstream

6b40ae8

footnote

deb43c2

docstrings

d3924ae

rth reviewed Jun 8, 2020

View reviewed changes

rth mentioned this pull request Jun 9, 2020

[WIP] Use dataset factories for estimator checks #17544

Open

jeremiedbb added 6 commits June 9, 2020 15:08

pathlib

86210c7

simpler data cache

e84d3d2

simpler data cache

a3d82d2

threadpoolctl in deps

56b4324

cln docstring

2392631

add MiniBatchKMeans

7be77f3

cmarmo added the Waiting for Reviewer label Jul 18, 2020

empty commit for docs to build

8f1ba71

Merge remote-tracking branch 'upstream/master' into add-asv-benchmarks

898a333

NicolasHug approved these changes Jul 28, 2020

View reviewed changes

additional instructions

4121522

fix link

33de322

NicolasHug merged commit 9964c50 into scikit-learn:master Jul 29, 2020

NicolasHug changed the title ~~[MRG] Add an asv benchmark suite~~ MNT Add asv benchmark suite Jul 29, 2020

cmarmo mentioned this pull request Aug 24, 2020

Generic benchmarking/profiling tool #10289

Open

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

MNT Add asv benchmark suite (scikit-learn#17026)

dc7c97c

jjerphan mentioned this pull request Nov 20, 2020

RFC: Add pierreglaser/sklearn_parallel_benchmark to scikit-learn_benchmarks jeremiedbb/scikit-learn_benchmarks#16

Open

3 tasks

lorentzenchr mentioned this pull request Dec 6, 2021

Link airspeed benchmarks page in the scikit-learn webpage #21898

Closed


		asv run --python=same

		It's particulary useful when you installed scikit-learn in editable mode. By

		@@ -762,6 +762,85 @@ To test code coverage, you need to install the `coverage

		3. Loop.

		Monitoring performance [*]_

		from .utils import neg_mean_inertia


		class KMeans_bench(Predictor, Transformer, Estimator, Benchmark):


		profile = os.getenv('SKLBENCH_PROFILE', config['profile'])

		n_jobs_vals_env = os.getenv('SKLBENCH_NJOBS')

-    cache_path = os.path.join(current_path, 'cache')
-    if not os.path.exists(cache_path):
-        os.mkdir(cache_path)
-    estimators_path = os.path.join(current_path, 'cache', 'estimators')
-    if not os.path.exists(estimators_path):
+    from pathlib import Path
+    current_path = Path(__file__).resolve()
+    config_path = current_path / 'config.json'
+    [...]
+    cache_path = current_path /  'cache'
+    if not cache_path.exists():
+        os.mkdir(cache_path)
+    estimators_path = current_path / 'cache' / 'estimators'
+    if not estimators_path.exists():

		f_name = (benchmark.__class__.__name__[:-6]
		+ '_data_' + '_'.join(list(map(str, params))) + '.pkl')

	list(map(os.remove, (os.path.join(path, f) for f in os.listdir(path))))
	for fpath in os.listdir(path):
	os.remove(os.path.join(path, fpath))


		First of all you need to install the development version of asv::

		pip install git+https://github.com/airspeed-velocity/asv


		More information on how to write a benchmark and how to use asv can be found in
		the `asv documentation <https://asv.readthedocs.io/en/latest/index.html>`_.

Uh oh!

MNT Add asv benchmark suite #17026

MNT Add asv benchmark suite #17026

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment