Generic benchmarking/profiling tool #10289

jnothman · 2017-12-12T01:45:18Z

We have not been proficient at documenting the estimated runtime or space complexity of our estimators and algorithms. Even were we to document asymptotic complexity functions, it would not give a realistic estimate for all parameter settings, etc. for a particular kind of data. Rather we could assist users in estimating complexity functions empirically.

I would like to see a function something like the following:

def benchmark_estimator_cost(est, X, y=None, fit_params=None,
                             vary_n_samples=True, vary_n_features=False,
                             n_fits=100, time_budget=300, profile_memory=True):
    """Profiles the cost of fitting est on samples of different size

    Parameters
    ----------
    est : estimator
    X : array-like
    y : array-like, optional
    fit_params : dict, optional
    vary_n_samples : bool, default=True
        Whether to benchmark for various random sample sizes.
    vary_n_features : bool, default=False
        Whether to benchmark for various random feature set sizes.
    n_fits : int, default=100
        Maximum number of fits to make while benchmarking.
    time_budget : int, default=300
        Maximum number of seconds to use overall.  Current fit will
        be stopped if the budget is exceeded.
    profile_memory : bool, default=True
        Whether to include memory (or just time) profiling. Memory
        profiling will slow down fitting, and hence make fit_time
        estimates more approximate.

    Returns
    -------
    results : dict
        The following keys are each mapped to an array:

        n_samples
            The number of samples
        n_features
            The number of samples
        fit_time
            In seconds
        peak_memory
            The memory used at peak of fitting, in KiB.
        model_memory
            The memory in use at the end of fitting, minus that at the
            beginning, in KiB.

    models : dict
        keys 'peak_memory', 'model_memory' and 'fit_time' map to polynomial
        GP regressors whose input is n_samples and n_features and whose
        outputs are each of those targets.

    errors : list of dicts
        lists the parameters that resulted in exceptions
    """

This would run fit successively for different values of n_samples (logarithmically spaced, perhaps guided by a gaussian process) to estimate the function for fitting complexity, within budget. I have not thought extensively about exactly what sampling strategy would be followed. If this is implemented for the library, we would consider experimental and the algorithm subject to change for a little while.

What do others think?

The text was updated successfully, but these errors were encountered:

agramfort · 2017-12-12T03:46:40Z

sounds useful

rth · 2017-12-12T23:00:36Z

Even were we to document asymptotic complexity functions, it would not give a realistic estimate for all parameter settings, etc. for a particular kind of data. Rather we could assist users in estimating complexity functions empirically.

Definitely agree with that.

I think there are cases where fit_transform, predict etc could also be worth benchmarking. Maybe replace fit_params by method_params and add a method="fit" argument?

This would run fit successively for different values of n_samples (logarithmically spaced, perhaps guided by a gaussian process) to estimate the function for fitting complexity, within budget.

That would also depend if the user is interested in the asymptotic scaling or the scaling in some particular range of parameters.

Memory profiling would require additional optional dependency memory_profiler (which pulls psutil), wouldn't it?

jnothman · 2017-12-13T03:00:39Z

I think we can accept such an optional dependency. Also, I'm fine to see this implemented and put into a separate package, but I think it would be a great resource for our users, and potentially for building our documentation and benchmarking enhancements.

vrishank97 · 2017-12-14T11:23:20Z

Can I try working on this feature?

jnothman · 2017-12-14T20:34:12Z

it's a pretty substantial piece of work. Propose an algorithm?

vrishank97 · 2017-12-15T06:49:36Z

We have to call fit with the parameters in fit_params and measure the time each iteration takes (If time exceeds time_budget the current fit will be stopped). This has to be done for different values of n_samples if vary_n_samples=True. If profile_memory=True, we will use memory_profiler to track used memory(Estimate space complexity). All of this has to be done for n_fits.

How do we plan to vary the number of features?

Please let me know if there's anything I'm missing out on.

jnothman · 2017-12-16T21:05:42Z

yes, that's the question: how do we vary n_samples / n_features?

vrishank97 · 2017-12-17T18:05:34Z

We can use np.logspace to vary n_samples and we can also let users use linearly spaced n_samples for smaller datasets. After fitting on different values of n_samples we can feed fit_time, peak_memory, model_memory to a GP regressor(And let users chose the kernel). We can also provide an option for KFold cross val at each value of n_sample at the cost of a smaller maximum possible value for n_samples to provide a higher accuracy.

jnothman · 2017-12-17T21:25:24Z

kfold isn't relevant if we are not interested in a fair estimate of accuracy. random samples should do. I'm not yet sure what policy we should have for repeated trials... rather than logspace, I'd start with a fairly small number of samples and double them. Or multiply by 4 initially and fill in the gaps later to reduce uncertainty in the learnt function. Not really sure how to sensibly vary both samples and features and perhaps we should leave features until later. I'm okay if you give this a go, but expect it to not be a bit of a process

vrishank97 · 2017-12-18T17:08:52Z

Thanks a lot. I've been wanting to work on something like this for quite some time now.

If we use logspace, n_fits becomes a more meaningful parameter as it then becomes the number of distinct n_samples generated. It might also be more intuitive for users as it is also related to the accuracy of the models that are generated here. Thoughts?

jnothman · 2017-12-18T21:21:18Z

yes, but: logspace produces floats and we need ints; and we might run out of time before all n_fits are completed. so we need to take care to try get the shape of the function within constraints.

amueller · 2017-12-18T22:30:11Z

should we use OpenML, for example OpenML 100? It is only classification right now but covers n_features and n_samples pretty well (looks kinda uniform on a logscale).

amueller · 2017-12-18T22:30:24Z

Or did you want to use synthetic data?

amueller · 2017-12-18T22:31:10Z

This really sounds like a relatively sophisticated automl problem....

jnothman · 2017-12-18T23:23:48Z

I don't like how we always benchmark on canned data. i want the user to get some understanding of how it will perform on their chosen (dummy or real) dataset. This is not really automl, but I think what you mean is that the function for choosing the next point to try is a lot like the kinds of constrained searches performed in automl

amueller · 2017-12-18T23:32:38Z

Well trying to estimate runtime from hyper parameters is also pretty typical in automl.
And there's work on extrapolating learning curves for neural nets for example.

jnothman · 2017-12-18T23:42:36Z

okay. this issue currently focuses on sample size rather than parameter variation...

…

On 19 Dec 2017 10:32 am, "Andreas Mueller" ***@***.***> wrote: Well trying to estimate runtime from hyper parameters is also pretty typical in automl. And there's work on extrapolating learning curves for neural nets for example. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10289 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6x7xzyrdtE8A-zomhNaBA_2orDXpks5tBvYYgaJpZM4Q-TC5> .

amueller · 2017-12-19T00:01:31Z

ah, sorry, didn't read correctly. Still related but easier ;)

vrishank97 · 2017-12-20T11:01:22Z

How about we first fit on all samples, then half, then a fourth and 3 fourths and so on?
So even with just a few iterations our model can be fed data for a varied number of n_samples

jnothman · 2017-12-20T11:09:43Z

All samples may be much larger than is feasible to run in the benchmark time... It is also least informative, given the budget, as to the functional shape of the scaling

…

On 20 December 2017 at 22:01, Vrishank Bhardwaj ***@***.***> wrote: How about we first fit on all samples, then half, then a fourth and 3 fourths and so on? So even with just a few iterations our model can be fed data for a varied number of n_samples — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10289 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz62DNI0359l8dlrPVPoHlE3fRkp7Aks5tCOkDgaJpZM4Q-TC5> .

vrishank97 · 2017-12-21T11:10:43Z

I have started some work on this. Should I put in a PR with n_samples starting at 8 and doubling for each fit?

vrishank97 · 2017-12-21T11:53:54Z

Or we could let users select a base and a multiplier?

jnothman · 2017-12-21T13:08:14Z

I don't mind starting at 8 for now. Might be a bad pick if the dataset is big and has many classes...

vrishank97 · 2017-12-21T13:21:44Z

Agreed. Is the final goal a script or something similar to an estimator?

jnothman · 2017-12-21T14:15:13Z

A function + an example run with plots.

…

On 22 December 2017 at 00:21, Vrishank Bhardwaj ***@***.***> wrote: Agreed. Is the final goal a script or something something similar to an estimator? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10289 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz65hES-7bDWawnpovDqDjtWFIWoFAks5tCltqgaJpZM4Q-TC5> .

vrishank97 · 2017-12-21T17:31:13Z

Cool. Where should I put the function is the repo?

amueller · 2017-12-21T18:55:34Z

start with n_classes * 2 for classifiers? ;)

jnothman · 2017-12-21T20:41:52Z

Perhaps sklearn.benchmark??

…

On 22 December 2017 at 05:55, Andreas Mueller ***@***.***> wrote: start with n_classes * 2 for classifiers? ;) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10289 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wuZ8RBQrm6JKjr3BCezUzZyVMWsks5tCqmpgaJpZM4Q-TC5> .

rth · 2018-03-04T19:24:45Z

There are also some low-level concerns that need to be addressed for this to work reliably IMO and in particular the finite resolution of the timer: on Windows, time.time has a resolution of 16ms, time.clock is better but it was deprecated in PY 3.3 in favor of time.perf_counter. There was some earlier discussion about this in #2844

A bit in the orthogonal direction to this issue, I have been experimenting with benchmarking lately in this repo and I'm wondering if an API similar to joblib.Parallel could work for something like this. Basically, a benchmark function that accepts an iterable of delayed calculations, as in the example below,

import numpy as np
from sklearn.cluster import KMeans

from neurtu import delayed, timeit

rng = np.random.RandomState(42)

n_samples_max, n_features = 10000, 10

timeit(delayed(KMeans, tags={'n_samples': n_samples})(n_clusters=8)
                   .fit(rng.rand(n_samples, n_features))
       for n_samples in np.geomspace(100, n_samples_max, 5, dtype='int'))

which here produces a DataFrame with

   n_samples  wall_time_max  wall_time_mean  wall_time_min  wall_time_std
0        100       0.034289        0.032714       0.031805       0.001118
1        316       0.054796        0.053576       0.051900       0.001226
2       1000       0.129308        0.119423       0.107751       0.008891
3       3162       0.387829        0.344845       0.303622       0.034400
4      10000       1.486790        1.414572       1.340358       0.059797

that can then be sent to a GP regressor or just visualized.

The advantage of such approach is that it can be used to benchmark and compare anything the user might be interested: n_samples, n_features, some parameters (e.g. n_jobs, solver) or even different estimators.

Here is a more complete example that includes runtime and peak memory usage of LogisticRegression for different n_samples and solver parameters.

cmarmo · 2020-08-24T20:00:57Z

@jeremiedbb, @jnothman has #17026 solved this issue? Thanks!

jnothman · 2020-08-25T03:42:01Z

I'm not sure how well #17026 solves the need of a user estimating how well an algorithm will scale on their specific data. If it does, a tutorial would be beneficial!

vrishank97 · 2020-08-25T03:48:13Z

@jnothman #17026 Is an implementation of a benchmarking tool for the sample datasets we use in the sklearn examples, it doesn't exactly cover the use case that was in mind for this profiling tool, which was intended to be used to model the change in performance of estimators as their hyperparams change.

jnothman added API help wanted New Feature labels Dec 12, 2017

vrishank97 m 9E88 entioned this issue Dec 23, 2017

[WIP] Feature: Generic benchmarking/profiling tool #10362

Closed

5 tasks

jnothman mentioned this issue Feb 13, 2019

Request for project inclusion: scitime scikit-learn-contrib/scikit-learn-contrib#38

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Generic benchmarking/profiling tool #10289

Generic benchmarking/profiling tool #10289

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Generic benchmarking/profiling tool #10289

Generic benchmarking/profiling tool #10289

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!