8000 Generic benchmarking/profiling tool · Issue #10289 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Generic benchmarking/profiling tool #10289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jnothman opened this issue Dec 12, 2017 · 32 comments
Open

Generic benchmarking/profiling tool #10289

jnothman opened this issue Dec 12, 2017 · 32 comments

Comments

@jnothman
Copy link
Member
jnothman commented Dec 12, 2017

We have not been proficient at documenting the estimated runtime or space complexity of our estimators and algorithms. Even were we to document asymptotic complexity functions, it would not give a realistic estimate for all parameter settings, etc. for a particular kind of data. Rather we could assist users in estimating complexity functions empirically.

I would like to see a function something like the following:

def benchmark_estimator_cost(est, X, y=None, fit_params=None,
                             vary_n_samples=True, vary_n_features=False,
                             n_fits=100, time_budget=300, profile_memory=True):
    """Profiles the cost of fitting est on samples of different size

    Parameters
    ----------
    est : estimator
    X : array-like
    y : array-like, optional
    fit_params : dict, optional
    vary_n_samples : bool, default=True
        Whether to benchmark for various random sample sizes.
    vary_n_features : bool, default=False
        Whether to benchmark for various random feature set sizes.
    n_fits : int, default=100
        Maximum number of fits to make while benchmarking.
    time_budget : int, default=300
        Maximum number of seconds to use overall.  Current fit will
        be stopped if the budget is exceeded.
    profile_memory : bool, default=True
        Whether to include memory (or just time) profiling. Memory
        profiling will slow down fitting, and hence make fit_time
        estimates more approximate.

    Returns
    -------
    results : dict
        The following keys are each mapped to an array:

        n_samples
            The number of samples
        n_features
            The number of samples
        fit_time
            In seconds
        peak_memory
            The memory used at peak of fitting, in KiB.
        model_memory
            The memory in use at the end of fitting, minus that at the
            beginning, in KiB.

    models : dict
        keys 'peak_memory', 'model_memory' and 'fit_time' map to polynomial
        GP regressors whose input is n_samples and n_features and whose
        outputs are each of those targets.

    errors : list of dicts
        lists the parameters that resulted in exceptions
    """

This would run fit successively for different values of n_samples (logarithmically spaced, perhaps guided by a gaussian process) to estimate the function for fitting complexity, within budget. I have not thought extensively about exactly what sampling strategy would be followed. If this is implemented for the library, we would consider experimental and the algorithm subject to change for a little while.

What do others think?

@agramfort
Copy link
Member
agramfort commented Dec 12, 2017 via email

@rth
Copy link
Member
rth commented Dec 12, 2017

Even were we to document asymptotic complexity functions, it would not give a realistic estimate for all parameter settings, etc. for a particular kind of data. Rather we could assist users in estimating complexity functions empirically.

Definitely agree with that.

I think there are cases where fit_transform, predict etc could also be worth benchmarking. Maybe replace fit_params by method_params and add a method="fit" argument?

This would run fit successively for different values of n_samples (logarithmically spaced, perhaps guided by a gaussian process) to estimate the function for fitting complexity, within budget.

That would also depend if the user is interested in the asymptotic scaling or the scaling in some particular range of parameters.

Memory profiling would require additional optional dependency memory_profiler (which pulls psutil), wouldn't it?

@jnothman
Copy link
Member Author
jnothman commented Dec 13, 2017 via email

@vrishank97
Copy link
Contributor

Can I try working on this feature?

@jnothman
Copy link
Member Author
jnothman commented Dec 14, 2017 via email

@vrishank97
Copy link
Contributor
vrishank97 commented Dec 15, 2017

We have to call fit with the parameters in fit_params and measure the time each iteration takes (If time exceeds time_budget the current fit will be stopped). This has to be done for different values of n_samples if vary_n_samples=True. If profile_memory=True, we will use memory_profiler to track used memory(Estimate space complexity). All of this has to be done for n_fits.

How do we plan to vary the number of features?

Please let me know if there's anything I'm missing out on.

@jnothman
Copy link
Member Author
jnothman commented Dec 16, 2017 via email

@vrishank97
Copy link
Contributor
vrishank97 commented Dec 17, 2017

We can use np.logspace to vary n_samples and we can also let users use linearly spaced n_samples for smaller datasets. After fitting on different values of n_samples we can feed fit_time, peak_memory, model_memory to a GP regressor(And let users chose the kernel). We can also provide an option for KFold cross val at each value of n_sample at the cost of a smaller maximum possible value for n_samples to provide a higher accuracy.

@jnothman
Copy link
Member Author
jnothman commented Dec 17, 2017 via email

@vrishank97
Copy link
Contributor
vrishank97 commented Dec 18, 2017

Thanks a lot. I've been wanting to work on something like this for quite some time now.

If we use logspace, n_fits becomes a more meaningful parameter as it then becomes the number of distinct n_samples generated. It might also be more intuitive for users as it is also related to the accuracy of the models that are generated here. Thoughts?

@jnothman
Copy link
Member Author
jnothman commented Dec 18, 2017 via email

@amueller
Copy link
Member

should we use OpenML, for example OpenML 100? It is only classification right now but covers n_features and n_samples pretty well (looks kinda uniform on a logscale).

@amueller
Copy link
Member

Or did you want to use synthetic data?

@amueller
Copy link
Member

This really sounds like a relatively sophisticated automl problem....

@jnothman
Copy link
Member Author
jnothman commented Dec 18, 2017 via email

@amueller
Copy link
Member

Well trying to estimate runtime from hyper parameters is also pretty typical in automl.
And there's work on extrapolating learning curves for neural nets for example.

@jnothman
Copy link
Member Author
jnothman commented Dec 18, 2017 via email

@amueller
Copy link
Member

ah, sorry, didn't read correctly. Still related but easier ;)

@vrishank97
Copy link
Contributor

How about we first fit on all samples, then half, then a fourth and 3 fourths and so on?
So even with just a few iterations our model can be fed data for a varied number of n_samples

@jnothman
Copy link
Member Author
jnothman commented Dec 20, 2017 via email

@vrishank97
Copy link
Contributor

I have started some work on this. Should I put in a PR with n_samples starting at 8 and doubling for each fit?

@vrishank97
Copy link
Contributor

Or we could let users select a base and a multiplier?

@jnothman
Copy link
Member Author
jnothman commented Dec 21, 2017 via email

@vrishank97
Copy link
Contributor
vrishank97 commented Dec 21, 2017

Agreed. Is the final goal a script or something similar to an estimator?

@jnothman
Copy link
Member Author
jnothman commented Dec 21, 2017 via email

@vrishank97
Copy link
Contributor

Cool. Where should I put the function is the repo?

@amueller
Copy link
Member

start with n_classes * 2 for classifiers? ;)

@jnothman
Copy link
Member Author
jnothman commented Dec 21, 2017 via email

@rth
Copy link
Member
rth commented Mar 4, 2018

There are also some low-level concerns that need to be addressed for this to work reliably IMO and in particular the finite resolution of the timer: on Windows, time.time has a resolution of 16ms, time.clock is better but it was deprecated in PY 3.3 in favor of time.perf_counter. There was some earlier discussion about this in #2844

A bit in the orthogonal direction to this issue, I have been experimenting with benchmarking lately in this repo and I'm wondering if an API similar to joblib.Parallel could work for something like this. Basically, a benchmark function that accepts an iterable of delayed calculations, as in the example below,

import numpy as np
from sklearn.cluster import KMeans

from neurtu import delayed, timeit

rng = np.random.RandomState(42)

n_samples_max, n_features = 10000, 10

timeit(delayed(KMeans, tags={'n_samples': n_samples})(n_clusters=8)
                   .fit(rng.rand(n_samples, n_features))
       for n_samples in np.geomspace(100, n_samples_max, 5, dtype='int'))

which here produces a DataFrame with

   n_samples  wall_time_max  wall_time_mean  wall_time_min  wall_time_std
0        100       0.034289        0.032714       0.031805       0.001118
1        316       0.054796        0.053576       0.051900       0.001226
2       1000       0.129308        0.119423       0.107751       0.008891
3       3162       0.387829        0.344845       0.303622       0.034400
4      10000       1.486790        1.414572       1.340358       0.059797

that can then be sent to a GP regressor or just visualized.

The advantage of such approach is that it can be used to benchmark and compare anything the user might be interested: n_samples, n_features, some parameters (e.g. n_jobs, solver) or even different estimators.

Here is a more complete example that includes runtime and peak memory usage of LogisticRegression for different n_samples and solver parameters.

@cmarmo
Copy link
Contributor
cmarmo commented Aug 24, 2020

@jeremiedbb, @jnothman has #17026 solved this issue? Thanks!

@jnothman
Copy link
Member Author

I'm not sure how well #17026 solves the need of a user estimating how well an algorithm will scale on their specific data. If it does, a tutorial would be beneficial!

8123
Copy link
Contributor

@jnothman #17026 Is an implementation of a benchmarking tool for the sample datasets we use in the sklearn examples, it doesn't exactly cover the use case that was in mind for this profiling tool, which was intended to be used to model the change in performance of estimators as their hyperparams change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0