Request for project inclusion: scitime #38

nathan-toubiana · 2019-02-12T23:03:02Z

Request for project inclusion in scikit-learn-contrib

Project name: scitime
Project description: package that gives runtime estimations of scikit-learn algorithms
Authors: Nathan Toubiana, Gabriel Lerner
Current repository: https://github.com/scitime/scitime
Requirements:
[NOT APPLICABLE] scikit-learn compatible (check_estimator passed)
Documentation
Unit tests (coverage: 79%)
Python3 compatible
PEP8 compliant
Continuous integration

jnothman · 2019-02-13T00:24:00Z

Interesting! We had ideas for a similar tool at scikit-learn/scikit-learn#10289 (and an initial implementation). You might want to also benchmark for memory consumption.

nathan-toubiana · 2019-02-13T03:34:51Z

Thanks for taking a look @jnothman!

We did look at the initial implementation of the benchmarking tool you pointed us to.
In some regards, the benchmark_estimator_cost function is similar to our generate_data framework which gathers memory consumption and fit runtime while circling through a parameter space for the estimator. However, the benchmark_estimator_cost function seems to fit a runtime estimator on a per request basis (for fixed parameters and varying number of observations from what we understand). We went with a slightly different approach by collecting this data beforehand and building/storing a fit-time estimator.

Building a memory consumption estimator would be a great next step as we continue working on this package.

Let us know if you think we satisfy the scikit-learn-contrib requirements, we're looking forward to continuing our work!

nathan-toubiana · 2019-02-24T01:05:35Z

Hi @jnothman , just wanted to follow up on our request and make sure it does not get forgotten.
Let us know if you need anything from us, we’re very much looking forward to your review.
Thanks again!

jnothman · 2019-02-24T16:45:39Z

We seem to be a bit stuck on scikit-learn-contrib: we do not have a clear process for review. I tend to focus on the core library, but I hope we discuss how to better maintain contrib at the sprint this week.

GaelVaroquaux · 2019-02-24T16:54:58Z

I think that rather than a problem with the process, we simply lack people. ⁣Sent from my phone. Please forgive typos and briefness.

…

On Feb 24, 2019, 17:45, at 17:45, Joel Nothman ***@***.***> wrote: We seem to be a bit stuck on scikit-learn-contrib: we do not have a clear process for review. I tend to focus on the core library, but I hope we discuss how to better maintain contrib at the sprint this week. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #38 (comment)

nathan-toubiana · 2019-03-19T01:52:50Z

Hi @jnothman and @GaelVaroquaux - thanks for the update, it makes perfectly sense.

We just published a medium article featured on FreeCodeCamp describing our process, if that helps.

amueller · 2019-03-25T19:08:46Z

@GaelVaroquaux I don't understand your point. We need a process that works with the resources we have. I don't see why we could hope to have "enough" people at some point.

We should probably discuss how much of a review we want to do.
I'd like to see this in contrib but it seems unlikely I'll have time to review.

amueller · 2019-03-25T19:11:09Z

@joaquinvanschoren and @janvanrijn might be interested in this as well.

nathan-toubiana · 2019-05-06T15:45:20Z

Hi all - just wanted to follow up on this and see if there was any update.
Happy to help if there’s anything we can do to accelerate the review process

amueller · 2019-05-07T15:14:19Z

@nathan-toubiana Sorry for the delay, I hope I can get to this in May, maybe someone else will get to it earlier.

nathan-toubiana · 2019-05-08T02:39:14Z

thanks for the update! We're very excited to hear that.

amueller · 2019-05-08T19:19:38Z

btw have you compared to the model that's within oboe? https://github.com/udellgroup/oboe

nathan-toubiana · 2019-05-08T20:01:28Z

Thanks for the reference. Based on their paper, it seems that their 'meta models' only account for the number of observations and features as meta inputs (we do add model hyperparameters and machine performance data as meta inputs) - other than that and the fact that their meta models are polynomial regressions, their logic seems pretty similar

amueller · 2019-05-08T20:52:46Z

I think their model is per hyper-parameter setting, but it's not entirely clear to me. I need to check the code. They say they have relatively accurate results with a simple model. I don't think they look across machines at all, though. Anyway, the paper seemed cool and I thought you might be interested.

nathan-toubiana · 2019-05-08T20:56:55Z

oh that makes sense - not sure how they handle non-categorical hyper-parameters though, I ll look at the code. Definitely super interesting! thanks a lot

nathan-toubiana · 2019-06-05T16:05:36Z

Hi @amueller , we wanted to know if you had a chance to look through our submission?
We are always available if you have any questions of course.
Thank you very much!

amueller · 2020-05-28T22:48:53Z

Sorry :-/

nathan-toubiana · 2021-03-02T22:11:37Z

Hi @amueller ,

Hope you are well. I'm following up on last year's request for our package scitime to be part of sk learn contribs. We've had a significant amount of requests and activity on our repo over the last few months so I thought it could be a good time to reopen our discussion. We'd love to hear how/if we could improve our package to be part of the scikit-learn community. Thanks!

rth · 2021-03-02T23:35:20Z

Thanks @nathan-toubiana the project looks interesting, and it would make sense to have it in scikit-learn-contrib however I have a few questions / comments, for instance regarding the usage example:

it would help to have some brief description in the documentation how this is working internally (that you did in the article). Initially I assumed that you are extrapolating from a subset of data instead of using pre-trained datasets.
currently the repo includes vendored pickles of estimators for scikit-learn 0.24.1. Are scripts to retrain them available, and do you have plans to do so for future scikit-learn releases? How long does it take?
The naming of from scitime import Estimator can be somewhat confusing, because in scikit-learn estimator has a different meaning than estimating run time. So some more explicit name might have been better. Same for scitime.Model

The above comments are related to the inclusion process. For the following ones it's just personal curiosity,

Why not manually estimate the time complexity with big O notations?

That’s a fair point. It’s a valid way of approaching the problem and something we thought about at the beginning of the project. One thing however is that we would need to formulate the complexity explicitly for each algo

If you use a linear model to predict the log (run time) as a function of the log n_samples, log n_features I would have expected this to already provide a starting point that would extrapolate nicely. That doesn't produce good results on the obtained data?

are extrapolated by the NN estimator, whereas the RF estimator predicts the output stepwise.

Yeah, naively I would have thought that RF would not be the best for this use case, particularly if you are not sure if you are going to be extrapolating.

Additionally, the NN might perform poorly on small to medium predictions. Sometimes, for small durations, the NN might even predict a negative duration, in which case we automatically switch back to RF.

Predicting the log of the duration might help.

nathan-toubiana · 2021-03-03T16:16:37Z

Hi @rth

Thanks for your prompt answer.

it would help to have some brief description in the documentation how this is working internally (that you did in the article). Initially I assumed that you are extrapolating from a subset of data instead of using pre-trained datasets.

We did a PR to add descriptions in the documentation (section how it works) - feel free to take a look and let us know. We confirm that we do use pre-trained datasets to estimate runtimes.

currently the repo includes vendored pickles of estimators for scikit-learn 0.24.1. Are scripts to retrain them available, and do you have plans to do so for future scikit-learn releases? How long does it take?

Once the runtime data is generated, it’s very quick to update the pkls (see the documentation here) and we actually did this last week when we bumped the package version to 0.1.0, see our PR here. Users of scitime can build their own pkls by generating their own data and we also plan to make our training data public.

The naming of from scitime import Estimator can be somewhat confusing, because in scikit-learn estimator has a different meaning than estimating run time. So some more explicit name might have been better. Same for scitime.Model

We renamed Estimator to RuntimeEstimator and Model to RuntimeModelBuilder on the same PR - feel free to take a look and let us know, then we can merge and release.

If you use a linear model to predict the log (run time) as a function of the log n_samples, log n_features I would have expected this to already provide a starting point that would extrapolate nicely. That doesn't produce good results on the obtained data?

Unfortunately, n_samples and n_features are often not the only parameters having a significant impact on runtime. For instance, in RandomForestClassifier, max_depth can significantly change the runtime. This is why we went with this approach. Number of cpus and available memory can also make a difference.

Yeah, naively I would have thought that RF would not be the best for this use case, particularly if you are not sure if you are going to be extrapolating.

Yes, this is the reason why we decided to keep both meta algos. NN suits best for extrapolations while RF has been trained on a large number of datapoints and provides good estimations for cases that are similar to our training data.

Predicting the log of the duration might help.

Thanks! We’ll try to retrain doing that and see if it improves the accuracy.

nathan-toubiana · 2021-03-12T00:30:05Z

Hi @rth ,

Just wanted to follow up on our conversation, we re ready to make these changes and more if needed. Thanks!

nathan-toubiana · 2021-03-27T16:30:23Z

Hi,

We just released a new version (v0.1.1) with all the changes discussed above.

nathan-toubiana · 2023-11-16T21:02:53Z

Hi,

Just following up on this since it has been some time. Would love to understand next steps needed to get approved.

adrinjalali · 2023-12-05T16:35:40Z

The repo hasn't seen any updates in the past 2 years. The repo also includes pickle files, which raises a lot of issues, both in terms of security, and in terms of version compatibility. I don't think we should include this in the contrib org.

nathan-toubiana · 2023-12-05T16:47:02Z

hi @adrinjalali , thanks for getting back to us. The reason why the repo hasn't been updated lately is that we havent heard from our request since our last back and forth (as you can see in the above thread). However, we are still seeing some usage (~1k weekly uploads) and are more than happy to work on more updates as needed, if we are still considered for inclusion.

Looking forward to hearing from you.

adrinjalali · 2023-12-05T17:01:14Z

The repo being active is usually a requirement for it to be moved here, and not the other way around. Otherwise we have no way of knowing if after inclusion the repo will go stale or not.

Also regarding pickles, I would need to know why exactly they're included. Pickle files are executables, and nobody should load any pickle files unless they really really trust the source. So we really need to find an alternative here.

nathan-toubiana closed this as completed May 1, 2020

nathan-toubiana reopened this Mar 2, 2021

chkoar mentioned this issue Oct 20, 2023

RFC Project inclusion process #66

Open

adrinjalali closed this as not planned Won't fix, can't repro, duplicate, stale Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Request for project inclusion: scitime #38

Request for project inclusion: scitime #38

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Request for project inclusion: scitime #38

Request for project inclusion: scitime #38

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!