8000 Request for project inclusion: scitime · Issue #38 · scikit-learn-contrib/scikit-learn-contrib · GitHub
[go: up one dir, main page]

Skip to content

Request for project inclusion: scitime #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
nathan-toubiana opened this issue Feb 12, 2019 · 26 comments
Closed
5 tasks done

Request for project inclusion: scitime #38

nathan-toubiana opened this issue Feb 12, 2019 · 26 comments

Comments

@nathan-toubiana
Copy link
nathan-toubiana commented Feb 12, 2019

Request for project inclusion in scikit-learn-contrib

  • Project name: scitime
  • Project description: package that gives runtime estimations of scikit-learn algorithms
  • Authors: Nathan Toubiana, Gabriel Lerner
  • Current repository: https://github.com/scitime/scitime
  • Requirements:
  • [NOT APPLICABLE] scikit-learn compatible (check_estimator passed)
  • Documentation
  • Unit tests (coverage: 79%)
  • Python3 compatible
  • PEP8 compliant
  • Continuous integration
@jnothman
Copy link
Member
jnothman commented Feb 13, 2019 via email

@nathan-toubiana
Copy link
Author

Thanks for taking a look @jnothman!

We did look at the initial implementation of the benchmarking tool you pointed us to.
In some regards, the benchmark_estimator_cost function is similar to our generate_data framework which gathers memory consumption and fit runtime while circling through a parameter space for the estimator. However, the benchmark_estimator_cost function seems to fit a runtime estimator on a per request basis (for fixed parameters and varying number of observations from what we understand). We went with a slightly different approach by collecting this data beforehand and building/storing a fit-time estimator.

Building a memory consumption estimator would be a great next step as we continue working on this package.

Let us know if you think we satisfy the scikit-learn-contrib requirements, we're looking forward to continuing our work!

@nathan-toubiana
Copy link
Author

Hi @jnothman , just wanted to follow up on our request and make sure it does not get forgotten.
Let us know if you need anything from us, we’re very much looking forward to your review.
Thanks again!

@jnothman
Copy link
Member
jnothman commented Feb 24, 2019 via email

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Feb 24, 2019 via email

@nathan-toubiana
Copy link
Author

Hi @jnothman and @GaelVaroquaux - thanks for the update, it makes perfectly sense.

We just published a medium article featured on FreeCodeCamp describing our process, if that helps.

@amueller
Copy link
Member

@GaelVaroquaux I don't understand your point. We need a process that works with the resources we have. I don't see why we could hope to have "enough" people at some point.

We should probably discuss how much of a review we want to do.
I'd like to see this in contrib but it seems unlikely I'll have time to review.

@amueller
Copy link
Member

@joaquinvanschoren and @janvanrijn might be interested in this as well.

@nathan-toubiana
Copy link
Author

Hi all - just wanted to follow up on this and see if there was any update.
Happy to help if there’s anything we can do to accelerate the review process

@amueller
Copy link
Member
amueller commented May 7, 2019

@nathan-toubiana Sorry for the delay, I hope I can get to this in May, maybe someone else will get to it earlier.

@nathan-toubiana
Copy link
Author

thanks for the update! We're very excited to hear that.

@amueller
Copy link
Member
amueller commented May 8, 2019

btw have you compared to the model that's within oboe? https://github.com/udellgroup/oboe

@nathan-toubiana
Copy link
Author
nathan-toubiana commented May 8, 2019

Thanks for the reference. Based on their paper, it seems that their 'meta models' only account for the number of observations and features as meta inputs (we do add model hyperparameters and machine performance data as meta inputs) - other than that and the fact that their meta models are polynomial regressions, their logic seems pretty similar

@amueller
Copy link
Member
amueller commented May 8, 2019

I think their model is per hyper-parameter setting, but it's not entirely clear to me. I need to check the code. They say they have relatively accurate results with a simple model. I don't think they look across machines at all, though. Anyway, the paper seemed cool and I thought you might be interested.

@nathan-toubiana
Copy link
Author

oh that makes sense - not sure how they handle non-categorical hyper-parameters though, I ll look at the code. Definitely super interesting! thanks a lot

@nathan-toubiana
Copy link
Author

Hi @amueller , we wanted to know if you had a chance to look through our submission?
We are always available if you have any questions of course.
Thank you very much!

@amueller
Copy link
Member

Sorry :-/

@nathan-toubiana
Copy link
Author

Hi @amueller ,

Hope you are well. I'm following up on last year's request for our package scitime to be part of sk learn contribs. We've had a significant amount of requests and activity on our repo over the last few months so I thought it could be a good time to reopen our discussion. We'd love to hear how/if we could improve our package to be part of the scikit-learn community. Thanks!

@rth
Copy link
Contributor
rth commented Mar 2, 2021

Thanks @nathan-toubiana the project looks interesting, and it would make sense to have it in scikit-learn-contrib however I have a few questions / comments, for instance regarding the usage example:

  • it would help to have some brief description in the documentation how this is working internally (that you did in the article). Initially I assumed that you are extrapolating from a subset of data instead of using pre-trained datasets.
  • currently the repo includes vendored pickles of estimators for scikit-learn 0.24.1. Are scripts to retrain them available, and do you have plans to do so for future scikit-learn releases? How long does it take?
  • The naming of from scitime import Estimator can be somewhat confusing, because in scikit-learn estimator has a different meaning than estimating run time. So some more explicit name might have been better. Same for scitime.Model

The above comments are related to the inclusion process. For the following ones it's just personal curiosity,

Why not manually estimate the time complexity with big O notations?

That’s a fair point. It’s a valid way of approaching the problem and something we thought about at the beginning of the project. One thing however is that we would need to formulate the complexity explicitly for each algo

If you use a linear model to predict the log (run time) as a function of the log n_samples, log n_features I would have expected this to already provide a starting point that would extrapolate nicely. That doesn't produce good results on the obtained data?

are extrapolated by the NN estimator, whereas the RF estimator predicts the output stepwise.

Yeah, naively I would have thought that RF would not be the best for this use case, particularly if you are not sure if you are going to be extrapolating.

Additionally, the NN might perform poorly on small to medium predictions. Sometimes, for small durations, the NN might even predict a negative duration, in which case we automatically switch back to RF.

Predicting the log of the duration might help.

@nathan-toubiana
Copy link
Author
nathan-toubiana commented Mar 3, 2021

Hi @rth

Thanks for your prompt answer.

  • it would help to have some brief description in the documentation how this is working internally (that you did in the article). Initially I assumed that you are extrapolating from a subset of data instead of using pre-trained datasets.

We did a PR to add descriptions in the documentation (section how it works) - feel free to take a look and let us know. We confirm that we do use pre-trained datasets to estimate runtimes.

  • currently the repo includes vendored pickles of estimators for scikit-learn 0.24.1. Are scripts to retrain them available, and do you have plans to do so for future scikit-learn releases? How long does it take?

Once the runtime data is generated, it’s very quick to update the pkls (see the documentation here) and we actually did this last week when we bumped the package version to 0.1.0, see our PR here. Users of scitime can build their own pkls by generating their own data and we also plan to make our training data public.

  • The naming of from scitime import Estimator can be somewhat confusing, because in scikit-learn estimator has a different meaning than estimating run time. So some more explicit name might have been better. Same for scitime.Model

We renamed Estimator to RuntimeEstimator and Model to RuntimeModelBuilder on the same PR - feel free to take a look and let us know, then we can merge and release.

If you use a linear model to predict the log (run time) as a function of the log n_samples, log n_features I would have expected this to already provide a starting point that would extrapolate nicely. That doesn't produce good results on the obtained data?

Unfortunately, n_samples and n_features are often not the only parameters having a significant impact on runtime. For instance, in RandomForestClassifier, max_depth can significantly change the runtime. This is why we went with this approach. Number of cpus and available memory can also make a difference.

Yeah, naively I would have thought that RF would not be the best for this use case, particularly if you are not sure if you are going to be extrapolating.

Yes, this is the reason why we decided to keep both meta algos. NN suits best for extrapolations while RF has been trained on a large number of datapoints and provides good estimations for cases that are similar to our training data.

Predicting the log of the duration might help.

Thanks! We’ll try to retrain doing that and see if it improves the accuracy.

@nathan-toubiana
Copy link
Author

Hi @rth ,

Just wanted to follow up on our conversation, we re ready to make these changes and more if needed. Thanks!

@nathan-toubiana
Copy link
Author

Hi,

We just released a new version (v0.1.1) with all the changes discussed above.

@nathan-toubiana
Copy link
Author

Hi,

Just following up on this since it has been some time. Would love to understand next steps needed to get approved.

@adrinjalali
Copy link
Member

The repo hasn't seen any updates in the past 2 years. The repo also includes pickle files, which raises a lot of issues, both in terms of security, and in terms of version compatibility. I don't think we should include this in the contrib org.

@adrinjalali adrinjalali closed this as not planned Won't fix, can't repro, duplicate, stale Dec 5, 2023
@nathan-toubiana
Copy link
Author

hi @adrinjalali , thanks for getting back to us. The reason why the repo hasn't been updated lately is that we havent heard from our request since our last back and forth (as you can see in the above thread). However, we are still seeing some usage (~1k weekly uploads) and are more than happy to work on more updates as needed, if we are still considered for inclusion.

Looking forward to hearing from you.

@adrinjalali
Copy link
Member

The repo being active is usually a requirement for it to be moved here, and not the other way around. Otherwise we have no way of knowing if after inclusion the repo will go stale or not.

Also regarding pickles, I would need to know why exactly they're included. Pickle files are executables, and nobody should load any pickle files unless they really really trust the source. So we really need to find an alternative here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
0