-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[Feature Request] Progress Bars #7574
New issue
Comments
I think we would prefer a design that adds callbacks to the estimators, and not bake in the progress bar. How the callback or logging mechanism should work is a pretty big design decision. I'm not sure if there is any discussion on that anywhere. |
sorry misclick. |
I think it would be extremely useful to monitor the progress of algorithms beyond the current verbosity measures. Does anybody know of any other libraries that use a robust logging / callback system that can be used for inspiration / ideas? I'd certainly be willing to help on this. How would this system ideally work with existing verbosity messages? These messages are already baked in, and although a callback system make this even more customization, I think these baked in messages are still valuable. In the short term though I was thinking more along the lines of simply improving the verbosity of the costly subroutines in some estimators. By improving the verbosity I really mean providing some timing measurements like current rate, estimate time remaining, start time, and total duration. Because I want to add this in multiple places I thought it would be nice to have a reusable class that can go in the utils. What I envisioned for this feature request is almost the same as the way the current verbosity methods work. For example sklearn.cluster.kmeans_._mini_batch_convergence does print out a notion of absolute progress. All I want to do is add in some timing statistics and estimates and put these features in a reusable class that can be dropped in in other estimators. |
We are sometimes somewhat conservative (or slow?) in adding entirely new functionality because it's hard to remove anything without breaking user code or at least annoying them. And if we add a lot of 8000 custom logging and displaying of logs, it's easy to get feature creep and hard-to-understand code. If you want to add this reusable class, it still needs some form of interface, right? We have a |
Btw, there's an issue over here: #78 |
It's a bit unclear to me if logging is the right solution for what you want, because we need to know what 100% is for a progress bar. Some algorithms have multiple stopping criteria, which makes it kind of tricky to use a bar. And even if there is a single stopping criterion, we sometimes stop when the change in objective is smaller than some threshold. How does that translate to a progress bar? |
Improving verbosity is okay, but it would be better to provide structured objects to a logger that the user can configure. The logger then knows how to include timestamps in display. |
Structured logging, with things like {"context": ['RandomForests.fit'], On 5 October 2016 at 08:07, Andreas Mueller notifications@github.com
|
Might. What's the generic interface that the logger implements? This is additional complexity but I think we need to do this at some point. |
That would have to be faked in Cython |
Do we want the monitor in Cython? I guess so. I would have left it at Python for now. |
A few more thoughts on this. For all estimators that either,
using tqdm progress bars is IMO easier and more reliable that writing custom classes, even if it adds one dependency for people who need this functionality. For instance, without modifying the scikit-learn code base, one could well do, from tqdm import tqdm
# [...]
for sl in tqdm(gen_batches(n_samples, self.batch_size)):
estimator.partial_fit(X[sl], y[sl])
# [...] or
vect = CountVectorizer()
X = vect.fit_transform(tqdm(iterable_of_file_names)) For logging one then can write a custom logging handler which offsets this problem to the tqdm community that probably has more experience in handling progress bars.. |
It would be a great help to have sklearn use tqdm (or any implementation of a progress bar) to monitor progress for long trainings of complex pipelines. |
I would actually recommend against
The implementation that I originally proposed here has been made into its own standalone pip-installable Python package called progiter. It is both fast (very low overhead) and single threaded, so it can be safely used with multiprocessing. Its API is mostly compatible with tqdm. |
An experimental API for callbacks that would allow implementing progress bars was proposed in #16925. Feedback welcome. |
I'm currently working with a fork of sklearn where I include progress bars in long running Estimators (particularly MiniBatchKmeans). It takes up to 9 hours to run some of the algorithms and having an estimated time remaining has been quite helpful to see when I should come back and look at the results.
I was wondering if it would be worthwhile to compile some of these into a pull requests and add this functionality to the library. If so there are few logistics:
Which progress implementation should I use? I see three options here.
All options essentially wrap an iterator and dump some info to stdout about how far through the iterator they are.
These options are ordered from the least work to the most work. The first option is the least work, but it adds a dependency. I have experience with the second option, but it is not as widely used as click (although my port would use a more click-like signature). The third option requires me to delve into the click library, which I haven't had any experience with yet. However, judging from my initial look at its implementation it looks my implementation may be more efficient. The click implementation seems to compute and display progress information on every iteration, whereas my implementation tries to minimally impact the wrapped loop by adjusting its report frequency.
Here is some example code. Previously a code block looked like this:
If I port a refactored / paired down version of my progress iterator to sklearn, then it will look something like this:
Here is a mockup of the simplified ProgIter object. I paired it down rather quickly, so there may be some small issue in this version.
Here is an example showing what this does:
Lastly here is some timeit information that shows how frequency adjusting causes minimal overhead
The text was updated successfully, but these errors were encountered: