[Feature Request] Progress Bars #7574

Erotemic · 2016-10-04T15:27:37Z

I'm currently working with a fork of sklearn where I include progress bars in long running Estimators (particularly MiniBatchKmeans). It takes up to 9 hours to run some of the algorithms and having an estimated time remaining has been quite helpful to see when I should come back and look at the results.

I was wondering if it would be worthwhile to compile some of these into a pull requests and add this functionality to the library. If so there are few logistics:

Which progress implementation should I use? I see three options here.

I could add a dependency on the external click library and use its ProgressBar class.
I could port my a simplified version of my own ProgIter class into sklearn utils.
I could port a simplified version of click's ProgressBar class

All options essentially wrap an iterator and dump some info to stdout about how far through the iterator they are.

These options are ordered from the least work to the most work. The first option is the least work, but it adds a dependency. I have experience with the second option, but it is not as widely used as click (although my port would use a more click-like signature). The third option requires me to delve into the click library, which I haven't had any experience with yet. However, judging from my initial look at its implementation it looks my implementation may be more efficient. The click implementation seems to compute and display progress information on every iteration, whereas my implementation tries to minimally impact the wrapped loop by adjusting its report frequency.

Here is some example code. Previously a code block looked like this:

        if self.verbose:
            print('Computing label assignment and total inertia')
        x_squared_norms = row_norms(X, squared=True)
        slices = gen_batches(X.shape[0], self.batch_size)
        results = [_labels_inertia(X[s], x_squared_norms[s],
                                   self.cluster_centers_) for s in slices]
        labels, inertia = zip(*results)
        return np.hstack(labels), np.sum(inertia)

If I port a refactored / paired down version of my progress iterator to sklearn, then it will look something like this:

        if self.verbose:
            print('Computing label assignment and total inertia')
        x_squared_norms = row_norms(X, squared=True)
        n_samples = X.shape[0]
        total_batches = int(n_samples // self.batch_size)
        slices = gen_batches(n_samples, self.batch_size)
        slices = ProgIter(slices, label='computing labels and inertia',
                          length=total_batches, enabled=self.verbose)
        results = [_labels_inertia(X[s], x_squared_norms[s],
                                   self.cluster_centers_) for s in slices]
        labels, inertia = zip(*results)
        return np.hstack(labels), np.sum(inertia)

Here is a mockup of the simplified ProgIter object. I paired it down rather quickly, so there may be some small issue in this version.

from __future__ import absolute_import, division, print_function, unicode_literals
import sys
import time
import datetime
import collections
import six

WIN32 = sys.platform.startswith('win32')

if WIN32:
    # Use time.clock in win32
    default_timer = time.clock
    BEFORE_BAR = '\r'
    AFTER_BAR = '\n'
else:
    default_timer = time.time
    BEFORE_BAR = '\r\033[?25l'
    AFTER_BAR = '\033[?25h\n'

text_type = six.text_type

class ProgIter(object):
    def __init__(self, iterable=None, label=None, length=None, enabled=True,
                 freq=1, adjust=True, eta_window=64):
        if length is None:
            try:
                length = len(iterable)
            except Exception:
                pass
        if label is None:
            label = ''
        self.iterable = iterable
        self.label = label
        self.length = length
        self.freq = freq
        self.enabled = enabled
        self.adjust = adjust
        self.count = 0
        # Window size for estimated time remaining
        self.eta_window = eta_window

    def __call__(self, iterable):
        self.iterable = iterable
        return self

    def __iter__(self):
        if not self.enabled:
            return iter(self.iterable)
        else:
            return self.iter_rate()

    def build_msg_fmtstr(self):
        with_wall = True
        tzname = time.tzname[0]
        length_ = '?' if self.length == 0 else text_type(self.length)
        msg_head = ['', self.label, ' {count:4d}/', length_ , '...  ']
        msg_tail = [
            ('rate={rate:4.2f} Hz,'),
            ('' if self.length == 0 else ' eta={eta},'),
            ' ellapsed={ellapsed},',
            (' wall={wall} ' + tzname if with_wall else ''),
        ]
        msg_fmtstr_time = ''.join(([BEFORE_BAR] + msg_head + msg_tail))
        return msg_fmtstr_
8000
time

    def tryflush(self):
        try:
            # flush sometimes causes issues in IPython notebooks
            sys.stdout.flush()
        except IOError:
            pass

    def write(self, msg):
        sys.stdout.write(msg)

    def iter_rate(self):
        freq          = self.freq
        self.count    = 0
        between_count = 0
        last_count    = 0

        # how long iterations should be before a flush
        # (used for freq adjustment)
        time_thresh = 2.0

        max_between_time = -1.0
        max_between_count = -1.0

        iters_per_second = -1

        # Prepare for iteration
        msg_fmtstr = self.build_msg_fmtstr()

        self.tryflush()

        msg = msg_fmtstr.format(
            count=self.count, rate=0.0,
            eta=text_type('0:00:00'),
            ellapsed=text_type('0:00:00'),
            wall=time.strftime('%H:%M'),
        )
        self.write(msg)

        self.tryflush()

        start_time = default_timer()
        last_time  = start_time

        # use last few (64) times to compute a more stable average rate
        measure_between_time = collections.deque([], maxlen=self.eta_window)
        measure_est_seconds = collections.deque([], maxlen=self.eta_window)

        # Wrap the for loop with a generator
        for self.count, item in enumerate(self.iterable, start=1):
            yield item

            if (self.count) % freq == 0:
                # update progress information every so often
                now_time          = default_timer()
                between_time      = (now_time - last_time)
                between_count     = self.count - last_count
                total_seconds     = (now_time - start_time)

                measure_between_time.append(between_count / (float(between_time) + 1E-9))
                iters_per_second = sum(measure_between_time) / len(measure_between_time)

                if self.length is None:
                    est_seconds_left = -1
                else:
                    iters_left = self.length - self.count
                    est_eta = iters_left / (iters_per_second + 1E-9)

                    measure_est_seconds.append(est_eta)
                    est_seconds_left = sum(measure_est_seconds) / len(measure_est_seconds)

                last_count = self.count
                last_time  = now_time
                # Adjust frequency if printing too quickly
                # so progress doesnt slow down actual function
                if self.adjust and (between_time < time_thresh or
                                    between_time > time_thresh * 2.0):
                    max_between_time = max(max_between_time, between_time)
                    max_between_time = max(max_between_time, 1E-9)
                    max_between_count = max(max_between_count, between_count)
                    # If progress was uniform and all time estimates were
                    # perfect this would be the new freq to achieve time_thresh
                    new_freq = int(time_thresh * max_between_count / max_between_time)
                    new_freq = max(new_freq, 1)
                    # But things are not perfect. So, don't make drastic changes
                    max_freq_change_up = max(256, freq * 2)
                    max_freq_change_down = freq // 2
                    if (new_freq - freq) > max_freq_change_up:
                        freq += max_freq_change_up
                    elif (freq - new_freq) > max_freq_change_down:
                        freq -= max_freq_change_down
                    else:
                        freq = new_freq
                msg = msg_fmtstr.format(
                    count=self.count,
                    rate=iters_per_second,
                    eta=text_type(datetime.timedelta(seconds=int(est_seconds_left))),
                    ellapsed=text_type(datetime.timedelta(seconds=int(total_seconds))),
                    wall=time.strftime('%H:%M'),
                )
                self.write(msg)
                self.tryflush()

        if (self.count) % freq != 0:
            # Write the final progress line if it was not written in the loop
            est_seconds_left = 0
            now_time = default_timer()
            total_seconds = (now_time - start_time)
            msg = msg_fmtstr.format(
                count=self.count,
                rate=iters_per_second,
                eta=text_type(datetime.timedelta(seconds=int(est_seconds_left))),
                ellapsed=text_type(datetime.timedelta(seconds=int(total_seconds))),
                wall=time.strftime('%H:%M'),
            )
            self.write(msg)
            self.tryflush()
        self.write(AFTER_BAR)

Here is an example showing what this does:

# Define a function that takes some time
def is_prime(n):
    return n >= 2 and not any(n % i == 0 for i in range(2, n))
N = 50000

# Default behavior adjusts frequency of progress reporting so
# the performance of the loop is minimally impacted
iterable = (is_prime(n) for n in range(N))
piterable = ProgIter(iterable, length=N)
_ = list(piterable)
 50000/50000...  rate=25097.04 Hz, eta=0:00:00, ellapsed=0:00:22, wall=11:22 EST

# Adjustments can be turned off to give constant feedback
iterable = (is_prime(n) for n in range(N))
piterable = ProgIter(iterable, length=N, adjust=False, freq=100)
_ = list(piterable)
  2200/50000...  rate=32500.44 Hz, eta=0:00:01, ellapsed=0:00:00, wall=11:23 EST
43300/50000...    rate=1398.73 Hz, eta=0:00:06, ellapsed=0:00:18, wall=11:23 EST
50000/50000...  rate=1129.99 Hz, eta=0:00:02, ellapsed=0:00:25, wall=11:25 EST

# only one line of the above output is shown at a time, I copied different points from a few runs 
# to illustrate the output.

Lastly here is some timeit information that shows how frequency adjusting causes minimal overhead

def is_prime(n):
    return n >= 2 and not any(n % i == 0 for i in range(2, n))

N = 1000

# Test time of raw loop without any fanciness 
%timeit list((is_prime(n) for n in range(N)))
100 loops, best of 3: 10 ms per loop

# Test time of raw loop when printing on every iteration
%timeit list(ProgIter((is_prime(n) for n in range(N)), length=N, freq=1, adjust=False))
10 loops, best of 3: 159 ms per loop

# Test time of raw loop when printing based on an adjusted schedule 
%timeit list(ProgIter((is_prime(n) for n in range(N)), length=N, freq=1, adjust=True))
100 loops, best of 3: 11 ms per loop

amueller · 2016-10-04T17:46:04Z

I think we would prefer a design that adds callbacks to the estimators, and not bake in the progress bar.
Better logging / callbacks would be great, though.
Then we can talk about whether we actually want progress bar functionality in scikit-learn.

How the callback or logging mechanism should work is a pretty big design decision. I'm not sure if there is any discussion on that anywhere.

amueller · 2016-10-04T17:46:10Z

sorry misclick.

Erotemic · 2016-10-04T18:30:32Z

I think it would be extremely useful to monitor the progress of algorithms beyond the current verbosity measures. Does anybody know of any other libraries that use a robust logging / callback system that can be used for inspiration / ideas? I'd certainly be willing to help on this. How would this system ideally work with existing verbosity messages? These messages are already baked in, and although a callback system make this even more customization, I think these baked in messages are still valuable.

In the short term though I was thinking more along the lines of simply improving the verbosity of the costly subroutines in some estimators. By improving the verbosity I really mean providing some timing measurements like current rate, estimate time remaining, start time, and total duration. Because I want to add this in multiple places I thought it would be nice to have a reusable class that can go in the utils.

What I envisioned for this feature request is almost the same as the way the current verbosity methods work. For example sklearn.cluster.kmeans_._mini_batch_convergence does print out a notion of absolute progress. All I want to do is add in some timing statistics and estimates and put these features in a reusable class that can be dropped in in other estimators.

amueller · 2016-10-04T20:58:30Z

We are sometimes somewhat conservative (or slow?) in adding entirely new functionality because it's hard to remove anything without breaking user code or at least annoying them. And if we add a lot of 8000 custom logging and displaying of logs, it's easy to get feature creep and hard-to-understand code.

If you want to add this reusable class, it still needs some form of interface, right?

We have a monitor callback in GradientBoostingClassifier but I'm not sure people are very happy with it.
Maybe draft what the interface would be like to support what you'd like to do in MiniBatchKMeans?
You can create a PR that adds the callback to MiniBatchKMeans and then maybe an example with a custom logger?

amueller · 2016-10-04T21:04:30Z

Btw, there's an issue over here: #78

amueller · 2016-10-04T21:07:00Z

It's a bit unclear to me if logging is the right solution for what you want, because we need to know what 100% is for a progress bar. Some algorithms have multiple stopping criteria, which makes it kind of tricky to use a bar. And even if there is a single stopping criterion, we sometimes stop when the change in objective is smaller than some threshold. How does that translate to a progress bar?

jnothman · 2016-10-04T22:33:30Z

Improving verbosity is okay, but it would be better to provide structured objects to a logger that the user can configure. The logger then knows how to include timestamps in display.

jnothman · 2016-10-05T00:26:43Z

Structured logging, with things like {"context": ['RandomForests.fit'],
"progress": 35, "out_of": 100} or {"context": ['SGDClassifier.fit'],
"objective": 1.100, "target_range": [-.005, .005]} might work... Main
problem is that it's essentially another spec that needs designing and
documenting.

On 5 October 2016 at 08:07, Andreas Mueller notifications@github.com
wrote:

It's a bit unclear to me if logging is the right solution for what you
want, because we need to know what 100% is for a progress bar. Some
algorithms have multiple stopping criteria, which makes it kind of tricky
to use a bar. And even if there is a single stopping criterion, we
sometimes stop when the change in objective is smaller than some threshold.
How does that translate to a progress bar?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7574 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz60RksHz9-XflIE6IC_hUhBeu0vi_ks5qwr_2gaJpZM4KN0Tn
.

amueller · 2016-10-05T19:28:57Z

Might. What's the generic interface that the logger implements? This is additional complexity but I think we need to do this at some point.

amueller · 2016-10-09T22:12:35Z

Hm @denis-bz suggested here to just pass locals() to the callback, and it could return a dict of whatever....

jnothman · 2016-10-09T22:23:54Z

That would have to be faked in Cython

amueller · 2016-10-10T19:18:57Z

Do we want the monitor in Cython? I guess so. I would have left it at Python for now.

rth · 2017-07-14T09:21:44Z

A few more thoughts on this. For all estimators that either,

support incremental learning (i.e. have a partial_fit method) , including MiniBatchKmeans
support iterators and process one row of data at a time (e.g. CountVectorizer().fit_transform)

using tqdm progress bars is IMO easier and more reliable that writing custom classes, even if it adds one dependency for people who need this functionality.

For instance, without modifying the scikit-learn code base, one could well do,

from tqdm import tqdm

# [...]
for sl in  tqdm(gen_batches(n_samples, self.batch_size)):
      estimator.partial_fit(X[sl], y[sl])

# [...] or

vect = CountVectorizer()
X = vect.fit_transform(tqdm(iterable_of_file_names))

For logging one then can write a custom logging handler which offsets this problem to the tqdm community that probably has more experience in handling progress bars..

GuillaumeDesforges · 2019-03-26T14:37:05Z

It would be a great help to have sklearn use tqdm (or any implementation of a progress bar) to monitor progress for long trainings of complex pipelines.

Erotemic · 2019-03-27T12:18:43Z

I would actually recommend against tqdm for one simple reason: its implementation makes heavy use of Python threads. This can be an issue for scikit-learn processes that make use of multiprocessing. As noted in Raymond Hettinger's 2017 PyBay Keynote:

The general rule is “thread after you fork not before”. Otherwise, the locks used by the thread executor will get duplicated across processes. If one of those processes dies while it has the lock, all of the other processes using that lock will deadlock.

The implementation that I originally proposed here has been made into its own standalone pip-installable Python package called progiter. It is both fast (very low overhead) and single threaded, so it can be safely used with multiprocessing. Its API is mostly compatible with tqdm.

rth · 2020-04-14T22:28:39Z

An experimental API for callbacks that would allow implementing progress bars was proposed in #16925. Feedback welcome.

amueller closed this as completed Oct 4, 2016

amueller reopened this Oct 4, 2016

amueller mentioned this issue Oct 9, 2016

except KeyboardInterrupt: pass in neural_network/multilayer_perceptron.py #7596

Closed

Erotemic mentioned this issue Dec 22, 2016

[WIP] Progress Logger #8105

Closed

markotoplak mentioned this issue Apr 13, 2018

Feature discussion: callbacks for long operations #10973

Open

NicolasHug mentioned this issue Apr 12, 2020

Include an "Estimated Time to Arrival" (ETA) feature #16901

Closed

rth mentioned this issue Apr 14, 2020

Callbacks API #16925

Closed

5 tasks

adrinjalali mentioned this issue Jul 20, 2020

Display progress bar for grid search and exponentiated gradient fairlearn/fairlearn#517

Closed

NicolasHug mentioned this issue May 24, 2021

Progress bar for SequentialFeatureSelector #20127

Closed

jeremiedbb mentioned this issue Dec 16, 2021

[WIP] Callback API continued #22000

Draft

4 tasks

Micky774 mentioned this issue Sep 2, 2022

Progress bar for BaseSearchCV and its inheritors to track iterative progress more conveniently #24305

Closed

jeremiedbb mentioned this issue Sep 27, 2022

Add TQDM progress bar to .fit #24524

Closed

jeremiedbb mentioned this issue Jun 7, 2023

Add tqdm integration for progress tracking in GridSearchCV #26532

Closed

adrinjalali mentioned this issue Oct 15, 2024

Add TQDM progress bar to .fit #30072

Closed

adrinjalali mentioned this issue Mar 10, 2025

Add a progress bar to the randomized and grid search #30852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Progress Bars #7574

[Feature Request] Progress Bars #7574

[Feature Request] Progress Bars #7574

[Feature Request] Progress Bars #7574

Comments