-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Use python logging to report on convergence progress it level info for long running tasks #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Work has started on this in https://github.com/GaelVaroquaux/scikit-learn/tree/progress_logger What remains to be done is most probably fairly mechanical work. |
There is also work in the new Gradient Boosting module. |
Logging actually isn't that easy to use at all, in my experience, so -1 on this. |
Is anyone working on this ? |
How about we add a logger that by default prints to STDOUT? That should be fairly simple, right? |
This issue has been open since 2011 and so I wonder whether this is going to be fixed. I've run into this issue with RFECV ( scikit-learn/sklearn/feature_selection/rfe.py Line 273 in a24c8b4
sys.stdout to make this work and overriding the logger would be the simple and clean solution.
There are other issued in sklearn such as #8105 and #10973 that would benefit from real logging in sklearn. Overall, I think logging would be a great addition to sklearn. |
you're welcome to work on it. perhaps a callback system is better than
logging
|
I'm a bit busy right now but I support customizable logging in sklean in whatever form (although I prefer standard python logging). |
Has there been any discussion about what Given that libraries aren't supposed to do logging configuration, it's up to the user to configure their "application" (which may just be a script or interactive session) to log things appropriately. This isn't easy always to do correctly. My proposal in dask/dask-ml#528 is for |
Does temporarily mean also that the handler set up is specific to that
estimator or estimator type?
|
This seems like a good balance. Using the logging module is not that too user friendly. Another "hack" would be to use |
I think changing the level of specific messages when the user asks for more
verbosity is exactly the opposite of how the logging module is meant to
work. But the local handler could change from warning to info to debug
level on stream as verbose increases
|
@jnothman’s comment matches my thoughts. scikit-learn would always emit the message, and the verbose keyword controls the logger level and handlers. |
Okay, lets go with this. Currently the logging levels are https://docs.python.org/3/library/logging.html#logging-levels By default, we can use the What do you think? |
Hi, I'm very interested in this issue. I have some experience with might be helpful to recap ideas mentioned here:
My take on these options: Option 1 or option 4 would probably be best.
A fifth option would be to remove |
@grisaitis thanks! See also more recent related discussions in #17439 and #16925 (comment) (regarding callbacks). Your help would be very much appreciated, the main issue is we haven't decided yet what approach would be best :) |
I would support removing verbose, as I find per-estimator configuration
frustrating, and the numeric values of verbose arbitrary, poorly
documented, etc. Per-class configuration would be handled by having
multiple scikit-learn logger names.
|
I think getting rid of |
Also, one thing what logging provides is that you can attach extra information to each logging message, not just strings. So whole dict of useful stuff. So if you want to report loss during learning, you can do that and store numerical value. Even more, you can both store the numerical value as number and use it to format user friendly string, like: |
I think having module or estimator level loggers is a bit overkill for now and we should start with something simple that allows us to extend it later. How do logging and joblib interact? The logging level is not preserved (as expected I guess): import logging
logger = logging.getLogger('sklearn')
logger.setLevel(2)
def get_level():
another_logger = logging.getLogger('sklearn')
return another_logger.level
results = Parallel(n_jobs=2)(
delayed(get_level)() for _ in range(2)
)
results
But that's probably not needed, since this works: import logging
import sys
logger = logging.getLogger('sklearn')
logger.setLevel(1)
handler = logging.StreamHandler(sys.stdout)
logger.addHandler(handler)
def log_some():
another_logger = logging.getLogger('sklearn')
another_logger.critical("log something")
results = Parallel(n_jobs=2)(
delayed(log_some)() for _ in range(2)
) Honestly, I'm not entirely sure how this works though. both stdout and stderr don't show up in jupyter btw. My dream: being able to get an approximation of the current behavior with a single line, but also being able to easily use progress bars or plot convergence instead. re verbose: it's probably cleaner to deprecate verbose, but deprecating verbose and not having estimator-level logging will make it a bit trickier to log one estimator but not another. I think it's fine to have the user filter the messages, though. |
hey all, thanks for the friendly replies and info. i read the other issues and have some thoughts.
@amueller that is very weird. i reproduced your example. things do work with seems like import concurrent.futures
import logging
import os
logger = logging.getLogger("demo🙂")
logger.setLevel("DEBUG")
handler = logging.StreamHandler()
handler.setFormatter(
logging.Formatter("%(process)d (%(processName)s) %(levelname)s:%(name)s:%(message)s")
)
logger.addHandler(handler)
def get_logger_info(_=None):
another_logger = logging.getLogger("demo🙂")
print(os.getpid(), "another_logger:", another_logger, another_logger.handlers)
another_logger.warning(f"hello from {os.getpid()}")
return another_logger
if __name__ == "__main__":
print(get_logger_info())
print()
print("concurrent.futures demo...")
with concurrent.futures.ProcessPoolExecutor(2) as executor:
results = executor.map(get_logger_info, range(2))
print(list(results))
print()
print("joblib demo (@amueller's example #2)...")
from joblib import Parallel, delayed
results = Parallel(n_jobs=2)(delayed(get_logger_info)() for _ in range(2))
print(results) which outputs 19817 another_logger: <Logger demo🙂 (DEBUG)> [<StreamHandler <stderr> (NOTSET)>]
19817 (MainProcess) WARNING:demo🙂:hello from 19817
<Logger demo🙂 (DEBUG)>
concurrent.futures demo...
19819 another_logger: <Logger demo🙂 (DEBUG)> [<StreamHandler <stderr> (NOTSET)>]
19819 (SpawnProcess-1) WARNING:demo🙂:hello from 19819
19819 another_logger: <Logger demo🙂 (DEBUG)> [<StreamHandler <stderr> (NOTSET)>]
19819 (SpawnProcess-1) WARNING:demo🙂:hello from 19819
[<Logger demo🙂 (DEBUG)>, <Logger demo🙂 (DEBUG)>]
joblib demo (@amueller's example #2)...
19823 another_logger: <Logger demo🙂 (WARNING)> []
hello from 19823
19823 another_logger: <Logger demo🙂 (WARNING)> []
hello from 19823
[<Logger demo🙂 (DEBUG)>, <Logger demo🙂 (DEBUG)>] |
I think you should configure processes joblib spawns to send logging message to the main logger in the main process. Then one can control logging in the main process only. Something like this or this. So there are logging queue sinks and sources and you can tie them together. We use this on our cluster, to send all logging from all workers on all machines to central location, to show it on user's computer. |
@mitar i agree, i think that might be the best bet. (not necessarily network-based queues, but inter-process communication queues) i'm actually coding up an example using also love your other suggestion:
i implemented a visualization of gaussian mixture modeling using the also re the only limitation of using a QueueHandler, that i can think of, is that any |
Yes, you should always use queue handler, because you never know when the sending over the socket blocks and you do not want to slow down the model because of logging blocking. Also, you do not even have to use |
for the
thanks yes. i find it useful also to log state without converting it to text. e.g. including a numpy array in |
I think so. I have not used this over process boundaries, but it seems they have a documented support for multiprocessing, so it should work with joblib as well. I am using QueueHandler / QueueListener inside the same process. To decouple logging writes from logging transport. So is QueueHandler -> QueueListener -> Send to central logging service. But from documentation it looks like it can work through multiprocessing queue.
Yes. What I am saying is that you do not have to use |
@grisaitis FYI if you mention a name in a commit, every time you do anything with the commit (like rebasing it or merging it or pushing it), the person gets an email, so it's generally discouraged ;) |
Sorry about that Andreas! 😬 That's embarassing... I was just trying to have well-documented commits lol. Will avoid in the future. In that repo I figured out how logging can work with As a first step I'll implement logging with that approach in a part of sklearn where |
Yes, it sounds like it would be necessary to start/stop a monitoring thread (here |
So I figure the only reason why what I did above "worked" was that it printed to stdout which was the shared resource and |
Print is not thread safe. They just print to the same file descriptor. Probably running for a longer time you would see that messages sometimes interleave and lines get mixed up. |
Are we gonna do this? (we should). Moving to 2.0 |
This is a proposal to use python's logging module instead of using stdout and verbose flags in the models API.
Using the logging module would make it easier for the user to control the verbosity of the scikit using a single and well documented configuration interface and logging API.
http://docs.python.org/library/logging.html
The text was updated successfully, but these errors were encountered: