-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Added fit and score times for learning_curve #13938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added fit and score times for learning_curve #13938
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pull request. We will not make this kind of backwards-incompatible change. Please add a parameter such as return_times. It may also be worth adding a plot like that to an example.
|
Docs will generate automatically on Circle CI. See our contributor docs. I
don't see why you would need a new example rather than extend the existing
one, though.
|
Setting it to - 1 is likely to make the generation of the documentation crash.
|
|
||
| plt.show() | ||
|
|
||
| ############################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to actually integrate this into the example above, rather than having an entirely separate demonstration of this functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can, I will try!
EDIT: I am wondering if it is a good idea to merge the examples, as the first one focuses on showing two distinct learning curves whereas the second one focuses on stacking multiple fit time curves of estimators to find the best one.
So could you be more specific of what you would except?
For the moment I guess you would like to directly retrieve the fit times of the two estimators of the first example (GaussianNB and SVC) and plot the two corresponding fit time curves (new feature). If yes, then I can take care of that. Otherwise if we want to show more curves (and therefore more estimators), we should separate the examples (or would you prefer to add other classifiers to the first example?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So could you be more specific of what you would except?
Why not just add your 2 new plots (n_samples vs fit_time and fit_time vs score) to both pre-existing estimators?
That would be a 3 by 2 grid of plots and would avoid having this whole new code section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so here are the resulting plots with changes in ea0cc71:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the generated doc https://59951-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/plot_learning_curve.html
the plots are very small. Maybe try with 3 rows and 2 columns instead.
(you can build the doc locally following these guidelines https://scikit-learn.org/dev/developers/contributing.html#building-the-documentation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having trouble generating the doc. I am using a miniconda venv so I enter the following in the prompt command:
(py3) path\scikit-learn\doc>set EXAMPLES_PATTERN=plot_learning_curve.py
(py3) path\scikit-learn\doc>make html
But all the examples are then generated (regex var not taken in account) and this takes forever on my computer...
I also tried with (py3) path\scikit-learn\doc>set EXAMPLES_PATTERN=*plot_learning_curve.py but same result. How would you do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the doc says EXAMPLES_PATTERN=your_regex_goes_here make html, in one command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gave me unknown command error, but anyway make html eventually worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, Looks good in general.
I'm not a huge fan of returning tuples with different sizes depending on the input, but I guess it's too late to return a bunch a anyway.
|
|
||
| plt.show() | ||
|
|
||
| ############################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So could you be more specific of what you would except?
Why not just add your 2 new plots (n_samples vs fit_time and fit_time vs score) to both pre-existing estimators?
That would be a 3 by 2 grid of plots and would avoid having this whole new code section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan wrote on gitter:
This seems like a good candidate for returning a dictionary (or a Bunch). I think going through a deprecation cycle for it is a little rough.
Are you trying to suggest changing the return value without deprecation?? I don’t think that’s a good idea, certainly not unless you have a return value that can unpack as three elements.
Or are you trying to suggest that you want feedback on whether we should consider deprecating and moving to a bunch. I agree that deprecation would be disruptive to users and tutorials of this, so I'm not entirely fond of it.
If we want to change to bunch, then instead of return_times we should add return_bunch and include times in the bunch, make the default value change from False to True in a couple of versions, etc, etc.
I was looking for feedback for such an API change. (I will be more direct next time). Even in the example we have something like this: train_sizes, train_scores, test_scores, fit_times, _ = \
learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True)Every time I see |
|
I think it's fair enough. We currently return a similar dict in
cross_validate (not a bunch there or in GridSearchCV.cv_results_, but could
be), but this API was designed to follow cross_val_score before
cross_validate existed.
|
Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minimal comments. LGTM anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise lgtm. We can separately consider a return_bunch option, I suppose.
Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:
|
@H4dr1en , please look at https://github.com/scikit-learn/scikit-learn/pull/13938/files and revert all the unrelated changes. |
|
Thanks @H4dr1en ! |
|
It looks like Circle CI is failing on master after this PR was merged? |
|
Raised issue regarding this at #14098 |


What does this implement/fix? Explain your changes.
It would be interesting to have access to the fit and score times of the estimators during the computing of the learning curves. This can be very easily done because
_fit_and_scoremethod already has areturn_timesparameter. I suggest to set it toTruefor the computing of the learning curves.Any other comments?
Here is how fit and score times can be used to get valuable information:
Using the following setup:
Having the times of the fit and scores help us plotting the following:
And also such plots:
As you can see it is easy to determine the best estimator for the considered dataset, taking in account the
"scalability" of the model: it helps us telling which model will perform the better if we add more data to our dataset.