8000 Added fit and score times for learning_curve by H4dr1en · Pull Request #13938 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@H4dr1en
Copy link
Contributor
@H4dr1en H4dr1en commented May 24, 2019

What does this implement/fix? Explain your changes.

It would be interesting to have access to the fit and score times of the estimators during the computing of the learning curves. This can be very easily done because _fit_and_score method already has a return_times parameter. I suggest to set it to True for the computing of the learning curves.

Any other comments?

Here is how fit and score times can be used to get valuable information:

Using the following setup:

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor

X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100)
estimator = RandomForestRegressor()

Having the times of the fit and scores help us plotting the following:

learning_curve_doc_time_all

And also such plots:

learning_curve_doc_diag

As you can see it is easy to determine the best estimator for the considered dataset, taking in account the
"scalability" of the model: it helps us telling which model will perform the better if we add more data to our dataset.

@H4dr1en H4dr1en changed the title [MRG] Added fit and score times for learning curves [WIP] Added fit and score times for learning curves May 24, 2019
@H4dr1en H4dr1en changed the title [WIP] Added fit and score times for learning curves [MRG] Added fit and score times for learning curves May 24, 2019
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request. We will not make this kind of backwards-incompatible change. Please add a parameter such as return_times. It may also be worth adding a plot like that to an example.

@H4dr1en
Copy link
Contributor Author
H4dr1en commented May 27, 2019

I added backwards compatibility (default behavior: as before, meaning return_times = False)
I also added a plot_learning_curves_times.py under examples/model_selection that will generate the following plot:

image

Should I take care of generating the doc for this example? If yes, how?

@jnothman
Copy link
Member
jnothman commented May 27, 2019 via email


plt.show()

###############################################################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to actually integrate this into the example above, rather than having an entirely separate demonstration of this functionality?

Copy link
Contributor Author
@H4dr1en H4dr1en May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can, I will try!

EDIT: I am wondering if it is a good idea to merge the examples, as the first one focuses on showing two distinct learning curves whereas the second one focuses on stacking multiple fit time curves of estimators to find the best one.

So could you be more specific of what you would except?

For the moment I guess you would like to directly retrieve the fit times of the two estimators of the first example (GaussianNB and SVC) and plot the two corresponding fit time curves (new feature). If yes, then I can take care of that. Otherwise if we want to show more curves (and therefore more estimators), we should separate the examples (or would you prefer to add other classifiers to the first example?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So could you be more specific of what you would except?

Why not just add your 2 new plots (n_samples vs fit_time and fit_time vs score) to both pre-existing estimators?

That would be a 3 by 2 grid of plots and would avoid having this whole new code section

Copy link
Contributor Author
@H4dr1en H4dr1en May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so here are the resulting plots with changes in ea0cc71:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the generated doc https://59951-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/plot_learning_curve.html

the plots are very small. Maybe try with 3 rows and 2 columns instead.
(you can build the doc locally following these guidelines https://scikit-learn.org/dev/developers/contributing.html#building-the-documentation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble generating the doc. I am using a miniconda venv so I enter the following in the prompt command:

(py3) path\scikit-learn\doc>set EXAMPLES_PATTERN=plot_learning_curve.py
(py3) path\scikit-learn\doc>make html

But all the examples are then generated (regex var not taken in account) and this takes forever on my computer...
I also tried with (py3) path\scikit-learn\doc>set EXAMPLES_PATTERN=*plot_learning_curve.py but same result. How would you do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the doc says EXAMPLES_PATTERN=your_regex_goes_here make html, in one command

Copy link
Contributor Author
@H4dr1en H4dr1en Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gave me unknown command error, but anyway make html eventually worked.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, Looks good in general.

I'm not a huge fan of returning tuples with different sizes depending on the input, but I guess it's too late to return a bunch a anyway.


plt.show()

###############################################################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So could you be more specific of what you would except?

Why not just add your 2 new plots (n_samples vs fit_time and fit_time vs score) to both pre-existing estimators?

That would be a 3 by 2 grid of plots and would avoid having this whole new code section

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan wrote on gitter:

This seems like a good candidate for returning a dictionary (or a Bunch). I think going through a deprecation cycle for it is a little rough.

Are you trying to suggest changing the return value without deprecation?? I don’t think that’s a good idea, certainly not unless you have a return value that can unpack as three elements.

Or are you trying to suggest that you want feedback on whether we should consider deprecating and moving to a bunch. I agree that deprecation would be disruptive to users and tutorials of this, so I'm not entirely fond of it.

If we want to change to bunch, then instead of return_times we should add return_bunch and include times in the bunch, make the default value change from False to True in a couple of versions, etc, etc.

@thomasjpfan
Copy link
Member

Or are you trying to suggest that you want feedback on whether we should consider deprecating and moving to a bunch.

I was looking for feedback for such an API change. (I will be more direct next time). Even in the example we have something like this:

train_sizes, train_scores, test_scores, fit_times, _ = \
	        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
	                       train_sizes=train_sizes,
	                       return_times=True)

Every time I see \ before the function is even called, I think of returning a bunch.

@jnothman
Copy link
Member
jnothman commented May 30, 2019 via email

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimal comments. LGTM anyway

6D46

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm. We can separately consider a return_bunch option, I suppose.

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@NicolasHug
Copy link
Member

@H4dr1en , please look at https://github.com/scikit-learn/scikit-learn/pull/13938/files and revert all the unrelated changes.

@NicolasHug NicolasHug changed the title [MRG] Added fit and score times for learning curves Added fit and score times for learning_curve Jun 14, 2019
@NicolasHug NicolasHug merged commit b28aadf into scikit-learn:master Jun 14, 2019
@NicolasHug
Copy link
Member

Thanks @H4dr1en !

@rth
Copy link
Member
rth commented Jun 16, 2019

It looks like Circle CI is failing on master after this PR was merged?

@thomasjpfan
Copy link
Member

Raised issue regarding this at #14098

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

0