8000 WIP Grid search convenience class. by amueller · Pull Request #1034 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

WIP Grid search convenience class. #1034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

amueller
Copy link
Member

This aims at closing #1020.

It introduces a new class to handle the output of GridSearchCV.
I don't like complexity but I haven't found a nice way to do this otherwise. If you have less complex solutions, please let me know.

Basically this transforms the dicts that are usually in GridSearchCV.grid_scores_ (remember, this is in general a list of dictionaries) to a list of parameters (which is sorted(param_grid.keys()) and an array where
each axis corresponds to one parameter and the last corresponds to folds.
I think this is already an improvement.

The reason why I added the class is that I also want to marginalize parameters. I want to look at it even if I have 5 parameters to adjust. Maximizing over multiple axis is ugly so I wanted to class to handle this.

As always, any comments welcome.
I'll make an example to illustrate the usefulness now :)

Going back to wip as I think this should be designed with having non-grid evaluations of estimators in mind.

@amueller
Copy link
Member Author

The current example is not polished and takes to long, but you can have a look to get the general idea.
Btw, I noticed that using standard deviations is not a good idea and I should use the binomial thing. Could be an option or something. Finishing for today.

@amueller
Copy link
Member Author

@agramfort you seemed quite enthusiastic about the topic ;) Any opinions?

An alternative to the marginalized parameter plot would be doing a scatterplot showing all values along a given dimension btw. I don't know how to plot uncertainty there, though.

@amueller
Copy link
Member Author

The new example looks something like this:
grid_search_plots

I'm not terribly happy with the graphs but the run time shouldn't go up to much. Any ideas for a better example are welcome.

I'm quite happy with the code for the example, though. Also note how the SVM grid search example got a bit nicer.

@ogrisel
Copy link
Member
ogrisel commented Aug 23, 2012

It would be great to measure the fit / predict times for each parameter and add them to the collected statistics and plots.

@amueller
Copy link
Member Author

I definitely won't have time to do that before the release.
What do you think of the current approach?
It would be easy to give the class another function to get the timings later.

@ogrisel
Copy link
Member
ogrisel commented Aug 23, 2012

I don't have the time to review in details right now but I surely find the global idea a good approach.

I would like to have a similar "result class" for the classification_report utility function and have the __repr__ method compute the same string as the current implementation of the classification_report function.

"""
print __doc__

import matplotlib.pyplot as plt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import pylab as pl

@agramfort
Copy link
Member

besides this design looks good to me

@amueller
Copy link
Member Author

Thanks for the feedback. Glad you like the design :)

axes = axes.ravel()
for ax, param in zip(axes, cv_scores.params):
means, errors = cv_scores.accumulated(param, 'max')
ax.errorbar(cv_scores.values[param], means, yerr=errors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentionned on IRC, I think boxplots would be better than errorbars.
I also think the labels should be set, both on X and Y axis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think the three plots should be scaled the same way.

@amueller
Copy link
Member Author
amueller commented Sep 2, 2012

About the boxplots: that would mean I have to return all sample points, not only summarized statistics.
that could be done but I don't see a huge benefit in using the boxplots. instead, I'd rather go for the correct statistical measure, which I have to look up ;)

@amueller
Copy link
Member Author
amueller commented Sep 3, 2012

I am not so happy about this any more. It assumes an actual grid. I think the object should be constructed such that it supports random sampling and other ways to try out parameters.
This opens up a whole new can of worms, though...
Because then you might feed a stored object into the searching object so as to refine the search or something... not sure if these should be addressed together.

What was the API idea for the random search? How do we report / store progress? You definitely want to be able to continue a search that was started before....

@ogrisel
Copy link
Member
ogrisel commented Sep 3, 2012

Indeed, supporting @jaberg's randomized grid search would be neat (even if we decide to make that a non-default option). Hence the report object should be able to deal with that and maybe the case where the user interrupts the grid search early (e.g. using a keyboard interrupt), hence the report would have a non complete grid.

I think we should postpone the discussion for after the release.

@amueller
Copy link
Member Author
amueller commented Sep 3, 2012

Yeah sure, this is not release-related. But I think I should do this together with #455 and maybe an interrupt-robust GridSearchCV or similar...

@agramfort
Copy link
Member

can be topic for a sprint at pycon fr...

@amueller
Copy link
Member Author
amueller commented Sep 3, 2012

sure, why not.

@amueller amueller mentioned this pull request Sep 21, 2012
@amueller
Copy link
Member Author

Will look at @jaberg's JMLR paper before continuing this, and then hopefully do randomized search together with sensible visualization / analysis for both grid and random search.

@GaelVaroquaux
Copy link
Member

I am not so happy about this any more. It assumes an actual grid. I think
the object should be constructed such that it supports random sampling and
other ways to try out parameters.

That's exactly what I was thinking when reviewing this PR :)

@jnothman
Copy link
Member

One nice thing about this feature is being able to examine variation some parameters while taking the max or average over some other parameters. I would like to be able to see an argmax equivalent, which basically takes a parameter grid (or other space) and selects the parameter settings (i.e. returns an index array) corresponding to some sub-grid (choosing the max-scoring instance for each point).

With #1842, one could do something like:

# Get an index over the parameters of intersest
ind_params, index = sklearn.grid_search.ParameterGrid(grid.param_grid).build_index(interesting_params, ravel=True)
# Reshape our results (assuming structured array output) and max out over remaining params
best = grid.grid_results_[index]['test_score'].argmax(axis=-1)
# Get back indices into original results (perhaps not the best way to do it)
index.flat[best.flat + (index.shape[-1] * np.arange(best.size))]

but it similarly doesn't apply nicely to non-grid shapes.

@amueller
Copy link
Member Author

I abandoned the PR because it is way out of date, but I would still like to see something like it.
It should be able to handle non-grid structure, though... if that is possible in a sensible way...

@jnothman
Copy link
Member

Yeah, I noticed. Arranging different parameters along different axes really doesn't work for non-grids, but can be really useful for grids, which is why #1842 proposes building such a structure within ParameterGrid, with the constraint that when it uses a list of grids, you can only index one at a time.

I've also privately implemented returning parameters as a record array, which means it can be more easily sliced and diced using numpy comparators and indexing operations. (And if each distinct group of values for some fields can be assigned an integer, calling bincount twice can find the average score; the argmax score can also be found. Or there's pgnumpy.) But because it's possible to get parameter spaces where some parameters are simply not set for some points (and this will more often happen when users are able to replace pipeline steps), one needs to use a masked recarray, which comes with somewhat quirky behaviour making them not very user-friendly.

And perhaps to make this data wrangling user-friendly it should be within an accessory class, which could even provide things like a Wilcoxon signed-rank test between fold results under different parameters. Or perhaps that's all unnecessary bloat.

@jnothman
Copy link
Member

Here's my own tool to do some of these things: https://gist.github.com/jnothman/5480026
But I've realised this is more-or-less what the pandas project specialises in.

@amueller
Copy link
Member Author

actually, some people from my lab use pandas exactly for that purpose. So maybe one goal would be to make it easy to create a dataframe from the results? I still haven't got around to review your stuff :-/ Wednesday is the next deadline....

@amueller
Copy link
Member Author

btw, accumulating with the mean is actually a bad idea, as the variables are err.. beta distributed?

@jnothman
Copy link
Member

Okay, so let's say we had two attributes or functions to get back:

  • the parameters and average search results as mrecarray
  • the parameters and fold results with additional fold number field as mrecarray

(I.e. to avoid the heterogenous data types/shapes, we allow users to take either the raw fold scores, or the aggregate scores, not both at same time.)

These have the traditional table form and can be played with in numpy, or passed to Pandas, or dumped to spreadsheet / *SQL / mongodb.

@jnothman
Copy link
Member

What do you think of tabulate_results?

@GaelVaroquaux
Copy link
Member

• the parameters and average search results as mrecarray
• the parameters and fold results with additional fold number field as
mrecarray (I.e. to avoid the heterogenous data types/shapes, we allow users
to take either the raw fold scores, or the aggregate scores, not both at
same time.) These have the traditional table form and can be played with in
numpy, or passed to Pandas, or dumped to spreadsheet / *SQL / mongodb.

In terms of data presentations, this seems a reasonnable set of choices,
IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0