-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
WIP Grid search convenience class. #1034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The current example is not polished and takes to long, but you can have a look to get the general idea. |
@agramfort you seemed quite enthusiastic about the topic ;) Any opinions? An alternative to the |
It would be great to measure the fit / predict times for each parameter and add them to the collected statistics and plots. |
I definitely won't have time to do that before the release. |
I don't have the time to review in details right now but I surely find the global idea a good approach. I would like to have a similar "result class" for the |
""" | ||
print __doc__ | ||
|
||
import matplotlib.pyplot as plt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import pylab as pl
besides this design looks good to me |
Thanks for the feedback. Glad you like the design :) |
axes = axes.ravel() | ||
for ax, param in zip(axes, cv_scores.params): | ||
means, errors = cv_scores.accumulated(param, 'max') | ||
ax.errorbar(cv_scores.values[param], means, yerr=errors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentionned on IRC, I think boxplots would be better than errorbars.
I also think the labels should be set, both on X and Y axis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think the three plots should be scaled the same way.
About the boxplots: that would mean I have to return all sample points, not only summarized statistics. |
I am not so happy about this any more. It assumes an actual grid. I think the object should be constructed such that it supports random sampling and other ways to try out parameters. What was the API idea for the random search? How do we report / store progress? You definitely want to be able to continue a search that was started before.... |
Indeed, supporting @jaberg's randomized grid search would be neat (even if we decide to make that a non-default option). Hence the report object should be able to deal with that and maybe the case where the user interrupts the grid search early (e.g. using a keyboard interrupt), hence the report would have a non complete grid. I think we should postpone the discussion for after the release. |
Yeah sure, this is not release-related. But I think I should do this together with #455 and maybe an interrupt-robust |
can be topic for a sprint at pycon fr... |
sure, why not. |
Will look at @jaberg's JMLR paper before continuing this, and then hopefully do randomized search together with sensible visualization / analysis for both grid and random search. |
That's exactly what I was thinking when reviewing this PR :) |
One nice thing about this feature is being able to examine variation some parameters while taking the max or average over some other parameters. I would like to be able to see an argmax equivalent, which basically takes a parameter grid (or other space) and selects the parameter settings (i.e. returns an index array) corresponding to some sub-grid (choosing the max-scoring instance for each point). With #1842, one could do something like: # Get an index over the parameters of intersest
ind_params, index = sklearn.grid_search.ParameterGrid(grid.param_grid).build_index(interesting_params, ravel=True)
# Reshape our results (assuming structured array output) and max out over remaining params
best = grid.grid_results_[index]['test_score'].argmax(axis=-1)
# Get back indices into original results (perhaps not the best way to do it)
index.flat[best.flat + (index.shape[-1] * np.arange(best.size))] but it similarly doesn't apply nicely to non-grid shapes. |
I abandoned the PR because it is way out of date, but I would still like to see something like it. |
Yeah, I noticed. Arranging different parameters along different axes really doesn't work for non-grids, but can be really useful for grids, which is why #1842 proposes building such a structure within I've also privately implemented returning parameters as a record array, which means it can be more easily sliced and diced using numpy comparators and indexing operations. (And if each distinct group of values for some fields can be assigned an integer, calling And perhaps to make this data wrangling user-friendly it should be within an accessory class, which could even provide things like a Wilcoxon signed-rank test between fold results under different parameters. Or perhaps that's all unnecessary bloat. |
Here's my own tool to do some of these things: https://gist.github.com/jnothman/5480026 |
actually, some people from my lab use pandas exactly for that purpose. So maybe one goal would be to make it easy to create a dataframe from the results? I still haven't got around to review your stuff :-/ Wednesday is the next deadline.... |
btw, accumulating with the mean is actually a bad idea, as the variables are err.. beta distributed? |
Okay, so let's say we had two attributes or functions to get back:
(I.e. to avoid the heterogenous data types/shapes, we allow users to take either the raw fold scores, or the aggregate scores, not both at same time.) These have the traditional table form and can be played with in numpy, or passed to Pandas, or dumped to spreadsheet / *SQL / mongodb. |
What do you think of |
In terms of data presentations, this seems a reasonnable set of choices, |
This aims at closing #1020.
It introduces a new class to handle the output of GridSearchCV.
I don't like complexity but I haven't found a nice way to do this otherwise. If you have less complex solutions, please let me know.
Basically this transforms the dicts that are usually in
GridSearchCV.grid_scores_
(remember, this is in general a list of dictionaries) to a list of parameters (which issorted(param_grid.keys()
) and an array whereeach axis corresponds to one parameter and the last corresponds to folds.
I think this is already an improvement.
The reason why I added the class is that I also want to marginalize parameters. I want to look at it even if I have 5 parameters to adjust. Maximizing over multiple axis is ugly so I wanted to class to handle this.
As always, any comments welcome.
I'll make an example to illustrate the usefulness now :)
Going back to wip as I think this should be designed with having non-grid evaluations of estimators in mind.