-
Notifications
You must be signed in to change notification settings - Fork 2
[MRG] Model selection documentation #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
beec231
to
23d41f7
Compare
8b81641
to
9ba4c67
Compare
23d41f7
to
94e6ddd
Compare
ade9aa4
to
a545a4f
Compare
@@ -143,43 +143,58 @@ Classes | |||
covariance.oas | |||
covariance.graph_lasso | |||
|
|||
.. _model_selection_ref: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, I think I would keep the modules in the references under a "old and will be removed" header. I'm not sure what @vene @GaelVaroquaux or @jnothman think about that, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that model_selection_ref
tag used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so all the "ref" tags were added in 6709ed5, the usage later got removed, and we kept adding the tags in good old cargo-cult fashion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +0.5 about keeping the "old and will be removed". It might hurt googlability so we need good links to the new non-deprecated files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also ping @larsmans ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so all the "ref" tags were added in 6709ed5, the usage later got removed, and we kept adding the tags in good old cargo-cult fashion?
Ah :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm -1 for keeping the old modules listed here, but don't object to having their docs generated (by putting automodule somewhere hidden, say "deprecated.rst") with a clear deprecation-oriented docstring, if that's what you're more-or-less going for, @amueller. I'm not sure whether there's an easy way to set rel="canonical"
to point to new equivalents, if that's your concern @vene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also convinced now that we should not list them here. Maybe just not list them. People can always look at the docs in ipython or the source if they like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+.. _model_selection_ref:
As I said, I think I would ke 8000 ep the modules in the references under a
"old and will be removed" header. I'm not sure what @vladn
@GaelVaroquaux or @jnothman think about that, though.
Sorry for the slow reply. I think that it is a good suggestion.
the |
Not sure what to do with all the mentions in whatsnew..... |
@@ -618,24 +610,6 @@ From text | |||
lda.LDA | |||
|
|||
|
|||
.. _learning_curve_ref: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused that these tags were not used anywhere....
I was also wondering about that :P but I thought maybe its better to leave it unchanged? since we'll be adding a whatsnew entry to note that they are grouped into |
Oh are there? I'll fix them right away! EDIT: fixed! |
model_selection.cross_val_score | ||
model_selection.cross_val_predict | ||
model_selection.permutation_test_score | ||
model_selection.check_cv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a stupid question, but why is check_cv
public? Input from @vene @GaelVaroquaux @larsmans @jnothman welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it can be used by extension modules, I guess. Here are some github code search hits.
This is like parts of sklearn.utils
in that users don't need it directly, but it's useful for library extensions.
https://github.com/nilearn/nilearn/blob/master/nilearn/decoding/tests/test_searchlight.py#L31
https://github.com/experiencor/Data-analysis-projects/blob/4a3d20efd17a555028b8a9080f5ac6e63eff42ba/Decoding%20Human%20Brain/mne-python-master/mne/decoding/time_gen.py#L88
https://github.com/peret/visualize-bovw/blob/386bdc172a8a7766e915e65fabcc9e49c2e0bf23/weighted_grid_search.py#L13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep it public.
d36fbaa
to
d1d30b9
Compare
>>> slo = LabelShuffleSplit(labels, n_iter=4, test_size=0.5, | ||
... random_state=0) | ||
>>> for train, test in slo: | ||
>>> slo = LabelShuffleSplit(n_iter=4, test_size=0.5, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slo? maybe lss
LGTM apart from minor nitpicks. If we rename |
b2a952a
to
b40f307
Compare
Traceback (most recent call last): | ||
ValueError: 'wrong_choice' is not a valid scoring value. Valid options are ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'] | ||
>>> clf = svm.SVC(probability=True, random_state=0) | ||
>>> cross_validation.cross_val_score(clf, X, y, scoring='log_loss') # doctest: +ELLIPSIS | ||
>>> cross_val_score(clf, X, y, scoring='log_loss') # doctest: +ELLIPSIS | ||
array([-0.07..., -0.16..., -0.06...]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert the order of the 2 examples: start with the valid use case (scoring='log_loss'
) first before the failure case (scoring='wrong_choice'
).
64eebe2
to
1c47e5d
Compare
5a6ce7e
to
d147a85
Compare
73401d8
to
aa35308
Compare
d147a85
to
4dcfa19
Compare
7972165
to
ca9517b
Compare
4dcfa19
to
602fc8e
Compare
74ec175
to
6a0d2fc
Compare
602fc8e
to
efff882
Compare
-------------------- * ENH Reogranize classes/fn from grid_search into search.py * ENH Reogranize classes/fn from cross_validation into split.py * ENH Reogranize cls/fn from cross_validation/learning_curve into validate.py * MAINT Merge _check_cv into check_cv inside the model_selection module * MAINT Update all the imports to point to the model_selection module * FIX use iter_cv to iterate throught the new style/old style cv objs * TST Add tests for the new model_selection members * ENH Wrap the old-style cv obj/iterables instead of using iter_cv * ENH Use scipy's binomial coefficient function comb for calucation of nCk * ENH Few enhancements to the split module * ENH Improve check_cv input validation and docstring * MAINT _get_test_folds(X, y, labels) --> _get_test_folds(labels) * TST if 1d arrays for X introduce any errors * ENH use 1d X arrays for all tests; * ENH X_10 --> X (global var) Minor ----- * ENH _PartitionIterator --> _BaseCrossValidator; * ENH CVIterator --> CVIterableWrapper * TST Import the old SKF locally * FIX/TST Clean up the split module's tests. * DOC Improve documentation of the cv parameter * COSMIT consistently hyphenate cross-validation/cross-validator * TST Calculate n_samples from X * COSMIT Use separate lines for each import. * COSMIT cross_validation_generator --> cross_validator Commits merged manually ----------------------- * FIX Document the random_state attribute in RandomSearchCV * MAINT Use check_cv instead of _check_cv * ENH refactor OVO decision function, use it in SVC for sklearn-like decision_function shape * FIX avoid memory cost when sampling from large parameter grids ENH Major to Minor incremental enhancements to the model_selection Squashed commit messages - (For reference) Major ----- * ENH p --> n_labels * FIX *ShuffleSplit: all float/invalid type errors at init and int error at split * FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings * ENH+TST KFold: make rng to be generated at every split call for reproducibility * FIX/MAINT KFold: make shuffle a public attr * FIX Make CVIterableWrapper private. * FIX reuse len_cv instead of recalculating it * FIX Prevent adding *SearchCV estimators from the old grid_search module * re-FIX In all_estimators: the sorting to use only the 1st item (name) To avoid collision between the old and the new GridSearch classes. * FIX test_validate.py: Use 2D X (1D X is being detected as a single sample) * MAINT validate.py --> validation.py * MAINT make the submodules private * MAINT Support old cv/gs/lc until 0.19 * FIX/MAINT n_splits --> get_n_splits * FIX/TST test_logistic.py/test_ovr_multinomial_iris: pass predefined folds as an iterable * MAINT expose BaseCrossValidator * Update the model_selection module with changes from master - From scikit-learn#5161 - - MAINT remove redundant p variable - - Add check for sparse prediction in cross_val_predict - From scikit-learn#5201 - DOC improve random_state param doc - From scikit-learn#5190 - LabelKFold and test - From scikit-learn#4583 - LabelShuffleSplit and tests - From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests - From scikit-learn#5378 - Make the GridSearchCV docs more accurate. - From scikit-learn#5458 - Remove shuffle from LabelKFold - From scikit-learn#5466(scikit-learn#4270) - Gaussian Process by Jan Metzen - From scikit-learn#4826 - Move custom error / warnings into sklearn.exception Minor ----- * ENH Make the KFold shuffling test stronger * FIX/DOC Use the higher level model_selection module as ref * DOC in check_cv "y : array-like, optional" * DOC a supervised learning problem --> supervised learning problems * DOC cross-validators --> cross-validation strategies * DOC Correct Olivier Grisel's name ;) * MINOR/FIX cv_indices --> kfold * FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut * TST/FIX imports on separate lines * FIX use __class__ instead of classmethod * TST/FIX import directly from model_selection * COSMIT Relocate the random_state documentation * COSMIT remove pass * MAINT Remove deprecation warnings from old tests * FIX correct import at test_split * FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse * FIX random state to avoid doctest failure * TST n_splits and split wrapping of _CVIterableWrapper * FIX/MAINT Use multilabel indicator matrix directly * TST/DOC clarify why we conflate classes 0 and 1 * DOC add comment that this was taken from BaseEstimator * FIX use of labels is not needed in stratified k fold * Fix cross_validation reference * Fix the labels param doc FIX/DOC/MAINT Addressing the review comments by Arnaud and Andy COSMIT Sort the members alphabetically COSMIT len_cv --> n_splits COSMIT Merge 2 if; FIX Use kwargs DOC Add my name to the authors :D DOC make labels parameter consistent FIX Remove hack for boolean indices; + COSMIT idx --> indices; DOC Add Returns COSMIT preds --> predictions DOC Add Returns and neatly arrange X, y, labels FIX idx(s)/ind(s)--> indice(s) COSMIT Merge if and else to elif COSMIT n --> n_samples COSMIT Use bincount only once COSMIT cls --> class_i / class_i (ith class indices) --> perm_indices_class_i FIX/ENH/TST Addressing the final reviews COSMIT c --> count FIX/TST make check_cv raise ValueError for string cv value TST nested cv (gs inside cross_val_score) works for diff cvs FIX/ENH Raise ValueError when labels is None for label based cvs; TST if labels is being passed correctly to the cv and that the ValueError is being propagated to the cross_val_score/predict and grid search FIX pass labels to cross_val_score FIX use make_classification DOC Add Returns; COSMIT Remove scaffolding TST add a test to check the _build_repr helper REVERT the old GS/RS should also be tested by the common tests. ENH Add a tuple of all/label based CVS FIX raise VE even at get_n_splits if labels is None FIX Fabian's comments PEP8
efff882
to
cff6258
Compare
Documentation for the scikit-learn#4294
The build docs are over here
TODO -
grid_search.rst
tosearch.rst