[WIP] Sample properties #4696

amueller · 2015-05-08T19:49:12Z

Shot at #4497.

Todo:

Issues

the len of dicts behaves different from recarray and dataframes. Not supporting dicts might lead to a cleaner interface, but not sure.
Just using the new interface in GridSearchCV and cross_val_score will break all 3rd party estimators. Should we inspect / try-except and raise a deprecation warning? Or just try/except and not raise a warning?
It's unclear to me how to prevent typos in specifying sample_props column names / dictionary keys.
should it be passed to the score function, too?
should it be passed to partial_fit, too?

amueller · 2015-05-09T23:19:39Z

ping @GaelVaroquaux @ogrisel @jnothman @vene @agramfort
This is somewhat annoying work, and it would be good to get some buy-in ;)

amueller · 2015-05-09T23:20:08Z

And also feedback for the issues.

amueller · 2015-05-09T23:31:56Z

maybe @glouppe, @mblondel and @pprett also have opinions about sample_weight being deprecated?

glouppe · 2015-05-10T14:55:35Z

maybe @glouppe, @mblondel and @pprett also have opinions about sample_weight being deprecated?

Nothing against deprecating it, as long as there is still a way to pass weights.

However, I dont find the new name of sample_props very intuitive.

agramfort · 2015-05-10T16:30:08Z

Nothing against deprecating it, as long as there is still a way to pass
weights.

+1

However, I dont find the new name of sample_props very intuitive.

what do you suggest?

glouppe · 2015-05-11T07:33:06Z

what do you suggest?

I would at least not use abbreviations, and go instead for something like sample_properties.

Another thing which is not entirely clear for me. Why couldnt we simply put these properties into fit_params? (Sorry if I missed the discussion)

amueller · 2015-05-11T16:02:14Z

The handling of fit_params is pretty inconsistent.
In particular, fit_params should really not be a constructor argument of GridSearchCV, but a **kwargs of GridSearchCV.fit, to allow nesting inside cross_val_score.

In a way, this is mostly renaming fit_params, not making it a kwarg (but an explicit argument) and allowing dataframes and recarrays.

One thing that is not nice about fit_params is that it can contain "per sample" information and "global information" so we don't know whether we want to slice it or not.

After doing this, I am not entirely convinced any more that adding new api is better than what we have + **kwargs. It would mean that all estimators would take **kwargs in fit though, which is also not great.

mblondel · 2015-05-12T01:12:31Z

One thing that is not nice about fit_params is that it can contain "per sample" information and "global information" so we don't know whether we want to slice it or not.

We could use the convention that any fit_param starting by sample_ needs to be sliced.

jnothman · 2015-05-12T03:53:40Z

That idea is appealing and I will have to think a little more about it. One
questionable advantage of sample props was the ability to transmit
arbitrary metadata through pipelines with extraneous fields being ignored
by intermediate transformers (the converse problem is that changes to
weight support would create backwards compatibility issues), although this
could be solved in a more explicit, if user-burdening, manner with a
routing parameter to the pipeline. The other advantage stated above is more
seamless use of DataFrames, which I think is something we should be
considering, for which the prefixing approach is still fine with **my_df,
but requires the dataframe to have awkward column names (X, y,
sample_weight, sample_group, etc.).

On 12 May 2015 at 11:12, Mathieu Blondel notifications@github.com wrote:

One thing that is not nice about fit_params is that it can contain "per
sample" information and "global information" so we don't know whether we
want to slice it or not.

We could use the convention that any fit_param starting by sample_ needs
to be sliced.

—
Reply to this email directly or view it on GitHub
#4696 (comment)
.

agramfort · 2015-05-12T07:05:48Z

if we use sample_props column names should be weight, group which is not that awkward.

jnothman · 2015-05-12T11:58:19Z

The problem with sample_props is that support in a method for a particular
property is then implicit in the API, resulting in documentation and
versioning challenges.

On 12 May 2015 at 17:05, Alexandre Gramfort notifications@github.com
wrote:

if we use sample_props column names should be weight, group which is not
that awkward.

—
Reply to this email directly or view it on GitHub
#4696 (comment)
.

agramfort · 2015-05-12T13:04:24Z

what do you suggest then? **kwargs are not an option

mblondel · 2015-05-12T15:14:10Z

sample_props doesn't seem fundamentally very different from fit_params. It's a dict which contains additional data to be used for fitting. It's different from a **kwargs: the name fit_params appears in the doc string. We just need a rule to decide when the additional data passed in fit_params needs to be sliced or not for cross validation. Above, I suggested that any data whose key is prefixed by sample_ needs to be sliced. I didn't completely get why we can't use data frames as input to fit_params.

agramfort · 2015-05-12T15:18:30Z

sample_props doesn't seem fundamentally very different from fit_params. It's a dict which contains additional data to be used for fitting. It's different from a **kwargs: the name fit_params appears in the doc string. We just need a rule to decide when the additional data passed in fit_params needs to be sliced or not for cross validation. Above, I suggested that any data whose key is prefixed by sample_ needs to be sliced. I didn't completely get why we can't use data frames as input to fit_params.

ok point taken. Now the fit_params could be used in transform for example. So fit_ is not necessarily a good name...

mblondel · 2015-05-12T15:23:27Z

Good point for the name...

amueller · 2015-05-12T15:44:27Z

@mblondel the problem is the way that fit_params works is kinda broken.
You can not do

cross_val_score(GridSearchCV(SGDClassifier(), params), fit_params={'sample_weights':sample_weights})

because then fit_params of cross_val_score will be passed to GridSearchCV.fit, while they would need to be given to GridSearchCV.__init__.

In that way I disagree with

sample_props doesn't seem fundamentally very different from fit_params

I would say

sample_props is exactly the same as fit_params moved to fit

amueller · 2015-05-12T15:47:23Z

Btw, maybe to make my goals more clear. I want these two things to work:

cross_val_score(GridSearchCV(SGDClassifier(), params), i_dont_care_what_its_called={'sample_weights':sample_weights})

and

cross_val_score(GridSearchCV(SGDClassifier(), params, cv=LeaveOneLabelOut()), i_dont_care_what_its_called={'labels': groups}, cv=LeaveOneLabelOut())

jnothman · 2017-06-04T13:52:29Z

I think this remains the biggest basic design issue. It may deserve discussion at the sprint. It is tricky, and I suspect some backwards compatibility issues will have to be ignored.

amueller · 2017-06-06T11:47:23Z

Do we have a list of use-cases? do we want a SLEP? As I pointed out above, the current sample_weight support is lacking. What other applications do we have? the cross-validation groups? Anything else?

amueller mentioned this pull request May 8, 2015

[API] Consistent API for attaching properties to samples #4497

Closed

amueller changed the title ~~Sample properties~~ [WIP] Sample properties May 8, 2015

amueller and others added 9 commits May 9, 2015 18:30

add sample attributes in cross_val_score

14a959b

intermediate

2f69231

put the if in safe_indexing.

1867d78

some fixes to grid_search

f6d119f

rename attributes sample_props

ab03530

add sample_props to some random classifiers to play with.

51909d9

fix fitting of GridSearch.best_estimator_

b9d6a6f

use CheckingClassifier

d519131

Add sample_props=None to a couple of fit functions.

e7f0509

amueller force-pushed the sample_properties branch from ba9cc83 to e7f0509 Compare May 9, 2015 22:53

Deprecate sample_weights everywhere.

b2eafef

amueller added 2 commits May 9, 2015 19:27

fixes in learning_curve and validation_curve

6d19a08

import depth fixes

45d94de

fix in sgd

b6715fb

amueller mentioned this pull request May 19, 2015

[WIP] Add feature_extraction.ColumnTransformer #3886

Closed

8 tasks

jnothman mentioned this pull request Jun 18, 2015

has_fit_parameter will not work with **kwargs #4871

Open

raghavrv mentioned this pull request Jun 23, 2015

[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294

Merged

24 tasks

jnothman mentioned this pull request May 29, 2017

LogisticRegressionCV not compatible with LeaveOneGroupOut #8950

Closed

jnothman mentioned this pull request Aug 16, 2017

[WIP] Sample property routing #9566

Closed

11 tasks

amueller closed this May 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Sample properties #4696

[WIP] Sample properties #4696

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[WIP] Sample properties #4696

[WIP] Sample properties #4696

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!