8000 [WIP] Sample properties by amueller · Pull Request #4696 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Sample properties #4696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed

Conversation

amueller
Copy link
Member
@amueller amueller commented May 8, 2015

Shot at #4497.

Todo:

  • handling in pipeline
  • add to all estimators fit
  • add to all estimators fit_transform
  • add to all estimators fit_predict
  • deprecate sample weights
  • don't use sample_weights in the code / examples
  • deprecate slicing of fit_params?
  • testing that all estimators support it in fit
  • testing that all estimators support it in fit_transform
  • testing that all estimators support it in fit_predict
  • test that sample_weights is deprecated everywhere
  • test that sample_weights works the same way as sample_props['sample_weights']
  • testing that estimators that don't support sample_props don't break
  • test that cross-validation and grid-search handle it appropriately.
  • examples
  • docs
  • change all docstrings

Issues

  • the len of dicts behaves different from recarray and dataframes. Not supporting dicts might lead to a cleaner interface, but not sure.
  • Just using the new interface in GridSearchCV and cross_val_score will break all 3rd party estimators. Should we inspect / try-except and raise a deprecation warning? Or just try/except and not raise a warning?
  • It's unclear to me how to prevent typos in specifying sample_props column names / dictionary keys.
  • should it be passed to the score function, too?
  • should it be passed to partial_fit, too?

@amueller amueller changed the title Sample properties [WIP] Sample properties May 8, 2015
@amueller amueller force-pushed the sample_properties branch from ba9cc83 to e7f0509 Compare May 9, 2015 22:53
@amueller
Copy link
Member Author
amueller commented May 9, 2015

ping @GaelVaroquaux @ogrisel @jnothman @vene @agramfort
This is somewhat annoying work, and it would be good to get some buy-in ;)

@amueller
Copy link
Member Author
amueller commented May 9, 2015

And also feedback for the issues.

@amueller
Copy link
Member Author
amueller commented May 9, 2015

maybe @glouppe, @mblondel and @pprett also have opinions about sample_weight being deprecated?

@glouppe
Copy link
Contributor
glouppe commented May 10, 2015

maybe @glouppe, @mblondel and @pprett also have opinions about sample_weight being deprecated?

Nothing against deprecating it, as long as there is still a way to pass weights.

However, I dont find the new name of sample_props very intuitive.

@agramfort
Copy link
Member

Nothing against deprecating it, as long as there is still a way to pass
weights.

+1

However, I dont find the new name of sample_props very intuitive.

what do you suggest?

@glouppe
Copy link
Contributor
glouppe commented May 11, 2015

what do you suggest?

I would at least not use abbreviations, and go instead for something like sample_properties.

Another thing which is not entirely clear for me. Why couldnt we simply put these properties into fit_params? (Sorry if I missed the discussion)

@amueller
Copy link
Member Author

The handling of fit_params is pretty inconsistent.
In particular, fit_params should really not be a constructor argument of GridSearchCV, but a **kwargs of GridSearchCV.fit, to allow nesting inside cross_val_score.

In a way, this is mostly renaming fit_params, not making it a kwarg (but an explicit argument) and allowing dataframes and recarrays.

One thing that is not nice about fit_params is that it can contain "per sample" information and "global information" so we don't know whether we want to slice it or not.

After doing this, I am not entirely convinced any more that adding new api is better than what we have + **kwargs. It would mean that all estimators would take **kwargs in fit though, which is also not great.

@mblondel
Copy link
Member

One thing that is not nice about fit_params is that it can contain "per sample" information and "global information" so we don't know whether we want to slice it or not.

We could use the convention that any fit_param starting by sample_ needs to be sliced.

@jnothman
Copy link
Member

That idea is appealing and I will have to think a little more about it. One
questionable advantage of sample props was the ability to transmit
arbitrary metadata through pipelines with extraneous fields being ignored
by intermediate transformers (the converse problem is that changes to
weight support would create backwards compatibility issues), although this
could be solved in a more explicit, if user-burdening, manner with a
routing parameter to the pipeline. The other advantage stated above is more
seamless use of DataFrames, which I think is something we should be
considering, for which the prefixing approach is still fine with **my_df,
but requires the dataframe to have awkward column names (X, y,
sample_weight, sample_group, etc.).

On 12 May 2015 at 11:12, Mathieu Blondel notifications@github.com wrote:

One thing that is not nice about fit_params is that it can contain "per
sample" information and "global information" so we don't know whether we
want to slice it or not.

We could use the convention that any fit_param starting by sample_ needs
to be sliced.


Reply to this email directly or view it on GitHub
#4696 (comment)
.

@agramfort
Copy link
Member
agramfort commented May 12, 2015 via email

@jnothman
Copy link
Member

The problem with sample_props is that support in a method for a particular
property is then implicit in the API, resulting in documentation and
versioning challenges.

On 12 May 2015 at 17:05, Alexandre Gramfort notifications@github.com
wrote:

if we use sample_props column names should be weight, group which is not
that awkward.


Reply to this email directly or view it on GitHub
#4696 (comment)
.

@agramfort
Copy link
Member
agramfort commented May 12, 2015 via email

@mblondel
Copy link
Member

sample_props doesn't seem fundamentally very different from fit_params. It's a dict which contains additional data to be used for fitting. It's different from a **kwargs: the name fit_params appears in the doc string. We just need a rule to decide when the additional data passed in fit_params needs to be sliced or not for cross validation. Above, I suggested that any data whose key is prefixed by sample_ needs to be sliced. I didn't completely get why we can't use data frames as input to fit_params.

@agramfort
Copy link
Member
agramfort commented May 12, 2015 via email

@mblondel
Copy link
Member

Good point for the name...

@amueller
Copy link
Member Author

@mblondel the problem is the way that fit_params works is kinda broken.
You can not do

cross_val_score(GridSearchCV(SGDClassifier(), params), fit_params={'sample_weights':sample_weights})

because then fit_params of cross_val_score will be passed to GridSearchCV.fit, while they would need to be given to GridSearchCV.__init__.

In that way I disagree with

sample_props doesn't seem fundamentally very different from fit_params

I would say

sample_props is exactly the same as fit_params moved to fit

@amueller
Copy link
Member Author

Btw, maybe to make my goals more clear. I want these two things to work:

cross_val_score(GridSearchCV(SGDClassifier(), params), i_dont_care_what_its_called={'sample_weights':sample_weights})

and

cross_val_score(GridSearchCV(SGDClassifier(), params, cv=LeaveOneLabelOut()), i_dont_care_what_its_called={'labels': groups}, cv=LeaveOneLabelOut())

@jnothman
Copy link
Member
jnothman commented Jun 4, 2017

I think this remains the biggest basic design issue. It may deserve discussion at the sprint. It is tricky, and I suspect some backwards compatibility issues will have to be ignored.

@amueller
Copy link
Member Author
amueller commented Jun 6, 2017

Do we have a list of use-cases? do we want a SLEP? As I pointed out above, the current sample_weight support is lacking. What other applications do we have? the cross-validation groups? Anything else?

@jnothman jnothman mentioned this pull request Aug 16, 2017
11 tasks
@amueller amueller closed this May 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0