8000 Helpers for managing different set of classes · Issue #8100 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Helpers for managing different set of classes #8100

New issue
Open
jnothman opened this issue Dec 22, 2016 · 16 comments
Open

Helpers for managing different set of classes #8100

jnothman opened this issue Dec 22, 2016 · 16 comments

Comments

@jnothman
Copy link
Member

It seems like half the bugs we've solved in the past couple of months surround problems of having different sets of classes, in the context of:

  • cross-validation splitting that yields different subsets of classes in different training or testing subsets (and hence issues in alignment of class-wise outputs from predict_proba, decision_function or metrics, or in normalising macro-averaged scores)
  • partial_fit where classes are specified upfront, but then repeated calls need matching to those classes
  • warm_start where classes_ from the first fit must be identical to the set of classes in y in each call to fit()

These are all subtly different problems, but at the moment it seems like we're handling them on an ad-hoc (and too often a post-hoc) basis.

It would be amazing if someone could review these issues and identify where either API changes (classes as a constructor parameter to classifiers has been suggested) or helper utilities might help avoid these issues in the future.

@dalmia
Copy link
Contributor
dalmia commented Jan 12, 2017

I'll take this up.

@jnothman
Copy link
Member Author

I think it will be a challenge, @dalmia, but go ahead.

@dalmia
Copy link
Contributor
dalmia commented Jan 12, 2017

Thanks for the heads up @jnothman. :)

@amueller
Copy link
Member
amueller commented Mar 3, 2017

@dalmia are you still working on this?

@dalmia
Copy link
Contributor
dalmia commented Mar 8, 2017

@amueller No Sir, please retain the 'Need Contributor' label.

@amueller
Copy link
Member

Ok so in #6231 we talked about adding classes as a construction parameter, which I think might be the best solution.

If we do that, what's the shape of coef_ in linear models? I guess it needs to reflect all classes to solve the warm-start issue.

@amueller
Copy link
Member
amueller commented Jul 7, 2017

Should we sort the classes that the user provides? That might be surprising to the user, but not doing it might lead to subtle bugs as we rely on that in some places, I think?

It seems a bit strange to me that get_params and set_params return / set classes as searching over it is nonsensical, but I guess it's not the worst....

@jnothman
Copy link
Member Author
jnothman commented Jul 8, 2017 via email

@amueller
Copy link
Member
amueller commented Aug 4, 2017

Also see #8004 #1304 #1183 #7889 (comment)

@amueller
Copy link
Member

Should we have attributes reflect the additional classes in all cases? I think so, because otherwise it will be very confusing. @aarshayj points out that that requires changing coef_ and intercept_ in the linear models, if we want them to correspond to the entries in decision_function.

@thismlguy
Copy link
Contributor

yea i think we'll have to do this. for instance, if the user sets classes=[1,2,3] while y has only [1,2], then an SVM will return just 1 set of coefficients and 1d decision function because the library we use makes use of data only in y. but in our API, we need to supplement both coef and decision function to make them 3d.

the returned values are for the positive class (2). so we can fill in 0s for class (3) and fill in class (1) as negative of the values for class (2).

@jnothman
Copy link
Member Author
jnothman commented Aug 13, 2017 via email

@amueller
Copy link
Member

So I'm not entirely certain this is the right solution any more, because of issues we face in the implementation.

There are two or three main points:

  • are we requiring classes in our API contract for third-party estimators?
  • If we don't, then issues like in cross_val_predict will remain for third-party estimators, and we end up where we started.
  • how much of the internal structure of the estimators should reflect the new classes? coef_ et al is very vague. For GradientBoostingClassifier, what should estimators_ be? Should we actually create trees for classes that don't exist? What should coef_ and intercept_ be? Ideally we want zero probabilities, which means -np.inf for decision function and intercept, which is maybe not great.

If we used a wrapper/meta-estimator approach, these issues would not exist as we could wrap third-party estimators and not expose any internal data structures. I'm not sure if a wrapper would be the right solution to ensure consistent scoring, but it's probably possible.

Ping @jnothman @GaelVaroquaux @ogrisel @agramfort.

@amueller
Copy link
Member

Hm for the third-party estimators, I guess we don't have to require classes, and just raise an error if we want to combine multiple estimators that have different classes_ like in cross_val_predict. Basically someone only needs to implement classes if they want these edge cases to work. Still seems like a bit of a burden on the estimator author that we could avoid with a wrapper approach.

@jnothman
Copy link
Member Author

are we requiring classes in our API contract for third-party estimators?

In short, yes. Eventually. We need to maintain the classes_ mismatch handling in cross_val_predict for now in any case.

We don't end up where we started in that we have a way for metaestimators to pass classes to supporting classifiers (and scorers), and that we remove the problem in partial_fit for us and for third parties who adopt the classes parameter. But yes, when the user has not specified classes, weird errors may be raised in partial_fit still (unless protected by common tests).

A wrapper approach for meta-estimators interpreting the output of decision_function? Okay, but it won't work for interpreting coef_, will it? It won't work for passing classes to scorers, but I suppose we can make that the user's problem as we currently do.

I do see your problems about making the models represent all the classes in a consistent way. I'm not yet sure what the solution is.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Aug 31, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants
0