8000 WIP simplify naive Bayes parametrization by amueller · Pull Request #1525 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

WIP simplify naive Bayes parametrization #1525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

amueller
Copy link
Member
@amueller amueller commented Jan 6, 2013

After moving class_prior to class_weight as in __init__ parameter in #1499, the parametrization was a bit weird. I now got rid of fit_prior and let class_weight be 'auto' ,None(=uniform) or an array. This is a bit more consistent with the other estimators.

I'll try to make it pass the common tests now and I guess I'll also allow dictionaries.

@amueller
Copy link
Member Author
amueller commented Jan 6, 2013

Does it make sense to support dictionaries as class_weight? What is the value of the left-out classes? hum...
ping @larsmans.

This is fixing #1511 btw.

@amueller
Copy link
Member Author
amueller commented Jan 6, 2013

Making the common tests pass is a bit more tricky. I think this is a step forward in the inferface, though.

@larsmans
Copy link
Member
larsmans commented Jan 6, 2013

Actually, I'm not so sure things have gotten consistent now; class_weight="auto" is more like fit_intercept=True in the linear models. Whether a dict with missing values makes sense depends on what meaning we want to assign to class_weight: is it the prior (exp(intercept_)), or is the prior derived taking class_weight into account?

I think I'd favor the former for simplicity; Naive Bayes is a typical beginner's tool, so it should be simple. The latter would provide better API compatibility with other estimators, but NB's interpretation of alpha is already quite different as well.

@amueller
Copy link
Member Author
amueller commented Jan 6, 2013

I see your point about class_weight='auto' and fit_intercept=True. But adjusting the prior is also the same as duplicating samples in a class, right?

I don't know what you mean by "prior taking class_weight into account.". In the current / previous implementation, class_prior now class_weight simply specifies the prior, i.e. overwrites fit_prior if I understand correctly.

That t 8000 hese are mutually exclusive is one of the reasons I think having a single parameter is more natural.

One could also imagine class_weights to reweight the prior from the data but that is not implemented, right?

@larsmans
Copy link
Member
larsmans commented Jan 6, 2013

In the current / previous implementation, class_prior now class_weight simply specifies the prior, i.e. overwrites fit_prior if I understand correctly.

Yes, but we could decouple them so that unrepresented classes get a weight of 1 and other classes can be up-/downweighted arbitrarily. We'd just have to fill in the ones and normalize, which seems to be what you're implying as well.

@amueller
Copy link
Member Author
amueller commented Jan 6, 2013

Ok but then the meaning of the parameter would change a lot from the previous class_prior.

@amueller
Copy link
Member Author
amueller commented Jan 6, 2013

Ok so you would keep the fit_prior and then do a multiplicative (?) modification of the prior that was found using the class_weight parameter?

@larsmans
Copy link
Member
larsmans commented Jan 6, 2013

No, because I agree the API is too clumsy (and the code is more complicated than an NB estimator needs to be). Let me formulate a proposal:

  • When class_weight="auto", the empirical prior is used. This should be the default.
  • When class_weight is a dict or array-like, we normalize it so it becomes a custom prior. Missing values are implicitly 1, so when the user computes their own prior, they should provide the value for all classes. [Should we not allow unnormalized class_weight, then we'd have to validate it, which is just as much work but less convenient for the user.]
  • When class_weight=None, a uniform prior is used.

This way, custom priors can still be given explicitly, but the parameter becomes much more flexible.

Does that sound like something we almost could put in a docstring, aside from the [rationale]?

@amueller
Copy link
Member Author
amueller commented Jan 6, 2013

It sounds a lot like my current docstring, doesn't it ;)

It sounds like a good solution to me. I must admit I haven't really thought about the algorithm just shuffled the bits around. Sounds actually kinda simple the way you put it.

Do you want to give it a shot or should I try?

@amueller
Copy link
Member Author
amueller commented Jan 6, 2013

Ok it sounds more like I should have written the docstring, actually ;) but it was what I meant, I only put less elegantly.

@amueller
Copy link
Member Author
amueller commented Jan 6, 2013

We can actually just reuse the function from sklearn.utils.class_weights, right?
Anything else we need to do?

@larsmans
Copy link
Member
larsmans commented Jan 6, 2013

Looks good!

@amueller
Copy link
Member Author
amueller commented Jan 7, 2013

Ok, I'll add some more tests for dictionaries and also for using lists in SVM and SGD.

@amueller
Copy link
Member Author

Ok so this doesn't seem to make much sense, maybe:
In NB, fit_prior surprisingly fits a prior to the classes. So frequent classes will be more likely to be predicted. In the linear models, class_weight='auto' downweights examples from the overreprese 8000 nted classes.

So I just unified two functions that did the opposite...

I could still use the same function but take the inverse in the auto case. but then we should definitely document that quite explicitly!

@amueller
Copy link
Member Author

Hm reusing the code is not so trivial given the different normalizations.
Also, we should really think about whether we want class_weight='auto' mean two completely opposite things :-/

Any ideas how to handle this better? Just leave the current parametrization?

@amueller
Copy link
Member Author

closing this as it sucks.

@amueller amueller closed this Feb 20, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0