8000 Tweedie regression on insurance claims example · Issue #17200 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Tweedie regression on insurance claims example #17200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jieliang opened this issue May 13, 2020 · 7 comments
Closed

Tweedie regression on insurance claims example #17200

jieliang opened this issue May 13, 2020 · 7 comments

Comments

@jieliang
Copy link

https://scikit-learn.org/dev/auto_examples/linear_model/plot_tweedie_regression_insurance_claims.html

I have a question about hyperparameter tuning: how were the values for alpha and other tunable parameters chosen for the passion, gamma (frequency*severity) and tweedie models? Did you do something like grid search and cross validation?

I also thought that it would be nice to have a method to calculate the D-squared score for the composite frequency*severity model, so that it can be compared to the Tweedie model ( The Tweedie GridSearchCV chooses the best value for power based on D-squared score in the example).

Another question is since there's discussion about Gini index at the end of the example, would it make sense to also use Gini index as one of the scoring metrics in GridSearchCV? In insurance applications, if coming up with most accurate rates is the goal, maybe MAE/RMSE is suitable, while Gini index is better for the purpose of ranking policy holders in terms of risk.

Last suggestion is that a function for deriving the relativities of features would be really useful.

Thank you!

@lorentzenchr
Copy link
Member

As said in the example, alpha is not set to zero to avoid numerical problems, e.g. with collinearity. It is set to a small number as it does not have a large effect.
The correct, but much longer approach would be to use cross-validation to search for optimal alpha. We thought this would be overkill for the example.

Then, there is #15244 for adding a score function D^2, like R^2.

Concerning your third question, I, personally, am not a big fan of the Gini index/AUC. Especially in a regression setting like this example, I would advice strongly against using AUC for cross-validation because it is not a strictly consistent scoring function for the expectation (of y), which we want to predict, see https://arxiv.org/abs/0912.0902).
If your aim is ranking, I would be interested in a mathematical (supervised learning) formulation of this objective. Maybe minimizing deviance loss—what the estimators in the example do—is then not an optimal strategy.

If by "relativities" you mean the coefficients or weights of the GLMs, those are accessible by the attributes coef_ and intercept_.

@ogrisel
Copy link
Member
8000 ogrisel commented Jun 10, 2020

@lorentzenchr how would you select the parameter p of the Tweedie regression model? As discussed in another PR we cannot use the deviance because as the objective metric as it depends on p. I believe the correct way to do it would be to evaluate the test likelihood for a grid of values for p but then with need to also estimate the Phi parameter of the variance function which is not possible with the current code base.

Using the Gini criterion for model selection of p (and the regularizer strength alpha) would be possible although I agree this is indirect because the optimal value of p for ranking is not necessarily the optimal value of p from an expected likelihood point of view.

@lorentzenchr
Copy link
Member

You're asking a usage question. The issue tracker is mainly for bugs and new features. For usage questions, it is recommended to try Stack Overflow or the Mailing List.

@ogrisel Sorry, couldn't resist 😏

Just to clarify for others, the question is how to choose the parameter p, i.e. the variance power of the Tweedie familiy, i.e. E[Y|X] = μ(X), Var[Y|X]=phi/weight * μ(X)**p, which enters sklearn.linear_model.TweedieRegressor(...,power=p). It also enters sklearn.metrics.mean_tweedie_deviance(..., power=q), but in order to distinguish it, I call this one q.

The following is to the best of my knowledge:
The most important question to raise and to answer is: What you want to achieve, what is your (business) goal? Leaving that aside and focusing on selecting the pin TweedieRegressor, the answer depends on what assumptions and approximations you are willing to take.

  1. Assume the target is Tweedie distributed. Then full maximum likelihood theory applies and p as well as phi and μ can be estimated by MLE1. This is currently not possible in scikit-learn and hard to do in python.
  2. Only assume that the target has finite 1st (and 2nd?) moment (and maybe some further regularity conditions) and aim to predict the expectation, i.e. E[Y|X]. Then you can choose a scoring function (see again https://arxiv.org/abs/0912.0902, e.g. Eq. (18)) that is strictly consistent for the expectation/mean, treat p as a hyperparameter and find best value by cross-validation. Examples of such scoring functions are
    • metrics.mean_squared_error
    • metrics.mean_tweedie_deviance. Now we have to choose a parameter q (independently of p). This can be a matter of your business goal or a matter of asymptotic estimation efficiency. To illustrate the last point, assume your data is Tweedie(power=1.5) distributed. Then choosing q=1.5 will give you the most efficient scoring function (for the expectation) among all mean_tweedie_deviance. For large samples and a correctly specified model, I hypothesize that the estimation of p (by cross validation) will converge to 1.5, too. Again a lot of assumptions...
  3. If your data allows you to separate the target multiplicatively into something like "target = count * value_per_count", you can model "count" and "value_per_count" separately and at least for count, statistical wisdom (or was it lore?) suggests to use PoissonRegressor.
  4. Assume your goal is to predict a quantile of the target or rank (by) the target. In this case I would not even start to use TweedieRegressor but other estimators that are better suited for this particular task. (You would not consult an ophthalmologist if you had stomach ache.)

1Note that, only after selecting a value for p, the Tweedie distribution becomes a member of the exponential family (more precisely exponential dispersion family) with all the nice properties, i.e. a sufficient statistic for the expectation.

To sum up: The situation is quite similar to Negative binomial regression for counts:smirk:

@lorentzenchr
Copy link
Member
lorentzenchr commented Jun 11, 2020

We might also super kindly ask some real experts like @bbolker or @gksmyth, if they could share some insights.

@bbolker
Copy link
bbolker commented Jun 11, 2020

I don't consider myself an expert on Tweedie regression: @kaskr implemented this in glmmTMB, I believe. I have always gone with approach 1 above (use MLE). The tricky question is whether the estimation is stable enough to try to estimate p simultaneously with the rest of the parameters (which is what glmmTMB does, I think), or whether it is better to profile over p (i.e., do a one-dimensional optimization and/or grid search over p; for each value of p do an MLE fit with the value of p held constant). I agree with the point about negative binomial regression: MASS::glm.nb() uses an iterative method to profile over the dispersion parameter (fitting a GLM for each value), but I have found that fitting the full MLE is sometimes more robust (the collinearity/identifiability problems are not nearly as bad for NB dispersion as for Tweedie p). Another analogous problem is estimating df in a regression model with t-distributed residuals.

@lorentzenchr
Copy link
Member

@bbolker Thank you very much for sharing your insights.

@cmarmo
Copy link
Contributor
cmarmo commented Sep 11, 2020

Thanks @jieliang for reaching out. It seems to me that all your questions have received attention. I'm closing this issue. Feel free to keep in touch with the community on Stack Overflow or the scikit-learn mailing list for new questions. Thanks.

@cmarmo cmarmo closed this as completed Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0