10000 Changed number of features and max iteration limits in plot_tweedie_regression_insurance_claims.py · Pull Request #21622 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Changed number of features and max iteration limits in plot_tweedie_regression_insurance_claims.py #21622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ghost
Copy link
@ghost ghost commented Nov 10, 2021

#21598 @sply88 @adrinjalali @cakiki

Adapted the number of features at the beginning and then reduced the number of max possible iterations

@adrinjalali
Copy link
Member

@sveneschlbeck please remember to paste before and after outputs and times with your PRs on examples :)

@rth @lorentzenchr you may have a good idea on the details of this example

@adrinjalali adrinjalali mentioned this pull request Nov 10, 2021
41 tasks
@ghost
Copy link
Author
ghost commented Nov 10, 2021

@adrinjalali Yes, time tests are currently running, will paste them in soon :)

@rth
Copy link
Member
rth commented Nov 10, 2021

Here what would be necessary as well is to check the figures / scores in the rendered example before and after this PR (see link in "ci/circleci: doc artifact") and make sure there are no major changes, or at least that the comments in the example are still valid with this smaller number of samples.

@ghost
Copy link
Author
ghost commented Nov 10, 2021

@adrinjalali @rth Agreed, I could need your expertise to judge the effect of the changes, I am quite disappointed since the sample number reduction did not really save time. However, the results look quite a bit different. Maybe there is another way to achieve faster execution in this example?
screenshot_before

screenshot_after

@lorentzenchr
Copy link
Member

@sveneschlbeck Could you check which part of the example code takes the most time? Then we can focus on that part. Maybe it's the fit, maybe it's some plot function.

@ghost
Copy link
Author
ghost commented Nov 10, 2021

@lorentzenchr Will do :)

@ghost
Copy link
Author
ghost commented Nov 13, 2021

@lorentzenchr I checked it out and these are the results analysing the partial execution times:

line 1 - line 207 --> 0.001 sec
line 208 - line 261 --> 31 sec
line 262 - line 293 --> 1.2 sec
line 294 - line 348 --> 0.22 sec
line 349 - end --> 4.1 sec

So, this part seems to be the slowest by far:

# %%
# Loading datasets, basic feature extraction and target definitions
# -----------------------------------------------------------------
#
# We construct the freMTPL2 dataset by joining the freMTPL2freq table,
# containing the number of claims (``ClaimNb``), with the freMTPL2sev table,
# containing the claim amount (``ClaimAmount``) for the same policy ids
# (``IDpol``).

df = load_mtpl2(n_samples=60000)

# Note: filter out claims with zero amoun
8000
t, as the severity model
# requires strictly positive target values.
df.loc[(df["ClaimAmount"] == 0) & (df["ClaimNb"] >= 1), "ClaimNb"] = 0

# Correct for unreasonable observations (that might be data error)
# and a few exceptionally large claim amounts
df["ClaimNb"] = df["ClaimNb"].clip(upper=4)
df["Exposure"] = df["Exposure"].clip(upper=1)
df["ClaimAmount"] = df["ClaimAmount"].clip(upper=200000)

log_scale_transformer = make_pipeline(
    FunctionTransformer(func=np.log), StandardScaler()
)

column_trans = ColumnTransformer(
    [
        ("binned_numeric", KBinsDiscretizer(n_bins=10), ["VehAge", "DrivAge"]),
        (
            "onehot_categorical",
            OneHotEncoder(),
            ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
        ),
        ("passthrough_numeric", "passthrough", ["BonusMalus"]),
        ("log_scaled_numeric", log_scale_transformer, ["Density"]),
    ],
    remainder="drop",
)
X = column_trans.fit_transform(df)

# Insurances companies are interested in modeling the Pure Premium, that is
# the expected total claim amount per unit of exposure for each policyholder
# in their portfolio:
df["PurePremium"] = df["ClaimAmount"] / df["Exposure"]

# This can be indirectly approximated by a 2-step modeling: the product of the
# Frequency times the average claim amount per claim:
df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df["AvgClaimAmount"] = df["ClaimAmount"] / np.fmax(df["ClaimNb"], 1)

with pd.option_context("display.max_columns", 15):
    print(df[df.ClaimAmount > 0].head())

@ghost
Copy link
Author
ghost commented Nov 14, 2021

@lorentzenchr What about reducing the number of samples?

@lorentzenchr
Copy link
Member

If the bottleneck is load_mtpl2, i.e.

  • df_freq = fetch_openml(data_id=41214, as_frame=True)["data"] and
  • df_sev = fetch_openml(data_id=41215, as_frame=True)["data"],

then it depends on the ratio between download time and parsing time. If it's download time, reducing n_samples won't help.

@ghost
Copy link
Author
ghost commented Nov 14, 2021

@lorentzenchr n_samples seems to not make a difference in loading time: 33 sec (60000) compared to 31 sec (30000)...so that is not a solution

@adrinjalali
Copy link
Member

@sveneschlbeck do we have any conclusions here?

B47C

@MysteryMage
Copy link

Since the bottleneck is df_freq and df_sev, can't we call the function fetch_openml with 2 different threads to hasten the process?

@adrinjalali
Copy link
Member

This will be fixed by #21938. Closing. Thanks for the work @sveneschlbeck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0