8000 ENH: LabelEncoder supports pandas Categorical by TomAugspurger · Pull Request #310 · dask/dask-ml · GitHub
[go: up one dir, main page]

Skip to content

ENH: LabelEncoder supports pandas Categorical #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 20, 2018

Conversation

TomAugspurger
Copy link
Member
@TomAugspurger TomAugspurger commented Jul 19, 2018

Enhances LabelEncoder to use CategoricalDtype for pandas and dask
series.

This improves the performance, and will be helpful for implementing OneHotEncoder efficiently for dask dataframes.

import string
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask_ml.preprocessing

dtype = pd.api.types.CategoricalDtype(list(string.ascii_letters[:12]))
n = 1_000_000

data = dd.from_pandas(
    pd.Series(np.random.choice(dtype.categories, size=n), dtype=dtype),
    npartitions=n // 100_000
)

Setup

le = dask_ml.preprocessing.LabelEncoder()  # True / False
codes = le.fit_transform(data)

print('fit')
%timeit le.fit(data)
print('transform')
%timeit le.transform(data)

print('transform-compute')
%timeit le.transform(data).compute()

print('inverse_transform')
%timeit le.inverse_transform(codes)

print('inverse_transform-compute')
%timeit le.inverse_transform(codes).compute()

Results

method categorical no categorical
fit 43.5 µs 625 ms
transform 678 µs 33.1 ms
transform-compute 4.98 ms 128 ms
inverse_transform 3.66 ms 240 µs
inverse_transform-compute 37 ms 160 ms

cc @jrbourbeau

Enhances LabelEncoder to use CategoricalDtype for pandas and dask
series
@TomAugspurger
Copy link
Member Author

cc also @jorisvandenbossche if you're interested in what a future scikit-learn LabelEncoder that could use a pandas-like CategoricalDtype might look like.

@jorisvandenbossche
Copy link
Member

Cool! Note that in the mean time, we don't use LabelEncoder anymore directly in OneHotEncoder: scikit-learn/scikit-learn#10209 (both now use a common _encode function).
Although I suppose this doesn't matter too much for dask-ml (I don't think it would be possible to subclass OneHotEncoder to re-use the code that called LabelEncoder ?)

@TomAugspurger
Copy link
Member Author

Note that in the mean time, we don't use LabelEncoder anymore directly in OneHotEncoder: scikit-learn/scikit-learn#10209 (both now use a common _encode function).

Thanks, I was working off an older branch. This will still be a nice standalone addition though, and I think it's worthwhile diverging from (getting ahead of?) scikit-learn here as using the dtype information is more important for large datasets.

@jorisvandenbossche
Copy link
Member

Yeah, this is certainly a worthwhile addition!

@TomAugspurger TomAugspurger merged commit 8f49b9d into dask:master Jul 20, 2018
@TomAugspurger TomAugspurger deleted the categorical-label-encoder branch July 20, 2018 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0