-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
PERF: faster constructors from ea scalars #45854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/dtypes/cast.py
Outdated
| cls = dtype.construct_array_type() | ||
| subarr = cls._from_sequence([value] * length, dtype=dtype) | ||
| if isinstance(dtype, CategoricalDtype): | ||
| subarr = cls._from_sequence([value] * length, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if you don't special case Categorical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling take on the empty categorical would error because the categories were not defined.
I just pushed another commit which first sets the categories. Categorical now has a similar improvement:
import pandas as pd
N = 1_000_000
%timeit pd.Series(1, index=range(N), dtype='category')
415 ms ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- main
4.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling take on the empty categorical would error because the categories were not defined.
Hmm i think this might be a problem for other not-fully-initialized dtype objects. IIRC IntervalDtype without 'closed' set is another one, not sure if we have a way to detect the general case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest update uses .repeat per @jreback's suggestion below.
pandas/core/dtypes/cast.py
Outdated
| cls = dtype.construct_array_type() | ||
| subarr = cls._from_sequence([value] * length, dtype=dtype) | ||
| subarr = cls._from_sequence([value], dtype=dtype) | ||
| taker = np.broadcast_to(np.intp(0), length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just use .repeat()? (I think we define that generally on EA)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's even better - just updated. Thanks for pointing it out.
|
very nice @lukemanley keep em coming! |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Perf improvement when constructing a DataFrame/Series from a scalar EA value.
Examples: