E529 PERF: faster constructors from ea scalars by lukemanley · Pull Request #45854 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@lukemanley
Copy link
Member
@lukemanley lukemanley commented Feb 7, 2022

Perf improvement when constructing a DataFrame/Series from a scalar EA value.

Examples:

import pandas as pd

N = 1_000_000


%timeit pd.DataFrame({"A": pd.NA, "B": 1.0}, index=range(N), dtype=pd.Float64Dtype())
2.18 s ± 214 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)       <- main
22.4 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   <- PR


%timeit pd.Series(pd.NA, index=range(N), dtype=pd.Float64Dtype())
1.62 s ± 50.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)      <- main
7.33 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR


%timeit pd.Series(1, index=range(N), dtype='Int64')
242 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)      <- main
11.6 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <- PR

@lukemanley lukemanley added Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance labels Feb 7, 2022
cls = dtype.construct_array_type()
subarr = cls._from_sequence([value] * length, dtype=dtype)
if isinstance(dtype, CategoricalDtype):
subarr = cls._from_sequence([value] * length, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if you don't special case Categorical?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling take on the empty categorical would error because the categories were not defined.

I just pushed another commit which first sets the categories. Categorical now has a similar improvement:

import pandas as pd

N = 1_000_000

%timeit pd.Series(1, index=range(N), dtype='category')
415 ms ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
4.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling take on the empty categorical would error because the categories were not defined.

Hmm i think this might be a problem for other not-fully-initialized dtype objects. IIRC IntervalDtype without 'closed' set is another one, not sure if we have a way to detect the general case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest update uses .repeat per @jreback's suggestion below.

cls = dtype.construct_array_type()
subarr = cls._from_sequence([value] * length, dtype=dtype)
subarr = cls._from_sequence([value], dtype=dtype)
taker = np.broadcast_to(np.intp(0), length)
Copy link
Contributor
@jreback jreback Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use .repeat()? (I think we define that generally on EA)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's even better - just updated. Thanks for pointing it out.

@jreback jreback added this to the 1.5 milestone Feb 7, 2022
@jreback jreback merged commit ab6901c into pandas-dev:main Feb 9, 2022
@jreback
Copy link
Contributor
jreback commented Feb 9, 2022

very nice @lukemanley keep em coming!

phofl pushed a commit to phofl/pandas that referenced this pull request Feb 14, 2022
@lukemanley lukemanley deleted the ea-from-scalar branch March 2, 2022 01:13
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0