8000 ENH: Improve performance for arrow dtypes in monotonic join by phofl · Pull Request #51365 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@phofl
Copy link
Member
@phofl phofl commented Feb 13, 2023
  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.
idx = Index(list(range(1, 1_000_000)), dtype="int64[pyarrow]")
idx2 = Index(list(range(100_000, 1_100_000)), dtype="int64[pyarrow]")
idx.union(idx2)

# main
# %timeit idx.union(idx2)
# 327 ms ± 72.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pr
# %timeit idx.union(idx2)
# 2.79 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jbrockmendel
Copy link
Member

no objection here, but eventually we ought to find a way to do this dispatch without special-casing inside the index code (i.e. implement something at the EA level)

hows is perf affected on multi-chunk pyarrow objs?

@phofl
Copy link
Member Author
phofl commented Feb 13, 2023

Arrays have 2 million entries, initial performance 380ms, on this pr

%timeit idx.union(idx2)
11.5 ms ± 477 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

2 chunks
I totally agree with your point regarding EA interface, this is a short term solution for 2.0

@phofl
Copy link
Member Author
phofl commented Feb 15, 2023

@jbrockmendel ok to merge?

@phofl phofl added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Arrow pyarrow functionality labels Feb 15, 2023
@phofl phofl added this to the 2.0 milestone Feb 15, 2023
elif isinstance(self.values, ArrowExtensionArray):
import pyarrow as pa

return type(self.values)(pa.array(result))
8000 Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_from_sequence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, changed

Copy link
Member
@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@phofl phofl merged commit d82f9dd into pandas-dev:main Feb 16, 2023
@phofl phofl deleted the pyarrow_monotonic_join branch February 16, 2023 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

0