8000 PERF: Merge empty frame by lukemanley · Pull Request #45838 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@lukemanley
Copy link
Member
@lukemanley lukemanley commented Feb 5, 2022

Faster merge when left or right is empty.

One ASV added:

       before           after         ratio
     [7651c082]       [4f451520]
     <main>           <merge-empty-frame>
-      12.0±0.3ms      1.42±0.09ms     0.12  join_merge.Merge.time_merge_dataframe_empty(False)
-        24.3±2ms      1.29±0.01ms     0.05  join_merge.Merge.time_merge_dataframe_empty(True)

Some additional examples:

N = 10_000_000

df1 = pd.DataFrame(
    np.random.randint(0, 100, (N, 2)),
    columns=['A', 'B'],
)

df2 = df1.set_index('A')

df_empty = pd.DataFrame(columns=['A', 'C'], dtype='int64')
%timeit df1.merge(df_empty, how="right")  
242 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)        <-- main
1.12 ms ± 50.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR
%timeit df2.merge(df_empty, how="left", left_index=True, right_on='A')    
1.07 s ± 47.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- main
241 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- PR

@lukemanley lukemanley added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 5, 2022
def time_merge_dataframe_integer_key(self, sort):
merge(self.df, self.df2, on="key1", sort=sort)

def time_merge_dataframe_empty(self, sort):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the reverse as well (e.g. left empty)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse added here

- Performance improvement in :meth:`DataFrame.duplicated` when subset consists of only one column (:issue:`45236`)
- Performance improvement in :meth:`.GroupBy.transform` when broadcasting values for user-defined functions (:issue:`45708`)
- Performance improvement in :meth:`.GroupBy.transform` for user-defined functions when only a single group exists (:issue:`44977`)
- Performance improvement in :meth:`DataFrame.merge` when left and/or right are empty (:issue:`45838`)
Copy link
Member
@phofl phofl Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we normally use :func: merge instead of DataFrame.merge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to :func: merge. thanks

@jreback jreback added this to the 1.5 milestone Feb 9, 2022
@jreback
Copy link
Contributor
jreback commented Feb 9, 2022

lgtm will merge on green.

tip to avoid conflicts in the release notes; add a note not at the bottom, add in middle somewhere

@phofl
Copy link
Member
phofl commented Feb 9, 2022

Greenish

@phofl phofl merged commit c51c2a7 into pandas-dev:main Feb 9, 2022
@phofl
Copy link
Member
phofl commented Feb 9, 2022

Thx @lukemanley

phofl pushed a commit to phofl/pandas that referenced this pull request Feb 14, 2022
* faster merge with empty frame

* whatsnew

* docs, tests, asvs

* fix whatsnew

Co-authored-by: Jeff Reback <jeff@reback.net>
@lukemanley lukemanley mentioned this pull request Feb 16, 2022
3 tasks
@lukemanley lukemanley deleted the merge-empty-frame branch March 2, 2022 01:13
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
* faster merge with empty frame

* whatsnew

* docs, tests, asvs

* fix whatsnew

Co-authored-by: Jeff Reback <jeff@reback.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0