-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Introduction
This ticket is a weekly summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please feel free to leave comments on this ticket about things that I may have missed or you think should get wider attention by the community
Loosely inspired by https://this-week-in-rust.org/
DataFusion Related Blogs
- Caching in DataFusion: Don't read twice from @XiangpengHao
- Parquet pruning in DataFusion: Read no more than you need from @XiangpengHao
Upcoming Releases
- Release DataFusion 42.2.0 #13166 -- trying to unblock delta-rs upgrade
- Release DataFusion 43.0.0 #12470 (thanks @andygrove)
- Release sqlparser-rs version
0.52.0
datafusion-sqlparser-rs#1423 (huge kudos to @iffyio for all the reviews)
Major Projects / Discussions under way
- [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821 -- show the world what you can do with focused engineering effort. Thanks to the epic work of @Rachelint, @goldmedal, @jayzhan211, @Dandandan @XiangpengHao and others,
- Adaptive Parquet Predicate Pushdown Evaluation arrow-rs#5523 - @XiangpengHao and @tustvold are working to make parquet even better
- [DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357
- @timsaucer is working on FFI initial implementation #12920, bindings to DataFusion (stable C API). Making good progress with various PRs
- Helping make DataFusion more visible: Enhancing DataFusion's Community Engagement and Visibility #13049 @SamSynnada
- @2010YOUY01 started working on [EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation #13123
- @eejbyfeldt has been bashing away at bugs / things that prevent complete TPC-DS run such as feat: Move subquery check from analyzer to PullUpCorrelatedExpr (Fix TPC-DS q41) #13091
Highlights from last week(s):
(I am sorry if I missed you -- please add a note to this ticket with anything you would like to highlight)
- 🎉 Rust ORC implementation is "graduating" from
datafusion-contriub
: discuss: Move into the Apache ORC PMC and develop asapache/orc-rust
datafusion-contrib/datafusion-orc#120. Thanks @waynexia and @Xuanwo - PR is up to improve predicate pushdown into parquet, Implement predicate pruning for
like
expressions (prefix matching) #12978 from @adirangb - @Blizzara, @LatrecheYasser @vbarua @westonpace and @tokoko keep hardening the substrait implementation with PRs such as this and this
- @goldmedal and @sgrebnov are making the Plan --> SQL text unparser cover even more SQL Improve TableScan with filters pushdown unparsing (joins) #13132
- @comphead is methodically making Sort-Merge-Join production ready (eg. Move filtered SMJ Left Anti filtered join out of
join_partial
phase #13111) - @Omega359 and @mnorfolk03 are getting our SQL planner benchmarks in good shape: Add clickbench parquet based queries to sql_planner benchmark #13103 chore: Added a number of physical planning join benchmarks #13085
- @2010YOUY01 began working on improving memory limited aggregation: Add benchmark for memory-limited aggregation #13090
- @LeslieKid improved the aggregation test coverage feat: Add
Date32
/Date64
in aggregate fuzz testing #13041 - March towards all user defined UDWF: Convert
ntile
builtIn function to UDWF #13040 from @jatin510 - @buraksenn submitted several improvements ❤ : [docs]: migrate lead/lag window function docs to new docs #13095 / Remove unused
LogicalPlan::CrossJoin
as it is unused #13076 / extended log.rs tests for unary/binary and f32/f64 casting #13034
Looking to get more involved? Try code review!
DataFusion has a long history of community members contributing in all aspects of the project. Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.
We have docs about reviews. TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try @
mentioning one of the committers.
Help wanted
Please feel leave your own comments on the ticket if you are looking for help
Community
- Weekly Call
- Slack/Discord: info links
Upcoming meetups:
- Dec 18 Chicago: https://lu.ma/eq5myc5i @adriangb @timsaucer
- TBD: DISCUSSION: January 2025 DataFusion Meetup in Amsterdam / CIDR 2025 #12988
- Jan 15Boston
Background:
Previous update: #13035
Andrew's Focus Areas:
We are preparing for the 43.0.0 release and I am personally pretty excited about (and thus actively help / put to the top of my review list)
- [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821 (thanks to the epic work of @Rachelint, @goldmedal, @jayzhan211, @Dandandan @XiangpengHao and others, we are quite close)
- [Epic] Unify
WindowFunction
Interface (remove built in list ofBuiltInWindowFunction
s) #8709 (very close to finishing thanks @jcsherin @jatin510) - [EPIC] Automatically generate all function documentation from code #12740 (also almost done thanks to @Omega359 and @jonathanc-n)
- Aggregation fuzz testing #12114 (thanks @Rachelint for all your help so far)