Description
What this project could be
I think this project needs someone who wants to make a world class python dataframe library and user experience take the helm. I will argue why I think this is a compelling opportunity to make a great piece of technology and have a wide impact across the data analytic space:
What this project could be
I think this project could be one of the most widely used data analysis libraries out there. Imagine a system that allows BOTH a fast dataframe API (ala pol.rs) but also first class SQL support (ala duckdb) that are both screaming fast (due to all the effort that goes into https://github.com/apache/arrow-datafusion) as well as easy to plug into the eco system (arrow / parquet) and extensible (UDFS, UDAs, etc)
DataFusion already posts great benchmark numbers, and I will post datafusion 28.0.0 benchmark when we have them.
How is this different than the mission of DataFusion?
DataFusion is a great project but is currently focused on building the core analytic engine:
DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.
This repository contains basic python bindings, but the user experience (UX) could be improved in so many ways.
The opportunity
This would be a great opportunity for someone to:
- Build some really cool technology
- Learn how to help grow an open source project and community with help and guidance from the rest of the DataFusion community
- Learn about analytic database technology, Arrow, etc
- Influence the direction of Development in DataFusion