You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10-21Lines changed: 10 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,16 @@
24
24
25
25
This is a Python library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion).
26
26
27
-
DataFusion's Python bindings can be used as an end-user tool as well as providing a foundation for building new systems.
27
+
DataFusion's Python bindings can be used as a foundation for building new data systems in Python. Here are some examples:
28
+
29
+
-[Dask SQL](https://github.com/dask-contrib/dask-sql) uses DataFusion's Python bindings for SQL parsing, query
30
+
planning, and logical plan optimizations, and then transpiles the logical plan to Dask operations for execution.
31
+
-[DataFusion Ballista](https://github.com/apache/arrow-ballista) is a distributed SQL query engine that extends
32
+
DataFusion's Python bindings for distributed use cases.
33
+
34
+
It is also possible to use these Python bindings directly for DataFrame and SQL operations, but you may find that
35
+
[Polars](http://pola.rs/) and [DuckDB](http://www.duckdb.org/) are more suitable for this use case, since they have
36
+
more of an end-user focus and are more actively maintained than these Python bindings.
28
37
29
38
## Features
30
39
@@ -35,20 +44,6 @@ DataFusion's Python bindings can be used as an end-user tool as well as providin
35
44
- Serialize and deserialize query plans in Substrait format.
36
45
- Experimental support for transpiling SQL queries to DataFrame calls with Polars, Pandas, and cuDF.
37
46
38
-
## Comparison with other projects
39
-
40
-
Here is a comparison with similar projects that may help understand when DataFusion might be suitable and unsuitable
41
-
for your needs:
42
-
43
-
-[DuckDB](http://www.duckdb.org/) is an open source, in-process analytic database. Like DataFusion, it supports
44
-
very fast execution, both from its custom file format and directly from Parquet files. Unlike DataFusion, it is
45
-
written in C/C++ and it is primarily used directly by users as a serverless database and query system rather than
46
-
as a library for building such database systems.
47
-
48
-
-[Polars](http://pola.rs/) is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it
49
-
is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide full SQL
50
-
support, nor as many extension points.
51
-
52
47
## Example Usage
53
48
54
49
The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the results
@@ -143,12 +138,6 @@ See [examples](examples/README.md) for more information.
143
138
144
139
-[Serialize query plans using Substrait](./examples/substrait.py)
145
140
146
-
### Executing SQL against DataFrame Libraries (Experimental)
147
-
148
-
-[Executing SQL on Polars](./examples/sql-on-polars.py)
149
-
-[Executing SQL on Pandas](./examples/sql-on-pandas.py)
150
-
-[Executing SQL on cuDF](./examples/sql-on-cudf.py)
0 commit comments