8000 docs: enhance user guide with detailed DataFrame operations and examples · kosiew/datafusion-python@e8308de · GitHub
[go: up one dir, main page]

Skip to content

Commit e8308de

Browse files
committed
docs: enhance user guide with detailed DataFrame operations and examples
1 parent 818975b commit e8308de

File tree

3 files changed

+377
-0
lines changed

3 files changed

+377
-0
lines changed

docs/source/api/dataframe.rst

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
DataFrames
2+
==========
3+
4+
Overview
5+
--------
6+
7+
DataFusion's DataFrame API provides a powerful interface for building and executing queries against data sources.
8+
It offers a familiar API similar to pandas and other DataFrame libraries, but with the performance benefits of Rust
9+
and Arrow.
10+
11+
A DataFrame represents a logical plan that can be composed through operations like filtering, projection, and aggregation.
12+
The actual execution happens when terminal operations like `collect()` or `show()` are called.
13+
14+
Basic Usage
15+
----------
16+
17+
.. code-block:: python
18+
19+
import datafusion
20+
from datafusion import col, lit
21+
22+
# Create a context and register a data source
23+
ctx = datafusion.SessionContext()
24+
ctx.register_csv("my_table", "path/to/data.csv")
25+
26+
# Create and manipulate a DataFrame
27+
df = ctx.sql("SELECT * FROM my_table")
28+
29+
# Or use the DataFrame API directly
30+
df = (ctx.table("my_table")
31+
.filter(col("age") > lit(25))
32+
.select([col("name"), col("age")]))
33+
34+
# Execute and collect results
35+
result = df.collect()
36+
37+
# Display the first few rows
38+
df.show()
39+
40+
HTML Rendering
41+
-------------
42+
43+
When working in Jupyter notebooks or other environments that support HTML rendering, DataFrames will
44+
automatically display as formatted HTML tables, making it easier to visualize your data.
45+
46+
The `_repr_html_` method is called automatically by Jupyter to render a DataFrame. This method
47+
controls how DataFrames appear in notebook environments, providing a richer visualization than
48+
plain text output.
49+
50+
Customizing HTML Rendering
51+
-------------------------
52+
53+
You can customize how DataFrames are rendered in HTML by configuring the formatter:
54+
55+
.. code-block:: python
56+
57+
from datafusion.html_formatter import configure_formatter
58+
59+
# Change the default styling
60+
configure_formatter(
61+
max_rows=50, # Maximum number of rows to display
62+
max_width=None, # Maximum width in pixels (None for auto)
63+
theme="light", # Theme: "light" or "dark"
64+
precision=2, # Floating point precision
65+
thousands_separator=",", # Separator for thousands
66+
date_format="%Y-%m-%d", # Date format
67+
truncate_width=20 # Max width for string columns before truncating
68+
)
69+
70+
The formatter settings affect all DataFrames displayed after configuration.
71+
72+
Custom Style Providers
73+
---------------------
74+
75+
For advanced styling needs, you can create a custom style provider:
76+
77+
.. code-block:: python
78+
79+
from datafusion.html_formatter import StyleProvider, configure_formatter
80+
81+
class MyStyleProvider(StyleProvider):
82+
def get_table_styles(self):
83+
return {
84+
"table": "border-collapse: collapse; width: 100%;",
85+
"th": "background-color: #007bff; color: white; padding: 8px; text-align: left;",
86+
"td": "border: 1px solid #ddd; padding: 8px;",
87+
"tr:nth-child(even)": "background-color: #f2f2f2;",
88+
}
89+
90+
def get_value_styles(self, dtype, value):
91+
"""Return custom styles for specific values"""
92+
if dtype == "float" and value < 0:
93+
return "color: red;"
94+
return None
95+
96+
# Apply the custom style provider
97+
configure_formatter(style_provider=MyStyleProvider())
98+
99+
Creating a Custom Formatter
100+
--------------------------
101+
102+
For complete control over rendering, you can implement a custom formatter:
103+
104+
.. code-block:: python
105+
106+
from datafusion.html_formatter import Formatter, get_formatter
107+
108+
class MyFormatter(Formatter):
109+
def format_html(self, batches, schema, has_more=False, table_uuid=None):
110+
# Create your custom HTML here
111+
html = "<div class='my-custom-table'>"
112+
# ... formatting logic ...
113+
html += "</div>"
114+
return html
115+
116+
# Set as the global formatter
117+
configure_formatter(formatter_class=MyFormatter)
118+
119+
# Or use the formatter just for specific operations
120+
formatter = get_formatter()
121+
custom_html = formatter.format_html(batches, schema)
122+
123+
Managing Formatters
124+
------------------
125+
126+
Reset to default formatting:
127+
128+
.. code-block:: python
129+
130+
from datafusion.html_formatter import reset_formatter
131+
132+
# Reset to default settings
133+
reset_formatter()
134+
135+
Get the current formatter settings:
136+
137+
.. code-block:: python
138+
139+
from datafusion.html_formatter import get_formatter
140+
141+
formatter = get_formatter()
142+
print(formatter.max_rows)
143+
print(formatter.theme)
144+
145+
Contextual Formatting
146+
--------------------
147+
148+
You can also use a context manager to temporarily change formatting settings:
149+
150+
.. code-block:: python
151+
152+
from datafusion.html_formatter import formatting_context
153+
154+
# Default formatting
155+
df.show()
156+
157+
# Temporarily use different formatting
158+
with formatting_context(max_rows=100, theme="dark"):
159+
df.show() # Will use the temporary settings
160+
161+
# Back to default formatting
162+
df.show()

docs/source/user-guide/basics.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,8 @@ DataFrames are typically created by calling a method on :py:class:`~datafusion.c
7272
calling the transformation methods, such as :py:func:`~datafusion.dataframe.DataFrame.filter`, :py:func:`~datafusion.dataframe.DataFrame.select`, :py:func:`~datafusion.dataframe.DataFrame.aggregate`,
7373
and :py:func:`~datafusion.dataframe.DataFrame.limit` to build up a query definition.
7474

75+
For more details on working with DataFrames, including visualization options and conversion to other formats, see :doc:`dataframe`.
76+
7577
Expressions
7678
-----------
7779

docs/source/user-guide/dataframe.rst

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
DataFrame Operations
19+
===================
20+
21+
Working with DataFrames
22+
----------------------
23+
24+
A DataFrame in DataFusion represents a logical plan that defines a series of operations to be performed on data.
25+
This logical plan is not executed until you call a terminal operation like :py:func:`~datafusion.dataframe.DataFrame.collect`
26+
or :py:func:`~datafusion.dataframe.DataFrame.show`.
27+
28+
DataFrames provide a familiar API for data manipulation:
29+
30+
.. ipython:: python
31+
32+
import datafusion
33+
from datafusion import col, lit, functions as f
34+
35+
ctx = datafusion.SessionContext()
36+
37+
# Create a DataFrame from a CSV file
38+
df = ctx.read_csv("example.csv")
39+
40+
# Add transformations
41+
df = df.filter(col("age") > lit(30)) \
42+
.select([col("name"), col("age"), (col("salary") * lit(1.1)).alias("new_salary")]) \
43+
.sort("age")
< 9920 /td>44+
45+
# Execute the plan
46+
df.show()
47+
48+
Common DataFrame Operations
49+
--------------------------
50+
51+
DataFusion supports a wide range of operations on DataFrames:
52+
53+
Filtering and Selection
54+
~~~~~~~~~~~~~~~~~~~~~~~
55+
56+
.. ipython:: python
57+
58+
# Filter rows
59+
df = df.filter(col("age") > lit(30))
60+
61+
# Select columns
62+
df = df.select([col("name"), col("age")])
63+
64+
# Select by column name
65+
df = df.select_columns(["name", "age"])
66+
67+
# Select using column indexing
68+
df = df["name", "age"]
69+
70+
Aggregation
71+
~~~~~~~~~~
72+
73+
.. ipython:: python
74+
75+
# Group by and aggregate
76+
df = df.aggregate(
77+
[col("category")], # Group by columns
78+
[f.sum(col("amount")).alias("total"),
79+
f.avg(col("price")).alias("avg_price")]
80+
)
81+
82+
Joins
83+
~~~~~
84+
85+
.. ipython:: python
86+
87+
# Join two DataFrames
88+
df_joined = df1.join(
89+
df2,
90+
how="inner",
91+
left_on=["id"],
92+
right_on=["id"]
93+
)
94+
95+
# Join with custom expressions
96+
df_joined = df1.join_on(
97+
df2,
98+
[col("df1.id") == col("df2.id")],
99+
how="left"
100+
)
101+
102+
DataFrame Visualization
103+
----------------------
104+
105+
Jupyter Notebook Integration
106+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
107+
108+
When working in Jupyter notebooks, DataFrames automatically display as HTML tables. This is
109+
handled by the :code:`_repr_html_` method, which provides a rich, formatted view of your data.
110+
111+
.. ipython:: python
112+
113+
# DataFrames render as HTML tables in notebooks
114+
df # Just displaying the DataFrame renders it as HTML
115+
116+
Customizing DataFrame Display
117+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
118+
119+
You can customize how DataFrames are displayed using the HTML formatter:
120+
121+
.. ipython:: python
122+
123+
from datafusion.html_formatter import configure_formatter
124+
125+
# Change display settings
126+
configure_formatter(
127+
max_rows=100, # Show more rows
128+
truncate_width=30, # Allow longer strings
129+
theme="light", # Use light theme
130+
precision=2 # Set decimal precision
131+
)
132+
133+
# Now display uses the new format
134+
df.show()
135+
136+
Creating a Custom Style Provider
137+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
138+
139+
For advanced styling needs:
140+
141+
.. code-block:: python
142+
143+
from datafusion.html_formatter import StyleProvider, configure_formatter
144+
145+
class CustomStyleProvider(StyleProvider):
146+
def get_table_styles(self):
147+
return {
148+
"table": "border-collapse: collapse; width: 100%;",
149+
"th": "background-color: #4CAF50; color: white; padding: 10px;",
150+
"td": "border: 1px solid #ddd; padding: 8px;",
151+
"tr:hover": "background-color: #f5f5f5;",
152+
}
153+
154+
def get_value_styles(self, dtype, value):
155+
if dtype == "float" and value < 0:
156+ return "color: red; font-weight: bold;"
157+
return None
158+
159+
# Apply custom styling
160+
configure_formatter(style_provider=CustomStyleProvider())
161+
162+
Managing Display Settings
163+
~~~~~~~~~~~~~~~~~~~~~~~
164+
165+
You can temporarily change formatting settings with context managers:
166+
167+
.. code-block:: python
168+
169+
from datafusion.html_formatter import formatting_context
170+
171+
# Use different formatting temporarily
172+
with formatting_context(max_rows=5, theme="dark"):
173+
df.show() # Will show only 5 rows with dark theme
174+
175+
# Reset to default formatting
176+
from datafusion.html_formatter import reset_formatter
177+
reset_formatter()
178+
179+
Converting to Other Formats
180+
--------------------------
181+
182+
DataFusion DataFrames can be easily converted to other popular formats:
183+
184+
.. ipython:: python
185+
186+
# Convert to Arrow Table
187+
arrow_table = df.to_arrow_table()
188+
189+
# Convert to Pandas DataFrame
190+
pandas_df = df.to_pandas()
191+
192+
# Convert to Polars DataFrame
193+
polars_df = df.to_polars()
194+
195+
# Convert to Python data structures
196+
python_dict = df.to_pydict()
197+
python_list = df.to_pylist()
198+
199+
Saving DataFrames
200+
---------------
201+
202+
You can write DataFrames to various file formats:
203+
204+
.. ipython:: python
205+
206+
# Write to CSV
207+
df.write_csv("output.csv", with_header=True)
208+
209+
# Write to Parquet
210+
df.write_parquet("output.parquet", compression="zstd")
211+
212+
# Write to JSON
213+
df.write_json("output.json")

0 commit comments

Comments
 (0)
0