8000 Switch to Ruff for Python linting (#529) · samuelcolvin/datafusion-python@76ecf56 · GitHub
[go: up one dir, main page]

Skip to content

Commit 76ecf56

Browse files
authored
Switch to Ruff for Python linting (apache#529)
1 parent 3a82be0 commit 76ecf56

31 files changed

+239
-496
lines changed

.github/workflows/build.yml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,22 @@ on:
2424
branches: ["branch-*"]
2525

2626
jobs:
27+
build:
28+
runs-on: ubuntu-latest
29+
steps:
30+
- uses: actions/checkout@v3
31+
- name: Install Python
32+
uses: actions/setup-python@v4
33+
with:
34+
python-version: "3.11"
35+
- name: Install dependencies
36+
run: |
37+
python -m pip install --upgrade pip
38+
pip install ruff
39+
# Update output format to enable automatic inline annotations.
40+
- name: Run Ruff
41+
run: ruff check --output-format=github datafusion
42+
2743
generate-license:
2844
runs-on: ubuntu-latest
2945
steps:

.github/workflows/test.yaml

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -103,13 +103,6 @@ jobs:
103103
source venv/bin/activate
104104
pip install -r requirements-311.txt
105105
106-
- name: Run Python Linters
107-
if: ${{ matrix.python-version == '3.10' && matrix.toolchain == 'stable' }}
108-
run: |
109-
source venv/bin/activate
110-
flake8 --exclude venv,benchmarks/db-benchmark --ignore=E501,W503
F438 111-
black --line-length 79 --diff --check .
112-
113106
- name: Run tests
114107
env:
115108
RUST_BACKTRACE: 1

.pre-commit-config.yaml

Lines changed: 7 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,14 @@ repos:
2020
rev: v1.6.23
2121
hooks:
2222
- id: actionlint-docker
23-
- repo: https://github.com/psf/black
24-
rev: 22.3.0
23+
- repo: https://github.com/astral-sh/ruff-pre-commit
24+
# Ruff version.
25+
rev: v0.3.0
2526
hooks:
26-
- id: black
27-
files: datafusion/.*
28-
# Explicitly specify the pyproject.toml at the repo root, not per-project.
29-
args: ["--config", "pyproject.toml", "--line-length", "79", "--diff", "--check", "."]
30-
- repo: https://github.com/PyCQA/flake8
31-
rev: 5.0.4
32-
hooks:
33-
- id: flake8
34-
files: datafusion/.*$
35-
types: [file]
36-
types_or: [python]
37-
additional_dependencies: ["flake8-force"]
27+
# Run the linter.
28+
- id: ruff
29+
# Run the formatter.
30+
- id: ruff-format
3831
- repo: local
3932
hooks:
4033
- id: rust-fmt

README.md

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ source venv/bin/activate
202202
# update pip itself if necessary
203203
python -m pip install -U pip
204204
# install dependencies (for Python 3.8+)
205-
python -m pip install -r requirements-310.txt
205+
python -m pip install -r requirements.in
206206
```
207207

208208
The tests rely on test data in git submodules.
@@ -222,12 +222,27 @@ python -m pytest
222222

223223
### Running & Installing pre-commit hooks
224224

225-
arrow-datafusion-python takes advantage of [pre-commit](https://pre-commit.com/) to assist developers with code linting to help reduce the number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for the developer but certainly helpful for keeping PRs clean and concise.
225+
arrow-datafusion-python takes advantage of [pre-commit](https://pre-commit.com/) to assist developers with code linting to help reduce
226+
the number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for the
227+
developer but certainly helpful for keeping PRs clean and concise.
226228

227-
Our pre-commit hooks can be installed by running `pre-commit install`, which will install the configurations in your ARROW_DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit, failing to complete the commit if an offending lint is found allowing you to make changes locally before pushing.
229+
Our pre-commit hooks can be installed by running `pre-commit install`, which will install the configurations in
230+
your ARROW_DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit, failing to complete
231+
the commit if an offending lint is found allowing you to make changes locally before pushing.
228232

229233
The pre-commit hooks can also be run adhoc without installing them by simply running `pre-commit run --all-files`
230234

235+
## Running linters without using pre-commit
236+
237+
There are scripts in `ci/scripts` for running Rust and Python linters.
238+
239+
```shell
240+
./ci/scripts/python_lint.sh
241+
./ci/scripts/rust_clippy.sh
242+
./ci/scripts/rust_fmt.sh
243+
./ci/scripts/rust_toml_fmt.sh
244+
```
245+
231246
## How to update dependencies
232247

233248
To change test dependencies, change the `requirements.in` and run

benchmarks/db-benchmark/groupby-datafusion.py

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -79,17 +79,13 @@ def execute(df):
7979

8080
data = pacsv.read_csv(
8181
src_grp,
82-
convert_options=pacsv.ConvertOptions(
83-
auto_dict_encode=True, column_types=schema
84-
),
82+
convert_options=pacsv.ConvertOptions(auto_dict_encode=True, column_types=schema),
8583
)
8684
print("dataset loaded")
8785

8886
# create a session context with explicit runtime and config settings
8987
runtime = (
90-
RuntimeConfig()
91-
.with_disk_manager_os()
92-
.with_fair_spill_pool(64 * 1024 * 1024 * 1024)
88+
RuntimeConfig().with_disk_manager_os().with_fair_spill_pool(64 * 1024 * 1024 * 1024)
9389
)
9490
config = (
9591
SessionConfig()
@@ -116,9 +112,7 @@ def execute(df):
116112
if sql:
117113
df = ctx.sql("SELECT id1, SUM(v1) AS v1 FROM x GROUP BY id1")
118114
else:
119-
df = ctx.table("x").aggregate(
120-
[f.col("id1")], [f.sum(f.col("v1")).alias("v1")]
121-
)
115+
df = ctx.table("x").aggregate([f.col("id1")], [f.sum(f.col("v1")).alias("v1")])
122116
ans = execute(df)
123117

124118
shape = ans_shape(ans)
@@ -197,9 +191,7 @@ def execute(df):
197191
gc.collect()
198192
t_start = timeit.default_timer()
199193
if sql:
200-
df = ctx.sql(
201-
"SELECT id3, SUM(v1) AS v1, AVG(v3) AS v3 FROM x GROUP BY id3"
202-
)
194+
df = ctx.sql("SELECT id3, SUM(v1) AS v1, AVG(v3) AS v3 FROM x GROUP BY id3")
203195
else:
204196
df = ctx.table("x").aggregate(
205197
[f.col("id3")],

benchmarks/db-benchmark/join-datafusion.py

Lines changed: 4 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -152,11 +152,7 @@ def ans_shape(batches):
152152
print(f"q2: {t}")
153153
t_start = timeit.default_timer()
154154
df = ctx.create_dataframe([ans])
155-
chk = (
156-
df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))])
157-
.collect()[0]
158-
.column(0)[0]
159-
)
155+
chk = df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))]).collect()[0].column(0)[0]
160156
chkt = timeit.default_timer() - t_start
161157
m = memory_usage()
162158
write_log(
@@ -193,11 +189,7 @@ def ans_shape(batches):
193189
print(f"q3: {t}")
194190
t_start = timeit.default_timer()
195191
df = ctx.create_dataframe([ans])
196-
chk = (
197-
df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))])
198-
.collect()[0]
199-
.column(0)[0]
200-
)
192+
chk = df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))]).collect()[0].column(0)[0]
201193
chkt = timeit.default_timer() - t_start
202194
m = memory_usage()
203195
write_log(
@@ -234,11 +226,7 @@ def ans_shape(batches):
234226
print(f"q4: {t}")
235227
t_start = timeit.default_timer()
236228
df = ctx.create_dataframe([ans])
237-
chk = (
238-
df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))])
239-
.collect()[0]
240-
.column(0)[0]
241-
)
229+
chk = df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))]).collect()[0].column(0)[0]
242230
chkt = timeit.default_timer() - t_start
243231
m = memory_usage()
244232
write_log(
@@ -275,11 +263,7 @@ def ans_shape(batches):
275 10000 263
print(f"q5: {t}")
276264
t_start = timeit.default_timer()
277265
df = ctx.create_dataframe([ans])
278-
chk = (
279-
df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))])
280-
.collect()[0]
281-
.column(0)[0]
282-
)
266+
chk = df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))]).collect()[0].column(0)[0]
283267
chkt = timeit.default_timer() - t_start
284268
m = memory_usage()
285269
write_log(

benchmarks/tpch/tpch.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -83,9 +83,7 @@ def bench(data_path, query_path):
8383
time_millis = (end - start) * 1000
8484
total_time_millis += time_millis
8585
print("q{},{}".format(query, round(time_millis, 1)))
86-
results.write(
87-
"q{},{}\n".format(query, round(time_millis, 1))
88-
)
86+
results.write("q{},{}\n".format(query, round(time_millis, 1)))
8987
results.flush()
9088
except Exception as e:
9189
print("query", query, "failed", e)

ci/scripts/python_lint.sh

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#!/usr/bin/env bash
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12< 10000 code class="diff-text syntax-highlighted-line addition">+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
20+
set -ex
21+
ruff format datafusion
22+
ruff check datafusion

datafusion/__init__.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -208,9 +208,7 @@ def udaf(accum, input_type, return_type, state_type, volatility, name=None):
208208
Create a new User Defined Aggregate Function
209209
"""
210210
if not issubclass(accum, Accumulator):
211-
raise TypeError(
212-
"`accum` must implement the abstract base class Accumulator"
213-
)
211+
raise TypeError("`accum` must implement the abstract base class Accumulator")
214212
if name is None:
215213
name = accum.__qualname__.lower()
216214
if isinstance(input_type, pa.lib.DataType):

datafusion/cudf.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,9 +68,7 @@ def to_cudf_df(self, plan):
6868
elif isinstance(node, TableScan):
6969
return cudf.read_parquet(self.parquet_tables[node.table_name()])
7070
else:
71-
raise Exception(
72-
"unsupported logical operator: {}".format(type(node))
73-
)
71+
raise Exception("unsupported logical operator: {}".format(type(node)))
7472

7573
def create_schema(self, schema_name: str, **kwargs):
7674
logger.debug(f"Creating schema: {schema_name}")

datafusion/input/base.py

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,9 @@ class BaseInputSource(ABC):
3131
"""
3232

3333
@abstractmethod
34-
def is_correct_input(
35-
self, input_item: Any, table_name: str, **kwargs
36-
) -> bool:
34+
def is_correct_input(self, input_item: Any, table_name: str, **kwargs) -> bool:
3735
pass
3836

3937
@abstractmethod
40-
def build_table(
41-
self, input_item: Any, table_name: str, **kwarg
42-
) -> SqlTable:
38+
def build_table(self, input_item: Any, table_name: str, **kwarg) -> SqlTable:
4339
pass

datafusion/input/location.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -72,9 +72,7 @@ def build_table(
7272
for _ in reader:
7373
num_rows += 1
7474
# TODO: Need to actually consume this row into resonable columns
75-
raise RuntimeError(
76-
"TODO: Currently unable to support CSV input files."
77-
)
75+
raise RuntimeError("TODO: Currently unable to support CSV input files.")
7876
else:
7977
raise RuntimeError(
8078
f"Input of format: `{format}` is currently not supported.\

datafusion/pandas.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -64,9 +64,7 @@ def to_pandas_df(self, plan):
6464
elif isinstance(node, TableScan):
6565
return pd.read_parquet(self.parquet_tables[node.table_name()])
6666
else:
67-
raise Exception(
68-
"unsupported logical operator: {}".format(type(node))
69-
)
67+
raise Exception("unsupported logical operator: {}".format(ty BD94 pe(node)))
7068

7169
def create_schema(self, schema_name: str, **kwargs):
7270
logger.debug(f"Creating schema: {schema_name}")

datafusion/polars.py

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,7 @@ def to_polars_df(self, plan):
5151
args = [self.to_polars_expr(expr) for expr in node.projections()]
5252
return inputs[0].select(*args)
5353
elif isinstance(node, Aggregate):
54-
groupby_expr = [
55-
self.to_polars_expr(expr) for expr in node.group_by_exprs()
56-
]
54+
groupby_expr = [self.to_polars_expr(expr) for expr in node.group_by_exprs()]
5755
aggs = []
5856
for expr in node.aggregate_exprs():
5957
expr = expr.to_variant()
@@ -67,17 +65,13 @@ def to_polars_df(self, plan):
6765
)
6866
)
6967
else:
70-
raise Exception(
71-
"Unsupported aggregate function {}".format(expr)
72-
)
68+
raise Exception("Unsupported aggregate function {}".format(expr))
7369
df = inputs[0].groupby(groupby_expr).agg(aggs)
7470
return df
7571
elif isinstance(node, TableScan):
7672
return polars.read_parquet(self.parquet_tables[node.table_name()])
7773
else:
78-
raise Exception(
79-
"unsupported logical operator: {}".format(type(node))
80-
)
74+
raise Exception("unsupported logical operator: {}".format(type(node)))
8175

8276
def create_schema(self, schema_name: str, **kwargs):
8377
logger.debug(f"Creating schema: {schema_name}")

datafusion/tests/generic.py

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,7 @@ def data_datetime(f):
5050
datetime.datetime.now() - datetime.timedelta(days=1),
5151
datetime.datetime.now() + datetime.timedelta(days=1),
5252
]
53-
return pa.array(
54-
data, type=pa.timestamp(f), mask=np.array([False, True, False])
55-
)
53+
return pa.array(data, type=pa.timestamp(f), mask=np.array([False, True, False]))
5654

5755

5856
def data_date32():
@@ -61,9 +59,7 @@ def data_date32():
6159
datetime.date(1980, 1, 1),
6260
datetime.date(2030, 1, 1),
6361
]
64-
return pa.array(
65-
data, type=pa.date32(), mask=np.array([False, True, False])
66-
)
62+
return pa.array(data, type=pa.date32(), mask=np.array([False, True, False]))
6763

6864

6965
def data_timedelta(f):
@@ -72,9 +68,7 @@ def data_timedelta(f):
7268
datetime.timedelta(days=1),
7369
datetime.timedelta(seconds=1),
7470
]
75-
return pa.array(
76-
data, type=pa.duration(f), mask=np.array([False, True, False])
77-
)
71+
return pa.array(data, type=pa.duration(f), mask=np.array([False, True, False]))
7872

7973

8074
def data_binary_other():

0 commit comments

Comments
 (0)
0