-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
needs triageNeeds a response from a contributorNeeds a response from a contributor
Description
The following program causes an index out of bounds exception internally in Dask. I have reduced it down to a manageable example. Here is the code
import dask.dataframe as dd
dfnode_6 = dd.read_parquet("tpcds-data-5pc/household_demographics").rename(columns={col: f"{col}_node_6" for col in dd.read_parquet("tpcds-data-5pc/household_demographics").columns})
dfnode_7 = dd.read_parquet("tpcds-data-5pc/ship_mode").rename(columns={col: f"{col}_node_7" for col in dd.read_parquet("tpcds-data-5pc/ship_mode").columns})
dfnode_4 = dfnode_6.repartition(npartitions=3)
dfnode_5 = dfnode_7.repartition(npartitions=7)
dfnode_2 = dfnode_4.drop_duplicates()
dfnode_3 = dfnode_5.dropna(subset=['sm_ship_mode_sk_node_7', 'sm_ship_mode_id_node_7', 'sm_type_node_7'], how='any')
dfnode_1 = dfnode_2.merge(dfnode_3, left_on='hd_buy_potential_node_6', right_on='sm_code_node_7', how='right')
result = dfnode_1.sort_values(by='hd_demo_sk_node_6', ascending=True)
print(result)And here is the data required to reproduce the example:
It produces the following error:
Traceback (most recent call last):
File "dask-oracle-server/test-bug-zion-08-g_19591-a_19591-dag_3918-dfg2.py", line 11, in <module>
print(result)
File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_collection.py", line 435, in __repr__
data = self._repr_data().to_string(max_rows=5)
File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_collection.py", line 4075, in _repr_data
index = self._repr_divisions
File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_collection.py", line 2629, in _repr_divisions
name = f"npartitions={self.npartitions}"
File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_collection.py", line 358, in npartitions
return self.expr.npartitions
File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_shuffle.py", line 803, in npartitions
return self.operand("npartitions") or len(self._divisions()) - 1
File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_shuffle.py", line 1008, in _divisions
divisions, mins, maxes, presorted = _get_divisions(
File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_shuffle.py", line 1333, in _get_divisions
result = _calculate_divisions(
File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_shuffle.py", line 1357, in _calculate_divisions
divisions, mins, maxes = compute(
File "venv/lib/python3.10/site-packages/dask/base.py", line 685, in compute
results = schedule(expr, keys, **kwargs)
File "venv/lib/python3.10/site-packages/dask/dataframe/partitionquantiles.py", line 351, in process_val_weights
q_target = np.linspace(q_weights[0], q_weights[-1], npartitions + 1)
IndexError: index 0 is out of bounds for axis 0 with size 0
Environment:
- Dask version: 2026.1.1
- Python version: 3.10.12
- Operating System: Ubuntu 22.04.5 LTS
- Install method (conda, pip, source): pip
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
needs triageNeeds a response from a contributorNeeds a response from a contributor