8000 Dask Crashes With Index Out Of Bounds on Print · Issue #12257 · dask/dask · GitHub
[go: up one dir, main page]

Skip to content

Dask Crashes With Index Out Of Bounds on Print #12257

@ahmayun

Description

@ahmayun

The following program causes an index out of bounds exception internally in Dask. I have reduced it down to a manageable example. Here is the code

import dask.dataframe as dd


dfnode_6 = dd.read_parquet("tpcds-data-5pc/household_demographics").rename(columns={col: f"{col}_node_6" for col in dd.read_parquet("tpcds-data-5pc/household_demographics").columns})
dfnode_7 = dd.read_parquet("tpcds-data-5pc/ship_mode").rename(columns={col: f"{col}_node_7" for col in dd.read_parquet("tpcds-data-5pc/ship_mode").columns})
dfnode_4 = dfnode_6.repartition(npartitions=3)
dfnode_5 = dfnode_7.repartition(npartitions=7)
dfnode_2 = dfnode_4.drop_duplicates()
dfnode_3 = dfnode_5.dropna(subset=['sm_ship_mode_sk_node_7', 'sm_ship_mode_id_node_7', 'sm_type_node_7'], how='any')
dfnode_1 = dfnode_2.merge(dfnode_3, left_on='hd_buy_potential_node_6', right_on='sm_code_node_7', how='right')
result = dfnode_1.sort_values(by='hd_demo_sk_node_6', ascending=True)
print(result)

And here is the data required to reproduce the example:

minimal-g19591.zip

It produces the following error:

Traceback (most recent call last):
  File "dask-oracle-server/test-bug-zion-08-g_19591-a_19591-dag_3918-dfg2.py", line 11, in <module>
    print(result)
  File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_collection.py", line 435, in __repr__
    data = self._repr_data().to_string(max_rows=5)
  File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_collection.py", line 4075, in _repr_data
    index = self._repr_divisions
  File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_collection.py", line 2629, in _repr_divisions
    name = f"npartitions={self.npartitions}"
  File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_collection.py", line 358, in npartitions
    return self.expr.npartitions
  File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_shuffle.py", line 803, in npartitions
    return self.operand("npartitions") or len(self._divisions()) - 1
  File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_shuffle.py", line 1008, in _divisions
    divisions, mins, maxes, presorted = _get_divisions(
  File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_shuffle.py", line 1333, in _get_divisions
    result = _calculate_divisions(
  File "venv/lib/python3.10/site-packages/dask/dataframe/dask_expr/_shuffle.py", line 1357, in _calculate_divisions
    divisions, mins, maxes = compute(
  File "venv/lib/python3.10/site-packages/dask/base.py", line 685, in compute
    results = schedule(expr, keys, **kwargs)
  File "venv/lib/python3.10/site-packages/dask/dataframe/partitionquantiles.py", line 351, in process_val_weights
    q_target = np.linspace(q_weights[0], q_weights[-1], npartitions + 1)
IndexError: index 0 is out of bounds for axis 0 with size 0

Environment:

  • Dask version: 2026.1.1
  • Python version: 3.10.12
  • Operating System: Ubuntu 22.04.5 LTS
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs triageNeeds a response from a contributor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0