Remove internal dependencies mapping in update_graph #9036

fjetter · 2025-04-04T11:33:00Z

This is a long overdue follow up of the Task class changes and is aiming to reduce transmission overhead and reduce memory utilization on the scheduler during graph submission.

The need to materialize the dependencies as a separate dict is entirely redundant. For backwards compatibility, _materialize_graph generated a materialized version of the dependencies mapping to keep the changes minimal at the time.

The caveat was that Scheduler._remove_done_tasks_from_dsk was actually mutating that view which turns out to be false behavior.

dependencies of a task are immutable and every attempt to treat it differently is very likely corrupting the state or at the very least is altering the graph itself. This argument alone should already suffice to see that changes to dependencies are a bad idea.
After closer review of the code, I believe that the entire logic around _remove_done_tasks_from_dsk is actually redundant and can be removed. This method was parsing the dsk graph for tasks that were already in memory or were already erred and removed them from dsk accordingly. The reasoning is sound since we do not want to recompute those tasks again. However, the transition engine is already taking care of this and is producing appropriate recommendations that end up doing nothing.
Ultimately, this is a performance tradeoff. This logic in _remove_done_tasks_from_dsk allows us to not throw already computed keys into the transition engine which is arguably slower than the _remove_done_tasks_from_dsk logic itself. Essentially, this makes repeated persists faster at the cost of slowing everything else down.
In particular, _remove_done_tasks_from_dsk contains a call to reverse_dict which walks the entire graph and all edges and constructs a dict with dependents. Contrary 8000 to dependencies, dependents are ephemeral and have to be recomputed every time. This is actually expensive and not doing it is helpful.

I haven't done any measuremnets but I strongly suspect this is a win-win (even repeated persists are likely faster).

There may be a couple of wrinkles in CI that I'll have to check on but I'm very confident about the change itself.

github-actions · 2025-04-04T14:56:30Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

21 files - 6 21 suites - 6 8h 27m 46s ⏱️ - 2h 38m 8s
4 089 tests - 18 3 978 ✅ - 16 111 💤 ± 0 0 ❌ - 2
39 347 runs - 12 144 37 449 ✅ - 11 756 1 898 💤 - 386 0 ❌ - 2

Results for commit e2284bb. ± Comparison against base commit 01ea1eb.

This pull request removes 26 and adds 8 tests. Note that renamed tests count towards both.

distributed.cli.tests.test_dask_scheduler ‑ test_signal_handling[Signals.SIGINT]
distributed.cli.tests.test_dask_scheduler ‑ test_signal_handling[Signals.SIGTERM]
distributed.diagnostics.tests.test_nvml ‑ test_1_visible_devices
distributed.diagnostics.tests.test_nvml ‑ test_2_visible_devices[0,1]
distributed.diagnostics.tests.test_nvml ‑ test_2_visible_devices[1,0]
distributed.diagnostics.tests.test_nvml ‑ test_enable_disable_nvml
distributed.diagnostics.tests.test_nvml ‑ test_gpu_metrics
distributed.diagnostics.tests.test_nvml ‑ test_gpu_monitoring_range_query
distributed.diagnostics.tests.test_nvml ‑ test_gpu_monitoring_recent
distributed.diagnostics.tests.test_nvml ‑ test_has_cuda_context
…

distributed.tests.test_client ‑ test_compute_partially_forgotten[False-False]
distributed.tests.test_client ‑ test_compute_partially_forgotten[False-True]
distributed.tests.test_client ‑ test_compute_partially_forgotten[True-False]
distributed.tests.test_client ‑ test_compute_partially_forgotten[True-True]
distributed.tests.test_client ‑ test_map_accepts_nested_futures[False]
distributed.tests.test_client ‑ test_map_accepts_nested_futures[future]
distributed.tests.test_client ‑ test_map_accepts_nested_futures[simple]
distributed.tests.test_scheduler ‑ test_dont_recompute_if_erred_transition_log

This pull request skips 12 tests.

distributed.dashboard.tests.test_scheduler_bokeh ‑ test_counters
distributed.dashboard.tests.test_worker_bokeh ‑ test_counters
distributed.protocol.tests.test_compression ‑ test_compression_thread_safety[snappy-bytes]
distributed.protocol.tests.test_compression ‑ test_compression_thread_safety[snappy-memoryview]
distributed.protocol.tests.test_compression ‑ test_large_messages[snappy]
distributed.protocol.tests.test_compression ‑ test_maybe_compress[snappy-bytes]
distributed.protocol.tests.test_compression ‑ test_maybe_compress[snappy-memoryview]
distributed.protocol.tests.test_compression ‑ test_maybe_compress_memoryviews[snappy]
distributed.protocol.tests.test_compression ‑ test_maybe_compress_sample[snappy]
distributed.tests.test_core ‑ test_tick_logging
…

♻️ This comment has been updated with latest results.

Copilot

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

distributed/scheduler.py:5016

[nitpick] Consider using an explicit check for key existence (e.g. returning None) instead of using an empty tuple as a default, to improve code clarity and intent.

if tspec := dsk.get(k, ()):

fjetter · 2025-04-16T08:29:38Z

There's a genuine error in test_gh2187

fjetter · 2025-05-06T14:02:12Z

I can reproduce flakyness of test_gh2187 on main as well. this change seems to make it much more likely to occur

fjetter · 2025-05-07T09:24:49Z

This now builds on #9068

fjetter marked this pull request as ready for review April 8, 2025 13:26

fjetter requested a review from jacobtomlinson as a code owner April 8, 2025 13:26

fjetter requested a review from Copilot April 9, 2025 07:10

Copilot AI reviewed Apr 9, 2025

View reviewed changes

fjetter force-pushed the update_graph_remove_dependencies branch from 015925f to d802834 Compare April 15, 2025 15:07

fjetter mentioned this pull request Apr 15, 2025

Migrate dependencies dependents sets from TaskState to Key #9042

Closed

fjetter force-pushed the update_graph_remove_dependencies branch 3 times, most recently from 35a9e8f to 1634935 Compare May 7, 2025 09:24

fjetter added 7 commits May 8, 2025 16:08

Remove internal dependencies mapping in update_graph

8e9d808

fix graph layout plugin

25ae9b1

Fix collisions

22dd5e6

Add back original version of test_dont_recompute_if_erred

4bc0046

remove runnable set in update_graph

eb0ce85

parametrize swap keys

58b720f

add priorities back

8faeae6

fjetter force-pushed the update_graph_remove_dependencies branch from 1634935 to 8faeae6 Compare May 8, 2025 14:08

fjetter added 2 commits May 8, 2025 16:17

Avoid copies for plugin.update_graph

6267ac3

fix test logic

e2284bb

fjetter merged commit a7b7e00 into dask:main May 9, 2025
25 of 33 checks passed

fjetter deleted the update_graph_remove_dependencies branch May 9, 2025 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Remove internal dependencies mapping in update_graph #9036

Remove internal dependencies mapping in update_graph #9036

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Remove internal dependencies mapping in update_graph #9036

Remove internal dependencies mapping in update_graph #9036

Uh oh!

Conversation

Uh oh!

Uh oh!

Unit Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!