8000 Remove internal dependencies mapping in update_graph by fjetter · Pull Request #9036 · dask/distributed · GitHub
[go: up one dir, main page]

Skip to content

Remove internal dependencies mapping in update_graph #9036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 9, 2025

Conversation

fjetter
Copy link
Member
@fjetter fjetter commented Apr 4, 2025

This is a long overdue follow up of the Task class changes and is aiming to reduce transmission overhead and reduce memory utilization on the scheduler during graph submission.

The need to materialize the dependencies as a separate dict is entirely redundant. For backwards compatibility, _materialize_graph generated a materialized version of the dependencies mapping to keep the changes minimal at the time.

The caveat was that Scheduler._remove_done_tasks_from_dsk was actually mutating that view which turns out to be false behavior.

dependencies of a task are immutable and every attempt to treat it differently is very likely corrupting the state or at the very least is altering the graph itself. This argument alone should already suffice to see that changes to dependencies are a bad idea.
After closer review of the code, I believe that the entire logic around _remove_done_tasks_from_dsk is actually redundant and can be removed. This method was parsing the dsk graph for tasks that were already in memory or were already erred and removed them from dsk accordingly. The reasoning is sound since we do not want to recompute those tasks again. However, the transition engine is already taking care of this and is producing appropriate recommendations that end up doing nothing.
Ultimately, this is a performance tradeoff. This logic in _remove_done_tasks_from_dsk allows us to not throw already computed keys into the transition engine which is arguably slower than the _remove_done_tasks_from_dsk logic itself. Essentially, this makes repeated persists faster at the cost of slowing everything else down.
In particular, _remove_done_tasks_from_dsk contains a call to reverse_dict which walks the entire graph and all edges and constructs a dict with dependents. Contrary 8000 to dependencies, dependents are ephemeral and have to be recomputed every time. This is actually expensive and not doing it is helpful.

I haven't done any measuremnets but I strongly suspect this is a win-win (even repeated persists are likely faster).

There may be a couple of wrinkles in CI that I'll have to check on but I'm very confident about the change itself.

Copy link
Contributor
github-actions bot commented Apr 4, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    21 files   -      6      21 suites   - 6   8h 27m 46s ⏱️ - 2h 38m 8s
 4 089 tests  -     18   3 978 ✅  -     16    111 💤 ±  0  0 ❌  - 2 
39 347 runs   - 12 144  37 449 ✅  - 11 756  1 898 💤  - 386  0 ❌  - 2 

Results for commit e2284bb. ± Comparison against base commit 01ea1eb.

This pull request removes 26 and adds 8 tests. Note that renamed tests count towards both.
distributed.cli.tests.test_dask_scheduler ‑ test_signal_handling[Signals.SIGINT]
distributed.cli.tests.test_dask_scheduler ‑ test_signal_handling[Signals.SIGTERM]
distributed.diagnostics.tests.test_nvml ‑ test_1_visible_devices
distributed.diagnostics.tests.test_nvml ‑ test_2_visible_devices[0,1]
distributed.diagnostics.tests.test_nvml ‑ test_2_visible_devices[1,0]
distributed.diagnostics.tests.test_nvml ‑ test_enable_disable_nvml
distributed.diagnostics.tests.test_nvml ‑ test_gpu_metrics
distributed.diagnostics.tests.test_nvml ‑ test_gpu_monitoring_range_query
distributed.diagnostics.tests.test_nvml ‑ test_gpu_monitoring_recent
distributed.diagnostics.tests.test_nvml ‑ test_has_cuda_context
…
distributed.tests.test_client ‑ test_compute_partially_forgotten[False-False]
distributed.tests.test_client ‑ test_compute_partially_forgotten[False-True]
distributed.tests.test_client ‑ test_compute_partially_forgotten[True-False]
distributed.tests.test_client ‑ test_compute_partially_forgotten[True-True]
distributed.tests.test_client ‑ test_map_accepts_nested_futures[False]
distributed.tests.test_client ‑ test_map_accepts_nested_futures[future]
distributed.tests.test_client ‑ test_map_accepts_nested_futures[simple]
distributed.tests.test_scheduler ‑ test_dont_recompute_if_erred_transition_log
This pull request skips 12 tests.
distributed.dashboard.tests.test_scheduler_bokeh ‑ test_counters
distributed.dashboard.tests.test_worker_bokeh ‑ test_counters
distributed.protocol.tests.test_compression ‑ test_compression_thread_safety[snappy-bytes]
distributed.protocol.tests.test_compression ‑ test_compression_thread_safety[snappy-memoryview]
distributed.protocol.tests.test_compression ‑ test_large_messages[snappy]
distributed.protocol.tests.test_compression ‑ test_maybe_compress[snappy-bytes]
distributed.protocol.tests.test_compression ‑ test_maybe_compress[snappy-memoryview]
distributed.protocol.tests.test_compression ‑ test_maybe_compress_memoryviews[snappy]
distributed.protocol.tests.test_compression ‑ test_maybe_compress_sample[snappy]
distributed.tests.test_core ‑ test_tick_logging
…

♻️ This comment has been updated with latest results.

@fjetter fjetter marked this pull request as ready for review April 8, 2025 13:26
@fjetter fjetter requested a review from jacobtomlinson as a code owner April 8, 2025 13:26
@fjetter fjetter requested a review from Copilot April 9, 2025 07:10
Copy link
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

distributed/scheduler.py:5016

  • [nitpick] Consider using an explicit check for key existence (e.g. returning None) instead of using an empty tuple as a default, to improve code clarity and intent.
if tspec := dsk.get(k, ()):

@fjetter
Copy link
Member Author
fjetter commented Apr 16, 2025

There's a genuine error in test_gh2187

@fjetter
Copy link
Member Author
fjetter commented May 6, 2025

I can reproduce flakyness of test_gh2187 on main as well. this change seems to make it much more likely to occur

8000
@fjetter fjetter force-pushed the update_graph_remove_dependencies branch 3 times, most recently from 35a9e8f to 1634935 Compare May 7, 2025 09:24
@fjetter
Copy link
Member Author
fjetter commented May 7, 2025

This now builds on #9068

@fjetter fjetter force-pushed the update_graph_remove_dependencies branch from 1634935 to 8faeae6 Compare May 8, 2025 14:08
@fjetter fjetter merged commit a7b7e00 into dask:main May 9, 2025
25 of 33 checks passed
@fjetter fjetter deleted the update_graph_remove_dependencies branch May 9, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0