-
-
Notifications
You must be signed in to change notification settings - Fork 734
Remove internal dependencies mapping in update_graph #9036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 21 files - 6 21 suites - 6 8h 27m 46s ⏱️ - 2h 38m 8s Results for commit e2284bb. ± Comparison against base commit 01ea1eb. This pull request removes 26 and adds 8 tests. Note that renamed tests count towards both.
This pull request skips 12 tests.
♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
distributed/scheduler.py:5016
- [nitpick] Consider using an explicit check for key existence (e.g. returning None) instead of using an empty tuple as a default, to improve code clarity and intent.
if tspec := dsk.get(k, ()):
015925f
to
d802834
Compare
There's a genuine error in |
I can reproduce flakyness of |
35a9e8f
to
1634935
Compare
This now builds on #9068 |
1634935
to
8faeae6
Compare
This is a long overdue follow up of the
Task
class changes and is aiming to reduce transmission overhead and reduce memory utilization on the scheduler during graph submission.The need to materialize the dependencies as a separate dict is entirely redundant. For backwards compatibility,
_materialize_graph
generated a materialized version of the dependencies mapping to keep the changes minimal at the time.The caveat was that
Scheduler._remove_done_tasks_from_dsk
was actually mutating that view which turns out to be false behavior.dependencies
of a task are immutable and every attempt to treat it differently is very likely corrupting the state or at the very least is altering the graph itself. This argument alone should already suffice to see that changes to dependencies are a bad idea.After closer review of the code, I believe that the entire logic around
_remove_done_tasks_from_dsk
is actually redundant and can be removed. This method was parsing thedsk
graph for tasks that were already inmemory
or were alreadyerred
and removed them fromdsk
accordingly. The reasoning is sound since we do not want to recompute those tasks again. However, the transition engine is already taking care of this and is producing appropriate recommendations that end up doing nothing.Ultimately, this is a performance tradeoff. This logic in
_remove_done_tasks_from_dsk
allows us to not throw already computed keys into the transition engine which is arguably slower than the_remove_done_tasks_from_dsk
logic itself. Essentially, this makes repeatedpersist
s faster at the cost of slowing everything else down.In particular,
_remove_done_tasks_from_dsk
contains a call toreverse_dict
which walks the entire graph and all edges and constructs a dict with dependents. Contrary 8000 to dependencies, dependents are ephemeral and have to be recomputed every time. This is actually expensive and not doing it is helpful.I haven't done any measuremnets but I strongly suspect this is a win-win (even repeated persists are likely faster).
There may be a couple of wrinkles in CI that I'll have to check on but I'm very confident about the change itself.