8000 Fix DaskWorker and add GitHub Actions workflow for Dask tests by adi611 · Pull Request #686 · nipype/pydra · GitHub
[go: up one dir, main page]

Skip to content

Fix DaskWorker and add GitHub Actions workflow for Dask tests #686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 19, 2023

Conversation

adi611
Copy link
Contributor
@adi611 adi611 commented Aug 10, 2023

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Summary

Fixes the DaskWorker and add a GitHub Actions workflow file called testdask.yml for Dask tests.

Checklist

  • I have added tests to cover my changes (if necessary)
  • I have updated documentation (if necessary)

@codecov
Copy link
codecov bot commented Aug 11, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.10% ⚠️

Comparison is base (29d3d1f) 82.88% compared to head (0f646d2) 82.79%.

❗ Current head 0f646d2 differs from pull request most recent head 3d687b0. Consider uploading reports for the commit 3d687b0 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #686      +/-   ##
==========================================
- Coverage   82.88%   82.79%   -0.10%     
==========================================
  Files          22       22              
  Lines        4845     4848       +3     
  Branches     1391        0    -1391     
==========================================
- Hits         4016     4014       -2     
- Misses        825      834       +9     
+ Partials        4        0       -4     
Flag Coverage Δ
unittests 82.79% <0.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
pydra/engine/workers.py 19.15% <0.00%> (-0.13%) ⬇️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ghisvail and others added 3 commits August 11, 2023 09:35
YAML gotcha 3.10 -> 3.1, versions should be quoted.
- Add concurrency group
- Add permissions
- Bump actions/setup-python to v4
- Add OS (Ubuntu and macOS) to test matrix
- Streamline job steps for testing
@ghisvail
Copy link
Collaborator

@adi611 @djarecka Interesting,

It looks like testing fails on macOS regardless of the Python version.

The tests succeed on Linux for Python 3.9, 3.10 and 3.11, though there are intermittent failures due to some tests running forever without triggering a timeout. This is quite concerning I think, as we probably don't want Pydra jobs to run indefinitely for unknown reasons.

@ghisvail
Copy link
Collaborator

pydra/engine/tests/test_workflow.py::test_wf_3nd_st_1[dask] RERUN [ 88%]

This test is very long to process.

@ghisvail
Copy link
Collaborator

After more than an hour:

=========================== short test summary info ============================
FAILED pydra/engine/tests/test_workflow.py::test_wf_3nd_st_1[dask] - Exception: graph is not empty, but not able to get more tasks - something may have gone wrong when retrieving the results of predecessor tasks. This could be caused by a file-system error or a bug in the internal workflow logic, but is likely to be caused by the hash of an upstream node being unstable. 

Hash instability can be caused by an input of the node being modified in place, or by psuedo-random ordering of `set` or `frozenset` inputs (or nested attributes of inputs) in the hash calculation. To ensure that sets are hashed consistently you can you can try set the environment variable PYTHONHASHSEED=0 for all processes, but it is best to try to identify where the set objects are occurring and manually hash their sorted elements. (or use list objects instead)

Blocked tasks
-------------

mult (FunctionTask_9119c450eb4aba771bfa0b0d61c16836) is blocked by add2x (FunctionTask_32acf4c9930ee17f343a230ee86c85d3), which matches names of []; add2y (FunctionTask_ad4669811dc84a9749ce5e6c6ecc1204), which matches names of []
= 1 failed, 546 passed, 374 skipped, 3 xfailed, 84 warnings, 3 rerun in 3987.10s (1:06:27) =

@djarecka
Copy link
Collaborator
djarecka commented Aug 11, 2023

@ghisvail - yes, I had a problem on my laptop.

How this is related to #673, should we close one ? if you only run test_workflow (as in #673), everything works?

@adi611
Copy link
Contributor Author
adi611 commented Aug 11, 2023

Should I add a commit updating GA workflow file to run the tests as two different jobs - one for test_workflow.py and one for the rest?

@adi611
Copy link
Contributor Author
adi611 commented Aug 11, 2023

On ubuntu-latest:

  • for Python 3.10 and 3.11 - all the tests pass
  • for Python 3.9 - test_wf_3nd_st_1[dask] in test_workflow.py fails after running for a considerable time, other tests pass

On macos-latest:

  • for Python 3.9, 3.10 and 3.11 - test_duplicate_input_on_split_wf in test_workflow.py fails due to timeout, other tests pass. Point to note is that this test runs on cf plugin only and it passes on other GA workflows like testpydra.yml

Also for ubuntu-latest, previously the Dask GA workflow failed for both test_duplicate_input_on_split_wf and test_inner_outer_wf_duplicate due to timeouts but now after the recent commits to the Pydra repo it passes for both.

@ghisvail
Copy link
Collaborator

A first remark, the logs contain a lot of UserWarning: Port 8787 is already in use.

Which might indicate that we are not cleaning Dask resources properly during the tests.

@ghisvail
Copy link
Collaborator

I was also wondering, have we got a test workflow not too trivial for Dask parallelization which we could exercise outside of pytest?

@ghisvail
Copy link
Collaborator

More clues that there is probably something going on with Dask resources:

/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/distributed/client.py:1542: RuntimeWarning: coroutine 'wait_for' was never awaited
  self.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/distributed/client.py:1542: RuntimeWarning: coroutine 'Client._close' was never awaited
  self.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

@ghisvail
Copy link
Collaborator

I re-run the tests for Python 3.10 on ubuntu-latest` and it failed this time, sadly.

@djarecka
Copy link
Collaborator

@adi611 - so is it exactly the same version that was running fine when each test file was running separately? In my case it doesn't change anything...

I've tried to debug today on my osx but hasn't got too far yet..:(

@adi611
Copy link
Contributor Author
adi611 commented Aug 17, 2023

I re-run the tests for Python 3.10 on ubuntu-latest` and it failed this time, sadly.

It is confusing to debug, since even for the same environment the results are inconsistent.

@djarecka
Copy link
Collaborator

@adi611 - I think I fixed the dask worker. You can either accept my PR that I made to your repository/branch or move to #689
you can also check how does it work on your local machine or collabs

@adi611
Copy link
Contributor Author
adi611 commented Aug 19, 2023

[adi611-patch-updatedask-1](/adi611/pydra/tree/adi611-patch-updatedask-1)

I tried it on my local machine as well as on colab and no test fails! Thank you for the help!

@djarecka djarecka mentioned this pull request Aug 19, 2023
2 tasks
@djarecka
Copy link
Collaborator

ok, great! I will merge this now. I think it runs much faster now and we can run more tests with the dask plugin, but we can do it in a separate pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0