8000 Implementation of proposed syntax changes by tclose · Pull Request #768 · nipype/pydra · GitHub
[go: up one dir, main page]

Skip to content

Implementation of proposed syntax changes #768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 512 commits into from
Apr 1, 2025
Merged

Implementation of proposed syntax changes #768

merged 512 commits into from
Apr 1, 2025

Conversation

tclose
Copy link
Contributor
@tclose tclose commented Feb 17, 2025

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Summary

Implements proposed syntax changes in #692

Test modules (to fix)

  • test_python (design)
  • test_shell (design)
  • test_workflow (design)
  • test_boutiques
  • test_dockertask
  • test_environments
  • test_functions
  • test_graph
  • test_helpers_file
  • test_helpers_state
  • test_helpers
  • test_node_task
  • test_numpy_examples
  • test_profiles
  • test_shelltask_inputspec
  • test_shelltask
  • test_singularity
  • test_specs
  • test_state
  • test_submitter
  • test_task
  • test_tasks_files
  • test_workflow (engine)
  • test_hash
  • test_messenger
  • test_typing

Checklist

  • I have added tests to cover my changes (if necessary)
  • I have updated documentation (if necessary)

@satra
Copy link
Contributor
satra commented Feb 25, 2025

@tclose - at this point, does it make sense to still keep two separate areas (design/engine) in the codebase? or simply merge them? i think we are all in agreement that this new design is the path we want to take.

@tclose
Copy link
Contributor Author
tclose commented Feb 25, 2025

@tclose - at this point, does it make sense to still keep two separate areas (design/engine) in the codebase? or simply merge them? i think we are all in agreement that this new design is the path we want to take.

Thanks :) although my plan was to keep the design area separate in the final state. Actually, in the refactor I had in mind I was planning to pull out environments and workers into separate namespaces, e.g. pydra.workers, pydra.environments, and have design, tasks, workers and environments (do we like plural or non-plural, i.e. task, worker, environment?) as expansion points where plugins can add support for new task-types, tasks, workers and environment classes.

I also wanted to pull out the helper functions into pydra.utils to avoid some circular import issues, leaving the only user-facing API within pydra.engine the Submitter class

@tclose
Copy link
Contributor Author
tclose commented Mar 31, 2025

Ok, I believe this is all good to go!! I have added a brief glossary to the docs explaining some of the new concepts and a few of the attributes I renamed (see docs/source/reference/glossary.rst).

Let me know if there is anything else, otherwise it would be great to get this merged and a 1.0a released so I can start updating the rest of my packages to point to that 😁

@satra
Copy link
Contributor
satra commented Mar 31, 2025

thanks tom. i will leave it to you and others to pull the trigger.

regarding prereleases and auto (which we use), we may need to setup a few things: https://intuit.github.io/auto/docs/generated/next (may not be a bad thing to setup).

alternatively, we could merge into master (should we rename to main) and make a manual pre-release.

@satra
Copy link
Contributor
satra commented Mar 31, 2025

one last question is the hashing of arbitrary objects, especially given the issues we had with torch arrays. is there a default serializer that should handle any arbitrary object? it did seem from that issue that we had to write and register our own serializer, and that the default backstop was not sufficient. (see: https://github.com/sensein/senselab/blob/5f1af5b5305579c8bdd698837dc8048d5284229d/src/senselab/utils/data_structures/pydra_helpers.py#L17).

@tclose
Copy link
Contributor Author
tclose commented Mar 31, 2025

one last question is the hashing of arbitrary objects, especially given the issues we had with torch arrays. is there a default serializer that should handle any arbitrary object? it did seem from that issue that we had to write and register our own serializer, and that the default backstop was not sufficient. (see: https://github.com/sensein/senselab/blob/5f1af5b5305579c8bdd698837dc8048d5284229d/src/senselab/utils/data_structures/pydra_helpers.py#L17).

The backstop object bytes-repr is here

pydra/pydra/utils/hash.py

Lines 308 to 346 in bc6b0a9

@singledispatch
def bytes_repr(obj: object, cache: Cache) -> Iterator[bytes]:
"""Default implementation of hashing for generic objects. Single dispatch is used
to provide hooks for class-specific implementations
Parameters
----------
obj: object
the object to hash
cache : Cache
a dictionary object used to store a cache of previously cached objects to
handle circular object references
Yields
-------
bytes
unique representation of the object in a series of bytes
"""
cls = obj.__class__
yield f"{cls.__module__}.{cls.__name__}:{{".encode()
dct: Dict[str, ty.Any]
if attrs.has(type(obj)):
# Drop any attributes that aren't used in comparisons by default
dct = attrs.asdict(obj, recurse=False, filter=lambda a, _: bool(a.eq))
elif hasattr(obj, "__slots__") and obj.__slots__ is not None:
dct = {attr: getattr(obj, attr) for attr in obj.__slots__}
else:
def is_special_or_method(n: str):
return (n.startswith("__") and n.endswith("__")) or inspect.ismethod(
getattr(obj, n)
)
try:
dct = {n: v for n, v in obj.__dict__.items() if not is_special_or_method(n)}
except AttributeError:
dct = {n: getattr(obj, n) for n in dir(obj) if not is_special_or_method(n)}
yield from bytes_repr_mapping_contents(dct, cache)
yield b"}"

I did make some changes to it so that it can handle slots classes (as well as special handling for attrs-generated objects), so would be interested if it can now handle the PyTorch arrays. If not, I might be worth adding those bytes-repr functions to the hash module with guards for the case where torch is not installed

@satra
Copy link
Contributor
satra commented Mar 31, 2025

perhaps i'm doing something wrong, but let me know if this makes sense to you. it doesn't seem to be reflecting the object fully, i.e. the output doesn't change for the two different tensors.

In [15]: a = torch.tensor([1, 2])

In [16]: list(bytes_repr(a, cache=cache))
Out[16]: [b'torch.Tensor:{', b'}']

In [17]: a = torch.tensor([3, 4])

In [18]: list(bytes_repr(a, cache=cache))
Out[18]: [b'torch.Tensor:{', b'}']

for a backstop i was hoping it would be similar to the cloudpickle/pickle option, where as long as it is pickleable we can potentially hash it.

@tclose
Copy link
Contributor Author
tclose commented Mar 31, 2025

https://intuit.github.io/auto/docs/generated/next

I had a look at the shipit pre-release docs, and I think that would work better > 1.0, as we don't really want to keep the current master as the default until 1.0 is released.

So if you guys are ok with it, I will pull the trigger on the merge and manually release a 1.0a. I'm also happy to rename the master branch to main in the process if you like.

@satra
Copy link
Contributor
satra commented Mar 31, 2025

whereas the bytes_repr works fine for numpy:

In [21]: a = np.array([3, 4])

In [22]: list(bytes_repr(a, cache=cache))
Out[22]: 
[b'numpyndarray:2:',
 b'\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00']

In [23]: a = np.array([1, 2])

In [24]: list(bytes_repr(a, cache=cache))
Out[24]: 
[b'numpyndarray:2:',
 b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00']

@tclose
Copy link
Contributor Author
tclose commented Apr 1, 2025

perhaps i'm doing something wrong, but let me know if this makes sense to you. it doesn't seem to be reflecting the object fully, i.e. the output doesn't change for the two different tensors.

In [15]: a = torch.tensor([1, 2])

In [16]: list(bytes_repr(a, cache=cache))
Out[16]: [b'torch.Tensor:{', b'}']

In [17]: a = torch.tensor([3, 4])

In [18]: list(bytes_repr(a, cache=cache))
Out[18]: [b'torch.Tensor:{', b'}']

for a backstop i was hoping it would be similar to the cloudpickle/pickle option, where as long as it is pickleable we can potentially hash it.

Interesting, torch.Tensor doesn't have slots and its __dict__ is empty. We could just fallback to dir/getattr, although we will need to enclose it in try/except to catch invalid properties (I'm wondering whether we could just check set(dir(obj)) - set(dir(type(obj))) or whether that would miss things.

Cloudpickle would be guaranteed to be unique, but it is not guaranteed to produce the same representations between equivalent objects so you would get a lot of false-negative cache lookups, e.g.

import torch
import cloudpickle as cp

a = torch.tensor([1, 2])
cp1 = cp.dumps(a)

b = torch.tensor([1, 2])
cp2 = cp.dumps(b)

print(cp1 == cp2)  # False

@tclose
Copy link
Contributor Author
tclose commented Apr 1, 2025

perhaps i'm doing something wrong, but let me know if this makes sense to you. it doesn't seem to be reflecting the object fully, i.e. the output doesn't change for the two different tensors.

In [15]: a = torch.tensor([1, 2])

In [16]: list(bytes_repr(a, cache=cache))
Out[16]: [b'torch.Tensor:{', b'}']

In [17]: a = torch.tensor([3, 4])

In [18]: list(bytes_repr(a, cache=cache))
Out[18]: [b'torch.Tensor:{', b'}']

for a backstop i was hoping it would be similar to the cloudpickle/pickle option, where as long as it is pickleable we can potentially hash it.

Interesting, torch.Tensor doesn't have slots and its __dict__ is empty. We could just fallback to dir/getattr, although we will need to enclose it in try/except to catch invalid properties (I'm wondering whether we could just check set(dir(obj)) - set(dir(type(obj))) or whether that would miss things.

Cloudpickle would be guaranteed to be unique, but it is not guaranteed to produce the same representations between equivalent objects so you would get a lot of false-negative cache lookups, e.g.

import torch
import cloudpickle as cp

a = torch.tensor([1, 2])
cp1 = cp.dumps(a)

b = torch.tensor([1, 2])
cp2 = cp.dumps(b)

print(cp1 == cp2)  # False

Maybe this could be addressed in a separate PR though...

@satra
Copy link
Contributor
satra commented Apr 1, 2025

thanks for the cloudpickle example - fine with addressing this in a separate PR.

@tclose tclose merged commit a41d30d into master Apr 1, 2025
21 checks passed
@satra
Copy link
Contributor
satra commented Apr 1, 2025

woohoo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0