8000 *.define() `no_cache` option for tasks that modify persistent data · Issue #791 · nipype/pydra · GitHub
[go: up one dir, main page]

Skip to content

*.define() no_cache option for tasks that modify persistent data #791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tclose opened this issue Apr 9, 2025 · 12 comments
Open

*.define() no_cache option for tasks that modify persistent data #791

tclose opened this issue Apr 9, 2025 · 12 comments
Assignees
Labels
feature Feature requests

Comments

@tclose
Copy link
Contributor
tclose commented Apr 9, 2025

What would you like changed/added and why?

Add a no_cache option to the pydra.compose.*.design() functions to indicate that these tasks cannot be cached.

Currently, if a task attempts to modify a persistent data store (e.g. a directory) multiple times, and the store either doesn't capture its current state fully or the state is reset (e.g. an output sub-directory deleted) and the same input parameters are reused the modification won't happen in the subsequent runs.

For such tasks a purely random checksum could be generated to guarantee a unique working directory. However, this will require #784 or similar functionality otherwise downstream nodes won't know where to find this directory.

What would be the benefit? Does the change make something easier to use?

Subsequent task runs will be executed no matter what so the modifications are guaranteed to be made

@tclose tclose added the enhancement New feature or request label Apr 9, 2025
@tclose tclose moved this to Proposed in Pydra Roadmap Apr 9, 2025
@satra
Copy link
Contributor
satra commented Apr 9, 2025

wouldn't this violate a basic assumption of pydra and caching? also can you give a more concrete example of such a task?

@tclose
Copy link
Contributor Author
tclose commented Apr 9, 2025

The issue potentially arises when you have a pydra task that is modifying data outside of the cache directory. In this case you will still want the modification to be made regardless of an existing cache directory, i.e. the true outputs of the task are not contained within the cache directory.

In practice, I don't imagine it will occur that often but it came up for me theoretically when designing a task that modifies an XNAT imaging session to strip it of identifiable data (I am using Pydra to utilise the infrastructure I have built to bundle pydra tasks into XNAT pipelines).

The input to the task is the imaging session, and the task runs an anonymisation script run over all images in the session. If by some quirk the imaging session got reset to its original state (or at least to a close enough state that its hash was identical) then if you went to rerun the pipeline the session wouldn't actually be deidentified as Pydra will effectively say, "I don't need to do this again, I have already done it before, here are the outputs from the previous run".

Now in most cases the hashes will be different enough that this won't be an issue, but you could imagine a toy example where a database data is being synced to was accidentally deleted and then go to recreate the data and Pydra thinks it doesn't need to do anything.

You could in such cases, simply clear out your cache, so it is not a major issue, more a theoretical weakness I suppose.

@satra
Copy link
Contributor
satra commented Apr 10, 2025

perhaps this is closer to the run always or force rerun concept rather than no cache.

@tclose
Copy link
Contributor Author
tclose commented Apr 10, 2025

run-always would be what I meant, has this been discussed previously?

@tclose
Copy link
Contributor Author

#541 could be related

@satra
Copy link
Contributor
satra commented Apr 11, 2025

@tclose
Copy link
Contributor Author
tclose commented Apr 13, 2025

Being able to force the rerun is useful, but the "always_run" functionality is what I'm seeking here. However, if it is always run there isn't much point calculating the input checksums as long as downstream tasks can still locate it. And for tasks that modify an external object passed as an input (e.g. data store imaging session), it would be good to be able to disable the checks to see whether the inputs hash has changed

@satra
Copy link
Contributor
satra commented Apr 14, 2025

force the rerun is useful, but the "always_run"

the way it's implemented rerun=True (set in init) == always_run.

tasks that modify an external object passed as an input

if it returns the same checksum for different states that is not a valid object to hash from a pydra perspective. an example of this in nipype is the SPM.mat file, which is treated by SPM to keep updating it's internal state. this is why in nipype we copied it over to work on, instead of pointing to it.

in general, a few conditions:

  1. people overwrite some input (that is not copied over) inside a function, they are changing the state of that object.
  2. a similar comment could be made more generally of a function that takes as input an api endpoint and calls it. that endpoint could return different values each time it's called

i'm not sure this requires a change in pydra. for 1, this is why we have the equivalent of nipype's copyfile=True so that the input is not touched and in pydra the task should have been annotated that way. for 2, effectively this is a degenerate one to many mapping and should not be a valid task. how we catch it in general is going to be challenging, but one way of addressing that situation is to set a timestamp as an input before execution. that way the cache always refers to the call of an api at a given time and should always be retrievable and makes it a one to one mapping between inputs and checksums.

@tclose tclose added feature Feature requests and removed enhancement New feature or request labels Apr 29, 2025
@tclose tclose moved this from Proposed to temp in Pydra Roadmap Apr 29, 2025
@tclose tclose moved this from temp to To do (v1.0) in Pydra Roadmap Apr 29, 2025
@tclose
Copy link
Contributor Author
tclose commented Apr 29, 2025

The use case I'm referring to is specifically when you are updating external data stores. So in theory you should be creating a hash of the external data store's state and use that in the cache, but that when dealing with remote stores, that could be impractical, and in practice the state will almost always be different.

@tclose
Copy link
Contributor Author
tclose commented May 19, 2025

Maybe it makes more sense to define that particular classes aren't to be cached, e.g. a XNAT imaging session by having their registered bytes_repr return None or something and this being interpreted as uncacheable, e.g.

@register_serializer
def bytes_repr_xnat_session(session: xnat.XnatSession, cache: Cache) -> Iterator[bytes] | None:
    return None

@tclose
Copy link
Contributor Author
tclose commented Jun 2, 2025

Maybe a special value instead of None as it is easy just to forget to return a value

@tclose tclose self-assigned this Jun 2, 2025
@tclose
Copy link
Contributor Author
tclose commented Jun 2, 2025
@register_serializer
def bytes_repr_xnat_session(session: xnat.XnatSession, cache: Cache) -> Iterator[bytes | Uncacheable]:
    yield Uncacheable()

@tclose tclose moved this from Triage to v1.0 in Pydra Roadmap Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature requests
Projects
Status: v1.0
Development

No branches or pull requests

2 participants
2A49
0