*.define() `no_cache` option for tasks that modify persistent data #791

tclose · 2025-04-09T05:25:17Z

What would you like changed/added and why?

Add a no_cache option to the pydra.compose.*.design() functions to indicate that these tasks cannot be cached.

Currently, if a task attempts to modify a persistent data store (e.g. a directory) multiple times, and the store either doesn't capture its current state fully or the state is reset (e.g. an output sub-directory deleted) and the same input parameters are reused the modification won't happen in the subsequent runs.

For such tasks a purely random checksum could be generated to guarantee a unique working directory. However, this will require #784 or similar functionality otherwise downstream nodes won't know where to find this directory.

What would be the benefit? Does the change make something easier to use?

Subsequent task runs will be executed no matter what so the modifications are guaranteed to be made

The text was updated successfully, but these errors were encountered:

satra · 2025-04-09T12:11:31Z

wouldn't this violate a basic assumption of pydra and caching? also can you give a more concrete example of such a task?

tclose · 2025-04-09T22:54:10Z

The issue potentially arises when you have a pydra task that is modifying data outside of the cache directory. In this case you will still want the modification to be made regardless of an existing cache directory, i.e. the true outputs of the task are not contained within the cache directory.

In practice, I don't imagine it will occur that often but it came up for me theoretically when designing a task that modifies an XNAT imaging session to strip it of identifiable data (I am using Pydra to utilise the infrastructure I have built to bundle pydra tasks into XNAT pipelines).

The input to the task is the imaging session, and the task runs an anonymisation script run over all images in the session. If by some quirk the imaging session got reset to its original state (or at least to a close enough state that its hash was identical) then if you went to rerun the pipeline the session wouldn't actually be deidentified as Pydra will effectively say, "I don't need to do this again, I have already done it before, here are the outputs from the previous run".

Now in most cases the hashes will be different enough that this won't be an issue, but you could imagine a toy example where a database data is being synced to was accidentally deleted and then go to recreate the data and Pydra thinks it doesn't need to do anything.

You could in such cases, simply clear out your cache, so it is not a major issue, more a theoretical weakness I suppose.

satra · 2025-04-10T12:51:28Z

perhaps this is closer to the run always or force rerun concept rather than no cache.

tclose · 2025-04-10T23:03:48Z

run-always would be what I meant, has this been discussed previously?

tclose< 8000 /a> · 2025-04-10T23:04:02Z

#541 could be related

satra · 2025-04-11T05:14:34Z

in nipype 1.x we did this: https://github.com/nipy/nipype/blob/2f85d927678cb09791531effb7d0141c24e6a500/nipype/pipeline/engine/nodes.py#L445

and in pydra we have:

pydra/pydra/compose/base/task.py

Line 171 in 92cb5b9

rerun: bool = False,

tclose · 2025-04-13T23:39:35Z

Being able to force the rerun is useful, but the "always_run" functionality is what I'm seeking here. However, if it is always run there isn't much point calculating the input checksums as long as downstream tasks can still locate it. And for tasks that modify an external object passed as an input (e.g. data store imaging session), it would be good to be able to disable the checks to see whether the inputs hash has changed

satra · 2025-04-14T00:10:55Z

force the rerun is useful, but the "always_run"

the way it's implemented rerun=True (set in init) == always_run.

tasks that modify an external object passed as an input

if it returns the same checksum for different states that is not a valid object to hash from a pydra perspective. an example of this in nipype is the SPM.mat file, which is treated by SPM to keep updating it's internal state. this is why in nipype we copied it over to work on, instead of pointing to it.

in general, a few conditions:

people overwrite some input (that is not copied over) inside a function, they are changing the state of that object.
a similar comment could be made more generally of a function that takes as input an api endpoint and calls it. that endpoint could return different values each time it's called

i'm not sure this requires a change in pydra. for 1, this is why we have the equivalent of nipype's copyfile=True so that the input is not touched and in pydra the task should have been annotated that way. for 2, effectively this is a degenerate one to many mapping and should not be a valid task. how we catch it in general is going to be challenging, but one way of addressing that situation is to set a timestamp as an input before execution. that way the cache always refers to the call of an api at a given time and should always be retrievable and makes it a one to one mapping between inputs and checksums.

tclose · 2025-04-29T22:45:21Z

The use case I'm referring to is specifically when you are updating external data stores. So in theory you should be creating a hash of the external data store's state and use that in the cache, but that when dealing with remote stores, that could be impractical, and in practice the state will almost always be different.

tclose · 2025-05-19T21:53:13Z

Maybe it makes more sense to define that particular classes aren't to be cached, e.g. a XNAT imaging session by having their registered bytes_repr return None or something and this being interpreted as uncacheable, e.g.

@register_serializer
def bytes_repr_xnat_session(session: xnat.XnatSession, cache: Cache) -> Iterator[bytes] | None:
    return None

tclose · 2025-06-02T20:40:44Z

Maybe a special value instead of None as it is easy just to forget to return a value

tclose · 2025-06-02T20:45:30Z

@register_serializer
def bytes_repr_xnat_session(session: xnat.XnatSession, cache: Cache) -> Iterator[bytes | Uncacheable]:
    yield Uncacheable()

tclose added the enhancement New feature or request label Apr 9, 2025

tclose added this to Pydra Roadmap Apr 9, 2025

tclose moved this to Proposed in Pydra Roadmap Apr 9, 2025

tclose added feature Feature requests and removed enhancement New feature or request labels Apr 29, 2025

tclose mentioned this issue Apr 29, 2025

Provide wrapper for functions that modify inputs #625

Open

tclose moved this from Proposed to temp in Pydra Roadmap Apr 29, 2025

tclose moved this from temp to To do (v1.0) in Pydra Roadmap Apr 29, 2025

tclose mentioned this issue Apr 29, 2025

passing user defined objects as an input #433

Closed

tclose self-assigned this Jun 2, 2025

tclose moved this from Triage to v1.0 in Pydra Roadmap Jun 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

*.define() `no_cache` option for tasks that modify persistent data #791

*.define() `no_cache` option for tasks that modify persistent data #791

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

*.define() no_cache option for tasks that modify persistent data #791

*.define() no_cache option for tasks that modify persistent data #791

Comments

What would you like changed/added and why?

What would be the benefit? Does the change make something easier to use?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

*.define() `no_cache` option for tasks that modify persistent data #791

*.define() `no_cache` option for tasks that modify persistent data #791