-
Notifications
You must be signed in to change notification settings - Fork 59
*.define() no_cache
option for tasks that modify persistent data
#791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
wouldn't this violate a basic assumption of pydra and caching? also can you give a more concrete example of such a task? |
The issue potentially arises when you have a pydra task that is modifying data outside of the cache directory. In this case you will still want the modification to be made regardless of an existing cache directory, i.e. the true outputs of the task are not contained within the cache directory. In practice, I don't imagine it will occur that often but it came up for me theoretically when designing a task that modifies an XNAT imaging session to strip it of identifiable data (I am using Pydra to utilise the infrastructure I have built to bundle pydra tasks into XNAT pipelines). The input to the task is the imaging session, and the task runs an anonymisation script run over all images in the session. If by some quirk the imaging session got reset to its original state (or at least to a close enough state that its hash was identical) then if you went to rerun the pipeline the session wouldn't actually be deidentified as Pydra will effectively say, "I don't need to do this again, I have already done it before, here are the outputs from the previous run". Now in most cases the hashes will be different enough that this won't be an issue, but you could imagine a toy example where a database data is being synced to was accidentally deleted and then go to recreate the data and Pydra thinks it doesn't need to do anything. You could in such cases, simply clear out your cache, so it is not a major issue, more a theoretical weakness I suppose. |
perhaps this is closer to the run always or force rerun concept rather than no cache. |
|
#541 could be related |
in nipype 1.x we did this: https://github.com/nipy/nipype/blob/2f85d927678cb09791531effb7d0141c24e6a500/nipype/pipeline/engine/nodes.py#L445 and in pydra we have: pydra/pydra/compose/base/task.py Line 171 in 92cb5b9
|
Being able to force the rerun is useful, but the "always_run" functionality is what I'm seeking here. However, if it is always run there isn't much point calculating the input checksums as long as downstream tasks can still locate it. And for tasks that modify an external object passed as an input (e.g. data store imaging session), it would be good to be able to disable the checks to see whether the inputs hash has changed |
the way it's implemented
if it returns the same checksum for different states that is not a valid object to hash from a pydra perspective. an example of this in nipype is the SPM.mat file, which is treated by SPM to keep updating it's internal state. this is why in nipype we copied it over to work on, instead of pointing to it. in general, a few conditions:
i'm not sure this requires a change in pydra. for 1, this is why we have the equivalent of nipype's |
The use case I'm referring to is specifically when you are updating external data stores. So in theory you should be creating a hash of the external data store's state and use that in the cache, but that when dealing with remote stores, that could be impractical, and in practice the state will almost always be different. |
Maybe it makes more sense to define that particular classes aren't to be cached, e.g. a XNAT imaging session by having their registered @register_serializer
def bytes_repr_xnat_session(session: xnat.XnatSession, cache: Cache) -> Iterator[bytes] | None:
return None |
Maybe a special value instead of |
@register_serializer
def bytes_repr_xnat_session(session: xnat.XnatSession, cache: Cache) -> Iterator[bytes | Uncacheable]:
yield Uncacheable() |
What would you like changed/added and why?
Add a
no_cache
option to thepydra.compose.*.design()
functions to indicate that these tasks cannot be cached.Currently, if a task attempts to modify a persistent data store (e.g. a directory) multiple times, and the store either doesn't capture its current state fully or the state is reset (e.g. an output sub-directory deleted) and the same input parameters are reused the modification won't happen in the subsequent runs.
For such tasks a purely random checksum could be generated to guarantee a unique working directory. However, this will require #784 or similar functionality otherwise downstream nodes won't know where to find this directory.
What would be the benefit? Does the change make something easier to use?
Subsequent task runs will be executed no matter what so the modifications are guaranteed to be made
The text was updated successfully, but these errors were encountered: