10BC0 Functionality questions for ome-zarr formatted data · Issue #181 · funkelab/gunpowder · GitHub
[go: up one dir, main page]

Skip to content

Functionality questions for ome-zarr formatted data #181

@Christianfoley

Description

@Christianfoley

Greetings from the Mehta Lab and apologies in advance for the long post!
I am attempting to use gunpowder as a dataloader for (float32) data in the ome-zarr format. I have run into a few issues trying to get functionality to work with some data I have. I have enumerated some questions I have below.

Support for multiple zarr stores in the OME-HCS Zarr format
If I have data stored in ome-zarr format as a series of hierarchical groups (row > col > position > data_arrays), when I create datasets inside of a source node, they need to be specified by inputting the full hierarchy path to the dataset source:

raw = gp.ArrayKey('RAW')
source = gp.ZarrSource(
    filename=zarr_dir,
    datasets={raw: 'Row_0/Col_1/Pos_1/arr_0'},
    array_specs={raw: gp.ArraySpec(interpolatable=True)}
)

Because of this format, we store arrays containing data in different rows that are all part of one 'dataset' in different zarr stores. Is it possible to create a single source that can access multiple zarr stores?

Inconsistent behavior of BatchRequest objects
When applying some augmentations (for example the SimpleAugment node), re-usage of a BatchRequest without redefining the request or pipelines will randomly result in data returned with the wrong indices:

For example, I define a dataset and a pipeline with and without a simple augmentation node:

raw = gp.ArrayKey('RAW')

source = gp.ZarrSource(
    zarr_dir,  # the zarr container
    {raw: 'Row_0/Col_1/Pos_1/arr_0'},  # arr_0 is 3 channels of 3D image stacks, dims: (1, 3, 41, 2048, 2048) 
    {raw: gp.ArraySpec(interpolatable=True)} 
)

simple_augment = gp.SimpleAugment(transpose_only=(-1,-2))

pipelines = [source, source + simple_augment]

Then I define a batch request:

request = gp.BatchRequest()
request[raw] = gp.Roi((0,0,0,0,0), (1,3,1,768,768))

Then I use that request to generate two batches from each pipeline in sequence:

#First loop is fine, second loop has 2nd dimension flipped/transposed
for n in range(2):
  batches = []
  for pipeline in pipelines: #for both augmented and plain pipeline
    with gp.build(pipeline):
      batch = pipeline.request_batch(request) #get batch
      batches.append(batch)

  # visualize the content of the batches
  fig, ax = plt.subplots(len(pipelines), 3, figsize = (14,10))
  for i in range(len(pipelines)):
    for j in range(3):
      ax[i][j].imshow(batches[i][raw].data[0,j,0])
  ax[0][1].set_title('source')
  ax[1][1].set_title('source + aug')    
  plt.show()

The result is the following:
                        Visualization of batch from loop 1                                           Visualization of batch from loop 2

Screen Shot 2022-10-13 at 4 50 15 PM      Screen Shot 2022-10-13 at 4 49 28 PM

I am confused as to why the behavior changes when the data, pipeline, and batch request haven't changed. Is there a reason that the second augmentation batch returns with reversed channels?


Thanks!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0