3D Medical Image Synthesis by Factorised Representation and Deformable Model Learning

Thomas Joyce¹¹ &
Sebastian Kozerke¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11827))

Included in the following conference series:

International Workshop on Simulation and Synthesis in Medical Imaging

2170 Accesses

Abstract

In this paper we propose a model for controllable synthesis of 3D (volumetric) medical image data. The model is comprised of three components which are learnt simultaneously from unlabelled data through self-supervision: (i) a multi-tissue anatomical model, (ii) a probability distribution over deformations of this anatomical model, and, (iii) a probability distribution over ‘renderings’ of the anatomical model (where a rendering defines the relationship between anatomy and resulting pixel intensities). After training, synthetic data can be generated by sampling the deformation and rendering distributions. To encourage meaningful correspondence in the learnt anatomical model the renderer is kept simple during training, however once trained the (deformed) anatomical model provides dense multi-class segmentation masks for all training volumes, which can be used directly for state-of-the-art conditional image synthesis. This factored model based approach to data synthesis has a number of advantages: Firstly, it allows for coherent synthesis of realistic 3D data, as it is only necessary to learn low dimensional generative models (over deformations and renderings) rather than over the high dimensional 3D images themselves. Secondly, as a by-product of the anatomical model we implicitly learn a dense correspondence between all training volumes, which can be used for registration, or one-shot segmentation (through label transfer). Lastly, the factored representation allows for modality transfer (rendering one image in the modality of another), and meaningful interpolation between volumes. We demonstrate the proposed approach on cardiac MR, and multi-modal abdominal MR/CT datasets.

You have full access to this open access chapter, Download conference paper PDF

XCAT-GAN for Synthesizing 3D Consistent Labeled Cardiac MR Images on Anatomically Variable XCAT Phantoms

Domain-Adaptive 3D Medical Image Synthesis: An Efficient Unsupervised Approach

Pathology Synthesis of 3D Consistent Cardiac MR Images Using 2D VAEs and GANs

Keywords

1 Introduction

Image synthesis techniques have improved significantly over the last few years, and synthetic data has been effectively leveraged in a broad range of medical imaging tasks [2], including segmentation [3], classification [4] and reconstruction [5]. However, despite these results current medical image synthesis approaches have a number of limitations that prevent them from being applied even more widely. Firstly, although there are now many papers demonstrating impressive high resolution results of 2D image generation [6], there is still limited progress on the generation of 3D (volumetric) images. Secondly, controllable image synthesis, in which image generation can be meaningfully conditioned on some input (such as a dense semantic mask) is under explored in the medical setting. Lastly, synthesis of labelled data (e.g. data with multi-class segmentation masks) is strictly more useful than unlabelled image synthesis, but many papers focus on unlabelled image synthesis, or synthesis restricted to binary classes [2, 4]

In this paper we propose an approach to address these problems. We introduce a model for data synthesis that learns a factored representation of 3D medical data which it then leverages to generate diverse and realistic synthetic images with corresponding dense labels (Fig. 1). Specifically, the model learns an anatomical factor, which captures the spatial structure of the data, and a rendering factor, which describes how the anatomy is rendered to a final image. The anatomical factor is represented as a deformation of a single multi-tissue anatomical model. This anatomical model is also learnt during training. The rendering factor then describes how the various tissues translate into final pixel intensities (see Fig. 2).

We demonstrate that this explicitly enforced factorisation enables the model to synthesise realistic 3D data. Moreover, the proposed framework also provides additional benefits. As all data are represented by (different deformations of) the same underlying anatomical model, we implicitly learn a dense correspondence between all training volumes, as well as between all synthesised images. We demonstrate that this dense correspondence facilitates few-shot, and even one-shot, segmentation via label propagation. Moreover, this dense correspondence allows us to co-register volumes, or to directly apply random realistic deformations.

After training, the anatomical model provides a dense semantic segmentation for every volume in the training set. As a second step we demonstrate that such dense segmentation masks can be used with state-of-the-art conditional image synthesis models [1] to generate sharp, high resolution synthetic image data, and that moreover, this synthesis is controllable in a natural and readily interpretable way, through varying the anatomical and rendering factors.

2 Related Work

Image synthesis has seen impressive recent development [6, 7], and provides a powerful approach to enlarging training sets for arbitrary downstream tasks [2, 8]. A standard generative model (such as the original GAN [9], or DCGAN [10]) is able to synthesise images similar to^{Footnote 1} those in a large set of example images. Given sufficient training data the images produced by state-of-the-art generative models are both realistic and diverse [6]. However, these approaches generate unsegmented data, and there is no (interpretable) control over the specific image generated. Further, the requirement for large training datasets can restrict application in a medical image context, where data is limited.

Various approaches have been developed which help to mitigate these limitations. To address the lack of labels a common approach is to generate labelled data in the target modality from labelled data in another modality through domain transfer [3], however this requires suitable auxiliary labelled data. To achieve controllable image synthesis various conditional generative models have been proposed [2]. Relevant here, a number of recent works have explored controllable synthesis of natural images conditioned on dense segmentation masks [1, 11]. These methods have produced exceptional results, but the requirement for dense segmentation masks prohibits straightforward application to medical images, especially in the 3D case. Alternatively the labels can by synthesised as part of the data (e.g. as additional channels), but this significantly increases the data dimensionality and does not facilitate controllable synthesis.

Lastly, a number of methods that straddle the line between augmentation and synthesis have been proposed, and generate realistic 3D data through shape-model based image deformations [8, 12], or from in silico phantoms [13]. However, direct generation of 3D (volumetric) medical image data, to the best of our knowledge, has not yet been demonstrated.

Factored representation learning, i.e. learning representation in which we disentangle “[d]ifferent explanatory factors of the data [that] tend to change independently of each other” [14], is a growing topic in both machine learning and medical image analysis. However, it has recently been shown that factorisation without guiding prior knowledge is not beneficial in general, and the representations learnt do not facilitate improvement in down-stream tasks [15]. In this work our factorisation is explicitly grounded, relying on the fact that medical images result from both a patient’s anatomy and an imaging procedure. We make use of this prior knowledge to learn a powerful model without labelled data. Previous work has demonstrated the benefit of factorisation of medical images for segmentation tasks [16, 17], and outside of medical imaging there have also been demonstrations of factored representations leveraged to implicitly register data, e.g. on 2D face images [18].

3 Proposed Approach

In this section we describe the proposed approach. A schematic of the connections between the various model components is shown in Fig. 3, and below we describe each element of the model in detail.

Anatomical Model. The proposed approach involves learning an anatomical model M: a multi-channel volume of size \(S \times W \times H \times C\). S is the number of slices in the volume, W and H are the in-plane image dimensions (in pixels), and C is the number of channels. C can be seen as a hyper-parameter which defines the maximum number of different ‘materials’ (or ‘tissue types’) in the anatomical model. We restrict M such that for every voxel the values across the channels dimension sum to one. Intuitively, this can be understood as letting the values across the channel dimension represent the relative proportion of each tissue type found in each voxel. To implement the model we directly learn the (unconstrained) values of a volume \(M_{pre}\) (identical in size to M) during training, and define \(M := softmax(M_{pre})\) where the softmax is over the channel dimension. An example of a learnt anatomical model is shown in Fig. 4.

Variational Encoder-Decoder Networks. Our model employs three variational encoder-decoder [19] networks (VEDs). Each of these networks consists of an encoder, which encodes the input to a latent vector, and a decoder, which maps this latent vector to the required output. The first VED takes as input the original image volume X and predicts affine transformation parameters \(\theta _{a}\). The second VED takes as input the affine-transformed image volume \(X^{\prime }\) and predicts the non-linear warp parameters \(\theta _w\), and the third VED also takes as input the affine-transformed input volume \(X^{\prime }\), and predicts the rendering parameters \(\theta _r\). We define \(\theta _t = \{\theta _{a}, \theta _{w}\}\), i.e. the parameters of the full transformation. We employ the VED approach so that, as in the variational auto-encoder [19], we learn a low-dimensional latent representation for each input, and during training we encourage the posterior over the latent space to be a standard Gaussian distribution, allowing us to use the model in a generative way by sampling \(z_t\) and \(z_r\) for standard Gaussian distributions (see Fig. 2). Each encoder is three 32 channel \(3\times 3\times 3\) convolutions with stride (1, 2, 2) then two 128 neuron dense layers and a final dense layer of the required size with no activation. The decoders are the same as the dense layers of the encoders. We use ReLU activations thoughout. The latent spaces \(z_w\), \(z_a\), \(z_r\) are size 64, 16, 16.

Transforms. The first step of the pipeline is to transform the input volume using an affine transform such that the input is approximately aligned with the anatomical model. After this transformation the model and input volume have the same overall orientation and scale, but are not co-registered at the individual voxel level as this requires a further non-linear registration step (see below). The affine transform is specified by \(\theta _a \in \mathbb {R}^{12}\). Note that later in the processing we invert the predicted affine transform. This is only possible if the matrix is non-singular, however we found that the reconstruction cost itself is sufficient to ensure this condition is met, and no additional regularisation is required.

The next step is to perform a non-linear deformation of the anatomical model M to produce a dense correspondence with the (affine-aligned) input volume \(X^{\prime }\). We investigated several deformation methods but found that directly predicting a dense offset field (with suitable regularisation, see Sect. 3 for details), produced the best results. Thus \(\theta _w \in \mathbb {R}^{S\times W \times H \times 3}\). Although this deformation is not required to be invertible by the model, encouraging invertibility provides strong regularisation, and allows for co-registration of volumes via their predicted deformations.

Rendering. The final step is to convert the warped anatomical model into an image. We refer to this step as ‘rendering’. In order to encourage the anatomical model to capture as much information as possible we restrict the renderer to a simple network that assigns a single colour per tissue. Specifically, the simple render learns just C weights (and a bias) and performs a weighted sum of the anatomical model’s channels to yield the final image. After training the model we then learn a 2D SPADE-based renderer [1], leveraging the predicted dense segmentation masks, which we up-sample to \(128 \times 128\).

Loss Function. We train the model end-to-end to minimise the mean-absolute-error of the reconstruction, \(L_{MAE}\). Additionally, we minimise the KL divergences of \(z_t\) and \(z_r\) from standard Gaussian distributions (loss component \(L_{KL}\)). This is done using the reparameterisation trick, as in the original VAE [19].

To regularise the non-linear transformation (and encourage invertability) we minimise \(L_{det(J)} = |1-det(J)|\), where det(J) is the determinant of the Jacoboian of the non-linear transformation. We also minimise the overall offset resulting from the combination of the affine and non-linear transformations, \(L_{offset}\), this encourages the model to minimise the distance between a voxel’s initial position in the model and its final position after both transformations. We weight the z direction of this offset to account for the volume’s non-isotropic resolution.

The overall loss function is defined as \(\lambda _1 L_{MAE} + \lambda _2 L_{KL} + \lambda _3 L_{det(J)} + \lambda _4 L_{offset}\) where \(\lambda \)s are hyper-parameters that appropriately scale each loss and determine their relative importances, set empirically to 1, 0.001, 0.1 and 0.0001 respectively.

4 Experiment Details

4.1 Data and Pre-processing

We make use of two datasets: ACDC (Automated Cardiac Diagnosis Challenge) [20] consists of cardiac magnetic resonance images (MRI) of 100 patients, both healthy (20%) and unhealthy (80%). We preprocess the data by resampling to \(1.3 \times 1.3\) mm in-plane resolution (keeping the inter-slice resolution unchanged). We then crop the in-plane image to a \(144 \times 144\) pixels. In total we have 200 (100 end-systolic (ES) and 100 end-diastolic (ED)) volumes, each with an average of 9 slices. The CHAOS (Combined Healthy Abdominal Organ Segmentation) dataset [21,22,23] consists of Abdominal CT and MRI images from different patients. Here we use the CT data and the T1-DUAL in phase MR images, and preprocess the data as done for ACDC, additionally downsampling to 8 slices.

4.2 Training Details

In all experiments we first train the proposed model for 2,000 epochs using Adam [24] with default parameters, a learning rate of 0.01, and a batch size of 32. We use online data augmentation to enhance the seen data variation: we randomly select an \(8 \times 128 \times 128\) sub-volume, down-sample to \(8 \times 64 \times 64\).

After training the initial model we then train the SPADE-based renderer on 2D image-mask pairs (at \(128\times 128\) resolution). The voxels in the anatomical model are not discrete classes, but rather contain ratios of the tissues. Thus, in order to make discrete multi-class segmentation maps we perform K-means clustering on the voxels. This produces K distinct classes of voxel, which we use for the segmentation map. We then train the original SPADE model on our data.

5 Results

2D and 3D Synthesis. Given Gaussian noise as input, the learnt model synthesises coherent 3D volumes (from which a 2D slice can then be randomly sampled if required, see Fig. 1). We visualise two example volumes in Fig. 5. As can be seen, the data is anatomically coherent both within and between slices, and the synthetic data is not simply memorised from the training set.

Label Transfer. We evaluate few-shot segmentation on ACDC by using a small number of volumes to learn labels for the anatomical model, then encoding test volumes and evaluating the Dice between the real labels and the labels of the warped anatomical model. Averaged over 10 splits we achieve Dice scores of 69%, 67%, 63%, 60% and 55% for 150, 50, 10, 3 and 1 shot label transfer respectively (over three classes: myocardium and both ventricular blood-pools). Although these results are lower than those produced with supervision they are on par with results learned from unpaired data [25]. It should also be noted that the training data is used only to learn the labels for the anatomical model, the model itself remains constant, and thus the model’s correspondence is at least 69%.

Latent Space Interpolation. First we perform Pseudo 4D synthesis. We take the ES and ED volumes from an ACDC patient and interpolate (in the latent space) between their anatomical model deformations. This produces a smooth continuum of anatomies between the two cardiac phases. We then render all volumes using the SPADE-based renderer, resulting in 4D data for half of a cardiac cycle. The results are shown in Fig. 6. As only ES and ED frames were used for training the figure should be taken more as an example of the models ability to meaningfully interpolate, rather then as realistic synthesis of cardiac motion, as the intricacies of the cardiac dynamics may not be captured. We further examine the learnt latent space of the model through additional interpolations in Fig. 7. In particular, results on CHAOS demonstrate multi-modal interpolation.

6 Conclusion and Discussion

We have presented a method for synthesis of medical image data via a learnt anatomical model and factored representation. Image volumes are represented as an anatomical factor \(z_t\) (model deformation) and an rendering factor \(z_r\). Factoring the task in this way breaks the synthesis process into two simpler problems which can be solved in parallel. Further, the approach has a number of benefits: it emulates the real factored nature of medical image generation into patient and protocol, learns a multi-tissue (generative) shape model, implicitly co-registers all volumes (i.e. both training and synthesised), and allows for multi-modal learning by explicitly capturing the shared anatomical and discrepant appearance information. Additionally, it yields dense segmentation masks for all volumes, and this combined with the model’s modular nature means the render can be replaced by a state-of-the-art conditional synthesis model after training. We believe the proposed method can be readily applied to a range of medical synthesis tasks.

Lastly, our method uses a voxelised anatomical model. Future work looking instead at learning continuous (e.g. mesh-based) anatomy would open up a number of research directions, e.g. allowing simulating k-space acquisition and reconstruction without committing “inverse crime” [26]. This would allow the rendering process to move towards simulating full MRI acquisition and reconstruction.

Notes

1.
Broadly, the hope is that the synthetic images are drawn from the same probability distribution over images as the training data was.

References

Park, T., Liu, M.-Y., Wang, T.-C., Zhu, J.-Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Yi, X., Walia, E., Babyn, P.: Generative adversarial network in medical imaging: a review. arXiv preprint arXiv:1809.07294 (2018)
Chartsias, A., Joyce, T., Dharmakumar, R., Tsaftaris, S.A.: Adversarial image synthesis for unpaired multi-modal cardiac data. In: Tsaftaris, S.A., Gooya, A., Frangi, A.F., Prince, J.L. (eds.) SASHIMI 2017. LNCS, vol. 10557, pp. 3–13. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68127-6_1
Chapter Google Scholar
Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Gan-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 321, 321–331 (2018)
Article Google Scholar
Quan, T.M., Nguyen-Duc, T., Jeong, W.-K.: Compressed sensing MRI reconstruction using a generative adversarial network with a cyclic loss. IEEE Trans. Med. Imaging 37(6), 1488–1497 (2018)
Article Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. arXiv preprint arXiv:1906.00446 (2019)
Corral Acero, J., et al.: SMOD - data augmentation based on statistical models of deformation to enhance segmentation in 2D cine cardiac MRI. In: Coudière, Y., Ozenne, V., Vigmond, E., Zemzemi, N. (eds.) FIMH 2019. LNCS, vol. 11504, pp. 361–369. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21949-9_39
Chapter Google Scholar
Goodfellow, I.: Generative adversarial nets. In: Advances in neural information processing systems, pp. 2672–2680 (2014)
Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv (2015)
Google Scholar
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE CVPR, pp. 8798–8807 (2018)
Google Scholar
Bansal, A., Sheikh, Y., Ramanan, D.: Shapes and context: in-the-wild image synthesis & manipulation. arXiv preprint arXiv:1906.04728 (2019)
Wissmann, L., Santelli, C., Segars, W.P., Kozerke, S.: MRXCAT: realistic numerical phantoms for cardiovascular magnetic resonance. J. Cardiovasc. Magn. Reson. 16(1), 63 (2014)
Article Google Scholar
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE TPAMI 35(8), 1798–1828 (2013)
Article Google Scholar
Locatello, F., Bauer, S., Lucic, M., Gelly, S., Schölkopf, B., Bachem, O.: Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359 (2018)
Chartsias, A., et al.: Factorised representation learning in cardiac image analysis. arXiv:1903.09467 (2019)
Xia, T., Chartsias, A., Tsaftaris, S.A.: Adversarial pseudo healthy synthesis needs pathology factorization. arXiv preprint arXiv:1901.07295 (2019)
Shu, Z., Sahasrabudhe, M., Alp Güler, R., Samaras, D., Paragios, N., Kokkinos, I.: Deforming autoencoders: unsupervised disentangling of shape and appearance. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 664–680. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_40
Chapter Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Bernard, O., et al.: Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans. Med. Imaging 37(11), 2514–2525 (2018)
Article Google Scholar
Selver, M.A.: Exploring brushlet based 3D textures in transfer function specification for direct volume rendering of abdominal organs. IEEE Trans. Vis. Comput. Graph. 21(2), 174–187 (2014)
Article Google Scholar
Selvi, E., Selver, M.A., Kavur, A.E., Guzelis, C., Dicle, O.: Segmentation of abdominal organs from MR images using multi-level hierarchical classification. J. Fac. Eng. Arch. Gazi Univ. 30(3), 533–546 (2015)
Google Scholar
Selver, M.A.: Segmentation of abdominal organs from CT using a multi-level, hierarchical neural network strategy. Comput. Methods Programs Biomed. 113(3), 830–852 (2014)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Joyce, T., Chartsias, A., Tsaftaris, S.A.: Deep multi-class segmentation without ground-truth labels (2018)
Google Scholar
Wirgin, A.: The inverse crime. arXiv preprint math-ph/0401050 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

ETH Zurich, Zürich, Switzerland
Thomas Joyce & Sebastian Kozerke

Authors

Thomas Joyce
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Kozerke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Joyce .

Editor information

Editors and Affiliations

Institut du Cerveau et de la Moelle Épinière (ICM), Paris, France
Ninon Burgos
University of Leeds, Leeds, UK
Ali Gooya
Masaryk University, Brno, Czech Republic
David Svoboda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joyce, T., Kozerke, S. (2019). 3D Medical Image Synthesis by Factorised Representation and Deformable Model Learning. In: Burgos, N., Gooya, A., Svoboda, D. (eds) Simulation and Synthesis in Medical Imaging. SASHIMI 2019. Lecture Notes in Computer Science(), vol 11827. Springer, Cham. https://doi.org/10.1007/978-3-030-32778-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-32778-1_12
Published: 08 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32777-4
Online ISBN: 978-3-030-32778-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)