Keywords

1 Introduction

Image synthesis techniques have improved significantly over the last few years, and synthetic data has been effectively leveraged in a broad range of medical imaging tasks [2], including segmentation [3], classification [4] and reconstruction [5]. However, despite these results current medical image synthesis approaches have a number of limitations that prevent them from being applied even more widely. Firstly, although there are now many papers demonstrating impressive high resolution results of 2D image generation [6], there is still limited progress on the generation of 3D (volumetric) images. Secondly, controllable image synthesis, in which image generation can be meaningfully conditioned on some input (such as a dense semantic mask) is under explored in the medical setting. Lastly, synthesis of labelled data (e.g. data with multi-class segmentation masks) is strictly more useful than unlabelled image synthesis, but many papers focus on unlabelled image synthesis, or synthesis restricted to binary classes [2, 4]

Fig. 1.
figure 1

Example synthetic cardiac MR images from the proposed model. The first row shows random samples from the learnt multi-tissue anatomical model, the second and third rows show synthetic images generated conditioned on those anatomical model samples with the simple and SPADE-based [1] renderers respectively. The final row shows the closest (\(l_2\)-norm) real image in the (augmented) training set. (Note that the model synthesises 3D volumes and we visualise random slices).

Fig. 2.
figure 2

An overview of the proposed approach. During training (left) the anatomical model, encoder and decoder networks are learnt through self-supervision. Image volumes are encoded to latent vectors \(z_t\) (encoding the transformation) and \(z_r\) (encoding the rendering), which are both encouraged to have a Gaussian distribution. When synthesising data (right) Gaussian noise is fed to the decoders, and a syhtnetic volume is produced by realistically deforming and rendering the learnt anatomical model.

In this paper we propose an approach to address these problems. We introduce a model for data synthesis that learns a factored representation of 3D medical data which it then leverages to generate diverse and realistic synthetic images with corresponding dense labels (Fig. 1). Specifically, the model learns an anatomical factor, which captures the spatial structure of the data, and a rendering factor, which describes how the anatomy is rendered to a final image. The anatomical factor is represented as a deformation of a single multi-tissue anatomical model. This anatomical model is also learnt during training. The rendering factor then describes how the various tissues translate into final pixel intensities (see Fig. 2).

We demonstrate that this explicitly enforced factorisation enables the model to synthesise realistic 3D data. Moreover, the proposed framework also provides additional benefits. As all data are represented by (different deformations of) the same underlying anatomical model, we implicitly learn a dense correspondence between all training volumes, as well as between all synthesised images. We demonstrate that this dense correspondence facilitates few-shot, and even one-shot, segmentation via label propagation. Moreover, this dense correspondence allows us to co-register volumes, or to directly apply random realistic deformations.

After training, the anatomical model provides a dense semantic segmentation for every volume in the training set. As a second step we demonstrate that such dense segmentation masks can be used with state-of-the-art conditional image synthesis models [1] to generate sharp, high resolution synthetic image data, and that moreover, this synthesis is controllable in a natural and readily interpretable way, through varying the anatomical and rendering factors.

2 Related Work

Image synthesis has seen impressive recent development [6, 7], and provides a powerful approach to enlarging training sets for arbitrary downstream tasks [2, 8]. A standard generative model (such as the original GAN [9], or DCGAN [10]) is able to synthesise images similar toFootnote 1 those in a large set of example images. Given sufficient training data the images produced by state-of-the-art generative models are both realistic and diverse [6]. However, these approaches generate unsegmented data, and there is no (interpretable) control over the specific image generated. Further, the requirement for large training datasets can restrict application in a medical image context, where data is limited.

Various approaches have been developed which help to mitigate these limitations. To address the lack of labels a common approach is to generate labelled data in the target modality from labelled data in another modality through domain transfer [3], however this requires suitable auxiliary labelled data. To achieve controllable image synthesis various conditional generative models have been proposed [2]. Relevant here, a number of recent works have explored controllable synthesis of natural images conditioned on dense segmentation masks [1, 11]. These methods have produced exceptional results, but the requirement for dense segmentation masks prohibits straightforward application to medical images, especially in the 3D case. Alternatively the labels can by synthesised as part of the data (e.g. as additional channels), but this significantly increases the data dimensionality and does not facilitate controllable synthesis.

Lastly, a number of methods that straddle the line between augmentation and synthesis have been proposed, and generate realistic 3D data through shape-model based image deformations [8, 12], or from in silico phantoms [13]. However, direct generation of 3D (volumetric) medical image data, to the best of our knowledge, has not yet been demonstrated.

Factored representation learning, i.e. learning representation in which we disentangle “[d]ifferent explanatory factors of the data [that] tend to change independently of each other” [14], is a growing topic in both machine learning and medical image analysis. However, it has recently been shown that factorisation without guiding prior knowledge is not beneficial in general, and the representations learnt do not facilitate improvement in down-stream tasks [15]. In this work our factorisation is explicitly grounded, relying on the fact that medical images result from both a patient’s anatomy and an imaging procedure. We make use of this prior knowledge to learn a powerful model without labelled data. Previous work has demonstrated the benefit of factorisation of medical images for segmentation tasks [16, 17], and outside of medical imaging there have also been demonstrations of factored representations leveraged to implicitly register data, e.g. on 2D face images [18].

Fig. 3.
figure 3

An overview of the proposed model. An image volume X is given as input, from which the parameters for an affine transform (\(\theta _a\)) are predicted through a variational encoder-decoder (VED) network (with latent representation \(z_a\)). This affine transformation is applied to X producing \(X'\). Next, parameters for a non-linear transformation (deformation) \(\theta _w\) are predicted from \(X'\) (via another VED), and this deformation is applied to the learnt anatomical model M yielding \(M'\). Next, a final VED predicts parameters \(\theta _r\) from \(X'\), and these parameters are used to render \(M'\), i.e. map it to an image. Finally the rendered image is aligned with the input X by applying the inverse affine transform. During training the encoder networks, decoder networks, and the anatomical model M are learnt. The other components have no learnable parameters.

3 Proposed Approach

In this section we describe the proposed approach. A schematic of the connections between the various model components is shown in Fig. 3, and below we describe each element of the model in detail.

Anatomical Model. The proposed approach involves learning an anatomical model M: a multi-channel volume of size \(S \times W \times H \times C\). S is the number of slices in the volume, W and H are the in-plane image dimensions (in pixels), and C is the number of channels. C can be seen as a hyper-parameter which defines the maximum number of different ‘materials’ (or ‘tissue types’) in the anatomical model. We restrict M such that for every voxel the values across the channels dimension sum to one. Intuitively, this can be understood as letting the values across the channel dimension represent the relative proportion of each tissue type found in each voxel. To implement the model we directly learn the (unconstrained) values of a volume \(M_{pre}\) (identical in size to M) during training, and define \(M := softmax(M_{pre})\) where the softmax is over the channel dimension. An example of a learnt anatomical model is shown in Fig. 4.

Fig. 4.
figure 4

Example of a \(8 \times 64 \times 64 \times 6\) voxel learnt anatomical model. The model was learnt with six tissue types and then clustered into six classes for visualisation with one color per class. The base to apical slices are shown from left to right. It can be seen that, although learnt without supervision, various anatomical parts are clearly visible, such as the ventricular cavities, the myocardium and the chest wall.

Variational Encoder-Decoder Networks. Our model employs three variational encoder-decoder [19] networks (VEDs). Each of these networks consists of an encoder, which encodes the input to a latent vector, and a decoder, which maps this latent vector to the required output. The first VED takes as input the original image volume X and predicts affine transformation parameters \(\theta _{a}\). The second VED takes as input the affine-transformed image volume \(X^{\prime }\) and predicts the non-linear warp parameters \(\theta _w\), and the third VED also takes as input the affine-transformed input volume \(X^{\prime }\), and predicts the rendering parameters \(\theta _r\). We define \(\theta _t = \{\theta _{a}, \theta _{w}\}\), i.e. the parameters of the full transformation. We employ the VED approach so that, as in the variational auto-encoder [19], we learn a low-dimensional latent representation for each input, and during training we encourage the posterior over the latent space to be a standard Gaussian distribution, allowing us to use the model in a generative way by sampling \(z_t\) and \(z_r\) for standard Gaussian distributions (see Fig. 2). Each encoder is three 32 channel \(3\times 3\times 3\) convolutions with stride (1, 2, 2) then two 128 neuron dense layers and a final dense layer of the required size with no activation. The decoders are the same as the dense layers of the encoders. We use ReLU activations thoughout. The latent spaces \(z_w\), \(z_a\), \(z_r\) are size 64, 16, 16.

Transforms. The first step of the pipeline is to transform the input volume using an affine transform such that the input is approximately aligned with the anatomical model. After this transformation the model and input volume have the same overall orientation and scale, but are not co-registered at the individual voxel level as this requires a further non-linear registration step (see below). The affine transform is specified by \(\theta _a \in \mathbb {R}^{12}\). Note that later in the processing we invert the predicted affine transform. This is only possible if the matrix is non-singular, however we found that the reconstruction cost itself is sufficient to ensure this condition is met, and no additional regularisation is required.

The next step is to perform a non-linear deformation of the anatomical model M to produce a dense correspondence with the (affine-aligned) input volume \(X^{\prime }\). We investigated several deformation methods but found that directly predicting a dense offset field (with suitable regularisation, see Sect. 3 for details), produced the best results. Thus \(\theta _w \in \mathbb {R}^{S\times W \times H \times 3}\). Although this deformation is not required to be invertible by the model, encouraging invertibility provides strong regularisation, and allows for co-registration of volumes via their predicted deformations.

Rendering. The final step is to convert the warped anatomical model into an image. We refer to this step as ‘rendering’. In order to encourage the anatomical model to capture as much information as possible we restrict the renderer to a simple network that assigns a single colour per tissue. Specifically, the simple render learns just C weights (and a bias) and performs a weighted sum of the anatomical model’s channels to yield the final image. After training the model we then learn a 2D SPADE-based renderer [1], leveraging the predicted dense segmentation masks, which we up-sample to \(128 \times 128\).

Loss Function. We train the model end-to-end to minimise the mean-absolute-error of the reconstruction, \(L_{MAE}\). Additionally, we minimise the KL divergences of \(z_t\) and \(z_r\) from standard Gaussian distributions (loss component \(L_{KL}\)). This is done using the reparameterisation trick, as in the original VAE [19].

To regularise the non-linear transformation (and encourage invertability) we minimise \(L_{det(J)} = |1-det(J)|\), where det(J) is the determinant of the Jacoboian of the non-linear transformation. We also minimise the overall offset resulting from the combination of the affine and non-linear transformations, \(L_{offset}\), this encourages the model to minimise the distance between a voxel’s initial position in the model and its final position after both transformations. We weight the z direction of this offset to account for the volume’s non-isotropic resolution.

The overall loss function is defined as \(\lambda _1 L_{MAE} + \lambda _2 L_{KL} + \lambda _3 L_{det(J)} + \lambda _4 L_{offset}\) where \(\lambda \)s are hyper-parameters that appropriately scale each loss and determine their relative importances, set empirically to 1, 0.001, 0.1 and 0.0001 respectively.

4 Experiment Details

4.1 Data and Pre-processing

We make use of two datasets: ACDC (Automated Cardiac Diagnosis Challenge) [20] consists of cardiac magnetic resonance images (MRI) of 100 patients, both healthy (20%) and unhealthy (80%). We preprocess the data by resampling to \(1.3 \times 1.3\) mm in-plane resolution (keeping the inter-slice resolution unchanged). We then crop the in-plane image to a \(144 \times 144\) pixels. In total we have 200 (100 end-systolic (ES) and 100 end-diastolic (ED)) volumes, each with an average of 9 slices. The CHAOS (Combined Healthy Abdominal Organ Segmentation) dataset [21,22,23] consists of Abdominal CT and MRI images from different patients. Here we use the CT data and the T1-DUAL in phase MR images, and preprocess the data as done for ACDC, additionally downsampling to 8 slices.

4.2 Training Details

In all experiments we first train the proposed model for 2,000 epochs using Adam [24] with default parameters, a learning rate of 0.01, and a batch size of 32. We use online data augmentation to enhance the seen data variation: we randomly select an \(8 \times 128 \times 128\) sub-volume, down-sample to \(8 \times 64 \times 64\).

After training the initial model we then train the SPADE-based renderer on 2D image-mask pairs (at \(128\times 128\) resolution). The voxels in the anatomical model are not discrete classes, but rather contain ratios of the tissues. Thus, in order to make discrete multi-class segmentation maps we perform K-means clustering on the voxels. This produces K distinct classes of voxel, which we use for the segmentation map. We then train the original SPADE model on our data.

5 Results

2D and 3D Synthesis. Given Gaussian noise as input, the learnt model synthesises coherent 3D volumes (from which a 2D slice can then be randomly sampled if required, see Fig. 1). We visualise two example volumes in Fig. 5. As can be seen, the data is anatomically coherent both within and between slices, and the synthetic data is not simply memorised from the training set.

Label Transfer. We evaluate few-shot segmentation on ACDC by using a small number of volumes to learn labels for the anatomical model, then encoding test volumes and evaluating the Dice between the real labels and the labels of the warped anatomical model. Averaged over 10 splits we achieve Dice scores of 69%, 67%, 63%, 60% and 55% for 150, 50, 10, 3 and 1 shot label transfer respectively (over three classes: myocardium and both ventricular blood-pools). Although these results are lower than those produced with supervision they are on par with results learned from unpaired data [25]. It should also be noted that the training data is used only to learn the labels for the anatomical model, the model itself remains constant, and thus the model’s correspondence is at least 69%.

Fig. 5.
figure 5

Two example 3D synthetic image sets (128 \(\times \) 128 \(\times \) 8 voxels). Each 3D image consist of 8 short-axis slices. The first three rows show the sampled anatomical model, result of the simple renderer, and result of the SPADE-based renderer. The final row shows, for each generated slice, the most similar slice (\(l_2\) norm) from the (augmented) training set.

Fig. 6.
figure 6

Example “4D” synthesis (8 \(\times \) 128 \(\times \) 128 voxels, 10 frames). Each frame consist of 8 short-axis slices, here we show two slices due to space constraints. See text for details.

Fig. 7.
figure 7

Latent interpolation on ACDC (top) and CHAOS (bottom). In each row the left and right most images are real, and the ten central images show reconstructions (using the simple renderer) when interpolating linearly between the latent representations of the outer images. It can be seen that both the shape and intensities smoothly transition. Note the second row shows a multi-modal transition between CT and MR.

Latent Space Interpolation. First we perform Pseudo 4D synthesis. We take the ES and ED volumes from an ACDC patient and interpolate (in the latent space) between their anatomical model deformations. This produces a smooth continuum of anatomies between the two cardiac phases. We then render all volumes using the SPADE-based renderer, resulting in 4D data for half of a cardiac cycle. The results are shown in Fig. 6. As only ES and ED frames were used for training the figure should be taken more as an example of the models ability to meaningfully interpolate, rather then as realistic synthesis of cardiac motion, as the intricacies of the cardiac dynamics may not be captured. We further examine the learnt latent space of the model through additional interpolations in Fig. 7. In particular, results on CHAOS demonstrate multi-modal interpolation.

6 Conclusion and Discussion

We have presented a method for synthesis of medical image data via a learnt anatomical model and factored representation. Image volumes are represented as an anatomical factor \(z_t\) (model deformation) and an rendering factor \(z_r\). Factoring the task in this way breaks the synthesis process into two simpler problems which can be solved in parallel. Further, the approach has a number of benefits: it emulates the real factored nature of medical image generation into patient and protocol, learns a multi-tissue (generative) shape model, implicitly co-registers all volumes (i.e. both training and synthesised), and allows for multi-modal learning by explicitly capturing the shared anatomical and discrepant appearance information. Additionally, it yields dense segmentation masks for all volumes, and this combined with the model’s modular nature means the render can be replaced by a state-of-the-art conditional synthesis model after training. We believe the proposed method can be readily applied to a range of medical synthesis tasks.

Lastly, our method uses a voxelised anatomical model. Future work looking instead at learning continuous (e.g. mesh-based) anatomy would open up a number of research directions, e.g. allowing simulating k-space acquisition and reconstruction without committing “inverse crime” [26]. This would allow the rendering process to move towards simulating full MRI acquisition and reconstruction.