Keywords

1 Introduction

Improvement of trade-off between spatial resolution and signal to noise ratio (SNR) in MR imaging motivates the research from the perspective of both hardware and signal processing. As SNR increases monotonically with the strength of magnetic field, high-field MR scanners (7T, 11.5T) have been designed and are successful in providing higher SNR for the same resolution of images. However, the cost increases exponentially with the magnetic field strength. This leads to the lesser availability of high-field MR scanners across different hospitals and clinical labs and thus doesn’t solve the problem in practice. The number of clinical 7T scanners in the world are just \(\sim \)40, as compared to \(\sim \)20000 3T scanners [5]. Thus, developing algorithms to enhance images from low-field (and low-cost) MR scanners, serve as an important alternative. Indeed, it has been shown that the signal processing techniques can improve the spatial resolution along with significant increment in the SNR [1].

The problem to reconstruct the high-field like images from the low-field images is manifold and consists many sub-problems which include (i) increase in resolution leading to enhancement of image details, (ii) contrast improvement, and (iii) increase in signal to noise ratio. Also, those approaches to address such problems are more feasible in clinical practices, which take less time.

One can address the above concerns by learning a highly non-linear mapping from the low field to high field MR images using exemplar low-field (LF) and high-field (HF) MR images. Considering this, Khosro et al. in [2] attempted to construct 7T like MR images from 3T MR images using dictionaries defined in same space, which is estimated by hierarchical application of canonical correlation analysis (CCA), and 7T MR images are reconstructed using the dictionary defined for 7T MR images and the coefficient vector computed by representation of 3T MR images using dictionary of corresponding exemplary 3T MR images. As it tries to capture the non-linearity of the transformation, it performs better than the approaches which solely increase the resolution with SNR [3, 4]. However, the non-linearity of transformation is still approximated by linear operations and may have significant fitting errors by degree of the non-linearity.

This is further addressed by the approaches defined in the popular framework of a neural network which can well approximate even the non-linear transformations [5, 6]. In [5] the reconstruction of 7T MR images is explored using convolutional neural network (CNN) network with a requirement of anatomical features. Reconstruction, as well as segmentation of the high-field MR images, is performed using a cascaded CNN given the 3T MR images and corresponding segmentation images at the input. Both these approaches divide the image volumes into 3D cubes and execute the algorithm with 3D CNN. Processing 3D cubes can help in reconstruction of local details and consistency in x, y, and z directions, but at the same time it may introduce block artifacts, and importantly, increases the time for reconstruct the test MR volumes.

Considering the ill-posed nature of the problem, and a possibility of multiple good solutions, we propose a merged convolutional autoencoder with three decoders, along with a strategy to update the weights adaptively based on the performance of each decoder at every iteration. The final estimate of the HF image is obtained by averaging the reconstructed images from the three decoders. To make the algorithm better usable in clinical practices, we reduce the reconstruction time of test MR volumes, while achieving better reconstruction and segmentation of the high-field MR images, by processing 2D images, and removing the requirement of any anatomical/segmentation based features.

Thus, our contributions can be summarized as: (a) architecture of convolutional autoencoder with multiple decoders. (b) update criteria for the encoder weights on the basis of decoder performance. (c) merge connections to enhance the reconstruction ability. (d) demonstrating reconstruction and segmentation improvements along with significant reduction in reconstruction time as compared to the state-of-the-art approaches. (e) we demonstrate superior performance across a variety of quantitative metrics such as PSNR, SSIM, sharpness and edge width unlike [5, 6].

2 Proposed Approach

In this work, we employ the convolutional autoencoder which tries to learn compact representative features of the image data. The problem to construct HF-like MR images from LF MR images involves the non-linear transformation, which the convolutional autoencoder learns at in latent space at multiple scales of the image obtained by upsampling and downsampling layers. The salient aspects of the proposed approach are detailed below:

2.1 One Encoder with Multiple Decoders

For the image reconstruction task, being an ill-posed problem, many solutions (HF images) may exist for the transformation of LF image to the HF image estimate. The transformation in our case depends on the filter weights which ideally should be representative enough to construct image details of complex structures, and discriminative enough to be able to learn the differences between details of HF and LF image. While, such a transformation can be learnt with a simple convolutional autoencoder (single encoder and decoder), considering that the transformation can be highly non-linear, there could be different weight combinations that can provide good estimates of such a transformation. The proposed multi-decoder model is thoughtfully designed with a notion that decoders initialized randomly, and updated using individual distinct costs, are likely to learn different weights via the different optimization paths. The random initializations can yield diverse solutions that can easily be collated for better PSNR. The distinctness between the learnt weights can be observed in Fig. 2 via activation maps at same layers of different decoders.

Fig. 1.
figure 1

Selective auto-encoder backpropagation

While there can be different configurations of multiple decoders, as an example, in this work we consider three decoders, integrated with a single encoder in the proposed architecture (Fig. 2). In this architecture, a selective backpropagation approach (as elaborated next) (Fig. 1) is proposed to enable the weight updates across the three paths, by selecting one decoder out of the three based on their losses, in each weight-update iteration (i.e. for each batch, with multiple batches considered within an epoch).

2.2 Updating the Weights

As indicated above, the weights of the architecture are updated in a three-fold manner which involves the selection of one of the three decoders, in each iteration. The selection is based on the minimum loss. Suppose \(\mathbf {E}_i\) represents the error of the network at the \(i^{\text{ th }}\) decoder, such that \(\mathbf {E}_i = g(\mathbf {W}_E, \mathbf {W}_{D_i})\), with encoder weights \(\mathbf {W}_E\) and decoder weights \(\mathbf {W}_{D_i}\), (\(i=1, 2, 3.\)). The weight update of the encoder is represented as \(\varDelta \mathbf {W}_E \propto \min _{i}(\mathbf {E}_i)\). In this way, in every iteration the encoder weights have three open but guided paths to move on, and the optimal one (with minimum loss) is chosen.

While the encoder weights are updated with the minimum decoder loss, for updating the decoder weights, we update all the decoders using their respective losses, i.e. \(\varDelta \mathbf {W}_{D_i} \propto \mathbf {E}_i\). We observe that simultaneously updating the decoders helps in minimizing the training loss faster, even as compared to a single encoder and decoder model, and also yields an improved performance.

2.3 Merge Connections

We define the proposed architecture with blocks of subsequent filter layers followed by a max pooling layer in the encoder section as shown in Fig. 2. To reconstruct the original size of image at output, an upsampling layer is introduced in each block of the decoders. While upsampling, there may be some artifacts introduced due to missing details in downsampled input of decoder. Hence, we concatenate the input of decoder with its upscaled version from the encoder in order to provide the nature of upscaled details for better reconstruction while upsampling in decoder. Indeed, we observe that adding the merge connections yield a significant PSNR improvement (of the order of 5db). This setting of our architecture is inspired from [7].

2.4 Proposed Architecture

The proposed approach employs a single encoder and multiple decoder architecture with a single channel input as described in Fig. 2. Three convolutional layers are used in each block of an encoder and all the three decoders, followed by a batch normalization layer to maintain the numerical stability.

Fig. 2.
figure 2

Proposed Architecture (Better viewed in color)

The first convolutional block in the encoder has 32 filters and the number of filters doubles after each convolutional block. In all the decoders the first convolutional block has 256 filters and the number of filters are halved after each block. We use a filter size of 3-by-3 in all convolutional blocks.

We use Rectified Linear Unit (ReLU) as an activation function in all the layers except the final layer. Since our data is normalized between 0 and 1, a Sigmoid activation function is used at the final layer.

The pixel values greater than zero (brain area) are passed to the next layer and rest of the pixels (outer part of the brain) are squashed to zero. This phenomenon is clearly visible in the initial layers of the encoder and the last layers of all the three decoders, through the activation maps in Fig. 2.

It is well known that local image details at various scales play a significant role in image reconstruction. The proposed architecture considers images at different scales using hierarchical layers for downsampling (maxpooling) and upsampling (each for a factor of 2) in encoder and decoders, respectively. The encoded representation obtained after three downsampling operations brings the data from a high dimension input to a latent space representation.

As the training proceeds, after every epoch the model is validated on 20% of the unseen validation data. The auto-encoder with minimum mean square error in training data, predicts on the validation data, and using the predicted output we calculate PSNR on the validation data and save the best weights corresponding to the maximum PSNR across epochs. The test reconstructions are then computed on these weights. Further, we have observed experimentally that the minimum strategy for backpropagation gives better results as compared to maximum strategy.

To aid a faster convergence we reduce the learning rate by 10% of its value after every 20 epochs. We also observe that using learning rate decay, the loss value converges to a smaller value than the case without using learning rate decay. The initial value for learning rate is set to be 1e-3. The model is trained for 500 epochs, which is observed to be more than sufficient to ensure convergence.

Finally, in the testing phase, the three reconstruction estimates on the test data are obtained at the three decoder outputs, and we take an average of all the three predictions which improves the results quantitatively. We observe that averaging the predicted outputs also helps in reducing the noise-like effects, with preservation of local features in the reconstructed images, and hence the improvement in PSNR over the individual decoder outputs as well as AE trained with single decoder is observed.

3 Experimental Results

3.1 Experimental Setup

To evaluate the performance of the proposed algorithm, real MR images scanned by 3 T and 7T MR scanner are selected from the dataset available online [8]. From a pool of volumes 39 MR image volumes are randomly selected and 3T MR images are registered with 7T MR image volumes using FLIRT software in FSL [9], in order to have pixel to pixel correspondence. Further, each of the MR volume is scaled to 0 to 1 range for numerical stability. The proposed architecture is trained on MR image volumes from 22 subjects, while volumes of 6 subjects are used for validation and 11 are used for testing. We cross validate across 3 trails involving random sets of training, validation and test data.

For comparison with existing approaches, we re-implement the 3D CNN approach defined in [5]. As some of the parameters are not mentioned in their work, we have used the same parameters as used in proposed work for e.g. learning rate, learning rate decay strategy, optimizer, batch size. To consider a complementary framework, the sparse-representation approach is also used for comparison [4]. For training all approaches we use 207 images from each subject volume. However, due to insignificant information in first and last 20 slices, we select central 167 slices per volume for reconstruction. All implementations are on a system with Nvidia 1080 Ti GPU Xeon e5 GeForce processor with 32 GB RAM.

3.2 Reconstruction Results

The test 7T MR image volumes are constructed using proposed approach and other existing works and two images are randomly selected to illustrate the quality comparison between different approaches. It can be observed from Figs. 3 and 4, that sparse based approach [4] is able to construct the details but with diffused tissue boundary. 3DCNN performs well in terms of tissue boundary but is unable to restore smaller differences in voxel values. Both these aspects are improved upon by the proposed approach.

The improvement is reflected in the quantitative results in Table 1 with higher PSNR and SSIM values. To compute the performance in terms of blurriness of the edges, two parameters i.e. sharpness and edge width are computed as defined in [10]. We observe that the algorithm may change the dynamic range of the data. Thus to be consistent for comparing quality of images reconstructed, we first match the histogram (HM) of reconstructed image with the corresponding 3T image. However, we also show the results for the proposed method without HM. The values for parameters are computed over non-background pixels of reconstructed images scaled to their original range.

Fig. 3.
figure 3

Example reconstructions and comparison visualized at a finer scale.

Fig. 4.
figure 4

Example reconstructions and comparison visualized at a finer scale.

3.3 Segmentation Results

High quality images helps in improving segmentation of the tissues required for medical analysis. Thus, we compare segmentation labels for images reconstructed by different algorithms, with FAST software of FSL for gray matter(GM), white matter(WM) and CSF. The dice-ratio improvements in segmentation with reconstruction using the proposed approach is clear from Table 2. The work in [5] has outperformed the sparse based reconstruction, thus we do not provide segmentation results for the latter.

Table 1. Quantitative comparison of proposed approach
Table 2. Dice ratio for segmentation of images reconstructed by different algorithms

3.4 Computational Complexity

Here, we stress the computational advantage of the proposed approach in terms of run-time for reconstruction, as compared to the approach of [5]. The 3D CNN approach [5] takes 137 min to construct 11 subject image volume. The proposed algorithm contrarily is computationally simple and takes less than 2 min to do the same task. To justify, we note that the amount of multiplications in the architecture of [5] is 2145 times than that in the proposed one. This is largely due to unpadded 3D convolution in [5].

4 Conclusion

We reported a novel convolutional single encoder with three decoder framework for reconstructing 7T-like MR images from 3T MR image as inputs. The proposed approach employs single-channel input (i.e. does not require anatomical and segmentation features as an input), and yet achieves a superior reconstruction quality over some contemporary methods. It also has a significant computational advantage. We also show that the reconstructed 7T-like MR images when segmented have better dice ratio compared to the comparative approaches.