1. Introduction
With the development of medical imaging modalities such as ultrasound, X-ray, computed tomography (CT), and magnetic resonance imaging (MRI), much attention is attracted to developing new medical image-processing (MIP) techniques [
1]. Among different types of MIP techniques, medical image segmentation has been a hotspot in recent years, which is considered the most crucial step in clinical diagnosis. Moreover, medical image segmentation is also the prerequisite of many downstream clinical tasks. Nevertheless, medical image segmentation is often characterized by uncertainties. These uncertainties include the uncertainty of the plausible segmentation hypotheses and the uncertainty of segmentation performance. These two types of uncertainties affect the effectiveness of the MIS algorithm and affect the reliability of the subsequent medical diagnosis.
Different medical experts may have their style of segmenting lesions from medical images, and from a medical perspective, all their segmentations could be correct [
2]. Additionally, in organ abnormality segmentation, a lesion area is clear and distinct, but even the top experts may not be able to judge whether the lesion is cancer tissue or not. These result in the uncertainty of the plausible segmentation hypotheses, which is common in medical imaging. On the other hand, there is uncertainty in the segmentation performance of a segmentation algorithm. A segmentation algorithm may perform well on some medical images while performing poorly on the other images. Failure to detect these poor segmentations may affect the performance of downstream tasks and even lead to misdiagnosis in practice. Most automatic segmentation algorithms treat the segmentation target as a one-to-one mapping from the image to the output mask. However, only providing the most likely segmentation hypothesis is not enough for radiologists. Radiologists need all plausible segmentation hypotheses to avoid the uncertainty of the plausible segmentation hypotheses. They also need predictable statistical measures about these hypotheses to assist them in evaluating the segmentation performance. If possible, radiologists even want to use their specified measurement values to generate segmentation hypotheses.
Many studies have been done to solve the uncertainty of the plausible segmentation hypotheses by producing more than one plausible hypothesis from the models. Balaji et al. [
3] propose to learn an ensemble of deep models to generate multiple plausible hypotheses. However, the output segmentations generated by the ensembles lack diversity, and the ensembles usually do not learn rare segmentation hypotheses, because their models are trained independently, and the impact of the rare segmentation hypotheses is weakened by the majority. Abner et al. [
4] present a max-margin formulation and the oracle set loss to directly model the M-Best prediction problem. Enlightened by Abner et al. [
4], Rupprecht et al. [
5] and Ilg et al. [
6] combined the oracle set loss with a common deep network with M heads to generate M hypotheses. Nevertheless, all the above-mentioned methods can only generate a fixed number of hypotheses and are not graceful to extend to a large number of hypotheses. This problem can be solved by a conditional variational autoencoder (CVAE) [
7], as a CVAE can effectively perform probabilistic inference by learning Gaussian latent variables to model complex distributions. With the learned conditional distribution, a CVAE can generate unlimited and diverse segmentation hypotheses. Due to the excellent segmentation performance of the U-Net [
8], Simon et al. [
9] combined the CVAE with the U-Net. The combined method is called probabilistic U-net, which can produce an unlimited number of segmentation hypotheses while delivering better performance than network ensembles [
3] and M-heads [
5,
6]. Recently, probabilistic hierarchical segmentation (PHiSeg) [
10] further extends the prior net and the posterior net of the probabilistic from a single resolution structure to a hierarchical multi-resolution structure and achieves state-of-the-art performance across multiple datasets.
As another type of uncertainty, the uncertainty of segmentation performance has not drawn enough attention. To reduce this type of uncertainty, a feasible method is to provide the corresponding measure predictions along with each segmentation hypothesis. The measures can be precision, accuracy, the true positive rate, the true negative rate, or other measures. These predictive measures can assist radiologists to evaluate the segmentation performance and decide whether the segmentation hypothesis should be accepted. To the best of our knowledge, there exists no other work that has considered providing the corresponding measure predictions along with each segmentation hypothesis; and there exists no other work that has considered generating segmentation hypotheses based on specified measurement values. To fill these two gaps, we propose a hierarchical predictable segmentation network (HPS-Net). An illustration of the application case of the proposed HPS-Net is shown in
Figure 1. HPS-Net can learn a complex probability distribution of the samples in the latent space and has the ability to predict the measurement values; therefore, it can generate an unlimited number of segmentation hypotheses along with their measure predictions. From another perspective, HPS-Net can provide the segmentation hypothesis based on a specified measurement value by continually generating segmentation hypotheses until the predicted measurement value is extremely close to the specified value. HPS-Net is a hierarchical multi-task network for medical image segmentation and segmentation performance prediction, which consists of a new network structure, a new loss function, and a cooperative training mode. As a multi-task network, HPS-Net includes four sub-networks, namely, the posterior network, the prior network, the likelihood network, and the measure network. These subnetworks have multiple symmetrical structures in them, which enable the networks to extract the features of images at different scales. To demonstrate the efficacy of HPS-Net, experiments were conducted on two public datasets—the LIDC-IDRI dataset [
11] and the ISIC2018 dataset [
12]. The results show that the proposed HPS-Net can provide effective measure predictions while keeping the maximum segmentation performance in terms of segmentation accuracies of multiple annotations and a single annotation per image.
The remainder of the article is organized as follows. In
Section 2, we review related work including U-Net, probabilistic U-Net, and PHiSeg. We describe problem formulation and the HPS-Net structure in
Section 3. In
Section 4, we evaluate the performance of HPS-Net on the LIDC-IDRI dataset [
11] and the ISIC2018 dataset [
12]. Finally, we conclude our article in
Section 5.
2. Related Work
Our proposed HPS-NET is derived from PHiSeg, while PHiSeg is derived from the probabilistic U-Net, where U-Net is a milestone in the field of medical image segmentation and is the derivative source of HPS-NET, PHiSeg, and the probabilistic U-Net. To have a better understanding of the proposed method, we review U-Net, the probabilistic U-Net, and PHiSeg in this section.
2.1. U-Net
Due to the outstanding performance of U-Net, many excellent networks [
13,
14,
15] are based on or sourced from U-Net. U-Net [
8] is derived from the fully convolutional network (FCN) [
16] and is a convolutional neural network developed for biomedical image segmentation task. Compared with FCN, the structure of U-Net was modified to adapt to fewer training images and to achieve higher segmenting accuracy. The main idea of U-Net is to supplement a usual contracting network by successive layers, where pooling operations are replaced by upsampling operators. Hence, these layers increase the resolution of the output. Moreover, successive convolutional layers can then learn to assemble a precise output based on the information.
Figure 2 illustrates the structure of U-Net. From
Figure 2, we can see that U-Net has two symmetric paths, namely, the left contracting path and the right expanding path. The contracting path is made up of repeated convolutional layers, ReLU layers, and Max-pooling layers, which focuses on capturing context information. The right expanding path contains repeated convolutional layers, ReLU layers, and upsampling layers, which enables precise localization. There are some skip connections between these two paths, which can compensate for the loss of boundary pixels caused by the convolution operations in the contracting path.
Since the input and output are images of the same size, why construct a symmetric structure that includes downsampling and upsampling? The answer can be explained from the theoretical meaning of symmetry principles. The symmetric structure can be divided into two paths—the downsampling path and the upsampling path. The downsampling path can increase the robustness of some small disturbances of the input image, such as image translation and rotation, thereby reducing the risk of overfitting, reducing the amount of calculation, and increasing the size of the receptive field. The main purpose of the upsampling path is actually to transform the features back to the size of the original image. In addition, since a large amount of information is lost in downsampling, some connections are applied between the layers at the same resolution level to supplement the missing information. The above reasons led to the final adoption of a symmetric structure.
2.2. Probabilistic U-Net
U-Net delivers the best segmentation performance at the time and has been applied to many medical image domains. However, U-Net follows a one-to-one I/O mode, so it cannot solve the uncertainty of the plausible segmentation hypotheses. Based on U-Net and the conditional variational auto encoder (CVAE) [
7], Simon et al. [
9] proposed a segmentation network called probabilistic U-Net, which can output an unlimited number of segmentation hypotheses for an input image. The network structure of the probabilistic U-Net is shown in
Figure 3a. Similar to the general auto encoder, the CVAE is also symmetrical and consists of two symmetric sub-networks called prior net and posterior net, which are corresponding to the encoder and decoder in an auto encoder. The inputs of the prior net and the posterior net are the medical image and the ground truth of the segmentation. Different from the general auto encoder, the CVAE assumes that the data are generated by some random processes, and the CVAE performs probabilistic inference by learning Gaussian latent variables to model complex distributions in the latent space. In CVAE, the parameters in the network are assigned the meaning of
and
in a Gaussian distribution. During the training process, the CVAE optimizes a large number of Gaussian latent variables to construct a large number of multidimensional Gaussian distributions to fit a complex distribution.
As the CVAE can learn a complex probability distribution in the latent space, an unlimited number of random samples can be sampled by the learned distribution. In the probabilistic U-Net, the features of the generated sample are then concatenated to the output of the last layer in U-Net. Finally, three 1 × 1 convolutions are applied to combine the concatenated information and map it to the desired number of classes. The probabilistic U-Net has two characteristics. Firstly, it can model the joint probability of all pixels in the medical image, so the produced segmentation maps can interpret the entire images consistently. Secondly, it models a complex distribution by all training samples, so it also learns the segmentation hypotheses with low probabilities. Due to the above characteristics, the probabilistic U-Net delivers better performance than network ensembles [
3], M-heads [
5,
6], and other related methods.
2.3. PHiSeg
Christian et al. demonstrate that the probabilistic U-Net has limited diversity when generating the segmentation hypotheses, which may be due to the following two reasons. Firstly, the randomly generated sample is concatenated to the output of the last layer in U-Net, so randomness is only introduced in the highest resolution level of U-Net. Secondly, the randomness is only introduced by channel concatenation, thereby the final three 1 × 1 convolutional layers can choose to ignore the random information brought in from the latent space.
Due to the above two reasons, Christian et al. proposed a hierarchical variant of the probabilistic U-Net, called PHiSeg [
10]. The network structure of PHiSeg for two resolution levels is shown in
Figure 3b. Compared with the probabilistic U-Net shown in
Figure 3a, PHiSeg expands the structures of the prior net and the posterior net to the hierarchical multi-resolution structures. The latent space is also learned separately at each downsampled resolution level, and the random segmentation variants are generated from each latent space and are directly input to the likelihood network instead of being input to U-Net by channel concatenation. The likelihood network outputs multiple segmentation hypotheses, which are then resized to the same size of the ground truth and compared with the ground truth to calculate the loss values. Experimental results show that PHiSeg can generate segmentation hypotheses that closely match the ground-truth distribution.
3. The Proposed HPS-Net
PHiSeg achieved state-of-the-art results on solving the uncertainty of the plausible segmentation hypotheses. However, as another type of uncertainty, the uncertainty of segmentation performance has not drawn enough attention. To solve this problem, in this section, we present the hierarchical predictable segmentation network (HPS-Net), which consists of a new network structure, a new loss function, and a cooperative training mode.
3.1. Network Structure
HPS-Net is a multi-task network. The first task of HPS-Net is to generate realistic and diverse segmentation hypotheses, which is the same as PHiSeg and other related methods. The second task is to provide the measure predictions including precision, accuracy, the true positive rate (TPR), the true negative rate (TNR), or other measures. HPS-Net includes four sub-networks, namely, the posterior network, the prior network, the likelihood network, and the measure network. The posterior network, the prior network, and the likelihood network are for medical image segmentation, the structures of which are similar to PHiSeg [
10]. The measure network is for predicting the different measurement values.
Figure 4 illustrates the detailed network structure of the proposed HPS-Net for three latent levels and four resolution levels.
The inputs of the posterior network are the medical image and the corresponding ground-truth segmentation, while the input of the prior network is only the medical image. The goal of these two sub-networks is to learn the latent spaces corresponding to the posterior probability distribution and the prior probability distribution, respectively, and maximize the similarity of the learned latent spaces of these two sub-networks. From
Figure 4, we can see that the posterior network and the prior network are symmetrical. Such a symmetric network structure allows us to learn the reasonable latent spaces in a confrontational way. With the learned latent spaces, random segmentation variants can be generated, which are then to be input to the likelihood network. With the random segmentation variants as the input, the likelihood network is trained to generate diverse segmentation hypotheses at multiple resolution levels. It should be noted that the resolution levels need at least one more level than the latent levels, and whether each resolution level contains a latent level is optional. The example in
Figure 4 has a total of four resolution levels and three latent levels.
Figure 5 shows the example of five resolution levels and two latent levels. We obtained the best results in the experiments with seven resolution levels and five latent levels.
The final and the most important sub-network is the measure network, which enables HPS-Net to generate the predicted measurement values. The measure network takes the medical image and its corresponding segmentation hypotheses as the input. To fuse the information, the segmentation hypotheses at different levels are upsampled to the same size of the medical image and then fused with the medical image by channel concatenation. The resulting multi-channel image is then input into the backbone network. Here, we used the SE-Inception-V4 (the code and the details of SE-Inception-V4 can be found in
https://github.com/taki0112/SENet-Tensorflow, accessed on 27 October 2021) as the backbone network, which is a combination of Inception-v4 [
17] and the squeeze-and-excitation (SE) block [
18]. As a convolutional neural network, Inception-v4 has an excellent performance in the field of image recognition. The SE block can adaptively adjust the relationship between different channels by explicitly modeling interdependencies between channels. As the input of the measure network is a multi-channel image and the importance of each channel is quite different, applying the SE block to Inception-v4 can further improve the performance at a slight additional computational cost. After the combination, SE-Inception-V4 processes the image by convolution, pooling, residual connection, squeeze, excitation, and other operations, breaking down the image into features. The result of this process feeds into the final fully connected layer that drives the final predicted measurement value.
3.2. Loss Functions
In the posterior network, we used the Kullback–Leibler divergence to penalize the difference between the posterior distribution and the prior distribution. For the likelihood network, cross-entropy loss was used for each latent level to penalize the difference between the segmentation hypothesis and the ground truth of the segmentation (mask). Following the practice of PHiSeg [
10], if an input medical image had multiple corresponding masks, then one mask was randomly selected for computing. As all masks were plausible, random selection was therefore reasonable and efficient in this case.
We proposed measure loss to penalize the difference between the predicted measurement value and the ground truth of the measurement. The input of the measure loss includes the ground truth segmentations, the segmentation hypothesis output from each latent level, and the predicted measurement value. The measure loss is formulated as follows:
where
represents the measure loss,
l is the number of latent levels,
represents the sub-loss of the
ith latent level,
N is the batch size,
is the predicted measurement value of the
nth sample,
represents the upper bound of the ground truth measurement value of the
nth segmentation hypothesis,
represents the lower bound of the ground truth measurement value of the
nth segmentation hypothesis,
denotes the
nth segmentation hypothesis,
m is the number of ground truth,
denotes the
jth ground truth of the
nth segmentation hypothesis, and
calculates the measurement value, which can be TPR, TNR, precision, and others.
3.3. Training Procedure
This section describes our cooperative training mode. The training process mainly includes two parts—forward propagation and backward propagation. In forward propagation, each batch of the medical images is firstly input into the posterior network and the prior network. The outputs of the two sub-networks represent multiple multidimensional normal distributions, so as to fit their learned latent spaces respectively. Then, the Kullback–Leibler divergence of each level is calculated as the loss value based on the outputs of the two sub-networks to describe the difference between the learned latent spaces. With the learned latent space of the prior network, samples can be generated and then be input into the likelihood network to get the segmentation hypotheses. By comparing the output segmentation hypotheses and the ground-truth segmentations, the cross entropy losses are computed. After that, the segmentation hypotheses are input into the measure network to generate the predicted measurement value, where the measure loss is applied for calculating the difference between the predicted measurement value and the ground-truth measurement value. The ground-truth measurement value is initially unknown, but it can be calculated with the output segmentation hypotheses and the ground-truth segmentations.
In backward propagation, the gradients of the loss function with respect to the weights of the network are calculated, and this calculation process proceeds backward through the network from the final layer to the first layer. Specifically, the network parameters in the posterior network and the prior network are updated by the calculated Kullback–Leibler divergence of each level. The network parameters in the likelihood network are updated by the cross entropy losses, and the measure loss corresponds to the parameter update in the measure network. It should be noted that although there are connections between the likelihood network and the measure network, we limited the backpropagation starting from the measure loss to the range of the measurement network, so as to avoid interference with the prediction of the segmentation hypothesis. The model was trained using the Adam optimizer with a learning rate of and a batch-size of 12. All the input images and the ground-truth segmentations were resized to 128 × 128. Batch-normalization was applied on all non-output layers. All models were trained for 36 h on an NVIDIA TESLA V100 GPU. After a model is trained, the segmentation hypotheses and the corresponding measurement values for an input image can be generated by first inputting the image to the prior network, then obtaining the samples using the prior network and finally generating the segmentation hypotheses and the measurement values using the likelihood network and the measure network, respectively.
4. Experiments and Results
In the experiments, two aspects of HPS-Net were evaluated. The first aspect was to evaluate whether HPS-Net can perform as well as PHiSeg after introducing the measure network. The second aspect was to evaluate the performance of HSP-Net on predicting different measurement values.
Experiments were conducted on the LIDC-IDRI dataset [
11] and the ISIC2018 dataset [
12]. The LIDC-IDRI dataset contains 1018 thoracic CT images from 1010 subjects with lesions annotated by four experienced thoracic radiologists. The original in-plane resolution of the CT images was between 0.461 mm and 0.977 mm. In the experiments, all CT images were normalized to 0.5 mm × 0.5 mm resolution. Based on the lesion positions annotated by the radiologists, the lesion patches of size 128 × 128 pixels were cropped out. The processed dataset was then partitioned into the ratio of 60:20:20 for training, testing, and validation, respectively. The ISIC2018 dataset consists of 3694 skin lesion images, including a training set, a validation set, and a test set. The training set consists of 2594 images and 2594 corresponding ground-truth response masks. The validation set and the test set consist of 100 images and 1000 images, respectively. The initial size of images in ISIC2018 varied from 1024 × 768 to 512 × 384. In our experiments, these images were normalized to 512 × 512. To compare with PHiSeg [
10], we followed the setting of [
10] in all experiments, setting seven resolution levels for both PHiSeg and HPS-Net.
In the first experiment, we firstly trained the models with the masks from all four radiologists and then trained the models with only one mask from a single radiologist. When training with all masks, one mask was randomly selected per image in each batch. When training with one mask, only the mask from the same radiologist was used for the corresponding image in each batch. We used a conventional Dice score to gauge the similarity between an output segmentation hypothesis and a mask. To compare a set of segmentation hypotheses and a set of masks, namely, to compare two sets of segmentation distributions, we used the generalized energy distance [
9,
19]:
where
Y and
are independent masks from the ground-truth distribution
,
S and
are independent segmentations from the predicted distribution
, and
.
has been proved to be a metric [
20,
21].
is also a metric in this case if
is also a metric, which has been proved in [
22]. Like [
10], we also used the average normalized cross correlation (NCC) to evaluate the correlation between the predicted distribution and the ground-truth distribution from a visual perspective:
where
denotes cross entropy and
is the mean segmentation prediction.
Table 1 show the results of the first experiment, which compares the segmentation performance of different methods on the LIDC-IDRI dataset. From
Table 1, we can observe that, using all annotations, HPS-Net (L = 5) and PHiSeg (L = 5) performed better than prob. U-Net in terms of
and
. With single annotation, HPS-Net (L = 5) and PHiSeg (L = 5) outperformed prob. U-Net and det. U-Net in terms of
,
, and
. Additionally, we can observe that HPS-Net showed slightly better performance than PHiSeg with the same number of resolution levels in the cases of both single annotation and all annotations, but the performance difference was small. This is in line with our expectation, as we limited the backpropagation starting from the measure loss to the range of the measure network; the prediction performance of the likelihood network was therefore not affected by the measure network.
In the second experiment, we trained the HPS-Net models with different latent levels and different numbers of radiologists to evaluate the performance of HSP-Net on predicting different measurement values.
Table 2 shows the results of the second experiment, from which we could observe the following:
Compared with HPS-Net (L = 1) and HPS-Net (L = 3), HPS-Net (L = 5) had the lowest mean squared error (MSE) and the lowest standard deviation (std.) in most cases.
The predictions on TNR were much more accurate than the predictions on TPR and precision, where the MSE ± std. on TPR was lower than the MSE ± std. on precision.
The MSE ± std. on TNR was as low as 0.0001 ± 0.0062 (all annotation) and 0.0010 ± 0.0219 (single annotation), which is close to the requirements of practical application. However, the MSE ± std. on TPR and precision was still as high as 0.0938 ± 0.1080 and 0.1160 ± 0.1449 with single annotation and L = 5, which is far from practical application and still needs to be improved.
To further verify the performance of HPS-Net on image segmentation and measure prediction, an additional experiment was conducted on the ISIC2018 dataset [
12], where we trained the HPS-Net models with different latent levels and compared it with three benchmark methods, namely, det. U-Net, prob. U-Net, and PHiSeg. From
Table 3, we can observe that HPS-Net had the highest Dice score, which means it had the best segmentation performance. Similarly, the MSE ± std. of TNR was the lowest among all three measure predictions, while the MSE ± std. of precision was still the highest.