Abstract
Lesion detection in CT (computed tomography) scan images is an important yet challenging task due to the low contrast of soft tissues and similar appearance between lesion and the background. Exploiting 3D context information has been studied extensively to improve detection accuracy. However, previous methods either use a 3D CNN which usually requires a sliding window strategy to inference and only acts on local patches; or simply concatenate feature maps of independent 2D CNNs to obtain 3D context information, which is less effective to capture 3D knowledge. To address these issues, we design a hybrid detector to combine benefits from both of the above methods. We propose to build several light-weighted 3D CNNs as subnets to bridge 2D CNNs’ intermediate features, so that 2D CNNs are connected with each other which interchange 3D context information while feed-forwarding. Comprehensive experiments in DeepLesion dataset show that our method can combine 3D knowledge effectively and provide higher quality backbone features. Our detector surpasses the current state-of-the-art by a large margin with comparable speed and GPU memory consumption.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Lesion detection is an essential task for clinical applications such as computer-aided diagnosis. With the emergence of modern CNNs, object detection in 2D natural images has been developed quickly and achieves promising performance [1, 5,6,7]. However, it is still unclear how to adapt these algorithms into CT scans effectively. The main gap is how to efficiently involve 3D context information into these detectors. This problem has attracted many research attentions [2, 4, 10], due to its importance for the success of lesion detection.
Current solutions come in two folds. One uses fully 3D connected CNNs, which can directly exploit 3D knowledge, for detection and classification. However, due to GPU memory limit, it can only be performed on small patches in a sliding-window fashion [4] or on small-patch candidates generated by a 2D detector [2], leading to high time complexity. It is also unable to make use of ImageNet pretraining, thus only achieves inferior lesion detection accuracy as reported in [10]. To alleviate the issues of 3D CNNs, other studies are exploring how to combine 2D CNN features from consecutive CT slices for classification and regression, so as to better utilize the 3D context information. [10] followed R-FCN [1] which used a Region Proposal Network (RPN) to predict suspicious regions and a Region Classification Network (RCN) to further classify and regress those suspicious regions. [10] proposed to concatenate backbone feature maps from neighboring CT slices to feed into RCN, in order to gather 3D information in the RCN subnet. Under this pipeline, a backbone network can take the whole CT scans as input which can be trained in an end-to-end manner, from ImageNet pretrained weights. However, the backbone networks are still independent 2D CNNs, and no 3D information can be aggregated until the final backbone features are computed. Another problem is that the central CT slice and the contextual CT slices share the same 2D CNN weights, which may be less optimal since we expect to distill different and complementary knowledge from those different slices.
We propose a hybrid detector combining advantages of fully 3D connected CNN detectors (strong knowledge of 3D context) [2, 4] and 2D CNN concatenated detectors (efficiency and ability to use ImageNet pretrained weights) [10]. Similar to [10], we use 2D CNNs for CT slices at different axial locations as our backbone. However, as discussed before, this is less optimal since these isolated 2D CNNs cannot extract and exploit 3D context information. To address this problem, we propose light-weighted 3D CNN subnets named 3D Fusion Modules (3DFMs) to bridge those 2D CNNs, allowing information flow from different slices. These subnets connect the internal layers of 2D CNNs, so that each 2D CNN can distill knowledge from its neighbor 2D CNNs, to exploit 3D information and focus on different knowledge. The main difference between [10] and our method is that in [10], the 3D context information is not exploited in the layers before the RCN, and the RCN cannot fully utilize 3D context since the its input features only have high-level semantics without low-level details, and the RCN has a very shallow structure which is incapable of learning rich 3D information; on the contrary, in our method, the 3D information is exploited gradually throughout our backbone CNNs, and 3DFMs learn 3D information at low-level, mid-level and high-level layers. Our design breaks the isolation among 2D CNNs, and enables them to distill different knowledge from different input slices, thus the backbone provides stronger features with richer 3D context encoded.
3DFMs introduce few parameters and small computation overhead, while greatly improving the detection accuracy. Experiments on DeepLesion [11] show our hybrid detector significantly improves the sensitivities at every false positive (FP) rate and on every lesion type. With 27 CT scan slices as input, hybrid detector improves the average sensitivities by 1.4 and the sensitivity at \(\frac{1}{8}\) FP per image by 2.7. Our method surpasses [10] and achieves a new state-of-the-art.
Backbone of our hybrid lesion detector. Different rows illustrate different 2D CNNs for the corresponding images. The ground-truth boxes are labelled in the central image (with red boundary) and other 3-channel images (with yellow boundary) are served as 3D context. The central conv5_3 feature (marked in green) is used in RPN and the fused feature (marked in blue) is used in RCN. Best view in color. (Color figure online)
3D Fusion Module. K is 5 in this example. See Subsect. 2.2 for details.
2 Approach
2.1 Overview Pipeline of Our Detector
The backbone of our detector is shown in Fig. 1. Following [10], to make use of ImageNet pretrained weights, we combine 3 adjacent CT slices into a 3-channel image like a natural image, to feed to VGG16 [9], which serves as the backbone 2D CNN of our detector. When considering more 3D context, we combine context slices into 3-channel images and feed them to different VGG16 branches. Each VGG16 branch takes a 3-channel image as input, and generates a conv5_3 feature map as output. The conv5_3 feature from the central slice (marked in green) is used in the Region Proposal Network (RPN) to generate proposals, and the concatenation of conv5_3 features from all slices (marked in blue) is used in the Region Classification Network (RCN) to classify and regress proposals. However, unlike [10], where 2D CNNs feed-forward isolatedly, we use a novel and efficient 3D Fusion Module (3DFM) to bridge internal features from different 2D CNNs to build a hybrid backbone. The hybrid detector backbone can better exploit 3D context and make different 2D CNNs to learn different patterns, while utilizing ImageNet pretrained weights. Details of 3DFM are discussed in Subsect 2.2.
Given the backbone of our hybrid lesion detector, we follow [7] to employ an RPN and an RCN to generate and classify proposals. As Fig. 2 shows, we use the conv5_3 feature of the central branch (marked in green) to generate proposals, and use ROIAlign [3] to generate features from the concatenated feature of different branches (marked in blue), for each proposal. Finally, those features are used to classify and regress the proposals, and generate lesion detection results.
2.2 3D Fusion Module
3D context has been shown to be extremely important to detect objects in CT scan images [2, 4, 10]. However, existing methods to utilize 3D information are either memory expensive and only able to process small 3D patches, or inefficient which naively concatenate features from different slices. In this paper, we propose an efficient and computation cheap 3D Fusion Module (3DFM), as shown in Fig. 3, to combine 3D context information in the backbone 2D CNNs.
3DFM takes internal features (\(A_{i}\in \mathbb {R}^{C\times H\times W}\), C, H and W are the channel, height and width of the feature map) from the backbone CNNs as inputs, as shown in the first column in Fig. 3. Given K input images, there will be K intermediate features, and each of them is generated from a 3-channel CT image as shown in Fig. 1. We first concatenate them to build a 4D tensor \(\mathbf {A}\in \mathbb {R}^{K\times C\times H\times W}\), and transpose it make the channel to be the first dimension (\(\mathbf {B}\in \mathbb {R}^{C\times K\times H\times W}\)), as shown in the second column in Fig. 3. A 3D convolution is used to gather 3D context information to generate a 3D fused feature map \(\mathbf {C}\in \mathbb {R}^{C\times K\times H\times W}\). The kernel size is \(3\times 1\times 1\) corresponding to the K, H and W dimensions, so we are utilizing the context along the axial direction by convolving across neighbor slices. We use \(3\times 1\times 1\) instead of \(3\times 3\times 3\) because the context along the other two directions is already considered in the 2D convolutions in the backbone CNN, and thus we only need to consider the axial direction to reduce computation/memory overhead. Finally \(\mathbf {C}\) is transposed backed to \(K\times C\times H\times W\) as \(\mathbf {D}\), and the sum of \(\mathbf {A}\) and \(\mathbf {D}\) (noted as \(\mathbf {E}\)) is split to K feature maps with shape \(C\times H\times W\), which are used in the backbone 2D CNNs for future processing.
3DFM is flexible and can be inserted anywhere in the backbone CNNs to fuse the 3D information. In our detector, we insert 3DFMs in a sparse manner: only at the pool3 and conv4_3 layers in VGG16, as in Fig. 1. These 3DFMs will combine those independent 2D VGG16 branches into a sparsely bridged 3D CNN, which will serve as the backbone CNN of our detector. Extensive experiments show our design is light-weighted and takes very little computation/memory overhead, while effectively exploiting 3D context knowledge and improving the accuracy significantly.
3 Experiments
3.1 Implementation Details
Our hybrid detector is implemented with Tensorflow. We use VGG16 as our backbone CNN, and remove the pool4 layer to keep the output resolution to be \(\frac{1}{8}\) of the input image. We take the same CT scan image preprocessing as in [10], which rescales the CT intensity to 0–255, resizes the images and clips the black border. We use the horizontal flip data augmentation which is very common for object detection. For each sample, we take adjacent 3, 9, 15, 21 or 27 CT slices to generate 1, 3, 5, 7 or 9 input images with 3 channels each, to evaluate the efficacy of hybrid detector at different 3D context richness levels, and to make a fair comparison with the state-of-the-art 3DCE [10]. For the training, we use a batch size of 2, and train the hybrid detector for 120k iterations. The initial learning rate is \(10^{-3}\) and is reduced to \(10^{-4}\) after the first 90k iterations. We take the official train/test subsets to train and report accuracy. Comprehensive experiments and ablation studies are reported in the following subsections.
3.2 Experimental Results
To evaluate the efficacy of our method, we conduct extensive experiments on DeepLesion [11]. Following the metric used in LUNA challenge [8], we compute the sensitivity at 7 pre-defined false positive (FP) per image rates: \(\frac{1}{8}\), \(\frac{1}{4}\), \(\frac{1}{2}\), 1, 2, 4 and 8 FPs per sample, as well as the average sensitivity at these 7 pre-defined FP rates. We also compute the sensitivity at the FP per image rate of 16, to compare with the 3DCE [10]. For all our baselines and hybrid detectors, we train and evaluate for four times, and report the average performance, to alleviate the randomness caused by initialization and training data shuffling.
The results on the official test set are shown in Table 1. We also compare with 3DCE which is the current state-of-the-art and already surpasses fully 3D connected detectors. ‘Baseline’ in the table is a Faster-RCNN [7] based detector with feature concatenation after the backbone CNN, and ‘Hybrid’ is ‘Baseline’ equipped with 3DFMs illustrated in Fig. 3. We also plot the free-response receiver operating characteristic curves for our baseline and hybrid detectors in Fig. 4. In the table and figure, we find that our hybrid detector with 3DFM is very effective in improving the detection quality. The sensitivity consistently goes up at all FP rate levels significantly with 27 slices as input, especially in the high precision case (i.e. fewer FPs per image). Our hybrid detector surpasses 3DCE greatly with the same train/test sets and achieves a new state-of-the-art.
3.3 Ablation Studies
Inference Speed and Memory Overhead. Our 3D Fusion Modules (3DFM) efficiently combine 3D context information in the backbone. To quantitatively evaluate the computation/memory overhead, we run all our baselines and detectors on a machine with a single nVIDIA Titan Xp GPU. We report the total runtime for the official test set (4817 samples) and the max GPU memory consumed for inference. Results are shown in Table 2. Our 3DFMs introduce very small computation overhead and negligible GPU memory overhead. This verifies the efficiency of our method, which may be applied to more complex datasets.
Architecture of 3DFM. In this subsection, we compare our 3D Fusion Module with some other potential architectures combining 3D context information:
-
Without Skip Connection: the 3D context information bridging module is the same as 3DFM (see Fig. 3), but does not have the skip connection to combine the original backbone features with the 3D fused features.
-
Without 3D Conv: the 3D context information bridging module concatenates the K backbone features with size of \(C\times H\times W\) to a thicker tensor \(KC\times H\times W\), and takes a \(1\times 1\) 2D Conv to fuse information from different slices.
All experiments are conducted on the 27-slice inputs, and results are shown in Table 3. Both architectures described above achieve inferior performance: without skip connection, it has lower sensitivities at high FP levels even compared with our baseline detector; and using 2D Conv on a concatenated feature map leads to inferior sensitivities at all FP levels.
Number of 3DFMs. 3DFMs can bridge the 3D context information in the 2D CNN backbones, and can be inserted anywhere in the 2D CNNs. In our detector, we insert 3DFMs at the pool3 and conv4_3 layers in VGG16 as in Fig. 1. We also conduct diagnostic experiments by 1) inserting 3DFM at only conv4_3 layer and 2) inserting 3DFMs at pool2, pool3 and conv4_3 layers. Results are shown in ‘3DFM@4’ and ‘3DFM@234’ of Table 3. Compared with ‘3DFM@4’, adding another 3DFM at pool3 significantly improve the performance from 70.62 to 71.04. However, adding an extra 3DFM at pool2 will only give a marginal performance gain. For simplicity, we use only two 3DFMs in our final detector.
3.4 Analysis on Different Lesion Types
We test our hybrid detector on different lesion types in DeepLesion [11]. There are 8 types of lesion labelled in the dataset, and the abbreviations are in the parentheses: bone (BN), abdomen (AB), mediastinum (ME), liver (LV), lung (LU), kidney (KD), soft tissue (ST) and pelvis (PV). In Table 4, we evaluate the sensitivities of our baseline detector and our hybrid detector equipped with 3DFM, at 4 FPs per image (27 slices). The results further confirms that our hybrid detector can improve the detection quality under all 8 lesion types, thus it is very general with consistent gains. We also show some qualitative results in Fig. 5, where our baseline detector fails to detect the lesion, but the 3DFM equipped hybrid detector detects them with scores greater than 0.9 at 4 FP per image threshold. We observe that our detector is able to find difficult lesions such as small or low-contrast lesions.
4 Conclusions
We propose a hybrid detector which bridges 3D context information in 2D CNN backbones. Based on a baseline detector which takes adjacent CT scan images independently with the same 2D CNN, we enhance the backbone feature quality by fusing 3D context knowledge via 3DFMs. Extensive experiments have been conducted to show the efficacy of our hybrid detector, which improves the sensitivity at all false positive levels. The improvement is consistent under different settings (e.g., number of input slices and lesion types). Qualitative analysis also suggests that our method outperforms the baseline method and remains valid even for some extremely difficult cases. Our approach surpasses existing methods and thus establishes a new state-of-the-art. The superior performance demonstrates its potential usage for different clinical applications.
References
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS, pp. 379–387 (2016)
Ding, J., Li, A., Hu, Z., Wang, L.: Accurate pulmonary nodule detection in computed tomography images using deep convolutional neural networks. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 559–567. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7_64
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Liao, F., Liang, M., Li, Z., Hu, X., Song, S.: Evaluate the malignancy of pulmonary nodules using the 3D deep leaky noisy-or network. arXiv preprint arXiv:1711.08324 (2017)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR, pp. 779–788 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Setio, A.A.A., et al.: Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge. Med. Image Anal. 42, 1–13 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Yan, K., Bagheri, M., Summers, R.M.: 3D context enhanced region-based convolutional neural network for end-to-end lesion detection. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 511–519. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_58
Yan, K., et al.: Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In: CVPR, pp. 9261–9270 (2018)
Acknowledgements
This work was supported by the Lustgarten Foundation for Pancreatic Cancer Research and NSFC No. 61672336.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Z., Zhou, Y., Shen, W., Fishman, E., Yuille, A. (2019). Lesion Detection by Efficiently Bridging 3D Context. In: Suk, HI., Liu, M., Yan, P., Lian, C. (eds) Machine Learning in Medical Imaging. MLMI 2019. Lecture Notes in Computer Science(), vol 11861. Springer, Cham. https://doi.org/10.1007/978-3-030-32692-0_54
Download citation
DOI: https://doi.org/10.1007/978-3-030-32692-0_54
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32691-3
Online ISBN: 978-3-030-32692-0
eBook Packages: Computer ScienceComputer Science (R0)