Deep Lesion Tracker: Monitoring Lesions in 4D Longitudinal Imaging Studies
Deep Lesion Tracker: Monitoring Lesions in 4D Longitudinal Imaging Studies
Deep Lesion Tracker: Monitoring Lesions in 4D Longitudinal Imaging Studies
Jinzheng Cai1 Youbao Tang1 Ke Yan1 Adam P. Harrison1 Jing Xiao2 Gigin Lin3 Le Lu1
1 2 3
PAII Inc. Ping An Technology Chang Gung Memorial Hospital
Abstract
15159
across different images captured at different time points and tion causes 10% accuracy drops on SiamRPN++ [29] and
contrast phases by using both appearance and anatomical DEEDS [22]. In contrast, the accuracy of DLT only de-
signals. In Fig. 1, we show two real-life examples of lesion creases by 1.9%. We then apply DLT to an external test-
tracking. ing set of 100 real-life clinical longitudinal studies, deliv-
Similar with visual tracking in the general computer vi- ering 88% matching accuracy and demonstrating excellent
sion, lesion tracking can be viewed as to match instances generalizability. Finally, we plug DLT into a lesion moni-
of the same lesion in neighboring time frames. However, toring pipeline to simulate automatic treatment monitoring.
it is challenging due to changes in size and appearance. The workflow assesses lesion treatment responses with 85%
Lesion size can enlarge multiple times than its baseline or accuracy, which is only 0.46% lower than the accuracy of
nadir. Meanwhile, its appearance varies during the follow- manual inputs.
up exam because of morphological or functional changes,
commonly attributed to necrosis or changes in vascularity. 2. Related Work
Therefore, an effective tracker should handle both size and
visual changes of lesions. Trackers based on image regis- Visual object tracking is an active research topic in gen-
tration [1, 47] are robust to appearance changes, as registra- eral computer vision [8, 35, 18, 54, 53, 6, 39, 51, 52]. We fo-
tion inherently introduces anatomical constraints for lesion cus our review on recent progresses, especially deep learn-
matching. The involved body part and surrounding organs ing based approaches.
of the target lesion are constrained among different images. Tracking as Similarity Learning. Tracking of target
However, registration algorithms [22, 23, 36, 37, 5] are usu- objects can be achieved via similarity comparisons between
ally less sensitive to local image changes; thus, they can be the object template and proposals from the search domain.
inaccurate to track small-sized lesions or lesions with large Similarities are measured by either color/intensity represen-
shape changes. On the other hand, appearance-based track- tations [19], spatial configurations [63, 33], or their com-
ers [40, 14] handle size and appearance changes by project- binations [6]. Recently, deep learning features are more
ing lesion images into an embedding space [62, 59], where widely used for visual tracking [53, 39, 18, 54] as they out-
images of the same lesion have similar embeddings and perform hand-crafted features with more expressive repre-
images of different lesions are different from one another. sentations. To efficiently extract and compare deep learning
However, these appearance-based trackers may mismatch features, SiamFC [7] and CFNet [52] use a cross-correlation
lesions with visually similar but spurious backgrounds. layer at the end of Siamese architectures [9]. This cross-
Therefore, to combine the merits of both strategies, we de- correlation layer uses Siamese feature maps extracted from
sign our tracker to conduct appearance based recognition the template image patch as a kernel to operate fully circu-
under anatomical constraints. lar convolution on the corresponding Siamese feature maps
of the search image. This procedure encodes the informa-
Because the proposed deep lesion tracker (DLT) is a deep
tion regarding the relative position of the target object inside
learning model, providing enough training data is a pre-
the search image. Within the same framework of SiamFC,
requisite for good performance. To this end, we construct
SiamRPN++ [29] introduced strategies to allow training of
a dataset with 3891 lesion pairs, collected from DeepLe-
Siamese networks with modern very deep networks, e.g.,
sion [61], to train and evaluate different tracking solu-
dense convolutional network (DenseNet) [25], to further
tions. We publicly release this dataset to facilitate related
boost tracking accuracy. This is critical for medical im-
research1 . Although more training pairs can promote a
age analysis as many medical applications lack large-scale
stronger tracker, labor and time costs preclude easily col-
training data and rely on transfer learning of pre-trained net-
lecting and annotating a large number of longitudinal stud-
works for good performance [46].
ies for a specific clinical application. Therefore, we also in-
troduce an effective self-supervised learning (SSL) strategy Siamese networks have also been investigated in medical
to train trackers. Importantly, this strategy can train lesion image analysis. Gomariz et al. [14] applied 2D Siamese net-
trackers using images from only one time point, meaning works to track liver landmarks in ultra-sound videos. Liu et
non-longitudinal datasets can be used, which are more read- al. [32] extended similar 2D Siamese networks in a coarse-
ily collected. This allows for a more ready introduction of to-fine fashion. While, Rafael-Palou et al. [40] performed
more lesion instances with varied appearances and sizes. 3D Siamese networks with CT series, only shallow net-
work architectures were evaluated on tracking lung nodules.
With the proposed DLT and model training strategies,
However, we follow SiamRPN++ [29] to use Siamese net-
we achieve 89% matching accuracy on a test set of 480 le-
works with 3D DenseNet backbones and apply it to conduct
sion pairs. Meanwhile, we demonstrate that DLT is robust
universal lesion tracking in whole body CT images. Pro-
to inaccurate tracking initializations, i.e., the given initial
cessing different types of lesions with a unified deep learn-
lesion center. In our robustness study, inaccurate initializa-
ing model [45, 49, 61, 62, 59, 60, 10, 12] demonstrates com-
1 https://github.com/JimmyCai91/DLT putational efficiency and could also alleviate model over-
15160
fitting. Different from prior formulations of Siamese net- ful for aligning large-sized lesions [47, 41, 64], they can fail
works, we propose a simple but effective 3D kernel decom- to track small-sized lesions and struggle whenever there are
position to speed up 3D cross-correlation operations for ob- local changes in the lesion’s appearance.
ject matching. This provides dramatic boosts in efficiency, In this work, we improve upon the capabilities of reg-
reducing over 65% of FLOPs in our fast cross-correlation istration approaches using deep learning based lesion ap-
(FCC) layer. pearance recognition to match lesions based on both vi-
Tracking as Detector Learning. Tracking as detec- sual and anatomical signals. Specifically, we first roughly
tor learning relies on developing discriminative models to initialize the location of a target lesion using image regis-
separate the target from background regions [2, 3, 24, 19]. tration, i.e., affine registration [36]. Then, our DLT deep
A discriminative model that is suitable for visual tracking learning model refines the location to the lesion center us-
should consist of two core components, namely a classifier ing appearance-based cues. In contrast with approaches that
that can be efficiently updated online during visual tracking use the spatial and structural priors simply in pre- [51, 7] or
[2, 3, 24] and a powerful feature representation, e.g. fea- post-processing [14], DLT takes them as its inputs and prop-
tures extracted by convolutional neural networks (CNNs) agates them together with CT-based visual signal to gener-
[28, 25] that can let the classifier easily differentiate objects ate the final target location. The priors also function as at-
in the feature space. Following this strategy, SO-DLT [54], tention guidance, letting the appearance learning focus on
FCNT [53], and MDNet [39] all train CNNs offline from vital image regions.
large-scale object recognition tasks so that the learnt fea-
ture representation is general with visual objects. During 3. Deep Lesion Tracker
tracking, they freeze the lower layers of the network as a We build DLT based on the structure of Siamese net-
feature extractor and update the higher layers to adapt to works because they are efficient and deliver state-of-the-
the specific video domain. art visual tracking performance for many computer vision
In this work, we consider the strategy of tracking via de- tasks. The core component of Siamese-based tracking is a
tector learning and accordingly construct our strong lesion correlation filter, which is also known as cross-correlation
tracking baselines. Given the specialty of processing med- layer. It uses Siamese features extracted from the template
ical data, especially 4D CT images (3D image plus time), image patch as a kernel to perform explicit convolutional
there are no baseline methods ready for comparison. Thus, scanning over the entire extent of the search image feature
we construct our own lesion tracking baselines by concate- masps. Fig. 2 shows its overall configuration. Our goal is
nating the state-of-the-art lesion detection [12, 58] models to apply the proposed model to process three dimensional
with deep learning feature extractors [62, 59]. However, medical data, i.e., CT images. Therefore, we create net-
the tracker developed with this strategy can be sub-optimal work backbones in 3D and introduce an anatomy signal en-
since the detection models feature extractors are developed coder (ASE) to guide lesion tracking with anatomical con-
from independent offline tasks. In contrast, our proposed straints. To avoid the prohibitive computational expenses of
DLT unifies the tasks of feature extraction and target ob- 3D cross-correlation between the template and the search
ject localization in an end-to-end structure and outperforms image, we introduce a simple but effective formulation to
these detector learning baselines with higher accuracy and speed up this procedure.
faster speed. Problem definition. We use It and Is to respectively
Tracking Priors from Image Registration. Visual denote a template and search CT image. In It , a lesion is
tracking in video follows a prior of spatial consistency, known with its center µt and radius rt . Given It , Is , µt , and
which means the search space in the next video frame can rt , the task of lesion tracking is to locate the same lesion in
be constrained to be near to the current location. This prior Is by predicting its new center µs .
is helpful for improving tracking efficiency and making the
3.1. Image Encoder: 3D DenseFPN
model robust to background distractors [51, 7, 14]. Sim-
ilarly, lesion tracking in CT should follow a spatial con- In lesion tracking, the Siamese network needs to process
sistency governed by anatomical considerations. This im- lesions with varied appearances and sizes in 3D images. As
plies that the surrounding organs and structures of a lesion shown in Fig. 3, we use a deep 3D image encoder with large
will not drastically change. Under such constraints, image model capacity, so that it can learn effective feature repre-
registration approaches [22, 23, 36, 37, 5] can perform le- sentations. Specifically, we transform DenseNet into 3D by
sion tracking via image alignment. Specifically, registration duplicating its 2D convolutional kernels along the third di-
algorithms are designed to optimize the global structural rection and then downscaling weight values by the number
alignment, i.e. accurately align boundaries of large organs, of duplications [13]. This configuration is found to be more
while being robust to local changes. Nonetheless, although effective than 3D UNet [17] on modeling universal lesion
reported results suggest that registration algorithms are use- appearances [12]. We then add a feature pyramid network
15161
Figure 2. The configuration of our proposed deep lesion tracker.
(FPN) [30] after the 3D DenseNet to generate visual fea- nal maps as Gt and Gs , respectively. We solve Eq. 1 using
tures at three scales. We visually depict the detailed con- SimpleElastix [36].
figuration of 3D DenseFPN in Fig. 3. For clarity, we use Fig. 3 depicts the network configuration of the proposed
ψ1 , ψ2 , and ψ3 to refer to the image mapping functions that ASE. It encodes anatomical signals into high-dimensional
generate feature maps from the largest to the smallest reso- anatomical features with three different resolutions. In cor-
lutions, respectively. respondence with 3D DenseFPN, we denote the network
functions for the three scales as φ1 , φ2 , and φ3 from the
3.2. Anatomy Signal Encoder and Its Inputs largest to the smallest, respectively.
We observe that directly implementing lesion tracking 3.3. Fast Cross-Correlation
with Siamese networks can produce matches with visually
similar but spurious regions. In contrast, affine registration As mentioned, correlation is a core operation of
[36] is a robust approach to roughly align CT images. It is Siamese-based tracking, which creates a correspondence
achieved by solving map between target and search features, ψ(It ) and φ(Gt ),
respectively. Because we perform the same operation at
TAff = arg min kTAff (It ) − Is k1 , (1) each scale, we drop the scale subscripts here for simplic-
TAff ∈A
ity. To conduct cross-correlation, we first fuse image and
where A is the space of affine transforms. The projected anatomy features. For example, to fuse ψ(It ) and φ(Gt )
location of the template lesion, TAff (µt ), is usually located we use
close to the actual target lesion. While prior art has used F = ψ(It ) ⊙ φ(Gt ), (3)
affine registration as pre- [51, 7] or post-processing [14], where ⊙ is element-wise multiplication and we constrain
these do not provide mechanisms for incorporation into φ(Gt ) to have the same shape as ψ(It ). We observe from
a tracking pipeline that cross-correlates template features experiments that fusing ψ(It ) and φ(Gt ) with ⊙ performs
across the entire extent of the search image. For example, better than channel-wise concatenation. Next, we define a
pre-registering will have minimal effect on the translation- cropping function to extract a 3×3×3 template kernel as,
invariant cross-correlation. Instead, as shown in Fig. 2, we
encode anatomy signals as Gaussian heatmaps centered at K = C(F, µt , (3, 3, 3)). (4)
lesion locations: where the kernel is centered at µt after any potential fea-
P i 2
! ture downscaling. To encode the global image context
i∈{x,y,z} (i − µ )
G(µ, nr) = exp − , (2) better, we also extract another larger size kernel Kg =
2(nr)2 C(F, µt , (7, 11, 11)). Here we limit its size in the z-
direction to be 7 since the size of It during model training
where we find n = 4 delivers the best performance. For
is only (32, 384, 384).
It we simply use the template lesion location and size:
Following the traditional cross-correlation operation [7],
G(µt , nrt ). For Is we use the affine-projected location and
we define the correspondence map as,
size of the template lesion: G(TAff (µt ), nTAff (rt )). For clar-
ity, we simply refer to the template and search anatomy sig- M = (K ⋆ S) + (Kg ⋆ S), (5)
15162
Figure 3. Network configurations of the proposed image encoder 3D DenseFPN and anatomy signal encoder (ASE).
where S = ψ(Is ) ⊙ φ(Gs ) and + is the element-wise Y = G(µs , rs ) and then downsize it to match the dimen-
sum. Unfortunately, a direct use of Kg introduces a heavy sions of Ŷ . We use focal loss [31, 65] in training:
computational load. We propose to decompose Kg along (
the axial, coronal, and sagittal directions and obtain flat- X (1 − ŷi )α log(ŷi ) if yi = 1
Lsl = , (8)
tened kernels as Kg,z ∈ R(1,11,11) , Kg,x ∈ R(7,1,11) , and (1 − y i ) β
(ŷ i ) α
log(1 − ŷ i ) otherwise
i
Kg,y ∈ R(7,11,1) , where we omit the dimensions of batch
size for clarity. As Fig. 2 demonstrates, the proposed FCC where yi and ŷi are the i-th voxels in Y and Ŷ , respectively,
layer performs the flattening using learned 3D convolutions and α = 2 and β = 4 are focal-loss hyper-parameters [31,
configured to produce an output of identical size as the ker- 65]. The ground-truth heat map is < 1 everywhere except
nel, except with one dimension flattened. The resulting at the lesion center voxel. So that the training can converge
faster version of Eq. 5 is quickly, it ignores hard voxels that are near µs .
X Center augmentation. In practice, labels from clin-
M = (K ⋆ S) + Kg,i ⋆ S. (6)
icians may not represent the exact lesion centers. The
i∈x,y,z
provided location, µt , may shift inside the central area.
We also tested kernel decomposition by simply extracting Therefore, to increase model robustness we train DLT with
the middle “slices” of Kg along the three dimensions, but it random location shifts. This is achieved by adding µt
did not perform as well as the learned flattening operations. with ∆µt , which is randomly sampled from the sphere
Adding back the scale subscripts, the final output is a k∆µt k2 ≤ 0.25rt .
probability map:
4.2. Self-Supervised Learning
Ŷ = σ(W T (M1 + U2 + U3 ) + b), (7) Since our proposed DLT is built upon Siamese pair-
where σ(·) is the Sigmoid function, W and b are parameters wise comparison, it inherently supports learning with self-
of the final fully convolutional layer, U2 is M2 up-scaled by supervision. The key insight is that effective visual rep-
(1, 2, 2), and U3 is M3 up-scaled by (1, 4, 4). The predicted resentation for object recognition can be learned by com-
lesion center µp is the index of the global maximum in Ŷ . paring the template image, It , with its augmented counter-
parts. With It , we implement data augmentations including
4. Supervised and Self-Supervised Learning (1) elastic deformations at random scales ranging from 0
to 0.25, (2) rotations in the xy-plane with a random angle
DLT is capable of both supervised and self-supervised ranging from -10 to 10 degrees, (3) random scales ranging
learning (SSL). It is flexible enough to learn from paired from 0.75 to 1.25, (4) random crops, (5) add Gaussian noise
annotations, when enough are available, and to also use ef- with zero mean and a random variance ranging from 0 to
ficient self-supervised learning. 0.05, and (6) Gaussian blurring with a random sigma rang-
ing from 0.5 to 1.5 [26]. Each augmentation individually
4.1. Supervised Learning
takes place with the probability of 0.5. For clarity, we define
Based on the introduced network architecture, Ŷ , the Taug as any combination of the data augmentations. There-
output of DLT is a dense probability map representing the fore, each self-supervised image “pair” comprises It and
likelihood of each location to be the target lesion center. Taug (It ) with corresponding anatomical signals of Gt and
Therefore, we define the ground truth as a Gaussian kernel Taug (Gt ). The same training procedure as supervised learn-
centered at the target location µs . Formally, we first define ing can then be followed. It is worth mentioning that our
15163
SSL strategy shares a similar spirit with recent contrastive total, we have 906 and 960 directed lesion pairs in the val-
learning studies that matches an image with its transformed idation and test sets, respectively. We define a center point
version [16], but in the pixel-level. matching (CPM) accuracy, which represents the percentage
We select non-longitudinal images from DeepLe- of correctly matched lesions. A match will be counted cor-
sion [61] and use the bounding box annotations as µt and rect when the Euclidean distance between the ground truth
rt . When bounding box annotations are not available, the center and the predicted center is smaller than a threshold.
template lesions can be extracted by applying a pre-trained We first set the threshold to be the corresponding lesion ra-
universal lesion detector on It and randomly selecting top- dius and refer the matching accuracy CPM@Radius or sim-
scoring proposals. However, we do not explore that here. ply CPM. However this threshold is not tight enough to dif-
Limited by GPU memory, when combining the super- ferentiate trackers as some lesions have large sizes. We then
vised learning with SSL, we switch the training of DLT be- use an adaptive threshold min(r, 10mm) to limit the al-
tween both schemes as: lowed maximum offset in large lesions and we refer to this
( matching accuracy as CPM@10mm. We empirically use
Lssl if λ ≤ τ 10mm because 55% lesions in the test set have larger than
Lmix = , (9)
Lsl otherwise 10mm radius.
We also measure the absolute offset between ground
where λ ∈ [0, 1] is a random number and we empirically set truth and predicted centers in mm and report the mean Eu-
the threshold τ to 0.25 in our experiments. clidean distance (MED) and its projections MEDX , MEDY ,
MEDZ in each direction. The speed of trackers is counted
5. Experiments using seconds per volume (spv).
15164
Figure 4. Comparisons of our methods with three state-of-the-art trackers. The top 3 closest to center distances are reported in mm.
Method CPM@ CPM@ MEDX MEDY MEDZ MED speed
10mm Radius (mm) (mm) (mm) (mm) (spv)
Affine [36] 48.33 65.21 4.1±5.0 5.4±5.6 7.1±8.3 11.2±9.9 1.82
VoxelMorph [4] 49.90 65.59 4.6±6.7 5.2±7.9 6.6±6.2 10.9±10.9 0.46
LENS-LesionGraph [58, 62] 63.85 80.42 2.6±4.6 2.7±4.5 6.0±8.6 8.0±10.1 4.68
VULD-LesionGraph [12, 62] 64.69 76.56 3.5±5.2 4.1±5.8 6.1±8.8 9.3±10.9 9.07
VULD-LesaNet [12, 59] 65.00 77.81 3.5±5.3 4.0±5.7 6.0±8.7 9.1±10.8 9.05
SiamRPN++ [29] 68.85 80.31 3.8±4.8 3.8±4.8 4.8±7.5 8.3±9.2 2.24
LENS-LesaNet [58, 59] 70.00 84.58 2.7±4.8 2.6±4.7 5.7±8.6 7.8±10.3 4.66
DLT-SSL 71.04 81.52 3.8±5.3 3.7±5.5 5.4±8.4 8.8±10.5 3.57
DEEDS [22] 71.88 85.52 2.8±3.7 3.1±4.1 5.0±6.8 7.4±8.1 15.3
DLT-Mix 78.65 88.75 3.1±4.4 3.1±4.5 4.2±7.6 7.1±9.2 3.54
DLT 78.85 86.88 3.5±5.6 2.9±4.9 4.0±6.1 7.0±8.9 3.58
Table 1. Comparisons between the proposed DLT and state-of-the-art approaches.
15165
Eq. 6: Kg ψ, φ Eq. 2: G test speed Input MAE Growth Response
id size learn dim. size (n) MED spv generator (mm) acc. (%) acc. (%)
a N/A N/A 64 4 9.3 1.44 DEEDS [22] 2.69±4.12 78.02 84.17
b 7,7,7 X 64 4 9.4 2.38 DLT 2.47±3.58 79.69 85.10
c 7,15,15 X 64 4 7.7 24.1 Manual inputs 2.31±3.16 79.69 85.56
d 7,11,11 X 64 2 7.4 3.51 Table 4. Impact on automatic lesion size measurement when using
e 7,11,11 X 64 8 8.5 3.51 the OneClick [50] model.
f 7,11,11 X 32 4 8.7 2.25 Method CPM@Radius speed (spv)
g 7,11,11 X 128 4 7.9 5.83 DEEDS [22] 85.6 67.1±17.8
h 7,11,11 X 64 N/A 9.3 3.51 DLT 88.4 4.7±0.35
i 7,11,11 % 64 4 9.3 3.51 Table 5. External evaluation.
j 7,11,11 X 64 4 7.9 3.51
Table 3. Parameter analysis of the proposed components.
RECIST guideline [20], which classifies a treatment re-
Parameter Analysis. Table 3 presents our parameter anal- sponse as partial response if ρ ≤-0.3, as progressive disease
ysis for different model configurations, with model j rep- if ρ ≥0.2, or as stable disease if ρ ∈(-0.3,0.2). We then
resenting our final configuration without the multiplication predict treatment response using ρp =(dp -dt )/dt .
fusion of Eq. 3 or the center augmentation of Sec. 4.1. We tested DLT, DEEDS, and manual inputs, i.e. the
We present test results, but note that our model selection ground truth lesion centers. Table 4 shows the results. DLT
was based off of our validation (which can be found in the outperforms DEEDS in MAE by 0.22mm, which is an 8%
supplementary material). Model a is identical to our final improvement. Compared with manual inputs, DLT exhibits
model, except that the global kernel has been disabled, re- the same growth accuracy and is only 0.46% lower in the
sulting in significant MED increases and demonstrating the treatment response accuracy.
importance of the global kernel. Models b and c explore dif- External Evaluation. We further invite a board-certified
ferent global kernel sizes, indicating performance can vary radiologist to manually assess DLT with 100 longitudinal
somewhat, but is not overly sensitive to the choice. How- studies recruited from real-life clinical workflows. The user
ever, too large of a kernel results in an order of magnitude provides binarized responses, i.e., inside- or outside-lesion
greater runtime, justifying our choice of a (7, 11, 11) kernel. for the CPM@Radius metric. We compared the tracking re-
As model e demonstrates, when the ASE heat map of Eq. 2 sults of DLT with DEEDS in Table 5. DLT delivers 88.4%
covers too large of an area it can lose its specificity, resulting CPM accuracy and outperforms DEEDS by 2.8%. Besides,
in performance degradation. Models f and g show the effect DLT requires only 4.67 seconds to process a whole body
of different embedding feature dimensions, again showing CT, which is over 14 times faster than DEEDS. These re-
that performance is not overly sensitive to this choice, as sults also underscore the value of our DLS dataset.
long as the embedding dimension is large enough. In terms
of the need for the anatomy signal of ASE, model h demon- 6. Conclusion & Discussion
strates its removal considerably increases the MED. Finally,
model i’s performance shows that the learnable decomposi- In this work, we introduce a new public benchmark for
tion of Eq. 6 is critical for accurate tracking. Adding Eq. 3 lesion tracking and present DLT as our solution. Due to the
and center augmentation to model j results in our final con- different setup of medical applications, DLT differs from
figuration featured in Table 1. general visual trackers in two aspects. First, DLT does not
regress bounding boxes for target lesions because as men-
5.4. Impact on Downstream Measurements tioned in Sec. 5.4, the lesion size can be accurately predicted
In this experiment, we compare trackers with down- by the down stream measurement module. Second, DLT
stream size measurements. We use a pre-trained model, does not perform long-term tracking because time points in
OneClick [50] that takes the image Is and the predicted le- longitudinal studies is much less than general videos. Also,
sion center µp as its inputs and regresses the RECIST diam- manual calibration occurs much more often in lesion track-
eters of the target lesion. For simplicity, we only compare ing than general object tracking.
the long diameters. We use evaluation metrics including Our presented DLT has been demonstrated effective for
mean absolute error (MAE) in mm, growth accuracy, and lesion tracking, outperforming a comprehensive set of base-
treatment response accuracy. With the template diameter dt , lines that represent various tracking strategies. DLT can
search diameter ds , and OneClick predicted diameter dp , we be trained via either supervised or self-supervised learning,
define dp as a correct growth prediction, if and only if the in- where the combination of both training schemes results in
equality (ds -dt )(dp -dt )>0 holds. The growth accuracy rep- the best performance and robustness. We benchmark the
resents the percentage of correct growth predictions. The task of lesion tracking on our DLS dataset which will be
treatment response, ρ=(ds -dt )/dt , is defined based on the made available upon request.
15166
References Conf. Comput. Vis. Pattern Recog., pages 4724–4733. 2017.
3
[1] Diego Ardila, Atilla P Kiraly, Sujeeth Bharadwaj, Bokyung
[14] Alvaro Gomariz Carrillo, Weiye Li, Ece Ozkan, Christine
Choi, Joshua J Reicher, Lily Peng, Daniel Tse, Mozziyar
Tanner, and Orcun Goksel. Siamese networks with location
Etemadi, Wenxing Ye, Greg Corrado, et al. End-to-end
prior for landmark tracking in liver ultrasound sequences. In
lung cancer screening with three-dimensional deep learning
IEEE Int. Symposium on Biomedical Imaging, pages 1757–
on low-dose chest computed tomography. Nature medicine,
1760. 2019. 2, 3, 4
25(6):954–961, 2019. 1, 2
[15] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
[2] Shai Avidan. Support vector tracking. IEEE Trans. Pattern
Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic im-
Anal. Mach. Intell., 26(8):1064–1072, 2004. 3
age segmentation with deep convolutional nets, atrous con-
[3] Boris Babenko, Ming-Hsuan Yang, and Serge J. Belongie.
volution, and fully connected crfs. IEEE Trans. Pattern Anal.
Robust object tracking with online multiple instance learn-
Mach. Intell., 40(4):834–848, 2018. 1
ing. IEEE Trans. Pattern Anal. Mach. Intell., 33(8):1619–
[16] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
1632, 2011. 3
offrey E. Hinton. A simple framework for contrastive learn-
[4] Guha Balakrishnan, Amy Zhao, Mert Sabuncu, John Guttag,
ing of visual representations. CoRR, abs/2002.05709, 2020.
and Adrian V. Dalca. An unsupervised learning model for
6
deformable medical image registration. IEEE Conf. Comput.
Vis. Pattern Recog., pages 9252–9260, 2018. 7 [17] Özgün Çiçek, Ahmed Abdulkadir, Soeren S. Lienkamp,
Thomas Brox, and Olaf Ronneberger. 3d u-net: Learning
[5] Guha Balakrishnan, Amy Zhao, Mert R. Sabuncu, John V.
dense volumetric segmentation from sparse annotation. In
Guttag, and Adrian V. Dalca. Voxelmorph: A learning
Medical Image Computing and Computer Assisted Interven-
framework for deformable medical image registration. IEEE
tion, pages 424–432, 2016. 1, 3
Trans. Med. Imaging, 38(8):1788–1800, 2019. 2, 3, 6
[18] Martin Danelljan, Gustav Häger, Fahad Shahbaz Khan, and
[6] Luca Bertinetto, Jack Valmadre, Stuart Golodetz, Ondrej
Michael Felsberg. Convolutional features for correlation fil-
Miksik, and Philip H. S. Torr. Staple: Complementary learn-
ter based visual tracking. In Int. Conf. Comput. Vis. Worksh.,
ers for real-time tracking. In IEEE Conf. Comput. Vis. Pat-
pages 621–629. 2015. 2
tern Recog., pages 1401–1409. 2016. 2
[7] Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea [19] Martin Danelljan, Fahad Shahbaz Khan, Michael Felsberg,
Vedaldi, and Philip H. S. Torr. Fully-convolutional siamese and Joost van de Weijer. Adaptive color attributes for real-
networks for object tracking. In Eur. Conf. Comput. Vis. time visual tracking. In IEEE Conf. Comput. Vis. Pattern
Worksh., pages 850–865, 2016. 2, 3, 4 Recog., pages 1090–1097. 2014. 2, 3
[8] David S. Bolme, J. Ross Beveridge, Bruce A. Draper, and [20] E. Eisenhauer, P. Therasse, J. Bogaerts, and et al. New
Yui Man Lui. Visual object tracking using adaptive corre- response evaluation criteria in solid tumours: revised re-
lation filters. In IEEE Conf. Comput. Vis. Pattern Recog., cist guideline (version 1.1). European journal of cancer,
pages 2544–2550. 2010. 2 45(2):228–247, 2009. 1, 6, 8
[9] Jane Bromley, James W. Bentz, Léon Bottou, Isabelle [21] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Malik. Rich feature hierarchies for accurate object detection
Roopak Shah. Signature verification using A ”siamese” time and semantic segmentation. In IEEE Conf. Comput. Vis. Pat-
delay neural network. Int. J. Pattern Recognit. Artif. Intell., tern Recog., pages 580–587. 2014. 1
7(4):669–688, 1993. 2 [22] M. P. Heinrich, M. Jenkinson, M. Brady, and J. A. Schnabel.
[10] J. Cai, A. P. Harrison, Y. Zheng, K. Yan, Y. Huo, J. Xiao, Mrf-based deformable registration and ventilation estimation
L. Yang, and L. Lu. Lesion-harvester: Iteratively mining of lung ct. IEEE Trans. Med. Imaging, 32(7):1239–1248,
unlabeled lesions and hard-negative examples at scale. IEEE 2013. 1, 2, 3, 6, 7, 8
Trans. Med. Imaging, pages 1–1, 2020. 2 [23] Mattias P. Heinrich, Oskar Maier, and Heinz Handels. Multi-
[11] Jinzheng Cai, Youbao Tang, Le Lu, Adam P. Harrison, Ke modal multi-atlas segmentation using discrete optimisation
Yan, Jing Xiao, Lin Yang, and Ronald M. Summers. Ac- and self-similarities. In IEEE Int. Symposium on Biomedical
curate weakly-supervised deep lesion segmentation using Imaging, pages 27–30. 2015. 2, 3, 6
large-scale clinical annotations: Slice-propagated 3d mask [24] João F. Henriques, Rui Caseiro, Pedro Martins, and Jorge P.
generation from 2d RECIST. In Medical Image Computing Batista. Exploiting the circulant structure of tracking-by-
and Computer Assisted Intervention, pages 396–404. 2018. detection with kernels. In Eur. Conf. Comput. Vis., pages
1 702–715. 2012. 3
[12] Jinzheng Cai, Ke Yan, Chi-Tung Cheng, Jing Xiao, Chien- [25] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil-
Hung Liao, Le Lu, and Adam P. Harrison. Deep volumet- ian Q. Weinberger. Densely connected convolutional net-
ric universal lesion detection using light-weight pseudo 3d works. In IEEE Conf. Comput. Vis. Pattern Recog., pages
convolution and surface point regression. In Medical Image 2261–2269. 2017. 2, 3
Computing and Computer Assisted Intervention, pages 3–13. [26] Fabian Isensee, Jens Petersen, André Klein, David Zim-
2020. 1, 2, 3, 6, 7 merer, Paul F. Jaeger, Simon Kohl, Jakob Wasserthal, Gregor
[13] João Carreira and Andrew Zisserman. Quo vadis, action Koehler, Tobias Norajitra, Sebastian J. Wirkert, and Klaus H.
recognition? A new model and the kinetics dataset. In IEEE Maier-Hein. nnu-net: Self-adapting framework for u-net-
15167
based medical image segmentation. CoRR, abs/1809.10486, González Ballester. Re-identification and growth detection
2018. 5 of pulmonary nodules without image registration using 3d
[27] Chenhan Jiang, Shaoju Wang, Xiaodan Liang, Hang Xu, siamese neural networks. Med. Image Anal., 67:101823,
and Nong Xiao. Elixirnet: Relation-aware network architec- 2021. 1, 2
ture adaptation for medical lesion detection. In AAAI, pages [41] Ashwin Raju, Chi-Tung Cheng, Yuankai Huo, Jinzheng Cai,
11093–11100. 2020. 1 Junzhou Huang, Jing Xiao, Le Lu, Chien-Hung Liao, and
[28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Adam P. Harrison. Co-heterogeneous and adaptive segmen-
Imagenet classification with deep convolutional neural net- tation from multi-source and multi-phase CT imaging data:
works. In Adv. Neural Inform. Process. Syst., pages 1106– A study on pathological liver and lesion segmentation. In
1114, 2012. 3 Eur. Conf. Comput. Vis., pages 448–465. 2020. 3
[29] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, [42] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
and Junjie Yan. Siamrpn++: Evolution of siamese visual Convolutional networks for biomedical image segmentation.
tracking with very deep networks. In IEEE Conference In Medical Image Computing and Computer Assisted Inter-
on Computer Vision and Pattern Recognition, CVPR 2019, vention, pages 234–241. 2015. 1
Long Beach, CA, USA, June 16-20, 2019, pages 4282–4291. [43] Holger R. Roth, Le Lu, Amal Farag, Hoo-Chang Shin, Ji-
2019. 1, 2, 6, 7 amin Liu, Evrim B. Turkbey, and Ronald M. Summers.
[30] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Deeporgan: Multi-level deep convolutional networks for au-
Bharath Hariharan, and Serge J. Belongie. Feature pyramid tomated pancreas segmentation. In Medical Image Com-
networks for object detection. In IEEE Conf. Comput. Vis. puting and Computer Assisted Intervention, pages 556–564.
Pattern Recog., pages 936–944. 2017. 4 2015. 1
[31] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He,
[44] Holger R. Roth, Le Lu, Ari Seff, Kevin M. Cherry, Joanne
and Piotr Dollár. Focal loss for dense object detection. IEEE
Hoffman, Shijun Wang, Jiamin Liu, Evrim Turkbey, and
Trans. Pattern Anal. Mach. Intell., 42(2):318–327, 2020. 1,
Ronald M. Summers. A new 2.5d representation for lymph
5
node detection using random sets of deep convolutional neu-
[32] Fei Liu, Dan Liu, Jie Tian, Xiaoyan Xie, Xin Yang, and Kun ral network observations. In Medical Image Computing and
Wang. Cascaded one-shot deformable convolutional neu- Computer Assisted Intervention, pages 520–527. 2014. 1
ral networks: Developing a deep learning model for respira-
[45] Qingbin Shao, Lijun Gong, Kai Ma, Hualuo Liu, and Yefeng
tory motion estimation in ultrasound sequences. Med. Image
Zheng. Attentive CT lesion detection using deep pyramid
Anal., 65:101793, 2020. 2
inference with multi-scale booster. In Medical Image Com-
[33] Ting Liu, Gang Wang, and Qingxiong Yang. Real-time part-
puting and Computer Assisted Intervention, pages 301–309.
based visual tracking via adaptive correlation filters. In IEEE
2019. 1, 2
Conf. Comput. Vis. Pattern Recog., pages 4902–4912. 2015.
2 [46] Hoo-Chang Shin, Holger R. Roth, Mingchen Gao, Le Lu,
[34] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully Ziyue Xu, Isabella Nogues, Jianhua Yao, Daniel J. Mol-
convolutional networks for semantic segmentation. In IEEE lura, and Ronald M. Summers. Deep convolutional neural
Conf. Comput. Vis. Pattern Recog., pages 3431–3440. 2015. networks for computer-aided detection: CNN architectures,
1 dataset characteristics and transfer learning. IEEE Trans.
Med. Imaging, 35(5):1285–1298, 2016. 2
[35] Chao Ma, Jia-Bin Huang, Xiaokang Yang, and Ming-Hsuan
Yang. Hierarchical convolutional features for visual tracking. [47] Maxine Tan, Zheng Li, Yuchen Qiu, Scott D. McMeekin,
In Int. Conf. Comput. Vis., pages 3074–3082. 2015. 2 Theresa C. Thai, Kai Ding, Kathleen N. Moore, Hong Liu,
[36] Kasper Marstal, Floris F. Berendsen, Marius Staring, and and Bin Zheng. A new approach to evaluate drug treatment
Stefan Klein. Simpleelastix: A user-friendly, multi-lingual response of ovarian cancer patients based on deformable im-
library for medical image registration. In IEEE Conf. Com- age registration. IEEE Trans. Med. Imaging, 35(1):316–325,
put. Vis. Pattern Recog. Worksh., pages 574–582. 2016. 2, 3, 2016. 1, 2, 3
4, 6, 7 [48] Youbao Tang, Adam P Harrison, Mohammadhadi Bagheri,
[37] Shun Miao, Z. Jane Wang, and Rui Liao. A CNN regression Jing Xiao, and Ronald M Summers. Semi-automatic RE-
approach for real-time 2d/3d registration. IEEE Trans. Med. CIST labeling on CT scans with cascaded convolutional neu-
Imaging, 35(5):1352–1363, 2016. 2, 3 ral networks. In Medical Image Computing and Computer
[38] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. Assisted Intervention, pages 405–413. 2018. 1
V-net: Fully convolutional neural networks for volumetric [49] Youbao Tang, Ke Yan, Yuxing Tang, Jiamin Liu, Jin Xiao,
medical image segmentation. In 3DV, pages 565–571. 2016. and Ronald M. Summers. Uldor: A universal lesion detector
1 for ct scans with pseudo masks and hard negative example
[39] Hyeonseob Nam and Bohyung Han. Learning multi-domain mining. In IEEE Int. Symposium on Biomedical Imaging,
convolutional neural networks for visual tracking. In IEEE pages 833–836. 2019. 1, 2
Conf. Comput. Vis. Pattern Recog., pages 4293–4302. 2016. [50] Youbao Tang, Ke Yan, Jing Xiao, and Ronald M. Summers.
2, 3 One click lesion RECIST measurement and segmentation on
[40] Xavier Rafael-Palou, Anton Aubanell, Ilaria Bonavita, CT scans. In Medical Image Computing and Computer As-
Mario Ceresa, Gemma Piella, Vicent Ribas, and Miguel A. sisted Intervention, pages 573–583. 2020. 1, 8
15168
[51] Ran Tao, Efstratios Gavves, and Arnold W. M. Smeulders. [64] Ling Zhang, Yu Shi, Jiawen Yao, Yun Bian, Kai Cao, Dakai
Siamese instance search for tracking. In IEEE Conf. Comput. Jin, Jing Xiao, and Le Lu. Robust pancreatic ductal adeno-
Vis. Pattern Recog., pages 1420–1429. 2016. 2, 3, 4 carcinoma segmentation with multi-institutional multi-phase
[52] Jack Valmadre, Luca Bertinetto, João F. Henriques, Andrea partially-annotated CT scans. In Medical Image Computing
Vedaldi, and Philip H. S. Torr. End-to-end representation and Computer Assisted Intervention, pages 491–500. 2020.
learning for correlation filter based tracking. In IEEE Conf. 3
Comput. Vis. Pattern Recog., pages 5000–5008. 2017. 2 [65] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
[53] Lijun Wang, Wanli Ouyang, Xiaogang Wang, and Huchuan jects as points. CoRR, abs/1904.07850, 2019. 1, 5
Lu. Visual tracking with fully convolutional networks. In
Int. Conf. Comput. Vis., pages 3119–3127. 2015. 2, 3
[54] Naiyan Wang, Siyi Li, Abhinav Gupta, and Dit-Yan Yeung.
Transferring rich feature hierarchies for robust visual track-
ing. CoRR, abs/1501.04587, 2015. 2, 3
[55] Xudong Wang, Zhaowei Cai, Dashan Gao, and Nuno Vas-
concelos. Towards universal object detection by domain at-
tention. In IEEE Conf. Comput. Vis. Pattern Recog., pages
7289–7298. 2019. 1
[56] Saining Xie and Zhuowen Tu. Holistically-nested edge de-
tection. Int. J. Comput. Vis., 125(1-3):3–18, 2017. 1
[57] Zhoubing Xu, Christopher P. Lee, Mattias P. Heinrich, Marc
Modat, Daniel Rueckert, Sébastien Ourselin, Richard G.
Abramson, and Bennett A. Landman. Evaluation of six reg-
istration methods for the human abdomen on clinically ac-
quired CT. IEEE Trans. Biomed. Eng., 63(8):1563–1572,
2016. 6
[58] Ke Yan, Jinzheng Cai, Youjing Zheng, Adam P. Harrison,
Dakai Jin, Youvao Tang, Yuxing Tang, Lingyun Huang, Jing
Xiao, and Le Lu. Learning from multiple datasets with het-
erogeneous and partial labels for universal lesion detection
in CT. CoRR, abs/2009.02577, 2020. 1, 3, 7
[59] Ke Yan, Yifan Peng, Veit Sandfort, Mohammadhadi Bagheri,
Zhiyong Lu, and Ronald M. Summers. Holistic and com-
prehensive annotation of clinically significant findings on di-
verse CT images: Learning from radiology reports and label
ontology. In IEEE Conf. Comput. Vis. Pattern Recog., pages
8523–8532. 2019. 2, 3, 6, 7
[60] Ke Yan, Youbao Tang, Yifan Peng, Veit Sandfort, Moham-
madhadi Bagheri, Zhiyong Lu, and Ronald M. Summers.
MULAN: multitask universal lesion analysis network for
joint lesion detection, tagging, and segmentation. In Med-
ical Image Computing and Computer Assisted Intervention,
pages 194–202. 2019. 2
[61] Ke Yan, Xiaosong Wang, Le Lu, and Ronald M. Summers.
Deeplesion: automated mining of large-scale lesion annota-
tions and universal lesion detection with deep learning. J.
Med. Imaging, 5(3), 2018. 2, 6
[62] Ke Yan, Xiaosong Wang, Le Lu, Ling Zhang, Adam P. Har-
rison, Mohammadhadi Bagheri, and Ronald M. Summers.
Deep lesion graphs in the wild: Relationship learning and
organization of significant radiology image findings in a di-
verse large-scale lesion database. In IEEE Conf. Comput.
Vis. Pattern Recog., pages 9261–9270. 2018. 2, 3, 6, 7
[63] Rui Yao, Qinfeng Shi, Chunhua Shen, Yanning Zhang, and
Anton van den Hengel. Part-based visual tracking with on-
line latent structural learning. In IEEE Conf. Comput. Vis.
Pattern Recog., pages 2363–2370. 2013. 2
15169