Keywords

1 Introduction

Instance segmentation plays an important role in biomedical imaging tasks like cell migration, but also in computer vision based tasks like scene understanding. It is considerably more difficult than semantic segmentation (e.g., [10]), since instance segmentation does not only assign class labels to pixels, but also distinguishes between instances within each class, e.g., each individual person on an image from a surveillance camera is assigned a unique ID.

Mainly due to the high performance of the U-Net [12], semantic segmentation has been successfully used as a first step in medical instance segmentation tasks, e.g., cell tracking. However, for instances to be separated as connected components during postprocessing, borders of instances have to be treated with special care. In the computer vision community, many methods for instance segmentation have in common that they solely segment one instance at a time. In [4], all instances are first detected and independently segmented, while in [11], recurrent networks are used to memorize which instances were already segmented. Segmenting solely one instance at a time can be problematic when hundreds of instances are visible in the image, as often is the case with e.g., cell instance segmentation. Recent methods are segmenting each instance simultaneously, by predicting embeddings for all pixels at once [5, 8]. These embeddings have similar values within an instance, but differ among instances. In the task of cell segmentation and tracking, temporal information is an important cue to establish coherence between frames, thus preserving instances throughout videos. Despite improvements of instance segmentation using embeddings, to the best of our knowledge, combining them with temporal information for tracking instance segmentations has not been presented.

Fig. 1.
figure 1

Overview of our proposed framework showing input image, propagation of cosine embeddings from frame t to frame \(t+1\) (three randomly chosen embedding dimensions as RGB channels), and resulting clustered instances.

In this paper, we propose to use recurrent fully convolutional networks for embedding-based instance segmentation and tracking. To memorize temporal information, we integrate convolutional gated recurrent units (ConvGRU [2]) into a stacked hourglass network [9]. Furthermore, we use a novel embedding loss based on cosine similarities, where we exploit the four color map theorem [1], by requiring only neighboring instances to have different embeddings.

2 Instance Segmentation and Tracking

Figure 1 shows our proposed framework on a cell instance segmentation and tracking example. To distinguish cell instances, they are represented as embeddings at different time points. By representing temporal sequences of embeddings in a recurrent hourglass network, a predictor can be learnt from the data, which allows tracking of embeddings also in the case of mitosis events. To finally generate instance segmentations, clustering of the predicted embeddings is performed.

2.1 Recurrent Stacked Hourglass Network

We modify the stacked hourglass architecture [9] by integrating ConvGRU [2] to propagate temporal information, as shown in Fig. 2. Differently from the original stacked hourglass network, we use single convolution layers with \(3\times 3\) filters and 64 outputs for all blocks in the contracting and expanding paths, while we use ConvGRU with \(3\times 3\) filters and 64 outputs in between paths. As proposed by [9], we also stack two hourglasses in a row to improve network predictions. Therefore, we concatenate the output of the first hourglass with the input image to use it as input for the second hourglass. We apply the loss function on the outputs of both hourglasses, while we only use the outputs of the second hourglass for the clustering of embeddings.

Fig. 2.
figure 2

Overview of the recurrent stacked hourglass network with two hourglasses and three levels. Yellow bars: input; blue boxes: convolutions; red boxes: ConvGRU; dashed black box: concatenation; green boxes: embeddings.

2.2 Cosine Embedding Loss

We let the network predict a d-dimensional embedding vector \(\varvec{e}_p\in \mathbb {R}^d\) for each pixel p of the image. To separate instances \(i\in \mathbb {I}\), firstly, embeddings of pixels \(p\in \mathbb {S}^{(i)}\) belonging to the same instance i need to be similar, and secondly, embeddings of \(\mathbb {S}^{(i)}\) need to be dissimilar to embeddings of pixels \(p\in \mathbb {S}^{(j)}\) of other instances \(j\ne i\). Here, we treat background as an independent instance. Following from the four color map theorem [1], only neighboring instances need to have different embeddings. Thus, we relax the need of dissimilarity between different instances only to the neighboring ones, i.e., \(\mathbb {N}^{(i)}= \bigcup _j\mathbb {S}^{(j)}\) for all instances \(j\ne i\) within pixel-wise distance \(r_{\mathbb {N}}\) to instance i. This relaxation simplifies the problem by assigning only a limited number of different embeddings to a possibly large number of different instances.

We compare two embeddings with the cosine similarity

$$\begin{aligned} \text {cos}(\varvec{e}_1,\varvec{e}_2)=\frac{\varvec{e}_1\cdot \varvec{e}_2}{\Vert \varvec{e}_1 \Vert \Vert \varvec{e}_2 \Vert }, \end{aligned}$$
(1)

which ranges from \(-1\) to 1, while \(-1\) indicates the vectors have the opposite, 0 orthogonal, and 1 the same direction. We define the cosine embedding loss as

$$\begin{aligned} L=\frac{1}{|\mathbb {I}|}\sum _{i\in \mathbb {I}}\left( 1-\frac{1}{|\mathbb {S}^{(i)}|}\sum _{p\in \mathbb {S}^{(i)}}{\text {cos}(\varvec{\bar{e}}^{(i)},\varvec{e}_p)}\right) + \left( \frac{1}{|\mathbb {N}^{(i)}|}\sum _{p\in \mathbb {N}^{(i)}}{\text {cos}(\varvec{\bar{e}}^{(i)},\varvec{e}_p)^2}\right) , \end{aligned}$$
(2)

where the mean embedding of instance i is defined as \(\varvec{\bar{e}}^{(i)}=\frac{1}{|\mathbb {S}^{(i)}|}\sum _{p\in \mathbb {S}^{(i)}}{\varvec{e}_p}\). By minimizing \(L\), the first term urges embeddings \(\varvec{e}_p\) of pixels \(p\in \mathbb {S}^{(i)}\) to have the same direction as the mean \(\varvec{\bar{e}}^{(i)}\), which is the case when \(\text {cos}(\varvec{\bar{e}}^{(i)},\varvec{e}_p)\approx 1\), while the second term pushes embeddings \(\varvec{e}_p\) of pixels \(p\in \mathbb {N}^{(i)}\) to be orthogonal to the mean \(\varvec{\bar{e}}^{(i)}\), i.e., \(\text {cos}(\varvec{\bar{e}}^{(i)},\varvec{e}_p)\approx 0\).

2.3 Clustering of Embeddings

To get the final segmentations from the predicted embeddings, individual groups of embeddings that describe different instances need to be identified. As the number of instances is not known, we perform this grouping with the clustering algorithm HDBSCAN [3] that estimates the number of clusters automatically. For each dataset, two HDBSCAN parameters have to be adjusted: minimal points \(m_{\text {pts}}\) and minimal cluster size \(m_{\text {clSize}}\). To simplify clustering and still be able to detect splitting of instances, we cluster only overlapping pairs of consecutive frames at a time. Since our embedding loss allows same embeddings for different instances that are far apart, we use both image coordinates and value of the embeddings as data points for the clustering algorithm. After identifying the embedding clusters with HDBSCAN and filtering clusters that are smaller than \(t_{\text {size}}\), the final segmented instances for each frame pair are obtained.

For merging the segmented instances in overlapping frame pairs, we identify same instances by the highest intersection over union (IoU) between each segmented instance in the overlapping frame. The resulting segmentations are then upsampled back to the original image size, generating the final segmented and tracked instances.

3 Experimental Setup and Results

We train the networks with TensorFlowFootnote 1 and perform on-the-fly data augmentation with SimpleITKFootnote 2. We use hourglass networks with seven levels and an input size of \(256\times 256\), while we scale the input images to fit. All recurrent networks are trained on sequences of ten frames. We refer to the supplementary material for individual training and augmentation parameters, as well as individual values of parameter described in Sect. 2.

Left Ventricle Segmentation: To show that our proposed recurrent stacked hourglass network is able to incorporate temporal information, we perform semantic segmentation on videos of short-axis MR slices of the heart from the left ventricle segmentation challenge [14]. We compare the recurrent network with a non-recurrent version, where we replace each ConvGRU with a convolution layer to keep the network complexity the same. Since outer slices do not contain parts of the left ventricle, the networks are evaluated on the three central slices that contain both left ventricle myocardium and blood cavity (see Fig. 3a). We train the networks with a softmax cross entropy loss to segment three labels, i.e., background, myocardium, and blood cavity. We use a three-fold cross-validation setup, where we randomly split datasets of 96 patients into three equally sized folds. Table 1a shows the IoU for our internal cross-validation of both recurrent and non-recurrent stacked hourglass networks.

Leaf Instance Segmentation: We show that the cosine embedding loss and the subsequent clustering are suitable for instance segmentation without temporal information, by evaluating on the A1 dataset of the CVPPP challenge for segmenting individual plant leaves [7] (see Fig. 3b). We use the non-recurrent version of the proposed network from the previous experiment to predict embeddings with 32 dimensions. Consequently, the clustering is also performed on single images. As we were not able to provide results on the challenge test set in time before finalizing this paper, we report results of an internal three-fold cross-validation of the 128 training images. In consensus with [13], we report the symmetric best Dice (SBD) and the absolute difference in count (|DiC|) and compare to other methods in Table 1b.

Fig. 3.
figure 3

Qualitative results of the left ventricle segmentation and the CVPPP leaf instance segmentation. The images on the left side show example inputs, the images on the right side show the predicted segmentations.

Table 1. Quantitative results of the left ventricle segmentation and the CVPPP leaf instance segmentation. Values show mean ± standard deviation. Note that we report our results for both datasets based on a three-fold cross-validation setup. Thus, they are not directly comparable to other published results. IoU: intersection over union; myo: myocardium; cav: blood cavity; SBD: symmetric best Dice; |DiC|: absolute difference in count.

Cell Instance Tracking: As our main experiment, we show applicability of our full framework for instance segmentation and tracking by evaluating six different datasets of cell microscopy videos from the ISBI celltracking challenge [15]. Each celltracking dataset consists of two annotated training videos and two testing videos with image sizes ranging from \(512\times 512\) to \(1200\times 1024\) and with 48 to 138 frames. We refer to [6] for additional imaging and video parameters. As the instance IDs in groundtruth images are consistent throughout the whole video only for tracking, but not for segmentation, we merge both tracking and segmentation groundtruth for each frame to have consistent instance IDs. Furthermore to learn the background embeddings, we only use the frames on which every cell is segmented. With hyperparameters determined on the two annotated training videos from each dataset, we train the networks for predicting embeddings of size 16 on both videos for our challenge submission.

To compete in the tracking metric of the challenge, the framework is required to identify the parent ID of each cell. As the framework is able to identify splitting cells and to assign new instance IDs (i.e., mitosis as seen on Fig. 1), the parent ID of each newly created instance is determined as the instance with the highest IoU in previous frames. We further postprocess the cells’ family tree to be consistent with the evaluation criteria, e.g., an instance ID may not be used after splitting into children. The results in comparison to the top performing methods are presented in Table 2.

Table 2. Quantitative results of the celltracking datasets for overall performance (OP), segmentation (SEG), and tracking (TRA), as described in [15].

4 Discussion and Conclusion

Up to our knowledge, we are the first to present a method that incorporates temporal information into a network to allow tracking of embeddings for instance segmentation. We perform three experiments to show different aspects of our novel method, i.e., temporal segmentation, instance segmentation, and combined instance segmentation and tracking. Thus, we demonstrate the wide applicability of our approach.

We use the left ventricle segmentation experiment to show that our novel recurrent stacked hourglass network can be used for incorporating temporal information. It can be seen from the results of the experiment that incorporating ConvGRU between contracting and expanding path deeply inside the architecture improves over the baseline stacked hourglass network. Nevertheless, since we simplified the evaluation protocol of the challenge, the results of the experiment should not be directly compared to other reported results. Moreover, benefits of such deep incorporation compared to having recurrent layers on other positions in the network [11] remain to be shown.

This paper also contributes with a novel embedding loss based on cosine similarities. Most of the methods that use embeddings for differentiating between instance segmentations are based on maximizing distances of embeddings in the Euclidean space, e.g., [8]. When using such embedding losses, we observed problems when combining them with recurrent networks, presumably due to unrestricted embedding values. To overcome these problems, we use cosine similarities that normalize embeddings. The only other work that suggests cosine similarities for instance segmentation with embeddings is the unpublished work of [5]. However, compared to their embedding loss that takes all instances into account, our novel loss focuses only on neighboring ones, which can be beneficial for optimization in the case of a large number of instances. We evaluate our novel loss on the CVPPP challenge dedicated to instance segmentation from still images. While waiting for the results of the competition, our method evaluated with three-fold cross-validation shows to be in line with the currently leading method, and has a significant margin to the second best. Moreover, compared to the leading method [11], the architecture of our method is considerably simpler.

In our main experiment for segmentation and tracking of instances, we evaluate our method on the ISBI celltracking challenge, showing large variability in visual appearance, size and number of cells. Our method achieves two first and two second places among the six submitted datasets in the tracking metric. For the dataset DIC-HeLa, having a dense layout of cells as seen in Fig. 1, we outperform all other methods in both tracking and segmentation metrics. On the dataset Fluo-GOWT1 we rank overall second. On the datasets Fluo-HeLa and Flou-SIM+, which consist of images with small cells, our method does not perform well due to the need to downsample images for the network to process them. When the downsampling results in drastic reduction of cell sizes, our method fails to create instance segmentations, thus explaining the not satisfying performance also in tracking. To increase the resolution and consequently improve segmentation and tracking, we could split the input image into multiple smaller parts, similarly as done in [12].

In conclusion, our work has shown that embeddings for instance segmentation can be successfully combined with recurrent networks incorporating temporal information to perform instance tracking. In future work, we will investigate the possibility of incorporating the required clustering step inside of a single end-to-end trained network, which could simplify the framework and further improve the segmentation and tracking results.