Cluster2Former: Semisupervised Clustering Transformers for Video Instance Segmentation
<p>Illustration of the clustering-based approach. Based on the learned object queries, we compute the output of the decoder for a pixel, representative of the features of the object the pixel is embedded in, as a probability distribution. Our learning objective guides the decoder to output similar distributions for pixels embedded in the same object and different distributions for a pixel pair embedded in different objects.</p> "> Figure 2
<p>Annotations—the first instance is marked by blue, the second by red. (<b>Left</b>): Three annotation methods are displayed: full mask (shown as a surrounding boundary), bounding box (minimal enclosing rectangle), and scribble. (<b>Right</b>): For a more challenging example containing both occlusion and blurry regions, neither the full mask nor the bounding box can provide unambiguous annotation. Using scribbles, however, one can select representative regions with high confidence.</p> "> Figure 3
<p>Selection of pixel pairs in foreground and background regions. The foreground object (lion) is annotated by a cyan scribble, the background by a pink scribble. Positive pairs (green lines) are selected between points of the foreground object, but not between points of the background. Negative pairs (red lines) are selected between points of different clusters: in this figure between the background and the foreground object. This way, the model is not forced to recognize the background as a single homogeneous object; instead, it can segment into unlabeled distinct clusters, e.g., other animals, ground, mountain, and sky in this figure.</p> "> Figure 4
<p>Benefits of omitting the positive pairs between background points. <b>Top</b> row: ground truth frames. The scribbles are shown at the original resolution (1 pixel wide curves), and the bounding boxes are displayed only as visual guides for the scribbles. <b>Bottom</b> row: prediction. The parrot at the top-right corner of both frames is detected correctly, even though it was not annotated. By not enforcing similarity between different regions of the background, it can maintain its heterogeneity.</p> "> Figure 5
<p>Pairing of points on consecutive frames. The two objects are annotated by a magenta scribble (antelope) and cyan scribble (cheetah). The dots on the scribbles are sampled pixels (for clarity, only a few representative sample pixels and pixel pairs are shown in this figure). The positive pairs are connected by green lines, the negative pairs by red lines. The thickness of the connecting lines represents the weight of the pair: close negative and distant positive pairs are given heavier weight. The lines connecting pixels in different frames are drawn by dashed lines.</p> "> Figure 6
<p><b>Left</b>: Although the image is not completely sharp, the extent of the individuals can be specified with the help of a bounding box. <b>Right</b>: The instances are still clearly distinguishable, despite the fact that the edges are blurred, but the bounding boxes that border them almost completely coincide, since the two individuals cross each other. It is not clear which bounding box is the annotation of which instance. On the other hand, the surfaces belonging to each individual can be clearly marked with scribbles. Blue and red bounding boxes and scribbles identify different instances.</p> "> Figure 7
<p>Instance segmentation of an occluded object. <b>Top</b> row shows the annotated frames; <b>bottom</b> row is prediction. The bounding boxes are only guiding the eye to localize the annotation scribbles, as in <a href="#sensors-24-00997-f004" class="html-fig">Figure 4</a>.</p> "> Figure 8
<p>VIS of directly adjacent instances; the separating edge is correctly found.</p> "> Figure 9
<p>VIS of frames containing many occlusions.</p> "> Figure 10
<p>Limitation of our model. <b>Left</b> column: frames segmented by Cluster2Former (ours) from the test set of the YouTube-VIS 2021 dataset. Even though both instances (person and snowboard) are correctly found, the detected edge of the snowboard is quite far from the ground truth. <b>Right</b> column: the same frames segmented by Mask2Former; the edge of the snowboard is accurate.</p> ">
Abstract
:1. Introduction
- We propose a VIS model, which is trained by scribbles drawn on the training video frames. Our model achieves competitive performance despite using only 0.5% of the pixel count of the full training masks as annotation.
- The above result is achieved by modifying the learning objective only, leaving the architecture of the transformer (in this work, Mask2Former) intact. This not only eliminates costly architecture-specific hyperparameter optimization, but also enables the application of the same loss function modification to future, more advanced VIS architectures.
- We demonstrate that the pairwise approach for training, based on feature vectors obtained by transformers, provides an efficient solution to video instance segmentation.
2. Related Works
3. Methods
3.1. Annotation Based on Scribbles
3.2. Similarity-Based Constraint Loss
3.3. Pairing Strategies
3.4. Architecture and Training
4. Results
4.1. Datasets
4.2. Implementation Details
4.3. Experiments
4.4. Ablation Experiments
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
MCL | meta-classification learning |
PCSA-DEC | pairwise constraint and subset allocation deep embedded clustering [14] |
VIS | video instance segmentation |
References
- Yang, L.; Fan, Y.; Xu, N. Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5188–5197. [Google Scholar]
- Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar]
- Qi, J.; Gao, Y.; Hu, Y.; Wang, X.; Liu, X.; Bai, X.; Belongie, S.; Yuille, A.; Torr, P.H.; Bai, S. Occluded video instance segmentation: A benchmark. Int. J. Comput. Vis. 2022, 130, 2022–2039. [Google Scholar] [CrossRef]
- Cheng, B.; Parkhi, O.; Kirillov, A. Pointly-supervised instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2617–2626. [Google Scholar]
- Ke, L.; Danelljan, M.; Ding, H.; Tai, Y.W.; Tang, C.K.; Yu, F. Mask-free video instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22857–22866. [Google Scholar]
- Wu, J.; Jiang, Y.; Zhang, W.; Bai, X.; Bai, S. Seqformer: A frustratingly simple model for video instance segmentation. arXiv 2021, arXiv:2112.08275. [Google Scholar]
- Cheng, B.; Choudhuri, A.; Misra, I.; Kirillov, A.; Girdhar, R.; Schwing, A.G. Mask2former for video instance segmentation. arXiv 2021, arXiv:2112.10764. [Google Scholar]
- Ke, L.; Ding, H.; Danelljan, M.; Tai, Y.W.; Tang, C.K.; Yu, F. Video mask transfiner for high-quality video instance segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 731–747. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
- Shen, W.; Peng, Z.; Wang, X.; Wang, H.; Cen, J.; Jiang, D.; Xie, L.; Yang, X.; Tian, Q. A survey on label-efficient deep image segmentation: Bridging the gap between weak supervision and dense prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9284–9305. [Google Scholar] [CrossRef] [PubMed]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Hsu, Y.C.; Kira, Z. Neural network-based clustering using pairwise constraints. arXiv 2015, arXiv:1511.06321. [Google Scholar]
- Hsu, Y.C.; Xu, Z.; Kira, Z.; Huang, J. Learning to cluster for proposal-free instance segmentation. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
- Wang, Y.; Zou, J.; Wang, K.; Liu, C.; Yuan, X. Semi-supervised deep embedded clustering with pairwise constraints and subset allocation. Neural Netw. 2023, 164, 310–322. [Google Scholar] [CrossRef] [PubMed]
- Fóthi, Á.; Faragó, K.B.; Kopácsi, L.; Milacski, Z.Á.; Varga, V.; Lőrincz, A. Multi Object Tracking for Similar Instances: A Hybrid Architecture. In Proceedings of the International Conference on Neural Information Processing, Vancouver, BC, Canada, 6–12 December 2020; pp. 436–447. [Google Scholar]
- Yu, Q.; Wang, H.; Kim, D.; Qiao, S.; Collins, M.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.C. Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2560–2570. [Google Scholar]
- Zhao, Y.; Liu, S.; Hu, Z. Focal learning on stranger for imbalanced image segmentation. IET Image Process. 2022, 16, 1305–1323. [Google Scholar] [CrossRef]
- Chen, X.; Lian, Y.; Jiao, L.; Wang, H.; Gao, Y.; Lingling, S. Supervised edge attention network for accurate image instance segmentation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 617–631. [Google Scholar]
- Bertasius, G.; Torresani, L. Classifying, segmenting, and tracking object instances in video with mask propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9739–9748. [Google Scholar]
- Ke, L.; Li, X.; Danelljan, M.; Tai, Y.W.; Tang, C.K.; Yu, F. Prototypical cross-attention networks for multiple object tracking and segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 1192–1203. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
- Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8573–8581. [Google Scholar]
- Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1328–1338. [Google Scholar]
- Kopácsi, L.; Dobolyi, Á.; Fóthi, Á.; Keller, D.; Varga, V.; Lőrincz, A. RATS: Robust Automated Tracking and Segmentation of Similar Instances. In Proceedings of the International Conference on Artificial Neural Networks, Bratislava, Slovakia, 14–17 September 2021; pp. 507–518. [Google Scholar]
- Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng, B.; Shen, H.; Xia, H. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8741–8750. [Google Scholar]
- Li, J.; Yu, B.; Rao, Y.; Zhou, J.; Lu, J. TCOVIS: Temporally Consistent Online Video Instance Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 17–24 June 2023; pp. 1097–1107. [Google Scholar]
- Wang, H.; Jiang, X.; Ren, H.; Hu, Y.; Bai, S. Swiftnet: Real-time video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1296–1305. [Google Scholar]
- Yang, S.; Fang, Y.; Wang, X.; Li, Y.; Fang, C.; Shan, Y.; Feng, B.; Liu, W. Crossover learning for fast online video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 8043–8052. [Google Scholar]
- Wu, J.; Liu, Q.; Jiang, Y.; Bai, S.; Yuille, A.; Bai, X. In defense of online models for video instance segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 588–605. [Google Scholar]
- Voigtlaender, P.; Krause, M.; Osep, A.; Luiten, J.; Sekar, B.B.G.; Geiger, A.; Leibe, B. Mots: Multi-object tracking and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7942–7951. [Google Scholar]
- Athar, A.; Mahadevan, S.; Osep, A.; Leal-Taixé, L.; Leibe, B. Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 158–177. [Google Scholar]
- Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854. [Google Scholar]
- Pang, J.; Qiu, L.; Li, X.; Chen, H.; Li, Q.; Darrell, T.; Yu, F. Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 164–173. [Google Scholar]
- Wu, J.; Cao, J.; Song, L.; Wang, Y.; Yang, M.; Yuan, J. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12352–12361. [Google Scholar]
- Yan, B.; Jiang, Y.; Sun, P.; Wang, D.; Yuan, Z.; Luo, P.; Lu, H. Towards grand unification of object tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 733–751. [Google Scholar]
- Hwang, S.; Heo, M.; Oh, S.W.; Kim, S.J. Video instance segmentation using inter-frame communication transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 13352–13363. [Google Scholar]
- Heo, M.; Hwang, S.; Oh, S.W.; Lee, J.Y.; Kim, S.J. Vita: Video instance segmentation via object token association. Adv. Neural Inf. Process. Syst. 2022, 35, 23109–23120. [Google Scholar]
- Pathak, D.; Girshick, R.; Dollár, P.; Darrell, T.; Hariharan, B. Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2701–2710. [Google Scholar]
- Fu, Y.; Liu, S.; Iqbal, U.; De Mello, S.; Shi, H.; Kautz, J. Learning to track instances without video annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8680–8689. [Google Scholar]
- Huang, D.A.; Yu, Z.; Anandkumar, A. Minvis: A minimal video instance segmentation framework without video-based training. Adv. Neural Inf. Process. Syst. 2022, 35, 31265–31277. [Google Scholar]
- Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbeláez, P.; Sorkine-Hornung, A.; Van Gool, L. The 2017 DAVIS Challenge on Video Object Segmentation. arXiv 2017, arXiv:1704.00675. [Google Scholar]
- Caelles, S.; Montes, A.; Maninis, K.K.; Chen, Y.; Van Gool, L.; Perazzi, F.; Pont-Tuset, J. The 2018 DAVIS Challenge on Video Object Segmentation. arXiv 2018, arXiv:1803.00557. [Google Scholar]
Losses | Annotation | AP | AP50 | AP75 | AR1 | AR10 |
---|---|---|---|---|---|---|
Mask2Former [7] | mask | 46.4 | 68.0 | 50.0 | - | - |
Mask2Former [7] + SC loss (ours) | mask | 46.3 | 68.1 | 50.3 | 47.7 | 59.5 |
Cluster2Former (ours) | mask | 41.7 | 66.1 | 45.6 | 42.9 | 51.5 |
Cluster2Former (ours) | scribble | 38.3 | 62.5 | 42.5 | 39.3 | 46.6 |
Losses | Annotation | AP | AP50 | AP75 | AR1 | AR10 |
---|---|---|---|---|---|---|
Mask2Former [7] | mask | 40.6 | 60.9 | 41.8 | - | - |
Mask2Former [7] + SC loss (ours) | mask | 41.6 | 65.1 | 44.4 | 39.0 | 52.9 |
Cluster2Former (ours) | mask | 34.1 | 55.8 | 37.4 | 33.8 | 42.3 |
Cluster2Former (ours) | scribble | 29.5 | 51.7 | 30.4 | 30.3 | 37.2 |
Method | Annotation | AP | AP50 | AP75 | AR1 | AR10 |
---|---|---|---|---|---|---|
Mask2Former [7] | mask | 46.4 | 68.0 | 50.0 | - | - |
MaskFreeVIS [5] | bbox | 43.8 | 70.7 | 46.9 | 41.5 | 52.3 |
SOLO-Track [41] | wo video | 30.6 | 50.7 | 33.5 | 31.6 | 37.1 |
Cluster2Former (ours) | scribble | 38.3 | 62.5 | 42.5 | 39.3 | 46.6 |
Method | Annotation | AP | AP50 | AP75 | AR1 | AR10 |
---|---|---|---|---|---|---|
Mask2Former [7] | mask | 40.6 | 60.9 | 41.8 | - | - |
MaskFreeVIS [5] | bbox | 37.2 | 61.9 | 40.3 | 35.3 | 46.1 |
Cluster2Former (ours) | scribble | 29.5 | 51.7 | 30.4 | 30.3 | 37.2 |
Neg BG Only | Weighted Pairs | Temp Pairs | AP | AP50 | AP75 | AR1 | AR10 |
---|---|---|---|---|---|---|---|
✓ | ✓ | 35.1 | 59.8 | 37.5 | 36.5 | 43.1 | |
✓ | ✓ | 33.8 | 58.4 | 36.6 | 36.2 | 42.8 | |
✓ | ✓ | 36.4 | 61.2 | 39.7 | 39.2 | 45.4 | |
✓ | ✓ | ✓ | 38.3 | 62.5 | 42.5 | 39.3 | 46.6 |
Tube Length | AP | AP50 | AP75 | AR1 | AR10 |
---|---|---|---|---|---|
1 | 33.1 | 57.9 | 35.1 | 35.4 | 42.4 |
2 | 38.3 | 62.5 | 42.5 | 39.3 | 46.6 |
3 | 37.0 | 61.6 | 40.8 | 39.7 | 46.7 |
4 | 35.9 | 59.6 | 39.9 | 38.7 | 45.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fóthi, Á.; Szlatincsán, A.; Somfai, E. Cluster2Former: Semisupervised Clustering Transformers for Video Instance Segmentation. Sensors 2024, 24, 997. https://doi.org/10.3390/s24030997
Fóthi Á, Szlatincsán A, Somfai E. Cluster2Former: Semisupervised Clustering Transformers for Video Instance Segmentation. Sensors. 2024; 24(3):997. https://doi.org/10.3390/s24030997
Chicago/Turabian StyleFóthi, Áron, Adrián Szlatincsán, and Ellák Somfai. 2024. "Cluster2Former: Semisupervised Clustering Transformers for Video Instance Segmentation" Sensors 24, no. 3: 997. https://doi.org/10.3390/s24030997