BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

Masoud Monajatipoor¹²,
Mozhdeh Rouhsedaghat¹³,
Liunian Harold Li¹²,
C.-C. Jay Kuo¹³,
Aichi Chien¹² &
…
Kai-Wei Chang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13435))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

8656 Accesses
9 Citations

Abstract

Vision-and-language ($ \mathrm{{V}} \& \mathrm{{L}}$) models take image and text as input and learn to capture the associations between them. These models can potentially deal with the tasks that involve understanding medical images along with their associated text. However, applying $ \mathrm{{V}} \& \mathrm{{L}}$ models in the medical domain is challenging due to the expensiveness of data annotations and the requirements of domain knowledge. In this paper, we identify that the visual representation in general $ \mathrm{{V}} \& \mathrm{{L}}$ models is not suitable for processing medical data. To overcome this limitation, we propose BERTHop, a transformer-based model based on PixelHop++ and VisualBERT for better capturing the associations between clinical notes and medical images.

Experiments on the OpenI dataset, a commonly used thoracic disease diagnosis benchmark, show that BERTHop achieves an average Area Under the Curve (AUC) of 98.12% which is 1.62% higher than state-of-the-art while it is trained on a 9$\times $ smaller dataset (https://github.com/monajati/BERTHop).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

Article 14 March 2024

Knowledge-enhanced visual-language pre-training on chest radiology images

Article Open access 28 July 2023

Notes

1.
Part of the input is masked and the objective is to predict the masked words or image regions based on the remaining contexts.
2.
https://github.com/YIKUAN8/Transformers-VQA.

References

Abiyev, R.H., Ma’aitah, M.K.S.: Deep convolutional neural networks for chest diseases detection. J. Healthc. Eng. 2018 (2018)
Google Scholar
Allaouzi, I., Ben Ahmed, M.: A novel approach for multi-label chest X-ray classification of common thorax diseases. IEEE Access 7, 64279–64288 (2019)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Buitinck, L., et al.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)
Google Scholar
Chen, H.-S., Rouhsedaghat, M., Ghani, H., Shuowen, H., You, S., Jay Kuo, C.C.: DefakeHop: a light-weight high-performance Deepfake detector (2021)
Google Scholar
Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt Representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Chen, Y., Rouhsedaghat, M., You, S., Rao, R., Jay Kuo, C.-C.: PixelHop++: a small successive-subspace-learning-based (SSL-based) model for image classification. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 3294–3298. IEEE (2020)
Google Scholar
Chou, S.-H., Chao, W.-L., Lai, W.-S., Sun, M., Yang, M.-H.: Visual question answering on 360deg images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1607–1616 (2020)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Giger, M.L., Suzuki, K.: Computer-aided diagnosis. In: Biomedical Information Technology, pp. 359-XXII. Elsevier (2008)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Johnson, A.E.W., et al.: MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, Y., Wang, H., Luo, Y.: A comparison of pre-trained vision-and-language models for multimodal representation learning across medical images and reports. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1999–2004. IEEE (2020)
Google Scholar
Liu, F., et al.: VoxelHop: successive subspace learning for ALS disease classification using structural MRI. arXiv preprint arXiv:2101.05131 (2021)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)
Google Scholar
Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMO on ten benchmarking datasets. arXiv preprint arXiv:1906.05474 (2019)
Rajpurkar, P., et al.: CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)
Article Google Scholar
Rouhsedaghat, M., Monajatipoor, M., Azizi, Z., Jay Kuo, C.-C.: Successive subspace learning: an overview. arXiv preprint arXiv:2103.00121 (2021)
Rouhsedaghat, M., Wang, Y., Ge, X., Hu, S., You, S., Jay Kuo, C.-C.: FaceHop: a light-weight low-resolution face gender classification method. arXiv preprint arXiv:2007.09510 (2020)
Rouhsedaghat, M., Wang, Y., Hu, S., You, S., Jay Kuo, C.-C.: Low-resolution face recognition in resource-constrained environments. arXiv preprint arXiv:2011.11674 (2020)
Shin, H.-C., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016)
Article Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106 (2017)
Google Scholar
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
Zhou, L., Palangi, H., Zhang, L., Houdong, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13041–13049 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, Los Angeles, CA, 90095, USA
Masoud Monajatipoor, Liunian Harold Li, Aichi Chien & Kai-Wei Chang
University of Southern California, Los Angeles, CA, 90007, USA
Mozhdeh Rouhsedaghat & C.-C. Jay Kuo

Authors

Masoud Monajatipoor
View author publications
You can also search for this author in PubMed Google Scholar
Mozhdeh Rouhsedaghat
View author publications
You can also search for this author in PubMed Google Scholar
Liunian Harold Li
View author publications
You can also search for this author in PubMed Google Scholar
C.-C. Jay Kuo
View author publications
You can also search for this author in PubMed Google Scholar
Aichi Chien
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Wei Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masoud Monajatipoor .

Editor information

Editors and Affiliations

Rochester Institute of Technology, Rochester, NY, USA
Linwei Wang
Chinese University of Hong Kong, Hong Kong, Hong Kong
Qi Dou
University of Virginia, Charlottesville, VA, USA
P. Thomas Fletcher
National Center for Tumor Diseases (NCT/UCC), Dresden, Germany
Stefanie Speidel
Case Western Reserve University, Cleveland, OH, USA
Shuo Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Monajatipoor, M., Rouhsedaghat, M., Li, L.H., Jay Kuo, CC., Chien, A., Chang, KW. (2022). BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, vol 13435. Springer, Cham. https://doi.org/10.1007/978-3-031-16443-9_69

Download citation

DOI: https://doi.org/10.1007/978-3-031-16443-9_69
Published: 16 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16442-2
Online ISBN: 978-3-031-16443-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

Knowledge-enhanced visual-language pre-training on chest radiology images

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training

ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

Knowledge-enhanced visual-language pre-training on chest radiology images

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation