Abstract
Vision-and-language (\( \mathrm{{V}} \& \mathrm{{L}}\)) models take image and text as input and learn to capture the associations between them. These models can potentially deal with the tasks that involve understanding medical images along with their associated text. However, applying \( \mathrm{{V}} \& \mathrm{{L}}\) models in the medical domain is challenging due to the expensiveness of data annotations and the requirements of domain knowledge. In this paper, we identify that the visual representation in general \( \mathrm{{V}} \& \mathrm{{L}}\) models is not suitable for processing medical data. To overcome this limitation, we propose BERTHop, a transformer-based model based on PixelHop++ and VisualBERT for better capturing the associations between clinical notes and medical images.
Experiments on the OpenI dataset, a commonly used thoracic disease diagnosis benchmark, show that BERTHop achieves an average Area Under the Curve (AUC) of 98.12% which is 1.62% higher than state-of-the-art while it is trained on a 9\(\times \) smaller dataset (https://github.com/monajati/BERTHop).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Part of the input is masked and the objective is to predict the masked words or image regions based on the remaining contexts.
- 2.
References
Abiyev, R.H., Ma’aitah, M.K.S.: Deep convolutional neural networks for chest diseases detection. J. Healthc. Eng. 2018 (2018)
Allaouzi, I., Ben Ahmed, M.: A novel approach for multi-label chest X-ray classification of common thorax diseases. IEEE Access 7, 64279–64288 (2019)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Buitinck, L., et al.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)
Chen, H.-S., Rouhsedaghat, M., Ghani, H., Shuowen, H., You, S., Jay Kuo, C.C.: DefakeHop: a light-weight high-performance Deepfake detector (2021)
Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt Representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chen, Y., Rouhsedaghat, M., You, S., Rao, R., Jay Kuo, C.-C.: PixelHop++: a small successive-subspace-learning-based (SSL-based) model for image classification. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 3294–3298. IEEE (2020)
Chou, S.-H., Chao, W.-L., Lai, W.-S., Sun, M., Yang, M.-H.: Visual question answering on 360deg images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1607–1616 (2020)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Giger, M.L., Suzuki, K.: Computer-aided diagnosis. In: Biomedical Information Technology, pp. 359-XXII. Elsevier (2008)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Johnson, A.E.W., et al.: MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, Y., Wang, H., Luo, Y.: A comparison of pre-trained vision-and-language models for multimodal representation learning across medical images and reports. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1999–2004. IEEE (2020)
Liu, F., et al.: VoxelHop: successive subspace learning for ALS disease classification using structural MRI. arXiv preprint arXiv:2101.05131 (2021)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)
Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMO on ten benchmarking datasets. arXiv preprint arXiv:1906.05474 (2019)
Rajpurkar, P., et al.: CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)
Rouhsedaghat, M., Monajatipoor, M., Azizi, Z., Jay Kuo, C.-C.: Successive subspace learning: an overview. arXiv preprint arXiv:2103.00121 (2021)
Rouhsedaghat, M., Wang, Y., Ge, X., Hu, S., You, S., Jay Kuo, C.-C.: FaceHop: a light-weight low-resolution face gender classification method. arXiv preprint arXiv:2007.09510 (2020)
Rouhsedaghat, M., Wang, Y., Hu, S., You, S., Jay Kuo, C.-C.: Low-resolution face recognition in resource-constrained environments. arXiv preprint arXiv:2011.11674 (2020)
Shin, H.-C., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106 (2017)
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
Zhou, L., Palangi, H., Zhang, L., Houdong, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13041–13049 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Monajatipoor, M., Rouhsedaghat, M., Li, L.H., Jay Kuo, CC., Chien, A., Chang, KW. (2022). BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, vol 13435. Springer, Cham. https://doi.org/10.1007/978-3-031-16443-9_69
Download citation
DOI: https://doi.org/10.1007/978-3-031-16443-9_69
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16442-2
Online ISBN: 978-3-031-16443-9
eBook Packages: Computer ScienceComputer Science (R0)