Fine-Tuning BERT for COVID-19 Domain Ad-Hoc IR by Using Pseudo-qrels

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12657))

Included in the following conference series:

European Conference on Information Retrieval

2520 Accesses
5 Altmetric

Abstract

This work analyzes the feasibility of training a neural retrieval system for a collection of scientific papers about COVID-19 using pseudo-qrels extracted from the collection. We propose a method for generating pseudo-qrels that exploits two characteristics present in scientific articles: a) the relationship between title and abstract, and b) the relationship between articles through sentences containing citations. Through these signals we generate pseudo-queries and their respective pseudo-positive (relevant documents) and pseudo-negative (non-relevant documents) examples. The article retrieval process combines a ranking model based on term-maching techniques and a neural one based on pretrained BERT models. BERT models are fine-tuned to the task using the pseudo-qrels generated. We compare different BERT models, both open domain and biomedical domain, and also the generated pseudo-qrels with the open domain MS-Marco dataset for fine-tuning the models. The results obtained on the TREC-COVID collection show that pseudo-qrels provide a significant improvement to neural models, both against classic IR baselines based on term-matching and neural systems trained on MS-Marco.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Query Performance Prediction for Neural IR: Are We There Yet?

Measuring Domain Portability and Error Propagation in Biomedical QA

Learning Query-Space Document Representations for High-Recall Retrieval

Notes

1.
Release of July 16th, 2020.
2.
Query and question fields in TREC-COVID topics are joined as query sentence.
3.
MAP, BPREF, P@[5, 10, 20], ndcg and ndcg@[10, 20] were considered.
4.
Parameters tuned on dev: $weight=0.5, fbterms=25, fbdocs=45$.
5.
Significance tests are done using Paired Randomization Test [13].

References

Alsentzer, E., Murphy, J., Boag, W., Weng, W.H., Jin, D., Naumann, T., McDermott, M.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, Minnesota, USA, pp. 72–78. Association for Computational Linguistics, June 2019. https://doi.org/10.18653/v1/W19-1909. https://www.aclweb.org/anthology/W19-1909
Asadi, N., Metzler, D., Elsayed, T., Lin, J.: Pseudo test collections for learning web search ranking functions. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1073–1082 (2011)
Google Scholar
Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.B.: Neural ranking models with weak supervision. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 65–74 (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dietz, L., Verma, M., Radlinski, F., Craswell, N.: TREC complex answer retrieval overview. In: TREC (2017)
Google Scholar
Guo, J., Fan, Y., Ai, Q., Croft, W.B.: A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 55–64 (2016)
Google Scholar
Hui, K., Yates, A., Berberich, K., De Melo, G.: CO-PACRR: a context-aware neural IR model for ad-hoc retrieval. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 279–287 (2018)
Google Scholar
MacAvaney, S., Hui, K., Yates, A.: An approach for weakly-supervised deep information retrieval. arXiv preprint arXiv:1707.00189 (2017)
Mitra, B., Diaz, F., Craswell, N.: Learning to match using local and distributed representations of text for web search. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1291–1299 (2017)
Google Scholar
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: Ms marco: a human-generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)
Nogueira, R., Cho, K.: Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019)
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281 (1998)
Google Scholar
Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. In: CIKM 2007: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, New York, NY, USA, pp. 623–632. ACM (2007). http://doi.acm.org/10.1145/1321440.1321528
Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: a language model-based search engine for complex queries. In: Proceedings of the International Conference on Intelligent Analysis, vol. 2, pp. 2–6. Citeseer (2005)
Google Scholar
Wang, L.L., et al.: Cord-19: the covid-19 open research dataset. ArXiv (2020)
Google Scholar
Xiong, C., Dai, Z., Callan, J., Liu, Z., Power, R.: End-to-end neural ad-hoc ranking with kernel pooling. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 55–64 (2017)
Google Scholar
Yang, W., Zhang, H., Lin, J.: Simple applications of BERT for ad hoc document retrieval. arXiv preprint arXiv:1903.10972 (2019)
Zhang, K., Xiong, C., Liu, Z., Liu, Z.: Selective weak supervision for neural information retrieval. In: Proceedings of The Web Conference 2020, pp. 474–485 (2020)
Google Scholar

Download references

Acknowledgement

This work has been partially funded by the Basque Government (DeepText, (Elkartek grant no. KK-2020/00088) and by VIGICOVID project FSuperaCovid-5 (Fondo Supera COVID-19/CRUE-CSIC-Santander). We also acknowledge the support of Googles’s TFRC program.

Author information

Authors and Affiliations

Elhuyar fundazioa, Zelai Haundi 3, 20170, Usurbil, Spain
Xabier Saralegi & Iñaki San Vicente

Authors

Xabier Saralegi
View author publications
You can also search for this author in PubMed Google Scholar
Iñaki San Vicente
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xabier Saralegi .

Editor information

Editors and Affiliations

Radboud University Nijmegen, Nijmegen, The Netherlands
Djoerd Hiemstra
Department of Computer Science, Katholieke Universiteit Leuven, Heverlee, Belgium
Marie-Francine Moens
Toulouse, Toulouse Institute of Computer Science Research, Toulouse, France
Josiane Mothe
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy
Raffaele Perego
Leipzig University, Leipzig, Germany
Martin Potthast
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy
Fabrizio Sebastiani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saralegi, X., San Vicente, I. (2021). Fine-Tuning BERT for COVID-19 Domain Ad-Hoc IR by Using Pseudo-qrels. In: Hiemstra, D., Moens, MF., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science(), vol 12657. Springer, Cham. https://doi.org/10.1007/978-3-030-72240-1_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-72240-1_38
Published: 30 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72239-5
Online ISBN: 978-3-030-72240-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fine-Tuning BERT for COVID-19 Domain Ad-Hoc IR by Using Pseudo-qrels

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Query Performance Prediction for Neural IR: Are We There Yet?

Measuring Domain Portability and Error Propagation in Biomedical QA

Learning Query-Space Document Representations for High-Recall Retrieval

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Fine-Tuning BERT for COVID-19 Domain Ad-Hoc IR by Using Pseudo-qrels

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Query Performance Prediction for Neural IR: Are We There Yet?

Measuring Domain Portability and Error Propagation in Biomedical QA

Learning Query-Space Document Representations for High-Recall Retrieval

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation