Fast Approximate Duplicate Detection for 2D-NMR Spectra

Björn Egert¹,
Steffen Neumann¹ &
Alexander Hinneburg²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4544))

Included in the following conference series:

International Conference on Data Integration in the Life Sciences

720 Accesses
2 Citations

Abstract

2D-Nuclear magnetic resonance (NMR) spectroscopy is a powerful analytical method to elucidate the chemical structure of molecules. In contrast to 1D-NMR spectra, 2D-NMR spectra correlate the chemical shifts of ¹H and ¹³C simultaneously. To curate or merge large spectra libraries a robust (and fast) duplicate detection is needed. We propose a definition of duplicates with the desired robustness properties mandatory for 2D-NMR experiments. A major gain in runtime performance wrt. previously proposed heuristics is achieved by mapping the spectra to simple discrete objects. We propose several appropriate data transformations for this task. In order to compensate for slight variations of the mapped spectra, we use appropriate hashing functions according to the locality sensitive hashing scheme, and identify duplicates by hash-collisions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

An automated framework for NMR chemical shift calculations of small organic molecules

Article Open access 26 October 2018

Molecular search by NMR spectrum based on evaluation of matching between spectrum and molecule

Article Open access 25 October 2021

Ilm-NMR-P31: an open-access ³¹P nuclear magnetic resonance database and data-driven prediction of ³¹P NMR shifts

Article Open access 18 December 2023

References

Tsipouras, A., Ondeyka, J., Dufresne, C., et al.: Using similarity searches over databases of estimated c-13 nmr spectra for structure identification of natural products. Analytica Chimica Acta 316, 161–171 (1995)
Article Google Scholar
Barros, A.S., Rutledge, D.N.: Segmented principal component transform-principal component analysis. Chemometrics & Intelligent Laboratory Systems 78, 125–137 (2005)
Article Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48. ACM Press, New York (2003)
Chapter Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: Selected papers from the sixth international conference on World Wide Web, pp. 1157–1166. Elsevier Science Publishers, Essex, UK (1997)
Google Scholar
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20(2), 171–191 (2002)
Article Google Scholar
Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55(3), 441–453 (1997)
Article MATH Google Scholar
Cohen, J.D., Lin, M.C., Manocha, D., Ponamgi, M.K.: I-COLLIDE: An interactive and exact collision detection system for large-scale environments. Symposium on Interactive 3D Graphics 218, 189–196 (1995)
Article Google Scholar
Conrad, J.G., Guo, X.S., Schriber, C.P.: Online duplicate document detection: signature reliability in a dynamic retrieval environment. In: CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pp. 443–452. ACM Press, New York (2003)
Chapter Google Scholar
Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable bloom filters. In: SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 25–36. ACM Press, New York (2006)
Chapter Google Scholar
Gionis, A., Gunopulos, D., Koudas, N.: Efficient and tunable similar set retrieval. In: SIGMOD ’01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pp. 247–258. ACM Press, New York (2001)
Chapter Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB’99: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518–529. Morgan Kaufmann Publishers, CA USA (1999)
Google Scholar
Gomes, D., Santos, A.L., Silva, M.J.: Managing duplicates in a web archive. In: SAC ’06: Proceedings of the 2006 ACM symposium on Applied computing, pp. 818–825. ACM Press, New York, NY, USA (2006)
Chapter Google Scholar
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 284–291. ACM Press, New York (2006)
Chapter Google Scholar
Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data. Mining and Knowledge Discovery 2(1), 9–37 (1998)
Article Google Scholar
Hinneburg, A., Egert, B., Porzel, A.: Duplicate detection of 2d-nmr spectra. Journal of Integrative Bioinformatics 4(1), 53 (2007)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbor - towards removing the curse of dimensionality. In: Proceedings of the 30th Symposium on Theory of Computing, pp. 604–613 (1998)
Google Scholar
Ke, Y., Sukthankar, R., Huston, L.: An efficient parts-based near-duplicate and sub-image retrieval system. In: MULTIMEDIA ’04: Proceedings of the 12th annual ACM international conference on Multimedia, pp. 869–876. ACM Press, New York (2004)
Chapter Google Scholar
Krishnan, P., Kruger, N.J., Ratcliffe, R.G.: Metabolite fingerprinting and profiling in plants using nmr. Journal of Experimental Botany 56, 255–265 (2005)
Article Google Scholar
Farkas, M., Bendl, J., Welti, D.H., et al.: Similarity search for a h-1 nmr spectroscopic data base. Analytica Chimica Acta. 206, 173–187 (1988)
Article Google Scholar
Metwally, A., Agrawal, D., Abbadi, A.E.: Duplicate detection in click streams. In: WWW ’05: Proceedings of the 14th international conference on World Wide Web, pp. 12–21. ACM Press, New York (2005)
Chapter Google Scholar
Noren, G.N., Orre, R., Bate, A.: A hit-miss model for duplicate detection in the who drug safety database. In: KDD ’05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 459–468. ACM Press, New York (2005)
Chapter Google Scholar
Steinbeck, C., Krause, S., Kuhn, S.: Nmrshiftdb-constructing a free chemical information system with open-source components. J. chem. inf. & comp. sci. 43, 1733–1739 (2003)
Article Google Scholar
Weis, M., Naumann, F.: Detecting duplicate objects in xml documents. In: IQIS ’04: Proceedings of the 2004 international workshop on Information quality in information systems, pp. 10–19. ACM Press, New York (2004)
Google Scholar
Yang, H., Callan, J.: Near-duplicate detection by instance-level constrained clustering. In: SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 421–428. ACM Press, New York (2006)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Leibniz Institute of Plant Biochemistry, Department of Stress and Developmental Biology, Germany
Björn Egert & Steffen Neumann
Institute of Computer Science, Martin-Luther-University of Halle-Wittenberg, Germany
Alexander Hinneburg

Authors

Björn Egert
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Neumann
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Hinneburg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Sarah Cohen-Boulakia Val Tannen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Egert, B., Neumann, S., Hinneburg, A. (2007). Fast Approximate Duplicate Detection for 2D-NMR Spectra. In: Cohen-Boulakia, S., Tannen, V. (eds) Data Integration in the Life Sciences. DILS 2007. Lecture Notes in Computer Science(), vol 4544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73255-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-540-73255-6_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73254-9
Online ISBN: 978-3-540-73255-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast Approximate Duplicate Detection for 2D-NMR Spectra

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

An automated framework for NMR chemical shift calculations of small organic molecules

Molecular search by NMR spectrum based on evaluation of matching between spectrum and molecule

Ilm-NMR-P31: an open-access ³¹P nuclear magnetic resonance database and data-driven prediction of ³¹P NMR shifts

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Fast Approximate Duplicate Detection for 2D-NMR Spectra

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

An automated framework for NMR chemical shift calculations of small organic molecules

Molecular search by NMR spectrum based on evaluation of matching between spectrum and molecule

Ilm-NMR-P31: an open-access 31P nuclear magnetic resonance database and data-driven prediction of 31P NMR shifts

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

Ilm-NMR-P31: an open-access ³¹P nuclear magnetic resonance database and data-driven prediction of ³¹P NMR shifts