Abstract
Searching and mining nuclear magnetic resonance (NMR)-spectra of naturally occurring substances is an important task to investigate new potentially useful chemical compounds. Multi-dimensional NMR-spectra are relational objects like documents, but consists of continuous multi-dimensional points called peaks instead of words. We develop several mappings from continuous NMR-spectra to discrete text-like data. With the help of those mappings any text retrieval method can be applied. We evaluate the performance of two retrieval methods, namely the standard vector space model and probabilistic latent semantic indexing (PLSI). PLSI learns hidden topics in the data, which is in case of 2D-NMR data interesting in its owns rights. Additionally, we develop and evaluate a simple direct similarity function, which can detect duplicates of NMR-spectra. Our experiments show that the vector space model as well as PLSI, which are both designed for text data created by humans, can effectively handle the mapped NMR-data originating from natural products. Additionally, PLSI is able to find meaningful ”topics” in the NMR-data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Tsipouras, A., Ondeyka, J., Dufresne, C., et al.: Using similarity searches over databases of estimated c-13 nmr spectra for structure identification of natural products. Analytica Chimica Acta 316, 161–171 (1995)
Barros, A.S., Rutledge, D.N.: Segmented principal component transform-principal component analysis. Chemometrics & Intelligent Laboratory Systems 78, 125–137 (2005)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Cai, L., Hofmann, T.: Text categorization by boosting automatically extracted concepts. In: SIGIR ’03 (2003)
Hofmann, T.: Probabilistic latent semantic indexing. In: SIGIR ’99 (1999)
Krishnan, P., Kruger, N.J., Ratcliffe, R.G.: Metabolite fingerprinting and profiling in plants using nmr. Journal of Experimental Botany 56, 255–265 (2005)
Farkas, M., Bendl, J., Welti, D.H., et al.: Similarity search for a h-1 nmr spectroscopic data base. Analytica Chimica Acta 206, 173–187 (1988)
Mei, Q., Zhai, C.: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: KDD ’05 (2005)
Popescul, A., Ungar, L.H., Pennock, D.M., Lawrence, S.: Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In: UAI’2001 (2001)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Steinbeck, C., Krause, S., Kuhn, S.: Nmrshiftdb-constructing a free chemical information system with open-source components. J. chem. inf. & comp. sci. 43, 1733–1739 (2003)
Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: KDD ’04 (2004)
Wolfram, K., Porzel, A., Hinneburg, A.: Similarity search for multi-dimensional nmr-spectra of natural products. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Hinneburg, A., Porzel, A., Wolfram, K. (2007). An Evaluation of Text Retrieval Methods for Similarity Search of Multi-dimensional NMR-Spectra. In: Hochreiter, S., Wagner, R. (eds) Bioinformatics Research and Development. BIRD 2007. Lecture Notes in Computer Science(), vol 4414. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71233-6_33
Download citation
DOI: https://doi.org/10.1007/978-3-540-71233-6_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71232-9
Online ISBN: 978-3-540-71233-6
eBook Packages: Computer ScienceComputer Science (R0)