Abstract
The Ngram Statistics Package (NSP) is a flexible and easy- to-use software tool that supports the identification and analysis of Ngrams, sequences of N tokens in online text. We have designed and implemented NSP to be easy to customize to particular problems and yet remain general enough to serve a broad range of needs. This paper provides an introduction to NSP while raising some general issues in Ngram analysis, and summarizes several applications where NSP has been successfully employed. NSP is written in Perl and is freely available under the GNU Public License.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
G. Bouman and B. Villada. Corpus-based acquisition of collocational prepositional phrases. Computational Linguistics in the Netherlands (CLIN), 2002.
K. Church and P. Hanks. Word association norms, mutual information and lexicography. In Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics, pages 76–83, 1990.
T. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74, 1993.
A. Gill and J. Oberlander. Taking care of the linguistic features of extraversion. In Proceedings of the 24th Annual Conference of the Cognitive Science Society, pages 363–368, Washington, D.C., 2002.
A. Lopez, M. Nossal, R. Hwa, and P. Resnik. Word-level alignment for multilingual resource acquisition. In Proceedings of the 2002 LREC Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data, 2002.
T. Pedersen. Fishing for exactness. In Proceedings of the South Central SAS User’s Group (SCSUG-96) Conference, pages 188–200, Austin, TX, October 1996.
T. Pedersen. A decision tree of bigrams is an accurate predictor of word sense. In Proceedings of the Second Annual Meeting of the North American Chapter of the Association for Computational Linguistics, pages 79–86, Pittsburgh, July 2001.
T. Pedersen. Machine learning with lexical features: The Duluth approach to Senseval-2. In Proceedings of the Senseval-2 Workshop, pages 139–142, Toulouse, July 2001.
T. Pedersen, M. Kayaalp, and R. Bruce. Significant lexical relationships. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 455–460, Portland, OR, August 1996.
C. Shannon. Prediction and entropy of printed English. The Bell System Technical Journal, 30(50–64), 1951.
T. van der Wouden. Collocational behavior in non content words. In ACL/EACL Workshop on Collocations, Toulouse, France, 2001.
M. Yanamoto and K. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1–30, 2001.
D. Zaiu Inkpen and G. Hirst. Acquiring collocations for lexical choice between near synonyms. In SIGLEX Workshop on Unsupervised Lexical Acquisition, 40th meeting of the Association for Computational Linguistics, Philadelphia, 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Banerjee, S., Pedersen, T. (2003). The Design, Implementation, and Use of the Ngram Statistics Package. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2003. Lecture Notes in Computer Science, vol 2588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36456-0_38
Download citation
DOI: https://doi.org/10.1007/3-540-36456-0_38
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00532-2
Online ISBN: 978-3-540-36456-6
eBook Packages: Springer Book Archive