[go: up one dir, main page]

Skip to main content

Advertisement

Log in

Big Data Analytics, Text Mining and Modern English Language

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

The modern English Language took centuries to convert from old English. The word ‘hath’ of old English for example, has taken centuries to become ‘have’ in the modern English Language. If these changes had not been occurred there would not have been the possibility of modern words. A text written in fifteen century can be difficult to read and if we go back a couple of more centuries, it would be like reading a different language. In this paper, we have used the text mining techniques to analyze the old and modern English languages. We have introduced the Common-Words Counting algorithm that identifies common words of 15th century that diminishes gradually in the later centuries. We computed the speed of linguistic changes and identified the reasons behind them. For this purpose, 34000 text books were downloaded from Project Gutenberg of different authors, between 15th to 19th centuries. These books were categorized into five centuries in the range from 15th to 19th centuries. We selected most common words from the books of 15th century and calculated their frequencies in other centuries. We calculated the sum of Term Frequency-Inverse Document Frequency (TF-IDF) of these words and proved that frequencies of words were decreasing from 15th century to 19th century with some words even disappeared in other centuries, such as ‘doth’, ‘hath’, punt, guise and ‘selfe’. We calculated the speed of changing of words using the slope formula. We proved that the words were changing during each century with the speed of changing of words being the lowest during 16th – 17th centuries and the highest during 18th – 19th centuries which shows that the old words or their spellings were changed to the modern words during 18th – 19th centuries. The industrialization, modernization, and British Empire invasion were the key factors, which changed the old English language into modern English language.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Chou, S., Hsing, T.P.: Text mining technique for Chinese written judgment of criminal case. In: Pacific-Asia workshop on intelligence and security informatics, pp 113–125. Springer, Berlin (2010)

  2. Griffiths, T.L., Kalish, M.L.: Language evolution by iterated learning with Bayesian agents. Cogn. Sci. 31(3), 441–480 (2007)

    Article  Google Scholar 

  3. Reed, J.W., Jiao, Y., Potok, T.E., Klump, B.A., Elmore, M.T., Hurson, A.R.: TF-ICF: A new term weighting scheme for clustering dynamic data streams. In: 5th international conference on machine learning and applications, 2006. ICMLA’06. (pp. 258-263). IEEE (2006)

  4. Hills, T.T., Adelman, J.S.: Recent evolution of learnability in American English from 1800 to 2000. Cognition 143, 87–92 (2015)

    Article  Google Scholar 

  5. Ramos, J.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning, vol. 242, pp. 133–142 (2003)

  6. Dias, L., Gerlach, M., Scharloth, J., Altmann, E.G.: Using text analysis to quantify the similarity and evolution of scientific disciplines. R. Soc. Open Sci. 5(1), 171545 (2018)

    Article  Google Scholar 

  7. Grzega, J., Schoener, M.: English and general historical lexicology. Eichstätt-Ingolstadt, Katholische Universität (2007)

    Google Scholar 

  8. Firbas, J.A.N.: De Vordre des mots dans les langues anciennes com- pares aux langues modernes Question de grammaire generate (1844)

  9. Zhong, N., Li, Y., Wu, S.T.: Effective pattern discovery for text mining. IEEE Trans. Knowl. Data Eng. 24(1), 30–44 (2012)

    Article  Google Scholar 

  10. Munková, D., Munk, M., Vozár, M.: Influence of stop-words removal on sequence patterns identification within comparable corpora. In: ICT innovations 2013, pp 67–76. Springer, Heidelberg (2014)

  11. Rehman, Z., Anwar, W., Bajwa, U.I., Xuan, W., Chaoying, Z.: Morpheme matching based text tokenization for a scarce resourced language. PloS one 8(8), e68178 (2013)

    Article  Google Scholar 

  12. Blumenstock, J.E.: Size matters: word count as a measure of quality on wikipedia. In: Proceedings of the 17th international conference on World Wide Web (pp. 1095–1096). ACM (2008)

  13. Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60(5), 503–520 (2004)

    Article  Google Scholar 

  14. Azam, N., Yao, J.: Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst. Appl. 39(5), 4760–4768 (2012)

    Article  Google Scholar 

  15. Tufte, E.R.: Beautiful evidence, vol. 1. Graphics Press, Cheshire (2006)

    Google Scholar 

  16. Hofmann, T.: August. Probabilistic latent semantic indexing. In: ACM SIGIR Forum (Vol. 51, No. 2, pp. 211-218). ACM (2017)

  17. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: ACM SIGIR Forum (Vol. 51, No. 2, pp. 268-276). ACM (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saqib Alam.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alam, S., Yao, N. Big Data Analytics, Text Mining and Modern English Language. J Grid Computing 17, 357–366 (2019). https://doi.org/10.1007/s10723-018-9452-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-018-9452-4

Keywords

Navigation