Abstract
Corpus has played an important role in most of research fields, especially in natural language processing. Some research demos provided detailed corpus content to highlight the contribution they have made, while overlook the security of corpus. In this paper, we explore content leakage resulted from the content display through a crawler. A website for displaying corpus is selected to be crawled by a simply crawler algorithm with some strategies we present. It is estimated that over 85% of the corpus can be downloaded, which means a substantial threaten to its IP right. Finally, we discuss the protection measures for content display, and give some valid suggestions for information content protection in technology and law.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Renouf, A.: Corpus development 25 years on: from super-corpus to cybercorpus. Lang. Comput. Stud. Pract. Linguist. 62(1), 27–49 (2007)
Kennedy, G., Ooi, V.B.Y.: An Introduction to Corpus Linguistics. Studies in Language and Linguistics (1998)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017)
Cohen, K.B., Ogren, P.V., Fox, L., et al.: Corpus design for biomedical natural language processing. In: ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 38–45. Association for Computational Linguistics (2005)
Heydon, A., Najork, M.: Mercator: A scalable, extensible Web crawler. World Wide Web-Internet Web Inf. Syst. 2(4), 219–229 (1999)
Koehn, P.: A parallel corpus for statistical machine translation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 79–86 (2005)
Bergler, F.: Application program interface: US, US 5572675 A[P] (1996)
Mehrabi, H.: Digital watermark. In: Constantopoulos, P., Sølvberg, I.T. (eds.) ECDL 2001. LNCS, vol. 2163, pp. 49–58. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44796-2_5
Adji, F.R., Saputra, H.M.: Perbandingan Hyper Text Transfer Protocol (HTTP) dengan Real Time Streaming Protcol (RTSP) menggunakan Video Streaming. In: Prosiding Seminar Nasional Rekayasa & Desain Itenas (2016)
Sun, H., Tang, Y., Liang, C., et al.: High speed computer screen recorder system based on FPGA+ARM. Application of Electronic Technique (2011)
Dong, A.: Question inquiry on the copyright protection of foreign language corpus. J. Beijing Inst. Graph. 25, 68–70 (2017)
Liu, SL.: The strategy of coping with anti-crawler website. Comput. Knowl. Technol. 13, 19–21 (2017)
Acknowledgments
The work of this paper is funded by the project of National Natural Science Foundation of China (No. 2017YFB1002102) and the project of National key research and development program of China (No. 91520204).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ma, J., Yang, M., Wang, H., Zhu, C., Xu, B. (2018). A Study on Corpus Content Display and IP Protection. In: Zhou, Q., Miao, Q., Wang, H., Xie, W., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 902. Springer, Singapore. https://doi.org/10.1007/978-981-13-2206-8_10
Download citation
DOI: https://doi.org/10.1007/978-981-13-2206-8_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2205-1
Online ISBN: 978-981-13-2206-8
eBook Packages: Computer ScienceComputer Science (R0)