A Study on Corpus Content Display and IP Protection

Jingyi Ma¹⁴,
Muyun Yang¹⁴,
Haoyong Wang¹⁵,
Conghui Zhu¹⁴ &
…
Bing Xu¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 902))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

1763 Accesses

Abstract

Corpus has played an important role in most of research fields, especially in natural language processing. Some research demos provided detailed corpus content to highlight the contribution they have made, while overlook the security of corpus. In this paper, we explore content leakage resulted from the content display through a crawler. A website for displaying corpus is selected to be crawled by a simply crawler algorithm with some strategies we present. It is estimated that over 85% of the corpus can be downloaded, which means a substantial threaten to its IP right. Finally, we discuss the protection measures for content display, and give some valid suggestions for information content protection in technology and law.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Harnessing the Wisdom of Crowds for Corpus Annotation through CAPTCHA

Protecting the Web from Misinformation

Extraction and Processing of Web Content for Corpus Creation: A Systematic Literature Review

References

Renouf, A.: Corpus development 25 years on: from super-corpus to cybercorpus. Lang. Comput. Stud. Pract. Linguist. 62(1), 27–49 (2007)
Google Scholar
Kennedy, G., Ooi, V.B.Y.: An Introduction to Corpus Linguistics. Studies in Language and Linguistics (1998)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017)
Google Scholar
Cohen, K.B., Ogren, P.V., Fox, L., et al.: Corpus design for biomedical natural language processing. In: ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 38–45. Association for Computational Linguistics (2005)
Google Scholar
Heydon, A., Najork, M.: Mercator: A scalable, extensible Web crawler. World Wide Web-Internet Web Inf. Syst. 2(4), 219–229 (1999)
Article Google Scholar
Koehn, P.: A parallel corpus for statistical machine translation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 79–86 (2005)
Google Scholar
Bergler, F.: Application program interface: US, US 5572675 A[P] (1996)
Google Scholar
Mehrabi, H.: Digital watermark. In: Constantopoulos, P., Sølvberg, I.T. (eds.) ECDL 2001. LNCS, vol. 2163, pp. 49–58. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44796-2_5
Chapter Google Scholar
Adji, F.R., Saputra, H.M.: Perbandingan Hyper Text Transfer Protocol (HTTP) dengan Real Time Streaming Protcol (RTSP) menggunakan Video Streaming. In: Prosiding Seminar Nasional Rekayasa & Desain Itenas (2016)
Google Scholar
Sun, H., Tang, Y., Liang, C., et al.: High speed computer screen recorder system based on FPGA+ARM. Application of Electronic Technique (2011)
Google Scholar
Dong, A.: Question inquiry on the copyright protection of foreign language corpus. J. Beijing Inst. Graph. 25, 68–70 (2017)
Google Scholar
Liu, SL.: The strategy of coping with anti-crawler website. Comput. Knowl. Technol. 13, 19–21 (2017)
Google Scholar

Download references

Acknowledgments

The work of this paper is funded by the project of National Natural Science Foundation of China (No. 2017YFB1002102) and the project of National key research and development program of China (No. 91520204).

Author information

Authors and Affiliations

Computer Science and Technology, Harbin Institute of Technology, 92 West Dazhi Street, Harbin, 150001, China
Jingyi Ma, Muyun Yang, Conghui Zhu & Bing Xu
Institute of Foreign Languages, Agricultural University of Hebei, 289, Raining Temple Street, Baoding, 071001, Hebei, China
Haoyong Wang

Authors

Jingyi Ma
View author publications
You can also search for this author in PubMed Google Scholar
Muyun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Haoyong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Conghui Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Bing Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muyun Yang .

Editor information

Editors and Affiliations

Zhengzhou University, Zhengzhou, Henan, China
Qinglei Zhou
Xidian University, Xi’an, Shaanxi, China
Qiguang Miao
Harbin Institute of Technology, Harbin, China
Hongzhi Wang
Harbin University of Science and Technology, Harbin, China
Wei Xie
Zhengzhou Institute of Technology, Zhengzhou, China
Yan Wang
National Academy of Guo Ding Institute of Data Science, Beijing, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, J., Yang, M., Wang, H., Zhu, C., Xu, B. (2018). A Study on Corpus Content Display and IP Protection. In: Zhou, Q., Miao, Q., Wang, H., Xie, W., Wang, Y., Lu, Z. (eds) Data Science. ICPCSEE 2018. Communications in Computer and Information Science, vol 902. Springer, Singapore. https://doi.org/10.1007/978-981-13-2206-8_10

Download citation

DOI: https://doi.org/10.1007/978-981-13-2206-8_10
Published: 09 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2205-1
Online ISBN: 978-981-13-2206-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Study on Corpus Content Display and IP Protection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Harnessing the Wisdom of Crowds for Corpus Annotation through CAPTCHA

Protecting the Web from Misinformation

Extraction and Processing of Web Content for Corpus Creation: A Systematic Literature Review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Study on Corpus Content Display and IP Protection

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Harnessing the Wisdom of Crowds for Corpus Annotation through CAPTCHA

Protecting the Web from Misinformation

Extraction and Processing of Web Content for Corpus Creation: A Systematic Literature Review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation