RIDGES Herbology: designing a diachronic multi-layer corpus

Carolin Odebrecht¹,
Malte Belz¹,
Amir Zeldes²,
Anke Lüdeling¹ &
…
Thomas Krause¹

601 Accesses
1 Altmetric
Explore all metrics

Abstract

This paper introduces a multi-layer corpus architecture with multiple tokenizations using the open source historical, diachronic corpus of German called Register in Diachronic German Science. The corpus contains herbal texts printed between the fifteenth and nineteenth centuries and is concerned with the development of a German scientific register, independent of Latin. We will discuss difficulties of transcribing, normalizing and annotating historical texts and will thereby argue for the advantages of multiple layers and multiple tokenizations. A virtually infinite number of annotations can be added to the corpus, without the need for deciding between or discarding interpretations. Thus, this flexible architecture enables multiple normalizations and types of annotation and is open to a wide range of research questions in the humanities. We provide case studies concerning the exploitation of our different normalizations as well as structural, register-specific and linguistic annotations. The corpus architecture allows for its reuse as a resource for corpus-based research approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Diachronic Corpora

The Construction of a Corpus from the Brazilian Historical-Biographical Dictionary

Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew

Article 04 June 2019

Notes

http://korpling.german.hu-berlin.de/ridges/index_en.html. The corpus is freely available under a CC-BY license at the LAUDATIO Repository http://hdl.handle.net/11022/0000-0000-2D85-8. Accessed 1 March 2016.
The corpus texts were collected and initially prepared in several graduate and undergraduate seminars at Humboldt-Universität zu Berlin. The texts were extensively corrected and checked for consistency before publication. The corpus is growing; Version 5 (containing 36 excerpts, 183.724 tokens) was published in June 2016.
The size of the text excerpts is chosen depending on the teaching context, i.e. whether the data is collected in a graduate or undergraduate seminar.
Bayerische Staatsbibliothek https://www.bsb-muenchen.de/, Münchener Digitalisierungszentrum http://www.digitale-sammlungen.de/, Universitätsbibliothek Heidelberg http://www.ub.uni-heidelberg.de/helios/digi/digilit.html. Accessed 1 March 2016. The corpus is currently based on printed texts only. We used the original version wherever possible (that is, wherever we were able to find a high-quality scan) and the earliest available version otherwise. The complete bibliographical information for each text is given in the metadata. We plan to add some manuscripts at a later stage, and also envision adding some of the Latin sources.
https://books.google.de/. Accessed 1 March 2016.
http://corpus-tools.org/pepper. Accessed 8 June 2016.
ANNIS, which stands for ANNotation of Information Structure, was originally designed to provide access to the data of the SFB 632—Information Structure, see http://corpus-tools.org/annis/. Accessed 1 March 2016.
LAUDATIO, which stands for Long-term Access and Usage of Deeply Annotated Information, is an open access repository for historical corpora. http://www.laudatio-repository.org. Accessed 1 March 2016.
There is an ongoing discussion in corpus linguistics on what constitutes primary data (cf. Claridge 2008; Himmelmann 2012, the discussion involves the roles of originals, pictures (scans), transcriptions, and normalizations). Here, we focus on the technical features of a corpus and do not want to engage in this discussion. We will briefly come back to the different notions of ‘text' in Sect. 3.5.
In Sects. 3.1 and 3.2 we will discuss the tokenization and normalization for historical German.
http://corpus-tools.org/salt/ Accessed 8 June 2016.
Bird and Liberman (2001) proposed to use character offsets as a substitute for time-stamps in written texts, but since different tokenizations can have different base texts (unlike Fig. 1, where the exact same character sequence is tokenized in different ways) this is not applicable to our model. But even without time-stamps, the structure of a timeline allows us to model the alignment between different tokenizations. In contrast to Salt, the PAULA and ANNIS data models do not have the explicit concept of a timeline and thus need a different way to encode it. The solution to this problem is an automatic creation of a single artificial minimal tokenization (cf. Krause et al. 2012), where each artificial token corresponds to a timeline item. The conceptual tokenizations are represented as annotations on top of these artificial tokens and are flagged as segmentation layers. Technically, a segmentation layer is just a normal annotation layer, but flagging it as a segmentation layer makes it behave like one of a set of alternative tokenization layers that the search engine, ANNIS, treats as the basic text of a document. This affects both the initial view of search results and the ability to define search context and distance between search elements.
Other corpus projects using a similar corpus architecture are Falko (Reznicek et al. 2013), PCC (Stede and Neumann 2014), Referenzkorpus Altdeutsch (Donhauser 2015), or Coptic Scriptorium (Zeldes and Schroeder 2015).
For the corpus documentation see http://hdl.handle.net/11022/0000-0000-8253-F. Accessed 16 March 2016.
http://korpling.german.hu-berlin.de/ridges. Accessed 16 March 2016.
For the official Unicode table see www.unicode.org. Accessed 1 March 2016. An anonymous reviewer has asked why we have opted to use precomposed characters when possible and not to use combining diacritics. In principle, the TEI standard has taken an agnostic stance in this matter. Precomposed characters circumvent possible problems with regular expression engines that only have level 1 support for Unicode (e.g. when searching for a single grapheme cluster as described in http://unicode.org/reports/tr18/#Grapheme_Cluster_Mode). Not all glyphs have precomposed characters in Unicode and we use combining characters in this case.
This is generally true even for incunabula which may contain rare glyphs. The Medieval Unicode Fonts Initiative (MUFI, http://folk.uib.no/hnooh/mufi/) is concerned with adding special characters represented in older texts to the Unicode standard. Accessed 1 March 2016.
TEI stands for Text Encoding Initiative, for an introduction see Romary (2009) and Sect. 3.3. http://www.tei-c.org. Accessed 10 May 2016.
See Voigt (2013) for guidelines, http://korpling.german.hu-berlin.de/ridges/download/v4/cleanV2README.txt. Accessed 1 March 2016.
Note that there is a different way of dealing with the search problem, namely the mapping of different forms in the search itself, also known as fuzzy search. For further references on automatic normalization see Sect. 3.5.
Duden (Dudenredaktion 2016) is the standard orthographic lexicon for German. Many other historical corpora follow modern reference lexicons in their normalization, cf. e.g. Rissanen (2012) and Donhauser (2015).
Another problem of this approach is a conceptual one: Is it useful to map forms of one language to forms (and ultimately categories) of another language? Which interesting distinctions and properties are lost? This issue (similar to the debate about the comparative fallacy in second language acquisition research, see Bley-Vroman 1983) is interesting and needs to be discussed further.
The text also contains the form das in both interpretations. The choice between das and dz seems to be driven by typographic needs. It seems that the correct alignment within the print space plays an important role for the early printers and that (at least sometimes) the choice of the shorter/longer form is driven by the need for less/more space rather than by linguistic considerations.
There is also considerable work within the framework of the TEI relating to normalization and tokenization, as well as suggestions for multi-layer standoff approaches within the standard (see Heiden 2010; Pose et al. 2014).
There are, among many others, Deutsches Textarchiv http://www.deutschestextarchiv.de/ (Geyken et al. 2012), the Duisburg-Leipzig Korpus romanischer Zeitungssprachen http://home.uni-leipzig.de/burr/CorpusLing/Korpusanalyse/default.htm (Burr et al. 2015), and Coptic Scriptorium http://copticscriptorium.org/, see Zeldes and Schroeder (2015). Accessed 1 March 2016.
http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/. Accessed 1 March 2016.
Petrova, Svetlana; Donhauser, Karin; Odebrecht, Carolin; T-Codex (Version 2.1), Humboldt-Universität zu Berlin. https://korpling.german.hu-berlin.de/~annis/T-CODEX/corpus_description_tatian2.1.pdf, http://hdl.handle.net/11022/0000-0000-850C-D. Accessed 21 March 2016.
Bennett, Paul; Durrell, Martin; Ensslin, Astrid; Scheible, Silke; Whitt, Richard; GerManC (Version 1.0), University of Manchester. http://www.llc.manchester.ac.uk/research/projects/germanc/. http://hdl.handle.net/11022/0000-0000-2D1B-1. Accessed 21 March 2016.
Fürstinnenkorrespondenzkorpus. Lühr, Rosemarie; Faßhauer, Vera; Prutscher, Daniela; Seidel, Henry; Fuerstinnenkorrespondenz (Version 1.1), Universität Jena, DFG. http://www.indogermanistik.uni-jena.de/Web/Projekte/Fuerstinnenkorr.htm. http://hdl.handle.net/11022/0000-0000-82A0-7. Accessed 21 March 2016.
Donhauser, Karin; Gippert, Jost; Lühr, Rosemarie; ddd-ad (Version 0.1), Humboldt-Universität zu Berlin. https://referenzkorpusaltdeutsch.wordpress.com/. http://hdl.handle.net/11022/0000-0000-7FC2-7. Accessed 21 March 2016.
The element <lang> http://www.tei-c.org/release/doc/tei-p5-doc/de/html/ref-lang.html and attribute xml:lang, and the element <hi> which can be attributed information about the font http://www.tei-c.org/release/doc/tei-p5-doc/de/html/ref-hi.html. Accessed 1 March 2016.
Due to the ongoing history of the corpus and the evolving annotation guidelines, not all texts contain annotation for German. If a document does not have an explicit annotation deu, we counted each dipl token without any annotation in the lang layer as German in the post hoc analysis.
http://www.loc.gov/standards/iso639-2/php/code_list.php. Accessed 1 March 2016.
Many of the texts contain Latin passages, ranging from words [often translations of the name of a herb or an illness, as in Example (1b)] to phrases and, sometimes, whole paragraphs. The texts also contain information (also often translations of the names) in other languages, such as Greek, French, or English.
Currently it is the ‘Rat für deutsche Rechtschreibung’ http://www.rechtschreibrat.com/. Accessed 1 March 2016.
http://sfs.uni-tuebingen.de/langbank/de/index.html Accessed 16 March 2016.

References

Admoni, W. (1990). Historische Syntax des Deutschen. Tübingen: Niemeyer.
Google Scholar
Archer, D., Kytö, M., Baron, A., Rayson, P., et al. (2015). Guidelines for normalising Early Modern English corpora. Decisions and Justifications. ICAME Journal, 39(1), 5–24. doi:10.1515/icame-2015-0001.
Article Google Scholar
Baron, A., & Rayson, P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. In Proceedings of postgraduate conference in corpus linguistics, PCCL 2008, Birmingham. http://acorn.aston.ac.uk/conf_proceedings.html. Accessed 4 Aug 2015.
Baron, A., Rayson, P., Archer, D., et al. (2009). Word frequency and key word statistics in historical corpus linguistics. International Journal of English Studies, 20(1), 41–67.
Google Scholar
Bartsch, N., Dipper, S., Herbers, B., Kwekkeboom, S., Wegera, K.-P., Eschke, L., et al. (2011). Annotiertes Referenzkorpus Mittelhochdeutsch (1050–1350). In 33. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft, DGfS-CL Poster Session 2011. Göttingen.
Belz, M., Odebrecht, C., Perlitz, L., & Voigt, V. (2015). Annotationsrichtlinien zu Ridges Herbology Version 4.1, Humboldt-Universität zu Berlin. http://korpling.german.hu-berlin.de/ridges/download/pubs/annotationGuidelines_v4.1.pdf. Accessed 16 March 2016.
Bentzinger, R. (2000). Die Kanzleisprachen. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 2, pp. 1665–1673). Berlin i.a.: de Gruyter.
Google Scholar
Besch, W. (2003). Die Entstehung und Ausformung der neuhochdeutschen Schriftsprache/Standardsprache. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 3, pp. 2252–2296). Berlin i.a.: de Gruyter.
Google Scholar
Biber, D., & Conrad, S. (2009). Register, genre, and style. Cambridge: Cambridge University Press.
Book Google Scholar
Biber, D., & Gray, B. (2011a). Grammar emerging in the noun phrase: The influence of written language use. English Language and Linguistics, 15, 223–250. doi:10.1017/S1360674311000025.
Article Google Scholar
Biber, D., & Gray, B. (2011b). The historical shift of scientific academic prose in English towards less explicit styles of expression: Writing without Verbs. In V. Bathia, P. Sánchez, & P. Perez-Paredes (Eds.), Researching specialized languages (pp. 11–24). Amsterdam: John Benjamins. doi:10.1075/scl.47.
Chapter Google Scholar
Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1–2), 23–60. doi:10.1016/S0167-6393(00)00068-6.
Article Google Scholar
Bley-Vroman, R. (1983). The comparative fallacy in interlanguage studies: The case of systematicity. Language Learning, 33(1), 1–17.
Article Google Scholar
Bohnet, B. (2010). Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd international conference on computational linguistics (pp. 89–97), Coling 2010. Beijing.
Bollmann, M., Dipper, S., Krasselt, J., & Petran, F. (2012). Manual and semi-automatic normalization of historical spelling—Case studies from Early New High German. In Proceedings of the first international workshop on language technology for Historical text(s), KONVENS 2011. Wien.
Bollmann, M., Petran, F., & Dipper, S. (2011). Applying rule-based normalization to different types of historical texts: An evaluation. In Proceedings of the 5th language and technology conference: Human language technologies as a challenge for computer science and linguistics, TLC 2011. Poznan.
Burr, E., Burkhardt, J., Potapenko, E., Sierig, R., & Concepción Durán, A. (2015). Das Duisburg-Leipzig Korpus romanischer Zeitungssprachen und sein Textmodell. In Proceedings Von Daten zu Erkenntnissen. 2. Jahrestagung des Verbandes der Digital Humanities im deutschsprachigen Raum, DHd 2015, Graz. http://gams.uni-graz.at/o:dhd2015.abstracts-gesamt. Accessed 22 March 2016.
Carletta, J., Evert, S., Heid, U., Kilgour, J., Robertson, J., Voormann, H., et al. (2003). The NITE XML toolkit: Flexible annotation for multi-modal language data. Behavior Research Methods, Instruments, and Computers, 35(3), 353–363. doi:10.3758/BF03195511.
Article Google Scholar
Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., et al. (2008). A flexible framework for integrating annotations from different tools and tag sets. Traitement Automatique des Langues, 49(2), 271–293.
Google Scholar
Claridge, C. (2008). Historical corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 242–259). Berlin i.a.: de Gruyter.
Google Scholar
Craig, H., & Whipp, R. (2010). Old spellings, new methods: Automated procedures for indeterminate linguistic data. Literary and Linguistic Computing, 25(1), 37–52. doi:10.1093/llc/fqp033.
Article Google Scholar
Dickinson, M., & Meurers, W. D. (2003). Detecting errors in part-of-speech annotation. In Proceedings of the 10th conference of the European chapter of the association for computational linguistics (pp. 107–114), EACL-03, Budapest. http://decca.osu.edu/publications/dickinson-meurers-03.html. Accessed 22 March 2016.
Diel, M., Fisseni, B., Lenders, W., & Schmitz, H. (2002). XML-Kodierung des Bonner Frühneuhochdeutschkorpus. IKP-Arbeitsbericht NF 02, Bonn. https://korpora.zim.uni-duisburg-essen.de/Fnhd/ikpab-nf02.pdf. Accessed 22 March 2016.
Dipper, S. (2005). XML-based stand-off representation and exploitation of multi-level linguistic annotation. In Proceedings of berliner XML tage (pp. 39–50), BXML 2005. Berlin.
Dipper, S., & Schultz-Balluff, S. (2013). The Anselm Corpus: Methods and perspectives of a parallel aligned corpus. In Proceedings of the NODALIDA workshop on computational historical linguistics (pp. 27–42), NEALT, Oslo. http://www.ep.liu.se/ecp/article.asp?issue=087&article=003. Accessed 22 March 2016.
Donhauser, K. (2015). Das Referenzkorpus Altdeutsch: Das Konzept, die Realisierung und die neuen Möglichkeiten. In J. Gippert & R. Gehrke (Eds.), Historical corpora. Challenges and perspectives (pp. 25–50). Tübingen: Narr.
Google Scholar
Dudenredaktion. (2016). Duden online Wörtbuch. Berlin: Bibliographisches Institut GmbH. http://www.duden.de/woerterbuch. Accessed 23 March 2016.
Duden. (Ed.) (2005). Dudengrammatik. Band 4. 7. Auflage. Mannheim i.a.: Dudenverlag.
Durrell, M., Ensslin, A., & Bennett, P. (2007). The GerManC project. Sprache und Datenverarbeitung, 31, 71–80.
Google Scholar
Ebert, R. P. (1978). Historische Syntax des Deutschen. Stuttgart: J.B. Metzlersche Verlagsbuchhandlung.
Book Google Scholar
Ernst-Gerlach, A. (2013). Retrievalmethoden für historische Korpora mit nicht standardisierten Schreibweisen. Ph.D. thesis. Universität Duisburg. http://duepublico.uni-duisburg-essen.de/servlets/DerivateServlet/Derivate-33270/Ernst-Gerlach_Diss.pdf. Accessed 4 Aug 2015.
Gévaudan, P. (2002). Klassifikation der lexikalischen Entwicklungen. Semantische, morphologische und stratische Filiation. Ph.D. thesis, Universität Tübingen.
Geyken, A., Haaf, S., & Wiegand, F. (2012). The DTA-base format: A TEI-subset for the compilation of interoperable corpora. In Proceedings of the conference of the 11th conference on natural language processing (KONVENS)—Empirical methods in natural language processing (pp. 383–391), LThist 2012 Workshop. Vienna.
Gloning, T. (2007). Deutsche Kräuterbücher des 12. bis 18. Jahrhunderts. Textorganisation, Wortgebrauch, funktionale Syntax. In A. Meyer & J. Schulz-Grobert (Eds.), Gesund und krank im Mittelalter (pp. 9–88). Leipzig: Eudora-Verlag.
Google Scholar
Habermann, M. (2001). Deutsche Fachtexte der frühen Neuzeit: naturkundlich-medizinische Wissensvermittlung im Spannungsfeld von Latein und Volkssprache. Studia linguistica Germanica (61). Berlin i.a.: de Gruyter.
Hartweg, F., & Wegera, K. (2005). Frühneuhochdeutsch. Eine Einführung in die Sprache des Spätmittelalters und der frühen Neuzeit. 2. Auflage. Tübingen: Niemeyer.
Heiden, S. (2010) The TXM platform: Building open-source textual analysis software compatible with the TEI encoding scheme. In R. Otoguro, K. Ishikawa, H. Umemoto, K. Yoshimoto & Y. Harada (Eds.), 24th Pacific Asia conference on language, information and computation (pp. 389–398). Sendai, Japan.
Himmelmann, N. P. (2012). Linguistic data types and the interface between language documentation and description. Language Documentation and Conservation, 6, 187–207.
Google Scholar
Höchli, S. (1981). Zur Geschichte der Interpunktion im Deutschen. Eine kritische Darstellung der Lehrschriften von der zweiten Hälfte des 15.Jahrhunderts bis zum Ende des 18. Jahrhunderts. Studia Linguistica Germanica (17). Berlin i.a.: de Gruyter.
Höder, S. (2012). Annotating ambiguity: Insights from a corpus-based study on syntactic change in old Swedish. In T. Schmidt & K. Wörner (Eds.), Multilingual corpora and multilingual corpus analysis (pp. 245–271). Hamburg studies on multilingualism (14). Amsterdam i.a.: Benjamins.
Jurish, B. (2010). More than words: Using token context to improve canonicalization of historical German. Journal for Language Technology and Computational Linguistics, 25(1), 23–39.
Google Scholar
Klein, W. P. (1999). Die Geschichte der meteorologischen Kommunikation in Deutschland. Eine historische Fallstudie zur Entwicklung von Wissenschaftssprachen. Postdoctoral thesis, Freie Universität Berlin.
Klein, T. (2013). Verknüpfung digitaler Lemmalisten historischer Sprachstufen des Deutschen. Wie und wozu? Talk at Arbeitsgespräch zur historischen Lexikographie. Universität Trier. https://www.uni-trier.de/fileadmin/forschung/maw/MWB/Arbeitsgespraech2013/Vortrag_Bullay_Klein.pdf Accessed 22 March 2016.
Krause, T., Lüdeling, A., Odebrecht, C., & Zeldes, A. (2012) Multiple tokenizations in a diachronic corpus. In Exploring ancient languages through corpora conference, EALC 2012. Oslo.
Krause, T., & Zeldes, A. (2016). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities, 31(1), 118–139. doi:10.1093/llc/fqu057. Accessed 22 March 2016.
Kroch, A., Santorini, B., & Delfs, L. (2004). The Penn-Helsinki parsed corpus of Early Modern English (PPCEME), 1st edn. Department of Linguistics, University of Pennsylvania. CD-ROM.
Kroch, A., & Taylor, A. (2000). The Penn-Helsinki parsed corpus of Middle English (PPCME2), 2nd edn. Department of Linguistics, University of Pennsylvania. CD-ROM.
Kübler, S., & Zinsmeister, H. (2015). Corpus linguistics and linguistically annotated corpora. London i.a.: Bloomsbury.
Google Scholar
Kytö, M. (1996). Manual to the diachronic part of the Helsinki corpus of English texts: Coding conventions and lists of source texts (3rd ed.). Helsinki: University of Helsinki, Department of English.
Google Scholar
Kytö, M. (2011). Corpora and historical linguistics. Revista Brasileira de Linguística Aplicada, 11(2), 417–457. doi:10.1590/S1984-63982011000200007. Accessed 22 March 2016.
Kytö, M., & Pahta, P. (2012). Evidence from historical corpora up to the twentieth century. In T. Nevalainen & E. C. Traugott (Eds.), The Oxford handbook of the history of English (pp. 123–133). Oxford i.a.: Oxford University Press.
Google Scholar
Lindauer, T. (1995). Genitivattribute. Eine morphosyntaktische Untersuchung zum deutschen DP/NP-System. Tübingen: Niemeyer.
Google Scholar
Lüdeling, A. (2011). Corpora in linguistics: Sampling and annotation. In K. Grandin (Ed.), Going digital. Evolutionary and revolutionary aspects of digitization. Nobel symposium 147 (pp. 220–243). New York: Science History Publications.
Lüdeling, A., Poschenrieder, T., Faulstich, L. C., et al. (2005). DeutschDiachronDigital – Ein diachrones Korpus des Deutschen. Jahrbuch für Computerphilologie, 2004, 119–136.
Google Scholar
Nerius, D. (2003). Graphematische Entwicklungstendenzen in der Geschichte des Deutschen. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 3, pp. 2461–2472). Berlin i.a.: de Gruyter.
Google Scholar
Nerius, D. (2007). Deutsche Orthographie (4th ed.). Hildesheim i.a.: Olms.
Google Scholar
Odebrecht, C. (2014). Modeling linguistic research data for a repository for historical corpora. In Proceedings of digital humanities 2014 conference, DH conference 2014 (pp. 284–285), Université de Lausanne, Lausanne. https://dh2014.files.wordpress.com/2014/07/dh2014_abstracts_proceedings_07-11.pdf. Accessed 22 March 2016.
Odebrecht, C., Krause, T., & Lüdeling, A. (2015). Austausch von historischen Texten verschiedener Sprachen über das LAUDATIO-Repository. In 37. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft, DGfS-CL Poster Session 2015. Leipzig.
Pahta, P., & Taavitsainen, I. (2010). Scientific discourse. In A. H. Jucker & I. Taavitsainen (Eds.), Historical pragmatics (Vol. 8, pp. 549–586). Berlin: Mouton de Gruyter.
Google Scholar
Paul, H. (1995). Prinzipien der Sprachgeschichte (10th ed.). Tübingen: Niemeyer.
Book Google Scholar
Perlitz, L. (2014). Konkurrenz zwischen Wortbildung und Syntax: Historische Entwicklung von Benennung. Bachelorarbeit. Berlin: Humboldt-Universität zu Berlin.
Petrova, S., Solf, M., Ritz, J., Chiarcos, C., Zeldes, A., et al. (2009). Building and using a richly annotated interlinear diachronic corpus: The case of Old High German Tatian. Traitement automatique des langues, 50(2), 47–71.
Google Scholar
Pilz, T. (2009). Nichtstandardisierte Rechtschreibung – Variationsmodellierung und rechnergestützte Variationsverarbeitung. Ph.D. thesis. Universität Duisburg-Essen.
Piotrowski, M. (2012). Natural language processing for historical texts. Synthesis lectures on human language technologies; 17. San Rafael: Morgan & Claypool.
Pörksen, U. (2003). Deutsche Sprachgeschichte und die Entwicklung der Naturwissenschaften – Aspekte einer Geschichte der Naturwissenschaftssprache und ihrer Wechselwirkung zur Gemeinsprache. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 1, pp. 193–210). Berlin i.a.: de Gruyter.
Google Scholar
Pose, J., Lopez, P. & Romary, L. (2014). A generic formalism for encoding stand-off annotations in TEI. INRIA technical report. hal-01061548. Accessed 22 March 2016.
Reichmann, O., & Wegera, K.-P. (1993). Schreibung und Lautung. In O. Reichmann & K. P. Wegera (Eds.), Frühneuhochdeutsche Grammatik (pp. 13–163). Tübingen: Niemeyer.
Google Scholar
Reynaert, M., Hendricks, I., & Marquilhas, R. (2012) Historical spelling normalization. A comparison of two statistical methods: TICCL and VARD2. In Proceedings of the second workshop on annotation of corpora for research in the humanities, ACRH 2012. Lisbon.
Reznicek, M., Lüdeling, A., & Hirschmann, H. (2013). Competing target hypotheses in the Falko corpus: A flexible multi-layer corpus architecture. In A. Díaz-Negrillo (Ed.), Automatic treatment and analysis of learner corpus data (pp. 101–123). Amsterdam: John Benjamins.
Chapter Google Scholar
Riecke, J. (2004). Die Frühgeschichte der mittelalterlichen medizinischen Fachsprache im Deutschen. Band 1: Untersuchungen, Band 2: Wörterbuch. Berlin, New York: Walter de Gruyter.
Book Google Scholar
Riecke, J. (2007). Beiträge zum mittelalterlichen deutschen Wortschatz der Heilkunde. In A. Meyer & J. Schulz-Grobert (Eds.), Gesund und krank im Mittelalter. Marburger Beiträge zur Kulturgeschichte der Medizin (pp. 89–106). Leipzig: Eudora-Verlag.
Google Scholar
Rissanen, M. (2008). Corpus linguistics and historical linguistics. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 53–68). Berlin i.a.: de Gruyter.
Google Scholar
Rissanen, M. (2012). Corpora and the study of English historical syntax. In M. Kytö (Ed.), English corpus linguistics: Crossing paths (pp. 197–220). Amsterdam, New York: Rodopi.
Chapter Google Scholar
Romary, L. (2009). Questions & answers for TEI newcomers. Jahrbuch für Computerphilologie, 10. Digital Libraries. doi:http://arxiv.org/abs/0812.3563.
Schiller, A., Teufel, S., Stöckert, C., & Thielen, C. (1999). Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Universität Stuttgart und Tübingen. For STTS Tag Table (1995/1999) see http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf. Accessed 1 March 2016.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing, 1994. Manchester.
Schmid, H. (2008). Tokenization and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 527–551). Berlin i.a.: de Gruyter.
Google Scholar
Simmler, F. (2003). Geschichte der Interpunktionssysteme im Deutschen. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 3, pp. 2472–2504). Berlin i.a.: de Gruyter.
Google Scholar
Splett, J. (2000). Wortbildung des Althochdeutschen. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 2, pp. 1213–1222). Berlin i.a.: de Gruyter.
Google Scholar
Squires, C. (2010). Konstantes und Variables im Aufbau von deutschen mittelalterlichen heilkundlichen Texten und angrenzenden Textsorten. In A. Ziegler (Ed.), Diachronie, Althochdeutsch, Mittelhochdeutsch 1: Historische Textgrammatik und Historische Syntax des Deutschen (pp. 561–588). Berlin i.a.: de Gruyter.
Google Scholar
Stede, M., & Neumann, A. (2014). Potsdam commentary corpus 2.0: Annotation for discourse research. In Proceedings of the language resources and evaluation conference (pp. 925–929), LREC 2014, Reykjavik.
Springmann, U., & Lüdeling, A. (submitted). Progress of OCR of early printings exemplified by the RIDGES Herbology Corpus.
TEI Consortium. (Eds.) (2015). TEI P5: Guidelines for electronic text encoding and interchange. Version 2.8.0. 2015-04-06. TEI Consortium. http://www.tei-c.org/Guidelines/P5/. Accessed 13 Aug 2015.
Vikør, L. (2004). Lingua Franca and international language. Verkehrssprache und Internationale Sprache. In U. Ammon (Ed.), Sociolinguistics. An international handbook of the science of language and society (pp. 328–334). Berlin i.a.: de Gruyter.
Google Scholar
Voigt, V. (2013). Python Script for the Normalization Layer clean. Script and Documentation see point 6 at http://korpling.german.hu-berlin.de/ridges/documentation_v4_en.html. Accessed 1 March 2015.
Wolff, G. (2009). Deutsche Sprachgeschichte von den Anfängen bis zur Gegenwart (6th ed.). Tübingen and Basel: Narr Francke.
Google Scholar
Zeldes, A., & Schroeder, C. T. (2015). Computational methods for coptic: Developing and using part-of-speech tagging for digital scholarship in the humanities. Digital Scholarship in the Humanities, 31(1), 164–176. doi:10.1093/llc/fqv043. Accessed 22 March 2016.
Zipser F., & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using Standards. In Proceedings of the workshop on language resource and language technology standards, LREC 2010. Malta.

Download references

Acknowledgements

We would like to thank Vivian Voigt and Laura Perlitz, our two very capable student assistants, who helped with many aspects of corpus creation and consistency checking. We would also like to thank the many students who took part in the digitization and basic annotation, as well as Uwe Springmann, Florian Zipser and three anonymous reviewers for comments that greatly improved the manuscript. The project has been generously funded by two Google Digital Humanities Research Awards.

Author information

Authors and Affiliations

Institut für deutsche Sprache und Linguistik, Humboldt-Universität zu Berlin, Dorotheenstraße 24, 10117, Berlin, Germany
Carolin Odebrecht, Malte Belz, Anke Lüdeling & Thomas Krause
Department of Linguistics, Georgetown University, Poulton Hall, 1421 37th St. NW, Washington, DC, 20057, USA
Amir Zeldes

Authors

Carolin Odebrecht
View author publications
You can also search for this author in PubMed Google Scholar
Malte Belz
View author publications
You can also search for this author in PubMed Google Scholar
Amir Zeldes
View author publications
You can also search for this author in PubMed Google Scholar
Anke Lüdeling
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Krause
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carolin Odebrecht.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Odebrecht, C., Belz, M., Zeldes, A. et al. RIDGES Herbology: designing a diachronic multi-layer corpus. Lang Resources & Evaluation 51, 695–725 (2017). https://doi.org/10.1007/s10579-016-9374-3

Download citation

Published: 01 December 2016
Issue Date: September 2017
DOI: https://doi.org/10.1007/s10579-016-9374-3

RIDGES Herbology: designing a diachronic multi-layer corpus

Abstract

Access this article

Subscribe and save

Buy Now