Abstract
This paper introduces a multi-layer corpus architecture with multiple tokenizations using the open source historical, diachronic corpus of German called Register in Diachronic German Science. The corpus contains herbal texts printed between the fifteenth and nineteenth centuries and is concerned with the development of a German scientific register, independent of Latin. We will discuss difficulties of transcribing, normalizing and annotating historical texts and will thereby argue for the advantages of multiple layers and multiple tokenizations. A virtually infinite number of annotations can be added to the corpus, without the need for deciding between or discarding interpretations. Thus, this flexible architecture enables multiple normalizations and types of annotation and is open to a wide range of research questions in the humanities. We provide case studies concerning the exploitation of our different normalizations as well as structural, register-specific and linguistic annotations. The corpus architecture allows for its reuse as a resource for corpus-based research approaches.
Similar content being viewed by others
Notes
http://korpling.german.hu-berlin.de/ridges/index_en.html. The corpus is freely available under a CC-BY license at the LAUDATIO Repository http://hdl.handle.net/11022/0000-0000-2D85-8. Accessed 1 March 2016.
The corpus texts were collected and initially prepared in several graduate and undergraduate seminars at Humboldt-Universität zu Berlin. The texts were extensively corrected and checked for consistency before publication. The corpus is growing; Version 5 (containing 36 excerpts, 183.724 tokens) was published in June 2016.
The size of the text excerpts is chosen depending on the teaching context, i.e. whether the data is collected in a graduate or undergraduate seminar.
Bayerische Staatsbibliothek https://www.bsb-muenchen.de/, Münchener Digitalisierungszentrum http://www.digitale-sammlungen.de/, Universitätsbibliothek Heidelberg http://www.ub.uni-heidelberg.de/helios/digi/digilit.html. Accessed 1 March 2016. The corpus is currently based on printed texts only. We used the original version wherever possible (that is, wherever we were able to find a high-quality scan) and the earliest available version otherwise. The complete bibliographical information for each text is given in the metadata. We plan to add some manuscripts at a later stage, and also envision adding some of the Latin sources.
https://books.google.de/. Accessed 1 March 2016.
http://corpus-tools.org/pepper. Accessed 8 June 2016.
ANNIS, which stands for ANNotation of Information Structure, was originally designed to provide access to the data of the SFB 632—Information Structure, see http://corpus-tools.org/annis/. Accessed 1 March 2016.
LAUDATIO, which stands for Long-term Access and Usage of Deeply Annotated Information, is an open access repository for historical corpora. http://www.laudatio-repository.org. Accessed 1 March 2016.
There is an ongoing discussion in corpus linguistics on what constitutes primary data (cf. Claridge 2008; Himmelmann 2012, the discussion involves the roles of originals, pictures (scans), transcriptions, and normalizations). Here, we focus on the technical features of a corpus and do not want to engage in this discussion. We will briefly come back to the different notions of ‘text' in Sect. 3.5.
http://corpus-tools.org/salt/ Accessed 8 June 2016.
Bird and Liberman (2001) proposed to use character offsets as a substitute for time-stamps in written texts, but since different tokenizations can have different base texts (unlike Fig. 1, where the exact same character sequence is tokenized in different ways) this is not applicable to our model. But even without time-stamps, the structure of a timeline allows us to model the alignment between different tokenizations. In contrast to Salt, the PAULA and ANNIS data models do not have the explicit concept of a timeline and thus need a different way to encode it. The solution to this problem is an automatic creation of a single artificial minimal tokenization (cf. Krause et al. 2012), where each artificial token corresponds to a timeline item. The conceptual tokenizations are represented as annotations on top of these artificial tokens and are flagged as segmentation layers. Technically, a segmentation layer is just a normal annotation layer, but flagging it as a segmentation layer makes it behave like one of a set of alternative tokenization layers that the search engine, ANNIS, treats as the basic text of a document. This affects both the initial view of search results and the ability to define search context and distance between search elements.
For the corpus documentation see http://hdl.handle.net/11022/0000-0000-8253-F. Accessed 16 March 2016.
http://korpling.german.hu-berlin.de/ridges. Accessed 16 March 2016.
For the official Unicode table see www.unicode.org. Accessed 1 March 2016. An anonymous reviewer has asked why we have opted to use precomposed characters when possible and not to use combining diacritics. In principle, the TEI standard has taken an agnostic stance in this matter. Precomposed characters circumvent possible problems with regular expression engines that only have level 1 support for Unicode (e.g. when searching for a single grapheme cluster as described in http://unicode.org/reports/tr18/#Grapheme_Cluster_Mode). Not all glyphs have precomposed characters in Unicode and we use combining characters in this case.
This is generally true even for incunabula which may contain rare glyphs. The Medieval Unicode Fonts Initiative (MUFI, http://folk.uib.no/hnooh/mufi/) is concerned with adding special characters represented in older texts to the Unicode standard. Accessed 1 March 2016.
TEI stands for Text Encoding Initiative, for an introduction see Romary (2009) and Sect. 3.3. http://www.tei-c.org. Accessed 10 May 2016.
See Voigt (2013) for guidelines, http://korpling.german.hu-berlin.de/ridges/download/v4/cleanV2README.txt. Accessed 1 March 2016.
Note that there is a different way of dealing with the search problem, namely the mapping of different forms in the search itself, also known as fuzzy search. For further references on automatic normalization see Sect. 3.5.
Another problem of this approach is a conceptual one: Is it useful to map forms of one language to forms (and ultimately categories) of another language? Which interesting distinctions and properties are lost? This issue (similar to the debate about the comparative fallacy in second language acquisition research, see Bley-Vroman 1983) is interesting and needs to be discussed further.
The text also contains the form das in both interpretations. The choice between das and dz seems to be driven by typographic needs. It seems that the correct alignment within the print space plays an important role for the early printers and that (at least sometimes) the choice of the shorter/longer form is driven by the need for less/more space rather than by linguistic considerations.
There are, among many others, Deutsches Textarchiv http://www.deutschestextarchiv.de/ (Geyken et al. 2012), the Duisburg-Leipzig Korpus romanischer Zeitungssprachen http://home.uni-leipzig.de/burr/CorpusLing/Korpusanalyse/default.htm (Burr et al. 2015), and Coptic Scriptorium http://copticscriptorium.org/, see Zeldes and Schroeder (2015). Accessed 1 March 2016.
http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/. Accessed 1 March 2016.
Petrova, Svetlana; Donhauser, Karin; Odebrecht, Carolin; T-Codex (Version 2.1), Humboldt-Universität zu Berlin. https://korpling.german.hu-berlin.de/~annis/T-CODEX/corpus_description_tatian2.1.pdf, http://hdl.handle.net/11022/0000-0000-850C-D. Accessed 21 March 2016.
Bennett, Paul; Durrell, Martin; Ensslin, Astrid; Scheible, Silke; Whitt, Richard; GerManC (Version 1.0), University of Manchester. http://www.llc.manchester.ac.uk/research/projects/germanc/. http://hdl.handle.net/11022/0000-0000-2D1B-1. Accessed 21 March 2016.
Fürstinnenkorrespondenzkorpus. Lühr, Rosemarie; Faßhauer, Vera; Prutscher, Daniela; Seidel, Henry; Fuerstinnenkorrespondenz (Version 1.1), Universität Jena, DFG. http://www.indogermanistik.uni-jena.de/Web/Projekte/Fuerstinnenkorr.htm. http://hdl.handle.net/11022/0000-0000-82A0-7. Accessed 21 March 2016.
Donhauser, Karin; Gippert, Jost; Lühr, Rosemarie; ddd-ad (Version 0.1), Humboldt-Universität zu Berlin. https://referenzkorpusaltdeutsch.wordpress.com/. http://hdl.handle.net/11022/0000-0000-7FC2-7. Accessed 21 March 2016.
The element <lang> http://www.tei-c.org/release/doc/tei-p5-doc/de/html/ref-lang.html and attribute xml:lang, and the element <hi> which can be attributed information about the font http://www.tei-c.org/release/doc/tei-p5-doc/de/html/ref-hi.html. Accessed 1 March 2016.
Due to the ongoing history of the corpus and the evolving annotation guidelines, not all texts contain annotation for German. If a document does not have an explicit annotation deu, we counted each dipl token without any annotation in the lang layer as German in the post hoc analysis.
http://www.loc.gov/standards/iso639-2/php/code_list.php. Accessed 1 March 2016.
Many of the texts contain Latin passages, ranging from words [often translations of the name of a herb or an illness, as in Example (1b)] to phrases and, sometimes, whole paragraphs. The texts also contain information (also often translations of the names) in other languages, such as Greek, French, or English.
Currently it is the ‘Rat für deutsche Rechtschreibung’ http://www.rechtschreibrat.com/. Accessed 1 March 2016.
http://sfs.uni-tuebingen.de/langbank/de/index.html Accessed 16 March 2016.
References
Admoni, W. (1990). Historische Syntax des Deutschen. Tübingen: Niemeyer.
Archer, D., Kytö, M., Baron, A., Rayson, P., et al. (2015). Guidelines for normalising Early Modern English corpora. Decisions and Justifications. ICAME Journal, 39(1), 5–24. doi:10.1515/icame-2015-0001.
Baron, A., & Rayson, P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. In Proceedings of postgraduate conference in corpus linguistics, PCCL 2008, Birmingham. http://acorn.aston.ac.uk/conf_proceedings.html. Accessed 4 Aug 2015.
Baron, A., Rayson, P., Archer, D., et al. (2009). Word frequency and key word statistics in historical corpus linguistics. International Journal of English Studies, 20(1), 41–67.
Bartsch, N., Dipper, S., Herbers, B., Kwekkeboom, S., Wegera, K.-P., Eschke, L., et al. (2011). Annotiertes Referenzkorpus Mittelhochdeutsch (1050–1350). In 33. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft, DGfS-CL Poster Session 2011. Göttingen.
Belz, M., Odebrecht, C., Perlitz, L., & Voigt, V. (2015). Annotationsrichtlinien zu Ridges Herbology Version 4.1, Humboldt-Universität zu Berlin. http://korpling.german.hu-berlin.de/ridges/download/pubs/annotationGuidelines_v4.1.pdf. Accessed 16 March 2016.
Bentzinger, R. (2000). Die Kanzleisprachen. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 2, pp. 1665–1673). Berlin i.a.: de Gruyter.
Besch, W. (2003). Die Entstehung und Ausformung der neuhochdeutschen Schriftsprache/Standardsprache. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 3, pp. 2252–2296). Berlin i.a.: de Gruyter.
Biber, D., & Conrad, S. (2009). Register, genre, and style. Cambridge: Cambridge University Press.
Biber, D., & Gray, B. (2011a). Grammar emerging in the noun phrase: The influence of written language use. English Language and Linguistics, 15, 223–250. doi:10.1017/S1360674311000025.
Biber, D., & Gray, B. (2011b). The historical shift of scientific academic prose in English towards less explicit styles of expression: Writing without Verbs. In V. Bathia, P. Sánchez, & P. Perez-Paredes (Eds.), Researching specialized languages (pp. 11–24). Amsterdam: John Benjamins. doi:10.1075/scl.47.
Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1–2), 23–60. doi:10.1016/S0167-6393(00)00068-6.
Bley-Vroman, R. (1983). The comparative fallacy in interlanguage studies: The case of systematicity. Language Learning, 33(1), 1–17.
Bohnet, B. (2010). Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd international conference on computational linguistics (pp. 89–97), Coling 2010. Beijing.
Bollmann, M., Dipper, S., Krasselt, J., & Petran, F. (2012). Manual and semi-automatic normalization of historical spelling—Case studies from Early New High German. In Proceedings of the first international workshop on language technology for Historical text(s), KONVENS 2011. Wien.
Bollmann, M., Petran, F., & Dipper, S. (2011). Applying rule-based normalization to different types of historical texts: An evaluation. In Proceedings of the 5th language and technology conference: Human language technologies as a challenge for computer science and linguistics, TLC 2011. Poznan.
Burr, E., Burkhardt, J., Potapenko, E., Sierig, R., & Concepción Durán, A. (2015). Das Duisburg-Leipzig Korpus romanischer Zeitungssprachen und sein Textmodell. In Proceedings Von Daten zu Erkenntnissen. 2. Jahrestagung des Verbandes der Digital Humanities im deutschsprachigen Raum, DHd 2015, Graz. http://gams.uni-graz.at/o:dhd2015.abstracts-gesamt. Accessed 22 March 2016.
Carletta, J., Evert, S., Heid, U., Kilgour, J., Robertson, J., Voormann, H., et al. (2003). The NITE XML toolkit: Flexible annotation for multi-modal language data. Behavior Research Methods, Instruments, and Computers, 35(3), 353–363. doi:10.3758/BF03195511.
Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., et al. (2008). A flexible framework for integrating annotations from different tools and tag sets. Traitement Automatique des Langues, 49(2), 271–293.
Claridge, C. (2008). Historical corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 242–259). Berlin i.a.: de Gruyter.
Craig, H., & Whipp, R. (2010). Old spellings, new methods: Automated procedures for indeterminate linguistic data. Literary and Linguistic Computing, 25(1), 37–52. doi:10.1093/llc/fqp033.
Dickinson, M., & Meurers, W. D. (2003). Detecting errors in part-of-speech annotation. In Proceedings of the 10th conference of the European chapter of the association for computational linguistics (pp. 107–114), EACL-03, Budapest. http://decca.osu.edu/publications/dickinson-meurers-03.html. Accessed 22 March 2016.
Diel, M., Fisseni, B., Lenders, W., & Schmitz, H. (2002). XML-Kodierung des Bonner Frühneuhochdeutschkorpus. IKP-Arbeitsbericht NF 02, Bonn. https://korpora.zim.uni-duisburg-essen.de/Fnhd/ikpab-nf02.pdf. Accessed 22 March 2016.
Dipper, S. (2005). XML-based stand-off representation and exploitation of multi-level linguistic annotation. In Proceedings of berliner XML tage (pp. 39–50), BXML 2005. Berlin.
Dipper, S., & Schultz-Balluff, S. (2013). The Anselm Corpus: Methods and perspectives of a parallel aligned corpus. In Proceedings of the NODALIDA workshop on computational historical linguistics (pp. 27–42), NEALT, Oslo. http://www.ep.liu.se/ecp/article.asp?issue=087&article=003. Accessed 22 March 2016.
Donhauser, K. (2015). Das Referenzkorpus Altdeutsch: Das Konzept, die Realisierung und die neuen Möglichkeiten. In J. Gippert & R. Gehrke (Eds.), Historical corpora. Challenges and perspectives (pp. 25–50). Tübingen: Narr.
Dudenredaktion. (2016). Duden online Wörtbuch. Berlin: Bibliographisches Institut GmbH. http://www.duden.de/woerterbuch. Accessed 23 March 2016.
Duden. (Ed.) (2005). Dudengrammatik. Band 4. 7. Auflage. Mannheim i.a.: Dudenverlag.
Durrell, M., Ensslin, A., & Bennett, P. (2007). The GerManC project. Sprache und Datenverarbeitung, 31, 71–80.
Ebert, R. P. (1978). Historische Syntax des Deutschen. Stuttgart: J.B. Metzlersche Verlagsbuchhandlung.
Ernst-Gerlach, A. (2013). Retrievalmethoden für historische Korpora mit nicht standardisierten Schreibweisen. Ph.D. thesis. Universität Duisburg. http://duepublico.uni-duisburg-essen.de/servlets/DerivateServlet/Derivate-33270/Ernst-Gerlach_Diss.pdf. Accessed 4 Aug 2015.
Gévaudan, P. (2002). Klassifikation der lexikalischen Entwicklungen. Semantische, morphologische und stratische Filiation. Ph.D. thesis, Universität Tübingen.
Geyken, A., Haaf, S., & Wiegand, F. (2012). The DTA-base format: A TEI-subset for the compilation of interoperable corpora. In Proceedings of the conference of the 11th conference on natural language processing (KONVENS)—Empirical methods in natural language processing (pp. 383–391), LThist 2012 Workshop. Vienna.
Gloning, T. (2007). Deutsche Kräuterbücher des 12. bis 18. Jahrhunderts. Textorganisation, Wortgebrauch, funktionale Syntax. In A. Meyer & J. Schulz-Grobert (Eds.), Gesund und krank im Mittelalter (pp. 9–88). Leipzig: Eudora-Verlag.
Habermann, M. (2001). Deutsche Fachtexte der frühen Neuzeit: naturkundlich-medizinische Wissensvermittlung im Spannungsfeld von Latein und Volkssprache. Studia linguistica Germanica (61). Berlin i.a.: de Gruyter.
Hartweg, F., & Wegera, K. (2005). Frühneuhochdeutsch. Eine Einführung in die Sprache des Spätmittelalters und der frühen Neuzeit. 2. Auflage. Tübingen: Niemeyer.
Heiden, S. (2010) The TXM platform: Building open-source textual analysis software compatible with the TEI encoding scheme. In R. Otoguro, K. Ishikawa, H. Umemoto, K. Yoshimoto & Y. Harada (Eds.), 24th Pacific Asia conference on language, information and computation (pp. 389–398). Sendai, Japan.
Himmelmann, N. P. (2012). Linguistic data types and the interface between language documentation and description. Language Documentation and Conservation, 6, 187–207.
Höchli, S. (1981). Zur Geschichte der Interpunktion im Deutschen. Eine kritische Darstellung der Lehrschriften von der zweiten Hälfte des 15.Jahrhunderts bis zum Ende des 18. Jahrhunderts. Studia Linguistica Germanica (17). Berlin i.a.: de Gruyter.
Höder, S. (2012). Annotating ambiguity: Insights from a corpus-based study on syntactic change in old Swedish. In T. Schmidt & K. Wörner (Eds.), Multilingual corpora and multilingual corpus analysis (pp. 245–271). Hamburg studies on multilingualism (14). Amsterdam i.a.: Benjamins.
Jurish, B. (2010). More than words: Using token context to improve canonicalization of historical German. Journal for Language Technology and Computational Linguistics, 25(1), 23–39.
Klein, W. P. (1999). Die Geschichte der meteorologischen Kommunikation in Deutschland. Eine historische Fallstudie zur Entwicklung von Wissenschaftssprachen. Postdoctoral thesis, Freie Universität Berlin.
Klein, T. (2013). Verknüpfung digitaler Lemmalisten historischer Sprachstufen des Deutschen. Wie und wozu? Talk at Arbeitsgespräch zur historischen Lexikographie. Universität Trier. https://www.uni-trier.de/fileadmin/forschung/maw/MWB/Arbeitsgespraech2013/Vortrag_Bullay_Klein.pdf Accessed 22 March 2016.
Krause, T., Lüdeling, A., Odebrecht, C., & Zeldes, A. (2012) Multiple tokenizations in a diachronic corpus. In Exploring ancient languages through corpora conference, EALC 2012. Oslo.
Krause, T., & Zeldes, A. (2016). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities, 31(1), 118–139. doi:10.1093/llc/fqu057. Accessed 22 March 2016.
Kroch, A., Santorini, B., & Delfs, L. (2004). The Penn-Helsinki parsed corpus of Early Modern English (PPCEME), 1st edn. Department of Linguistics, University of Pennsylvania. CD-ROM.
Kroch, A., & Taylor, A. (2000). The Penn-Helsinki parsed corpus of Middle English (PPCME2), 2nd edn. Department of Linguistics, University of Pennsylvania. CD-ROM.
Kübler, S., & Zinsmeister, H. (2015). Corpus linguistics and linguistically annotated corpora. London i.a.: Bloomsbury.
Kytö, M. (1996). Manual to the diachronic part of the Helsinki corpus of English texts: Coding conventions and lists of source texts (3rd ed.). Helsinki: University of Helsinki, Department of English.
Kytö, M. (2011). Corpora and historical linguistics. Revista Brasileira de Linguística Aplicada, 11(2), 417–457. doi:10.1590/S1984-63982011000200007. Accessed 22 March 2016.
Kytö, M., & Pahta, P. (2012). Evidence from historical corpora up to the twentieth century. In T. Nevalainen & E. C. Traugott (Eds.), The Oxford handbook of the history of English (pp. 123–133). Oxford i.a.: Oxford University Press.
Lindauer, T. (1995). Genitivattribute. Eine morphosyntaktische Untersuchung zum deutschen DP/NP-System. Tübingen: Niemeyer.
Lüdeling, A. (2011). Corpora in linguistics: Sampling and annotation. In K. Grandin (Ed.), Going digital. Evolutionary and revolutionary aspects of digitization. Nobel symposium 147 (pp. 220–243). New York: Science History Publications.
Lüdeling, A., Poschenrieder, T., Faulstich, L. C., et al. (2005). DeutschDiachronDigital – Ein diachrones Korpus des Deutschen. Jahrbuch für Computerphilologie, 2004, 119–136.
Nerius, D. (2003). Graphematische Entwicklungstendenzen in der Geschichte des Deutschen. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 3, pp. 2461–2472). Berlin i.a.: de Gruyter.
Nerius, D. (2007). Deutsche Orthographie (4th ed.). Hildesheim i.a.: Olms.
Odebrecht, C. (2014). Modeling linguistic research data for a repository for historical corpora. In Proceedings of digital humanities 2014 conference, DH conference 2014 (pp. 284–285), Université de Lausanne, Lausanne. https://dh2014.files.wordpress.com/2014/07/dh2014_abstracts_proceedings_07-11.pdf. Accessed 22 March 2016.
Odebrecht, C., Krause, T., & Lüdeling, A. (2015). Austausch von historischen Texten verschiedener Sprachen über das LAUDATIO-Repository. In 37. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft, DGfS-CL Poster Session 2015. Leipzig.
Pahta, P., & Taavitsainen, I. (2010). Scientific discourse. In A. H. Jucker & I. Taavitsainen (Eds.), Historical pragmatics (Vol. 8, pp. 549–586). Berlin: Mouton de Gruyter.
Paul, H. (1995). Prinzipien der Sprachgeschichte (10th ed.). Tübingen: Niemeyer.
Perlitz, L. (2014). Konkurrenz zwischen Wortbildung und Syntax: Historische Entwicklung von Benennung. Bachelorarbeit. Berlin: Humboldt-Universität zu Berlin.
Petrova, S., Solf, M., Ritz, J., Chiarcos, C., Zeldes, A., et al. (2009). Building and using a richly annotated interlinear diachronic corpus: The case of Old High German Tatian. Traitement automatique des langues, 50(2), 47–71.
Pilz, T. (2009). Nichtstandardisierte Rechtschreibung – Variationsmodellierung und rechnergestützte Variationsverarbeitung. Ph.D. thesis. Universität Duisburg-Essen.
Piotrowski, M. (2012). Natural language processing for historical texts. Synthesis lectures on human language technologies; 17. San Rafael: Morgan & Claypool.
Pörksen, U. (2003). Deutsche Sprachgeschichte und die Entwicklung der Naturwissenschaften – Aspekte einer Geschichte der Naturwissenschaftssprache und ihrer Wechselwirkung zur Gemeinsprache. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 1, pp. 193–210). Berlin i.a.: de Gruyter.
Pose, J., Lopez, P. & Romary, L. (2014). A generic formalism for encoding stand-off annotations in TEI. INRIA technical report. hal-01061548. Accessed 22 March 2016.
Reichmann, O., & Wegera, K.-P. (1993). Schreibung und Lautung. In O. Reichmann & K. P. Wegera (Eds.), Frühneuhochdeutsche Grammatik (pp. 13–163). Tübingen: Niemeyer.
Reynaert, M., Hendricks, I., & Marquilhas, R. (2012) Historical spelling normalization. A comparison of two statistical methods: TICCL and VARD2. In Proceedings of the second workshop on annotation of corpora for research in the humanities, ACRH 2012. Lisbon.
Reznicek, M., Lüdeling, A., & Hirschmann, H. (2013). Competing target hypotheses in the Falko corpus: A flexible multi-layer corpus architecture. In A. Díaz-Negrillo (Ed.), Automatic treatment and analysis of learner corpus data (pp. 101–123). Amsterdam: John Benjamins.
Riecke, J. (2004). Die Frühgeschichte der mittelalterlichen medizinischen Fachsprache im Deutschen. Band 1: Untersuchungen, Band 2: Wörterbuch. Berlin, New York: Walter de Gruyter.
Riecke, J. (2007). Beiträge zum mittelalterlichen deutschen Wortschatz der Heilkunde. In A. Meyer & J. Schulz-Grobert (Eds.), Gesund und krank im Mittelalter. Marburger Beiträge zur Kulturgeschichte der Medizin (pp. 89–106). Leipzig: Eudora-Verlag.
Rissanen, M. (2008). Corpus linguistics and historical linguistics. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 53–68). Berlin i.a.: de Gruyter.
Rissanen, M. (2012). Corpora and the study of English historical syntax. In M. Kytö (Ed.), English corpus linguistics: Crossing paths (pp. 197–220). Amsterdam, New York: Rodopi.
Romary, L. (2009). Questions & answers for TEI newcomers. Jahrbuch für Computerphilologie, 10. Digital Libraries. doi:http://arxiv.org/abs/0812.3563.
Schiller, A., Teufel, S., Stöckert, C., & Thielen, C. (1999). Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Universität Stuttgart und Tübingen. For STTS Tag Table (1995/1999) see http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf. Accessed 1 March 2016.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing, 1994. Manchester.
Schmid, H. (2008). Tokenization and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 527–551). Berlin i.a.: de Gruyter.
Simmler, F. (2003). Geschichte der Interpunktionssysteme im Deutschen. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 3, pp. 2472–2504). Berlin i.a.: de Gruyter.
Splett, J. (2000). Wortbildung des Althochdeutschen. In W. Besch, A. Betten, O. Reichmann, & S. Sonderegger (Eds.), Sprachgeschichte. Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung (2nd ed., Vol. 2, pp. 1213–1222). Berlin i.a.: de Gruyter.
Squires, C. (2010). Konstantes und Variables im Aufbau von deutschen mittelalterlichen heilkundlichen Texten und angrenzenden Textsorten. In A. Ziegler (Ed.), Diachronie, Althochdeutsch, Mittelhochdeutsch 1: Historische Textgrammatik und Historische Syntax des Deutschen (pp. 561–588). Berlin i.a.: de Gruyter.
Stede, M., & Neumann, A. (2014). Potsdam commentary corpus 2.0: Annotation for discourse research. In Proceedings of the language resources and evaluation conference (pp. 925–929), LREC 2014, Reykjavik.
Springmann, U., & Lüdeling, A. (submitted). Progress of OCR of early printings exemplified by the RIDGES Herbology Corpus.
TEI Consortium. (Eds.) (2015). TEI P5: Guidelines for electronic text encoding and interchange. Version 2.8.0. 2015-04-06. TEI Consortium. http://www.tei-c.org/Guidelines/P5/. Accessed 13 Aug 2015.
Vikør, L. (2004). Lingua Franca and international language. Verkehrssprache und Internationale Sprache. In U. Ammon (Ed.), Sociolinguistics. An international handbook of the science of language and society (pp. 328–334). Berlin i.a.: de Gruyter.
Voigt, V. (2013). Python Script for the Normalization Layer clean. Script and Documentation see point 6 at http://korpling.german.hu-berlin.de/ridges/documentation_v4_en.html. Accessed 1 March 2015.
Wolff, G. (2009). Deutsche Sprachgeschichte von den Anfängen bis zur Gegenwart (6th ed.). Tübingen and Basel: Narr Francke.
Zeldes, A., & Schroeder, C. T. (2015). Computational methods for coptic: Developing and using part-of-speech tagging for digital scholarship in the humanities. Digital Scholarship in the Humanities, 31(1), 164–176. doi:10.1093/llc/fqv043. Accessed 22 March 2016.
Zipser F., & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using Standards. In Proceedings of the workshop on language resource and language technology standards, LREC 2010. Malta.
Acknowledgements
We would like to thank Vivian Voigt and Laura Perlitz, our two very capable student assistants, who helped with many aspects of corpus creation and consistency checking. We would also like to thank the many students who took part in the digitization and basic annotation, as well as Uwe Springmann, Florian Zipser and three anonymous reviewers for comments that greatly improved the manuscript. The project has been generously funded by two Google Digital Humanities Research Awards.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Odebrecht, C., Belz, M., Zeldes, A. et al. RIDGES Herbology: designing a diachronic multi-layer corpus. Lang Resources & Evaluation 51, 695–725 (2017). https://doi.org/10.1007/s10579-016-9374-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-016-9374-3