[go: up one dir, main page]

Academia.eduAcademia.edu

Word forms, style and typology

2010, Glottotheory

89 GLOTTOTHEORY 2010, NUMBER 3/1, PP 89 – 96. Word forms, style and typology Ioan-Iovitz Popescu1 Ján Maþutek2 Gabriel Altmann3 1 Introduction Computing the vocabulary richness of a text has always been joined with two serious difficulties: the indicators usually depended on N, the text size (= number of tokens in text), and the variance of the vocabulary (Var(V)) could not be derived and computed. There are several trials to overcome this difficulty, for example in Ejiri, Smith (1993) who normalized the relationship between V and N and proposed a useful but non-testable indicator. In our previous work (cf. Popescu, Mautek, Altmann 2009; Popescu et al. 2010) we tried to involve into computations the arc length formed between the individual frequencies of the rank-frequency distribution. In this case the vocabulary of word-forms or lemmas represents simply the inventory (= number of types). Since the maximal value of the arc length can easily be computed, the arc length can be relativized and its variance can be derived. An asymptotic test is possible. Nevertheless, the arc length increases with increasing text length, too, thus at last, one must try to normalize it in such a way that N plays only the role of a constant; in such a case an indicator based on the rank-frequency distribution can be used both for measuring style differences as well as differences between languages, i.e. for typological purposes. In the past, one frequently tried to stabilize or to normalize the TTR and create an indicator of vocabulary richness but one could not eliminate the impact of text size. The problems connected with these endeavours are shown in Wimmer, Altmann (1999). In the present paper we do not touch either TTR or vocabulary richness but concentrate on the rank-frequency distribution alone. Let V be the vocabulary of word-forms in a text, {f1,f2,…fV} be the sequence of ranked frequencies of word forms, N the text length (= number of tokens). Let the arc length be defined as (1) V −1 L = ¦[( f r − f r +1 ) 2 + 1]1/2 , r =1 i.e. the sum of Euclidean distances between frequencies. Then we define the indicator lambda as (2) Λ= L log10 N N which is relatively stable, the empirical values oscillate around a horizontal straight line but the indicator is able to differentiate texts and languages. In order to show the relationship of L to N we present the result of computation in 28 languages and 526 texts in Figures 1a to 1c. In Figure 1a arc length L is evidently dependent on text size N. 1 I.-I. Popescu: iovitzu@gmail.com Ján Mautek: jmacutek@yahoo.com 3 G. Altmann: RAM-Verlag@t-online.de 2 Brought to you by | University of Sydney Authenticated Download Date | 5/30/18 11:50 AM 90 POPESCU, IOAN-IOVITZ – MAČUTEK, JÁN – ALTMANN, GABRIEL Figure 1a. Dependence of L on N Figure 1b. Dependence of L/N on N Relativizing L by N merely changes the direction of regression, as can be seen in Figure 1b. Brought to you by | University of Sydney Authenticated Download Date | 5/30/18 11:50 AM 91 WORD FORMS, STYLE AND TYPOLOGY Figure 1c. The indicator ȁ However, the lambda-indicator does not depend on N.4 It can express the difference between two texts or two languages. The text having a greater lambda is richer than a text in the same language having a smaller lambda. Further, a language having a higher mean lambda is more synthetic than a language having a smaller mean lambda. In this way the rank-frequency distribution can be used for at least two different purposes. The difference of the lambdas can be tested asymptotically using the normal distribution. To this end one must know the variance of lambda. In Popescu, Mautek, Altmann (2009: 49 – 53) it has been shown that the variance can be approximated as (3) Var ( L) = N − f1 V 2 § pˆ r · N − f1 V −1 V ˆ ˆ 1 2 a p − − ¦ r r ¨ 1 − pˆ ¸ (1 − pˆ )2 ¦ ¦ aˆr aˆs pˆ r pˆ s 1 − pˆ1 r =2 r = 2 s =r +1 © 1¹ 1 where § pˆ r −1 − pˆ r · ¸ © 1 − pˆ1 ¹ (4) aˆr = − 2 ( N − f1 ) 2 § pˆ r −1 − pˆ r · ¨ ¸ +1 © 1 − pˆ1 ¹ § pˆ r − pˆ r +1 · ¸ © 1 − pˆ1 ¹ ( N − f1 ) ¨ ( N − f1 ) ¨ + 2 ( N − f1 ) 2 § pˆ r − pˆ r +1 · ¨ ¸ +1 © 1 − pˆ1 ¹ for r = 2,..,V-1, and 4 Indeed, the linear fit of lambda has a negligible slope of b = - 0.00001 in spite of the appearances because the high scale disparity of axes. However, if we want to have a perfect b = 0 slope, we need to slightly change the lambda definition to  = (L/N)*(LogN)1.25. Brought to you by | University of Sydney Authenticated Download Date | 5/30/18 11:50 AM POPESCU, IOAN-IOVITZ – MAČUTEK, JÁN – ALTMANN, GABRIEL 92 § pˆV −1 − pˆV · ¸ © 1 − pˆ1 ¹ ( N − f (1) ) ¨ (5) aˆV = − 2 . ˆ − pˆ · 2§ p ( N − f (1) ) ¨ V −1 V ¸ + 1 © 1 − pˆ1 ¹ Hence 2 (6) § log N · Var (Λ ) = ¨ 10 ¸ Var ( L) © N ¹ Though the computation of the variances is lengthy, the test (7) u= Λ1 − Λ 2 Var (Λ1 ) + Var (Λ 2 ) can be performed mechanically. Skipping data presentation and computation details we bring only some results. 2 Typology If one considers several texts in one language, one can compute the mean of the lambdas. The more texts, the better. If we compare mean lambdas, then in formula (7) the variances should then be divided by the number of texts respectively. In Table 1 we only present the mean lambdas of 20 languages in order to show the difference in synthetism without performing tests. The strongly synthetic languages have the greatest lambda, the strongly analytic languages have the smallest lambda. The data have been taken from Popescu et al. (2009, 2010). The results are no fixed values; analysis of further texts would surely change some of the values, and their empirical dispersion would become ever smaller. Table 1 Indicator lambda for 20 languages (100 texts) (using data from Popescu, Mautek, Altmann 2009, Table 5.18) Language Latin Hungarian Kannada Czech Romanian Russian Marathi German Slovenian mean ȁ (decreasing) 2.1376 2.1116 1.9899 1.8312 1.8100 1.7231 1.7139 1.6476 1.6109 Brought to you by | University of Sydney Authenticated Download Date | 5/30/18 11:50 AM 93 WORD FORMS, STYLE AND TYPOLOGY Language Bulgarian Italian Indonesian Tagalog English Lakota Marquesan Maori Rarotongan Samoan Hawaiian mean ȁ (decreasing) 1.5747 1.5243 1.4757 1.3499 1.2795 1.2460 0.9421 0.8954 0.8836 0.8277 0.6337 The above result probably shows the both extremes on the synthetism/analytism scale. Using texts, the indicator lambda can help to trace down the morphological evolution of a language without studying particular cases. It is not as simple as Greenberg´s indicators but one does not need pre-analyzed corpora, a simple word counter is sufficient. The index can be helpful in studying also the morphological diversification within a language family. In order to show an example, we used the Kelih-corpus (2009, 2009a) containing ten chapters of the Russian original of the book Kak zakaljalas´ stal´ by N. Ostrovskij and its translations into 11 Slavic languages. The result is presented in Table 2. As can be seen, the values for Czech, Russian, Slovenian and Bulgarian appearing also in Table 1 are slightly different here. This can be caused by the unique style of the text. E.g. Russian (Table 1) = 1.72 while it is 1.95 in Table 2. In any case, the table shows that within Slavic languages, Macedonian stays at the lower end of synthetism and Russian at its upper end. Peculiar enough, the table shows also the approximate geographic location of Slavic languages, dividing them in East, West and South Slavic groups. The southern branch displays the strongest trend to analytism. Table 2 Indicator lambda in 12 Slavic languages (the same text, each of 10 chapters) (using data from Popescu et al. 2010, Table 5.3) Language Russian Belorussian Polish Ukrainian Czech Slovak Sorbian Slovenian Croatian Serbian Bulgarian Macedonian mean ȁ (decreasing) 1.9485 1.9247 1.9195 1.9089 1.9038 1.8842 1.8024 1.7937 1.7786 1.7761 1.6134 1.5290 Brought to you by | University of Sydney Authenticated Download Date | 5/30/18 11:50 AM POPESCU, IOAN-IOVITZ – MAČUTEK, JÁN – ALTMANN, GABRIEL 94 3 Style Here the results do not comprehend the style as a whole but are restricted to the choice of morphological means of a language, i.e. the indicator lambda can show the differences between texts in one and the same language, and its validity concerns only the morphological aspect – in no case the choice of words and types of sentences, etc. Here we show two different kinds of texts. The first group contains the End-of-year speeches of Italian presidents, i.e. texts whose form and contents are not quite free but follow a line which displays some commonalities (cf. Tuzzi, Popescu, Altmann 2010). The results are presented in Table 3. Each president held several speeches the means of which are presented here. Table 3 Indicator lambda in the End-of-year speeches of 10 Italian presidents (60 texts) (using data from Tuzzi, Popescu, Altmann 2010, Table 3.4) President Einaudi Gronchi Leone Segni Saragat Napolitano Ciampi Cossiga Scalfaro Pertini mean ȁ (decreasing) 1.7279 1.6605 1.5943 1.5854 1.5792 1.5749 1.5299 1.4588 1.2991 1.2578 As can bee seen, the ordering according to lambda is not chronological but the values are dispersed in the vicinity of the Italian mean presented in Table 1. One can automatically ask what are the other properties of the texts with which lambda is associated. At this very point a new field of quantitative investigations can be started. If one takes a group of thematically and formally not uniform texts, one can order the authors according to the richness of forms, too. In Table 4 we present the mean lambdas with 24 German authors. Table 4 Mean lambdas of 26 German writers (253 texts) (using data from Popescu, Mautek, Altmann 2009, Table 7.8) Author Meyer Paul Droste Goethe Rückert Lessing Heine mean ȁ (decreasing) 1.6891 1.6522 1.6108 1.6071 1.6044 1.5514 1.5480 Brought to you by | University of Sydney Authenticated Download Date | 5/30/18 11:50 AM 95 WORD FORMS, STYLE AND TYPOLOGY Author Chamisso Pseudonym Kafka Novalis Keller Hoffmann Sealsfield Schnitzler Arnim Busch Eichendorff Rieder Wedekind Raabe Löns Tucholsky Immermann Sudermann Storm mean ȁ (decreasing) 1.5248 1.5208 1.5153 1.5108 1.4754 1.4418 1.3894 1.3778 1.3734 1.3569 1.3505 1.3436 1.3277 1.3081 1.2247 1.1688 1.1151 1.0216 0.8886 As can be seen, in German the dispersion is relatively great. The mean in Table 1 (1.6476) for German is caused by the fact that all texts correspond to a few Goethe´s poems only (Gott und die Bajadere; Der Erlkönig; Elegies No. 2, 3, 13, 15, 19) while those in Table 4 represent 253 texts of 26 German writers. If we take the mean of means in Table 4, we would obtain for German ȁ = 1.4038 which is probably a more realistic characteristic of German but we assume that the “truth” is somewhere in between. We may assume that the lambda-indicator changes in the course of time and that different text sorts behave differently. 4 Other vistas A further domain which can profit by this investigation, is areal linguistics. It can be supposed that if there is areal influence between two non-related languages, then this influence can be reflected also in morphology. However, a great number of samples are necessary to gain a realistic mean-lambda for a language and in no case spoken texts can be neglected. This direction of research can become very laborious because the historical aspect must be taken into account. Further research is necessary in order to show that lambda is associated with properties like mean sentence length, the dependence structure of sentence, the exploitation of word classes, psycholinguistic properties, text sorts, etc. Evidently, the lambda indicator which liberates us from the weight of text length and is easily computable has still more application possibilities not known to us presently. Brought to you by | University of Sydney Authenticated Download Date | 5/30/18 11:50 AM 96 POPESCU, IOAN-IOVITZ – MAČUTEK, JÁN – ALTMANN, GABRIEL References Ejiri, K. – Smith, A. E. (1993). Proposal of a new ‘Constraint measure’ for text. In: Köhler, R., Rieger, B.B. (eds.), Contributions to quantitative linguistics: 195 – 211. Dordrecht: Kluwer. Kelih, E. (2009). Slavisches Parallel-Textkorpus: Projektvorstellung von “Kak zakaljalas´ stal´ (KZS). In: Kelih, E., Levickij, V., Altmann, G. (eds.), Methods of text analysis: 106 – 124. ernivci: NU. Kelih, E. (2009a). Preliminary analysis of a Slavic parallel corpus. In: Levická, J., Garabík, R. (eds.), NLP, Corpus Linguistics, Corpus Based Grammar Research. Fifth International Conference Smolenice, Slovakia, 25 – 27 November 2009. Proceedings: 175 – 183. Brno: Tribun. Popescu, I.-I. – Maþutek, J. – Altmann, G. (2009). Aspects of word frequencies. Lüdenscheid: RAM. Popescu, I.-I. – Maþutek, J. – Kelih, E. – ýech, R. – Best, K.-H. – Altmann, G. (2010). Vectors and codes of texts. Lüdenscheid: RAM Tuzzi, A. – Popescu, I.-I. – Altmann, G. (2010). Quantitative analysis of Italian texts. Lüdenscheid: RAM. Brought to you by | University of Sydney Authenticated Download Date | 5/30/18 11:50 AM