89
GLOTTOTHEORY 2010, NUMBER 3/1, PP 89 – 96.
Word forms, style and typology
Ioan-Iovitz Popescu1
Ján Maþutek2
Gabriel Altmann3
1 Introduction
Computing the vocabulary richness of a text has always been joined with two serious
difficulties: the indicators usually depended on N, the text size (= number of tokens in text),
and the variance of the vocabulary (Var(V)) could not be derived and computed. There are
several trials to overcome this difficulty, for example in Ejiri, Smith (1993) who normalized
the relationship between V and N and proposed a useful but non-testable indicator.
In our previous work (cf. Popescu, Mautek, Altmann 2009; Popescu et al. 2010) we
tried to involve into computations the arc length formed between the individual frequencies of
the rank-frequency distribution. In this case the vocabulary of word-forms or lemmas
represents simply the inventory (= number of types). Since the maximal value of the arc
length can easily be computed, the arc length can be relativized and its variance can be
derived. An asymptotic test is possible. Nevertheless, the arc length increases with increasing
text length, too, thus at last, one must try to normalize it in such a way that N plays only the
role of a constant; in such a case an indicator based on the rank-frequency distribution can be
used both for measuring style differences as well as differences between languages, i.e. for
typological purposes. In the past, one frequently tried to stabilize or to normalize the TTR and
create an indicator of vocabulary richness but one could not eliminate the impact of text size.
The problems connected with these endeavours are shown in Wimmer, Altmann (1999). In
the present paper we do not touch either TTR or vocabulary richness but concentrate on the
rank-frequency distribution alone.
Let V be the vocabulary of word-forms in a text, {f1,f2,…fV} be the sequence of ranked
frequencies of word forms, N the text length (= number of tokens). Let the arc length be
defined as
(1)
V −1
L = ¦[( f r − f r +1 ) 2 + 1]1/2 ,
r =1
i.e. the sum of Euclidean distances between frequencies. Then we define the indicator lambda
as
(2)
Λ=
L
log10 N
N
which is relatively stable, the empirical values oscillate around a horizontal straight line but
the indicator is able to differentiate texts and languages.
In order to show the relationship of L to N we present the result of computation in 28
languages and 526 texts in Figures 1a to 1c. In Figure 1a arc length L is evidently dependent
on text size N.
1
I.-I. Popescu: iovitzu@gmail.com
Ján Mautek: jmacutek@yahoo.com
3
G. Altmann: RAM-Verlag@t-online.de
2
Brought to you by | University of Sydney
Authenticated
Download Date | 5/30/18 11:50 AM
90
POPESCU, IOAN-IOVITZ – MAČUTEK, JÁN – ALTMANN, GABRIEL
Figure 1a. Dependence of L on N
Figure 1b. Dependence of L/N on N
Relativizing L by N merely changes the direction of regression, as can be seen in Figure 1b.
Brought to you by | University of Sydney
Authenticated
Download Date | 5/30/18 11:50 AM
91
WORD FORMS, STYLE AND TYPOLOGY
Figure 1c. The indicator ȁ
However, the lambda-indicator does not depend on N.4 It can express the difference between
two texts or two languages. The text having a greater lambda is richer than a text in the same
language having a smaller lambda. Further, a language having a higher mean lambda is more
synthetic than a language having a smaller mean lambda. In this way the rank-frequency
distribution can be used for at least two different purposes.
The difference of the lambdas can be tested asymptotically using the normal distribution. To this end one must know the variance of lambda. In Popescu, Mautek, Altmann
(2009: 49 – 53) it has been shown that the variance can be approximated as
(3)
Var ( L) =
N − f1 V 2 §
pˆ r ·
N − f1 V −1 V
ˆ
ˆ
1
2
a
p
−
−
¦ r r ¨ 1 − pˆ ¸ (1 − pˆ )2 ¦
¦ aˆr aˆs pˆ r pˆ s
1 − pˆ1 r =2
r = 2 s =r +1
©
1¹
1
where
§ pˆ r −1 − pˆ r ·
¸
© 1 − pˆ1 ¹
(4)
aˆr = −
2
( N − f1 )
2
§ pˆ r −1 − pˆ r ·
¨
¸ +1
© 1 − pˆ1 ¹
§ pˆ r − pˆ r +1 ·
¸
© 1 − pˆ1 ¹
( N − f1 ) ¨
( N − f1 ) ¨
+
2
( N − f1 )
2
§ pˆ r − pˆ r +1 ·
¨
¸ +1
© 1 − pˆ1 ¹
for r = 2,..,V-1, and
4
Indeed, the linear fit of lambda has a negligible slope of b = - 0.00001 in spite of the appearances because the
high scale disparity of axes. However, if we want to have a perfect b = 0 slope, we need to slightly change the
lambda definition to = (L/N)*(LogN)1.25.
Brought to you by | University of Sydney
Authenticated
Download Date | 5/30/18 11:50 AM
POPESCU, IOAN-IOVITZ – MAČUTEK, JÁN – ALTMANN, GABRIEL
92
§ pˆV −1 − pˆV ·
¸
© 1 − pˆ1 ¹
( N − f (1) ) ¨
(5)
aˆV = −
2
.
ˆ − pˆ ·
2§ p
( N − f (1) ) ¨ V −1 V ¸ + 1
© 1 − pˆ1 ¹
Hence
2
(6)
§ log N ·
Var (Λ ) = ¨ 10 ¸ Var ( L)
© N ¹
Though the computation of the variances is lengthy, the test
(7)
u=
Λ1 − Λ 2
Var (Λ1 ) + Var (Λ 2 )
can be performed mechanically.
Skipping data presentation and computation details we bring only some results.
2 Typology
If one considers several texts in one language, one can compute the mean of the lambdas. The
more texts, the better. If we compare mean lambdas, then in formula (7) the variances should
then be divided by the number of texts respectively. In Table 1 we only present the mean
lambdas of 20 languages in order to show the difference in synthetism without performing
tests. The strongly synthetic languages have the greatest lambda, the strongly analytic
languages have the smallest lambda. The data have been taken from Popescu et al. (2009,
2010). The results are no fixed values; analysis of further texts would surely change some of
the values, and their empirical dispersion would become ever smaller.
Table 1
Indicator lambda for 20 languages (100 texts)
(using data from Popescu, Mautek, Altmann 2009, Table 5.18)
Language
Latin
Hungarian
Kannada
Czech
Romanian
Russian
Marathi
German
Slovenian
mean ȁ (decreasing)
2.1376
2.1116
1.9899
1.8312
1.8100
1.7231
1.7139
1.6476
1.6109
Brought to you by | University of Sydney
Authenticated
Download Date | 5/30/18 11:50 AM
93
WORD FORMS, STYLE AND TYPOLOGY
Language
Bulgarian
Italian
Indonesian
Tagalog
English
Lakota
Marquesan
Maori
Rarotongan
Samoan
Hawaiian
mean ȁ (decreasing)
1.5747
1.5243
1.4757
1.3499
1.2795
1.2460
0.9421
0.8954
0.8836
0.8277
0.6337
The above result probably shows the both extremes on the synthetism/analytism scale.
Using texts, the indicator lambda can help to trace down the morphological evolution of
a language without studying particular cases. It is not as simple as Greenberg´s indicators but
one does not need pre-analyzed corpora, a simple word counter is sufficient.
The index can be helpful in studying also the morphological diversification within a
language family. In order to show an example, we used the Kelih-corpus (2009, 2009a)
containing ten chapters of the Russian original of the book Kak zakaljalas´ stal´ by N.
Ostrovskij and its translations into 11 Slavic languages. The result is presented in Table 2. As
can be seen, the values for Czech, Russian, Slovenian and Bulgarian appearing also in Table 1
are slightly different here. This can be caused by the unique style of the text. E.g. Russian
(Table 1) = 1.72 while it is 1.95 in Table 2. In any case, the table shows that within Slavic
languages, Macedonian stays at the lower end of synthetism and Russian at its upper end.
Peculiar enough, the table shows also the approximate geographic location of Slavic
languages, dividing them in East, West and South Slavic groups. The southern branch
displays the strongest trend to analytism.
Table 2
Indicator lambda in 12 Slavic languages (the same text, each of 10 chapters)
(using data from Popescu et al. 2010, Table 5.3)
Language
Russian
Belorussian
Polish
Ukrainian
Czech
Slovak
Sorbian
Slovenian
Croatian
Serbian
Bulgarian
Macedonian
mean ȁ (decreasing)
1.9485
1.9247
1.9195
1.9089
1.9038
1.8842
1.8024
1.7937
1.7786
1.7761
1.6134
1.5290
Brought to you by | University of Sydney
Authenticated
Download Date | 5/30/18 11:50 AM
POPESCU, IOAN-IOVITZ – MAČUTEK, JÁN – ALTMANN, GABRIEL
94
3 Style
Here the results do not comprehend the style as a whole but are restricted to the choice of
morphological means of a language, i.e. the indicator lambda can show the differences
between texts in one and the same language, and its validity concerns only the morphological
aspect – in no case the choice of words and types of sentences, etc. Here we show two
different kinds of texts. The first group contains the End-of-year speeches of Italian
presidents, i.e. texts whose form and contents are not quite free but follow a line which
displays some commonalities (cf. Tuzzi, Popescu, Altmann 2010). The results are presented
in Table 3. Each president held several speeches the means of which are presented here.
Table 3
Indicator lambda in the End-of-year speeches of 10 Italian presidents (60 texts)
(using data from Tuzzi, Popescu, Altmann 2010, Table 3.4)
President
Einaudi
Gronchi
Leone
Segni
Saragat
Napolitano
Ciampi
Cossiga
Scalfaro
Pertini
mean ȁ (decreasing)
1.7279
1.6605
1.5943
1.5854
1.5792
1.5749
1.5299
1.4588
1.2991
1.2578
As can bee seen, the ordering according to lambda is not chronological but the values are
dispersed in the vicinity of the Italian mean presented in Table 1. One can automatically ask
what are the other properties of the texts with which lambda is associated. At this very point a
new field of quantitative investigations can be started.
If one takes a group of thematically and formally not uniform texts, one can order the
authors according to the richness of forms, too. In Table 4 we present the mean lambdas with
24 German authors.
Table 4
Mean lambdas of 26 German writers (253 texts)
(using data from Popescu, Mautek, Altmann 2009, Table 7.8)
Author
Meyer
Paul
Droste
Goethe
Rückert
Lessing
Heine
mean ȁ (decreasing)
1.6891
1.6522
1.6108
1.6071
1.6044
1.5514
1.5480
Brought to you by | University of Sydney
Authenticated
Download Date | 5/30/18 11:50 AM
95
WORD FORMS, STYLE AND TYPOLOGY
Author
Chamisso
Pseudonym
Kafka
Novalis
Keller
Hoffmann
Sealsfield
Schnitzler
Arnim
Busch
Eichendorff
Rieder
Wedekind
Raabe
Löns
Tucholsky
Immermann
Sudermann
Storm
mean ȁ (decreasing)
1.5248
1.5208
1.5153
1.5108
1.4754
1.4418
1.3894
1.3778
1.3734
1.3569
1.3505
1.3436
1.3277
1.3081
1.2247
1.1688
1.1151
1.0216
0.8886
As can be seen, in German the dispersion is relatively great. The mean in Table 1 (1.6476) for
German is caused by the fact that all texts correspond to a few Goethe´s poems only (Gott und
die Bajadere; Der Erlkönig; Elegies No. 2, 3, 13, 15, 19) while those in Table 4 represent 253
texts of 26 German writers. If we take the mean of means in Table 4, we would obtain for
German ȁ = 1.4038 which is probably a more realistic characteristic of German but we
assume that the “truth” is somewhere in between. We may assume that the lambda-indicator
changes in the course of time and that different text sorts behave differently.
4 Other vistas
A further domain which can profit by this investigation, is areal linguistics. It can be supposed
that if there is areal influence between two non-related languages, then this influence can be
reflected also in morphology. However, a great number of samples are necessary to gain a
realistic mean-lambda for a language and in no case spoken texts can be neglected. This
direction of research can become very laborious because the historical aspect must be taken
into account.
Further research is necessary in order to show that lambda is associated with properties
like mean sentence length, the dependence structure of sentence, the exploitation of word
classes, psycholinguistic properties, text sorts, etc.
Evidently, the lambda indicator which liberates us from the weight of text length and is
easily computable has still more application possibilities not known to us presently.
Brought to you by | University of Sydney
Authenticated
Download Date | 5/30/18 11:50 AM
96
POPESCU, IOAN-IOVITZ – MAČUTEK, JÁN – ALTMANN, GABRIEL
References
Ejiri, K. – Smith, A. E. (1993). Proposal of a new ‘Constraint measure’ for text. In: Köhler,
R., Rieger, B.B. (eds.), Contributions to quantitative linguistics: 195 – 211. Dordrecht:
Kluwer.
Kelih, E. (2009). Slavisches Parallel-Textkorpus: Projektvorstellung von “Kak zakaljalas´
stal´ (KZS). In: Kelih, E., Levickij, V., Altmann, G. (eds.), Methods of text analysis:
106 – 124. ernivci: NU.
Kelih, E. (2009a). Preliminary analysis of a Slavic parallel corpus. In: Levická, J., Garabík,
R. (eds.), NLP, Corpus Linguistics, Corpus Based Grammar Research. Fifth International Conference Smolenice, Slovakia, 25 – 27 November 2009. Proceedings: 175 –
183. Brno: Tribun.
Popescu, I.-I. – Maþutek, J. – Altmann, G. (2009). Aspects of word frequencies. Lüdenscheid: RAM.
Popescu, I.-I. – Maþutek, J. – Kelih, E. – ýech, R. – Best, K.-H. – Altmann, G. (2010).
Vectors and codes of texts. Lüdenscheid: RAM
Tuzzi, A. – Popescu, I.-I. – Altmann, G. (2010). Quantitative analysis of Italian texts.
Lüdenscheid: RAM.
Brought to you by | University of Sydney
Authenticated
Download Date | 5/30/18 11:50 AM