Models used to explain phenomena are necessarily finer grained than the models used to measure th... more Models used to explain phenomena are necessarily finer grained than the models used to measure them. In language study, the measures used to assess development (e.g., readability indices) rely on models of language that are too coarse grained to be interpreted in a linguistic framework and so do not participate in linguistic accounts of development. This study argues that the constructionist approaches provide a framework for the development of a practical and interpretable measure of developmental complexity because these approaches feature affordances from which a measurement model may be derived: they describe language knowledge as a comprehensive network of enumerable entities that do not require the imputation of external processes, are extensible to early child language, and hold that the drivers of language development are the learning and generalization of constructions. It is argued here that treating schematic constructions as the unit of language knowledge supports a complexity measure that can reflect developmental changes arising from the learning and productive generalization of these units.
Why Adult Language Learning is Harder: A Computational Model of the Consequences of Cultural Sele... more Why Adult Language Learning is Harder: A Computational Model of the Consequences of Cultural Selection for Learnability. Robert N. Nelson Jr. (rnnelson@purdue.edu) Department of English, Purdue University West Lafayette, IN. 47906 Abstract This paper reports on a limited model of language evolution that incorporates transmission noise and errorful learning as sources of variation. The model illustrates how the adaptation of language to the statistical learning mechanisms of infants may be a factor in the apparent ceiling on adult second language achievement. The model is limited in its focus to only phonotactics because the probabilistic imbalances that have been found in phonotactics have been found to be effective cues in the very first language learning task, speech segmentation (Saffran & Theissen, 2003; Mattys & Jusczyk, 2001), and in the organization of lexical memory (Vitevitch, Luce, Pisoni & Auer, 1999). The argument that this model supports is that these probabilistic imba...
This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabu... more This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams, Segalowitz & Leclair, 2014). When sampling from a population generated by a power-law process (e.g., a Zipf distribution), the probability that any given member is selected is dependent on its rank, such that higher frequency rank (i.e., 1st, 2nd, 3rd) members are much more likely to be selected than lower rank (i.e., 100th, 1000th) members. Because of this, sampling is much more likely to select from the same limited group of words. The CR measure, however, assumes a uniform distribution, and so drastically underestimates the size of the vocabulary when applied to power-law data. Work with simulated data shows ways that the degree of underestimation may be lessened. Applying these methods to real data shows effects parallel to those in the simulations.
Gries (2008, 2021) defined two dispersion measures able to alert corpus analysts to words that ha... more Gries (2008, 2021) defined two dispersion measures able to alert corpus analysts to words that have a problematically limited distribution. Gries (2010, 2022) posited that these measures may additionally be relevant to language development research, as the learnability of a pattern may be predicted by the evenness of its distribution in corpora. However, both measures work by comparing vectors of observed and expected frequencies in partitioned corpora and this method cannot determine that a word is evenly distributed because it cannot distinguish the random noise inherent to an unbiased process from substantial non-random bias. An additional concern with the 2008 measure is raised: the 2008 measure is Manhattan distance scaled to the unit interval and, as such, it is extremely sensitive to the number of corpus parts because this choice sets the dimensionality of the measure space. In sum, this short analysis presents evidence that these measures should not be used to declare a pattern evenly distributed as neither can tell the difference between statistical noise and systematic bias.
To some extent, we seem to use language in chunks—multiple words that are co-selected and used as... more To some extent, we seem to use language in chunks—multiple words that are co-selected and used as gestalt units. By some estimates, these chunks constitute more than 50% of a given text (Erman & Warren, 2000). The extent to which our communication is composed of these units has broad implications for linguistic theory, psycholinguistics, and applied linguistics, and so is the focus of this study. This study shows that claims made regarding the nature of formulaic language (Sinclair, 1991) lead to a method for the automatic detection of holistically used multiword patterns in text corpora, which in turn allows for the estimation of the ‘chunkiness’ of linguistic corpora. These estimates may be useful for materials development in language teaching, as well as corpus linguistic and psycholinguistic studies.
It has long been recognized that developing measures of the internal structure of
collocations is... more It has long been recognized that developing measures of the internal structure of collocations is an important goal (Sinclair, 1991). Recently, Gries’ (2013) presented a measure that captures the asymmetric nature of conditional probabilities in collocations. This paper intends to contribute to the discussion by introducing measures of asymmetry and redundancy that may meet the needs of some researchers. Two asymmetry measures are described. The first captures only frequency asymmetry while the second is an asymmetric version of the mutual information measure. A measure of semantic redundancy is also described here. This measure takes a higher value when the fact that two words co-occur contains more information than the uncertainty introduced by the occurrence of the individual words.
This short paper discusses shortcomings of the capture-recapture (CR)
method of estimating vocabu... more This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams, Segalowitz & Leclair, 2014). When sampling from a population generated by a power-law process (e.g., a Zipf distribution), the probability that any given member is selected is dependent on its rank, such that higher frequency rank (i.e., 1st, 2nd, 3rd) members are much more likely to be selected than lower rank (i.e., 100th, 1000th) members. Because of this, sampling is much more likely to select from the same limited group of words. The CR measure, how- ever, assumes a uniform distribution, and so drastically underestimates the size of the vocabulary when applied to power-law data. Work with simulated data shows ways that the degree of underestimation may be lessened. Applying these methods to real data shows effects parallel to those in the simulations.
please let me know if you find errors, better ways to say/do the things described in the chapter,... more please let me know if you find errors, better ways to say/do the things described in the chapter, or have suggestions for content
Models used to explain phenomena are necessarily finer grained than the models used to measure th... more Models used to explain phenomena are necessarily finer grained than the models used to measure them. In language study, the measures used to assess development (e.g., readability indices) rely on models of language that are too coarse grained to be interpreted in a linguistic framework and so do not participate in linguistic accounts of development. This study argues that the constructionist approaches provide a framework for the development of a practical and interpretable measure of developmental complexity because these approaches feature affordances from which a measurement model may be derived: they describe language knowledge as a comprehensive network of enumerable entities that do not require the imputation of external processes, are extensible to early child language, and hold that the drivers of language development are the learning and generalization of constructions. It is argued here that treating schematic constructions as the unit of language knowledge supports a complexity measure that can reflect developmental changes arising from the learning and productive generalization of these units.
Why Adult Language Learning is Harder: A Computational Model of the Consequences of Cultural Sele... more Why Adult Language Learning is Harder: A Computational Model of the Consequences of Cultural Selection for Learnability. Robert N. Nelson Jr. (rnnelson@purdue.edu) Department of English, Purdue University West Lafayette, IN. 47906 Abstract This paper reports on a limited model of language evolution that incorporates transmission noise and errorful learning as sources of variation. The model illustrates how the adaptation of language to the statistical learning mechanisms of infants may be a factor in the apparent ceiling on adult second language achievement. The model is limited in its focus to only phonotactics because the probabilistic imbalances that have been found in phonotactics have been found to be effective cues in the very first language learning task, speech segmentation (Saffran & Theissen, 2003; Mattys & Jusczyk, 2001), and in the organization of lexical memory (Vitevitch, Luce, Pisoni & Auer, 1999). The argument that this model supports is that these probabilistic imba...
This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabu... more This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams, Segalowitz & Leclair, 2014). When sampling from a population generated by a power-law process (e.g., a Zipf distribution), the probability that any given member is selected is dependent on its rank, such that higher frequency rank (i.e., 1st, 2nd, 3rd) members are much more likely to be selected than lower rank (i.e., 100th, 1000th) members. Because of this, sampling is much more likely to select from the same limited group of words. The CR measure, however, assumes a uniform distribution, and so drastically underestimates the size of the vocabulary when applied to power-law data. Work with simulated data shows ways that the degree of underestimation may be lessened. Applying these methods to real data shows effects parallel to those in the simulations.
Gries (2008, 2021) defined two dispersion measures able to alert corpus analysts to words that ha... more Gries (2008, 2021) defined two dispersion measures able to alert corpus analysts to words that have a problematically limited distribution. Gries (2010, 2022) posited that these measures may additionally be relevant to language development research, as the learnability of a pattern may be predicted by the evenness of its distribution in corpora. However, both measures work by comparing vectors of observed and expected frequencies in partitioned corpora and this method cannot determine that a word is evenly distributed because it cannot distinguish the random noise inherent to an unbiased process from substantial non-random bias. An additional concern with the 2008 measure is raised: the 2008 measure is Manhattan distance scaled to the unit interval and, as such, it is extremely sensitive to the number of corpus parts because this choice sets the dimensionality of the measure space. In sum, this short analysis presents evidence that these measures should not be used to declare a pattern evenly distributed as neither can tell the difference between statistical noise and systematic bias.
To some extent, we seem to use language in chunks—multiple words that are co-selected and used as... more To some extent, we seem to use language in chunks—multiple words that are co-selected and used as gestalt units. By some estimates, these chunks constitute more than 50% of a given text (Erman & Warren, 2000). The extent to which our communication is composed of these units has broad implications for linguistic theory, psycholinguistics, and applied linguistics, and so is the focus of this study. This study shows that claims made regarding the nature of formulaic language (Sinclair, 1991) lead to a method for the automatic detection of holistically used multiword patterns in text corpora, which in turn allows for the estimation of the ‘chunkiness’ of linguistic corpora. These estimates may be useful for materials development in language teaching, as well as corpus linguistic and psycholinguistic studies.
It has long been recognized that developing measures of the internal structure of
collocations is... more It has long been recognized that developing measures of the internal structure of collocations is an important goal (Sinclair, 1991). Recently, Gries’ (2013) presented a measure that captures the asymmetric nature of conditional probabilities in collocations. This paper intends to contribute to the discussion by introducing measures of asymmetry and redundancy that may meet the needs of some researchers. Two asymmetry measures are described. The first captures only frequency asymmetry while the second is an asymmetric version of the mutual information measure. A measure of semantic redundancy is also described here. This measure takes a higher value when the fact that two words co-occur contains more information than the uncertainty introduced by the occurrence of the individual words.
This short paper discusses shortcomings of the capture-recapture (CR)
method of estimating vocabu... more This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams, Segalowitz & Leclair, 2014). When sampling from a population generated by a power-law process (e.g., a Zipf distribution), the probability that any given member is selected is dependent on its rank, such that higher frequency rank (i.e., 1st, 2nd, 3rd) members are much more likely to be selected than lower rank (i.e., 100th, 1000th) members. Because of this, sampling is much more likely to select from the same limited group of words. The CR measure, how- ever, assumes a uniform distribution, and so drastically underestimates the size of the vocabulary when applied to power-law data. Work with simulated data shows ways that the degree of underestimation may be lessened. Applying these methods to real data shows effects parallel to those in the simulations.
please let me know if you find errors, better ways to say/do the things described in the chapter,... more please let me know if you find errors, better ways to say/do the things described in the chapter, or have suggestions for content
Uploads
Papers by Robert Nelson
collocations is an important goal (Sinclair, 1991). Recently, Gries’ (2013) presented
a measure that captures the asymmetric nature of conditional probabilities in collocations.
This paper intends to contribute to the discussion by introducing measures
of asymmetry and redundancy that may meet the needs of some researchers. Two
asymmetry measures are described. The first captures only frequency asymmetry
while the second is an asymmetric version of the mutual information measure. A
measure of semantic redundancy is also described here. This measure takes a higher
value when the fact that two words co-occur contains more information than the
uncertainty introduced by the occurrence of the individual words.
method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams,
Segalowitz & Leclair, 2014). When sampling from a population generated by
a power-law process (e.g., a Zipf distribution), the probability that any given
member is selected is dependent on its rank, such that higher frequency rank
(i.e., 1st, 2nd, 3rd) members are much more likely to be selected than lower
rank (i.e., 100th, 1000th) members. Because of this, sampling is much more
likely to select from the same limited group of words. The CR measure, how-
ever, assumes a uniform distribution, and so drastically underestimates the size
of the vocabulary when applied to power-law data. Work with simulated data
shows ways that the degree of underestimation may be lessened. Applying these
methods to real data shows effects parallel to those in the simulations.
collocations is an important goal (Sinclair, 1991). Recently, Gries’ (2013) presented
a measure that captures the asymmetric nature of conditional probabilities in collocations.
This paper intends to contribute to the discussion by introducing measures
of asymmetry and redundancy that may meet the needs of some researchers. Two
asymmetry measures are described. The first captures only frequency asymmetry
while the second is an asymmetric version of the mutual information measure. A
measure of semantic redundancy is also described here. This measure takes a higher
value when the fact that two words co-occur contains more information than the
uncertainty introduced by the occurrence of the individual words.
method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams,
Segalowitz & Leclair, 2014). When sampling from a population generated by
a power-law process (e.g., a Zipf distribution), the probability that any given
member is selected is dependent on its rank, such that higher frequency rank
(i.e., 1st, 2nd, 3rd) members are much more likely to be selected than lower
rank (i.e., 100th, 1000th) members. Because of this, sampling is much more
likely to select from the same limited group of words. The CR measure, how-
ever, assumes a uniform distribution, and so drastically underestimates the size
of the vocabulary when applied to power-law data. Work with simulated data
shows ways that the degree of underestimation may be lessened. Applying these
methods to real data shows effects parallel to those in the simulations.