Connections and Symbols
Connections and Symbols
During the past two years there has been more discussion of the foundations of cognitive science than in the 25 years preceding . The impetus for this reexamination has been a new approach to studying the mind , called " Connectionism " , " Parallel Distributed Processin2
-" ,
or " Neural
Networks
" . The
assumptions behind this approach differ in substantial ways from the " central dogma " of cognitive science, that intelligence is the result of the manipulation of structured symbolic expressions. Instead , connectionists suggest that intel ligence is to be understood as the result of the transmission of activation levels in large networks of densely interconnected simple units . Connectionism has spawned an enormous amount of research activity in a
short time . Much of the excitement surrounding the movement has been
inspired by the rich possibilities inherent in ideas such as massive parallel processing , distributed representation , constraint satisfaction , neurally -realistic cognitive models , and subsymbolic or microfeatural analyses. Models incorporating various combinations of these notions have been proposed for behavioral abilities as diverse as Pavlovian conditioning , visual recognition , and language acquisition . Perhaps it is not surprising that in a burgeoning new field there have been few systematic attempts to analyze the core assumptions of the new approach in comparison with those of the approach it is trying to replace , and to juxtapose both sets of assumptions with the most salient facts about human cognition . Analyses of new scientific models have their place , but they are premature before substantial accomplishments in the new field have been reported and digested . Now that many connectionist efforts are well known , it may be time for a careful teasing apart of what is truly new and what is just a relabeling of old notions ; of the empirical generalizations that are sound and those that are likely to be false ; of the proposals that naturally belong together and those that are logically independent . This special issue of Cognition on Connectionism and Symbol Systems is
intended to start such a discussion . Each of the papers in the issue attempts
to analyze in careful detail the accomplishments and liabilities of connectio nist models of cognition . The papers were independently and coincidentally submitted to the journal - a sign, perhaps , that the time is especially right for reflection on the status of connectionist theories . Though each makes dif ferent points , there are noteworthy common themes . All the papers are high ly critical of certain aspects of connectionist models , particularly as applied to language of the parts of cognition employing language-like operations . All
Introduction
of them try to pinpoint what it is about human cognition that supports the traditional physical symbol systemhypothesis Yet none of the papers is an . outri '-' dismissal in each case the authors discussaspectsof cognition for ~ht , which connectionistmodels may yield critical insights. Perhaps the most salient common theme in these papers is that many current connectionist proposals are not motivated purely by considerations of parallel processing distributed representation constraint satisfaction or , , , other computational issues but seemto be tied more closely to an agendaof , reviving associationism a central doctrine of learning and mental functionas ing. As a result, discussionsof connectionisminvolve a reexamination of debates about the strengths and weaknesses associationistmechanisms of that were a prominent part of cognitive theory 30 yearsago and 300yearsago. These paperscomprisethe first critical examination of connectionismas a scientific theory. The issuesthey raise go to the heart of our understanding of how the mind works. We hope that they begin a fruitful debate among scientistsfrom different frameworks as to the respectiveroles of connectionist networks and physical symbol systemsin explaining intelligence.
Connectionism and cognitive architecture : A criti , al c . * anaIYSJS JERRYA. FODOR CUNYGraduate Center ZENONW. PYL YSHYN University Western of Ontario
A bs tract
This paper explores differences between Connectionist proposals for cognitive architecture and the sorts of models that have traditionally been assumed in cognitive science. We claim that the major distinction is that, while both Connectionist and Classical architectures postulate representational mental states , the latter but not the former are committed to a symbol -level of representation, or to a 'language of thought ': i . e., to representational states that have combina torial syntactic and semantic structure . Several arguments for combinatorial structure in mental representations are then reviewed. These include arguments based on the 'sy.S 'tematicity ' of mental representation: i . e., on the fact that cognitive capacities always exhibit certain symmetrie.s so that the ability to ', entertain a given thought implies the ability to entertain thoughts with semantically related contents. We claim that such arguments make a powerful case that mind / brain archl:tecture is not Connectionist at the cognitive level. We then consider the po ~ 'sibility that Connectionism may provide an account of the neural (or 'abstract neurological ') structures in which Classical cognitive archi tecture is implemented . We survey a number of the standard arguments that have been offered in favor of Connectionism , and conclude that they are coherent only on thi~ interpretation . '
*This paper js basedon a chapter from a forthcoming book. Authors' namesarc listed alphabctically. We wjsh to thank the Alfred P. Sloan Foundatjon for their generoussupport of this research The preparatjon of . this paper wasalso ajded by a Killam Research Fellowshjp and a Senjor Fellowshjpfrom the CanadjanInstitute for Advanced Researchto ZWP. We also gratefully acknowledgecommentsand criticisms of earljer drafts by: Professors Noam Chomsky, William Demopoulos, Lila Gleitman, RussGreiner, Norbert Hornstein, Keith Humphrey, SandyPentland, StevenPjnker, David Rosenthal and Edward Stabler. Reprjnts may be obtained , by writing to either author: Jerry Fodor, CONY Graduate Center, 33 West 42 Strcct. New York . NY ]00 .16. U .S.A .; Zenon Pylyshyn Centre for Cognitive Science University of Western Ontario, London, Ontario, , , CanadaN6A 5C2.
1. Introduction
Connectionistor PD P models are catching on. There are conferencesand new books nearly every day, and the popular sciencepress hails this new wave of theorizing as a breakthrough in understandingthe mind (a typical example is the article in the May issueof Science called "How we think : 86, A new theory"). There are also, inevitably, descriptionsof the emergence of Connectionismas a Kuhnian "paradigm shift" . (SeeSchneider 1987 for an , , example of this and for further evidence of the tendency to view Connec tionism as the " new wave" of Cognitive Science .) The fan club includes the most unlikely collection of people. Connectio nism givessolaceboth to philosopherswho think that relying on the pseudo scientific intentional or semantic notions of folk psychology(like goals and beliefs) mislead psychologistsinto taking the computational approach (e.g., P.M . Churchland, 1981 P.S. Churchland, 1986 Dennett, 1986 ; and to those ; ; ) with nearly the oppositeperspective who think that computationalpsycholo , gy is bankrupt becauseit doesn addressissuesof intentionality or meaning 't (e.g., Dreyfus & Dreyfus, in press On the computer scienceside, Connec ). tionism appealsto theorists who think that serial machinesare too weak and must be replaced by radically new parallel machines (Fahlman & Hinton , 1986 , while on the biological side it appealsto those who believe that cogni) tion can only be understoodif we study it as neuroscience (e.g., Arbib , 1975 ; Sejnowski 1981 It is also attractive to psychologistswho think that much , ). of the mind (including the part involved in using imagery) is not discrete (e.g., Kosslyn & Hatfield , 1984 or who think that cognitive sciencehas not ), paid enough attention to stochasticmechanisms to '~ or holistic" mechanisms (e.g., Lakoff , 1986 and so on and on. It also appealsto many young cogni), tive scientists who view the approach as not only anti-establishment (and therefore desirable but also rigorous and mathematical(see however, foot) , note 2) . Almost everyone who is discontent with contemporary cognitive psychology and current " information processing models of the mind has " rushed to embrace" the Connectionist alternative" . When taken as a way of modeling cognitive architecture Connectionism , really does represent an approach that is quite different from that of the Classicalcognitive sciencethat it seeksto replace. Classicalmodels of the mind were derived from the structure of Turing and Von Neumannmachines . They are not, of course committed to the details of these machines as , exemplified in Turing's original formulation or in typical commercialcomputers; only to the basic idea that the kind of computing that is relevant to understanding cognition involves operations on symbols (see Fodor 1976 , 1987 New.ell, 1980 1982 Pylyshyn 1980 1984a b). In contrast, Connec ; , ; , , , -
tionists propose to design systems that can exhibit intelligent behavior without storing , retrieving , or otherwise operating on structured symbolic expressions. The style of processing carried out in such models is thus strikingly unlike what goes on when conventional machines are computing some func tion . Connectionist systems are networks consisting of very large numbers of simple but highly interconnected " units " . Certain assumptions are generally made both about the units and the connections : Each unit is assumed to receive real -valued activity (either excitatory or inhibitory or both ) along its input lines . Typically the units do little more than sum this activity and change their state as a function (usually a threshold function ) of this sum. Each connection is allowed to modulate the activity it transmits as a function of an intrinsic (but modifiable ) property called its " weight " . Hence the activity on an input line is typically some non -linear function of the state of activity of its sources. The behavior of the network as a whole is a function of the initial state of activation of the units and of the weights on its connections , which serve as its only form of memory . Numerous elaborations of this basic Connectionist architecture are possible . For example , Connectionist models often have stochastic mechanisms for determining the level of activity or the state of a unit . Moreover , units may be connected to outside environments . In this case the units are sometimes assumed to respond to a narrow range of combinations of parameter values and are said to have a certain " receptive field " in parameter -space. These are called " value units " (Ballard , 1986) . In some versions of Connectionist architecture , environmental properties are encoded by the pattern of states of entire populations of units . Such " coarse coding " techniques are among the ways of achieving what Connectionist call " distributed representa tion " .! The term 'Connectionist model ' (like 'Turing Machine ' or 'Van Neumann machine ') is thus applied to a family of mechanisms that differ in details but share a galaxy of architectural commitments . We shall return to the characterization of these commitments below . Connectionist networks have been analysed extensively - in some cases
IThe difference between Connectionist networks in which the state of a single unit encodesproperties of the world (i .e., the so.called 'localist' networks) and onesin which the pattern of statesof an entire population ( . of units does the encoding (the so-called 'distributed' representationnetworks) is consideredto be Important by m&ny people working on Connectionist models Although Connectionistsdebate the relative merits of . localist (or 'compact) versusdistributed representations(e.g., Feldman, 1986 , the distinction will usually be ' ) of little consequence our purposes for reasonsthat we give later. For simplicity, when we wish to refer for , indifferently to either single unit codes or aggregatedistributed codes we shall refer to the 'nodes in a , ' network. When the distinction is relevant to our discussion however, we shall explicitly mark the difference , by referring either to units or to aggregateof units.
using advanced mathematical techniques .2 They have also been simulated on computers and shown to exhibit interesting aggregate properties . For example , they can be " wired " to recognize patterns , to exhibit rule -like behavioral regularities , and to realize virtually any mapping from patterns of (input ) parameters to patterns of (output ) parameters- though in most cases multi parameter , multi -valued mappings require very large numbers of units . Of
even greater interest is the fact that such networks can be made to learn ; this
is achieved by modifying the weights on the connections as a function of certain kinds of feedback (the exact way in which this is done constitutes a preoccupation of Connectionist research and has lead to the development of such important techniques as " back propagation " ) . In short , the study of Connectionist machines has led to a number of striking and unanticipated findings ; it 's surprising how much computing can
be done with a uniform network of simple interconnected elements .
Moreover , these models have an appearance of neural plausibility that Classical architectures are sometimes said to lack . Perhaps, then , a new Cognitive Science based on Connectionist networks should replace the old Cognitive Science based on Classical computers . Surely this is a proposal that ought to be taken seriously : if it is warranted , it implies a major redirection of research .
Unfortunately , however , discussions of the relative merits of the two architectures have thus far been marked by a variety of confusions and irrele vances. It 's our view that when you clear away these misconceptions what 's left is a real disagreement about the nature of mental processes and mental representations . But it seems to us that it is a matter that was substantially put to rest about thirty years ago; and the arguments that then appeared to militate decisively in favor of the Classical view appear to us to do so still . In the present paper we will proceed as follows . First , we discuss some methodological questions about levels of explanation that have become enmeshed in the substantive
to say what it is that makes
controversy
Connectionist
over Connectionism
and Classical
. Second , we try
theories of mental
20ne of the attractions of Connectionism for many people is that it does employ some heavy mathematical machinery , as can be seen from a glance at many of the chapters of the two volume collection by Rumelhart , McClelland and the POP Research Group ( 1986) . But in contrast to many other mathematically sophisticated areas of cognitive science , such as automata theory or parts of Artificial Intelligence (particularly the study of search , or of reasoning and knowledge representation ) , the mathematics has not been used to map out the limits of what the proposed class of mechanisms can do . Like a great deal of Artificial Intelligence research , the Connectionist approach remains almost entirely experimental ; mechanisms that look interesting are pro posed and explored by implementing them on computers and subjecting them to empirical trials to see what they will do . As a consequence , although there is a great deal of mathematical work within the tradition , one has very little idea what various Connectionist networks and mechanisms are good for in general .
structure
incompatible
Third
we
review
and
extend
some
of
the
traditional
arguments
for
the
Classical
architecture
Though
these
arguments
have
been
somewhat
recast
very
little
that
we
'
ll
have
to
say
here
is
entirely
new
But
we
hope
to
make
it
clear
how
various
aspects
of
the
Classical
doctrine
cohere
and
why
rejecting
the
Classical
picture
of
reasoning
leads
Connectionists
to
say
the
very
implausible
things
they
do
about
logic
and
semantics
In
part
four
we
return
to
the
question
what
makes
the
Connectionist
approach
ap
pear
attractive
to
so
many
people
In
doing
so
we
'
ll
consider
some
arguments
that
have
been
offered
in
favor
of
Connectionist
networks
as
general
models
of
cognitive
processing
Levels
of
explanation
There
are
two
major
traditions
in
modern
theorizing
about
the
mind
one
that
we
' ll
call
'
Representationalist
'
and
one
that
we
'
ll
call
'
Eliminativist
'
Representationalists
hold
that
postulating
representational
or
' intentional
'
or
' semantic
'
states
is
essential
to
theory
of
cognition
according
to
Rep
resentationalists
there
are
states
of
the
mind
which
function
to
encode
states
of
the
world
Eliminativists
by
contrast
think
that
psychological
theories
can
dispense
with
such
semantic
notions
as
representation
According
to
Eliminativists
the
appropriate
vocabulary
for
psychological
theorizing
is
neurological
or
perhaps
behavioral
or
perhaps
syntactic
in
any
event
not
vocabulary
that
characterizes
mental
states
in
terms
of
what
they
represent
For
neurological
version
of
eliminativism
see
. S
Churchland
1986
for
behavioral
version
see
Watson
1930
for
syntactic
version
see
Stich
1983
Connectionists
are
on
the
Representationalist
side
of
this
issue
As
Rumelhart
and
McClelland
1986a
121
say
PDPs
' ~ are
explicitly
con
cerned
with
the
problem
of
internal
representation
"
Correspondingly
the
specification
of
what
the
states
of
network
represent
is
an
essential
part
of
Connectionist
model
Consider
for
example
the
well
known
Connectionist
account
of
the
bistability
of
the
Necker
cube
Feldman
&
Ballard
1982
"
Simple
units
representing
the
visual
features
of
the
two
alternatives
are
arranged
in
competing
coalitions
with
inhibitory
links
between
rival
fea
tures
and
positive
links
within
each
coalition
The
result
is
network
that
has
two
dominant
stable
states
"
see
Figure
Notice
that
in
this
as
in
all
other
such
Connectionist
models
the
commitment
to
mental
representation
is
explicit
the
label
of
node
is
taken
to
express
the
representational
content
of
the
state
that
the
device
is
in
when
the
node
is
excited
and
there
are
nodes
corresponding
to
monadic
and
to
relational
properties
of
the
reversible
cube
when
it
is
seen
in
one
way
or
the
other
bolic states do have a semantics, though it 's not the semantics of representa tions at the " conceptual level " . According to Smolensky , the semantical distinction between symbolic and sub-sy 'mbolic theories is just that " entities that are typically represented in the symbolic paradigm by [single] symbols are typically represented in the sub-symbolic paradigm by a large number of sub-symbols" .3 Both the conceptual and the sub-symbolic levels thus postulate representational states, but sub-symbolic theories slice them thinner . We are stressing the Representationalist character of Connectionist theorizing because much Connectionist methodological writing has been preoccupied with the question 'What level of explanation is appropriate for theories of cognitive architecture ? (see, for example , the exchange between Broadbent , 1985, and Rumelhart & McClelland , 1985) . And , as we're about to see, what one says about the levels question depends a lot on what stand one takes about whether there are representational states. It seems certain that the world has causal structure at very many different levels of analysis, with the individuals recognized at the lowest levels being , in general , very small and the individuals recognized at the highest levels being , in general , very large . Thus there is a scientific story to be told about q[uarks ; and a scientific story to be told about atoms ; and a scientific story to h~ to1rl ahout molecules ... ditto rocks and stones and rivers ... ditto galaxies. - - ~-- - - - --- ---- -- - - J\ nd the story that scientists tell about the causal structure that the world has at anyone of these levels may be quite different from the story that they tell about its causal structure at the next level up or down . The methodological implication for psychology is this : If you want to have an argument about cognitive architecture , you have to specify the level of analysis that 's supposed to be at issue. If you 're not a Representationalist , this is quite tricky since it is then not obvious what makes a phenomenon cognitive . But specifying the level of analysis relevant for theories of cognitive architecture is no problem for either Classicists or Connectionists . Since Classicists and Connectionists are both Representationalists , for them any level at which states of the system are taken to encode properties of the world counts as a cognitive level ; and no other levels do . (Representations of " the world " include of course, representations of symbols ; for example , the concept WORD is a construct at the cognitive level because it represents something , namely words .) Correspond -
3Smolensky to thinkthattheideaof postulatinglevelof representations a semantics seems a with of subconceptual is unique network features to theories is anextraordinary considering extent . This view the to which Classical theorists been have concerned feature with analyses inevery ofpsychology phonetics area from tovisual perceptionlexicographyfactthequestion to . In , whether are -conceptual there 'sub ' featuresneutral is withrespect thequestion to whether cognitive architecture is ClassicalConnectionist or .
10
ina users manual thatarchitecture were for ifit availablea computer), on whose domain range therepresentational oftheorganism. and are states 4 It follows, if youwant make theConnectionist asa that, to good theory theory cognitive of architecture, have show theprocesses you to that which
tational neurological,molecular, quantum (e.g., or or mechanical) of states
ingly, thearchitecturerepresentational andprocesses discusits of states that sions cognitive of architecture about. differently, architecture are Put the of the cognitive system consists theset of basic of operations, resources, functions, principles, (generally sorts properties would described etc. the of that be
possible implement Classical to a cognitive architecture sucha network. in 5 In fact,the question whether Connectionist networks shouldbe treatedas modelsat somelevelof implementation moot,and willbe discussed is at
butthat representationalthemselves This because, as the states arenot. is just of causally interacting nonrepresentational sotooit is perfectly elements,
it ispossible implement Connectionist to a cognitive architecture a network in
open question the whether mind a such network the the is a at psychological level. is,inparticular, It perfectly possible nonrepresentational that neurolo-
It is important be clear to about matter levels painof simply this of on trivializing issues the about cognitive architecture. Consider, example, for thefollowing ofRumelharts: hasseemed meforsome remark It to years now there bea unified that must accountwhich so-called in the rule-governed and exceptional were with aunified [ the] cases dealt by underlying processa
cognitive noncognitiveThus, example, 5molensky isclearly from levels for , although (1988) a Representationalist, answer question distinguishes his official tothe what those dynamical that cognitive systemsare from that not?makes mistake those are the ofappealing tocomplexitythan rather intentionality: A river fails beacognitive system because satisfy rangegoals alarge to dynamical only itcannot alarge of under
range conditions, ofcourse, dependshow individuate and of But, that on you goals conditions;river the that wants gettothesea to wants togethalf tothesea, then get way first way and to half more and on; so quite lotofgoals told. real ofcourse,that that a all The point, is states represent play role the goals a in etiology thebehaviors people notin theetiology thebehavior rivers. of of hut of of 5 Classical That architectures implemented can be innetworks disputedConneetionists; isnot by see for example Rume and lhart MeClelland p. 118): one make arbitrary (1986a, ... can an computational machine out linear of threshold including, units, for example, amachinecan out the that carry all operations necessary forimplementing machine; one a Turing the limitation real isthat biological cannot Turing systems be machines because they have finite hardware..
11
process which produces rule-like and rule-exception behaviorthroughthe application a singleprocess [In this process... both the rule-like and of ... ] non -rule-like behavior a productof the interaction a verylargenumber is of of 'sub -symbolic processes (Rumelhart 1984 p. 60 It's clearfrom the ' ." , , ). contextthat Rumelharttakesthis idea to be very tendentiousone of the ; Connectionist claimsthat Classical theories requiredto deny are . But in factit's not. For, of course thereare'sub -symbolicinteractions ' that implement both rule like andrule violatingbehavior for examplequantum ; , mechanical processes Thats not what Classical do. ' theoristsdeny indeed ; , it's not deniedby anybody who is evenvaguelya materialist Nor doesa . Classical theoristdeny that rule-followingand rule-violatingbehaviors are both implemented the very same by neurological machineryFor a Classical . theorist neuronsimplementall cognitiveprocesses preciselythe same , in way viz., by supporting basicoperations : . the that are requiredfor symbol processIng . What would be an interestingand tendentious claim is that there no 's distinctionbetween rule-followingandrule-violatingmentation thecogni at tive or representational symboliclevel specificallythat it is not the case or ; , that the etiologyof rule-followingbehavior mediated the representation is by of explicitrules We will consider ideain Section wherewe will argue .6 this 4, that it too is notwhatdivides Classical from Connectionist architectureClas ; sicalmodels permita principleddistinction between etiologies mental the of processes areexplicitlyrule-governed mentalprocesses aren that and that 't; but they don demand . 't one In short the issue , between Classical Connectionist and architecture not is aboutthe explicitness rules aswe presently , Classical of ; 'll see architecture is not, per se committed the ideathat explicitrulesmediate etiology , to the of behavior And it is not aboutthe realityof representational ; Classi . states cistsandConnectionists all Repre entational are .s RealistsAnd it is not about . nonrepresentational architecturea Connectionist ; neuralnetworkcan perfectlywell implement Classical a architecture the cognitive at level . So then what is the disagreement , , between Classical Connectionist and architecture about ?
~ hereis a differentidea frequently , encountered the Connectionist in literature that this oneis easily , confused with: viz., that thedistinction between regularities exceptions merely and is stochastic (whatmakes 'went an irregular ' pasttense just that the more is frequent construction the oneexhibited 'walked It is by ' ). seems obvious if thisclaimiscorrect canbereadily that it assimilated Classical to architecture Section (see 4).
12
2. The nature of the dispute Classicists and Connectionists all assign semantic content to something . Roughly , Connectionists assign semantic content to 'nodes' (that is, to units or aggregates of units ; see footnote I )- i .e., to the sorts of things that are typically labeled in Connectionist diagrams ; whereas Classicists assign semantic content to expressions- i .e., to the sorts of things that get written on the tapes of Turing machines and stored at addresses in Von Neumann machines.7 But Classical theories disagree with Connectionist theories about what primitive relations hold among these content -bearing entities . Connectionist theories acknowledge only causal connectednessas a primitive relation
among nodes ; when you know how activation and inhibition flow among
them , you know everything there is to know about how the nodes in a net work are related . By contrast , Classical theories acknowledge not only causal relations among the semantically evaluable objects that they posit , but also a range of structural relations , of which constituency is paradigmatic . This difference has far reaching consequences for the ways that the two kinds of theories treat a variety of cognitive phenomena , some of which we will presently examine at length . But , underlying the disagreements about
details are two architectural differences between the theories :
( 1)
Combinatorial syntax and semantics for mental representations. Classical theories- but not Connectionist theories - postulate a 'language of thought ' (see, for example , Fodor , 1975) ; they take mental representa tions to have a combinatorial syntax and semantics, in which (a) there is a distinction between structurally atomic and structurally molecular representations ; (b) structurally molecular representations have syntactic constituents that are themselves either structurally molecular or structurally atomic ; and (c) the semantic content of a (molecular ) representation is a function of the semantic contents of its syntactic parts ,
together with its constituent structure . For purposes of convenience , we'll sometime abbreviate (a)- (c) by speaking of Classical theories as
13
to
"
complex
"
mental
representations
or
to
"
symbol
struc
(2)
sensitivity states
of are
processes transformed
In
Classical , or by
models which an
the input
principles selects
by the
corresponding
output
are
defined
over
structural
properties
of
mental
Because , it is
mental Classical
have to
com apply
to
them
by
reference
to
their
form
The
result
is
that
paradigmatic
Classical
mental
process
operates
upon
any
mental
representation
that
satisfies
given
structural
description
and
transforms
it
into
mental
representation
that
satisfies
another
structural
description
( So
for
example
in
model
of
inference
one
might
recognize
an
operation
that
applies
to
any
representation
of
the
form
&
and
transforms
it
into
representation be defined at a
of
the variety
form of
P levels
. )
Notice of
that abstraction
since ,
formal such an
properties operation
can can
apply
equally
to
representations
that
differ
widely
in
their
structural
complexity
The
operation
that
applies
to
representations
of
the
form
&
to
produce
is
satisfied
by
for
example
an
expression
like
"
( AvBvC
&
( DvEvF
) "
from
which
it
derives
the
expression
"
( AvBvC
) "
We
take
( 1
and
( 2
as
the
claims
that
define
Classical
models
and
we
take
these
claims
quite
literally
they
constrain
the
physical
realizations
of
symbol
structures
In
particular
the
symbol
structures
in
Classical
model
are
as
sumed
to
correspond
to
real
physical
structures
in
the
brain
and
the
com
binatorial
structure
of
representation
is
supposed
to
have
counterpart
in
structural
relations
among
physical
properties
of
the
brain
For
example
the
relation
' part
of
' ,
which
holds
between
relatively
simple
symbol
and
more
one . 9
, This
is
to
to
some of
physical computational
relation systems
among such
as
brains
and
Classical
computers
as
" physical
symbols
systems
"
8Sometimes difference the between simplypostulating representational andpostulating states representa tionswith a combinatorial syntax semantics marked distinguishing and is by theories postulate that symbols from theories that postulate symbolsystems latter theoriesbut not the former are committed a . The , , to "language thought For this usageseeKosslyn Hatfield(1984 who take the refusal postulate of ". , and ) to symbolsystems be the characteristic to respect whichConnectionist in architectures differ from Classical architectures agree . We with thisdiagnosis . 9Perhaps notionthat relations the amongphysical properties the brain instantiate encode the of (or ) combinatorial structure anexpression of bears some elaborationOnewayto understand is involved . what is to consider conditions musthold on a mapping the that (whichwe refer to as the 'physical instantiation mapping from expressions brainstates the causal ') to if relations among brainstates to depend the are on -
14
This bearsemphasisbecause Classicaltheory is committed not only to the there being a systemof physically instantiated symbols but also to the claim , that the physicalproperties onto which the structureof the symbolsis mapped are the very properties that causethe systemto behaveas it does In other . words the physical counterpartsof the symbols and their structural proper, ties, causethe system behavior. A systemwhich has symbolic expressions 's , but whoseoperation doesnot dependupon the structure of theseexpressions , does not qualify as a Classicalmachine since it fails to satisfy condition (2). In this respect a Classicalmodel is very different from one in which behavior , is causedby mechanisms such as energy minimization, that are not respon , sive to the physical encodingof the structure of representations . From now on, when we speakof 'Classical models, we will have in mind ' any model that has complex mental representations as characterizedin (1) , and structure-sensitivemental processesas characterizedin (2). Our account , of Classicalarchitecture is therefore neutral with respect to such issuesas whether or not there is a separateexecutive For example, Classicalmachines . can have an "object-oriented" architecture, like that of the computer language Smalltalk, or a "message passing architecture, like that of Hewett's "
15
( 1977) Actors - so long as the objects or the messages have a combinatorial structure which is causally implicated in the processing. Classical architecture is also neutral on the question whether the operations on the symbols are constrained to occur one at a time or whether many operations can occur at the same time . Here , then , is the plan for what follows . In the rest of this section , we will sketch the Connectionist proposal for a computational architecture that does - away with complex mental representations and structure sensitive operations . (Although our purpose here is merely expository , it turns out that describing exactly what Connectionists are committed to requires substantial reconstruc tion of their remarks and practices . Since there is a great variety of points of view within the Connectionist community , we are prepared to find that some Connectionists in good standing may not fully endorse the program when it is laid out in what we take to be its bare essentials.) Following this general expository (or reconstructjve ) discussion, Section 3 provides a series of arguments favoring the Classical story . Then the remainder of the paper considers some of the reasons why Connectionism appears attractive to many people and offers further general comments on the relation between the Classical and the Connectionist enterprise .
2.1. Complex mental representations To begin with , consider a caseof the most trivial sort; two machines one , Classical in spirit and one Connectionist lo Here is how the Connectionist . machine might reason There is a network of labelled nodes as in Figure 2. . Pathsbetweenthe nodesindicate the routes alongwhich activation can spread (that is, they indicate the consequences exciting one of the nodeshasfor that determining the level of excitation of others) . Drawing an inference from A & B to A thus correspondsto an excitation of node 2 being causedby an excitation of node 1 (alternatively, if the systemis in a state in which node 1 is excited, it eventually settles into a state in which node 2 is excited; see footnote 7) . Now consider a Classicalmachine This machine has a tape on which it . writes expressions Among the expressions . that can appear on this tape are:
!('This illustration has not any particular Connectionistmodel in mind, though the caricature presentedis, in fact, a simplified version of the Ballard ( 1987 Connectionist theorem proving system(which actually uses ) a more restricted proof procedure basedon the unification of Horn clauses. To simplify the exposition, we ) assume 'localist' approach in which eachsemanticallyinterpreted node corresponds a singleConnectionist a , to unit ~but nothing relevant to this discussion changedif these nodesactually consistof patterns over a cluster is of units.
16
there is no structural (e.g., no part/whole) relationthat holdsbetween them. In short, it is characteristic Classical of systems,but not of Connectionist
connected theobject which content isascribed node but to to the A (viz., 2);
constituents. contrast,theConnectionist none this true; 1 By in machine of is theobject which content is ascribed node1)is causally to the A&B (viz.,
ascribed tokens theexpression literally (viz., of A&B) contain, proper as parts, objects which content isascribed tokens theexpresto the A (viz., of sionA.) Moreover, semantics thesatisfaction the (e.g., conditions) the of
and semantic parts (e.g., expressions A&B). like It is easy to overlookthis difference betweenClassical Connectionist and
systems, exploit to arrays symbols ofwhich atomic expresof some are (e.g.,
be lead to do so: (1) by failingto understand difference the betweenwhat arrays of symbolsdo in Classical machinesand what node labelsdo in Con-
inga Connectionist There at least ways which might model. are four in one
Thismakes compositionality structures the ofdata a defining ofclassical property architecture. But, ofcourse,leaves the it open question degreewhich languages English) also ofthe to naturol (like arc compositional.
nectionist machines ; (2) by confusing the question whether the nodes in Con nectionist networks have constituent structure with the question whether they are neurologically distributed ; (3) by failing to distinguish between a representation having semantic and syntactic constituents and a concept being encoded in terms of microfeatures , and (4) by assuming that since representa tions of Connectionist networks have a graph structure , it follows that the nodes in the networks have a corresponding constituent structure . We shall now need ' rather a long digression to clear up these misunderstandings . 2.1.1. The role of labels in Connectionist theories In the course of setting out a Connectionist model , intentional content will be assigned to machine states, and the expressions of some language or other will , of course, be used to express this assignment; for example , nodes may be labelled to indicate their representational content . Such labels often have
a combinatorial syntax and semantics ; in this respect , they can look a lot like
Classical mental representations . The point to emphasize, however , is that it doesn't follow (and it isn't true ) that the nodes to which these labels are assigned have a combinatorial syntax and semantics. 'A & B ' , for example ,
can be tokened on the tape of the Classical machine and can also appear as
a label in a Connectionist
machine
as it does in diagram
2 above . And , of
course, the expression 'A & B ' is syntactically and semantically complex : it has a token of 'A ' as one of its syntactic constituents , and the semantics of the expression ' A & B ' is a function of the semantics of the expression ' A ' . But it isn't part of the intended reading of the diagram that node 1 itself has constituents ; the node- unlike its label- has no semantically interpreted parts . It is, in short , important to understand the difference between Connectionist labels and the symbols over which Classical computations are defined . The difference is this : Strictly speaking , the labels play no role at all in deter mining the. operation of a Connectionist machine ; in particular , the operation of the machine is unaffected by the syntactic and semantic relations that hold among the expressions that are used as labels . To put this another way , the
node labels in a Connectionist machine are not part of the causal structure
of the machine . Thus , the machine depicted in Figure 2 will continue to make the same state transitions regardless of what labels we assign to the nodes. Whereas , by contrast , the state transitions of Classical machines are causally determined by the structure- including the constituent structure- of the symbol arrays that the machines transform : change the symbols and the system behaves quite differently . (In fact , since the behavior of a Classical machine is sensitive to the syntax of the representations it computes on , even interchang ing synonymous - semantically equivalent - representations affects the course of computation ) . So, although the Connectionist 's labels and the Classicist's
18
data structures both constitute languages, only the latter language constitutes a medium of computation . 12 2.1.2. Connectionist networks and graph structures The second reason that the lack of syntactic and semantic structure in Connectionist representations has largely been ignored may be that Connectionist networks look like general graphs; and it is, of course, perfectly possible to use graphs to describe the internal structure of a complex symbol . That 's precisely what linguists do when they use 'trees' to exhibit the constituent structure of sentences. Correspondingly , one could jmagine a graph
notation
I
that
expresses -
the
internal
structure
of mental
renresentations).- - - -- ----
- ---
hv - .
using arcs and labelled nodes. So, for example , you might express the syntax of the mental representation that corresponds to the thought that John loves the girl like this : John ~ loves ~ the girl Under the intended interpretation , this would be the structural description of a mental representation whose content is that John loves the girl , and whose constituents are: a mental representation that refers to John , a mental
representation that refers to the girl , and a mental representation that expres -
ses the two -place relation represented by '~ loves ~ ' . But although graphs can sustain an interpretation as specifying the logical syntax of a complex mental representation , this interpretation is inappro priate for graphs of Connectionist networks . Connectionist graphs are not structural descriptions of mental representations ; they 're specifications of causal relations . All that a Connectionist can mean by a graph of the form X ~ Y is: states of node X causally affect states of node Y. In particular , the graph can't mean X is a constituent of Y or X is grammatically related to Y etc ., since these sorts of relations are , in general , not defined for the kinds of mental representations that Connectionists recognize . Another way to put this is that the links in Connectionist diagrams are not generalized pointers that can be made to take on different functional signifi 12Labels aren 't part of the causal structure of a Connectionist machine , but they may play an essential role in its causal history insofar as designers wire their machines to respect the semantical relations that the labels express . For example , in Ballard 's ( 1987) Connectionist model of theorem proving , there is a mechanical procedure for wiring a network which will carry out proofs by unification . This procedure is a function from a set of node labels to a wired -up machine . There is thus an interesting and revealing respect in which node labels are relevant to the operations that get performed when the function is executed . But , of course , the machine on which the labels have the effect is not the machine whose states they are labels of ; and the effect of the labels occurs at the time that the theorem -proving machine is constructed , not at the time its reasoning process is carried out . This sort of case of labels ' having effects ' is thus quite different from the way that symbol tokens (e.g. , tokened data structures ) can affect the causal processes of a Classical machine .
19
cance by an independent interpreter , but are confined to meaning something like " sends activation to " . The intended interpretation of the links as causal connections is intrinsic to the theory . If you ignore this point , you are likely to take Connectionism to offer a much richer notion of mental representation than it actually does. 2.1.3. Distributed representations The third mistake that can lead to a failure to notice that the mental representations in Connectionist models lack combinatorial syntactic and semantic structure is the fact that many Connectionists view representations as being neurologically distributed ; and , presumably , whatever is distributed must have parts . It doesn't follow , however , that whatever is distributed must have constituents; being neurologically distributed is very different from having semantic or syntactic constituent structure . You have constituent structure when (and only when ) the parts of semanticallv evaluable entities are themselves semantically evaluable . Constituency of relations thus hold among objects all of which are at the representational level ; they are, in that sense, within level relations .!3 By contrast , neural distributedness - the sort of relation that is assumed to hold between 'nodes' and the 'units ' by which they are realized- is a between level relation : The nodes, but not the units , count as representations . To claim that a node is neurally distributed is presumably to claim that its states of activation corre spond to patterns of neural activity - to aggregates of neural 'units '- rather than to activations of single neurons . The important point is that nodes that are distributed in this sense can perfectly well be syntactically and semantically atomic : Complex spatially -distributed implementation in no way implies constituent structure . There is, however , a different sense in which the representational states in a network might be distributed , and this sort of distribution also raises questions relevant to the constituency issue. 2.1.4. Representations as 'distributed ' over microfeatures Many Connectionists hold that the mental representations that correspond to commonsense concepts (CHAIR , JOHN , CUP , etc.) are 'distributed ' over galaxies of lower level units which themselves have representational content . To use common Connectionist terminology (see Smolensky , 1988) , the higher or " conceptual level " units correspond to vectors in a " sub-conceptual " space
13 Any relation specified as holding among representationalstates is, by definition, within the 'cognitive level' . It goeswithout sayingthat relations that are 'within-level' by this criterion can count as 'between .level' when we usecriteria of finer grain. There is, for example, nothing to prevent hierarchiesof levelsof represen tational states .
20
of microfeatures The model here is something like the relation between a . defined expression and its defining feature analysis thus, the concept : BACHELOR might be thought to correspondto a vector in a spaceof features that includesADULT , HUMAN , MALE , and MARRIED ; i .e., as an assignmentof the value + to the first two features and - to the last. Notice that distribution over microfeatures(unlike distribution over neural units) is a relation among representations hence a relation at the cognitive level. , ' Since microfeatures are frequently assumedto be derived automatically (i .e., via learning procedures from the statistical properties of samplesof ) stimu.i , we can think of them as expressingthe sorts of properties that are l revealedby multivariate analysisof setsof stimuli (e.g., by multidimensional scaling of similarity judgments In particular, they need not correspondto ). English words; they can be finer-grained than, or otherwise atypical of, the terms for which a non-specialist needs to have a word. Other than that, however, they are perfectly ordinary semanticfeatures, much like those that lexicographershave traditionally used to representthe meaningsof words. On the most frequent Connectionistaccounts theories articulated in terms , of microfeature vectors are supposedto show how conceptsare actually encoded, hencethe feature vectorsare intended to replace"lessprecise specifi" cationsof macrolevelconcepts For example, where a Classicaltheorist might . recognizea psychologicalstate of entertaining the concept CUP, a Connec tionist may acknowledgeonly a roughly analogousstate of tokening the correspondingfeature vector. (One reasonthat the analogyis only rough is that which feature vector 'corresponds to a given concept may be viewed as ' heavily context dependent) The generalizationsthat 'conceptlevel' theories . frame are thus taken to be only approximately true, the exact truth being stateableonly in the vocabularyof the microfeatures Smolensky for example . , (p. 11 is explicit in endorsingthis picture: "Precise formal descriptionsof ), , the intuitive processorare generallytractable not at the conceptuallevel, but only at the subconceptuallevel." 14This treatment of the relation between
14Smolensky , p. 14 remarks "unljkesymbolic (1988 ) that tokens these , vectors in a topological lie space , in whichsome close are together others far apart" Howeverthisseems radically and are . , to conflate claims aboutthe Connectionist modelandclaims aboutits implementation conflation is not unusual the (a that in Connectionist literature we see Section If thespace issue physicalthenSmolenskycommitted as 'll in 4). at is , is to extremely strong claims aboutadjacency relations the brain claims in ; whichthereis, in fact, no reason at all to believeBut if, asseems . moreplausiblethe space issue semantical whatSmolensky isn , at is then says 't true. Practically cognitive any theory implydistance will measures between mental representations Classical . In theoriesfor examplethedistance , , between representations two is plausibly related thenumber compu to of tationalsteps takesto deriveonerepresentation the other In Connectionist it from . theoriesit is plausibly , relatedto the number intervening of nodes to the degree overlap (or of between vectorsdepending the , on version Connectionism hasin mind Theinteresting of one ). claimis not that an architecture offersa distance measure that it offersthe rightdistance but measure - one that is empirically certifiable .
21
commonsense concepts and microfeatures is exactly analogous to the standard Connectionist treatment of rules ; in both cases, macro level theory is said to provide a vocabulary adequate for formulating generalizations that roughly approximate the facts about behavioral regularities . But the constructs of the macrotheory do not correspond to the causal mechanisms that generate these regularities . If you want a theory of these mechanisms, you need to replace talk about rules and concepts with talk about nodes, connections , microfeatures , vectors and the like . Is
Now , it is among the major misfortunes of the Connectionist literature that the issue about whether commonsense concepts should be represented by sets of microfeatures has gotten thoroughly mixed up with the issue about combinatorial structure in mental representations . The crux of the mixup is the fact that sets of microfeatures can overlap , so that , for example , if a microfeature corresponding to ' + has-a-handle ' is part of the array of nodes
over which the commonsense concept CUP is distributed , then you might
think of the theory as representing ' + has-a-handle ' as a constituent of the concept CUP ; from which you might conclude that Connectionists have a notion of constituency after all , contrary to the claim that Connectionism is not a language-of-thought architecture (see Smolensky , 1988) .
A moment ' s consideration will make it clear , however , that even on the
assumption that concepts are distributed over microfeatures , ' + has-a-handle ' is not a constituent of CUP in anything like the sense that 'Mary ' (the word ) is a constituent of (the sentence) 'John loves Mary ' . In the former case, " constituency " is being (mis )used to .refer to a semantic relation between predicates ; roughly , the idea is that macrolevel predicates like CUP are defined by sets of microfeatures like 'has-a-handle ' , so that it 's some sort of semantic truth that CUP applies to a subset of what 'has-a-handle ' applies to . Notice that while the extensions of these predicates are in a set/subset relation , the predicates themselves are not in any sort of part -to -whole rela tion . The expression 'has-a-handle ' isn't part of the expression CUP any more
15The primary use that Connectionists make of microfeatures is in their accounts of generalization and abstraction ( see, for example , Hinton , McClelland , & Rumelhart , 1986) . Roughly , you get generalization by using overlap of microfeatures to define a similarity space, and you get abstraction by making the vectors that correspond to t.vpes be subvectors of the ones that correspond to their tokens . Similar proposals have quite a long history in traditional Empiricist analysis ; and have been roundly criticized over the centuries . (For a
discussion of abstractionism see Geach , 1957 ; that similarity is a primitive relation hence not reducible to
partial identity of feature sets- was , of course , a main tenet of Gestalt psychology , as well as more recent approaches based on " prototypes " ) . The treatment of microfeatures in the Connectionist literature would appear to be very close to early proposals by Katz and Fodor ( 1963) and Katz and Postal ( 1964) , where both the idea of a feature analysis of concepts and the idea that relations of semantical containment among concepts should be identified with set-theoretic relations among feature arrays are explicitly endorsed .
22
than the English phrase 'is an unmarried man' is part of the English phrase
' is a bachelor '.
Real constituency does have to do with parts and wholes ; the symbol 'Mary ' is literally a part of the symbol 'John loves Mary ' . It is because their symbols enter into real-constituency relations that natural languages have both atomic symbols and complex ones. By contrast , the definition relation can hold in a language where all the symbols are syntactically atomic ; e.g., a language which contains both 'cup' and 'has-a-handle ' as atomic predicates . This point is worth stressing. The question whether a representational system has real-constituency is independent of the question of microfeature analysis; it arises both for systems in which you have CUP as semantically primitive , and for systems in which the semantic primitives are things like ' + has-ahandle ' and CUP and the like are defined in terms of these primitives . It really is very important not to confuse the semantic distinction between primi tive expressions and defined expressions with the syntactic distinction between atomic symbols and complex symbols .
So far as we know , there are no worked out attempts in the Connectionist
literature to deal with the syntactic- and semantical issues raised by relations of real-constituency . There is, however , a proposal that comes up from time to time : viz ., that what are traditionally treated as complex symbols should actually be viewed as just sets of units , with the role relations that tradition ally get coded by constituent structure represented by units belonging to these sets. So, for example , the mental representation corresponding to the belief that John loves Mary might be the feature vector { + John -subject,' + loves,' + Mary -object} . Here 'John-subject ' 'Mary -object ' and the like are the labels of units ; that is, they are atomic (i .e., micro -) features , whose status is analogous to 'has-a-handle ' . In particular , they have no internal syntactic analysis, and there is no structural relation (except the orthographic one) between the feature 'Mary -object ' that occurs in the set { John-subject ; loves ; Mary -object } and the feature 'Mary -subject ' that occurs in the set { Mary -subject ; loves ; John -object } . (See, for example , the discussion in Hin ton , 1987 of " role -specific descriptors that represent the conjunction of an identity and a role [by the use of which ] we can implement part -whole hierar chies using set intersection as the composition rule ." See also, McClelland ,
Rumelhart & Hinton , 1986 , p . 82 - 85 , where , these clear sorts what of ideas sort what appears to be the same
you were to take them seriously . As we understand it , the proposal really has two parts : On the one hand ,
23
it 's suggested that although Connectionist representations cannot exhibit real constituency , nevertheless the Classical distinction between complex symbols and their constituents can be replaced by the distinction between feature sets and their subsets; and , on the other hand , it 's suggested that role relations can be captured by features . We 'll consider these ideas in turn . (1) Instead of having complex symbols like " John loves Mary " in the representational system, you have feature sets like { + John -subject,. + loves,' + Mary -object } . Since this set has { + John -subject} , { + loves,. + Mary -object - and so forth as sub-sets, it may be supposed that the force of the . } -constituency relation has been captured by employing the subset rela tion . However , it ' s clear that this idea won ' t work since not all subsets of fea -
tures correspond to genuine constituents . For example , among the subsets of { + John -subject,' + loves,' + Mary -object} are the sets { + John -subject,' + Mary -object} ) and the set { + John -subject,' + loves} which do not , of course, correspond to constituents of the complex symbol " John loves Mary " . (2) Instead of defining roles in terms of relations among constituents , as
one does in Classical architecture , introduce them as microfeatures .
Consider a system in which the mental representation that is entertained when one believes that John loves Mary is the feature set { + John -subject,. + loves,. + Mary -object } . What representation corresponds to the belief that John loves Mary and Bill hates Sally ? Suppose, pursuant to the present prop osal, that it 's the set { + John -subject,' + loves,' + Mary -object,' + Bill -subject,' + hates,' + Sally -object } . We now have the problem of distinguishing that belief from the belief that John loves Sally and Bill hates Mary ; and from the belief that John hates Mary and Bill loves Sally ; and from the belief that John hates Mary and Sally and Bill loves Mary ; etc., since these other beliefs will all correspond to precisely the same set of features . The problem is, of course, that nothing in the representation of Mary as + Mary -object specifies whether it 's the loving or the hating that she is the object of ; similarly , mutatis mutan dis, for the representation of John as + John -subject. What has gone wrong isn't disastrous (yet ) . All that 's required is to enrich the system of representations by recognizing features that correspond not to (for example ) just being a subject , but rather to being the subject of a loving of Mary (the property that John has when John loves Mary ) and being the subject of a hating of Sally (the property that Bill has when Bill hates Sally) . So, the representation of John that 's entertained when one believes tha. John t loves Mary and Bill hates Sally might be something like + John -subject-hatesMary -object.
24
The disadvantage of this proposal is that it requires rather a lot of micro features .16 How many ? Well , a number of the order of magnitude of the sentencesof a natural language (whereas one might have hoped to get by with a vocabulary of basic expressions that is not vastly larger than the lexicon of a natural language ; after all , natural languages do) . We leave it to the reader to estimate the number of microfeatures you would need , assuming that there is a distinct belief corresponding to every grammatical sentence of English of up to , say, fifteen words of length , and assuming that there is an average of , say, five roles associated with each belief . (Hint : George Miller once estimated that the number of well -formed 20-word sentences of English is of the order of magnitude of the number of seconds in the history of the universe .) The alternative to this grotesque explosion of atomic symbols would be to have a combinatorial syntax and semantics for the features . But , of course , this is just to give up the game since the syntactic and semantic relations that hold among the parts of the complex feature + ((John subject) loves (Mary object)) are the very same ones that Classically hold among the constituents of the complex symbol " John lo\'es Mary " ; these include the role relations which Connectionists had proposed to reconstruct using just sets of atomic features . It is, of course, no accident that the Connectionist proposal for dealing with role relations runs into these sorts of problems . Subject , object and the rest are Classically defined with respect to the geometry of constituent structure trees. And Connectionist representations don 't have constituents . The idea that we should capture role relations by allowing features like John -subject thus turns out to be bankrupt ; and there doesn't seem to be any other way to get the force of structured symbols in a Connectionist architec ture . Or , if there is, nobody has given any indication of how to do it . This becomes clear once the crucial issue about structure in mental representations is disentangled from the relatively secondary (and orthogonal ) issue about whether the representation of commonsense concepts is 'distributed ' (i .e., from questions like whether it 's CUP or 'has-a-handle ' or both that is semantically primitive in the language of thought ) . It 's worth adding that these problems about expressing the role relations are actually just a symptom of a more pervasive difficulty : A consequence of restricting the vehicles of mental representation to sets of atomic symbols is a notation that fails quite generally to express the way that concepts group
16 Another disadvantageis that, strictly speakingit doesn work ; although it allows us to distinguish the 't belief that John loves Mary and Bill hates Sally from the belief that John loves Sally and Bill hatesMary, we don't yet have a way to distinguish believing that (John loves Mary becauseBill hates Sally) from believing that (Bill hatesSally because John loves Mary) . Presumablynobody would want to have microfeaturescorres ponding to these.
25
into propositions . To see this , let 's continue to suppose that we ha're a network in which the nodes represent concepts rather than propositions (so that what corresponds to the thought that John loves Mary is a distribution of activation over the set of nodes { JOHN ; LOVES ; MARY } rather than the activation of a single node labelled JOHN LOVES MARY ) . Notic:e that it cannot plausibly be assumed that all the nodes that happen to be ac :tive at a given time will correspond to concepts that are constituents of the same proposition ; least of all if the architecture is " massively parallel " so that many things are allowed to go on- many concepts are allowed to be entertained simultaneously in a given mind . Imagine , then , the following situation : at time t , a man is looking at the sky (so the nodes corresponding to ~ ;KY and BLUE are active ) and thinking that John loves Fido (so the nodc s corre ; sponding to JOHN , LOVES , and FIDO are active) , and the node FIDO is connected to the node DOG (which is in turn connected to the node ANI MAL ) in such fashion that DOG and ANIMAL are active too . W 'e can , if you like , throw it in that the man has got an itch , so ITCH is also on .
According to the current theory of mental representation , this man's mind at t is specified by the vector { + JOHN , + LOVES , + FIDO , + DOG , + SKY , + BLUE , + ITCH , + ANIMAL } . And the question is: which subvectors of this vector correspond to thoughts that the man is thinking ? Specifically , what is it about the man's representational state that determines that the simulta neous activation of the nodes, {JOHN , LOVES , FIDO } constitutes his think ing that John loves Fido , but the simultaneous activation of FlOG , p~ NIMAL and BLUE does not constitute his thinking that Fido is a blue animal ? It seems that we made it too easy for ourselves when we identified tht ~thought that John loves Mary with the vector { + JOHN , + LOVES , + MARY } ; at best that works only on the assumption that JOHN , LOVES and M " ARY are the only nodes active when someone has that thought . And that 's an assumption to which no theory of mental representation is entitled . It 's important to see that this problem arises precisely because the theory is trying to use sets of atomic representations to do a job that you really need complex representations for . Thus , the question we're wanting to answer is: Given the total set of nodes active at a time , what distinguishes the subvectors that correspond to propositions from the subvectors that don 't ? This question has a straightforward answer if , contrary to the present proposal , complex representations are assumed: When representations express concepts that belong to the same proposition , they are not merely simultaneously active , but also in construction with each other . By contrast , representations that express concepts that don 't belong to the same proposition may bc~simulta neously active ; but , they are ipso facto not in construction with ea(;h other . In short , you need two degrees of freedom to specify the thoughts that an
26
27
Strikingly enough , the point that we've been making in the past several paragraphs is very close to one that Kant made against the Associationists of his day . In " Transcendental Deduction (B )" of The First Critique , Kant remarks that :
... if I investigate ... the relation of the given modes of knowledge in any judge ment , ~nd distinguish it , as belonging to the understanding , from the relation according to laws of the reproductive imagination [e.g., according to the princi ples of association ] , which has only subjective validity , I find that a judgement is nothing but the manner in which given modes of knowledge are brought to the objective unity of apperception . This is what is intended by the copula " is" . It is employed to distinguish the objective unity of given representations from the subjective .... Only in this way does there arise from the relation a judge ment , that is a relation which is objectively valid , and so can be adequately distinguished from a relation of the same representations that would have only subjective validity - as when they are connected according to laws of association . In the latter case, all that I could say would be "If I support a body , I feel an impression of weight ' ; I could not say, 'It , the body , is heavy ' . Thus to say 'The body is heavy ' is not merely to state that the two representations have always been conjoined in my perception , ... what we are asserting is that they are combined in the object ... (CPR , p . 159; emphasis Kant 's)
A modern paraphrase might be: A theory of mental representation must distinguish the case when two concepts (e.g., THIS BODY , HEAVY ) are merely simultaneously entertained from tqe case where , to put it roughly , the property that one of the concepts expresses is predicated of the thing that the other concept denotes (as in the thought : THIS BODY IS HEAVY ) . The
relevant distinction is that while both concepts are " active " in both cases , in the latter case but not in the former the active concepts are in construction .
Kant thinks that " this is what is intended by the copula 'is' " . But of course there are other notational devices that can serve to specify that concepts are in construction ; notably the bracketing structure of constituency trees . There are, to reiterate , two questions that you need to answer to specify
the content of a mental state : " Which concepts are ' active ' " and " Which of
the active concepts are in construction with which others ?" Identifying mental states with sets of active nodes provides resources to answer the first of these questions but not the second. That 's why the version of network theory that acknowledges sets of atomic representations but no complex representations fails , in indefinitely many cases, to distinguish mental states that are in fact
distinct .
nected ) representations of concepts , theories that acknowledge complex symbols define semantic interpreta tions over sets of representations of concepts together with specifications of the constituency relations that hold among these representations .
28
But we are not claiming that you can't reconcile a Connectionistarchitec ture with an adequate theory of mental representation (specifically with a combinatorial syntax and semantics mental representations. On the confor ) trary, of course you can: All that's required is that you use your network to implement a Turing machine and specify a combinatorial structure for its , computational language What it appearsthat you can't do, however, is have . both a combinatorial representationalsystemand a Connectionist architec ture at the cognitive level. So much, then, for our long digression We have now reviewed one of the . major respectsin which Connectionistand Classicaltheoriesdiffer ; viz., their accountsof mental representationsWe turn to the secondmajor difference, . which concernstheir accountsof mental processes .
2.2. Structure sensitive operations
Classicistsand Connectionistsboth offer accountsof mental processes but , their theories differ sharply. In particular, the Classicaltheory relies heavily on the notion of the logico/syntacticform of mental representations define to the rangesanddomainsof mental operations This notion is, however, unavail. able to orthodox Connectionistssinceit presupposes there are nonatomic that mental representations . The Classicaltreatment of mental processes rests on two ideas, each of which correspondsto an aspectof the Classicaltheory of computation. Together they explain why the Classicalview postulatesat least three distinct levels of organization in computational systems not just a physicallevel and : a semantic(or " knowledge ) level, but a syntactic level as well. " The first idea is that it is possibleto construct languagesin which certain features of the syntactic structuresof formulas correspondsystematicallyto certain of their semantic features. Intuitively , the idea is that in such lan.guagesthe syntax of a formula encodesits meaning; most especially those , aspectsof its meaning that determine its role in inference. All the artificial languages that are used for logic have this property and English has it more or less Classicistsbelieve that it is a crucial property of the Language of . Thought. A simple exampleof how a languagecan use syntacticstructure to encode inferential roles and relations among meanings may help to illustrate this point . Thus, consider the relation betweenthe following two sentences : (1) John went to the store and Mary went to the store. (2) Mary went to the store. On the one hand, from the semanticpoint of view, (1) entails (2) (so, of
29
course
inferences
from
to
are
truth
preserving
On
the
other
hand
from
the
syntactic
point
of
view
is
constituent
of
These
two
facts
can
be
brought
into
phase
by
exploiting
the
principle
that
sentences
with
the
syntactic
structure
'
S1
and
S2
'
entail
their
sentential
constituents
Notice
that
this
principle
connects
the
syntax
of
these
sentences
with
their
inferential
roles
Notice
too
that
the
trick
relies
on
facts
about
the
grammar
of
English
it
wouldn
'
work
in
language
where
the
formula
that
expresses
the
conjunc
tive
content
John
went
to
the
store
and
Mary
went
to
the
store
is
syntactically
atomic
I8
Here
is
another
example
We
can
reconstruct
such
truth
preserving
infer
ences
as
if
Rover
bites
then
something
bites
on
the
assumption
that
the
sentence
'
Rover
bites
'
is
of
the
syntactic
type
Fa
the
sentence
'
something
bites
'
is
of
the
syntactic
type
3x
Fx
and
every
formula
of
the
first
type
entails
corresponding
formula
of
the
second
type
where
the
notion
' corres
ponding
formula
'
is
cashed
syntactically
roughly
the
two
formulas
must
differ
only
in
that
the
one
has
an
existentially
bound
variable
at
the
syntactic
posi
tion
that
is
occupied
by
constant
in
the
other
. )
Once
again
the
point
to
notice
is
the
blending
of
syntactical
and
semantical
notions
The
rule
of
exis
tential
generalization
applies
to
formulas
in
virtue
of
their
syntactic
form
But
the
salient
property
that
'
preserved
under
applications
of
the
rule
is
seman
tical
What
' s
claimed
for
the
transformation
that
the
rule
performs
is
that
it
is
truth
preserving
19
There
are
as
it
turns
out
examples
that
are
quite
lot
more
complicated
than
these
The
whole
of
the
branch
of
logic
known
as
proof
theory
is
devoted
to
exploring
them
. 2o
It
would
not
be
unreasonable
to
describe
Classical
Cog
lH
And
it
doesn
' t
work
uniformly
for
English
conjunction
Compare
John
and
Mary
are
friends
- ' ) 0
* John
are
friends
or
The
flag
is
red
white
and
blue
- ' ) 0
The
flag
is
blue
Such
cases
show
either
that
English
is
not
the
language
of
thought
or
that
if
it
is
the
relation
between
syntax
and
semantics
is
good
deal
subtler
for
the
language
of
thought
than
it
is
for
the
standard
logical
languages
191t
needn
' t
however
be
strict
truth
- preservation
that
makes
the
syntactic
approach
relevant
to
cognition
Other
semantic
properties
might
be
preserved
under
syntactic
transformation
in
the
course
of
mental
pro
cessing
. g
. ,
warrant
plausibility
heuristic
value
or
simply
semantic
non
- arbitrariness
The
point
of
Classical
modeling
isn
' t
to
characterize
human
thought
as
supremely
logical
rather
it
' s
to
show
how
family
of
types
of
semantically
coherent
( or
knowledge
- dependent
reasoning
are
mechanically
possible
Valid
inference
is
the
paradigm
only
in
that
it
is
the
best
understood
member
of
this
family
the
one
for
which
syntactical
analogues 2 lt is
for not
semantical uncommon
relations for
have Connectionists
been
most to make
systematically disparaging
elaborated remarks
psychology
even
thought
they
accept
the
idea
that
inference
is
involved
in
reasoning
Sometimes
the
sugges
tion
seems
to
be
that
it
' s
all
right
if
Connectionism
can
' t
reconstruct
the
theory
of
inference
that
formal
deductive National
logic Science
provides Foundation
since ,
it
has McClelland
something , Feldman
even
better , Adelson
on
offer , Bower
For &
example McDermott
in
their ( 1986
report ) state
to
the that
U "
. S
. . .
connectionist
models
realize
an
evidential
logic
in
contrast
to
the
symbolic
logic
of
conventional
computing
( p
our
emphasis
"
and
that
"
evidential
logics
are
becoming
increasingly
important
in
cognitive
science
and
30
nitive Science as an extended attempt to apply the methods of proof theory to the modeling of thought (and similarly , of whatever other mental processes are plausibly viewed as involving inferences ; preeminently learning and perception ) . Classical theory construction rests on the hope that syntactic analogues can be constructed for nondemonstrative inferences (or informal , commonsense reasoning ) in something like the way that proof theory has provided syntactic analogues for validity . The second main idea underlying the Classical treatment of mental processes is that it is possible to devise machines whose function is the transforma tion of symbols , and whose operations are sensitive to the syntactical structure of the symbols that they operate upon . This is the Classical conception of a computer : it 's what the various architectures that derive from Turing and Von Neumann machines all have in common . Perhaps it 's obvious how the two 'main ideas' fit together . If , in principle , syntactic relations can be made to parallel semantic relations , and if , in prin ciple , you can have a mechanism whose operations on formulas are sensitive to their syntax , then it may be possible to construct a syntactically driven machine whose state transitions satisfy semantical criteria of coherence . Such a machine would be just what 's required for a mechanical model of the semantical coherence of thought ; correspondingly , the idea that the brain is such a machine is the foundational hypothesis of Classical cognitive science. So much for the Classical story about mental processes. The Connectionist story must , of course, be quite different : Since Connectionists eschew postulating mental representations with combinatorial syntactic /semantic structure , they are precluded from postulating mental processes that operate on mental representations in a way that is sensitive to their structure . The sorts of operations that Connectionist models do have are of two sorts, depending on whether the process . nder examination is learning or reasoning . u
2.2.1. Learning If a Connectionist model is intended to learn, there will be processes that determine the weights of the connectionsamong its units as a function of the character of its training. Typically in a Connectionist machine (such as a 'Boltzman Machine') the weights among connectionsare adjusted until the system behavior comesto model the statistical properties of its inputs. In 's
havea natural mapto connectionist modeling (p. 7). It is, howeverhardto understand implied ." , the contrast since on the onehand evidential , , logicmustsurely a fairly conservative be extension "the symbolic of logic of conventional computing(i.e., mostof thetheorems thelatterhave come truein theformer and " of to out ) , on the other thereis not the slightest , reason doubtthat an evidential to logicwould'run' on a Classical machinePrimafacie theproblem . , aboutevidential logicisn that we got onethat wedon knowhowto 't 've 't implementit's that we haven got one ; 't .
31
the limit , the stochastic relations among machine states recapitulates the stochasticrelations among the environmental eventsthat they represent . This should bring to mind the old Associationistprinciple that the strength of associationbetween 'Ideas' is a function of the frequencywith which they are paired 'in experience and the Learning Theoretic principle that the ' strength of a stimulus -response connectionis a function of the frequencywith which the responseis rewarded in the presenceof the stimulus. But though Connectionists like other Associationists are committed to learning proces , , ses that model statistical properties of inputs and outputs, the simple mechanisms basedon co-occurrencestatisticsthat were the hallmarks of oldfashioned Associationismhave been augmentedin Connectionist models by a number of technical devices (Hence the 'new' in 'New Connectionism.) . ' For example some of the earlier limitations of associativemechanismsare , overcome by allowing the network to contain 'hidden' units (or aggregates ) that are not directly connectedto the environment and whose purposeis, in effect, to detect statistical patterns in the activity of the 'visible' units including, perhaps patterns that are more abstractor more 'global' than the ones , that could be detectedby old-fashionedperceptrons .21 In short, sophisticatedversions of the associativeprinciples for weightsetting are on offer in the Connectionist literature. The point of present concern, however, is what all versions of these principles have in common with one another and with older kinds of Associationism viz., theseprocesses : are all frequency -sensitive To return to the example discussedabove: if a . Connectionist learning machine convergeson a state where it is prepared to infer A from A &B (i .e., to a state in which when the 'A & B' node is excited it tends to settle into a state in which the 'A ' node is excited) the convergence will typically be caused by statistical properties of the machine training 's experience e.g., by correlation betweenfiring of the 'A & B' node and firing : of the ' A ' node, or by correlations of the firing of both with some feedback signal. Like traditional Associationism Connectionismtreats learning as ba, sically a sort of statistical modeling. 2.2.2. Reasoning Association operatesto alter the structure of a network diachronically as a function of its training. Connectionistmodelsalsocontain a variety of types of 'relaxation' processes which determine the synchronic behavior of a network; specifically, they determine what output the device provides for a given pattern of inputs. In this respect one can think of a Connectionist ,
21Compare " little s's" and " little r's" of neo-Hullean " mediational" Associationistslike CharlesOsgood the .
32
model as a species of analog machine constructed to realize a certain function . The inputs to the function are (i) a specification of the connectedness of the machine (of which nodes are connected to which ) ; (ii ) a specification of the weights along the connections ; (iii ) a specification of the values of a variety of idiosyncratic parameters of the nodes (e.g. ~intrinsic thresholds : time since last firing , etc .) (iv ) a specification of a pattern of excitation over the input nodes. The output of the function is a specification of a pattern of excitation over the output nodes; intuitively , the machine chooses the output pattern that is most highly associated to its input . Much of the mathematical sophistication of Connectionist theorizing has been devoted to devising analog solutions to this problem of finding a 'most highly associated' output corresponding to an arbitrary input ; but , once again, the details needn't concern us. What is important , for our purposes , is
another property that Connectionist theories share with other forms of As -
sociationism . In traditional Associationism , the probability that one Idea will elicit another is sensitive to the strength of the association between them (including 'mediating ' associations, if any) . And the strength of this association is in turn sensitive to the extent to which the Ideas have previously been correlated . Associative strength was not , however , presumed to be sensitive to features of the content or the structure of representations per se. Similarly , in Connectionist models , the selection of an output corresponding to a given input is a function of properties of the paths that connect them (including the weights , the states of intermediate units , etc .) . And the weights , in turn , are a function of the statistical properties of events in the environment (or of relations between patterns of events in the environment and implicit 'predic tions ' made by the network , etc.) . But the syntactic/semantic structure of the representation of an input is not presumed to be a factor in determining the selection of a corresponding output since, as we have seen, syntactic/semantic
structure is not defined for the sorts of representations that Connectionist
models acknowledge . To summarize : Classical and Connectionist theories disagree about the nature of mental representation ; for the former , but not for the latter , mental representations characteristically exhibit a combinatorial constituent structure and a combinatorial semantics . Classical and Connectionist theories also
disagree about the nature of mental processes; for the former , but not for the latter , mental processes are characteristically sensitive to the combinator ial structure of the representations on which they operate . We take it that these two issues define the present dispute about the nature
of cognitive architecture . We now propose to argue that the Connectionists
are on the wrong side of both .
33
3. The need for symbol systems: Productivity , systematicity , compositionality and inferential coherence Classical psychological theories appeal to the constituent structure of mental representations to explain three closely related features of cognition : its pro ductivity , its compositionality and its inferential coherence . The traditional argument has been that these features ' of cognition are, on the one hand , pervasive and, on the other hand , explicable only on the assumption that mental representations have internal structure . This argument - familiar in more or less explicit versions for the last thirty years or so- is still intact , so far as we can tell . It appears to offer something close to a demonstration that an empirically adequate cognitive theory must recognize not just causal rela tions among representational states but also relations of syntactic and semantic constituency ; hence that the mind cannot be, in its general structure , a Connectionist network . 3.1. Productivity of thought There is a classical productivity argument for the existence of combinatorial structure in any rich representational system (including natural languages and the language of thought ) . The representational capacities of such a system are , by assumption , unbounded under appropriate idealization ; in particular , there are indefinitely many propositions which the system can encode .22How ever , this unbounded expressive power must presumably be achieved by finite means. The way to do this is to treat the system of representations as consisting of expressions belonging to a generated set. More precisely , the corre spondence between a representation and the proposition it expresses is, in arbitrarily many cases, built up recursively out of correspondences between parts of the expression and parts of the proposition . But , of course, this strategy can operate only when an unbounded number of the expressions are non-atomic . So linguistic (and mental ) representations must constitute symbol systems (in the sense of footnote 8) . So the mind cannot be a PDP . Very often , when people reject this sort of reasoning , it is because they doubt that human cognitive capacities are correctly viewed as productive . In
22This way of putting the productivity argument is most closely identified with Chomsky (e.g., Chomsky, 1965 1968 However, one does not have to rest the argumentupon a basicassumptionof infjnite generative ; ). capacity Infinjte generative capacity can be viewed, instead, as a consequenceor a corollary of theories . formulated so as to capture the greatest number of generalizationswith the fewest independent principles. This more neutral approachis, in fact, very much in the spirit of what we shall proposebelow. We are putting it in the present form for expository and historical reasons .
34
the long run there can be no a priori argumentsfor (or against idealizing to ) productive capacities whether you acceptthe idealization dependson wheth; er you believe that the inferencefrom finite performanceto finite capacityis justified, or whether you think that finite performanceis typically a result of the interaction of an unboundedcompetence with resourceconstraints Clas . sicistshave traditionally offered a mixture of methodological and empirical considerationsin favor of the latter view. From a methodologicalperspective the least that can be said for assuming , productivity is that it precludes solutions that rest on inappropriate tricks (such as storing all the pairs that define a function) ; tricks that would be unreasonablein practical terms even for solving finite tasks that place sufficiently large demandson memory. The idealization to unboundedproductive capacity forces the theorist to separatethe finite specification of a method for solving a computational problem from such factors as the resourcesthat the system(or person brings to bear on the problem at any given moment. ) The empirical argumentsfor productivity have been mademost frequently in connectionwith linguistic competence They are familiar from the work of . Chomsky (1968 who hasclaimed (convincingly, in our view) that the knqwl) edgeunderlying linguistic competenceis generative i.e., that it allows us in principle to generate (/understand an unbounded number of sentences It ) . goes without saying that no one does, or could, in fact utter or understand tokens of more than a finite number of sentencetypes; this is a trivial conse quence of the fact that nobody can utter or understand more than a finite number of sentencetokens. But there are a number of considerationswhich suggest that, despitede facto constraintson performance onesknowledgeof , ones languagesupports an unboundedproductive capacity in much the same way that ones knowledge of addition supports an unbounded number of sums Among these considerationsare, for example, the fact that a speaker . / hearer's performance can often be improved by relaxing time constraints , increasingmotivation, or supplying pencil and paper. It seemsvery natural to treat such manipulations as affecting the transient state of the speakers ' memory and attention rather than what he knows about- or how he represents his language But this treatment is available only on the assumption . that the character of the subject's performanceis determined by interactions between the available knowledge base and the available computational resources . Classicaltheories are able to accommodatethese sorts of considerations becausethey assumearchitecturesin which there is a functional distinction between memory and program. In a systemsuch as a Turing machine, where the length of the tape is not fixed in advance changesin the amount of , availablememory canbe affectedwithout changingthe computationalstructure
35
of the machine ; viz ., by making more tape available . By contrast , in a finite state automaton or a Connectionist machine , adding to the memory (e.g., by adding units to a network ) alters the connectivity relations among nodes and thus does affect the machine 's computational structure . Connectionist cogni tive architectures cannot , by their very nature , support an expandable memory , so they cannot support productive cognitive capacities . The long and short is that if productivity arguments are sound, then they show that the architecture of the mind can't be Connectionist . Connectionists have, by and large , acknowledged this ; so they are forced to reject productivity arguments . The test of a good scientific idealization is simply and solely whether it produces successful science in the long term . It seems to us that the produc tivity idealization has more than earned its keep , especially in linguistics and in theories of reasoning . Connectionists , however , have not been persuaded . For example , Rumelhart and McClelland ( 1986a, p . 119) say that they " ... do not agree that [productive ] capabilities are of the essence of human computation . As anyone who has ever attempted to process sentences like 'The man the boy the girl hit kissed moved ' can attest , our ability to process even moderate degrees of center-embedded structure is grossly impaired relative to an A TN [Augmented Transition Network ] parser .... What is needed , then , is not a mechanism for flawless and effortless processing of embedded constructions . .. The challenge is to explain how those processes that others have chosen to explain in terms of recursive mechanisms can be better explained by the kinds of processes natural for PDP networks ."
These remarks suggest that Rumelhart and McClelland think that the fact
that center -embedding sentences are hard is somehow an embarrassment for theories that view linguistic capacities as productive . But of course it 's not since, according to such theories , performance is an effect of interactions between a productive competence and restricted resources. There are , in fact , quite plausible Classical accounts of why center -embeddings ought to impose especially heavy demands on resources, and there is a reasonable amount of experimental support for these models (see, for example , Wanner & Maratsos , 1978) . In any event , it should be obvious that the difficulty of parsing center -embeddings can't be a consequence of their recursiveness per se since there are many recursive structures that are strikingly easy to understand . Consider : 'this is the dog that chased the cat that ate the rat that lived in the house that Jack built .' The Classicist's case for productive capacities in parsing rests on
the transparency of sentences like these .23 In short , the fact that center -em -
23McCIeiland and Kawamoto ( 1986) discuss this sort of recursion briefly . Their suggestion seems to be that parsing such sentences doesn ' t really require recovering their recursive structure : " ... the job of the parser
-
36
beddedsentences hard perhapsshowsthat there are somerecursivestrucare tures that we can't parse But what Rumelhart and McClelland need if they . are to deny the productivity of linguistic capacitiesis the much stronger claim that there are no recursive structures that we can parse; and this stronger claim would appear to be simply false. Rumelhart and McClelland's discussion of recursion (pp. 119 120 ) nevertheless repayscloseattention. They are apparentlypreparedto concede that PDPscan model recursivecapacitiesonly indirectly- viz., by implementing Classicalarchitectureslike A TN s; so that if human cognition exhibited recursive capacities that would suffice to show that minds have Classical , rather than Connectionist architecture at the psychologicallevel. "We have not dwelt on PDP implementations of Turing machinesand recursive processingenginesbecause do not agreewith thosewho would arguethat such we capacitiesare of the essence human computation (p. 119, our emphasis. of " ) Their argument that recursive capacities aren't "of the essenceof human computation" is, however, just the unconvincing stuff about center-embed ding quoted above. So the Rumelhart and McClelland view is apparently that if you take it to be independentlyobvious that somecognitive capacitiesare productive, then you should take the existenceof such capacitiesto argue for Classicalcognitive architecture and hence for treating Connectionism as at best an implementation theory. We think that this is quite a plausible understandingof the bearing that the issuesabout productivity and recursion have on the issuesabout cognitive architecture; in Section4 we will return to the sugges tion that Connectionist models can plausibly be construed as models of the implementation of a Classicalarchitecture. In the meantime, however, we propose to view the statusof productivity argumentsfor Classicalarchitecturesasmoot; we're about to presenta different sort of argumentfor the claim that mental representations need an articulated internal structure. It is closelyrelated to the productivity argument, but it doesn require the idealization to unboundedcompetence Its assumptions 't .
~ [with respectto right-recursivesentencesis to spit out phrasesin a way that capturestheir local context. Such ] a representationmay prove sufficient to allow us to reconstructthe correct bindings of noun phrasesto verbs and prepositional phrasesto nearby nouns and verbs" (p. 324; emphasisours). It is, however, by no means the casethat all of the semanticallyrelevant grammaticalrelations in readily intelligible embeddedsentences are local in surface structure. Consider: ' Where did the man who owns the cat that chased the rat that frightened the girl say that he was going to move to (X )?' or 'What did the girl that the children loved to listen to promise your friends that she would read (X) to them?' Notice that, in such examples a binding element , (italicized) can be arbitrarily displacedfrom the position whoseinterpretation it controls (marked 'X ') without making the sentenceparticularly difficult to understand Notice too that the 'semantics doesn determine the . ' 't binding relations in either example.
~
37
should thus be acceptable even to theorists who- like Connectionists - hold that the finitistic character of cognitive capacities is intrinsic to their architec ture .
3.2. Systematicity cognitiverepresentation of The form of the argument is this: Whether or not cognitive capacitiesare really productive, it seemsindubitable that they are what we shall call 'sys tematic'. And we'll see that the systematicityof cognition provides as good a reason for postulating combinatorial structure in mental representationas the productivity of cognition does: You get, in effect, the sameconclusion , but from a weaker premise. The easiestway to understandwhat the systematicityof cognitive capacities amounts to is to focus on the systematicityof languagecomprehensionand production. In fact, the systematicityargumentfor combinatorial structure in thought exactly recapitulatesthe traditional Structuralist argument for constituent structure in sentences But we pause to remark upon a point that . we'll re-emphasize later; linguistic capacityis a paradigmof systematiccognition , but it 's wildly unlikely that it 's the only example. On the contrary, there's every reason to believe that systematicityis a thoroughly pervasive feature of human and infrahuman mentation. What we mean when we say that linguistic capacitiesare systematic that is the ability to produce /understandsomesentences intrinsically connectedto is the ability to produce /understand certain others. You can see the force of this if you compare learning languages way we really do learn them with the learning a languageby memorizing an enormousphrasebook. The point isn't that phrasebooks are finite and can therefore exhaustivelyspecifyonly nonproductive languages that's true, but we've agreednot to rely on productivity ; argumentsfor our present purposes Our point is rather that you can learn . any part of a phrase book without learning the rest. Hence, on the phrase book model, it would be perfectly possibleto learn that uttering the form of words 'Granny's cat is on Uncle Arthur 's mat' is the way to say (in English) that Granny's cat is on Uncle Arthur 's mat, and yet have no idea at all how to say that it 's raining (or, for that matter, how to say that Uncle Arthur 's cat is on Granny's mat). Perhapsit 's self-evident that the phrasebook story must be wrong about languageacquisition becausea speakers knowledge of ' his native languageis never like that. You don't , for example, find native speakerswho know how to say in English that John loves the girl but don't know how to say in English that the girl loves John. Notice, in passing that systematicityis a property of the mastery of the , syntax of a language not of its lexicon. The phrase book model really does ,
38
fit what it 's like to learn the vocabulary of English since when you learn English vocabulary you acquire a lot of basically independentcapacities So . you might perfectly well learn that using the expression'cat' is the way to refer to cats and yet haveno idea that usingthe expression'deciduousconifer' is the way to refer to deciduousconifers. Systematicity like productivity, is , the sort of property of cognitive capacitiesthat you're likely to miss if you concentrateon the psychologyof learning and searchinglists. There is, as we remarked, a straightforward (and quite traditional) argument from the systematicityof languagecapacity to the conclusionthat sen tences must have syntactic and semanticstructure: If you assumethat sen tences are constructed out of words and phrases and that many different , sequences words can be phrasesof the sametype, the very fact that one of formula is a sentenceof the languagewill often imply that other formulas must be too: in effect, systematicity follows from the postulation of constituent structure. Suppose for example that it 's a fact about English that formulas with the , , constituent analysis'NP Vt NP' are well formed; and supposethat 'John' and 'the girl' are NPs and 'loves' is a Vt . It follows from these assumptions that 'John loves the girl ,' 'John loves John,' 'the girl loves the girl ,' and 'the girl loves John' must all be sentences It follows too that anybod who has mas . y tered the grammarof English must havelinguistic capacitiesthat are systemat ic in respect of these sentences he can't but assumethat all of them are ; sentences he assumes if that any of them are. Compare the situation on the view that the sentences English are all atomic. There is then no structural of analogy between 'John loves the girl' and 'the girl loves John' and henceno reason why understanding one sentence should imply understanding the other; no more than understanding'rabbit' implies understanding'tree' .24 On the view that the sentences atomic, the systematicityof linguistic are capacitiesis a mystery; on the view that they have constituent structure, the systematicityof linguistic capacitiesis what you would predict. So we should prefer the latter view to the former. Notice that you can make this argument for constituent structure in sen tenceswithout idealizing to astronomicalcomputationalcapacities There are . productivity argumentsfor constituent structure, but they're concernedwith our ability- in principle- to understand sentences that are arbitrarily long. Systematicity by contrast, appealsto premisesthat are much nearer home; ,
24See Pinker (1984 Chapter 4) for evidencethat children never go through a stagein which they distinguish , between the internal structuresof NPs dependingon whether they are in subject or object position; i .e., the dialects that children speak are always systematicwith respectto the syntactic structures that can appear in these positions.
39
such considerations as the ones mentioned above, that no speaker under stands the form of words ' John loves the girl ' except as he also understands the form of words 'the girl loves John ' . The assumption that linguistic capacities are productive " in principle " is one that a Connectionist might refuse to grant . But that they are systematic in fact no one can plausibly deny . We can now , finally , come to the point : the argument from the systematicity of linguistic capacities to constituent structure in sentences is quite clear . But thought is systematic too , so there is a precisely parallel argument from the systematicity of thought to syntactic and semantic structure in mental representations . What does it mean to say that thought is systematic? Well , just as you don 't find people who can understand the sentence 'John loves the girl ' but not the sentence 'the girl loves John ,' so too you don 't find people who can think the thought that John loves the girl but can't think the thought that the girl loves John . Indeed , in the case of verbal organisms the systematicity of thought follows from the systematicity of language if you assume- as most psychologists do- that understanding a sentence involves entertaining the thought that it expresses; on that assumption , nobody could understand both the sentences about John and the girl unless he were able to think both the thoughts about John and the girl . But now if the ability to think that John loves the girl is intrinsically connected to the ability to think that the girl loves John , that fact will somehow have to be explained . For a Representationalist (which , as we have seen, Connectionists are) , the explanation is obvious : Entertaining thoughts requires being in representational states (i .e., it requires tokening mental representations ) . And , just as the systematicity of language shows that there must be structural relations between the sentence ' John loves the girl ' and the sentence 'the girl loves John ,' so the systematicity of thought shows that there must be structural relations between the mental representation that corresponds to the thought that John loves the girl and the mental representation that corresponds to the thought that the girl loves John ;25namely , the two mental representations , like the two sentences, must be made of the same parts . But if this explanation is right (and there don 't seem to be any others on offer ) , then mental representations have internal structure and there is a
25It may be worth emphasizing that the structural complexity of a mental representation is not the same thing as, and does not follow from , the structural complexity of its propositional content ( i .e. , of what we 're calling " the thought that one has" ) . Thus , Connectionists and Classicists can agree to agree that the thought that P& Q is complex (and has the thought that P among its parts ) while agreeing to disagree about whether mental representatjons have internal syntactic structure .
40
languageof thought. So the architecture of the mind is not a Connectionist network.26 To summarizethe discussion far: Productivity argumentsinfer the interso nal structure of mental representationsfrom the presumedfact that nobody has a finite intellectual competence By contrast, systematicity arguments . infer the internal structure of mental representations from the patent fact that nobody has a punctateintellectual competence Just as you don't find linguis. tic capacitiesthat consist of the ability to understand sixty-seven unrelated sentencesso too you don't find cognitive capacitiesthat consistof the ability , to think seventy -four unrelated thoughts. Our claim is that this isn't , in either case an accident A linguistic theory that allowed for the possibility of , : punctate languageswould have gone not just wrong, but very profoundly wrong. And similarly for a cognitive theory that allowed for the possibility of punctate minds. But perhapsnot being punctateis a property only of the minds of language users perhaps the representational capacities of infraverbal organismsdo ; havejust the kind of gapsthat Connectionistmodelspermit? A Connectionist might then claim that he can ~o everything " up to language on the assump " tion that mental representationslack combinatorial syntactic and semantic structure. Everything up to languagemay not be everything, but it 's a lot . (On the other hand, a lot may be a lot , but it isn't everything. Infraverbal cognitive architecture mustn't be so represented as to make the eventual acquisition of languagein phylogeny and in ontogeny require a miracle.) It is not, however, plausible that only the minds of verbal organismsare systematic Think what it would mean for this to be the case It would have . . to be quite usual to find , for example, animals capable of representingthe state of affairs aRb, but incapable of representingthe state of affairs bRa. Such animalswould be, as it were, aRb sightedbut bRa blind since, presum ably, the representationalcapacitiesof its mind affect not just what an or2 heseconsiderationsthrow further light on a pr6posalwe discussed Section2. Suppose ~ in that the mental representation corresponding to the thought that John loves the girl is the feature vector { +John-subject ; + Ioves + the-girl -object} where 'John-subject and "the-girl-object' are atomic features; as such, they bear no ; ' more structural relation to 'John-object' and 'the-girl -subject than they do to one another or to, say, 'has ' -ahandle'. Since this theory recognizesno structural relation between 'john -subject and 'john -object' , it offers ' no reasonwhy a representationalsystemthat provides the meansto expressone of theseconceptsshould also provide the meansto expressthe other. This treatment of role relations thus makes a mystery of the (presumed fact that anybodywho can entertain the thought that John loves the girl can also entertain the thought ) that the girl loves John (and, mutatis mutandis, that any natural languagethat can expressthe proposition that John loves the girl can also expressthe proposition that the girl loves John). This consequence the of proposal that role relations be handled by " role specific descriptors that represent the conjunction of an identity and a role" (Hinton , 1987 offers a particularly clear example of how failure to postulate internal ) structure in representationsleads to failure to capture the systematicityof representationalsystems .
41
ganism think, also itcanperceive. consequence, animals can but what In such
wouldbe able to learn to respond selectively aRb situationsbut quite to unableto learnto respond selectively bRasituations. that, thoughyou to (So couldteachthe creatureto choosethe picturewiththe squarelargerthanthe
the triangle larger than the square.)
triangle, couldnt the lifeof youteach to choose picture you for it the with
infraverbal organisms often structuredthat way, but were preparedto are
betthattheyarenot.Ethological aretheexceptions prove rule. cases that the Thereareexamples where salient environmental configurations asgestalact ten; andin such cases reasonable doubtthatthemental its to representation
of the stimulusis complex.But the point is preciselythat these cases are
exceptional; exactly ones theyre the where expect therewill some you that be
special totellabout ecological story the significance stimulus: its ofthe that
theshape a predator, thesong a conspecificetc.Conversely, of or of ... when thereis no suchstoryto tellyouexpect structurally similar stimuli elicit to correspondingly cognitive similar capacities. surely, the leastthata That, is
respectable principle stimulus of generalization got to require. has That infraverbal cognition prettygenerally is systematic seems,in short, to be about as secureas any empirical premisein this area can be. And, as
wevejust seen,its a premise which inadequacy Connectionist from the of modelsas cognitive theoriesfollows quitestraightforwardly; straightforas wardly, anyevent,as it would the assumption suchcapacities in from that
are generally productive.
3.3. Compositionality of representations
Compositionality isclosely related systematicity; to perhaps theyrebestviewed as aspects a single of phenomenon. will We therefore follow thesame much
coursehere as in the preceding discussion: weintroduce concept first the by
recalling standard the arguments the compositionality for of natural languages. thensuggest parallel We that arguments secure compositionality the
of mentalrepresentations.Sincecompositionality requirescombinatorial syntacticand semantic structure,the compositionality thoughtis evidence of that
the mind is not a Connectionist network.
that the ability produce/understand of the sentences intrinsically to some is connected the ability produce/understand of the others. We to to certain nowadd that whichsentences systematically are relatedis not arbitrary from a semantic ofview. example, ableto understand loves point For being John
thegirl goes along being to understand girlloves with able the John,and
42
contrast, isnointrinsic there connection between understanding ofthe either John/girl sentences understanding and semantically unrelated formulas like
in order thefirst betrue, must to thegirl very for to John bear the same relation thetruth thesecond that of requires girl bear John. the to to By
dependent.
asthough semantical relatednesssystematicityquite company. and keep close You suppose this might that covariance iscoveredthesame by explanation thataccounts systematicity roughly, sentences aresysfor perse; that that tematically arecomposed thesame related from syntactic constituents. But, in fact, need further you a assumption, wellcall principle which the of compositionality: as a language systematic,lexical must insofar is a item make approximately semantic thesame contribution expression to each in which occurs. is, forexample, insofar the girl, lovesand it It only as Johnmake same the semantic contribution loves girlthat toJohn the they make the girl Johnthatunderstanding sentence to loves theone implies understanding other.Similarity constituent the of structure accounts the for semantic relatedness between systematically sentences to the related only
veal, theporkand like. Similarly,difference good feed the27 the between book,good andgood isprobably meaning butsynrest fight not shift
induce (rather select) meaning than the animal, would you expect the feed
chicken eat must to involve animal/food an ambiguitychicken in rather thana violation compositionality if thecontext the ... could of since feed
amount context of induced variation lexical of meaning oftenoverestimated is because othersortsof context sensitivity misconstrued violations are as of compositionality.example, difference For the between thechicken feed and
Here idioms prove rule: able understand man, its that the being to the, bucket, kicked bucket bear standard since and dont their meanings inthis context. justasyoudexpect, man And, the kicked bucket notsysthe is tematic withrespect syntactically related even to closely sentences the like man kicked thebucket that over (for matter, notsystematic respect its with to the the mankicked bucketreadliterally). the Its uncertain exactly compositional languages how natural actually are (just its uncertain how as exactly systematic are). suspect the they We that
kickedandbucketisntmuch with help understanding man the kicked the
Connectionismand cognitivearchitecture
43
interest fights its funto watch to bein, or it clears air);and in (viz., or the
flurg answers therelevant that to interest flurgs in without knowing what flurgs or what relevant are the interest flurgs (seeZiff,1960). in is In anyevent,the mainargument stands: systematicity depends compoon sitionality, to theextent a natural so that language systematicmust is it be
compositional Thisillustrates too. another respect which in systematicity ar-
guments do thework which can for productivity arguments previously have beenemployed. traditional The argument compositionality it is for is that required explain a finitely to how representable language contain can infinitely many nonsynonymousexpressions. Considerations about systematicityoffer one argument for compositional-
bytheirconjunction. Moreoverand is importantthis this semantical pattern is not peculiar the casescited.On the contrary, holdsfor a very to it
soldier, ... is a child prodigy; and so forth). How are we to account for these sorts of regularities? The answer seems
large range predicates ... isa redsquare,... isa funny German of (see old
clear enough; isa brown entails isbrownbecause thesecond ... cow ... (a) expression a constituent the first;(b) the syntactical (adjective is of form
noun)N (inmany has cases) semantic of a conjunction, (c) the force and
brown retains its semantical value under simplificationof conjunction.
adjective;which ... isa brown wouldnt ... isbrown in case cow entail after
all.Notice that (c)isjust an application the principle composition. too of of
So,heres theargument far:youneedto assume degree cornso some of positionality English of sentences account the factthatsystematically to for relatedsentences always are semantically related;andto account certain for
entailments. beyond serious So, any doubt,the sentences English be of must compositional someserious to extent.But the principle compositionality of
44
they constituents.compositionality that(some) are So implies expressions have constituents.compositionality for(specifically, So argues presupposes) syntactic/semantic structure in sentences.
to expressthoughts;so if the abilityto use somesentencesis connectedwith
theability use to certain semantically sentences, theability other, related then to think thoughts becorrespondingly with ability some must connected the to think certain other, semantically thoughts. youcanonly related But think thethoughts your that mental representationsexpress. iftheability can So, to think certain thoughtsinterconnected, thecorresponding is then representational capacities beinterconnected specifically, ability be must too; the to in some representational must states imply ability beincertain the to other,
square above triangle; is the whereaswould very it be surprising ifbeing able to learn square/triangle implied able learn quarks the facts being to that are madeof gluonsor that Washington the first Presidentof America. was So,then,whatexplains correlation the between systematic relations and
generally semantically related Its no surprise being to learn too. that able that the triangleis abovethe squareimplies beingableto learnthat the
that semantic ofthese iscontext-independent, explain the value parts that would why systematically thoughts also these related are semantically So,by related. this chainof argument, evidence the compositionalitysentences for of is evidence thecompositionality representational ofspeaker! for ofthe states hearers. Finally, about compositionality what the ofinfraverbal thought? arguThe mentisnt much different the onethatwevejustrun through. from We assume animal that thought largely is systematic: organism canperthe that ceive (hence learn) aRbcangenerally that perceive (/learn) bRa.But, that systematically thoughts like related (just systematically sentences) related are
resentation corresponds the thought the girlloves that to that John.That would explain these why thoughts systematically and,totheextent are related;
Butthenthequestion arises: could mind soarranged the how the be that ability beinonerepresentational isconnected theability be to state with to in others aresemantically What that nearby? account mental of representation would thisconsequence? answer justwhat have The is youdexpect fromthediscussionthelinguistic of material. Mental representations must have internal structure, theway sentences Inparticular,must just that do. it bethatthemental representation correspondsthethought John that to that loves girl the contains, itsparts, same as the constituents mental asthe rep-
45
dont address question; factthat a network this the contains nodelabelled a X has, so far as the constraints imposed Connectionist by architecture are
concerned, implications all for the labelsof the othernodesin the no at network; particular, doesntimply therewillbe nodesthatrepresent in it that thatsystematically thoughts constituents thatthe semantic related share and
valuesof these shared constituentsare contextindependent)the correlation
thoughts aresemantically toX.This justthesemantical of that close is side the fact that networkarchitectures permitarbitrarily punctatementallives. Butif, ontheotherhand, make usual we the Classicist assumptions (viz.,
between systematicity semantic and relatedness follows immediately. a For Classicist, correlation an architectural this is property minds; couldnt of it
models suppose them to.
wholesale. example, For Smolensky (1988) claims that: Surely ... wewould
betweencan with coffee and can withoutcoffee or tree withcoffee and
resentations .....
Its certainly thatcompositionality generallyfeature Contrue is not a of nectionist representations. Connectionists acknowledge factsof cant the compositionality because arecommitted mental they to representations that
dont havecombinatorial structure.But to giveup on compositionality to is take kick the bucket as a modelfor the relationbetweensyntaxand seman-
tics;andtheconsequence asweveseen,thatyoumake systematicity is, the oflanguage ofthought)mystery. theother (and a On hand, saythatkick to
the bucket is aberrant, and that the right model for the syntax/semantics
relationis (e.g.) brown cow, is to start downa trail whichleads,pretty inevitably, acknowledging to combinatorial structure mentalrepresentain tion,hence the rejection Connectionist to of networks cognitive as models. We dont thinktheres any wayout of the need to acknowledge cornthe
positionality natural of languages of mental and representations. However, its beensuggested Smolensky, cit.)thatwhile principle corn(see op the of
symbol in thevarious has contexts which occurs. such in it Since proposals generally elaborated, unclear theyre arent its how supposedhandle to the
46
relation holds that between andrabbits notthesame thatholds turtles is one between rabbits Ferraristhen hardtoseewhy inference and its the should be valid.
relationtransitive.however, assumed is If, its (contrary principle tothe of compositionality) thanmeans that slower something inpremises different (i)and (and (ii) presumably aswell)so strictly in(iii) that, speaking, the
thesame relation slower holds (viz., than) between turtles rabbits the and on
transitive. thataccount, argument turtles slower rabOn the from are than bitsandrabbits furrier Ferraris turtles slower Ferraris are than to are than
Talk therelations similar papers thedifficulty about being only over since theproblem toprovide isthen a notion similarity will of that guaranty that if (i)and(ii)aretrue,sotoois (iii). And,sofarat least, such no notion of similaritybeen has forthcoming. that wont torequire that Notice it do just therelationsbesimilarrespecttheir all in of transitivity,that allbe i.e., they
Until sorts issues attended theproposalreplace these of are to, to the compositional ofcontext principle invariancea notion approximate with of
more than hand waving.
semanticsmental of representations issensitivetheir to syntax thesecond and byassuming mental that structure. processes tomental apply representations invirtue of their constituent
A consequencetheseassumptions Classical of isthat theories commitare
to Qisvalid). Correspondingly, its a psychological thoughts law that that P&Q tocause tend thoughts Pand that thoughts Q,all being that else equal. Classical exploits constituent ofmental theory the structure representations
InSection saw according 2we that, toClassical theories, syntax mental the of representations between semantic mediates their properties their and causal role mental in processes. a simple Its a logical Take case: principle that conjunctions their entail constituentstheargument P&Q P and (so from to
47
ted to the following striking prediction : inferences that are of similar logical type ought , pretty generally ,28 to elicit correspondingly similar cognitive capacities . You shouldn 't , for example , find a kind of mental life in which
you get inferences from P & Q & R to P but you don ' t get inferences from P & Q
to P . This is because, according to the Classical account , this logically homogeneous class of inferences is carried out by a correspondingly
The idea that organisms should exhibit similar cognitive capacities in respect of logically similar inferences is so natural that it may seem unavoidable . But , on the contrary : there 's nothing in principle to preclude a kind of cognitive model in which inferences that are quite similar from the logician 's point of view are nevertheless computed by quite different mechanisms; or in which some inferences of a given logical type are computed and other inferences of the same logical type are not . Consider , in particular , the Connectionist account . A Connectionist can certainly model a mental life in which , if you can
reason from P & Q & R to P , then you can also reason from P & Q to P . For
28The hedge is meant to exclude cases where inferences of the same logical type nevertheless differ in complexity in virtue of , for example , the length of their premises . The inference from (AvBvCvDvE ) and ( - B & - C& - D & - E ) to A is of the same logical type as the inference from AvB and - B to A . But it wouldn ' t be very surprising , or very interesting , if there were minds that could handle the second inference but not the
first .
48
J.A. FodorandZ.W.Pylyshyn
Butnotice a Connectionistequally that can model mental in which a life yougetoneof these inferences nottheother. thepresent since and In case,
from andMary tothe butnotfrom and and John went store John Mary Susan
of logical syntax, is a mystery youdont. it that
3.5. Summary
But,weclaim, dont findcognitive you capacities havethesesortsof that gaps. dont, forexample, minds areprepared infer went You get that to John to thestorefrom JohnandMary Susan Sally to thestoreand and and went
tions areallsyntactically that conjunctive). theConnectionist So, architecture tolerates in cognitive gaps capacities; hasno mechanism enforce it to the
(remember, areatomic; bemisled thenode allnodes dont by labels) theres orvice versa. Analogously, noreason youshouldnt minds theres why get thatsimplify premise loves and hates butnoothers; the John Mary Bill Mary orminds simplify that premises 1,3,or5conjuncts,dontsimplify with but premises that wereacquired or,forthat etc. minds simplify with 4,or6 conjuncts; 2, matter, that onlypremises on Tuesdays ... Infact, Connectionist the architecture indifferent isutterly asamong these possibilities. becauserecognizes Thats it nonotion syntax of according to which thoughts arealike inferential (e.g.,thoughts areall that in role that subject simplification to ofconjunction) areexpressed bymental representations correspondingly syntactic (e.g., mental of similar form by representanoreason a mind contains first why that the should contain second, also the
edges mental processes arestructure that sensitive, it will then predict that similarly structured representations generally similar will play rolesin thought. theory says thesentence loves girlismade A that that John the
tures. thelinguistic constituent In cases, analysis impliestaxonomysena of tences theirsyntactic andin the inferential it implies by form, cases, a
ingtheargument systematicity, from theargument compositionality, from andtheargument influential from coherenceare really thesame: much If you thekind theory acknowledges representations, hold of that structured it
49
out of the same parts as the sentence 'the girl loves John ' , and made by applications of the same rules of composition , will have to go out of its way to explain a linguistic competence which embraces one sentence but not the other . And similarly , if a theory says that the mental representation that corresponds to the thought that P& Q& R has the same (conjunctive ) syntax as the mental representation that corresponds to the thought that P& Q and that mental processes of drawing inferences subsume mental representations in virtue of their syntax , it will have to go out of its way to explain inferential capacities which embrace the one thought but not the other . Such a competence would be, at best, an embarrassment for the theory , and at worst a refutation . By contrast , since the Connectionist architecture recognizes no combinat orial structure in mental representations , gaps in cognitive competence should proliferate arbitrarily . It 's not just that you 'd expect to get them from time to time ; it 's that , on the 'no-structure ' story , gaps are the unmarked case. It 's the systematic competence that the theory is required to treat as an embarrassment . But , as a matter of fact , inferential competences are blatantly systematic . So there must be something deeply wrong with Connectionist architec ture . What 's deeply wrong with Connectionist architecture is this : Because it acknowledges neither syntactic nor semantic structure in mental representa tions , it perforce treats them not as a generated set but as a list . But lists , qua lists , have no structure ; any collection of items is a possible list . And , correspondingly , on Connectionist principles , any collection of (causally connected) representational states is a possible mind . So, as far as Connectionist architecture is concerned , there is nothing to prevent minds that are arbitrar ily unsystematic . But that result is preposterous . Cognitive capacities come in structurally related clusters ; their systematicity is pervasive . All the evidence suggests that punctate minds can't happen. This argument seemed conclusive against the Connectionism of Hebb , Osgood and Hull twenty or thirty years ago. So far as we can tell , nothing of any importance has happened to change the situation in the meantime .29
2lJHistorical footnote: Connectionistsare Associationists but not every Associationist holds that mental , representationsmust be unstructured. Hume didn't , for example. Hume thought that mental representations are rather like pictures, and pictures typically have a compositionalsemantics the parts of a picture of a horse : are generally pictures of horse parts. On the other hand, allowing a compositional semanticsfor mental representationsdoesn do an As't sociationist much good so long as he is true to this spirit of his Associationism The virtue of having mental . representationswith structure is that it allows for structure sensitive operations to be defined over them; specifically, it allows for the sort of operations that eventuatein productivity and systematicity Association . is not, however, such an operation; all it can do is build an internal model of redundanciesin experienceby -
50
A final comment to round off this part of the discussion It 's possible to . imagine a Connectionist being prepared to admit that while systematicity doesn follow from- and henceis not explainedby- Connectionistarchitec 't ture, it is nonethelesscompatiblewith that architecture It is, after all, per. fectly possible to follow a policy of building networks that have aRb nodes only if they have bRa nodes ... etc. There is therefore nothing to stop a Connectionistfrom stipulating as an independentpostulate of his theory of mind- that all biologically instantiated networks are, de facto, systematic . But this missesa crucial point: It 's not enoughjust to stipulate systematic ity ; one is also required to specify a mechanismthat is able to enforce the stipulation. To put it another way, it 's not enough for a Connectionist to agreethat all minds are systematic he must also explain how naturecontrives ; to produce only systematic minds. Presumablythere would have to be some sort of mechanism over and abovethe onesthat Connectionismper seposits, , the functioning of which insuresthe systematicityof biologically instantiated networks; a mechanismsuch that, in virtue of its operation, every network that has an aRb node also has a bRa node ... and so forth . There are, however, no proposalsfor such a mechanism Or , rather, there is just one: . The only mechanismthat is known to be able to produce pervasivesystem aticity is Classicalarchjtecture. And , as we have seen Classicalarchitecture , is not compatible with Connectionismsince it requires internally structured representations .
architecture plausible
acknowledges
structured
just
but by
also
" Imagination of of
he
' faculty of
recombi example in
. ( The
is pieced , of course
of to
the
and
associationist precisely
right don
actjve : an
mental answer
. But question
allowing how
what . The
' t have
productive
is that an
structured them
representations is practically
, the irresistible
temptatjon .
postulate
structure
sensitive
operations
executive
51
in the literature
of conventional
computers
as
models of brains . These may be seen as favoring the Connectionist alterna tive . We will sketch a number of these before discussing the general problems which they appear to raise . . Rapidity of cognitive processes in relation to neural speeds the "hundred . step" constraint . It has been observed (e.g., Feldman & Ballard , 1982) that the time required to execute computer instructions is in the order
of nanoseconds , whereas neurons take tens of milliseconds to fire . Con -
sequently , in the time it takes people to carry out many of the tasks at which they are fluent (like recognizing a word or a picture , either of which may require considerably less than a second) a serial neurally -in stantiated program would only be able to carry out about 100 instruc tions . Yet such tasks might typically require many thousands- or even
millions - of instructions in present -day computers (if they can be done
at all ) . Thus , it is argued , the brain must operate quite differently from computers . In fact , the argument goes, the brain must be organized in a highly parallel manner (" massively parallel " is the preferred term of art ) . Difficulty of achieving large-capacity pattern recognition and contentbased retrieval in conventional architectures. Closely related to the issues
about time constraints is the fact that humans can store and make use
of an enormous amount of information - apparently without effort (Fahlman & Hinton , 1987) . One particularly dramatic skill that people exhibit is the ability to recognize patterns from among tens or even hundreds of thousands of alternatives (e.g., word or face recognition ) .
In fact , there is reason to believe that many expert skills may be based
on large , fast recognition memories (see Simon & Chase, 1973) . If one had to search through one's memory serially , the way conventional computers do , the complexity would overwhelm any machine . Thus , the knowledge that people have must be stored and retrieved differently
from the way conventional computers do it .
Conventional computer models are committed to a different etiology for " rule-governed " behavior and "exceptional " behavior . Classical psychological theories , which are based on conventional computer ideas, typically distinguish between mechanisms that cause regular and diver gent behavior by postulating systems of explicit unconscious rules to explain the former , and then attributing departures from these rules to secondary (performance ) factors . Since the divergent behaviors occur very frequently , a better strategy would be to try to account for both
types of behavior in terms of the same mechanism .
Lack of progress in dealing with processes that are nonverbal or intuitive . Most of our fluent cognitive skills do not consist in accessing verbal knowledge or carrying out deliberate conscious reasoning (F"ahlman & Hinton , 1987; Smolensky , 1988) . We appear to know many things that we would have great difficulty in describing verbally , including how to ride a bicycle , what our close friends look like , and how to recall the name of the President , etc . Such knowledge , it is argued , must not be stored in linguistic form , but in some other " implicit " form . The fact that conventional computers typically operate in a " linguistic mode " , inasmuch as they process information by operating on syntactically structured expressions, may explain why there has been relatively little success in modeling implicit knowledge . Acute sensitivity of conventional architectures to damage and noise. Un like digital circuits , brain circuits must tolerate noise arising from spontaneous neural activity . Moreover , they must tolerate a moderate degree of damage without failing completely . With a few notable exceptions , if a part of the brain is damaged , the degradation in performance is usually not catastrophic but varies more or less gradually with the extent of the damage. This is especially true of memory . Damage to the temporal cortex (usually thought to house memory traces) does not result in selective loss of particular facts and memories . This and similar facts about brain damaged patients suggests that human memory representations , and perhaps many other cognitive skills as well , are distributed spatially , rather than being neurally localized . This appears to contrast with conventional computers , where hierarchical -style control keeps the crucial decisions highly localized and where memory storage consists of an array of location -addressable registers . Storage in conventional architectures is passive. Conventional computers have a passive memory store which is accessed in what has been called a " fetch and execute cycle" . This appears to be quite unlike human memory . For example , according to Kosslyn and Hatfield ( 1984, pp . 1022, 1029) :
In computers the memory is static : once an entry is put in a given location , it just sits there until it is operated upon by the CPO .... But consider a very simple experiment : Imagine a letter A over and over again ... then switch to the letter B . In a model employing a Von Neumann architecture the 'fatigue ' that inhibited imaging the A would be due to some quirk in the way the CPO executes a given instruction .... Such fatigue should generalize to all objects imaged because the routine responsible for imaging was less effective . But experiments have demonstrated that this is not true : specific objects become more difficult to image , not all objects . This
finding is more easily explained by an analogy to the way invisible ink fades of its own accord ...: with invisible ink , the representation itself is doing something - there is no separate processor working over it ... .
.
.
Conventional rule -based systems depict cognition as " all -or -none" . But cognitive skills appear to be characterized by various kinds of continuities . For example : Continuous variation in degree of applicability of different principles , or in the degree of relevance of different constraints , " rules" , or proce dures . There are frequent cases (especially in perception and memory retrieval ) , in which it appears that a variety of different constraints are brought to bear on a problem simultaneously and the outcome is a combined effect of all the different factors (see, for example , the infor mal discussion by McClelland , Rumelhart & Hinton , 1986, pp . 3- 9) . That 's why " constraint propagation " techniques are receiving a great deal of attention in artificial intelligence (see Mackworth , 1987) . Non determin ism of human behavior : Cognitive processes are never rigidly determined or precisely replicable . Rather , they appear to have a significant random or stochastic component . Perhaps that 's because there is randomness at a microscopic level , caused by irrelevant biochemical or electrical activity or perhaps even by quantum mechani cal events. To model this activity by rigid deterministic rules can only lead to poor predictions because it ignores the fundamentally stochastic nature of the underlying mechanisms. Moreover , deterministic , all -or none models will be unable to account for the gradual aspect of learning and skill acquisition . Failure to display graceful degradation . When humans are unable to do a task perfectly , they nonetheless do something reasonable . If the particular task does not fit exactly into some known pattern , or if it is only partly understood , a person will not give up or produce nonsensical
behavior . By contrast , if a Classical rule -based computer program fails
to recognize the task , or fails to match a pattern to its stored representations or rules , it usually will be unable to do anything at all . This suggests that in order to display graceful degradation , we must be able to represent prototypes , match patterns , recognize problems , etc ., in various degrees . Conventional models are dictated by current technical features ofcomput ers and take little or no account of the facts of neuroscience. Classical symbol processing systems provide no indication of how the kinds of processes that they postulate could be realized by a brain . The fact that this gap between high -level systems and brain architecture is so large might be an indication that these models are on the wrong track .
54
Whereas the architecture of the mind has evolved under the pressures of natural selection , some of the Classical assumptions about the mind may derive from features that computers have only because they are explicitly designed for the convenience of programmers . Perhaps this includes even the assumption that the description of mental processes at the cognitive level can be divorced from the description of their physical realization . At a minimum , by building our models to take account of what is known about neural structures we may reduce the risk of of being misled by metaphors based on contemporary computer architec tures .
Replies : Why the usual reasons given for preferring architecture are invalid
a Connectionist
It seems to us that , as arguments against Classical cognitive architecture , all these points suffer from one or other of the following two defects. (1) The objections depend on properties that are not in fact intrinsic to Classical architectures , since there can be perfectly natural Classical models that don 't exhibit the objectionable features . (We believe this to be true , for example , of the arguments that Classical rules are explicit and Classical operations are 'all or none ' .) The objections are true of Classical architectures insofar as they are implemented on current computers , but need not be true of such archi tectures when differently (e.g., neurally ) implemented . They are, in other words , directed at the implementation level rather than the cognitive level , as these were distinguished in our earlier discussion. (We believe that this is true , for example , of the arguments about speed, resistance to damage and noise, and the passivity of memory .)
(2)
In the remainder of this section we will expand on these two points and relate them to some of the arguments presented above . Following this analysis, we will present what we believe may be the most tenable view of Connectionism ; namely that it is a theory of how (Classical) cognitive systems might be implemented , either in real brains or in some 'abstract neurology ' . Parallel computation and the issue of speed Consider the argument that cognitive processes must involve large scale par allel computation . In the form that it takes in typical Connectionist discussions, this issue is irrelevant to the adequacy of Classical cognitive architec -
55
ture . The " hundred step constraint " , for example , is clearly directed at the implementation level . All it rules out is the (absurd) hypothesis that cognitive architectures are implemented in the brain in the same way as they are im plemented on electronic computers . If you ever have doubts about whether a proposal pertains to the im plementation level or the symbolic level , a useful heuristic is to ask yourself whether what is being claimed is true of a conventional computer - such as the DEC VAX - at its implementation level . Thus although most algorithms that run on the V AX are serial ,30at the implementation level such computers are 'massively parallel ' ; they quite literally involve simultaneous electrical activity throughout almost the entire device . For example , every memory accesscycle involves pulsing every bit in a significant fraction of the system's memory registers- since memory access is essentially a destructive read and rewrite process, the system clock regularly pulses and activates most of the
central processing unit , and so Oll .
The moral is that the absolute speed of a process is a property par excellence of its implementation . (By contrast , the relative speed with which a system responds to different inputs is often diagnostic of distinct processes; but this has always been a prime empirical basis for deciding among alterna tive algorithms in information processing psychology ) . Thus , the fact that individual neurons require tens of miliseconds to fire can have no bearing on the predicted speed at which an algorithm will run unless there is at least a partial , independently motivated , theory of how the operations of the functional architecture are implemented in neurons. Since, in the case of the brain , it is
not even certain that the firing31of neurons is invariably the relevant implementation property (at least for higher level cognitive processes like learn ing and memory ) the 100 step " constraint " excludes nothing . Finally , absolute constraints on the number of serial steps that a mental process can require , or on the time that can be required to execute them , provide weak arguments against Classical architecture because Classical architecture in no way excludes parallel execution of multiple symbolic processes. Indeed , it seems extremely likely that many Classical symbolic processes
311Evenin the case of a conventional computer , \\'hether it should be viewed as executing a serial or a parallel algorithm depends on what 'virtual machine ' is being considered in the case in question . After all , a V AX can be used to simulate (i .e., to implement ) a virtual machine with a parallel architecture . In that case the relevant algorithm would be a parallel one . 31There are , in fact , a number of different mechanisms of neural interaction (e .g., the " local interactions " described by Rakic , 1975) . Moreover , a large number of chemical processes take place at the dendrites , covering a wide range of time scales, so even if dendritic transmission were the only relevant mechanism , we still wouldn ' t know what time scale to use as our estimate of neural action in general (see, for example , Black , 1986) .
56
are
going
on
in
parallel
in
cognition
and
that
these
processes
interact
with
one
another
. g
they
may
be
involved
in
some
sort
of
symbolic
constraint
propagation
Operating
on
symbols
can
even
involve
"
massively
parallel
"
organizations
that
might
indeed
imply
new
architectures
but
they
are
all
Classical
in
our
sense
since
they
all
share
the
Classical
conception
of
compu
tation
as
symbol
processing
For
examples
of
serious
and
interesting
propos
als
on
organizing
Classical
processors
into
large
parallel
networks
see
Hewett
' s
1977
Actor
system
Hillis
'
1985
"
Connection
Machine
"
as
well
as
any
of
number
of
recent
commercial
multi
processor
machines
The
point
here
is
that
an
argument
for
network
of
parallel
computers
is
not
in
and
of
itself
either
an
argument
against
Classical
architecture
or
an
argu
ment
for
Connectionist
architecture
Resistance
to
noise
and
physical
damage
and
the
argument
for
distributed
representation
Some
of
the
other
advantages
claimed
for
Connectionist
architectures
over
Classical
ones
are
just
as
clearly
aimed
at
the
implementation
level
For
example
the
"
resistance
to
physical
damage
"
criterion
is
so
obviously
mat
ter level
of
implementation theories .
that
it
should
hardly
arise
in
discussions
of
cognitive
It
is
true
that
certain
kind
of
damage
resistance
appears
to
be
incompat
ible
with
localization
and
it
is
also
true
that
representations
in
PD
' s
are
distributed
over
groups
of
units
at
least
when
"
coarse
coding
"
is
used
But
distribution
over
units
achieves
damage
resistance
only
if
it
entails
that
repre
sentations
are
also
neurally
distributed
. 32
However
neural
distribution
of
representations
is
just
as
compatible
with
Classical
architectures
as
it
is
with
Connectionist
networks
In
the
Classical
case
all
you
need
are
memory
regis
ters
that
distribute
their
contents
over
physical
space
You
can
get
that
with
fancy
storage
systems
like
optical
ones
or
chemical
ones
or
even
with
regis
ters
made
of
Connectionist
nets
Come
to
think
of
it
we
already
had
it
in
the
old
style
"
ferrite
core
"
memories
32Unless
the
' units
'
in
Connectionist
network
really
are
assumed
to
have
different
spatiaIJy
- focused
loci
in
the
brain
talk
about
distributed
representation
is
likely
to
be
extremely
misleading
In
particular
if
units
are
merely
functionally
individuated
any
amount
of
distribution
or
functional
entities
is
compatible
with
any
amount
of
spatial
compactness
of
their
neural
representations
But
it
is
not
clear
that
units
do
in
fact
correspond
to
any
anatomically
identifiable
locations
in
the
brain
In
the
light
of
the
way
Connectionist
mechanisms
are
designed
it
may
be
appropriate
to
view
units
and
links
as
functionaUmathematical
entities
( what
psychologists
would
call
"
hypothetical
constructs
"
whose
neurological
interpretation
remains
entirely
open
( This
is
in
fact
the
view
that
some
Connectionists
take
see
Smolensky
1988
. )
The
point
is
that
distribution
over
mathematical
constructs
does
not
buy
you
damage
resistance
only
neural
distribution
does
57
ilymisunderstood. (Confounding of physical functional and properties is widespread psychological in theorizing general; a discussion this in for of
confusionin relation to metrical properties in models of mental imagery,see
be distinctsymbolic expressions eachstateof affairsthat it can represent. for Since such expressions often have a structureconsistingof concatenated when the architectureis implemented (see the discussion footnote9). in However, sincethe relation be physically to realized functional is adjacency, thereis no necessity physical that instantiations adjacent of symbols spabe
ic elements,and the distinction betweenatomicand complex symbols must somehow physically be instantiated, thereis no necessity a tokenof an that atomic symbol assigned smaller be a region space in thana tokenof a complex
use our heuristic again)pairsof symbols certainly functionally may be adjacent,but the symbol tokensare nonetheless spatially spreadthrough many
metric corresponds some to appropriate physical dimension. thatisthe When case,wemaybe ableto predict adverse consequences varying physthat the icalproperty on objects has localized functional (e.g.,varying in space the
when areresistant spatially-local they to damage, maynotbe resistant they to damage islocal that along other some physical dimensions. spatiallySince localdamage particularly is frequent real, in world traumas, mayhave this important practical consequences. solong ourknowledgehow But as of cognitiveprocesses mightbe mapped braintissueremains nearly onto very
nonexistent,its messagefor cognitivescienceremainsmoot.
Thenotionthat soft constraints which varycontinuously degree can (as of activation does),are incompatible Classical with rule-based symbolic systems
58
functional architecturedepends continuously magnitudes. and on varying Indeed, istypically it isdone practical this how in expertsystems which, for example, aBayesian use mechanism production-system intheir rule-interpreter. Thesoftor stochastic nature rule-based or processes from interarises the action deterministic with of rules real-valued properties implementaofthe tion, or withnoisyinputsor noisyinformation transmission.
rules generate can independent effects, might sorted parallel which get out laterdepending onwhich theparallel say, of streams reachesgoal a first. Animportant, sometimes though neglected about aggregate point such properties overt of behavior continuity, as fuzziness, randomness, is that etc., they notarise underlying need from mechanismsarethemselves that fuzzy, continuous orrandom.isnotonly It possibleprinciple, often in but quite
It shouldalsobe notedthat rule applications not issuein all or need none behaviors several since rulesmaybe activated onceandcanhave at interactive effects the outcome. alternatively, of the activated on Or, each
sing) the implementation separate. canhavea Classical and levels One rule
reasonable practice, assume apparently in to that variable nondeterministic or behavior arises fromthe interaction multiple of deterministic sources.
A similar canbemade point about issue graceful the of degradation. Classical architecture notrequire when conditionsapplying does that the for
(Nor,to our knowledge, anyargument beenoffered Connechas yet that tionist architectures inprinciple are capable dealing it. Infactcurrent of with Connectionist models everybit as graceless theirmodes failure are in of as ones based Classical on architectures. example, For contrary some to claims,
principled limitation ofClassical architectures: is,toourknowledge, There noreason believe something Newells to that like (1969) hierarchy weak of methods Laird, or Rosenberg Newells universal and (1986) subgoaling, isinprinciple incapable ofdealing theproblemgraceful with of degradation.
ical models available inadequate a broad now are over spectrummeasures, of so theirproblems graceful with degradation be a special of their may case general unintelligence: may They simply besmart not enough know to what
depending how their upon close conditionstoholding. what are Exactly happens these may in cases depend how rule-system on the isimplemented. On theother it could thatthefailure display hand, be to graceful degradation reallyis an intrinsic of the current limit classof models evenof current or approaches todesigning intelligent Itseems that psychologsystems. clear the
59
models such as that of McClelland and Kawamoto, 1986 fail quite unnatu, rally when given incomplete information.) In short, the Classicaltheorist can view stochasticproperties of behavior as emergingfrom interactionsbetweenthe model and the intrinsic properties of the physical medium in which it is realized. It is essentialto remember that, from the Classicalpoint of view, overt behavior is par excellencean interaction effect, and symbol manipulations are supposedto be only one of the interacting causes . These same considerations apply to Kosslyn and Hatfield's remarks (quoted earlier) about the commitment of Classicalmodelsto 'passive versus ' 'active' representations It is true, as Kosslyn and Hatfield say, that the rep. resentations that Van Neumann machines manipulate 'don't do anything' until a CPU operatesupon them (they don't decay for example But , even , ). on the absurd assumptionthat the mind has exactlythe architecture of some contemporary (V on Neumann computer, it is obvious that its behavior, and ) hence the behavior of an organism is determined not just by the logical , machine that the mind instantiates but also by the protoplasmic machinein , which the logic is realized. Instantiated representationsare therefore bound to be active, even accordingto Classicalmodels; the question is whether the kind of activity they exhibit should be accountedfor by the cognitive model or by the theory of its implementation. This questionis empirical and must not be beggedon behalf of the Connectionistview. (As it is, for example in such , passages "The brain itself does not manipulate symbols the brain is the as ; medium in which the symbols are floating and in which they trigger each other. There is no central manipulator, no central program. There is simply a vast collection of 'teams patterns of neural firings that, like teamsof ants, 'trigger other patterns of neural firings ... . We feel those symbolschurning within ourselvesin somewhatthe sameway we feel our stomachchurning." (Hofstadter, 1983 p. 279 This appearsto be a serious caseof Formicidae , ). in machina ants in the stomachof the ghost in the machine.) : Explicitnessof rules According to McClelland, Feldman, Adelson, Bower, and McDermott (1986 , p. 6) , " ... Connectionist models are leading to a reconceptualizationof key psychologicalissues such as the nature of the representation of knowledge , . .. . One traditional approach to such issuestreats knowledge as a body of rules that are consultedby processing mechanisms the courseof processing in ; in Connectionistmodels, suchknowledgeis represented often in widely dis, tributed form, in the connectionsamong the processingunits." As we remarked in the Introduction , we think that the claim that most
60
61
What doesneed to be explicit in a Classicalmachineis not its program but the symbols that it writes on its tapes (or stores in its registers These, ). however, correspondnot to the machine rules of state transition but to its 's data structures Data structures are the objectsthat the machinetransforms . , not the rules of transformation In the case of programs that parse natural . language for example, Classicalarchitecturerequires the explicit representa , tion of the structural descriptionsof sentencesbut is entirely neutral on the , explicitnessof grammars contrary to what many Connectionistsbelieve. , One of the important inventions in the history of computers the storedprogram computer makes it possible for programs to take on the role of data structures But nothing in the architecture requiresthat they alwaysdo . so. Similarly, Turing demonstratedthat there existsan abstractmachine(the so-called Universal Turing Machine) which can simulate the behavior of any target (Turing) machine. A Universal machine is " rule-explicit" about the machine it is simulating (in the sensethat it has an explicit representationof that machine which is sufficient to specify its behavior uniquely). Yet the target machinecan perfectly well be "rule-implicit " with respectto the rules that govern its behavior. So, then, you can't attack Classicaltheories of cognitive architecture by showingthat a cognitive processis rule-implicit ; Classicalarchitecturepermits rule-explicit processes does not require them. However, you can attack but Connectionistarchitecturesby showingthat a cognitive processis rule explicit since, by definition , Connectionist architecture precludesthe sorts of logicosyntactic capacitiesthat are required to encode rules and the sorts of execu tive mechanisms that are required to apply them.34 If , therefore, there should prove to be persuasive arguments for rule explicit cognitive processes that would be very embarrassingfor Connec , tionists. A natural place to look for such argumentswould be in the theory of the acquisition of cognitive competences For example, much traditional . work in linguistics (see Prince & Pinker, 1988 and all recent work in ) mathematical learning theory (see Osherson Stav, & Weinstein, 1984 as , ), sumes that the characteristic output of a cognitive acquisition device is a recursive rule system (a grammar, in the linguistic case . Suppose such ) theories prove to be well-founded; then that would be incompatible with the assumptionthat the cognitive architecture of the capacitiesacquired is Connectionist.
340f course ispossiblesimulate , it to a"rule explicit process aConnectionist byfirstimplement " in network ~ inga Classical architecture network slippage in the . The between networks architectures asimplemen as and tations ubiquitousConnectionist , as\\;'e remarked . is in writings above
62
terpretations . On the one hand , people like Ballard ( 1986) , and Sejnowski (1981) , are explicitly attempting to build models based on properties of neurons and neural organizations , even though the neuronal units in question are idealized (some would say more than a little idealized : see, for example the commentaries following the Ballard , 1986, paper ) . On the other hand , Smolensky (1988) views Connectionist units as mathematical objects which can be given an interpretation in either neural or psychological terms . Most Connectionists find themselves somewhere in between , frequently referring to their approach as " brain -style " theorizing .35 Understanding both psychological principles and the way that they are neurophysiologically implemented is much better (and , indeed , more empir ically secure) than only understanding one or the other . That is not at issue. The question is whether there is anything to be gained by designing " brain
style " models that are uncommitted about how the models map onto brains .
Presumably the point of " brain style " modeling is that theories of cognitive processing should be influenced by the facts of biology (especially neurosci ence) . The biological facts that influence Connectionist models appear to include the following : neuronal connections are important to the patterns of brain activity ; the memory " engram " does not appear to be spatially local ; to a first approximation , neurons appear to be threshold elements which sum the activity arriving at their dendrites ; many of the neurons in the cortex have multi dimension " receptive fields " that are sensitive to a narrow range of values of a number of parameters ; the tendency for activity at a synapse to cause a neuron to " fire " is modulated by the frequency and recency of past firings .
Let us suppose that these and similar claims are both true and relevant to
the way the brain functions - an assumption that is by no means unproblem atic . The question we might then ask is: What follows from such facts that is relevant to inferring the nature of the cognitive architecture ? The unavoid able answer appears to be, very little . That 's not an a priori claim . The degree of relationship between facts at different levels of organization of a system is an empirical matter . However , there is reason to be skeptical about whether the sorts of properties listed above are reflected in any more -or -less direct
way in the structure of the system that carries out reasoning . Consider , for example , one of the most salient properties of neural systems : they are networks which transmit activation culminating in state changes of some quasi-threshold elements . Surely it is not warranted to conclude that reasoning consists of the spread of excitation among representa tions , or even among semantic components of representations . After all , a V AX is also correctly characterized as consisting of a network over which excitation is transmitted culminating in state changes of quasi-threshold elements . Yet at the level at which it processes representations , a V AX is literally organized as a Von Neumann architecture . The point is that the structure of " higher levels" of a system are rarely isomorphic , or even similar , to the structure of " lower levels" of a system.
No one expects the theory of protons to look very much like the theory of
rocks and rivers , even though , to be sure, it is protons and the like that rocks and rivers are 'implemented in ' . Lucretius got into trouble precisely by assuming that there must be a simple correspondence between the structure of macrolevel and microlevel theories . He thought , for example , that hooks and eyes hold the atoms together . He was wrong , as it turns out . There are, no doubt , cases where special empirical considerations suggest detailed structure /function correspondences or other analogies between dif ferent levels of a system's organization . For example , the input to the most peripheral stages of vision and motor control must be specified in terms of anatomically projected patterns (of light , in one case, and of muscular activity in the other ) ; and independence of structure and function is perhaps less likely in a system whose input or output must be specified somatotopically . Thus , at these stages it is reasonable to expect an anatomically distributed structure to be reflected by a distributed functional architecture . When , how ever , the cognitive process under investigation is as abstract as reasoning , there is simply no reason to expect isomorphisms between structure and
function ; as, indeed , the computer case proves .
Perhaps this is all too obvious to be worth saying. Yet it seems that the commitment to " brain style" modeling leads to many of the characteristic Connectionist claims about psychology , and that it does so via the implicit and unwarranted - assumption that there ought to be similarity 01 structure among the different levels of organization of a computational system. This is distressing since much of the psychology that this search for structural analogies has produced is strikingly recidivist . Thus the idea that the brain is a neural network motivates the revival of a largely discredited Associationist psychology . Similarly , the idea that brain activity is anatomically distributed leads to functionally distributed representations for concepts which in turn leads to the postulation of microfeatures ; yet the inadequacies of feature -
64
to beextremely intheir limited applicability Minsky Papert, (e.g., & 1972, Chomsky, 1957).
importantthan respectability.
Berkeley. moral tobethat should deeply The seems one be suspicious ofthe heroic ofbrain sort modeling purportsaddress problems that to the ofcogni-
current attempt bethoroughly to modern takethebrain and seriously should leadto a psychology readily not distinguishable theworst Hume from of and
widely appreciated. done largely It has so because assumptions the about structureof the brain have been adoptedin an all-too-direct manneras hypotheses cognitive about architecture;aninstructive its paradox the that
cognitive modelinga beneficial in manner, facta research in strategy to has bejudged itsfruits. main ofbrainstyle by The fruit modeling been has to
Concluding comments: Connectionism asa theory implementation of A recurring intheprevious theme discussion many thearguments isthat of forConnectionism construedclaiming cognitive arebest as that architecture isimplementedcertain ofnetwork abstract ina kind (of units). Understood this these way, arguments neutral thequestionwhat cognitive are on of the architecture In these is. 36 concluding remarks briefly well consider Connectionism from this point of view. Almost student enters course computational every who a on or information-processing ofcognition bedisabuseda very models must of general misbecause they to ourunderstanding problem 116), studying canlead the (t) add ofthe (p. (2) PDPs to postulation ofdifferent macrolevel (p.126). these deal theheuristic ofbrain processes Both points with value
Rumelhart 36 and McClelland that models than theories maintain PDP are more just ofimplementation
style theorizing. though inprinciple, arcirrelevant crucial whether Hence, correct they tothe question Conncctionism isbest understoodattempt neural asan tomodel implementation, itreally orwhether does promise theorythe incompatible a new of mind with classical information-processing isan approaches. It empirical whetherheuristic ofthis question the value approachturn tobepositive will out ornegative, we
have already commented view therecent onour of history this of attempt.
65
understanding concerning the role of the physical computer in such models . Students are almost always skeptical about " the computer as a model of cognition " on such grounds as that " computers don 't forget or make mistakes" , " computers function by exhaustive search," " computers are too logical and unmotivated ," " computers can't learn by themselves; they can only do what they 're told ," or " computers are too fast (or too slow) ," or " computers never get tired or bored ," and so on . If we add to this list such relatively more sophisticated complaints as that " computers don 't exhibit graceful degradation " or " computers are too sensitive to physical damage" this list will begin to look much like the arguments put forward by Connectionists . The answer to all these complaints has always been that the implementa tion , and all properties associated with the particular realization of the algorithm that the theorist happens to use in a particular case, is irrelevant to the psychological theory ; only the algorithm and the representations on which it operates are intended as a psychological hypothesis . Students are taught the notion of a " virtual machine " and shown that some virtual machines can learn , forget , get bored , make mistakes and whatever else one likes , provid ing one has a theory of the origins of each of the empirical phenomena in question . Given this principled distinction between a model and its implementation , a theorist who is impressed by the virtues of Connectionism has the option of proposing PDP 's as theories of implementation . But then , far from provid ing a revolutionary new basis for cognitive science, these models are in principle neutral about the nature of cognitive processes. In fact , they might be viewed as advancing the goals of Classical information processing psychol ogy by attempting to explain how the brain (or perhaps some idealized brain like network ) might realize the types of processes that conventional cognitive science has hypothesized . Connectionists do sometimes explicitly take their models to be theories of implementation . Ballard ( 1986) even refers to Connectionism as " the imple mentational approach " . Touretzky ( 1986) clearly views his BoltzCONS model this way ; he uses Connectionist techniques to implement conventional symbol processing mechanisms such as pushdown stacks and other LISP facilities .37
37Even thiscasewherethe modelis specifically in , designed implement -like featuressome the to Lisp , of rhetoric failsto keeptheimplementation -algorithm levels distinct Thisleads talk about"emergent . to proper ties and to the claimthat evenwhenthey implement -like mechanisms " Lisp , Connectionist systems "can compute thingsin waysin whichTuringmachines von Neumann and computers 't." (Touretzky 1986 can , ). Sucha claimsuggests Touretzky that distinguishes different"waysof computing not in termsof different " algorithmsbut in terms different , of ways implementing same of the algorithmWhilenobody proprietary . has rightsto termslike "ways computing this is a misleading of puttingit; it means a DEC machine of ", way that hasa "different wayof computingfrom an IBM machine " evenwhenexecuting identical the program .
66
Rumelhart and McClelland ( 1986a, p . 117) , who are convinced that Connectionism signals a radical departure from the conventional symbol processing approach , nonetheless refer to " PDP implementations " of various mechanisms such as attention . Later in the same essay, they make their position explicit : Unlike " reductionists ," they believe " . .. that new and useful concepts emerge at different levels of organization " . Although they then defend the Glaim that one should understand the higher levels " ... through the study of
the interactions among lower level units " , the basic idea that there are auton -
omous levels seems implicit everywhere in the essay. But once one admits that there really are cognitive -level principles distinct from the (putative ) architectural principles that Connectionism articulates , there seems to be little left to argue about . Clearly it is pointless to ask whether one should or shouldn 't do cognitive science by studying " the interac tion of lower levels" as opposed to studying processes at the cognitive level since we surely have to do both . Some scientists study geological principles , others study " the interaction of lower level units " like molecules . But since the fact that there are genuine , autonomously -stateable principles of geology is never in dispute , people who build molecular level models do not claim to have invented a " new theory of geology " that will dispense with all that old fashioned " folk geological " talk about rocks , rivers and mountains ! We have , in short , no objection at all to networks as potential implemen tation models , nor do we suppose that any of the arguments we've given are incompatible with this proposal . The trouble is, however , that if Connectionists do want their models to be construed this way , then they will have to radically alter their practice . For , it seems utterly clear that most of the Connectionist models that have actually been proposed must be construed as theories of cognition , not as theories of implementation . This follows from
the fact that it is intrinsic to these theories to ascribe representational content
to the units (and/or aggregates) that they postulate . And , as we remarked at the beginning , a theory of the relations among representational states is ipso facto a theory at the level of cognition , not at the level of implementation . It has been the burden of our argument that when construed as a cognitive theory , rather than as an implementation theory , Connectionism appears to have fatal limitations . The problem with Connectionist models is that all the reasons for thinking that they might be true are reasons for thinking that they couldn 't be psychology .
Conclusion What , in light of all of this , are the options for the further development of Connectionist theories ? As far as we can see, there are four routes that they could follow : (1) Hold out for unstructured mental representations as against the Classical view that mental representations have a combinatorial syntax and semantics . Productivity and systematicity arguments make this option appear not attractive . (2) Abandon network architecture to the extent of opting for structured mental representations but continue to insist upon an Associationistic account of the nature of mental processes. This is, in effect , a retreat to Hume 's picture of the mind (see footnote 29) , and it has a problem that we don 't believe can be solved : Although mental representations are, on the present assumption , structured objects , association is not a structure sensitive relation . The problem is thus how to reconstruct the semantical coherence of thought without postulating psychological processes that are sensitive to the structure of mental representations . (Equivalently , in more modern terms , it 's how to get the causal relations among mental representations to mirror their semantical relations without assuming a proof -theoretic treatment of inference and- more generally - a treatment of semantic coherence that is syntactically expressed, in the spirit of proof -theory .) This is the problem on which . tradi tional Associationism foundered , and the prospects for solving it now strike us as not appreciably better than they were a couple of hundred years ago. To put it a little differently : if you need structure in mental representations anyway to account for the productivity and systematicity of minds , why not postulate mental processes that are structure sensitive to account for the coherence of mental processes? Why not be a Classicist, in short . In any event , notice that the present option gives the Classical picture a lot of what it wants : viz ., the identification of semantic states with relations to structured arrays of symbols and the identification of mental processes with transformations of such arrays . Notice too that , as things now stand , this proposal is Utopian since there are no serious proposals for incorporating syntactic structure in Connectionist architectures . (3) Treat Connectionism as an implementation theory . We have no princi pled objection to this view (though there are, as Connectionists are discover ing , technical reasons why networks are often an awkward way to implement Classical machines) . This option would entail rewriting quite a lot of the pole, ical material in the Connectionist literature , as well as redescribing m
68
what spreading
the
networks activation
are
doing among
as
operating
on interpreted
symbol
structures nodes .
rather
than
semantically
Moreover As tionist of that simply cognitive that with tulate is that matter processes structures ' implements analysis bears as a we have approach
, this pointed
of many
policy
is people
sure have to to
to
lose been
the
movement attracted to the providing is , it , biochemistry are them architecture are also
a the
lot
of Connec
fans
. -
its
promise
( a ) the . If
do
away
with of
level
cognition than
may
for
likely
that
way
down
Give 1986a
up , " . It
on p . 110
the )
that
networks
( to
quote modeling
Rumelhart cognitive some as can tell cognitive can , such consist of of the what be
&
reasonable held they inferences analog of cognitive be to quite what machines that
bet
that
relations theory .
estimate
compared
Connectionists
themselves
offering
is
for between
one
of
understanding ( 1986b these , given statistical and the that a the terms
's
going
on and
in Pinker
the
McClelland in quite which the of a verb of weights of , the inflecting machine with that effect merely a competence that Pinker close
Prince effect of
) , though McClelland
, Rumelhart pairings between form .of that the its is at in the past set going since data ontogenetic , by quite a on . a
mechanism computes
' teacher
'
correlation phonological so network verb the in more stem form the must correlations to the which Prince
phonological past analogically asymptote specified tense By in the nor processes lot , the best that contrast learning statistical a plausible tense
phonological was , most Prince tense hypothesis account converge of this . It highly and
( in
past
argument
69
There is an alternative to the Empiricist idea that all learning consistsof a kind of statistical inference, realized by adjusting parameters it 's the ; Rationalist idea that somelearning is a kind of theory construction, effected by framing hypothesesand evaluating them against evidence We seem to . rememberhaving been through this argumentbefore. We find ourselveswith a gnawing senseof deja vu.
References
Arbib, M. (1975 Artificial intelligence braintheory Unitiesanddiversities ). and : . Biomedical Engineering , 3, 238274 - . Ballard D.H. (1986 Corticalconnections parallelprocessing , ). and : Structure function TheBehavioral and . andBrainSciences 67 120 , 9, - . Ballard D.H. (1987 Parallel , ). Logical Inference Energy and MinimizationReport . TR142Computer , Science DepartmentUniversity Rochester , of . Black I.B. (1986 Molecular , ). memory mechanisms Lynch G. (Ed.), Synapsecircuitsandthebeginnings . In , ., s , of memoryCambridge . , MA: M.I .T, PressA Bradford , Book . Bolinger D, (1965 Theatomization meaningLanguage , 555573 , ). of , , 41 - , BroadbentD. (1985 A question levelsComments McClelland Rumelhart , ). of : on and . Journal Experimen of tal Psychology : General114 189192 , , - . Carroll L. (1956 Whatthe tortoisesaidto Achillesandotherriddles In Newmanl .R. (Ed.), Theworld , ). . , of mathematicVolume . NewYork: Simon Schuster 's ': Four and . ChomskyN. (1957 Syntactic , ). structures HagueMouton . The : . ChomskyN. (1965 Aspects thetheory syntaxCambridge , ). of of . : MA: M.I.T. Press . ChomskyN. (1968 Language mind NewYork: Harcourt BraceandWorld , ). and . , . Churchland , P.M. (1981 Eliminative ). materialism the propositional and attitudesJournal Philosophy , . of , 78 67 90 - . Churchland , P.S. (1986 Neurophilosophy ). . Cambridge , MA: M.I.T. Press . CumminsR. (1983. Thenature psychological , ) o/ explanation . Cambridge , MA: M.I.T. Press . Dennett D. (1986 Thelogical , ). geography computational of approaches viewfromtheeast . In Brand :A pole , M. & Harnish M. (Eds Therepresentationknowledge , .), of . TusconAZ: The University Arizona , of Press . Dreyfus H., & Dreyfus S. (in press. Makinga mindvs modelling brain A.I. backat a branch , , ) the : point . Daedalus . FahlmanS.E., & Hinton G.E. (1987 Connectionist , , ). architectures artificialintelligence for . Computer , , 20 100109 - . Feldmanl .A. (1986 Neuralrepresentation conceptual , ). of knowledge . ReportTR189Department Com . of puterScience , University Rochester of . Feldmanl .A., & Ballard D.H. (1982 Connectionist , , ). models their propertiesCognitive and . Science , 6, 205254 - . Fodor J. (1976 Thelanguage thoughtHarvester , ). of , PressSussex , . (Harvard University Press paperback ). Fodor J.D. (1977 Semantics , ). : Theories meaning generative of in grammarNewYork: Thomas Crowell . Y. . Fodor J. (1987 Psychosemantics , ). . Cambridge , MA: M.I.T. Press . Frohn H., Geiger H., & SingerW. (1987 A selforganizing , , , ). neuralnetwork sharing features the mam of malian visualsystemBiological . Cybernetics 333343 , 55, - .
70
: Routledge structures
, D . ( 1985 ) . The connection , G . ( 1987 ) . Representing , G .E . , McClelland D .E . , McClelland the microstructure Books .
. Cambridge hierarchies
, J .L . , &
Rumelhart
representations processing
Research
distributed
: Explorations
. Volume
1 : Foundations
. Cambridge
, MA : M .l . T . Press / Bradford
Hofstadter
, D . R . ( 1983 ) . Artificial
intelligence
: Sub - cognition
as computation
. In F . Machlup : John
& U . Mansfield
( Eds .) , The study of information : Interdisciplinary Kant , I . ( 1929 ) . The critique of pure reason . New York Katz , J .J . ( 1972 ) . Semantic Katz , J .J . , & Fodor theory . New York structure : Harper
, J .A . ( 1963 ) . The
theory
descriptions symbol
. Cambridge
, G . ( 1984 ) . Representation
systems . Social
, J . , Rosenbloom , P . , & Newell , A . ( 1986 ) . Universal subgoaling and chunking and learning of goal hierarchies . Boston , MA : Kluwer Academic Publishers , G . ( 1986 ) . Connectionism cember 8 , 1986 . and cognitive linguistics . Seminar delivered
Lakoff
at Princeton
University
, De -
Mackworth , A . ( 1987 ) . Constraint propagation . Iq Shapiro gence , Volume 1 . New York : John Wiley & Sons . McClelland , J .L . , Feldman , J . , Adelson , B . , Bower
, S .C . ( Ed . ) . The encyclopedia
of artificial
intelli and
models
and implications
to the National
Foundation
, June ,
, J .L . , & Kawamoto
, A . H . ( 1986 ) . Mechanisms
of sentence
processing
: Assigning
roles distributed
to con pro -
stituents . In McClelland , Rumelhart and the PDP Research Group cessing : volume 2 . Cambridge , MA : M . I . T . Press , Bradford Books . McClelland , J .L . , Rumelhart , D . E . , & Hinton , G .E . ( 1986 ) . The appeal
( Eds .) , Parallel
of parallel
distributed
processing
. In
Rumelhart , McClelland and the PDP Research Group 1 . Cambridge , MA : M . I . T . Press / Bradford Books . Minsky , M . , & Papert , F . ( 1972 ) . Artificial of Technology . Intelligence Progress
distributed
processing
: volume Institute
Memo
252 , Massachusetts
, A . ( 1969 ) . Heuristic programming : Ill - structured problems ations research , Ill . New York : John Wiley & Sons . , A . ( 1980 ) . Physical , A . ( 1982 ) . The symbol systems . Cognitive level . Artificial knowledge Intelligence theory
. In Aronofsky
, J . ( Ed .) , Progress
in oper -
Science , 4 , 135 - 183 . . 18 , 87 - 127 . and natural language , Cognition , 17 , 1- 28 . Press . processing
, S . ( 1984 ) . Language
, learnability
development
. Cambridge
: Harvard
University
, A . , & Pinker , S . ( 1988 ) . On language and connectionism : Analysis model of language acquisition . Cognition , 28 , this issue . , Z . W . ( 1980 ) . Cognition and computation and Brain Sciences , 3 : 1 , 154 - 169 . , Z .W . ( 1981 ) . The 88 , 16 - 45 . imagery : Issues in the foundations
of a parallel
distributed
Pylyshyn
of cognitive
science . Behavioral
Pylyshyn
debate : Analogue
media
versus
tacit
knowledge
. Psychological
Review ,
Pylyshyn
: Toward
a foundation
for
cognitive
science . Cambridge
Pylyshyn
requires symbols . Proceedings of the Sixth Annual , Colorado , August , 1984 . Hillsdale , NJ : Erlbaum
Conference .
of
71
Rakic P. (1975 LocalcircuitneuronsNeurosciences , ). . Research Program Bulletin 13 299313 , , - . RumelhartD.E. (1984 Theemergence cognitive , ). of phenomena sub from -symbolic processes Proceedings . In of the SixthAnnual Conference the Cognitive of Science SocietyBolder ColoradoAugust 1984 , , , , . HillsdaleNJ: Erlbaum , . RumelhartD.E., & McClelland (1985 LevelsindeedA response Broadbent , , J.L. ). ' ! to . Journal ofExperimen tal Ps chologyGeneral114 193 197 _ v : , , - . Rumelhart D.E., & McClellandJ.L. (1986a PDP Modelsand general , , ). issues cognitive in scienceIn : RumelhartMcClelland the POPResearch , and Group(Eds Parallel .), disttibuted processing , volume 1. Cambridge , MA: M.I.T. PressA Bradford , Book . RumelhartD.E., & McClellandJ.L. (1986b On learning pasttenses English , , ). the of verbs In Rumelhart . , McClelland thePOPResearch and Group(Eds Parallel .), distributed processing , volume Cambridge 1. , MA: M.I.T. PressA BradfordBook , . Schneider (1987 Connectionism it a paradigm , W. ). : Is shift for psychologyBehavior ? Research Methods , Instnlments Computers , 73 83 ,& , 19 - . SejnowskiT.J. (1981 Skeleton , ). filters in the brain In Hinton G.E., & AndersonA .J. (Eds Parallel . , , .), models as'ociative of .\ memoryHillsdaleNJ: Erlbaum . , . Simon H.A., & ChaseW.G. (1973 Skill in chessAmerican , , ). . Scientist , 394 . . 621 -403 Smolensky (1988 On the propertreatment connectjonjsm Behavioral BrainSciences , , P. ). of . The and . 11 forthcoming . StablerE. (1985 How aregrammars , ). represented ? Behavioral BrainSciences 391 . and , 6, -420 Stich S. (1983 Fromfolk psychology cognitive , ). to science . Cambridge , MA: M.I.T. Press . TouretzkyD.S. (1986 BoltzCONS , ). : Reconciling connectionism therecursive with nature stacks trees of and . Proceedings theEighth of AnnualConference theCognitive of Science Society . AmherstMA, August , , 1986HillsdaleNJ: Erlbaum . , . Wanner E., & MaratsosM. (1978 An ATN approach comprehension Halle M., BresnanJ., & , , ). to . In , , Miller, G.A. (Eds Linguistic .), theory psychological . Cambridge and reality , MA: M.I.T. Press . WatsonJ. (1930 Behaviorism , ). , ChicagoUniversity Chicago : of Press . WoodsW.A. (1975 Whats in a link? in Bobrow D., & Collins A. (Eds Representation understand , ). ' , , .), and ing. NewYork: Academic Press . Ziff, P. (1960 Semantic ). analysisIthaca NY: CornellUniversity . , Press .
On language and connectionism : Analysis of a parallel distributed processing model of language acquisition *
PINKER STEVEN
Massachusetts Institute of Technology
ALANPRINCE BrandeisUniversity
Abstract
Does knowledgeof languageconsistof mentally-represented rules? Rumelhart and McClelland havedescribed connectionist a (parallel distributedprocessing ) model of the acquisition of the past tensein English which successfully maps many stemsonto their past tense forms, both regular (walk/walked) and irregular (go/went), and which mimics someof the errors and sequences develop of ment of children. Yet the model containsno explicit rules, only a setof neuronstyle units which standfor trigrams of phonetic features of the stem a set of , units which stand for trigrams of phonetic features of the past form , and an array of connections between two setsof units whosestrengths modified the are during learning. Rumelhart and McClelland concludethat linguistic rules may be merely convenientapproximatefictions and that the real causalprocesses in languageuseand acquisitionmust be characterized the transferof activa as tion levelsamongunits and themodification of the weightsof their connections . We analyzeboth the linguistic and the developmental assumptions the model of in detail and discoverthat (1) it cannot representcertain words, (2) it cannot learn many rules, (3) it can learn rules found in no human language (4) it , cannot explain morphological and phonological regularities (5) it cannot ex,
*Theauthors contributed equally thispaperandlist their names alphabetical . Wearegrateful to in order to JaneGrimshaw Brian MacWhinney providing and for transcripts children speech of 's from the Brandeis Longitudinal Study the ChildLanguage Exchange and Data System , respectively alsothankTomBever . We , JaneGrimshawStephen , Kosslyn Dan Slobin an anonymous , , reviewer from Cognitionand the Boston , Philosophy Psychology and Discussion Groupfor theircomments earlierdrafts andRichard on , Goldberg for hisassistance . Preparation thispaper supported NSFgrant1ST of was by -8420073 Jane to Grimshaw Ray and Jackendoff Brandeis of Universityby NIH grantHD 18381 andNSFgrant85 18774 Steven , - 04 to Pinker , andby a grantfrom the Alfred P. Sloan Foundation the MIT Center Cognitive to for Science . Requests for reprintsmaybesentto Steven Pinkerat the Department BrainandCognitive of Sciences , MIT, Cambridge , MA 02139U.S.A. or Alan Prince the Linguistics Cognitive , at and Science ProgramBrown125 Brandeis , , UniversityWaltham 02254U.S.A. , MA ,
74
plain the differences between irregular and regular forms , (6) it fails at its assigned task of mastering the past tense of English , (7) it gives an incorrect explanation for two developmental phenomena : stagesof overregularization of irregular forms such as bringed , and the appearance of doubly -marked forms such as ated and (8) it gives accounts of two others (infrequent overregulariza tion of verbs ending in t/d, and the order of acquisition of different irregular subclasses that are indistinguishable from those of rule -based theories. In ) addition , we show how many failures of the model can be attributed to its connectionist architecture . We conclude that connectionists ' claims about the dispensability of rules in explanations in the psychology of language must be rejected, and that, on the contrary , the linguistic and developmental facts pro vide good evidence for such rules. If designgovern in a thing so small. Robert Frost 1. Introduction
The study of languageis notoriously contentious but until recently, resear , cherswho could agreeon little elsehave all agreedon one thing: that linguistic knowledgeis couchedin the form of rules and principles. This conception is consistentwith- indeed, is one of the prime motivations for- the " central dogma of modern cognitive science namely that intelligence is the result of " , processingsymbolic expressions To understandlanguageand cognition, ac. cording to this view, one must break them up into two aspects the rules or : symbol manipulating processes capableof generatinga domain of intelligent human performance, to be discoveredby examining systematicitiesin people's perception and behavior, and the elementarysymbol-manipulating mechanismsmade available by the information-processingcapabilitiesof neural tissue, out of which the rules or symbol manipulating processeswould be composed(see e.g., Chomsky, 1965 Fodor, 1968 1975 Marr , 1982 Minsky, , ; , ; ; 1963 Newell & Simon, 1961 Putnam, 1960 Pylyshyn 1984 ; ; ; , ). One of the reasonsthis strategy is inviting is that we know of a complex intelligent system the computer, that can only be understood using this al, gorithm-implementation or software -hardware distinction. And one of the reasons that the strategy has remained compelling is that it has given us precise revealing, and predictive models of cognitive domains that have re, quired few assumptions about the underlying neural hardwareother than that it makesavailable somevery generalelementaryprocesses comparing and of transforming symbolic expressions .
75
Of course, no one believes that cognitive models explicating the systematicities in a domain of intelligence can fly in the face of constraints provided by the operations made available by neural hardware . Some early cognitive models have assumed an underlying architecture inspired by the historical and technological accidents of current computer design, such as rapid reliable serial processing, limited bandwidth communication channels, or rigid distinctions between registers and memory . These assumptions are not onlv inaccurate as descriptions of the brain , composed as it is of slow , noisy and massively interconnected units acting in parallel , but they are unsuited to tasks such as vision where massive amounts of information must be processed in parallel . Furthermore , some cognitive tasks seem to require mechanisms for rapidly satisfying large sets of probabilistic constraints , and some aspects of human performance seem to reveal graded patterns of generalization to large sets of stored exemplars , neither of which is easy to model with standard serial symbol -matching architectures . And progress has sometimes been stymied by the difficulty of deciding among competing mod els of cognition when one lacks any constraints on which symbol -manipulating processes the neural hardware supplies " for free " and which must be composed of more primitive processes. 1.1. Connectionism and symbol processing In response to these concerns, a family of models of cognitive processes originally developed in the 1950sand early 1960shas received increased atten tion . In these models , collectively referred to as " Parallel Distributed Processing" (" PDP " ) or " Connectionist " models , the hardware mechanisms are networks consisting of large numbers of densely interconnected units , which correspond to concepts (Feldman & Ballard , 1982) or to features (Hinton , McClelland , & Rumelhart , 1981) . These units have activation levels and they transmit signals (graded or 1- 0) to one another along weighted connections . Units " compute " their output signals through a process of weighting each of their input signals by the strength of the connection along which the signal is coming in , summing the weighted input signals, and feeding the result into a nonlinear output function , usually a threshold . Learning consists of adjusting the strengths of connections and the threshold -values, usually in a direction that reduces the discrepancy between an actual output in response to some input and a " desired " output provided by an independent set of " teaching " inputs . In some respects, these models are thought to resemble neural networks in meaningful ways; in others , most notably the teaching and learning mechanisms, there is no known neurophysiological analogue , and some authors are completely agnostic about how the units and connections are neur -
76
ally instantiated. (" Brain-style modeling" is the noncommittal term used by Rumelhart & McClelland, 1986a The computations underlying cognitive .) processesoccur when a set of input units in a network is turned on in a pattern that correspondsin a fixed way to a stimulus or internal input. The activation levels of the input units then propagatethrough connectionsto the output units, possibly mediated by one or more levels of intermediate units. The pattern of activation of the output units correspondsto the output of the computation and can be fed into a subsequentnetwork or into response effectors. Many models of perceptual and cognitive processeswithin this famil)' have been explored recently (for a recent collection of reports, including extensive tutorials, reviews, and historical surveys see Rumelhart, , McClelland, & The POP Research Group, 1986 and McClelland, ; Rume .lhart , & The PDP Research Group, 1986 henceforth, "POPI" and ; "POPII" ) . There is no doubt that these models have a different feel than standard symbol processing models. The units, the topology and weightsof the connec tions among them, the functions by which activation levels are transformed in units and connections and the learning (i.e., weight-adjustment function , ) are all that is " in" thesemodels; one cannot easily point to rules, algorithms, expressions and the like inside them. By itself, of course this meanslittle , , , becausethe same is true for a circuit diagram of a digital computer implementing a theorem-prover. How, then, are PDP models related to the more traditional symbol-processingmodels that have until now dominated cognitive psychologyand linguistics? It is useful to distinguish three possibilities In one, PDP models would . occupy an intermediate level between symbol processing and neural hardware: they would characterizethe elementaryinformation processes provided by neural networks that serve as the building blocks of rules or algorithms. Individual PDP networks would compute the primitive symbol as sociations (such as matching an input againstmemory, or pairing the input and output of a rule) , but the way the overall output of one network feeds into the input of another would be isomorphic to the structure of the symbol manipulationscapturedin the statementsof rules. Progressin PDP modeling would undoubtedly force revisionsin traditional models, becausetraditional assumptionsabout primitive mechanismsmay be neurally implausible, and complex chains of symbol manipulations may be obviated by unanticipated primitive computational powers of PDP networks. Nonetheless in this , scenario a well-defined division between rule and hardware would remain, each playing an indispensablerole in the explanation of a cognitive process . Many existing types of symbol-processing modelswould survivemostly intact, and, to the extent they have empirical support and explanatorypower, would
77
dictate many fundamental aspectsof network organization. In some exposi tions of PDP models, this is the proposed scenario(see e.g., Hinton , 1981 , ; Hinton , McClelland, & Rumelhart, 1986 p. 78; also Touretzky, 1986 and , Touretzky & Hinton , 1985 where PDP networks implement aspects LISP , of and production systems respectively We call this "implementational con, ). nectionism . " An alternative possibility is that once PDP network models are fully developed, they will replacesymbol processing modelsasexplanationsof cognitive processes It would be impossibleto find a principled mapping between . the componentsof a PDP model and the stepsor memory structures implicated by a symbol processingtheory, to find states of the PDP model that correspondto intermediate statesof the executionof the program, to observe stagesof its growth correspondingto componentsof the program being put into place, or statesof breakdown correspondingto componentswiped out through trauma or loss the structure of the symbolic model would vanish. Even the input- output function computedby the network model could differ in special casesfrom that computed by the symbolic model. Basically, the entire operation of the model (to the extent that it is not a black box) would have to be characterized in terms of interactionsamongentities possessing not both semanticand physical properties (e.g., different subsetsof neurons or statesof neuronseachof which representa distinct chunk of knowledge , but ) in terms of entities that had only physicalproperties, (e.g., the "energylandscape defined by the activation levels of a large aggregate interconnected " of neurons . Perhapsthe symbolic model, as an approximatedescription of the ) performancein question, would continue to be useful asa heuristic, capturing some of the regularities in the domain in an intuitive or easily -communicated way, or allowing one to make convenient approximate predictions. But the symbolic model would not be a literal accountat any level of analysisof what is going on in the brain, only an analogyor a rough summaryof regularities. This scenario which we will call "eliminative connectionism , sharply con, " trasts with the hardware softwaredistinction that has been assumed cogniin tive scienceuntil now: no one would say that a program is an " approximate " description of the behavior of a computer, with the " exact" description existing at the level of chips and circuits; rather they are both exact descriptions at different levels of analysis . Finally, there is a range of intermediate possibilitiesthat we have already hinted at. A cognitive processmight be profitably understood as a sequence or systemof isolable entities that would be symbolic inasmuchas one could characterizethem as having semanticproperties such as truth values, consis tency relations, or entailment relations, and one might predict the input- output function and systematicities performance development or lossstrictly in , ,
Languageand connectionism
79
the David
acquisition Rumelhart
of
the and
of
the
past ( 1986b
tense , 1987
in )
English . Using
developed standard PD
by P
mechanisms
this
model
learns
to
map
representations
of
present
tense
forms
of
English
verbs
onto
their
past
tense
versions
It
handles
both
regular
( walk
walked novel
) verbs
and not
irregular in its
( feel training
/ felt set
verbs , and
, it
productively distinguishes
past of
forms the
for past
tense
morpheme
( t
versus
versus
id
conditioned
by
the
final
consonant
of
the
verb
( walked
versus
jogged
versus
sweated
Furthermore
in
doing
so
it
displays
number
of
behaviors
reminiscent
of
children
It
passes
through
stages
of
conservative
acquisition
of
correct
irregular
and
regular
verbs
( wal
ked
brought
hit
followed
by
productive
application
of
the
regular
rule
and
overregularization
to
irregular
stems
( e
. g
bringed
hitted
followed
by
mas
tery
of
both
regular
and
irregular
verbs
It
acquires
subclasses
of
irregular
verbs
( e
. g
. fly
/ flew
sing
/ sang
hit
/ hit
in
an
order
similar
to
children
It
makes
certain in the
types model
of
errors corresponds
( ated in
wented any
) obvious
at
similar way
stages to the
, have
nothing been
assumed
to
be
an
essential
part
of
the
explanation
of
the
past
tense
formation
process to a word
None , a
of position
the
individual within a
units word ,
or a
connections morpheme ,
in a
the regular
model rule
corresponds , an excep -
tion
or
paradigm
The
intelligence
of
the
model
is
distributed
in
the
pattern
of a
the is McClelland
simple complex
output at of
, .
so
that
any
relation
to
Rumelhart
results
work
as
strong
support
for
eliminative
connectionism
the
paradigm
in
which
rule
or
symbol
- based
accounts
are
simply
eliminated
from
direct
explanations
of
intelligence
We tions ior
that
implicit
of
may
be .
stored While ) as
in the
connec behav
processing may be
describable
conforming
to
some
system
of
rules
we
suggest
that
an
account
of
the
fine
structure
of
the
of make
language reference
use
and to
acquisition of
can the
best underlying
be
formulated networks
in .
( Rumelhart
&
McClelland
1987
196
we rules a
a - tense of of
to
the explicit
that . be a
acquisition "
past more
tense
recourse
the
notion
of lem
the . The
language child
We
have not
that what
for the
this
case rules
, are
there , nor
is
no even
prob are
need
rules
( Rumelhart
&
McClelland
1986b
267
their
emphasis
80
standing language of knowledge, language acquisition, linguistic and information processinggeneral. in (Rumeihart McClelland, p. 268) & 1986b, The RumelhartMcClelland (henceforth, RM) model,becauseit in-
blunt force theRumelhart McClellands onrules suggestthe of and attack by ingthatthemodel does really contain orthatpasttense rules, acquisition is anunrepresentativelyproblem, thatthere some easy or is reason principle in whyPDPmodels incapable being are of extended language a whole, to as or thatRumelhart McClelland modeling and are performance saying and little
algorithms. believe these reactionsbe conversion We that quick they experiences outright or dismissalsare unwarranted. canbegained taking Much by themodel face at value a theory thepsychology child by as of ofthe and examining claims themodel detail. isthegoal thispaper. the of in That of TheRMmodel, many models, a tourdeforce. is explicit like PDP is It
aboutcompetence are modeling or implementations saying about but little
tancein manyquarters; manyresearchers beenpersuaded that have that theories language of couched terms rulesandruleacquisition be in of may obsolete e.g.,Sampson, (see, 1987). Otherresearchers attempted have to
spires remarkable figures these claims, prominently ingeneral expositions of connectionism stress revolutionary such Smolensky that its nature, as (in press) McClelland, and Rumeihart, Hinton and (1986). Despite radical the nature these of conclusions, ourimpression they gained it is that have accep-
rather leaving asa degree freedom. model tested only than it of The is not against phenomenon inspired the that itthe three-stage developmental se-
landbring these developmental to bearonthemodel anunusually data in detailed examining onlygrosseffects alsomany its more way, not but of subtle details. Several non-obvious interesting but empirical predictions are
standards.
operate surprising These in ways. features virtually are unheard indevelopof mental psycholinguistics (see Pinker, Wexler Culicover, There 1979; & 1980). is no doubtthatour understandinglanguage of acquisition advance would
Nonetheless, analysis themodel come conclusions differour of will to very ent fromthoseof Rumeihart McClelland. theirpresentation, and In the
81
model is evaluated only by a global comparison of its overall output behavior with that of children . There is no unpacking of its underlying theoretical assumptions so as to contrast them with those of a symbolic rule -based alter native , or indeed any alternative . As a result , there is no apportioning of credit or blame for the model 's performance to properties that are essential versus accidental , or unique to it versus shared by any equally explicit alter native . In particular , Rumelhart and McClelland do not consider what it is about the standard symbol -processing theories that makes them " standard " , beyond their first -order ability to relate stem and past tense. To these ends, we analyze the assumptions and consequences of the RM model , as compared to those of symbolic theories , and point out the crucial tests that distinguish them . In particular , we seek to determine whether the RM model is viable as a theory of human language acquisition - there is no question that it is a valuable demonstration of some of the surprising things that PD P models are capable of , but our concern is whether it is an accurate model of children . Our analysis will lead to the following conclusions : . . . Rumelhart and McClelland 's actual explanation of children 's stages of regularization of the past tense morpheme is demonstrably incorrect . Their explanation for one striking type of childhood speech error is also incorrect . Their other apparent successes in accounting for developmental phenomena either have nothing to do with the model 's parallel distri buted processing architecture , and can easily be duplicated by symbolic models , or involve major confounds and hence do not provide clear support for the model . The model is incapable of representing certain kinds of words . It is incapable of explaining patterns of psychological similarity among words . It easily models many kinds of rules that are not found in any human language . It fails to capture central generalizations about English sound patterns . It makes false predictions about derivational morphology , compound ing , and novel words . It cannot handle the elementary problem of homophony . It makes errors in computing the past tense forms of a large percentage of the words it is tested on . It fails to generate any past tense form at all for certain wo: ds. r It makes incorrect predictions about the reality of the dis1 :inction between regular rules and exceptions in children and in languages.
. . . . . . . . .
82
We works the that tionist the tionism The phenomena the the its the the state we RM
will can
conclude eliminate of
the need
that
processing mechanisms , we of
net in argue
is
unwarranted due
cases if
features by copying
connec tenets of
and symbolic
irremediable theory language as verbal it contrasts amounts empirical , we of with evaluate children
remediable
promise
of
connec
explicating is organized English and each handle section properties it the by of the the and in how . This the
. First
examine we describe
broad the
the of
with to
evaluation of model of
of . In
properties the
ability toward
handle adult ,
development rule
a simple of the
symbolic about
acquisition
status RM RM
radical , and is
properties distributed
architecture models
bears for
accounting
language
language
acquisition
2 . A
brief
overview
of
English
verbal
inflection
2 .1 .
The
facts
of
English aim
inflection to to English we describe our part of the of present RM about system their the model of model basic , many verbal , we flavor additional theories of inflec briefly of a -
- based about
it . ! When of English .
evaluate inflection
linguistic
structure English
presented morphology
is 350 regular
not distinct
the
verb Spanish
of
classical or Italian
Greek about
has
about
50 , the
English
exactly
lYaluable Curme
linguistic
studies
verbal
system
include
( 1947 ) , Bybee
and Slobin
( 1982 ) , ( 1936 ) ,
( 1935 ) , Fries
( 1940 ) , Hoard
( 1973 ) , Hockett
( 1942 ) , Jespersen
( 1942 ) , Mencken
and Hoard ( 1971 ) , Sweet ( 1892 ) . Chomsky works touching on aspects of the system .
( 1982a , b )
Languageand connectionism
83
(1)
a.
b. c.
walk
walks walked
d.
walking
As is typical in morphological systems, there is rampant syncretism- use of the same phonological form to express different , often unrelated mor phological categories . On syntactic grounds we might distinguish 13 categories filled by the four forms . (2) a. -f/J Present-everything but 3rd person singular : I , you , we , they open.
Infinitive :
-s
- ed
It opened. Perfect Participle : It has opened. Passive Participle : It was being opened. Verbal adjective : A recently -opened box . d. -ing Progressive Participle : He is opening . Present Participle : He tried opening the door . Verbal noun (gerund ) : His incessant opening of the boxes. Verbal adjective : A quietly -opening door .
The system is rendered more interesting by the presence of about 180 'strong ' or 'irregular ' verbs , which form the past tense other than by simple suffixation . There are , however , far fewer than 180 ways of modifying a stem to produce a strong past tense; the study upon which Rumelhart and McClel -
84
land
depend
Bybee
and
Slobin
1982
divides
the
strong
group
into
nine
coarse
and
somewhat
heterogeneous
subclasses
which
we
discuss
later
See
the
Appendix
for
precis
of
the
entire
system
. )
Many
strong
verbs
also
maintain
further
formal
distinction
lost
in
2c
between
the
past
tense
itself
and
the
Perfect
Passive
Participle
which
is
fre
quently
marked
with
en
' he
ate
'
vs
' he
has
was
eaten
'
These
verbs
mark
the
outermost
boundary
of
systematic
complexity
in
English
giving
the
learner
five
forms
to
keep
track
of
two
of
which
past
and
perfect
passive
participle
are
not
predictable
from
totally
general
rules
. 2
. 2
Basic
features
of
symbolic
models
of
inflection
Rumelhart
and
McClelland
write
that
"
We
chose
the
study
of
acquisition
of
past
tense
in
part
because
the
phenomenon
of
regularization
is
an
example
often
cited
in
support
of
the
view
that
children
do
respond
according
to
general
rules
of
language
"
What
they
mean
is
that
when
Berko
1958
first
documented
children
' s
ability
to
inflect
novel
verbs
for
past
tense
. g
jicked
and
when
Ervin
1964
documented
overregularizations
of
irregular
past
tense
forms
in
spontaneous
speech
. g
breaked
it
was
effective
eviden
ce
against
any
notion
that
language
acquisition
consisted
of
rote
imitation
But
it
is
important
to
note
the
general
point
that
the
ability
to
generalize
beyond
rote
forms
is
not
the
only
motivation
for
using
rules
as
behaviorists
were
quick
to
point
out
in
the
1960s
when
they
offered
their
own
accounts
of
generalization
In
fact
even
the
existence
of
competing
modes
of
genera
lizing
such
as
the
different
past
tense
forms
of
regular
and
irregular
verbs
or
of
regular
verbs
ending
in
different
consonants
is
not
the
most
important
motivation
for
positing
distinct
rules
Rather
rules
are
generally
invoked
in
linguistic
explanations
in
order
to
factor
complex
phenomenon
into
simpler
components
that
feed
representations
into
one
another
Different
types
of
rules
apply
to
these
intermediate
representations
forming
cascade
of
struc
tures
and
rule
components
Rules
are
individuated
not
only
because
they
compete
and
mandate
different
transformations
of
the
sam
.e
input
structure
such
as
break
breakedl
broke
but
because
they
apply
to
different
kinds
of
structures
and
thus
impose
factoring
of
phenomenon
into
distinct
compo
nents
rather
than
generating
the
phenomena
in
single
step
mapping
inputs
to
outputs
Such
factoring
allows
orthogonal
generalizations
to
be
extracted
and
stated
separately
so
that
observed
complexity
can
arise
through
interac
tion
and
feeding
of
independent
rules
and
processes
which
often
have
rather
2Somewhat
beyond
this
bound
lies
the
verb
' be
'
with
eight
distinct
forms
be
am
is
are
was
were
been
being
of
which
only
the
last
is
regular
Languageand connectionism
85
different parametersand domains of relevance This is immediately obvious . in most of syntax, and indeed, in most domainsof cognitive processing (which is why the acquisition and use of internal representationsin "hidden units" is an important technical problem in connectionist modeling; see Hinton & Sejnowski, 1986 Rumelhart, Hinton , & Williams, 1986 . ; ) However, it is not as obvious at first glance how rules feed each other in the case of past tense inflection. Thus to examine in what sensethe RM model "has no rules" and thus differs from symbolic accounts it is crucial to , spell out how the different rules in the symbolic accountsare individuated in terms of the componentsthey are associated with . There is one set of "rules" inherent in the generation of the past tense in English that is completely outsidethe mappingthat the RM model computes : those governing the interaction between the use of the past tense form and the type of sentencethe verb appearsin, which dependson semanticfactors such as the relationship betweenthe times of the speechact, referent event, and a reference point, combined with various syntactic and lexical factors such as the choice of a matrix verb in a complex sentence(I helped her leave *left versus I know she left/ *leave and the modality and mood of a / ) sentence (I went *go yesterdayversus I didn't go/ *went yesterday If my / ,. grandmother had/ *has balls she'd be my grandfather . In other words, a ) speakerdoesn chooseto produce a past tenseform of a verb when and only 't when he or sheis referring to an event taking placebefore the act of speaking . The distinction between the mechanismsgoverning these phenomena and , those that associateindividual stems and past tense forms, is implicitly accepted by Rumelhart and McClelland. That is, presumablythe RM model would be embeddedin a collection of networks that would pretty much reproduce the traditional picture of there being one set of syntactic and semantic mechanisms that selectsoccasions use of the past tense, feeding informafor tion into a distinct morphological phonological system that associates . indi~ vidual stemswith their tensed forms. As such, one must be cautious at the outset in sayingthat the RM model is an alternative to a rule-basedaccount of the past tense in general; at most, it is an alternative to whatever decom position is traditionally assumedwithin the part of grammar that associates stemsand past tense forms.3 In symbolic accounts this morphological phonological part is subject to , 3Purthermore the RM model seeksonly to generate , past forms from stems it has no facility for retrieving ; a stem given the past tense form as input. (There is no guaranteethat a network will run 'backwards and in ' fact someof the more sophisticatedlearning algorithms presuppose strictly feed-forward design Presumably a .) the human learner can go both ways from the very beginning of the process later we present examplesof ; children's back~ formations in support of this notion. Rule-based theories, as accountsof knowledge rather than use of knowledge are neutral with respectto the production/recognition distinction. ,
86
further decomposition . In particular , rule -based accounts rely on several fun damental distinctions :
Lexical item vs. phoneme string . The lexical item is a unique , idiosyncra tic set of syntactic , semantic , morphological , and phonological proper ties . The phoneme string is just one of these properties . Distinct items may share the same phonological composition (homophony ) . Thus the notion of lexical representation distinguishes phonologically ambiguous words such as wring and ring .
Morphological category vs. morpheme . There is a distinction between a
morphological category , such as 'past tense' or 'perfect aspect' or 'plural ' or 'nominative case' , and the realization (s) of it in phonological substance . The relation can be many - one in both directions : the same
phonological entity can mark several categories (syncretism ) ; and one category may have several (or indeed many) realizations , such as through a variety of suffixes or through other means of marking . Thus in English , -ed syncretistically marks the past, the perfect participle , the passive participle , and a verbal adjective - distinct categories ; while the past tense category itself is manifested differently in such items as bought , blew, sat, bled, bent, cut, went, ate, killed . Morphology vs. phonology . Morphological rules describe the syntax of words- how words are built from morphemes- and the realization of abstract morphological categories . Phonological rules deal with the pre dictable features of sound structure , including adjustments and accommodations occasioned by juxtaposition and superposition of phonologi cal elements . Morphology trades in such notions as 'stem' , 'prefix ' , 'suffix ' , 'past tense' ; phonology in such as 'vowel ' , 'voicing ' , 'obstruence ' , 'syllable ' . As we will see in our examination of English morphology , there can be a remarkable degree of segregation of the two vocabularies into distinct rule systems: there are morphological rules which are blind to phonology , and phonological rules blind to morphological category . Phonology vs. phonetics . Recent work (Liberman & Pierrehumbert , 1984; Pierrehumbert & Beckman , 1986) refines the distinction between phonology proper , which establishes and maps between one phonological representation and another , and phonetic implementation , which takes a representation and relates it to an entirely different system of parameters (for example , targets in acoustic or articulatory space) .
In addition , a rule -system is organized by principles which determine the interactions between rules : whether they compete or feed , and if they compete , which wins . A major factor in regulating the feeding relation is organi zation into components : morphology , an entire set of formation rules , feeds
87
phonology , which feeds phonetics .4 Competition among morphological alter natives is under the control of a principle of paradigm structure (called the 'Unique Entry Principle ' in Pinker , 1984) which guarantees that in general each word will have one and only one form for each relevant morphological category ; this is closely related to the 'Elsewhere Condition ' of formallinguis tics (Kiparsky , 1982a, b) . The effect is that when a general rule (like Past(x) = x + ed) formally overlaps a specific rule (like Past(go) = went) , the specific rule not only applies but also blocks the general one from applying . The picture that emerges looks like this :
l
[--==~ - ~;;::::::~ ~Unique ~ I~ -J Entry Principle
,
r- - ---~:: ~;-- - - l ~ :~
L _ __ _ _ _~ - - ~ - - _ :
[_~~~ ~ =~ -J ~ i =- ~ 1
I
With this general structure in mind , we can now examine how the RM model differs in " not having rules " .
3 . The
Rumelhart
- McClelland
model
Rumelhart and McClelland 's goal is to model the acquisition of the past tense, specifically the production of the past tense, considered in isolation
4More intricate variations on this basic pattern are explored in recent work in " Lexical Phonology " ; see Kiparsky ( 1982a, b) .
88
from the rest of the English morphological system They assumethat the . acquisition processestablishes direct mapping from the phonetic represen a tation of the stem to the phonetic representationof the past tenseform . The model therefore takes the following basic shape : (4) Uninflected stem ~ Pattern associator~ Past form This proposed organization of knowledge collapsesthe major distinctions embodied in the linguistic theory sketched in (3) . In the following sections we ascertainand evaluate the consequences this move. of The detailed structure of the RM model is portrayed in Figure 1. In its trained state, the pattern associatoris supposedto take any stem as input and emit the correspondingpast tense form. The model's pattern as sociator is a simple network with two layers of nodes, one for representing input, the other for output. Each node representsa different property that an input item may have. Nodes in the RM model may only be 'on' or 'off' ; thus the nodes represent binary features, 'off ' and 'on' marking the simple absenceor presenceof a certain property. Each stem must be encodedas a unique subsetof turned-on input nodes; each possible past tense form as a unique subsetof output nodes turned on. Here a nonobvious problem assertsitself. The natural assumptionwould be that words are strings on an alphabet, a concatenationof phonemes But .
Figure 1. The Rumelhart McClelland model of past tenseacquisition (Reproduced . from Rumelhart and McClelland, 1986b p. 222, with permission of the , publisher, Bradford Books/ MIT Press .) Fixed Encoding Network Pattern Associator Modifiable Connections Decoding /Binding Network
89
(In orderto locate word-edges, areessential phonology morwhich to and phology, is necessary assume word-boundary is a character it to that (#)
in the underlying alphabet.)Rumelhart McClelland suchtrigrams and call Wickeiphones. a wordlike strip translates, their notationto Thus in
construedas an atomicpropertythat a stringmayhaveor lack.Thus,writing it out as we did above is misleading,because the order of the five Wickel-
set of Wickelphones. givesa distributed representation: individual This an worddoesnot register its ownnode,but is analyzed an ensemble on as of
properties, Wickeiphones, are the trueprimitives the system. which of As Figure1 shows, Rumeihart McClelland and require encoder of unan specified nature convert ordered to an phonetic string a setof activated into
Wickeiphone units;we discusssomeof its propertieslater.
TheWickelphone contains enough context detectin gross kindof to the inputoutput relationships in thestem-to-past mapping. found tense Imagine a pattern associator mapping inputWickelphonesoutput from to Wickelphones. is usual suchnetworks, inputnodeis connected every As in every to output node,giving inputWickelphone thechance influence each node to every intheoutput node Wickeiphone Suppose a setofinput set. that nodes
is turnedon, representing inputto the network.Whethera givenoutput an
node will turn on is determinedjointly by the strength of its connectionsto
the activeinputnodesandby the outputnodes ownoverall susceptibility to influence, threshold.Theindividual decisions theoutput its on/off for units are made probabilistically, the basisof the discrepancy on betweentotal
90
has already figured which tenseformis to be associated which out past with stemform.Wecallthisthe juxtaposition process; Rume andMcClellhart landadoptthenotunreasonable idealization it doesnotinteract the that with
forms.
distinct ofteachinginput shown Figure The kind (not in 1). corresponding psychological assumption thechild, isthat through unspecified some process,
work aninput (inthepresent a representation stem) with form case, ofa and comparing output the pattern actually obtained thedesired with pattern for
input output and nodeslink weightszeroor with at random inputoutput relations; a tabula thatis either its rasa blank meaninglessly or noisy. (RumelhartMcClellands & isblank.) Training involves presenting netthe
process abstracting nature the mapping of the of between andpast stem Thecomparison between actual the output pattern computed theconby nections between andoutput input nodes, thedesired and pattern provided bytheteacher, ismade a node-by-node Anyoutput thatis on basis. node inthewrong becomes target adjustment. network up state the of Ifthe ends leavingnode thatought beonaccordingtheteacher, a off to to changes are made render node likely fire thepresencetheparticular to that more to in of input hand. at Specifically, theweights thelinks on connecting input active unitsto the recalcitrant output are increased unit slightly; willincrease this thetendency thecurrently input for active unitsthosethatrepresent the input formto activate target the node. addition, target In the nodesown thresholdlowered is slightly, that will toturn more across so it tend on easily theboard. ontheother If, hand, network the incorrectly anoutput turns nodeon, thereverse procedure employed: weights theconnections is the of from currently input aredecremented active units (potentially the driving connection to a negative, weight inhibitory andthe target value) nodes thresholdraised; hyperactive node thusmade likely is a output is more to turnoffgiven same the pattern input activation. of node Repeated cycling through input-output with pairs, concomitant adjustments, thebehavshapes iorofthepattern associator. istheperceptron This convergence procedure (Rosenblatt, and isknown produce,thelimit, setofweights 1962) it to in a thatsuccessfully theinput maps activation ontothedesired vectors output
In fact, RMnet,following 200 the about training of420 cycles stem-past pairs total about (a of 80,000 isable produce trials), to correct forms past for
91
"teaching inputs. Somewhatsurprisingly, a single set of connection weights " in the network is able to map look to looked, live to lived, melt to melted hit , to hit, make to made sing to sang, even go to went. The bits of stored , information accomplishingthese mappingsare superimposedin the connec tion weights and node thresholds no single parameter correspondsuniquely ; to a rule or to any single irregular stem -past pair. Of course it is necessary show how sucha network generalizes stems , to to it has not been trained on, not only how it reproduces a rate list of pairs. The circumstances under which generalization occurs in pattern associators with distributed representationsis reasonablywell understood. Any encoded (one is tempted to say 'en-noded') property of the input data that participates in a frequently attested pattern of input/output relations will playa major role in the developmentof the network. Becauseit is turned on during many training episodes and because standsin a recurrent relationship to a set of , it output nodes its influence will be repeatedly enhancedby the learning pro, cedure. A connectionist network does more than match input to output; it respondsto regularities in the representation of the data and usesthem to accomplishthe mapping it is trained on and to generalizeto new cases In . fact, the distinction between reproducing the memorized input- output pairs and generating novel outputs for novel inputs is absent from pattern as sociators a single set of weights both reproducestrained pairs and produces : novel outDuts which are blends of the output patterns strongly associated ~ with each of the properties defining the novel input. The crucial step is therefore the first one: coding the data. If the patterns in the data relevant to generalizing to new forms are not encoded in the representationof the data, no network- in fact, no algorithmic systemof any sort- will be able to find them. (This is after all the reason that so much researchin the 'symbolic paradigm' has centered on the nature of linguistic representations Since phonological processes .) and relations (like those involved in past tenseformation) do not treat phonemesas atomic, unanalyza ble wholes but refer instead to their constituent phonetic properties like voicing, obstruency tenseness vowels, and so on, it is necessary , of that such fine-grained information be present in the network. The Wickelphone, like the phoneme, is too coarse to support generalization To take an extreme . example adaptedfrom Morris Halle , any English speakerwho labors to pronounce the celebrated composers name as [bax] knows that if there were a ' verb to Bach, its past would be baxt and not baxd or baxid, even though no existing English word containsthe velar fricative [x] . Any representationthat does not characterizeBach as similar to passand walk by virtue of ending in an unvoiced segmentwould fail to make this generalization Wickelphones . , of course have this problem; they treat segmentsas opaque quarks and fail ,
92
to display vital information about segmental similarity classes. A better rep resentation would hav'e units referring in some way to phonetic features rather than to phonemes , because of the well -known fact that the correct dimension of generalization from old to new forms must be in terms of such
features .
Rumelhart and McClelland present a second reason for avoiding Wickel phone nodes. The number of possible Wickelphones for their representation of English is 353 + (2 x 352 = 45,325 (all triliterals + all biliterals beginning ) and ending with #) . The number of distinct connections from the entire input Wickelvector to jts output clone would be over two billion (45,3252 , too ) many to handle comfortably . Rumelhart and McClelland therefore assume a phonetic decomposition of segments into features which are in broad outline like those of modern phonology . On the basis of this phonetic analysis, a Wickelphone dissolves into a set of 'Wickelfeatures ' , a sequence of three features , one from each of the three elements of the Wickelphone . For exam-
verb set. Notice that the actual atomic properties recognized by the model are not phonetic features per se, but entities that can be thought of as 3-feature sequences. The Wickelphone /Wickelfeature is an excellent example of the kind of novel properties that revisionist -symbol -processing connectionism
can come A further up with . refinement is that not all definable Wickelfeatures have units
dedicated to them : the Wickelfeature set was trimmed to exclude , roughly , feature - triplets whose first and third features were chosen from different
phonetic dimensions .5 The end result is a system of 460 nodes , each one
representing a Wickelfeature . One may calculate that this gives rise to 4602 = 211,600 input - output connections . The module that encodes words into input Wickelfeatures (the " Fixed Encoding Network " of Figure 1) and the one that decodes output Wickel features into words (the " Decoding /Binding Network " of Figure 1) are perhaps not meant to be taken entirely seriously in the current implementation of the RM model , but several of their properties are crucially important in under S Although this move was inspired purely by considerations of computational economy , it or something like it has real empirical support ; the reader familiar with current phonology will recognize its relation to the notion of a ' tier ' of related features in autosegrnental phonology .
93
standing and evaluating it . The input encoder is deliberately designed to activate some incorrect Wickelfeatures in addition to the precise set of Wickelfeatures in the stem: specifically a randomly selectedsubsetof those , Wickelfeaturesthat encodethe featuresof the central phonemeproperly but encode incorrect feature values for one of the two context phonemes This . "blurred" Wickelfeature representation cannot be construed as random noise; the sameset of incorrect Wickelfeaturesis activatedevery time a word is presented and no Wickelfeature encodingan incorrect choiceof the central , feature is ever activated. Rather, the blurred representationfostersgeneralization. Connectionistpattern associators alwaysin danger of capitalizing are too much on idiosyncraticproperties of words in the training set in developing their mapping from input to output and hence of not properly generalizing to new forms. Blurring the input representations makes the connection weights in the RM model less likely to be able to exploit the idiosyncrasies of the words in the training set and hence reduces the model's tendency toward conservatism . The output decoder faces a formidable task. When an input stem is fed into the model, the result is a set of activated output Wickelfeature units. Which units are on in the output depends on the current weights of the connections from active input units and on the probabilistic process that converts the summedweighted inputs into a decision as to whether or not to turn on. Nothing in the model ensuresthat the set of activated output units will fit together to describe a legitimate word: the set of activated units do not have to have neighboring context features that " mesh and hence im" plicitly " assemble the Wickelfeatures into a coherent string; they do not " have to be mutually consistentin the feature they mandatefor a given position ; and they do not have to define a set of featuresfor a given position that collectively define an English phoneme (or any kind of phoneme In fact, ). the output Wickelfeaturesvirtually neverdefine a word exactly, and so there is no clear sensein which one knows which word .the output Wickelfeatures are defining. In many cases Rumelhart and McClelland are only interested , in assessing how likely the model seemsto be to output a given target word, such as the correct past tense form for a given stem; in that casethey can peer into the model, count the number of desired Wickelfeatures that are successfully activatedand vice versa, and calculatethe goodness the match. of However, this does not reveal which phonemes or which words, the model , would actually output. To assess how likely the model actually is to output a phonemein a given context, that is, how likely a given Wicke /phone is in the output, a Wicke /phone Binding Network was constructedas part of the output decoder. This network has units corresponding to Wickelphones these units "compete ; "
94
with one another in an iterative processto " claim" the activated Wickelfeatures: the more Wickelfeatures that a Wickelphone unit uniquely accounts for , the greater its strength (Wickelfeaturesaccountedfor by more than one Wickelphone are " split" in proportion to the number of other Wickelfeatures each Wickelphone accountsfor uniquely) and, supposedly the more likely , that Wickelphone is to appearin the output. A similar mechanism called the , Whole -String Binding Network, is defined to estimate the model's relative tendenciesto output any of a particular set of words when it is of interest to compare those words with one another as possible outputs. Rumelhart and McClelland choose a set of plausible output words for a given input stem, such as break, broke, breaked and broked for the past tense of break, and define a unit for each one. The units then compete for activated Wickelfeatures in the output vector, each one growing in strength as a function of the number of activated Wickelfeatures it uniquely accountsfor (with credit for nonunique Wickelfeatures split between the words that can account for it), and diminishing as a function of the number of activatedWickelfeaturesthat are inconsistentwith it . This amountsto a forced-choice procedure and still doesnot reveal what the model would output if left to its own devices which is crucial in evaluatingthe model's ability to produce correct past tenseforms for stemsit has not been trained on. Rumelhart and McClelland envision an eventual " sequentialreadout process that would convert Wickelfeaturesinto " a single temporally ordered representation but for now they make do with a , more easily implemented substitute: an UnconstrainedWhole -String Binding Network, which is a whole-string binding network with one unit for every possible string of phonemesless than 20 phonemeslong- that is, a forcedchoice procedure among all possible strings. Since this process would b.e intractable to compute on today's computers and maybe even tomorrow's, , they created whole-string units only for a sharply restricted subset of the possible strings, those whose Wickelphones exceed a threshold in the Wickelphonebinding network competition. But the setwasstill fairly largeand thus the model was in principle capable of selectingboth correct past tense forms and various kinds of distortions of them. Even with the restricted set of whole stringsavailablein the unconstrainedwhole-string binding network, the iterative competition process quite time-consuming the implementation, was in and thus Rumelhart and McClelland ran this network only in assessing the model's ability to produce past forms for untrained stems in all other cases ; , they either countedfeaturesin the output Wickelfeature vector directly, or set up a restricted forced-choicetest amonga small set of likely alternativesin the whole-string binding network. In sum, the RM model works asfollows. The phonologicalstring is cashed in for a set of Wickelfeaturesby an unspecifiedprocessthat activatesall the cor-
rect and someof the incorrect Wickelfeature units. The pattern associatorexcites the Wickelfeature units in the output; during the training phase its parameters(weightsand thresholds are adjustedto reducethe discrepancy ) between the excited Wickelfeature units and the desired ones provided by the teacher. The activatedWickelfeature units maythen be decodedinto a string of Wickelphonesby the Wickelphonebinding network, or into oneof a smallsetof words by the whole-string binding network, or into a free choiceof an output word by the unconstrainedwhole-string binding network. 4. An analysisof the assumptions the Rumelhart- McCleUandmodel in of comparisonwith symbolicaccounts It is possibleto practicepsycholinguistics with minimal commitmentto expl:. aic ting the internal representation of language achieved by the learner. Rumelhart and McClelland's work is emphaticallynot of this sort. Their model is offered preciselyas a model of internal representation the learning process ; is understoodin terms of changes a representationalsystemas it converges in on the mature state. It embodiesclaims of the greatestpsycholinguisticinterest: it has a theory of phonological representation a theory of morphology, , a theory (or rather anti-theory) of the role of the notion 'lexical item' , and a theory of the relation between regular and irregular forms. In no case are thesepresupposed theories simply transcribedfrom familiar views; they constitute a bold new perspectiveon the central issues the study of word-forms, in - rooted in the exigenciesand strengthsof connectionism . The model largely exemplifieswhat we have called revisionist-symbol-processingconnectionism rather than implementational or eliminative connec , tionism. Standardsymbolic rules are not embodiedin it ; nor does it posit an utterly opaque device whose operation cannot be understood in terms of symbol processing any sort. It is possibleto isolate an abstractbut unorthoof dox linguistic theory implicit in the model (though Rumelhart and McClelland do not themselvesconsider it in this light) , and that theory can be analyzed and evaluatedin the sameway that more familiar theories are. These are the fundamental linguistic assumptions the RM model: of . . . That the Wickelphone/ Wickelfeatureprovides an adequate basis for phonologicalgeneralization circumventingthe needto deal with strings. , That the past tenseis formed by direct modification of the phoneticsof the root, so that there is no need to recognizea more abstract level of morphological structure. That the formation of strong (irregular) pasts is determined by purely
96
, so that as a locus
there of
is no
need .
to recognize
the
notion
regular in the
as the
irregular of
, differ exem in
plars , so that it is appropriate a single , indissoluble facility These rather specific supplies such 267 ) , assumptions a viable as that a
whole
relation
to support to highly
the
broader
sketched reasonable
. " We of
" that
the
tense than
be provided
without of the
recourse . " By
notion they
, are
not and
in internal broader
. Rumelhart
show
that
grossly undermines
, we
will
within interactions
, more properties
benchmark
measuring
the
of linguistic
4 . 1 . Wickelphonology The Wickelphone hold / Wickelfeature that the finite has some useful set false the properties can encode . Rumelhart strings enough data . In of to and arbi being , -
( PD PII them
this
a way contains
words within
addition depen
a chunk . These
of context properties
found
allow
the
RM
off
the
. If , however
, the
Wickelphone
/ Wickelfeature
seriously satisfy
of phonological criteria .6
representation
, it must
distinctions must
of
system present in
6For other critiques of the Wickelphone ( 1971) and Savin and Bever ( 1970) .
97
the language English orthography is a familiar representationalsystemthat . fails to preserve distinctness for example, the word spelled 'read' may be : read as either [rid] or [rEd ;7 whatever its other virtues, spelling is not an ] appropriate medium for phonologicalcomputation. The Wickelphone system fails more seriously because , there are distinctions that it is in principle incapable of handling. Certain patterns of repetitions will map distinct string-regions onto the sameWickelphone set, resulting in irrecoverablelossof information. This is not just a mathematicalcuriosity. For example, the Australian language Oykangand(Sommer 1980 distinguishes , ) betweenalgal'straight' and algalgal 'ramrod straight' , different strings which share the Wickelphone set {alg, al#, gal, 19a #al} , ascanbe seenfrom the analysisin (5): , (5) a. algal #al alg 19a gal all b. algalgal #al alg 19a gal alg 19a gal al#
Wickelphone sets containing subsetsclosed under cyclic permutation on the character string- {alg, gal, 19a in the example at hand- are infinitely } ambiguousas to the strings they encode This showsthat Wickelphonescan. not represent even relatively short strings, much less strings of arbitrary length, without loss of concatenationstructure (loss is guaranteedfor strings
7Wewill usethe followingphonetic notationandterminology (sparingly Enclosure square ). in brackets [ ] indicates phonetic spelling . Thetense vowels : [i] asin beat are [e] asin bait
The lax vowelsare: [I] asin bit [E asinbet ]
The low front vowel [re appearsin cat. The low central vowel [A] appearsin shut. The 10 back vowel [J] ] \\/ appearsin caught The diphthong [ay] appearsin might and bite; the diphthong [aw] in house The high lax . . central vowel [I] is the secondvowel in melted rose's. , The symbol [t ] standsfor the voicelesspalato-alveolar affricate that appearstwice in church; the symbol [j ] for its voiced counterpart, which appearstwice in judge. [5] is the voicelesspalato-alveolar fricative of shoe and [z] is its voiced counterpart, the final consonantof rouge. The velar nasalIJis the final consonantin sing. The term sonorant consonantrefers to the liquids I,r and the nasalsm,n, IJ The term obstruentrefers to the . complement set of oral stops~ fricatives and affricates, such as p,t,k,j ,s,s,c, b, d,g, v,z,i ,j . The term coronal refers to sounds made at the dental, alveolar, and palato-alveolar placesof articulation. The term sibilant refers to the conspicuouslynoisy fricatives and affricates [s,z,s,z,t ,j ] .
98
over a certain length ) . On elementary grounds , then , the Wickelphone is demonstrably inadequate . Supporting generalizations . A second, more sophisticated requirement is that a representation supply the basis for proper generalization . It is here that the phonetic vagaries of the most commonly encountered representation of English - its spelling- receive a modicum of justification . The letter i , for example , is implicated in the spelling of both [ay] and [I ] , allowing word -relatedness to be overtly expressed as identity of spelling in many pairs like
write -written
bite - bit
c. d.
e.
The Wickelphone /Wickelfeature provides surprisingly little help in finding phonological generalizations . There are two domains in which significant similarities are operative : ( 1) among items in the input set, and (2) between an input item and its output form . Taking the trigram as the primitive unit of description impedes the discovery of inter -item similarity relations . Consider the fact , noted by Rumelhart and McClelland , that the word silt and the word slit have no Wickelphones in common : the first goes to {#si, sil , ilt , It#} , the second to {#sl, sli , lit , it #} . The implicit claim is that such pairs have no phonological properties in common . Although this result meets the need to distinguish the distinct , it shows that Wickelphone composition is a very unsatisfactory measure of psychological phonetic similarity . Indeed , historical changes of the type slit ~ silt and silt ~ slit , based on phonetic similarity , are fairly common in natural language. In the history of English , for example , we find hross~ horse, thrid ~ third , brid ~ bird (Jespersen, 1942, p . 58) . On pure Wickel phones such changes are equivalent to complete replacements ; they are there fore no more likely , and no easier to master , than any other complete replacement , like horse going to slit or bird to clam . The situation is improved somewhat by the transition to Wickelfeatures , but remains unsatisfactory . Since phonemes I and i share features like voicing , Wickelphones like sil and sli will share Wickelfeatures like V oiceless- Voiced - Voiced . The problem is that the Iii overlap is the same as the overlap of I with any vowel and the same as the overlap of r with vowels . In Wickelfeatures it is just as costly- counting by number of replacements- to turn brid to phonetically distant bald or blud as it is to turn it to nearby bird .
99
encumbrance than a guide . The dominant regularity of the language entails that a verb like kill will simply add one phone [d] in the past; in Wickelphones the map is as in (7) : (7) a.
b.
The change, shown in (7b) , is exactly the full replacement of one Wickel phone by two others . The Wickelphone is in principle incapable of represent ing an observation like ' add [d] to the end of a word when it ends in a voiced consonant ' , because there is no way to single out the one word -ending consonant and no way to add a phoneme without disrupting the stem; you must
refer to the entire sequence AB # , whether A is relevant or not , and you must
ture -by-Wickelfeature basis, but the unifying pattern is undiscoverable . Since the relevant phonological process involves only a pair of representationally adjacent elements , the triune Wickelphone /Wickelfeature is quite generally incompetent to locate the relevant factors and to capitalize on them in learn ing , with consequences we will see when we examine the model 's success in generalizing to new forms . The " blurring " of the Wickelfeature representation , by which certain input
units XBC and ABZ are turned on in addition to authentic ABC , is a tactical
response to the problem of finding similarities among the input set. The reason that AYB is not also turned on- as one would expect , if " blurring " corresponded to neural noise of some sort- is in part that XBC and ABZ are units preserving the empirically significant adjacency pairing of segments: in many strings of the form ABC , we expect interactions within AB and BC , but not between A and C . Blurring both A and C helps to model processes in which only the presence of B is significant , and as Lachter and Bever ( 1988) show, partially recreates the notion of the single phoneme as a phonological unit . Such selective " blurring " is not motivated within Rumelhart and McClelland 's theory or by general principles of PDP architec ture ; it is an external imposition that pushes it along more or less in the right direction . Taken literally , it is scarcely credible : the idea would be that the pervasive adjacency requirement in phonological processes is due to quasirandom confusion , rather than structural features of the representational apparatus and the physical system it serves. Excluding the impossible . The third and most challenging requirement we can place on a representational system is that it should exclude the impossi ble . Many kinds of formally simple relations are absent from natural lan-
100
guage presumably becausethey cannot be mentally represented Here the , . Wickelphone /Wickelfeature fails spectacularly A quintessentialunlinguistic . map is relating a string to its mirror image reversal (this would relate pit to tip, brag to garb, dumb to mud, and so on) ; although neither physiology nor physicsforbids it , no languageusessuch a pattern. But it is as easyto represent and learn in the RM pattern associatoras the identity map. The rule is simply to replace each Wickelfeature ABC by the Wickelfeature CBA . In network terms, assuming link-weightsfrom 0 to 1, weight the lines from ABC ~ CBA at 1 and all the (459 others emanating from ABC at O Since all ) . weights start at 0 for Rumelhart and McClelland, this is exactly as easy to achieveas weighting the lines ABC ~ ABC at 1, with the others from ABC staying at 0; and it requires considerably less modification of weights than most other input- output transforms. Unlike other, more random replace ments, the S ~ SRmap is guaranteedto preservethe stringhood of the input Wickelphone set. It is easyto define other processes over the Wickelphone that are equally unlikely to make their appearancein natural language for : example, no processturns on the identity of the entire first Wickelphone (#AB ) or last Wickelphone (AB #)- compare in this regard the notions 'first (last) segment, 'first (last) syllable' , frequently involved in actual morpholog' ical and phonological processesbut which appear as arbitrary disjunctions, , if reconstructibleat all, in the Wickelphonerepresentation The Wickelphone . tells us as little about unnatural avenuesof generalization as it does about the natural ones. The root cause we suggest is that the Wickelphone is being askedto carry , , two contradictory burdens. Division into Wickelphonesis primarily a way of multiplying out possible rule-contexts in advance Since many phonological . interactions are segmentally local, a Wickelphone-like decomposition into short substringswill pick out domainsin which interaction is likely .s But any such decompositionmust also retain enough information to allow the string to be reconstituted with a fair degreeof certainty. Therefore, the minimum usable unit to reconstruct order is three segmentslong, even though many contexts for actual phonological processes span a window of only two seg 80f course, not all interactions are segmentallylocal. In vowel harmony, for example, a vowel typically reacts to a nearby vowel over an intervening string of consonants if there are two intervening consonants ; , the interacting vowels will never be in the sameWickelphone and generalization will be impossible Stress . rules.commonly skip over a string of one or two syllables which may contain many segments crucial notions , : such as 'secondsyllable' will have absolutely no characterizationin Wickelphonology (see Sietsema 1987 for , , further discussion. Phenomenalike these show the need for more sophisticatedrepresentationalresources ) , so that the relevant notion of domain of interaction may be adequatelydefined (seevan der Hulst & Smith, 1982 for an overview of recent work) . It is highly doubtful that Wickelphonology can be strengthenedto deal , with such cases but we will not explore these broader problems, becauseour goal is to examine the Wickel, phone as an alternative to the segmentalconcatenativestructure which every theory of phonology includes.
101
ments
Similarly
if
the
blurring
process
were
done
thoroughly
so
that
ABC
would
set
off
all
XBZ
in
the
input
set
there
would
be
full
representation
of
the
presence
of
but
the
identity
of
the
input
string
would
disappear
The
RM
model
thus
establishes
mutually
subversive
relation
between
rep
resenting
the
aspects
of
the
string
that
figure
in
generalizations
and
represent
ing
its
concatenation
structure
In
the
end
neither
is
done
satisfactorily
Rumelhart
and
McClelland
display
some
ambivalence
about
the
Wickelfea
ture
At
one
point
they
dismiss
the
computational
difficulty
of
recovering
string
from
Wickelfeature
set
as
one
that
is
easily
overcome
by
parallel
processing
"
in
biological
hardware
"
262
At
another
point
they
show
how
the
Wickelfeature
to
Wickelphone
re
conversion
can
be
done
in
bind
ing
network
that
utilizes
certain
genus
of
connectionist
mechanisms
imply
ing
again
that
this
process
is
to
be
taken
seriously
as
part
of
the
model
Yet
they
write
PDPII
239
All
we
claim
for
the
present
coding
scheme
is
its
sufficiency
for
the
task
of
representing
the
past
tenses
of
the
500
most
frequent
verbs
in
English
and
the
importance
of
the
basic
principles
of
distributed
coarse
what
we
are
calling
blurred
conjunctive
coding
that
it
embodies
This
disclaimer
is
at
odds
with
the
centrality
of
the
Wickelfeature
in
the
model
'
design
The
Wickelfeature
structure
is
not
some
kind
of
approxima
tion
that
can
easily
be
sharpened
and
refined
it
is
categorically
the
wrong
kind
of
thing
for
the
jobs
assigned
to
it
. 9
At
the
same
time
the
Wickelphone
or
something
similar
is
demanded
by
the
most
radically
distributed
forms
of
distributed
representations
which
resolve
order
relations
like
concatena
tion
into
unordered
sets
of
features
Without
the
Wickelphone
Rumelhart
and
McClelland
have
no
account
about
how
phonological
strings
are
to
be
analyzed
for
significant
patterning
. 2
Phonology
and
morphology
The
RM
model
maps
from
input
to
output
in
single
step
on
the
assumption
that
the
past
tense
derives
by
direct
phonetic
modification
of
the
stem
The
regular
endings
t ,
id
make
their
appearance
in
the
same
way
as
the
9Compare
in
this
regard
certain
other
aspects
of
the
model
which
are
clearly
inaccurate
but
represent
harmless
oversimplifications
The
actual
set
of
phonetic
features
used
to
describe
individual
phones
( p
235
doesn
' t
make
enough
distinctions
for
English
much
less
language
at
large
nor
is
it
intended
to
but
the
underlying classifications
strategy of the
of
featural verbs in
is
solidly derive
in Kucera
the -
scientific Francis
Similarly a written
the corpus
frequency , which
shows
obvious
divergences
from
the
input
encountered
by
learner
( for
examples
see
footnote
24
Such
aberrations
which
have
little
impact
on
the
model
' s
behavior
could
be
corrected
easily
with
no
structural
re
- design
vowel changes i ~ a (sing - sang) or u ~ 0 (choose - chose) . Rumelhart and McClelland claim as an advantage of the model that " [a] uniform procedure is applied for producing the past-tense form in every case." (PDPII , p . 267) This sense of uniformity can be sustained, however , only if past tense forma tion is viewed in complete isolation from the rest of English phonology and morphology . We will show that Rumelhart and McClelland 's very local uni formity must be paid for with extreme nonuniformity in the treatment of the broader patterns of the language .
The distribution of t-d -id follows a simple pattern : id goes after those stems
ending in t or d ; elsewhere , t (voiceless itself ) goes after a voiceless segment and d (itself voiced ) goes after a voiced segment. The real interest of this rule is that none of it is specifically bound to the past tense. The perfect /passive participle and the verbal adjective use the very same t-d-id scheme: was kicked - was slugged - was patted,. a kicked dog - a flogged horse - a patted cat. These categories cannot be simply identified as copies of the past tense, because they have their own distinctive irregular formations . For example , past drank contrasts with the participle drunk and the verbal adjective drun ken . Outside the verbal system entirely there is yet another process that uses
the t-d-id suffix , with the variants distributed in exactly the same way as in
the verb forms , to make adjectives from nouns , with the meaning 'having X ' (Jespersen, 1942, p . 426 ft .) : (8) -t hooked
saber - toothed
-d long -nosed
homed
-id one-handed
talented
The full generality of the component processes inherent in the t-d-id alter nation only becomes apparent when we examine the widespread s-z-iz alter nation found in the diverse morphological categories collected below : (9) Category a. b. c. d. e.
f.
g. h. i.
Languageand connectionism
103
These 9 categories show syncretism in a big way- they use the same phone tic resources to express very different distinctions .
The regular noun plural exactly parallels the 3rd person singular marking of the verb , despite the fact that the two categories (noun /verb , singular / plural ) have no notional overlap . The rule for choosing among s-z-iz is this : iz goes after stems ending in sibilants (s,z,s,i ,C ; elsewhere , s (itself voice ,]J less) goes after voiceless segments, z (voiced itself ) goes after voiced segments . The distribution of sf z is exactly the same as that of tf d . The rule for iz differs from that for id only inasmuch as z differs from d . In both cases the rule functions to separate elements that are phonetically similar : as the sibil ant z is to the sibilants , so the alveolar stop d is to the alveolar stops t and d .
The possessive marker and the fully reduced forms of the auxiliary has and the auxiliary /main verb is repeat the pattern . These three share the further interesting property that they attach not to nouns but to noun phrases, with the consequence that in ordinary colloquial speech they can end up on any kind of word at all , as shown in (10) below : ( 10) a. b. c. d. e. [my mother -in -law]'s hat (ct . plural : mothers-in -law) [the man you met]'s dog [the man you spoke to ] 's here . (Main verb be) [the student who did well ] 's being escorted home . (Auxiliary be) [the patient who turned yellow ]'s been getting better . (Auxiliary has)
The remaining formal categories (lOf- h) share the s/ z part of the pattern . The auxiliary does, when unstressed, can reduce colloquially to its final sibil ant : l0
? Whatchurch he go to? 's ?? Whose lunch he eatfrom? 's ?? Which he like better 's ? ?? Whose he actually 's prefer ?
We suspectthat the problem here lies in getting doesto reduceat all in suchstructural environments regardless , of phonology. If this is right, then (i) and (ii) should be as good (or bad) as structurally identical (v) and (vi) , where the sibilant-sibilant problem doesn arise: 't (v) ? What synagogue he go to? 's (vi) ? Whose dinner's he eat from? Sentenceforms (iii ) and (iv) use the wh.determines which and whosewithout following head nouns, which may introduce sufficient additional structural complexity to inhibit reduction. At any rate, this detail, though interesting in itself, is orthogonal to the question of what happensto does when it does reduce.
104
(11 a. ) b. c.
'Z he like beans ? Whats he eat for lunch ' ? Where he go for dinner 's ?
The affectivemarkers/ z formsnicknames somedialects argots as in and , in Willsfrom William Patsfrom PatrIck andalsoshows in variousemo , , up tionally -coloredneologisms bonkers bats paralleling-y or -0 (batty like , , , wacko with whichit sometimes ), combines (Patsyfatso A number adver , ). of bial formsaremarked s/ z- unawares by , nowadays , besides , backwards / , here there /whereabouts , amidships final, quitesporadic phonologically .A (but reg ular) uselinks togetherelements compounds in huntsmanstatesman of , as , , kinsman bondsman , . The reason the voiced that /voiceless choice madeidentically is throughout Englishmorphology not hardto find: it reflects prevailing inescap is the and ablephonetics consonant of clustervoicingin the language large Evenin at . unanalyzable words final obstruent , clusters havea singlevaluefor the voic ing feature we find only wordslike these ; : (12 a. ) b. c. d . e. f. ax, fix, box act, fact, product traipse lapsecorpse , , apt, opt, abrupt blitz, kibitz, Potts post ghost list , , [ks] [kt] [ps ] [pt] [ts] [st]
Entirely absent wordsending a clusterwith mixedvoicing [zt], [gs are in : ], [kz], etc.!! Notice that after vowels liquids and nasals , , (non -obstruentsa ) voicingcontrastis permitted : (13 a. lens- fence ) [nz] - [ns ] b. furze- force [rz] - [rs] c. wild - wilt rId] - [It] d. bulb- help [Ib] - [Ip] e. goad goat [ad] - [at] f. niece sneeze [is] - [iz] .If we areto achieve uniformityin the treatment consonant of -clustervoic ing, we must not spreadit out over 10 or so distinctmorphological form generators (i.e., 10differentnetworks, andthen repeatit onceagainin the ) phoneticcomponentthat appliesto unanalyzable words Otherwise we . ,
IJIn. noncomplex are voiceless [dz :the word ]pretty stands words clusters alone obstruentoverwhelminglyadze much
105
wouldhave no explanation why Englishcontains,and why generation for aftergeneration children of easily learn,the exactsamepatterneleven so or differenttimes.Elevenunrelatedsets of clusterpatterningswouldbe just as
shape English of morphemes begiven dueina rule can their system. Suppose thephonetic content thepasttense of marker justId!andthatofthediverse is morphemes(9)is /z/.There a setofmorphological thatsayhow in is rules morphemes assembled words: example, are into for Verb-paststem /d/; = + Noun-pl stem+ Izi;Verb-3psgstem /zI.Given wecaninvoke = = + this, a
single to derivethe occurrences [ and [ rule of t] s}:
the next in word final position. 12
e.
f.
rip + Id!
tow + ld/
>
[ rlpt]
No Change
The crucial effect of the rule is to devoice Id! and !zI after voiceless
obstruents; after voiced obstruentsits effect is vacuous and after nonobstruentsvowels,liquids,nasalsit doesnt applyat all, allowing the
basic values to emerge unaltered. 13 The environment of the variant with the reduced vowel I is similarlycon-
suffixes,handle case words to the of ending vowels, in liquids, nasals. example, + Is! and For pea would have
which already ofEnglish, derive suffixal is part to the voicing pattern without further ado.
togotopea+ {z], though pattern voicing generally inthelanguage: even this of isnot required cf.the morphologically word simplex peace. Positing and asbasic, theother allows rule Id! /z/ on hand, the (14),
them .14English has very strong general restrictions against the clustering of identical or highly similar consonants . These are not mere conventions deriv ing from vocabulary statistics , but real limitations on what native speakers of English have learned to pronounce . (Such sequences are allowed in other languages.) Consequently , forms like [skldd ] from skid + /d/ or [jAJZ from ] judge + /z/ are quite impossible . To salvage them , a vowel comes in to separate the ending from a too -similar stem final consonant . We can infor mally state the rule as (16) : ( 16) Vowel Insertion . Word -finally , separate with the vowel j adjacent consonants that are too similar in place and manner of articulation , as defined by the canons of English word phonology . The two phonological rules have a competitive interaction . Words like passes [presiz] and pitted [pItid ] show that Vowel Insertion will always prevent Voicing Assimilation : from pass + /z/ and pit + /d/ we never get [presis or ] [pI tit ] , with assimilation to the voiceless final consonant . Various lines of explanation might be pursued ; we tentatively suggest that the outcome of the competition follows from the rather different character of the two rules . Voicing Assimilation is highly phonetic in character , and might well be part of the system that implements phonological representations rather than part of the phonology proper , where representations are defined , constructed , and changed. If Vowel Insertion , as seems likely , actually changes the representation prior to implementation , then it is truly phonological in character . Assuming the componential organization of the whole system portrayed above , with a flow between components in the direction Morphology ~ Phonology ~ Phonetics , the pieces of the system fall naturally into place . Morphology provides the. basic structure of stem + suffix . Phonology makes various representational adjustments , including Vowel Insertion , and Phonetics then implements the representations . In this scheme, Voicing Assimila tion , sitting in the phonetic component , never sees the suffix as adjacent to
a too - similar stem - final consonant .
Whatever the ultimate fate of the details of the competition , it is abundantly clear that the English system turns on a fundamental distinction between phonology and morphology . Essential phonological and phonetic pro cesses are entirely insensitive to the specifics of morphological composition and sweep across categories with no regard for their semantic or syntactic content . Such processes define equivalences at one level over items that are distinct at the level of phonetics : for English suffixes , t = d = id and s = z
]~ his is of course a phonoJogicaJ restriction , not an orthographic example , have identical consonantal phonology . one . The words petty and pity , for
107
regular notthree; onesuffix, three, each plural, past, and not for of 3rd
typicallyhave no awarenessof them.
= iz. As a consequence, learner the infers thereis onesuffix the that for
person singular, possessive, soon.The and phonetic differences autoemerge matically;would expected such as be in cases, uninstructed speakers native
Rumeihartand McClellands pattern associator hobbledby a doctrine is
wemight morphological dub localism: assumption there foreach the that is morphological anencapsulated thathandles detail category system every of itsphonetics. theymischaracterize theoretically This asa desirable uniformity. In fact,morphological destroys localism uniformity preventing by generalization categories byexcluding across and inference onlargerbased
scale regularities. it isinconsistent thefactthatthelanguages Thus with that people learnareshaped these by generalizations inferences. and
Theshapeof thesystem. is instructive notethatalthough various It to the English morphemes discussed earlier participate thegeneral all in phonologicalpatterns the language, thepasttensetheycanalsodisplay of like their
own particularities subpatterns. and The 3rd personsingularis extremely
minusculenumber of non-/z/forms (oxen, children,geese, mice, ...), a 0
regular, a few with lexical irregularities has,does, anda lexical (is, says) class (modal auxiliaries) cant be inflected will, that (can, etc.).Theplural a has
suffixing (sheep, class deer),and a fricative-voicing subclass (leaf-leaves, wreath-wreathes). possessive The admits lexical no peculiarities (outside of the pronouns), presumably because addsto phrases it ratherthan lexical items,but it is lostafterplural (mensvs. dogs)andsporadically /z/ after
otherzs. Thefully reduced forms isandhasadmit lexical morphologof no or
rather than lexical.
2. All nonsyllabic regularsuffixes formed are fromthe phonetic substance Id!or /z/;thatis, theymustbe the sameup to the onefeaturedistinguish3. All morphemes liableto re-shaping phonology phonetics. are by and
be entirely regular.
4. Categories, inasmuch they are lexical, support as can specific lexical peculiarities subpatterns; and inasmuch theyarenonlexical, must as they
Properties and(2) are clearly (1) English-bound generalizations,be to learned the native by speaker. Properties and(4)are replicated (3) from
language language shouldtherefore referredto the general to and be capacities thelearner of rather thanto theaccidents EnglishNotice of . that wehave livedupto ourpromise show therules to that governing regular the pasttense not idiosyncratic it: beyond are to even phonology the discussed aboveitsintrinsic , phonetic content shared to onefeature theother is up with regular nonsyllabic suffixesandthe rule of inflectional ; suffixation is itself shared generally across categories have . We founda highly modular system , in which mapping uninflected to thephonetic the from stem representation of thepasttense formbreaks down a cascade independent sys into of rule temsandeach system ,originally rule . treats inputs its identically regardlesshowthey of were created It is a nontrivial problem design device arrives thischaracteri to a that at zationon its own An unanalyzed . single module the RM pattern like as sociator maps that fromfeatures features to cannot so do .
109
tity. suffixal t and arematched Id,notwith orozor The variants d with iz ogorany conceivable other butphonetically form. distant Similarly, morphemes show s/zvariants izintheappropriate which the take circumstances, notIdor odorgu.This follows directly ourhypothesis themorfrom that phemesquestion just basic in have one phonetic contentId! IzIwhich or issubject minor to contextual adjustments. RM The model, however, cannot grasp generalization. this, this To see considerWickelphone the map involved
in the Id case, using the verb melt as an example:
b. lt# lti, tid, id#
Ig#. theRM Thus model explain prevalence languages cannot the across of
ThegeneralizationstheRMmodel that extracts consist specific of correlations between particular sequences thestemandparticular phone in phone
ingtoa stem se,independent particular sequences per ofthe phone that happen have to exemplified themajoritystems themodels of in history, it
cannot makeanygeneralization refers stems se, cutting that to per across theirindividual phonetic contents. a morphological likereduThus process
plication, inmany which languages anentire (e.g. copies stem yielding forms roughly analogous todum-dum boom-boom), beacquired and cannot inits fully general by network.many itcan form the In cases memorize particular patternsreduplication, ofmappings particular of consisting between feature sequences their and reduplicated counterparts even problems (though here can arisebecauseof the povertyof the Wickelfeature representation, we as pointed indiscussing out Wickelphonology), but concept the the Copy stem itself unlearnable; isnounitary is there representation thing becopied ofa to andnooperation consistingcopyingvariable of a regardlessitsspecific of
content. when newstemcomes thatdoesnotsharemany Thus a in features withthe onesencountered previously, willnot matchanystoredpatterns it
andreduplication not applyto it. will 15 1t 5 isworth that noting reduplication, which calls avariable stem, syllable always on (ifnot then or foot) one the commonly of is of most used strategies word-formation.or In formanother, in one itsfound hundreds, thousands, worlds probably ofthe languages. analysis, For detailed see McCarthy and Prince
(forthcoming).
110
strong with tense links past features. weexamine performance When the of theRM model, will how ofitsfailures probably attributed we see some can be
The strikes tohome well. English past point closer as The regular tense stem; mentionsvariable, it a stem, thatis cashed independently in for information inparticular entries. therule, learned, stored lexical Thus once canapplyacross boardindependent the set of stemsencountered the of in thelearners history. RM The model, theother learns past on hand, the tense alternation linking by phonetic features inflected directly the of forms to particular features thestem example,pat patted M# affix of (for in the Wickelfeatures directlytheentire offeatures pat:#px, arelinked to set for pxt,etc.). Though oftheactivationtheaffix much for features eventually is contributed stem bysome features cutacross individual that many stems, such those theendof a word, allofit is;some as at not contribution from the word-specific stem features are that well-represented sample inthe input can a role well. theRM play as Thus model fail generate past could to any tense fora newstem thestem notshare form if did enough features with those thatwere stems encountered past thatthus their inthe and grew own
ruleadds affix a stem. ruledoesntcareabout contents the an to The the of
many dialects, hang toa form execution, hang regular refers of strong means merely suspend. verb isregular, One fit meaning adjust;theother, which
tied directlythese to sequences. basic This empirical istransparently claim false. Within strong itself, isa contrast the class there between (past: ring rang) wring wrung) areonly and (past: which orthographically Lookdistinct. ingat thebroader population,find string shared theitems we the lay by lie (past: prevaricate lie(past: assumerecumbent lied) and lay) a position. In
IfRume and lhart McClelland areright, can nohomophony there be between regular irregular orbetween indistinct and verbs items irregular classes,because arenothing phone-sequences, words but andirregular are forms
111
(18 )a . b .
That shirt never fit /?fitted me . The tailor fitted /*fit me with a shirt .
The sequence ] belongs the strongsystem [kAm to whenit spellsthe morphemecome not otherwisecontrast , : becomeovercome succumben , with , cumber . An excellent source counterexamplestheclaimthat pasttense of to forma tion collapses distinctions the between wordsandtheir featuraldecomposi tion is supplied verbsderivedfrom othercategories nounsor adjec by (like tives The significance theseexamples ). of , whichwerefirst noticedin Menc ken (1936, hasbeenexplored Kiparsky(1982ab)16 ) in ,
()a 19 . b . c . d . e . f. g h . .. 1 .. J k .
He braked the car suddenly == broke . 1 He flied out to center field . == flew 1 He ringed the city with artillery . *rang Martina 2-setted Chris. *2-set He subletted /sublet th, apartment. e He sleigheddown the hill . *slew He de-flea'd his dog. *de-fled He spitted the pig. *spat He righted the boat. *rote He high-sticked the goalie. *high-stuck He grandstandedto the crowd. *grandstood .
This phenomenon becomes intelligibleif we assume irregularityis a that property of verb roots Nounsand adjectives their very naturedo not . by classify irregular(or regular with respect thepasttense a purelyverbal as ) to , notion. Making a noun into a verb, which is donequite freely in English , cannotproduce newverbroot, just a newverb. Suchverbscanreceive a no special treatment areinflectedin accord and with the regularsystemregard , lessof anyphoneticresemblance strongroots to . In somecasesthereis a circuitous , path of derivation V ~ N ~ V. But : the endproduct havng passed , . i throughnounhoodmustberegularno matter , what the statusof the original sourceverb. (By "derivation we refer to " relationsintuitively grasped the nativespeakernot to historicaletymol by , ogy The baseball .) verbtofly out, meaning 'makean out by hitting a fly ball that getscaught, is derivedfrom the baseball ' nounfly (ball) , meaning 'ball hit on a conspicuously parabolic trajectory whichis in turn relatedto the ', simplestrongverb fly 'proceedthroughthe air'. Everyonesays"he flied out" ; no meremortalhasyet beenobserved have"flown out" to left field. to
16Examples ) and (h) are from Kiparsky. (19b
112
Considerthese examples:
Onemight tempted trytoexplain phenomenaterms the be to these in of meaningsregular irregular of and versions a verb. example, of For Lakoff (1987) appeals thedistinction to between centralandextended the senses ofpolysemous and words, claims irregularity only thecentral that attaches to sense anitem. isa remarkable of It factindeed, insult any an to naive idea thatlinguistic is driven meaningthat form by polysemyirrelevant the is to regularization phenomenon. Lakoffs proposed generalization sound. is not
wet regular central in sense. wet irregular extended in sense.
formation, because nouns general haveproperties as tense. in cant such Thus isnotsimply it derivation erases that idiosyncrasy, butdeparture from theverb stand class: retains verbal its integrity theverbs in withstand, understand, throw intheverbs as does overthrow, underthrow. 17 Kiparsky (1982a, b) haspointed thatregularization-by-derivation and out is quite general shows wherever up irregularity befound. nouns, example, isto In for we have Toronto Leafs, *Leaves, use a name a the Maple not because in strips morpheme original ofits content. Similar patterns regularization of areob-
speakers be related the homophonous verb,butoncemadea to to strong noun verbal its irregularity beresurrected: grandstood. cannot *he Aderived nouncannot retainanyverbal properties its base,likeirregular of tense
(20)a. Hewetted pants. his b. Hewethispants. (21)a. They heaved bottle the overboard.
b. They to. hove
or extended. thepurely Thus semantic metaphorical ofsense or aspect extension nopredictive whatsoever. like has power Verbs come, do,have, go,
It appears a low-frequency that irregular occasionally locked can become intoa highly specific regardless whether sense use, of the involvedcentral is
combinationparticles in,out, off), they with like up, yet march lockstep in
When verb head word 7 the isthe ofthe itbelongs passesitscategorial tothe to,it on features whole word, includingverb-ness specialized both and more morphological like properties irregularity (Williams, 1981). nouns and Deverbal [ NV] denominal[ must verbs N] therefore be headless, prefixed are whereas verbs headed [ PREF-V}. that can uncertainty differencesinterpretation Notice there be and dialect inthe of individual verb canpasttensesublet. cases. subletbe The of as [ V Nsublet}J subletted, prefixed oras a formheadedby the verbto let, giving thought denominal, giving
113
114
than that a
to
every lexical
, syntactic possess . an
, and
pragmatic local
fillip
, we identity
can as
accessible
distributed
Assumption
"
All
past
tenses
are
formed
by
direct
phonetic
modification
of
the
stem
. "
We
have
shown
that
the
regular
forms
are
derived
through
affixation
followed
by
phonological
and
phonetic
adjustment
Assumption
"
The
inflectional
class
of
any
verb
regular
subregular
irregu
lar
can
be
determined
from
its
phonological
representation
alone
. "
We
have
seen information
that
membership .
in
the
strong
classes
depends
on
lexical
and
morphological
These
results
still
leave
open
the
question
of
disparity
between
the
regular
and
strong
systems
To
resolve
it
we
need
firmer
understanding
of
how
the
strong
system
works
We
will
find
that
the
strong
system
has
number
of
distinctive
peculiarities
which
are
related
to
its
being
partly
structured
list
of
exceptions
We
will
examine
five
Phonetic similarity criteria on classmembership . Prototypicality structure of classes . Lexical distinctnessof stem and past tense forms. Failure of predictability in verb categorization . Lack of phonological motivation for the strong-classchanges .
4.4.1. Hypersimilarity The strong classes often held together, if not exactlydefined, by phoneare tic similarity. The most pervasive constraint is monosyllabism 90% of the : strong verbs are monosyllabic and the rest are composedof a monosyllable , combined with an unstressedand essentiallymeaningless prefix.19
I~ he polysyllabic strong verbs are: arise- awake become befall, beget, begin, behold, beset beshit, bespeak , , forbear, forbid , forget, forgive, forgo, forsake, forswear, foretell mistake partake (continued )
Languageand connectionism
115
Within
the
various
classes
there
are
often
significant
additional
resem
blances
holding
between
the
members
Consider
the
following
sample
of
typical
classes
arranged
by
pattern
of
change
in
past
and
past
participle
z form
"
will , and
mean z as in follows
the
verb participle
has ) means .
in
its about
in cited past
its
indicated
? Verb
of
the
of
Verb
is
somewhat
less
natural
than
usual
Verb
means
that
Verb
is
archaic
or
recherche
- sounding
in
the
past
tense
( 22
Some a . x
strong [ u ] -
verb x ( o ) +
types n
blow draw
, ,
grow withdraw
know
throw
fly
? ? slay
[ e
[ V take
[ e ] mistake
en , forsake , shake
[ ay
[ aw bind
] ,
find
[ aw ,
] grind , wind
[ d
[ t ] bend
@ '
build
[ E ]
[ :J] swear
[ ~ ] ,
+ tear
get ? tread
forget
? beget
The
members
of
these
classes
share
much
more
than
just
pattern
of
changes
In
the
blow
- group
( 22a
for
example
the
stem
- vowel
becomes
[ u
in
the
past
this
change
could
in
principle
apply
to
all
sorts
of
stems
but
in
fact a CC
the cluster
participating . In the
stems find
are - group
all ( 22c
vowel ) the
- final vowel
and change
all
but [ ay
know ] ~
begin [ aw ]
with could
apply
to
any
stem
in
[ ay
but
it
only
applies
to
few
ending
in
[ nd
The
change
of
[ d
to
[ t ]
in
( 22d
occurs
only
after
sonorants
[ n
1]
and
mostly
when
the
stem
rhymes
in
- end
Rhyming
is
also
important
in
( 22b
where
every
about sufficient is in
. ) Their ( 1976 . ) . As
independent mentioned
support sense
sense
; see
morphology
separate
abstract
thing ends in -ake (and the base also begins with a coronal consonant ) , and in (22e) , where -ear has a run .
Most of the multi -verb classes in the system are in fact organized around
clusters of words that rhyme and share other structural similarities , which we will call hypersimilarities . (The interested reader is referred to the Appendix for a complete listing .) The regular system shows no signs of such organiza tion . As we have seen, the regular morpheme can add onto any phonetic form - even those most heavily tied to the strong system, as long as the lexical item involved is not a primary verb root . 4.4.2. Prototypicality The strong classes often have a kind of prototypicality structure . Along the phonetic dimension , Bybee and Slobin (1982) point out that class cohesion can involve somewhat disjunctive 'family resemblances' rather than satisfaction of a strict set of criteria . In the blow -class (22a) , for example , the central exemplars are blow , grow , throw , all of the form [CRo ] , where R is a sonorant . The verb know [no] lacks the initial C in the modern language, but otherwise behaves like the exemplars . The stems draw [dr~] and slay [sle] fit a slightly generalized pattern [CR V ] and take the generalized pattern x- u- x in place of o- u- o . The verb fly [flay ] has the diphthong [ay] for the vowel slot in [CR V ] , which is unsurprising in the context of English phonol ogy , but unlike draw and slay it takes the concrete pattern of changes in the exemplars : x- u- o rather than x- u- x . Finally , all take -n in the past participle . Another kind of prototypicality has to do with the degree to which strong forms allow regular variants . (This need not correlate with ohonetic central ~ , . I . ~-- -- --ity - notice that all the words in the blow -class are quite secure in their irregu lar status.) Consider the class of verbs which add -t and lax the stem-vowel : (2,3) V : - V - V ( + t ) keep , sleep, sweep, weep ( ?weeped/wept ) , creep (?creeped/crept ) , leap (leaped/leapt ) feel , deal (?dealed/dealt ) , kneel (kneeled /?knelt )
mean
Notice the hypersimilarities uniting the class: the almost exclusive preva lence of the vowel [i] ; the importance of the terminations [-ip] and [-il ] . The parenthetical material contains coexisting variants of the past forms that , according to our judgments , are acceptable to varying degrees. The range of prototypicality runs from 'can only be strong ' (keep) through 'may
117
be either ' (leap) to 'may possibly be strong ' (dream) . The source of such variability is probably the low but nonzero frequency of the irregular form , often due to the existence of conflicting but equally high -status dialects (see Bybee , 1985) . The regular system, on the other hand , does not have prototypical exemplars and does not have a gradient of variation of category membership defined by dimensions of similarity . For example , there appears to be no sense in which walked is a better or worse example of the past tense form of walk than genuflected is of genuflect . In the case at hand , there is no reason to assume that regular verbs such aspeep, reap function as a particularly power ful attracting cluster , pulling weep, creep, leap away from irregularity . Histor ically , we can clearly see attraction in the opposite direction : according to the OED , knelt appears first in the 19th century ; such regular verbs as heal, peel, peal, reel, seal, squeal failed to protect it ; as regular forms they could not do so, on our account , because their phonetic similarity is not perceived as relevant to their choice of inflection , so they do not form an attracting cluster .
4.4.3. Lexicality The behaviorof low-frequency formssuggests the stemandits strong that pastare actuallyregarded distinctlexicalitems while a regularstemand as , its inflectedforms no matter how rare, are regarded expressions a , as of singleitem. Consider verbforgo: thoughuncommonit retainsa certainliveliness the , , particularly the sarcastic in phrase "forgothe pleasure ..." . Thepasttense of mustsurelybeforwentratherthan *forgoed but it seems , entirelyunusable . Contrast followingexampledueto JaneGrimshaw the , : (24 a. ) b. *Last night I forwentthe pleasure grading of studentpapers . You will excuse if I forgo the pleasure readingyour paper me of until it's published . Similarlybut more subtly we find a differencein naturalness , between stemandpasttense whenthe verbsbearandstandmean'tolerate ': (25 a. I don't knowhow shebearsit . ) b. (?) I don knowhow sheboreit . 't c. I don't knowhow shestands him. d. (?) I don't knowhow shestoodhim. The verb rendenjoysa marginal subsistence the phrase in rendthefabric of societyyet the pastseems , slightlyodd: TheVietnam War rent thefabric of American society The implicationis that familiaritycan accrue . differen tially to sternand pasttenseforms the useof one in a givencontextdoes ;
not
always This
the
of
the absent
other from
. the of no regular idioms effort system , like . There eke all in inflected like of anas odd are " eke
regular out forms tomose ness listed form forms other even if . or in If " ,
are
trapped one .
narrow in rare
" crook
Furthermore
tense familiarity
items
actually inflected
lexicon
each single
- generated from
then , on
all the
. Irregular be able
should
company
4 .4 . 4 . Even in the patterns will set they years the change fall blow remain .) clear As the verb
Failures when a
of verb
predictability matches , no . If matter the cannot like , . flow know , as glow the verb the characteristic how is closely strong , , its predict crow are of as the regular of the [ I the re strong A ] and there similarity which similar set in are the of to to patterns can of be to any no the these the each last few , A of the classes that
always ,
it the ; yet
( Indeed
, crow
subcategorization associated
subclasses [ I
subregularity classes :
with
( 26
a .
re
A , .
sink
stink
, )
spin
win
cling, sling, sting, string, swing: slink (slinked/?slunk) stick dig (hang)
wring
fling
( ? flinged
/ flung
) ,
The core members of these related classesend in -ing and -ink . (Bybee and Slobin note the family resemblance structure here, whereby the hallmark 'velar nasal accommodatesmere nasals on the one side (swim, etc.) and ' mere velars on the other (stick, dig) ; the stemsrun and hang differ from the
119
norm
in
vowel
feature
or
two
as
well
Interestingly
no
primitive
English
monosyllabic
verb
root
that
ends
in
ing
is
regular
Forms
like
ding
ping
zing
which
show
no
attraction
to
class
26
are
tainted
by
onomatopoetic
origins
forms
like
ring
surround
king
as
in
checkers
and
wing
are
obvi
ously
derived
from
nouns
Thus
the
ing
class
of
verbs
is
the
closest
we
have
in
English
to
class
that
can
be
uniformly
and
possibly
productively
in
flected
with
anything
other
than
the
regular
ending
Nevertheless
even
for
this
subclass
it
is
impossible
to
predict
the
actual
forms
from
the
fact
of
irregularity
ring
rang
contrasts
with
wring
wrung
spring
sprang
with
string
strung
and
bring
belongs
to
an
entirely
unrelated
class
This
observation
indicates
that
learners
can
pick
up
the
general
distinction
regular
irregular
at
some
remove
from
the
particular
patterns
The
regular
system
in
contrast
offers
complete
predictability
. 5
Lack
of
phonological
motivation
for
morphological
rules
The
rules
that
determine
the
shape
of
the
regular
morphemes
of
English
are
examples
of
true
phonological
or
even
phonetic
rules
they
examine
narrow
window
of
the
string
and
make
small
scale
change
Such
rules
have
necessary
and
sufficient
conditions
which
must
be
satisfied
by
elements
pres
ent
in
the
window
under
examination
in
order
for
the
rule
to
apply
The
conditioning
factors
are
intrinsically
connected
with
the
change
performed
Voicelessness
in
the
English
suffixes
directly
reflects
the
voicelessness
of
the
stem
final
consonant
Insertion
of
the
vowel
resolves
the
inadmissible
adja
cency
of
what
English
speakers
regard
as
excessively
similar
consonants
The
relations
between
stem
and
past
tense
in
the
various
strong
verb
classes
are
defined
on
phonological
substance
but
the
factors
affecting
the
relation
ship
are
not
like
those
found
in
true
phonological
rules
In
particular
the
changes
are
for
the
most
part
entirely
unmotivated
by
phonological
conditions
in
the
string
There
is
nothing
in
the
environment
nd
that
encourages
ay
to
become
aw
nothing
about
CRo
the
basic
scheme
of
the
blow
class
that
causes
change
to
CRu
or
makes
such
change
more
likely
than
in
some
other
environment
These
are
arbitrary
though
easily
definable
changes
tied
arbitrarily
to
certain
canonical
forms
in
order
to
mark
an
abstract
mor
phological
category
past
tense
The
patterns
of
similarity
binding
the
classes
together
actually
play
no
causal
role
in
determining
the
changes
that
occur
powerful
association
may
exist
but
it
is
merely
conventional
and
could
quite
easily
be
otherwise
and
indeed
in
the
different
dialects
of
the
language
spoken
now
or
in
the
past
there
are
many
different
systems
Similarity
relations
serve
essentially
to
qualify
entry
into
strong
class
rather
than
to
provide
an
environment
that
causes
rule
to
happen
There
is
one
region
of
the
strong
system
where
discernibly
phonological
factors do playa role: the treatment of stems ending in [-t] and [-d] . No strong verb takes the suffix id (bled/ *bledded got/ *gotted the illicit cluster , ); that would be created by suffixing /d/ is resolved instead by eliminating the suffix. This is a strategy that closely resemblesthe phonological processof degemination(simplification of identical adjacent consonants a singleconto sonant , which is active elsewherein English. Nevertheless if we examine ) , the classof affecteditems, we seethe samearbitrariness prototypicality, and , incomplete predictivenesswe have found above. Consider the " no-change " class which usesa single form for stem, past tense, and past participle- by , far the largest single classof strong verbs, with about 25 members In these . examples a word precededby '?' has no natural-soundingpast tenseform in , our dialect; words followed by two alternativesin parentheses have two possible forms, often with one of them (indicated by '?') worse -sounding than the other: (27) No-changeverbs hit , slit, split, quit , spit (spit/spat) , knit (knitted/?knit ), ?shit, ??beshit bid, rid shed, spread wed , let, set, upset, ?beset, wet (wetted/wet) cut, shut put burst, cast, cost thrust (thrusted/thrust), hurt Although ending in [-t , d] is a necessary condition for no-changestatus it , is by no meanssufficient. First of all, the generalconstraint of monosyllabism applies, even though it is irrelevant to degemination Second there is a strong . , favoritism for the vowels [I] and [E , followed by a single consonant again, ] ; this is of no conceivablerelevanceto a truly phonological processsimplifying [td] and [dd] to [t] and [d] . Absent from the class and under no attraction , to it , are such verbs as bat, chat, pat, scat, as well asjot, rot, spot, trot, with the wrong sort of vocalism; and dart, fart, smart, start, thwart, snort, sort, halt, pant, rant, want with nonprototypical vowel and consonant structure. Even in the core class we find arbitrary exceptions flit , twit, knit are all , : regular, as are fret, sweat whet, and someusesof wet. Beside strong cut and , shut, we find regular butt, jut , strut. Beside hurt we find blurt, spurt; beside burst, we find regular bust. The phonological constraints on the class far exceedanything relevant to degemination but in the end they characterize , rather than define the class just as we have come to expect. , Morphological classification responds to fairly large-scale measureson
121
plar?doesit alliterate (begin a similar with consonant cluster) an exemas plar?Phonological lookfordifferent much rules and morelocal configuratwo vocabularies kept distinct:we are not likelyto find a morphological are
subclass holding together because members contain its each somewhere inside thema pairof adjacent obstruents; willwefinda ruleof voicing-spread nor that applies onlyin rhyming monosyllables. an analytical If engineis to
generalize effectively language over data, it canill affordto lookuponmormal type.
phological classification phonological asprocesses thesame and rules of for4.4.6. Default structure
ularsystem, supporting view thestrong the that system a cluster irregular is of patterns, only -ing with the forms perhaps no-change displaying and the forms
some activelife as partiallygeneralizable subregularities the adult lanin
monosyllabism; nonderived root status;(3) for the subregularities, (2) verb resemblance keyexemplars. means the system largely to This that is closed, particularly because rootsveryrarely verb enterthelanguage verbsare (new common enough, but are usually derived from nouns, adjectives, or onomatopoetic expressions). a fewpointsin history,there havebeen At
borrowed items that have met all the criteria: quit and cost are both from
be derivedfrom an adjective. Phoneticsimilarity an exemplarplays no to role either. Furthermore,the behaviorof regularverbsis entirelypredictable
suggest above: the morpheme a stopratherthana fricative. that is Theregular system hasaninternal also default structure isworthy that of note, sinceit contrasts the RMmodelspropensities. rule Past with The
voicelessness propagatesfrom the stem. Elsewherethe default caseno-
thinghappens. appears language It that learners fondof sucharchitecare tures,which appearrepeatedly languages. in (Indeed, the history Enin of glish inflection inthisdirection.) theRMnetwork, all heads Yet unlike the
122
strong subclasses. extreme categorical The and uniformity regular ofthe system disappears sight, with thehope identifying uniformity from and it of such
as a benchmark linguistic of generalization.
a kindof fortuitously overpopulated subregularity; indeed,as three such classes, thed-t-id since alternationtreated a parwith choice is on the between
suffixt if theword begins b; prefix if theword in a vowel; with ik ends change s to r before etc.TheRM all ta; model theregular as treats class
thatcan toany and isadjusted byvery apply word that only general phonoheldtogether phonologically-unpredictable by hypersimilarities are which
neither necessary sufficient nor criteria membershiptheclasses. for in strong verbs. pasttenseforms strong The of verbs mustbe memorized; the
logical regularities; whereas strong the system consists a setof subclasses of
Whyare theyso different? thinkthe answer We comes fromthe common-
4.4.7. aretheregular strong Why and systems different? so Wehave argued theregular strong that and systems very have different
Rosch Mervis note conceptual and (1975) that categories, asvegetables such alongsetofdimensionsgraded a and membership determined bysimilarity
or tools, to consist members family tend of with resemblancesoneanother to
hearing andonly youhearit often it if enough youlikely remember are to it. However,isimportant notethatthebulk thestrong areof it to of verbs nomore middling than frequency some them actually raising and of are rare, the question howtheymanaged endure. hypersimilarities of to The and graded membership structure thestrong might of class provide answer. an
only easily-memorized the strong will forms survive. 10most The frequent verbs English strong, ithas been that thefrequency of are and long noted as ofa strong declines form historically, becomes likely regutheverb more to larize. standard The explanation you only a strong by isthat can learn past
past forms regular can generatedrule. theirregular tense of verbs be by Thus forms roughly grammar offandmemory are where leaves begins. Whatever affects memorygeneral shape properties strong human in will the ofthe class, butnottheregular bya kind Darwinian class, of selection process, because
display resemblance than they grouped categories afamily structure if are into arbitrarily. strong like Since verbs, Rosch Merviss and artificial exemplars,
123
survive, particularly in the middle and low frequencies will be those dis, playing a family resemblance structure. In order words, the reasonthat strong verh~ are either freQuent or members of families is that strong verbs are ... memorized and frequency and family resemblanceassistmemorization. The regular system must answer to an entirely different set of requirements: the rule must allow the user to compute the past tense form of any regular verb and so must be generally applicable, predictable in its output, ~ and so on. While it is possiblethat connectionistmodels of category formation (e.g. ... McClelland & Rumelhart, 1985 might offer insights into why family resem ) blance fosters categoryformation, it is the difference between fuzzy families of memorized exemplars and formal rules that the models leave unexplained.2o Rumelhart and McClelland's failure to distinguish between mnemonics and productive morphology leads to the lowest-common -denominator 'uniformity ' of accomplishingall change through arbitrary Wickelfeature replacement and thus vitiates the use of psychologicalprinciples , to explain linguistic regularities.
5. How good is the model 's performance ? The bottom -line and most easily grasped claim of the RM model is that it succeeds at its assigned task : producing the correct past tense form . Rumelhart and McClelland are admirably open with their test data , so we can evaluate the model 's achievement quite directly . Rumelhart and McClelland submitted 72 new regular verbs to the trained model and submitted each of the resulting activated Wickelfeature vectors to the unconstrained whole -string binding network to obtain the analog of freely -generated responses. The model does not really 'decide' on a unique past tense form and stick with it thereafter ; several candidates get strength values assigned to them , and Rumelhart and McClelland interpret those strength values as being related roughly monotonically to the likelihood the model would output those candidates . Since there is noise in some of the processes that contribute to strength values , they chose a threshold value ( .2 on the 0- 1 scale) and if a word surpassed that criterion , it was construed as being one of the model 's guesses for the past tense form for a given stem. By this criterion , 24 of the 72 probe stems resulted in a strong tendency to incorrect responses- 33% of the sample . Of these, 6 (jump , pump , soak, 20See Armstrong , Gleitman Gleitman ) for ananalogous , and (1983 argument applied conceptual to cate gories .
124
(soak/smoke; trail/mail; glare/tour). suggests the reasonfor the This that models muteness thatit failed learntherelevant is to transformations; i.e. togeneralize appropriately theregular Apparently steps about past. the taken to prevent model bogging ininsufficiently the from down general case-by-case
outputunitsduring learning, notworkwellenough. did Butit alsoreveals of the inherent one deficits the model havealluded of we
verbstend to clusterin phoneticsimilarity spaceeither with one another (jump, pump)or withotherverbsthatthe model erredon, discussed below
no special resemblance the apparently to quasi-productive verb strong typesthe factor affects that human responses. Second, no-response the
warm, trail,glare) no response threshold. had at Though is hardto reconit structthe reasons this,twofactsare worthnoting.First,theseverbshave for
mutuallyincompatible outputshad been associated, modelcan fail to the outputanyresponse significantly stronger the background than noise.The regular in symbolic rule accounts, contrast, in doesntcarewhatsin the word
to:there nosuch asa variable any is thing for stem, regardless phonetic ofits composition, hence way themodel attain knowledge and no for to the that youcanaddId!to a stem to getitspast.Rather, theknowledgethe all of model consistsresponses totheconcrete of trained features thetraining in set. If the newverbshappennot to shareenough thesefeatures the words of with in thetraining or happen possess set, to features which to competing and
orhow itscontents submitted often were previouslytraining; concept for the ofa stem issufficient. return thispoint discussing of itself We to when some
d.
mate - maded
hug - hug
smoke - smoke brown - brawned
(30) a.
b.
c .
d.
e .
f. g.
type - typeded step - steppeded snap - snappeded map - mappeded drip - drippeded carp - carpeded smoke - smokeded
d .
e.
shape - shipt sip - sept slip - slept brown - brawned mail - membled
Well before it hasmasteredthe richly exemplified regular rule, the patternassociatorappearsto have gained considerableconfidence in certain incorrectly-grasped sparsely exemplified patterns of feature-change among the , vowels. This implies that a major "induction problem"- latching onto the productive patterns and bypassingthe spurious ones is not being solved successfully . In sum, for 14 of the 18 stems yielding incorrect forms, the forms were quite removed from the confusionswe might expectpeople to make. Taking these with the 6 no-shows we have 20 out of the 72 test stemsresulting in , seriouslywrong forms, a 28% failure rate. This is the state of the model after it has been trained 190 200 times on eachitem in a vocabularyof 336regular verbs. What we have here is not a model of the mature system . 6. On some common objections to argumentsbasedon linguistic evidence We have found that many psychologists and computer scientistsfeel uncomfortable about evidenceof the sort we have discussed far, concerningthe so ability of a model to attain the complex organization of a linguistic systemin its mature state, and attempt to dismissit for a variety of reasons We con. sider the evidencecrucial and decisive and in this sectionwe reproducesome , of the objections we have heard and show why they are groundless . " Thosephilosophical argumentsare interesting but it's really the empirical , data that are important." All of the evidencewe have discussed empirical. is
126
scious, reflective, problem-solving ofthought is distinct the mode that from intuitive processes PDP models that account (see, for example, for Smolensky, inpress). iscompletely Theruleadding toa stem This wrong. Id! to form pastis notgenerally in school doesnt to be!) the taught (it have except possibly a ruleof spelling, as which anything if obscures nature: its foronething, plural the morpheme, isvirtually which identicalthepast to morpheme its phonological in behavior, spelled is differently versus (s
bore or Yesterdaychat anhour, thatupon him we for or hearing sensuch tences, people perceive assounding could them perfectly normal. every In caseit is an empirical about human thattheydont.Any datum the brain theoryof the psychology language of must accountfor such data. Rule-governed behaviors exist, they the indeed but are products schoolof
It is entirely conceivable people goaround that could saying Whatz anthe swer? Hehigh-stuck goalie The or the or canary or1dontknow she pept how
bleand tobefound descriptive not in grammars orlanguage curricula. Many have recently adequately only been characterized; traditional prescriptive
andphonological changes, distinct tenses homophones, past for interactions between strong regular the and systems, soon,areconsciously and inaccessi-
between morphology and phonology, ofroots morphology, therole in preservation stemandaffix of identity, phonological processes areoblivious that to morphological disjoint origin, conditions application forthe ofmorphological
theforms broadcastedjoy-ridedthe1920s and in (without consciously knowingit, they adheringtheprinciple irregularity property were to that isa of verb roots, hence formed nouns regular). prescriptive verbs from are The guardians language afruitless toinstruct explicitly ofthe made attempt people
manner. example, Mencken For H.L. (1936) thatpeople noted started use to
irregular system, havenoted, closely to memory wellas to we is tied as language, it turns thatpeople have so out often metalinguistic awareness of some itspatterns, of especially competing andirregular since regular past
phenomena theRM that model good handling is at isunsystematic analogy formation onitsinput based history subregular (asopposedthe with forms to automatic application regular where ofthe rule linguistically mandated). The
127
reflection . Thus people , when in a reflective , conscious, problem -solving mode , will seem to act more like the RM model : the overapplication of subregularities that the model is prone to can be seen in modes of language use that bear all the hallmarks of self-conscious speech, such as jocularity (e.g. spaghettus, I got schrod at Legal Seafood, The bear shat in the woods) , explicit instruction within a community of specialists (e.g. VAXen as the
rection such as the anti -broadcasted campaign documented by Mencken (simi larly , we found that some of our informants offered Hurst no-hitted the Blue Jays as their first guess as to the relevant past form but withdrew it in favor of no-hit which they " conceded" was " more proper " ) . " We academics speak in complex ways, but if you were to go down to [name of nearest working -class neighborhood ] you 'd find that people talk very differ ently ." If anything is universal about language , it is probably people 's tendency to denigrate the dialects of other ethnic or socioeconomic groups . One
would hope that this prejudice is not taken seriously as a scientific argument ; it has no basis in fact . The set of verbs that are irregular varies according to regional and socioeconomic dialect (see Mencken , 1936, for extensive lists) , as does the character of the subregular patterns , but the principles organizing
the system as a whole show no variation across classes or groups .
" Grammars may characterize some aspects of the ideal behavior of adults, but connectionist models are more consistent with the sloppiness found in
situation when he or she needs to produce the past tense form of soak or glare , none vacillates between kid and kidded , none produces membled for mailed or toureder for toured . Although children equivocate in experimental tasks eliciting inflected nonce forms , these tasks are notorious for the degree to which they underestimate competence with the relevant phenomena (Levy , 1983; Maratsos et al ., 1987; Pinker , Lebeaux , & Frost , 1987)- not to mention the fact that children do not remain children forever . The crucial point is that adults can speak without error and can realize that their errors are errors (by which we mean , needless to say, from the standpoint of the untaxed operation of their own system, not of a normative standard dialect ) . And children 's learning culminates in adult knowledge . These are facts that any theory must
account for .
7. The RM model and the facts of children 's development Rumelhart and McClelland stress that their model 's ability to explain the developmental sequence of children 's mastery of the past tense is the key point in favor of their model over traditional accounts. In particular , these facts are the " fine structure of the phenomena of language use and language acquisition " that their model is said to provide an exact account of , as opposed the traditional explanations which " leave out a great deal of detail " , describing the phenomena only " approximately " . One immediate problem in assessing this claim is that there is no equally explicit model incorporating rules against which we can compare the RM model . Linguistic theories make no commitment as to how rules increase or decrease in relative strength during acquisition ; this would have to be supplied by a learning mechanism that meshed with the assumptions about the
representation of the rules . And theories discussed in the traditional literature
of developmental psycholinguistics are far too vague and informal to yield the kinds of predictions that the RM model makes. There do exist explicit models of the acquisition of inflection , such as that outlined by Pinker ( 1984) , but they tend to be complementary in scope to the RM model ; the Pinker model , for example , attempts to account for how the child realizes that one word is the past tense version of another , and which of two competing past tense candidates is to be retained , which in the RM model is handled by the " teacher " or not at all , and relegates to a black box the process of abstracting the morphological and phonological changes relating past forms and stems, which is what the RM model is designed to learn . The precision of the RM theory is surely a point in its favor , but it is still difficult to evaluate , for it is not obvious what features of the model give it
Languageand connectionism
129
its empirical successes . More important, it is not clear whether suchfeatures are consequences the model's PDP architecture or simply attributes of of fleshed -out processes that would function in the sameway in any equally-explicit model of the acquisitionprocess In most cases . Rumelhart and McClelland do not aDDortioncredit or blame for the model's behavior to specific A ~ aspectsof its operation; the model's output is compared against the data rather globally. In other casesthe intelligence of the model is so distributed and its output mechanisms so interactive that it is difficult for anyone to are know what aspectof the model makesit successfulAnd in general, Rumel. hart and McClelland do not presentcritical testsbetweencompetinghypotheses embodying minimally different assumptions only descriptionsof good, nessof fit between their model and the data. In this section, we unpack the assumptions the model, and showwhich onesare doing the work in account of ing for the developmental facts and whether the developmental facts are accountedfor to begin with . 7.1. Unique and sharedproperties of networksand rule systems Among the RM model's many properties, there are two that are crucial to its accountsof developmentalphenomena First, it has a learning mechanism . that makesit type-frequency sensitive the more verbs it encountersthat em: body a given type of morphophonologicalchange the stronger are its graded , representationsof that morphophonological change and the greater is the , tendencyof the model to generalizethat changeto new input verbs. Furthermore, the different past tense versions of a word that would result from applying various regularities to it are computed in parallel and there is a competitionamongthem for expression whoseoutcomeis determinedmainly , by the strength of the regularity and the goodness the match betweenthe of regularity and the input. (In fact the outcome can also be a blend of competing responses but the issue of responseblending is complex enough for us , to defer discussing to a later section.) it It is crucial to realize that neither frequency -sensitivity nor competition is unique to PDP models. Internal representationsthat have graded strength values associatedwith them are probably as old as theories of learning in psychology in particular, it is commonplaceto have greater strength values ; assigned representationsthat are more frequently exemplified in the input to during learning, so that strength of a representationbasicallycorrespondsto degreeof confidencein the hypothesisrepresented Competition amongcan. didate operations that partially match the input is also a ubiquitous assump tion amongsymbol-processing modelsin linguisticsand cognitive psychology . Spreading -activation models and production systems which are prototypical ,
130
symbol processingmodels of cognition, are the clearest examples(see e.g. , Newell & Simon, 1972 Anderson, 1976 1983 MacWhinney & Sokolov, ; , ; 1987 ). To show how these assumptionsare part and parcel of standardrule-processing models, we will outline a simplified module for certain aspects past of tense acquisition, which searchesfor the correct past tense rule or rules, keeping severalcandidatesas possibilitiesbefore it is done. We do not mean to propose it as a serioustheory, but only as a demonstration that many of the empirical successes the RM model are the result of assumptions of about frequency -sensitivity and competition amongoutput candidatesthat are independent of parallel distributed processingin networks of simple units. A simple illustrative module of a rule-basedinflection acquisition theory, incorporating assumptions about frequency -sensitivityand competition Acquiring inflectional systemsposes a number of tricky induction problems, discussed length in Pinker (1984 When a child hears an inflected at ). verb in a single context, it is utterly ambiguouswhat morphological category the inflection is signaling(the gender, number, person, or somecombination of those agreementfeatures for the subject? for the object? is it tense as? pect? modality? some combination of these . Pinker (1984 suggested ?) ) that the child solves this problem by "sampling from the spaceof possible hy" pothesesdefined by combinationsof an innate finite set of elements main, taining thesehypotheses the provisional grammar, and testing them against in future usesof that inflection, expunginga hypothesisif it is counterexempli tied by a future word. Eventually, all incorrect hypotheses about the category featuresencodedby that affix will be pruned, any correct one will be hypothesized, and only correct ones will survive. The surviving features define the dimensionsof a word-specific paradigm structure into whose cells the different inflected forms of a given verb are placed (for example, singular plural or present past future). The system then seeksto form a productive general paradigm that is, a set of rules for related inflections- by examiningthe patternsexhibited acrossthe paradigms for the individual words. This posesa new induction problem becauseof the large number of possiblegeneralizationsconsistentwith the data, and it cannot be solved by examining a single word-specific paradigm or even a set of paradigms For example in examining sleep . , /slept, should one concludethat the regular rule of English laxes and lowers the vowel and adds a t? If so, does it do so for all stems or only for those ending in a stop, or only those whose stem vowel is i? Or is this simply an isolated irregular form , to be recorded individually with no contribution to the regular rule system There ?
Languageand connectionism
131
is no way to solve the problem other than by trying out various hypotheses and seeingwhich ones survive when tested againstthe ever-growing vocabu lary. Note that this induction problem is inherent to the task and cannot be escapedfrom using connectionistmechanisms any other mechanismsthe or ; RM model attempts to solve the problem in one way, by trying out a large number of hypothesesof a certain type in parallel. A symbolic model would solve the problem using a mechanismthat can formulate, provisionally maintain, test, and selectively expunge hypotheses about rules of various degreesof generality. It is this hypothesis -formation mechanismthat the simplified module embodies The module is basedon five . assumptions : 1. Candidatesfor rules are hypothesizedby comparingbaseand past tense versions of a word, and factoring apart the changing portion , which serves as the rule operation, from certain morphologically-relevant phonological componentsof the stem, which serveto define the classof stemsover which the operation can apply.21Specifically let us assume , that when the addition of material to the edge of a baseform is noted, the added material is stored as an affix, and the provisional definition of the morphological classwill consistof the features of the edgeof the stem to which the affix is attached When a vowel is noted to change . , the changeis recorded, and the applicable morphological classwill be provisionally defined in terms of the featuresof the adjacentconsonants . (In a more realistic model, global properties defining the " basicwords" of a language such as monosyllabicity in English, would also be ex, tracted.) 2. If two rule candidates have been coined that have the same change operation, a single collapsedversion is created, in which the phonological features distinguishingtheir classdefinitions are eliminated. 3. Rule candidates increase in strength each time they have been exemplified by an input pair. 4. When an input stemhasto be processed the systemin its intermediate by stages an input is matchedin parallel againstall existingrule candidates , , and if it falls into severalclassesseveralpast tenseforms may be gener , ated. 5. The outcome of a competition amongthe past tenseforms is determined by the strength of the relevant rule and the proportion of a word's features that were matched by that rule.
21Moreaccurately the changing portion is examined subsequentto the subtraction of any phonological , and phonetic changesthat have been independently acquired.
132
model
works portion
as is i
follows ~ o .
. The
Imagine provisional
its
first definition
input
pair of the
is
speak class to
/ spoke which
such
rule
would
apply
would
be
the
features
of
the
adjacent
consonants
which
we
will
abbreviate
as
Thus
the
candidate
rule
coined
is
( 32a
which features
can of
be / p /
glossed before
as the
"
change vowel
i and
to
for
the
class the
of features
words of
containing / k / after
the the
containing
vowel
"
Of
course
the
candidate
rule
has
such
specific
class
definition
in
the
example
that
it
is
almost
like
listing
the
pair
directly
Let
us
make
the
minimal 1 every
about exemplified
the
strength . Thus
and of
increase candidate
it
by is
Say
the
second
input
is
get
/ got
The
resulting
rule
candidate
with
strength
of
is
( 32b
regular
input
pair
tip
/ tipped
would
yield
( 32c
Similarly
sing strength
/ sang ,
would
lead
to
( 32d
and
hit
/ hit
would
lead
to
( 32e
each
with
unit
( 32
Change
Class b . Change
_ :
k e ~ ~
Class c . Suffix
: :
g _ t
Class
Change
re
Class e . Suffix
: :
s 0
IJ
Class
t #
Change Class : h
i t
~ . 22
Now
we
can
examine
the
rule
- collapsing
process
second
regular
input
walk
walked
would
inspire
the
learner
to
coin
the
rule
candidate
( 33a
which
because
it
shares
the
change
operation
of
rule
candidate
( 32c
would
be
collapsed of its
with contributing ) .
it
to
form rules
a ,
new or
rule equivalently
( 33b
of ,
( summing of times
the it
exemplified
22Let assume it is unclear the childat this point whether us that to thereis a null vowelchange a null or affix, so botharestored Actually we don think eitheris accurate it will do for the present . , 't , but example .
133
(33 a. )
b.
: t
k#
: t
The context -collapsing operation has left the symbol " c " (for consonant ) and its three phonological features as the common material in the definitions of the two previously distinct provisional classes. Now consider the results of a third regular input , pace/paced . First , a fairly word -specific rule (34a) would be coined ; then it would be collapsed with the existing rule (33) with which it shares a change operation , yielding a rule (34b) with strength 3. (34) a.
b .
Suffix : t
Class : s#
Suffix ; t
Class :
[ -voiced ] Rule candidates based on subregularities would also benefit from the increases in strength that would result from the multiple input types exemplify ing it . For example , when the pair ring / rang is processed, it would contribute (35a) , which would then be collapsed with (32d) to form (35b) . Similar collapsing would strengthen other subregularities as tentative rule candidates ,
such as the null affix .
(35) a.
b.
Change : i ~ a
Class : r _ 1J
Change : i ~
Class : C - IJ
Though this model is ridiculously simple , one can immediately see that it has several things in common with the RM model . First , regularities , certain subregularities , and irregular alternations are extracted , to be entertained as possible rules , by the same mechanism . Second, mechanisms embodying the different regularities accrue strength values that are monotonically related to the number of inputs that exemplify them . Third , the model can generalize to new inputs that resemble those it has encountered in the past; for example , tick , which terminates in an unvoiced stop , matches the context of rule (34b) ,
Languageand connectionism
135
calculations , and the phonology would subtract out modifications abstracted from consonant clusters of simple words and perhaps from sets of morpholog ically unrelated rules . Finally , the general regular paradigm would be used when needed to fill out empty cells of word -specific paradigms with a unique entry , while following the constraint that irregular forms in memory block the product of the regular rule , and only a single form can be generated for a specific stem when more than one productive rule applies to it (multiple entries can exist only when the irregular form is too weakly represented , or when both multiple forms are witnessed in the input ; see Pinker , 1 .984) . Though both our candidate -hypothesization module and the RM model share certain properties , let us be clear about the differences . The RM model is designed to account for the entire process that maps stems to past tense forms , with no interpretable subcomponents , and few constraints on the regularities that can be recorded . The candidate -hypothesization module , on the other hand , is meant to be a part of a larger system, and its outputs , namely rule candidates , are symbolic structures that can be examined , modified or filtered out by other components of grammar . For example , the phonological acquisition mechanism can note the similarities between t/d/id and s/z/iz and pullout the common phonological regularities , which would be impossible if those allomorphic regularities were distributed across a set of connection weights onto which countless other regularities were superimposed . It is also important to note that , as we have mentioned , the candidate hypothesization module is motivated by a requirement of the learnability task facing the child . Specifically , the child at birth does not know whether English has a regular rule , or if it does, what it is or whether it has one or several . He or she must examine the input evidence , consisting of pairs of present and past forms acquired individually , to decide . But the evidence is locally ambiguous in that the nonproductive exceptions to the regular rule are not a random set but display some regularities for historical reasons (such as mul tiple borrowings from other languages or dialects , or rules that have ceased to be productive ) and psychological reasons (easily-memorized forms fall into
f ::.milv resemblance
..
ent regularities . Furthermore , there is the intermediate case presented by languages that have several productive rules applying to different classes of stems. The " learnability problem " for the child is to distinguish these cases. Before succeeding, the child must entertain a number of candidates for the regular rule or rules , because it is only by examining large sets of present-past pairs that the spurious regularities can be ruled out and the partially -produc tive ones assigned to their proper domains ; small samples are always ambigu ous in this regard . Thus a child who has not yet solved the problem of distin guishing general productive rules from restricted productive rules from acci-
dental patterns will have a number of candidate regularities still open as hypotheses . At this stage there will be competing options for the past tense form of a given verb . The child who has not yet figured out the distinction between regular , subregular , and idiosyncratic cases will display behavior that is similar to a system that is incapable of making the distinction - the RM model . In sum , any adequate rule -based theory will have to contain a module that extracts multiple regularities at several levels of generality , assign them strengths related to their frequency of exemplification by input verbs , and let them compete in generating a past tense form for a given verb . In addition , such a model can attain the adult state by feeding its candidates into paradigm -organization processes , which , following linguistic constraints , dis tinguish real generalizations from spurious ones . With this alternative model in mind , we can now examine which aspects of the developmental data are attributable to specific features of the RM model 's parallel distributed pro cessing architecture - specifically , to its collapsing of linguistic distinctions and those which are attributable to its assumptions of graded strength , type frequency sensitivity , and competition which it shares with symbolic alterna tives .
The RM model is , as the authors point out , very rich in its empirical predic tions . It is a strong point of their model that it provides accounts for several independent phenomena , all but one of them unanticipated when the model was designed . They consider four phenomena in detail : ( 1) the V -shaped curve representing the overregularization of strong verbs whose regular pasts the child had previously used properly ; (2) The fact that verbs ending in t or d ( e .g . hit ) are regularized less often than other verbs ; ( 3) The order of acquisition of the different classes of irregular verbs manifesting different subregularities ; (4) The appearance during the course of development of [past + ed ] errors such as ated in addition to [stem + ed] errors such as eated . 7.2 .1. Developmental curve ) sequence of productive inflection (the " U " -shaped
It is by now well -documented that children pass through two stages before attaining adult competence in handling the past tense in English . In the first stage , they use a variety of correct past tense forms , both irregular and regu lar , and do not readily apply the regular past tense morpheme to nonce words presented in experimental situations . In the second stage , they apply the past
Languageand connectionism
137
tense morpheme productively to irregular verbs , yielding overregularizations such as hitted and breaked for verbs that they may have used exclusively in their correct forms during the earlier stage. Correct and overregularized forms coexist for an extended period of time in this stage, and at some point during that stage, children demonstrate the ability to apply inflections to nonce forms in experimental settings . Gradually , irregular past tense forms that the child continues to hear in the input drive out the overregularized forms he or she has created productively , resulting in the adult state where a productive rule coexists with exceptions (see Berko , 1958; Brown , 1973; Cazden , 1968; Ervin , 1964; Kuczaj , 1977, 1981) .
A standard account of this sequence is that in the first stage , with no knowl -
edge of the distinction between present and past forms , and no knowledge of what the regularities are in the adult language that relate them , the child is simply memorizing present and past tense forms directly from the input . He or she correctly uses irregular foffils because the overregularized forms do not appear in the input and there is no productive rule yet . Regular past tenses are acquired in the same way , with no analysis of them into a stem plus an inflection . Using mechanisms such as those sketched in the preceding section , the child builds a productive rule and can apply it to any stem, including stems of irregular verbs . Because the child will have had the oppor tunity to memorize irregular pasts before relating stems to their corresponding pasts and before the evidence for the regular relationship between the two has accumulated across inputs , correct usage can in many cases pre cede overregularization . The adult state results from a realization , which may occur at different times for different verbs , that overregularized and irregular forms are both past tense versions of a given stem, and by the application of a Uniqueness principle that , roughly , allows the cells of an inflectional paradigm for a given verb to be filled by no more and no less than one entry , which is the entry witnessed in the input if there are competing nonwitnessed rule -generated forms and witnessed irregulars (see Pinker , 1984) .
The RM model also has the ability
..
to produce an arbitrary
~
for a given present when they have been exemplified in the input , and to generate regular past tense forms for the same verbs by adding -ed. Of course , it does so without distinct mechanisms of rate and rule . In early stages, the links between the Wickelfeatures of a base irregular form and the Wickelfea tures of its past form are given higher weights . However , as a diverse set of regular forms begins to stream in , links are strengthened between a large set of input Wickelfeatures and the output Wickelfeatures containing features of the regular past morpheme , enough to make the regularized form a stronger output than the irregular form . During the overregularization stage, " the past tenses of similar verbs they are learning show such a consistent pattern that
the generalization from these similar verbs outweighs the relatively small amount of learning that has occurred on the irregular verb in question" (PDPII , p. 268) . The irregular form eventually returns asthe strongestoutput because repeatedpresentationsof it causethe network to tune the connection weights so that the Wickelfeaturesthat are specificto the irregular stemform (and to similar irregular forms manifestingthe samekind of stem -past variation) are linked more and more strongly to the Wickelfeatures specific to their past forms, and develop strong negativeweights to the Wickelfeatures correspondingto the regular morpheme. That is, the prevalenceof a general pattern acrossa large set of verbs tradesoff againstthe repeatedpresentation of a single specific pattern of a single verb presentedmany times (with subregularities constituting an intermediate case This givesthe model the abil). ity to be either conservative (correct for an irregular verb) or productive (overregularizingan irregular verb) for a given stem, dependingon the mixture of inputs it has received up to a given point. Sincethe model's tendencyto generalizelies on a continuum, any sequence of stagesof correct irregulars or overregularizedirregulars is possiblein principle, dependingon the model's input history. How, then, is the specificshift shown by children, from correct irregular forms to a combination of overregularized and correct forms, mimicked by the model? Rumelhart and McClelland divide the training sequencepresented t. the model into two o stages In the first, they presented 10 high-frequency verbs to the model, 2 . of them regular, 10 times each. In the second they added 410 verbs to this , sample 334 of them regular, and presented the sample of 420 verbs 190 , times. The beginning of the downward arm of the U -shapedplot of percent correct versustime, representinga worseningof performancefor the irregular verbs, occurs exactly at the boundary betweenthe first set of inputs and the second The sudden influx of regular forms causesthe links capturing the . regular pattern to increasein strength; prior to this influx , the regular pattern was exemplified by only two input forms, not many more than those exemplifying any of the idiosyncratic or subregularpatterns. The shift from the first to the secondstageof the model's behavior, then, is a direct conse quence of a shift in the input mixture from a heterogeneouscollection of patterns to a collection in which the regular pattern occurs in the majority . It is important to realize the theoretical claim inherent in this demonstra tion. The model's shift from correct to overregularizedformsdoesnot emerge from any endogenous process it is driven directly by shifts in the environment ,' . Given a different environment (say, one in which heterogeneousirregular forms suddenlystart to outnumber regular forms) , it appearsthat the model could just as easily go in the opposite direction, regularizing in its first stage and then becomingaccuratewith the irregular forms. In fact, sincethe model
139
always has the potential to be conservativeor rule-governed and continu, ously tunes itself to the input, it appearsthat just about any shapeof curve at all is possible given the right shifts in the mixture of regular and irregular , forms in the input. Thus if the model is to serveas a theory of children's languageacquisition, Rumelhart and McClelland must attribute children's transition between the first and second stage to a prior transition of the mixture of regular and irregular inputs from the external environment. They conjecture that such a transition might occur becauseirregular verbs tend to be high in frequency. "Our conception of the nature of [the child's] experienceis simply that the chl1dlearns first about the present and past tensesof the highest frequency verbs; later on, learning occursfor a much larger ensembleof verbs, including a much larger proportion of regular forms" (p. 241 . They concedethat there ) is no abrupt shift in the input to the child, but suggest that children's acquisi tion of the present tense forms of verbs servesas a kind of filter for the past tense learning mechanism and that this acquisition of baseforms undergoes , an explosivegrowth at a certain stageof development Becausethe newly-ac. quired verbs are numerousand presumablylower in frequencythan the small set of early-acquiredverbs, it will include a much higher proportion of regular verbs. Thus the shift in the proportion of regular verbs in the input to the model comes about as a consequenceof a shift from high frequency to medium freQuencyverbs; Rumelhart and McClelland do not have to adjust the leannessor richnessof the input mixture by hand. The shift in the model's input thus is not entirely ad hoc, but is it realistic? The use of frequency counts of verbs in written samplesin order to model children's vocabulary development is, of course tenuous.24To determine , whether the input to children's past tense learning shifts in the manner as sumed by Rumelhart and McClelland, we examined Roger Brown's unpublished grammarssummarizingsamplesof 713 utterancesof the spontaneous speechof three children observedat five stagesof development The stages . were defined in terms of equally spaced intervals of the children's Mean Length of Utterance (MLU ) . Each grammar includes an exhaustivelist of the child's verbsin the sample and an explicit discussion whether the child , of -
24For examplein the Kuceraand Francis , (1967 countsusedby Rumelhart McClellandmedium ) and , frequencies assigned theverbs , seekmislead arise whicharegoing beabsent are to flee , and , to from a young chilrl-" vocabularvOn the other handstickand tear whichplaya significant in the ecology early , role of --.--- '- . - -- - ------. . 1 - ~ as . childhoodareranked low-frequency anddo arenot in the.high , . Be -frequency group where , theybelong do belongs because its ubiquityin questions factnot reflected the writtenlanguage appears be of ,a in . Be to out of the study perhaps , because Rumelhart McClelland and countthe frequency the -ing forms of .
140
was overregularizing the past tense rule .25In addition , we examined the vocabulary of Lisa , the subject of a longitudinal language acquisition study at Brandeis University , in her one-word stage. Two of the children , Adam and Eve , began to overregularize in the Stage III sample ; the third child , Sarah, began to overregularize only in the State V sample except for the single form heared appearing in Stage II which Brown noted might simply be one of
Sarah 's many cases of unusual pronunciations
stage .26
. We tabulated
child 's verb vocabulary and the proportion of verbs that were regular at each The results , shown in Table 1 and Figure 2, are revealing . The percentage of the children 's verbs that are regular is remarkably stable across children and across stages, never veering very far from 50% . (This is also true in parental speech itself : Slobin , 1971, showed that the percentage of regular verbs in Eve 's parents ' speech during the period in which she was over regularizing was 43% .) In particular , there is no hint of a consistent increase in the proportion of regular verbs prior to or in the stage at which regulariza tions first occur . Note also that an explosive growth in vocabulary does not invariably precede the onset of regularization . This s. ands in stark contrast t to the assumed input to the RM model , where the onset of overregularization occurs subsequent to a sudden shift in the proportion of regular forms in the input from 20% to 80% . Neither the extreme rarity of regular forms during the conservative stage, nor the extreme prevalence of regular forms during the overproductive stage, nor the sudden transition from one input mixture to another , can be seen in human children . The explanation for their develop mental sequence must lie elsewhere .
We expect that this phenomenon is quite general . The plural in English , for example , is overwhelmingly regular even among high -frequency nouns :27 only 4 out of the 25 most frequent concrete count nouns in the Francis and Kucera ( 1982) corpus are irregular . Since there are so few irregular plurals , children are never in a stage in which irregulars strongly outnumber regulars
:? 5For details of the study , see Brown ( 1973) ~ for descriptions of the unpublished grammars , see Brown ( 1973) and Pinker ( 1984) . Verifjcation of some of the details reported in the grammars , and additional analyses of children 's speech to be reported in this paper , were based on on -line transcripts of the speech of the Brown children included in the Child Language Data Exchange System ; MacWhinney & Snow ( 1985) . 2nA verb was counted whether it appeared in the present , progressive , or past tense form , and was counted only once if it appeared in more than one form . Since most of the verbs were in the present , this is of little consequence . We counted a verb once across its appearances alone and with various particles since past tense inflection is independent of these differences . We excluded modal pairs such as can/ could since they only occasionally encode a present /pas( contrast for adults . We excluded catenative verbs that encode tense and mood in English and hence which do not have obvious past tenses such as in going to , come on , and gimme . :? 7We are grateful to Maryellen McDonald for this point .
in the input or in their vocabulary of noun stems. Nonetheless , the V -shaped developmental sequence can be observed in the development of plural inflec tion in the speech of the Brown children : for example , Adam said feet nine times in the samples starting at age 2;4 before he used foots for the first time at age 3;9; Sarah used feet 18 times starting at 2;9 before uttering foots at 5; 1; Eve uttered feet a number of times but never foots . Examining token frequencies only underlines the unnaturally favorable assumptions about the input used in the RM model 's training run . Not only
does the transition from conservatism to overregularization correspond to a
shift from a 20/80 to an 80/20 ratio of regulars to irregulars , but in the first , conservative phase, high -frequency irregular pairs such as go/ went and make/ made were only presented 10 times each, whereas in the overregularizing phase the hundreds of regular verbs were presented 190 times each. In contrast , irregular verbs are always much higher in token frequency in children 's environment . Slobin ( 1971) performed an exhaustive analysis of the verbs heard by Eve in 49 hours of adults ' speech during the phase in which she was overregularizing and found that the ratio of irregular to regular tokens was 3:1. Similarly , in Brown 's smaller samples, the ratios were 2.5:1 for Adam 's
parents , 5 : 1 for Eve ' s parents , and 3 .5 : 1 for Sarah ' s parents . One wonders
whether presenting the RM model with 10 high -frequency verbs , say, 190 times each in the first phase could "have burned in the 8 irregulars so strongly that they would never be overregularized in Phase 2. If children 's transition from the first to the second phase is not driven by a change in their environments or in their vocabularies , what causes it ? One possibility is that a core assumption of the RM model , that there is no psychological reality to the distinction between rule -generated and memorized forms , is mistaken . Children might have the capacity to memorize independent present and past forms from the beginning , but a second mechanism that coins and applies rules might not go into operation until some maturational change put it into place , or until the number of verbs exemplifying a rule exceeded a threshold . Naturally , this is not the only possible explanation . An alternative is that the juxtaposition mechanism that relates each stem to its corresponding past tense form has not yet succeeded in pairing up memorized stems and past forms in the child 's initial stage. No learning of the past tense regularities has begun because there are no stempast input pairs that can be fed into the learning mechanism ; individually acquired independent forms are the only possibility .
Some of the evidence supports this alternative . Brown notes in the gram -
mars that children frequently used the present tense form in contexts that clearly called for the past, and in one instance did the reverse . As the children developed , past tense forms were used when called for more often , and
143
evidence for an understanding of the function of the past tense form and the tendency to overregularize both increase. Kuczaj ( 1977) provides more pre cise evidence from a cross-sectional study of 14 children . He concluded that once children begin to regularize they rarely use a present tense form of an irregular verb in contexts where a past is called for . The general point is that in either case the RM model does not explain children 's developmental shift from conservatism to regularization . It attempts to do so only by making assumptions about extreme shifts in the input to rule learning that turn out to be false . Either rules and stored forms are distinct , or some .orocess other than extraction of morphophonological regularity exp L lains the developmental shift . The process of coming to recognize that two
forms constitute
Little needs
the present
to be said
regularization and overregularization occurs, to the third (adult ) stage, in which application of the regular rule and storage of irregular pasts cooccur . Though the model does overcome its tendency to overregularize previously acquired irregular verbs , we have shown in a previous section that it never properly attains the third stage. This stage is attained , we suggest, not by incremental strength changes in a pattern -finding mechanism , but by a mechanism that makes categorical decisions about whether a hypothesized rule candidate is a genuine productive rule and about whether to apply it to a given verb . On the psychological reality of the memorized / rule -generated distinction . In discussing the developmental shift to regularization , we have shown that there can be developmental consequences of the conclusion that was forced upon us by the linguistic data , namely that rule -learning and memorization of indi vidual forms are separate mechanisms. (In particular , we pointed out that one might mature before the other , or one requires prior learning - juxtapos ing stems and past forms - and the other does not .) This illustrates a more general point : the psychological reality of the memorized /rule -generated distinction predicts the possibility of finding dissociations between the two pro cesses , whereas a theory such as Rumelhart and McClelland ' s that denies that
reality predicts th'at such dissociations should not be found . The development al facts are clearly on the side of there being such a distinction . First of all , children 's behavior with irregular past forms during the first , pre -regularization phase bears all the signs of rate memorization , rather than a tentatively overspecific mapping from a specific set of stem features to a specific set of past features . Brown notes , for example , that Adam used fell -down 10 times in the Stage II sample without ever using fall or falling ,
144
sortin her speech several over stages. Similar patterns be foundin the can
other childrens speech.
sohisproduction offell-down beattributedany ofmapping cannot to sort at allfrom topast. stem Moreover isnohint this there in phase any of interactionor transfer learning phonetically individual of across similar irregular forms:for example, Sarahsspeech, in break/broke coexisted make! with made neither any and had influencetake, lackedpast ofany on which a form
nessof children overgeneralize causative to the alternation. suchdifferIf ences notreflect do differences thechildrens in environments vocabularies or
willing generalize to productively.example, For Cazden notes (1968) that Adam more toovergeneralizations and was prone than Eve Sarah 447), (p. an observation made Brown hisunpublished also by in grammars. More specifically, 1shows Sarah toregularize past two Table that began the tense stages thantheother children later two despite comparable vocabularies. verb Maratsos (1987) etal. documented individual many differences willinginthe
list-learning sophisticationa rulesystem. and of Another possible dissociation be found individual might in differences. A number investigatorschild of of language noted some have that children are conservative producers memorized whereas of forms othersare far more
A clear example a dissociation roteandruleover span of between a in mastery irregular tense was predicted their of past forms best by chronological age, their but mastery regular tense was predicted their of past forms best by Mean LengthUtterance. (1973) that correlates of Brown showed MLU highly withvarietymeasures a of ofgrammatical sophistication acquiring inchildren English. Kuczajs was irregular aresimply logic that pasts memorized, sothe sheer number exposures, increasesthechild longer, the of which as lives is crucial whereas pasts beformed theapplication factor, regular can by ofa rule, which beinduced partofthechilds must as developing grammar, so overall grammatical developmentbetter is a predictor. thelinguistic Thus distinction between ofexceptions rule-generated (see lists and forms Section 4.4) paralleleda developmental between is by distinction opportunities for
which coexist they comes fromKuczaj (1977), showed childrens who that
generalizing mechanism children strongermore ofsome being or developed than ofothers, that without comparable differences ability record intheir to forms directly theinput. RM from The model easily cannot account any for ofthese dissociations than attributing aspectsthe (other by crucial of generalization phenomenamechanisms outside model), to entirely their because memorized and forms generalizations arehandled a single by mechanismrecall theidentity inthenetwork belearned adjusting that map must by alarge
setofconnection weights, likeanyofthestem just alterations;isnotthere it
145
at the outset , and is not intrinsically easy to learn . The question is not closed, but the point is that the different theories can in principle be submitted to decisive empirical tests. It is such tests that should be the basis for debate on the psychological issue at hand . Simply demonstrating that there exist contrived environments in which a network model can be made to mimic some data , especially in the absence of compari sons to alternative models , tells us nothing about the psychology of the child . 7.2.2. Performance with no-change verbs A class of English verbs does not change in form between stem and past: beat, cut, put , hit , and others . All of these verbs end in a t or d . Bybee and Slobin ( 1982) suggest that this is no coincidence . They suggest that learners generate a schema for the form of past tense verbs on the basis of prevalent regular forms which states that past tense verbs end in t or d . A verb whose stem already ends in t or d spuriously appears to have already been inflected for past tense, and the child is likely to assume that it is a past tense form . As a result , it can be entered as the past version of the verb in the child 's paradigm , blocking the output of the regular rule . Presumably this tendency could result in the unchanged verb surviving into adulthood , causing the no-change verbs to have entered in the language at large in some past generation and to be easily relearned thereafter . We will call this phenomenon misperception .28 In support of this hypothesis , Bybee and Slobin found in an elicitation experiment that for verbs ending in t or d , children were more likely to produce a past tense form identical to the present than a regularized form , whereas for verbs not ending in a t or d , they were more likely to produce a regularized form than an unchanged form . In addition , Kuczaj ( 1978) found in a judgment task that children were more likely to accept correct no-change forms for nonchanging verbs than correct past tense forms for other irregular verbs such as break or send, and less likely to accept overregularized versions of no-change verbs than overregularized versions of other irregular verbs . Thus not only do children learn that verbs ending in t/ d are likely to be unchanged , but this subregularity is easier for them to acquire than the kinds of changes such as the vowel alternations found in other classes of irregular verbs . Unlike the three -stage developmental sequence for regularization , chil -
28Bybee Slobin do not literally proposethat the child misanalyzes -final verbs as (nonexistent stems and lId ) inflected by a rule. Rather, they postulate a static template which the child matchesagainstunanalyzedforms during word perception to decide whether the forms are in the past tense or not.
146
dren 's sensitivity to the no-change subregularity for verbs ending in tl d played no role in the design of the RM model or of its simulation run . Nonetheless , Rumelhart and McClelland point out that during the phase in which the model was overregularizing , it produced stronger regularized past tense candidates for verbs not ending in tld than for verbs ending in lid , and stronger unchanged past candidates for verbs ending in tld than for verbs not ending in lid . This was true not only across the board , but also within the class of regular verbs , and within the classes of irregular verbs that do change in the past tense, for which no-change responses are incorrect . Furthermore , when
Rumelhart and McClelland examined the total past tense response of the
network (that is, the set of Wickelfeatures activated in the response pool ) for verbs in the different irregular subclasses, they found that the no-change verbs resulted in fewer incorrectly activated Wickelfeatures than the other classes of irregulars . Thus both aspects of the acquisition of the no-change pattern fallout of the model with no extra assumptions . Why does the model display this behavior ? Because the results of its learn ing are distributed over hundreds of thousands of connection weights , it is
hard to tell , and Rumelhart and McClelland do not try to tease apart the
various possible causal factors . Misperception cannot be the explanation because the model always received correct stem-past pairs . There are two other possibilities . One is that connections from many Wickelfeatures to the Wic kelfeatures for word - final t , and the thresholds for those Wickelfeatures ,
have been affected by the many regular stem-past pairs fed into the model . The response of the model is a blend of the operation of all the learned subregularities , so there might be some transfer from regular learning in this case. For example , the final Wickelphone in the correct past tense form of hit , namely it#, shares many of its Wickelfeatures with those of the regular past tense allomorphs such as id#. Let us call this effect between-class transfer . It is important to note that much of the between -class transfer effect may be a consequence- perhaps even an artifact - of the Wickelfeature representation and one of the measures defined over it , namely percentage of incor rect Wickelfeatures activated in the output . Imagine that the model 's learning component actually treated no-change verbs and other kinds of verbs identi cally , generating Wickelfeature sets of equal strength for cutted and taked. Necessarily , taked must contain more incorrect Wickelfeatures than cutted: most of the Wickelfeatures that one would regard as " incorrect " for cutted, such as those that correspond to the Wickelphone tid and id#, happen to characterize the stem perfectly (StopVoweIStop , InterruptedFrontInter rupted , etc.) , because cut and ted are featurally very similar . On the other hand , the incorrect Wickelfeatures for taked (those corresponding to Wickel phones Akt and kt#) will not characterize the correct output form took . This
147
effect is exaggerated further by the fact that there are many more Wickelfea tures representing word boundaries than representing the same phonemes string -internally , as Lachter and Bever (1988) point out (recall that the Wic kelfeature set was trimmed so as to exclude those whose two context
phonemes belonged to different phonological dimensions- since the word boundary feature # has no phonological properties , such a criterion will leave all Wickelfeatures of the form XY # intact ) . This difference is then carried over to the current implementation of the response-generation component , which puts response candidates at a disadvantage if they do not account for activated Wickelfeatures . The entire effect (a consequence of the fact that the model does not keep track of which features go in which positions ) can be viewed either as a bug or a feature . On the one hand , it is one way of generating the (empirically correct ) phenomenon that no-change responses are more common when stems have the same endings as the affixes that
would be attached to them . On the other hand , it is part of a family of
phonological confusions that result from the Wickelphone /Wickelfeature representations in general (see the section on Wickelphonology ) and that hobble the model 's ability even to reproduce strings verbatim . If the stem-affix feature confusions really are at the heart of the model 's no-change responses, then it should also have recurring problems , unrelated to learning , in generat~ ing forms such as pitted or pocketed where the same Wickelfeatures occur in the stem and affix or even twice in the same stem but they must be kept distinct . Indeed , the model really does seems prone to make these undesir able errors , such as generating a single CVC sequence when two are necessary, as in the no-change responses for hug, smoke , and brown , or the converse, in overmarking errors such as typeded and steppeded. A third possible reason that no-change responses are easy for lId -final stems is that unlike other classes of irregulars in English , the no-change class has a single kind of change (that is, no change at all ) , and all its members have a phonological property in common : ending with a t or d . It is also the largest irregular subclass. The model has been given relatively consistent evidence of the contingency that verbs ending in t or d tend to have unchanged past tense forms , and it has encoded that contingency , presumably in large part by strengthening links between input Wickelfeatures represent ing word -final lIds and identical corresponding output Wickelfeatures . Basically , the moQel is potentially sensitive to any statistical correlation between input and output feature sets, and it has picked up that one . That is, the acquisition of the simple contingency " end in tld ~ no change" presumably makes the model mimic children . We can call this the within -class uniformity effect . As we have mentioned , the simplified rule -hypothesization mechanism presented in a previous section can acquire the same contingency (add a null
affix for verbs ending in a nonsonorant noncontinuant coronal), and strengthen it with every no-changepair in the input. If , as we have argued, a rule-learning model consideredmany rules exemplified by input pairs before being able to determine which of them was the correct productive rule or rules for the language this rule would exist in the child's grammar and , would competewith the regular d rule and with other rules, just ascompeting outputs are computed in the RM model. Finally, there is a fourth mechanismthat was mentioned in our discussion of the strong verb system Addition of the regular suffix d to a form ending . in t or d produces a phonologically -illicit consonantcluster: td or dd. For regular verbs, the phonological rule of vowel insertion places an i between the two consonants Interestingly, no irregular past ends in id, though some . add a tor d. Thus we find tell/ told and leave /left, but we fail to find bleed /bledded or get/ gotted. A possibleexplanationis that a phonologicalrule, degemi nation, removesan affix after it is added as an alternative meansof avoiding adjacent coronals in the strong class The no-changeverbs would then just . be a special case of this generalization where the vowel doesn change , 't either. Basically, the child would capitalize on a phonological rule acquired elsewherein the system and might overgeneralizeby failing to restrict the , degeminationrule to the stron~ verbs. Thus we have an overlapping set of explanationsfor the early acquisition and overgeneralizationof the no-changecontingency Bybee and Slobin cite . misperception Rumelhart and McClelland cite between , -class transfer and within-classuniformity , and rule-basedtheories can cite within-classuniformity or overgeneralizedphonology. What is the evidence concerning the reasonsthat children are so sensitiveto this contingency ? Unfortunately, a number of confounds in English make the theories difficult to distinguish. No-changeverbshave a diagnosticphonologicalproperty in common with one another. They also share a phonological property with regular inflected past tenseforms. Unfortunately, they are the sameproperty: ending with tide And it is the sharing of that phonological property that triggers the putative phonological rule. So this massiveconfound prevents one from clearly distinguishingthe accountsusing the English past tenserule; one cannot say that the Rumelhart- McClelland model receivesclear support from its ability to mimic children in this case . In principle, a number of more diagnostic tests are possible First, one . must explain why the no-changeclassis confounded The within-classunifor. mity account which is one of the factors behind the RM model's success , , cannot do this: if it were the key factor, we would surmisethat English could just as easily have contained a no-changeclassdefined by any easily -charac terized within-classproperty (e.g. begin with j , end with s). Bybee and Slobin
149
note that across languages, it is very common for no-change stems to contain the very ending that a rule would add . While ruling out within -class unifor mity as the only explanation , this still leaves misperception , transfer , and phonology as possibilities , all of which foster learning of no-change forms for stems resembling the relevant affix . Second, one can look at cases where possessing the features of the regular ending is not confounded with the characteristics of the no-change class. For example , the nouns that do not change when pluralized in English such as sheep and cod do not in general end in an s or z sound . If children nonetheless avoid pluralizing nouns like ax or lens or sneeze it would support one or more , of the accounts based on stem-affix similarity . Similarly , we might expect
children
rethink . were found , differences among verbs all of which resemble
to be reluctant
or
If such effects
the affix in question could discriminate the various accounts that exploit the stem-affix similarity effect in different ways. Transfer , which is exploited by the RM model , would , all other things being equal , lead to equally likely no-change responses for all stems with a given degree of similarity to the affix . Phonology would predict that transfer would occur only when the result of adding an affix led to adjacent similar segments; thus it would predict more no-change responses for the plural of ax than the progressive of sting , which is phonologically acceptable without the intervention of any further rule . Returning now to a possible unconfounded test of the within -class unifor mity effect (implicated by Rumelhart and McClelland and by the rule hypothesization module ) , one could look for some phonological property in common among a set of no-change stems that was independent of the phonological property of the relevant affix and see whether children were more likely to yield both correct and incorrect no-change responses when a stem had that property . As we have pointed out , monosyllabicity is a property holding of the irregular verbs in general , and of the no-change verbs in par ticular ; presumably it is for this reason that the RM model , it turns out , is particularly susceptible to leaving regular verbs ending in t/ d erroneously unchanged when they are monosyllabic . As Rumelhart and McClelland point out , if children are less likely to leave verbs such as decide or devote unchanged than verbs such as cede or raid it would constitute a test of this aspect
of their theory ; this test is not confounded by effects of across -class transfer .29
29 Actually , this test is complicated by the fact that monosyllabicity and irregularity are not independent : jn English , monosyllabicity is an important feature in defining the domain of many morphological and syntactic rules (e.g . nicer / *intelligenter , give/ * donate the museum a painting ; see Pinker , 1984) , presumably because in English a monosyllable constitutes the minimal or basic word (McCarthy & Prince , forthcoming ) . As we have pointed out , all the irregular verbs in English are monosyllables or contain monosyllabic roots , (likewise for
-
A possible test of the misperceptionhypothesisis to, look for other kinds of evidence that children misperceive certain stems as falling into a morphological category that is characteristicallyinflected. If so, then once the regular rule is acquired it could be applied in reverse to such misperceived forms, resulting in back-formations. For no-changeverbs, this would result in errors such as bea or bias for beat or blast. We know of no reoorts of such . a . -errors among past tense forms (many would be impossibJe phonological for reasons but have observed in Lisa's speechmik for -mix, and in her noun ) system clo (thes , fen (s), sentent(cf. sentence, Santa Clau (s), upstair (s), ) ) downstair (s), bok (ct. box) , trappy (ct. trapeze and brefek (cf. brefeks = ), 'breakfast').30 Finally, the processby which Rumelhart and McClelland exploit stem -affix similarity, namely transfer of the strength of the output features involved in regular pairs to the no-changestems can be tested by looking at examples , of blends of regular and subregularalternations that involve classes verb. of s other than the no-changeclass One must determine whether children pro. duce such blends and whether it is a good thing or a bad thing for the RM theory that their model does so. We examine this issue in the next two sections . In sum, the class of English verbs that do not change in the past tense involves a massive confound of within-class phonological uniformity and stem-affix similarity, leading to a complex nexus of predictions as to why children are so sensitiveto the properties of the class The relations between . different modelsof past tenseacquisition, predictions of which linguistic variables should have an effect on languages and oQchildren, and the classes of verbs instantiating those variables is many-to-many-to-many. Painstaking , testing of the individual predictions using unconfounde sets of items in a .d variety of inflectional classes English and other languages in could teasethe
nouns, a factrelated some to irregularity ) in way being restricted rootsandmonosyllables prototypical to being Englishroots So if childrenknowthat only rootscanbe irregularand that rootsare monosyllables . , (see Gordon 1986for evidence children sensitive the interaction , , that are to between roothood morphology and , andGropen Pinker 1986for evidence theyaresensitive monosyllabicitytheymayrestricttheir & , , that to ), tendency no-change to responses monosyllables if it is nottheproduct theirdetecting first-order to even of the correlation between monosyllabicity unchanged . Thustheidealtestwouldhave bedonefor . ome and pasts to s otherlanguage whicha no-change hada common , in class phonological property independent thedefinition of of a basic root in the language independent the phonology the regular , and of of affix. 3ONote the factsof English not comport that do well with anystrongmisperception account would that havethechildinvariably misanalyze irregular pasts pseudo as -stems followed theregular by affix: themajority of no-change verbs eitherhave vowels hence lax and wouldleave phonologically impossible pseudo -stems after the affixwassubtracted , such hi or CUor endin vowel sequences as , -t , whichnever occur regular in pasts and only rarely(e.g. bought in irregularsFor the same ) . reason is crucialto BybeeandSlobin account it 's that childrenbe constrained form the schema : ...tld#) ratherthanseveral to (past schemas matching input the moreaccurately , suchas (past ...[unvoicedt H)] and (past ...[voicedd #) . If theydid, theywouldnever : . ] : ] misperceive andcutaspasttense hit forms .
Languageand connectionism
151
effects apart. At present, however, a full range of possibilities are all consis tent with the data, ranging from the RM model explaining much of the phenomenonto its being entirely dispensable The model's ability to dupli. cate children's performance in and of itself, tells us relatively little . ,
7.2.3. Frequency of overregularizing irregular verbs in different vowelchange subclasses Bybee and Slobin examined eight different classes of irregular past tense verbs (see the Appendix for an alternative , more fine -grained taxonomy ) . Their Class I contains the no-change verbs we have just discussed. Their Class II contains verbs that change a final d to t to form the past tense, such as send/ sent and build / built . The other six classes involve vowel chan~es, and are defined by Bybee and Slobin as follows :
.
.
. . . .
Class III . Verbs that undergo an internal vowel changeand also add a final ItI or Idl , e.g. feel/felt, lose /lost, say/said, tell/ told. Class IV . Verbs that undergo an internal vowel change delete a final , consonant and add a final It I or Idl , e.g. bring/ brought, catch caugh . , / 't [Bybee and Slobin include in this classthe pair buy/ bought even though it doesnot involve a deleted consonant Make/madeand have had were . / also included even though they do not involve a vowel change .] ClassV . Verbs that undergo an internal vowel changewhosestemsend in a dental, e.g. bite/ bit, find /found, ride/rode. Class VI . Verbs that undergo a vowel change of III to lrel or lA/, e.g. sing/sang sting/stung. , ClassVII . All other verbs that undergo an internal vowel change e.g. , give/gave break/broke. , Class VIII . All verbs that undergo a vowel change and that end in a diphthongal sequence e.g. blow / blew, fly /flew . (Go / went is also in , cluded in this class .]
Bybee and Slobin noted that preschoolershad widely varying tendencies to overregularize the verbs in these different classes ranging from 10% to , 80% of the time (see the first column of Table 2). Class IV and III verbs, whosepast tense forms receive a final lId in addition to their vowel changes , were overregularizedthe least; ClassVII and V verbs, which have unchanged final consonantsand a vowel change were overregularizedsomewhatmore , often; ClassVI verbs, involving the ing-ang-ung regularity, were regularized more often than that; and Class VIII verbs, which end in a diphthong sequencewhich is changedin the past, were overregularizedmost often. Bybee and Slobin again accountfor this phenomenonby appealingto factors affecting the processof juxtaposing correspondingpresent and past forms. They
152
suggestthat the presenceof an added l/ d facilitates the child's recognition that ClassIII and IV past forms are past forms, and that the small percentage of shared segmentsbetween ClassVIII present and past versions (e.g., one for see /sawor know/ knew) hindersthat recognition process As the likelihood . of successful juxtaposition of presentand past forms decreasesthe likelihood , of the regular rule to operate, unblocked by an irregular past form , increases and overre~ularizations becomemore common. Rumelhart and McClelland suggest as in their discussionof no-change , verbs, that their model as it stands can reproduce the developmental phenomenon. Sincethe Bybee and Slobin subjectsrangefrom 1+ to 5 years, it is not clear which stageof performanceof the model should be comparedwith that of the children, so Rumelhart and McClelland examined the output of the model at several stages These stagescorrespondedto the model's first . five trials with the set of medium-frequency, predominantly regular verbs, the next five trials, the next ten trials, and an averageover those first twenty trials (these intervals constitute the period in which the tendency of the model to overregularize was highest . The average strength of the over) regularized forms within each class was calculated for each of these four intervals. The fit between model and data is good for the interval comprising the first five trials, which Rumelhart and McClelland concentrateon. We calcu late the rank-order correlation betweendegreeof overregularizationby children and model acrossclassesas .77 in that first interval; however it then declines to .31 and .14 in the next two intervals and is .31 for the average responseover all three intervals. The fact that the model is only successful at accountingfor Bybee and Slobin's data for one brief interval (lessthan 3% of the training run) selectedpost hoc, whereasthe data themselvesare an averageover a spanof developmentof 3+ years, should be kept in mind in evaluating the degree of empirical confirmation this study gives the model. Nonetheless the tendency of Class VIII verbs (fly/flew) to be most often , regularized, and for ClassIII verbs (feel felt) to be among those least often / regularized persistsacrossall three intervals. , The model, of course is insensitive to any factor uniquely affecting the juxtaposition of present and past forms becausesuch juxtaposition is accomplished by the " teacher in the simulation run. Instead, its fidelity to " children's overregularization patterns at the very beginning of its own overregularization stagemust be attributed to someother factor. Rumelhart and McClelland point to differences among the classesin the frequency with which their characteristicvowel changesare exemplified by the verb corpus as a whole. ClassVIII verbs have vowel shifts that are relatively idiosyncratic to the individual verbs in the class the vowel shifts of other classes on the ; ,
Languageand connectionism
153
other hand , might be exemplified by many verbs in many classes Further . more , Class III and IV verbs , which require the addition of a final t/ d , can benefit from the fact that the connections in the network that effect the addition of a final t/ d have been strengthened by the large number of regular verbs . The model creates past tense forms piecemeal , by links between stem and past Wickelfeatures , and with no record of the structure of the individual words that contributed to the strengths of those links . Thus vowel shifts and consonant shifts that have been exemplified by large numbers of verbs can be applied to different parts of a base form even if the exact combination of such shifts exemplified by that base form is not especially frequent . How well could the simplified rule -finding module account for the data ? Like the RM model , it would record various subregular rules as candidates for a regular past tense rule . Assuming it is sensitive to type frequency , the rule candidates for more -frequently exemplified subregularities would . be stronger . And the stronger an applicable subregular rule candidate is, the less is the tendency for its output to lose the competition with the overregularized form contributed by the regular rule . Thus if Rumelhart and McClelland 's explanation of their model 's fit to the data is correct , a rule -finding model sensitive to type -frequency presumably would fit the data as well . This conjecture is hard to test because Bybee and Slobin 's data are tabu lated in some inconvenient ways. Each class is heterogeneous , containing verbs governed by a variety of vowel -shifts and varying widely as to the number of such shifts in the class and the number of verps exemplifying them within the class and across the classes Furthermore , there are some quirks . in the classification . Go/ went, the most irregular main verb in English , is assigned to Class VIII , which by itself could contribute to the poor perfor mance of children and the RM model on that class. Conversely , have and make , which involve no vowel shift at all , are included in Class IV , possibly contributing to good average performance for the class by children and the model . (See the Appendix for an alternative classification .) It would be helpful to get an estimate as to how much of the RM model 's empirical succe~s here might be due to the different frequencies of exemplifi cation of the vowel -shift subregularities within each class, because such an effect carries over 'to a symbolic rule -finding alternative . To get such an estimate , we considered each vowel shift ( e.g. i -.::,. re) as a separate candidate rule , strengthened by a unit amount with each presentation of a verb that exemplifies it in the Rumelhart - McClelland corpus of high - and medium -fre quency verbs . To allow have and make to benefit from the prevalence of other verbs whose vowels do not change, we pooled the different vowel no-change rules (a ~ a,. i ~ i , etc .) into a single rule (the RM model gets a similar benefit by using Wickelfeatures , which can code for the presence of
154
with
Verb 3( -subclas V-64 2 4 blow2( /elt2 53 b(lew 1 III06 2 16 4. .ought 81 65 )4 8 1 1 ) VI(it5 5 5sing3 6 s42 4 3 23 1 . 55 17 ang 9 4 6 bite 33 36 VII 2 ) break /5 0 b.) 3 III1 5 feel f roke IV s -seek
Table 2. Ranks aftendencies to overregularize irregular verbs involving vowel shifts
Children * RM RM RM RM A vg . freq . of
1st set 2nd set 3rd set average vowel shift * *
Rank order correlation children ' s proportions . 77 . 31 . 14 . 31 . 71 * * * Actual Mean proportions number . of of verbs regularizations in the irregular by children corpus are exemplifying in parentheses the vowel . shifts within a class are indicated in 311n a sense of all simple , it thtj would verbs strength rates irregular verbs for . have in the function Class IV been corpus , this verbs more , accurate regular would and and have so we to calculate , the rather strength than the the deck just in of the the favor no - vowel - change verbs correctly of no - vowel . But rule on with basis overly regularization the irregular greatly only irregular of stacked counted predicting - change exemplifications
vowels, rather than Wickelphones whose strength was determined by the ) number of no-vowel-changeverbs in ClassesI and 11 Then we averaged .31 the strengths of all the subregular rules included within each of Bybee and Slobin's classes These averagesallow a prediction of the ordering of over. regularization probabilities for the different subclassesbasedsolelv on the , ~ number of irregular verbs in the corpus exemplifying the specificvowel alternations among the verbs in the class Though the method of prediction is . crude, it is just about as good at predicting the data as the output of the RM model during the interval at which it did best and much better than the RM model during the other intervals examined Specifically the rank-order cor. , relation betweennumber of verbs in the corpus exemplifying the vowel shifts in a classand the frequency of children's regularization of verbs in the class is .71. The data, predictions of the RM model, and predictions from our simple tabulations are summarizedin Table 2. What about the effect of the addition of t/ d on the good performanceon ClassIII and IV verbs? The situation is similar in some ways to that of the no-changeverbs discussed the previous section. The ClassIII and IV verbs in
parentheses
Languageand connectionism
155
take some of the most frequently-exemplified vowel-changes(including nochangefor have and make); they also involve the addition of t or d at the end causingthem to resemble the past tense forms of regular verbs. Given this confound, good performance with these classescan be attributed to either factor and so the RM model's good performancewith them does not favor it over the Bybee-Slobin account focusing on the juxtaposition problem. The question of blended responsesAn interesting issue arises however, . , when we consider the possible effects of the addition of t/ d in combination with the effectsof a commonvowel shift. Recall that the RM model generates its output piecemeal Thus strong regularities pertaining to different parts of . a word can affect the word simultaneously producing a chimerical output , that neednot correspondin its entirety to previousfrequent patterns. To take a simplified example, after the model encounterspairs such as meet met it / has strong links betweeni and E; after it encounterspairs such asplay/played it has strong links between final vowels and final vowel-d sequenceswhen ; presented with flee it could then generatefled by combining the two regularities, even if it never encounteredan ee ed alternation before. What is / interesting is that this blending phenomenonis the direct result of the RM model's lack of word structure. In an alternative rule-finding account there , would be an i ~ E rule candidate and there would be a d-affixation rule , candidate, but they would generate two distinct competing outputs, not a single blended output. (It is possiblein principle that someof the subregular strong verbs such as told and sent involve the superpositionof independent subregular rules, especially in the history of the language but in modern , English one cannot simply heap the effect of the regular rule on top of any subregularalternation, as the RM model is prone to do.) Thus it is not really fair for us to claim that a rule-hypothesizationmodel can account for good performance with Class III and IV verbs becausethey involve frequentlyexemplified vowel alternations; such alternations only result in correct outputs if they are blended with the addition of a t/ d to the end of the word. In principle, this could give us a critical test between the network model and a rule-hvDothesizationmodel: unlike the ability to soak up frequent alterna~ A tions, the automatic superpositionof any set of them into a single output is (under the. simplest assumptions unique to the network model. ) This leads to two questions Is there independent evidencethat children : blend subregularities And does the RM model itself really blend sub? regularities? We will defer answeringthe first question until the next section, where it arisesagain. As for the second it might seemthat the question of , whether responseblending occurs is perfectly straightforward, but in fact it is not. Say the model's active output Wickelfeatures in responseto flee in-
156
clude
those
for
medial
E and those
d . Is the overt
response
of
the model fled , a correct blend , or does it set up a competition between [ {lid ] and [fiE] , choosing one of them , as the rule -hypothesization model would ? In principle , either outcome is possible , but we are never given the opportunity to find out . Rumelhart and McClelland do not test their model against the Bybee and Slobin data by letting it output its favored response. Rather , they externally assemble alternatives corresponding to the overregularized and correct forms , and assessthe relative strengths of those alternatives by observing the outcome of the competition in the restricted -choice whole -string binding network (recall that the output of the associative network , a set of activated Wickelfeatures , is the input to the whole -string binding network ) . These strengths are determined by the number of activated Wickelfeatures
that each is consistent . But with . The we do not result know is that whether correct the alternatives , if left that also
happen to resemble blends of independent subregularities are often the response chosen model to its own
devices, would produce a blend as its top -ranked response. Rumelhart and McClelland did not perform this test because it would have been too .computationally intensive given the available hardware : recall that
the only way to get the model to produce a complete response form on its
own is by giving it (roughly ) all possible output strings (that is, all permuta tions of segments) and having them compete against each other for active Wickelfeatures in an enormous " unconstrained whole -string binding network " . This is an admitted kluge designed to give approximate predictions of the strengths of responses that a more realistic output mechanism would construct . Rumelhart and McClelland only ran the unconstrained whole string binding network on a small set of new low -frequency verbs in a transfer test involving no further learning . It is hard to predict what will happen when
this network operates because it involves a " rich -get -richer " scheme in the
competition among whole strings , by which a string that can uniquely account for some Wickelfeatures (including Wickelfeatures incorrectly turned on as part of the noisy output function ) gets disproportionate credit for the features that it and its competitors account for equally well , occasionally leading to unpredictable winners . In fact , the whole -string mechanism does yield blends such as slip / slept. But as mentioned , these blends are also occasionally bizarre , such as mailed / membled or tour / toureder . And this is why the question of overt blended outputs is foggy : it is unclear whether tuning the whole string binding network , or a more reasonable output construction mechanism ,
so that the bizarre blends were eliminated , would also eliminate the blends
that perhaps turn out to be the correct outputs for Class III and Iv .32
32To complicate matters even further , even outright blends are possiblein principle within the rule-ba~
sum vowel
the
relative
tendencies irregular selected post in is class simply have the verbs hoc
of
to favor
overregularize the RM high to overregularized with by also . be But into and could network is we will verbs sensitive the a ability single lead model unknown turn to model
different . The
correlation
on
behavior
which in the to of
A under
rule
of
to
the
models blended
the incarnation
output
question
" Eated
"
versus
" ated
"
errors that to with later in as the . They produce ed , such Rumelhart overregularization as ated than or breaked of eated that the the course or errors ( Kuczaj and ated strength of braked and McClelland errors . consisting , 1977 outputs of training the , ) . Such
final is of to form
developmental the an tendency irregular occur affixed McClelland verbs to the past fails form . in the Kuczaj + to Thinking ed of past considerably with
ed
such
Rumelhart for ated thus What child of the cannot past changes yield model discussed the ' s pairs are some past be the form mimicking irregular
and
compared their eated data errors realize it mechanism for provided to the . correct This is the to it model . corpus form . ? that is a There the
relative
increased
two
possibilities past itself a doubly because is and the of of the Class the that results , is he the or - marked
One past
that
she
gets behavior
The
alternative form to
base similar
double relatively in
one
model's tendencyto leave t/ d-final stemsunchangeddiscussed the section in before that. As in the previous discussions the lack of a realistic response , production mechanismmakesit unclear whether the model would ever actually producepast + ed blendswhen it is forced to utter a responseon its own, or whether the phenomenonis confined to such forms simply increasingin strength in the three-alternative forced-choice experiment becauseonly the past + ed form by definition contains three sets of features all of them strengthenedin the courseof learning (its idiosyncratic features, the features output by subregularities and the features of regularized forms). In , Rumelhart and McClelland's transfer test on new verbs, they chose a minimum strengthvalue of .2 asa criterion for when a form should be consid ered as being a likely overt responseof the model. By this criterion, the model should be seenas rarely outputting past + ed forms, sincesuchforms on the averagenever exceeda strength of .15. But let us assume now that for such forms would be output, and that blending is their source . At first one might think that the model had an advantagein that it is consistentwith the fact that ated errors increaserelative to the eatederrors in later stages a phenomenonnot obviously predicted by the misconstrued , stem account 33However, many of the phenomenawe discussbelow that . favor the misconstrued -stem account over the RM model appear during the same relatively late period asthe atederrors (Kuczaj, 1981 , so latenessitself , ) does not distinguish the accounts Moreover, Kuczaj (1977 warns that the . ) late-ated effect is not very robust and is subject to individual differences In . a later study (Kuczaj, 1978 he eliminated thesesamplingproblemsby using ), an experimental task in which children judged whether various versions of past tense forms sounded" silly" . He found in two separateexperimentsthat while children's acceptance the eatedforms declined monotonically their of acceptance atedforms showedan inverted V -shapedfunction, first increas of ing but then decreasing relative to eated -type errors. Since in the RM model the strengthsof both forms monotonically approachan asymptotenear zero, with the curvescrossingonly once, the model demonstrates specialability no to track the temporal dynamicsof the two kinds of errors. In the discussion below we will concentrateon the reasonsthat such errors occur in the first place. Once again, a confound in the materials provided by the English language confoundsRumelhart and McClelland's conclusionthat their model accounts
children 's ated-type errors . Irregular past tense forms appear in the child 's input and hence can be misconstrued as base forms . They also are part of the model 's output for irregular base forms and hence can be blended with the regularized response. Until forms which have one of these properties and not
the other are examined , the two accounts are at a stalemate .
Fortunately , the two properties can be unconfounded . 'Though the correct irregular past will usually be the strongest non -regularized response of the network , it is also sensitive to subregularities among vowel changes and hence one might expect blends consisting of a frequent and consistent but incorrect vowel change plus the regular ed ending . In fact the model does produce such errors for regular verbs it has not been trained on , such as shape/shipped, sip/sepped, slip / slept, and brown / brawned . Since the stems of these responses are either not English verbs or have no semantic relationship to the correct verb , such responses can never be the result of mistakenly feeding the wrong
base form of the verb into the past tense formation process . Thus if the
blending assumed in the Rumelhart and McClelland model is the correct explanation for children 's past + ed overregularizations , we should see chil dren making these and other kinds of blend errors . We might also expect errors consisting of a blend of a correct irregular alteration of a verb plus a frequent subregular alteration , such as send/soant (a blend of the d ~ t and I:: ~ 0 subregularities ) or think / that (a blend of the ing ~ ang and final consonant cluster ~ t subregularities ) . (As mentioned , though , these last errors are not ruled out in principle in all rule -based models , since superposi tion may have had a role in the creation of several of the strong past forms
in the history of English , but indiscriminately adding the regular affix onto strong pasts is ruled out by most theories of morphology .) Conversely , if the phenomenon is due to incorrect base input forms , we might expect to see other inflection processes applied to the irregular past, resulting in errors such as wenting and braking or wents and brokes . Since mechanisms for progressive or present indicative inflection would never be exposed to the idiosyncrasies or subregularities of irregular past tense forms under Rumelhart and McClelland 's assumptions , such errors could not result from blending of outputs . Similarly , irregular pasts should appear in syntactic contexts calling for bare stems if children misconstrue irregular pasts as stems. In addition , we might expect to find caseswhere ed is added to incorrect base forms that are plausible confusions of the correct base form but implausible results of the mixing of subregularities . Finally , we might expect that if children are put in a situation in which the correct stem of a verb is provided for them , they would not generate past +
ed errors , since the source of such errors would be eliminated .
All five predictions work against the RM model and in favor of the expla-
nation based on incorrect inputs. Kuczaj (1977 reports that his transcripts ) contained no exampleswhere the child overapplied any subregularity let , alone a blend of two of them or of a subregularity plus the regular ending. Bybee and Slobin do not report any sucherrors in children's speech though , they do report them as adult slips of the tongue in a time-pressuredspeaking task designed to elicit errors. We examined the full set of transcripts of Adam, Eve, Sarah and Lisa for words ending in -ed. We found 13 examples , of irregular past + ed or en errors in past and passiveconstructions : (36) Adam: fanned tooked staled (twice) braked (participle) felled Eve: tored
Sarah flewed (twice) : caughted stucked(participle) Lisa: tamed (participle) taaken (twice) (participle) sawn(participle)
The participle forms must be interpreted with caution. BecauseEnglish irregular participles sometimesconsistof the stem plus en (e.g. take - took taken but sometimesconsistof the irregular past plus en (e.g. break - broke ) - broken), errors like tooken could reflect the child overextendingthis regularity to past forms of ver~s that actually follow the stem + en pattern; the actualstemor eventhe child's mistakenhypothesisabout it may play no role. What about errors consisting of a subregular vowel alternation plus the addition of ed? The only exampleswhere an incorrect vowel other than that of the irregular form appearedwith ed are the following: (37) Adam: I think it 's not fulled up to de top. I think my pocketsgonnabe all fulled up. I 'm gonna ask Mommy if she has any more grain ... more stuff that sheneedsgrained. [He hasbeen grinding crackersin a meat grinder producingwhat he calls "grain" .] Sarah 00 , he hahted. : Lisa: I brekked your work.
Languageand connectionism
161
For Adam, neither vowel alternation is exemplified by any of the irregular verbs in the Rumelhart-McClelland corpus, but in both casesthe stem is identical to a non-verb that is phonologically and semanticallyrelated to the target verb and hence may have been misconstruedas the base form of the verb or converted into a new base verb. Sarah error can be attributed 's directly to phonological factors since she also pronounced dirt , involving no morphological change as "dawt" , according to the transcriber. This leaves , Lisa's brekked as the only putative example note that unlike her single ; -subregularity errors such as bote for bit which lasted for extended periods of time, this appearedonly once, and the correct form broke was very common in her speech Furthermore blending is not a likely explanation: amonghigh . and middle frequency verbs, the alternation i. found only in say/said and to s a lesser extent in pairs such as sleep /slept and leave /left, whereas in many other alternationsthe e soundis mappedonto other vowels (bear/ bore, wear / wore, tear/tore, take/ took, and shake /shook). Thus it seemsunlikely that the RM model would produce a blend in this casebut not in the countlessother opportunities for blending that the children avoided. Finally, we note that Lisa was referring to a pile of papersthat she scattered an unlikely example , of breaking but a better one of wrecking, which may have been the target serving as the real sourceof the blend (and not a past tensesubregularity if ) it was a blend. In sum, except perhaps for this last example under an extremely charitable interpretation, the apparentblendsseemfar more sugges tive of an incorrect stem correctly inflected than a blend between two past tense subregularities . This conclusionis strengthened when we note that children do make errors such as wentsand wenting which could only result from inflecting the wrong , stem. Kuczaj (1981 reports frequent useof wenting ating, and thoughtingin ) , the speechof his son, and we find in Eve's speech fells and wentsand in Lisa's speechblow awayn lefting, hidding (= hiding), stoling, to took, to shot, and , might loss. These last three errors are examplesof a common phenomenon sometimescalled 'overtensing, which because occursmostly with irregulars ' it (Maratsos & Kuczaj, 1978 is evidencethat irregulars are misconstruedas ), stems(identical to infinitives in English) . Someexamplesfrom Pinker (1984 ) include Can you broke those What are you did?, Shegonnafell out, and I 'm , going to sit on him and madehim broken. Note that sincemany of theseforms occur at the same time as the ated errors, the relatively late appearanceof atedforms may reflect the point at which stem extraction (and mis-extraction) in general is accomplished . Finally, Kuczaj (1978 presentsmore direct evidencethat past + ed errors ) are due to irregular pastsmisconstruedas stems In one of his tasks, he had . children convert a future tenseform (i .e. "X will + (verb stem) ") into a past
tenseform (i .e. "X already (verb past) ") . Past + ed errors virtually vanished
(in fact they completely vanished for two of the three age groups ) . Kuczaj argues that the crucial factor was that children were actually given the proper
base forms . This shows that children ' s derivation ate to ated , not , as it is in the RM model , from of the errors eat to ated . must be from
Yet another test of the source of apparently blended errors is possible when we turn our attention to the regular system. If the child occasionally misanalyzes a past form as a stern, he or she should do so for regular inflected past forms and not just irregular ones, resultiI )g in errors such as talkeded .
The RM model also produces such errors as blends , but for reasons that
Rumelhart and McClelland do not explain , all these errors involve regular verbs whose stems end in p or k : carpeded, drippeded , mappeded, smokeded, snappeded, steppeded, and typeded, but not browneded , warmeded, teareded or clingeded , nor , for that matter , irregular stems of any sort : the model did not output creepeded/ crepted, weepeded wepted, diggeded, or stickeded. We / suggest the following explanation for this aspect of the model 's behavior . The phonemes p and k share most of their features with t. Therefore on a Wic kelfeature by Wickelfeature basis, learning that t and d give you id in the
output transfers to p , b, g and k as well . So there will be a bias toward id
responses after all stops. Since there is also a strong bias toward simply adding t, there will be a tendency to blend the 'add t' and the 'add id ' responses. Irregular verbs , as we have noted , never end in id , so to the extent that the novel irregulars resemble trained ones (see Section 4.4) , the features of the novel irregulars will inhibit the response of the id Wickelfeatures and double -marking will be less common . In any case, - hough Rumelhart and McClelland cannot explain their t model 's behavior in this case, they are willing to predict that children as well
will double - mark more often for p - and k - final stems . In the absence of an
explanation as to why the model behaved as it did , Rumelhart and McClel land should just as readily extrapolate the model 's reluctance to double -mark irregular stems and test the prediction that children should double -mark only regular forms (if our hypothesis about the model 's operation is correct , the two predrctions stem from a common effect ) . Checking the transcripts , we did find ropeded and stoppeded (the latter uncertain in transcription ) in Adam 's speech, and likeded and pickeded in Sarah's, as Rumelhart and McClelland would predict . But Adam also said tieded and Sarah said buyded and makeded (an irregular ) . Thus the model 's prediction that double -marking should be specific to stems ending with p and d , and then only when they are regular , is not borne out . In particular , note that buyded and tieded cannot be the result of a blend of subregularities , because there is no subregularity
163
according to which buy or tie would tend to attract a id ending .34 Finally , Slobin ( 1985) notes that Hebrew contains two quite pervasive rules for inflecting the present tense, the first involving a vowel change, the second a consonantal prefix and a different vowel change. Though Israeli children overextend the prefix to certain verbs belonging to the first class, they never blend this prefix of the second class with the vowel change of the first class. This may be part of a larger pattern that children seem to respect the integrity
of the word as a cohesive unit , one that can have affixes added to it and that
can be modified by general phonological processes, but that cannot simply be composed as a blend of bits and pieces contributed by various regular and irregular inflectional regularities . It is suggestive in this regard that Slobin (1985) , in his crosslinguistic survey , lists examples from the speech of children learning Spanish, French , German , Hebrew , Russian , and Polish , where the language mandates a stem modification plus the addition of an affix and children err by only adding the affix . Once again we see that the model does not receive empirical support from its ability to mimic a pattern of developmental data . The materials that Rumelhart and McClelland looked at are again confounded in a way that leaves their explanation and the standard one focusing on the juxtaposition problem equally plausible given only the fact of ated errors . One can do better than that . By looking at unconfounded cases, contrasting predictions leading to critical tests are possible . In this case, six different empirical tests all go against the explanation inherent in the Rumelhart and McClelland model : absence of errors due to blending of subregularities , presence of wenting-type errors , presence of errors where irregular pasts are used in nonpast contexts , presence of errors where the regular past ending is mistakenly applied to non -verb stems, drastic reduction of ated-errors when the correct stem is supplied to the child , and presence of errors where the regular ending is applied twice to stems that are irregular or that end in a vowel . These tests show that errors such as ated are the result of the child incorrectly feeding
ate as a base form into the past tense inflection mechanism , and not the result
340ne might argue that the misconstrued -stem account would fail to generate these errors , too , since it would require that the child first generate maked and buyed using a productive past tense rule and then forget that the forms really were in the past tense . Perhaps , the argument would go , some other kind of blending caused the errors , such as a mixture of the two endings d and id which are common across the language even if the latter is contraindicated for these particular stems. In fact , the misconstrued -stem account survives unscathed , because one can find errors not involving overinflection where child -generated forms are treated as stems : for example , Kuczaj ( 1976) reports sentences such as They wouldn 't haved a house and She didn 't goed .
164
7.3. Summary of how well the model fares against the facts of children 's development
What general conclusions can we make from our examination of the facts of
children 's acquisition of the English past tense form and the ability of the RM .model to account for them ? This comparison has brought several issues
to light .
To begin with , one must reject the premise that is implicit in Rumelhart and McClelland 's arguments , namely that if their model can duplicate a phenomenon , the traditional explanation of that phenomenon
-
can be rejected .
,-
For one thing , there is no magic in the RM model duplicating correlations in language systems: the model can extract any combination of over 200_ 000 atomic regularities , and many regularities that are in fact the consequences of an interaction among principles in several grammatical components will be detectable by the model as first -order correlations because they fall into that huge set. As we argued in Section 4, this leaves the structure and constraints on the phenomena unexplained . But in addition , it leaves many of the simple goodness-of -fit tests critically confounded . When the requirements of a learn ing system designed to attain the adult state are examined , and when uncon founded tests are sought , the picture changes. First , some of the developmental phenomena can be accounted for by any mechanism that keeps records of regularities at several levels of generality , assigns strengths to them based on type -frequency of exemplification , and lets them compete in producing past tense'candidate forms . These phenome na include children 's shifts or waffling between irregular and overregularized past tense forms , their tendency not to change verbs ending in t/ d , and their tendency to overregularize verbs with some kinds of vowel alternations less than others . Since there are good reasons why rule -hypothesization models should be built in this way , these phenomena do not support the RM model as a whole or in contrast with rule -based models in general , though they do support the more general (and uncontroversial ) assumption of competition among multiple regularities of graded strength during acquisition . Second, the lack of structures corresponding to distinct words in the model , one of its characteristic features in contrast with rule -based models , might be related to the phenomenon of blended outputs incorporating independent subregularities . However , there is no good evidence that children 's correct
responses are ever the products of such blends , and there is extensive evi -
dence from a variety of sources that their ated-type errors are not the products of such blends . Furthermore , given that many blends are undesirable , it is
not clear that the model should be allowed model of .its output process is constructed . to output them when a realistic
Languageand connectionism
165
Third , the three-stageor U -shapedcourseof developmentfor regular and irregular past tense forms in no way supports the RM model. In fact, the model provides the wrong explanation for it , making predictions about changes in the mixture of irregular and regular forms in children's vocabulariesthat are completely off the mark. This meansthat in the two hypothesesfor which unconfoundedtests are available (the causeof the V -shapedoverregularizationcurve, and the genesis of ated -errors) , both of the processes needed by the RM model to account for developmental phenomena frequency -sensitivity and blending have been shown to play no important role, and in eachcase processes , appealing to rules to the child's initial hypothesizationof a rule in one case and to , the child's misapplicationof it to incorrect inputs in a second have received independent support. And since the model's explanations in the two confounded cases(performancewith no-changeverbs, and order of acquisition of subclasses appeal in part to the blending process the evidence against ) , blending in our discussionof the ated errors taints theseaccountsaswell. We concludethat the developmentalfacts discussed this sectionand the linguisin tic facts discussed Section4 convergeon the conclusionthat knowledgeof in languageinvolves the acquisition and use of symbolic rules.
8. General discussion
Why subjectthe RM modelto suchpainstaking analysis Surelyfew models ? of anykind couldwithstand suchscrutiny We did it for two reasonsFirst, . . the conclusions drawnby Rumelhartand McClelland that PDP networks provideexactaccounts psychological of mechanisms aresuperior the that to approximate descriptions couched linguistic in rules thatthereis no induction ; problemin their networkmodel that theresults their investigation ; of warrant revisingthe way in whichlanguage studied are bold and revolutionary is . Secondbecause modelis so explicitandits domainso rich in data we , the , havean unusual opportunityto evaluate ParallelDistributedProcessing the approach cognition termsof its concrete to in technical properties ratherthan blandgeneralities recycled or statements hopesor prejudices of . In this concluding section do four things webrieflyevaluate we : Rumelhart and McClelland strongclaims about languagewe evaluatethe general 's ; claimsaboutthe differences between connectionist symbolic and theories of cognitionthat the RM modelhasbeentakento illustrate we examine ; some of the waysthat the problems the RM modelareinherently to its PDP of due architecture and hencewaysin which our criticismsimplicitly extendto ,
166
certain kinds of PDP models in general; and we considerwhether the model could be salvagedby using mor" sophisticatedconnectionistmechanisms e . 8.1. On Rumelhart and McClelland's strong claims about language One thing should be clear. Rumelhart and McClelland's PDP model doesnot differ from a rule-basedtheory in providing a more exact accountof the facts of languageand languagebehavior. The situation is exactly the reverse As . far as the adult steadystate is concerned the network model gives a crude, , inaccurate and unrevealingdescription of the very factsthat standardlinguis, tic theories are designedto explain, many of them in classictextbook cases . As far as children's development is concerned the model's accountsare at , their best no better than those of a rule-basedtheory with an equally explicit learning component, and for two of the four relevant developmentalphenomena, critical empirical testsdesignedto distinguishthe theorieswork directly against the RM model's accounts but are perfectly consistent with the notion that children create and apply rules. Given these empirical failings, the ontological issueof whether the PDP and rule-basedaccountsare realist portrayals of actual mechanisms opposedto convenientapproximatesumas maries of higher-order regularities in behavior is rather moot. There is also no basisfor Rumelhart and McClelland's claim that in their network model, as opposed to traditional accounts "there is no induction , problem" . The induction problem in language acquisition consists among , other things, of finding setsof inputs that embody generalizations extracting , the right kinds of generalizationsfrom them, and deciding which generaliza tions can be extended to new cases The model does not deal at all with the . first problem, which involves recognizingthat a given word encodesthe past tense and that it constitutes the past tense version of another word. This juxtaposition problem is relegated to the model's environment (its "teacher" ), or more realistically, some unspecifiedprior process such a division ; of labor would be unproblematic if it were not for the fact that many of the developmentalphenomenathat Rumelhart and McClelland marshall in support of their model may be intertwined with the juxtaposition process(the onset of overregularization, and the sourceof atederrors, most notably). The secondpart of the induction problem is dealt with in the theory the old-fashioned way: by providing it with an innate feature spacethat is supposedto be appropriate for the regularities in that domain. In this case it is the , distinctive features of familiar phonological theories, which are incorporated into the model's Wickelfeature representations(see also Lachter & Bever, 1987 . Aspects in which the RM model differs from traditional accountsin ) how it uses distinctive features, such as representing words as unordered
Languageand connectionism
167
pools to aspect make constraints performance irregular rule The standing the wish increased intriguing . put
of it of
trigrams Finally ,
, do the
not theory
work very to
to
the poorly
of
the crucial . It
problem
proper
ways and
overestimating the
significance of
subgeneralizations
underestimating
generality
that and
the
success
of
their
model , is credit
for
have extent
questions
subregularities tion effects tradeoffs tween past does , the of blending the between developmental juxtaposition not give
frequency
overregulariza overt input transitions of But the the it present model raises . outputs on the be
and
process superior or
- extraction answers
radically
8 .2 .
Implications
for
the
metatheory
and
methodology
of
connectionism
Often to is study a
the
RM language of .
model , In but
is
as way persistent
a to
case
not what
only a
of cognitive
new
way theory
affirms mate
' macro
symbolic of , but
at may
domain goes
circumstances
the
real
models
accurately provide an )
describe approximate
the
microstructure description of
of the
cognition macrostructure
while .
, in
press
, p . 21
We which
view the
as
to
the paper
often microstructure , p .
some bring
situations much
insight
[ macro
- level &
models
are ,
and
should
not ours
be here
pushed and
too else -
( Rumelhart )
McClelland
material
In such discussions the relationship between Newtonian physics and Quantum Mechanics typically surfaces as the desired analogy . One of the reasons that connectionist theorists tend to reserve no role for higher -level theories as anything but approximations is that they create a dichotomy that , we think , is misleading . They associate the systematic , rule based analysis of linguistic knowledge with what they' call the " explicit inaccessible rule " view of psychology , which ... holds that the rules of languageare stored in explicit form as propositions, and are used by languageproduction, comprehension and judgment mecha , nisms Thesepropositionscannot be describedverbally [by the untutored native . speaker . (Rumelhart and McClelland, PDPII , p. 217 ] ) Their own work is intended to provide " an alternative to explicit inaccessible rules ... a mechanism in which there is no explicit representation of a rule " (p . 217) . The implication , or invited inference , seems to be that a formal rule is an eliminable descriptive convenience unless inscribed somewhere and examined by the neural equivalent of a read-head in the course of linguistic information processing~ In fact , there is no necessary link between realistic interpretation of rule theories and the " explicit inaccessible" view . Rules could be explicitly in scribed and accessed but they also could be implemented in hardware in such , a way that every consequence of the rule -system holds . If the latter turns out to be the case in a cognitive domain , there is a clear sense in which the rule -theory is validated - it is exactly true - rather than faced with a compet ing alternative or relegated to the status of an approximate convenience .35 Consider pattern -associators like Rumelhart and McClelland 's, which gives symbolic output from symbolic input . Under a variety of conditions , it will function as a rule -implementer . To take only the simplest , suppose that all connection weights are 0 except those from the input node for feature fz to the output node for Ii , which are set to 1. Then the network will implement the identity map . There is no read-head, write -head , or executive overseeing the operation , yet it is legitimate and even enlightening to speak of it in terms of rules manipulating symbols . More realistically , one can abstract from the RM pattern associator an implicit theory implicating a " representation " consisting of a set of unordered Wickelfeatures and a list of " rules " replacing Wickelfeatures with other Wic 35Noteas well that many of the examplesoffered to give common -sensesupport to the desirability of eliminating rules are seriouslymisleadingbecause they appealto a confusion betweenattributing a rule-system to an entity and attributing the wrong rule-systemto an entity. An example that Rumelhart and McClelland cite, in which it is noted that beescan create hexagonalcells in their hive with no knowledge of the rules of geometry, gains its intuitive force becauseof this confusion.
169
kelfeatures . Examining the properties of such rules and representations is quite revealing . We can find out what it takes to add /d/ to a stem; what it takes to reverse the order of phonemes in an input ; whether simple local modifications of a string are more easily handled than complex global ones;
and so on . The results we obtain carryover without modification to the actual
pattern associator, where much more complex conditions prevail . The deficiencies of Wickelphone /Wickelfeature transformation are as untouched by the addition of thresholds , logistic probability functions , temperatures , and parameters of that ilk as they are by whether the program implementing the
model is written in Fortran or C .
An important role of higher -level theory , as Marr for one has made clear , is to delineate the basic assumptions that lower level models must inevitably be built on . From this perspective , the high -level theory is not some approx imation whose behavior offers a gross but useful guide to reality . Rather , the
relation is one of embodiment : the lower -level theory embodies the higher
level theory , and it does so with exactitude . The RM model has a theory of linguistic knowledge associated with it ; it is just that the theory is so unor thodox that one has to look with some care to find it . But if we want to
understand the model , dealing with the embodied theory is not a conveni ence, but a necessity, and it should be pushed as far as possible . 8.2.1. When does a network implement a rule ? Nonetheless , as we pointed out in the Introduction , it is not a logical necessity that a cognitive model implement a symbolic rule system, either a traditional or a revisionist one ; the " eliminative " or rule -as-approximation connectionism that Rumelhart , McClelland , and Smolensky write about (though do not completely succeed in adhering to in the RM model ) is a possible outcome of the general connectionist program . How could one tell
the difference ? We suggest that the crucial
network ' s structure .
notion
is the motivation
for a
In a radical or eliminative connectionist model , the overall properties of the rule -theory of a domain are not only caused by the mechanisms of the micro -theory (that is, the stipulated properties of the units and connections ) but follow in a natural way from micro -assumptions that are well -motivated on grounds that have nothing to do with the structure of the domain under macro -scrutiny . The rule -theory would have second-class status because its assumptions would be epiphenomena : if you really want to understand why things take the shape they do , you must turn not to the axioms of a rule -theo ry but to the micro -ecology that they follow from . The intuition behind the symbolic paradigm is quite different : here rule -theory drives micro -theory ;
we expect to find many characteristics of the micro -level which make no
micro -sense, do not derive from natural micro -assumptions or interactions , and can only be understood in terms of the higher -level system being imple mented .
The RM pattern associator again provides us with some specific examples . As noted , it is surely significant that the regular past-tense morphology leaves the stem completely unaltered . Suppose we attempt to encode this in the pattern associator by pre -setting it for the identity map ; then for the vast majority of items (perhaps more than 95% on the whole vocabulary ) , most connections will not have to be changed at all . In this way , we might be able to make the learner pay (in learning time ) for divergences from identity . But such a setting has no justification from the micro -level perspective , which conduces only to some sort of uniformity (all weights 0, for example , or all random ) ; the labels that we use from our perspective as theorists are invisible to the units themselves , and the connections implementing the identity map are indistinguishable at the micro -level from any other connections . Wiring it in is an implementational strategy driven by outside considerations , a fin gerprint of the macro -theory . An actual example in the RM model as it stands is the selective blurring of Wickelfeature representations . When the Wickelfeature ABC is part of an input stem , extra Wickelfeatures XBC and ABY are also turned on , but AXC is not : as we noted above (see also Lachter & Bever , 1988) , this is motivated by the macro-principles that individual phonemes are the signifi cant units of analysis and that phonological interactions when they occur generally involve adjacent pairs of segments. It is not motivated by any prin ciple of micro -level connectionism . Even the basic organization of the RM model , simple though it is, comes from motives external to the micro -level . Why should it be that the stem is mapped to the past tense, that the past tense arises from a modification of the stem? Because a sort of intuitive proto -linguistics tells us so. It is easy to set up a network in which stem and past tense are represented only in terms of their semantic features , so that generalization gradients are defined over semantic similarity (e.g. hit and strike would be subject to similar changes in the past tense) , with the unwelcome consequence that no phonological relations will 'emerge' . Indeed , the telling argument against the RM pattern associator as a model of linguistic knowledge is that its very design forces it to blunder past the major generalizations of the English system. It is not unthinkable that many of the design flaws could be overcome , resulting in a connectionist network that learns more insightfully . But subsymbolism or eliminative connectionism , as a radical metatheory of cognitive science, will not be vindicated if the principal structures of such hypothetical improved models turn out to be dictated by higher -level theory rather than by micro -
Languageand connectionism
171
necessities. To the extent that connectionist models are not mere isotropic node tangles , they will themselves have properties that callout for explana tion . We expect that in many cases, these explanations will constitute the macro -theory of the rules that the system would be said to implement . Here we see, too , why radical connectionism is so closely wedded to the notion of blank slates, simple learning mechanisms, and vectors of " teaching " inputs juxtaposed unit -by-unit with the networks ' output vectors . If you really want a network not to implement any rules at all , the properties of the units and connections at the micro -level must suffice to organize the network into something that behaves intelligently . Since these units are too simple and too oblivious to the requirements of the computational problem that the entire network will be required to solve to do the job , the complexity of the system must derive from the complexity of the set of environmental inputs causing the units to execute their simple learning functions . One explains the organi ~ zation of the system, then , only in terms of the structure of the environment , the simple activation and learning abilities of the units , and the tools and language of those aspects of statistical mechanics apropos to the aggregate behavior of the units as they respond to environmental contingencies (as in Hinton & Sejnowski , 1986; Smolensky , 1986)- the rules genuinely would have no role to play . As it turns out , the RM model requires both kinds of explanation - im ~ plemented macrotheory and massive supervised learning - in accounting for its asymptotic organization . Rumelhart and McClelland made up for the model 's lack of proper rule -motivated structure by putting it into a teaching environment that was unrealistically tailored to produce much of the behavior they wanted to see. In the absence of macro ~ organization the environment
must bear a very heavy burden .
Rumelhart and McClelland ( 1986a, p . 143) recognize this implication clearly and unflinchingly in the two paragraphs they devote in their volumes to answering the question " Why are People Smarter than Rats?" :
Given all of the above [the claim that human cognition and the behavior of lower animals can be explained in terms of PD P networks ] , the question does seem a bit puzzling . ... People have much more cortex than rats do or even than other primates do ; in particular they have very much more ... brain structure not dedicated to input /output - and presumably , this extra cortex is strategically placed in the brain to subserve just those functions that differentiate people
from rats or even apes . . . . But there must be another aspect to the difference
between rats and people as well . This is that the human environment includes other people and the cultural devices that they have developed to organize their thinking processes.
172
We agree completely with one part : that the plausibility of radical connectionism is tied to the plausibility of this explanation .
of parallel
distributed
processing
models
In our view the more interesting points raised by an examination of the RM model concern the general adequacy of the PDP mechanisms it uses, for it is this issue, rather than the metatheoretical ones, that will ultimately have
the most impact on the future of cognitive science . The RM model is just one
early example of a PD P model of language , and Rumelhart and McClelland make it clear that it has been simplified in many ways and that there are many paths for improvement and continued development within the POP frame work . Thus it would be especially revealing to try to generalize the results of our analysis to the prospects for PDP models of language in general . Al though the past tense rule is a tiny fragment of knowledge of language , many of its properties that pose problems for the RM model are found in spades elsewhere . Here we point out some of the properties of the PD P architecture
used in the RM model that seem to contribute to its difficulties and hence
which will pose the most challenging problems to PDP models of language. 8.3.1. Distributed representations PDP models such as RM 's rely on 'distribu 'ted' representations : a largescale entity is represented by a pattern of activation over a set of units rather than by turning on a single unit dedicated to it . This would be a strictly implementational claim , orthogonal to the differences between connectionist and symbol -processing theories , were it not for an additional aspect: the units have semantic content ; they stand for (that is, they are turned on in response to ) specific properties of the entity , and the entity is thus represented solely in terms which of those properties it has. The links in a network describe strengths of association between properties , not between individuals . The relation between features and individuals is one-to-many in both directions : Each individual is described as a collection of many features , and each feature plays a role in the description of many individuals . Hinton et al . ( 1986) point to a number of useful characteristics of distri buted representations . They provide a kind of content -addressable memory , from which individual entities may be called up through their properties . They provide for automatic generalization : things true of individual X can be inherited by individual Y inasmuch as the representation of Y overlaps that of X (i .e. inasmuch as Y shares properties with X ) and activation of the overlapping portion during learning has been correlated with generalizable
173
properties . And they allow for the formation of new concepts in a system via new combinations of properties that the system already represents . It is often asserted that distributed representation using features is uniquely available to PDP models , and stands as the hallmark of a new paradigm of
cognitive science one that calculates not with symb , .ols but with what
Smolensky (in press) has dubbed 'subsymbols' (basically , what Rumelhart , McClelland , and Hinton call 'microfeatures ') . Smolensky puts it this way :
(18) Symbols and Context Dependence . In the symbolic paradigm , the context of a symbol is manifest around it , and consists of other symbols ; in the subsymbolic paradigm , the context of a symbol is manifest inside it , and consists of subsymbols .
It is striking , then , that one aspect of distributed representation - featural decomposition - is a well -established tool in every area of linguistic theory , a branch of inquiry securely located in (perhaps indeed paradigmatic of ) the 'symbolic paradigm ' . Even more striking , linguistic theory calls on a version of distributed representation to accomplish the very goals that Hinton et ale ( 1986) advert to . Syntactic , morphological , semantic , and phonological entities are analyzed as feature complexes so that they can be efficiently content addressed in linguistic rules ; so that generalization can be achieved across individuals ; so that 'new' categories can appear in a system from fresh combinations of features . Linguistic theory also seeks to make the correct generalizations inevitable given the representation . One influential attempt , the 'evaluation metric ' hypothesis , proposed to measure the optimality of linguistic rules (specifically phonological rules) in terms of the number of features they refer to ; choosing the most compact grammar would guarantee maximal generality . Compare in this regard Hinton et alis ( 1986, p . 84)
remark about types and instances : ... the relation between a type and an instance can be implemented by the relationship between a set of units [features ] and a larger set [of features ] that includes it . Notice that the more general the type the smaller the set of units [features ] used to encode it . As the number of terms in an intensional [featural ] description gets smaller , the corresponding extensional set [of individuals ] gets larger .
This echoes exactly Halle 's (1957, 1962) observation that the important general classes of phonemes were among those that could be specified by small sets of features . In subsequent linguistic work we find thorough and continuing exploration of a symbol -processing content -addressing automat ically -generalizing rule -theory built , in part , on featural analysis. No distinc tion -in -principle between PDP and all that has gone before can be linked to
174
the presenceor absenceof featural decomposition (one central aspect of distributed representation as the key desideratum Features analyze the ) . structure of paradigms the way individuals contrast with comparableindividuals and any theory, macro, micro, or mini , that deals with complex entities can use them. Of course distributed representation in PDP models implies more than , just featural decomposition an entity is representedas nothing but the fea: tures it is composedof. Concatenativestructure, constituency variables, and .." ' " their binding- in short, syntagmaticorganization are virtually abandoned . This is where the RM model and similar PDP efforts really depart from previous work , and also where they fail most dramatically. A crucial problem is the difficulty PDP models have in representingindividuals and variables (this criticism is also made by Norman, 1986 in his , generally favorable appraisal of PDP models . The models represent indi) vidual objects as setsof their features. Nothing, however, representsthe fact that a collection of features correspondsto an existing individual: that it is distinct from a twin that might shareall its features or that an object similar , to a previously viewed one is a singleindividual that hasundergonea change as opposedto two individual obj ects that happen to resembleone another, or that a situation has undergone a change if two identical objects have switched positions.36In the RM model, for example this problem manifests , itself in the inability to supply different past tensesfor homophonousverbs such as wring and ring, or to enforce a categoricaldistinction between morphologically disparate verbs that are given similar featural representations such as becomeand succumb to mention just two of the examplesdiscussed , in Section 4.3. As we have mentioned, a seeminglyobvious way to handle this problemjust increase the size of the feature set so that more distinctions can be encoded will not do. For one thing, the obvious kinds of features to add, such as semantic features to distinguish homophones gives the model too , much power, as we have mentioned: it could use any semanticproperty or combination of semantic and phonological properties to distinguish inflec. tional rules, whereasin fact only a relatively small set of features are ever encodedinflectionally in the world's languages(Bybee, 1985 Talmy, 1985 ; ). Furthermore, the crucial properties governing choice of inflection are not semanticat all but refer to abstractmorphologicalentities suchasbasiclexical itemhood or roothood. Finally, this move would commit one to the prediction that semantically -related words are likely to have similar past tenses which , is just not true (compare e.g. hit/hit versus'strike/struck versusslap/slapped ,
36We thankDavidKirshfor these examples .
(similar meanings, different kinds of past tenses) or stand/stood versus understand/ understood versus stand out /stood out (different meanings , same kind of past tense) . Basically , increasing the feature set is only an approximate way to handle the problem of representing individuals ; by making finer distinctions it makes it less likely that individuals will be confused but it still
does not encode individuals as individuals . The relevant difference between
wring and ring as far as the past tense is concerned is that they are different words , pure and simple .37 A second way of handling the problem is to add arbitrary features that simply distinguish words . In the extreme case, there could be a set of n features over which n orthogonal patterns of activation stand in one-to -one correspondence with n lexical items . This won 't work , either . The basic prob lem is that distributed representations , when they are the only representations of objects , face the conflicting demands of keeping individuals distinct and providing the basis for generalization . As it stands, Rumelhart and McClel land must walk a fine line between keeping similar words distinct and getting the model to generalize to new inputs - witness their use of Wickelfeatures over Wickelphones , their decision to encode a certain proportion of incorrect Wickelfeatures , their use of a noisy output function for the past tense units , all designed to blur distinctions and foster generalization (as mentioned , the effort was only partially successful, as the model failed to generalize properly to many unfamiliar stems) . Dedicating some units to representing wordhood would be a big leap in the direction of nongeneralizability . With orthogonal patterns representing words , in the extreme case, word -specific output features could be activated accurately in every case and the discrepancy between computed -output and teacher-supplied -input needed to strengthen connections from the relevant stem features would never occur . Intermediate solutions , such as having a relatively small set of word -distinguishing features available to distinguish homophones with distinct endings , might help . But given the extremely delicate balance between discriminability and generaliza bility , one won 't know until it is tried , and in any case, it would at best be a hack that did not tackle the basic problem at hand : individuating individuals , and associating them with the abstract predicates that govern the permissible generalizations in the system. The lack of a mechanism to bind sets of features together as individuals causes problems at the output end , too . A general problem for coarse-coded
370f course , another problem with merely increasing the feature set, especially if the features are conjunc tive , is that the network can easily grow too large very quickly . Recall that Wickelphones , which in principle can make finer distinctions than Wickelfeatures , would have required a network with more than two billion
connections .
176
individual - leading to " illusory conjunctions " where , say, an observer may be unable to say whether he or she is seeing a blue circle and a red triangle or a red triangle and a blue circle (see Hinton et al., 1986; Treisman & Schmidt , 1982) . The RM model simultaneously computes past tense output features corresponding to independent subregularities which it is then unable to keep separate, resulting in incorrect blends such as slept as the past tense of slip- a kind of self-generated phonological illusory conjunction . The current substitute for a realistic binding mechanism , namely the " whole -string binding network " , does not do the job , and we are given no reason to believe
that a more realistic and successful model is around the corner . The basic
later
point is that the binding problem is a core deficiency of this kind of distributed representation , not a minor detail whose solution can be postponed to some
date .
The other main problem with features -only distributed representations is that they do not easily provide variables that stand for sets of individuals regardless of their featural decomposition , and over which quantified generalizations can be made . This dogs the RM model in many places. For example , there is the inability to represent certain reduplicative words , in which the distinction between a feature occurring once versus occurring twice is crucial , or in learning the general nature of the rule of reduplication , where a morpheme must be simply copied : one needs a variable standing for an occurrence of a morpheme independent of the particular features it is composed of . In fact , even the English regular rule of adding /d/ is never properly learned (that is, the model does not generalize it properly to many words ) ,
because in essence the real rule causes an affix to be added to a " word " ,
which is a variable standing for any admissible phone sequence, whereas the model associates the family of /d/ features with a list of particular phone sequences it has encountered instead . Many of the other problems we have pointed out can also be traced to the lack of variables . We predict that the kind of distributed representation used in the two layer pattern -associators like the one in the RM model will cause similar problems anywhere they are used in modeling middle - to high -level cognitive processes.38 Hinton , McClelland , and Rumelhart (p . 82) themselves provide an example that (perhaps inadvertently ) illustrates the general problem :
38Within linguistic semantics , for example , a well -known problem is that if semantic representation is a set of features , how are propositional connectives defined over such feature sets? If P is a set of features , what function of connectionist representation will give the set for ,.....P?
177
People good generalizing acquired are at newly knowledge. If, forexample, ... youlearn chimpanzees onions will that like you probably your raise estimate of
the probability gorillas onions. a network usesdistributed that like In that representations, kindofgeneralization this is automatic. newknowledge The about chimpanzeesincorporated modifying of the connection is by some strengths so
as to alter the causaleffectsof the distributedpattern of activitythat represents
chimpanzees. modification The automatically changes causal the effects all of similar activity patterns. if the representation gorillas a similar So of is activity pattern thesame ofunits, causal over set its effects bechanged a similar will in
way.
ogy. Peoples inductive generalizations automatic arenot responsessimito larity anynon-question-begging similarity); depend the (in sense of they on
reasoners unconscioustheory of the domain, and on any theory-relevant
verbally, acquired a single in exposure, inferred through circuitous means, etc.), a way cancompletely in that override similarity relations (Carey, 1985; deJong Mooney, Gelman Markman, Keil, Osherson, & 1986; & 1986; 1986; Smith, Shafir, Pazzani, & 1986; 1987). takeoneexample, To knowledge of howa setofperceptual features caused, knowledge thekind that was or of an individual an example canoverride generalizations is of, any inspired by the objectsfeatures themselves: example, animal looks for an that exactly
like a skunkwillnonetheless treatedas a raccoon one is told that the be if
stripewaspainted ontoan animal had raccoon that parents raccoon and babies Keil,1986; demonstrates thisphenomenon in (see who that occurs
childrenandis not the resultof formalschooling). Similarly, evena basketball
ignoramus notbeseduced thesimilarity will by relations holding among the typical starting players theBoston of Celtics those and holding among the starting players theLosAngeles of Lakers, thuswill betempted and not to predict a yellow-shirted player that blond entering game runto the the will
did so. (Haircolor,nonetheless, be usedin qualitatively might different generalizations, as which such players be selected endorse care will to hair products.) example, Pazzani Dyer The from and (1987), oneofmany is that
ing which greater has usefulness greater and fidelity peoples to commonsensereasoning the similarity-based than learningthat Hintonet als
example performs system automatically e.g.,deJong Mooney, (see, & 1986). Osherson al. (1987), analyse use of similarity a basisfor et also the as generalization show inherent and its problems; Gelman Markman and (1986)
178
show preschool shelve how children similarity when relations making induc-
generalization propertiesdistributed of representations toprovide account an ofhuman inductive inferencegeneral. isanalogousthefact have in This to we been stressing throughout, thatthepast inflectional is namely tense system nota slave linguisticbutit isdriven precise byspeakers tosimilarity in ways implicit theories of organization. Insum, featural decomposition isanessential ofstandard feature symbolic models language cognition, many thesuccessesPDP of and and of of models simply inherit advantages. these However is unique what about RM the model othertwo-layer and pattern associatorstheclaim individuals is that andtypes represented nothing activated are as but subsets features. of This
exclusively carnivorous,learning aparticular who and that gorilla happens to have broken does like a leg not onions notnecessarily toany will lead tendency project distaste other to that onto injured gorillas chimpanzees. and Though similarity playsrole domains surely a in ofwhich areentirely people
butdepends crucially thestructured on propositional content theknowlof edge: learning allgorillas exclusively that are carnivorous leadto a differwill ent generalization theirtasteforonions learning some about than that or
specific on the natureof the inductive ways inference be madeon that to occasion. Furthermore knowledge cantotally or reverse the that alter an
signed sets individuals pick some to of that out properties completely and ignore others, differentlydifferent on occasions, depending inknowledge-
need enter alltheprocesses not into referringtheobject. symbol to Some referringtheobject object, some to qua and variable referringtasktypes to relevant ofobjects cutacross classes that featural similarity,required. are
8.3.2.Distinctions among subcomponents abstract and internal
representations
The model RM collapses asingle into inputoutput amapping module that inrule-based isa composition distinct accounts ofseveral subcomponents feeding informationone into another, asderivational such morphology and
179
inflectional morphology, inflectional or morphology phonology. of and This, course, whatgives its radical is it look.If the subcomponentsa traditional of account werekept distinct a PDPmodel,mapping distinct in onto subnetworks pools unitswiththeirowninputs outputs, ontodistinct or of and or
differentiateRumelhartand McClellandscollapsedone-boxmodelfrom the
traditional accounts that causes it to fail so noticeably.
layers a multilayer of network, would one naturally thatthenetwork say simply implemented traditional the account. it is justthefactors But that
composition beginwith? principal to The reason that whenonebreaksa is system intocomponents, components communicate down the must bypassing
informationinternal representationsamong themselves. But because
belearned anyobvious (Chomsky, calls theargument in way. 1981, this from povertyofthestimulus.) Sequencesmorphemes of resulting factoring from outphonological changes onekindofabstract are representation inrule used systems; lexical entries distinct phonetic from representations another; are morphological area third.TheRMmodel is composed a single roots thus of module mapping frominputdirectly outputin part because to thereis no
realistic for their convergence way procedure learnthe internalrepresento
tations of a modular account properly.
models language not designed arbitrary of were for reasons preserved and as
quaint traditions; distinctions make substantive motivated the they are claims by empirical andcannot obliterated a newmodel facts be unless provides equally compelling accountsthose Designing of facts. a model can that record
hundredsof thousandsof first-ordercorrelations simulatesomebut not can all of this structureand is unableto explainit or to accountfor the structures
emerge other from cognitive domains arerich dataandtheory. is that in It unlikely anymodel beableto obliterate that will distinctions subcomamong
cognition. alone sharply anyheadlong This will brake movement from away
the kinds of theories that have been constructed within the symbolic
framework.
ponents theircorresponding of abstract and forms internal representations that havebeen independently motivated detailedstudyof a domainof by
180
S. Pinker
and A. Prince
8.3.3. Discrete,categoricalrules
measure similarity to known or words strings *1put; *Thechild or (e.g. seems sleeping;What youseesomething?). * did Obviously models PDP can
Despite graded frequency-sensitive made children the and responses by andbyadults their in speech andanalogical errors extensions ofthe inparts strong system, aspects knowledge verb many of oflanguage incategorresult ical judgments ofungrammaticality. isdifficultreconcile any This fact to with mechanismatasymptote anumbercandidates that leaves of atsuprathreshold strength allows tocompete and them probabilistically forexpression (Bowerman, 1987, makes point). thepresent adult also this In case, speakers assign a single tense past form words represent being to they as regular even if subregularities several bring candidates mind(e.g.brought/*brang/* to bringed); subregularities have partially and that may been productive inchildhoodare barredfromgenerating tenseformswhenverbsare derived past from syntactic other categories *pang; (e.g. *highstuck)areregistered or as being distinct lexical from items those exemplifying subregularities *J (e.g. broke car).Categorical the judgments ungrammaticalitycommon of is a (though all-pervasive) not property linguistic of judgments novel of words and strings, cannot predictedsemantic and be by interpretability prior orany
qualitatively appropriateresponse.
circuits; questionwhether the is models bebuiltotherthan implecan by menting standard symbolic theoriesin which quantitatively the strongest output to thesharpening invariably prior circuit corresponds unique to the
8.3.4. Unconstrained correlation extraction
strained. thecase theRM In of model, saw it can we how acquire that rules arenotfound any in language asnonlocal such conditioning ofphonological changesmirror-reversal or ofphonetic This strings. problem geteven would worse the set of feature wasexpanded represent kinds if units to other of information attempt distinguish in an to homophonous or phonologically similar forms. model exploits The also subregularities asthose the (such of irregular that classes) adults best notexploit at do productively (slip/*slept and peep/*pept) atworst completely and are oblivious(e.g. to lexical causatives sit/setlie/lay----fall/fell----rise/raise, generalized like which never are to cry/*cray). types inflection across The of found human languages a involves highly constrained ofthelogically subset possible semantic features, feature
ners;virtually amount statistical any of correlation among features a set in ofinputs besoaked bytheweights thedense ofinterconnections can up on set
It isoften considered ofPDP a virtue models they powerful that are lear-
181
combinations, phonological alterations, itemsadmitting inflection, of and agreement relations (Bybee, 1985; Talmy, 1985). example, represent For to the literalmeanings the verbsbrakeandbreakthe notionof a man-made of
mechanical device is relevant, but no language has different past tenses or
plurals a distinction for between man-made natural versus objects, despite thecognitive salience thatnotion. theconstrained ofthevarof And nature
iationin othercomponents language assyntax beenthedominant of such has
The mostnaturalresponse a PDP theoristto our criticisms of wouldbe to retreat fromthe claimthat the RM modelin its currentformis to be taken
simplest the devices the PDParmamentarium, that PDP of in devices cause problems theRM for model, these and problems alldiminish would if
moresophisticated of PDPnetworks used.Thus claim kinds were the that
PDPnetworks ratherthan rulesprovide exactand detailed an account of
language would survive.
theorists general beenmoving from. in have away Perhaps isthelimitations it of thesesimplest devicestwo-layer PDP pattern association networksthat
In particular, interesting of networks, Boltzmann two kinds the Machine (Hinton Sejnowski, andtheBack-Propagation (Rumelhart & 1986) scheme et al.,1986) been have developed recently have that hidden unitsorintermediate layers between inputandoutput. These hidden function units as internal representations asa result networks capable computand such are of
RMvariety. Furthermore,many in interesting themodels been cases have ableto learn internal representations. example Rumelhart al. For the et model changes only weights theconnections output in not the of to its units
responseanerror respect theteaching butitpropagates to with to input, the error signal backwards intermediate and tothe units changes weights their in
learning couldavoidthe problems the RMmodel. of Thereare three reasons suchspeculations basically why are irrelevant to
the points we have been making.
the direction alterstheiraggregate that effecton the outputin the right direction. Perhaps, a multilayered network back-propagation then, PDP with
182
example, might discrepantlya setofrules a way they behave from in that mediate might totally interms what represent), layers be opaque of they and
mimicked peoples systematic divergence that ofrules, their from set or inter-
tobe implemented consisting inasystem ofmassively interconnected parallel stochastic inwhich effectslearningmanifest units the of are inchanges in connections. uncontroversial always atthe founThese facts have been very dationsthe interpretation models of realist ofsymbolic ofcognition;do they not adeparture sort standard accounts. signal ofany from symbolic Perhaps a multilayered multinetwork could thetasks inflecorgated system solve of tion acquisition simply without implementing grammars (for standard intact
semantic networks, production orLISP systems, primitive operations (Hinton,1981; Touretzky, Touretzky 1986; &Hinton, areappealing 1985) because have ability mimic implementstandard they the to or the operations and representationsintraditional accounts perhaps needed symbolic (though with twists). donot that wouldpossible some We doubt it be toimplement a rulesystem networks multiple after it hasbeen in with layers: all, known for over years nonlinear 45 that neuron-like can elementsfunction gates aslogic andthathence networks that consisting ofinterconnected ofsuch layers elementscompute can propositions (McCulloch 1943). &Pitts, Furthermore, given we about information and what know neural processingplasticity it seems that elementary likely the operations ofsymbolic processinghave will
andother sophisticated such those have network can models as that one that gatethe connections between othersor networks cansimulate two that
that sharply so differentiate model standard namely, theRM from ones, the lack internal of representations and subcomponents. Multilayered networks,
more animplementation symbolic than ofa rule-based account. advanThe tageofa multilayered is precisely it isfreefrom constraints model that the
aslittle toourattentionany claim thehypothetical claim as other about consequences of a nonexistent model. Second, a successfulmodelmore PDP of complex may nothing design be
guistic information processingare ontheputative based success their of existing Given their model. that existing does dothe itissaid model not job to do,theclaims berejected. a PDP must If advocate to eschew were the existing model appealmore RM and to powerful mechanisms, claim the only that bemadethat may amodelunspecified that could is there exist of design may or may not accountfor past tense acquisition withoutthe use of rules and ifitdid, should ourunderstanding that we revise oflanguage, rules treat
problem account, that must our intheir and we revise understanding oflin-
183
Aswementioned a previous in section, reallyradical the claim thatthere is are modelsthat can learntheir internalorganization througha process that can be exhaustively described an interaction as between correlational the structureof environmental inputsand the aggregate behaviorof the unitsas
theyexecute simple their learning activation and functions response in to those inputs. Again, isnomore a vague this than hope. important An technical problem thatwhen is intermediate of complex layers networks to have learn anything thelocal in unconstrained characteristic modmanner ofPDP els,theyareoneor morelayers removed theoutput at which from layer discrepancies between anddesired actual outputs recorded. inputs are Their andoutputs longer no correspond anydirect to overt in way stimuli and responses, thesteps and needed modify weights nolonger to their are transparent. Since differencesthesetting eachtunable in of component the of
changes otherunits of before affecting output the layer), isharder ensure it to
that the intermediate layerswillbe properlytuned by localadjustments prop-
sonstage (their effects combine complex with effects weight in ways the of
agating backwards. Rumeihartal.(1986) dealt thisproblem et have with in clever with ways some interesting successessimple in domains aslearnsuch ingto addtwo-digit numbers, detecting symmetry,learning exclusiveor the
on incorrectsolutionsdefinedby localminimaof the energy landscape defined overthe spaceof possible weights, suchfactorsas the starting and
configuration, orderof inputs, the several parameters thelearning of function,the number hidden of units,andthe innate topology the network of
These problems exactly problems. donotdemonstrate are that, They that interesting models language impossibleprinciple. thesame PDP of are in At time, show thereis nobasis thebelief connectionism they that for that will
dissolve difficult the puzzles language, evenprovide of or radically solunew tionsto them.As for the present,wehaveshownthat the paradigm example of a PDPmodel language claim of can nothing morethana superficial fidelity to somefirst-order regularities language. of Moreis knownthanjust the
184
first-order regularities, when deeper more and the and diagnostic patterns are examined care, one seesnot onlythat the PDP modelis not a viable with alternativesymbolic to theories, thatthesymbolic but account supported is in virtually aspect. every Principled symbolic theories language of have achieved success a broad with spectrum empirical of generalizations, of some considerable ranging propertieslinguistic depth, from of structurepatterns to
means weregard as somewhat thanusual, that Verb less particularlya as strong formin the class whereits listed. notation The ??Verb means we that regard asobsolete Verb (particularly past) recognizable,kind inthe but the ofthing picks from one up reading. notation means theverb, The (+) that in ourjudgment, admits regular a form.Notice obsolescence not that does imply regularizability: verbs a few simply tolack usable tense seem a past or pastparticiple. have We found judgments from that differ dialect dialect, to with dineof willingness-to-regularize from a running up British English (south-of-London) toCanadian (Montreal) toAmerican (general). in When
Prefixed forms listed are when prefix-root the combinationnotsemantiis
byitslaxcounterpart. English, to theGreat In due Vowel thenotion Shift, tax counterpart slightly thetense-lax is odd: alternations noti-I,e-c, are u-U,andsoon,butrather i-E, o-o/a, ay-I, ee, u-o/a. termablaut refers The
185
2 . T / D with
laxing
3a burn
Suffix ,
with
laxing
feel
? kneeI
( +
keep
leap
( +
) ,
sleep
sweep
( +
) ,
weep
39He knit a sweateris possible not?? He knit his brows. , 40 in poker, bridge, or defensecontracts. As 41The adjective is only wedded . 42Mainlyan adjective.
3d buy
. x - ought , bring
4 .
Overt
-D
4a . Satellitic flee say hear 4b . Drop have make 4c . sell 4d . do With , tell With ablaut stem
consonant
[ E - 0
- 0]
II . E-:) ablautclass
1. iff - ofJ - ofJ + n freeze , speak, ??bespeak, steal, weave( + )43 ?heave( + )44 , get , for ~et , ??beget ??tread 5 swear, tear , wear , ?bear , ??forbear , ??forswear 2. Satellitic x - 0 - o + n
choose
430nlyin reference carpetsetc. is the strongform possible* Thedrunk wovedownthe road The to , . . adjective woven is . 440nly nautical heave to/hoveto. *He havehislunch Pastparticiple *hoven . not . 45Though is common BritishEnglishit is at bestquaintin American trod in , English .
Languageand connectionism
187
III . I - reIA - A group 1. I - Ee A . . . nng, smg, sprIng drink , shrink, sink, stink . SWIm begin
2. I - A - A
fly
?slay
2. e - U -e+ n
ay
aw
aw
bind
find
grind
wind
ay
4a .
ay
rIse
anse
. wnte ,
? .
? . smIte
ride
drive
strive
4b
ay
dive
shine47
stride
thrive
Miscellaneous
fall
befall
( cf
get
got
hold come
, ,
behold become
( cf
tell
told
eat
beat
satellite
of
blow
- class
189
4 . Miscellaneous
5. Regular but for past participle a. Add -n to stem (all allow -ed in participle ) sow, show , sew, prove , shear, strew b. Add -n to ablauted stem swell
A remark. A number of strong participial forms survive only as adjectives (most, indeed, somewhatunusual : cleft, cloven girt, gilt, hewn, pent, bereft ) , , shod, wrought, laden mown, sodden clad, shaven drunken, (mis)shapen , , , . The verb crow admits a strong form only in the phrasethe cock crew; notice that the rooster crew is distinctly peculiar and Melvin crew over his victory is unintelligible. Other putative strong forms like leant, clove abode durst, , , chid, and sawn seemto us to belong to another language .
References Anderson & HintonG.E. (1981 Models information , J.A., , ). of processing brainIn G.E. Hinton in the . & J.A. Anderson .), Parallel (Eds models associative . Hillsdale : Erlbaum of memory , NJ . Anderson (1976Language , J.R. ). , memory thought and . Hillsdale : Erlbaum , NJ . Anderson (1983 The , J.R. ). architecture of cognition . Cambridge Harvard , MA: University . Press Armstrong.L., Gleitman ,S , L.R., & Gleitman (1983What , H. ). some concepts notbe Cognition , might . , 13 ?h1 10R . ---- - --AronoffM. (1976 Word , ). formation generative in grammar . Cambridge MITPress , MA: . BerkoJ. (1958Thechild learning English , ). 's of morphology , 14 150177 . Word , - . BlochB. (1947English inflection , ). verb . Language 399 . 23 - 418 , Bowerman (1987 Discussion , M. ). : Mechanisms language of acquisition B. MacWhinney .), . In (Ed Mechanisms of language acquisition . Hillsdale : Erlbaum , NJ . BrownR. (1973A firstlanguage early , ). : The stages . Cambridge Harvard , MA: University . Press Bybee (1985Morphology , J.L. ). : astudy the of relation between meaning form Philadelphia and . : Benjamins . Bybee , J.L., & Slobin (1982Rules schemes development use theEnglish ten , D.I. ). and in the and of past ~ ;e. Language, 265289 , 58 - . Carey . (1985 Conceptual in childhood ,S ). change . Cambridge Bradford /MIT Press , MA: Books . Cazden (1968Theacquisitionnoun verb , C.B. ). of and inflections Development 433448 . Child , 39 - . , Chomsky (1957Syntactic , N. ). structures Hague . The : Mouton . Chomsky (1965Aspects the , N. ). of theory syntax of . Cambridge MITPress , MA: . Chomsky (1981 Lectures government binding , N. ). on and . Dordrecht , Netherlands . : Foris
Chomsky , N . , & Halle , M . ( 1968) . The sound pattern of English . New York : Harper and Row . Curme , G . ( 1935) . A grammar of the English language II . Boston : Barnes & Noble . de Jong , G .F . , & Mooney , R .J. ( 1986) . Explanation -based learning : An alternative view . Machine Learning ,
1 , 145 - 176 .
Ervin , S. ( 1964) . Imitation and structural change in children 's language . In E . Lenneberg (Ed .) , New directions in the study of language . Cambridge , MA : MIT Press. Feldman , J .A . , & Ballard , D .H . ( 1982) . Connectionist
205 - 254 .
Fodor , J.A . ( 1968) . Psychological explanation . New York : Random House . Fodor , J.A . ( 1975) . The language ofthou ~ht . New York : T .Y . Crowell .
Francis, N., & Kucera, H . (1982 . Frequencyanalysis of English usage Lexicon and grammar. Boston: ) :
Houghton Mifflin . Fries , C . ( 1940) . American English grammar . New York : Appleton -Century . Gelman , S.A . , & Markman , E .M . ( 1986) . Categories and induction in young children . Cognition , 23 , 183- 209 . Gleitman , L .R . , & Wanner , E , ( 1982) , Language acquisition : The state of the state of the art . In E . Wanner
Press .
and L .R . Gleitman (Eds .) , Language acquisition : The state of the art . New York : Cambridge University
Gordon , P. ( 1986) . Level -ordering in lexical development . Cognition , 21 , 73- 93. Gropen , J. , & Pinker , S. ( 1986) . Constrained productivity in the acquisition of the dative alternation . Paper presented at the 11th Annual Boston University Conference on Language Development , October . Halle , M . ( 1957) . In defense of the Number Two . In E . Pulgram (Ed .) , Studies presented to J . Whatmough . Mouton : The Hague . Halle , M . ( 1962) . Phonology in generative grammar . Word , 18, 54- 72. Halwes , T . , & Jenkins , J.J. ( 1971) . Problem of serial behavior is not resolved by context -sensitive memory models . Psychological Review , 78, 122- 29. Hinton , G .E . ( 1981) . Implementing semantic networks in parallel hardware . In G .E . Hinton & J.A . Anderson (Eds .) , Parallel models of associative memory . Hillsdale , NJ : Erlbaum . Hinton , G .E ., McClelland , J.L ., & Rumelhart , D .E . ( 1986) . Distributed representations . In D .E . Rumelhart , J.L . McClelland , and the PDP Research Group , Parallel distributed processing : Explorations in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford Books /MIT Press. Hinton , G .E . , & Sejnowski , T .J. ( 1986) . Learning and relearning in Boltzmann machines . In D .E . Rumelhart , J.L . McClelland , and the PDP Research Group , Parallel distributed processing : Explorations in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford Books /MIT Press. Hoard , J., & C . Sloat ( 1973) . English irregular verbs . Language , 49, 107- 20. Hockett , C . ( 1942) . English verb inflection . Studies in Lin .R "uistics . 1.2. . 1- 8. Jespersen , O . ( 1942) . A modern English grammar on historical principles , VI . Reprinted George Allen & Unwin Ltd . 1961: London :
Keil , F .C . ( 1986) . The acquisition of natural kinds and artifact terms . In W . Demopoulos & A . Marras ( Ed .) , Language learning and concept acquisition : Foundational issues. Norwood , NJ : Ablex . Kiparsky , P. ( 1982a) . From cycljcal to lexical phonology . In H . van der Hulst , & N . Smith (Eds .) . The structure of phonological representations . Dordrecht , Netherlands : Foris . Kiparsky , P. ( 1982b) . Lexical phonology and morphology . In I .S. Yang (Ed .) , Linguistics in the morning calm .
Seoul : Hansjn , pp . 3 - 91 .
Kucera , H . , & N . Francis ( 1967) . Computational analysis of present -day American English . Providence : Brown University Press. Kuczaj , S.A . ( 1976) . Arguments
423 - 427 .
of Child Language , 3 ,
Kuczaj , S.A . ( 1977) . The acquisition of regular and irregular past tense forms . Journal of Verbal Learning
and Verbal Behavior , 16 , 589 - 600 .
Kuczaj , S.A . ( 1978) . Children 's judgments of grammatical and ungrammatical irregular past tense verbs . Child Development , 49, 319- 326. Kuczaj , S.A . ( 1981) . More on children 's initial failure to relate specific acquisitions . Journal of Child Lan f!;uaf !;e J 8 , 485 - 487 .
between linguistic
structure
language learning - A constructive critique of some connectionist learning models . Cognition , 28 , 195Lakoff , G . ( 1987) . Connectionist explanations in linguistics : Some thoughts on recent anti -connectionist pa-
pers . Unpublished electronic manuscript , ARPAnet . Levy , Y . ( 1983) . The use of nonce word tests in assessing children 's verbal knowledge . Paper presented at the 8th Annual Boston University Conference on Language Development , October , 1983. Liberman , M ., & Pierrehumbert , J. ( 1984) . Intonational invariance under changes in pitch range and length . In M . Aronoff & R . Oehrle (Eds .) , Language sound structure . Cambridge , MA : MIT Press. MacWhinney , B ., & Snow , C .E . ( 1985) . The child language data exchange system . Journal a/ Child Language ,
12 , 271 - 296 .
MacWhinney , B . , & Sokolov , J.L . ( 1987) . The competition model of the acquisition of syntax . In B . MacWhin ney (Ed .) , Mechanisms of language acquisition . Hillsdale . NJ : Erlbaum . Maratsos , M . , Gudeman , R ., Gerard -Nogo , P., & de Hart , G . ( 1987) . A study in novel word learning : The productivity
NJ : Erlbaum
Maratsos , M ., & Kuczaj , S.A . ( 1978) . Against the transformationalist overmarkings . Journal of Child Language , 5 , 337- 345. Marr , D . ( 1982) . Vision , San Francisco : Freeman .
McCarthy , J. , & Prince , A . (forthcoming ) . Prosodic morphology . McClelland , J .L . , & Rumelhart , D .E . ( 1985) . Distributed memory and the representation
of general and
specific information . Journal of Experimental Psychology : General , 114, 159- 188. McClelland , J .L ., Rumelhart , D .E . , & Hinton , G .E . ( 1986) . The appeal of parallel distributed processing . In D .E . Rumelhart , J .L . McClelland , and the PDP Research Group , Parallel distributed processing : Explorations in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford
Books / MIT Press .
McClelland , J .L . , Rumelhart ~ D .E . ~ & The PDP Research Group . ( 1986) . Parallel distributed processing : Explorations in the microstructure of cognition . Volume 2: Psychological and biological models . Cam bridge , MA : Bradford Books /MIT Press. McCulloch , W .S. ~ & Pitts , W . ( 1943) . A logical calculus of the ideas immanent in nervous activity . Bulletin of Mathematical Biophysics ~ 5 , 115- 133. Mencken ~ H . ( 1936) . The American language . New York : Knopf . Minsky , M . ( 1963) . Steps toward artificial intelligence . In E .A . Feigenbaum & J. Feldman (Eds .) , Computers and thought . New York : McGraw -Hill . Newell ~ A . , & Simon , H . ( 1961) . Computer simulation of human thinking . Science, 134, 2011- 2017 . Newell , A . , & Simon , H . ( 1972) . Human problem salving . Englewood Cliffs , NJ : Prentice -Hall . Norman , D .A . ( 1986) . Reflections on cognition and parallel distributed processing . In J .L . McClelland , D .E . Rumelhart , & The PDP Research Group , Para /leldistributedprocessing : Explorations in the microstruc ture of cognition . Volume 2: Psychological and biological models . Cambridge , MA : Bradford Books /
MIT Press .
Osherson , D .N . , Smith , E .E . , & Shafir , E . ( 1986) . Some origins of belief . Cognition , 24 , 197- 224 . Palmer , H . ( 1930) . A grammar of spoken English on a strictly phonetic basis. Cambridge : W . Heffer . Pazzani , M . ( 1987) . Explanation -based learning for knowledge -based systems. International Journal of Man Machine Studies , 26 , 413 - 433 .
Pazzani , M . , & Dyer , M . ( 1987) . A comparison of concept identification in human learning and network learning with the generalized delta rule . Unpublished manuscript , UCLA .
Pierrehumbert , J ., & Beckman , M . ( 1986) . Japanese tone structure . Unpublished Laboratories , Murray Hill , NJ . Pinker , S. ( 1979) . Formal models of language learning . Co.e -nition . 7. 217- 283.
~ '-' "
University
Pinker , S., Lebeaux , D .S., & Frost , L .A . ( 1987) . Productivity passive . Cognition , 26 ) 195- 267 .
NYU Press .
Putnam , H . ( 1960) . Minds and machines . In S. Hook (Ed .) , Dimensions of mind : A symposium . New York : Pylyshyn , Z .W . ( 1984) . Computation
MA : Bradford Books / MIT Press
Rosch , E . , & Mervis , C .B . ( 1975) . Family resemblances : Studies in the internal representation Cognitive Psycholo !!;y , 7, 573- 605 . Rosenblatt , F . ( 1962) . Principles ofneurodynamics . New York : Spartan . Ross , J.R . ( 1975) . Wording up . Unpublished manuscript , MIT .
Rumelhart , D .E . , Hinton , G .E . , & Williams , R .J. ( 1986) . Learning internal representations by error propa gation . In D .E . Rumelhart , J.L . McClelland , and the PDP Research Group , Parallel distributed pro cessing: Explorations in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA :
Bradford Books / MIT Press .
Rumelhart , D .E ., & McClelland , J .L . ( 1986a) . PDP models and general issues in cognitive science . In D .E . Rumelhart , J .L . McClelland , and the POP Research Group , Parallel distributed processing : Explora tions in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford Books /MIT
Press .
Rumelhart , D .E . , & McClelland , J.L . ( 1986b) . On learning the past tenses of English verbs . In J.L . McClel land , D .E . Rumelhart , and the PDP Research Group , Parallel distributed processing : Explorations in the microstructure of cognition . Volume 2: Psychological and biological models . Cambridge , MA : Brad ford Books / MIT Press .
Rumelhart , D .E . , & McClelland , J .L . ( 1987) . Learning the past tenses of English verbs : Implicit rules or parallel distributed processing ? In B . MacWhinney ( Ed .) , Mechanisms of language acquisition . Hills dale , NJ : Erlbaum .
Rumelhart , D .E ., McClelland , J.L ., and the PDP Research Group . ( 1986) . Parallel distributed processing : Explorations in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford
Books / MIT Press .
Sampson , G . ( 1987) . A turning point in linguistics . Times Literary Supplement , June 12, 1987, 643. Savin , H . , & Bever , T .G . ( 1970) . The nonperceptual
Verbal Behavior , 9 , 295 - 302 .
Shattuck -Hufnagel , S. ( 1979) . Speech errors as evidence for a serial -ordering mechanism in sentence produc tion . In W .E . Cooper & E .C .T . Walker (Eds .) , Sentence processing : Psycholinguistic studies presented
to Merrill Garrett . Hillsdale , NJ : Erlbaum .
193
Smolenskyproperof.In.LGoyvaertsin 80 : ,.A(1980 of syllables .)andSciences Pin).The .( press Sommer).TheKunjen D. .Behavioral.'sGhent ,BScientia treatment (Ed Brain . . . shape connectionism the ,Phonology Storynew grammar .Oxford. - ).A English and Sweet LexicalizationhistoricalTShapeD ,H(1892 patterns in forms (Ed . ,logical lexical . Clarendon Talmy). semantic.3Grammatical .NewCam ,typologydescription categories .York L(1985 . and :Semantic :and structure.Inlexicon: ,Vol : the ),Language bridge Press connectionism stacks University Touretzky . Annualof Cognitive of and ,D(1986 :ReconcilingtheSociety trees . of Eighth ).the BoltzCONS the recursive . with nature ProceedingsSymbols Science inference Conference of.connectionist TouretzkyG.of ). International a Intelligence ,D&Proceedings the :Details . ., Hinton Ninth Joint on neurons architecture . , .E(1985 among Artificial the Conference TreismanH ).Illusory in perception Psychology ,A-141, .(1982 conjunctions .Cognitive .,&. Schmidt the of , 14 ,H&Smith.)(1982 of objectsDordrecht ,107 ,N( van .,:Foris Eds ).The phonological , der Hulst . . structure representations . Netherlands Wexler(1969- Formal language inMA Press ,KW. ).P ). principles,andorder )behavior .,& ,Context ,associative (speech Culicover .(1980 of .Cambridge Wickelgren ,1 . codingmemory , :MIT. , .A Review sensitive acquisition serial Psychological "andofword Inquiry . . Williamsthe -15 related a ".Linguistic ,E ).On,76 .(1981notions "head "lexically ,12-274 ,245
Therelation between linguisticstructure associative and theoriesof language learning A constructive critique of someconnectionist learning models . JOELLACHTER THOMAS BEVER G . UniversityRochester of Therenosafety numbersor anything 's in ... else (Thurber ) Abs tract Recently proposed connectionist models acquired of linguistic behaviors have linguistic -based rule representations in. Similar built connectionist models of language acquisition arbitrary have devices architectures make and which them mimictheeffect rules Connectionist of . models general notwell in are -suited to account theacquisition structural for of knowledge require , and predetermi nedstructures to simulate even basic linguistic . Such facts models more are appropriate describing formation complex for the of associations between structures areindependently which represented makes . This connectionistmo dels potentially important in studying relations tools the between frequent beha viors thestructures and underlying knowledge representations very and . At the leastsuch , models offercomputationally may powerful of demonstrating ways thelimitsof associationistic descriptionsbehavior of . 1. Rules models and Thispaper considers status current the of proposals connectionist that systems of cognitive modelling account rule can for -governed behavior without direct ly representing corresponding (Hinton& Anderson ; Hanson the rules , 1981 *We grateful are for comments drafts paper Gary ,Jeff ,Jerry , on earlier ofthis from Dell Elman Feldman Jerry , Lou Gerken Hanson Lakoff Pinker Prince Pylyshyn Fodor Ann , Steve , George , Steve , Alan , Zenon , Patrice , Paul Simard Smolensky Valian , and Ginny . Requests should for reprints be addressed toJ. Lachter or .G Bever T . , Department ,University ,Rochester ,U .A.This ofPsychology ofRochester ,NY 14627 .S work was completed first was while authorsupported Science pre the by National Foundation fellowship a -doctoral .
& Kegl , 1987a, b ; McClelland & Rumelhart , 1986; Rumelhart & McClelland , 1986; Smolensky , in press) . We find that those models which seem to exhibit regularities defined by structural rules and constraints , do so only because of their ad hoc representations and architectures which are manifestly motivated by such rules . We conclude that , at best, connectionist models may contribute
structures
representation
modelling are at the center of a current war . For the past 20 years, the dominant approach to algorithmic modelling of intelligent behavior has been in terms of 'production systems' (for discussions and references see Ander son, 1983; Neches, Langley , & Klar , 1987) . Production systems characteristi cally (but not criterially ) utilize a set of statements and algorithmic steps
represents the organization underlying the behavior . The relationship between each pair of nodes is an activation function which specifies the strength
with which one node ' s activation level effects another . Such systems are tou -
ship to neuronal nets is transparent and enticing (Feldman & Ballard , 1982) . Since the nodes are by definition interconnected , this paradigm for artificial intelligence has become known as 'connectionism ' (Dell , 1986; Feldman & Ballard , 1982; Grossberg , 1987; Hinton & Sejnowski , 1983.1Hopfield , 1982;
McClelland & Rumelhart , 1981 ; for general references on connectionism , see
Accordingly, we concentrateour attention on somerecent connectionistmodels devoted to the description of language behaviors . After some initial consideration of models of acquired language behavior , we turn to models which purport to learn language behaviors from analogues to normal input . Such
Feldman et al., 1985; McClelland & Rumelhart , 1986) . Connectionist modelling defines sets of computational languages based on network structures and activation functions . In certain configurations , such languages can map any Boolean input /output function . Thus , connectionism is no more a psychological theory than is Boolean algebra . Its value for psychological theory in general can be assessedonly in specific models . Lan guage offers one of the most complex challenges to any theoretical paradigm .
197
elsforlanguage indeed notcontain do algorithmic of thesortusedin rules production systems. of thesemodels, Some however, at theirpre-dework
finedtasksbecausetheyhavespecialrepresentations in, whichare conbuilt
touted because it does not contain rules, actuallycontainsarbitrary computational deviceswhich implementfragmentsof rule-basedrepresentationswe
nectionist-style implementations ofrule-based descriptions oflanguage (Dell, 1986; Hanson Kegi,1987; & McClelland Elman, & 1986). Another model,
showthat it is just thesedevices crucially that enablethismodel simulate to
some properties rule-based of regularitiesthe inputdata(Rumelhart in & McClelland, 1986b). Thus,noneof theseconnectionist models succeed in
whichin speakers are the result of linguisticrules.
2. Models of acquired language behaviors
We consider somemodels adultskill.The goalin thesecases,is to first of describea regularinput/output behavior a connectionist in configuration. Dells model speech of production serves a case-in-point as (based Dell, on 1986; personal communication; Figure1). Thereare two intersecting see
for sequencing output.In the linguistic the representation, nodesare organized fourseparate in levels, words, syllables, phonemes, phonological and
features. Each word describesa hierarchyspecifying order of component the
syllables, in turnspecify component which their phonemes, in turn which specify bundles phonetic of features. sequencing activates The system ele-
input from linguistic both the subsystem thesequencing and subsystem, which
results theirbeingproduced a specified in in order.As eachphoneme is activated,in turnactivates thefeatures allthesyllables which it all and to it is connected, irrelevant which not in the current even ones are word;then thosefeatures, words syllables inturnactivate phonemes. and can other This
pattern radiating of activation automatically relatively activates strongly just those irrelevantwords and syllableswith structurallysimilardescriptions. Accordingly, model the predicts errors occur a function the that can as of activation irrelevant of phones,syllables words,but primarily and thosein structurally positions.iswell similar It known these justthekind that are of
~'j:'~ t:I:"t:[.)~ 1
11 111111111111 1111 : : ;
~ ~
I:. ~ . (. 'T -j: ~ '"I~.~
I t
,
~ ~
I
~ ~ ~
I ,
199
speech errorsthat occur exchanges , between sounds wordsin structurally and similarpositions . It is crucialthat the nodesare arrayedat different levelsof lexicaland phonological representation . Eachof theselevelshassome pre-theoreticintuitive basis but actuallythe units at eachlevel are consistently , definedin terms of a theory with rules which range categorically over thoseunits . Hence if the modelis takenasa psychological , theory it wouldsupportthe , validity of particularlevelsof representation whichare uniquelydefinedin termsof a rule-governed structuraltheory The model alsomakescertain . assumptions about the prior availabilityof sequence instructionsabsolute , speed activation and speed inhibition of just-utteredunits Thus the of , of . , modelcomprises fairly complete a outlineof thesequencing speech of output , givenprior linguistically defined information andthe prior specification a , of numberof performance mechanisms parameters modelis a talking and . The mechanism the normalflow of speech for whichcancorrectly represent errors because hasthe linguisticunits andthe relations it between themwiredin. Suchmodels represent can perceptual well asproductive as processes . For exampleElman McClelland Rumelhart , , and havedeveloped series mod a of elsfor word recognition (Elman& McClelland 1986McClelland Elman , ; & , 1986 McClelland& Rumelhart 1981 Rumelhart& McClelland 1982. ; , ; , ) Thesemodelsaresimilarto Dell's modelin the sense theyhaveseveral that levels of linguisticrepresentation built into them For example TRACE . , (McClelland& Elman 1986 recognizes , ) wordsbased a stylizedacoustic on featuresequential input. The modelutilizesthree kindsof detectornodes , for acoustic featuresphonemes words Feature , and . nodes grouped are into 11sets eachcorresponding a successive slice of the acoustic , to 'time ' input. Eachslicepresents valueof eachof seven the featuredimensions ; eachdimension 9 possible has values Accordingly at the input levelof this model . , , each'phonemeis represented termsof a feature ' in /value /time -slicematrix whichdefines separate 693 nodesThe centers each of timeslices . of set from a phoneme spaced slicesapart to simulate acoustic are 6 , the overlapthat real phones havein speechEverythreetime slices thereis centered set . , a of connections 15phoneme to detectornodes with eachphoneme , nodereceivinginput from 11 time slices This means . that a three -phoneme word activates 1993 feature /value /time-slicenodes 165phoneme . Finally and units , thereare211wordnodes appropriately linkedto theirconstituent phonemes . TRACE buildsin several levelsof linguisticanalysisThis is not merely . anexternal observation aboutthe model it is reflected its internalarchitec , in ture. All nodesat the same level of representation inhibit eachother, while nodes adjacent at levelsof representation exciteeachother. Thewithin-level inhibition serves functionof reducing activenodesat eachlevel to the the
200
thosethat are most active; across-level the facilitation serves function the of increasing effectof one levelon another.Thewithin-level the inhibition inTRACE setat distinct is values eachlevel, arethelevels excitation for as of
phonemes features.Sucheffects and occurbecause nodesat the different levelsof representation interconnected. take this to be one of the are We obvious features connectionist of algorithms: insofar thereareinteractions as
levelsa parallel system multiple with simultaneous connections within both
between different Theresult thatthemodel levels. is exhibits qualitatively distinct ofnodes, sets grouped layeredthesame asthecorrespondand in way inglinguistic of representation. allthisbuilt-in levels With linguistic apparatus, TRACE model number interesting can a of aspects lexical of access byhumans. it canrecognize from acoustic input. First, words mock feature Second, can generate variety effects it a of involving interference relations between ofrepresentation: levels forexample, a family wordsuperiority of effects, which in propertieswords of influence perceptionparticular the of
orderto understand it works.Wearguethat the modelseems work why to for twokindsof reasons; it contains first, arbitrary devices which makeit
framework fortheimplementation allows oflinguistic structurecomputain tionally effective Ifwetakesuch ways. models psychologically as instructive, they inform thatonce knows theinternal us one what structurea language of is, it canbe encoded such system a way in a in which predicts linguistic behavior. Amore ambitious istoconstruct goal a connectionist which model actually learns regularities normal from in input,andgenerates behavior which conforms therealbehavior. this themodel aspire explain to In way, can to the patterns characteristic behavior. ofthe Consider recent the model proposed byRumelhart McClelland R&M) seems learn past and (1986b, which to the tenserulesof English verbswithout learning explicit any rules.Thismodel hasbeen given considerable attention a current in review article, especially with reference itsempirical principled andhow failings to and failings such follow thechoice computational from of architecture &Prince, (Pinker 1988). Ourapproachcomplementary: is weexamine R&M the model internally, in
201
relatively sensitiveto those phonologicalstructureswhich are involved in the past-tense rules; second it stagesthe input data and internal learning func, tions so that it simulateschild-like learning behavior. We first describethe past-tense phenomena in rule-governed terms, and then examine how the R& M model conforms to the consequentregularities in the child's stagesof acquisition. The principles of the past tense formation in English involve a regular ending ('ed') for most verbs, and a set of internal vowel changes a minority for of verbs. In our discussion are interestedin a particular kind of structural we - - . - - - - - --rule, typical of linguistic phenomena one which performs categoricalopera: tions on categoricalsymbols In this sense linguistic rules admit of no excep . , tions. That is, they are not probabilistically sensitive to inputs, and their effects are not probabilistic. We can see how they work in the formation of the regular pasttensefrom the presentin English verbs. The regularitiesare: mat . . . mattED need. .. needED If the verb endsin "t" or " d" , add "ed" push... pick ... buzz. .. bug ... ski . . . pushT pickT buzzD bugD skiD If the verb endsin sounds , " sh" , "k'~ '.p'~ " ch" , ... add '.t" , , If the verb endsin "z" , "g" , "b" , "j " or a vowel ... add " d"
These facts are describedwith rules which draw on and define a set of phonological 'distinctive features , attributes which co-occur in each distinct ' phoneme. Features are both concrete and abstract: they are concrete in the sense that they define the phonetic/acoustic content of speech they are ; abstractin the sensethat they are the objects of phonologicalrules, which in turn can define levels of representation which are not pronounced or pronounceablein a given language The description of the rules of past tense . formation exemplify these characteristics (Note that any particular descrip . tion is theory dependent We have chosena fairly neutral form (basicallythat . in Halle, 1962 althoughwe recognizethat phonologicaltheory is continually ), in convulsions The generalizations draw are worded to hold acrossa wide . \\ITe range of such theories.) 1a. Add a phonemewhich has all the featuresin common to T and D , with the voicing dimension unspecified . lb . Insert a neutral vowel, 'e' , betweentwo word-final consonants that have identical features, ignoring voicing. 1c. Assimilate the voicing of the T/D to that of the preceding segment , .
202
Push
PushT/D PushT
Pit
la lb
PitT/D
PiteT/D PiteD
la
lb ic
tinct constraints (Goldsmith, orasapplyingstrict (Halle, 1976) in order 1962). Either theeffects theoutput arecategorical, rule way, on form each involving
rule (Ib) refers to sequences consonants of whichare made in the same
clarifies sense which the in distinctive features symbolic. example, are For
not to actualphoneticor acousticobjects. The depth of abstract representations becomesclear when we consider
break sequencess, z, sh,ch created affixing plural present up of by the and rule(lc)applies those aswell. generalitytheapplication in cases The of of
processes involved describing in English sound systems general in (Chomsky allowed tense past pronunciations, appear mouDed (long they as nasalized vowel preceding a tongue andmouDed(short flap) nasalized precedvowel inga tongue flap).These words endup being two can differentiated acousti& Halle, 1968). Consider verbsmound and mount; in one of their the
cally interms thelength thefirst only of of vowel, though underlying even the basisfor the difference in the voicing absence it in the finald/t of the is or of stem Chomsky, Therules (see 1964). involvedarriving these in at pronunciations include,
id. nasalize a vowel before a nasal
if.
Each these hasindependent of rules application inother ofEnglish parts pronunciation patterns, theyaredistinct. these areseparated so If rules so
203
that they can apply to isolated cases, they must apply in order when combined . For example , since (Id ) requires a nasal, (Ie ) cannot have applied ; if ( If ) is to apply differentially to 'mound ' , and not to 'mount ' , then (Ie ) must have applied ; if the vowel length is to reflect the difference between 'mound ' and 'mount ' , ( If ) must apply before ( Ig ) . Thus , whether the intermediate stages are serially or logically ordered , the inputs , mound + past, mount + past, have a very abstract relation to their corresponding outputs : mound + past mound DIT mound eDIT mound ed m-ounded m-ouded m-'ouded m-'ouDed (rule (rule (rule (rule (rule (rule (rule la ) Ib ) lc ) Id ) Ie ) 1f) 19) mount + past mount D / T mount eDIT mount ed m-ounted m-outed (can't apply ) m-ouDed
That the rules can be optional does not make them probabilistic within the model : they state what is allowed , not how often it happens (indeed , for many speakers, deleting nasals can occur more easily before unvoiced homor ganic stops, than before voiced stops) . Optionality in a rule is a way of expressing the fact that the structure can occur with and without a corre sponding property . The fact that linguistic rules apply to their appropriate domain 'without exception ' does not mean that the appropriate domain is defined only in terms of phonological units . For example , in English there is a particular set of 'irregular ' verbs which are not subject to the three past-tense formation rules described above. Whether a verb is irregular or not depends on its lexical representation : certain verbs are and others are not 'regular ' . For example , one has to know which 'ring ' is meant to differentiate the correct from incorrect past tenses below (see Pinker & Prince , 1988 for other cases . ) 2a. The Indians ringed ( *rang) the settler 's encampment . 2b . The Indians rang ( *ringed ) in the new year . There are about 200 irregular verbs in modern English . They are the detritus of a more general rule -governed system in Old English (which have an interesting relation to the Indo -European e/o ablaut , see Bever & Langen doen , 1963) . They fall into a few groups ; those involving no change (which characteristically already end in t or d (beat , rid ) ; those which add t (or d) (send) ; those lowering the vowel (drink , give) ; those involving a reversal of the vowel color between front and back (find , break ; come) ; those which both lower and change vowel color (sting) ; those which involve combinations
of all three kinds of change(bring, feel, tell) . The point for our purposesis that almost all of the 'irregular' verbs draw on a small set of phonological processes Only a fe\\7involve completely suppletive forms (e.g., go/went). . This brief analysisof the past tense formation in terms of features and rules, reveals severalproperties of the structural system First, the relevant . grouping of features for the rules is vertical, with features grouped into phonemic segments This property is not formally necessary For example, . . rules could range over isolated features collected from a series of phones : apparently, it is a matter of linguistic fact that 'the phoneme is a natural domain of the phonological processes involved in the past tense formation (note that there may be suprasegmental processes well, but thesetend to as range over different locations of values on the same feature dimension . ) Second it is the segmentat the end of the verb stem that determines the , ultimate features of the regular past tense ending. This, too, is a fact about the English past system not a logically necessary , property.
4. The R & M model : A general picture of PDP models of learning Rumelhart and McClelland ( 1986b; R & M ) implemented a model which learns to associate past tense with the present tense of both the majority and minority verb types. The first step in setting up this model is to postulate a description of words in terms of individual feature units . Parallel distributed connectionist models are not naturally suited to represent serially ordered representations , since all components are to be represented simultaneously in one matrix . But phonemes , and their corresponding bundles of distinctive features , clearly are ordered . R& M solve this problem by invoking a form of phonemic representation suggested by Wickelgr ,t;n ( 1969) , which recasts or dered phonemes into 'Wickelphones ' , which can be ordered in a given word in only one way . Wickelphones appear to avoid the problem of representing serial order by differentiating each phoneme as a function of its immediate phonemic neighborhood . For example , 'bet ' would be represented as composed of the following Wickelphones .
eT :#: , bEt , :#: Be
Each Wickelphone is a triple , consisting of the central phoneme , and a representation of the preceding and following phonemes as well . As reflected in
the above representation , such entities do not have to be represerited in
memory as ordered : they can be combined in only one way into an actual sequence, if one follows the rule that the central phone must correspond to the prefix of the following unit and the postfix of the preceding unit . That rule leads to only one output representation for the above three Wickel -
phones, namely .e...t. Ofcourse, number Wickeiphones lanb. the of in a guage much is larger thenumber phonemesroughly thirdpower. than of the
so that a givenWickeiphone neveroccursmorethan oncein a wordsee
Pinker & Prince, 1988).
But sucha representational schemeseemsto circumvent need for a direct the representation orderitself(at least,so longas the vocabulary restricted of is
Front
Middle
Back
V/L
U/S
V/L
U/S
V/L
U/S
k
Interrupted
Stop
Nasal
v/D w/l
E
A
fIT
i
e
z r
0
I
s
A
a/a
ZIj y
U
w
S/C h
U
*10
R&Massign set of phonemic a distinctive features eachphonewithin to a Wickelphone. are4 feature There dimensions, withtwovalues two two and withthree, yielding individual 10 featurevalues(seeFigure2). Thisallows themto represent Wickelphonesfeature in matrices: example lEtin for the
bet would be represented as shown below.
kelfeatures. These consistof a triple of features, [ f2, f3},the first taken fi,
would be the following:
from prefix the phone, second thecentral the from phone thethirdfrom and the post-fix phone. Accordingly, of the Wickelfeatures the bet some for
207
input node does not affect the state of the output node. If the weight is positive, and if the input node is activated by the present tense input, the output node is also activatedby the connectionbetweenthem; if the weighting is negative then the output node would be inhibited by that connection . Each output node also has a threshold, which the summedactivation input from all input nodesmust exceedfor the output node to becomeactive (we note below that the thresholdis itself interpreted asa probabilistic function). On each training trial , the machineis given the correct output Wickelfeature set as well as the input set. This makes it possibleto assess extent the to which each output Wickelfeature node which should be activated, is, and conversely The machine then uses a variant on the standard perceptron . learning rule (Rosenblatt, 1962 , which changesthe weights between the ) active input nodes and the output nodes which were incorrect on the trial : lower the weight and raise the threshold for all nodes that were incorrectly activated; do the opposite for nodes that were incorrectly inactivated. The machine was given a set of 200 training sessions with a number of , verbs in each session At the end of this training, the systemcould take new . verbs, that it had not processedbefore, and correctly associatetheir past tense in most cases Hence, the model appearsto learn, given a finite input , . how to generalize to new cases Furthermore, the model appears to go . through severalstagesof acquisitionwhich correspondto the stagesof learning the past tense of verbs which children go through as well (Brown, 1973 ; Bybee & Slobin, 1982 . During an early phase the model (and children) ) , produce the correct past tense for a small number of verbs, especially a number of the minority forms (went, ran, etc.). Then, the model (and children) 'overgeneralize the attachment of the majority past form , 'ed' and its ' variants, so that they then make errors on forms on which they had been correct before (goed, wented, runned, etc.). Finally, the model (and children) produce the correct minority and majority forms. It would seem that the model has learned the rule-governedbehaviors involved in forming the past tense of novel verbs. Yet , as R& M point out, the model does not 'contain' rules, only matrices of associative strengthsbetween nodes They argue that . the success the system in learning the rule-governed properties, and in of simulating the pattern of acquisition, showsthat rules may not be a necessary component of the description of acquisition behavior. Whenever a model is touted to be assuccessful theirs, it is hard to avoid as the temptation to take it as a replacement for a rule-governed description and explanation of the phenomena That is, we can go beyond R& M and . take this model as a potential demonstrationthat the appearance that behav ior is rule-governed is an illusion, and that its real nature is explained by nodes and the associativestrengths between networks of nodes A number .
208
replaces linguistic rather, has rules; it internal architecture isarranged which tobeparticularly sensitiveaspects data conformtherules. to of that to Thus, themodel (ormay bea successful may not) algorithmic description oflearn-
we examine modelin this light:we find no evidence the model the that
models, particular in because theapparently of successful performance of R&Ms past-tense learning (Langackre, Sampson, Below, model 1987; 1987).
often presentedsimplifications model, as ofthe actually beexplained could asaccommodations rule-governed tothe propertiesthesystem thereof and sulting behavior. should that, and We note by large, didnotrequire it much
generalization pattern exhibited children. by
5.1.
explicit aboutthem.Therearetwokindsof TRICS, thosethatreconstitute crucial aspects the linguistic of system, thosewhichcreatethe overand
sociative learning, information totherule-governed the relevant description oftheformationthepast of tense. forensic Our methodthefollowing: is we examine arbitrary each decision themodel therulesystem about with in mind,and ask abouteachdecision: wouldthisfacilitate inhibitthe beor havioral emergence which like governedthepast ofdata looks that by tense
emergence of such behavior.
5.1.1.
tionat the endof the verb.Thatis, the arbitrary decisions thedetails about of themodel transparently are interpretable making reliable asas most for
Thefirstsimplificationthe model of involves reducing number the of within-word Wickelfeatures about1000 260.Onewayto do this, from to would to delete be randomly Wickelfeatures, relyontheoverall some and
209
redundancyof the systemto carry the behavioralregularities. Another option is to use a principled basisfor dropping certain Wickelfeatures For example, . one could drop all Wickelfeatureswhose component subfeaturesare on different dimensions this move alone would reduce the number of features considerably Suchreduction would occur if Wickelfeatureswere required to . have at least two sub-features on the same dimension. R& M do something like this, but in an eccentric way: they require that all Wickelfeatures have the samedimensionfor f1 and f3: f2 can rangefully acrossall feature dimensions and values Accordingly, the potential Wickelfeaturesfor the vowel lEI . in 'bet' on the left below are possible those on the right are not. , [interrupted, vowel, interrupted] [voiced, mid, unvoiced ] [front , short, middle] [interrupted, vowel, stop] [stop, vowel, unvoiced ] [front , short, unvoiced ]
This apparently arbitrary way of cutting down on the number of Wickelfeatures has felicitous consequences the relative amount of rule-based for information contained within each sub-feature. It has the basic effect of reducing the information contained within fl and f3, since they are heavily predictable that is the actually employedWickelphonesare " centrally informative." This heightensthe relative importance of information in f2, sinceit can be more varied. This move is entirely arbitrary from the standpoint of the model; but it is an entirely sensiblemove if the goal were to accommodate to a structural rule account of the phenomena the rules imply that the rele: vant information in a phoneme is in f2. The use of centrally informative Wickelphones automatically emphasizes f2.
We can see that this gives a privileged informational status to phones at the word boundary , compared with any other ordinally defined position within the word : the phones at the boundary are the only ones with their own unique set of Wickelfeatures . This makes the information at the boundary uniquely recognizable . That is, the Wickelfeature representation of the words exhibits the property of 'boundary sharpening ' . This arbitrary move accommodates another property we noted in the rule governed account of the past tense. The phone at the word boundarv deter mines the shape of the regular past. Here , too , we see that an apparently arbitrary decision was just the right one to make in order to make sure that the system would accommodate to the rule -governed regularities .
..
5 . 1 .3 .
R& M allow certain input Wickelfeatures to be activated even when only some of the constituent sub-features are present . One theoretically arbitrary way to do this would be to allow some Wickelphone nodes to be activated sometimes when any lout of 3 subfeatures does not correspond to the input . R & M do something like this , but yet again , in an eccentric way- the model allows a Wickelphone node to be activated if either ft or f3 is incorrect , but not if f2 is incorrect . That is, the fidelity of the relationship between an input Wickelphone and the nodes which are actually activated , is subject to 'peripheral blurring ' . This effect is not small : the blurring was set to occur with a probability of .9 (this value was determined by R & M after some trial and error with other values) . That is, a given Wickelnode is activated 90 percent of the time when either the input does not correspond to its fl or f3 . But it can always count on f2 . This dramatic feature of the model is unmoti vated within the connectionist framework . But it has the same felicitous result
from the standpoint of the structure of the phenomenon as discussed in 5.1.1. It heightens (in this case, drastically ) the relative reliability of the information in f2 , and tends to destroy the reliability of information in f1 and f3 . This
further reflects the fact that the structurally
relevant
information
is in f2 .
information will be lost as to how to sequentially order the phones . For this reason, R & M had to institute an arbitrary choice of which Wickelfeatures could be incorrectly activated , so that for each input feature there are some reliable cues to Wickelphone order . This necessitates blurring the Wickel features with less than 100 percent of the false options .
211
5.1.4. The use of phonological features is one of the biggestTRICS of all. Features are introduced as an encoding device which reduces the number of internodal connections the number of connectionsbetween two layers of 30,000 Wickelphones is roughly one billion . Some kind of encoding was necessary reducethis number. Computationally, any kind of binary encod to ing could solvethe problem: R& M chosephonologicalfeaturesas the encod ing device because that is the only basisfor the model to arrive at appropriate generalizations Furthermore, the four distinctive features do not all corre. spond to physically definable dimensions In order to simplify the number of . encoded features, R& M create some feature hybrids. For example b,v,EE are grouped together along one 'dimension in opposition to m,l ,EY . '~ ' Nhile such a grouping is not motivated by either phonetics or linguistic theory, it is neither noxious nor helpful to the model, so far as we can tell . However, the other major arbitrary feature grouping lumps together long vowels and voiced consonantsin opposition to short vowels and unvoiced consonants . All verbswhich end in vowels, end in long vowels: accordingly that particular , grouping of features facilitates selectivelearning of the verbs that end in a segmentwhich must be followed by /d/, namely the verbs ending in vowels and in voiced consonants . 5.1.5. It is interestingto ponder how successful devicesof central informativethe ness peripheral blurring and boundary sharpeningare in reconstituting trad, itional segmentalphonemic information. Phonemicrepresentationsof words offer a basisfor representingsimilarity betweenthem. One of the properties of Wickelphones is that they do not represent shared properties of words. For example 'slowed and 'sold' are no more similar in Wickelphonesthan , ' 'sold' and 'fig' . This is obviously an inherent characteristicof Wickelphonology (Pinker & Prince, 1988 Savin & Bever, 1970 which would extend to ; ) Wickelfeature representationsif there were no TRICS. But there are, and they turn out to go a long way to restoring the similarity metric represented directly in normal phonemicnotations. One way to quantify this is to examine how many sharedinput Wickelfeaturesthere are for words which do and do not share phonemes This is difficult to do in detail, becauseR&M do not . give complete accounts of which features were chosen for blurring. Two arbitrarily chosenwords without shared phonemeswill also share few Wic.. kelfeatures our rough calculation for a pair of arbitrarily chosen4 letter words is about 10 percent. Words which shareinitial or final phonemes will , have a noticeablenumber of sharedfeaturesbecause boundary sharpening of . Blurring plays an especiallyimportant role in reconstituting similarity among
212
words
with
shared
internal
phonemes
such
as
'
slowed
'
and
'
sold
'
Roughly
our
calculations
show
that
two
such
words
go
from
about
20
percent
shared
features
without
blurring
to
around
65
percent
with
it
In
terms
of
correlating
the
chances
of
node
being
activated
in
each
word
this
represents
rise
from
about
to
about
. 5
The
corresponding
proportions
for
phonemically
distinct
words
are
10
and
30
percent
the
correlation
in
that
case
stays
at
even
with
blurring
The
technical
reason
for
this
proportional
difference
is
that
the
blurring
has
radiating
effect
on
just
the
right
Wickelfeatures
to
create
an
overlap
when
there
are
common
phonemes
Thus
the
model
does
not some
reconstitute .
the
phonemic
representation
but
it
does
replicate
. 2
There
are
two
major
behavioral
properties
of
the
model
First
it
seems
to
learn
. e
it
changes
its
behavior
second
it
goes
through
period
of
over
generalizing
the
regular
rule
The
fact
that
the
model
learns
at
all
has
the
same
formal
basis
as
the
fact
that
the
model
can
represent
the
present
past
mapping
Any
mapping
perceptron
can
carry
out
it
can
' learn
'
to
carry
out
using
the
kind
of
learning
rule
described
above
Rosenblatt
1962
Basically
the
model
is
composed
of
460
perceptrons
which
converge
on
the
appropriate
mapping
representation
In
fact
in
simple
perceptrons
the
convergence
can
be
quite
efficient
On
each
trial
the
learning
rule
adjusts
threshold
discrimi
nation
function
such
that
only
the
correct
output
units
are
activated
What
is
striking
about
&
' s
complex
perceptron
is
that
it
takes
so
long
to
learn
the
mapping
The
function
for
regular
verbs
is
extremely
simple
and
might
well
be
reasonably
arrived
at
with
just
one
training
cycle
Thereafter
few
cycles
would
suffice
to
straighten
out
the
errors
on
the
irregular
forms
especially
because
as
we
pointed
out
above
most
irregular
verbs
follow
restricted
set
of
rules
In
this
case
there
would
be
little
intermediate
performance
charac
teristic
of
learning
and
no
period
during
which
the
regular
endings
over
generalized
to
previously
correct
irregular
verbs
The
reason
that
the
model
in
fact
does
exhibit
considerable
intermediate
performance
and
overgeneralization
is
due
to
another
one
of
the
TRICS
which
imposes
probabilistic
function
on
the
output
The
probability
of
an
output
unit
being
active
is
sigmoid
function
of
its
net
activation
weighted
input
minus
threshold
this
function
represents
the
fact
that
even
when
the
input
is
correctly
assigned
there
are
output
errors
Such
move
qualitatively
improves
the
generalizations
made
by
the
system
once
it
has
been
trained
This
is
because
in
general
as
the
number
of
error
correcting
trials
increases
the
difference
between
activations
resulting
from
inputs
for
which
the
unit
is
213
supposedto have positive output and those for which it is supposedto have negative output, grows more rapidly than the difference amongthe inputs of each type. This enhances clarity of the generalizationand also makesthe the learning proceed more slowly. As R&M put it , its use here is motivated by the fact that it "causes systemto learn more slowly so the effect of regular the verbs on the irregulars continuesover a much longer period of time." (R& M , p. 224 ). The period of overgeneralizationof the regular past at the 11th cycle of trials also dependson a real trick , not a technically defined one. For the first 10 cycles the machine is presented with only 10 verbs, 8 irregular and 2 , regular ones. On the 11th cycle it is presentedwith an additional 410 verbs of which about 80 percent are regular. Thus, even on the 11th cycle alone, the model is given more instancesof regular verbs than the training trials it has receivedon the entire preceding10cycles it is no wonder that the regular : past ending immediately swampsthe previously acquired regularities. R& M defend this arbitrary move by suggesting that children also experiencea sudden surge of regular past tense experience We know of no acquisition data . which show anything of the sort (see Pinker & Prince, 1988 who compile , evidence to the contrary). Furthermore, if there were a suddenincreasein the number of verbs a child knows at the time he learnsthe regular past tense rule, it would be ambiguousevidencebetweenacquiring the rule, and acquiring a lot of verbs. The rule allows the child to memorizehalf as many lexicalitems for each verb, and learn twice as many verbs from then on. Therefore, even if it were true that children show a sudden increasein the number of verbs they know at the sametime that they start overgeneralizing it would , be very difficult to decide which was the causeand which the effect.
5.3. TRICSaren for kids 't It is clearthat a numberof arbitrarydecisions madesimplyto getthe model up andworking weremadein waysthat wouldfacilitatelearningthe struc , tural regularities inherentto the presented . To us it seems data fairly clear whatwenton: Wickelphones the representation choicebecause were of they seemto solvethe problemof representing serialorder (thoughthey do so only for a restricted vocabularyseePinker& Prince 1988 Savin& Bever , , ; , 1970. But Wickelphones giveequalweightto the preceding follow) also and ing phone while it is the centralphonewhichis the subject rule-governed , of regularitiesAccordingly a numberof devices built into the model to . , are reducethe informationand reliability of the preceding followingsub and phones the WickelphoneFurtherdevices in . markphones word at -boundary as uniquelyimportantelementsas they are in rule-governed , accounts of
214
some phonological changes which happen when morphemes are adjoined . Finally , the behavioral learning properties of the model were insured by making the model learn slowly , and flooding it with regular verbs at a particu lar point . The most important claim for the R & M model is that it conforms to the behavioral regularities described in rule -governed accounts, but without any rules . We have not reported on the extent to which the model actually captures the behavioral regularities . Pinker & Prince ( 1988) demonstrate that , in fact , the model is not adequate , even to the basic facts: hence, the first
claim for the model is not correct . We have shown further , that even if the
model were empirically adequate , it would be because the model 's architec ture is designed to extract rule -based regularities in the input data . The im pact of the rules for the past tense learning , is indirectly embedded in the form of representation and the TRICS : even Wickelfeatures involve a linguis tic theory with acoustic segments and phonological features within them ; the work of the TRICS is to render available the segmental phoneme , and emphasize boundary phonemes in terms of segmental features . That is, garbage in/garbage out : regularities in/regularities out . How crucial the TRICS really are is easy to find out : simply run the model without them , and see what it does. We expect that if the TRI CS were replaced with theoretically neutral
devices , the new model would not learn with even the current cess ' , if at all ; nor would it exhibit the same behaviors . limited ' suc -
If a slightly improved set of TRICS does lead to successful performance , one could argue that this new model is a theory of the innate phonological devices available to children . On this interpretation , the child would come to the language learning situation with uncommitted connectionist networks supported by TRICS of the general kind built into the R & M model . The child operates on the feedback it receives from its attempts to produce phonologically correct sequences, and gradually builds up a network which exhibits rule -like properties , but without any rules , as R & M claim . It is dif ficult to consider the merits of such a proposal in the abstract : clearly , if the TRICS were sufficiently structured so that they were tantamount to an im plementation of universal phonological constraints in rule -governed accounts, then such a theory would be the equivalent of one that is rule -based (see Fodor & Pylyshyn , 1988, for a discussion of connectionist models as potential implementation systems) . The theory in R & M self-avowedly and clearly falls short of representing the actual rules . So, we must analyze the nature of the TRICS in the model at hand , to assesstheir compatibility with a plausible universal theory of phonology . None of the TRICS fares well under this kind of scrutiny . Consider first the limitation on Wickelfeatures which makes them 'centrally informative ' :
215
this requires that fl and f3 be marked for the same feature dimension (although the valueson that dimensionmay be different). Certain phonological processes depend on information about the phone preceding and following the affected segment For example the rule which transforms IT or DI to a . , voiced tongue flap, applies only when both the precedingand following seg ments are vowels. The 'central-informativeness TRIC neatly accommodates ' a processlike this, since it makes available a set of Wickelfeatures with fl and f3 marked as 'vowel' . Unfortunately, the sameTRIC makes it hard to learn processes which fl and f3 are marked for different dimensions Since in . such processes also quite common, the universal predictions this makes are are incorrect. The secondset of representationalTRICS has the net result of sharpening the reliability of information at word boundaries this is well-suited to isolate : the relevant information in the regular past-tense formation in English, and would seemlike a natural way to representthe fact that morphological processes effect segmentsat the boundariesof morphemeswhen they are combined. Unfortunately, such processesdo not seem to predominate over within-word processes such as the formation of the tongue-flap between , vowels, or the restrictions on nasaldeletion. Furthermore, there are numerous languageswhich changesegmentswithin morphemesas they are combined, as in the many languageswith vowel harmony. Thus, there is no empirical support for a system which unambiguously gives priority to phonological processes morpheme boundaries at . R& M link toQ :ether distinctive features which are maintained orthogonal ~ to each other in most phonological theories. Hence, in R&M , long vowels and voiced consonantsare linked together in opposition to short vowels and unvoiced/consonants This link is prima facie a TRIC , which facilitates learn. ing the regular past tense morphology. But , taken as a claim of universal phonological theory, it is prima facie incorrect: it would propose to explain a non-fact, the relative frequency, or easeof learning, processes which apply simultaneouslyto long vowels and voiced consonantsor to short vowels and unvoiced consonants . Stagingthe input data, and imposing a sigmoid learning function are not, strictly speaking componentsof phonological theory- both devices playa , role in guaranteeing overgeneralization of the regular past. Such overgeneralizations are common in mastery of other morpho-phonological phenomena for example, the presenttensein English ('Harry do-es' (rhymes , with 'news or plural ('fishes , 'childrens . The carefully controlled sigmoid ')) ' ') learning function and tbe staging of input data necessaryto yield overgeneralization phenomena have the properties of dei ex machina with no , independentevidenceof any kind.
of the TRICS
is to refocus
the reliable
information
within the central phone of Wickelphone triples . This does reconstitute the segmental property of many phonological processes, obscured by the Wickel phonological representations . However , since those representations are not adequate in general , such reconstitution is of limited value . Most important , the specific TRICS involved in this reconstitution make wrong universal
claims in some cases , and make that even obscure if the and unmotivated TRICS were claims in the other to arrive at cases . We conclude fine - tuned
satisfactory empirical adequacy, they would still be arbitrarily chosen to facili tate the learning of English past tense rules , with no empirical support as the basis for phonological learning in general . We noted above that there are theories of phonology in which more than one segment contributes to a constraint simultaneously , e.g., 'auto -segmental phonology ' (Goldsmith , 1976) . It might seem that such a variant of phonolog ical theory would be consistent with Wickelphones - and the connectionist learning paradigms in general . Such claims would be incorrect . First , autosegmental phonology does not deny that many processes involve individual segments ; rather , it asserts that there are simultaneous suprasegmental structures
as well . Second, Wickelphones are no better suited than simple phones for the kinds of larger phonological unit to which multi -segmental constraints apply , very often the syllable . Finally , the constraints in autosegmental phonology are structural and just as rule -like as those in segmental phonol ogy . It might also seem that the issue between traditional and autosegmental phonological theory concerns the reality of intermediate stages of derivation as resulting from the ordered application of rules like ( la - g)- another way to put this is in terms of whether rules are simple -but -ordered as in traditional phonology , or complex -but -unordered . There may be phonological theories
which differ on these dimensions . But , however this issue is resolved , there
will be no particular comfort for connectionist learning models of the type in R & M . The underlying object to which the rules apply will still be an abstract formula , and the output will still differentiate categorically between grammat ical and ungrammatical sequences in the particular language. 5.4. Empirical evidence for rules Up to now , we have relied on the reader 's intuitive understanding of what a
'rule ' is- a computation which maps one representation onto another . We
have argued further that the R & M model achieves categorical rule -like behavior in the context of an analogue connectionist machine by way of special representational and processing devices. One might reply that the 'rules ' we have been discussing actually compute the structure of linguistic 'compe-
217
tence
'
while
the
TRIC
ridden
model
is
' performance
'
mechanism
which
is
the
real
basis
for
the
rule
like
behavior
This
line
of
reply
would
be
consistent
with
the
current
distinction
between
three
types
of
description
the
computa
tional
the
algorithmic
and
the
implementational
Marr
1982
It
is
line
already
taken
in
several
defenses
of
connectionism
Rumelhart
&
McClel
land
1986a
Smolensky
in
press
' The
distinction
between
these
different
types
of
description
might
seem
to
allow
for
synthesis
of
the
cqnnectionist
and
rule
based
theorie
.s
On
this
view
rule
based
theories
describe
the
structure
of
language
while
connec
tionist
models
explain
how
it
'
actually
works
This
would
be
synthesis
is
not
available
however
since
grammatical
rules
are
necessary
for
the
explanation
of
behavior
The
diachronic
maintenance
of
language
systems
Consider
first
the
operation
of
the
processes
we
have
used
as
examples
they
characteristically
fall
into
categories
often
even
at
physical
level
of
description
For
example
if
stop
sound
is
'
unvoiced
'
it
exhibits
certain
invariants
which
contrast
it
from
its
' voiced
'
mode
The
indicated
processes
occur
in
environments
which
are
categorically
described
. g
It
, dl
becomes
tongue
flap
between
two
vowels
actually
between
two
'
non
consonants
'
in
distinctive
feature
terms
not
between
two
sounds
that
are
like
vowels
to
high
degree
Variations
in
language
behavior
show
similar
discontinuities
for
example
children
invent
phonological
rules
which
involve
rule
governed
shifts
rather
than
just
groups
of
changed
words
Similarly
dialects
differ
by
entire
rule
processes
not
isolated
cases
finally
stable
historical
changes
occur
in
precise
but
broad
ranging
shifts
the
great
vowel
shift
involved
in
the
irregular
past
verbs
included
complete
rotation
of
vowel
heights
not
isolated
changes
We
are
not
suggesting
that
developmentally
synchronically
and
historically
there
are
no
intermediate
stages
of
performance
rather
we
emphasize
that
the
stable
phenomena
and
periods
are
those
caused
by
mental
representations
that
are
structural
in
nature
It
is
possible
to
show
that
mental
representations
of
rule
based
account
of
language
are
necessary
to
describe
the
properties
of
language
change
The
categorical
nature
of
language
change
is
explained
in
rule
based
account
by
the
fact
that
rules
themselves
are
categorical
not
incremental
hence
linguis
tic
change
is
resisted
except
at
those
times
when
it
occurs
in
major
shifts
Such
facts
are
clearly
consistent
with
rules
but
it
must
be
shown
that
they
are
consistent
with
models
like
that
in
&
way
to
do
this
is
to
consider
whether
successive
generations
of
such
models
could
maintain
an
approxima
tion
of
rule
governed
behavior
without
containing
rules
The
models
we
have
considered
achieve
90
95
percent
correct
output
on
their
training
set
and considerablyless on generalizationtrials (Pinker & Prince, 1988 calcu, late 66 percent correct on generalizations. This level is, of course far below ) , a 5-year-old child's ability, but one might argue that improved models will do better. However, if these models are to be taken as anything like correct models of the child, they must exhibit stable as well as accuratebehavior_in , the face of imperfect input. We can operationalize this by asking a simple question (or performing the actual experiment : what will an untrained model ) learn, if it is given asinput, the less -than-perfect output of a trained model? This question is divisible into two parts: how fast will the 'child' model arrive at its asymptotic performance compared with the 'parent', and what will the asymptotic level be? It is likely that for a given number of trials before asymptote is reached the child-model will perform worse than the , parent-model. This follows from the fact that the data the child-model is given are less reliably related to the actual structure of the language and , therefore must require more trials to arrive at a stable output. It is lessclear what the final asymptotic level will be. If the parent-model errors were truly random in nature, then the final asymptoticlevel of performanceshould be the samein the child model. But , in fact, the parental errors are not random- they tend to occur on just those forms which are hard for the model to learn. R& M offer a casein point: after 80,000 trials, the model still makesa variety of strangeerrors on the past tense (e.g., 'squawked for ' 'squat' ; 'membled' for 'mail'; see Pinker & Prince, 1988 especially for an , analysisof the 'blending' mechanismwhich producescaseslike the second )~ " It is intuitively clear, that someof theseerrors occur because phonological of coincidences others becauseof the overwhelming frequency of the regular , past ending. In both kinds of cases the errors have a systematicbasis and , , are not random- indeed, they are by operational definition , just the cases which frequency -basedcomputations in both models discriminate with diffi culty: so, we can expect the child-model to perform even worse on these cases once given seductivelymisleadinganalyses the parent model. Even, by tually, with some number of generations(itself determined by the learning curve parameters and other TRICS) , the final descendant , -model will stabilize at always getting the critical caseswrong. There are many indeterminaciesin theseconsiderations and the best wav , .. to see what happenswill be to train successive generationsof models. We think that this is an important empirical test for any model of learning. One must not only show that a particular model can approximaterule-like behav ior , given perfect input and perfect feedbackinformation, but that successive generationsof exactly the same kind of model continue to re-generate the same rule-like regularities, given the impet6fect input of their immediate ancestor -models. R& M consideredin this light, predicts that in successive gen-
219
erations, a languagesystem will degeneratequickly towards the dominant rule, overgeneralizingmost of the exceptionalcases But, this doesnot occur . in actual linguistic evolution. Rather, it is characteristicthat every systematic processhas some sub-systematicexceptions As Sapir put it , 'grammarsal. ways leak' . One can speculateas to why this is so (Bever, 1986 Sadock ; , 1974 The fact that it is so posesparticular problems for a model basedon ). imperfect frequency approximationsby successive generations . We do not doubt that a set of diachronic TRICS can be tacked onto the model, which would tend to maintain the rule-like regularities in the face of systematicallyimperfect input information. We expect by induction on the properties of the current TRICS, that two things will be true about any new TRICS: (1) insofar asthey have any systematic motivation, it will not be from the connectionistframework, but from the rule-basedexplanation; (2) insofar as they work , it will be becauseof their relation to the rule-basedaccount . 5.4.2. Linguistic intuitions It is also striking that children seemto have explicit differentiation of the way they talk from the way they should talk. Children are both aware that they overgeneralize and that they should not do it . Bever (1975 reports a , ) dialogue demonstratingthat his child (age 3;6) had this dual pattern. Tom: Where's mommy? Frederick: Mommy goedto the store. Tom: Mommy goedto the store? Frederick: NO ! (annoyed Daddy, I sayit that way, not you. ) Tom: Mommy wentedto the store? Frederick: No ! Tom: Mommy went to the store. Frederick: That's right , mommy wennn ... mommy goed to the store. Slobin (1978 reported extended interviews with his child demonstrating a ) similar sensitivity: 'she rarely usessomeof the [strong] verbs correctly in her own speech yet she is clearly aware of the correct forms.' He reports the ; following dialogue at 4;7. Dan: . .. Did Barbara read you that whole story .. . Haida: Yeah ... and ... mama this morning after breakfast, read ('red') the whole book ... I don't know when shereaded('reeded ... ') Dan: You don't know when shewhat? Haida: ... shereadedthe book ... Dan: M-hm Haida: That's the book she read. She read the whole, the whole book.
220
Dan : Dan :
, huh ? ?
Haida : Yeah ... read! (annoyed) Haida : Babar , yeah. You know cause you readed some of it too ... she
all the rest .
Dan :
Haida
Dan : Oh , that 's right ; yeah, I readed the beginning of it . Haida : Readed ? (annoyed surprise ) Read !
Dan :
Dan : What
Oh , yeah Sure
read .
hand , they clearly made overgeneralization errors ; on the other hand , they clearly knew what they should and should not say. This would seem to be evjdence that the overgeneralizations are strictly a performance function of the talking algorithm , quite distinct from their linguistic knowledge . The fact that the children know they are making a mistake emphasizes the distinction between the structures that they know and the sequences that they utter . (But see Kuczaj , 1978, who showed that children do not differentiate between experimentally presented correct and incorrect past tense forms . We think that he underestimates the children 's competence because of methodological factors . For example , he assumes that children think that everything they say is grammatical , which the above reports show is not true . Finally , in all his studies, the child in general prefers the correct past forms for the irregular
verbs . )
In brief , we see that even children are aware that they are following (or should follow ) structural systems. Adults also exhibit knowledge of the contrast between what they say and what is grammatical . For example the sentence below is recognized as usable but ungrammatical , while the second is recognized as grammatical but unusable (see discussion in Section 7 below ) .
Either I or you are crazy .
Oysters oysters oysters split split split Children and adults who know the contrast between their speech and the correct form have a representation of the structural system. Hence , it is of
little interest to claim that a connectionist model can ' learn ' the behavior
without the structural system. Real people learn both ; most interestingly , they sometimes learn the structure before they master its use.
221
5.4.3. Languagebehaviors There is also considerableexperimentalevidencefor the independentrole of grammaticalstructuresin the behavior of adults. We discuss under the this rubric of evidencefor a 'psychogrammar (Bever, 1975 an internalized rep' ), resentationof the language that is not necessarily model of suchbehaviors , a as speech perception or production, but a representation of the structure usedin those and other languagebehaviors Presumably the psychogrammar . , is strongly equivalent to some correct linguistic gramm~r with a universally definable mental and physiologicalrepresentation We set up the conceptfor . this discussion avoid claiming "psychological or "physiologicalreality" for to " any particular linguistic grammar or mode of implementation. Rather, we wish to outline somesimple evidencethat a psychogrammar exists: this demonstration is sufficient to invalidate the psychological relevance of those connectionistlearning models which do not learn grammars . The fundamentalmental activity in using speechis to relate inchoate ideas with explicit utterances as in perception and production. There is observa , tional and experimentalevidencethat multiple levels of linguistic representa tion are computed during these processes (Bever, 1970 Dell , 1986 Fodor, ; ; Bever, & Garrett, 1974 Garrett, 1975 Tanenhaus Carlson, & Seidenberg ; ; , , 1985 . The data suggestthat the processes ) underlying these two behaviors are not simple inversions so they may make different use of the grammatical , structures suggestingseparaterepresentationsof the grammar. Hence, the , psychogrammar may be distinct from suchsystems speechbehavior; in any of case it explains certain phenomenain its own right. In standard linguistic , investigations it allows the isolation of linguistic universalsdue to psycho , grammaticalconstraintsfrom those due to the other systems speechbehav of ior . We think that the achievementsof this approach to linguistic research have been prodigious and justify the distinction in themselves A further . argument for the separate existence of a psychogrammaris the empirical evidence that it is an independent source of acceptability intuitions. The crucial data are sequences which are intuitively well-formed but unusable , and sequences which are usablebut intuitively ill -formed, asdiscussed above. Such casesillustrate that behavioral usability and intuitive well-formedness do not overlap completely, suggesting that each is accountedfor by (at least partially) independentmental representations . 5.4.4. Conclusion The explanatoryrole of rules : The evidencewe havereviewedin the previous three sectionsdemonstrates that even if one were to differentiate structural rules from algorithmic rules, it remains the case that the structural rules are directly implicated in the explanationof linguistic phenomena That is, the rules are not merely abstract .
descriptions of the regularities underlying languagebehaviors but are vital , to their explanation because they characterizecertain mental representations or processes It is becauseof those mental entities that the rules compactly . describe facts of languageacquisition, variation, and history; they provide explanationsof linguistic knowledgedirectly availableto children and adults; they help explain thosemental representations involved in the comprehension and production of behaviorally usablesentencesthey are part of the explana ; tion of historical facts about languages .
6. Learning to assign thematic roles
We learned now turn the to assignment a second of model thematic in which roles to linguistic nounphrases behavior in is different apparently serial
positions
( McClelland
&
Kawamoto
1986
&
The
system
which
learns
to to triples
thematic which
roles learns of a
to
in
specific . The ,
similar
consisting
syntactic
position
noun
or
verb
the
output
nodes
represent
semantic
feature
for
noun
one
for
verb
and
thematic
noun
- verb
relation
There
is
probabilistic
blurring
mechanism
which
turns
on
feature
/ role
nodes
only
85
percent
of
the
time
when
they
should
be
on
and
15
percent
of
the
time
when
they
should
not
The
semantic
TRI
Ideally
( that
is
in
one
' s
idealization
of
how
this
model
must
work
to
be
theory
of independently
learning
to defined
attach
thematic set of
roles semantic
to
words features
( human tended
, to
animate account
. . . ) for
taken naming
from behavior
some ,
semantic or within
theory a semantic
( e
. g
. ,
theory ~ to
in ac
theorv
count might
for be
synonymy interpreted
and as
. some
Then
the
role
of between
statistical these
blurring formal
interaction
features
and
the
continuous
variability
which
can
occur
when
fitting
nouns
into
thematic
roles
But
for
all
the
statistical
blurring
it
remains
the
case
that rather
the ,
role
features descriptors
' , of
do
not roles
flow themselves
from
some . Here
theory fea
: -
are / instrument
chosen
for / modifier
each . that
to
reflect
the
that verbs
it
is
an are
/ ob to so
corresponding verb is an
the
likelihood
action
modifiers
on does
( see not
Figure involve
Hence
any independently
' learning
'
that defined
occurs semantic
is
trivial features
The
'
isolating
223
Nouns
-- -- -
human nonhuman softhard male female neuter small medium large compact-D 2-D 3-D 1 pointed rounded fragile unbreakable foodtoytoolutensil furniture animate -inan nat
- -- --
Verbs
-- --- -- -- -- ----
DOER
yes no
CAUSE
TOUCH
NAT _CHNG
AGT
_ MVMT
PT _ MVMT
INTENSITY
- . --
= Agent is
relevant to roles, but rather an accumulation of activation strengths from having the role-featuresavailable, and being given correct instancesof words (feature matrices placed in particular role positions. ) One of the achievements this model accordingto M& K is that it 'overof generalizes the" atic role assignmentsFor example, 'doll' is not marked as ' m . 'animate' and therefore is ineligible to be an agent. However, 'doll' is nonethelessassignedsignificant strength as agent in such sentencesas 'the doll moved' . This result seems be due to the fact that everything is assigned to the gender neuter except animate objects and the word 'doll' which is as signed 'female' . Thus, 'neuter' becomesa perfect predictor of inanimacy, except for 'doll' . It is not surprising that 'doll' is treated as though it were animate.
because of some inherent limitation on their computational power . We now consider connectionist learning models with more computational power , and examine some specific instances for TRICS . Connectionist learning machines are composed of perceptrons . One of the staple theorems about perceptrons is that they cannot perform certain Boo lean functions if they have only an input and an output set of nodes (Minsky & Papert , 1969) . Exclusive disjunctive conditions are among those functions that cannot be represented in a two -layer perceptron system. Yet , even simple phonetic phenomena involved in the past tense involve disjunctive descriptions , if one were limited to two -level descriptions . For example , the variants of 'mounded ' discussed above involve disjunction of the presence of n , lengthened vowel and the tongue flap . That is, the tongue -flap pronuncia tion of 't ' or 'd' can occur only if the 'n' has been deleted , and the previous vowel has been lengthened . Furthermore , the distinction between It I and Idl in 'mounted ' and 'mounded ' , in some pronunciations has been displaced to the length of the preceding vowel . The solution for modelling such disjunctive phenomena within the connectionist framework is the invocation of units that are neither input nor output nodes, but which comprise an intermediate set of units which are 'hidden ' (Hinton & Sejnowski , 1983, 1986; Smolensky , 1986) . (A formal problem in the use of hidden units is formulating how the perceptron learning rule should apply to a system with them : there are two (or more ) layers of connections to be trained on each trial , but only the output layer is directly corrected . Somehow , incorrect weights and thresholds must be corrected at both the output and hidden levels . A current technique , 'back-propagation ' is the instance of such a learning rule used in the examples we discuss below- it apportions 'blame ' for incorrect activations to hidden units which are involved , according to a function too complex for presentation here (see Rumelhart , Hinton & Williams , 1986) .)
Recent work has shown that a model with hidden units can be trained to
regenerate a known acoustic sample of speech, with the result that novel speech samples can also be regenerated without further training (Elman & Zipser , 1986) . The training technique does not involve explicit segmentation of the signal , nor is there a mapping onto a separate response. The model takes a speech sample in an input acoustic feature representation : the input is mapped onto a set of input nodes, in a manner similar to that of McClelland and Elman . Each input node is connected to a set of hidden nodes, which in turn are connected to a layer of output nodes corresponding to the input nodes. On each trial , the model adjusts weights between the layers of nodes
following learning (theback-propagation ofit)toimprove the rule variant the matchbetween inputand the output.Thismodelusesan auto-asthe sociative technique,which in weights adjusted yield output is are to an that
the closest to the original fit input.Aftermany trials(upto a million), the model impressively is successful,regenerating speech in new samples from
the same speaker. is an exciting This achievement, it opens the since up possibility an analysis thespeech a compact that of into internal representationis possible, simply exposure a sample. by to There several are things whichremainto be shown. example, internal For the analysis which the model arrives may may corresponda linguistically analat or not to relevant
Hanson Kegl and (1987, H&K) theauto-association with use method hidden
unitsto re-generate sequences syntactic of categories which correspond to
actualsentences. After a periodof learning, modelcantake in a sequence the
of lexical categories something determiner, (e.g., like noun,verb,determiner, noun, adverb), regenerate sequence. is interesting and that What is that it can regenerate sequences whichcorrespond actualEnglish to sentences,but it doesnot regenerate sequences which not correspond do to
If [ model] notrecognize sentencesafter our does such nothing than more exposuredata, would ustosuspect rather being innate to this lead that than an property thelearner, constraints conditions directly of these and follow from
regularities data... . Both model] thechild only inthe [ our and are exposed to
constituents just the regularities whichthey are exposed from to
sentences natural from language, bothmust they induce general andlarger rules
Thatis,H&K thesuccess their take of model beanexistence that to proof some linguistic universals belearned can without internal any structure. This makes imperativeexamine architecturetheirmodelas weshall it to the of see,it incorporates linguistically representations certain defined in crucial
wayswhichinvalidate theirempiricist conclusion. Hereis howone of the models works. modelis trainedon a set of The 1000 actual sentences, ranging a fewto 15words length. from in Eachlexical
item in every input sentence 'manuallyassigned a syntactic is to category , eachcodedinto a 9-bit sequence (input textsweretakenfrom a corpus with grammatical categories alreadyassignedFrancis& Kucera 1979. These : , ) sequences mapped are onto 270input nodes (135for 15wordpositionseach , with ninebits; another distinctsetfor wordboundary codes. Thecategorized ) sequences thentreatedasinput to a setof 45 hiddennodes each input are category nodeis connected eachhiddennode and eachhiddennodeis to , connected a corresponding of 270outputnodes to set (seeFigure5). During training the model is given input sequences categories the model , of matches input against selfgenerated the the outputon eachtrial. The usual learningrule appliesto adjust weightson eachtrial (usinga variationof back -propagation with the usualbuilt-in variabilityin learning each ), on trial. After 180 trials with the trainingset the modelasymptoted about90 ,000 , at percentcorrecton both the trainingsetandon newsentence -based category sequences whichhad not beenpresented before . H&K highlightfour qualitative results the trainedmodel responses in 's to new casesFirst, the model supplements . incompleteinformationin input sequences that the regenerated so sequences conformto possible sentence types For examplegiventhe input in (3a the model response in the . , ) 's fills missingword with a verb, as in (3b). (We are quotingdirectly from their examplesRoughly the lexicalcategories . , correspond distinctions to usedin Francis Kucera 1979 'P-verb' refersto 'verb in pasttenseform'.) & , . Second , the model corrects
3a. article noun (BLANK) , article noun , , , 3b. article noun p-verb, article noun , , ,
Figure 5.
at the
n boy
vbd threw
at the
n boll
at the
n boy
vbd threw
at the
n ball
4a. article, noun, p-verb, adverb, article, noun, p-verb 4b. article,noun, p-verb,preposition,article,noun, p-verb The interestof this case is basedon the claimthat (4a) does not correspond
to a possible sequence theysay(4a)corresponds* horse (e.g., to the raced quickly barnfell, while correspondsthe horse the (4b) to raced the past
barnfell. Notethat(4b)is a tricky sentence: corresponds the horsethat it to
was raced past the barn, fell).
rat the cat chaseddied): this regeneration occursdespite the lackof even
ticsequence correspondinga double to embedding themodel (5a), responds with(5b).H&Ksay that this shows their modelcan differentially that generalize sentences canappear natural to that in language (center-embeddings) cannot but recognize sentences which violate natural language constraints (multiple center-embeddings).
Finally, model the refuses regenerate to sequences which adverb in an interruptsverbanda following nounwhich a article would to bethe have
verbsdirect object; example (6a)asinput(corresponding* for given to John
gavequickly book),the model the responds (6b)(corresponding with to Johnquickly thewinner). was H&K thatthisshows themodel say that has acquired of theuniversal one case-marking constraints expressed Enas in glish; directobject a mustbe adjacent a verbin orderto receive to case fromit andthereby allowed be (licensed) occur object to in position ( [ from]
a Government-Bindingapproach (Chomsky, 1981)).
noun, verb, adverb, article, noun noun, adverb, was, article, noun 6a. 6b.
It is enterprising H&K to put the modelto suchqualitative of tests, over and abovethe 90percent levelof correct regenerations. in formal As linguis-
tic research, is the descriptive it claims madeby a model which justify its
228
only butit isnotunique verb, either. Similarly, would besurprised one not
thatthemodel notaccept input does the sequence, isanobverse which part ofits90percent success regenerating input in correct sequences.thiscase, In themodel responds changing lexical by one category thattheoutput so corresponds a possible to English sentencethe question does change is, it the input a correct behaviorally sentence. to or salient Consider a sequence other than(4b)which would result changing category (4a). from one in
4c. art, noun,p-verb,conj, art, noun,p-verb(the horseracedand the
crowd roared)
to the horseracedbreathtakingly crowdroared and corrects to one the it corresponding the horseracedpast the barn fell. Indeed,it is relevant to
Thelogic theinterpretation next cases contradictory. of ofthe two is H&K citewith approval fact themodel the that rejects sequence a corresponding
used categorize input, verbisdifferentiated pastparticito their past from ple, sothecorrect category sequence corresponding horse past to the raced
the barn fell, wouldbe as in (4d):
failure themodel, other of given options (4c) like which correspondmuch to easier structures. Finally, output (4b)does corresponda wellthe in not to
Yetthemodel apparently (4b). sentence haslong underchose This type been stood anexample a sentence isdifficult speakers understand as of which for to (seeBever,1970, discussion). for Accordingly, take the factthat the we
constraints. Others alsoargued the difficulty center-embedded have that of constructions thatanadequate shows model language of structure should not represent them(e.g.,McClellandwrite: & Kawamoto, Reich,1969). 1986; For example, McClelland and Kawamoto
ingtaken alleged the success themodel regenerating of at a behaviorally difficult sequence (5a), like H&K report approvingly themodel that rejects doubly embedded sentences. Although notethatothers argued they have thatsuch constructions arecomplex tobehavioral due reasons, appear they to believe inrejecting themodel simulating language that them, is natural
4d. article, noun, (past verb participle), article, noun, p-verb Thetreatment multiple of center-embedding isconversely puzzling. Hav-
229
The unparsability of [doubly -embedded ] sentences has usually been explained by an appeal to adjunct assumptions about performance limitations (e.g., work ing -memory limitations ) , but it may be , instead , that they are unparsable because the parser , by the general nature of its design is simply incapable of processing such sentences.
There is a fundamental error here . Multiple center -embedding construc tions are not ungrammatical , as shown by cases like (6c) (the unacceptability of (6d) shows that the acceptability of (6c) is not due to semantic constraints alone ) . 6c. The reporter everyone I have met trusts , is predicting another Irangate .
6d . The reporter the editor the cat scratched fired died .
In fact , the difficulty of center -embedding constructions is a function of the differentiability of the nounphrases and verbphrases : (6c) is acceptable
because each of the nounphrases is of a different type , as is each of the
verbphrases : conversely , (6d) is unacceptable because the three nounphrases are syntactically identical , as are the three verbphrases . This effect is predicted by any comprehension mechanism which labels nounphrases syntacti cally as they come in , stores them , and then assigns them to verbphrase argument positions as they appear : the more distinctly labelled each phrase is, the less likely it is to be confused in immediate memory (see Bever , 1970; Miller , 1962; for more detailed discussions) . Thus , there are examples of acceptable center -embedding sentences, and a simple performance theory which predicts the difference between acceptable and unacceptable cases. Hence , H & K are simply in error to claim that it is an achievement of their model to reject center -embeddings , at least if the achievement is to be taken as reflecting universal structural constraints on possible sentences. The other two qualitative facts suggest to H & K that their model has developed a representation of constituency . The model regenerates single embeddings without exposure to them , and rejects sequences which interrupt a
verb - article - noun sequence with an adverb after the verb . Both behaviors
indicate to H & K that the model has acquired a representation corresponding to a nounphrase constituent . They buttress this claim with an informal report of a statistical clustering analysis on the response patterns of the hidden units to input nounphrases and verbphrases : the clusters showed that some units responded strongly to nounphrases , while others responded strongly to verb phrases. This seems at first to be an interesting result . But careful analysis of the internal structure of the model and the final weighting patterns are required to determine how it works . The grouping of those sequences which are noun phrases (sequences containing an article and/or some kind of noun ) from
230
issue involves informativenessinput the ofthe categories, areassigned which byhand. input richly The is differentiated model 467 forthe into syntactic categories. aremany There distinctions among function made the words and morphemes. example, forms be aredifferentiated, For many of personal possessive pronouns differentiated other are from pronouns, prosubject
differentiated from absolutes, and so on.
is anachievement, onethatexceeds item arrangement butnot many and schemata. Hanson Kegls presentation and brief (imposedthem thepublication on by
those which verbphrases are (containing ofthelistofverb one types) might phrases,ispuzzling verbphrasesnotexcite it that did nounphrase at patterns thesame thevery time: notion constituency thatthey of requires should. As it stands, model the appears have to encoded something thefollowing like properties word of215 sentences: have initial ofphrase they an set types,
occur for many reasons. Indeed, since verbphrases often containnoun-
H&K thattheir state model begins noassumptions syntactic with about other thefact exist. isa bitmisleading. syntactic than they This First, categories notindependently are distinguished thesyntax which from in they occur. Many, notall,syntactic if categoriesa language like in are phonologicaldistinctive features, thatthey motivatedpartbytherule-based in are in function serve they universallyina particular and language. example, For in English, words, on,under areallprepositionsbecause the in, ... just they actthesame inrelationlinguistic way to constraintsi.e., precede they nounphrases, beused verb-particles, can as can coalesce verbs thepassive with in form, soon.Giving model information these and the the that privileges of
structure any nor special expectations propertiessyntactic about of categories
function categorized As it stands, word input. recognitiona fewbasic of redundancies well might account themodels percent rate. for 85 hit
one ofpattern another. kind or Indeed, language can parsers operate surprisand differentiation words all. reasonstraightforward: no ofcontent at The is function aretheskeleton syntax. tendto begin words of They English phrases, end not them: function offer many words unique information: e.g., the, a, my always a noun signal somewhere right, tothe towards always signals nounphrase theendofa clause, soon.In fact,it would a or and be interesting ifH&Ks performs worseitistrained on tosee parser any if only
ingly successfully differential with sensitivity 20classes function to of words
nouns differentiated object are from pronouns, comparative adjectives are Given richenough a analysis thiskind, of every English sentence into falls
231
occurrence
coincide
with
a category
is providing
crucial
information
about
what words can be expected to pattern together learner would have to discover . Thus , providing vides information which itself ref1.ects syntactic H & K actually give members of the same
structure
the model even more information : they differentiate category when they are used in syntactically different
ways . For example , the prepositions are pre - categorized differently according to whether they are used in prepositional phrases or as verb particles . In the categorized samples they classified with the symbols to working on the to the over the pushed aside pass it up This for differentiation automatic , as reflected looked ' in ' ' in ' ' in ' ' in ' ' rb ' ' rb ' solves in ahead English of time , the one of the more of difficult problems from provide , the prepositions on the right : below on the left are
parsers
differentiation of ( 7 ) .
prepositions
particles 7.
Harry
instances :
as a preposition
above , and as a complementizer to devote to continue Such syntactic ' to ' ' to ' pre - disambiguation
, as below
of phonologically
indistinguishable
cate -
gories appears in other cases . For example , ' it ' as subject from ' it ' as object ; ' that ' as conjunction is differentiated relative pronoun ; simple tiated from superficially -
past tense verb forms ( ' pushed ' , ' sang ' ) are differen identical past participle forms ( ' pushed ' , except in categorization disambiguations from the four
categorized sample selections which H & K present . In the categorization scheme they used ( Francis & Kucera , 1979 ) other syntactic homophones are disambiguated hardest problems in the for categorization parsers may framework be solved by as well . Thus , one the categorization of of the the
input - how to distinguish the use of a category in terms of its local syntactic function . This definitely is among the TRICS , in the sense defined above . There seems to be at least another . The categories are further differentiated
232
according more categories A tation cessful found the made input more more of , it with its is active ,
to
were
assigned function of more syntactic complete insofar general TRICS pre kind from
to
initially content
the model
. -
We
without grammatical
directly
or
about to
how give no
fine
- grained
the in
are
involves , and to
dilemma
syntactic sample
category sequence
grammatical either as
( Sa
8a Sb
. .
word
word
word modifying
word
, the
word following and verb following and verb , , noun both past noun both adverb , singular grammatical - tense , singular grammatical modifying the transitive noun noun in subject verb in object pre the , the
determiner phrase thematic determiner phrase thematic verb , . noun , as the as the agent
preceding of the
determiner preceding
patient
8c
determiner
verb
determiner
noun
adverb
sequences , which the to Hanson can think to that structural , which clarifies scheme would universals succeeds , ) . For in the in roughly example basic not is is regenerate & Kegl of regenerate the
like is
( 8a what
is
of
no
each order
Hence ward interest mation model that this coding This categorization tion it case
number
input in all
for is
the
obverse within
grammatical distinctions
encoded sequences
K to
could the of
claim : input in
structural universals an the of allow from regenerating like , 7 - 8 if that the categories the ' innate dilemma limited them
universals are ' part which interest to data input which machine as in . make What at may can embedded of
richness
the H &
model K face on
on
limited of a model of
scheme
child
Val
( 8c
233
demonstration of the kind H & K seek. Even more impressive , would be a demonstration that it can learn to map inputs of the complexity in (8c) onto outputs of the complexity in (8b) . That would be more like real learning , taking in lexically categorized input and learning to map it onto grammatically categorized output . In such a case, both the simple and complex categories could be viewed as innate . The empirical question would be: can the model learn the regularities of how to map simple onto complex grammatical categorizations of sentences? If such a model were empirically successful, it would offer the potential of testing some aspects of the empiricist hypothesis
which H & K may have in mind . Aside from TRICS , it is clear that auto association works both at the
phonological and syntactic level because there are redundancies in the input : the hidden units give the system the computational power to differentiate disjunctive categories , which allows for isolation of disjunctive redundancies . We think that the models involve techniques which may be of great impor tance for certain applications . But , as always with artificially 'intelligent ' systems , the importance of these proofs for psychological models of learning , will depend on their rate of successand the internal analysis both of the prior architecture of the model and of categories implied by the final weight pat terns . Ninety percent correct after 180,000 trials is not impressive when compared to a human child , especially when one notes that the child must discover the input categories as well . In fact , it is characteristic of widely diver gent linguistic theories that they usually overlap on 95 percent of the data- it is the last 5 percent which brings out deep differences . Many models can achieve 90 percent correct analysis: this is why small differences in how close to 100 percent correct a model is can be' an important factor . But , most important is the analysis which the machine gives to each sentence. At the
moment , it is hard for us to see how 45 associative units will render the
8 . Conclusions
We have interpreted R & M , M & K , and H & K as concluding from the success of their models that artificial intelligence systems that learn are possible with out rules . At the same time we have shown that both the learning and adult behavior models contain devices that emphasize the information which carries the rule -based representations that explain the behavior . That is, the models ' limited measure of success ultimately depends on structural rules . We leave it to connectionists ' productionist brethren in artificial intelligence and cognitive modelling to determine the implications of this for the algorithmic use
of rules in artificial intelligence . Our conclusion here is simply that one cannot proclaim these learning models as replacements for the acquisition of linguis tic rules and constraints on rules .
9. Some general considerations 9.1. The problem of segmenting the world for stimulus / response models of learning In the following section we set aside the issue of the acquisition of structure , and consider the status of models like R & M 's as performance models of learning . That is, we now stipulate that representations of the kind we have highlighted are built into the models ; we then ask, are these learning models with built -in representations plausible candidates as performance models of language acquisition ? We find that they suffer from the same deficiencies as all learning models which depend on incrementally building up structures out of
trials .
isolated
(with secondary and tertiary connections -between -connections corresponding to between node , and hidden -node connections ) . The connectionist model operates 'in parallel ' , thereby allowing for the simultaneous establishment of complex patterns of activation and inhibition . We can view these models as composed of an extremely large number of s- r pairs , for example , the verb learning model would have roughly 211,600 such pairs , one for each input / output Wickelfeature pair . In fact , we can imagine a set of rats each of whose tails are attached to one of the 460 Wickelfeature inputs and each of whose
nose -whiskers
each trial , each
outputs : on
then may
stimulated
or not . He
lunge at his nose node . His feedback tells him either that he was supposed to lunge or not , and he adjusts the likelihood of lunging on the next trial using formulae of the kind explored by Hull and his students , Rescorla &
Wagner ( 1972 ) , and others .
Wewill cali this arrangement of rats a Massively Parallel Rodent (MPR ) . Clearly , the MPR as a whole can adjust its behavior . That is, if a connection between two nodes is fixed and if the relevant input and output information is unambiguously specified and reinforced , and if there is a rule for changing the strength of the connection based on the input /output /reinforcement configuration , then the model 's connection strengths will change on each trial . In his review of Skinner 's Verbal Behavior ( 1957) , Chomsky ( 1959) accepted
235
this tautology about changesin s- r connection strengths as a function of reinforcing episodes But he pointed out that the theory doesnot offer a way . of segmentingwhich experiencescount as stimuli, which as responses and , to which pairs of thesea changein drive level (reinforcer) is relevant. Stimulus, response and reinforcement are all interdefined, which makesdiscovering . the 'laws' of learning impossiblein the stimulus /responseframework. We solved the correspondingpractical problem for our MPR by giving it the relevant information externally- we segmentwhich aspectsof its experience are relevant to each other in stimulus /responsereinforcement triples. / Accordingly, the MPR works becauseeach rat does not have to determine what inout Wickelfeature activation (or lack of it) is relevant to what output ..----- --r ~ " Wickelfeature, and whether positive or negative he is hard-tailed and -nosed : into a given input/output pair. Each learning trial externally specifies the effect of reinforcement on the connectionbetweenthat pair. In this way, the MPR can gradually be trained to build up the behavior as composedout of differential associativestrengthsbetween different units. Supposewe put the MPR into the field , (the verbal one), and wait for it to learn anything at all. Without constant information about when to relate reinforcement information to changes the threshold and responseweights, in nothing systematiccan occur. The result would appear to be as limited and circular as the apparently simpler model proposedby Skinner. It might seem that giving the MPR an input set of categorieswould clear this up. It does not, unlessone also informs it which categoriesare relevant when. The hope that enough learning trials will gradually allow the MPR to weed out the irrelevancies begs the question; which is, how doesthe organismknow that , a given experience counts as a trial? The much larger number of component organismsdoes not changethe power of the single -unit machine if nobody , tells them what is important in the world and what is important to do about it . There's no solution, even in very, very large numbers of rats. Auto -associationin systemswith hidden units might seemto offer a solution to the problem of segmentingthe world into stimuli, responsesand reinforcement relations: these models operate without separateinstruction on each matching trial . Indeed, Elman & Zipser's model apparently arrives at a compact representation of input speechwithout being selectively reinforced. But . as we said~it will take a thorough study of the final response patterns of the hidden units to show that the model arrives at a correct compact representation The samepoint is true of models in the style of Hanson . and Kegl. And , in any case analysisof the potential stimuli is only part of , the learning problem- suchmodelsstill would require identification of which analyzedunits serveas stimuli in particular reinforcement relations to which responses .
So , even
if we grant
them
a considerable
amount
of pre -wired
architec
tures , and hand -tailored input , models like the MPR , and their foreseeable elaborations , are not interesting candidates as performance models for actual learning ; rather , they serve, at best, as models of how constraints can be modified by experience , with all the interesting formal and behavioral work of setting the constraints done outside the model . The reply might be that every model of learning has such problems - somehow the child must learn what stimuli in the world are relevant for his language behavior responses. Clearly , the child has to have a segmentation of the world into units , which we can grant to every model including the MPR . But if we are accounting for the acquisition of knowledge and not the pairing of input /output behaviors , then there is no restriction on what counts as a relevant pair . The problem of what counts as a reinforcement for an input /output pair only exists for those systems which postulate that learning consists of increasing the probability of particular input /output pairs (and proper subsets of such pairs) . The hypothesis -testing child is at liberty to think entirely in terms of confirming systems of knowledge against idiosyncratic experiences . An example may help here . Consider two ways of learning about the game of tag: a stimulus /response model , and an hypothesis testing model . Both
models can have innate characteristics which lead to the game as a possible representation . But , in a connectionist instantiation of the sir model , the
acquired representation of the game is in terms of pairs of input and output representations . These representations specify the vectors of motion for each player , and whether there is physical contact between them . We have no doubt that a properly configured connectionist model would acquire behavior similar to that observed in a group of children , in which the players succes sively disperse away from the player who was last in contact with the player who has a chain of dispersal that goes back to the first player of that kind . As above , the machine would be given training in possible successive vector configurations and informed about each so that it could change its activation weights in the usual connectionist way . Without that information , the model will have no way of modulating its input activation configuration to conform
to the actual behavior .
Contrast this with the hypothesis testing model , also with innate r::tructure : in this case, what is innate is a set of possible games, of which tag is one,
stated in terms of rules . Consider what kind of data such a model must have :
it must be given a display of tag actively in progress, in some representational language , perhaps in terms of movement vectors and locations , just like that for the connectionist model . But what it does not need is step-by-step feedback after each projection of where the players will be next . It needs instances of predictions made by the game of tag that are fulfilled : e.g., that after
237
contact , players ' vectors tend to reverse . How many such instances it needs is a matter of the criteria for confirmation that the hypothesis tester requires . Thus , both the stimulus /response pattern association model and the hypothesis testing model require some instances of categorizable input . But only the sir model requires massive numbers of instances in order to construct a representation of the behavior . 9.2. The human use for connectionist models
The only relation in connectionistmodels is strength of associationbetween nodes. This makesthem potential modelsin which to representthe formation of associationswhich are (almostby definition) frequent- and, in that sense , , important- phenomenaof everyday life . Given a structural description of a domain and a performance mechanism a connectionist model may provide , a revealing description of the emergenceof certain regularities in performance which are not easily describedby the structural description or perfor, mance mechanism alone. In this section, we explore some ways to think about the usefulness connectionistmodels in integrating mental represen of tations of structures and associations between structures . 9.2.1. Behavioral sourcesof overgeneralization We turn first to the 'developmental achievement of Rumelhart and ' McClelland's model of past tenselearning, the overgeneralizationof the 'ed' ending after having achievedbetter-than-chanceperformanceon someirregular verbs. This parallels stagesthat children go through, although clearly not for the same reasons R& M coerce the model into overgeneralizationand . regressionin performance by abruptly flooding the model with regular-verb input. Given the relative univocality of the regular past tense ending, such coercion may turn out to be unnecessary evenequal numbersof regular and irregular verbs may lead to a period of overgeneralization(dependingon the learning curve function) because regular ending processes simpler. In the are any case the model is important in that it attempts to addresswhat is a , common developmentalphenomenonin the mastery of rule-governedstructures- at first , there is apparentmasteryof a structurally defined conceptand then a subsequentdecreasein performance basedon a generalizationfrom , the available data. It is true that some scholarshave used these periods of regression as evidence that the child is actively mastering structural rules, e.g., 9. add 'ed' to form the past tense Clearly, the child is making a mistake But it is not necessarilya mistaken . application of an actual rule of the language(note that the formula above is
not exactly a rule of the language) . Rather , it can be interpreted as a perfor mance mistake , the result of an overactive speech production algorithm which
captures the behavioral generalization that almost all verbs form the past
tense by adding ledl (see Macken , 1987, for a much more formal discussion of this type ) . This interpretation is further supported by the fact that children , like adults , can express awareness of the distinction between what they say, and what they know they should say (section 5.4) . There are many other examples of overgeneralization in cognitive develop ment , in which rule -based explanations are less compelling than explanations based on performance mechanisms (Bever , 1982; Strauss, 1983) . Consider , for example , the emergence of the ability to conserve numerosity judgments between small arrays of objects . Suppose we present the array on the left below to children , and ask which row has more in it (Bever , Mehler , & Epstein , 1968; Mehler & Bever , 1967) . Most children believe that it is the row on the bottom . Now suppose we change the array on the left below to the one on the right , and ask children to report which row now has more : 2-year-old children characteristically get the answer correct , and always per form better than 3 - year -old children
* * * *
.
* * * *
The 3-year-olds characteristically choose the longer row - this kind of pattern occurs in many domains involving quantities of different kinds . In each case, the younger child performs on the specific task better than the older child . But the tasks are chosen so that they bring a structural principle into conflict with a perceptual algorithm . The principle is 'conservation ' , that if nothing is changed in the quantity of two unequal arrays , the one with more remains the one with more . The perceptual algorithm is that if an array looks larger than another , it has more in it . Such strategies are well -supported in experience , and probably remain in the adult repertoire , though better inte grated than in the 3-year-old . Our present concerns make it important that the
overgeneralized strategy ' that causes the decrease in performance is not a
structural rule in any sense; it is a behavioral algorithm . A similar contrast between linguistic structure and perceptual generaliza tion occurs in the development of language comprehension (Bever , 1970) . Consider ( lOa) and ( lOb) .
lOa . The horse kicked the cow
of sentence. But the basis on which they do this appears to change from age
2 to age 4: at the older age, the children perform markedly worse than at the younger age on passive sentences like ( lOb) . This , and other facts suggest that the 4-year-old child depends on a perceptual heuristic , 11. " assign an available NVN - sequence, the thematic relations , agent , predicate , patient ."
This heuristic is consistent with active sentence order , but specifically contradicts passive order . The heuristic may reflect the generalization that in English sentences, agents do usually precede patients . The order strategy appears in other languages only in those cases in which there is a dominant
word order . In fact , in heavily ~ inflected languages , children seem to learn
very early to depend just on the inflectional endings and to ignore word order (Slobin & Bever , 1980) . The important point here is that the heuristics that
emerge around 4 years of age are not reflections of the structural rules . In
fact , the strategies can interfere with linguistic success of the younger child and result in a decrease in performance . Certain heuristics draw on general knowledge , for example the heuristic that sentences should make worldly sense. 4-year-old children correctly act out sequences which are highly probable ( 12a) , but systematically fail to act out corresponding improbable sequences like ( 12b) .
12a . The 12b . The horse cookie ate the cookie ate the horse
The linguistic competence of the young child before such heuristics emerge is quite impressive . They perform both sentences correctly - they often acknowledge that the second is amusing , but act it out as linguistically indi cated . This suggests early reliance on structural properties of the language, which is later replaced by reliance on statistically valid generalizations in the child 's experience . These examples demonstrate that regressions in cognitive performance seem to be the rule , not the exception . But the acquisition of rules is not what underlies the regressions. Rather , they occur as generalizations which reflect statistical properties of experience . Such systems of heuristics stand in contrast to the systematic knowledge of the structural properties of behavior and experience which children rely on before they have sufficient experience to extract the heuristics . In brief , we are arguing that R & M may be correct in the characterization of the period of overgeneralization as the result of the detection of a statistically reliable pattern (Bever , 1970) . The emergence of the overgeneralization is not unambiguous evidence for the acquisition of a rule . Rather , it may reflect the emergence of a statistically supported pattern
of behavior . We turn in the next section to the usefulness of connectionist
240
9.2.2. The nodularity of mime Up to now , we have argued that insofar as connectionist models seem to acquire structural rules , it is because they contain representational devices which approximate aspects of relevant linguistic properties . Such probabilistic models are simply not well -suited to account for the acquisition of categorical structures . But there is another aspect of behavior to which they are more naturally suited- those behaviors which are essentially associative in nature .
Associations most empirically can make impressive repeated behaviors more efficient , but are the to the
description of erroneous or arbitrary behavior . For example , Dell 's model of speech production predicts a wide range of different kinds of slips of the tongue and the contexts in which they occur . The TRACE model of speech perception predicts interference effects between levels of word and phoneme representations . By the same token , we are willing to stipulate that an im proved model of past tense learning might do a better job of modelling the mistakes which children make along the way . All of these phenomena have something in common - they result , not from
the structural constraints on the domain , but from the way in which an as -
sociative architecture responds to environmental regularities . Using a connectionist architecture may allow for a relatively simple explanation of phenomena such as habits , that seem to be parasitic on regularities in behavioral patterns . This interpretation is consistent with a view of the mind as utilizing two sorts of processes, computational and associative. The computa tional component nent represents represents the direct the structure of behavior , the associative compo activation of behaviors which accumulates with
practice .
speaking and comprehension , we draw on a small number of phrase- and sentence-types . Our capacity for brute associative memory shows that we can form complex associations between symbols : it is reasonable that such associations will arise between common phrase types and the meaning relations between their constituent lexical categories . The associative network cannot explain the existence or form of phrase structures , but it can associate them efficiently , once they are defined . This commonsense idea about everyday behavior had no explicit formal
mechanism
theories
which
could
account
for
it
in
traditional
models
stimulus -response
had in their arsenal
of how
associations
are formed
. Such
single chains of associated units . The notion of multiple levels of representa tion and matrices was not developed . The richest attempt in this direction for
241
language was the later work of Osgood (1968) ; but he was insistent that the models should not only describe associations that govern acquired behavior , they should also account for learning structurally distinct levels of representa tion - which traditional sir theories cannot consistently represent or learn
(see Bever , 1968; Fodor , 1965) . There are some serious consequences of our proposal for research on language behavior . For example , associations between phrase types and semantic configurations may totally obscure the operation of more structurally sensitive processes. This concept underlay the proposal that sentence comprehension proceeds by way of 'perceptual strategies' !like ( 12) , mapping rules which express the relation between a phrase type and the semantic roles assigned to its constituents (Bever , 1970) . Such strategies are non -determinis tic and are not specified as a particular function of grammatical structure . The existence of such strategies is supported by the developmental regressions in comprehension reviewed above , as well as their ability to explain a variety of facts about adult sentence perception (including the difficulty of sentences like (5a) which run afoul of the NVN strategy ( 12)) . But the formu lation of a complete strategies-based parser met with great difficulty .
StrateQies
-.-
are
not
' rules
' ~ and
there
was
no
clear
formalism
available
in which
to state them so that they can apply simultaneously . These problems were part of the motivation for rejecting a strategies-based parser (Frazier , 1979) , in favor of either deterministic processes (production systems of a kind ; Wan ner & Maratsos , 1978) and current attempts to construct parsers as direct functions of a grammar (Crain & J.D . Fodor , 1985; Ford , Bresnan , & Kap lan , 1981; Frazier , Carson , & Rayner , work in progress) . On our interpretation , connectionist methods offer a richer representa tional system for perceptual 'strategies' than previously available . Indeed , some investigators have suggested connectionist -like models of all aspects of syntactic knowledge and processing (Bates & MacWhinney , 1987; MacWhin ney , 1987) . We have no quarrel with their attempts to construct models which instantiate strategies of the type we have discussed (if that is what they are in fact doing ) ; however , it seems that they go further , and argue that the strategies comprise an entire grammatical and performance theory at the same time . We must be clear to the point of tedium : a connectionist model of parsing , like the original strategies model , does not explain away mental grammatical structures- in fact , it depends on their independent existence elsewhere in the user's mental repertoire . Furthermore , frequency based behavioral heuristics do not necessarily comprise a complete theory of perfor mance, since they represent only the associatively based components . Smolensky (in press) has come to a view superficially similar to ours about the relation between connectionist models and rule -governed systems. He
applies the analysis to the distinction between controlled and automatic pro cessing (Schneider , Dumais , & Schiffrin , 1984) . A typical secular example is accessing the powers of the number 2. The obvious way to access a large power of two , is to multiply 2 times itself that large number of times . Most of us are condemned to this multiplication route , but computer scientists often have memorized large powers of two , because two is the basic currency of computers . Thus , many computer scientists have two ways of arriving at the 20th power of two , calculatjng it out , or remembering it . Smolensky suggests that this distinction reflects two kinds of knowledge , both of which can be implemented in connectionist architecture - 'conscious' and 'intuitive ' . 'Conscious' knowledge consists of memorized rules similar to productions . Frequent application of a conscious rule develops connection strengths between the input and the output of the rule until its effect can be arrived at
without the operation of the production -style system . Thus the production -
style system 'trains ' the connectionist network to asymptotic performance . The conscious production -style systems are algorithms , not structural representations . For example , the steps involved in successive multiplications by two depend on available memory , particular procedures , knowledge of the
relation between different powers of the same number , and so on . Thus , this
proposal is that the slow-but -sure kind of algorithms can be used to train the fast-but -probabilistic algorithms . Neither of them necessarily represents the structure . Smolensky , however , suggests further that a connectionist model 's structural 'competence ' can be represented in the form of what the model would do , given an infinite amount of memory /time . That is, 'ideal perfor mance' can be taken to represent competence (note that 'harmony theory ' referred to below is a variant of a connectionist system) .
It is a corollary of the way this network embodies the problem domain constraints , and the general theorems of harmony theory , that the system, when given a well -posed problem , and infinite relaxation time , will always give the
correct
laws
answer . So , under
down inside
that idealization
, the competence
of the system is
described by hard constraints : Ohm 's law , Kirchoff 's law . It is as if it had those
written of it .
Thus , the system conforms completely to structural rules in an ideal situation . But the ability to conform to structure in an infinite time is not the same as
the representation
lem domain
the idealized performance of a model . The structure corresponds to the 'prob constraints
Smolensky directly relates the production -style algorithms to explicit potentially conscious knowledge , such as knowing how to multiply . This makes the proposal inappropriate for the representation of linguistic rules , since
243
are
generally for
not
of
on
ob of over , in au
Consider
example of the
. But structure
understand
9 .3 .
The
end
habits
and
rules
have debate
to
put
the among
debate in as
about cognitive
rules
in
perspective , con produc debate models has can differ chal M ' s and
. -
- based
those . That
the
question behavior
With
have
specific R &
language
. .
Some R & M
require , does
rules not
anyway work
( Fodor , empirically
&
Pylyshyn or theoretically
paper
has arrive
been at
on rule already
the - like
fact
that regularities
the
models by mental
contain representations
architectures of
is represent
not
there : - in hence
is
no , as
natural adult to machines as the never is formation that by are the that the
for
an , such
systems
sensitivity
or share
their the
the their :
connectionist behavior , they models of the be the reader explained behaviors can
namely
structural a
habits human
sum as
, we language obvious
structure
networks result of
associations
244
between structurally defined representations . We think that connectionist models are worth exploring as potential explanations of those behaviors : at the very least, such investigations will give a clearer definition of those aspects of knowledge and performance which cannot be accounted for by testable and computationally powerful associationistic mechanisms.
References
Anderson , l .A . ( 1983) . The architecture o/ cognition . Cambridge , MA : Harvard University Press.
Bates , E ., & MacWhinney , B . (1987) . Competition variation and language learning . In B . MacWhinney ( Ed .) , Mechanisms of language acquisition (pp . 157- 197) . Hillsdale , Nl : Lawrence Erlbaum Associates . Bever , T . ( 1968) . A formal limitation of associationism . In T . Dixon & D . Horton and general behavior theory . Englewood Cliffs , NJ : Prentice -Hall , Inc . (Eds .) , Verbal behavior
Bever , T . ( 1970) . The cognitive basis for linguistic universals . In J .R . Hayes (Ed .) , Cognition and the develop ment of language ( pp . 277- 360) . New York , NY : Wiley & Sons, Inc . Bever , T . ( 1975) . Psychologically real grammar emerges because of its role in language acquisition . In D . Dato (Ed .) , Developmental psycholinguistics : Theory and applications (pp . 63- 75) . Georgetown Uni versity Round Table on Languages and Linguistics . Bever , T . ( 1982) . Regression in the service of development . In T . Bever (Ed .) , Regression in mental develop ment (pp . 153- 188) . Hillsdale , NJ : Lawrence Erlbaum Associates . Bever , T . , & Langendoen , D . (1963) . (a) The formal justification and descriptive role of variables in phonol ogy . (b) The description of the Indo -European E/O ablaut . (c) The E/O ablaut in Old English . Quar terly Progress Report RLE . MIT , Summer . Bever , T . , Carroll , J., & Miller L .A . ( 1984) . Introduction . In T . Bever , J. Carroll & L .A . Miller Talking minds : The study of language in the cognitive sciences. Cambridge , MA : MIT Press.
921 - 924 .
(Eds .) ,
Bever , T . , Mehler , J. , & Epstein , J . ( 1968) . What children do in spite of what they know . Science, 162 , BrownJ R . ( 1973) . A first language : The early stages. Cambridge , MA : Harvard University Bybee , J. , & Slobin , D . ( 1982) . Rules and schemes in the development Language , 58 , 265- 289. Press.
Chomsky , N . ( 1959) . Review of Skinner 's Verbal Behavior . Language , 35 , 26- 58. Chomsky , N . ( 1964) . The logical basis of linguistic theory . In Proceedings of the 9th International on Linguistics . Chomsky , N . ( 1981) . Lectures on government and binding . Dordrecht : Foris .
Conference
Chomsky , N . , & Halle , M . ( 1968) . The sound pattern of English . New York , NY : Harper and Row . Crain , S., & Fodor , J .D . ( 1985) . How can grammars help parsers ? In D .R . Dowty , L . Kartunen , & A . Zwicky (Eds .) J Natural language parsing : PsychologicalJ computational and theoretical perspectives . Cambridge : Cambridge University Press. Dell , G . ( 1986) . A spreading -activation theory of retrieval in sentence production . Psychological Review , 93,
3 , 283 - 321 .
Elman , J.L . , & McClelland , J .L . ( 1986) . Exploiting the lawful variability in the speech wave . In J .S. Perkell & D .H . Klatt (Eds .) , lnvariance and variability afspeech processes. Hillsdale , NJ : Lawrence Erlbaum
Associates .
Elman , J.L . , & Zipser , D . ( 1987) . Learning the hidden structure of speech . USCD Institute for Cognitive Science Report 8701. Feldman , J . ( 1986) . Neural representation ence Technical Report URCS -33. of conceptual knowledge . University of Rochester Cognitive Sci-
245
Feldman Ballard (1982 ,J.,& ,D. ).Connectionist their models properties Science -254 and . Cognitive , 6,205 . Feldman , J., Ballard Brown., &Dell . (1985 , D., ,C ,G ). Rochester Connectionist1979 .Univer Papers -1985 : sity Rochester Science Report-172 of Computer Technical TR . Fodor (1965 meaning rmJournal , J.A. ). Could be ? an ofVerbal LearningVerbal , 4, 7381 and Behavior - . Fodor Bever ., & Garrett. (1974 , J.A., , T.G ,M ). Psychology . New : McGraw. oflanguage York Hill Fodor &Pylyshyn . (1988 , J.A., , Z.W ). Connectionism architecture ana . Cogni and cognitive : Acritical .ysis l tion ,3 , this , , 28 -71 issue Ford ., Bresnan Kaplan (1981 competence ,M , J., & ,R , ), A -based ofsyntactic . InJ. Bresnan theory closure (Ed The representation .), mental ofgrammatical , CambridgeMIT . relations , MA Press : Francis & Kucera (1979 ,W .N., , H. ). Manual ofinformation toaccompany sample -day astandard ~ fpresent edited American , foruse digital English with computers , RI Department . Providence , ofLinguistics , Brown University . Frazier (1979 comprehending: Syntactic strategies Thesis , L. ). On sentences parsing . Doctoral , University of Massachusetts . Frazier Carson & Rayner (1985 , L" ,M ., , K, ). Parameterizing processing: Branching the language system patterns and languages within across . Unpublished , Garrett. (1975 analysis .M ). The ofsentence . InG Bower .), The production. (Ed psychology and ojlearning ' motivation York . New : Academic (pp133 ). Press. -177 Goldsmith ). Anoverview , J. (1976 ofautosegmental . Linguistic , 2, No1 phonology Analysis . . Grice . (1975 and , H.P ). Logic conversationCole J.L. Morgan .), Syntax semantics . InP . and (Eds and 3 : SpeechNew , NYSeminar . acts York : . Press Grossberg ). Competitive from , S (1987 . learning interactive toadaptive . I: ognitive activation resonance Sci ence, 2363 , 11 - . Halle . (1962 ,M ). Phonology ingenerative . Word , 54 . grammar , 18 -72 Hanson & Kegl (1987 ,S .J., , J. ). PARSNIP : A connectionist that naturallangu network learns .age grammar from exposure language . Proceedings Annual tonatural sentences ofthe Ninth Cognitive Society Science Meeting , NJLawrence Associates . Hillsdale : Erlbaum . Hinton.. & Anderson .) (1981 .G , S (Eds ). Parallel ofassociative . Hillsdale Lawrence . models memory , ]\1J : Erlbaum Associates . Hinton., & Sejnowski ). Optimal ,G , R (1983 . perceptual . Proceedings Conference inference ofthe IEEE on Computer and Vision Pattern Recognition , Washington , D.C . Hinton., &Sejnowski ). Learningrelearning ,-G , T. (1986 and inBoltzmann . InD. Rllmelhart machines &J. McClelland Parallel (Eds .), distributed : Explorations processing inthe microstructure : Vol ofcognition . 1 Foundations , MAMIT , . . Cambridge Press : Hopfield(1982 , J. ). Neural networksphysical with and systemsemergent computatabilities collective :onal . l Proceedings ofthe National ofScience: Vol79Biophysics -25 8 Academy USA . . (pp2551;5 ). . Kuczaj (1978 ,S . ). Children 'sjudgments ofgrammatical and ungrammatical past verbs irregular-tense . Child Development-326 , 49319 . , Kuroda (1987 . Sy, . ). Where isChomsky 'sbottleneck ofthe ?Reports CenterResearch for inI..anguage , San Diego . 1 No . , Vol , .5 Langackre ). The , R (1987 cognitive . perspective . InReports Center Research ofthe for inIJanguage , San Diego . 1 No . , Vol , .3 Macken (1987 , M. ). Representation overgeneralizations . InB. Mac rules and inphonology "'hinney .), (Ed Mechanisms acquisition -397Hillsdale Lawrence Pssociates oflanguage (pp367 ). . , NJ : Erlbaum . . MacWhinney ). The , B. (1987 competition ofthe model acquisition . InB. MacWhinney ofsyntax (Ed .), Mechanisms acquisition -308Hillsdale Lawrence Pssociates oflanguage (pp249 ). . , NJ : Erlbaum ~ . Marr (1982 . San , D. ). Vision FranciscoFreeman , CA : . McClelland Elman (1986 , J., & , J. ). Interactive inspeech processes perceptionTRAC . InJ. : The ]~model McClellandRumelhart Parallel &D. (Eds .), distributed : Explorations processing inthe microstructure
246
McClelland,Rumelhart, J.,& D.(1981). interactive modelcontext inletter An activation of effects perception:Part 1: An account basic of findings. Psychological Review, 5, 60-94. 88,
ture cognition: 2. Psychologicalbiological of Vol. and models. Cambridge, MIT MA: Press. Meh J., &Bever, (1967). cognitive ler, T. A capacity young of children. Science, 6, 141. Oct. Miller, (1962). psychological ofgrammar. GA. Some studies American Psychologists, 17,748762.
McClelland, Rume D.(Eds.) J. &lhart, (1986). distributed Parallel processing: Explorations microstrucinthe
Neches, Langley, &Klahr, (1987). R., P., D. Learning, development and production InD.Klahr, systems. P.Langley Neches Production modelslearning development. &R. (Eds.), system of and Cambridge, MA: MIT Press.
Osgood, (1968). CE. Towardwedding insufficiencies. Dickson D.Horton, a of InT. & Verbal behavior and
general behavior theory,Englewood Cliffs, Prentice-Hall, NJ: Inc.
modelof languageacquisition.Cognition, 73193, this issue. 28,
Rescorla, &Wagner, (1972). theory Pavlovian R.A., AR. A of conditioning: Variations effectiveness inthe
NewYork: Appleton-Century-Crofts, 6499. pp.
Rume D., lhart, Hinton, &Williams, Learning representations propagation. G.E., R.(1986). internal by error
tions themicrostructure in of cognition: 1. Formulations. Vol. Cambridge, MITPress/Bradford MA: Books.
InD.Rumelhart, J. MeClelland PDP &the Research Parallel Group, distributed processing: Explora-
Rume D.,&McClelland, An lhart, J. (1982). interactive modelcontext inletter activation of effects percep89, 1, 6094.
tion: 2:The Part contextual enhancement and tests themodel. effect some of Psychological Review,
Rumelhart, &McClelland, (1986a). distributed D., J. (Eds.) Parallel processing: Explorations microinthe
structure cognition: 1. Foundations. of Vol. Cambridge, MITPress. MA:
Rumelhart, &McClelland, D., J. (1986b). learning past On the tenses English InJ. MeClelland of verbs. &
Vol. Psychological biological 2. and models. Cambridge, MITPress. MA: Sadock, (1974). J. Toward linguistic ofspeech New a theory acts. York: Academic Press.
tion.Times Literary Supplement, 12,1987, 643. June p. Sapir,E. (192149).Language. NewYork:Harcourt,Braceand World.
Sampson, (1987). turning inlinguistics: ofD. Rumelhart,McClelland thePDP G. A point Review J. and
Research (Eds.), Group Parallel distributed processing: Explorations microstructure inthe ofcogni-
Savin, &Bever, (1970). nonperceptualofthe Fl., Behavior, 9, 295302. T. The reality phoneme. ofVerbal Journal Learning and Verbal
Schneider, Dumais, &Shriffrin,(1984). W., S., R. Automatic control and processingattention.R. and In
Parasaraman Davies &D. (Eds.), Varietiesattention. York: of New Academic Inc. Press, Skinner, (1957). behavior. York, Appleton-Century-Crofts. B. Verbal New NY:
The childs conceptionlangaage. York, Springer-Verlag. of New NY:
Slobin,(1978). study early D. Acase of language awareness. Sinclair, InA. R.Jarvella, Levelt &W. (Eds.), Slobin, &Bever, 12,(1982). use D., T.G.219277. Childrencanonical sehemas: sentence Acrosslinguistic word study of order. Cognition,
247
Smolensky , P. ( 1986) . Information processing in dynamical systems : Foundations of harmony theory . In D . Rumelhart & J. McClelland (Eds .) , Parallel distributed processing : Explorations in the microstructure of cognition : Vol . 1. Foundations . Cambridge , MA : MIT Press. Smolensky , P. (in press) . The proper treatment of connectionism . Behavioral and Brain Sciences. Strauss , S. , & Stavy , R . ( 1981) . V -shaped behavioral growth : Implications for theories of development . In W .W . Hartup ( Ed .) , Review of child development research (Volume 6) , Chicago : University of Chicago
Press .
Name Index
Adelson 29n 59 , B., , Anderson A., 195 , J. Anderson R., 130196 , J. , Arbib M., 4 , Armstrong L., 123n , S. Aronoff M., 113115n , ,
Ballard , D . H ., 5 , 7 , 15n , 18n , 51 , 62 , 65 , 75 , 196 Bates , E . , 241 Beckman , M . , 86 Berko , J . , 84 , 137 Bever , T . G . , 96n , 99 , 147 , 166, 170, 195- 247 Black , I . B . , 53 Bloch , B . , 82n Bolinger , D . , 64 Bower , G . , 29n , 59 Bresnan , J . , 241 Broadbent , D . , 9 Brown , R . , 137, 139 , 140n , 142 , 144, 207 Bybee , J . , 82n , 84 , 116- 118 , 145 , 148, 150n , 151- 156 , 160 , 174 , 181 , 207 Carey , S . , 177 Carlson , G . , 221 Carroll , L . , 60 Carson , M . , 241 Cazden , C . B . , 137 , 144 Chase , W . G ., 51 Chomsky , N . A . , 33n , 34 , 64 , 74 , 78 , 82n , 179 , 181 , 202 , 227 , 234 Churchland , P . M . , 4 Churchland , P . S . , 4 , 7 Crain , S . , 241
Fodor J. A., 3- 71 74 214221 241243 , , , , , , Fodor J. D., 64 241 , , Ford M., 241 , FrancisN., 139n , FrancisW. N., 226231232 , , , FrazierL., 241 , Fries C., 82n , Frost L. A., 128 , GarrettM., 221 , Geach 21n , P., Gelman A., 177 , S. Gleitman 123n , H., Gleitman 123n127 , L., , Goldsmith 202216 , J., , GordonP., 150n , Grimshaw 117 , J., Gropen 150n , J., Grossberg 196 , S., Halle M., 82n 91 173201202 , , , , , HalwesT., 96n , HansonS. J., 195196 197225230 235 , , , , , , HatfieldG., 4, 13 52 59 , , , HewettC., 14 15 56 , , , Hillis, D., 56 Hinton G. E., 4, 21 22 40n 51 52 53 75 , , , , , , , , 77 80 85 171 173 176181 182195196 , , , - , , , , , , 224 HoardJ., 82n , HockettC., 82n , HopfieldJ., 196 , Jenkins J., 96n , J. Jespersen ., 82n 98 121 ,0 , ,
Kaplan, R., 241 Katz, J. J., 21n Kawamoto A . H., 35n 59, 222 228 , , , Kegl, J., 196 197 225 230 235 , , , , KeiJ, F. C., 177 Kiparsky, P., 82n 87, 111 112 , , Klahr, D., 196 Kosslyn, S. M., 4, 13 52, 59 , Kucera, H., 139n 226 231 232 , , , Kuczaj, S. A., 137 143 145 157 158 160 , , , , , 161 163n 220 , ,
EpsteinJ., 238 , Ervin S., 84 137 , , Fahlman E., 4, 51 52 , S. , Feldman A., 5n 7, 29n 51 59 75 196 , J. , , , , ,
250
Name Index
Rumelhart , D . E . , 6n , 7 , 8, 9 , lOn , 11, 21n , 22 , 35 , 36 , 53 , 60n , 62n , 64n , 65 , 68 , 73- 193, 196 , 199, 200 , 204 , 217 , 224 , 234 , 237 Sadock , J . , 219 Sampson , G . , 80 , 208 Sapir , E . , 219 Savin , H . , 96n , 211 , 212 Schiffrin , R . , 242 Schmidt , H . , 176 Schneider , W . , 4 , 242 Seidenberg ~ M . , 221 Sejnowski , T . J . , 4 , 62 , 85 , 171 , 181, 196, 224 Shafir , E . , 177
, 60n 7 , , , 204 8 , 9 , 62n , , 217 lOn 64n , , , 222 21n 66 , , , 68 224 22 , , 228 , ,
, Markman Marr McCarthy McClelland 29n 73 234 McCulloch McDermott Mehler Mencken Mervis Miller Minsky Mooney , C , G , , , J . , , . . , M R H B 24 , 35 , , 36 196 , D , E . , M 217 109 L , 199 . , 59 , , 149 6n , . , 17
. , 74
, J . , , J . , 53 ,
193 , 237
200
W , D 238 . , . , ,
. . ,
S . , 29n
182 , 59
Shattuck -Hufnagel 157n , S., Sietsema lOOn , B., SimonH. A., 51 74 130 , , , SkinnerB. F., 234235 , , SloatC., 82n ,
Slobin , D . , 82n , 84 , 116 , 118, 140, 145, 148, 150n , 151- 156 , 160 , 163, 207 , 219 , 239 Smith , E . E . , 177 Smith , N . , 100 Smolensky , P . , 8 , 9 , lOn , 19, 20 , 21 , 45 , 46 , 52 , 56 , 62 , 80 , 126 , 167 , 169 , 171, 173, 196, 217 , 224 , 241 , 242 Snow , C . E . , 140n Sokolov , J . L . , 130 Sommer , B . A . , 97 Stabler , E . , 14n , 60 Stich , S . , 7 Stov , M . , 61 Strauss , S . , 238 Sweet , H . , 82n
Ill
126
127
. , 64 . J . ,
, 224
. ,
, A , D
. , 4 , . A
Osgood Osherson
C ,
. , D
31n . , 61
, 49 , 177
Palmer Papert Pazzani Pierrehumbert Pinker 205 Pitts Postal Prince 211 Putnam Pylyshyn , ,
, H
. ,
S . , 64 , M . ,
S . , . , P .
, 42n , 218
, 211 , W ,
Talmy L., 174 181 , , Tanenhaus K., 221 , M. TarskiA., 14n , Touretzky S., 65 77 182 , D. , , Treisman 176 , A., ValianB., 230 , VanderHulst H., lOOn ,
Wagner , A . R . , 234 , E . , 35 , 127 , 241 , J ., 7 , S . , 61 , K . , 80 , 127 , W . A . , 89 , 204 , E . , 112 , R . J . , 85 , 224 Wanner Watson
M . , ,
. , 61
, A 214 ,
218 . , 74 Z .
H ,
. ,
3 -
71
, 74
214
243
Rayner , K . , 241
Rescorla Rosenblatt Rosenbloom
Ziff Zipser
, P . , 43 , D . , 224 , 235
Ross , J . R . , 113
Subject Index
Acoustic Activation 196 . 223 Actors , 15 features level , 199 , 1 , 5 , 12 , 15 , 63 , 73 , 75 , 76 , Cognitive Cognitive Combinatorial tions Competence 128 . 242 Complex sys Compositionality Computation Computational tional Concatenative Concept - level ConceDtuallevel Conscious Conservation Conservatism Consistency . associative , 7 , 14 , 55 , 60 , 5 , 28 , 54 , 78 , 52n , 63 , 240 level , 8 , 9 , 66 , 76 , 77 , 85 , of mental representa , 34 , 35 , 80 , structure versus symbols
processes
Animal
classical von
thought , 41 , 44 , 45
description
theories , 20 . 9 ~ 19 knowledge overregularization , 77 ; ~ 104 ( see also Voic , 242 , 139 , , 238 versus relations
vs . intuitive
renresentations transition
. 12 networks , 216 , 35
Augmented
- association
ing assimilation Constituency 175 Constrained Constraint Content Continuous of principles Conventional Neumann Correlation Damage sical
relations
Back propagation6, 181 183 224 225 , , , , Back-formation, 111 Bandwidthof communication channelslimits , on, in serialmachines 75 , Bistability of the Neckertube, 7 Blendedresponsesin the RM model 155 156 , , , ,
nature of linguistic entities , 181 satisfaction , 1 , 2 , 53 , 196 ,- 243 memory , 172 of applicability Von , 180 over clas variation , in degree
- addressable
-machine 101181 ,210 Boltzman ,30 , Boolean algebra ,196 Brain - modeling style ,62
machines Data - structures
, 53 , 57 machines )
machines
extraction
- resistance , 61
of connectionist , 52 , 56
Decoding / Binding network , 92 Default structure , in the regular Degemination Derivational , 148 morphology , 178
forms
, 121
Developmental psycholinguistics , 80 Diachronic maintenance , of language 217 Diachronic work Distance Distinctive Distributed , 31 vs . synchronic behavior
systems of a net -
between representations , 20n features . 201 . 202 . 204 . 215 representations , 1 , 2 , 5 , 19 , 172 ,
252
Subject Index
Double
-marking
errors stem
, 77 , 79 , 25 , 26 , 27
misconstrued 165
account
In - construction Individuals
relation
, 7 , 169 , 170 (see also ) , 60 , 171 ( see also , 14 , 77 of the English level past
Induction
acquisition
, 79 ,
80 , 165 - 167 , 182 Inductive Inferential Infinitive Inflection acquisition English , 82 models morphology processing content of , 84 107 models assigned to , 179 , 4 machine states , of , 128 , 130 , 181 reasoning coherence , English , 83 , 177 , 178
, 33 , 48
Errors , in the tense . 73 Excitation Exclusive Expert Explicit 168 Explicit models Featural
. See Activation
rule
view
, of psychology
Intentional 17
absence
in connectionist
Internal of
, 7 , 44 , 129
, 179
, 181
language
acquisition
of , 85 formst , 149 English , 151 , 212 , 83 , 110 , 157 , 111 , 113 , 114 , , 73 , 74 , 79 ,
Irregular
tense
and its role in symbolic Feedback , 6 Folk psychology French , 163 Frequency samples Frequency Generative counts , 139 - sensitivity syntax , 4
, English
, 159
, 161
, 162 ,
of verbs
, their
use in written
Kant , 27
, 31 , 13Off .
Labels
syntax connectionist
and
semantics machines
of , 17 , 17 , 18n
Generativity of linguistic German . 163 Gerund Graceful classical Grammatical Hanson Habits Hanson Hebrew Heuristic Hidden Hierarchy Higher Holistic Hume and and and rules Kegl , English
Language acquisition behavior knowled2e of use thought , 79 , 80 problem access entries by , 135 humans , 200 , 110 , 117 , 179 , , 1 , 37 , 73 , 79 - 82 , 139 , 166 , 167 , 221 of . 73 , 40 , 222 , 140 , 181
, 232
Learnability
, 239 , 243 model tense , 225 - 233 inflection in , 163 models , 224
, 86 , 96 , 108 , 109
, 34 , 35
, 168 , 169
, 65 , 182 , of , 7 ( see - level phonological also theories semantics , of representations ) cognition , 169 , 100
of game - playing
Lower
Imperative Implementation 67
, English
, 83 level , 55 , 58 , 64 ,
Macro -organization , 170171 , 167 , McClelland Kawamoto & model222233 , Memory vs. program 34, 56 , Mentalprocesses6, 13 28, 30 , ,
vs . symbolic
Subject Index
253
Mental representation 7, 12 18 19 22 25 , 6, , , , , , 28 30 32 36 39 49n 217222 , , , , , , , classical theories 13 17 of, , complex simple13 vs. , constituent structure 33 of, syntax 46 of, classical connectionist vs. theories 7 , 6, Microfeatures , 21 22 23 63 64 , 20 , , , , Modalityof a sentence , , 85 Molecular states10(see representational , also states ) Mood of a sentence , , 85 Morpheme , 86 107108 , 79 , , Morphology phonology , 106119126 vs. , 86 , , Multilayered networks182 ,
Natural kinds , 177 Natural selection , 54 Neologisms . 104 Networks . 6 . 76 . 129 . 196 Neural organization , 62 Neurological states , 10 Neurological distribution , 19 Neuronal nets ~ 196 Newtonian physics and Quantum physics , rela tion between , 168 No -change verbs , performance with , 145, 148, 149 , 150, 151 Nodes , their use in connectionist models , 12, 21 Noise -tolerance of connectionist over classical machines , 52 , 56 Nondeterminism , of human behavior , 53 , 57 Nonverbal and intuitive processes , 52
Obstruence , 86 OldEnglish203 , Output decoder theRMmodel93 , in , Overactive speech production algorithm , 238 Overgeneralization Overregularization (see ) Overregularization, 84 137138140144 , 74 , , , , , 146148 151152154 156157 159164 , , , - , , , , , 166167207208213 219223 237238 , , , , , , , , Overtensing , 161 Paradigm , Kuhnian4 shift , Paradigm used rule , as in -based theories lan of guage , 79 Parameter , 5 space Part relation13 -of , Passive participleEnglish83 84 102 , , , , Passive storage conventional in architecture , 52 59 ,
Past morpheme79. 83. 84 . participleforms, 115 160 , tense English 79, 85, 86, 88, 101 102 115 , , , , , 135 137 142 176 201 204 208 216 224 , , , , , , , tense English acquisition 73, 87, 134 150 , , , , , 164 207 240 . . tense English production 87 , , , Pattem -associatorin the RM model 88 91, , , 93, 107 125 167 168 170 176 181 , , , , , , Pavlovianconditionin 1 ~. Perceptron convergence procedure 90, 179 , , 207 Perceptual heuristic 239 241 , , Perfectparticiple, English 83, 84, 102 , Performance Models 234 236 , t Phoneme91, 93, 173 197 199 204 205 211 , , , , , , , 212 Phoneticrepresentation73. 88. 108 202 . . Phonological features 73. 197 211 , , Phonological representation95, 201 , Phonology morphology in the RM model, and , 101 Phonology phonetics 86 vs. , Physicalrealization of descriptions mental , of processes54 , Physicalsymbolsystemhypothesis 2 , Polish 163 , Polysemy 42, 112 , Possessive marker English 103 107 , , , Postfix 204 , Prefix, 86, 204 Present morphemeEnglish 83 , , participle, English 83 , tenseform, 79, 137 Probabilis tic blurring, 222 -, constraints 75 , decisions 89 , Probabilityfunctions lO2istic 169 . . . Production -style algorithms 242 , Productiverules, general 135 ,
Productivity thought33 36(see Sys , of , , also ,tematicityCompositionality , ) Progressive participleEnglish83 , , Proof -theory 29 30 , , Prototypicality structures thestrong , in class of verbs 116 117120 . . . Psychogrammar , 221 Psychological , of grammars reality , 221 Quantum -mechanical , 10 states
254
Subject Index
RM Radical
model
( See connectionism
Rumelhart
Eliminative Rapidity neural Rationalist Reasoning Receptive Recursion Reduplication Registers Regressions 241 Regular 107 137 213 vs . irregular Regularization 112 past , 108 , 139 and of
- a - vis
speeds
, 51 , 54 ideas , 69
, 5 , 62
, distinction performance
between , 239
, 75 ,
tense
forms , of
, 74 , 86 ,
Rule-exception behavior , 11 (see also Rulegoverned behavior ) Rule -generated vs . memorized forms , significance of the distinction , 142 , 143 Rule -governed behavior , 11, 51 , 195 , 197 , 207 ,
216 , 217
Rule - implicit vs . rule -explicit processes , 60 , 61 Rules and principles , underlying language use ,
74 , 78 , 221
Serial machines , 55 theirweakness ,4 Serial processing , 75 Similarity relations177 178 , , Similarity space , 21n Smalltalk , 14 Software hardware vs. , 74 Sounds wordsexchanges and , between , 199 Spanish , 163 Spatial functional vs. adjacency , 57 Speech behavior , 221 Speech errors 199 , Speech production 's model197 also , Dell , (see TRACE ) Speed activation inhibition199 of and , Stem86 , Stem affixidentity 108126 and , , Stimulus -response mechanisms , 237 240 , 236 , , 241 Stochastic component human , of behavior (see Nondeterminism ) Stochastic relations among environmental , 31 states among machine states31 , Strength connections of , 64 Strong verbsEnglish83 110113114119 , , , , , , , 122184189 alsoIrregular , (see verbs ) Structural constraints cognitive on phenom ena 196240 , , Structural description , 13 Structural , 195 196220 221 224 rules , , , , Structure functiontheirindependence and , in some systems , 63 Structure -sensitive operations , 28 , 13 Structured symbols , 24 Sub -conceptual , of mi~ space rofeatures , 20 , 19 Sub -symbolic models8, 9, 60 167173 , , , Sub -systematic exceptions , 219 Subcomputational transitions , 60 Subjunctive , English83 ,
Suppletive forms , 204
Rumelhart -McClelland model of past tense ac quisition , 73- 184 , 200 , 204 , 207 - 218 , 233 ,
234 Russian , 163
and
andsystematicity , 45 Semantic - . its use theM&K model theory in . 222 Semantics ,7 Sequencing , 197
paradigm , 9 , 78 , 91 , 173
Syncategorematicity , 42, 43
Subject Index
255
Syncretism Synonymous Syntactic Systematicity of cognitive ( see of of of of errors inference language thought also in
83
representations categories
33
37
, 48
50
37
38
, 42
, 43
199
, 240 , 220
algorithm , 90 Rules , 75 , ,
- preserving machines
transformations , 4 , 5 , lOn , 12 , 28
, 29 , 30 , 34 , 36 ,
Universal
subgoaling
, 58
Validity Variables Variation and 181 Verbal Verbal Verbal Visual Voiced 106 Von Vowel Vowel Vowel Vowel
, of ,
30
constrained
nature
in
linguistic
systems
adjective inflection noun Features vs . 211 Neumann . 86 harmony insertion shift , 153 . voiceless ,
English , English
, , , 83
83 82
English , 7 , 176
distinction
86
104
105
machines
5 ,
12
30
, 59
, ,
lOOn 106
Weights Whole
on
, 5 , 6 , network ,
31 94
76 , 95
, 91 , 176
, 206
- string
unconstrained Wickelfeature 123 162 Wickelphone 147 216 binding Word as a network , 9 , 79 collection , 88 , 154 , , , 124 168 , 137 170
, , -
, ,
, ,
, ,
, ,
, 89 169 ,
100 , 205
, 204
, 209
213
, 93 , of effects 108 ,
feature , 200
as
bi
- polar
feature
209