[go: up one dir, main page]

0% found this document useful (0 votes)
381 views253 pages

Connections and Symbols

The impetus for this reexamination has been a new approach to studying the mind. The assumptions behind this approach differ in substantial ways from the " central dogma " of cognitive science. Connectionism has spawned an enormous amount of research activity in a short time. This special issue of Cognition on Connectionism and Symbol Systems is intended to start such a discussion.

Uploaded by

Marko Cetrovivc
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
381 views253 pages

Connections and Symbols

The impetus for this reexamination has been a new approach to studying the mind. The assumptions behind this approach differ in substantial ways from the " central dogma " of cognitive science. Connectionism has spawned an enormous amount of research activity in a short time. This special issue of Cognition on Connectionism and Symbol Systems is intended to start such a discussion.

Uploaded by

Marko Cetrovivc
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 253

Introduction

During the past two years there has been more discussion of the foundations of cognitive science than in the 25 years preceding . The impetus for this reexamination has been a new approach to studying the mind , called " Connectionism " , " Parallel Distributed Processin2

-" ,

or " Neural

Networks

" . The

assumptions behind this approach differ in substantial ways from the " central dogma " of cognitive science, that intelligence is the result of the manipulation of structured symbolic expressions. Instead , connectionists suggest that intel ligence is to be understood as the result of the transmission of activation levels in large networks of densely interconnected simple units . Connectionism has spawned an enormous amount of research activity in a
short time . Much of the excitement surrounding the movement has been

inspired by the rich possibilities inherent in ideas such as massive parallel processing , distributed representation , constraint satisfaction , neurally -realistic cognitive models , and subsymbolic or microfeatural analyses. Models incorporating various combinations of these notions have been proposed for behavioral abilities as diverse as Pavlovian conditioning , visual recognition , and language acquisition . Perhaps it is not surprising that in a burgeoning new field there have been few systematic attempts to analyze the core assumptions of the new approach in comparison with those of the approach it is trying to replace , and to juxtapose both sets of assumptions with the most salient facts about human cognition . Analyses of new scientific models have their place , but they are premature before substantial accomplishments in the new field have been reported and digested . Now that many connectionist efforts are well known , it may be time for a careful teasing apart of what is truly new and what is just a relabeling of old notions ; of the empirical generalizations that are sound and those that are likely to be false ; of the proposals that naturally belong together and those that are logically independent . This special issue of Cognition on Connectionism and Symbol Systems is
intended to start such a discussion . Each of the papers in the issue attempts

to analyze in careful detail the accomplishments and liabilities of connectio nist models of cognition . The papers were independently and coincidentally submitted to the journal - a sign, perhaps , that the time is especially right for reflection on the status of connectionist theories . Though each makes dif ferent points , there are noteworthy common themes . All the papers are high ly critical of certain aspects of connectionist models , particularly as applied to language of the parts of cognition employing language-like operations . All

Introduction

of them try to pinpoint what it is about human cognition that supports the traditional physical symbol systemhypothesis Yet none of the papers is an . outri '-' dismissal in each case the authors discussaspectsof cognition for ~ht , which connectionistmodels may yield critical insights. Perhaps the most salient common theme in these papers is that many current connectionist proposals are not motivated purely by considerations of parallel processing distributed representation constraint satisfaction or , , , other computational issues but seemto be tied more closely to an agendaof , reviving associationism a central doctrine of learning and mental functionas ing. As a result, discussionsof connectionisminvolve a reexamination of debates about the strengths and weaknesses associationistmechanisms of that were a prominent part of cognitive theory 30 yearsago and 300yearsago. These paperscomprisethe first critical examination of connectionismas a scientific theory. The issuesthey raise go to the heart of our understanding of how the mind works. We hope that they begin a fruitful debate among scientistsfrom different frameworks as to the respectiveroles of connectionist networks and physical symbol systemsin explaining intelligence.

STEVEN PINKER JACQUES MEHLER

Connectionism and cognitive architecture : A criti , al c . * anaIYSJS JERRYA. FODOR CUNYGraduate Center ZENONW. PYL YSHYN University Western of Ontario
A bs tract

This paper explores differences between Connectionist proposals for cognitive architecture and the sorts of models that have traditionally been assumed in cognitive science. We claim that the major distinction is that, while both Connectionist and Classical architectures postulate representational mental states , the latter but not the former are committed to a symbol -level of representation, or to a 'language of thought ': i . e., to representational states that have combina torial syntactic and semantic structure . Several arguments for combinatorial structure in mental representations are then reviewed. These include arguments based on the 'sy.S 'tematicity ' of mental representation: i . e., on the fact that cognitive capacities always exhibit certain symmetrie.s so that the ability to ', entertain a given thought implies the ability to entertain thoughts with semantically related contents. We claim that such arguments make a powerful case that mind / brain archl:tecture is not Connectionist at the cognitive level. We then consider the po ~ 'sibility that Connectionism may provide an account of the neural (or 'abstract neurological ') structures in which Classical cognitive archi tecture is implemented . We survey a number of the standard arguments that have been offered in favor of Connectionism , and conclude that they are coherent only on thi~ interpretation . '

*This paper js basedon a chapter from a forthcoming book. Authors' namesarc listed alphabctically. We wjsh to thank the Alfred P. Sloan Foundatjon for their generoussupport of this research The preparatjon of . this paper wasalso ajded by a Killam Research Fellowshjp and a Senjor Fellowshjpfrom the CanadjanInstitute for Advanced Researchto ZWP. We also gratefully acknowledgecommentsand criticisms of earljer drafts by: Professors Noam Chomsky, William Demopoulos, Lila Gleitman, RussGreiner, Norbert Hornstein, Keith Humphrey, SandyPentland, StevenPjnker, David Rosenthal and Edward Stabler. Reprjnts may be obtained , by writing to either author: Jerry Fodor, CONY Graduate Center, 33 West 42 Strcct. New York . NY ]00 .16. U .S.A .; Zenon Pylyshyn Centre for Cognitive Science University of Western Ontario, London, Ontario, , , CanadaN6A 5C2.

I .A . Fodor and Z . W. Pylyshyn

1. Introduction

Connectionistor PD P models are catching on. There are conferencesand new books nearly every day, and the popular sciencepress hails this new wave of theorizing as a breakthrough in understandingthe mind (a typical example is the article in the May issueof Science called "How we think : 86, A new theory"). There are also, inevitably, descriptionsof the emergence of Connectionismas a Kuhnian "paradigm shift" . (SeeSchneider 1987 for an , , example of this and for further evidence of the tendency to view Connec tionism as the " new wave" of Cognitive Science .) The fan club includes the most unlikely collection of people. Connectio nism givessolaceboth to philosopherswho think that relying on the pseudo scientific intentional or semantic notions of folk psychology(like goals and beliefs) mislead psychologistsinto taking the computational approach (e.g., P.M . Churchland, 1981 P.S. Churchland, 1986 Dennett, 1986 ; and to those ; ; ) with nearly the oppositeperspective who think that computationalpsycholo , gy is bankrupt becauseit doesn addressissuesof intentionality or meaning 't (e.g., Dreyfus & Dreyfus, in press On the computer scienceside, Connec ). tionism appealsto theorists who think that serial machinesare too weak and must be replaced by radically new parallel machines (Fahlman & Hinton , 1986 , while on the biological side it appealsto those who believe that cogni) tion can only be understoodif we study it as neuroscience (e.g., Arbib , 1975 ; Sejnowski 1981 It is also attractive to psychologistswho think that much , ). of the mind (including the part involved in using imagery) is not discrete (e.g., Kosslyn & Hatfield , 1984 or who think that cognitive sciencehas not ), paid enough attention to stochasticmechanisms to '~ or holistic" mechanisms (e.g., Lakoff , 1986 and so on and on. It also appealsto many young cogni), tive scientists who view the approach as not only anti-establishment (and therefore desirable but also rigorous and mathematical(see however, foot) , note 2) . Almost everyone who is discontent with contemporary cognitive psychology and current " information processing models of the mind has " rushed to embrace" the Connectionist alternative" . When taken as a way of modeling cognitive architecture Connectionism , really does represent an approach that is quite different from that of the Classicalcognitive sciencethat it seeksto replace. Classicalmodels of the mind were derived from the structure of Turing and Von Neumannmachines . They are not, of course committed to the details of these machines as , exemplified in Turing's original formulation or in typical commercialcomputers; only to the basic idea that the kind of computing that is relevant to understanding cognition involves operations on symbols (see Fodor 1976 , 1987 New.ell, 1980 1982 Pylyshyn 1980 1984a b). In contrast, Connec ; , ; , , , -

Connectionism and cognitive architecture

tionists propose to design systems that can exhibit intelligent behavior without storing , retrieving , or otherwise operating on structured symbolic expressions. The style of processing carried out in such models is thus strikingly unlike what goes on when conventional machines are computing some func tion . Connectionist systems are networks consisting of very large numbers of simple but highly interconnected " units " . Certain assumptions are generally made both about the units and the connections : Each unit is assumed to receive real -valued activity (either excitatory or inhibitory or both ) along its input lines . Typically the units do little more than sum this activity and change their state as a function (usually a threshold function ) of this sum. Each connection is allowed to modulate the activity it transmits as a function of an intrinsic (but modifiable ) property called its " weight " . Hence the activity on an input line is typically some non -linear function of the state of activity of its sources. The behavior of the network as a whole is a function of the initial state of activation of the units and of the weights on its connections , which serve as its only form of memory . Numerous elaborations of this basic Connectionist architecture are possible . For example , Connectionist models often have stochastic mechanisms for determining the level of activity or the state of a unit . Moreover , units may be connected to outside environments . In this case the units are sometimes assumed to respond to a narrow range of combinations of parameter values and are said to have a certain " receptive field " in parameter -space. These are called " value units " (Ballard , 1986) . In some versions of Connectionist architecture , environmental properties are encoded by the pattern of states of entire populations of units . Such " coarse coding " techniques are among the ways of achieving what Connectionist call " distributed representa tion " .! The term 'Connectionist model ' (like 'Turing Machine ' or 'Van Neumann machine ') is thus applied to a family of mechanisms that differ in details but share a galaxy of architectural commitments . We shall return to the characterization of these commitments below . Connectionist networks have been analysed extensively - in some cases

IThe difference between Connectionist networks in which the state of a single unit encodesproperties of the world (i .e., the so.called 'localist' networks) and onesin which the pattern of statesof an entire population ( . of units does the encoding (the so-called 'distributed' representationnetworks) is consideredto be Important by m&ny people working on Connectionist models Although Connectionistsdebate the relative merits of . localist (or 'compact) versusdistributed representations(e.g., Feldman, 1986 , the distinction will usually be ' ) of little consequence our purposes for reasonsthat we give later. For simplicity, when we wish to refer for , indifferently to either single unit codes or aggregatedistributed codes we shall refer to the 'nodes in a , ' network. When the distinction is relevant to our discussion however, we shall explicitly mark the difference , by referring either to units or to aggregateof units.

I .A . Fodor and Z . W. Pylyshyn

using advanced mathematical techniques .2 They have also been simulated on computers and shown to exhibit interesting aggregate properties . For example , they can be " wired " to recognize patterns , to exhibit rule -like behavioral regularities , and to realize virtually any mapping from patterns of (input ) parameters to patterns of (output ) parameters- though in most cases multi parameter , multi -valued mappings require very large numbers of units . Of
even greater interest is the fact that such networks can be made to learn ; this

is achieved by modifying the weights on the connections as a function of certain kinds of feedback (the exact way in which this is done constitutes a preoccupation of Connectionist research and has lead to the development of such important techniques as " back propagation " ) . In short , the study of Connectionist machines has led to a number of striking and unanticipated findings ; it 's surprising how much computing can
be done with a uniform network of simple interconnected elements .

Moreover , these models have an appearance of neural plausibility that Classical architectures are sometimes said to lack . Perhaps, then , a new Cognitive Science based on Connectionist networks should replace the old Cognitive Science based on Classical computers . Surely this is a proposal that ought to be taken seriously : if it is warranted , it implies a major redirection of research .

Unfortunately , however , discussions of the relative merits of the two architectures have thus far been marked by a variety of confusions and irrele vances. It 's our view that when you clear away these misconceptions what 's left is a real disagreement about the nature of mental processes and mental representations . But it seems to us that it is a matter that was substantially put to rest about thirty years ago; and the arguments that then appeared to militate decisively in favor of the Classical view appear to us to do so still . In the present paper we will proceed as follows . First , we discuss some methodological questions about levels of explanation that have become enmeshed in the substantive
to say what it is that makes

controversy
Connectionist

over Connectionism
and Classical

. Second , we try
theories of mental

20ne of the attractions of Connectionism for many people is that it does employ some heavy mathematical machinery , as can be seen from a glance at many of the chapters of the two volume collection by Rumelhart , McClelland and the POP Research Group ( 1986) . But in contrast to many other mathematically sophisticated areas of cognitive science , such as automata theory or parts of Artificial Intelligence (particularly the study of search , or of reasoning and knowledge representation ) , the mathematics has not been used to map out the limits of what the proposed class of mechanisms can do . Like a great deal of Artificial Intelligence research , the Connectionist approach remains almost entirely experimental ; mechanisms that look interesting are pro posed and explored by implementing them on computers and subjecting them to empirical trials to see what they will do . As a consequence , although there is a great deal of mathematical work within the tradition , one has very little idea what various Connectionist networks and mechanisms are good for in general .

Connectionism and cognitive architecture

structure

incompatible

Third

we

review

and

extend

some

of

the

traditional

arguments

for

the

Classical

architecture

Though

these

arguments

have

been

somewhat

recast

very

little

that

we

'

ll

have

to

say

here

is

entirely

new

But

we

hope

to

make

it

clear

how

various

aspects

of

the

Classical

doctrine

cohere

and

why

rejecting

the

Classical

picture

of

reasoning

leads

Connectionists

to

say

the

very

implausible

things

they

do

about

logic

and

semantics

In

part

four

we

return

to

the

question

what

makes

the

Connectionist

approach

ap

pear

attractive

to

so

many

people

In

doing

so

we

'

ll

consider

some

arguments

that

have

been

offered

in

favor

of

Connectionist

networks

as

general

models

of

cognitive

processing

Levels

of

explanation

There

are

two

major

traditions

in

modern

theorizing

about

the

mind

one

that

we

' ll

call

'

Representationalist

'

and

one

that

we

'

ll

call

'

Eliminativist

'

Representationalists

hold

that

postulating

representational

or

' intentional

'

or

' semantic

'

states

is

essential

to

theory

of

cognition

according

to

Rep

resentationalists

there

are

states

of

the

mind

which

function

to

encode

states

of

the

world

Eliminativists

by

contrast

think

that

psychological

theories

can

dispense

with

such

semantic

notions

as

representation

According

to

Eliminativists

the

appropriate

vocabulary

for

psychological

theorizing

is

neurological

or

perhaps

behavioral

or

perhaps

syntactic

in

any

event

not

vocabulary

that

characterizes

mental

states

in

terms

of

what

they

represent

For

neurological

version

of

eliminativism

see

. S

Churchland

1986

for

behavioral

version

see

Watson

1930

for

syntactic

version

see

Stich

1983

Connectionists

are

on

the

Representationalist

side

of

this

issue

As

Rumelhart

and

McClelland

1986a

121

say

PDPs

' ~ are

explicitly

con

cerned

with

the

problem

of

internal

representation

"

Correspondingly

the

specification

of

what

the

states

of

network

represent

is

an

essential

part

of

Connectionist

model

Consider

for

example

the

well

known

Connectionist

account

of

the

bistability

of

the

Necker

cube

Feldman

&

Ballard

1982

"

Simple

units

representing

the

visual

features

of

the

two

alternatives

are

arranged

in

competing

coalitions

with

inhibitory

links

between

rival

fea

tures

and

positive

links

within

each

coalition

The

result

is

network

that

has

two

dominant

stable

states

"

see

Figure

Notice

that

in

this

as

in

all

other

such

Connectionist

models

the

commitment

to

mental

representation

is

explicit

the

label

of

node

is

taken

to

express

the

representational

content

of

the

state

that

the

device

is

in

when

the

node

is

excited

and

there

are

nodes

corresponding

to

monadic

and

to

relational

properties

of

the

reversible

cube

when

it

is

seen

in

one

way

or

the

other

l .A . Fodor and Z . W. Pylyshyn

Figure 1. A Connectionist network model illustrating the two stable representations of


the Necker cube. (Reproduced from Feldman and Ballard , 1982, p . 221, with permission of the publisher , Ablex Publishing Corporation .)

Connectionism and cognitive architecture

bolic states do have a semantics, though it 's not the semantics of representa tions at the " conceptual level " . According to Smolensky , the semantical distinction between symbolic and sub-sy 'mbolic theories is just that " entities that are typically represented in the symbolic paradigm by [single] symbols are typically represented in the sub-symbolic paradigm by a large number of sub-symbols" .3 Both the conceptual and the sub-symbolic levels thus postulate representational states, but sub-symbolic theories slice them thinner . We are stressing the Representationalist character of Connectionist theorizing because much Connectionist methodological writing has been preoccupied with the question 'What level of explanation is appropriate for theories of cognitive architecture ? (see, for example , the exchange between Broadbent , 1985, and Rumelhart & McClelland , 1985) . And , as we're about to see, what one says about the levels question depends a lot on what stand one takes about whether there are representational states. It seems certain that the world has causal structure at very many different levels of analysis, with the individuals recognized at the lowest levels being , in general , very small and the individuals recognized at the highest levels being , in general , very large . Thus there is a scientific story to be told about q[uarks ; and a scientific story to be told about atoms ; and a scientific story to h~ to1rl ahout molecules ... ditto rocks and stones and rivers ... ditto galaxies. - - ~-- - - - --- ---- -- - - J\ nd the story that scientists tell about the causal structure that the world has at anyone of these levels may be quite different from the story that they tell about its causal structure at the next level up or down . The methodological implication for psychology is this : If you want to have an argument about cognitive architecture , you have to specify the level of analysis that 's supposed to be at issue. If you 're not a Representationalist , this is quite tricky since it is then not obvious what makes a phenomenon cognitive . But specifying the level of analysis relevant for theories of cognitive architecture is no problem for either Classicists or Connectionists . Since Classicists and Connectionists are both Representationalists , for them any level at which states of the system are taken to encode properties of the world counts as a cognitive level ; and no other levels do . (Representations of " the world " include of course, representations of symbols ; for example , the concept WORD is a construct at the cognitive level because it represents something , namely words .) Correspond -

3Smolensky to thinkthattheideaof postulatinglevelof representations a semantics seems a with of subconceptual is unique network features to theories is anextraordinary considering extent . This view the to which Classical theorists been have concerned feature with analyses inevery ofpsychology phonetics area from tovisual perceptionlexicographyfactthequestion to . In , whether are -conceptual there 'sub ' featuresneutral is withrespect thequestion to whether cognitive architecture is ClassicalConnectionist or .

10

J.A. FodorandZ.W. Pylyshyn

ina users manual thatarchitecture were for ifit availablea computer), on whose domain range therepresentational oftheorganism. and are states 4 It follows, if youwant make theConnectionist asa that, to good theory theory cognitive of architecture, have show theprocesses you to that which
tational neurological,molecular, quantum (e.g., or or mechanical) of states

ingly, thearchitecturerepresentational andprocesses discusits of states that sions cognitive of architecture about. differently, architecture are Put the of the cognitive system consists theset of basic of operations, resources, functions, principles, (generally sorts properties would described etc. the of that be

from cognitive the psychologists ofview, show thenonrepresenpoint to that

specified a Connectionist by architecture.is, forexample, useat all, It no

operate on the representational statesof an organism those whichare are

an organism constitutea Connectionist network,becausethat wouldleave

some length in Section 4.

possible implement Classical to a cognitive architecture sucha network. in 5 In fact,the question whether Connectionist networks shouldbe treatedas modelsat somelevelof implementation moot,and willbe discussed is at

butthat representationalthemselves This because, as the states arenot. is just of causally interacting nonrepresentational sotooit is perfectly elements,
it ispossible implement Connectionist to a cognitive architecture a network in

gical states interconnectedtheways are in described Connectionist by models

open question the whether mind a such network the the is a at psychological level. is,inparticular, It perfectly possible nonrepresentational that neurolo-

It is important be clear to about matter levels painof simply this of on trivializing issues the about cognitive architecture. Consider, example, for thefollowing ofRumelharts: hasseemed meforsome remark It to years now there bea unified that must accountwhich so-called in the rule-governed and exceptional were with aunified [ the] cases dealt by underlying processa
cognitive noncognitiveThus, example, 5molensky isclearly from levels for , although (1988) a Representationalist, answer question distinguishes his official tothe what those dynamical that cognitive systemsare from that not?makes mistake those are the ofappealing tocomplexitythan rather intentionality: A river fails beacognitive system because satisfy rangegoals alarge to dynamical only itcannot alarge of under

4 Sometimes, even however, Representationalists thatisrepresentation fail appreciate to it that dittinguishes

range conditions, ofcourse, dependshow individuate and of But, that on you goals conditions;river the that wants gettothesea to wants togethalf tothesea, then get way first way and to half more and on; so quite lotofgoals told. real ofcourse,that that a all The point, is states represent play role the goals a in etiology thebehaviors people notin theetiology thebehavior rivers. of of hut of of 5 Classical That architectures implemented can be innetworks disputedConneetionists; isnot by see for example Rume and lhart MeClelland p. 118): one make arbitrary (1986a, ... can an computational machine out linear of threshold including, units, for example, amachinecan out the that carry all operations necessary forimplementing machine; one a Turing the limitation real isthat biological cannot Turing systems be machines because they have finite hardware..

Connectionism and cognitive architecture

11

process which produces rule-like and rule-exception behaviorthroughthe application a singleprocess [In this process... both the rule-like and of ... ] non -rule-like behavior a productof the interaction a verylargenumber is of of 'sub -symbolic processes (Rumelhart 1984 p. 60 It's clearfrom the ' ." , , ). contextthat Rumelharttakesthis idea to be very tendentiousone of the ; Connectionist claimsthat Classical theories requiredto deny are . But in factit's not. For, of course thereare'sub -symbolicinteractions ' that implement both rule like andrule violatingbehavior for examplequantum ; , mechanical processes Thats not what Classical do. ' theoristsdeny indeed ; , it's not deniedby anybody who is evenvaguelya materialist Nor doesa . Classical theoristdeny that rule-followingand rule-violatingbehaviors are both implemented the very same by neurological machineryFor a Classical . theorist neuronsimplementall cognitiveprocesses preciselythe same , in way viz., by supporting basicoperations : . the that are requiredfor symbol processIng . What would be an interestingand tendentious claim is that there no 's distinctionbetween rule-followingandrule-violatingmentation thecogni at tive or representational symboliclevel specificallythat it is not the case or ; , that the etiologyof rule-followingbehavior mediated the representation is by of explicitrules We will consider ideain Section wherewe will argue .6 this 4, that it too is notwhatdivides Classical from Connectionist architectureClas ; sicalmodels permita principleddistinction between etiologies mental the of processes areexplicitlyrule-governed mentalprocesses aren that and that 't; but they don demand . 't one In short the issue , between Classical Connectionist and architecture not is aboutthe explicitness rules aswe presently , Classical of ; 'll see architecture is not, per se committed the ideathat explicitrulesmediate etiology , to the of behavior And it is not aboutthe realityof representational ; Classi . states cistsandConnectionists all Repre entational are .s RealistsAnd it is not about . nonrepresentational architecturea Connectionist ; neuralnetworkcan perfectlywell implement Classical a architecture the cognitive at level . So then what is the disagreement , , between Classical Connectionist and architecture about ?

~ hereis a differentidea frequently , encountered the Connectionist in literature that this oneis easily , confused with: viz., that thedistinction between regularities exceptions merely and is stochastic (whatmakes 'went an irregular ' pasttense just that the more is frequent construction the oneexhibited 'walked It is by ' ). seems obvious if thisclaimiscorrect canbereadily that it assimilated Classical to architecture Section (see 4).

12

l .A . Fodor and Z. W. Pylyshyn

2. The nature of the dispute Classicists and Connectionists all assign semantic content to something . Roughly , Connectionists assign semantic content to 'nodes' (that is, to units or aggregates of units ; see footnote I )- i .e., to the sorts of things that are typically labeled in Connectionist diagrams ; whereas Classicists assign semantic content to expressions- i .e., to the sorts of things that get written on the tapes of Turing machines and stored at addresses in Von Neumann machines.7 But Classical theories disagree with Connectionist theories about what primitive relations hold among these content -bearing entities . Connectionist theories acknowledge only causal connectednessas a primitive relation
among nodes ; when you know how activation and inhibition flow among

them , you know everything there is to know about how the nodes in a net work are related . By contrast , Classical theories acknowledge not only causal relations among the semantically evaluable objects that they posit , but also a range of structural relations , of which constituency is paradigmatic . This difference has far reaching consequences for the ways that the two kinds of theories treat a variety of cognitive phenomena , some of which we will presently examine at length . But , underlying the disagreements about
details are two architectural differences between the theories :

( 1)

Combinatorial syntax and semantics for mental representations. Classical theories- but not Connectionist theories - postulate a 'language of thought ' (see, for example , Fodor , 1975) ; they take mental representa tions to have a combinatorial syntax and semantics, in which (a) there is a distinction between structurally atomic and structurally molecular representations ; (b) structurally molecular representations have syntactic constituents that are themselves either structurally molecular or structurally atomic ; and (c) the semantic content of a (molecular ) representation is a function of the semantic contents of its syntactic parts ,

together with its constituent structure . For purposes of convenience , we'll sometime abbreviate (a)- (c) by speaking of Classical theories as

Connectionism and cognitive architecture

13

committed tures ~' . 8

to

"

complex

"

mental

representations

or

to

"

symbol

struc

(2)

Structure which mental

sensitivity states

of are

processes transformed

In

Classical , or by

models which an

the input

principles selects

by the

corresponding

output

are

defined

over

structural

properties

of

mental

representations binatorial structure

Because , it is

Classical possible for

mental Classical

representations mental operations

have to

com apply

to

them

by

reference

to

their

form

The

result

is

that

paradigmatic

Classical

mental

process

operates

upon

any

mental

representation

that

satisfies

given

structural

description

and

transforms

it

into

mental

representation

that

satisfies

another

structural

description

( So

for

example

in

model

of

inference

one

might

recognize

an

operation

that

applies

to

any

representation

of

the

form

&

and

transforms

it

into

representation be defined at a

of

the variety

form of

P levels

. )

Notice of

that abstraction

since ,

formal such an

properties operation

can can

apply

equally

to

representations

that

differ

widely

in

their

structural

complexity

The

operation

that

applies

to

representations

of

the

form

&

to

produce

is

satisfied

by

for

example

an

expression

like

"

( AvBvC

&

( DvEvF

) "

from

which

it

derives

the

expression

"

( AvBvC

) "

We

take

( 1

and

( 2

as

the

claims

that

define

Classical

models

and

we

take

these

claims

quite

literally

they

constrain

the

physical

realizations

of

symbol

structures

In

particular

the

symbol

structures

in

Classical

model

are

as

sumed

to

correspond

to

real

physical

structures

in

the

brain

and

the

com

binatorial

structure

of

representation

is

supposed

to

have

counterpart

in

structural

relations

among

physical

properties

of

the

brain

For

example

the

relation

' part

of

' ,

which

holds

between

relatively

simple

symbol

and

more

complex brain states

one . 9

, This

is

assumed is why Newell

to

correspond ( 1980 ) speaks

to

some of

physical computational

relation systems

among such

as

brains

and

Classical

computers

as

" physical

symbols

systems

"

8Sometimes difference the between simplypostulating representational andpostulating states representa tionswith a combinatorial syntax semantics marked distinguishing and is by theories postulate that symbols from theories that postulate symbolsystems latter theoriesbut not the former are committed a . The , , to "language thought For this usageseeKosslyn Hatfield(1984 who take the refusal postulate of ". , and ) to symbolsystems be the characteristic to respect whichConnectionist in architectures differ from Classical architectures agree . We with thisdiagnosis . 9Perhaps notionthat relations the amongphysical properties the brain instantiate encode the of (or ) combinatorial structure anexpression of bears some elaborationOnewayto understand is involved . what is to consider conditions musthold on a mapping the that (whichwe refer to as the 'physical instantiation mapping from expressions brainstates the causal ') to if relations among brainstates to depend the are on -

14

l .A . Fodor and Z .W. Pylyshyn

This bearsemphasisbecause Classicaltheory is committed not only to the there being a systemof physically instantiated symbols but also to the claim , that the physicalproperties onto which the structureof the symbolsis mapped are the very properties that causethe systemto behaveas it does In other . words the physical counterpartsof the symbols and their structural proper, ties, causethe system behavior. A systemwhich has symbolic expressions 's , but whoseoperation doesnot dependupon the structure of theseexpressions , does not qualify as a Classicalmachine since it fails to satisfy condition (2). In this respect a Classicalmodel is very different from one in which behavior , is causedby mechanisms such as energy minimization, that are not respon , sive to the physical encodingof the structure of representations . From now on, when we speakof 'Classical models, we will have in mind ' any model that has complex mental representations as characterizedin (1) , and structure-sensitivemental processesas characterizedin (2). Our account , of Classicalarchitecture is therefore neutral with respect to such issuesas whether or not there is a separateexecutive For example, Classicalmachines . can have an "object-oriented" architecture, like that of the computer language Smalltalk, or a "message passing architecture, like that of Hewett's "

Connectionism and cognitive architecture

15

( 1977) Actors - so long as the objects or the messages have a combinatorial structure which is causally implicated in the processing. Classical architecture is also neutral on the question whether the operations on the symbols are constrained to occur one at a time or whether many operations can occur at the same time . Here , then , is the plan for what follows . In the rest of this section , we will sketch the Connectionist proposal for a computational architecture that does - away with complex mental representations and structure sensitive operations . (Although our purpose here is merely expository , it turns out that describing exactly what Connectionists are committed to requires substantial reconstruc tion of their remarks and practices . Since there is a great variety of points of view within the Connectionist community , we are prepared to find that some Connectionists in good standing may not fully endorse the program when it is laid out in what we take to be its bare essentials.) Following this general expository (or reconstructjve ) discussion, Section 3 provides a series of arguments favoring the Classical story . Then the remainder of the paper considers some of the reasons why Connectionism appears attractive to many people and offers further general comments on the relation between the Classical and the Connectionist enterprise .

2.1. Complex mental representations To begin with , consider a caseof the most trivial sort; two machines one , Classical in spirit and one Connectionist lo Here is how the Connectionist . machine might reason There is a network of labelled nodes as in Figure 2. . Pathsbetweenthe nodesindicate the routes alongwhich activation can spread (that is, they indicate the consequences exciting one of the nodeshasfor that determining the level of excitation of others) . Drawing an inference from A & B to A thus correspondsto an excitation of node 2 being causedby an excitation of node 1 (alternatively, if the systemis in a state in which node 1 is excited, it eventually settles into a state in which node 2 is excited; see footnote 7) . Now consider a Classicalmachine This machine has a tape on which it . writes expressions Among the expressions . that can appear on this tape are:

!('This illustration has not any particular Connectionistmodel in mind, though the caricature presentedis, in fact, a simplified version of the Ballard ( 1987 Connectionist theorem proving system(which actually uses ) a more restricted proof procedure basedon the unification of Horn clauses. To simplify the exposition, we ) assume 'localist' approach in which eachsemanticallyinterpreted node corresponds a singleConnectionist a , to unit ~but nothing relevant to this discussion changedif these nodesactually consistof patterns over a cluster is of units.

16

J.A. Fodor and Z.W. Pylyshyn

Figure A possible 2. or to B. Connectionist for drawing network inferences A&B A from to


A&B

A, B, A&B, C, D, C&D, A&C&D ... etc. The machinescausal


tape, the machinewritesa token of the form P. An inferencefromA&B to

constitution follows: isas whenever a token theform appears the of P&Q on


So then, whatdoesthe architectural difference betweenthe machines con-

A thus correspondsa tokening typeA&B onthetapecausingtokento of a ing of type A.


sist in? In the Classical machine, objectsto whichthe contentA&Bis the

there is no structural (e.g., no part/whole) relationthat holdsbetween them. In short, it is characteristic Classical of systems,but not of Connectionist

connected theobject which content isascribed node but to to the A (viz., 2);

constituents. contrast,theConnectionist none this true; 1 By in machine of is theobject which content is ascribed node1)is causally to the A&B (viz.,

expression A&B is determined a uniform by the semantics its in way of

ascribed tokens theexpression literally (viz., of A&B) contain, proper as parts, objects which content isascribed tokens theexpresto the A (viz., of sionA.) Moreover, semantics thesatisfaction the (e.g., conditions) the of

and semantic parts (e.g., expressions A&B). like It is easy to overlookthis difference betweenClassical Connectionist and

sions A) butindefinitely ofwhich other like many have symbolssyntactic as

systems, exploit to arrays symbols ofwhich atomic expresof some are (e.g.,

be lead to do so: (1) by failingto understand difference the betweenwhat arrays of symbolsdo in Classical machinesand what node labelsdo in Con-

inga Connectionist There at least ways which might model. are four in one

architectures reading Connectionist when the polemical literature examinor

Thismakes compositionality structures the ofdata a defining ofclassical property architecture. But, ofcourse,leaves the it open question degreewhich languages English) also ofthe to naturol (like arc compositional.

Connectionism cognitive and architecture 17

nectionist machines ; (2) by confusing the question whether the nodes in Con nectionist networks have constituent structure with the question whether they are neurologically distributed ; (3) by failing to distinguish between a representation having semantic and syntactic constituents and a concept being encoded in terms of microfeatures , and (4) by assuming that since representa tions of Connectionist networks have a graph structure , it follows that the nodes in the networks have a corresponding constituent structure . We shall now need ' rather a long digression to clear up these misunderstandings . 2.1.1. The role of labels in Connectionist theories In the course of setting out a Connectionist model , intentional content will be assigned to machine states, and the expressions of some language or other will , of course, be used to express this assignment; for example , nodes may be labelled to indicate their representational content . Such labels often have
a combinatorial syntax and semantics ; in this respect , they can look a lot like

Classical mental representations . The point to emphasize, however , is that it doesn't follow (and it isn't true ) that the nodes to which these labels are assigned have a combinatorial syntax and semantics. 'A & B ' , for example ,
can be tokened on the tape of the Classical machine and can also appear as

a label in a Connectionist

machine

as it does in diagram

2 above . And , of

course, the expression 'A & B ' is syntactically and semantically complex : it has a token of 'A ' as one of its syntactic constituents , and the semantics of the expression ' A & B ' is a function of the semantics of the expression ' A ' . But it isn't part of the intended reading of the diagram that node 1 itself has constituents ; the node- unlike its label- has no semantically interpreted parts . It is, in short , important to understand the difference between Connectionist labels and the symbols over which Classical computations are defined . The difference is this : Strictly speaking , the labels play no role at all in deter mining the. operation of a Connectionist machine ; in particular , the operation of the machine is unaffected by the syntactic and semantic relations that hold among the expressions that are used as labels . To put this another way , the
node labels in a Connectionist machine are not part of the causal structure

of the machine . Thus , the machine depicted in Figure 2 will continue to make the same state transitions regardless of what labels we assign to the nodes. Whereas , by contrast , the state transitions of Classical machines are causally determined by the structure- including the constituent structure- of the symbol arrays that the machines transform : change the symbols and the system behaves quite differently . (In fact , since the behavior of a Classical machine is sensitive to the syntax of the representations it computes on , even interchang ing synonymous - semantically equivalent - representations affects the course of computation ) . So, although the Connectionist 's labels and the Classicist's

18

l .A . Fodor and Z.W. Pylyshyn

data structures both constitute languages, only the latter language constitutes a medium of computation . 12 2.1.2. Connectionist networks and graph structures The second reason that the lack of syntactic and semantic structure in Connectionist representations has largely been ignored may be that Connectionist networks look like general graphs; and it is, of course, perfectly possible to use graphs to describe the internal structure of a complex symbol . That 's precisely what linguists do when they use 'trees' to exhibit the constituent structure of sentences. Correspondingly , one could jmagine a graph
notation
I

that

expresses -

the

internal

structure

of mental

renresentations).- - - -- ----

- ---

hv - .

using arcs and labelled nodes. So, for example , you might express the syntax of the mental representation that corresponds to the thought that John loves the girl like this : John ~ loves ~ the girl Under the intended interpretation , this would be the structural description of a mental representation whose content is that John loves the girl , and whose constituents are: a mental representation that refers to John , a mental
representation that refers to the girl , and a mental representation that expres -

ses the two -place relation represented by '~ loves ~ ' . But although graphs can sustain an interpretation as specifying the logical syntax of a complex mental representation , this interpretation is inappro priate for graphs of Connectionist networks . Connectionist graphs are not structural descriptions of mental representations ; they 're specifications of causal relations . All that a Connectionist can mean by a graph of the form X ~ Y is: states of node X causally affect states of node Y. In particular , the graph can't mean X is a constituent of Y or X is grammatically related to Y etc ., since these sorts of relations are , in general , not defined for the kinds of mental representations that Connectionists recognize . Another way to put this is that the links in Connectionist diagrams are not generalized pointers that can be made to take on different functional signifi 12Labels aren 't part of the causal structure of a Connectionist machine , but they may play an essential role in its causal history insofar as designers wire their machines to respect the semantical relations that the labels express . For example , in Ballard 's ( 1987) Connectionist model of theorem proving , there is a mechanical procedure for wiring a network which will carry out proofs by unification . This procedure is a function from a set of node labels to a wired -up machine . There is thus an interesting and revealing respect in which node labels are relevant to the operations that get performed when the function is executed . But , of course , the machine on which the labels have the effect is not the machine whose states they are labels of ; and the effect of the labels occurs at the time that the theorem -proving machine is constructed , not at the time its reasoning process is carried out . This sort of case of labels ' having effects ' is thus quite different from the way that symbol tokens (e.g. , tokened data structures ) can affect the causal processes of a Classical machine .

Connectionism and cognitive architecture

19

cance by an independent interpreter , but are confined to meaning something like " sends activation to " . The intended interpretation of the links as causal connections is intrinsic to the theory . If you ignore this point , you are likely to take Connectionism to offer a much richer notion of mental representation than it actually does. 2.1.3. Distributed representations The third mistake that can lead to a failure to notice that the mental representations in Connectionist models lack combinatorial syntactic and semantic structure is the fact that many Connectionists view representations as being neurologically distributed ; and , presumably , whatever is distributed must have parts . It doesn't follow , however , that whatever is distributed must have constituents; being neurologically distributed is very different from having semantic or syntactic constituent structure . You have constituent structure when (and only when ) the parts of semanticallv evaluable entities are themselves semantically evaluable . Constituency of relations thus hold among objects all of which are at the representational level ; they are, in that sense, within level relations .!3 By contrast , neural distributedness - the sort of relation that is assumed to hold between 'nodes' and the 'units ' by which they are realized- is a between level relation : The nodes, but not the units , count as representations . To claim that a node is neurally distributed is presumably to claim that its states of activation corre spond to patterns of neural activity - to aggregates of neural 'units '- rather than to activations of single neurons . The important point is that nodes that are distributed in this sense can perfectly well be syntactically and semantically atomic : Complex spatially -distributed implementation in no way implies constituent structure . There is, however , a different sense in which the representational states in a network might be distributed , and this sort of distribution also raises questions relevant to the constituency issue. 2.1.4. Representations as 'distributed ' over microfeatures Many Connectionists hold that the mental representations that correspond to commonsense concepts (CHAIR , JOHN , CUP , etc.) are 'distributed ' over galaxies of lower level units which themselves have representational content . To use common Connectionist terminology (see Smolensky , 1988) , the higher or " conceptual level " units correspond to vectors in a " sub-conceptual " space
13 Any relation specified as holding among representationalstates is, by definition, within the 'cognitive level' . It goeswithout sayingthat relations that are 'within-level' by this criterion can count as 'between .level' when we usecriteria of finer grain. There is, for example, nothing to prevent hierarchiesof levelsof represen tational states .

20

l .A. Fodor Z.W Pylyshyn and .

of microfeatures The model here is something like the relation between a . defined expression and its defining feature analysis thus, the concept : BACHELOR might be thought to correspondto a vector in a spaceof features that includesADULT , HUMAN , MALE , and MARRIED ; i .e., as an assignmentof the value + to the first two features and - to the last. Notice that distribution over microfeatures(unlike distribution over neural units) is a relation among representations hence a relation at the cognitive level. , ' Since microfeatures are frequently assumedto be derived automatically (i .e., via learning procedures from the statistical properties of samplesof ) stimu.i , we can think of them as expressingthe sorts of properties that are l revealedby multivariate analysisof setsof stimuli (e.g., by multidimensional scaling of similarity judgments In particular, they need not correspondto ). English words; they can be finer-grained than, or otherwise atypical of, the terms for which a non-specialist needs to have a word. Other than that, however, they are perfectly ordinary semanticfeatures, much like those that lexicographershave traditionally used to representthe meaningsof words. On the most frequent Connectionistaccounts theories articulated in terms , of microfeature vectors are supposedto show how conceptsare actually encoded, hencethe feature vectorsare intended to replace"lessprecise specifi" cationsof macrolevelconcepts For example, where a Classicaltheorist might . recognizea psychologicalstate of entertaining the concept CUP, a Connec tionist may acknowledgeonly a roughly analogousstate of tokening the correspondingfeature vector. (One reasonthat the analogyis only rough is that which feature vector 'corresponds to a given concept may be viewed as ' heavily context dependent) The generalizationsthat 'conceptlevel' theories . frame are thus taken to be only approximately true, the exact truth being stateableonly in the vocabularyof the microfeatures Smolensky for example . , (p. 11 is explicit in endorsingthis picture: "Precise formal descriptionsof ), , the intuitive processorare generallytractable not at the conceptuallevel, but only at the subconceptuallevel." 14This treatment of the relation between
14Smolensky , p. 14 remarks "unljkesymbolic (1988 ) that tokens these , vectors in a topological lie space , in whichsome close are together others far apart" Howeverthisseems radically and are . , to conflate claims aboutthe Connectionist modelandclaims aboutits implementation conflation is not unusual the (a that in Connectionist literature we see Section If thespace issue physicalthenSmolenskycommitted as 'll in 4). at is , is to extremely strong claims aboutadjacency relations the brain claims in ; whichthereis, in fact, no reason at all to believeBut if, asseems . moreplausiblethe space issue semantical whatSmolensky isn , at is then says 't true. Practically cognitive any theory implydistance will measures between mental representations Classical . In theoriesfor examplethedistance , , between representations two is plausibly related thenumber compu to of tationalsteps takesto deriveonerepresentation the other In Connectionist it from . theoriesit is plausibly , relatedto the number intervening of nodes to the degree overlap (or of between vectorsdepending the , on version Connectionism hasin mind Theinteresting of one ). claimis not that an architecture offersa distance measure that it offersthe rightdistance but measure - one that is empirically certifiable .

Connectionism and cognitive architecture

21

commonsense concepts and microfeatures is exactly analogous to the standard Connectionist treatment of rules ; in both cases, macro level theory is said to provide a vocabulary adequate for formulating generalizations that roughly approximate the facts about behavioral regularities . But the constructs of the macrotheory do not correspond to the causal mechanisms that generate these regularities . If you want a theory of these mechanisms, you need to replace talk about rules and concepts with talk about nodes, connections , microfeatures , vectors and the like . Is

Now , it is among the major misfortunes of the Connectionist literature that the issue about whether commonsense concepts should be represented by sets of microfeatures has gotten thoroughly mixed up with the issue about combinatorial structure in mental representations . The crux of the mixup is the fact that sets of microfeatures can overlap , so that , for example , if a microfeature corresponding to ' + has-a-handle ' is part of the array of nodes
over which the commonsense concept CUP is distributed , then you might

think of the theory as representing ' + has-a-handle ' as a constituent of the concept CUP ; from which you might conclude that Connectionists have a notion of constituency after all , contrary to the claim that Connectionism is not a language-of-thought architecture (see Smolensky , 1988) .
A moment ' s consideration will make it clear , however , that even on the

assumption that concepts are distributed over microfeatures , ' + has-a-handle ' is not a constituent of CUP in anything like the sense that 'Mary ' (the word ) is a constituent of (the sentence) 'John loves Mary ' . In the former case, " constituency " is being (mis )used to .refer to a semantic relation between predicates ; roughly , the idea is that macrolevel predicates like CUP are defined by sets of microfeatures like 'has-a-handle ' , so that it 's some sort of semantic truth that CUP applies to a subset of what 'has-a-handle ' applies to . Notice that while the extensions of these predicates are in a set/subset relation , the predicates themselves are not in any sort of part -to -whole rela tion . The expression 'has-a-handle ' isn't part of the expression CUP any more

15The primary use that Connectionists make of microfeatures is in their accounts of generalization and abstraction ( see, for example , Hinton , McClelland , & Rumelhart , 1986) . Roughly , you get generalization by using overlap of microfeatures to define a similarity space, and you get abstraction by making the vectors that correspond to t.vpes be subvectors of the ones that correspond to their tokens . Similar proposals have quite a long history in traditional Empiricist analysis ; and have been roundly criticized over the centuries . (For a
discussion of abstractionism see Geach , 1957 ; that similarity is a primitive relation hence not reducible to

partial identity of feature sets- was , of course , a main tenet of Gestalt psychology , as well as more recent approaches based on " prototypes " ) . The treatment of microfeatures in the Connectionist literature would appear to be very close to early proposals by Katz and Fodor ( 1963) and Katz and Postal ( 1964) , where both the idea of a feature analysis of concepts and the idea that relations of semantical containment among concepts should be identified with set-theoretic relations among feature arrays are explicitly endorsed .

22

l .A. Fodor Z.W Pylyshyn and .

than the English phrase 'is an unmarried man' is part of the English phrase
' is a bachelor '.

Real constituency does have to do with parts and wholes ; the symbol 'Mary ' is literally a part of the symbol 'John loves Mary ' . It is because their symbols enter into real-constituency relations that natural languages have both atomic symbols and complex ones. By contrast , the definition relation can hold in a language where all the symbols are syntactically atomic ; e.g., a language which contains both 'cup' and 'has-a-handle ' as atomic predicates . This point is worth stressing. The question whether a representational system has real-constituency is independent of the question of microfeature analysis; it arises both for systems in which you have CUP as semantically primitive , and for systems in which the semantic primitives are things like ' + has-ahandle ' and CUP and the like are defined in terms of these primitives . It really is very important not to confuse the semantic distinction between primi tive expressions and defined expressions with the syntactic distinction between atomic symbols and complex symbols .
So far as we know , there are no worked out attempts in the Connectionist

literature to deal with the syntactic- and semantical issues raised by relations of real-constituency . There is, however , a proposal that comes up from time to time : viz ., that what are traditionally treated as complex symbols should actually be viewed as just sets of units , with the role relations that tradition ally get coded by constituent structure represented by units belonging to these sets. So, for example , the mental representation corresponding to the belief that John loves Mary might be the feature vector { + John -subject,' + loves,' + Mary -object} . Here 'John-subject ' 'Mary -object ' and the like are the labels of units ; that is, they are atomic (i .e., micro -) features , whose status is analogous to 'has-a-handle ' . In particular , they have no internal syntactic analysis, and there is no structural relation (except the orthographic one) between the feature 'Mary -object ' that occurs in the set { John-subject ; loves ; Mary -object } and the feature 'Mary -subject ' that occurs in the set { Mary -subject ; loves ; John -object } . (See, for example , the discussion in Hin ton , 1987 of " role -specific descriptors that represent the conjunction of an identity and a role [by the use of which ] we can implement part -whole hierar chies using set intersection as the composition rule ." See also, McClelland ,
Rumelhart & Hinton , 1986 , p . 82 - 85 , where , these clear sorts what of ideas sort what appears to be the same

treatment is proposed in somewhat different terms .)


Since , as we remarked it ' s worth a word aren ' t elaborated you would in the Con get into if

nectionist literature , detailed discussion is probably not warranted here . But


to make of trouble

you were to take them seriously . As we understand it , the proposal really has two parts : On the one hand ,

Connectionism and cognitive architecture

23

it 's suggested that although Connectionist representations cannot exhibit real constituency , nevertheless the Classical distinction between complex symbols and their constituents can be replaced by the distinction between feature sets and their subsets; and , on the other hand , it 's suggested that role relations can be captured by features . We 'll consider these ideas in turn . (1) Instead of having complex symbols like " John loves Mary " in the representational system, you have feature sets like { + John -subject,. + loves,' + Mary -object } . Since this set has { + John -subject} , { + loves,. + Mary -object - and so forth as sub-sets, it may be supposed that the force of the . } -constituency relation has been captured by employing the subset rela tion . However , it ' s clear that this idea won ' t work since not all subsets of fea -

tures correspond to genuine constituents . For example , among the subsets of { + John -subject,' + loves,' + Mary -object} are the sets { + John -subject,' + Mary -object} ) and the set { + John -subject,' + loves} which do not , of course, correspond to constituents of the complex symbol " John loves Mary " . (2) Instead of defining roles in terms of relations among constituents , as
one does in Classical architecture , introduce them as microfeatures .

Consider a system in which the mental representation that is entertained when one believes that John loves Mary is the feature set { + John -subject,. + loves,. + Mary -object } . What representation corresponds to the belief that John loves Mary and Bill hates Sally ? Suppose, pursuant to the present prop osal, that it 's the set { + John -subject,' + loves,' + Mary -object,' + Bill -subject,' + hates,' + Sally -object } . We now have the problem of distinguishing that belief from the belief that John loves Sally and Bill hates Mary ; and from the belief that John hates Mary and Bill loves Sally ; and from the belief that John hates Mary and Sally and Bill loves Mary ; etc., since these other beliefs will all correspond to precisely the same set of features . The problem is, of course, that nothing in the representation of Mary as + Mary -object specifies whether it 's the loving or the hating that she is the object of ; similarly , mutatis mutan dis, for the representation of John as + John -subject. What has gone wrong isn't disastrous (yet ) . All that 's required is to enrich the system of representations by recognizing features that correspond not to (for example ) just being a subject , but rather to being the subject of a loving of Mary (the property that John has when John loves Mary ) and being the subject of a hating of Sally (the property that Bill has when Bill hates Sally) . So, the representation of John that 's entertained when one believes tha. John t loves Mary and Bill hates Sally might be something like + John -subject-hatesMary -object.

24

I .A . Fodor and Z . W. Pylyshyn

The disadvantage of this proposal is that it requires rather a lot of micro features .16 How many ? Well , a number of the order of magnitude of the sentencesof a natural language (whereas one might have hoped to get by with a vocabulary of basic expressions that is not vastly larger than the lexicon of a natural language ; after all , natural languages do) . We leave it to the reader to estimate the number of microfeatures you would need , assuming that there is a distinct belief corresponding to every grammatical sentence of English of up to , say, fifteen words of length , and assuming that there is an average of , say, five roles associated with each belief . (Hint : George Miller once estimated that the number of well -formed 20-word sentences of English is of the order of magnitude of the number of seconds in the history of the universe .) The alternative to this grotesque explosion of atomic symbols would be to have a combinatorial syntax and semantics for the features . But , of course , this is just to give up the game since the syntactic and semantic relations that hold among the parts of the complex feature + ((John subject) loves (Mary object)) are the very same ones that Classically hold among the constituents of the complex symbol " John lo\'es Mary " ; these include the role relations which Connectionists had proposed to reconstruct using just sets of atomic features . It is, of course, no accident that the Connectionist proposal for dealing with role relations runs into these sorts of problems . Subject , object and the rest are Classically defined with respect to the geometry of constituent structure trees. And Connectionist representations don 't have constituents . The idea that we should capture role relations by allowing features like John -subject thus turns out to be bankrupt ; and there doesn't seem to be any other way to get the force of structured symbols in a Connectionist architec ture . Or , if there is, nobody has given any indication of how to do it . This becomes clear once the crucial issue about structure in mental representations is disentangled from the relatively secondary (and orthogonal ) issue about whether the representation of commonsense concepts is 'distributed ' (i .e., from questions like whether it 's CUP or 'has-a-handle ' or both that is semantically primitive in the language of thought ) . It 's worth adding that these problems about expressing the role relations are actually just a symptom of a more pervasive difficulty : A consequence of restricting the vehicles of mental representation to sets of atomic symbols is a notation that fails quite generally to express the way that concepts group

16 Another disadvantageis that, strictly speakingit doesn work ; although it allows us to distinguish the 't belief that John loves Mary and Bill hates Sally from the belief that John loves Sally and Bill hatesMary, we don't yet have a way to distinguish believing that (John loves Mary becauseBill hates Sally) from believing that (Bill hatesSally because John loves Mary) . Presumablynobody would want to have microfeaturescorres ponding to these.

Connectionism and cognitive architecture

25

into propositions . To see this , let 's continue to suppose that we ha're a network in which the nodes represent concepts rather than propositions (so that what corresponds to the thought that John loves Mary is a distribution of activation over the set of nodes { JOHN ; LOVES ; MARY } rather than the activation of a single node labelled JOHN LOVES MARY ) . Notic:e that it cannot plausibly be assumed that all the nodes that happen to be ac :tive at a given time will correspond to concepts that are constituents of the same proposition ; least of all if the architecture is " massively parallel " so that many things are allowed to go on- many concepts are allowed to be entertained simultaneously in a given mind . Imagine , then , the following situation : at time t , a man is looking at the sky (so the nodes corresponding to ~ ;KY and BLUE are active ) and thinking that John loves Fido (so the nodc s corre ; sponding to JOHN , LOVES , and FIDO are active) , and the node FIDO is connected to the node DOG (which is in turn connected to the node ANI MAL ) in such fashion that DOG and ANIMAL are active too . W 'e can , if you like , throw it in that the man has got an itch , so ITCH is also on .

According to the current theory of mental representation , this man's mind at t is specified by the vector { + JOHN , + LOVES , + FIDO , + DOG , + SKY , + BLUE , + ITCH , + ANIMAL } . And the question is: which subvectors of this vector correspond to thoughts that the man is thinking ? Specifically , what is it about the man's representational state that determines that the simulta neous activation of the nodes, {JOHN , LOVES , FIDO } constitutes his think ing that John loves Fido , but the simultaneous activation of FlOG , p~ NIMAL and BLUE does not constitute his thinking that Fido is a blue animal ? It seems that we made it too easy for ourselves when we identified tht ~thought that John loves Mary with the vector { + JOHN , + LOVES , + MARY } ; at best that works only on the assumption that JOHN , LOVES and M " ARY are the only nodes active when someone has that thought . And that 's an assumption to which no theory of mental representation is entitled . It 's important to see that this problem arises precisely because the theory is trying to use sets of atomic representations to do a job that you really need complex representations for . Thus , the question we're wanting to answer is: Given the total set of nodes active at a time , what distinguishes the subvectors that correspond to propositions from the subvectors that don 't ? This question has a straightforward answer if , contrary to the present proposal , complex representations are assumed: When representations express concepts that belong to the same proposition , they are not merely simultaneously active , but also in construction with each other . By contrast , representations that express concepts that don 't belong to the same proposition may bc~simulta neously active ; but , they are ipso facto not in construction with ea(;h other . In short , you need two degrees of freedom to specify the thoughts that an

26

I .A . Fodor and Z . W. Pylyshyn

Connectionism and cognitive architecture

27

Strikingly enough , the point that we've been making in the past several paragraphs is very close to one that Kant made against the Associationists of his day . In " Transcendental Deduction (B )" of The First Critique , Kant remarks that :

... if I investigate ... the relation of the given modes of knowledge in any judge ment , ~nd distinguish it , as belonging to the understanding , from the relation according to laws of the reproductive imagination [e.g., according to the princi ples of association ] , which has only subjective validity , I find that a judgement is nothing but the manner in which given modes of knowledge are brought to the objective unity of apperception . This is what is intended by the copula " is" . It is employed to distinguish the objective unity of given representations from the subjective .... Only in this way does there arise from the relation a judge ment , that is a relation which is objectively valid , and so can be adequately distinguished from a relation of the same representations that would have only subjective validity - as when they are connected according to laws of association . In the latter case, all that I could say would be "If I support a body , I feel an impression of weight ' ; I could not say, 'It , the body , is heavy ' . Thus to say 'The body is heavy ' is not merely to state that the two representations have always been conjoined in my perception , ... what we are asserting is that they are combined in the object ... (CPR , p . 159; emphasis Kant 's)

A modern paraphrase might be: A theory of mental representation must distinguish the case when two concepts (e.g., THIS BODY , HEAVY ) are merely simultaneously entertained from tqe case where , to put it roughly , the property that one of the concepts expresses is predicated of the thing that the other concept denotes (as in the thought : THIS BODY IS HEAVY ) . The
relevant distinction is that while both concepts are " active " in both cases , in the latter case but not in the former the active concepts are in construction .

Kant thinks that " this is what is intended by the copula 'is' " . But of course there are other notational devices that can serve to specify that concepts are in construction ; notably the bracketing structure of constituency trees . There are, to reiterate , two questions that you need to answer to specify
the content of a mental state : " Which concepts are ' active ' " and " Which of

the active concepts are in construction with which others ?" Identifying mental states with sets of active nodes provides resources to answer the first of these questions but not the second. That 's why the version of network theory that acknowledges sets of atomic representations but no complex representations fails , in indefinitely many cases, to distinguish mental states that are in fact
distinct .

nected ) representations of concepts , theories that acknowledge complex symbols define semantic interpreta tions over sets of representations of concepts together with specifications of the constituency relations that hold among these representations .

28

I .A . Fodor and Z. W. Pylyshyn

But we are not claiming that you can't reconcile a Connectionistarchitec ture with an adequate theory of mental representation (specifically with a combinatorial syntax and semantics mental representations. On the confor ) trary, of course you can: All that's required is that you use your network to implement a Turing machine and specify a combinatorial structure for its , computational language What it appearsthat you can't do, however, is have . both a combinatorial representationalsystemand a Connectionist architec ture at the cognitive level. So much, then, for our long digression We have now reviewed one of the . major respectsin which Connectionistand Classicaltheoriesdiffer ; viz., their accountsof mental representationsWe turn to the secondmajor difference, . which concernstheir accountsof mental processes .
2.2. Structure sensitive operations

Classicistsand Connectionistsboth offer accountsof mental processes but , their theories differ sharply. In particular, the Classicaltheory relies heavily on the notion of the logico/syntacticform of mental representations define to the rangesanddomainsof mental operations This notion is, however, unavail. able to orthodox Connectionistssinceit presupposes there are nonatomic that mental representations . The Classicaltreatment of mental processes rests on two ideas, each of which correspondsto an aspectof the Classicaltheory of computation. Together they explain why the Classicalview postulatesat least three distinct levels of organization in computational systems not just a physicallevel and : a semantic(or " knowledge ) level, but a syntactic level as well. " The first idea is that it is possibleto construct languagesin which certain features of the syntactic structuresof formulas correspondsystematicallyto certain of their semantic features. Intuitively , the idea is that in such lan.guagesthe syntax of a formula encodesits meaning; most especially those , aspectsof its meaning that determine its role in inference. All the artificial languages that are used for logic have this property and English has it more or less Classicistsbelieve that it is a crucial property of the Language of . Thought. A simple exampleof how a languagecan use syntacticstructure to encode inferential roles and relations among meanings may help to illustrate this point . Thus, consider the relation betweenthe following two sentences : (1) John went to the store and Mary went to the store. (2) Mary went to the store. On the one hand, from the semanticpoint of view, (1) entails (2) (so, of

Connectionism and cognitive architecture

29

course

inferences

from

to

are

truth

preserving

On

the

other

hand

from

the

syntactic

point

of

view

is

constituent

of

These

two

facts

can

be

brought

into

phase

by

exploiting

the

principle

that

sentences

with

the

syntactic

structure

'

S1

and

S2

'

entail

their

sentential

constituents

Notice

that

this

principle

connects

the

syntax

of

these

sentences

with

their

inferential

roles

Notice

too

that

the

trick

relies

on

facts

about

the

grammar

of

English

it

wouldn

'

work

in

language

where

the

formula

that

expresses

the

conjunc

tive

content

John

went

to

the

store

and

Mary

went

to

the

store

is

syntactically

atomic

I8

Here

is

another

example

We

can

reconstruct

such

truth

preserving

infer

ences

as

if

Rover

bites

then

something

bites

on

the

assumption

that

the

sentence

'

Rover

bites

'

is

of

the

syntactic

type

Fa

the

sentence

'

something

bites

'

is

of

the

syntactic

type

3x

Fx

and

every

formula

of

the

first

type

entails

corresponding

formula

of

the

second

type

where

the

notion

' corres

ponding

formula

'

is

cashed

syntactically

roughly

the

two

formulas

must

differ

only

in

that

the

one

has

an

existentially

bound

variable

at

the

syntactic

posi

tion

that

is

occupied

by

constant

in

the

other

. )

Once

again

the

point

to

notice

is

the

blending

of

syntactical

and

semantical

notions

The

rule

of

exis

tential

generalization

applies

to

formulas

in

virtue

of

their

syntactic

form

But

the

salient

property

that

'

preserved

under

applications

of

the

rule

is

seman

tical

What

' s

claimed

for

the

transformation

that

the

rule

performs

is

that

it

is

truth

preserving

19

There

are

as

it

turns

out

examples

that

are

quite

lot

more

complicated

than

these

The

whole

of

the

branch

of

logic

known

as

proof

theory

is

devoted

to

exploring

them

. 2o

It

would

not

be

unreasonable

to

describe

Classical

Cog

lH

And

it

doesn

' t

work

uniformly

for

English

conjunction

Compare

John

and

Mary

are

friends

- ' ) 0

* John

are

friends

or

The

flag

is

red

white

and

blue

- ' ) 0

The

flag

is

blue

Such

cases

show

either

that

English

is

not

the

language

of

thought

or

that

if

it

is

the

relation

between

syntax

and

semantics

is

good

deal

subtler

for

the

language

of

thought

than

it

is

for

the

standard

logical

languages

191t

needn

' t

however

be

strict

truth

- preservation

that

makes

the

syntactic

approach

relevant

to

cognition

Other

semantic

properties

might

be

preserved

under

syntactic

transformation

in

the

course

of

mental

pro

cessing

. g

. ,

warrant

plausibility

heuristic

value

or

simply

semantic

non

- arbitrariness

The

point

of

Classical

modeling

isn

' t

to

characterize

human

thought

as

supremely

logical

rather

it

' s

to

show

how

family

of

types

of

semantically

coherent

( or

knowledge

- dependent

reasoning

are

mechanically

possible

Valid

inference

is

the

paradigm

only

in

that

it

is

the

best

understood

member

of

this

family

the

one

for

which

syntactical

analogues 2 lt is

for not

semantical uncommon

relations for

have Connectionists

been

most to make

systematically disparaging

elaborated remarks

. about the relevance of logic to

psychology

even

thought

they

accept

the

idea

that

inference

is

involved

in

reasoning

Sometimes

the

sugges

tion

seems

to

be

that

it

' s

all

right

if

Connectionism

can

' t

reconstruct

the

theory

of

inference

that

formal

deductive National

logic Science

provides Foundation

since ,

it

has McClelland

something , Feldman

even

better , Adelson

on

offer , Bower

For &

example McDermott

in

their ( 1986

report ) state

to

the that

U "

. S

. . .

connectionist

models

realize

an

evidential

logic

in

contrast

to

the

symbolic

logic

of

conventional

computing

( p

our

emphasis

"

and

that

"

evidential

logics

are

becoming

increasingly

important

in

cognitive

science

and

30

I .A . Fodor and Z. W. Pylyshyn

nitive Science as an extended attempt to apply the methods of proof theory to the modeling of thought (and similarly , of whatever other mental processes are plausibly viewed as involving inferences ; preeminently learning and perception ) . Classical theory construction rests on the hope that syntactic analogues can be constructed for nondemonstrative inferences (or informal , commonsense reasoning ) in something like the way that proof theory has provided syntactic analogues for validity . The second main idea underlying the Classical treatment of mental processes is that it is possible to devise machines whose function is the transforma tion of symbols , and whose operations are sensitive to the syntactical structure of the symbols that they operate upon . This is the Classical conception of a computer : it 's what the various architectures that derive from Turing and Von Neumann machines all have in common . Perhaps it 's obvious how the two 'main ideas' fit together . If , in principle , syntactic relations can be made to parallel semantic relations , and if , in prin ciple , you can have a mechanism whose operations on formulas are sensitive to their syntax , then it may be possible to construct a syntactically driven machine whose state transitions satisfy semantical criteria of coherence . Such a machine would be just what 's required for a mechanical model of the semantical coherence of thought ; correspondingly , the idea that the brain is such a machine is the foundational hypothesis of Classical cognitive science. So much for the Classical story about mental processes. The Connectionist story must , of course, be quite different : Since Connectionists eschew postulating mental representations with combinatorial syntactic /semantic structure , they are precluded from postulating mental processes that operate on mental representations in a way that is sensitive to their structure . The sorts of operations that Connectionist models do have are of two sorts, depending on whether the process . nder examination is learning or reasoning . u

2.2.1. Learning If a Connectionist model is intended to learn, there will be processes that determine the weights of the connectionsamong its units as a function of the character of its training. Typically in a Connectionist machine (such as a 'Boltzman Machine') the weights among connectionsare adjusted until the system behavior comesto model the statistical properties of its inputs. In 's
havea natural mapto connectionist modeling (p. 7). It is, howeverhardto understand implied ." , the contrast since on the onehand evidential , , logicmustsurely a fairly conservative be extension "the symbolic of logic of conventional computing(i.e., mostof thetheorems thelatterhave come truein theformer and " of to out ) , on the other thereis not the slightest , reason doubtthat an evidential to logicwould'run' on a Classical machinePrimafacie theproblem . , aboutevidential logicisn that we got onethat wedon knowhowto 't 've 't implementit's that we haven got one ; 't .

Connectionism and cognitive architecture

31

the limit , the stochastic relations among machine states recapitulates the stochasticrelations among the environmental eventsthat they represent . This should bring to mind the old Associationistprinciple that the strength of associationbetween 'Ideas' is a function of the frequencywith which they are paired 'in experience and the Learning Theoretic principle that the ' strength of a stimulus -response connectionis a function of the frequencywith which the responseis rewarded in the presenceof the stimulus. But though Connectionists like other Associationists are committed to learning proces , , ses that model statistical properties of inputs and outputs, the simple mechanisms basedon co-occurrencestatisticsthat were the hallmarks of oldfashioned Associationismhave been augmentedin Connectionist models by a number of technical devices (Hence the 'new' in 'New Connectionism.) . ' For example some of the earlier limitations of associativemechanismsare , overcome by allowing the network to contain 'hidden' units (or aggregates ) that are not directly connectedto the environment and whose purposeis, in effect, to detect statistical patterns in the activity of the 'visible' units including, perhaps patterns that are more abstractor more 'global' than the ones , that could be detectedby old-fashionedperceptrons .21 In short, sophisticatedversions of the associativeprinciples for weightsetting are on offer in the Connectionist literature. The point of present concern, however, is what all versions of these principles have in common with one another and with older kinds of Associationism viz., theseprocesses : are all frequency -sensitive To return to the example discussedabove: if a . Connectionist learning machine convergeson a state where it is prepared to infer A from A &B (i .e., to a state in which when the 'A & B' node is excited it tends to settle into a state in which the 'A ' node is excited) the convergence will typically be caused by statistical properties of the machine training 's experience e.g., by correlation betweenfiring of the 'A & B' node and firing : of the ' A ' node, or by correlations of the firing of both with some feedback signal. Like traditional Associationism Connectionismtreats learning as ba, sically a sort of statistical modeling. 2.2.2. Reasoning Association operatesto alter the structure of a network diachronically as a function of its training. Connectionistmodelsalsocontain a variety of types of 'relaxation' processes which determine the synchronic behavior of a network; specifically, they determine what output the device provides for a given pattern of inputs. In this respect one can think of a Connectionist ,
21Compare " little s's" and " little r's" of neo-Hullean " mediational" Associationistslike CharlesOsgood the .

32

I .A. Fodor Z.W Pylyshyn and .

model as a species of analog machine constructed to realize a certain function . The inputs to the function are (i) a specification of the connectedness of the machine (of which nodes are connected to which ) ; (ii ) a specification of the weights along the connections ; (iii ) a specification of the values of a variety of idiosyncratic parameters of the nodes (e.g. ~intrinsic thresholds : time since last firing , etc .) (iv ) a specification of a pattern of excitation over the input nodes. The output of the function is a specification of a pattern of excitation over the output nodes; intuitively , the machine chooses the output pattern that is most highly associated to its input . Much of the mathematical sophistication of Connectionist theorizing has been devoted to devising analog solutions to this problem of finding a 'most highly associated' output corresponding to an arbitrary input ; but , once again, the details needn't concern us. What is important , for our purposes , is
another property that Connectionist theories share with other forms of As -

sociationism . In traditional Associationism , the probability that one Idea will elicit another is sensitive to the strength of the association between them (including 'mediating ' associations, if any) . And the strength of this association is in turn sensitive to the extent to which the Ideas have previously been correlated . Associative strength was not , however , presumed to be sensitive to features of the content or the structure of representations per se. Similarly , in Connectionist models , the selection of an output corresponding to a given input is a function of properties of the paths that connect them (including the weights , the states of intermediate units , etc .) . And the weights , in turn , are a function of the statistical properties of events in the environment (or of relations between patterns of events in the environment and implicit 'predic tions ' made by the network , etc.) . But the syntactic/semantic structure of the representation of an input is not presumed to be a factor in determining the selection of a corresponding output since, as we have seen, syntactic/semantic
structure is not defined for the sorts of representations that Connectionist

models acknowledge . To summarize : Classical and Connectionist theories disagree about the nature of mental representation ; for the former , but not for the latter , mental representations characteristically exhibit a combinatorial constituent structure and a combinatorial semantics . Classical and Connectionist theories also

disagree about the nature of mental processes; for the former , but not for the latter , mental processes are characteristically sensitive to the combinator ial structure of the representations on which they operate . We take it that these two issues define the present dispute about the nature
of cognitive architecture . We now propose to argue that the Connectionists
are on the wrong side of both .

Connectionism and cognitive architecture

33

3. The need for symbol systems: Productivity , systematicity , compositionality and inferential coherence Classical psychological theories appeal to the constituent structure of mental representations to explain three closely related features of cognition : its pro ductivity , its compositionality and its inferential coherence . The traditional argument has been that these features ' of cognition are, on the one hand , pervasive and, on the other hand , explicable only on the assumption that mental representations have internal structure . This argument - familiar in more or less explicit versions for the last thirty years or so- is still intact , so far as we can tell . It appears to offer something close to a demonstration that an empirically adequate cognitive theory must recognize not just causal rela tions among representational states but also relations of syntactic and semantic constituency ; hence that the mind cannot be, in its general structure , a Connectionist network . 3.1. Productivity of thought There is a classical productivity argument for the existence of combinatorial structure in any rich representational system (including natural languages and the language of thought ) . The representational capacities of such a system are , by assumption , unbounded under appropriate idealization ; in particular , there are indefinitely many propositions which the system can encode .22How ever , this unbounded expressive power must presumably be achieved by finite means. The way to do this is to treat the system of representations as consisting of expressions belonging to a generated set. More precisely , the corre spondence between a representation and the proposition it expresses is, in arbitrarily many cases, built up recursively out of correspondences between parts of the expression and parts of the proposition . But , of course, this strategy can operate only when an unbounded number of the expressions are non-atomic . So linguistic (and mental ) representations must constitute symbol systems (in the sense of footnote 8) . So the mind cannot be a PDP . Very often , when people reject this sort of reasoning , it is because they doubt that human cognitive capacities are correctly viewed as productive . In

22This way of putting the productivity argument is most closely identified with Chomsky (e.g., Chomsky, 1965 1968 However, one does not have to rest the argumentupon a basicassumptionof infjnite generative ; ). capacity Infinjte generative capacity can be viewed, instead, as a consequenceor a corollary of theories . formulated so as to capture the greatest number of generalizationswith the fewest independent principles. This more neutral approachis, in fact, very much in the spirit of what we shall proposebelow. We are putting it in the present form for expository and historical reasons .

34

I .A . Fodor and Z. W. Pylyshyn

the long run there can be no a priori argumentsfor (or against idealizing to ) productive capacities whether you acceptthe idealization dependson wheth; er you believe that the inferencefrom finite performanceto finite capacityis justified, or whether you think that finite performanceis typically a result of the interaction of an unboundedcompetence with resourceconstraints Clas . sicistshave traditionally offered a mixture of methodological and empirical considerationsin favor of the latter view. From a methodologicalperspective the least that can be said for assuming , productivity is that it precludes solutions that rest on inappropriate tricks (such as storing all the pairs that define a function) ; tricks that would be unreasonablein practical terms even for solving finite tasks that place sufficiently large demandson memory. The idealization to unboundedproductive capacity forces the theorist to separatethe finite specification of a method for solving a computational problem from such factors as the resourcesthat the system(or person brings to bear on the problem at any given moment. ) The empirical argumentsfor productivity have been mademost frequently in connectionwith linguistic competence They are familiar from the work of . Chomsky (1968 who hasclaimed (convincingly, in our view) that the knqwl) edgeunderlying linguistic competenceis generative i.e., that it allows us in principle to generate (/understand an unbounded number of sentences It ) . goes without saying that no one does, or could, in fact utter or understand tokens of more than a finite number of sentencetypes; this is a trivial conse quence of the fact that nobody can utter or understand more than a finite number of sentencetokens. But there are a number of considerationswhich suggest that, despitede facto constraintson performance onesknowledgeof , ones languagesupports an unboundedproductive capacity in much the same way that ones knowledge of addition supports an unbounded number of sums Among these considerationsare, for example, the fact that a speaker . / hearer's performance can often be improved by relaxing time constraints , increasingmotivation, or supplying pencil and paper. It seemsvery natural to treat such manipulations as affecting the transient state of the speakers ' memory and attention rather than what he knows about- or how he represents his language But this treatment is available only on the assumption . that the character of the subject's performanceis determined by interactions between the available knowledge base and the available computational resources . Classicaltheories are able to accommodatethese sorts of considerations becausethey assumearchitecturesin which there is a functional distinction between memory and program. In a systemsuch as a Turing machine, where the length of the tape is not fixed in advance changesin the amount of , availablememory canbe affectedwithout changingthe computationalstructure

Connectionism and cognitive architecture

35

of the machine ; viz ., by making more tape available . By contrast , in a finite state automaton or a Connectionist machine , adding to the memory (e.g., by adding units to a network ) alters the connectivity relations among nodes and thus does affect the machine 's computational structure . Connectionist cogni tive architectures cannot , by their very nature , support an expandable memory , so they cannot support productive cognitive capacities . The long and short is that if productivity arguments are sound, then they show that the architecture of the mind can't be Connectionist . Connectionists have, by and large , acknowledged this ; so they are forced to reject productivity arguments . The test of a good scientific idealization is simply and solely whether it produces successful science in the long term . It seems to us that the produc tivity idealization has more than earned its keep , especially in linguistics and in theories of reasoning . Connectionists , however , have not been persuaded . For example , Rumelhart and McClelland ( 1986a, p . 119) say that they " ... do not agree that [productive ] capabilities are of the essence of human computation . As anyone who has ever attempted to process sentences like 'The man the boy the girl hit kissed moved ' can attest , our ability to process even moderate degrees of center-embedded structure is grossly impaired relative to an A TN [Augmented Transition Network ] parser .... What is needed , then , is not a mechanism for flawless and effortless processing of embedded constructions . .. The challenge is to explain how those processes that others have chosen to explain in terms of recursive mechanisms can be better explained by the kinds of processes natural for PDP networks ."
These remarks suggest that Rumelhart and McClelland think that the fact

that center -embedding sentences are hard is somehow an embarrassment for theories that view linguistic capacities as productive . But of course it 's not since, according to such theories , performance is an effect of interactions between a productive competence and restricted resources. There are , in fact , quite plausible Classical accounts of why center -embeddings ought to impose especially heavy demands on resources, and there is a reasonable amount of experimental support for these models (see, for example , Wanner & Maratsos , 1978) . In any event , it should be obvious that the difficulty of parsing center -embeddings can't be a consequence of their recursiveness per se since there are many recursive structures that are strikingly easy to understand . Consider : 'this is the dog that chased the cat that ate the rat that lived in the house that Jack built .' The Classicist's case for productive capacities in parsing rests on
the transparency of sentences like these .23 In short , the fact that center -em -

23McCIeiland and Kawamoto ( 1986) discuss this sort of recursion briefly . Their suggestion seems to be that parsing such sentences doesn ' t really require recovering their recursive structure : " ... the job of the parser
-

36

I .A. Fodor Z.W Pylyshyn and .

beddedsentences hard perhapsshowsthat there are somerecursivestrucare tures that we can't parse But what Rumelhart and McClelland need if they . are to deny the productivity of linguistic capacitiesis the much stronger claim that there are no recursive structures that we can parse; and this stronger claim would appear to be simply false. Rumelhart and McClelland's discussion of recursion (pp. 119 120 ) nevertheless repayscloseattention. They are apparentlypreparedto concede that PDPscan model recursivecapacitiesonly indirectly- viz., by implementing Classicalarchitectureslike A TN s; so that if human cognition exhibited recursive capacities that would suffice to show that minds have Classical , rather than Connectionist architecture at the psychologicallevel. "We have not dwelt on PDP implementations of Turing machinesand recursive processingenginesbecause do not agreewith thosewho would arguethat such we capacitiesare of the essence human computation (p. 119, our emphasis. of " ) Their argument that recursive capacities aren't "of the essenceof human computation" is, however, just the unconvincing stuff about center-embed ding quoted above. So the Rumelhart and McClelland view is apparently that if you take it to be independentlyobvious that somecognitive capacitiesare productive, then you should take the existenceof such capacitiesto argue for Classicalcognitive architecture and hence for treating Connectionism as at best an implementation theory. We think that this is quite a plausible understandingof the bearing that the issuesabout productivity and recursion have on the issuesabout cognitive architecture; in Section4 we will return to the sugges tion that Connectionist models can plausibly be construed as models of the implementation of a Classicalarchitecture. In the meantime, however, we propose to view the statusof productivity argumentsfor Classicalarchitecturesasmoot; we're about to presenta different sort of argumentfor the claim that mental representations need an articulated internal structure. It is closelyrelated to the productivity argument, but it doesn require the idealization to unboundedcompetence Its assumptions 't .
~ [with respectto right-recursivesentencesis to spit out phrasesin a way that capturestheir local context. Such ] a representationmay prove sufficient to allow us to reconstructthe correct bindings of noun phrasesto verbs and prepositional phrasesto nearby nouns and verbs" (p. 324; emphasisours). It is, however, by no means the casethat all of the semanticallyrelevant grammaticalrelations in readily intelligible embeddedsentences are local in surface structure. Consider: ' Where did the man who owns the cat that chased the rat that frightened the girl say that he was going to move to (X )?' or 'What did the girl that the children loved to listen to promise your friends that she would read (X) to them?' Notice that, in such examples a binding element , (italicized) can be arbitrarily displacedfrom the position whoseinterpretation it controls (marked 'X ') without making the sentenceparticularly difficult to understand Notice too that the 'semantics doesn determine the . ' 't binding relations in either example.
~

Connectionismand cognitive architecture

37

should thus be acceptable even to theorists who- like Connectionists - hold that the finitistic character of cognitive capacities is intrinsic to their architec ture .

3.2. Systematicity cognitiverepresentation of The form of the argument is this: Whether or not cognitive capacitiesare really productive, it seemsindubitable that they are what we shall call 'sys tematic'. And we'll see that the systematicityof cognition provides as good a reason for postulating combinatorial structure in mental representationas the productivity of cognition does: You get, in effect, the sameconclusion , but from a weaker premise. The easiestway to understandwhat the systematicityof cognitive capacities amounts to is to focus on the systematicityof languagecomprehensionand production. In fact, the systematicityargumentfor combinatorial structure in thought exactly recapitulatesthe traditional Structuralist argument for constituent structure in sentences But we pause to remark upon a point that . we'll re-emphasize later; linguistic capacityis a paradigmof systematiccognition , but it 's wildly unlikely that it 's the only example. On the contrary, there's every reason to believe that systematicityis a thoroughly pervasive feature of human and infrahuman mentation. What we mean when we say that linguistic capacitiesare systematic that is the ability to produce /understandsomesentences intrinsically connectedto is the ability to produce /understand certain others. You can see the force of this if you compare learning languages way we really do learn them with the learning a languageby memorizing an enormousphrasebook. The point isn't that phrasebooks are finite and can therefore exhaustivelyspecifyonly nonproductive languages that's true, but we've agreednot to rely on productivity ; argumentsfor our present purposes Our point is rather that you can learn . any part of a phrase book without learning the rest. Hence, on the phrase book model, it would be perfectly possibleto learn that uttering the form of words 'Granny's cat is on Uncle Arthur 's mat' is the way to say (in English) that Granny's cat is on Uncle Arthur 's mat, and yet have no idea at all how to say that it 's raining (or, for that matter, how to say that Uncle Arthur 's cat is on Granny's mat). Perhapsit 's self-evident that the phrasebook story must be wrong about languageacquisition becausea speakers knowledge of ' his native languageis never like that. You don't , for example, find native speakerswho know how to say in English that John loves the girl but don't know how to say in English that the girl loves John. Notice, in passing that systematicityis a property of the mastery of the , syntax of a language not of its lexicon. The phrase book model really does ,

38

I .A. Fodor Z.W Pylyshyn and .

fit what it 's like to learn the vocabulary of English since when you learn English vocabulary you acquire a lot of basically independentcapacities So . you might perfectly well learn that using the expression'cat' is the way to refer to cats and yet haveno idea that usingthe expression'deciduousconifer' is the way to refer to deciduousconifers. Systematicity like productivity, is , the sort of property of cognitive capacitiesthat you're likely to miss if you concentrateon the psychologyof learning and searchinglists. There is, as we remarked, a straightforward (and quite traditional) argument from the systematicityof languagecapacity to the conclusionthat sen tences must have syntactic and semanticstructure: If you assumethat sen tences are constructed out of words and phrases and that many different , sequences words can be phrasesof the sametype, the very fact that one of formula is a sentenceof the languagewill often imply that other formulas must be too: in effect, systematicity follows from the postulation of constituent structure. Suppose for example that it 's a fact about English that formulas with the , , constituent analysis'NP Vt NP' are well formed; and supposethat 'John' and 'the girl' are NPs and 'loves' is a Vt . It follows from these assumptions that 'John loves the girl ,' 'John loves John,' 'the girl loves the girl ,' and 'the girl loves John' must all be sentences It follows too that anybod who has mas . y tered the grammarof English must havelinguistic capacitiesthat are systemat ic in respect of these sentences he can't but assumethat all of them are ; sentences he assumes if that any of them are. Compare the situation on the view that the sentences English are all atomic. There is then no structural of analogy between 'John loves the girl' and 'the girl loves John' and henceno reason why understanding one sentence should imply understanding the other; no more than understanding'rabbit' implies understanding'tree' .24 On the view that the sentences atomic, the systematicityof linguistic are capacitiesis a mystery; on the view that they have constituent structure, the systematicityof linguistic capacitiesis what you would predict. So we should prefer the latter view to the former. Notice that you can make this argument for constituent structure in sen tenceswithout idealizing to astronomicalcomputationalcapacities There are . productivity argumentsfor constituent structure, but they're concernedwith our ability- in principle- to understand sentences that are arbitrarily long. Systematicity by contrast, appealsto premisesthat are much nearer home; ,
24See Pinker (1984 Chapter 4) for evidencethat children never go through a stagein which they distinguish , between the internal structuresof NPs dependingon whether they are in subject or object position; i .e., the dialects that children speak are always systematicwith respectto the syntactic structures that can appear in these positions.

Connectionism and cognitive architecture

39

such considerations as the ones mentioned above, that no speaker under stands the form of words ' John loves the girl ' except as he also understands the form of words 'the girl loves John ' . The assumption that linguistic capacities are productive " in principle " is one that a Connectionist might refuse to grant . But that they are systematic in fact no one can plausibly deny . We can now , finally , come to the point : the argument from the systematicity of linguistic capacities to constituent structure in sentences is quite clear . But thought is systematic too , so there is a precisely parallel argument from the systematicity of thought to syntactic and semantic structure in mental representations . What does it mean to say that thought is systematic? Well , just as you don 't find people who can understand the sentence 'John loves the girl ' but not the sentence 'the girl loves John ,' so too you don 't find people who can think the thought that John loves the girl but can't think the thought that the girl loves John . Indeed , in the case of verbal organisms the systematicity of thought follows from the systematicity of language if you assume- as most psychologists do- that understanding a sentence involves entertaining the thought that it expresses; on that assumption , nobody could understand both the sentences about John and the girl unless he were able to think both the thoughts about John and the girl . But now if the ability to think that John loves the girl is intrinsically connected to the ability to think that the girl loves John , that fact will somehow have to be explained . For a Representationalist (which , as we have seen, Connectionists are) , the explanation is obvious : Entertaining thoughts requires being in representational states (i .e., it requires tokening mental representations ) . And , just as the systematicity of language shows that there must be structural relations between the sentence ' John loves the girl ' and the sentence 'the girl loves John ,' so the systematicity of thought shows that there must be structural relations between the mental representation that corresponds to the thought that John loves the girl and the mental representation that corresponds to the thought that the girl loves John ;25namely , the two mental representations , like the two sentences, must be made of the same parts . But if this explanation is right (and there don 't seem to be any others on offer ) , then mental representations have internal structure and there is a

25It may be worth emphasizing that the structural complexity of a mental representation is not the same thing as, and does not follow from , the structural complexity of its propositional content ( i .e. , of what we 're calling " the thought that one has" ) . Thus , Connectionists and Classicists can agree to agree that the thought that P& Q is complex (and has the thought that P among its parts ) while agreeing to disagree about whether mental representatjons have internal syntactic structure .

40

l .A . Fodor and Z . W. Pylyshyn

languageof thought. So the architecture of the mind is not a Connectionist network.26 To summarizethe discussion far: Productivity argumentsinfer the interso nal structure of mental representationsfrom the presumedfact that nobody has a finite intellectual competence By contrast, systematicity arguments . infer the internal structure of mental representations from the patent fact that nobody has a punctateintellectual competence Just as you don't find linguis. tic capacitiesthat consist of the ability to understand sixty-seven unrelated sentencesso too you don't find cognitive capacitiesthat consistof the ability , to think seventy -four unrelated thoughts. Our claim is that this isn't , in either case an accident A linguistic theory that allowed for the possibility of , : punctate languageswould have gone not just wrong, but very profoundly wrong. And similarly for a cognitive theory that allowed for the possibility of punctate minds. But perhapsnot being punctateis a property only of the minds of language users perhaps the representational capacities of infraverbal organismsdo ; havejust the kind of gapsthat Connectionistmodelspermit? A Connectionist might then claim that he can ~o everything " up to language on the assump " tion that mental representationslack combinatorial syntactic and semantic structure. Everything up to languagemay not be everything, but it 's a lot . (On the other hand, a lot may be a lot , but it isn't everything. Infraverbal cognitive architecture mustn't be so represented as to make the eventual acquisition of languagein phylogeny and in ontogeny require a miracle.) It is not, however, plausible that only the minds of verbal organismsare systematic Think what it would mean for this to be the case It would have . . to be quite usual to find , for example, animals capable of representingthe state of affairs aRb, but incapable of representingthe state of affairs bRa. Such animalswould be, as it were, aRb sightedbut bRa blind since, presum ably, the representationalcapacitiesof its mind affect not just what an or2 heseconsiderationsthrow further light on a pr6posalwe discussed Section2. Suppose ~ in that the mental representation corresponding to the thought that John loves the girl is the feature vector { +John-subject ; + Ioves + the-girl -object} where 'John-subject and "the-girl-object' are atomic features; as such, they bear no ; ' more structural relation to 'John-object' and 'the-girl -subject than they do to one another or to, say, 'has ' -ahandle'. Since this theory recognizesno structural relation between 'john -subject and 'john -object' , it offers ' no reasonwhy a representationalsystemthat provides the meansto expressone of theseconceptsshould also provide the meansto expressthe other. This treatment of role relations thus makes a mystery of the (presumed fact that anybodywho can entertain the thought that John loves the girl can also entertain the thought ) that the girl loves John (and, mutatis mutandis, that any natural languagethat can expressthe proposition that John loves the girl can also expressthe proposition that the girl loves John). This consequence the of proposal that role relations be handled by " role specific descriptors that represent the conjunction of an identity and a role" (Hinton , 1987 offers a particularly clear example of how failure to postulate internal ) structure in representationsleads to failure to capture the systematicityof representationalsystems .

Connectionism and cognitive architecture

41

ganism think, also itcanperceive. consequence, animals can but what In such
wouldbe able to learn to respond selectively aRb situationsbut quite to unableto learnto respond selectively bRasituations. that, thoughyou to (So couldteachthe creatureto choosethe picturewiththe squarelargerthanthe
the triangle larger than the square.)

triangle, couldnt the lifeof youteach to choose picture you for it the with
infraverbal organisms often structuredthat way, but were preparedto are

It is, to be sure,an empirical question whether cognitive the capacities of

betthattheyarenot.Ethological aretheexceptions prove rule. cases that the Thereareexamples where salient environmental configurations asgestalact ten; andin such cases reasonable doubtthatthemental its to representation
of the stimulusis complex.But the point is preciselythat these cases are

exceptional; exactly ones theyre the where expect therewill some you that be

special totellabout ecological story the significance stimulus: its ofthe that

theshape a predator, thesong a conspecificetc.Conversely, of or of ... when thereis no suchstoryto tellyouexpect structurally similar stimuli elicit to correspondingly cognitive similar capacities. surely, the leastthata That, is
respectable principle stimulus of generalization got to require. has That infraverbal cognition prettygenerally is systematic seems,in short, to be about as secureas any empirical premisein this area can be. And, as

wevejust seen,its a premise which inadequacy Connectionist from the of modelsas cognitive theoriesfollows quitestraightforwardly; straightforas wardly, anyevent,as it would the assumption suchcapacities in from that
are generally productive.
3.3. Compositionality of representations

Compositionality isclosely related systematicity; to perhaps theyrebestviewed as aspects a single of phenomenon. will We therefore follow thesame much
coursehere as in the preceding discussion: weintroduce concept first the by

recalling standard the arguments the compositionality for of natural languages. thensuggest parallel We that arguments secure compositionality the
of mentalrepresentations.Sincecompositionality requirescombinatorial syntacticand semantic structure,the compositionality thoughtis evidence of that
the mind is not a Connectionist network.

We saidthat the systematicity linguistic of competence consists the fact in

that the ability produce/understand of the sentences intrinsically to some is connected the ability produce/understand of the others. We to to certain nowadd that whichsentences systematically are relatedis not arbitrary from a semantic ofview. example, ableto understand loves point For being John

thegirl goes along being to understand girlloves with able the John,and

42

J.A. Fodorand Z.W. Pylyshyn

contrast, isnointrinsic there connection between understanding ofthe either John/girl sentences understanding and semantically unrelated formulas like

in order thefirst betrue, must to thegirl very for to John bear the same relation thetruth thesecond that of requires girl bear John. the to to By

therearecorrespondingly semantic close relations between sentences: these

dependent.

extent thesemantical that properties theshared of constituents context-inare

asthough semantical relatednesssystematicityquite company. and keep close You suppose this might that covariance iscoveredthesame by explanation thataccounts systematicity roughly, sentences aresysfor perse; that that tematically arecomposed thesame related from syntactic constituents. But, in fact, need further you a assumption, wellcall principle which the of compositionality: as a language systematic,lexical must insofar is a item make approximately semantic thesame contribution expression to each in which occurs. is, forexample, insofar the girl, lovesand it It only as Johnmake same the semantic contribution loves girlthat toJohn the they make the girl Johnthatunderstanding sentence to loves theone implies understanding other.Similarity constituent the of structure accounts the for semantic relatedness between systematically sentences to the related only

quarks made gluons the catisonthemator2 + 2 = 4; it looks are of or

categorematicity. NP means Good something NP thatanswers the like to


7 areindebted Steve We to Pinker thispoint. for

veal, theporkand like. Similarly,difference good feed the27 the between book,good andgood isprobably meaning butsynrest fight not shift

induce (rather select) meaning than the animal, would you expect the feed

chicken eat must to involve animal/food an ambiguitychicken in rather thana violation compositionality if thecontext the ... could of since feed

amount context of induced variation lexical of meaning oftenoverestimated is because othersortsof context sensitivity misconstrued violations are as of compositionality.example, difference For the between thechicken feed and

Here idioms prove rule: able understand man, its that the being to the, bucket, kicked bucket bear standard since and dont their meanings inthis context. justasyoudexpect, man And, the kicked bucket notsysthe is tematic withrespect syntactically related even to closely sentences the like man kicked thebucket that over (for matter, notsystematic respect its with to the the mankicked bucketreadliterally). the Its uncertain exactly compositional languages how natural actually are (just its uncertain how as exactly systematic are). suspect the they We that
kickedandbucketisntmuch with help understanding man the kicked the

Connectionismand cognitivearchitecture

43

relevant interest NFs:a goodbookis one that answers our interest in to in

books its good read); good isonethatanswers ourinterest (viz., to a rest to


in rests(viz.,it leaves refreshed); goodfightis onethatanswers our one a to
so on. Its becausethe meaningof good is syncategorematic has a and variablein it for relevantinterests,that you can knowthat a goodflurgis a

interest fights its funto watch to bein, or it clears air);and in (viz., or the

flurg answers therelevant that to interest flurgs in without knowing what flurgs or what relevant are the interest flurgs (seeZiff,1960). in is In anyevent,the mainargument stands: systematicity depends compoon sitionality, to theextent a natural so that language systematicmust is it be
compositional Thisillustrates too. another respect which in systematicity ar-

guments do thework which can for productivity arguments previously have beenemployed. traditional The argument compositionality it is for is that required explain a finitely to how representable language contain can infinitely many nonsynonymousexpressions. Considerations about systematicityoffer one argument for compositional-

ity;considerations entailment another. about offer Consider predicates like


tion to the predicates is a cow and ... is brown; viz., that the first ...
... is a browncow. Thisexpression bearsa straightforward semantical rela-

predicate trueof a thing andonly bothoftheothers Thatis, ... is if if are.

is a browncow severally entails is brownand... is a cow andis entailed ...

bytheirconjunction. Moreoverand is importantthis this semantical pattern is not peculiar the casescited.On the contrary, holdsfor a very to it
soldier, ... is a child prodigy; and so forth). How are we to account for these sorts of regularities? The answer seems

large range predicates ... isa redsquare,... isa funny German of (see old
clear enough; isa brown entails isbrownbecause thesecond ... cow ... (a) expression a constituent the first;(b) the syntactical (adjective is of form

noun)N (inmany has cases) semantic of a conjunction, (c) the force and
brown retains its semantical value under simplificationof conjunction.

Notice youneed(c)to ruleoutthepossibility brownmeans that that brown


when in it modifiesa noun but (as it might be) dead when its a predicate

adjective;which ... isa brown wouldnt ... isbrown in case cow entail after
all.Notice that (c)isjust an application the principle composition. too of of

So,heres theargument far:youneedto assume degree cornso some of positionality English of sentences account the factthatsystematically to for relatedsentences always are semantically related;andto account certain for

regular parallelisms between syntactical the structure sentences their of and

entailments. beyond serious So, any doubt,the sentences English be of must compositional someserious to extent.But the principle compositionality of

governs semantic the relations between andtheexpressions words of which

44

J.A. Fodorand Z.W.Pylyshyn

they constituents.compositionality that(some) are So implies expressions have constituents.compositionality for(specifically, So argues presupposes) syntactic/semantic structure in sentences.
to expressthoughts;so if the abilityto use somesentencesis connectedwith

premise oneuseslanguage express thoughts: that to ones Sentences used are

as youdexpect, bridging a argument on theusual based psycholinguistic

Now about compositionality what the ofmental representations? is, There

semantically related representational states.

theability use to certain semantically sentences, theability other, related then to think thoughts becorrespondingly with ability some must connected the to think certain other, semantically thoughts. youcanonly related But think thethoughts your that mental representationsexpress. iftheability can So, to think certain thoughtsinterconnected, thecorresponding is then representational capacities beinterconnected specifically, ability be must too; the to in some representational must states imply ability beincertain the to other,

square above triangle; is the whereaswould very it be surprising ifbeing able to learn square/triangle implied able learn quarks the facts being to that are madeof gluonsor that Washington the first Presidentof America. was So,then,whatexplains correlation the between systematic relations and

generally semantically related Its no surprise being to learn too. that able that the triangleis abovethe squareimplies beingableto learnthat the

that semantic ofthese iscontext-independent, explain the value parts that would why systematically thoughts also these related are semantically So,by related. this chainof argument, evidence the compositionalitysentences for of is evidence thecompositionality representational ofspeaker! for ofthe states hearers. Finally, about compositionality what the ofinfraverbal thought? arguThe mentisnt much different the onethatwevejustrun through. from We assume animal that thought largely is systematic: organism canperthe that ceive (hence learn) aRbcangenerally that perceive (/learn) bRa.But, that systematically thoughts like related (just systematically sentences) related are

resentation corresponds the thought the girlloves that to that John.That would explain these why thoughts systematically and,totheextent are related;

Butthenthequestion arises: could mind soarranged the how the be that ability beinonerepresentational isconnected theability be to state with to in others aresemantically What that nearby? account mental of representation would thisconsequence? answer justwhat have The is youdexpect fromthediscussionthelinguistic of material. Mental representations must have internal structure, theway sentences Inparticular,must just that do. it bethatthemental representation correspondsthethought John that to that loves girl the contains, itsparts, same as the constituents mental asthe rep-

Connectionism and cognitive architecture

45

semantic relations in infraverbalthought? Clearly, Connectionistmodels

dont address question; factthat a network this the contains nodelabelled a X has, so far as the constraints imposed Connectionist by architecture are

concerned, implications all for the labelsof the othernodesin the no at network; particular, doesntimply therewillbe nodesthatrepresent in it that thatsystematically thoughts constituents thatthe semantic related share and
valuesof these shared constituentsare contextindependent)the correlation

thoughts aresemantically toX.This justthesemantical of that close is side the fact that networkarchitectures permitarbitrarily punctatementallives. Butif, ontheotherhand, make usual we the Classicist assumptions (viz.,
between systematicity semantic and relatedness follows immediately. a For Classicist, correlation an architectural this is property minds; couldnt of it
models suppose them to.

but hold if mentalrepresentations the generalproperties Classical have that


What have Connectioniststo say about these matters? There is some tex-

tual evidencethat they are tempted to deny the facts of compositionality

wholesale. example, For Smolensky (1988) claims that: Surely ... wewould
betweencan with coffee and can withoutcoffee or tree withcoffee and
resentations .....

getquite different a representation ofcoffeeifweexamined difference the


tree without coffee; or man with coffee and man withoutcoffee ... context

insensitivitynot something expect be reflected Connectionist is we to in rep-

Its certainly thatcompositionality generallyfeature Contrue is not a of nectionist representations. Connectionists acknowledge factsof cant the compositionality because arecommitted mental they to representations that
dont havecombinatorial structure.But to giveup on compositionality to is take kick the bucket as a modelfor the relationbetweensyntaxand seman-

tics;andtheconsequence asweveseen,thatyoumake systematicity is, the oflanguage ofthought)mystery. theother (and a On hand, saythatkick to
the bucket is aberrant, and that the right model for the syntax/semantics

relationis (e.g.) brown cow, is to start downa trail whichleads,pretty inevitably, acknowledging to combinatorial structure mentalrepresentain tion,hence the rejection Connectionist to of networks cognitive as models. We dont thinktheres any wayout of the need to acknowledge cornthe

positionality natural of languages of mental and representations. However, its beensuggested Smolensky, cit.)thatwhile principle corn(see op the of

positionality false(because is content context isnt invariant) is there


nevertheless family resemblance a between various the meanings a that
salientfactsaboutsystematicity inference. surelythere are goingto and But

symbol in thevarious has contexts which occurs. such in it Since proposals generally elaborated, unclear theyre arent its how supposedhandle to the

46

J.A. Fodorand Z.W.Pylyshyn

be serious problems. Consider, example, inferences for such as


(i) Turtles are slower than rabbits. (ii) Rabbits are slower than Ferraris.

(iii) Turtles are slower than Ferraris.

relation holds that between andrabbits notthesame thatholds turtles is one between rabbits Ferraristhen hardtoseewhy inference and its the should be valid.

relationtransitive.however, assumed is If, its (contrary principle tothe of compositionality) thanmeans that slower something inpremises different (i)and (and (ii) presumably aswell)so strictly in(iii) that, speaking, the

onehand, rabbits Ferraris theother; (b)thefactthatthat and and on and

thesame relation slower holds (viz., than) between turtles rabbits the and on

Thesoundnessthisinference of appears depend (a)thefactthat to upon

wouldbe validsincefurrier than is transitivetoo.

transitive. thataccount, argument turtles slower rabOn the from are than bitsandrabbits furrier Ferraris turtles slower Ferraris are than to are than

Talk therelations similar papers thedifficulty about being only over since theproblem toprovide isthen a notion similarity will of that guaranty that if (i)and(ii)aretrue,sotoois (iii). And,sofarat least, such no notion of similaritybeen has forthcoming. that wont torequire that Notice it do just therelationsbesimilarrespecttheir all in of transitivity,that allbe i.e., they

Until sorts issues attended theproposalreplace these of are to, to the compositional ofcontext principle invariancea notion approximate with of
more than hand waving.

equivalenceacross ... contexts (Smolensky, doesnt to bemuch 1988) seem

3.4. Thesystetnaticity inference of

semanticsmental of representations issensitivetheir to syntax thesecond and byassuming mental that structure. processes tomental apply representations invirtue of their constituent
A consequencetheseassumptions Classical of isthat theories commitare

to account both thesefacts,the firstby assuming the combinatorial for that

to Qisvalid). Correspondingly, its a psychological thoughts law that that P&Q tocause tend thoughts Pand that thoughts Q,all being that else equal. Classical exploits constituent ofmental theory the structure representations

InSection saw according 2we that, toClassical theories, syntax mental the of representations between semantic mediates their properties their and causal role mental in processes. a simple Its a logical Take case: principle that conjunctions their entail constituentstheargument P&Q P and (so from to

Connectionism and cognitivearchitecture

47

ted to the following striking prediction : inferences that are of similar logical type ought , pretty generally ,28 to elicit correspondingly similar cognitive capacities . You shouldn 't , for example , find a kind of mental life in which
you get inferences from P & Q & R to P but you don ' t get inferences from P & Q

to P . This is because, according to the Classical account , this logically homogeneous class of inferences is carried out by a correspondingly

homogeneous classof psychologicalmechanismsThe premisesof both infer:


ences are expressed by mental representations that satisfy the same syntactic analysis (viz ., Sl& S2 S3 ... Sn ; and the process of drawing the inference & & ) corresponds , in both cases, to the same formal operation of detaching the
constituent that expresses the conclusion .

The idea that organisms should exhibit similar cognitive capacities in respect of logically similar inferences is so natural that it may seem unavoidable . But , on the contrary : there 's nothing in principle to preclude a kind of cognitive model in which inferences that are quite similar from the logician 's point of view are nevertheless computed by quite different mechanisms; or in which some inferences of a given logical type are computed and other inferences of the same logical type are not . Consider , in particular , the Connectionist account . A Connectionist can certainly model a mental life in which , if you can
reason from P & Q & R to P , then you can also reason from P & Q to P . For

example , the network in (Figure 3) would do :


Figure 3. A possible Connectionist network which draws inferences from P& Q & R to P and also draws inferences from P& Q to P.

28The hedge is meant to exclude cases where inferences of the same logical type nevertheless differ in complexity in virtue of , for example , the length of their premises . The inference from (AvBvCvDvE ) and ( - B & - C& - D & - E ) to A is of the same logical type as the inference from AvB and - B to A . But it wouldn ' t be very surprising , or very interesting , if there were minds that could handle the second inference but not the
first .

48

J.A. FodorandZ.W.Pylyshyn

Butnotice a Connectionistequally that can model mental in which a life yougetoneof these inferences nottheother. thepresent since and In case,

from andMary tothe butnotfrom and and John went store John Mary Susan
of logical syntax, is a mystery youdont. it that
3.5. Summary

But,weclaim, dont findcognitive you capacities havethesesortsof that gaps. dont, forexample, minds areprepared infer went You get that to John to thestorefrom JohnandMary Susan Sally to thestoreand and and went

correspondingly homogeneous computational processes.

requirement logically that homogeneous inferences beexecuted should by

tions areallsyntactically that conjunctive). theConnectionist So, architecture tolerates in cognitive gaps capacities; hasno mechanism enforce it to the

(remember, areatomic; bemisled thenode allnodes dont by labels) theres orvice versa. Analogously, noreason youshouldnt minds theres why get thatsimplify premise loves and hates butnoothers; the John Mary Bill Mary orminds simplify that premises 1,3,or5conjuncts,dontsimplify with but premises that wereacquired or,forthat etc. minds simplify with 4,or6 conjuncts; 2, matter, that onlypremises on Tuesdays ... Infact, Connectionist the architecture indifferent isutterly asamong these possibilities. becauserecognizes Thats it nonotion syntax of according to which thoughts arealike inferential (e.g.,thoughts areall that in role that subject simplification to ofconjunction) areexpressed bymental representations correspondingly syntactic (e.g., mental of similar form by representanoreason a mind contains first why that the should contain second, also the

there nostructural is relation between P&Q&R andtheP&Q the node node

offthegrounditisa truism you getsuch that dont minds. Lacking a notion

Classical ofmentation togetitsaccountmental theory requires of processes

went thestore.Given notion logical to a of syntaxthe verynotion the that

edges mental processes arestructure that sensitive, it will then predict that similarly structured representations generally similar will play rolesin thought. theory says thesentence loves girlismade A that that John the

taxonomy arguments theirlogical So,ifyour of by form. theory acknowlalso

tures. thelinguistic constituent In cases, analysis impliestaxonomysena of tences theirsyntactic andin the inferential it implies by form, cases, a

mustperforce acknowledge representations similar identical with or struc-

ingtheargument systematicity, from theargument compositionality, from andtheargument influential from coherenceare really thesame: much If you thekind theory acknowledges representations, hold of that structured it

It is perhaps obvious nowthatallthe arguments wevebeenreviewby that

Connectionism and cognitive architecture

49

out of the same parts as the sentence 'the girl loves John ' , and made by applications of the same rules of composition , will have to go out of its way to explain a linguistic competence which embraces one sentence but not the other . And similarly , if a theory says that the mental representation that corresponds to the thought that P& Q& R has the same (conjunctive ) syntax as the mental representation that corresponds to the thought that P& Q and that mental processes of drawing inferences subsume mental representations in virtue of their syntax , it will have to go out of its way to explain inferential capacities which embrace the one thought but not the other . Such a competence would be, at best, an embarrassment for the theory , and at worst a refutation . By contrast , since the Connectionist architecture recognizes no combinat orial structure in mental representations , gaps in cognitive competence should proliferate arbitrarily . It 's not just that you 'd expect to get them from time to time ; it 's that , on the 'no-structure ' story , gaps are the unmarked case. It 's the systematic competence that the theory is required to treat as an embarrassment . But , as a matter of fact , inferential competences are blatantly systematic . So there must be something deeply wrong with Connectionist architec ture . What 's deeply wrong with Connectionist architecture is this : Because it acknowledges neither syntactic nor semantic structure in mental representa tions , it perforce treats them not as a generated set but as a list . But lists , qua lists , have no structure ; any collection of items is a possible list . And , correspondingly , on Connectionist principles , any collection of (causally connected) representational states is a possible mind . So, as far as Connectionist architecture is concerned , there is nothing to prevent minds that are arbitrar ily unsystematic . But that result is preposterous . Cognitive capacities come in structurally related clusters ; their systematicity is pervasive . All the evidence suggests that punctate minds can't happen. This argument seemed conclusive against the Connectionism of Hebb , Osgood and Hull twenty or thirty years ago. So far as we can tell , nothing of any importance has happened to change the situation in the meantime .29

2lJHistorical footnote: Connectionistsare Associationists but not every Associationist holds that mental , representationsmust be unstructured. Hume didn't , for example. Hume thought that mental representations are rather like pictures, and pictures typically have a compositionalsemantics the parts of a picture of a horse : are generally pictures of horse parts. On the other hand, allowing a compositional semanticsfor mental representationsdoesn do an As't sociationist much good so long as he is true to this spirit of his Associationism The virtue of having mental . representationswith structure is that it allows for structure sensitive operations to be defined over them; specifically, it allows for the sort of operations that eventuatein productivity and systematicity Association . is not, however, such an operation; all it can do is build an internal model of redundanciesin experienceby -

50

l .A . Fodor and Z. W. Pylyshyn

A final comment to round off this part of the discussion It 's possible to . imagine a Connectionist being prepared to admit that while systematicity doesn follow from- and henceis not explainedby- Connectionistarchitec 't ture, it is nonethelesscompatiblewith that architecture It is, after all, per. fectly possible to follow a policy of building networks that have aRb nodes only if they have bRa nodes ... etc. There is therefore nothing to stop a Connectionistfrom stipulating as an independentpostulate of his theory of mind- that all biologically instantiated networks are, de facto, systematic . But this missesa crucial point: It 's not enoughjust to stipulate systematic ity ; one is also required to specify a mechanismthat is able to enforce the stipulation. To put it another way, it 's not enough for a Connectionist to agreethat all minds are systematic he must also explain how naturecontrives ; to produce only systematic minds. Presumablythere would have to be some sort of mechanism over and abovethe onesthat Connectionismper seposits, , the functioning of which insuresthe systematicityof biologically instantiated networks; a mechanismsuch that, in virtue of its operation, every network that has an aRb node also has a bRa node ... and so forth . There are, however, no proposalsfor such a mechanism Or , rather, there is just one: . The only mechanismthat is known to be able to produce pervasivesystem aticity is Classicalarchjtecture. And , as we have seen Classicalarchitecture , is not compatible with Connectionismsince it requires internally structured representations .

4. The lure of Connectionism


The and problems based processing ever altering systematicity of having Hume to be an the the , in ' active jdea probabilities are concerned can but not fact , cheated that of transitions among mental who states . So far as the problems representations of productivity is in the and position , a ( proof notion number of current philosophers which popularity is were theoretic of puzzling largely ) cognitive apparently notion of the in Connectionist view responsible of computation in arguments the of the for approach sorts the and first , of among problems psychologists raised of - style . There a , are encountered above syntax symbol , how ; -

development a Turing place repeatedly

architecture plausible

, an Associationist the opener . : he can allowed produce himself new together , no

acknowledges

structured

not concepts out

just

Association out of idea old of parts

but by

also

" Imagination of of

" , whjch analysis a horn and , for

he

takes .) gave can

' faculty of

a process the jdea

recombi example in

nation Qua Hume be

. ( The

a unicorn Hume had

is pieced , of course

of to

the

a horse faculties to the

and

associationist precisely

right don

actjve : an

mental answer

. But question

allowing how

imagination mental to processes

what . The

modern moral and

Connectionists if you ' ve to got apply

' t have

productive

is that an

structured them

representations is practically

, the irresistible

temptatjon .

postulate

structure

sensitive

operations

executive

Connectionismand cognitive architecture

51

in the literature

, that stress certain limitations

of conventional

computers

as

models of brains . These may be seen as favoring the Connectionist alterna tive . We will sketch a number of these before discussing the general problems which they appear to raise . . Rapidity of cognitive processes in relation to neural speeds the "hundred . step" constraint . It has been observed (e.g., Feldman & Ballard , 1982) that the time required to execute computer instructions is in the order
of nanoseconds , whereas neurons take tens of milliseconds to fire . Con -

sequently , in the time it takes people to carry out many of the tasks at which they are fluent (like recognizing a word or a picture , either of which may require considerably less than a second) a serial neurally -in stantiated program would only be able to carry out about 100 instruc tions . Yet such tasks might typically require many thousands- or even
millions - of instructions in present -day computers (if they can be done

at all ) . Thus , it is argued , the brain must operate quite differently from computers . In fact , the argument goes, the brain must be organized in a highly parallel manner (" massively parallel " is the preferred term of art ) . Difficulty of achieving large-capacity pattern recognition and contentbased retrieval in conventional architectures. Closely related to the issues
about time constraints is the fact that humans can store and make use

of an enormous amount of information - apparently without effort (Fahlman & Hinton , 1987) . One particularly dramatic skill that people exhibit is the ability to recognize patterns from among tens or even hundreds of thousands of alternatives (e.g., word or face recognition ) .
In fact , there is reason to believe that many expert skills may be based

on large , fast recognition memories (see Simon & Chase, 1973) . If one had to search through one's memory serially , the way conventional computers do , the complexity would overwhelm any machine . Thus , the knowledge that people have must be stored and retrieved differently
from the way conventional computers do it .

Conventional computer models are committed to a different etiology for " rule-governed " behavior and "exceptional " behavior . Classical psychological theories , which are based on conventional computer ideas, typically distinguish between mechanisms that cause regular and diver gent behavior by postulating systems of explicit unconscious rules to explain the former , and then attributing departures from these rules to secondary (performance ) factors . Since the divergent behaviors occur very frequently , a better strategy would be to try to account for both
types of behavior in terms of the same mechanism .

l .A . Fodor and Z . W. Pylysh )'n

Lack of progress in dealing with processes that are nonverbal or intuitive . Most of our fluent cognitive skills do not consist in accessing verbal knowledge or carrying out deliberate conscious reasoning (F"ahlman & Hinton , 1987; Smolensky , 1988) . We appear to know many things that we would have great difficulty in describing verbally , including how to ride a bicycle , what our close friends look like , and how to recall the name of the President , etc . Such knowledge , it is argued , must not be stored in linguistic form , but in some other " implicit " form . The fact that conventional computers typically operate in a " linguistic mode " , inasmuch as they process information by operating on syntactically structured expressions, may explain why there has been relatively little success in modeling implicit knowledge . Acute sensitivity of conventional architectures to damage and noise. Un like digital circuits , brain circuits must tolerate noise arising from spontaneous neural activity . Moreover , they must tolerate a moderate degree of damage without failing completely . With a few notable exceptions , if a part of the brain is damaged , the degradation in performance is usually not catastrophic but varies more or less gradually with the extent of the damage. This is especially true of memory . Damage to the temporal cortex (usually thought to house memory traces) does not result in selective loss of particular facts and memories . This and similar facts about brain damaged patients suggests that human memory representations , and perhaps many other cognitive skills as well , are distributed spatially , rather than being neurally localized . This appears to contrast with conventional computers , where hierarchical -style control keeps the crucial decisions highly localized and where memory storage consists of an array of location -addressable registers . Storage in conventional architectures is passive. Conventional computers have a passive memory store which is accessed in what has been called a " fetch and execute cycle" . This appears to be quite unlike human memory . For example , according to Kosslyn and Hatfield ( 1984, pp . 1022, 1029) :
In computers the memory is static : once an entry is put in a given location , it just sits there until it is operated upon by the CPO .... But consider a very simple experiment : Imagine a letter A over and over again ... then switch to the letter B . In a model employing a Von Neumann architecture the 'fatigue ' that inhibited imaging the A would be due to some quirk in the way the CPO executes a given instruction .... Such fatigue should generalize to all objects imaged because the routine responsible for imaging was less effective . But experiments have demonstrated that this is not true : specific objects become more difficult to image , not all objects . This

Connectionism cognitive and architecture 53

finding is more easily explained by an analogy to the way invisible ink fades of its own accord ...: with invisible ink , the representation itself is doing something - there is no separate processor working over it ... .

.
.

Conventional rule -based systems depict cognition as " all -or -none" . But cognitive skills appear to be characterized by various kinds of continuities . For example : Continuous variation in degree of applicability of different principles , or in the degree of relevance of different constraints , " rules" , or proce dures . There are frequent cases (especially in perception and memory retrieval ) , in which it appears that a variety of different constraints are brought to bear on a problem simultaneously and the outcome is a combined effect of all the different factors (see, for example , the infor mal discussion by McClelland , Rumelhart & Hinton , 1986, pp . 3- 9) . That 's why " constraint propagation " techniques are receiving a great deal of attention in artificial intelligence (see Mackworth , 1987) . Non determin ism of human behavior : Cognitive processes are never rigidly determined or precisely replicable . Rather , they appear to have a significant random or stochastic component . Perhaps that 's because there is randomness at a microscopic level , caused by irrelevant biochemical or electrical activity or perhaps even by quantum mechani cal events. To model this activity by rigid deterministic rules can only lead to poor predictions because it ignores the fundamentally stochastic nature of the underlying mechanisms. Moreover , deterministic , all -or none models will be unable to account for the gradual aspect of learning and skill acquisition . Failure to display graceful degradation . When humans are unable to do a task perfectly , they nonetheless do something reasonable . If the particular task does not fit exactly into some known pattern , or if it is only partly understood , a person will not give up or produce nonsensical
behavior . By contrast , if a Classical rule -based computer program fails

to recognize the task , or fails to match a pattern to its stored representations or rules , it usually will be unable to do anything at all . This suggests that in order to display graceful degradation , we must be able to represent prototypes , match patterns , recognize problems , etc ., in various degrees . Conventional models are dictated by current technical features ofcomput ers and take little or no account of the facts of neuroscience. Classical symbol processing systems provide no indication of how the kinds of processes that they postulate could be realized by a brain . The fact that this gap between high -level systems and brain architecture is so large might be an indication that these models are on the wrong track .

54

I .A. Fodor Z.W Pylyshyn and .

Whereas the architecture of the mind has evolved under the pressures of natural selection , some of the Classical assumptions about the mind may derive from features that computers have only because they are explicitly designed for the convenience of programmers . Perhaps this includes even the assumption that the description of mental processes at the cognitive level can be divorced from the description of their physical realization . At a minimum , by building our models to take account of what is known about neural structures we may reduce the risk of of being misled by metaphors based on contemporary computer architec tures .

Replies : Why the usual reasons given for preferring architecture are invalid

a Connectionist

It seems to us that , as arguments against Classical cognitive architecture , all these points suffer from one or other of the following two defects. (1) The objections depend on properties that are not in fact intrinsic to Classical architectures , since there can be perfectly natural Classical models that don 't exhibit the objectionable features . (We believe this to be true , for example , of the arguments that Classical rules are explicit and Classical operations are 'all or none ' .) The objections are true of Classical architectures insofar as they are implemented on current computers , but need not be true of such archi tectures when differently (e.g., neurally ) implemented . They are, in other words , directed at the implementation level rather than the cognitive level , as these were distinguished in our earlier discussion. (We believe that this is true , for example , of the arguments about speed, resistance to damage and noise, and the passivity of memory .)

(2)

In the remainder of this section we will expand on these two points and relate them to some of the arguments presented above . Following this analysis, we will present what we believe may be the most tenable view of Connectionism ; namely that it is a theory of how (Classical) cognitive systems might be implemented , either in real brains or in some 'abstract neurology ' . Parallel computation and the issue of speed Consider the argument that cognitive processes must involve large scale par allel computation . In the form that it takes in typical Connectionist discussions, this issue is irrelevant to the adequacy of Classical cognitive architec -

Connectionismand cognitive architecture

55

ture . The " hundred step constraint " , for example , is clearly directed at the implementation level . All it rules out is the (absurd) hypothesis that cognitive architectures are implemented in the brain in the same way as they are im plemented on electronic computers . If you ever have doubts about whether a proposal pertains to the im plementation level or the symbolic level , a useful heuristic is to ask yourself whether what is being claimed is true of a conventional computer - such as the DEC VAX - at its implementation level . Thus although most algorithms that run on the V AX are serial ,30at the implementation level such computers are 'massively parallel ' ; they quite literally involve simultaneous electrical activity throughout almost the entire device . For example , every memory accesscycle involves pulsing every bit in a significant fraction of the system's memory registers- since memory access is essentially a destructive read and rewrite process, the system clock regularly pulses and activates most of the
central processing unit , and so Oll .

The moral is that the absolute speed of a process is a property par excellence of its implementation . (By contrast , the relative speed with which a system responds to different inputs is often diagnostic of distinct processes; but this has always been a prime empirical basis for deciding among alterna tive algorithms in information processing psychology ) . Thus , the fact that individual neurons require tens of miliseconds to fire can have no bearing on the predicted speed at which an algorithm will run unless there is at least a partial , independently motivated , theory of how the operations of the functional architecture are implemented in neurons. Since, in the case of the brain , it is

not even certain that the firing31of neurons is invariably the relevant implementation property (at least for higher level cognitive processes like learn ing and memory ) the 100 step " constraint " excludes nothing . Finally , absolute constraints on the number of serial steps that a mental process can require , or on the time that can be required to execute them , provide weak arguments against Classical architecture because Classical architecture in no way excludes parallel execution of multiple symbolic processes. Indeed , it seems extremely likely that many Classical symbolic processes

311Evenin the case of a conventional computer , \\'hether it should be viewed as executing a serial or a parallel algorithm depends on what 'virtual machine ' is being considered in the case in question . After all , a V AX can be used to simulate (i .e., to implement ) a virtual machine with a parallel architecture . In that case the relevant algorithm would be a parallel one . 31There are , in fact , a number of different mechanisms of neural interaction (e .g., the " local interactions " described by Rakic , 1975) . Moreover , a large number of chemical processes take place at the dendrites , covering a wide range of time scales, so even if dendritic transmission were the only relevant mechanism , we still wouldn ' t know what time scale to use as our estimate of neural action in general (see, for example , Black , 1986) .

56

I .A . Fodor and Z . W. Pylyshyn

are

going

on

in

parallel

in

cognition

and

that

these

processes

interact

with

one

another

. g

they

may

be

involved

in

some

sort

of

symbolic

constraint

propagation

Operating

on

symbols

can

even

involve

"

massively

parallel

"

organizations

that

might

indeed

imply

new

architectures

but

they

are

all

Classical

in

our

sense

since

they

all

share

the

Classical

conception

of

compu

tation

as

symbol

processing

For

examples

of

serious

and

interesting

propos

als

on

organizing

Classical

processors

into

large

parallel

networks

see

Hewett

' s

1977

Actor

system

Hillis

'

1985

"

Connection

Machine

"

as

well

as

any

of

number

of

recent

commercial

multi

processor

machines

The

point

here

is

that

an

argument

for

network

of

parallel

computers

is

not

in

and

of

itself

either

an

argument

against

Classical

architecture

or

an

argu

ment

for

Connectionist

architecture

Resistance

to

noise

and

physical

damage

and

the

argument

for

distributed

representation

Some

of

the

other

advantages

claimed

for

Connectionist

architectures

over

Classical

ones

are

just

as

clearly

aimed

at

the

implementation

level

For

example

the

"

resistance

to

physical

damage

"

criterion

is

so

obviously

mat

ter level

of

implementation theories .

that

it

should

hardly

arise

in

discussions

of

cognitive

It

is

true

that

certain

kind

of

damage

resistance

appears

to

be

incompat

ible

with

localization

and

it

is

also

true

that

representations

in

PD

' s

are

distributed

over

groups

of

units

at

least

when

"

coarse

coding

"

is

used

But

distribution

over

units

achieves

damage

resistance

only

if

it

entails

that

repre

sentations

are

also

neurally

distributed

. 32

However

neural

distribution

of

representations

is

just

as

compatible

with

Classical

architectures

as

it

is

with

Connectionist

networks

In

the

Classical

case

all

you

need

are

memory

regis

ters

that

distribute

their

contents

over

physical

space

You

can

get

that

with

fancy

storage

systems

like

optical

ones

or

chemical

ones

or

even

with

regis

ters

made

of

Connectionist

nets

Come

to

think

of

it

we

already

had

it

in

the

old

style

"

ferrite

core

"

memories

32Unless

the

' units

'

in

Connectionist

network

really

are

assumed

to

have

different

spatiaIJy

- focused

loci

in

the

brain

talk

about

distributed

representation

is

likely

to

be

extremely

misleading

In

particular

if

units

are

merely

functionally

individuated

any

amount

of

distribution

or

functional

entities

is

compatible

with

any

amount

of

spatial

compactness

of

their

neural

representations

But

it

is

not

clear

that

units

do

in

fact

correspond

to

any

anatomically

identifiable

locations

in

the

brain

In

the

light

of

the

way

Connectionist

mechanisms

are

designed

it

may

be

appropriate

to

view

units

and

links

as

functionaUmathematical

entities

( what

psychologists

would

call

"

hypothetical

constructs

"

whose

neurological

interpretation

remains

entirely

open

( This

is

in

fact

the

view

that

some

Connectionists

take

see

Smolensky

1988

. )

The

point

is

that

distribution

over

mathematical

constructs

does

not

buy

you

damage

resistance

only

neural

distribution

does

Connectionism and cognitive architecture

57

ilymisunderstood. (Confounding of physical functional and properties is widespread psychological in theorizing general; a discussion this in for of
confusionin relation to metrical properties in models of mental imagery,see

The physicalrequirementsof a Classical symbol-processing systemare eas-

Pylyshyn Forexample, 1981.) conventional architecture requires there that

be distinctsymbolic expressions eachstateof affairsthat it can represent. for Since such expressions often have a structureconsistingof concatenated when the architectureis implemented (see the discussion footnote9). in However, sincethe relation be physically to realized functional is adjacency, thereis no necessity physical that instantiations adjacent of symbols spabe

parts,the adjacency relation be instantiated some must by physical relation

tially adjacent. Similarly, although complex expressions made ofatomare out

ic elements,and the distinction betweenatomicand complex symbols must somehow physically be instantiated, thereis no necessity a tokenof an that atomic symbol assigned smaller be a region space in thana tokenof a complex

symbol; a tokenof a complex even symbol which is a constituent. of it In


Classicalarchitectures,as in Connectionist networks,functionalelementscan

be physically distributed localized anyextent or to whatever. a VAX In (to


locations in physical memory.

use our heuristic again)pairsof symbols certainly functionally may be adjacent,but the symbol tokensare nonetheless spatially spreadthrough many

expression) isfunctionally hasnoimplications way theother local one or for

In short,the factthat a property(likethe position a symbol of withinan

damage-resistanceor noise tolerance unless the functional-neighborhood

metric corresponds some to appropriate physical dimension. thatisthe When case,wemaybe ableto predict adverse consequences varying physthat the icalproperty on objects has localized functional (e.g.,varying in space the

voltage linefrequency damage leftpartofanexpression). or might the But,


of course, the situationis exactlythe same for Connectionist systems:even

when areresistant spatially-local they to damage, maynotbe resistant they to damage islocal that along other some physical dimensions. spatiallySince localdamage particularly is frequent real, in world traumas, mayhave this important practical consequences. solong ourknowledgehow But as of cognitiveprocesses mightbe mapped braintissueremains nearly onto very
nonexistent,its messagefor cognitivescienceremainsmoot.

Soft constraints, continuous magnitudes, stochastic mechanisms, active and


symbols

Thenotionthat soft constraints which varycontinuously degree can (as of activation does),are incompatible Classical with rule-based symbolic systems

58

LA. Fodorand Z.W. Pylyshyn

functional architecturedepends continuously magnitudes. and on varying Indeed, istypically it isdone practical this how in expertsystems which, for example, aBayesian use mechanism production-system intheir rule-interpreter. Thesoftor stochastic nature rule-based or processes from interarises the action deterministic with of rules real-valued properties implementaofthe tion, or withnoisyinputsor noisyinformation transmission.
rules generate can independent effects, might sorted parallel which get out laterdepending onwhich theparallel say, of streams reachesgoal a first. Animportant, sometimes though neglected about aggregate point such properties overt of behavior continuity, as fuzziness, randomness, is that etc., they notarise underlying need from mechanismsarethemselves that fuzzy, continuous orrandom.isnotonly It possibleprinciple, often in but quite
It shouldalsobe notedthat rule applications not issuein all or need none behaviors several since rulesmaybe activated onceandcanhave at interactive effects the outcome. alternatively, of the activated on Or, each

system whichthe decision in concerning whichrule willfire residesin the

sing) the implementation separate. canhavea Classical and levels One rule

isanother example failurekeep psychological ofthe to the (or symbol-proces-

reasonable practice, assume apparently in to that variable nondeterministic or behavior arises fromthe interaction multiple of deterministic sources.

theavailable arentprecisely theprocess rules met, should simply to do fail

A similar canbemade point about issue graceful the of degradation. Classical architecture notrequire when conditionsapplying does that the for

(Nor,to our knowledge, anyargument beenoffered Connechas yet that tionist architectures inprinciple are capable dealing it. Infactcurrent of with Connectionist models everybit as graceless theirmodes failure are in of as ones based Classical on architectures. example, For contrary some to claims,

principled limitation ofClassical architectures: is,toourknowledge, There noreason believe something Newells to that like (1969) hierarchy weak of methods Laird, or Rosenberg Newells universal and (1986) subgoaling, isinprinciple incapable ofdealing theproblemgraceful with of degradation.

to do whena limited stockof methods to apply. thisneedntbe a fails But

ical models available inadequate a broad now are over spectrummeasures, of so theirproblems graceful with degradation be a special of their may case general unintelligence: may They simply besmart not enough know to what

depending how their upon close conditionstoholding. what are Exactly happens these may in cases depend how rule-system on the isimplemented. On theother it could thatthefailure display hand, be to graceful degradation reallyis an intrinsic of the current limit classof models evenof current or approaches todesigning intelligent Itseems that psychologsystems. clear the

anything all. As notedabove,rulescouldbe activated somemeasure at in

Connectionismand cognitive architecture

59

models such as that of McClelland and Kawamoto, 1986 fail quite unnatu, rally when given incomplete information.) In short, the Classicaltheorist can view stochasticproperties of behavior as emergingfrom interactionsbetweenthe model and the intrinsic properties of the physical medium in which it is realized. It is essentialto remember that, from the Classicalpoint of view, overt behavior is par excellencean interaction effect, and symbol manipulations are supposedto be only one of the interacting causes . These same considerations apply to Kosslyn and Hatfield's remarks (quoted earlier) about the commitment of Classicalmodelsto 'passive versus ' 'active' representations It is true, as Kosslyn and Hatfield say, that the rep. resentations that Van Neumann machines manipulate 'don't do anything' until a CPU operatesupon them (they don't decay for example But , even , ). on the absurd assumptionthat the mind has exactlythe architecture of some contemporary (V on Neumann computer, it is obvious that its behavior, and ) hence the behavior of an organism is determined not just by the logical , machine that the mind instantiates but also by the protoplasmic machinein , which the logic is realized. Instantiated representationsare therefore bound to be active, even accordingto Classicalmodels; the question is whether the kind of activity they exhibit should be accountedfor by the cognitive model or by the theory of its implementation. This questionis empirical and must not be beggedon behalf of the Connectionistview. (As it is, for example in such , passages "The brain itself does not manipulate symbols the brain is the as ; medium in which the symbols are floating and in which they trigger each other. There is no central manipulator, no central program. There is simply a vast collection of 'teams patterns of neural firings that, like teamsof ants, 'trigger other patterns of neural firings ... . We feel those symbolschurning within ourselvesin somewhatthe sameway we feel our stomachchurning." (Hofstadter, 1983 p. 279 This appearsto be a serious caseof Formicidae , ). in machina ants in the stomachof the ghost in the machine.) : Explicitnessof rules According to McClelland, Feldman, Adelson, Bower, and McDermott (1986 , p. 6) , " ... Connectionist models are leading to a reconceptualizationof key psychologicalissues such as the nature of the representation of knowledge , . .. . One traditional approach to such issuestreats knowledge as a body of rules that are consultedby processing mechanisms the courseof processing in ; in Connectionistmodels, suchknowledgeis represented often in widely dis, tributed form, in the connectionsamong the processingunits." As we remarked in the Introduction , we think that the claim that most

60

I .A . Fodor and Z . W. Pylyshyn

Connectionism and cognitive architecture

61

What doesneed to be explicit in a Classicalmachineis not its program but the symbols that it writes on its tapes (or stores in its registers These, ). however, correspondnot to the machine rules of state transition but to its 's data structures Data structures are the objectsthat the machinetransforms . , not the rules of transformation In the case of programs that parse natural . language for example, Classicalarchitecturerequires the explicit representa , tion of the structural descriptionsof sentencesbut is entirely neutral on the , explicitnessof grammars contrary to what many Connectionistsbelieve. , One of the important inventions in the history of computers the storedprogram computer makes it possible for programs to take on the role of data structures But nothing in the architecture requiresthat they alwaysdo . so. Similarly, Turing demonstratedthat there existsan abstractmachine(the so-called Universal Turing Machine) which can simulate the behavior of any target (Turing) machine. A Universal machine is " rule-explicit" about the machine it is simulating (in the sensethat it has an explicit representationof that machine which is sufficient to specify its behavior uniquely). Yet the target machinecan perfectly well be "rule-implicit " with respectto the rules that govern its behavior. So, then, you can't attack Classicaltheories of cognitive architecture by showingthat a cognitive processis rule-implicit ; Classicalarchitecturepermits rule-explicit processes does not require them. However, you can attack but Connectionistarchitecturesby showingthat a cognitive processis rule explicit since, by definition , Connectionist architecture precludesthe sorts of logicosyntactic capacitiesthat are required to encode rules and the sorts of execu tive mechanisms that are required to apply them.34 If , therefore, there should prove to be persuasive arguments for rule explicit cognitive processes that would be very embarrassingfor Connec , tionists. A natural place to look for such argumentswould be in the theory of the acquisition of cognitive competences For example, much traditional . work in linguistics (see Prince & Pinker, 1988 and all recent work in ) mathematical learning theory (see Osherson Stav, & Weinstein, 1984 as , ), sumes that the characteristic output of a cognitive acquisition device is a recursive rule system (a grammar, in the linguistic case . Suppose such ) theories prove to be well-founded; then that would be incompatible with the assumptionthat the cognitive architecture of the capacitiesacquired is Connectionist.

340f course ispossiblesimulate , it to a"rule explicit process aConnectionist byfirstimplement " in network ~ inga Classical architecture network slippage in the . The between networks architectures asimplemen as and tations ubiquitousConnectionist , as\\;'e remarked . is in writings above

62

I .A . Fodor and Z. W. Pylyshyn

On " Brain style" modeling


The relation of Connectionist models to neuroscience is open to many in -

terpretations . On the one hand , people like Ballard ( 1986) , and Sejnowski (1981) , are explicitly attempting to build models based on properties of neurons and neural organizations , even though the neuronal units in question are idealized (some would say more than a little idealized : see, for example the commentaries following the Ballard , 1986, paper ) . On the other hand , Smolensky (1988) views Connectionist units as mathematical objects which can be given an interpretation in either neural or psychological terms . Most Connectionists find themselves somewhere in between , frequently referring to their approach as " brain -style " theorizing .35 Understanding both psychological principles and the way that they are neurophysiologically implemented is much better (and , indeed , more empir ically secure) than only understanding one or the other . That is not at issue. The question is whether there is anything to be gained by designing " brain
style " models that are uncommitted about how the models map onto brains .

Presumably the point of " brain style " modeling is that theories of cognitive processing should be influenced by the facts of biology (especially neurosci ence) . The biological facts that influence Connectionist models appear to include the following : neuronal connections are important to the patterns of brain activity ; the memory " engram " does not appear to be spatially local ; to a first approximation , neurons appear to be threshold elements which sum the activity arriving at their dendrites ; many of the neurons in the cortex have multi dimension " receptive fields " that are sensitive to a narrow range of values of a number of parameters ; the tendency for activity at a synapse to cause a neuron to " fire " is modulated by the frequency and recency of past firings .
Let us suppose that these and similar claims are both true and relevant to

the way the brain functions - an assumption that is by no means unproblem atic . The question we might then ask is: What follows from such facts that is relevant to inferring the nature of the cognitive architecture ? The unavoid able answer appears to be, very little . That 's not an a priori claim . The degree of relationship between facts at different levels of organization of a system is an empirical matter . However , there is reason to be skeptical about whether the sorts of properties listed above are reflected in any more -or -less direct

Connectionism cognitive and architecture 63

way in the structure of the system that carries out reasoning . Consider , for example , one of the most salient properties of neural systems : they are networks which transmit activation culminating in state changes of some quasi-threshold elements . Surely it is not warranted to conclude that reasoning consists of the spread of excitation among representa tions , or even among semantic components of representations . After all , a V AX is also correctly characterized as consisting of a network over which excitation is transmitted culminating in state changes of quasi-threshold elements . Yet at the level at which it processes representations , a V AX is literally organized as a Von Neumann architecture . The point is that the structure of " higher levels" of a system are rarely isomorphic , or even similar , to the structure of " lower levels" of a system.
No one expects the theory of protons to look very much like the theory of

rocks and rivers , even though , to be sure, it is protons and the like that rocks and rivers are 'implemented in ' . Lucretius got into trouble precisely by assuming that there must be a simple correspondence between the structure of macrolevel and microlevel theories . He thought , for example , that hooks and eyes hold the atoms together . He was wrong , as it turns out . There are, no doubt , cases where special empirical considerations suggest detailed structure /function correspondences or other analogies between dif ferent levels of a system's organization . For example , the input to the most peripheral stages of vision and motor control must be specified in terms of anatomically projected patterns (of light , in one case, and of muscular activity in the other ) ; and independence of structure and function is perhaps less likely in a system whose input or output must be specified somatotopically . Thus , at these stages it is reasonable to expect an anatomically distributed structure to be reflected by a distributed functional architecture . When , how ever , the cognitive process under investigation is as abstract as reasoning , there is simply no reason to expect isomorphisms between structure and
function ; as, indeed , the computer case proves .

Perhaps this is all too obvious to be worth saying. Yet it seems that the commitment to " brain style" modeling leads to many of the characteristic Connectionist claims about psychology , and that it does so via the implicit and unwarranted - assumption that there ought to be similarity 01 structure among the different levels of organization of a computational system. This is distressing since much of the psychology that this search for structural analogies has produced is strikingly recidivist . Thus the idea that the brain is a neural network motivates the revival of a largely discredited Associationist psychology . Similarly , the idea that brain activity is anatomically distributed leads to functionally distributed representations for concepts which in turn leads to the postulation of microfeatures ; yet the inadequacies of feature -

64

J.A. Fodorand Z.W.Pylyshyn

the cognitive Theconsequencea resurgence statistical level. is of models of

neuronsaffected thefrequency co-activationprojected is by oftheir gets onto

Fodor,1977). again, ideathatthe strength a connection Or the of between

turetheory donenothing address (seeBolinger, J.D. has to them 1965;

based theoriesconcepts well-known toourknowledge, of are and, micro-fea-

to beextremely intheir limited applicability Minsky Papert, (e.g., & 1972, Chomsky, 1957).

learning had widely that been acknowledgedinPsychology AT) (both and in


Soalthough, principle, in knowledge howthe brainworks of could direct

importantthan respectability.

that manypsychologists to feel.But, givena choice, seem truthis more

tion.Wesympathize thecraving biologically with for respectable theories

Berkeley. moral tobethat should deeply The seems one be suspicious ofthe heroic ofbrain sort modeling purportsaddress problems that to the ofcogni-

current attempt bethoroughly to modern takethebrain and seriously should leadto a psychology readily not distinguishable theworst Hume from of and

widely appreciated. done largely It has so because assumptions the about structureof the brain have been adoptedin an all-too-direct manneras hypotheses cognitive about architecture;aninstructive its paradox the that

revive psychological whose theories limitationspreviously pretty had been

cognitive modelinga beneficial in manner, facta research in strategy to has bejudged itsfruits. main ofbrainstyle by The fruit modeling been has to

Concluding comments: Connectionism asa theory implementation of A recurring intheprevious theme discussion many thearguments isthat of forConnectionism construedclaiming cognitive arebest as that architecture isimplementedcertain ofnetwork abstract ina kind (of units). Understood this these way, arguments neutral thequestionwhat cognitive are on of the architecture In these is. 36 concluding remarks briefly well consider Connectionism from this point of view. Almost student enters course computational every who a on or information-processing ofcognition bedisabuseda very models must of general misbecause they to ourunderstanding problem 116), studying canlead the (t) add ofthe (p. (2) PDPs to postulation ofdifferent macrolevel (p.126). these deal theheuristic ofbrain processes Both points with value

Rumelhart 36 and McClelland that models than theories maintain PDP are more just ofimplementation

style theorizing. though inprinciple, arcirrelevant crucial whether Hence, correct they tothe question Conncctionism isbest understoodattempt neural asan tomodel implementation, itreally orwhether does promise theorythe incompatible a new of mind with classical information-processing isan approaches. It empirical whetherheuristic ofthis question the value approachturn tobepositive will out ornegative, we
have already commented view therecent onour of history this of attempt.

Connectionism and cognitive architecture

65

understanding concerning the role of the physical computer in such models . Students are almost always skeptical about " the computer as a model of cognition " on such grounds as that " computers don 't forget or make mistakes" , " computers function by exhaustive search," " computers are too logical and unmotivated ," " computers can't learn by themselves; they can only do what they 're told ," or " computers are too fast (or too slow) ," or " computers never get tired or bored ," and so on . If we add to this list such relatively more sophisticated complaints as that " computers don 't exhibit graceful degradation " or " computers are too sensitive to physical damage" this list will begin to look much like the arguments put forward by Connectionists . The answer to all these complaints has always been that the implementa tion , and all properties associated with the particular realization of the algorithm that the theorist happens to use in a particular case, is irrelevant to the psychological theory ; only the algorithm and the representations on which it operates are intended as a psychological hypothesis . Students are taught the notion of a " virtual machine " and shown that some virtual machines can learn , forget , get bored , make mistakes and whatever else one likes , provid ing one has a theory of the origins of each of the empirical phenomena in question . Given this principled distinction between a model and its implementation , a theorist who is impressed by the virtues of Connectionism has the option of proposing PDP 's as theories of implementation . But then , far from provid ing a revolutionary new basis for cognitive science, these models are in principle neutral about the nature of cognitive processes. In fact , they might be viewed as advancing the goals of Classical information processing psychol ogy by attempting to explain how the brain (or perhaps some idealized brain like network ) might realize the types of processes that conventional cognitive science has hypothesized . Connectionists do sometimes explicitly take their models to be theories of implementation . Ballard ( 1986) even refers to Connectionism as " the imple mentational approach " . Touretzky ( 1986) clearly views his BoltzCONS model this way ; he uses Connectionist techniques to implement conventional symbol processing mechanisms such as pushdown stacks and other LISP facilities .37
37Even thiscasewherethe modelis specifically in , designed implement -like featuressome the to Lisp , of rhetoric failsto keeptheimplementation -algorithm levels distinct Thisleads talk about"emergent . to proper ties and to the claimthat evenwhenthey implement -like mechanisms " Lisp , Connectionist systems "can compute thingsin waysin whichTuringmachines von Neumann and computers 't." (Touretzky 1986 can , ). Sucha claimsuggests Touretzky that distinguishes different"waysof computing not in termsof different " algorithmsbut in terms different , of ways implementing same of the algorithmWhilenobody proprietary . has rightsto termslike "ways computing this is a misleading of puttingit; it means a DEC machine of ", way that hasa "different wayof computingfrom an IBM machine " evenwhenexecuting identical the program .

66

l .A . Fodor and Z. W. Pylyshyn

Rumelhart and McClelland ( 1986a, p . 117) , who are convinced that Connectionism signals a radical departure from the conventional symbol processing approach , nonetheless refer to " PDP implementations " of various mechanisms such as attention . Later in the same essay, they make their position explicit : Unlike " reductionists ," they believe " . .. that new and useful concepts emerge at different levels of organization " . Although they then defend the Glaim that one should understand the higher levels " ... through the study of
the interactions among lower level units " , the basic idea that there are auton -

omous levels seems implicit everywhere in the essay. But once one admits that there really are cognitive -level principles distinct from the (putative ) architectural principles that Connectionism articulates , there seems to be little left to argue about . Clearly it is pointless to ask whether one should or shouldn 't do cognitive science by studying " the interac tion of lower levels" as opposed to studying processes at the cognitive level since we surely have to do both . Some scientists study geological principles , others study " the interaction of lower level units " like molecules . But since the fact that there are genuine , autonomously -stateable principles of geology is never in dispute , people who build molecular level models do not claim to have invented a " new theory of geology " that will dispense with all that old fashioned " folk geological " talk about rocks , rivers and mountains ! We have , in short , no objection at all to networks as potential implemen tation models , nor do we suppose that any of the arguments we've given are incompatible with this proposal . The trouble is, however , that if Connectionists do want their models to be construed this way , then they will have to radically alter their practice . For , it seems utterly clear that most of the Connectionist models that have actually been proposed must be construed as theories of cognition , not as theories of implementation . This follows from
the fact that it is intrinsic to these theories to ascribe representational content

to the units (and/or aggregates) that they postulate . And , as we remarked at the beginning , a theory of the relations among representational states is ipso facto a theory at the level of cognition , not at the level of implementation . It has been the burden of our argument that when construed as a cognitive theory , rather than as an implementation theory , Connectionism appears to have fatal limitations . The problem with Connectionist models is that all the reasons for thinking that they might be true are reasons for thinking that they couldn 't be psychology .

Connectionism cognitive and architecture 67

Conclusion What , in light of all of this , are the options for the further development of Connectionist theories ? As far as we can see, there are four routes that they could follow : (1) Hold out for unstructured mental representations as against the Classical view that mental representations have a combinatorial syntax and semantics . Productivity and systematicity arguments make this option appear not attractive . (2) Abandon network architecture to the extent of opting for structured mental representations but continue to insist upon an Associationistic account of the nature of mental processes. This is, in effect , a retreat to Hume 's picture of the mind (see footnote 29) , and it has a problem that we don 't believe can be solved : Although mental representations are, on the present assumption , structured objects , association is not a structure sensitive relation . The problem is thus how to reconstruct the semantical coherence of thought without postulating psychological processes that are sensitive to the structure of mental representations . (Equivalently , in more modern terms , it 's how to get the causal relations among mental representations to mirror their semantical relations without assuming a proof -theoretic treatment of inference and- more generally - a treatment of semantic coherence that is syntactically expressed, in the spirit of proof -theory .) This is the problem on which . tradi tional Associationism foundered , and the prospects for solving it now strike us as not appreciably better than they were a couple of hundred years ago. To put it a little differently : if you need structure in mental representations anyway to account for the productivity and systematicity of minds , why not postulate mental processes that are structure sensitive to account for the coherence of mental processes? Why not be a Classicist, in short . In any event , notice that the present option gives the Classical picture a lot of what it wants : viz ., the identification of semantic states with relations to structured arrays of symbols and the identification of mental processes with transformations of such arrays . Notice too that , as things now stand , this proposal is Utopian since there are no serious proposals for incorporating syntactic structure in Connectionist architectures . (3) Treat Connectionism as an implementation theory . We have no princi pled objection to this view (though there are, as Connectionists are discover ing , technical reasons why networks are often an awkward way to implement Classical machines) . This option would entail rewriting quite a lot of the pole, ical material in the Connectionist literature , as well as redescribing m

68

I .A . Fodor and Z. W. Pylyshyn

what spreading

the

networks activation

are

doing among

as

operating

on interpreted

symbol

structures nodes .

rather

than

semantically

Moreover As tionist of that simply cognitive that with tulate is that matter processes structures ' implements analysis bears as a we have approach

, this pointed

revision out because ( b ) elevate on of no how more issues , of

of many

policy

is people

sure have to to

to

lose been

the

movement attracted to the providing is , it , biochemistry are them architecture are also

a the

lot

of Connec

fans

. -

its

promise

( a ) the . If

do

away

with of

symbol evidence considered constrain , or

level

, and directly theory

neuroscience of cognition is theories do cognition different , and it . All

position Connectionism implemented

cognition than

neurally in of , from goes

may

models , quantum that

biophysics these and all cognitive all the theories of

for

mechanics implement are ' is quite transitive

concerned to . The pos point -

likely

that

way

down

( 4 ) land in ses as , general . the A

Give 1986a

up , " . It

on p . 110

the )

idea " . . . still be a be

that

networks

offer basis for

( to

quote modeling

Rumelhart cognitive some as can tell cognitive can , such consist of of the what be

&

McClel processes proces analyzed network

reasonable held they inferences analog of cognitive be to quite what machines that

could might of are that ,

networks sustain ; as such far for processing a the modest

sustain processes as we computing does

good drawing really we doubt

bet

that

statistical is just much this would

models Since statistical for been network

inferences analyzing prospects have

relations theory .

estimate

compared

Connectionists

themselves

offering

This argument ( 1988 and a

is

for between

example Rumelhart neither postulate

one

way and puts it

of

understanding ( 1986b these , given statistical and the that a the terms

what ) and . In a corpus

's

going

on and

in Pinker

the

McClelland in quite which the of a verb of weights of , the inflecting machine with that effect merely a competence that Pinker close

Prince effect of

) , though McClelland

paper a as of . data the ( The by the problem sequence ,

, Rumelhart pairings between form .of that the its is at in the past set going since data ontogenetic , by quite a on . a

mechanism computes

' teacher

'

provides form inflection represented .) Given

correlation phonological so network verb the in more stem form the must correlations to the which Prince

phonological past analogically asymptote specified tense By in the nor processes lot , the best that contrast learning statistical a plausible tense

ending magnitude the

correlations the new chooses sequence ) that estimating fit on and

computed exhibits ending of training be

phonological was , most Prince tense hypothesis account converge of this . It highly and

correlated Pinker argue than neither adult to . us

( in

past

morphology provides of seems the

ontogenetic the have

argument

Connectionism and cognitive architecture

69

There is an alternative to the Empiricist idea that all learning consistsof a kind of statistical inference, realized by adjusting parameters it 's the ; Rationalist idea that somelearning is a kind of theory construction, effected by framing hypothesesand evaluating them against evidence We seem to . rememberhaving been through this argumentbefore. We find ourselveswith a gnawing senseof deja vu.

References
Arbib, M. (1975 Artificial intelligence braintheory Unitiesanddiversities ). and : . Biomedical Engineering , 3, 238274 - . Ballard D.H. (1986 Corticalconnections parallelprocessing , ). and : Structure function TheBehavioral and . andBrainSciences 67 120 , 9, - . Ballard D.H. (1987 Parallel , ). Logical Inference Energy and MinimizationReport . TR142Computer , Science DepartmentUniversity Rochester , of . Black I.B. (1986 Molecular , ). memory mechanisms Lynch G. (Ed.), Synapsecircuitsandthebeginnings . In , ., s , of memoryCambridge . , MA: M.I .T, PressA Bradford , Book . Bolinger D, (1965 Theatomization meaningLanguage , 555573 , ). of , , 41 - , BroadbentD. (1985 A question levelsComments McClelland Rumelhart , ). of : on and . Journal Experimen of tal Psychology : General114 189192 , , - . Carroll L. (1956 Whatthe tortoisesaidto Achillesandotherriddles In Newmanl .R. (Ed.), Theworld , ). . , of mathematicVolume . NewYork: Simon Schuster 's ': Four and . ChomskyN. (1957 Syntactic , ). structures HagueMouton . The : . ChomskyN. (1965 Aspects thetheory syntaxCambridge , ). of of . : MA: M.I.T. Press . ChomskyN. (1968 Language mind NewYork: Harcourt BraceandWorld , ). and . , . Churchland , P.M. (1981 Eliminative ). materialism the propositional and attitudesJournal Philosophy , . of , 78 67 90 - . Churchland , P.S. (1986 Neurophilosophy ). . Cambridge , MA: M.I.T. Press . CumminsR. (1983. Thenature psychological , ) o/ explanation . Cambridge , MA: M.I.T. Press . Dennett D. (1986 Thelogical , ). geography computational of approaches viewfromtheeast . In Brand :A pole , M. & Harnish M. (Eds Therepresentationknowledge , .), of . TusconAZ: The University Arizona , of Press . Dreyfus H., & Dreyfus S. (in press. Makinga mindvs modelling brain A.I. backat a branch , , ) the : point . Daedalus . FahlmanS.E., & Hinton G.E. (1987 Connectionist , , ). architectures artificialintelligence for . Computer , , 20 100109 - . Feldmanl .A. (1986 Neuralrepresentation conceptual , ). of knowledge . ReportTR189Department Com . of puterScience , University Rochester of . Feldmanl .A., & Ballard D.H. (1982 Connectionist , , ). models their propertiesCognitive and . Science , 6, 205254 - . Fodor J. (1976 Thelanguage thoughtHarvester , ). of , PressSussex , . (Harvard University Press paperback ). Fodor J.D. (1977 Semantics , ). : Theories meaning generative of in grammarNewYork: Thomas Crowell . Y. . Fodor J. (1987 Psychosemantics , ). . Cambridge , MA: M.I.T. Press . Frohn H., Geiger H., & SingerW. (1987 A selforganizing , , , ). neuralnetwork sharing features the mam of malian visualsystemBiological . Cybernetics 333343 , 55, - .

70

I .A . Fodor and Z . W. Pylyshyn

Geach , P . ( 1957 ) . Mental Hewett Hillis Hinton Hinton

acts . London control

: Routledge structures

and Kegan as patterns

Paul . of passing messages . The Artificial Intelligence

, C . ( 1977 ) . Viewing Journal , 8 , 232 - 364 .

, D . ( 1985 ) . The connection , G . ( 1987 ) . Representing , G .E . , McClelland D .E . , McClelland the microstructure Books .

machine part -whole

. Cambridge hierarchies

, MA : M .I . T . Press . in connectionist networks . Unpublished manuscript . In Rumelhart in . ,

, J .L . , &

Rumelhart

, D .E . ( 1986 ) . Distributed Group , Parallel

representations processing

, J .L . and the PDP of cognition

Research

distributed

: Explorations

. Volume

1 : Foundations

. Cambridge

, MA : M .l . T . Press / Bradford

Hofstadter

, D . R . ( 1983 ) . Artificial

intelligence

: Sub - cognition

as computation

. In F . Machlup : John

& U . Mansfield

( Eds .) , The study of information : Interdisciplinary Kant , I . ( 1929 ) . The critique of pure reason . New York Katz , J .J . ( 1972 ) . Semantic Katz , J .J . , & Fodor theory . New York structure : Harper

messages . New York : St . Martins Press . & Row . theory , Language

\ \ 7iley & Sons .

, J .A . ( 1963 ) . The

of a semantic of linguistic without

, 39 , 170 - 210 . , MA : M .I . T . Press .

Katz , J . , & Postal , P . ( 1964 ) . An integrated Kosslyn Laird , S . M . , & Hatfield

theory

descriptions symbol

. Cambridge

, G . ( 1984 ) . Representation

systems . Social

Research , 51 , 1019 - 1054 . : The automatic . generation

, J . , Rosenbloom , P . , & Newell , A . ( 1986 ) . Universal subgoaling and chunking and learning of goal hierarchies . Boston , MA : Kluwer Academic Publishers , G . ( 1986 ) . Connectionism cember 8 , 1986 . and cognitive linguistics . Seminar delivered

Lakoff

at Princeton

University

, De -

Mackworth , A . ( 1987 ) . Constraint propagation . Iq Shapiro gence , Volume 1 . New York : John Wiley & Sons . McClelland , J .L . , Feldman , J . , Adelson , B . , Bower

, S .C . ( Ed . ) . The encyclopedia

of artificial

intelli and

, G . , & McDermott . Report

, D . ( 1986 ) . Connectionist Science

models

cognitive 1986 . McClelland

science : Goal .\', directions

and implications

to the National

Foundation

, June ,

, J .L . , & Kawamoto

, A . H . ( 1986 ) . Mechanisms

of sentence

processing

: Assigning

roles distributed

to con pro -

stituents . In McClelland , Rumelhart and the PDP Research Group cessing : volume 2 . Cambridge , MA : M . I . T . Press , Bradford Books . McClelland , J .L . , Rumelhart , D . E . , & Hinton , G .E . ( 1986 ) . The appeal

( Eds .) , Parallel

of parallel

distributed

processing

. In

Rumelhart , McClelland and the PDP Research Group 1 . Cambridge , MA : M . I . T . Press / Bradford Books . Minsky , M . , & Papert , F . ( 1972 ) . Artificial of Technology . Intelligence Progress

, ( Eds . ) , Parallel Report , AI

distributed

processing

: volume Institute

Memo

252 , Massachusetts

Newell Newell Newell

, A . ( 1969 ) . Heuristic programming : Ill - structured problems ations research , Ill . New York : John Wiley & Sons . , A . ( 1980 ) . Physical , A . ( 1982 ) . The symbol systems . Cognitive level . Artificial knowledge Intelligence theory

. In Aronofsky

, J . ( Ed .) , Progress

in oper -

Science , 4 , 135 - 183 . . 18 , 87 - 127 . and natural language , Cognition , 17 , 1- 28 . Press . processing

Osherson Pinker Prince

, D . , Stov , M . , & Weinstein

, S . ( 1984 ) . Learning and language

, S . ( 1984 ) . Language

, learnability

development

. Cambridge

: Harvard

University

, A . , & Pinker , S . ( 1988 ) . On language and connectionism : Analysis model of language acquisition . Cognition , 28 , this issue . , Z . W . ( 1980 ) . Cognition and computation and Brain Sciences , 3 : 1 , 154 - 169 . , Z .W . ( 1981 ) . The 88 , 16 - 45 . imagery : Issues in the foundations

of a parallel

distributed

Pylyshyn

of cognitive

science . Behavioral

Pylyshyn

debate : Analogue

media

versus

tacit

knowledge

. Psychological

Review ,

Pylyshyn

, Z . W . ( 1984a ) . Computation MA : M . I .T . Press , A Bradford

and cognition Book .

: Toward

a foundation

for

cognitive

science . Cambridge

Pylyshyn

, Z . W . ( 1984b ) . Why computation the Cognitive Science Society , Bolder

requires symbols . Proceedings of the Sixth Annual , Colorado , August , 1984 . Hillsdale , NJ : Erlbaum

Conference .

of

Connectionism and cognitive architecture

71

Rakic P. (1975 LocalcircuitneuronsNeurosciences , ). . Research Program Bulletin 13 299313 , , - . RumelhartD.E. (1984 Theemergence cognitive , ). of phenomena sub from -symbolic processes Proceedings . In of the SixthAnnual Conference the Cognitive of Science SocietyBolder ColoradoAugust 1984 , , , , . HillsdaleNJ: Erlbaum , . RumelhartD.E., & McClelland (1985 LevelsindeedA response Broadbent , , J.L. ). ' ! to . Journal ofExperimen tal Ps chologyGeneral114 193 197 _ v : , , - . Rumelhart D.E., & McClellandJ.L. (1986a PDP Modelsand general , , ). issues cognitive in scienceIn : RumelhartMcClelland the POPResearch , and Group(Eds Parallel .), disttibuted processing , volume 1. Cambridge , MA: M.I.T. PressA Bradford , Book . RumelhartD.E., & McClellandJ.L. (1986b On learning pasttenses English , , ). the of verbs In Rumelhart . , McClelland thePOPResearch and Group(Eds Parallel .), distributed processing , volume Cambridge 1. , MA: M.I.T. PressA BradfordBook , . Schneider (1987 Connectionism it a paradigm , W. ). : Is shift for psychologyBehavior ? Research Methods , Instnlments Computers , 73 83 ,& , 19 - . SejnowskiT.J. (1981 Skeleton , ). filters in the brain In Hinton G.E., & AndersonA .J. (Eds Parallel . , , .), models as'ociative of .\ memoryHillsdaleNJ: Erlbaum . , . Simon H.A., & ChaseW.G. (1973 Skill in chessAmerican , , ). . Scientist , 394 . . 621 -403 Smolensky (1988 On the propertreatment connectjonjsm Behavioral BrainSciences , , P. ). of . The and . 11 forthcoming . StablerE. (1985 How aregrammars , ). represented ? Behavioral BrainSciences 391 . and , 6, -420 Stich S. (1983 Fromfolk psychology cognitive , ). to science . Cambridge , MA: M.I.T. Press . TouretzkyD.S. (1986 BoltzCONS , ). : Reconciling connectionism therecursive with nature stacks trees of and . Proceedings theEighth of AnnualConference theCognitive of Science Society . AmherstMA, August , , 1986HillsdaleNJ: Erlbaum . , . Wanner E., & MaratsosM. (1978 An ATN approach comprehension Halle M., BresnanJ., & , , ). to . In , , Miller, G.A. (Eds Linguistic .), theory psychological . Cambridge and reality , MA: M.I.T. Press . WatsonJ. (1930 Behaviorism , ). , ChicagoUniversity Chicago : of Press . WoodsW.A. (1975 Whats in a link? in Bobrow D., & Collins A. (Eds Representation understand , ). ' , , .), and ing. NewYork: Academic Press . Ziff, P. (1960 Semantic ). analysisIthaca NY: CornellUniversity . , Press .

On language and connectionism : Analysis of a parallel distributed processing model of language acquisition *

PINKER STEVEN
Massachusetts Institute of Technology

ALANPRINCE BrandeisUniversity

Abstract

Does knowledgeof languageconsistof mentally-represented rules? Rumelhart and McClelland havedescribed connectionist a (parallel distributedprocessing ) model of the acquisition of the past tensein English which successfully maps many stemsonto their past tense forms, both regular (walk/walked) and irregular (go/went), and which mimics someof the errors and sequences develop of ment of children. Yet the model containsno explicit rules, only a setof neuronstyle units which standfor trigrams of phonetic features of the stem a set of , units which stand for trigrams of phonetic features of the past form , and an array of connections between two setsof units whosestrengths modified the are during learning. Rumelhart and McClelland concludethat linguistic rules may be merely convenientapproximatefictions and that the real causalprocesses in languageuseand acquisitionmust be characterized the transferof activa as tion levelsamongunits and themodification of the weightsof their connections . We analyzeboth the linguistic and the developmental assumptions the model of in detail and discoverthat (1) it cannot representcertain words, (2) it cannot learn many rules, (3) it can learn rules found in no human language (4) it , cannot explain morphological and phonological regularities (5) it cannot ex,
*Theauthors contributed equally thispaperandlist their names alphabetical . Wearegrateful to in order to JaneGrimshaw Brian MacWhinney providing and for transcripts children speech of 's from the Brandeis Longitudinal Study the ChildLanguage Exchange and Data System , respectively alsothankTomBever . We , JaneGrimshawStephen , Kosslyn Dan Slobin an anonymous , , reviewer from Cognitionand the Boston , Philosophy Psychology and Discussion Groupfor theircomments earlierdrafts andRichard on , Goldberg for hisassistance . Preparation thispaper supported NSFgrant1ST of was by -8420073 Jane to Grimshaw Ray and Jackendoff Brandeis of Universityby NIH grantHD 18381 andNSFgrant85 18774 Steven , - 04 to Pinker , andby a grantfrom the Alfred P. Sloan Foundation the MIT Center Cognitive to for Science . Requests for reprintsmaybesentto Steven Pinkerat the Department BrainandCognitive of Sciences , MIT, Cambridge , MA 02139U.S.A. or Alan Prince the Linguistics Cognitive , at and Science ProgramBrown125 Brandeis , , UniversityWaltham 02254U.S.A. , MA ,

74

S. Pinker and A . Prince

plain the differences between irregular and regular forms , (6) it fails at its assigned task of mastering the past tense of English , (7) it gives an incorrect explanation for two developmental phenomena : stagesof overregularization of irregular forms such as bringed , and the appearance of doubly -marked forms such as ated and (8) it gives accounts of two others (infrequent overregulariza tion of verbs ending in t/d, and the order of acquisition of different irregular subclasses that are indistinguishable from those of rule -based theories. In ) addition , we show how many failures of the model can be attributed to its connectionist architecture . We conclude that connectionists ' claims about the dispensability of rules in explanations in the psychology of language must be rejected, and that, on the contrary , the linguistic and developmental facts pro vide good evidence for such rules. If designgovern in a thing so small. Robert Frost 1. Introduction

The study of languageis notoriously contentious but until recently, resear , cherswho could agreeon little elsehave all agreedon one thing: that linguistic knowledgeis couchedin the form of rules and principles. This conception is consistentwith- indeed, is one of the prime motivations for- the " central dogma of modern cognitive science namely that intelligence is the result of " , processingsymbolic expressions To understandlanguageand cognition, ac. cording to this view, one must break them up into two aspects the rules or : symbol manipulating processes capableof generatinga domain of intelligent human performance, to be discoveredby examining systematicitiesin people's perception and behavior, and the elementarysymbol-manipulating mechanismsmade available by the information-processingcapabilitiesof neural tissue, out of which the rules or symbol manipulating processeswould be composed(see e.g., Chomsky, 1965 Fodor, 1968 1975 Marr , 1982 Minsky, , ; , ; ; 1963 Newell & Simon, 1961 Putnam, 1960 Pylyshyn 1984 ; ; ; , ). One of the reasonsthis strategy is inviting is that we know of a complex intelligent system the computer, that can only be understood using this al, gorithm-implementation or software -hardware distinction. And one of the reasons that the strategy has remained compelling is that it has given us precise revealing, and predictive models of cognitive domains that have re, quired few assumptions about the underlying neural hardwareother than that it makesavailable somevery generalelementaryprocesses comparing and of transforming symbolic expressions .

Language and connectionism

75

Of course, no one believes that cognitive models explicating the systematicities in a domain of intelligence can fly in the face of constraints provided by the operations made available by neural hardware . Some early cognitive models have assumed an underlying architecture inspired by the historical and technological accidents of current computer design, such as rapid reliable serial processing, limited bandwidth communication channels, or rigid distinctions between registers and memory . These assumptions are not onlv inaccurate as descriptions of the brain , composed as it is of slow , noisy and massively interconnected units acting in parallel , but they are unsuited to tasks such as vision where massive amounts of information must be processed in parallel . Furthermore , some cognitive tasks seem to require mechanisms for rapidly satisfying large sets of probabilistic constraints , and some aspects of human performance seem to reveal graded patterns of generalization to large sets of stored exemplars , neither of which is easy to model with standard serial symbol -matching architectures . And progress has sometimes been stymied by the difficulty of deciding among competing mod els of cognition when one lacks any constraints on which symbol -manipulating processes the neural hardware supplies " for free " and which must be composed of more primitive processes. 1.1. Connectionism and symbol processing In response to these concerns, a family of models of cognitive processes originally developed in the 1950sand early 1960shas received increased atten tion . In these models , collectively referred to as " Parallel Distributed Processing" (" PDP " ) or " Connectionist " models , the hardware mechanisms are networks consisting of large numbers of densely interconnected units , which correspond to concepts (Feldman & Ballard , 1982) or to features (Hinton , McClelland , & Rumelhart , 1981) . These units have activation levels and they transmit signals (graded or 1- 0) to one another along weighted connections . Units " compute " their output signals through a process of weighting each of their input signals by the strength of the connection along which the signal is coming in , summing the weighted input signals, and feeding the result into a nonlinear output function , usually a threshold . Learning consists of adjusting the strengths of connections and the threshold -values, usually in a direction that reduces the discrepancy between an actual output in response to some input and a " desired " output provided by an independent set of " teaching " inputs . In some respects, these models are thought to resemble neural networks in meaningful ways; in others , most notably the teaching and learning mechanisms, there is no known neurophysiological analogue , and some authors are completely agnostic about how the units and connections are neur -

76

S Pinker A. Prince . and

ally instantiated. (" Brain-style modeling" is the noncommittal term used by Rumelhart & McClelland, 1986a The computations underlying cognitive .) processesoccur when a set of input units in a network is turned on in a pattern that correspondsin a fixed way to a stimulus or internal input. The activation levels of the input units then propagatethrough connectionsto the output units, possibly mediated by one or more levels of intermediate units. The pattern of activation of the output units correspondsto the output of the computation and can be fed into a subsequentnetwork or into response effectors. Many models of perceptual and cognitive processeswithin this famil)' have been explored recently (for a recent collection of reports, including extensive tutorials, reviews, and historical surveys see Rumelhart, , McClelland, & The POP Research Group, 1986 and McClelland, ; Rume .lhart , & The PDP Research Group, 1986 henceforth, "POPI" and ; "POPII" ) . There is no doubt that these models have a different feel than standard symbol processing models. The units, the topology and weightsof the connec tions among them, the functions by which activation levels are transformed in units and connections and the learning (i.e., weight-adjustment function , ) are all that is " in" thesemodels; one cannot easily point to rules, algorithms, expressions and the like inside them. By itself, of course this meanslittle , , , becausethe same is true for a circuit diagram of a digital computer implementing a theorem-prover. How, then, are PDP models related to the more traditional symbol-processingmodels that have until now dominated cognitive psychologyand linguistics? It is useful to distinguish three possibilities In one, PDP models would . occupy an intermediate level between symbol processing and neural hardware: they would characterizethe elementaryinformation processes provided by neural networks that serve as the building blocks of rules or algorithms. Individual PDP networks would compute the primitive symbol as sociations (such as matching an input againstmemory, or pairing the input and output of a rule) , but the way the overall output of one network feeds into the input of another would be isomorphic to the structure of the symbol manipulationscapturedin the statementsof rules. Progressin PDP modeling would undoubtedly force revisionsin traditional models, becausetraditional assumptionsabout primitive mechanismsmay be neurally implausible, and complex chains of symbol manipulations may be obviated by unanticipated primitive computational powers of PDP networks. Nonetheless in this , scenario a well-defined division between rule and hardware would remain, each playing an indispensablerole in the explanation of a cognitive process . Many existing types of symbol-processing modelswould survivemostly intact, and, to the extent they have empirical support and explanatorypower, would

Language and connectionism

77

dictate many fundamental aspectsof network organization. In some exposi tions of PDP models, this is the proposed scenario(see e.g., Hinton , 1981 , ; Hinton , McClelland, & Rumelhart, 1986 p. 78; also Touretzky, 1986 and , Touretzky & Hinton , 1985 where PDP networks implement aspects LISP , of and production systems respectively We call this "implementational con, ). nectionism . " An alternative possibility is that once PDP network models are fully developed, they will replacesymbol processing modelsasexplanationsof cognitive processes It would be impossibleto find a principled mapping between . the componentsof a PDP model and the stepsor memory structures implicated by a symbol processingtheory, to find states of the PDP model that correspondto intermediate statesof the executionof the program, to observe stagesof its growth correspondingto componentsof the program being put into place, or statesof breakdown correspondingto componentswiped out through trauma or loss the structure of the symbolic model would vanish. Even the input- output function computedby the network model could differ in special casesfrom that computed by the symbolic model. Basically, the entire operation of the model (to the extent that it is not a black box) would have to be characterized in terms of interactionsamongentities possessing not both semanticand physical properties (e.g., different subsetsof neurons or statesof neuronseachof which representa distinct chunk of knowledge , but ) in terms of entities that had only physicalproperties, (e.g., the "energylandscape defined by the activation levels of a large aggregate interconnected " of neurons . Perhapsthe symbolic model, as an approximatedescription of the ) performancein question, would continue to be useful asa heuristic, capturing some of the regularities in the domain in an intuitive or easily -communicated way, or allowing one to make convenient approximate predictions. But the symbolic model would not be a literal accountat any level of analysisof what is going on in the brain, only an analogyor a rough summaryof regularities. This scenario which we will call "eliminative connectionism , sharply con, " trasts with the hardware softwaredistinction that has been assumed cogniin tive scienceuntil now: no one would say that a program is an " approximate " description of the behavior of a computer, with the " exact" description existing at the level of chips and circuits; rather they are both exact descriptions at different levels of analysis . Finally, there is a range of intermediate possibilitiesthat we have already hinted at. A cognitive processmight be profitably understood as a sequence or systemof isolable entities that would be symbolic inasmuchas one could characterizethem as having semanticproperties such as truth values, consis tency relations, or entailment relations, and one might predict the input- output function and systematicities performance development or lossstrictly in , ,

S. Pinker and A . Prince

Languageand connectionism

79

1.2. The Rumelhart - McClelland model and theory


One of the most influential efforts in the PD P school has been a model of

the David

acquisition Rumelhart

of

the and

marking James McClelland

of

the

past ( 1986b

tense , 1987

in )

English . Using

developed standard PD

by P

mechanisms

this

model

learns

to

map

representations

of

present

tense

forms

of

English

verbs

onto

their

past

tense

versions

It

handles

both

regular

( walk

walked novel

) verbs

and not

irregular in its

( feel training

/ felt set

verbs , and

, it

productively distinguishes

yielding the variants

past of

forms the

for past

tense

morpheme

( t

versus

versus

id

conditioned

by

the

final

consonant

of

the

verb

( walked

versus

jogged

versus

sweated

Furthermore

in

doing

so

it

displays

number

of

behaviors

reminiscent

of

children

It

passes

through

stages

of

conservative

acquisition

of

correct

irregular

and

regular

verbs

( wal

ked

brought

hit

followed

by

productive

application

of

the

regular

rule

and

overregularization

to

irregular

stems

( e

. g

bringed

hitted

followed

by

mas

tery

of

both

regular

and

irregular

verbs

It

acquires

subclasses

of

irregular

verbs

( e

. g

. fly

/ flew

sing

/ sang

hit

/ hit

in

an

order

similar

to

children

It

makes

certain in the

types model

of

errors corresponds

( ated in

wented any

) obvious

at

similar way

stages to the

Nonetheless rules that

, have

nothing been

assumed

to

be

an

essential

part

of

the

explanation

of

the

past

tense

formation

process to a word

None , a

of position

the

individual within a

units word ,

or a

connections morpheme ,

in a

the regular

model rule

corresponds , an excep -

tion

or

paradigm

The

intelligence

of

the

model

is

distributed

in

the

pattern

of a

weights rule - based

linking account and

the is McClelland

simple complex

input and take

and indirect the

output at of

units best this

, .

so

that

any

relation

to

Rumelhart

results

work

as

strong

support

for

eliminative

connectionism

the

paradigm

in

which

rule

or

symbol

- based

accounts

are

simply

eliminated

from

direct

explanations

of

intelligence

We tions ior

suggest among of such

instead simple networks

that

implicit

knowledge units organized ( at

of

language into least networks approximately

may

be .

stored While ) as

in the

connec behav

processing may be

describable

conforming

to

some

system

of

rules

we

suggest

that

an

account

of

the

fine

structure

of

the

phenomena models that

of make

language reference

use

and to

language the characteristics

acquisition of

can the

best underlying

be

formulated networks

in .

( Rumelhart

&

McClelland

1987

196

We learn shown without

have the that

we rules a

believe of reasonable to English

provided past account

a - tense of of

distinct formation the a " rule

alternative in any of as anything

to

the explicit

view sense can than

that . be a

children We provided description have

acquisition "

past more

tense

recourse

the

notion

of lem

the . The

language child

We

have not

shown figure out

that what

for the

this

case rules

, are

there , nor

is

no even

induction that there

prob are

need

rules

( Rumelhart

&

McClelland

1986b

267

their

emphasis

80

S. Pinker and A. Prince

standing language of knowledge, language acquisition, linguistic and information processinggeneral. in (Rumeihart McClelland, p. 268) & 1986b, The RumelhartMcClelland (henceforth, RM) model,becauseit in-

Weviewthisworkon past-tense morphology a steptoward revised as a under-

blunt force theRumelhart McClellands onrules suggestthe of and attack by ingthatthemodel does really contain orthatpasttense rules, acquisition is anunrepresentativelyproblem, thatthere some easy or is reason principle in whyPDPmodels incapable being are of extended language a whole, to as or thatRumelhart McClelland modeling and are performance saying and little
algorithms. believe these reactionsbe conversion We that quick they experiences outright or dismissalsare unwarranted. canbegained taking Much by themodel face at value a theory thepsychology child by as of ofthe and examining claims themodel detail. isthegoal thispaper. the of in That of TheRMmodel, many models, a tourdeforce. is explicit like PDP is It
aboutcompetence are modeling or implementations saying about but little

tancein manyquarters; manyresearchers beenpersuaded that have that theories language of couched terms rulesandruleacquisition be in of may obsolete e.g.,Sampson, (see, 1987). Otherresearchers attempted have to

spires remarkable figures these claims, prominently ingeneral expositions of connectionism stress revolutionary such Smolensky that its nature, as (in press) McClelland, and Rumeihart, Hinton and (1986). Despite radical the nature these of conclusions, ourimpression they gained it is that have accep-

and mechanistic: preciseempirical predictions flowout of the modelas it

operates autonomously, thanbeing rather continuously molded reshaped or


to fit the factsbya theoristactingasdeusexmachina. authorshavemade The

quence generalizationsthe regular tensemorphemebutagainst of of past


severalunrelatedphenomena well.Furthermore, as Rumelhart McCleland

rather leaving asa degree freedom. model tested only than it of The is not against phenomenon inspired the that itthe three-stage developmental se-

a commitmentto theunderlying as computational architecture themodel, of

raisedin theseexaminations. Finally, modelusesclevermechanisms the that

landbring these developmental to bearonthemodel anunusually data in detailed examining onlygrosseffects alsomany its more way, not but of subtle details. Several non-obvious interesting but empirical predictions are

standards.

morerapidly theories developmental if in psycholinguistics heldto such were

operate surprising These in ways. features virtually are unheard indevelopof mental psycholinguistics (see Pinker, Wexler Culicover, There 1979; & 1980). is no doubtthatour understandinglanguage of acquisition advance would

Nonetheless, analysis themodel come conclusions differour of will to very ent fromthoseof Rumeihart McClelland. theirpresentation, and In the

Language and connectionism

81

model is evaluated only by a global comparison of its overall output behavior with that of children . There is no unpacking of its underlying theoretical assumptions so as to contrast them with those of a symbolic rule -based alter native , or indeed any alternative . As a result , there is no apportioning of credit or blame for the model 's performance to properties that are essential versus accidental , or unique to it versus shared by any equally explicit alter native . In particular , Rumelhart and McClelland do not consider what it is about the standard symbol -processing theories that makes them " standard " , beyond their first -order ability to relate stem and past tense. To these ends, we analyze the assumptions and consequences of the RM model , as compared to those of symbolic theories , and point out the crucial tests that distinguish them . In particular , we seek to determine whether the RM model is viable as a theory of human language acquisition - there is no question that it is a valuable demonstration of some of the surprising things that PD P models are capable of , but our concern is whether it is an accurate model of children . Our analysis will lead to the following conclusions : . . . Rumelhart and McClelland 's actual explanation of children 's stages of regularization of the past tense morpheme is demonstrably incorrect . Their explanation for one striking type of childhood speech error is also incorrect . Their other apparent successes in accounting for developmental phenomena either have nothing to do with the model 's parallel distri buted processing architecture , and can easily be duplicated by symbolic models , or involve major confounds and hence do not provide clear support for the model . The model is incapable of representing certain kinds of words . It is incapable of explaining patterns of psychological similarity among words . It easily models many kinds of rules that are not found in any human language . It fails to capture central generalizations about English sound patterns . It makes false predictions about derivational morphology , compound ing , and novel words . It cannot handle the elementary problem of homophony . It makes errors in computing the past tense forms of a large percentage of the words it is tested on . It fails to generate any past tense form at all for certain wo: ds. r It makes incorrect predictions about the reality of the dis1 :inction between regular rules and exceptions in children and in languages.

. . . . . . . . .

82

S. Pinker and A . Prince

We works the that tionist the tionism The phenomena the the its the the state we RM

will can

conclude eliminate of

that the human are

the need

claim for rules

that

parallel and for rule

distributed induction . In to central , only for , profound the . in particular

processing mechanisms , we of

net in argue

explanation the shortcomings ideology maligned in paper of model of to major

language in many ; or . The are follows inflection

is

unwarranted due

cases if

features by copying

connec tenets of

and symbolic

irremediable theory language as verbal it contrasts amounts empirical , we of with evaluate children

remediable

implications , we think , we . Then the an rule

promise

of

connec

explicating is organized English and each handle section properties it the by of the the and in how . This the

. First

examine we describe

broad the

outline operation , evaluating in adult to the terms state

the of

with to

- based of language in terms

alternative the in of its model its

merits ability next

evaluation of model of

of . In

properties the

ability toward

handle adult ,

empirical , comparing evaluate

' s path model of claims we a on direct the

development rule

a simple of the

symbolic about

acquisition

. Finally that to which of pro . were the its

status RM RM

radical , and is

connectionism the extent of parallel

motivated performance PDP cessing

model model thus

determine consequence promise and of

properties distributed

architecture models

bears for

accounting

language

language

acquisition

2 . A

brief

overview

of

English

verbal

inflection

2 .1 .

The

basic and English the

facts

of

English aim

inflection to to English we describe our part of the of present RM about system their the model of model basic , many verbal , we flavor additional theories of inflec briefly of a -

Rumelhart tion review rule details its in

McClelland . As background of of facts be the

examination verb , and the and

structure account the will inflectional

- based about

it . ! When of English .

evaluate inflection

linguistic

structure English

presented morphology

is 350 regular

not distinct

notably forms verb

complicated and has the verb

Where of four current :

the

verb Spanish

of

classical or Italian

Greek about

has

about

50 , the

English

exactly

lYaluable Curme

linguistic

studies

of the English and Sloat

verbal

system

include

Bloch and Halle

( 1947 ) , Bybee

and Slobin

( 1982 ) , ( 1936 ) ,

( 1935 ) , Fries

( 1940 ) , Hoard

( 1973 ) , Hockett

( 1942 ) , Jespersen

( 1942 ) , Mencken

Palmer ( 1930 ) , Sloat are important general

and Hoard ( 1971 ) , Sweet ( 1892 ) . Chomsky works touching on aspects of the system .

( 1968 ) and Kiparsky

( 1982a , b )

Languageand connectionism

83

(1)

a.
b. c.

walk
walks walked

d.

walking

As is typical in morphological systems, there is rampant syncretism- use of the same phonological form to express different , often unrelated mor phological categories . On syntactic grounds we might distinguish 13 categories filled by the four forms . (2) a. -f/J Present-everything but 3rd person singular : I , you , we , they open.
Infinitive :

They may open , They tried to open. Imperative :


Open !

Subjunctive : They insisted that it open. b.


c.

-s
- ed

Present- 3rd person singular :


He , she , it opens . Past :

It opened. Perfect Participle : It has opened. Passive Participle : It was being opened. Verbal adjective : A recently -opened box . d. -ing Progressive Participle : He is opening . Present Participle : He tried opening the door . Verbal noun (gerund ) : His incessant opening of the boxes. Verbal adjective : A quietly -opening door .

The system is rendered more interesting by the presence of about 180 'strong ' or 'irregular ' verbs , which form the past tense other than by simple suffixation . There are , however , far fewer than 180 ways of modifying a stem to produce a strong past tense; the study upon which Rumelhart and McClel -

84

S. Pinker and A . Prince

land

depend

Bybee

and

Slobin

1982

divides

the

strong

group

into

nine

coarse

and

somewhat

heterogeneous

subclasses

which

we

discuss

later

See

the

Appendix

for

precis

of

the

entire

system

. )

Many

strong

verbs

also

maintain

further

formal

distinction

lost

in

2c

between

the

past

tense

itself

and

the

Perfect

Passive

Participle

which

is

fre

quently

marked

with

en

' he

ate

'

vs

' he

has

was

eaten

'

These

verbs

mark

the

outermost

boundary

of

systematic

complexity

in

English

giving

the

learner

five

forms

to

keep

track

of

two

of

which

past

and

perfect

passive

participle

are

not

predictable

from

totally

general

rules

. 2

. 2

Basic

features

of

symbolic

models

of

inflection

Rumelhart

and

McClelland

write

that

"

We

chose

the

study

of

acquisition

of

past

tense

in

part

because

the

phenomenon

of

regularization

is

an

example

often

cited

in

support

of

the

view

that

children

do

respond

according

to

general

rules

of

language

"

What

they

mean

is

that

when

Berko

1958

first

documented

children

' s

ability

to

inflect

novel

verbs

for

past

tense

. g

jicked

and

when

Ervin

1964

documented

overregularizations

of

irregular

past

tense

forms

in

spontaneous

speech

. g

breaked

it

was

effective

eviden

ce

against

any

notion

that

language

acquisition

consisted

of

rote

imitation

But

it

is

important

to

note

the

general

point

that

the

ability

to

generalize

beyond

rote

forms

is

not

the

only

motivation

for

using

rules

as

behaviorists

were

quick

to

point

out

in

the

1960s

when

they

offered

their

own

accounts

of

generalization

In

fact

even

the

existence

of

competing

modes

of

genera

lizing

such

as

the

different

past

tense

forms

of

regular

and

irregular

verbs

or

of

regular

verbs

ending

in

different

consonants

is

not

the

most

important

motivation

for

positing

distinct

rules

Rather

rules

are

generally

invoked

in

linguistic

explanations

in

order

to

factor

complex

phenomenon

into

simpler

components

that

feed

representations

into

one

another

Different

types

of

rules

apply

to

these

intermediate

representations

forming

cascade

of

struc

tures

and

rule

components

Rules

are

individuated

not

only

because

they

compete

and

mandate

different

transformations

of

the

sam

.e

input

structure

such

as

break

breakedl

broke

but

because

they

apply

to

different

kinds

of

structures

and

thus

impose

factoring

of

phenomenon

into

distinct

compo

nents

rather

than

generating

the

phenomena

in

single

step

mapping

inputs

to

outputs

Such

factoring

allows

orthogonal

generalizations

to

be

extracted

and

stated

separately

so

that

observed

complexity

can

arise

through

interac

tion

and

feeding

of

independent

rules

and

processes

which

often

have

rather

2Somewhat

beyond

this

bound

lies

the

verb

' be

'

with

eight

distinct

forms

be

am

is

are

was

were

been

being

of

which

only

the

last

is

regular

Languageand connectionism

85

different parametersand domains of relevance This is immediately obvious . in most of syntax, and indeed, in most domainsof cognitive processing (which is why the acquisition and use of internal representationsin "hidden units" is an important technical problem in connectionist modeling; see Hinton & Sejnowski, 1986 Rumelhart, Hinton , & Williams, 1986 . ; ) However, it is not as obvious at first glance how rules feed each other in the case of past tense inflection. Thus to examine in what sensethe RM model "has no rules" and thus differs from symbolic accounts it is crucial to , spell out how the different rules in the symbolic accountsare individuated in terms of the componentsthey are associated with . There is one set of "rules" inherent in the generation of the past tense in English that is completely outsidethe mappingthat the RM model computes : those governing the interaction between the use of the past tense form and the type of sentencethe verb appearsin, which dependson semanticfactors such as the relationship betweenthe times of the speechact, referent event, and a reference point, combined with various syntactic and lexical factors such as the choice of a matrix verb in a complex sentence(I helped her leave *left versus I know she left/ *leave and the modality and mood of a / ) sentence (I went *go yesterdayversus I didn't go/ *went yesterday If my / ,. grandmother had/ *has balls she'd be my grandfather . In other words, a ) speakerdoesn chooseto produce a past tenseform of a verb when and only 't when he or sheis referring to an event taking placebefore the act of speaking . The distinction between the mechanismsgoverning these phenomena and , those that associateindividual stems and past tense forms, is implicitly accepted by Rumelhart and McClelland. That is, presumablythe RM model would be embeddedin a collection of networks that would pretty much reproduce the traditional picture of there being one set of syntactic and semantic mechanisms that selectsoccasions use of the past tense, feeding informafor tion into a distinct morphological phonological system that associates . indi~ vidual stemswith their tensed forms. As such, one must be cautious at the outset in sayingthat the RM model is an alternative to a rule-basedaccount of the past tense in general; at most, it is an alternative to whatever decom position is traditionally assumedwithin the part of grammar that associates stemsand past tense forms.3 In symbolic accounts this morphological phonological part is subject to , 3Purthermore the RM model seeksonly to generate , past forms from stems it has no facility for retrieving ; a stem given the past tense form as input. (There is no guaranteethat a network will run 'backwards and in ' fact someof the more sophisticatedlearning algorithms presuppose strictly feed-forward design Presumably a .) the human learner can go both ways from the very beginning of the process later we present examplesof ; children's back~ formations in support of this notion. Rule-based theories, as accountsof knowledge rather than use of knowledge are neutral with respectto the production/recognition distinction. ,

86

S. Pinker and A . Prince

further decomposition . In particular , rule -based accounts rely on several fun damental distinctions :

Lexical item vs. phoneme string . The lexical item is a unique , idiosyncra tic set of syntactic , semantic , morphological , and phonological proper ties . The phoneme string is just one of these properties . Distinct items may share the same phonological composition (homophony ) . Thus the notion of lexical representation distinguishes phonologically ambiguous words such as wring and ring .
Morphological category vs. morpheme . There is a distinction between a

morphological category , such as 'past tense' or 'perfect aspect' or 'plural ' or 'nominative case' , and the realization (s) of it in phonological substance . The relation can be many - one in both directions : the same

phonological entity can mark several categories (syncretism ) ; and one category may have several (or indeed many) realizations , such as through a variety of suffixes or through other means of marking . Thus in English , -ed syncretistically marks the past, the perfect participle , the passive participle , and a verbal adjective - distinct categories ; while the past tense category itself is manifested differently in such items as bought , blew, sat, bled, bent, cut, went, ate, killed . Morphology vs. phonology . Morphological rules describe the syntax of words- how words are built from morphemes- and the realization of abstract morphological categories . Phonological rules deal with the pre dictable features of sound structure , including adjustments and accommodations occasioned by juxtaposition and superposition of phonologi cal elements . Morphology trades in such notions as 'stem' , 'prefix ' , 'suffix ' , 'past tense' ; phonology in such as 'vowel ' , 'voicing ' , 'obstruence ' , 'syllable ' . As we will see in our examination of English morphology , there can be a remarkable degree of segregation of the two vocabularies into distinct rule systems: there are morphological rules which are blind to phonology , and phonological rules blind to morphological category . Phonology vs. phonetics . Recent work (Liberman & Pierrehumbert , 1984; Pierrehumbert & Beckman , 1986) refines the distinction between phonology proper , which establishes and maps between one phonological representation and another , and phonetic implementation , which takes a representation and relates it to an entirely different system of parameters (for example , targets in acoustic or articulatory space) .

In addition , a rule -system is organized by principles which determine the interactions between rules : whether they compete or feed , and if they compete , which wins . A major factor in regulating the feeding relation is organi zation into components : morphology , an entire set of formation rules , feeds

Language and connectionism

87

phonology , which feeds phonetics .4 Competition among morphological alter natives is under the control of a principle of paradigm structure (called the 'Unique Entry Principle ' in Pinker , 1984) which guarantees that in general each word will have one and only one form for each relevant morphological category ; this is closely related to the 'Elsewhere Condition ' of formallinguis tics (Kiparsky , 1982a, b) . The effect is that when a general rule (like Past(x) = x + ed) formally overlaps a specific rule (like Past(go) = went) , the specific rule not only applies but also blocks the general one from applying . The picture that emerges looks like this :

() L---(stems':;:~~- 1 3 ~~-:affixes.)- I ~i: ,0 ,etc :-x -: -;


-! -- -- --

l
[--==~ - ~;;::::::~ ~Unique ~ I~ -J Entry Principle
,

r- - ---~:: ~;-- - - l ~ :~
L _ __ _ _ _~ - - ~ - - _ :

Interface with perceptual and motor systems

[_~~~ ~ =~ -J ~ i =- ~ 1
I

With this general structure in mind , we can now examine how the RM model differs in " not having rules " .

3 . The

Rumelhart

- McClelland

model

Rumelhart and McClelland 's goal is to model the acquisition of the past tense, specifically the production of the past tense, considered in isolation
4More intricate variations on this basic pattern are explored in recent work in " Lexical Phonology " ; see Kiparsky ( 1982a, b) .

88

S. Pinker and A . Prince

from the rest of the English morphological system They assumethat the . acquisition processestablishes direct mapping from the phonetic represen a tation of the stem to the phonetic representationof the past tenseform . The model therefore takes the following basic shape : (4) Uninflected stem ~ Pattern associator~ Past form This proposed organization of knowledge collapsesthe major distinctions embodied in the linguistic theory sketched in (3) . In the following sections we ascertainand evaluate the consequences this move. of The detailed structure of the RM model is portrayed in Figure 1. In its trained state, the pattern associatoris supposedto take any stem as input and emit the correspondingpast tense form. The model's pattern as sociator is a simple network with two layers of nodes, one for representing input, the other for output. Each node representsa different property that an input item may have. Nodes in the RM model may only be 'on' or 'off' ; thus the nodes represent binary features, 'off ' and 'on' marking the simple absenceor presenceof a certain property. Each stem must be encodedas a unique subsetof turned-on input nodes; each possible past tense form as a unique subsetof output nodes turned on. Here a nonobvious problem assertsitself. The natural assumptionwould be that words are strings on an alphabet, a concatenationof phonemes But .
Figure 1. The Rumelhart McClelland model of past tenseacquisition (Reproduced . from Rumelhart and McClelland, 1986b p. 222, with permission of the , publisher, Bradford Books/ MIT Press .) Fixed Encoding Network Pattern Associator Modifiable Connections Decoding /Binding Network

Phonological representation of rootform

Wicketfeature representation of rootform

Wickelfeature representation of pasttense

Phonological representation of pasttense

Language and connectionism

89

each datum fed to a networkmust decomposeinto an unordered set of prop-

erties(codedas turned-on units),and a stringis a primeexample an of


ordered entity. To overcomethis, Rumeihartand McClelland turn to a

scheme proposed Wickelgren by (1969), according which stringis repto a

resented as the set of the trigrams(3-character-sequences) it contains. that

(In orderto locate word-edges, areessential phonology morwhich to and phology, is necessary assume word-boundary is a character it to that (#)
in the underlying alphabet.)Rumelhart McClelland suchtrigrams and call Wickeiphones. a wordlike strip translates, their notationto Thus in

str, rp iP#} thattheword isuniquely r 1 , . Note strip reconstructible the from


withmorethan one string,Rumelhart McClelland that allwordsin and find

citedtrigram Although set. certaintrigram are consistent principle sets in

theirsample uniquely are encoded. Crucially, possible each trigram be must


tive accessto the central phonemet in a Wickeiphonestror to the context

construedas an atomicpropertythat a stringmayhaveor lack.Thus,writing it out as we did above is misleading,because the order of the five Wickel-

phones notrepresented is anywhere theRMsystem, thereisnoselecin and


phonemes andXr.It is morefaithful the actual X to mechanism listthe to Wickeiphonesarbitrary in (e.g.,alphabetical) andavoid spurious order any internal decomposition Wickelphones, of hence:{ip#,rip, str, tn, #st}.
For immediate expository purposes, canthinkof eachunitin the input we

layerof the networks standing one of the possible as for Wickelphones;


likewisefor each unit in the output layer. Any given word is encoded as a

patternof nodeactivations the whole of Wickelphone over set nodesas a

set of Wickelphones. givesa distributed representation: individual This an worddoesnot register its ownnode,but is analyzed an ensemble on as of

properties, Wickeiphones, are the trueprimitives the system. which of As Figure1 shows, Rumeihart McClelland and require encoder of unan specified nature convert ordered to an phonetic string a setof activated into
Wickeiphone units;we discusssomeof its propertieslater.

TheWickelphone contains enough context detectin gross kindof to the inputoutput relationships in thestem-to-past mapping. found tense Imagine a pattern associator mapping inputWickelphonesoutput from to Wickelphones. is usual suchnetworks, inputnodeis connected every As in every to output node,giving inputWickelphone thechance influence each node to every intheoutput node Wickeiphone Suppose a setofinput set. that nodes
is turnedon, representing inputto the network.Whethera givenoutput an
node will turn on is determinedjointly by the strength of its connectionsto

the activeinputnodesandby the outputnodes ownoverall susceptibility to influence, threshold.Theindividual decisions theoutput its on/off for units are made probabilistically, the basisof the discrepancy on betweentotal

90

S. Pinker and A. Prince

inputand threshold: nearerthe inputis to the threshold,the morerandom the


the decision.

has already figured which tenseformis to be associated which out past with stemform.Wecallthisthe juxtaposition process; Rume andMcClellhart landadoptthenotunreasonable idealization it doesnotinteract the that with
forms.

distinct ofteachinginput shown Figure The kind (not in 1). corresponding psychological assumption thechild, isthat through unspecified some process,

the pasttenseform,which provided the network a teacher as a is to by

work aninput (inthepresent a representation stem) with form case, ofa and comparing output the pattern actually obtained thedesired with pattern for

input output and nodeslink weightszeroor with at random inputoutput relations; a tabula thatis either its rasa blank meaninglessly or noisy. (RumelhartMcClellands & isblank.) Training involves presenting netthe

Anuntrained pattern associator outwith preset starts no relations between

activation vectors, longas sucha set of weights as exists.

process abstracting nature the mapping of the of between andpast stem Thecomparison between actual the output pattern computed theconby nections between andoutput input nodes, thedesired and pattern provided bytheteacher, ismade a node-by-node Anyoutput thatis on basis. node inthewrong becomes target adjustment. network up state the of Ifthe ends leavingnode thatought beonaccordingtheteacher, a off to to changes are made render node likely fire thepresencetheparticular to that more to in of input hand. at Specifically, theweights thelinks on connecting input active unitsto the recalcitrant output are increased unit slightly; willincrease this thetendency thecurrently input for active unitsthosethatrepresent the input formto activate target the node. addition, target In the nodesown thresholdlowered is slightly, that will toturn more across so it tend on easily theboard. ontheother If, hand, network the incorrectly anoutput turns nodeon, thereverse procedure employed: weights theconnections is the of from currently input aredecremented active units (potentially the driving connection to a negative, weight inhibitory andthe target value) nodes thresholdraised; hyperactive node thusmade likely is a output is more to turnoffgiven same the pattern input activation. of node Repeated cycling through input-output with pairs, concomitant adjustments, thebehavshapes iorofthepattern associator. istheperceptron This convergence procedure (Rosenblatt, and isknown produce,thelimit, setofweights 1962) it to in a thatsuccessfully theinput maps activation ontothedesired vectors output

the stemswhenthe stemsare presented alone,that is, in the absence of

In fact, RMnet,following 200 the about training of420 cycles stem-past pairs total about (a of 80,000 isable produce trials), to correct forms past for

Language and connectionism

91

"teaching inputs. Somewhatsurprisingly, a single set of connection weights " in the network is able to map look to looked, live to lived, melt to melted hit , to hit, make to made sing to sang, even go to went. The bits of stored , information accomplishingthese mappingsare superimposedin the connec tion weights and node thresholds no single parameter correspondsuniquely ; to a rule or to any single irregular stem -past pair. Of course it is necessary show how sucha network generalizes stems , to to it has not been trained on, not only how it reproduces a rate list of pairs. The circumstances under which generalization occurs in pattern associators with distributed representationsis reasonablywell understood. Any encoded (one is tempted to say 'en-noded') property of the input data that participates in a frequently attested pattern of input/output relations will playa major role in the developmentof the network. Becauseit is turned on during many training episodes and because standsin a recurrent relationship to a set of , it output nodes its influence will be repeatedly enhancedby the learning pro, cedure. A connectionist network does more than match input to output; it respondsto regularities in the representation of the data and usesthem to accomplishthe mapping it is trained on and to generalizeto new cases In . fact, the distinction between reproducing the memorized input- output pairs and generating novel outputs for novel inputs is absent from pattern as sociators a single set of weights both reproducestrained pairs and produces : novel outDuts which are blends of the output patterns strongly associated ~ with each of the properties defining the novel input. The crucial step is therefore the first one: coding the data. If the patterns in the data relevant to generalizing to new forms are not encoded in the representationof the data, no network- in fact, no algorithmic systemof any sort- will be able to find them. (This is after all the reason that so much researchin the 'symbolic paradigm' has centered on the nature of linguistic representations Since phonological processes .) and relations (like those involved in past tenseformation) do not treat phonemesas atomic, unanalyza ble wholes but refer instead to their constituent phonetic properties like voicing, obstruency tenseness vowels, and so on, it is necessary , of that such fine-grained information be present in the network. The Wickelphone, like the phoneme, is too coarse to support generalization To take an extreme . example adaptedfrom Morris Halle , any English speakerwho labors to pronounce the celebrated composers name as [bax] knows that if there were a ' verb to Bach, its past would be baxt and not baxd or baxid, even though no existing English word containsthe velar fricative [x] . Any representationthat does not characterizeBach as similar to passand walk by virtue of ending in an unvoiced segmentwould fail to make this generalization Wickelphones . , of course have this problem; they treat segmentsas opaque quarks and fail ,

92

S. Pinker and A . Prince

to display vital information about segmental similarity classes. A better rep resentation would hav'e units referring in some way to phonetic features rather than to phonemes , because of the well -known fact that the correct dimension of generalization from old to new forms must be in terms of such
features .

Rumelhart and McClelland present a second reason for avoiding Wickel phone nodes. The number of possible Wickelphones for their representation of English is 353 + (2 x 352 = 45,325 (all triliterals + all biliterals beginning ) and ending with #) . The number of distinct connections from the entire input Wickelvector to jts output clone would be over two billion (45,3252 , too ) many to handle comfortably . Rumelhart and McClelland therefore assume a phonetic decomposition of segments into features which are in broad outline like those of modern phonology . On the basis of this phonetic analysis, a Wickelphone dissolves into a set of 'Wickelfeatures ' , a sequence of three features , one from each of the three elements of the Wickelphone . For exam-

ple, the features "VoweIUnvoicedInterrupted" and "HighStopStop" are two


of the Wickelfeatures in the ensemble that would correspond to the Wickel phone " ipt " . In the RM model , units represent Wickelfeatures , not Wickel phones ; Wickelphones themselves play no role in the model and are only represented implicitly as sets of Wickelfeatures . Again , there is the potential
for nondistinct representations , but it never occurred in practice for their

verb set. Notice that the actual atomic properties recognized by the model are not phonetic features per se, but entities that can be thought of as 3-feature sequences. The Wickelphone /Wickelfeature is an excellent example of the kind of novel properties that revisionist -symbol -processing connectionism
can come A further up with . refinement is that not all definable Wickelfeatures have units

dedicated to them : the Wickelfeature set was trimmed to exclude , roughly , feature - triplets whose first and third features were chosen from different
phonetic dimensions .5 The end result is a system of 460 nodes , each one

representing a Wickelfeature . One may calculate that this gives rise to 4602 = 211,600 input - output connections . The module that encodes words into input Wickelfeatures (the " Fixed Encoding Network " of Figure 1) and the one that decodes output Wickel features into words (the " Decoding /Binding Network " of Figure 1) are perhaps not meant to be taken entirely seriously in the current implementation of the RM model , but several of their properties are crucially important in under S Although this move was inspired purely by considerations of computational economy , it or something like it has real empirical support ; the reader familiar with current phonology will recognize its relation to the notion of a ' tier ' of related features in autosegrnental phonology .

Language and connectionism

93

standing and evaluating it . The input encoder is deliberately designed to activate some incorrect Wickelfeatures in addition to the precise set of Wickelfeatures in the stem: specifically a randomly selectedsubsetof those , Wickelfeaturesthat encodethe featuresof the central phonemeproperly but encode incorrect feature values for one of the two context phonemes This . "blurred" Wickelfeature representation cannot be construed as random noise; the sameset of incorrect Wickelfeaturesis activatedevery time a word is presented and no Wickelfeature encodingan incorrect choiceof the central , feature is ever activated. Rather, the blurred representationfostersgeneralization. Connectionistpattern associators alwaysin danger of capitalizing are too much on idiosyncraticproperties of words in the training set in developing their mapping from input to output and hence of not properly generalizing to new forms. Blurring the input representations makes the connection weights in the RM model less likely to be able to exploit the idiosyncrasies of the words in the training set and hence reduces the model's tendency toward conservatism . The output decoder faces a formidable task. When an input stem is fed into the model, the result is a set of activated output Wickelfeature units. Which units are on in the output depends on the current weights of the connections from active input units and on the probabilistic process that converts the summedweighted inputs into a decision as to whether or not to turn on. Nothing in the model ensuresthat the set of activated output units will fit together to describe a legitimate word: the set of activated units do not have to have neighboring context features that " mesh and hence im" plicitly " assemble the Wickelfeatures into a coherent string; they do not " have to be mutually consistentin the feature they mandatefor a given position ; and they do not have to define a set of featuresfor a given position that collectively define an English phoneme (or any kind of phoneme In fact, ). the output Wickelfeaturesvirtually neverdefine a word exactly, and so there is no clear sensein which one knows which word .the output Wickelfeatures are defining. In many cases Rumelhart and McClelland are only interested , in assessing how likely the model seemsto be to output a given target word, such as the correct past tense form for a given stem; in that casethey can peer into the model, count the number of desired Wickelfeatures that are successfully activatedand vice versa, and calculatethe goodness the match. of However, this does not reveal which phonemes or which words, the model , would actually output. To assess how likely the model actually is to output a phonemein a given context, that is, how likely a given Wicke /phone is in the output, a Wicke /phone Binding Network was constructedas part of the output decoder. This network has units corresponding to Wickelphones these units "compete ; "

94

S. Pinker and A . Prince

with one another in an iterative processto " claim" the activated Wickelfeatures: the more Wickelfeatures that a Wickelphone unit uniquely accounts for , the greater its strength (Wickelfeaturesaccountedfor by more than one Wickelphone are " split" in proportion to the number of other Wickelfeatures each Wickelphone accountsfor uniquely) and, supposedly the more likely , that Wickelphone is to appearin the output. A similar mechanism called the , Whole -String Binding Network, is defined to estimate the model's relative tendenciesto output any of a particular set of words when it is of interest to compare those words with one another as possible outputs. Rumelhart and McClelland choose a set of plausible output words for a given input stem, such as break, broke, breaked and broked for the past tense of break, and define a unit for each one. The units then compete for activated Wickelfeatures in the output vector, each one growing in strength as a function of the number of activated Wickelfeatures it uniquely accountsfor (with credit for nonunique Wickelfeatures split between the words that can account for it), and diminishing as a function of the number of activatedWickelfeaturesthat are inconsistentwith it . This amountsto a forced-choice procedure and still doesnot reveal what the model would output if left to its own devices which is crucial in evaluatingthe model's ability to produce correct past tenseforms for stemsit has not been trained on. Rumelhart and McClelland envision an eventual " sequentialreadout process that would convert Wickelfeaturesinto " a single temporally ordered representation but for now they make do with a , more easily implemented substitute: an UnconstrainedWhole -String Binding Network, which is a whole-string binding network with one unit for every possible string of phonemesless than 20 phonemeslong- that is, a forcedchoice procedure among all possible strings. Since this process would b.e intractable to compute on today's computers and maybe even tomorrow's, , they created whole-string units only for a sharply restricted subset of the possible strings, those whose Wickelphones exceed a threshold in the Wickelphonebinding network competition. But the setwasstill fairly largeand thus the model was in principle capable of selectingboth correct past tense forms and various kinds of distortions of them. Even with the restricted set of whole stringsavailablein the unconstrainedwhole-string binding network, the iterative competition process quite time-consuming the implementation, was in and thus Rumelhart and McClelland ran this network only in assessing the model's ability to produce past forms for untrained stems in all other cases ; , they either countedfeaturesin the output Wickelfeature vector directly, or set up a restricted forced-choicetest amonga small set of likely alternativesin the whole-string binding network. In sum, the RM model works asfollows. The phonologicalstring is cashed in for a set of Wickelfeaturesby an unspecifiedprocessthat activatesall the cor-

Language connectionism 95 and

rect and someof the incorrect Wickelfeature units. The pattern associatorexcites the Wickelfeature units in the output; during the training phase its parameters(weightsand thresholds are adjustedto reducethe discrepancy ) between the excited Wickelfeature units and the desired ones provided by the teacher. The activatedWickelfeature units maythen be decodedinto a string of Wickelphonesby the Wickelphonebinding network, or into oneof a smallsetof words by the whole-string binding network, or into a free choiceof an output word by the unconstrainedwhole-string binding network. 4. An analysisof the assumptions the Rumelhart- McCleUandmodel in of comparisonwith symbolicaccounts It is possibleto practicepsycholinguistics with minimal commitmentto expl:. aic ting the internal representation of language achieved by the learner. Rumelhart and McClelland's work is emphaticallynot of this sort. Their model is offered preciselyas a model of internal representation the learning process ; is understoodin terms of changes a representationalsystemas it converges in on the mature state. It embodiesclaims of the greatestpsycholinguisticinterest: it has a theory of phonological representation a theory of morphology, , a theory (or rather anti-theory) of the role of the notion 'lexical item' , and a theory of the relation between regular and irregular forms. In no case are thesepresupposed theories simply transcribedfrom familiar views; they constitute a bold new perspectiveon the central issues the study of word-forms, in - rooted in the exigenciesand strengthsof connectionism . The model largely exemplifieswhat we have called revisionist-symbol-processingconnectionism rather than implementational or eliminative connec , tionism. Standardsymbolic rules are not embodiedin it ; nor does it posit an utterly opaque device whose operation cannot be understood in terms of symbol processing any sort. It is possibleto isolate an abstractbut unorthoof dox linguistic theory implicit in the model (though Rumelhart and McClelland do not themselvesconsider it in this light) , and that theory can be analyzed and evaluatedin the sameway that more familiar theories are. These are the fundamental linguistic assumptions the RM model: of . . . That the Wickelphone/ Wickelfeatureprovides an adequate basis for phonologicalgeneralization circumventingthe needto deal with strings. , That the past tenseis formed by direct modification of the phoneticsof the root, so that there is no need to recognizea more abstract level of morphological structure. That the formation of strong (irregular) pasts is determined by purely

96

S. Pinker and A . Prince

phonetic ' lexical . That ing the only

considerations item ' to serve system number

, so that as a locus

there of

is no

need .

to recognize

the

notion

idiosyncrasy the of the same their

regular in the

is qualitatively and . combine alternative above uniformity to handle

as the

irregular of

, differ exem in

populations stem / past

plars , so that it is appropriate a single , indissoluble facility These rather specific supplies such 267 ) , assumptions a viable as that a

whole

relation

to support to highly

the

broader

claim symbol , " they of

that - pro write the

connectionism cessing ( PDPII past more theories , p . can

structured have shown

sketched reasonable

. " We of

" that

account to the this

the

acquisition of a ' rule that

tense than

be provided

without of the

recourse . " By

notion they

' as anything rules , as mere rep claim is

a description of the on data the

language intrinsically McClelland of their listed that

mean involved for the

summaries resentations based We domain accuracy tense entirely will

, are

not and

or causally ' s argument model .

in internal broader

. Rumelhart

behavior each of the

show

that

assumptions seriously ' . More

grossly undermines

mischaracterizes the model show ' s claim how system larger

the to past of sys -

it is relevant and formation and even

to , in a way ' reasonableness takes its place

positively a larger . The for

, we

will

within interactions

, more properties

inclusive of the value

phonological tem and

morphological us with a clear models .

will provide psycholinguistic

benchmark

measuring

the

of linguistic

4 . 1 . Wickelphonology The Wickelphone hold / Wickelfeature that the finite has some useful set false the properties can encode . Rumelhart strings enough data . In of to and arbi being , -

McClelland trary true length to give

Wickelphone though all

( PD PII them

, p . 269 ) and to distinguish

this

is close in which their

a way contains

words within

addition depen

a Wickelphone dencies ground as even certain can be

a chunk . These

of context properties

phonological model is to be to taken get

found

allow

the

RM

off

the

. If , however

, the

Wickelphone

/ Wickelfeature

seriously satisfy

an approximate model basic , uncontroversial

of phonological criteria .6

representation

, it must

Preserving for a language

distinctions must

. First preserve all

of

all , a phonological the distinctions that

representation are actually

system present in

6For other critiques of the Wickelphone ( 1971) and Savin and Bever ( 1970) .

hypothesis , antedating the RM model , see Halwes and Jenkins

Language and connectionism

97

the language English orthography is a familiar representationalsystemthat . fails to preserve distinctness for example, the word spelled 'read' may be : read as either [rid] or [rEd ;7 whatever its other virtues, spelling is not an ] appropriate medium for phonologicalcomputation. The Wickelphone system fails more seriously because , there are distinctions that it is in principle incapable of handling. Certain patterns of repetitions will map distinct string-regions onto the sameWickelphone set, resulting in irrecoverablelossof information. This is not just a mathematicalcuriosity. For example, the Australian language Oykangand(Sommer 1980 distinguishes , ) betweenalgal'straight' and algalgal 'ramrod straight' , different strings which share the Wickelphone set {alg, al#, gal, 19a #al} , ascanbe seenfrom the analysisin (5): , (5) a. algal #al alg 19a gal all b. algalgal #al alg 19a gal alg 19a gal al#

Wickelphone sets containing subsetsclosed under cyclic permutation on the character string- {alg, gal, 19a in the example at hand- are infinitely } ambiguousas to the strings they encode This showsthat Wickelphonescan. not represent even relatively short strings, much less strings of arbitrary length, without loss of concatenationstructure (loss is guaranteedfor strings
7Wewill usethe followingphonetic notationandterminology (sparingly Enclosure square ). in brackets [ ] indicates phonetic spelling . Thetense vowels : [i] asin beat are [e] asin bait
The lax vowelsare: [I] asin bit [E asinbet ]

[u] asin shoe [0] asin go


[V ] asinput [~] asinlost

The low front vowel [re appearsin cat. The low central vowel [A] appearsin shut. The 10 back vowel [J] ] \\/ appearsin caught The diphthong [ay] appearsin might and bite; the diphthong [aw] in house The high lax . . central vowel [I] is the secondvowel in melted rose's. , The symbol [t ] standsfor the voicelesspalato-alveolar affricate that appearstwice in church; the symbol [j ] for its voiced counterpart, which appearstwice in judge. [5] is the voicelesspalato-alveolar fricative of shoe and [z] is its voiced counterpart, the final consonantof rouge. The velar nasalIJis the final consonantin sing. The term sonorant consonantrefers to the liquids I,r and the nasalsm,n, IJ The term obstruentrefers to the . complement set of oral stops~ fricatives and affricates, such as p,t,k,j ,s,s,c, b, d,g, v,z,i ,j . The term coronal refers to sounds made at the dental, alveolar, and palato-alveolar placesof articulation. The term sibilant refers to the conspicuouslynoisy fricatives and affricates [s,z,s,z,t ,j ] .

98

S Pinker A. Prince . and

over a certain length ) . On elementary grounds , then , the Wickelphone is demonstrably inadequate . Supporting generalizations . A second, more sophisticated requirement is that a representation supply the basis for proper generalization . It is here that the phonetic vagaries of the most commonly encountered representation of English - its spelling- receive a modicum of justification . The letter i , for example , is implicated in the spelling of both [ay] and [I ] , allowing word -relatedness to be overtly expressed as identity of spelling in many pairs like

those in (6) : (6) a.


b.

write -written
bite - bit

c. d.
e.

ignite -ignition senile-senility


derive - derivative

The Wickelphone /Wickelfeature provides surprisingly little help in finding phonological generalizations . There are two domains in which significant similarities are operative : ( 1) among items in the input set, and (2) between an input item and its output form . Taking the trigram as the primitive unit of description impedes the discovery of inter -item similarity relations . Consider the fact , noted by Rumelhart and McClelland , that the word silt and the word slit have no Wickelphones in common : the first goes to {#si, sil , ilt , It#} , the second to {#sl, sli , lit , it #} . The implicit claim is that such pairs have no phonological properties in common . Although this result meets the need to distinguish the distinct , it shows that Wickelphone composition is a very unsatisfactory measure of psychological phonetic similarity . Indeed , historical changes of the type slit ~ silt and silt ~ slit , based on phonetic similarity , are fairly common in natural language. In the history of English , for example , we find hross~ horse, thrid ~ third , brid ~ bird (Jespersen, 1942, p . 58) . On pure Wickel phones such changes are equivalent to complete replacements ; they are there fore no more likely , and no easier to master , than any other complete replacement , like horse going to slit or bird to clam . The situation is improved somewhat by the transition to Wickelfeatures , but remains unsatisfactory . Since phonemes I and i share features like voicing , Wickelphones like sil and sli will share Wickelfeatures like V oiceless- Voiced - Voiced . The problem is that the Iii overlap is the same as the overlap of I with any vowel and the same as the overlap of r with vowels . In Wickelfeatures it is just as costly- counting by number of replacements- to turn brid to phonetically distant bald or blud as it is to turn it to nearby bird .

Even in the home territory of the past tense, Wickelphonology is more an

Language and connectionism

99

encumbrance than a guide . The dominant regularity of the language entails that a verb like kill will simply add one phone [d] in the past; in Wickelphones the map is as in (7) : (7) a.
b.

{#ki , kit , il #} ~ {#ki , kit , ild , ld#}


il # ~ ild , ld #

The change, shown in (7b) , is exactly the full replacement of one Wickel phone by two others . The Wickelphone is in principle incapable of represent ing an observation like ' add [d] to the end of a word when it ends in a voiced consonant ' , because there is no way to single out the one word -ending consonant and no way to add a phoneme without disrupting the stem; you must
refer to the entire sequence AB # , whether A is relevant or not , and you must

replace it entirely , regardless of whether the change preserves input string


structure . Given time and space , the facts can be registered on a Wickelfea -

ture -by-Wickelfeature basis, but the unifying pattern is undiscoverable . Since the relevant phonological process involves only a pair of representationally adjacent elements , the triune Wickelphone /Wickelfeature is quite generally incompetent to locate the relevant factors and to capitalize on them in learn ing , with consequences we will see when we examine the model 's success in generalizing to new forms . The " blurring " of the Wickelfeature representation , by which certain input
units XBC and ABZ are turned on in addition to authentic ABC , is a tactical

response to the problem of finding similarities among the input set. The reason that AYB is not also turned on- as one would expect , if " blurring " corresponded to neural noise of some sort- is in part that XBC and ABZ are units preserving the empirically significant adjacency pairing of segments: in many strings of the form ABC , we expect interactions within AB and BC , but not between A and C . Blurring both A and C helps to model processes in which only the presence of B is significant , and as Lachter and Bever ( 1988) show, partially recreates the notion of the single phoneme as a phonological unit . Such selective " blurring " is not motivated within Rumelhart and McClelland 's theory or by general principles of PDP architec ture ; it is an external imposition that pushes it along more or less in the right direction . Taken literally , it is scarcely credible : the idea would be that the pervasive adjacency requirement in phonological processes is due to quasirandom confusion , rather than structural features of the representational apparatus and the physical system it serves. Excluding the impossible . The third and most challenging requirement we can place on a representational system is that it should exclude the impossi ble . Many kinds of formally simple relations are absent from natural lan-

100

S. Pinker and A . Prince

guage presumably becausethey cannot be mentally represented Here the , . Wickelphone /Wickelfeature fails spectacularly A quintessentialunlinguistic . map is relating a string to its mirror image reversal (this would relate pit to tip, brag to garb, dumb to mud, and so on) ; although neither physiology nor physicsforbids it , no languageusessuch a pattern. But it is as easyto represent and learn in the RM pattern associatoras the identity map. The rule is simply to replace each Wickelfeature ABC by the Wickelfeature CBA . In network terms, assuming link-weightsfrom 0 to 1, weight the lines from ABC ~ CBA at 1 and all the (459 others emanating from ABC at O Since all ) . weights start at 0 for Rumelhart and McClelland, this is exactly as easy to achieveas weighting the lines ABC ~ ABC at 1, with the others from ABC staying at 0; and it requires considerably less modification of weights than most other input- output transforms. Unlike other, more random replace ments, the S ~ SRmap is guaranteedto preservethe stringhood of the input Wickelphone set. It is easyto define other processes over the Wickelphone that are equally unlikely to make their appearancein natural language for : example, no processturns on the identity of the entire first Wickelphone (#AB ) or last Wickelphone (AB #)- compare in this regard the notions 'first (last) segment, 'first (last) syllable' , frequently involved in actual morpholog' ical and phonological processesbut which appear as arbitrary disjunctions, , if reconstructibleat all, in the Wickelphonerepresentation The Wickelphone . tells us as little about unnatural avenuesof generalization as it does about the natural ones. The root cause we suggest is that the Wickelphone is being askedto carry , , two contradictory burdens. Division into Wickelphonesis primarily a way of multiplying out possible rule-contexts in advance Since many phonological . interactions are segmentally local, a Wickelphone-like decomposition into short substringswill pick out domainsin which interaction is likely .s But any such decompositionmust also retain enough information to allow the string to be reconstituted with a fair degreeof certainty. Therefore, the minimum usable unit to reconstruct order is three segmentslong, even though many contexts for actual phonological processes span a window of only two seg 80f course, not all interactions are segmentallylocal. In vowel harmony, for example, a vowel typically reacts to a nearby vowel over an intervening string of consonants if there are two intervening consonants ; , the interacting vowels will never be in the sameWickelphone and generalization will be impossible Stress . rules.commonly skip over a string of one or two syllables which may contain many segments crucial notions , : such as 'secondsyllable' will have absolutely no characterizationin Wickelphonology (see Sietsema 1987 for , , further discussion. Phenomenalike these show the need for more sophisticatedrepresentationalresources ) , so that the relevant notion of domain of interaction may be adequatelydefined (seevan der Hulst & Smith, 1982 for an overview of recent work) . It is highly doubtful that Wickelphonology can be strengthenedto deal , with such cases but we will not explore these broader problems, becauseour goal is to examine the Wickel, phone as an alternative to the segmentalconcatenativestructure which every theory of phonology includes.

Language and connectionism

101

ments

Similarly

if

the

blurring

process

were

done

thoroughly

so

that

ABC

would

set

off

all

XBZ

in

the

input

set

there

would

be

full

representation

of

the

presence

of

but

the

identity

of

the

input

string

would

disappear

The

RM

model

thus

establishes

mutually

subversive

relation

between

rep

resenting

the

aspects

of

the

string

that

figure

in

generalizations

and

represent

ing

its

concatenation

structure

In

the

end

neither

is

done

satisfactorily

Rumelhart

and

McClelland

display

some

ambivalence

about

the

Wickelfea

ture

At

one

point

they

dismiss

the

computational

difficulty

of

recovering

string

from

Wickelfeature

set

as

one

that

is

easily

overcome

by

parallel

processing

"

in

biological

hardware

"

262

At

another

point

they

show

how

the

Wickelfeature

to

Wickelphone

re

conversion

can

be

done

in

bind

ing

network

that

utilizes

certain

genus

of

connectionist

mechanisms

imply

ing

again

that

this

process

is

to

be

taken

seriously

as

part

of

the

model

Yet

they

write

PDPII

239

All

we

claim

for

the

present

coding

scheme

is

its

sufficiency

for

the

task

of

representing

the

past

tenses

of

the

500

most

frequent

verbs

in

English

and

the

importance

of

the

basic

principles

of

distributed

coarse

what

we

are

calling

blurred

conjunctive

coding

that

it

embodies

This

disclaimer

is

at

odds

with

the

centrality

of

the

Wickelfeature

in

the

model

'

design

The

Wickelfeature

structure

is

not

some

kind

of

approxima

tion

that

can

easily

be

sharpened

and

refined

it

is

categorically

the

wrong

kind

of

thing

for

the

jobs

assigned

to

it

. 9

At

the

same

time

the

Wickelphone

or

something

similar

is

demanded

by

the

most

radically

distributed

forms

of

distributed

representations

which

resolve

order

relations

like

concatena

tion

into

unordered

sets

of

features

Without

the

Wickelphone

Rumelhart

and

McClelland

have

no

account

about

how

phonological

strings

are

to

be

analyzed

for

significant

patterning

. 2

Phonology

and

morphology

The

RM

model

maps

from

input

to

output

in

single

step

on

the

assumption

that

the

past

tense

derives

by

direct

phonetic

modification

of

the

stem

The

regular

endings

t ,

id

make

their

appearance

in

the

same

way

as

the

9Compare

in

this

regard

certain

other

aspects

of

the

model

which

are

clearly

inaccurate

but

represent

harmless

oversimplifications

The

actual

set

of

phonetic

features

used

to

describe

individual

phones

( p

235

doesn

' t

make

enough

distinctions

for

English

much

less

language

at

large

nor

is

it

intended

to

but

the

underlying classifications

strategy of the

of

featural verbs in

analysis the study

is

solidly derive

supported from the

in Kucera

the -

scientific Francis

literature count over

Similarly a written

the corpus

frequency , which

shows

obvious

divergences

from

the

input

encountered

by

learner

( for

examples

see

footnote

24

Such

aberrations

which

have

little

impact

on

the

model

' s

behavior

could

be

corrected

easily

with

no

structural

re

- design

102 S Pinker A. Prince . and

vowel changes i ~ a (sing - sang) or u ~ 0 (choose - chose) . Rumelhart and McClelland claim as an advantage of the model that " [a] uniform procedure is applied for producing the past-tense form in every case." (PDPII , p . 267) This sense of uniformity can be sustained, however , only if past tense forma tion is viewed in complete isolation from the rest of English phonology and morphology . We will show that Rumelhart and McClelland 's very local uni formity must be paid for with extreme nonuniformity in the treatment of the broader patterns of the language .
The distribution of t-d -id follows a simple pattern : id goes after those stems

ending in t or d ; elsewhere , t (voiceless itself ) goes after a voiceless segment and d (itself voiced ) goes after a voiced segment. The real interest of this rule is that none of it is specifically bound to the past tense. The perfect /passive participle and the verbal adjective use the very same t-d-id scheme: was kicked - was slugged - was patted,. a kicked dog - a flogged horse - a patted cat. These categories cannot be simply identified as copies of the past tense, because they have their own distinctive irregular formations . For example , past drank contrasts with the participle drunk and the verbal adjective drun ken . Outside the verbal system entirely there is yet another process that uses
the t-d-id suffix , with the variants distributed in exactly the same way as in

the verb forms , to make adjectives from nouns , with the meaning 'having X ' (Jespersen, 1942, p . 426 ft .) : (8) -t hooked
saber - toothed

-d long -nosed
homed

-id one-handed
talented

pimple -faced foul -mouthed thick -necked

winged moneyed bad-tempered

kind -hearted warm -blooded bareheaded

The full generality of the component processes inherent in the t-d-id alter nation only becomes apparent when we examine the widespread s-z-iz alter nation found in the diverse morphological categories collected below : (9) Category a. b. c. d. e.
f.

-s hawks hits Pat's Pat's Pat's


what ' s

-z dogs sheds Fred 's Fred 's Fred 's


where 's

-iz hoses chooses George 's George 's George 's


-

Plural 3psg Possessive has is


does

g. h. i.

Affective adverbial Linking -s

Pats(y) thereabouts huntsman

Wills , bonkers towards , nowadays landsman

Languageand connectionism

103

These 9 categories show syncretism in a big way- they use the same phone tic resources to express very different distinctions .

The regular noun plural exactly parallels the 3rd person singular marking of the verb , despite the fact that the two categories (noun /verb , singular / plural ) have no notional overlap . The rule for choosing among s-z-iz is this : iz goes after stems ending in sibilants (s,z,s,i ,C ; elsewhere , s (itself voice ,]J less) goes after voiceless segments, z (voiced itself ) goes after voiced segments . The distribution of sf z is exactly the same as that of tf d . The rule for iz differs from that for id only inasmuch as z differs from d . In both cases the rule functions to separate elements that are phonetically similar : as the sibil ant z is to the sibilants , so the alveolar stop d is to the alveolar stops t and d .

The possessive marker and the fully reduced forms of the auxiliary has and the auxiliary /main verb is repeat the pattern . These three share the further interesting property that they attach not to nouns but to noun phrases, with the consequence that in ordinary colloquial speech they can end up on any kind of word at all , as shown in (10) below : ( 10) a. b. c. d. e. [my mother -in -law]'s hat (ct . plural : mothers-in -law) [the man you met]'s dog [the man you spoke to ] 's here . (Main verb be) [the student who did well ] 's being escorted home . (Auxiliary be) [the patient who turned yellow ]'s been getting better . (Auxiliary has)

The remaining formal categories (lOf- h) share the s/ z part of the pattern . The auxiliary does, when unstressed, can reduce colloquially to its final sibil ant : l0

? Whatchurch he go to? 's ?? Whose lunch he eatfrom? 's ?? Which he like better 's ? ?? Whose he actually 's prefer ?
We suspectthat the problem here lies in getting doesto reduceat all in suchstructural environments regardless , of phonology. If this is right, then (i) and (ii) should be as good (or bad) as structurally identical (v) and (vi) , where the sibilant-sibilant problem doesn arise: 't (v) ? What synagogue he go to? 's (vi) ? Whose dinner's he eat from? Sentenceforms (iii ) and (iv) use the wh.determines which and whosewithout following head nouns, which may introduce sufficient additional structural complexity to inhibit reduction. At any rate, this detail, though interesting in itself, is orthogonal to the question of what happensto does when it does reduce.

104

S. Pinker and A . Prince

(11 a. ) b. c.

'Z he like beans ? Whats he eat for lunch ' ? Where he go for dinner 's ?

The affectivemarkers/ z formsnicknames somedialects argots as in and , in Willsfrom William Patsfrom PatrIck andalsoshows in variousemo , , up tionally -coloredneologisms bonkers bats paralleling-y or -0 (batty like , , , wacko with whichit sometimes ), combines (Patsyfatso A number adver , ). of bial formsaremarked s/ z- unawares by , nowadays , besides , backwards / , here there /whereabouts , amidships final, quitesporadic phonologically .A (but reg ular) uselinks togetherelements compounds in huntsmanstatesman of , as , , kinsman bondsman , . The reason the voiced that /voiceless choice madeidentically is throughout Englishmorphology not hardto find: it reflects prevailing inescap is the and ablephonetics consonant of clustervoicingin the language large Evenin at . unanalyzable words final obstruent , clusters havea singlevaluefor the voic ing feature we find only wordslike these ; : (12 a. ) b. c. d . e. f. ax, fix, box act, fact, product traipse lapsecorpse , , apt, opt, abrupt blitz, kibitz, Potts post ghost list , , [ks] [kt] [ps ] [pt] [ts] [st]

Entirely absent wordsending a clusterwith mixedvoicing [zt], [gs are in : ], [kz], etc.!! Notice that after vowels liquids and nasals , , (non -obstruentsa ) voicingcontrastis permitted : (13 a. lens- fence ) [nz] - [ns ] b. furze- force [rz] - [rs] c. wild - wilt rId] - [It] d. bulb- help [Ib] - [Ip] e. goad goat [ad] - [at] f. niece sneeze [is] - [iz] .If we areto achieve uniformityin the treatment consonant of -clustervoic ing, we must not spreadit out over 10 or so distinctmorphological form generators (i.e., 10differentnetworks, andthen repeatit onceagainin the ) phoneticcomponentthat appliesto unanalyzable words Otherwise we . ,

IJIn. noncomplex are voiceless [dz :the word ]pretty stands words clusters alone obstruentoverwhelminglyadze much

Language and connectionism

105

wouldhave no explanation why Englishcontains,and why generation for aftergeneration children of easily learn,the exactsamepatterneleven so or differenttimes.Elevenunrelatedsets of clusterpatterningswouldbe just as

likely. Rather, voicing the pattern be factored of themorphology must out


and allowed to stand on its own.

Lets see how the cross-categorial generalizations governthe surface that

shape English of morphemes begiven dueina rule can their system. Suppose thephonetic content thepasttense of marker justId!andthatofthediverse is morphemes(9)is /z/.There a setofmorphological thatsayhow in is rules morphemes assembled words: example, are into for Verb-paststem /d/; = + Noun-pl stem+ Izi;Verb-3psgstem /zI.Given wecaninvoke = = + this, a
single to derivethe occurrences [ and [ rule of t] s}:
the next in word final position. 12

(14)Voicing Assimilation. thevalue voicing oneobstruent Spread of from to


of ax and adze and is restricted so as to allow goat and horse to escape
comes about via morphology, the rule works like this:
pig + /z/ pit + /z/ pea + /z/
rub + Id!

Rule(14) motivated thefacts simplex shown is by of words above: holds it

unaffectedtheyendin single obstruents, clusters. not Whena finalcluster


(15) a. b. c.
d.

Vacuous [ pIts] No Change


Vacuous

e.
f.

rip + Id!
tow + ld/

>

[ rlpt]
No Change

The crucial effect of the rule is to devoice Id! and !zI after voiceless

obstruents; after voiced obstruentsits effect is vacuous and after nonobstruentsvowels,liquids,nasalsit doesnt applyat all, allowing the
basic values to emerge unaltered. 13 The environment of the variant with the reduced vowel I is similarlycon-

stant acrossall morphological categories,entailingthe same sort of uniform

treatment.Here againthe simplex formsin the English vocabulary provide

thekeyto understanding: case thephonetic in no are sequences [ [ dd], tt],


2 More likely, syllable-final position.

[ sibilant-sibilant] tolerated the endof unanalyzable at words, eveninside or


Notice that if It! andIs!weretakenas basic,we wouldrequirea special of voicing, 3 rule restricted to

suffixes,handle case words to the of ending vowels, in liquids, nasals. example, + Is! and For pea would have
which already ofEnglish, derive suffixal is part to the voicing pattern without further ado.

togotopea+ {z], though pattern voicing generally inthelanguage: even this of isnot required cf.the morphologically word simplex peace. Positing and asbasic, theother allows rule Id! /z/ on hand, the (14),

106 S Pinker A. Prince . and

them .14English has very strong general restrictions against the clustering of identical or highly similar consonants . These are not mere conventions deriv ing from vocabulary statistics , but real limitations on what native speakers of English have learned to pronounce . (Such sequences are allowed in other languages.) Consequently , forms like [skldd ] from skid + /d/ or [jAJZ from ] judge + /z/ are quite impossible . To salvage them , a vowel comes in to separate the ending from a too -similar stem final consonant . We can infor mally state the rule as (16) : ( 16) Vowel Insertion . Word -finally , separate with the vowel j adjacent consonants that are too similar in place and manner of articulation , as defined by the canons of English word phonology . The two phonological rules have a competitive interaction . Words like passes [presiz] and pitted [pItid ] show that Vowel Insertion will always prevent Voicing Assimilation : from pass + /z/ and pit + /d/ we never get [presis or ] [pI tit ] , with assimilation to the voiceless final consonant . Various lines of explanation might be pursued ; we tentatively suggest that the outcome of the competition follows from the rather different character of the two rules . Voicing Assimilation is highly phonetic in character , and might well be part of the system that implements phonological representations rather than part of the phonology proper , where representations are defined , constructed , and changed. If Vowel Insertion , as seems likely , actually changes the representation prior to implementation , then it is truly phonological in character . Assuming the componential organization of the whole system portrayed above , with a flow between components in the direction Morphology ~ Phonology ~ Phonetics , the pieces of the system fall naturally into place . Morphology provides the. basic structure of stem + suffix . Phonology makes various representational adjustments , including Vowel Insertion , and Phonetics then implements the representations . In this scheme, Voicing Assimila tion , sitting in the phonetic component , never sees the suffix as adjacent to
a too - similar stem - final consonant .

Whatever the ultimate fate of the details of the competition , it is abundantly clear that the English system turns on a fundamental distinction between phonology and morphology . Essential phonological and phonetic pro cesses are entirely insensitive to the specifics of morphological composition and sweep across categories with no regard for their semantic or syntactic content . Such processes define equivalences at one level over items that are distinct at the level of phonetics : for English suffixes , t = d = id and s = z
]~ his is of course a phonoJogicaJ restriction , not an orthographic example , have identical consonantal phonology . one . The words petty and pity , for

Language and connectionism

107

regular notthree; onesuffix, three, each plural, past, and not for of 3rd
typicallyhave no awarenessof them.

= iz. As a consequence, learner the infers thereis onesuffix the that for

person singular, possessive, soon.The and phonetic differences autoemerge matically;would expected such as be in cases, uninstructed speakers native
Rumeihartand McClellands pattern associator hobbledby a doctrine is

wemight morphological dub localism: assumption there foreach the that is morphological anencapsulated thathandles detail category system every of itsphonetics. theymischaracterize theoretically This asa desirable uniformity. In fact,morphological destroys localism uniformity preventing by generalization categories byexcluding across and inference onlargerbased
scale regularities. it isinconsistent thefactthatthelanguages Thus with that people learnareshaped these by generalizations inferences. and
Theshapeof thesystem. is instructive notethatalthough various It to the English morphemes discussed earlier participate thegeneral all in phonologicalpatterns the language, thepasttensetheycanalsodisplay of like their
own particularities subpatterns. and The 3rd personsingularis extremely
minusculenumber of non-/z/forms (oxen, children,geese, mice, ...), a 0

regular, a few with lexical irregularities has,does, anda lexical (is, says) class (modal auxiliaries) cant be inflected will, that (can, etc.).Theplural a has
suffixing (sheep, class deer),and a fricative-voicing subclass (leaf-leaves, wreath-wreathes). possessive The admits lexical no peculiarities (outside of the pronouns), presumably because addsto phrases it ratherthan lexical items,but it is lostafterplural (mensvs. dogs)andsporadically /z/ after
otherzs. Thefully reduced forms isandhasadmit lexical morphologof no or
rather than lexical.

ically-based peculiarities whatever, presumably because are syntactic they


Fromtheseobservations, canput togethera generalpictureof howthe we morphological system works.Thereare someembracing regularities:
1. All inflectional morphology is suffixing.

2. All nonsyllabic regularsuffixes formed are fromthe phonetic substance Id!or /z/;thatis, theymustbe the sameup to the onefeaturedistinguish3. All morphemes liableto re-shaping phonology phonetics. are by and
be entirely regular.

ing d from z: sibilance.

4. Categories, inasmuch they are lexical, support as can specific lexical peculiarities subpatterns; and inasmuch theyarenonlexical, must as they

Properties and(2) are clearly (1) English-bound generalizations,be to learned the native by speaker. Properties and(4)are replicated (3) from

108 S Pinker A. Prince . and

language language shouldtherefore referredto the general to and be capacities thelearner of rather thanto theaccidents EnglishNotice of . that wehave livedupto ourpromise show therules to that governing regular the pasttense not idiosyncratic it: beyond are to even phonology the discussed aboveitsintrinsic , phonetic content shared to onefeature theother is up with regular nonsyllabic suffixesandthe rule of inflectional ; suffixation is itself shared generally across categories have . We founda highly modular system , in which mapping uninflected to thephonetic the from stem representation of thepasttense formbreaks down a cascade independent sys into of rule temsandeach system ,originally rule . treats inputs its identically regardlesshowthey of were created It is a nontrivial problem design device arrives thischaracteri to a that at zationon its own An unanalyzed . single module the RM pattern like as sociator maps that fromfeatures features to cannot so do .

Language and connectionism

109

tity. suffixal t and arematched Id,notwith orozor The variants d with iz ogorany conceivable other butphonetically form. distant Similarly, morphemes show s/zvariants izintheappropriate which the take circumstances, notIdor odorgu.This follows directly ourhypothesis themorfrom that phemesquestion just basic in have one phonetic contentId! IzIwhich or issubject minor to contextual adjustments. RM The model, however, cannot grasp generalization. this, this To see considerWickelphone the map involved
in the Id case, using the verb melt as an example:
b. lt# lti, tid, id#

Theother ofthemorphological isthepreservationaffix side coin of iden-

(17) {#me, elt,lt#} {#me, elt,lti,tid,id#} a. mel, * mel,

Thereplacement Wickeiphone (ormore properlyWickelfeature set)id#


hasnorelation thestem-final to consonant could aswell iz#or and just be
inflectional alternations preservestemand affixidentities. that
4.3.2. Operations on lexical items

Ig#. theRM Thus model explain prevalence languages cannot the across of
ThegeneralizationstheRMmodel that extracts consist specific of correlations between particular sequences thestemandparticular phone in phone

sequences past inthe form. themodel Since contains symbol no correspond-

ingtoa stem se,independent particular sequences per ofthe phone that happen have to exemplified themajoritystems themodels of in history, it

cannot makeanygeneralization refers stems se, cutting that to per across theirindividual phonetic contents. a morphological likereduThus process

plication, inmany which languages anentire (e.g. copies stem yielding forms roughly analogous todum-dum boom-boom), beacquired and cannot inits fully general by network.many itcan form the In cases memorize particular patternsreduplication, ofmappings particular of consisting between feature sequences their and reduplicated counterparts even problems (though here can arisebecauseof the povertyof the Wickelfeature representation, we as pointed indiscussing out Wickelphonology), but concept the the Copy stem itself unlearnable; isnounitary is there representation thing becopied ofa to andnooperation consistingcopyingvariable of a regardlessitsspecific of
content. when newstemcomes thatdoesnotsharemany Thus a in features withthe onesencountered previously, willnot matchanystoredpatterns it
andreduplication not applyto it. will 15 1t 5 isworth that noting reduplication, which calls avariable stem, syllable always on (ifnot then or foot) one the commonly of is of most used strategies word-formation.or In formanother, in one itsfound hundreds, thousands, worlds probably ofthe languages. analysis, For detailed see McCarthy and Prince
(forthcoming).

110

S. Pinker and A. Prince

tothefact what learns associated particular sequences that it is with phone


asopposed variables to standing stems general. for in
4.3.3.Lexical as thelocus idiosyncrasy items of

strong with tense links past features. weexamine performance When the of theRM model, will how ofitsfailures probably attributed we see some can be

The strikes tohome well. English past point closer as The regular tense stem; mentionsvariable, it a stem, thatis cashed independently in for information inparticular entries. therule, learned, stored lexical Thus once canapplyacross boardindependent the set of stemsencountered the of in thelearners history. RM The model, theother learns past on hand, the tense alternation linking by phonetic features inflected directly the of forms to particular features thestem example,pat patted M# affix of (for in the Wickelfeatures directlytheentire offeatures pat:#px, arelinked to set for pxt,etc.). Though oftheactivationtheaffix much for features eventually is contributed stem bysome features cutacross individual that many stems, such those theendof a word, allofit is;some as at not contribution from the word-specific stem features are that well-represented sample inthe input can a role well. theRM play as Thus model fail generate past could to any tense fora newstem thestem notshare form if did enough features with those thatwere stems encountered past thatthus their inthe and grew own
ruleadds affix a stem. ruledoesntcareabout contents the an to The the of

byphonological thereis nonotion a lexicalitem, as distinct criteria; of


canbe affixed. assessing model, In their Rumelhart McClelland and write: Thechild notdecide need whetherverb regular irregular. isno a is or There questiontowhether inflected should stored as the form be directlythelexicon in
or derived more from general principles. (PDPII, 267) p.

For RM the model, membership strong isdetermined inthe classes entirely

from phone-sequences uptheitem, which irregular the that make to an tag

refers theshape-or-size to appropriatenesssubject, bestrong: ofits can

many dialects, hang toa form execution, hang regular refers of strong means merely suspend. verb isregular, One fit meaning adjust;theother, which

tied directlythese to sequences. basic This empirical istransparently claim false. Within strong itself, isa contrast the class there between (past: ring rang) wring wrung) areonly and (past: which orthographically Lookdistinct. ingat thebroader population,find string shared theitems we the lay by lie (past: prevaricate lie(past: assumerecumbent lied) and lay) a position. In

IfRume and lhart McClelland areright, can nohomophony there be between regular irregular orbetween indistinct and verbs items irregular classes,because arenothing phone-sequences, words but andirregular are forms

Language and connectionism

111

(18 )a . b .

That shirt never fit /?fitted me . The tailor fitted /*fit me with a shirt .

The sequence ] belongs the strongsystem [kAm to whenit spellsthe morphemecome not otherwisecontrast , : becomeovercome succumben , with , cumber . An excellent source counterexamplestheclaimthat pasttense of to forma tion collapses distinctions the between wordsandtheir featuraldecomposi tion is supplied verbsderivedfrom othercategories nounsor adjec by (like tives The significance theseexamples ). of , whichwerefirst noticedin Menc ken (1936, hasbeenexplored Kiparsky(1982ab)16 ) in ,

()a 19 . b . c . d . e . f. g h . .. 1 .. J k .

He braked the car suddenly == broke . 1 He flied out to center field . == flew 1 He ringed the city with artillery . *rang Martina 2-setted Chris. *2-set He subletted /sublet th, apartment. e He sleigheddown the hill . *slew He de-flea'd his dog. *de-fled He spitted the pig. *spat He righted the boat. *rote He high-sticked the goalie. *high-stuck He grandstandedto the crowd. *grandstood .

This phenomenon becomes intelligibleif we assume irregularityis a that property of verb roots Nounsand adjectives their very naturedo not . by classify irregular(or regular with respect thepasttense a purelyverbal as ) to , notion. Making a noun into a verb, which is donequite freely in English , cannotproduce newverbroot, just a newverb. Suchverbscanreceive a no special treatment areinflectedin accord and with the regularsystemregard , lessof anyphoneticresemblance strongroots to . In somecasesthereis a circuitous , path of derivation V ~ N ~ V. But : the endproduct havng passed , . i throughnounhoodmustberegularno matter , what the statusof the original sourceverb. (By "derivation we refer to " relationsintuitively grasped the nativespeakernot to historicaletymol by , ogy The baseball .) verbtofly out, meaning 'makean out by hitting a fly ball that getscaught, is derivedfrom the baseball ' nounfly (ball) , meaning 'ball hit on a conspicuously parabolic trajectory whichis in turn relatedto the ', simplestrongverb fly 'proceedthroughthe air'. Everyonesays"he flied out" ; no meremortalhasyet beenobserved have"flown out" to left field. to
16Examples ) and (h) are from Kiparsky. (19b

112

S. Pinker and A. Prince

Considerthese examples:

Onemight tempted trytoexplain phenomenaterms the be to these in of meaningsregular irregular of and versions a verb. example, of For Lakoff (1987) appeals thedistinction to between centralandextended the senses ofpolysemous and words, claims irregularity only thecentral that attaches to sense anitem. isa remarkable of It factindeed, insult any an to naive idea thatlinguistic is driven meaningthat form by polysemyirrelevant the is to regularization phenomenon. Lakoffs proposed generalization sound. is not
wet regular central in sense. wet irregular extended in sense.

served widely the worldslanguages. very in

formation, because nouns general haveproperties as tense. in cant such Thus isnotsimply it derivation erases that idiosyncrasy, butdeparture from theverb stand class: retains verbal its integrity theverbs in withstand, understand, throw intheverbs as does overthrow, underthrow. 17 Kiparsky (1982a, b) haspointed thatregularization-by-derivation and out is quite general shows wherever up irregularity befound. nouns, example, isto In for we have Toronto Leafs, *Leaves, use a name a the Maple not because in strips morpheme original ofits content. Similar patterns regularization of areob-

speakers be related the homophonous verb,butoncemadea to to strong noun verbal its irregularity beresurrected: grandstood. cannot *he Aderived nouncannot retainanyverbal properties its base,likeirregular of tense

Similarly,noun inthe the stand lexical compound grandstand felt issurely by

(20)a. Hewetted pants. his b. Hewethispants. (21)a. They heaved bottle the overboard.
b. They to. hove

heave regular central in sense.


heave irregular extended in
sense.

set,get,put, stand... are magnificently polysemous become (and moreso in

or extended. thepurely Thus semantic metaphorical ofsense or aspect extension nopredictive whatsoever. like has power Verbs come, do,have, go,

It appears a low-frequency that irregular occasionally locked can become intoa highly specific regardless whether sense use, of the involvedcentral is

combinationparticles in,out, off), they with like up, yet march lockstep in

When verb head word 7 the isthe ofthe itbelongs passesitscategorial tothe to,it on features whole word, includingverb-ness specialized both and more morphological like properties irregularity (Williams, 1981). nouns and Deverbal [ NV] denominal[ must verbs N] therefore be headless, prefixed are whereas verbs headed [ PREF-V}. that can uncertainty differencesinterpretation Notice there be and dialect inthe of individual verb canpasttensesublet. cases. subletbe The of as [ V Nsublet}J subletted, prefixed oras a formheadedby the verbto let, giving thought denominal, giving

Language and connectionism

113

114

S. Pinker and A . Prince

rather conclude well as

than that a

to

every lexical

possible items featural do

semantic indeed decomposition

, syntactic possess . an

, and

pragmatic local

fillip

, we identity

can as

accessible

distributed

4.4. The strong system and the regular system


The irregular the broader RM model modes incorrectness claim of embodies of of uniformity formation two the claim is assumptions . spurious that . that the At distinction this were point supposed , between we have to regular established support the and

Assumption

"

All

past

tenses

are

formed

by

direct

phonetic

modification

of

the

stem

. "

We

have

shown

that

the

regular

forms

are

derived

through

affixation

followed

by

phonological

and

phonetic

adjustment

Assumption

"

The

inflectional

class

of

any

verb

regular

subregular

irregu

lar

can

be

determined

from

its

phonological

representation

alone

. "

We

have

seen information

that

membership .

in

the

strong

classes

depends

on

lexical

and

morphological

These

results

still

leave

open

the

question

of

disparity

between

the

regular

and

strong

systems

To

resolve

it

we

need

firmer

understanding

of

how

the

strong

system

works

We

will

find

that

the

strong

system

has

number

of

distinctive

peculiarities

which

are

related

to

its

being

partly

structured

list

of

exceptions

We

will

examine

five

Phonetic similarity criteria on classmembership . Prototypicality structure of classes . Lexical distinctnessof stem and past tense forms. Failure of predictability in verb categorization . Lack of phonological motivation for the strong-classchanges .

4.4.1. Hypersimilarity The strong classes often held together, if not exactlydefined, by phoneare tic similarity. The most pervasive constraint is monosyllabism 90% of the : strong verbs are monosyllabic and the rest are composedof a monosyllable , combined with an unstressedand essentiallymeaningless prefix.19
I~ he polysyllabic strong verbs are: arise- awake become befall, beget, begin, behold, beset beshit, bespeak , , forbear, forbid , forget, forgive, forgo, forsake, forswear, foretell mistake partake (continued )

Languageand connectionism

115

Within

the

various

classes

there

are

often

significant

additional

resem

blances

holding

between

the

members

Consider

the

following

sample

of

typical

classes

arranged

by

pattern

of

change

in

past

and

past

participle

( " tense are

z form

"

will , and

mean z as in follows

that its past :

the

verb participle

has ) means .

the Our that

vowel judgments usage

in

its about

stem the irregular

in cited past

its

past forms form

indicated

? Verb

of

the

of

Verb

is

somewhat

less

natural

than

usual

Verb

means

that

Verb

is

archaic

or

recherche

- sounding

in

the

past

tense

( 22

Some a . x

strong [ u ] -

verb x ( o ) +

types n

blow draw

, ,

grow withdraw

know

throw

fly

? ? slay

[ e

[ V take

[ e ] mistake

en , forsake , shake

[ ay

[ aw bind

] ,

find

[ aw ,

] grind , wind

[ d

[ t ] bend

[ t ] , send , spend , ? lend , ? ? rend

@ '

build

[ E ]

[ :J] swear

[ ~ ] ,

+ tear

n , wear , ? bear , ? ? forswear , ? ? forbear

get ? tread

forget

? beget

The

members

of

these

classes

share

much

more

than

just

pattern

of

changes

In

the

blow

- group

( 22a

for

example

the

stem

- vowel

becomes

[ u

in

the

past

this

change

could

in

principle

apply

to

all

sorts

of

stems

but

in

fact a CC

the cluster

participating . In the

stems find

are - group

all ( 22c

vowel ) the

- final vowel

and change

all

but [ ay

know ] ~

begin [ aw ]

with could

apply

to

any

stem

in

[ ay

but

it

only

applies

to

few

ending

in

[ nd

The

change

of

[ d

to

[ t ]

in

( 22d

occurs

only

after

sonorants

[ n

1]

and

mostly

when

the

stem

rhymes

in

- end

Rhyming

is

also

important

in

( 22b

where

every

understand undergo , upset


withdraw The ( There in that other prefixes is nothing forms is ~ withstand a - , be - , / or -, ' for under ~ and to some - ~ with ' get ~ , for a a do not example of , carry any , that compositeness component particular helps us meaning interpret Aronoff of language , nor forget in fact do most of the stems .

about sufficient is in

. ) Their ( 1976 . ) . As

independent mentioned

existence , this shows

support sense

sense

; see

morphology

separate

abstract

116 S Pinker A. Prince . and

thing ends in -ake (and the base also begins with a coronal consonant ) , and in (22e) , where -ear has a run .
Most of the multi -verb classes in the system are in fact organized around

clusters of words that rhyme and share other structural similarities , which we will call hypersimilarities . (The interested reader is referred to the Appendix for a complete listing .) The regular system shows no signs of such organiza tion . As we have seen, the regular morpheme can add onto any phonetic form - even those most heavily tied to the strong system, as long as the lexical item involved is not a primary verb root . 4.4.2. Prototypicality The strong classes often have a kind of prototypicality structure . Along the phonetic dimension , Bybee and Slobin (1982) point out that class cohesion can involve somewhat disjunctive 'family resemblances' rather than satisfaction of a strict set of criteria . In the blow -class (22a) , for example , the central exemplars are blow , grow , throw , all of the form [CRo ] , where R is a sonorant . The verb know [no] lacks the initial C in the modern language, but otherwise behaves like the exemplars . The stems draw [dr~] and slay [sle] fit a slightly generalized pattern [CR V ] and take the generalized pattern x- u- x in place of o- u- o . The verb fly [flay ] has the diphthong [ay] for the vowel slot in [CR V ] , which is unsurprising in the context of English phonol ogy , but unlike draw and slay it takes the concrete pattern of changes in the exemplars : x- u- o rather than x- u- x . Finally , all take -n in the past participle . Another kind of prototypicality has to do with the degree to which strong forms allow regular variants . (This need not correlate with ohonetic central ~ , . I . ~-- -- --ity - notice that all the words in the blow -class are quite secure in their irregu lar status.) Consider the class of verbs which add -t and lax the stem-vowel : (2,3) V : - V - V ( + t ) keep , sleep, sweep, weep ( ?weeped/wept ) , creep (?creeped/crept ) , leap (leaped/leapt ) feel , deal (?dealed/dealt ) , kneel (kneeled /?knelt )
mean

dream (dreamed /?dream t )


leave lose

Notice the hypersimilarities uniting the class: the almost exclusive preva lence of the vowel [i] ; the importance of the terminations [-ip] and [-il ] . The parenthetical material contains coexisting variants of the past forms that , according to our judgments , are acceptable to varying degrees. The range of prototypicality runs from 'can only be strong ' (keep) through 'may

Language and connectionism

117

be either ' (leap) to 'may possibly be strong ' (dream) . The source of such variability is probably the low but nonzero frequency of the irregular form , often due to the existence of conflicting but equally high -status dialects (see Bybee , 1985) . The regular system, on the other hand , does not have prototypical exemplars and does not have a gradient of variation of category membership defined by dimensions of similarity . For example , there appears to be no sense in which walked is a better or worse example of the past tense form of walk than genuflected is of genuflect . In the case at hand , there is no reason to assume that regular verbs such aspeep, reap function as a particularly power ful attracting cluster , pulling weep, creep, leap away from irregularity . Histor ically , we can clearly see attraction in the opposite direction : according to the OED , knelt appears first in the 19th century ; such regular verbs as heal, peel, peal, reel, seal, squeal failed to protect it ; as regular forms they could not do so, on our account , because their phonetic similarity is not perceived as relevant to their choice of inflection , so they do not form an attracting cluster .

4.4.3. Lexicality The behaviorof low-frequency formssuggests the stemandits strong that pastare actuallyregarded distinctlexicalitems while a regularstemand as , its inflectedforms no matter how rare, are regarded expressions a , as of singleitem. Consider verbforgo: thoughuncommonit retainsa certainliveliness the , , particularly the sarcastic in phrase "forgothe pleasure ..." . Thepasttense of mustsurelybeforwentratherthan *forgoed but it seems , entirelyunusable . Contrast followingexampledueto JaneGrimshaw the , : (24 a. ) b. *Last night I forwentthe pleasure grading of studentpapers . You will excuse if I forgo the pleasure readingyour paper me of until it's published . Similarlybut more subtly we find a differencein naturalness , between stemandpasttense whenthe verbsbearandstandmean'tolerate ': (25 a. I don't knowhow shebearsit . ) b. (?) I don knowhow sheboreit . 't c. I don't knowhow shestands him. d. (?) I don't knowhow shestoodhim. The verb rendenjoysa marginal subsistence the phrase in rendthefabric of societyyet the pastseems , slightlyodd: TheVietnam War rent thefabric of American society The implicationis that familiaritycan accrue . differen tially to sternand pasttenseforms the useof one in a givencontextdoes ;

118 S. Pinker and A . Prince

not

always This

entail phenomenon verbs that in

the

naturalness appears to in ' s finger be

of

the absent

other from

. the of no regular idioms effort system , like . There eke all in inflected like of anas odd are " eke

regular out forms tomose ness listed form forms other even if . or in If " ,

are

trapped one .

narrow in rare

range " stint or show that than a other self no

crook seem , fleech uncertainty the regular

" crook

" , stint , , prescind

" , yet verbs increment

equivalent , fleer , in that forms freely , listed belong inherit because to

Furthermore

- conscious further it is only

incommode the past gain are rule statistics

tense familiarity

. Suppose , rather from each , .

the individual listed

items

actually inflected

lexicon

each single

- generated from

item forms to part

then , on

all the

should hand they

. Irregular be able

unpredictable single paradigm

should

company

4 .4 . 4 . Even in the patterns will set they years the change fall blow remain .) clear As the verb

Failures when a

of verb

predictability matches , no . If matter the cannot like , . flow know , as glow the verb the characteristic how is closely strong , , its predict crow are of as the regular of the [ I the re strong A ] and there similarity which similar set in are the of to to patterns can of be to any no the these the each last few , A of the classes that

strong will of into , grow the .

system be strong subclasses Verbs , regular for throw

guarantee characteristic subclasses words other hundred consider A ] vowel in

always ,

it the ; yet

members has into turned one

( Indeed

, crow

subcategorization associated

subclasses [ I

subregularity classes :

with

( 26

a .

re

A , .

. smg , shrink , sprIng ,

nng drink SWIm begin ( run b . I A A

sink

stink

, )

spin

win

cling, sling, sting, string, swing: slink (slinked/?slunk) stick dig (hang)

wring

fling

( ? flinged

/ flung

) ,

The core members of these related classesend in -ing and -ink . (Bybee and Slobin note the family resemblance structure here, whereby the hallmark 'velar nasal accommodatesmere nasals on the one side (swim, etc.) and ' mere velars on the other (stick, dig) ; the stemsrun and hang differ from the

Language and connectionism

119

norm

in

vowel

feature

or

two

as

well

Interestingly

no

primitive

English

monosyllabic

verb

root

that

ends

in

ing

is

regular

Forms

like

ding

ping

zing

which

show

no

attraction

to

class

26

are

tainted

by

onomatopoetic

origins

forms

like

ring

surround

king

as

in

checkers

and

wing

are

obvi

ously

derived

from

nouns

Thus

the

ing

class

of

verbs

is

the

closest

we

have

in

English

to

class

that

can

be

uniformly

and

possibly

productively

in

flected

with

anything

other

than

the

regular

ending

Nevertheless

even

for

this

subclass

it

is

impossible

to

predict

the

actual

forms

from

the

fact

of

irregularity

ring

rang

contrasts

with

wring

wrung

spring

sprang

with

string

strung

and

bring

belongs

to

an

entirely

unrelated

class

This

observation

indicates

that

learners

can

pick

up

the

general

distinction

regular

irregular

at

some

remove

from

the

particular

patterns

The

regular

system

in

contrast

offers

complete

predictability

. 5

Lack

of

phonological

motivation

for

morphological

rules

The

rules

that

determine

the

shape

of

the

regular

morphemes

of

English

are

examples

of

true

phonological

or

even

phonetic

rules

they

examine

narrow

window

of

the

string

and

make

small

scale

change

Such

rules

have

necessary

and

sufficient

conditions

which

must

be

satisfied

by

elements

pres

ent

in

the

window

under

examination

in

order

for

the

rule

to

apply

The

conditioning

factors

are

intrinsically

connected

with

the

change

performed

Voicelessness

in

the

English

suffixes

directly

reflects

the

voicelessness

of

the

stem

final

consonant

Insertion

of

the

vowel

resolves

the

inadmissible

adja

cency

of

what

English

speakers

regard

as

excessively

similar

consonants

The

relations

between

stem

and

past

tense

in

the

various

strong

verb

classes

are

defined

on

phonological

substance

but

the

factors

affecting

the

relation

ship

are

not

like

those

found

in

true

phonological

rules

In

particular

the

changes

are

for

the

most

part

entirely

unmotivated

by

phonological

conditions

in

the

string

There

is

nothing

in

the

environment

nd

that

encourages

ay

to

become

aw

nothing

about

CRo

the

basic

scheme

of

the

blow

class

that

causes

change

to

CRu

or

makes

such

change

more

likely

than

in

some

other

environment

These

are

arbitrary

though

easily

definable

changes

tied

arbitrarily

to

certain

canonical

forms

in

order

to

mark

an

abstract

mor

phological

category

past

tense

The

patterns

of

similarity

binding

the

classes

together

actually

play

no

causal

role

in

determining

the

changes

that

occur

powerful

association

may

exist

but

it

is

merely

conventional

and

could

quite

easily

be

otherwise

and

indeed

in

the

different

dialects

of

the

language

spoken

now

or

in

the

past

there

are

many

different

systems

Similarity

relations

serve

essentially

to

qualify

entry

into

strong

class

rather

than

to

provide

an

environment

that

causes

rule

to

happen

There

is

one

region

of

the

strong

system

where

discernibly

phonological

120 S. Pinker and A . Prince

factors do playa role: the treatment of stems ending in [-t] and [-d] . No strong verb takes the suffix id (bled/ *bledded got/ *gotted the illicit cluster , ); that would be created by suffixing /d/ is resolved instead by eliminating the suffix. This is a strategy that closely resemblesthe phonological processof degemination(simplification of identical adjacent consonants a singleconto sonant , which is active elsewherein English. Nevertheless if we examine ) , the classof affecteditems, we seethe samearbitrariness prototypicality, and , incomplete predictivenesswe have found above. Consider the " no-change " class which usesa single form for stem, past tense, and past participle- by , far the largest single classof strong verbs, with about 25 members In these . examples a word precededby '?' has no natural-soundingpast tenseform in , our dialect; words followed by two alternativesin parentheses have two possible forms, often with one of them (indicated by '?') worse -sounding than the other: (27) No-changeverbs hit , slit, split, quit , spit (spit/spat) , knit (knitted/?knit ), ?shit, ??beshit bid, rid shed, spread wed , let, set, upset, ?beset, wet (wetted/wet) cut, shut put burst, cast, cost thrust (thrusted/thrust), hurt Although ending in [-t , d] is a necessary condition for no-changestatus it , is by no meanssufficient. First of all, the generalconstraint of monosyllabism applies, even though it is irrelevant to degemination Second there is a strong . , favoritism for the vowels [I] and [E , followed by a single consonant again, ] ; this is of no conceivablerelevanceto a truly phonological processsimplifying [td] and [dd] to [t] and [d] . Absent from the class and under no attraction , to it , are such verbs as bat, chat, pat, scat, as well asjot, rot, spot, trot, with the wrong sort of vocalism; and dart, fart, smart, start, thwart, snort, sort, halt, pant, rant, want with nonprototypical vowel and consonant structure. Even in the core class we find arbitrary exceptions flit , twit, knit are all , : regular, as are fret, sweat whet, and someusesof wet. Beside strong cut and , shut, we find regular butt, jut , strut. Beside hurt we find blurt, spurt; beside burst, we find regular bust. The phonological constraints on the class far exceedanything relevant to degemination but in the end they characterize , rather than define the class just as we have come to expect. , Morphological classification responds to fairly large-scale measureson

Language and connectionism

121

word structure: is the word a monosyllable?does it rhyme with a key exem-

tions: is this segmentan obstruentthat followsa voiceless consonant?are

plar?doesit alliterate (begin a similar with consonant cluster) an exemas plar?Phonological lookfordifferent much rules and morelocal configuratwo vocabularies kept distinct:we are not likelyto find a morphological are

theseadjacent consonants nearly identical articulation? many in In ways, the

subclass holding together because members contain its each somewhere inside thema pairof adjacent obstruents; willwefinda ruleof voicing-spread nor that applies onlyin rhyming monosyllables. an analytical If engineis to
generalize effectively language over data, it canill affordto lookuponmormal type.

phological classification phonological asprocesses thesame and rules of for4.4.6. Default structure

We have found major differences betweenthe strongsystemand the reg-

ularsystem, supporting view thestrong the that system a cluster irregular is of patterns, only -ing with the forms perhaps no-change displaying and the forms
some activelife as partiallygeneralizable subregularities the adult lanin

guage. Membership strong in the system governed several is by criteria: (1)

monosyllabism; nonderived root status;(3) for the subregularities, (2) verb resemblance keyexemplars. means the system largely to This that is closed, particularly because rootsveryrarely verb enterthelanguage verbsare (new common enough, but are usually derived from nouns, adjectives, or onomatopoetic expressions). a fewpointsin history,there havebeen At
borrowed items that have met all the criteria: quit and cost are both from

French, example for (Jespersen, 1942). regular The system freefromsuch is


constraint. No canonical structure is requiredfor example, not a monosyllable. No informationabout derivationalstatus is required, such as must not

be derivedfrom an adjective. Phoneticsimilarity an exemplarplays no to role either. Furthermore,the behaviorof regularverbsis entirelypredictable

on general grounds. regular of formation an extremely The rule is simple


default with very few characteristics its ownperhaps only one, as we of

suggest above: the morpheme a stopratherthana fricative. that is Theregular system hasaninternal also default structure isworthy that of note, sinceit contrasts the RMmodelspropensities. rule Past with The
voicelessness propagatesfrom the stem. Elsewherethe default caseno-

stem + Id! coversall possiblecases.Under narrowlydefinedcircumstances,

somephonology place:a vowel takes intrudes separate to stemand affix,

thinghappens. appears language It that learners fondof sucharchitecare tures,which appearrepeatedly languages. in (Indeed, the history Enin of glish inflection inthisdirection.) theRMnetwork, all heads Yet unlike the

122

S. Pinker and A. Prince

strong subclasses. extreme categorical The and uniformity regular ofthe system disappears sight, with thehope identifying uniformity from and it of such
as a benchmark linguistic of generalization.

a kindof fortuitously overpopulated subregularity; indeed,as three such classes, thed-t-id since alternationtreated a parwith choice is on the between

suffixt if theword begins b; prefix if theword in a vowel; with ik ends change s to r before etc.TheRM all ta; model theregular as treats class

ofscattered, nonlocal, phonetically unrelated subregularities: forexample,

ruletheory, offers noinsight. network equally to learn set us The is able a

thatcan toany and isadjusted byvery apply word that only general phonoheldtogether phonologically-unpredictable by hypersimilarities are which
neither necessary sufficient nor criteria membershiptheclasses. for in strong verbs. pasttenseforms strong The of verbs mustbe memorized; the
logical regularities; whereas strong the system consists a setof subclasses of
Whyare theyso different? thinkthe answer We comes fromthe common-

properties: regular the system obeys categorical thatisstated a form a rule in

4.4.7. aretheregular strong Why and systems different? so Wehave argued theregular strong that and systems very have different

sense characterization psychological of the difference between regular and

to a prototype. Theyalsoshowed two experiments it is easierfor in that

Rosch Mervis note conceptual and (1975) that categories, asvegetables such alongsetofdimensionsgraded a and membership determined bysimilarity
or tools, to consist members family tend of with resemblancesoneanother to

hearing andonly youhearit often it if enough youlikely remember are to it. However,isimportant notethatthebulk thestrong areof it to of verbs nomore middling than frequency some them actually raising and of are rare, the question howtheymanaged endure. hypersimilarities of to The and graded membership structure thestrong might of class provide answer. an

only easily-memorized the strong will forms survive. 10most The frequent verbs English strong, ithas been that thefrequency of are and long noted as ofa strong declines form historically, becomes likely regutheverb more to larize. standard The explanation you only a strong by isthat can learn past

past forms regular can generatedrule. theirregular tense of verbs be by Thus forms roughly grammar offandmemory are where leaves begins. Whatever affects memorygeneral shape properties strong human in will the ofthe class, butnottheregular bya kind Darwinian class, of selection process, because

mustbe learned by one,it is reasonable expect the onesthat one to that

display resemblance than they grouped categories afamily structure if are into arbitrarily. strong like Since verbs, Rosch Merviss and artificial exemplars,

subjects memorize membersanartificial to the of categorythose if members

Language and connectionism

123

survive, particularly in the middle and low frequencies will be those dis, playing a family resemblance structure. In order words, the reasonthat strong verh~ are either freQuent or members of families is that strong verbs are ... memorized and frequency and family resemblanceassistmemorization. The regular system must answer to an entirely different set of requirements: the rule must allow the user to compute the past tense form of any regular verb and so must be generally applicable, predictable in its output, ~ and so on. While it is possiblethat connectionistmodels of category formation (e.g. ... McClelland & Rumelhart, 1985 might offer insights into why family resem ) blance fosters categoryformation, it is the difference between fuzzy families of memorized exemplars and formal rules that the models leave unexplained.2o Rumelhart and McClelland's failure to distinguish between mnemonics and productive morphology leads to the lowest-common -denominator 'uniformity ' of accomplishingall change through arbitrary Wickelfeature replacement and thus vitiates the use of psychologicalprinciples , to explain linguistic regularities.
5. How good is the model 's performance ? The bottom -line and most easily grasped claim of the RM model is that it succeeds at its assigned task : producing the correct past tense form . Rumelhart and McClelland are admirably open with their test data , so we can evaluate the model 's achievement quite directly . Rumelhart and McClelland submitted 72 new regular verbs to the trained model and submitted each of the resulting activated Wickelfeature vectors to the unconstrained whole -string binding network to obtain the analog of freely -generated responses. The model does not really 'decide' on a unique past tense form and stick with it thereafter ; several candidates get strength values assigned to them , and Rumelhart and McClelland interpret those strength values as being related roughly monotonically to the likelihood the model would output those candidates . Since there is noise in some of the processes that contribute to strength values , they chose a threshold value ( .2 on the 0- 1 scale) and if a word surpassed that criterion , it was construed as being one of the model 's guesses for the past tense form for a given stem. By this criterion , 24 of the 72 probe stems resulted in a strong tendency to incorrect responses- 33% of the sample . Of these, 6 (jump , pump , soak, 20See Armstrong , Gleitman Gleitman ) for ananalogous , and (1983 argument applied conceptual to cate gories .

124

S. Pinker and A. Prince

(soak/smoke; trail/mail; glare/tour). suggests the reasonfor the This that models muteness thatit failed learntherelevant is to transformations; i.e. togeneralize appropriately theregular Apparently steps about past. the taken to prevent model bogging ininsufficiently the from down general case-by-case
outputunitsduring learning, notworkwellenough. did Butit alsoreveals of the inherent one deficits the model havealluded of we

verbstend to clusterin phoneticsimilarity spaceeither with one another (jump, pump)or withotherverbsthatthe model erredon, discussed below

no special resemblance the apparently to quasi-productive verb strong typesthe factor affects that human responses. Second, no-response the

warm, trail,glare) no response threshold. had at Though is hardto reconit structthe reasons this,twofactsare worthnoting.First,theseverbshave for

learning, as blurring Wickelfeaturesusing probabilistic such the and noisy

mutuallyincompatible outputshad been associated, modelcan fail to the outputanyresponse significantly stronger the background than noise.The regular in symbolic rule accounts, contrast, in doesntcarewhatsin the word

to:there nosuch asa variable any is thing for stem, regardless phonetic ofits composition, hence way themodel attain knowledge and no for to the that youcanaddId!to a stem to getitspast.Rather, theknowledgethe all of model consistsresponses totheconcrete of trained features thetraining in set. If the newverbshappennot to shareenough thesefeatures the words of with in thetraining or happen possess set, to features which to competing and

the limitations connectionist of architecture general. in


correct choice, 4 yielded grosslybizarre candidates: (28) a. squat - squakt
b.
c.

orhow itscontents submitted often were previouslytraining; concept for the ofa stem issufficient. return thispoint discussing of itself We to when some

Oftheremainingverbs which model notoutput single 18 for the did a


mail - membled
tour - toureder

d.

mate - maded

Threeothercandidates far offthe systematic were mark:


(29) a.
b. c.

hug - hug
smoke - smoke brown - brawned

errors of this sort):

the regular tensemorpheme past (laterwe examine whether children make

Seven showedstrong exclusive more a or tendency double to marking with

Language connectionism and 125

(30) a.
b.
c .

d.
e .

f. g.

type - typeded step - steppeded snap - snappeded map - mappeded drip - drippeded carp - carpeded smoke - smokeded

Note that the model showsan interestingtendency makeill-advised to


vowel changes: (31) a.
b .
c .

d .

e.

shape - shipt sip - sept slip - slept brown - brawned mail - membled

Well before it hasmasteredthe richly exemplified regular rule, the patternassociatorappearsto have gained considerableconfidence in certain incorrectly-grasped sparsely exemplified patterns of feature-change among the , vowels. This implies that a major "induction problem"- latching onto the productive patterns and bypassingthe spurious ones is not being solved successfully . In sum, for 14 of the 18 stems yielding incorrect forms, the forms were quite removed from the confusionswe might expectpeople to make. Taking these with the 6 no-shows we have 20 out of the 72 test stemsresulting in , seriouslywrong forms, a 28% failure rate. This is the state of the model after it has been trained 190 200 times on eachitem in a vocabularyof 336regular verbs. What we have here is not a model of the mature system . 6. On some common objections to argumentsbasedon linguistic evidence We have found that many psychologists and computer scientistsfeel uncomfortable about evidenceof the sort we have discussed far, concerningthe so ability of a model to attain the complex organization of a linguistic systemin its mature state, and attempt to dismissit for a variety of reasons We con. sider the evidencecrucial and decisive and in this sectionwe reproducesome , of the objections we have heard and show why they are groundless . " Thosephilosophical argumentsare interesting but it's really the empirical , data that are important." All of the evidencewe have discussed empirical. is

126

S. Pinker and A. Prince

scious, reflective, problem-solving ofthought is distinct the mode that from intuitive processes PDP models that account (see, for example, for Smolensky, inpress). iscompletely Theruleadding toa stem This wrong. Id! to form pastis notgenerally in school doesnt to be!) the taught (it have except possibly a ruleof spelling, as which anything if obscures nature: its foronething, plural the morpheme, isvirtually which identicalthepast to morpheme its phonological in behavior, spelled is differently versus (s

ingor explicit instruction, aredeployed people when a conand by only in

bore or Yesterdaychat anhour, thatupon him we for or hearing sensuch tences, people perceive assounding could them perfectly normal. every In caseit is an empirical about human thattheydont.Any datum the brain theoryof the psychology language of must accountfor such data. Rule-governed behaviors exist, they the indeed but are products schoolof

It is entirely conceivable people goaround that could saying Whatz anthe swer? Hehigh-stuck goalie The or the or canary or1dontknow she pept how

grammarstend to be oblivious them or to treat them in a ham-fisted to

bleand tobefound descriptive not in grammars orlanguage curricula. Many have recently adequately only been characterized; traditional prescriptive

andphonological changes, distinct tenses homophones, past for interactions between strong regular the and systems, soon,areconsciously and inaccessi-

between morphology and phonology, ofroots morphology, therole in preservation stemandaffix of identity, phonological processes areoblivious that to morphological disjoint origin, conditions application forthe ofmorphological

ed). Themoreabstract principles havediscussed, as distinctions we such

to usebroadcast joy-rode and instead, onitssimilarity cast-cast based to and ride-rode.


In fact, the objectiongets the facts exactlybackwards. One of the

theforms broadcastedjoy-ridedthe1920s and in (without consciously knowingit, they adheringtheprinciple irregularity property were to that isa of verb roots, hence formed nouns regular). prescriptive verbs from are The guardians language afruitless toinstruct explicitly ofthe made attempt people

manner. example, Mencken For H.L. (1936) thatpeople noted started use to

notations. some thefine Thus of points useoftheirregulars on of depend

tenseformscarrydifferent degrees prestige othersocioeconomic of and con-

irregular system, havenoted, closely to memory wellas to we is tied as language, it turns thatpeople have so out often metalinguistic awareness of some itspatterns, of especially competing andirregular since regular past

phenomena theRM that model good handling is at isunsystematic analogy formation onitsinput based history subregular (asopposedthe with forms to automatic application regular where ofthe rule linguistically mandated). The

exposureto standarddialects,on normative instruction, on conscious and

Language and connectionism

127

reflection . Thus people , when in a reflective , conscious, problem -solving mode , will seem to act more like the RM model : the overapplication of subregularities that the model is prone to can be seen in modes of language use that bear all the hallmarks of self-conscious speech, such as jocularity (e.g. spaghettus, I got schrod at Legal Seafood, The bear shat in the woods) , explicit instruction within a community of specialists (e.g. VAXen as the

plural of VAX) , pseudoerudition(rhinoceri, axia for axioms , and hypercor)

rection such as the anti -broadcasted campaign documented by Mencken (simi larly , we found that some of our informants offered Hurst no-hitted the Blue Jays as their first guess as to the relevant past form but withdrew it in favor of no-hit which they " conceded" was " more proper " ) . " We academics speak in complex ways, but if you were to go down to [name of nearest working -class neighborhood ] you 'd find that people talk very differ ently ." If anything is universal about language , it is probably people 's tendency to denigrate the dialects of other ethnic or socioeconomic groups . One

would hope that this prejudice is not taken seriously as a scientific argument ; it has no basis in fact . The set of verbs that are irregular varies according to regional and socioeconomic dialect (see Mencken , 1936, for extensive lists) , as does the character of the subregular patterns , but the principles organizing
the system as a whole show no variation across classes or groups .

" Grammars may characterize some aspects of the ideal behavior of adults, but connectionist models are more consistent with the sloppiness found in

children's speechand adult's speecherrors, which are more 'psychological '


phenomena . " Putting aside until the next section the question of whether connectionist models really do provide a superior account of adult 's or chil dren 's errors , it is important to recognize a crucial methodological asymmetry that this kind of objection fails to acknowledge . The ability to account for patterns of error is a useful criterion for evaluating . competing theories each of which can account for successful performance equally well . But a theory that can only account for errorful or immature performance , with no account of why the errors are errors or how children mature into adults , is of limited value (Pinker , 1979, 1984; Wexler & Culicover , 1980; Gleitman & Wanner , 1982) . (Imagine a " model " of the internal combustion engine that could mimic its ability to fail to start on cold mornings - by doing nothing - but could not mimic its ability to run , under any circumstances .) Thus it is not legitimate to suggest, as Rumelhart and McClelland do , that " people - or at least children , even in early grade-school years- are not perfect rule -applying machines either . ... Thus we see little reason to believe that our model 's 'deficiencies ' are significantly greater than those of native speakers of comparable experience " (PDPII , p . 265- 266) . Unlike the RM model , no adult speaker is utterly stumped in an unpressured naturalistic

128 S Pinker A. Prince . and

situation when he or she needs to produce the past tense form of soak or glare , none vacillates between kid and kidded , none produces membled for mailed or toureder for toured . Although children equivocate in experimental tasks eliciting inflected nonce forms , these tasks are notorious for the degree to which they underestimate competence with the relevant phenomena (Levy , 1983; Maratsos et al ., 1987; Pinker , Lebeaux , & Frost , 1987)- not to mention the fact that children do not remain children forever . The crucial point is that adults can speak without error and can realize that their errors are errors (by which we mean , needless to say, from the standpoint of the untaxed operation of their own system, not of a normative standard dialect ) . And children 's learning culminates in adult knowledge . These are facts that any theory must
account for .

7. The RM model and the facts of children 's development Rumelhart and McClelland stress that their model 's ability to explain the developmental sequence of children 's mastery of the past tense is the key point in favor of their model over traditional accounts. In particular , these facts are the " fine structure of the phenomena of language use and language acquisition " that their model is said to provide an exact account of , as opposed the traditional explanations which " leave out a great deal of detail " , describing the phenomena only " approximately " . One immediate problem in assessing this claim is that there is no equally explicit model incorporating rules against which we can compare the RM model . Linguistic theories make no commitment as to how rules increase or decrease in relative strength during acquisition ; this would have to be supplied by a learning mechanism that meshed with the assumptions about the
representation of the rules . And theories discussed in the traditional literature

of developmental psycholinguistics are far too vague and informal to yield the kinds of predictions that the RM model makes. There do exist explicit models of the acquisition of inflection , such as that outlined by Pinker ( 1984) , but they tend to be complementary in scope to the RM model ; the Pinker model , for example , attempts to account for how the child realizes that one word is the past tense version of another , and which of two competing past tense candidates is to be retained , which in the RM model is handled by the " teacher " or not at all , and relegates to a black box the process of abstracting the morphological and phonological changes relating past forms and stems, which is what the RM model is designed to learn . The precision of the RM theory is surely a point in its favor , but it is still difficult to evaluate , for it is not obvious what features of the model give it

Languageand connectionism

129

its empirical successes . More important, it is not clear whether suchfeatures are consequences the model's PDP architecture or simply attributes of of fleshed -out processes that would function in the sameway in any equally-explicit model of the acquisitionprocess In most cases . Rumelhart and McClelland do not aDDortioncredit or blame for the model's behavior to specific A ~ aspectsof its operation; the model's output is compared against the data rather globally. In other casesthe intelligence of the model is so distributed and its output mechanisms so interactive that it is difficult for anyone to are know what aspectof the model makesit successfulAnd in general, Rumel. hart and McClelland do not presentcritical testsbetweencompetinghypotheses embodying minimally different assumptions only descriptionsof good, nessof fit between their model and the data. In this section, we unpack the assumptions the model, and showwhich onesare doing the work in account of ing for the developmental facts and whether the developmental facts are accountedfor to begin with . 7.1. Unique and sharedproperties of networksand rule systems Among the RM model's many properties, there are two that are crucial to its accountsof developmentalphenomena First, it has a learning mechanism . that makesit type-frequency sensitive the more verbs it encountersthat em: body a given type of morphophonologicalchange the stronger are its graded , representationsof that morphophonological change and the greater is the , tendencyof the model to generalizethat changeto new input verbs. Furthermore, the different past tense versions of a word that would result from applying various regularities to it are computed in parallel and there is a competitionamongthem for expression whoseoutcomeis determinedmainly , by the strength of the regularity and the goodness the match betweenthe of regularity and the input. (In fact the outcome can also be a blend of competing responses but the issue of responseblending is complex enough for us , to defer discussing to a later section.) it It is crucial to realize that neither frequency -sensitivity nor competition is unique to PDP models. Internal representationsthat have graded strength values associatedwith them are probably as old as theories of learning in psychology in particular, it is commonplaceto have greater strength values ; assigned representationsthat are more frequently exemplified in the input to during learning, so that strength of a representationbasicallycorrespondsto degreeof confidencein the hypothesisrepresented Competition amongcan. didate operations that partially match the input is also a ubiquitous assump tion amongsymbol-processing modelsin linguisticsand cognitive psychology . Spreading -activation models and production systems which are prototypical ,

130

S. Pinker and A . Prince

symbol processingmodels of cognition, are the clearest examples(see e.g. , Newell & Simon, 1972 Anderson, 1976 1983 MacWhinney & Sokolov, ; , ; 1987 ). To show how these assumptionsare part and parcel of standardrule-processing models, we will outline a simplified module for certain aspects past of tense acquisition, which searchesfor the correct past tense rule or rules, keeping severalcandidatesas possibilitiesbefore it is done. We do not mean to propose it as a serioustheory, but only as a demonstration that many of the empirical successes the RM model are the result of assumptions of about frequency -sensitivity and competition amongoutput candidatesthat are independent of parallel distributed processingin networks of simple units. A simple illustrative module of a rule-basedinflection acquisition theory, incorporating assumptions about frequency -sensitivityand competition Acquiring inflectional systemsposes a number of tricky induction problems, discussed length in Pinker (1984 When a child hears an inflected at ). verb in a single context, it is utterly ambiguouswhat morphological category the inflection is signaling(the gender, number, person, or somecombination of those agreementfeatures for the subject? for the object? is it tense as? pect? modality? some combination of these . Pinker (1984 suggested ?) ) that the child solves this problem by "sampling from the spaceof possible hy" pothesesdefined by combinationsof an innate finite set of elements main, taining thesehypotheses the provisional grammar, and testing them against in future usesof that inflection, expunginga hypothesisif it is counterexempli tied by a future word. Eventually, all incorrect hypotheses about the category featuresencodedby that affix will be pruned, any correct one will be hypothesized, and only correct ones will survive. The surviving features define the dimensionsof a word-specific paradigm structure into whose cells the different inflected forms of a given verb are placed (for example, singular plural or present past future). The system then seeksto form a productive general paradigm that is, a set of rules for related inflections- by examiningthe patternsexhibited acrossthe paradigms for the individual words. This posesa new induction problem becauseof the large number of possiblegeneralizationsconsistentwith the data, and it cannot be solved by examining a single word-specific paradigm or even a set of paradigms For example in examining sleep . , /slept, should one concludethat the regular rule of English laxes and lowers the vowel and adds a t? If so, does it do so for all stems or only for those ending in a stop, or only those whose stem vowel is i? Or is this simply an isolated irregular form , to be recorded individually with no contribution to the regular rule system There ?

Languageand connectionism

131

is no way to solve the problem other than by trying out various hypotheses and seeingwhich ones survive when tested againstthe ever-growing vocabu lary. Note that this induction problem is inherent to the task and cannot be escapedfrom using connectionistmechanisms any other mechanismsthe or ; RM model attempts to solve the problem in one way, by trying out a large number of hypothesesof a certain type in parallel. A symbolic model would solve the problem using a mechanismthat can formulate, provisionally maintain, test, and selectively expunge hypotheses about rules of various degreesof generality. It is this hypothesis -formation mechanismthat the simplified module embodies The module is basedon five . assumptions : 1. Candidatesfor rules are hypothesizedby comparingbaseand past tense versions of a word, and factoring apart the changing portion , which serves as the rule operation, from certain morphologically-relevant phonological componentsof the stem, which serveto define the classof stemsover which the operation can apply.21Specifically let us assume , that when the addition of material to the edge of a baseform is noted, the added material is stored as an affix, and the provisional definition of the morphological classwill consistof the features of the edgeof the stem to which the affix is attached When a vowel is noted to change . , the changeis recorded, and the applicable morphological classwill be provisionally defined in terms of the featuresof the adjacentconsonants . (In a more realistic model, global properties defining the " basicwords" of a language such as monosyllabicity in English, would also be ex, tracted.) 2. If two rule candidates have been coined that have the same change operation, a single collapsedversion is created, in which the phonological features distinguishingtheir classdefinitions are eliminated. 3. Rule candidates increase in strength each time they have been exemplified by an input pair. 4. When an input stemhasto be processed the systemin its intermediate by stages an input is matchedin parallel againstall existingrule candidates , , and if it falls into severalclassesseveralpast tenseforms may be gener , ated. 5. The outcome of a competition amongthe past tenseforms is determined by the strength of the relevant rule and the proportion of a word's features that were matched by that rule.
21Moreaccurately the changing portion is examined subsequentto the subtraction of any phonological , and phonetic changesthat have been independently acquired.

132

S. Pinker and A . Prince

The The changing

model

works portion

as is i

follows ~ o .

. The

Imagine provisional

its

first definition

input

pair of the

is

speak class to

/ spoke which

such

rule

would

apply

would

be

the

features

of

the

adjacent

consonants

which

we

will

abbreviate

as

Thus

the

candidate

rule

coined

is

( 32a

which features

can of

be / p /

glossed before

as the

"

change vowel

i and

to

for

the

class the

of features

words of

containing / k / after

the the

containing

vowel

"

Of

course

the

candidate

rule

has

such

specific

class

definition

in

the

example

that

it

is

almost

like

listing

the

pair

directly

Let

us

make

the

minimal 1 every

assumptions time a rule is

about exemplified

the

strength . Thus

function the strength

and of

simply this rule

increase candidate

it

by is

Say

the

second

input

is

get

/ got

The

resulting

rule

candidate

with

strength

of

is

( 32b

regular

input

pair

tip

/ tipped

would

yield

( 32c

Similarly

sing strength

/ sang ,

would

lead

to

( 32d

and

hit

/ hit

would

lead

to

( 32e

each

with

unit

( 32

Change

Class b . Change

_ :

k e ~ ~

Class c . Suffix

: :

g _ t

Class

Change

re

Class e . Suffix

: :

s 0

IJ

Class

t #

Change Class : h

i t

~ . 22

Now

we

can

examine

the

rule

- collapsing

process

second

regular

input

walk

walked

would

inspire

the

learner

to

coin

the

rule

candidate

( 33a

which

because

it

shares

the

change

operation

of

rule

candidate

( 32c

would

be

collapsed of its

with contributing ) .

it

to

form rules

a ,

new or

rule equivalently

( 33b

of ,

strength the number

( summing of times

the it

strengths has been

exemplified

22Let assume it is unclear the childat this point whether us that to thereis a null vowelchange a null or affix, so botharestored Actually we don think eitheris accurate it will do for the present . , 't , but example .

Language and connectionism

133

(33 a. )
b.

Suffix Class : Suffix Class :

: t

k#
: t

C # [-voiced] [-continuant] [-sonorant ]

The context -collapsing operation has left the symbol " c " (for consonant ) and its three phonological features as the common material in the definitions of the two previously distinct provisional classes. Now consider the results of a third regular input , pace/paced . First , a fairly word -specific rule (34a) would be coined ; then it would be collapsed with the existing rule (33) with which it shares a change operation , yielding a rule (34b) with strength 3. (34) a.
b .

Suffix : t
Class : s#
Suffix ; t

Class :

[ -voiced ] Rule candidates based on subregularities would also benefit from the increases in strength that would result from the multiple input types exemplify ing it . For example , when the pair ring / rang is processed, it would contribute (35a) , which would then be collapsed with (32d) to form (35b) . Similar collapsing would strengthen other subregularities as tentative rule candidates ,
such as the null affix .

(35) a.
b.

Change : i ~ a
Class : r _ 1J

Change : i ~
Class : C - IJ

Though this model is ridiculously simple , one can immediately see that it has several things in common with the RM model . First , regularities , certain subregularities , and irregular alternations are extracted , to be entertained as possible rules , by the same mechanism . Second, mechanisms embodying the different regularities accrue strength values that are monotonically related to the number of inputs that exemplify them . Third , the model can generalize to new inputs that resemble those it has encountered in the past; for example , tick , which terminates in an unvoiced stop , matches the context of rule (34b) ,

134 S Pinker A. Prince . and

Languageand connectionism

135

calculations , and the phonology would subtract out modifications abstracted from consonant clusters of simple words and perhaps from sets of morpholog ically unrelated rules . Finally , the general regular paradigm would be used when needed to fill out empty cells of word -specific paradigms with a unique entry , while following the constraint that irregular forms in memory block the product of the regular rule , and only a single form can be generated for a specific stem when more than one productive rule applies to it (multiple entries can exist only when the irregular form is too weakly represented , or when both multiple forms are witnessed in the input ; see Pinker , 1 .984) . Though both our candidate -hypothesization module and the RM model share certain properties , let us be clear about the differences . The RM model is designed to account for the entire process that maps stems to past tense forms , with no interpretable subcomponents , and few constraints on the regularities that can be recorded . The candidate -hypothesization module , on the other hand , is meant to be a part of a larger system, and its outputs , namely rule candidates , are symbolic structures that can be examined , modified or filtered out by other components of grammar . For example , the phonological acquisition mechanism can note the similarities between t/d/id and s/z/iz and pullout the common phonological regularities , which would be impossible if those allomorphic regularities were distributed across a set of connection weights onto which countless other regularities were superimposed . It is also important to note that , as we have mentioned , the candidate hypothesization module is motivated by a requirement of the learnability task facing the child . Specifically , the child at birth does not know whether English has a regular rule , or if it does, what it is or whether it has one or several . He or she must examine the input evidence , consisting of pairs of present and past forms acquired individually , to decide . But the evidence is locally ambiguous in that the nonproductive exceptions to the regular rule are not a random set but display some regularities for historical reasons (such as mul tiple borrowings from other languages or dialects , or rules that have ceased to be productive ) and psychological reasons (easily-memorized forms fall into
f ::.milv resemblance
..

structures ) . So the child must distinguish


'

real from appar -

ent regularities . Furthermore , there is the intermediate case presented by languages that have several productive rules applying to different classes of stems. The " learnability problem " for the child is to distinguish these cases. Before succeeding, the child must entertain a number of candidates for the regular rule or rules , because it is only by examining large sets of present-past pairs that the spurious regularities can be ruled out and the partially -produc tive ones assigned to their proper domains ; small samples are always ambigu ous in this regard . Thus a child who has not yet solved the problem of distin guishing general productive rules from restricted productive rules from acci-

136 S. Pinker A. Prince and

dental patterns will have a number of candidate regularities still open as hypotheses . At this stage there will be competing options for the past tense form of a given verb . The child who has not yet figured out the distinction between regular , subregular , and idiosyncratic cases will display behavior that is similar to a system that is incapable of making the distinction - the RM model . In sum , any adequate rule -based theory will have to contain a module that extracts multiple regularities at several levels of generality , assign them strengths related to their frequency of exemplification by input verbs , and let them compete in generating a past tense form for a given verb . In addition , such a model can attain the adult state by feeding its candidates into paradigm -organization processes , which , following linguistic constraints , dis tinguish real generalizations from spurious ones . With this alternative model in mind , we can now examine which aspects of the developmental data are attributable to specific features of the RM model 's parallel distributed pro cessing architecture - specifically , to its collapsing of linguistic distinctions and those which are attributable to its assumptions of graded strength , type frequency sensitivity , and competition which it shares with symbolic alterna tives .

7.2 . Developmentalphenomena model

claimed to support the Rumelhart - McClelland

The RM model is , as the authors point out , very rich in its empirical predic tions . It is a strong point of their model that it provides accounts for several independent phenomena , all but one of them unanticipated when the model was designed . They consider four phenomena in detail : ( 1) the V -shaped curve representing the overregularization of strong verbs whose regular pasts the child had previously used properly ; (2) The fact that verbs ending in t or d ( e .g . hit ) are regularized less often than other verbs ; ( 3) The order of acquisition of the different classes of irregular verbs manifesting different subregularities ; (4) The appearance during the course of development of [past + ed ] errors such as ated in addition to [stem + ed] errors such as eated . 7.2 .1. Developmental curve ) sequence of productive inflection (the " U " -shaped

It is by now well -documented that children pass through two stages before attaining adult competence in handling the past tense in English . In the first stage , they use a variety of correct past tense forms , both irregular and regu lar , and do not readily apply the regular past tense morpheme to nonce words presented in experimental situations . In the second stage , they apply the past

Languageand connectionism

137

tense morpheme productively to irregular verbs , yielding overregularizations such as hitted and breaked for verbs that they may have used exclusively in their correct forms during the earlier stage. Correct and overregularized forms coexist for an extended period of time in this stage, and at some point during that stage, children demonstrate the ability to apply inflections to nonce forms in experimental settings . Gradually , irregular past tense forms that the child continues to hear in the input drive out the overregularized forms he or she has created productively , resulting in the adult state where a productive rule coexists with exceptions (see Berko , 1958; Brown , 1973; Cazden , 1968; Ervin , 1964; Kuczaj , 1977, 1981) .
A standard account of this sequence is that in the first stage , with no knowl -

edge of the distinction between present and past forms , and no knowledge of what the regularities are in the adult language that relate them , the child is simply memorizing present and past tense forms directly from the input . He or she correctly uses irregular foffils because the overregularized forms do not appear in the input and there is no productive rule yet . Regular past tenses are acquired in the same way , with no analysis of them into a stem plus an inflection . Using mechanisms such as those sketched in the preceding section , the child builds a productive rule and can apply it to any stem, including stems of irregular verbs . Because the child will have had the oppor tunity to memorize irregular pasts before relating stems to their corresponding pasts and before the evidence for the regular relationship between the two has accumulated across inputs , correct usage can in many cases pre cede overregularization . The adult state results from a realization , which may occur at different times for different verbs , that overregularized and irregular forms are both past tense versions of a given stem, and by the application of a Uniqueness principle that , roughly , allows the cells of an inflectional paradigm for a given verb to be filled by no more and no less than one entry , which is the entry witnessed in the input if there are competing nonwitnessed rule -generated forms and witnessed irregulars (see Pinker , 1984) .
The RM model also has the ability
..

to produce an arbitrary
~

past tense form


-

for a given present when they have been exemplified in the input , and to generate regular past tense forms for the same verbs by adding -ed. Of course , it does so without distinct mechanisms of rate and rule . In early stages, the links between the Wickelfeatures of a base irregular form and the Wickelfea tures of its past form are given higher weights . However , as a diverse set of regular forms begins to stream in , links are strengthened between a large set of input Wickelfeatures and the output Wickelfeatures containing features of the regular past morpheme , enough to make the regularized form a stronger output than the irregular form . During the overregularization stage, " the past tenses of similar verbs they are learning show such a consistent pattern that

138 S Pinker A. Prince . and

the generalization from these similar verbs outweighs the relatively small amount of learning that has occurred on the irregular verb in question" (PDPII , p. 268) . The irregular form eventually returns asthe strongestoutput because repeatedpresentationsof it causethe network to tune the connection weights so that the Wickelfeaturesthat are specificto the irregular stemform (and to similar irregular forms manifestingthe samekind of stem -past variation) are linked more and more strongly to the Wickelfeatures specific to their past forms, and develop strong negativeweights to the Wickelfeatures correspondingto the regular morpheme. That is, the prevalenceof a general pattern acrossa large set of verbs tradesoff againstthe repeatedpresentation of a single specific pattern of a single verb presentedmany times (with subregularities constituting an intermediate case This givesthe model the abil). ity to be either conservative (correct for an irregular verb) or productive (overregularizingan irregular verb) for a given stem, dependingon the mixture of inputs it has received up to a given point. Sincethe model's tendencyto generalizelies on a continuum, any sequence of stagesof correct irregulars or overregularizedirregulars is possiblein principle, dependingon the model's input history. How, then, is the specificshift shown by children, from correct irregular forms to a combination of overregularized and correct forms, mimicked by the model? Rumelhart and McClelland divide the training sequencepresented t. the model into two o stages In the first, they presented 10 high-frequency verbs to the model, 2 . of them regular, 10 times each. In the second they added 410 verbs to this , sample 334 of them regular, and presented the sample of 420 verbs 190 , times. The beginning of the downward arm of the U -shapedplot of percent correct versustime, representinga worseningof performancefor the irregular verbs, occurs exactly at the boundary betweenthe first set of inputs and the second The sudden influx of regular forms causesthe links capturing the . regular pattern to increasein strength; prior to this influx , the regular pattern was exemplified by only two input forms, not many more than those exemplifying any of the idiosyncratic or subregularpatterns. The shift from the first to the secondstageof the model's behavior, then, is a direct conse quence of a shift in the input mixture from a heterogeneouscollection of patterns to a collection in which the regular pattern occurs in the majority . It is important to realize the theoretical claim inherent in this demonstra tion. The model's shift from correct to overregularizedformsdoesnot emerge from any endogenous process it is driven directly by shifts in the environment ,' . Given a different environment (say, one in which heterogeneousirregular forms suddenlystart to outnumber regular forms) , it appearsthat the model could just as easily go in the opposite direction, regularizing in its first stage and then becomingaccuratewith the irregular forms. In fact, sincethe model

Language and connectionism

139

always has the potential to be conservativeor rule-governed and continu, ously tunes itself to the input, it appearsthat just about any shapeof curve at all is possible given the right shifts in the mixture of regular and irregular , forms in the input. Thus if the model is to serveas a theory of children's languageacquisition, Rumelhart and McClelland must attribute children's transition between the first and second stage to a prior transition of the mixture of regular and irregular inputs from the external environment. They conjecture that such a transition might occur becauseirregular verbs tend to be high in frequency. "Our conception of the nature of [the child's] experienceis simply that the chl1dlearns first about the present and past tensesof the highest frequency verbs; later on, learning occursfor a much larger ensembleof verbs, including a much larger proportion of regular forms" (p. 241 . They concedethat there ) is no abrupt shift in the input to the child, but suggest that children's acquisi tion of the present tense forms of verbs servesas a kind of filter for the past tense learning mechanism and that this acquisition of baseforms undergoes , an explosivegrowth at a certain stageof development Becausethe newly-ac. quired verbs are numerousand presumablylower in frequencythan the small set of early-acquiredverbs, it will include a much higher proportion of regular verbs. Thus the shift in the proportion of regular verbs in the input to the model comes about as a consequenceof a shift from high frequency to medium freQuencyverbs; Rumelhart and McClelland do not have to adjust the leannessor richnessof the input mixture by hand. The shift in the model's input thus is not entirely ad hoc, but is it realistic? The use of frequency counts of verbs in written samplesin order to model children's vocabulary development is, of course tenuous.24To determine , whether the input to children's past tense learning shifts in the manner as sumed by Rumelhart and McClelland, we examined Roger Brown's unpublished grammarssummarizingsamplesof 713 utterancesof the spontaneous speechof three children observedat five stagesof development The stages . were defined in terms of equally spaced intervals of the children's Mean Length of Utterance (MLU ) . Each grammar includes an exhaustivelist of the child's verbsin the sample and an explicit discussion whether the child , of -

24For examplein the Kuceraand Francis , (1967 countsusedby Rumelhart McClellandmedium ) and , frequencies assigned theverbs , seekmislead arise whicharegoing beabsent are to flee , and , to from a young chilrl-" vocabularvOn the other handstickand tear whichplaya significant in the ecology early , role of --.--- '- . - -- - ------. . 1 - ~ as . childhoodareranked low-frequency anddo arenot in the.high , . Be -frequency group where , theybelong do belongs because its ubiquityin questions factnot reflected the writtenlanguage appears be of ,a in . Be to out of the study perhaps , because Rumelhart McClelland and countthe frequency the -ing forms of .

140

S. Pinker and A . Prince

was overregularizing the past tense rule .25In addition , we examined the vocabulary of Lisa , the subject of a longitudinal language acquisition study at Brandeis University , in her one-word stage. Two of the children , Adam and Eve , began to overregularize in the Stage III sample ; the third child , Sarah, began to overregularize only in the State V sample except for the single form heared appearing in Stage II which Brown noted might simply be one of
Sarah 's many cases of unusual pronunciations
stage .26

. We tabulated

the size of each

child 's verb vocabulary and the proportion of verbs that were regular at each The results , shown in Table 1 and Figure 2, are revealing . The percentage of the children 's verbs that are regular is remarkably stable across children and across stages, never veering very far from 50% . (This is also true in parental speech itself : Slobin , 1971, showed that the percentage of regular verbs in Eve 's parents ' speech during the period in which she was over regularizing was 43% .) In particular , there is no hint of a consistent increase in the proportion of regular verbs prior to or in the stage at which regulariza tions first occur . Note also that an explosive growth in vocabulary does not invariably precede the onset of regularization . This s. ands in stark contrast t to the assumed input to the RM model , where the onset of overregularization occurs subsequent to a sudden shift in the proportion of regular forms in the input from 20% to 80% . Neither the extreme rarity of regular forms during the conservative stage, nor the extreme prevalence of regular forms during the overproductive stage, nor the sudden transition from one input mixture to another , can be seen in human children . The explanation for their develop mental sequence must lie elsewhere .

We expect that this phenomenon is quite general . The plural in English , for example , is overwhelmingly regular even among high -frequency nouns :27 only 4 out of the 25 most frequent concrete count nouns in the Francis and Kucera ( 1982) corpus are irregular . Since there are so few irregular plurals , children are never in a stage in which irregulars strongly outnumber regulars

:? 5For details of the study , see Brown ( 1973) ~ for descriptions of the unpublished grammars , see Brown ( 1973) and Pinker ( 1984) . Verifjcation of some of the details reported in the grammars , and additional analyses of children 's speech to be reported in this paper , were based on on -line transcripts of the speech of the Brown children included in the Child Language Data Exchange System ; MacWhinney & Snow ( 1985) . 2nA verb was counted whether it appeared in the present , progressive , or past tense form , and was counted only once if it appeared in more than one form . Since most of the verbs were in the present , this is of little consequence . We counted a verb once across its appearances alone and with various particles since past tense inflection is independent of these differences . We excluded modal pairs such as can/ could since they only occasionally encode a present /pas( contrast for adults . We excluded catenative verbs that encode tense and mood in English and hence which do not have obvious past tenses such as in going to , come on , and gimme . :? 7We are grateful to Maryellen McDonald for this point .

142 S Pinker A. Prince . and

in the input or in their vocabulary of noun stems. Nonetheless , the V -shaped developmental sequence can be observed in the development of plural inflec tion in the speech of the Brown children : for example , Adam said feet nine times in the samples starting at age 2;4 before he used foots for the first time at age 3;9; Sarah used feet 18 times starting at 2;9 before uttering foots at 5; 1; Eve uttered feet a number of times but never foots . Examining token frequencies only underlines the unnaturally favorable assumptions about the input used in the RM model 's training run . Not only
does the transition from conservatism to overregularization correspond to a

shift from a 20/80 to an 80/20 ratio of regulars to irregulars , but in the first , conservative phase, high -frequency irregular pairs such as go/ went and make/ made were only presented 10 times each, whereas in the overregularizing phase the hundreds of regular verbs were presented 190 times each. In contrast , irregular verbs are always much higher in token frequency in children 's environment . Slobin ( 1971) performed an exhaustive analysis of the verbs heard by Eve in 49 hours of adults ' speech during the phase in which she was overregularizing and found that the ratio of irregular to regular tokens was 3:1. Similarly , in Brown 's smaller samples, the ratios were 2.5:1 for Adam 's
parents , 5 : 1 for Eve ' s parents , and 3 .5 : 1 for Sarah ' s parents . One wonders

whether presenting the RM model with 10 high -frequency verbs , say, 190 times each in the first phase could "have burned in the 8 irregulars so strongly that they would never be overregularized in Phase 2. If children 's transition from the first to the second phase is not driven by a change in their environments or in their vocabularies , what causes it ? One possibility is that a core assumption of the RM model , that there is no psychological reality to the distinction between rule -generated and memorized forms , is mistaken . Children might have the capacity to memorize independent present and past forms from the beginning , but a second mechanism that coins and applies rules might not go into operation until some maturational change put it into place , or until the number of verbs exemplifying a rule exceeded a threshold . Naturally , this is not the only possible explanation . An alternative is that the juxtaposition mechanism that relates each stem to its corresponding past tense form has not yet succeeded in pairing up memorized stems and past forms in the child 's initial stage. No learning of the past tense regularities has begun because there are no stempast input pairs that can be fed into the learning mechanism ; individually acquired independent forms are the only possibility .
Some of the evidence supports this alternative . Brown notes in the gram -

mars that children frequently used the present tense form in contexts that clearly called for the past, and in one instance did the reverse . As the children developed , past tense forms were used when called for more often , and

Language and connectionism

143

evidence for an understanding of the function of the past tense form and the tendency to overregularize both increase. Kuczaj ( 1977) provides more pre cise evidence from a cross-sectional study of 14 children . He concluded that once children begin to regularize they rarely use a present tense form of an irregular verb in contexts where a past is called for . The general point is that in either case the RM model does not explain children 's developmental shift from conservatism to regularization . It attempts to do so only by making assumptions about extreme shifts in the input to rule learning that turn out to be false . Either rules and stored forms are distinct , or some .orocess other than extraction of morphophonological regularity exp L lains the developmental shift . The process of coming to recognize that two
forms constitute
Little needs

the present
to be said

and past tense variants


the shift from the

of the same verb , that


second stage , in which

is, the juxtaposition process, seems to be the most likely candidate .


about

regularization and overregularization occurs, to the third (adult ) stage, in which application of the regular rule and storage of irregular pasts cooccur . Though the model does overcome its tendency to overregularize previously acquired irregular verbs , we have shown in a previous section that it never properly attains the third stage. This stage is attained , we suggest, not by incremental strength changes in a pattern -finding mechanism , but by a mechanism that makes categorical decisions about whether a hypothesized rule candidate is a genuine productive rule and about whether to apply it to a given verb . On the psychological reality of the memorized / rule -generated distinction . In discussing the developmental shift to regularization , we have shown that there can be developmental consequences of the conclusion that was forced upon us by the linguistic data , namely that rule -learning and memorization of indi vidual forms are separate mechanisms. (In particular , we pointed out that one might mature before the other , or one requires prior learning - juxtapos ing stems and past forms - and the other does not .) This illustrates a more general point : the psychological reality of the memorized /rule -generated distinction predicts the possibility of finding dissociations between the two pro cesses , whereas a theory such as Rumelhart and McClelland ' s that denies that

reality predicts th'at such dissociations should not be found . The development al facts are clearly on the side of there being such a distinction . First of all , children 's behavior with irregular past forms during the first , pre -regularization phase bears all the signs of rate memorization , rather than a tentatively overspecific mapping from a specific set of stem features to a specific set of past features . Brown notes , for example , that Adam used fell -down 10 times in the Stage II sample without ever using fall or falling ,

144

S. Pinker and A. Prince

sortin her speech several over stages. Similar patterns be foundin the can
other childrens speech.

sohisproduction offell-down beattributedany ofmapping cannot to sort at allfrom topast. stem Moreover isnohint this there in phase any of interactionor transfer learning phonetically individual of across similar irregular forms:for example, Sarahsspeech, in break/broke coexisted make! with made neither any and had influencetake, lackedpast ofany on which a form

nessof children overgeneralize causative to the alternation. suchdifferIf ences notreflect do differences thechildrens in environments vocabularies or

willing generalize to productively.example, For Cazden notes (1968) that Adam more toovergeneralizations and was prone than Eve Sarah 447), (p. an observation made Brown hisunpublished also by in grammars. More specifically, 1shows Sarah toregularize past two Table that began the tense stages thantheother children later two despite comparable vocabularies. verb Maratsos (1987) etal. documented individual many differences willinginthe

list-learning sophisticationa rulesystem. and of Another possible dissociation be found individual might in differences. A number investigatorschild of of language noted some have that children are conservative producers memorized whereas of forms othersare far more

A clear example a dissociation roteandruleover span of between a in mastery irregular tense was predicted their of past forms best by chronological age, their but mastery regular tense was predicted their of past forms best by Mean LengthUtterance. (1973) that correlates of Brown showed MLU highly withvarietymeasures a of ofgrammatical sophistication acquiring inchildren English. Kuczajs was irregular aresimply logic that pasts memorized, sothe sheer number exposures, increasesthechild longer, the of which as lives is crucial whereas pasts beformed theapplication factor, regular can by ofa rule, which beinduced partofthechilds must as developing grammar, so overall grammatical developmentbetter is a predictor. thelinguistic Thus distinction between ofexceptions rule-generated (see lists and forms Section 4.4) paralleleda developmental between is by distinction opportunities for
which coexist they comes fromKuczaj (1977), showed childrens who that

generalizing mechanism children strongermore ofsome being or developed than ofothers, that without comparable differences ability record intheir to forms directly theinput. RM from The model easily cannot account any for ofthese dissociations than attributing aspectsthe (other by crucial of generalization phenomenamechanisms outside model), to entirely their because memorized and forms generalizations arehandled a single by mechanismrecall theidentity inthenetwork belearned adjusting that map must by alarge
setofconnection weights, likeanyofthestem just alterations;isnotthere it

(they dontin thecase thepasttense), of presumably result the they from

Language and connectionism

145

at the outset , and is not intrinsically easy to learn . The question is not closed, but the point is that the different theories can in principle be submitted to decisive empirical tests. It is such tests that should be the basis for debate on the psychological issue at hand . Simply demonstrating that there exist contrived environments in which a network model can be made to mimic some data , especially in the absence of compari sons to alternative models , tells us nothing about the psychology of the child . 7.2.2. Performance with no-change verbs A class of English verbs does not change in form between stem and past: beat, cut, put , hit , and others . All of these verbs end in a t or d . Bybee and Slobin ( 1982) suggest that this is no coincidence . They suggest that learners generate a schema for the form of past tense verbs on the basis of prevalent regular forms which states that past tense verbs end in t or d . A verb whose stem already ends in t or d spuriously appears to have already been inflected for past tense, and the child is likely to assume that it is a past tense form . As a result , it can be entered as the past version of the verb in the child 's paradigm , blocking the output of the regular rule . Presumably this tendency could result in the unchanged verb surviving into adulthood , causing the no-change verbs to have entered in the language at large in some past generation and to be easily relearned thereafter . We will call this phenomenon misperception .28 In support of this hypothesis , Bybee and Slobin found in an elicitation experiment that for verbs ending in t or d , children were more likely to produce a past tense form identical to the present than a regularized form , whereas for verbs not ending in a t or d , they were more likely to produce a regularized form than an unchanged form . In addition , Kuczaj ( 1978) found in a judgment task that children were more likely to accept correct no-change forms for nonchanging verbs than correct past tense forms for other irregular verbs such as break or send, and less likely to accept overregularized versions of no-change verbs than overregularized versions of other irregular verbs . Thus not only do children learn that verbs ending in t/ d are likely to be unchanged , but this subregularity is easier for them to acquire than the kinds of changes such as the vowel alternations found in other classes of irregular verbs . Unlike the three -stage developmental sequence for regularization , chil -

28Bybee Slobin do not literally proposethat the child misanalyzes -final verbs as (nonexistent stems and lId ) inflected by a rule. Rather, they postulate a static template which the child matchesagainstunanalyzedforms during word perception to decide whether the forms are in the past tense or not.

146

S. Pinker and A . Prince

dren 's sensitivity to the no-change subregularity for verbs ending in tl d played no role in the design of the RM model or of its simulation run . Nonetheless , Rumelhart and McClelland point out that during the phase in which the model was overregularizing , it produced stronger regularized past tense candidates for verbs not ending in tld than for verbs ending in lid , and stronger unchanged past candidates for verbs ending in tld than for verbs not ending in lid . This was true not only across the board , but also within the class of regular verbs , and within the classes of irregular verbs that do change in the past tense, for which no-change responses are incorrect . Furthermore , when
Rumelhart and McClelland examined the total past tense response of the

network (that is, the set of Wickelfeatures activated in the response pool ) for verbs in the different irregular subclasses, they found that the no-change verbs resulted in fewer incorrectly activated Wickelfeatures than the other classes of irregulars . Thus both aspects of the acquisition of the no-change pattern fallout of the model with no extra assumptions . Why does the model display this behavior ? Because the results of its learn ing are distributed over hundreds of thousands of connection weights , it is
hard to tell , and Rumelhart and McClelland do not try to tease apart the

various possible causal factors . Misperception cannot be the explanation because the model always received correct stem-past pairs . There are two other possibilities . One is that connections from many Wickelfeatures to the Wic kelfeatures for word - final t , and the thresholds for those Wickelfeatures ,

have been affected by the many regular stem-past pairs fed into the model . The response of the model is a blend of the operation of all the learned subregularities , so there might be some transfer from regular learning in this case. For example , the final Wickelphone in the correct past tense form of hit , namely it#, shares many of its Wickelfeatures with those of the regular past tense allomorphs such as id#. Let us call this effect between-class transfer . It is important to note that much of the between -class transfer effect may be a consequence- perhaps even an artifact - of the Wickelfeature representation and one of the measures defined over it , namely percentage of incor rect Wickelfeatures activated in the output . Imagine that the model 's learning component actually treated no-change verbs and other kinds of verbs identi cally , generating Wickelfeature sets of equal strength for cutted and taked. Necessarily , taked must contain more incorrect Wickelfeatures than cutted: most of the Wickelfeatures that one would regard as " incorrect " for cutted, such as those that correspond to the Wickelphone tid and id#, happen to characterize the stem perfectly (StopVoweIStop , InterruptedFrontInter rupted , etc.) , because cut and ted are featurally very similar . On the other hand , the incorrect Wickelfeatures for taked (those corresponding to Wickel phones Akt and kt#) will not characterize the correct output form took . This

Language and connectionism

147

effect is exaggerated further by the fact that there are many more Wickelfea tures representing word boundaries than representing the same phonemes string -internally , as Lachter and Bever (1988) point out (recall that the Wic kelfeature set was trimmed so as to exclude those whose two context

phonemes belonged to different phonological dimensions- since the word boundary feature # has no phonological properties , such a criterion will leave all Wickelfeatures of the form XY # intact ) . This difference is then carried over to the current implementation of the response-generation component , which puts response candidates at a disadvantage if they do not account for activated Wickelfeatures . The entire effect (a consequence of the fact that the model does not keep track of which features go in which positions ) can be viewed either as a bug or a feature . On the one hand , it is one way of generating the (empirically correct ) phenomenon that no-change responses are more common when stems have the same endings as the affixes that
would be attached to them . On the other hand , it is part of a family of

phonological confusions that result from the Wickelphone /Wickelfeature representations in general (see the section on Wickelphonology ) and that hobble the model 's ability even to reproduce strings verbatim . If the stem-affix feature confusions really are at the heart of the model 's no-change responses, then it should also have recurring problems , unrelated to learning , in generat~ ing forms such as pitted or pocketed where the same Wickelfeatures occur in the stem and affix or even twice in the same stem but they must be kept distinct . Indeed , the model really does seems prone to make these undesir able errors , such as generating a single CVC sequence when two are necessary, as in the no-change responses for hug, smoke , and brown , or the converse, in overmarking errors such as typeded and steppeded. A third possible reason that no-change responses are easy for lId -final stems is that unlike other classes of irregulars in English , the no-change class has a single kind of change (that is, no change at all ) , and all its members have a phonological property in common : ending with a t or d . It is also the largest irregular subclass. The model has been given relatively consistent evidence of the contingency that verbs ending in t or d tend to have unchanged past tense forms , and it has encoded that contingency , presumably in large part by strengthening links between input Wickelfeatures represent ing word -final lIds and identical corresponding output Wickelfeatures . Basically , the moQel is potentially sensitive to any statistical correlation between input and output feature sets, and it has picked up that one . That is, the acquisition of the simple contingency " end in tld ~ no change" presumably makes the model mimic children . We can call this the within -class uniformity effect . As we have mentioned , the simplified rule -hypothesization mechanism presented in a previous section can acquire the same contingency (add a null

148 S Pinker A. Prince . and

affix for verbs ending in a nonsonorant noncontinuant coronal), and strengthen it with every no-changepair in the input. If , as we have argued, a rule-learning model consideredmany rules exemplified by input pairs before being able to determine which of them was the correct productive rule or rules for the language this rule would exist in the child's grammar and , would competewith the regular d rule and with other rules, just ascompeting outputs are computed in the RM model. Finally, there is a fourth mechanismthat was mentioned in our discussion of the strong verb system Addition of the regular suffix d to a form ending . in t or d produces a phonologically -illicit consonantcluster: td or dd. For regular verbs, the phonological rule of vowel insertion places an i between the two consonants Interestingly, no irregular past ends in id, though some . add a tor d. Thus we find tell/ told and leave /left, but we fail to find bleed /bledded or get/ gotted. A possibleexplanationis that a phonologicalrule, degemi nation, removesan affix after it is added as an alternative meansof avoiding adjacent coronals in the strong class The no-changeverbs would then just . be a special case of this generalization where the vowel doesn change , 't either. Basically, the child would capitalize on a phonological rule acquired elsewherein the system and might overgeneralizeby failing to restrict the , degeminationrule to the stron~ verbs. Thus we have an overlapping set of explanationsfor the early acquisition and overgeneralizationof the no-changecontingency Bybee and Slobin cite . misperception Rumelhart and McClelland cite between , -class transfer and within-classuniformity , and rule-basedtheories can cite within-classuniformity or overgeneralizedphonology. What is the evidence concerning the reasonsthat children are so sensitiveto this contingency ? Unfortunately, a number of confounds in English make the theories difficult to distinguish. No-changeverbshave a diagnosticphonologicalproperty in common with one another. They also share a phonological property with regular inflected past tenseforms. Unfortunately, they are the sameproperty: ending with tide And it is the sharing of that phonological property that triggers the putative phonological rule. So this massiveconfound prevents one from clearly distinguishingthe accountsusing the English past tenserule; one cannot say that the Rumelhart- McClelland model receivesclear support from its ability to mimic children in this case . In principle, a number of more diagnostic tests are possible First, one . must explain why the no-changeclassis confounded The within-classunifor. mity account which is one of the factors behind the RM model's success , , cannot do this: if it were the key factor, we would surmisethat English could just as easily have contained a no-changeclassdefined by any easily -charac terized within-classproperty (e.g. begin with j , end with s). Bybee and Slobin

Language and connectionism

149

note that across languages, it is very common for no-change stems to contain the very ending that a rule would add . While ruling out within -class unifor mity as the only explanation , this still leaves misperception , transfer , and phonology as possibilities , all of which foster learning of no-change forms for stems resembling the relevant affix . Second, one can look at cases where possessing the features of the regular ending is not confounded with the characteristics of the no-change class. For example , the nouns that do not change when pluralized in English such as sheep and cod do not in general end in an s or z sound . If children nonetheless avoid pluralizing nouns like ax or lens or sneeze it would support one or more , of the accounts based on stem-affix similarity . Similarly , we might expect
children
rethink . were found , differences among verbs all of which resemble

to be reluctant

to add -ing to form verbs like ring or hamstring

or

If such effects

the affix in question could discriminate the various accounts that exploit the stem-affix similarity effect in different ways. Transfer , which is exploited by the RM model , would , all other things being equal , lead to equally likely no-change responses for all stems with a given degree of similarity to the affix . Phonology would predict that transfer would occur only when the result of adding an affix led to adjacent similar segments; thus it would predict more no-change responses for the plural of ax than the progressive of sting , which is phonologically acceptable without the intervention of any further rule . Returning now to a possible unconfounded test of the within -class unifor mity effect (implicated by Rumelhart and McClelland and by the rule hypothesization module ) , one could look for some phonological property in common among a set of no-change stems that was independent of the phonological property of the relevant affix and see whether children were more likely to yield both correct and incorrect no-change responses when a stem had that property . As we have pointed out , monosyllabicity is a property holding of the irregular verbs in general , and of the no-change verbs in par ticular ; presumably it is for this reason that the RM model , it turns out , is particularly susceptible to leaving regular verbs ending in t/ d erroneously unchanged when they are monosyllabic . As Rumelhart and McClelland point out , if children are less likely to leave verbs such as decide or devote unchanged than verbs such as cede or raid it would constitute a test of this aspect
of their theory ; this test is not confounded by effects of across -class transfer .29

29 Actually , this test is complicated by the fact that monosyllabicity and irregularity are not independent : jn English , monosyllabicity is an important feature in defining the domain of many morphological and syntactic rules (e.g . nicer / *intelligenter , give/ * donate the museum a painting ; see Pinker , 1984) , presumably because in English a monosyllable constitutes the minimal or basic word (McCarthy & Prince , forthcoming ) . As we have pointed out , all the irregular verbs in English are monosyllables or contain monosyllabic roots , (likewise for
-

150 S Pinker A. Prince . and

A possible test of the misperceptionhypothesisis to, look for other kinds of evidence that children misperceive certain stems as falling into a morphological category that is characteristicallyinflected. If so, then once the regular rule is acquired it could be applied in reverse to such misperceived forms, resulting in back-formations. For no-changeverbs, this would result in errors such as bea or bias for beat or blast. We know of no reoorts of such . a . -errors among past tense forms (many would be impossibJe phonological for reasons but have observed in Lisa's speechmik for -mix, and in her noun ) system clo (thes , fen (s), sentent(cf. sentence, Santa Clau (s), upstair (s), ) ) downstair (s), bok (ct. box) , trappy (ct. trapeze and brefek (cf. brefeks = ), 'breakfast').30 Finally, the processby which Rumelhart and McClelland exploit stem -affix similarity, namely transfer of the strength of the output features involved in regular pairs to the no-changestems can be tested by looking at examples , of blends of regular and subregularalternations that involve classes verb. of s other than the no-changeclass One must determine whether children pro. duce such blends and whether it is a good thing or a bad thing for the RM theory that their model does so. We examine this issue in the next two sections . In sum, the class of English verbs that do not change in the past tense involves a massive confound of within-class phonological uniformity and stem-affix similarity, leading to a complex nexus of predictions as to why children are so sensitiveto the properties of the class The relations between . different modelsof past tenseacquisition, predictions of which linguistic variables should have an effect on languages and oQchildren, and the classes of verbs instantiating those variables is many-to-many-to-many. Painstaking , testing of the individual predictions using unconfounde sets of items in a .d variety of inflectional classes English and other languages in could teasethe
nouns, a factrelated some to irregularity ) in way being restricted rootsandmonosyllables prototypical to being Englishroots So if childrenknowthat only rootscanbe irregularand that rootsare monosyllables . , (see Gordon 1986for evidence children sensitive the interaction , , that are to between roothood morphology and , andGropen Pinker 1986for evidence theyaresensitive monosyllabicitytheymayrestricttheir & , , that to ), tendency no-change to responses monosyllables if it is nottheproduct theirdetecting first-order to even of the correlation between monosyllabicity unchanged . Thustheidealtestwouldhave bedonefor . ome and pasts to s otherlanguage whicha no-change hada common , in class phonological property independent thedefinition of of a basic root in the language independent the phonology the regular , and of of affix. 3ONote the factsof English not comport that do well with anystrongmisperception account would that havethechildinvariably misanalyze irregular pasts pseudo as -stems followed theregular by affix: themajority of no-change verbs eitherhave vowels hence lax and wouldleave phonologically impossible pseudo -stems after the affixwassubtracted , such hi or CUor endin vowel sequences as , -t , whichnever occur regular in pasts and only rarely(e.g. bought in irregularsFor the same ) . reason is crucialto BybeeandSlobin account it 's that childrenbe constrained form the schema : ...tld#) ratherthanseveral to (past schemas matching input the moreaccurately , suchas (past ...[unvoicedt H)] and (past ...[voicedd #) . If theydid, theywouldnever : . ] : ] misperceive andcutaspasttense hit forms .

Languageand connectionism

151

effects apart. At present, however, a full range of possibilities are all consis tent with the data, ranging from the RM model explaining much of the phenomenonto its being entirely dispensable The model's ability to dupli. cate children's performance in and of itself, tells us relatively little . ,
7.2.3. Frequency of overregularizing irregular verbs in different vowelchange subclasses Bybee and Slobin examined eight different classes of irregular past tense verbs (see the Appendix for an alternative , more fine -grained taxonomy ) . Their Class I contains the no-change verbs we have just discussed. Their Class II contains verbs that change a final d to t to form the past tense, such as send/ sent and build / built . The other six classes involve vowel chan~es, and are defined by Bybee and Slobin as follows :
.
.

. . . .

Class III . Verbs that undergo an internal vowel changeand also add a final ItI or Idl , e.g. feel/felt, lose /lost, say/said, tell/ told. Class IV . Verbs that undergo an internal vowel change delete a final , consonant and add a final It I or Idl , e.g. bring/ brought, catch caugh . , / 't [Bybee and Slobin include in this classthe pair buy/ bought even though it doesnot involve a deleted consonant Make/madeand have had were . / also included even though they do not involve a vowel change .] ClassV . Verbs that undergo an internal vowel changewhosestemsend in a dental, e.g. bite/ bit, find /found, ride/rode. Class VI . Verbs that undergo a vowel change of III to lrel or lA/, e.g. sing/sang sting/stung. , ClassVII . All other verbs that undergo an internal vowel change e.g. , give/gave break/broke. , Class VIII . All verbs that undergo a vowel change and that end in a diphthongal sequence e.g. blow / blew, fly /flew . (Go / went is also in , cluded in this class .]

Bybee and Slobin noted that preschoolershad widely varying tendencies to overregularize the verbs in these different classes ranging from 10% to , 80% of the time (see the first column of Table 2). Class IV and III verbs, whosepast tense forms receive a final lId in addition to their vowel changes , were overregularizedthe least; ClassVII and V verbs, which have unchanged final consonantsand a vowel change were overregularizedsomewhatmore , often; ClassVI verbs, involving the ing-ang-ung regularity, were regularized more often than that; and Class VIII verbs, which end in a diphthong sequencewhich is changedin the past, were overregularizedmost often. Bybee and Slobin again accountfor this phenomenonby appealingto factors affecting the processof juxtaposing correspondingpresent and past forms. They

152

S. Pinker and A . Prince

suggestthat the presenceof an added l/ d facilitates the child's recognition that ClassIII and IV past forms are past forms, and that the small percentage of shared segmentsbetween ClassVIII present and past versions (e.g., one for see /sawor know/ knew) hindersthat recognition process As the likelihood . of successful juxtaposition of presentand past forms decreasesthe likelihood , of the regular rule to operate, unblocked by an irregular past form , increases and overre~ularizations becomemore common. Rumelhart and McClelland suggest as in their discussionof no-change , verbs, that their model as it stands can reproduce the developmental phenomenon. Sincethe Bybee and Slobin subjectsrangefrom 1+ to 5 years, it is not clear which stageof performanceof the model should be comparedwith that of the children, so Rumelhart and McClelland examined the output of the model at several stages These stagescorrespondedto the model's first . five trials with the set of medium-frequency, predominantly regular verbs, the next five trials, the next ten trials, and an averageover those first twenty trials (these intervals constitute the period in which the tendency of the model to overregularize was highest . The average strength of the over) regularized forms within each class was calculated for each of these four intervals. The fit between model and data is good for the interval comprising the first five trials, which Rumelhart and McClelland concentrateon. We calcu late the rank-order correlation betweendegreeof overregularizationby children and model acrossclassesas .77 in that first interval; however it then declines to .31 and .14 in the next two intervals and is .31 for the average responseover all three intervals. The fact that the model is only successful at accountingfor Bybee and Slobin's data for one brief interval (lessthan 3% of the training run) selectedpost hoc, whereasthe data themselvesare an averageover a spanof developmentof 3+ years, should be kept in mind in evaluating the degree of empirical confirmation this study gives the model. Nonetheless the tendency of Class VIII verbs (fly/flew) to be most often , regularized, and for ClassIII verbs (feel felt) to be among those least often / regularized persistsacrossall three intervals. , The model, of course is insensitive to any factor uniquely affecting the juxtaposition of present and past forms becausesuch juxtaposition is accomplished by the " teacher in the simulation run. Instead, its fidelity to " children's overregularization patterns at the very beginning of its own overregularization stagemust be attributed to someother factor. Rumelhart and McClelland point to differences among the classesin the frequency with which their characteristicvowel changesare exemplified by the verb corpus as a whole. ClassVIII verbs have vowel shifts that are relatively idiosyncratic to the individual verbs in the class the vowel shifts of other classes on the ; ,

Languageand connectionism

153

other hand , might be exemplified by many verbs in many classes Further . more , Class III and IV verbs , which require the addition of a final t/ d , can benefit from the fact that the connections in the network that effect the addition of a final t/ d have been strengthened by the large number of regular verbs . The model creates past tense forms piecemeal , by links between stem and past Wickelfeatures , and with no record of the structure of the individual words that contributed to the strengths of those links . Thus vowel shifts and consonant shifts that have been exemplified by large numbers of verbs can be applied to different parts of a base form even if the exact combination of such shifts exemplified by that base form is not especially frequent . How well could the simplified rule -finding module account for the data ? Like the RM model , it would record various subregular rules as candidates for a regular past tense rule . Assuming it is sensitive to type frequency , the rule candidates for more -frequently exemplified subregularities would . be stronger . And the stronger an applicable subregular rule candidate is, the less is the tendency for its output to lose the competition with the overregularized form contributed by the regular rule . Thus if Rumelhart and McClelland 's explanation of their model 's fit to the data is correct , a rule -finding model sensitive to type -frequency presumably would fit the data as well . This conjecture is hard to test because Bybee and Slobin 's data are tabu lated in some inconvenient ways. Each class is heterogeneous , containing verbs governed by a variety of vowel -shifts and varying widely as to the number of such shifts in the class and the number of verps exemplifying them within the class and across the classes Furthermore , there are some quirks . in the classification . Go/ went, the most irregular main verb in English , is assigned to Class VIII , which by itself could contribute to the poor perfor mance of children and the RM model on that class. Conversely , have and make , which involve no vowel shift at all , are included in Class IV , possibly contributing to good average performance for the class by children and the model . (See the Appendix for an alternative classification .) It would be helpful to get an estimate as to how much of the RM model 's empirical succe~s here might be due to the different frequencies of exemplifi cation of the vowel -shift subregularities within each class, because such an effect carries over 'to a symbolic rule -finding alternative . To get such an estimate , we considered each vowel shift ( e.g. i -.::,. re) as a separate candidate rule , strengthened by a unit amount with each presentation of a verb that exemplifies it in the Rumelhart - McClelland corpus of high - and medium -fre quency verbs . To allow have and make to benefit from the prevalence of other verbs whose vowels do not change, we pooled the different vowel no-change rules (a ~ a,. i ~ i , etc .) into a single rule (the RM model gets a similar benefit by using Wickelfeatures , which can code for the presence of

154

S. Pinker and A . Prince

with

Verb 3( -subclas V-64 2 4 blow2( /elt2 53 b(lew 1 III06 2 16 4. .ought 81 65 )4 8 1 1 ) VI(it5 5 5sing3 6 s42 4 3 23 1 . 55 17 ang 9 4 6 bite 33 36 VII 2 ) break /5 0 b.) 3 III1 5 feel f roke IV s -seek
Table 2. Ranks aftendencies to overregularize irregular verbs involving vowel shifts
Children * RM RM RM RM A vg . freq . of
1st set 2nd set 3rd set average vowel shift * *
Rank order correlation children ' s proportions . 77 . 31 . 14 . 31 . 71 * * * Actual Mean proportions number . of of verbs regularizations in the irregular by children corpus are exemplifying in parentheses the vowel . shifts within a class are indicated in 311n a sense of all simple , it thtj would verbs strength rates irregular verbs for . have in the function Class IV been corpus , this verbs more , accurate regular would and and have so we to calculate , the rather strength than the the deck just in of the the favor no - vowel - change verbs correctly of no - vowel . But rule on with basis overly regularization the irregular greatly only irregular of stacked counted predicting - change exemplifications

vowels, rather than Wickelphones whose strength was determined by the ) number of no-vowel-changeverbs in ClassesI and 11 Then we averaged .31 the strengths of all the subregular rules included within each of Bybee and Slobin's classes These averagesallow a prediction of the ordering of over. regularization probabilities for the different subclassesbasedsolelv on the , ~ number of irregular verbs in the corpus exemplifying the specificvowel alternations among the verbs in the class Though the method of prediction is . crude, it is just about as good at predicting the data as the output of the RM model during the interval at which it did best and much better than the RM model during the other intervals examined Specifically the rank-order cor. , relation betweennumber of verbs in the corpus exemplifying the vowel shifts in a classand the frequency of children's regularization of verbs in the class is .71. The data, predictions of the RM model, and predictions from our simple tabulations are summarizedin Table 2. What about the effect of the addition of t/ d on the good performanceon ClassIII and IV verbs? The situation is similar in some ways to that of the no-changeverbs discussed the previous section. The ClassIII and IV verbs in

parentheses

the our low within

Languageand connectionism

155

take some of the most frequently-exemplified vowel-changes(including nochangefor have and make); they also involve the addition of t or d at the end causingthem to resemble the past tense forms of regular verbs. Given this confound, good performance with these classescan be attributed to either factor and so the RM model's good performancewith them does not favor it over the Bybee-Slobin account focusing on the juxtaposition problem. The question of blended responsesAn interesting issue arises however, . , when we consider the possible effects of the addition of t/ d in combination with the effectsof a commonvowel shift. Recall that the RM model generates its output piecemeal Thus strong regularities pertaining to different parts of . a word can affect the word simultaneously producing a chimerical output , that neednot correspondin its entirety to previousfrequent patterns. To take a simplified example, after the model encounterspairs such as meet met it / has strong links betweeni and E; after it encounterspairs such asplay/played it has strong links between final vowels and final vowel-d sequenceswhen ; presented with flee it could then generatefled by combining the two regularities, even if it never encounteredan ee ed alternation before. What is / interesting is that this blending phenomenonis the direct result of the RM model's lack of word structure. In an alternative rule-finding account there , would be an i ~ E rule candidate and there would be a d-affixation rule , candidate, but they would generate two distinct competing outputs, not a single blended output. (It is possiblein principle that someof the subregular strong verbs such as told and sent involve the superpositionof independent subregular rules, especially in the history of the language but in modern , English one cannot simply heap the effect of the regular rule on top of any subregularalternation, as the RM model is prone to do.) Thus it is not really fair for us to claim that a rule-hypothesizationmodel can account for good performance with Class III and IV verbs becausethey involve frequentlyexemplified vowel alternations; such alternations only result in correct outputs if they are blended with the addition of a t/ d to the end of the word. In principle, this could give us a critical test between the network model and a rule-hvDothesizationmodel: unlike the ability to soak up frequent alterna~ A tions, the automatic superpositionof any set of them into a single output is (under the. simplest assumptions unique to the network model. ) This leads to two questions Is there independent evidencethat children : blend subregularities And does the RM model itself really blend sub? regularities? We will defer answeringthe first question until the next section, where it arisesagain. As for the second it might seemthat the question of , whether responseblending occurs is perfectly straightforward, but in fact it is not. Say the model's active output Wickelfeatures in responseto flee in-

156

S. Pinker and A . Prince

clude

those

for

medial

E and those

for word -final

d . Is the overt

response

of

the model fled , a correct blend , or does it set up a competition between [ {lid ] and [fiE] , choosing one of them , as the rule -hypothesization model would ? In principle , either outcome is possible , but we are never given the opportunity to find out . Rumelhart and McClelland do not test their model against the Bybee and Slobin data by letting it output its favored response. Rather , they externally assemble alternatives corresponding to the overregularized and correct forms , and assessthe relative strengths of those alternatives by observing the outcome of the competition in the restricted -choice whole -string binding network (recall that the output of the associative network , a set of activated Wickelfeatures , is the input to the whole -string binding network ) . These strengths are determined by the number of activated Wickelfeatures
that each is consistent . But with . The we do not result know is that whether correct the alternatives , if left that also

happen to resemble blends of independent subregularities are often the response chosen model to its own

devices, would produce a blend as its top -ranked response. Rumelhart and McClelland did not perform this test because it would have been too .computationally intensive given the available hardware : recall that
the only way to get the model to produce a complete response form on its

own is by giving it (roughly ) all possible output strings (that is, all permuta tions of segments) and having them compete against each other for active Wickelfeatures in an enormous " unconstrained whole -string binding network " . This is an admitted kluge designed to give approximate predictions of the strengths of responses that a more realistic output mechanism would construct . Rumelhart and McClelland only ran the unconstrained whole string binding network on a small set of new low -frequency verbs in a transfer test involving no further learning . It is hard to predict what will happen when
this network operates because it involves a " rich -get -richer " scheme in the

competition among whole strings , by which a string that can uniquely account for some Wickelfeatures (including Wickelfeatures incorrectly turned on as part of the noisy output function ) gets disproportionate credit for the features that it and its competitors account for equally well , occasionally leading to unpredictable winners . In fact , the whole -string mechanism does yield blends such as slip / slept. But as mentioned , these blends are also occasionally bizarre , such as mailed / membled or tour / toureder . And this is why the question of overt blended outputs is foggy : it is unclear whether tuning the whole string binding network , or a more reasonable output construction mechanism ,
so that the bizarre blends were eliminated , would also eliminate the blends

that perhaps turn out to be the correct outputs for Class III and Iv .32
32To complicate matters even further , even outright blends are possiblein principle within the rule-ba~

Language connectionism and 157

In ses for data Much alternations a whole of one

sum vowel

the

relative

tendencies irregular selected post in is class simply have the verbs hoc

of

children does shows not a it to the

to favor

overregularize the RM high to overregularized with by also . be But into and could network is we will verbs sensitive the a ability single lead model unknown turn to model

different . The

clas model with forms

- change stage 's correlation in a given

brief children of this

moderately assigns frequencies

correlation

on

behavior

strength due been model

which in the to of

the corpus these the

vowel as fre net fol distin actu whether . -

exemplified would assumptions

A under

rule

- hypothesization even the simplest

quencies work lows guishing ally children output model naturally

of

to

blend from its

independent lack of word

subregularities structure , whether in its is best a

response to tests would ; shortly

the

models blended

. Unfortunately responses blended responses

the incarnation

output

question

7 .2 . 4 . The examine consisting errors of a tend base

" Eated

"

versus

" ated

"

errors that to with later in as the . They produce ed , such Rumelhart overregularization as ated than or breaked of eated that the the course or errors ( Kuczaj and ated strength of braked and McClelland errors . consisting , 1977 outputs of training the , ) . Such

final is of to form

developmental the an tendency irregular occur affixed McClelland verbs to the past fails form . in the Kuczaj + to Thinking ed of past considerably with

phenomenon children affixed

development eated strength found over

ed

such

Rumelhart for ated thus What child of the cannot past changes yield model discussed the ' s pairs are some past be the form mimicking irregular

and

compared their eated data errors realize it mechanism for provided to the . correct This is the to it model . corpus form . ? that is a There the

relative

increased

causes sometimes base tense the are

are irregular form

two

possibilities past itself a doubly because is and the of of the Class the that results , is he the or - marked

One past

is tense feeds error

that

the form it . into This !

base and ' s

she

formation explanation always applied - marking infrequent the previous

gets behavior

correct the are two

present different blended for to the verbs for the

The

alternative form to

base similar

double relatively in

one

explanations III explanations and IV

overregularization section , and to one of

158 S Pinker A. Prince . and

model's tendencyto leave t/ d-final stemsunchangeddiscussed the section in before that. As in the previous discussions the lack of a realistic response , production mechanismmakesit unclear whether the model would ever actually producepast + ed blendswhen it is forced to utter a responseon its own, or whether the phenomenonis confined to such forms simply increasingin strength in the three-alternative forced-choice experiment becauseonly the past + ed form by definition contains three sets of features all of them strengthenedin the courseof learning (its idiosyncratic features, the features output by subregularities and the features of regularized forms). In , Rumelhart and McClelland's transfer test on new verbs, they chose a minimum strengthvalue of .2 asa criterion for when a form should be consid ered as being a likely overt responseof the model. By this criterion, the model should be seenas rarely outputting past + ed forms, sincesuchforms on the averagenever exceeda strength of .15. But let us assume now that for such forms would be output, and that blending is their source . At first one might think that the model had an advantagein that it is consistentwith the fact that ated errors increaserelative to the eatederrors in later stages a phenomenonnot obviously predicted by the misconstrued , stem account 33However, many of the phenomenawe discussbelow that . favor the misconstrued -stem account over the RM model appear during the same relatively late period asthe atederrors (Kuczaj, 1981 , so latenessitself , ) does not distinguish the accounts Moreover, Kuczaj (1977 warns that the . ) late-ated effect is not very robust and is subject to individual differences In . a later study (Kuczaj, 1978 he eliminated thesesamplingproblemsby using ), an experimental task in which children judged whether various versions of past tense forms sounded" silly" . He found in two separateexperimentsthat while children's acceptance the eatedforms declined monotonically their of acceptance atedforms showedan inverted V -shapedfunction, first increas of ing but then decreasing relative to eated -type errors. Since in the RM model the strengthsof both forms monotonically approachan asymptotenear zero, with the curvescrossingonly once, the model demonstrates specialability no to track the temporal dynamicsof the two kinds of errors. In the discussion below we will concentrateon the reasonsthat such errors occur in the first place. Once again, a confound in the materials provided by the English language confoundsRumelhart and McClelland's conclusionthat their model accounts

Language connectionism and 159

children 's ated-type errors . Irregular past tense forms appear in the child 's input and hence can be misconstrued as base forms . They also are part of the model 's output for irregular base forms and hence can be blended with the regularized response. Until forms which have one of these properties and not
the other are examined , the two accounts are at a stalemate .

Fortunately , the two properties can be unconfounded . 'Though the correct irregular past will usually be the strongest non -regularized response of the network , it is also sensitive to subregularities among vowel changes and hence one might expect blends consisting of a frequent and consistent but incorrect vowel change plus the regular ed ending . In fact the model does produce such errors for regular verbs it has not been trained on , such as shape/shipped, sip/sepped, slip / slept, and brown / brawned . Since the stems of these responses are either not English verbs or have no semantic relationship to the correct verb , such responses can never be the result of mistakenly feeding the wrong
base form of the verb into the past tense formation process . Thus if the

blending assumed in the Rumelhart and McClelland model is the correct explanation for children 's past + ed overregularizations , we should see chil dren making these and other kinds of blend errors . We might also expect errors consisting of a blend of a correct irregular alteration of a verb plus a frequent subregular alteration , such as send/soant (a blend of the d ~ t and I:: ~ 0 subregularities ) or think / that (a blend of the ing ~ ang and final consonant cluster ~ t subregularities ) . (As mentioned , though , these last errors are not ruled out in principle in all rule -based models , since superposi tion may have had a role in the creation of several of the strong past forms

in the history of English , but indiscriminately adding the regular affix onto strong pasts is ruled out by most theories of morphology .) Conversely , if the phenomenon is due to incorrect base input forms , we might expect to see other inflection processes applied to the irregular past, resulting in errors such as wenting and braking or wents and brokes . Since mechanisms for progressive or present indicative inflection would never be exposed to the idiosyncrasies or subregularities of irregular past tense forms under Rumelhart and McClelland 's assumptions , such errors could not result from blending of outputs . Similarly , irregular pasts should appear in syntactic contexts calling for bare stems if children misconstrue irregular pasts as stems. In addition , we might expect to find caseswhere ed is added to incorrect base forms that are plausible confusions of the correct base form but implausible results of the mixing of subregularities . Finally , we might expect that if children are put in a situation in which the correct stem of a verb is provided for them , they would not generate past +
ed errors , since the source of such errors would be eliminated .

All five predictions work against the RM model and in favor of the expla-

160 S Pinker A. Prince . and

nation based on incorrect inputs. Kuczaj (1977 reports that his transcripts ) contained no exampleswhere the child overapplied any subregularity let , alone a blend of two of them or of a subregularity plus the regular ending. Bybee and Slobin do not report any sucherrors in children's speech though , they do report them as adult slips of the tongue in a time-pressuredspeaking task designed to elicit errors. We examined the full set of transcripts of Adam, Eve, Sarah and Lisa for words ending in -ed. We found 13 examples , of irregular past + ed or en errors in past and passiveconstructions : (36) Adam: fanned tooked staled (twice) braked (participle) felled Eve: tored

Sarah flewed (twice) : caughted stucked(participle) Lisa: tamed (participle) taaken (twice) (participle) sawn(participle)

The participle forms must be interpreted with caution. BecauseEnglish irregular participles sometimesconsistof the stem plus en (e.g. take - took taken but sometimesconsistof the irregular past plus en (e.g. break - broke ) - broken), errors like tooken could reflect the child overextendingthis regularity to past forms of ver~s that actually follow the stem + en pattern; the actualstemor eventhe child's mistakenhypothesisabout it may play no role. What about errors consisting of a subregular vowel alternation plus the addition of ed? The only exampleswhere an incorrect vowel other than that of the irregular form appearedwith ed are the following: (37) Adam: I think it 's not fulled up to de top. I think my pocketsgonnabe all fulled up. I 'm gonna ask Mommy if she has any more grain ... more stuff that sheneedsgrained. [He hasbeen grinding crackersin a meat grinder producingwhat he calls "grain" .] Sarah 00 , he hahted. : Lisa: I brekked your work.

Languageand connectionism

161

For Adam, neither vowel alternation is exemplified by any of the irregular verbs in the Rumelhart-McClelland corpus, but in both casesthe stem is identical to a non-verb that is phonologically and semanticallyrelated to the target verb and hence may have been misconstruedas the base form of the verb or converted into a new base verb. Sarah error can be attributed 's directly to phonological factors since she also pronounced dirt , involving no morphological change as "dawt" , according to the transcriber. This leaves , Lisa's brekked as the only putative example note that unlike her single ; -subregularity errors such as bote for bit which lasted for extended periods of time, this appearedonly once, and the correct form broke was very common in her speech Furthermore blending is not a likely explanation: amonghigh . and middle frequency verbs, the alternation i. found only in say/said and to s a lesser extent in pairs such as sleep /slept and leave /left, whereas in many other alternationsthe e soundis mappedonto other vowels (bear/ bore, wear / wore, tear/tore, take/ took, and shake /shook). Thus it seemsunlikely that the RM model would produce a blend in this casebut not in the countlessother opportunities for blending that the children avoided. Finally, we note that Lisa was referring to a pile of papersthat she scattered an unlikely example , of breaking but a better one of wrecking, which may have been the target serving as the real sourceof the blend (and not a past tensesubregularity if ) it was a blend. In sum, except perhaps for this last example under an extremely charitable interpretation, the apparentblendsseemfar more sugges tive of an incorrect stem correctly inflected than a blend between two past tense subregularities . This conclusionis strengthened when we note that children do make errors such as wentsand wenting which could only result from inflecting the wrong , stem. Kuczaj (1981 reports frequent useof wenting ating, and thoughtingin ) , the speechof his son, and we find in Eve's speech fells and wentsand in Lisa's speechblow awayn lefting, hidding (= hiding), stoling, to took, to shot, and , might loss. These last three errors are examplesof a common phenomenon sometimescalled 'overtensing, which because occursmostly with irregulars ' it (Maratsos & Kuczaj, 1978 is evidencethat irregulars are misconstruedas ), stems(identical to infinitives in English) . Someexamplesfrom Pinker (1984 ) include Can you broke those What are you did?, Shegonnafell out, and I 'm , going to sit on him and madehim broken. Note that sincemany of theseforms occur at the same time as the ated errors, the relatively late appearanceof atedforms may reflect the point at which stem extraction (and mis-extraction) in general is accomplished . Finally, Kuczaj (1978 presentsmore direct evidencethat past + ed errors ) are due to irregular pastsmisconstruedas stems In one of his tasks, he had . children convert a future tenseform (i .e. "X will + (verb stem) ") into a past

162 S. Pinker A. Prince and

tenseform (i .e. "X already (verb past) ") . Past + ed errors virtually vanished
(in fact they completely vanished for two of the three age groups ) . Kuczaj argues that the crucial factor was that children were actually given the proper
base forms . This shows that children ' s derivation ate to ated , not , as it is in the RM model , from of the errors eat to ated . must be from

Yet another test of the source of apparently blended errors is possible when we turn our attention to the regular system. If the child occasionally misanalyzes a past form as a stern, he or she should do so for regular inflected past forms and not just irregular ones, resultiI )g in errors such as talkeded .
The RM model also produces such errors as blends , but for reasons that

Rumelhart and McClelland do not explain , all these errors involve regular verbs whose stems end in p or k : carpeded, drippeded , mappeded, smokeded, snappeded, steppeded, and typeded, but not browneded , warmeded, teareded or clingeded , nor , for that matter , irregular stems of any sort : the model did not output creepeded/ crepted, weepeded wepted, diggeded, or stickeded. We / suggest the following explanation for this aspect of the model 's behavior . The phonemes p and k share most of their features with t. Therefore on a Wic kelfeature by Wickelfeature basis, learning that t and d give you id in the
output transfers to p , b, g and k as well . So there will be a bias toward id

responses after all stops. Since there is also a strong bias toward simply adding t, there will be a tendency to blend the 'add t' and the 'add id ' responses. Irregular verbs , as we have noted , never end in id , so to the extent that the novel irregulars resemble trained ones (see Section 4.4) , the features of the novel irregulars will inhibit the response of the id Wickelfeatures and double -marking will be less common . In any case, - hough Rumelhart and McClelland cannot explain their t model 's behavior in this case, they are willing to predict that children as well
will double - mark more often for p - and k - final stems . In the absence of an

explanation as to why the model behaved as it did , Rumelhart and McClel land should just as readily extrapolate the model 's reluctance to double -mark irregular stems and test the prediction that children should double -mark only regular forms (if our hypothesis about the model 's operation is correct , the two predrctions stem from a common effect ) . Checking the transcripts , we did find ropeded and stoppeded (the latter uncertain in transcription ) in Adam 's speech, and likeded and pickeded in Sarah's, as Rumelhart and McClelland would predict . But Adam also said tieded and Sarah said buyded and makeded (an irregular ) . Thus the model 's prediction that double -marking should be specific to stems ending with p and d , and then only when they are regular , is not borne out . In particular , note that buyded and tieded cannot be the result of a blend of subregularities , because there is no subregularity

Language and connectionism

163

according to which buy or tie would tend to attract a id ending .34 Finally , Slobin ( 1985) notes that Hebrew contains two quite pervasive rules for inflecting the present tense, the first involving a vowel change, the second a consonantal prefix and a different vowel change. Though Israeli children overextend the prefix to certain verbs belonging to the first class, they never blend this prefix of the second class with the vowel change of the first class. This may be part of a larger pattern that children seem to respect the integrity
of the word as a cohesive unit , one that can have affixes added to it and that

can be modified by general phonological processes, but that cannot simply be composed as a blend of bits and pieces contributed by various regular and irregular inflectional regularities . It is suggestive in this regard that Slobin (1985) , in his crosslinguistic survey , lists examples from the speech of children learning Spanish, French , German , Hebrew , Russian , and Polish , where the language mandates a stem modification plus the addition of an affix and children err by only adding the affix . Once again we see that the model does not receive empirical support from its ability to mimic a pattern of developmental data . The materials that Rumelhart and McClelland looked at are again confounded in a way that leaves their explanation and the standard one focusing on the juxtaposition problem equally plausible given only the fact of ated errors . One can do better than that . By looking at unconfounded cases, contrasting predictions leading to critical tests are possible . In this case, six different empirical tests all go against the explanation inherent in the Rumelhart and McClelland model : absence of errors due to blending of subregularities , presence of wenting-type errors , presence of errors where irregular pasts are used in nonpast contexts , presence of errors where the regular past ending is mistakenly applied to non -verb stems, drastic reduction of ated-errors when the correct stem is supplied to the child , and presence of errors where the regular ending is applied twice to stems that are irregular or that end in a vowel . These tests show that errors such as ated are the result of the child incorrectly feeding
ate as a base form into the past tense inflection mechanism , and not the result

of blending components of ate and eated outputs .

340ne might argue that the misconstrued -stem account would fail to generate these errors , too , since it would require that the child first generate maked and buyed using a productive past tense rule and then forget that the forms really were in the past tense . Perhaps , the argument would go , some other kind of blending caused the errors , such as a mixture of the two endings d and id which are common across the language even if the latter is contraindicated for these particular stems. In fact , the misconstrued -stem account survives unscathed , because one can find errors not involving overinflection where child -generated forms are treated as stems : for example , Kuczaj ( 1976) reports sentences such as They wouldn 't haved a house and She didn 't goed .

164

S. Pinker and A . Prince

7.3. Summary of how well the model fares against the facts of children 's development
What general conclusions can we make from our examination of the facts of

children 's acquisition of the English past tense form and the ability of the RM .model to account for them ? This comparison has brought several issues
to light .

To begin with , one must reject the premise that is implicit in Rumelhart and McClelland 's arguments , namely that if their model can duplicate a phenomenon , the traditional explanation of that phenomenon
-

can be rejected .
,-

For one thing , there is no magic in the RM model duplicating correlations in language systems: the model can extract any combination of over 200_ 000 atomic regularities , and many regularities that are in fact the consequences of an interaction among principles in several grammatical components will be detectable by the model as first -order correlations because they fall into that huge set. As we argued in Section 4, this leaves the structure and constraints on the phenomena unexplained . But in addition , it leaves many of the simple goodness-of -fit tests critically confounded . When the requirements of a learn ing system designed to attain the adult state are examined , and when uncon founded tests are sought , the picture changes. First , some of the developmental phenomena can be accounted for by any mechanism that keeps records of regularities at several levels of generality , assigns strengths to them based on type -frequency of exemplification , and lets them compete in producing past tense'candidate forms . These phenome na include children 's shifts or waffling between irregular and overregularized past tense forms , their tendency not to change verbs ending in t/ d , and their tendency to overregularize verbs with some kinds of vowel alternations less than others . Since there are good reasons why rule -hypothesization models should be built in this way , these phenomena do not support the RM model as a whole or in contrast with rule -based models in general , though they do support the more general (and uncontroversial ) assumption of competition among multiple regularities of graded strength during acquisition . Second, the lack of structures corresponding to distinct words in the model , one of its characteristic features in contrast with rule -based models , might be related to the phenomenon of blended outputs incorporating independent subregularities . However , there is no good evidence that children 's correct
responses are ever the products of such blends , and there is extensive evi -

dence from a variety of sources that their ated-type errors are not the products of such blends . Furthermore , given that many blends are undesirable , it is
not clear that the model should be allowed model of .its output process is constructed . to output them when a realistic

Languageand connectionism

165

Third , the three-stageor U -shapedcourseof developmentfor regular and irregular past tense forms in no way supports the RM model. In fact, the model provides the wrong explanation for it , making predictions about changes in the mixture of irregular and regular forms in children's vocabulariesthat are completely off the mark. This meansthat in the two hypothesesfor which unconfoundedtests are available (the causeof the V -shapedoverregularizationcurve, and the genesis of ated -errors) , both of the processes needed by the RM model to account for developmental phenomena frequency -sensitivity and blending have been shown to play no important role, and in eachcase processes , appealing to rules to the child's initial hypothesizationof a rule in one case and to , the child's misapplicationof it to incorrect inputs in a second have received independent support. And since the model's explanations in the two confounded cases(performancewith no-changeverbs, and order of acquisition of subclasses appeal in part to the blending process the evidence against ) , blending in our discussionof the ated errors taints theseaccountsaswell. We concludethat the developmentalfacts discussed this sectionand the linguisin tic facts discussed Section4 convergeon the conclusionthat knowledgeof in languageinvolves the acquisition and use of symbolic rules.
8. General discussion

Why subjectthe RM modelto suchpainstaking analysis Surelyfew models ? of anykind couldwithstand suchscrutiny We did it for two reasonsFirst, . . the conclusions drawnby Rumelhartand McClelland that PDP networks provideexactaccounts psychological of mechanisms aresuperior the that to approximate descriptions couched linguistic in rules thatthereis no induction ; problemin their networkmodel that theresults their investigation ; of warrant revisingthe way in whichlanguage studied are bold and revolutionary is . Secondbecause modelis so explicitandits domainso rich in data we , the , havean unusual opportunityto evaluate ParallelDistributedProcessing the approach cognition termsof its concrete to in technical properties ratherthan blandgeneralities recycled or statements hopesor prejudices of . In this concluding section do four things webrieflyevaluate we : Rumelhart and McClelland strongclaims about languagewe evaluatethe general 's ; claimsaboutthe differences between connectionist symbolic and theories of cognitionthat the RM modelhasbeentakento illustrate we examine ; some of the waysthat the problems the RM modelareinherently to its PDP of due architecture and hencewaysin which our criticismsimplicitly extendto ,

166

S. Pinker and A . Prince

certain kinds of PDP models in general; and we considerwhether the model could be salvagedby using mor" sophisticatedconnectionistmechanisms e . 8.1. On Rumelhart and McClelland's strong claims about language One thing should be clear. Rumelhart and McClelland's PDP model doesnot differ from a rule-basedtheory in providing a more exact accountof the facts of languageand languagebehavior. The situation is exactly the reverse As . far as the adult steadystate is concerned the network model gives a crude, , inaccurate and unrevealingdescription of the very factsthat standardlinguis, tic theories are designedto explain, many of them in classictextbook cases . As far as children's development is concerned the model's accountsare at , their best no better than those of a rule-basedtheory with an equally explicit learning component, and for two of the four relevant developmentalphenomena, critical empirical testsdesignedto distinguishthe theorieswork directly against the RM model's accounts but are perfectly consistent with the notion that children create and apply rules. Given these empirical failings, the ontological issueof whether the PDP and rule-basedaccountsare realist portrayals of actual mechanisms opposedto convenientapproximatesumas maries of higher-order regularities in behavior is rather moot. There is also no basisfor Rumelhart and McClelland's claim that in their network model, as opposed to traditional accounts "there is no induction , problem" . The induction problem in language acquisition consists among , other things, of finding setsof inputs that embody generalizations extracting , the right kinds of generalizationsfrom them, and deciding which generaliza tions can be extended to new cases The model does not deal at all with the . first problem, which involves recognizingthat a given word encodesthe past tense and that it constitutes the past tense version of another word. This juxtaposition problem is relegated to the model's environment (its "teacher" ), or more realistically, some unspecifiedprior process such a division ; of labor would be unproblematic if it were not for the fact that many of the developmentalphenomenathat Rumelhart and McClelland marshall in support of their model may be intertwined with the juxtaposition process(the onset of overregularization, and the sourceof atederrors, most notably). The secondpart of the induction problem is dealt with in the theory the old-fashioned way: by providing it with an innate feature spacethat is supposedto be appropriate for the regularities in that domain. In this case it is the , distinctive features of familiar phonological theories, which are incorporated into the model's Wickelfeature representations(see also Lachter & Bever, 1987 . Aspects in which the RM model differs from traditional accountsin ) how it uses distinctive features, such as representing words as unordered

Languageand connectionism

167

pools to aspect make constraints performance irregular rule The standing the wish increased intriguing . put

of it of

feature mildly the .

trigrams Finally ,

, do the

not theory

clearly deals , when

work very to

to

the poorly

advantage with to new the rule the , the items

of

the crucial . It

model third cannot

induction phonological on it the errs domain in

problem

generalize or of the respect regular

proper

generalizations of two application , both

morphosyntactic and in its actual of the regular the

ways and

overestimating the

significance of

subgeneralizations

underestimating

generality

third of problems to deny our

claim language we the

that and

the

success

of

their

model , is credit

calls hardly where and

for

revised in due , 's we work has structure

under light do of not has raised of

language discussed to which of .

acquisition To give Rumelhart language role of the of subregularities

warranted it McClelland . The model is

have extent

understanding about and of of mixture rote their the

acquisition family exemplification in

questions

resemblance in generating in causes roles process for the questions . the of

subregularities tion effects tradeoffs tween past does , the of blending the between developmental juxtaposition not give

frequency

overregulariza overt input transitions of But the the it present model raises . outputs on the be

independent of and stages regular

and

irregular , , and the

forms the relative

generalization , and in the particular pattern new

process superior or

- extraction answers

radically

8 .2 .

Implications

for

the

metatheory

and

methodology

of

connectionism

Often to is study a

the

RM language of .

model , In but

is

presented of a , ' new a

as way persistent

a to

paradigm understand theme in can ; exact they or

case

not what

only a

of cognitive

new

way theory

theory that description

particular - level of the claim

connectionist best be : provide convenient

metatheory an approxi in some -

affirms mate

' macro

symbolic of , but

theories inquiry never

at may

domain goes

circumstances

the

real

Subsymbolic symbolic ( Smolensky models

models

accurately provide an )

describe approximate

the

microstructure description of

of the

cognition macrostructure

while .

, in

press

, p . 21

We which

view the

macrotheories distributed they of the , POPI are model

as

approximations presented useful , but may 125 ) in in our

to

the paper

underlying attempts it deeper will

microstructure to capture turn . out ( Rumel . As that -

approximations an hart examination & McClelland

often microstructure , p .

some bring

situations much

insight

. . . these far where .

[ macro

- level &

models

are ,

approximations p . 126 ; bracketed

and

should

not ours

be here

pushed and

too else -

( Rumelhart )

McClelland

material

168 S. Pinker and A . Prince

In such discussions the relationship between Newtonian physics and Quantum Mechanics typically surfaces as the desired analogy . One of the reasons that connectionist theorists tend to reserve no role for higher -level theories as anything but approximations is that they create a dichotomy that , we think , is misleading . They associate the systematic , rule based analysis of linguistic knowledge with what they' call the " explicit inaccessible rule " view of psychology , which ... holds that the rules of languageare stored in explicit form as propositions, and are used by languageproduction, comprehension and judgment mecha , nisms Thesepropositionscannot be describedverbally [by the untutored native . speaker . (Rumelhart and McClelland, PDPII , p. 217 ] ) Their own work is intended to provide " an alternative to explicit inaccessible rules ... a mechanism in which there is no explicit representation of a rule " (p . 217) . The implication , or invited inference , seems to be that a formal rule is an eliminable descriptive convenience unless inscribed somewhere and examined by the neural equivalent of a read-head in the course of linguistic information processing~ In fact , there is no necessary link between realistic interpretation of rule theories and the " explicit inaccessible" view . Rules could be explicitly in scribed and accessed but they also could be implemented in hardware in such , a way that every consequence of the rule -system holds . If the latter turns out to be the case in a cognitive domain , there is a clear sense in which the rule -theory is validated - it is exactly true - rather than faced with a compet ing alternative or relegated to the status of an approximate convenience .35 Consider pattern -associators like Rumelhart and McClelland 's, which gives symbolic output from symbolic input . Under a variety of conditions , it will function as a rule -implementer . To take only the simplest , suppose that all connection weights are 0 except those from the input node for feature fz to the output node for Ii , which are set to 1. Then the network will implement the identity map . There is no read-head, write -head , or executive overseeing the operation , yet it is legitimate and even enlightening to speak of it in terms of rules manipulating symbols . More realistically , one can abstract from the RM pattern associator an implicit theory implicating a " representation " consisting of a set of unordered Wickelfeatures and a list of " rules " replacing Wickelfeatures with other Wic 35Noteas well that many of the examplesoffered to give common -sensesupport to the desirability of eliminating rules are seriouslymisleadingbecause they appealto a confusion betweenattributing a rule-system to an entity and attributing the wrong rule-systemto an entity. An example that Rumelhart and McClelland cite, in which it is noted that beescan create hexagonalcells in their hive with no knowledge of the rules of geometry, gains its intuitive force becauseof this confusion.

Language and connectionism

169

kelfeatures . Examining the properties of such rules and representations is quite revealing . We can find out what it takes to add /d/ to a stem; what it takes to reverse the order of phonemes in an input ; whether simple local modifications of a string are more easily handled than complex global ones;
and so on . The results we obtain carryover without modification to the actual

pattern associator, where much more complex conditions prevail . The deficiencies of Wickelphone /Wickelfeature transformation are as untouched by the addition of thresholds , logistic probability functions , temperatures , and parameters of that ilk as they are by whether the program implementing the
model is written in Fortran or C .

An important role of higher -level theory , as Marr for one has made clear , is to delineate the basic assumptions that lower level models must inevitably be built on . From this perspective , the high -level theory is not some approx imation whose behavior offers a gross but useful guide to reality . Rather , the
relation is one of embodiment : the lower -level theory embodies the higher

level theory , and it does so with exactitude . The RM model has a theory of linguistic knowledge associated with it ; it is just that the theory is so unor thodox that one has to look with some care to find it . But if we want to

understand the model , dealing with the embodied theory is not a conveni ence, but a necessity, and it should be pushed as far as possible . 8.2.1. When does a network implement a rule ? Nonetheless , as we pointed out in the Introduction , it is not a logical necessity that a cognitive model implement a symbolic rule system, either a traditional or a revisionist one ; the " eliminative " or rule -as-approximation connectionism that Rumelhart , McClelland , and Smolensky write about (though do not completely succeed in adhering to in the RM model ) is a possible outcome of the general connectionist program . How could one tell
the difference ? We suggest that the crucial
network ' s structure .

notion

is the motivation

for a

In a radical or eliminative connectionist model , the overall properties of the rule -theory of a domain are not only caused by the mechanisms of the micro -theory (that is, the stipulated properties of the units and connections ) but follow in a natural way from micro -assumptions that are well -motivated on grounds that have nothing to do with the structure of the domain under macro -scrutiny . The rule -theory would have second-class status because its assumptions would be epiphenomena : if you really want to understand why things take the shape they do , you must turn not to the axioms of a rule -theo ry but to the micro -ecology that they follow from . The intuition behind the symbolic paradigm is quite different : here rule -theory drives micro -theory ;
we expect to find many characteristics of the micro -level which make no

170 S. Pinker and A . Prince

micro -sense, do not derive from natural micro -assumptions or interactions , and can only be understood in terms of the higher -level system being imple mented .

The RM pattern associator again provides us with some specific examples . As noted , it is surely significant that the regular past-tense morphology leaves the stem completely unaltered . Suppose we attempt to encode this in the pattern associator by pre -setting it for the identity map ; then for the vast majority of items (perhaps more than 95% on the whole vocabulary ) , most connections will not have to be changed at all . In this way , we might be able to make the learner pay (in learning time ) for divergences from identity . But such a setting has no justification from the micro -level perspective , which conduces only to some sort of uniformity (all weights 0, for example , or all random ) ; the labels that we use from our perspective as theorists are invisible to the units themselves , and the connections implementing the identity map are indistinguishable at the micro -level from any other connections . Wiring it in is an implementational strategy driven by outside considerations , a fin gerprint of the macro -theory . An actual example in the RM model as it stands is the selective blurring of Wickelfeature representations . When the Wickelfeature ABC is part of an input stem , extra Wickelfeatures XBC and ABY are also turned on , but AXC is not : as we noted above (see also Lachter & Bever , 1988) , this is motivated by the macro-principles that individual phonemes are the signifi cant units of analysis and that phonological interactions when they occur generally involve adjacent pairs of segments. It is not motivated by any prin ciple of micro -level connectionism . Even the basic organization of the RM model , simple though it is, comes from motives external to the micro -level . Why should it be that the stem is mapped to the past tense, that the past tense arises from a modification of the stem? Because a sort of intuitive proto -linguistics tells us so. It is easy to set up a network in which stem and past tense are represented only in terms of their semantic features , so that generalization gradients are defined over semantic similarity (e.g. hit and strike would be subject to similar changes in the past tense) , with the unwelcome consequence that no phonological relations will 'emerge' . Indeed , the telling argument against the RM pattern associator as a model of linguistic knowledge is that its very design forces it to blunder past the major generalizations of the English system. It is not unthinkable that many of the design flaws could be overcome , resulting in a connectionist network that learns more insightfully . But subsymbolism or eliminative connectionism , as a radical metatheory of cognitive science, will not be vindicated if the principal structures of such hypothetical improved models turn out to be dictated by higher -level theory rather than by micro -

Languageand connectionism

171

necessities. To the extent that connectionist models are not mere isotropic node tangles , they will themselves have properties that callout for explana tion . We expect that in many cases, these explanations will constitute the macro -theory of the rules that the system would be said to implement . Here we see, too , why radical connectionism is so closely wedded to the notion of blank slates, simple learning mechanisms, and vectors of " teaching " inputs juxtaposed unit -by-unit with the networks ' output vectors . If you really want a network not to implement any rules at all , the properties of the units and connections at the micro -level must suffice to organize the network into something that behaves intelligently . Since these units are too simple and too oblivious to the requirements of the computational problem that the entire network will be required to solve to do the job , the complexity of the system must derive from the complexity of the set of environmental inputs causing the units to execute their simple learning functions . One explains the organi ~ zation of the system, then , only in terms of the structure of the environment , the simple activation and learning abilities of the units , and the tools and language of those aspects of statistical mechanics apropos to the aggregate behavior of the units as they respond to environmental contingencies (as in Hinton & Sejnowski , 1986; Smolensky , 1986)- the rules genuinely would have no role to play . As it turns out , the RM model requires both kinds of explanation - im ~ plemented macrotheory and massive supervised learning - in accounting for its asymptotic organization . Rumelhart and McClelland made up for the model 's lack of proper rule -motivated structure by putting it into a teaching environment that was unrealistically tailored to produce much of the behavior they wanted to see. In the absence of macro ~ organization the environment
must bear a very heavy burden .

Rumelhart and McClelland ( 1986a, p . 143) recognize this implication clearly and unflinchingly in the two paragraphs they devote in their volumes to answering the question " Why are People Smarter than Rats?" :
Given all of the above [the claim that human cognition and the behavior of lower animals can be explained in terms of PD P networks ] , the question does seem a bit puzzling . ... People have much more cortex than rats do or even than other primates do ; in particular they have very much more ... brain structure not dedicated to input /output - and presumably , this extra cortex is strategically placed in the brain to subserve just those functions that differentiate people
from rats or even apes . . . . But there must be another aspect to the difference

between rats and people as well . This is that the human environment includes other people and the cultural devices that they have developed to organize their thinking processes.

172

S. Pinker and A . Prince

We agree completely with one part : that the plausibility of radical connectionism is tied to the plausibility of this explanation .

8.3 . On the properties

of parallel

distributed

processing

models

In our view the more interesting points raised by an examination of the RM model concern the general adequacy of the PDP mechanisms it uses, for it is this issue, rather than the metatheoretical ones, that will ultimately have
the most impact on the future of cognitive science . The RM model is just one

early example of a PD P model of language , and Rumelhart and McClelland make it clear that it has been simplified in many ways and that there are many paths for improvement and continued development within the POP frame work . Thus it would be especially revealing to try to generalize the results of our analysis to the prospects for PDP models of language in general . Al though the past tense rule is a tiny fragment of knowledge of language , many of its properties that pose problems for the RM model are found in spades elsewhere . Here we point out some of the properties of the PD P architecture
used in the RM model that seem to contribute to its difficulties and hence

which will pose the most challenging problems to PDP models of language. 8.3.1. Distributed representations PDP models such as RM 's rely on 'distribu 'ted' representations : a largescale entity is represented by a pattern of activation over a set of units rather than by turning on a single unit dedicated to it . This would be a strictly implementational claim , orthogonal to the differences between connectionist and symbol -processing theories , were it not for an additional aspect: the units have semantic content ; they stand for (that is, they are turned on in response to ) specific properties of the entity , and the entity is thus represented solely in terms which of those properties it has. The links in a network describe strengths of association between properties , not between individuals . The relation between features and individuals is one-to-many in both directions : Each individual is described as a collection of many features , and each feature plays a role in the description of many individuals . Hinton et al . ( 1986) point to a number of useful characteristics of distri buted representations . They provide a kind of content -addressable memory , from which individual entities may be called up through their properties . They provide for automatic generalization : things true of individual X can be inherited by individual Y inasmuch as the representation of Y overlaps that of X (i .e. inasmuch as Y shares properties with X ) and activation of the overlapping portion during learning has been correlated with generalizable

Language and connectionism

173

properties . And they allow for the formation of new concepts in a system via new combinations of properties that the system already represents . It is often asserted that distributed representation using features is uniquely available to PDP models , and stands as the hallmark of a new paradigm of

cognitive science one that calculates not with symb , .ols but with what
Smolensky (in press) has dubbed 'subsymbols' (basically , what Rumelhart , McClelland , and Hinton call 'microfeatures ') . Smolensky puts it this way :
(18) Symbols and Context Dependence . In the symbolic paradigm , the context of a symbol is manifest around it , and consists of other symbols ; in the subsymbolic paradigm , the context of a symbol is manifest inside it , and consists of subsymbols .

It is striking , then , that one aspect of distributed representation - featural decomposition - is a well -established tool in every area of linguistic theory , a branch of inquiry securely located in (perhaps indeed paradigmatic of ) the 'symbolic paradigm ' . Even more striking , linguistic theory calls on a version of distributed representation to accomplish the very goals that Hinton et ale ( 1986) advert to . Syntactic , morphological , semantic , and phonological entities are analyzed as feature complexes so that they can be efficiently content addressed in linguistic rules ; so that generalization can be achieved across individuals ; so that 'new' categories can appear in a system from fresh combinations of features . Linguistic theory also seeks to make the correct generalizations inevitable given the representation . One influential attempt , the 'evaluation metric ' hypothesis , proposed to measure the optimality of linguistic rules (specifically phonological rules) in terms of the number of features they refer to ; choosing the most compact grammar would guarantee maximal generality . Compare in this regard Hinton et alis ( 1986, p . 84)
remark about types and instances : ... the relation between a type and an instance can be implemented by the relationship between a set of units [features ] and a larger set [of features ] that includes it . Notice that the more general the type the smaller the set of units [features ] used to encode it . As the number of terms in an intensional [featural ] description gets smaller , the corresponding extensional set [of individuals ] gets larger .

This echoes exactly Halle 's (1957, 1962) observation that the important general classes of phonemes were among those that could be specified by small sets of features . In subsequent linguistic work we find thorough and continuing exploration of a symbol -processing content -addressing automat ically -generalizing rule -theory built , in part , on featural analysis. No distinc tion -in -principle between PDP and all that has gone before can be linked to

174

S. Pinker and A . Prince

the presenceor absenceof featural decomposition (one central aspect of distributed representation as the key desideratum Features analyze the ) . structure of paradigms the way individuals contrast with comparableindividuals and any theory, macro, micro, or mini , that deals with complex entities can use them. Of course distributed representation in PDP models implies more than , just featural decomposition an entity is representedas nothing but the fea: tures it is composedof. Concatenativestructure, constituency variables, and .." ' " their binding- in short, syntagmaticorganization are virtually abandoned . This is where the RM model and similar PDP efforts really depart from previous work , and also where they fail most dramatically. A crucial problem is the difficulty PDP models have in representingindividuals and variables (this criticism is also made by Norman, 1986 in his , generally favorable appraisal of PDP models . The models represent indi) vidual objects as setsof their features. Nothing, however, representsthe fact that a collection of features correspondsto an existing individual: that it is distinct from a twin that might shareall its features or that an object similar , to a previously viewed one is a singleindividual that hasundergonea change as opposedto two individual obj ects that happen to resembleone another, or that a situation has undergone a change if two identical objects have switched positions.36In the RM model, for example this problem manifests , itself in the inability to supply different past tensesfor homophonousverbs such as wring and ring, or to enforce a categoricaldistinction between morphologically disparate verbs that are given similar featural representations such as becomeand succumb to mention just two of the examplesdiscussed , in Section 4.3. As we have mentioned, a seeminglyobvious way to handle this problemjust increase the size of the feature set so that more distinctions can be encoded will not do. For one thing, the obvious kinds of features to add, such as semantic features to distinguish homophones gives the model too , much power, as we have mentioned: it could use any semanticproperty or combination of semantic and phonological properties to distinguish inflec. tional rules, whereasin fact only a relatively small set of features are ever encodedinflectionally in the world's languages(Bybee, 1985 Talmy, 1985 ; ). Furthermore, the crucial properties governing choice of inflection are not semanticat all but refer to abstractmorphologicalentities suchasbasiclexical itemhood or roothood. Finally, this move would commit one to the prediction that semantically -related words are likely to have similar past tenses which , is just not true (compare e.g. hit/hit versus'strike/struck versusslap/slapped ,
36We thankDavidKirshfor these examples .

Language connectionism and 175

(similar meanings, different kinds of past tenses) or stand/stood versus understand/ understood versus stand out /stood out (different meanings , same kind of past tense) . Basically , increasing the feature set is only an approximate way to handle the problem of representing individuals ; by making finer distinctions it makes it less likely that individuals will be confused but it still
does not encode individuals as individuals . The relevant difference between

wring and ring as far as the past tense is concerned is that they are different words , pure and simple .37 A second way of handling the problem is to add arbitrary features that simply distinguish words . In the extreme case, there could be a set of n features over which n orthogonal patterns of activation stand in one-to -one correspondence with n lexical items . This won 't work , either . The basic prob lem is that distributed representations , when they are the only representations of objects , face the conflicting demands of keeping individuals distinct and providing the basis for generalization . As it stands, Rumelhart and McClel land must walk a fine line between keeping similar words distinct and getting the model to generalize to new inputs - witness their use of Wickelfeatures over Wickelphones , their decision to encode a certain proportion of incorrect Wickelfeatures , their use of a noisy output function for the past tense units , all designed to blur distinctions and foster generalization (as mentioned , the effort was only partially successful, as the model failed to generalize properly to many unfamiliar stems) . Dedicating some units to representing wordhood would be a big leap in the direction of nongeneralizability . With orthogonal patterns representing words , in the extreme case, word -specific output features could be activated accurately in every case and the discrepancy between computed -output and teacher-supplied -input needed to strengthen connections from the relevant stem features would never occur . Intermediate solutions , such as having a relatively small set of word -distinguishing features available to distinguish homophones with distinct endings , might help . But given the extremely delicate balance between discriminability and generaliza bility , one won 't know until it is tried , and in any case, it would at best be a hack that did not tackle the basic problem at hand : individuating individuals , and associating them with the abstract predicates that govern the permissible generalizations in the system. The lack of a mechanism to bind sets of features together as individuals causes problems at the output end , too . A general problem for coarse-coded
370f course , another problem with merely increasing the feature set, especially if the features are conjunc tive , is that the network can easily grow too large very quickly . Recall that Wickelphones , which in principle can make finer distinctions than Wickelfeatures , would have required a network with more than two billion
connections .

176

S. Pinker and A . Prince

distributed representations is that when two individuals are simultaneously


represented , the system can lose track of which feature goes with which

individual - leading to " illusory conjunctions " where , say, an observer may be unable to say whether he or she is seeing a blue circle and a red triangle or a red triangle and a blue circle (see Hinton et al., 1986; Treisman & Schmidt , 1982) . The RM model simultaneously computes past tense output features corresponding to independent subregularities which it is then unable to keep separate, resulting in incorrect blends such as slept as the past tense of slip- a kind of self-generated phonological illusory conjunction . The current substitute for a realistic binding mechanism , namely the " whole -string binding network " , does not do the job , and we are given no reason to believe
that a more realistic and successful model is around the corner . The basic

later

point is that the binding problem is a core deficiency of this kind of distributed representation , not a minor detail whose solution can be postponed to some
date .

The other main problem with features -only distributed representations is that they do not easily provide variables that stand for sets of individuals regardless of their featural decomposition , and over which quantified generalizations can be made . This dogs the RM model in many places. For example , there is the inability to represent certain reduplicative words , in which the distinction between a feature occurring once versus occurring twice is crucial , or in learning the general nature of the rule of reduplication , where a morpheme must be simply copied : one needs a variable standing for an occurrence of a morpheme independent of the particular features it is composed of . In fact , even the English regular rule of adding /d/ is never properly learned (that is, the model does not generalize it properly to many words ) ,
because in essence the real rule causes an affix to be added to a " word " ,

which is a variable standing for any admissible phone sequence, whereas the model associates the family of /d/ features with a list of particular phone sequences it has encountered instead . Many of the other problems we have pointed out can also be traced to the lack of variables . We predict that the kind of distributed representation used in the two layer pattern -associators like the one in the RM model will cause similar problems anywhere they are used in modeling middle - to high -level cognitive processes.38 Hinton , McClelland , and Rumelhart (p . 82) themselves provide an example that (perhaps inadvertently ) illustrates the general problem :

38Within linguistic semantics , for example , a well -known problem is that if semantic representation is a set of features , how are propositional connectives defined over such feature sets? If P is a set of features , what function of connectionist representation will give the set for ,.....P?

Language and connectionism

177

People good generalizing acquired are at newly knowledge. If, forexample, ... youlearn chimpanzees onions will that like you probably your raise estimate of
the probability gorillas onions. a network usesdistributed that like In that representations, kindofgeneralization this is automatic. newknowledge The about chimpanzeesincorporated modifying of the connection is by some strengths so
as to alter the causaleffectsof the distributedpattern of activitythat represents

chimpanzees. modification The automatically changes causal the effects all of similar activity patterns. if the representation gorillas a similar So of is activity pattern thesame ofunits, causal over set its effects bechanged a similar will in
way.

This venerableassociationist hypothesisabout inductivereasoninghas

beenconvincingly discredited contemporary by research cognitive in psychol-

ogy. Peoples inductive generalizations automatic arenot responsessimito larity anynon-question-begging similarity); depend the (in sense of they on
reasoners unconscioustheory of the domain, and on any theory-relevant

factaboutthe domainacquired throughanyroutewhatsoever (communicated

verbally, acquired a single in exposure, inferred through circuitous means, etc.), a way cancompletely in that override similarity relations (Carey, 1985; deJong Mooney, Gelman Markman, Keil, Osherson, & 1986; & 1986; 1986; Smith, Shafir, Pazzani, & 1986; 1987). takeoneexample, To knowledge of howa setofperceptual features caused, knowledge thekind that was or of an individual an example canoverride generalizations is of, any inspired by the objectsfeatures themselves: example, animal looks for an that exactly
like a skunkwillnonetheless treatedas a raccoon one is told that the be if

stripewaspainted ontoan animal had raccoon that parents raccoon and babies Keil,1986; demonstrates thisphenomenon in (see who that occurs
childrenandis not the resultof formalschooling). Similarly, evena basketball

ignoramus notbeseduced thesimilarity will by relations holding among the typical starting players theBoston of Celtics those and holding among the starting players theLosAngeles of Lakers, thuswill betempted and not to predict a yellow-shirted player that blond entering game runto the the will

Celticsbasket whenhe getsthe balljustbecause previous all blondplayers


haveled to artificial intelligence systems basedon explanation-based learn-

did so. (Haircolor,nonetheless, be usedin qualitatively might different generalizations, as which such players be selected endorse care will to hair products.) example, Pazzani Dyer The from and (1987), oneofmany is that
ing which greater has usefulness greater and fidelity peoples to commonsensereasoning the similarity-based than learningthat Hintonet als

example performs system automatically e.g.,deJong Mooney, (see, & 1986). Osherson al. (1987), analyse use of similarity a basisfor et also the as generalization show inherent and its problems; Gelman Markman and (1986)

178

S. Pinker andA. Prince

tive generalizations about natural kinds.

show preschool shelve how children similarity when relations making induc-

general. featural The decomposition object beavailable certain ofan must to

impoverished mechanism neither language incognition isviable in nor in

generalization propertiesdistributed of representations toprovide account an ofhuman inductive inferencegeneral. isanalogousthefact have in This to we been stressing throughout, thatthepast inflectional is namely tense system nota slave linguisticbutit isdriven precise byspeakers tosimilarity in ways implicit theories of organization. Insum, featural decomposition isanessential ofstandard feature symbolic models language cognition, many thesuccessesPDP of and and of of models simply inherit advantages. these However is unique what about RM the model othertwo-layer and pattern associatorstheclaim individuals is that andtypes represented nothing activated are as but subsets features. of This

relatedin the past.Therefore would wantto use the automaticone not

not a merereflection patterns featural of of similarity havebeenintercorthat

unfamiliar, perhaps initial reactions, or in gut full-scale intuitive inference is

exclusively carnivorous,learning aparticular who and that gorilla happens to have broken does like a leg not onions notnecessarily toany will lead tendency project distaste other to that onto injured gorillas chimpanzees. and Though similarity playsrole domains surely a in ofwhich areentirely people

many exclusively are carnivorous it isnotthecase allgorillas orthat that are

butdepends crucially thestructured on propositional content theknowlof edge: learning allgorillas exclusively that are carnivorous leadto a differwill ent generalization theirtasteforonions learning some about than that or

inductive inference justanother isnot pattern trained of feature correlations,

specific on the natureof the inductive ways inference be madeon that to occasion. Furthermore knowledge cantotally or reverse the that alter an

signed sets individuals pick some to of that out properties completely and ignore others, differentlydifferent on occasions, depending inknowledge-

Thepointis thatpeoples inductive inferences depend variables on as-

need enter alltheprocesses not into referringtheobject. symbol to Some referringtheobject object, some to qua and variable referringtasktypes to relevant ofobjects cutacross classes that featural similarity,required. are
8.3.2.Distinctions among subcomponents abstract and internal
representations

processes, canonly oneoftherecords but be associated theobject with and

The model RM collapses asingle into inputoutput amapping module that inrule-based isa composition distinct accounts ofseveral subcomponents feeding informationone into another, asderivational such morphology and

Language and connectionism

179

inflectional morphology, inflectional or morphology phonology. of and This, course, whatgives its radical is it look.If the subcomponentsa traditional of account werekept distinct a PDPmodel,mapping distinct in onto subnetworks pools unitswiththeirowninputs outputs, ontodistinct or of and or
differentiateRumelhartand McClellandscollapsedone-boxmodelfrom the
traditional accounts that causes it to fail so noticeably.

layers a multilayer of network, would one naturally thatthenetwork say simply implemented traditional the account. it is justthefactors But that
composition beginwith? principal to The reason that whenonebreaksa is system intocomponents, components communicate down the must bypassing
informationinternal representationsamong themselves. But because

Why Rumelhart McClelland to obliterate traditional do and have the de-

these are internalrepresentations environment the cannot see them and so

cannot adjust themduring learning theperceptron via convergence procedure


used in the RM model. Furthermore, the internal representations do not

correspond directly environmental andoutputs sothe criteria to inputs and


for matches and mismatchesnecessaryto drive the convergenceprocedure are not defined. In other words the representationsused in decomposed,

modularsystems abstract,and manyaspectsof their organization are cannot

belearned anyobvious (Chomsky, calls theargument in way. 1981, this from povertyofthestimulus.) Sequencesmorphemes of resulting factoring from outphonological changes onekindofabstract are representation inrule used systems; lexical entries distinct phonetic from representations another; are morphological area third.TheRMmodel is composed a single roots thus of module mapping frominputdirectly outputin part because to thereis no
realistic for their convergence way procedure learnthe internalrepresento
tations of a modular account properly.

models language not designed arbitrary of were for reasons preserved and as

A verygeneral wehopeto have point made thispaperis thatsymbolic in

quaint traditions; distinctions make substantive motivated the they are claims by empirical andcannot obliterated a newmodel facts be unless provides equally compelling accountsthose Designing of facts. a model can that record
hundredsof thousandsof first-ordercorrelations simulatesomebut not can all of this structureand is unableto explainit or to accountfor the structures

that do not occur acrosslanguages. Similarconclusions, predict,will we

emerge other from cognitive domains arerich dataandtheory. is that in It unlikely anymodel beableto obliterate that will distinctions subcomamong
cognition. alone sharply anyheadlong This will brake movement from away
the kinds of theories that have been constructed within the symbolic
framework.

ponents theircorresponding of abstract and forms internal representations that havebeen independently motivated detailedstudyof a domainof by

180

S. Pinker

and A. Prince

8.3.3. Discrete,categoricalrules

measure similarity to known or words strings *1put; *Thechild or (e.g. seems sleeping;What youseesomething?). * did Obviously models PDP can

Despite graded frequency-sensitive made children the and responses by andbyadults their in speech andanalogical errors extensions ofthe inparts strong system, aspects knowledge verb many of oflanguage incategorresult ical judgments ofungrammaticality. isdifficultreconcile any This fact to with mechanismatasymptote anumbercandidates that leaves of atsuprathreshold strength allows tocompete and them probabilistically forexpression (Bowerman, 1987, makes point). thepresent adult also this In case, speakers assign a single tense past form words represent being to they as regular even if subregularities several bring candidates mind(e.g.brought/*brang/* to bringed); subregularities have partially and that may been productive inchildhoodare barredfromgenerating tenseformswhenverbsare derived past from syntactic other categories *pang; (e.g. *highstuck)areregistered or as being distinct lexical from items those exemplifying subregularities *J (e.g. broke car).Categorical the judgments ungrammaticalitycommon of is a (though all-pervasive) not property linguistic of judgments novel of words and strings, cannot predictedsemantic and be by interpretability prior orany

qualitatively appropriateresponse.

circuits; questionwhether the is models bebuiltotherthan implecan by menting standard symbolic theoriesin which quantitatively the strongest output to thesharpening invariably prior circuit corresponds unique to the
8.3.4. Unconstrained correlation extraction

display categorical judgments various ofsharpening threshold by kinds and

strained. thecase theRM In of model, saw it can we how acquire that rules arenotfound any in language asnonlocal such conditioning ofphonological changesmirror-reversal or ofphonetic This strings. problem geteven would worse the set of feature wasexpanded represent kinds if units to other of information attempt distinguish in an to homophonous or phonologically similar forms. model exploits The also subregularities asthose the (such of irregular that classes) adults best notexploit at do productively (slip/*slept and peep/*pept) atworst completely and are oblivious(e.g. to lexical causatives sit/setlie/lay----fall/fell----rise/raise, generalized like which never are to cry/*cray). types inflection across The of found human languages a involves highly constrained ofthelogically subset possible semantic features, feature

among units.Butthisproperty a liability human is if learners moreconare

ners;virtually amount statistical any of correlation among features a set in ofinputs besoaked bytheweights thedense ofinterconnections can up on set

It isoften considered ofPDP a virtue models they powerful that are lear-

Language and connectionism

181

combinations, phonological alterations, itemsadmitting inflection, of and agreement relations (Bybee, 1985; Talmy, 1985). example, represent For to the literalmeanings the verbsbrakeandbreakthe notionof a man-made of
mechanical device is relevant, but no language has different past tenses or

plurals a distinction for between man-made natural versus objects, despite thecognitive salience thatnotion. theconstrained ofthevarof And nature
iationin othercomponents language assyntax beenthedominant of such has

theme linguistic of investigations a quarter a century Chomsky, for of (e.g.


mustbe ableto account a model canlearnall possible for; that degrees of correlation among setoffeatures nota model thehuman a is of being.
8.4. Can the modelbe recastusingmorepowerfulFDP mechanisms?

1981). These constraints facts anytheory language are that of acquisition

The mostnaturalresponse a PDP theoristto our criticisms of wouldbe to retreat fromthe claimthat the RM modelin its currentformis to be taken

as a literal model inflection of acquisition. RMmodel someof the The uses

simplest the devices the PDParmamentarium, that PDP of in devices cause problems theRM for model, these and problems alldiminish would if
moresophisticated of PDPnetworks used.Thus claim kinds were the that
PDPnetworks ratherthan rulesprovide exactand detailed an account of
language would survive.

theorists general beenmoving from. in have away Perhaps isthelimitations it of thesesimplest devicestwo-layer PDP pattern association networksthat

In particular, interesting of networks, Boltzmann two kinds the Machine (Hinton Sejnowski, andtheBack-Propagation (Rumelhart & 1986) scheme et al.,1986) been have developed recently have that hidden unitsorintermediate layers between inputandoutput. These hidden function units as internal representations asa result networks capable computand such are of

ingfunctions areuncomputable that intwo-layer associators pattern ofthe

RMvariety. Furthermore,many in interesting themodels been cases have ableto learn internal representations. example Rumelhart al. For the et model changes only weights theconnections output in not the of to its units

responseanerror respect theteaching butitpropagates to with to input, the error signal backwards intermediate and tothe units changes weights their in
learning couldavoidthe problems the RMmodel. of Thereare three reasons suchspeculations basically why are irrelevant to
the points we have been making.

the direction alterstheiraggregate that effecton the outputin the right direction. Perhaps, a multilayered network back-propagation then, PDP with

First,thereis the gapbetween revolutionary manifestos actual and ac-

182

S. Pinker and A. Prince

example, might discrepantlya setofrules a way they behave from in that mediate might totally interms what represent), layers be opaque of they and
mimicked peoples systematic divergence that ofrules, their from set or inter-

tobe implemented consisting inasystem ofmassively interconnected parallel stochastic inwhich effectslearningmanifest units the of are inchanges in connections. uncontroversial always atthe founThese facts have been very dationsthe interpretation models of realist ofsymbolic ofcognition;do they not adeparture sort standard accounts. signal ofany from symbolic Perhaps a multilayered multinetwork could thetasks inflecorgated system solve of tion acquisition simply without implementing grammars (for standard intact

semantic networks, production orLISP systems, primitive operations (Hinton,1981; Touretzky, Touretzky 1986; &Hinton, areappealing 1985) because have ability mimic implementstandard they the to or the operations and representationsintraditional accounts perhaps needed symbolic (though with twists). donot that wouldpossible some We doubt it be toimplement a rulesystem networks multiple after it hasbeen in with layers: all, known for over years nonlinear 45 that neuron-like can elementsfunction gates aslogic andthathence networks that consisting ofinterconnected ofsuch layers elementscompute can propositions (McCulloch 1943). &Pitts, Furthermore, given we about information and what know neural processingplasticity it seems that elementary likely the operations ofsymbolic processinghave will

andother sophisticated such those have network can models as that one that gatethe connections between othersor networks cansimulate two that

that sharply so differentiate model standard namely, theRM from ones, the lack internal of representations and subcomponents. Multilayered networks,

more animplementation symbolic than ofa rule-based account. advanThe tageofa multilayered is precisely it isfreefrom constraints model that the

aslittle toourattentionany claim thehypothetical claim as other about consequences of a nonexistent model. Second, a successfulmodelmore PDP of complex may nothing design be

asmere approximations, on.Such assertion,course, have andso an of would

guistic information processingare ontheputative based success their of existing Given their model. that existing does dothe itissaid model not job to do,theclaims berejected. a PDP must If advocate to eschew were the existing model appealmore RM and to powerful mechanisms, claim the only that bemadethat may amodelunspecified that could is there exist of design may or may not accountfor past tense acquisition withoutthe use of rules and ifitdid, should ourunderstanding that we revise oflanguage, rules treat

problem account, that must our intheir and we revise understanding oflin-

guage bedescribed approximately can only byrules, there noinduction that is

complishments. Rumelhart McClellands and surprising claimsthat lan-

Language and connectionism

183

thus wouldcall for a revisedunderstanding language,but there is no of


reason to believe that this will be true.

Aswementioned a previous in section, reallyradical the claim thatthere is are modelsthat can learntheir internalorganization througha process that can be exhaustively described an interaction as between correlational the structureof environmental inputsand the aggregate behaviorof the unitsas

theyexecute simple their learning activation and functions response in to those inputs. Again, isnomore a vague this than hope. important An technical problem thatwhen is intermediate of complex layers networks to have learn anything thelocal in unconstrained characteristic modmanner ofPDP els,theyareoneor morelayers removed theoutput at which from layer discrepancies between anddesired actual outputs recorded. inputs are Their andoutputs longer no correspond anydirect to overt in way stimuli and responses, thesteps and needed modify weights nolonger to their are transparent. Since differencesthesetting eachtunable in of component the of
changes otherunits of before affecting output the layer), isharder ensure it to
that the intermediate layerswillbe properlytuned by localadjustments prop-

intermediate layers haveconsequences arelessdramatic thecomparithat at

sonstage (their effects combine complex with effects weight in ways the of

agating backwards. Rumeihartal.(1986) dealt thisproblem et have with in clever with ways some interesting successessimple in domains aslearnsuch ingto addtwo-digit numbers, detecting symmetry,learning exclusiveor the
on incorrectsolutionsdefinedby localminimaof the energy landscape defined overthe spaceof possible weights, suchfactorsas the starting and

or operator. thereis always danger suchsystems converging But the in of

configuration, orderof inputs, the several parameters thelearning of function,the number hidden of units,andthe innate topology the network of

whether areconnected alloutput viadirect oronly they to units paths through


intervening canallinfluence links) whether models properly the will converge evenin someof the simple cases.Thereis no reasonto predict certainty with that thesemodels failto acquire will complex abilities as mastery the such of

(such whether input areconnectedallintermediate and as all units to units,

pasttense system without intraditional wiring theories handbutthere by


is even less reason to predict that they will.

These problems exactly problems. donotdemonstrate are that, They that interesting models language impossibleprinciple. thesame PDP of are in At time, show thereis nobasis thebelief connectionism they that for that will

dissolve difficult the puzzles language, evenprovide of or radically solunew tionsto them.As for the present,wehaveshownthat the paradigm example of a PDPmodel language claim of can nothing morethana superficial fidelity to somefirst-order regularities language. of Moreis knownthanjust the

184

S. Pinker and A. Prince

of development children. is onlysuchsuccess canwarrant in It that confidence

first-order regularities, when deeper more and the and diagnostic patterns are examined care, one seesnot onlythat the PDP modelis not a viable with alternativesymbolic to theories, thatthesymbolic but account supported is in virtually aspect. every Principled symbolic theories language of have achieved success a broad with spectrum empirical of generalizations, of some considerable ranging propertieslinguistic depth, from of structurepatterns to

in thereality exactitude ourclaims understanding. and of to


Appendix:Englishstrong verbs

doubt,weve takenthe American way.


cally transparent.

means weregard as somewhat thanusual, that Verb less particularlya as strong formin the class whereits listed. notation The ??Verb means we that regard asobsolete Verb (particularly past) recognizable,kind inthe but the ofthing picks from one up reading. notation means theverb, The (+) that in ourjudgment, admits regular a form.Notice obsolescence not that does imply regularizability: verbs a few simply tolack usable tense seem a past or pastparticiple. have We found judgments from that differ dialect dialect, to with dineof willingness-to-regularize from a running up British English (south-of-London) toCanadian (Montreal) toAmerican (general). in When
Prefixed forms listed are when prefix-root the combinationnotsemantiis

for example, Rumelhart McClellands & drag-drug). notation The ?Verb

ofallthestrong that recognize own verbs we inour vocabulary weomit, (thus

Here provide, the we for readers convenience, aninformally listing classified

to all other vowel changes.

byitslaxcounterpart. English, to theGreat In due Vowel thenotion Shift, tax counterpart slightly thetense-lax is odd: alternations noti-I,e-c, are u-U,andsoon,butrather i-E, o-o/a, ay-I, ee, u-o/a. termablaut refers The

The taxing tothereplacementtense ordiphthong term refers ofa vowel

Language and connectionism

185

I . T/ D Superclass 1. T/D + (/)


h It , SIt , Sp . It , qUIt , ? k nIt ( + ) , 39 ? SpIt " ? ? SIt ' , ? ? b es h It ' I, ' ' ' ' ' , , , h " bid4 , rid , ?forbid shed , spread , wed ( + ) 41 let , set , beset42 , upset , wet ( + ) cut , shut put burst , cast , cost , thrust hurt (+ )

2 . T / D with

laxing

class , read , speed ( + ) , ?plead ( + )

bleed , breed , feed , lead , mislead meet

hide(en), slide bite(en) , light(+ ) , alight(+ !) shoot


3 . Overt - T ending

3a burn

Suffix ,

- t ? ? learn ? ? . . SpOI .1 , ? dwell , ? ? spelI , ? ? ? smeII

? . II . Spi , 3b bend build 3c lose deal mean ? dream creep leave , , . - t .

Devoicing , send , spend , ? Iend , ? rend

with

laxing

feel

? kneeI

( +

keep

leap

( +

) ,

sleep

sweep

( +

) ,

weep

39He knit a sweateris possible not?? He knit his brows. , 40 in poker, bridge, or defensecontracts. As 41The adjective is only wedded . 42Mainlyan adjective.

186 S. Pinker and A . Prince

3d buy

. x - ought , bring

- ought , catch , fight , seek , teach , think

4 .

Overt

-D

ending laxing ( c/ . bleed group )

4a . Satellitic flee say hear 4b . Drop have make 4c . sell 4d . do With , tell With ablaut stem

consonant

[ E - 0

- 0]

, foretell unique vowel change and + n participle ..

II . E-:) ablautclass
1. iff - ofJ - ofJ + n freeze , speak, ??bespeak, steal, weave( + )43 ?heave( + )44 , get , for ~et , ??beget ??tread 5 swear, tear , wear , ?bear , ??forbear , ??forswear 2. Satellitic x - 0 - o + n

choose

awake, wake , break

430nlyin reference carpetsetc. is the strongform possible* Thedrunk wovedownthe road The to , . . adjective woven is . 440nly nautical heave to/hoveto. *He havehislunch Pastparticiple *hoven . not . 45Though is common BritishEnglishit is at bestquaintin American trod in , English .

Languageand connectionism

187

III . I - reIA - A group 1. I - Ee A . . . nng, smg, sprIng drink , shrink, sink, stink . SWIm begin
2. I - A - A

cling, ?fling, sling sting string swing wring , , , , stick


dig , , WIll, SpIll ?stmk , ?SIIIk . . . I'

3. Satellites - rei - A x A run (cf. I - re- A ) hang strike46 , ?sneak(ct. I - A - A )


IV . Residual clusters

1. x - u - x/o + n blow , grow , know draw , withdraw , throw

fly
?slay

2. e - U -e+ n

take, mistake forsake shake partake , , ,

46Stricken participle in 'from the record otherwise an adjective as as ', as .

188 S Pinker A. Prince . and

ay

aw

aw

bind

find

grind

wind

ay

4a .

ay

rIse

anse

. wnte ,

? .

? . smIte

ride

drive

strive

4b

ay

dive

shine47

stride

thrive

Miscellaneous

1. Pure suppletion be go, forgo , undergo


2 . Backwards ablaut

fall

befall

( cf

get

got

hold come

, ,

behold become

( cf

tell

told

eat

beat

see give forbid

( possibly , forgive , ? ? bid48

satellite

of

blow

- class

47Typically intransitive: *He shonehis shoes .


4X in 'ask or command to ' . The past bade is very peculiar , bidded is impossible , and the past participle As is obscure , though certainly not bidden .

Language and connectionism

189

4 . Miscellaneous

sit , spit stand , understand , withstand (possibly satellite of I - A - A class)


lie

5. Regular but for past participle a. Add -n to stem (all allow -ed in participle ) sow, show , sew, prove , shear, strew b. Add -n to ablauted stem swell

A remark. A number of strong participial forms survive only as adjectives (most, indeed, somewhatunusual : cleft, cloven girt, gilt, hewn, pent, bereft ) , , shod, wrought, laden mown, sodden clad, shaven drunken, (mis)shapen , , , . The verb crow admits a strong form only in the phrasethe cock crew; notice that the rooster crew is distinctly peculiar and Melvin crew over his victory is unintelligible. Other putative strong forms like leant, clove abode durst, , , chid, and sawn seemto us to belong to another language .
References Anderson & HintonG.E. (1981 Models information , J.A., , ). of processing brainIn G.E. Hinton in the . & J.A. Anderson .), Parallel (Eds models associative . Hillsdale : Erlbaum of memory , NJ . Anderson (1976Language , J.R. ). , memory thought and . Hillsdale : Erlbaum , NJ . Anderson (1983 The , J.R. ). architecture of cognition . Cambridge Harvard , MA: University . Press Armstrong.L., Gleitman ,S , L.R., & Gleitman (1983What , H. ). some concepts notbe Cognition , might . , 13 ?h1 10R . ---- - --AronoffM. (1976 Word , ). formation generative in grammar . Cambridge MITPress , MA: . BerkoJ. (1958Thechild learning English , ). 's of morphology , 14 150177 . Word , - . BlochB. (1947English inflection , ). verb . Language 399 . 23 - 418 , Bowerman (1987 Discussion , M. ). : Mechanisms language of acquisition B. MacWhinney .), . In (Ed Mechanisms of language acquisition . Hillsdale : Erlbaum , NJ . BrownR. (1973A firstlanguage early , ). : The stages . Cambridge Harvard , MA: University . Press Bybee (1985Morphology , J.L. ). : astudy the of relation between meaning form Philadelphia and . : Benjamins . Bybee , J.L., & Slobin (1982Rules schemes development use theEnglish ten , D.I. ). and in the and of past ~ ;e. Language, 265289 , 58 - . Carey . (1985 Conceptual in childhood ,S ). change . Cambridge Bradford /MIT Press , MA: Books . Cazden (1968Theacquisitionnoun verb , C.B. ). of and inflections Development 433448 . Child , 39 - . , Chomsky (1957Syntactic , N. ). structures Hague . The : Mouton . Chomsky (1965Aspects the , N. ). of theory syntax of . Cambridge MITPress , MA: . Chomsky (1981 Lectures government binding , N. ). on and . Dordrecht , Netherlands . : Foris

190 S Pinker A. Prince . and

Chomsky , N . , & Halle , M . ( 1968) . The sound pattern of English . New York : Harper and Row . Curme , G . ( 1935) . A grammar of the English language II . Boston : Barnes & Noble . de Jong , G .F . , & Mooney , R .J. ( 1986) . Explanation -based learning : An alternative view . Machine Learning ,
1 , 145 - 176 .

Ervin , S. ( 1964) . Imitation and structural change in children 's language . In E . Lenneberg (Ed .) , New directions in the study of language . Cambridge , MA : MIT Press. Feldman , J .A . , & Ballard , D .H . ( 1982) . Connectionist
205 - 254 .

models and their properties . Cognitive Science, 6 ,

Fodor , J.A . ( 1968) . Psychological explanation . New York : Random House . Fodor , J.A . ( 1975) . The language ofthou ~ht . New York : T .Y . Crowell .

Francis, N., & Kucera, H . (1982 . Frequencyanalysis of English usage Lexicon and grammar. Boston: ) :
Houghton Mifflin . Fries , C . ( 1940) . American English grammar . New York : Appleton -Century . Gelman , S.A . , & Markman , E .M . ( 1986) . Categories and induction in young children . Cognition , 23 , 183- 209 . Gleitman , L .R . , & Wanner , E , ( 1982) , Language acquisition : The state of the state of the art . In E . Wanner
Press .

and L .R . Gleitman (Eds .) , Language acquisition : The state of the art . New York : Cambridge University

Gordon , P. ( 1986) . Level -ordering in lexical development . Cognition , 21 , 73- 93. Gropen , J. , & Pinker , S. ( 1986) . Constrained productivity in the acquisition of the dative alternation . Paper presented at the 11th Annual Boston University Conference on Language Development , October . Halle , M . ( 1957) . In defense of the Number Two . In E . Pulgram (Ed .) , Studies presented to J . Whatmough . Mouton : The Hague . Halle , M . ( 1962) . Phonology in generative grammar . Word , 18, 54- 72. Halwes , T . , & Jenkins , J.J. ( 1971) . Problem of serial behavior is not resolved by context -sensitive memory models . Psychological Review , 78, 122- 29. Hinton , G .E . ( 1981) . Implementing semantic networks in parallel hardware . In G .E . Hinton & J.A . Anderson (Eds .) , Parallel models of associative memory . Hillsdale , NJ : Erlbaum . Hinton , G .E ., McClelland , J.L ., & Rumelhart , D .E . ( 1986) . Distributed representations . In D .E . Rumelhart , J.L . McClelland , and the PDP Research Group , Parallel distributed processing : Explorations in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford Books /MIT Press. Hinton , G .E . , & Sejnowski , T .J. ( 1986) . Learning and relearning in Boltzmann machines . In D .E . Rumelhart , J.L . McClelland , and the PDP Research Group , Parallel distributed processing : Explorations in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford Books /MIT Press. Hoard , J., & C . Sloat ( 1973) . English irregular verbs . Language , 49, 107- 20. Hockett , C . ( 1942) . English verb inflection . Studies in Lin .R "uistics . 1.2. . 1- 8. Jespersen , O . ( 1942) . A modern English grammar on historical principles , VI . Reprinted George Allen & Unwin Ltd . 1961: London :

Keil , F .C . ( 1986) . The acquisition of natural kinds and artifact terms . In W . Demopoulos & A . Marras ( Ed .) , Language learning and concept acquisition : Foundational issues. Norwood , NJ : Ablex . Kiparsky , P. ( 1982a) . From cycljcal to lexical phonology . In H . van der Hulst , & N . Smith (Eds .) . The structure of phonological representations . Dordrecht , Netherlands : Foris . Kiparsky , P. ( 1982b) . Lexical phonology and morphology . In I .S. Yang (Ed .) , Linguistics in the morning calm .
Seoul : Hansjn , pp . 3 - 91 .

Kucera , H . , & N . Francis ( 1967) . Computational analysis of present -day American English . Providence : Brown University Press. Kuczaj , S.A . ( 1976) . Arguments
423 - 427 .

against Hurford 's auxiliary

copying rule . Journal

of Child Language , 3 ,

Kuczaj , S.A . ( 1977) . The acquisition of regular and irregular past tense forms . Journal of Verbal Learning
and Verbal Behavior , 16 , 589 - 600 .

Language connectionism and 191

Kuczaj , S.A . ( 1978) . Children 's judgments of grammatical and ungrammatical irregular past tense verbs . Child Development , 49, 319- 326. Kuczaj , S.A . ( 1981) . More on children 's initial failure to relate specific acquisitions . Journal of Child Lan f!;uaf !;e J 8 , 485 - 487 .

Lachter , J. , & Bever , T .G . ( 1988) . The relation


247 this issue .

between linguistic

structure

and associative theories of

language learning - A constructive critique of some connectionist learning models . Cognition , 28 , 195Lakoff , G . ( 1987) . Connectionist explanations in linguistics : Some thoughts on recent anti -connectionist pa-

pers . Unpublished electronic manuscript , ARPAnet . Levy , Y . ( 1983) . The use of nonce word tests in assessing children 's verbal knowledge . Paper presented at the 8th Annual Boston University Conference on Language Development , October , 1983. Liberman , M ., & Pierrehumbert , J. ( 1984) . Intonational invariance under changes in pitch range and length . In M . Aronoff & R . Oehrle (Eds .) , Language sound structure . Cambridge , MA : MIT Press. MacWhinney , B ., & Snow , C .E . ( 1985) . The child language data exchange system . Journal a/ Child Language ,
12 , 271 - 296 .

MacWhinney , B . , & Sokolov , J.L . ( 1987) . The competition model of the acquisition of syntax . In B . MacWhin ney (Ed .) , Mechanisms of language acquisition . Hillsdale . NJ : Erlbaum . Maratsos , M . , Gudeman , R ., Gerard -Nogo , P., & de Hart , G . ( 1987) . A study in novel word learning : The productivity
NJ : Erlbaum

of the causative . In B . MacWhinney


.

(Ed .) , Mechanisms a/ language acquisition . Hillsdale , account : A simpler analysis of auxiliary

Maratsos , M ., & Kuczaj , S.A . ( 1978) . Against the transformationalist overmarkings . Journal of Child Language , 5 , 337- 345. Marr , D . ( 1982) . Vision , San Francisco : Freeman .

McCarthy , J. , & Prince , A . (forthcoming ) . Prosodic morphology . McClelland , J .L . , & Rumelhart , D .E . ( 1985) . Distributed memory and the representation

of general and

specific information . Journal of Experimental Psychology : General , 114, 159- 188. McClelland , J .L ., Rumelhart , D .E . , & Hinton , G .E . ( 1986) . The appeal of parallel distributed processing . In D .E . Rumelhart , J .L . McClelland , and the PDP Research Group , Parallel distributed processing : Explorations in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford
Books / MIT Press .

McClelland , J .L . , Rumelhart ~ D .E . ~ & The PDP Research Group . ( 1986) . Parallel distributed processing : Explorations in the microstructure of cognition . Volume 2: Psychological and biological models . Cam bridge , MA : Bradford Books /MIT Press. McCulloch , W .S. ~ & Pitts , W . ( 1943) . A logical calculus of the ideas immanent in nervous activity . Bulletin of Mathematical Biophysics ~ 5 , 115- 133. Mencken ~ H . ( 1936) . The American language . New York : Knopf . Minsky , M . ( 1963) . Steps toward artificial intelligence . In E .A . Feigenbaum & J. Feldman (Eds .) , Computers and thought . New York : McGraw -Hill . Newell ~ A . , & Simon , H . ( 1961) . Computer simulation of human thinking . Science, 134, 2011- 2017 . Newell , A . , & Simon , H . ( 1972) . Human problem salving . Englewood Cliffs , NJ : Prentice -Hall . Norman , D .A . ( 1986) . Reflections on cognition and parallel distributed processing . In J .L . McClelland , D .E . Rumelhart , & The PDP Research Group , Para /leldistributedprocessing : Explorations in the microstruc ture of cognition . Volume 2: Psychological and biological models . Cambridge , MA : Bradford Books /
MIT Press .

Osherson , D .N . , Smith , E .E . , & Shafir , E . ( 1986) . Some origins of belief . Cognition , 24 , 197- 224 . Palmer , H . ( 1930) . A grammar of spoken English on a strictly phonetic basis. Cambridge : W . Heffer . Pazzani , M . ( 1987) . Explanation -based learning for knowledge -based systems. International Journal of Man Machine Studies , 26 , 413 - 433 .

Pazzani , M . , & Dyer , M . ( 1987) . A comparison of concept identification in human learning and network learning with the generalized delta rule . Unpublished manuscript , UCLA .

192 S. Pinker and A . Prince

Pierrehumbert , J ., & Beckman , M . ( 1986) . Japanese tone structure . Unpublished Laboratories , Murray Hill , NJ . Pinker , S. ( 1979) . Formal models of language learning . Co.e -nition . 7. 217- 283.
~ '-' "

manuscript , AT & T Bell

Pinker , S. ( 1984) . Language learnability


Press .

and language development . Cambridge , MA : Harvard

University

Pinker , S., Lebeaux , D .S., & Frost , L .A . ( 1987) . Productivity passive . Cognition , 26 ) 195- 267 .
NYU Press .

and conservatism in the acquisition of the

Putnam , H . ( 1960) . Minds and machines . In S. Hook (Ed .) , Dimensions of mind : A symposium . New York : Pylyshyn , Z .W . ( 1984) . Computation
MA : Bradford Books / MIT Press

and cognition : Toward a foundation


.

for cognitive science. Cambridge , of categories .

Rosch , E . , & Mervis , C .B . ( 1975) . Family resemblances : Studies in the internal representation Cognitive Psycholo !!;y , 7, 573- 605 . Rosenblatt , F . ( 1962) . Principles ofneurodynamics . New York : Spartan . Ross , J.R . ( 1975) . Wording up . Unpublished manuscript , MIT .

Rumelhart , D .E . , Hinton , G .E . , & Williams , R .J. ( 1986) . Learning internal representations by error propa gation . In D .E . Rumelhart , J.L . McClelland , and the PDP Research Group , Parallel distributed pro cessing: Explorations in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA :
Bradford Books / MIT Press .

Rumelhart , D .E ., & McClelland , J .L . ( 1986a) . PDP models and general issues in cognitive science . In D .E . Rumelhart , J .L . McClelland , and the POP Research Group , Parallel distributed processing : Explora tions in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford Books /MIT
Press .

Rumelhart , D .E . , & McClelland , J.L . ( 1986b) . On learning the past tenses of English verbs . In J.L . McClel land , D .E . Rumelhart , and the PDP Research Group , Parallel distributed processing : Explorations in the microstructure of cognition . Volume 2: Psychological and biological models . Cambridge , MA : Brad ford Books / MIT Press .

Rumelhart , D .E . , & McClelland , J .L . ( 1987) . Learning the past tenses of English verbs : Implicit rules or parallel distributed processing ? In B . MacWhinney ( Ed .) , Mechanisms of language acquisition . Hills dale , NJ : Erlbaum .

Rumelhart , D .E ., McClelland , J.L ., and the PDP Research Group . ( 1986) . Parallel distributed processing : Explorations in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford
Books / MIT Press .

Sampson , G . ( 1987) . A turning point in linguistics . Times Literary Supplement , June 12, 1987, 643. Savin , H . , & Bever , T .G . ( 1970) . The nonperceptual
Verbal Behavior , 9 , 295 - 302 .

reality of the phoneme . Journal of Verbal Learning and

Shattuck -Hufnagel , S. ( 1979) . Speech errors as evidence for a serial -ordering mechanism in sentence produc tion . In W .E . Cooper & E .C .T . Walker (Eds .) , Sentence processing : Psycholinguistic studies presented
to Merrill Garrett . Hillsdale , NJ : Erlbaum .

Sietsema B . ( 1987 Theoretical commitments underlying Wickelphonology. Unpublished manuscript MIT . , ). .


Sloat , C ., & Hoard , J. ( 1971) . The inflectional morphology of English . Glossa , 5 , 47- 56. Slobin , D .I . ( 1971) . On the learning of morphological rules : A reply to Palermo and Eberhart . In D .I . Slobin ( Ed .) , Tht ontogenesis of grammar : A theoretical symposium . New York : Academic Press. Slobin , D .I . ( 1985) . Crosslinguistic evidence for the language -making capacity . In D .I . Slobin (Ed .) , The crosslinguistic study of language acquisition . Volume II : Theoretical issues. Hillsdale , NJ : Erlbaum . Smolensky , P . ( 1986) . Information processing in dynamical systems : Foundations of harmony theory . In D .E . Rumelhart , J. L . McClelland , and the PD P Research Group , Parallel distributed processing : Explora tions in the microstructure of cognition . Volume 1: Foundations . Cambridge , MA : Bradford Books /MIT
Press .

Language and connectionism

193

Smolenskyproperof.In.LGoyvaertsin 80 : ,.A(1980 of syllables .)andSciences Pin).The .( press Sommer).TheKunjen D. .Behavioral.'sGhent ,BScientia treatment (Ed Brain . . . shape connectionism the ,Phonology Storynew grammar .Oxford. - ).A English and Sweet LexicalizationhistoricalTShapeD ,H(1892 patterns in forms (Ed . ,logical lexical . Clarendon Talmy). semantic.3Grammatical .NewCam ,typologydescription categories .York L(1985 . and :Semantic :and structure.Inlexicon: ,Vol : the ),Language bridge Press connectionism stacks University Touretzky . Annualof Cognitive of and ,D(1986 :ReconcilingtheSociety trees . of Eighth ).the BoltzCONS the recursive . with nature ProceedingsSymbols Science inference Conference of.connectionist TouretzkyG.of ). International a Intelligence ,D&Proceedings the :Details . ., Hinton Ninth Joint on neurons architecture . , .E(1985 among Artificial the Conference TreismanH ).Illusory in perception Psychology ,A-141, .(1982 conjunctions .Cognitive .,&. Schmidt the of , 14 ,H&Smith.)(1982 of objectsDordrecht ,107 ,N( van .,:Foris Eds ).The phonological , der Hulst . . structure representations . Netherlands Wexler(1969- Formal language inMA Press ,KW. ).P ). principles,andorder )behavior .,& ,Context ,associative (speech Culicover .(1980 of .Cambridge Wickelgren ,1 . codingmemory , :MIT. , .A Review sensitive acquisition serial Psychological "andofword Inquiry . . Williamsthe -15 related a ".Linguistic ,E ).On,76 .(1981notions "head "lexically ,12-274 ,245

Therelation between linguisticstructure associative and theoriesof language learning A constructive critique of someconnectionist learning models . JOELLACHTER THOMAS BEVER G . UniversityRochester of Therenosafety numbersor anything 's in ... else (Thurber ) Abs tract Recently proposed connectionist models acquired of linguistic behaviors have linguistic -based rule representations in. Similar built connectionist models of language acquisition arbitrary have devices architectures make and which them mimictheeffect rules Connectionist of . models general notwell in are -suited to account theacquisition structural for of knowledge require , and predetermi nedstructures to simulate even basic linguistic . Such facts models more are appropriate describing formation complex for the of associations between structures areindependently which represented makes . This connectionistmo dels potentially important in studying relations tools the between frequent beha viors thestructures and underlying knowledge representations very and . At the leastsuch , models offercomputationally may powerful of demonstrating ways thelimitsof associationistic descriptionsbehavior of . 1. Rules models and Thispaper considers status current the of proposals connectionist that systems of cognitive modelling account rule can for -governed behavior without direct ly representing corresponding (Hinton& Anderson ; Hanson the rules , 1981 *We grateful are for comments drafts paper Gary ,Jeff ,Jerry , on earlier ofthis from Dell Elman Feldman Jerry , Lou Gerken Hanson Lakoff Pinker Prince Pylyshyn Fodor Ann , Steve , George , Steve , Alan , Zenon , Patrice , Paul Simard Smolensky Valian , and Ginny . Requests should for reprints be addressed toJ. Lachter or .G Bever T . , Department ,University ,Rochester ,U .A.This ofPsychology ofRochester ,NY 14627 .S work was completed first was while authorsupported Science pre the by National Foundation fellowship a -doctoral .

196 J. Lachter and T. G. Bever

& Kegl , 1987a, b ; McClelland & Rumelhart , 1986; Rumelhart & McClelland , 1986; Smolensky , in press) . We find that those models which seem to exhibit regularities defined by structural rules and constraints , do so only because of their ad hoc representations and architectures which are manifestly motivated by such rules . We conclude that , at best, connectionist models may contribute
structures

to our understandingof complex associations betweenindependentlydefined .


For the purposes of our discussion , a rule is a function which maps one

representation

onto another . The status of rules within

the field of cognitive

modelling are at the center of a current war . For the past 20 years, the dominant approach to algorithmic modelling of intelligent behavior has been in terms of 'production systems' (for discussions and references see Ander son, 1983; Neches, Langley , & Klar , 1987) . Production systems characteristi cally (but not criterially ) utilize a set of statements and algorithmic steps

which result in a behavior. Suchmodelscharacteristically(but not criterially)


operate linearly ; that is, a model first consults a proposition , applies it if relevant , then goes on to the next proposition , and so on . Recently , a different paradigm in cognitive modelling has been proposed , using information arranged in systems which can apply as a set of parallel
constraint satisfactions . In these systems , a network of interconnected node ~

represents the organization underlying the behavior . The relationship between each pair of nodes is an activation function which specifies the strength
with which one node ' s activation level effects another . Such systems are tou -

ted as meeting many potential objections to production systems, in that the


effect of the nodes on output behavior can be simultaneous , and the relation -

ship to neuronal nets is transparent and enticing (Feldman & Ballard , 1982) . Since the nodes are by definition interconnected , this paradigm for artificial intelligence has become known as 'connectionism ' (Dell , 1986; Feldman & Ballard , 1982; Grossberg , 1987; Hinton & Sejnowski , 1983.1Hopfield , 1982;
McClelland & Rumelhart , 1981 ; for general references on connectionism , see

Accordingly, we concentrateour attention on somerecent connectionistmodels devoted to the description of language behaviors . After some initial consideration of models of acquired language behavior , we turn to models which purport to learn language behaviors from analogues to normal input . Such

Feldman et al., 1985; McClelland & Rumelhart , 1986) . Connectionist modelling defines sets of computational languages based on network structures and activation functions . In certain configurations , such languages can map any Boolean input /output function . Thus , connectionism is no more a psychological theory than is Boolean algebra . Its value for psychological theory in general can be assessedonly in specific models . Lan guage offers one of the most complex challenges to any theoretical paradigm .

Linguisticstructureand associative theories

197

models mostimportant are because mightbe ambitiously they takenas solv-

ing problem how the of rule-governable isinduced theenvironbehavior from


ment without the invocation of rules.

What we shalldemonstrate,roughly,is this: particularconnectionist mod-

elsforlanguage indeed notcontain do algorithmic of thesortusedin rules production systems. of thesemodels, Some however, at theirpre-dework
finedtasksbecausetheyhavespecialrepresentations in, whichare conbuilt
touted because it does not contain rules, actuallycontainsarbitrary computational deviceswhich implementfragmentsof rule-basedrepresentationswe

nectionist-style implementations ofrule-based descriptions oflanguage (Dell, 1986; Hanson Kegi,1987; & McClelland Elman, & 1986). Another model,
showthat it is just thesedevices crucially that enablethismodel simulate to

some properties rule-based of regularitiesthe inputdata(Rumelhart in & McClelland, 1986b). Thus,noneof theseconnectionist models succeed in
whichin speakers are the result of linguisticrules.
2. Models of acquired language behaviors

explaining language behavior without containing linguistic representations,

We consider somemodels adultskill.The goalin thesecases,is to first of describea regularinput/output behavior a connectionist in configuration. Dells model speech of production serves a case-in-point as (based Dell, on 1986; personal communication; Figure1). Thereare two intersecting see

systems interconnected oneforlinguistic of nodes, representations one and

for sequencing output.In the linguistic the representation, nodesare organized fourseparate in levels, words, syllables, phonemes, phonological and
features. Each word describesa hierarchyspecifying order of component the

syllables, in turnspecify component which their phonemes, in turn which specify bundles phonetic of features. sequencing activates The system ele-

mentsin phonologically allowable orders.Eachphonereceives activation

input from linguistic both the subsystem thesequencing and subsystem, which
results theirbeingproduced a specified in in order.As eachphoneme is activated,in turnactivates thefeatures allthesyllables which it all and to it is connected, irrelevant which not in the current even ones are word;then thosefeatures, words syllables inturnactivate phonemes. and can other This

pattern radiating of activation automatically relatively activates strongly just those irrelevantwords and syllableswith structurallysimilardescriptions. Accordingly, model the predicts errors occur a function the that can as of activation irrelevant of phones,syllables words,but primarily and thosein structurally positions.iswell similar It known these justthekind that are of

~'j:'~ t:I:"t:[.)~ 1

11 111111111111 1111 : : ;

198 J. Lachter and T. G. Bever

~ ~
I:. ~ . (. 'T -j: ~ '"I~.~
I t
,

~ ~
I

~ ~ ~
I ,

~-- r..~ j:t~ ~

Linguistic structureand associative theories

199

speech errorsthat occur exchanges , between sounds wordsin structurally and similarpositions . It is crucialthat the nodesare arrayedat different levelsof lexicaland phonological representation . Eachof theselevelshassome pre-theoreticintuitive basis but actuallythe units at eachlevel are consistently , definedin terms of a theory with rules which range categorically over thoseunits . Hence if the modelis takenasa psychological , theory it wouldsupportthe , validity of particularlevelsof representation whichare uniquelydefinedin termsof a rule-governed structuraltheory The model alsomakescertain . assumptions about the prior availabilityof sequence instructionsabsolute , speed activation and speed inhibition of just-utteredunits Thus the of , of . , modelcomprises fairly complete a outlineof thesequencing speech of output , givenprior linguistically defined information andthe prior specification a , of numberof performance mechanisms parameters modelis a talking and . The mechanism the normalflow of speech for whichcancorrectly represent errors because hasthe linguisticunits andthe relations it between themwiredin. Suchmodels represent can perceptual well asproductive as processes . For exampleElman McClelland Rumelhart , , and havedeveloped series mod a of elsfor word recognition (Elman& McClelland 1986McClelland Elman , ; & , 1986 McClelland& Rumelhart 1981 Rumelhart& McClelland 1982. ; , ; , ) Thesemodelsaresimilarto Dell's modelin the sense theyhaveseveral that levels of linguisticrepresentation built into them For example TRACE . , (McClelland& Elman 1986 recognizes , ) wordsbased a stylizedacoustic on featuresequential input. The modelutilizesthree kindsof detectornodes , for acoustic featuresphonemes words Feature , and . nodes grouped are into 11sets eachcorresponding a successive slice of the acoustic , to 'time ' input. Eachslicepresents valueof eachof seven the featuredimensions ; eachdimension 9 possible has values Accordingly at the input levelof this model . , , each'phonemeis represented termsof a feature ' in /value /time -slicematrix whichdefines separate 693 nodesThe centers each of timeslices . of set from a phoneme spaced slicesapart to simulate acoustic are 6 , the overlapthat real phones havein speechEverythreetime slices thereis centered set . , a of connections 15phoneme to detectornodes with eachphoneme , nodereceivinginput from 11 time slices This means . that a three -phoneme word activates 1993 feature /value /time-slicenodes 165phoneme . Finally and units , thereare211wordnodes appropriately linkedto theirconstituent phonemes . TRACE buildsin several levelsof linguisticanalysisThis is not merely . anexternal observation aboutthe model it is reflected its internalarchitec , in ture. All nodesat the same level of representation inhibit eachother, while nodes adjacent at levelsof representation exciteeachother. Thewithin-level inhibition serves functionof reducing activenodesat eachlevel to the the

200

J. Lachter and T. G. Bever

thosethat are most active; across-level the facilitation serves function the of increasing effectof one levelon another.Thewithin-level the inhibition inTRACE setat distinct is values eachlevel, arethelevels excitation for as of

phonemes features.Sucheffects and occurbecause nodesat the different levelsof representation interconnected. take this to be one of the are We obvious features connectionist of algorithms: insofar thereareinteractions as
levelsa parallel system multiple with simultaneous connections within both

between different Theresult thatthemodel levels. is exhibits qualitatively distinct ofnodes, sets grouped layeredthesame asthecorrespondand in way inglinguistic of representation. allthisbuilt-in levels With linguistic apparatus, TRACE model number interesting can a of aspects lexical of access byhumans. it canrecognize from acoustic input. First, words mock feature Second, can generate variety effects it a of involving interference relations between ofrepresentation: levels forexample, a family wordsuperiority of effects, which in propertieswords of influence perceptionparticular the of

between ofrepresentationespecially from levels influence higher lower to

system ofrules beelegantly set could configured tocapture phenomena. such


3. A modelwhichseemsto learn past tense rules

and between isindicated. itisdifficultsee a productionlevels Indeed, to how

Theprevious models embed already an established linguistic analysis a into

orderto understand it works.Wearguethat the modelseems work why to for twokindsof reasons; it contains first, arbitrary devices which makeit

framework fortheimplementation allows oflinguistic structurecomputain tionally effective Ifwetakesuch ways. models psychologically as instructive, they inform thatonce knows theinternal us one what structurea language of is, it canbe encoded such system a way in a in which predicts linguistic behavior. Amore ambitious istoconstruct goal a connectionist which model actually learns regularities normal from in input,andgenerates behavior which conforms therealbehavior. this themodel aspire explain to In way, can to the patterns characteristic behavior. ofthe Consider recent the model proposed byRumelhart McClelland R&M) seems learn past and (1986b, which to the tenserulesof English verbswithout learning explicit any rules.Thismodel hasbeen given considerable attention a current in review article, especially with reference itsempirical principled andhow failings to and failings such follow thechoice computational from of architecture &Prince, (Pinker 1988). Ourapproachcomplementary: is weexamine R&M the model internally, in

modelof adult behavior: these models demonstrate the connectionist that

Linguistic structureand associative theories

201

relatively sensitiveto those phonologicalstructureswhich are involved in the past-tense rules; second it stagesthe input data and internal learning func, tions so that it simulateschild-like learning behavior. We first describethe past-tense phenomena in rule-governed terms, and then examine how the R& M model conforms to the consequentregularities in the child's stagesof acquisition. The principles of the past tense formation in English involve a regular ending ('ed') for most verbs, and a set of internal vowel changes a minority for of verbs. In our discussion are interestedin a particular kind of structural we - - . - - - - - --rule, typical of linguistic phenomena one which performs categoricalopera: tions on categoricalsymbols In this sense linguistic rules admit of no excep . , tions. That is, they are not probabilistically sensitive to inputs, and their effects are not probabilistic. We can see how they work in the formation of the regular pasttensefrom the presentin English verbs. The regularitiesare: mat . . . mattED need. .. needED If the verb endsin "t" or " d" , add "ed" push... pick ... buzz. .. bug ... ski . . . pushT pickT buzzD bugD skiD If the verb endsin sounds , " sh" , "k'~ '.p'~ " ch" , ... add '.t" , , If the verb endsin "z" , "g" , "b" , "j " or a vowel ... add " d"

These facts are describedwith rules which draw on and define a set of phonological 'distinctive features , attributes which co-occur in each distinct ' phoneme. Features are both concrete and abstract: they are concrete in the sense that they define the phonetic/acoustic content of speech they are ; abstractin the sensethat they are the objects of phonologicalrules, which in turn can define levels of representation which are not pronounced or pronounceablein a given language The description of the rules of past tense . formation exemplify these characteristics (Note that any particular descrip . tion is theory dependent We have chosena fairly neutral form (basicallythat . in Halle, 1962 althoughwe recognizethat phonologicaltheory is continually ), in convulsions The generalizations draw are worded to hold acrossa wide . \\ITe range of such theories.) 1a. Add a phonemewhich has all the featuresin common to T and D , with the voicing dimension unspecified . lb . Insert a neutral vowel, 'e' , betweentwo word-final consonants that have identical features, ignoring voicing. 1c. Assimilate the voicing of the T/D to that of the preceding segment , .

202

J. Lachter and T. G. Bever

These rulesdefine distinct internal representationseachpastform: of


Rule

Push

PushT/D PushT
Pit

la lb

PitT/D
PiteT/D PiteD

la
lb ic

tinct constraints (Goldsmith, orasapplyingstrict (Halle, 1976) in order 1962). Either theeffects theoutput arecategorical, rule way, on form each involving
rule (Ib) refers to sequences consonants of whichare made in the same

Thereare models which rulesare interpreted simultaneous in such as dis-

clarifies sense which the in distinctive features symbolic. example, are For

a distinct, abstract but phonetic form.Thespecific shapeof the rulesalso

not to actualphoneticor acousticobjects. The depth of abstract representations becomesclear when we consider

such highlights factthattheyapply abstract rules the to subsets features, of

break sequencess, z, sh,ch created affixing plural present up of by the and rule(lc)applies those aswell. generalitytheapplication in cases The of of

manner place, and regardless theirvoicing of status. rulealsoapplies This to

tenseS/Zto nouns (glitches) verbs and (botches):thevoicing assimilation

processes involved describing in English sound systems general in (Chomsky allowed tense past pronunciations, appear mouDed (long they as nasalized vowel preceding a tongue andmouDed(short flap) nasalized precedvowel inga tongue flap).These words endup being two can differentiated acousti& Halle, 1968). Consider verbsmound and mount; in one of their the

howthese areembedded a fuller ofoptional rules within set phonological

cally interms thelength thefirst only of of vowel, though underlying even the basisfor the difference in the voicing absence it in the finald/t of the is or of stem Chomsky, Therules (see 1964). involvedarriving these in at pronunciations include,
id. nasalize a vowel before a nasal

if.

le. drop a nasalbeforea homorganic stop

lengthen a vowel before a voiced consonant

1g. change t or d to a tongue flap, D, between vowels

Each these hasindependent of rules application inother ofEnglish parts pronunciation patterns, theyaredistinct. these areseparated so If rules so

Linguistic structure and associative theories

203

that they can apply to isolated cases, they must apply in order when combined . For example , since (Id ) requires a nasal, (Ie ) cannot have applied ; if ( If ) is to apply differentially to 'mound ' , and not to 'mount ' , then (Ie ) must have applied ; if the vowel length is to reflect the difference between 'mound ' and 'mount ' , ( If ) must apply before ( Ig ) . Thus , whether the intermediate stages are serially or logically ordered , the inputs , mound + past, mount + past, have a very abstract relation to their corresponding outputs : mound + past mound DIT mound eDIT mound ed m-ounded m-ouded m-'ouded m-'ouDed (rule (rule (rule (rule (rule (rule (rule la ) Ib ) lc ) Id ) Ie ) 1f) 19) mount + past mount D / T mount eDIT mount ed m-ounted m-outed (can't apply ) m-ouDed

That the rules can be optional does not make them probabilistic within the model : they state what is allowed , not how often it happens (indeed , for many speakers, deleting nasals can occur more easily before unvoiced homor ganic stops, than before voiced stops) . Optionality in a rule is a way of expressing the fact that the structure can occur with and without a corre sponding property . The fact that linguistic rules apply to their appropriate domain 'without exception ' does not mean that the appropriate domain is defined only in terms of phonological units . For example , in English there is a particular set of 'irregular ' verbs which are not subject to the three past-tense formation rules described above. Whether a verb is irregular or not depends on its lexical representation : certain verbs are and others are not 'regular ' . For example , one has to know which 'ring ' is meant to differentiate the correct from incorrect past tenses below (see Pinker & Prince , 1988 for other cases . ) 2a. The Indians ringed ( *rang) the settler 's encampment . 2b . The Indians rang ( *ringed ) in the new year . There are about 200 irregular verbs in modern English . They are the detritus of a more general rule -governed system in Old English (which have an interesting relation to the Indo -European e/o ablaut , see Bever & Langen doen , 1963) . They fall into a few groups ; those involving no change (which characteristically already end in t or d (beat , rid ) ; those which add t (or d) (send) ; those lowering the vowel (drink , give) ; those involving a reversal of the vowel color between front and back (find , break ; come) ; those which both lower and change vowel color (sting) ; those which involve combinations

204 J. Lachter and T. G. Bever

of all three kinds of change(bring, feel, tell) . The point for our purposesis that almost all of the 'irregular' verbs draw on a small set of phonological processes Only a fe\\7involve completely suppletive forms (e.g., go/went). . This brief analysisof the past tense formation in terms of features and rules, reveals severalproperties of the structural system First, the relevant . grouping of features for the rules is vertical, with features grouped into phonemic segments This property is not formally necessary For example, . . rules could range over isolated features collected from a series of phones : apparently, it is a matter of linguistic fact that 'the phoneme is a natural domain of the phonological processes involved in the past tense formation (note that there may be suprasegmental processes well, but thesetend to as range over different locations of values on the same feature dimension . ) Second it is the segmentat the end of the verb stem that determines the , ultimate features of the regular past tense ending. This, too, is a fact about the English past system not a logically necessary , property.
4. The R & M model : A general picture of PDP models of learning Rumelhart and McClelland ( 1986b; R & M ) implemented a model which learns to associate past tense with the present tense of both the majority and minority verb types. The first step in setting up this model is to postulate a description of words in terms of individual feature units . Parallel distributed connectionist models are not naturally suited to represent serially ordered representations , since all components are to be represented simultaneously in one matrix . But phonemes , and their corresponding bundles of distinctive features , clearly are ordered . R& M solve this problem by invoking a form of phonemic representation suggested by Wickelgr ,t;n ( 1969) , which recasts or dered phonemes into 'Wickelphones ' , which can be ordered in a given word in only one way . Wickelphones appear to avoid the problem of representing serial order by differentiating each phoneme as a function of its immediate phonemic neighborhood . For example , 'bet ' would be represented as composed of the following Wickelphones .
eT :#: , bEt , :#: Be

Each Wickelphone is a triple , consisting of the central phoneme , and a representation of the preceding and following phonemes as well . As reflected in
the above representation , such entities do not have to be represerited in

memory as ordered : they can be combined in only one way into an actual sequence, if one follows the rule that the central phone must correspond to the prefix of the following unit and the postfix of the preceding unit . That rule leads to only one output representation for the above three Wickel -

Linguistic structureand associative theories 205

phones, namely .e...t. Ofcourse, number Wickeiphones lanb. the of in a guage much is larger thenumber phonemesroughly thirdpower. than of the
so that a givenWickeiphone neveroccursmorethan oncein a wordsee
Pinker & Prince, 1988).

But sucha representational schemeseemsto circumvent need for a direct the representation orderitself(at least,so longas the vocabulary restricted of is

Figure2. Categorizationphonemes foursimple of on dimens ons


Place

Front

Middle

Back

V/L

U/S

V/L

U/S

V/L

U/S
k

Interrupted

Stop
Nasal

Cont.Consonant Fric. Liq/SV


Vowel High
Low

v/D w/l
E
A

fIT
i
e

z r
0
I

s
A
a/a

ZIj y
U
w

S/C h
U
*10

Key: = nginsing; = thinthe; = thinwith; = zinazure;= shin C= chin N D T Z S ship; chip;


E ccinbeet; i inbit;0 = oainboat; = u inbutorschwa; = ooinboot; = ooinbook; i A U u A = ai in bait;e = e in bet;I ic in bite;a = a in bat;a a infather;W = owin cow;
* = aw in saw; o = o in hot.

R&Massign set of phonemic a distinctive features eachphonewithin to a Wickelphone. are4 feature There dimensions, withtwovalues two two and withthree, yielding individual 10 featurevalues(seeFigure2). Thisallows themto represent Wickelphonesfeature in matrices: example lEtin for the
bet would be represented as shown below.

Dimension1 Dimension2 Dimension3 Dimension4

Interrupted Stop Voiced Front


fl

Vowel Low Short Front


f2

Interrupted Stop Unvoiced Middle


f3

kelfeatures. These consistof a triple of features, [ f2, f3},the first taken fi,
would be the following:

The verblearning modelrepresents Wickeiphone a set of Wiceach in

from prefix the phone, second thecentral the from phone thethirdfrom and the post-fix phone. Accordingly, of the Wickelfeatures the bet some for

Linguistic structure and associative theories

207

input node does not affect the state of the output node. If the weight is positive, and if the input node is activated by the present tense input, the output node is also activatedby the connectionbetweenthem; if the weighting is negative then the output node would be inhibited by that connection . Each output node also has a threshold, which the summedactivation input from all input nodesmust exceedfor the output node to becomeactive (we note below that the thresholdis itself interpreted asa probabilistic function). On each training trial , the machineis given the correct output Wickelfeature set as well as the input set. This makes it possibleto assess extent the to which each output Wickelfeature node which should be activated, is, and conversely The machine then uses a variant on the standard perceptron . learning rule (Rosenblatt, 1962 , which changesthe weights between the ) active input nodes and the output nodes which were incorrect on the trial : lower the weight and raise the threshold for all nodes that were incorrectly activated; do the opposite for nodes that were incorrectly inactivated. The machine was given a set of 200 training sessions with a number of , verbs in each session At the end of this training, the systemcould take new . verbs, that it had not processedbefore, and correctly associatetheir past tense in most cases Hence, the model appearsto learn, given a finite input , . how to generalize to new cases Furthermore, the model appears to go . through severalstagesof acquisitionwhich correspondto the stagesof learning the past tense of verbs which children go through as well (Brown, 1973 ; Bybee & Slobin, 1982 . During an early phase the model (and children) ) , produce the correct past tense for a small number of verbs, especially a number of the minority forms (went, ran, etc.). Then, the model (and children) 'overgeneralize the attachment of the majority past form , 'ed' and its ' variants, so that they then make errors on forms on which they had been correct before (goed, wented, runned, etc.). Finally, the model (and children) produce the correct minority and majority forms. It would seem that the model has learned the rule-governedbehaviors involved in forming the past tense of novel verbs. Yet , as R& M point out, the model does not 'contain' rules, only matrices of associative strengthsbetween nodes They argue that . the success the system in learning the rule-governed properties, and in of simulating the pattern of acquisition, showsthat rules may not be a necessary component of the description of acquisition behavior. Whenever a model is touted to be assuccessful theirs, it is hard to avoid as the temptation to take it as a replacement for a rule-governed description and explanation of the phenomena That is, we can go beyond R& M and . take this model as a potential demonstrationthat the appearance that behav ior is rule-governed is an illusion, and that its real nature is explained by nodes and the associativestrengths between networks of nodes A number .

208

J. Lachter and T. G. Bever

basis for natural language.

ing;but,insofar it works, actually as it confirms existence rulesasthe the of

replaces linguistic rather, has rules; it internal architecture isarranged which tobeparticularly sensitiveaspects data conformtherules. to of that to Thus, themodel (ormay bea successful may not) algorithmic description oflearn-

we examine modelin this light:we find no evidence the model the that

models, particular in because theapparently of successful performance of R&Ms past-tense learning (Langackre, Sampson, Below, model 1987; 1987).

of linguistic commentators drawnthisconclusion have aboutconnectionist

5. Themodels TRICS Representations (The It Crucially Supposes)

often presentedsimplifications model, as ofthe actually beexplained could asaccommodations rule-governed tothe propertiesthesystem thereof and sulting behavior. should that, and We note by large, didnotrequire it much
generalization pattern exhibited children. by
5.1.

to makingwork. each wenote apparently it In case, that arbitrary decisions,

We nowturn to somespecial characteristics the modelwhich of contribute

explicit aboutthem.Therearetwokindsof TRICS, thosethatreconstitute crucial aspects the linguistic of system, thosewhichcreatethe overand

detective to isolate arbitrary work the propertiesinmost cases, R&M are

sociative learning, information totherule-governed the relevant description oftheformationthepast of tense. forensic Our methodthefollowing: is we examine arbitrary each decision themodel therulesystem about with in mind,and ask abouteachdecision: wouldthisfacilitate inhibitthe beor havioral emergence which like governedthepast ofdata looks that by tense
emergence of such behavior.
5.1.1.

tionat the endof the verb.Thatis, the arbitrary decisions thedetails about of themodel transparently are interpretable making reliable asas most for

phoneme clustering features, to emphasize clarity theinformaof and the of

A set of properties themodel to re-focus thephoneme-byof serve on

rules?Withoutexception, find that the decision we wouldfacilitate the

Thefirstsimplificationthe model of involves reducing number the of within-word Wickelfeatures about1000 260.Onewayto do this, from to would to delete be randomly Wickelfeatures, relyontheoverall some and

Linguistic structure and associative theories

209

redundancyof the systemto carry the behavioralregularities. Another option is to use a principled basisfor dropping certain Wickelfeatures For example, . one could drop all Wickelfeatureswhose component subfeaturesare on different dimensions this move alone would reduce the number of features considerably Suchreduction would occur if Wickelfeatureswere required to . have at least two sub-features on the same dimension. R& M do something like this, but in an eccentric way: they require that all Wickelfeatures have the samedimensionfor f1 and f3: f2 can rangefully acrossall feature dimensions and values Accordingly, the potential Wickelfeaturesfor the vowel lEI . in 'bet' on the left below are possible those on the right are not. , [interrupted, vowel, interrupted] [voiced, mid, unvoiced ] [front , short, middle] [interrupted, vowel, stop] [stop, vowel, unvoiced ] [front , short, unvoiced ]

This apparently arbitrary way of cutting down on the number of Wickelfeatures has felicitous consequences the relative amount of rule-based for information contained within each sub-feature. It has the basic effect of reducing the information contained within fl and f3, since they are heavily predictable that is the actually employedWickelphonesare " centrally informative." This heightensthe relative importance of information in f2, sinceit can be more varied. This move is entirely arbitrary from the standpoint of the model; but it is an entirely sensiblemove if the goal were to accommodate to a structural rule account of the phenomena the rules imply that the rele: vant information in a phoneme is in f2. The use of centrally informative Wickelphones automatically emphasizes f2.

210 J. Lachter T.G. Bever and

[vowel , interrupted , end] [front , stop , end]

[low , unvoiced , end] [short , middle , end]

We can see that this gives a privileged informational status to phones at the word boundary , compared with any other ordinally defined position within the word : the phones at the boundary are the only ones with their own unique set of Wickelfeatures . This makes the information at the boundary uniquely recognizable . That is, the Wickelfeature representation of the words exhibits the property of 'boundary sharpening ' . This arbitrary move accommodates another property we noted in the rule governed account of the past tense. The phone at the word boundarv deter mines the shape of the regular past. Here , too , we see that an apparently arbitrary decision was just the right one to make in order to make sure that the system would accommodate to the rule -governed regularities .
..

5 . 1 .3 .

R& M allow certain input Wickelfeatures to be activated even when only some of the constituent sub-features are present . One theoretically arbitrary way to do this would be to allow some Wickelphone nodes to be activated sometimes when any lout of 3 subfeatures does not correspond to the input . R & M do something like this , but yet again , in an eccentric way- the model allows a Wickelphone node to be activated if either ft or f3 is incorrect , but not if f2 is incorrect . That is, the fidelity of the relationship between an input Wickelphone and the nodes which are actually activated , is subject to 'peripheral blurring ' . This effect is not small : the blurring was set to occur with a probability of .9 (this value was determined by R & M after some trial and error with other values) . That is, a given Wickelnode is activated 90 percent of the time when either the input does not correspond to its fl or f3 . But it can always count on f2 . This dramatic feature of the model is unmoti vated within the connectionist framework . But it has the same felicitous result

from the standpoint of the structure of the phenomenon as discussed in 5.1.1. It heightens (in this case, drastically ) the relative reliability of the information in f2 , and tends to destroy the reliability of information in f1 and f3 . This
further reflects the fact that the structurally

relevant

information

is in f2 .

Blurring , however , has to be kept under careful control . Clearly , if blurring


occurred on 100 percent of the features with incorrect f1 or f3 , the crucial

information will be lost as to how to sequentially order the phones . For this reason, R & M had to institute an arbitrary choice of which Wickelfeatures could be incorrectly activated , so that for each input feature there are some reliable cues to Wickelphone order . This necessitates blurring the Wickel features with less than 100 percent of the false options .

Linguistic structureand associative theories

211

5.1.4. The use of phonological features is one of the biggestTRICS of all. Features are introduced as an encoding device which reduces the number of internodal connections the number of connectionsbetween two layers of 30,000 Wickelphones is roughly one billion . Some kind of encoding was necessary reducethis number. Computationally, any kind of binary encod to ing could solvethe problem: R& M chosephonologicalfeaturesas the encod ing device because that is the only basisfor the model to arrive at appropriate generalizations Furthermore, the four distinctive features do not all corre. spond to physically definable dimensions In order to simplify the number of . encoded features, R& M create some feature hybrids. For example b,v,EE are grouped together along one 'dimension in opposition to m,l ,EY . '~ ' Nhile such a grouping is not motivated by either phonetics or linguistic theory, it is neither noxious nor helpful to the model, so far as we can tell . However, the other major arbitrary feature grouping lumps together long vowels and voiced consonantsin opposition to short vowels and unvoiced consonants . All verbswhich end in vowels, end in long vowels: accordingly that particular , grouping of features facilitates selectivelearning of the verbs that end in a segmentwhich must be followed by /d/, namely the verbs ending in vowels and in voiced consonants . 5.1.5. It is interestingto ponder how successful devicesof central informativethe ness peripheral blurring and boundary sharpeningare in reconstituting trad, itional segmentalphonemic information. Phonemicrepresentationsof words offer a basisfor representingsimilarity betweenthem. One of the properties of Wickelphones is that they do not represent shared properties of words. For example 'slowed and 'sold' are no more similar in Wickelphonesthan , ' 'sold' and 'fig' . This is obviously an inherent characteristicof Wickelphonology (Pinker & Prince, 1988 Savin & Bever, 1970 which would extend to ; ) Wickelfeature representationsif there were no TRICS. But there are, and they turn out to go a long way to restoring the similarity metric represented directly in normal phonemicnotations. One way to quantify this is to examine how many sharedinput Wickelfeaturesthere are for words which do and do not share phonemes This is difficult to do in detail, becauseR&M do not . give complete accounts of which features were chosen for blurring. Two arbitrarily chosenwords without shared phonemeswill also share few Wic.. kelfeatures our rough calculation for a pair of arbitrarily chosen4 letter words is about 10 percent. Words which shareinitial or final phonemes will , have a noticeablenumber of sharedfeaturesbecause boundary sharpening of . Blurring plays an especiallyimportant role in reconstituting similarity among

212

J. Lachter and T. G. Bever

words

with

shared

internal

phonemes

such

as

'

slowed

'

and

'

sold

'

Roughly

our

calculations

show

that

two

such

words

go

from

about

20

percent

shared

features

without

blurring

to

around

65

percent

with

it

In

terms

of

correlating

the

chances

of

node

being

activated

in

each

word

this

represents

rise

from

about

to

about

. 5

The

corresponding

proportions

for

phonemically

distinct

words

are

10

and

30

percent

the

correlation

in

that

case

stays

at

even

with

blurring

The

technical

reason

for

this

proportional

difference

is

that

the

blurring

has

radiating

effect

on

just

the

right

Wickelfeatures

to

create

an

overlap

when

there

are

common

phonemes

Thus

the

model

does

not some

correctly of its effects

reconstitute .

the

phonemic

representation

but

it

does

replicate

. 2

There

are

two

major

behavioral

properties

of

the

model

First

it

seems

to

learn

. e

it

changes

its

behavior

second

it

goes

through

period

of

over

generalizing

the

regular

rule

The

fact

that

the

model

learns

at

all

has

the

same

formal

basis

as

the

fact

that

the

model

can

represent

the

present

past

mapping

Any

mapping

perceptron

can

carry

out

it

can

' learn

'

to

carry

out

using

the

kind

of

learning

rule

described

above

Rosenblatt

1962

Basically

the

model

is

composed

of

460

perceptrons

which

converge

on

the

appropriate

mapping

representation

In

fact

in

simple

perceptrons

the

convergence

can

be

quite

efficient

On

each

trial

the

learning

rule

adjusts

threshold

discrimi

nation

function

such

that

only

the

correct

output

units

are

activated

What

is

striking

about

&

' s

complex

perceptron

is

that

it

takes

so

long

to

learn

the

mapping

The

function

for

regular

verbs

is

extremely

simple

and

might

well

be

reasonably

arrived

at

with

just

one

training

cycle

Thereafter

few

cycles

would

suffice

to

straighten

out

the

errors

on

the

irregular

forms

especially

because

as

we

pointed

out

above

most

irregular

verbs

follow

restricted

set

of

rules

In

this

case

there

would

be

little

intermediate

performance

charac

teristic

of

learning

and

no

period

during

which

the

regular

endings

over

generalized

to

previously

correct

irregular

verbs

The

reason

that

the

model

in

fact

does

exhibit

considerable

intermediate

performance

and

overgeneralization

is

due

to

another

one

of

the

TRICS

which

imposes

probabilistic

function

on

the

output

The

probability

of

an

output

unit

being

active

is

sigmoid

function

of

its

net

activation

weighted

input

minus

threshold

this

function

represents

the

fact

that

even

when

the

input

is

correctly

assigned

there

are

output

errors

Such

move

qualitatively

improves

the

generalizations

made

by

the

system

once

it

has

been

trained

This

is

because

in

general

as

the

number

of

error

correcting

trials

increases

the

difference

between

activations

resulting

from

inputs

for

which

the

unit

is

Linguistic structureand as\'ociative theories .

213

supposedto have positive output and those for which it is supposedto have negative output, grows more rapidly than the difference amongthe inputs of each type. This enhances clarity of the generalizationand also makesthe the learning proceed more slowly. As R&M put it , its use here is motivated by the fact that it "causes systemto learn more slowly so the effect of regular the verbs on the irregulars continuesover a much longer period of time." (R& M , p. 224 ). The period of overgeneralizationof the regular past at the 11th cycle of trials also dependson a real trick , not a technically defined one. For the first 10 cycles the machine is presented with only 10 verbs, 8 irregular and 2 , regular ones. On the 11th cycle it is presentedwith an additional 410 verbs of which about 80 percent are regular. Thus, even on the 11th cycle alone, the model is given more instancesof regular verbs than the training trials it has receivedon the entire preceding10cycles it is no wonder that the regular : past ending immediately swampsthe previously acquired regularities. R& M defend this arbitrary move by suggesting that children also experiencea sudden surge of regular past tense experience We know of no acquisition data . which show anything of the sort (see Pinker & Prince, 1988 who compile , evidence to the contrary). Furthermore, if there were a suddenincreasein the number of verbs a child knows at the time he learnsthe regular past tense rule, it would be ambiguousevidencebetweenacquiring the rule, and acquiring a lot of verbs. The rule allows the child to memorizehalf as many lexicalitems for each verb, and learn twice as many verbs from then on. Therefore, even if it were true that children show a sudden increasein the number of verbs they know at the sametime that they start overgeneralizing it would , be very difficult to decide which was the causeand which the effect.

5.3. TRICSaren for kids 't It is clearthat a numberof arbitrarydecisions madesimplyto getthe model up andworking weremadein waysthat wouldfacilitatelearningthe struc , tural regularities inherentto the presented . To us it seems data fairly clear whatwenton: Wickelphones the representation choicebecause were of they seemto solvethe problemof representing serialorder (thoughthey do so only for a restricted vocabularyseePinker& Prince 1988 Savin& Bever , , ; , 1970. But Wickelphones giveequalweightto the preceding follow) also and ing phone while it is the centralphonewhichis the subject rule-governed , of regularitiesAccordingly a numberof devices built into the model to . , are reducethe informationand reliability of the preceding followingsub and phones the WickelphoneFurtherdevices in . markphones word at -boundary as uniquelyimportantelementsas they are in rule-governed , accounts of

214

J. Lachter and T. G. Bever

some phonological changes which happen when morphemes are adjoined . Finally , the behavioral learning properties of the model were insured by making the model learn slowly , and flooding it with regular verbs at a particu lar point . The most important claim for the R & M model is that it conforms to the behavioral regularities described in rule -governed accounts, but without any rules . We have not reported on the extent to which the model actually captures the behavioral regularities . Pinker & Prince ( 1988) demonstrate that , in fact , the model is not adequate , even to the basic facts: hence, the first
claim for the model is not correct . We have shown further , that even if the

model were empirically adequate , it would be because the model 's architec ture is designed to extract rule -based regularities in the input data . The im pact of the rules for the past tense learning , is indirectly embedded in the form of representation and the TRICS : even Wickelfeatures involve a linguis tic theory with acoustic segments and phonological features within them ; the work of the TRICS is to render available the segmental phoneme , and emphasize boundary phonemes in terms of segmental features . That is, garbage in/garbage out : regularities in/regularities out . How crucial the TRICS really are is easy to find out : simply run the model without them , and see what it does. We expect that if the TRI CS were replaced with theoretically neutral
devices , the new model would not learn with even the current cess ' , if at all ; nor would it exhibit the same behaviors . limited ' suc -

If a slightly improved set of TRICS does lead to successful performance , one could argue that this new model is a theory of the innate phonological devices available to children . On this interpretation , the child would come to the language learning situation with uncommitted connectionist networks supported by TRICS of the general kind built into the R & M model . The child operates on the feedback it receives from its attempts to produce phonologically correct sequences, and gradually builds up a network which exhibits rule -like properties , but without any rules , as R & M claim . It is dif ficult to consider the merits of such a proposal in the abstract : clearly , if the TRICS were sufficiently structured so that they were tantamount to an im plementation of universal phonological constraints in rule -governed accounts, then such a theory would be the equivalent of one that is rule -based (see Fodor & Pylyshyn , 1988, for a discussion of connectionist models as potential implementation systems) . The theory in R & M self-avowedly and clearly falls short of representing the actual rules . So, we must analyze the nature of the TRICS in the model at hand , to assesstheir compatibility with a plausible universal theory of phonology . None of the TRICS fares well under this kind of scrutiny . Consider first the limitation on Wickelfeatures which makes them 'centrally informative ' :

Linguistic structure and associative theories

215

this requires that fl and f3 be marked for the same feature dimension (although the valueson that dimensionmay be different). Certain phonological processes depend on information about the phone preceding and following the affected segment For example the rule which transforms IT or DI to a . , voiced tongue flap, applies only when both the precedingand following seg ments are vowels. The 'central-informativeness TRIC neatly accommodates ' a processlike this, since it makes available a set of Wickelfeatures with fl and f3 marked as 'vowel' . Unfortunately, the sameTRIC makes it hard to learn processes which fl and f3 are marked for different dimensions Since in . such processes also quite common, the universal predictions this makes are are incorrect. The secondset of representationalTRICS has the net result of sharpening the reliability of information at word boundaries this is well-suited to isolate : the relevant information in the regular past-tense formation in English, and would seemlike a natural way to representthe fact that morphological processes effect segmentsat the boundariesof morphemeswhen they are combined. Unfortunately, such processesdo not seem to predominate over within-word processes such as the formation of the tongue-flap between , vowels, or the restrictions on nasaldeletion. Furthermore, there are numerous languageswhich changesegmentswithin morphemesas they are combined, as in the many languageswith vowel harmony. Thus, there is no empirical support for a system which unambiguously gives priority to phonological processes morpheme boundaries at . R& M link toQ :ether distinctive features which are maintained orthogonal ~ to each other in most phonological theories. Hence, in R&M , long vowels and voiced consonantsare linked together in opposition to short vowels and unvoiced/consonants This link is prima facie a TRIC , which facilitates learn. ing the regular past tense morphology. But , taken as a claim of universal phonological theory, it is prima facie incorrect: it would propose to explain a non-fact, the relative frequency, or easeof learning, processes which apply simultaneouslyto long vowels and voiced consonantsor to short vowels and unvoiced consonants . Stagingthe input data, and imposing a sigmoid learning function are not, strictly speaking componentsof phonological theory- both devices playa , role in guaranteeing overgeneralization of the regular past. Such overgeneralizations are common in mastery of other morpho-phonological phenomena for example, the presenttensein English ('Harry do-es' (rhymes , with 'news or plural ('fishes , 'childrens . The carefully controlled sigmoid ')) ' ') learning function and tbe staging of input data necessaryto yield overgeneralization phenomena have the properties of dei ex machina with no , independentevidenceof any kind.

216 J. Lachter T.G. Bever and

In brief , the net effect

of the TRICS

is to refocus

the reliable

information

within the central phone of Wickelphone triples . This does reconstitute the segmental property of many phonological processes, obscured by the Wickel phonological representations . However , since those representations are not adequate in general , such reconstitution is of limited value . Most important , the specific TRICS involved in this reconstitution make wrong universal
claims in some cases , and make that even obscure if the and unmotivated TRICS were claims in the other to arrive at cases . We conclude fine - tuned

satisfactory empirical adequacy, they would still be arbitrarily chosen to facili tate the learning of English past tense rules , with no empirical support as the basis for phonological learning in general . We noted above that there are theories of phonology in which more than one segment contributes to a constraint simultaneously , e.g., 'auto -segmental phonology ' (Goldsmith , 1976) . It might seem that such a variant of phonolog ical theory would be consistent with Wickelphones - and the connectionist learning paradigms in general . Such claims would be incorrect . First , autosegmental phonology does not deny that many processes involve individual segments ; rather , it asserts that there are simultaneous suprasegmental structures

as well . Second, Wickelphones are no better suited than simple phones for the kinds of larger phonological unit to which multi -segmental constraints apply , very often the syllable . Finally , the constraints in autosegmental phonology are structural and just as rule -like as those in segmental phonol ogy . It might also seem that the issue between traditional and autosegmental phonological theory concerns the reality of intermediate stages of derivation as resulting from the ordered application of rules like ( la - g)- another way to put this is in terms of whether rules are simple -but -ordered as in traditional phonology , or complex -but -unordered . There may be phonological theories
which differ on these dimensions . But , however this issue is resolved , there

will be no particular comfort for connectionist learning models of the type in R & M . The underlying object to which the rules apply will still be an abstract formula , and the output will still differentiate categorically between grammat ical and ungrammatical sequences in the particular language. 5.4. Empirical evidence for rules Up to now , we have relied on the reader 's intuitive understanding of what a
'rule ' is- a computation which maps one representation onto another . We

have argued further that the R & M model achieves categorical rule -like behavior in the context of an analogue connectionist machine by way of special representational and processing devices. One might reply that the 'rules ' we have been discussing actually compute the structure of linguistic 'compe-

Linguistic structure and associative theorie.\'

217

tence

'

while

the

TRIC

ridden

model

is

' performance

'

mechanism

which

is

the

real

basis

for

the

rule

like

behavior

This

line

of

reply

would

be

consistent

with

the

current

distinction

between

three

types

of

description

the

computa

tional

the

algorithmic

and

the

implementational

Marr

1982

It

is

line

already

taken

in

several

defenses

of

connectionism

Rumelhart

&

McClel

land

1986a

Smolensky

in

press

' The

distinction

between

these

different

types

of

description

might

seem

to

allow

for

synthesis

of

the

cqnnectionist

and

rule

based

theorie

.s

On

this

view

rule

based

theories

describe

the

structure

of

language

while

connec

tionist

models

explain

how

it

'

actually

works

This

would

be

synthesis

is

not

available

however

since

grammatical

rules

are

necessary

for

the

explanation

of

behavior

The

diachronic

maintenance

of

language

systems

Consider

first

the

operation

of

the

processes

we

have

used

as

examples

they

characteristically

fall

into

categories

often

even

at

physical

level

of

description

For

example

if

stop

sound

is

'

unvoiced

'

it

exhibits

certain

invariants

which

contrast

it

from

its

' voiced

'

mode

The

indicated

processes

occur

in

environments

which

are

categorically

described

. g

It

, dl

becomes

tongue

flap

between

two

vowels

actually

between

two

'

non

consonants

'

in

distinctive

feature

terms

not

between

two

sounds

that

are

like

vowels

to

high

degree

Variations

in

language

behavior

show

similar

discontinuities

for

example

children

invent

phonological

rules

which

involve

rule

governed

shifts

rather

than

just

groups

of

changed

words

Similarly

dialects

differ

by

entire

rule

processes

not

isolated

cases

finally

stable

historical

changes

occur

in

precise

but

broad

ranging

shifts

the

great

vowel

shift

involved

in

the

irregular

past

verbs

included

complete

rotation

of

vowel

heights

not

isolated

changes

We

are

not

suggesting

that

developmentally

synchronically

and

historically

there

are

no

intermediate

stages

of

performance

rather

we

emphasize

that

the

stable

phenomena

and

periods

are

those

caused

by

mental

representations

that

are

structural

in

nature

It

is

possible

to

show

that

mental

representations

of

rule

based

account

of

language

are

necessary

to

describe

the

properties

of

language

change

The

categorical

nature

of

language

change

is

explained

in

rule

based

account

by

the

fact

that

rules

themselves

are

categorical

not

incremental

hence

linguis

tic

change

is

resisted

except

at

those

times

when

it

occurs

in

major

shifts

Such

facts

are

clearly

consistent

with

rules

but

it

must

be

shown

that

they

are

consistent

with

models

like

that

in

&

way

to

do

this

is

to

consider

whether

successive

generations

of

such

models

could

maintain

an

approxima

tion

of

rule

governed

behavior

without

containing

rules

The

models

we

have

considered

achieve

90

95

percent

correct

output

on

their

training

set

218 J. Lachter T.G. Bever and

and considerablyless on generalizationtrials (Pinker & Prince, 1988 calcu, late 66 percent correct on generalizations. This level is, of course far below ) , a 5-year-old child's ability, but one might argue that improved models will do better. However, if these models are to be taken as anything like correct models of the child, they must exhibit stable as well as accuratebehavior_in , the face of imperfect input. We can operationalize this by asking a simple question (or performing the actual experiment : what will an untrained model ) learn, if it is given asinput, the less -than-perfect output of a trained model? This question is divisible into two parts: how fast will the 'child' model arrive at its asymptotic performance compared with the 'parent', and what will the asymptotic level be? It is likely that for a given number of trials before asymptote is reached the child-model will perform worse than the , parent-model. This follows from the fact that the data the child-model is given are less reliably related to the actual structure of the language and , therefore must require more trials to arrive at a stable output. It is lessclear what the final asymptotic level will be. If the parent-model errors were truly random in nature, then the final asymptoticlevel of performanceshould be the samein the child model. But , in fact, the parental errors are not random- they tend to occur on just those forms which are hard for the model to learn. R& M offer a casein point: after 80,000 trials, the model still makesa variety of strangeerrors on the past tense (e.g., 'squawked for ' 'squat' ; 'membled' for 'mail'; see Pinker & Prince, 1988 especially for an , analysisof the 'blending' mechanismwhich producescaseslike the second )~ " It is intuitively clear, that someof theseerrors occur because phonological of coincidences others becauseof the overwhelming frequency of the regular , past ending. In both kinds of cases the errors have a systematicbasis and , , are not random- indeed, they are by operational definition , just the cases which frequency -basedcomputations in both models discriminate with diffi culty: so, we can expect the child-model to perform even worse on these cases once given seductivelymisleadinganalyses the parent model. Even, by tually, with some number of generations(itself determined by the learning curve parameters and other TRICS) , the final descendant , -model will stabilize at always getting the critical caseswrong. There are many indeterminaciesin theseconsiderations and the best wav , .. to see what happenswill be to train successive generationsof models. We think that this is an important empirical test for any model of learning. One must not only show that a particular model can approximaterule-like behav ior , given perfect input and perfect feedbackinformation, but that successive generationsof exactly the same kind of model continue to re-generate the same rule-like regularities, given the impet6fect input of their immediate ancestor -models. R& M consideredin this light, predicts that in successive gen-

Linguistic structure and associative theories

219

erations, a languagesystem will degeneratequickly towards the dominant rule, overgeneralizingmost of the exceptionalcases But, this doesnot occur . in actual linguistic evolution. Rather, it is characteristicthat every systematic processhas some sub-systematicexceptions As Sapir put it , 'grammarsal. ways leak' . One can speculateas to why this is so (Bever, 1986 Sadock ; , 1974 The fact that it is so posesparticular problems for a model basedon ). imperfect frequency approximationsby successive generations . We do not doubt that a set of diachronic TRICS can be tacked onto the model, which would tend to maintain the rule-like regularities in the face of systematicallyimperfect input information. We expect by induction on the properties of the current TRICS, that two things will be true about any new TRICS: (1) insofar asthey have any systematic motivation, it will not be from the connectionistframework, but from the rule-basedexplanation; (2) insofar as they work , it will be becauseof their relation to the rule-basedaccount . 5.4.2. Linguistic intuitions It is also striking that children seemto have explicit differentiation of the way they talk from the way they should talk. Children are both aware that they overgeneralize and that they should not do it . Bever (1975 reports a , ) dialogue demonstratingthat his child (age 3;6) had this dual pattern. Tom: Where's mommy? Frederick: Mommy goedto the store. Tom: Mommy goedto the store? Frederick: NO ! (annoyed Daddy, I sayit that way, not you. ) Tom: Mommy wentedto the store? Frederick: No ! Tom: Mommy went to the store. Frederick: That's right , mommy wennn ... mommy goed to the store. Slobin (1978 reported extended interviews with his child demonstrating a ) similar sensitivity: 'she rarely usessomeof the [strong] verbs correctly in her own speech yet she is clearly aware of the correct forms.' He reports the ; following dialogue at 4;7. Dan: . .. Did Barbara read you that whole story .. . Haida: Yeah ... and ... mama this morning after breakfast, read ('red') the whole book ... I don't know when shereaded('reeded ... ') Dan: You don't know when shewhat? Haida: ... shereadedthe book ... Dan: M-hm Haida: That's the book she read. She read the whole, the whole book.

220

J. Lachter and T. G. Bever

Dan : Dan :

That ' s the book Barbara reacted readed

she readed you Babar

, huh ? ?

Haida : Yeah ... read! (annoyed) Haida : Babar , yeah. You know cause you readed some of it too ... she
all the rest .

Dan :
Haida

She read the whole thing to you , huh ?


: Yeah , . . . nu - uh , you read some .

Dan : Oh , that 's right ; yeah, I readed the beginning of it . Haida : Readed ? (annoyed surprise ) Read !
Dan :
Dan : What

Oh , yeah Sure

read .

Haida : Will you stop that Papa?


are we to make of Frederick ' s and Haida ' s competence ? On the one

hand , they clearly made overgeneralization errors ; on the other hand , they clearly knew what they should and should not say. This would seem to be evjdence that the overgeneralizations are strictly a performance function of the talking algorithm , quite distinct from their linguistic knowledge . The fact that the children know they are making a mistake emphasizes the distinction between the structures that they know and the sequences that they utter . (But see Kuczaj , 1978, who showed that children do not differentiate between experimentally presented correct and incorrect past tense forms . We think that he underestimates the children 's competence because of methodological factors . For example , he assumes that children think that everything they say is grammatical , which the above reports show is not true . Finally , in all his studies, the child in general prefers the correct past forms for the irregular
verbs . )

In brief , we see that even children are aware that they are following (or should follow ) structural systems. Adults also exhibit knowledge of the contrast between what they say and what is grammatical . For example the sentence below is recognized as usable but ungrammatical , while the second is recognized as grammatical but unusable (see discussion in Section 7 below ) .
Either I or you are crazy .

Oysters oysters oysters split split split Children and adults who know the contrast between their speech and the correct form have a representation of the structural system. Hence , it is of
little interest to claim that a connectionist model can ' learn ' the behavior

without the structural system. Real people learn both ; most interestingly , they sometimes learn the structure before they master its use.

Linguistic structure and associative theories

221

5.4.3. Languagebehaviors There is also considerableexperimentalevidencefor the independentrole of grammaticalstructuresin the behavior of adults. We discuss under the this rubric of evidencefor a 'psychogrammar (Bever, 1975 an internalized rep' ), resentationof the language that is not necessarily model of suchbehaviors , a as speech perception or production, but a representation of the structure usedin those and other languagebehaviors Presumably the psychogrammar . , is strongly equivalent to some correct linguistic gramm~r with a universally definable mental and physiologicalrepresentation We set up the conceptfor . this discussion avoid claiming "psychological or "physiologicalreality" for to " any particular linguistic grammar or mode of implementation. Rather, we wish to outline somesimple evidencethat a psychogrammar exists: this demonstration is sufficient to invalidate the psychological relevance of those connectionistlearning models which do not learn grammars . The fundamentalmental activity in using speechis to relate inchoate ideas with explicit utterances as in perception and production. There is observa , tional and experimentalevidencethat multiple levels of linguistic representa tion are computed during these processes (Bever, 1970 Dell , 1986 Fodor, ; ; Bever, & Garrett, 1974 Garrett, 1975 Tanenhaus Carlson, & Seidenberg ; ; , , 1985 . The data suggestthat the processes ) underlying these two behaviors are not simple inversions so they may make different use of the grammatical , structures suggestingseparaterepresentationsof the grammar. Hence, the , psychogrammar may be distinct from suchsystems speechbehavior; in any of case it explains certain phenomenain its own right. In standard linguistic , investigations it allows the isolation of linguistic universalsdue to psycho , grammaticalconstraintsfrom those due to the other systems speechbehav of ior . We think that the achievementsof this approach to linguistic research have been prodigious and justify the distinction in themselves A further . argument for the separate existence of a psychogrammaris the empirical evidence that it is an independent source of acceptability intuitions. The crucial data are sequences which are intuitively well-formed but unusable , and sequences which are usablebut intuitively ill -formed, asdiscussed above. Such casesillustrate that behavioral usability and intuitive well-formedness do not overlap completely, suggesting that each is accountedfor by (at least partially) independentmental representations . 5.4.4. Conclusion The explanatoryrole of rules : The evidencewe havereviewedin the previous three sectionsdemonstrates that even if one were to differentiate structural rules from algorithmic rules, it remains the case that the structural rules are directly implicated in the explanationof linguistic phenomena That is, the rules are not merely abstract .

222 J. Lachter and T. G. Bever

descriptions of the regularities underlying languagebehaviors but are vital , to their explanation because they characterizecertain mental representations or processes It is becauseof those mental entities that the rules compactly . describe facts of languageacquisition, variation, and history; they provide explanationsof linguistic knowledgedirectly availableto children and adults; they help explain thosemental representations involved in the comprehension and production of behaviorally usablesentencesthey are part of the explana ; tion of historical facts about languages .
6. Learning to assign thematic roles
We learned now turn the to assignment a second of model thematic in which roles to linguistic nounphrases behavior in is different apparently serial

positions

( McClelland

&

Kawamoto

1986

&

The

system

which

learns

to to triples

assign the , model

thematic which

roles learns of a

to

nouns past tenses

in

specific . The ,

sentences input and two

has representational semantic

similar

properties nodes features for are a

consisting

syntactic

position

noun

or

verb

the

output

nodes

represent

semantic

feature

for

noun

one

for

verb

and

thematic

noun

- verb

relation

There

is

probabilistic

blurring

mechanism

which

turns

on

feature

/ role

nodes

only

85

percent

of

the

time

when

they

should

be

on

and

15

percent

of

the

time

when

they

should

not

The

semantic

TRI

Ideally

( that

is

in

one

' s

idealization

of

how

this

model

must

work

to

be

significant one would

psychological start with an

theory

of independently

learning

to defined

attach

thematic set of

roles semantic

to

words features

( human tended

, to

animate account

. . . ) for

taken naming

from behavior

some ,

semantic or within

theory a semantic

( e

. g

. ,

theory ~ to

in ac

theorv

count might

for be

synonymy interpreted

and as

entailment allowing for

. some

Then

the

role

of between

statistical these

blurring formal

interaction

features

and

the

continuous

variability

which

can

occur

when

fitting

nouns

into

thematic

roles

But

for

all

the

statistical

blurring

it

remains

the

case

that rather

the ,

' semantic they are just

role

features descriptors

' , of

do

not roles

flow themselves

from

some . Here

independent , the semantic

theory fea

: -

tures ject capture

are / instrument

chosen

for / modifier

each . that

noun The the

to

reflect

the

probability features for , involves

that verbs

it

is

an are

agent chosen and

/ ob to so

corresponding verb is an

the

likelihood

action

modifiers

on does

( see not

Figure involve

Hence

any independently

' learning

'

that defined

occurs semantic

is

trivial features

The

' learning which are

'

isolating

Linguistic structure and associative theories

223

Figure Featuredimensionsand values 4.


- -- ---- --- - -- -- -

Nouns
-- -- -

HUMAN SOFfNESS GENDER VOLUME FORM PO INTINESS BREAKAB ILITY 0 BJTYPE --

human nonhuman softhard male female neuter small medium large compact-D 2-D 3-D 1 pointed rounded fragile unbreakable foodtoytoolutensil furniture animate -inan nat
- -- --

Verbs
-- --- -- -- -- ----

DOER

yes no

CAUSE
TOUCH

yes no-cause no-change


agent inst both none AisP

NAT _CHNG

pieces shreds chemical none


unused

AGT

_ MVMT

trans part none NA trans part none NA

PT _ MVMT

INTENSITY

- . --

low high inanimate , AisP

= Agent is

Note ,' nat -inan = natural

Patient , NA = not applicable .

relevant to roles, but rather an accumulation of activation strengths from having the role-featuresavailable, and being given correct instancesof words (feature matrices placed in particular role positions. ) One of the achievements this model accordingto M& K is that it 'overof generalizes the" atic role assignmentsFor example, 'doll' is not marked as ' m . 'animate' and therefore is ineligible to be an agent. However, 'doll' is nonethelessassignedsignificant strength as agent in such sentencesas 'the doll moved' . This result seems be due to the fact that everything is assigned to the gender neuter except animate objects and the word 'doll' which is as signed 'female' . Thus, 'neuter' becomesa perfect predictor of inanimacy, except for 'doll' . It is not surprising that 'doll' is treated as though it were animate.

224 J. Lachter and T.G. Bever

7. The power of units unseen


It might seem that the models we have discussed are dependent on TRICS

because of some inherent limitation on their computational power . We now consider connectionist learning models with more computational power , and examine some specific instances for TRICS . Connectionist learning machines are composed of perceptrons . One of the staple theorems about perceptrons is that they cannot perform certain Boo lean functions if they have only an input and an output set of nodes (Minsky & Papert , 1969) . Exclusive disjunctive conditions are among those functions that cannot be represented in a two -layer perceptron system. Yet , even simple phonetic phenomena involved in the past tense involve disjunctive descriptions , if one were limited to two -level descriptions . For example , the variants of 'mounded ' discussed above involve disjunction of the presence of n , lengthened vowel and the tongue flap . That is, the tongue -flap pronuncia tion of 't ' or 'd' can occur only if the 'n' has been deleted , and the previous vowel has been lengthened . Furthermore , the distinction between It I and Idl in 'mounted ' and 'mounded ' , in some pronunciations has been displaced to the length of the preceding vowel . The solution for modelling such disjunctive phenomena within the connectionist framework is the invocation of units that are neither input nor output nodes, but which comprise an intermediate set of units which are 'hidden ' (Hinton & Sejnowski , 1983, 1986; Smolensky , 1986) . (A formal problem in the use of hidden units is formulating how the perceptron learning rule should apply to a system with them : there are two (or more ) layers of connections to be trained on each trial , but only the output layer is directly corrected . Somehow , incorrect weights and thresholds must be corrected at both the output and hidden levels . A current technique , 'back-propagation ' is the instance of such a learning rule used in the examples we discuss below- it apportions 'blame ' for incorrect activations to hidden units which are involved , according to a function too complex for presentation here (see Rumelhart , Hinton & Williams , 1986) .)
Recent work has shown that a model with hidden units can be trained to

regenerate a known acoustic sample of speech, with the result that novel speech samples can also be regenerated without further training (Elman & Zipser , 1986) . The training technique does not involve explicit segmentation of the signal , nor is there a mapping onto a separate response. The model takes a speech sample in an input acoustic feature representation : the input is mapped onto a set of input nodes, in a manner similar to that of McClelland and Elman . Each input node is connected to a set of hidden nodes, which in turn are connected to a layer of output nodes corresponding to the input nodes. On each trial , the model adjusts weights between the layers of nodes

Linguistic structure associative and theories 225

following learning (theback-propagation ofit)toimprove the rule variant the matchbetween inputand the output.Thismodelusesan auto-asthe sociative technique,which in weights adjusted yield output is are to an that
the closest to the original fit input.Aftermany trials(upto a million), the model impressively is successful,regenerating speech in new samples from

the same speaker. is an exciting This achievement, it opens the since up possibility an analysis thespeech a compact that of into internal representationis possible, simply exposure a sample. by to There several are things whichremainto be shown. example, internal For the analysis which the model arrives may may corresponda linguistically analat or not to relevant

ysis; it does if correspond units, is notclear they be to such it how can


integrated higher-order with relations between them.
7.1. A hidden unit model of syntax acquisition

Hanson Kegl and (1987, H&K) theauto-association with use method hidden
unitsto re-generate sequences syntactic of categories which correspond to
actualsentences. After a periodof learning, modelcantake in a sequence the

of lexical categories something determiner, (e.g., like noun,verb,determiner, noun, adverb), regenerate sequence. is interesting and that What is that it can regenerate sequences whichcorrespond actualEnglish to sentences,but it doesnot regenerate sequences which not correspond do to

English sequencesin way, model this the approximates theability render to


grammaticality distinctions. Hanson Kegl and disavow model approtheir as priate thelanguage for learning buttheymake extraordinarily child, an strong
claimfor whatit showsaboutlinguistic structureswhichgovernthe ungrammaticality of certain sentences:

If [ model] notrecognize sentencesafter our does such nothing than more exposuredata, would ustosuspect rather being innate to this lead that than an property thelearner, constraints conditions directly of these and follow from
regularities data... . Both model] thechild only inthe [ our and are exposed to
constituents just the regularities whichthey are exposed from to

sentences natural from language, bothmust they induce general andlarger rules

Thatis,H&K thesuccess their take of model beanexistence that to proof some linguistic universals belearned can without internal any structure. This makes imperativeexamine architecturetheirmodelas weshall it to the of see,it incorporates linguistically representations certain defined in crucial
wayswhichinvalidate theirempiricist conclusion. Hereis howone of the models works. modelis trainedon a set of The 1000 actual sentences, ranging a fewto 15words length. from in Eachlexical

226 J. Lachter and T. G. Bever

item in every input sentence 'manuallyassigned a syntactic is to category , eachcodedinto a 9-bit sequence (input textsweretakenfrom a corpus with grammatical categories alreadyassignedFrancis& Kucera 1979. These : , ) sequences mapped are onto 270input nodes (135for 15wordpositionseach , with ninebits; another distinctsetfor wordboundary codes. Thecategorized ) sequences thentreatedasinput to a setof 45 hiddennodes each input are category nodeis connected eachhiddennode and eachhiddennodeis to , connected a corresponding of 270outputnodes to set (seeFigure5). During training the model is given input sequences categories the model , of matches input against selfgenerated the the outputon eachtrial. The usual learningrule appliesto adjust weightson eachtrial (usinga variationof back -propagation with the usualbuilt-in variabilityin learning each ), on trial. After 180 trials with the trainingset the modelasymptoted about90 ,000 , at percentcorrecton both the trainingsetandon newsentence -based category sequences whichhad not beenpresented before . H&K highlightfour qualitative results the trainedmodel responses in 's to new casesFirst, the model supplements . incompleteinformationin input sequences that the regenerated so sequences conformto possible sentence types For examplegiventhe input in (3a the model response in the . , ) 's fills missingword with a verb, as in (3b). (We are quotingdirectly from their examplesRoughly the lexicalcategories . , correspond distinctions to usedin Francis Kucera 1979 'P-verb' refersto 'verb in pasttenseform'.) & , . Second , the model corrects

3a. article noun (BLANK) , article noun , , , 3b. article noun p-verb, article noun , , ,
Figure 5.

Auto-Associator NaturalLanguage for Syntax 585units 24615 connections


Ratio 7 : 1 compression

at the

n boy

vbd threw

at the

n boll

at the

n boy

vbd threw

at the

n ball

Linguistic structure associative and theories 227

incorrect syntactic input information; example (4a)asinput, refor given it


sponds with (4b).

4a. article, noun, p-verb, adverb, article, noun, p-verb 4b. article,noun, p-verb,preposition,article,noun, p-verb The interestof this case is basedon the claimthat (4a) does not correspond

to a possible sequence theysay(4a)corresponds* horse (e.g., to the raced quickly barnfell, while correspondsthe horse the (4b) to raced the past
barnfell. Notethat(4b)is a tricky sentence: corresponds the horsethat it to
was raced past the barn, fell).

rat the cat chaseddied): this regeneration occursdespite the lackof even

Third, model the regenerates center-embedding one (corresponding tothe

a single occurrencea center-embedded within [ sentence] of sentence the1000


corpus. Furthermore, modelrejectssequences the corresponding two to
center-embeddings rat the cat the dogbit chased (the died). Givena syntac-

ticsequence correspondinga double to embedding themodel (5a), responds with(5b).H&Ksay that this shows their modelcan differentially that generalize sentences canappear natural to that in language (center-embeddings) cannot but recognize sentences which violate natural language constraints (multiple center-embeddings).

5a. article,noun,article,noun,article,noun,p-verb,p-verb,p-verb 5b. article,noun,article,noun,article, noun,p-verb,noun,verb

Finally, model the refuses regenerate to sequences which adverb in an interruptsverbanda following nounwhich a article would to bethe have
verbsdirect object; example (6a)asinput(corresponding* for given to John

gavequickly book),the model the responds (6b)(corresponding with to Johnquickly thewinner). was H&K thatthisshows themodel say that has acquired of theuniversal one case-marking constraints expressed Enas in glish; directobject a mustbe adjacent a verbin orderto receive to case fromit andthereby allowed be (licensed) occur object to in position ( [ from]
a Government-Bindingapproach (Chomsky, 1981)).
noun, verb, adverb, article, noun noun, adverb, was, article, noun 6a. 6b.

It is enterprising H&K to put the modelto suchqualitative of tests, over and abovethe 90percent levelof correct regenerations. in formal As linguis-

tic research, is the descriptive it claims madeby a model which justify its

acceptance muchas its empirical as adequacy generatecorrectsentences. to

Unfortunately, testswhich the H&K do notprovide cite crucial support for

theirmodel, a variety reasons. for of First,many kindsofpattern recognition

228

J. Lachter and T. G. Bever

modelswouldisolatethe fact that everysentencecontainsa verbin their

case,one of a set of pre-categorized items,suchas p-verb,was,do, input

etc .... It is not trivialthat the modelfillsin a blankword as a sentences

if themodel in a noun fora blank filled between article andverb, or an


articleor preposition a blank betweena verb and a noun. for

only butit isnotunique verb, either. Similarly, would besurprised one not

thatthemodel notaccept input does the sequence, isanobverse which part ofits90percent success regenerating input in correct sequences.thiscase, In themodel responds changing lexical by one category thattheoutput so corresponds a possible to English sentencethe question does change is, it the input a correct behaviorally sentence. to or salient Consider a sequence other than(4b)which would result changing category (4a). from one in
4c. art, noun,p-verb,conj, art, noun,p-verb(the horseracedand the
crowd roared)

to the horseracedbreathtakingly crowdroared and corrects to one the it corresponding the horseracedpast the barn fell. Indeed,it is relevant to

Thelogic theinterpretation next cases contradictory. of ofthe two is H&K citewith approval fact themodel the that rejects sequence a corresponding

used categorize input, verbisdifferentiated pastparticito their past from ple, sothecorrect category sequence corresponding horse past to the raced
the barn fell, wouldbe as in (4d):

formed sentence anyway. the Francis al. categorization In et schema H&K

failure themodel, other of given options (4c) like which correspondmuch to easier structures. Finally, output (4b)does corresponda wellthe in not to

model regenerates behaviorally sequence be an empirical this difficult to

Yetthemodel apparently (4b). sentence haslong underchose This type been stood anexample a sentence isdifficult speakers understand as of which for to (seeBever,1970, discussion). for Accordingly, take the factthat the we

constraints. Others alsoargued the difficulty center-embedded have that of constructions thatanadequate shows model language of structure should not represent them(e.g.,McClellandwrite: & Kawamoto, Reich,1969). 1986; For example, McClelland and Kawamoto

ingtaken alleged the success themodel regenerating of at a behaviorally difficult sequence (5a), like H&K report approvingly themodel that rejects doubly embedded sentences. Although notethatothers argued they have thatsuch constructions arecomplex tobehavioral due reasons, appear they to believe inrejecting themodel simulating language that them, is natural

4d. article, noun, (past verb participle), article, noun, p-verb Thetreatment multiple of center-embedding isconversely puzzling. Hav-

Linguistic structure and associative theories

229

The unparsability of [doubly -embedded ] sentences has usually been explained by an appeal to adjunct assumptions about performance limitations (e.g., work ing -memory limitations ) , but it may be , instead , that they are unparsable because the parser , by the general nature of its design is simply incapable of processing such sentences.

There is a fundamental error here . Multiple center -embedding construc tions are not ungrammatical , as shown by cases like (6c) (the unacceptability of (6d) shows that the acceptability of (6c) is not due to semantic constraints alone ) . 6c. The reporter everyone I have met trusts , is predicting another Irangate .
6d . The reporter the editor the cat scratched fired died .

In fact , the difficulty of center -embedding constructions is a function of the differentiability of the nounphrases and verbphrases : (6c) is acceptable
because each of the nounphrases is of a different type , as is each of the

verbphrases : conversely , (6d) is unacceptable because the three nounphrases are syntactically identical , as are the three verbphrases . This effect is predicted by any comprehension mechanism which labels nounphrases syntacti cally as they come in , stores them , and then assigns them to verbphrase argument positions as they appear : the more distinctly labelled each phrase is, the less likely it is to be confused in immediate memory (see Bever , 1970; Miller , 1962; for more detailed discussions) . Thus , there are examples of acceptable center -embedding sentences, and a simple performance theory which predicts the difference between acceptable and unacceptable cases. Hence , H & K are simply in error to claim that it is an achievement of their model to reject center -embeddings , at least if the achievement is to be taken as reflecting universal structural constraints on possible sentences. The other two qualitative facts suggest to H & K that their model has developed a representation of constituency . The model regenerates single embeddings without exposure to them , and rejects sequences which interrupt a
verb - article - noun sequence with an adverb after the verb . Both behaviors

indicate to H & K that the model has acquired a representation corresponding to a nounphrase constituent . They buttress this claim with an informal report of a statistical clustering analysis on the response patterns of the hidden units to input nounphrases and verbphrases : the clusters showed that some units responded strongly to nounphrases , while others responded strongly to verb phrases. This seems at first to be an interesting result . But careful analysis of the internal structure of the model and the final weighting patterns are required to determine how it works . The grouping of those sequences which are noun phrases (sequences containing an article and/or some kind of noun ) from

230

J. Lachter and T. G. Bever

issue involves informativenessinput the ofthe categories, areassigned which byhand. input richly The is differentiated model 467 forthe into syntactic categories. aremany There distinctions among function made the words and morphemes. example, forms be aredifferentiated, For many of personal possessive pronouns differentiated other are from pronouns, prosubject
differentiated from absolutes, and so on.

cannote,however, several factors which beimportant. most may The serious

format) notallow to examine model does us the completelyTRICS. for We

is anachievement, onethatexceeds item arrangement butnot many and schemata. Hanson Kegls presentation and brief (imposedthem thepublication on by

andthenanother of phrase set types beginning some ofverb.This with kind

those which verbphrases are (containing ofthelistofverb one types) might phrases,ispuzzling verbphrasesnotexcite it that did nounphrase at patterns thesame thevery time: notion constituency thatthey of requires should. As it stands, model the appears have to encoded something thefollowing like properties word of215 sentences: have initial ofphrase they an set types,
occur for many reasons. Indeed, since verbphrases often containnoun-

H&K thattheir state model begins noassumptions syntactic with about other thefact exist. isa bitmisleading. syntactic than they This First, categories notindependently are distinguished thesyntax which from in they occur. Many, notall,syntactic if categoriesa language like in are phonologicaldistinctive features, thatthey motivatedpartbytherule-based in are in function serve they universallyina particular and language. example, For in English, words, on,under areallprepositionsbecause the in, ... just they actthesame inrelationlinguistic way to constraintsi.e., precede they nounphrases, beused verb-particles, can as can coalesce verbs thepassive with in form, soon.Giving model information these and the the that privileges of
structure any nor special expectations propertiessyntactic about of categories

function categorized As it stands, word input. recognitiona fewbasic of redundancies well might account themodels percent rate. for 85 hit

one ofpattern another. kind or Indeed, language can parsers operate surprisand differentiation words all. reasonstraightforward: no ofcontent at The is function aretheskeleton syntax. tendto begin words of They English phrases, end not them: function offer many words unique information: e.g., the, a, my always a noun signal somewhere right, tothe towards always signals nounphrase theendofa clause, soon.In fact,it would a or and be interesting ifH&Ks performs worseitistrained on tosee parser any if only
ingly successfully differential with sensitivity 20classes function to of words

nouns differentiated object are from pronouns, comparative adjectives are Given richenough a analysis thiskind, of every English sentence into falls

Linguistic structureand associative theories

231

occurrence

coincide

with

a category

is providing

crucial

information

about

what words can be expected to pattern together learner would have to discover . Thus , providing vides information which itself ref1.ects syntactic H & K actually give members of the same

- something which a real the correct categories pro .

structure

the model even more information : they differentiate category when they are used in syntactically different

ways . For example , the prepositions are pre - categorized differently according to whether they are used in prepositional phrases or as verb particles . In the categorized samples they classified with the symbols to working on the to the over the pushed aside pass it up This for differentiation automatic , as reflected looked ' in ' ' in ' ' in ' ' in ' ' rb ' ' rb ' solves in ahead English of time , the one of the more of difficult problems from provide , the prepositions on the right : below on the left are

parsers

differentiation of ( 7 ) .

prepositions

particles 7.

in the ambiguity up the street . the different

Harry

H & K also differentiate

instances :

of ' to ' , used

as a preposition

above , and as a complementizer to devote to continue Such syntactic ' to ' ' to ' pre - disambiguation

, as below

of phonologically

indistinguishable

cate -

gories appears in other cases . For example , ' it ' as subject from ' it ' as object ; ' that ' as conjunction is differentiated relative pronoun ; simple tiated from superficially -

is differentiated from ' that ' as a

past tense verb forms ( ' pushed ' , ' sang ' ) are differen identical past participle forms ( ' pushed ' , except in categorization disambiguations from the four

strong verbs , ' sung ' ) . We have adduced the above

categorized sample selections which H & K present . In the categorization scheme they used ( Francis & Kucera , 1979 ) other syntactic homophones are disambiguated hardest problems in the for categorization parsers may framework be solved by as well . Thus , one the categorization of of the the

input - how to distinguish the use of a category in terms of its local syntactic function . This definitely is among the TRICS , in the sense defined above . There seems to be at least another . The categories are further differentiated

232

J. Lachter and T. G. Bever

according more categories A tation cessful found the made input more more of , it with its is active ,

to

their input which

frequency codes may . This greatly analysis . of models relevant . a number

frequent will facilitate of H & K 's tend to

categories differentiate detection awaits conclude , of units regularities the .

were

assigned function of more syntactic complete insofar general TRICS pre kind from

to

initially content

the model

patterns presen as it as - categorize encoded is suc those

. -

complete characteristics because simpler so that accessible

We

tentatively of TRICS hidden

that same The are

without grammatical

directly

or

The The every example below

decision extremes possible , : the are

about to

how give no

fine

- grained

the in

categories the by be categories some entered

are

involves , and to

dilemma

information afforded ( 8c ) could

differentiate theory . or For ( 8b )

syntactic sample

category sequence

grammatical either as

( Sa

8a Sb

. .

~ ' ord definite same and definite same and ceding

word

word

word modifying

word

, the

word following and verb following and verb , , noun both past noun both adverb , singular grammatical - tense , singular grammatical modifying the transitive noun noun in subject verb in object pre the , the

determiner phrase thematic determiner phrase thematic verb , . noun , as the as the agent

preceding of the

determiner following the

modifying preceding of the

determiner preceding

patient

8c

determiner

verb

determiner

noun

adverb

Regenerating a different , . code regenerating

sequences , which the to Hanson can think to that structural , which clarifies scheme would universals succeeds , ) . For in the in roughly example basic not is is regenerate & Kegl of regenerate the

like is

( 8a what

is

of

no

interest serial in like the those reason the of this are :

each order

word in would ( 8b the ) would the

boundary input be straightfor be of code

has . little infor : if the -

represents of sequences words

Hence ward interest mation model that this coding This categorization tion it case

number

input in all

Learning to we learned shows the scheme contrast

for is

the

obverse within

grammatical distinctions

encoded sequences

category kind not in , H &

K to

could the of

not learner the

claim : input in

structural universals an the of allow from regenerating like , 7 - 8 if that the categories the ' innate dilemma limited them

universals are ' part which interest to data input which machine as in . make What at may can embedded of

innate the . : a about be of learning

richness

the H &

model K face on

on

limited of a model of

, learning claims might an be

complex the interest

categoriza induction is complexity

scheme

grammatical which categorization ian only , 1986 stated

intermediate available learn ) , it to might to regenerate provide a

child

( see sequences an existence

Val

( 8c

Linguistic structure and associative theories

233

demonstration of the kind H & K seek. Even more impressive , would be a demonstration that it can learn to map inputs of the complexity in (8c) onto outputs of the complexity in (8b) . That would be more like real learning , taking in lexically categorized input and learning to map it onto grammatically categorized output . In such a case, both the simple and complex categories could be viewed as innate . The empirical question would be: can the model learn the regularities of how to map simple onto complex grammatical categorizations of sentences? If such a model were empirically successful, it would offer the potential of testing some aspects of the empiricist hypothesis
which H & K may have in mind . Aside from TRICS , it is clear that auto association works both at the

phonological and syntactic level because there are redundancies in the input : the hidden units give the system the computational power to differentiate disjunctive categories , which allows for isolation of disjunctive redundancies . We think that the models involve techniques which may be of great impor tance for certain applications . But , as always with artificially 'intelligent ' systems , the importance of these proofs for psychological models of learning , will depend on their rate of successand the internal analysis both of the prior architecture of the model and of categories implied by the final weight pat terns . Ninety percent correct after 180,000 trials is not impressive when compared to a human child , especially when one notes that the child must discover the input categories as well . In fact , it is characteristic of widely diver gent linguistic theories that they usually overlap on 95 percent of the data- it is the last 5 percent which brings out deep differences . Many models can achieve 90 percent correct analysis: this is why small differences in how close to 100 percent correct a model is can be' an important factor . But , most important is the analysis which the machine gives to each sentence. At the
moment , it is hard for us to see how 45 associative units will render the

subtlety we know to be true of syntactic structures .

8 . Conclusions

We have interpreted R & M , M & K , and H & K as concluding from the success of their models that artificial intelligence systems that learn are possible with out rules . At the same time we have shown that both the learning and adult behavior models contain devices that emphasize the information which carries the rule -based representations that explain the behavior . That is, the models ' limited measure of success ultimately depends on structural rules . We leave it to connectionists ' productionist brethren in artificial intelligence and cognitive modelling to determine the implications of this for the algorithmic use

234 J. Lachter T.G. Bever and

of rules in artificial intelligence . Our conclusion here is simply that one cannot proclaim these learning models as replacements for the acquisition of linguis tic rules and constraints on rules .

9. Some general considerations 9.1. The problem of segmenting the world for stimulus / response models of learning In the following section we set aside the issue of the acquisition of structure , and consider the status of models like R & M 's as performance models of learning . That is, we now stipulate that representations of the kind we have highlighted are built into the models ; we then ask, are these learning models with built -in representations plausible candidates as performance models of language acquisition ? We find that they suffer from the same deficiencies as all learning models which depend on incrementally building up structures out of
trials .

isolated

The learning models in McClelland & Rumelhart ( 1986) are a complex


variant on traditional - or at least Hullean s- r connections formed in time

(with secondary and tertiary connections -between -connections corresponding to between node , and hidden -node connections ) . The connectionist model operates 'in parallel ' , thereby allowing for the simultaneous establishment of complex patterns of activation and inhibition . We can view these models as composed of an extremely large number of s- r pairs , for example , the verb learning model would have roughly 211,600 such pairs , one for each input / output Wickelfeature pair . In fact , we can imagine a set of rats each of whose tails are attached to one of the 460 Wickelfeature inputs and each of whose
nose -whiskers
each trial , each

are all attached


rat is either

to one of the 460 Wickelfeature


at his tail node

outputs : on
then may

stimulated

or not . He

lunge at his nose node . His feedback tells him either that he was supposed to lunge or not , and he adjusts the likelihood of lunging on the next trial using formulae of the kind explored by Hull and his students , Rescorla &
Wagner ( 1972 ) , and others .

Wewill cali this arrangement of rats a Massively Parallel Rodent (MPR ) . Clearly , the MPR as a whole can adjust its behavior . That is, if a connection between two nodes is fixed and if the relevant input and output information is unambiguously specified and reinforced , and if there is a rule for changing the strength of the connection based on the input /output /reinforcement configuration , then the model 's connection strengths will change on each trial . In his review of Skinner 's Verbal Behavior ( 1957) , Chomsky ( 1959) accepted

Linguistic structure and associative theories

235

this tautology about changesin s- r connection strengths as a function of reinforcing episodes But he pointed out that the theory doesnot offer a way . of segmentingwhich experiencescount as stimuli, which as responses and , to which pairs of thesea changein drive level (reinforcer) is relevant. Stimulus, response and reinforcement are all interdefined, which makesdiscovering . the 'laws' of learning impossiblein the stimulus /responseframework. We solved the correspondingpractical problem for our MPR by giving it the relevant information externally- we segmentwhich aspectsof its experience are relevant to each other in stimulus /responsereinforcement triples. / Accordingly, the MPR works becauseeach rat does not have to determine what inout Wickelfeature activation (or lack of it) is relevant to what output ..----- --r ~ " Wickelfeature, and whether positive or negative he is hard-tailed and -nosed : into a given input/output pair. Each learning trial externally specifies the effect of reinforcement on the connectionbetweenthat pair. In this way, the MPR can gradually be trained to build up the behavior as composedout of differential associativestrengthsbetween different units. Supposewe put the MPR into the field , (the verbal one), and wait for it to learn anything at all. Without constant information about when to relate reinforcement information to changes the threshold and responseweights, in nothing systematiccan occur. The result would appear to be as limited and circular as the apparently simpler model proposedby Skinner. It might seem that giving the MPR an input set of categorieswould clear this up. It does not, unlessone also informs it which categoriesare relevant when. The hope that enough learning trials will gradually allow the MPR to weed out the irrelevancies begs the question; which is, how doesthe organismknow that , a given experience counts as a trial? The much larger number of component organismsdoes not changethe power of the single -unit machine if nobody , tells them what is important in the world and what is important to do about it . There's no solution, even in very, very large numbers of rats. Auto -associationin systemswith hidden units might seemto offer a solution to the problem of segmentingthe world into stimuli, responsesand reinforcement relations: these models operate without separateinstruction on each matching trial . Indeed, Elman & Zipser's model apparently arrives at a compact representation of input speechwithout being selectively reinforced. But . as we said~it will take a thorough study of the final response patterns of the hidden units to show that the model arrives at a correct compact representation The samepoint is true of models in the style of Hanson . and Kegl. And , in any case analysisof the potential stimuli is only part of , the learning problem- suchmodelsstill would require identification of which analyzedunits serveas stimuli in particular reinforcement relations to which responses .

236 J. Lachter T.G. Bever and

So , even

if we grant

them

a considerable

amount

of pre -wired

architec

tures , and hand -tailored input , models like the MPR , and their foreseeable elaborations , are not interesting candidates as performance models for actual learning ; rather , they serve, at best, as models of how constraints can be modified by experience , with all the interesting formal and behavioral work of setting the constraints done outside the model . The reply might be that every model of learning has such problems - somehow the child must learn what stimuli in the world are relevant for his language behavior responses. Clearly , the child has to have a segmentation of the world into units , which we can grant to every model including the MPR . But if we are accounting for the acquisition of knowledge and not the pairing of input /output behaviors , then there is no restriction on what counts as a relevant pair . The problem of what counts as a reinforcement for an input /output pair only exists for those systems which postulate that learning consists of increasing the probability of particular input /output pairs (and proper subsets of such pairs) . The hypothesis -testing child is at liberty to think entirely in terms of confirming systems of knowledge against idiosyncratic experiences . An example may help here . Consider two ways of learning about the game of tag: a stimulus /response model , and an hypothesis testing model . Both
models can have innate characteristics which lead to the game as a possible representation . But , in a connectionist instantiation of the sir model , the

acquired representation of the game is in terms of pairs of input and output representations . These representations specify the vectors of motion for each player , and whether there is physical contact between them . We have no doubt that a properly configured connectionist model would acquire behavior similar to that observed in a group of children , in which the players succes sively disperse away from the player who was last in contact with the player who has a chain of dispersal that goes back to the first player of that kind . As above , the machine would be given training in possible successive vector configurations and informed about each so that it could change its activation weights in the usual connectionist way . Without that information , the model will have no way of modulating its input activation configuration to conform
to the actual behavior .

Contrast this with the hypothesis testing model , also with innate r::tructure : in this case, what is innate is a set of possible games, of which tag is one,
stated in terms of rules . Consider what kind of data such a model must have :

it must be given a display of tag actively in progress, in some representational language , perhaps in terms of movement vectors and locations , just like that for the connectionist model . But what it does not need is step-by-step feedback after each projection of where the players will be next . It needs instances of predictions made by the game of tag that are fulfilled : e.g., that after

Linguistic structure and associative theories

237

contact , players ' vectors tend to reverse . How many such instances it needs is a matter of the criteria for confirmation that the hypothesis tester requires . Thus , both the stimulus /response pattern association model and the hypothesis testing model require some instances of categorizable input . But only the sir model requires massive numbers of instances in order to construct a representation of the behavior . 9.2. The human use for connectionist models

The only relation in connectionistmodels is strength of associationbetween nodes. This makesthem potential modelsin which to representthe formation of associationswhich are (almostby definition) frequent- and, in that sense , , important- phenomenaof everyday life . Given a structural description of a domain and a performance mechanism a connectionist model may provide , a revealing description of the emergenceof certain regularities in performance which are not easily describedby the structural description or perfor, mance mechanism alone. In this section, we explore some ways to think about the usefulness connectionistmodels in integrating mental represen of tations of structures and associations between structures . 9.2.1. Behavioral sourcesof overgeneralization We turn first to the 'developmental achievement of Rumelhart and ' McClelland's model of past tenselearning, the overgeneralizationof the 'ed' ending after having achievedbetter-than-chanceperformanceon someirregular verbs. This parallels stagesthat children go through, although clearly not for the same reasons R& M coerce the model into overgeneralizationand . regressionin performance by abruptly flooding the model with regular-verb input. Given the relative univocality of the regular past tense ending, such coercion may turn out to be unnecessary evenequal numbersof regular and irregular verbs may lead to a period of overgeneralization(dependingon the learning curve function) because regular ending processes simpler. In the are any case the model is important in that it attempts to addresswhat is a , common developmentalphenomenonin the mastery of rule-governedstructures- at first , there is apparentmasteryof a structurally defined conceptand then a subsequentdecreasein performance basedon a generalizationfrom , the available data. It is true that some scholarshave used these periods of regression as evidence that the child is actively mastering structural rules, e.g., 9. add 'ed' to form the past tense Clearly, the child is making a mistake But it is not necessarilya mistaken . application of an actual rule of the language(note that the formula above is

238 J. Lachter T.G. Bever and

not exactly a rule of the language) . Rather , it can be interpreted as a perfor mance mistake , the result of an overactive speech production algorithm which
captures the behavioral generalization that almost all verbs form the past

tense by adding ledl (see Macken , 1987, for a much more formal discussion of this type ) . This interpretation is further supported by the fact that children , like adults , can express awareness of the distinction between what they say, and what they know they should say (section 5.4) . There are many other examples of overgeneralization in cognitive develop ment , in which rule -based explanations are less compelling than explanations based on performance mechanisms (Bever , 1982; Strauss, 1983) . Consider , for example , the emergence of the ability to conserve numerosity judgments between small arrays of objects . Suppose we present the array on the left below to children , and ask which row has more in it (Bever , Mehler , & Epstein , 1968; Mehler & Bever , 1967) . Most children believe that it is the row on the bottom . Now suppose we change the array on the left below to the one on the right , and ask children to report which row now has more : 2-year-old children characteristically get the answer correct , and always per form better than 3 - year -old children
* * * *

.
* * * *

The 3-year-olds characteristically choose the longer row - this kind of pattern occurs in many domains involving quantities of different kinds . In each case, the younger child performs on the specific task better than the older child . But the tasks are chosen so that they bring a structural principle into conflict with a perceptual algorithm . The principle is 'conservation ' , that if nothing is changed in the quantity of two unequal arrays , the one with more remains the one with more . The perceptual algorithm is that if an array looks larger than another , it has more in it . Such strategies are well -supported in experience , and probably remain in the adult repertoire , though better inte grated than in the 3-year-old . Our present concerns make it important that the
overgeneralized strategy ' that causes the decrease in performance is not a

structural rule in any sense; it is a behavioral algorithm . A similar contrast between linguistic structure and perceptual generaliza tion occurs in the development of language comprehension (Bever , 1970) . Consider ( lOa) and ( lOb) .
lOa . The horse kicked the cow

lOb . The cow got kicked by the horse


At all ages between 2 and 6 , children can make puppets act out the first kind

of sentence. But the basis on which they do this appears to change from age

Linguistic structure associative and theories 239

2 to age 4: at the older age, the children perform markedly worse than at the younger age on passive sentences like ( lOb) . This , and other facts suggest that the 4-year-old child depends on a perceptual heuristic , 11. " assign an available NVN - sequence, the thematic relations , agent , predicate , patient ."

This heuristic is consistent with active sentence order , but specifically contradicts passive order . The heuristic may reflect the generalization that in English sentences, agents do usually precede patients . The order strategy appears in other languages only in those cases in which there is a dominant
word order . In fact , in heavily ~ inflected languages , children seem to learn

very early to depend just on the inflectional endings and to ignore word order (Slobin & Bever , 1980) . The important point here is that the heuristics that
emerge around 4 years of age are not reflections of the structural rules . In

fact , the strategies can interfere with linguistic success of the younger child and result in a decrease in performance . Certain heuristics draw on general knowledge , for example the heuristic that sentences should make worldly sense. 4-year-old children correctly act out sequences which are highly probable ( 12a) , but systematically fail to act out corresponding improbable sequences like ( 12b) .
12a . The 12b . The horse cookie ate the cookie ate the horse

The linguistic competence of the young child before such heuristics emerge is quite impressive . They perform both sentences correctly - they often acknowledge that the second is amusing , but act it out as linguistically indi cated . This suggests early reliance on structural properties of the language, which is later replaced by reliance on statistically valid generalizations in the child 's experience . These examples demonstrate that regressions in cognitive performance seem to be the rule , not the exception . But the acquisition of rules is not what underlies the regressions. Rather , they occur as generalizations which reflect statistical properties of experience . Such systems of heuristics stand in contrast to the systematic knowledge of the structural properties of behavior and experience which children rely on before they have sufficient experience to extract the heuristics . In brief , we are arguing that R & M may be correct in the characterization of the period of overgeneralization as the result of the detection of a statistically reliable pattern (Bever , 1970) . The emergence of the overgeneralization is not unambiguous evidence for the acquisition of a rule . Rather , it may reflect the emergence of a statistically supported pattern
of behavior . We turn in the next section to the usefulness of connectionist

models in accounting for the formation and role of such habits .

240

J. Lachter and T. G. Bever

9.2.2. The nodularity of mime Up to now , we have argued that insofar as connectionist models seem to acquire structural rules , it is because they contain representational devices which approximate aspects of relevant linguistic properties . Such probabilistic models are simply not well -suited to account for the acquisition of categorical structures . But there is another aspect of behavior to which they are more naturally suited- those behaviors which are essentially associative in nature .
Associations most empirically can make impressive repeated behaviors more efficient , but are the to the

source of error in novel behaviors . Accordingly , we find it significant that the


connectionist models have been devoted

description of erroneous or arbitrary behavior . For example , Dell 's model of speech production predicts a wide range of different kinds of slips of the tongue and the contexts in which they occur . The TRACE model of speech perception predicts interference effects between levels of word and phoneme representations . By the same token , we are willing to stipulate that an im proved model of past tense learning might do a better job of modelling the mistakes which children make along the way . All of these phenomena have something in common - they result , not from
the structural constraints on the domain , but from the way in which an as -

sociative architecture responds to environmental regularities . Using a connectionist architecture may allow for a relatively simple explanation of phenomena such as habits , that seem to be parasitic on regularities in behavioral patterns . This interpretation is consistent with a view of the mind as utilizing two sorts of processes, computational and associative. The computa tional component nent represents represents the direct the structure of behavior , the associative compo activation of behaviors which accumulates with

practice .

Clearly , humans have knowledge of the structural form of language, levels


of representation and relations between them . Yet , much of the time in both

speaking and comprehension , we draw on a small number of phrase- and sentence-types . Our capacity for brute associative memory shows that we can form complex associations between symbols : it is reasonable that such associations will arise between common phrase types and the meaning relations between their constituent lexical categories . The associative network cannot explain the existence or form of phrase structures , but it can associate them efficiently , once they are defined . This commonsense idea about everyday behavior had no explicit formal
mechanism
theories

which

could

account

for

it

in

traditional
models

stimulus -response
had in their arsenal

of how

associations

are formed

. Such

single chains of associated units . The notion of multiple levels of representa tion and matrices was not developed . The richest attempt in this direction for

Linguistic structure and associative theories

241

language was the later work of Osgood (1968) ; but he was insistent that the models should not only describe associations that govern acquired behavior , they should also account for learning structurally distinct levels of representa tion - which traditional sir theories cannot consistently represent or learn

(see Bever , 1968; Fodor , 1965) . There are some serious consequences of our proposal for research on language behavior . For example , associations between phrase types and semantic configurations may totally obscure the operation of more structurally sensitive processes. This concept underlay the proposal that sentence comprehension proceeds by way of 'perceptual strategies' !like ( 12) , mapping rules which express the relation between a phrase type and the semantic roles assigned to its constituents (Bever , 1970) . Such strategies are non -determinis tic and are not specified as a particular function of grammatical structure . The existence of such strategies is supported by the developmental regressions in comprehension reviewed above , as well as their ability to explain a variety of facts about adult sentence perception (including the difficulty of sentences like (5a) which run afoul of the NVN strategy ( 12)) . But the formu lation of a complete strategies-based parser met with great difficulty .
StrateQies
-.-

are

not

' rules

' ~ and

there

was

no

clear

formalism

available

in which

to state them so that they can apply simultaneously . These problems were part of the motivation for rejecting a strategies-based parser (Frazier , 1979) , in favor of either deterministic processes (production systems of a kind ; Wan ner & Maratsos , 1978) and current attempts to construct parsers as direct functions of a grammar (Crain & J.D . Fodor , 1985; Ford , Bresnan , & Kap lan , 1981; Frazier , Carson , & Rayner , work in progress) . On our interpretation , connectionist methods offer a richer representa tional system for perceptual 'strategies' than previously available . Indeed , some investigators have suggested connectionist -like models of all aspects of syntactic knowledge and processing (Bates & MacWhinney , 1987; MacWhin ney , 1987) . We have no quarrel with their attempts to construct models which instantiate strategies of the type we have discussed (if that is what they are in fact doing ) ; however , it seems that they go further , and argue that the strategies comprise an entire grammatical and performance theory at the same time . We must be clear to the point of tedium : a connectionist model of parsing , like the original strategies model , does not explain away mental grammatical structures- in fact , it depends on their independent existence elsewhere in the user's mental repertoire . Furthermore , frequency based behavioral heuristics do not necessarily comprise a complete theory of perfor mance, since they represent only the associatively based components . Smolensky (in press) has come to a view superficially similar to ours about the relation between connectionist models and rule -governed systems. He

242 J. Lachter T.G. Bever and

applies the analysis to the distinction between controlled and automatic pro cessing (Schneider , Dumais , & Schiffrin , 1984) . A typical secular example is accessing the powers of the number 2. The obvious way to access a large power of two , is to multiply 2 times itself that large number of times . Most of us are condemned to this multiplication route , but computer scientists often have memorized large powers of two , because two is the basic currency of computers . Thus , many computer scientists have two ways of arriving at the 20th power of two , calculatjng it out , or remembering it . Smolensky suggests that this distinction reflects two kinds of knowledge , both of which can be implemented in connectionist architecture - 'conscious' and 'intuitive ' . 'Conscious' knowledge consists of memorized rules similar to productions . Frequent application of a conscious rule develops connection strengths between the input and the output of the rule until its effect can be arrived at
without the operation of the production -style system . Thus the production -

style system 'trains ' the connectionist network to asymptotic performance . The conscious production -style systems are algorithms , not structural representations . For example , the steps involved in successive multiplications by two depend on available memory , particular procedures , knowledge of the
relation between different powers of the same number , and so on . Thus , this

proposal is that the slow-but -sure kind of algorithms can be used to train the fast-but -probabilistic algorithms . Neither of them necessarily represents the structure . Smolensky , however , suggests further that a connectionist model 's structural 'competence ' can be represented in the form of what the model would do , given an infinite amount of memory /time . That is, 'ideal perfor mance' can be taken to represent competence (note that 'harmony theory ' referred to below is a variant of a connectionist system) .
It is a corollary of the way this network embodies the problem domain constraints , and the general theorems of harmony theory , that the system, when given a well -posed problem , and infinite relaxation time , will always give the
correct
laws

answer . So , under
down inside

that idealization

, the competence

of the system is

described by hard constraints : Ohm 's law , Kirchoff 's law . It is as if it had those
written of it .

Thus , the system conforms completely to structural rules in an ideal situation . But the ability to conform to structure in an infinite time is not the same as
the representation
lem domain

of the structure . The structure


'.

has an existence outside of

the idealized performance of a model . The structure corresponds to the 'prob constraints

Smolensky directly relates the production -style algorithms to explicit potentially conscious knowledge , such as knowing how to multiply . This makes the proposal inappropriate for the representation of linguistic rules , since

Linguistic structureand associative theories

243

they jects the abstract .

are

generally for

not

conscious , regular are structures or language humans by the

and seven past

most rules tense

of

them involved . Many

operate in of the them

on

abstract description operate

ob of over , in au

Consider

example of the

pronunciation phones terms of explicit all along are ,

which , the learning : that

definition are teaching has use a when abstract of rich

unpronounceable schemata such and they operations abstract speak and ) . It

( similarly is not , this ,

tosegmental to the the think problem algorithms

meaningful has whatever it . been

. But structure

understand

9 .3 .

The

end

habits

and

rules

We The trasting tion been learn ing lenge H &

have debate

attempted was proposition initially

to

put

the among

connectionist those algorithms satisfaction interested

debate in as

about cognitive

rules

in

perspective , con produc debate models has can differ chal M ' s and

. -

modelling used in form many of

ally with to rule emphasis - based . constraint include

- based

such algorithms of whether

those . That

systems extended to shades to K ' s rule models exhibit of

the

question behavior

connectionist acquiring answers to posed rules the by .

- governed , theories there of

without been three acquisition

With

have

specific R &

language

. .

Some R & M

mental 's model & emphasis we only explained . Prince

processes , in ) in have this particular

require , does

rules not

anyway work

( Fodor , empirically

&

Pylyshyn or theoretically

( Pinker . Our models behavior devices rules

paper

has arrive

been at

on rule already

the - like

fact

that regularities

the

connectionist in language and categorical

considered insofar in as humans the

models by mental

contain representations

architectures of

This to gent acquisition models stimuli positive framework In such equally

is represent

not

surprising categorical require devices , change responses of for the have a

since rules built

there : - in hence

is

no , as

natural adult to machines as the never is formation that by are the that the

way models rules

for

an , such

associative artificially equivalents limitation result of rules rich

device intelli . As of pairing . The all -

systems

sensitivity

or share

their the

the their :

connectionist behavior , they models of the be the reader explained behaviors can

which and feature

accumulated discover they of provide complex of

namely

structural a

connectionist description reminded cannot that some

associative . behaviors . But it is

habits human

sum as

, we language obvious

structure

associative habits the

networks result of

associations

244

J. Lachter and T. G. Bever

between structurally defined representations . We think that connectionist models are worth exploring as potential explanations of those behaviors : at the very least, such investigations will give a clearer definition of those aspects of knowledge and performance which cannot be accounted for by testable and computationally powerful associationistic mechanisms.
References
Anderson , l .A . ( 1983) . The architecture o/ cognition . Cambridge , MA : Harvard University Press.

Bates , E ., & MacWhinney , B . (1987) . Competition variation and language learning . In B . MacWhinney ( Ed .) , Mechanisms of language acquisition (pp . 157- 197) . Hillsdale , Nl : Lawrence Erlbaum Associates . Bever , T . ( 1968) . A formal limitation of associationism . In T . Dixon & D . Horton and general behavior theory . Englewood Cliffs , NJ : Prentice -Hall , Inc . (Eds .) , Verbal behavior

Bever , T . ( 1970) . The cognitive basis for linguistic universals . In J .R . Hayes (Ed .) , Cognition and the develop ment of language ( pp . 277- 360) . New York , NY : Wiley & Sons, Inc . Bever , T . ( 1975) . Psychologically real grammar emerges because of its role in language acquisition . In D . Dato (Ed .) , Developmental psycholinguistics : Theory and applications (pp . 63- 75) . Georgetown Uni versity Round Table on Languages and Linguistics . Bever , T . ( 1982) . Regression in the service of development . In T . Bever (Ed .) , Regression in mental develop ment (pp . 153- 188) . Hillsdale , NJ : Lawrence Erlbaum Associates . Bever , T . , & Langendoen , D . (1963) . (a) The formal justification and descriptive role of variables in phonol ogy . (b) The description of the Indo -European E/O ablaut . (c) The E/O ablaut in Old English . Quar terly Progress Report RLE . MIT , Summer . Bever , T . , Carroll , J., & Miller L .A . ( 1984) . Introduction . In T . Bever , J. Carroll & L .A . Miller Talking minds : The study of language in the cognitive sciences. Cambridge , MA : MIT Press.
921 - 924 .

(Eds .) ,

Bever , T . , Mehler , J. , & Epstein , J . ( 1968) . What children do in spite of what they know . Science, 162 , BrownJ R . ( 1973) . A first language : The early stages. Cambridge , MA : Harvard University Bybee , J. , & Slobin , D . ( 1982) . Rules and schemes in the development Language , 58 , 265- 289. Press.

and use of the English past tense .

Chomsky , N . ( 1959) . Review of Skinner 's Verbal Behavior . Language , 35 , 26- 58. Chomsky , N . ( 1964) . The logical basis of linguistic theory . In Proceedings of the 9th International on Linguistics . Chomsky , N . ( 1981) . Lectures on government and binding . Dordrecht : Foris .

Conference

Chomsky , N . , & Halle , M . ( 1968) . The sound pattern of English . New York , NY : Harper and Row . Crain , S., & Fodor , J .D . ( 1985) . How can grammars help parsers ? In D .R . Dowty , L . Kartunen , & A . Zwicky (Eds .) J Natural language parsing : PsychologicalJ computational and theoretical perspectives . Cambridge : Cambridge University Press. Dell , G . ( 1986) . A spreading -activation theory of retrieval in sentence production . Psychological Review , 93,
3 , 283 - 321 .

Elman , J.L . , & McClelland , J .L . ( 1986) . Exploiting the lawful variability in the speech wave . In J .S. Perkell & D .H . Klatt (Eds .) , lnvariance and variability afspeech processes. Hillsdale , NJ : Lawrence Erlbaum
Associates .

Elman , J.L . , & Zipser , D . ( 1987) . Learning the hidden structure of speech . USCD Institute for Cognitive Science Report 8701. Feldman , J . ( 1986) . Neural representation ence Technical Report URCS -33. of conceptual knowledge . University of Rochester Cognitive Sci-

Linguistic structure and associative theories

245

Feldman Ballard (1982 ,J.,& ,D. ).Connectionist their models properties Science -254 and . Cognitive , 6,205 . Feldman , J., Ballard Brown., &Dell . (1985 , D., ,C ,G ). Rochester Connectionist1979 .Univer Papers -1985 : sity Rochester Science Report-172 of Computer Technical TR . Fodor (1965 meaning rmJournal , J.A. ). Could be ? an ofVerbal LearningVerbal , 4, 7381 and Behavior - . Fodor Bever ., & Garrett. (1974 , J.A., , T.G ,M ). Psychology . New : McGraw. oflanguage York Hill Fodor &Pylyshyn . (1988 , J.A., , Z.W ). Connectionism architecture ana . Cogni and cognitive : Acritical .ysis l tion ,3 , this , , 28 -71 issue Ford ., Bresnan Kaplan (1981 competence ,M , J., & ,R , ), A -based ofsyntactic . InJ. Bresnan theory closure (Ed The representation .), mental ofgrammatical , CambridgeMIT . relations , MA Press : Francis & Kucera (1979 ,W .N., , H. ). Manual ofinformation toaccompany sample -day astandard ~ fpresent edited American , foruse digital English with computers , RI Department . Providence , ofLinguistics , Brown University . Frazier (1979 comprehending: Syntactic strategies Thesis , L. ). On sentences parsing . Doctoral , University of Massachusetts . Frazier Carson & Rayner (1985 , L" ,M ., , K, ). Parameterizing processing: Branching the language system patterns and languages within across . Unpublished , Garrett. (1975 analysis .M ). The ofsentence . InG Bower .), The production. (Ed psychology and ojlearning ' motivation York . New : Academic (pp133 ). Press. -177 Goldsmith ). Anoverview , J. (1976 ofautosegmental . Linguistic , 2, No1 phonology Analysis . . Grice . (1975 and , H.P ). Logic conversationCole J.L. Morgan .), Syntax semantics . InP . and (Eds and 3 : SpeechNew , NYSeminar . acts York : . Press Grossberg ). Competitive from , S (1987 . learning interactive toadaptive . I: ognitive activation resonance Sci ence, 2363 , 11 - . Halle . (1962 ,M ). Phonology ingenerative . Word , 54 . grammar , 18 -72 Hanson & Kegl (1987 ,S .J., , J. ). PARSNIP : A connectionist that naturallangu network learns .age grammar from exposure language . Proceedings Annual tonatural sentences ofthe Ninth Cognitive Society Science Meeting , NJLawrence Associates . Hillsdale : Erlbaum . Hinton.. & Anderson .) (1981 .G , S (Eds ). Parallel ofassociative . Hillsdale Lawrence . models memory , ]\1J : Erlbaum Associates . Hinton., & Sejnowski ). Optimal ,G , R (1983 . perceptual . Proceedings Conference inference ofthe IEEE on Computer and Vision Pattern Recognition , Washington , D.C . Hinton., &Sejnowski ). Learningrelearning ,-G , T. (1986 and inBoltzmann . InD. Rllmelhart machines &J. McClelland Parallel (Eds .), distributed : Explorations processing inthe microstructure : Vol ofcognition . 1 Foundations , MAMIT , . . Cambridge Press : Hopfield(1982 , J. ). Neural networksphysical with and systemsemergent computatabilities collective :onal . l Proceedings ofthe National ofScience: Vol79Biophysics -25 8 Academy USA . . (pp2551;5 ). . Kuczaj (1978 ,S . ). Children 'sjudgments ofgrammatical and ungrammatical past verbs irregular-tense . Child Development-326 , 49319 . , Kuroda (1987 . Sy, . ). Where isChomsky 'sbottleneck ofthe ?Reports CenterResearch for inI..anguage , San Diego . 1 No . , Vol , .5 Langackre ). The , R (1987 cognitive . perspective . InReports Center Research ofthe for inIJanguage , San Diego . 1 No . , Vol , .3 Macken (1987 , M. ). Representation overgeneralizations . InB. Mac rules and inphonology "'hinney .), (Ed Mechanisms acquisition -397Hillsdale Lawrence Pssociates oflanguage (pp367 ). . , NJ : Erlbaum . . MacWhinney ). The , B. (1987 competition ofthe model acquisition . InB. MacWhinney ofsyntax (Ed .), Mechanisms acquisition -308Hillsdale Lawrence Pssociates oflanguage (pp249 ). . , NJ : Erlbaum ~ . Marr (1982 . San , D. ). Vision FranciscoFreeman , CA : . McClelland Elman (1986 , J., & , J. ). Interactive inspeech processes perceptionTRAC . InJ. : The ]~model McClellandRumelhart Parallel &D. (Eds .), distributed : Explorations processing inthe microstructure

246

J. Lachter and T. G. Bever

ofcognition: 2. Psychological biological Vol. and models. Cambridge, MITPress. MA:

McClelland,Kawamoto, J. & A.(1986). Mechanisms ofsentence processing: Assigning toconstituents. roles


InJ. MeClelland Rumelhart Parallel &D. (Eds.), distributed processing: Explorations microstrucinthe tureof cognition: 2. Psychological biological Vol. and models. Cambridge, MITPress. MA:

McClelland,Rumelhart, J.,& D.(1981). interactive modelcontext inletter An activation of effects perception:Part 1: An account basic of findings. Psychological Review, 5, 60-94. 88,

Minsky, & Papert,S. (1969).Perceptrons. M., Cambridge, MA: MITPress.

ture cognition: 2. Psychologicalbiological of Vol. and models. Cambridge, MIT MA: Press. Meh J., &Bever, (1967). cognitive ler, T. A capacity young of children. Science, 6, 141. Oct. Miller, (1962). psychological ofgrammar. GA. Some studies American Psychologists, 17,748762.

McClelland, Rume D.(Eds.) J. &lhart, (1986). distributed Parallel processing: Explorations microstrucinthe

Neches, Langley, &Klahr, (1987). R., P., D. Learning, development and production InD.Klahr, systems. P.Langley Neches Production modelslearning development. &R. (Eds.), system of and Cambridge, MA: MIT Press.
Osgood, (1968). CE. Towardwedding insufficiencies. Dickson D.Horton, a of InT. & Verbal behavior and
general behavior theory,Englewood Cliffs, Prentice-Hall, NJ: Inc.
modelof languageacquisition.Cognition, 73193, this issue. 28,

Pinker, &Prince, (1988).languageconnectionism: ofaparallel S., A. On and Analysis distributed processing


Reich, (1969). finiteness natural PA. The of language. Language, 831843. 45,

Rescorla, &Wagner, (1972). theory Pavlovian R.A., AR. A of conditioning: Variations effectiveness inthe
NewYork: Appleton-Century-Crofts, 6499. pp.

ofreinforcement and non-reinforcement.BlackW.F. InA.H. & Prokosy Classical (Eds.), conditioning.

Rosenblatt,(1962). F. Principlesneurodynamics. York, Spartan. of New NY:

Rume D., lhart, Hinton, &Williams, Learning representations propagation. G.E., R.(1986). internal by error
tions themicrostructure in of cognition: 1. Formulations. Vol. Cambridge, MITPress/Bradford MA: Books.

InD.Rumelhart, J. MeClelland PDP &the Research Parallel Group, distributed processing: Explora-

Rume D.,&McClelland, An lhart, J. (1982). interactive modelcontext inletter activation of effects percep89, 1, 6094.

tion: 2:The Part contextual enhancement and tests themodel. effect some of Psychological Review,

Rumelhart, &McClelland, (1986a). distributed D., J. (Eds.) Parallel processing: Explorations microinthe
structure cognition: 1. Foundations. of Vol. Cambridge, MITPress. MA:

Rumelhart, &McClelland, D., J. (1986b). learning past On the tenses English InJ. MeClelland of verbs. &
Vol. Psychological biological 2. and models. Cambridge, MITPress. MA: Sadock, (1974). J. Toward linguistic ofspeech New a theory acts. York: Academic Press.
tion.Times Literary Supplement, 12,1987, 643. June p. Sapir,E. (192149).Language. NewYork:Harcourt,Braceand World.

D.Rume (Eds.), lhart Parallel distributed processing: Explorationsmicrostructure inthe ofcognition:

Sampson, (1987). turning inlinguistics: ofD. Rumelhart,McClelland thePDP G. A point Review J. and

Research (Eds.), Group Parallel distributed processing: Explorations microstructure inthe ofcogni-

Savin, &Bever, (1970). nonperceptualofthe Fl., Behavior, 9, 295302. T. The reality phoneme. ofVerbal Journal Learning and Verbal
Schneider, Dumais, &Shriffrin,(1984). W., S., R. Automatic control and processingattention.R. and In
Parasaraman Davies &D. (Eds.), Varietiesattention. York: of New Academic Inc. Press, Skinner, (1957). behavior. York, Appleton-Century-Crofts. B. Verbal New NY:
The childs conceptionlangaage. York, Springer-Verlag. of New NY:

Slobin,(1978). study early D. Acase of language awareness. Sinclair, InA. R.Jarvella, Levelt &W. (Eds.), Slobin, &Bever, 12,(1982). use D., T.G.219277. Childrencanonical sehemas: sentence Acrosslinguistic word study of order. Cognition,

Linguistic structure and associative theories

247

Smolensky , P. ( 1986) . Information processing in dynamical systems : Foundations of harmony theory . In D . Rumelhart & J. McClelland (Eds .) , Parallel distributed processing : Explorations in the microstructure of cognition : Vol . 1. Foundations . Cambridge , MA : MIT Press. Smolensky , P. (in press) . The proper treatment of connectionism . Behavioral and Brain Sciences. Strauss , S. , & Stavy , R . ( 1981) . V -shaped behavioral growth : Implications for theories of development . In W .W . Hartup ( Ed .) , Review of child development research (Volume 6) , Chicago : University of Chicago
Press .

Tanenhaus M .K ., Carlson, G., & Seidenberg M . (1985 Do listenerscompute linguistic representations In , , ). ?


D . Dowty , L . Kartunnen , & A . Zwicky (Eds .) , Natural language parsing : Psychological , computational , and theoretical perspectives (pp . 459- 408) . Cambridge : Cambridge University Press. Wanner , E . , & Maratsos , M . ( 1978) . An ATN approach to comprehension . In M . Halle , J. Bresnan and G . Miller (Eds .) , Linguistic theory and psychological reality . Cambridge , MA : MIT Press. Wickelgren , W . ( 1969) . Context -sensitive coding , associative memory , and serial order in (speech) behavior . Psychological Review , 76, 1- 15.

Name Index
Adelson 29n 59 , B., , Anderson A., 195 , J. Anderson R., 130196 , J. , Arbib M., 4 , Armstrong L., 123n , S. Aronoff M., 113115n , ,
Ballard , D . H ., 5 , 7 , 15n , 18n , 51 , 62 , 65 , 75 , 196 Bates , E . , 241 Beckman , M . , 86 Berko , J . , 84 , 137 Bever , T . G . , 96n , 99 , 147 , 166, 170, 195- 247 Black , I . B . , 53 Bloch , B . , 82n Bolinger , D . , 64 Bower , G . , 29n , 59 Bresnan , J . , 241 Broadbent , D . , 9 Brown , R . , 137, 139 , 140n , 142 , 144, 207 Bybee , J . , 82n , 84 , 116- 118 , 145 , 148, 150n , 151- 156 , 160 , 174 , 181 , 207 Carey , S . , 177 Carlson , G . , 221 Carroll , L . , 60 Carson , M . , 241 Cazden , C . B . , 137 , 144 Chase , W . G ., 51 Chomsky , N . A . , 33n , 34 , 64 , 74 , 78 , 82n , 179 , 181 , 202 , 227 , 234 Churchland , P . M . , 4 Churchland , P . S . , 4 , 7 Crain , S . , 241

Fodor J. A., 3- 71 74 214221 241243 , , , , , , Fodor J. D., 64 241 , , Ford M., 241 , FrancisN., 139n , FrancisW. N., 226231232 , , , FrazierL., 241 , Fries C., 82n , Frost L. A., 128 , GarrettM., 221 , Geach 21n , P., Gelman A., 177 , S. Gleitman 123n , H., Gleitman 123n127 , L., , Goldsmith 202216 , J., , GordonP., 150n , Grimshaw 117 , J., Gropen 150n , J., Grossberg 196 , S., Halle M., 82n 91 173201202 , , , , , HalwesT., 96n , HansonS. J., 195196 197225230 235 , , , , , , HatfieldG., 4, 13 52 59 , , , HewettC., 14 15 56 , , , Hillis, D., 56 Hinton G. E., 4, 21 22 40n 51 52 53 75 , , , , , , , , 77 80 85 171 173 176181 182195196 , , , - , , , , , , 224 HoardJ., 82n , HockettC., 82n , HopfieldJ., 196 , Jenkins J., 96n , J. Jespersen ., 82n 98 121 ,0 , ,
Kaplan, R., 241 Katz, J. J., 21n Kawamoto A . H., 35n 59, 222 228 , , , Kegl, J., 196 197 225 230 235 , , , , KeiJ, F. C., 177 Kiparsky, P., 82n 87, 111 112 , , Klahr, D., 196 Kosslyn, S. M., 4, 13 52, 59 , Kucera, H., 139n 226 231 232 , , , Kuczaj, S. A., 137 143 145 157 158 160 , , , , , 161 163n 220 , ,

Culicover 80 127 , P., , Cummins 60 , R., CurIneG., 82n ,


De Jong G. F., 177 , Dell, G., 196 197 199 221 240 , , , , Dennett D., 4 , Dreyfus, H ., 4 Dreyfus, S., 4 Dumais S., 242 , Dyer, M ., 177 Elman. J. L .. 197 199 224 235 , , . . .

EpsteinJ., 238 , Ervin S., 84 137 , , Fahlman E., 4, 51 52 , S. , Feldman A., 5n 7, 29n 51 59 75 196 , J. , , , , ,

LachterJ., 99 147166 170195247 , , , , , Laird J., 58 , Lakoff G., 4, 112 ,

250

Name Index

Rumelhart , D . E . , 6n , 7 , 8, 9 , lOn , 11, 21n , 22 , 35 , 36 , 53 , 60n , 62n , 64n , 65 , 68 , 73- 193, 196 , 199, 200 , 204 , 217 , 224 , 234 , 237 Sadock , J . , 219 Sampson , G . , 80 , 208 Sapir , E . , 219 Savin , H . , 96n , 211 , 212 Schiffrin , R . , 242 Schmidt , H . , 176 Schneider , W . , 4 , 242 Seidenberg ~ M . , 221 Sejnowski , T . J . , 4 , 62 , 85 , 171 , 181, 196, 224 Shafir , E . , 177
, 60n 7 , , , 204 8 , 9 , 62n , , 217 lOn 64n , , , 222 21n 66 , , , 68 224 22 , , 228 , ,

, Markman Marr McCarthy McClelland 29n 73 234 McCulloch McDermott Mehler Mencken Mervis Miller Minsky Mooney , C , G , , , J . , , . . , M R H B 24 , 35 , , 36 196 , D , E . , M 217 109 L , 199 . , 59 , , 149 6n , . , 17

. , 74

, J . , , J . , 53 ,

193 , 237

200

W , D 238 . , . , ,

. . ,

S . , 29n

182 , 59

Shattuck -Hufnagel 157n , S., Sietsema lOOn , B., SimonH. A., 51 74 130 , , , SkinnerB. F., 234235 , , SloatC., 82n ,
Slobin , D . , 82n , 84 , 116 , 118, 140, 145, 148, 150n , 151- 156 , 160 , 163, 207 , 219 , 239 Smith , E . E . , 177 Smith , N . , 100 Smolensky , P . , 8 , 9 , lOn , 19, 20 , 21 , 45 , 46 , 52 , 56 , 62 , 80 , 126 , 167 , 169 , 171, 173, 196, 217 , 224 , 241 , 242 Snow , C . E . , 140n Sokolov , J . L . , 130 Sommer , B . A . , 97 Stabler , E . , 14n , 60 Stich , S . , 7 Stov , M . , 61 Strauss , S . , 238 Sweet , H . , 82n

82n 122 229 , 74 177

Ill

126

127

. , 64 . J . ,

, 224

Neches Newell Norman

. ,

196 13 . , , 174 58 , 74 , 130

, A , D

. , 4 , . A

Osgood Osherson

C ,

. , D

31n . , 61

, 49 , 177

Palmer Papert Pazzani Pierrehumbert Pinker 205 Pitts Postal Prince 211 Putnam Pylyshyn , ,

, H

. ,

82n , 224 177 , J . , 86 , 61 , 243 , 68 , 73 193 , 200 , 203 ,

S . , 64 , M . ,

S . , . , P .

38n 214 182

, 42n , 218

, 211 , W ,

Talmy L., 174 181 , , Tanenhaus K., 221 , M. TarskiA., 14n , Touretzky S., 65 77 182 , D. , , Treisman 176 , A., ValianB., 230 , VanderHulst H., lOOn ,
Wagner , A . R . , 234 , E . , 35 , 127 , 241 , J ., 7 , S . , 61 , K . , 80 , 127 , W . A . , 89 , 204 , E . , 112 , R . J . , 85 , 224 Wanner Watson

M . , ,

. , 61

21n , 68 , 243 , 73 193 , 200 , 203 , 205 ,

, A 214 ,

218 . , 74 Z .

H ,

. ,

3 -

71

, 74

214

243

Rakic Reich Rosch

, P . , 55 , P . , 228 , R . A . , 234 , E . , 122 , F . , 90 , 207 , 211 , P . , 58

Weinstein Wexler Wickelgren Williams Williams

Rayner , K . , 241
Rescorla Rosenblatt Rosenbloom

Ziff Zipser

, P . , 43 , D . , 224 , 235

Ross , J . R . , 113

Subject Index
Acoustic Activation 196 . 223 Actors , 15 features level , 199 , 1 , 5 , 12 , 15 , 63 , 73 , 75 , 76 , Cognitive Cognitive Combinatorial tions Competence 128 . 242 Complex sys Compositionality Computation Computational tional Concatenative Concept - level ConceDtuallevel Conscious Conservation Conservatism Consistency . associative , 7 , 14 , 55 , 60 , 5 , 28 , 54 , 78 , 52n , 63 , 240 level , 8 , 9 , 66 , 76 , 77 , 85 , of mental representa , 34 , 35 , 80 , structure versus symbols

Acquisition , stages of , 201

processes

, 13 , 14 , 15 , 37 performance , 22 of cognitive , syntax - based , 217 , 174 representations notion of , 50 ,

Ale :orithm . 65 . 80 vs . implementation , 74 Algorithmic use of rules in AI , 233 , 234


All - or - none terns , 53 character of conventional rule

33 , 41 , 43 , 44 , 48 vs . algorithmic structure vs . implementa

Animal
classical von

thought , 41 , 44 , 45

Architecture connectionist Neumann

description

theories , 20 . 9 ~ 19 knowledge overregularization , 77 ; ~ 104 ( see also Voic , 242 , 139 , , 238 versus relations

Associationism , 27 , 31 , 49n , 63 Associative learning , 208

vs . intuitive

Associative strengths between units , 235 Atomic predicates , 22


Atomic
Auto

140 , 142 , 143 Consonant Consonant shifts , 153 - cluster voicing ) , 12 , 17 , 19 , 21 , 23 , 24 ,

renresentations transition

. 12 networks , 216 , 35

Augmented

- association

. 225 . 233 . 235

Autosegmental phonology Auxiliary , does , 103 Auxiliary , has , 103

ing assimilation Constituency 175 Constrained Constraint Content Continuous of principles Conventional Neumann Correlation Damage sical

relations

Back propagation6, 181 183 224 225 , , , , Back-formation, 111 Bandwidthof communication channelslimits , on, in serialmachines 75 , Bistability of the Neckertube, 7 Blendedresponsesin the RM model 155 156 , , , ,

nature of linguistic entities , 181 satisfaction , 1 , 2 , 53 , 196 ,- 243 memory , 172 of applicability Von , 180 over clas variation , in degree

- addressable

157159161 162165 167 , , , , , Blurringof wickelfeature , representation , , 99

-machine 101181 ,210 Boltzman ,30 , Boolean algebra ,196 Brain - modeling style ,62
machines Data - structures

, 53 , 57 machines )

, 5 ( see also , unconstrained

machines

extraction

- resistance , 61

of connectionist , 52 , 56

Decoding / Binding network , 92 Default structure , in the regular Degemination Derivational , 148 morphology , 178

forms

, 121

Categorial operations , 201 rules discrete180 , ,


symbols 201 , Categorization disambiguations231 , Centerembedded sentences35, 227 229 , Centralprocessing unit, 52n 59 , Centralinformativeness TRICS, in the R&M model 214 215 . . Classicalmachines connectionist vs. machines , 16 17 , Cognitivearchitecture 9, 10 35 , ,

Developmental psycholinguistics , 80 Diachronic maintenance , of language 217 Diachronic work Distance Distinctive Distributed , 31 vs . synchronic behavior

systems of a net -

between representations , 20n features . 201 . 202 . 204 . 215 representations , 1 , 2 , 5 , 19 , 172 ,

173 , 174 , 175 , 176 , 177

252

Subject Index

Double

-marking

errors stem

, 157 , 162 of , 158 , 161 , 163n ,

Implementational Implicit knowledge

connectionism , of - with , 174 problem , in language language

, 77 , 79 , 25 , 26 , 27

misconstrued 165

account

In - construction Individuals

relation

Eliminativist Radical Empiricism Rationalism Energy Entailment

theories connectionism vs . nativism )

, 7 , 169 , 170 (see also ) , 60 , 171 ( see also , 14 , 77 of the English level past

Induction

acquisition

, 79 ,

80 , 165 - 167 , 182 Inductive Inferential Infinitive Inflection acquisition English , 82 models morphology processing content of , 84 107 models assigned to , 179 , 4 machine states , of , 128 , 130 , 181 reasoning coherence , English , 83 , 177 , 178

, 33 , 48

, minimization relations acquisition

Errors , in the tense . 73 Excitation Exclusive Expert Explicit 168 Explicit models Featural

of nodes ' or ' , 183 systems , 58

. See Activation

symbolic Inflectional Information

inaccessible rules , 73 decomposition , their

rule

view

, of psychology

Intentional 17

absence

in connectionist

Internal of

representations , 95 and past use

, 7 , 44 , 129

, 179

, 181

language

acquisition

of , 85 formst , 149 English , 151 , 212 , 83 , 110 , 157 , 111 , 113 , 114 , , 73 , 74 , 79 ,

, 174 , 205 models , 178

Irregular

tense

and its role in symbolic Feedback , 6 Folk psychology French , 163 Frequency samples Frequency Generative counts , 139 - sensitivity syntax , 4

95 , 139 , 144 , 147 Irregular verbs

, English

136 , 137 , 138 , 139 , 153 165 , 203

, 159

, 161

, 162 ,

of verbs

, their

use in written
Kant , 27

, 31 , 13Off .
Labels

, 14n competence noun to display encoding , in in , 34

combinatorial and role in

syntax connectionist

and

semantics machines

of , 17 , 17 , 18n

Generativity of linguistic German . 163 Gerund Graceful classical Grammatical Hanson Habits Hanson Hebrew Heuristic Hidden Hierarchy Higher Holistic Hume and and and rules Kegl , English

Language acquisition behavior knowled2e of use thought , 79 , 80 problem access entries by , 135 humans , 200 , 110 , 117 , 179 , , 1 , 37 , 73 , 79 - 82 , 139 , 166 , 167 , 221 of . 73 , 40 , 222 , 140 , 181

. See Verbal , failure , 53 , 58 , their model

degradation machines Kegl regularities

, 232

Learnability

, 239 , 243 model tense , 225 - 233 inflection in , 163 models , 224

Lexical Lexical 199 competence Linguistic Linguistic

, 86 , 96 , 108 , 109

, present . 239 units - level of weak

vs . performance intuitions rules structure , 197 , 195 , 238 , 125 , 219

, 34 , 35

, use in connectionist methods , 4 , 58 theories , 175 testing model , of cognition

, 168 , 169

Linguistic LISP Locality Logic

mechanisms , 49n , 50n

, 65 , 182 , of , 7 ( see - level phonological also theories semantics , of representations ) cognition , 169 , 100

Homophones Hypothesis 236

of game - playing

Lower

Imperative Implementation 67

, English

, 83 level , 55 , 58 , 64 ,

Macro -organization , 170171 , 167 , McClelland Kawamoto & model222233 , Memory vs. program 34, 56 , Mentalprocesses6, 13 28, 30 , ,

vs . symbolic

Subject Index

253

Mental representation 7, 12 18 19 22 25 , 6, , , , , , 28 30 32 36 39 49n 217222 , , , , , , , classical theories 13 17 of, , complex simple13 vs. , constituent structure 33 of, syntax 46 of, classical connectionist vs. theories 7 , 6, Microfeatures , 21 22 23 63 64 , 20 , , , , Modalityof a sentence , , 85 Molecular states10(see representational , also states ) Mood of a sentence , , 85 Morpheme , 86 107108 , 79 , , Morphology phonology , 106119126 vs. , 86 , , Multilayered networks182 ,
Natural kinds , 177 Natural selection , 54 Neologisms . 104 Networks . 6 . 76 . 129 . 196 Neural organization , 62 Neurological states , 10 Neurological distribution , 19 Neuronal nets ~ 196 Newtonian physics and Quantum physics , rela tion between , 168 No -change verbs , performance with , 145, 148, 149 , 150, 151 Nodes , their use in connectionist models , 12, 21 Noise -tolerance of connectionist over classical machines , 52 , 56 Nondeterminism , of human behavior , 53 , 57 Nonverbal and intuitive processes , 52

Obstruence , 86 OldEnglish203 , Output decoder theRMmodel93 , in , Overactive speech production algorithm , 238 Overgeneralization Overregularization (see ) Overregularization, 84 137138140144 , 74 , , , , , 146148 151152154 156157 159164 , , , - , , , , , 166167207208213 219223 237238 , , , , , , , , Overtensing , 161 Paradigm , Kuhnian4 shift , Paradigm used rule , as in -based theories lan of guage , 79 Parameter , 5 space Part relation13 -of , Passive participleEnglish83 84 102 , , , , Passive storage conventional in architecture , 52 59 ,

Past morpheme79. 83. 84 . participleforms, 115 160 , tense English 79, 85, 86, 88, 101 102 115 , , , , , 135 137 142 176 201 204 208 216 224 , , , , , , , tense English acquisition 73, 87, 134 150 , , , , , 164 207 240 . . tense English production 87 , , , Pattem -associatorin the RM model 88 91, , , 93, 107 125 167 168 170 176 181 , , , , , , Pavlovianconditionin 1 ~. Perceptron convergence procedure 90, 179 , , 207 Perceptual heuristic 239 241 , , Perfectparticiple, English 83, 84, 102 , Performance Models 234 236 , t Phoneme91, 93, 173 197 199 204 205 211 , , , , , , , 212 Phoneticrepresentation73. 88. 108 202 . . Phonological features 73. 197 211 , , Phonological representation95, 201 , Phonology morphology in the RM model, and , 101 Phonology phonetics 86 vs. , Physicalrealization of descriptions mental , of processes54 , Physicalsymbolsystemhypothesis 2 , Polish 163 , Polysemy 42, 112 , Possessive marker English 103 107 , , , Postfix 204 , Prefix, 86, 204 Present morphemeEnglish 83 , , participle, English 83 , tenseform, 79, 137 Probabilis tic blurring, 222 -, constraints 75 , decisions 89 , Probabilityfunctions lO2istic 169 . . . Production -style algorithms 242 , Productiverules, general 135 ,

Productivity thought33 36(see Sys , of , , also ,tematicityCompositionality , ) Progressive participleEnglish83 , , Proof -theory 29 30 , , Prototypicality structures thestrong , in class of verbs 116 117120 . . . Psychogrammar , 221 Psychological , of grammars reality , 221 Quantum -mechanical , 10 states

254

Subject Index

RM Radical

model

( See connectionism

Rumelhart

- McClelland , 169 , 171 ) , vis , 172

model ( see also

Eliminative Rapidity neural Rationalist Reasoning Receptive Recursion Reduplication Registers Regressions 241 Regular 107 137 213 vs . irregular Regularization 112 past , 108 , 139 and of

connectionism cognitive processes

- a - vis

speeds

, 51 , 54 ideas , 69

vs . empiricist , 31 fields , 36 , 109 memory , in cognitive , 176 , of neurons

, 5 , 62

, distinction performance

between , 239

, 75 ,

tense

forms , 117 , 163

, English , 118 , 122 , 165 , 170

, 73 , 74 , 79 , , 123 , 201 , 136 , , 210 ,

, 110 , 114 , 149 , 153

forms , of

, 95 verbs formed from nouns ,

Representationalist theories ~7- 11 Representations , complex , 25 Response effectors , 76


Role Root relations , 12 , 23 , 24 , 110 , 113 , 126 , 179

Rule -based representations , 195 Rule -based theories , of acquisition


136 , 148 , 166 , 168 '

, 74 , 86 ,

Rule-exception behavior , 11 (see also Rulegoverned behavior ) Rule -generated vs . memorized forms , significance of the distinction , 142 , 143 Rule -governed behavior , 11, 51 , 195 , 197 , 207 ,
216 , 217

systems , 10, 139 , 241 Rule -hypothesization mechanism , 134, 135,


147 , 153 , 155 , 156 , 157 , 164 , 165

Rule - implicit vs . rule -explicit processes , 60 , 61 Rules and principles , underlying language use ,
74 , 78 , 221

Serial machines , 55 theirweakness ,4 Serial processing , 75 Similarity relations177 178 , , Similarity space , 21n Smalltalk , 14 Software hardware vs. , 74 Sounds wordsexchanges and , between , 199 Spanish , 163 Spatial functional vs. adjacency , 57 Speech behavior , 221 Speech errors 199 , Speech production 's model197 also , Dell , (see TRACE ) Speed activation inhibition199 of and , Stem86 , Stem affixidentity 108126 and , , Stimulus -response mechanisms , 237 240 , 236 , , 241 Stochastic component human , of behavior (see Nondeterminism ) Stochastic relations among environmental , 31 states among machine states31 , Strength connections of , 64 Strong verbsEnglish83 110113114119 , , , , , , , 122184189 alsoIrregular , (see verbs ) Structural constraints cognitive on phenom ena 196240 , , Structural description , 13 Structural , 195 196220 221 224 rules , , , , Structure functiontheirindependence and , in some systems , 63 Structure -sensitive operations , 28 , 13 Structured symbols , 24 Sub -conceptual , of mi~ space rofeatures , 20 , 19 Sub -symbolic models8, 9, 60 167173 , , , Sub -systematic exceptions , 219 Subcomputational transitions , 60 Subjunctive , English83 ,
Suppletive forms , 204

Rumelhart -McClelland model of past tense ac quisition , 73- 184 , 200 , 204 , 207 - 218 , 233 ,
234 Russian , 163

Subregularities160 tense , Suffixin English ,86 the past


Suprasegmental processes, 204
Syllable , 86 , 197 Symbol -processing models , in linguistics cognition , 11, 59 , 75 , 129, 130, 131 Symbolic
models structures , 167 , 179 , 13 , 78 , 135

S - R connections , 234 , 235 , 236 Semantic relatedness and constituent structure , 42

and

andsystematicity , 45 Semantic - . its use theM&K model theory in . 222 Semantics ,7 Sequencing , 197

paradigm , 9 , 78 , 91 , 173

theories of acquisition , 184


Symbols and context dependence , 173

Syncategorematicity , 42, 43

Subject Index

255

Syncretism Synonymous Syntactic Systematicity of cognitive ( see of of of of errors inference language thought also in

83

103 , , in the H & 17 K model , 230

representations categories

representations Productivity acquisition , 46 comprehension , 37 , 39 , ) , 218

33

37

, 48

50

37

38

, 42

, 43

TRACE Talking Teacher Thematic Thresholds Trees Truth Turing 61 , 18

199

, 240 , 220

algorithm , 90 Rules , 75 , ,

222 89 , 90 , 91 , 169 , 207 , 212

- preserving machines

transformations , 4 , 5 , lOn , 12 , 28

, 29 , 30 , 34 , 36 ,

Universal

subgoaling

, 58

Validity Variables Variation and 181 Verbal Verbal Verbal Visual Voiced 106 Von Vowel Vowel Vowel Vowel

, of ,

inference 174 , 176

30

constrained

nature

in

linguistic

systems

adjective inflection noun Features vs . 211 Neumann . 86 harmony insertion shift , 153 . voiceless ,

English , English

, , , 83

83 82

English , 7 , 176

distinction

86

104

105

machines

5 ,

12

30

, 59

, ,

lOOn 106

Weights Whole

on

connections Binding , 94 , 92 , , 138 175n , 91 175 , 96 , 95 , 98 146 , 205 98 ,

, 5 , 6 , network ,

31 94

76 , 95

, 91 , 176

, 206

- string

unconstrained Wickelfeature 123 162 Wickelphone 147 216 binding Word as a network , 9 , 79 collection , 88 , 154 , , , 124 168 , 137 170

, , -

100 147 210 ,

, ,

101 153 , 101 215 , -

, ,

109 155 , 234 109 211 ,

, ,

110 156 , 146 235 ,

, ,

, 89 169 ,

100 , 205

, 204

, 209

213

, 93 , of effects 108 ,

, 95 197 , 199 - units , 204

feature , 200

superiority Word - boundary

as

bi

- polar

feature

209

You might also like