Full
Full
COGNITIVE
NEUROSCIENCE
The LibreTexts mission is to unite students, faculty and scholars in a cooperative effort to develop an easy-to-use online
platform for the construction, customization, and dissemination of OER content to reduce the burdens of unreasonable
textbook costs to our students and society. The LibreTexts project is a multi-institutional collaborative venture to develop
the next generation of open-access texts to improve postsecondary education at all levels of higher learning by developing
an Open Access Resource environment. The project currently consists of 14 independently operating and interconnected
libraries that are constantly being optimized by students, faculty, and outside experts to supplant conventional paper-based
books. These free textbook alternatives are organized within a central environment that is both vertically (from advance to
basic level) and horizontally (across different fields) integrated.
The LibreTexts libraries are Powered by MindTouch® and are supported by the Department of Education Open Textbook
Pilot Project, the UC Davis Office of the Provost, the UC Davis Library, the California State University Affordable
Learning Solutions Program, and Merlot. This material is based upon work supported by the National Science Foundation
under Grant No. 1246120, 1525057, and 1413739. Unless otherwise noted, LibreTexts content is licensed by CC BY-NC-
SA 3.0.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do
not necessarily reflect the views of the National Science Foundation nor the US Department of Education.
Have questions or comments? For information about adoptions or adaptions contact info@LibreTexts.org. More
information on our activities can be found via Facebook (https://facebook.com/Libretexts), Twitter
(https://twitter.com/libretexts), or our blog (http://Blog.Libretexts.org).
1: INTRODUCTION
1.1: INTRODUCTION
1.2: SOME PHENOMENA WE'LL EXPLORE
1.3: THE COMPUTATIONAL APPROACH
1.4: EMERGENT PHENOMENA
1.5: WHY SHOULD WE CARE ABOUT THE BRAIN?
1.6: HOW TO READ THIS BOOK
2: NEURON
2.1: INTRODUCTION
2.2: BASIC BIOLOGY OF A NEURON AS DETECTOR
2.3: DYNAMICS OF INTEGRATION- EXCITATION VS. INHIBITION AND LEAK
2.4: COMPUTING ACTIVATION OUTPUT
2.5: MATHEMATICAL FORMULATIONS
2.6: EXPLORATION OF THE INDIVIDUAL NEURON
2.7: BACK TO THE DETECTOR
2.8: SUBTOPICS
2.9: EXPLORATIONS
3: NETWORKS
3.1: INTRODUCTION
3.2: BIOLOGY OF THE NEOCORTEX
3.3: CATEGORIZATION AND DISTRIBUTED REPRESENTATIONS
3.4: BIDIRECTIONAL EXCITATORY DYNAMICS AND ATTRACTORS
3.5: INHIBITORY COMPETITION AND ACTIVITY REGULATION
3.6: SUBTOPICS AND EXPLORATIONS
4: LEARNING MECHANISMS
4.1: INTRODUCTION
4.2: BIOLOGY OF SYNAPTIC PLASTICITY
4.3: THE EXTENDED CONTRASTIVE ATTRACTOR LEARNING (XCAL) MODEL
4.4: WHEN, EXACTLY, IS THERE AN OUTCOME THAT SHOULD DRIVE LEARNING
4.5: THE LEABRA FRAMEWORK
4.6: SUBTOPICS AND EXPLORATIONS
4.7: REFERENCES
5: BRAIN AREAS
5.1: INTRODUCTION
5.2: NAVIGATING THE FUNCTIONAL ANATOMY OF THE BRAIN
5.3: PRECEPTION AND ATTENTION- WHAT VS. WHERE
5.4: MOTOR CONTROL- PARIETAL AND MOTOR CORTEX INTERACTING WITH BASAL GANGLIA AND
CEREBELLUM
5.5: LEARNING AND MEMORY- TEMPORAL CORTEX AND THE HIPPOCAMPUS
5.6: LANGUAGE- ALL TOGETHER NOW
5.7: EXECUTIVE FUNCTION- PREFRONTAL CORTEX AND BASAL GANGLIA
1 1/5/2022
6.2: BIOLOGY OF PRECEPTION
6.3: ORIENTED EDGE DETECTORS IN PRIMARY VISUAL CORTEX
6.4: INVARIANT OBJECT RECOGNITION IN THE "WHAT" PATHWAY
6.5: SPATIAL ATTENTION AND NEGLECT IN THE "WHERE/HOW" PATHWAY
6.6: EXPLORATIONS
9: LANGUAGE
9.1: INTRODUCTION
9.2: BIOLOGY OF LANGUAGE
9.3: READING AND DYSLEXIA IN THE TRIANGLE MODEL
9.4: SPELLING TO SOUND MAPPINGS IN WORD READING
9.5: LATENT SEMANTICS IN WORD CO-OCCURRENCE
9.6: SYNTAX AND SEMANTICS IN A SENTENCE GESTALT
9.7: NEXT STEPS IN LANGUAGE MODELING OF SENTENCES AND BEYOND
9.8: SUBTOPICS AND EXPLORATIONS
BACK MATTER
INDEX
GLOSSARY
2 1/5/2022
CHAPTER OVERVIEW
1: INTRODUCTION
1.1: INTRODUCTION
1.2: SOME PHENOMENA WE'LL EXPLORE
1.3: THE COMPUTATIONAL APPROACH
1.4: EMERGENT PHENOMENA
1.5: WHY SHOULD WE CARE ABOUT THE BRAIN?
1.6: HOW TO READ THIS BOOK
1 1/5/2022
1.1: Introduction
You are about to embark on one of the most fascinating scientific journeys possible: inside your own brain! We start this
journey byunderstanding what individual neurons in your neocortex do with the roughly 10,000
synaptic input signals that they receive from other neurons. The neocortex is the most
evolutionarily recent part of the brain, which is also most enlarged in humans, and is where most
of your thinking takes place. The numbers of neurons and synapses between neurons in the neocortex are
astounding: roughly 20 billion neurons, each of which is interconnected with roughly 10,000 others. That is several times
more neurons than people on earth. And each neuron is far more social than we are as people -- estimates of the size of
stable human social networks are only around 150-200 people, compared to the 10,000 for neurons. We've got a lot going
on under the hood. At these scales, the influence of any one neuron on any other is relatively small. We'll see that these
small influences can be shaped in powerful ways through learning mechanisms, to achieve complex and powerful forms of
information processing. And this information processing prowess does not require much complexity from the individual
neurons themselves -- fairly simple forms of information integration both accurately describe the response properties of
actual neocortical neurons, and enable sophisticated information processing at the level of aggregate neural networks.
After developing an understanding of these basic neural information processing mechanisms in Part I of this book, we
continue our journey in Part II by exploring many different aspects of human thought (cognition), including perception and
attention, motor control and reinforcement learning, learning and memory, language, and executive function. Amazingly,
all these seemingly different cognitive functions can be understood using the small set of common neural mechanisms
developed in Part I. In effect, our neocortex is a fantastic form of silly putty, which can be molded by the learning process
to take on many different cognitive tasks. For example, we will find striking similarities across different brain areas and
cognitive functions -- the development of primary visual cortex turns out to tell us a lot about the development of rich
semantic knowledge of word meanings!
Figure 1.1: Simple example of emergence of phenomena, in a very simple physical system: two gears. Both panel a and b
contain the same parts. Only panel b exhibits an emergent phenomena through the interaction of the two gears, causing
things like torque and speed differentials on the two different gears.
Emergence is about interactions between parts. Computer models can capture many complex interactions and reveal
nonobvious kinds of emergence.Emergence can be illustrated in a very simple physical system, two interacting gears, as
shown in Figure 1.1. It is not mysterious or magical. On the
other hand, it really is. You can make the gears out of any kind of sufficiently hard material, and they will still work. There
might be subtle factors like friction and durability that vary. But over a wide range, it doesn't matter what the gears are
made from. Thus, there is a level of transcendence that occurs with emergence, where the behavior of the more complex
interacting system does not depend on many of the detailed properties of the lower level parts. In effect, the interaction
itself is what matters, and the parts are mere place holders. Of course, they have to be there, and meet some basic criteria,
but they are nevertheless replaceable.
Taking this example into the domain of interest here, does this mean that we can switch out our biological neurons for
artificial ones, and everything should still function the same, as long as we capture the essential interactions in the right
way? Some of us believe this to be the case, and that when we finally manage to put enough neurons in the right
configuration into a big computer simulation, the resulting brain will support consciousness and everything else, just like
the ones in our own heads. One interesting further question arises: how important are all the interactions between our
physical bodies and the physical environment? There is good reason to believe that this is critical. Thus, we'll have to put
this brain in a robot. Or perhaps more challengingly, in a virtual environment in a virtual reality, still stuck inside the
computer. It will be fascinating to ponder this question on your journey through the simulated brain...
Figure 1.3: Models that are relatively unconstrained, e.g., by not addressing biological constraints, or detailed behavioral
data, are like jigsaw puzzles of a featureless blue sky -- very hard to solve -- you just don't have enough clues to how
everything fits together.
A couple of the most satisfying instances of all the pieces coming together to complete a puzzle include:
The detailed biology of the hippocampus, including high levels of inhibition and broad diffuse connectivity, fit together
with its unique role in rapidly learning new episodic information, and the remarkable data from patient HM who had
his hippocampus resected to prevent intractable epilepsy. Through computational models in the Memory Chapter, we
can see that these biological details produce high levels of pattern separation which keep memories highly distinct, and
thus enable rapid learning without creating catastrophic levels of interference.
2.1: INTRODUCTION
2.2: BASIC BIOLOGY OF A NEURON AS DETECTOR
2.3: DYNAMICS OF INTEGRATION- EXCITATION VS. INHIBITION AND LEAK
2.4: COMPUTING ACTIVATION OUTPUT
2.5: MATHEMATICAL FORMULATIONS
2.6: EXPLORATION OF THE INDIVIDUAL NEURON
2.7: BACK TO THE DETECTOR
2.8: SUBTOPICS
2.9: EXPLORATIONS
1 1/5/2022
2.1: Introduction
One major reason the brain can be so plastic and learn to do so many different things, is that it is made up of a highly-
sculptable form of silly putty: billions of individual neurons that are densely interconnected with each other, and capable of
shaping what they do by changing these patterns of interconnections. The brain is like a massive LEGO set, where each of
the individual pieces is quite simple (like a single LEGO piece), and all the power comes from the nearly infinite ways that
these simple pieces can be recombined to do different things.
So the good news for you the student is, the neuron is fundamentally simple. Lots of people will try to tell you otherwise,
but as you'll see as you go through this book, simple neurons can account for much of what we know about how the brain
functions. So, even though they have a lot of moving parts and you can spend an entire career learning about even just one
tiny part of a neuron, we strongly believe that all this complexity is in the service of a very simple overall function.
What is that function? Fundamentally, it is about detection. Neurons receive thousands of different input signals from
other neurons, looking for specific patterns that are "meaningful" to them. A very simple analogy is with a smoke detector,
which samples the air and looks for telltale traces of smoke. When these exceed a specified threshold limit, the alarm goes
off. Similarly, the neuron has a threshold and only sends an "alarm" signal to other neurons when it detects something
significant enough to cross this threshold. The alarm is called an action potential or spike and it is the fundamental unit of
communication between neurons.
Figure 2.1: Trace of a simulated neuron spiking action potentials in response to an excitatory input -- the blue v_m
membrane potential (voltage of the neuron) increases (driven by the excitatory net input) until it reaches threshold (around
.5), at which point a green act activation spike (action potential) is triggered, which then resets the membrane potential
back to its starting value (.3) and the process continues. The spike is communicated other neurons, and the overall rate of
spiking (tracked by the purple act_eq value) is proportional to the level of excitatory net input (relative to other opposing
factors such as inhibition -- the balance of all these factors is reflected in the net current I_net, in red). You can produce
this graph and manipulate all the relevant parameters in the Neuron exploration for this chapter.
Our goal in this chapter is to understand how the neuron receives input signals from other neurons, integrates them into an
overall signal strength that is compared against the threshold, and communicates the result to other neurons. We will see
how these processes can be characterized mathematically in computer simulations (summarized in Figure 2.1). In the rest
of the book, we will see how this simple overall function of the neuron ultimately enables us to perceive the world, to
think, to communicate, and to remember.
Math warning: This chapter and the Learning Mechanisms Chapter are the only two in the entire book with significant
amounts of math (because these two chapters describe in detail the equations for our simulations). We have separated the
conceptual from the mathematical content, and those with an aversion to math can get by without understanding all the
details. So, don't be put off or overwhelmed by the math here!
Figure 2.3: The neuron is a tug-of-war battleground between inhibition and excitation -- the relative strength of each is
what determines the membrane potential, Vm, which is what must get over threshold to fire an action potential output from
the neuron.
The integration process can be understood in terms of a tug-of-war ( Figure 2.3). This tug-of-war takes place in the space
of electrical potentials that exist in the neuron relative to the surrounding extracellular medium in which neurons live
(interestingly, this medium, and the insides of neurons and other cells as well, is basically salt water with sodium (Na+),
chloride (Cl-) and other ions floating around -- we carry our remote evolutionary environment around within us at all
times). The core function of a neuron can be understood entirely in electrical terms: voltages (electrical potentials) and
currents (flow of electrically charged ions in and out of the neuron through tiny pores called ion channels).
To see how this works, let's just consider excitation versus inhibition (inhibition and leak are effectively the same for our
purposes at this time). The key point is that the integration process reflects the relative strength of excitation versus
inhibition -- if excitation is stronger than inhibition, then the neuron's electrical potential (voltage) increases, perhaps to
the point of getting over threshold and firing an output action potential. If inhibition is stronger, then the neuron's electrical
potential decreases, and thus moves further away from getting over the threshold for firing.
Before we consider specific cases, let's introduce some obscure terminology that neuroscientists use to label the various
actors in our tug-of-war drama (going from left to right in the Figure):
-- the inhibitory conductance (g is the symbol for a conductance, and i indicates inhibition) -- this is the total strength
of the inhibitory input (i.e., how strong the inhibitory guy is tugging), and plays a major role in determining how strong
of an inhibitory current there is. This corresponds biologically to the proportion of inhibitory ion channels that are
currently open and allowing inhibitory ions to flow (these are chloride or Cl- ions in the case of GABA inhibition, and
potassium or K+ ions in the case of leak currents). For electricity buffs, the conductance is the inverse of resistance --
most people find conductance more intuitive than resistance, so we'll stick with it.
-- the inhibitory driving potential -- in the tug-of-war metaphor, this just amounts to where the inhibitory guy
happens to be standing relative to the electrical potential scale that operates within the neuron. Typically, this value is
around -75mV where mV stands for millivolts -- one thousandth (1/1,000) of a volt. These are very small electrical
potentials for very small neurons.
-- the action potential threshold -- this is the electrical potential at which the neuron will fire an action potential
output to signal other neurons. This is typically around -50mV. This is also called the firing threshold or the spiking
threshold, because neurons are described as "firing a spike" when they get over this threshold.
-- the membrane potential of the neuron (V = voltage or electrical potential, and m = membrane). This is the current
electrical potential of the neuron relative to the extracellular space outside the neuron. It is called the membrane
potential because it is the cell membrane (thin layer of fat basically) that separates the inside and outside of the neuron,
and that is where the electrical potential really happens. An electrical potential or voltage is a relative comparison
between the amount of electric charge in one location versus another. It is called a "potential" because when there is a
Computing Inputs
We begin by formalizing the "strength" by which each side of the tug-of-war pulls, and then show how that causes the V m
"flag" to move as a result. This provides explicit equations for the tug-of-war dynamic integration process. Then, we show
how to actually compute the conductance factors in this tug-of-war equation as a function of the inputs coming into the
neuron, and the synaptic weights (focusing on the excitatory inputs for now). Finally, we provide a summary equation for
the tug-of-war which can tell you where the flag will end up in the end, to complement the dynamical equations which
show you how it moves over time.
Neural Integration
The key idea behind these equations is that each guy in the tug-of-war pulls with a strength that is proportional to both its
overall strength (conductance), and how far the "flag" (V ) is away from its position (indicated by the driving potential E).
m
Imagine that the tuggers are planted in their position, and their arms are fully contracted when the V flag gets to their
m
position (E), and they can't re-grip the rope, such that they can't pull any more at this point. To put this idea into an
equation, we can write the "force" or current that the excitatory guy exerts as:
excitatory current:
Ie = ge (Ee − Vm ) (2.5.1)
The excitatory current is I_e (I is the traditional term for an electrical current, and e again for excitation), and it is the
product of the conductance g_e times how far the membrane potential is away from the excitatory driving potential. If
V_m = E_e then the excitatory guy has "won" the tug of war, and it no longer pulls anymore, and the current goes to zero
(regardless of how big the conductance might be -- anything times 0 is 0). Interestingly, this also means that the excitatory
guy pulls the strongest when the V "flag" is furthest away from it -- i.e., when the neuron is at its resting potential. Thus,
m
leak current:
Il = gl (El − Vm ) (2.5.3)
\[V_{m}(t)\) is the current value of V , which is updated from value on the previous time step V (t − 1) , and the dt is
m m vm
a rate constant that determines how fast the membrane potential changes -- it mainly reflects the capacitance of the
neuron's membrane).
The above two equations are the essence of what we need to be able to simulate a neuron on a computer! It tells us how the
membrane potential changes as a function of the inhibitory, leak and excitatory inputs -- given specific numbers for these
input conductances, and a starting V value, we can then iteratively compute the new V value according to the above
m m
equations, and this will accurately reflect how a real neuron would respond to similar such inputs!
To summarize, here's a single version of the above equations that does everything:
Vm (t) = Vm (t − 1) + dtvm [ ge (Ee − Vm ) + gi (Ei − Vm ) + gl (El − Vm )]
For those of you who noticed the issue with the minus sign above, or are curious how all of this relates to Ohm's law and
the process of diffusion, please see Electrophysiology of the Neuron. If you're happy enough with where we've come, feel
free to move along to finding out how we compute these input conductances, and what we then do with the V value to m
inhibitory conductance:
¯¯
ḡ i
gi (t) (2.5.7)
leak conductance:
¯
¯¯
gl (2.5.8)
(note that because leak is a constant, it does not have a dynamically changing value, only the constant g-bar value).
This separation of terms makes it easier to compute the conductance, because all we need to focus on is computing the
proportion or fraction of open ion channels of each type. This can be done by computing the average number of ion
channels open at each synaptic input to the neuron:
1
ge (t) = ∑ xi wi
n i
where x is the activity of a particular sending neuron indexed by the subscript i, w is the synaptic weight strength that
i i
connects sending neuron i to the receiving neuron, and n is the total number of channels of that type (in this case,
excitatory and inhibitory input conductances (if these aren't steady, then the V will likely be constantly changing as they
m
change). This equilibrium value is interesting because it tells us more clearly how the tug-of-war process inside the neuron
actually balances out in the end. Also, we will see in the next section that it is useful mathematically.
To compute the equilibrium membrane potential (V ), we can use an important mathematical technique: set the change in
eq
m
membrane potential (according to the iterative V updating equation from above) to 0, and then solve the equation for the
m
value of V under this condition. In other words, if we want to find out what the equilibrium state is, we simply compute
m
what the numbers need to be such that V is no longer changing (i.e., its rate of change is 0). Here are the mathematical
m
just the change part (time constant omitted as we are looking for equilibrium):
set it to zero:
solve for V : m
ge gi gl
Vm = Ee + Ei + El (2.5.12)
ge + gi + gl ge + gi + gl ge + gi + gl
excitatory conductance g relative to the sum of all the conductances (g + g + g ) . And the same for each of the others
e e i l
(inhibition, leak). This is just what we expect from the tug-of-war picture: if we ignore g_l, then the V "flag" is m
ge
positioned as a function of the relative balance between g and g -- if they are equal, then
e i is .5 (e.g., just put a "1" in
g +g
e i
for each of the g's -- 1/2 = .5), which means that the V flag is half-way between E and E . So, all this math just to
m i e
rediscover what we knew already intuitively! (Actually, that is the best way to do math -- if you draw the right picture, it
should tell you the answers before you do all the algebra). But we'll see that this math will come in handy next.
Here is a version with the conductance terms explicitly broken out into the "g-bar" constants and the time-varying "g(t)"
parts:
For those who really like math, the equilibrium membrane potential equation can be shown to be a Bayesian Optimal
Detector.
Generating Outputs
The output of the neuron can be simulated at two different levels: discrete spiking (which is how neurons actually behave
biologically), or using a rate code approximation. We cover each in turn, and show how the rate code must be derived to
match the behavior of the discrete spiking neuron, when averaged over time (it is important that our approximations are
valid in the sense that they match the more detailed biological behavior where possible, even as they provide some
simplification).
Discrete Spiking
To compute discrete action potential spiking behavior from the neural equations we have so far, we need to determine
when the membrane potential gets above the firing threshold, and then emit a spike, and subsequently reset the membrane
potential back down to a value, from which it can then climb back up and trigger another spike again, etc. This is actually
best expressed as a kind of simple computer program:
if (Vm > θ) then: y = 1; Vm = Vm_r; else y = 0
where y is the activation output value of the neuron, and Vm_r is the reset potential that the membrane potential is reset to
after a spike is triggered. Biologically, there are special potassium (K+) channels that bring the membrane potential back
down after a spike.
This simplest of spiking models is not quite sufficient to account for the detailed spiking behavior of actual cortical
neurons. However, a slightly more complex model can account for actual spiking data with great accuracy (as shown by
Gerstner and colleagues (Brette & Gerstner, 2005), and winning several international competitions even!). This model is
known as the Adaptive Exponential or AdEx model -- click on the link to read more about it. We typically use this AdEx
model when simulating discrete spiking, although the simpler model described above is also still an option. The critical
feature of the AdEx model is that the effective firing threshold adapts over time, as a function of the excitation coming into
the cell, and its recent firing history. The net result is a phenomenon called spike rate adaptation, where the rate of
spiking tends to decrease over time for otherwise static input levels. Otherwise, however, the AdEx model is identical to
the one described above.
Even though actual neurons communicate via discrete spiking (action potential) events, it is often useful in our
computational models to adopt a somewhat more abstract rate code approximation, where the neuron continuously
outputs a single graded value (typically normalized between 0-1) that reflects the overall rate of spiking that the neuron
should be exhibiting given the level of inputs it is receiving. In other words, we could count up the number of discrete
spikes the neuron fires, and divide that by the amount of time we did the counting over, and this would give us an average
spiking rate. Instead of having the neuron communicate this rate of spiking distributed in discrete spikes over that period of
time, we can have it communicate that rate value instantly, as a graded number. Computationally, this should be more
efficient, because it is compressing the amount of time required to communicate a particular spiking rate, and it also tends
to reduce the overall level of noise in the network, because instead of switching between spiking and not-spiking, the
neuron can continuously output a more steady rate code value.
As noted earlier, the rate code value can be thought of in biological terms as the output of a small population (e.g., 100) of
neurons that are generally receiving the same inputs, and giving similar output responses -- averaging the number of spikes
at any given point in time over this population of neurons is roughly equivalent to averaging over time from a single
spiking neuron. As such, we can consider our simulated rate code computational neurons to correspond to a small
population of actual discrete spiking neurons.
To actually compute the rate code output, we need an equation that provides a real-valued number that matches the number
of spikes emitted by a spiking neuron with the same level of inputs. Interestingly, you cannot use the membrane potential
Vm as the input to this equation -- it does not have a one-to-one relationship with spiking rate! That is, when we run our
spiking model and measure the actual rate of spiking for different combinations of excitatory and inhibitory input, and then
plot that against the equilibrium V value that those input values produce (without any spiking taking place), there are
m
multiple spiking rate values for each V value -- you cannot predict the correct firing rate value knowing only the V (
m m
Figure 2.5).
an appropriate threshold value ( Figure 2.6). For the membrane potential, we know that V is compared to the threshold θ
m
to determine when output occurs. What is the appropriate threshold to use for the excitatory net input? We need to
somehow convert θ into a g value -- a threshold in excitatory input terms. Here, we can leverage the equilibrium
Θ
e
membrane potential equation, derived above. We can use this equation to solve for the level of excitatory input
conductance that would put the equilibrium membrane potential right at the firing threshold θ :
equilibrium V m at threshold:
Θ
ge Ee + gi Ei + gl El
Θ= (2.5.13)
Θ
ge + gi + gl
And all we need to do is figure out what this function f() should look like.
There are three important properties that this function should have:
threshold -- it should be 0 (or close to it) when g is less than its threshold value (neurons should not respond when
e
below threshold).
saturation -- when g gets very strong relative to the threshold, the neuron cannot actually keep firing at increasingly
e
high rates -- there is an upper limit to how fast it can spike (typically around 100-200 Hz or spikes per second). Thus,
our rate code function also needs to exhibit this leveling-off or saturation at the high end.
smoothness -- there shouldn't be any abrupt transitions (sharp edges) to the function, so that the neuron's behavior is
smooth and continuous.
where x is the positive portion of g_e - g_e^{\Theta} , with an extra gain factor \gamma , which just multiplies everything:
Θ
x = γ[ ge − ge ]
+
As you can see in Figure 2.7 (Noise=0), the basic XX1 function is not smooth at the point of the threshold. To remedy
this problem, we convolve the XX1 function with normally-distributed (gaussian) noise, which smooths it out as shown in
the Noise=0.005 case in Figure 2.7. Convolving amounts to adding to each point in the function some contribution from
its nearby neighbors, weighted by the gaussian (bell-shaped) curve. It is what photo editing programs do when they do
"smoothing" or "blurring" on an image. In the software, we perform this convolution operation and then store the results in
a lookup table of values, to make the computation very fast. Biologically, this convolution process reflects the fact that
neurons experience a large amount of noise (random fluctuations in the inputs and membrane potential), so that even if
they are slightly below the firing threshold, a random fluctuation can sometimes push it over threshold and generate a
spike. Thus, the spiking rate around the threshold is smooth, not sharp as in the plain XX1 function.
For completeness sake, and strictly for the mathematically inclined, here is the equation for the convolution operation:
∞ 1
2 2
∗ −z /(2 σ )
y (x) = ∫ e y(z − x)dz
−∞ √2πσ
where y(z-x) is the XX1 function applied to the z-x input instead of just x. In practice, a finite kernel of width 3σ on either
side of x is used in the numerical convolution.
After convolution, the XX1 function ( Figure 2.7) approximates the average firing rate of many neuronal models with
discrete spiking, including AdEx. A mathematical explanation is here: Frequency-Current Curve.
Restoring Iterative Dynamics in the Activation
There is just one last problem with the equations as written above. They don't evolve over time in a graded fashion. In
contrast, the V value does evolve in a graded fashion by virtue of being iteratively computed, where it incrementally
m
This causes the actual final rate code activation output at the current time t, y(t) to iteratively approach the driving value
given by y (x), with the same time constant dt that is used in updating the membrane potential. In practice this works
∗
vm
extremely well, better than any prior activation function used with Leabra.
Summary of Neuron Equations and Normalized Parameters
Parameter Bio Val Norm Val Parameter Bio Val Norm Val
-100..100 mV =
Time 0.001 sec 1 ms Voltage 0.1 V or 100 mV
0..2 dV
Current 1 × 10
−8
A 10 nA Conductance 1x10^{-9}S 1 nS
C (memb 1/C = .355 =
Capacitance 1 × 10
−12
F 1 pF 281 pF
capacitance) dt.vm
gl (leak) 10 nS 0.1 gi (inhibition) 100 nS 1
e_rev_l (leak)
ge (excitation) 100 nS 1 -70 mV 0.3
and Vm_r
1
e_rev_i e_rev_e
-75 mV 0.25 0 mV
(inhibition) (excitation)
spike.spk_thr
θ (act.thr, V in T
-50 mV 0.5 (exp cutoff in 20 mV 1.2
AdEx)
AdEx)
spike.exp_slope ( adapt.dt_time (
2 mV 0.02 144 ms dt = 0.007
Δ in AdEx)
T τ in AdEx)
W
adapt.vm_gain adapt.spk_gain
4 nS 0.04 0.0805 nA 0.00805
(a in AdEx) (b in AdEx)
Table 2.1: The parameters used in our simulations are normalized using the above conversion factors so that the typical
values that arise in a simulation fall within the 0.1 normalized range. For example, the membrane potential is represented
in the range between 0 and 2 where 0 corresponds to -100 mV and 2 corresponds to +100 mV and 1 is thus 0 mV (and
most membrane potential values stay within 0-1 in this scale). The biological values given are the default values for the
AdEx model. Other biological values can be input using the BioParams button on the LeabraUnitSpec, which
automatically converts them to normalized values.
Table 2.1 shows the normalized values of the parameters used in our simulations. We use these normalized values instead
of the normal biological parameters so that everything fits naturally within a 0..1 range, thereby simplifying many practical
aspects of working with the simulations.
The final equations used to update the neuron, in computational order, are shown here, with all variables that change over
time indicated as a function of (t):
1. Compute the excitatory input conductance (inhibition would be similar, but we'll discuss this more in the next chapter,
so we'll omit it here):
1
ge (t) = ∑ xi (t)wi
n i
3a. For discrete spiking, compare membrane potential to threshold and trigger a spike and reset Vm if above threshold:
if (Vm(t) > θ) then: y(t) = 1; Vm(t) = Vm_r; else y(t) = 0
3b. For rate code approximation, compute output activation as NXX1 function of g_e and Vm:
∗ ∗
y (x) = fN XX1 (ge (t)) ≈
1
(convolution with noise not shown)
1
(1+ )
Θ
γ[g −g ]
e e
+
∗
y(t) = y(t − 1) + dtvm (y (x) − y(t − 1)) (restoring iterative dynamics based on time constant of membrane
potential changes)
3.1: INTRODUCTION
The cerebral cortex or neocortex is composed of roughly 85% excitatory neurons and 15% inhibitory interneurons. We focus
primarily on the excitatory pyramidal neurons, which perform the bulk of the information processing in the cortex. Unlike the local
inhibitory interneurons, they engage in long-range connections between different cortical areas, and it is clear that learning takes place
in the synapses between these excitatory neurons.
1 1/5/2022
3.1: Introduction
In this chapter, we build upon the Neuron Chapter to understand how networks of detectors can produce emergent behavior
that is more than the sum of their simple neural constituents. We focus on the networks of the neocortex ("new cortex",
often just referred to as "cortex"), which is the evolutionarily most recent, outer portion of the brain where most of
advanced cognitive functions take place. There are three major categories of emergent network phenomena:
Categorization of diverse patterns of activity into relevant groups: For example, faces can look very different from one
another in terms of their raw "pixel" inputs, but we can categorize these diverse inputs in many different ways, to treat
some patterns as more similar than others: male vs. female, young vs. old, happy vs. sad, "my mother" vs. "someone
other", etc. Forming these categories is essential for enabling us to make the appropriate behavioral and cognitive
responses (approach vs. avoid, borrow money from, etc.). Imagine trying to relate all the raw inputs of a visual image
of a face to appropriate behavioral responses, without the benefit of such categories. The relationship ("mapping")
between pixels and responses is just too complex. These intermediate, abstract categories organize and simplify
cognition, just like file folders organize and simplify documents on your computer. One can argue that much of
intelligence amounts to developing and using these abstract categories in the right ways. Biologically, we'll see how
successive layers of neural detectors, organized into a hierarchy, enable this kind of increasingly abstract
categorization of the world. We will also see that many individual neural detectors at each stage of processing can work
together to capture the subtlety and complexity necessary to encode complex conceptual categories, in the form of a
distributed representation. These distributed representations are also critical for enabling multiple different ways of
categorizing an input to be active at the same time -- e.g., a given face can be simultaneously recognized as female, old,
and happy. A great deal of the emergent intelligence of the human brain arises from multiple successive levels of
cascading distributed representations, constituting the collective actions of billions of excitatory pyramidal neurons
working together in the cortex.
Bidirectional excitatory dynamics are produced by the pervasive bidirectional (e.g., bottom-up and top-down or
feedforward and feedback) connectivity in the neocortex. The ability of information to flow in all directions
throughout the brain is critical for understanding phenomena like our ability to focus on the task at hand and not get
distracted by irrelevant incoming stimuli (did my email inbox just beep??), and our ability to resolve ambiguity in
inputs by bringing higher-level knowledge to bear on lower-level processing stages. For example, if you are trying to
search for a companion in a big crowd of people (e.g., at a sporting event or shopping mall), you can maintain an image
of what you are looking for (e.g., a red jacket), which helps to boost the relevant processing in lower-level stages. The
overall effects of bidirectional connectivity can be summarized in terms of an attractor dynamic or multiple
constraint satisfaction, where the network can start off in a variety of different states of activity, and end up getting
"sucked into" a common attractor state, representing a cleaned-up, stable interpretation of a noisy or ambiguous input
pattern. Probably the best subjective experience of this attractor dynamic is when viewing an Autostereogram (NOTE:
links to Wikipedia for now) -- you just stare at this random-looking pattern with your eyes crossed, until slowly your
brain starts to fall into the 3D attractor, and the image slowly emerges. The underlying image contains many individual
matches of the random patterns between the two eyes at different lateral offsets -- these are the constraints in the
multiple constraint satisfaction problem that eventually work together to cause the 3D image to appear -- this 3D image
is the one that best satisfies all those constraints.
Inhibitory competition, mediated by specialized inhibitory interneurons is important for providing dynamic
regulation of overall network activity, which is especially important when there are positive feedback loops between
neurons as in the case of bidirectional connectivity. The existence of epilepsy in the human neocortex indicates that
achieving the right balance between inhibition and excitation is difficult -- the brain obtains so many benefits from this
bidirectional excitation that it apparently lives right on the edge of controlling it with inhibition. Inhibition gives rise to
sparse distributed representations (having a relatively small percentage of neurons active at a time, e.g., 15% or so),
which have numerous advantages over distributed representations that have many neurons active at a time. In addition,
we'll see in the Learning Chapter that inhibition plays a key role in the learning process, analogous to the Darwinian
"survival of the fittest" dynamic, as a result of the competitive dynamic produced by inhibition.
We begin with a brief overview of the biology of neural networks in the neocortex.
Layered Structure
Figure 3.1: Neural constituents of the neocortex. (A) shows excitatory pyramidal neurons, which constitute roughly 85%
of neurons, and convey the bulk of the information content via longer-range axonal projections (some of which can go all
the way across the brain). (B) shows inhibitory interneurons, which have much more local patterns of connectivity, and
represent the remaining 15% of neurons. Reproduced from Crick & Asanuma (1986).
Figure 3.2: A slice of the visual cortex of a cat, showing the six major cortical layers (I - VI), with sublayers of layer IV
that are only present in visual cortex. The first layer (I) is primarily axons ("white matter"). Reproduced from Sejnowski
and Churchland (1989).
Figure 3.4: Function of the cortical layers: layer 4 processes input information (e.g., from sensory inputs) and drives
superficial layers 2/3, which provide a "hidden" internal re-processing of the inputs (extracting behaviorally-relevant
categories), which then drive deep layers 5/6 to output a motor response. Green triangles indicate excitation, and red
circles indicate inhibition via inhibitory interneurons. BG = basal ganglia which is important for driving motor outputs, and
Subcortex includes a large number of other subcortical areas.
The neocortex has a characteristic 6-layer structure ( Figure 3.2), which is present throughout all areas of cortex ( Figure
3.3). However, the different cortical areas, which have different functions, have different thicknesses of each of the 6
layers, which provides an important clue to the function of these layers, as summarized in ( Figure 3.4). The anatomical
patterns of connectivity in the cortex are also an important source of information giving rise to the following functional
picture:
Input areas of the cortex (e.g., primary visual cortex) receive sensory input (typically via the thalamus), and these areas
have a greatly enlarged layer 4, which is where the axons from the thalamus primarily terminate. The input layer
contains a specialized type of excitatory neuron called the stellate cell, which has a dense bushy dendrite that is
relatively localized, and seems particularly good at collecting the local axonal input to this layer.
Hidden areas of the cortex are so-called because they don't directly receive sensory input, nor do they directly drive
motor output -- they are "hidden" somewhere in between. The bulk of the cortex is "hidden" by this definition, and this
Patterns of Connectivity
Figure 3.6: Connectivity matrix between cortical areas, showing that when a given area sends a feedforward projection to
another area, it typically also receives a feedback projection from that same area. Thus, cortical connectivity is
predominantly bidirectional. Reproduced from Sporns & Zwi (2004).
The other significant aspect of cortical connectivity that will become quite important for our models, is that the
connectivity is largely bidirectional Figure 3.6. Thus, an area that sends a feedforward projection to another area also
typically receives a reciprocal feedback projection from that same area. This bidirectional connectivity is important for
enabling the network to converge into a coherent overall state of activity across layers, and is also important for driving
error-driven learning as we'll see in the Learning Chapter.
Next, let's see how feedforward excitatory connections among areas can support intelligent behavior by developing
categorical representations of inputs.
Figure 3.7: Schematic of a hierarchical sequence of categorical representations processing a face input stimulus.
Representations are distributed at each level (multiple neural detectors active). At the lowest level, there are elementary
feature detectors (oriented edges). Next, these are combined into junctions of lines, followed by more complex visual
features. Individual faces are recognized at the next level (even here multiple face units are active in graded proportion to
how similar people look). Finally, at the highest level are important functional "semantic" categories that serve as a good
basis for actions that one might take -- being able to develop such high level categories is critical for intelligent behavior.
As explained in the introduction to this chapter, the process of forming categorical representations of inputs coming into
a network enables the system to behave in a much more powerful and "intelligent" fashion ( Figure 3.7). Philosophically,
it is an interesting question as to where our mental categories come from -- is there something objectively real underlying
our mental categories, or are they merely illusions we impose upon reality? Does the notion of a "chair" really exist in the
real world, or is it just something that our brains construct for us to enable us to get by (and rest our weary legs)? This
issue has been contemplated since the dawn of philosophy, e.g., by Plato with his notion that we live in a cave perceiving
only shadows on the wall of the true reality beyond the cave. It seems plausible that there is something "objective" about
chairs that enables us to categorize them as such (i.e., they are not purely a collective hallucination), but providing a
rigorous, exact definition thereof seems to be a remarkably challenging endeavor (try it! don't forget the cardboard box, or
the lump of snow, or the miniature chair in a dollhouse, or the one in the museum that nobody ever sat on..). It doesn't
seem like most of our concepts are likely to be true "natural kinds" that have a very precise basis in nature. Things like
Newton's laws of physics, which would seem to have a strong objective basis, are probably dwarfed by everyday things
like chairs that are not nearly so well defined (and "naive" understanding of physics is often not actually correct in many
cases either).
The messy ontological status of conceptual categories doesn't bother us very much. As we saw in the previous chapter,
Neurons are very capable detectors that can integrate many thousands of different input signals, and can thereby deal with
complex and amorphous categories. Furthermore, we will see that learning can shape these category representations to
pick up on things that are behaviorally relevant, without requiring any formality or rigor in defining what these things
might be. In short, our mental categories develop because they are useful to us in some way or another, and the outside
world produces enough reliable signals for our detectors to pick up on these things. Importantly, a major driver for learning
these categories is social and linguistic interaction, which enables very complex and obscure things to be learned and
shared -- the strangest things can be learned through social interactions (e.g., you now know that the considerable extra
space in a bag of chips is called the "snackmosphere", courtesy of Rich Hall). Thus, our cultural milieu plays a critical role
in shaping our mental representations, and is clearly a major force in what enables us to be as intelligent as we are (we do
occasionally pick up some useful ideas along with things like "snackmosphere"). If you want to dive deeper into the
philosophical issues of truth and relativism that arise from this lax perspective on mental categories, see Philosophy of
Categories.
One intuitive way of understanding the importance of having the right categories (and choosing them appropriately for the
given situation) comes from insight problems. These problems are often designed so that our normal default way of
categorizing the situation leads us in the wrong direction, and it is necessary to re-represent the problem in a new way
("thinking outside the box"), to solve it. For example, consider this "conundrum" problem: "two men are dead in a cabin in
the woods. what happened?" -- you then proceed to ask a bunch of true/false questions and eventually realize that you need
Distributed Representations
In addition to our mental categories being somewhat amorphous, they are also highly polymorphous: any given input can
be categorized in many different ways at the same time -- there is no such thing as the appropriate level of categorization
for any given thing. A chair can also be furniture, art, trash, firewood, doorstopper, plastic and any number of other such
things. Both the amorphous and polymorphous nature of categories are nicely accommodated by the notion of a
distributed representation. Distributed representations are made up of many individual neurons-as-detectors, each of
which is detecting something different. The aggregate pattern of output activity ("detection alarms") across this population
of detectors can capture the amorphousness of a mental category, because it isn't just one single discrete factor that goes
into it. There are many factors, each of which plays a role. Chairs have seating surfaces, and sometimes have a backrest,
and typically have a chair-like shape, but their shapes can also be highly variable and strange. They are often made of
wood or plastic or metal, but can also be made of cardboard or even glass. All of these different factors can be captured by
the whole population of neurons firing away to encode these and many other features (e.g., including surrounding context,
history of actions and activities involving the object in question).
The same goes for the polymorphous nature of categories. One set of neurons may be detecting chair-like aspects of a
chair, while others are activating based on all the different things that it might represent (material, broader categories,
appearance, style etc). All of these different possible meanings of the chair input can be active simultaneously, which is
well captured by a distributed representation with neurons detecting all these different categories at the same time.
Figure 3.9: Distributed representations of different shapes mapped across regions of inferotemporal (IT) cortex in the
monkey. Each shape activates a large number of different neurons distributed across the IT cortex, and these neurons
overlap partially in some places. Reproduced from Tanaka (2003).
Figure 3.10: Schematic diagram of topographically organized shape representations in monkey IT cortex, from Tanaka
(2003) -- each small area of IT responds optimally to a different stimulus shape, and neighboring areas tend to have similar
but not identical representations.
Another demonstration of distributed representations comes from a landmark study by Haxby and colleagues (2001), using
functional magnetic resonance imaging (fMRI) of the human brain, while viewing different visual stimuli ( Figure 3.11).
They showed that contrary to prior claims that the visual system was organized in a strictly modular fashion, with
completely distinct areas for faces vs. other visual categories, for example, there is in fact a high level of overlap in
activation over a wide region of the visual system for these different visual inputs. They showed that you can distinguish
which object is being viewed by the person in the fMRI machine based on these distributed activity patterns, at a high level
of accuracy. Critically, this accuracy level does not go down appreciably when you exclude the area that exhibits the
maximal response for that object. Prior "modularist" studies had only reported the existence of these maximally responding
areas. But as we know from the monkey data, neurons will respond in a graded way even if the stimulus is not a perfect fit
to their maximally activating input, and Haxby et al. showed that these graded responses convey a lot of information about
the nature of the input stimulus.
Coarse Coding
Figure 3.12: Coarse coding, which is an instance of a distributed representation with neurons that respond in a graded
fashion. This example is based on the coding of color in the eye, which uses only 3 different photoreceptors tuned to
different frequencies of light (red, green blue) to cover the entire visible spectrum. This is a very efficient representation
compared to having many more receptors tuned more narrowly and discretely to different frequencies along the spectrum.
Figure 3.12 illustrates an important specific case of a distributed representation known as coarse coding. This is not
actually different from what we've described above, but the particular example of how the eye uses only 3 photoreceptors
to capture the entire visible spectrum of light is a particularly good example of the power of distributed representations.
Localist Representations
The opposite of a distributed representation is a localist representation, where a single neuron is active to encode a given
category of information. Although we do not think that localist representations are characteristic of the actual brain, they
are nevertheless quite convenient to use for computational models, especially for input and output patterns to present to a
network. It is often quite difficult to construct a suitable distributed pattern of activity to realistically capture the
similarities between different inputs, so we often resort to a localist input pattern with a single input neuron active for each
different type of input, and just let the network develop its own distributed representations from there.
Figure 3.13: The famous case of a Halle Berry neuron recorded from a person with epilepsy who had electrodes implanted
in their brain. The neuron appears sensitive to many different presentations of Halle Berry (including just seeing her name
in text), but not to otherwise potentially similar people. Although this would seem to suggest the presence of localist
"grandmother cells", in fact there are many other distributed neurons activated by any given input such as this within the
same area, and even this neuron does exhibit some level of firing to similar distractor cases. Reproduced from Quian
Quiroga et al. (2005).
Figure 3.13 shows the famous case of a "Halle Berry" neuron, recorded from a person with epilepsy who had electrodes
implanted in their brain. This would appear to be evidence for an extreme form of localist representation, known as a
grandmother cell (a term apparently coined by Jerry Lettvin in 1969), denoting a neuron so specific yet abstract that it
only responds to one's grandmother, based on any kind of input, but not to any other people or things. People had long
scoffed at the notion of such grandmother cells. Even though the evidence for them is fascinating (including also other
neurons for Bill Clinton and Jennifer Aniston), it does little to change our basic understanding of how the vast majority of
neurons in the cortex respond. Clearly, when an image of Halle Berry is viewed, a huge number of neurons at all levels of
the cortex will respond, so the overall representation is still highly distributed. But it does appear that, amongst all the
different ways of categorizing such inputs, there are a few highly selective "grandmother" neurons! One other outstanding
question is the extent to which these neurons actually do show graded responses to other inputs -- there is some indication
of this in the figure, and more data would be required to really test this more extensively.
Figure 3.14: Illustration of attractor dynamics, in terms of a "gravity well". In the familiar gravity wells that suck in coins
at science museums, the attractor state is the bottom hole in the well, where the coin inevitably ends up. This same
dynamic can operate in more abstract cases inside bidirectionally connected networks. For example, the x and y axes in
this diagram could represent the activities of two different neurons, and the attractor state indicates that the network
connectivity prefers to have neuron x highly active, while neuron y is weakly active. The attractor basin indicates that
regardless of what configuration of activations these two neurons start in, they'll end up in this same overall attractor state.
The overall process of converging on a good internal representation given a noisy, weak or otherwise ambiguous input can
be summarized in terms of attractor dynamics ( Figure 3.14). An attractor is a concept from dynamical systems theory,
representing a stable configuration that a dynamical system will tend to gravitate toward. A familiar example of attractor
dynamics is the coin gravity well, often found in science museums. You roll your coin down a slot at the top of the device,
and it rolls out around the rim of an upside-down bell-shaped "gravity well". It keeps orbiting around the central hole of
this well, but every revolution brings it closer to the "attractor" state in the middle. No matter where you start your coin, it
will always get sucked into the same final state. This is the key idea behind an attractor: many different inputs all get
sucked into the same final state. If the attractor dynamic is successful, then this final state should be the correct
categorization of the input pattern.
Similarly, the average activation is just the average of the activation values (y ):
i
1
< y >= ∑ yi
n n
We compute the overall inhibitory conductance applied uniformly to all the units in the layer / group with just a few key
parameters applied to each of these two averages. Because the feedback component tends to drive oscillations (alternately
over and under reacting to the average activation), we apply a simple time integration dynamic on that term. The
feedforward does not require this time integration, but it does require an offset term, which was determined by fitting the
actual inhibition generated by our earlier kWTA equations. Thus, the overall inhibitory conductance is just the sum of the
two terms (ff and fb), with an overall inhibitory gain factor gi:
gi (t) = gi[f f (t) + f b(t)]
This gi factor is typically the only parameter manipulated to determine how active overall a layer is. Typically a value of
1.5 is as low as is used, to give a more widely distributed activation pattern, with values around 2.0 (often 2.1 or 2.2 works
best) being very typical. For very sparse layers (e.g., a single output unit active), values up to around 3.5 or so can be used.
The feedforward (ff) term is:
f f (t) = f f [< η > −f f 0]+
where ff is the overall gain factor for the feedforward component (set to 1.0 by default), and ff0 is an offset (set to 0.1 by
default) that is subtracted from the average netinput value <\eta>.
The feedback (fb) term is:
f b(t) = f b(t − 1) + dt[f b < y > −f b(t − 1)]
where fb is the overall gain factor for the feedback component (0.5 default), dt is the time constant for integrating the
feedback inhibition (0.7 default), and the t-1 indicates the previous value of the feedback inhibition -- this equation
specifies a graded folding-in of the new inhibition factor on top of what was there before, and the relatively fast dt value of
0.7 makes it track the new value fairly quickly -- there is just enough lag to iron out the oscillations.
Overall, it should be clear that this FFFB inhibition is extremely simple to compute (much simpler than the previous
kWTA computation), and it behaves in a much more proportional manner relative to the excitatory drive on the units -- if
there is higher overall excitatory input, then the average activation overall in the layer will be higher, and vice-versa. The
previous kWTA-based computation tended to be more rigid and imposed a stronger set-point like behavior. The FFFB
dynamics, being much more closely tied to the way inhibitory interneurons actually function, should provide a more
biologically accurate simulation.
Explorations
Here are all the explorations covered in the main portion of the Networks chapter:
Face Categorization (face_categ.proj) -- face categorization, including bottom-up and top-down processing (used for
multiple explorations in Networks chapter) (Questions 3.1 - 3.3).
Cats and Dogs (cats_and_dogs.proj) -- Constraint satisfaction in the Cats and Dogs model. (Question 3.4).
Necker Cube (necker_cube.proj) -- Constraint satisfaction and the role of noise and accommodation in the Necker Cube
model. (Question 3.5).
Inhibition (inhib.proj) -- Inhibitory interactions. (Questions 3.6 - 3.8).
4.1: INTRODUCTION
4.2: BIOLOGY OF SYNAPTIC PLASTICITY
4.3: THE EXTENDED CONTRASTIVE ATTRACTOR LEARNING (XCAL) MODEL
4.4: WHEN, EXACTLY, IS THERE AN OUTCOME THAT SHOULD DRIVE LEARNING
4.5: THE LEABRA FRAMEWORK
4.6: SUBTOPICS AND EXPLORATIONS
4.7: REFERENCES
1 1/5/2022
4.1: Introduction
How do we learn to read, do math, and play sports? Learning in a neural network amounts to the modification of synaptic
weights, in response to the local activity patterns of the sending and receiving neurons. As emphasized in previous
chapters, these synaptic weights are what determine what an individual neuron detects, and thus are the critical parameters
for determining neuron and network behavior.
In other words, everything you know is encoded in the patterns of your synaptic weights, and these have been shaped by
every experience you've had (as long as those experiences got your neurons sufficiently active). Many of those experiences
don't leave a very strong mark, and in much of the brain, traces of individual experiences are all blended together, so it is
difficult to remember them distinctly (we'll see in the Memory Chapter that this blending can be quite beneficial for overall
intelligence, actually). But each experience nevertheless drives some level of learning, and our big challenge in this
chapter is to figure out how the mere influences of patterns of activity among individual neurons can add up to enable us to
learn big things.
Biologically, synaptic plasticity (the modification of synaptic weights through learning) has been extensively studied, and
we now know a tremendous amount about the detailed chemical processes that take place as a result of neural activity.
We'll provide multiple levels of detail here (including a discussion of spike timing dependent plasticity (STDP), which
has captured the imaginations of many researchers in this area), but the high-level story is fairly straightforward: the
overall level of neural activity on both ends of the synapse (sending and receiving neural firing) drives the influx of
calcium ions (Ca++) via NMDA channels, and synaptic weight changes are driven by the level of postsynaptic Ca++ in
the dendritic spine associated with a given synapse. Low levels of Ca++ cause synapses to get weaker, and higher levels
cause them to get stronger.
Computationally, many different sets of equations have been developed that can drive synaptic weight changes to
accomplish many different computational goals. Which of these correspond to what the biology is actually doing? That is
the big question. While a definitive answer remains elusive, we nevertheless have a reasonable candidate that aligns well
with the biological data, and also performs computationally very useful forms of learning, which can solve the most
challenging of cognitive tasks (e.g., learning to read or recognize objects).
There are two primary types of learning:
Self-organizing learning, which extracts longer time-scale statistics about the environment, and can thus be useful for
developing an effective internal model of the outside world (i.e., what kinds of things tend to reliably happen in the
world -- we call these statistical regularities).
Error-driven learning, which uses more rapid contrasts between expectations and outcomes to correct these
expectations, and thus form more detailed, specific knowledge about contingencies in the world. For example, young
children seem endlessly fascinated learning about what happens when they push stuff off their high chair trays: will it
still fall to the ground and make a huge mess this time? Once they develop a sufficiently accurate expectation about
exactly what will happen, it starts to get a bit less interesting, and other more unpredictable things start to capture their
interest. As we can see in this example, error-driven learning is likely intimately tied up with curiosity, surprise, and
other such motivational factors. For this reason, we hypothesize that neuromodulators such as dopamine,
norepinephrine and acetylcholine likely play an important role in modulating this form of learning, as they have been
implicated in different versions of surprise, that is, when there is a discrepancy between expectations and outcomes.
Interestingly, the main computational difference between these two forms of learning has to do with the time scale over
which one of the critical variables is updated -- self-organizing learning involves averaging over a long time scale, whereas
error-driven learning is much quicker. This difference is emphasized in the above descriptions as well, and provides an
important source of intuition about the differences between these types of learning. Self-organizing learning is what
happens when you blur your eyes and just take stuff in over a period of time, whereas error-driven learning requires much
more alert and rapid forms of neural activity. In the framework that we will use in the rest of the book, we combine these
types of learning into a single set of learning equations, to explore how we come to perceive, remember, read, and plan.
Figure 4.1: Critical steps in allowing calcium ions (Ca++) to enter postsynaptic cell via NMDA channels, inducing
synaptic plasticity. 1. The postsynaptic membrane potential (Vm) must be elevated (from collective excitatory synaptic
inputs to existing AMPA receptors, and backpropagating action potential that comes back down the dendrite when the
postsynaptic neuron fires). 2. Elevated Vm causes magnesium (Mg+) ions to be expelled from NMDA channel openings,
thus unblocking them. 3. Presynaptic neuron fires an action potential, releasing glutamate. 4. Glutamate binds to NMDA
receptors, causing them to open, allowing Ca++ to enter (only when also unblocked, per step 2). 5. The concentration of
Ca++ in the postsynaptic spine drives second messenger systems (indicated by the X) that result in change in AMPA
receptor efficacy, thereby changing the synaptic weight. Ca++ can also enter from voltage-gated calcium channels
(VGCC's), which depend only on postsynaptic Vm levels, and not sending activity -- these are weaker contributors to
Ca++ levels.
Figure 4.1 shows the five critical steps in the cascade of events that drives change in AMPA receptor efficacy. The
NMDA receptors and the calcium ion (Ca++) play a central role -- NMDA channels allow Ca++ to enter the postsynaptic
spine. Across all cells in the body, Ca++ typically plays an important role in regulating cellular function, and in the neuron,
it is capable of setting off a series of chemical reactions that ends up controlling how many AMPA receptors are functional
in the synapse. For details on these reactions, see Detailed Biology of Learning. Here's what it takes for the Ca++ to get
into the postsynaptic cell:
1. The postsynaptic membrane potential (Vm) must be elevated, as a result of all the excitatory synaptic inputs coming
into the cell. The most important contributor to this Vm level is actually the backpropagating action potential --
when a neuron fires an action potential, it not only goes forward out the axon, but also backward down the dendrites
(via active voltage-sensitive Na+ channels along the dendrites). Thus, the entire neuron gets to know when it fires --
we'll see that this is incredibly useful computationally.
2. The elevated Vm causes magnesium ions (Mg+) to be repelled (positive charges repel each other) out of the openings
of NMDA channels, unblocking them.
Figure 4.2: Direction of synaptic plasticity (LTP = increase, LTD = decrease) as a function of Ca++ concentration in the
postsynaptic spine (accumulated over several 100 milliseconds). Low levels of Ca++ cause LTD, while higher levels drive
LTP. Threshold levels indicated by theta values represent levels where the function changes sign.
Let us assume that the persistence or repetition of a reverberatory activity (or "trace") tends to induce lasting cellular
changes that add to its stability.… When an axon of cell A is near enough to excite a cell B and repeatedly or
persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that
A's efficiency, as one of the cells firing B, is increased.
This can be more concisely summarized as cells that fire together, wire together. The NMDA channel is essential for this
process, because it requires both pre and postsynaptic activity to allow Ca++ to enter and drive learning. It can detect the
coincidence of neural firing. Interestingly, Hebb is reputed to have said something to the effect of "big deal, I knew it had
to be that way already" when someone told him that his learning principle had been discovered in the form of the NMDA
receptor.
Mathematically, we can summarize Hebbian learning as:
where Δw is the change in synaptic weight w, as a function of sending activity x and receiving activity y .
Anytime you see this kind of pre-post product in a learning rule, it tends to be described as a form of Hebbian learning. For
a more detailed treatment of Hebbian learning and various popular variants of it, see Hebbian Learning .
As we'll elaborate below, this most basic form of Hebbian learning is very limited, because weights will only go up (given
that neural activities are rates of spiking and thus only positive quantities), and will do so without bound. Interestingly,
Hebb himself only seemed to have contemplated LTP, not LTD, so perhaps this is fitting. But it won't do anything useful in
a computational model. Before we get to the computational side of things, we cover one more important result in the
biology.
Figure 4.3: Spike timing dependent plasticity demonstrated in terms of temporal offset of firing of pre and postsynaptic
neurons. If pre fires before post, then the weights go up, and otherwise they go down. This fits with a causal flow of
information from pre to post. However, more complex sequences of spikes wash out such precise timing and result in more
generic forms of Hebbian-like learning. Reproduced (for now) from STDP article on scholarpedia, which is based on
original from Bi and Poo (1998).
Figure 4.3 shows the results from an experiment by Bi and Poo in 1998 that captured the imagination of many a scientist,
and has resulted in extensive computational modeling work. This experiment showed that the precise order of firing
between a pre and postsynaptic neuron determined the sign of synaptic plasticity, with LTP resulting when the presynaptic
neuron fired before the postsynaptic one, while LTD resulted otherwise. This spike timing dependent plasticity (STDP)
was so exciting because it fits with the causal role of the presynaptic neuron in driving the postsynaptic one. If a given pre
neuron actually played a role in driving the post neuron to fire, then it will necessarily have to have fired in advance of it,
and according to the STDP results, its weights will increase in strength. Meanwhile, pre neurons that have no causal role in
firing the postsynaptic cell will have their weights decreased. However, as we discuss in more detail in STDP, this STDP
pattern does not generalize well to realistic spike trains, where neurons are constantly firing and interacting with each other
over 100's of milliseconds. Nevertheless, the STDP data does provide a useful stringent test for computational models of
synaptic plasticity. We base our learning equations on a detailed model using more basic, biologically-grounded synaptic
plasticity mechanisms that does capture these STDP findings (Urakubo, Honda, Froemke, & Kuroda, 2008), but which
nevertheless result in quite simple learning equations when considered at the level of firing rate.
where f xcal is the piecewise linear function shown in Figure 4.4. The weight change also depends on an additional
dynamic threshold parameter θ , which determines the point at which it crosses over from negative to positive weight
p
changes -- i.e., the point at which weight changes reverse sign. For completeness, here is the mathematical expression of
this function, but you only need to understand its shape as shown in the figure:
(xy − θp ) if xy > θp θd
fxcal (xy, θp ) = {
−xy (1 − θd ) / θd otherwise
where θ = .1 is a constant that determines the point where the function reverses direction (i.e., back toward zero within
d
the weight decrease regime) -- this reversal point occurs at θ θ , so that it adapts according to the dynamic θ value.
p d p
As noted in the previous section, the dependence of the NMDA channel on activity of both sending and receiving neurons
can be summarized with this simple Hebbian product, and the level of intracellular Ca++ is likely to reflect this value.
Thus, the XCAL dWt function makes very good sense in these terms: it reflects the qualitative nature of weight changes as
a function of Ca++ that has been established from empirical studies and postulated by other theoretical models for a long
time. The Urakubo model simulates detailed effects of pre/postsynaptic spike timing on Ca++ levels and associated
LTP/LTD, but what emerges from these effects at the level of firing rates is this much simpler fundamental function.
As a learning function, this basic XCAL dWt function has some advantages over a plain Hebbian function, while sharing
its basic nature due to the "pre * post" term at its core. For example, because of the shape of the dWt function, weights will
go down as well as up, whereas the Hebbian function only causes weights to increase. But it still has the problem that
weights will increase without bound (as long as activity levels are often greater than the threshold). We'll see in the next
section that some other top-down computationally-motivated modifications can result in a more powerful form of learning
while maintaining this basic form.
where again x = sending activity, y = receiving activity, and θ is a floating threshold reflecting a long time average of
the receiving neuron's activity:
2
θ = ⟨y ⟩
where ⟨⟩ indicates the expected value or average, in this case of the square of the receiving neuron's activation. Figure
4.5 shows what this function looks like overall -- a shape that should be becoming rather familiar. Indeed, the fact that the
BCM learning function anticipated the qualitative nature of synaptic plasticity as a function of Ca++ (Figure 4.2) is an
amazing instance of theoretical prescience. Furthermore, BCM researchers have shown that it does a good job of
accounting for various behavioral learning phenomena, providing a better fit than a comparable Hebbian learning
mechanism ( Figure 4.6, (Cooper, Intrator, Blais, & Shouval, 2004) ).
Figure 4.6: Synaptic plasticity data from dark reared (filled circles) and normally-reared (open circles) rats, showing that
dark reared rats appear to have a lower threshold for LTP, consistent with the BCM floating threshold. Neurons in these
animals are presumably much less active overall, and thus their threshold moves down, making them more likely to exhibit
LTP relative to LTD. Reproduced from Kirkwood, Rioult, & Bear (1996).
BCM has typically been applied in simple feedforward networks in which, given an input pattern, there is only one
activation value for each neuron. But how should weights be updated in a more realistic bidirectionally connected system
with attractor dynamics in which activity states continuously evolve through time? We confront this issue in the XCAL
version of the BCM equations:
Δw = fxcal (xy, ⟨y ⟩l ) = fxcal (xy, yl )
where xy is understood to be the short-term average synaptic activity (on a time scale of a few hundred milliseconds --
the time scale of Ca++ accumulation that drives synaptic plasticity), which could be more formally expressed as: ⟨xy⟩ , s
without the squaring), which plays the role of the θ floating threshold value in the XCAL function.
p
After considerable experimentation, we have found the following way of computing the yl floating threshold to provide
the best ability to control the threshold and achieve the best overall learning dynamics:
1
if y > .2 then yl = yl + (max −yl )
τl
1
else yl = yl + (min −yl )
τ
This produces a well-controlled exponential approach to either the max or min extremes depending on whether the
receiving unit activity exceeds the basic activity threshold of .2. The time constant for integration τ is 10 by default --
l
integrating over around 10 trials. See XCAL_Details sub-topic for more discussion.
Figure 4.7: How the floating threshold as a function of long-term average receiver neural activity ⟨y⟩ drives homeostatic
l
behavior. Neurons that have low average activity are much more likely to increase their weights because the threshold is
low, while those that have high average activity are much more likely to decrease their weights because the threshold is
high.
Figure 4.7 shows the main qualitative behavior of this learning mechanism: when the long term average activity of the
receiver is low, the threshold moves down, and thus it is more likely that the short term synaptic activity value will fall into
the positive weight change territory. This will tend to increase synaptic weights overall, and thus make the neuron more
likely to get active in the future, achieving the homeostatic objective. Conversely, when the long term average activity of
the receiver is high, the threshold is also high, and thus the short term synaptic activity is more likely to drive weight
decreases than increases. This will take these over-active neurons down a notch or two, so they don't end up dominating
the activity of the network.
Thus, a bigger epsilon means larger weight changes, and thus quicker learning, and vice-versa for a smaller value. A
typical starting value for the learning rate is .04, and we often have it decrease over time (which is true of the brain as well
-- younger brains are much more plastic than older ones) -- this typically results in the fastest overall learning and best
final performance.
Many researchers (and drug companies) have the potentially dangerous belief that a faster learning rate is better, and
various drugs have been developed that effectively increase the learning rate, causing rats to learn some kind of standard
task faster than normal, for example. However, we will see in the Learning and Memory Chapter that actually a slow
learning rate has some very important advantages. Specifically, a slower learning rate enables the system to incorporate
more statistics into learning -- the learning rate determines the effective time window over which experiences are averaged
together, and a slower learning rate gives a longer time window, which enables more information to be integrated. Thus,
learning can be much smarter with a slower learning rate. But the tradeoff of course is that the results of this smarter
learning take that much longer to impact actual behavior. Many have argued that humans are distinctive in our extremely
protracted period of developmental learning, so we can learn a lot before we need to start earning a paycheck. This allows
us to have a pretty slow learning rate, without too many negative consequences.
driven learning. This medium time frame reflects the development of a pattern of neural activity that encodes an
expectation about what will happen next. The most recent short term synaptic activity (which drives learning) represents
the actual outcome of what did happen next. Because of the (nearly) linear nature of the dWt function, it effectively
computes the difference between outcome and expectation. Qualitatively, if the outcome produces greater activation of a
population of neurons than did expectation, corresponding weights go up, while neurons that decreased their activity states
as a result of the outcome will have their weights go down. This is illustrated above in the case of low vs. high
expectations.
Figure 4.8 shows how the same floating threshold behavior from the BCM-like self-organizing aspect of XCAL learning
can be adapted to perform error-driven learning, in the form of differences between an outcome vs. an expectation.
Specifically, we speed up the time scale for computing the floating threshold (and also have it reflect synaptic activity, not
just receiver activity):
Θp = ⟨xy ⟩m
where ⟨xy⟩ is this new medium-time scale average synaptic activity, which we think of as reflecting an emerging
m
expectation about the current situation, which develops over approximately 75 msec of neural activity. The most recent,
short-term (last 25 msec) neural activity (⟨xy⟩ ) reflects the actual outcome, and it is the same calcium-based signal that
s
Biologically, the deep neocortical layers (layers 5, 6) and the thalamus have a natural oscillatory rhythm at the alpha
frequency (Buffalo, Fries, Landman, Buschman, & Desimone, 2011; Lorincz, Kekesi, Juhasz, Crunelli, & Hughes,
2009; Franceschetti et al., 1995; Luczak, Bartho, & Harris, 2013). Specific dynamics in these layers organize the
cycle of expectation vs. outcome within the alpha cycle.
Biologically, the superficial neocortical layers (layers 2, 3) have a gamma frequency oscillation (Buffalo, Fries,
Landman, Buschman, & Desimone, 2011), supporting the quarter-level organization.
A Cycle represents 1 msec of processing, where each neuron updates its membrane potential according to the equations
covered in the Neuron chapter.
The XCAL learning mechanism coordinates with this timing by comparing the most recent synaptic activity
(predominantly driven by plus phase / outcome states) to that integrated over the medium-time scale, which effectively
includes both minus and plus phases. Because the XCAL learning function is (mostly) linear, the association of the floating
threshold with this synaptic activity over the medium time frame (including expectation states), to which the short-term
outcome is compared, directly computes their difference:
Δw ≈ xs ys − xm ym
Intuitively, we can understand how this error-driven learning rule works by thinking about different specific cases. The
easiest case is when the expectation is equivalent to the outcome (i.e., a correct expectation) -- the two terms above will be
the same, and thus their subtraction is zero, and the weights remain the same. So once you obtain perfection, you stop
learning. What if your expectation was higher than your outcome? The difference will be a negative number, and the
weights will thus decrease, so that you will lower your expectations next time around. Intuitively, this makes perfect sense
-- if you have an expectation that all movies by M. Night Shyamalan are going to be as cool as The Sixth Sense, you might
end up having to reduce your weights to better align with actual outcomes. Conversely, if the expectation is lower than the
outcome, the weight change will be positive, and thus increase the expectation. You might have thought this class was
going to be deadly boring, but maybe you were amused by the above mention of M. Night Shyamalan, and now you'll have
to increase your weights just a bit. It should hopefully be intuitively clear that this form of learning will work to minimize
the differences between expectations and outcomes over time. Note that while the example given here was cast in terms of
deviations from expectations having value (ie things turned out better or worse than expected, as we cover in more detail in
the Motor control and Reinforcement Learning chapter, the same principle applies when outcomes deviate from other sorts
of expectations.
Because of its explicitly temporal nature, there are a few other interesting ways of thinking about what this learning rule
does, in addition to the explicit timing defined above. To reiterate, the rule says that the outcome comes immediately after
a preceding expectation -- this is a direct consequence of making it learn toward the short-term (most immediate) average
synaptic activity, compared to a slightly longer medium-term average that includes the time just before the immediate
present.
Figure 4.9: Illustration of Contrastive Attractor Learning (CAL) principle, which is core idea behind XCAL error-driven
learning mechanism. The network learns on the contrast between the early phase of settling (the minus phase, or medium
time frame activation average ⟨xy⟩ ) versus the late phase of settling (the plus phase or short time frame activation
m
average ⟨xy⟩ ). The late phase has integrated more of the overall constraints in the network and thus represents a "better"
s
overall interpretation or representation of the current situation than the early phase, so it makes sense for the late phase to
serve as the "training signal" relative to the earlier phase.
We can think of this learning in terms of the attractor dynamics discussed in the Networks Chapter. Specifically, the name
Contrastive Attractor Learning (CAL) reflects the idea that the network is settling into an attractor state, and it is the
contrast between the final attractor state that the network settles into (i.e., the "outcome" in this case), versus the network's
Figure 4.10: Intuition for how bidirectional connections enable the backpropagation of learning signals from other parts of
the network -- when there is a difference between an expectation and an outcome in any part of the network, neurons in
other parts of the network "feel" that difference via the bidirectional connection. All neurons experience an impact on their
own activation of both the expectation and the outcome, and thus when they learn on the difference between these two
points in time (later training earlier), they are learning about their own impact on the outcome - expectation error, and
weight changes based on this difference will end up minimizing the overall error in the network as a whole. Neurons closer
to the source of the error learn the most, with error decreasing with distance from this source.
Biologically, the bidirectional connectivity in our models enables these error signals to propagate in this manner ( Figure
4.10). Thus, changes in any given location in the network radiate backward (and every which way the connections go) to
affect activation states in all other layers, via bidirectional connectivity, and this then influences the learning in these other
layers. In other words, XCAL uses bidirectional activation dynamics to communicate error signals throughout the network,
whereas backpropagation uses a biologically implausible procedure that propagates error signals backward across synaptic
connections, in the opposite direction of the way that activation typically flows. Furthermore, the XCAL network
experiences a sequence of activation states, going from an expectation to experiencing a subsequent outcome, and learns
on the difference between these two states. In contrast, backpropagation computes a single error delta value that is
effectively the difference between the outcome and the expectation, and then sends this single value backwards across the
connections. See the Backpropagation subsection for how these two different things can be mathematically equivalent.
Also, it is a good idea to look at the discussion of the credit assignment process in this subsection, to obtain a fuller
understanding of how error-driven learning works.
However, computationally, it is clearer and simpler to just combine separate XCAL functions, each with their own
weighting function -- due to the linearity of the function, this is mathematically equivalent:
Δw = λf fxcal (xs ys , yl ) + λm fxcal (xs ys , xm ym )
It is reasonable that these lambda parameters may differ according to brain area (i.e., some brain systems learn more about
statistical regularities, whereas others are more focused on minimizing error), and even that it may be dynamically
regulated (i.e. transient changes in neuromodulators like dopamine and acetylcholine can influence the degree to which
error signals are emphasized).
There are small but reliable computational advantages to automating this balancing of self-organizing vs. error-driven
learning (i.e., a dynamically-computed λ value, while keeping λ = 1 ), based on two factors: the magnitude of the y_l
l m
receiving-unit running average activation, and the average magnitude of the error signals present in a layer (see Leabra
Details).
As you can see in Figure 4.11, this function creates greater contrast for weight values around this .5 central value -- they
get pushed up or down to the extremes. This contrast-enhanced weight value is then used for communication among the
neurons, and is what shows up as the wt value in the simulator.
Figure 4.11: Weight contrast enhancement function, gain (gamma) = 6, offset (theta) = 1.25.
Biologically, we think of the plain weight value w, which is involved in the learning functions, as an internal variable that
accurately tracks the statistics of the learning functions, while the contrast-enhanced weight value is the actual synaptic
efficacy value that you measure and observe as the strength of interaction among neurons. Thus, the plain w value may
correspond to the phosphorylation state of CAMKII or some other appropriate internal value that mediates synaptic
plasticity.
Finally, see Implementational Details for a few implementational details about the way that the time averages are
computed, which don't affect anything conceptually, but if you really want to know exactly what is going on..
Figure 4.12: Different situations that give rise to a contrast between expectations and outcomes. a) The simplest case of
explicit teacher / parent input -- a visual input (e.g., an object) at time t drives a verbal output (e.g., the name of the object),
and the teacher then corrects (or confirms) the output. b) The same scenario can go through without actually producing a
verbal output -- instead just an expectation of what someone else might say, and this can be compared with what is actually
said to derive useful error signals. c) Is a specific instance of when many expectations are likely to be generated, when a
motor action (e.g., pushing food off of a high chair) drives an expectation about the visual outcomes associated with the
action, which then occur (to the seemingly endless delight of the mischievous infant). d) Involves making an "expectation"
about what you actually just saw -- reconstructing or generating the input (otherwise known as generative model or an
auto-encoder) -- the input itself serves as its own training signal in this case.
This is the biggest remaining question for error-driven learning. You may not have even noticed this issue, but once you
start to think about implementing the XCAL equations on a computer, it quickly becomes a major problem. We have talked
about how the error-driven learning reflects the difference between an outcome and an expectation, but it really matters
that the short-term average activation representing the outcome state reflects some kind of actual outcome that is worth
learning about. Figure 4.12 illustrates four primary categories of situations in which an outcome state can arise, which
can play out in myriad ways in different real-world situations.
In our most recent framework described briefly above, the expectation-outcome timing is specified in terms of the 100
msec alpha trial. And within this trial, the combined circuitry between the deep neocortical layers and the thalamus end up
producing an outcome state that drives predictive auto-encoder learning, which is basically the last case (d) in Figure
4.12, with an extra twist that during every 100 msec alpha trial, the network attempts to predict what will happen in the
next 100 msec -- the predictive aspect of the auto-encoder idea. Specifically, the deep layers attempt to predict what the
bottom-up driven activity pattern over the thalamus will look like in the final plus-phase quarter of the alpha trial, based on
activations present during the prior alpha trial. Because of the extensive bidirectional connectivity between brain areas, the
cross-modal expectation / output sequence shown in panel (b) of Figure 4.12 is also supported by this mechanism. A later
revision of this text will cover these ideas in more detail. Preliminary versions are available: (O'Reilly, Wyatte, & Rohrlich,
2014; Kachergis, Wyatte, O'Reilly, Kleijn, & Hommel, 2014).
Another hypothesis for something that "marks" the presence of an important outcome is a phasic burst of a neuromodulator
like dopamine. It is well established that dopamine bursts occur when an unexpected outcome arises, at least in the context
of expectations of reward or punishment (we'll discuss this in detail in the Motor Control and Reinforcement Learning
Chapter). Furthermore, we know from a number of studies that dopamine plays a strong role in modulating synaptic
plasticity. Under this hypothesis, the cortical network is always humming along doing standard BCM-like self-organizing
learning at a relatively low learning rate (due to a small lambda parameter in the combined XCAL equation, which
presumably corresponds to the rate of synaptic plasticity associated with the baseline tonic levels of dopamine), and then,
when something unexpected occurs, a dopamine burst drives stronger error-driven learning, with the immediate short-term
average "marked" by the dopamine burst as being associated with this important (salient) outcome. The XCAL learning
will automatically contrast this immediate short-term average with the immediately available medium-term average, which
presumably reflects an important contribution from the prior expectation state that was just violated by the outcome.
Figure 4.13: Summary of all the mechanisms in the Leabra framework used in this text, providing a summary of the last
three chapters.
Figure 4.13 provides a summary of the Leabra framework, which is the name given to the combination of all the neural
mechanisms that have been developed to this point in the text. Leabra stands for Learning in an Error-driven and
Associative, Biologically Realistic Algorithm -- the name is intended to evoke the "Libra" balance scale, where in this case
the balance is reflected in the combination of error-driven and self-organizing learning ("associative" is another name for
Hebbian learning). It also represents a balance between low-level, biologically-detailed models, and more abstract
computationally-motivated models. The biologically-based way of doing error-driven learning requires bidirectional
connectivity, and the Leabra framework is relatively unique in its ability to learn complex computational tasks in the
context of this pervasive bidirectional connectivity. Also, the FFFB inhibitory function producing k-Winners-Take-All
dynamics is unique to the Leabra framework, and is also very important for its overall behavior, especially in managing the
dynamics that arise with the bidirectional connectivity.
The different elements of the Leabra framework are therefore synergistic with each other, and as we have discussed, highly
compatible with the known biological features of the neocortex. Thus, the Leabra framework provides a solid foundation
for the cognitive neuroscience models that we explore next in the second part of the text.
For those already familiar with the Leabra framework, see Leabra Details for a brief review of how the XCAL version of
the algorithm described here differs from the earlier form of Leabra described in the original CECN textbook (O'Reilly &
Munakata, 2000).
Exploration of Leabra
Open the Family Trees simulation to explore Leabra learning in a deep multi-layered network running a more complex
task with some real-world relevance. This simulation is very interesting for showing how networks can create their own
similarity structure based on functional relationships, refuting the common misconception that networks are driven purely
by input similarity structure.
Explorations
Here are all the explorations covered in the main portion of the Learning chapter:
Self Organizing (self_org.proj) -- Self organizing learning using BCM-like dynamic of XCAL (Questions 4.1-4.2).
Pattern Associator (pat_assoc.proj) -- Basic two-layer network learning simple input/output mapping tasks with
Hebbian and Error-driven mechanisms (Questions 4.3-4.6).
Error Driven Hidden (err_driven_hidden.proj) -- Full error-driven learning with a hidden layer, can solve any input
output mapping (Question 4.7).
Family Trees (family_trees.proj) -- Learning in a deep (multi-hidden-layer) network, reshaping internal representations
to encode relational similarity (Questions 4.8-4.9).
5.1: INTRODUCTION
5.2: NAVIGATING THE FUNCTIONAL ANATOMY OF THE BRAIN
5.3: PRECEPTION AND ATTENTION- WHAT VS. WHERE
5.4: MOTOR CONTROL- PARIETAL AND MOTOR CORTEX INTERACTING WITH BASAL GANGLIA AND
CEREBELLUM
5.5: LEARNING AND MEMORY- TEMPORAL CORTEX AND THE HIPPOCAMPUS
5.6: LANGUAGE- ALL TOGETHER NOW
5.7: EXECUTIVE FUNCTION- PREFRONTAL CORTEX AND BASAL GANGLIA
1 1/5/2022
5.1: Introduction
In Part I of this book, we have developed a toolkit of basic neural mechanisms, going from the activation dynamics of
individual neurons, to networks of neurons, and the learning mechanisms that configure them in both self-organizing and
error-driven ways. At this start of Part II, we begin the transition to exploring a wide range of cognitive phenomena. As an
important foundational step along this path, this chapter attempts to provide a big picture view of the overall functional
organization of the brain, in a relatively non-controversial way that is roughly meant to correspond to what is generally
agreed upon in the literature. This should help you understand at a broad level how different brain areas work together to
perform different cognitive functions, and situate the more specific models in the subsequent chapters into a larger overall
framework.
We proceed in the same sequence as the subsequent chapters, which roughly follows the evolutionary trajectory of the
brain itself, starting with basic perceptual and motor systems, and then proceeding to explore different forms of learning
and memory (including the role of the hippocampus in episodic memory). Building upon these core capacities, we then
examine language and executive function, which build upon and extend the functionality of these basic cognitive systems.
As usual, we begin with a basic foundation in biology: the gross anatomy of the brain.
Figure 5.1: Gross anatomy of the brain. Left panel shows the major lobes of the outer neocortex layer of the brain, and
right panel shows some of the major brain areas internal to the neocortex.
Figure 5.2: Brodmann's numbering system for the different areas of the neocortex, based on anatomical distinctions such
as the thickness of different cortical layers, as we discussed in the Networks Chapter. These anatomical distinctions are
remarkably well correlated with the functional differences in what different brain areas do.
Figure 5.1 shows the "gross" (actually quite beautiful and amazing!) anatomy of the brain. The outer portion is the
"wrinkled sheet" (upon which our thoughts rest) of the neocortex, showing all of the major lobes. This is where most of
our complex cognitive function occurs, and what we have been focusing on to this point in the text. The rest of the brain
lives inside the neocortex, with some important areas shown in the figure. These are generally referred to as subcortical
brain areas, and we include some of them in our computational models, including:
Hippocampus -- this brain area is actually an "ancient" form of cortex called "archicortex", and we'll see in Learning
and Memory how it plays a critical role in learning new "everyday" memories about events and facts (called episodic
memories).
Amygdala -- this brain area is important for recognizing emotionally salient stimuli, and alerting the rest of the brain
about them. We'll explore it in Motor Control and Reinforcement Learning, where it plays an important role in
reinforcing motor (and cognitive) actions based on reward (and punishment).
Cerebellum -- this massive brain structure contains 1/2 of the neurons in the brain, and plays an important role in
motor coordination. It is also active in most cognitive tasks, but understanding exactly what its functional role is in
cognition remains somewhat elusive. We'll explore it in Motor Control and Reinforcement Learning.
Thalamus -- provides the primary pathway for sensory information on its way to the neocortex, and is also likely
important for attention, arousal, and other modulatory functions. We'll explore the role of visual thalamus in Perception
and Attention and of motor thalamus in Motor Control and Reinforcement Learning.
Basal Ganglia -- this is a collection of subcortical areas that plays a critical role in Motor Control and Reinforcement
Learning, and also in Executive Function. It helps to make the final "Go" call on whether (or not) to execute particular
actions that the cortex 'proposes', and whether or not to update cognitive plans in the prefrontal cortex. Its policy for
making these choices is learned based on their prior history of reinforcement/punishment.
Figure 5.3: Color delineated map of Brodmann areas on the external cortical surface. Top: anterior view. Bottom: posterior
view.
Figure 5.4: Terminology for referring to different parts of the brain -- for everything except lateral and medial, three
different terms for the same thing are given.
Figure 5.2 and Figure 5.3 shows more detail on the structure of the neocortex, in terms of Brodmann areas -- these
areas were identified by Korbinian Brodmann on the basis of anatomical differences (principally the differences in
thickness of different cortical layers, which we covered in the Networks Chapter). We won't refer too much to things at this
level of detail, but learning some of these numbers is a good idea for being able to read the primary literature in cognitive
neuroscience. Here is a quick overview of the functions of the cortical lobes ( Figure 5.5):
Figure 5.5: Summary of functions of cortical lobes -- see text for details.
Occipital lobe -- this contains primary visual cortex (V1) (Brodmann's area 17 or BA17), located at the very back tip
of the neocortex, and higher-level visual areas that radiate out (forward) from it. Clearly, its main function is in visual
processing.
Table 5.1: Comparison of learning mechanisms and activity/representational dynamics across four primary areas of the
brain. +++ means that the area definitely has given property, with fewer +'s indicating less confidence in and/or importance
of this feature. --- means that the area definitely does not have the given property, again with fewer -'s indicating lower
confidence or importance.
Table 5.1 shows a comparison of four major brain areas according to the learning rules and activation dynamics that they
employ. The evolutionarily older areas of the basal ganglia, cerebellum, and hippocampus employ a separating form of
activation dynamics, meaning that they tend to make even somewhat similar inputs map onto more separated patterns of
neural activity within the structure. This is a very conservative, robust strategy akin to "memorizing" specific answers to
specific inputs -- it is likely to work OK, even though it is not very efficient, and does not generalize to new situations very
well. Each of these structures can be seen as optimizing a different form of learning within this overall separating dynamic.
The basal ganglia are specialized for learning on the basis of reward expectations and outcomes. The cerebellum uses a
simple yet effective form of error-driven learning (basically the delta rule as discussed in the Learning Chapter). And the
hippocampus relies more on hebbian-style self-organizing learning. Thus, the hippocampus is constantly encoding new
episodic memories regardless of error or reward (though these can certainly modulate the rate of learning, as indicated by
the weaker + signs in the table), while the basal ganglia is learning to select motor actions on the basis of potential reward
or lack thereof (and is also a control system for regulating the timing of action selection), while the cerebellum is learning
to swiftly perform those motor actions by using error signals generated from differences in the sensory feedback relative to
the motor plan. Taken together, these three systems are sufficient to cover the basic needs of an organism to survive and
adapt to the environment, at least to some degree.
The hippocampus does introduce one critical innovation beyond what is present in the basal ganglia and cerebellum: it has
attractor dynamics. Specifically the recurrent connections between CA3 neurons are important for retrieving previously-
encoded memories, via pattern completion as we explored in the Networks Chapter. The price for this innovation is that
the balance between excitation and inhibition must be precisely maintained, to prevent epileptic activity dynamics. Indeed,
the hippocampus is the single most prevalent source of epileptic activity, in people at least.
Against this backdrop of evolutionarily older systems, the neocortex represents a few important innovations. In terms of
activation dynamics, it builds upon the attractor dynamic innovation from the hippocampus (appropriately so, given that
hippocampus represents an ancient "proto" cortex), and adds to this a strong ability to develop representations that
integrate across experiences to extract generalities, instead of always keeping everything separate all the time. The cost for
this integration ability is that the system can now form the wrong kinds of generalizations, which might lead to bad overall
behavior. But the advantages apparently outweigh the risks, by giving the system a strong ability to apply previous
learning to novel situations. In terms of learning mechanisms, the neocortex employs a solid blend of all three major forms
of learning, integrating the best of all the available learning signals into one system.
Figure 5.6: Hierarchy of visual detectors of increasing complexity achieves sophisticated perceptual categorization, with
the higher levels being able to recognize 1000's of different objects, people, etc.
Figure 5.7: Felleman & Van Essen's (1991) diagram of the anatomical connectivity of visual processing pathways, starting
with retinal ganglion cells (RGC) to the LGN of the thalamus, then primary visual cortex (V1) and on up.
The perceptual system provides an excellent example of the power of hierarchically organized layers of neural detectors, as
we discussed in the Networks Chapter. Figure 5.6 summarizes this process, with associated cortical areas noted below
each stage of processing. Figure 5.7 shows the actual anatomical connectivity patterns of all of the major visual areas,
showing that information really is processed in a hierarchical fashion in the brain (although there are many
interconnections outside of a strict hierarchy as well). Figure 5.8 puts these areas into their anatomical locations, showing
more clearly the what vs where (ventral vs dorsal) split in visual processing. Here is a quick summary of the flow of
information up the what side of the visual pathway (pictured on the right side of Figure 5.7):
V1 -- primary visual cortex, which encodes the image in terms of oriented edge detectors that respond to edges
(transitions in illumination) along different angles of orientation. We will see in Perception and Attention how these
edge detectors develop through self-organizing learning, driven by the reliable statistics of natural images.
V2 -- secondary visual cortex, which encodes combinations of edge detectors to develop a vocabulary of intersections
and junctions, along with many other basic visual features (e.g., 3D depth selectivity, basic textures, etc), that provide
the foundation for detecting more complex shapes. These V2 neurons also encode these features in a broader range of
Figure 5.8: Division of What vs Where (ventral vs. dorsal) pathways in visual processing.
We'll explore a model of invariant object recognition in Perception and Attention that shows how this deep hierarchy of
detectors can develop through learning. The Language Chapter builds upon this object recognition process to understand
how words are recognized and translated into associated verbal motor outputs during reading, and associated with semantic
knowledge as well.
The where aspect of visual processing going up in a dorsal directly through the parietal cortex (areas MT, VIP, LIP, MST)
contains areas that are important for processing motion, depth, and other spatial features. As noted above, these areas are
also critical for translating visual input into appropriate motor output, leading Goodale and Milner to characterize this as
the howpathway. In Perception and Attention we'll see how this dorsal pathway can interact with the ventral what pathway
in the context of visual attention, producing the characteristic effects of parietal damage in hemispatial neglect, for
example.
Figure 5.9: The hippocampus sits on "top" of the cortical hierarchy and can encode information from all over the brain,
binding it together into an episodic memory.
There is one brain area, however, that looms so large in the domain of memory, that we'll spend a while focusing on it.
This is the hippocampus, which seems to be particularly good at rapidly learning new information, in a way that doesn't
interfere too much with previously learned information ( Figure 5.9). When you need to remember the name associated
with a person you recently met, you're relying on this rapid learning ability of the hippocampus. We'll see that the neural
properties of the hippocampal system are ideally suited to producing this rapid learning ability. One key neural property is
the use of extremely sparse representations, which produce a phenomenon called pattern separation, where the neural
activity pattern associated with one memory is highly distinct from that associated with other similar memories. This is
what minimizes interference with prior learning -- interference arises as a function of overlap. We'll see how this pattern
separation process is complemented by a pattern completion process for recovering memories during retrieval from partial
information.
We'll also see how the learning rate plays a crucial role in learning. Obviously, to learn rapidly, you need a fast learning
rate. But what happens with a slow learning rate? Turns out this enables you to integrate across many different
experiences, to produce wisdom and semantic knowledge. This slower learning rate is characteristic of most of the
neocortex (it also enables the basal ganglia to learn probabilities of positive and negative outcomes for each action across a
range of experience, rather than just their most recent outcomes). Interestingly, even with a slow learning rate, neocortex
can exhibit measurable effects of a single trial of learning, in the form of priming effects and familiarity signals that can
drive recognition memory (i.e., your ability to recognize something as familiar, without any more explicit episodic
memory). This form of recognition memory seems to depend on medial temporal lobe (MTL) areas including perirhinal
cortex. Another form of one trial behavioral learning involves mechanisms that support active maintenance of memories in
an attractor state (working memory in the prefrontal cortex). This form of memory does not require a weight change at all,
but can nevertheless rapidly influence behavioral performance from one instance to the next.
Figure 5.10: The What vs. How distinction for posterior cortex can be carried forward into prefrontal cortex, to understand
the distinctive roles of the ventral and dorsal areas of PFC.
We also build upon the functional divisions of the posterior cortex to understand how the ventral vs. dorsal areas of
prefrontal cortex are functionally organized. Figure 5.10 shows an overall schematic for how this occurs. It also
illustrates how the lateral surface is more associated with "cold" cognitive function, while the medial surface is more
involved in "hot" emotional and motivational processing.
We'll see how the PFC can provide top-down cognitive control over processing in the posterior cortex, with the classic
example being the Stroop task.
Then we'll explore how PFC and BG can interact to produce a dynamically gated working memory system that allows the
system to hold multiple pieces of information 'in mind', and to separately update some pieces of information while
continuing to maintain some existing information. The role of the BG in this system builds on the more established role of
the BG in motor control, by interacting in very similar circuits with PFC instead of motor cortex. In both cases, the BG
provide a gating signal for determining whether or not a given frontal cortical 'action' should be executed or not. It's just
that PFC actions are more cognitive than motor cortex, and include things like updating of working memory states, or of
goals, plans, etc. Once updated, these PFC representations can then provide that top-down cognitive control mentioned
above, and hence can shape action selection in BG-motor circuits, but also influence attention to task-relevant features in
sensory cortex. Interestingly, the mechanisms for reinforcing which cognitive actions to execute (including whether or not
to update working memory, or to attend to particular features, or to initiate a high level plan) seem to depend on very
similar dopaminergic reinforcement learning mechanisms that are so central to motor control. This framework also
provides a link between motivation and cognition which is very similar to the well established link between motivation and
action.
6.1: INTRODUCTION
6.2: BIOLOGY OF PRECEPTION
6.3: ORIENTED EDGE DETECTORS IN PRIMARY VISUAL CORTEX
6.4: INVARIANT OBJECT RECOGNITION IN THE "WHAT" PATHWAY
6.5: SPATIAL ATTENTION AND NEGLECT IN THE "WHERE/HOW" PATHWAY
6.6: EXPLORATIONS
1 1/5/2022
6.1: Introduction
Perception is at once obvious and mysterious. It is so effortless to us that we have little appreciation for all the amazing
computation that goes on under the hood. And yet we often use terms like "vision" as a metaphor for higher-level concepts
(does the President have a vision or not?) -- perhaps this actually reflects a deep truth: that much of our higher-level
cognitive abilities depend upon our perceptual processing systems for doing a lot of the hard work. Perception is not the
mere act of seeing, but is leveraged whenever we imagine new ideas, solutions to hard problems, etc. Many of our most
innovative scientists (e.g., Einstein, Richard Feynman) used visual reasoning processes to come up with their greatest
insights. Einstein tried to visualize catching up to a speeding ray of light (in addition to trains stretching and contracting in
interesting ways), and one of Feynman's major contributions was a means of visually diagramming complex mathematical
operations in quantum physics.
Pedagogically, perception serves as the foundation for our entry into cognitive phenomena. It is the most well-studied and
biologically grounded of the cognitive domains. As a result, we will cover only a small fraction of the many fascinating
phenomena of perception, focusing mostly on vision. But we do focus on a core set of issues that capture many of the
general principles behind other perceptual phenomena.
We begin with a computational model of primary visual cortex (V1), which shows how self-organizing learning
principles can explain the origin of oriented edge detectors, which capture the dominant statistical regularities present in
natural images. This model also shows how excitatory lateral connections can result in the development of topography in
V1 -- neighboring neurons tend to encode similar features, because they have a tendency to activate each other, and
learning is determined by activity.
Building on the features learned in V1, we explore how higher levels of the ventral what pathway can learn to recognize
objects regardless of considerable variability in the superficial appearance of these objects as they project onto the retina.
Object recognition is the paradigmatic example of how a hierarchically-organized sequence of feature category detectors
can incrementally solve a very difficult overall problem. Computational models based on this principle can exhibit high
levels of object recognition performance on realistic visual images, and thus provide a compelling suggestion that this is
likely how the brain solves this problem as well.
Next, we consider the role of the dorsal where (or how) pathway in spatial attention. Spatial attention is important for
many things, including object recognition when there are multiple objects in view -- it helps focus processing on one of the
objects, while degrading the activity of features associated with the other objects, reducing potential confusion. Our
computational model of this interaction between what and where processing streams can account for the effects of brain
damage to the wherepathway, giving rise to hemispatial neglect for damage to only one side of the brain, and a
phenomenon called Balint's syndrome with bilateral damage. This ability to account for both neurologically intact and
brain damaged behavior is a powerful advantage of using neurally-based models.
As usual, we begin with a review of the biological systems involved in perception.
Figure 6.1: The pathway of early visual processing from the retina through lateral geniculate nucleus of the thalamus
(LGN) to primary visual cortex (V1), showing how information from the different visual fields (left vs. right) are routed to
the opposite hemisphere.
Figure 6.2: How the retina compresses information by only responding to areas of contrasting illumination, not solid
uniform illumination. The response properties of retinal cells can be summarized by these Difference-of-Gaussian (DoG)
filters, with a narrow central region and a wider surround (also called center-surround receptive fields). The excitatory and
inhibitory components exactly cancel when both are uniformly illuminated, but when light falls more on the center vs. the
surround (or vice-versa), they respond, as illustrated with an edge where illumination transitions between darker and
lighter.
Figure 6.1 shows the basic optics and transmission pathways of visual signals, which come in through the retina, and
progress to the lateral geniculate nucleus of the thalamus (LGN), and then to primary visual cortex (V1). The primary
organizing principles at work here, and in other perceptual modalities and perceptual areas more generally, are:
Transduction of different information -- in the retina, photoreceptors are sensitive to different wavelengths of light
(red = long wavelengths, green = medium wavelengths, and blue = short wavelengths), giving us color vision, but the
retinal signals also differ in their spatial frequency (how coarse or fine of a feature they detect -- photoreceptors in the
central fovea region can have high spatial frequency = fine resolution, while those in the periphery are lower
resolution), and in their temporal response (fast vs. slow responding, including differential sensitivity to motion).
Organization of information in a topographic fashion -- for example, the left vs. right visual fields are organized
into the contralateral hemispheres of cortex -- as the figure shows, signals from the left part of visual space are routed
to the right hemisphere, and vice-versa. Information within LGN and V1 is also organized topographically in various
ways. This organization generally allows similar information to be contrasted, producing an enhanced signal, and also
grouped together to simplify processing at higher levels.
Extracting relevant signals, while filtering irrelevant ones -- Figure 6.2 shows how retinal cells respond only to
contrast, not uniform illumination, by using center-surround receptive fields (e.g., on-center, off-surround, or vice-
versa). Only when one part of this receptive field gets different amounts of light compared to the others do these
Figure 6.3: A V1 simple cell that detects an oriented edge of contrast in the image, by receiving from a line of LGN on-
center cells aligned along the edge. The LGN cells will fire due to the differential excitation vs. inhibition they receive (see
previous figure), and then will activate the V1 neuron that receives from them.
Figure 6.4: Simple and complex cell types within V1 -- the complex cells integrate over the simple cell properties,
including abstracting across the polarity (positions of the on vs. off coding regions), and creating larger receptive fields by
integrating over multiple locations as well (the V1-Simple-Max cells are only doing this spatial integration). The end stop
cells are the most complex, detecting any form of contrasting orientation adjacent to a given simple cell. In the simulator,
the V1 simple cells are encoded more directly using gabor filters, which mathematically describe their oriented edge
sensitivity.
Figure 6.5: Felleman & Van Essen's (1991) diagram of the anatomical connectivity of visual processing pathways, starting
with retinal ganglion cells (RGC) to the LGN of the thalamus, then primary visual cortex (V1) and on up.
Moving up beyond the primary visual cortex, the perceptual system provides an excellent example of the power of
hierarchically organized layers of neural detectors, as we discussed in the Networks Chapter. Figure 6.5 shows the
anatomical connectivity patterns of all of the major visual areas, starting from the retinal ganglion cells (RGC) to LGN to
V1 and on up. The specific patterns of connectivity allow a hierarchical structure to be extracted, as shown, even though
there are many interconnections outside of a strict hierarchy as well.
Figure 6.6: Division of What vs Where (ventral vs. dorsal) pathways in visual processing.
Figure 6.6 puts these areas into their anatomical locations, showing more clearly a what vs where (ventral vs dorsal)
split in visual processing. The projections going in a ventral direction from V1 to V4 to areas of inferotemporal cortex
Figure 6.7: Orientation tuning of an individual V1 neuron in response to bar stimuli at different orientations -- this neuron
shows a preference for vertically oriented stimuli.
Neurons in primary visual cortex (V1) detect the orientation of edges or bars of light within their receptive field (RF -- the
region of the visual field that a given neuron receives input from). Figure 6.7 shows characteristic data from
electrophysiological recordings of an individual V1 neuron in response to oriented bars. This neuron responds maximally
to the vertical orientation, with a graded fall off on either side of that. This is a very typical form of tuning curve. Figure
6.8 shows that these orientation tuned neurons are organized topographically, such that neighbors tend to encode similar
orientations, and the orientation tuning varies fairly continuously over the surface of the cortex.
Figure 6.8: Topographic organization of oriented edge detectors in V1 -- neighboring regions of neurons have similar
orientation tuning, as shown in this colorized map where different colors indicate orientation preference as shown in panel
C. Panel B shows how a full 360 degree loop of orientations nucleate around a central point -- these are known as pinwheel
structures.
The question we attempt to address in this section is why such a topographical organization of oriented edge detectors
would exist in primary visual cortex? There are multiple levels of answer to this question. At the most abstract level, these
Simulation Exploration
Open V1Rf to explore the development of oriented edge detectors in V1. This model gets exposed to a set of natural
images, and learns to encode oriented edges, because they are the statistical regularity present in these images. Figure 6.9
shows the resulting map of orientations that develops.
Figure 6.9: Topographic organization of oriented edge detectors in simulation of V1 neurons exposed to small windows of
natural images (mountains, trees, etc). The neighborhood connectivity of neurons causes a topographic organization to
develop.
Figure 6.10: Why object recognition is hard: things that should be categorized as the same (i.e., have the same output
label) often have no overlap in their retinal input features when they show up in different locations, sizes, etc, but things
that should be categorized as different often have high levels of overlap when they show up in the same location. Thus, the
bottom-up similarity structure is directly opposed to the desired output similarity structure, making the problem very
difficult.
The reason object recognition is so hard is that there can often be no overlap at all among visual inputs of the same object
in different locations (sizes, rotations, colors, etc), while there can be high levels of overlap among different objects in the
same location ( Figure 6.10). Therefore, you cannot rely on the bottom-up visual similarity structure -- instead it often
works directly against the desired output categorization of these stimuli. As we saw in the Learning Chapter, successful
learning in this situation requires error-driven learning, because self-organizing learning tends to be strongly driven by the
input similarity structure.
Figure 6.12: Another way of representing the hierarchy of increasing featural complexity that arises over the areas of the
ventral visual pathways. V1 has elementary feature detectors (oriented edges). Next, these are combined into junctions of
lines in V2, followed by more complex visual features in V4. Individual faces are recognized at the next level in IT (even
here multiple face units are active in graded proportion to how similar people look). Finally, at the highest level are
important functional "semantic" categories that serve as a good basis for actions that one might take -- being able to
develop such high level categories is critical for intelligent behavior -- this level corresponds to more anterior areas of IT.
The most successful approach to the object recognition problem, which was advocated initially in a model by Fukushima
(1980), is to incrementally solve two problems over a hierarchically organized sequence of layers ( Figure 6.11, Figure
6.12):
The invariance problem, by having each layer integrate over a range of locations (and sizes, rotations, etc) for the
features in the previous layer, such that neurons become increasingly invariant as one moves up the hierarchy.
The pattern discrimination problem (distinguishing an A from an F, for example), by having each layer build up more
complex combinations of feature detectors, as a result of detecting combinations of the features present in the previous
layer, such that neurons are better able to discriminate even similar input patterns as one moves up the hierarchy.
The critical insight from these models is that breaking these two problems down into incremental, hierarchical steps
enables the system to solve both problems without one causing trouble for the other. For example, if you had a simple fully
invariant vertical line detector that responded to a vertical line in any location, it would be impossible to know what spatial
Figure 6.13: Summary of neural response properties in V2, V4, and IT for the macaque monkey, according to both the
extent to which the areas respond to complex vs. simple visual features (Smax / MAX column, showing how the response
to simple visual inputs (Smax) compares to the maximum response to any visual input image tested (MAX), and the
overall size of the visual receptive field, over which the neurons exhibit relatively invariant responding to visual features.
For V2, nearly all neurons responded maximally to simple stimuli, and the receptive field sizes were the smallest. For V4,
only 50% of neurons had simple responses as their maximal response, and the receptive field sizes increase over V2.
Posterior IT increases (slightly) on both dimensions, while anterior IT exhibits almost entirely complex featural responding
and significantly larger receptive fields. These incremental increases in complexity and invariance (receptive field size) are
exactly as predicted by the incremental computational solution to invariant object recognition as shown in the previous
figure. Reproduced from Kobatake & Tanaka (1994).
In a satisfying convergence of top-down computational motivation and bottom-up neuroscience data, this incremental,
hierarchical solution provides a nice fit to the known properties of the visual areas along the ventral what pathway (V1,
V2, V4, IT). Figure 6.13 summarizes neural recordings from these areas in the macaque monkey, and shows that neurons
increase in the complexity of the stimuli that drive their responding, and the size of the receptive field over which they
exhibit an invariant response to these stimuli, as one proceeds up the hierarchy of areas. Figure 6.14 shows example
complex stimuli that evoked maximal responding in each of these areas, to give a sense of what kind of complex feature
conjunctions these neurons can detect.
Figure 6.14: Complex stimuli that evoked a maximal response from neurons in V2, V4, and IT, providing some suggestion
for what kinds of complex features these neurons can detect. Most V2 neurons responded maximally to simple stimuli
(oriented edges, not shown). Reproduced from Kobatake & Tanaka (1994).
See Ventral Path Data for a more detailed discussion of the data on neural responses to visual shape features in these
ventral pathways, including several more data figures. There are some interesting subtleties and controversies in this
literature, but the main conclusions presented here still hold.
Figure 6.15: Set of 20 objects composed from horizontal and vertical line elements used for the object recognition
simulation. By using a restricted set of visual feature elements, we can more easily understand how the model works, and
also test for generalization to novel objects (object 18 and 19 are not trained initially, and then subsequently trained only in
a relatively few locations -- learning there generalizes well to other locations).
Go to Objrec for the computational model of object recognition, which demonstrates the incremental hierarchical solution
to the object recognition problem. We use a simplified set of "objects" ( Figure 6.15) composed from vertical and
horizontal line elements. This simplified set of visual features allows us to better understand how the model works, and
also enables testing generalization to novel objects composed from these same sets of features. You will see that the model
learns simpler combinations of line elements in area V4, and more complex combinations of features in IT, which are also
invariant over the full receptive field. These IT representations are not identical to entire objects -- instead they represent
an invariant distributed code for objects in terms of their constituent features. The generalization test shows how this
distributed code can support rapid learning of new objects, as long as they share this set of features. Although they are
likely much more complex and less well defined, it seems that a similar such vocabulary of visual shape features are
learned in primate IT representations.
Hemispatial Neglect
Figure 6.18: Progression of self portraits by an artist with hemispatial neglect, showing gradual remediation of the neglect
over time.
Some of the most striking evidence that the parietal cortex is important for spatial attention comes from patients with
hemispatial neglect, who tend to ignore or neglect one side of space ( Figure 6.17, Figure 6.18, Figure 6.19). This
condition typically arises from a stroke or other form of brain injury affecting the right parietal cortex, which then gives
rise to a neglect of the left half of space (due to the crossing over of visual information shown in the biology section).
Interestingly, the neglect applies to multiple different spatial reference frames, as shown in Figure 6.19, where lines on
the left side of the image tend to be neglected, and also each individual line is bisected more toward the right, indicating a
neglect of the left portion of each line.
Figure 6.19: Results of a line bisection task for a person with hemispatial neglect. Notice that neglect appears to operate at
two different spatial scales here: for the entire set of lines, and within each individual line.
Figure 6.21: Typical data from the Posner spatial cueing task, showing a speedup for valid trials, and a slowdown for
invalid trials, compared to a neutral trial with no cueing. The data for patients with hemispatial neglect is also shown, with
their overall slower reaction times normalized to that of the intact case.
As shown in Figure 6.21, patients with hemispatial neglect show a disproportionate increase in reaction times for the
invalid cue case, specifically when the cue is presented to the good visual field (typically the right), while the target
appears in the left. Posner took this data to suggest that these patients have difficulty disengaging attention, according to
his box-and-arrow model of the spatial cueing task ( Figure 6.22).
Figure 6.23: Interacting spatial and object recognition pathways can explain Posner spatial attention effects in terms of
spatial influences on early object recognition processing, in addition to top-down influence on V1 representations.
Importantly, these models make distinct predictions regarding the effects of bilateral parietal damage. Patients with this
condition are known to suffer from Balint's syndrome, which is characterized by a profound inability to recognize objects
when more than one is present in the visual field. This is suggestive of the important role that spatial attention plays in
facilitating object recognition in crowded visual scenes. According to Posner's disengage model, bilateral damage should
result in difficulty disengaging from both sides of space, producing slowing in invalid trials for both sides of space. In
contrast, the competition-based model makes the opposite prediction: the lesions serve to reduce competition on both sides
of space, such that there should be reduced attentional effects on both sides. That is, the effect of the invalid cue actually
decreases in magnitude. The data is consistent with the competition model, and not Posner's model.
Exploration
Open AttnSimple to explore a model with spatial and object pathways interacting in the context of multiple spatial
attention tasks, including perceiving multiple objects, and the Posner spatial cueing task. It reproduces the behavioral data
shown above, and correctly demonstrates the observed pattern of reduced attentional effects for Balint's patients.
7.1: INTRODUCTION
7.2: BASAL GANGLIA, ACTION SELECTION AND REINFORCEMENT LEARNING
7.3: DOPAMINE AND TEMPORAL DIFFERENCE REINFORCEMENT LEARNING
7.4: CEREBELLUM AND ERROR-DRIVEN LEARNING
7.5: SUBTOPICS AND EXPLORATIONS
1 1/5/2022
7.1: Introduction
The foundations of cognition are built upon the sensory-motor loop -- processing sensory inputs to determine which motor
action to perform next. This is the most basic function of any nervous system. The human brain has a huge number of such
loops, spanning the evolutionary timescale from the most primitive reflexes in the peripheral nervous system, up to the
most abstract and inscrutable plans, such as the decision to apply to, and attend, graduate school, which probably involves
the highest levels of processing in the prefrontal cortex (PFC) (or perhaps some basic level of insanity... who knows).
Learning Rules Across the Brain
Learning Signal Dynamics
Table 7.1: Comparison of learning mechanisms and activity/representational dynamics across four primary areas of the
brain. +++ means that the area definitely has given property, with fewer +'s indicating less confidence in and/or importance
of this feature. --- means that the area definitely does not have the given property, again with fewer -'s indicating lower
confidence or importance.
Figure 7.1: Illustration of the role of the basal ganglia in action selection -- multiple possible actions are considered in the
cortex, and the basal ganglia selects the best (most rewarding) one to actually execute. Reproduced from Gazzaniga et al
(2002).
In this chapter, we complete the loop that started in the previous chapter on Perception and Attention, by covering a few of
the most important motor output and control systems, and the learning mechanisms that govern their behavior. At the
subcortical level, the cerebellum and basal ganglia are the two major motor control areas, each of which has specially
adapted learning mechanisms that differ from the general-purpose cortical learning mechanisms described in the Learning
Mechanisms chapter (see Comparing and Contrasting Major Brain Areas for a high-level summary of these differences --
the key summary table is reproduced here (Table 7.1)). The basal ganglia are specialized for learning from
reward/punishment signals, in comparison to expectations for reward/punishment, and this learning then shapes the
action selection that the organism will make under different circumstances (selecting the most rewarding actions and
avoiding punishing ones; Figure 7.1). This form of learning is called reinforcement learning. The cerebellum is
specialized for learning from error, specifically errors between the sensory outcomes associated with motor actions,
relative to expectations for these sensory outcomes associated with those motor actions. Thus, the cerebellum can refine
the implementation of a given motor plan, to make it more accurate, efficient, and well-coordinated.
There is a nice division of labor here, where the basal ganglia help to select one out of many possible actions to perform,
and the cerebellum then makes sure that the selected action is performed well. Consistent with this rather clean division of
labor, there are no direct connections between the basal ganglia and cerebellum -- instead, each operates in interaction with
various areas in the cortex, where the action plans are formulated and coordinated. Both basal ganglia and cerebellum are
Figure 7.2: Parallel circuits through the basal ganglia for different regions of the frontal cortex -- each region of frontal
cortex has a corresponding basal ganglia circuit, for controlling action selection/initiation in that frontal area. Motor loop:
SMA = supplementary motor area -- the associated striatum (putamen) also receives from premotor cortex (PM), and
primary motor (M1) and somatosensory (S1) areas -- everything needed to properly contextualize motor actions.
Oculomotor loop: FEF = frontal eye fields, also receives from dorsolateral PFC (DLPFC), and posterior parietal cortex
(PPC) -- appropriate context for programming eye movements. Prefrontal loop: DLPFC also controlled by posterior
parietal cortex, and premotor cortex. Orbitofrontal loop: OFC = orbitofrontal cortex, also receives from inferotemporal
cortex (IT), and anterior cingulate cortex (ACC). Cingulate loop: ACC also modulated by hippocampus (HIP), entorhinal
cortex (EC), and IT.
The basal ganglia performs its action selection function over a wide range of frontal cortical areas, by virtue of a sequence
of parallel loops of connectivity ( Figure 7.2). These areas include motor (skeletal muscle control) and oculomotor (eye
movement control), but also prefrontal cortex, orbitofrontal cortex, and anterior cingulate cortex, which are not directly
motor control areas. Thus, we need to generalize our notion of action selection to include cognitive action selection --
more abstract forms of selection that operate in higher-level cognitive areas of prefrontal cortex. For example, the basal
ganglia can control the selection of large-scale action plans and strategies in its connections to the prefrontal cortex. The
orbitofrontal cortex is important for encoding the reward value associated with different possible stimulus outcomes, so the
basal ganglia connection here is important for driving the updating of these representations as a function of contingencies
in the environment. The anterior cingulate cortex is important for encoding the costs of motor actions (time, effort,
uncertainty), and basal ganglia similarly can help control updating of these costs as different actions are considered. We
can summarize the role of basal ganglia in these more abstract frontal areas as controlling working memory updating, as
is discussed further in the Executive Function chapter.
Interestingly, the additional inputs that converge into the basal ganglia for a given area all make good sense. Motor control
needs to know about the current somatosensory state, as well as inputs from the slightly higher-level motor control area
known as premotor cortex. Orbitofrontal cortex is all about encoding the reward value of stimuli, and thus needs to get
input from IT cortex, which provides the identity of relevant objects in the environment.
Figure 7.4: Biology of the basal ganglia system, with two cases shown: a) Dopamine burst activity that drives the direct
"Go" pathway neurons in the striatum, which then inhibit the tonic activation in the globus pallidus internal segment (GPi),
which releases specific nuclei in the thalamus from this inhibition, allowing them to complete a bidirectional excitatory
circuit with the frontal cortex, resulting in the initiation of a motor action. The increased Go activity during dopamine
bursts results in potentiation of corticostriatal synapses, and hence learning to select actions that tend to result in positive
outcomes. b) Dopamine dip (pause in tonic dopamine neuron firing), leading to preferential activity of indirect "NoGo"
pathway neurons in the striatum, which inhibit the external segment globus pallidus neurons (GPe), which are otherwise
tonically active, and inhibiting the GPi. Increased NoGo activity thus results in disinhibition of GPi, making it more active
and thus inhibiting the thalamus, preventing initiation of the corresponding motor action. The dopamine dip results in
potentiation of corticostriatal NoGo synapses, and hence learning to avoid selection actions that tend to result in negative
outcomes. From Frank, 2005.
Zooming in on any one of these loops, the critical elements of the basal ganglia system are diagrammed in Figure 7.4,
with two important activation patterns shown. First, the basal ganglia system involves the following subregions:
The striatum, which is the major input region, consisting of the caudate and putamen subdivisions (as shown in
Figure 7.2). The striatum is anatomically subdivided into many small clusters of neurons, with two major types of
clusters: patch/striosomes and matrix/matrisomes. The matrix clusters contain direct (Go) and indirect (NoGo)
pathway medium spiny neurons, which together make up 95% of striatal cells, both of which receive excitatory inputs
from all over the cortex but are inhibitory on their downstream targets in the globus pallidus as described next. The
patch cells project to the dopaminergic system, and thus appear to play a more indirect role in modulating learning
signals. There are also a relatively few widely spaced tonically active neurons (TAN's), which release acetylcholine as a
Figure 7.5: Characteristic patterns of neural firing of the dopaminergic neurons in the ventral tegmental area (VTA) and
substantia nigra pars compacta (SNc), in a simple conditioning task. Prior to conditioning, when a reward is delivered, the
dopamine neurons fire a burst of activity (top panel -- histogram on top shows sum of neural spikes across the repeated
recording traces shown below, with each row being a different recording trial). After the animal has learned to associate a
conditioned stimulus (CS) (e.g., a tone) with the reward, the dopamine neurons now fire to the onset of the CS, and not to
the reward itself. If a reward is withheld after the CS, there is a dip or pause in dopamine firing, indicating that there was
some kind of prediction of the reward, and when it failed to arrive, there was a negative prediction error. This overall
pattern of firing across conditions is highly consistent with reinforcement learning models based on reward prediction
error.
Although we considered above how phasic changes in dopamine can drive Go and NoGo learning to select the most
rewarding actions and to avoid less rewarding ones, we have not addressed above how dopamine neurons come to
represent these phasic signals for driving learning. One of the most exciting discoveries in recent years was the finding that
dopamine neurons in the ventral tegmental area (VTA) and substantia nigra pars compacta (SNc) behave in accord with
reinforcement learning models based on reward prediction error. Unlike some popular misconceptions, these dopamine
neurons do not encode raw reward value directly. Instead, they encode the difference between reward received versus an
expectation of reward. This is shown in Figure 7.5: if there is no expectation of reward, then dopamine neurons fire to
the reward, reflecting a positive reward prediction error (zero expectation, positive reward). If a conditioned stimulus (CS,
e.g., a tone or light) reliably predicts a subsequent reward, then the neurons no longer fire to the reward itself, reflecting
the lack of reward prediction error (expectation = reward). Instead, the dopamine neurons fire to the onset of the CS. If the
reward is omitted following the CS, then the dopamine neurons actually go the other way (a "dip" or "pause" in the
otherwise low tonic level of dopamine neuron firing), reflecting a negative reward prediction error (positive reward
prediction, zero reward).
Computationally, the simplest model of reward prediction error is the Rescorla-Wagner conditioning model, which is
mathematically identical to the delta rule as discussed in the Learning Mechanisms chapter, and is simply the difference
between the actual reward and the expected reward:
δ = r−r
^
δ = r − ∑ xw
where δ ("delta") is the reward prediction error, r is the amount of reward actually received, and r^ = ∑ xw is the amount
of reward expected, which is computed as a weighted sum over input stimuli x with weights w. The weights adapt to try to
accurately predict the actual reward values, and in fact this delta value specifies the direction in which the weights should
change:
Δw = δx
This is identical to the delta learning rule, including the important dependence on the stimulus activity x -- you only want
to change the weights for stimuli that are actually present (i.e., non-zero x's).
When the reward prediction is correct, then the actual reward value is cancelled out by the prediction, as shown in the
second panel in Figure 7.5. This rule also accurately predicts the other cases shown in the figure too (positive and
negative reward prediction errors).
7.3.1 12/29/2021 https://med.libretexts.org/@go/page/12605
What the Rescorla-Wagner model fails to capture is the firing of dopamine to the onset of the CS in the second panel in
Figure 7.5. However, a slightly more complex model known as the temporal differences (TD) learning rule does capture
this CS-onset firing, by introducing time into the equation (as the name suggests). Relative to Rescorla-Wagner, TD just
adds one additional term to the delta equation, representing the future reward values that might come later in time:
δ = (r + f ) − r
^
where f represents the future rewards, and now the reward expectation r^ = ∑ xw has to try to anticipate both the current
reward r and this future reward f . In a simple conditioning task, where the CS reliably predicts a subsequent reward, the
onset of the CS results in an increase in this f value, because once the CS arrives, there is a high probability of reward in
the near future. Furthermore, this f itself is not predictable, because the onset of the CS is not predicted by any earlier cue
(and if it was, then that earlier cue would be the real CS, and drive the dopamine burst). Therefore, the r^ expectation
cannot cancel out the f value, and a dopamine burst ensues.
Although this f value explains CS-onset dopamine firing, it raises the question of how can the system know what kind of
rewards are coming in the future? Like anything having to do with the future, you fundamentally just have to guess, using
the past as your guide as best as possible. TD does this by trying to enforce consistency in reward estimates over time. In
effect, the estimate at time t is used to train the estimate at time t − 1 , and so on, to keep everything as consistent as
possible across time, and consistent with the actual rewards that are received over time.
This can all be derived in a very satisfying way by specifying something known as a value function, V(t) that is a sum of
all present and future rewards, with the future rewards discounted by a "gamma" factor, which captures the intuitive
notion that rewards further in the future are worth less than those that will occur sooner. As the Wimpy character says in
Popeye, "I'll gladly pay you Tuesday for a hamburger today." Here is that value function, which is an infinite sum going
into the future:
1 2
V (t) = r(t) + γ r(t + 1) + γ r(t + 2) …
And because we don't know anything for certain, all of these value terms are really estimates, denoted by the little "hats"
above them:
^ ^
V (t) = r(t) + γ V (t + 1)
So this equation tells us what our estimate at the current time t should be, in terms of the future estimate at time t + 1 .
Next, we subtract V^ from both sides, which gives us an expression that is another way of expressing the above equality --
that the difference between these terms should be equal to zero:
^ ^
0 = (r(t) + V (t + 1)) − V (t)
This is mathematically stating the point that TD tries to keep the estimates consistent over time -- their difference should
be zero. But as we are learning our V^ estimates, this difference will not be zero, and in fact, the extent to which it is not
zero is the extent to which there is a reward prediction error:
^ ^
δ = (r(t) + V (t + 1)) − V (t)
If you compare this to the equation with f in it above, you can see that:
^
f = γ V (t + 1)
and otherwise everything else is the same, except we've clarified the time dependence of all the variables, and our reward
expectation is now a "value expectation" instead (replacing the r^ with a V^). Also, as with Rescorla-Wagner, the delta value
here drives learning of the value expectations.
The TD learning rule can be used to explain a large number of different conditioning phenomena, and its fit with the firing
of dopamine neurons in the brain has led to a large amount of research progress. It represents a real triumph of the
computational modeling approach for understanding (and predicting) brain function.
Figure 7.6: Basic structure of the actor critic architecture for motor control. The critic is responsible for processing reward
inputs (r ), turning them into reward prediction errors (δ), which are suitable for driving learning in both the critic and the
actor. The actor is responsible for producing motor output given relevant sensory input, and doesn't process reward or
reward expectations directly. This is an efficient division of labor, and it is essential for learning to transform rewards into
reward prediction errors, otherwise the system would overlearn on simple tasks that it mastered long ago.
Now that you have a better idea about how dopamine works, we can revisit its role in modulating learning in the basal
ganglia (as shown in Figure 7.4). From a computational perspective, the key idea is the distinction between an actor and
a critic ( Figure 7.6), where it is assumed that rewards result at least in part from correct performance by the actor. The
basal ganglia is the actor in this case, and the dopamine signal is the output of the critic, which then serves as a training
signal for the actor (and the critic too as we saw earlier). The reward prediction error signal produced by the dopamine
system is a good training signal because it drives stronger learning early in a skill acquisition process, when rewards are
more unpredictable, and reduces learning as the skill is perfected, and rewards are thus more predictable. If the system
instead learned directly on the basis of external rewards, it would continue to learn about skills that have long been
mastered, and this would likely lead to a number of bad consequences (synaptic weights growing ever stronger,
interference with other newer learning, etc).
Figure 7.7: The Opponent Actor Learning (OpAL) scheme. This is a modified actor critic whereby the actor contains
separate G and N weights representing the Go and NoGo pathways. The activities of the pathways are scaled by dopamine
levels during choice, and the relative activation differences for each action are compared to make a choice. The figure
depicts selection among three actions that have different learned costs and benefits (think coffee, tea, water: clearly coffee
has a better benefit than tea, but it also has higher costs (jitters etc)). When dopamine levels are low (left), the costs are
amplified, and the benefits diminished, and the system chooses to avoid the highest cost and selects action 3 (water). When
dopamine levels are high, the benefits are amplified and the costs diminished, and it chooses action 1 (coffee). Moderate
dopamine levels are associated with action 2 (tea; not shown). This accounts for differential effects of dopamine on
learning and choice among actions with different costs and benefits. From Collins & Frank, 2014.
Furthermore, the sign of the reward prediction error is appropriate for the effects of dopamine on the Go and NoGo
pathways in the striatum, as we saw in the BG model project above. Positive reward prediction errors, when unexpected
rewards are received, indicate that the selected action was better than expected, and thus Go firing for that action should be
increased in the future. The increased activation produced by dopamine on these Go neurons will have this effect,
assuming learning is driven by these activation levels. Conversely, negative reward prediction errors will facilitate NoGo
firing, causing the system to avoid that action in the future. Indeed, the complex neural network model of BG Go/NoGo
circuitry can be simplified with more formal analysis in a modified actor-critic architecture called Opponent Actor
Learning (OpAL; Figure 7.7), where the actor is divided into independent G and N opponent weights, and where their
relative contribution is itself affected by dopamine levels during both learning and choice (Collins & Frank 2014).
Figure 7.8: Biological mapping of the PVLV algorithm, which has two separate subsystems: Primary Value (PV) and
Learned Value (LV). Each subsystem has excitatory and inhibitory subcomponents, so named for their effect on dopamine
firing. P V = primary rewards that excite dopamine, associated with lateral hypothalamic nucleus (LHA). P V = inhibitory
e i
canceling of dopamine firing to rewards, driven by patch-like neurons in the ventral striatum (VS_Patch). LV = excitatory
e
drive from the central nucleus of the amygdala (CNA), which represents CS's. LV = inhibitory canceling of LV
i e
excitation, also via patch-like neurons in the ventral striatum. The pedunculopontine tegmental nucleus (PPTN) may
transform sustained inputs into phasic dopamine responses via a simple temporal delta operation.
You might have noticed that we haven't yet explained at a biological level how the dopamine neurons in the VTA and SNc
actually come to exhibit their reward prediction error firing. There is a growing body of data supporting the involvement of
the brain areas shown in Figure 7.8:
Lateral hypothalamus (LHA) provides a primary reward signal for basic rewards like food, water etc.
Patch-like neurons in the ventral striatum (VS-patch) have direct inhibitory connections onto the dopamine neurons
in the VTA and SNc, and likely play the role of canceling out the influence of primary reward signals when these
rewards have successfully been predicted.
Central nucleus of the amygdala (CNA) is important for driving dopamine firing to the onset of conditioned stimuli.
It receives broadly from the cortex, and projects directly and indirectly to the VTA and SNc. Neurons in the CNA
exhibit CS-related firing.
Figure 7.9: How the different components of PVLV contribute to the overall pattern of dopamine firing in a simple
conditioning paradigm. At CS onset, LV responds with excitatory dopamine burst, due to prior learning at time of
e
external reward (P V ), which associated the CS with the LV system (in the CNA). When the external reward
e e
(unconditioned stimulus or US) comes in, the P V system partially cancels out the excitation from P V -- over time
i e
Given that there are distinct brain areas involved in these different aspects of the dopamine firing, it raises the question as
to how the seemingly unified TD learning algorithm could be implemented across such different brain areas? In response
to this basic question, the PVLV model of dopamine firing was developed. PVLV stands for Primary Value, Learned Value,
δpv = P Ve − P Vi
Where excitatory (e ) and inhibitory (i) subscripts denote the two components of the primary value system, and the sign of
their influence on dopamine firing.
The dopamine signal for learned values (LV) applies whenever PV does not (i.e., when external rewards are not present or
expected), and it has a similar form:
δlv = LVe − LVi
Where LV is the excitatory drive on dopamine from the CNA, which learns to respond to CS's. LV is a counteracting
e i
inhibitory drive, again thought to be associated with the patch-like neurons of the ventral striatum. It learns much more
slowly than the LV system, and will eventually learn to cancel out CS-associated dopamine responses, once these CS's
e
become highly highly familiar (beyond the short timescale of most experiments).
The P V values are learned in the same way as in the delta rule or Rescorla-Wagner, and the LV and LV values are
i e i
learned in a similar fashion as well, except that their training signal is driven directly from the P V reward values, and
e
only occurs when external rewards are present or expected. This is critical for allowing LV for example to get activated at
e
the time of CS onset, when there isn't any actual reward value present. If LV was always learning to match the current
e
value of P V , then this absence of P V value at CS onset would quickly eliminate the LV response then. See PVLV
e e e
Learning for the full set of equations governing the learning of the LV and PV components.
There are a number of interesting properties of the learning constraints in the PVLV system. First, the CS must still be
active at the time of the external reward in order for the LV system to learn about it, since LV only learns at the time of
e
external reward. If the CS itself goes off, then some memory of it must be sustained. This fits well with known constraints
on CS learning in conditioning paradigms. Second, the dopamine burst at the time of CS onset cannot influence learning in
the LV system itself -- otherwise there would be an unchecked positive feedback loop. One implication of this is that the
LV system cannot support second-order conditioning, where a first CS predicts a second CS which then predicts reward.
Consistent with this constraint, the CNA (i.e., LV ) appears to only be involved in first order conditioning, while the
e
basolateral nucleus of the amygdala (BLA) is necessary for second-order conditioning. Furthermore, there does not appear
to be much of any evidence for third or higher orders of conditioning. Finally, there is a wealth of specific data on
differences in CS vs. US associated learning that are consistent with the PVLV framework (see Hazy et al, 2010 for a
thorough review).
In short, the PVLV system can explain how the different biological systems are involved in generating phasic dopamine
responses as a function of reward associations, in a way that seems to fit with otherwise somewhat peculiar constraints on
the system. Also, we will see in the Executive Function Chapter that PVLV provides a cleaner learning signal for
controlling the basal ganglia's role in the prefrontal cortex working memory system.
Exploration of PVLV
PVLV model of same simple conditioning cases as explored in TD model: PVLV
Figure 7.10: Areas of the cortex that project to the cerebellum -- unlike the basal ganglia, the cerebellum receives
exclusively from motor-related areas, including the parietal cortex (which includes primary somatosensory cortex), and
motor areas of frontal cortex. Notably, it does not receive from prefrontal cortex or temporal cortex.
Now that we understand how the basal ganglia can select an action to perform based on reinforcement learning, we turn to
the cerebellum, which takes over once the action has been initiated, and uses error-driven learning to shape the
performance of the action so that it is accurate and well-coordinated. As shown in Figure 7.10, the cerebellum only
receives from cortical areas directly involved in motor production, including the parietal cortex and the motor areas of
frontal cortex. Unlike the basal ganglia, it does not receive from prefrontal cortex or temporal cortex, which makes sense
according to their respective functions. Prefrontal cortex and temporal cortex are really important for high-level planning
and action selection, but not for action execution. However, we do know from neuroimaging experiments that the
cerebellum is engaged in many cognitive tasks -- this must reflect its extensive connectivity with the parietal cortex, which
is also activated in many cognitive tasks. One idea is that the cerebellum can help shape learning and processing in parietal
cortex by virtue of its powerful error-driven learning mechanisms -- this may help to explain how the parietal cortex can
learn to do all the complex things it does. However, at this point both the parietal cortex and cerebellum are much better
understood from a motor standpoint than a cognitive one.
Figure 7.11: Circuitry and structure of the cerebellum -- see text for details. Reproduced from ???
Figure 7.12: Schematic circuitry of the cerebellum. MF = mossy fiber input axons. GC = granule cell. GcC = golgi cell. PF
= parallel fiber. PC = Purkinje cell. BC = basket cell, SC = stellate cell. CF = climbing fiber. DCN = deep cerebellar nuclei.
IO = inferior olive.
Figure 7.13: Lookup table solution to function learning -- the appropriate value for the function can be memorized for each
input value X, with perhaps just a little bit of interpolation around the X values. For the cerebellum, X are the inputs
(sensory signals, etc), and f(X) is the motor output commands.
Putting all these pieces together, David Marr and James Albus argued that the cerebellum is a system for error-driven
learning, with the error signal coming from the climbing fibers. It is clear that it has the machinery to associate stimulus
inputs with motor output commands, under the command of the climbing fiber inputs. One important principle of
cerebellar function is the projection of inputs into a very high-dimensional space over the granule cells -- computationally
this achieves the separation form of learning, where each combination of inputs activates a unique pattern of granule cell
neurons. This unique pattern can then be associated with a different output signal from the cerebellum, producing
something approximating a lookup table of input/output values Figure 7.13. A lookup table provides a very robust
solution to learning even very complex, arbitrary functions -- it will always be able to encode any kind of function. The
drawback is that it does not generalize to novel input patterns very well. However, it may be better overall in motor control
to avoid improper generalization, rather than eek out a bit more efficiency from some form of generalization. This high-
dimensional expansion is also used successfully by the support vector machine (SVM), one of the most successful machine
learning algorithms.
Exploration of Cerebellum
Cereb (cereb.proj) -- Cerebellum role in motor learning, learning from errors.
Explorations
Here are all the explorations covered in the main portion of the Motor Control and Reinforcement Learning chapter:
BG (bg.proj) -- action selection / gating and reinforcement learning in the basal ganglia. (Questions 7.1 -- 7.4)
RL (rl_cond.proj) -- Pavlovian Conditioning using Temporal Differences Reinforcement Learning. (Questions 7.5 --
7.6)
PVLV (pvlv.proj) -- Pavlovian Conditioning using the Primary Value Learned Value algorithm. (Questions 7.7 -- 7.9)
Cereb (cereb.proj) -- Cerebellum role in motor learning, learning from errors. (Questions 7.10 -- 7.11)
8.1: INTRODUCTION
8.2: EPISODIC MEMORY
8.3: FAMILIARITY AND RECOGNITION MEMORY
8.4: PRIMING- WEIGHT AND ACTIVATION-BASED
8.5: SUBTOPICS AND EXPLORATIONS
1 1/5/2022
8.1: Introduction
When you think of memory, you probably think of episodic memory -- memory for specific episodes or events. Maybe
you can remember some special times from childhood (birthdays, family trips, etc), or some traumatic times (ever get lost
in a supermarket, or get left behind on a hike or other family outing?). Probably you can remember what you had for
dinner last night, and who you ate with? Although this aspect of memory is the most salient for us, it is just one of many
different types of memory.
One broad division in memory in mechanistic, computational terms is between weight-based and activation-based forms
of memory. Weight based memory is a result of synaptic plasticity, and is generally relatively long lasting (at least several
10's of minutes, and sometimes several decades, up to a whole lifetime). Activation-based memory is supported by
ongoing neural activity, and is thus much more transient and fleeting, but also more flexible. Because weight-based
memory exists at every modifiable synapse in the brain, it can manifest in innumerable ways. In this chapter, we focus on
some of the most prominent types of memory studied by psychologists, starting with episodic memory, then looking at
familiarity-based recognition memory, followed by weight-based priming, and activation-based priming. We'll look at
more robust forms of activation-based memory, including working memory, in the Executive Function chapter.
Probably most people have heard of the hippocampus and its critical role in episodic memory -- the movie Memento for
example does a great job of portraying what it is like to not have a functional hippocampus. We'll find out through our
computational models why the hippocampus is so good at episodic memory -- it has highly sparse patterns of neural
activity (relatively few neurons active at a time), which allows even relatively similar memories to have very different,
non-overlapping neural representations. These distinct neural patterns dramatically reduce interference, which is the
primary nemesis of memory. Indeed, the highly distributed, overlapping representations in the neocortex -- while useful for
reasons outlined in the first half of this book -- by themselves produce catastrophic interference when they are driven to
learn too rapidly. But it is this rapid one-shot learning that is required for episodic memory! Instead, it seems that the brain
leverages two specialized, complementary learning systems -- the hippocampus for rapid encoding of new episodic
memories, and the neocortex for slow acquisition of rich webs of semantic knowledge, which benefit considerably from
the overlapping distributed learning and slower learning rates, as we'll see.
Countering the seemingly ever-present urge to oversimplify and modularize the brain, it is critical to appreciate that
memory is a highly distributed phenomena, with billions of synapses throughout the brain being tweaked by any given
experience. Several studies have shown preserved learning of new memories of relatively specific information in people
with significant hippocampal damage -- but it is critical to consider how these memories are cued. This is an essential
aspect to remember about memory in general: whether a given memory can actually be retrieved depends critically on how
the system is probed. We've probably all had the experience of a flood of memories coming back as a result of visiting an
old haunt -- the myriad of cues available enable (seemingly spontaneous) recall of memories that otherwise are not quite
strong enough to rise to the surface. The memories encoded without the benefit of the hippocampus are weaker and more
vague, but they do exist.
In addition to being highly distributed, memory in the brain is also highly interactive. Information that is initially encoded
in one part of the brain can appear to "spread" to other parts of the brain, if those memories are reactivated and these other
brain areas get further opportunities to learn them. A classic example is that episodic memories initially encoded in the
hippocampus can be strengthened in the surrounding neocortical areas through repeated retrieval of those memories. This
can even happen while we are sleeping, when patterns of memories experienced during the day have shown to be re-
activated! Furthermore, influences of the prefrontal cortex Executive Function system, and affective states, can
significantly influence the encoding and retrieval of memory. Thus, far from the static "hard drive" metaphor from
computers, memory in the brain is a highly dynamic, constantly evolving process that reflects the complexity and
interactions present across all the areas of the brain.
Figure 8.1: Data from humans (a) and a generic (cortical) neural network model (b) on the classic AB-AC list learning task,
which generates considerable interference by re-pairing the A list items with new associates in the AC list after having first
learned the AB list. People's performance on the AB items after learning the AC list definitely degrades (red line), but
nowhere near as catastrophically as in the neural network model. Data reproduced from McCloskey and Cohen (1989).
After 1, 5, 10, and 20 iterations of learning this AC list, people are tested on their ability to recall the original AB items,
without any additional training on those items. Figure 8.1 shows that there is a significant amount of interference on the
AB list as a result of learning the AC items, due to the considerable overlap between the two lists, but even after 20
Hippocampal Anatomy
Figure 8.2: The hippocampus sits on "top" of the cortical hierarchy and can encode information from all over the brain,
binding it together into an episodic memory. Dorsal (parahippocampal) and Ventral (perirhinal) pathways from posterior
cortex converge into the entorhinal cortex, which is then the input and output pathway of the hippocampus proper,
consisting of the dentate gyrus (DG) and areas of "ammon's horn" (cornu ammonis, CA) -- CA3 and CA1. CA3 represents
the primary "engram" for the episodic memory, while CA1 is an invertible encoding of EC, such that subsequent recall of
the CA3 engram can activate CA1 and then EC, to reactivate the full episodic memory out into the cortex.
The anatomy of the hippocampus proper and the areas that feed into it is shown in Figure 8.2. The hippocampus
represents one of two "summits" on top of the hierarchy of interconnected cortical areas (where the bottom are sensory
input areas, e.g., primary visual cortex) -- the other such summit is the prefrontal cortex explored in the Executive Function
Chapter. Thus, it posesses a critical feature for an episodic memory system: access to a very high-level summary of
everything of note going on in your brain at the moment. This information, organized along the dual-pathway dorsal vs.
ventral pathways explored in the Perception and Attention Chapter, converges on the parahippocampal (PHC) (dorsal)
and perirhinal (PRC) (ventral) areas, which then feed into the entorhinal cortex (EC), and then into the hippocampus
proper. The major hippocampal areas include the dentate gyrus (DG) and the areas of "ammon's horn" (cornu ammonis
(CA) in latin), CA3 and CA1 (what happened to CA2? turns out it is basically the same as CA3 so we just use that label).
Figure 8.3: Comparison of activity patterns across different areas of the hippocampus, showing that CA fields (CA3, CA1)
are much more sparse and selective than the cortical input areas (Entorhinal cortex (EC) and subiculum). This sparse,
pattern separated encoding within the hippocampus enables it to rapidly learn new episodes while suffering minimal
interference. Activation of sample neurons within each area are shown for a rat running on an 8 arm radial maze, with the
bars along each arm indicating how much the neuron fired for each direction of motion along the arm. The CA3 neuron
fires only for one direction in one arm, while EC has activity in all arms (i.e., a much more overlapping, distributed
representation).
Exploration
Now, let's explore how the hippocampus encodes and recalls memories, using the AB-AC task. Just click on the following
exploration and follow the instructions from there:
Hippocampus
Theta Waves
Figure 8.7: Different areas of the hippocampal system fire out of phase with respect to the overall theta rhythm, producing
dynamics that optimize encoding vs. retrieval. We consider the strength of the EC and CA3 inputs to CA1. When the EC
input is strong and CA3 is weak, CA1 can learn to encode the EC inputs. This serves as a plus phase for an error-driven
learning dynamic in the Leabra framework. When CA3 is strong and EC is weak, the system recalls information driven by
prior CA3 -> CA1 learning. This serves as a minus phase for Leabra error-driven learning, relative to the plus phase
encoding state. (adapted from Hasselmo et al, 2002)
An important property of the hippocampus is an overall oscillation in the rate of neural firing, in the so-called theta
frequency band in rats, which ranges from about 8-12 times per second. As shown in Figure 8.7, different areas of the
hippocampus are out of phase with each other with respect to this theta oscillation, and this raises the possibility that these
phase differences may enable the hippocampus to learn more effectively. Hasselmo et al (2002) argued that this theta phase
relationship enables the system to alternate between encoding of new information vs. recall of existing information. This is
an appealing idea, because as we discussed earlier, there can be a benefit by altering the hippocampal parameters to
optimize encoding or retrieval based on various other kinds of demands.
The emergent software now supports an extension to this basic theta encoding vs. retrieval idea that enables Leabra error-
driven learning to shape two different pathways of learning in the hippocampus, all within one standard trial of processing
(see KetzMorkondaOReilly13 for the published paper). Each pathway has an effective minus and plus phase activation
state (although in fact they share the same plus phase). The main pathway, trained on the standard minus to plus phase
difference, involves CA3-driven recall of the corresponding CA1 activity pattern, which can then reactivate EC and so on
out to cortex. The second pathway, trained using a special initial phase of settling within the minus phase, is the CA1 <->
EC invertible auto-encoder, which ensures that CA1 can actually reactivate the EC if it is correctly recalled. In our
standard hippocampal model explored previously, this auto-encoder pathway is trained in advance on all possible sub-
patterns within a single subgroup of EC and CA1 units (which we call a "slot"). This new model suggests how this auto-
encoder can instead be learned via the theta phase cycle.
See Hippocampus Theta Phase for details on this theta phase version of the hippocampus, which is recommended to use
for any computationally demanding hippocampal applications.
Theta oscillations are also thought to play a critical role in the grid cell activations in the EC layers, and perhaps may also
serve to encode temporal sequence information, because place field activity firing shows a theta phase procession, with
let___
and respond with words like "window" or "winter", "letter" or "lettuce". The priming effect is revealed by first exposing
people to one of the possible words for these stems, often in a fairly disguised, incidental manner, and then comparing how
much this influences the subsequent likelihood of completing the stem with it. By randomizing which of the different
words people are exposed to, you can isolate the effects of prior exposure relative to whatever baseline preferences people
might otherwise have. We know that those priming effects are not due to learning in the hippocampus, because they remain
intact in people with hippocampal lesions.
Exploration
Now we can explore both weight-based and activation-based priming on a simple stem-completion like task, using a very
generic cortical learning model.
WtPriming (wt_priming.proj)
ActPriming (act_priming.proj)
Explorations
Here are all the explorations covered in the main portion of the Memory chapter:
ABAC (ab_ac_interference.proj) -- Paired associate learning and catastrophic interference. (Questions 8.1 -- 8.3)
Hippocampus (hip.proj) -- Hippocampus model and overcoming interference. (Questions 8.4 -- 8.6)
WtPriming (wt_priming.proj) -- Weight-based (long-term) priming. (Question 8.7)
ActPriming (act_priming.proj) -- Activation-based (short-term) priming. (Question 8.8)
9.1: INTRODUCTION
9.2: BIOLOGY OF LANGUAGE
The classic "textbook" brain areas for language are Broca's and Wernicke's areas, which have been associated with syntax and
semantics, respectively. For example, a person who suffers a stroke or other form of damage to Wernicke's area can produce fluent,
syntactically-correct speech, which is essentially devoid of meaning.
1 1/5/2022
9.1: Introduction
Language involves almost every part of the brain, as covered in other chapters in the text:
Perception and Attention: language requires the perception of words from auditory sound waves, and written text.
Attention is critical for pulling out individual words on the page, and individual speakers in a crowded room. In this
chapter, we see how a version of the object recognition model from the perception chapter can perform written word
recognition, in a way that specifically leverages the spatial invariance property of this model.
Motor control: Language production obviously requires motor output in the form of speech, writing, etc. Fluent speech
depends on an intact cerebellum, and the basal ganglia have been implicated in a number of linguistic phenomena.
Learning and Memory: early word learning likely depends on episodic memory in the hippocampus, while longer-term
memory for word meaning depends on slow integrated learning in the cortex. Memory for recent topics of discourse
and reading (which can span months in the case of reading a novel) likely involves the hippocampus and sophisticated
semantic representations in temporal cortex.
Executive Function: language is a complex mental facility that depends critically on coordination and working memory
from the prefrontal cortex (PFC) and basal ganglia -- for example encoding syntactic structures over time, pronoun
binding, and other more transient forms of memory.
One could conclude from this that language is not particularly special, and instead represents a natural specialization of
domain general cognitive mechanisms. Of course, people have specialized articulatory apparatus for producing speech
sounds, which are not shared by other primate species, but one could argue that everything on top of this is just language
infecting pre-existing cognitive brain structures. Certainly reading and writing is too recent to have any evolutionary
adaptations to support it (but it is also the least "natural" aspect of language, requiring explicit schooling, compared to the
essentially automatic manner in which people absorb spoken language).
But language is fundamentally different from any other cognitive activity in a number of important ways:
Symbols -- language requires thought to be reduced to a sequence of symbols, transported across space and time, to be
reconstructed in the receiver's brain.
Syntax -- language obeys complex abstract regularities in the ordering of words and letters/phonemes.
Temporal extent and complexity -- language can unfold over a very long time frame (e.g., Tolstoy's War and Peace),
with a level of complexity and richness conveyed that far exceeds any naturally occurring experiences that might arise
outside of the linguistic environment. If you ever find yourself watching a movie on an airplane without the sound,
you'll appreciate that visual imagery represents the lesser half of most movie's content (the interesting ones anyway).
Generativity -- language is "infinite" in the sense that the number of different possible sentences that could be
constructed is so large as to be effectively infinite. Language is routinely used to express new ideas. You may find
some of those here.
Culture -- much of our intelligence is imparted through cultural transmission, conveyed through language. Thus,
language shapes cognition in the brain in profound ways.
The "special" nature of language, and its dependence on domain-general mechanisms, represent two poles in the
continuum of approaches taken by different researchers. Within this broad span, there is plenty of room for controversy
and contradictory opinions. Noam Chomsky famously and influentially theorized that we are all born with an innate
universal grammar, with language learning amounting to discovering the specific parameters of that language instance. On
the other extreme, connectionist language modelers such as Jay McClelland argue that completely unstructured, generic
neural mechanisms (e.g., backpropagation networks) are sufficient for explaining (at least some of) the special things about
language.
Our overall approach is clearly based in the domain-general approach, given that the same general-purpose neural
mechanisms used to explore a wide range of other cognitive phenomena are brought to bear on language here. However,
we also think that certain features of the PFC / basal ganglia system play a special role in symbolic, syntactic processing.
At present, these special contributions are only briefly touched upon here, and elaborated just a bit more in the executive
function chapter, but future plans call for further elaboration. One hint at these special contributions comes from mirror
neurons discovered in the frontal cortex of monkeys, in an area thought to be analogous to Broca's area in humans -- these
Figure 9.1: Brain areas associated with two of the most well-known forms of aphasia, or deficit in speech produced by
damage to these areas. Broca's aphasia is associated with impaired syntax but intact semantics, while Wernicke's is the
opposite. This makes sense given their respective locations in brain: temporal cortex for semantics, and frontal cortex for
syntax.
Figure 9.2: The different components of the vocal tract, which are important for producing the range of speech sounds that
people can produce.
The classic "textbook" brain areas for language are Broca's and Wernicke's areas ( Figure 9.1), which have been
associated with syntax and semantics, respectively. For example, a person who suffers a stroke or other form of damage to
Wernicke's area can produce fluent, syntactically-correct speech, which is essentially devoid of meaning. Here is one
example:
"You know that smoodle pinkered and that I want to get him round and take care of him like you want before", which
apparently was intended to mean: "The dog needs to go out so I will take him for a walk".
In contrast, a person with damage to Broca's area has difficulty producing syntactically correct speech output, typically
producing single content words with some effort, e.g., "dog....walk".
The more modern term for Broca's aphasia is expressive aphasia, indicating a primary deficit in expressing speech.
Comprehension is typically intact, although interestingly there can be deficits in understanding more syntactically complex
sentences. Wernicke's aphasia is known as receptive aphasia, indicating a deficit in comprehension, but also expression of
meaning.
Biologically, the locations of the damage associated with these aphasias are consistent with what we know about these
areas more generally. The ventral posterior area of frontal cortex known as Broca's area (corresponding to Brodmann's
areas 44 and 45) is adjacent to the primary motor area associated with control over the mouth, and thus it represents
supplementary motor cortex for vocal output. Even though Broca's patient's can physically move their mouths and other
articulatory systems, they cannot perform the complex sequencing of these motor commands that is necessary to produce
fluid speech. Interestingly, these higher-order motor control areas also seem to be important for syntactic processing, even
for comprehension. This is consistent with the idea that frontal cortex is important for temporally-extended patterning of
behavior according to increasingly complex plans as one moves more anterior in frontal cortex.
Figure 9.3: International Phonological Alphabet (IPA) for vowels, as a function of where the tongue is positioned (front vs.
back, organized horizontally in figure), and the shape of the lips (vertical axis in figure) -- these two dimensions define a
space of vowel sounds.
Figure 9.4: Version of IPA vowel space with vowel labels used in simulations -- these are all standard roman letters and
thus easier to manipulate in computer programs. Only the subset present in English is used.
Figure 9.5: International Phonological Alphabet (IPA) for consonants, which are defined in terms of the location where the
flow of air is restricted (place, organized horizontally in the table) and the manner in which it is restricted (plosive,
fricative, etc, organized vertically).
The vocal tract in people ( Figure 9.2) is capable of producing a wide range of different speech sounds, by controlling the
location and manner in which sound waves are blocked or allowed to pass. There are two basic categories of speech
sounds: vowels and consonants. Vowels occur with unobstructed airflow (you can sing a vowel sound over an extended
period), and differ in the location of the tongue and lips ( Figure 9.3 and Figure 9.4). For example, the long "E" vowel
sound as in "seen" is produced with the tongue forward and the lips relatively closed. Consonants involve the blockage of
Figure 9.6: Triangle model of reading pathways: Visual word input (orthography) can produce speech output of the word
either directly via projections to phonology (direct path), or indirectly via projections to semantics that encode the
meanings of words. There is no single "lexicon" of words in this model -- word representations are instead distributed
across these different pathways. Damage to different pathways can account for properties of acquired dyslexia.
There are three major forms of acquired dyslexia that can be simulated with the model:
Phonological -- characterized by difficulty reading nonwords (e.g., "nust" or "mave"). This can be produced by
damage to the direct pathway between orthography and phonology (there shouldn't be any activation in semantics for
nonwords), such that people have difficulty mapping spelling to sound according to learned regularities that can be
applied to nonwords. We'll explore this phenomenon in greater detail in the next simulation.
Deep -- is a more severe form of phonological dyslexia, with the striking feature that people sometimes make semantic
substitutions for words, pronouncing the word "orchestra" as "symphony" for example. There are also visual errors, so-
named because they seem to reflect a misperception of the word inputs (e.g, reading the word "dog" as "dot").
Interestingly, we'll see how more significant damage to the direct pathway can give rise to this profile -- the semantic
errors occur due to everything going through the semantic layer, such that related semantic representations can be
activated. In the normal intact brain, the direct pathway provides the relevant constraints to produce the actual written
word, but absent this constraint, an entirely different but semantically related word can be output.
Surface -- here nonword reading is intact, but access to semantics is impaired (as in Wernicke's aphasia), strongly
implicating a lesion in the semantics pathway. Interestingly, pronunciation of exception words (e.g., "yacht") is
impaired. This suggests that people typically rely on the semantic pathway to "memorize" how to pronounce odd words
like yacht, and the direct pathway is used more for pronouncing regular words.
Exploration
Open Dyslexia for the simulation of the triangle model and associated forms of dyslexia. This model allows you to
simulate the different forms of acquired dyslexia, in addition to normal reading, using the small corpus of words as
shown in Figure 9.7. In the next section, we expand upon the direct pathway and examine nonword reading, which
requires a much larger corpus of words to acquire the relevant statistical regularities that support generalization.
See Dyslexia Details for detailed graphs of the effects of partial lesions to the different pathways within the triangle model
-- you can reproduce this data in the simulation, but the graphs provide a much clearer overall visualization of the results.
The qualitative patterns produced by complete lesions of either the direct or semantic pathway hold up quite well over the
range of partial damage.
I cnduo't bvleiee taht I culod aulaclty uesdtannrd waht I was rdnaieg. Unisg the icndeblire pweor of the hmuan mnid,
aocdcrnig to rseecrah at Cmabrigde Uinervtisy, it dseno't mttaer in waht oderr the lterets in a wrod are, the olny
irpoamtnt tihng is taht the frsit and lsat ltteer be in the rhgit pclae. The rset can be a taotl mses and you can sitll raed it
whoutit a pboerlm. Tihs is bucseae the huamn mnid deos not raed ervey ltteer by istlef, but the wrod as a wlohe.
Aaznmig, huh? Yaeh and I awlyas tghhuot slelinpg was ipmorantt! See if yuor fdreins can raed tihs too.
Clearly this is more effortful than properly spelled text, but the ability to read it at all indicates that just extracting
individual letters in an invariant manner goes a long way.
Nonword Generalization Performance
Nonword Set ss Model PMSP People
Table 9.1: Comparison of nonword reading performance for our spelling-to-sound model (ss Model), the PMSP model,
and data from people, across a range of different nonword datasets as described in the text. Our model performs
comparably to people, after learning on nearly 3,000 English monosyllabic words.
To test the performance of this object-recognition based approach, we ran it through a set of different standard sets of
nonwords, several of which were also used to test the PMSP model. The results are shown in Table 9.1.
Exploration
Open Spelling to Sound to explore the spelling-to-sound model, and test its performance on both word and nonword
stimuli.
Exploration
Open Semantics for the sem.proj exploration of semantic learning of word co-occurrences. The model here was trained
on an early draft of the first edition of this textbook, and thus has relatively specialized knowledge, hopefully much of
which is now shared by you the reader.
Figure 9.10: Syntactic diagram of a basic sentence. S = sentence; NP = noun phrase; Art = article; N = noun; VP = verb
phrase; V = verb.
Having covered some of the interesting properties of language at the level of individual words, we now take one step
higher, to the level of sentences. This step brings us face-to-face with the thorny issue of syntax. The traditional approach
to syntax assumes that people assemble something akin to those tree-like syntactic structures you learned (or maybe not) in
school ( Figure 9.10). But given that these things need to be explicitly taught, and don't seem to be the most natural way
of thinking for many people, it seems perhaps unlikely that this is how our brains actually process language.
These syntactic structures also assume a capacity for role-filler binding that is actually rather challenging to achieve in
neural networks. For example, the assumption is that you somehow "bind" the noun boy into a variable slot that is
designated to contain the subject of the sentence. And once you move on to the next sentence, this binding is replaced with
the next one. This constant binding and unbinding is rather like the rotation of a wheel on a car -- it tends to rip apart
anything that might otherwise try to attach to the wheel. One important reason people have legs instead of wheels is that
we need to provide those legs with a blood supply, nerves, etc, all of which could not survive the rotation of a wheel.
Similarly, our neurons thrive on developing longer-term stable connections via physical synapses, and are not good at this
rapid binding and unbinding process. We focus on these issues in greater depth in the Executive Function Chapter.
An alternative way of thinking about sentence processing that is based more directly on neural network principles is
captured in the Sentence Gestalt model of St. John & McClelland (1990). The key idea is that both syntax and semantics
merge into an evolving distributed representation that captures the overall gestalt meaning of a sentence, without requiring
all the precise syntactic bindings assumed in the traditional approach. We don't explicitly bind boy to subject, but rather
encode the larger meaning of the overall sentence, which implies that the boy is the subject (or more precisely, the agent),
because he is doing the chasing.
One advantage of this way of thinking is that it more naturally deals with all the ambiguity surrounding the process of
parsing syntax, where the specific semantics of the collection of words can dramatically alter the syntactic interpretation. A
classic demonstration of this ambiguity is the sentence:
Time flies like an arrow.
which may not seem very ambiguous, until you consider alternatives, such as:
Fruit flies like a banana.
The word flies can be either a verb or noun depending on the semantic context. Further reflection reveals several more
ambiguous interpretations of the first sentence, which are fun to have take hold over your brain as you re-read the
sentence. Another example from Rohde (2002) is:
The slippers were found by the nosy dog.
The slippers were found by the sleeping dog.
just a single subtle word change recasts the entire meaning of the sentence, from one where the dog is the agent to one
where it plays a more peripheral role (the exact syntactic term for which is unknown to the authors).
If you don't bother with the syntactic parse in the first place, and just try to capture the meaning of the sentence, then none
of this ambiguity really matters. The meaning of a sentence is generally much less ambiguous than the syntactic parse --
getting the syntax exactly right requires making a lot of fine-grained distinctions that people may not actually bother with.
But the meaning does depend on the exact combination of words, so there is a lot of emergent meaning in a sentence --
The model structure ( Figure 9.11) has single word inputs (using localist single-unit representations of words) projecting
up through an encoding hidden layer to the gestalt layer, which is where the distributed representation of sentence meaning
develops. The memory for prior words and meaning interpretations of the sentence is encoded via a context layer, which is
a copy of the gestalt layer activation state from the previous word input. This context layer is known as a simple recurrent
network (SRN), and it is widely used in neural network models of temporally extended tasks (we discuss this more in the
next chapter on executive function). The network training comes from repeated probing of the network for the various
semantic roles enumerated above (e.g., "agent vs. patient). A role input unit is activated, and then the network is trained to
activate the appropriate response in the filler output layer.
Figure 9.12: Cluster plot over the gestalt layer of patterns associated with the different nouns, showing that these
distributed representations capture the semantic similarities of the words (much as in the LSA-like semantics model
explored in the previous section).
Figure 9.14: Cluster plot over the gestalt layer of patterns associated with a set of test sentences designed to test for
appropriate similarity relationships. sc = schoolgirl; st = stirred; ko = kool-aid; te = teacher; bu= busddriver; pi = pitcher;
dr = drank; ic = iced-tea; at = ate; so = soup; st = steak;
Figure 9.12 shows a cluster plot of the gestalt layer representations of the different nouns, while Figure 9.13 shows the
different verbs. These indicate that the network does develop sensible semantic similarity structure for these words.
Probing further, Figure 9.14 shows the cluster plot for a range of related sentences, indicating a sensible verb-centric
semantic organization (sentences sharing the same verb are all clustered together).
Exploration
Open Sentence Gestalt to explore the sentence gestalt model.
Explorations
Here are all the explorations covered in the main portion of the Language chapter:
Dyslexia (dyslex.proj) -- Normal and disordered reading and the distributed lexicon. (Questions 9.1 -- 9.6)
Spelling to Sound (ss.proj) -- Orthography to Phonology mapping and regularity, frequency effects. (Questions 9.7 --
9.8)
Semantics (sem.proj) -- Semantic Representations from World Co-occurrences and Hebbian Learning. (Questions 9.9 -
- 9.11)
Sentence Gestalt (sg.proj) -- The Sentence Gestalt model. (Question 9.12)
10.01: INTRODUCTION
10.2: BIOLOGY OF PFC/BG AND DOPAMINE SUPPORTING ROBUST ACTIVE MAINTENANCE
10.03: THE PBWM COMPUTATIONAL MODEL
10.4: TOP-DOWN COGNITIVE CONTROL FROM SUSTAINED PFC FIRING- THE STROOP MODEL
10.05: DEVELOPMENT OF PFC ACTIVE MEMORY STRENGTH AND THE A-NOT-B TASK
10.6: DYNAMIC UPDATING OF PFC ACTIVE MEMORY- THE SIR MODEL
10.7: MORE COMPLEX DYNAMIC UPDATING OF PFC ACTIVE MEMORY- THE N-BACK TASK
10.8: HIERARCHICAL ORGANIZATION OF PFC- SUBTASKS, GOALS, COGNITIVE SEQUENCING
10.9: AFFECTIVE INFLUENCES OVER EXECUTIVE FUNCTION- ROLES OF THE OFC AND ACC
10.10: OTHER EXECUTIVE FUNCTIONS
10.11: ALTERNATIVE FRAMEWORKS AND MODELING APPROACHES
10.12: SUMMARY OF KEY POINTS
10.13: SUBTOPICS AND EXPLORATIONS
10.14: EXTERNAL RESOURCES
1 1/5/2022
10.01: Introduction
We have now reached the top of the cognitive neuroscience hierarchy: the "executive" level. In a business, an executive
makes important decisions and plans, based on high-level information coming in from all the different divisions of the
company, and with a strong consideration of "the bottom line." In a person, the executive level of processing, thought to
occur primarily within the prefrontal cortex (PFC), similarly receives high-level information from posterior cortical
association areas, and is also directly interconnected with motivational and emotional areas that convey "the bottom line"
forces that ultimately guide behavior. Although many of us walk around with the impression (delusion?) that our actions
are based on rational thought and planning, instead it is highly likely that basic biological motivations and affective signals
play a critical role in shaping what we do. At least, this is what the underlying biology of the PFC and associated brain
areas suggests. And yet, it is also clear that the PFC is critical for supporting more abstract reasoning and planning
abilities, including the ability to ignore distraction and other influences in the pursuit of a given goal. We will try to
unravel the mystery of this seemingly contradictory coexistence of abilities in the PFC in this chapter.
Evidence for the importance of the PFC in higher-level cognitive control comes from the environmental dependency
syndrome associated with damage to PFC. In one classic example, a patient with PFC damage visited a researcher's home
and, upon seeing the bed, proceeded to get undressed (including removal of his toupee!), got into bed, and prepared to
sleep. The environmental cues overwhelmed any prior context about what one should do in the home of someone you don't
know very well. In other words, without the PFC, behavior is much more reflexive and unthinking, driven by the
affordances of the immediate sensory environment, instead of by some more abstract and considered plan or goals. You
don't need actual PFC damage to experience this syndrome -- certainly you have experienced yourself absent-mindedly
doing something cued by the immediate sensory environment that you hadn't otherwise planned to do (e.g., brushing your
teeth a second time before going to bed because you happened to see the toothbrush). We all experience lapses in attention
-- the classic stereotype of an absent-minded professor is not explained by lack of PFC in professors, but rather that the
PFC is apparently working on something else and thus leaves the rest of the brain to fend for itself in an environmentally-
dependent manner.
Another great source of insight into the cognitive contributions of the PFC is available to each of us every night, in the
form of our dreams. It turns out that the PFC is one of the brain areas most inactivated during dreaming phases of sleep. As
a result, our dreams often lack continuity, and seem to jump from one disconnected scene to another, with only the most
tangential thread connecting them. For example, one moment you might be reliving a tense social situation from high
school, and the next you're trying to find out when the airplane is supposed to leave, with a feeling of general dread that
you're hopelessly late for it.
So what makes the PFC uniquely capable of serving as the brain's executive? Part of the answer is its connectivity, as
alluded to above -- it sits on top of the overall information processing hierarchy of the brain, and thus receives highly-
processed "status reports" about everything important going on in your brain. In this sense it is similar to the hippocampus
as we saw in the Memory Chapter, and indeed these areas appear to work together. However, the PFC is also especially
well placed to exert control over our actions -- the PFC is just in front of the frontal motor areas (see the Motor Chapter),
and has extensive connectivity to drive overt (and covert) motor behavior. Furthermore, the medial and ventral areas of
PFC are directly interconnected with affective processing areas in subcortical regions such as the amygdala, thus enabling
it to be driven by, and reciprocally, to amplify or override, motivational and affective signals.
In addition to being in the right place, the PFC also has some special biological properties that enable it to hold onto
information in the face of distraction, e.g., from incoming sensory signals. Thus, with an intact PFC, you can resist the idea
of laying down in someone else's bed, and remain focused on the purpose of your visit. We refer to this ability as robust
active maintenance because it depends on the ability to keep a population of neurons actively firing over the duration
needed to maintain a goal or other relevant pieces of information. This ability is also referred to as working memory, but
this latter term has been used in many different ways in the literature, so we are careful to define it as synonymous with
robust active maintenance of information in the PFC, in this context. We will see later how active maintenance works
together with a gating system that allows us to hold in mind more than one item at a time, to selectively update and
manipulate some information while continuing to maintain others, in a way that makes the integrated system support more
sophisticated forms of working memory.
Figure 10.1: Schematic of functional relationships and connectivity of PFC, BG, and SNc phasic dopamine signals in
relationship to basic loop between sensory input and motor output. The PFC provides top-down context and control over
posterior cortical processing pathways to ensure that behavior is task and context appropriate. The BG exerts a
disinhibitory gating influence over PFC, switching between robust maintenance and rapid updating. The SNc (substantia
nigra pars compacta) exhibits phasic dopamine (DA) firing to drive learning and modulation of BG circuits, thereby
training the BG gating signals in response to task demands (external reward signals).
The overall connectivity of the areas that are particularly important for executive function are shown in Figure 10.1, in
relation to the sensory and motor processing associated with posterior cortex (temporal, parietal and occipital lobes) and
motor frontal cortex (i.e., frontal cortex posterior to the prefrontal cortex). The PFC is interconnected with higher-level
association cortical areas in posterior cortex where highly processed and abstracted information about the sensory world is
encoded. It also interconnects with higher-level motor control areas (premotor cortex, supplementary motor areas), which
coordinate lower-level motor control signals to execute sequences of coordinated motor outputs. With this pattern of
connectivity, PFC is in a position to both receive from, and exert influence over, the processing going on in posterior and
motor cortex.
Figure 10.2: Parallel circuits through the basal ganglia for different regions of the frontal cortex -- each region of frontal
cortex has a corresponding basal ganglia circuit, for controlling action selection/initiation in that frontal area. Motor loop:
SMA = supplementary motor area -- the associated striatum (putamen) also receives from premotor cortex (PM), and
primary motor (M1) and somatosensory (S1) areas -- everything needed to properly contextualize motor actions.
Oculomotor loop: FEF = frontal eye fields, also receives from dorsolateral PFC (DLPFC), and posterior parietal cortex
(PPC) -- appropriate context for programming eye movements. Prefrontal loop: DLPFC also controlled by posterior
parietal cortex, and premotor cortex. Orbitofrontal loop: OFC = orbitofrontal cortex, also receives from inferotemporal
cortex (IT), and anterior cingulate cortex (ACC). Cingulate loop: ACC also modulated by hippocampus (HIP), entorhinal
cortex (EC), and IT.
The Basal Ganglia (BG), which consists principally of the striatum (caudate, putamen, nucleus accumbens), globus
pallidus, and subthalamic nucleus, is densely interconnected with the PFC by way of specific nuclei of the thalamus
Figure 10.2. As described in detail in Motor Control and Reinforcement Learning, the BG provides a dynamic, adaptive
gating influence on the frontal cortex, by disinhibiting the excitatory loop between PFC and the thalamus. In the context of
the PFC, this gating influence controls the updating of information that is actively maintained in the PFC, using the same
mechanisms that control the initiation of motor actions in the context of motor control. Also, top-down projections from
PFC to the subthalamic nucleus support a type of inhibitory control over behavior by detecting conditions under which
ongoing action selection should be halted or switched and preventing the rest of the BG circuitry from gating the planned
motor action.
Figure 2.
The ability of PFC neurons to exhibit sustained active firing over delays, as initially discovered by (Fuster &
Alexander,1971; Kubota & Niki, 1971), is shown in Figure 10.3 panel B ("Neuron with delay signal"), in the context of
the delayed saccading task described in the introduction. Other subsets of PFC neurons also exhibit other firing patterns,
such as responding transiently to visual inputs (Panel C) and initiating movements (Panel D). This differentiation of neural
response patterns in PFC has important functional implications that we capture in the PBWM model described later.
Figure 10.4: Two types of reverberant loops can support actively maintained representations in cortical tissue: 1)
Corticocortical interconnections among pyramidal neurons within the same PFC stripe (horizontal blue arrows); 2)
Thalamocortical connections between PFC and thalamic relay cells (TRC's) (vertical blue arrows). Both use mutually
supportive recurrent excitation plus intrinsic maintenance currents (NMDARs; mGluRs).
Figure 10.5: Brodmann numbers for areas of the prefrontal cortex, each of which has been associated with a different
mixture of executive functions. Reproduced from Fuster, 2001.
Anatomically, the frontal lobes constitute those cortical areas anterior to the central sulcus. Immediately anterior to the
central sulcus, and thus most posteriorly in frontal cortex, is the primary motor cortex (M1), which is most prominently
seen on the lateral surface but extends all the way over the dorsal surface and onto the medial side. Contiguous tissue
roughly anterior to M1 makes up planning motor areas, the premotor (PM) cortex (laterally) and supplementary motor
areas (SMA, pre-SMA; medially). Then, anterior to that are the PFC areas, labeled with their Brodmann numbers in
Figure 10.5.
Figure 5.
Within each functional PFC area, there is some interesting topographic organization of neurons into hypercolumns,
macrocolumns or stripes (each of these terms is typically associated with a similar type of neural organization, but in
different parts of the cortex, with stripes being specific to the PFC) ( Figure 10.7, Figure 10.8). In all areas of cortex,
one can identify the smallest level of neural topological organization as a cortical column or microcolumn (to more clearly
distinguish it from the larger macrocolumn), which contains roughly 20 pyramidal neurons in a region that is roughly 50
microns across. A stripe contains roughly 100 of these microcolumns, generally organized in an elongated shape that is
roughly 5 microcolumns wide (250 microns) by 20 microcolumns long (1000 microns or 1 millimeter). Each such stripe is
interconnected with a set of roughly 10 or more other stripes, which we can denote as a stripe cluster. Given the size of the
human frontal cortex, there may be as many as 20,000 stripes within all of frontal cortex (including motor areas).
In PFC and other areas, neurons within a microcolumn tend to encode very similar information, and may be considered
equivalent to a single rate-coded neuron of the sort that we typically use in our models. We can then consider an individual
stripe as containing roughly 100 such rate-coded neuron-equivalents, which provides sufficient room to encode a
reasonably large number of different things using sparse distributed representations across microcolumns.
Functionally, we hypothesize in the PBWM model that each stripe can be independently updated by a corresponding
stripe-wise loop of connectivity with an associated stripe of neurons through the BG system. This allows for very fine-
grained control by the BG over the updating and maintenance of information in PFC, as we describe next.
As we discussed in the Motor and Reinforcement Learning Chapter, the Basal Ganglia (BG) is in a position to modulate
the activity of the PFC, by virtue of its control over the inhibition of the thalamic neurons that are bidirectionally
connected with the PFC ( Figure 10.9). In the default state of no striatal activity, or firing of indirect (NoGo) pathway
neurons, the SNr (substantia nigra pars reticulata) or GPi (globus pallidus internal segment) neurons tonically inhibit the
thalamus. This prevents the thalmocortical loop from being activated, and it is activation of this loop that is thought to be
critical for initiating motor actions or updating PFC active memory representations. When the striatal Go (direct) pathway
neurons fire, they inhibit the tonic SNr/GPi inhibition, thereby allowing the excitatory thalamocortical loop to be activated.
This wave of excitatory activation can activate a new population of PFC neurons, which are then actively maintained until
a new Go signal is fired.
Figure 10.11: Components of a PBWM model, based on biological connections and functions of the PFC (robust active
maintenance of task-relevant information), Basal Ganglia (BG, dynamic gating of PFC active maintenance), and PVLV
(phasic dopamine signals for training the BG gating. Each specialized job, in interaction, produces a capable overall
executive function system, after sufficient learning experience.
The biological properties of the PFC/BG system that we reviewed above are captured in a computational model called
PBWM (prefrontal cortex basal ganglia working memory) (O'Reilly & Frank, 2006; Hazy et al 2006, 2007) ( Figure
10.11). The PFC neurons in this model are organized into separately-updatable stripes, and also into separate functional
groups of maintenance and output gating (described more below). Furthermore, each PFC stripe is represented in terms of
superficial layers (2,3) and deep layers (5,6) -- the deep layer neurons specifically have the ability to sustain firing over
time through a variety of mechanisms, representing the effects of NMDA and mGluR channels and excitatory loops
through the thalamus. The flow of activation from the superficial to deep layers of a given PFC stripe is dependent on BG
gating signals, with the BG layers also organized into corresponding maintenance and output gating stripes. The Matrix
layer of the BG (representing the matrisomes of the striatum) has separate Go and NoGo neurons that project to a
combined GPi and thalamus (GPiThal) layer with a single neuron per stripe that fires if the Go pathway is sufficiently
stronger than the NoGo (this mechanism abstracts away from the detailed BG gating circuitry involving the GPe, GPi/SNr,
STN and thalamus, as simulated in the motor chapter, and simply summarizes functionality in a single GPiThal layer). A
GPiThal Go signal will update the PFC deep layer activations to reflect the current superficial layer activations, while a
NoGo leaves the PFC alone to continue to maintain prior information (or nothing at all).
The PVLV phasic dopamine system drives learning of the BG Go and NoGo neurons, with positive DA bursts leading to
facilitation of Go and depression of NoGo weights, and vice-versa for DA dips -- using the same reinforcement learning
mechanisms described in the Motor chapter.
The main dynamics of behavior of the different PBWM components are illustrated in Figure 10.12 (Not Created Yet).
Perhaps the single most important key for understanding how the system works is that it uses trial and error exploration of
different gating strategies in the BG, with DA reinforcing those strategies that are associated with positive reward, and
punishing those that are not. In the current version of the model, Matrix learning is driven exclusively by dopamine firing
at the time of rewards, and it uses a synaptic-tag-based trace mechanism to reinforce/punish all prior gating actions that led
up to this dopaminergic outcome. Specifically, when a given Matrix unit fires for a gated action, synapses with active input
establish a synaptic tag, which persists until a subsequent phasic dopaminergic outcome singal. Extensive research has
shown that these synaptic tags, based on actin fiber networks in the synapse, can persist for up to 90 minutes, and when a
subsequent strong learning event occurs, the tagged synapses are also strongly potentiated (Redondo & Morris, 2011;
Rudy, 2015; Bosch & Hayashi, 2012). This form of trace-based learning is very effective computationally, because it does
not require any other mechanisms to enable learning about the reward implications of earlier gating events. In earlier
versions of the PBWM model, we relied on CS (conditioned stimulus) based phasic dopamine to reinforce gating, but this
scheme requires that the PFC maintained activations function as a kind of internal CS signal, and that the amygdala learn
to decode these PFC activation states to determine if a useful item had been gated into memory. Compared to the trace-
based mechanism, this CS-dopamine approach is much more complex and error-prone. Instead, in general, we assume that
the CS's that drive Matrix learning are more of the standard external type, which signal progress toward a desired outcome,
and thus reinforce actions that led up to that intermediate state (i.e., the CS represents the achievement of a subgoal).
The presence of multiple stripes is typically important for the PBWM model to learn rapidly, because it allows different
gating strategies to be explored in parallel, instead of having a single stripe sequentially explore all the different such
Output Gating
Figure 10.13: Schematic to illustrate the division of labor between maintenance-specialized stripes and corresponding
output-specialized stripes. A - Maintenance stripe (left) in maintenance mode, with corticothalamocortical reverberant
activity shown (red). Information from that stripe projects via layer Vb pyramidals to a thalamic relay cell for the
corresponding output stripe, but the BG gate is closed from tonic SNr/GPi inhibition so nothing happens (gray). B - Output
gate opens due to `Go'-signal generated disinhibition of SNr/GPi output (green), triggering burst firing in the thalamic
relay cell, which in turn activates the corresponding cortical stripe representation for the appropriate output. Projection
from output stripe's layer Vb pyramidal cells then activates cortical and subcortical action/output areas, completing a
handoff from maintenance to output. MD = mediodorsal nucleus of the thalamus; VP/VL = ventroposterior or ventrolateral
(motor) thalamic nuclei.
As we saw in Figure 10.3, some PFC neurons exhibit delay-period (active maintenance) firing, while others exhibit
output response firing. These populations do not appear to mix: a given neuron does not typically exhibit a combination of
both types of firing. This is captured in the PBWM framework by having a separate set of PFC stripes that are output gated
instead of maintenance gated, which means that maintained information can be subject to further gating to determine
whether or not it should influence downstream processing (e.g., attention or motor response selection). We typically use a
simple pairing of maintenance and output gating stripes, with direct one-to-one projections from maintenance to output
PFC units, but there can be any form of relationship between these stripes. The output PFC units are only activated,
however, when their corresponding stripe-level BG/GPiThal Go pathway fires. Thus, information can be maintained in an
active but somewhat "offline" form, before being actively output to drive behavior. Figure 10.13 illustrates this division
of labor between the maintenance side and the output side for gating and how a "handoff" can occur.
For more PBWM details, including further considerations for output gating, how maintained information is cleared when
no longer needed (after output gating), and gating biases that can help improve learning, see PBWM details Subtopic,
which also includes relevant equations and default parameters.
Figure 10.14: The Stroop task requires either reading the word or naming the ink color of stimuli such as these. When
there is a conflict between the word and the ink color, the word wins because reading is much more well-practiced. Top-
down biasing (executive control) is required to overcome the dominance of word reading, by providing excitatory support
for the weaker ink color naming pathway.
Figure 10.15: Typical data from neurologically intact participants on the Stroop task, showing differentially slowed
performance on the conflict (incongruent) color naming condition. Damage to the PFC produces a differential impairment
in this condition relative to the others, indicating that PFC is providing top-down excitatory biasing to support color
naming.
In the Stroop paradigm ( Figure 10.14) subjects are presented with color words (e.g., "red", "green") one at a time and are
required to either read the word (e.g., "red"), or name the color of the ink that the word is written in. Sometimes the word
"red" appears in green ink, which represents the incongruent or conflict condition. The "Stroop effect" is that error rates
and response times are larger for this incongruent condition, especially in the case of color naming ( Figure 10.15). That
color naming is particularly difficult in the incongruent condition has been attributed to the relatively "automatic", well-
practiced nature of reading words, so that the natural tendency to read the word interferes with attending to, and naming,
the color of the ink.
Figure 10.16: Hierarchical action selection across multiple prefrontal basal ganglia loops. On the far right, at the most
anterior level, the PFC represents contextual information that is gated by its corresponding BG loop based on the
probability that maintaining this context for guiding lower level actions is predictive of reward. The middle loop involves
both input and output gating. The input gating mechanism allows stimulus representations S to update a PFC_maint layer,
while the output gating mechanism gates out a subset of maintained information conditional on the context in anterior
PFC. Its associated BG layer learns the reward probability of output gating given the maintained stimulus S and the
context. Finally, the left-most motor loop learns to gate simple motor responses based on their reward probabilities
conditional on the stimulus, as in the single loop BG model described in the Motor chapter, but where here relevant
stimulus features are selected by the more anterior loops. Reproduced from Frank & Badre (2012).
For related models simulating hierarchical control over action across multiple PFC-BG circuits, see Reynolds & O'Reilly,
2009; Frank & Badre, 2012 and Collins & Frank (2013). The latter model considers situations in which there are multiple
potential rule sets signifying which actions to select in particular sensory states, and where the appropriate rule set might
depend on a higher level context. (For example, your tendency to greet someone with a hug, kiss, handshake, or wave
might depend on the situation: your relationship to the person, whether you are in the street or at work, etc. And when you
go to a new country (or city), the rule set to apply may be the same as that you've applied in other countries, or it might
require creating a new rule set). More generally, we refer to the higher level rule as a "task-set" which contextualizes how
to act in response to many different stimuli. Hierarchical PFC-BG networks can learn to create these PFC task-sets, and
simultaneously, which actions to select in each task-set. Critically, with this hierarchical representation, the learned PFC
representations are abstract and independent of the contexts that cue them, facilitating generalization and transfer to other
contexts, while also identifying when new task-sets need to created. They also allow for new knowledge to be appended to
existing abstract task structures, which then can be immediately transferred to other contexts that cue them (much like
learning a new word in a language: you can immediately then re-use that word in other contexts and with other people). To
see this network in action, including demonstrations of generalization and transfer, see the Collins & Frank network linked
here. Various empirical data testing this model have shown that indeed humans (including babies!) represent such task-sets
in a hierarchical manner (even when not cued to do so, and even when it is not beneficial for learning) in such a way that
facilitates generalization and transfer; and that the extent of this hierarchical structure is related to neural signatures in PFC
and BG (see e.g., Badre & Frank, 2012; Collins et al., 2014; Collins & Frank, 2016; Werchan et al, 2016).
To put many of the elements explored above to their most important use, we explore how the coordinated interactions of
various regions of the PFC (including the affective areas explored previously), together with BG gating, enable the system
to behave in a coherent, task-driven manner over multiple sequential steps of cognitive processing. This is really the
hallmark of human intelligence: we can solve complex problems by performing a sequence of simpler cognitive steps, in a
flexible, adaptive manner. More abstract cognitive models such as ACT-R provide a nice characterization of the functional
Explorations
Here are all the explorations covered in the main portion of this chapter collected in one place for convenience/easy
browsing. These may or may not be optional for a given course, depending on the instructor's specifications for what to
cover.
Stroop (stroop.proj) -- The Stroop effect and PFC top-down biasing (Questions 10.1 - 10.3)
A Not B (a_not_b.proj) -- -- Development of PFC active maintenance and the A-not-B task (Questions 10.4 - 10.6)
SIR (sir.proj) -- Store/Ignore/Recall Task - Updating and Maintenance in more complex PFC model (Questions 10.7 -
10.8)