Dynamic Memory Allocator Review
Dynamic Memory Allocator Review
2
known memory allocators, including several not usu- Exploiting ordering and size depen-
ally covered. It also explains why such mechanism- dencies. : : : : : : : : : : : : : : 18
based taxonomies are very limited, and may obscure Implications for strategy. : : : : : : : 18
more important policy issues. Some of those policy Implications for research. : : : : : : : 18
issues are sketched. Proles of some real programs. : : : : 19
Section 4 reviews the literature on memory alloca- Summary. : : : : : : : : : : : : : : : : 22
tion. A major point of this section is that the main 2.5 Deferred Coalescing and Deferred Reuse 22
stream of allocator research over the last several dec- Deferred coalescing. : : : : : : : : : : 22
ades has focused on oversimplied (and unrealistic) Deferred reuse. : : : : : : : : : : : : : 24
models of program behavior, and that little is actu- 2.6 A Sound Methodology: Simulation Us-
ally known about how to design allocators, or what ing Real Traces 25
: : : : : : : : : :
performance to expect. Tracing and simulation. 25
: : : : : : : :
Section 5 concludes by summarizing the major Locality studies. 26
: : : : : : : : : : : :
points of the paper, and suggesting avenues for future
research. 3 A Taxonomy of Allocators 26
: : : : : : :
Boundary tags. 28
: : : : : : : : : : : : :
Random simulations. 10
nisms. 42
: : : : : : : : : :
Probabilistic analyses. 11 : : : : : : : : : : : : : : :
A note on exponentially-distributed :
What to coalesce. 45
: : : : : : : : :
Discussion. 45
: : : : : :
3
5 Summary and Conclusions : : : : : : : 69 proper issues. Many programmers avoid heap alloca-
5.1 Models and Theories : : : : : : : : : : 69 tion in many situations, because of perceived space or
5.2 Strategies and Policies : : : : : : : : : 70 time costs.10
5.3 Mechanisms : : : : : : : : : : : : : : : 70 It seems signicant to us that many articles in non-
5.4 Experiments : : : : : : : : : : : : : : 71 refereed publications|and a number in refereed pub-
5.5 Data : : : : : : : : : : : : : : : : : : : 71 lications outside the major journals of operating sys-
5.6 Challenges and Opportunities : : : : : 71 tems and programming languages|are motivated by
extreme concerns about the speed or memory costs
1.1 Motivation of general heap allocation. (One such paper [GM85]
is discussed in Section 4.1.) Often, ad hoc solutions
This paper is motivated by our perception that there are used for applications that should not be problem-
is considerable confusion about the nature of memory atic at all, because at least some well-designed gen-
allocators, and about the problem of memory alloca- eral allocators should do quite well for the workload
tion in general. Worse, this confusion is often unrec- in question.
ognized, and allocators are widely thought to be fairly We suspect that in some cases, the perceptions are
well understood. In fact, we know little more about wrong, and that the costs of modern heap allocation
allocators than was known twenty years ago, which are simply overestimated. In many cases, however, it
is not as much as might be expected. The literature appears that poorly-designed or poorly-implemented
on the subject is rather inconsistent and scattered, allocators have lead to a widespread and quite under-
and considerable work appears to be done using ap- standable belief that general heap allocation is neces-
proaches that are quite limited. We will try to sketch sarily expensive. Too many poor allocators have been
a unifying conceptual framework for understanding supplied with widely-distributed operating systems
what is and is not known, and suggest promising ap- and compilers, and too few practitioners are aware
proaches for new research. of the alternatives.
This problem with the allocator literature has con- This appears to be changing, to some degree. Many
siderable practical importance. Aside from the human operating systems now supply fairly good allocators,
eort involved in allocator studies per se, there are ef- and there is an increasing trend toward marketing li-
fects in the real world, both on computer system costs, braries that include general allocators which are at
and on the eort required to create real software. least claimed to be good, as a replacement for de-
We think it is likely that the widespread use of poor fault allocators. It seems likely that there is simply a
allocators incurs a loss of main and cache memory lag between the improvement in allocator technology
(and CPU cycles) upwards of of a billion U.S. dollars and its widespread adoption, and another lag before
worldwide|a signicant fraction of the world's mem- programming style adapts. The combined lag is quite
ory and processor output may be squandered, at huge long, however, and we have seen several magazine ar-
cost.9 ticles in the last year on how to avoid using a general
Perhaps even worse is the eect on programming allocator. Postings praising ad hoc allocation schemes
style due to the widespread use of allocators that are very common in the Usenet newsgroups oriented
are simply bad ones|either because better allocators toward real-world programming.
are known but not widely known or understood, or The slow adoption of better technology and the lag
because allocation research has failed to address the in changes in perceptions may not be the only prob-
9 This is an unreliable estimate based on admittedly ca- lems, however. We have our doubts about how well
sual last-minute computations, approximately as fol- allocators are really known to work, based on a fairly
lows: there are on the order of 100 million PC's in the thorough review of the literature. We wonder whether
world. If we assume that they have an average of 10 some part of the perception is due to occasional pro-
megabytes of memory at $30 per megabyte, there is 30
billion dollars worth of RAM at stake. (With the ex- 10 It is our impression that UNIX programmers' usage of
pected popularity of Windows 95, this seems like it will heap allocation went up signicantly when Chris Kings-
soon become a fairly conservative estimate, if it isn't al- ley's allocator was distributed with BSD 4.2 UNIX|
ready.) If just one fth (6 billion dollars worth) is used simply because it was much faster than the allocators
for heap-allocated data, and one fth of that is unnec- they'd been accustomed to. Unfortunately, that alloca-
essarily wasted, the cost is over a billion dollars. tor is somewhat wasteful of space.
4
grams that interact pathologically with common allo- and experiments is needed before we have any assur-
cator designs, in ways that have never been observed ance that any allocator works reliably, in a crucial
by researchers. performance sense|much less works well. Given this
This does not seem unlikely, because most experi- caveat, however, it appears that some allocators are
ments have used non-representative workloads, which clearly better than others in most cases, and this pa-
are extremely unlikely to generate the same problem- per will attempt to explain the dierences.
atic request patterns as real programs. Sound studies
using realistic workloads are too rare. The total num- 1.2 What an Allocator Must Do
ber of real, nontrivial programs that have been used
for good experiments is very small, apparently less An allocator must keep track of which parts of mem-
than 20. A signicant number of real programs could ory are in use, and which parts are free. The goal of
exhibit problematic behavior patterns that are simply allocator design is usually to minimize wasted space
not represented in studies to date. without undue time cost, or vice versa. The ideal allo-
Long-running processes such as operating sys- cator would spend negligible time managing memory,
tems, interactive programming environments, and and waste negligible space.
networked servers may pose special problems that A conventional allocator cannot control the num-
have not been addressed. Most experiments to date ber or size of live blocks|these are entirely up to the
have studied programs that execute for a few minutes program requesting and releasing the space managed
(at most) on common workstations. Little is known by the allocator. A conventional allocator also can-
about what happens when programs run for hours, not compact memory, moving blocks around to make
days, weeks or months. It may well be that some them contiguous and free contiguous memory. It must
seemingly good allocators do not work well in the respond immediately to a request for space, and once
long run, with their memory eciency slowly degrad- it has decided which block of memory to allocate, it
ing until they perform quite badly. We don't know| cannot change that decision|that block of memory
and we're fairly sure that nobody knows. Given that must be regarded as inviolable until the application11
long-running processes are often the most important program chooses to free it. It can only deal with mem-
ones, and are increasingly important with the spread ory that is free, and only choose where in free mem-
of client/server computing, this is a potentially large ory to allocate the next requested block. (Allocators
problem. record the locations and sizes of free blocks of mem-
The worst case performance of any general alloca- ory in some kind of hidden data structure, which may
tor amounts to complete failure due to memory ex- be a linear list, a totally or partially ordered tree, a
haustion or virtual memory thrashing (Section 1.2). bitmap, or some hybrid data structure.)
This means that any real allocator may have lurking An allocator is therefore an online algorithm, which
\bugs" and fail unexpectedly for seemingly reasonable must respond to requests in strict sequence, immedi-
inputs. ately, and its decisions are irrevocable.
Such problems may be hidden, because most pro- The problem the allocator must address is that
grammers who encounter severe problems may simply the application program may free blocks in any or-
code around them using ad hoc storage management der, creating \holes" amid live objects. If these holes
techniques|or, as is still painfully common, by stat- are too numerous and small, they cannot be used to
ically allocating \enough" memory for variable-sized satisfy future requests for larger blocks. This prob-
structures. These ad-hoc approaches to storage man- lem is known as fragmentation, and it is a poten-
agement lead to \brittle" software with hidden limi- tially disastrous one. For the general case that we
tations (e.g., due to the use of xed-size arrays). The have outlined|where the application program may
impact on software clarity,
exibility, maintainability, allocate arbitrary-sized objects at arbitrary times and
and reliability is quite important, but dicult to esti- free them at any later time|there is no reliable algo-
mate. It should not be underestimated, however, be- rithm for ensuring ecient memory usage, and none
cause these hidden costs can incur major penalties in 11 We use the term \application" rather generally; the \ap-
productivity and, to put it plainly, human costs in plication" for which an allocator manages storage may
sheer frustration, anxiety, and general suering. be a system program such as a le server, or even an
A much larger and broader set of test applications operating system kernel.
5
is possible. It has been proven that for any possible thing of a black art. Little is known about the inter-
allocation algorithm, there will always be the possi- actions between programs and allocators, and which
bility that some application program will allocate and programs are likely to bring out the worst in which al-
deallocate blocks in some fashion that defeats the al- locators. However, one thing is clear|most programs
locator's strategy, and forces it into severe fragmen- are \well behaved" in some sense. Most programs
tation [Rob71, GGU72, Rob74, Rob77]. Not only are combined with most common allocators do not squan-
there no provably good allocation algorithms, there der huge amounts of memory, even if they may waste
are proofs that any allocator will be \bad" for some a quarter of it, or a half, or occasionally even more.
possible applications. That is, there are regularities in program behavior
The lower bound on worst case fragmentation is that allocators exploit, a point that is often insu-
generally proportional to the amount of live data12 ciently appreciated even by professionals who design
multiplied by the logarithm of the ratio between the and implement allocators. These regularities are ex-
largest and smallest block sizes, i.e., log2 , where
M n ploited by allocators to prevent excessive fragmenta-
M is the amount of live data and is the ratio be-
n tion, and make it possible for allocators to work in
tween the smallest and largest object sizes [Rob71]. practice.
(In discussing worst-case memory costs, we gener- These regularities are surprisingly poorly under-
ally assume that all block sizes are evenly divisible stood, despite 35 years of allocator research, and
by the smallest block size, and is sometimes sim-
n scores of papers by dozens of researchers.
ply called \the largest block size," i.e., in units of the
smallest.) 1.3 Strategies, Placement Policies, and
Of course, for some algorithms, the worst case is Splitting and Coalescing
much worse, often proportional to the simple product The main technique used by allocators to keep frag-
of and .
M n
mentation under control is placement choice. Two
So, for example, if the minimum and maximum ob- subsidiary techniques are used to help implement that
jects sizes are one word and a million words, then choice: splitting blocks to satisfy smaller requests, and
fragmentation in the worst case may cost an excel- coalescing of free blocks to yield larger blocks.
lent allocator a factor of ten or twenty in space. A Placement choice is simply the choosing of where in
less robust allocator may lose a factor of a million, in free memory to put a requested block. Despite poten-
its worst case, wasting so much space that failure is tially fatal restrictions on an allocator's online choices,
almost certain. the allocator also has a huge freedom of action|it
Given the apparent insolubility of this problem, it can place a requested block anywhere it can nd a
may seem surprising that dynamic memory allocation suciently large range of free memory, and anywhere
is used in most systems, and the computing world does within that range. (It may also be able to simply re-
not grind to a halt due to lack of memory. The rea- quest more memory from the operating system.) An
son, of course, is that there are allocators that are allocator algorithm therefore should be regarded as
fairly good in practice, in combination with most ac- the mechanism that implements a placement policy,
tual programs. Some allocation algorithms have been which is motivated by a strategy for minimizing frag-
shown in practice to work acceptably well with real mentation.
programs, and have been widely adopted. If a partic-
ular program interacts badly with a particular alloca-
tor, a dierent allocator may be used instead. (The Strategy, policy, and mechanism. The strategy
bad cases for one allocator may be very dierent from takes into account regularities in program behavior,
the bad cases for other allocators of dierent design.) and determines a range of acceptable policies as to
The design of memory allocators is currently some- where to allocate requested blocks. The chosen pol-
12 We use \live" here in a fairly loose sense. Blocks are
icy is implemented by a mechanism, which is a set
of algorithms and the data structures they use. This
\live" from the point of view of the allocator if it doesn't three-level distinction is quite important.
know that it can safely reuse the storage|i.e., if the In the context of general memory allocation,
block was allocated but not yet freed. This is dierent
from the senses of liveness used in garbage collection or { a strategy attempts to exploit regularities in the
in compilers'
ow analyses. request stream,
6
{ a policy is an implementable decision procedure hold them." That's not a complete policy, however,
for placing blocks in memory, and because there may be several equally good ts; the
{ a mechanism is a set of algorithms and data struc- complete policy must specify which of those should be
tures that implement the policy, often over-sim- chosen, for example, the one whose address is lowest.
ply called \an algorithm."13 The chosen policy is implemented by a specic
mechanism, chosen to implement that policy e-
An ideal strategy is \put blocks where they won't ciently in terms of time and space overheads. For best
cause fragmentation later"; unfortunately that's im- t, a linear list or ordered tree structure might be used
possible to guarantee, so real strategies attempt to to record the addresses and sizes of free blocks, and
heuristically approximate that ideal, based on as- a tree search or list search would be used to nd the
sumed regularities of application programs' behavior. one dictated by the policy.
For example, one strategy is \avoid letting small long- These levels of the allocator design process inter-
lived objects prevent you from reclaiming a larger con- act. A strategy may not yield an obvious complete
tiguous free area." This is part of the strategy underly- policy, and the seemingly slight dierences between
ing the common \best t" family of policies. Another similar policies may actually implement interestingly
part of the strategy is \if you have to split a block dierent strategies. (This results from our poor un-
and potentially waste what's left over, minimize the derstanding of the interactions between application
size of the wasted part." behavior and allocator strategies.) The chosen policy
The corresponding (best t) policy is more may not be obviously implementable at reasonable
concrete|it says \always use the smallest block that cost in space, time, or programmer eort; in that case
is at least large enough to satisfy the request." some approximation may be used instead.
The placement policy determines exactly where in The strategy and policy are often very poorly-
memory requested blocks will be allocated. For the dened, as well, and the policy and mechanism are
best t policies, the general rule is \allocate objects arrived at by a combination of educated guessing,
in the smallest free block that's at least big enough to trial and error, and (often dubious) experimental
13 validation.14
This set of distinctions is doubtless indirectly in
uenced
by work in very dierent areas, notably Marr's work in 14 In case the important distinctions between strategy, pol-
natural and articial visual systems [Mar82] and Mc- icy, and mechanism are not clear, a metaphorical exam-
Clamrock's work in the philosophy of science and cog- ple may help. Consider a software company that has a
nition [McC91, McC95]. The distinctions are impor- strategy for improving productivity: reward the most
tant for understanding a wide variety of complex sys- productive programmers. It may institute a policy of
tems, however. Similar distinctions are made in many rewarding programmers who produce the largest num-
elds, including empirical computer science, though of- bers of lines of program code. To implement this policy,
ten without making them quite clear. it may use the mechanisms of instructing the managers
In \systems" work, mechanism and policy are often to count lines of code, and providing scripts that count
distinguished, but strategy and policy are usually not lines of code according to some particular algorithm.
distinguished explicitly. This makes sense in some con- This example illustrates the possible failures at each
texts, where the policy can safely be assumed to im- level, and in the mapping from one level to another. The
plement a well-understood strategy, or where the choice strategy may simply be wrong, if programmers aren't
of strategy is left up to someone else (e.g., designers of particularly motivated by money. The policy may not
higher-level code not under discussion). implement the intended strategy, if lines of code are an
In empirical evaluations of very poorly understood inappropriate metric of productivity, or if the policy has
strategies, however, the distinction between strategy unintended \strategic" eects, e.g., due to programmer
and policy is often crucial. (For example, errors in the resentment.
implementation of a strategy are often misinterpreted The mechanism may also fail to implement the spec-
as evidence that the expected regularities don't actu- ied policy, if the rules for line-counting aren't enforced
ally exist, when in fact they do, and a slightly dierent by managers, or if the supplied scripts don't correctly
strategy would work much better.) implement the intended counting function.
Mistakes are possible at each level; equally important, This distinction between strategy and policy is over-
mistakes are possible between levels, in the attempt to simplied, because there may be multiple levels of strat-
\cash out" (implement) the higher-level strategy as a egy that shade o into increasingly concrete policies.
policy, or a policy as a mechanism. At dierent levels of abstraction, something might be
7
Splitting and coalescing. Two general techniques that case, freed blocks will usually be coalesced with
for supporting a range of (implementations of) place- neighbors to form large blocks of free memory, and
ment policies are splitting and coalescing of free later allocations will have to split smaller chunks o
blocks. (These mechanisms are important subsidiary of those blocks to obtained the desired sizes. It of-
parts of the larger mechanism that is the allocator ten turns out that most of this eort is wasted, be-
implementation.) cause the sizes requested later are largely the same as
The allocator may split large blocks into smaller the sizes freed earlier, and the old small blocks could
blocks arbitrarily, and use any suciently-large sub- have been reused without coalescing and splitting. Be-
block to satisfy the request. The remainders from this cause of this, many modern allocators use deferred
splitting can be recorded as smaller free blocks in their coalescing|they avoid coalescing and splitting most
own right and used to satisfy future requests. of the time, but use it intermittently, to combat frag-
The allocator may also coalesce (merge) adjacent mentation.
free blocks to yield larger free blocks. After a block
is freed, the allocator may check to see whether the
neighboring blocks are free as well, and merge them 2 A Closer Look at Fragmentation,
into a single, larger block. This is often desirable, be- and How to Study It
cause one large block is more likely to be useful than
two small ones|large or small requests can be satis- In this section, we will discuss the traditional concep-
ed from large blocks. tion of fragmentation, and the usual techniques used
Completely general splitting and coalescing can be for studying it. We will then explain why the usual un-
supported at fairly modest cost in space and/or time, derstanding is not strong enough to support scientic
using simple mechanisms that we'll describe later. design and evaluation of allocators. We then propose
This allows the allocator designer the maximum free- a new (though nearly obvious) conception of fragmen-
dom in choosing a strategy, policy, and mechanism for tation and its causes, and describe more suitable tech-
the allocator, because the allocator can have a com- niques used to study it. (Most of the experiments us-
plete and accurate record of which ranges of memory ing sound techniques have been performed in the last
are available at all times. few years, but a few notable exceptions were done
The cost may not be negligible, however, espe- much earlier, e.g., [MPS71] and [LH82], discussed in
cially if splitting and coalescing work too well|in Section 4.)
viewed as a strategy or policy.
The key point is that there are at least three quali- 2.1 Internal and External Fragmentation
tatively dierent kinds of levels of abstraction involved
[McC91]; at the upper levels, there are is the general de- Traditionally, fragmentation is classed as external or
sign goal of exploiting expected regularities, and a set of internal [Ran69], and is combatted by splitting and
strategies for doing so; there may be subsidiary strate- coalescing free blocks.
gies, for example to resolve con
icts between strategies
in the best possible way. External fragmentation arises when free blocks of
At at a somewhat lower level there is a general policy memory are available for allocation, but can't be used
of where to place objects, and below that is a more to hold objects of the sizes actually requested by a pro-
detailed policy that exactly determines placement. gram. In sophisticated allocators, that's usually be-
Below that there is an actual mechanism that is in- cause the free blocks are too small, and the program
tended to implement the policy (and presumably ef- requests larger objects. In some simple allocators, ex-
fect the strategy), using whatever algorithms and data ternal fragmentation can occur because the allocator
structures are deemed appropriate. Mechanisms are of- is unwilling or unable to split large blocks into smaller
ten layered, as well, in the usual manner of structured ones.
programming [Dij69]. Problems at (and between) these Internal fragmentation arises when a large-enough
levels are the best understood|a computation may be
improperly specied, or may not meet its specication. free block is allocated to hold an object, but there is
(Analogous problems occur at the upper levels occur as a poor t because the block is larger than needed. In
well|if expected regularities don't actually occur, or if some allocators, the remainder is simply wasted, caus-
they do occur but the strategy does't actually exploit ing internal fragmentation. (It's called internal be-
them, and so on.) cause the wasted memory is inside an allocated block,
8
rather than being recorded as a free block in its own This is one of the major points of this paper. The
right.) paradigm of statistical mechanics15 has been used in
To combat internal fragmentation, most allocators theories of memory allocation, but we believe that it
will split blocks into multiple parts, allocating part is the wrong paradigm, at least as it is usually ap-
of a block, and then regarding the remainder as a plied. Strong assumptions are made that frequencies
smaller free block in its own right. Many allocators of individual events (e.g., allocations and dealloca-
will also coalesce adjacent free blocks (i.e., neighbor- tions) are the base statistics from which probabilistic
ing free blocks in address order), combining them into models should be developed, and we think that this
larger blocks that can be used to satisfy requests for is false.
larger objects. The great success of \statistical mechanics" in other
In some allocators, internal fragmentation arises areas is due to the fact that such assumptions make
due to implementation constraints within the allo- sense there. Gas laws are pretty good idealizations,
cator|for speed or simplicity reasons, the allocator because aggregate eects of a very large number of
design restricts the ways memory may be subdivided. individual events (e.g., collisions between molecules)
In other allocators, internal fragmentation may be ac- do concisely express the most important regularities.
cepted as part of a strategy to prevent external frag- This paradigm is inappropriate for memory allo-
mentation|the allocator may be unwilling to frag- cation, for two reasons. The rst is simply that the
ment a block, because if it does, it may not be able to number of objects involved is usually too small for
coalesce it again later and use it to hold another large asymptotic analyses to be relevant, but this is not the
object. most important reason.
The main weakness of the \statistical mechanics"
2.2 The Traditional Methodology: approach is that there are important systematic in-
Probabilistic Analyses, and Simulation Using teractions that occur in memory allocation, due to
Synthetic Traces phase behavior of programs. No matter how large the
system is, basing probabilistic analyses on individual
(Note: readers who are uninterested in experimental events is likely to yield the wrong answers, if there
methodology may wish to skip this section, at least are systematic eects involved which are not captured
on a rst reading. Readers uninterested in the history by the theory. Assuming that the analyses are appro-
of allocator research may skip the footnotes. The fol- priate for \suciently large" systems does not help
lowing section (2.3) is quite important, however, and here|the systematic errors will simply attain greater
should not be skipped.) statistical signicance.
Allocators are sometimes evaluated using proba- Consider the case of evolutionary biology. If an
bilistic analyses. By reasoning about the likelihood of overly simple statistical approach about individual
certain events, and the consequences of those events animals' interactions is used, the theory will not cap-
for future events, it may be possible to predict what ture predator/prey and host/symbiote relationships,
will happen on average. For the general problem of sexual selection, or other pervasive evolutionary ef-
dynamic storage allocation, however, the mathemat- fects as niche lling.16 Developing a highly predictive
ics are too dicult to do this for most algorithms
and most workloads. An alternative is to do simu- 15 This usage of \statistical mechanics" should perhaps be
lations, and nd out \empirically" what really hap- regarded as metaphorical, since it is not really about
pens when workloads interact with allocator policies. simple interactions of large numbers of molecules in
This is more common, because the interactions are so a gas or liquid. Several papers on memory allocation
poorly understood that mathematical techniques are have used it loosely, however, to describe the analo-
dicult to apply. gous approach to analyzing memory allocation. Statis-
Unfortunately, in both cases, to make probabilistic tical mechanics has literally provided a paradigm|in
techniques feasible, important characteristics of the the original, smaller sense of a \model" or \examplar,"
rather than in a larger Kuhnian sense|which many nd
workload must be known|i.e., the probabilities of attractive.
relevant characteristics of \input" events to the al- 16 Some of these eects may emerge from lower-level mod-
location routine. The relevant characteristics are not eling, but for simulations to reliably predict them, many
understood, and so the probabilities are simply un- important lower-level issues must be modeled correctly,
known. and sucient data are usually not available, or su-
9
evolutionary theory is extremely dicult|and some Since an allocator normally responds only to the re-
would say impossible|because too many low-level (or quest sequence, this can produce very accurate simu-
higher-level) details matter,17 and there may intrinsic lations of what the allocator would do if the workload
unpredictabilities in the systems described [Den95].18 were real|that is, if a real program generated that
We are not saying that the development of a good request sequence.
theory of memory allocation is as hard as develop- Typically, however, the request sequences are not
ing a predictive evolutionary theory|far from it. The real traces of the behavior of actual programs. They
problem of memory allocation seems far simpler, and are \synthetic" traces that are generated automati-
we are optimistic that a useful predictive theory can cally by a small subprogram; the subprogram is de-
be developed.19 signed to resemble real programs in certain statisti-
Our point is simply that the paradigm of simple cal ways. In particular, object size distributions are
statistical mechanics must be evaluated relative to thought to be important, because they aect the frag-
other alternatives, which we nd more plausible in this mentation of memory into blocks of varying sizes. Ob-
domain. There are major interactions between work- ject lifetime distributions are also often thought to
loads and allocator policies, which are usually ignored. be important (but not always), because they aect
No matter how large the system, and no matter how whether blocks of memory are occupied or free.
asymptotic the analyses, ignoring these eects seems Given a set of object size and lifetime distributions,
likely to yield major errors|e.g., analyses will simply the small \driver" subprogram generates a sequence of
yield the wrong asymptotes. requests that obeys those distributions. This driver is
A useful probabilistic theory of memory allocation simply a loop that repeatedly generates requests, us-
may be possible, but if so, it will be based on a ing a pseudo-random number generator; at any point
quite dierent set of statistics from those used so in the simulation, the next data object is chosen by
far|statistics which capture eects of systematicities, \randomly" picking a size and lifetime, with a bias
rather than assuming such systematicities can be ig- that (probabilistically) preserves the desired distribu-
nored. As in biology, the theory must be tested against tions. The driver also maintains a table of objects that
reality, and rened to capture systematicities that had have been allocated but not yet freed, ordered by their
previously gone unnoticed. scheduled death (deallocation) time. (That is, the step
at which they were allocated, plus their randomly-
chosen lifetime.) At each step of the simulation, the
Random simulations.The traditional technique driver deallocates any objects whose death times indi-
for evaluating allocators is to construct several traces cate that they have expired. One convenient measure
(recorded sequences of allocation and deallocation re- of simulated \time" is the volume of objects allocated
quests) thought to resemble \typical" workloads, and so far|i.e., the sum of the sizes of objects that have
use those traces to drive a variety of actual allocators. been allocated up to that step of the simulation.20
An important feature of these simulations is that
ciently understood.
17 For example, the dierent evolutionary strategies im-
they tend to reach a \steady state." After running for
a certain amount of time, the volume of live (simu-
plied by the varying replication techniques and muta-
tion rates of RNA-based vs. DNA-based viruses, or the 20 In many early simulations, the simulator modeled real
impact of environmental change on host/parasite inter- time, rather than just discrete steps of allocation and
actions [Gar94]. deallocation. Allocation times were chosen based on ran-
18 For example, a single chance mutation that results in domly chosen \arrival" times, generated using an \inter-
an adaptive characteristic in one individual may have a arrival distribution" and their deaths scheduled in con-
major impact on the subsequent evolution of a species tinuous time|rather than discrete time based on the
and its entire ecosystem [Dar59]. number and/or sizes of objects allocated so far. We will
19 We are also not suggesting that evolutionary theory pro- generally ignore this distinction in this paper, because
vides a good paradigm for allocator research; it is just we think other issues are more important. As will be-
an example of a good scientic paradigm that is very come clear, in the methodology we favor, this distinction
dierent from the ones typically seen in memory alloca- is not important because the actual sequences of actions
tion research. It demonstrates the important and neces- are sucient to guarantee exact simulation, and the ac-
sary interplay between high-level theories and detailed tual sequence of events is recorded rather than being
empirical work. (approximately) emulated.
10
lated) objects reaches a level that is determined by are assumed to be independent|the fact that dier-
the size and lifetime distributions, and after that ob- ent sizes of objects may have dierent lifetime distrib-
jects are allocated and deallocated in approximately utions is generally assumed to be unimportant.
equal numbers. The memory usage tends to vary very In general, there has been something of a trend
little, wandering probabilistically (in a random walk) toward the use of more realistic distributions,24 but
around this \most likely" level. Measurements are this trend is not dominant. Even now, researchers of-
typically made by sampling memory usage at points ten use simple and smooth mathematical functions to
after the steady state has presumably been reached, or generate traces for allocator evaluation.25 The use of
by averaging over a period of \steady-state" variation. smooth distributions is questionable, because it bears
These measurements \at equilibrium" are assumed to directly on issues of fragmentation|if objects of only
be important. a few sizes are allocated, the free (and uncoalesca-
There are three common variations of this simu- ble) blocks are likely to be of those sizes, making it
lation technique. One is to use a simple mathemat- possible to nd a perfect t. If the object sizes are
ical function to determine the size and lifetime dis- smoothly distributed, the requested sizes will almost
tributions, such as uniform or (negative) exponential. always be slightly dierent, increasing the chances of
Exponential distributions are often used because it fragmentation.
has been observed that programs are typically more
likely to allocate small objects than large ones,21 and
are more likely to allocate short-lived objects than Probabilistic analyses.Since Knuth's derivation
long-lived ones.22 (The size distributions are gener- of the \fty percent rule" [Knu73] (discussed later,
ally truncated at some plausible minimum and max- in Section 4), there have been many attempts to rea-
imum object size, and discretized, rounding them to son probabilistically about the interactions between
the nearest integer.) program behavior and allocator policy, and assess
The second variation is to pick distributions intu- the overall cost in terms of fragmentation (usually)
itively, i.e., out of a hat, but in ways thought to re- and/or CPU time.
semble real program behavior. One motivation for this These analyses have generally made the same as-
is to model the fact that many programs allocate ob- sumptions as random-trace simulation experiments|
jects of some sizes and others in small numbers or not e.g., random object allocation order, independence of
at all; we refer to these distributions as \spiky."23 size and lifetimes, steady-state behavior|and often
The third variation is to use statistics gathered from stronger assumptions as well.
real programs, to make the distributions more realis- These simplifying assumptions have generally been
tic. In almost all cases, size and lifetime distributions made in order to make the mathematics tractable. In
particular, assumptions of randomness and indepen-
21 Historically, uniform size distributions were the most dence make it possible to apply well-developed theory
common in early experiments; exponential distributions 24 The trend toward more realistic distributions can be ex-
then became increasingly common, as new data be- plained historically and pragmatically. In the early days
came available showing that real systems generally used of computing, the distributions of interest were usually
many more small objects than large ones. Other dis- the distribution of segment sizes in an operating sys-
tributions have also been used, notably Poisson and tem's workload. Without access to the inside of an op-
hyper-exponential. Still, relatively recent papers have erating system, this data was dicult to obtain. (Most
used uniform size distributions, sometimes as the only researchers would not have been allowed to modify the
distribution. implementation of the operating system running on a
22 As with size distributions, there has been a shift over very valuable and heavily-timeshared computer.) Later,
time toward non-uniform lifetime distributions, often the emphasis of study shifted away from segment sizes
exponential. This shift occurred later, probably because in segmented operating systems, and toward data ob-
real data on size information was easier to obtain, and ject sizes in the virtual memories of individual processes
lifetime data appeared later. running in paged virtual memories.
23 In general, this modeling has not been very precise. 25 We are unclear on why this should be, except that a par-
Sometimes the sizes chosen out of a hat are allocated in ticular theoretical and experimental paradigm [Kuh70]
uniform proportions, rather than in skewed proportions had simply become thoroughly entrenched in the early
re
ecting the fact that (on average) programs allocate 1970's. (It's also somewhat easier than dealing with real
many more small objects than large ones. data.)
11
of stochastic processes (Markov models, etc.) to derive and that the emphasis on distributions tends to dis-
analytical results about expected behavior. Unfortu- tract researchers from the strongly patterned underly-
nately, these assumptions tend to be false for most ing processes that actually generate them (as will be
real programs, so the results are of limited utility. explained in Section 2.4).
It should be noted that these are not merely conve- We invite the reader to consider a randomly-
nient simplifying assumptions that allow solution of ordered trace with an exponential lifetime distribu-
problems that closely resemble real problems. If that tion. In this case there is no correlation at all between
were the case, one could expect that with renement an object's age and its expected time until death|
of the analyses|or with sucient empirical validation the \half-life" decay property of the distribution and
that the assumptions don't matter in practice|the the randomness ensure that allocated objects die com-
results would come close to reality. There is no reason pletely at random with no way to estimate their death
to expect such a happy outcome. These assumptions times from any of the information available to the
dramatically change the key features of the problem; allocator.26 (An exponential random function exhibits
the ability to perform the analyses hinges on the very only a half-life property, and no other pattern, much
facts that make them much less relevant to the general like radioactive decay.) In a sense, exponential life-
problem of memory allocation. times are thus the reductio ad absurdum of the syn-
Assumptions of randomness and independence thetic trace methodology|all of the time-varying reg-
make the problem irregular, in a supercial sense, ularities have been systematically eliminated from the
but they make it very smooth (hence mathematically input. If we view the allocator's job as an online prob-
tractable) in a probabilistic sense. This smoothness lem of detecting and exploiting regularities, we see
has the advantage that it makes it possible to derive that this puts the allocator in the awkward position
analytical results, but it has the disadvantage that it of trying to extract helpful hints from pure noise.
turns a real and deep scientic problem into a math- This does not necessarily mean that all allocators
ematical puzzle that is much less signicant for our will perform identically under randomized workloads,
purposes. however, because there are regularities in size distrib-
The problem of dynamic storage allocation is in- utions, whether they are real distributions or simple
tractable, in the vernacular sense of the word. As an mathematical ones, and some allocators may simply
essentially data-dependent problem, we do not have a shoot themselves in the foot.
grip on it, because it because we simply do not under- Analyses and experiments with exponentially dis-
stand the inputs. \Smoothing" the problem to make it tributed random lifetimes may say something reveal-
mathematically tractable \removes the handles" from ing about what happens when an allocator's strategy
something that is fundamentally irregular, making it is completely orthogonal to the actual regularities. We
unlikely that we will get any real purchase or leverage have no real idea whether this is a situation that oc-
on the important issues. Removing the irregularities curs regularly in the space of possible combinations of
removes some of the problems|and most of the op- real workloads and reasonable strategies.27 (It's clear
portunities as well. that it is not the usual case, however.) The terrain of
that space is quite mysterious to us.
A note on exponentially-distributed ran-
dom lifetimes.Exponential lifetime distributions A note on Markov models. Many probabilistic
have become quite common in both empirical and an- studies of memory allocation have used rst-order
alytic studies of memory fragmentation over the last 26
two decades. In the case of empirical work (using We are indebted to Henry Baker, who has made quite
random-trace simulations), this seems an admirable similar
tial
observations with respect to the use of exponen-
lifetime distributions to estimate the eectiveness
adjustment to some observed characteristics of real of generational garbage collection schemes [Bak93].
program behavior. In the case of analytic studies, it 27 In particular, certain
turns out to have some very convenient mathemati- (or may not) resembleeects of randomized traces may
the cumulative eect of alloca-
cal properties as well. Unfortunately, it appears that tor strategy errors over much longer periods. This re-
the apparently exponential appearence of real lifetime semblance cannot be assumed, however|there are good
distributions is often an artifact of experimental meth- reasons to think it may occur in some cases, but not in
odology (as will be explained in Sections 2.3 and 4.1) others, and empirical validation is necessary.
12
Markov processes to approximate program and allo- Unfortunately, this is a very inappropriate model
cator behavior, and have derived conclusions based for real program and allocator behavior. An ergodic
on the well-understood properties of Markov models. Markov model is a kind of (probabilistic) nite au-
In a rst-order Markov model, the probabilities of tomaton, and as such the patterns it generates are
state transitions are known and xed. In the case of very, very simple, though randomized and hence un-
fragmentation studies, this corresponds to assuming predictable. They're almost unpatterned, in fact, and
that a program allocates objects at random, with xed hence very predictable in a certain probabilistic sense.
probabilities of allocating dierent sizes. Such an automaton is extremely unlikely to gener-
The space of possible states of memory is viewed ate many patterns that seem likely to be important in
as a graph, with a node for each conguration of allo- real programs, such as the creation of the objects in a
cated and free blocks. There is a start state, represent- linked list in one order, and their later destruction in
ing an empty memory, and a transition probability exactly the same order, or exactly the reverse order.28
for each possible allocation size. For a given place- There are much more powerful kinds of machines|
ment policy, there will be a known transition from a which have more complex state, like a real program|
given state for any possible allocation or deallocation which are capable of generating more realistic pat-
request. The state reached by each possible allocation terns. Unfortunately, the only machines that we are
is another conguration of memory. sure generate the \right kinds" of patterns are actual
For any given request distribution, there is a net- real programs.
work of possible states reachable from the start state, We do not understand what regularities exist in real
via successions of more or less probable transitions. In programs well enough to model them formally and
general, for any memory above a very, very small size, perform probabilistic analyses that are directly appli-
and for arbitrary distributions of sizes and lifetimes, cable to real program behavior. The models we have
this network is inconceivably large. As described so are grossly inaccurate in respects that are quite rele-
far, it is therefore useless for any practical analyses. vant to problems of memory allocation.
To make the problem more tractable, certain as- There are problems for which Markov models are
sumptions are often made. One of these is that life- useful, and a smaller number of problems where as-
times are exponentially distributed as well as random, sumptions of ergodicity are appropriate. These prob-
and have the convenient half-life property described lems involve processes that are literally random, or
above, i.e., they die completely at random as well as can be shown to be eectively random in the neces-
being born at random. sary ways. The general heap allocation problem is not
This assumption can be used to ensure that both in either category. (If this is not clear, the next section
the states and the transitions between states have def- should make it much clearer.)
inite probabilities in the long run. That is, if one were Ergodic Markov models are also sometimes used for
to run a random-trace simulation for a long enough problems where the basic assumptions are known to
period of time, all reachable states would be reached, be false in some cases|but they should only be used
and all of them would be reached many times|and in this way if they can be validated, i.e., shown by ex-
the number of times they were reached would re
ect tensive testing to produce the right answers most of
the probabilities of their being reached again in the the time, despite the oversimplications they're based
future, if the simulation were continued indenitely. on. For some problems it \just turns out" that the
If we put a counter on each of the states to keep track dierences between real systems and the mathemati-
of the number of times each state was reached, the cal models are not usually signicant. For the general
ratio between these counts would eventually stabilize, problem of memory allocation, this turns out to be
plus or minus small short-term variations. The rela- false as well|recent results clearly invalidate the use
tive weights of the counters would \converge" to a 28 Technically, a Markov model will eventually generate
stable solution. such patterns, but the probability of generating a par-
Such a network of states is called an ergodic Markov ticular pattern within a nite period of time is vanish-
model, and it has very convenient mathematical prop- ingly small if the pattern is large and not very strongly
erties. In some cases, it's possible to avoid running re
ected in the arc weights. That is, many quite prob-
a simulation at all, and analytically derive what the able kinds of patterns are extremely improbable in a
network's probabiblities would converge to. simple Markov model.
13
of simple Markov models [ZG94, WJNB95].29 design experiments measuring fragmentation, it is
worthwhile to stop for a moment and consider what
2.3 What Fragmentation Really Is, and Why fragmentation really is, and how it arises.
the Traditional Approach is Unsound Fragmentation is the inability to reuse memory that
is free. This can be due to policy choices by the allo-
A single death is a tragedy. A million deaths cator, which may choose not to reuse memory that in
is a statistic. principle could be reused. More importantly for our
|Joseph Stalin purposes, the allocator may not have a choice at the
moment an allocation request must be serviced: there
We suggested above that the shape of a size dis- may be free areas that are too small to service the
tribution (and its smoothness) might be important request and whose neighbors are not free, making it
in determining the fragmentation caused by a work- impossible to coalesce adjacent free areas into a su-
load. However, even if the distributions are completely ciently large contiguous block.30
realistic, there is reason to suspect that randomized Note that for this latter (and more fundamental)
synthetic traces are likely to be grossly unrealistic. kind of fragmentation, the problem is a function both
As we said earlier, the allocator should embody a of the program's request stream and the allocator's
strategy designed to exploit regularities in program choices of where to allocate the requested objects. In
behavior|otherwise it cannot be expected to do par- satisfying a request, the allocator usually has consid-
ticularly well. The use of randomized allocation order erable leeway; it may place the requested object in
eliminates some regularities in workloads, and intro- any suciently large free area. On the other hand,
duces others, and there is every reason to think that the allocator has no control over the ordering of re-
the dierences in regularities will aect the perfor- quests for dierent-sized pieces of memory, or when
mance of dierent strategies dierently. To make this objects are freed.
concrete, we must understand fragmentation and its We have not made the notion of fragmentation par-
causes. ticularly clear or quantiable here, and this is no ac-
The technical distinction between internal and ex- cident. An allocator's inability to reuse memory de-
ternal fragmentation is useful, but in attempting to pends not only on the number and sizes of holes, but
29 It might seem that the problem here is the use of rst- on the future behavior of the program, and the fu-
order Markov models, whose states (nodes in the reach- ture responses of the allocator itself. (That is, it is
ability graph) correspond directly to states of memory. a complex matter of interactions between patterned
Perhaps \higher-order" Markov models would work, workloads and strategies.)
where nodes in the graph represent sequences of con- For example, suppose there are 100 free blocks of
crete state transitions. We think this is false as well. size 10, and 200 free blocks of size 20. Is memory
The important kinds of patterns produced by real highly fragmented? It depends. If future requests are
programs are generally not simple very-short-term se- all for size 10, most allocators will do just ne, using
quences of a few events, but large-scale patterns involv- the size 10 blocks, and splitting the size 20 blocks as
ing many events. To capture these, a Markov model necessary. But if the future requests are for blocks of
would have to be of such high order that analyses would size 30, that's a problem. Also, if the future requests
be completely infeasible. It would essentially have to be
pre-programmed to generate specic literal sequences are for 100 blocks of size 10 and 200 blocks of size 20,
of events. This not only begs the essential question of whether it's a problem may depend on the order in
what real programs do, but seems certain not to con- which the requests arrive and the allocator's moment-
cisely capture the right regularities.
Markov models are simply not powerful enough| 30 Beck [Bec82] makes the only clear statement of this prin-
i.e., not abstract enough in the right ways|to help ciple which we have found in our exhausting review of
with this problem. They should not be used for this the literature. As we will explain later (in our chronolog-
purpose, or any similarly poorly understood purpose, ical review, Section 4.1), Beck also made some impor-
where complex patterns may be very important. (At tant inferences from this principle, but his theoretical
least, not without extensive validation.) The fact that model and his empirical methodology were weakened
the regularities are complex and unknown is not a by working within the dominant paradigm. His paper
good reason to assume that they're eectively random is seldom cited, and its important ideas have generally
[ZG94, WJNB95] (Section 4.2). gone unnoticed.
14
by-moment decisions as to where to place them. Best dierent stereotyped ways. Some kinds of objects ac-
t will do well for this example, but other allocators cumulate over time, but other kinds may be used in
do better for some other examples where best t per- bursty patterns. (This will be discussed in more detail
forms abysmally. in Section 2.4.) The allocator's job is to exploit these
We leave the concept of fragmentation somewhat patterns, if possible, or at least not let the patterns
poorly dened, because in the general case the actual undermine its strategy.
phenomenon is poorly dened.31
Implications for experimental methodology.
Fragmentation is caused by isolated deaths. (Note: this section is concerned only with experimen-
A crucial issue is the creation of free areas whose tal techniques; uninterested readers may skip to the
neighboring areas are not free. This is a function of following section.)
two things: which objects are placed in adjacent areas The traditional methodology of using random pro-
and when those objects die. Notice that if the alloca- gram behavior implicitly assumes that there is no or-
tor places objects together in memory, and they die dering information in the request stream that could
\at the same time" (with no intervening allocations), be exploited by the allocator|i.e., there's nothing in
no fragmentation results: the objects are live at the the sequencing of requests which the allocator will
same time, using contiguous memory, and when they use as a hint to suggest which objects should be al-
die they free contiguous memory. An allocator that located adjacent to which other objects. Given a ran-
can predict which objects will die at approximately dom request stream, the allocator has little control|
the same time can exploit that information to reduce wherever objects are placed by the allocator, they die
fragmentation, by placing those objects in contiguous at random, randomly creating holes among the live
memory. objects. If some allocators do in fact tend to exploit
real regularities in the request stream, the randomiza-
tion of the order of object creations (in simulations)
ensures that the information is discarded before the
Fragmentation is caused by time-varying be- allocator can use it. Likewise, if an algorithm tends
havior. Fragmentation arises from changes in the to systematically
way a program uses memory|for example, freeing make mistakes when faced with real
small blocks and requesting large ones. This much is patterns of allocations and deallocations, randomiza-
obvious, but it is important to consider patterns in tion may hide that fact.
the changing behavior of a program, such as the free- It should be clear that random object deaths may
ing of large numbers of objects and the allocation of systematically create serious fragmentation in ways
large numbers of objects of dierent types. Many pro- that are unlikely to be realistic. Randomization also
grams allocate and free dierent kinds of objects in has a potentially large eect on large-scale aggregate
behavior of large numbers of objects. In real programs,
31 Our concept of fragmentation has been called the total volume of objects varies over time, and often
\startlingly nonoperational," and we must confess that the relative volumes of objects of dierent sizes varies
it is, to some degree. We think that this is a strength, as well. This often occurs due to phase behavior|
however, because it is better to leave a concept some- some phases may use many more objects than others,
what vague than to dene it prematurely and in- and the objects used by one phase may be of very
correctly. It is important to rst identify the \natu- dierent sizes than those used by another phase.
ral kinds" in the phenomena under study, and then Now consider a randomized synthetic trace|the
gure out what their most important characteristics overall volume of objects is determined by a random
are [Kri72, Put77, Qui77]. (We are currently working walk, so that the volume of objects rises gradually un-
on developing operational measures of \fragmentation- til a steady state is reached. Likewise the volume of
related" program behavior.) memory allocated to objects of a given size is a similar
Later in the paper we will express experimental \frag-
mentation" results as percentages, but this should be random walk. If the number of objects of a given size
viewed as an operational shorthand for the eects of is large, the random walk will tend to be relatively
fragmentation on memory usage at whatever point or smooth, with mostly gradual and small changes in
points in program execution measurements were made; overall allocated volume. This implies that the pro-
this should be clear in context. portions of memory allocated to dierent-sized objects
15
tend to be relatively stable. plied (in dierent combinations) to solve many prob-
This has major implications for external fragmen- lems. Several common patterns have been observed.
tation. External fragmentation means that there are
free blocks of memory of some sizes, but those are
the wrong sizes to satisfy current needs. This happens Ramps, peaks, and plateaus. In terms of overall
when objects of one size are freed, and then objects memory usage over time, three patterns have been
of another size are allocated|that is, when there is observed in a variety of programs in a variety of con-
an unfortunate change in the relative proportions of texts. Not all programs exhibit all of these patterns,
objects of one size and objects of a larger size. (For al- but most seem to exhibit one or two of them, or all
locators that never split blocks, this can happen with three, to some degree. Any generalizations based on
requests for smaller sizes as well.) For synthetic ran- these patterns must therefore be qualitative and quali-
dom traces, this is less likely to occur|they don't ed. (This implies that to understand the quantitative
systematically free objects of one size and then allo- importance of these patterns, a small set of programs
cate objects of another. Instead, they tend to allocate is not sucient.)
and free objects of dierent sizes in relatively stable
proportions. This minimizes the need to coalesce ad- { Ramps. Many programs accumulate certain data
jacent free areas to avoid fragmentation; on average, structures monotonically over time. This may be
a free memory block of a given size will be reused rel- because they keep a log of events, or because
atively soon. This may bias experimental results by the problem-solving strategy requires building a
hiding an allocator's inability to deal well with ex- large representation, after which a solution can
ternal fragmentation, and favor allocators that deal be found quickly.
well with internal fragmentation at a cost in external { Peaks. Many programs use memory in bursty pat-
fragmentation. terns, building up relatively large data structures
Notice that while random deaths cause fragmen- which are used for the duration of a particular
tation, the aggregate behavior of random walks may phase, and then discarding most or all of those
reduce the extent of the problem. For some alloca- data structures. Note that the \surviving" data
tors, this balance of unrealistically bad and unrealis- structures are likely to be of dierent types, be-
tically good properties may average out to something cause they represent the results of a phase, as op-
like realism, but for others it may not. Even if|by posed to intermediate values which may be rep-
sheer luck|random traces turn out to yield realis- resented dierently. (A peak is like a ramp, but
tic fragmentation \on average," over many allocators, of shorter duration.)
they are inadequate for comparing dierent allocators, { Plateaus. Many programs build up data struc-
which is usually the primary goal of such studies. tures quickly, and then use those data structures
for long periods (often nearly the whole running
time of the program).
2.4 Some Real Program Behaviors
...and suddenly the memory returns. These patterns are well-known, from anecdotal ex-
|Marcel Proust, Swann's Way perience by many people (e.g., [Ros67, Han90]), from
research on garbage collection (e.g., [Whi80, WM89,
Real programs do not generally behave randomly| UJ88, Hay91, Hay93, BZ95, Wil95]),32 and from a re-
they are designed to solve actual problems, and the cent study of C and C++ programs [WJNB95].
methods chosen to solve those problems have a strong 32 It may be thought that garbage collected systems are
eect on their patterns of memory usage. To begin
to understand the allocator's task, it is necessary to suciently dierent from those using conventional stor-
have a general understanding of program behavior. age management that these results are not relevant. It
This understanding is almost absent in the literature appears, however, that these patterns are common in
both kinds of systems, because similar problem-solving
on memory allocators, apparently because many re- strategies are used by programmers in both kinds of
searchers consider the innite variation of possible systems. (For any particular problem, dierent qualita-
program behaviors to be too daunting. tive program behaviors may result, but the general cat-
There are strong regularities in many real pro- egories seem to be common in conventional programs as
grams, however, because similar techniques are ap- well. See [WJNB95].)
16
(Other patterns of overall memory usage also occur, Ramps, peaks, and plateus have very dierent im-
but appear less common. As we describe in Section 4, plications for fragmentation.
backward ramp functions have been observed [GM85]. An overall ramp or plateau prole has a very conve-
Combined forward and backward ramp behavior has nient property, in that if short-term fragmentation can
also been observed, with one data structure shrinking be avoided, long term fragmentation is not a problem
as another grows [Abr67].) either. Since the data making up a plateau are stable,
Notice that in the case of ramps and ramp-shaped and those making up a ramp accumulate monotonic-
peaks, looking at the statistical distributions of object ally, inability to reuse freed memory is not an issue|
lifetimes may be very misleading. A statistical distri- nothing is freed until the end of program execution.
bution suggests a random decay process of some sort, Short-term fragmentation can be a cumulative prob-
but it may actually re
ect sudden deaths of groups of lem, however, leaving many small holes in the mass of
objects that are born at dierent times. In terms of long lived-objects.
fragmentation, the dierence between these two mod- Peaks and tall, skinny plateaus can pose a challenge
els is major. For a statistical decay process, the allo- in terms of fragmentation, since many objects are allo-
cator is faced with isolated deaths, which are likely cated and freed, and many other objects are likely to
to cause fragmentation. For a phased process where be allocated and freed later. If an earlier phase leaves
many objects often die at the same time, the alloca- scattered survivors, it may cause problems for later
tor is presented with an opportunity to get back a phases that must use the spaces in between.
signicant amount of memory all at once. More generally, phase behavior is the major cause
In real programs, these patterns may be composed of fragmentation|if a program's needs for blocks of
in dierent ways at dierent scales of space and time. particular sizes change over time in an awkward way.
A ramp may be viewed as a kind of peak that grows If many small objects are freed at the end of a phase|
over the entire duration of program execution. (The but scattered objects survive|a later phase may run
distinction between a ramp and a peak is not pre- into trouble. On the other hand, if the survivors hap-
cise, but we tend to use \ramp" to refer to something pen to have been placed together, large contiguous
that grows slowly over the whole execution of a pro- areas will come free.
gram, and drops o suddenly at the end, and \peak"
to refer to faster-growing volumes of objects that are
discarded before the end of execution. A peak may Fragmentation at peaks is important. Not all
also be
at on top, making it a kind of tall, skinny periods of program execution are equal. The most im-
plateau.) portant periods are usually those when the most mem-
While the overall long-term pattern is often a ramp ory is used. Fragmentation is less important at times
or plateau, it often has smaller features (peaks or pla- of lower overall memory usage than it is when mem-
teus) added to it. This crude model of program be- ory usage is \at its peak," either during a short-lived
havior is thus recursive. (We note that it is not gen- peak or near the end of a ramp of gradually increas-
erally fractal33|features at one scale may bear no
resemblance to features at another scale. Attempting location policy. (We suspect that it's ill-conceived for
to characterize the behavior of a program by a simple understanding program behavior at the level of refer-
ences to objects, as well as at the level of references
number such as fractal dimension is not appropriate, to memory.) If the fractal concept is used in a strong
because program behavior is not that simple.34) sense, we believe it is simply wrong. If it is taken in a
weak sense, we believe it conveys little useful informa-
33 We are using the term \fractal" rather loosely, as is com- tion that couldn't be better summarized by simple sta-
mon in this area. Typically, \fractal" models of program tistical curve-tting; using a fractal conceptual frame-
behavior are not innitely recursive, and are actually work tends to obscure more issues than it claries. Av-
graftals or other nite fractal-like recursive entities. erage program behavior may resemble a fractal, because
34 We believe that this applies to studies of locality of ref- similar features can occur at dierent scales in dierent
erence as well. Attempts to characterize memory refer- programs; however, an individual program's behavior is
encing behavior as fractal-like (e.g., [VMH+ 83, Thi89]) not fractal-like in general, any more than it is a simple
are ill-conceived or severely limited|if only because Markov process. Both kinds of models fail to capture
memory allocation behavior is not generally fractal, and the \irregularly regular" and scale-dependent kinds of
memory-referencing behavior depends on memory al- patterns that are most important.
17
ing memory usage. This means that average fragmen- This suggests that objects allocated at about
tation is less important than peak fragmentation| the same time should be allocated adjacent to
scattered holes in the heap most of the time may each other in memory, with the possible amend-
not be a problem if those holes are well-lled when ment that dierent-sized objects should be segregated
it counts. [WJNB95].36
This has implications for the interpretation of anal-
yses and simulations based on steady-state behavior
(i.e., equilibrium conditions). Real programs may ex- Implications for strategy. The phased behavior of
hibit some steady-state behavior, but there are usu- many programs provides an opportunity for the al-
ally ramps and/or peaks as well. It appears that most locator to reduce fragmentation. As we said above, if
programs never reach a truly steady state, and if they successive objects are allocated contiguously and freed
reach a temporary steady state, it may not matter at about the same time, free memory will again be
much. (It can matter, however, because earlier phases contiguous. We suspect that this happens with many
may result in a conguration of blocks that is more existing allocators|even though they were not de-
or less problematic later on, at peak usage.) signed with this principle in mind, as far as we can
tell.
Overall memory usage is not the whole story, of the major It may well be that this accidental \strategy" is
course. Locality of reference matters as well. All other way that good allocators keep fragmenta-
things being equal, however, a larger total \footprint" tion low.
matters even for locality. In virtual memories, many
programs never page at all, or suer dramatic perfor- Implications for research. A major goal of alloca-
mance degradations if they do. Keeping the overall tor research should be to determine which patterns
memory usage lower makes this less likely to happen. are common, and which can be exploited (or at least
(In a time-shared machine, a larger footprint is likely guarded against). Strategies that work well for one
to mean that a dierent process has its pages evicted program may work poorly for another, but it may be
when the peak is reached, rather than its own less- possible to combine strategies in a single robust policy
recently-used pages.) that works well for almost all programs. If that fails,
it may be possible to have a small set of allocators
with dierent properties, at least one of which works
Exploiting ordering and size dependencies. If well for the vast majority of real problems.
the allocator can exploit the phase information from We caution against blindly experimenting with dif-
the request stream, it may be able to place objects ferent combinations of programs and complex, opti-
that will die at about the same time in a contiguous mized allocators, however. It is more important to
area of memory. This may suggest that the allocator determine what regularities exist in real program be-
should be adaptive,35 but much simpler strategies also havior, and only then decide which strategies are most
seem likely to work [WJNB95]: 36 We have not found any other mention of these heuristics
{ Objects allocated at about the same time are in the literature, although somewhat similar ideas un-
likely to die together at the end of a phase; derlie the \zone" allocator of Ross [Ros67] and Hanson's
if consecutively-allocated objects are allocated \obstack" system (both discussed later). Beck [Bec82],
in contiguous memory, they will free contiguous Demers et al. [DWH+ 90], and and Barrett and Zorn
[BZ93] have developed systems that predict the lifetimes
memory. of objects for similar purposes.
{ Objects of dierent types may be likely to serve We note that for our purposes, it is not necessary
dierent purposes and die at dierent times. Size to predict which groups of objects will die when. It is
is likely to be related to type and purpose, so only necessary to predict which groups of objects will
avoiding the intermingling of dierent sizes (and die at similar times, and which will die at dissimilar
likely types) of objects may reduce the scattering times, without worrying about which group will die rst.
of long-lived objects among short-lived ones. We refer to this as \death time discrimination." This
simpler discrimination seems easier to achieve than life-
35 Barrett and Zorn have recently built an allocator using time prediction, and possibly more robust. Intuitively,
prole information to heuristically separate long-lived it also seems more directly related to the causes of
objects from short-lived ones [BZ93]. (Section 4.2.) fragmentation.
18
appropriate, and which good strategies can be com- limited experience.) Notice that this program exhibits
bined successfully. This is not to say that experiments very dierent usage proles for dierent sized objects.
with many variations on many designs aren't useful| The use of one size is nearly steady, another is strongly
we're in the midst of such experiments ourselves|but peaked, and others are peaked, but dierent.
that the goal should be to identify fundamental inter-
actions rather than just \hacking" on things until they Grobner. Figure 2 shows memory usage for the Grob-
work well for a few test applications. ner program40 which decomposes complex expressions
into linear combinations of polynomials (Grobner
Proles of some real programs. To make our dis- bases).41 As we understand it, this is done by a pro-
cussion of memory usage patterns more concrete, we cess of expression rewriting, rather like term rewriting
will present proles of memory use for some real pro- or rewrite-based theorem proving techniques.
grams. Each gure plots the overall amount of live Overall memory usage tends upward in a general
data for a run of the program, and also the amounts ramp shape, but with minor short-term variations, es-
of data allocated to objects of the ve most popu- pecially small plateaus, while the proles for usage of
lar sizes. (\Popularity" here means most volume al- different-sized objects are roughly similar, their ramps
located, i.e., sum of sizes, rather than object counts.) start at dierent points during execution and have
These are proles of program behavior, independent dierent slopes and irregularities|the proportions of
of any particular allocator. dierent-sized objects vary somewhat.42
GCC. Figure 1 shows memory usage for GCC, the
GNU C compiler, compiling the largest le of its own Hypercube. Figure 3 shows memory usage for a hy-
source code (combine.c). (A high optimization switch percube message-passing simulator, written by Don
was used, encouraging the compiler to perform exten- Lindsay while at CMU. It exhibits a large and simple
sive inlining, analyses, and optimization.) We used a plateau.
trace processor to remove \obstack" allocation from This program allocates a single very large object
the trace, creating a trace with the equivalent allo- near the beginning of execution, which lives for al-
cations and frees of individual objects; obstacks are most the entire run; it represents the nodes in a hy-
heavily used in this program.37 The use of obstacks percube and their interconnections.43 A very large
may aect programming style and memory usage pat- number of other objects are created, but they are
terns; however, we suspect that the memory usage small and very short-lived; they represent messages
patterns would be similar without obstacks, and that
obstacks are simply used to exploit them.38 40 This program (and the hypercube simulator described
This is a heavily phased program, with several below) were also used by Detlefs in [Det92] for evalu-
strong and similar peaks. These are two-horned peaks, ation of a garbage collector. Based on several kinds of
where one (large) size is allocated and deallocated, proles, we now think that Detlefs' choice of test pro-
and much smaller size is allocated and deallocated, grams may have led to an overestimation of the costs
out of phase.39 (This is an unusual feature, in our of his garbage collector for C++. Neither of these pro-
grams is very friendly to a simple GC, especially one
37 See the discussion of [Han90] (Section 4.1) for a descrip- without compiler or OS support.
tion of obstacks. 41 The function of this program is rather analogous to that
38 We've seen similarly strong peaks in a prole of a com- of a Fourier transform, but the basis functions are poly-
piler of our own, which relies on garbage collection nomials rather than sines and cosines, and the mecha-
rather than obstacks. nism used is quite dierent.
39 Interestingly, the rst of the horns usually consists of 42 Many of the small irregularities in overall usage come
a size that is specic to that peak|dierent peaks use from sizes that don't make it into the top ve|small
dierent-sized large objects, but the out-of-phase part- but highly variable numbers of these objects are used.
ner horn consists of the same small size each time. The 43 In these plots, \time" advances at the end of each allo-
dierences in sizes used by the rst horn explains why cation. This accounts for the horizontal segments visible
only three of these horns show up in the plot, and they after the allocatons of large objects|no other objects
show up for the largest peaks|for the other peaks' large are allocated or deallocated between the beginning and
sizes, the total memory used does not make it into the end of the allocation of an individual object, and allo-
top ve. cation time advances by the size of the object.
19
cc1 -O2 -pipe -c combine.c, memory in use by object sizes (Top 5)
2500
all objects
178600 byte objects
16 byte objects
132184 byte objects
20 byte objects
2000 69720 byte objects
1500
KBytes in Use
1000
500
0
0 2 4 6 8 10 12 14 16 18
Allocation Time in Megabytes
sent between nodes randomly.44 This program quickly (Since Perl is a fairly general and featureful program-
reaches a steady state, but the steady state is quite ming language, its memory usage may vary tremen-
dierent from the one reached by most randomized al- dously depending on the program being executed.)
locator simulations|a very few sizes are represented,
and lifetimes are both extremely skewed and strongly LRUsim. Figure 5 shows memory usage for a locality
correlated with sizes. proler written by Doug van Wieren. This program
processes a memory reference trace, keeping track of
Perl. Figure 4 shows memory usage for a script (pro- how recently each block of memory has been touched
gram) written in the Perl scripting language. This pro- and a accumulating a histogram of hits to blocks at
gram processes a le of string data. (We're not sure dierent recencies (LRU queue positions). At the end
exactly what it is doing with the strings, to be hon- of a run, a PostScript grayscale plot of the time-vary-
est; we do not really understand this program.) This ing locality characteristics is generated. The recency
program reaches a steady state, with heavily skewed queue is represented as a large modied AVL tree,
usage of dierent sizes in relatively xed proportions. which dominates memory usage|only a single ob-
ject size really matters much. At the parameter set-
44 These objects account for the slight increase and irregu- ting used for this run, no blocks are ever discarded,
laritiy in the overall lifetime curve at around 2MB, after and the tree grows monotonically; essentially no heap-
the large, long-lived objects have been allocated. allocated objects are ever freed, so memory usage is a
20
Grobner, memory in use by object sizes (Top 5)
160
all objects
12 byte objects
24 byte objects
140 22 byte objects
18 byte objects
14 byte objects
120
100
KBytes in Use
80
60
40
20
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Allocation Time in Megabytes
simple ramp. At other settings, only a bounded num- Espresso. Figure 6 shows memory usage for a run of
ber of items are kept in the LRU tree, so that memory Espresso, an optimizer for programmable logic array
usage ramps up to a very stable plateau. This pro- designs.
gram exhibits a kind of dynamic stability, either by Espresso appears to go through several qualitatively
steady accumulation (as shown) or by exactly replac- dierent kinds of phases, using dierent sizes of ob-
ing the least-recently-used objects within a plateau jects in quite dierent ways.
(when used with a xed queue length).
This is a small and simple program, but a very real
one, in the sense that we have used it to tie up many Discussion of program proles.In real programs,
megabytes of memory for about a trillion instruction memory usage is usually quite dierent from the mem-
cycles.45 ory usage of randomized traces. Ramps, peaks, and
plateaus are common, as is heavily skewed usage of a
45 We suspect that in computing generally, a large frac- few sizes. Memory usage is neither Markov nor inter-
tion of CPU time and memory usage is devoted to pro- estingly fractal-like in most cases. Many programs ex-
grams with more complex behavior, but another signif- hibit large-scale and small-scale patterns which may
icant fraction is dominated by highly regular behavior be of any of the common feature types, and dier-
of simple useful programs, or by long, regular phases of ent at dierent scales. Usage of dierent sizes may
more complex programs. be strongly correlated, or it may not be, or may be
21
lindsay, memory in use by object sizes (Top 5)
2500
all objects
1687552 byte objects
393256 byte objects
52 byte objects
1024 byte objects
2000 28 byte objects
1500
KBytes in Use
1000
500
0
0 1 2 3 4 5 6 7 8
Allocation Time in Megabytes
related in more subtle time-varying ways. Given the { Known program behavior invalidates previous ex-
wide variation within this small sample, it is clear that perimental and analytical results,
more programs should be proled to determine which { Nonrandom behavior of programs can be ex-
other patterns occur in a signicant number of pro- ploited, and
grams, and how often various patterns are likely to { Dierent programs may display characteristically
occur. dierent nonrandom behavior.
Summary. In summary, this section makes six re- 2.5 Deferred Coalescing and Deferred Reuse
lated points:
Deferred coalescing. Many allocators attempt to
{ Program behavior is usually time-varying, not avoid coalescing blocks of memory that may be re-
steady, peatedly reused for short-lived objects of the same
{ Peak memory usage is important; fragmentation size. This deferred coalescing can be added to any al-
at peaks is more important than at intervening locator, and usually avoids coalescing blocks that will
points, soon be split again to satisfy requests for small ob-
{ Fragmentation is caused by time-varying behav- jects. Blocks of a given size may be stored on a simple
ior, especially peaks using dierent sizes of ob- free list, and reused without coalescing, splitting, or
jects. formatting (e.g., putting in headers and/or footers).
22
perl: words small data, memory in use by object sizes (Top 5)
70
all objects
32 byte objects
8200 byte objects
60 52 byte objects
36 byte objects
5632 byte objects
50
KBytes in Use
40
30
20
10
0
0 5 10 15 20 25 30 35
Allocation Time in Megabytes
If the application requests the same size block soon particularly expensive, because large areas will
after one is freed, the request can be satised by sim- be coalesced together by repeatedly combining
ply popping the pre-formatted block o of a free list adjacent blocks, only to be split again into a large
in very small constant time. number of smaller blocks. If fragmentation is low,
While deferred coalescing is traditionally thought of deferred coalescing may be especially benecial.
as a speed optimization, it is important to note that { Deferred coalescing may have signicant eects
fragmentation considerations come into play, in three on fragmentation, by changing the allocator's de-
ways.46 cisions as to which blocks of memory to use to
hold which objects. For example, blocks cannot
{ The lower fragmentation is, the more important be used to satisfy requests for larger objects while
deferred coalescing will be in terms of speed|if they remain uncoalesced. Those larger objects
adjacent objects generally die at about the same may therefore be allocated in dierent places
time, aggressive coalescing and splitting will be than they would have been if small blocks were
46 To our knowledge, none of these eects has been noted coalesced immediately; that is, deferred coales-
previously in the literature, although it's likely we've cing can aect placement policy.
seen at least the rst but forgotten where. In any event, { Deferred coalescing may decrease locality of ref-
these eects have received little attention, and don't erence for the same reason, because recently-
seem to have been studied directly. freed small blocks will usually not be reused to
23
LRUsim, memory in use by object sizes (Top 5)
1400
all objects
36 byte objects
8200 byte objects
1200 4104 byte objects
3164 byte objects
136 byte objects
1000
KBytes in Use
800
600
400
200
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Allocation Time in Megabytes
hold larger objects. This may force the program aged in a mostly stack-like way. For others, it is more
to touch more dierent areas of memory than queue-like, with older free blocks tending to be reused
if small blocks were coalesced immediately and in preference to newly-freed blocks.
quickly used again. On the other hand, deferred Deferred reuse may have eects on locality, because
coalescing is very likely to increase locality of ref- the allocator's choices aect which parts of memory
erence if used with an allocator that otherwise are used by the program|the program will tend to
would not reuse most memory immediately|the use memory brie
y, and then use other memory before
deferred coalescing mechanism will ensure that reusing that memory.
most freed blocks are reused soon. Deferred reuse may also have eects on fragmenta-
tion, because newly-allocated objects will be placed
Deferred reuse. Another related notion|which is in holes left by old objects that have died. This may
equally poorly understood|is deferred reuse.47 De- make fragmentation worse, by mixing objects created
ferred reuse is a property of some allocators that by dierent phases (which may die at dierent times)
recently-freed blocks tend not to be the soonest in the same area of memory. On the other hand, it
reused. For many allocators, free memory is man- may be very benecial because it may gradually pack
the \older" areas of memory with long-lived objects,
47 Because it is not generally discussed in any systematic or because it gives the neighbors of a freed block more
way in the literature, we coined this term for this paper. time to die before the freed block is reused. That
24
espresso, largest_data, memory in use by object sizes (Top 5)
300
all objects
38496 byte objects
28 byte objects
55072 byte objects
250 24464 byte objects
36704 byte objects
200
KBytes in Use
150
100
50
0
0 20 40 60 80 100 120
Allocation Time in Megabytes
may allow slightly longer-lived objects to avoid caus- traces|i.e., the actual record of allocation and deal-
ing much fragmentation, because they will die rel- location requests from real programs.
atively soon, and be coalesced with their neighbors
whose reuse was deferred.
Tracing and simulation. Allocation traces are not
2.6 A Sound Methodology: Simulation Using particularly about
dicult to obtain (but see the caveats
program selection in Section 5.5). A slightly
Real Traces modied allocator can be used, which writes informa-
The traditional view has been that programs' frag- tion about each allocation and deallocation request
mentation-causing behavior is determined only by to a le|i.e., whether the request is an allocation or
their object size and lifetime distributions. Recent deallocation, the address of the block, and (for alloca-
experimental results show that this is false ([ZG94, tions) the requested block size. This allocator can be
WJNB95], Section 4.2), because orderings of requests linked with a program of interest and used when run-
have a large eect on fragmentation. Until a much ning the program. These traces tend to be long, but
deeper understanding of program behavior is avail- they can be stored in compressed form, on inexpensive
able, and until allocator strategies and policies are as serial media (e.g., magnetic tape), and later processed
well understood as allocator mechanisms, the only re- serially during simulation. (Allocation traces are gen-
liable method for allocator simulation is to use real erally very compressible, due to the strong regularities
25
in program behavior.48) Large amounts of disk space spent executing the actual program are wasted, but
and/or main memory are not required, although they on fast machines this may be preferable to the cost of
are certainly convenient. trace I/O, for many programs.
To use the trace for a simulation, a driver routine
reads request records out of the le, and submits them Locality studies. While locality is mostly beyond
to the allocator being tested by calling the allocator in the scope of this paper, it is worth making a few com-
the usual way. The driver maintains a table of objects ments about locality studies. Several tools are avail-
that are currently allocated, which maps the object able to make it relatively easy to gather memory-
identier from the trace le to the address where it is reference traces, and several cache and virtual mem-
allocated during simulation; this allows it to request ory simulators are available for processing these
the deallocation of the block when it encounters the traces.
deallocation record in the trace. Larus' QPT tool (a successor to the earlier AE sys-
This simulated program doesn't actually do any- tem [BL92]) modies an executable program to make
thing with the allocated blocks, as a real program it self-tracing. The Shade tool from SunLabs [CK93]
would, but it imitates the real program's request se- is essentially a CPU emulator, which runs a program
quences exactly, which is sucient for measuring the in emulation and records various kinds of events in an
memory usage. Modern proling tools [BL92, CK93] extremely
exible way. For good performance, it uses
can also be used with the simulation program to de- dynamic compilation techniques to increase speed rel-
termine how many instruction cycles are spent in the ative to a straightford interpretive simulator.
allocator itself. Either of these systems can save a reference trace
An alternative strategy is to actually link the pro- to a le, but the le is generally very large for long-
gram with a variety of allocators, and actually re-run running programs. Another alternative is to perform
the program for each \simulation". This has the ad- incremental simulation, as the trace is recorded|
vantage that the traces needn't be stored. It has the event records are saved to a fairly small buer, and
disadvantages that it requires being able to re-run the batches of event records are passed to a cache simu-
program at will (which may depend on having simi- lator which consumes them on the
y.
lar systems, input data sets being available and in Ecient cache simulators are available for process-
the right directories, environment variables, etc.) and ing reference traces, including Mark Hill's Tycho and
doesn't allow convenient sharing of traces between dif- Dinero systems [HS89].49
ferent experimenters for replication of experiments. It
also has the obvious disadvantage that instructions 3 A Taxonomy of Allocators
48 Conventional text-string-oriented compression algo-
Allocators are typically categorized by the mecha-
rithms [Nel91] (e.g, UNIX compress or GNU gzip) nisms they use for recording which areas of mem-
work quite well, although we suspect that sophisticated ory are free, and for merging adjacent free blocks into
schemes could do signicantly better by taking advan-
tage of the numerical properties of object identiers 49 Before attempting locality studies, however, allocation
or addresses; such schemes have been proposed for use researchers should become familiar with the rather sub-
in compressed paging and addressing [WLM91, FP91]. tle issues in cache design, in particular the eects and
(Text-oriented compression generally makes Markov- interactions of associativity, fetch and prefetch policies,
like modeling assumptions, i.e., that literal sequences write buers, victim buers, and subblock placement.
are likely to recur. This is clearly true to a large degree Such details have been shown to be important in as-
for allocation and reference traces, but other regularities sessing the impact of locality of allocation on perfor-
could probably be exploited as well [WB95].) mance; a program with apparently \poor" locality for
Dain Samples [Sam89] used a simple and eective a simple cache design may do quite well in a mem-
approach for compressing memory-reference traces; his ory hierarchy well-suited to its behavior. The litera-
\Mache" trace compactor used a simple preprocessor to ture on garbage collection is considerably more sophisti-
massage the trace into a dierent format, making the cated in terms of locality studies than the literature on
the relevant regularities easier for standard string-ori- memory allocation, and should not be overlooked. (See,
ented compression algorithms to recognize and exploit. e.g., [Bae73, KLS92, Wil90, WLM92, DTM93, Rei94,
A similarly simple system may work well for allocation GA95, Wil95].) Many of the same issues must arise in
traces. conventionally-managed heaps as well.
26
larger free blocks (coalescing). Equally important are eects on locality; for example, reusing recently-freed
the policy and strategy implications|i.e., whether the blocks may increase temporal locality of reference
allocator properly exploits the regularities in real re- by reusing memory that is still cached in high-speed
quest streams. memory, in preference to memory that has gone un-
In this section, we survey the policy issues and touched for a longer while. (Locality is beyond the
mechanisms in memory allocation; since deferred co- scope of this paper, but it is an important consider-
alescing can be added to any allocator, it will be dis- ation. We believe that the best policies for reducing
cussed after the basic general allocator mechanisms fragmentation are good for locality as well, by and
have been covered, in Section 3.11. large, but we will not make that argument in detail
here.50 )
3.1 Allocator Policy Issues
We believe that there are several important policy is-
3.2 Some Important Low-Level Mechanisms
sues that must be made clear, and that real allocators' Several techniques are used in dierent combinations
performance must be interpreted with regard to them: with a variety of allocators, and can help make so-
phisticated policies surprisingly easy to implement ef-
{ Patterns of Memory Reuse. Are recently-freed ciently. We will describe some very low-level mecha-
blocks reused in preference to older free areas? nisms that are pieces of several \basic" (higher-level)
Are free blocks in an area of memory preferen- mechanisms, which in turn implement a policy.
tially reused for objects of the same size (and (The casual reader may wish to skim this section.)
perhaps type) as the live objects nearby? Are free
blocks in some areas reused in preference to free
blocks in other areas (e.g., preferentially reusing Header elds and alignment. Most allocators use
free blocks toward one end of the heap area)? a hidden \header" eld within each block to store use-
{ Splitting and Coalescing. Are large free blocks ful information. Most commonly, the size of the block
split into smaller blocks to satisfy requests for is recorded in the header. This simplies freeing, in
smaller objects? Are adjacent free blocks merged many algorithms, because most standard allocator in-
into larger areas at all? Are all adjacent free ar- terfaces (e.g., the standard C free() routine) do not
eas coalesced, or are there restrictions on when require a program to pass the size of the freed block
coalescing can be done because it simplies the to the deallocation routine at deallocation time.
implementation? Is coalescing always done when Typically, the allocation function (e.g., C's
it's possible, or is it deferred to avoid needless malloc() memory allocation routine) passes only the
merging and splitting over short periods of time? requested size, and the allocator returns a pointer to
{ Fits. When a block of a particular size is reused, the block allocated; the free routine is only passed
are blocks of about the same size used preferen- that address, and it is up to the allocator to infer the
tially, or blocks of very dierent sizes? Or per- size if necessary. (This may not be true in some sys-
haps blocks whose sizes are related in some other tems with stronger type systems, where the sizes of
useful way to the requested size? objects are usually known statically. In that case, the
{ Splitting thresholds. When a too-large block is compiler may generate code that supplies the object
used to satisfy a request, is it split and the re- size to the freeing routine automatically.)
mainder made available for reuse? Or is the re- Other information may be stored in the header as
mainder left unallocated, causing internal frag- well, such as information about whether the block is
mentation, either for implementation simplicity in use, its relationship to its neighbors, and so on.
or as part of a policy intended to trade inter- Having information about the block stored with the
nal fragmentation for reduced external fragmen- block makes many common operations fast.
tation? 50 Brie
y, we believe that the allocator should heuristi-
cally attempt to cluster objects that are likely to be
All of these issues may aect overall fragmentation, used at about the same times and in similar ways. This
and should be viewed as policies, even if the reason should improve locality [Bae73, WLM91]; it should also
for a particular choice is to make the mechanism (im- increase the chances that adjacent objects will die at
plementation) simpler or faster. They may also have about the same time, reducing fragmentation.
27
Header elds are usually one machine word; on most block is in use (holding a live object), the size eld in
modern machines, that is four 8-bit bytes, or 32 bits. the footer is not actually needed|all that is needed
(For convenience, we will assume that the word size is the
ag bit saying that the storage is unavailable
is 32 bits, unless indicated otherwise.) In most sit- for coalescing. The size eld is only needed when the
uations, there is enough room in one machine word block is free, so that its header can be located for co-
to store a size eld plus two or three one-bit \
ags" alescing. The size eld can therefore be taken out of
(boolean elds). This is because most systems allocate the last word of the block of memory|when the block
all heap-allocated objects on whole-word or double- is allocated, it can be used to hold part of the object;
word address boundaries, but most hardware is byte- when the object is freed, the size eld can be copied
addressable.51 (This constraint is usually imposed by from the header into the footer, because that space is
compilers, because hardware issues make unaligned no longer needed to hold part of the object.
data slower|or even illegal|to operate on.) The single bit needed to indicate whether a block
This alignment means that partial words cannot be is in use can be stolen from the header word of the
allocated|requests for non-integral numbers of words following block without unduly limiting the range of
are rounded up to the nearest word. The rounding to the size eld.53
word (or doubleword) boundaries ensures that the low
two (or three) bits of a block address are always zero.
Header elds are convenient, but they consume Link elds within blocks. For allocators using free
space|e.g., a word per block. It is common for block lists or indexing trees to keep track of free blocks, the
sizes in many modern systems to average on the or- list or tree nodes are generally embedded in the free
der of 10 words, give or take a factor of two or so, blocks themselves. Since only free blocks are recorded,
so a single word per header may increase memory us- and since their space would otherwise be wasted, it is
age by about 10% [BJW70, Ung86, ZG92, DDZ93, usually considered reasonable to use the space within
WJNB95]. the \empty" blocks to hold pointers linking them
together. Space for indexing structures is therefore
\free" (almost).
Boundary tags. Many allocators that support gen- Many systems use doubly-linked linear lists, with a
eral coalescing are implemented using boundary tags \previous" and \next" pointer taken out of the free
(due to Knuth [Knu73]) to support the coalescing of area. This supports fast coalescing; when objects are
free areas. Each block of memory has a both header merged together, at least one of them must be re-
and a \footer" eld, both of which record the size of moved from the linked list so that the resulting block
the block and whether it is in use. (A footer, as the will appear only once in the list. Having pointers to
name suggests, is a hidden eld within the block, at both the predecessor and successor of a block makes it
the opposite end from the header.) When a block is possible to quickly remove the block from the list, by
freed, the footer of the preceding block of memory is adjusting those objects' \next" and \previous" point-
examined to see if it is free; likewise, the header of the ers to skip the removed object.
following block is examined. Adjacent free areas are Some other allocators use trees, with space for the
merged to form larger free blocks. \left child" and \right child" (and possibly \parent")
Header and footer overhead are likely to be signi- pointers taken out of the free area.
cant|with an average object size of about ten words, The hidden cost of putting link elds within blocks
for example, a one-word header incurs a 10% overhead is that the block must be big enough to hold them,
and a one-word footer incurs another 10%. along with the header eld and footer eld, if any. This
Luckily there is a simple optimization that can imposes a minimum block size on the allocator imple-
avoid the footer overhead.52 Notice that when an mentors of actual systems, or by researchers in recent
51 For doubleword aligned systems, it is still possible to use years.
a one-word header while maintaining alignment. Blocks 53 Consider a 32-bit byte-addressed system where blocks
are allocated \o by one" from the doubleword boun- may be up to 4GB. As long as blocks are word-aligned,
dary, so that the part of the block that actually stores the least signicant bits of a block address are always
an object is properly aligned. zero, so those two \low bits" can be used to hold the
52 This optimization is described in [Sta80], but it appears two
ags. In a doubleword-aligned system, three \low
not to have been noticed and exploited by most imple- bits" are available.
28
mentation, and any smaller request must be rounded a large number of large objects in a short period of
up to that size. A common situation is having a header time|it generally must do something with the space
with a size eld and boundary tags, plus two point- it allocates, e.g., initialize the elds of the allocated
ers in each block. This means that the smallest block objects, and presumably do something more with at
size must be at least three words. (For doubleword least some of their values. For some moderate object
alignment, it must be four.) size and above, the possible frequency of allocations
Assuming only the header eld is needed on allo- is so low that a little extra overhead is not signicant.
cated blocks, the eective object size is three words (Counterexamples are possible, of course, but we be-
for one-, two-, or three-word objects. If many objects lieve they are rare.) The basic idea here is to ensure
are only one or two words long|and two is fairly that the time spent allocating a block is small relative
common|signicant space may be wasted. to the computations on the data it holds.
Lookup tables. Some allocators treat blocks within Special treatment of the end block of the heap.
ranges of sizes similarly|rather than indexing free
blocks by their exact size, they lump together blocks The allocator allocates memory to programs on re-
of roughly the same size. The size range may also be quest, but the allocator itself must get memory from
important to the coalescing mechanism. Powers of two somewhere. The most common situtation in modern
are often used, because it is easy to use bit selection systems is that the heap occupies a range of virtual
techniques on a binary representation of the size to g- addresses and grows \upward" through the address
ure out which power-of-two range it falls into. Powers space. To request more (virtual) memory, a system
of two are coarse, however, and can have drawbacks, call such as the UNIX brk()54 call is used to re-
which we'll discuss later. quest that storage be mapped to that region of address
Other functions (such as Fibonacci series) may be space, so that it can be used to hold data.55 Typically,
more useful, but they are more expensive to compute the allocator keeps a \high-water mark" that divides
at run time. A simple and eective solution is to use memory into the part that is backed by storage and
a lookup table, which is simply an array, indexed by the part that is not.
the size, whose values are the numbers of the ranges. (In systems with a xed memory, such as some non-
To look up which range a size falls into, you simply virtual memory systems, many allocators maintain a
index into the array and fetch the value stored there. similar high-water mark for their own purposes, to
This technique is simple and very fast. keep track of which part of memory is in use and which
If the values used to index into the table are poten- part is a large contiguous free space.)
tially large, however, the lookup table itself may be We will generally assume that a paged virtual mem-
too big. This is often avoided by using lookup tables ory is in use. In that case, the system call that obtains
only for values below some threshold (see below). more memory obtains some integral number of pages,
(e.g., 4KB, 8KB, 12KB, or 16KB on a machine with
Special treatment of small objects. In most sys- 4KB pages.) If a larger block is requested, a larger
tems, many more small objects are allocated than request (for as many pages as necessary) is made.
large ones. It is therefore often worthwhile to treat Typically the allocator requests memory from the
small objects specially, in one sense or another. This operating system when it cannot otherwise satisfy a
can usually be done by having the allocator check to memory request, but it actually only needs a small
see if the size is small, and if so, use an optimized amount of memory to satisfy the request (e.g., 10
technique for small values; for large values, it may use words). This raises the question of what is done with
a slower technique. the rest of the memory returned by the operating sys-
One application of this principle is to use a fast tem.
allocation technique for small objects, and a space- 54 is often called indirectly, via the library routine
brk()
ecient technique for large ones. Another is to use fast .
sbrk()
lookup table techniques for small values, and slower 55 Other arrangements are possible. For example, the heap
computations for large ones, so that the lookup ta- could be backed by a (growable) memory-mapped le, or
bles don't take up much space. In this case, consider several les mapped to non-contiguous ranges of address
the fact that it is very dicult for a program to use space.
29
While this seems like a trivial bookkeeping matter, well to large heaps, in terms of time costs; as the num-
it appears that the treatment of this \end block" of ber of free blocks grows, the time to search the list
memory may have signicant policy consequences un- may become unacceptable.56 More ecient and scal-
der some circumstances. (We will return to this issue able techniques are available, using totally or partially
in Section 3.5.) ordered trees, or segregated ts (see Section 3.6).57
The section on sequential ts, below, is particularly This appears not to happen in practice, or at least
important|many basic policy issues arise there, and not commonly.
the policy discussion is applicable to many dierent
mechanisms. First t. First t simply searches the list from the be-
After describing these basic allocators, we will dis- ginning, and uses the rst free block large enough to
cuss deferred coalescing techniques applicable to all of 56
them. This is not necessarily true, of course, because the aver-
age search time may be much lower than the worst case.
For robustly good performance, however, it appears that
3.4 Sequential Fits simple linear lists should generally be avoided for large
heaps.
57 The confusion of mechanism with strategy and pol-
Several classic allocator algorithms are based on
having a single linear list of all free blocks of icy has sometimes hampered experimental evaluations;
memory. (The list is often doubly-linked and/or even after obviously scalable implementations had been
circularly-linked.) Typically, sequential ts algorithms discussed in the literature, later researchers often ex-
cluded sequential t policies from consideration due to
use Knuth's boundary tag technique, and a doubly- their apparent time costs.
linked list to make coalescing simple and fast. 58 This potential accumulation of small fragments (often
In considering sequential ts, it is probably most called \splinters" or \sawdust") was noted by Knuth
important to keep strategy and policy issues in mind. [Knu73], but it seems not to be a serious problem for
The classic linear-list implementations may not scale best t, with either real or synthetic workloads.
30
satisfy the request. If the block is larger than neces- In experiments with both real and synthetic traces,
sary, it is split and the remainder is put on the free it appears that address-ordered rst t may cause sig-
list. nicantly less fragmentation than LIFO-ordered rst
A problem with sequential rst t is that the larger t (e.g., [Wei76, WJNB95]); the address-ordered vari-
blocks near the beginning of the list tend to be split ant is the most studied, and apparently the most used.
rst, and the remaining fragments result in having a Another alternative is to simply push freed blocks
lot of small blocks near the beginning of the list. These onto the rear of a (doubly-linked) list, opposite the
\splinters" can increase search times because many end where searches begin. This results in a FIFO
small free blocks accumulate, and the search must go (rst-in-rst-out) queue-like pattern of memory use.
past them each time a larger block is requested. Clas- This variant has not been considered in most stud-
sic (linear) rst t therefore may scale poorly to sys- ies, but recent results suggest that it can work quite
tems in which many objects are allocated and many well|better than the LIFO ordering, and perhaps as
dierent-sized free blocks accumulate. well as address ordering [WJNB95].
As with best t, however, more scalable implemen- A rst t policy may tend over time toward behav-
tations of rst t are possible, using more sophisti- ing rather like best t, because blocks near the front
cated data structures. This is somewhat more dicult of the list are split preferentially, and this may result
for rst t, however, because a rst t search must in a roughly size-sorted list.60 Whether this happens
nd the rst block that is also large enough to hold for real workloads is unknown.
the object being allocated. (These techniques will be
discussed under the heading of Indexed Fits, in Sec- Next t. A common \optimization" of rst t is to use
tion 3.8.) a roving pointer for allocation [Knu73]. The pointer
This brings up an important policy question: what records the position where the last search was satis-
ordering is used so that the \rst" t can be found? ed, and the next search begins from there. Successive
When a block is freed, at what position is it inserted searches cycle through the free list, so that searches
into the ordered set of free blocks? The most obvious do not always begin in the same place and result in an
ordering is probably to simply push the block onto accumulation of splinters. The usual rationale for this
the front of the free list. Recently-freed blocks would is to decrease average search times when using a linear
therefore be \rst," and tend to be reused quickly, in list, but this implementation technique has major ef-
LIFO (last-in-rst-out) order. In that case, freeing is fects on the policy (and eective strategy) for memory
very fast but allocation requires a sequential search. reuse.
Another possibility is to insert blocks in the list in Since the roving pointer cycles through memory
address order, requiring list searches when blocks are regularly, objects from dierent phases of program
freed, as well as when they are allocated. execution may become interspersed in memory. This
An advantage of address-ordered rst t is that the may aect fragmentation if objects from dierent
address ordering encodes the adjacency of free blocks; phases have dierent expected lifetimes. (It may also
this information can be used to support fast coales- seriously aect locality. The roving pointer itself may
cing. No boundary tags or double linking (backpoint- have bad locality characteristics, since it examines
ers) are necessary. This can decrease the minimum each free block before touching the same block again.
object size relative to other schemes.59 Worse, it may aect the locality of the program it allo-
59 Another possible implementation of address-ordered cates for, by scattering objects used by certain phases
rst t is to use a linked list of all blocks, allocated or and intermingling them with objects used by other
free, and use a size eld in the header of each block as a phases.)
\relative" pointer (oset) to the beginning of the next In several experiments using both real traces
block. This avoids the need to store a separate link eld, [WJNB95] and synthetic traces (e.g., [Bay77, Wei76,
making the minimum object size quite small. (We've Pag84, KV85]), next t has been shown to cause more
never seen this technique described, but would be sur-
prised if it hasn't been used before, perhaps in some other indexing structure.
of the allocators described in [KV85].) If used straight- 60 This has also been observed by Ivor Page [Pag82] in ran-
forwardly, such a system is likely to scale very poorly, domized simulations, and similar (but possibly weaker)
because live blocks must be traversed during search, but observations were made by Knuth and Shore and others
this technique might be useful in combination with some in the late 1960's and 1970's. (Section 4.)
31
fragmentation than best t or address-ordered rst t, whether this choice actually occurs often in practice.
and the LIFO-order variant may be signicantly worse It may be that large blocks tend to come free due
than address order [WJNB95]. to clustered deaths. If free blocks become scattered,
As with the other sequential ts algorithms, scal- however, it choosing among them may be particularly
able implementations of next t are possible using signicant.
various kinds of trees rather than linear lists.
Splitting. A common variation is to impose a splitting
threshold, so that blocks will not be split if they are
3.5 Discussion of Sequential Fits and already small. Blocks generally can't be split if the re-
General Policy Issues. sulting remainder is smaller than the minimum block
The sequential ts algorithms have many possible size (big enough to hold the header (and possibly a
variations, which raise policy issues relevant to most footer) plus the free list link(s)). In addition, the allo-
other kinds of allocators as well. cator may choose not to split a block if the remainder
is \too small," either in absolute terms [Knu73] or
List order and policy. The classic rst t or next t relative to the size of the block being split [WJNB95].
mechanisms may actually implement very dierent This policy is intended to avoid allocating in the
policies, depending on exactly how the free list is remainder a small object that may outlive the large
maintained. These policy issues are relevant to many object, and prevent the reclamation of a larger free
other allocation mechanisms as well, but we will dis- area. Splitting thresholds do not appear to be helpful
cuss them in the context of sequential ts for con- in practice, unless (perhaps) they are very small.
creteness. Splitting raises other policy questions; when a block
LIFO-ordered variants of rst t and next t push is split, where is the remainder left in the free list?
freed blocks onto the front of the list, where they will For address-ordered variants, there is no choice, but
be the next considered for reuse. (In the case of next for others, there are several possibilities|leave it at
t, this immediate reuse only happens if the next al- the point in the list where the split block was found
location request can be satised by that block; other- (this seems to be common), or put it on one end or
wise the roving pointer will rove past it.) the other of the free list, or anywhere in between.61
If a FIFO-ordered free list is used, freed blocks may And when the block is split, is the rst part used, or
tend not to be reused for a long time. If an address- the last, or even the middle?62
ordered free list is used, blocks toward one end of policies. Sequential ts techniques may also be
memory will tend to be used preferentially. Seemingly Other
used to intentionally implement unusual policies.
minor changes to a few of lines of code may change
the placement policy dramatically, and in eect im- is always used,is inworst
One policy
the
t, where the largest free block
hope that small fragments will
plement a whole new strategy with respect to the reg- not accumulate. The idea of worst t is to avoid creat-
ularities of the request stream.
Address-ordered free lists may have an advantage der as large as possible. This byextreme
ing small, unusable fragments making the remain-
policy seems
in that they tend to pack one end of memory with
live objects, and gradually move upward through the 61 Our guess is that putting it at the head of the list would
address space. In terms of clustering related objects, be advantageous, all other things being equal, to in-
the eects of this strategy are potentially complex. If crease the chances that it would be used soon. This
adjacent objects tend to die together, large contiguous might tend to place related objects next to each other
areas of memory will come free, and later be carved in memory, and decrease fragmentaton if they die at
up for consecutively-allocated objects. If deaths are about the same time. On the other hand, if the remain-
scattered, however, scattered holes will be lled with der is too small and only reusable for a dierent size,
related objects, perhaps decreasing the chances of con- this might make it likely to be used for a dierent pur-
pose, and perhaps it should not be reused soon.
tiguous areas coming free at about the same time. 62 Using the last part has the minor speed advantage that
(Locality considerations are similarly complex.) the rst part can be left linked where it is in the free
Even for best t, the general strategy does not de- list|if that is the desired policy|rather than unlinking
termine an exact policy. If there are multiple equally- the rst part and having to link the remainder back into
good best ts, how is the tie broken? We do not know the list.
32
to work quite badly (in synthetic trace studies, at 3. allocates another short-lived large object of the
least)|probably because of its tendency to ensure same size as the freed large object.
that there are no very large blocks available. The gen-
eral idea may have some merit, however, as part of a In this case, each time a large block is freed, a small
combination of strategies. block is soon taken out of it to satisfy the request for
Another policy is so-called \optimal t," where a the small object. When the next large object is allo-
limited search of the list is usually used to \sample" cated, the block used for the previously-deallocated
the list, and a further search nds a t that is as good large object is now too small to hold it, and more
or better [Cam71].63 memory must be requested from the operating sys-
Another policy is \half t" [FP74], where the allo- tem. The small objects therefore end up eectively
cator preferentially splits blocks twice the requested wasting the space for large objects, and fragmenta-
size, in hopes that the remainder will come in handy tion is proportional to the ratio of their sizes. This
if a similar request occurs soon. may not be a common occurrence, but it has been
observed to happen in practice more than once, with
severe consequences.66
Scalability.As mentioned before, the use of a A more subtle possible problem with next t is
sequentially-searched list poses potentially serious that clustered deallocations of dierent-sized objects
scalability problems|as heaps become large, the may result in a free list that has runs of similar-sized
search times can in the worst case be proportional blocks, i.e., batches of large blocks interspersed with
to the size of the heap. The use of balanced binary batches of small blocks. The occasional allocation of
trees, self-adjusting (\splay") trees,64 or partially or- a large object may often force the free pointer past
dered trees can reduce the worst-case performance so many small blocks, so that subsequent allocations are
that it is logarithmic in the number of free blocks, more likely to carve small blocks out of large blocks.
rather than linear.65 (This is a generalization of the simple kind of loop-
Scalability is also sensitive to the degree of fragmen- ing behavior that has been shown to be a problem for
tation. If there are many small fragments, the free list some programs.)
will be long and may take much longer to search. We do not yet know whether this particular kind of
repetitive behavior accounts for much of the fragmen-
Plausible pathologies. It may be worth noting that tation seen for next t in several experiments.
LIFO-ordered variants of rst t and next t can suf-
fer from severe fragmentation in the face of certain Treatment of the end block. As mentioned before, the
simple and plausible patterns of allocation and deal- treatment of the last block in the heap|at the point
location. The simplest of these is when a program re- where more memory is obtained from the operating
peatedly does the following: system, or from a preallocated pool|can be quite
important. This block is usually rather large, and a
1. allocates a (short-lived) large object, mistake in managing it can be expensive. Since such
2. allocates a long-lived small object, and blocks are allocated whenever heap memory grows,
63 This is not really optimal in any useful sense, of course. consistent mistakes could be disastrous [KV85]|all
See also Page's critique in [Pag82] (Section 4.1). of the memory obtained by the allocator could get
64 Splay trees are particularly interesting for this appli- \messed up" soon after it comes under the allocator's
cation, since they have an adaptive characteristic that control.
may adjust well to the patterns in allocator requests, as There is a philosophical question of whether the end
well as having amortized complexity within a constant block is \recently freed" or not. On the one hand, the
factor of optimal [ST85]. block just became available, so perhaps it should be
65 We suspect that earlier researchers often simply didn't put on whichever end of the free list freed blocks are
worry about this because memory sizes were quite small put on. On the other hand, it's not being freed|in a
(and block sizes were often rather large). Since this point
was not generally made explicit, however, the obvious 66 One example is in an early version of the large object
applicability of scalable data structures was simply left manager for the Lucid Common Lisp system (Jon L.
out of most discussions, and the confusion between pol- White, personal communication, 1991); another is men-
icy and mechanism became entrenched. tioned in [KV85] (Section 4.1).
33
sense, the end block has been there all along, ignored is not viewed this way, the end block will usually be a
until needed. Perhaps it should go on the opposite end little less than a page (or whatever unit is used to ob-
of the list because it's conceptually the oldest block| tain memory from the operating system); typically, it
the very large block that contains all as-yet-unused will not be used to satisfy small requests unless there
memory. are no other similarly-large blocks available.
Such philosophical ne points aside, there is the We therefore suspect|but do not know|that it
practical question of how to treat a virgin block of sig- does not matter much whether the block is viewed as
nicant size, to minimize fragmentation. (This block is the beginning of a huge block, or as a moderate-sized
sometimes called \wilderness" [Ste83] to signify that block in its own right, as long as the allocator tends
it is as yet unspoiled.) to use smaller or lower-addressed blocks in preference
Consider what happens if a rst t or next t pol- to larger or higher-addressed blocks.68
icy is being used. In that case, the allocator will most
likely carve many small objects out of it immediately, Summary of policy issues. While best t and address-
greatly increasing the chances of being unable to re- ordered rst t seem to work well, it is not clear that
cover the contiguous free memory of the block. On other policies can't do quite as well; FIFO-ordered
the other hand, putting it on the opposite end of the rst t may be about as good.
list will tend to leave it unused for at least a while, The sensitivity of such results to slight dierences
perhaps until it gets used for a larger block or blocks. in mechanical details suggests that we do not have
An alternative strategy is to keep the wilderness block a good model of program behavior and allocator
out of the main ordering data structure entirely, and performance|at this point, it is quite unclear which
only carve blocks out of it when no other space can be seemingly small details will have signicant policy
found. (This \wilderness" block can also be extended consequences.
to include more memory by expanding the heap seg- Few experiments have been performed with novel
ment, so that the entire area above the high-water policies and real program behavior; research has
mark is viewed as a single huge block.67 ) Korn and largely focused on the obvious variations of algorithms
Vo call this a \wilderness preservation heuristic," and that date from the early 1960's or before.69
report that it is helpful for some allocators [KV85]
(No quantitative results are given, however.) Speculation on strategy issues. We have observed that
For policies like best t and address-ordered rst t, best t and address-ordered rst t perform quite sim-
it seems natural to simply put the end block in the in- ilarly, for both real and synthetic traces.
dexing structure like any other block. If the end block Page [Pag82] has observed that (for random traces
is viewed as part of the (very large) block of as-yet- using uniform distributions), the short-term place-
unused memory, this means that a best t or address- ment choices made by best t and address-ordered
ordered rst t policy will always use any other avail-
able memory before carving into the wilderness. If it 68 It is interesting to note, however, that the direction
of the address ordering matters for rst t, if the end
67 In many simple UNIX and roughly UNIX-like systems, block is viewed as the beginning of a very large block
the allocator should be designed so that other routines of all unused memory. If reverse-address-order is used,
can request pages from the operating system by extend- it becomes pathological. It will simply march through
ing the (single) \data segment" of the address space. In all of \available" memory|i.e., all memory obtainable
that case, the allocator must be designed to work with from the operating system|without reusing any mem-
a potentially non-contiguous set of pages, because there ory. This suggests to us that address-ordered rst t (us-
may be intervening pages that belong to dierent rou- ing the usual preference order) is somehow more \right"
tines. (For example, our Texas persistent store allows than its opposite, at least in a context where the size of
the data segment to contain interleaved pages belong- memory can be increased.
ing to a persistent heap and a transient heap [SKW92].) 69 Exceptions include Fenton and Payne's \half t" pol-
Despite this possible interleaving of pages used by icy (Section 4.1), and Beck's \age match" policy (Sec-
dierent modules, extending the heap will typically just tion 4.1). Barrett and Zorn's \lifetime prediction" allo-
extend the \wilderness block," because it's more likely cator (Section 4.2) is the only recent work we know of
that successive extensions of the data segment are due (for conventional allocators) that adopts a novel and ex-
to requests by the allocator, than that memory requests plicit strategy to exploit interesting regularities in real
from dierent sources are interleaved. request streams.
34
rst t were usually identical. That is, if one of these bined in dierent ways by best t and address-ordered
policies was used up to a certain point in a trace, rst t.
switching to the other for the next allocation request Shore [Sho75] designed and implemented a hybrid
usually did not change the placement decision made best t/rst t policy that outperformed either plain
for that request. rst t or plain best t for his randomized workloads.
We speculate that this re
ects a fundamental simi- (Discussed in Section 4.1.) The strategic implications
larity between best t and address-ordered rst t, in of this hybrid policy have not been explored, and it is
terms of how they exploit regularities in the request unclear whether they apply to real workloads. Shore's
stream. These allocators seem to perform well|and results should be interpreted with considerable cau-
very similarly|for both real and randomized work- tion, because real workloads exhibit regularities (e.g.,
loads. In some sense, perhaps, each is an approxima- plateaus and ramps) that seem likely to interact with
tion of the other. these strategies in subtle ways.72
But a more important question is this: what is the Address-ordered rst t seems likely to have other
successful strategy that both of these policies imple- strategic implications as well. The use of address
ment? ordering seems likely to result in clustering of re-
One possibility is something we might call the lated data under some circumstances, increasing the
\open space preservation" heuristic, i.e., try not to cut chances that contiguous areas will come free, if the
into relatively large unspoiled areas.70 At some level, related objects die together. However, in cases where
of course, this is obvious|it's the same general idea free blocks are small, of varied sizes, and widely scat-
that was behind best t in the rst place, over three tered, rst t may tend to decluster related objects, as
decades ago. will best t. Amending these policies may allow bet-
As we mentioned earlier, however, there are at least ter clustering, which could be important for long-run
two ideas behind best t, at least in our view: fragmentation.
It should now be quite unclear why best t and
{ Minimize the remainder|i.e., if a block must be address-ordered rst t work well in practice, and
split, split the block that will leave the small- whether they work for the same reasons under ran-
est remainder. If the remainder goes unused, the domized workloads as for real workloads.
smaller it is, the better. For randomized workloads, which cause more scat-
{ Don't break up large free areas unnnecessarily| tered random deaths, there may be very few place-
preferentially split areas that are already small, ment choices, and little contiguous free memory. In
and hence less likely to be
exibly usable in the that case, the strategy of minimizing the remainder
future. may be crucial. For real workloads, however, large
contiguous areas may come free at the ends of phases,
In some cases, the rst principle may be more im- and tend to be carved up into small blocks by later
portant, while the second may be more important in phases as live data accumulate. This may often re-
other cases. Minimizing the remainder may have a sult in contiguous allocation of successively-allocated
tendency to result in small blocks that are unlikely blocks, which will again create large free blocks when
to be used soon; the result may be similar to hav- they die together at the end of the later phase. In
ing a splitting threshold, and to respect the second that case, the eects of small \errors" due to unusually
principle.71 long-lived objects may be important; they may lead to
These are very dierent strategies, at least on the cumulative fragmentation for long-running programs,
surface. It's possible that these strategies can be com- or fragmentation may stabilize after a while. We sim-
bined in dierent ways|and perhaps they are com- ply don't know.
There are many possible subtle interactions and
70 Korn and Vo's \wilderness preservation heuristic" can strategic implications, all of which are quite poorly
be seen as a special case or variant of the \open space
preservation heuristic." 72 For example, address-ordered rst t has a tendency to
71 This could explain why explicit splitting thresholds pack one end of memory with live data, and leave larger
don't seem to be very helpful|policies like best t holes toward the other end. This seems particularly rele-
may already implement a similar strategy indirectly, and vant to programs that allocate large and very long-lived
adding an explicit splitting threshold may be overkill. data structures near the beginning of execution.
35
understood for these seemingly simple and very pop- cic terms \simple segregated storage" and \segrega-
ular policies. ted ts."73)
An advantage of this simple scheme is that no head-
ers are required on allocated objects; the size informa-
3.6 Segregated Free Lists tion can be recorded for a page of objects, rather than
for each object individually. This may be important
One of the simplest allocators uses an array of free if the average object size is very small. Recent stud-
lists, where each list holds free blocks of a particular ies indicate that in modern programs, the average ob-
size [Com64]. When a block of memory is freed, it is ject size is often quite small by earlier standards (e.g.,
simply pushed onto the free list for that size. When a around 10 words [WJNB95]), and that header and
request is serviced, the free list for the appropriate size footer overheads alone can increase memory usage by
is used to satisfy the request. There are several impor- ten percent or twenty percent [ZG92, WJNB95]. This
tant variations on this segregated free lists scheme. is comparable to the \real" fragmentation for good
It is important to note that blocks in such schemes allocators [WJNB95].
are logically segregated in terms of indexing, but usu- Simple segregated storage is quite fast in the usual
ally not physically segregated in terms of storage. case, especially when objects of a given size are repeat-
Many segregated free list allocators support general edly freed and reallocated over short periods of time.
splitting and coalescing, and therefore must allow The freed blocks simply wait until the next allocation
mixing of blocks of dierent sizes in the same area of the same size, and can be reallocated without split-
of memory. ting. Allocation and freeing are both fast constant-
One common variation is to use size classes to lump time operations.
similar sizes together for indexing purposes, and use The disadvantage of this scheme is that it is sub-
free blocks of a given size to satisfy a request for that ject to potentially severe external fragmentation|no
size, or for any size that is slightly smaller (but still attempt is made to split or coalesce blocks to satisfy
larger than any smaller size class). A common size- requests for other sizes. The worst case is a program
class scheme is to use sizes that are a power of two that allocates many objects of one size class and frees
apart (e.g., 4 words, 8 words, 16 words...) and round them, then does the same for many other size classes.
the requested size up to the nearest size class; how- In that case, separate storage is required for the max-
ever, closer size class spacings have also been used, imum volume of objects of all sizes, because none of
and are usually preferable. memory allocated to one size block can be reused for
the another.
There is some tradeo between expected internal
Simple segregated storage. In this variant, no splitting fragmentation
of free blocks is done to satisfy requests for smaller ing between size and external fragmentation; if the spac-
sizes. When a request for a given size is serviced, and will fall into eachclassessize
is large, more dierent sizes
class, allowing space for some
the free list for the appropriate size class is empty, sizes to be reused for others. (In
more storage is requested from the underlying oper- size classes generally lose more practice, very coarse
ating system (e.g., using UNIX sbrk() to extend the fragmentation than they save in externaltofragmen-
memory internal
heap segment); typically one or two virtual memory
pages are requested at a time, and split into same- tation.) In the worst case, memory usage is propor-
sized blocks which are then strung together and put data (plustheworst-case
tional to product of the maximum amount of live
internal fragmentation due to
on the free list. We call this simple segregated stor- the rounding up of sizes) and the number of size clas-
age because the result is that pages (or some other ses.
relatively large unit) contain blocks of only one size A crude but possibly eective form of coalescing for
class. (This diers from the traditional terminology in
an important way. \Segregated storage" is commonly 73 Simple segregated storage is sometimes incorrectly
used to refer both to this kind of scheme and what called a buddy system; we do not use that terminol-
we call segregated ts [PSC71]. We believe this ter- ogy because simple segregated storage does not use a
minology has caused considerable confusion, and will buddy rule for coalescing|no coalescing is done at all.
generally avoid it; we will refer to the larger class as (Standish [Sta80] refers to simple segregated storage as
\segregated free list" schemes, or use the more spe- \partitioned storage.")
36
simple segregated storage (used by Mike Haertel in a as with best t. In some cases, however, the details
fast allocator [GZH93, Vo95], and in several garbage of the size class system and the searching of size-class
collectors [Wil95]) is to maintain a count of live ob- lists may cause deviations from the best t policy.
jects for each page, and notice when a page is entirely Note that in a segregated ts scheme, coalescing
empty. If a page is empty, it can be made available for may increase search times. When blocks of a given
allocating objects in a dierent size class, preserving size are freed, they may be coalesced and put on dif-
the invariant that all objects in a page are of a single ferent free lists (for the resulting larger sizes); when
size class.74 the program requests more objects of that size, it may
have to nd the larger block and split it, rather than
Segregated ts. This variant uses an array of free lists, still having the same small blocks on the appropriate
with each array holding free blocks within a size class. free list. (Deferred coalescing can reduce the extent of
When servicing a request for a particular size, the free this problem, and the use of multiple free lists makes
list for the corresponding size class is searched for a segregated ts a particularly natural context for de-
block at least large enough to hold it. The search is ferred coalescing.)
typically a sequential ts search, and many signicant Segregated ts schemes fall into three general cate-
variations are possible (see below). Typically rst t gories:
or next t is used. It is often pointed out that the use
of multiple free lists makes the implementation faster 1. Exact Lists. In exact lists systems, where there is
than searching a single free list. What is sometimes (conceptually) a separate free list for each possi-
not appreciated is that this also aects the placement ble block size [Com64]. This can result in a very
in a very important way|the use of segregated lists large number of free lists, but the \array" of free
excludes blocks of very dierent sizes, meaning good lists can be represented sparsely. Standish and
ts are usually found|the policy therefore embodies Tadman's \Fast Fits" scheme75 uses an array of
a good t or even best t strategy, despite the fact that free lists for small size classes, plus a binary tree
it's often described as a variation on rst t. of free lists for larger sizes (but only the ones that
If there is not a free block in the appropriate free actually occur) [Sta80, Tad78].76
list, segregated ts algorithms try to nd a larger 2. Strict Size Classes with Rounding. When sizes are
block and split it to satisfy the request. This usually grouped into size classes (e.g., powers of two),
proceeds by looking in the list for the next larger size one approach is to maintain an invariant that all
class; if it is empty, the lists for larger and larger sizes blocks on a size list are exactly of the same size.
are searched until a t is found. If this search fails, This can be done by rounding up requested sizes
more memory is obtained from the operating system to one of the sizes in the size class series, at some
to satisfy the request. For most systems using size cost in internal fragmentation. In this case, it is
classes, this is a logarithmic-time search in the worst also necessary to ensure that the size class series
case. (For example for powers-of-two size classes, the is carefully designed so that split blocks always
total number of lists is equal to the logarithm of the result in a size that is also in the series; otherwise
maximum block size. For a somewhat more rened se- blocks will result that aren't the right size for any
ries, it is still generally logarithmic, but with a larger free list. (This issue will be discussed in more
constant factor.) detail when we come to buddy systems.)
In terms of policy, this search order means that 3. Size Classes with Range Lists. The most com-
smaller blocks are used in preference to larger ones, mon way of dealing with the ranges of sizes that
74 This invariant can be useful in some kinds of systems,
especially systems that provide persistence [SKW92] 75 Not to be confused with Stephenson's better-known in-
and/or garbage collection for languages such as C or dexed ts scheme of the same name.
C++ [BW88, WDH89, WJ93], where pointers may 76 As with most tree-based allocators, the nodes of the tree
point into the interior parts of objects, and it is im- are embedded in the blocks themselves. The tree is only
portant to be able to nd the object headers quickly. In used for larger sizes, and the large blocks are big enough
garbage-collected systems, it is common to segregated to hold left and right child pointers, as well as a doubly
objects by type, or by implementation-level characteris- linked list pointers. One block of each large size is part
tics, to facilitate optimizations of type checking and/or of the tree, and it acts as the head of the doubly-linked
garbage collection [Yua90, Del92, DEB94]. list of same-sized blocks.
37
fall into size classes is to allow the lists to con- smaller areas, and so on. This hierarchical division of
tain blocks of slightly dierent sizes, and search memory is used to constrain where objects are allo-
the size lists sequentially, using the classic best cated, what their allowable sizes are, and how they
t, rst t, or next t technique [PSC71]. (The may be coalesced into larger free areas. For each al-
choice aects the policy implemented, of course, lowable size, a separate free list is maintained, in an
though probably much less than in the case of array of free lists. Buddy systems are therefore actu-
a single free list.) This could introduce a linear ally a special case of segregated ts, using size classes
component to search times, though this does not with rounding, and a peculiar limited technique for
seem likely to be a common problem in practice, splitting and coalescing.
at least if size classes are closely spaced.77 78 If it Buddy systems therefore implement an approxima-
is, then exact list schemes are preferable. tion of a best t policy, but with potentially serious
An ecient segregated ts scheme with general variations due to peculiarities in splitting and coales-
coalescing (using boundary tags) was described and cing.
shown to perform well in 1971 [PSC71], but it did (In practical terms, buddy systems appear to be
not become well-known; Standish and Tadman's ap- distinctly inferior to more general schemes support-
parently better scheme was published (but only in a ing arbitrary coalescing; without heroic eorts at op-
textbook) in 1980, and similarly did not become par- timization and/or hybridization, their cost in internal
ticularly well known, even to the present. Our impres- fragmentation alone seems to be comparable to the
sion is that these techniques have received too little total fragmentation costs of better schemes.)
attention, while considerably more attention has been A free block may only be merged with its buddy,
given to techniques that are inferior in terms of scala- which is its unique neighbor at the same level in the
bility (sequential ts) or generality (buddy systems). binary hierarchical division. The resulting free block
Apparently, too few researchers realized the full is therefore always one of the free areas at the next
signicance of Knuth's invention of boundary tags higher level in the memory-division hierarchy|at any
for a wide variety of allocation schemes|boundary level, the rst block may only be merged with the fol-
tags can support fast and general splitting and coa- lowing block, which follows it in memory; conversely,
lescing, independently of the basic indexing scheme the second block may only be merged with the rst,
used by the allocator. This frees the designer to use which precedes it in memory. This constraint on co-
more sophisticated higher-level mechanisms and poli- alescing ensures that the resulting merged free area
cies to implement almost any desired strategy. (It will always be aligned on one of the boundaries of the
seems likely that the original version of boundary tags hierarchical splitting.
was initially viewed as too costly in space, in a time (This is perhaps best understood by example; the
when memory was a very scarce resource, and the reader may wish to skip ahead to the description of
footer optimization [Sta80] simply never became well- binary buddies, which are the simplest kind of buddy
known.) systems.)
The purpose of the buddy allocation constraint is to
3.7 Buddy Systems ensure that when a block is freed, its (unique) buddy
can always be found by a simple address computation,
Buddy systems [Kno65, PN77] are a variant of segre- and its buddy will always be either a whole, entirely
gated lists that supports a limited but ecient kind of free block, or an unavailable block. An unavailable
splitting and coalescing. In the simple buddy schemes, block may be entirely allocated, or may have been
the entire heap area is conceptually split into two split and have some of its sub-parts allocated but not
large areas, and those areas are further split into two others. Either way, the address computation will al-
77 Lea's allocator uses very closely spaced size classes, di- ways be able to locate the beginning of the buddy|it
viding powers of two linearly into four uniform ranges. will never nd the middle of an allocated object. The
78 Typical size distributions appear to be both spiky and buddy will be either a whole (allocated or free) block
heavily skewed, so it seems likely that for small size of a determinate size, or the beginning of a block of
ranges, only zero or one actual sizes (or popular sizes) that size that has been split in a determinate way. If
will fall into a given range. In that case, a segregated ts (and only if) it turns out to be the header of a free
scheme may approximate a best t scheme very closely. block, and the block is the whole buddy, the buddies
38
can be merged. If the buddy is entirely or partly al- buddy systems generally exhibit signicantly more
located, the buddies cannot be merged|even if there fragmentation than segregated ts and indexed ts
is an adjacent free area within the (split) buddy. schemes using boundary tags to support general co-
Buddy coalescing is relatively fast, but perhaps the alescing. (Most of these results come from synthetic
biggest advantage in some contexts is that it requires trace studies, however; it appears that only two buddy
little space overhead per object|only one bit is re- systems have ever been studied using real traces
quired per buddy, to indicate whether the buddy is a [WJNB95].)
contiguous free area. This can be implemented with Several signicant variations on buddy systems
a single-bit header per object or free block. Unfor- have been devised:
tunately, for this to work, the size of the block be-
ing freed must be known|the buddy mechanism itself Binary buddies. Binary buddies are the simplest and
does not record the sizes of the blocks. This is work- best-known kind of buddy system [Kno65]. In this
able in some statically-typed languages, where object scheme, all buddy sizes are a power of two, and each
sizes are known statically and the compiler can sup- size is divided into two equal parts. This makes ad-
ply the size argument to the freeing routine. In most dress computations simple, because all buddies are
current languages and implementations, however, this aligned on a power-of-two boundary oset from the
is not the case, due to the presence of variable-sized beginning of the heap area, and each bit in the oset
objects and/or because of the way libraries are typi- of a block represents one level in the buddy system's
cally linked. Even in some languages where the sizes of hierarchical splitting of memory|if the bit is 0, it is
objects are known, the \single" bit ends up up cost- the rst of a pair of buddies, and if the bit is 1, it
ing an entire word per object, because a single bit is the second. These operations can be implemented
cannot be \stolen" from the space for an allocated eciently with bitwise logical operations.
object|objects must be aligned on word boundaries On the other hand, systems based on closer size
for architectural reasons, and there is no provision for class spacings may be similarly ecient if lookup ta-
stealing a bit from the space allocated to an object.79 bles are used to perform size class mappings quickly.
Stealing a bit from each object can be avoided, how- A major problem with binary buddies is that in-
ever, by keeping the bits in a separate table \o to the ternal fragmentation is usually relatively high|the
side" [IGK71], but this is fairly awkward, and such a expected case is (very roughly) about 28% [Knu73,
bit table could probably be put to better use with an PN77],81 because any object size must be rounded up
entirely dierent basic allocation mechanism. to the nearest power of two (minus a word for the
In practical terms, therefore, buddy systems usu- header, if the size eld is stored).
ally require a header word per object, to record the
type and/or size. Other, less restrictive schemes can Fibonacci buddies. This variant of the buddy scheme
get by with a word per object as well. Since buddy uses a more closely-spaced set of size classes, based on
systems also incur internal fragmentation, this appar- a Fibonacci series, to reduce internal fragmentation
ently makes buddy systems unattractive relative to [Hir73]. Since each number in the Fibonacci series is
more general coalescing schemes such as segregated the sum of the two previous numbers, a block can
ts.80 always be split (unevenly) to yield two blocks whose
In experiments using both real and synthetic traces, sizes are also in the series. As with binary buddies,
79 In some implementations of some languages, this is less the increasing size of successive size ranges limits the
of a problem, because all objects have headers that en- number of free lists required.
code type information, and one bit can be reserved for A further renement, called generalized Fibonacci
use by the allocator and ignored by the language imple- buddies [Hir73, Bur76, PN77] uses a Fibonacci-like
mentation. This complicates the language implementa- number series that starts with a larger number and
tion, but may be worthwhile if a buddy system is used. generates a somewhat more closely-spaced set of sizes.
80 Of course, buddy systems could become more attrac- A possible disadvantage of Fibonacci buddies is
tive if it were to turn out that the buddy policy has that when a block is split to satisfy a request for a
signicant benecial interactions with actual program particular size, the remaining block is of a dierent
behavior, and unexpectedly reduced external fragmen-
tation or increased locality. At present, this does not 81 This gure varies somewhat depending on the expected
appear to be the case. range and skew of the size distribution [PN77].
39
size, which is less likely to be useful if the program requested. As an optimization, free areas of a rela-
allocates many objects of the same size [Wis78]. tively large size (e.g., a whole free page) may be made
available to the other size series and split according
Weighted buddies. Weighted buddy systems [SP74] to that size series' rules. (This complicates the treat-
use a dierent kind of size class series than either ment of large objects, which could be treated entirely
binary or Fibonacci buddy systems. Some size clas- dierently, or by another buddy system for large units
ses can be split only one way, while other size classes of free storage such as pages.)
can be split in two ways. The size classes include the Naturally, more than two buddy systems could be
powers of two, but in between each pair of successive combined, to decrease internal fragmentation at a pos-
sizes, there is also a size that is three times a power sible cost in external fragmentation due to limitations
of two. The series is thus 2, 3, 4, 6, 8, 12... (words). on sharing free memory between the dierent buddy
(Often, the series actually starts at 4 words.) systems.
Sizes that are powers of two may only be split As with simple segregated storage, it is possible to
evenly in two, as in the binary buddy system. This keep per-page counts of live objects, and notice when
always yields another size in the series, namely the an entire page is empty. Empty pages can be trans-
next lower power of two. ferred from one buddy series to another. To our knowl-
Sizes that are three times a power of two can be edge, such an optimization has never been implemen-
split in two ways. They may be split evenly in two, ted for a double buddy scheme.
yielding a size that is another three-times-a-power-of- Buddy systems can easily be enhanced with de-
two size. (E.g., a six may be split into two threes.) ferred coalescing techniques, as in \recombination de-
They may also be split unevenly into two sizes that laying" buddy systems [Kau84]. Another optimization
are one third and two thirds of the original size; these is to tailor a buddy system's size class series to a par-
sizes are always a power of two. (E.g., six may be split ticular program, picking a series that produces little
into two and four.). internal fragmentation for the object sizes the pro-
gram uses heavily.
Double buddies. Double buddy systems use a dier-
ent technique to allow a closer spacing of size classes 3.8 Indexed Fits
[Wis78, PH86, WJNB95]. They use two dierent bi-
nary buddy systems, with staggered sizes. For exam- As we saw in Section 3.4 simple linear list mechanisms
ple, one buddy system may use powers-of-two sizes (2, can be used to implement a wide variety of policies,
4, 8, 16...) while another uses a powers-of-two spacing with general coalescing.
starting at a dierent size, such as 3. (The result- An alternative is to use a more sophisticated index-
ing sizes are 3, 6, 12, 24 ...). This is the same set of ing data structure, which indexes blocks by exactly
sizes used in weighted buddies, but the splitting rule the characteristics of interest to the desired policy, and
is quite dierent. Blocks may only be split in half, as supports ecient searching according to those char-
in the binary buddy system, so the resulting blocks acteristics. We call this kind of mechanism indexed
are always in the same binary buddy series. ts. (This is really an unsatisfying catch-all category,
Request sizes are rounded up to the nearest size showing the limitations of a mechanism-based taxon-
class in either series. This reduces the internal frag- omy.)
mentation by about half, but means that space used The simplest example of an indexed t scheme was
for blocks in one size series can only coalesced or mentioned earlier, in the discussion of sequential ts:
split into sizes in that series. That is, splitting a size a best t policy implemented using a balanced or
whose place in the combined series is odd always pro- self-adjusting binary tree ordered by block size. (Best
duces another size whose place is odd; likewise, split- t policies may be easier to implement scalably than
ting an even-numbered size always produces an even- address-ordered rst t policies.)
numbered size. (E.g., a block of size 16 can be split Another example was mentioned in the section on
into 8's and 4's, and a block of size 24 can be split segregated free lists (3.6); Standish and Tadman's ex-
into 12's and 6's, but not 8's or 4's.) act lists scheme is the limiting case of a segregated ts
This may cause external fragmentation if blocks in scheme, where the indexing is precise enough that no
one size series are freed, and blocks in the other are linear searching is needed to nd a t. On the other
40
hand, it is also a straightforward two-step optimiza- Discussion of indexed ts. In terms of implemen-
tion of the simple balanced-tree best t. (The rst tation, it appears that size-based policies may be eas-
optimization is to keep a tree with only one node per ier to implement eciently than address-based poli-
size that occurs, and hang the extra blocks of the same cies; a tree that totally orders all actual block sizes
sizes o of those nodes in linear lists. The second op- will typically be fairly small, and quick to search. If
timization is to keep the most common size values in a FIFO- or LIFO- ordering of same-sized blocks im-
an array rather than the tree itself.) Our mechanism- plements an acceptable policy, then a linear list can
based taxonomy is clearly showing it seams here, be- be used and no searching among same-sized blocks is
cause the use of hybrid data structures blurs the dis- required.83 Size-based policies also easier to optimize
tinctions between the basic classes of allocators. the common case, namely small sizes.
The best-known example of an indexed ts scheme A tree that totally orders all block addresses may be
is probably Stephenson's \Fast Fits" allocator [Ste83], very much larger, and searches will take more time.
which uses a Cartesian tree sorted on both size On the other hand, adaptive structures (e.g., splay
and address. A Cartesian tree [Vui80] encodes two- trees) may make these searches fast in the common
dimensional information in a binary tree, using two case, though this depends on subtleties of the request
constraints on the tree shape. It is eectively sorted stream and the policy that are not currently under-
on a primary key and a secondary key. The tree is a stood.
normal totally-ordered tree with respect to the pri- Deferred coalescing may be able to reduce tree
mary key. With respect to the secondary key, it is a searches to the point where the dierences in speed are
\heap" data structure, i.e., a partially ordered tree not critical, making the fragmentation implications of
whose nodes each have a value greater than their de- the policy more important than minor dierences in
scendants. This dual constraint limits the ability to speed.
rebalance the tree, because the shape of the tree is Totally ordered trees may not be necessary to im-
highly constrained by the dual indexing keys. plement the best policy, whatever that should turn
In Stephenson's system, this indexing data struc- out to be. Partial orders may work just as well, and
ture is embedded in the free blocks of memory them- lend themselves to very ecient and scalable imple-
selves, i.e., the blocks become the tree nodes in much mentations. At this point, the main problem does not
the same way that free blocks become list nodes in a seem to be time costs, but understanding what policy
sequential ts ts scheme. The addresses of blocks are will yield the least fragmentation and the best locality.
used as the primary key, and the sizes of blocks are Many other indexed ts policies and mechanisms
used as the secondary key. are possible, using a variety of data structures to ac-
Stephenson uses this structure to implement either celerate searches. One of these is a set of free lists
an address-ordered rst t policy (called \leftmost segregated by size, as discussed earlier, and another
t") or a \better t" policy, which is intended to ap- is a simple bitmap, discussed next.
proximate best t. (It is unclear how good an approx-
imation this is.) 3.9 Bitmapped Fits
As with address-ordered linear lists, the address or-
dering of free blocks is encoded directly in the tree A particularly interesting form of indexed ts is bit-
structure, and the indexing structure can be used to mapped ts, where a bitmap is used to record which
nd adjacent free areas for coalescing, with no addi- rithms used in that study, and having good memory
tional overhead for boundary tags. In most situations, usage. On the other hand, data from a dierent exper-
however, a size eld is still required, so that blocks iment [GZ93] show it being considerably slower than a
being freed can be inserted into the tree in the appro- set of allocators designed primarily for speed. Very re-
priate place. cent data [Vo95] show it being somewhat slower than
While Cartesian trees give logarithmic expected some other algorithms with similar memory usage, on
search times for random inputs, they may become un- average.
balanced in the face of patterned inputs, and in the 83 If an algorithm relies on an awkward secondary key, e.g.,
worst case provide only linear time searches.82 best t with address-ordered tie breaking, then it may
not make much dierence what the ordering function
82 Data from [Zor93] suggest that actual performance is is|one total ordering of blocks is likely to cost about
reasonable for real data, being among the faster algo- as much as another.
41
parts of the heap area are in use, and which parts are Bitmapped allocators have two other advantages
not. A bitmap is a simple vector of one-bit
ags, with compared to conventional schemes. One is that they
one bit corresponding to each word of the heap area. support searching the free memory indexed by address
(We assume here that heap memory is allocated in order, or localized searching, where the search may
word-aligned units that are multiples of one word. In begin at a carefully-chosen address. (Address-ordered
some systems, double-word alignment is required for searches may result in less fragmentation than similar
architectural reasons. In that case, the bitmap will policies using some other orderings.) Another advan-
include one bit for each double-word alignment boun- tage is that bitmaps are \o to the side," i.e., not in-
dary.) terleaved with the normal data storage area. This may
To our knowledge, bitmapped allocation has never be exploitable to improve the locality of searching it-
been used in a conventional allocator, but it is quite self, as opposed to traversing lists or trees embedded
common in other contexts, particularly mark-sweep in the storage blocks themselves. (It may also reduce
garbage collectors (notably the conservative collectors checkpointing costs in systems that checkpoint heap
of Boehm, et al. from Xerox PARC [BW88, BDS91, memory, by improving the locality of writes; freeing
DWH+ 90]84) and le systems' disk block managers. an object does not modify heap memory, only the bit-
We suspect that the main reason it has not been used map.) Bitmapped techniques therefore deserve further
for conventional memory allocation is that it is per- consideration.
ceived as too slow. It may appear that bitmapped allocators are slow,
We believe that bitmap operations can be made fast because search times are linear, and to a rst approx-
enough to use in allocators by the use of clever im- imation this may be true. But notice that if a good
plementation techniques. For example, a bitmap can heuristic is available to decide which area of the bit-
be quickly scanned a byte at a time using a 256-way map to search, searching is linear in the size of the
lookup table to detect whether there are any runs of area searched, rather than the number of free blocks.
a desired length.85 The cost of bitmapped allocation may then be pro-
If object sizes are small, bitmapped allocation may portional to the rate of allocation, rather than the
have a space advantage over systems that use whole- number of free blocks, and may scale better than other
word headers. A bit per word of heap memory only indexing schemes. If the associated constants are low
incurs a 3% overhead, while for object sizes averaging enough, bitmapped allocation may do quite well. It
10 words, a header incurs a 10% overhead. In the most may also be valuable in conjunction with other index-
obvious scheme, two bitmaps are required (one to en- ing schemes.
code the boundaries of blocks, and another to encode
whether blocks are in use), but we believe there are 3.10 Discussion of Basic Allocator
ways around that.86 Mechanisms.
84 Actually, these systems use bitmaps to detect contigu-
By now it should be apparent that our conventional
ous areas of free memory, but then accumulate free lists taxonomy is of only very limited utility, because the
of the detected free blocks. The advantage of this is that implementation focus obscures issues of policy. At a
a single scan through a region of the bitmap can nd suciently high level of abstraction, all of these al-
blocks of all sizes, and make them available for fast al- locators are really \indexed" ts|they record which
location by putting them on free lists for those sizes.
85 This can be enhanced in several ways. One enhancement areas of memory are free in some kind of data struc-
allows the fast detection of longer runs that cross 8-bit ture|but they vary in terms of the policies they
boundaries by using a dierent lookup tables to com- implement, how eciently their mechanisms support
pute the number of leading and trailing zeroes, so that the desired policy, and how
exible the mechanisms
a count can be maintained of the number of zeroes seen are in supporting policy variations. Even in its own
so far. Another is to use redundant encoding of the size mechanism-based terms, the taxonomy is collapsing
by having headers in large objects, obviating long scans under its own weight due to the use of hybrid algo-
when determining the size of a block being freed. rithms that can be categorized in several ways.
86 It is increasingly common for allocators to ensure
double-word alignment (even on 32-bit machines), also be clever encodings that can make some of the bits
padding requests as necessary, for architectural reasons. in a bitmap do double duty, especially if the minimum
In that case, half as many bits are needed. There may object size is more than two alignment units.
42
Simple segregated storage is simple and quite fast| The following discussion describes what seems to be
allocation and deallocation usually take only a few a typical (or at least reasonable) arrangement. (Some
instructions each|but lacks freedom to split and co- allocators dier in signicant details, notably Lea's
alesce memory blocks to support later requests for dif- segregated ts scheme.)
ferent-sized objects. It is therefore subject to serious To the general allocator, a block on a quick list ap-
external fragmentation, as well as internal fragmenta- pears to be allocated, i.e., uncoalescable. For example,
tion, with some tradeo between the two. if boundary tags are used for coalescing, the
ag in-
Buddy systems support fairly
exible splitting, but dicates that the block is allocated. The fact that the
signicantly restricted coalescing. block is free is encoded only in its presence on the
Sequential ts support
exible splitting and (with quick list.
boundary tags) general coalescing, but cannot sup- When allocating a small block, the quick list for
port most policies without major scalability concerns. that size is consulted. If there is a free block of that
(More precisely, the boundary tag implementation size on the list, it is removed from the list and used.
technique supports completely general coalescing, but If not, the search may continue by looking in other
the \index" is so simple that searches may be very ex- quick lists for a larger-sized block that will do. If this
pensive for some policies.) fails, the general allocator is used, to allocate a block
This leaves us with the more general indexed stor- from the general pool. When freeing a small block, the
age techniques, which include tree-structured indexes, block is simply added to the quick list for that size.
segregated ts using boundary tags, and bitmapped Occasionally, the blocks in the quick lists are removed
techniques using bitmaps for both boundary tags and and added to the general pool using the general allo-
indexing. All of these can be used to implement a va- cator to coalesce neighboring free blocks.
riety of policies, including exact or approximate best The quick lists therefore act as caches for the loca-
t. None of them require more space overhead per tion and size information about free blocks for which
object than buddy systems, for typical conventional coalescing has not been attempted, while the general
language systems, and all can be expected to have allocator acts as a \backing store" for this informa-
lower internal fragmentation. tion, and implements general coalescing. (Most often,
In considering any indexing scheme, issues of strat- the backing store has been managed using an unscal-
egy and policy should be considered carefully. Scala- able algorithm such as address-ordered rst t using a
bility is a signicant concern for large systems, and linear list.) Using a scalable algorithm for the general
may become increasingly important. allocator seems preferable.
Constant factors should not be overlooked, how- Another alternative is to use an allocator which in
ever. Alignment and header and footer costs may be its usual operation maintains a set of free lists for
just as signicant as actual fragmentation. Similarly, dierent sizes or size classes, and simply to defer the
the speed of common operations is quite important, coalescing of the blocks on those lists. This may be
as well as scalability to large heaps. In the next sec- a buddy system (as in [Kau84]) or a segregated lists
tion, we discuss techniques for increasing the speed of allocator such as segregated ts.87
a variety of general allocators. Some allocators, which we will call \simplied quick
t" allocators, are structured similarly but don't do
3.11 Quick Lists and Deferred Coalescing any coalescing for the small blocks on the quick lists.
Deferred coalescing can be used with any of the ba- In eect, they simply use a non-coalescing segregated
sic allocator mechanisms we have described. The most lists allocator for small objects and an entirely dier-
common way of doing this is to keep an array of free ent allocator for large ones. (Examples include We-
lists, often called \quick lists" or \subpools" [MPS71], instock and Wulf's simplication of their own Quick
one for each size of block whose coalescing is to be de- Fit allocator [WW88], and an allocator developed by
ferred. Usually, this array is only large enough to have Grunwald and Zorn, using Lea's allocator as the gen-
a separate free list for each individual size up to some eral allocator[GZH93].) One of the advantages of such
maximum, such as 10 or 32 words; only those sizes 87 The only deferred coalescing segregated ts algorithm
will be treated by deferred coalescing [Wei76]. Blocks that we know of is Doug Lea's allocator, distributed
larger than this maximum size are simply returned freely and used in several recent studies (e.g., [GZH93,
directly to the \general" allocator, of whatever type. Vo95, WJNB95]).
43
a scheme is that the minimum block size can be very ensures that only a bounded amount of memory can
small|only big enough to hold a header and and a be wasted due to deferred coalescing, but if the es-
single link pointer. (Doubly-linked lists aren't neces- timates of usage are wrong, deferred coalescing may
sary, since no coalescing is done for small objects.) not work as well|memory may sit idle on some quick
These simplied designs are not true deferred coa- lists when it could otherwise be used for other sizes.
lescing allocators, except in a degenerate sense. (With In Oldehoeft and Allan's system [OA85], the num-
respect to small objects, they are non-coalescing allo- ber of quick lists varies over time, according to a FIFO
cators, like simple segregated storage.) or Working Set policy. This has an adaptive char-
True deferred coalescing schemes vary in signicant acter, especially for the Working Set policy, in that
ways besides what general allocator is used, notably sizes that have not been freed recently are quickly co-
in how often they coalesce items from quick lists, and alesced, while \active" sizes are not. This adaptation
which items are chosen for coalescing. They also may may not be sucient to ensure that the memory lost
dier in the order in which they allocate items from to deferred coalescing remains small, however; if the
the quick lists, e.g., LIFO or FIFO, and this may have system only frees blocks of a few sizes over a long
a signicant eect on placement policies. period of time, uncoalesced blocks may remain on an-
other quick list indenitely. (This appears to happen
Scheduling of coalescing. Some allocators defer all for some workloads in a similar system developed by
coalescing until memory runs out, and then coa- Zorn and Grunwald [ZG94], using a xed-length LRU
lesce all coalescable memory. This is most common queue of quick lists.)
in early designs, including Comfort's original pro- Doug Lea's segregated ts allocator uses an unusual
posal [Com64]88 and Weinstock's \Quick Fit" scheme and rather complex policy to perform coalescing in
[Wei76]. small increments. (It is optimized as much for speed
This is not an attractive strategy in most mod- as for space.) Coalescing is only performed when a
ern systems, however, because in a virtual memory, request cannot otherwise be satised without obtain-
the program never \runs out of space" until back- ing more memory from the operating system, and only
ing store is exhausted. If too much memory remains enough coalescing is done to satisfy that request. This
uncoalesced, wasting virtual memory, locality may be incremental coalescing cycles through the free lists for
degraded and extra paging could result. Most systems the dierent size classes. This ensures that coalescable
therefore attempt to limit the amount of memory that blocks will not remain uncoalesced indenitely, unless
may be wasted because coalescing has not been at- the heap is not growing.
tempted. In our view, the best policy for minimizing space us-
Some systems wait until a request cannot be sat- age without undue time costs is probably an adaptive
ised without either coalescing or requesting more one that limits the volume of uncoalesced blocks|i.e.
memory from the operating system. They then per- the actual amount of potentially wasted space|and
form some coalescing. They may perform all possible adapts the lengths of the free lists to the recent usage
coalescing at that time, or just enough to satisfy that patterns of the program. Simply
ushing the quick
request, or some intermediate amount. lists periodically (after a bounded amount of alloca-
Another possibility is to periodically
ush the quick tion) may be sucient, and may not incur undue costs
lists, returning all of the items on the quick lists to if the general allocator is reasonably fast.89 90
the general store for coalescing. This may be done 89 The issues here are rather analogous to some issues in
incrementally, removing only the older items from the the design and tuning of generational garbage collectors,
quick lists. particularly the setting of generation sizes and advance-
In Margolin et al.'s scheme [MPS71], the lengths ment thresholds [Wil95].
90 If absolute all-out speed is important, Lea's strategy
of the free lists are bounded, and those lengths are
based on the expected usage of dierent sizes. This of coalescing only when a search fails may be more
attractive|it does not require incrementing or check-
88 In Comfort's proposed scheme, there was no mechanism ing an allocation total at each allocation or dealloca-
for immediate coalescing. (Boundary tags had not been tion. (Another possibility would be to use a timer inter-
invented.) The only way memory could be coalesced was rupt, but this is quite awkward. Most allocator designers
by examining all of the free lists, and this was considered do not wish to depend on using interrupts for what is
a awkward and expensive. otherwise a fairly simple library, and it also raises ob-
44
On the other hand, it may be preferable to avoid at- on quick lists in a deferred coalescing scheme.92 Simi-
tempting to coalesce very recently-freed blocks, which larly, when items are removed from the quick list and
are very likely to be usable for another request soon. returned to the general allocator, it is unknown which
One possible technique is to use some kind of \high- items should be returned, and which should be kept
water mark" pointer into each list to keep track of on the quick lists.
which objects were freed after some point in time, To date, only a few sound experiments evaluating
such as the last allocate/coalesce cycle. However, it deferred coalescing have been performed, and those
may be easier to accomplish by keeping two lists, one that have been performed are rather limited in terms
for recently-freed blocks and one for older blocks. At of identifying basic policy issues and the interactions
each attempt at coalescing, the older blocks are given between deferred coalescing and the general allocator.
to the general allocator, and the younger blocks are Most experiments before 1992 used synthetic traces,
promoted to \older" status.91 (If a more rened notion and are of very dubious validity. To understand why,
of age is desired, more than two lists can be used.) consider a quick list to be a buer that absorbs vari-
ations in the number of blocks of a given size. If vari-
ations are small, most allocation requests can be sat-
ised from a small buer. If there are frequent varia-
What to coalesce. As mentioned earlier, several tions in the sizes in use, however, many buers (quick
systems defer the coalescing of small objects, but not lists) will be required in order to absorb them.
large ones. If allocations of large objects are relatively Randomization may reduce clustered usage of the
infrequent|and they generally are|immediately co- same sizes, spreading all requested sizes out over the
alescing them is likely to be worthwhile, all other whole trace. This may make the system look bad, be-
things being equal. (This is true both because the time cause it could increase the probability that the buers
costs are low and the savings in potentially wasted (i.e., the set of quick lists) contain objects of the wrong
memory are large.) Deferred coalescing usually aects sizes. On the other hand, the smoothed (random walk)
the placement policy, however, and the eects of that nature of a synthetic trace may
atter deferred coales-
interaction are not understood. cing by ensuring that allocations and frees are fairly
balanced over small periods of time; real phase behav-
ior could overwhelm a too-small buer by performing
Discussion. There are many possible strategies for many frees and later many allocations.
deferred coalescing, and any of them may aect the
general allocator's placement policy and/or the local- 3.12 A Note on Time Costs
ity of the program's references to objects. For exam-
ple, it appears that for normal free lists, FIFO order- An allocator can be made extremely fast if space costs
ing may produce less fragmentation than LIFO order- are not a major issue. Simple segregated storage can
ing, but it is unknown whether that applies to items be used to allow allocation or deallocation in a rela-
tively small number of instructions|a few for a table
scure issues of reentrancy|the interrupt handler must lookup to nd the right size class, a few for indexing
be careful not to do anything that would interfere with into the free list array and checking to ensure the free
an allocation or deallocation that is interrupted.)
91 This is similar to the \bucket brigade" advancement
list is not empty, and a few for the actual unlinking
or linking of the allocated block.93
technique used in some generational garbage collectors This scheme can be made cosiderably faster if the
[Sha88, WM89, Wil95]. A somewhat similar technique allocator can be compiled together with the applica-
is used in Lea's allocator, but for a dierent purpose.
Lea's allocator has a quick list (called the \dirty" list) 92 Informal experiments by Lea suggest that FIFO pro-
for each size class used by the segregated ts mech- duces less fragmentation, at least for his scheme. (Lea,
anism, rather than for every small integer word size. personal communication 1995.)
(This means that allocations from the quick list have to 93 For a closely-spaced series of size classes, it may be nec-
search for a block that ts, but a close spacing of size essary to spend a few more instructions on checking
classes ensures that there is usually only one popular the size to ensure that (in the usual case) it's small
size per list; the searches are usually short.) The quick enough to use table lookup, and occasionally use a
lists are stored in the same array as the main (\clean") slower computation to nd the appropriate list for large-
free lists. sized requests.
45
tion program, rather than linked as a library in the cause we are not yet familiar enough with all of this
usual way. The usual-case code for the allocator can be literature to do it justice.
compiled as an \inline" procedure rather than a run- The two subsections below cover periods before and
time procedure call, and compile-time analyses can after 1991. The period from 1960 to 1990 was domi-
perform the size-class determination at compile time. nated by the gradual development of various allocator
In the usual case, the runtime code will simply di- designs and by the synthetic trace methodology. The
rectly access the appropriate free list, check that it is period after 1990 has (so far) shown that that meth-
not empty, and link or unlink a block. This inlined odology is in fact unsound and biased, and that much
routine will incur no procedure call overhead. (This still needs to be done, both in terms of reevaluating
kind of alloction inlining is quite common in garbage old designs and inventing new ones on the basis of new
collected systems. It can be a little tricky to code the results. (Readers who are uninterested in the history
inlined allocation routine so that a compiler will op- of allocator design and evaluation may wish to skip
timize it appropriately, but it is not too hard.) to Section 4.2.)
If space is an issue, naturally things are more In much of the following, empirical results are pre-
complicated|space ecient allocators are more com- sented qualitatively (e.g., allocator A was found to
plicated than simple segregated storage. However, de- use space more eciently than allocator B). In part,
ferred coalescing should ensure that a complex alloca- this is due to the fact that early results used gures
tor behaves like simple segregated storage most of the of merit that are awkward to explain in a brief re-
time; with some space/time tradeo. If extreme speed view, and dicult to relate to measures that current
is desired, coalescing can be deferred for a longer pe- readers are likely to nd most interesting. In addi-
riod, to ensure that quick lists usually have free blocks tion, workloads have changed so much over the last
on them and allocation is fast.94 Adjusting this space- three decades that precise numbers would be of mostly
time tradeo is a topic for future research, however. historical interest. (Early papers were mostly about
managing operating system segments (or overlays) in
xed main memories,96 while recent papers are mostly
4 A Chronological Review of The about managing small objects within the memory of
Literature a single process.) The qualitative presentation is also
due in part to our skepticism of the methodology un-
Given the background presented by earlier sections, cise derlying most of the results before 1991; citing pre-
we will chronologically review the literature, pay- we consider numbers would lend undue weight to quantities
ing special attention to methodological considerations questionable.
that we believe are important. To our knowledge, this
is by far the most thorough review to date, but it 4.1 The rst three decades: 1960 to 1990
should not be considered detailed or exhaustive; valu- Structure of this section. Our review of the work in
able points or papers may have escaped our notice.95 this period is structured chronologically, and divided
We have left out work on concurrent and parallel al- into three parts, roughly a decade each. Each of the
locators (e.g., [GW82, Sto82, BAO85, MK88, EO88, three sections begins with an overview; the casual
For88, Joh91, JS92, JS92, MS93, Iye93]), which are reader may want to read the overviews rst, and skim
beyond the scope of this paper. We have also neglected the rest. We apologize in advance for a certain amount
mainly analytical work (e.g., [Kro73, Bet73, Ree79, of redundancy|we have attempted to make this sec-
Ree80, McI82, Ree82, BCW85]) to some degree, be- tion relatively free-standing, so that it can be read
94 This is not quite necessarily true. For applications that straight through (by a reader with sucient fortitude)
do little freeing, the initial carving of memory requested
given the basic concepts presented by earlier sections.
from the operating system will be a signicant fraction 96 Several very early papers (e.g., [Mah61, IJ62]) discussed
of the allocation cost. This can be made quite fast as memory fragmentation, but in systems where segments
well, however. could be compacted together or swapped to secondary
95 A few papers have not escaped our attention but seem storage when fragmentation became a problem; these
to have escaped our libary. In particular, we have had to papers generally do not give any quantitative results
rely on secondary sources for Graham's in
uential work at all, and few qualitative results comparing dierent
in worst-case analyses. allocation strategies.
46
1960 to 1969. For example, multitasking may introduce phase be-
havior, since the segments belonging to a process are
Overview. Most of the basic designs still in use were usually only released when that process is running, or
conceived in the 1960's, including sequential ts, when it terminates. Between time slices, a program
buddy systems, simple segregated storage, and segre- does not generally acquire or release segments. Oper-
gated lists using exact lists, and sequential ts. (Some ations on the segments associated with a process may
of these, particularly sequential ts, already existed in occur periodically.
the late 50's, but were not well described in the liter- Other assumptions that became common during
ature. Knuth [Knu73] gives pointers to early history the 1960's (and well beyond) also seem unwarranted
of linked list processing.) In the earliest days, inter- in retrospect. It was widely assumed that segment
est was largely in managing memory overlays or seg- sizes were independent, perhaps because most sys-
ments in segmented operating systems, i.e., managing tems were used by many users at the same time, so
mappings of logical (program and data) segments to that most segments were typically \unrelated." On
physical memory.97 By the mid-1960's, the problem of re
ection, even in such a system there is good rea-
managing storage for dierent-sized objects within the son to think that particular segment sizes may be
address space of a single process was also recognized quite common, for at least three reasons. First, if the
as an important one, largely due to the increasing use same program is run in dierent processes simulta-
(and sophistication) of list processing techniques and neously, the statically-allocated data segment sizes of
languages [Ros61, Com64, BR64].98 frequently-used programs may appear often. Second,
Equally important, the 1960's saw the invention of some important programs may use data segments of
the now-traditional methodology for allocator eval- particular characteristic sizes. (Consider a sort utility
uation. In early papers, the assumptions underlying that uses a xed amount of memory chosen to make
this scheme were explicit and warned against, but as internal sorting fast, but using merging from external
the decade progressed, the warnings decreased in fre- storage to avoid bringing all of the data into mem-
quency and seriousness. ory.) Third, some segment sizes may be used in un-
Some of the assumptions underlying this model usually large numbers due to peculiarities of the sys-
made more sense then than they do now, at least for tem design, e.g., the minimum and/or maximum seg-
some purposes. For example, most computers were ment size. (Segments or overlays were also typically
based on segmented memory systems, and highly fairly large compared to total memory, so statistical
loaded. In these systems, the memory utilization was mechanics would not be particularly reliable even for
often kept high, by long-term scheduling of jobs. (In random workloads.)
some cases, segments belonging to a process might
be evicted to backing storage to make room when a The original paraphernalia for the lottery
request couldn't otherwise be satised.) This makes had been lost long ago, and the black
steady-state and independence assumptions some- box had been put into use even before Old
what more plausible than in later decades, when the :::
47
Collins described his simulations as a \game," in the (binary) buddy system, although Knuth [Knu73] re-
terminology of game theory. The application program ports that same idea was independently invented and
and the allocator are players; the application makes used by H. Markowitz in the Simscript system around
moves by requesting memory allocations or dealloca- 1963. Knowlton also suggested the use of deferred co-
tions, and the allocator responds with moves that are alescing to avoid unneeded overheads in the common
placement decisions.99 case where objects of the same size were frequently
Collins noted that this methodology required fur- used.
ther validation, and that experiments with real work- Ross, in [Ros67] described a sophisticated stor-
loads would be better. Given this caveat, best t age management system for the AED engineering de-
worked best, but rst t (apparently address-ordered) sign support system. While no empirical results were
was almost equally good. No quantitative results were reported, Ross describes dierent patterns of mem-
reported, and the distributions used were not speci- ory usage that programs may exhibit, such as mostly
ed. monotonic accumulation (ramps), and fragmentation
Comfort, in a paper about list processing for caused by dierent characteristic lifetimes of dierent-
dierent-sized objects [Com64], brie
y described the sized objects.
segregated lists technique with splitting and coales- The storage allocation scheme divided available
cing, as well as address-ordered rst t, using an or- memory into \zones," which could be managed by
dered linear list.100 (The address order would be used dierent allocators suitable to dierent application's
to support coalescing without any additional space usual behavior.101 Zones could be nested, and the
overhead.) Comfort did not mention that his \mul- system was extensible|a zone could use one of the
tiple free lists" technique (segregated ts with exact default allocators, or provide its own allocation and
lists) was an implementation of a best t policy, or deallocation routines. It was also possible to free an
something very similar; later researchers would often entire zone at once, rather than freeing each object in-
overlook this scheme. Comfort also proposed a simple dividually. The default allocators included rst t and
form of deferred coalescing, where no coalescing was simple segregated storage. (This is the rst published
done until memory was exhausted, and then it was mention of simple segregated storage that we have
all done at once. (Similar coalescing schemes seem to found, though Comfort's multiple free list scheme is
have been used in some early systems, with process similar.)
swapping or segment eviction used when coalescing Graham, in an unpublished technical report [Gra],
failed to obtain enough contiguous free memory.) No described the problem of analyzing the worst-case
empirical results were reported. memory use of allocators, and presented lower bounds
Totschek [Tot65] reported the distribution of job on worst case fragmentation.102 (An earlier memo by
sizes (i.e., memory associated with each process) in Doug McIlroy may have motivated this work, as well
the SDC (Systems Development Corporation) time- as Robson's later work.)
sharing system. Later papers refer to this as \the SDC Graham characterized the problem metaphorically
distribution". Naturally, the \block" sizes here were as a board game between an \attacker," who knows
rather large. Totschek found a roughly trimodal dis- the exact policy used by the allocator (\defender")
tribution, with most jobs being either around 20,000 and submits requests (\makes moves") that will force
words, or either less than half or more than twice that. the defender's policy to do as badly as possible. (This
He did not nd a signicant correlation between job is a common metaphor in \minimax" game theory;
size and running time. such an omniscient, malevolent opponent is commonly
Knowlton [Kno65] published the rst paper on the called a \devil" or \evil demon.")
99 We suspect that the history of allocator research might Knuth surveyed memory allocation techniques in
Volume One of The Art of Computer Programming
have been quite dierent if this metaphor had been
taken more seriously|the application program in the 101 Comparable schemes were apparently used in other
randomized methodology is a very unstable individual, early systems, including one that was integrated with
or one using a very peculiar strategy. overlaying in the IBM PL/I compiler [Boz84].
100 Knuth [Knu73] reports that this paper was written in 102 We do not have a copy of this report at this writing.
1961, but unpublished until 1964. Our information comes from secondary sources.
48
([Knu73], rst edition 1968), which has been a stan- what dierent critique of the fty percent rule.)
dard text and reference ever since. It has been par- In a worst-case analysis, Knuth showed that the bi-
ticularly in
uential in the area of memory allocation, nary buddy system requires at most 2 log2 mem-
M n
sumptions now appear to be false for most programs, mentation would outweigh the decrease in internal
as we will explain later in the discussions of [MPS71], fragmentation.106 (Given the smoothing eects of the
[ZG94] and [WJNB95]. Shore would later show that randomization of requests, and its possibly dierent
Knuth's simplifying assumptions about the lack of eects on internal and external fragmentation, this
systematicity in the allocator's placement were also result should be interpreted with caution.)
unwarranted.105 Betteridge [Bet82] provides a some- Randell used three dierent placement algorithms.
103 One consisted of the six powers of two from 1 to 32, cho- The rst (called RELOC) was an idealized algo-
sen with probability inversely proportional to size, and rithm that continually compacted memory to obtain
the other consisted of 22 sizes from 10 to 4000, chosen the best possible space usage. The other two (non-
with equal probability. The latter distribution appears compacting) algorithms were best t (called MIN)
(now) to be unrealistic in that most real programs' size and random. Comparisons between these two are not
distributions are not only spiky, but skewed toward a given. The only quantitative data obtainable from the
few heavily-used sizes.
104 This contrasts strongly with our own recent results for
paper are from gures 2 and 3, which show that for
best t, the SDC distribution exhibits less fragmen-
synthetic traces using randomized order (but real sizes
and lifetimes), described later. We are unsure why this reasonable rst-cut analyses in the course of writing a
is, but there are many variables involved, including the tremendously ambitious, valuable and general series of
relative sizes of memories, pages, and objects, as well as books.)
the size and lifetime distributions. 106 On rst reading, Randell's grain sizes seem quite large|
105 Nonetheless, his fty-percent rule (and others' corollar- the smallest (nonzero) value used was 16 words. Exam-
ies) are still widely quoted in textbooks on data struc- ining Totschek's distribution, however, it is clear that
tures and operating systems. (To our minds, the fault for this is quite small relative to the average \object" (seg-
this does not lie with Knuth, who presented eminently ment) size [Tot65].
49
tation (about 11 or 12 percent) than an exponential and as it became obvious that the relative perfor-
distribution (about 17 or 18 percent), and both suer mance of dierent algorithms depended on those fac-
considerably as the grain size is increased. tors. Exponential distributions became the most com-
Minker et al. [M+ 69] published a technical report mon size distribution, and a common lifetime dis-
which contained a distribution of \buer sizes" in the tribution, because empirical data showed that al-
University of Maryland UNIVAC Exec 8 system.107 locations of small and short-lived objects were fre-
Unfortunately, these data are imprecise, because they quent. The fact that these distributions were of-
give counts of buers within ranges of sizes, not exact ten spiky|or eectively smoothed in the statistics-
sizes. gathering process|was often overlooked, as was the
These data were later used by other researchers, non-independence of requests.
some of whom described the distribution as roughly Perhaps the most innovative and empirical paper of
exponential. The distribution is clearly not a simple this period was Margolin's, which used sound meth-
exponential, however, and the use of averaging over odology, and evaluated a new form of deferred coales-
ranges may conceal distinct spikes.108 cing.
Fenton and Payne's \half t" policy is also novel
and interesting; it is based on a very dierent strategy
1970 to 1979. from those used in other allocators. Wise's (unpub-
lished) double buddy design is also well-motivated.
Overview. The 1970's saw a few signicant innova- Purdom, Stigler and Cheam introduced the segrega-
tions in allocator design and methodology. However, ted ts mechanism, which did not receive the atten-
most research was focused on attempts to rene tion it was due.
known allocator designs (e.g., the development of var- Batson and Brundage's statistics for Algol-60 seg-
ious buddy systems), on experiments using dierent ment sizes and lifetimes were quite illuminating, and
combinations of distributions and allocators, or on at- their commentary insightfully questioned the plausi-
tempts to derive analytical formulae that could pre- bility of the usual assumptions of randomness and in-
dict the performance of actual implementations for dependence. They underscored the diculty of pre-
(randomized) workloads. dicting allocator performance. Unfortunately, though
Analytic techniques had much greater success their results and commentary were available in 1974
within a certain limited scope. Bounds were found in a technical report, they were not published in a
for worst-case fragmentation, both for specic algo- journal until 1977.
rithms and for all algorithms. The results were not en-
couraging. Building on Graham's analysis framework, Denning [Den70] used Knuth's fty percent rule
Robson's 1971 paper dashed any hope of nding an to derive an \unused memory rule", which states that
allocator with low fragmentation in the worst case. under assumptions of randomness and steady-state
Most empirical studies used synthetic trace tech- behavior, fragmentation generally increases memory
niques, which were rened as more information about usage by about half; he also pointed out that sequen-
real lifetime and size distributions became available, tial free list searches tend to be longer when mem-
107 We have not yet obtained a copy of this report| ory is heavily loaded. Gelenbe also derived a similar
\two thirds rule" [Gel71] in a somewhat dierent way.
our information is taken from [Rus77] and other sec- (These essentially identical rules are both subject to
ondary sources. We are unclear on exactly what sense the same criticisms as Knuth's original rule.)
of \buer" is meant, but believe that it means mem-
ory used to cache logical segments for processes; we Purdom and Stigler [PS70] performed statistical
suspect that the sizes reported are ranges because the analyses of the binary buddy system, and argued that
system used a set of xed buer sizes, and recorded limitations on buddy system coalescing were seldom
those, rather than the exact sizes of segments allocated a problem. Their model was based on strong assump-
in those buers. We are also unsure of the exact units tions of independence and randomness in the work-
used. load, including exponentially distributed random life-
108 Our tentative interpretation of the data is that the dis-
tribution is at least bimodal, with modes somewhere times.
around roughly 5 units (36% of all requests) and roughly Batson, Ju and Wood [BJW70] reported seg-
20 units (30% of all requests). ment size and lifetime distributions in the Univer-
50
sity of Virginia B5500 system. Most segments were grams, or to the single large \phase" that covers the
\small"|about 60 percent of the segments in use were whole duration of execution.
40 (48-bit) words or less in length.
About 90 percent of the programs run on this sys- Campbell introduced an \optimal t" policy,
tem, including system programs, were written in Al- which is a variant of next t intended to improve the
gol, and the sizes of segments often corresponded to chances of a good t without too much cost in extra
the sizes of individual program objects, e.g., Algol ar- searching [Cam71]. (It is not optimal in any useful
rays. (In many other systems, e.g., Totschek's SDC sense.) The basic idea is that the allocator looks for-
system, segments were usually large and might con- ward through the linear list for a bounded number of
tain many individual program objects.) The data were links, recording the best t found. It then proceeds
obtained by sampling at various times, and re
ect the forward looking for another t at least as good as
actual numbers of segments in use, not the number of what it found in that (sample) range. If it fails to nd
allocation requests. one before traversing the whole list, it uses the best
This distribution is weighted toward small objects, t it found in the sample range. (That is, it degener-
but Batson et al. note that it is not well described as ates into exhaustive best t search when the sample
an exponential. Unfortunately, their results are pre- contains the best t.)
sented only in graphs, and in roughly exponentially Campbell tested this technique with a real program
spaced bins (i.e., more precise for smaller objects than (a physics problem), but the details of his design and
large ones). This eectively smooths the results, mak- experiment were strongly dependent on unusual co-
ing it unclear what the actual distribution is, e.g., ordination between the application program and the
whether it is spiky. The general shape (after smooth- memory allocator. After an initial phase, the appli-
ing) has a rounded peak for the smaller sizes, and is cation can estimate the number of blocks of dierent
roughly exponential after that. (In a followup study sizes that will be needed later. Campbell's algorithm
[BB77], described later, Batson and Brundage would exploited this information to construct a randomized
nd spikes.) free list containing a good mix of block sizes.
A note about Algol-60 is in order here. Algol-60 While Campbell's algorithm worked well in his ex-
does not support general heap allocation|all data al- periment, it seems that his results are not applica-
locations are associated with procedure activations, ble to the general allocation problem, and other tech-
and have (nested) dynamic extents. (In the case of niques might have worked as well or better. (For ex-
statically allocated data, that extent is the entire pro- ample, constructing multiple free lists segregated by
gram run.) In the B5500 Algol system, scalar variables size, rather than a random unied free list that must
associated with a procedure were apparently allocated be searched linearly. See also the discussion of [Pag82],
in a segment; arrays were allocated in separate seg- later in this section.)
ments, and referenced via an indirection. Because of Purdom, Stigler, and Cheam [PSC71] intro-
the B5500's limit of 1023 words per segment, large ar- duced segregated ts using size classes with range
rays were represented as a set smaller arrays indexed lists (called \segregated storage" in their paper). The
by an array of descriptors (indirections).109 nature and importance of this ecient mechanism
Because of this purely block-structured approach for best-t-like policies was not generally appreci-
to storage allocation, Algol-60 data lifetimes may ated by later researchers (an exception being Standish
be more closely tied to the phase structure of the [Sta80]). This may be because their paper's title gave
program than would be expected for programs in no hint that a novel algorithm was presented.
more modern languages with a general heap. On the Purdom et al. used the random trace methodology
other hand, recent data for garbage-collected systems to compare rst t, binary buddy, and segregated ts.
[Wil95] and for C and C++ programs [WJNB95] sug- (It is unclear which kind of rst t was used, e.g.,
gest that the majority of object lifetimes in modern LIFO-ordered or address-ordered). Their segregated
programs are also tied to the phase structure of pro- ts scheme used powers-of-two size classes.
109 Algol-60's dynamically sized arrays may complicate this They reported that memory usage for segregated
scenario somewhat, requiring general heap allocation, narywas
ts almost identical to that of rst t, while bi-
buddy's was much worse.
but apparently a large majority of arrays were statically
sized and stack-like usage predominated.
51
Every year, after the lottery, Mr. Summers installation, but not a particular run.)
began talking again about a new box, but They found that using deferred coalescing increased
every year the subject was allowed to fade memory usage by approximately zero to 25%, while
o without anything's being done. The black generally decreasing search traversals to a small frac-
box grew shabbier each year; by now it was tion of the original algorithm's. In actual tests in the
no longer completely black but splintered real system, time spent in memory management was
badly among one side to show the original cut by about a factor of six.
wood color, and in some places faded or Robson [Rob71] showed that the worst-case perfor-
stained. mance of a worst-case-optimal algorithm is bounded
|Shirley Jackson, \The Lottery" from below by a function that rises logarithmically
Margolin et al. used real traces to study memory with the ratio (the ratio of the largest and smallest
n
allocation in the CP-67 control program of an IBM block sizes), i.e., log2 times a constant.
M n
System/360 mainframe [MPS71]. (Note that this al- Isoda, Goto and Kimura [IGK71] introduced a
locator allocated storage used by the operating system bitmapped technique for keeping track of allocated
itself, not for application programs.) and unallocated buddies in the (binary) buddy sys-
They warned that examination of their system tem. Rather than taking a bit (or several, as in Knowl-
showed that several assumptions underlying the usual ton's original scheme) out of the storage for each
methodology were false, for their system's workload: block, their scheme maintains a bit vector correspond-
uncorrelated sizes and lifetimes, independence of suc- ing to the words of memory. The bit for the last word
cessive requests, and well-behaved distributions. Un- of each block, and the bit for the last word occupied
fortunately, these warnings were to go generally un- by a block is set. The buddy placement constraint lets
heeded for two decades, despite the fact that some these be used as \tail lamps" to look eciently look
later researchers used the distributions they reported through memory to nd the ends of preceding blocks.
to generate randomly-ordered synthetic traces. (We
suspect that their careful analysis of a single system Hirschberg [Hir73] followed Knuth's suggestion
was not given the attention it deserved because it and devised a Fibonacci buddy system; he compared
seemed too ad hoc.) this experimentally to a binary buddy. His experiment
Their size distribution was both spiky and skewed, used the usual synthetic trace methodology, using a
with several strong modes of dierent sizes. Nearly real distribution of block sizes (from the University
half (46.7%) of all objects were of size 4 or 5 dou- of Maryland UNIVAC Exec 8 system [M+ 69]) and
blewords; sizes 1 and 8 (doublewords) accounted for exponential lifetime distribution. His results agreed
about 11% each, and size 29 accounted for almost 16% well with the analytically derived estimates; Fibo-
of the remainder. Many sizes were never allocated at nacci buddy's fragmentation increased memory usage
all. by about 25%, compared to binary buddy's 38%.
Margolin et al. began with an address-ordered rst Hirschberg also suggested a generalization of the
t scheme, and added deferred coalescing. Their ma- buddy system allowing Fibonacci-like series where
jor goal was to decrease the time spent in memory each size was the sum of the previous size and a size
management inside the CP-67 control program, with- a xed distance further down in the size series. (For
out an undue increase in memory usage. Their de- some xed integer , the th size in the series may be
k i
ferred coalescing subpools (quick lists) pre-allocated split into two blocks of sizes 1 and
i i .)k
some fraction (50% or 95%) of the expected maxi- Robson [Rob74] put a fairly tight upper and
mum usage of objects of those sizes. (This scheme does lower bounds on the worst-case performance of the
not appear to adapt to changes in program behav- best possible allocation algorithm. He showed that a
ior.) Deferred coalescing was only used for frequently- worst-case-optimal strategy's worst-case memory us-
allocated sizes. age was somewhere between 0 5 log2 and about
For their experiments, they used several traces from 0 84 log2 .
: M n
on dierent days. They tuned the free list sizes using Shen and Peterson introduced the weighted
one subset of the traces, and evaluated them using buddy method [SP74], whose allowable block sizes are
another. (Their system was thus tuned to a particular either powers of two, or three times a power of two.
52
They compared this scheme to binary buddy, using Hinds [Hin75] presented a fast scheme for recom-
the synthetic trace methodology; they used only a uni- bination in binary and generalized Fibonacci buddy
form lifetime distributions, and only two size distrib- systems. Each block has a \left buddy count" indi-
utions, both smooth (uniform and exponential). This cating whether it is a right buddy at the lowest level
is unfortunate, because skew in object size request (in which case the LBC is zero), or indicating for how
may aect the eectiveness of dierent block-splitting many levels above the lowest it is a left buddy. This
schemes. supports splitting and merging nearly as quickly as in
They found that for a uniform size distribution, the binary buddy scheme.
weighted buddy lost more memory to fragmentation Cranston and Thomas [CT75] presented a
than binary buddy, about 7%. For an exponential dis- method for quickly nding the buddy of a block in var-
tribution (which is apparently more realistic) this was ious buddy systems, using only three bits per block.
reversed|weighted buddy did better by about 7%. This reduces the time cost of splitting and merging
By default, they used FIFO-ordered free lists. With relative to Hirschberg's scheme, as well as incurring
LIFO-ordered free lists, memory usage was about 3% minimal space cost.
worse.
Shore [Sho75] compared best t and address-
Using a variation of the random trace methodol- ordered rst t more thoroughly than had been done
ogy intended to approximate a segment-based multi- previously, and also experimented with worst-t and a
programming system,110 Fenton and Payne [FP74] novel hybrid of best t and rst t. He used the then-
compared best t (called \least t"), rst t, next t, standard methodology, generating random synthetic
worst t, and \half t." The half t policy allocator traces with (only) uniformly distributed lifetimes. Size
attempts to nd a block about twice the desired size, distributions were uniform, normal, exponential, and
in the hopes that if there is a bias toward particular hyperexponential. He also performed limited experi-
sizes, remainders from splitting will be more likely to ments with \partial populations" (i.e., spiky distrib-
be a good t for future requests. They found that best utions). The gure of merit was the space-time prod-
t worked best, followed by rst t, half t, next t, uct of memory usage over time. (This essentially cor-
and worst t, in that order. Half t was almost as responds to the average memory usage, rather than
good as rst t, with next t performing signicantly peak usage.)
worse, and worst t much worse. This study was motivated in part by Wald's re-
All of the size distributions used in their experi- port of the \somewhat puzzling success" of best t in
ments were smooth. For many of their experiments, actual use in the Automatic Operating and Schedul-
they used a smooth distribution based on generaliza- ing Program of the Burroughs D-825 system [Wal66].
tions about Totschek's SDC distribution and Batson, (Fragmentation was expected to be a problem; plans
Ju, and Wood's B5500 distribution. (This is a \de- were made for compaction, but none was needed.)
formed exponential" distribution, which rises quickly, Shore found that best t and (address-ordered) rst
rounds o at the top, and then descends in a roughly t worked about equally well, but that rst t had an
exponential fashion.) Fenton and Payne apparently advantage when the distribution included block sizes
didn't consider the possibility that smooth distribu- that were relatively large compared to the memory
tions (and randomized order) might make their half- size. Following Knuth [Knu73], he hypothesized that
t policy work worse than it would in practice, by this was due to its tendency to t small objects into
decreasing the chance that a request for a particular holes near one end of memory, accumulating larger
size would be repeated soon. free areas toward the other end.111
110 In this model, each object (segment) is assumed to be For partial populations, Shore found that increasing
associated with a dierent process. When a request can-
degrees of spikiness seemed to favor best t over rst
not be satised, that process blocks (i.e., the death 111 We are actually unsure what Shore's claim is here. It is
time of the segment is delayed, but time advances so not clear to us whether he is making the general claim
that other segments may die). This models embodies that rst t tends to result in a free list that is ap-
an oversimplication relative to most real systems, in proximately size-ordered, or only the weaker claim that
that processes in most systems may have multiple asso- rst t more often has unusually large free blocks in
ciated segments whose death times cannot be postponed the higher address range, and that this is important for
independently. distributions that include occasional very large blocks.
53
t slightly, but that the variance increased so quickly tention than his thorough (and in
uential) experimen-
that this result was not reliable.112 tation within the random trace paradigm.
Shore noted that while rst t and best t policies Burton introduced a generalization of the Fibo-
are roughly similar, they seem to have somewhat dif- nacci buddy system [Bur76] which is more general
ferent strengths and weaknesses; he hypothesized that than Hirschberg's. Rather than using a xed func-
these might be combinable in a hybrid algorithm that tion for generating successive sizes (such as always
would outperform either. adding size 1 and 3 to generate size ), Bur-
Shore experimented with a novel parameterized al-
i i i
54
to the same block allocate exactly the same number exponential curve. It has distinct spikes at 2 words
and sizes of segments. They stated that they had no (44% of all objects) and 9 words (14%). In between
success tting any simple curve to their data, and thatthose spikes is another elevation at 5 words and 6
this casts doubts on analyses and experiments assum- words (9% each).
ing well-behaved distributions. The gures of merit for space usage in this study
They also suggested that the experiments of Ran- were probabilities of failure in dierent-sized mem-
dell, Knuth, and Shore could be redone be using real- ories. (That is, how likely it was that the synthetic
istic distributions, but warned that \we must wait for program would exhaust memory and fail, given a par-
a better understanding" of \the dynamics of the way ticular limited memory size.) This makes the results
in which the allocated space is used|before we can rather dicult reading, but the use of xed mem-
make reasonable predictions about the comparative ory sizes allows experimentation with allocators which
performance of dierent mechanisms." They go on to perform (deferred) coalescing only when memory is
say that \there is no reason to suppose that stochasticotherwise exhausted.
processes could possibly generate the observed request Weinstock experimented with QuickFit, best t,
distributions." rst t, next t, and binary buddies. Variations of
Though based on a 1974 technical report, this pa- best t used address-ordered or size-ordered free lists.
per was not published until 1977, the same year that Variations of rst t and next t used address-ordered
saw publication of a
urry of papers based on random and LIFO-ordered free lists. The address-ordered ver-
traces with well-behaved distributions. (Described be- sions of best, rst, and next t were also tried with
low.) immediate coalescing and deferred coalescing. Two bi-
Weinstock [Wei76] surveyed most of the impor- nary buddy systems were used, with immediate and
tant work in allocators before 1976, and presented new deferred coalescing. (In all cases, deferred coalescing
empirical results. He also introduced the \QuickFit" was only performed when memory was exhausted; no
algorithm, a deferred coalescing scheme using size- intermediate strategies were used.)
specic lists for small block sizes, backed by LIFO- In general, Weinstock found that address-ordered
ordered rst t as the general allocator.115 (Weinstockbest t had the best space usage, followed closely
reported that this scheme was invented several years by address-ordered rst t. (Both did about equally
earlier for use in the Bliss/11 compiler [WJW+ 75], well under light loadings, i.e., when memory was more
and notes that a similar scheme was independently plentiful.)
developed and used in the Simscript II.5 language After address-ordered best t came a cluster of al-
[Joh72]. Margolin's prior work was overlooked, how- gorithms whose ranking changed depending on the
ever.) loading and on the distributions used: address-ordered
Weinstock used the conventional synthetic trace rst t, address-ordered best t with deferred coales-
methodology; randomly-ordered synthetic traces were cing, size-ordered best t, and Quick Fit.
generated, using two real size distributions and four After that came a cluster containing address-
articial ones. One of the real size-and-lifetime distrib-
ordered rst t with deferred coalescing and address-
utions came from the Bliss/11 compiler [WJW+ 75], ordered next t. This was followed by address or-
and the other was from Batson and Brundage's mea- dered next t with deferred coalescing, followed in
surements of the University of Virginia B5500 system turn by LIFO-ordered rst t. Binary buddies per-
[BB77], described above. The four articial size dis- formed worst, with little dierence between the im-
tributions were uniform, exponential, Poisson, and a mediate and deferred coalescing variants.
two-valued distribution designed to be a bad case for In summary, address-ordered variants tended to
rst t and best t. (The two-valued distribution was outperform other variants, and deferred coalescing (in
not used in the nal evaluation of allocators.) the extreme form used) usually increased fragmenta-
The Bliss/11 distribution is heavily weighted to- tion. FIFO-ordered lists were not tried, however.
ward small objects, but is not well-described by an In terms of speed, QuickFit was found to be fastest,
115 This is not to be confused with the later variant followed by binary buddy with deferred coalescing.
of QuickFit [WW88], which does no coalescing for Then came binary buddy with immediate coalescing.
small objects, or Standish and Tadman's indexed ts Rankings are given for the remaining allocators, but
allocator. these are probably not particularly useful; the remain-
55
ing algorithms were based on linear list implementa- internal fragmentation due to more-rened size series
tions, and could doubtless be considerably improved were usually oset by similar increases in external
by the use of more sophisticated indexing systems fragmentation.
such as splay trees or (in the case of best t) seg-
regated ts. Robson [Rob77] showed that the worst-case perfor-
Weinstock made the important point that seem- mance of address-ordered rst t is about log2 , M n
ingly minor variations in algorithms could have a sig- while best t's is far worse, at about . He also
Mn
nicant eect on performance; he therefore took great noted that the roving pointer optimization made next
care in the describing of the algorithms he used, and t's worst case similarly bad|both best t and next
some of the algorithms used in earlier studies. t can suer about as much from fragmentation as
any allocator with general splitting and coalescing.
In a brief technical communication, Bays [Bay77]
replicated some of Shore's results comparing rst t Nielsen [Nie77] studied the performance of mem-
and best t, and showed that next t was distinctly ory allocation algorithms for use in simulation pro-
inferior when average block sizes were small. When grams. His main interest was in nding fast alloca-
block sizes were large, all three methods degraded to tors, rather than memory-ecient allocators. He used
similar (poor) performance. (Only uniformly distribu- a variation of the usual random trace methodology in-
ted lifetimes and exponentially distributed sizes were tended to model the workloads generated by discrete-
used.) event simulation systems. A workload was modeled as
a set of streams of event objects; each stream gener-
\Seems like there's no time at all between ated only requests of a single size, but these requests
lotteries any more," Mrs. Delacroix said to were generated randomly according to size and inter-
Mrs. Graves in the back row. arrival time distributions associated with the streams.
|Shirley Jackson, \The Lottery" To construct a workload, between 3 and 25 request
streams were combined to simulate a simulation with
Peterson and Norman [PN77] described a very many concurrent activities.
general class of buddy systems, and experimentally Eighteen workloads (stream combinations) were
compared several varieties of buddy systems: binary, used. Of these, only two modeled any phase behavior,
Fibonacci, a generalized Fibonacci [HS64, Fer76], and and only one modeled phases that aected dierent
weighted. They used the usual random trace method- streams (and object sizes) in correlated ways.116
ology, with both synthetic (uniform and exponential) Nielsen's experiments were done in two phases. In
and real size distributions. Their three size distrib- the rst phase a single workload was used to test 35
utions were Margolin's CP-67 distribution, the Uni- variants of best t, rst t, next t, binary buddies,
versity of Maryland distribution, and a distribution and segregated ts. (This workload consisted of 10
from an IBM 360 OS/MVT system at Brigham Young streams, and modeled no phase behavior.) Primarily
University. (This \BYU" distribution was also used in on the basis of time costs, all but seven of the ini-
several later studies.) They point out that the latter 116 In our view, this does not constitute a valid cross-section
two distributions were imprecise, grouping sizes into of discrete event simulation programs, for several rea-
ranges; they generated sizes randomly within those sons. (They may better re
ect the state of the art in
ranges. (The implication of this is that these distrib- simulation at the time the study was done, however.)
utions were smoothed somewhat; only the CP-67 dis- First, in many simulations, events are not generated at
tribution is truly natural.) random, but in synchronized pulses or other patterns.
(The BYU distribution is clearly not exponential, Second, many events in some simulations are responses
although some later researchers would describe it that to emergent interactions of other events, i.e., patterns in
way; while it is skewed toward small sizes, it is at the domain-level systems being simulated. Third, many
least bimodal. Given that it is reported in averages simulation programs have considerable state local to
simulated objects, in addition to the event records them-
over ranges, there may be other regularities that have selves. Fourth, many simulation systems include analy-
been smoothed away, such as distinct spikes.) sis facilities which may create objects with very dier-
We are unsure what lifetime distribution was used. ent lifetime characteristics than the simulation objects
Peterson and Norman found that these buddy sys- themselves; for example, an event log that accumulates
tems all had similar memory usage; the decreases in monotonically until the simulation terminates.
56
tial set of allocators were eliminated from consider- buddy the error was rather large (44% predicted, 30%
ation. (This is unfortunate, because dierent imple- observed).
mentation strategies could implement many of the Russell notes that the CP-67 data do not closely
same policies more eciently. Best t and address- resemble a Zipf distribution, and for this distribution
ordered rst t were among the policies eliminated.) the fragmentation using conventional Fibonacci is in
Of the surviving seven allocators, six had poor mem- fact lower (at 15%) than his estimated lower bound
ory usage. The seventh allocator, which performed (24%). Averaging just the results for the other two
quite well in terms of both speed and memory us- distributions brings the results closer to the predicted
age, was \multiple free lists," i.e., segregated ts with values on average, but for generalized Fibonacci they
exact lists. move further away. We believe that his estimation
technique is unreliable, partly because we do not be-
In [Sho77], Shore analyzed address-ordered rst lieve that distributions are generally exponential, and
t theoretically, and showed that the allocator itself partly because of the randomness of request order that
violates a statistical assumption underlying Knuth's he assumes.
fty percent rule. He argued that systematicity in the
placement of objects interacts with \the statistics of Wise, in an unpublished technical report [Wis78],
the release process" to aect the length of the the free described a double buddy system and its advantages
list under equilibrium conditions. over Fibonacci systems in terms of external fragmen-
Shore demonstrated that the relative performance tation (producing free blocks of the same size as re-
of best t and (address-ordered) rst t depended on quested blocks). This report apparently went unno-
the shape of the lifetime distribution. ticed until well after double buddy was reinvented by
Shore was primarily concerned with simple, well be- Page and Hagins [PH86].118
haved distributions, however, and made the usual as- Reeves [Ree79, Ree80, Ree82, Ree83] used analytic
sumptions of randomness (e.g., independence of suc- techniques to determine the eect of a random t allo-
cessive allocations, independence of size and lifetime). cator policy in the face of random workloads, using a
He did not consider possible systematicities in the ap- \generating function" approach originated by Knuth
plication program's allocations and releases, such as [Knu73]. This work relies extremely heavily on ran-
patterned births and deaths. (He did aptly note that domness assumptions|usually in both the workload
\the dynamics of memory usage comprise complicated and the allocator|to enable the analyses of memories
phenomena in which observable eects often have sub- of signicant size.
tle causes.")
Russell [Rus77] attempted to derive formulas for 1980 to 1990.
expected fragmentation in a Fibonacci and a gener- People at rst were not so much concerned
alized Fibonacci buddy system,117 based on the as- with what the story meant; what they
sumption that size distributions followed a generaliza- wanted to know was where these lotteries
tion of Zipf's law (i.e., a decreasing function inversely were held, and whether they could go there
related to the sizes). Based on this assumption, he and watch.
derived estimated lower and upper bounds, as well as |Shirley Jackson, \On the Morning of June
estimated average performance. He compared this to 28, 1948, and `The Lottery' "
simulation results, using the conventional synthetic
trace methodology and basing size distributions on Overview. The 1980{1990 period saw only modest de-
three real distributions (Margolin's CP-67 distribu- velopment of new allocator techniques, and little new
tion, the BYU distribution, and the U. of Maryland in the way of evaluation methodologies, at least in aca-
distribution.) For the generalized Fibonacci system, demic publications. Despite doubts cast by Margolin
average fragmentation for the three workloads was and Batson, most experimenters continued to use syn-
close to what was predicted (22% predicted, 21% ob- thetic traces, often with smooth and well-behaved
served). For the plain Fibonacci system, the error was
signicant (29% predicted, 22% observed). For binary 118 The rst author of the present paper overlooked both
and reinvented it yet again in 1992. It is next expected
117 See also Bromley [Bro80]. to appear in the year 2000.
57
distributions. This is probably due to the lack of a Algol-68 did support general heap allocation, an im-
comprehensive survey addressing methodological con- provement over Algol-60. The Algol-68 system used
cerns. (The present paper is an attempt to remedy for experiments used reference counting to reclaim
that problem.) By this time, there were many papers space automatically.120 (Deferred) coalescing was per-
on allocators, and Margolin's and Batson's were prob- formed only when memory is exhausted. The general
ably not among the most studied.119 Most theoretical allocator was rst t with a LIFO-ordered free list.
papers continued to make strong assumptions of ran- LIFO-ordered quick lists for dierent-sized blocks
domness and independence, as well, with the excep- were used, as well as per-procedure lists for activation
tion of papers about worst-case performance. records,121 and some lists for specic data types. De-
Among the more interesting designs from this pe- ferred coalescing greatly improved the speed of their
riod are Standish and Tadman's exact lists scheme, allocator, and usually decreased overall memory us-
Page and Hagins' double buddy system, Beck's age- age.
match algorithm, and Hanson's obstack system. Leverett and Hibbard also found that Knuth's rov-
ing pointer modication (i.e., next t) was disappoint-
Standish surveyed memory allocation research in ing; search lengths did not decrease by much, and for
a (short) chapter of a book on data structures [Sta80], some programs got longer.
describing segregated ts and introducing a segrega- Page [Pag82] evaluated Campbell's \optimal t"
ted free lists method using exact lists. Citing Tad- method analytically and in randomized trace simula-
man's masters thesis [Tad78], he reported that an ex- tions. (Page's version of optimal t was somewhat dif-
perimental evaluation showed this scheme to perform ferent from Campbell's, of necessity, since Campbell's
quite similarly to best t|which is not surprising, was intertwined with a particular application pro-
because it is best t, in policy terms|and that it gram structure.) Page showed that Campbell's analy-
was fast. (These experiments used the usual synthetic sis erred in assuming randomness in rst-t-like place-
trace methodology, and Standish summarized some of ment policies, and that systematicities in placement
Weinstock's results as well.) matter considerably. In Page's analysis and simula-
Page [Pag84] analyzed a \cyclic placement" policy tions, Campbell's \optimal" t was distinctly inferior
similar to next t, both analytically and in random- to rst t and best t in both search times and mem-
ized simulations. (Only uniformly distributed sizes ory usage. (Only uniformly distributed sizes and life-
and lifetimes were used.) The cyclic placement scheme times were used, however.)
generally resulted in signicantly more fragmentation Page also showed that (for uniformly distributed
than rst t or best t. sizes and lifetimes), a rst t policy resulted in the
same placement decisions as best t most of the time,
\...over in the north village they're talking if given the same conguration of memory and the
of giving up the lottery." same request. He also showed that the free list for
|Shirley Jackson, \The Lottery" rst t tended toward being roughly sorted in size
order. (See also similar but possibly weaker claims in
Leverett and Hibbard [LH82] performed one [Sho75], discussed earlier.)
of the all-too-rare studies evaluating memory allo- 120 A possibly misleading passage says that memory is freed
cators using real traces. Unfortunately, their work- \explicitly," but that is apparently referring to a level
load consisted of ve very small programs (e.g., tow- of abstraction below the reference counting mechanism.
ers of Hanoi, knight's tour) coded in Algol-68; none Another potentially confusing term, \garbage collec-
was more than 100 lines. It is unclear how well such tion," is used to refer to deferred coalescing where co-
textbook-style programs represent larger programs in alescing is performed only when there is no suciently
general use. large block to satisfy a request. This is very dierent
from the usual current usage of the term [Wil95], but it
119 Margolin's paper was published in an IBM journal, is not uncommon in early papers on allocators.
while the main stream of allocator papers was pub- 121 Activation records were apparently allocated on the gen-
lished in Communications of the ACM. Batson and eral heap; presumably this was used to support closures
Brundage's paper was published in CACM, but its ti- with indenite extent (i.e., \block retention"), and/or
tle may not have conveyed the signicance of their data \thunks" (hidden parameterless subroutines) for call-
and conclusions. by-name parameter passing [Ing61].
58
Betteridge [Bet82] attempted to compute frag- lent states [Ben81, McI82].)
mentation probabilities for dierent allocators using We believe that the only way to make this kind of
rst-order Markov modeling. (This book is apparently problem remotely tractable is with powerful abstrac-
Betteridge's dissertation, completed in 1979.) The ba- tions over the possible states of memory. For the gen-
sic idea is to model all possible states of memory oc- eral memory allocation problem, this is simply not
cupancy (i.e., all arrangements of allocated and free possible|for an arbitrary interesting allocator and
blocks), and the transition probabilities between those real request streams, there is always the possibility
states. Given a xed set of transition probabilities, it of systematic and even chaotic interactions. The only
is possible to compute the likelihood of the system be- way to make the real problem formalizable is to nd a
ing in any particular state over the long run. This set useful qualitative model that captures the likely range
of state probabilities can then be used to summarize of program behaviors, each allocator's likely responses
the likelihood of dierent degrees of fragmentation. to classes of request streams, and (most importantly)
Unfortunately, the number of possible states of allows reliable characterization of request streams and
memory is exponential in the size of memory, and allocators in the relevant ways. We are very far away
Betteridge was only able to compute probabilities for from this deep understanding at present.
memories of sizes up to twelve units. (These units may Beck [Bec82] described the basic issue of fragmen-
be words, or they may be interpreted as some larger tation clearly, and designed two interesting classes of
grain size. However, earlier results suggest that small allocators, one idealized and one implementable. Beck
grain sizes are preferred.) He suggests several tech- pointed out that basic goal of an allocator is to re-
niques to make it easier to use somewhat larger mod- duce the number of isolated free blocks, and that the
els, but had little success with the few he tried. (See existence of isolated free blocks is due to neighboring
also [Ben81, Ree82, McI82].) We are not optimistic blocks having dierent death times.
that this approach is useful for realistic memory sizes, This motivated the design of an idealized oine al-
especially since memory sizes tend to increase rapidly locator that looks ahead into the future to nd out
over time. when objects will die; it attempts to place new ob-
To allow the use of a rst-order Markov model, Bet- jects near objects that will die at about the same
teridge assumed that object lifetimes were completely time. This policy can't be used in practice, because
independent|not only must death times be random allocators must generally make their decisions online,
with respect to allocation order, but there could be but it provides an idealized standard for comparison.
no information in the request stream that might give This \release-match" algorithm is philosophically sim-
an allocator any exploitable hint as to when objects ilar to Belady's well-known MIN (or OPT) algorithm
might die. For this, Betteridge had to assume a ran- for optimal demand-paging. (It is heuristic, however,
dom exponential lifetime function, i.e., a half-life func- rather than optimal.)
tion where any live object was exactly as likely to die Beck also described an implementable \age match"
as any other at a given time. (Refer to Section 2.2 algorithm intended to resemble release-match, using
for more on the signicance of this assumption.) This allocation time to heuristically estimate the dealloca-
is necessary to ensure that the frequencies of actual tion (release) time.
transitions would stabilize over the long run (i.e., the For an exponential size distribution and uniform
Markov model is ergodic|see Section 2.2), and al- lifetime distribution, he found that the eectiveness
lows the computation of the transition probabilities of the age-match heuristic depended on the lifetime
without running an actual simulation for an inconve- variance (i.e., the range of the uniform distribution).
niently innite period of time. The system need not This is not surprising, because when lifetimes are sim-
keep track of the sequences of transitions that result ilar, objects will tend to be deallocated in the order
in particular states|actual sequences are abstracted that they are allocated. As the variance in lifetimes
away, and only the states where histories intersect are increases, however, the accuracy of prediction is re-
represented. duced.
Even with these extremely strong assumptions of Beck also experimented with hyper-exponential life-
randomness, this problem is combinatorially explo- time distributions. In this case, the age-match heuris-
sive. (This is true even when various symmetries and tic systematically failed, because in that case the age
rotations are exploited to combine (exactly) equiva- of an object is negatively correlated with the time un-
59
til it will die. This should not be surprising. (In this also measured the performance of a resulting algo-
case it might work to reverse the order of estimated rithm in actual use in the VM-SP system.
death times.) First, Bozman et al. compared rst t, next t
Stephenson [Ste83] introduced the \Fast Fits" and best t with the VM-SP algorithm. This algo-
technique, using a Cartesian tree of free blocks or- rithm, based on earlier research by Margolin et al.,
dered primarily by address and secondarily by block used deferred coalescing with a general pool managed
size. He evaluated the leftmost t (address-ordered by address-ordered rst t. In terms of fragmenta-
rst t) and better t variants experimentally. De- tion, VM-SP was best, followed by best t, which was
tails of the experiment are not given, but the general signicantly better than rst t. This result is un-
result was that the space usage of the two policies clear, however, because they don't state which vari-
was similar, with better t appearing to have a time ety of rst t they were using (e.g., address-ordered
advantage. (A caveat is given, however, that this re- or LIFO-ordered free lists). Next t was considerably
sult appears to be workload-dependent, in that dier- worse, using about 50% more memory than the VM-
ent distributions may give dierent results. This may SP algorithm.
be a response to the then-unpublished experiments in They then compared best-t-rst (taking the rst
[BBDT84], but no details are given.) of several equally good ts) with best-t-last (taking
the last), and found that best-t-last was better. They
Kaufman [Kau84] presented two buddy system al- also added a splitting threshold, which reduced the
locators using deferred coalescing. The rst, \tailored dierence between best t and rst t. (We are not
list" buddy systems, use a set of size-specic free lists sure whether these got better or worse in absolute
whose contents are not usually coalesced.122 This sys- terms.) Adding the splitting threshold also reversed
tem attempts to keep the lengths of the free lists pro- the order of best-t-rst and best-t-last.
portional to the expected usage of the corresponding Bozman et al. also tested a binary buddy and a
sizes; it requires estimates of program behavior. The modied Fibonacci buddy. They found that the mem-
second scheme, \recombination delaying" buddy sys- ory usage of both was poor, but both were fast; the
tems, adapts dynamically to the actual workload. In memory usage of the modied Fibonacci buddy was
experiments using the usual synthetic trace methodol- quite variable.
ogy, Kaufman found that both systems worked quite Testing Stephenson's Cartesian tree allocator, they
well at reducing the time spent in memory manage- found that the leftmost t (address ordered rst t)
ment. These results are suspect, however, due to the policy worked better than the \better t" policy; they
load-smoothing eects of random traces, which
atter latter suered from \severe" external fragmentation
small caches of free blocks (Section 3.11).123 for the test workload. They suggest that leftmost t
Bozman et al. [BBDT84] studied a wide variety would make a good general allocator in a system with
of allocators, including sequential ts, deferred coales- deferred coalescing.
cing schemes, buddy systems, and Stephenson's Car- After these initial experiments, Bozman et al. de-
tesian tree system. (Not all allocators were compared veloped a fast deferred coalescing allocator. This allo-
directly to each other, because some were tailored to cator used 2 to 15 percent more memory than best t,
an IBM operating system and others were not.) They but was much faster. We note that the extra memory
used synthetic traces based on real lifetime distrib- usage was likely caused at least in part by the policy
utions, primarily from two installations of the same of keeping \subpools" (free lists caching free blocks of
IBM operating system, VM-SP. (Their main goal was particular sizes) long enough that the miss rate was
to develop an ecient allocator for that system.) They half a percent or less. (That is, no more than one in
122 This tailoring of list length should not be confused with two hundred allocations required the use of the gen-
eral allocator.)
the tailoring of size classes as mentioned in [PN77]. This allocator was deployed and evaluated in the
123 The tailored list scheme worked better than the recom-
bination delaying scheme, but this result is especially same installations of the VM-SP operating system
suspect; the tailored list scheme does not respond dy- from which their test statistics had been gathered.
namically to the changing characteristics of the work- The performance results were favorable, and close to
load, but this weakness is not stressed by an articial what was predicted. From this Bozman et al. make the
trace without signicant phase behavior. general claim|which is clearly far too strong|that
60
the statistical assumptions underlying the random- extremely unlikely to occur with random traces using
trace methodology are not a problem, and that the smooth distributions, although the use of compression
results are highly predictive. (We believe that this may smooth size distributions somewhat.)
conclusion is dicult to support with what amount We also note that for secondary and tertiary storage
to two data points, especially since their validation more generally, contiguous storage is not strictly re-
was primarily relevant to variations on a single opti- quired; freedom from this restriction allows schemes
mized design, not the wide variety of basic allocators that are much more
exible and less vulnerable to
they experimented with using synthetic traces.) fragmentation. (Many systems divide all les into
In a related paper, Bozman [Boz84] described blocks of one or two xed sizes, and only preserve
a general \software lookaside buer" technique for logical contiguity (e.g., [RO91, VC90, SKW92, CG91,
caching search results in data structures. One of his AS95]). If access times are important, other consider-
three applications (and empirical evaluations) was for ations are likely to be much more signicant, such as
deferred coalescing with best t and address-ordered locality. (For rotating media and especially for tapes,
rst t allocators. In that application, the buer is a placement has more important eects on speed than
FIFO queue storing the size and address of individ- on space usage.)
ual blocks that have been freed recently. It is searched Oldehoeft and Allan [OA85] experimented with
linearly at allocation time. variants of deferred coalescing, using a working-set
For his evaluation, Bozman used the conventional or FIFO policy to dynamically determine which sizes
synthetic trace methodology, using a real size distri- would be kept on quick lists for for deferred coales-
bution from a VM-SP system and exponentially dis- cing. The system maintained a cache of free lists for
tributed lifetimes; he reported considerable reductions recently-freed sizes. (Note that where Bozman had
in search lengths, in terms of combined FIFO buer maintained a cache of individual free blocks, Olde-
and general allocator searches. (It should be noted hoeft and Allan maintained a cache of free lists for
that both general allocators used were based on lin- recently-freed sizes.) For the FIFO policy, this cache
ear lists, and hence not very scalable to large heaps; contains a xed number of free lists. For the Working
since the FIFO buer records individual free blocks, it Set policy, a variable number of free lists are main-
too would not scale well. With better implementations tained, depending on how many sizes have been freed
of the general allocator, this would be less attractive. within a certain time window. In either policy, when
It also appears that the use of a randomized trace is a free list is evicted from the cache, the blocks on that
likely to have a signicant eect on the results (Sec- list are returned to the general pool and coalesced if
tion 3.11). possible. Note that the number and size of uncoalesced
Coman, Kadota, and Shepp [CKS85] have free blocks is potentially quite variable in this scheme,
conjectured that address-ordered rst t approaches but probably less so than in schemes with xed-length
optimal as the size of memory increases. They make quick lists.
very strong assumptions of randomness and indepen- One real trace was used, and two synthetic traces
dence, including assuming that lifetimes are unrelated generated from real distributions. The real trace was
and exponentially distributed. from a Pascal heap (program type not stated) and
In support of this conjecture, they present results the real distributions were Margolin's CP-67 data and
of simulations using pseudo-random synthetic traces, Leverett and Hibbard's data for small Algol programs.
which are consistent with their conjecture. They claim Oldehoeft and Allan reported results for FIFO and
that \we can draw strong engineering conclusions Working Set with comparable average cache sizes. The
from the above experimental result." FIFO policy may defer the coalescing of blocks for a
Naturally, we are somewhat skeptical of this state- very variable time, depending on how many dier-
ment, because of the known non-randomness and non- ent sizes of object are freed. The Working Set policy
independence observed in most real systems. Coman, to coalesce all blocks of sizes that haven't been freed
Kadota, and Shepp suggest that their result indicates within its time window. Neither policy bounds the vol-
that large archival storage systems should use rst ume of memory contained in the quick lists, although
t rather than more complex schemes, but we believe it would appear that Working Set is less likely to have
that this result is inapplicable there. (We suspect that excessive amounts of idle memory on quick lists.
there are signicant regularities in le usage that are The Working Set policy yielded higher hit rates|
61
i.e., more allocations were satised from the size- as follows. Best t variants worked best, with space
specic lists, avoiding use of the general allocator. wastage of roughly 6 to 11 percent (in order of in-
They also experimented with a totally synthetic creasing waste, best t (splay), best t (balanced),
workload using uniform random size and lifetime dis- better t Cartesian). Segregated ts followed at about
tributions. For that workload, Working Set and FIFO 16 percent. Address-ordered next t wasted about 20
performed about equally, and poorly, as would be ex- percent, and address-ordered rst t wasted about 24
pected. percent. Standard next t and a variant using adap-
Eects on actual memory usage were not reported, tive search followed, both at about 26 percent. Two
so the eect of their deferred coalescing on overall other variants of next t followed at a considerable
memory usage is unknown. distance; one used a restricted search (42 percent) and
Korn and Vo [KV85] evaluated a variety of UNIX the other treated small blocks specially (45 percent).
memory allocators, both production implementations Simple segregated storage (powers of two sizes) was
distributed with several UNIX systems, and new im- worst at about 47 percent. (These numbers should be
plementations and variants. Despite remarking on the interpreted with some caution, however; besides the
high fragmentation observed for a certain usage pat- general problem of using synthetic workloads, there
tern combined with a next t allocator (the simple is variation among the allocators in per-block over-
loop described in Section 3.5), they used the tradi- heads.)
tional synthetic trace methodology. (Vo's recent work In terms of time costs, two implementations scaled
uses real traces, as described later.) Only uniform size very poorly, being fast for small mean lifetimes (and
and lifetime distributions were used. They were inter- hence heap sizes), but very slow for large ones. The
ested in both time and space costs, and in scalability implementations of these algorithms both used linear
to large heaps. lists of all blocks, allocated or free. These algorithms
Five of their allocators were variants of next t.124 were a standard next t and an address-ordered next
The others included simple segregated storage (with t.
powers of two size classes)125 address-ordered rst t Among the other algorithms, there were four clus-
(using a self-adjusting \splay" tree [ST85]), segrega- ters at dierent time performance levels. (We will
ted ts (using Fibonacci-spaced size classes), better name the algorithms within a cluster in approxi-
t (using Stephenson's Cartesian tree scheme), and mately increasing cost order.) The rst cluster con-
two best t algorithms (one using a balanced binary tained only simple segregated storage, which was by
tree, and the other a splay tree). far the fastest. The second cluster contained next t
It may be signicant that Korn and Vo modi- with restricted search, next t with special treatment
ed most of their allocators to include a \wilderness of small blocks, segregated ts, and next t with adap-
preservation heuristic," which treats the last block tive search. (This last appeared to scale the worst of
of the heap memory area specially; this is the point this cluster, while segregated ts scaled best.) The
(called the \break") where the heap segment may be third cluster contained best t (splay), better t (Car-
extended, using UNIX sbrk() system call, to obtain tesian), and address-ordered rst t (splay).
more virtual memory pages from the operating sys- Gai and Mezzalama [GM85] presented a very
tem. (See Section 3.5.) simple deferred coalescing scheme, where only one size
To summarize their results, we will give approxi- class is treated specially, and the standard C library
mate numbers obtained by visual inspection of their allocator routines are used for backing storage. (The
Figure 3. (These numbers should be considered very algorithms used in this library are not stated, and are
approximate, because the space wastage varied some- not standardized.)
what with mean object size and lifetimes.) Their target application domain was concurrent
Space waste (expressed as an increase over the simulations, where many variations of a design are
amount of live data, and in increasing order), was tested in a single run. As the run progresses, faults
126 An
124 Next t is called \rst t" in their paper, as is common. are detected and faulty designs are deleted.
125This is allocator (implemented by Chris Kingsley and 126 This is actually intended to test a test system; faulty
widely distributed with the BSD 4.2 UNIX system) is designs are intentionally included in the set, and should
called a buddy system in their paper, but it is not; it be weeded out by the test system. If not, the test system
does no coalescing at all. must be improved.
62
interesting characteristic of this kind of system is be confused with the sense of \heap" as a pool for
that memory usage follows a backward (decreasing) dynamic storage allocation|embedded in an array.
ramp function after the initialization phase|aside To keep the size of this heap array small, a two-level
from short-term variations due to short-lived objects, scheme is used. Memory is divided into equal-sized
the general shape of the memory-use function is mon- chunks, and the heap recorded the size of the largest
otonically decreasing. free block in each chunk. Within a chunk, conventional
To test their allocator, they used a synthetic work- linear searching is used. While this scheme appears
load where memory usage rises sharply at the begin- to scale well, it has the drawback that the constant
ning and oscillates around a linearly descending ramp. factors are apparently rather high. Other scalable in-
The use of this synthetic trace technique is more some- dexing schemes may provide higher performance for
what more reasonable for this specialized allocator address-ordered rst t.
than for the general allocation problem; since there's
essentially no external fragmentation,127 there's little Although the villagers had forgotten the rit-
dierence between a real trace and a synthetic one in ual and lost the original black box, they still
that regard. remembered to use stones...
They reported that this quick list technique was \It isn't fair, it isn't right," Mrs. Hutchison
quite fast, relative to the (unspecied) general alloca- screamed and then they were upon her.
tor. |Shirley Jackson, \The Lottery"
From our point of view, we nd the experimental
results less interesting than the explanation of the Coman and Leighton, in a paper titled \A
overall pattern of memory usage in this class of appli- Provably Ecient Algorithm for Dynamic Storage
cation, and what the attractiveness of this approach Allocation" [CL89] describe an algorithm combining
indicates about the state of heap management in the some characteristics of best t and address-ordered
real world (refer to Section 1.1). rst t, and prove that its memory usage is asymptot-
ically optimal as system size increases toward innity.
Page and Hagins [PH86] provided the rst pub- To enable this proof, they make the usual as-
lished double buddy system, and experimentally com- sumptions of randomness and independence, includ-
pared it to binary and weighted buddy systems. Us- ing randomly ordered and exponentially distributed
ing the standard simulation techniques, and only uni- lifetimes. (See Section 2.2.) They also make the fur-
formly distributed sizes and lifetimes, they show that ther assumption that the distribution of object sizes
double buddies suer from somewhat less fragmen- is known a priori, which is generally not the case in
tation than binary and weighted buddies. They also real systems.
present an analysis that explains this result.128 Coman and Leighton say that probabilistic re-
Brent [Bre89] presented a scalable algorithm for sults are less common than worst-case results, \but far
the address-ordered rst t policy, using a \heap," more important," that their result has \strong conse-
data structure|i.e., a partially-ordered tree, not to quences for practical storage allocation systems," and
that algorithms designed to \create suciently large
127 Since memory usage is dominated by a single size, al- holes when none exist will not be necessary except in
most all requests can be satised by almost any free very special circumstances."
block;
128 While we believe that double buddies are indeed eec-
It should be no surprise that we feel compelled to
take exception with such strongly-stated claims. In
tive, we disagree somewhat with their methodology and our view, the patterned time-varying nature of real
their analysis. Uniform random distributions do not ex- request streams is the major problem in storage allo-
hibit the skewed and non-uniform size distributions of- cation, and in particular the time-varying shifts in the
ten seen in real programs, or pronounced phase behav- requested sizes. Assuming that request distributions
ior. All of these factors may aect the performance of are known and stable makes the problem mathemat-
the double buddy system; a skew towards a particu-
lar size favors double buddies, where splitting always ically tractable, but considerably less relevant.
results in same-sized free blocks. Phase behavior may Coman and Leighton oer an asymptotic improve-
enhance this eect, but on the other hand may cause ment in memory usage, but this amounts to no more
problems due to uneven usage of the two component (bi- than a small constant factor in practice, since real
nary) buddy systems, causing external fragmentation. algorithms used in real systems apparently seldom
63
waste more than a factor of two in space, and usu- The obstack scheme has two advantages. First, it
ally much less.129 is often easier for the programmer to manage batches
While we believe that this result is of limited rel- of objects than to code freeing routines that free each
evance to real systems, it does seem likely that for object individually. Second, the allocator implemen-
extremely large systems with many complex and in- tation can be optimized for this usage style, reducing
dependent tasks, there may be signicant smoothing space and time costs for freeing objects. In Hanson's
eects that tend in this direction. In that case, there system, storage for a specially-managed heap is allo-
may be very many eectively random holes, and thus cated as a linked list of large chunks, and objects can
a likely good t for any particular request. be allocated contiguously within a chunk; no header
Unfortunately, we suspect that the result given is is required on each small object. The usual time cost
not directly relevant to any existing system, and for for allocation is just the incrementing of a pointer
any suciently large and complex systems, other con- into a chunk, plus a check to see if the chunk is full.
siderations are likely to be more important. For the The time cost for freeing in a large specially-managed
foreseeable future, time-varying behavior is the essen- heap is roughly proportional to the number of chunks
tial policy consideration. If systems eventually become freed, with fairly small constant factors, rather than
ver large (and heterogeneous), locality concerns are the number of small objects freed.
likely to be crucial. (Consider the eects on locality Obstack allocation must be used very carefully, be-
in a large system when objects are placed in eec- cause it intertwines the management of data struc-
tively randomly-generated holes; the scattering of re- tures with the control structure of a program. It is
lated data seems likely to be a problem.) easy to make mistakes where objects are allocated on
Hanson [Han90] presents a technique for allocat- the obstack, but the data objects they manage are
ing objects and deallocating them en masse. This is allocated on the general heap. (E.g., a queue object
often more ecient and convenient than traversing may be allocated on an obstack, but allocate its queue
data structures being deallocated, and freeing each nodes on the general heap.) When the controlling ob-
object individually. A special kind of heap can be cre- jects are freed, the controlled objects are not; this is
ated on demand. In the GNU C compiler system, these especially likely to happen in large systems, where in-
are called \obstacks," short for \object stacks," and tercalling libraries do not obey the same storage man-
we will adopt that term here. Objects known to die agement conventions.131
at the end of a phase can be allocated on an obstack,
and all freed at once when the phase is over. More 131 The opposite kind of mistake is also easy to make, if
generally, nested phases are supported, so that ob-
jects can be deallocated in batches whose extents are the controlling objects' routines are coded on the as-
nested. Freeing an object simply frees that object and sumption that the objects it controls will be freed au-
all objects allocated after it. (This is actually a very tomatically when it is freed, but the controlling object
is actually allocated on the general heap rather than an
old idea, dating at least to Collins' \zone" system.130 obstack. In that case, a storage leak results. These kinds
The fact that this idea has been independently devel- of errors (and many others) can usually be avoided if
oped by a variety of system implementors attests to garbage collection [Wil95] is used to free objects au-
the obvious and exploitable phase behavior evident in tomatically. Henry Baker reports that the heavy use
many programs.) of an obstack-like scheme used in MIT Lisp machines
was a continuing source of bugs (Baker, personal com-
129 We also note that their algorithm requires log n time| munication 1995). David Moon reports that a similar
2
where n is the number of free blocks|which tends facility in the Symbolics system often resulted in ob-
toward innity as n tends toward innity. In practi- scure bugs, and its use was discouraged after an ef-
cal terms, it becomes rather slow as systems become cient generational garbage collector [Moo84] was de-
very large. However, more scalable (sublogarithmic) al- veloped (Moon, personal communication 1995); gener-
gorithms could presumably exploit the same statistical ational techniques heuristically exploit the lifetime dis-
tendencies of very large systems, if real workloads re- tributions of typical programs [LH83, Wil95]. For sys-
sembled stochastic processes. tems without garbage collection, however, the resulting
130 Similar techniques have been used in Lisp systems (no- problems may be no worse than those introduced by
tably the Lisp Machine systems), and are known by a other explicit deallocation strategies, when used care-
variety of names. fully and in well-documented ways.
64
4.2 Recent Studies Using Real Traces various properties, then constructed various drivers
using pseudo-random numbers to generate request
\Some places have already quit lotteries," streams accordingly.
Mrs. Adams said. In general, the more rened attempts at modeling
\Nothing but trouble in that," Old Man real behavior failed. (Our impression is that they did
Warner said stoutly. not necessarily expect to succeed|their earlier empir-
|Shirley Jackson, \The Lottery" ical work shows a strong disposition toward the use of
real workloads.) They found that their most accurate
Zorn, Grunwald, et al. Zorn and Grunwald and predictor was a simple \mean value" model, which
their collaborators have performed a variety of ex- uses only the mean size and lifetime, and generates
perimental evaluations of allocators and garbage col- a request stream with uniformly distributed sizes and
lectors with respect to space, time, and locality costs. lifetimes. (Both vary from zero to twice the mean, uni-
This is the rst major series of experiments using valid formly.) Unfortunately, even their best model is not
methodology, i.e., using real traces of program behav- very accurate, exhibiting errors of around 20%. For a
ior for a variety of programs. small set of allocators, this was sucient to predict
Our presentation here is sketchy and incomplete, for the rank ordering (in terms of fragmentation) in most
several reasons. Zorn and Grunwald are largely inter- cases, but with ordering errors when the allocators
ested in time costs, while we are (here) more inter- were within a few percent of each other.
ested in placement policies' eect on fragmentation. From this Zorn and Grunwald conclude that the
They have often used complicated hybrid allocator al- only reliable method currently available for studying
gorithms, making their results dicult to interpret in allocators is trace-driven simulation with real traces.
terms of our basic policy consideration, and in gen- While this result has received too little attention, we
eral, they do not carefully separate out the eects of believe that this was a watershed experiment, invali-
particular implementation details (such as per-object dating most of the prior experimental work in memory
overheads and minimumblock sizes) from \true" frag- allocation.
mentation. (Nonetheless, their work is far more useful Ironically, Zorn and Grunwald's results show that
than most prior experimental work.) Some of Zorn some of the most simplistic models|embodying
and Grunwald's papers|and much of their data and clearly false assumptions of uniform size and life-
their test programs|are available via anonymous In- time distributions|generally produce more accurate
ternet FTP (from cs.colorado.edu) for further anal- results than more \realistic" models. It appears that
ysis and experimentation. some earlier results using unsound methods have ob-
In [ZG92], Zorn and Grunwald present various tained the right results by sheer luck|the \better"
allocation-related statistics on six allocation-intensive algorithms do in fact tend to work better for real pro-
C programs, i.e., programs for which the speed of grams behavior as well. (Randomization introduces
the allocator is important. (Not all of these use large biases that tend to cancel each other out for most
amounts of memory, however.) They found that for policies tested in earlier work.) The errors produced
each of these programs, the two most popular sizes are still large, however, often comparable to the total
accounted for at least half (and as much as 93%) of fragmentation for real programs, once various over-
all allocations. In each, the top ten sizes accounted for heads are accounted for.
at least 85% of all allocations. (Our own later experiments [WJNB95], described
later, show that the random trace methodology can
Zorn and Grunwald [ZG94] attempted to nd introduce serious and systematic errors for some al-
fairly conventional models of memory allocation that locators which are popular in practice but almost
would allow the generation of synthetic traces useful entirely absent in the experimental literature. This
for evaluating allocators. They used several models of is ironic as well|earlier experimenters happened to
varying degrees of sophistication, some of which mod- choose a combination of policies and experimental
eled phase behavior and one of which modeled ne- methodology that gave some of the right answers. It
grained patterns stochastically (using a rst-order is clear from our review of the literature that there
Markov model). To obtain the relevant statistics, they was{and still is|no good model that predicts such a
gathered real traces and analyzed them to quantify happy coincidence.)
65
Zorn, Grunwald, and Henderson [GZH93] mea- jects. Their \lifetime prediction" allocator uses oine
sured the locality eects of several allocators: next t, prole information from \training" runs on sample
the G++ segregated ts allocator by Doug Lea, sim- data to predict which call sites will allocate short-
ple segregated storage using powers of two size classes lived objects. During normal (non-training) runs, the
(the Berkeley 4.2 BSD allocator by Chris Kingsley), allocator examines the procedure call stack to dis-
and two simplied quick t schemes (i.e., \Quick Fit" tinguish between dierent patterns of procedure calls
in the sense of [WW88], i.e., without coalescing for that result in allocations. Based on prole informa-
small objects). tion, it predicts whether the lifetimes of objects cre-
One of simplied these quick t allocators (written ated by that call pattern can be reliably predicted to
by Mike Haertel) uses rst t as the general alloca- be short. (This is essentially a renement of a similar
tor, and allocates small objects in powers-of-two sized scheme used by Demers et al. for lifetime prediction
blocks. (We are not sure which variant of rst t is in a garbage collector; that scheme [DWH+90] uses
used.) As an optimization, it stores information about only the size and stack pointer, however, not the call
the memory use within page-sized (4KB) chunks and chain.)
can reclaim space for entirely empty pages, so that For ve test applications, Barrett and Zorn found
they can be reused for objects of other sizes. It can that examining the stack to a depth of four calls gen-
also use the pagewise information in an attempt to erally worked quite well, enabling discrimination be-
improve the locality of free list searches. tween qualitatively dierent patterns that result in
The other simplied quick t allocator is uses the allocations from the same allocator call site.
G++ segregated ts system as its general allocator, Their predictor was able to correctly predict that
and uses quick lists for each size, rounded to the near- 18% to 99% of all allocated bytes would be short-lived.
est word, up to 8 words (32 bytes). (For other allocations, no prediction is made; the dis-
Using Larus' QP tracing tool [BL92], Zorn et al. tinction is between \known short-lived" and \don't
traced ve C programs combined with their ve al- know.") While we are not sure whether this is the best
locators, and ran the traces through virtual memory way of exploiting regularities in real workloads,132 it
and cache simulators. certainly shows that exploitable regularities exist, and
They found that next t had by far the worst that program behavior is not random in the man-
locality, and attribute this to the roving pointer ner assumed (implicitly or explicitly) by earlier re-
mechanism|as free list searches cycle through the searchers. (Barrett and Zorn found that using only
free list, they may touch widely separated blocks only the requested size was less predictive, but still pro-
once per cycle. We suspect that there is more to it vided useful information.)
than this, however, and that the poor locality is also Zorn and Grunwald [GZ93] have investigated the
due to the eects of the free list policy; it may inter- tailoring of allocators to particular programs, primar-
sperse objects belonging to one phase among objects ily to improve speed without undue space cost. One
belonging to others as it roves through memory. important technique is the use of inlining (incorporat-
Because of the number of variables (use of quick ing the usual-case allocator code at the point of call,
lists, size ranges of quick lists, type of general alloca- rather than requiring an out-of-line call to a subrou-
tor, etc.), we nd the other results of this study dif- tine). The judicious use of inlining, quick lists for the
cult to summarize. It appears that the use of coarse important size classes, and a general coalescing back-
size ranges degrades locality, as does excessive per- ing allocator appears to be able to provide excellent
object overhead due to boundary tags. (The version of speed with reasonable memory costs.
Lea's allocator they used had one-word footers as well Another useful empirical result is that when pro-
as one-word headers; we have since removed the foot- grams are run on dierent data sets, they typically al-
ers.) FIFO-managed segregated lists promote rapid locate the same sizes in roughly similar proportions|
reuse of memory, improving locality at the small gran- the most important size classes in one run are likely
ularities relevant to cache memories. Eects on larger- to be the most important size classes in another, al-
scale locality are less clear. lowing oine tailoring of the algorithm using prole
Barrett and Zorn [BZ93] present a very inter- data.
esting scheme for avoiding fragmentation by heuris- 132 As noted in Section 2.4, we suspect that death time dis-
tically segregating short-lived objects from other ob- crimination is easier than lifetime prediction.
66
Vo. In a forthcoming article, Vo reports on the design errors as well. In an initial test of eight varied alloca-
of a new allocator framework and empirical results tors, the correlations accounted for only about a third
comparing several allocators using real traces [Vo95]. of the observed variation in performance. This shows
(Because this is work in progress, we will not report that the random ordering of synthetic traces discards
the empirical results in detail.) the majority of the information relevant to estimat-
Vo's vmalloc() allocator is conceptually similar ing real fragmentation. Results from most of pre-1992
to Ross' zone system, allowing dierent \regions" of experiments are therefore highly questionable.
memory to be managed by dierent policies.133 (Re- Using real traces, we measured fragmentation for
gions are subsets of the overall heap memory, and are our eight programs using our large set of allocators.
not contiguous in general; to a rst approximation, We will report results for the twelve we consider
they are sets of pages.) A specic allocator can be most interesting here; for more complete and detailed
chosen at link time by setting appropriate UNIX en- information, see the forthcoming report [WJNB95].
vironment variables. This supports experimentation These allocators are best t (using FIFO-ordered free
with dierent allocators to tune memory management lists134), rst t (using LIFO-ordered, FIFO-ordered
to specic applications, or to dierent parts of the and address-ordered free lists), next t (also using
same application, which may allocate in zones that LIFO, FIFO, and address order), Lea's segregated ts
are managed dierently. Various debugging facilities allocator, binary and double buddy systems, simple
are also provided. segregated storage using powers-of-two size classes,
The default allocator provided by Vo's system is a and simple segregated storage using twice as many
deferred coalescing scheme using best t for the gen- size classes (powers of two, and three times powers of
eral allocator. (The size ordering of blocks is main- two, as in the weighted buddy system).
tained using a splay tree.) In comparisons with several We attempted to control as many implementation-
other allocators, this allocator is shown to be consis- specic costs as possible. In all cases, objects were
tently among the fastest and among the most space aligned on double-word (eight-byte) boundaries, and
ecient, for several varied test applications. the minimum block size was four words. Fragmenta-
tion costs will be reported as a percentage increase,
Wilson, Johnstone, Neely, and Boles. In a forth- relative to the baseline of the number of actual bytes
coming report [WJNB95], we will present results of a of memory devoted to program objects at the point of
variety of memory allocation experiments using real maximum memory usage. All allocators had one-word
traces from eight varied C and C++ programs, and headers, except for the simple segregated storage al-
more than twenty variants of six general allocator locators, which had no headers.135 (As explained ear-
types (rst t, best t, next t, buddy systems, and lier, we believe that in most systems, these will be the
simple segregated storage) [WJNB95]. We will brie
y usual header sizes for well-implemented allocators of
describe some of the major results of that study here. these types.)
To test the usual experimental assumptions, we We will summarize fragmentation costs for twelve
used both real and synthetic traces, and tried to make allocators, in increasing order of space cost. We note
the synthetic traces as realistic as possible in terms of that some of these numbers may change slightly be-
size and lifetime distributions. We then compared re- fore [WJNB95] appears, due to minor changes in our
sults of simulations using real traces with those from 134 No signicant dierences were found between results
randomly-ordered traces. (To generate the random for variations of best t using dierent free list orders.
traces, we simply \shued" the real traces, preserving This is not too surprising, given that the best t policy
the size and lifetime distributions much more accu- severely restricts the choice of free blocks.
rately than most synthetic trace generation schemes 135 Rather than varying the actual implementations' header
do.) We found that there was a signicant correlation and footer schemes, we simulated dierent header sizes
between the results from real traces and those from by compensating at allocation time and in our mea-
shued traces, but there were major and systematic surements. The sequential ts, segregated ts, and sim-
ple segregated storage allocators actually use two-word
133 See also Delacour's [Del92] and Attardi's [AF94] so- headers or one word headers and one word footers, but
phisticated systems for low-level storage management we reduced the request sizes by one word at allocation
in (mostly) garbage-collected systems using mixed lan- time to \recover" one of those words by counting it as
guages and implementation strategies. available to hold a word of an object.
67
experiments. The nubers for next t are also some- implementations.
what suspect|we are currently trying to determine Double buddy worked as it was designed to|if we
whether they are aected by a failure to respect Korn assume that it reduced internal fragmentation by the
and Vo's wilderness preservation heuristic.136 expected (approximate) 14%, it seems that the dual
It should also be noted that our experimental meth- buddy scheme did not introduce signicant external
odology could introduce errors on the order of a per- fragmentation|relative to binary buddies|as Fibo-
cent or two. Worse, we found that the variance for nacci and weighted schemes are believed to do. Still,
some of these allocators was quite high, especially for its performance was far worse than that of the best
some of the poorer algorithms. (We are also concerned allocators.
that any sample of eight programs cannot be consid- In simulations of two of the best allocators (address-
ered representative of all real programs, though we ordered rst t and best t), eliminating all header
have done our best [WJNB95].) The rank ordering overhead reduced their memory waste to about 14%.
here should thus be considered very approximate, es- We suspect that using one-word alignment and a
pecially within clusters. smaller minimum object size could reduce this by sev-
To our great surprise, we found that best t, eral percent more. This suggests the \real" fragmenta-
address-ordered rst t, and FIFO-ordered rst t tion produced by these policies|as opposed to waste
all performed extremely well|and nearly identically caused by the implementation mechanisms we used|
well. All three of these allocators had only about 22% may be less than 10%. (This is comparable to the loss
fragmentation, including losses due to header costs, we expect just from the double word alignment and
rounding up for doubleword alignment, and rounding minimum block sizes.)
small block sizes up to four words. While the rankings of best t and address-ordered
They were followed by a cluster containing address- rst t are similar to results obtained by random-
ordered next t, segregated ts, and FIFO-ordered trace methods, we found them quite surprising, due
next t at 28%, 31% and 32%. Then came a cluster to the evident methodological problems of random-
consisting of LIFO-ordered rst t, double buddy, and trace studies. We know of no good model to explain
LIFO-ordered next t, and at 54%, 56%, and 59%. them.137
These were followed by a cluster consisting of sim- While the three excellent allocators fared well with
ple segregated storage using closely-spaced size clas- both real and randomized traces, other allocators
ses (73%) and binary buddy (74%). Simple segregated fared dierently in the two sets of simulations. The
storage using powers-of-two sizes came last, at 85%. segregated storage schemes did unrealistically well,
For rst t and next t, we note that the LIFO free relative to other allocators, when traces were random-
list order performed far worse than the FIFO free list ized.
order or the address order. For many programmers The results for randomized traces show clearly that
(including us), LIFO ordering seems most natural; all size and lifetime distributions are not sucient to pre-
other things being equal, it would also appear to be dict allocator performance for real workloads. The or-
advantageous in terms of locality. Its fragmentation dering information interacts with the allocator's poli-
eects are severe, however, typically increasing frag- cies in ways that are often more important than the
mentation by a factor of two or three relative to either distributions alone. Some of these results were not
address-order or FIFO-order. We are not sure why this unexpected, given our understanding on the meth-
is; the main characteristic the latter two seem to have odology. For example, the unrealistically good perfor-
in common is deferred reuse. It may be that a deferred mance of simple segregated ts schemes relative to the
reuse strategy is more important than the details of others was expected, because of the smoothing eect
the actual policy. If so, that suggests that a wide vari- of random walks|synthetic traces tend not to intro-
ety of policies may have excellent memory usage. This duce large amounts of external fragmentation, which
is encouraging, because it suggests that some of those is the Achilles' heel of non-splitting, non-coalescing
policies may be amenable to very ecient and scalable policies.
136 Most of the allocators appear fairly insensitive to this Like Zorn and Grunwald, we will make the test pro-
issue, and the others (our rst t and best t) were 137 We have several just-so stories that could explain them,
designed to respect it by putting the end block at the of course, but we haven't yet convinced ourselves that
far end of the free list from the search pointer. any of them are true.
68
grams we used available for others to use for replica- of condence. Computer scientists often have less to
tion of our results and for other experiments.138 worry about in terms of the validity of \known" re-
sults, relative to other scientists, but in fact they often
worry less about it, which can be a problem, too.
5 Summary and Conclusions
\[People refused to believe that the world 5.1 Models and Theories
was round] because it looked like the the
world was
at." There has been a considerable amount of theoretical
\What would it have looked like if it had work done in the area of memory allocation|if we
looked like the world was round?" use \theory" in the parlance of computer science, to
|attributed to Ludwig Wittgenstein mean a particular subdiscipline using particular kinds
of logical and mathematical analyses. There has been
There is a very large space of possible allocator poli- very little theoretical work done, however, if we use
cies, and a large space of mechanisms that can support the vernacular and central sense of \theory," i.e., what
them. Only small parts of these spaces have been ex- everyday working scientists do.
plored to date, and the empirical and analytical tech- We simply have no theory of program behavior,
niques used have usually produced results of dubious much less a theory of how allocators exploit that be-
validity. havior. (Batson made similar comments in 1976, in a
There has been a widespread failure to recog- slightly dierent context [Bat76], but after nearly two
nize anomalous data as undermining the domi- decades the situation is much the same.)
nant paradigm, and to push basic causal reasoning Aside from several useful studies of worst-case per-
through|to recognize what data could be relevant, formance, most of the analytical work to date seems
and what other theories might be consistent with the to be based on several assumptions that turn out to be
observed facts. We nd this curious, and suspect it incorrect, and the results cannot be expected to apply
has two main causes. directly to the real problems of memory allocation.
One cause is simply the (short) history of the eld, Like much work in mathematics, however, theoreti-
and expectations that computer science issues would cal results may yet prove to be enlightening. To make
be easily formalized, after many striking early suc- sense of these results and apply them properly will
cesses. (Ullman [Ull95] eloquently describes this phe- require considerable thought, and the development of
nomenon.) a theory in the vernacular sense.
Another is doubtless the same kind of paradigm en- For example, the striking similarities in perfor-
trenchment that occurs in other, more mature sciences mance between best t and address-ordered rst t
[Kuh70]. Once the received view has been used as a for randomized workloads should be explained. How
theoretical underpinning of enough seemingly success- is it that such dierent policies are so comparable,
ful experiments, and reiterated in textbooks without for an essentially unpredictable sequence of requests?
the caveats buried in the original research papers, it More importantly, how does this relate to real re-
is very hard for people to see the alternatives. quest sequences? The known dependencies of these
The history of memory allocation research may algorithms on lifetime distributions should also be ex-
serve as a cautionary tale for empirical computer sci- plained more clearly. Randomization of input order
ence. Hartmanis has observed that computer science may eliminate certain important variables, and allow
seems less prone to paradigm shifts than most elds others to be explored more or less in isolation. On the
[Har95]. We agree in part with this sentiment, but the other hand, interactions with real programs may be
successes of computer science can lead to a false sense so systematically dierent that these phenomena have
138 Our nothing important in common|for example, depen-
anonymous FTP repository is on ftp.cs.utexas.edu dence on size distributions may be an eect that has
in the directory pub/garbage. This repository also con- little importance in the face of systematic interactions
tains the BibTeX bibliography le used for this paper between placement policy and phase behavior.
and [Wil95], several papers on persistence and memory Understanding real program behavior still remains
hierarchies, and numerous papers on garbage collection the most important rst step in formulating a the-
by ourselves and others. ory of memory management. Without doing that, we
69
cannot hope to develop the science of memory man- used for dierent purposes. (Again, this may increase
agement; we can only fumble around doing ad hoc locality as well|by keeping related objects clustered
engineering, in the too-often-used pejorative sense of after more ephemeral objects have been deallocated.)
the word. At this point, the needs of good science On the other hand, it is possible that the regulari-
and of good engineering in this area are the same|a ties exploited by good existing allocators are so strong
deeper qualitative understanding. We must try to dis- and simple that we cannot improve memory usage by
cern what is relevant and characterize it; this is neces- much|it's possible that all of our best current algo-
sary before formal techniques can be applied usefully. rithms exploit them to the fullest, however acciden-
tally. The other patterns in program behavior may
5.2 Strategies and Policies be so subtle, or interact in such complex ways, that
no strategy can do much better. Or it may turn out
Most policies used by current allocators are derived that once the regularities are understood, the task of
fairly straightforwardly from ideas that date from the exploiting them online is just too expensive. (That
1960's, at least. Best t and address-ordered rst t doesn't seem likely to us, though some intermediate
policies seem to work well in practice, but after sev- situation seems plausible.)
eral decades the reasons why are not much clearer If all else fails, relying best t and rst t usually
than they were then. It is not clear which regularities won't be a disaster, as long as the mechanisms used
in real request streams they exploit. (It is not even are scalable. (If one of them doesn't work well for your
very clear how they exploit regularities in synthetic program, it's likely that the other will|or that some
request streams, where the regularities are minimal other simple policy will suce.)
and presumably much easier to characterize.) Because On the other hand, it is not clear that our best
our current understanding of these issues is so weak, policies are robust enough to count on|so far, only
we will indulge in some speculation. a few experiments have been performed to asses the
Given that there is no reason to think that these interactions between real program behavior and allo-
early policies were so well thought out that nothing cator policies. It is entirely possible that there is a
could compete with them, it is worthwhile to wonder non-negligible percentage of programs for which our
whether there is a large space of possible policies that \best" algorithms will fail miserably.
work at least as well as these two. Recent results for
FIFO-ordered sequential ts may suggest that close 5.3 Mechanisms
ts and address ordering are not crucial for good per-
formance. Many current allocator policies are partly artifacts
It may well be that the better allocators perform of primitive implementation techniques|they are
well because it's very easy to perform well. Program mostly based on obvious ways of managing linear lists.
behavior may be so patterned and redundant (in cer- Modern data structure techniques allow us to build
tain relevant ways) that the important regularities in much more sophisticated indexing schemes, either to
request streams are trivial to exploit. The known good improve performance or support better-designed poli-
policies may only be correlated to some more funda- cies.
mental strategy|or combination of strategies|yet to Segregated ts and (other) indexing schemes can
be discovered. be used to implement policies known to work well in
Given the real and striking regularities in request practice, and many others. More sophisticated index-
streams due to common programming techniques, it ing schemes will probably allow us to exploit whatever
seems likely that better algorithms could be designed exploitable regularities we are clever enough to char-
if we only had a good model of program behavior, and acterize, in a scalable way.
a good understanding of how that interacts with allo- Deferred coalescing allows optimization of common
cation policies. Clustered deaths due to phase behav- patterns of short-term memory use, so that scalable
ior, for example, suggest that contiguous allocation of mechanisms don't incur high overheads in practice.
consecutively-allocated blocks may tend to keep frag- The techniques for deferred coalescing must be stud-
mentation low. (It probably has benecial eects on ied carefully, however, to ensure that this mecha-
locality as well.) nism doesn't degrade memory usage unacceptably by
Segregation of dierent kinds of objects may avoid changing placement policies and undermining strate-
fragmentation due to diering death times of objects gies.
70
5.4 Experiments test application suites should include as many of them
New experimental methods must be developed for the as possible.
testing of new theories. Trace-driven simulations of There are some diculties in obtaining and using
real program/allocator pairs will be quite important, such programs that can't be overlooked. The rst is
of course|they are an indispensable reality check. that the most easily obtainable programs are often not
These trace-driven simulations should include locality the most representative|freely available code is often
studies as well as conventional space and time mea- of a few types, such as script language interpreters,
surements. Sound work of both sorts has barely be- which do not represent the bulk of actual computer
gun; there is a lot to do. use, particularly memory use.
If we are to proceed scientically, however, just Those programs that are available are often di-
running experiments with a grab-bag of new alloca- cult to analyze, for various reasons. Many used hand-
tors would may be doing things backwards. Program optimized memory allocators, which must be removed
behavior should be studied in (relative) isolation, to to reveal the \true" memory usage|and this \true"
identifying the fundamental regularities that are rele- memory usage itself may be skewed by the awkward
vant to to various allocators and memory hierarchies. programming styles used to avoid general heap allo-
After that, it should be easier to design strategies and cation.
policies intelligently.
5.6 Challenges and Opportunities
5.5 Data Computer Science and Engineering is a
Clearly, in order to formulate useful theories of mem- eld that attracts a dierent kind of
ory management, more data are required. The current thinker Such people are especially good
:::
set of programs used for experimentation is not large at dealing with situations where dierent
enough or varied enough to be representative. rules apply in dierent cases; they are in-
Some kinds of programs that are not represented dividuals who can rapidly change levels of
are: abstraction, simultaneously seeing things \in
the large" and \in the small."
{ scientic computing programs (especially those |Donald Knuth, quoted in [Har95]
using sophisticated sparse matrix representa-
tions), Memory management is a fundamental area of com-
{ long-running system programs such as operating puter science, spanning several very dierent lev-
system kernels, name servers, le servers, and els of abstraction|from the programmer's strategies
graphics display servers, for dealing with data, language-level features for ex-
{ business data analysis programs such as spread- pressing those concepts, language implementations for
sheets, report generators, and so on, managing actual storage, and the varied hardware
{ graphical programs such as desktop publishing memories that real machines contain. Memory man-
systems, CAD interaction servers and interactive agement is where the rubber meets the road|if we
3-D systems (e.g., virtual reality), do the wrong thing at any level, the results will not
{ interactive programming environments with be good. And if we don't make the levels work well
source code management systems and interactive together, we are in serious trouble. In many areas of
debugging facilities, computer science, problems can be decomposed into
{ heavily object-oriented programs using sophisti- levels of abstraction, and dierent problems addressed
cated kits and frameworks composed in a variety at each level, in nearly complete isolation. Memory
of ways, management requires this kind of thinking, but that
{ automatically-generated programs of a variety of is not enough|it also requires the ability to reason
types, created using specialized code-generation about phenomena that span multiple levels. This is
systems or compilers for very-high-level lan- not easy.
guages. Unfortunately, the compartmentalization of com-
puting disciplines has discouraged the development of
This partial list is just a beginning|there are many a coherent memory management community. Mem-
kinds of programs, written in a variety of styles, and ory management tends to be an orphan, sometimes
71
harbored by the programming language community, cases, but which is arguably the single most impor-
sometimes by the operating systems community|and tant scientic theory ever. Perhaps we should look to
usually ignored by the architecture community. Darwin as an examplar, too.
It seems obvious that memory management poli-
cies can have a profound impact on locality of refer-
ence, and therefore the overall performance of mod- Acknowledgements
ern computers, but in the architecture community
locality of reference is generally treated as a mys- We would like to thank Hans Boehm and especially
terious, incomprehensible substance. (Or maybe two Henry Baker for many enlightening discussions of
or three substances, all fairly mysterious.) A pro- memory management over the last few years, and for
gram is pretty much a black box, however abraded comments on earlier versions of this paper.
and splintered, and locality comes out of the box if Thanks to Ivor Page, for comments that seem to
you're lucky. It is not generally recognized that dier- connect important pieces of the puzzle more con-
ent memory management policies can have an eect cretely than we expected, and to Ben Zorn, Dirk
on memory hierarchies that is sometimes as signif- Grunwald and Dave Detlefs for making their test ap-
icant as dierences in programs' intrinsic behavior. plicatons available.
Recent work in garbage collection shows this to be Thanks also to Dave Barrett, Sheetal Kakkad, Doug
true ([WLM92, Wil95, GA95]), but few architects are Lea, Doug McIlroy, and Phong Vo for comments that
aware of it, or aware that similar phenomena must have improved our understanding and presentation,
occur (to at least some degree) in conventionally- and to Henry Baker and Janet Swisher for their help
managed memories as well [GZH93]. and extraordinary patience during the paper's prepa-
The challenge is to develop a theory that can span ration. (Of course, we bear sole responsibility for any
all of these levels. Such a theory will not come all opinions and errors.)
at once, and we think it is unlikely to be primarily
mathematical, at least not for a long time, because References
of the complex and ill-dened interactions between
dierent phenomena at dierent levels of abstraction. [Abr67] John Abramowich. Storage allocation in a cer-
Computer science has historically been biased to- tain iterative process. Communications of the
ward the paradigms of mathematics and physics|and ACM, 10(6):368{370, June 1967.
often a rather naive view of the scientic process in [AF94] G. Attardi and T. Flagella. A customizable
those elds|rather than the \softer" natural sciences. memory management framework. In Proceed-
We recommend a more naturalistic approach [Den95], ings of the USENIX C++ Conference, Cam-
which we believe is more appropriate for complex mul- bridge, Massachussetts, 1994.
tilevel systems that are only partly hierarchically de- [AS95] Sedat Akyurek and Kenneth Salem. Adaptive
composable. block rearrangement. ACM Transactions on
Computer Systems, 13(2):95{121, May 1995.
The fact that fact that we study mostly deter- [Bae73] H. D. Baecker. Aspects of reference locality
ministic processes in formally-describable machines is in list structures in virtual memory. Software
sometimes irrelevant and misleading. The degrees of Practice and Experience, 3(3):245{254, 1973.
complexity and uncertainty involved in building real [Bak93] Henry G. Baker. Infant mortality and genera-
systems require that we examine real data, theorize tional garbage collection. SIGPLAN Notices,
carefully, and keep our eyes open. 28(4):55{57, April 1993.
Computer science is often a very \hard" science, [BAO85] B. M. Bigler, S. J. Allan, and R. R. Oldehoeft.
which develops along the lines of the great develop- Parallel dynamic storage allocation. In 1985
ments in the physical sciences and mathematics the International Conference on Parallel Process-
ing, pages 272{275, 1985.
seventeenth, eighteenth and nineteenth centuries. It [Bat76] Alan Batson. Program behavior at the sym-
owes a great deal to the examples set by Newton and bolic level. IEEE Computer, pages 21{26,
Descartes. But the nineteenth century also saw a very November 1976.
great theory that was tremendously important with- [Bay77] C. Bays. A comparison of next-t, rst-t
out being formalized at all|a theory that to this day and best-t. Communications of the ACM,
can only be usefully formalized in special, restricted 20(3):191{192, March 1977.
72
[BB77] A. P. Batson and R. E. Brundage. Segment languages. Communications of the ACM,
sizes and lifetimes in ALGOL 60 programs. 7(4):231{240, April 1964.
Communications of the ACM, 20(1):36{44, [Bre89] R. Brent. Ecient implementation of the
January 1977. rst-t strategy for dynamic storage alloca-
[BBDT84] G. Bozman, W. Buco, T. P. Daly, and W. H. tion. ACM Transactions on Programming
Tetzla. Analysis of free storage algorithms| Languages and Systems, July 1989.
revisited. IBM Systems Journal, 23(1):44{64, [Bro80] A. G. Bromley. Memory fragmentation in
1984. buddy methods for dynamic storage alloca-
[BC79] Daniel G. Bobrow and Douglas W. Clark. tion. Acta Informatica, 14(2):107{117, August
Compact encodings of list structure. ACM 1980.
Transactions on Programming Languages and [Bur76] Warren Burton. A buddy system variation
Systems, 1(2):266{286, October 1979. for disk storage allocation. Communications
[BC92] Yves Bekkers and Jacques Cohen, editors. In- of the ACM, 19(7):416{417, July 1976.
ternational Workshop on Memory Manage- [BW88] Hans-Juergen Boehm and Mark Weiser.
ment, number 637 in Lecture Notes in Com- Garbage collection in an uncooperative envi-
puter Science, St. Malo, France, September ronment. Software Practice and Experience,
1992. Springer-Verlag. 18(9):807{820, September 1988.
[BCW85] B. S. Baker, E. G. Coman, Jr., and D. E. [BZ93] David A. Barrett and Bejamin G. Zorn. Using
Willard. Algorithms for resolving con
icts in lifetime predictors to improve memory alloca-
dynamic storage allocation. Journal of the tion performance. In Proceedings of the 1993
ACM, 32(2):327{343, April 1985. SIGPLAN Conference on Programming Lan-
[BDS91] Hans-J. Boehm, Alan J. Demers, and Scott guage Design and Implementation [PLD93],
Shenker. Mostly parallel garbage collection. pages 187{196.
In Proceedings of the 1991 SIGPLAN Confer-
ence on Programming Language Design and [BZ95] David A. Barrett and Benjamin G. Zorn.
Implementation [PLD91], pages 157{164. Garbage collection using a dynamic threaten-
[Bec82] Leland L. Beck. A dynamic storage allocation ing boundary. In Proceedings of the 1995 SIG-
technique based on memory residence time. PLAN Conference on Programming Language
Communications of the ACM, 25(10):714{724, Design and Implementation, pages 301{314,
October 1982. La Jolla, California, June 1995. ACM Press.
[Ben81] V. E. Benes. Models and problems of dynamic [Cam71] J. A. Campbell. A note on an optimal-t
storage allocation. In Applied Probability and method for dynamic allocation of storage.
Computer Science|the Interface. Institute of Computer Journal, 14(1):7{9, February 1971.
Management Science and Operations Research [CG91] Vincent Cate and Thomas Gross. Combining
Society of America, January 1981. the concepts of compression and caching for a
[Bet73] Terry Betteridge. An analytical storage al- two-level le system. In Fourth International
location model. Acta Informatica, 3:101{122, Conference on Architectural Support for Pro-
1973. gramming Languages and Operating Systems
[Bet82] Terry Betteridge. An Algebraic Analysis of (ASPLOS IV), pages 200{209, Santa Clara,
Storage Fragmentation. UMI Research Press, California, April 1991.
Ann Arbor, Michigan, 1982. [CK93] Robert Cmelik and David Keppel. Shade:
[BJW70] A. P. Batson, S. M. Ju, and D. C. Wood. Mea- A fast instruction-set simulator for execution
surements of segment size. Communications of proling. Technical Report UWCSE 93-06-06,
the ACM, 13(3):155{159, March 1970. Dept. of Computer Science and Engineering,
[BL92] Ball and Larus. Optimal proling and trac- University of Washington, Seattle, Washing-
ing of programs. In Conference Record of the ton, 1993.
Nineteenth Annual ACM Symposium on Prin- [CKS85] E. G. Coman, Jr., T. T. Kadota, and L. A.
ciples of Programming Languages, pages 59{ Shepp. On the asymptotic optimality of rst-
70. ACM Press, January 1992. t storage allocation. IEEE Transactions
[Boz84] Gerald Bozman. The software lookaside buer on Software Engineering, SE-11(2):235{239,
reduces search overhead with linked lists. February 1985.
Communications of the ACM, 27(3):222{227, [CL89] E. G. Coman, Jr. and F. T. Leighton. A
March 1984. provably ecient algorithm for dynamic stor-
[BR64] Daniel G. Bobrow and Bertram Raphael. age allocation. Journal of Computer and Sys-
A comparison of list-processing computer tem Sciences, 38(1):2{35, February 1989.
73
[Col61] G. O. Collins. Experience in automatic stor- Journal of Parallel Programming, 17(4):303{
age allocation. Communications of the ACM, 345, 1988.
4(10):436{440, October 1961. [Fer76] H. R. P. Ferguson. On a generalization of
[Com64] W. T. Comfort. Multiword list items. Com- the Fibonacci numbers useful in memory al-
munications of the ACM, 7(6), June 1964. location schema. The Fibonacci Quarterly,
[CT75] B. Cranston and R. Thomas. A simpli- 14(3):233{243, October 1976.
ed recombination scheme for the Fibonacci [For88] R. Ford. Concurrent algorithms for real-time
buddy system. Communications of the ACM, memory management. IEEE Software, pages
18(6):331{332, July 1975. 10{23, September 1988.
[Dar59] Charles Darwin. The Origin of Species. 1859. [FP74] J. S. Fenton and D. W. Payne. Dynamic stor-
[DDZ93] David Detlefs, Al Dosser, and Benjamin Zorn. age allocations of arbitrary sized segments. In
Memory allocation costs in large C and C++ Proc. IFIPS, pages 344{348, 1974.
programs. Technical Report CU-CS-665-93, [FP91] Matthew Farrens and Arvin Park. Dynamic
University of Colorado at Boulder, Dept. of base register caching: A technique for reducing
Computer Science, Boulder, Colorado, August address bus width. In 18th Annual Interna-
1993. tional Symposium on Computer Architecture,
[DEB94] R. Kent Dybvig, David Eby, and Carl Brugge- pages 128{137, Toronto, Canada, May 1991.
man. Don't stop the BIBOP: Flexible and ACM Press.
ecient storage management for dynamically [GA95] Marcelo J. R. Goncalves and Andrew W. Ap-
typed languages. Technical Report 400, In- pel. Cache performance of fast-allocating pro-
diana University Computer Science Dept., grams. In FPCA '95, 1995.
March 1994. [Gar94] Laurie Garrett. The Coming Plague: Newly
[Del92] V. Delacour. Allocation regions and imple- Emerging Diseases in a World out of Balance.
mentation contracts. In Bekkers and Cohen Farrar, Straus and Giroux, New York, 1994.
[BC92], pages 426{439. [Gel71] E. Gelenbe. The two-thirds rule for dynamic
[Den70] Peter J. Denning. Virtual memory. Comput- storage allocation under equilibrium. Infor-
ing Surveys, 3(2):153{189, September 1970. mation Processing Letters, 1(2):59{60, July
[Den95] Daniel Dennett. Darwin's Dangerous Idea. 1971.
1995. [GGU72] M. R. Garey, R. L. Graham, and J. D. Ullman.
[Det92] David L. Detlefs. Garbage collection and run- Worst-case analysis of memory allocation al-
time typing as a C++ library. In USENIX gorithms. In Fourth Annual ACM Symposium
C++ Conference, Portland, Oregon, August on the Theory of Computing, 1972.
1992. USENIX Association. [GM85] S. Gai and M. Mezzalama. Dynamic storage
[Dij69] Edsger W. Dijkstra. Notes on structured pro- allocation: Experiments using the C language.
gramming. In Structured Programming. Aca- Software Practice and Experience, 15(7):693{
demic Press, 1969. 704, July 1985.
[Dou93] Fred Douglis. The compression cache: Using [Gra] R. L. Graham. Unpublished technical report
on-line compression to extend physical mem- on worst-case analysis of memory allocation
ory. In Proceedings of 1993 Winter USENIX algorithms, Bell Labs.
Conference, pages 519{529, San Diego, Cali- [GW82] A. Gottlieb and J. Wilson. Parallelizing the
fornia, January 1993. usual buddy algorithm. Technical Report Sys-
[DTM93] Amer Diwan, David Tarditi, and Eliot Moss. tem Software Note 37, Courant Institute, New
Memory subsystem performance of programs York University, 1982.
with intensive heap allocation. Submitted for [GZ93] Dirk Grunwald and
publication, August 1993. Benjamin Zorn. CustoMalloc: Ecient syn-
[DWH+ 90] Alan Demers, Mark Weiser, Barry Hayes, thesized memory allocators. Software Practice
Daniel Bobrow, and Scott Shenker. Combin- and Experience, 23(8):851{869, August 1993.
ing generational and conservative garbage col- [GZH93] Dirk Grunwald, Benjamin Zorn, and Robert
lection: Framework and implementations. In Henderson. Improving the cache locality of
Conference Record of the Seventeenth Annual memory allocation. In Proceedings of the 1993
ACM Symposium on Principles of Program- SIGPLAN Conference on Programming Lan-
ming Languages, pages 261{269, San Fran- guage Design and Implementation [PLD93],
cisco, California, January 1990. ACM Press. pages 177{186.
[EO88] C. S. Ellis and T. J. Olson. Algorithms for [Han90] David R. Hanson. Fast allocation and deal-
parallel memory allocation. International location of memory based on object lifetimes.
74
Software Practice and Experience, 20(1), Jan- [KLS92] Phillip J. Koopman, Jr., Peter Lee, and
uary 1990. Daniel P. Siewiorek. Cache performance of
[Har95] Juris Hartmanis. Turing award lecture: On combinator graph reduction. ACM Trans-
computational complexity and the nature actions on Programming Languages and Sys-
of computer science. Computing Surveys, tems, 14(2):265{297, April 1992.
27(1):7{16, March 1995. [Kno65] Kenneth C. Knowlton. A fast storage alloca-
[Hay91] Barry Hayes. Using key object opportunism tor. Communications of the ACM, 8(10):623{
to collect old objects. In Andreas Paepcke, 625, October 1965.
editor, Conference on Object Oriented Pro- [Knu73] Donald E. Knuth. The Art of Computer
gramming Systems, Languages and Applica- Programming, volume 1: Fundamental Al-
tions (OOPSLA '91), pages 33{46, Phoenix, gorithms. Addison-Wesley, Reading, Mas-
Arizona, October 1991. ACM Press. sachusetts, 1973. First edition published in
[Hay93] Barry Hayes. Key Objects in Garbage Collec- 1968.
tion. PhD thesis, Standford University, March [Kri72] Saul A. Kripke. Naming and Necessity. Har-
1993. vard University Press, 1972.
[Hin75] J. A. Hinds. An algorithm for locating adja- [Kro73] S. Krogdahl. A dynamic storage allocation
cent storage blocks in the buddy system. Com- problem. Information Processing Letters,
munications of the ACM, 18(4):221{222, April 2:96{99, 1973.
1975. [Kuh70] Thomas S. Kuhn. The Structure of Scien-
[Hir73] D. S. Hirschberg. A class of dynamic memory tic Revolutions (Second Edition, Enlarged).
allocation algorithms. Communications of the University of Chicago Press, Chicago, Illinois,
ACM, 16(10):615{618, October 1973. 1970.
[HS64] V. C. Harris and C. C. Styles. A generaliza- [KV85] David G. Korn and Kiem-Phong Vo. In search
tion of the Fibonacci numbers. The Fibonacci of a better malloc. In Proc. USENIX Summer
Quarterly, 2(4):227{289, December 1964. 1985, pages 489{506, Portland, Oregon, June
[HS89] Mark D. Hill and Alan Jay Smith. Evaluat- 1985. USENIX Association.
ing associativity in CPU caches. IEEE Trans- [LH82] B. W. Leverett and P. G. Hibbard. An adap-
actions on Computers, 38(12):1612{1629, De- tive system for dynamic storage allocation.
cember 1989. Software Practice and Experience, 12(6):543{
[IGK71] S. Isoda, E. Goto, and I. Kimura. An ecient 556, June 1982.
bit table technique for dynamic storage allo- [LH83] Henry Lieberman and Carl Hewitt. A real-
cation of 2n -word blocks. Communications of time garbage collector based on the lifetimes
the ACM, 14(9):589{592, September 1971. of objects. Communications of the ACM,
[IJ62] J. K. Ilie and J. G. Jodeit. A dynamic stor- 26(6):419{429, June 1983.
age allocation scheme. Computer Journal, [M+ 69] J. Minker et al. Analysis of data processing
5(3):200{209, October 1962. systems. Technical Report 69-99, University
[Ing61] P. Z. Ingerman. Thunks. Communications of of Maryland, College Park, Maryland, 1969.
the ACM, 4(1):55{58, January 1961. [Mah61] R. J. Maher. Problems of storage allocation
[Iye93] Arun K. Iyengar. Parallel dynamic storage al- in a multiprocessor multiprogrammed system.
location algorithms. In Fifth IEEE Sympo- Communications of the ACM, 4(10):421{422,
sium on Parallel and Distributed Processing, October 1961.
1993. [Mar82] David Marr. Vision. Freeman, New York,
[Joh72] G. D. Johnson. Simscript II.5 User's Manual, 1982.
S/360-370 Version, Release 6, 1972. [McC91] Ronald McClamrock. Marr's three levels: a re-
[Joh91] Theodore Johnson. A concurrent fast ts evaluation. Minds and Machines, 1:185{196,
memory manager. Technical Report 91-009, 1991.
University of Florida, 1991. [McC95] Ronald McClamrock. Existential Cognition:
[JS92] T. Johnson and D. Sasha. Parallel buddy Computational Minds in the World. Univer-
memory management. Parallel Processing sity of Chicago Press, 1995.
Letters, 2(4):391{398, 1992. [McI82] M. D. McIlroy. The number of states of a dy-
[Kau84] Arie Kaufman. Tailored-list and namic storage allocation system. Computer
recombination-delaying buddy systems. ACM Journal, 25(3):388{392, August 1982.
Transactions on Programming Languages and [MK88] Marshall Kirk McKusick and Michael J.
Systems, 6(4):118{125, 1984. Karels. Design of a general-purpose memory
75
allocator for the 4.3bsd UNIX kernel. In Pro- [Put77] Hilary Putnam. Meaning and reference. In
ceedings of the Summer 1988 USENIX Con- Stephen P. Schwartz, editor, Naming, Neces-
ference, San Francisco, California, June 1988. sity, and Natural Kinds. Cornell University
USENIX Association. Press, Ithaca, New York, 1977.
[Moo84] David Moon. Garbage collection in a large [Qui77] W. V. Quine. Natural kinds. In Stephen P.
Lisp system. In Conference Record of the Schwartz, editor, Naming, Necessity, and Nat-
1984 ACM Symposium on LISP and Func- ural Kinds. Cornell University Press, Ithaca,
tional Programming, pages 235{246, Austin, New York, 1977.
Texas, August 1984. ACM Press. [Ran69] Brian Randell. A note on storage fragmenta-
[MPS71] B. H. Margolin, R. P. Parme- tion and program segmentation. Communica-
lee, and M. Schatzo. Analysis of free-storage tions of the ACM, 12(7):365{372, July 1969.
algorithms. IBM Systems Journal, 10(4):283{ [Ree79] C. M. Reeves. Free store distribution un-
304, 1971. der random-t allocation. Computer Journal,
[MS93] Paul E. McKenney and Jack Slingwine. Ef- 22(4):346{351, November 1979.
cient kernel memory allocation on shared- [Ree80] C. M. Reeves. Free store distribution under
memory multiprocessors. In USENIX 1993 random-t allocation: Part 2. Computer Jour-
Winter Technical Conference, San Diego, Cal- nal, 23(4):298{306, November 1980.
ifornia, January 1993. USENIX Association. [Ree82] C. M. Reeves. A lumped-state model of clus-
[Nel91] Mark Nelson. The Data Compression Book. tering in dynamic storage allocation. Com-
M & T Books, 1991. puter Journal, 27(2):135{142, 1982.
[Nie77] N. R. Nielsen. Dynamic memory allocation in [Ree83] C. M. Reeves. Free store distribution under
computer simulation. Communications of the random-t allocation, part 3. Computer Jour-
ACM, 20(11):864{873, November 1977. nal, 26(1):25{35, February 1983.
[OA85] R. R. Oldehoeft and S. J. Allan. Adaptive [Rei94] Mark B. Reinhold. Cache performance of
exact-t storage management. Communica- garbage-collected programs. In Proceedings of
tions of the ACM, 28(5):506{511, May 1985. the 1994 SIGPLAN Conference on Program-
[Pag82] Ivor P. Page. Optimal t of arbitrary sized ming Language Design and Implementation,
segments. Computer Journal, 25(1), January pages 206{217, Orlando, Florida, June 1994.
1982. ACM Press.
[Pag84] Ivor P. Page. Analysis of a cyclic placement [RO91] Mendel Rosenblum and John K. Ousterhout.
scheme. Computer Journal, 27(1):18{25, Jan- The design and implementation of a log-
uary 1984. structured le system. In Proceedings of the
[PH86] Ivor P. Page and Je Hagins. Improving the Thirteenth Symposium on Operating Systems
performance of buddy systems. IEEE Trans- Principles, pages 1{15, Pacic Grove, Califor-
actions on Computers, C-35(5):441{447, May nia, October 1991. ACM Press. Published as
1986. Operating Systems Review 25(5).
[PLD91] Proceedings of the 1991 SIGPLAN Conference [Rob71] J. M. Robson. An estimate of the store
on Programming Language Design and Imple- size necessary for dynamic storage allocation.
mentation, Toronto, Ontario, June 1991. ACM Journal of the ACM, 18(3):416{423, July 1971.
Press. Published as SIGPLAN Notices 26(6), [Rob74] J. M. Robson. Bounds for some functions con-
June 1992. cerning dynamic storage allocation. Journal of
[PLD93] Proceedings of the 1993 SIGPLAN Conference the ACM, 21(3):491{499, July 1974.
on Programming Language Design and Imple- [Rob77] J. M. Robson. Worst case fragmentation of
mentation, Albuquerque, New Mexico, June rst t and best t storage allocation strate-
1993. ACM Press. gies. Computer Journal, 20(3):242{244, Au-
[PN77] J. L. Peterson and T. A. Norman. Buddy gust 1977.
systems. Communications of the ACM, [Ros61] D. T. Ross. A generalized technique for sym-
20(6):421{431, June 1977. bol manipulation and numerical calculation.
[PS70] P.W. Purdom and S. M. Stigler. Statistical Communications of the ACM, 4(3):147{150,
properties of the buddy system. Journal of March 1961.
the ACM, 17(4):683{697, October 1970. [Ros67] D. T. Ross. The AED free storage package.
[PSC71] P. W. Purdom, S. M. Stigler, and Tat-Ong Communications of the ACM, 10(8):481{492,
Cheam. Statistical investigation of three stor- August 1967.
age allocation algorithms. BIT, 11:187{195, [Rus77] D. L. Russell. Internal fragmentation in a
1971. class of buddy systems. SIAM J. Comput.,
76
6(4):607{621, December 1977. [Tot65] R. A. Totschek. An empirical investigation
[Sam89] A. Dain Samples. Mache: No-loss trace com- into the behavior of the SDC timesharing sys-
paction. In ACM SIGMETRICS, pages 89{97, tem. Technical Report SP2191, Systems De-
May 1989. velopment Corporation, 1965.
[Sha88] Robert A. Shaw. Empirical Analysis of a Lisp [UJ88] David Ungar and Frank Jackson. Tenuring
System. PhD thesis, Stanford University, Palo policies for generation-based storage reclama-
Alto, California, February 1988. Technical tion. In Norman Meyrowitz, editor, Confer-
Report CSL-TR-88-351, Stanford University ence on Object Oriented Programming Sys-
Computer Systems Laboratory. tems, Languages and Applications (OOPSLA
[Sho75] J. E. Shore. On the external storage fragmen- '88) Proceedings, pages 1{17, San Diego, Cal-
tation produced by rst-t and best-t alloca- ifornia, September 1988. ACM Press.
tion strategies. Communications of the ACM, [Ull95] Jerey D. Ullman. The role of theory today.
18(8):433{440, August 1975. Computing Surveys, 27(1):43{44, March 1995.
[Sho77] J. E. Shore. Anomalous behavior of the fty- [Ung86] David Ungar. Design and Evaluation of a
percent rule in dynamic memory allocation. High-Performance Smalltalk System. MIT
Communications of the ACM, 20(11):558{562, Press, Cambridge, Massachusetts, 1986.
November 1977. [VC90] P. Vongsathorn and S. D. Carson. A sys-
[SKW92] Vivek tem for adaptive disk rearrangement. Soft-
Singhal, Sheetal V. Kakkad, and Paul R. Wil- ware Practice and Experience, 20(3):225{242,
son. Texas: an ecient, portable persistent March 1990.
store. In Antonio Albano and Ron Morrison, [VMH+83] J. Voldman, B. Mandelbrot, L. W. Hoevel,
editors, Fifth International Workshop on Per- J. Knight, and P. Rosenfeld. Fractal nature
sistent Object Systems, pages 11{33, San Mini- of software-cache interaction. IBM Journal
ato, Italy, September 1992. Springer-Verlag. of Research and Development, 27(2):164{170,
[SP74] K. K. Shen and J. L. Peterson. A weighted March 1983.
buddy method for dynamic storage allocation. [Vo95] Kiem-Phong Vo. Vmalloc: A general and e-
Communications of the ACM, 17(10):558{562, cient memory allocator. Software Practice and
October 1974. Experience, 1995. To appear.
[ST85] Daniel Dominic Sleator and Robert Endre [Vui80] Jean Vuillemin. A unifying look at data
Tarjan. Self-adjusting binary search trees. structures. Communications of the ACM,
Journal of the ACM, 32(3), 1985. 29(4):229{239, April 1980.
[Sta80] Thomas Standish. Data Structure Tech- [Wal66] B. Wald. Utilization of a multiprocessor in
niques. Addison-Wesley, Reading, Mas- command and control. Proceedings of the
sachusetts, 1980. IEEE, 53(12):1885{1888, December 1966.
[Ste83] C. J. Stephenson. Fast ts: New methods for [WB95] Paul R. Wilson and V. B. Balayoghan. Com-
dynamic storage allocation. In Proceedings of pressed paging. In preparation, 1995.
the Ninth Symposium on Operating Systems [WDH89] Mark Weiser, Alan Demers, and Carl Hauser.
Principles, pages 30{32, Bretton Woods, New The portable common runtime approach to in-
Hampshire, October 1983. ACM Press. Pub- teroperability. In Proceedings of the Twelfth
lished as Operating Systems Review 17(5), Oc- Symposium on Operating Systems Principles,
tober 1983. December 1989.
[Sto82] Harold S. Stone. Parallel memory alloca- [Wei76] Charles B. Weinstock. Dynamic Storage Al-
tion using the FETCH-AND-ADD instruction. location Techniques. PhD thesis, Carnegie-
Technical report, IBM Thomas J. Watson Re- Mellon University, Pittsburgh, Pennsylvania,
search Center, Yorktown Heights, New York, April 1976.
November 1982. [Whi80] Jon L. White. Address/memory management
[Tad78] M. Tadman. Fast-t: A new hierarchical dy- for a gigantic Lisp environment, or, GC con-
namic storage allocation technique. Master's sidered harmful. In LISP Conference, pages
thesis, UC Irvine, Computer Science Dept., 119{127, Redwood Estates, California, August
1978. 1980.
[Thi89] Dominique Thiebaut. The fractal dimension [Wil90] Paul R. Wilson. Some issues and strategies in
of computer programs and its application to heap management and memory hierarchies. In
the prediction of the cache miss ratio. IEEE OOPSLA/ECOOP '90 Workshop on Garbage
Transactions on Computers, pages 1012{1026, Collection in Object-Oriented Systems, Octo-
July 1989. ber 1990. Also appears in SIGPLAN Notices
77
23(3):45{52, March 1991. [Wol65] Eric Wolman. A xed optimum cell-size for
[Wil91] Paul R. Wilson. Operating system support records of various lengths. Journal of the
for small objects. In International Workshop ACM, 12(1):53{70, January 1965.
on Object Orientation in Operating Systems, [WW88] Charles B. Weinstock and William A. Wulf.
pages 80{86, Palo Alto, California, October Quickt: an ecient algorithm for heap stor-
1991. IEEE Press. age allocation. ACM SIGPLAN Notices,
[Wil92] Paul R. Wilson. Uniprocessor garbage col- 23(10):141{144, October 1988.
lection techniques. In Bekkers and Cohen [Yua90] Taichi Yuasa. The design and implementation
[BC92], pages 1{42. of Kyoto Common Lisp. Journal of Informa-
[Wil95] Paul R. Wilson. Garbage collection. Com- tion Processing, 13(3), 1990.
puting Surveys, 1995. Expanded ver- [ZG92] Benjamin Zorn and Dirk Grunwald. Empir-
sion of [Wil92]. Draft available via anony- ical measurements of six allocation-intensive
mous internet FTP from cs.utexas.edu as C programs. Technical Report CU-CS-604-92,
pub/garbage/bigsurv.ps. In revision, to ap- University of Colorado at Boulder, Dept. of
pear. Computer Science, July 1992.
[Wis78] David S. Wise. The double buddy-system. [ZG94] Benjamin Zorn and Dirk Grunwald. Evaluat-
Technical Report 79, Computer Science De- ing models of memory allocation. ACM Trans-
partment, Indiana University, Bloomington, actions on Modeling and Computer Simula-
Indiana, December 1978. tion, 1(4):107{131, 1994.
[WJ93] Paul R. Wilson and Mark S. Johnstone. Truly [Zor93] Benjamin Zorn. The measured cost of conser-
real-time non-copying garbage collection. In vative garbage collection. Software|Practice
OOPSLA '93 Workshop on Memory Manage- and Experience, 23(7):733{756, July 1993.
ment and Garbage Collection, December 1993.
Expanded version of workshop position paper
submitted for publication.
[WJNB95] Paul R. Wilson, Mark S. Johnstone, Michael
Neely, and David Boles. Memory allocation
policies reconsidered. Technical report, Uni-
versity of Texas at Austin Department of Com-
puter Sciences, 1995.
[WJW+ 75] William A. Wulf, R. K. Johnsson, C. B. We-
instock, S. O. Hobbs, and C. M. Geschke. De-
sign of an Optimizing Compiler. American El-
sevier, 1975.
[WLM91] Paul R. Wilson, Michael S. Lam,
and Thomas G. Moher. Eective static-graph
reorganization to improve locality in garbage-
collected systems. In Proceedings of the 1991
SIGPLAN Conference on Programming Lan-
guage Design and Implementation [PLD91],
pages 177{191. Published as SIGPLAN No-
tices 26(6), June 1992.
[WLM92] Paul R. Wilson, Michael S. Lam,
and Thomas G. Moher. Caching considera-
tions for generational garbage collection. In
Conference Record of the 1992 ACM Sympo-
sium on LISP and Functional Programming,
pages 32{42, San Francisco, California, June
1992. ACM Press.
[WM89] Paul R. Wilson and Thomas G. Moher. De-
sign of the Opportunistic Garbage Collector.
In Conference on Object Oriented Program-
ming Systems, Languages and Applications
(OOPSLA '89) Proceedings, pages 23{35, New This article was processed using the LaTEX macro package
Orleans, Louisiana, 1989. ACM Press. with LLNCS style
78