Performance Issues For Multiprocessor Operating Sy
Performance Issues For Multiprocessor Operating Sy
net/publication/2247884
CITATIONS READS
7 4,701
5 authors, including:
All content following this page was uploaded by Michael Stumm on 13 March 2014.
The Computer Systems Research Institute (CSRI) is an interdisciplinary group formed to conduct research
and development relevant to computer systems and their application. It is an Institute within the Faculty of
Applied Science and Engineering, and the Faculty of Arts and Science, at the University of Toronto, and is
supported in part by the Natural Sciences and Engineering Research Council of Canada.
Performance Issues for
Multiprocessor Operating Systems
Benjamin Gamsa Orran Krieger Eric W. Parsons Michael Stumm
1 Introduction
Designing an operating system for good performance is fundamentally more difficult for shared-memory
multiprocessors than it is for uniprocessors. Multiprocessors require that a distinct set of issues be considered.
For example, the overhead of cache consistency requires careful attention to the placement of data in order to
reduce the number of cache misses. Similarly, in large systems the distribution of memory across the system
also requires attention to the placement of data but this time to improve memory access locality. Dealing
with these issues might involve such optimizations as data alignment, padding, regrouping, replication, and
migration, or even the use of alternative data structures. Unfortunately, current compiler technology is not
capable of performing many of these optimizations on code typical of operating systems. Hence, the systems
programmer needs to be aware of the intricacies of multiprocessors and explicitly optimize the code for these
systems.
This article first reviews how shared memory multiprocessor hardware affects the performance of system
software and then presents a set of principles for organizing operating system data to maximize performance
on these systems. The principles have been derived from a synthesis of our own experience developing (and
performance tuning) two parallel operating systems and the experiences reported by other operating system
developers. The principles are not only useful for adapting and performance tuning existing multiprocessor
operating systems, but are also useful when designing a system from scratch. In the latter case, they
lead to a system structure that is both novel and effective. As an example, we illustrate the use of these
principles on selected components of a (complete and running) object-oriented operating system we have
been implementing.
1
Small−Scale Large−Scale
C C C
P P P
Interconnection C
to network
Network
I/O M
P
I/O M
Figure 1: This figure depicts the two basic classes of multiprocessors under consideration. The left figure shows
a small-scale bus-based multiprocessor, which commonly range in size from 4-16 processors. The figure on the
right shows the general architecture of a large-scale system, typically built from nodes containing a small number of
processors, memory, and disks, and connected by some scalable interconnect (represented here by a cloud), which
could be a mesh, tree, torus, etc. Important features common to both types of systems are the use of an invalidation-
based coherence scheme (rather than the less common update-based scheme), and in-cache synchronization primitives
(rather than network- or memory-based primitives).
Since much of the material presented in the section is tutorial in nature, the well-informed reader should
feel free to scan over it quickly.
2
In addition, there are complex tradeoffs to consider in the implementation of locks on a multiprocessor.
Whereas the choice is straightforward for uniprocessors where the only sensible option is a simple blocking
lock, multiprocessor locking must consider such issues as context-switch overhead, wasted spin-cycles, bus
and network contention due to spinning, fairness, and preemption effects for both lock holders and spinners.
Fortunately, this area has been well-studied [2, 3, 5, 13, 18, 32], and in most cases spin-then-block locks
with exponential back-off for the spinning phase have proven to be highly effective [13].1
1
Distributed queue-based spin locks are another well-studied option, whose prime value is for the less common cases that can
benefit from continual spinning (even under high contention) and which also require low-latency and a high degree of fairness [20, 23].
2
To be more precise, invalidation-based cache coherence protocols require that the cache obtain exclusive access to a line the
first time it is written to, even if no sharing is taking place. These misses are sometimes referred to as upgrade misses or initial-write
misses. Hence there is an extra cost in multiprocessors both for sharing misses and upgrade misses.
3
Some results suggest that there is often sufficient locality that the locks can generally be assumed to reside in the cache [30].
However, later work (by the same author) suggests that coherence traffic due to locking is a significant problem [35]. The effects
of false sharing may well explain this apparent contradiction.
4
Even the Digital 8400 multiprocessor, which is expressly optimized for read-miss latency, has a 50 percent higher latency than
Digital’s earlier, lower performance, uniprocessors [10].
3
in early work on Mach on the RP3 where it took 2.5 hours to boot the system due to memory contention [6],
and in the work porting Solaris to the Cray SuperServer where misses in the idle process slowed non-idle
processors by 33 percent [21]. Third, the directory-based coherence schemes of larger systems must often
send multiple messages to retrieve up-to-date copies of data or invalidate multiple sharers of a cache-line,
further increasing both latency and network load. This effect is clearly illustrated in the operating system
investigation by Chapin et al., which found that local data miss costs were twice the local instruction cache
miss costs, due to the need to send multiple remote messages [7].
In the case of large systems, the physical distribution of memory has a considerable effect on performance.
Such systems are generally referred to as NUMA systems, or Non-Uniform Memory Access time systems,
since the time to access memory varies with the distance between the processor and the memory module.
In these systems, physical locality plays an important role in addition to temporal and spatial locality. An
example can be seen in the work of Unrau et al., where, due to the lack of physical locality in the data
structures used, the uncontended cost of a page fault increased by 25 percent when the system was scaled
from 1 to 16 processors [33].
As a result of the high cost of cache misses, newer systems are being designed to support non-blocking
caches and write buffers in order to hide load and store latencies. Many of these systems allow loads and
stores to proceed while previous loads and stores are still outstanding. This results in a system in which the
order of loads and stores may not appear the same to each processor; these systems are sometimes referred
to as being weakly consistent [12].
The implication of weakly consistent systems is that regular memory accesses cannot be used for
synchronization purposes, since memory access can appear out of order. For example, if one processor fills
a buffer and marks it filled, other processors might see the buffer first being marked filled and then actually
filled with data. This could cause some processors to access stale data when they see the buffer marked
filled. As a result of this behavior, software must generally switch from using simple flags to using full
locking, with significantly more expensive special memory synchronization instructions. This difference is
illustrated by the cost of a lock/unlock pair in AIX on the PowerPC, which is 100 times more expensive that
a cached store [29].5
Another result of the high cache-miss latency is the movement towards larger cache lines of 128 or even
256 bytes in length. These large cache lines are an attempt to substitute bandwidth (which is relatively easy
to design into a system) for latency (which is much harder to design in). Unfortunately, large cache lines
tend to degrade performance in SMMPs due to increased false sharing.6 The example cited earlier of a single
falsely-shared cache line being responsible for 18 percent of all coherence misses was due in part to the
fact that the hardware had 128 byte cache lines, causing a critical lock to be shared with 22 other randomly
placed variables [27].
2.4 Summary
The memory access times in a multiprocessor span several orders of magnitude due to architectural and
contention effects (see Table 1). To mitigate these effects, system software must be structured to reduce
cache misses and increase physical locality. In the next section we present a set of structuring techniques
and the circumstances under which they should be applied.
5
In principle, locks would not be required for single flag variables if it were possible to insert (or have the compiler insert) the
required memory barrier instructions at the right places, but such support is not generally available at this time [1].
6
To further complicate matters, with such a wide variety of cache line sizes (sometimes even within the same product line),
ranging from 16 to 256 bytes, it is becoming increasingly difficult to optimize for a fixed or “average” cache line size.
4
Access Type Latency
Primary cache 1
Secondary cache 10
Local memory 100
Remote memory 200–500
Hot spot 1000–10000
Table 1: Latencies for different types of accesses in a typical large-scale shared-memory multiprocessor, measured in
processor cycles.
3 Design principles
The issues described in the previous section clearly have a major impact on the performance of multiprocessor
system software. In order to achieve good performance, system software must be designed to take the
hardware characteristics into account. There is no magic solution for dealing with them; the design
of each system data structure must take into account the expected access pattern, degree of sharing, and
synchronization requirements. Nevertheless, we have found a number of principles and design strategies that
have repeatedly been useful. These include principles previously proposed by us and others [7, 32, 33, 35],
refinements of previously proposed principles [14] to address the specific needs of system software, and a
number of new principles.
S1: Segregate read-mostly data from frequently modified data. Read-mostly data should not reside in the
same cache line as frequently modified data in order to avoid false sharing. Segregation can often be
achieved by properly padding, aligning, or regrouping the data. Consider a linked list whose structure
is static but whose elements are frequently modified. To avoid having the modifications of the elements
affect the performance of list traversals, the search keys and link pointers should be segregated from
the other data of the list elements.7
S2: Segregate independently accessed read/write data from each other. This principle prevents false sharing
of read/write data, by ensuring that data that is accessed independently by multiple processors ends
up in different cache lines.
S3: Privatize write-mostly data. Where practical, generate a private copy of the data for each processor so
that modifications are always made to the private copy and global state is determined by combining
the state of all copies. This principle avoids coherence overhead in the common case, since processors
update only their private copy of the data. For example, it is often necessary to maintain a reference
count on an object to ensure that the object is not deleted while it is being accessed. Such a reference
count can be decomposed into multiple reference counts, each updated by a different processor and,
applying S2, forced into separate cache lines.
7
This is in marked contrast to uniprocessors, where it is better to collocate the linked list state and the list elements so that when
an element is reached some of its data will have been loaded into the cache.
5
S4: Use strictly per-processor data wherever possible. If data is accessed mostly by a single processor, it
is often a good idea to restrict access to only that processor, forcing other processors to pay the extra
cost of inter-processor communication on their infrequent accesses. In addition to the caching benefits
(as in S3), strictly per-processor structures allow the use of uniprocessor solutions to synchronize
access to the data. For example, for low-level structures, disabling interrupts is sufficient to ensure
atomic access. Alternatively, since data is only accessed by a single processor, the software can rely
on the ordering of writes for synchronization, something not otherwise possible on weakly consistent
multiprocessors (see Section 2.3).
L1: Use per-processor reader/writer locks for read-mostly data. A lock for read-mostly data should be
implemented using a separate lock for each processor. To obtain a read-lock, a processor need only
acquire its own lock, while to obtain a write-lock it must acquire all locks. This strategy allows the
processor to acquire a read-lock and access the shared data with no coherence overhead in the common
case of read access. (This principle can be viewed as a special case of principle S3.)
L2: Segregate contended locks from their associated data if the data is frequently modified. If there is a
high probability of multiple processes attempting to modify data at the same time, then it is important
to segregate the lock from the data so that the processors trying to access the lock will not interfere
with the processor that has acquired the lock.8
L3: Collocate uncontended locks with their data if the data is frequently modified. When a lock is brought
into the cache for locking, some of its associated data is then brought along with it, and hence
subsequent cache misses are avoided.
D1: Replicate read-mostly data. Read-mostly data should be replicated to multiple memory modules so
that processors’ requests can be handled by nearby replicas. Typically, replication should occur on
demand so that the overhead of replicating data is only incurred if necessary.
D2: Partition and migrate read/write data. Data should be partitioned into constituent components according
to how the data will be accessed, allowing the components to be stored in different memory modules.
Each component should be migrated on use if it is primarily accessed by one processor at a time.
Alternatively, if most of the requests to the data are from a particular client, then the data should be
migrated with that client.
8 Special hardware support for locks has been proposed that results in no cache-coherence traffic on an unsuccessful attempt to
6
D3: Privatize write-mostly data. Privatizing write-mostly data, as described in principle S3, can be used
not only to avoid coherence overhead, but also to distribute data across the system to localize memory
accesses.
Although the following principles are not strictly for structuring data for locality, we have found them equally
important in achieving efficient localization of operating system data.
D4: Use approximate local information rather than exact global information. For certain operating system
policies, it is possible to sacrifice some degree of accuracy in exchange for performance by using
local approximate information to make reasonable decisions rather than exact global information. For
example, having per-processor run-queues reduces short-term fairness globally, but minimizes the cost
to the dispatcher when a process is being scheduled to run.
D5: Avoid barrier-based synchronization for global state changes. When data is replicated or partitioned,
it is necessary to synchronize when making global changes to the replicas and when determining a
globally consistent value for the partitioned data. In this case, the system should avoid using barrier-
based synchronization because it wastes processor cycles while waiting for other processors to reach
the barrier, and results in a high overhead in interrupting (and restarting) the tasks running on the
processors [35]. There are a variety of alternative asynchronous schemes (e.g., as used for lazy TLB
shoot-down [25]) which can, for example, allow multiple requests to be combined together to reduce
the amount of inter-processor communication.
In applying the S*, L*, and D* principles, the programmer must be aware that their over-zealous ap-
plication may actually reduce performance. For example, while a naive application of the S* principles
may result in a system with reduced coherence overhead, it may also result in an increased total number of
cache misses due to the fragmentation of data structures. Hence, significant expertise is required to apply
the design principles in a balanced fashion.
7
and to allow each component to be placed close to processors that are accessing it (D*). However, this
requires a mechanism that allows a processor to efficiently locate the component of the object that it should
access. Although an object could provide a redirection table for this purpose, a naive implementation would
require all accesses to pass through a centralized table, thereby limiting the benefits of distributing the data
of the object. The approach we have taken is based on a new model of objects, called clustered objects, that
allows objects to be easily partitioned while preserving the fundamental object abstraction [24].
8
P1 P2
1 3 5 7 9 11 13 15
page 15
Matrix Region 2 4 6 8 10 12 14 16 of matrix
(Application) 17 19 21 23 25 27 29 31
18 20 22 24 26 28 30 32
P3 P4
P1 P2 P3 P4
Figure 2: This figure illustrates an application accessing a matrix in a partitioned fashion, and the corresponding
implementation of the region in the kernel. Each processor possesses its own region representative (Rep 1 through
Rep 4), stored in processor-local memory, to which page fault and prefetch requests are directed. These representatives
keep track of the physical pages allocated to the region as well as the virtual-to-physical mappings. Since each
representative manages different portions of the region, a representative can be locked without delaying the progress
of other processors.
9
More detailed aspects of the implementation are described elsewhere [24].
10
Our system is based on C++, and hence the cost to invoke a non-clustered-object is the cost of a C++ virtual function call.
9
Object Translation Tables
OID A
OID B
Clustered Obj A
Rep A.1 Rep A.2
Clustered Obj B
P1 P2 P3 P4
Figure 3: The clustered object framework allows a clustered object to associate a different representative with each
processor (and share representatives among processors if desired). This relationship is stored in a per-processor
object translation table. In this way, an object identifier (OID) can be transparently used anywhere in the system; each
access will be directed to the appropriate object representative. For scalability, the table entries are filled on demand,
so although all the entries are shown as if they were filled, they would only be filled if all processors were actually
accessing the two objects shown.
use an asynchronous technique for object destruction (D5). This ensures that processors are never blocked
(avoiding wasted cycles spinning), and allows destruction-requests to be batched (reducing inter-processor
communication).
A reference count scheme is used to determine when an object’s memory can be reclaimed. Rather than
maintaining a separate reference count for each object, a single per-processor reference count keeps track
of the number of threads accessing any clustered object on the processor (L1, S3, D3). When a thread first
enters an address space to invoke a method on a clustered object, it increments the counter for the processor;
subsequent object invocations within the address space require no further increments.
Destroying an object is performed in two phases. In the first phase the object invalidates all references
to itself held by other objects and notifies the appropriate set of processors that it is in the process of being
destroyed. When a processor’s reference count reaches zero, it invalidates its translation table entry; until
this point, it is possible that a thread running on that processor will have made a temporary copy of the
reference (say, in a register), and hence be capable of making new method invocations to the object. The
second phase starts once all translation table entries have been invalidated; it consists of lazily reclaiming
the memory used by the representatives.11
11
This object destruction algorithm is similar in nature to lazy TLB shoot-down algorithms that re-use physical pages only once
all processors are known to have discarded their associated TLB entry [25].
10
Client Address Space
Client Thread
Server Thread
(3)
(1) User Space
Kernel Space
(2)
Worker List Worker List
Port Tables
P1 P2
Figure 4: The kernel maintains a per-processor table for the ports corresponding to servers in the system. When
a client makes a cross-address space method invocation to such a port, it first traps into the kernel (1). The kernel
then removes a pre-initialized worker (stored in processor-local memory) from the worker list for the port (2). It then
continues execution in the server’s address space (3) at the object translation entry point. When the server thread
terminates, the worker structure in the kernel is returned to the port’s worker list.
and concurrency [11]. We use per-processor data structures for common-case operations, resulting in the
fastest multiprocessor cross-address-space method invocation we know of, matching that of any uniprocessor
implementation [19].
To accept cross-address space requests, a server must first allocate a port from the kernel to be used
by clients in making requests. Per-processor tables are used to maintain port information, which includes
information about the allocating server (D1) as well as a private pool of pre-initialized worker process
descriptors (PDs) that reside in local memory (D2), as shown in Figure 4.12 Handling a cross-address-
space request consists of removing the first PD from the processor’s private worker queue, and (possibly)
re-mapping a page of memory to pass parameters that cannot fit in registers. Hence, this implementation
of cross-address-space method invocation requires no remote data accesses and no locking (given that the
kernel performs this at interrupt level), resulting in a null method call requiring only 300 instructions.
12
The importance of having local worker structures has been observed in numerous other contexts, including network protocol
processing [28] and process dispatching [30].
11
Global Virt Layer
L
Cluster Layer Remote
Return List
L L
Processor Layer 16 16
L L
24 24
L L
32 32
L L
64 64
P1 P2
Figure 5: The kernel memory allocator consists of four layers: the processor layer manages blocks of mem-
ory for each processor; the cluster layer balances the allocation of memory blocks among a cluster of pro-
cessors sharing a common memory module; the global physical layer manages the physical pages available in
the system; finally, the global virtual layer manages the allocation of virtual space for memory allocation.
Because the bottom two layers are accessed most frequently, we concentrate on these. Each
processor maintains a list of free blocks for memory of various sizes. These are used to allo-
cate and free memory blocks of a particular size on the given processor. Since contention is ex-
pected to be low, each list is protected by a lock embedded in the head pointer for the list.
If a processor frees memory that belongs to a different processor, however, the memory is returned to a re-
turn list at the cluster level, which then redistributes the blocks in bulk. Since there might be contention in this case,
the lock is maintained separately from the pointer to the list.
12
5 Concluding remarks
We have described the effects of multiprocessor hardware on the performance of system software, provided
a number of design principles that are useful in optimizing multiprocessor system software, and shown
how these design principles can be applied to real software. Our focus throughout has been on low-level
issues related to the organization of system data structures. In doing so we have not addressed many
other important multiprocessor issues. For example, we did not discussed tradeoffs between different
synchronization algorithms, or how code should be distributed across the system for locality.
Multiprocessor operating system programmers also face high-level challenges, such as, developing
techniques to manage the complex resources of large-scale systems and handling the resource requirements
of parallel programs. While it is crucial to address these high-level challenges, the benefits can only be fully
realized if the basic operating system data is properly structured.
The work described is part of a large project involving the design and implementation of the NUMAchine
multiprocessor, the Tornado operating system, and a variety of tools, compilers, and applications. For high
application-level performance it is necessary that all levels involved perform well including the hardware,
the operating system, and the application. In this paper we have described some of the techniques we have
found to be important in achieving high performance at the operating system level.
References
[1] S.V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. Technical Report
ECE Technical Report 9512, Rice University, September 1995. Also as Digital Western Research
Laboratory Research Report 95/7.
[2] T. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE
Transactions on Parallel and Distributed Systems, 1(1):6–16, January 1990.
[3] David L. Black, Avadis Tevanian Jr. , David B. Golub, and Michael W. Young. Locking and reference
counting in the mach kernel. In Proc. 1991 ICPP, volume II, Software, pages II–167–II–173, August
1991.
[4] Brad Calder, Dirk Grunwald, and Ben Zorn. Quantifying behavioral differences between C and C++
programs. Journal of Programming Languages, 2(4):313–351, 1994.
[5] M. Campbell, R. Holt, and J. Slice. Lock granularity tuning mechanisms in SVR4/MP. In Symposium
on Experiences with Distributed and Multiprocessor Systems (SEDMS II.), pages 221–228, March
1991.
[6] H.H.Y. Chang and B. Rosenburg. Experience porting mach to the RP3 large-scale shared-memory
multiprocessor. Future Generation Computer Systems, 7(2–3):259–267, April 1992.
[7] John Chapin, Stephen A. Herrod, Mendel Rosenblum, and Anoop Gupta. Memory system performance
of UNIX on CC-NUMA multiprocessors. In Proceedings of the 1995 ACM SIGMETRICS Joint
International Conferentce on Measurement and Modelling of Computer Systems, May 1995.
[8] J. Bradley Chen and Brian N. Bershad. The impact of operating system structure on memory system
performance. In Proc. 14th ACM SOSP, pages 120–133, 1993.
13
[9] J. R. Eykholt, S. R. Kleiman, S. Barton, R. Faulkner, D. Stein, M. Smith, A. Shivalingiah, J. Voll,
M. Weeks, and D. Williams. Beyond multiprocessing: Multithreading the System V Release 4 kernel.
In USENIX Conference Proceedings, pages 11–18, Summer 1992.
[10] David M. Fenwick, Denis J. Foley, William B. Gist, Stephen R. VanDoren, and Daniel Wissell. The
AlphaServer 8000 series: High-end server platform development. Digital Technical Journal, 7(1):43–
65, 1995.
[11] Benjamin Gamsa, Orran Krieger, and Michael Stumm. Optimizing IPC performance for shared-memory
multiprocessors. In Proc. 1994 ICPP, pages 208–211, August 1994.
[12] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. L. Henessy. Memory consistency
and event ordering in scalable shared-memory multiprocessors. In Proc. 17th Annual Int’l Symp. on
Computer Architecture, ACM SIGARCH Computer Architecture News, page 15, June 1990.
[13] Anoop Gupta, Andrew Tucker, and Shigeru Urushibara. The impact of operating system scheduling
policies and synchronization methods of the performance of parallel application. In Proceedings of the
ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 120–132,
1991.
[14] Mark D. Hill and James R. Larus. Cache considerations for mutliprocessor programmers. CACM,
33(8):97–102, August 1990.
[15] Norman C. Hutchinson and Larry L. Peterson. The x-Kernel: An architecture for implementing network
protocols. IEEE Transactions on Software Engineering, 17(1):64–76, January 1991.
[16] Tor E. Jeremiassen and Susan J. Eggers. Reducing false sharing on shared memory multiprocessors
through compile time data transformations. In Proc. 5th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming, PPoPP’95, pages 179–188, July 1995.
[17] Alain Kagi, Nagi Aboulenein, Douglas C. Burger, and James R. Goodman. An analysis of the inter-
actions of overhead-redudincg techniques for shared-memory multiprocessors. In Proc. International
Conference on Supercomputing, pages 11–20, July 1995.
[18] Anna R. Karlin, Kai Li, Mark S. Manasse, and Susan Owicki. Empirical studies of competitive spinning
for a shared-memory multiprocessor. In Proceedings of 13th ACM Symposium on Operating Systems
Principles, pages 41–55, October 1991.
[19] J. Liedtke. Improving IPC by kernel design. In Proceedings of the Fourteenth ACM Symposium on
Operating Systems Principles, pages 175–187, 1993.
[20] Peter Magnussen, Anders Landin, and Erik Hagersten. Queue locks on cache coherent multiprocessors.
In 8th IPPS, pages 26–29, 1994.
[21] Drew McCrocklin. Scaling Solaris for enterprise computing. In Spring 1995 Cray Users Group
Meeting, 1995.
[22] Paul E. McKenney and Jack Slingwine. Efficient kernel memory allocation on shared-memory multi-
processor. In USENIX Technical Conference Proceedings, pages 295–305, Winter 1993.
14
[24] Eric Parsons, Ben Gamsa, Orran Krieger, and Michael Stumm. (De-)Clustering objects for multipro-
cessor system software. In Fourth International Workshop on Object Orientation in Operating Systems
95 (IWOOO’95), pages 72–81, 1995.
[25] J. Kent Peacock, Sunil Saxena, Dean Thomas, Fred Yang, and Wilfred Yu. Experiences from mul-
tithreading System V Release 4. In Symposium on Experiences with Distributed and Multiprocessor
Systems (SEDMS), pages 77–92, March 1992.
[26] Dennis M. Ritchie and Ken Thompson. The UNIX time-sharing system. CACM, 17(7):365–375, July
1974.
[27] Mendel Rosenblum, Edouard Bugnion, Emmett Witchel Stephen A. Herrod, and Anoop Gupta. The
impact of architectural trends on operating system performance. In To Appear in The 15th ACM
Symposium on Operating Systems Principles, December 1995.
[28] James D. Salehi, James F. Kurose, and Don Towsley. The performance impact of scheduling for cache
affinity in parallel network processing. In Fourth IEEE International Symposium on High Performance
Distributed Computing, August 1995.
[29] Jacques Talbot. Turning the AIX operating system into an MP-capable OS. In USENIX 1995 Technical
Conference Proceedings, January 1995.
[30] Josep Torrellas, Anoop Gupta, and John Hennessy. Characterizing the caching and synchronization
performance of a multiprocessor operating system. In Proceedings of the 5th International Conference
on Architectural Support for Programming Languages and Operating System (ASPLOS), pages 162–
174, September 1992.
[31] Josep Torrellas, Monica S. Lam, and John L. Hennessy. False sharing and spatial locality in multipro-
cessor caches. Transactions on Computers, 43(6), June 1994.
[32] Ronald C. Unrau, Orran Krieger, Benjamin Gamsa, and Michael Stumm. Experiences with locking in
a NUMA multiprocessor operating system kernel. In Operating System Design and Implementation,
pages 139–152, November 1994.
[33] Ronald C. Unrau, Orran Krieger, Benjamin Gamsa, and Michael Stumm. Hierarchical clustering: A
structure for scalable multiprocessor operating system design. Journal of Supercomputing, 9(1/2):105–
134, 1995.
[34] Z. Vranesic, S. Brown, M. Stumm, S. Caranci, A. Grbic, R. Grindley, M. Gusat, O. Krieger, G. Lemieux,
K. Loveless, N. Manjikian, Z. Zilic, T. Abdelrahman, B. Gamsa, P. Pereira, K. Sevcik, A. Elkateeb,
and S. Srbljic. The NUMAchine multiprocessor. Technical Report 324, University of Toronto, April
1995.
[35] Chun Xia and Josep Torrellas. Improving the data cache performance of multiprocessor operating
systems. In To appear in HPCA-2, 1996.
15