UNIX Internals The New Frontiers Vahalia
UNIX Internals The New Frontiers Vahalia
. URESH VAHALIA
UNIX Internals
The New Frontiers
UNIX Internals
The New Frontiers
Uresh Vahalia
EMC Corporation
Hopkinton, MA
Prentice Hall
Upper Saddle River, New Jersey 07458
Library of Congress Cataloging-in-Publication Data
Vahalia, Uresh.
UNIX internals : the new frontiers I Uresh Vahalia.
p. em.
Includes index.
ISBN 0-13-101908-2
I. UNIX (Computer file) 2. Operating systems (Computers)
I. Title.
QA76.76.063V33 1996 95-25213
005.4'3--dc20 CIP
UNIX is a registered trademark licensed exclusively by X/Open Co., Ltd. Sun OS and Solaris are registered
trademarks of Sun Microsystems, Inc. Digital UNIX is a trademark of Digital Equipment Corporation.
Other designation used by vendors as trademarks to distinguish their products may appear in this book. In
·
all cases where the publisher is aware of a current trademark claim, the designations have been printed in
initial capitals or all capitals.
It
=-
=
-
© 1996 by Prentice-Hall, Inc.
Simon & Schuster/A Viacom Company
Upper Saddle River, New Jersey 07458
The author and publisher of this book used their best efforts in preparing this book. These efforts include
the development, research, and testing of the theories and programs to determine their effectiveness. The
author and publisher make no warranty of any kind, expressed or implied, with regard to these programs
or the documentation contained in this book. The author and publisher shall not be liable in any event for
incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or
use of these programs.
10 9 8 7 6 5
ISBN 0-13-101908-2
There are more flavors of UNIX than of most brands of ice cream. Despite the industrial impetus on
the part of X/Open and its members, the single UNIX specification appears to be ever-further from
our grasp. In fact, it may not be an important goal. Ever since Interactive Systems produced the first
commercial UNIX system and Whitesmiths produced the first UNIX clone, the user community has
been confronted by a variety of implementations running on multiple platforms.
Created in 1969, UNIX was not even a decade old when versions began to proliferate. Be-
fore it was 20 years old, there were rival consortia! (the Open Software Foundation and UNIX Inter-
national) and a large number of versions. The two main streams were those of AT&T (now Novell)
and the University of California at Berkeley. Descriptions of those UNIXes were made easily avail-
able by Maurice Bach [Bach 86] and Sam Leffler, Kirk McKusick, Mike Karels, and John Quarter-
man [Leff89].
No single book offered the interested student a view of the UNIX Operating System's vari-
ous implementations. Uresh Vahalia has now done this. He has gone boldly where none have gone
before and elucidated the internals of SVR4, 4.4BSD, and Mach. Even more, he presents elaborate
discussions of both Solaris and SunOS, Digital UNIX, and HP-UX.
He has done so clearly and without the bias that some writers have displayed toward this
UNIX or that. With relatively new UNIX clones such as Linux already developing variants and even
Berkeley derivatives diverging from one another, a book like this, which exposes the internals and
principles that motivated UNIX's growth and popularity is of exceptional value.
On June 12, 1972, Ken Thompson and Dennis Ritchie released the UNIX Programmer's
Manual, Second Edition. In its Preface the authors remark: "The number of UNIX installations has
grown to 10, with more expected." They could never have expected what has actually happened.
Vll
viii Foreword
I have traced the paleontology and history of the system elsewhere [Salu 94], but Vahalia
has given us a truly original and comprehensive view of the comparative anatomy of the species.
References
[Bach 86] Bach, M.J., The Design of the UNIX Operating System, Prentice-Hall, Englewood
Cliffs, NJ, 1986.
[Leff89] Leffler, S.J., McKusick, M.K., Karels, M.J., and Quarterman, J.S., The Design and
Implementation of the 4.3 BSD UNIX Operating System, Addison-Wesley, Reading,
MA, 1989.
[Salu 94] Salus, P.H., A Quarter Century of UNIX, Addison-Wesley, Reading, MA, 1994.
[Thorn 72] Thompson, K., and Ritchie, D.M., UNIX Programmer's Manual, Second Edition,
Bell Telephone Laboratories, Murray Hill, NJ, 1972.
Preface
Since the early 1970s, the UNIX system has undergone considerable metamorphosis. It started as a
small, experimental operating system distributed freely (almost) by Bell Telephone Laboratories to
a growing band of loyal followers. Over the years, it absorbed contributions from numerous mem-
bers of academia and industry, endured battles over ownership and standardization, and evolved into
its current state as a stable, mature operating system. Today there are several commercial and re-
search variants of the UNIX system, each different from the other in many respects, yet all similar
enough to be recognizable as different members of the same family. A UNIX programmer who has
gained experience on one specific UNIX system can be productive on a number of different hard-
ware platforms and UNIX variants without skipping a beat.
Hundreds of books have described various features of the UNIX system. Although most of
them describe user-visible aspects such as the command shell or the programming interface, only a
small number of books discuss UNIX internals. UNIX internals refers to a study of the UNIX ker-
nel, which comprises the heart of the operating system. To date, each book on UNIX internals has
focused on one specific UNIX release. Bach's The Design of the UNIX Operating System [Bach 86]
is a landmark book on the System V Release 2 (SVR2) kernel. Leffler et al.'s The Design and Im-
plementation of the 4. 3BSD UNIX Operating System [Leff 88] is a comprehensive description of the
4.3BSD release by some of its principal designers. Goodheart and Cox's The Magic Garden Ex-
plained [Good 94] describes the internals of System V Release 4.0 (SVR4).
Design Perspectives
This book views the UNIX kernel from a system design perspective. It describes a number of main-
stream commercial and research UNIX variants. For each component of the kernel, the book ex-
plores its architecture and design, how the major UNIX systems have chosen to implement the
IX
X Preface
component, and the advantages and drawbacks of alternative approaches. Such a comparative treat-
ment gives the book a unique flavor and allows the reader to examine the system from a critical
viewpoint. When studying an operating system, it is important to note both its strengths and its
weaknesses. This is only possible by analyzing a number of alternatives.
UNIX Variants
Although this book gives most attention to SVR4.2, it also explores 4.4BSD, Solaris 2.x, Mach, and
Digital UNIX in detail. Further, it describes interesting features of a number of other variants, in-
cluding some research that has not yet made it into commercial releases. It analyzes the major de-
velopments in UNIX from the mid-1980s to the mid-1990s. For completeness it includes a brief de-
scription of traditional UNIX functionality and implementation. Where necessary, it provides an
historical treatment, starting with the traditional approach, analyzing its drawbacks and limitations,
and presenting the modem solutions.
Intended Audience
UNIX Internals is useful for university courses and as a professional reference. As a university text,
it is suitable for an advanced undergraduate or graduate course on operating systems. It is not an in-
troductory book and assumes knowledge of concepts such as the kernel, processes, and virtual
memory. Each chapter contains a set of exercises designed to stimulate further thought and research,
and to provide additional insight into the system design. Many of the exercises are open-ended, and
some require additional reading on the part of the student. Each chapter also has an exhaustive list
of references, which should be useful for the student seeking to explore further.
UNIX Internals is also suitable as a professional reference for operating system developers,
application programmers, and system administrators. Operating system designers and architects can
use it to study the kernel architecture in contemporary systems, evaluate the relative merits and
drawbacks of different designs, and use the insight to develop the next generation of operating sys-
tems. Application programmers can use the knowledge of the system internals to write more effi-
cient programs that take better advantage of the characteristics of the operating system. Finally,
system administrators can do a better job of configuring and tuning their systems by understanding
how various parameters and usage patterns affect the system behavior.
and login session management. Chapter 5 describes the UNIX scheduler and the growing support
for real-time applications. Chapter 6 deals with interprocess communications (IPC), including the
set of features known as System V IPC. It also describes the Mach architecture, which uses IPC as
the fundamental primitive for structuring the kernel. Chapter 7 discusses the synchronization
frameworks used in modem uniprocessor and multiprocessor systems.
The next four chapters explore file systems. Chapter 8 describes the file system interface as
seen by the user, and the vnode/vfs interface that defines the interactions between the kernel and the
file system. Chapter 9 provides details of some specific file system implementations, including the
original System V file system (s5fs), the Berkeley Fast File System (FFS), and many small, special-
purpose file systems that take advantage of the vnode/vfs interface to provide useful services.
Chapter 10 describes a number of distributed file systems, namely Sun Micro systems' Network File
System (NFS), AT&T's Remote File Sharing (RFS), Carnegie-Mellon University's Andrew File
System (AFS), and Transarc Corporation's Distributed File System (DFS). Chapter II describes
some advanced file systems that use joumaling to provide higher availability and performance, and
a new file system framework based on stackable vnode layers.
Chapters I2 through I5 describe memory management. Chapter 12 discusses kernel memory
allocation and explores several interesting allocation algorithms. Chapter 13 introduces the notion of
virtual memory and uses the 4.3BSD implementation to illustrate several issues. Chapter 14 de-
scribes the virtual memory architecture of SVR4 and Solaris. Chapter 15 describes the Mach and
4.4BSD memory models. It also analyzes the effects of hardware features such as translation look-
aside buffers and virtually addressed caches.
The last two chapters address the 1/0 subsystem. Chapter 16 describes the device driver
framework, the interaction between the kernel and the 1/0 subsystem, and the SVR4 device driver
interface/driver kernel interface specification. Chapter 17 talks about the STREAMS framework for
writing network protocols and network and terminal drivers.
Typographical Conventions
I have followed a small set of typographical conventions throughout this book. All system calls, li-
brary routines, and shell commands are in italics (for instance,fork, fopen, and ls -l). The first oc-
currence of any term or concept is also italicized. Names of internal kernel functions and variables,
as well as all code examples, are in fixed-width font, such as ufs _1 ookup (). When specifying the
calling syntax, the system call name is italicized, but the arguments are in fixed-width font. Finally,
all file and directory names are in bold face (for instance, /etc/passwd). In the figures, solid arrows
represent direct pointers, whereas a dashed arrow implies that the relationship between the source
and destination of the arrow is inferred indirectly.
Despite my best efforts, some errors are inevitable. Please send me all corrections, com-
ments, and suggestions by electronic mail at vahalia@acm.org.
xii Preface
Acknowledgments
A number of people deserve credit for this book. First of all, I want to thank my son, Rohan, and my
wife, Archana, whose patience, love, and sacrifice made this book possible. Indeed, the hardest
thing about writing the book was justifying to myself the weekends and evenings that could have
been spent with them. They have shared my travails with a smile and have encouraged me every
step of the way. I also thank my parents for their love and support.
Next, I want to thank my friend Subodh Bapat, who gave me the confidence to undertake
this project. Subodh has helped me maintain focus throughout the project and has spent countless
hours advising, counseling, and encouraging me. I owe him special thanks for access to the tools,
templates, and macros used for his book, Object-Oriented Networks [Bapa 94], for his meticulous
reviews of my drafts, and for his lucid discourses on writing style.
A number of reviewers contributed an incredible amount of their time and expertise to im-
prove the book, going through several drafts and providing invaluable comments and suggestions. I
want to thank Peter Salus, for his constant encouragement and support, and Benson Marguiles,
Terry Lambert, Mark Ellis, and William Bully for their in-depth feedback on the content and or-
ganization of my work. I also thank Keith Bostic, Evi Nemeth, Pat Parseghian, Steven Rago, Margo
Seltzer, Richard Stevens, and Lev Vaitzblit, who reviewed parts of my book.
I want to thank my manager, Percy Tzelnic, for his support and understanding throughout
my project. Finally, I want to thank my publisher Alan Apt, both for proposing the book and for
helping me at every stage, and the rest of the team at Prentice-Hall and at Spectrum Publisher Serv-
ices, in particular, Shirley McGuire, Sondra Chavez, and Kelly Ricci, for their help and support.
References
[Bach 86] Bach, M.J ., The Design of the UNIX Operating System, Prentice-Hall, 1986.
[Bapa 94] Bapat, S.G., Object-Oriented Networks, Prentice-Hall, 1994.
[Good 94] Goodheart, B., and Cox, J., The Magic Garden Explained-The Internals of UNIX
System V Release 4, An Open Systems Design, Prentice-Hall, 1994.
[Leff89] Leffler, S.J., McKusick, M.K., Karels, M.J., and Quarterman, J.S., The Design and
Implementation of the 4. 3 BSD UNIX Operating System, Addison-Wesley, 1989.
Contents
1 INTRODUCTION 1
1.1 Introduction 1
1.1.1 A Brief History 2
1.1.2 The Beginning 2
1.1.3 Proliferation 3
1.1.4 BSD 4
1.1.5 System V 5
1.1.6 Commercialization 5
1.1.7 Mach 6
1.1.8 Standards 6
1.1.9 OSF and UI 7
1.1.10 SVR4 and Beyond 8
xiii
xiv Contents
1.2.3 Performance 10
1.2.4 Hardware Changes 10
1.2.5 Quality Improvement 11
1.2.6 Paradigm Shifts 11
1.2.7 Other Application Domains 12
1.2.8 Small is Beautiful 12
1.2.9 Flexibility 13
1.5 References 17
2.1 Introduction 19
2.5 Synchronization 33
2.5.1 Blocking Operations 35
2.5.2 Interrupts 35
2.5.3 Multiprocessors 37
2.7 Signals 38
2.9 Summary 45
2.10 Exercises 45
2.11 References 46
3.1 Introduction 48
3.1.1 Motivation 49
3.1.2 Multiple Threads and Processors 49
3.1.3 Concurrency and Parallelism 52
3.10 Summary 79
3.11 Exercises 80
3.12 References 80
4.1 Introduction 83
4.7 Exceptions 95
17 STREAMS 547
Index 587
1
Introduction
1.1 Introduction
In 1994 the computer industry celebrated the twenty-fifth birthday of the UNIX operating system.
Since its inception in 1969, the UNIX system has been ported to dozens of hardware platforms, and
has been released in many forms by commercial vendors, universities, and research organizations.
Starting as a small collection of programs, it has grown into a versatile operating system used in a
wide range of environments and applications. Today, versions of UNIX run on platforms ranging
from small embedded processors, to workstations and desktop systems, to high-performance multi-
processor systems serving a large community of users.
The UNIX system consists of a collection of user programs, libraries, and utilities, running
on the UNIX operating system, which provides a run-time environment and system services for
these applications. This book examines the design and implementation of the operating system it-
self, and does not describe the applications and tools that run on it. While UNIX began life in Bell
Telephone Laboratories (BTL), which was responsible for all its early releases, it has since been
embraced by several companies and universities. This has led to a proliferation of UNIX variants in
the marketplace. All these variants loosely support a core set of interfaces, applications, and features
routinely expected from a "UNIX system." They differ in their internal implementation, detailed
semantics of the interfaces, and the set of "value-added" features they provide. This book devotes
greater attention to baseline releases such as Novell, Inc.'s System V Release 4 (SVR4), The Uni-
versity of California's Berkeley Software Distribution (4.xBSD), and Carnegie-Mellon University's
Mach. It also discusses a number of commercial implementations such as SunOS and Solaris from
2 Chapter I Introduction
Sun Microsystems, Digital UNIX from Digital Equipment Corporation, and HP-UX from Hewlett-
Packard Corporation.
This chapter introduces the UNIX operating system. It begins with a brief history of the
birth, maturation, and industry acceptance of the UNIX system. It then discusses the factors that
have influenced the evolution cf the system. Finally it discusses directions in which UNIX may
continue to evolve.
UNIX Programmer's Manual. Since then, there have been a total often editions of this manual, cor-
responding to ten versions of UNIX released by BTL.
The first several releases were strictly internal to BTL. The third edition, in February 1973,
included cc, the C compiler. That same year, UNIX was rewritten in C (resulting in version 4 in
November 1973), a step that had a tremendous impact on its future success. Thompson and Ritchie
co-authored the first UNIX paper, The UNIX Time Sharing System [Thorn 74]. It was presented at
the ACM Symposium on Operating Systems (SOSP) in October 1973 and published in the Commu-
nications of the ACMin July 1974. 1 This paper gave the outside world its first look at UNIX.
1.1.3 Proliferation
In 1956, as a result of antitrust litigation by the Department of Justice against AT&T and the West-
em Electric Company, 2 AT&T signed a "consent decree" with the federal government. The terms of
this agreement prevented AT&T from manufacturing any equipment not related to telephone or
telegraph services, or engaging in business other than furnishing "common carrier communication
services."
As a result, AT&T took the view that it could not market computing products. On the other
hand, the SOSP presentation resulted in numerous requests for UNIX software and sources. AT&T
provided the UNIX system to universities for educational and research purposes, royalty-free and
under simple licensing agreements. It did not advertise or market the system and did not support its
releases. One of the earliest such licensees was the University of California at Berkeley, which ob-
tained the UNIX system in December I973.
Under these conditions UNIX systems quickly proliferated throughout the world. By I975
they had spread to sites as far apart as the Hebrew University of Jerusalem, the University of New
South Wales in Australia, and the University of Toronto in Canada. The first UNIX port was to the
Interdata machine. The port was completed independently by the University of Wollongong in
1976, and again by Ritchie and Steve Johnson at BTL in 1977.
Version 7 UNIX, released in January 1979, was the first truly portable UNIX system, and
greatly influenced future development of UNIX. Its initial release ran on the PDP-II and the Inter-
data 8/32. It was both more robust and provided significantly greater functionality than version 6; it
was also considerably slower. Several UNIX licensees responded by improving its performance in
several areas; AT&T incorporated many of these improvements in future releases. This spirit of co-
operation between its keepers and users (which, unfortunately, deteriorated considerably once UNIX
became commercially successful) was a key factor in the rapid growth and rising popularity of
UNIX.
Soon UNIX was ported to several other architectures. Microsoft Corporation and the Santa
Cruz Operation (SCO) collaborated to port UNIX to the Intel 8086, resulting in XENIX, one of the
earliest commercial UNIX variants. In 1978 Digital introduced the 32-bit VAX-11 computer. After
being turned down by Ritchie, Thompson, and Johnson, Digital approached a group in the Holmdel,
New Jersey, branch of BTL to port UNIX to the VAX. This was the first port to a 32-bit machine,
and the resulting version was called UNIX/32V. This version was sent to the University of Califor-
nia at Berkeley, where it evolved into 3BSD in 1979.
1.1.4 BSD
The University of California at Berkeley obtained one ofthe first UNIX licenses in December 1974.
Over the next few years, a group of graduate students including Bill Joy and Chuck Haley devel-
oped several utilities for it, including the ex editor (which was later followed by vi) and a Pascal
compiler. They bundled these additions into a package called the Berkeley Software Distribution
(BSD) and sold it in the spring of 1978 at $50 per license. The initial BSD releases (version 2 was
shipped in late 1978) consisted solely of applications and utilities, and did not modify or redistribute
the operating system. One of Joy's early contributions was the C shell [Joy 86], which provided
facilities such as job control and command history not available in the Bourne shell.
In 1978 Berkeley obtained a VAX-11/780 and the UNIX/32V that had been ported to it by
the BTL group in Holmdel, New Jersey. The VAX had a 32-bit architecture, allowing a 4-gigabyte
address space, but only 2 megabytes of physical memory. Around the same time, Ozalp Babaoglu
designed a paging-based virtual memory system for the VAX, and incorporated it into UNIX. The
result, released as 3BSD in late 1979, was the first operating system release from Berkeley.
The virtual memory work prompted the Defense Advanced Research Projects Agency
(DARPA) to fund the development of UNIX systems at Berkeley. One of the major goals of the
DARPA project was to integrate the Transmission Control Protocol/Internet Protocol (TCPIIP)
network protocol suite. With DARPA funding, Berkeley produced several BSD releases collectively
called 4BSD: 4.0BSD in 1980, 4.1BSD in 1981,3 4.2BSD in 1983, 4.3BSD in 1986, and 4.4BSD in
1993.
The Berkeley team was responsible for many important technical contributions. Besides
virtual memory and the incorporation of TCPIIP, BSD UNIX introduced the Fast File System
(FFS), a reliable signals implementation, and the sockets facility. 4.4BSD replaced the original vir-
tual memory design with a new version based on Mach (see Section 1.1.7), and added other en-
hancements such as a log-structured file system.
The work on UNIX at Berkeley was performed by the Computer Science Research Group
(CSRG). With 4.4BSD, CSRG decided to close shop and discontinue UNIX development. The ma-
jor reasons cited [Bost 93] were:
• Scarcity of grants and funds.
• BSD features were now available in a number of commercial systems.
• The system had become too large and complex for a small group to architect and maintain.
A company called Berkeley Software Design, Inc. (BSDI) was formed to commercialize and
market 4.4BSD. Since most of the original UNIX source code had been replaced with new code de-
veloped at Berkeley, BSDI claimed that the source code in its BSD/386 release was completely free
of AT&T licenses. UNIX System Laboratories, the AT&T subsidiary responsible for UNIX devel-
opment, filed a lawsuit against BSDI and the Regents of the University of California, claiming
3 There were three separate releases of 4.1 BSD--4.1 a, 4.1 b, and 4.1 c.
1.1 Introduction 5
copyright infringement, breach of contract, and misappropriation of trade secrets [Gerb 92]. The
lawsuit was sparked by BSDI's use of the phone number 1-800-ITS-UNIX to sell the source code.
The university countersued, and the resulting litigation delayed the release. On February 4, 1994,
the case was settled out of court, with all parties dropping their claims. BSDI announced the avail-
ability of 4.4BSD-lite, sold with unencumbered source code, for around $1000.
1.1.5 System V
Going back to AT&T, its legal battles with the Justice Department culminated in a landmark decree
in 1982. As a result of this decree, Western Electric was dissolved, the regional operating companies
were divested from AT&T and formed the "Baby Bells," and Bell Telephone Laboratories was
separated and renamed AT&T Bell Laboratories. Also, AT&T was allowed to enter the computer
business.
While the research group at BTL continued to work on UNIX, the responsibility for external
releases shifted from them to the UNIX Support Group, then to the UNIX System Development
Group, and then to AT&T Information Systems. Among them, these groups released System III in
1982, System Vin 1983, System V Release 2 (SVR2) in 1984, and Release 3 (SVR3) in 1987. AT&T
marketed System V aggressively, and several commercial UNIX implementations are based on it.
System V UNIX introduced many new features and facilities. Its virtual memory implemen-
tation, called the regions architecture, was quite different from that of BSD. SVR3 introduced an
interprocess communication facility (including shared memory, semaphores, and message queues),
remote file sharing, shared libraries, and the STREAMS framework for device drivers and network
protocols. The latest System V version is Release 4 (SVR4), which will be discussed in Section
1.1.1 0.
1.1.6 Commercialization
The growing popularity of UNIX attracted the interest of several computer companies, who rushed
to commercialize and market their own versions of UNIX. Each began with a base release of UNIX
from either AT&T or Berkeley, ported it to their hardware, and enhanced it with their own value-
added features. In 1977 Interactive Systems became the first commercial UNIX vendor. Their first
release was called IS/1 and ran on the PDP-lis.
In 1982 Bill Joy left Berkeley to cofound Sun Microsystems, which released a 4.2BSD-
based variant called SunOS (and later, an SVR4-based variant called Safaris). Microsoft and the
SCO jointly released XENIX. Later, SCO ported SVR3 onto the 386 and released it as SCO UNIX.
The 1980s saw a number of commercial offerings, including AIX from IBM, HP-UX from Hewlett-
Packard Corporation, and ULTRIX (followed by DEC OSF/1, later renamed to Digital UNIX) from
Digital.
The commercial variants introduced many new features, some of which were subsequently
incorporated in newer releases of the baseline systems. SunOS introduced the Network File System
(NFS), the vnode/vfs interface to support multiple file system types, and a new virtual memory ar-
chitecture that was adopted by SVR4. AIX was among the first to provide a commercial journaling
file system for UNIX. UL TRIX was one of the first multiprocessor UNIX systems.
6 Chapter I Introduction
1.1.7 Mach
A major reason for the popularity of the UNIX system was that it was small and simple, yet offered
many useful facilities. As the system incorporated more and more features, the kernel became large,
complex, and increasingly unwieldy. Many people felt that UNIX was moving away from the prin-
ciples that had made it elegant and successful.
In the mid-1980s researchers at Carnegie-Mellon University in Pittsburgh, PA, began work-
ing on a new operating system called Mach [Acce 86]. Their objective was to develop a microker-
nel, which provides a small set of essential services and a framework for implementing other operat-
ing system functions at the user level. The Mach architecture would support the UNIX
programming interface, run on uniprocessor and multiprocessor systems, and be suitable for a dis-
tributed environment. By starting afresh, they hoped to avoid many of the problems afflicting UNIX
at the time.
The basic approach was to have the microkernel export a few simple abstractions, and to
provide most of the functionality through a collection of user-level tasks called servers. Mach held
another advantage-it was unencumbered by AT&T licenses, making it attractive to many vendors.
Mach 2.5 is the most popular release, and commercial systems like OSF/1 and NextStep have been
based on it. The early versions of Mach featured monolithic kernels, with a higher-level layer pro-
viding a 4BSD UNIX interface. Mach 3.0 was the first microkernel implementation.
1.1.8 Standards
The proliferation of UNIX variants led to several compatibility problems. While all variants "looked
like UNIX" from a distance, they differed in many important respects. Initially, the industry was
torn by differences between AT&T's System V releases (the official UNIX), and the BSD releases
from Berkeley. The introduction of commercial variants worsened the situation.
System V and 4BSD differ in many ways-they have different, incompatible physical file
systems, networking frameworks, and virtual memory architectures. Some of the differences are re-
stricted to kernel design and implementation, but others manifest themselves at the programming
interface level. It is not possible to write a complex application that will run unmodified on System
V and BSD systems.
The commercial variants were each derived from either System V or BSD, and then aug-
mented with value-added features. These extra features were often inherently unportable. As a re-
sult, the application programmers were often very confused, and spent inordinate amounts of effort
making sure their programs worked on all the different flavors of UNIX.
This led to a push for a standard set of interfaces, and several groups began working on
them. The resulting standards were almost as numerous and diverse as the UNIX variants. Eventu-
ally, most vendors agreed upon a few standards. These include the System V Interface Definition
(SVID) from AT&T, the IEEE POSIX specifications, and the X!Open Portability Guide from the
X/Open Consortium.
Each standard deals with the interface between the programmer and the operating system,
and not with how the system implements the interface. It defines a set of functions and their detailed
semantics. Compliant systems must meet these specifications, but may implement the functions ei-
ther in the kernel, or in user-level libraries.
1.1 Introduction 7
The standards deal with a subset of the functions provided by most UNIX systems. Theoreti-
cally, if programmers restrict themselves to using this subset, the resulting application should be
portable to any system that complies with the standard. This precludes the programmer from taking
advantage of added features of a particular variant, or making optimizations based on specific hard-
ware or operating system peculiarities, without compromising the portability of the code.
The SVID is essentially a detailed specification of the System V programming interface.
AT&T published three versions-SVID, SVID2, and SVID3 [AT&T 89], corresponding to SVR2,
SVR3, and SVR4, respectively. They allowed vendors to call their operating systems "System V"
only if they conformed to the SVID. AT&T also published the System V Verification Suite (SVVS),
which verifies if a system conforms to the SVID.
In 1986 the IEEE appointed a committee to publish a formal standard for operating system
environments. They adopted the name POSIX (Portable Operating Systems based on UNIX), and
their standard approximates an amalgam of the core parts of SVR3 and 4.3BSD UNIX. The
POSIXI 003. I standard, commonly known as POSIX. I, was published in 1990 [IEEE 90]. It has
gained wide acceptance, in part because it does not align itself closely with a single UNIX variant.
X/Open is a consortium of international computer vendors. It was formed in 1984, not to
produce new standards, but to develop an open Common Applications Environment (CAE) based on
existing de facto standards. It published a seven-volume X/Open Portability Guide (XPG), whose
latest release is Issue 4 in 1993 [XPG4 93]. It is based on a draft of the POSIX.l standard, but goes
beyond it by addressing many additional areas such as internationalization, window interfaces, and
data management.
ing system. It contained many advanced features not found in SVR4, such as complete multiproces-
sor support, dynamic loading, and logical volume management. The plan was for its founding
members to develop commercial operating systems based on OSF/1.
OSF and UI began as great rivals, but were quickly faced with a common outside threat. The
economic downturn in the early 1990s, along with the surge of Microsoft Windows, jeopardized the
growth, and even survival, of UNIX. UI went out of business in 1993, and OSF abandoned many of
its ambitious plans (such as the Distributed Management Environment). DEC OSF/1, released by
Digital in 1993, was the only major commercial system based on OSF/1. Over time, though, Digital
removed many OSF11 dependencies from their operating system, and in 1995, changed its name to
Digital UNIX
erating system, its originators began with a simple, extensible framework that was built upon incre-
mentally by contributions from all over-from the industry, academia, and enthusiastic users.
It is useful to examine the factors that motivate change and growth in an operating system.
In this section we look at the main factors that have influenced the growth of the UNIX system, and
speculate about the direction of its future growth.
1.2.1 Functionality
The biggest motivation for change is adding new features to the system. In the beginning, new
functionality was provided mainly by adding user-level tools and utilities. As the system matured,
its developers added many features to the UNIX kernel itself.
Much of the new functionality helps support more complex programs. The primary example
is the System V lnterprocess Communications (IPC) suite, consisting of shared memory, sema-
phpres, and message queues. Together, they allow cooperating processes to share data, exchange
messages, and synchronize their actions. Most modern UNIX systems also provide several levels of
support for writing multithreaded applications.
IPC and threads help the development of complex applications, such as those based on a cli-
ent-server model. In such programs, the server usually sits in a loop, waiting for client requests.
When a request arrives, the server processes it and waits for the next one. Since the server may have
to service several clients, it is desirable to handle multiple requests concurrently. With IPC, the
server may use a different process for each request, and these processes can share data and syn-
chronize with one another. A multithreaded system can allow the server to be implemented as a
single process with multiple, concurrently executing threads sharing a common address space.
Perhaps the most visible part of an operating system is its file system, which too has incor-
porated many new features. These include support for first-in, first-out (FIFO) files, symbolic links,
and files larger than a disk partition. Modern UNIX systems support file and byte-range locks, ac-
cess-control lists, and per-user disk quotas.
1.2.2 Networking
The part of the kernel that has undergone the greatest change is the networking subsystem. The
early UNIX systems ran standalone and could not communicate with other machines. The prolifera-
tion of computer networks made it imperative for UNIX to support them. The first major undertak-
ing was at Berkeley, where DARPA funded the project to integrate the TCPIIP suite into 4BSD.
Today UNIX systems support a number of network interfaces (such as ethernet, FDDI, and ATM),
protocols (such as TCPIIP, UDPIIP, 4 and SNA5), and frameworks (such as sockets and STREAMS).
The ability to connect to other machines impacted the system in many ways. Soon users
wanted to share files among connected machines and run programs on remote nodes. To meet this
challenge, UNIX systems evolved in three directions:
• Many new distributed file systems were developed, which allow almost transparent access
to files on remote nodes. The most successful of these are Sun Microsystems' Network
File System (NFS), Carnegie-Mellon University's Andrew File System (AFS), and Tran-
sarc Corporation's Distributed File System (DFS).
• A number of distributed services allow sharing of information in a network. These are
normally implemented as user-level programs based on a client-server model, and use re-
mote procedure calls to invoke operations on other machines. Some examples are Sun Mi-
crosystems' Network Information Service (NIS) and the Open Software Foundation's Dis-
tributed Computing Environment (DCE).
• Distributed operating systems such as Mach, Chorus, and Sprite provided varying
amounts of UNIX compatibility and were marketed as base technologies on which to
build future versions of distributed UNIX systems.
1.2.3 Performance
Improving system performance is a constant motivation for change. Competing UNIX vendors
make great efforts to demonstrate or claim that their system performs better than that of their rivals.
Nearly every kernel subsystem has seen major changes solely to improve performance.
In the early 1980s Berkeley introduced the Fast File System, which took advantage of intel-
ligent disk block allocation policies to improve performance. Faster file systems followed, using
extent-based allocation and journaling techniques. Performance improvements also motivated many
developments in the areas of interprocess communications, memory management, and multi-
threaded processes. One processor was insufficient for many applications, and vendors developed
multiprocessor UNIX systems, some with hundreds of CPU s.
hundred. Memory sizes and the disk space per user have grown by more than a factor of twenty.
Memory and disk speeds, on the other hand, have barely doubled.
In the 1970s, UNIX performance was limited by the processor speed and memory size.
Hence the UNIX kernel made heavy use of techniques such as swapping and (later) paging to juggle
a number of processes in the small memory. As time progressed, memory and CPU speed became
less of an issue, and the system became I/O-bound, spending much of its time moving pages be-
tween the disks and main memory. This provoked considerable research in file system, storage, and
virtual memory architectures to reduce the disk bottleneck, leading to the invention of Redundant
Arrays of Inexpensive Disks (RAID) and the proliferation of log-structured file systems.
tific number-crunching applications). Database servers run a database engine and handle queries
and transactions submitted by clients. The servers are powerful, high-end machines with fast proc-
essors, and plenty of memory and disk space. The client workstations have relatively less processing
power, memory, and storage, but have good display and interactive features.
As workstations grew more powerful, the differences between clients and servers began to
blur. Moreover, centralizing important services on a small number of servers led to network con-
gestion and server overload The result was one more paradigm shift, this time to distributed com-
puting. In this model, a number of machines collaborated to provide a network-based service. For
instance, each node might have a local file system, which it makes available to other nodes. Hence
each node acts as a server for its local files, and a client for files on other nodes. This avoids net-
work congestion and single points offailure.
The UNIX system has adapted to the different models of computing. For instance, early
UNIX releases had only a local file system. The support for network protocols was followed by the
development of distributed file systems. Some of these, such as early versions of AFS, required
centralized, dedicated servers. In time, they evolved into distributed file systems, where a single
machine could be both a client and a server.
Many people felt that the change was not entirely for the better, and that the system had be-
come large, cluttered, and disorganized. This led to many efforts to rewrite the system, or to write a
new operating system that was based on the original UNIX philosophy, but was more extensible and
modular. The most successful of these was Mach, which was the basis of commercial implementa-
tions such as OSF/1 and NextStep. Mach migrated to a microkemel architecture (see Section 1.1.7),
in which a small kernel provides the framework for running programs, and user-level server tasks
provide other functions.
Efforts at controlling the kernel size have been only moderately successful. Microkemels
have never been able to provide performance comparable to the traditional, monolithic kernel, pri-
marily due to the overhead of message passing. Some less ambitious efforts have been more bene-
ficial, such as modularization (see Section 1.2.9), pageable kernels, and dynamic loading, which al-
lows some components to be loaded into and out of the kernel as necessary.
1.2.9 Flexibility
In the 1970s and the early 1980s UNIX kernels were not very versatile. They supported a single
type of file system, scheduling policy, and executable file format (Figure 1-1). The only flexibility
was offered by the block and character device switches, which allow different types of devices to be
accessed through a common interface. The development of distributed file systems in the mid-1980s
made it essential for UNIX systems to support both remote and local file systems. Similarly, fea-
tures such as shared libraries required different executable file formats. The UNIX system had to
support these new formats, as well as the traditional a. out format for compatibility. The coexistence
of multimedia and real-time applications with normal interactive programs required scheduler sup-
port for different classes of applications.
In summary, the broadening use of UNIX systems required a more flexible operating system
that could support several different methods of performing the same task. This need instigated the
development of many flexible frameworks, such as the vnode/vfs interface, exec switch, scheduling
classes, and segment-based memory architecture. The modem UNIX kernel is very similar to the
system shown in Figure 1-2. Each of the outer circles represents an interface that may be imple-
mented in a number of ways.
elf
,
NFS
FFS
device -~ssfs
mappings""
time-sharing
.
__
processes
real-time
processes
system
processes
network tty
driver driver
The efforts at Berkeley furthered this trend. Overall, UNIX evolved through an extremely
open process (or lack of process). Contributions to the operating system came from academia, in-
dustry, and enthusiastic hackers from several different countries and continents. Even when UNIX
became commercialized, many vendors recognized the value of open systems and made their inno-
vations accessible to others, creating open specifications such as NFS.
The original UNIX system was very well designed and formed a successful basis for a num-
ber of later versions and offshoots. One of its greatest strengths was its adherence to the "Small is
Beautiful" philosophy [Allm 87]. A small kernel provided a minimal set of essential services. Small
utilities performed simple manipulations of data. The pipe mechanism, along with the programma-
ble shell, allowed users to combine these utilities in many different ways to create powerful tools.
The UNIX file system exemplified the small, simple approach. Unlike other contemporary
operating systems, which had complex file access methods such as Indexed Sequential Access
Method (!SAM) or Hierarchical Sequential Access Method (HSAM), UNIX treated files as merely a
sequence of bytes. Applications could impose any structure on the file's contents and devise their
own access methods, without the file system getting in their way.
Most system applications used lines of text to represent their data. For instance, important
system databases such as the /etc/passwd, /etc/fstab, and /etc/ttys files were ordinary text files.
While it may have been more efficient to store the information in structured, binary format, the text
representation allowed users to read and manipulate these files without special tools. Text is a famil-
iar, universal, and highly portable data form, easily manipulated by a variety of utilities.
Another outstanding feature of UNIX was its simple, uniform interface to 1/0 devices. By
representing all devices as files, UNIX allows the user to use the same set of commands and system
calls to manipulate and access devices as well as files. Developers can write programs that perform
1/0 without having to check if the 1/0 is performed to a file, user terminal, printer, or other device.
This, along with the 1/0 redirection features of the shell, provides a simple and powerful 1/0 inter-
face.
A key to the success and proliferation of UNIX is its portability. The bulk of the kernel is
written in the C language. This allows it to be ported to new machines with relatively little effort. It
was first available on the popular PDP-11, and then ported to the V AX-11, also a popular machine.
Many vendors could develop new machines and simply port UNIX to them, rather than having to
write new operating systems.
example, while the read and write system calls are atomic (indivisible) operations on the file, the
buffering in the I/0 library loses the atomicity.
While UNIX is an excellent operating system, most users want not an operating system, but
simply the ability to do a particular task. These users are not interested in the elegance of the under-
lying file system structure or process model. They want to run specific applications (such as editors,
financial packages, drawing programs) with a minimum of expense and bother. The lack of a sim-
ple, uniform (and preferably graphical) user interface in early UNIX systems was a major deterrent
to its acceptance among the masses. In Ritchie's words, "UNIX is simple and coherent, but it takes a
genius (or at any rate, a programmer) to understand and appreciate its simplicity."
The building-block approach to tools is as much a bane as it is a boon. While elegant and
aesthetically pleasing, it requires creativity and imagination to use effectively. Many users prefer the
integrated, do-it-all programs such as those available for personal computers.
In some ways UNIX was a victim of its own success. Its simple licensing terms and port-
ability encouraged uncontrolled growth and proliferation. As people tinkered with the system, each
group changed it in a different way, often with incompatible results. At first there were two major
strains-AT&T and BSD, each with a different file system, memory architecture, and signal and
terminal handling framework. Soon many vendors released their own variants, trying for some level
of compatibility with both AT&T and BSD versions. The situation became more chaotic, and many
application developers had to spend great effort porting their programs to all the different flavors of
UNIX.
The standardization efforts were only partly successful, since they met with opposition from
the very people contributing to the process. This is because vendors needed to add unique features
for "product differentiation," to show that their product was different from, and superior to, that of
their competitors.
Richard Rashid, one of the principal developers of Mach, offers further insight into the fail-
ures of UNIX. In the introductory talk of the Mach Lecture Series [Rash 89], he explains how the
motivation for Mach grew out of observations on the evolution of the UNIX system. UNIX has a
minimalist, building-block approach to tool building. Large, complex tools are created by combin-
ing small, simple ones. Yet the same approach is not carried over to the kernel.
The traditional UNIX kernel is not sufficiently flexible or extensible, and has few facilities
for code reuse. As UNIX grew, developers simply added code to the kernel, which became a
"dumping ground" for new features. Very soon the kernel became bloated, unmodular, and com-
plex. Mach tries to solve these problems by rewriting the operating system from the ground up,
based on a small number of abstractions. Modem UNIX systems have tackled this problem differ-
ently, adding flexible frameworks to several subsystems, as described in Section 1.2.9.
ants are derived from any one of the baseline systems and contain value-added features and en-
hancements from the vendor. These include Sun Microsystem's SunOS and Solaris 2.x, IBM's AIX,
Hewlett-Packard's HP-UX, and Digital's ULTRIX and Digital UNIX.
This book does not focus on a specific release or variant of the UNIX system. Instead, it ex-
amines a number of important implementations and compares their architecture and approach to
many important problems. SVR4 receives the most attention, but there is ample coverage of
4.3BSD, 4.4BSD, and Mach. Among commercial variants, the book gives maximum coverage to
SunOS and Solaris 2.x, not only because of their importance in the UNIX market, but also because
Sun Microsystems has been responsible for many technical contributions that have subsequently
been integrated into the baseline releases, and because of the plethora of published work on their
systems.
Often the book makes generic references to traditional UNIX or modem UNIX. By tradi-
tional UNIX we mean SVR3, 4.3BSD, and earlier versions. We often discuss features or properties
of traditional systems (for instance, "traditional UNIX systems had a single type of file system").
While there are many differences between SVR3 and 4.3BSD in each subsystem, there is also a lot
of common ground, and such generic discussions focus on these common themes. When talking
about modem UNIX systems, we mean to SVR4, 4.4BSD, Mach, and systems derived from these.
Again, general comments, such as "Modem UNIX systems provide some kind of a joumaling file
system," describe features available in a large number of modem systems, but not necessarily in all
of them.
1.5 References
[Acce 86] Accetta, M., Baron, R., Golub, D., Rashid, R., Tevanian, A., and Young, M., "Mach:
A New Kernel Foundation for UNIX Development," Proceedings of the Summer
1986 USENIXTechnical Conference, Jun. 1986, pp. 93-112.
[Allm 87] Allman, E., "UNIX: The Data Forms," Proceedings of the Winter 1987 USENIX
Technical Conference, Jan. 1987, pp. 9-15.
[AT&T 89] American Telephone and Telegraph, The System V Interface Definition (SVID), Third
Edition, 1989.
[Bost 93] Bostic, K., "4.4BSD Release," ;login, Vol. 18, No.5, Sep.-Oct. 1993, pp. 29-31.
[Bour 78] Bourne, S.R., "The UNIX Shell," The Bell System Technical Journal, Vol. 57, No.6,
Part 2, Jul.-Aug. 1978, pp. 1971-1990.
[Gerb 92] Gerber, C., "USL Vs. Berkeley," UNIX Review, Vol. 10, No. 11, Nov. 1992, pp. 33-
36.
[IEEE 90] Institute for Electrical and Electronic Engineers, Information Technology-Portable
Operating System Interface (POSIX) Part 1: System Application Program Interface
(API) {C Language}, 1003.1-1990, IEEE, Dec. 1990.
[Joy 86] Joy, W.N., Fabry, R.S., Leffler, S.J., McKusick, M.K., and Karels, M.J., "An
Introduction to the C Shell," UNIX User's Supplementary Documents, 4.3 Berkeley
Software Distribution, Virtual V AX-11 Version, USENIX Association, 1986, pp.
4:1-46.
18 Chapter I Introduction
[Orga 72] Organick, E.J., The Multics System: An Examination of Its Structure, The MIT Press,
Cambridge, MA, 1972.
[Rash 89] Rashid, R.F., "Mach: Technical Innovations, Key Ideas, Status," Mach 2.5 Lecture
Series, OSF Research Institute, 1989.
[Rich 82] Richards, M., and Whitby-Strevens, C., BCPL: The Language and Its Compiler,
Cambridge University Press, Cambridge, UK, 1982.
[Rite 78] Ritchie, D.M., and Thompson, K., "The UNIX Time-Sharing System," The Bell
System Technical Journal, Vol. 57, No. 6, Part 2, pp. 1905-1930, Jul.-Aug. 1978.
[Rite 87] Ritchie, D.M., "Unix: A Dialectic," Proceedings of the Winter I987 USENJX
Technical Conference, Jan. 1987, pp. 29-34.
[Salu 94] Salus, P.H., A Quarter Century of UNIX, Addison-Wesley, Reading, MA, 1994.
[Thorn 74] Thompson, K., and Ritchie, D.M., "The UNIX Time-Sharing System,"
Communications of the ACM, Vol. 17, No.7, Jul. 1974, pp. 365-375.
[XPG4 93] The X/OPEN Portability Guide (XPG), Issue 4, Prentice-Hall, Englewood Cliffs, NJ,
1993.
2
2.1 Introduction
The principal function of an operating system is to provide an execution environment in which user
programs (applications) may run. This involves defining a basic framework for program execution,
and providing a set of services-such as file management and I/O-and an interface to these serv-
ices. The UNIX system presents a rich and versatile programming interface [Kern 84] that can effi-
ciently support a variety of applications. This chapter describes the main components of the UNIX
systems and how they interact to provide a powerful programming paradigm.
There are several different UNIX variants. Some of the important ones are the System V re-
leases from AT&T (SVR4, the latest System V release, is now owned by Novell), the BSD releases
from the University of California at Berkeley, OSF/1 from the Open Software Foundation, and
SunOS and Solaris from Sun Microsystems. This chapter describes the kernel and process architec-
ture of traditional UNIX systems, that is, those based on SVR2 [Bach 86], SVR3 [AT&T 87],
4.3BSD [Leff89], or earlier versions. Modern UNIX variants such as SVR4, OSF/1, 4.4BSD, and
Solaris 2.x differ significantly from this basic model; the subsequent chapters explore the modern
releases in detail.
The UNIX application environment contains one fundamental abstraction-the process. In
traditional UNIX systems, the process executes a single sequence of instructions in an address
space. The address space of a process comprises the set of memory locations that the process may
reference or access. The control point of the process tracks the sequence of instructions, using a
hardware register typically called the program counter (PC). Many newer UNIX releases support
19
20 Chapter 2 The Process and the Kernel
multiple control points (called threads [IEEE 94]), and hence multiple instruction sequences, within
a single process.
The UNIX system is a multiprogramming environment, i.e., several processes are active in
the system concurrently. To these processes, the system provides some features of a virtual ma-
chine. In a pure virtual machine architecture the operating system gives each process the illusion
that it is the only process on the machine. The programmer writes an application as if only its code
were running on the system. In UNIX systems each process has its own registers and memory, but
must rely on the operating system for 1/0 and device control.
The process address space is virtual, 1 and normally only part of it corresponds to locations in
physical memory. The kernel stores the contents of the process address space in various storage ob-
jects, including physical memory, on-disk files, and specially reserved swap areas on local or re-
mote disks. The memory management subsystem of the kernel shuffles pages (fixed-size chunks) of
process memory between these objects as convenient.
Each process also has a set of registers, which correspond to real, hardware registers. There
are many active processes in the system, but only one set of hardware registers. The kernel keeps
the registers of the currently running process in the hardware registers and saves those of other
processes in per-process data structures.
Processes contend for the various resources of the system, such as the processor (also known
as the Central Processing Unit or CPU), memory, and peripheral devices. An operating system must
act as a resource manager, distributing the system resources optimally. A process that cannot ac-
quire a resource it needs must block (suspend execution) until that resource becomes available.
Since the CPU is one such resource, only one process can actually run at a time on a uniprocessor
system. The rest of the processes are blocked, waiting for either the CPU or other resources. The
kernel provides an illusion of concurrency by allowing one process to have the CPU for a brief pe-
riod of time (called a quantum, usually about 10 milliseconds), then switching to another. In this
way each process receives some CPU time and makes progress. This method of operation is known
as time-slicing.
From another perspective, the computer provides several facilities to the user, such as the
processor, disks, terminals, and printers. Application programmers do not wish to be concerned with
the low-level details of the functionality and architecture of these components. The operating sys-
tem assumes complete control of these devices and offers a high-level, abstract programming inter-
face that applications can use to access these components. It hides all the details of the hardware,
greatly simplifying the work of the programmer. 2 By centralizing all control of the devices, it also
provides additional facilities such as access synchronization (if two users want the same device at
the same time) and error recovery. The application programming interface (API) defines the se-
mantics of all interactions between user programs and the operating system.
I There are some UNIX systems that do not use virtual memory. These include the earliest UNIX releases (the first
virtual memory systems appeared in the late 1970s-see Section 1.1.4) and some real-time UNIX variants. This book
deals only with UNIX systems that have virtual memory.
2 The UNIX system takes this too far in some cases; for example, its treatment of tape drives as character streams
makes it difficult for applications to properly handle errors and exceptional cases. The tape interface is inherently re-
cord-based and does not fit nicely with the UNIX device framework [AIIm 87].
2.1 Introduction 21
We have already started referring to the operating system as an entity that does things. What
exactly is this entity? On one hand, an operating system is a program (often called the kernel) that
controls the hardware and creates, destroys, and controls all processes (see Figure 2-1). From a
broader perspective, an operating system includes not just the kernel, but also a host of other pro-
grams and utilities (such as the shells, editors, compilers, and programs like date, ls, and who) that
together provide a useful work environment. Obviously, the kernel alone is of limited use, and users
purchasing the UNIX system expect many of these other programs to come with it. The kernel,
however, is special in many ways. It defines the programming interface to the system. It is the only
indispensable program, without which nothing can run. While several shells or editors may run con-
currently, only a single kernel may be loaded at a time. This book is devoted to studying the kernel,
and when it mentions the operating system, or UNIX, it means the kernel, unless specified other-
wise.
To rephrase the earlier question, "What exactly is the kernel?" Is it a process, or something
distinct from all processes? The kernel is a special program that runs directly on the hardware. It
implements the process model and other system services. It resides on disk in a file typically called
/vmunix or /unix (depending on the UNIX vendor). When the system starts up, it loads the kernel
from disk using a special procedure called bootstrapping. The kernel initializes the system and sets
up the environment for running processes. It then creates a few initial processes, which in tum cre-
ate other processes. Once loaded, the kernel remains in memory until the system is shut down. It
manages the processes and provides various services to them.
The UNIX operating system provides functionality in four ways:
• User processes explicitly request services from the kernel through the system call interface
(see Figure 2-1 ), the central component of the UNIX API. The kernel executes these re-
quests on behalf of the calling process.
• Some unusual actions of a process, such as attempting to divide by zero, or overflowing
the user stack, cause hardware exceptions. Exceptions require kernel intervention, and the
kernel handles them on behalf of the process.
~y~t~~ ~a!l __
interface
interface
to devices
• The kernel handles hardware interrupts from peripheral devices. Devices use the interrupt
mechanism to notify the kernel of 110 completion and status changes. The kernel treats
interrupts as global events, unrelated to any specific process.
• A set of special system processes, such as the swapper and the pagedaemon, perform sys-
tem-wide tasks such as controlling the number of active processes or maintaining a pool of
free memory.
The following sections describe these different mechanisms and define the notion of the
execution context of a process.
the MMU registers have the necessary information. Occasionally, the kernel must access the address
space of a process other than the current one. It does so indirectly, using special, temporary map-
pings.
While the kernel is shared by all processes, system space is protected from user-mode ac-
cess. Processes cannot directly access the kernel, and must instead use the system call interface.
When a process makes a system call, it executes a special sequence of instructions to put the system
in kernel mode (this is called a mode switch) and transfer control to the kernel, which handles the
operation on behalf of the process. After the system call is complete, the kernel executes another set
of instructions that returns the system to user mode (another mode switch) and transfers control
back to the process. The system call interface is described further in Section 2.4.1.
There are two important per-process objects that, while managed by the kernel, are often
implemented as part of the process address space. These are the u area (also called the user area)
and the kernel stack. The u area is a data structure that contains information about a process of in-
terest to the kernel, such as a table of files opened by the process, identification information, and
saved values of the process registers when the process is not running. The process should not be al-
lowed to change this information arbitrarily, and hence the u area is protected from user-mode ac-
cess. (Some implementations allow the process to read, but not modify, the u area.)
The UNIX kernel is re-entrant, meaning that several processes may be involved in kernel
activity concurrently. In fact, they may even be executing the same routine in parallel. (Of course,
only one process can actually run at a time; the others are blocked or waiting to run.) Hence each
process needs its own private kernel stack, to keep track of its function call sequence when execut-
ing in the kernel. Many UNIX implementations allocate the kernel stack in the address space of
each process, but do not allow user-mode access to it. Conceptually, both the u area and the kernel
stack, while being per-process entities in the process space, are owned by the kernel.
Another important concept is the execution context. Kernel functions may execute either in
process context or in system context. In process context, the kernel acts on behalf of the current
process (for instance, while executing a system call), and may access and modify the address space,
u area, and kernel stack of this process. Moreover, the kernel may block the current process if it
must wait for a resource or device activity.
The kernel must also perform certain system-wide tasks such as responding to device inter-
rupts and recomputing process priorities. Such tasks are not performed on behalf of a particular
process, and hence are handled in system context (also called interrupt context). When running in
system context, the kernel may not access the address space, u area, or kernel stack of the current
process. The kernel may not block when executing in system context, since that would block an in-
nocent process. In some situations there may not even be a current process, for example, when all
processes are blocked awaiting I/0 completion.
This far, we have noted the distinctions between user and kernel mode, process and system
space, and process and system context. Figure 2-2 summarizes these notions. User code runs in user
mode and process context, and can access only the process space. System calls and exceptions are
handled in kernel mode but in process context, and may access process and system space. Interrupts
are handled in kernel mode and system context, and must only access system space.
24 Chapter 2 The Process and the Kernel
process
context
user (access process space only) (access process and system space) kernel
mode (access system space only) mode
system
context
What exactly is a UNIX process? One oft-quoted answer is, "A process is an instance of a running
program." Going beyond perfunctory definitions, it is useful to describe various properties of the
process. A process is an entity that runs a program and provides an execution environment for it. It
comprises an address space and a control point. The process is the fundamental scheduling entity-
only one process runs on the CPU at a time. In addition, the process contends for and owns various
system resources such as devices and memory. It also requests services from the system, which the
kernel performs on its behalf.
The process has a definite lifetime-most processes are created by a fork or vfork system
call and run until they terminate by calling exit. During its lifetime, a process may run one or many
programs (usually one at a time). It invokes the exec system call to run a new program.
UNIX processes have a well-defined hierarchy. Each process has one parent, and may have
one or more child processes. The process hierarchy can be described by an inverted tree, with the
init process at the top. The init process (so named because it executes the program /etc/init) is the
first user process created when the system boots. It is the ancestor of all user processes. A few sys-
tem processes, such as the swapper and the pagedaemon (also called the pageout daemon), are cre-
ated during the bootstrapping sequence and are not descendants of init. If, when a process termi-
nates, it has any active child processes, they become orphans and are inherited by init.
2.3 The Process Abstraction 25
Key
I
I
in 4.2/4.3BSD, not
0 in SVR2/SVR3
fork
wait
I
I
wakeup 1
I
I
sto~t
II
stop 1 1continue ., .... "'
..- -- II .
stop 1 1contmue
e:::~-~~------
Figure 2-3. Process states and state transitions.
3 Interrupts can also occur when the system is in kernel mode. In this case, the system will remain in kernel mode after
the handler completes.
26 Chapter 2 The Process and the Kernel
s 1eep (), which puts the process on a queue of sleeping processes, and changes its state to asleep.
When the event occurs or the resource becomes available, the kernel wakes up the process, which
now becomes ready to run and waits to be scheduled.
When a process is scheduled to run, it initially runs in kernel mode (kernel running state),
where it completes the context switch. Its next transition depends on what it was doing before it was
switched out. If the process was newly created or was executing user code (and was descheduled to
let a higher priority process run), it returns immediately to user mode. If it was blocked for a re-
source while executing a system call, it resumes execution of the system call in kernel mode.
Finally, the process terminates by calling the exit system call, or because of a signal (signals
are notifications issued by the kernel-see Chapter 4). In either case, the kernel releases all the re-
sources of the process, except for the exit status and resource usage information, and leaves the
process in the zombie state. The process remains in this state until its parent calls wait (or one of its
variants), which destroys the process and returns the exit status to the parent (see Section 2.8.6).
4BSD defines some additional states that are not supported in SVR2 or SVR3. A process is
stopped, or suspended, by a stop signal (SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU). Unlike other
signals, which are handled only when the process runs, a stop signal changes the process state im-
mediately. If the process is in the running or ready to run state, its state changes to stopped. If the
process is asleep when this signal is generated, its state changes to asleep and stopped. A stopped
process may be resumed by a continue signal (SIGCONT), which returns it to the ready to run state. If
the process was stopped as well as asleep, SIGCONT returns the process to the asleep state. System V
UNIX incorporated these features in SVR4 (see Section 4.5). 4
variable=value
which are inherited from the parent. Most UNIX systems store these strings at the bottom
of the user stack. The standard user library provides functions to add, delete, or modify
these variables, and to translate the variable and return its value. When invoking a new
4 SVR3 provides a stopped state for the process solely for the purpose of process tracing (see Section 6.2.4). When a
traced process receives any signal, it enters the stopped state, and the kernel awakens its parent.
2.3 The Process Abstraction 27
program, the caller may ask exec to retain the original environment or provide a new set of
variables to be used instead.
• Hardware context: This includes the contents of the general-purpose registers, and of a
set of special system registers. The system registers include:
• The program counter (PC), which holds the address of the next instruction to exe-
cute.
• The stack pointer (SP), which contains the address of the uppermost element of the
stack. 5
• The processor status word (PSW), which has several status bits containing informa-
tion about the system state, such as current and previous execution modes, current
and previous interrupt priority levels, and overflow and carry bits.
• Memory management registers, which map the address translation tables of the proc-
ess.
• Floating point unit (FPU) registers.
The machine registers contain the hardware context of the currently running process. When a con-
text switch occurs, these registers are saved in a special part of the u area (called the process control
block, or PCB) of the current process. The kernel selects a new process to run and loads the hard-
ware context from its PCB.
5 Or lowermost, on machines where the stack grows downward. Also, on some systems, the stack pointer contains the
address at which the next item can be pushed onto the stack.
28 Chapter 2 The Process and the Kernel
can access the file (see Section 8.2.2 for more details). The real UID and real GID identify the real
owner of the process and affect the permissions for sending signals. A process without superuser
privileges can signal another process only if the sender's real or effective UID matches the real UID
of the receiver.
There are three system calls that can change the credentials. If a process calls exec to run a
program installed in suid mode (see Section 8.2.2), the kernel changes the effective UID of the
process to that of the owner of the file. Likewise, if the program is installed in sgid mode, the kernel
changes the effective GID of the calling process.
UNIX provides this feature to grant special privileges to users for particular tasks. The clas-
sic example is the passwd program, which allows the user to modify his own password. This pro-
gram must write to the password database, which users should not be allowed to directly modify (to
prevent them from changing passwords of other users). Hence the passwd program is owned by the
superuser and has its SUID bit set. This allows the user to gain superuser privileges while running
the passwd program.
A user can also change his credentials by calling setuid or setgid. The superuser can invoke
these system calls to change both the real and effective UID or GID. Ordinary users can use this call
only to change their effective UID or GID back to the real ones.
There are some differences in the treatment of credentials in System V and BSD UNIX.
SVR3 also maintains a saved UID and saved GJD, which are the values of the effective UID and
GID prior to calling exec. The setuid and setgid calls can also restore the effective IDs to the saved
values. While 4.3BSD does not support this feature, it allows a user to belong to a set of supplemen-
tal groups (using the setgroups system call). While files created by the user belong to his or her
primary group, the user can access files belonging either to the principal or to a supplemental group
(provided the file allows access to group members).
SVR4 incorporates all the above features. It supports supplemental groups, and maintains
the saved UID and GID across exec.
contains data that is needed only when the process is running. The proc structure contains informa-
tion that may be needed even when the process is not running.
The major fields in the u area include:
• The process control block-stores the saved hardware context when the process is not
running.
• A pointer to the proc structure for this process.
• The real and effective UID and GID.6
• Arguments to, and return values or error status from, the current system call.
• Signal handlers and related information (see Chapter 4).
• Information from the program header, such as text, data, and stack sizes and other memory
management information.
• Open file descriptor table (see Section 8.2.3). Modem UNIX systems such as SVR4 dy-
namically extend this table as necessary.
• Pointers to vnodes of the current directory ana the controlling terminal. Vnodes represent
file system objects and are further described in Section 8.7.
• CPU usage statistics, profiling information, disk quotas, and resource limits.
• In many implementations, the per-process kernel stack is part of the u area.
The major fields in the proc structure include:
• Identification: Each process has a unique process ID (PID) and belongs to a specific proc-
ess group. Newer releases also assign a session ID to each process.
• Location of the kernel address map for the u area of this process.
• The current process state.
• Forward and backward pointers to link the process onto a scheduler queue or, for a
blocked process, a sleep queue.
• Sleep channel for blocked processes (see Section 7.2.3).
• Scheduling priority and related information (see Chapter 5).
• Signal handling information: masks of signals that are ignored, blocked, posted, and han-
dled (see Chapter 4).
• Memory management information.
• Pointers to link this structure on lists of active, free, or zombie processes.
• Miscellaneous flags.
• Pointers to keep the structure on a hash queue based on its PID.
• Hierarchy information, describing the relationship of this process to others.
Figure 2-4 illustrates the process relationships in 4.3BSD UNIX. The fields that describe the
hierarchy are p_pid (process ID), p_ppid (parent process ID), p_pptr (pointer to the parent's proc
structure), p_cptr (pointer to the oldest child), p_ysptr (pointer to next younger sibling), and
p_ osptr (pointer to next older sibling).
6 Modem UNIX systems such as SVR4 store user credentials in a dynamically allocated, reference-counted data
structure, and keep a pointer to it in the proc structure. Section 8.10.7 discusses this arrangement further.
30 Chapter 2 The Process and the Kernel
p_pid = 50
p_ppid = 38
p_pptr
p_cptr
p_ysptr
p_osptr
Many modem UNIX variants have modified the process abstraction to support several
threads of control in a single process. This notion is explained in detail in Chapter 3.
exception handler therefore runs in process context; it may access the address space and u area of
the process and block if necessary. Software interrupts, or traps, occur when a process executes a
special instruction, such as in system calls, and are handled synchronously in process context.
Table 2-1. Setting the interrupt priority level in 4.3BSD and SVR4
4.3BSD SVR4 Purpose
splO splOorsplbase enable all interrupts
splsoftclock spltimeout block functions scheduled by timers
splnet block network protocol processing
splstr block STREAMS interrupts
spltty spltty block terminal interrupts
splbio spldisk block disk interrupts
splimp block network device interrupts
splclock block hardware clock interrupt
splhigh s p l 7 or s p l h i disable all interrupts
splx splx restore ipl to previously saved value
schedule low-priority clock-related tasks. While these interrupts are synchronous to normal system
activity, they are handled just like normal interrupts.
Since there are several different events that may cause interrupts, one interrupt may occur
while another is being serviced. UNIX systems recognize the need to prioritize different kinds of
interrupts and allow high-priority interrupts to preempt the servicing of low-priority interrupts. For
example, the hardware clock interrupt must take precedence over a network interrupt, since the lat-
ter may require a large amount of processing, spanning several clock ticks.
UNIX systems assign an interrupt priority level (ipl) to each type of interrupt. Early UNIX
implementations had ipls in the range 0-7. In BSD, this was expanded to 0-31. The processor
status register typically has bit-fields that store the current (and perhaps previous) ipf.7 Normal ker-
nel and user processing occurs at the base ipl. The number of interrupt priorities varies both across
different UNIX variants and across different hardware architectures. On some systems, ipl 0 is the
lowest priority, while on others it is the highest. To make things easier for kernel and device driver
developers, UNIX systems provide a set of macros to block and unblock interrupts. However, dif-
ferent UNIX variants use different macros for similar purposes. Table 2-1 lists some of the macros
used in 4.3BSD and in SVR4.
When an interrupt occurs, if its ipl is higher than the current ipl, the current processing is
suspended and the handler for the new interrupt is invoked. The handler begins execution at the new
ipl. When the handler completes, the ipl is lowered to its previous value (which is obtained from the
old processor status word saved on the interrupt stack), and the kernel resumes execution of the in-
terrupted process. If the kernel receives an interrupt of ipllower than or equal to the current ipl, that
interrupt is not handled immediately, but is stored in a saved interrupt register. When the ipl drops
sufficiently, the saved interrupt will be handled. This is described in Figure 2-5.
The ipls are compared and set in hardware in a machine-dependent way. UNIX also provides
the kernel with mechanisms to explicitly check or set the ipl. For instance, the kernel may raise the
ipl to block interrupts while executing some critical code. This is discussed further in Section 2.5.2.
7 Some processors, such as the Intel 80x86, do not support interrupt priorities in hardware. On these systems, the op-
erating system must implement ipls in software. The exercises explore this problem further.
2.5 Synchronization 33
interrupt arrives
handle interrupt;
return from it
No
Some machines provide a separate global interrupt stack used by all the handlers. In ma-
chines without an interrupt stack, handlers run on the kernel stack of the current process. They must
ensure that the rest of the kernel stack is insulated from the handler. The kernel implements this by
pushing a context layer on the kernel stack before calling the handler. This context layer, like a
stack frame, contains the information needed by the handler to restore the previous execution con-
text upon return.
2.5 Synchronization
The UNIX kernel is re-entrant. At any time, several processes may be active in the kernel. Of these,
only one (on a uniprocessor) can be actually running; the others are blocked, waiting either for the
34 Chapter 2 The Process and the Kernel
CPU or some other resource. Since they all share the same copy of the kernel data structures, it is
necessary to impose some form of synchronization, to prevent corruption of the kernel.
Figure 2-6 shows one example of what can happen in the absence of synchronization. Sup-
pose a process is trying to remove element B from the linked list. It executes the first line of code,
but is interrupted before it can execute the next line, and another process is allowed to run. If this
second process were to access this same list, it would find it in an inconsistent state, as shown in
Figure 2-6(b). Clearly, we need to ensure that such problems never occur.
UNIX uses several synchronization techniques. The first line of defense is that the UNIX
kernel is nonpreemptive. This means that if a process is executing in kernel mode, it cannot be pre-
empted by another process, even though its time quantum may expire. The process must voluntarily
relinquish the CPU. This typically happens when the process is about to block while waiting for a
resource or event, or when it has completed its kernel mode activity and is about to return to user
mode. In either case, since the CPU is relinquished voluntarily, the process can ensure that the ker-
nel remains in a consistent state. (Modern UNIX kernels with real-time capability allow preemption
under certain conditions-see Section 5.6 for details.)
Making a kernel nonpreemptive provides a broad, sweeping solution to most synchroniza-
tion problems. In the example of Figure 2-6, for instance, the kernel can manipulate the linked list
without locking it, if it does not have to worry about preemption. There are three situations where
synchronization is still necessary-blocking operations, interrupts, and multiprocessor synchroniza-
tion.
A B c
(b)Ajter B->prev->next B->next;
A B c
(c) Ajter B->next->prev B->prev;
A B c
(d)Ajter free(B);
~
Figure 2-6. Removing an element from a linked list.
2.5 Synchronization 35
2.5.2 Interrupts
While the kernel is normally safe from preemption by another process, a process manipulating ker-
nel data structures may be interrupted by devices. If the interrupt handler tries to access those very
data structures, they may be in an inconsistent state. This problem is handled by blocking interrupts
while accessing such critical data structures. The kernel uses macros such as those in Table 2-1 to
8 Recent versions of UNIX offer several alternatives to wakeup(), such as wake_ one() and wakeprocs ().
36 Chapter 2 The Process and the Kernel
process wants
Yes
wake up all
Yes
waiting processes
No
processing
explicitly raise the ipl and block interrupts. Such a region of code is called a critical region (see
Example 2-1 ).
• Two different interrupts can have the same priority level. For instance, on many systems
both terminal and disk interrupts occur at ipl 21.
• Blocking an interrupt also blocks all interrupts at the same or lower ipl.
·Note: T'he word block is used in many different ways when describing
the UN!Xsubsystem. A process blocks on a resource or event when it .
enters the asleep state waiting for the resource to be available or the
eventto occur. The kernel blocks an interrupt or a signal by tempo-
rarily disabling its delivery; Finally, the 110 subsystem transfers data
to and from storage devices in fixed-size blocks.
2.5.3 Multiprocessors
Multiprocessor systems lead to a new class of synchronization problems, since the fundamental
protection offered by the nonpreemptive nature of the kernel is no longer present. On a uniproces-
sor, the kernel manipulates most data structures with impunity, knowing that it cannot be pre-
empted. It only needs to protect data structures that may be accessed by interrupt handlers, or those
that need to be consistent across calls to sleep ().
On a multiprocessor, two processes may execute in kernel mode on different processors and
may also execute the same function concurrently. Thus any time the kernel accesses a global data
structure, it must lock that structure to prevent access from other processors. The locking mecha-
nisms themselves must be multiprocessor-safe. If two processes running on different processors at-
tempt to lock an object at the same time, only one must succeed in acquiring the lock.
Protecting against interrupts also is more complicated, because all processors may handle
interrupts. It is usually not advisable to block interrupts on every processor, since that might degrade
performance considerably. Multiprocessors clearly require more complex synchronization mecha-
nisms. Chapter 7 explores these issues in detail.
vation of any process, since eventually the priority of any process that is waiting to run will rise
high enough for it to be scheduled.
A process executing in kernel mode may relinquish the CPU if it must block for a resource
or event. When it becomes runnable again, it is assigned a kernel priority. Kernel priorities are
higher than any user priorities. In traditional UNIX kernels, scheduling priorities have integer values
between 0 and 127, with smaller numbers meaning higher priorities. (As the UNIX system is written
almost entirely in C, it follows the standard convention of beginning all counts and indices at 0). In
4.3BSD, for instance, the kernel priorities range from 0 to 49, and user priorities from 50 to 127.
While user priorities vary with CPU usage, kernel priorities are fixed, and depend on the reason for
sleeping. Because of this, kernel priorities are also known as sleep priorities. Table 2-2 lists the
sleep priorities in 4.3BSD UNIX.
Chapter 5 provides further details of the UNIX scheduler.
2.7 Signals
UNIX uses signals to inform a process of asynchronous events, and to handle exceptions. For ex-
ample, when a user types control-C at the terminal, a S I GI NT signal is sent to the foreground proc-
ess. Likewise, when a process terminates, a SIGCHLD signal is sent to its parent. UNIX defines a
number of signals (31 in 4.3BSD and SVR3). Most are reserved for specific purposes, while two
(SIGUSR1 and SIGUSR2) are available for applications to use as they wish.
Signals are generated in many ways. A process may explicitly send a signal to one or more
processes using the kill system call. The terminal driver generates signals to processes connected to
it in response to certain keystrokes and events. The kernel generates a signal to notify the process of
a hardware exception, or of a condition such as exceeding a quota.
Each signal has a default response, usually process termination. Some signals are ignored by
default, and a few others suspend the process. The process may specify another action instead of the
default by using the signal (System V), sigvec (BSD), or sigaction (POSIX.l) calls. This other ac-
tion may be to invoke a user-specified signal handler, or it may be to ignore the signal, or even to
revert to the default. The process may also choose to block a signal temporarily; such a signal will
only be delivered to a process after it is unblocked.
A process does not instantaneously respond to a signal. When the signal is generated, the
kernel notifies the process by setting a bit in the pending signals mask in its proc structure. The
process must become aware of the signal and respond to it, and that can only happen when it is
scheduled to run. Once it runs, the process will handle all pending signals before returning to its
normal user-level processing. (This does not include the signal handlers themselves, which run in
user mode.)
What should happen if a signal is generated for a sleeping process? Should the signal be kept
pending until the process awakens, or should the sleep be interrupted? The answer depends on why
the process is sleeping. If the process is sleeping for an event that is certain to occur soon, such as
disk I/0 completion, there is no need to wake up the process. If, on the other hand, the process is
waiting for an event such as terminal input, there is no limit to how long it might block. In such a
case, the kernel interrupts the sleep and aborts the system call in which the process had blocked.
4.3BSD provides the siginterrupt system call to control how signals should affect system call han-
dling. Using siginterrupt, the user can specify whether system calls interrupted by signals should be
aborted or restarted. Chapter 4 covers the topic of signals in greater detail.
call returns different values to the parent and child-fork returns 0 to the child, and the child's PID
to the parent.
Most often, the child process will call exec shortly after returning from fork, and thus begin
executing a new program. The C library provides several alternate forms of exec, such as exece,
execve, and execvp. Each takes a slightly different set of arguments and, after some preprocessing,
calls the same system call. The generic name exec refers to any unspecified function of this group.
The code that uses fork and P.xec looks resembles that in Example 2-2.
if ((result= fork())== 0) {
/* child code */
Since exec overlays a new program on the existing process, the child does not return to the
old program unless exec fails. Upon successful completion of exec, the child's address space is re-
placed with that of the new program, and the child returns to user mode with its program counter set
to the first executable instruction of the new program.
Since fork and exec are so often used together, it may be argued that a single system call
could efficiently accomplish both tasks, resulting in a new process running a new program. Older
UNIX systems [Thorn 78] also incurred a large overhead in duplicating the parent's address space
for the child (duringfork), only to have the child discard it completely and replace it with that of the
new program.
There are many advantages of keeping the calls separate. In many client-server applications,
the server program may fork numerous processes that continue to execute the same program. 9 In
contrast, sometimes a process wants merely to invoke a new program, without creating a new proc-
ess. Finally, between the fork and the exec, the child may optionally perform a number of tasks to
ensure that the new program is invoked in the desired state. These tasks include:
• Redirecting standard input, output, or error.
• Closing open files inherited from the parent that are not needed by the new program.
• Changing the UID or process group.
• Resetting signal handlers.
A single system call that tries to perform all these functions would be unwieldy and ineffi-
cient. The existing fork-exec framework provides greater flexibility and is clean and modular. In
9 Modem multi-threaded UNIX systems make this unnecessary- the server simply creates a number of threads.
2.8 New Processes and Programs 41
Section 2.8.3 we will examine ways of minimizing the performance problems associated with this
division.
pages that are modified must be copied, not the entire address space. If the child calls exec or exit,
the pages revert to their original protection, and the copy-on-write flag is cleared.
BSD UNIX provided another solution-a new vfork system call. A user may call vfork in-
stead of fork if he or she expects to call exec shortly afterward. vfork does no copying. Instead, the
parent loans its address space to the child and blocks until the child returns it. The child then exe-
cutes using the parent's address space, until it calls exec or exit, whereupon the kernel returns the
address space to the parent, and awakens it. vfork is extremely fast, since not even the address maps
are copied. The address space is passed to the child simply by copying the address map registers. It
is, however, a dangerous call, because it permits one process to use and even modifY the address
space of another process. Some programs such as csh exploit this feature.
10 This division is simply functional in nature; the kernel does not recognize so many different components. SVR4, for
instance, views the address space as merely a collection of shared and private mappings.
2.8 New Processes and Programs 43
shared libraries as value-added features. In the following description of exec, we will consider a
simple program that uses neither of these features.
UNIX supports many executable file formats. The oldest is the a.out format, which has a
32-byte header followed by text and data sections and the symbol table. The program header con-
tains the sizes of the text, initialized data, and uninitialized data regions, and the entry point, which
is the address of the first instruction the program must execute. It also contains a magic number,
which identifies the file as a valid executable file and gives further information about its format,
such as whether the file is demand paged, or whether the data section begins on a page boundary.
Each UNIX variant defines the set of magic numbers it supports.
The exec system call must perform the following tasks:
1. Parse the pathname and access the executable file.
2. Verify that the caller has execute permission for the file.
3. Read the header and check that it is a valid executable. II
4. If the file has SUID or SGID bits set in its mode, change the caller's effective UID or GID
respectively to that of the owner of the file.
5. Copy the arguments to exec and the environment variables into kernel space, since the
current user space is going to be destroyed.
6. Allocate swap space for the data and stack regions.
7. Free the old address space and the associated swap space. If the process was created by
vfork, return the old address space to the parent instead.
8. Allocate address maps for the new text, data, and stack.
9. Set up the new address space. If the text region is already active (some other process is
already running the same program}, share it with this process. Otherwise, it must be
initialized from the executable file. UNIX processes are usually demand paged, meaning
that each page is read into memory only when the program needs it.
10. Copy the arguments and environment variables back onto the new user stack.
11. Reset all signal handlers to default actions, because the handler functions do not exist in
the new program. Signals that were ignored or blocked before calling exec remain ignored
or blocked.
12. Initialize the hardware context. Most registers are reset to zero, and the program counter is
set to the entry point of the program.
The wait system call allows a process to wait for a child to terminate. Since a child may have
terminated before the call, wait must also handle that condition. wait first checks if the caller has
any deceased or suspended children. If so, it returns immediately. If there are no deceased children,
wait blocks the caller until one of its children dies and returns once that happens. In both cases, wait
returns the PID of the deceased child, writes the child's exit status into stat _1 oc, and frees its proc
structure (if more than one child is dead, wait acts only on the first one it finds). If the child is being
traced, wait also returns when the child receives a signal. wait returns an error if the caller has no
children (dead or alive), or if wait is interrupted by a signal.
4.3BSD provides a wait3 call (so named because it requires three arguments), which also
returns resource usage information about the child (user and system times of the child and all its de-
ceased children). The POSIX.l standard [IEEE 90] adds the waitpid call, which uses the pi d argu-
ment to wait for a child with a specific process ID or process group. Both wait3 and waitpid support
two options: WNOHANG and WUNTRACED. WNOHANG causes wait3 to return immediately if there are no
deceased children. WUNTRAC ED also returns if a child is suspended or resumed. The SVR4 waitid call
provides a superset of all the above features. It allows the caller to specify the process ID or group
to wait for and the specific events to trap, and also returns more detailed information about the child
process.
2.9 Summary
We have described the interactions between the kernel and user processes in traditional UNIX ker-
nels. This provides a broad perspective, giving us the context needed to examine specific parts of
the system in greater detail. Modem variants such as SVR4 and Solaris 2.x introduce several ad-
vanced facilities, which will be detailed in the following chapters.
2.1 0 Exercises
1. What elements of the process context must the kernel explicitly save when handling (a) a
context switch, (b) an interrupt, or (c) a system call?
2. What are the advantages of allocating objects such as proc structures and descriptor table
blocks dynamically? What are the drawbacks?
3. How does the kernel know which system call has been made? How does it access the
arguments to the call (which are on the user stack)?
4. Compare and contrast the handling of system calls and of exceptions. What are the similarities
and differences?
5. Many UNIX systems provide compatibility with another version of UNIX by providing user
library functions to implement the system calls of the other version. Why, if at all, does the
application developer care if a function is implemented by a library or by a system call?
6. What issues must a library developer be concerned with when choosing to implement a
function in the user library instead of as a system call? What if the library must use multiple
system calls to implement the function?
7. Why is it important to limit the amount of work an interrupt handler can do?
8. On a system with n distinct interrupt priority levels, what is the maximum number of
interrupts that may be nested at a time? What repercussions can this have on the sizes of
various stacks?
9. The Intel 80x86 architecture does not support interrupt priorities. It provides two instructions
for interrupt management-eLI to disable all interrupts, and STI to enable all interrupts.
Write an algorithm to implement interrupt priority levels in software for such a machine.
10. When a resource becomes available, the wakeup() routine wakes up all processes blocked on
it. What are the drawbacks of this approach? What are the alternatives?
11. Propose a new system call that combines the functions of fork and exec. Define its interface
and semantics. How would it support features such as I/0 redirection, foreground or
background execution, and pipes?
12. What is the problem with returning an error from the exec system call? How can the kernel
handle this problem?
13. For a UNIX system of your choice, write a function that allows a process to wait for its parent
to terminate.
14. Suppose a process does not wish to block until its children terminate. How can it ensure that
child processes are cleaned up when they terminate?
15. Why does a terminating process wake up its parent?
2.11 References
[Allm 87] Allman, E., "UNIX: The Data Forms," Proceedings of the Winter 1987 USENIX
Technical Conference, Jan. 1987, pp. 9-15.
[AT&T 87] American Telephone and Telegraph, The System V Interface Definition (SVID), Issue
2, 1987.
[Bach 86] Bach, M.J., The Design of the UNIX Operating System, Prentice-Hall, Englewood
Cliffs, NJ, 1986.
[Kern 84] Kernighan, B.W., and Pike, R., The UNIX Programming Environment, Prentice-Hall,
Englewood Cliffs, NJ, 1984.
2.11 References 47
[Leff 89] Leffler, S.J., McKusick, M.K., Karels, M.J., and Quarterman, J.S., The Design and
Implementation of the 4.3 BSD UNIX Operating System, Addison-Wesley, Reading,
MA, 1989.
[IEEE 90] Institute for Electrical and Electronic Engineers, Information Technology-Portable
Operating System Interface (POSIX) Part 1: System Application Program Interface
(API) [C Language}, 1003.1-1990, IEEE, Dec. 1990.
[IEEE 94] Institute for Electrical and Electronic Engineers, POSIX Pl003.4a, Threads
Extension for Portable Operating Systems, 1994.
[Sale 92] Salemi, C., Shah, S., and Lund, E., "A Privilege Mechanism for UNIX System V
Release 4 Operating Systems," Proceedings of the Summer 1992 USENIX Technical
Conference, Jun. 1992, pp. 235-241.
[Thorn 78] Thompson, K., "UNIX Implementation," The Bell System Technical Journal, Vol.
57, No.6, Part 2, Jul.-Aug. 1978, pp. 1931-1946.
3
3.1 Introduction
The process model has two important limitations. First, many applications wish to perform several
largely independent tasks that can run concurrently, but must share a common address space and
other resources. Examples of such applications include server-side database managers, transaction-
processing monitors, and middle- and upper-layer network protocols. These processes are inherently
parallel in nature and require a programming model that supports parallelism. Traditional UNIX
systems force such applications to serialize these independent tasks or to devise awkward and inef-
ficient mechanisms to manage multiple operations.
Second, traditional processes cannot take advantage of multiprocessor architectures, because
a process can use only one processor at a time. An application must create a number of separate
processes and dispatch them on the available processors. These processes must find ways of sharing
memory and resources, and synchronizing their tasks with each other.
Modem UNIX variants address these problems by providing a variety of primitives in the
operating system to support concurrent processing. The lack of standard terminology makes it diffi-
cult to describe and compare the wide assortment of intraprocess parallelization mechanisms. Each
UNIX variant uses its own nomenclature, including terms such as kernel threads, user threads, ker-
nel-supported user threads, C-threads, pthreads, and lightweight processes. This chapter clarifies
the terminology, explains the basic abstractions, and describes the facilities provided by some im-
portant UNIX variants. Finally, it evaluates the strengths and weaknesses of these mechanisms. We
begin by investigating the necessity and benefits of threads.
48
3 .I Introduction 49
3.1.1 Motivation
Many programs must perform several largely independent tasks that do not need to be serialized.
For instance, a database server may listen for and process numerous client requests. Since the re-
quests do not need to be serviced in a particular order, they may be treated as independent execution
units, which in principle could run in parallel. The application would perform better if the system
provided mechanisms for concurrent execution of the subtasks.
On traditional UNIX systems, such programs use multiple processes. Most server applica-
tions have a listener process that waits for client requests. When a request arrives, the listener forks
a new process to service it. Since servicing of the request often involves 1/0 operations that may
block the process, this approach yields some concurrency benefits even on uniprocessor systems.
Next, consider a scientific application that computes the values of various terms in an array,
each term being independent of the others. It could create a different process for each element of the
array and achieve true parallelism by dispatching each process to run on a different computer, or
perhaps on different CPUs of a multiprocessor system. Even on uniprocessor machines, it may be
desirable to divide the work among multiple processes. If one process must block for 1/0 or page
fault servicing, another process can progress in the meantime. As another example, the UNIX make
facility allows users to compile several files in parallel, using a separate process for each.
Using multiple processes in an application has some obvious disadvantages. Creating all
these processes adds substantial overhead, since fork is usually an expensive system call (even on
systems that support copy-on-write sharing of address spaces). Because each process has its own
address space, it must use interprocess communication facilities such as message passing or shared
memory. Additional work is required to dispatch processes to different machines or processors, pass
information between these processes, wait for their completion, and gather the results. Finally,
UNIX systems have no appropriate frameworks for sharing certain resources, e.g., network connec-
tions. Such a model is justified only if the benefits of concurrency offset the cost of creating and
managing multiple processes.
These examples serve primarily to underscore the inadequacies of the process abstraction
and the need for better facilities for parallel computation. We can now identify the concept of a
fairly independent computational unit that is part of the total processing work of an application.
These units have relatively few interactions with one another and hence low synchronization re-
quirements. An application may contain one or more such units. The thread abstraction represents a
single computational unit. The traditional UNIX process is single-threaded, meaning that all compu-
tation is serialized within the same unit.
The mechanisms described in this chapter address the limitations of the process model. They
too have their own drawbacks, which are discussed at the end of the chapter.
number of processors, the threads must be multiplexed on the available processors. 1 Ideally, an ap-
plication will have n threads running on n processors and will finish its work in J/n 1h the time re-
quired by a single-threaded version of the program. In practice, the overhead of creating, managing,
and synchronizing thread, and that of the multiprocessor operating system, will reduce the benefit
well below this ideal ratio.
Figure 3-1 shows a set of single-threaded processes executing on a uniprocessor machine.
The system provides an illusion of concurrency by executing each process for a brief period of time
(time slice) before switching to the next. In this example the first three processes are running the
server side of a client-server application. The server program spawns a new process for each active
client. The processes have nearly identical address spaces and share information with one another
using interprocess communication mechanisms. The lower two processes are running another server
application.
Figure 3-2 shows two servers running in a multithreaded system. Each server runs as a sin-
gle process, with multiple threads sharing a single address space. Interthread context switching may
be handled by either the kernel or a user-level threads library, depending on the operating system.
For both cases, this example shows some of the benefits of threads. Eliminating multiple, nearly
identical address spaces for each application reduces the load on the memory subsystem. (Even
modern systems using copy-on-write memory sharing must manage separate address translation
maps for each process.) Since all threads of an application share a common address space, they can
use efficient, lightweight, interthread communication and synchronization mechanisms.
The potential disadvantages of this approach are evident. A single-threaded process does not
have to protect its data from other processes. Multithreaded processes must be concerned with every
'
.....______..
' '
time
1 This means using each available processor to service the runnable threads. Of course, a processor can run only one
thread at a time.
3.1 Introduction 51
_ _ -~ address space
(CD 1~
:CD l
!"W_-~)
(m-..
I
\~
I
..___
'
'
'
~
'
'
'
'
'
'
•••
..
..
..
.
I~
'~ - -- thread
time
object in their address space. If more than one thread can access an object, they must use some form
of synchronization to avoid data corruption.
Figure 3-3 shows two multithreaded processes running on a multiprocessor. All threads of
one process share the same address space, but each runs on a different processor. Hence they all run
concurrently. This improves performance considerably, but also complicates the synchronization
problems.
Although the two facilities combine well together, they are also useful independently. A
multiprocessor system is also useful for single-threaded applications, as several processes can run in
parallel. Likewise, there are significant benefits of multithreaded applications, even on single-
processor systems. When one thread must block for I/0 or some other resource, another thread can
be scheduled to run, and the application continues to progress. The thread abstraction is more suited
for representing the intrinsic concurrency of a program than for mapping software designs to multi-
processor hardware architectures.
tiona! UNIX process has a single thread of control. Multithreaded systems extend this concept by
allowing more than one thread of control in each process.
Centralizing resource ownership in the process abstraction has some drawbacks. Consider a
server application that carries out file operations on behalf of remote clients. To ensure compliance
with file access permissions, the server assumes the identity of the client while servicing a request.
To do so, the server is installed with superuser privileges, and calls setuid, setgid, and setgroups to
temporarily change its user credentials to match those of the client. Multithreading this server to in-
crease the concurrency causes security problems. Since the process has a single set of credentials, it
can only pretend to be one client at a time. Hence the server is forced to serialize (single-thread) all
system calls that check for security.
There are several different types of threads, each having different properties and uses. In this
section, we describe three important types-kernel threads, lightweight processes, and user threads.
,---------------,
They can make system calls and block for 1/0 or resources. On a multiprocessor system, a process
can enjoy the benefits oftrue parallelism, because each LWP can be dispatched to run on a different
processor. There are significant advantages even on a uniprocessor, since resource and 1/0 waits
block individual L WPs, not the entire process.
Besides the kernel stack and register context, an L WP also needs to maintain some user
state. This primarily includes the user register context, which must be saved when the L WP is pre-
empted. While each L WP is associated with a kernel thread, some kernel threads may be dedicated
to system tasks and not have an L WP.
Such multithreaded processes are most useful when each thread is fairly independent and
does not interact often with other threads. User code is fully preemptible, and all L WPs in a process
share a common address space. If any data can be accessed concurrently by multiple LWPs, such
access must be synchronized. The kernel therefore provides facilities to lock shared variables and to
block an L WP if it tries to access locked data. These facilities, such as mutual exclusion (mutex)
locks, semaphores, and condition variables, are further detailed in Chapter 7.
It is important to note the limitations of L WPs. Most LWP operations, such as creation, de-
struction, and synchronization, require system calls. System calls are relatively expensive operations
for several reasons: Each system call requires two mode switches--one from user to kernel mode on
invocation, and another back to user mode on completion. On each mode switch, the LWP crosses a
protection boundary. The kernel must copy the system call parameters from user to kernel space and
validate them to protect against malicious or buggy processes. Likewise, on return from the system
call, the kernel must copy data back to user space.
3.2 Fundamental Abstractions 55
When the L WPs frequently access shared data, the synchronization overhead can nullify any
performance benefits. Most multiprocessor systems provide locks that can be acquired at the user
level if not already held by another thread [Muel 93]. If a thread wants a resource that is currently
unavailable, it may execute a busy-wait (loop until the resource is free), again without kernel in-
volvement. Busy-waiting is reasonable for resources that are held only briefly; in other cases, it is
necessary to block the thread. Blocking an LWP requires kernel involvement and hence is expensive.
Each LWP consumes significant kernel resources, including physical memory for a kernel
stack. Hence a system cannot support a large number of LWPs. Moreover, since the system has a
single LWP implementation, it must be general enough to support most reasonable applications. It
will therefore be burdened with a lot of baggage that many applications do not need. LWPs are un-
suitable for applications that use a large number of threads, or that frequently create and destroy
them. Finally, LWPs must be scheduled by the kernel. Applications that must often transfer control
from one thread to another cannot do so easily using LWPs. LWP use also raises some fairness is-
sues-a user can monopolize the processor by creating a large number of LWPs.
In summary, while the kernel provides the mechanisms for creating, synchronizing, and
managing LWPs, it is the responsibility of the programmer to use them judiciously. Many applica-
tions are better served by a user-level threads facility, such as that described in the next section.
Note: The term LWP is borrowed from the SVR4/MP and Solaris 2.x
terminology. It is somewhat confusing, since version 4.x of SunOS
[Kepe 85J uses the term LWPs to refer to the user-level threads de-
scribed in the next section. In this book, however, we consistently use
L WP to refer to kernel-supported user threads. Some systems use the
term virtual processor, which is essentially the same as an LWP.
2 Many threads library features require the kernel to provide facilities for asynchronous 110.
56 Chapter 3 Threads and Lightweight Processes
schedules and switches context between user threads by saving the current thread's stack and regis-
ters, then loading those of the newly scheduled one.
The kernel retains the responsibility for process switching, because it alone has the privilege
to modify the memory management registers. User threads are not truly schedulable entities, and the
kernel has no knowledge of them. The kernel simply schedules the underlying process or L WP,
which in tum uses library functions to schedule its threads. When the process or L WP is preempted,
so are its threads. Likewise, if a user thread makes a blocking system call, it blocks the underlying
LWP. If the process has only one L WP (or if the user threads are implemented on a single-threaded
system), all its threads are blocked.
The library also provides synchronization objects to protect shared data structures. Such an
object usually comprises a type of lock variable (such as a semaphore) and a queue of threads
blocked on it. Threads must acquire the lock before accessing the data structure. If the object is al-
3.2 Fundamental Abstractions 57
ready locked, the library blocks the thread by linking it onto its blocked threads queue and transfer-
ring control to another thread.
Modem UNIX systems provide asynchronous 1/0 mechanisms, which allow processes to
perform 1/0 without blocking. SVR4, for example, offers an 10 SETSIG ioctl operation to any
STREAMS device. (STREAMS are described Chapter 17.) A s~bsequent read or write to the
stream simply queues the operation and returns without blocking. When the 1/0 completes, the
process is informed via a SIGPOLL signal.
Asynchronous 1/0 is a very useful feature, because it allows a process to perform other tasks
while waiting for 1/0. However, it leads to a complex programming model. It is desirable to restrict
asynchrony to the operating system level and give applications a synchronous programming envi-
ronment. A threads library achieves this by providing a synchronous interface that uses the asyn-
chronous mechanisms internally. Each request is synchronous with respect to the calling thread,
which blocks until the 1/0 completes. The process, however, continues to make progress, since the
library invokes the asynchronous operation and schedules another user thread to run in the mean-
time. When the 1/0 completes, the library reschedules the blocked thread.
User threads have several benefits. They provide a more natural way of programming many
applications such as windowing systems. User threads also provide a synchronous programming
paradigm by hiding the complexities of asynchronous operations in the threads library. This alone
makes them useful, even in systems lacking any kernel support for threads. A system can provide
several threads libraries, each optimized for a different class of applications.
The greatest advantage of user threads is performance. User threads are extremely light-
weight and consume no kernel resources except when bound to an LWP. Their performance gains
result from implementing the functionality at user level without using system calls. This avoids the
overhead of trap processing and moving parameters and data across protection boundaries. A useful
notion is the critical thread size [Bita 95], which indicates the amount of work a thread must do to
be useful as a separate entity. This size depends on the overhead associated with creating and using
a thread. For user threads, the critical size is of the order of a few hundred instructions and may be
reduced to less than a hundred with compiler support. User threads require much less time for crea-
tion, destruction, and synchronization. Table 3-1 compares the latency for different operations on
processes, LWPs, and user threads on aSP ARCstation 2 [Sun 93].
On the other hand, user threads have several limitations, primarily due to the total separation
of information between the kernel and the threads library. Since the kernel does not know about user
threads, it cannot use its protection mechanisms to protect them from each other. Each process has
its own address space, which the kernel protects from unauthorized access by other processes. User
Table 3-1. Latency of user thread, LWP, and process operations on SPARCstation 2
Creation time Synchronization time
(microseconds) using semaphores
(microseconds)
User thread 52 66
LWP 350 390
Process 1700 200
58 Chapter 3 Threads and Lightweight Processes
threads enjoy no such protection, operating in the common address space owned by the process. The
threads library must provide synchronization facilities, which requires cooperation from the threads.
The split scheduling model causes many other problems. The threads library schedules the
user threads, the kernel schedules the underlying processes or L WPs, and neither knows what the
other is doing. For instance, the kernel may preempt an LWP whose user thread is holding a spin
lock. If another user thread on a different L WP tries to acquire this lock, it will busy-wait until the
holder of the lock runs again. Likewise, because the kernel does not know the relative priorities of
user threads, it may preempt an L WP running a high-priority user thread to schedule an L WP run-
ning a lower-priority user thread.
The user-level synchronization mechanisms may behave incorrectly in some instances. Most
applications are written on the assumption that all runnable threads are eventually scheduled. This is
true when each thread is bound to a separate L WP, but may not be valid when the user threads are
multiplexed onto a small number of L WPs. Since the L WP may block in the kernel when its user
thread makes a blocking system call, a process may run out of L WPs even when there are runnable
threads and available processors. The availability of an asynchronous l/0 mechanism may help to
mitigate this problem.
Finally, without explicit kernel support, user threads may improve concurrency, but do not
increase parallelism. Even on a multiprocessor, user threads sharing a single L WP cannot execute in
parallel.
This section explains three commonly used thread abstractions. Kernel threads are primitive
objects not visible to applications. Lightweight processes are user-visible threads that are recognized
by the kernel and are based on kernel threads. User threads are higher-level objects not visible to the
kernel. They may use lightweight processes if supported by the system, or they may be implemented
in a standard UNIX process without special kernel support. Both LWPs and user threads have major
drawbacks that limit their usefulness. Section 3.5 describes a new framework based on scheduler
activations, which addresses many of these problems. First, however, we examine the issues related
to L WP and user thread design in greater detail.
There are several factors that influence the design of lightweight processes. Foremost is the need to
properly preserve UNIX semantics, at least for the single-threaded case. This means that a process
containing a single L WP must behave exactly like a traditional UNIX process. (Note again that the
term LWP refers to kernel-supported user threads, not to the SunOS 4.0 lightweight processes,
which are purely user-level objects.)
There are several areas where UNIX concepts do not map easily to a multithreaded system.
The following sections examine these issues and present possible solutions.
open files) must be redesigned. There are two important guidelines. First, the system call must sat-
isfy traditional UNIX semantics in the single-threaded case; second, when issued by a multithreaded
process, the system call should behave in a reasonable manner that closely approximates the single-
threaded semantics. With that in mind, let us examine some important system calls that are affected
by the multithreading design.
In traditional UNIX, fork creates a child process, which is almost an exact clone of the par-
ent. The only differences are those necessary to distinguish the parent from the child. The semantics
of fork are clear for a single-threaded process. In the case of a multithreaded process, there is an op-
tion of duplicating all LWPs of the parent or only the one that invokes the fork.
Suppose fork copies only the calling LWP into the new process. This is definitely more ef-
ficient. It is also a better model for the case where the child soon invokes another program by call-
ing exec. This interface has several problems [Powe 91). LWPs are often used to support user-level
thread libraries. Such libraries represent each user thread by a data structure in user space. If fork
duplicates only the calling LWP, the new process will contain user-level threads that do not map to
any LWP. Furthermore, the child process must not try to acquire locks held by threads that do not
exist in the child, because this could result in deadlock. This may be difficult to enforce, because
libraries often create hidden threads of which the programmer is unaware.
On the other hand, suppose fork duplicates all the LWPs of the parent. This is more useful
when fork is used to clone the entire process, rather than to run another program. It also has many
problems. An LWP in the parent may be blocked in a system call. Its state will be undefined in the
child. One possibility is to make such calls return the status code EINTR (system call interrupted),
allowing the LWP to restart them if necessary. An LWP may have open network connections.
Closing the connection in the child may cause unexpected messages to be sent to the remote node.
Some LWP may be manipulating an external shared data structure, which could become corrupted if
fork clones the LWP.
Neither solution handles all situations correctly. Many systems compromise by offering two
variants offork, one to duplicate the whole process and the other to duplicate a single thread. For
the latter case, these systems define a set of safe functions that may be called by the child prior to
the exec. Another alternative is to allow the process to register one or more fork handlers, which are
functions that run in the parent or the child, before or after the fork, as specified during registration.
Likewise, the process has a single current working directory and a single user credentials
structure. Since the credentials may change at any time, the kernel must sample them atomically,
only once per system call.
All L WPs of a process share a common address space and may manipulate it concurrently
through system calls such as mmap and brk. These calls must be made thread-safe so that they do
not corrupt the address space in such situations. Programmers must be careful to serialize such op-
erations, otherwise the results may be unexpected.
caused by a thread itself. It makes more sense to deliver such a signal to the thread that caused it.
Other signals, such as SIGTSTP (stop signal generated from the terminal) and SIGINT (interrupt sig-
nal), are generated by external events and cannot logically be associated with any particular thread.
Another related issue is that of signal handling and masking. Must all threads share a com-
mon set of signal handlers, or can each define its own? Although the latter approach is more versa-
tile and flexible, it adds considerable overhead to each thread, which defeats the main purpose in
having multithreaded processes. The same considerations do not hold for signal masking. Signals
are normally masked to protect critical regions of code. Hence it is better to allow each thread to
specify its own signal mask. The overhead of per-thread masks is relatively low and acceptable.
3.3.4 Visibility
It is important to decide to what extent an L WP is visible outside the process. The kernel undoubt-
edly knows about L WPs and schedules them independently. Most implementations do not allow
processes to know about or interact with specific LWPs of another process.
Within a process, however, it is often desirable for the LWPs to know about each other.
Many systems therefore provide a special system call that allows one LWP to send a signal to an-
other specific LWP within the same process.
3 Up to a configurable limit. In SVR4, the stack size is limited by the value RLIMIT_NOFILE. This value comprises a
hard limit and a soft limit. The getrlimit system call retrieves these limits. The setrlimit call may lower the hard limit,
or lower or raise the soft limit so long as it does not exceed the hard limit.
4 Some multithreaded systems, such as SVR4.2/MP, provide facilities for automatic extension of a user thread stack.
5 Of course, this signal must be handled on a special stack, since the normal stack has no room for the signal handler to
operate. Modem UNIX systems provide a way for an application to specify an alternate stack for signal handling (see
Section 4.5).
62 Chapter 3 Threads and Lightweight Processes
• Multiplex user threads on a (smaller) set of L WPs. This is more efficient, as it consumes
fewer kernel resources. This method works well when all threads in a process are roughly
equivalent. It provides no easy way of guaranteeing resources to a particular thread.
• Allow a mixture of bound and unbound threads in the same process. This allows the appli-
cation to fully exploit the concurrency and parallelism of the system. It also allows prefer-
ential handling of a bound thread, by increasing the scheduling priority of its underlying
L WPs or by giving its L WP exclusive ownership of a processor.
The threads library contains a scheduling algorithm that selects which user thread to run. It
maintains per-thread state and priority, which has no relation to the state or priority of the underly-
ing LWPs. Consider the example in Figure 3-7, which shows six user threads multiplexed onto two
LWPs. The library schedules one thread to run on each LWP. These threads (uS and u6) are in the
running state, even though the underlying L WPs may be blocked in the middle of a system call, or
preempted and waiting to be scheduled.
A thread (such as ul or u2 in Figure 3-7) enters the blocked state when it tries to acquire a
synchronization object locked by another thread. When the lock is released, the library unblocks the
thread, and puts it on the scheduler queue. The thread (such as u3 and u4 in Figure 3-7) is now in
the runnable state, waiting to be scheduled. The threads scheduler selects a thread from this queue
based on priority and L WP affiliation. This mechanism closely parallels the kernel's resource wait
and scheduling algorithms. As mentioned previously, the threads library acts as a miniature kernel
for the threads it manages.
[Doep 87], [Muel 93] and [Powe 91] discuss user threads in more detail.
, ----------------------------------~ '
' blocked threads '
,'
The rest of this chapter describes the threads implementation in Solaris, SVR4, Mach, and
Digital UNIX.
6 Kernel threads were introduced in Solaris 2.0 and the user-visible interface in Solaris 2.2.
66 Chapter 3 Threads and Lightweight Processes
Kernel threads are used to handle asynchronous activity, such as deferred disk writes,
STREAMS service procedures, and callouts (see Section 5.2.1). This allows the kernel to associate
a priority to each such activity (by setting the thread's priority), and thus schedule them appropri-
ately. They are also used to support lightweight processes. Each LWP is attached to a kernel thread
(although not all kernel threads have an LWP).
L WPs have no global name space and hence are invisible to other processes. A process can-
not direct a signal to a specific LWP in another process or know which LWP sent a message to it.
~,----------------------------------------------~,
I
unbound threads bound threads
L L
address space
~----------------------------------------------''
Having bound and unbound user threads in the same application can be very useful in situa-
tions that involve time-critical processing. Such processing can be handled by threads bound to
LWPs that are assigned a real-time scheduling priority, whereas other threads are responsible for
lower-priority background processing. In the previous windowing example, a real-time thread can
be assigned to respond to mouse movements, since those must be reflected immediately on the dis-
play.
grades performance in many ways. On multiprocessor systems, these problems are magnified. The
kernel must protect many more objects and usually must block interrupts on all processors.
Solaris replaces the traditional interrupt and synchronization model with a new implementa-
tion [Eykh 92, Klei 95] that aims to improve performance, particularly for multiprocessors. To be-
gin with, it does not utilize IPLs to protect from interrupts. Instead, it uses a variety of kernel syn-
chronization objects, such as mutex locks and semaphores. Next, it employs a set of kernel threads
to handle interrupts. These interrupt threads can be created on the fly and are assigned a higher pri-
ority than all other types of threads. They use the same synchronization primitives as other threads
and thus can block if they need a resource held by another thread. The kernel blocks interrupts only
in a few exceptional situations, such as when acquiring the mutex lock that protects a sleep queue.
Although the creation of kernel threads is relatively lightweight, it is still too expensive to
create a new thread for each interrupt. The kernel maintains a pool of interrupt threads, which are
preallocated and partially initialized. By default, this pool contains one thread per interrupt level for
each CPU, plus a single systemwide thread for the clock. Since each thread requires about 8 kilo-
bytes of storage for the stack and thread data, the pool uses a significant amount of memory. On
systems where memory is scarce, it is better to reduce the number of threads in this pool, since all
interrupts are unlikely to be active at once.
Figure 3-9 describes interrupt handling in Solaris. Thread Tl is executing on processor Pl
when it receives an interrupt. The interrupt handler first raises the ipl to prevent further interrupts of
the same or lower level (preserving UNIX semantics). It then allocates an interrupt thread T2 from
the pool and switches context to it. While T2 executes, Tl is pinned, which means it may not run on
another CPU. When T2 returns, it switches context back to Tl, which resumes execution.
The interrupt thread T2 runs without being completely initialized. This means it is not a full-
fledged thread and cannot be descheduled. Initialization is completed only if the thread has a reason
to block. At this time, it saves its state and becomes an independent thread, capable of running on
any CPU. If T2 blocks, it returns control to Tl, thus unpinning it. This way, the overhead of com-
plete thread initialization is restricted to cases where the interrupt thread must block.
•
n
I I
~ I interrupt handler : 0
T2~
0
: 1
~~;;~-~i.;,;;;..;,~; ~~d;·:
:mGJGJGJ.
:, _____O)__ rJ) __ ~----'
Figure 3-9. Using threads to handle interrupts.
70 Chapter 3 Threads and Lightweight Processes
Implementing interrupts as threads adds some overhead (about 40 instructions on the Spare).
On the other hand, it avoids having to block interrupts for each synchronization object, which saves
about 12 instructions each time. Since synchronization operations are much more frequent than in-
terrupts, the net result is a performance improvement, as long as interrupts do not block too fre-
quently.
3. 7 Threads in Mach
Mach was designed as a multithreaded operating system from the outset. It supports threads
both in the kernel and through user-level libraries. It provides additional mechanisms to control the
allocation of processors to threads on multiprocessors. Mach provides full 4.3BSD UNIX semantics
at the programming interface level, including all system calls and libraries. 7 This section describes
the threads implementation of Mach. Section 3.8 discusses the UNIX interface in Digital UNIX,
which is derived from Mach. Section 3.9 describes a new mechanism called continuations, which
was introduced in Mach 3.0.
7 The Mach 2.5 implementation itself provides 4.3BSD functionality within its kernel. Mach 3.0 provides this func-
tionality as a server program at the application level.
3.7 Threads in Mach 71
stack, used for system call handling. It also has its own computation state (program counter, stack
pointer, general registers, etc.) and is independently scheduled by the processor. Threads that belong
to user tasks are equivalent to lightweight processes. Pure kernel threads belong to the kernel task.
Mach also supports the notion of processor sets, which are further described in Section
5.7.1. The available processors in the system can be divided into nonoverlapping processor sets.
Each task and thread can be assigned to any processor set (many processor set operations require
superuser privilege). This allows dedicating some CPUs of a multiprocessor to one or more specific
tasks, thus guaranteeing resources to some high-priority tasks.
The task structure represents a task, and contains the following information:
• Pointer to the address map, which describes the virtual address space of the task.
• The header of the list of threads belonging to the task.
• Pointer to the processor set to which the task is assigned.
• Pointer to the utas k structure (see Section 3.8.1).
• Ports and other IPC-related information (see Section 6.4).
The resources held by the task are shared by all its threads. Each thread is described by a thread
structure, which contains:
• Links to put the thread on a scheduler or wait queue.
• Pointers to the task and the processor set to which it belongs.
• Links to put the thread on the list of threads in the same task and on the list of threads in
the same processor set.
• Pointer to the process control block (PCB) to hold its saved register context.
• Pointer to its kernel stack.
• Scheduling state (runnable, suspended, blocked, etc.).
• Scheduling information, such as priority, scheduling policy, and CPU usage.
• Pointers to the associated uthread and utask structures (see Section 3.8.1).
• Thread-specific IPC information (see Section 6.4.1 ).
Tasks and threads play complementary roles. The task owns resources, including the address
space. The thread executes code. A traditional UNIX process comprises a task containing a single
thread. A multithreaded process consists of one task and several threads.
Mach provides a set of system calls to manipulate tasks and threads. The task_create,
task_terminate, task_suspend, and task_Jesume calls operate on tasks. The thread_create,
thread_terminate, thread_suspend, and thread_Jesume calls operate on threads. These calls have the
obvious meanings. In addition, thread_status and thread_mutate allow reading and modification of
the register state of the thread, and task_threads returns a list of all threads in a task.
creates a new thread that invokes the function fun c () . A thread can call
layer in Mach 2.5, which in tum was ported from the native 4.3BSD implementation. Likewise,
many device drivers were ported from Digital's ULTRIX, which also is BSD-based. The ported
code makes extensive references to the proc and user structures, making it desirable to preserve
these interfaces.
There are two problems with retaining the user and proc structures in their original forms.
First, some of their functionality is now provided by the task and thread structures. Second, they
do not adequately represent a multithreaded process. For instance, the traditional u area contains the
process control block, which holds the saved register context of the process. In the multithreaded
case, each thread has its own register context. Thus these structures must be modified significantly.
The u area is replaced by two objects-a single utask structure for the task as a whole, and
one uthread structure for each thread in the task. These structures are no longer at fixed addresses
in the process and are not swapped out with the process.
The utask structure contains the following information:
• Pointers to vnodes of the current and root directories.
• Pointer to the proc structure.
• Array of signal handlers and other fields related to signaling.
• Open file descriptors table.
• Default file creation mask (cmask).
• Resource usage, quotas, and profiling information.
If one thread opens a file, the descriptor is shared by all threads in the task. Likewise, they
all have a common current working directory. The uthread structure describes the per-thread re-
sources of a UNIX process, which include:
• pointer to saved user-level registers
• pathname traversal fields
• current and pending signals
• thread-specific signal handlers
To ease the porting effort, references to fields of the old u area have been converted to refer-
ences to utask or uthread fields. This conversion is achieved by macros such as:
The proc structure is retained with few changes, but much of its functionality is now pro-
vided by the task and thread structures. As a result, many of its fields are unused, although they
are retained for historical reasons. For instance, the fields related to scheduling and priority are un-
necessary because Digital UNIX schedules each thread individually. The Digital UNIX proc struc-
ture contains the following information:
• Links to put the structure on the allocated, zombie, or free process list.
• Signal masks.
• Pointer to the credentials structure.
74 Chapter 3 Threads and Lightweight Processes
address
~-------l task 1------+~
utask
uthread uthread uthread uthread
Figure 3-10. Digital UNIX data structures for tasks and threads.
3.8 Digital UNIX 75
~ T1
1
continue other I issue 110 request and block I
work !
I 110 complete -resume I
1 ----~
!
receive notification notify thread T1 I
! !
process notification I terminate I
and continue
The UNIX kernel uses a process model of programming. Each thread has a kernel stack, used when
it traps into the kernel for a system call or exception. When the thread blocks in the kernel, the stack
contains its execution state, including its call sequence and automatic variables. This has the advan-
tage of simplicity, as kernel threads can block without having to explicitly save any state. The main
drawback is the excessive memory consumption.
Some operating systems, such as QuickSilver [Hask 88] and V [Cher 88] use an interrupt
model of programming. The kernel treats system calls and exceptions as interrupts, using a single
per-processor kernel stack for all kernel operations. Consequently, if a thread needs to block while
in the kernel, it must first explicitly save its state somewhere. The kernel uses this saved information
to restore the thread's state the next time it runs.
The main advantage of the interrupt model is the memory saved by having a single kernel
stack. The main drawback is that a thread must save its state explicitly for each potentially blocking
operation. This makes the model difficult to program, because the information that must be saved
may span module boundaries. Hence if a thread blocks while in a deeply nested procedure, it must
determine what state is needed by all functions in the call chain.
The conditions under which a thread must block dictate which model is more suitable. If a
thread blocks deep inside a call chain, it will benefit from the process model. If, however, the thread
has little state to save when it blocks, the interrupt model will work better. Many server programs,
for instance, repeatedly block in the kernel to wait for a client request, and then process the request
when it arrives. Such a program does not have much state to maintain in the kernel and can easily
relinquish its stack.
The Mach 3.0 continuations facility combines the advantages of the two models, and allows
the kernel to choose the blocking method depending on the circumstances. We now examine its de-
sign and implementation.
3.9 Mach 3.0 Continuations 77
where contfn () is the continuationfimction to be invoked the next time the thread runs. Passing a
NULL argument indicates that traditional blocking behavior is required. This way, the thread can
choose to use continuations selectively.
When a thread wishes to use a continuation, it first saves any state that might be needed after
resuming. The thread structure contains a 28-byte scratch area for this purpose; if more space is
needed, the thread must allocate an additional data structure. The kernel blocks the thread and recap-
tures its stack. When the thread is resumed, the kernel gives it a new stack and invokes the con-
tinuation function. This function recovers the state from where it was saved. This requires that both
the continuation and the calling function must have a detailed understanding about what state was
saved and where.
The following example illustrates the use of continuations. Example 3-1 uses the traditional
approach to blocking a thread:
syscall t (argl)
{ -
threa(j bl ockO;
tz (ar~i): ·
return;
}
f2 (argiJ·•.
{
f2 ()
{
restore argl and any other state information;
thread_syscall_return (status};
Note that when thread_ block() is called with an argument, it does not return to the caller;
when the thread resumes, the kernel transfers control to f2 (). The thread_ sysca ll_return ()
function is used to return to the user level from a system call. The entire process is transparent to the
user, who sees only a synchronous return from the system call.
The kernel uses continuations when only a small amount of state must be saved when
blocking. For instance, one of the most common blocking operations occurs during page fault han-
dling. In traditional implementations, the handler code issues a disk read request and blocks until
the read completes. When this happens, the kernel simply returns the thread to user level, and the
application can resume. The work that must be done after the read completes does requires little
saved state (perhaps a pointer to the page that was read in, and the memory mapping data that must
be updated). This is a good example of how continuations are useful.
3.9.3 Optimizations
The direct benefit of continuations is to reduce the number of kernel stacks in the system. Continua-
tions also allow some important optimizations. Suppose, during a context switch, the kernel discov-
ers that both the old and new threads have used continuations. The old thread has relinquished its
kernel stack, and the new thread does not have one. The kernel can directly transfer the stack from
the old thread to the new, as shown in Figure 3-12. Besides saving the overhead of allocating a new
stack, this also helps reduce the cache and translation lookaside buffer (TLB) misses (see Section
13.3 .1) associated with a context switch, since the same memory is reused.
The Mach IPC (interprocess communication) implementation (see Section 6.4) takes this
one step further. A message transfer involves two steps-a client thread uses the mach_msg system
call to send a message and wait for a reply, and a server thread uses mach_msg to send replies to
clients and wait for the next request. The message is sent to and received from a port, which is a
protected queue of messages. The sending and receiving are independent of each other. If a receiver
is not ready, the kernel queues the message on the port.
When a receiver is waiting, the transfer can be optimized using a continuation. If the sender
finds a receiver waiting, it hands off its stack to the receiver and blocks itself with a
mach_msg_ continue() continuation. The receiving thread resumes using the sender's stack, which
already contains all the information about the message to be transferred. This avoids the overhead of
queuing and dequeuing the message and considerably speeds up the message transfer. When the
server replies, it will hand off its stack to the client thread and resume it in a similar fashion.
3.10 Summary 79
3.9.4 Analysis
Continuations have proved extremely effective in Mach. Because their use is optional, it is unneces-
sary to change the entire programming model, and their use can be extended incrementally. Con-
tinuations greatly reduce the demands placed on kernel memory. Performance measurements
[Drav 91] determined that on average, the system required only 2.002 kernel stacks per processor,
reducing the per-thread kernel space from 4664 to 690 bytes.
Mach 3.0 is particularly well suited for continuations, since it is a microkernel that exports
only a small interface and has a small number of abstractions. In particular, the UNIX compatibility
code has been removed from the kernel and is provided by user-level servers [Golu 90]. As a result
there are only about 60 places where the kernel can block, and 99% of the blocks occur at just six
"hot spots." Concentrating on those provides a large benefit for a small programming effort. Tradi-
tional UNIX systems, in contrast, may block at several hundred places and have no real hot spots.
3.10 Summary
We have seen several different ways of designing multithreaded systems. There are many types of
thread primitives, and a system can combine one or more of them to create a rich concurrent pro-
gramming environment. Threads may be supported by the kernel, by user libraries, or by both.
Application developers must also choose the right blend of kernel and user facilities. One
problem they face is that each operating system vendor provides a different set of system calls to
create and manage threads, making it difficult to write portable multithreaded code that efficiently
uses the system resources. The POSIX 1003.4a standard defines the threads library functions, but
does not address the kernel interfaces or implementation.
80 Chapter 3 Threads and Lightweight Processes
3.11 Exercises
I. For each of the following applications, discuss the suitability of lightweight processes, user
threads, or other programming models:
(a) The server component of a distributed name service.
(b) A windowing system, such as the X server.
(c) A scientific application that runs on a multiprocessor and performs many parallel
computations.
(d) A make utility that compiles files in parallel whenever possible.
2. In what situations is an application better off using multiple processes rather than LWPs or
user threads?
3. Why does each LWP need a separate kernel stack? Can the system save resources by
allocating a kernel stack only when an LWP makes a system call?
4. The proc structure and the u area contain process attributes and resources. In a multithreaded
system, which of their fields may be shared by all LWPs of the process, and which must be
per-LWP?
5. Suppose one LWP invokes fork just at the same instance that another LWP of the same
process invokes exit. What would be the result if the system uses fork to duplicate all LWPs of
the process? What iffork duplicates only one LWP?
6. Would the problems with fork in a multithreaded system be addressed by having a single
system call to do an atomic fork and exec?
7. Section 3.3.2 described the problems with having a single shared set of resources such as file
descriptors and the current directory. Why should these resources not be per-L WP or per-user-
thread? [Bart 88] explores this idea further.
8. The standard library defines a per-process variable called errno, which contains the error
status from the last system call. What problems does this create for a multithreaded process?
How can these problems be solved?
9. Many systems classify library functions as thread-safe or thread-unsafe. What causes a
function to be unsafe for use by a multithreaded application?
10. What are the drawbacks of using threads to run interrupt handlers?
11. What are the drawbacks of having the kernel control LWP scheduling?
12. Suggest an interface that would allow a user to control which of its LWPs is scheduled first.
What problems can this cause?
13. Compare the multithreading primitives of Solaris and Digital UNIX. What are the advantages
of each?
3.12 References
[Ande 91] Anderson, T.E., Bershad, B.N., Lazowska, E.D., and Levy, H.M., "Scheduler
Activations: Effective Kernel Support for the User-Level Management of
Parallelism," Proceedings of the Thirteenth Symposium on Operating System
Principles, Oct. 1991, pp. 95-109.
3.12 References 81
[Anna 90] Armand, F., Hermann, F., Lipkis, J., and Rozier, M., "Multi-threaded Processes in
Chorus/MIX," Proceedings of the Spring 1990 European UNIX Users Group
Conference, Apr. 1990.
[Bart 88] Barton, J.M., and Wagner, J.C., "Beyond Threads: Resource Sharing in UNIX,"
Proceedings of the Winter 1988 USENIX Technical Conference, Jan. 1988, pp. 259-
266.
[Bita 95] Bitar, N., "Selected Topics in Multiprocessing," USENIX 1995 Technical
Conference Tutorial Notes, Jan. 1995.
[Blac 90] Black, D.L., "Scheduling Support for Concurrency and Parallelism in the Mach
Operating System," IEEE Computer, May 1990, pp. 35-43.
[Cher 88] Cheriton, D.R., "The V Distributed System," Communications of the ACM, Vol. 31,
No.3, Mar. 1988, pp. 314-333.
[Coop 90] Cooper, E.C., and Draves, R.P., "C Threads," Technical Report CMU-CS-88-154,
Department of Computer Science, Carnegie Mellon University, Sep. 1990.
[DEC 94] Digital Equipment Corporation, DEC OSF/1- Guide to DECthreads, Part No. AA-
Q2DPB-TK, July 1994.
[Doep 87] Doeppner, T.W., Jr., "Threads, A System for the Support of Concurrent
Programming," Brown University Technical Report CS-87-11, Jun. 1987.
[Drav 91] Draves, R.P., Bershad, B.N., Rashid, R.F., and Dean, R.W., "Using Continuations to
Implement Thread Management and Communication in Operating Systems,"
Technical Report CMU-CS-91-115R, Department of Computer Science, Carnegie
Mellon University, Oct. 1991.
[Eykh 92] Eykholt, J.R., Kleiman, S.R., Barton, S., Faulkner, R., Shivalingiah, A., Smith, M.,
Stein, D., Voll, J., Weeks, M., and Williams, D., "Beyond Multiprocessing:
Multithreading the SunOS Kernel," Proceedings of the Summer 1992 USENIX
Technical Conference, Jun. 1992, pp. 11-18.
[Golu 90] Golub, D., Dean, R., Forin, A., and Rashid, R., "UNIX as an Application Program,"
Proceedings ofthe Summer 1990 USENIXTechnical Conference, Jun. 1990, pp. 87-
95.
[IEEE 94] Institute for Electrical and Electronic Engineers, POSIX P1003.4a, Threads
Extension for Portable Operating Systems, 1994.
[Hask 88] Haskin, R., Malachi, Y., Sawdon, W., and Chan, G., "Recovery Management in
QuickSilver," ACM Transactions on Computer Systems, Vol. 6, No. 1, Feb. 1988,
pp. 82-108.
[Kepe 85] Kepecs, J., "Lightweight Processes for UNIX Implementation and Applications,"
Proceedings of the Summer 1985 USENIX Technical Conference, Jun. 1985, pp.
299-308.
[Kepp 91] Keppel, D., "Register Windows and User-Space Threads on the SPARC," Technical
Report 91-08-01, Department of Computer Science and Engineering, University of
Washington, Seattle, WA, Aug. 1991.
[Klei 95] Kleiman, S.R., and Eykholt, J.R., "Interrupts as Threads," Operating Systems
Review, Vol. 29, No.2, Apr. 1995.
82 Chapter 3 Threads and Lightweight Processes
[Muel 93] Mueller, F., "A Library Implementation of POSIX Threads under UNIX,"
Proceedings of the Winter 1993 USENIX Technical Conference, Jan. 1993, pp. 29-
41.
[OSF 93] Open Software Foundation, Design of the OSF/1 Operating System-Release 1.2,
Prentice-Hall, Englewood Cliffs, NJ, 1993.
[Powe 91] Powell, M.L., Kleiman, S.R., Barton, S., Shah, D., Stein, D., and Weeks, M.,
"SunOS Multi-thread Architecture," Proceedings of the Winter 1991 USENIX
Technical Conference, Jan. 1991, pp. 65-80.
[Sun 93] Sun Microsystems, SunOS 5.3 System Services, Nov. 1993.
[Teva 87) Tevanian, A., Jr., Rashid, R.F., Golub, D.B., Black, D.L., Cooper, E., and Young,
M.W., "Mach Threads and the UNIX Kernel: The Battle for Control," Proceedings
ofthe Summer 1987 USENIXTechnical Conference, Jun. 1987, pp. 185-197.
[Vand 88) Vandevoorde, M., and Roberts, E., "WorkCrews: An Abstraction for Controlling
Parallelism," International Journal of Parallel Programming, Vol. 17, No. 4, Aug.
1988, pp. 347-366.
4
4.1 Introduction
Signals provide a mechanism for notifying processes of system events. They also function as a
primitive mechanism for communication and synchronization between user processes. The pro-
gramming interface, behavior, and internal implementation of signals differ greatly from one ver-
sion of UNIX to another, and also, for any single variant, from one release to another. To make
matters more confusing for the programmer, the operating system provides additional system calls
and library routines to support earlier interfaces and maintain backward compatibility. 1
The original System V implementation of signals was inherently unreliable and defective.
Many of its problems are addressed in 4.2BSD UNIX (with further enhancements in 4.3BSD),
which introduced a new, robust signal mechanism. The 4.2BSD signal interface, however, is in-
compatible with the System V interface in several respects. This causes problems both for applica-
tion developers, who wish to write portable code, and for other UNIX vendors, who want their ver-
sion of UNIX to be compatible with both System V and BSD.
The POSIX 1003.1 standard [IEEE 90] (also known as POSIX.l) imposes some order amid
the chaos created by the plethora of signal implementations. It defines a standard interface that all
compliant implementations must support. PO SIX standards, however, do not regulate how the inter-
face must be implemented. The operating system is free to decide whether to provide the implemen-
tation in the kernel, through user-level libraries, or through a combination of both.
1 This creates other problems. If a library using one set of signal interfaces is linked with an application using another,
the program may behave incorrectly.
83
84 Chapter 4 Signals and Session Management
2 4.4BSD calls this file core.prog, where prog is the first 16 characters of the program that the process was executing
when it received the signal.
4.2 Signal Generation and Handling 85
signal
delivered
execute normal code J resume normal execution
ess and posts the signal to it. 3 Upon return from the interrupt, the process will check for and find the
signal.
Exceptions, however, result in synchronous signals. They are usually caused by a program-
ming error (division by zero, illegal instruction, etc.) and will occur at the same point if the program
is rerun in the same manner (i.e., if the same execution path is repeated). When an exception occurs
in a program, it causes a trap to the kernel mode. The trap handler in the kernel recognizes the ex-
ception and sends the appropriate signal to the current process. When the trap handler is about to
return to user mode, it calls i s s i g (), thus receiving the signal.
It is possible for several signals to be pending to the process simultaneously. In that case, the
signals are processed one at a time. A signal might also arrive while executing a signal handler; this
can cause nesting of handlers. In most implementations, users can ask the kernel to selectively block
certain signals before invoking a specific handler (see Section 4.4.3). This allows users to disable or
control the nesting of signal handlers.
if (issig{))
psi g 0;
3 On a multiprocessor, the target process may be running on a different processor than the one that handles the termi-
nal interrupt. In this case, the interrupt handler must arrange a special cross-processor interrupt so that the target sees
the signal.
4.3 Unreliable Signals 89
main()
{
signal (SIGINT, sigint_handler); /* install the handler */
This, however, leads to a race condition. Suppose that the user types control-C twice in rapid
succession. The first causes a SI GI NT signal that resets the action to default and invokes the handler.
If the second control-C is typed before the handler is reinstalled, the kernel will take the default ac-
tion and terminate the process. This leaves a window between the time the handler is invoked and
the time it is reinstalled, during which the signal cannot be caught. For this reason, the old imple-
mentation is often referred to as unreliable signals.
There is also a performance problem regarding sleeping processes. In the old implementa-
tion, all information regarding signal disposition is stored in a u_ signa 1 [] array in the u area,
which contains one entry for each signal type. This entry contains the address of a user-defined
handler, SIG_DFL to specify that the default action should be taken, or SIG_IGN to specify that the
signal should be ignored.
Since the kernel can only read the u area of the current process, it has no way of knowing
how another process will deal with a signal. Specifically, if the kernel has to post a signal to a proc-
ess in an interruptible sleep, it cannot know if the process is ignoring the signal. It will thus post the
signal and wake up the process, assuming that the process is handling the signal. If the process finds
that it has awakened because of a signal that was to be ignored, it will simply go back to sleep. This
spurious wakeup results in unnecessary context switches and wasteful processing. It is far better if
the kernel can recognize and discard ignored signals without ever waking up the process.
90 Chapter 4 Signals and Session Management
Finally, the SVR2 implementation lacks a facility to block a signal temporarily, deferring its
delivery until unblocked. It also lacks support for job control, where groups of processes can be
suspended and resumed in order to control access to the terminal.
signal and blocks the process until it receives a signal. If the unmasked signal is already
pending when the system call is issued, the call returns immediately.
t11t sig:.._:rec~i.ved= 0;
.void
{-·
.handleri
. . . .
(tnt. stg) ·
s .i g. received++;
~ ·.·:.:.
}
ma.JnO
{
stgset (SIGQUIJ, handler);
This example illustrates some features of SVR3 signaling. The sighold and sigrelse calls al-
low blocking and unblocking of a signal. The sigpause call atomically unblocks a signal and puts
the process to sleep until it receives a signal that is not ignored or blocked. The sigset system call
specifies a persistent handler that is not reset to default when the signal occurs. The old signal call is
retained for backward compatibility; handlers specified through signal are not persistent.
This interface still has several deficiencies [Stev 90]. Most important, the sighold, sigrelse,
and sigpause calls deal with only one signal at a time. There is no way to atomically block or un-
92 Chapter 4 Signals and Session Management
block multiple signals. In Example 4-2, if the handler was used by multiple signals, there is no satis-
factory way to code the critical region. We could block the signals one at a time, but sigpause can-
not atomically unblock all of them and then wait.
SVR3 also lacks support for job control and facilities such as automatic restarting of system
calls. These features, and many others, are provided in the 4BSD framework.
the handler returns instead of being aborted with an EINTR error. 4.3BSD adds the siginterrupt sys-
tem call, which allows selective enabling or disabling of this feature on a per-signal basis.
The BSD signal interface is powerful and flexible. Its main drawback is the lack of com-
patibility with the original AT&T interface (and even with the SVR3 interface, although that was
released later). These incompatibilities drove third-party vendors to develop various library inter-
faces that tried to satisfy both camps. Ultimately, SVR4 introduced a POSIX-compliant interface
that is backward compatible with previous releases of System V as well as with BSD semantics.
SA ONSTACK Handle this signal on the alternate stack, if one has been specified by
sigaltstack
SA NOCLDWAIT Used only with SIGCHLD-asks the system not to create zombie proc-
esses (see Section 2.8.7) when children of calling process terminate. If
this process subsequently calls wait, it will sleep until all its children
terminate.
SA SIGINFO Provide additional information to the signal handler. Used for handling
hardware exceptions, etc.
SA NODE FER Do not automatically block this signal while its handler is running.
SA RESETHAND Reset the action to default before calling the handler.
SA_ NODE FER and SA_ RESETHAND provide backward compatibility with the original unreli-
able signals implementation. In all cases, oact returns the previously installed sigaction
data.
• Compatibility interface
To provide compatibility with older releases, SVR4 also supports the signal, sigset, sig-
hold, sigrelse, sigignore, and sigpause calls. Systems that do not require binary compati-
bility may implement these calls as library routines.
Except for the last set, these system calls directly correspond to the POSIX.l functions in name,
calling syntax, and semantics.
4. 7 Exceptions
An exception5 occurs when a program encounters an unusual condition, usually an error. Examples
include accessing an invalid address and attempting to divide by zero. This results in a trap to the
kernel, which normally generates a signal to notify the process of the exception.
In UNIX, the kernel uses signals to notify the user of exceptions. The type of signal depends
on the nature of the exception. For instance, an invalid address exception may result in a SIGSEGV
signal. If the user has declared a handler for that signal, the kernel invokes the handler. If not, the
default action is to terminate the process. This allows individual programs to install their own ex-
5 This section describes hardware exceptions, which must not be confused with software exceptions supported by cer-
tain languages such as C++.
96 Chapter 4 Signals and Session Management
ception handlers. Some programming languages, such as Ada, have built-in exception handling
mechanisms; these are implemented by the language library as signal handlers.
Exceptions are also used extensively by debuggers. Debugged (traced) programs generate
exceptions at breakpoints and upon completion of the exec system call. The debugger must intercept
these exceptions to control the program. The debugger may also wish to intercept other selected ex-
ceptions and signals generated by the debugged program. The ptrace system call in UNIX enables
this interception; it is described further in Section 6.2.4.
There are several drawbacks to the way UNIX handles exceptions. First, the signal handler
runs in the same context as the exception. This means that it cannot access the full register context
as it was at the time of the exception. When the exception occurs, the kernel passes some of the ex-
ception context to the handler. The amount of context passed depends on the specific UNIX variant
and on the hardware on which it runs. In general, a single thread must deal with two contexts-that
of the handler and that of the context in which the exception occurred.
Second, signals are designed for single-threaded processes. UNIX variants that support
multithreaded processes find it difficult to adapt signals to such an environment. Finally, due to
limitations of the ptrace system call, a traditional ptrace-based debugger can control only its im-
mediate children.
Victim Handler
thread terminate
Resume/
Terminate
Two messages are involved in handling a single exception. When the victim raises the ex-
ception, it sends a message to the handler and waits for the reply. The handler catches the exception
when it receives the message and clears it by sending a reply message to the victim. When the vic-
tim receives the reply, it can resume execution.
Since an exception could use either the task or the thread exception port, we need a way of
resolving the conflict. To do so, we observe that the thread exception port is used for error handlers
that should be transparent to debuggers. For example, a handler may respond to a floating point un-
derflow error by substituting zero as the result of the operation. Such an exception is usually of no
interest to the debugger, which would normally wish to intercept unrecoverable errors only. Hence
if an error handler is installed, Mach invokes it in preference to the debugger.
When an exception occurs, it is sent to the thread exception port if one exists. Thus excep-
tions that invoke error handlers are not seen by the debugger. If the installed error handler cannot
successfully clear the exception, it forwards it to the task exception port. (Since the error handler is
another thread in the same task, it has access to the victim's task exception port.) If neither handler
can handle the exception, the kernel terminates the victim thread.
network. It allocates "proxy" ports for all remote ports, receives messages intended for them, and
forwards these messages across the network transparently to the sender. This allows a debugger to
control a task on any node on the network, just as it would control a local task.
4.8.4 Analysis
The Mach exception handling facility addresses many of the problems faced by UNIX. It is also
more robust and provides functionality not available in UNIX. Some of its important advantages
are:
• A debugger is not restricted to controlling its immediate children. It can debug any task,
provided it has the required permissions.
• A debugger can attach itself to a running task.6 It does so by registering one of its ports as
that task's exception port. It can also detach itself from a task, by resetting the task's ex-
ception port to its former value. This port is the only connection between the debugger and
the target, and the kernel contains no special support for debugging.
• The extension of Mach IPC over the network allows the development of distributed de-
buggers.
• Having a separate error handler thread allows a clean separation of the handler and victim
contexts and allows the handler to access the entire context of the victim.
• Multithreaded processes are handled cleanly. Only the thread that caused the exception is
suspended, while others remain unaffected. If several threads cause exceptions, each gen-
erates a separate message and is handled independently.
6 Currently, most UNIX debuggers are written using the /proc file system, which allows access to address spaces of
unrelated processes. Hence debuggers can easily attach and detach running processes. At the time when Mach ex-
ception handling was designed, this ability was uncommon.
100 Chapter 4 Signals and Session Management
Controlling terminal - Each process may have a controlling tenninal. This is usually the login
tenninal at which this process was created. All processes in the same group share the same control-
ling tenninal.
The /dev/tty file - The special file /dev/tty is associated with the controlling tenninal of each
process. The device driver for this file simply routes all requests to the appropriate tenninal. For in-
stance, in 4.3BSD, the device number of the controlling tenninal is stored in the u_ttyd field of the
u area. A read to the tenninal is thus implemented as
7 The terminal driver maintains the tty structure for each terminal.
4.9 Process Groups and Terminal Management 101
I I
f'D\1 I
~-~II
I I I lp\
I II ~-1
I 11 I
II~.~~I II
I I
: :
I
(p)
y /®
I I
I I I I ,
. -.I T.I -t·-
I .- ·"
tty
®process
f'D) foreground .-·,. process
I
·,.,·' group
D login
~ process sesswn
Typical scenario - The init process forks a child for each terminal listed in the /etc/inittab file.
The child process calls setpgrp, becoming a group leader, and then execs the getty program, which
displays a login prompt and waits for input. When a user types in his login name, getty execs the
login program, which asks for and verifies the password, and finally, execs the login shell. Hence
the login shell is a direct child of init and is also a process group leader. Usually, other processes do
not create their own groups (except for system daemons started from a login session); hence all
processes belonging to a login session will be in the same process group.
Terminal access -There is no support for job control. All processes that have a terminal open can
access it equally, whether they are in the foreground or background. Output from such processes
will be randomly intermingled on the screen. If several processes try to read the terminal concur-
rently, it is purely a matter of chance which process will read any particular line of input.
Terminal signals- Signals such as SIGQUIT and SIGINT that are generated at the keyboard are
sent to all processes in the terminal's controlling group, thus usually to all processes in the login
session. These signals are really intended for foreground processes only. Hence when the shell cre-
ates a process that will run in the background, it sets them up to ignore these signals. It also redirects
the standard input of such processes to /dev/null, so that they may not read from the terminal
through that descriptor (they may still open other descriptors to read from the terminal).
Detaching the terminal- A terminal is detached from its controlling group when its t_pgrp field
is set to zero. This happens when no more processes have the terminal open or when the group
leader (usually the login process) exits.
102 Chapter 4 Signals and Session Management
Death of group leader - The group leader becomes the controlling process of its terminal and is
responsible for managing the terminal for the entire group. When it dies, its controlling terminal is
disassociated from the group (its t_pgrp is set to zero). Moreover, all other processes in its group
are sent a SIGHUP signal, and their p_pgrp is set to zero, so they do not belong to a process group
(they become orphaned).
Implementation- The p_pgrp field of the proc structure contains the process group ID. The u
area has two terminal-related fields-u_ ttyp (pointer to tty structure of controlling terminal) and
u_ ttyd (device number of controlling terminal). The t _pgrp field in the tty structure contains the
controlling process group of the terminal.
4.9.3 Limitations
The SVR3 process group framework has several limitations [Lenn 86]:
• There is no way for a process group to close its controlling terminal and allocate another.
• Although the process groups are modeled after login sessions, there is no way to preserve
a login session after disconnecting from its controlling terminal. Ideally, we would like to
have such a session persist in the system, so that it can attach to another terminal at a later
time, preserving its state in the meantime.
• There is no consistent way of handling "loss of carrier" by a controlling terminal. The se-
mantics of whether such a terminal remains allocated to the group and can be reconnected
to the group differ from one implementation to another.
• The kernel does not synchronize access to the terminal by different processes in the group.
Foreground and background processes can read from or write to the terminal in an unregu-
lated manner.
• When the process group leader terminates, the kernel sends a SIGHUP signal to all proc-
esses in the group. Processes that ignore this signal can continue to access the controlling
terminal, even after it is assigned to another group. This can result in a new user receiving
unsolicited output from such a process, or worse, the process can read data typed by the
new user, causing a possible security breach.
• If a process other than the login process invokes setpgrp, it will be disconnected from the
controlling terminal. It can continue to access the terminal through any existing file de-
scriptors. The process, however, is not controlled by the terminal and will not receive
SIGHUP signals.
• There are no job control facilities, such as the ability to move processes between the fore-
ground and the background.
• A program such as a terminal emulator, which opens devices other than its controlling
terminal, has no way of receiving carrier loss notification from those devices.
4BSD addresses some of these problems. The next section describes the BSD approach.
4.9 Process Groups and Terminal Management 103
-·-. /fii\\_-
(lpl'-~J -~~-
._\g.! ; __
==
®7.---- --'
-·-·'
_,·
- .---·-®·P-·-·-..:.'--,....
....
-
-
--7'
I
tty
D
®process ® foreground
process
( '·, process
·, __. group
login
session
8 It is also possible to combine two or more unconnected processes into a single process group by issuing multiple
shell commands on the same line, separated by semicolons and placed within parentheses, as for example:
% (cc tanman.c; cp filel file2; echo done >newfile)
104 Chapter 4 Signals and Session Management
are active at the same time, all sharing the same controlling terminal. The t _pgrp field of the termi-
nal's tty structure always contains the foreground job's process group.
Controlling terminals - If a process with a group ID of zero opens a terminal, the terminal be-
comes the controlling terminal for that process, and the process joins the terminal's current control-
ling group (the p_pgrp of the process is set to the t _pgrp of the terminal). If the terminal is cur-
rently not a controlling terminal for another group, then this process is first made a group leader
(thus, both p_pgrp of the process and t _pgrp of the terminal are set to the process's PID). Direct
descendants of in it (thus, all login shell processes) initially have a group ID of zero. Other than that,
only the superuser can reset a process's group ID to zero.
Terminal access -The foreground processes (the terminal's current controlling group, obtained
from t _pgrp) always have unobstructed access to the terminal. If a background process tries to read
from the terminal, the driver sends a SIGTTIN signal to all processes in its process group. SIGTTIN
suspends the receiving process by default. Writes by background processes are permitted by default.
4.3BSD provides a terminal option (the LTOSTOP bit manipulated by the TIOCLSET ioctl) that causes
a S I GTTOU signal to be sent to a background process that tries to write to the terminal. Jobs stopped
by SIGTTIN or SIGTTOU can later be resumed by sending them a SIGCONT signal.
Controlling group- A process that has read access to the terminal can use the TIOCSPGRP ioctl
call to change the terminal's controlling group (t _pg rp) to any other value. The shell uses this fa-
cility to move jobs to the foreground or background. For example, a user can resume a suspended
process group and move it to the foreground by making it the controlling group and sending it a
SIGCONT signal. csh and ksh provide thefg and bg commands for this purpose.
Closing the terminal - When no process has the terminal open, the terminal is disassociated from
the group and its t _pgrp is set to zero. This is done by the terminal driver's close routine, called
when the last descriptor to the terminal is closed.
Reinitializing the terminal line- 4.3BSD provides a vhangup system call, typically used by init
to terminate a login session and start another. vhangup traverses the open file table, finds each entry
that resolves to this terminal, and makes it unusable. It can do so by deleting the open mode in the
file table entry or, in implementations that support the vnode interface (see Section 8.6), by chang-
ing the vnodeops pointer to point to a set of functions that simply return an error. vhangup then
calls the close() routine of the terminal and, finally, sends the SIGHUP signal to the terminal's
controlling group. This is the 4.3BSD solution to handling processes that continue after the login
session terminates.
4.9.5 Drawbacks
While 4.3BSD job control is powerful and versatile, it has some important drawbacks:
• There is no clear representation of a login session. The original login process is not special
and may not even be a group leader. SIGHUP is typically not sent when the login process
terminates.
• No single process is responsible for controlling the terminal. Thus a loss of carrier condi-
tion sends a SIGHUP signal to its current controlling group, which could even be ignoring
4.10 The SVR4 Sessions Architecture 105
this signal. For instance, a remote user connected via a modem would remain logged in if
he or she simply disconnected the line.
• A process can change the terminal's controlling group to any value, even a nonexistent
one. If a group is later created with that group ID, it will inherit the terminal and receive
signals from it unintentionally.
• The programming interface is incompatible with that of System V.
Clearly, we want an approach that will preserve the concepts of login sessions and of tasks
within such sessions. The next section looks at the sessions architecture of SVR4 and how it deals
with these issues.
GJ!I It obi II
tty
(D) foreground .-- '. process (":1 session
®process
~ process
I I
·,.,· group [J object
Figure 4-5. SVR4 sessions architecture.
4.10 The SVR4 Sessions Architecture 107
Its function is to change the process group of the target process, identified by pi d, to the value
specified by pgi d. If pgi d is zero, the process group is set to the same value as the pi d, thus making
the process a group leader. If pi d is zero, the call acts on the calling process itself. There are, how-
ever, some important restrictions. The target process must be either the caller itself or a child of the
caller that has not yet called exec. The caller and the target processes must both belong to the same
session. If pgi d is not the same as the target's PID (or zero, which has the same effect), it must be
equal to another existing group ID within the same session only.
Hence processes may move from one group to another within a session. The only way they
can leave the session is by calling setsid to start a new session with themselves as the sole member.
A process that is a group leader may relinquish leadership of its group by moving to another group,.
Such a process, however, cannot start a new session as long as its PID is the group ID of any other
process (that is, the group whose leadership the process relinquished is not empty). This prevents
the confusing situation in which a process group has the same ID as a session of which it is not a
part.
Likewise, a terminal's foreground (controlling) group may only be changed by a process in
the session that controls the terminal, and it can only be changed to another valid group in the same
session. This feature is used by job control shells to move jobs to the foreground or background.
leader of
session another proc foreground
leader in session group
proc proc proc
p_sessp p_sessp
struct sess;on
tty dev number -------, I
.------+ vnode pt r
... I
In addition, the driver sends a SIGTSTP signal to the foreground process group, if it is differ-
ent from that of the session leader. This prevents foreground processes from receiving unexpected
errors when trying to access the terminal. The controlling terminal remains allocated to the session.
This gives the session leader the option of trying to reconnect to the terminal after the connection is
reestablished.
A session leader may disconnect the current controlling terminal and open a new one. The
kernel will set the session's vnode pointer to point to the vnode of the new terminal. As a result, all
processes in this login session will switch transparently to the new controlling terminal. The indi-
rection provided by /dev/tty makes it easy to propagate this change of controlling terminal.
When the session leader terminates, it ends the login session. The controlling terminal is
deallocated by setting the session's vnode pointer to NULL. As a result, none of the processes in
this session can access the terminal through /dev/tty (they can continue to access the terminal if
they have explicitly opened its device file). Processes in the foreground group of the terminal are
sent a SIGHUP signal. All direct children of the exiting process are inherited by the init process.
foreground G
pg_mem group pg_mem
pg_session -tllr----, .----+---+ pg_session
pg_id pg_id
T s
t session -+------H s leader vnode of
- t_pgrp s_ttyvp
s_ttyp control tty
4.11 Summary
The PO SIX 1003.1 standard has helped bring together divergent and mutually incompatible meth-
ods of signals and controlling terminal handling. The resulting interfaces are robust and closely
match the expectations of typical applications and users.
4.12 Exercises
Note - Some of the questions have different answers for each major UNIX variant. The student
may answer such questions for the UNIX system with which he or she is most familiar.
1. Why are signal handlers not preserved across an exec system call?
2. Why is the SI GCH LD signal ignored by default?
3. What happens if a signal is generated for a process while it is in the middle of a fork, exec, or
exit system call?
4. Under what circumstances will a kill signal not terminate a process immediately?
5. Traditional UNIX systems use the sleep priority for two purposes-to decide if a signal
should wake up the sleeping process and to determine the scheduling priority of the process
after waking up. What is the drawback of this approach, and how do modem systems address
it?
6. What is the drawback of having signal handlers be persistent (remain installed after being
invoked)? Are there any specific signals that should not have persistent handlers?
7. How does the 4.3BSD sigpause call differ from that of SVR3? Describe a situation in which it
is more useful.
8. Why is it desirable to have the kernel restart an interrupted system call rather than have the
user do so?
9. What happens if a process receives several instances of the same signal before it can handle
the first instance? Would other semantics be more useful for this situation?
10. Suppose a process has two signals pending and has declared handlers for each of them. How
does the kernel ensure that the process handles the second signal immediately after handling
the first?
11. What if a process receives a signal while handling another? How may a process control its
behavior in this case?
12. When should a process use the SA_NOCLDWAIT feature of SVR4? When should it not use it?
13. Why would an exception handler need the full context of the process that raised the
exception?
14. Which process may create a new process group in (a) 4.3BSD and (b) SVR4?
4.13 References 111
15. What benefits does the SVR4 sessions architecture offer over the 4.3BSD terminal and job
control facilities?
16. [Bell 88] describes a user-level session manager to support login sessions. How does this
compare with the SVR4 sessions architecture?
17. What should the SVR4 kernel do when a session leader deallocates its controlling terminal?
18. How does SVR4 allow a session to reconnect to its controlling terminal? In what situations is
this useful?
4.13 References
[AT&T 86] American Telephone & Telegraph, UNIX System V Release 3: Programmer's
Reference Manual, 1986.
[Bach 86] Bach, M.J., The Design of the UNIX Operating System, Prentice-Hall, Englewood
Cliffs, NJ, 1986.
[Bell 88] Bellovin, S.M., "The Session Tty Manager," Proceedings of the Summer I988
USENIXTechnical Conference, Jun. 1988.
[Blac 88] Black, D.L., Golub, D.B., Hauth, K., Tevanian, A., and Sanzi, R., "The Mach
Exception Handling Facility," CMU-CS-88-129, Computer Science Department,
Carnegie Mellon University, Apr. 1988.
[IEEE 90] Institute for Electrical and Electronic Engineers, Information Technology-Portable
Operating System Interface (POSIX) Part I: System Application Program Interface
(API) [C Language], 1003.1-1990, Dec. 1990.
[Joy 80] Joy, W., "An Introduction to the C Shell," Computer Science Division, University of
California at Berkeley, Nov. 1980.
[Leff 89] Leffler, S.J., McKusick, M.K., Karels, M.J., and Quarterman, J.S., The Design and
Implementation of the 4. 3 BSD UNIX Operating System, Addison-Wesley, Reading,
MA, 1989.
[Lenn 86] Lenner, D.C., "A System V Compatible Implementation of 4.2 BSD Job Control,"
Proceedings of the Summer I986 USENIX Technical Conference, Jun. 1986, pp.
459-474.
[Stev 90] Stevens, W.R., UNIX Network Programming, Prentice-Hall, Englewood Cliffs, NJ,
1990.
[Stev 92] Stevens, W.R., Advanced Programming in the UNIX Environment, Addison-Wesley,
Reading, MA, 1992.
[UNIX 92] UNIX Systems Laboratories, Operating System API Reference: UNIX SVR4.2, UNIX
Press, 1992.
[Will89] Williams, T., "Session Management in System V Release 4," Proceedings of the
Winter I989 USENIXTechnical Conference, Jan. 1989, pp. 365-375.
5
Process Scheduling
5.1 Introduction
Like memory and terminals, the CPU is a shared resource for which processes in the system con-
tend. The operating system must decide how to apportion this resource among all the processes. The
scheduler is the component of the operating system that determines which process to run at any
given time, and how long to let it run. UNIX is essentially a time-sharing system, which means it
allows several processes to run concurrently. To some extent this is an illusion (at least on a uni-
processor), because a single processor can run only one process at any given instant. The UNIX
system emulates concurrency by interleaving processes on a time-share basis. The scheduler gives
the CPU to each process for a brief period of time before switching to another process. This period
is called a time quantum or time slice.
A description of the UNIX scheduler must focus on two aspects: The first deals with pol-
icy-the rules used to decide which process to run and when to switch to another. The second deals
with implementation-the data structures and algorithms used to carry out these policies. The
scheduling policy must try to meet several objectives-fast response time for interactive applica-
tions, high throughput for background jobs, avoidance of process starvation, and so forth. These
goals often conflict with each other, and the scheduler must balance them the best that it can. It also
must implement its policy efficiently and with minimum overhead.
At the lowest level, the scheduler arranges for the processor to switch from one process to
another. This is called a context switch. The kernel saves the hardware execution context of the cur-
rent process in its process control block (PCB), which is traditionally part of the u area of the proc-
112
5.2 Clock Interrupt Handling 113
ess. The context is a snapshot of the values of the general-purpose, memory management, and other
special registers of the process. The kernel then loads the hardware registers with the context of the
next process to be run. (The context is obtained from the PCB of this process.) This causes the CPU
to begin executing the next process from the saved context. The primary responsibilities of the
scheduler are to decide when to perform a context switch and which process to run.
Context switches are expensive operations. Besides saving a copy of the process registers,
the kernel must perform many architecture-specific tasks. On some systems, it must flush the data,
instruction, or address translation cache to avoid incorrect memory access (see Sections 15.9-15.13)
by the new process. As a result, the new process incurs several main memory accesses when it starts
running. This degrades the performance of the process, because memory access is significantly
slower than cache access. Finally, on pipelined architectures such as Reduced Instruction Set Com-
puters (RISC), the kernel must flush the instruction pipeline prior to switching context. These fac-
tors may influence not only the implementation, but also the scheduling policy.
This chapter first describes the handling of the clock interrupt and timer-based tasks. The
clock is critical to the operation of the scheduler, because the scheduler often wants to preempt
running processes when their time slice expires. The rest of this chapter examines various scheduler
designs and how they affect the behavior of the system.
1 This is far from universal and depends on the UNIX variant. It also depends on the resolution of the system's hard-
ware clock.
114 Chapter 5 Process Scheduling
• Updates the time-of-day clock and other related clocks. For instance, SVR4 maintains a
variable called l bo l t to store the number of ticks that have elapsed since the system was
booted.
• Handles callouts (see Section 5.2.1).
• Wakes up system processes such as the swapper and pagedaemon when appropriate.
• Handles alarms (see Section 5.2.2).
Some of these tasks do not need to be performed on every tick. Most UNIX systems define a notion
of a major tick, which occurs once every n clock ticks, where n depends on the specific UNIX vari-
ant. The scheduler performs some of its tasks only on major clock ticks. For instance, 4.3BSD per-
forms priority recomputation on every fourth tick, while SVR4 handles alarms and wakes up system
processes once a second if necessary.
5.2.1 Callouts
A callout records a function that the kernel must invoke at a later time. In SVR4, for example, any
kernel subsystem may register a callout by calling
where fn () is the kernel function to invoke, a rg is an argument to pass to fn (), and delta is the
time interval in CPU ticks, after which the function must be invoked. The kernel invokes the callout
function in system context. Hence the function must neither sleep nor access process context. The
return value to_ I Dmay be used to cancel the callout, using
2 Many implementations provide an optimization when no other interrupt is pending when the primary handler com-
pletes. In this case, the clock handler directly lowers the interrupt priority and invokes the callout handler.
5.2 Clock Interrupt Handling 115
insert a new callout into the list is less critical since insertions typically occur at lower priority and
much less frequently than once per tick.
There are several ways to implement the callout list. One method used in 4.3BSD [Leff 89]
sorts the list in order of the "time to fire". Each entry stores the difference between its time to fire
and that of the previous callout. The kernel decrements the time of the first entry at each clock tick
and issues the callout if the time reaches zero. Other callouts due at the same time are also issued.
This is described in Figure 5-1.
Another approach uses a similarly sorted list, but stores the absolute time of expiration for
each entry. This way, at each tick, the kernel compares the current absolute time with that of the
first entry and issues the callout when the times are equal.
Both methods require maintaining a sorted list, which can be expensive if the list is large.
An alternative solution is to use a timing wheel, which is a fixed-size, circular array of callout
queues. At every tick, the clock interrupt handler advances a current time pointer to the next element
in the array, wrapping around at the end of the array. If there are any callouts on that queue, their
expiration time is checked. New callouts are inserted on the queue that is N elements away from the
current queue, where N is the time to fire measured in ticks.
In effect, the timing wheel hashes the callouts based on the expiry time (time at which they
are due). Within each queue, the callouts can be kept either unsorted or sorted. Sorting the callouts
reduces the time required to process non-empty queues, but increases the insertion time. [Varg 87]
describes further refinements to this method that use multiple hierarchical timing wheels to optimize
timer performance.
5.2.2 Alarms
A process can request the kernel to send it a signal after a specific amount of time, much like an
alarm clock. There are three types of alarms-real-time, profiling, and virtual-time. A real-time
alarm relates to the actual elapsed time, and notifies the process via a SIGALRM signal. The profiling
alarm measures the amount of time the process has been executing and uses the SIGPROF signal for
t=4
f1
notification. The virtual-time alarm monitors only the time spent by the process in user mode and
sends the SIGVTALRM signal.
In BSD UNIX, the setitimer system call allows the process to request any type of alarm and
specify the time interval in microseconds. Internally, the kernel converts this interval to the appro-
priate number of CPU ticks, because that is the highest resolution the kernel can provide. In System
V, the alarm system call asks for a real-time alarm. The time must be a whole number of seconds.
SVR4 adds the hrtsys system call, which provides a high-resolution timer interface that allows time
to be specified in microseconds. This allows compatibility with BSD by implementing setitimer
(also getitimer, gettimeofday, and settimeofday) as a library routine. Likewise, BSD provides alarm
as a library routine.
The high resolution of real-time alarms does not imply high accuracy. Suppose a user re-
quests a real-time alarm to sound after 60 milliseconds. When that time expires, the kernel promptly
delivers the SIGALRM signal to the calling process. The process, however, will not see and respond to
the signal until it is next scheduled to run. This could introduce a substantial delay depending on the
receiver's scheduling priority and the amount of activity in the system. High-resolution timers are
helpful when used by high-priority processes, which are less likely to have scheduling delays. Even
these processes can be delayed if the current process is executing in kernel mode and does not reach
a preemption point. These concepts are explained further in Section 5.5.4.
The profiling and virtual alarms do not have this problem, because they are not concerned
with the actual clock time. Their accuracy is affected by another factor. The clock interrupt handler
charges the whole tick to the current process, even though it may have used only a part of it. Thus
the time measured by these alarms reflects the number of clock interrupts that have occurred while
this process was running. In the long run, this averages out and is a good indicator of the time used
by the process. For any single alarm, however, it results in significant inaccuracy.
• Batch - Activities such as software builds and scientific computations do not require
user interaction and are often submitted as background jobs. For such tasks, the measure
of scheduling efficiency is the task's completion time in the presence of other activity, as
compared to the time required on an otherwise inactive system.
• Real-time - This is a catchall class of applications that are often time-critical. Although
there are many types of real-time applications, each with its own set of requirements, they
share many common features. They normally need predictable scheduling behavior with
guaranteed bounds on response times. For instance, a video application may want to dis-
play a fixed number of video frames per second ifps). It may care more about minimizing
the variance than simply obtaining more CPU time. Users may prefer a constant rate of 15
fPs to one that fluctuates noticeably between 10 and 30jjJs, with an average of20jjJs.
A typical workstation may run many different types of applications simultaneously. The
scheduler must try to balance the needs of each. It must also ensure that kernel functions such as
paging, interrupt handling, and process management can execute promptly when required.
In a well-behaved system, all applications must continue to progress. No application should
be able to prevent others from progressing, unless the user has explicitly permitted it. Moreover, the
system should always be able to receive and process interactive user input; otherwise, the user
would have no way to control the system.
The choice of scheduling policy has a profound effect on the system's ability to meet the
requirements of different types of applications. The next section reviews the traditional
(SVR3/4.3BSD) scheduler, which supports interactive and batch jobs only. The rest of this chapter
examines schedulers in modern UNIX systems, which also provide some support for real-time ap-
plications.
We begin by describing the traditional UNIX scheduling algorithm, which is used in both SVR3 and
4.3BSD UNIX. These systems are primarily targeted at time-sharing, interactive environments with
several users running several batch and foreground processes simultaneously. The scheduling policy
aims to improve response times of interactive users, while ensuring that low-priority, background
jobs do not starve.
Traditional UNIX scheduling is priority-based. Each process has a scheduling priority that
changes with time. The scheduler always selects the highest-priority runnable process. It uses pre-
emptive time-slicing to schedule processes of equal priority, and dynamically varies process priori-
ties based on their CPU usage patterns. If a higher-priority process becomes ready to run, the
scheduler preempts the current process even if it has not completed its time slice or quantum.
The traditional UNIX kernel is strictly nonpreemptible. If a process is executing kernel code
(due to a system call or interrupt), it cannot be forced to yield the CPU to a higher-priority process.
The running process may voluntarily relinquish the CPU when blocking on a resource. Otherwise, it
can be preempted when it returns to user mode. Making the kernel nonpreemptible solves many
synchronization problems associated with multiple processes accessing the same kernel data struc-
tures (see Section 2.5).
118 Chapter 5 Process Scheduling
The following subsections describe the design and implementation of the 4.3BSD scheduler.
The SVR3 implementation differs only in a few minor respects, such as some function and variable
names.
3 The nice(/) command is normally used for this purpose. It accepts any value between -20 and 19 (only the superuser
can specify negative values). This value is used as an increment to the current nice value.
5.4 Traditional UNIX Scheduling 119
tick, the clock handler increments p_cpu for the current process, to a maximum of 127. Moreover,
every second, the kernel invokes a routine called schedcpu () (scheduled by a callout) that reduces
the p_cpu value of each process by a decay factor. SVR3 uses a fixed decay factor of 112. 4.3BSD
uses the following formula:
whichqs
lolololllolllol- ·-I
I
qs
'
' 0-3
4-7
8-11
'----------- 12-15
16-19
20-23
This simplifies the task of selecting a process to run. The swtch () routine, which performs
the context switch, examines whi chqs to find the index of the first set bit. This index identifies the
scheduler queue containing the highest priority runnable process. swtch () removes a process from
the head of the queue, and switches context to it. When swtch () returns, the newly scheduled proc-
ess resumes execution.
The context switch involves saving the register context (general purpose registers, program
counter, stack pointer, memory management registers, etc.) of the current process in its process
control block (pcb ), which is part of the u area, and then loading the registers with the saved context
of the new process. The p_ addr field in the proc structure points to the page table entries of the u
area, and swtch () uses this to locate the new pcb.
Since the V AX-11 was the reference target for both 4BSD and the early System V releases,
its architecture [DEC 86] has greatly influenced the scheduler implementation. The VAX has two
special instructions-FFS, or Find First Set, and FFC, or Find First Clear-to manipulate 32-bit
fields. This made it desirable to collapse the 128 priorities into 32 queues. It also has special in-
structions (INSQHI and REMQHI) to atomically insert and remove elements from doubly linked lists,
and others (LDPCTX and SVPCTX) to load and save a process context. This allows the VAX to execute
the entire scheduling algorithm using only a small number of machine instructions.
The schedcpu () routine recomputes the priority of each process once every second. Since
the priority cannot change while the process is on a run queue, schedcpu () removes the process
from the queue, changes its priority, and puts it back, perhaps on a different run queue. The clock
interrupt handler recomputes the priority of the current process every four ticks.
There are three situations where a context switch is indicated:
• The current process blocks on a resource or exits. This is a voluntary context switch.
• The priority recomputation procedure results in the priority of another process becoming
greater than that of the current one.
• The current process, or an interrupt handler, wakes up a higher-priority process.
The voluntary context switch is straightforward-the kernel directly calls swtch () from the
s 1eep () or ex it() routines. Events that cause involuntary switches occur when the system is in
kernel mode, and hence cannot preempt the process immediately. The kernel sets a flag called run-
run, which indicates that a higher priority process is waiting to be scheduled. When the process is
about to return to user mode, the kernel checks the runrun flag. If set, it transfers control to the
s wt c h () routine, which initiates a context switch.
5.4.4 Analysis
The traditional scheduling algorithm is simple and effective. It is adequate for a general time-
sharing system with a mixture of interactive and batch jobs. Dynamic recomputation of the priorities
prevents starvation of any process. The approach favors I/O-bound jobs that require small infrequent
bursts of CPU cycles.
The scheduler has several limitations that make it unsuitable for use in a wide variety of
commercial applications:
• It does not scale well-if the number of processes is very large, it is inefficient to recom-
pute all priorities every second.
• There is no way to guarantee a portion of CPU resources to a specific process or group of
processes.
• There are no guarantees of response time to applications with real-time characteristics.
• Applications have little control over their priorities. The nice value mechanism is simplis-
tic and inadequate.
• Since the kernel is nonpreemptive, higher-priority processes may have to wait a significant
amount of time even after being made runnable. This is called priority inversion.
Modem UNIX systems are used in many kinds of environments. In particular, there is a
strong need for the scheduler to support real-time applications that require more predictable behav-
ior and bounded response times. This requires a complete redesign of the scheduler. The rest of this
chapter examines the new scheduling facilities in SVR4, Solaris 2.x, and OSF/1, as well as some
non-mainstream variants.
122 Chapter 5 Process Scheduling
dqactmap
loloiiiololllol- .. 1
I
dispq
I
160
159
I_ - - - - - - - - - - -
158
157
156
155
A major limitation of UNIX for use in real-time applications is the nonpreemptive nature of
the kernel. Real-time processes need to have a low dispatch latency, which is the delay between the
time they become runnable and the time they actually begin running. If a real-time process becomes
runnable while the current process is executing a system call, there may be a significant delay be-
fore the context switch can occur.
To address this problem, the SVR4 kernel defines several preemption points. These are
places in the kernel code where all kernel data structures are in a stable state, and the kernel is about
to embark on a lengthy computation. When such a preemption point is reached, the kernel checks a
flag called kprunrun. If set, it indicates that a real-time process is ready to run, and the kernel pre-
empts the current process. This bounds the amount of time a real-time process must wait before be-
ing scheduled. 4 The PREEMPT () macro checks kprunrun and calls the preempt() routine to actually
preempt the process. Some examples of preemption points are:
• In the pathname parsing routine 1ookuppn (), before beginning to parse each individual
pathname component
• In the open system call, before creating the file if it does not exist
• In the memory subsystem, before freeing the pages of a process
The run run flag is used as in traditional systems, and only preempts processes that are about
to return to user mode. The preempt() function invokes the CL_PREEMPT operation to perform
class-dependent processing, and then calls swt ch () to initiate the context switch.
swtch () calls pswtch () to perform the machine-independent part of the context switch, and
then invokes lower-level assembly code to manipulate the register context, flush translation buffers,
etc. pswtch () clears the run run and kprunrun flags, selects the highest-priority runnable process,
4 This code is not class-dependent, despite the explicit mention of real-time processes. The kernel merely checks
kprunrun to detennine if it should preempt the process. Currently, only the real-time class sets this flag, but in future
there may be new classes that also require kernel preemption.
124 Chapter 5 Process Scheduling
and removes it from the dispatch queue. It updates the dqactmap, and sets the state of the process to
SONPROC (running on a processor). Finally, it updates the memory management registers to map the
u area and virtual address translation maps of the new process.
sys_init
ts init
initialization
ts classfuncs
functions
proc structures
The class-dependent functions can be accessed in this manner from the class-independent code and
from the priocntl system call.
The scheduling class decides the policies for priority computation and scheduling of the
processes that belong to it. It determines the range of priorities for its processes, and if and under
what conditions the process priority can change. It decides the size of the time slice each time a
process runs. The time slice may be the same for all processes or may vary according to the priority.
It may be anywhere from one tick to infinity. An infinite quantum is appropriate for some real-time
tasks that must run to completion.
The entry points of the class-dependent interface include:
CL TICK Called from the clock interrupt handler-monitors the time slice, recom-
putes priority, handles time quantum expiration, and so forth.
CL_FORK,
CL FORKRET Called from fork-CL_FORK initializes the child's class-specific data
structure. CL_FORKRET may set runrun, allowing the child process to run
before the parent.
CL_ENTERCLASS,
CL EXITCLASS Called when a process enters or exits a scheduling class-responsible for
allocating and deallocating the class-dependent data structures respec-
tively.
CL SLEEP Called from sleep () -may recompute process priority.
CL WAKEUP Called from wakeprocs () - puts the process on the appropriate run
queue; may set run run or kprunrun.
The scheduling class decides what actions each function will take, and each class may im-
plement these functions differently. This allows for a very versatile approach to scheduling. For in-
stance, the clock interrupt handler of the traditional scheduler charges each tick to the current proc-
ess and recomputes its priority on every fourth tick. In the new architecture, the handler simply calls
the CL_TICK routine for the class to which the process belongs. This routine decides how to process
the clock tick. The real-time class, for example, uses fixed priorities and does no recomputation.
The class-dependent code determines when the time quantum has expired and sets run run to initiate
a context switch.
By default, the 160 priorities are divided into the following three ranges:
0-59 time-sharing class
60-99 system priorities
100-159 real-time class
126 Chapter 5 Process Scheduling
The following sub-sections describe the implementation of the time-sharing and real-time
classes.
the default parameter table assigns larger time slices for lower priorities. The class-dependent data
of a real-time process is stored in a struct rtproc, which includes the current time quantum, time
remaining in the quantum, and the current priority.
Real-time processes require bounded dispatch latency, as well as bounded response time.
These concepts are explained in Figure 5-5. The dispatch latency is the time between when the
process becomes runnable and when it begins to run. The response time is the time between the oc-
currence of an event that requires the process to respond and the response itself. Both these times
need to have a well-defined upper bound that is within a reasonable limit.
The response time is the sum of the time required by the interrupt handler to process the
event, the dispatch latency, and the time taken by the real-time process itself to respond to the event.
The dispatch latency is of great concern to the kernel. Traditional kernels cannot provide reasonable
bounds, since the kernel itself is nonpreemptible, and the process may have to wait for a long period
of time if the current process is involved in some elaborate kernel processing. Measurements have
shown that some code paths in the kernel can take several milliseconds, which is clearly unaccept-
able for most real-time applications.
SVR4 uses preemption points to divide lengthy kernel algorithms into smaller, bounded
units of work. When a real-time process becomes runnable, the rt _wakeup() routine that handles
the class-dependent wakeup processing sets the kernel flag kprunrun. When the current process
(presumably executing kernel code) reaches a preemption point, it checks this flag and initiates a
context switch to the waiting real-time process. Thus the wait is bounded by the maximal code path
between two preemption points, which is a much more acceptable solution.
Finally, we must note that any guarantees on the latency bounds apply only when the real-
time process is the highest-priority runnable process on the system. If, at any time during its wait, a
event occurs
interrupt processing
process made runnable
~j
nonpreemptive kernel
processing
1
context switch initiated
.------co_n_t_e_x_t-sw--it_c_h-----.1
higher-priority process becomes runnable, it will be scheduled preferentially, and the latency calcu-
lation must restart from zero after that process yields the CPU.
5.5.6 Analysis
SVR4 has replaced the traditional scheduler with one that is completely different in design and be-
havior. It provides a flexible approach that allows the addition of scheduling classes to a system. A
vendor can tailor the scheduler to the needs of his applications. Dispatcher tables give much more
control to the system administrator, who can alter the behavior of the system by changing the set-
tings in the tables and rebuilding the kernel.
Traditional UNIX systems recpmpute the priority of each process once every second. This
can take an inordinate amount of time if there are many processes. Hence the algorithm does not
scale well to systems that have thousands of processes. The SVR4 time-sharing class changes proc-
ess priorities based on events related to that process. Since each event usually affects only one proc-
ess, the algorithm is fast and highly scalable.
Event-driven scheduling deliberately favors I/O-bound and interactive jobs over CPU-bound
ones. This approach has some important drawbacks. Interactive users whose jobs also require large
computations may not find the system to be responsive, since these processes may not generate
enough priority-elevating events to offset the effects of CPU usage. Also, the optimal boosts and
penalties to associate with different events depend on the total load on the system and the character-
istics of the jobs running at any given time. Thus it may be necessary to retune these values fre-
quently to keep the system efficient and responsive.
130 Chapter 5 Process Scheduling
Adding a scheduling class does not require access to the kernel source code. The developer
must take the following steps:
1. Provide an implementation of each class-dependent scheduling function.
2. Initialize a cl assfuncs vector to point to these functions.
3. Provide an initialization function to perform setup tasks such as allocating internal data
structures.
4. Add an entry for this class in the class table in a master configuration file, typically lo-
cated in the master.d subdirectory of the kernel build directory. This entry contains point-
ers to the initialization function and the c1ass funcs vector.
5. Rebuild the kernel.
An important limitation is that SVR4 provides no good way for a time-sharing class process
to switch to a different class. The priocntl call is restricted to the superuser. A mechanism to map
either specific user IDs or specific programs to a nondefault class would be very useful.
Although the real-time facilities represent a major first step, they still fall short of the desired
capability. There is no provision for deadline-driven scheduling (see Section 5.9.2). The code path
between preemption points is too long for many time-critical applications. In addition, true real-time
systems require a fully preemptible kernel. Some of these issues have subsequently been addressed
in Solaris 2.x, which provides several enhancements to the SVR4 scheduler. We describe this ap-
proach in the next section.
A major problem with the SVR4 scheduler is that it is extremely difficult to tune the system
properly for a mixed set of applications. [Nieh 93] describes an experiment that ran three different
programs-an interactive typing session, a batch job, and a video film display program-
concurrently. This became a difficult proposition since both the typing and the film display required
interaction with the X-server.
The authors tried several permutations of priority and scheduling class assignments to the
four processes (the three applications plus the X-server). It was very difficult to find a combination
that allowed all applications to progress adequately. For instance, the intuitive action is to place
only the video in the real-time class. However, this was catastrophic, and not even the video appli-
cation could progress. This was because the X-server, on which the video display depended, did not
receive sufficient CPU time. Placing both video and the X-server in the real-time class gave ade-
quate video performance (after correctly tweaking their relative priorities), but the interactive and
batch jobs crawled to a halt, and the system stopped responding to mouse and keyboard events.
Through careful experimentation, it may be possible to find the right combination of priori-
ties for a given set of applications. Such settings, however, might only work for that exact mix of
programs. The load on a system varies constantly, and it is not reasonable to require careful manual
tuning each time the load changes.
must support these features. Additionally, Solaris makes several optimizations to lower the dispatch
latency for high-priority, time-critical processes. The result is a scheduler that is more suitable for
real-time processing.
~ ~ ~ ~
T4
~
T1 T2 T3 TS
pri pri pri pri pri
120 130 100 132 135
P1 P2 P3 P4 P5
T7
pri
115
(a) Initial situation
about to be
switched out
~ ~ ~ ~
T1 T4
~
T2 T3 TS
pri pri pri pri pri
120 130 100 132 135
P1 P2 P3 P4 P5
cpu_chosen_level = 130
about to be
scheduled on P 3
~ T6
pri
~ T7
pri
130 115
J~ Tl
dispatcher queues
will examine the cpu_chosen _1 eve 1 of P3, find that it is 130, and realize that a higher-priority
thread is about to run on this processor. Thus, in this case, cpu_ choose() will leave T7 on the dis-
patch queue, avoiding the conflict.
There are certain situations where a low-priority thread can block a higher-priority thread for
a long period of time. These situations are caused either by hidden scheduling or by priority inver-
sion. Solaris eliminates many of these effects, as described in the following subsections.
5.6 Solaris 2.x Scheduling Enhancements 133
becomes
runnable
and hence cannot be preempted by T3 (Figure 5-7(c)). When Tl releases the resource, its priority
returns to its original value, allowing T2 to preempt it. T3 will be scheduled only after Tl has re-
leased the resource and T2 has run and relinquished the CPU.
Priority inheritance must be transitive. In Figure 5-8, T4 is blocked on a resource held by
TS, which in turn is blocked on a resource held by T6. If the priority ofT4 is higher than that ofTS
and T6, then T6 must inherit the priority of T4 via TS. Otherwise, a thread T7 whose priority is
greater than that of TS and T6, but less than that of T4, will preempt T6 and cause priority inver-
sion with respect to T4. Thus, the inherited priority of a thread must be that of the highest-priority
thread that it is directly or indirectly blocking.
The Solaris kernel must maintain extra state about locked objects to implement priority in-
heritance. It needs to identify which thread is the current owner of each locked object, and also the
object for which each blocked thread is waiting. Since inheritance is transitive, the kernel must be
able to traverse all the objects and blocked threads in the synchronization chain starting from any
given object. The next subsection shows how Solaris implements priority inheritance.
5.6 Solaris 2.x Scheduling Enhancements 135
currently running
5 This algorithm is known as the computation of transitive closure, and the chain traversed forms a directed acyclic
graph.
136 Chapter 5 Process Scheduling
Blocked threads
owner
runnable ~ gpT290 ~ T3
gp 70
ip 100
~ gpT580
owner
Key
currently running
gp global
~wants priority
EJ----- R4 ip inherited
priority
Resources
Consider the example of Figure 5-9. Thread T6, which is currently running and has a global
priority of 110, wants resource R4 that is held by thread TS. The kernel calls pi_wi ll to () which
traverses the synchronization chain starting at R4, taking the following actions:
1. The owner of R4 is thread TS, which has a global priority of 80 and no inherited priority.
Since this is lower than 110, set the inherited priority of TS to 110.
2. TS is blocked on resource R3, which is owned by thread Tl. T1 has a global priority of 60
but an inherited priority of 100 (through R2). That is also smaller than 110, so raise the
inherited priority of T1 to 110.
3. Since Tl is not blocked on any resource, terminate the chain traversal and return.
After pi_will to () returns, the kernel blocks T6 and selects another thread to run. Since the
priority of Tl was just raised to 110, it is likely to be scheduled immediately. Figure 5-10 shows the
situation after the context switch.
When a thread releases an object, it surrenders its inherited priority by calling pi_waive().
Sometimes, as in the previous example, a thread may have locked multiple objects. Its inherited pri-
ority is then the maximum of all priorities inherited from these objects. When this thread releases an
object, its inherited priority must be recomputed based on the remaining objects it owns. This in-
heritance loss may reduce the thread's priority to one below that of another runnable thread, in
which case the former thread will be preempted.
5.6 Solaris 2.x Scheduling Enhancements 137
owner
Key
gp =global
priority
ip = inherited
priority
Resources Blocked threads
what is acceptable for many real-time applications. One reason is that the blocking chain can grow
arbitrarily long. Another is that if a high-priority process has several critical regions in its execution
path, it might block on each of them, resulting in a large total delay. This problem has received great
attention in the research community. Some alternative solutions have been proposed, such as the
ceiling protocol [Sha 90], which controls the locking of resources by processes to ensure that a
high-priority process blocks at most once per activation on a resource held by a lower-priority proc-
ess. Although this limits the blocking delay for high-priority processes, it causes low-priority proc-
esses to block much more often. It also requires a priori knowledge of all processes in the system
and their resource requirements. These drawbacks limit the usefulness of this protocol to a small set
of applications.
5.6.7 Turnstiles
The kernel contains hundreds of synchronization objects, one for each data structure that must be
separately protected. Such objects must maintain a great deal of information, such as a queue of
threads that are blocked on it. Having a large data structure for each object is wasteful, since there
are hundreds of synchronization objects in the kernel, but only a few of them are in use at any given
instant. Solaris provides a space-effective solution using an abstraction called a turnstile. A syn-
chronization object contains a pointer to a turnstile, which contains all the data needed to manipu-
late the object, such as the queue of blocked threads and a pointer to the thread that currently owns
the resource (Figure 5-11 ). Turnstiles are dynamically allocated from a pool that grows in size with
the number of allocated threads in the system. The turnstile is allocated by the first thread that must
block on the object. When no more threads are blocked on the object, the turnstile is released back
into the pool.
In traditional UNIX systems, the kernel associates a specific sleep channel with each re-
source or event on which a process may block (see Section 7.2.3). The channel is typically an ad-
dress associated with that resource or event. The kernel hashes the process onto a sleep queue based
1 ----------------
active
:·l!l
active
Blocked threads
Synchronization objects
on this wait channel. Because different wait channels may map to the same sleep queue, the time
taken to traverse the queue is bounded only by the total number of threads in the system. Solaris2.x
replaces this mechanism with turnstiles. Turnstiles restrict the sleep queue to threads blocked on that
very resource, thus providing a more reasonable bound on the time taken to process the queue.
Threads in a turnstile are queued in order of their priority. Synchronization objects support
two kinds of unlocking behavior-signa/,6 which wakes up a single blocked thread, and broadcast,
which wakes up all threads blocked on the resource. In Solaris, signal wakes up the highest-priority
thread from the queue.
5.6.8 Analysis
Solaris 2.x provides a sophisticated environment for multithreaded and real-time processing for uni-
processors and multiprocessors. It addresses several shortcomings in the SVR4 scheduling imple-
mentation. Measurements on a Sparcstation 1 [Khan 92] show that the dispatch latency was under 2
milliseconds for most situations. This is largely due to the fully preemptible kernel and priority in-
heritance.
Although Solaris provides an environment suitable for many real-time applications, it is
primarily a general-purpose operating system. A system designed purely for real-time would pro-
vide many other features such as gang scheduling of processors and deadline-driven or priority-
based scheduling ofl/0 devices. These issues are discussed further in Section 5.9.
Let us now review some other scheduling algorithms in commercial and experimental UNIX
variants.
6 This signaling behavior is completely unrelated to the traditional UNIX signals. As UNIX inherits terminology from
multiple sources, some terms are overused in this fashion.
140 Chapter 5 Process Scheduling
algorithm is distributed. Each thread monitors its own CPU usage and recomputes it when it awak-
ens after blocking. The clock interrupt handler adjusts the usage factor of the current thread. To
avoid starving low-priority threads that remain on run queues without getting a chance to recompute
their priorities, an internal kernel thread runs every two seconds, recomputing the priorities of all
runnable threads.
The scheduled thread runs for a fixed time quantum. At the end of this quantum, it can be
preempted by another thread of equal or higher priority. The current thread's priority may drop be-
low that of other runnable threads before its initial quantum expires. In Mach, such reductions do
not cause context switches. This feature reduces the number of context switches that are solely re-
lated to usage balancing. The current thread can be preempted if a higher-priority thread becomes
runnable, even though its quantum has not expired.
Mach provides a feature called handoff scheduling, whereby a thread can directly yield the
processor to another thread without searching the run queues. The interprocess communication
(IPC) subsystem uses this technique for message passing-if a thread is already waiting to receive a
message, the sending thread directly yields the processor to the receiver. This improves the per-
formance of the IPC calls.
kernel
®
application
®
Figure 5-12. Processor allocation in Mach.
2. The application requests the server for processors for this set.
3. The server asks the kernel to assign processors to this set.
4. The server replies to the application indicating that the processors have been allocated.
5. The application asks the kernel to assign threads to this set.
6. The application uses the processors and notifies the server when it is finished.
7. The server reassigns the processors.
This allows tremendous flexibility in managing CPU utilization in the system, especially on
a massively parallel system with a large number of processors. It is possible, for instance, to dedi-
cate several processors to a single task or group of tasks, thereby guaranteeing a portion of the
available resources to these tasks, regardless of the total load on the system. In the extreme case, an
application may seek to dedicate one processor to each of its threads. This is known as gang
scheduling.
Gang scheduling is useful for applications that require barrier synchronization. Such appli-
cations create several threads that operate independently for a while, then reach a synchronization
point called a barrier. Each thread must wait at the barrier until the rest of the threads get there. Af-
ter all threads synchronize at the barrier, the application may run some single-threaded code, then
create another batch of threads and repeat the pattern of activity.
For such an application to perform optimally, the delay at the barrier must be minimized.
This requires that all threads reach the barrier at about the same time. Gang scheduling allows the
application to begin the threads together, and bind each to a separate processor. This helps minimize
the barrier synchronization delay.
Gang scheduling is also useful for fine-grained applications whose threads interact fre-
quently. With these applications, if a thread is preempted, it may block other threads that need to
interact with it. The drawback of dedicating processors to single threads is that if the thread must
block, the processor cannot be used.
All processors in the system may not be equivalent-some may be faster, some may have
floating point units attached to them, and so forth. Processor sets make it easy to use the right proc-
essors for the right jobs. For example, processors with floating point units may be assigned only to
threads that need to perform floating point arithmetic.
142 Chapter 5 Process Scheduling
Additionally, a thread may be temporarily bound to a specific processor. This feature serves
mainly to support the unparallelized (not multiprocessor-safe) portion of Mach's UNIX compatibil-
ity code, which runs on a single, designated master processor. Each processor has a local run queue,
and each processor set has a global run queue shared by all processors in that set. Processors first
examine their local run queue, thereby giving absolute preference to bound threads (even over
higher-priority unbound threads). This decision was made in order to provide maximum throughput
to the unparallelized UNIX code, thus avoiding a bottleneck.
• The CPU usage factor reduces the priority of time-sharing processes according to the
amount of CPU time it receives.
• System processes have fixed priorities in the range 20-31.
• Fixed-priority processes may be assigned any priority from 0 to 63. Superuser privileges
are required, however, to assign priorities higher than 19. Priorities in the range 32-63 are
real-time priorities, since such processes cannot be preempted by system processes.
The sched_setparam call changes priorities of processes in the FIFO and round-robin
classes. The schedyield call puts the calling process at the end of the queue for its priority, thereby
yielding the processor to any runnable process at the same priority. If there is no such runnable
process, the caller continues to run.
some of which have found their way into various UNIX implementations. This section examines a
few interesting treatments.
Deadline-driven scheduling is a suitable approach for a system that primarily runs processes
with known response time requirements. The same priority scheme can be applied to scheduling
disk I/0 requests, and so forth.
ceeds a critical level, the server throughput drops rapidly. This is known as receive livelock. The
three-level scheduler addresses this problem by moving all network processing to real-time tasks,
which bound the amount of traffic they will service in a single invocation. If the incoming traffic
exceeds the critical value, the server will drop excess requests, but still be able to make progress on
the requests that it accepts. Hence, after peaking, the throughput remains nearly constant instead of
declining.
5.10 Summary
We have examined several scheduling architectures and shown how they affect system response to
different types of applications. Because computers are used in very different environments, each
with its own set of requirements, no one scheduler is ideal for all systems. The Solaris 2.x scheduler
is adequate for many applications, and provides the framework for dynamically adding other
scheduling classes to suit the needs of specific environments. It lacks some features such as real-
time streams I/0 and user-controlled disk scheduling, but is still an improvement over the tradi-
tional UNIX scheduler. Some other solutions we have seen are targeted primarily at a specific appli-
cation domain such as parallel processing or multimedia.
5.11 Exercises
1. Why are callouts not handled by the primary clock interrupt handler?
2. In which situations will timing wheels be more efficient than the 4.3BSD algorithm for
managing callouts?
3. What are the advantages and disadvantages of using delta times as opposed to absolute times
in callouts?
4. Why do UNIX systems usually favor I/O-bound processes over CPU-bound processes?
5. What are the benefits of the object-oriented interface of the SVR4 scheduler? What are the
drawbacks?
6. Why ares 1pret and 1wait given higher values than tqexp in each row of Table 5-1?
7. For what reasons are real-time processes given higher priorities than kernel processes? What
are the drawbacks of doing this?
8. How does event-driven scheduling favor I/O-bound and interactive applications?
9. Regarding the [Nieh 93] experiments described on page 130, what would be the effect if the X
server, the video application, and the interactive task were all assigned real-time priorities,
while the batch job was given a time-sharing priority?
10. Suppose a process releases a resource for which several processes are waiting. When is it
preferable to wake up all such processes, and when is it better to wake only one? If waking
just one process, how should that process be chosen?
11. Gang scheduling assumes that each thread runs on a separate processor. How will an
application requiring barrier synchronization behave if there are fewer processors than
runnable threads? In such a situation, can the threads busy-wait at the barrier?
5.12 References 147
12. What are the different ways in which Solaris2.x supports real-time applications? In what
respects is this inadequate?
13. Why is deadline-driven scheduling unsuitable for a conventional operating system?
14. What are the characteristics of real-time processes? Give some examples of periodic and
nonperiodic real-time applications.
15. It is possible to reduce response time and dispatch latency simply by using a faster processor.
What distinguishes a real-time system from a fast, high-performance system? Can a system
that is slower overall be better suited for real-time applications?
16. What is the difference between hard real-time and soft real-time requirements?
17. Why is admission control important in a real-time system?
5.12 References
[AT&T 90] American Telephone and Telegraph, UNIX System V Release 4 Internals Students
Guide, 1990.
[Blac 90] Black, D.L., "Scheduling Support for Concurrency and Parallelism in the Mach
Operating System," IEEE Computer, May 1990, pp. 35-43.
[Bond 88] Bond, P.G., "Priority and Deadline Scheduling on Real-Time UNIX," Proceedings
of the Autumn 1988 European UNIX Users' Group Conference, Oct. 1988, pp. 201-
207.
[DEC 86] Digital Equipment Corporation, VAX Architecture Handbook, Digital Press, 1986.
[DEC 94] Digital Equipment Corporation, DEC OSF/1 Guide to Realtime Programming, Part
No. AA-PS33C-TE, Aug. 1994.
[Denh 94] Denham, J.M., Long, P., and Woodward, J.A., "DEC OSF/1 Version 3.0 Symmetric
Multiprocessing Implementation," Digital Technical Journal, Vol. 6, No.3, Summer
1994, pp. 29-54.
[Henr 84] Henry, G.J., "The Fair Share Scheduler," AT&T Bell Laboratories Technical
Journal, Vol. 63, No.8, Oct 1984, pp. 1845-1857.
[IEEE 93] Institute for Electrical and Electronic Engineers, POSIX PI 003.4b, Real-Time
Extensions for Portable Operating Systems, 1993.
[Khan 92] Khanna, S., Sebree, M., and Zolnowsky, J., "Realtime Scheduling in SunOS 5.0,"
Proceedings of the Winter 1992 USENIXTechnical Conference, Jan. 1992.
[Lamp 80] Lampson, B.W. and Redell, D.D., "Experiences with Processes and Monitors in
Mesa," Communications of the ACM, vol. 23, no. 2, Feb 1980, pp. 105-117.
[Liu 73] Liu, C.L., and Layland, J.W., "Scheduling Algorithms for Multiprogramming in a
Hard Real-Time Environment:, Journal of the ACM, Vol. 20, no. 1, Jan. 1973, pp.
46-61.
[Leff89] Leffler, S.J., McKusick, M.K., Karels, M.J., and Quarterman, J.S., The Design and
Implementation of the 4.3 BSD UNIX Operating System, Addison-Wesley, Reading,
MA, 1989.
148 Chapter 5 Process Scheduling
[Nieh 93] Nieh, J., "SVR4 UNIX Scheduler Unacceptable for Multimedia Applications,"
Proceedings of the Fourth International Workshop on Network and Operating
Support for Digial Audio and Video, 1993.
[Rama 95] Ramakrishnan, K.K., Vaitzblit, L., Gray, C.G., Vahalia, U., Ting, D., Tzelnic, P.,
Glaser, S., and Duso, W.W.,"Operating System Support for a Video-on-Demand File
Server," Multimedia Systems, Vol. 3, No.2, May 1995, pp. 53-65.
[Sha 86] Sha, L., and Lehoczky, J.P., "Performance of Real-Time Bus Scheduling
Algorithms," ACM Performance Evaluation Review, Special Issue, Vol. 14, No. 1,
May 1986.
[Sha 90] Sha, L., Rajkumar, R., and Lehoczky, J.P., "Priority Inheritance Protocols: An
Approach to Real-Time Synchronization," IEEE Tansactions on Computers, Vol. 39,
No.9,Sep. 1990,pp. 1175-1185.
[Stra 86] Straathof, J.H., Thareja, A.K., and Agrawala, A.K., "UNIX Scheduling for Large
Systems," Proceedings of the Winter 1986 USENIX Technical Conference, Jan.
1986.
[Varg 87] Varghese, G., and Lauck, T., "Hashed and Hierarchical Timing Wheels: Data
Structures for the Efficient Implementation of a Timer Facility," Eleventh ACM
Symposium on Operating Systems Principles, Nov. 1987, pp. 25-38.
6
lnterprocess
Communications
6.1 Introduction
A complex programming environment often uses multiple cooperating processes to perform related
operations. These processes must communicate with each other and share resources and informa-
tion. The kernel must provide mechanisms that make this possible. These mechanisms are collec-
tively referred to as interprocess communications, or IPC. This chapter describes the IPC facilities
in major UNIX variants.
Interprocess interactions have several distinct purposes:
• Data transfer - One process may wish to send data to another process. The amount of
data sent may vary from one byte to several megabytes.
• Sharing data - Multiple processes may wish to operate on shared data, such that if a
process modifies the data, that change will be immediately visible to other processes
sharing it.
• Event notification - A process may wish to notify another process or set of processes
that some event has occurred. For instance, when a process terminates, it may need to in-
form its parent process. The receiver may be notified asynchronously, in which case its
normal processing is interrupted. Alternatively, the receiver may choose to wait for the
notification.
• Resource sharing - Although tii.e kernel provides default semantics for resource alloca-
tion, they are not suitable for all applications. A set of cooperating processes may wish to
define their own protocol for accessing specific resources. Such rules are usually imple-
149
150 Chapter 6 lnterprocess Communications
mented by a locking and synchronization scheme, which must be built on top of the basic
set of primitives provided by the kernel.
• Process control - A process such as a debugger may wish to assume complete control
over the execution of another (target) process. The controlling process may wish to inter-
cept all traps and exceptions intended for the target and be notified of any change in the
target's state.
UNIX provides several different IPC mechanisms. This chapter first describes a core set of
facilities found in all versions of UNIX, namely signals, pipes, and process tracing. It then examines
the primitives collectively described as System V !PC. Finally, it looks at message-based IPC in
Mach, which provides a rich set of facilities from a single, unified framework.
6.2.1 Signals
Signals serve primarily to notify a process of asynchronous events. Originally intended for handling
errors, they are also used as primitive IPC mechanisms. Modem UNIX versions recognize 31 or
more different signals. Most have predefined meanings, but at least two, SIGUSRl and SIGUSR2, are
available for applications to use as they please. A process may explicitly send a signal to another
process or processes using the kill or killpg system calls. Additionally, the kernel generates signals
internally in response to various events. For instance, typing control-C at the terminal sends a
SIGINT signal to the foreground process.
Each signal has a default action, which is typically to terminate the process. A process may
specify an alternative response to any signal by providing a signal handler function. When the signal
occurs, the kernel interrupts the process, which responds to the signal by running the handler. When
the handler completes, the process may resume normal processing.
This way processes are notified of, and respond to, asynchronous events. Signals can also be
used for synchronization. A process may use the sigpause call to wait for a signal to arrive. In early
UNIX releases, many applications developed resource sharing and locking protocols based entirely
on signals.
The original intent was to use signals primarily for handling errors; for example, the kernel
translates hardware exceptions, such as division by zero or invalid instruction, into signals. If the
process does not have an error handler for that exception, the kernel terminates the process.
I The earliest UNIX systems in Bell Telephone Laboratories did not have these features. For instance, pipes were de-
veloped by Doug Mcilroy and Ken Thompson and made available in Version 3 UNIX in 1973 [Salu 94].
6.2 Universal !PC Facilities !51
As an IPC mechanism, signals have several limitations: Signals are expensive. The sender
must make a system call; the kernel must interrupt the receiver and extensively manipulate its stack,
so as to invoke the handler and later resume the interrupted code. Moreover, they have a very lim-
ited bandwidth-because only 31 different signals exist (in SVR4 or 4.3BSD; some variants such as
AIX provide more signals), a signal can convey only a limited amount of information. It is not pos-
sible to send additional information or arguments with user-generated signals. 2 Signals are useful
for event notification, but are inefficient for more complicated interactions.
Signals are discussed in detail in Chapter 4.
6.2.2 Pipes
In traditional implementations, a pipe is a unidirectional, first-in first-out, unstructured data stream
of fixed maximum size.3 Writers add data to the end of the pipe; readers retrieve data from the front
of the pipe. Once read, the data is removed from the pipe and is unavailable to other readers. Pipes
provide a simple flow-control mechanism. A process attempting to read from an empty pipe blocks
until more data is written to the pipe. Likewise, a process trying to write to a full pipe blocks until
another process reads (and thus removes) data from the pipe.
The pipe system call creates a pipe and returns two file descriptors-one for reading and one
for writing. These descriptors are inherited by child processes, which thus share access to the file.
This way, each pipe can have several readers and writers (Figure 6-1). A given process may be a
reader or writer, or both. Normally, however, the pipe is shared between two processes, each own-
ing one end. I/0 to the pipe is much like I/0 to a file and is performed through read and write sys-
tem calls to the pipe's descriptors. A process is often unaware that the file it is reading or writing is
in fact a pipe.
senders recetvers
data
2 Signals generated by the kernel in response to hardware exceptions return additional information via the s i gi nfo
structure passed to the handler.
3 Traditional UNIX systems such as SVR2 implement pipes in the file system and use the direct block address fields in
the inode (see Section 9.2.2) to locate data blocks of the pipe. This limits the pipe size to ten blocks. Newer UNIX
systems retain this limit, even though they implement pipes differently.
!52 Chapter 6 Interprocess Communications
Typical applications such as the shell manipulate the descriptors so that a pipe has exactly
one reader and one writer, thus using it for one-way flow of data. The most common use of pipes is
to let the output of one program become the input for another. Users typically join two programs by
a pipe using the shell's pipe operator ('I').
From the IPC perspective, pipes provide an efficient way of transferring data from one proc-
ess to another. They have, however, some important limitations:
• Since reading data removes it from the pipe, a pipe cannot be used to broadcast data to
multiple receivers.
• Data in a pipe is treated as a byte-stream and has no knowledge of message boundaries. If
a writer sends several objects of different length through the pipe, the reader cannot de-
termine how many objects have been sent, or where one object ends and the next begins. 4
• If there are multiple readers on a pipe, a writer cannot direct data to a specific reader.
Likewise, ifthere are multiple writers, there is no way to determine which of them sent the
data.s
There are several ways to implement pipes. The traditional approach (in SVR2, for instance)
is to use the file system mechanisms and associate an inode and a file table entry with each pipe.
Many BSD-based variants use sockets to implement a pipe. SVR4 provides bidirectional,
STREAMS-based pipes, described in the following section.
A related facility available in System V UNIX, and in many commercial variants, is the
FIFO (first-in, first-out) file, also called a named pipe. These differ from pipes mainly in the way
they are created and accessed. A user creates a FIFO file by calling mknod, passing it a filename and
a creation mode. The mode field includes a file type of S_I FIFO and the usual access permissions.
Thereafter, any process that has the appropriate permissions can open the FIFO and read or write to
it. The semantics of reading and writing a FIFO are very similar to those of a pipe, and are further
described in Section 8.4.2. The FIFO continues to exist until explicitly unlinked, even if no readers
or writers are active.
FIFOs offer some important advantages over pipes. They may be accessed by unrelated
processes. They are persistent, and hence are useful for data that must outlive the active users. They
have a name in the file system name space. FIFOs also have some important drawbacks. They must
be explicitly deleted when not in use. They are less secure than pipes, since any process with the
right privileges can access them. Pipes are easier to set up and consume fewer resources.
4 Cooperating applications could agree on a protocol to store packet boundary information in each object.
5 Again, cooperating applications could agree on a protocol for tagging the data source for each object.
6 This new functionality applies only to pipes. SVR4 FIFOs behave much like traditional ones.
6.2 Universal IPC Facilities !53
In SVR4, this call creates two independent, first-in first-out, I/0 channels that are represented by the
two descriptors. Data written to fi 1des [1] can be read from fi 1des [0], and data written to
fi 1 des [0] can be read from fi 1des [1]. This is very useful because many applications require
two-way communication, and used two separate pipes in pre-SVR4 implementations.
SVR4 also allows a process to attach any STREAMS file descriptor to an object in the file
system [Pres 90]. An application can create a pipe using pipe, and then bind either of its descriptors
to a file name by calling
where path is the pathname of a file system object owned by, and writable by, the caller.? This ob-
ject can be an ordinary file, directory, or other special file. It cannot be an active mount point
(cannot have a file system mounted on it) or an object in a remote file system, and also cannot be
attached to another STREAMS file descriptor. It is possible to attach a descriptor to several paths,
thereby associating multiple names with it.
Once attached, all subsequent operations on path will operate on the STREAMS file until
the descriptor is detached from path throughfdetach. Using this facility, a process can create a pipe
and then allow unrelated processes to access it.
Finally, users can push STREAMS modules onto a pipe or FIFO. These modules intercept
the data flowing through the stream and process it in some way. Because the modules execute inside
the kernel, they can provide functionality not possible with a user-level application. Nonprivileged
users cannot add modules to the system, but can push modules onto streams they have opened.
where pi d is the process ID ofthe target process, addr refers to a location in the target's address
space, and interpretation of the data argument depends on cmd. The cmd argument allows the parent
to perform the following operations:
• Read or write a word in the child's address space.
• Read or write a word in the child's u area.
• Read or write to the child's general-purpose registers.
• Intercept specific signals. When an intercepted signal is generated for the child, the kernel
will suspend the child and notify the parent of the event.
• Set or delete watchpoints in the child's address space.
• Resume the execution of a stopped child.
• Single-step the child-resume its execution, but arrange for it to stop again after executing
one instruction.
• Terminate the child.
One command ( cmd == 0) is reserved for the child. The child uses this command to inform
the kernel that it (the child) will be traced by its parent. The kernel sets the child's traced flag (in its
proc structure), which affects how the child responds to signals. If a signal is generated to a traced
process, the kernel suspends the process and notifies its parent via a SIGCHLD signal, instead of in-
voking the signal handler. This allows the parent to intercept the signal and act appropriately. The
traced flag also changes the behavior of the exec system call. When the child invokes a new pro-
gram, exec generates a SIGTRAP signal to the child before returning to user mode. Again, this allows
the parent to gain control of the child before the child begins to run.
The parent typically creates the child, which invokes ptrace to allow the parent to control it.
The parent then uses the wait system call to wait for an event that changes the child's state. When
the event occurs, the kernel wakes up the parent. The return value of wait indicates that the child has
stopped rather than terminated and supplies information about the event that caused the child to
stop. The parent then controls the child by one or more ptrace commands.
Although ptrace has allowed the development of many debuggers, it has several important
drawbacks and limitations:
• A process can only control the execution of its direct children. If the traced process forks,
the debugger cannot control the new process or its descendants.
• ptrace is extremely inefficient, requiring several context switches to transfer a single word
from the child to the parent. These context switches are necessary because the debugger
does not have direct access to the child's address space.
• A debugger cannot trace a process that is already running, because the child first needs to
call ptrace to inform the kernel that it is willing to be traced.
• Tracing a setuid program raises a security problem if such a program subsequently calls
exec. A crafty user could use the debugger to modify the target's address space, so that
exec invokes the shell instead of the program it was asked to run. As a result, the user ob-
tains a shell with superuser privileges. To avoid this problem, UNIX either disables trac-
ing of setuid programs or inhibits the setuid and setgid actions on subsequent exec calls.
For a long time, ptrace was the only tool for debugging programs. Modern UNIX systems
such as SVR4 and Solaris provide much more efficient debugging facilities using the /proc file
system [Faul 91 ], described in Section 9.11.2. It is free from the limitations of ptrace and provides
additional capabilities such as allowing a process to debug umelated processes, or allowing a de-
bugger to attach to a running process. Many debuggers have been rewritten to use /proc instead of
ptrace.
6.3 System V IPC 155
8 If the key is the special value !PC PRIVATE, the kernel creates a new resource. This resource cannot be accessed
through other get calls (since the k;;-mel will generate a new resource each time), and hence the caller has exclusive
ownership of it. The owner can share the resource with its children, who inherit it through the fork system call.
!56 Chapter 6 Interprocess Communications
queue and then exit; at a later time, another process can retrieve this data. The IPC resource can
persist and be usable beyond the lifetime of the processes accessing it.
The drawback, however, is that the kernel cannot determine if a resource has deliberately
been left active for use by future processes, or if it has been abandoned accidentally, perhaps be-
cause the process that would have freed it terminated abnormally before doing so. As a result, the
kernel must retain the resource indefinitely. If this happens often, the system could run out of that
resource. At the very least, the resource ties up memory that could be better used.
Only the creator, current owner, or a superuser process can issue the I PC_ RM ID command.
Removing a resource affects all processes that are currently accessing it, and the kernel must ensure
that these processes handle this event gracefully and consistently. The specifics of this issue differ
for each IPC mechanism and are discussed in the following sections.
To implement this interface, each type of resource has its own fixed-size resource table. The
size of this table is configurable and limits the total number of instances of that IPC mechanism that
can simultaneously be active in the system. The resource table entry comprises a common i pc_perm
structure and a part specific to the type of resource. The i pc _perm structure contains the common
attributes of the resources (the key, creator and owner IDs, and permissions), as well as a sequence
number, which is a counter that is increased each time the entry is reused.
When a user allocates an IPC resource, the kernel returns the resource ID, which it computes
by the formula
The user passes the i d as an argument to subsequent system calls on that resource. The ker-
nel translates the i d to locate the resource in the table using the formula
index = id % table_size;
6.3.2 Semaphores
Semaphores [Dijk 65] are integer-valued objects that support two atomic operations-P () and V() .9
The P() operation decrements the value of the semaphore and blocks if its new value is less than
9 The names P () and V() are derived from the Dutch words for these operations.
6.3 System V IPC 157
zero. The V() operation increments its value; if the resulting value becomes greater than or equal to
zero, V() wakes up a waiting thread or process (if any). The operations are atomic in nature.
Semaphores may be used to implement several synchronization protocols. For example,
consider the problem of managing a counted resource, that is, a resource with a fixed number of in-
stances. Processes try to acquire an instance of the resource, and release it when they finish using it.
This resource can be represented by a semaphore that is initialized to the number of instances. The
P() operation is used while trying to acquire the resource; it will decrement the semaphore each
time it succeeds. When the value reaches zero (no free resources), further P() operations will block.
Releasing a resource results in a V() operation, which increments the value of the semaphore,
causing blocked processes to awaken.
In many UNIX systems, the kernel uses semaphores internally to synchronize its operations.
It is also desirable to provide the same facility for applications. System V provides a generalized
version of semaphores. The semget system call creates or obtains an array of semaphores (there is a
configurable upper bound on the array size). Its syntax is
where key is a 32-bit value supplied by the caller. semget returns an array of count semaphores as-
sociated with key. If no semaphore set is associated with key, the call fails unless the caller has
supplied the I PC_CREA T flag, which creates a new semaphore set. If the I PC_ EX CL flag was also
provided, semget returns an error if a semaphore set already exists for that key. The semi d value is
used in subsequent semaphore operations to identify this semaphore array.
The semop system call is used to perform operations on the individual semaphores in this ar-
ray. Its syntax is
where sops is a pointer to an nsops-element array of sembuf structures. Each sembuf, as described
below, represents one operation on a single semaphore in the set:
struct sembuf
unsigned short sem_num;
short sem_op;
short sem_flg;
};
sem_num identifies one semaphore from the array and sem_ op specifies the action to perform on it.
The value of sem_ op is interpreted as follows:
sem_op > 0 Add sem_op to the semaphore's current value. This may result in waking
up processes that are waiting for the value to increase.
sem_op 0 Block until the semaphore's value becomes zero.
!58 Chapter 6 Interprocess Communications
sem_op < 0 Block until the semaphore's value becomes greater than or equal to the ab-
solute value of sem_ op, then subtract sem_ op from that value. If the sema-
phore's value is already greater than the absolute value of sem_ op, the
caller does not block.
Thus, a single semop call can specify several individual operations, and the kernel guaran-
tees that either all or none of the operations will complete. Moreover, the kernel guarantees that no
other semop call on this array will begin until this one completes or blocks. If a semop call must
block after completing some of its component operations, the kernel rewinds the operation to the
beginning (undoes all modifications) to ensure atomicity of the entire call.
The sem_ fl g argument can supply two flags to the call. The I PC_ NOWAIT flag asks the ker-
nel to return an error instead of blocking. Also, a deadlock may occur if a process holding a sema-
phore exits prematurely without releasing it. Other processes waiting to acquire it may block forever
in the P () operation. To avoid this, a SEM_ UNDO flag can be passed to semop. If so, the kernel re-
members the operation, and automatically rolls it back if the process exits.
Finally, semaphores must be explicitly removed by the IPC_RMID command of the semctl
call. Otherwise, the kernel retains them even they are no longer being used by any process. This al-
lows semaphores to be used in situations that span process lifetimes, but can tie up resources if ap-
plications exit without destroying the semaphores.
When a process issues the I PC_ RM I D command, the kernel frees up the semaphore in the re-
source table. The kernel also wakes up any processes that have blocked on some semaphore opera-
tion; these processes return an EIDRM status from the semop call. Once the semaphore is removed,
processes can no longer access it (whether using the key or the semid).
Implementation Details
The kernel translates the semi d to obtain the semaphore resource table entry, which is described by
the following data structure:
struct semid_ds {
struct ipc_perm sem_perm; /* see Section 6. 3.1 *I
struct sem* sem_base; /*pointer to array of semaphores in set *I
ushort sem_nsems; /*number of semaphores in set *I
time t sem_otime; I* last operation time *I
time t sem_ctime; I* last change time *I
};
For each semaphore in the set, the kernel maintains its value and synchronization information in the
following structure:
6.3 System V IPC 159
struct sem {
ushort semval; l*cttrrentvdlue *I
pid:_ f sempjd; l*pidofprocessthdt invoked thelastoperd.tion *I
us ho rt semncnt; l*numofprocs }yctitin?[or semvaltoincrea~e *I
us hort ·. se.rnzcnt; /* num ofprocs Ulaitingfor semval td equql.O :;
};
Finally, the kernel maintains an undo list for each process that has invoked semaphore op-
erations with the SEM_UNDO flag. This list contains a record of each operation that must be rolled
back. When a process exits, the kernel checks if it has an undo list; if so, the kernel traverses the list
and reverses all the operations.
Discussion
Semaphores allow the development of complex synchronization facilities for use by cooperating
processes. On early UNIX systems that did not support semaphores, applications requiring syn-
chronization sought and used other atomic operations in UNIX. One alternative is the link system
call, which fails if the new link already exists. If two processes try the same link operation at the
same time, only one of them will succeed. It is, however, expensive and senseless to use file system
operations such as link merely for interprocess synchronization, and semaphores fill a major need of
application programmers.
The major problems with semaphores involve race conditions and deadlock avoidance.
Simple semaphores (as opposed to semaphore arrays) can easily cause a deadlock if processes must
acquire multiple semaphores. For example, in Figure 6-2, process A holds semaphore Sl and tries to
acquire semaphore 82, while process B holds S2 and tries to acquire Sl. Neither process can prog-
ress. Although this simple case is easy to detect and avoid, a deadlock can occur in an arbitrarily
complex scenario involving several semaphores and processes.
It is impractical to have deadlock detection and avoidance code in the kernel. Moreover,
there are no general, bounded algorithms that apply to all possible situations. The kernel thus leaves
all deadlock detection to the applications. By providing semaphore sets with compound atomic op-
erations, the kernel supplies mechanisms to handle multiple semaphores intelligently. Applications
can choose from several well-known deadlock avoidance techniques, some of which are discussed
in Section 7.10.1.
wants wants
One major problem in the System V semaphore implementation is that the allocation and
initialization of the semaphores are not atomic. The user calls semget to allocate a semaphore set,
followed by semctl to initialize it. This can lead to race conditions that must be prevented at the
application level [Stev 90].
Finally, the need to explicitly delete the resource through IPC_RMID is a common problem
with all IPC mechanisms. Although it allows the lifetime of the resource to exceed that of its crea-
tor, it creates a garbage collection problem if processes exit without destroying their resources.
The semantics of the call are similar to those of semget. The key is a user-chosen integer.
The I PC_CREAl flag is required to create a new message queue, and I PC_ EXCL causes the call to fail
if a queue already exists for that key. The msgqi d value is used in further calls to access the queue.
The user places messages on the queue by calling
where msgp points to the message buffer (containing a type field followed by data area), and count
is the total number of bytes in the message (including the type field). The IPC_NOWAIT flag can be
used to return an error if the message cannot be sent without blocking (if the queue is full, for ex-
ample-the queue has a configurable limit on the total amount of data it can hold).
Figure 6-3 describes the operation of a message queue. Each queue has an entry in themes-
sage queue resource table, and is represented by the following structure:
struct msqid.:...ds {
struct ipc_perm msg_perm; I* described in Section 6.3.1 *I
struct msg* msg"-first; I* first message on queue *I
struct msg* msg_last; I* last message on queue *I
ushort msg_cbytes; I* current byte count on queue *I
ushort msg_qbytes; I* max bytes allowed on queue *I
ushort msg_qnum; I* number of messages currently on queue *I
};
Messages are maintained in the queue in order of arrival. They are removed from the queue
(in first-in, first-out order) when read by a process, using the call
' ' /
/
' +- ./
messages read' .----"'----, new messages
fromhead ~~ msg ~~ + g,4._ded at tail
..-"' '------'
'
~---- '
®
Figure 6-3. Using a message queue.
Here msgp points to a buffer into which the incoming message will be placed, and maxcnt
limits the amount of data that can be read. If the incoming message is larger than maxcnt bytes, it
will be truncated. The user must ensure that the buffer pointed to by msgp is large enough to hold
maxcnt bytes. The return value specifies the number of bytes successfully read.
Ifmsgtype equals zero, msgrcv returns the first message on the queue. Ifmsgtype is greater
than zero, msgrcv returns the first message of type msgtype. If msgtype is less than zero, msgrcv
returns the first message of the lowest type that is less than or equal to the absolute value ofmsgtyp.
Again, the I PC_NOWA IT flag causes the call to return immediately if an appropriate message is not
on the queue.
Once read, the message is removed from the queue and cannot be read by other processes.
Likewise, if a message is truncated because the receiving buffer is too small, the truncated part is
lost forever, and no indication is given to the receiver.
A process must explicitly delete a message queue by calling msgctl with the I PC_ RM ID
command. When this happens, the kernel frees the message queue and deletes all messages on it. If
any processes are blocked waiting to read or write to the queue, the kernel awakens them, and they
return from the call with an EIDRM (message ID has been removed) status.
Discussion
Message queues and pipes provide similar services, but message queues are more versatile and ad-
dress several limitations of pipes. Message queues transmit data as discrete messages rather than as
an unformatted byte-stream. This allows data to be processed more intelligently. The message type
field can be used in various ways. For instance, it can associate priorities with messages, so that a
receiver can check for urgent messages before reading nonurgent ones. In situations where a mes-
sage queue is shared by multiple processes, the type field can be used to designate a recipient.
Message queues are effective for transferring small amounts of data, but become very ex-
pensive for large transfers. When one process sends a message, the kernel copies the message into
an internal buffer. When another process retrieves this message, the kernel copies the data to the re-
ceiver's address space. Thus each message transfer involves two data copy operations, resulting in
poor performance. Later in this chapter, we describe how Mach IPC allows efficient transfer of large
amounts of data.
162 Chapter 6 lnterprocess Communications
Another limitation of message queues is that they cannot specify a receiver. Any process
with the appropriate permissions can retrieve messages from the queue. Although, as mentioned
earlier, cooperating processes can agree on a protocol to specify recipients, the kernel does not assist
with this. Finally, message queues do not supply a broadcast mechanism, whereby a process can
send a single message to several receivers.
The STREAMS framework, available in most modem UNIX systems, provides a rich set of
facilities for message passing. It provides more functionality than, and is more efficient than, mes-
sage queues, rendering the latter almost obsolete. One feature that is available in message queues
but not in STREAMS is the ability to retrieve messages selectively, based on their types. This may
be useful for certain applications. Most applications, however, find STREAMS more useful, and
message queues have been retained in modem UNIX variants primarily for backward compatibility.
Chapter 17 describes STREAMS in detail.
where s i z e is the size of this region, and the other parameters and flags are the same as for semget
Ox3000:~-
Ox50000 \
_____ - , - - - - - - - - - - ,
- ---- --
~0 Ox50000
10 On a multiprocessor, additional operations are required to ensure cache consistency. Section 15.13 discusses some of
these issues.
6.3 System V IPC 163
or msgget. The process then attaches the region to a virtual address, using
The shmaddr argument suggests an address to which the region may be attached. The shmfl ag ar-
gument can specify the SHM_RND flag, which asks the kernel to round shmaddr down by an appro-
priate alignment factor. If shmaddr is zero, the kernel is free to choose any address. The
SHM_ RDONLY flag specifies that the region should be attached as read-only. The shmat call returns
the actual address to which the region was attached.
A process can detach a shared memory region from its address space by calling
shmdt {shmaddr);
To destroy the region completely, a process must use the I PC_ RM ID command of the shmctl
system call. This marks the region for deletion, and the region will be destroyed when all processes
detach it from their address space. The kernel maintains a count of processes attached to each re-
gion. Once a region has been marked for deletion, new processes cannot attach to it. If the region is
not explicitly deleted, the kernel will retain it even if no process is attached to it. This may be desir-
able for some applications: a process could leave some data in a shared memory region and termi-
nate; at a later time, another cooperating process could attach to this region using the same key and
retrieve the data.
The implementation of shared memory depends heavily on the virtual memory architecture
of the operating system. Some variants use a single page table to map the shared memory region
and share the table with all processes attached to it. Others have separate, per-process address
translation maps for the region. In such a model, if one process performs an action that changes the
address mapping of a shared page, the change must be applied to all mappings for that page. SVR4,
whose memory management is described in Chapter 14, uses an anon_map structure to locate the
pages of the shared memory region. Its shared memory resource table contains entries represented
by the following structure:
struct shmid_ds {
struct ipc_perm shm_perm; I* described in Section 6. 3.1 *I
int shm_segsz; /* segment size in bytes *I
struct anon_map *shm_amp; /*pointer to memory mgmt info *I
ushort shm_nattch; /*number ofcurrent attaches *I
};
Shared memory provides a very fast and versatile mechanism that allows a large amount of
data to be shared without copying or using system calls. Its main limitation is that it does not pro-
vide synchronization. If two processes attempt to modify the same shared memory region, the ker-
nel will not serialize the operations, and the data written may be arbitrarily intermingled. Processes
sharing a shared memory region must devise their own synchronization protocol, and they usually
do so using primitives such as semaphores. These primitives involve one or more system calls that
impose an overhead on shared memory performance.
164 Chapter 6 Interprocess Communications
Most modem UNIX variants (including SVR4) also provide the mmap system call, which
maps a file (or part of a file) into the address space of the caller. Processes can use mmap for IPC by
mapping the same file into their address space (in the MAP_SHARED mode). The effect is similar to a
shared memory region that is initialized to the contents of the file. If a process modifies a mapped
file, the change is immediately visible to all processes that have mapped that file; the kernel will
also update the file on disk. One advantage of mmap is that it uses the file system name space in-
stead of keys. Unlike shared memory, whose pages are backed by swap space (see Section 14.7.6),
mmap 'ed pages are backed by the file to which they are mapped. Section 14.2 describes mmap in
detail.
6.3.5 Discussion
There are several similarities between the IPC mechanisms and the file system. The resource ID is
like a file descriptor. The get calls resemble open, IPC_RMID resembles unlink, and the send and
receive calls resemble read and write. The shmdt call provides close-like functionality for shared
memory. For message queues and semaphores, however, there is no equivalent of the close system
call, which might be desirable for removing resources cleanly. As a result, processes using ames-
sage queue or semaphore may suddenly find that the resource no longer exists.
In contrast, the keys associated with a resource form a name space that is distinct from the
file system name space. 11 Each mechanism has its own name space, and the key uniquely identifies
a resource within it. Because the key is a simple, user-chosen integer, it is useful only on a single
machine and is unsuitable for a distributed environment. Also, it is difficult for unrelated processes
to choose and agree upon an integer-valued key and avoid conflicts with keys used by other appli-
cations. Hence, UNIX provides a library routine called ftok (described in the stdipc(3C) manual
page) to generate a key that is based on a file name and an integer. Its syntax is
The ftok routine generates a key value, usually based on ndx and the inode number of the
file. An application can choose a unique file name more easily than a unique integer value (for in-
stance, it can use the pathname of its own executable file), and hence reduce the likelihood of key
conflicts. The ndx parameter allows greater flexibility and can be used, for example, to specify a
project ID known to all cooperating processes.
Security is a problem, because the resource IDs are actually indexes into a global resource
table. An unauthorized process can access a resource simply by guessing the ID. It can thus read or
write messages or shared memory, or tamper with semaphores used by other processes. The per-
missions associated with each resource offer some protection, but many applications must share the
resources with processes belonging to different users, and hence cannot use very restrictive permis-
sions. Using the sequence number as a component of the resource ID provides a little more protec-
tion, since there are more IDs to guess from, but still poses a serious problem for applications that
are concerned about security.
II Several operating systems, such as Windows/NT and OS/2, use pathnames to name shared memory objects.
6.4 Mach IPC 165
Much of the functionality provided by System V IPC can be duplicated by other file system
facilities such as file locking or pipes. However, the IPC facilities are much more versatile and
flexible, and offer better performance than their file system counterparts.
The remainder of this chapter discusses Mach's message-based IPC facility. In Mach, IPC is the
central and most important kernel component. Instead of the operating system supporting IPC
mechanisms, Mach provides an IPC facility that supports much of the operating system. There were
several important goals in the design of Mach IPC:
• Message passing must be the fundamental communication mechanism.
• The amount of data in a single message may range from a few bytes to an entire address
space (typically up to four gigabytes). The kernel should enable large transfers without
unnecessary data copying.
• The kernel should provide secure communications and allow only authorized threads to
send and receive messages.
• Communication and memory management are tightly coupled. The IPC subsystem uses
the copy-on-write mechanisms of the memory subsystem to efficiently transfer large
amounts of data. Conversely, the memory subsystem uses IPC to communicate with user-
level memory managers (known as "external pagers").
• Mach IPC must support communication between user tasks, and also between the user and
the kernel. In Mach, a thread makes a system call by sending a message to the kernel, and
the kernel returns the result in a reply message.
• The IPC mechanism should be suitable for applications based on the client-server model.
Mach uses user-level server programs to perform many services (such as file system and
memory management) that are traditionally handled by the operating system kernel. These
servers use Mach IPC to handle requests for service.
• The interface should be transparently extensible to a distributed environment. The user
should not need to know whether he is sending a message to a local receiver or to a remote
node.
Mach IPC has evolved steadily over various releases. Sections 6.4 to 6.9 discuss the IPC
facility in Mach 2.5, which is the most popular Mach release and the basis for operating systems
such as OSF/1 and Digital UNIX. Mach 3.0 enhances the IPC mechanisms in several respects; these
features are discussed in Section 6.1 0.
This chapter makes several references to Mach tasks and threads. Section 3. 7.1 discusses
these abstractions in detail. In brief, a task is a collection of resources, including an address space in
which one or more threads execute. A thread is a dynamic entity that represents an independent
program counter and stack-thus a logical control sequence-in a program. A UNIX process is
equivalent to a task containing a single thread. All threads in a task share the resources of that task.
166 Chapter 6 Interprocess Communications
senders
port receiver
msg msg
1~1
12 Earlier versions of Mach had separate ownership and receive rights. Mach 2.5 and newer releases replace the owner-
ship rights with the backup port facility, which is described in Section 6.8.2.
6.5 Messages 167
task A taskB
r- port 6 port6 -
port 9 port 3
rights to a reply port, used to receive replies from system calls and remote procedure calls to other
tasks. There is also an exception port associated with each task and each thread. The rights to the
per-thread ports are owned by the task in which the thread runs; hence these ports can be accessed
by all threads within the task.
Tasks also inherit other port rights from their parent. Each task has a list of registered ports.
These allow the task to access various system-wide services. These ports are inherited by new tasks
during task creation.
6.5 Messages
Mach is a message-passing kernel, and most system services are accessed by exchanging messages.
Mach IPC provides communication between user tasks, between users and the kernel, and between
different kernel subsystems. A user-level program called the netmsgserver transparently extends
Mach IPC across the network, so that tasks can exchange messages with remote tasks as easily as
with local ones. The fundamental abstractions of Mach IPC are the message and the port. This sec-
tion describes the data structures and functions that implement these abstractions.
type
size
message
local port
header
destination port
message id
type name size
~
descriptor number flags
data
type
descriptor
1 name
number
size
flags
data
••
•
The msg_send call sends a message but does not expect a reply. The call may block if the
destination port's message queue is full. Likewise, msg_rcv blocks until a message is received. Each
call accepts a SEND_TIMEOUT or RCV_TIMEOUT option; if specified, the call blocks for a maximum
of timeout milliseconds. After the timeout period expires, the call returns with a timed out status
instead of remaining blocked. The RCV _NO_ SENDERS option causes msg_rcv to return if no one else
has a send right to the port.
The msg_rpc call sends an outgoing message, then waits for a reply to arrive. It is merely an
optimized way of performing a msg_send followed by a msg_rcv. The reply reuses the message
buffer used by the outgoing message. The options for msg_rpc include all the options of msg_send
and msg_rcv.
The header contains the size of the message. When calling msg_rcv, the header contains the
maximum size of the incoming message that the caller can accept; upon return, the header contains
170 Chapter 6 Interprocess Communications
the actual size received. In the msg_rpc call, rev_ size must be specified separately, because the
header contains the size of the outgoing message.
6.6 Ports
Ports are protected queues of messages. Tasks can acquire send or receive rights or capabilities to a
port. Ports can only be accessed by holders of the appropriate rights. Although many tasks may have
send rights to a port, only one task can hold a receive right. The holder of a receive right automati-
cally has a send right to that port.
Ports are also used to represent Mach objects such as tasks, threads, and processors. The
kernel holds the receive rights to such ports. Ports are reference-counted, and each send right consti-
tutes a reference to the object represented by the port. Such a reference allows the holder to manipu-
late the underlying object. For instance, the task_self port of a task represents that task. The task can
send messages to that port to request kernel services that affect the task. If another task, perhaps a
debugger, also has send rights to this port, it can perform operations on this task, such as suspending
it, by sending messages to the port. The specific operations that are permitted depend on the object
and the interface it wishes to export.
This section describes the port name space and the data structures used to represent a port.
• Pointer to a backup port. If this port is deallocated, the backup port receives all messages
sent to it.
• Doubly linked list of messages.
• Queue of blocked senders.
• Queue of blocked receivers. Although a single task has receive rights, many threads in the
task may be waiting to receive a message.
• Linked list of all translations for this object.
• Pointer to the port set, and pointers to next and previous ports in this set, if the port be-
longs to a port set (Section 6.8.3).
• Count of messages currently in the queue.
• Maximum number of messages (backlog) allowed in the queue.
'
'
~
-- I
'. \
\
\. ~
'·,·,_ ([}---
TP_table ~-----~\-,_-1£1
'·, ~ .....
·,
.,., .
e 1--:L-----1 Key
........ _,_;
6. 7 Message Passing
A single message transfer requires several operations:
1. The sender creates the message in its own address space.
2. The sender issues the msg_send system call to send the message. The message header contains
the destination port.
3. The kernel copies the message into an internal data structure (kern_msg_t) using the
msg_ copyi n () routine. In this process, port rights are converted to pointers to the ports' ker-
nel objects, and out-of-line memory is copied into a holding map.
4. (a) If a thread is waiting to receive the message (the thread is on the blocked receivers
queue of this port), it is awakened and the message is given to it directly.
(b) Otherwise, if the port's message queue is full, the sender blocks until a message is re-
moved from the queue.
(c) Otherwise, the message is queued at the port, where it remains until a thread in the re-
ceiver task performs a msg_rcv.
6.7 Message Passing 173
5. The kernel returns from msg_send once the message has been queued or given to the receiver.
6. When the receiver calls msg_rcv, the kernel calls msg_dequeue () to remove a message from
the queue. If the queue is empty, the receiver blocks until a message arrives.
7. The kernel copies the message into the receiver's address space using the msg_copyout ()
function, which performs further translations on out-of-line memory and port rights.
8. Often the sender expects a reply. For this to happen, the receiver must have a send right to
a port owned by the sender. The sender sends this right to the receiver using the reply port
field in the message header. In such a case, the sender would normally use the mach_rpc
call to optimize this exchange. This call is semantically equivalent to a msg_send followed
by a msg_rcv.
Figure 6-9 describes the transformations on the different components of a message during
the transfer process. Let us now take a closer look at some important issues of message passing.
sender receiver
,'
I
--~--
in-line
''
I
copy
I
~---:t_) '
copy
,'
---~---, '
copy of data copy of data
data
pointers to
pointers to
copy copy data copied
out-of-line holding map
mappings mappings into address
memory
space
kern_msg_t
outgoing (internal received
message , ,
message ,
... ___________
'---~-e~~~~~~--' ...... __________
msg
owns
reply msg
w-----------~~r-------------~
Figure 6-10. Messages can contain the send rights to a reply port.
Another common situation involves the interactions between a server program, a client, and
a name server (Figure 6-11). The name server holds send rights to several server programs in the
system. Typically, the servers register themselves with the name server when they begin executing
(a). All tasks inherit a send right to a name server during task creation (this value is stored in the
bootstrap port field in the task structure).
When a client wishes to access a server program, it must first acquire send rights to a port
owned by the server. To do so, it queries the name server (b), which returns a send right to that
server (c). The client uses that right to send a request to the server (d). The request contains a reply
port that the server can use to reply to the client (e). Further interactions between this server-client
pair need not involve the name server.
The sender sends a port right using its local name for the port. The type descriptor for that
component of the message informs the kernel that the item is a port right. The local name means
nothing to the receiver, and hence the kernel must translate it. To do this, the kernel searches the
translation entries by hashing on <task, local_name> and identifies the kernel object (global name)
for that port.
server
name
server
client
Figure 6-11. Using a name server to initiate contact between a client and a server.
6.7 Message Passing 175
When the message is retrieved, the kernel must translate this global name to a local name in
the receiving task. It first checks if the receiver already has a right for this port (by hashing on <task,
port>). If so, the kernel translates it to the same name. Otherwise, the kernel allocates a new port
name in the receiver and creates a translation entry that maps the name to the port. Port names are
usually small integers, and the kernel uses the lowest available integer for the new name.
Because the kernel creates an extra reference for this port, it must increment the reference
count in the port object. The kernel does so when it copies the message into system space, since the
new reference is created at that point. Alternately, the sender could have specified the deallocate
flag in the type descriptor. In that case, the kernel deallocates the right in the sender task and does
not need to increment the port reference count.
sender
holding
map
cow
cow
COW= copy-on-write
pages in memory
(a) message copied to holding map
address map
sender
receiver
pages in memory
(b) message copied to receiver task
address map
sender
receiver
pages in memory
(c) receiver modifies a page
This approach works best when neither the sender nor the receiver modifies the shared
pages. This is true of many applications. Even if the pages are modified, this approach saves a copy
operation. In-line memory is copied twice-once from sender to kernel, then again from kernel to
receiver. Out-of-line memory is copied at most once, the first time either task tries to modifY it.
The sender may set the deallocate flag in the type descriptor. In this case, the kernel does not
use copy-on-write sharing. It simply copies the address map entries to the holding map during
msg_copyi n () and deallocates them from the sender's address map. When the message is retrieved,
msg_copyout () copies the entries to the receiver's address map and deletes the holding map. As a
result, the pages move from the sender's address space to that of the receiver without any data
copying.
6.8 Port Operations 177
The message transfer proceeds along one of two paths-fast or slow. The slow path applies if a re-
ceiver is not waiting when a message is sent. In this case, the sender queues the message at the port
and returns. When a receiver does a msg_rcv, the kernel dequeues the message and copies it into the
receiver's address space.
Each port has a configurable limit, called its backlog, on the maximum number of messages
that may be queued to it. When that limit is reached, the port is full and new senders will block until
some messages are retrieved from the queue. Each time a message is retrieved from a port that has
blocked senders, one sender will be awakened. When the last message is dequeued from the port, all
blocked senders are awakened.
The fast path scenario occurs when a receiver is already waiting for the message. In this
case, msg_send does not queue the message to the port. Instead, it wakes up the receiver and directly
hands the message to it. Mach provides a facility called handoff scheduling [Drav 91], where one
thread directly yields the processor to another specific thread. The fast path code uses this facility to
switch to the receiver thread, which completes its msg_rcv call, using msg_copyout () to copy the
message to its address space routine. This eliminates the overhead of queuing and dequeuing the
message, and also speeds up the context switch, because the new thread to run is directly selected.
6.7.4 Notifications
A notification is an asynchronous message sent by the kernel to inform a task of certain events. The
kernel sends the message to the task's notify port. Mach IPC uses three types of notifications:
NOTIFY PORT DESTROYED When a port is destroyed, this message is sent to the owner
of its backup port (if any). Port destruction and backup
ports are discussed in the next section.
NOTIFY PORT DELETED When a port is destroyed, this message is sent to all tasks
that hold send rights to the port.
NOTIFY MSG ACCEPTED When sending a message to a port whose queue is full, the
sender can request this notification (using a
SEND_NOTIFY option) when a message is removed from
the queue.
The last case requires some elaboration. When the SEND_ NOTIFY option is used, the message
is transferred even if the queue is full. The kernel returns a SEND_WILL_NOTIFY status, which asks
the sender not to send more messages to the queue until it receives the NOTIFY_MSG_ACCEPTED no-
tification. This allows senders to send messages without blocking.
backup
P1 P2
backup
P1 P2
I
~~~3 ~~: - - - - - - - - - +
(b) after Pl is destroyed
NOTIFY PORT DESTROYED message to P2's owner. Pl is not deallocated, but holds send rights to
P2. All messages sent to Pl are automatically routed to P2, and can be retrieved by P2's owner.
blocked
senders
msg_rcv on a port set, the kernel retrieves the first message in the set's queue, regardless of the
component port to which the message was queued.
Port sets are created and destroyed by the port_set_al/ocate and port_set_deallocate calls.
Individual ports in the set are inserted and removed by the port_set_insert and port_set_remove
calls.
target
task_self
target P1
/
task_self I TS
Once this is accomplished, the debugger intercepts any messages sent by the target to its
task_notifY port (Mach system calls). The debugger processes the call and ensures that a reply is
eventually sent to the reply port specified in the message. The debugger can choose to emulate the
call and send the reply itself. Alternatively, it can forward the message to the kernel using the tar-
get's original task_notify port (to which the debugger now has send rights). When sending themes-
sage to the kernel, it can direct the reply to the target's reply port or specify its own reply port, thus
intercepting the kernel's reply.
Likewise, the debugger intercepts any notifications sent to this task and decides whether to
handle them on the target's behalf or forward them to the target. Hence a debugger (or other task)
can control any port of a target task, provided it has send rights to the task's task_self port.
6.9 Extensibility
Mach IPC is designed to be transparently extensible to a distributed environment. A user-level pro-
gram called the netmsgserver extends Mach IPC across a network, so that users can communicate
with tasks on remote machines as easily as with local tasks. Applications are unaware of the remote
connection and continue to use the same interface and system calls used for local communications.
An application is typically unaware if it is communicating with a local or remote task.
There are two important reasons why Mach is able to provide such transparent extensibility.
First, the port rights provide a location-independent name space. The sender simply sends the mes-
sage using a local name of the port (the send right). It does not have to know if the port represents a
local or remote object. The kernel maintains the mappings between the task's local name of the port
and the port's kernel object.
Second, senders are anonymous. Messages do not identify the senders. The sender may pass
the send right to a reply port in the message. The kernel translates this so that the receiver sees only
a local name for this right, and cannot determine who the sender is. Moreover, the sender need not
own the reply port; it only needs to own a send right to it. By specifying a reply port that is owned
by another task, the sender can direct the reply to a different task. This is also useful for debugging,
emulation, and so forth.
The netmsgserver operation is simple. Figure 6-16 shows a typical scenario. Each machine
on the network runs a netmsgserver program. If a client on node A wishes to communicate with a
server on node B, the netmsgserver on A sets up a proxy port to which the client sends the message.
It then retrieves messages sent to that port and forwards them to the netmsgserver on machine B,
which in turn forwards them to the server's port.
If the client expects a reply, it specifies a reply port in the message. The netmsgserver on A
retains the send right to the reply port. The netmsgserver on B creates a proxy port for the reply port
and sends the right to the proxy port to the server. The server replies to the proxy port, and the reply
is routed via the two netmsgservers to the client's reply port.
Servers register themselves with the local netmsgserver and pass it the send right to a port to
which the server listens. The netmsgservers maintain a distributed database of such registered net-
work ports and provide them the same services (protection, notifications, etc.) that the kernel pro-
vides to local ports. Thus the netmsgservers query each other to provide a global name lookup
182 Chapter 6 Interprocess Communications
node A
RP
__ [rei>iYl
~ I
I
I
I
SP'
service. Tasks use this service to acquire send rights to ports registered with remote netmsgservers.
In the absence of a network, this degenerates into a simple local name service.
The netmsgservers communicate with each other using low-level network protocols, not
through IPC messages.
Mach 3.0 has three separate enhancements that address this general problem-send-once
rights, notification requests, and user-references for send rights.
acquired in various ways. If a client wants to communicate with a server, the client acquires
(through the name server) a send right to it. If multiple threads in the client independently initiate
contact with the server, they will each receive a send right, which will be combined into a single
name. If one of these threads deallocates the name, it will have an impact on all other threads.
Mach 3.0 addresses this problem by associating a user-reference with each send right. Thus
in the previous example, the kernel increments the reference count each time the task obtains the
same send right, and decrements it each time a thread deallocates the right. When the last reference
is released, the kernel can remove the right safely.
6.11 Discussion
Mach uses IPC not only for communication between processes, but also as a fundamental kernel
structuring primitive. The virtual memory subsystem uses IPC to implement copy-on-write
[Youn 87], and the kernel uses IPC to control tasks and threads. The basic abstractions of Mach,
such as tasks, threads, and ports, interact with one another through message passing.
This architecture provides some interesting functionality. For instance, the netmsgserver
transparently extends the IPC mechanisms to a distributed system, so that a task may control and
interact with objects on remote nodes. This allows Mach to provide facilities such as remote de-
bugging, distributed shared memory, and other client-server programs.
In contrast, the extensive use of message passing results in poor performance. For a while
there was a great interest in building micro kernel operating systems, where most of the facilities are
provided by user-level server tasks that communicate with one another using IPC. While many ven-
dors are still working on such solutions, the performance concerns have driven these efforts away
from the mainstream.
The proponents of Mach have argued that IPC performance is not an important factor in de-
signing microkernel operating systems [Bers 92] because of the following reasons:
• Improvements in IPC performance have been far greater than in other areas of the operat-
ing system.
• With the increasing reliance on hardware caches, the cost of operating system services will
be dominated by cache hit patterns. Since the IPC code is well localized, it can be easily
tuned to use the cache optimally.
• Some data transfer can be achieved through other mechanisms such as shared memory.
• Migrating some of the kernel functionality into user-level servers reduces the number of
mode switches and protection boundary crossings, which are expensive.
Researchers have devoted considerable attention to improving IPC performance [Bers 90,
Barr 91]. So far, however, Mach IPC has had a limited impact in the commercial world. Even Digi-
tal UNIX, which is based on Mach, does not use Mach IPC in many of its kernel subsystems.
6.13 Exercises 185
6.12 Summary
This chapter described several IPC mechanisms. Signals, pipes, and ptrace are universal facilities,
available in all but the earliest UNIX systems. The System V IPC suite, comprising shared memory,
semaphores, and message queues, is also available in most modem variants. In the Mach kernel, all
objects use IPC to interact with each other. Mach IPC is extensible through the netmsgserver, allow-
ing the development of distributed, client-server applications.
Some other IPC mechanisms are covered elsewhere in the book-file locking in Chapter 8,
memory-mapped files in Chapter 14, and STREAMS pipes in Chapter 17.
6.13 Exercises
1. What are the limitations of ptrace as a tool for writing debuggers?
2. The pi d argument of ptrace must specify the process ID of a child of the caller. What are the
implications of relaxing this requirement? Why should processes not be able to use ptrace to
interact with arbitrary processes?
3. Compare the IPC functionality provided by pipes and message queues. What are the
advantages and drawbacks of each? When is one more suitable than the other?
4. Most UNIX systems allow a process to attach the same shared memory region to more than
one location in its address space. Is this a bug or a feature? When would this be useful? What
problems could it cause?
5. What issues must a programmer be concerned with in choosing an address to attach a shared
memory region to? What errors would the operating system protect against?
6. How can the I PC_ NOWAIT flag be used to prevent deadlocks when using semaphores?
7. Write programs to allow cooperating processes to lock a resource for exclusive use, using (a)
FIFO files, (b) semaphores, (c) the mkdir system call, and (d) flock or lockf system calls.
Compare and explain their performance.
8. What side effects, if any, must the programmer be concerned with in each of the above cases?
9. Is it possible to implement resource locking through (a) signals alone or (b) shared memory
and signals? What would be the performance of such a facility?
10. Write programs to transfer a large amount of data between two processes, using (a) a pipe, (b)
a FIFO, (c) a message queue, and (d) shared memory with semaphores for synchronization.
Compare and explain their performance.
11. What are the security problems associated with System V IPC? How can a malicious program
eavesdrop on, or interfere with, communications between other processes?
12. Semaphores are created with semget but initialized with semctl. Hence creation and
initialization cannot be accomplished in a single atomic operation. Describe a situation where
this might lead to a race condition and suggest a solution to the problem.
13. Can System V message queues be implemented on top of Mach IPC? What problems must
such an implementation solve?
186 Chapter 6 Interprocess Communications
6.14 References
[Bach 86] Bach, M.J., The Design of the UNIX Operating System, Prentice-Hall, Englewood
Cliffs, NJ, 1986.
[Baro 90] Baron, R.V., Black, D., Bolosky, W., Chew, J., Draves, R.P., Golub, D.B., Rashid,
R.F., Tevanian, A., Jr., and Young, M.W., Mach Kernel Interface Manual,
Department of Computer Science, Carnegie-Mellon University, Jan. 1990.
[Barr 91] Barrera, J.S.,III, "A Fast Mach Network IPC Implementation," Proceedings of the
USENIX Mach Symposium, Nov. 1991, pp. 1-12.
[Bers 90] Bershad, B.N., Anderson, T.E., Lazowska, E.D., and Levy, H.M., "Lightweight
Remote Procedure Call," ACM Transactions on Computer Systems, Vol. 8, No. 1,
Feb. 1990,pp.37-55.
[Bers 92] Bershad, B.N., "The Increasing Irrelevance of IPC Performance for Microkemel-
Based Operating Systems," USENIX Workshop on Micro-Kernels and Other Kernel
Architectures, Apr. 1992, pp. 205-212.
[Dijk 65] Dijkstra, E.W., "Solution of a Problem in Concurrent Programming Control,"
Communications of the ACM, Vol. 8, Sep. 1965, pp. 569-578.
[Drav 90] Draves, R.P., "A Revised IPC Interface," Proceedings of the First Mach USENIX
Workshop, Oct. 1990, pp. 101-121.
[Drav 91] Draves, R.P., Bershad, B.N., Rashid, R.F., and Dean, R.W., "Using Continuations to
Implement Thread Management and Communication in Operating Systems,"
Technical Report CMU-CS-91-115R, School of Computer Science, Carnegie-
Mellon University, Oct. 1991.
[Faul 91] Faulkner, R. and Gomes, R., "The Process File System and Process Model in UNIX
System V," Proceedings of the 1991 Winter USENIX Conference, Jan. 1991, pp.
243-252.
[Pres 90] Presotto, D.L., and Ritchie, D.M., "Interprocess Communications in the Ninth
Edition UNIX System," UNIX Research System Papers, Tenth Edition, Vol. II,
Saunders College Publishing, 1990, pp. 523-530.
[Rash 86] Rashid, R.F., "Threads of a New System," UNIX Review, Aug. 1986, pp. 37-49.
[Salu 94] Salus, P.H., A Quarter Century of UNIX, Addison-Wesley, Reading, MA, 1994.
[Stev 90] Stevens, R.W., UNIX Network Programming, Prentice-Hall, Englewood Cliffs, NJ,
1990.
[Thorn 78] Thompson, K., "UNIX Implementation," The Bell System Technical Journal, Vol.
57, No. 6, Part 2, Jul.-Aug. 1978, pp. 1931-1946.
[Youn 87] Young, M., Tevanian, A., Rashid, R.F., Golub, D., Eppinger, J., Chew, J., Bolosky,
W., Black, D., and Baron, R., "The Duality of Memory and Communication in the
Implementation of a Multiprocessor Operating System," Proceedings of the Eleventh
ACM Symposium on Operating Systems Principles, Nov. 1987, pp. 63-76.
7
Synchronization and
Multiprocessors
7.1 Introduction
The desire for more processing power has led to several advances in hardware architectures. One of
the major steps in this direction has been the development of multiprocessor systems. These systems
consist of two or more processors sharing the main memory and other resources. Such configura-
tions offer several advantages. They provide a flexible growth path for a project, which may start
with a single processor and, as its computing needs grow, expand seamlessly by adding extra proc-
essors to the machine. Systems used for compute-intensive applications are often CPU-bound. The
CPU is the main bottleneck, and other system resources such as the I/0 bus and memory are un-
derutilized. Multiprocessors add processing power without duplicating other resources, and hence
provide a cost-effective solution for CPU-bound workloads.
Multiprocessors also provide an extra measure of reliability; if one of the processors should
fail, the system could still continue to run without interruption. This, however, is a double-edged
sword, since there are more potential points of failure. To ensure a high mean time before failure
(MTBF), multiprocessor systems must be equipped with fault-tolerant hardware and software. In
particular, the system should recover from the failure of one processor without crashing.
Several variants of UNIX have evolved to take advantage of such systems. One of the earli-
est multiprocessing UNIX implementations ran on the AT&T 3B20A and the IBM 370 architectures
[Bach 84]. Currently, most major UNIX implementations are either native multiprocessing systems
(DECUNIX, Solaris 2.x) or have multiprocessing variants (SVR4/MP, SCO/MPX).
187
188 Chapter 7 Synchronization and Multiprocessors
Ideally, we would like to see the system performance scale linearly with the number of proc-
essors. Real systems fall short of this goal for several reasons. Since the other components of the
system are not duplicated, they can become bottlenecks. The need to synchronize when accessing
shared data structures, and the extra functionality to support multiple processors, adds CPU over-
head and reduces the overall performance gains. The operating system must try to minimize this
overhead and allow optimal CPU utilization.
The traditional UNIX kernel assumes a uniprocessor architecture and needs major modifica-
tions to run on multiprocessor systems. The three main areas of change are synchronization, paral-
lelization, and scheduling policies. Synchronization involves the basic primitives used to control
access to shared data and resources. The traditional primitives of sleep/wakeup combined with inter-
rupt blocking are inadequate in a multiprocessing environment and must be replaced with more
powerful facilities.
Parallelization concerns the efficient use of the synchronization primitives to control access
to shared resources. This involves decisions regarding lock granularity, lock placement, deadlock
avoidance, and so forth. Section 7.10 discusses some of these issues. The scheduling policy also
needs to be changed to allow the optimal utilization of all processors. Section 7.4 analyzes some
issues related to multiprocessor scheduling.
This chapter first describes the synchronization mechanisms in traditional UNIX systems
and analyzes their limitations. It follows with an overview of multiprocessor architectures. Finally
the chapter describes synchronization in modern UNIX systems. The methods describe work well
on both uniprocessor and multiprocessor platforms.
In traditional UNIX systems, the process is the basic scheduling unit, and it has a single
thread of control. As described in Chapter 3, many modern UNIX variants allow multiple threads of
control in each process, with full kernel support for these threads. Such multithreaded systems are
available both on uniprocessor and multiprocessor architectures. In these systems, individual threads
contend for and lock the shared resources. In the rest of this chapter, we refer to a thread as the basic
scheduling unit since it is the more general abstraction. For a single-threaded system, a thread is
synonymous with a process.
eral data structures without any locking, knowing that no other thread can access them until the cur-
rent thread is done with them and is ready to relinquish the kernel in a consistent state.
on) this resource. In that case, the thread examines the sleep queue and wakes up all such threads.
Waking a thread involves unlinking it from the sleep queue, changing its state to runnable, and put-
ting it on the scheduler queue. When one of these threads is eventually scheduled, it again checks
the locked flag, finds that it is clear, sets it, and proceeds to use the resource.
Sleep Blocked
of processes sleeping on that queue. This sort of unpredictable delay is usually undesirable and may
be unacceptable for kernels that support real-time applications requiring bounded dispatch latency
(see Section 5.5.4).
One alternative is to associate a separate sleep queue for each resource or event (Figure 7-2).
This approach would optimize the latency of the wakeup algorithm, at the expense of memory over-
head for all the extra queues. The typical queue header contains two pointers (forward and back-
ward) as well as other information. The total number of synchronization objects in the system may
be quite large and putting a sleep queue on each of them may be wasteful.
Solaris 2.x provides a more space-efficient solution [Eykh 92]. Each synchronization object
has a two-byte field that locates a turnstile structure that contains the sleep queue and some other
information (Figure 7-3). The kernel allocates turnstiles only to those resources that have threads
blocked on them. To speed up allocation, the kernel maintains a pool of turnstiles, and the size of
this pool is greater than the number of active threads. This approach provides more predictable real-
time behavior with minimal storage overhead. Section 5.6.7 describes turnstiles in greater detail.
~
R1
~
R2
~
R3
~
R4
turnstiles
a.---~
R1 ~
aR2
a_ ___ ~
R3
-~
(a) UMA
local
RAM
(d) NORMA
local
RAM
high-speed network
interconnect
l However, the data, instruction, and address translation caches are local to each processor.
7.3 Multiprocessor Systems 193
everything on a single system bus. This is a simple model from the operating system perspective. Its
main drawback is scalability. UMA architectures can support only a small number of processors. As
the number of processors increases, so does the contention on the bus. One of the largest UMA sys-
tems is the SGI Challenge, which supports up to 36 processors on a single bus.
In a NUMA system (Figure 7-4(b)), each CPU has some local memory, but can also access
memory local to another processor. The remote access is slower, usually by an order of magnitude,
than local access. There are also hybrid systems (Figure 7-4(c)), where a group of processors shares
uniform access to its local memory, and has slower access to memory local to another group. The
NUMA model is hard to program without exposing the details of the hardware architecture to the
applications.
In a NORMA system (Figure 7-4(d)), each CPU has direct access only to its own local
memory and may access remote memory only through explicit message passing. The hardware
provides a high-speed interconnect that increases the bandwidth for remote memory access. Build-
ing a successful system for such an architecture requires cache management and scheduling support
in the operating system, as well as compilers that can optimize the code for such hardware.
This chapter restricts itself to UMA systems.
Atomic Test-and-Set
An atomic test-and-set operation usually acts on a single bit in memory. It tests the bit, sets it to one,
and returns its old value. Thus at the completion of the operation the value of the bit is one (locked),
and the return value indicates whether it was already set to one prior to this operation. The operation
is guaranteed to be atomic, so if two threads on two processors both issue the same instruction on
the same bit, one operation will complete before the other starts. Further, the operation is also
atomic with respect to interrupts, so that an interrupt can occur only after the operation completes.
Such a primitive is ideally suited for simple locks. If the test-and-set returns one, the calling
thread owns the resource. If it returns zero, the resource is locked by another thread. Unlocking the
resource is done by simply setting the bit to zero. Some examples of test-and-set instructions are
194 Chapter 7 Synchronization and Multiprocessors
~
R1
thinks it owns the
lock
• 0
BBSS I (Branch on Bit Set and Set Interlocked) on the VAX-11 [Digi 87] and LDSTUB (LoaD and
STore Unsigned Byte) on the SPARC.
Some processors such as the MIPS R4000 and Digital's Alpha AXP use a pair of special load and
store instructions to provide an atomic read-modify-write operation. The load-linked instruction
(also called the load-locked instruction) loads a value from memory into a register and sets a flag
that causes the hardware to monitor the location. If any processor writes to such a monitored loca-
tion, the hardware will clear the flag. The store-conditional instruction stores a new value into the
location provided the flag is still set. In addition, it sets the value of another register to indicate if
the store occurred.
Such a primitive may be used to generate an atomic increment operation. The variable is
read using load-linked, and its new value is set using store-conditional. This sequence is repeated
until it succeeds. Event counters in DGIUX [Kell 89] are based on this facility.
Some systems such as the Motorola MC881 00 use a third approach based on a swap-atomic
instruction. This method is explored further in the exercises at the end of this chapter. Any of these
hardware mechanisms becomes the first building block for a powerful and comprehensive synchro-
nization facility. The high-level software abstractions described in the following sections are all
built on top of the hardware primitives.
7.4 Multiprocessor Synchronization Issues 195
accessing. This is compounded by the fact that the handler cannot use the sleep/wakeup synchroni-
zation model, since most implementations do not permit interrupt handlers to block. The system
should provide some mechanism for blocking interrupts on other processors. One possible solution
is a global ipl managed in software.
unRlolcks Q
~
no threads to
R1 wakeup
I
6calls
~sleep
I
not ~ ~blocked
locked ~onRI
This problem is not as acute on a uniprocessor, since by the time a thread runs, whoever had
locked the resource is likely to have released it. On a multiprocessor, however, if several threads
were blocked on a resource, waking them all may cause them to be simultaneously scheduled on
different processors, and they would all fight for the same resource again. This is frequently referred
to as the thundering herd problem.
Even if only one thread was blocked on the resource, there is still a time delay between its
waking up and actually running. In this interval, an unrelated thread may grab the resource, causing
the awakened thread to block again. If this happens frequently, it could lead to starvation of this
thread.
We have examined several problems with the traditional synchronization model that affect
correct operation and performance. The rest of this chapter describes several synchronization
mechanisms that function well on both uniprocessors and multiprocessors.
7.5 Semaphores
The early implementations of UNIX on multiprocessors relied almost exclusively on Dijkstra 's
semaphores [Dijk 65] (also called counted semaphores) for synchronization. A semaphore is an in-
teger-valued variable that supports two basic operations-P () and V(). P () decrements the sema-
phore and blocks if its new value is less than zero. V() increments the semaphore; if the resulting
value is less than or equal to zero, it wakes up a thread blocked on it (if any). Example 7-1 describes
these functions, plus an initialization function i ni t s em () and a CP () function, which is a nonblock-
ing version of P () :
The kernel guarantees that the semaphore operations will be atomic, even on a multiproces-
sor system. Thus if two threads try to operate on the same semaphore, one operation will complete
or block before the other starts. The P() and V() operations are comparable to sleep and wakeup,
but with somewhat different semantics. The CP () operation allows a way to poll the semaphore
without blocking and is used in interrupt handlers and other functions that cannot afford to block. It
is also used in deadlock avoidance cases, where a P() operation risks a deadlock.
/* During initialization */
semaphore sem;
initsem (&sem, 1);
/* On each use */
P (&sem);
Use resource;
V (&sem);
Example 7-2. Semaphore used to lock resource for exclusive use.
/*During initiqlization *I
semaphore event;
i nits em (&event, 0); /* probablya[boottime */
/* Duringinitializatiofl */
semaphore counter;
initsem (&counter, resourceCount);
and manipulation of sleep and scheduler queues, all of which make the operations slow. This ex-
pense may be tolerable for some resources that need to be held for a long time, but is unacceptable
for locks held for a short time.
The semaphore abstraction also hides information about whether the thread actually had to
block in the P() operation. This is often unimportant, but in some cases it may be crucial. The
UNIX buffer cache, for instance, uses a function called getb 1 k () to look for a particular disk block
in the buffer cache. If the dt::sired block is found in the cache, getb 1 k () attempts to lock it by call-
ing P() . If P() were to sleep because the buffer was locked, there is no guarantee that, when awak-
ened, the buffer would contain the same block that it originally had. The thread that had locked the
buffer may have reassigned it to some other block. Thus after P() returns, the thread may have
locked the wrong buffer. This problem can be solved within the framework of semaphores, but the
solution is cumbersome and inefficient, and indicates that other abstractions might be more suitable
[Ruan 90].
7.5.5 Convoys
Compared to the traditional sleep/wakeup mechanism, semaphores offer the advantage that proc-
esses do not wake up unnecessarily. When a thread wakes up within a P() , it is guaranteed to have
the resource. The semantics ensure that the ownership of the semaphore is transferred to the woken
up thread before that thread actually runs. If another thread tries to acquire the semaphore in the
meantime, it will not be able to do so. This very fact, however, leads to a performance problem
called semaphore convoys [Lee 87]. A convoy is created when there is frequent contention on a
semaphore. Although this can degrade the performance of any locking mechanism, the peculiar se-
mantics of semaphores compound the problem.
Figure 7-7 shows the formation of a convoy. Rl is a critical region protected by a sema-
phore. At instant (a), thread T2 holds the semaphore, while T3 is waiting to acquire it. Tl is run-
(a) holds (
71 blocks~(s T3) sched
queue
R1
sched
(b)
0 queue
R1
~
sched
(c) 1
queue
R1
holds
ning on another processor, and T4 is waiting to be scheduled. Now suppose T2 exits the critical re-
gion and releases the semaphore. It wakes up T3 and puts it on the scheduler queue. T3 now holds
the semaphore, as shown in (b).
Now suppose Tl needs to enter the critical region. Since the semaphore is held by T3, Tl
will block, freeing up processor Pl. The system will schedule thread T4 to run on Pl. Hence in (c),
T3 holds the semaphore and Tl is blocked on it; neither thread can run until T2 or T4 yields its
processor.
The problem lies in step (c). Although the semaphore has been assigned to T3, T3 is not
running and hence is not in the critical region. As a result, Tl must block on the semaphore even
though no thread is in the critical region. The semaphore semantics force allocation in a first-come,
first-served order. 2 This forces a number of unnecessary context switches. Suppose the semaphore
was replaced by an exclusive lock, or mutex. Then, in step (b), T2 would release the lock and wake
up T3, but T3 would not own the lock at this point. Consequently, in step (c), Tl would acquire the
lock, eliminating the context switch.
Note: Mutex, short for mutual exclusion lock; Js a general term thqt
refers to any primitive thatenforces exclusive access semimtics.
2 Some implementation may choose the thread to wake up based on priority. The effect in this example would be the
same.
202 Chapter 7 Synchronization and Multiprocessors
The most important characteristic of spin locks is that a thread ties up a CPU while waiting for the
lock to be released. It is essential, then, to hold spin locks only for extremely short durations. In
particular, they must not be held across blocking operations. It may also be desirable to block inter-
rupts on the current processor prior to acquiring a spin lock, so as to guarantee low holding time on
the lock.
The basic premise of a spin lock is that a thread busy-waits on a resource on one processor
while another thread is using the resource on a different processor. This is only possible on a multi-
processor. On a uniprocessor, if a thread tries to acquire a spin lock that is already held, it will loop
forever. Multiprocessor algorithms, however, must operate correctly regardless of the number of
processors, which means that they should handle the uniprocessor case as well. This requires strict
adherence to the rule that threads not relinquish control of the CPU while holding a spin lock. On a
uniprocessor, this ensures that a thread will never have to busy-wait on a spin lock.
The major advantage of spin locks is that they are inexpensive. When there is no contention
on the lock, both the lock and the unlock operations typically require only a single instruction each.
They are ideal for locking data structures that need to be accessed briefly, such as while removing
an item from a doubly linked list or while performing a load-modify-store type of operation on a
variable. Hence they are used to protect those data structures that do not need protection in a uni-
processor system. They are also used extensively to protect more complex locks, as shown in the
following sections. Semaphores, for instance, use a spin lock to guarantee atomicity of their opera-
tions, as shown in Example 7-7.
7.7 Condition Variables 203
spinlockt 1 ts~;
spinlock·(&li$1:};
1tem:>¥orw"->flack =
item->back->forw= itei'IJ"'~forw;
.• spin.::.YI'IlOqk .(~]JsJJ;
Example 7-7. Using a spin lock to access a doubly linked list.
struct;conditfo~
PtPc:: *ne~t;. .... ·.··.·· !*}iaubfyli!1ke.flti# '!'/> i .
•· Pri!E *J>#v; > . . 2 \·····. •t+~fi~{l.Ji/lt~(J)iiff¢#a~ fz.t•······· · ·
. sp]nlock--t lfstlock; .•. l*piotects thiilfst. */ ..
h
204 Chapter 7 Synchronization and Multiprocessors
spfn_unlock (&c->lfstlock);
We thus have a situation where a thread tries to acquire one spin lock while holding another.
This is not disastrous since the restriction on spin locks is only that threads are not allowed to block
while holding one. Deadlocks are avoided by maintaining a strict locking order-the lock on the
predicate must be acquired before l i stlock.
It is not necessary for the queue of blocked threads to be a part of the condition structure it-
self. Instead, we may have a global set of sleep queues as in traditional UNIX. In that case, the
l i stlock in the condition is replaced by a mutex protecting the appropriate sleep queue. Both
methods have their own advantages, as discussed earlier.
One of the major advantages of a condition variable is that it provides two ways to handle
event completion. When an event occurs, there is the option of waking up just one thread with
do_signal () or all threads with do_broadcast(). Each may be appropriate in different circum-
stances. In the case of the server application, waking one thread is sufficient, since each request is
handled by a single thread. However, consider several threads running the same program, thus
sharing a single copy of the program text. More than one of these threads may try to access the same
nonresident page of the text, resulting in page faults in each of them. The first thread to fault initi-
ates a disk access for that page. The other threads notice that the read has already been issued and
block waiting for the I/0 to complete. When the page is read into memory, it is desirable to call
do_broadcast() and wake up all the blocked threads, since they can all access the page without
conflict.
7.7.2 Events
Frequently, the predicate of the condition is simple. Threads need to wait for a particular task to
complete. The completion may be flagged by setting a global variable. This situation may be better
expressed by a higher-level abstraction called an event that combines a done flag, the spin lock pro-
tecting it, and the condition variable into a single object. The event object presents a simple inter-
face, allowing two basic operations-await Done() and setDone (). await Done() blocks until the
event occurs, while setDone() marks the event as having occurred and wakes up all threads
blocked on it. In addition, the interface may support a nonblocking tes tOone() function and a re-
set() function, which once again marks the event as not done. In some cases, the boolean done
flag may be replaced by a variable that returns more descriptive completion information when the
event occurs.
formance reasons, blocking locks might be provided as fundamental primitives. In particular, if each
resource has its own sleep queue, a single spin lock might protect both the flag and the queue.
7.8.2 Implementation
Example 7-9 implements a read-write lock facility:
struct rwlock
i nt nAct i ve; /* num of active readers, or -1 if a writer is active *I
int nPendingReads;
int nPendingWrites;
spinlock_t sl;
condition canRead;
condition canWrite;
};
r->nActive = -1;
spin_unlock (&r->sl);
spin_unlock (&r->sl);
~--!ri·~-10-~u}~---{~ T2J
Figure 7-8. Possible deadlock when using spin locks.
There are situations in which the ordering must be violated. Consider a buffer cache imple-
mentation that maintains disk block buffers on a doubly linked list, sorted in least recently used
(LRU) order. All buffers that are not actively in use are on the LRU list. A single spin lock protects
both the queue header and the forward and backward pointers in the buffers on that queue. Each
buffer also has a spin lock to protect the other information in the buffer. This lock must be held
while the buffer is actively in use.
When a thread wants a particular disk block, it locates the buffer (using hash queues or other
pointers not relevant to this discussion) and locks it. It then locks the LRU list in order to remove
the buffer from it. Thus the normal locking order is "first the buffer, then the list. "
Sometimes a thread simply wants any free buffer and tries to get it from the head of the LRU
list. It first locks the list, then locks the first buffer on the list and removes it from the list. This,
however, reverses the locking order, since it locks the list before the buffer.
It is easy to see how a deadlock can occur. One thread locks the buffer at the head of the list
and tries to lock the list. At the same time, another thread that has locked the list tries to lock the
buffer at the head. Each will block waiting for the other to release the lock.
The kernel uses stochastic locking to handle this situation. When a thread attempts to acquire
a lock that would violate the hierarchy, it uses a try_lock() operation instead of lock(). This
function attempts to acquire the lock, but returns failure instead of blocking if the lock is already
held. In this example, the thread that wants to get any free buffer will lock the list and then go down
the list, using try_1 oc k () until it finds a buffer it can lock. Example 7-10 describes an implemen-
tation oft ry_1 oc k () for spin locks:
locked briefly while adding or removing entries to it in memory. When disk 1/0 is required, the list
must be locked for a long time. Thus neither a spin lock nor a blocking lock is by itself a good solu-
tion. One alternative is to provide two locks, with the blocking lock being used only when the list is
being replenished. It is preferable, however, to have a more flexible locking primitive.
These issues can be effectively addressed by storing a hint in the lock that suggests whether
contending threads should spin or block. The hint is set by the owner of the lock and examined
whenever an attempt to acquire the lock does not immediately succeed. The hint may be either advi-
sory or mandatory.
An alternative solution is provided by the adaptive locks of Solaris 2.x [Eykh 92]. When a
thread Tl tries to acquire an adaptive lock held by another thread T2, it checks to see if T2 is cur-
rently active on any processor. As long as T2 is active, Tl executes a busy-wait. If T2 is blocked,
T1 blocks as well.
3 Sometimes, even SMP systems use a master processor to run code that is not multiprocessor-safe. This is known as
funneling.
7. II Case Studies 213
bottleneck. Systems such as Mach, however, use a fine-grained locking structure and associate locks
with individual data objects.
Locking duration, too, must be carefully examined. It is best to hold the lock for as short a
time as possible, so as to minimize contention on it. Sometimes, however, this may result in extra
locking and unlocking. Suppose a thread needs to perform two operations on an object, both requir-
ing a lock on it. In between the two operations, the thread needs to do some unrelated work. It could
unlock the object after the first operation and lock it again for the second one. It might be better, in-
stead, to keep the object locked the whole time, provided that the unrelated work is fairly short.
Such decisions must be made on a case-by-case basis.
7.11.1 SVR4.2/MP
SVR4.2/MP is the multiprocessor version of SVR4.2. It provides four types of locks-basic locks,
sleep locks, read-write locks, and synchronization variables [UNIX 92]. Each lock must be explic-
itly allocated and deallocated through xxx_ALLOC and xxx_DEALLOC operations, where xxx_ is the
type-specific prefix. The allocation operation takes arguments that are used for debugging.
Basic Locks
The basic lock is a nonrecursive mutex lock that allows short-term locking of resources. It may not
be held across a blocking operation. It is implemented as a variable of type l oc k_ t. It is locked and
unlocked by the following operations:
The LOCK call raises the interrupt priority level to new_ i p l before acquiring the lock and returns the
previous priority level. This value must be passed to the UNLOCK operation, so that it may restore the
ipl to the old level.
Read-Write Locks
A read-write lock is a nonrecursive lock that allows short-term locking with single-writer, multiple-
reader semantics. It may not be held across a blocking operation. It is implemented as a variable of
type rwl ock _ t and provides the following operations:
The treatment of interrupt priorities is identical to that for basic locks. The locking operations raise
the ipl to the specified level and return the previous ipl. RW_UNLOCK restores the ipl to the old level.
The lock also provides nonblocking operations RW_TRYRDLOCK and RW_ TRYWRLOCK.
Sleep Locks
A sleep lock is a nonrecursive mutex lock that permits long-term locking of resources. It may be
held across a blocking operation. It is implemented as a variable of type sleep_t, and provides the
following operations:
The pri parameter specifies the scheduling priority to assign to the process after it awakens. If a
process blocks on a call to SLEEP_ LOCK, it will not be interrupted by a signal. If it blocks on a call to
SLEEP_ LOCK_SIG, a signal will interrupt the process; the call returns TRUE if the lock is acquired,
and FALSE if the sleep was interrupted. The lock also provides other operations, such as
SLEEP_LOCK_AVAIL (checks if lock is available), SLEEP_LOCKOWNED (checks if caller owns the
lock), and SLEEP_TRY LOCK (returns failure instead of blocking if lock cannot be acquired).
Synchronization Variables
A synchronization variable is identical to the condition variables discussed in Section 7.7. It is im-
plemented as a variable of type s v_ t, and its predicate, which is managed separately by users of the
lock, must be protected by a basic lock. It supports the following operations:
As in sleep locks, the pri argument specifies the scheduling priority to assign to the process after it
wakes up, and SV _WAIT_SIG allows interruption by a signal. The l ockp argument is used to pass a
pointer to the basic lock protecting the predicate of the condition. The caller must hold lockp before
calling SV_WAIT or SV_WAIT_SIG. The kernel atomically blocks the caller and releases lockp.
When the caller returns from SV_WAIT or SV_WAIT_SIG, lockp is not held. The predicate is not
guaranteed to be true when the caller runs after blocking. Hence the call to SV_WAIT or
SV_WAIT_ S I G should be enclosed in a while loop that checks the predicate each time.
It is initialized to the unlocked state and cannot be held across blocking operations or context
switches.
The complex lock is a single high-level abstraction supporting a number of features, such as
shared and exclusive access, blocking, and recursive locking. It is a reader-writer lock and provides
two options-sleep and recursive. The sleep option can be enabled or disabled while initializing the
lock or at any later time. If set, the kernel will block requesters if the lock cannot be granted imme-
diately. Further, the sleep option must be set if a thread wishes to block while holding the lock. The
recursive option can be set only by a thread that has acquired the lock for exclusive use; it can be
cleared only by the same thread that set the option.
The interface provides nonblocking versions of various routines, which return failure if the
lock cannot be acquired immediately. There are also functions to upgrade (shared to exclusive) or
downgrade (exclusive to shared). The upgrade routine will release the shared lock and return failure
if there is another pending upgrade request. The nonblocking version of upgrade returns failure but
does not drop the shared lock in this situation.
assert_wait()
reschedule
thread_wakeup()
block, it calls assert_wait (),which puts it on the appropriate sleep queue. It then releases the spin
lock and calls thread_b lock () to initiate a context switch. If the event occurs between the release
of the spin lock and the context switch, the kernel will remove the thread from the sleep queue and
put it on the scheduler queue. Thus the thread does not lose the wakeup.
[Ruan 90] is based on conditions. DGIUX [Kell 89] uses indivisible event counters to implement
sequenced locks, which provide a somewhat different way of waking one process at a time.
7.12 Summary
Synchronization problems on a multiprocessor are intrinsically different from and more complex
than those on a uniprocessor. There are a number of different solutions, such as sleep/wakeup,
conditions, events, read-write locks, and semaphores. These primitives are more similar than differ-
ent, and it is possible, for example, to implement semaphores on top of conditions and vice-versa.
Many of these solutions are not limited to multiprocessors and may also be applied to synchroniza-
tion problems on uniprocessors or on loosely coupled distributed systems. Many multiprocessing
UNIX systems are based on existing uniprocessing variants and, for these, porting considerations
strongly influence the decision of which abstraction to use. Mach and Mach-based systems are
mainly free of these considerations, which is reflected in their choice of primitives.
7.13 Exercises
1. Many systems have a swap-atomic instruction that swaps the value of a register with that of a
memory location. Show how such an instruction may be used to implement an atomic test-
and-set.
2. How can an atomic test-and-set be implemented on a machine using load-linked and store-
conditional?
3. Suppose a convoy forms due to heavy contention on a critical region that is protected by a
semaphore. If the region could be divided into two critical regions, each protected by a
separate semaphore, would it reduce the convoy problem?
4. One way to eliminate a convoy is to replace the semaphore with another locking mechanism.
Could this risk starvation of threads?
5. How is a reference count different from a shared lock?
6. Implement a blocking lock on a resource, using a spin lock and a condition variable, with a
locked flag as the predicate (see Section 7.7.3).
7. In exercise 6, is it necessary to hold the spin lock protecting the predicate while clearing the
flag? [Ruan 90] discusses await lock () operation that can improve this algorithm.
8. How do condition variables avoid the lost wakeup problem?
9. Implement an event abstraction that returns a status value to waiting threads upon event
completion.
10. Suppose an object is accessed frequently for reading or writing. In what situations is it better
to protect it with a simple mutex, rather than with a read-write lock?
11. Does a read-write lock have to be blocking? Implement a read-write lock that causes threads
to busy-wait if the resoure is locked.
12. Describe a situation in which a deadlock may be avoided by making the locking granularity
finer.
218 Chapter 7 Synchronization and Multiprocessors
13. Describe a situation in which a deadlock may be avoided by making the locking granularity
coarser.
14. Is it necessary for a multiprocessor kernel to lock each variable or resource before accessing
it? Enumerate the kinds of situations where a thread may access or modify an object without
locking it.
15. Monitors [Hoar 74] are language-supported constructs providing mutual exclusion to a region
of code. For what sort of situations do they form a natural solution?
16. Implement upgrade() and downgrade() functions to the read-write lock implementation in
Section 7.8.2.
7.14 References
[Bach 84] Bach, M., and Buroff, S., "Multiprocessor UNIX Operating Systems," AT&T Bell
Laboratories Technical Journal, Vol. 63, Oct. 1984, pp. 1733-1749.
[Birr 89] Birrell, A.D., "An Introduction to Programming with Threads," Digital Equipment
Corporation Systems Research Center, 1989.
[Camp 91] Campbell, M., Barton, R., Browning, J., Cervenka, D., Curry, B., Davis, T.,
Edmonds, T., Holt, R., Slice, R., Smith, T., and Wescott, R., "The Parallelization of
UNIX System V Release 4.0," Proceedings of the Winter 1991 USENIX Conference,
Jan. 1991,pp.307-323.
[Denh 94] Denham, J.M., Long, P., and Woodward, J.A., "DEC OSF/1 Version 3.0 Symmetric
Multiprocessing Implementation," Digital Technical Journal, Vol. 6, No.3, Summer
1994,pp.29-54.
[Digi 87] Digital Equipment Corporation, VAX Architecture Reference Manual, 1984.
[Dijk 65] Dijkstra, E.W., "Solution of a Problem in Concurrent Programming Control,"
Communications of the ACM, Vol. 8, Sep. 1965, pp. 569-578.
[Eykh 92] Eykholt, J.R., Kleinman, S.R., Barton, S., Faulkner, R., Shivalingiah, A., Smith, M.,
Stein, D., Voll, J., Weeks, M., and Williams, D., "Beyond Multiprocessing:
Multithreading the SunOS Kernel," Proceedings of the Summer 1992 USENIX
Conference, Jun. 1992, pp. 11-18.
[Gobi 81] Goble, G.H., "A Dual-Processor VAX 11/780," USENIX Association Conference
Proceedings, Sep. 1981.
[Hoar 74] Hoare, C. A.R., "Monitors: An Operating System Structuring Concept,"
Communications ofthe ACM, Vol. 17, Oct. 1974, pp.549-557.
[Hitz 90] Hitz, D., Harris, G., Lau, J.K., and Schwartz, A.M., "Using UNIX as One
Component of a Lightweight Distributed Kernel for Multiprocessor File Servers,"
Proceedings of the Winter 1990 USEN!X Technical Conference, Jan. 1990, pp. 285-
295.
[Kell 89] Kelley, M.H., "Multiprocessor Aspects of the DGIUX Kernel," Proceedings of the
Winter 1989 USENIX Conference, Jan. 1989, pp. 85-99.
7.14 References 219
[Lee 87] Lee, T.P., and Luppi, M.W., "Solving Performance Problems on a Multiprocessor
UNIX System," Proceedings of the Summer 1987 USENJX Conference, Jun. 1987,
pp. 399-405.
[Nati 84] National Semiconductor Corporation, Series 32000 Instruction Set Reference
Manual, 1984.
[Peac 92] Peacock, J.K., Saxena, S., Thomas, D., Yang, F., and Yu, F., "Experiences from
Multithreading System V Release 4," The Third USENJX Symposium of Experiences
with Distributed and Multiprocessor Systems (SEDMS 111), Mar. 1992, pp. 77-91.
[Ruan 90] Ruane, L.M., "Process Synchronization in the UTS Kernel," Computing Systems,
Vol. 3, Summer 1990, pp. 387-421.
[Sink 88] Sinkewicz, U., "A Strategy for SMP ULTRIX," Proceedings of the Summer 1988
USENJXTechnical Conference, Jun. 1988, pp. 203-212.
[UNIX 92] UNIX System Laboratories, Device Driver Reference-UNIX SVR4.2, UNIX Press,
Prentice-Hall, Englewood Cliffs, NJ, 1992.
8
8.1 Introduction
The operating system must provide facilities for persistent storage and management of data. In
UNIX, the file abstraction acts as a container for data, and the file system allows users to organize,
manipulate, and access different files. This chapter describes the interface between the file system
and user applications, as well as the framework used by the kernel to support different kinds of file
systems. Chapters 9, IO, and II describe several different file system implementations that allow
users to access data on local and remote machines.
The file system interface comprises the system calls and utilities that allow user programs to
operate on files. This interface has remained fairly stable over the years, changing incrementally and
in compatible ways. The file system framework, however, has been overhauled completely. The
initial framework supported only one type of file system. All files were local to the machine and
were stored on one or more disks physically connected to the system. This has been replaced by the
vnodelvfs interface, which allows multiple file system types, both local and remote, to coexist on the
same machine.
The early commercially released versions of UNIX contained a simple file system now
known as the System V file system (s5fs) [Thorn 78]. All versions of System V UNIX, as well as
Berkeley UNIX versions earlier than 4.2BSD, support this file system. 4.2BSD introduced a new,
Fast File System (FFS) [McKu 84], which provides much better performance and greater function-
ality than s5fs. FFS has since gained wide acceptance, culminating in its inclusion in SVR4. Chap-
220
8.2 The User Interface to Files 221
ter 9 describes both s5fs and FFS, as well as some special-purpose file systems based on the
vnode/vfs interface.
Note: The terms FFS and ufs (UNIX file system) are often used inter-
changeably. To be specific, FFS refers to the original implementation
of the Berkeley Fast File System. The term ufs refers to the implemen-
tation ofFFSwithin the vnode!vft framework. In this book, we use the
two terms in this manner.
As it became easy to connect computers through a network, developers began to find ways
to access files on remote nodes. The mid-I980s saw the emergence of several competing technolo-
gies that provided transparent file sharing among interconnected computers. Chapter I 0 describes
the three most important alternatives-the Network File System (NFS), Remote File Sharing (RFS),
and the Andrew File System (AFS).
In recent years, several new file systems have been developed that improve upon FFS or ad-
dress the needs of specific applications. Most use sophisticated techniques such as joumaling, snap-
shots, and volume management to provide better performance, reliability, and availability. Chapter
II describes some of these modem file systems.
From a user's perspective, UNIX organizes files in a hierarchical, tree-structured name space
(Figure 8-1 ). The tree consists of files and directories, with the files at the leaf nodes.l A directory
contains name information about files and other directories that reside in it. Each file or diiectory
name may contain any ASCII characters except for 'I' and the NULL character. The file system may
impose a limit on the length of a filename. The root directory is called "f'. Filenames only need to
be unique within a directory. In Figure 8-1, both the bin and etc directories have a file called
passwd. To uniquely identify a file, it is necessary to specify its complete pathname. The pathname
is composed of all the components in the path from the root directory to the node, separated by '/'
characters. Hence the two passwd files have the same filename, but different pathnames-
lbin/passwd and /etc/passwd. The'/' character in UNIX serves both as the name of the root direc-
tory and as a pathname component separator.
UNIX supports the notion of a current working directory for each process, maintained as
part of the process state. This allows users to refer to files by their relative pathnames, which are
interpreted relative to the current directory. There are two special pathname components: the first is
".",which refers to the directory itself; the second is" .. ", which refers to the parent directory. The
root directory has no parent, and its" .. " component refers to the root directory itself. In Figure 8-1, a
user whose current directory is /usr/local may refer to the lib directory either as /usr/lib (absolute
pathname) or as .. /lib (relative pathname). A process may change its current directory by making the
chdir system call.
A directory entry for a file is called a hard link, or simply a link, to that file. Any file may
have one or more links to it, either in the same or in different directories. Thus a file is not bound to
a single directory and does not have a unique name. The name is not an attribute of the file. The file
continues to exist as long as its link count is greater than zero. The file links are equal in all ways
and are simply different names for the same file. The file may be accessed through any of its links,
and there is no way to tell which is the original link. Modem UNIX file systems also provide an-
other type of link called a symbolic link, described in Section 8.4.1.
Each different type of file system has its own internal directory format. Since application
programmers want to read the contents of directories in a portable way, the POSIX.1 standard
1 The existence of hard links means the correct abstraction is a directed acyclic graph (ignoring the " .. " entries that
refer back to the parent directory), but for most practical purposes the tree abstraction is simpler and just as adequate.
8.2 The User Interface to Files 223
};
..
c~~r (n~M~kNAM~"-'M~~ t lJ; ()t~ull4~tiftr~P:t¢q}f!l~~O/I]e *l ·.· · .
The value ofNAME_MAX depends on the file system type. SVR4 also provides a getdents system
call to read directory entries in file-system-independent format. The format of entries returned by
getdents is different from that of struct di rent. Hence users should use the more portable POSIX
functions wherever possible.
• Timestamps - There are three timestamps for each file: the time the file was last ac-
cessed, the time it was last modified, and the time its attributes (excluding the other
timestamps) were last changed.
• Permissions and mode flags, described below.
There are three types of permissions associated with each file-read, write, and execute. The
users trying to access a file are also divided into three categories-the owner of the file,2 people in
the same group as the owner, and everybody else (that is, owner, group, and others). This means
that all permissions associated with a file can be specified by nine bits. Directory permissions are
handled differently. Write access to a directory allows you to create and delete files in that directory.
Execute access allows you to access files in that directory. A user may never directly write to a di-
rectory, even if the permissions allow it. The contents of a directory are modified only by file crea-
tions and deletions.
The permissions mechanism is simple but primitive. Today, most UNIX vendors offer en-
hanced security features, either in their default implementations or in special secure versions.3
These normally involve some form of an access-control list, which allows a more detailed specifi-
cation of who may access a file and in what manner [Fern 88].
There are three mode jlags-suid, sgid, and sticky. The suid and sgid flags apply to execu-
table files. When a user executes the file, if the suid flag is set, the kernel sets the user's effective
UID to that of the owner of the file. The sgid flag affects the effective GID in the same way.
(Section 2.3 .3 defines UIDs and GIDs.) Since the sgid flag serves no purpose if the file is not execu-
table, it is overloaded for another purpose. If the file does not have group-execute permission and
the sgid flag is set, then mandatory file/record locking [UNIX 92] has been enabled on the file.
The sticky flag also is used for executable files and requests the kernel to retain the program
image in the swap area (see Section 13.2.4) after execution terminates. System administrators often
set the sticky flag for frequently executed files in order to improve performance. Most modem
UNIX systems leave all images on swap as long as possible (using some form of a least recently
used replacement algorithm), and hence do not use the sticky flag.
The sgid and sticky flags are used differently for directories. If the sticky flag is set and the
directory is writable, then a process may remove or rename a file in that directory only if its effec-
tive UID is that of the owner of the file or directory, or if the process has write permissions for the
file. If the sticky flag is clear, then any process that has write access to the directory may remove or
rename files in it.
When a file is created, it inherits the effective UID of the creating process. Its owner GID
may take one of two values. In SVR3, the file inherits the effective GID of the creator. In 4.3BSD, it
inherits the GID of the directory in which it is created. SVR4 uses the sgid flag of the parent direc-
tory to select the behavior. If the sgid flag is set for a directory, then new files created in the direc-
2 Also known as the user, for the sake of the chmod command, which uses the abbreviations u, g, and o to refer to user
(owner), group, and others.
3 Most secure UNIX versions base their requirements on a set of criteria published by the Department of Defense for
evaluating trusted computer systems. This document, informally known as the Orange Book [DoD 85], defines vari-
ous levels of security that a system may provide.
8.2 The User Interface to Files 225
tory inherit the GID from the parent directory. If the sgid flag is clear, then new files inherit the GID
of the creator.
UNIX provides a set of system calls to manipulate file attributes. These calls take the path-
name of the file as an argument. The link and unlink calls create and delete hard links respectively.
The kernel deletes the file only after all its hard links have been removed, and no one is actively
using the file. The utimes system call changes the access and modify timestamps of the file. The
chown call changes the owner UID and GID. The chmod system call changes the permissions and
mode flags of the file.
where path is the absolute or relative pathname of the file and mode specifies the permissions to as-
sociate with the file if it must be created. The flags passed in ofl ag specify if the file is to be
opened for read, write, read-write, or append, if it must be created, and so forth. The creat system
call also creates a file. It is functionally equivalent to an open call with the 0_ WRON LY, 0_ CREAT,
and 0_ TRUNC flags (see Section 8.1 0.4).
Each process has a default file creation mask, which is a bitmask of permissions that should
not be granted to newly created files. When the user specifies a mode to open or creat, the kernel
clears the bits specified in the default mask. The umask system call changes the value of the default
mask. The user can override the mask by calling chmod after the file is created.
When the user calls open, the kernel creates an open file object to represent the open in-
stance of the file. It also allocates a file descriptor, which acts as a handle, or reference, to the open
file object. The open system call returns the file descriptor (fd) to the caller. A user may open the
same file several times; also, multiple users may open the same file. In each case, the kernel gener-
ates a new open file object and a new file descriptor.
The file descriptor is a per-process object. The same descriptor number in two different
processes may, and usually does, refer to different files. The process passes the file descriptor to
I/O-related system calls such as read and write. The kernel uses the descriptor to quickly locate the
open file object and other data structures associated with the open file. This way, the kernel can per-
form tasks such as pathname parsing and access control once during the open, rather than on each
1/0 operation. Open files may be closed by the close system call and also are closed automatically
when the process terminates.
Each file descriptor represents an independent session with the file. The associated open file
object holds the context for that session. This includes the mode in which the file was opened and
the offset pointer at which the next read or write must start. In UNIX, files are accessed sequentially
by default. When the user opens the file, the kernel initializes the offset pointer to zero. Subse-
quently, each read or write advances the pointer by the amount of data transferred.
226 Chapter 8 File System Interface and Framework
Keeping the offset pointer in the open file object allows the kernel to insulate different ses-
sions to the file from one another (Figure 8-2). If two processes open the same file,4 a read or write
by one process advances only its own offset pointer and does not affect that of the other. This allows
multiple processes to share the file transparently. This functionality must be used with care. In many
cases, multiple processes accessing the same file should synchronize access to it using the file
locking facilities described in Section 8.2.6.
A process may duplicate a descriptor through the dup or dup2 system calls. These calls cre-
ate a new descriptor that references the same open file object and hence shares the same session
(Figure 8-3). Similarly, the fork system call duplicates all the descriptors in the parent and passes
them on to the child. Upon return from fork, the parent and child share the set of open files. This
form of sharing is fundamentally different from having multiple open instances. Since the two de-
scriptors share the same session to the file, they both see the same view of the file and use the same
offset pointer. If an operation on one descriptor changes the offset pointer, the change will be visible
to the other as well.
A process may pass a file descriptor to another unrelated process. This has the effect of
passing a reference to the open file object. The kernel copies the descriptor into the first free slot in
the receiver's descriptor table. The two descriptors share the same open file object and hence the
same offset pointer. Usually, the sending process closes its descriptor after sending it. This does not
close the open file, even if the other process has not yet received the descriptor, since the kernel
holds on to the second descriptor while it is in transit.
The interface for descriptor passing depends on the UNIX variant. SVR4 passes the descrip-
tor over a STREAMS pipe (see Section 17. 9), using the ioctl call with the I_ SEND FD command. The
other process receives it through the I_ RECVFD ioctl call. 4.3BSD uses sendmsg and recvmsg calls to
a socket connection (see Section 17.1 0.3) between the two processes. Descriptor passing is useful
for implementing some types of network applications. One process, the connection server, can set
up a network connection on behalf of a client process, then pass the descriptor representing the con-
file
I fdl I
v
offset
I fd2 I
offset/
file
offset -+----~1
nection back to the client. Section 11.10 describes the portal file system in 4.4BSD, which takes this
concept one step further.
where fd is the file descriptor, buf is a pointer to a buffer in the user address space into which the
data must be read, and count is the number of bytes to read.
The kernel reads data from the file associated with fd, starting at the offset stored in the
open file object. It may read fewer than count bytes if it reaches the end of the file or, in case ofFI-
FOs or device files, if there is not enough data available. For instance, a read issued to a terminal in
canonical mode returns when the user types a carriage return, even if the line contains fewer bytes
than requested. Under no circumstances will the kernel transfer more than count bytes. It is the
user's responsibility to ensure that buf is large enough to hold count bytes of data. The read call
returns the actual number of bytes transferred (nread). read also advances the offset pointer by
nread bytes so that the next read or write will begin where this one finished.
While the kernel allows multiple processes to open and share the file at the same time, it se-
rializes I/0 operations to it. For instance, if two processes issue writes to the file at (almost) the
same time, the kernel will complete one write before beginning the other. This allows each opera-
tion to have a consistent view of the file.
5 Most programmers do not use the read and write system calls directly. Instead, they use the standard library func-
tionsfread andfwrite, which provide additional functionality such as data buffering.
228 Chapter 8 File System Interface and Framework
A file may be opened in append mode by passing the 0_APPEND flag to the open system call.
This causes the kernel to set the offset pointer to the end of the file prior to each write system call
through that descriptor. Again, if one user opens a file in append mode, it has no effect on opera-
tions on the file through other descriptors.
Multithreaded systems must handle additional complications resulting from sharing descrip-
tors between threads. For instance, one thread may do an !seek just before another thread issues a
read, causing the second thread to read from a different offset than intended. Some systems, such as
Solaris, provide pread/pwrite system calls to perform an atomic seek and read/write. Section 3.6.6
discusses this issue in more detail.
where fd is the file descriptor, i ov is a pointer to an array of <base, length> pairs ( s t ruct i ovec)
that describe the set of source buffers, and i ovcnt is the number of elements in the array. As with
write, the return value nbytes specifies the number of bytes actually transferred. The offset pointer
determines the location of the start of the data in the file, and the kernel advances it by nbytes bytes
upon completion of the call.
Figure 8-4 illustrates the effect of a scatter-gather write. The kernel creates a struct ui o to
manage the operation and initializes it with information from the system call arguments and the
open file object. It then passes a pointer to the ui o structure to lower-level functions that perform
the 1/0. The operation atomically transfers data from all the specified user buffers to the file.
uio
file on disk
=3
''
''
This behavior is unsuitable for many applications that need to protect a file across multiple
accesses. Hence UNIX provides facilities to lock files. File locking may be advisory or mandatory.
Advisory locks are not enforced by the kernel and only protect the file from cooperating processes
that explicitly check for the lock. Mandatory locks are enforced by the kernel, which will reject op-
erations conflicting with the lock. Locking requests may be blocking or nonblocking; the latter re-
turn an error code of EWOULDBLOCK if the lock cannot be granted.
4BSD provides the flock system call, which only supports advisory locking of open files, but
allows both shared and exclusive locks. File locking in System V UNIX varies with the release.
SVR2 supports only advisory locking, for both files and records (byte ranges within files). SVR3
adds mandatory locking, but requires that the file first be enabled for mandatory locking through a
chmod call as described in Section 8.2.2. This feature is an artifact of XENIX binary compatibility.
SVR4 adds BSD compatibility and supports single-writer, multiple-reader locks. The fcntl call pro-
vides the locking functions, but most applications use a simpler programming interface offered by
the C library function lockf.
lated to an access to the root directory of the mounted file system. The mounted file system remains
visible until it is unmounted.
Figure 8-5 shows a file hierarchy composed of two file systems. In this example, fsO is in-
stalled as the root file system of the machine, and the file system fsl is mounted on the /usr direc-
tory of fsO. /usr is called the "mounted-on" directory or the mount point, and any attempts to access
/usr results in accessing the root directory of the file system mounted on it.
If the /usr directory of fsO contains any files, they are hidden, or covered, when fsl is
mounted on it, and may no longer accessed by users. When fsl is unmounted, these files become
visible and are accessible once again. Pathname parsing routines in the kernel must understand
mount points and behave correctly when traversing mount points in either direction. The original
s5fs and FFS implementations used a mount table to track mounted file systems. Modem UNIX
systems use some form of a vfs list (virtual file system list), as is described in Section 8.9.
The notion of mountable subsystems serves to hide the details of the storage organization
from the user. The file name space is homogeneous, and the user does not need to specify the disk
drive as part of the file name (as required in systems such as MS-DOS or VMS). File systems may
be taken off-line individually to perform backups, compaction, or repair. The system administrator
may vary the protections on each file system, perhaps making some of them read-only.
Mountable file systems impose some restrictions on the file hierarchy. A file cannot span file
systems and hence may grow only as much as the free space on the file system to which it belongs.
Rename and hard link operations cannot span file systems. Each file system must reside on a single
logical disk and is limited by the size of that disk.
fsO
I
,'~
sys dev etc
~~usr\ bin
I
fs1 I I
,
'', I , I
~
local adm users bin
....-"1'-. ....-"1'-. ....-"1'-. ....-"1'-.
system is fully contained in a single logical disk, and one logical disk may contain only one file
system. Some logical disks may not contain file systems, but are used by the memory subsystem for
swapping.
Logical disks allow physical storage to be mapped in a variety of useful ways. In the sim-
plest case, each logical disk is mapped to a single, entire, physical disk. The most common feature is
to divide a disk into a number of physically contiguous partitions, each a logical device. Older
UNIX systems provided only this feature. As a result, the word partition is often used to describe
the physical storage of a file system.
Modem UNIX systems support many other useful storage configurations. Several disks may
be combined into a single logical disk or volume, thus supporting files larger than the size of a sin-
gle disk. Disk mirroring allows a redundant copy of all data, increasing the reliability of the file
system. Stripe sets provide increased throughput from a file system by striping the data across a set
of disks. Several types of RAID (Redundant Arrays of Inexpensive Disks) configurations provide a
mix of reliability and performance enhancements to suit the requirements of different types of in-
stallations [Patt 88].
were not executed atomically [Bach 86]. SVR3 added the mkdir and rmdir system calls, but contin-
ued to allow linking to directories to maintain backward compatibility with older applications.
Hard links also create control problems. Suppose user X owns a file named /usr/X/filel.
Another user Y may create a hard link to this file and call it /usr/Y/linkl (Figure 8-6). To do so, Y
only needs execute permission for the directories in the path and write permission to the /usr/Y di-
rectory. Subsequently, user X may unlink filet and believe that the file has been deleted (typically,
users do not often check the link counts on their own files). The file, however, continues to exist
through the other link.
Of course, /usr/Y/linkl is still owned by user X, even though the link was created by user Y.
If X had write-protected the file, then Y will not be able to modify it. Nevertheless, X may not wish
to allow the file to persist. In systems that impose disk-usage quotas, X will continue to be charged
for it. Moreover, there is no way X can discover the location of the link, particularly if Y has read-
protected the /usr/Y directory (or if X no longer knows the inode number of the file).
4.2BSD introduced symbolic links to address many of the limitations of hard links. They
were soon adapted by most vendors and incorporated into s5fs in SVR4. The symlink system call
creates a symbolic link. It is a special file that points to another file (the linked-to file). The file type
attribute identifies it as a symbolic link. The data portion of the file contains the pathname of the
linked-to file. Many systems allow small pathnames to be stored in the inode of the symbolic link.
This optimization was first introduced in ULTRIX.
The pathname contained in the symbolic link may be absolute or relative. The pathname tra-
versal routines recognize symbolic links and translate them to obtain the name of the linked-to file.
If the name is relative, it is interpreted relative to the directory containing the link. While symbolic
link handling is transparent to most programs, some utilities need to detect and handle symbolic
links. This is enabled by the !stat system call, which suppresses translation of the final symbolic
link in a pathname. Hence if file mylink is a symbolic link to the file myfile, then !stat (myl ink,
... ) returns the attributes of mylink, while stat (myl ink, ..• ) returns the attributes of myfile.
Having detected a symbolic link through !stat, the user may call read/ink to retrieve the contents of
the link.
6 Only those descendants that are created after the pipe can share access to it.
234 Chapter 8 File System Interface and Framework
ferred FFS for its performance and features, others chose to retain s5fs for backward compatibility.
Either way, it was not a happy solution.
Moreover, while both s5fs and FFS were adequate for general time-sharing applications,
many applications found neither suitable to their needs. Database applications, for example, need
better support for transaction processing. Applications that use large, read-mostly or read-only files
would prefer extent-based allocation, which improves the performance of sequential reads. The
early UNIX systems had no way for vendors to add a custom file system without extensively over-
hauling the kernel. This was too restrictive for UNIX to become the operating system of choice for a
wider variety of environments.
There also was a growing need for supporting non-UNIX file systems. This would permit a
UNIX system running on a personal computer to access files on DOS partitions on the same ma-
chine or, for that matter, floppy disks written by DOS.
Most important, the proliferation of computer networks led to an increased demand for
sharing files between computers. The mid-1980s saw the emergence of a number of distributed file
systems-such as AT&T's Remote File Sharing (RFS) and Sun Microsystem's Network File System
(NFS)-that provided transparent access to files on remote nodes.
These developments necessitated fundamental changes in the UNIX file system framework
to support multiple file system types. Here again there were several alternative approaches, such as
AT&T's file system switch [Rifk 86], Sun Microsystem's vnode/vfs architecture [Klei 86], and
Digital Equipment Corporation's gnode architecture [Rodr 86]. For a while, these rival technologies
battled for acceptance and dominance. Eventually, AT&T integrated Sun's vnode/vfs and NFS
technologies into SVR4, enabling them to become de facto industry standards.
The vnode/vfs interface has evolved substantially from its original implementation. While
all major variants have embraced its fundamental approach, each provides a different interface and
implementation. This chapter concentrates on the SVR4 version of this interface. Section 8.11
summarizes other important implementations.
8.6.1 Objectives
The vnode/vfs architecture has several important objectives:
• The system should support several file system types simultaneously. These include UNIX
(s5fs or ufs) and non-UNIX (DOS, AIUX, etc.) file systems.
• Different disk partitions may contain different types of file systems. Once they are
mounted on each other, however, they must present the traditional picture of a single ho-
mogenous file system. The user is presented with a consistent view of the entire file tree
and need not be aware of the differences in the on-disk representations of the subtrees.
8.6 The VnodeNfs Architecture 235
• There should be complete support for sharing files over a network. A file system on a re-
mote machine should be accessible just like a local file system.
• Vendors should be able to create their own file system types and add them to the kernel in
a modular manner.
The main goal was to provide a framework in the kernel for file access and manipulation,
and a well-defined interface between the kernel and the modules that implemented specific file sys-
tems.
struct cdevsw {
int (*d open)();
int (*d-c1ose)();
int (*d-read)();
int (*(write)();
cdevsw[];
Hence cdevsw[] is a global array of struct cdevsw's and is called the character device
switch. The fields of the structure define the interface to an abstract character device. Each different
type of device provides its own set of functions that implements this interface. For example, the line
printer may provide the functions 1popen (), 1pc 1ose (), and so forth. Each device type has a dif-
ferent major device number associated with it. This number forms an index into the global
cdevsw[] array, giving each device its own entry in the switch. The fields of the entry are initialized
to point to the functions provided by that device.
Suppose a user issues a read system call to a character device file. In a traditional UNIX
system, the kernel will:
1. Use the file descriptor to get to the open file object.
2. Check the entry to see if the file is open for read.
236 Chapter 8 File System Interface and Framework
3. Get the pointer to the in-core inode from this entry. In-core inodes (also called in-memory
inodes) are file system data structures that keep attributes of active files in memory. They
are described in detail in Section 9.3.1.
4. Lock the inode so as to serialize access to the file.
5. Check the inode mode field and find that the file is a character device file.
6. Use the major device number (stored in the inode) to index into a table of character de-
vices and obtain the cdevsw entry for this device. This entry is an array of pointers to
functions that implement specific operations for this device.
7. From the cdevsw, obtain the pointer to the d_read routine for this device.
8. Invoke the d_read operation to perform the device-specific processing of the read request.
The code looks like the following:
The vnode/vfs interface was designed using object-oriented programming concepts. These
concepts have since been applied to other areas of the UNIX kernel, such as memory man-
agement, message-based communications, and process scheduling. It is useful to briefly
review the fundamentals of object-oriented programming as they apply to UNIX kernel
development. Although such techniques are naturally suited to object-oriented languages
such as C++ [Elli 90], UNIX developers have chosen to implement them in C to be con-
sistent with the rest of the kernel.
8.6 The Vnode!Vfs Architecture 237
The object-oriented approach is based on the notion ofclasses and objects. A class is a
complex data type, made up of data member fields and a set of member functions. An ob-
ject is an instance of a class and has storage associated with it. The member functions of a
class operate on individual objects of that class. Each member (data field or function) of a
class may be either public or private. Only the public members are visible externally to
users of the class. Private data and functions may only be accessed internally by other
functions of that class.
From any given class, we may generate one or more derived classes, called subclasses
(see Figure 8-7). A subclass may itself be a base for further derived classes, thus forming
a class hierarchy. A subclass inherits all the attributes (data and functions) of the base
class. It may also add its own data fields and extra functions. Moreover, it may override
some of the functions of the base class and provide its own implementation of these.
Because a subclass contains all the attributes of the base class, an object ofthe subclass
type is also an object ofthe base class type. For instance, the class directory may be a de-
rived class of the base class file. This means that every directory is also a file. Of course,
the reverse is not true--every file is not a directory. Likewise, a pointer to a directory ob-
ject is also a pointer to a file object. The attributes added by the derived class are not visi-
ble to the base class. Therefore a pointer to a base object may not be used to access the
data and functions specific to a derived class.
Frequently, we would like to use a base class simply to represent an abstraction and de-
fine an interface, with derived classes providing specific implementations of the member
functions. Thus the file class may define a function called create(), but whena user calls
this function for an arbitrary file, we would like to invoke a different routine depending on
whether the file is actually a regular file, directory, symbolic link, device file, and so on.
Indeed, we may have no generic implementation of create 0 that creates an arbitrary file.
Such a function is called a pure virtual function.
Object-oriented languages provide such facilities. In C++, for instance, we can define an
abstract base class as one that contains at least one pure virtual function. Since the base
class has no implementation for this function, it cannot be instantiated. It may only be
used for deriving subclasses, which provide specific implementations of the virtual func-
tions. All objects are instances of one subclass or another, but the user may manipulate
them using a pointer to a base class, without knowing which subclass they belong to.
When a virtual function is invoked for such an object, the implementation automatically
determines which specific function to call, depending on the actual subtype of the object.
As mentioned earlier, languages such as C++ and SmallTalk have built-in constructs to
describe the notions like classes and virtual functions. In C, these concepts are supported
with some smoke and mirrors. In the next section, we see how the vnode/vfs layer is im-
plemented in an object-oriented manner.
pointer (of type caddr_ t) to a private data structure that holds the file-system-specific data of the
vnode. For s5fs and ufs files, this structure is simply the traditional (s5fs and ufs, respectively)
i node structure. NFS uses an rnode structure, tmpfs (see Section 9.1 0.2) uses a tmpnode, and so
forth. Since this structure is accessed indirectly through v_data, it is opaque to the base vnode class,
and its fields are only visible to functions internal to the specific file system.
The v_op field points to a struct vnodeops, which consists of a set of pointers to functions
that implement the virtual interface of the vnode. Both the v_data and v_ op fields are filled in when
the vnode is initialized, typically during an open or create system call. When the file-system-
independent code calls a virtual function for an arbitrary vnode, the kernel dereferences the v_ op
pointer and calls the corresponding function of the appropriate file system implementation. For ex-
ample, the VOP_CLOSE operation allows the caller to close the file associated with the vnode. This is
accessed by a macro such as
where the ellipsis represent the other arguments to the close routine. Once vnodes have been prop-
erly initialized, this macro ensures that invoking the VOP_CLOSE operation would call the
ufs _close() routine for a ufs file, the nfs _close() routine for an NFS file, and so forth.
Similarly, the base class vfs has two fields-vfs_data and vfs_op-that allow proper link-
age to data and functions that implement specific file systems. Figure 8-9 shows the components of
the vfs abstraction.
InC, a base class is implemented simply as a struct, plus a set of global kernel functions
(and macros) that define the public nonvirtual functions. The base class contains a pointer to another
structure that consists of a set of function pointers, one for each virtual function. The v_ op and
v_data pointers (vfs_op and vfs_data for the vfs class) allow the linkage to the subclass and
hence provide run time access to the file-system-dependent functions and data.
vfs mount
vfs unmount
...
virtual functions (struct vfsops)
vfs root
vfs_sync
"" ~ __ ,
dependent private
data
~ file-system-dependent
implementation of
vfs statvfs ... vfsops functions
8.7.1 Objectives
A set of implementation goals evolved to allow the development of a flexible interface that can be
used efficiently by a large variety of diverse file systems:
• Each operation must be carried out on behalf of the current process, which may be put to
sleep if a function must block on a resource or event.
• Certain operations may need to serialize access to the file. These may lock data structures
in the file-system-dependent layer and must unlock them before the operation completes.
• The interface must be stateless. There must be no implicit use of global variables such as u
area fields to pass state information between operations.
• The interface must be reentrant. This requirement disallows use of global variables such as
u_error and u_rvall to store error codes or return values. In fact, all operations return
error codes as return values.
• File system implementations should be allowed, but not forced, to use global resources
such as the buffer cache.
• The interface must be usable by the server side of a remote file system to satisfy client re-
quests.
• The use of fixed-size static tables must be avoided. Dynamic storage allocation should be
used wherever possible.
file descriptor
I I
struct vnode
struct file
open mode flags
vnode pointer -
L/ v_data ., Ifile-system-
... dependent
v_op I objects
offset pointer -I I
Figure 8-10. File-system-independent data structures.
The open file object holds the context that manages a session with that file. If multiple users
have the file open (or the same user has it open multiple times), each has its own open file object. Its
fields include:
• Offset in the file at which the next read or write should start.
• Reference count of the number of file descriptors that point to it. This is normally 1, but
could be greater if descriptors are cloned through dup or fork.
• Pointer to the vnode of the file.
• Mode with which the file was opened. The kernel checks this mode on each I/0 operation.
Hence if a user has opened a file for read only, he cannot write to the file using that de-
scriptor even if he has the necessary privileges.
Traditional UNIX systems use a static, fixed-size file descriptor table in the u area. The de-
scriptor returned to the user is an index into this table. The size of the table (typically 64 elements)
limits the number of files the user could keep open at a time. In modem UNIX systems, the descrip-
tor table is not limited in size, but may grow arbitrarily large.?
Some implementations, such as SVR4 and SunOS, allocate descriptors in chunks of
(usually) 32 entries and keep the chunks in a linked list, with the first chunk in the u area of the
process. This complicates the task of dereferencing the descriptor. Instead of simply using the de-
scriptor as an index into the table, the kernel must first locate the appropriate chunk, then index into
that chunk. This scheme removes the restrictions on the number of files a process may have open, at
the cost of some increased code complexity and performance.
Some newer SVR4-based systems allocate the descriptor table dynamically and extend it
when necessary by calling kmem_rea 11 oc (),which either extends the table in-place or copies it into
a new location where it has room to grow. This allows dynamic growth of the descriptor table and
quick translation of descriptors, at the cost of copying the table over when allocating it.
struct vnode
u_short v_flag; I* V_ROOT, etc. *I
u_short v_count; I* reference count *I
struct vfs *vfsmountedhere; I* for mount points *I
struct vnodeops *v_op~ I* vnode operations vector *I
struct vfs *vfsp; I* file system to which it belongs *I
struct stdata *v stream; /* pointer to asspc:i~ted $tream, if any *I
s t ruct page *(_p'a:ge; /* resident page list */
enum vtype v_type; /* file type */
dev:....:t v_rdev; /* device ID for device files */
caddr t v_data; /* pointer to private data structure */
• The pathname traversal routine acquires a reference to each intermediate directory it en-
counters. It holds the reference while searching the directory and releases it after acquiring
a reference to the next pathname component.
Reference counts ensure persistence of the vnode and also of the underlying file. When a
process deletes a file that another process (or perhaps the same process) still has open, the file is not
physically deleted. The directory entry for that file is removed so no one else may open it. The file
itself continues to exist since the vnode has a nonzero reference count. The processes that currently
have the file open may continue to access it until they close the file. This is equivalent to marking
the file for deletion. When the last reference to the file is released, the file-system-independent code
invokes the VOP_INACTIVE operation to complete the deletion of the file. For a ufs or s5fs file, for
example, the inode and the data blocks are freed at this point.
This feature is very usefUl in creating temporary files. An application like a compiler uses
several temporary files to store results of intermediate phases. These files should be cleaned up if
the application were to terminate abnormally. The application ensures this by opening the file and
then immediately unlinking it. The link count becomes zero and the kernel removes the directory
entry. This prevents other users from seeing or accessing the file. Since the file is open, the in-core
reference count is I. Hence the file continues to exist, and the application may read and write to it.
When the application closes the file, either explicitly or implicitly when the process terminates, the
reference count becomes zero. The kernel completes the file deletion and frees its data blocks and
inode. Many UNIX systems have a standard library function called tmpfile, which creates a tempo-
rary file.
struct vfs
struct vfs *vfs_next; l*nextVFSinlist*l
struct vfsops *vfs_op; l*operationsvector *I
struct vnode *vfs_vnodecovered; I* vnodemounted on *I
i nt vfs_:_fstype; /*file system type index *I
caddr_,t yfs_data; l*privatetfata *I
deV t vfs~deV; /* d~iiic¢fj) *I
};
Figure 8-11 shows the relationships between the vnode and vfs objects in a system contain-
ing two file systems. The second file system is mounted on the /usr directory of the root file system.
The global variable rootvfs points to the head of a linked list of all vfs objects. The vfs for the root
file system is at the head of the list. The vfs _ vnodecovered field points to the vnode on which the
file system is mounted.
244 Chapter 8 File System Interface and Framework
struct vnode
VROOT VROOT
v_vfsp v_vfsp v_vfsp-
v vfsmountedhere v vfsmountedhere - v vfsmountedhere
... ... ...
vnode of vnode of "/" vnode of
"/" "/usr" mounted filsys
The v_ vf s p field of each vnode points to the vfs to which it belongs. The root vnodes of
each file system have the VROOT flag set. If a vnode is a mount point, its v_ vfsmountedhere field
points to the vfs object of the file system mounted on it. Note that the root file system is not
mounted anywhere and does not cover any vnode.
base (file-system-
independent) vnode
~,~--------------~
v data base (file-system-
independent) vnode
'--- v data
file-system-dependent file-system-dependent
(subclass) data structure (subclass) data structure
Relationship Standard
required by interface implementation
be perfectly acceptable to have separate data structures for the vnode and the file-system-dependent
portion, as long as the v_data field is initialized appropriately. Figure 8-12 illustrates both relation-
ships.
struct vnodeops {
i nt (*vop open)(};
int (*vop-close)();
i nt (*vop-read){};
i nt (*vop-write) () ;
int (*vop-ioctl)();
int (*vop-getattr) ();
int (*vop-setattr)();
int (*vop-access)();
int (*vop=lookup)();
int {*VOp create)();
int (*Vop-remove) ();
int (*vop-link)();
int (*vop-rename)();
int (*vop-mkdir)();
int (*vop=rmdir)();
246 Chapter 8 File System Interface and Framework
int (*vop_readdir)();
int (*vop_symlink)();
int (*vop_readlink)();
void (*vop_inactive)();
void (*vop_rwl ock) ();
void (*vop_rwunlock)();
int (*vop_realvp) ();
int (*vop_getpage)();
int (*vop_putpage)();
int (*vop_map) ();
int (*vop _;PO 11) () ;
};
Each file system implements this interface in its own way, and provides a set of functions to
do so. For instance, ufs implements the VOP _READ operation by reading the file from the local disk,
while NFS sends a request to the remote file server to get the data. Hence each file system provides
an instance of the struct vnodeops-ufs, for example, defines the object:
};
The v_ op field of the vnode points to the vnodeops structure for the associated file system
type. As Figure 8-13 shows, all files of the same file system type share a single instance of this
structure and access the same set of functions.
struct vfsops {
int (*vfs mount)();
int (*vfs-unmount)();
int (*vfs-root)();
int (*vfs-statvfs)();
int (*vfs=sync)();
};
8.9 Mounting a File System 247
l i vnode:
v_data
v_op-
i vnode:
L v_data
v_op v_op
... ...
ufs_open nfs_open
ufs close struct nfs close
... vnodeops
Each file system type provides its own implementation of these operations. Hence there is
one instance of struct vfsops for each file system type-ufs_vfsops for ufs, nfs_vfsops for
NFS, and so forth. Figure 8-14 shows the vfs layer data structures for a system containing two ufs
and one NFS file system.
where s pee is the name of the device file representing the file system, d i r is the pathname of the
mount point directory, type is a string that specifies which kind of file system it is, dataptr is a
pointer to additional file-system-dependent arguments, and data 1en is the total size of these extra
parameters. In this section, we describe how the kernel implements the mount system call.
SVR4 uses a mechanism called the virtual file system switch, which is a global table contain-
ing one entry for each file system type. Its elements are described by
struct vfssw {
char *vsw_name; I* file system type name *I
int (*vsw_init){); I* address of initialization routine *I
struct vfsops *vsw_vfsops; I* vfs operations vector for this fs *I
vfssw[];
The kernel then stores a pointer to the vfs structure in the v_ vfsmountedhere field of the
covered directory's vnode. Finally, it invokes the VFS _MOUNT operation of the vfs to perform the
file-system-dependent processing of the mount call.
9 A process can call chroot to change its notion of the root directory to something other than the system root. This af-
fects how the kernel interprets absolute pathnames for this process. Usually, certain login shells call chroot on behalf
of the process. System administrators may use this facility to restrict some users to a part of the global file tree. To
allow for this case, 1ookuppn () first examines the u. u_rdi r field in the u area and, if that is NULL, checks the
rootd i r variable.
250 Chapter 8 File System Interface and Framework
2. If the component is " .• ," and the current directory is the system root, move on to the next
component. The system root directory acts as its own parent.
3. If the component is".. ," and the current directory is the root of a mounted file system, ac-
cess the mount point directory. All root directories have the VROOT flag set. The v_vfsp
field points to the vfs structure for that file system, which contains a pointer to the mount
point in the field vfs _ vnodecovered.
4. Invoke the VOP_LOOKUP operation on this vnode. This results in a call to the lookup func-
tion of this specific file system (s51ookup(), ufs_1ookup(), etc.). This function
searches the directory for the component and, if found, returns a pointer to the vnode of
that file (allocating it if not already in the kernel). It also acquires a hold on that vnode.
5. If the component was not found, check to see if this is the last component. If so, return
success (the caller may have intended to create the file) and also pass back a pointer to the
parent directory without releasing the hold on it. Otherwise, return with an ENOENT error.
6. If the new component is a mount point (v _ vfsmountedhere ! = NULL), follow the pointer
to the vfs object of the mounted file system and invoke its vfs _root operation to return
the root vnode of that file system.
7. If the new component is a symbolic link (v_type == VLNK), invoke its VOP_SYMLINK op-
eration to translate the symbolic link. Append the rest of the pathname to the contents of
the link and restart the iteration. If the link contains an absolute pathname, the parsing
must resume from the system root.
The caller of 1ookuppn () may pass a flag that suppresses symbolic link evaluation for
the last component of the pathname. This is to accommodate certain system calls such as
/stat that do not want to traverse symbolic links at the end of pathnames. Also, a global
parameter called MAXSYMLINKS (usually set to 20) limits the maximum number of sym-
bolic links that may be traversed during a single call to 1ookuppn (). This prevents the
function from going into a possibly infinite loop due to badly conceived symbolic links-
for example, if /xly were a symbolic link to /x.
8. Release the directory it just finished searching. The hold was acquired by the VOP_LOOKUP
operation. For the starting point, it was explicitly obtained by 1ookuppn ().
9. Finally, go back to the top of the loop and search for the next component in the directory
represented by the new vnode.
10. When no components are left, or if a component is not found, terminate the search. If the
search was successful, do not release the hold on the final vnode and return a pointer to
this vnode to the caller.
If a file system wishes to use the name cache, its lookup function (the one that implements
the VOP_LOOKUP operation) first searches the cache for the desired file name. If found, it simply in-
crements the reference count of the vnode and returns it to the caller. This avoids the directory
search, thus saving several disk reads. Cache hits are likely since programmers typically make sev-
eral requests on a few frequently used files and directories. In event of a cache miss, the lookup
function searches the directory. If the component is found, it adds a new cache entry for future use.
Since the file system can access vnodes without going through pathname traversal, it may
perform operations that invalidate a cache entry. For instance, a user may unlink a file, and the ker-
nel may later reassign the vnode to another file. Without proper precautions, a subsequent search for
the old file may result in an incorrect cache hit, fetching a vnode that does not belong to that file.
The cache must therefore provide a way of ensuring or checking the validity of its entries. Both
4.3BSD and SVR4 implement a directory lookup cache, and each uses a different technique tore-
solve this issue.
4.3BSD does not use the vnode/vfs interface, and its name lookup cache directly locates the
in-memory inode of the file. The inode has a generation number, also called a capability, which is
incremented each time the inode is reassigned to a new file. The name lookup cache is hint-based.
When adding a new entry to the cache, the file system copies the inode generation number of the
file into the entry. The cache lookup function checks this number against the current generation
number of the inode. If the numbers are unequal, the entry is invalid and results in a cache miss.
In SVR4, the cache entry holds a reference to the vnode of the cached file and releases it
when the entry is flushed or reassigned. Although this method ensures that cache entries are always
valid, it has certain drawbacks. For example, the kernel must retain some inactive vnodes simply
because there is a name cache entry that references them. Also, it prevents other parts of the kernel
from ensuring exclusive use of a file or device.
where vp is a pointer to the parent directory vnode and compname is the component name. On suc-
cessful return, tvp must point to the vnode of compname and its reference count must be incre-
mented.
Like other operations in this interface, this results in a call to a file-system-specific lookup
function. Usually, this function first searches the name lookup cache. If there is a cache hit, it in-
crements the reference count and returns the vnode pointer. In case of a cache miss, it searches the
parent directory for the name. Local file systems perform the search by iterating through the direc-
tory entries block by block. Distributed file systems send a search request to the server node.
If the directory contains a valid match for the component, the lookup function checks if the
vnode of the file is already in memory. Each file system has its own method of keeping track of its
in-memory objects. In ufs, for instance, the directory search results in an inode number, which ufs
uses to index into a hash table and search for the inode. The in-memory inode contains the vnode. If
252 Chapter 8 File System Interface and Framework
the vnode is found in memory, the lookup function increments its reference count and returns it to
the caller.
Often the directory search produces a match for the component, but the vnode is not in
memory. The lookup function must allocate and initialize a vnode, as well as the file-system-
dependent private data structure. Usually, the vnode is part of the private data structure, and hence
both are allocated as one unit. The two objects are initialized by reading in the attributes of the file.
The v_ op field of the vnode is set to point to the vnodeops vector for this file system, and a hold is
added to the vnode. Finally, the lookup function adds an entry to the directory name lookup cache
and places it at the end of the LRU list of the cache.
Note that 1 ookuppn (} increments the reference count on the vnode and also initializes its
v_ op pointer. This ensures that subsequent system calls can access the file using the file descriptor
(the vnode will remain in memory) and that the file-system-dependent functions will be correctly
routed.
is encapsulated in a credentials object (struct cred), which is explicitly passed (through a pointer)
to most file operations.
Each process has a statically allocated credentials object, typically in the u area or proc
structure. For operations on local files, we pass a pointer to this object, which seems no different
from the earlier treatment of obtaining the information directly from the u area. The benefit of the
new method is in the handling of remote file operations, which are executed by a server process on
behalf of remote clients. Here, the permissions are determined by the credentials of the client, not by
those of the server process. Thus the server can dynamically allocate a credentials structure for each
client request and initialize it with the UID and GIDs of the client.
Since these credentials are passed from one operation to another, and must be retained until
the operations complete, the kernel associates a reference count with each object and frees the
structure when the count drops to zero.
8.11 Analysis
The vnode/vfs interface provides a powerful programming paradigm. It allows multiple file system
types to coexist. Vendors may add file systems to the kernel in a modular fashion. The object-
oriented framework effectively separates the file system from the rest of the kernel. This has led to
the development of several interesting file system implementations. The file system types found in a
typical SVR4 installation include:
Many variants, such as Solaris, also support the MS-DOS FAT file system. This is particularly use-
ful for moving files to and from DOS machines through floppy disks. The next few chapters de-
scribe a number of file systems in greater detail.
The SunOS vnode/vfs design, now incorporated into SVR4, has gained wide acceptance. It
is important, however, to examine some of its drawbacks and to see how some other UNIX variants
have addressed these issues. Its shortcomings are mainly the result of the way it implements path-
name lookup. The remainder of this section examines these drawbacks and looks at some recent
variants that address them.
8.11 Analysis 255
pathname. Ifthe parent directory is the same as in the previous call to namei {},the search begins at
this cached offset instead of at the start of the directory (wrapping around the end if necessary).
The 4.4BSD file system interface provides many other interesting features, such as stackable
vnodes and union mounts. These are described in Section 11.12.
8.12 Summary
The vnode/vfs interface provides a powerful mechanism for modular development and addition of
file systems to the UNIX kernel. It allows the kernel to deal simply with abstract representations of
files called vnodes and relegates the file-system-dependent code to a separate layer accessed through
258 Chapter 8 File System Interface and Framework
a well-defined interface. Vendors may build file systems that implement this interface. The process
is similar to writing a device driver.
It is important, however, to note the great variation in the different incarnations of this inter-
face. While various implementations such as SVR4, BSD, and OSF/1 are all based on similar gen-
eral principles, they differ substantially both in the specifics of the interface (such as the set of op-
erations and their arguments, as well as the format of the vnode and vfs structures), and in their
policies regarding state, synchronization, etc. This means that file system developers would have to
make major modifications to make their file system compatible with the different vfs interfaces.
8.13 Exercises
1. What are the advantages of having a byte-stream representation of files? In what ways is this
model inadequate?
2. Suppose a program makes repeated calls to readdir to list the contents of a directory. What
would happen if other users were to create and delete files in the directory in between?
3. Why are users never allowed to directly write to a directory?
4. Why are file attributes not stored in the directory entry itself?
5. Why does each process have a default creation mask? Where is this mask stored? Why does
the kernel not use the mode supplied to open or creat directly?
6. Why should a user not be allowed to write to a file opened in read-only mode, if he or she has
privileges to do so?
7. Consider the following shell script called myscri pt:
date
cat /etc/motd
What is the effect of executing the following command? How are the file descriptors shared?
myscript > result.log
8. What is the advantage ofhaving /seek be a separate system call, instead ofpassing the starting
offset to every read or write? What are the drawbacks?
9. When would a read return fewer bytes than requested?
10. What are the benefits of scatter-gather 1/0? What applications are most likely to use it?
11. What is the difference between advisory and mandatory locks? What kind of applications are
likely to use byte-range locks?
12. Suppose a user's current working directory is /usr/mntlkaumu. If the administrator mounts a
new file system on the /usr/mnt directory, how will it affect this user? Would the user be able
to continue to see the files in kaumu? What would be the result of a pwd command? What
other commands would behave unexpectedly?
13. What are the drawbacks of using a symbolic link instead of a hard link?
14. Why are hard links not allowed to span file systems?
15. What problems could arise from incorrect use of hard links to directories?
16. What should the kernel do when the reference count on a vnode drops to zero?
8.14 References 259
17. Discuss the relative merits and drawbacks of hint-based and reference-based directory name
lookup caches.
18. [Bark 90] and [John 95] describe two implementations that dynamically allocate and
deallocate vnodes. Can such a system use a hint-based name lookup cache?
19. Give an example of an infinite loop caused by symbolic links. How does 1ookuppn () handle
this?
20. Why does the VOP_LOOKUP operation parse only one component at a time?
21. 4.4BSD allows a process to lock a vnode across multiple vnode operations in a single system
call. What would happen if the process was killed by a signal while holding the lock? How
can the kernel handle this situation?
8.14 References
[Bach 86] Bach, M.J., The Design of the UNIX Operating System, Prentice-Hall, Englewood
Cliffs, NJ, 1986.
[Bark 90] Barkley, R.E., and Lee, T.P., "A Dynamic File System Inode Allocation and
Reallocation Policy," Proceedings of the Winter 1990 USENIX Technical
Conference, Jan. 1990, pp. 1-9.
[DoD 85] Department of Defense, Trusted Computer System Evaluation Criteria, DOD
5200.28-STD, Dec. 1985.
[Elli 90] Ellis, M.A., and Stroustrup, B., The Annotated C++ Reference Manual, Addison-
Wesley, Reading, MA, 1990.
[Fern 88] Fernandez, G., and Allen, L., "Extending the UNIX Protection Model with Access
Control Lists," Proceedings of the Summer 1988 USENIX Technical Conference,
Jun. 1988,pp. 119-132.
[John 95] John, A., "Dynamic Vnodes-Design and Implementation," Proceedings of the
Winter 1995 USENIXTechnical Conference, Jan. 1995, pp. 11-23.
[Kare 86] Karels, M.J. and McKusick, M.K., "Toward a Compatible Filesystem Interface,"
Proceedings of the Autumn 1986 European UNIX Users' Group Conference, Oct.
1986, pp. 481-496.
[Klei 86] Kleiman, S.R., "Vnodes: An Architecture for Multiple File System Types in Sun
UNIX," Proceedings ofthe Summer 1986 USENIXTechnical Conference, Jun. 1986,
pp. 238-247.
[LoVe 91] LoVerso, S., Paciorek, N., Langerman, A., and Feinberg, G., "The OSF/1 UNIX
Filesystem (UFS)," Proceedings of the Winter 1991 USENIX Conference, Jan. 1991,
pp. 207-218.
[McKu 84] McKusick, M.K., Joy, W.N., Leffler, S.J., and Fabry, R.S., "A Fast File System for
UNIX," ACMTransactions on Computer Systems, vol. 2, (Aug. 1984), pp. 181-197.
[Patt 88] Patterson, D.A., Gibson, G.A., and Katz, R.H., "A Case for Redundant Arrays of
Inexpensive Disks (RAID)," Proceedings of the 1988 ACM SJGMOD Conference of
Management of Data, Jun. 1988.
260 Chapter 8 File System Interface and Framework
[Rifk 86] Rifkin, A.P., Forbes, M.P., Hamilton, R.L., Sabrio, M., Shah, S., and Yueh, K., "RFS
Architectural Overview," Proceedings of the Summer 1986 USENIX Technical
Conference, Jun. 1986,pp.248-259.
[Rodr 86] Rodriguez, R., Koehler, M., and Hyde, R., "The Generic File System," Proceedings
ofthe Summer 1986 USENIXTechnical Conference, Jun. 1986, pp. 260-269.
[Thorn 78] Thompson, K., "UNIX Implementation," The Bell System Technical Journal, Jul.-
Aug. 1978, Vol. 57, No.6, Part 2, pp. 1931-1946.
[UNIX 92] UNIX System Laboratories, Operating System API Reference, UNIX SVR4.2, UNIX
Press, Prentice-Hall, Englewood Cliffs, NJ, 1992.
9
File System
Implementations
9.1 Introduction
The previous chapter described the vnode/vfs interface, which provides a framework for supporting
multiple file system types and defines the interface between the file system and the rest of the ker-
nel. Today's UNIX systems support many different types of file systems. These can be classified as
local or distributed. Local file systems store and manage their data on devices directly connected to
the system. Distributed file systems allow a user to access files residing on remote machines. This
chapter describes many local file systems. Chapter 10 discusses distributed file systems, and Chap-
ter 11 describes some newer file systems that provide advanced features such as journaling, volume
management, and high availability.
The two local, general-purpose file systems found in most modem UNIX systems are the
System V file system (s5fs) and the Berkeley Fast File System (FFS). s5fs [Thorn 78] is the original
UNIX file system. All versions of System V, as well as several commercial UNIX systems, support
s5fs. FFS, introduced by Berkeley UNIX in release 4.2BSD, provides more performance, robust-
ness, and functionality than s5fs.lt gained wide commercial acceptance, culminating in its inclusion
in SVR4. (SVR4 supports three general-purpose file systems: s5fs, FFS, and VxFS, the Veritas
journaling file system.)
When FFS was first introduced, the UNIX file system framework could support only one
type of file system. This forced vendors to choose between s5fs and FFS. The vnode/vfs interface,
introduced by Sun Microsystems [Klei 86] allowed multiple file system types to coexist on a single
machine. The file system implementations also required modifications to integrate with the
261
262 Chapter 9 File System Implementations
vnode/vfs framework. The integrated version of FFS is now known as the UNIX file system (ufs) . 1
[Bach 86] provides a comprehensive discussion of s5fs, and [Leff 89] does so for FFS. This chapter
summarizes and compares the two implementations, both for completeness and to provide the back-
ground for understanding the advanced file systems described in the following chapters.
In UNIX, the file abstraction includes various I/0 objects, including network connections
through sockets or STREAMS, interprocess communication mechanisms such as pipes and FIFOs,
and block and character devices. The vnode/vfs architecture builds on this philosophy by represent-
ing both files and file systems as abstractions that present a modular interface to the rest of the ker-
nel. This motivated the development of several special-purpose file systems. Many of these have
little to do with files or I/0 and merely exploit the abstract nature of this interface to provide special
functionality. This chapter examines some of the interesting implementations.
Finally, this chapter describes the UNIX block buffer cache. In earlier UNIX versions such
as SVR3 and 4.3BSD, all file I/0 used this cache. Modern releases such as SVR4 integrate file I/0
and memory management, accessing files by mapping them into the kernel's address space. Al-
though this chapter provides some details about this approach, most of the discussion must be de-
ferred to Chapter 14, which describes the SVR4 virtual memory implementation. The traditional
buffer cache mechanism is still used for metadata blocks. The term metadata refers to the attributes
and ancillary information about a file or file system. Rather than being part of a specific file system
implementation, the buffer cache is a global resource shared by all file systems.
We begin with s5fs, and describe its on-disk layout and kernel organization. Although FFS
differs from s5fs in many important respects, it also has many similarities, and the basic operations
are implemented in the same way. Our discussion of FFS will focus on the differences. Except
where noted, the general algorithms of s5fs, described in Section 9.3, also apply to FFS.
I Initially, the term ufs differed in meaning for different variants. System V-based releases used it to refer to their na-
tive file system, which is now known as s5fs. BSD-based systems used the terms ufs and s5fs as we use them in this
book. The confusion was resolved when SVR4 also adopted this convention.
9.2 The System V File System (s5fs) 263
At the beginning of the partition is the boot area, which may contain code required to boot-
strap (load and initialize) the operating system. Although only one partition needs to contain this
information, each partition contains a possibly empty boot area. The boot area is followed by the
superblock, which contains attributes and metadata of the file system itself.
Following the superblock is the inode list, which is a linear array of inodes. There is one
inode for each file. Inodes are described in Section 9.2.2. Each inode can be identified by its inode
number, which equals its index in the inode list. The size of an inode is 64 bytes. Several inodes fit
into a single disk block. The starting offsets of the superblock and the inode list are the same for all
partitions on a system. Consequently, an inode number can be easily translated to a block number
and the offset of the inode from the start of that block. The inode list has a fixed size (configured
while creating the file system on that partition), which limits the maximum number offiles the par-
tition can contain. The space after the inode list is the data area. It holds data blocks for files and
directories, as well as indirect blocks, which hold pointers to file data blocks and are described in
Section 9.2.2.
9.2.1 Directories
An s5fs directory is a special file containing a list of files and subdirectories. It contains fixed-size
records of 16 bytes each. The first two bytes contain the inode number, and the next fourteen the file
name. This places a limit of 65535 files per disk partition (since 0 is not a valid inode number) and
14 characters per file name. If the filename has fewer than fourteen characters, it is terminated by a
NULL character. Because the directory is a file, it also has an inode, which contains a field identify-
ing the file as a directory. The first two entries in the directory are".", which represents the direc-
tory itself, and".. ", which denotes the parent directory. If the inode number of an entry is zero, it
indicates that the corresponding file no longer exists. The root directory of a partition, as well as its
" .•" entry, always has an inode number of 2. This is how the file system can identify its root direc-
tory. Figure 9-2 shows a typical directory.
9.2.2 lnodes
Each file has an inode associated with it. The word inode derives from index node. The inode con-
tains administrative information, or metadata, of the file. It is stored on disk within the inode list.
When a file is open, or a directory is active, the kernel stores the data from the disk copy of the
inode into an in-memory data structure, also called an inode. This structure has many additional
fields that are not saved on disk. Whenever it is ambiguous, we use the term on-disk inode to refer
264 Chapter 9 File System Implementations
73.
38 ..
9 fil el
0 del etedfi 1e
110 subdirectory 1
65 archana
to the on-disk data structure (struct di node) and in-core inode to refer to the in-memory structure
(struct i node). Table 9-1 describes the fields of the on-disk inode.
The d i _mode field is subdivided into several bit-fields (Figure 9-3). The first four bits spec-
ify the file type, which may be I FREG (regular file), I FDIR (directory), I FBLK (block device), I FCHR
(character device), etc. The nine low-order bits specify read, write, and execute permissions for the
owner, members of the owner's group, and others, respectively.
The di _ addr field requires elaboration. UNIX files are not contiguous on disk. As a file
grows, the kernel allocates new blocks from any convenient location on the disk. This has the ad-
vantage that it is easy to grow and shrink files without the disk fragmentation inherent in contiguous
allocation schemes. Obviously, fragmentation is not completely eliminated, because the last block
of each file may contain unused space. On average, each file wastes half a block of space.
This approach requires the file system to maintain a map of the disk location of every block
of the file. Such a list is organized as an array of physical block addresses. The logical block number
within a file forms an index into this array. The size of this array depends on the size of the file. A
very large file may require several disk blocks to store this array. Most files, however, are quite
small [Saty 81], and a large array would only waste space. Moreover, storing the disk block array on
a separate block would incur an extra read when the file is accessed, resulting in poor performance.
I type(4b2;J:U}:g 1: ~
Figure 9-3. Bit-fields of di _mode.
The UNIX solution is to store a small list in the inode itself and use extra blocks for large
files. This is very efficient for small files, yet flexible enough to handle very large files. Figure 9-4
illustrates this scheme. The 39-byte di _addr field comprises a thirteen-element array, with each
element storing a 3-byte physical block number. Elements 0 through 9 in this array contain the
block numbers of blocks 0 through 9 of the file. Thus, for a file containing 10 blocks or fewer, all
the block addresses are in the inode itself. Element 10 is the block number of an indirect block, that
is, a block that contains an array of block numbers. Element 11 points to a double-indirect block,
which contains block numbers of other indirect blocks. Finally, element 12 points to a triple-
indirect block, which contains block numbers of double-indirect blocks.
Such a scheme, for a 1024-byte block size, allows addressing of 10 blocks directly, 256
more blocks through the single indirect block, 65536 (256 x 256) more blocks through the double
indirect block, and 16,777,216 (256 x 256 x 256) more blocks through the triple indirect block.
UNIX files may contain holes. A user may create a file, seek (set the offset pointer in the
open file object by calling !seek-see Section 8.2.4) to a large offset, and write data to it. The space
before this offset contains no data and is a hole in the file. If a process tries to read that part of the
file, it will see NULL (zero-valued) bytes.
Such holes can sometimes be large, spanning entire blocks. It is wasteful to allocate disk
space for such blocks. Instead, the kernel sets the corresponding elements of the di _ addr array, or
disk
of the indirect block, to zero. When a user tries to read such a block, the kernel returns a zero-filled
block. Disk space is allocated only when someone tries to write data to the block.
Refusing to allocate disk space for holes has some important consequences. A process may
unexpectedly run out of disk space while trying to write data to the hole. If a file containing a hole is
copied, the new file will have zero-filled pages on the disk instead of the hole. This happens because
copying involves reading the file's contents and writing them to the destination file. When the ker-
nel reads a hole, it creates zero-filled pages, which are then copied without further interpretation.
This can cause problems for backup and archiving utilities such as tar or cpio that operate at the
file level rather than the raw disk level. A system administrator may back up a file system, and dis-
cover that the same disk does not have enough room to restore its contents.
The superblock contains the first part of the list and adds and removes blocks from its tail. The first
element in this list points to the block containing the next part of the list, and so forth.
At some point, the block allocation routine discovers that the free block list in the super-
block contains only a single element. The value stored in that element is the number of the block
containing the next part of the free list (block a in Figure 9-5). It copies the list from that block into
the superblock, and that block now becomes free. This has the advantage that the space required to
store the free block list depends directly on the amount of free space on the partition. For a nearly
full disk, no space needs to be wasted to store the free block list.
hash
queue 0
hash
queue 1
hash
queue2
hash
queue3
If the directory contains a valid entry for the file, s 51 oo ku p () obtains the inode number
from the entry. It then calls i get() to locate that inode. i get() hashes on the inode number and
searches the appropriate hash queue for the inode. If the inode is not in the table, i get () allocates
an inode (this is described in Section 9.3.4), and initializes it by reading in the on-disk inode. While
copying the on-disk inode fields to the in-core inode, it expands the di_addr[] elements to four
bytes each. It then puts the inode on the appropriate hash queue. It also initializes the vnode, setting
its v_op field to point to the s5vnodeops vector, v_data to point to the inode itself, and v_vfsp to
point to the vfs to which the file belongs. Finally, it returns a pointer to the inode to s51 ookup ().
s51 ookup (),in turn, returns a pointer to the vnode to 1ookuppn ().
Note that i get () is the only function in s5fs that allocates and initializes inodes and vnodes.
For instance, when creating a new file, s5create() allocates an unused inode number (from the
free inode list in the superblock) and calls i get() to bring that inode into memory.
is opened in the correct mode. If it is, the kernel obtains the vnode pointer from the f i 1e structure.
Before starting 1/0, the kernel invokes VOP_RWLOCK operation to serialize access to the file. s5fs
implements this by acquiring an exclusive lock on the inode.2 This ensures that all data read or
written in a single system call is consistent and all write calls to the file are single-threaded. The
kernel then invokes the vnode's VOP_READ or VOP_WRITE operation. This results in a call to
s5read () or s5write (),respectively.
In earlier implementations, the file 1/0 routines used the block buffer cache, an area of
memory reserved for file system blocks. SVR4 unifies file 1/0 with virtual memory and uses the
buffer cache only for metadata blocks. Section 9.12 describes the old buffer cache. This section
summarizes the SVR4 operations. They are further detailed in the discussion of virtual memory in
Section 14.8.
Let us use the read operation as an example. s5read () translates the starting offset for the
1/0 operation to the logical block number in the file and the offset from the beginning of the block.
It then reads the data one page3 at a time, by mapping the block into the kernel virtual address space
and calling uiomove() to copy the data into user space. uiomove() calls the copyout() routine to
perform the actual data transfer. If the page is not in physical memory, or if the kernel does not have
a valid address translation for it, copyout () will generate a page fault. The fault handler will iden-
tify the file to which the page belongs and invoke the VOP_GETPAGE operation on its vnode.
In s5fs, this operation is implemented by s5getpage (), which first calls a function called
bmap () to convert the logical block number to a physical block number on the disk. It then searches
the vnode's page list (pointed to by v_page) to see if the page is already in memory. If it is not,
s5getpage () allocates a free page and calls the disk driver to read the data from disk.
The calling process sleeps until the 1/0 completes. When the block is read, the disk driver
wakes up the process, which resumes the data copy in copyout (). Before copying the data to user
space, copyout () verifies that the user has write access to the buffer into which it must copy the
data. Otherwise, the user may inadvertently or maliciously specify a bad address, causing many
problems. For instance, if the user specifies a kernel address, the kernel will overwrite its own text
or data structures.
s5read () returns when all data has been read or an error has occurred. The system-
independent code unlocks the vnode (using VOP_ RWUNLOCK), advances the offset pointer in the fi 1e
structure by the number of bytes read, and returns to the user. The return value of read is the total
number of bytes read. This usually equals the number requested, unless the end of file is reached or
some other error occurs.
The write system call proceeds similarly, with a few differences. The modified blocks are
not written immediately to disk, but remain in memory, to be written out later according to the
cache heuristics. Besides, the write may increase the file size and may require the allocation of data
blocks, and perhaps indirect blocks, for disk addresses. Finally, if only part of a block is being writ-
ten, the kernel must first read the entire block from disk, modify the relevant part, and then write it
back to the disk.
2 Many UNIX variants use a single-writer, multiple-readers lock, which allows better concurrency.
3 A page is a memory abstraction. It may contain one block, many blocks, or part of a block.
270 Chapter 9 File System Implementations
It is therefore better to reuse those inodes that have no pages cached in memory. When the
vnode reference count reaches zero, the kernel invokes its VOP_INACTIVE operation to release the
vnode and its private data object (in this case, the inode). When releasing the inode, the kernel
checks the vnode's page list. It releases the inode to the front of the free list if the page list is empty
and to the back of the free list if any pages of the file are still in memory. In time, if the inode re-
mains inactive, the paging system frees its pages.
[Bark 90] describes a new inode allocation and reclaim policy, which allows the number of
in-core in odes to adjust to the load on the system. Instead of using a fixed-size inode table, the file
systems allocate inodes dynamically using the kernel memory allocator. This allows the number of
inodes in the system to rise and fall as needed. The system administrator no longer needs to guess
the appropriate number of inodes to pre-configure into the system.
When i get () cannot locate an inode on its hash queue, it removes the first inode from the
free list. If this inode still has pages in memory, i get() returns it to the back of the free list and
calls the kernel memory allocator to allocate a new i node structure. It is possible to generalize the
algorithm to scan the free list for an inode with no in-memory pages, but the implementation de-
scribed here is simple and efficient. Its only drawback is that it may allocate a few more inodes in
memory than are absolutely necessary.
9.4 Analysis of s5fs 271
Experiments using a multiuser, time-sharing workload benchmark [Gaed 82] show that the
new algorithm reduces system time (amount of CPU time spent in kernel mode) usage by 12% to
16%. Although initially implemented for s5fs, the optimization is general enough to be applied to
other file systems such as FFS.
4 FFS first appeared in 4.1 bBSD, which was a test release internal to Berkeley. It was also part of 4.1 cBSD, another
test release that was sent to about I 00 sites [Salu 94].
272 Chapter 9 File System Implementations
sector 0
I
-platters
_j
Figure 9-7. Conceptual view of a hard disk.
5 Actually, both sides of the platter are used, and there is a separate disk head for each side.
9.7 On-Disk Organization 273
UNIX views the disk as a linear array of blocks. The number of sectors in a block is a small
power of two; for this section, let us assume there is exactly one sector per block. When a UNIX
process wants to read a particular block number, the device driver translates that into a logical sector
number, and from it, computes the physical track, head, and sector number. In this scheme, the sec-
tor numbers increase first, then the head number, and finally the cylinder (track) number. Each cyl-
inder thus contains a sequential set of block numbers. After computing the location of the desired
block, the driver must move the disk heads to the appropriate cylinder. This head seek is the most
time-consuming component of the disk 1/0 operation, and the seek latency depends directly on how
far the heads must move. Once the heads move into position, we must wait while the disk spins until
the correct sector passes under the heads. This delay is called the rotational latency. Once the cor-
rect sector is under the disk head, the transfer may begin. The actual transfer time is usually just the
time for one sector to move across the head. Optimizing 1/0 bandwidth thus requires minimizing the
number and size of the head seeks, and reducing rotational latency by proper placement of the
blocks on the disk.
9. 7 On-Disk Organization
A disk partition comprises of a set of consecutive cylinders on the disk, and a formatted partition
holds a self-contained file system. FFS further divides the partition into one or more cylinder
groups, each containing a small set of consecutive cylinders. This allows UNIX to store related data
in the same cylinder group, thus minimizing disk head movements. Section 9.7.2 discusses this in
greater detail.
The information in the traditional superblock is divided into two structures. The FFS super-
block contains information about the entire file system-the number, sizes and locations of cylinder
groups, block size, total number of blocks and inodes, and so forth. The data in the superblock does
not change unless the file system is rebuilt. Furthermore, each cylinder group has a data structure
describing summary information about that group, including the free inode and free block lists. The
superblock is kept at the beginning of the partition (after the boot block area), but that is not enough.
The data in the superblock is critical, and must be protected from disk errors. Each cylinder group
therefore contains a duplicate copy of the superblock. FFS maintains these duplicates at different
offsets in each cylinder group in such a way that no single track, cylinder, or platter contains all
copies of the superblock. The space between the beginning of the cylinder group and the superblock
copy is used for data blocks, except for the first cylinder group.
(4 gigabytes) to be addressed with only two levels of indirection. FFS does not use the triple indirect
block, although some variants use it to support file sizes greater than 4 gigabytes.
Typical UNIX systems have numerous small files that need to be stored efficiently
[Saty 81]. The 4K block size wastes too much space for such files. FFS solves this problem by al-
lowing each block to be divided into one or more fragments. The fragment size is also fixed for each
file system and is set when the file system is created. The number of fragments per block may be set
to 1, 2, 4, or 8, allowing a lower bound of 512 bytes, the same as the disk sector size. Each fragment
is individually addressable and allocable. This requires replacing the free block list with a bitmap
that tracks each fragment.
An FFS file is composed entirely of complete disk blocks, except for the last block, which
may contain one or more consecutive fragments. The file block must be completely contained
within a single disk block. Even if two adjacent disk blocks have enough consecutive free fragments
to hold a file block, they may not be combined. Furthermore, if the last block of a file contains more
than one fragment, these fragments must be contiguous and part of the same block.
This scheme reduces space wastage, but requires occasional recopying of file data. Consider
a file whose last block occupies a single fragment. The remaining fragments in that block may be
allocated to other files. If that file grows by one more fragment, we need to find another block with
two consecutive free fragments. The first fragment must be copied from the original position, and
the second fragment filled with the new data. If the file usually grows in small increments, its frag-
ments may have to be copied several times, thus impacting performance. FFS controls this by allow-
ing only direct blocks to contain fragments.
Hence for best performance, applications should write a full block at a time to the files
whenever possible. Different file systems on the same machine may have different block sizes.
Applications can use the stat system call to obtain the attributes of a file in a file-system-
independent format. One attribute returned by stat is a hint as to the best unit size for I/0 operations,
which, in the case ofFFS, is the block size. This information is used by the Standard I/0 library, as
well as other applications that manage their own I/0.
• Create each new directory in a different cylinder group from its parent, so as to distribute
data uniformly over the disk. The allocation routine chooses the new cylinder group from
groups with an above-average free inode count; from these, it selects the one with the few-
est directories.
• Try to place the data blocks of the file in the same cylinder group as the inode, because
typically the inode and data will be accessed together.
• To avoid filling an entire cylinder group with one large file, change the cylinder group
when the file size reaches 48 kilobytes and again at every megabyte. The 48-kilobyte mark
was chosen because, for a 4096-byte block size, the inode's direct block entries describe
the first 48 kilobytes. 6 The selection of the new cylinder group is based on its free block
count.
• Allocate sequential blocks of a file at rotationally optimal positions, if possible. When a
file is being read sequentially, there is a time lag between when a block read completes
and when the kernel processes the I/0 completion and initiates the next read. Because the
disk is spinning during this time, one or more sectors may have passed under the disk
head. Rotational optimization tries to determine the number of sectors to skip so that the
desired sector is under the disk head when the read is initiated. This number is called the
rotdelay factor, or the disk's interleave.
The implementation must balance the localization efforts with the need to distribute the data
throughout the disk. Too much localization causes all data to be crammed into the same cylinder
group; in the extreme case, we have a single large cylinder group, as in s5fs. The rules that begin
subdirectories in different groups and that break large files prevent such a scenario.
This implementation is highly effective when the disk has plenty of free space, but deterio-
rates rapidly once the disk is about 90% full. When there are very few free blocks, it becomes diffi-
cult to find free blocks in optimal locations. Thus the file system maintains a free space reserve pa-
rameter, usually set at 10%. Only the superuser can allocate space from this reserve.
6 The number of direct blocks in the disk address array was increased from I 0 to 12 in FFS.
276 Chapter 9 File System Implementations
7 inode number 7
4 allocation size r--- 24
2 name length 2
I f 1 11 1 0 I f1 111 0 0
0 name plus extra space
14
8 L_...
5 padding
I fl I i I 1]1 lei
121 0 0 0
Symbolic links
Symbolic links (see Section 8.4.1) address many limitations of hard links. A symbolic link is a file
that points to another file, called the target of the link. The type field in the inode identifies the file
as a symbolic link, and the file data is simply the pathname of the target file. The pathname may be
absolute or relative. The pathname traversal routine recognizes and interprets symbolic links. If the
name is relative, it is interpreted relative to the directory containing the link. Although symbolic
link handling is transparent to most programs, some utilities need to detect and handle symbolic
links. They can use the /stat system call, which does not translate the final symbolic link in the
pathname, and the read/ink call, which returns the contents (target) of the link. Symbolic links are
created by the symlink system call.
Other enhancements
4.2BSD added a rename system call to allow atomic renames of files and directories, which previ-
ously required a link followed by an unlink. It added a quota mechanism to limit the file system re-
sources available to any user. Quotas apply to both inodes and disk blocks and have a soft limit that
triggers a warning, along with a hard limit that the kernel enforces.
Some of these features have been subsequently incorporated into s5fs. In SVR4 s5fs allows 1
symbolic links and supports atomic renaming. It does not, however, support long filenames or disk
quotas.
9.9 Analysis
The performance gains of FFS are substantial. Measurements on a VAX/750 with a UNIBUS
adapter [Krid 83] show that read throughput increases from 29 kilobytes/sec in s5fs (with !-kilobyte
9.9 Analysis 277
blocks) to 221 kilobytes/sec in FFS (4-kilobyte blocks, 1-kilobyte fragments), and CPU utilization
increases from 11% to 43%. With the same configurations, write throughput increased from 48 to
142 kilobytes/sec, and CPU utilization from 29% to 43%.
It is also important to examine disk space wastage. The average wastage in the data blocks is
half a block per file in s5fs, and half a fragment per file in FFS. If the fragment size in FFS equals
the block size in s5fs, this factor will even out. The advantage of having large blocks is that less
space is required to map all the blocks of a large file. Thus the file system requires few indirect
blocks. In contrast, more space is required to monitor the free blocks and fragments. These two
factors also tend to cancel out, and the net result of disk utilization is about the same when the new
fragment size equals the old block size.
The free space reserve, however, must be counted as wasted space, because it is not avail-
able to user files. When this is factored in, the percentage of waste in an s5fs with 1K blocks ap-
proximately equals that in an FFS with 4K blocks, 512-byte fragments, and the free space reserve
set at 5%.
The disk layout described in Figure 9-7 is obsolete for many newer disks. Modem SCSI
(Small Computer Systems Interface) disks [ANSI 92] do not have fixed-size cylinders. They take
advantage of the fact that outer tracks can hold more data than inner ones and divide the disk into
several zones. Within each zone, each track has the same number of sectors.
FFS is oblivious to this, and hence users are forced to use completely fictional cylinder
sizes. To support the FFS notion of equal-sized tracks, vendors usually take the total number of 512-
byte sectors on the disk and factor it into a number of tracks and sectors per track. This factoring is
performed in a convenient way that does not particularly resemble the physical characteristics of the
drive. As a result, the careful rotational placement optimizations of FFS accomplish very little, and
may hurt performance in many cases. Grouping the cylinders is still useful, because blocks on
nearby cylinders, as seen by FFS, are still located on nearby tracks on the disks.
Overall, FFS provides great benefits, which are responsible for its wide acceptance. System
V UNIX also added FFS as a supported file system type in SVR4. Moreover, SVR4 incorporated
many features of FFS into s5fs. Hence, in SVR4, s5fs also supports symbolic links, shared and ex-
clusive file locking, and the rename system call.
Although FFS is a substantial improvement over s5fs, it is far from being the last word in
file systems. There are several ways to improve performance further. One way is by chaining kernel
buffers together, so that several buffers can be read or written in a single disk operation. This would
require modifying all disk drivers. Another possibility is to pre-allocate several blocks to rapidly
growing files, releasing the unused ones when closing the file. Other important approaches, includ-
ing log-structured and extent-based file systems, are described in Chapter 11.
FFS itself has had several enhancements since it was first introduced. 4.3BSD added two
types of caching to speed up name lookups [McKu 85]. First, it uses a hint-based directory name
lookup cache. This cache shows a hit rate of about 70% of all name translations. When FFS was
ported to SVR4, the implementors moved this cache out of the file-system-dependent code and
made it a global resource available to all file systems. They also changed its implementation so that
it held references (instead of hints) to the cached files. Section 8.1 0.2 discusses the directory name
lookup cache in greater detail.
278 Chapter 9 File System Implementations
Second, each process caches the directory offset of the last component of the most recently
translated pathname. If the next translation is for a file in the same directory, the search begins at
this point instead of at the top of the directory. This is helpful in cases where a process is scanning a
directory sequentially, which accounts for about 10-15% of name lookups. The SVR4 implementa-
tion moved this cached offset into the in-core inode. This allows FFS to cache an offset for each di-
rectory, instead of one directory per process. On the other hand, if multiple processes are concur-
rently using the same directory, its cached offset is unlikely to be useful.
file-system private data), wakes up the mount process, and sleeps while the request is serviced. The
mount process satisfies the request by copying the data from or to the appropriate portion of its ad-
dress space, and awakens the caller.
Since the file system is in the virtual memory of the mount process, it can be paged out like
any other data, by the standard memory management mechanisms. The pages of the mfs files com-
pete with all the other processes for physical memory. Pages that are not actively referenced are
written to the swap area, and must be faulted in if needed later. This allows the system to support a
temporary file system that may be much larger than the physical memory.
Although this file system is substantially faster than an on-disk system, it has several draw-
backs, largely due to the limitations of the BSD memory architecture. Using a separate process to
handle all 110 requires two context switches for each operation. The file system still resides in a
separate (albeit virtual) address space, which means we still need extra in-memory copy operations.
The format of the file system is the same as that of FFS, even though concepts such as cylinder
groups are meaningless for a memory-based system.
struct
vnode
struct
tmpnode page in
memory
ing metadata) are modified to first check this flag. If the flag is set, these routines defer the write by
simply marking the buffer as dirty. This approach has several advantages. It does not use a separate
RAM disk, and hence avoids the overhead in space and time to maintain two in-memory copies of
the blocks. Its performance measurements are impressive. The main drawback is that it requires
changes to several ufs routines, so it cannot be added easily to an existing kernel without access to
the ufs source code. On a system that supports multiple file system types, each implementation
would have to be modified to use the delay mount option.
We run into problems, however, when multiple device files refer to the same underlying
device. When different users access a device using different file names, the kernel must synchronize
access to the device. For block devices, it must also ensure consistency of copies of its blocks in the
buffer cache. Clearly, the kernel needs to be aware of which different vnodes actually represent the
same device.
The specfs layer creates a shadow vnode for each device file. Its file-system-dependent data
structure is called an snode. Lookup operations to the device file return a pointer to the shadow
vnode instead of the real vnode. The real vnode can be obtained, if necessary, by the vop _rea 1vp
operation. The snode has a field called s _ commonvp, which points to a common vnode (associated
with another snode) for that device. There is only one common vnode for each device (identified by
the device number and type), and multiple snodes may point to it. All operations that require syn-
chronization, as well as block device reads and writes, are routed through this common vnode. Sec-
tion 16.4 describes the implementation in greater detail.
map This read-only file describes the virtual address map of the process. It
contains an array of prmap structures, each element of which describes a
single contiguous address range in the process. Process address maps are
explained in Section 14.4.3.
as This read-write file maps the virtual address space of the process. Any ad-
dress in the process may be accessed by !seeking to that offset in this file,
and then performing a read or write.
sigact This read-only file contains signal handling information. The file contains
an array of s i gact ion structures (see Section 4.5), one for each signal.
cred This read-only file contains the user credentials of the process. Its format
is defined by struct prcred.
object This directory contains one file for each object mapped into the address
space of the process (see Section 14.2). A user can get a file descriptor for
the object by opening the corresponding file.
lwp This directory contains one subdirectory for each LWP (see Chapter 3) of
the process. Each subdirectory contains three fi1es-lwpstatus, lwpsinfo,
and lwpctl-which provide per-L WP status and control operations, similar
to the status, psinfo, and ctl files, respectively.
It is important to note that these are not physical files with real storage. They merely provide
an interface to the process. Operations on these files are translated by the /proc file system to ap-
propriate actions on the target process or its address space. Several users may open a /proc file con-
currently. The 0_ EXCL flag provides advisory locking when opening an as, ctl, or lwpctl file for
writing. The ctl and lwpctl files provide several control and status operations, including the follow-
ing:
PCSTOP Stops all L WPs of the process.
PCWSTOP Waits for all L WPs of the process to stop.
PC RUN Resumes a stopped L WP. Additional actions may be specified by optional
flags, such as PRCSIG to clear the current signal, or PRSTEP to single-step
the process.
PC KILL Sends a specified signal to the process.
PCSENTRY Instructs the L WP to stop on entry to specified system calls.
PCSEXIT Instructs the LWP to stop on exit from specified system calls.
There is no explicit support for breakpoints. They may be implemented simply by using the
write system call to deposit a breakpoint instruction at any point in the text segment. Most systems
designate an approved breakpoint instruction. Alternatively, we could use any illegal instruction that
causes a trap to the kernel.
The /proc interface provides mechanisms to handle children of a target process. The debug-
ger can set an inherit-on-fork flag in the target process and monitor exits from fork and vfork calls.
This causes both parent and child to stop on return from fork. When the parent stops, the debugger
can examine the return value from fork to determine the child PID and open the /proc files of the
9.11 Special-Purpose File Systems 283
child. Since the child stops before it returns from fork, the debugger has complete control over it
from that point.
This interface has allowed the development of several sophisticated debuggers and pro filers.
For instance, /proc allows dbx to attach to and detach from running programs. The implementation
works correctly with /proc files on remote machines accessed via RFS [Rifk 86]. This allows appli-
cations to debug and control remote and local processes in identical manner. The ptrace system call
has become obsolete and unnecessary. Several other commands, notably ps, have been reimple-
mented to use /proc. A generalized data watchpoint facility has evolved based on the VM system's
ability to dynamically change protections on memory pages.
TFS provides these facilities using copy-on-write semantics for the file system. Files from
the shared hierarchy are copied to the user's private hierarchy as they are modified. To achieve this,
a TFS directory is composed of several layers, where each layer is a physical directory. The layers
are joined by hidden files called search/inks, which contain the directory name of the next layer.
Each layer is like a revision of the directory, and the front layer is the newest revision.
The files seen in a TFS directory are the union of the files in all the layers. The latest revi-
sion of a file is accessed by default (layers are searched front to back). If an earlier version is de-
sired, it is necessary to explicitly follow the searchlinks chain. This can be done at the user level,
since each layer is merely a directory. Copy-on-write is implemented by making all layers except
the front layer read-only; a file in another layer must be copied to the front layer before it can be
modified.
TFS performance suffers because each lookup may have to search several layers (the number
of layers can become quite large in a typical environment). TFS addresses this problem by aggres-
sively using name lookup caches. It also provides facilities for variant layers corresponding to dif-
ferent machine architectures, because object files are different for each variant. User programs do
not need to be changed to access TFS files. The system administrator must perform some initial
setup to take advantage of TFS. Although TFS was initially designed to run as an NFS server, it has
since been changed to directly use the vnode/vfs interface.
Section 11.12.1 describes the union mount file system in 4.4BSD, which provides similar
functionality but is based on the 4.4BSD stackable vnode interface.
7 Certain metadata updates are written back synchronously, as described in Section 9.12.5.
286 Chapter 9 File System Implementations
The second involves a dirty buffer that reaches the head of the list, at which time it is removed from
the list and put on the disk driver's write queue. When the write completes, the buffer is marked as
clean and can be returned to the free list. Because it had already reached the list head without being
accessed again, it is returned to the head of the list instead of the end.
Each buffer is represented by a buffer header. The kernel uses the header to identify and locate the
buffer, synchronize access to it, and to perform cache management. The header also serves as the
interface to the disk driver. When the kernel wants to read or write the buffer from or to the disk, it
loads the parameters of the I/0 operation in the header and passes the header to the disk driver. The
header contains all the information required for the disk operation. Table 9-2 lists the important
fields of the s t ru c t b uf, which represents the buffer header.
The b_flags field is a bitmask of several flags. The kernel uses the B_BUSY and B_WANTED to
synchronize access to the buffer. B_DELWRI marks a buffer as dirty. The flags used by the disk driver
include B_READ, B_WRITE, B_ASYNC, B_DONE, and B_ERROR. The B_AGE flag indicates an aged buffer
that is a good candidate for reuse.
9.12.3 Advantages
The primary motivation for the buffer cache is to reduce disk traffic and eliminate unnecessary disk
I/0, and it achieves this effectively. Well-tuned caches report hit rates of up to 90% [Oust 85].
There are also several other advantages. The buffer cache synchronizes access to disk blocks
through the locked and wanted flags. If two processes try to access the same block, only one will be
able to lock it. The buffer cache offers a modular interface between the disk driver and the rest of
the kernel. No other part of the kernel can access the disk driver, and the entire interface is encapsu-
lated in the fields of the buffer header. Moreover, the buffer cache insulates the rest of the kernel
from the alignment requirements of disk I/0, since the buffers themselves are page aligned. There
are no problems of arbitrary disk I/0 requests to possibly unaligned kernel addresses.
9.12.4 Disadvantages
Despite the tremendous advantages, there are some important drawbacks of the buffer cache. First,
the write-behind nature of the cache means that data may be lost if the system crashes. This could
also leave the disk in an inconsistent state. This issue is further explored in Section 9.12.5. Second,
although reducing disk access greatly improves performance, the data must be copied twice-first
from the disk to the buffer, then from the buffer to the user address space. The second copy is orders
of magnitude faster than the first, and normally the savings in disk access more than compensate for
the expense of the additional memory-to-memory copy. It can become an important factor, however,
when sequentially reading or writing a very large file. In fact, such an operation creates a related
problem called cache wiping. If a large file is read end-to-end and then not accessed again, it has the
effect of flushing the cache. Since all blocks of this file are read in a short period of time, they con-
sume all the buffers in the cache, flushing out the data that was in them. This causes a large number
of cache misses for a while, slowing down the system until the cache is again populated with a more
useful set of blocks. Cache wiping can be avoided if the user can predict it. The Veritas File System
(VxFS), for example, allows users to provide hints as to how a file will be accessed. Using this fea-
ture, a user can disable caching of large files and ask the file system to transfer the data directly
from the disk to user address space.
cated inode (or one that is reassigned to another file). Such damage to the file system must be
avoided.
There are two ways in which UNIX tries to prevent such corruption. First, the kernel
chooses an order of metadata writes that minimizes the impact of a system crash. In the previous
example, consider the effect of reversing the order of the writes. Now suppose the system were to
crash with the inode updated but not the directory. When it reboots, this file has an extra link, but
the original directory entry is valid and the file can be accessed without any problems. If someone
were to delete the file, the directory entry would go away, but the inode and data blocks would not
be freed because the link count is still one. Although this does not prevent corruption, it causes less
severe damage than the earlier order.
Thus the order of metadata writes must be carefully chosen. The problem of enforcing the
order still remains, since the disk driver does not service the requests in the order that they are re-
ceived. The only way the kernel can order the writes is to make them synchronous. Hence in the
above case, the kernel will write the inode to the disk, wait until the write completes, and then issue
the directory write. The kernel uses such synchronous metadata writes in many operations that re-
quire modifying more than one related object [Bach 86].
The second way of combating file system corruption is the fsck (file system check) utility
[Kowa 78, Bina 89]. This program examines a file system, looks for inconsistencies, and repairs
them if possible. When the correction is not obvious, it prompts the user for instructions. By default,
the system administrator runs fsck each time the system reboots and may also run it manually at any
time. fsck uses the raw interface to the disk driver to access the file system. It is further described in
Section 11.2.4.
9.13 Summary
The vnode/vfs interface allows multiple file systems to coexist on a machine. In this chapter, we
have described the implementation of several file systems. We began with the two most popular lo-
cal file systems-s5fs and FFS. We then described several special-purpose file systems that took
advantage of the special properties of the vnode/vfs interface to provide useful functionality. Fi-
nally, we discussed the buffer cache, which is a global resource shared by all file systems.
In the following chapters, we describe many other file systems. The next chapter discusses
distributed file systems-in particular, NFS, RFS, AFS, and DFS. Chapter 11 deals with advanced
and experimental file systems that use techniques such as logging to provide better functionality and
performance.
9.14 Exercises
1. Why do s5fs and FFS have a fixed number of on-disk inodes in each file system?
2. Why is the inode separate from the directory entry of the file?
3. What are the advantages and drawbacks of having each file allocated contiguously on disk?
Which applications are likely to desire such a file system?
9.15 References 289
9.15 References
[ANSI 92] American National Standard for Information Systems, Small Computer Systems
Interface-2 (SCSI-2), X3.131-199X, Feb. 1992.
[Bach 86] Bach, M.J., The Design of the UNIX Operating System, Prentice-Hall, Englewood
Cliffs, NJ, 1986.
[Bark 90] Barkley, R.E., and Lee, T.P., "A Dynamic File System !node Allocation and Reclaim
Policy," Proceedings of the Winter 1990 USENIX Technical Conference, Jan. 1990,
pp. 1-9.
[Bina 89] Bina, E.J., and Emrath, P.A., "A Faster fsck for BSD UNIX," Proceedings of the
Winter 1989 USENIXTechnical Conference, Jan. 1989, pp. 173-185.
[Faul 91] Faulkner, R. and Gomes, R., "The Process File System and Process Model in UNIX
System V," Proceedings of the 1991 Winter USENIX Conference, Jan. 1991, pp.
243-252.
[Gaed 82] Gaede, S., "A Scaling Technique for Comparing Interactive System Capacities,"
Conference Proceedings ofCMG XIIL Dec. 1982, pp. 62-67.
290 Chapter 9 File System Implementations
[Ging 87] Gingell, R.A., Moran, J.P., and Shannon, W.A., "Virtual Memory Architecture in
SunOS," Proceedings of the Summer 1987 USENIX Technical Conference, Jun.
1987, pp. 81-94.
[Hend 90] Hendricks, D., "A FileSystem for Software Development," Proceedings of the
Summer 1990 USENIXTechnical Conference, Jun. 1990, pp. 333-340.
[Klei 86] Kleiman, S.R., "Vnodes: An Architecture for Multiple File System Types in Sun
UNIX," Proceedings ofthe Summer 1986 USENIXTechnical Conference, Jun. 1986,
pp. 238-247.
[Kowa 78] Kowalski, T., "FSCK-The UNIX System Check Program," Bell Laboratory,
Murray Hill, N.J. 07974, Mar. 1978.
[Krid 83] Kridle, R., and McKusick, M., "Performance Effects of Disk Subsystem Choices for
VAX Systems Running 4.2BSD UNIX," Technical Report No. 8, Computer Systems
Research Group, Dept. of EECS, University of California at Berkeley, CA, 1983.
[Leff89] Leffler, S.J., McKusick, M.K., Karels, M.J., and Quarterman, J.S., The Design and
Implementation of the 4. 3BSD UNIX Operating System, Addison-Wesley, Reading,
MA, 1989.
[McKu 84] McKusick, M.K., Joy, W.N., Leffler, S.J., and Fabry, R.S., "A Fast File System for
UNIX," ACMTransactions on Computer Systems, vol. 2, (Aug. 1984), pp. 181-197.
[McKu 85] McKusick, M.K., Karels, M., and Leffler, S.J., "Performance Improvements and
Functional Enhancements in 4.3BSD," Proceedings of the Summer 1985 USENIX
Conference, Jun. 1985, pp. 519-531.
[McKu 90] McKusick, M.K., Karels, M.K., and Bostic, K., "A Pageable Memory Based
Filesystem," Proceedings of the Summer 1990 USENIX Technical Conference, Jun.
1990.
[Nadk 92] Nadkarni, A.V., "The Processor File System in UNIX SVR4.2," Proceedings of the
1992 USENIX Workshop on File Systems, May 1992, pp. 131-132.
[Ohta 90] Ohta, M. and Tezuka, H., "A Fast /tmp File System by Delay Mount Option,"
Proceedings ofthe Summer 1990 USENIXConference, Jun. 1990, pp. 145-149.
[Oust 85] Ousterhout, J.K., Da Costa, H., Harrison, D., Kunze, J.A., Kupfer, M. and
Thompson, J.G., "A Trace-Driven Analysis of the UNIX 4.2 BSD File System,"
Proceedings of the Tenth Symposium on Operating System Principles, Dec. 1985, pp.
15-24.
[Rifk 86] Rifkin, A.P., Forbes, M.P., Hamilton, R.L., Sabrio, M., Shah, S., and Yueh, K., "RFS
Architectural Overview," Proceedings of the Summer 1986 USENIX Technical
Conference, Jun. 1986, pp. 248-259.
[Salu 94] Salus, P.H., A Quarter Century of UNIX, Addison-Wesley, Reading, MA, 1994.
[Saty 81] Satyanarayan, M., "A Study of File Sizes and Functional Lifetimes," Proceedings of
the Eighth Symposium on Operating Systems Principles, 1981, pp. 96-108.
[Snyd 90] Snyder, P., "tmpfs: A Virtual Memory File System," Proceedings of the Autumn
1990 European UNIX Users' Group Conference, Oct. 1990, pp. 241-248.
[Thorn 78] Thompson, K., "UNIX Implementation," The Bell System Technical Journal, Jul.-
Aug. 1978, Vol. 57, No. 6, Part 2, pp. 1931-1946.
10
10.1 Introduction
Since the 1970s, the ability to connect computers to each other on a network has revolutionized the
computer industry. The increase in network connectivity has fueled a desire to share files between
different computers. The early efforts in this direction were restricted to copying entire files from
one machine to another, such as the UNIX-to-UNIX copy (uucp) program [Nowi 90] and File
Transfer Protocol (ftp) [Post 85]. Such solutions, however, do not come close to fulfilling the vision
of being able to access files on remote machines as though they were on local disks.
The mid-1980s saw the emergence of several distributed file systems that allow transparent
access to remote files over a network. These include the Network File System (NFS) from Sun Mi-
crosystems [Sand 85a], the Remote File Sharing system (RFS) from AT&T [Rifk 86], and the
Andrew File System (AFS) from Carnegie-Mellon University [Saty 85]. All three are sharply differ-
ent in their design goals, architecture, and semantics, even though they try to solve the same funda-
mental problem. Today, RFS is available on many System V-based systems. NFS has gained much
wider acceptance and is available on numerous UNIX and non-UNIX systems. AFS development
has passed on to Transarc Corporation, where it has evolved into the Distributed File System (DFS)
component of Open Software Foundation's Distributed Computing Environment (DCE).
This chapter begins by discussing the characteristics of distributed file systems. It then de-
scribes the design and implementation of each of the above-mentioned file systems and examines
their strengths and weaknesses.
291
292 Chapter I 0 Distributed File Systems
• Stateful or stateless operation - A stateful server is one that retains information about
client operations between requests and uses this state information to service subsequent
requests correctly. Requests such as open and seek are inherently stateful, since someone
must remember which files a client has opened, as well as the seek offset in each open file.
In a stateless system, each request is self-contained, and the server maintains no persistent
state about the clients. For instance, instead of maintaining a seek offset, the server may
require the client to specify the offset for each read or write. Stateful servers are faster,
since the server can take advantage of its knowledge of client state to eliminate a lot of
network traffic. However, they have complex consistency and crash recovery mechanisms.
Stateless servers are simpler to design and implement, but do not yield as good perform-
ance.
• Semantics of sharing - The distributed file system must define the semantics that apply
when multiple clients access a file concurrently. UNIX semantics require that changes
made by one client be visible to all other clients when they issue the next read or write
system call. Some file systems provide session semantics, where the changes are propa-
gated to other clients at the open and close system call granularity. Some provide even
weaker guarantees, such as a time interval that must elapse before the changes are certain
to have propagated to other clients.
• Remote access methods - A pure client-server model uses the remote service method of
file access, wherein each action is initiated by the client, and the server is simply an agent
that does the client's bidding. In many distributed systems, particularly stateful ones, the
server plays a much more active role. It not only services client requests, but also partici-
pates in cache coherency mechanisms, notifying clients whenever their cached data is in-
valid.
We now look at the distributed file systems that are popular in the UNIX world, and see how
they deal with these issues.
1 Different UNIX variants have their own rules governing the granularity of the exports. Some may only allow an
entire file system to be exported, whereas some may permit only one subtree per file system.
2 Digital's ULTRIX, for instance, allows any user to mount an NFS file system so long as the user has write permis-
sion to the mount point directory.
3 Not all implementations allow this.
10.3 Network File System (NFS) 295
\
\
>--...
(local)
'-~
nfs subtree
• NFS should not be restricted to UNIX. Any operating system should be able to implement
an NFS server or client.
• The protocol should not be dependent on any particular hardware.
• There should be simple recovery mechanisms from server or client crashes.
• Applications should be able to access remote files transparently, without using special
pathnames or libraries and without recompiling.
• UNIX file system semantics must be maintained for UNIX clients.
• NFS performance must be comparable to that of a local disk.
• The implementation must be transport-independent.
which was published in 1993 and has been implemented by a number of vendors. Table
10-1 enumerates the complete set ofNFSv2 requests.
• The Remote Procedure Call (RPC) protocol defines the format of all interactions between
the client and the server. Each NFS request is sent as an RPC packet.
• The Extended Data Representation (XDR) provides a machine-independent method of en-
coding data to send over the network. All RPC requests use XDR encoding to pass data.
Note that XDR and RPC are used for many other services besides NFS.
• The NFS server code is responsible for processing all client requests and providing access
to the exported file systems.
• The NFS client code implements all client system calls on remote files by sending one or
more RPC requests to the server.
• The Mount protocol defines the semantics for mounting and unmounting NFS file sys-
tems. Table I 0-2 contains a brief description of the protocol.
• Several daemon processes are used by NFS. On the server, a set of nfsd daemons listen for
and respond to client NFS requests, and the mountd daemon handles mount requests. On
the client, a set of biod daemons handles asynchronous 1/0 for blocks ofNFS files.
• The Network Lock Manager (NLM) and the Network Status Monitor (NSM) together pro-
vide the facilities for locking files over a network. These facilities, while not formally tied
to NFS, are found on most NFS implementations and provide services not possible in the
base protocol. NLM and NSM implement the server functionality via the lockd and statd
daemons, respectively.
10.3.4 Statelessness
The single most important characteristic of the NFS protocol is that the server is stateless and does
not need to maintain any information about its clients to operate correctly. Each request is com-
pletely independent of others and contains all the information required to process it. The server need
not maintain any record of past requests from clients, except optionally for caching or statistics
gathering purposes.
For example, the NFS protocol does not provide requests to open or close a file, since that
would constitute state information that the server must remember. For the same reason, the READ
and WRITE requests pass the starting offset as a parameter, unlike read and write operations on local
files, which obtain the offset from the open file object (see Section 8.2.3). 5
A stateless protocol makes crash recovery simple. No recovery is required when a client
crashes, since the server maintains no persistent information about the client. When the client re-
boots, it may remount the file systems and start up applications that access the remote files. The
server neither needs to know nor cares about the client crashing.
When a server crashes, the client finds that its requests are not receiving a response. It then
continues to resend the requests until the server reboots. 6 At that time, the server will receive the
requests and can process them since the requests did not depend on any prior state information.
When the server finally replies to the requests, the client stops retransmitting them. The client has
no way to determine if the server crashed and rebooted or was simply slow.
Stateful protocols, however, require complex crash-recovery mechanisms. The server must
detect client crashes and discard any state maintained for that client. When a server crashes and re-
boots, it must notify the clients so that they can rebuild their state on the server.
A major problem with statelessness is that the server must commit all modifications to stable
storage before replying to a request. This means that not only file data, but also any metadata such
as inodes or indirect blocks must be flushed to disk before returning results. Otherwise, a server
crash might lose data that the client believes has been successfully written out to disk. (A system
5 Some systems provide pread and pwrite calls, which accept the offset as an argument. This is particularly useful for
multithreaded systems (see Section 3.3.2).
6 This is true only for hard mounts (which are usually the default). For soft mounts, the client gives up after a while
and returns an error to the application.
298 Chapter I 0 Distributed File Systems
crash can lose data even on a local file system, but in that case the users are aware of the crash and
of the possibility of data loss.) Statelessness also has other drawbacks. It requires a separate protocol
(NLM) to provide file locking. Also, to address the performance problems of synchronous writes,
most clients cache data and metadata locally. This compromises the consistency guarantees of the
protocol, as discussed in detail in Section 10.7.2.
The primary protocols in the NFS suite are RPC, NFS, and Mount. They all use XDR for data en-
coding. Other related protocols are the NLM, NSM, and the portmapper. This section describes
XDRandRPC.
Programs that deal with network-based communications between computers have to worry about
several issues regarding the interpretation of data transferred over the network. Since the computers
at each end might have very different hardware architectures and operating systems, they may have
different notions about the internal representation of data elements. These differences include byte
ordering, sizes of data types such as integers, and the format of strings and arrays. Such issues are
irrelevant for communication with the same machine or even between two like machines, but must
be resolved for heterogeneous environments.
Data transmitted between computers can be divided into two categories--opaque and typed.
Opaque, or byte-stream, transfers may occur, for example, in file transfers or modem communica-
tions. The receiver simply treats the data as a stream of bytes and makes no attempt to interpret it.
Typed data, however, is interpreted by the receiver, and this requires that the sender and receiver
agree on its format. For instance, a little-endian machine may send out a two-byte integer with a
value of OxO 103 (259 in decimal). If this is received by a big-endian machine, it would (in absence
of prior conventions) be interpreted as Ox0301 (decimal 769). Obviously, these two machines will
not be able to understand each other.
The XDR standard [Sun 87] defines a machine-independent representation for data trans-
mission over a network. It defines several basic data types and rules for constructing complex data
types. Since it was introduced by Sun Microsystems, it has been influenced by the Motorola 680x0
architecture (the Sun-2 and Sun-3 workstations were based on the 680x0 hardware) in issues such as
byte ordering. Some of the basic definitions ofXDR are as follows:
• Integers are 32-bit entities, with byte 0 (numbering the bytes from left to right) represent-
ing the most significant byte. Signed integers are represented in two's complement nota-
tion.
• Variable-length opaque data is described by a length field (which is a four-byte integer),
followed by the data itself. The data is NULL-padded to a four-byte boundary. The length
field is omitted for fixed-length opaque data.
10.4 The Protocol Suite 299
• Strings are represented by a length field followed by the ASCII bytes of the string,
NULL-padded to a four-byte boundary. If the string length is an exact multiple of four,
there is no (UNIX-style) trailing NULL byte.
• Arrays of homogeneous elements are encoded by a size field followed by the array ele-
ments in their natural order. The size field is a four-byte integer and is omitted for fixed-
size arrays. Each element's size must be a multiple of four bytes. While the elements must
be of the same type, they may have different sizes, for instance, in an array of strings.
• Structures are represented by encoding their components in their natural order. Each
component is padded to a four-byte boundary.
Figure 10-2 illustrates some examples of XDR encoding. In addition to this set of defini-
tions, XDR also provides a formal language specification to describe data. The RPC specification
language, described in the next section, simply extends the XDR language. Likewise, the rpcgen
compiler understands the XDR specification and generates routines that encode and decode data in
XDRform.
XDR forms a universal language for communication between arbitrary computers. Its major
drawback is the performance penalty paid by computers whose natural data representation semantics
do not match well with that of XDR. Such computers must perform expensive conversion opera-
tions for each data element transmitted. This is most wasteful when the two computers themselves
are of like type and do not need data encoding when communicating with each other.
For instance, consider two VAX-11 machines using a protocol that relies on XDR encoding.
Since the VAX is little-endian (byte 0 is least significant byte), the sender would have to convert
each integer to big-endian form (mandated by XDR), and the receiver would have to convert it back
to little-endian form. This wasteful exercise could be prevented if the representation provided some
Ox203040 00 20 30 40
Array of 3 integers
00 00 00 30
00 00 00 40
{Ox30, Ox40, Ox50} 00 00 00 50
ldl I a I lyl
00
means of communicating the machine characteristics, so that conversions were required only for
unlike machines. DCE RPC [OSF 92] uses such an encoding scheme in place ofXDR.
There are several different RPC implementations. NFS uses the RPC
protocol introduced by Sun Microsystems [Sun 88], which is known
as Sun RPC or ONC-RPC. (ONC stands for Open Network Comput-
ing.) Throughout this boolc, the term RPC refors to Sun RPC, except
when explicitly stated otherwise. The only other RPC facility men-
tioned in the book is that of OSF's Distributed Computing Environ-
ment, which is reforred to as DCE RPC.
Unlike DCE RPC, which provides synchronous and asynchronous operations, Sun RPC uses
synchronous requests only. When a client makes an RPC request, the calling process blocks until it
receives a response. This makes the behavior of the RPC similar to that of a local procedure call.
The RPC protocol provides reliable transmission of requests, meaning it must ensure that a
request gets to its destination and that a reply is received. Although RPC is fundamentally transport-
independent, it is often implemented on top ofUDPIIP (User Datagram Protocol/Internet Protocol),
which is inherently unreliable. The RPC layer implements a reliable datagram service by keeping
track of unanswered requests and retransmitting them periodically until a response is received.
Figure 10-3 describes a typical RPC request and (successful) reply. The xi d is a transmis-
sion ID, which tags a request. The client generates a unique xi d for each request, and the server re-
turns the same xi din the reply. This allows the client to identify the request for which the response
has arrived and the server to detect duplicate requests (caused by retransmissions from the client).
The direction field identifies the message as a request or a reply. The rpc_ vers field identifies
the version number of the RPC protocol (current version= 2). prog and vers are the program and
version number of the specific RPC service. An RPC service may register multiple protocol ver-
sions. The NFS protocol, for instance, has a program number of I 00003 and supports version num-
bers 2 and 3. proc identifies the specific procedure to call within that service program. In the reply,
the reply_stat and accept_stat fields contain status information.
RPC uses five authentication mechanisms to identify the caller to the server-AUTH_NULL,
AUTH_UNIX, AUTH_SHORT, AUTH_DES, and AUTH_KERB. AUTH_NULL means no authentication.
AUTH_UNIX is composed of UNIX-style credentials, including the client machine name, a UID, and
one or more GIDs. The server may generate an AUTH _SHORT upon receipt of an AUTH_UNIX creden-
I 0.5 NFS Implementation 301
procedure-specific
? results
arguments
tial, and return it to the caller for use in subsequent requests. The idea is tha~ the server can decipher
AUTH _SHORT credentials very quickly to identify known clients, thus providing faster authentication.
This is an optional feature and not many services support it. AUTH _DES is a secure authentication
facility using a mechanism called private keys [Sun 89]. AUTH _KERB is another secure facility based
on the Kerberos authentication mechanism [Stei 88]. Each service decides which authentication
mechanisms to accept. NFS allows all five, except that it allows AUTH _NULL only for the NULL pro-
cedure. Most NFS implementations, however, use AUTH _UNIX exclusively.
Sun also provides an RPC programming language, along with an RPC compiler called
rpcgen. An RPC-based service can be fully specified in this language, resulting in a formal interface
definition. When rpcgen processes this specification, it generates a set of C source files containing
XDR conversion routines and stub versions of the client and server routines, and a header file con-
taining definitions used by both client and server.
Server
ufsnode
nfs
server ufs code
code
ject called a file handle with each file or directory. The server generates this handle when the client
first accesses or creates the file through a LOOKUP, CREATE, or MKDIR request. The server returns the
handle to the client in the reply to the request, and the client can subsequently use it in other opera-
tions on this file.
The client sees the file handle as an opaque, 32-byte object and makes no attempt to interpret
the contents. The server can implement the file handle as it pleases, as long as it provides a unique
one-to-one mapping between files and handles. Typically, the file handle contains ajile system ID,
which uniquely identifies the local file system, the inode number of the file, and the generation
number of that inode. It may also contain the inode number and generation number of the exported
directory through which the file was accessed.
The generation number was added to the inode to solve problems peculiar to NFS. It is pos-
sible that, between the client's initial access of the file (typically through LOOKUP, which returns the
file handle) and when the client makes an I/0 request on the file, the server deletes the file and re-
uses its inode. Hence the server needs a way of determining that the file handle sent by the client is
obsolete. It does this by incrementing the generation number of the inode each time the inode is
freed (the associated file is deleted). The server can now recognize requests that refer to the old file
and respond to them with an EST ALE (stale file handle) error status.
On the client, lookup begins at the current or root directory (depending on whether the path-
name is relative or absolute) and proceeds one component at a time. If the current directory is an
NFS directory or if we cross a mount point and get to an NFS directory, the lookup operation calls
the NFS-specific VOP_LOOKUP function. This function sends a LOOKUP request to the server, passing
it the file handle of the parent directory (which the client had saved in the mode) and the name of
the component to be searched.
The server extracts the file system ID from the handle and uses it to locate the vfs structure
for the file system. It then invokes the VFS _VGET operation on this file system, which translates the
file handle and returns a pointer to the vnode of the parent directory (allocating a new one if it is not
already in memory). The server then invokes the VOP_LOOKUP operation on that vnode, which calls
the corresponding function of the local file system. This function searches the directory for the file
and, if found, brings its vnode into memory (unless it is already there, of course) and returns a
pointer to it.
The server next invokes the VOP_GETATTR operation on the vnode of the target file, followed
by VOP_FID, which generates the file handle for the file. Finally, it generates the reply message,
which contains the status, the file handle of the component, and its file attributes.
When the client receives the reply, it allocates a new mode and vnode for the file (if this file
had been looked up previously, the client may already have a vnode for it). It copies the file handle
and attributes into the mode and proceeds to search for the next component.
Searching for one component at a time is slow and requires several RPCs for a single path-
name. The client may avoid some RPCs by caching directory information (see Section 10.7.2.). Al-
though it would seem more efficient to send the entire pathname to the server in a single LOOKUP
call, that approach has some important drawbacks. First, since the client may have mounted a file
system on an intermediate directory in the path, the server needs to know about all the client's
mount points to parse the name correctly. Second, parsing an entire pathname requires the server to
understand UNIX pathname semantics. This conflicts with the design goals of statelessness and op-
erating system independence. NFS servers have been ported to diverse systems such as VMS and
Novell NetWare, which have very different pathname conventions. Complete pathnames are used
only in the mount operation, which uses a different protocol altogether.
Since NFS was primarily intended for UNIX clients, it was important that UNIX semantics be pre-
served for remote file access. The NFS protocol, however, is stateless, which means that clients
cannot maintain open files on the server. This leads to a few incompatibilities with UNIX, which we
describe in the following paragraphs.
lacking the concept of open files, checks the permissions on each read or write. It would therefore
return an error in such a case, which the client would not expect to happen.
Although there is no way to fully reconcile this issue, NFS provides a work-around. The
server always allows the owner of the file to read or write the file, regardless of the permissions. On
the face of it, this seems a further violation of UNIX semantics, since it appears to allow owners to
modify their own write-protected files. The NFS client code, however, prevents that from happen-
ing. When the client opens the file, the LOOKUP operation returns the file attributes along with the
handle. The attributes contain the file permissions at the time of the open. If the file is write-
protected, the client code returns an EACCESS (access denied) error from the open call. It is impor-
tant to note that the server's security mechanisms rely on proper behavior of the client. In this in-
stance, the problem is not serious, since it only affects the owner's lack of write access. Section 10.9
discusses the major problems with NFS security.
message is 8192 bytes), and the server, being stateless, maintains no locks between requests. NFS
offers no protection against such overlapping l/0 requests.
Cooperating processes can use the Network Lock Manager (NLM) protocol to lock either
entire files or portions thereof. This protocol only offers advisory locking, which means that another
process can always bypass the locks and access the file if it so chooses.
they are accessed after the quantum expires, the client fetches them from the server again. Likewise,
for file data blocks, the client checks cache consistency by verifying that the file's modifY time has
not changed since the cached data was read from the server. The client may use the cached value of
this timestamp or issue a GETATTR if it has expired.
Client-side caching is essential for acceptable performance. The precautions described here
reduce, but do not eliminate, the consistency problems. In fact, they introduce some new race con-
ditions, as described in [Mack 91] and [Jusz 89].
Some servers rely on an uninterruptible power supply (UPS) to flush blocks to disk in case
of a crash. Some simply ignore the NFS requirement of synchronous writes, expecting crashes to be
rare occurrences. The plethora of solutions and work-arounds to this problem simply highlights its
severity. NFSv3, described in Section 10.10, provides a protocol change that allows clients and
servers to use asynchronous writes safely.
inadequate-it covered only some of the loopholes and opened up new consistency problems.
Moreover, it does not address the performance problems, since the server does not check the cache
until after it has processed the request.
[Jusz 89] provides a detailed analysis of the problems in handling retransmissions. Based on
that, Digital revamped the xid cache in Ultrix. The new implementation caches all requests and
checks the cache before processing new requests. Each cache entry contains request identification
(client ID, xid, and procedure number), a state field, and a timestamp. If the server finds the request
in the cache, and the state of the request is in progress, it simply discards the duplicate. If the state is
done, the server discards the duplicate if the timestamp indicates that the request has completed re-
cently (within a throwaway window set at 3-6 seconds). Beyond the throwaway window, the server
processes the request if idempotent. For nonidempotent requests, the server checks to see if the file
has been modified since the original timestamp. If not, it sends a success response to the client; oth-
erwise, it retries the request. Cache entries are recycled on a least recently used basis so that if the
client continues to retransmit, the server will eventually process the request again.
The xid cache helps eliminate several duplicate operations, improving both the performance
and the correctness of the server. It is possible to take this one step further. If the server caches the
reply message along with the xid information, then it can handle a duplicate request by retransmit-
ting the cached reply. Duplicates that arrive within the throwaway window may still be discarded
altogether. This will further reduce wasteful reprocessing, even for idempotent requests. This ap-
proach requires a large cache capable of saving entire reply messages. Some replies, such as those
for READ or READDIR requests, can be very large. It is better to exclude these requests from the xid
cache and process them again if necessary.
7 The original design separated the storage functionality into a third subsystem. The recent line of products do not
feature separate storage processors.
310 Chapter 10 Distributed File Systems
message passing. One processor (also a 68020) runs a modified version of SunOS4.1 (with FMK
support added to it) and provides management functionality. Figure 10-5 shows the basic design.
The UNIX front end can communicate directly to each of the functional processors. It talks
to network processors through the standard network if driver and to file system processors through a
special local file system, which implements the vfs interface. The UNIX processor also has direct
access to the storage through a special device driver that represents an Auspex disk and converts disk
1/0 requests into FMK messages. This allows utilities such as fsck and newfs to work without
change.
Normal NFS requests bypass the UNIX processor altogether. The request comes in at a net-
work processor, which implements the IP, UDP, RPC, and NFS layers. It then passes the request to
the file system processor, which may issue I/0 requests to the storage processor. Eventually, the
network processor sends back the reply message to the client.
The FMK kernel supports a small set of primitives including light-weight processes, mes-
sage passing, and memory allocation. By eliminating a lot of the baggage associated with the tradi-
tional UNIX kernel, FMK provides extremely fast context switching and message passing. For in-
stance, FMK has no memory management and its processes never terminate.
This architecture provided the basis for a high-throughput NFS server that established Aus-
pex Systems as a leader in the high-end NFS market. Recently, its position has been challenged by
cluster-based NFS servers from vendors such as Sun Microsystems and Digital Equipment Corpo-
ration.
NS5000
nents-network reliability, disk reliability, and server reliability. It uses disk mirroring and optional
network replication to address the first two problems and uses a pair of cooperating servers to pro-
vide server reliability.
Figure 10-6 illustrates the HA-NFS design. Each server has two network interfaces and, cor-
respondingly, two IP addresses. A server designates one of its network interfaces as the primary in-
terface and uses it for normal operation. It uses the secondary interface only when the other server
fails.
HA-NFS also uses dual-ported disks, which are connected to both servers through a shared
SCSI bus. Each disk has a primary server, which alone accesses it during normal operation. The
secondary server takes over the disk when the primary server fails. Thus the disks are divided into
two groups, one for each server.
The two servers communicate with each other through periodic heartbeat messages. When a
server does not receive a heartbeat from the other, it initiates a series of probes to make sure the
other server has actually failed. If so, it initiates a failover procedure. It takes control of the failed
server's disks and sets the IP address of its secondary network interface to that of the failed server's
primary interface. This allows it to receive and service messages intended for the other server.
The takeover is transparent to clients, who only see reduced performance. The server seems
to be unresponsive while the failover is in progress. Once failover completes, the surviving server
may be slow, since it now handles the load normally meant for two servers. There is, however, no
loss in service.
Each server runs IBM's AIX operating system, which uses a metadata logging file system.
HA-NFS adds information about the RPC request to log entries for NFS operations. When a server
takes over a disk during failover, it replays the log to restore the file system to a consistent state and
recovers its retransmission cache from the RPC information in the log. This prevents inconsistencies
due to retransmissions during failover. The two servers also exchange information about clients that
Disks
Network SCSI
Client nodes
-------Primary
connection
- - - - - - -
Secondary
connection
have made file locking requests using NSM and NLM [Bhid 92]. This allows recovery of the lock
manager state after failure of one of the servers.
There are two ways to make the IP address takeover transparent to clients. One is to use
special network interface cards that allow their hardware addresses to be changed. During failover,
the server changes both the IP address and the hardware address of the secondary interface to those
of the failed server's primary interface. In absence of such hardware, the server can take advantage
of some side effects of certain address resolution protocol (ARP) requests [Plum 82] to update the
new <hardware address, IP address> mapping in the clients.
8 Many client implementations allow only privileged users to mount NFS file systems.
I 0.9 NFS Security 313
ing fake authentication data, perhaps even appearing to come from a different machine. This would
allow break-ins even by users who do not have root permission on a machine or by users on ma-
chines that do not have NFS access to a server.
Many of the above problems are not restricted to NFS. The traditional UNIX security
framework is designed for an isolated (no network), multiuser environment and is barely adequate
even in that domain. The introduction of a network where nodes trust each other severely compro-
mises the security and opens several loopholes. This has led to the development of several network
security and authentication services, the most notable in the UNIX world being Kerberos [Stei 88].
A more detailed discussion of network security is beyond the scope of this book.
a list of the filenames in the directory. The client then issues a LOOKUP and a GETA TTR for each file
in the list. For a large directory, this can cause excessive network traffic.
NFSv3 provides a READDIRPLUS operation, which returns the names, file handles, and at-
tributes of the files in the directory. This allows a single NFSv3 request to replace the entire se-
quence ofNFSv2 requests. The READDIRPLUS request must be used with care, since it returns a large
amount of data. If the client wants information about one file only, it may be cheaper to use the old
sequence of calls.
Implementations that support NFSv3 must also support NFSv2. The client and server nor-
mally use the highest version of the protocol that both of them support. When it first contacts the
server, the client uses its highest protocol version; if the server does not understand the request, the
client tries the next lowest version and so on, until they find a commonly supported version.
Only time will tell how effective and successful NFSv3 will be. The changes described
above are very welcome improvements to the NFS protocol, and should result in great performance
improvement. NFSv3 also cleans up a few minor problems with NFSv2. Some of the smaller
changes reduce the performance ofNFSv3, but the benefits of asynchronous writes and READDIR-
PLUS are expected to more than compensate for that.
other directories of the server, all the mounts are multiplexed on the same circuit. The circuit is kept
open for the duration of the mounts. If either the client or server crashes, the circuit breaks, and the
other becomes aware of the crash and can take appropriate action.
Network independence is achieved by implementing RFS on top of the STREAMS frame-
work (see Chapter 17) and using AT&T's transport provider interface (TPI). RFS can communicate
over multiple streams and thus use several different transport providers on the same machine. Figure
10-7 illustrates the communication setup between the client and the server.
RFS associates a symbolic resource name with each directory advertised (exported) by any
server. A centralized name server maps resource names to their network location. This allows the
resources (exported file trees) to be moved around in the network; clients can access the resource
without having to know its current location.IO
Since RFS is intended to work over large networks, resource management can become
complex. Therefore, RFS provides the concept of a domain, which is a logical grouping of a set of
machines in the network. Resources are identified by the domain name and the resource name,
which now must be unique only within the domain. If the domain name is not specified, the current
domain is assumed. The name server may only store the information about resources in its own do-
main and forward other requests to name servers of the respective domains.
Client Server
virtual circuit
I 0 Of course, the resource cannot be moved while any client has it mounted.
I 0.13 RFS Implementation 317
RFS request. The server recreates the client's environment and executes the system call. The client
process blocks until the server processes the request and sends back a response message, containing
the results of the system call. The client then interprets the results and completes the system call,
returning control to the user. This implementation was called the RFS 1.0 protocol.
When RFS was integrated with the vnode/vfs interface, it was necessary for RFS to imple-
ment each vnode operation. In the port to Sun OS [Char 87], each vnode operation was implemented
in terms of one or more RFS 1.0 requests. For instance, vn _open could simply use the RFS _OPEN re-
quest, whereas vn_ setattr required an RFS _OPEN, followed by one or more of RFS _ CHMOD,
RFS _CHOWN, and RFS _UTI ME.
SVR4 introduced a new version of the RFS protocol, called RFS2.0. It provided a set of re-
quests that directly mapped vnode and vfs operations, thus providing a cleaner integration with the
vnode/vfs interface. This did, however, bring up the problem of backward compatibility, since dif-
ferent machines on the network may be running different UNIX releases and, thus, different ver-
sions ofRFS.
To address this, SVR4 clients and servers understand both RFS protocols. When the con-
nection is made (during the first mount operation), the client and the server exchange information
about which protocols each of them can handle and agree on the protocol they both understand.
Thus RFS2.0 is only used when both machines support it. If one of the machines is running SVR4
and the other SVR3, they will use the RFSl.O protocol.
This requires SVR4 to implement each vnode and vfs operation in two ways-one when
speaking to another SVR4 machine and the other when talking to an older system.
client machines authorized to access the resource. In addition, the server may require a password
check to be performed during virtual circuit establishment.
The server calls advfs to advertise a directory. advfs creates an entry for the directory in are-
source list in the kernel (Figure 10-8). This entry contains the resource name, a pointer to the vnode
of the exported directory, and a list of authorized clients. It also contains the head of a list of mount
entries for each of the clients that mount this resource. In SVR4, the advfs system call has been re-
placed by the rfsys call, which exports several subfunctions, including one to advertise a file system.
Figure 10-9 describes the interactions between the RFS server, the name server, and the cli-
ent. The server invokes the adv(J) command to register its advertised resource with the name server.
Some time later, a client mounts an RFS resource by a command such as
-+
vnode for the root directory of the resource. The v_data field of an RFS vnode points to a data
structure called a send descriptor, which contains information about the virtual circuit (such as a
pointer to the stream), as well as a file handle that the server can use to locate the corresponding lo-
cal vnode.
The first mount operation between a client and a server establishes the virtual circuit. All
subsequent mounts (and RFS operations) are multiplexed over this circuit, which is maintained until
the last resource is unmounted. The mount operation initiates a connection to a daemon process on
the server, using the transport interface. Once the connection is established, the client and the server
negotiate the run-time parameters, including the protocol version number, and the hardware archi-
tecture type. If the two machines have different architectures, they use XDR for data encoding.
The initial virtual circuit establishment occurs in user mode, using the standard network pro-
gramming interface in the system. Once established, the user calls the FWFD function of the rfsys
system call, to pass the virtual circuit into the kernel.
Client Server
struct vfs resource list
vfs data
... mount - id -
...
struct rf- vfs
[ Fvnodel
a
for incoming requests and services each request to completion before attending to the next one.
While servicing a request, the daemon assumes the identity of the client process, using the creden-
tials, resource limits, and other attributes passed in the message. The RFS daemons may sleep if
they need to wait for resources and are scheduled just like normal processes.
11 In practice, using RFS to share devices is problem-prone, except in completely homogeneous environments. Small
differences in system call semantics may make device sharing impossible.
322 Chapter I 0 Distributed File Systems
C1 C2
Writes to file1
Sends message Blocks write,
I'-
to server ">> sends invalidate
message to C2
" ">> Invalidates cache
entries for file 1
I
Completes write, ~
~ Replies to server
Returns from V responds to C 1
/
write call
their cache. It is important to prevent them from using stale cached blocks if they reopen the file.
This is achieved by associating a version number with each file, which is incremented each time the
file is modified. The server returns the version number in the response to each open request, and the
client stores it in its cache. If a file has been modified since the client closed it, the client will get
back a different version number when it tries to reopen the file. When that happens, the client can
flush all the blocks associated with that file, ensuring that it will not access stale cached data.
Under normal circumstances, the RFS consistency mechanisms provide strong consistency
guarantees at a reasonable cost. Problems may arise if one of the client crashes or becomes unre-
sponsive. It may then take a long time to respond to cache invalidation requests, preventing other
nodes from completing their operations. In this way, a single errant client can cause problems for
the whole network. Overall, the benefits of caching are far greater than the cost of maintaining con-
sistency, and the RFS cache has demonstrated a performance improvement of about a factor of two
(over the old implementation) on benchmarks with one to five clients.
AFS controls network congestion and server overload by segmenting the network into a
number of independent clusters. Unlike NFS and RFS, AFS uses dedicated servers. Each machine is
either a client or a server, but not both. Figure 10-12 shows the organization of an AFS network.
Each cluster contains a number of clients, plus a server that holds the files of interest to those cli-
ents, such as the user directories of the owners of the client workstations.
This configuration provides fastest access to files residing on the server on the same network
segment. Users can access files on any other server, but the performance will be slower. The net-
work can be dynamically reconfigured to balance loads on servers and network segments.
AFS uses aggressive caching of files, coupled with a stateful protocol, to minimize network
traffic. Clients cache recently accessed files on their local disks. The original implementation
cached entire files. Since that was not practical for very large files, AFS3.0 divides the file into
64-kilobyte chunks, and caches individual chunks separately. The AFS servers participate actively
in client cache management, by notifying clients whenever the cached data becomes invalid. Section
10.16.1 describes this protocol further.
AFS also reduces server load by moving the burden of name lookups from the server to the
clients. Clients cache entire directories and parse the filenames themselves. Section 10.16.2 de-
scribes this in detail.
Backbone network
The volume provides a unit of file system storage that is distinct from partitions, which pro-
vide units of physical storage. This separation has several advantages. Volumes may be moved
freely from one location to another, without affecting active users. This may be done for load bal-
ancing, or to adjust for permanent moves of users. If a user moves his or her home workstation to a
different part of the network, the system administrator can move the user's volume to the local
server. Volumes also allow files that are much larger than a single disk. Read-only volumes can be
replicated on several servers to increase availability and performance. Finally, each volume can be
individually backed up and restored.
AFS provides a single, uniform name space that is independent of the storage location. Each
file is identified by anjid, which consists of a volume ID, a vnode number, and a vnode uniquifier.
Historically, AFS uses the term vnode to mean a Vice inode; hence, the vnode number is an index
into the inode list of the volume. The uniquifier is a generation number, incremented each time the
vnode is reused.
The volume location database provides location independence and transparency. It provides
a mapping between a volume ID and the physical location of the volume. The database is replicated
on each server, so that it does not become a bottleneck resource. If a volume is moved to another
location, the original server retains its forwarding information, so that the databases on the other
servers need not be updated immediately. While the volume is being transferred, the original server
may still handle updates, which are later migrated to the new server. At some point, the volume is
temporarily quiesced to transfer the recent changes.
Each client workstation must have a local disk. This disk contains a few local files, plus a di-
rectory on which it mounts the shared file hierarchy. Conventionally, each workstation mounts the
shared tree on the same directory. The local files include the system files essential for minimal op-
eration, plus some files the user may want to keep local for reasons of security or performance.
Hence each client sees the same shared name space, plus its own, unique, local files. The local disk
also acts as a cache for recently accessed shared files.
In a centralized UNIX system, if a process modifies a file, other processes see the new data on the
next read system call. Enforcing UNIX semantics in a distributed file system causes excessive net-
work traffic and performance degradation. AFS 2.0 uses a less restrictive consistency protocol,
called session semantics, which performs cache consistency operations only at file open or close.
Clients flush modified data to the server only when the file is closed. When that happens, the server
notifies other clients that their cached copies have become invalid. Clients do not check data valid-
ity on every read or write access to the file, but continue to use stale data until they open the file
again. Hence users on different machines see changes to a shared file at the open and close system
call granularity, rather than at the read and write system call granularity as in UNIX.
AFS provides stronger guarantees for metadata operations, which are updated to the server
(and from the server, to other clients) immediately. For instance, if a rename system call completes
on one client, no other machine on the network can open the file under the old name, and all can
open it under its new name.
326 Chapter I 0 Distributed File Systems
The consistency guarantees of session semantics are much weaker than those of UNIX se-
mantics. AFS 3.0 checks the data validity on every read or write, thus providing better consistency.
It still falls short of UNIX semantics, since the client does not flush changes to the server until the
file is closed. DFS, the new incarnation of AFS, enforces UNIX semantics through a token passing
mechanism, which is described in Section 10.18.2.
12 Prior to AFS 3.0, clients discarded the stale data only on the next open. This, along with the large chunk size, made
those implementations unsuitable for transaction processing and database systems.
I 0.16 AFS Implementation 327
fetching it from the server again. This may cause some extra network traffic and slow down the op-
eration. Such a situation requires an unlikely combination of events and, in practice, does not hap-
pen very often.
A more severe problem is when a temporary network failure prevents the delivery of a call-
back-breaking message. In AFS, the client may run for a long time without contacting the server.
During this time, it will incorrectly assume that its cached data is correct. To bound this time, the
client regularly probes each file server from which it has callback promises (once every ten minutes,
by default).
The callback mechanism implies a stateful server. The server keeps track of all callbacks it
has issued for each file. When a client modifies the file, the server must break all outstanding call-
backs for that file. If the volume of this information becomes unmanageable, the server can break
some existing callbacks and reclaim storage. The client must maintain validation information for
cached files.
10.16.3 Security
AFS considers Vice (the collection of servers) as the boundary of security. It considers both user
workstations and the network as inherently insecure (with good reason). It avoids passing unen-
crypted passwords over the network, since it is too easy to catch them through computers that can
snoop on the network.
AFS uses the Kerberos authentication system [Stei 88], developed at the Massachusetts In-
stitute of Technology. Kerberos clients authenticate themselves not by transmitting a password
known to the client and the server, but by answering encrypted challenges from the server. The
server encrypts the challenge with the key known to both the server and the client. The client de-
crypts the challenge, encrypts the answer with the same key, and sends the encrypted answer to the
328 Chapter I 0 Distributed File Systems
server. Since the server uses a different challenge each time, the client cannot reuse the same re-
sponse.
[Hone 92] identifies several loopholes in the way in which AFS 3.0 uses Kerberos. The cli-
ent keeps several important data structures unencrypted in its address space, making it vulnerable to
users who can acquire root privilege on their own workstations. Such users can traverse the kernel
data structures to obtain authenticated Kerberos tickets of other users. Moreover, the challenge-
response protocol in AFS 3.0 is susceptible to attack from another node on the network that sends
out fake challenges to the client. Transarc subsequently fixed these loopholes in AFS 3.1.
AFS also provides access-control lists (ACLs) for directories (but not for individual files).
Each ACL is an array of pairs. The first item in each pair is a user or group name, and the second
defines rights granted to that user or group. The ACLs support four types of permissions on a direc-
tory-lookup, insert, delete, and administer (modify the ACL for this directory). In addition, they
allow three types of permissions for files in that directory-read, write, and lock. AFS also retains
the standard UNIX permission bits, and a user must pass both tests (ACL and UNIX permissions) to
operate on a file.
Second, the write system call often succeeds when it should not (e.g., when the write extends the
file but the disk is full). Both situations have unexpected results on the client.
Finally, shifting the pathname lookup to the client decreases the server load, but requires the
client to understand the directory format of the server. Contrast this with NFS, which provides direc-
tory information in a hardware and operating system independent format.
Some of the drawbacks of AFS are addressed by DFS, which we describe in the next
section.
Client Server
glue layer
vnode/vfs interface
Episode
FS
the format of the server's directories. Hence DFS clients cache the results of individual lookups,
rather than entire directories.
The DFS server design is very different from AFS. In AFS, the access protocol and the file
system are a single, integrated entity. In DFS, the two are separated and interact through the
vnode/vfs interface. This allows DFS to export the server's native file system. It also allows local
applications on the server to access the exported file system. The DFS server uses an extended vfs
interface (called VFS+ ), which has additional functions to support volumes and access-control lists.
Episode supports all VFS+ operations, and hence provides full DFS functionality. Other local file
systems may not support the extensions and may provide only a subset of the DFS functionality.
The protocol exporter services requests from DFS clients. It maintains state information for
each client and informs the client whenever some of its cached data becomes invalid. The glue layer
in the vfs interface maintains consistency between the protocol exporter and other file access meth-
ods (local access and other distributed protocols supported by the server), as explained in Section
10.18.2.
To implement these semantics, the DFS server includes a token manager, which keeps track
of all active client references to files. On each reference, the server gives the client one or more to-
kens, which guarantee the validity of file data or attributes. The server may cancel the guarantee at
any time by revoking the token. The client must then treat the corresponding cached data as invalid,
and fetch it again from the server if needed.
DFS supports four types of tokens, each dealing with a different set of file operations:
Data tokens There are two types of data tokens-read and write. Each applies to a
range of bytes within a file. If the client holds a read token, its cached
copy of that part of the file is valid. If it holds a write token, it may modify
its cached data without flushing it to the server. When the server revokes a
read token, the client must discard its cached data. When the server re-
vokes a write token, the client must write any modifications back to the
server and then discard the data.
Status tokens These tokens provide guarantees about cached file attributes. Again there
are two types-status read and status write. Their semantics are similar to
those of data read and write tokens. If a client holds the status write token
to a file, the server will block other clients that try to even read the file's
attributes.
Lock tokens These allow the holder to set different types of file locks on byte ranges in
the file. As long as the client holds a lock token, it does not need to contact
the server to lock the file, since it is guaranteed that the server will not
grant a conflicting lock to another client.
Open tokens Allow the holder to open a file. There are different types of open tokens,
corresponding to different open modes-read, write, execute, and exclu-
sive write. For instance, a client holding an open for execute token is as-
sured that no other client will be able to modify the file. This particular
guarantee is difficult for other distributed file systems to support. It is nec-
essary because most UNIX systems access executable files one page at a
time (demand paging, see Section 13.2). If a file were modified while be-
ing executed, the client would get part of the old program and part of the
new one, leading to strange and unpredictable results.
DFS defines a set of compatibility rules when different clients want tokens for the same file.
Tokens of different types are mutually compatible, since they relate to separate components of a
file. For tokens of the same type, the rules vary by token type. For data and lock tokens, the read
and write tokens are incompatible if their byte ranges overlap. Status read and write tokens are al-
ways incompatible. For open tokens, exclusive writes are incompatible with any other subtype, and
execute tokens are incompatible with normal writes as well. The rest of the combinations are mu-
tually compatible.
Tokens are similar to the AFS callbacks-both provide cache consistency guarantees that
the server may revoke at any time. Unlike callbacks, tokens are typed objects. AFS defines a single
type of callback for file data and one for file attributes. DFS provides several token types, as previ-
332 Chapter I 0 Distributed File Systems
ously described. This allows a greater degree of concurrency in the file system, and enables UNIX-
style single-user semantics for access to shared files.
Replication server (rpserver) DFS supports fileset replication for increasing avail-
ability of important data. Replication protects
against network and server outages, and also reduces
bottlenecks by distributing the load for heavily used
filesets across several machines. The replicas are
read-only, but the original may be read-write. DFS
allows two forms of replication-release and
scheduled. With release replication, clients must is-
sue explicit.fts release commands to update the rep-
licas from the original. Scheduled replication auto-
matically updates replicas at fixed intervals.
10.18.5 Analysis
DFS provides a comprehensive set of facilities for distributed file access. Its Episode file system
uses logging to reduce crash recovery time, thereby increasing file system availability. It uses two
separate abstractions-aggregates and filesets-to organize the file system. Aggregates are units of
physical storage, while filesets are logical divisions of the file system. In this way, it decouples the
logical and physical organization of data.
Episode uses POSIX-compliant access-control lists for finer granularity file protection. Al-
though this allows for a more flexible and robust security scheme than that of UNIX, it is unfamiliar
to system administrators and users. Similarly, Kerberos provides a secure authentication framework
for DFS, but requires modification to several programs such as login, ftp, and various batch and
mail utilities.
DFS uses fileset replication to increase the availability of data and to reduce access times by
distributing the load among different servers. Replication also allows online backup of individual
filesets, since a replica is a frozen, consistent snapshot of the fileset. The fileset location database
provides location independence and transparency.
The DFS architecture is based on client-side caching with server-initiated cache invalidation.
This approach is suitable for large-scale networks, since it reduces network congestion under normal
usage patterns. By implementing both server and client on top of the vnode/vfs interface, DFS
achieves interoperability with other physical file systems and with other local and remote file access
protocols.
However, the DCE DFS architecture is complex and difficult to implement. It requires not
only DCE RPC but also a variety of related services, such as the X.500 global directory service
[OSF 93]. In particular, it is not easy to support DFS on small machines and simple operating sys-
tems (MS-DOS readily comes to mind). This will be a barrier to its acceptance in truly heterogene-
ous environments.
The cache consistency and deadlock avoidance mechanisms are highly complex as well. The
algorithms must also recover correctly from failure of individual clients, servers, or network seg-
ments. This is a problem with any distributed file system that provides fine-granularity concurrent
access semantics.
334 Chapter 10 Distributed File Systems
10.19 Summary
This chapter describes the architecture and implementation of four important distributed file sys-
tems-NFS, RFS, AFS, and DFS. NFS is the simplest to implement and is the most portable archi-
tecture. It has been ported to a large variety of platforms and operating systems, making it the proto-
col of choice for truly heterogeneous environments. However, it does not scale well, falls far short
of UNIX semantics, and suffers from poor write performance. RFS provides UNIX semantics and
also sharing of devices, but only works with System V UNIX and variants derived from it. AFS and
DFS are highly scalable architectures. DFS provides UNIX semantics, and is interoperable with
other access protocols. It is, however, complex and unwieldy, and difficult to implement. It is an
emerging technology, and only time will tell how successful it will become.
There are few published measurements of relative performance of these file systems.
[Howa 88] compares the performance ofNFS and AFS using identical hardware configurations. The
results show that for a single server, NFS is faster at low loads (less than 15 clients), but deteriorates
rapidly for higher loads. Both systems have evolved substantially since then, but the factors affect-
ing scalability have not changed significantly.
10.20 Exercises
l. Why is network transparency important in a distributed file system?
2. What is the difference between location transparency and location independence?
3. What are the benefits of a stateless file system? What are the drawbacks?
4. Which distributed file system provides UNIX semantics for shared access to files? Which
provides session semantics?
5. Why is the mount protocol separate from the NFS protocol?
6. How would an asynchronous RPC request operate? Suggest a client interface to send an
asynchronous request and receive a reply.
7. Write an RPC program that allows the client to send a text string to be printed on the server.
Suggest a use for such a service.
8. Suppose an NFS server crashes and reboots. How does it know what file systems its clients
have mounted? Does it care?
9. Consider the following shell command, executed from an NFS-mounted directory:
echo hello > krishna. txt
What sequence ofNFS requests will this cause? Assume the file krishna.txt does not already
exist.
10. In Exercise 9, what would be the sequence of requests ifbello.txt already existed?
11. NFS clients fake the deletion of open files by renaming the file on the server and deleting it
when the file is closed. If the client crashes before deleting the file, what happens to the file?
Suggest a possible solution.
12. The write system call is asynchronous and does not wait for the data to be committed to stable
storage. Why should the NFS write operation be synchronous? How do server or client
crashes affect outstanding writes?
10.21 References 335
10.21 References
[Bach 87] Bach, M.J., Luppi, M.W., Melamed, A.S., and Yueh, K., "A Remote-File Cache for
RFS," Proceedings of the Summer 1987 USENIX Technical Conforence, Jun. 1987,
pp. 273-279.
[Bhid 91] Bhide, A., Elnozahy, E., and Morgan, S., "A Highly Available Network File Server,"
Proceedings of the Winter 1991 USENIXTechnical Conference, Jan. 1991, pp. 199-
205.
[Bhid 92] Bhide, A., and Shepler, S., "A Highly Available Lock Manager for HA-NFS,"
Proceedings of the Summer 1992 USENIX Technical Conference, Jun. 1992, pp.
177-184.
[Char 87] Chartok, H., "RFS in SunOS," Proceedings of the Summer 1987 USENIX Technical
Conference, Jun. 1987,pp.281-290.
[Cher 88] Cheriton, D.R., "The V Distributed System," Communications of the ACM, Vol. 31,
No.3, Mar. 1988,pp.314-333.
[Chut 92] Chutani, S., Anderson, O.T., Kazar, M.L., Leverett, B.W., Mason, W.A., and
idebotham, R.N., "The Episode File System," Proceedings of the Winter 1992
USENIXTechnical Conference, Jan. 1992, pp. 43-59.
[Gerb 92] Gerber, B., "AFS: A Distributed File System that Supports Worldwide Networks,"
Network Computing, May 1992, pp. 142-148.
[Hitz 90] Hitz, D., Harris, G., Lau, J.K., and Schwartz, A.M., "Using UNIX as One
Component of a Lightweight Distributed Kernel for Multiprocessor File Servers,"
Proceedings of the Winter 1990 USENIX Technical Conference, Jan. 1990, pp. 285-
295.
[Hitz 94] Hitz, D., Lau, J., and Malcolm, M., "File System Design for an NFS File Server
Appliance," Proceedings of the Winter 1994 USENIX Technical Conference, Jan.
1994, pp. 235-245.
[Hone 92] Honeyman, P., Huston, L.B., and Stolarchuk, M.T., "Hijacking AFS," Proceedings
ofthe Winter 1992 USENIXTechnical Conference, Jan. 1992, pp. 175-181.
336 Chapter I 0 Distributed File Systems
[Howa 88] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satyanarayanan, M., and
Sidebotham, R.N., "Scale and Performance in a Distributed File System," ACM
Transactions on Computer Systems, Vol. 6, No. 1, Feb. 1988, pp. 55-81.
[Jusz 89] Juszczak, C., "Improving the Performance and Correctness of an NFS Server,"
Procedings of the Winter 1989 USENIXTechnical Conference, Jan. 1989, pp. 53-63.
[Jusz 94] Juszczak, C., "Improving the Write Performance of an NFS Server," Proceedings of
the Winter 1994 USENIXTechnical Conference, Jan. 1994, pp. 247-259.
[Kaza 88] Kazar, M.L., "Synchronization and Caching Issues in the Andrew File System,"
Proceedings of the Winter 1988 USENIX Technical Conference, Feb. 1988, pp. 27-
36.
[Kaza 90] Kazar, M.L., Leverett, B.W., Anderson, O.T., Apostolides, V., Bottos, B.A., Chutani,
S., Everhart, C.F., Mason, W.A., Tu, S.-T., and Zayas, E.R., "Decorum File System
Architectural Overview," Proceedings of the Summer 1990 USENIX Technical
Conference, Jun. 1990.
[Levy 90] Levy, E., and Silberschatz, A., "Distributed File Systems: Concepts and Examples,"
ACMComputingSurveys, Vol. 22, No.4, Dec. 1990, pp. 321-374.
[Mack 91] Macklem, R., "Lessons Learned Tuning the 4.3BSD Reno Implementation of the
NFS Protocol," Proceedings of the Winter 1991 USENIX Technical Conference, Jan.
1991, pp. 53-64.
[Mora 90] Moran, J., Sandberg, R., Coleman, D., Kepecs, J. and Lyon, B., "Breaking Through
the NFS Performance Barrier," Proceedings of the Spring 1990 European UNIX
Users' Group Conference, Apr. 1990, pp. 199-206.
[Morr 86] Morris, J.H., Satyanarayanan, M., Conner, M.H., Howard, J.H., Rosenthal, D.S.H.,
and Smith, F.D., "Andrew: A Distributed Personal Computing Environment,"
Communications of the ACM, Vol. 29, No.3, Mar. 1986, pp. 184-201.
[Nowi 90] Nowitz, D.A., "UUCP Administration," UNIX Research System Papers, Tenth
Edition, Vol. II, Saunders College Publishing, 1990, pp. 563-580.
[OSF 92] Open Software Foundation, OSF DCE Application Environment Specification,
Prentice-Hall, Englewood Cliffs, NJ, 1992.
[OSF 93] Open Software Foundation, OSF DCE Administration Guide-Extended Services,
Prentice-Hall, Englewood Cliffs, NJ, 1993.
[Pawl94] Pawlowski, B., Juszczak, C., Staubach, P., Smith, C., Lebel, D., and Hitz, D., "NFS
Version 3 Design and Implementation," Proceedings of the Summer 1994 USENIX
Technical Conference, Jun. 1994, pp. 137-151.
[Post 85] Postel, J., and Reynolds, J., "The File Transfer Protocol," RFC 959, Oct. 1985.
[Plum 82] Plummer, D.C., "An Ethernet Address Resolution Protocol," RFC 826, Nov. 1982.
[Rifk 86] Rifkin, A.P., Forbes, M.P., Hamilton, R.L., Sabrio, M., Shah, S., and Yueh, K., "RFS
Architectural Overview," Proceedings of the Summer 1986 USENIX Technical
Conference, Jun. 1986,pp. 248-259.
[Sand 85a] Sandberg, R., Goldberg, D., Kleiman, S.R., Walsh, D., and Lyon, B., "Design and
Implementation of the Sun Network Filesystem," Proceedings of the Summer 1985
USENIX Technical Conference, Jun. 1985, pp. 119-130.
10.21 References 337
[Sand 85b] Sandberg, R., "Sun Network Filesystem Protocol Specification," Sun Microsystems,
Inc., Technical Report, 1985.
[Saty 85] Satyanarayanan, M., Howard, J.H., Nichols, D.A., Sidebotham, R.N., Spector, A.Z.,
and West, M.J., "The ITC Distributed File System: Principles and Design," Tenth
ACM Symposium on Operating Systems Principles, Dec. 1985, pp. 35-50.
[Side 86] Sidebotham, R.N., "VOLUMES-The Andrew File System Data Structuring
Primitive," Proceedings of the Autumn 1986 European UNIX Users' Group
Conference, Oct. 1986, pp. 473-480.
[Spec 89] Spector, A.Z., and Kazar, M.L., "Uniting File Systems," Unix Review, Vol. 7, No.3,
Mar. 1989, pp. 61-70.
[Stei 88] Steiner, J.G., Neuman, C., and Schiller, J.I., "Kerberos: An Authentication Service
for Open Network Systems," Proceedings of the Winter 1988 USENIX Technical
Co~erence, Jan. 1988,pp. 191-202.
[Stol 93] Stolarchuk, M.T., "Faster AFS," Proceedings of the Winter 1993 USENIX Technical
Conference, Jan. 1993, pp. 67-75.
[Sun 87] Sun Microsystems, Inc., "XDR: External Data Representation Standard," RFC 1014,
DDN Network Information Center, SRI International, Jun. 1989.
[Sun 88] Sun Microsystems, Inc., "RPC: Remote Procedure Call, Protocol Specification,
Version 2," RFC 1057, DDN Network Information Center, SRI International, Jun.
1989.
[Sun 89] Sun Microsystems, Inc., "Network File System Protocol Specification," RFC 1094,
DDN Network Information Center, SRI International, Mar. 1989.
[Sun 95] Sun Microsystems, Inc., "NFS Version 3 Protocol Specification," RFC 1813, DDN
Network Information Center, SRI International, Jun. 1995.
[Tann 85] Tannenbaum, A.S., and Van Renesse, R., "Distributed Operating Systems," ACM
Computing Surveys, Vol. 17, No.4, Dec. 1985, pp. 419-470.
[Tann 90] Tannenbaum, A.S., Van Renesse, R., Van Staveren, H., Sharp, G.J., Mullender, S.J.,
Jansen, J., and Van Rossum, G., "Experiences with the Amoeba Distributed
Operating System," Communications of the ACM, Vol. 33, No. 12, Dec. 1990, pp.
46--63.
[Witt 93] Wittle, M., and Keith, B., "LADDIS: The Next Generation in NFS File Server
Benchmarking," Proceedings of the Summer 1993 USENIX Technical Conference,
Jun. 1993, pp. 111-128.
11
11.1 Introduction
Operating systems need to adapt to changes in computer hardware and architecture. As newer and
faster machines are designed, the operating system must change to take advantage of them. Often
developments in some components of the computer outpace those in other parts of the system. This
changes the balance of the resource utilization characteristics, and the operating system must ree-
valuate its policies accordingly.
Since the early 1980s, the computer industry has made very rapid strides in the areas of CPU
speed and memory size and speed [Mash 87]. In 1982, UNIX was typically run on a VAX 11/780,
which had a 1-mips (million instructions per second) CPU and 4-8 megabytes of RAM, and was
shared by several users. By 1995, machines with a 100-mips CPU and 32 megabytes or more of
RAM have become commonplace on individual desktops. Unfortunately, hard disk technology has
not kept pace, and although disks have become larger and cheaper, disk speeds have not increased
by more than a factor of two. The UNIX operating system, designed to function with moderately
fast disks but small memories and slow processors, has had to adapt to these changes.
Using traditional file systems on today's computers results in severely I/O-bound systems,
unable to take advantage of the faster CPUs and memories. As described in [Stae 91], if the time
taken for an application on a system is c seconds for CPU processing and i seconds for 1/0, then the
performance improvement seen by making the CPU infinitely fast is restricted to the factor (1 + cit).
If i is large compared to c, then reducing c yields little benefit. It is essential to find ways to reduce
338
11.2 Limitations of Traditional File Systems 339
the time the system spends doing disk I/0, and one obvious target for performance improvements is
the file system.
Throughout the mid- and late 1980s, an overwhelming majority of UNIX systems had either
s5fs or FFS (see Chapter 9) on their local disks. Both are adequate for general time-sharing applica-
tions, but their deficiencies are exposed when used in diverse commercial environments. The
vnode/vfs interface made it easier to add new file system implementations into UNIX. Its initial use,
however, was restricted to small, special-purpose file systems, which did not seek to replace s5fs or
FFS. Eventually, the limitations of s5fs and FFS motivated the development of several advanced file
systems that provide better performance or functionality. By the early 1990s, many of these had
gained acceptance in mainstream UNIX versions. In this chapter, we discuss the drawbacks of tra-
ditional file systems, consider various ways of addressing them, and examine some of the major file
systems that have emerged as alternatives to s5fs and FFS.
eral applications (for example, in the database and multimedia domains) use much larger
files. In fact, the constraint that the file size be less than 4 gigabytes (since the size field in
the inode is 32 bits long) is also considered too restrictive.
Let us now examine the performance and crash recovery issues in greater detail, identify
their underlying causes, and explore ways in which they may be addressed.
1 Some modem SCSI disks cache writes on a per-track basis, using the rotational energy of the drive to write cached
data in case of a power failure.
11.2 Limitations of Traditional File Systems 341
8 sectors I track
rotdel ay = 2
Finally,ftck provides a limited form of crash recovery-it returns the file system to a consis-
tent state. A reliable file system should deliver more than that. The ideal, of course, would be full
recovery, which requires each operation to be committed to stable storage before returning control
to the user. While that policy is followed by NFS and some non-UNIX file systems such as that of
MS-DOS, it suffers from poor performance. A more reasonable objective is to limit the damage
caused by the crash, without sacrificing performance. As we shall see (Section 11. 7), such a goal
can indeed be attained.
Write clustering requires a change to the ufs _put page() routine, which flushes a page to
disk. In Sun-FFS, this routine simply leaves the pages in the cache and returns successfully to the
caller, until a full cluster is in the cache or the sequential write pattern is broken. When that hap-
pens, it calls bma p () to find the physical location of these pages and writes them out in a single op-
eration. If the allocator has not been able to place the pages contiguously on disk, bmap () returns a
smaller length, and ufs _putpage () spreads the write over two or more operations.
While Sun-FFS adds a few refinements to address issues such as cache wiping, the above
changes describe the bulk of the clustering enhancements. Performance studies have shown that se-
quential reads and writes are improved by about a factor of two, whereas random access occurs at
about the same or slightly better speed than traditional FFS. The clustering approach does not en-
hance NFS write performance, since NFS requires all changes to be committed to disk synchro-
nously. To extend the benefits of clustering to NFS writes, it is necessary to incorporate the NFS
write-gathering optimizations to NFS described in Section 10.7.3.
log-structured file system, the log is the only representation of the file system on disk.
Such an approach, of course, requires full logging (data as well as metadata).
• Redo and undo logs - There are two types of logs: redo-only and undo-redo. A redo-
only log records only the modified data. An undo-redo log records both old and new val-
ues of the data. The redo-only log simplifies crash recovery, but places greater constraints
on the ordering of writes to the log and of in-place metadata updates (see Section 11.7.2
for details). The undo-redo log is larger and has more complex recovery mechanisms, but
allows greater concurrency during normal use.
• Garbage collection - Although a small number of implementations expand the log end-
lessly, moving old portions of the log onto tertiary storage, the popular approach is to have
a finite-sized log. This requires garbage collection of obsolete portions of the log, which is
treated as a logically circular file. This can be done on a running system or may require
stand-alone operation.
• Group commit - In order to meet the performance goals, the file system must write the
log in large chunks, bundling together several small writes if necessary. In deciding the
frequency and granularity of these writes, we need to make a tradeoff between perform-
ance and reliability, since the unwritten chunk is vulnerable to a crash.
• Retrieval - In a log-structured file system, we need an efficient way of retrieving data
from the log. Although the normal expectation is that a large cache will satisfy most reads,
making disk access a rarity, we still need to make sure that cache misses can be handled in
a reasonable time. This requires an efficient indexing mechanism to locate arbitrary file
blocks in the log.
but slow disks). On a system in steady state (one that has been running for a while), a large cache
could easily have a hit rate of more than 90%. Nevertheless, for those blocks that must be accessed
from disk (and there will be many of these when the system is initially booted), we need a way to
locate the data in the log in a reasonable time. Hence a fully log-structured file system must provide
an efficient way of addressing its contents.
The 4.4BSD log-structured file system, known as BSD-LFS [Selt 93], is based on similar
work in the Sprite operating system [Rose 90a]. In the rest of this section, we describe its structure
and implementation, and see how it achieves its objectives of reliability and performance.
next -
segment (2) full segment
pointers
-
(1) full segment
current
(3) partial segment (4) partial segment ~- end of
log
locates the inode on disk. Instead of computing the disk address directly from the inode number, it
looks up the address in the inode map, using the inode number as an index.
In the cache, data blocks are identified and hashed by vnode and logical block number. The
indirect blocks do not easily fit into this scheme. In FFS, indirect blocks are identified by the vnode
of the disk device and the physical block number. Because LFS does not assign disk addresses until
the segment is ready to be written, there is no convenient way to map these blocks. To get around
this problem, LFS uses negative logical block numbers to refer to indirect blocks. Each indirect
block number is the negative of that of the first block it references. Each double indirect block has a
number equal to one less than that of the first indirect block it points to, and so on.
D cached
of block
copy
new copy new copy
of block of inode
Dcached
ofinode
copy
log on disk
11.6.5 Analysis
There are three areas that create awkward problems for BSD-LFS. First, when a directory operation
involves more than one metadata object, these modifications may not all make it to the same partial
segment. This requires additional code to detect such cases and recover correctly if only a part of the
operation survives the crash.
Second, disk block allocation occurs when the segment is being written, not when the block
is first created in memory. Careful accounting of free space is necessary, or else a user may see a
successful return from a write system call, but the kernel may later find that there is no room on the
disk.
Finally, efficient operation of BSD-LFS requires a large physical memory, not only for the
buffer cache, but also for the large data structures and staging buffers required for logging and gar-
bage collection.
[Selt 93] and [Selt 95] describe detailed experiments comparing the performance of BSD-
LFS with traditional FFS and with Sun-FFS. The results show that BSD-LFS provides superior per-
350 Chapter II Advanced File Systems
llhU ~\over
segment being
cleaned end oflog before _ _ end of log afte r
garbage collection garbage collection
formance to traditional FFS in most circumstances (the exception being under high degrees of mul-
tiprogramming, where its performance is slightly worse). In comparison with Sun-FFS, BSD-LFS is
clearly superior in metadata-intensive tests (which focus on operations such as create, remove,
mkdir, and rmdir). In measurements of read and write performance and general multiuser bench-
marks, the results are less clear. Sun-FFS is faster in most I/O-intensive benchmarks, especially
when the BSD-LFS cleaner is turned on. The two are comparable for general, multiuser simulations
such as the Andrew benchmark [Howa 88].
The performance gains of BSD-LFS are questionable at best, since Sun-FFS provides equal
or better gains at a mere fraction of the implementation cost. LFS requires rewriting not only the file
system, but also a host of utilities, such as newfs and fsck, that understand the on-disk structures.
The real advantages of BSD-LFS are that it provides fast crash recovery and improves the perform-
ance of metadata operations. Section II. 7 shows how metadata logging can provide the same
benefits with a lot less effort.
Another log-structured file system worthy of note is the Write-Anywhere File Layout
(WAFL) system used by Network Appliance Corporation in their FAServer family of dedicated NFS
servers [Hitz 94]. WAFL integrates a log-structured file system with nonvolatile memory
(NV -RAM) and a RAID-4 disk array to achieve extremely fast response times for NFS access.
WAFL adds a useful facility called snapshots. A snapshot is a frozen, read-only copy of an active
file system. The file system can maintain a number of snapshots of itself, taken at different times,
subject to space constraints. Users can access the snapshots to retrieve older versions of files or to
undelete accidentally removed files. System administrators can use a snapshot to backup the file
system, since it provides a consistent picture of the file system at a single instance of time.
This approach provides the primary benefits of logging-rapid crash recovery and faster
metadata operations-without the drawbacks of a log-structured file system (complex, requires re-
writing of utilities, garbage collection degrades performance). Metadata logging has minimal impact
on normal 1/0 operations, but needs careful implementation to prevent the logging overhead from
reducing overall system performance.
The metadata log typically records changes to inodes, directory blocks and indirect blocks. It
may also include changes to superblocks, cylinder group summaries, and disk allocation bitmaps, or
the system may opt to reconstruct this information during crash recovery.
The log may reside either inside the file system itself or externally as an independent object.
The choice is governed by considerations regarding efficient disk usage and performance. The Ce-
dar file system [Hagm 87], for instance, implements the log as a fixed-size, circular file, using preal-
located blocks near the middle cylinders of the disk (so that it can be accessed quickly). The log file
is just like any other file: It has a name and an inode, and it may be accessed without special
mechanisms. In the Calaveras file system [Vaha 95], all file systems on a machine share a single
log, which resides on a separate disk. The Veritas file system [Yage 95] keeps the log separate from
the file system, but allows the system administrator to decide whether to dedicate a separate disk.
As a first example, let us consider a redo-only, new-value (the log records the new values of the
changed objects) logging scheme, such as the one in the Cedar file system [Hagm 87]. The log does
not deal with file data writes, which continue to be handled in the usual way. Figure 11-5 describes
an operation such as setattr, which modifies a single inode. The kernel executes the following se-
quence of actions:
This simple example illustrates how logging can impact system performance. On one hand,
each metadata update is written to disk twice-once in the log entry and once during the in-place
update. On the other hand, since the in-place updates are delayed, they are often eliminated or
hatched. For instance, the same inode may be modified several times before it is flushed to disk, and
multiple inodes in the same disk block are written back in a single 1/0 operation. For a metadata
logging implementation to perform reasonably, the reduction in the in-place updates should com-
pensate for the logging overhead.
Batching can be applied to log writes as well. Many file operations modify multiple
metadata objects. For instance, a mkdir modifies the parent directory and its inode, and also allo-
cates and initializes a new directory and inode. Some file system combine all changes caused by a
352 Chapter II Advanced File Systems
disk partition
physical memory
single operation into one log entry. Some go further and collect changes from a number of opera-
tions into a single entry.
This decision affects not only the performance, but also the reliability and consistency guar-
antees of the file system. If the file system crashes, it will lose any changes that were not written out
to the log. If the file system is used for NFS access, it cannot reply to the client requests until the
changes have been committed to the log. If multiple operations modify the same object, the changes
must be serialized to avoid inconsistency. This is discussed in detail in Section 11.7.2.
Since the log is fixed in size, it wraps around when it reaches its end. The file system must
prevent it from overwriting useful data. A log entry is considered active until all its objects have
been flushed to their on-disk locations (the in-place updates). There are two ways to deal with the
wraparound condition. One is to perform explicit garbage collection, as in BSD-LFS. The cleaner
must constantly stay one step ahead of the log and free up space by moving active entries to the tail
of the log [Hagm 87]. A simpler approach is to be more proactive with in-place updates. If the log is
large enough for in-place updates to keep it clean at peak loads, garbage collection can be avoided
altogether. This does not require a very large log-a few megabytes are sufficient for small servers
or time-sharing systems.
Suppose an operation modifies two metadata objects-A and B, in that order. A robust file
system may provide either an ordering or a transactional guarantee for multiple updates. An order-
ing guarantee promises that after recovering from a crash, the disk would have the new contents of
object B only if it also had the new contents of object A. A transactional guarantee is stronger,
promising that either both modifications would survive the crash, or neither would.
A redo log can satisfy the ordering guarantee by delaying the in-place update of any
metadata object until after its log entry is committed to disk. It writes objects to the log in the order
in which they are modified or created. In the previous example, it writes the log entry for object A
before that ofB (it may also write them out in a single log entry). This preserves the ordering of the
changes even if the in-place update of B precedes that of A.
Transactional guarantee in a redo-only log requires that neither object may be flushed to
disk until the log entries for both blocks are written out successfully. This may be trivial if the two
blocks have been bundled into a single log entry, but this cannot always be guaranteed. Suppose the
in-place update of A occurred before writing the log entry for B, and the system were to crash in
between. There is no way of recovering the old copy of A or the new copy of B. Hence we need to
force both log entries to disk before writing back either cached entry. The log also needs to add in-
formation that identifies the two updates as belonging to the same transaction, so that we do not re-
play partial transactions during recovery.
There is, in fact, a stronger requirement, which applies to concurrent operations on the same
object. It is incorrect to even read a modified object until it has been written to the log. Figure 11-6
shows a potential race condition. Process pl modifies object A and is about to write it out, first to
the log and then to disk. Before it can do so, process p2 reads object A and, based on its contents,
modifies object B. It then writes B to the log, and is about to write it to the disk. If the system were
to crash at this instant, the log contains the new value of B, but the new value of A is neither in the
log nor on disk. Since the change to B depends on the change to A, this situation is potentially in-
consistent.
To take a concrete example, suppose pl is deleting a file from a directory, while p2 is creat-
ing a file with the same name in the same directory. pl deletes the file name from block A of the
directory. p2 finds that the directory does not have a file with this name and proceeds to make a di-
rectory entry in block B of the directory. When the system recovers from the crash, it has the old
block A and the new block B, both of which have a directory entry for the same file name.
11.7 .3 Recovery
If the system crashes, the file system recovers by replaying the log, and using its entries to update
metadata objects on disk. This section describes recovery in a redo-only log. Section 11.8.3 dis-
cusses the issues related to undo-redo logs. The main problem is to determine the beginning and end
of the log, since it wraps around continuously. [Vaha 95] describes one solution. During normal op-
eration, the file system assigns an entry number to each entry. This number is monotonically in-
creasing and corresponds to the location of the entry in the log. When the log wraps around, the en-
try number continues to increase. Hence at any time, the relationship between the entry number and
its location in the log (its offset from the start of the log, measured in 512-byte units) is given by
354 Chapter II Advanced File Systems
11.7.4 Analysis
Metadata logging provides the important benefits of logging, namely rapid crash recovery and faster
metadata operations, without the complexity and implementation effort of log-structured file sys-
tems. The system recovers from a crash by replaying the log, writing its metadata objects to their
11.8 The Episode File System 355
on-disk locations. This usually takes a fraction of the time required by disk-checking utilities such
asfsck.
Metadata logging also speeds up operations that modify multiple metadata objects, such as
mkdir and rmdir, by collecting all changes made by the operation into a single log entry, thus reduc-
ing the number of synchronous writes. In this way, it also provides ordering or transactional guaran-
tees (depending on the implementation) for related metadata changes. This makes the file system
more robust than traditional architectures.
The overall impact on performance is unclear. Logging has no impact on operations that do
not modify the file system and little impact on data writes. Overall, logging is meant to reduce the
number of in-place metadata writes by deferring them. To obtain adequate performance, this reduc-
tion should compensate for the logging overhead.
[V aha 95] shows that the log may become a performance bottleneck and describes several
optimizations to prevent this. It also describes a number of experiments comparing two file systems
whose sole difference is that one uses logging and the other does not. The logging file system is
much faster in metadata-intensive benchmarks, but is only marginally better in a LADDIS bench-
mark [Witt 93], which simulates multiuser NFS access.
Metadata logging has some important drawbacks and limitations. Although it minimizes the
metadata changes lost in a crash, it does not limit the loss of ordinary file data (other than by run-
ning the update daemon). This also means that we cannot assure transactional consistency for all
operations-if an operation modifies both a file and its inode, the two are not updated atomically.
Hence a crash may result in just one of the two components of the transaction being recovered.
Overall, metadata logging offers increased robustness and rapid recovery, as well as modest
performance gains, without changing the on-disk structure of the file system. It is also relatively
easy to implement, since only the part of the file system that deals with writing metadata to disk
needs to be modified.
The debate between metadata logging and log-structured file systems has raged for some
time in the UNIX community. Metadata logging is winning the argument and has been the basis of
several successful commercial implementations, including the Veritas File System (VxFS) from
Veritas Corporation, IBM's Journaling File System (JFS), and Transarc's Episode File System (see
Section 11.8). Moreover, since metadata logging does not affect the data transfer code, it is possible
to combine it with other enhancements such as file-system clustering (Section 11.3) or NFS write-
gathering (Section 10.7.3), resulting in a file system that performs well for both data and metadata
operations.
tions from the logical file system structure. Section 10.18 described DCE's Distributed File System
(DCE DFS). In this section, we discuss the structure and features of Episode.
aggregate 1
fileset 1 fileset 2
EJ~Ej fileset 5
fileset 6
c; .· disk 4 :=:: aggregate 3
fileset 3 fileset 4
DO--
'---
fileset 7
2 IBM's Joumaling File System (JFS) was among the first UNIX file systems to allow logical volumes to span disk
partitions.
3 Current DCE tools only allow all the filesets in an aggregate to be exported together.
11.8 The Episode File System 357
11.8.2 Structure
The aggregate comprises several containers. The fileset container stores all its files and anodes. The
anodes reside in the fileset anode table at the head of the container and are followed by the data and
indirect blocks. A container does not occupy contiguous storage within the aggregate, so it can
shrink and grow dynamically with ease. Thus the file block addresses refer to block numbers within
the aggregate and not within the container.
The bitmap container allows aggregate wide block allocation. For each fragment in the ag-
gregate, it stores whether the fragment is allocated and whether it is used for logged or unlogged
data. This last information is used for special functions that must be performed when reusing a
logged fragment for unlogged data, and vice versa.
The log container contains an undo-redo, metadata-only log of the aggregate. The advan-
tages of undo-redo logs are discussed in Section 11.8.3. The log is fixed in size and is used in a cir-
cular fashion. Although current implementations place the log in the same aggregate as the one it
represents, that is not a strict requirement.
The aggregate fileset table (Figure 11-8) contains the superblock and the anode for each
container in the aggregate. Directory entries reference files by the fileset ID and the anode number
within the fileset. A file is located by first searching the aggregate fileset table for the anode of the
fileset and then indexing into the fileset's anode table for the desired file. Of course, appropriate use
of caching speeds up most of these operations.
Containers allow three modes of storage-inline, fragmented, and blocked. Each anode has
some extra space, and the inline mode stores small amounts of data in that. This is useful for sym-
bolic links, access-control lists, and small files. In the fragmented mode, several small containers
may share a single disk block. The blocked mode allows large containers and supports four levels of
indirection. This allows a maximum file size of231 disk blocks.4
[ superblock
aggregate fileset table anode / F anodes ::
bitmap anode /
log anode
/
/
fileset anode
r- file data-
fileset anode
1- blocks -
fileset anode
...
11.8.3 Logging
Episode uses a redo-undo metadata log, which provides the strong transactional guarantees de-
scribed in Section 11.7.2. The redo-undo log offers greater flexibility, since each entry stores both
the old and the new value of the object. During crash recovery, the file system has the option of re-
playing an entry by writing its new value to the on-disk object, or rolling it back by writing the old
value.
Transactional guarantees in redo-only logs require a two-phase locking protocol, which
locks all objects involved in a transaction until the entire transaction is committed to disk, so that no
other transaction will even read any uncommitted data. This reduces the concurrency of the system,
and incurs a substantial penalty in performance. Episode avoids this by using a mechanism called an
equivalence class, which contains all active transactions that involve the same metadata objects. The
equivalence class has the property that either all its transactions commit or none do.
In the event of a crash, the recovery procedure replays all complete equivalent classes, but
rolls back all transactions of an incomplete equivalent class. This allows a high degree of concur-
rency during normal operation. However, it doubles the size of each log entry, increases the 1/0
traffic to the log disk, and complicates log recovery.
In Episode, the buffer cache is tightly integrated with the logging facility. Higher-level
functions do not modify buffers directly, but call the logging functions. The logger correlates the
buffers with log entries and ensures that a buffer is not flushed to disk until its log entry has been
written successfully.
user:rohan:rwx-i-
11.9 Watchdogs 359
Each dash indicates a permission that is not granted. In this example, rohan does not have
control and delete permissions.
11.9 Watchdogs
A file system implementation defines its policies on several issues such as naming, access control,
and storage. These semantics are applied uniformly to all its files. Often, it is desirable to override
the default policies for some files that might benefit from special treatment, such as in the following
examples:
• Allow users to implement different access control mechanisms.
• Monitor and log all access to a particular file.
• Take some automatic actions upon receipt of mail.
• Store the file in compressed form and automatically decompress it when read.
Such functionality was provided in an extension to FFS developed at the University of
Washington [Bers 88]. The basic idea was to associate a user-level process called a watchdog with a
file or directory. This process intercepts selected operations on the file and can provide its own im-
plementation of those functions. Watchdog processes have no special privileges, are completely
transparent to applications accessing the files, and incur additional processing expense only for the
operations they override.
A wdlink system call was added to associate a watchdog process with a file, thus making it a
guarded file. The arguments to wdlink specified the filename and the name of the watchdog pro-
gram. The program name was stored in a 20-byte area in the inode that was reserved for "future use"
in BSD UNIX. To circumvent the 20-character limit, the names referred to entries in a public direc-
tory called /wdogs, which contained symbolic links to the real watchdog programs.
When a process tries to open a guarded file, the kernel sends a message to the watchdog
process (starting it up if not already running). The watchdog may use its own policies to permit or
deny access, or it may pass the decision back to the kernel. If the open is permitted, the watchdog
informs the kernel of the set of operations on the file that it is interested in guarding. This set of
guarded operations may be different for different open instances of the file, thus providing multiple
views of the same file.
Once opened, whenever a user tries to invoke a guarded operation, the kernel relays it to the
watchdog (Figure 11-9). The watchdog must do one of three things:
• Perform the operation. This may involve passing additional data between the kernel and
the watchdog (such as for read or write operations). To avoid loops, the watchdog is al-
lowed direct access to the file it is guarding.
• Deny the operation, passing back an error code.
• Simply acknowledge the operation, and ask the kernel to perform it in the usual manner.
The watchdog may perform some additional processing, such as accounting, before defer-
ring the operation to the kernel.
360 Chapter II Advanced File Systems
Yes
No
+ • .---------------------------~
normal system
call processing
Session Table
Messages
Watchdog
The watchdog reads the messages and sends its replies using the file descriptor returned by
createwmc. This maps to an entry for the WMC in the open file table, which points back to the ker-
nel end of the WMC. Thus both the watchdog and the kernel can access the message queue, and put
and get messages from it.
A master watchdog process manages all watchdog processes. It controls their creation (when
the guarded file is opened) and termination (usually upon the last close of the file). It may choose to
keep some frequently used watchdogs active even when no one has the associated file open, to
avoid the cost of starting up new processes each time.
11.9.3 Applications
The original implementation described several interesting applications:
wdacl Associates an access-control list with a file. A single watchdog may con-
trol access to many files.
wdcompact Provides on-line compression and decompression.
wdbiff Watches a user's mailbox and notifies the user when new mail arrives.
This may be extended to provide auto-answering or auto-forwarding ca-
pabilities.
wdview Presents different views of a directory to different users.
362 Chapter I I Advanced File Systems
wddate Allows users to read the current date and time from a file. The file itself
contains no data; the watchdog reads the system clock whenever the file is
accessed.
User interfaces that provide graphical views of the file tree can also benefit from watchdogs.
Whenever a user creates or deletes a file, the watchdog can ask the user interface to redraw itself to
reflect the new state of the directory. As these examples show, watchdogs provide a versatile
mechanism to extend the file system in several ways, limited only by the imagination. The ability to
redefine individual operating system functions at the user level is extremely useful and merits con-
sideration in modem operating systems.
[ process
user
] [ portal ]
daemon
open fd rest-of-path fd
/p /rest-of-path user spa ce
kernel spa ce
I portal file I 'J UNIX sockets I
system J
4. porta 1_open() passes the pathname to the portal daemon, which returns from the accept
system call.
5. The portal daemon processes the name as it sees fit and generates a file descriptor.
6. The daemon sends the descriptor back to the caller over a socket-pair connection set up by
the kernel.
7. The kernel copies the descriptor into the first unused slot in the caller's descriptor table.
8. The portal daemon dismantles the connection and calls accept to await further connection
requests.
Usually, the daemon creates a new child process to handle each request. The child executes
steps 5 through 7 and exits, thus dismantling the connection. The parent calls accept immediately
after creating the child.
The portal file system can be used in a number of ways. The portal daemon determines the func-
tionality it provides and also how it interprets the name space. One important application is the con-
nection server, mentioned earlier in Section 8.2.4. This server opens network connections on behalf
of other processes. Using portals, a process can create a TCP (Transmission Control Protocol) con-
nection simply by opening a file called
/p/tcp/node/service
where node is the name of the remote machine to connect to, and service is the TCP service (such as
ftp or rlogin) that the caller wishes to access. For instance, opening the file /p/tcp/archana/ftp
opens a connection to the ftp server on node archana.
The daemon performs all the work required to set up the connection. It determines the net-
work address of the remote machine, contacts the portmapper on that node to determine the port
number for the service, creates a TCP socket, and connects to the server. It passes the file descriptor
for the connection back to the calling process, which can use the descriptor to communicate to the
server.
This makes TCP connections available to naive applications. A naive application is one that
only uses stdin, stdout, and stderr, and does not use special knowledge about other devices. For in-
stance, a user can redirect the output of a shell or awk script to a remote node by opening the appro-
priate portal file.
Similar to watchdogs, the portal file system allows a user process to intercept file operations
by other processes, and implement them on their behalf. There are a few important differences. The
portal daemon only intercepts the open system call, whereas a watchdog may intercept a number of
operations of its choosing. Watchdogs may also intercept an operation, perform some work, and
then ask the kernel to complete the operation. Finally, the portal daemon defmes its name space.
This is possible because, in 4.4BSD, namei () passes the rest of the pathname to porta 1_1 ookup ()
when it crosses the mount point. Watchdogs usually operate on the existing file hierarchy, although
directory watchdogs can extend the name space in a limited way.
364 Chapter II Advanced File Systems
encryption/ physical
decryption file system
tion on a file, the kernel dynamically routes it to the file system to which the file belongs. This file
system is responsible for complete implementation of the operation.
The stackable layers framework allows multiple file systems to be mounted on top of each
other. Each file is represented by a vnode stack, with one vnode for each file system in the stack.
When the user invokes a file operation, the kernel passes it to the topmost vnode. This vnode may
do one of two things: It may execute the operation completely and pass the results back to the caller.
Alternatively, it may perform some processing, and pass the operation down to the next vnode in the
stack. This way, the operation can pass through all the layers. On return, the results again go up
through all the layers, giving each vnode the chance to do some additional processing.
This allows incremental file system development. For instance, a vendor may provide an en-
cryption-decryption module, which sits on top of any physical file system (Figure 11-12). This
module intercepts all I/0 operations, encrypting data while writing and decrypting it while reading.
All other operations are passed directly to the lower layer.
The stacking may allow fan-in or fan-out configurations. A fan-in stack allows multiple
higher layers to use the same lower layer. For example, a compression layer may compress data
while writing and decompress it while reading. A backup program may want to read the compressed
data directly while copying it to a tape. This results in the fan-in configuration shown in Figure
11-13.
A fan-out stack allows a higher layer to control multiple lower layers. This could be used for
a hierarchical storage manager (HSM) layer, which keeps recently accessed files on local disks, and
migrates rarely used files onto optical disks or tape jukeboxes (Figure 11-14). The HSM layer inter-
cepts each file access, both to track file usage and to download files from tertiary storage when
needed. [Webb 93] describes an HSM implementation using a framework that combines features of
stackable layers and watchdogs.
Hierarchical Storage
Manager
tertiary storage
ufs
manager
Disk
Tertiary storage
vnode
vnodeops ptr
ptr to pvt data
1--+t
ptr to vfs
vnodeops ptr
ptr to pvt data
ptr to vfs
next pvnode
• A vnode may hold a reference to another vnode as part of its private data. For example,
the root directory vnode of a file system keeps a reference to the mount point.
• The vfs operations must also be passed on to lower layers. To achieve this, many vfs op-
erations were converted to vnode operations that can be invoked on any vnode of the file
system.
• Many operations in the current interface operate on multiple vnodes. To function cor-
rectly, these must be broken up into separate suboperations on each vnode. For instance,
VOP _LINK must be divided into two operations: one on the file vnode to fetch its file ID
and increment its link count, and another on the directory vnode to add the entry for the
file.
• A transaction facility is needed to ensure atomic execution of suboperations invoked by
the same high-level operation.
• The <vnode, offset> name space for the page cache does not map well to the stacked
vnode interface, since the page now belongs to multiple vnodes. The interface with the
virtual memory system must be redesigned to handle this.
The following section describes two interesting applications of the stackable file system in-
terface.
11.13 Summary
We have seen several advanced file systems in this chapter. Some of them replace existing imple-
mentations such as FFS and s5fs, whereas some extend the traditional file systems in different ways.
These file systems offer higher performance, quicker crash recovery, increased reliability, or en-
hanced functionality. Some of these systems have already gained commercial acceptance; most re-
cent UNIX releases feature an enhanced file system that uses some form of logging.
The vnode/vfs interface has been an important enabling technology, allowing these new im-
plementations to be integrated into the UNIX kernel. The stackable layers framework addresses
many limitations of the vnode interface and promotes incremental file system development. 4.4BSD
has already adopted this approach, and commercial vendors are exploring it.
11.14 Exercises
1. Why does FFS use a rotdelay factor to interleave disk blocks? What does it assumme about
usage patterns and buffer cache sizes?
2. How would file system clustering affect the performance of an NFS server?
3. What is the difference between file system clustering and write-gathering (described in
Section 10.7.3)? What situation is each one useful in? When is it beneficial to combine the
two?
4. Would file system clustering reduce the benefit of nonvolatile memory? What is a good way
of using NV-RAM in a system that supports clustering?
11.15 References 369
11.15 References
[Bers 88] Bershad, B.N., and Pinkerton, C.B., "Watchdogs-Extending the UNIX File
System," Computing Systems, Vol. I, No. 2, Spring I988, pp. I69-I88.
[Chut 92] Chutani, S., Anderson, O.T., Kazar, M.L., Mason, W.A., and Sidebotham, R.N.,
"The Episode File System," Proceedings of the Winter 1992 USENIX Technical
Conferenc~ Jan. 1992,pp.43-59.
[Hagm 87] Hagmann, R., "Reimplementing the Cedar File System Using Logging and Group
Commit," Proceedings of the 11th Symposium on Operating Systems Principles,
Nov. I987, pp. I55-I62.
[Reid 94] Heidemann, J.S., and Popek, G.J., "File-System Development with Stackable
Layers," ACM Transactions on Computer Systems, Vol. I2, No. I, Feb. I994, pp.
58-89.
[Ritz 94] Ritz, D., Lau, J., and Malcolm, M., "File System Design for an NFS File Server
Appliance," Proceedings of the Winter 1994 USENIX Technical Conference, Jan.
I994, pp. 235-245.
[Howa 88] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satyanarayanan, M., and
Sidebotham, R.N., "Scale and Performance in a Distributed File System," ACM
Transactions on Computer Systems, Vol. 6, No. I, Feb. I988, pp. 55-81.
370 Chapter II Advanced File Systems
[Leff 89] Leffler, S.J., McKusick, M.K., Karels, M.J., and Quarterman, J.S., The Design and
Implementation of the 4.3 BSD UNIX Operating System, Addison-Wesley, Reading,
MA, 1989.
[Mash 87] Mashey, J.R., "UNIX Leverage-Past, Present, Future," Proceedings of the Winter
1987 USENIXTechnical Conference, Jan. 1987, pp. 1-8.
[McKu 84] McKusick, M.K., Joy, W.N., Leffler, S.J., and Fabry, R.S., "A Fast File System for
UNIX," ACMTransactions on Computer Systems, Vol. 2, No.3, Aug 1984, pp. 181-
197.
[McKu 95] McKusick, M.K., 'The Virtual Filesystem Interface in 4.4BSD," Computing
Systems, Vol. 8, No. 1, Winter 1995, pp. 3-25.
[McVo 91] McVoy, L.W., and Kleiman, S.R., "Extent-like Performance from a UNIX File
System," Proceedings of the 1991 Winter USENIX Conference, Jan. 1991, pp. 33-
43.
[OSF 93] Open Software Foundation, OSF DCE Administration Guide-Extended Services,
Prentice-Hall, Englewood Cliffs, NJ, 1993.
[Oust 85] Ousterhout, J.K., Da Costa, H., Harrison, D., Kunze, J.A., Kupfer, M., and
Thompson, J.G., "A Trace-Driven Analysis of the UNIX 4.2 BSD File System,"
Proceedings of the I Oth Symposium on Operating System Principles, Dec. 1985, pp.
15-24.
[Pend 95] Pendry, J.-S., and McKusick, M.K., "Union Mounts in 4.4BSD-Lite," Proceedings of
the Winter 1995 USENIXTechnical Conference, Jan. 1995, pp. 25-33.
[Rose 90a] Rosenblum, M., and Ousterhout, J.K., "The LFS Storage Manager," Proceedings of
the Summer 1990 USENIXTechnical Conference, Jun. 1990, pp. 315-324.
[Rose 90b] Rosenthal, D.S.H., "Evolving the Vnode Interface," Proceedings of the Summer
1990 USEN!X Technical Conference, Jun. 1990, pp. 107-118.
[Rose 92] Rosenthal, D.S.H., "Requirements for a "Stacking" VnodeNFS Interface," UNIX
International Document SF-01-92-N014, Parsippany, NJ, 1992.
[Selt 93] Seltzer, M., Bostic, K., McKusick, M.K., and Staelin, C., "An Implementation of a
Log-Structured File System for UNIX," Proceedings of the Winter 1993 USENIX
Technical Conference, Jan. 1993, pp. 307-326.
[Selt 95] Seltzer, M., and Smith, K.A., "File System Logging Versus Clustering: A
Performance Comparison," Proceedings of the Winter 1995 USENIX Technical
Conference, Jan. 1995,pp.249-264.
[Skin 93] Skinner, G.C., and Wong, T.K., "Stacking Vnodes: A Progress Report," Proceedings
of the Summer 1993 USENIXTechnical Conference, Jun. 1993, pp. 161-174.
[Stae 91] Staelin, C., "Smart Filesystems," Proceedings of the Winter 1991 USENIX
Conference, Jan. 1991, pp. 45-51.
[Stev 95] Stevens, W.R., and Pendry, J.-S., "Portals in 4.4BSD," Proceedings of the Winter
1995 USENIXTechnical Conference, Jan. 1995, pp. 1-10.
[Vaha 95] Vahalia, U., Gray, C., and Ting, D., "Metadata Logging in an NFS Server,"
Proceedings of the Winter 1995 USENIX Technical Conference, Jan. 1995, pp. 265-
276.
11.15 References 371
[Webb 93] Webber, N., "Operating System Support for Portable Filesystem Extensions,"
Proceedings ofthe Winter 1993 USENIXTechnical Conference, Jan. 1993, pp. 219-
228.
[Witt 93] Wittle, M., and Keith, B., "LADDID: The Next Generation in NFS File Server
Benchmarking." Proceedings of the Summer 1993 USENIX Technical Conference,
Jun. 1993, pp. 111-128.
[Yage 95] Yager, T., "The Great Little File System," Byte, Feb. 1995, pp. 155-158.
12
12.1 Introduction
The operating system must manage all the physical memory and allocate it both to other kernel sub-
systems and to user processes. When the system boots, the kernel reserves part of physical memory
for its own text and static data structures. This portion is never released and hence is unavailable for
any other purpose. 1 The rest of the memory is managed dynamically-the kernel allocates portions
of it to various clients (processes and kernel subsystems), which release it when it is no longer
needed.
UNIX divides memory into fixed-size frames or pages. The page size is a power of two,
with 4 kilobytes being a fairly typical value. 2 Because UNIX is a virtual memory system, pages that
are logically contiguous in a process address space need not be physically adjacent in memory. The
next three chapters describe virtual memory. The memory management subsystem maintains map-
pings between the logical (virtual) pages of a process and the actual location of the data in physical
memory. As a result, it can satisfy a request for a block of logically contiguous memory by allocat-
ing several physically non-contiguous pages.
This simplifies the task of page allocation. The kernel maintains a linked list of free pages.
When a process needs some pages, the kernel removes them from the free list; when the pages are
released, the kernel returns them to the free list. The physical location of the pages is unimportant.
1 Many modern UNIX systems (AIX, for instance) allow part of the kernel to be pageable.
2 This is a software-defined page size and need not equal the hardware page size, which is the granularity for protec-
tion and address translation imposed by the memory management unit.
372
12.1 Introduction 373
The merna 11 () and memfree () routines in 4.3BSD and the get page() and freepage () routines in
SVR4 implement this page-level allocator.
The page-level allocator has two principal clients (Figure 12-1). One is the paging system,
which is part of the virtual memory system. It allocates pages to user processes to hold portions of
their address space. In many UNIX systems, the paging system also provides pages for disk block
buffers. The other client is the kernel memory allocator, which provides odd-sized buffers of mem-
ory to various kernel subsystems. The kernel frequently needs chunks of memory of various sizes,
usually for short periods of time.
The following are some common users of the kernel memory allocator:
• The pathname translation routine may allocate a buffer (usually 1024 bytes) to copy a
pathname from user space.
• The a 11 ocb () routine allocates STREAMS buffers of arbitrary size.
• Many UNIX implementations allocate zombie structures to retain exit status and resource
usage information about deceased processes.
• In SVR4, the kernel allocates many objects (such as proc structures, vnodes, and file de-
scriptor blocks) dynamically when needed.
Most of these requests are much smaller than a page, and hence the page-level allocator is
inappropriate for this task. A separate mechanism is required to allocate memory at a finer granular-
ity. One simple solution is to avoid dynamic memory allocation altogether. Early UNIX implemen-
tations [Bach 86] used fixed-size tables for vnodes, proc structures, and so forth. When memory
was required for holding temporary pathnames or network messages, they borrowed buffers from
the block buffer cache. Additionally, a few ad hoc allocation schemes were devised for special
situations, such as the clists used by the terminal drivers.
This approach has several problems. It is highly inflexible, because the sizes of all tables and
caches are fixed at boot time (often at compile time) and can not adjust to the changing demands on
physical memory
the system. The default sizes of these tables are selected by the system developers based on the us-
age patterns expected with typical workloads. Although system administrators can usually tune
these sizes, they have little guidance for doing so. If any table size is set too low, the table could
overflow and perhaps crash the system without warning. If the system is configured conservatively,
with large sizes of all tables, it wastes too much memory, leaving little for the applications. This
causes the overall performance to suffer.
Clearly, the kernel needs a general-purpose memory allocator that can handle requests for
large and small chunks of data efficiently. In the following section, we describe the requirements for
this allocator and the criteria by which we can judge different implementations. We then describe
and analyze various memory allocators used by modem UNIX systems.
small to be useful. The allocator reduces fragmentation by coalescing adjacent chunks of free mem-
ory into a single large chunk.
A KMA must be fast, because it is used extensively by various kernel subsystems, including
interrupt handlers, whose performance is usually critical. Both the average and the worst-case la-
tency are important. Because kernel stacks are small, the kernel uses dynamic allocation in many
situations where a user process would simply allocate the object on its stack. This makes allocation
speed all the more important. A slow allocator degrades the performance of the entire system.
The allocator must have a simple programming interface that is suitable for a wide variety of
clients. One possibility is to have an interface similar to the rna 11 oc () and free() functions of the
user-level memory allocator provided by the standard library:
An important advantage of this interface is that the free() routine does not need to know
the size of the region being freed. Often, one kernel function allocates a chunk of memory and
passes it to another subsystem, which eventually frees it. For example, a network driver may allo-
cate a buffer for an incoming message and send it to a higher-level module to process the data and
free the buffer. The module releasing the memory may not know the size of the allocated object. If
the KMA can monitor this information, it will simplify the work of its clients.
Another desirable interface feature is that the client not be forced to release the entire allo-
cated area all at once. If a client wants to release only part of the memory, the allocator should han-
dle it correctly. The rna 11 oc ()/free() interface does not permit this. The free() routine will re-
lease the entire region and will fail if called with a different address from that returned by
rna 11 oc (). Allowing clients to grow a buffer (for instance by a rea 11 oc () function) would also be
useful.
Allocated memory should be properly aligned for faster access. On many RISC architec-
tures, this is a requirement. For most systems, longword alignment is sufficient, but 64-bit machines
such as DEC's Alpha AXP [DEC 92] may require alignment on an eight-byte boundary. A related
issue is the minimum allocation size, which is usually eight or sixteen bytes.
Many commercial environments have a cyclical usage pattern. For example, a machine may
be used for database queries and transaction processing during the day and for backups and database
reorganization at night. These activities may have different memory requirements. Transaction
processing might consume several small chunks of kernel memory to implement database locking,
while backups may require that most of the memory be dedicated to user processes.
Many allocators partition the pool into separate regions, or buckets, for requests of different
sizes. For instance, one bucket may contain all 16-byte chunks, while another may contain all
64-byte chunks. Such allocators must guard against a bursty or cyclical usage pattern as described
above. In some allocators, once memory has been assigned to a particular bucket, it cannot be re-
used for requests of another size. This may result in a large amount of unused memory in some
buckets, and hence not enough in others. A good allocator provides a way to dynamically recover
excess memory from one bucket for use by another.
Finally, the interaction with the paging system is an important criterion. The KMA must be
able to borrow memory from the paging system when it uses up its initial quota. The paging system
376 Chapter 12 Kernel Memory Allocation
must be able to recover unused memory from the KMA. This exchange should be properly con-
trolled to ensure fairness and avoid starvation of either system.
We now look at several allocation methods, and analyze them using the above criteria.
t
(c) After rmfree (128, 128).
576,448
128,256
t
(d) After many more operations
832,32
544,128
288,64
128,32
~
I II I I I I
D free Din use
12.3.1 Analysis
The resource map provides a simple allocator. The following are its main advantages:
• The algorithm is easy to implement.
• The resource map is not restricted to memory allocation. It can manage collections of arbi-
trary objects that are sequentially ordered and require allocation and freeing in contiguous
chunks (such as page table entries and semaphores, as described below).
378 Chapter 12 Kernel Memory Allocation
• It can allocate the exact number of bytes requested without wasting space. In practice, it
will usually round up requests to four- or eight-byte multiples for simplicity and align-
ment.
• A client is not constrained to release the exact region it has allocated. As the previous ex-
ample shows, the client can release any part of the region, and the allocator will handle it
correctly. This is because the arguments to rmfree() provide the size of the region being
freed, and the bookkeeping information (the map) is maintained separately from the allo-
cated memory.
• The allocator coalesces adjacent free regions, allowing memory to be reused for different
sized requests.
However, the resource map allocator also has some major drawbacks:
• After the allocator has been running for a while, the map becomes highly fragmented,
creating many small free regions. This results in low utilization. In particular, the resource
map allocator does poorly in servicing large requests.
• As the fragmentation increases, so does the size of the resource map, since it needs one
entry for each free region. If the map is preconfigured with a fixed number of entries, it
might overflow, and the allocator may lose track of some free regions.
• If the map grows dynamically, it needs an allocator for its own entries. This is a recursive
problem, to which we offer one solution below.
• To coalesce adjacent free regions, the allocator must keep the map sorted in order of in-
creasing base offsets. Sorting is expensive, even more so if it must be performed in-place,
such as when the map is implemented as a fixed array. The sorting overhead is significant,
even if the map is dynamically allocated and organized as a linked list.
• The allocator must perform a linear search of the map to find a free region that is large
enough. This is extremely time consuming and becomes slower as fragmentation in-
creases.
• Although it is possible to return free memory at the tail of the pool to the paging system,
the algorithm is really not designed for this. In practice, the allocator never shrinks its
pool.
The poor performance of the resource map is the main reason why it is unsuitable as a gen-
eral-purpose kernel memory allocator. It is, however, used by some kernel subsystems. The System
V interprocess communication facility uses resource maps to allocate semaphore sets and data areas
for messages. The virtual memory subsystem in 4.3BSD uses this algorithm to manage system page
table entries that map user page tables (see Section 13.4.2).
The map management can be improved in some circumstances. It is often possible to store
the map entry in the first few bytes of the free region. This requires no extra memory for the map
and no dynamic allocation for map entries. A single global variable can point to the first free region,
and each free region stores its size and a pointer to the next free entry. This requires free regions to
be at least two words long (one for the size, one for the pointer), which can be enforced by requiring
allocation and freeing words in multiples of two. The Berkeley Fast File System (FFS), described in
Section 9.5, uses a variation of this approach to manage free space within directory blocks.
12.4 Simple Power-of-Two Free Lists 379
While this optimization is suitable for the general memory allocator, it cannot be applied to
other uses of the resource map, such as for semaphore sets or page table entries, where the managed
objects have no room for map entry information.
list free
headers buffers
'
'
I
I
I
I
I
I
sequently, if a list becomes empty, the allocator may handle a new rna 11 oc () request for that size in
one of three ways:
• Block the request until a buffer of the appropriate size is released.
• Satisfy the request with a larger buffer, beginning with the next list and continuing the
search until it finds a nonempty list.
• Obtain additional memory from the page-level allocator to create more buffers of that size.
Each method has its benefits and drawbacks, and the proper choice depends on the situation. For
example, a kernel implementation of this algorithm may use an additional priority argument for al-
location requests. In this case, the allocator may block low-priority requests that cannot be satisfied
from the correct free list, but complete high-priority requests by one of the other two methods.
12.4.1 Analysis
The above algorithm is simple and reasonably fast. Its main appeal is that it avoids the lengthy lin-
ear searches of the resource map method and eliminates the fragmentation problem entirely. In
situations where a buffer is available, its worst-case performance is well bounded. The allocator also
presents a familiar programming interface, with the important advantage that the free() routine
need not be given the buffer size as an argument. As a result, an allocated buffer can be passed to
other functions and subsystems and eventually freed using only the pointer to the buffer. On the
other hand, the interface does not allow a client to release only part of the allocated buffer.
There are many important drawbacks of this algorithm. The rounding of requests to the next
power of two often leaves a lot of unused space in the buffer, resulting in poor memory utilization.
The problem becomes worse due to the need to store the header in the allocated buffers. Many
memory requests are for an exact power-of-two bytes. For such requests, the wastage is almost
100%, since the request must be rounded to the next power of two to allow for the header. For ex-
ample, a 512-byte request would consume a 1024-byte buffer.
There is no provision for coalescing adjacent free buffers to satisfy larger requests. Gener-
ally, the size of each buffer remains fixed for its lifetime. The only flexibility is that large buffers
may sometimes be used for small requests. Although some implementations allow the allocator to
steal memory from the paging system, there is no provision to return surplus free buffers to the
page-level allocator.
While the algorithm is much faster than the resource map method, it can be further im-
proved. In particular, the round-up loop, shown in Example 12-1, is slow and inefficient:
void*ma11oc (size)
{
int ndx = 0; /*free list index */
int bufsize = 1 << MINPOWER; I* size of smallest buffer *I
size+= 4; /* account for header *I
assert (size<= MAXBUFSIZE);
12.5 The McKusick-Karels Allocator 381
/*at this point, ndx is the index of the appropriate free list *I
r-l
0 128
256
64
io
: : ~ 512
L______________________________ ,
I 1 --------------------- 1
I l----------------
I I I
------------- I
I
I
I
#define NDX(size) \
(stze) > 128 \
? (size) > 256 ? 4 3 \
(size) > 64 \
? 2 \
: (size) > 32 ? 1 0
The main advantage of using a macro is that when the allocation size is known at the time of
compilation, the NDX {) macro reduces to a compile-time constant, saving a substantial number of
instructions. Another macro handles the simple cases of buffer release, calling the free {) function
in only a few cases, such as when freeing large buffers.
12.6 The Buddy System 383
12.5.1 Analysis
The McKusick-Karels algorithm is a significant improvement over the simple power-of-two alloca-
tor described in Section 12.4. It is faster, wastes less memory, and can handle large and small re-
quests efficiently. However, the algorithm suffers from some of the drawbacks inherent in the
power-of-two approach. There is no provision for moving memory from one list to another. This
makes the allocator vulnerable to a bursty usage pattern that consumes a large number of buffers of
one particular size for a short period. Also, there is no way to return memory to the paging system.
A
D free 0 muse
3 This is the binary buddy system, which is the simplest and most popular buddy system. We can implement other
buddy algorithms by splitting buffers into four, eight, or more pieces.
384 Chapter 12 Kernel Memory Allocation
2. allocate (128): Finds the 128-byte free list empty. It checks the 256-byte list, re-
moves B' from it, and splits it into C and C'. Then, it puts C' on the
128-byte free list and returns C to the client.
3. allocate (64): Finds the 64-byte list empty, and hence removes C' from 128-byte
free list. It splits C' into D and D', puts D' on the 64-byte list, and re-
turns D to the client. Figure 12-5 shows the situation at this point.
4. allocate (128): Finds the 128-byte and 256-byte lists empty. It then checks the 512-
byte free list and removes A' from it. Next, it splits A' into E and E',
and further splits E into F and F'. Finally, it puts E' onto the 256-byte
list, puts F' on the 128-byte list, and returns F to the client.
5. release (C, 128): Returns C to the 128-byte free list. This leads to the situation shown
in Figure 12-6.
6. release (D, 64): The allocator will note that D' is also free and will coalesceD with D'
to obtain C'. It will further note that C is also free and will coalesce it
with C' to get back B'. Finally, it will return B' to the 256-byte list,
resulting in the situation in Figure 12-7.
• There is the usual rounding of the request to the next power of two.
• For each request in this example, the corresponding free list is empty. Often, this is not the
case. If there is a buffer available on the appropriate free list, the allocator uses it, and no
splitting is required.
0 1023
C'
~
B' E
A A'
l E )
A
• The address and size of a buffer provide all the information required to locate its buddy.
This is because the algorithm automatically gives each buffer an alignment factor equal to
its size. Thus, for example, a 128-byte buffer at offset 256 has its buddy at offset 384,
while a 256-byte buffer at the same offset has its buddy at offset 0.
• Each request also updates the bitmap to reflect the new state of the buffer. While coalesc-
ing, the allocator examines the bitmap to determine whether a buffer's buddy is free.
• While the above example uses a single, 1024-byte page, the allocator can manage several
disjoint pages simultaneously. The single set of free list headers can hold free buffers from
all pages. The coalescing will work as before, since the buddy is determined from the
buffer's offset in the page. The allocator will, however, maintain a separate bitmap for
each page.
12.6.1 Analysis
The buddy system does a good job of coalescing adjacent free buffers. That provides flexibility, al-
lowing memory to be reused for buffers of a different size. It also allows easy exchange of memory
between the allocator and the paging system. Whenever the allocator needs more memory it can
obtain a new page from the paging system and split it as necessary. Whenever the release routine
coalesces an entire page, the page can be returned to the paging system.
The main disadvantage of this algorithm is its performance. Every time a buffer is released,
the allocator tries to coalesce as much as possible. When allocate and release requests alternate, the
algorithm may coalesce buffers, only to split them again immediately. The coalescing is recursive,
resulting in extremely poor worst-case behavior. In the next section, we examine how SVR4 modi-
fies this algorithm to overcome this performance bottleneck.
Another drawback is the programming interface. The release routine needs both the address
and size of the buffer. Moreover, the allocator requires that an entire buffer be released. Partial re-
lease is insufficient, since a partial buffer has no buddy.
386 Chapter 12 Kernel Memory Allocation
N"A+L+G
Depending on the values of these parameters, a buffer class is said to be in one of three states:
• lazy- buffer consumption is in a steady state (allocation and release requests are about
equal) and coalescing is not necessary.
• reclaiming - consumption is borderline; coalescing is needed.
• accelerated- consumption is not in a steady state, and the allocator must coalesce faster.
The critical parameter that determines the state is called the slack, defined as
slack " N - 2L - G
12.7 The SVR4 Lazy Buddy Algorithm 387
The system is in the lazy state when slack is 2 or more, in the reclaiming state when slack equals 1,
and in the accelerated state when slack is zero. The algorithm ensures that slack is never negative.
[Bark 89] provides comprehensive proof of why the slack is an effective measure of the buffer class
state.
When a buffer is released, the SVR4 allocator puts it on the free list and examines the result-
ing state of the class. If the list is in the lazy state, the allocator does no more. The buffer is not
marked as free in the bitmap. Such a buffer is called a delayed buffer and is identified as such by a
flag in the buffer header (the header is present only on buffers on the free list). Although it is avail-
able for other same-size requests, it cannot be coalesced with adjacent buffers.
If the list is in the reclaiming state, the allocator marks the buffer as free in the bitmap and
coalesces it if possible. If the list is in accelerated state, the allocator coalesces two buffers-the one
just released and an additional delayed buffer, if there is one. When it releases the coalesced buffer
to the next higher-sized list, the allocator checks the state of that class to decide whether to coalesce
further. Each of these operations changes the slack value, which must be recomputed.
To implement this algorithm efficiently, the buffers are doubly linked on the free lists. De-
layed buffers are released to the head of the list, and non-delayed buffers to the tail. This way, de-
layed buffers are reallocated first; this is desirable because they are the least expensive to allocate
(no bitmap update is required). Moreover, in the accelerated stage, the additional delayed buffer can
be quickly checked for and retrieved from the head of the list. If the first buffer is non-delayed, there
are no delayed buffers on the list.
This is a substantial improvement over the basic buddy system. In steady state, all lists are in
the lazy state, and no time is wasted in coalescing and splitting. Even when a list is in the acceler-
ated state, the allocator coalesces at most two buffers on each request. Hence, in the worst-case
situation, there are at most two coalescing delays per class, which is at most twice as bad as the
simple buddy system.
[Bark 89] analyzes the performance of the buddy and lazy buddy algorithms under various
simulated workloads. It shows that the average latency of the lazy buddy method is 10% to 32%
better than that of the simple buddy system. As expected, however, the lazy buddy system has
greater variance and poorer worst-case behavior for the release routine.
and we need a separate function to gather the 256-byte buffers of a pool. This is time-consuming,
and should be performed in the background. A system process called the kmdaemon runs periodi-
cally to coalesce the pools and return free pools to the page-level allocator.
where size is the size of each object, max is the maximum size in bytes the zone may reach, a 11 oc
is the amount of memory to add to the zone each time the free list becomes empty (the kernel
rounds it to a whole number of pages), and name is a string that describes the objects in the zone.
zinit() allocates a zone structure from the zone ofzones and records the size, max, and a11oc
values in it. z in it () then allocates an initial a 11 oc-byte region of memory from the page-level al-
locator and divides it into size-byte objects, which it puts on the free list. All active zone structures
are maintained on a linked list, described by the global variables first_zone and 1ast_zone
(Figure 12-8). The first element on this list is the zone of zones, from which all other elements are
allocated.
Thereafter, allocation and release are extremely fast, and involve nothing more than remov-
ing objects from and returning objects to the free list. If an allocation request finds the free list
empty, it asks the page-level allocator for a 11 oc more bytes. If the size of the pool reaches max
bytes, further allocations will fail.
f--------+1
struct struct
obj1 obj1
first zone
last zone
t----------+1
struct struct
objn objn
• a 11 oc _count is the total number of objects from that page assigned to the zone.
The a 11 oc _count is set whenever the page is acquired by the zone from the page-level allo-
cator. Since the page size may not be an exact multiple of the object size, an object may occasion-
ally span two pages. In this case, it is included in the a 11 oc _count of both pages. The
i n_free_1 i s t count is not updated with each allocation and release operation, but is recomputed
each time the garbage collector runs. This minimizes the latency of individual allocator requests.
The garbage collector routine, zone_gc (), is invoked by the swapper task each time it runs.
It walks through the list of zones and, for each zone, makes two passes through its free list. In the
first pass, it scans each free element and increments the in_ free_ count of the page to which it be-
longs. At the end of this scan, if the in_ free _1 i st and a 11 oc _count of any page are equal, all ob-
jects on that page are free, and the page can be recaptured. 4 Hence, in the second pass, zone_gc ()
removes all such objects from the free list. Finally, it calls kmem_free() to return each free page to
the page-level allocator.
12.8.2 Analysis
The zone allocator is fast and efficient. It has a simple programming interface. Objects are
allocated by
4 If the page being recaptured has objects at its top or bottom that span two pages, such objects must be removed from
the free list, and the all oc _count of the other page must be decremented to reflect this.
390 Chapter 12 Kernel Memory Allocation
where z points to the zone for that class of objects, set up by an earlier call to z in it (). The objects
are released by
Coalesce-to- per-page
Page layer free lists
global freelist
Globallaye bucket list
1-----=-------l
target = 3
main freelist
Per-CPU
aux freelist
cache target= 3
The per-CPU layer manages one set of power-of-two pools for each processor. These pools
are insulated from the other processors, and hence can be accessed without acquiring global locks.
Allocation and release are fast in most cases, as only the local free list is involved.
Whenever the per-CPU free list becomes empty, it can be replenished from the global layer,
which maintains its own power-of-two pools. Likewise, excess buffers in the per-CPU cache can be
returned to the global free list. As an optimization, buffers are moved between these two layers in
target-sized groups (three buffers per move in the case shown in Figure 12-9), preventing unneces-
sary linked-list operations.
To accomplish this, the per-CPU layer maintains two free li!"ts-main and aux. Allocation
and release primarily use the main free list. When this becomes empty, the buffers on aux are
moved to main, and the aux list is replenished from the global layer. Likewise, when the main list
overflows (size exceeds target), it is moved to aux, and the buffers on aux are returned to the global
layer. This way, the global layer is accessed at most once per target-number of accesses. The value
of target is a tunable parameter. Increasing target reduces the number of global accesses, but ties up
more buffers in per-CPU caches.
The global layer maintains global power-of-two free lists, and each list is subdivided into
groups of target buffers. Occasionally, it is necessary to transfer odd-sized groups of blocks to the
global layer, due to low-memory operations or per-CPU cache flushes. Such blocks are added to a
separate bucket list, which serves as a staging area for the global free list.
When a global list exceeds a global target value, excess buffers are returned to the coalesce-
to-page layer. This layer maintains per-page free lists (all buffers from the page are the same size).
This layer places the buffers on the free list to which they belong, and increases the free count for
that page. When all buffers on a page are returned to this list, the page can be given back to the
paging system. Conversely, the coalesce-to-page layer can borrow memory from the paging system
to create new buffers.
The coalesce-to-page layer sorts its lists based on the number of free blocks on each page.
This way, it allocates buffers from pages having the fewest free blocks. Pages with many free blocks
392 Chapter 12 Kernel Memory Allocation
get more time to recover other free blocks, increasing the probability of returning them to the paging
system. This results in a high coalescing efficiency.
12.9.1 Analysis
The Dynix algorithm provides efficient memory allocation for shared memory multiprocessors. It
supports the standard System V interface, and allows memory to be exchanged between the alloca-
tor and the paging system. The per-CPU caches reduce the contention on the global lock, and the
dual free lists provide a fast exchange of buffers between the per-CPU and global layers.
It is interesting to contrast the Dynix coalescing approach with that of the Mach zone-based
allocator. The Mach algorithm employs a mark-and-sweep method, linearly scanning the entire pool
each time. This is computationally expensive, and hence is relegated to a separate background task.
In Dynix, each time blocks are released to the coalesce-to-page layer, the per-page data structures
are updated to account for them. When all the buffers in a page are freed, the page can be returned to
the paging system. This happens in the foreground, as part of the processing of release operations.
The incremental cost for each release operation is small; hence it does not lead to unbounded worst-
case performance.
Benchmark results [McKe 93] show that for a single CPU, the Dynix algorithm is faster than
the McKusick-Karels algorithm by a factor of three to five. The improvement is even greater for
multiprocessors (a hundred to a thousand-fold for 25 processors). These comparisons, however, are
for the best-case scenario, where allocations occur from the per-CPU cache. This study does not de-
scribe more general measurements.
fixed, initial state. The deconstruction phase deals with the same fields, and in many cases, leaves
them in their initial state before deallocating the memory.
For instance, a vnode contains the header of a linked list of its resident pages. When the
vnode is initialized, this list is empty. In many UNIX implementations [Bark 90], the kernel deallo-
cates the vnode only after all its pages have been flushed from memory. Hence, just before freeing
the vnode (the deconstruction stage), its linked list is empty again.
If the kernel reuses the same object for another vnode, it does not need to reinitialize the
linked list header, for the deconstruction took care of that. The same principle applies to other ini-
tialized fields. For instance, the kernel allocates objects with an initial reference count of one, and
deallocates them when the last reference is released (hence, the reference count is one, and is about
to become zero). Mutexes are initialized to an unlocked state, and must be unlocked before releasing
the object.
This shows the advantage of caching and reusing the same object, rather than allocating and
initializing arbitrary chunks of memory. Object caches are also space-efficient, as we avoid the typi-
cal rounding to the next power of two. The zone allocator (Section 12.8) is also based on object
caching and gives efficient memory utilization. However, because it is not concerned with the object
state, it does not eliminate the reinitialization overhead.
page-level allocator
/..-active .... ' \ /..-active .... , /..-active .... , /..-active .... '
', ____
{ I \ I \ I \
mbufs ...... /
' ' vnodes /
---- /
' , procs _../
---- ----
' , msgbs _../
This interface does not construct or deconstruct objects when reusing them. Hence the kernel
must restore the object to its initial state before releasing it. As explained in Section 12.10.1, this
usually happens automatically and does not require additional actions.
When the cache is empty, it calls kmem_cache_grow (} to acquire a slab of memory from the
page-level allocator and create objects from it. The slab is composed of several contiguous pages
managed as a monolithic chunk by the cache. It contains enough memory for several instances of
the object. The cache uses a small part of the slab to manage the memory in the slab and divides the
rest of the slab into buffers that are the same size as the object. Finally, it initializes the objects by
calling their constructor (specified in the ctor argument to kmem_cache_create(}), and adds them
to the cache.
When the page-level allocator needs to recover memory, it calls kmem_cache_reap(} on a
cache. This function finds a slab whose objects are all free, deconstructs these objects (calling the
function specified in the dtor argument to kmem_cache_create(}), and removes the slab from the
cache.
12.10.5 Implementation
The slab allocator uses different techniques for large and small objects. We first discuss small ob-
jects, many of which can fit into a one-page slab. The allocator divides the slab into three parts-the
kmem_ s 1ab structure, the set of objects, and some unused space (Figure 12-11 ). The kmem_ s 1ab
structure occupies 32 bytes and resides at the end of the slab. Each object uses an extra four bytes to
store a free list pointer. The unused space is the amount left over after creating the maximum pos-
sible number of objects from the slab. For instance, if the inode size is 300 bytes, a 4096-byte slab
will hold 13 inodes, leaving 104 bytes unused (accounting for the kmem_ s 1ab structure and the free
list pointers). This space is split into two parts, for reasons explained below.
free
The kmem_slab structure contains a count of its in-use objects. It also contains pointers to
chain it in a doubly linked list of slabs of the same cache, as well as a pointer to the first free object
in the slab. Each slab maintains its own, singly linked, free buffer list, storing the linkage informa-
tion in a four-byte field immediately following the object. This field is needed only for free objects.
It must be distinct from the object itself, since we do not want to overwrite the constructed state of
the object.
The unused space is split into two parts: a slab coloring area at the head of the slab and the
rest just before the kmem_slab structure. The cache tries to use a different-sized coloring area in
each of its slabs, subject to alignment restrictions. In our inode cache example, if the inodes require
an 8-byte alignment, the slabs can have 14 different coloring sizes (0 through 104 in 8-byte incre-
ments). This allows a better distribution of the starting offsets of the objects of this class, resulting
in more balanced and efficient use of the hardware cache and memory buses.
The kernel allocates an object from a slab by removing the first element from the free list
and incrementing the slab's in-use count. When freeing the object, it identifies the slab by a simple
computation:
It then returns the object to the slab's free list, and decrements the in-use count.
When its in-use count becomes zero, the slab is free, or eligible for reclaiming. The cache
chains all its slabs on a partly sorted, doubly linked list. It stores fully active slabs (all objects in
use) at the beginning, partially active slabs in the middle, and free slabs at the tail. It also maintains
a pointer to the first slab that has a free object and satisfies allocations from that slab. Hence the
cache does not allocate objects from a completely free slab until all partly active slabs are ex-
hausted. If the page-level allocator must reclaim memory, it checks the slab at the tail of the list and
removes the slab if free.
12.10.6 Analysis
The slab allocator is a well-designed, powerful facility. It is space-efficient, because its space over-
head is limited to the kmem_slab structure, the per-object linkage field, and an unused area no larger
than one object per slab. Most requests are serviced extremely quickly by removing an object from
the free list and updating the in-use count. Its coloring scheme results in better hardware cache and
memory bus utilization, thus improving overall system performance. It also has a small footprint, as
it accesses only one slab for most requests.
12.11 Summary 397
The garbage collection algorithm is much simpler than that of the zone allocator, which is
based on similar principles. The cost of garbage collection is spread over all requests, since each
operation changes the in-use count. The actual reclaim operation involves some additional overhead,
for it must scan the different caches to find a free slab. The worst-case performance is proportional
to the total number of caches, not the number of slabs.
One drawback of the slab allocator is the management overhead inherent in having a sepa-
rate cache for each type of object. For common classes of objects, where the cache is large and often
used, the overhead is insignificant. For small, infrequently used caches, the overhead is often unac-
ceptable. This problem is shared by the Mach zone allocator and is solved by having a set of power-
of-two buffers for objects that do not merit a cache of their own.
The slab allocator would benefit from the addition of per-processor caches such as those of
Dynix. [Bonw 94] acknowledges this and mentions it as a possible future enhancement.
12.11 Summary
The design of a general-purpose kernel memory allocator raises many important issues. It must be
fast, easy to use, and use memory efficiently. We have examined several allocators and analyzed
their advantages and drawbacks. The resource map allocator is the only one that permits release of
part of the allocated object. Its linear search methods yield unacceptable performance for most ap-
plications. The McKusick-Karels allocator has the simplest interface, using the standard rna ll oc ()
and free () syntax. It has no provision for coalescing buffers or returning excess memory to the
page-level allocator. The buddy system constantly coalesces and breaks buffers to adjust to shifting
memory demands. Its performance is usually poor, particularly when there is frequent coalescing.
The zone allocator is normally fast, but has inefficient garbage collection mechanisms.
The Dynix and slab allocators offer significant improvements over these methods. Dynix
uses a power-of-two method, but adds per-processor caches and fast garbage collection (the coa-
lesce-to-page layer). The slab allocator is a modified zone algorithm. It improves performance
through object reuse and balanced address distribution. It also uses a simple garbage collection al-
gorithm that bounds the worst-case performance. As was previously noted, adding per-CPU caches
to the slab algorithm would provide an excellent allocator.
Table 12-1 summarizes the results of a set of experiments [Bonw 94] comparing the slab al-
locator with the SVR4 and McKusick-Karels allocators. The experiments also show that object re-
use reduces the time required for allocation plus initialization by a factor of 1.3 to 5.1, depending on
the object. This benefit is in addition to the improved allocation time noted in the table.
Many of these techniques can also be applied to user-level memory allocators. However, the
requirements of user-level allocators are quite different; hence, a good kernel allocator may not
work as well at the user level, and vice versa. User-level allocators deal with a very large amount of
(virtual) memory, practically limitless for all but the most memory-intensive applications. Hence,
coalescing and adjusting to shifting demands are less critical than rapid allocation and deallocation.
A simple, standard interface is also extremely important, since they are used by many diverse, inde-
pendently written applications. [Korn 85] describes several different user-level allocators.
12.12 Exercises
1. In what ways do the requirements for a kernel memory allocator differ from those for a user-
level allocator?
2. What is the maximum number of resource map entries required to manage a resource with n
items?
3. Write a program that evaluates the memory utilization and performance of a resource map
allocator, using a simulated sequence of requests. Use this to compare the first-fit, best-fit, and
worst-fit approaches.
4. Implement the free() function for the McKusick-Karels allocator.
5. Write a scavenge() routine that coalesces free pages in the McKusick-Karels allocator and
releases them to the page-level allocator.
6. Implement a simple buddy algorithm that manages a 1024-byte area of memory with a
minimum allocation size of 16 bytes.
7. Determine a sequence of requests that would cause the worst-case behavior for the simple
buddy algorithm.
8. In the SVR4 lazy buddy algorithm described in Section 12.7, how would each of following
events change the values of N, A, L, G, and slack?
(a) A buffer is released when slack is greater than 2.
(b) A delayed buffer is reallocated.
(c) A non-delayed buffer is allocated (there are no delayed buffers).
(d) A buffer is released when slack equals 1, but none of the free buffers can be coalesced
because their buddies are not free.
(e) A buffer is coalesced with its buddy.
9. Which of the other memory allocators can be modified to have a Dynix-style per-CPU free list
in case of multiprocessors? Which algorithms cannot adopt this technique? Why?
10. Why does the slab allocator use different implementations for large and small objects?
11. Which of the allocators described in this chapter have simple programming interfaces?
12. Which of the allocators allow a client to release part of an allocated block?
13. Which of the allocators can reject an allocation request even if the kernel has a block of
memory large enough to satisfy the request?
12.13 References 399
12.13 References
[Bach 86] Bach, M.J., The Design of the UNIX Operating System, Prentice-Hall, Englewood
Cliffs, NJ, 1986.
[Bark 89] Barkley, R.E., and Lee, T.P., "A Lazy Buddy System Bound By Two Coalescing
Delays per Class," Proceedings of the Twelfth ACM Symposium on Operating
Systems Principles, Dec. 1989, pp. 167-176.
[Bark 90] Barkley, R.E., and Lee, T.P., "A Dynamic File System !node Allocation and Reclaim
Policy," Proceedings of the Winter 1990 USENIX Technical Conference, Jan. 1990,
pp. 1-9.
[Bonw94] Bonwick, J., "The Slab Allocator: An Object-Caching Kernel Memory Allocator,"
Proceedings of the Summer 1994 USENIX Technical Conference, Jun. 1994, pp. 87-
98.
[Cekl92] Cekleov, M., Frailong, J.-M., and Sindhu, P., Sun-4D Architecture, Revision 1.4,
Sun Microsystems, 1992.
[DEC 92] Digital Equipment Corporation, Alpha Architecture Handbook, Digital Press, 1992.
[DEC 93] Digital Equipment Corporation, DEC OSF/1 Internals Overview-Student
Workbook, 1993.
[Knut 73] Knuth, D., The Art of Computer Programming, Vol. I, Fundamental Algorithms,
Addison-Wesley, Reading, MA, 1973.
[Korn 85] Korn, D.G., and Vo, K.-P., "In Search of a Better Malloc," Proceedings of the
Summer 1985 USENIXTechnical Conference, Jun. 1985, pp. 489-505.
[Lee 89] Lee, T.P., and Barkley, R.E., "A Watermark-Based Lazy Buddy System for Kernel
Memory Allocation," Proceedings of the Summer 1989 USENIX Technical
Conference, Jun. 1989, pp. 1-13.
[McKe 93] McKenney, P.E., and Slingwine, J., "Efficient Kernel Memory Allocation on Shared-
Memory Multiprocessors," Proceedings of the Winter 1993 USENIX Technical
Conference, Jan. 1993,pp.295-305.
[McKu 88] McKusick, M.K., and Karels, M.J., "Design of a General-Purpose Memory Allocator
for the 4.3BSD UNIX Kernel," Proceedings of the Summer 1988 USENIX Technical
Conference, Jun. 1988,pp.295-303.
[Pete 77] Peterson, J.L., and Norman, T.A., "Buddy Systems," Communications of the ACM,
Vol. 20, No.6, Jun. 1977, pp. 421-431.
[Sciv 90] Sciver, J.V., and Rashid, R.F., "Zone Garbage Collection," Proceedings of the
USENIX Mach Workshop, Oct. 1990, pp. 1-15.
[Step 83] Stephenson, C.J., "Fast Fits: New Methods for Dynamic Storage Allocation,"
Proceedings of the Ninth ACM Symposium on Operating Systems Principles, Vol.
17,no.5, 1983,pp.30-32.
13
Virtual Memory
13.1 Introduction
One of the primary functions of the operating system is to manage the memory resources of the
system efficiently. Each system has a high-speed, randomly accessible primary memory, also
known as main memory, physical memory, or simply, as memory or core. Its access time is of the
order of a few CPU cycles. 1 A program may directly reference code or data that is resident in main
memory. Such memory is relatively expensive and therefore limited. The system uses a number of
secondary storage devices (usually disks or other server machines on a network) to store informa-
tion that does not fit in main memory. Access to such devices is several orders of magnitudes slower
than to primary memory and requires explicit action on part of the operating system. The memory
management subsystem in the kernel is responsible for distributing information between main
memory and secondary storage. It interacts closely with a hardware component called the memory
management unit (MMU), which is responsible for getting data to and from main memory.
Life would be very simple for the operating system in the absence of memory management.
The system would keep only one program in memory at a time, loaded contiguously at a known,
fixed address. This would simplify the task of linking and loading, and absolve the hardware of any
address translation chores. All addressing would be directly to physical addresses, and the program
would have the entire machine to itself (shared, of course, with the operating system). This would
be the fastest, most efficient way of running any single program.
1 For instance, in 1995, a typical desktop system has a 75 Mhz processor and a 70-nanosecond memory access time,
which equals 5.25 CPU cycles.
400
13.1 Introduction 401
• Run programs larger than physical memory. Ideally, we should be able to run programs of
arbitrary size.
• Run partially loaded programs, thus reducing program startup time.
• Allow more than one program to reside in memory at one time, thereby increasing CPU
utilization.
• Allow relocatable programs, which may be placed anywhere in memory and moved
around during execution.
• Write machine-independent code-there should be no a priori correspondence between
the program and the physical memory configuration.
• Relieve programmers of the burden of allocating and managing memory resources.
• Allow sharing-for example, if two processes are running the same program, they should
be able to share a single copy of the program code.
These goals are realized through the use of virtual memory [Denn 70]. The application is
given the illusion that it has a large main memory at its disposal, although the computer may have a
relatively small memory. This requires the notion of an address space as distinct from memory lo-
cations. The program generates references to code and data in its address space, and these addresses
must be translated to locations in main memory. The hardware and software must cooperate to bring
information into main memory when it is needed for processing by the program and to perform the
address translations for each access.
Virtual memory does not come without cost. The translation tables and other data structures
used for memory management reduce the physical memory available to programs. The cost of ad-
dress translation is added to the execution time for each instruction and is particularly severe when
it involves extra memory accesses. When a process attempts to access a page that is not resident in
memory, the system generates a fault. It handles the fault by bringing the page into memory, which
may require time-consuming disk 1/0 operations. In all, memory management activities take up a
significant amount of CPU time (about 10% on a busy system). The usable memory is further re-
duced by fragmentation-for instance, in a page-based system, if only a part of a page contains use-
ful data, the rest is wasted. All these factors underscore the importance of an efficient design that
emphasizes performance as well as functionality.
data spaces, but that still restricted the process size to 128 kilobytes. This limitation led to the de-
velopment of various software overlay schemes for both user programs and the kernel [Coli 91].
Such methods reuse memory by overwriting a part of the address space that is no longer useful with
another part of the program. For example, once the system is up and running, it no longer needs the
system initialization code and can reclaim that space for use by other parts of the program. Such
overlay schemes require explicit actions by the application developer, who needs to be familiar with
the details of the program and the machine on which it runs. Programs using overlays are inherently
unportable, since the overlay scheme depends on the physical memory configuration. Even adding
more memory to the machine requires modifYing these programs.
The memory management mechanisms were restricted to swapping (Figure 13-1). Processes
were loaded in physical memory contiguously and in their entirety. A small number of processes
could fit into physical memory at the same time, and the system would time-share between them. If
another process wanted to run, one of the existing processes needed to be swapped out. Such a proc-
ess would be copied to a predefined swap partition on a disk. Swap space was allocated on this
Operating System
P3 PO
P1
t =t1
P2
unused memory
P1
Operating System
P3
PO t = t2
P2
partition for each process at process creation time, so as to guarantee its availability when needed.
Demand paging made its appearance in UNIX with the introduction of the VAX -11/780 in
1978, with its 32-bit architecture, 4-gigabyte address space, and hardware support for demand pag-
ing [DEC 80]. 3BSD was the first major UNIX release to support demand paging [Baba 79,
Baba 81]. By the mid-1980s, all versions of UNIX used demand paging as the primary memory
management technique, with swapping relegated to a secondary role.
In a demand-paged system, both memory and process address space are divided into fixed-
size pages, and these pages are brought into and out of memory as required. A page of physical
memory is often called a page frame (or a physical page). Several processes may be active at any
time, and physical memory may contain just some of the pages of each process (Figure 13-2). Each
program runs as though it is the only program on the system. Program addresses are virtual and are
divided by the hardware into a page number and an offset in the page. The hardware, in conjunction
with the operating system, translates the virtual page number in the program address space to a
physical page frame number and accesses the appropriate location. If the required page is not in
memory, it must be brought into memory. In pure demand paging, no page is brought into memory
until needed (referenced). Most modem UNIX systems do some amount of anticipatory paging,
bringing in pages the system predicts will be needed soon.
A demand paging scheme may be used along with or instead of swapping. There are several
benefits:
• Program size is limited only by virtual memory, which, for 32-bit machines, can be up to
4 gigabytes.
• Program startup is fast since the whole program does not need to be in memory in order to
run.
• More programs may be loaded at the same time, since only a few pages of each program
P1 P4
P2 ~ /
P5
/ ~
:7
?><
P3 v ~ P6
>(
~ Key
Physical C==:J in-core
memory C==:J not in-core
2 This book uses the term 80x86 (or more simply, x86) to refer to properties generic to the Intel 80386, 80486, and
Pentium architectures.
13 .2 Demand Paging 405
call, the kernel releases the old address space and allocates a new one that corresponds to
the new program. Other major operations on the address space include changing the size
of the d:tta region or the stack, and adding a new region (such as shared memory).
• Adllress translation -For each instruction that accesses memory, the MMU needs to
translate virtual addresses generated by the process to physical addresses in main memory.
In a demand-paged system, a page is the unit of memory allocation, protection, and ad-
dress translation. The virtual address is converted into a virtual page number and an offset
within that page. The virtual page number is converted to a physical page number using
some kind of address translation maps. If an instruction accesses a page that is not in
physical memory (not resident), it will cause a page fault exception. The fault handler in
the kernel must resolve the fault by bringing the page into memory.
• Physical memory management- Physical memory is the most important resource con-
trolled by the memory management subsystem. Both the kernel and the user processes
contend for memory, and these requests must be satisfied quickly. The total size of all ac-
tive processes is usually much larger than that of physical memory, which can hold only a
limited subset of this data. Hence the system uses physical memory as a cache of useful
data. The kernel must optimize the utilization of this cache and ensure consistency and
currency of the data.
• Memory protection - Processes must not access or modify pages in an unauthorized
manner. The kernel must protect its own code and data against modification by user proc-
esses. Otherwise, a program may accidentally (or maliciously) corrupt the kernel. Forse-
curity reasons, the kernel must also prevent processes from reading its code or data. Proc-
esses should not be able to access pages belonging to other processes. A part of the
process's address space may even be protected from the process itself. For example, the
text region of a process is usually write-protected, so that the process cannot accidentally
corrupt it. The kernel implements the system's memory protection policies using the
available hardware mechanisms. If the kernel detects an attempt to access an illegal loca-
tion, it notifies the offending process by sending it a segmentation violation (SIGSEGV)
signaJ.3
• Memory sharing - The characteristics of UNIX processes and their interactions natu-
rally suggest sharing of certain portions of their address spaces. For instance, all processes
running the same program can share a single copy of the text region of the program. Proc-
esses may explicitly ask to share a region of memory with other cooperating processes.
The text regions of standard libraries may be shared in the same manner. These are exam-
ples of high-level sharing. There is also potential for low-level sharing of individual
pages. For instance, after a fork, the parent and child may share a single copy of data and
stack pages as long as neither tries to modify them.
These and other forms of sharing improve performance by reducing the contention on
physical memory and by eliminating the in-memory copying and disk 1/0 needed to
maintain multiple copies of the same data. The memory management subsystem must de-
cide what forms of sharing it supports and how to implement such sharing.
3 In some cases, the kernel sends the S I GBUS (bus error) signal instead.
406 Chapter 13 Virtual Memory
• Monitoring system load- Usually, the paging system is able to cope with the demands
of the active processes. Sometimes, however, the system may become overloaded. When
that happens, processes do not get enough memory for their active pages and hence are
unable to make progress. The load on the paging system depends on the number and size
of the active processes, as well as their memory reference patterns. The operating system
needs to monitor the paging system to detect such a situation and to take corrective action
when required. This may involve controlling the system load by preventing new processes
from starting up or by deactivating some existing processes.
• Other facilities - Some of the other functions provided by the memory management
system include support for memory-mapped files, dynamically linked shared libraries, and
execution of programs residing on remote nodes.
The memory management architecture has a great impact on overall system performance,
and therefore the design must be sensitive to performance and scalability. Portability is important as
well, to allow the system to run on different types of machines. Finally, the memory subsystem
should be transparent to the user, who should be able to write code without worrying about the un-
derlying memory architecture.
• text
• initialized data
• uninitialized data
• modified data
• stack
• heap
• shared memory
• shared libraries
These page types differ in the protections, method of initialization, and how they are shared
by processes. Text pages are usually read-only, while data, stack, and heap pages are read-write.
Protections on shared memory pages are usually set when the region is first allocated.
Text pages are normally shared by all processes running the same program. Pages in a
shared memory region are shared by all processes that attach the region to their address space. A
shared library may contain both text and data pages. The text pages are shared by all processes ac-
cessing the library. Library data pages are not shared, and each process gets its own private copy of
these pages. (Some implementations may allow them to be shared until they are modified.)
13.2 Demand Paging 407
offset the faster swap-ins. Most modem implementations do not swap out the text pages, but read
them back from the file if needed.
address
address
translations
Data
physical
vides some hardware address translation mechanism, so that the operating system does not need to
be involved in each translation. Section 13.3 examines three examples of memory architectures, all
of which involve some form of translation lookaside buffers (TLBs) and page tables. Although the
hardware dictates the form of these data structures, the operating system is responsible for their
setup and maintenance.
The hardware address translation maps are the only data structures known to the MMU
hardware. The other maps described in this section are known only to the operating system.
Address space map - When the hardware is unable to translate an address, it generates a page
fault. This might happen because the page is not in physical memory or because the hardware does
not have a valid translation for it. The fault handler in the kernel must resolve the fault by bringing
the page into memory if necessary and loading a valid hardware translation entry.
The hardware-recognized maps may not provide complete information about the address
space of a process. For example, on the MIPS R3000, the hardware uses only a small TLB. The op-
erating system may maintain additional maps that fully describe the address space.
Physical memory map- Frequently, the kernel also needs to perform the reverse mapping and
determine the owning process and the virtual page number for a given physical page. For instance,
when the kernel removes an active page from physical memory, it must invalidate any translation
for that page. To do so, it must locate the page table entry and/or TLB entry for the page; otherwise,
the hardware will continue to expect the page to be at this physical location. Thus the kernel main-
tains a physical memory map that keeps track of what data is stored in each physical page.
Backing store map- When the fault handler cannot find a page in physical memory, it allocates a
new page and initializes it in one of two ways-by filling it with zeroes or by reading it in from
secondary storage. In the latter case, the page could be obtained from the executable file, from a
shared library object file, or from its saved copy in the swap area. These objects comprise the back-
ing store for the pages of a process. The kernel must maintain maps to locate pages on the backing
store.
To make room for a new page, the kernel must reclaim a page that is currently in memory. The page
replacement policy deals with how the kernel decides which page to reclaim [Bela 66]. The ideal
candidate is a dead page, that is, one that will never be required again (for example, a page belong-
ing to a terminated process). If there are no dead pages (or not enough of them), the kernel may
choose a local or global replacement policy. A local policy allocates a certain number of pages to
each process or group of related processes. If a process needs a new page, it must replace one of its
own pages. If the kernel uses a global policy, it can steal a page from any process, using global se-
lection criteria such as usage patterns.
Local policies are necessary when it is important to guarantee resources to certain processes.
For example, the system administrator may allocate a larger set of pages to a more important proc-
ess. Global policies, on the other hand, are simpler to implement and more suitable for a general
time-sharing system. Most UNIX variants implement a global replacement policy, but reserve a
minimum number of resident pages for each active process.
410 Chapter 13 Virtual Memory
For a global replacement policy, the kernel must choose criteria for deciding which pages
should be kept in memory. Ideally, we want to keep those pages that are going to be needed in the
near future. We call this set of pages the working set of a process. If the page reference behavior of
the processes were known in advance, the working set could be determined exactly, at least in the-
ory. In practice, we have little advance knowledge of the access pattern of processes, and we must
rely on empirical studies of typical processes to guide our implementation.
Such studies have shown that processes tend to exhibit locality of reference, that is, a proc-
ess tends to localize its references to a small subset of its pages, and this subset changes slowly. For
instance, when executing a function, all instructions are on the page (or pages) containing that func-
tion, and after a while, the process may move to a different function, thus changing its working set.
Similarly for data references, loops that operate on arrays and functions that perform several opera-
tions on a structure are examples of code that exhibits locality of reference.
The practical inference is that recently accessed pages are more likely to be accessed again
in the near future. Thus, a good approximation to the working set is the set of pages most recently
accessed. This leads to a least recently used (LRU) policy for page replacement--discard those
pages that have not been accessed for the longest time. Such an LRU policy is also adopted by the
filesystem buffer cache, since file access patterns exhibit similar trends. For memory management,
however, the LRU policy must be modified due to practical considerations, as seen in Section
13.5.3.
Finally, the kernel must decide when to free active pages. One option is to look for a page to
reclaim only when a process actually needs to bring a page into memory. This is inefficient and de-
grades system performance. The better solution is to maintain a pool of free pages and periodically
add pages to this pool, so that the load on the paging system is more evenly distributed over time.
tionally, a referenced bit. The page table format is hardware-prescribed; other than that, page tables
are simply data structures located in main memory. Several page tables reside in memory at any
time. The MMU uses only the active tables, whose locations are loaded in hardware page table reg-
isters. Typically, on a uniprocessor, there are two active page tables--one for the kernel and one for
the currently running process.
The MMU breaks a virtual address into the virtual page number and the offset within the
page. It then locates the page table entry for this page, extracts the physical page number, and com-
bines that with the offset to compute the physical address.
Address translation may fail for three reasons:
• Bounds error - The address does not lie in the range of valid addresses for the process.
There is no page table entry for that page.
• Validation error- The page table entry is marked invalid. This usually means the page
is not resident in memory. There are some situations where the valid bit is clear even if the
page is valid and resident; they are covered in Section 13.5.3.
• Protection error- The page does not permit the type of access desired (e.g., write ac-
cess to a read-only page or user access to a kernel page).
In all such cases, the MMU raises an exception and passes control to a handler in the kernel. 4
Such an exception is called a page fault, and the fault handler is passed the offending virtual address
as well as the type of fault (validation or protection-bounds errors result in a validation fault). The
handler may try to service the fault by bringing the page into memory or to notify the process by
sending it a signal (usually SIGSEGV).
If the fault can be successfully handled, the process (when it eventually runs again) must re-
start the instruction that caused the fault. This requires the hardware to save the correct information
required to restart the offending instruction prior to generating the page fault.
Every time a page is written to, the hardware sets the modified bit in its PTE. If the operating
system finds this bit set, it saves the page on stable storage before recycling it. If the hardware sup-
ports a referenced bit in its PTEs, it sets this bit on each reference to a page. This allows the operat-
ing system to monitor the usage of resident pages and recycle those that do not seem useful.
If a process has a large virtual address space, its page table may become extremely large. For
instance, if a process has an address space of 2 gigabytes and the page size is 1024 bytes, then the
page table must have 2 million entries, and thus be 8 megabytes in size. It is impractical to keep
such a large table in physical memory. Moreover, most of this address space is probably not really
being used-a typical process address space comprises a number of regions (text, data, stack, etc.)
scattered in different parts of this space. Thus the system needs a more compact way of describing
the address space.
This problem is addressed by having segmented page tables or by paging the page table it-
self. The first approach works best when the system explicitly supports segmentation. Each segment
of the process has its own page table, which is just large enough to hold the valid address range for
4 Some architectures do not support protection faults. On such systems, the kernel must force validation faults by un-
mapping the page, then determine the appropriate action based on whether the page is in memory and the type of ac-
cess attempted.
412 Chapter 13 Virtual Memory
that segment. In the second approach, the page table itself is paged, which means an additional
higher-level page table is used to map the lower-level page table. With such a multitiered page table
hierarchy, we need to allocate only those pages of the lower-level table that map valid addresses of
the process. The two-level approach is more common, but some architectures such as the SP ARC
Reference MMU [SPARC 91] allow three levels of page tables.
The page table thus forms a link between the MMU and the kernel, both of which can ac-
cess, use, and modify the PTEs. The hardware also has a set of MMU registers, which point to the
page tables. The MMU is responsible for using the PTE to translate virtual addresses, checking the
valid and protection bits in the process, and for setting referenced and modified bits as appropriate.
The kernel must set up the page tables, fill in the PTEs with the correct data, and set the MMU reg-
isters to point to them. These registers usually need to be reset on each context switch.
Page tables alone cannot provide efficient address translation. Each instruction would require sev-
eral memory accesses--one to translate the virtual address of the program counter, one to fetch the
instruction, and similarly, two accesses for each memory operand involved. If the page tables them-
selves are paged or tiered, the number of accesses increases further. Since each memory access re-
quires at least one CPU cycle, so many accesses will saturate the memory bandwidth and increase
the instruction execution time to unacceptable limits.
This problem is addressed in two ways. The first is by adding a high-speed cache that is
searched before each memory access. Machines may support separate data and instruction caches or
a single cache for both. Getting data from the cache is much faster than accessing main memory. On
many machines (especially older ones), the cache is addressed by physical memory. It is completely
managed by the hardware and is transparent to the software. Cache access takes place after the ad-
dress translation, so that the benefits are modest. Many newer architectures such as Hewlett-
Packard's PA-RISC [Lee 89] use a virtually addressed cache, which allows the cache search to pro-
ceed in parallel with address translation. This approach greatly improves performance, but has a
number of cache consistency problems, which must be dealt with by the kernel (see Section 15.13).
The second approach to reducing memory accesses is an on-chip translation cache, called a
translation lookaside buffer (I'LB). The TLB is an associative cache of recent address translations.
TLB entries are similar to page table entries and contain address translation and protection informa-
tion. They may also have a tag that identifies the process to which the address belongs. The cache is
associative in the sense that the lookup is content-based rather than index-based-the virtual page
number is searched for simultaneously in all the TLB entries.
The MMU usually controls most TLB operations. When it cannot find a translation in the
TLB, it looks it up in the software address maps (such as the page tables) and loads it into a TLB
entry. The operating system also needs to cooperate in some situations. If the kernel changes a page
table entry, the change is not automatically propagated to the TLB's cached copy of the entry. The
kernel must explicitly purge any TLB entry that it invalidates, so that the MMU will reload it from
memory when the page is next accessed. For example, the kernel may write-protect a page in re-
sponse to an explicit user request (through the mprotect system call). It must purge any old TLB
13.3 Hardware Requirements 413
entry that maps to this page, or else the process would still be able to write to the page (since the old
mapping allowed writes).
The hardware defines the way i'n which the kernel can operate on the TLB. It may either
provide explicit instructions for loading or invalidating TLB entries, or such functions may occur as
a byproduct of certain instructions. On some systems (such as the MIPS R3000), the hardware uses
only the TLBs, and any page tables or other maps are managed solely by the kernel. On such sys-
tems, the operating system is involved on each TLB miss and must load the correct translation into
the TLB.
Although all MMUs must provide the same basic functionality, the way in which they do so
may vary a lot. The MMU architecture dictates the virtual and physical page size, the types of pro-
tection available, and the format of the address translation entries. On machines with hardware-
supported page tables, the MMU defines the page table hierarchy and the registers that map the page
tables. It also defines the division of labor between itself and the kernel, and the extent of the ker-
nel's role in manipulating the address and translation caches. With that in mind, we now look at
three different architectures, with an emphasis on their impact on UNIX memory management im-
plementation.
virtual address
,,-----------------------, page table
'
'' L__..,.:::-_J
segment
descriptor table
global descriptor table (GDT), which has entries for the kernel code, data, and stack segments, plus
some special objects including the per-process LDTs. When translating a virtual address, the MMU
uses the segment selector to identify the correct segment descriptor, either from the GDT or from
the current LDT (depending on a bit in the selector). It makes sure the offset is less than the segment
size and adds it to the segment base address to obtain the linear address.
UNIX implementations on the x86 use segmentation only for memory protection, kernel en-
try, and context switching [Robb 87]. It hides the notion of segmentation from user processes, who
see a flat address space. To achieve this, the kernel sets up all user segments with base address 0 and
a large size that excludes only the high end of virtual memory, which is reserved for kernel code and
data. Code segments are protected as read-only and data as read-write, but both refer to the same
locations. Each LDT also refers to some special segments-a call gate segment for system call entry
and a task state segment (TSS) to save the register context across a context switch.
The x86 uses a two-level page table scheme (Figure 13-6). The 4-kilobyte page size implies
that a process may have up to one million pages. Rather than having one huge page table, the x86
uses a number of small page tables for each process. Each page table is one page in size, and thus
holds I 024 PTEs. It maps a contiguous region 4 megabytes in size, aligned at a 4-megabyte bound-
ary. Hence a process may have up to 1024 page tables. Most processes, however, have sparse ad-
dress spaces and use only a few page tables.
Each process also has a page directory, which contains PTEs that map the page tables them-
selves. The directory is one page in size and holds 1024 PTEs, one for each page table. The page
page directory of
~ current process
31
one of the page
tables of current
process
11 0
physical address
directory is the level-1 page table, and the page tables themselves are the level-2 tables. The rest of
this discussion uses the terms page directory and page table instead of level-1 and level-2 tables.
The control register CR3 stores the physical page number of the current page directory in its
20 high-order bits. Hence it is also known as the PDBR (Page Directory Base Register). Virtual ad-
dresses on the 80x86 can be broken into 3 parts. The top 10 bits contain the DIR field, which be-
comes the index into the page directory. This is combined with the page number from CR3 to give
the physical address of the page directory entry for the appropriate page table. The next 10 bits
contain the PAGE field, which stores the virtual page number relative to the start of that region. This
is used as an index in the page table to get the desired PTE, which in turn contains the physical page
number of the desired page. This in turn is combined with the low-order 12 bits of the virtual ad-
dress, which contain the byte offset in the page, to yield the physical address.
Each page table entry contains the physical page number, protection field, and valid, refer-
enced, and modified bits (Figure 13-7). The protection field has two bits-one to specify read-only
(bit clear) or read-write (bit set) access and another to specify if the page is a user page (bit clear) or
supervisor (bit set). When the process is running in kernel mode, all pages are accessible read-write
(write protection is ignored). In user mode, supervisor pages are inaccessible regardless of the read-
write bit setting, which only applies to user pages. Since both page directory entries as well as page
table entries have protection fields, access must be permitted by both entries.
The support for the referenced bit simplifies the kernel's task of monitoring page usage. The
CR3 register needs to be reset on each context switch so that it points to the page directory of the
new process.
The TLB in the x86 architecture is never directly accessed by the kernel. The entire TLB,
however, is flushed automatically whenever the PDBR is written to, either explicitly by a move in-
struction or as an indirect result of a context switch. The UNIX kernel flushes the TLB whenever it
invalidates a page table entry (for example, when reusing a page).
The x86 supports four privilege levels or protection rings, of which UNIX uses only two.
The kernel runs in the innermost ring, which is the most privileged. This allows it to execute privi-
leged instructions (such as those that modify MMU registers) and to access all segments and all
pages (user and supervisor). User code runs in the outermost, least privileged ring. It may execute
only nonprivileged instructions and access only the user pages in its own segments. The call gate
31 12 65 210
PFN
segment allows the user to make system calls. It puts the system into the inner ring and transfers
control to a location specified in the call gate, which is under control of the kernel.
system
_, . ic\•.ct#E.'i~V !.<..
process A process B
0 kernel text kernel text
~
0
user text user text
Io ri v data
~ _l:l_r i v data
~
shared data shared data
shared data shared data
shared data shared data
shared data shared data
shared data shared data
shared data shared data
shared data shared data
shared data shared data
VM data
VM data ~VMdot•
VM data
shared text ~ shared text
kernel data ~ kernel data
15 l/0 l/0 15
1:·¥1';'··
1 01\1 •··~'~'···
· ~···""~
·;;&:?¥.' ~
segment
registers
physical address
As explained earlier, the RS/6000 does not maintain a direct virtual to physical address
translation map. Instead, it maintains an inverted page table called the page frame table (P FT), with
one entry for each physical page. The system uses a hashing technique to translate virtual addresses,
as shown in Figure 13-10. A data structure called the hash anchor table (HAT) contains information
used to convert a system virtual page number to a hash value, which points to a linked list of PFT
entries. Each PFT entry contains the following fields:
• The virtual page number to which it maps.
• A pointer to the next entry in the hash chain.
• Flags such as valid, referenced, and modified.
• Protection and locking information.
The RS/6000 uses the HAT to locate the hash chain and traverses the chain till it finds an entry for
the desired virtual page number. The index of the entry in the PFT equals the physical page number,
which is 20 bits in size. This is combined with the 12-bit page offset from the process virtual ad-
dress to obtain the physical address.
This translation process is slow and expensive and should not be required for each memory
access. The RS/6000 maintains two separate TLBs-a 32-entry instruction TLB and a 128-entry
data TLB. In normal operation, these buffers should take care of most address translation, and the
hashed lookup is required only when there is a TLB miss. In addition, the RS/6000 has separate data
and instruction caches. The data cache is 32 or 64 kilobytes in size, and the instruction cache is 8 or
32 kilobytes, depending on the individual system model [Chak 94]. These caches are virtually ad-
physical page
number
physical address
dressed; therefore, address translation is not required when there is a cache hit. Virtually addressed
caches require kernel involvement to address some consistency problems. These issues are de-
scribed in Section 15.13.
address
0 translation
any
kuseg data and
.. instruction , ,, 0
caches ..,, .. ",
'
.. , ,, , ,, 20000000
80000000 ksegO ..
, ,,
______________ ..,, ,, .
,
AOOOOOOO kseg1 ______________ _,, , any
cooooooo address
kseg2
translation
FFFFFFFF maxphys
physical address
through 63, and is not the same as the traditional process ID. Each process that may have active
TLB entries will be assigned a tlbpid between 0 and 63. The kernel sets the PI D field in the entryhi
register to the tlbpid of the current process. The hardware compares it to the corresponding field in
the TLB entries, and rejects translations that do not match. This allows the TLB to contain entries
for the same virtual page number belonging to different processes without conflict.
The N (no-cache) bit, if set, says that the page should not go through the data or instruction
caches. The G (global) bit specifies that the PID should be ignored for this page. If the V (valid) bit is
clear, the entry is invalid, and if the D (dirty) bit is clear, the entry is write-protected. Note that there
is neither a referenced bit nor a modified bit.
In translating kuseg or kseg2 addresses, the virtual page number is compared with all TLB
entries simultaneously. If a match is found and the G bit is clear, the PID of the entry is compared
with the current tlbpid, stored in the entryhi register. If they are equal (or if the G bit is set) and the
V bit is set, the PFN field yields the valid physical page number. If not, a TLBmiss exception is
raised. For write (store) operations, the D bit must be set, or else a TLBmod exception will be
raised.
Since the hardware provides no further facilities (such as page table support), these excep-
tions must be handled by the kernel. The kernel will look at its own mappings, and either locate a
valid translation or send a signal to the process. In the former case, it must load a valid TLB entry
and restart the faulting instruction. The hardware imposes no requirements on whether the kernel
mappings should be page table-based and what the page table entries should look like. In practice,
however, UNIX implementations on MIPS use page tables so as to retain the basic memory man-
agement design. The format of the entrylo register is the natural form of the PTEs, and the eight
low-order bits, which are unused by hardware, may be used by the kernel in any way.
The lack of referenced and modified bits places further demands on the kernel. The kernel
must know which pages are modified, since they must be saved before reuse. This is achieved by
write-protecting all clean pages (clearing the D bit in their TLBs ), so as to force a TLBmod excep-
tion on the first write to them. The exception handler can then set the D bit in the TLB and set ap-
13.4 4.3BSD -A Case Study 421
propriate bits in the software PTE to mark the page as dirty. Reference information must also be
collected indirectly, as shown in Section 13.5.3.
This architecture leads to a larger number of page faults, since every TLB miss must be
handled by the software. The need to track page modifications and references causes even more
page faults. This is offset by the speed gained by a simpler memory architecture, which allows very
fast address translation when there is a TLB cache hit. Further, the faster CPU speed helps keep
down the cost of the page fault handling. Finally, the unmapped region ksegO is used to store the
static text and data of the kernel. This increases the speed of execution of kernel code, since address
translations are not required. It also reduces contention on the TLB, which is needed only for user
addresses and for some dynamically allocated kernel data structures.
So far we have described the basic concepts of demand paging, and how hardware characteristics
can influence the design. To understand the issues involved more clearly, we use 4.3BSD memory
management as a case study. The first UNIX system to support virtual memory was 3BSD. Its
memory architecture evolved incrementally over the subsequent releases. 4.3BSD was the last Ber-
keley release based on this memory model. 4.4BSD adopted a new memory architecture based on
Mach; this is described in Section 15.8. [Leff 89] provides a more complete treatment of 4.3BSD
memory management. In this chapter, we summarize its important features, evaluate its strengths
and drawbacks, and develop the motivation for the more sophisticated approaches described in the
following chapters.
Although the target platform for the BSD releases was the VAX-11, it has been successfully
ported to several other platforms. The hardware characteristics impact many kernel algorithms, in
particular the lower-level functions that manipulate page tables and the translation buffer. Porting
BSD memory management has not been easy, since the hardware dependencies permeate through
all parts of the system. As a result, several BSD-based implementations emulate the VAX memory
architecture in software, including its address space layout and its page table entry format. We
avoid a detailed description of the VAX memory architecture, since the machine is now obsolete.
Instead, we describe some of its important features as part of the BSD description.
4.3BSD uses a small number of fundamental data structures-the core map describes physi-
cal memory, the page tables describe virtual memory, and the disk maps describe the swap areas.
There are also resource maps to manage allocation of resources such as page tables and swap space.
Finally, some important information is stored in the proc structure and u area of each process.
be nonpageable. The very high end of physical memory is reserved for error messages generated
during a system crash. In between these two regions is the paged pool, which occupies the bulk of
physical memory. It contains all the pages belonging to user processes, as well as dynamically allo-
cated kernel pages. The latter are marked nonpageable, even though they are part of the paged pool.
These physical pages are called page frames, and the frame holds the contents of a process
page. The page stored in the frame can be replaced at any time by a different page, and thus we need
to maintain information about the contents of each frame. This is done using a core map, which is
an array of struct cmap entries, one entry for each frame in the paged pooJ.S The core map itself is
a kernel data structure, allocated at boot time and resident in the nonpaged pool. The core map entry
contains the following information about the frame:
• Name- The name or identity of the page stored in the frame. The name space for a proc-
ess page is described by the owner process ID, the type (data or stack), and the virtual
page number of the page in that region. Text pages may be shared by several processes,
and thus their owner is the text structure for that program. The core map entry stores the
index into the process table or the text table. The name <type, owner, virtual page num-
ber> allows the kernel to perform reverse address translation, that is, to locate the PTE
corresponding to the page frame, as seen in Figure 13-14.
• Free list- Forward and backward pointers to link free pages onto a free list. This list is
maintained in approximate least recently used order (Section 13.5.3) and is used by the
memory allocation routines to allocate and free physical memory.
• Text page cache - The name of a page is meaningful only as long as the owner process
is alive. This is okay for data and stack pages, because such pages are garbage once the
process exits. In case of a text page, however, there is a chance that another process may
soon try to rerun the same program. If this happens and some of the text pages are still
resident in memory, it makes sense to reuse them instead of reading them afresh from
disk. To identify such pages even after their owner(s) have terminated, the core map entry
stores the disk locations (device and block number) of text pages. Such pages are also
5 Actually, there is one cmap entry for each cluster of frames. Clusters provide the notion of a logical page composed
of a (fixed) number of physical pages. This enhances performance by increasing the granularity of several operations
and reducing the size of data structures such as the core map.
13.4 4.3BSD- A Case Study 423
proc [ ]
cmap[ ]
,,
type= data ,
,, ,,
owner------ ,,
VPN - - - - - - - - - - - - - - - - - '
, ,,
type= text
, ,,
owner------
VPN ,' page table
hashed onto a set of hash queues (based on device and block number) for quick retrieval.
The disk location may either identify the page in the executable file or on the swap device.
• Synchronization - A set of flags synchronizes access to the page. The page is locked
while moving it to or from the disk.
page tables. In such a case, the kernel invokes the swapper to swap out a process in an attempt to
free up space in Userptmap (see Section 0).
Can page tables be shared? In particular, if two processes are running the same program, can
they share the page table for the text region? This is generally possible, and many variants of UNIX
allow such sharing. BSD UNIX, however, has subtle problems with this approach. Each process
must have a single page table for the PO region, which must be contiguous in system virtual address
space and, hence, be described by a contiguous set of system PTEs in Userptmap. Since the data re-
gion is not shared, only a part of the PO page table is sharable. Because each process has its own set
of Userptmap entries, the PTEs for the page table pages for the text region must point to the same
set of pages. This in tum means that the beginning of the data region page table must start on a new
page and be described by a new PTE in Userptmap. This requires the data region to be aligned on a
64K boundary.
Such a requirement would have resulted in an incompatible, user-visible change. To avoid
that, BSD requires each process to have its own text page table. If multiple processes share a text
region, their text page table entries need to be kept in sync. For example, if one of the processes
brings a page into memory, that change must be propagated to the PTE for that page in all processes
sharing that region. Figure 13-15 shows how the kernel locates all the page tables mapping a par-
ticular text region.
stru ct struct
-------------~
em ap text
x_caddr I
l
struct p_xlink struct p_xlink struct
proc proc proc
I I I
i page
tables
i
Figure 13-15. Multiple mappings for a text page.
i
13.4 4.3BSD- A Case Study 425
• Fill-from-text- Text and initialized data pages are read in from the executable file
upon first access.
• Zero-fill- Uninitialized data, heap, and stack pages are created and filled with ze-
ros when required.
• Outswapped - These are pages that have once been read into memory and subsequently
paged out to make room for other pages. These pages may be recovered from their swap
area locations.
The kemel must maintain sufficient information about all nonresident pages, so that it can
bring them in when needed. For swapped out pages, it must store their locations on the swap device.
For zero-fill pages, the kemel only needs to recognize them as such. For fill-from-text pages, it must
determine their location in the filesystem. This can be done by the file system routines that read the
disk block array in the inode. That, however, is inefficient, since it frequently requires accessing
other disk blocks (indirect blocks) to locate the page.
A better approach is to store all such translations in memory management data structures
when the program is initially invoked. This allows a single pass through the block array in the inode
and the indirect blocks to locate all the text and initialized data pages. This could be done using a
second table tha.t maps all the nonresident pages. The disk block descriptor table in SVR3 UNIX
provides this functionality. This solution, however, involves significant memory overhead, requiring
an additional table essentially the same size as the page table.
The 4.3BSD solution relies on the fact that, except for the protection and valid bits, the rest
of the fields in the page table entry are not examined by the hardware unless the valid bit is set.
Since all nonresident pages have the valid bit clear, those fields can be replaced by other informa-
tion that tracks these pages. Figure 13-16(a) shows the hardware-defined format of the V AX-11
jill-on-demand
fill-from-text (1) or zero-fill (0)
Filesystem Block Number
31 26 23 0
(c) Fill-on-demand page table entry
page table entry. 4.3BSD uses bit 25, which is not used by the hardware, to define fill-on-demand
entries, described as follows.
For ordinary page table entries (Figure 13-16(b)), the fill-on-demand bit is clear. When the
bit is set, it indicates that the page is a fill-on-demand page (valid bit must be clear) and that the
page table entry is a special, fill-on-demand entry with a different set of fields (Figure 13-16(c)).
Instead of the page frame number and the modified bit, such an entry stores a jilesystem block num-
ber and a bit specifying if the page is fill-from-text (bit set) or zero-fill (bit clear). For a fill-from-
text page, the device number is obtained from the text structure for that program.
The treatment of outswapped pages is different. The PTEs for such pages have the valid and
jill-on-demand bits clear and the page frame number set to zero. The kernel maintains separate swap
maps to locate these pages on the swap device, as explained in Section 13.4.4.
2 * dmmin
• Text region -The child is added to the list of processes sharing the text structure used
by the parent, and the page table entries for the text region are copied from the parent.
• Data and stack- Data and stack must be copied one page at a time. For pages that are
still fill-on-demand, only the PTEs need to be copied. The rest of the pages are duplicated
by allocating physical memory for them and copying the pages from the parent. If the
page in the parent's space has been swapped out, it must be read in from swap and then
copied. The child's PTEs are set to point to the new copies of these pages. All newly cop-
ied pages are marked modified, so that they will be saved to swap before reuse.
The fork operation is expensive, largely due to all the copying involved in the last step.
Copying all the entire data and stack regions seems wasteful, considering that most processes will
either exit or call exec to invoke a new program soon after forking, thus discarding the whole ad-
dress space.
There have been two major approaches to reduce this overhead. The first is called copy-on-
write, which was adopted by System V UNIX. Here, the child and parent refer to a single copy of
the data and stack pages, whose protections are changed to read-only. If either tries to modify any of
the pages, we get a protection fault. The fault handler recognizes the situation and makes a new
copy of that page, changes the protections back to read-write, and updates the PTEs in the parent
and the child. This way, only those pages modified by either the parent or child need to be copied,
reducing the cost of process creation.
Implementing copy-on-write requires reference counts maintained on a per-page basis,
which was one of the reasons it was not adopted by BSD UNIX. Instead, BSD provides an alternate
system call named vfork (virtual fork), which addresses the problem in a different way.
vfork is used when the fork is expected to be soon followed by an exit or exec. Instead of
duplicating the address space, the parent passes its own space to the child and then sleeps until the
child execs or exits. When that happens, the kernel wakes up the parent, who recaptures the space
from the child. The only resources created for the child are the u area and the proc structure. The
passing of the address space is accomplished by simply copying the page table registers from the
parent to the child. Not even the page tables need to be copied. Only the PTEs mapping the u area
need to be changed.
vfork is extremely lightweight and a lot faster than copy-on-write. Its drawback is that it al-
lows the child to modify the contents or size of the parent's address space. The burden lies on the
programmer to ensure that vfork is properly used.
a routine to grow the stack automatically. Protection errors likewise result in a signal to the process;
systems implementing copy-on-write must check for that scenario, and handle such protection faults
by making a new writable copy of that page. For all other cases, a routine called pagei n () is called
to handle the fault.
pagei n () is passed the faulting virtual address, from which it obtains the PTE. If the page is
resident (the PTE is not fill-on-demand, and the page frame number is not zero), the cmap entry for
that page is also obtained. Together, these contain information about the state of the page and gov-
ern the actions of pagein(). Figure 13-18 shows the basic pagein() algorithm. There are seven
different scenarios:
I. The PTE may have simply been marked invalid for referenced bit simulation, as explained
in Section 13.5.3. This is the case when the page is resident, and the cmap entry is not
marked free. pagei n () simply sets the valid bit and returns.
2. The page is resident and on the free list. This is similar to case 1, except that the cmap entry
is marked free. pagei n () resets the valid bit and removes the cmap entry from the free list.
3. For a text page, another process could have started a read on the page. This happens if two
processes sharing a text region fault on the same page around the same time. The second
process finds that the page frame number is nonzero, but the core map entry is marked
locked and in-transit. pagei n () will set the wanted flag and block the second process,
which will sleep on the address (see Section 7.2.3) of the text structure for this page.
When the first process unlocks the page after it has been read in, it will wake up the sec-
ond process. Because the second process may not run immediately after being awakened,
it cannot assume the page is still in memory and must begin the search all over again.
4. Text pages could be in memory even though the PTE does not have a page frame number
for them. This would happen if they were left behind by another process that terminated a
short while back. Such pages can be located by searching the appropriate hash queue using
the <device, block number> pair as a key. Iffound, the page can be removed from the free
list and reused.
In the remaining cases, the page is not in memory. After determining its location, pagei n () must
first allocate a page from the free page list and then read in the page as follows:
5. The page is on the swap device. The fill-on-demand bit is clear, the page frame number is
zero, and case 4 does not apply. The swap maps are consulted to locate the page on swap,
and the page is read in from the swap device.
6. Zero-fill pages are handled by filling the newly allocated page with zeroes.
7. The page is fill-from-text and was not found on the hash queue (case 4). It is read in from
the executable file. This read occurs directly from the file to the physical page, bypassing
the buffer cache. This may cause a consistency problem if the disk copy of the page is ob-
solete. Hence the kernel searches the buffer cache for this page, and if found, flushes the
cache copy to disk before reading it in to the process page. This solution requires two disk
copy operations and is inefficient, but was retained for historical reasons. It would be bet-
ter to copy directly from the buffer cache if the page was found there.
In cases 5 and 6, the new page is marked as modified, so it will be saved on swap before reuse.
-1>-
w
0
Yes
No allocate new
page
Yes
fill it with
take page off
~
sleep on text struct No free list
~
0 0 page in
buffer cache?
start over when "--- /
woken up
No 0
I ()
::r
"'
"0
<
mark page
modified ) ~·
e:.
Figure 13-18. The pagei n () algorithm. ~
"'30
..:1
13.5 4.3BSD Memory Management Operations 431
hand spread
so the pagedaemon can continue to examine other pages in the meantime. When the writes com-
plete, the completion routine puts these pages onto a cleaned list, from which they are returned to
the free memory list by a routine called cleanup().
13.5.4 Swapping
Although the paging system works admirably most of the time, it can break down under heavy load.
The major problem, called thrashing, occurs when there is not enough memory to contain the
working sets of the active processes. This may happen because there are too many active processes
or because their access patterns are too random (and hence their working sets are too large). This
results in a sharp increase in the page fault rate. When the pages are faulted in, they replace other
pages that were part of the working set of an active process, which escalates the problem further.
The situation can worsen until the system is spending most of its time in page fault handling, and
the processes can make little progress.
This problem may be addressed by reducing the number of active processes, thus controlling
the system load. Processes that are "deactivated" may not be scheduled to run. It then makes sense
to free up as much of the memory used by such processes as possible, if necessary, by copying data
to the swap space. This operation is known as swapping the process out. When the load on the sys-
tem reduces, the process may be swapped back in.
A special process called the swapper monitors the system load, and swaps processes in and
out when needed. During system initialization, the kernel creates a process with PID 0, which fi-
nally calls sched (), the central function of the swapper. Process 0 thus becomes the swapper. It
sleeps most of the time, but wakes up periodically to check the system state and takes further action
if required. The swapper will swap out a process in the following cases:
13.6 Analysis 433
13.6 Analysis
The BSD memory management design provides powerful functionality using a small number of
primitives. The only hardware requirement is demand-paging support (since segmentation is not
used). There are, however, several important shortcomings and drawbacks to keep in mind:
434 Chapter 13 Virtual Memory
• There is no support for execution of remote programs (across a network). This is because
there is no support in the vanilla BSD file system for accessing remote files. If the file
system provides this facility, the extensions to the memory subsystem are simple.
• There is no support for sharing of memory, other than read-only sharing of the text region.
In particular, there is no equivalent of the System V shared memory facility.
• vfork is not a true substitute for fork, and the lack of copy-on-write hurts the performance
of applications that rely extensively on fork. In particular, daemons and other server appli-
cations that fork a child process for each incoming request are heavily impacted.
• Each process must have its own copy of the page table for a shared text region. This not
only wastes space, but also requires keeping these page tables synchronous by migrating
changes made by one process to the corresponding PTEs of other processes sharing the
text.
• There is no support for memory-mapped files. Section 14.2 discusses this facility in detail.
• There is no support for shared libraries.
• There is a problem with debugging a program that is being run by multiple processes. If
the debugger deposits a breakpoint in the program, it modifies the corresponding text
page. This modification is seen by all processes running this program, which can have un-
expected results. To avoid that, the system disallows putting breakpoints in a shared text
and disallows new processes from running a program that is being debugged. These solu-
tions are obviously unsatisfactory.
• The BSD implementation reserves enough swap space in advance to page out every single
page in the process address space. Such a policy ensures that a process can run out of swap
space only when it tries to grow (or infork or exec) and not arbitrarily in the middle of
execution. This conservative approach requires a large amount of swap space on the sys-
tem. From another perspective, the swap space on the system limits the size of the pro-
grams you can run.
• There is no support for using swap space on remote nodes, which is required for facilities
such as diskless operation.
• The design is heavily influenced by and optimized for the VAX architecture. This makes it
less suitable for the wide range of machines to which UNIX has been ported. Further, the
machine dependencies are scattered all over the code, making the porting effort even
greater.
• The code is not modular, so it is difficult to add features and change individual compo-
nents or policies. For example, storing the filesystem block number in invalid (fill-on-
demand) PTEs prevents a clean separation of the address translation and the page fetch
tasks.
Despite these shortcomings, the 4.3BSD design provides a sound foundation for the modem
memory architectures-such as those of SVR4, 4.4BSD, and Mach-described in the following
chapters. These architectures have retained many of the BSD methods, but they have changed the
underlying design in order to provide more functionality and address many of the limitations of the
BSD approach.
13.7 Exercises 435
The 4.3BSD architecture was sensible for the systems available in the 1980s, which typically
had slow CPUs and small memories, but relatively large disks. Hence the algorithms were opti-
mized to reduce memory consumption at the cost of doing extra I/0. In the 1990s, typical desktop
systems have large memories and fast processors, but relatively small disks. Most user files reside
on dedicated file servers. The 4.3BSD memory management model is not suitable for such systems.
4.4BSD introduced a new memory architecture based on that of Mach. This is described in Section
15.8.
13.7 Exercises
1. Which of the objectives listed in Section 13.2.1 can be met by a system that used swapping as
the only memory management mechanism?
2. What are the advantages of demand paging compared with segmentation?
3. Why do UNIX systems use anticipatory paging? What are its drawbacks?
4. What are the benefits and drawbacks of copying text pages to the swap area?
5. Suppose an executable program resides on a remote node. Would it be better to copy the
entire image to the local swap area before executing it?
6. The hardware and the operating system cooperate to translate virtual addresses. How is the
responsibility divided? Explore how the answer to this question varies for the three
architectures described in Section 13.3.
7. What are the benefits and drawbacks of a global page replacement policy as compared with a
local policy?
8. What steps can a programmer take to minimize the working set of an application?
9. What are the advantages of inverted page tables?
10. Why does the MIPS 3000 cause a large number of spurious page faults? What are the
advantages of this architecture that offset the cost of processing these additional faults?
11. Suppose a 4.3BSD process faults on a page that is both nonresident and protected (does not
permit the type of access desired). Which case should the fault handler check for first? What
should the handler do?
12. Why does the core map manage only the pages in the paged pool?
13. What do we mean by the name of a page? Does a page have just one name? What are the
different name spaces for pages in 4.3BSD?
14. What is the minimum amount of swap space a 4.3BSD system must have? What is the
advantage of having an extremely large swap area?
15. What are the factors that limit the maximum amount of virtual address space a process may
have? Why, if at all, is it important for a process to be thrifty in its use of virtual memory?
16. Is it better to distribute the swap space over multiple physical disks? Why or why not?
17. Why is a pure LRU policy unsuitable for page replacement?
18. Early BSD releases [Baba 81, Leff89] used a one-handed clock algorithm, which turned off
referenced bits in the first pass, and swapped out pages whose referenced bits were still off in
the second pass. Why is this algorithm inferior to the two-handed clock?
436 Chapter 13 Virtual Memory
13.8 References
[Baba 79] Babaoglu, 0., Joy, W.N., and Porcar, J., "Design and Implementation of the
Berkeley Virtual Memory Extensions to the UNIX Operating System," Technical
Report, CS Division, EECS Department, University of California, Berkeley, CA,
Dec. 1979.
[Baba 81] Babaoglu, 0., and Joy, W.N., "Converting a Swap-Based System to Do Paging in an
Architecture Lacking Page-Referenced Bits," Proceedings of the Eighth ACM
Symposium on Operating Systems Principles, Dec. 1981, pp. 78-86.
[Bach 86] Bach, M.J., The Design of the UNIX Operating System, Prentice-Hall, Englewood
Cliffs, NJ, 1986.
[Bako 90] Bakoglu, H.B., Grohoski, G.F., and Montoye, R.K., "The IBM RISC System/6000
Processor: Hardware Overview," IBM Journal of Research and Development, Vol.
34, Jan. 1990.
[Bela 66] Belady, L.A., "A Study of Replacement Algorithms for Virtual Storage Systems,"
IBM Systems Journal, Vol. 5, No.2, 1966, pp. 78-101.
[Chak 94] Chakravarty, D., Power RISC System/6000-Concepts, Facilities, and Architecture,
McGraw-Hill, 1994.
[Coll91] Collinson, P., "Virtual Memory," SunExpert Magazine, Apr. 1991, pp. 28-34.
[DEC 80] Digital Equipment Corporation, VAX Architecture Handbook, Digital Press, 1980.
[Denn 70] Denning, P.J., "Virtual Memory," Computing Surveys, Vol. 2, No. 3, Sep. 1970, pp.
153-189.
[Intel 86] Intel Corporation, 80386 Programmer's Reference Manual, 1986.
[Kane 88] Kane, G., Mips RISC Architecture, Prentice-Hall, Englewood Cliffs, NJ, 1988.
[Lee 89] Lee, R.B., "Precision Architecture," IEEE Computer, Vol. 21, No. 1, Jan. 1989, pp.
78-91.
[Leff89] Leffler, S.J., McKusick, M.K., Karels, M.J., and Quarterman, J.S., The Design and
Implementation of the 4.3 BSD UNIX Operating System, Addison-Wesley, Reading,
MA, 1989.
[Robb 87] Robboy, D., "A UNIX Port to the 80386," UNIX Papers for UNIX Developers and
Power Users, The Waite Group, 1987, pp. 400-426.
[SPARC 91] SPARC International, SPARC Architecture Manual Version 8, 1991.
14
14.1 Motivation
In SunOS 4.0, Sun Microsystems introduced a memory management architecture called VM (for
Virtual Memory). The previous versions of SunOS were based on the BSD memory management
model, which had all the limitations described in the previous chapter. In particular, SunOS wished
to provide support for memory sharing, shared libraries, and memory-mapped files, which was not
possible without major changes to the BSD design. Moreover, since SunOS ran on several different
hardware platforms (Motorola 680x0, Intel386 and Sun's own SPARC systems), it needed a highly
portable memory architecture. The VM architecture became very successful. Later, when a joint
team of engineers from AT&T and Sun Micro systems set out to design SVR4 UNIX, they based the
SVR4 memory management on this design, rather than on the regions architecture that existed in
SVR3.
The concept ofjile mapping is central to the VM architecture. The term file mapping is used
to describe two different but related ideas. At one level, file mapping provides a useful facility to
users, allowing them to map part of their address space to a file and then use simple memory access
instructions to read and write the file. It can also be used as a fundamental organizational scheme in
the kernel, which may view the entire address space simply as a collection of mappings to different
objects such as files. The SVR4 architecture incorporates both aspects of file mapping. 1 Before
1 These two ideas are independent. HP-UX 9.x, for instance, has user-level file mapping while retaining the traditional
kernel organization. AIX 3.1, in contrast, uses file mapping as its fundamental 1/0 strategy, but does not export it to
the user level (no mmap system call).
437
438 Chapter 14 The SVR4 VM Architecture
moving to the VM design itself, we discuss the notion of memory-mapped files and why it is useful
and important.
The traditional way of accessing files in UNIX is to first open them with the open system call and
then use read, write, and !seek calls to do sequential or random 110. This method is inefficient, as it
requires one system call (two for random access) for each 110 operation. Moreover, if several proc-
esses are accessing the same file, each maintains copies of the file data in its own address space,
needlessly wasting memory. Figure 14-1 depicts a situation where two processes read the same page
of a file. This requires one disk read to bring the page into the buffer cache and one in-memory copy
operation for each process to copy the data from the buffer to its address space. Furthermore, there
are three copies of this page in memory-one in the buffer cache, plus one in the address space of
each process. Finally, each process needs to make one read system call, as well as an !seek if the
access was random.
Now consider an alternative approach, where the processes map the page into their address
space (Figure 14-2). The kernel creates this mapping simply by updating some memory manage-
ment data structures. When process A tries to access the data in this page, it generates a page fault.
The kernel resolves it by reading the page into memory and updating the page table to point to it.
Subsequently, when process B faults on the page, the page is already in memory, and the kernel
merely changes B's page table entry to point to it.
This illustrates the considerable benefits of accessing files by mapping them into memory.
The total cost for the two reads is one disk access. After the mappings are set up, no further system
calls are necessary to read or write the data. Only one copy of the page is in memory, thus saving
two pages of physical memory and two in-memory copy operations. Reducing the demands on
physical memory yields further benefits by reducing paging operations.
disk
~ !:·~~ tabl"
A ',,
pageable ''
'
physical
buffer
cache
Figure 14-1. Two processes read the same page in traditional UNIX.
14.2 Memory-Mapped Files 439
~------,, ~------
A ',, B
'
''
''
pageable',
physical '',,,
memory '
Figure 14-2. Two processes map the same page into their address space.
What happens when a process writes to a mapped page? A process may establish two types
of mappings to files-shared and private. For a shared mapping, modifications are made to the
mapped object itself. The kernel applies all changes directly to this shared copy of the page and
writes them back to the file on disk when the page is flushed. If a mapping is private, any modifica-
tion results in making a private copy of the page, to which the changes are applied. Such writes do
not modify the underlying object; that is, the kernel does not write back the changes to the file when
flushing the page.
It is important to note that private mappings do not protect against changes made by others
who have shared mappings to the file. A process receives its private copy of a page only when it at-
tempts to modify it. It therefore sees all modifications made by other processes between the time it
establishes the mapping and the time it tries to write to the page.
Memory-mapped file 1/0 is a powerful mechanism that allows efficient file access. It cannot,
however, fully replace the traditional read and write system calls. One major difference is the
atomicity of the 1/0. A read or write system call locks the inode during the data transfer, guarantee-
ing that the operation is atomic. Memory-mapped files are accessed by ordinary program instruc-
tions, so at most one word will be read or written atomically. Such access is not governed by tradi-
tional file locking semantics, and synchronization is entirely the responsibility of the cooperating
processes.
Another important difference is the visibility of changes. If several processes have shared
mappings to a file, changes made by one are immediately visible to all others. This is starkly differ-
ent from the traditional model, where the other processes must issue another read to see these
changes. With mapped access, a process sees the contents of a page as they are at the time of access,
not at the time the mapping was created.
These issues, however, relate more to an application's decision to use mapped access to
files. They do not detract from the merits and desirability of this mechanism. In fact, 4.3 BSD
specified an interface for the mmap system call to perform file mapping, but did not provide an im-
plementation. The next section describes the semantics of the mmap interface.
440 Chapter 14 The SVR4 VM Architecture
This establishes a mapping between the byte range2 [offset, offset+ 1en) in the file represented
by fd, and the address range [paddr, paddr+ 1en) in the calling process. The flags include the
mapping, which may be MAP_SHARED or MAP_PRIVATE. The caller may set prot to a combination of
PROT_ READ, PROT_WRITE, and PROT_ EXECUTE. Some systems, whose hardware does not support
separate execute permissions, equate PROT_ EXECUTE to PROT_ READ.
The system chooses a suitable value for paddr. paddr will never be 0, and the mapping will
not overlay existing mappings. mmap ignores the addr parameter unless the caller specifies the
MAP_FIXED flag.3 In that case, paddr must be exactly the same as addr. If addr is unsuitable (it ei-
ther is not page-aligned or does not fall in the range of valid user addresses), mmap returns an error.
The use of MAP_FIXED is discouraged, for it results in non-portable code.
mmap works on whole pages. This requires that offset be page-aligned. If MAP_FIXED has
been specified, addr should also be page-aligned. If 1en is not a multiple of the page size, it will be
rounded upward to make it so.
The mapping remains in effect until it is unmapped by a call to
or by remapping the address range to another file by calling mmap with a MAP_RENAME flag. Protec-
tions may be changed on a per-page basis by
2 We follow the standard convention for specifying ranges: square brackets indicate inclusive boundaries, while paren-
theses indicate exclusive boundaries.
3 This is true of current implementations. The semantics of the call do specify that mmap will use addr as a hint if
MAP FIXED is not set.
14.4 Fundamental Abstractions 441
Note: This chapter uses the word object in two different ways. A
memory object represents a mapping, while a data object represents
a backing store item, such as a file. The meaning is usually clear from
the context; where it might be ambiguous, we specifically use the
terms memory object or data object.
The VM architecture is object-oriented. Section 8.6.2 explains the basic concepts of object-
oriented design as they apply to UNIX systems. Using this terminology, the common interface to
the memory object constitutes an abstract base class [Elli 90]. Each type of memory object
(differentiated by its backing store type) is a derived class, or subclass, of the base class. Every
specific mapping is an instance, or object, of the corresponding subclass.
The address space of a process comprises a set of mappings to different data objects. The
only valid addresses are those that are mapped to an object. The object provides a persistent backing
store for the pages mapped to it. The mapping renders the object directly addressable by the process.
The mapped object itself is neither aware of nor affected by the mapping.
The file system provides the name space for memory objects and mechanisms to access their
data. The vnode layer allows the VM subsystem to interact with the file system. The relationship
between memory objects and vnodes is many-to-one. Each named memory object is associated with
a unique vnode, but a single vnode may be associated with several memory objects. Some memory
objects, such as user stacks, are not associated with files and do not have names. They are repre-
sented by the anonymous object.
Physical memory serves as a cache for data from the mapped objects. The kernel attempts to
hold the most useful pages in physical memory, so as to minimize paging activity.
The memory is page-based, and the page is the smallest unit of allocation, protection, ad-
dress translation, and mapping. The address space, in this context, is merely an array of pages. The
page is a property of the address space, not of the data object. Abstractions such as regions may be
implemented at a higher level, using the page as a fundamental primitive.
The VM architecture is independent of UNIX, and all UNIX semantics such as text, data,
and stack regions are provided by a layer above the basic VM system. This allows future non-UNIX
operating systems to use the VM code. To make the code portable to other hardware architectures,
VM relegates all machine dependencies to a separate hardware address translation (HAT) layer,
which is accessed via a well-defined interface.
Whenever possible, the kernel uses copy-on-write to reduce in-memory copy operations and
the number of physical copies of a page in memory. This technique is necessary when processes
have private mappings to an object, since any modifications must affect neither the underlying data
object nor other processes sharing the page.
Swap
Device
/ vnode ptr
offset in the vnode
hash chain pointers
pointers for vnode page list
pointers for free list or 1/0 list
flags
'-.hat-related information ./
ers for this purpose. To find a physical page quickly, pages are hashed based on the vnode and off-
set, and each page is on one of the hash chains. Each vnode also maintains a list of all pages of the
object that are currently in physical memory, using a second pair of pointers in the page structure.
This list is used by routines that must operate on all in-memory pages of an object. For instance, if a
file is deleted, the kernel must invalidate all in-memory pages of the file. The final pair of pointers
keeps the page either on afree page list or on a list of pages waiting to be written to disk. The page
cannot be on both lists at the same time.
The page structure also maintains a reference count of the number of processes sharing this
page using copy-on-write semantics. There are flags for synchronization (locked, wanted, in-transit)
and copies of modified and referenced bits (from the HAT information). There is also a HAT-
dependent field, which is used to locate all translations for this page (Section 14.4.5).
The page structure has a low-level interface comprising routines that find a page given the
vnode and offset, move it onto and off the hash queues and free list, and synchronize access to it.
Figure 14-5 provides a high-level description of the data structures that describe the virtual address
space of a process. The address space (struct as) is the primary per-process abstraction and pro-
vides a high-level interface to the process address space. The proc structure for each process con-
tains a pointer to its as structure. An as contains the header for a linked list of the mappings for the
process, each of which is described by a seg structure. The mappings represent non-overlapping,
page-aligned address ranges and are sorted by their base address. The hat structure is also part of
the as structure. The as also contains a hint to the last segment that had a page fault, as well as other
information such as synchronization flags and the sizes of the address space and resident set.
The as layer supports two basic sets of operations. The first consists of operations performed
on the entire address space, including:
• as_ all oc (), used by fork and exec to allocate a new address space.
• as free (),called by exec and exit to release an address space.
• as_d up () , used by fork to duplicate an address space.
444 Chapter 14 The SVR4 VM Architecture
struct
segvn_data virtual
address
struct proc space
,
p_as ,,
text
struct
segvn_data
struct as ,
, ,,
segment list ,, data
hint ---------------'
struct
struct hat segvn_data
,
,, stack
,,
seg_vn ops
,
struct segops
, ,, u area
_______________ , , ,,
I ,,_..., I size
struct seg
The second set of functions operate on a range of pages within the as. They include the fol-
lowing:
• as_map() and as_ unmap (), to map and unmap memory objects into the as (called by
mmap, munmap, and several other routines).
• as_ setprot () and as_ checkprot (), called by mprotect to set and check protections on
parts of the as.
• as_ fa u1t ( ) , the starting point for page fault handling.
• as_ faul ta (),used for anticipatory paging (fault ahead).
Many of these functions are implemented by determining the mapping or mappings affected
and calling lower-level functions in the mappings interface, which is described in the next section.
ment is a memory object-a contiguous range of virtual addresses of a process mapped to a con-
tiguous byte range in a data object, with the same type of mapping (shared or private).
All segments present an identical interface to the rest of the VM subsystem. In object-
oriented terminology, this interface defines an abstract base class. There are several types of seg-
ments, and each specific segment type is a derived class, or subclass, of the base class. The VM
system also provides a set of generic functions to allocate and free a segment, attach it to the address
space, and unmap it.
The seg structure contains the public, or type-independent, fields of the segment, such as the
base and size of the address range it maps, and a pointer to the as structure to which it belongs. All
segments of an address space are maintained on a doubly linked list sorted by base address
(segments may not overlap). The as structure has a pointer to the first segment, and each seg struc-
ture has forward and backward pointers to keep it on this list.
The seg structure has a pointer to a seg_ops vector, which is a set of virtual functions that
define the type-independent interface of the segment class. Each specific subclass, that is, each
segment type, must implement the operations in this vector. The seg structure contains a pointer
(s _data) to a type-dependent data structure, which holds private, type-dependent data of this seg-
ment. This structure is opaque to the rest of the kernel and is only used by the type-dependent func-
tions that implement the segment operations.
The operations defined in seg_ops include the following:
Each segment also has a create routine. Although this routine is type-dependent, it is not ac-
cessed via the seg_ops vector, because it must be called before the segment, and hence the seg_ops
vector, is initialized. The kernel knows the names and calling syntax of all create routines (the ar-
guments to the create routine may also differ for each segment type) and calls the appropriate one
for the segment it wants to create.
Note the distinction between a virtual function and its specific implementation by a subclass.
For example,faulta is a virtual function, defining a generic operation on a segment. There is no ker-
nel function called faul ta (). Each subclass or segment type has a different function to implement
this operation-for example, the segment type seg_vn provides the function segvn _ faul ta ().
anonymous pages can be discarded when the process terminates or unmaps the page. Meanwhile,
they may be saved on the swap device if necessary.
Anonymous pages are widely used by all segments that support private mappings. For ex-
ample, although initialized data pages are initially mapped to the executable file, they become
anonymous pages when first modified. The swap layer provides the backing store for anonymous
pages.
A related but distinct concept is that of the anonymous object. There is a single anonymous
object in the entire system. It is represented by the NULL vnode pointer (or in some implementa-
tions, by the file /dev/zero) and is the source of all zero-filled pages. The uninitialized data and
stack regions of the process are MAP_PRIVATE mappings to the anonymous object, while shared
memory regions are MAP_SHARED mappings to it.
When a page mapped to the anonymous object is first accessed, it becomes an anonymous
page, regardless of whether the mapping was shared or private. This is because the anonymous ob-
ject does not provide backup storage for its pages, so the kernel must save them to the swap device.
The struct anon represents an anonymous page. It is opaque to the other components of the
VM system and is manipulated solely by a procedural interface. Because an anonymous page may
be shared, the anon structures are reference-counted. A segment that has an anonymous page merely
holds a reference to the anon structure for that page. If the mapping is private, then each segment
holds a separate reference to that page. If the mapping is shared, then the segments share the refer-
ence itself. Sharing of anonymous pages is discussed further in Section 14.7.4.
The anon layer exports a procedural interface to the rest of VM. It includes the following
.functions:
• anon_dup () duplicates references to a set of anonymous pages. This increments the refer-
ence count of each anon structure in the set.
• anon_ free() releases references to a set of anonymous pages, decrementing reference
counts on its anon structures. If the count falls to zero, it discards the page and releases the
anon structure.
• anon_private() makes a private copy of a page and associates anew anon structure with it.
• anon_zero () creates a zero-filled page and associates an anon structure with it.
• anon _get page() resolves a fault to an anonymous page, reading it back from swap if
necessary.
The HAT layer's primary data structure is the struct hat, which is part of the as structure
of each process. While this positioning underscores the one-to-one relationship between an address
space and its set of hardware mappings, the HAT layer is opaque to the as layer and the rest of the
VM system. It is accessed through a procedural interface, which includes three types of functions:
• Operations on the HAT layer itself, such as:
• hat_ all oc() and hat_free(), to allocate and free the hat structures.
• hat_ d up (), to duplicate the translations during fork.
• hat_swapin() and hat_swapout(), to rebuild and release the HAT information
when a process is swapped in or out.
• Operations on a range of pages of a process. If other processes share these pages, their
translations are unaffected by these operations. They include:
• hat_ chgprot () to change protections.
• hat_unload() to unload or invalidate the translations and flush the corresponding
TLB entries.
• hat_mem load() and hat_devl oad () load the translation for a single page. The latter
is used by seg_dev to load translations to device pages.
• Operations on all translations of a given page. A page can be shared by several processes,
each having its own translation to it. These operations include:
• hat _pageun load() unloads all translations for a given page. This involves opera-
tions such as invalidating its PTE and flushing its TLB entry.
• hat _pagesync () updates modified and referenced bits in all translations for the
page, using the values in its page structure.
All information managed by the HAT layer is redundant. It may be discarded at will and re-
built from the information available in the machine-independent layer. The interface makes no as-
sumption about what data is retained by the HAT layer and for how long. The HAT layer is free to
purge any translation at any time-if a fault occurs on that address, the machine-independent layer
will simply ask the HAT layer to reload the translation. Of course, rebuilding the HAT information
is expensive, and the HAT layer avoids doing this as much as possible.
The hat structure is highly machine-dependent. It may contain pointers to page tables and
other related information. To support operations such as hat _pageun load(), the HAT layer must
be able to find all translations to a page, including those belonging to other processes sharing the
page. To implement this, the HAT layer chains all translations for a shared page on a linked list, and
stores a pointer to this list in the HAT-dependent field of the page structure (Figure 14-6).
The reference port for SVR4 is on the Intel80x86 architecture [Bala 92). Its HAT layer uses
a data structure called a mapping chunk to monitor all translations for a physical page. Each active
page table entry has a corresponding mapping chunk entry. Because non-active translations do not
have mapping chunk entries, the size of the mapping chunk is much less than that of the page table.
Each physical page has a linked list of mapping chunk entries, one for each active translation to the
page. The struct page holds a pointer to this mapping chain.
448 Chapter 14 The SVR4 VM Architecture
(halt 1) (ha t 2)
1
I I
I I
I I
struct
page
I
y
physical
~-----------------------~ page
14.5.1 seg_vn
The vnode segment-also known as seg_vn-maps user addresses to regular files and to the anony-
mous object. The latter is represented by the NULL vnode pointer, or by the file /dev/zero, and
maps zero-fill regions such as uninitialized (bss) data and the user stack. Initial page faults to such
pages are handled by returning a zero-filled page. The text and initialized data regions are mapped
to the executable file using the seg_vn driver. Additional seg_vn segments may be created to handle
shared memory and files explicitly mapped with the mmap system call.
Figure 14-7 describes the data structures associated with vnode segments. Each vnode seg-
ment maintains a private data structure to store additional information used by the driver. This in-
cludes:
14.5 Segment Drivers 449
14.5.2 seg_map
UNIX files may be accessed in three ways-demand paging of executable files, direct access to
mmap 'ed files, and read or write calls to open files. The first case is similar to the second, since the
vnodeops
~ resident pages
:{Q]
-
anonymous
pages of
per-page
protections
W_Uoffile
segment
kernel maps the text and data segments to the executable file, and the subsequent access is analo-
gous to that of mmap'ed files. In both cases, the kernel uses the memory subsystem and page faults
to access the data.
Treating read and write system calls differently from mmap'ed access can lead to inconsis-
tency. Traditionally, the read system call reads the data from the disk into the block buffer cache,
and from there to the process address space. If another process mmaps the same file, there will be
two copies of the data in memory--one in the buffer cache and another in a physical page mapped
into the address space of the second process. If both processes modify the file, the results will be
unpredictable.
To avoid this problem, the VM system unifies the treatment of all three access methods.
When a user issues a read system call on an open file, the kernel first maps the required pages of the
file into its own virtual address space using the seg_map driver, then copies the data to the processes
address space. The seg_map driver manages its own virtual address space as a cache, so only re-
cently accessed mappings are in memory. This allows the VM system to subsume the role of the
buffer cache, which now becomes largely redundant. It also allows full synchronization of all types
of access to a file.
There is only one seg_map segment in the system. It belongs to the kernel and is created
during system initialization. The driver provides two additional functions-segmap _getmap () to
map part of a vnode to a virtual address, and segmap _release() to release such a mapping, writing
data back to disk if modified. The role of these functions is similar to that of the traditional bread()
and bre 1se ()/bwri te () functions of the buffer cache, and is further described in Section 14.8. The
seg_map driver is an optimized version of the vnode driver, providing quick but transitory map-
pings of files to the kernel.
14.5.3 seg_dev
The seg_dev driver maps character devices that implement an mmap interface. It is commonly used
to map frame buffers, physical memory, kernel virtual memory, and bus memory. It only supports
shared mappings.
14.5.4 seg_kmem
This driver maps portions of kernel address space such as kernel text, data, and bss regions and dy-
namically allocated kernel memory. These mappings are non-paged, and their address translations
do not change unless the kernel unrnaps the object (for example, when releasing dynamically allo-
cated memory).
14.5.5 seg_kp
The seg_kp driver allocates thread, kernel stack, and light-weight process (lwp) structures for mul-
tithreaded implementations such as Solaris 2.x (see Section 3.6). These structures may be from
swappable or non-swappable regions of memory. seg_kp also allocates red zones to prevent kernel
stack overflow. The red zone is a single write-protected page at the end of a stack. Any attempt to
write to this page results in a protection fault, thus protecting neighboring pages from corruption.
14.6 The Swap Layer 451
where cmd is SC_ADD or SC_REMOVE,4 and arg is a pointer to a swapres structure. This structure
contains the pathname of the swap file (for a local swap partition, this would be the device special
file) and the location and size of the swap area in this file.
The kernel sets up a swapi nfo structure for each swap device and adds it to a linked list
(Figure 14-8). It also allocates an array of anon structures, with one element for each page on the
device. The swapi nfo structure contains the vnode pointer and starting offset of the swap area, as
well as pointers to the beginning and end of its anon array. Free anon structures in the array are
linked together, and the swapi nfo structure has a poi~ter to this list. Initially, all elements are free.
Segments must both reserve and allocate swap space. When the kernel creates a segment that
potentially will require swap space (typically, all writable private mappings), it reserves as much
space as necessary (usually, equal to the size of the segment). The swap layer monitors the total
available swap space and reserves the required amount from this pool. This does not set aside spe-
cific swap pages; it merely ensures that the reserved space will be available if and when needed.
This reservation policy is conservative. It requires that processes always reserve backing
store for all anonymous memory, even though they may never use all of it. If the system will be
used for large applications, it needs a large swap device. On the other hand, the policy guarantees
that failures due to memory shortage only occur synchronously, that is, during calls such as exec
and mmap. Once a process has set up its address space, it will always have the swap space it needs,
unless it attempts to grow.
A segment allocates swap space on a per-page basis, whenever it creates a new anonymous
page. Allocations may only be made against a previous reservation. The swap_a 11 oc () routine al-
locates a free swap page and associates it with the anonymous page through an anon structure. It
attempts to distribute the load evenly on the swap devices, by allocating from a different device af-
ter every few pages.
In SVR4, the position of the anon structure in the anon array equals the position of the swap
page on the corresponding swap device. swap_a 11 oc () returns a pointer to the anon structure. This
pointer serves as the name of the anonymous page, since it can be used to locate the page on swap.
4 There are two other commands-SC_LIST and SC_ GETNSWP-for administrative purposes.
452 Chapter 14 The SVR4 VM Architecture
vnode
(•r --.1+
offset IRIFIOIFIFIRIFIO
___j
-
start
end iI I
I
I
I
I
I
I
I
free list I
I
I
.-
I
next , _________
I
struct --------
s wapinfo l
struct struct
page page swap device
Key
~ resident [!] free ~ outswapped
14.7 VM Operations
Having described the data structures and interfaces that encapsulate the major VM abstractions, we
now show how these components interact to provide the memory management functionality.
14.7 VM Operations 453
5 VOP MAP, along with VOP GETPAGE and VOP PUTPAGE, was added to the vnodeops vector to support the VM system.
These are virtual functio~s, as described in Section 8.6.2. The actual function invoked depends on the filesystem to
which the vnode belongs.
6 For ELF format files, exec maps in the interpreter that is specified in the program header. The interpreter, in tum,
maps in the actual program.
454 Chapter 14 The SVR4 VM Architecture
pointer to, and the size of, an anon reference array, which has one entry for each page of the seg-
ment. Each entry is a reference to the anon structure for the corresponding page, or NULL if that
page is not yet anonymous. The anon structure locates the page in physical memory or on the swap
device and contains a count of the number of references to it. When the reference count falls to zero,
the page and the anon structure may be deallocated.
These data structures are not created along with the segment. Rather, they are created and
initialized when needed, that is, on the first write to a page in that segment. This lazy approach is
beneficial, because many vnode segments may never create anonymous pages (text pages, for in-
stance, are never modified unless the program is being debugged and the debugger deposits a break-
point). By delaying the work, the kernel may avoid it altogether.
The first attempt to write to a privately mapped page causes a protection fault. The fault
handler recognizes the situation, for the mapping type is MAP_PRIVATE and the segment protections
are not read-only (as opposed to protections in the hardware address translation for the page, which
have been deliberately set to read-only so as to trap this write attempt). It allocates an anon struc-
ture, thus allocating swap space (since each anon structure corresponds to a unique page on the
swap device). It creates a reference to the anon structure (Section 7.9 explains object references in
more detail) and stores it in the corresponding element of the anon reference array.
The handler then makes a new copy of the page, using a newly allocated physical page. It
stores the pointer to the page structure for this page in the anon structure. Finally, the handler calls
the HAT layer to load a new translation for the page, which is write-enabled, and translates to the
new copy. All further modifications thus occur to this private copy of the page.
Some special cases require additional processing. If this is the first anonymous page created
for the segment, the handler allocates and initializes the anon_map and the anon reference array.
Also, if the faulting page was not in memory, the handler reads it in from its backing storage.
The kernel may eventually move this page to the swap device. This may cause the process to
fault on it again. This time, the fault handler discovers that the segment has a reference to the anon
structure for this page and uses it to retrieve the page from the swap device.
anon_map
L_____J ----au swap disk
When a process forks, all its anonymous pages are shared copy-on-write with the child. The
parent and the child have their own anon_map and anon reference array, but they refer to the same
anon pages. This is described further in the next section.
,------- -p~~~~-~~ A------,, swap device :''- ---- -p~~~~-~~ -B--- ---,
anon anon array ' anon
reference reference
, ______________________ ,'
'----------------------'
Figure 14-10. Sharing anonymous pages (part 1).
0 was initially not an anonymous page. When the child modified it, the kernel allocated a new anon
structure and physical page, and added a reference to it in the child's anon reference array. Because
the parent has not modified that page, it is still mapped to the vnode in the parent's address space.
Page 1, in contrast, was an anonymous page shared by both parent and child. It therefore had
a reference count of 2. Since the mapping was private, the modifications by the child could not be
applied to the shared copy. Thus the kernel made a new copy of the page and allocated a new anon
structure for it. The child released its reference on the original anon structure and obtained a refer-
ence to the new one. As a result, the parent and the child now reference different anon structures for
that page, and each has a reference count of 1.
This shows the sharing is indeed on a per-page basis. At this instance, as shown in Figure
14-11 , the parent and the child are sharing the anonymous pages for pages 2 and 4 of the segment,
but not for pages 0, 1, or 3.
,I
'----------------------' '----------------------'
Figure 14-11. Sharing anonymous pages (part 2).
14.7 VM Operations 457
Case 3(a) involves a private mapping to a file, where modifications must be made to a pri-
vate copy of the page. Case 3(b) deals with copy-on-write sharing of anonymous pages, as occurs
after afork. In either case, the fault handler calls anon_private() to make a private copy of the
page.
Finally, the handler calls hat_memload() to load the new translation for the page into the
hardware translation structures (page tables and TLBs).
ref= 1
anon_map
ref= 1
'i:ir~ce~~-c--,
, ___________ .
Swapping
When a UNIX system boots, the initialization code creates the swapper process. This is a system
process with no user context. Its PID is zero, and it executes a routine called sched (). It executes
this routine forever (or until the system crashes or is shut down) and is normally asleep. It is awak-
ened once every second, and also if certain other events occur.
When the swapper wakes up, it checks the amount of free memory and decides whether to
perform any swapping activity. If the free memory is less than a tunable parameter named
460 Chapter 14 The SVR4 VM Architecture
t_gpgslo, the swapper swaps out a process. To choose a process to swap out, it invokes the
scheduling priority class-dependent CL_ SWAPOUT operation of each class, which returns a candidate
from that priority class (see Section 5.5). Similarly, the priority class routine CL_SWAP IN chooses the
process to be swapped back in when sufficient memory is available.
The swapper swaps out a process by calling as swapout (), which cycles through each
segment and calls the swapout operation for each segment. The segment driver in turn must write
out all in-memory pages of the segment to the backing store. Most segments are of type seg_vn and
use the segvn _ swapout () routine to implement this operation. Finally, the swapper swaps out the u
area of the process.
To swap in a process, the swapper simply swaps in its u area. When the process eventually
runs, it will fault in other pages as needed.
Because VOP_GETPAGE handles all file accesses, it is able to perform optimizations such as
read-ahead. Moreover, since each file system knows its optimal transfer size and disk layout details,
it may perform an optimization called k/ustering, 7 wherein it reads additional physically adjacent
pages in the same disk 1/0 operation when suitable. It may also perform vnode operations such as
updating the access or modify times of the underlying inode.
The VOP_PUT PAGE operation is called to flush potentially dirty pages back to the file. Its ar-
guments include a flag that specifies whether the write-back is synchronous or asynchronous. When
VOP_PUTPAGE is called by the pagedaemon to free some memory, a deadlock can occur.
VOP_PUTPAGE needs to determine the physical location of the pages on disk. To do so, it may have
to read in an indirect block (see Section 9.2.2), which it cannot do because no memory is available.
One way to avoid this deadlock is to store the translation information from the indirect
blocks in memory as long as a file is mapped. This also improves performance of both
VOP_GETPAGE and VOP_PUTPAGE, since they avoid having to read in the indirect blocks from disk.
However, locking the information in memory incurs considerable space overhead.
,~-----------------------------,,
: user process :
' data access file access
(page fault) (read, write) file subsystem
VOP READ
'-----i-------------------------- - .
Vop WRITE
as fault ~ _ high-level
-1
... ~
- vnode ops
I aslayer ~
se~map_fault seg_mapl
VOP_GETPAGE, low-level
I VOP PUTPAGE vnode ops
segvn_fault seg_vn I I
1
VM subsystem
disk
7 This should not be confused with clustering, which refers to the logical grouping of adjacent physical pages in memory.
462 Chapter 14 The SVR4 VM Architecture
from the user's address space. The seg_map driver reads file system blocks into paged memory; the
buffer cache is no longer required for this purpose. This extends the mapped file access semantics to
traditional access methods. Such a unified treatment of files eliminates consistency problems that
may occur when the same file is accessed in different ways at the same time.
The seg_map driver is simply an optimized version of the vnode driver, providing quick but
transitory mappings of files to the kernel. Figure 14-14 describes the data structures used. The pri-
vate data (struct segmap_data) for the segment consists of pointers to an array of struct smap
entries, as well as to a hash table and to a list of free smap entries. Each smap entry contains pointers
to keep it on the hash queue and free list, the vnode pointer and offset for the page it represents, and
a reference count that monitors how many processes are currently accessing the entry.
Each smap represents one page ofthe segment. The kernel virtual address of the page is de-
termined by
where base is the starting address of the seg_map segment, entrynum is the index of the smap in the
array, and MAXBSIZE is the (machine-dependent) size of each page. Because several file systems
with different block sizes may coexist on the machine, one page in this segment may correspond to
one file system block, several blocks, or part of a block.
When a process issues a read system call, the file system determines the vnode and offset for
the data, and calls segmap_getmap () to establish a mapping for the page. segmap_getmap () checks
the hash queues to find if a mapping already exists; if so, it increments the reference count for the
smap. If there is no mapping, it allocates a free smap and stores the vnode pointer and offset in it.
Finally, it returns the virtual address of the page represented by the smap.
The file system next calls ui amove() to copy the data from the page to the user address
space. If the page is not in physical memory, or if its translation is not loaded in the HAT, a page
segmap_data
segmap_ops
fault occurs, and the fault handler calls segmap_fault() to handle it. segmap_fault() calls the
VOP_GETPAGE operation of the file system to bring the page into memory and then calls
hat_memload() to load a translation for it.
After copying the data to user space, the file system calls segmap_release() to return the
smap entry to the free list (unless it has another reference). In case of a write system call, seg-
map _re 1ease() also writes the page back to the file, usually synchronously. The page remains
mapped to the same vnode and offset until it is reassigned. This caches the translation, so the page
may be found quickly the next time.
This may appear to be a complex set of operations merely to read a page from a file. How-
ever, it is necessary, because the role of the seg_map functions is to integrate file access through
read and write system calls with mapped access. This ensures that even when a file is accessed si-
multaneously through both methods, only one copy of the page is in memory, and the behavior is
consistent.
The VOP_GETPAGE operation performs the real work of fetching the page from the file. The
page may already be in memory, as the seg_map hash table finds only those pages that are mapped
into the seg_map segment (that is, those pages that have been recently accessed through read and
write system calls). VOP_GETPAGE retrieves the page, either from memory or from the file, as de-
scribed in Section 14.8.1.
An important requirement of unifying file access is to have a single name space for file and
process pages. Traditional UNIX implementations identify a file system buffer by the <device,
block number> pair. The VM architecture uses the <vnode, offiet> pair to defme its name space,
and a global hash table to find a page by name. When disk access is needed, VM calls the file sys-
tem to translate the name to the on-disk location of the page.
The buffer cache is not completely eliminated. We still need it to cache file system metadata
(superblocks, inodes, indirect blocks and directories), which cannot be represented in the <vnode,
offiet> name space. A small buffer cache is retained exclusively for these pages.
The page size may not be the same as the file system block size. Indeed, there may be several differ-
ent file systems with different block sizes on the same machine. The VM system deals only with
pages; it asks the file system to bring pages into memory and to write them out to disk. It is the
vnode object manager's (file system's) task to read pages from disk. This may involve reading one
block, several blocks, or part of a block. If the file does not end on a page boundary, the vnode man-
ager must zero out the remainder of the last page.
The VM system deals with a name space defmed by the <vnode, offiet> pair. Thus, for each
disk access, the file system must translate the offsets to the physical block number on disk. If the
translation information (indirect blocks, etc.) is not in memory, it must be read from disk. When this
occurs as part of a pageout operation, VOP_PUTPAGE must be careful to avoid deadlocks. Deadlock
may occur if there is no free memory to read in the indirect block that has information about where
the dirty page must be written.
464 Chapter 14 The SVR4 VM Architecture
8 Reserving swap space does not set aside a specific area on a swap device; it merely guarantees that the space will be
available somewhere when required.
14.9 Virtual Swap Space in Solaris 465
described by the vnode of the virtual file and the offset of the page in that file. Instead of per-device
anon arrays, swap_a 11 oc () dynamically allocates anon structures. It explicitly stores the name
( <vnode, offset> pair) of the page in the anon structure, instead of inferring it from the address of
the structure. Note that to this point, the routine has not allocated and bound any physical swap
space to this page (Figure 14-15(a)).
The page needs physical swap space only when it must be paged out. At that time, the
pageout daemon obtains the name of the page and calls the VOP_PUTPAGE operator of the corre-
sponding vnode. For an anonymous page, this invokes the corresponding swapfs routine, with the
vnode and offset in the virtual file as arguments. This routine performs the following actions:
1. Calls the swap layer to allocate backing store for this page.
2. Calls the VOP_PUTPAGE operator of the physical swap device to write out the page.
3. Writes the new name of the page (the vnode and offset of the physical swap page) into
both the anon and the page structures for the page (Figure 14-15(b)).
Eventually the page will be freed from main memory. Later, it may be read back from swap
if needed, using the new information in the anon structure (Figure 14-15(c)).
Because swapfs allows allocations against main memory, it may not be able to find physical
anon reference
-
array
struct anon struct page
f---+ swapfs_vp swapfs_vp
swapfs_off swapfs_off swap
device
(a) Initial creation of anonymous page
anon reference
-
array
struct anon struct page
f---+ swapvp swapvp
swap
swap offset swap offset
device
-
array
struct anon
-------------
-------- 0
swapvp
swap
swap offset
device
(c) After page is freed from memory
swap space for a pageout operation. In that case, it simply wires down the page in memory and does
not mark it as clean.
14.9.3 Discussion
The Solaris implementation allows the system to run with very little swap space (as low as 20% of
main memory) without degrading performance. This is very useful when disk space is limited.
The Solaris framework allows enhancements to perform intelligent swap management. For
instance, it is possible to have the pagedaemon batch its anonymous page writes. This permits the
swap layer to allocate contiguous swap space for them and write them out in a single I/0 operation.
Alternatively, swapfs could define a separate vnode for each client process. It could then organize
the backing store such that pages of the same process are allocated close together on the same de-
vice, resulting in better paging performance.
14.10 Analysis
The VM architecture is vastly different from the 4.3BSD memory management architecture. On one
hand, it seems more complex, with a greater number of basic abstractions and data structures. On
the other hand, the separation of responsibility within the different layers results in cleaner inter-
faces between the components. The bottom line, however, is functionality and performance. What
have we gained with this approach, and at what cost? The VM architecture has the following major
advantages:
• The design is highly modular, with each major component represented by an object-
oriented interface that encapsulates its functionality and hides its inner implementation
from the rest of the system. This brings with it the traditional advantages of object-
oriented methods-each component may be easily modified or enhanced, and support for
new functionality or new machines may be added fairly easily. One such example is the
addition of a seg_u driver to manage u area allocation.
• In particular, isolating the hardware translation functionality into the HAT layer has made
the architecture highly portable. It has already been ported successfully to many different
systems, including Motorola 680x0, Intel80x86, SPARC, AT&T 3B2, and IBM 370/XA.
• The architecture supports various forms of sharing---copy-on-write sharing of individual
pages, MAP_SHARED mappings to the anonymous object for traditional shared memory re-
gions, and shared access to files through the mmap interface. This sharing reduces the
overall contention on physical memory and eliminates excess disk 1/0 to maintain multi-
ple copies of the same page in memory.
• The mmap interface is particularly powerful, not only because of the sharing capabilities it
offers, but also because it allows direct access to file data without system call overhead.
• Though shared libraries are not explicitly part of the kernel, they can be supported easily
by mapping them into the process address space.
• Because the VM architecture uses vnode operations for all file and disk accesses, it can
take advantage of all the benefits of the vnode interface. In particular, the VM system does
14.10 Analysis 467
not require special code to support the execution of files on remote nodes. Swap devices,
likewise, may be specified on remote nodes, thus supporting truly diskless operation.
• Integrating the buffer cache with the VM system provides an automatic tuning of physical
memory. In traditional architectures, the size of the buffer cache was fixed when the ker-
nel was built at approximately 10% of physical memory. In reality, the ideal size of the
buffer cache depends on how the system is used. An I/O-intensive system such as a file
server requires a large buffer cache. A system used primarily for timesharing applications
would prefer a small buffer cache, with much of the memory used for process pages. Even
for a single machine, the ideal cache size varies in time as the usage pattern changes. The
VM architecture dynamically changes the allocation of memory between process pages
and file pages to reflect the actual demands on memory, and thus effectively addresses the
needs of the users at all times.
• The problem with breakpoints in a shared text region (see Section 13.6) is solved neatly,
because the text region is mapped MAP_PRIVATE. When a debugger tries to set a break-
point by calling ptrace, the kernel makes a private copy of that page for the process and
writes the breakpoint instruction to that copy. This ensures that other processes running
this program do not see the breakpoint.
Although the advantages are impressive, they are not without cost. There are several important
shortcomings of this design, primarily related to performance:
• The VM system has to maintain a lot more information about its fundamental abstractions.
This results in many more data structures, and those having BSD counterparts are often
larger in size. For example, the page structure is over 40 bytes in size, as opposed to the
16-byte cmap structure in BSD. This means that the VM system uses up more memory to
maintain its own state, leaving less memory for user processes.
• The VM system does not save text and unmodified data pages to swap. Instead, it reads
them back from the executable file when needed. This reduces the total swap space
needed, and saves the expense of swapping out such pages. On the other hand, reading a
page back from file is slower than retrieving it from swap, because of the greater overhead
of the file system code and, in some cases, the need to read additional metadata blocks to
locate the page. The effect of this policy on overall performance depends on how often
such pages are retrieved.
• The algorithms are more complex and slower. The layering of components involves more
function calls, many of them indirect via function table lookups. This has an impact on
overall system performance. For example, [Chen 90] finds that function calls add a 20%
overhead to page fault handling.
• The VM system has abandoned the BSD practice of computing disk addresses of all fill-
from-text pages during exec, and storing them in the PTEs. This means that each disk ad-
dress may have to be computed individually when there is a page fault on that page. If the
indirect blocks for that file are no longer in physical memory, they must be read from disk.
This slows down the demand paging of text and initialized data pages.
• The object-oriented method results in invariant interfaces to its abstractions, flexible
enough to allow vastly different implementations of the underlying objects. This means,
468 Chapter 14 The SVR4 VM Architecture
however, that the system is not tuned for a specific implementation. Optimizations such as
those in the preceding paragraph are not possible in SVR4.
• Copy-on-write may not always be faster than anticipatory copying. When a page is shared
copy-on-write, more page faults are generated, and the TLB entry also needs to be flushed
when the page is initially write-protected. If pages need to be copied anyway, it is surely
more efficient to do it directly rather than waiting for copy-on-write faults.
• Swap space is allocated on a per-page basis, randomly on the swap devices. This loses the
clustering and pre-paging benefits of a BSD-like approach, which allocates contiguous
chunks of swap space for each process. In contrast, the BSD approach wastes the unused
space in each chunk.
These drawbacks are all performance-related and may be compensated for by the perform-
ance benefits of the new facilities such as memory sharing and file mapping. On balance, with the
rapid increase in CPU speeds and memory sizes, performance considerations take are less important
than functionality, and in that regard the VM architecture provides substantial advantages.
may do so only after it has been paged out again. The drawback is that there are many more page
faults. Many of these faults are spurious-the page is actually in memory, but the faulting process
does not have a valid translation for it.
The original VM implementation also uses a lazy approach to initialize the translation maps.
Unlike BSD, which allocates and initializes all page tables during fork or exec, VM defers the task
as far as possible. It initializes each translation map entry only when the process first faults on that
page. Likewise, it allocates each page of the map only when it is first needed. While this method
eliminates some amount of work, it incurs many more page faults.
The lazy approach is beneficial if the total overhead of the extra page faults is less than the
time saved by eliminating unnecessary operations. Under VM, the fault overhead is fairly large.
Measurements on the AT&T 3B2/400 show that the cost of a spurious validation fault (one where
the page is already in memory, but the system does not have a valid translation for it) is 3.5 milli-
seconds in the VM architecture, but only 1.9 milliseconds in the regions architecture. The difference
is largely due to the modular design of VM, which results in many more function calls and longer
code paths.
A similar tradeoff occurs in copy-on-write sharing during fork. Here, the objective is to copy
only those pages that are modified by either the parent or the child. To do so, VM defers copying
any page until the parent or the child faults on it. The drawback, once again, is that this causes addi-
tional page faults. On the 3B2, the cost of handling the protection fault and copying the page is 4.3
milliseconds, while that of copying alone is only 1 millisecond. Hence, for copy-on-write to be
beneficial, less than one in four pages should need to be copied. If more than 114 of the in-memory
pages are modified after a fork, it would have been better to copy all of them in advance.
The benchmarks also show that the initial VM implementation causes about three times the
number of faults as the regions architecture. The critical activities are fork and exec, which are re-
sponsible for most of the address space set up and memory sharing. To reduce the fault rate, it is
important to examine the paging behavior of these calls in typical situations.
Three important enhancements were made to VM based on these factors. They are described
in the following section.
process may not access those pages while they are in memory. The overall benefit, again, is due to
the fact that the cost of setting up the mappings is less than that of the page faults avoided.
The final change applies to copy-on-write sharing. Here, since the cost of copying all the
pages is very high (which is why we do copy-on-write in the first place), it is important to guess
which pages will need to be copied regardless. Copy-on-write occurs primarily because of fork,
which is mostly called by the shells to implement command execution. An analysis of the memory
access patterns of the different shells (sh, csh, ksh, etc.) shows that a single shell process will fork
several times and will use fork in a similar way each time. The same variables are likely to be
modified after each fork operation. In terms of pages, the pages that experience copy-on-write faults
after one fork will probably do so after the next.
When a copy-on-write page is modified, it becomes an anonymous page. This provides an
easy optimization. In the new implementation, fork examines the set of anonymous pages of the
parent and physically copies each page that is in memory. It expects this set to be a good predictor
of the pages that will be modified (and have to be copied) after the fork. The pre-copying eliminates
the overhead of the copy-on-write faults.
14.12 Summary
This chapter describes the SVR4 VM architecture. It has several advanced features and functional-
ity, and provides many forms of memory inheritance and sharing required for sophisticated applica-
tions. It must, however, be carefully optimized to yield good performance. The modem tendency is
to concentrate on adding functionality and expect upgrades to faster hardware to compensate for the
performance.
Kernel memory allocation in SVR4 is discussed in Section 12.7. The SVR4 treatment of
translation buffer consistency is described in Section 15.11. The Mach memory management archi-
tecture, which has several parallels to SVR4 VM, is also discussed in Section 15.2.
14.13 Exercises
1. In what ways is mapped file access semantically different from access through read and write
system calls?
2. Can mmap semantics be preserved by a distributed file system? Explain the effects of
mapping files exported by NFS, RFS, and DFS servers.
3. What are the differences between the page structure in SVR4 and the cmap structure in
4.3BSD?
4. Why does the as structure have a hint to the segment that had the last page fault?
5. What is the difference between an anonymous page and the anonymous object?
6. Why do anonymous pages not need a permanent backing store?
7. Why is there only one seg_map segment in the system?
8. SVR4 delays the allocation of swap pages until the process creates an anonymous page, while
4.3BSD pre-allocates all swap pages during process creation. What are the benefits and
drawbacks of each approach?
9. Does each mmap call create a new segment? Is it always a vnode segment?
10. When do processes share an anon_map?
11. Why does the shared memory IPC structure acquire a reference to the anon_map for the
segment?
12. What support does the file system provide to the VM subsystem?
13. Why does SVR4 not use the block buffer cache for file data pages?
14. Why do both SVR4 and Solaris reserve swap space at segment creation time?
15. In what situations does SVR4 use anticipatory paging? What are its benefits and drawbacks?
16. What is lazy evaluation? When does SVR4 use lazy evaluation?
17. What are the benefits and drawbacks of copy-on-write?
14.14 References
[Bala 92] Balan, R., and Gollhardt, K., "A Scalable Implementation of Virtual Memory HAT
Layer for Shared Memory Multiprocessor Machines," Proceedings of the Summer
1992 USENIXTechnical Conference, Jun. 1992, pp. 107-115.
[Char 91] Chartock, H., and Snyder, P., "Virtual Swap Space in SunOS," Proceedings of the
Autumn 1991 European UNIX Users' Group Conference, Sep. 1991.
[Chen 90] Chen, D., Barkley, R.E., and Lee, T.P., "Insuring Improved VM Performance: Some
No-Fault Policies," Proceedings of the Winter 1990 USENIX Technical Conference,
Jan. 1990,pp. 11-22.
[Elli 90] Ellis, M.A., and Stroustrup, B., The Annotated C++ Reference Manual, Addison-
Wesley, Reading, MA, 1990.
472 Chapter 14 The SVR4 VM Architecture
[Ging 87] Gingell, R.A., Moran, J.P., and Shannon, W.A., "Virtual Memory Architecture in
SunOS," Proceedings of the Summer 1987 USENIX Technical Conference, Jun.
1987, pp. 81-94.
[Klei 86] Kleiman, S.R., "Vnodes: An Architecture for Multiple File System Types in Sun
UNIX," Proceedings of the Summer 1986 USEN/X Technical Conference, Jun. 1986,
pp. 238-247.
[Mora 88] Moran, J.P., "SunOS Virtual Memory Implementation," Proceedings of the Spring
1988 European UNIX Users Group Conference, Apr. 1988.
[UNIX 92] UNIX System Laboratories, Operating System API Reference, UNIX SVR4.2, UNIX
Press, Prentice-Hall, Englewood Cliffs, NJ, 1992.
15
More Memory
Management Topics
15.1 Introduction
This chapter discusses three important topics. The first is the Mach virtual memory architecture,
which has some unique features such as the ability to provide much of the functionality through
user-level tasks. The second is the issue of translation lookaside buffer consistency on multiproces-
sors. The third is the problem of using virtually addressed caches correctly and efficiently.
473
474 Chapter 15 More Memory Management Topics
Similar to SVR4, the Mach VM design is motivated by the limitations of 4.3BSD memory architec-
ture, which is heavily influenced by the VAX hardware and hence is difficult to port. Moreover, the
4.3BSD functionality is primitive and restricted to demand-paging support. It lacks mechanisms for
memory sharing other than read-only sharing of text segments. Finally, 4.3BSD memory manage-
ment cannot be extended to a distributed environment. Although Mach provides full binary com-
patibility with 4.3BSD UNIX, it aims to support a richer set of features, including the following:
• Copy-on-write and read-write sharing of memory between related and unrelated tasks.
• Memory-mapped file access.
• Large, sparsely populated address spaces.
• Memory sharing between processes on different machines.
• User control over page replacement policies.
Mach separates all machine-dependent code into a small pmap layer. This makes it easy to port
Mach to a new hardware architecture, since only the pmap layer needs to be rewritten. The rest of
the code is machine-independent and is not modeled after any specific MMU architecture.
An important objective in the Mach VM design is to push much of the VM functionality out
of the kernel. From its conception, Mach was intended to evolve into a microkernel architecture,
with much of the traditional kernel functionality provided by user-level server tasks. Hence Mach
VM relegates functions such as paging to external (user-level) tasks.
Finally, Mach integrates the memory management and interprocess communication (IPC)
subsystems, to gain two advantages. The location-independence of Mach IPC (see Section 6.9) al-
lows virtual memory facilities to be transparently extended to a distributed environment. Section
15.5.1 shows one example of how a user-level program can provide shared memory between appli-
cations on different machines. Conversely, the copy-on-write sharing supported by the VM subsys-
tem allows fast transfer of large messages.
This discussion makes frequent references to the five fundamental abstractions of Mach,
namely tasks, threads, ports, messages, and memory objects. A task is a collection of resources in-
cluding an address space and some ports, in which one or more threads may run. A thread is a con-
trol point in a program; it is an executable and schedulable entity that runs within a task. Mach rep-
resents a UNIX process as a task with a single thread. Mach tasks and threads are described in
Section 3.7. Aport is a protected queue of messages. Many tasks may hold send rights to a port, but
only one task has the right to receive messages from it. A message is a typed collection of data. Its
size ranges from a few bytes to an entire address space. Mach ports and messages are described in
Section 6.4. Memory objects provide the backing store for virtual memory pages and are described
in this chapter.
• Memory allocation - A user may allocate one or more pages of virtual memory by
calling vm_allocate for zero-filled pages or vm_map for pages backed by a specific mem-
ory object (for example, a file). This does not consume resources immediately, since Mach
does not allocate physical pages until they are first referenced. The vm_deallocate call re-
leases virtual memory pages.
• Protection - Mach supports read, write, and execute permissions for each page, but their
enforcement depends on the hardware. Many MMUs do not recognize execute permis-
sions. On such systems, the hardware will allow execute access to any readable page. Each
page has a current and maximum protection. Once set, the maximum protection may only
be lowered. The current protection may never exceed the maximum protection. The
vm_protect call modifies both types of protections.
• Inheritance - Each page has an inheritance value, which determines what happens to
that page when the task creates a child task using task_create. This attribute can take one
of three values (Figure 15-1):
VM_INHERIT_NONE The page is not inherited by the child and does not appear in
its address space.
VM_ INHERIT_SHARE The page is shared between the parent and child. Both tasks
access a single copy of the page, and changes made by one are
immediately visible to the other.
VM_INHERIT_ COPY The child gets its own copy of the page. Mach implements this
using copy-on-write, so the data is actually copied only if and
when the parent or the child tries to modify the page. This is
the default inheritance value of newly allocated pages.
The vm_inherit call modifies the inheritance value of a set of pages. It is important to note
that this attribute is independent of how the task is currently sharing the page with others.
For example, task A allocates a page, sets its inheritance to VM_ INHERIT_SHARE, and then
creates child task B. Task B sets the page's inheritance to VM_INHERIT_COPY and then
creates task C. The page is thus shared read-write between A and B, but copy-on-write
with C (Figure 15-2). This is in contrast to SVR4, where the inheritance method is identi-
cal to the current mapping type (shared or private) for the page in that process.
copy-on-write
1+------"a:.::.cc""e"'"ss"----1 tas k C
'1:::=::===:::;::=::::±1 virtual
copy
• Miscellaneous - The vm_read and vm_write calls allow a task to access pages belonging
to other tasks. They are typically used by de buggers and profilers. The vm_regions call
returns information about all the regions of the address space, and the vm_statistics call
returns statistical data related to the virtual memory subsystem.
vm_map
memory resident
object port page list
The vm object provides the interface to the pages of a memory object. The memory object
[Youn 87] is an abstract object representing a collection of data bytes on which several operations,
such as read and write, are defined. These operations are executed by the data manager, or pager,
for that object. The pager is a task (user or kernel level) that manages one or more memory objects.
Examples of memory objects are files, databases, and network shared memory servers (Section
15.5.1).
The memory object is represented by a port owned by the object's pager (that is, the pager
has the receive rights to this port). The vm object has a reference, or send right, to this port and can
use it to communicate with the memory object. This is described in more detail in Section 15.4.3.
The vm object also maintains a linked list of all the resident pages of the memory object. This list
speeds up operations such as deallocating the object or invalidating or flushing all of its pages.
Each memory object is associated with a unique vm object. If two tasks map the same mem-
ory object into their address space, they share the vm object, as described in Section 15.3. The vm
object maintains a reference count to implement the sharing.
The similarity with SVR4 is striking. The vm_map corresponds to the struct as, and the
vm_map_entry corresponds to struct seg. The pager's role is like that of the segment driver
(except that the pager is implemented as a separate task), while the vm object and memory object
together describe a specific data source such as a file. An important difference is the lack of a per-
page protections array. If a user changes protections on a subset of pages of a region (the word re-
gion refers to the address range represented by an address map entry), Mach splits the region into
two' different regions mapped to the same memory object. Other operations, likewise, could result
in the merging of adjacent regions.
There are two other important data structures-the resident page table and the pmap. The
resident page table is an array (struct vm_page[]) with one entry for each physical page. The size
of a physical page is some power-of-two multiple of the hardware page size. Physical memory is
treated as a cache for the contents of memory objects. The name space for these pages is described
by the <object, offset> pair, which specifies the memory object the page belongs to and its starting
offset in that object. Each page in this table is kept on three lists:
• A memory object list chains all pages of the same object and speeds up object deallocation
and copy-on-write operations.
• Memory allocation queues maintained by the paging daemon. The page is on one of three
queues-active, inactive, orfree.
• Object/offset hash chains for fast lookup of a page in memory.
The vm_page[] array is very similar to the page[] array in SVR4. Finally, each task has a
machine-dependent pmap structure (analogous to SVR4's HAT layer) that describes the hardware-
defined virtual-to-physical address translation map. This structure is opaque to the rest of the system
and is accessed by a procedural interface. The pmap interface assumes only a simple, paged MMU.
The following are some of the functions it supports:
• pmap_create() is called when the system begins using a new address space. It creates a
new pmap structure and returns a pointer to it.
1 Or three different regions, if the subset is in the middle of the range of pages mapped by this region.
478 Chapter 15 More Memory Management Topics
Mach supports read-write and copy-on-write sharing between related and unrelated tasks. Tasks in-
herit memory from their parent during task_create. This allows a task to share regions of its mem-
ory with its descendants. Unrelated tasks can share memory by mapping a region of their address
space to the same memory object. Each of these facilities is described in this section.
To resolve a page fault, the kernel searches the shadow chain from the top down. Thus task
A finds its page 1 from its shadow object, but pages 2 and 3 from the original object. Likewise, task
B finds page 3 in its shadow object, but pages 1 and 2 from the original object.
As a task creates other children, it can build up a long chain of shadow objects. This not only
wastes resources, but also slows down page fault handling since the kernel must traverse several
shadow links. Mach therefore has algorithms to detect such situations and collapse shadow chains
task A task 8
vm_map_entry vm_map_entry
shadow objects
when possible. If all the pages managed by an object appear in objects above it in the chain, that
object can be destroyed. This compaction method, however, is inadequate; if some of the pages of
an object have been swapped out, the object cannot be deleted.
Copy-on-write sharing is also used for large message transfers. A task can send a message
containing out-of-line memory (see Section 6.7.2). Applications use this facility to transfer a large
amount of data without physically copying it if possible. The kernel maps such pages into the ad-
dress space of the new process by creating a new vm_map_entry, which shares the pages copy-on-
write with the sender. This is similar to copy-on-write sharing between parent and child.
One important difference is that while data is in transit, the kernel temporarily maps the data
into its own address space. To do so, it creates a vm_map_entry in the kernel map and shares the
pages with the sender. When the data has been mapped to the receiver's address space, the kernel
destroys this temporary mapping. This protects against changes made to the data by the sender be-
fore the receiver retrieves the message. It also guards against other events such as termination of the
sender.
Since this is a fundamentally different form of sharing, the kernel needs a different way to
implement it. Mach uses the notion of a share map to describe a read-write shared region. A share
map is itself a vm_rna p structure, with a flag set to denote that it describes shared memory and a ref-
erence count of the number oftasks sharing it. It contains a list ofvm_map_entrys; initially, the list
contains a single entry.
Figure 15-6 illustrates the implementation of share maps. It requires that a vm_map_entry
may point either to a vm_object or to a share map. In this example, the shared region contains three
pages, initially with read and write permissions enabled. Subsequently, one of the tasks calls
vmyrotect to make page 3 read-only. This splits the region into two, as shown in Figure 15-7.
Share maps allow such operations on shared regions to be implemented easily.
vm_map
. - - - - - - - - - ; ref_cnt = 2
shared
senting that object. It obtains the rights from the object's pager, since only the owner of a port may
issue rights to it. This step is outside the scope of the VM subsystem and may involve some other
interaction between the user task and the pager (and possibly other entities as well). For instance,
the vnode pager2 manages file system objects and provides a facility for user tasks to open files by
name. When a user opens a file, the pager returns a send right to the port representing that file.
Once a task acquires a port (to acquire a port means to acquire send rights to it), it maps the
memory object into its address space, using the system call
This is similar to the mmap call in SVR4. It maps the byte range [offset, offset+ size) in
the memory object to the address range [base_addr, base_addr +size) in the calling task. The
flag specifies whether the kernel may map the object to a different base address. The function re-
turns the actual address to which the object was mapped.
The first time an object is mapped, the kernel creates two additional ports for the object-a
control port, used by the pager to make cache management requests of the kernel, and a name port,
which identifies the object to other tasks who may retrieve information about the object using the
vm_regions call. The kernel owns these ports, and holds both send and receive rights to them. It
then calls
inemory_object_init (memory_object_port,
control_port,
name port,
page=size);
to ask the pager to initialize the memory object. This way, the pager acquires send rights to the re-
quest and name ports. Figure 15-8 describes the resulting setup.
memory_object_data_request (memory_object_port,
control port, offset,
length,-desired_access);
2 Earlier releases of Mach used an inode pager and mapped the file with a vm_allocate_with_pager call [Teva 87].
The vnode pager was introduced to support the vnode/vfs interface [Klei 86).
15.4 Memory Objects and Pagers 483
,,--------- ... ,
I '
1 I
I memory
object port
. user task ,
_________
pager
request port
pager
name port
Each of these functions results in an asynchronous message. If the kernel makes a request of
the pager, the pager responds by sending another message, using a kernel interface function. The
following subsection shows some of the interactions between the kernel and the pager.
484 Chapter 15 More Memory Management Topics
3 It is possible to have even the default pager run as a user task. [Golu 91] describes such an implementation.
15.5 External and Internal Pagers 485
rary memory objects, which have no permanent backing store. These objects may be divided into
two types:
• Shadow objects, containing modified pages of regions that are shared copy-on-write.
• Zero-fill regions, such as stacks, heaps, and uninitialized data. Zero-fill regions are created
by the call
'
' --noaa- A.---· --.,-oae·s---· '
'
task T1 task T2
shared
I
memory I
0®: server
: 0®
:PitI
I
I
~ I
I
I
I I
I I
,
I I
,
'' ------------ ' '' ------------ -
Figure 15-9. Two clients connect to the network shared memory server.
486 Chapter 15 More Memory Management Topics
,--- -noct'e-A----,
I
, ---node-s-----,
I
I I
task T1 task T2
shared
memory
server
4 If, instead, task T2 were trying to write to the page, the server would use the should_ flush flag, asking kernel A to
invalidate its copy of the page.
15.6 Page Replacement 487
,----noCie_A____ ., ~,----node-s----,
I
I
task T1 I task T2
I
I
I
shared
I
I memory
I
server
4. Kernel A reduces the permissions on the page to read-only and writes back the modified
page to the server via memory_object_data_write().
5. Now the server has the most recent version of this page. It sends it to kernel B via
memory_object_data_provided().
6. Kernel B updates the address translation tables and resumes task T2.
Obviously, this is a slow process. If the two tasks frequently modify the page, they will re-
peatedly go through this iteration, resulting in numerous IPC messages, repeatedly copying the page
from one node to another. Much of this overhead is inherent in the problem of synchronizing data
across a network. Note that each write fault causes the page to be copied over the network twice-
first from the other client to the server, then from the server to the faulting client. We could think of
reducing this to a single copy operation if the server could somehow ask one client to directly send
the page to another. That, however, would break the modularity of the design and lead to quite
complex interactions if there are more than two clients. Obviously, having the server reside on one
of the clients would eliminate one copy operation and also reduce other message traffic on the net-
work.
In spite of these shortcomings, the example illustrates how the close coupling of message
passing and memory management allows us to build powerful and versatile applications. Another
example [Subr 91] is an external pager that manages discardable pages, used by applications that
perform their own garbage collection. The pager receives information from client tasks regarding
which pages are discardable and influences page replacement by preflushing such pages. This frees
up more memory for useful pages in the system.
active list
inactive list
free list
function of the back hand is performed in steps 3 and 4, which examine the pages at the head of the
inactive queue.
The major difference is in how the pagedaemon selects the active pages to move to the inac-
tive queue. In the clock algorithm, the pages are picked sequentially based on their location in
physical memory and not on their usage pattern. In Mach, the active list is also FIFO, so that the
pagedaemon selects those pages that were activated the earliest. This is an improvement over the
clock algorithm.
Mach does not provide alternative replacement policies. Some applications, such as data-
bases, do not exhibit strong locality of reference and may find LRU-like policies inappropriate.
[McNa 90] suggests a simple extension to the external pager interface that would permit a user-level
task to choose its own replacement policy.
15.7 Analysis
Mach has a well-designed VM architecture with several advanced features. Much of its functionality
is similar to that of SVR4, such as copy-on-write sharing, memory-mapped file access, and support
for large, sparse, address spaces. Similar to SVR4, it also is based on an object-oriented approach,
using a small set of objects that present a modular programming interface. It cleanly separates the
machine-independent and dependent parts of the code and isolates the machine-dependent code in
the pmap layer, which is accessed through a narrow, well-defined interface. When porting to a new
hardware architecture, only the pmap layer needs to be rewritten.
Moreover, Mach VM offers many features not found in SVR4. It provides more flexible
memory sharing facilities, by separating the notions of sharing and inheritance. It integrates memory
management and interprocess communication. IPC uses VM to allow large messages (up to the en-
tire address space of a task) to be transferred efficiently using copy-on-write. VM uses IPC to pro-
vide location independence for its objects and to extend the VM facilities to a distributed environ-
ment. In particular, user-level tasks may manage the backing store for memory objects, and the
kernel communicates with these tasks through IPC messages. This coupling creates a highly flexible
490 Chapter 15 More Memory Management Topics
environment, allowing multiple user-level (external) pagers to coexist and provide different types of
paging behavior. The network shared memory manager is an excellent example of how this inter-
face may be used to add new functionality.
There are, however, some important drawbacks, many of them similar to those of SVR4.
The VM system is larger, slower, and more complex than the BSD design. It uses more and larger
data structures. Hence it consumes more physical memory for itself, leaving less available for the
processes. Since the design keeps machine-dependent code to a minimum, it cannot be properly op-
timized for any particular MMU architecture.
In addition, the use of message passing adds considerable overhead. The cost is reduced in
some cases by optimizing kernel-to-kernel message transfers. Overall, though, message passing is
still a lot more expensive than simple function calls. Except for the network shared memory man-
ager, external pagers are not used commonly. This raises questions about whether the external pager
interface is useful enough to justify its high cost. Digital UNIX, the major commercial UNIX sys-
tem based on Mach, does not support external pagers and does not export the Mach VM interface.
Its VM subsystem has diverged from Mach in many respects.
This establishes a mapping between the locations [paddr, paddr+ 1en) in the process and the byte
range [off, off+1en) in the file represented by fd. 5 As in SVR4, addr suggests what address to
5 We follow the standard convention for specifying ranges: square brackets indicate inclusive boundaries, while paren-
theses indicate exclusive boundaries.
15.8 Memory Management in 4.4BSD 491
map the file to, and prot specifies protections (combination of PROT_ READ, PROT WRITE, and
PROT_EXECUTE). The flags MAP_SHARED, MAP_PRIVATE, and MAP_FIXED have the same meaning as
inSVR4.
The 4.4BSD mmap has a few additional features. The flags argument must contain either
MAP_FILE (mapping to a file or device) or MAP_ANON (mapping to anonymous memory). There are
two additional flags-MAP_ INHERIT, which specifies that the mapping should be retained after an
exec system call, and MAP_HASSEMAPHORE, which specifies that the region may contain a semaphore.
Processes may share memory in two ways. They may map the same file into their address
space, in which case the file provides the initial contents and backing store for the region. Alterna-
tively, a process may map an anonymous region, associate a file descriptor with it, and pass the de-
scriptor to other processes that wish to attach to the region. This avoids the overhead of a mapped
file, and the descriptor is used only for naming.
4.4BSD allows fast synchronization between processes by allowing semaphores to be placed
in a shared memory region. In traditional UNIX systems, processes use semaphores to synchronize
access to shared memory (see Section 6.3). Manipulating semaphores requires system calls, which
impose a lot of overhead and negate most of the performance benefits of shared memory. 4.4BSD
reduces this overhead by placing semaphores in shared memory regions.
Empirical studies show that in most applications that use synchronization, when a process
tries to acquire a shared resource, it finds the resource unlocked in a majority of cases. If cooperat-
ing processes place semaphores in a shared memory region, they can try to acquire the semaphore
without making a system call, provided the system supports an atomic test-and-set or equivalent
instruction (see Section 7.3.2). Only if the resource is locked does the process need to make a sys-
tem call to wait for the resource to be unlocked. The kernel checks the semaphore again and blocks
the process only if the semaphore is still locked. Likewise, a process freeing the resource can release
the semaphore at the user level. It then checks if other processes are waiting for the resource; if so, it
must make a system call to awaken them.
4.4BSD provides the following interface for semaphore management. The shared memory
region containing the semaphore must be created with the MAP_HASSEMAPHORE flag. A process can
where semis a pointer to the semaphore and wait is a boolean that is set to true if the caller wants to
block on the semaphore. On return, va 1ue is zero if the process has acquired the semaphore. To re-
lease a semaphore, a process calls
Systems that have an atomic test-and-set instruction implement mset and mclear as user-
level functions. To block and unblock on the semaphore, 4.4BSD provides the calls:
msfeep {~~m);·
checks the semaphore, and blocks the caller if sem is still locked. The call
492 Chapter 15 More Memory Management Topics
mwalceup (~~m>;
wakes up at least one process blocked on this semaphore and does nothing if there are no blocked
processes.
The semaphore interface in the System V IPC facility (see Section 6.3) is quite different
from this. System V allows a set of semaphores to be manipulated in a single system call. The ker-
nel ensures that all operations in a single call are committed atomically. This interface can be im-
plemented in 4.4BSD by associating a single guardian semaphore with each such set. For any op-
eration on the set, the process always starts by acquiring the guardian semaphore. It then checks if it
can complete the desired operations; if not, it releases the guardian semaphore and blocks on one of
the semaphores that it could not acquire. When it wakes up, it must repeat the whole process.
• Invalidate a single TLB entry, identified by the virtual address. If the TLB has no entry for
that address, nothing is done. The TBIS (Translation Buffer Invalidate Single) instruction
on the V AX-11 is an example of this method.
• Invalidate the entire TLB cache. For instance, the Intel 80386 flushes the entire TLB any
time the Page Directory Base Register (PDBR) is written to, either explicitly by a move
instruction, or indirectly during a context switch. The V AX-11 's TBIA (Translation
Buffer Invalidate All) instruction has the same effect.
• Load a new TLB entry, overwriting the previous entry for that address if one exists. This
method is used by architectures such as the MIPS R3000, which allow software reloading
of the TLB.
• An active flag, which shows whether the processor is actively using some page table. If
this flag is clear, the processor is participating in shootdown and will not access any
modifiable pmap entry. (The pmap is the hardware address translation map for a task and
usually consists of page tables.)
• A queue of invalidation requests. Each request specifies a mapping that must be flushed
from the TLB.
• A set of currently active pmaps. Each processor usually has two active pmaps-the kernel
pmap and that of the current task.
Each pmap is protected by a spin lock, which serializes operations on it. Each pmap also has
a list of processors on which the pmap is currently active.
The kernel invokes the shootdown algorithm when one processor makes a change to an ad-
dress translation that may invalidate TLB entries on other processors. Figure 15-13 illustrates the
case of a single responder. The initiator first disables all interrupts and clears its own active flag.
Next, it locks the pmap and posts TLB flush requests to every processor on which the pmap is ac-
tive. It then sends cross-processor interrupts to those processors and waits for them to be acknowl-
edged.
When the responder receives the interrupt, it also disables all interrupts. It then acknowl-
edges the interrupt by clearing its active flag and spin-waits for the initiator to unlock the pmap.
Meanwhile, the initiator has been waiting for all the relevant processors to become inactive. When
they have all acknowledged the interrupt, the initiator flushes its own TLB, changes the pmap, and
unlocks it. The responders now get out of their spin loop, process their request queue, and flush all
obsolete TLBs. Finally, both the initiator and the responders reset their active flags, reenable inter-
rupts, and resume normal operation.
send CP interrupts
receive CP interrupt
wait till active flags cleared disable all interrupts
'
'
' clear own active flag
change pmap
flush own TLB wait till pmap unlocked
''
unlockpmap
shootdown with Pl and P2 as responders. P3 sends cross-processor interrupts to Pl and P2, and
blocks till they are acknowledged. Pl acknowledges its interrupt and blocks until the pmap is re-
leased. P2 is blocked on the lock with interrupts disabled and hence does not see or respond to the
interrupt. As a result, we have a three-way deadlock. To prevent this, the system must enforce a
fixed interrupt state for each lock: Either a lock should always be acquired with interrupts disabled
or always with interrupts enabled.
15.10.2 Discussion
The Mach TLB shootdown algorithm solves a complex problem while making no assumptions
about hardware features, other than support for cross-processor interrupts. It is, however, expensive,
and does not scale well. All responders must busy-wait while the initiator changes the pmap. On a
large multiprocessor with tens or hundreds of CPUs, shootdown can idle several processors at once.
The complexity is necessary due to two reasons. First, many MMUs write back TLB entries
to the page tables automatically to update modified and referenced bits. This update overwrites the
15.11 TLB Consistency in SVR4 and SVR4.2 UNIX 497
entire pmap entry. Second, the hardware and software page sizes are often different. Hence when the
kernel changes the mapping of a single page, it may have to change several pmap entries. These
changes must appear atomic to all processors. The only way to accomplish this is to idle all proces-
sors that may be using the pmap while making the change.
Many other TLB shootdown algorithms have been suggested and implemented. The next
section describes some ad hoc solutions to reducing the frequency of TLB flushes. Other methods
depend on some hardware characteristics that simplify the problem. For instance, [Rose 89] de-
scribes an efficient algorithm for the IBM Research Parallel Processor Prototype (RP3) [Pfis 85]. It
uses the facts that the RP3 does not automatically write back TLB entries to main memory and that
the large hardware page size (16 kilobytes) makes it unnecessary for a software page to span multi-
ple hardware pages. Other research [Tell 88] suggests modifications to MMU architectures to assist
in TLB shootdown.
15.11.1 SVR4/MP
The Mach shootdown algorithm solves the general TLB synchronization problem as efficiently as
possible, while making no assumptions about hardware characteristics (other than the availability of
cross-processor interrupts) or about the nature of the event that necessitates the shootdown. The
SVR4/MP approach is to analyze the events leading to shootdowns and find better ways of handling
them. There are four types of events that require TLB synchronization:
1. A process shrinks its address space, either by calling brk or sbrk or by releasing a region
of memory.
2. The pagedaemon invalidates a page, either to free it or to simulate reference bits.
3. The kernel remaps a system virtual address to another physical page.
4. A process writes to a page that is shared copy-on-write.
In SVR4/MP, the hardware automatically flushes the TLB on each context switch [Peac 92].
Hence, case 1 is not a problem, unless the operating system supports multithreaded processes
(which this release ofSVR4/MP does not). SVR4/MP provides optimizations for cases 2 and 3.
To reduce TLB flushes in case 2, the pagedaemon batches a number of invalidate operations
and flushes all TLBs in a single operation. This amortizes the cost of the global TLB flush over a
number of page invalidations.
The major cause ofTLB synchronization in SVR4 is the seg_map driver, used to support the
read and write system calls. The kernel implements these calls by mapping the file into its own ad-
dress space and then copying the data to the user process. It manages the seg_map segment to dy-
498 Chapter 15 More Memory Management Topics
namically map and unmap file pages into its address space. As a result, the physical mapping of
virtual addresses in this segment changes frequently. Since all processes share the kernel, it must en-
sure that such addresses are not accessed through obsolete mappings in the TLBs on other processors.
To track stale TLBs, the SVR4/MP kernel maintains a global generation count, as well as a
local generation count for each processor. When a processor flushes its local TLB, it increments the
global counter and copies the new value into its local counter. When seg_map releases the mapping
for a page, the kernel tags the address with the global generation count. When seg_map reallocates
the address to a new physical page, the kernel compares the saved generation count with the local
counter of each processor. If any local counter has a lower value than the saved counter, the TLB
may have stale entries, and the kernel performs a global TLB flush. Otherwise, all processors have
done a local flush since this address was invalidated, and hence no stale TLBs exist for this page.
The SVR4/MP optimization for this situation is based on the assumption that once a
seg_map mapping is invalidated, the kernel will not access that address until it is reallocated to a
new physical address. To minimize the need for flushing, pages released by seg_map are reused in a
first-in, first-out order. This increases the time between freeing and reallocating an address, making
it more probable that other processors will flush their TLBs in the meantime.
15.11.2 SVR4.2/MP
SVR4.2/MP is a multiprocessor, multithreaded release of SVR4.2. Its TLB shootdown policies and
implementation [Bala 92] have some features in common with the SVR4/MP work described in
Section 15.11.1, but provide several important enhancements. All interactions with the TLB are re-
stricted to the HAT layer (see Section 14.4), which is machine-dependent. The reference port for
SVR4.2/MP is to the Intel 386/486 architecture, but the TLB consistency algorithms and interfaces
are designed to be easily portable.
As in SVR4/MP, the kernel has complete control over its own address space, and hence can
guarantee that it will not access invalid kernel mappings (such as those released by the seg_map
segment). Hence the kernel may use a lazy shootdown policy for TLBs that map kernel addresses.
SVR4.2/MP, however, supports lightweight processes (lwps, described in Section 3.2.2), and it is
possible for multiple lwps of the same process to be running concurrently on different processors.
Since the kernel does not control the memory access patterns of user processes, it uses an immediate
shootdown policy for invalid user TLBs.
The global shootdowns used in SVR4/MP do not scale well for a system with a large num-
ber of processors, since the whole system idles while the shootdown proceeds. Hence SVR4.2/MP
maintains a processor list for each hat structure. The hat of the kernel address space has a list of all
on-line processors, since the kernel is potentially active on all of them. The hat of each user process
has a list of all processors on which the process is active (an lwp of the process is running on it).
This list is accessed through the following interface:
• hat_online() and hat_offline() add and remove CPUs to the list in the kernel hat
structure.
• hat_asload (as) adds the processor to the list in the hat structure of the address space
as and loads the address space into the MMU.
15.11 TLB Consistency in SVR4 and SVR4.2 UNIX 499
• hat_asunload (as, flags) unloads the MMU mappings for this process and removes
the processor from the list of this as. The flags argument supports a single flag, which
indicates whether the local TLB must be flushed after unloading the mappings. 6
The kernel also uses an object called a cookie, which is visible only to the HAT layer. It may
be implemented either as a timestamp or as a generation count (the reference port uses a timestamp),
and must satisfy the condition that newer cookies are always greater in value. The routine
hat _getshootcooki e () returns a new cookie, whose value is a measure of the age of the TLB. The
kernel passes the cookie to the hat_ shootdown () routine, which is responsible for shootdown of
kernel TLBs. If any other CPU has an older cookie, its TLB may be stale and needs to be flushed.
The following subsections explain how the kernel implements the lazy and immediate
shootdown algorithms.
6 The Intel context switch implementation does not ask hat_ a sun 1oad () to flush the TLB, since the TLB is flushed
anyway after the new u area is mapped in.
7 The treatment of seg_kmem pages is somewhat different. These pages are managed by a bitmap that is divided into
zones (by default, a zone is 128 bits in size). A cookie is associated with each zone, and is set when an address in the
zone is freed.
500 Chapter 15 More Memory Management Topics
15.11.5 Discussion
SVR4/MP and SVR4.2/MP seek to optimize the TLB shootdown by taking advantage of the unique
features of the hardware. Moreover, they treat each shootdown situation differently, taking advan-
tage of the synchronization inherent in the functions that trigger the shootdown.
15.12 Other TLB Consistency Algorithms 501
send CP interrupts
modifY ptes
wait till counter incremented
increment counter
flush own TLB
flush own TLB
unlock global spinlock
unlock hat structure
This approach achieves better performance than that of Mach, which uses a single, simple
algorithm for all machines and all situations. However, the SVR4 approach is more difficult to port,
since there are many dependencies on hardware and software specifics. For instance, the MIPS
R3000, with its support for tagged TLB entries, presents a different set of problems, since there is
no automatic TLB flush on each context switch. Section 15.12 presents a solution specific to such
an architecture.
In brief, we again see a tradeoff between using a single solution that applies universally and
using several ad hoc methods that tackle each situation optimally.
Such a system needs to correctly handle the shrinking of an address space. Suppose a proc-
ess runs first on processor A and then on processor B. While running on B, it shrinks its data region
and flushes the invalid TLB entries on B. If it later runs on A again, it can access invalid pages
through stale TLB entries left behind on A.
To solve this problem, the kernel assigns a new TLBpid to the process when it shrinks its
address space. This automatically invalidates all its existing TLB entries on all processors. 8 The
kernel must perform a global TLB flush when it reassigns the old TLBpid to another process. It re-
duces these events by reallocating TLBpids in first-in, first-out order, allowing stale entries to be
flushed naturally.
The MIPS implementation also handles the case where a process writes to a copy-on-write
page. The kernel makes a new copy of that page, and assigns it to the writing process. It also flushes
the page from the local TLB. If the process had previously run on another processor, its TLB may
have a stale translation for the page. The kernel maintains a record of the processors on which a
process has run. If, after writing to the copy-on-write page, the process runs on one of those proces-
sors again, the kernel first flushes the TLB of that processor.
The optimizations described in this section reduce the need for global TLB synchronization
and may improve system performance. In particular, the seg_map optimization is very useful, since
kernel mappings are shared by all processors, and seg_map is heavily used. The solutions, however,
are ad hoc and depend on specifics of both the hardware and the operating system function that trig-
gers the synchronization. There is no single general algorithm (other than that of Mach) that is
hardware-independent and caters to all situations.
physical
cache...,.. physical
address
miss_. memory
cache
~TLBmiss?
Many modem architectures use a virtually addressed cache, in some cases eliminating the
TLB altogether. Figure 15-16 shows a typical scenario. The MMU first searches for the virtual ad-
dress in the c~che. If found, there is no need to look further. If the data is not in the cache, the MMU
proceeds with[ the address translation and obtains the data from physical memory.
It is al~o permissible to have both a virtual address cache and a TLB. In such architectures,
such as the MJPS R4000 [MIPS 90] and the Hewlett-Packard PA-RISC [Lee 89], the MMU simul-
taneously searbhes the cache and the TLB. This has even better performance, at the cost of architec-
tural complexity.
A virtual address cache is composed of a number of cache lines, each of which maps a num-
ber of contiguous bytes of memory. For instance, the Sun-3 [Sun 86] has a 64-kilobyte cache made
up of 16-byte lines. The cache is indexed by virtual address, or optionally, by a combination of vir-
tual address and a process ID or context /D. Since many virtual addresses (both from the same and
different address spaces) map to the same cache line, the line must contain a tag that identifies the
process and virtual address to which it maps.
Using the virtual address as a retrieval index has one important consequence-an alignment
factor may be defined for the cache, such that if two virtual addresses differ by that value, they both
map to the same cache line. This alignment factor usually equals the cache size or a multiple
thereof. We use the term aligned addresses to refer to addresses that map to the same cache line.
Although physical address caches are completely transparent to the operating system, the
hardware cannot guarantee consistency of virtual address caches. A given physical address may map
memory
to several virtual addresses and hence to multiple cache lines, causing an internal consistency prob-
lem. The write-back nature of the cache may cause main memory to become stale relative to the
cache. There are three types of consistency problems-mapping changes or homonyms, address ali-
ases or synonyms, and direct memory access (DMA) operations.
physical
cache
memory
old page
[ virt addr 1 -----------1 I
~ newpage
cache physical
memory
virt addr 1
virt addr 2
Figure 15-18. Synonyms lead to multiple cache entries for the same data.
506 Chapter 15 More Memory Management Topics
DMA operations, too, must be handled differently for virtual address caches. Before starting
a DMA read, the kernel must flush from the cache any dirty data for the pages to be read. This en-
sures that main memory is not stale with respect to the cache. Similarly, in the case of a DMA write,
the kernel must first purge any cache entries for the data to be overwritten. Otherwise, stale cache
entries may later be written back to memory, destroying the more recent data from the DMA opera-
tion.
15.13.5 Analysis
A virtually addressed cache may improve memory access times substantially. However, it poses a
variety of consistency problems that must be dealt with in software. Moreover, it changes the mem-
ory architecture in a fundamental way, which requires rethinking of several operating system design
issues. Although it was designed to improve MMU performance, the cache conflicts with certain
assumptions made by the operating system and may adversely affect overall system performance.
Modem UNIX systems support many forms of memory sharing and mapping, such as Sys-
tem V shared memory, memory-mapped file access, and copy-on-write techniques for memory in-
heritance and interprocess communications. In traditional architectures, these techniques reduce the
amount of in-memory data copying and save memory by eliminating multiple resident copies of the
same data. This results in substantial performance improvements.
On a virtual address cache architecture, however, such memory sharing results in synonyms.
The operating system needs elaborate recovery procedures to ensure cache consistency, such as
flushing the cache, making certain pages noncacheable, or eliminating or restricting certain facili-
ties. These operations may reduce or eliminate any performance gains of memory sharing.
[Chen 87] showed that while in typical benchmarks, the time taken by cache flushing was only
0.13% of total time, certain tests raised this value to 3.0%.
In many situations, several algorithms must be redesigned to perform efficiently on systems
with virtual address caches. [Inou 92] describes several changes to operations in Mach and Chorus
to address this problem. [Whee 92] describes many ways of eliminating unnecessary cache consis-
tency operations, resulting in dramatic performance gains. Some of its suggestions are specific to
the peculiarities of Hewlett-Packard's PA-RISC architecture. On that machine, the TLB lookup oc-
curs in parallel with the address cache search, and the cache is tagged by the physical address. This
allows it to detect many inconsistencies in software and take more efficient corrective measures.
15.14 Exercises
1. How does memory inheritance in Mach differ from that in SVR4?
2. In the example shown in Figure 15-2, what happens if task A tries to write to the page?
3. Why does the Mach external pager interface result in poor performance?
4. What is the difference between a vm object and a memory object?
5. How does the network shared memory server behave if one of its clients crashes? What
happens if the server crashes?
6. Why does Mach not need a per-page protections array such as the one in SVR4?
508 Chapter 15 More Memory Management Topics
7. Suppose a vendor wished to provide System V IPC in a system based on a Mach kernel. How
could he or she implement the shared memory semantics? What issues need to be addressed?
8. What are the differences and similarities between the Mach vm_map call, the 4.4BSD mmap
call, and the SVR4 mmap call?
9. Section 15.8 mentions a guardian semaphore to implement System V-like semaphores in
4.4BSD. Would the guardian be allocated and managed by the kernel or a user library?
Describe a skeletal implementation.
10. Why is the Mach page replacement policy called FIFO with second chance?
11. What is the benefit of having the operating system reload the TLB rather than the hardware?
12. Suppose the TLB entry contains a hardware-supported referenced bit. How will the kernel use
this bit?
13. Since the UNIX kernel is nonpaged, what could lead to a change in a TLB entry for a kernel
page?
14. Why is TLB shootdown expensive? Why is it more expensive in Mach than in SVR4/MP or
SVR4.2/MP?
15. How do you think SVR4/MP and SVR4.2/MP handle TLB invalidations caused by writing a
copy-on-write page?
16. What additional TLB consistency problems are caused by lightweight processes?
17. Why is lazy shootdown preferable to immediate shootdown in many cases? When is
immediate shootdown necessary?
18. Does an MMU with a virtually addressed cache still need a TLB? What would be the benefits
and drawbacks?
19. What is the difference between an address alias and a mapping change?
20. How does the kernel ensure consistency of the TLB and the virtual address cache during an
exec system call?
15.15 References
[Bala 92] Balan, R., and Golhardt, K., "A Scalable Implementation of Virtual Memory HAT
Layer for Shared Memory Multiprocessor Machines," Proceedings of the Summer
1992 USENIXTechnical Conference, Jun. 1992, pp. 107-115.
[Blac 89] Black, D.L., Rashid, R., Golub, D., Hill, C., and Baron, R., "Translation Lookaside
Buffer Consistency: A Software Approach," Proceedings of the Third International
Conference on Architectural Support for Programming Languages and Operating
Systems, Apr. 1989, pp. 113-132.
[Chao 90] Chao, C., Mackey, M., and Sears, B., Mach on a Virtually Addressed Cache
Architecture," Proceedings of the First Mach USENIX Workshop, Oct. 1990, pp. 31-
51.
[Chen 87] Cheng, R., "Virtual Address Cache in UNIX," Proceedings of the Summer 1987
USENIX Technical Conference, Jun. 1987, pp. 217-224.
15.15 References 509
[Cleg 86] Clegg, F.W., Ho, G.S.-F., Kusmer, S.R., and Sontag, J.R., "The HP-UX Operating
System on HP Precision Architecture Computers," Hewlett-Packard Journal, Vol.
37, No. 12, 1986, pp. 4-22.
[Drav 91] Draves, R.P., "Page Replacement and Reference Bit Emulation in Mach,"
Proceedings of the Second USENIX Mach Symposium, Nov. 1991, pp. 201-212.
[Golu 91] Golub, D.B., and Draves, R.P., "Moving the Default Memory Manager Out of the
Mach Kernel," Proceedings of the Second USENIX Mach Symposium, Nov. 1991,
pp. 177-188.
[Inou 92] Inouye, J., Konuru, R., Walpole, J., and Sears, B., "The Effects of Virtually
Addressed Caches on Virtual Memory Design and Performance," Operating Systems
Review, Vol. 26, No.4, Oct. 1992, pp. 14-29.
[Kane 88] Kane, G., Mips RISC Architecture, Prentice-Hall, Englewood Cliffs, NJ, 1988.
[Klei 86] Kleiman, S.R., "Vnodes: An Architecture for Multiple File System Types in Sun
UNIX," Proceedings of the Summer 1986 USENIX Technical Conference, Jun. 1986,
pp. 238-247.
[Lee 89] Lee, R.B., "Precision Architecture," IEEE Computer, Vol. 21, No. 1, Jan. 1989, pp.
78-91.
[McKu 95] McKusick, M.K., "A New Virtual Memory Implementation for Berkeley UNIX,"
Computing Systems, Vol. 8, No. 1, Winter 1995.
[McNa 90] McNamee, D., and Armstrong, K., "Extending the Mach External Pager Interface to
Accommodate User-Level Page Replacement," Proceedings of the First Mach
USENIX Workshop, Oct. 1990, pp. 17-29.
[MIPS 90] MIPS Computer Systems Inc., MIPS R4000 Preliminary Users Guide, 1990.
[Peac 92] Peacock, J.K., Saxena, S., Thomas, D., Yang, F., and Yu, W., "Experiences from
Multithreading System V Release 4," Proceedings of the Third USENIX Symposium
on Distributed and Multiprocessor Systems (SEDMS III), Mar. 92, pp. 77-91.
[Pfis 85] Pfister, G.F., Brantley, W.C., George, D.A., Harvey, S.L., Kleinfelder, W.J.,
McAuliffe, K.P., Melton, E.A., Norton, V.A., and Weiss, J., "The IBM Research
Parallel Prototype (RP3): Introduction and Architecture," Proceedings of the 1985
International Conference on Parallel Processing, IEEE Computer Society, 1985, pp.
764-771.
[Rash 88] Rashid, R.F., Tevanian, A., Young, M., Golub, D., Black, D., Bolosky, W., and
Chew, J., "Machine-Independent Virtual Memory Management for Paged Uni-
processor and Multiprocessor Architectures," IEEE Transactions on Computing, vol.
37, no. 8, Aug. 1988, pp. 896-908.
[Rose 89] Rosenburg, B.S., "Low-Synchronization Translation Lookaside Buffer Consistency
in Large-Scale Shared-Memory Multiprocessors," Eleventh ACM Symposium on
OperatingSystemsPrinciples, Nov.1987,pp.137-146.
[Subr 91] Subramanian, I., "Managing Discardable Pages with an External Pager,"
Proceedings of the Second USENIX Mach Symposium, Nov. 1991, pp. 201-212.
[Sun 86] Sun Microsystems, Inc., "Sun-3 Architecture: A Sun Technical Report," Aug. 1986.
510 Chapter 15 More Memory Management Topics
[Tell88] Teller, P., Kenner, R., and Snir, M., "TLB Consistency on Highly Parallel Shared
Memory Multiprocessors," Proceedings of the Twenty-First Annual Hawaii Inter-
national Conforence on System Sciences, IEEE Computer Society, 1988, pp. 184-192.
[Teva 87] Tevanian, A., Rashid, R.F., Young, M.W., Golub, D.B., Thompson, M.R., Bolosky,
W., and Sanzi, R., "A UNIX Interface for Shared Memory and Memory Mapped
Files Under Mach," Technical Report CMU-CS-l-87, Department of Computer
Science, Carnegie-Mellon University, Jul. 1987.
[Thorn 88] Thompson, M.Y., Barton, J.M., Jermoluk, T.A., and Wagner, J.C., "Translation
Lookaside Buffer Synchronization in a Multiprocessor System," Proceedings of the
Winter 1988 USENIXTechnical Conference, Jan. 1988, pp. 297-302.
[Whee 92] Wheeler, R., and Bershad, B.N., "Consistency Management for Virtually Indexed
Caches," Proceedings of the Fifth International Conference on Architectural Support
for Programming Languages and Operating Systems, Oct. 1992.
[Youn 87] Young, M.W., Tevanian, A., Rashid, R.F., Golub, D.B., Eppinger, J., Chew, J.,
Bolosky, W., Black, D., and Baron, R., "The Duality of Memory and
Communication in the Implementation of a Multiprocessor Operating System,"
Proceedings of the Eleventh ACM Symposium on Operating Systems Principles,
Nov. 1987, pp. 63-76.
16
16.1 Introduction
The I/0 subsystem handles the movement of data between memory and peripheral devices such as
disks, printers, and terminals. The kernel interacts with these devices through device drivers. A
driver controls one or more devices and is the only interface between the device and the rest of the
kernel. This separation hides the intricacies of the device hardware from the kernel, which can ac-
cess the device using a simple, procedural interface.
A comprehensive discussion of device drivers is beyond the scope of this book. Many books
[Paja 92, Egan 88] devote themselves exclusively to this topic. Moreover, each UNIX vendor pub-
lishes detailed manuals [Sun 93] that explain how to write drivers for their platforms. This chapter
simply provides an overview of the UNIX device driver framework. It deals primarily with the
SVR4 interfaces, discusses their strengths and drawbacks, and describes some alternative ap-
proaches. It also describes the 1/0 subsystem, which is the part of the operating system that imple-
ments the device-independent processing ofl/0 requests.
16.2 Overview
A device driver is part of the kernel-it is a collection of data structures and functions that controls
one or more devices and interacts with the rest of the kernel through a well-defined interface. In
many ways, though, a driver is different and separate from the core components of the kernel. It is
the only module that may interact with the device. It is often written by a third-party vendor, usu-
511
512 Chapter 16 Device Drivers and !10
ally, the vendor of the device itself. It does not interact with other drivers, and the kernel may access
it only through a narrow interface. There are many benefits to such an approach:
• We can isolate device-specific code in a separate module.
• It is easy to add new devices.
• Vendors can add devices without kernel source code.
• The kernel has a consistent view of all devices and accesses them through the same inter-
face.
Figure 16-1 illustrates the role of the device driver. User applications communicate with pe-
ripheral devices through the kernel using the system call interface. The 1/0 subsystem in the kernel
handles these requests. It, in tum, uses the device driver interface to communicate with the devices.
Each layer has a well-defined environment and responsibilities. User applications need not
know whether they are communicating with a device or an ordinary file. A program that writes data
to a file should be able to write the same data to a terminal or serial line without modification or re-
compilation. Hence the operating system provides a consistent, high-level view of the system to
user processes.
The kernel passes all device operations to the I/0 subsystem, which is responsible for all
device-independent processing. The I/0 subsystem does not know the characteristics of individual
devices. It views devices as high-level abstractions manipulated by the device driver interface and
takes care of issues such as access control, buffering, and device naming.
The driver itself is responsible for all interaction with the device. Each driver manages one
or more similar devices. For example, a single disk driver may manage a number of disks. It alone
knows about the hardware characteristics of the device, such as the number of sectors, tracks, and
heads of a disk, or the baud rates of a serial line.
The driver accepts commands from the 1/0 subsystem through the device driver interface. It
also receives control messages from the device, which include completion, status, and error notifi-
cations. The device usually gets the driver's attention by generating an interrupt. Each driver has an
interrupt handler, which the kernel invokes when it fields the appropriate interrupt.
kernel
____ L~~~~~~~~~~~~~~~~~~~~~QQ~~~~~i~!~~~~~~~-~~~~~~~~~~~~~~~~~~----
device driver interface
controllers
1 Many 80486 machines also have a PCI (Peripheral Components Interconnect) local bus.
514 Chapter 16 Device Drivers and 1/0
sion. Likewise, if it tries to read a register to which it has just written, the value read may be quite
different from what was written. 2
The I/0 space of a computer includes the set of all device registers, as well as frame buffers
for memory-mapped devices such as graphics terminals. Each register has a well-defined address in
the I/0 space. These addresses are usually assigned at boot time, using a set of parameters specified
in a configuration file used to build the system. The system might assign a range of addresses to
each controller, which in turn might allocate space for each device it manages.
There are two ways in which I/0 space is configured in a system. On some architectures
such as the Intel 80x86, the I/0 space is separate from main memory and is accessed by special I/0
instructions (such as i nb and outb). Others, such as the Motorola 680x0, use an approach called
memory-mapped device J/0. This approach maps I/0 registers into a part of main memory and uses
ordinary memory access instructions to read and write the registers.
Likewise, there are two ways of transferring data between the kernel and the device, and the
method used depends on the device itself. We can classify devices into two categories based on their
data transfer method-Programmed I/0 (PIO) and Direct Memory Access (DMA). PIO devices re-
quire the CPU to move data to or from the device one byte at a time. Whenever the device is ready
for the next byte, it issues an interrupt. If a device supports DMA, the kernel may give it the loca-
tion (source or destination) of the data in memory, the amount of data to transfer, and other relevant
information. The device will complete the transfer by directly accessing memory, without CPU in-
tervention. When the transfer completes, the device will interrupt the CPU, indicating that it is ready
for the next operation.
Typically, slow devices such as modems, character terminals, and line printers are PIO de-
vices, while disks and graphics terminals are DMA devices. Some architectures such as the SPARC
also support Direct Virtual Memory Access (DVMA), where the device interacts directly with the
MMU to transfer data to virtual addresses. In such a case, a device may directly transfer data to an-
other device without going through main memory.
Devices use interrupts to get the attention of the CPU. Interrupt handling is highly machine-
dependent, but we can discuss a few general principles. Many UNIX systems define a set of inter-
rupt priority levels (ipls). The number of ipls supported is different for each system. The lowest
ipl is zero; in fact, all user code and most of the normal kernel code runs at ipl 0. The highest ipl
is implementation-dependent: Some common values are 6, 7, 15, and 31. Ifthe ipl of an arriving
interrupt is lower than the current ipl of the system, the interrupt is blocked until the system ipl
falls below that of the pending interrupt. This allows the system to prioritize different types of
interrupts.
Each device interrupts at a fixed ipl; usually, all devices on a single controller have the same
ipl. When the kernel handles an interrupt, it first sets the system ipl to that of the interrupt, so as to
block further interrupts from that device (as well as others of the same or lower priority). Moreover,
2 A single register could server as a control and status register, allowing both reads and writes.
16.2 Overview 515
some kernel routines raise the ipl temporarily to block certain interrupts. For instance, the routine
that manipulates the queue of disk block buffers raises the ipl to block out the disk interrupts. Oth-
erwise, a disk interrupt may occur while the queue is in an inconsistent state, confusing the disk
driver.
The kernel uses a set of routines to manipulate the ipl. For instance, s p l tty () raises the ipl
to that of the terminal interrupt. The s p l x () routine lowers the ipl to a previously saved value.
These routines are usually implemented as macros for efficiency.
Usually all interrupts invoke a common routine in the kernel and pass it some information
that identifies the interrupt. This routine saves the register context, raises the ipl of the system to
that of the interrupt, and calls the handler for that interrupt. When the handler completes, it returns
control to the common routine, which restores the ipl to its previous value, restores the saved regis-
ter context, and returns from the interrupt.
How does the kernel identify the correct interrupt handler? This depends on whether the
system supports vectored or polled interrupts. In a completely vectored system, each device pro-
vides a unique interrupt vector number, which is an index into an interrupt vector table. The entries
in the table are pointers to the appropriate interrupt handlers.
On some systems, the interrupt may only supply the ipl. Alternatively, it may supply a vec-
tor, but multiple devices may map to the same vector. In either case, the kernel may have to decide
which of several interrupt handlers to invoke. It maintains a linked list of all handlers that share the
same ipl (or the same vector). When an interrupt arrives, the common routine loops through the
chain and polls each driver. The driver in turn checks if the interrupt was generated by one of its
devices. If so, it handles the interrupt and returns success to the common routine. If not, it returns
failure, and the common routine polls the next device.
It is possible to combine the two methods. Systems that support vectoring may also access
the handlers through a linked list. This provides an easy way of dynamically loading a device driver
into a running system. It also allows vendors to write override drivers, which are installed at the
front of the linked list. Such a driver sits between the device and its default driver. It selectively
traps and handles certain interrupts, and it passes the rest on to the default driver.
Interrupt handling is the most important task of the system, and the handler executes in pref-
erence to any user or system processing. Since the handler interrupts all other activity (except for
higher priority interrupts), it must be extremely quick. Most UNIX implementations do not allow
interrupt handlers to sleep. If a handler needs a resource that might be locked, it must try to acquire
it in a nonblocking way.
These considerations influence what work the interrupt handler must do. On one hand, it
must be short and quick, and hence do as little as possible. On the other hand, it must do enough
work to make sure the device does not idle under a heavy load. For instance, when a disk 1/0 opera-
tion completes, the disk interrupts the system. The handler must notify the kernel of the results of
the operation. It must also initiate the next 1/0 if a request is pending. Otherwise, the disk would
idle until the kernel regained control and started the next request.
Although these mechanisms are common to a large number of UNIX variants, they are far
from universal. Solaris 2.x, for instance, moves away from the use of ipls except in a small number
of cases. It uses kernel threads to handle interrupts and allows such threads to block if needed (see
Section 3.6.5).
516 Chapter 16 Device Drivers and J/0
physical memory. The null device is a bit-sink-it only allows writes and simply discards all data
written to it. The zero device is a source of zero-filled memory. Such devices are called pseudode-
vices.
One important advantage of a pseudodevice driver is that it is often the only way a third-
party vendor can add functionality to a UNIX kernel. UNIX drivers support a general-purpose entry
point called ioctl. This may be invoked with an arbitrary number of driver-specific commands. This
allows a pseudodevice driver to provide a rich set of kernel functions to the user, without actually
modifying the kernel itself.
Modem UNIX systems support a third class of drivers, called STREAMS drivers.
STREAMS drivers typically control network interfaces and terminals, and replace character drivers
used in earlier implementations for such devices. For compatibility reasons, the STREAMS driver
interface is derived from that of character drivers, as described in Section 16.3.3.
struct bdevsw {
i nt (*d open) ();
i nt (*d.-.-cl ose) ();
int (*d-strategy}();
int (*d-size)();
int (djhalt)();
bdevsw[];
struct cdevsw {
int {*d,open)();
int {*d-close)();
i nt (*d-read)();
int {*d-write}{);
int {*d-ioctl)();
int (*d-mmap}{);
int {*d-segmap)();
int {*d~xpoll)();
int {*d=xhalt)();
struct streamtab* d_str;
cdevsw[];
The switch defines the abstract interface. Each driver provides specific implementations of
these functions. The next subsection describes each entry point. Whenever the kernel wants to per-
form an action on a device, it locates the driver in the switch table and invokes the appropriate
function of the driver. For example, to read data from a character device, the kernel invokes the
d_read () function of the device. In the case of a terminal driver, this might dereference to a routine
called t tread (). This is further described in Section 16.4.6.
Device drivers follow a standard naming convention for the switch functions. Each driver
uses a two-letter abbreviation to describe itself. This becomes a prefix for each of its functions. For
instance, the disk driver may use the prefix dk and name its routines dkopen(), dkclose(),
dkstrategy(), and dksize().
16.3 Device Driver Framework 519
A device may not support all entry points. For instance, a line printer does not normally al-
low reads. For such entry points, the driver can use the global routine nodev (), which simply re-
turns the error code ENODEV. For some entry points, the driver may wish to take no action. For in-
stance, many devices perform no special action when closed. In such a case, the driver may use the
global routine null dev (),which simply returns 0 (indicating success).
As mentioned earlier, STREAMS drivers are nominally treated and accessed as character
device drivers. They are identified by the d_ s t r field, which is NULL for ordinary character drivers.
For a STREAMS driver, this field points to a struct streamtab, which contains pointers to
STREAMS-specific functions and data. Chapter 17 discusses STREAMS in detail.
d_mmap() Not used if the d_ segmap () routine is supplied. If d_ segmap is NULL, the
mmap system call on a character device calls spec_segmap{), which in
turn calls d_ mma p () . Checks if specified offset in device is valid and re-
turns the corresponding virtual address.
d_xpoll () Polls the device to check if an event of interest has occurred. Can be used
to check if a device is ready for reading or writing without blocking, if an
error condition has occurred, and so on.
d_ xha lt () Shuts down the devices controlled by this driver. Called during system
shutdown or when unloading a driver from the kernel.
The switch structures vary a little between different UNIX versions. Some variants, for in-
stance, expand the block device switch to include functions such as d_ i oct 1 (), d_read(), and
d_write (). Others include functions for initialization or for responding to bus resets.
Except for d_xhalt() and d_strategy(), all of the above are top-half routines. d_xhalt()
is called during shutdown and, hence, cannot assume any user context or even the presence of inter-
rupts. It therefore must not sleep.
The d_strategy() operation is special for several reasons. It is frequently invoked to read
or write buffers that are not relevant to the calling process. For instance, a process trying to allocate
a free buffer finds that the first buffer on the freelist is dirty, and invokes the strategy routine to
flush it to disk. Having issued the write, the process allocates the next free buffer (assuming it is
clean) and proceeds to use it. It has no further interest in the buffer that is being written, nor does it
need to wait for the write to complete. Moreover, disk I/0 operations are often asynchronous (as in
this example), and the driver must not block the caller.
Hence d_strategy() is treated as a bottom-half routine. It initiates the l/0 operation and
returns immediately without waiting for I/0 completion. If the device is busy when the request ar-
rives, d_ strategy() simply adds the request to an internal queue and returns. Eventually, other
bottom-half routines invoked from the interrupt code will dequeue and execute the request. If the
caller needs to wait for the I/0 to complete, it does so outside the d_ strategy () routine.
The driver entry points for interrupt handling and initialization are typically not accessed
through the switch table. Instead, they are specified in a master configuration file, which is used to
build the kernel. This file contains an entry for each controller and driver. The entry also contains
information such as the ipl, interrupt vector number, and the base address of the CSRs for the driver.
The specific contents and format of this file are different for each implementation.
SVR4 defines two initialization routines for each driver-init and start. Each driver registers
these routines in the io_init[] and io_start[] arrays, respectively. The bootstrapping code in-
vokes all init functions before initializing the kernel and all start functions after the kernel is initial-
ized.
The I/0 subsystem is the portion of the kernel that controls the device-independent part of I/0 op-
erations and interacts with the device drivers to handle the device-dependent part. It is also respon-
16.4 The 1/0 Subsystem 521
sible for device naming and protection, and for providing user applications with a consistent inter-
face to all devices.
The kernel passes the device number as an argument to the driver's d_open() routine. The
device driver maintains internal tables to translate the minor device number to specific CSRs or
controller port numbers. It extracts the minor number from dev and uses it to access the correct device.
A single driver may be configured with multiple major numbers. This is useful if the driver
manages different types of devices that perform some common processing. Likewise, a single de-
vice may be represented by multiple minor numbers. For example, a tape drive may use one minor
number to select an auto-rewind mode and another for no-rewind mode. Finally, if a device has both
a block and a character interface, it uses separate entries in both switch tables, and hence separate
major numbers for each.
In earlier UNIX releases, dev_ t was a 16-bit field, with 8 bits each for the major and minor
numbers. This imposed a limit of 256 minor devices for a major device type, which was too restric-
tive for some systems. To circumvent that, drivers used multiple major device numbers that mapped
to the same major device. Drivers also used multiple major numbers if they controlled devices of
different types.
Another problem is that the switch tables may grow very large if they contain entries for
every possible device, including those that are not connected to the system or whose drivers are not
522 Chapter 16 Device Drivers and 110
linked with the kernel. This happens because vendors do not want to customize the switch table for
each different configuration they ship, and hence tend to throw everything into the switches.
SVR4 makes several changes to address this problem. The dev _ t type is 32 bits in size,
usually divided into 14 bits for a major number and 18 for a minor number. It also introduces the
notion of internal and external device numbers. The internal device numbers identify the driver and
serve as indexes into the switches. The external device numbers form the user-visible representation
of the device and are stored in the i _ rdev field of the inode of the device special file (see Section
16.4.2).
On many systems, such as the Intel x86, the internal and external numbers are identical. On
systems that support autoconfiguration, such as the AT&T 3B2, the two are different. On these sys-
tems, the bdevsw[] and cdevsw[] are built dynamically when the system boots and only contain
entries for the drivers that are configured into the system. The kernel maintains an array called
MAJOR[], which is indexed by the external major number. Each element of this array stores the cor-
responding internal major number.
The mapping between external and internal major numbers is many-to-one. The kernel pro-
vides the macros etoimajor () and i toemaj or() to translate between the two numbers. The Hoe-
major() macro must be called repeatedly to generate all possible major numbers. There are also
two minor numbers. For instance, if a driver supports two external major numbers with eight de-
vices on each, they would internally map to minor numbers 0 to 15 for the single internal major
number.
The getmajor () and getmi nor() macros return internal device numbers. The getemajor()
and getemi nor() macros return external device numbers.
the kernel to translate from the user-level device name (the pathname) to the internal device name
(the <major, minor> pair). The translation mechanism is further described in the next section.
The device file cannot be created in the usual way. Only a superuser may create a device file,
using the privileged system call
where path is the pathname of the special file, mode specifies the file type (IFBLK or IFCHR) and
permissions, and dev is the combined major and minor device number. The mknod call creates a
special file and initializes the di _mode and di _rdev fields of the inode from the arguments.
Unifying the file and device name space has great advantages. Device I/0 uses the same set
of system calls as file I/0. Programmers may write applications without worrying about whether the
input or output is to a device or to a file. Users see a consistent view of the system and may use de-
scriptive character-string names to reference devices.
Another important benefit is access control and protection. Some operating systems such as
DOS allow all users unrestricted access to all devices, whereas some mainframe operating systems
allow no direct access to devices. Neither scheme is satisfactory. By unifying the file system and
device name space, UNIX transparently extends the file protection mechanism to devices. Each de-
vice file is assigned the standard read/write/execute permissions for owner, group, and others. These
permissions are initialized and modified in the usual way, just as for files. Typically, some devices
such as disks are directly accessible only by the superuser, while others such as tape drives may be
accessed by all.
subsystem must ensure that, when a user opens a device file, he or she acquires a reference to the
specft vnode, and all operations to the file are routed to it.
To see how this happens, let us take an example where a user opens the file /devllp. The di-
rectory /dev is in the root file system, which is of type uft. The open system call translates the path-
name by repeatedly calling ufs _1 ookup (),first to locate the vnode for dev, then the vnode for lp.
When ufs _1 ookup 0 obtains the vnode forlp, it finds that the file type is I FCHR. It then extracts the
major and minor device numbers from the inode and passes them to a routine called speevp ().
The specft file system keeps all snodes in a hash table, indexed by the device numbers.
speevp 0 searches the hash table and, if the snode is not found, creates a new snode and vnode. The
snode has a field called s _rea 1vp, in which speevp 0 stores a pointer to the vnode of /devllp. Fi-
nally, it returns a pointer to the specft vnode to ufs_1ookup0, which passes it back to the open
system call. Hence open sees the specft vnode and not the vnode of the file /devllp. The specft
vnode shadows the vnode of /devllp, and its v_ op field points to the vector of specft operations
(such as spee_read 0 and spee_wri teO), which in turn call the device driver entry points. Figure
16-3 illustrates the resulting configuration.
Before returning, open invokes the VOP_OPEN operation on the vnode, which calls
s pee_open () in the case of a device file. The s pee_open () function calls the d_open () routine of
the driver, which performs the necessary steps to open the device. The term snode refers to shadow
node. In effect, the specft vnode shadows the "real" vnode and intercepts all operations on it.
Another problem involves page addressing. In SVR4, the name of a page in memory is de-
fined by the vnode that owns the page and the offset of the page in the file. For a page associated
with a device (such as memory-mapped frame buffers or disk blocks accessed through the raw inter-
face), the name is ambiguous if multiple files refer to the same device. Two processes accessing the
device through different device files could create two copies of the same page in memory, resulting
in a consistency problem.
When we have multiple file names for the same device, we can classify device operations
into two groups. Most of the operations are independent of the file name used to access the device,
and thus can be funneled through a common object. At the same time, there are a few operations
that depend on the file used to access the device. For instance, each file may have a different owner
and permissions; therefore, it is important to keep track of the "real" vnode (that of the device file)
and route those operations to it.
The specft file system uses the notion of a common snode to allow both types of operations.
Figure 16-4 describes the data structures. Each device has only one common snode, created when
the device is first accessed. There is also one snode for each device file. The snodes of all files rep-
resenting the same device share the common snode and reference it through the s _ commonvp field.
The first time a user opens a device file for a particular device, the kernel creates an snode
and a common snode. Subsequently, if another user opens the same file, it will share these objects.
If a user opens another file that represents the same device, the kernel will create a new snode,
which will reference the common snode through its s _ commonvp field. The common snode is not
directly associated with a device file; hence, its s _rea 1vp field is NULL. Its s _ commonvp field
points to itself.
struct snode
(common snode)
user makes a read system call, for example, the kernel dereferences the file descriptor to access the
struet file and, from it, the vnode of the file (which is part of the snode of the device). It per-
forms some validation, such as making sure the file is open for reading. It then invokes the
VOP_READ operation on the vnode, which results in a call to s pee_ read().
The s pee_ read() function checks the vnode type and finds that it is a character device. It
looks up the edevsw[] table, indexing by the major device number (which is stored in v_rdev). If
the device is a STREAMS device, it calls strread() to perform the operation. For a character de-
vice, it calls the d_read() routine of the device, passing it the ui o structure containing all the pa-
rameters of the read, such as the destination address in the user space and the number of bytes to be
transferred.
Since d_read() is a synchronous operation, it can block the calling process if the data is not
immediately available. When the data arrives, the interrupt handler wakes up the process, which
copies it to user space. d_read() calls the kernel function uiomove() to copy data to user space.
ui amove() must verify that the user has write access to the locations to which the data is being
copied. Otherwise, a careless or malicious user could overwrite his or her text segment, or even the
kernel address space. When the transfer completes, the kernel returns the count of bytes actually
read to the user.
where fds points to an array of size nfds, whose elements are described by
struet po 11 fd
int fd; I* file descriptor *!
short events; !*events ofinterest *I
short revents; !* returned events *!
};
For each descriptor, events specifies which events are of interest to the caller, and on return,
revents contains the events that have occurred. Both values are bitmasks. The types of defined
events include POLLIN (data may be read without blocking), POLLOUT (data may be written without
528 Chapter 16 Device Drivers and 1/0
blocking), POLLERR (an error has occurred on the device or stream), POLLHUP (a hang-up has oc-
curred on the stream), and others. Hence, in normal usage, poll checks if a device is ready for I/0 or
has encountered an error condition.
poll examines all the specified descriptors. If any event of interest has occurred, it returns
immediately after examining all descriptors. If not, it blocks the process until any interesting event
occurs. When it returns, the revents field of each poll fd shows which, if any, of the events of in-
terest have occurred on that descriptor. poll also returns if timeout milliseconds expire, even if no
events have occurred. If timeout is 0, poll returns immediately. If timeout is INFTIM or -1, poll
returns only when an event of interest occurs (or the system call is interrupted). The return value of
poll equals the number of events that have occurred, or 0 if the call times out, or -1 if it fails for
another reason.
In our example, the server can issue a poll system call, specifying the POLLIN flag for each
descriptor. When the call returns with a value greater than 0, the server knows that a message has
arrived on at least one connection and examines the poll fd structures to find which ones. It can
then read the message from that descriptor, process it, and poll again for new messages.
16.5.1 pol/Implementation
Although the descriptors passed to poll may refer to any files, they are normally used for character
or STREAMS devices, and we focus on this case here. The tricky part of poll is to block a process
in such a way that it can be woken up when any one of a set of events occurs. To implement this,
the kernel uses two data structures-pollhead and polldat. The struct pollhead is associated
with a device file. It maintains a queue of poll da t structures. Each poll da t structure identifies a
blocked process and the events on which it is blocked. A process that blocks on multiple devices has
one s truct poll dat for each device, and they are chained together as shown in Figure 16-5.
The poll system call first loops through all the specified descriptors and invokes the
VOP_POLL operation on the associated vnodes. The syntax for this call is
where vp is a pointer to the vnode, events is a bitmask of the events to poll for, and anyyet is the
number of events of interest already detected by the poll system call on other descriptors. On return,
revents contains the set of events that has already occurred and php contains a pointer to a s truct
poll head.
In the case of a character device, the VOP_POLL operation is implemented by spec _poll (),
which indexes the cdevsw [] table and calls the d_ xpo ll () routine of the driver. This routine
checks if a specified event is already pending on the device. If so, it updates the revents mask and
returns. If no event is pending and if anyyet is zero, it returns a pointer to the poll head structure
for the device. Character drivers typically allocate a poll head for each minor device they manage.
On return from VOP_POLL on a device, poll checks revents and anyyet. If both are zero, no
events of interest are pending on the devices checked so far. poll obtains the poll head pointer from
php, allocates a poll dat structure, and adds it to the poll head's queue. It stores a pointer to the
16.5 The poll System Call 529
proc structure and the mask of events for this device in the po 11 dat, and chains it to other po 11 dat
structures for the same process.
If a device returns a nonzero value in revents, it means an event is already pending and poll
does not need to block the process. In this case, poll removes all the po 11 da t structures from the
po 11 head queues and frees them. It increments anyyet by the number of events set in revents.
When it polls the next device, the driver will find anyyet to be nonzero and will not return a po 11 -
head structure.
If no specified event is pending on any device, poll blocks the process. The drivers, mean-
while, maintain information about the events on which any process is waiting. When such an event
occurs, the driver calls po 11 wakeup(), passing it the event and the pointer to the po 11 head for that
device. poll wakeup() goes through the poll dat queue in the po 11 head and wakes up every proc-
ess waiting for that event. For each such process, it also traverses its po 11 dat chain and removes
and releases each po 11 da t from its po 11 head queue.
Each file system and device must implement polling. Ordinary file systems such as ufs and
s5fs do so by calling the kernel routine fs _po 11 (),which simply copies the flags from events into
revents, and returns. This causes poll to return immediately without blocking. Block devices usu-
ally do the same thing. STREAMS devices use a routine called s t rpo 11 (), which implements
polling for any generic stream.
where readfds, wri tefds, and exceptfds are pointers to descriptor sets for read, write, and ex-
ception events respectively. In 4.3BSD, each descriptor set is an integer array of size nfds, with
nonzero elements specifying descriptors whose events are of interest to the caller. For example, a
530 Chapter 16 Device Drivers and 1/0
user wishing to wait for descriptors 2 or 4 to be ready for reading will set elements 2 and 4 in
readfds and clear all other elements in the three sets.
The timeout argument points to a struct timeval, which contains the maximum time to
wait for an event. If this time is zero, the call checks the descriptor and returns immediately. If
timeout itself is NULL, the call blocks indefinitely until an event occurs on a specified descriptor.
Upon return, select modifies the descriptor sets to indicate the descriptors on which the specified
events have occurred. The return value of select equals the total number of descriptors that are
ready.
Most modem UNIX systems support select, either as a system call or as a library routine.
Many of these implement the descriptor set in different ways, most commonly as a bitmask. To hide
the details of the implementation, each system provides the following POSIX-compliant macros to
manipulate descriptor sets:
The constant FD_SETSIZE defines the default size of the descriptor set. It equals 1024 on most sys-
tems (including SVR4).
The implementation of select is similar to that of poll in some respects. Each descriptor of
interest must correspond to an open file, otherwise the call fails. An ordinary file or a block device
is always considered ready for 1/0, and select is really useful only for character device files. For
each such descriptor, the kernel calls the d_select routine of the appropriate character driver (the
BSD counterpart of the d_xpo ll entry point, to check if the descriptor is ready. If not, the routine
records that the process has selected an event on the descriptor. When the event occurs, the driver
must arrange to wake up the process, which then checks all descriptors once again.
The 4.3BSD implementation of select is complicated by the fact that the drivers can record
only a single selecting process. If multiple processes select on the same descriptor, there is a colli-
sion. Checking for and handling such collisions may results in spurious wakeups.
Figure 16-6 describes the various stages in the handling of a block read operation (the algo-
rithm for writes is similar). In all cases, the kernel uses the page fault mechanism to initiate the read.
The fault handler fetches the page from the vnode associated with the block. The vnode, in turn,
calls the d_strategy () routine of the device driver to read the block.
A file may reside in many different places-on a local hard disk, on local removable media
such as a CD-ROM or floppy disk, or on another machine on the network. In the last case, I/0 oc-
curs through network drivers, which are often STREAMS devices. This section only considers files
on local hard disks. It begins by describing the buf structure, then examines the different ways in
which the I/0 subsystem accesses the block devices.
about a cached block. In modem UNIX systems such as SVR4, the buffer cache only manages file
metadata blocks, such as those containing inodes or indirect blocks (see Section 9.2.2). It caches the
most recently used blocks, in the expectation that they are more likely to be needed again soon, be-
cause of the locality of reference principle (see Section 13.2.6). A struct buf is associated with
each such block; it contains the following additional fields used for cache management:
• A pointer to the vnode of the device file.
• Flags that specify whether the buffer is free or busy, and whether it is dirty (modified).
• The aged flag, which is explained in the following paragraph.
• Pointers to keep the buffer on an LRU freelist.
• Pointers to chain the buffer in a hash queue. The hash table is indexed by the vnode and
block number.
The aged flag requires elaboration. When a dirty buffer is released, the kernel puts it at the
end of the freelist. Eventually, it migrates to the head of the list, unless it is accessed in the interim.
When the buffer reaches the head of the list, a process may try to allocate it and notice that it must
be written back to disk first. Before issuing the write, it sets the aged flag on the buffer, indicating
that the buffer has already traversed the freelist. Such an aged buffer must be reused before buffers
that have been put on the freelist for the first time, since the aged buffers have been unreferenced for
a longer time. Hence, when the write completes, the interrupt handler releases the aged buffer to the
head of the freelist instead of the tail.
cates a page in which to read the block and associates a buf structure with it. It obtains the disk's
device number from the file's inode (which it accesses through the vnode). Finally, it calls the
d_strategy() routine of the disk driver to perform the read, passing it a pointer to the buf, and
waits for the read to complete (the calling process sleeps). When the 1/0 completes, the interrupt
handler wakes up the process. ufs _get page() also takes care of some other details, such as issuing
read-aheads when necessary.
In the case when the block does not contain file data, it is associated with the vnode of the
device file. Hence the s pee _get page() function is invoked to read the block. It too searches mem-
ory to check if the block is already there, and issues a disk read otherwise. Unlike a regular file,
spec _get page() does not need to convert from logical to physical block numbers, since its block
numbers are already device-relative.
Pageout Operations
Every page in the pageable part of memory has a struct page associated with it. This structure has
fields to store the vnode pointer and offset that together name the page. The virtual memory subsys-
tem initializes the fields when the page is first brought into memory.
The pagedaemon periodically flushes dirty pages to disk. It chooses the pages to flush based
on their usage patterns (following a not recently used algorithm, described in Section 13.5.2), so as
to keep the most useful pages in memory. There are several other kernel operations that result in
writing pages out to disk, such as swapping out an entire process or callingftync for a file.
To write a page back to disk, the kernel locates the vnode from the page structure and in-
vokes its VOP _PUT PAGE operation. If the page belongs to a device file, this results in a call to
spec_putpage(), which obtains the device number from the vnode and calls the d_strategy()
routine for that device.
If the page belongs to an ordinary file, the operation is implemented by the corresponding
file system. The ufs_putpage() function, for example, writes back pages of uft files. It calls
ufs_bmap() to compute the physical block number, then calls the d_strategy() routine for the
device (getting the device number from the inode, which it accesses through the vnode).
ufs _put page() also handles optimizations such as clustering, where it gathers adjacent dirty pages
and writes them out in the same 1/0 request.
ments (seg_vn), the fault is handled by the s egvn _fault() routine. It invokes the VOP_GETPAGE op-
eration on the vnode of the file, which is pointed to by the private data of the segment.
Likewise, when a process modifies a page to which it has a shared mapping, the page must
be written back to the underlying file. This usually happens when the pagedaemon flushes the page,
as previously described.
In SVR4, reads and writes to an ordinary file go through the seg_map driver. When a user invokes
the read system call, for example, the kernel dereferences the file descriptor to get the file struc-
ture, and from it the vnode of the file. It invokes the VOP _READ operation on the vnode, which is
implemented by a file-system-dependent function, such as ufs _read() for ufs files. ufs _read()
performs the read as follows:
1. Calls segmap _getmap () to create a kernel mapping for the required range of bytes in the
file. This function returns the kernel address to which it maps the data.
2. Calls u i amove () to transfer the data from the file to user space. The source address for the
transfer is the kernel address obtained in the previous step.
3. Calls segmap _release() to free the mapping. The seg_map driver caches these mappings
in LRU order, in case the same pages are accessed again soon.
If the page is not already in memory, or if the kernel does not have a valid hardware address
translation to it, u i amove () causes a page fault. The fault handler determines that the page belongs
to the seg_map segment, and calls segmap _fault() to fetch the page. segmap _fault() invokes the
VOP_GETPAGE operation on the vnode, which retrieves the page from disk if necessary, as described
above.
A user may directly access a block device, assuming he or she has the appropriate permissions, by
issuing read or write system calls to its device file. In such a case, the kernel dereferences the file
descriptor to access the file structure, and from that, the vnode. It then invokes the VOP_READ or
VOP_WRITE operation on the vnode, which in this case, calls the spec _read() or spec_write()
functions. These functions operate much like the corresponding ufs functions, calling seg-
map _getmap (), ui amove(), and finally segmap _release(). Hence the actual I/0 occurs as a result
of page faults or page flushes, just as in the previous cases.
Alternatively, a user could map a block device into its address space with the mmap system
call. In that case, reads to mapped locations would cause page faults, which would be handled by the
seg_vn (vnode segment) driver. The segvn _fault() routine would invoke the VOP_GETPAGE opera-
tion on the vnode, which would result in a call to spec _get page(). This would call the
d_strategy() routine of the device, if the page is not already in memory. Writes would be simi-
larly handled by spec _put page().
16.7 The DDI/DKI Specification 535
allocation, and synchronization. Moreover, since multiple, independently written drivers coexist in
the kernel and may be active concurrently, it is important that they not interfere with each other or
with the kernel.
To reconcile the goals of independent driver development and peaceful coexistence, the in-
terface between the kernel and the driver must be rigorously defined and regulated. To achieve this,
SVR4 introduced the Device-Driver Interface/Driver-Kernel Interface (DDIIDKI) specification
[UNIX 92b], which formalizes all interactions between the kernel and the driver.
The interface is divided into several sections, similar to the organization of the UNIX man
pages. These sections are:
• Section 1 describes the data definitions that a driver needs to include. The way in which
the kernel accesses this information is implementation-specific and depends on how it
handles device configuration.
• Section 2 defines the driver entry point routines. It includes the functions defined in the
device switches, as well as interrupt handling and initialization routines.
• Section 3 specifies the kernel routines that the driver may invoke.
• Section 4 describes the kernel data structures that the driver may use.
• Section 5 contains the kernel #define statements that a driver may need.
The interface is divided into three parts:
• Driver-kernel - This is the largest part of the interface. It includes the driver entry
points and the kernel support routines.
• Driver-hardware - This part describes routines that support interactions between the
driver and the device. These routines are highly machine-dependent, but many of them are
defined in the DDI/DKI specification.
• Driver-boot - This part deals with how a driver is incorporated into the kernel. It is not
contained in the DDI/DKI specification, but is described in various vendor-specific device
driver programming guides [Sun 93].
The specification also describes a number of general-purpose utility functions that provide
services such as character and string manipulation. These are not considered a part of the DDI/DKI
interface.
Each function in the interface is assigned a commitment level, which may be 1 or 2. A
Ievel-l function will remain in future revisions of the DDI/DKI specification and will only be
modified in upward-compatible ways. Hence code written using Ievel-l functions will be portable to
future SVR4 releases. The commitment to support level-2 routines, however, is limited to three
years after a routine enters level 2. Each such routine has an entry date associated with it. After three
years, new revisions of the specification may drop the routine entirely or modify it in incompatible
ways.
A Ievel-l routine may contain some features that are defined as level-2. Further, the entire
routine may be moved to level 2 in a new release of the specification (for example, the rmi nit()
function, discussed in Section 16.7.2). The date of that release becomes the entry date for that rou-
tine, and it will continue to be supported as defined for a minimum of three more years.
16.7 The DDI/DKI Specification 537
arguments to
struct uio uiomove(}
;- uio iov addr
uio iovcnt = 3 nbytes
uio offset rwflag = UIO- WRITE
... '--
uiop
, ~------------------~ '
I '
::=R
I
iov base - I
iov len -
I I
-·-·-·-·-·
----+-
I
I
I
iov base - I I ~
I
-+<
iov 1en - -----:- I
I
I
I
I
I -·-·-·-·-·
I
iov base I
:...} I
iov 1en - I
'' '------------------~ , I
destination buffer
array of data in user or in kernel space
iovec's kernel space
Finally, the section describes the preflxi nfo structure that must be supplied by STREAMS drivers.
Section 2 specifies the driver entry points described earlier in this chapter. These include all
the switch functions, as well as the interrupt routine and the initialization functions preflxi nit ()
and preflxstart ().
Section 4 describes data structures shared between the kernel and the drivers. These include
the buf structure, described in Section 16.6.1, and the uio and iovec structures, described in Sec-
tion 8.2.5. The rest of the structures are used by STREAMS (see Chapter 17) and by the machine-
specific DMA interface.
Section 5 contains the relevant kernel #define values. These include errno values (error
codes), STREAMS messages, and signal numbers.
3 Some vendors provided loadable drivers long before SVR4. Sun Microsystems, for instance, had them in Sun0S4.1.
OSF/1 has this feature as well.
540 Chapter 16 Device Drivers and I/0
D_ RDWEQ Device accesses require strict equality under the mandatory access
control policy.
handler and the switch table entries, and removing all references to the driver from the rest of the
kernel.
SVR4.2 provides a set of facilities to perform all the above tasks. It adds the following rou-
tines to the DDIIDKI specification:
prefu:_1 oad ()
This section 2 routine must be provided by the driver. It performs driver initialization, and
the kernel invokes it when the driver is loaded. It handles the tasks usually performed by
the init and start routines, since those functions are not invoked when a driver is dynami-
cally loaded. It allocates memory for private data, initializing various data structures. It
then calls the mod_drvattach () routine to install the interrupt handler and, finally, initial-
izes all devices associated with this driver.
prefu:_ un 1oad ()
This section 2 routine must be provided by the driver. The kernel invokes it to handle
driver cleanup when unloading the driver. Typically, it undoes the actions of the pre-
fix _1 oad () routine. It calls mod_drvdetach () to disable and uninstall interrupts for this
driver, releases memory it had allocated, and performs any necessary shutdown on the
driver or its devices.
mod_drvattach()
This is a kernel-supplied, section 3 routine. It installs the interrupt handler for the driver
and enables interrupts from the driver's devices. It must be called from prefix_1 oad ()
with a single argument, a pointer to the driver's prefixattach _info structure. This struc-
ture is defined and initialized by the kernel's configuration tools when the driver is con-
figured. It is opaque to the driver, and the driver must not attempt to reference any of its
fields.
mod_drvdetach ()
This is a kernel-supplied, section 3 routine, which disables interrupts to the driver and
uninstalls its interrupt handler. It must be called from prefix un 1oad () with the prefixat-
tach _info pointer as an argument.
Wrapper Macros
The DDI/DKI specification supplies a set of macros that generate wrapper code for a load-
able module. There is one macro for each type of module, as follows:
Each macro takes five arguments. For MOD_DRV _WRAPPER, the syntax is
where prefix is the driver prefix, load and unload are the names of the pre-
fix _load ()and prefix_unload() routines, halt is the name of the driver's prefix_ halt()
routine, if any, and desc is a character string that describes the module. The wrapper code
arranges to call prefix_load() when the driver is loaded and prefix_unload() when the
driver is unloaded. The moddefs.h file defines the wrapper macros.
There are two ways in which a driver may be loaded into the SVR4.2 kernel [UNIX 92a]. A
user, typically the system administrator, may explicitly load and unload the driver using the mod-
load and rnoduload system calls. Alternatively, the system may automatically load the driver on first
reference. For instance, the kernel loads a STREAMS module (see Section 17.3.5) the first time it is
pushed onto a stream. If a module remains inactive for a time greater than a tunable parameter
called DEF_UNLOAD_DELAY, it becomes a candidate for unloading. The kernel may unload such a
module automatically if there is a memory shortage. Modules may override DEF_UNLOAD_DELAY by
specifying their own unload delay value in the configuration file.
The SVR4.2 dynamic loading facility is powerful and beneficial, but it has some limitations.
It requires explicit support in the driver; hence, older drivers cannot use this facility transparently.
[Konn 90] describes a version of dynamic loading that does not suffer from these limitations. It al-
lows transparent loading of the driver when first opened, and optionally, transparent unloading on
the last close. When the system boots, it inserts a special open routine in the cdevsw[] and
bdevsw[] for all major numbers that have no driver configured. When the device is first opened,
this routine loads the driver, then calls the driver's real open routine. Since the loading process up-
dates the device switch table, subsequent opens will directly call the driver's open routine. Like-
wise, the autounload on close feature uses a special close routine that is installed in the device
switch entry if the driver specifies an autounload flag in the configuration file. This routine first calls
the real close routine, then checks to see if all the minor devices are closed. If so, it unloads the driver.
The [Konn 90] implementation is transparent to the driver, since it uses the driver's init and
start routines instead of special load and unload routines. It works with well-behaved drivers that
satisfy certain requirements. The driver should not have any functions or variables directly refer-
enced by the kernel or other drivers. On the last close, it should perform complete cleanup and leave
no state behind. This includes releasing any kernel memory, canceling pending timeouts, and so on.
In spite of this, each device manufacturer must provide his or her own drivers. Each must
implement all the entry points for the driver, which constitute a very high-level interface. Each
driver contains not only the ASIC-dependent code, but also the high-level, ASIC-independent code.
As a result, different drivers of the same device class duplicate much of the functionality. This leads
to a lot of wasted effort and also increases the size of the kernel unnecessarily.
UNIX has addressed this problem in several ways. STREAMS provides a modular way of
writing character drivers. Each stream is built by stacking a number of modules together, each
module performing a very specific operation on the data. Drivers can share code at the module level,
since multiple streams may share a common module.
SCSI devices offer another possibility for code sharing. A SCSI controller manages many
different types of devices, such as hard and floppy disks, tape drives, audio cards, and CD-ROM
drives. Each SCSI controller has a number of controller-specific features and thus requires a differ-
ent driver. Each device type, too, has different semantics and hence must be processed differently. If
we have m different types of controllers and n different types of device, we may have to write m x n
drivers.
It is preferable, however, to divide the code into a device-dependent and a controller-
dependent part. There would be one device-dependent module for each device type, and one control-
ler-dependent driver for each controller. This requires a well-defined interface between the two
pieces, with a standard set of commands that is understood by each controller and issued by each
device-dependent module.
There are several efforts to create such a standard. SVR4 has released a Portable Device In-
terface (PDI), consisting of the following:
• A set of section 2 functions that each host bus adapter must implement.
• A set of section 3 functions that perform common tasks required by SCSI devices, such as
allocating command and control blocks.
• A set of section 4 data structures that are used by the section 3 functions.
Two other interfaces aimed at a similar layering are SCSI CAM (Common Access Method)
[ANSI 92b], supported by Digital UNIX [DEC 93] and ASPI (Adaptec SCSI Peripherals Interface),
popular in the personal computer world.
The I/0 subsystem in Mach 3.0 [Fori 91] extends such layering to all devices on a case-by-
case basis. It optimizes the code sharing for each device class, providing device-independent mod-
ules to implement the common code. It also moves the device-independent processing to the user
level, thereby reducing the size of the kernel. It also provides location transparency by implement-
ing device interactions as IPC (interprocess communication) messages. This allows users to trans-
parently access devices on remote machines. This interface, however, is incompatible with the
UNIX driver framework.
16.10 Summary
This chapter examines the device driver framework and how the I/0 subsystem interacts with the
devices. It describes the SVR4 DDI/DKI specification, which allows vendors to write drivers that
16.12 References 545
will be portable across future SVR4 releases. In addition, it describes several recent features in the
driver framework, such as support for multiprocessors and dynamic loading and interfaces for shar-
ing code between drivers.
16.11 Exercises
I. Why does UNIX use device switch tables to channel all device activity?
2. What is the difference between DMA and DVMA?
3. When is it necessary to poll the devices rather than rely on interrupts?
4. Why are pseudodevices such as mem and null implemented through the device driver
framework?
5. Give some examples of hardware devices that are not used for I/0. Which functions of the
driver interface do they support?
6. Give some examples of hardware devices that do not map well to the UNIX driver framework.
What aspects of the interface are unsuitable for these devices?
7. Why do top-half routines need to protect data structures from bottom-half routines?
8. What are the advantages of associating a device special file with each device?
9. How does the specfs file system handle multiple files associated with the same device?
10. What is the use of the common snode?
II. What functionality must the driver implement to support cloning? Give some examples of
devices that provide this feature.
I2. In what ways is 1/0 to a character device treated differently than 1/0 to a block device or a
file?
13. What are the differences in functionality between the poll and select system calls? Describe
how each may be implemented as a library function using the other and the problems that may
arise in doing so.
I4. What does it mean for a device to support memory mapped access? What kind of devices
benefit from this functionality?
I5. The DDI/DKI specification discourages direct access to its data structures and requires drivers
to use a procedural interface instead. Why? What are the advantages and drawbacks of using
function calls to access fields of a data structure?
I6. What are the main problems in writing a multiprocessor-safe driver?
I7. What are the benefits of loadable drivers? What problems must the driver writer be careful to
avoid?
16.12 References
[ANSI 92a] American National Standard for Information Systems, Small Computer Systems
Interface-2 (SCSI-2), X3.13I-I99X, Feb. I992.
546 Chapter 16 Device Drivers and I/0
[ANSI 92b] American National Standard for Information Systems, SCSI-2 Common Access
Method: Transport and SCSI Interface Module, working draft, X3T9.2/90-186, rev.
3.0, Apr. 1992.
[DEC 93] Digital Equipment Corporation, Guide to Writing Device Drivers for the SCSI/CAM
Architecture Interfaces, Mar. 1993.
[Egan 88] Egan, J.I., and Texeira, T.J., Writing a UNIX Device Driver, John Wiley & Sons,
1988.
[Fori 91] Forin, A., Golub, D., and Bershad, B.N., "An I/0 system for Mach 3.0," Technical
Report CMU-CS-91-191, School of Computer Science, Carnegie Mellon
University, Oct. 1991.
[Goul85] Gould, E., "Device Drivers in a Multiprocessor Environment," Proceedings of the
Summer 1992 USENIXTechnical Conforence, Jun. 1992, pp. 357-360.
[Klei 86] Kleiman, S.R., "Vnodes: An Architecture for Multiple File System Types in Sun
UNIX," Proceedings of the Summer 1986 Usenix Technical Conference, Jun. 1986,
pp. 238-247.
[Konn 90] Konnerth, D., Bartel, E., and Adler, 0., "Dynamic Driver Loading for UNIX System
V," Proceedings of the Spring 1990 European UNIX Users Group Conference, Apr.
1990, pp. 133-138.
[Paja 92] Pajari, G., Writing UNIX Device Drivers, Addison-Wesley, Reading, MA, 1992.
[Sun 93] Sun Microsystems, Writing Device Drivers, Part No. 800-5117-11, 1993.
[UNIX 92a] UNIX System Laboratories, Device Driver Programming-UNIX SVR4.2, UNIX
Press, Prentice-Hall, Englewood Cliffs, NJ, 1992.
[UNIX 92b] UNIX System Laboratories, Device Driver Reference-UNIX SVR4.2, UNIX Press,
Prentice-Hall, Englewood Cliffs, NJ, 1992.
17
STREAMS
17.1 Motivation
The traditional device driver framework has many flaws. First, the kernel interfaces with the drivers
at a very high level (the driver entry points), making the driver responsible for most of the process-
ing of an 1/0 request. Device drivers are usually written independently by device vendors. Many
vendors write drivers for the same type of device. Only part of the driver code is device-dependent;
the rest implements high-level, device-independent 1/0 processing. As a result, these drivers dupli-
cate much of their functionality, creating a larger kernel and greater likelihood of conflict.
Another shortcoming lies in the area of buffering. The block device interface provides rea-
sonable support for buffer allocation and management. However, there is no such uniform scheme
for character drivers. The character device interface was originally designed to support slow devices
that read or wrote one character at a time, such as teletypewriters or slow serial lines. Hence the
kernel provided minimal buffering support, leaving that responsibility to individual devices. This
resulted in the development of several ad hoc buffer and memory management schemes, such as the
clists used by traditional terminal drivers. The proliferation of such mechanisms resulted in ineffi-
cient memory usage and duplication of code.
Finally, the interface provides limited facilities to applications. 1/0 to character devices re-
quires read and write system calls, which treat the data as a FIFO (first-in, first-out) byte stream.
There is no support for recognizing message boundaries, distinguishing between regular data and
control information, or associating priorities to different messages. There is no provision for flow
control, and each driver and application devises ad hoc mechanisms to address this issue.
547
548 Chapter 17 STREAMS
The requirements of network devices highlight these limitations. Network protocols are de-
signed in layers. Data transfers are message- or packet-based, and each layer of the protocol per-
forms some processing on the packet and then passes it to the next layer. Protocols distinguish be-
tween ordinary and urgent data. The layers contain interchangeable parts, and a given protocol may
be combined with different protocols in other layers. This suggests a modular framework that sup-
ports layering and allows drivers to be built by combining several independent modules.
The STREAMS subsystem addresses many of these problems. It provides a modular ap-
proach to writing drivers. It has a fully message-based interface that contains facilities for buffer
management, flow control, and priority-based scheduling. It supports layered protocol suites by
stacking protocol modules to function as a pipeline. It encourages code sharing, as each stream is
composed of several reusable modules that can be shared by different drivers. It offers additional
facilities to user-level applications for message-based transfers and separation of control informa-
tion from data.
Originally developed by Dennis Ritchie [Rite 83], STREAMS is now supported by most
UNIX vendors and has become the preferred interface for writing network drivers and protocols.
Additionally, SVR4 also uses STREAMS to replace the traditional terminal drivers, as well as the
pipe mechanism. This chapter summarizes the design and implementation of STREAMS and ana-
lyzes its strengths and shortcomings.
17.2 Overview
A stream is a full-duplex processing and data transfer path between a driver in kernel space and a
process in user space. STREAMS is a collection of system calls, kernel resources, and kernel utility
routines that create, use, and dismantle a stream. It is also a framework for writing device drivers. It
specifies a set of rules and guidelines for driver writers and provides the mechanisms and utilities
that allow such drivers to be developed in a modular manner.
Figure 17-1 describes a typical stream. A stream resides entirely in kernel space, and its op-
erations are implemented in the kernel. It comprises a stream head, a driver end, and zero or more
optional modules between them. The stream head interfaces with the user level and allows applica-
tions to access the stream through the system call interface. The driver end communicates with the
device itself (alternatively, it may be a pseudodevice driver, in which case it may communicate with
another stream), and the modules perform intermediate processing of the data.
Each module contains a pair of queues-a read queue and a write queue. The stream head
and driver also contain such a queue pair. The stream transfers data by putting it in messages. The
write queues send messages downstream from the application to the driver. The read queues pass
them upstream, from the driver to the application. Although most messages originate at the stream
head or the driver, intermediate modules may also generate messages and pass them up or down the
stream.
Each queue may communicate with the next queue in the stream. For example, in Figure
17-1, the write queue of module 1 may send messages to the write queue of module 2 (but not vice-
versa). The read queue of module 1 may send messages to the read queue of the stream head. A
queue may also communicate with its mate, or companion queue. Thus, the read queue of module 2
17.2 Overview 549
I
downstream
i
upstream
1 I
driver end
may pass a message to the write queue of the same module, which may then send it downstream. A
queue does not need to know whether the queue it is communicating with belongs to the stream
head, the driver end, or another intermediate module.
Without further explanation, the preceding description shows the advantages of this ap-
proach. Each module can be written independently, perhaps by different vendors. Modules can be
mixed and matched in different ways, similar to combining various commands with pipes from a
UNIX shell.
Figure 17-2 shows how different streams may be formed from only a few components. A
vendor developing networking software may wish to add the TCPIIP (Transmission Control Proto-
col/Internet Protocol) suite to a system. Using STREAMS, he develops a TCP module, a UDP (User
Datagram Protocol) module, and an IP module. Other vendors who make network interface cards
independently write STREAMS drivers for ethernet and token ring.
Once these modules are available, they may be configured dynamically to form different
types of streams. In Figure 17-2(a), a user has formed a TCPIIP stream that connects to a token ring.
Figure 17-2(b) shows a new combination, featuring a UDPIIP stream connected to an ethernet
driver.
STREAMS supports a facility called multiplexing. A multiplexing driver can connect to
multiple streams at the top or bottom. There are three types of multiplexors-upper, lower, and two-
way. An upper, or fan-in, multiplexor can connect to multiple streams above it. A lower, or fan-out,
multiplexor can connect to multiple streams below it. A two-way multiplexor supports multiple
connections both above and below it.
550 Chapter 17 STREAMS
(a) (b)
By writing the TCP, UDP, and IP modules as multiplexing drivers, we can combine the
above streams into a single compound object that supports multiple data paths. Figure 17-3 shows a
possible layout. TCP and UDP act as upper multiplexors, while IP serves as a two-way multiplexor.
This allows applications to make various kinds of network connections and enables several users to
access any given combination of protocols and drivers. The multiplexor drivers must manage all the
different connections correctly and route the data up or down the stream to the correct queue.
IP
17 .3.1 Messages
The simplest message consists of three objects-a struct msgb (or type mb1 k_t), a struct datab
(type db 1 k_ t), and a data buffer. A multipart message may be constructed by chaining such triplets
together, as shown in Figure 17-4. In the msgb, the b_next and b_prev fields link a message onto a
queue, while b_cant chains the different parts of the same message. The b_ datap field points to the
associated datab.
Both the msgb and the datab contain information about the actual data buffer. The db_base
and db _1 i mfields of the da tab point to the beginning and end of the buffer. Only part of the buffer
may contain useful data, so the b rptr and b wptr fields of the msgb point to the beginning and
end of the valid data in the buffe; The a 11 ocb() routine allocates a buffer and initializes both the
b_rptr and b_wptr to point to the beginning of the buffer (db _base). As a module writes data into
the buffer, it advances b_ wpt r (always checking against db _1 i m). As a module reads data from the
buffer, it advances b_rptr (checking that it does not read past b_wptr), thus removing the data
from the buffer.
552 Chapter I 7 STREAMS
struct datab
db f
--
struct msgb db base
______. b next
-
db 1 im
b_prev db ref
b datap db _type ...
b rptr
b_wptr
r-b cont active part of
b band ... buffer
data buffer
datab ~
msgb
active part
data buffer
Allowing multipart messages (each a msgb-datab-data buffer triplet) has many advantages.
Network protocols are layered, and each protocol layer usually adds its own header or footer to a
message. As the message travels downstream, each layer may add its header or footer using a new
message block and link it to the beginning or end of the message. This way, it is unnecessary for
higher-level protocols to know about lower-level protocol headers or footers or leave space for them
while allocating the message. When messages arrive from the network and travel upstream, each
protocol layer strips off its header or footer in reverse order by adjusting the b_rpt r and b_wpt r
fields.
The b_band field contains the priority band of the message and is used for scheduling (see
Section 0). Each datab has a db_ type field, which contains one of several message types defined
by STREAMS. This allows modules to prioritize and process messages differently based on their
type. Section 17.3.3 discusses message types in detail. The db_ f field holds information used for
message allocation (see Section 17.7). The db _ref field stores a reference count, which is used for
virtual copying (see Section 17.3.2).
datab
msgb
/ db ref = 2
~ msgb
b_rptr b_rptr
/
b_wptr
"'~/~/ '' ''
b_wptr
'' '
' ''
copying of the data. Figure 17-5 shows an example where two messages share a datab. Both share
the associated data buffer, but each maintains its independent read and write offset into it.
Normally, such a shared buffer is used in read-only mode, for two independent writes to it
may interfere with each other. Such semantics, however, must be enforced by the modules or drivers
processing these buffers. STREAMS is neither aware of nor concerned with how or when modules
read or write the buffers.
One example of the use of virtual copying is in the TCPIIP protocol. The TCP layer provides
reliable transport, and hence must ensure that every message reaches its destination. If the receiver
does not acknowledge the message within a specified period of time, the sender retransmits it. To do
so, it must retain a copy of each message it sends until the message is acknowledged. Physically
copying every message is wasteful, hence TCP uses the virtual copying mechanism. When the TCP
module receives a message to send downstream, it calls the STREAMS routine dupmsg (), which
creates another msgb that references the same datab. This results in two logical messages, each ref-
erencing the same data. TCP sends one message downstream, while holding on to the other.
When the driver sends the message and releases the msgb, the datab and data buffer are not
freed, because the reference count is still non-zero. Eventually, when the receiver acknowledges the
message, the TCP module frees the other msgb. This drops the reference count on the datab to zero,
and STREAMS releases the datab and the associated data buffer.
struct qinit
'\
module2
w R '\
message 1 ,
, , ____________
I '\ , message 2 , ;
____________ I
of these fields are also present in the queue structure. This enables users to override these parame-
ters dynamically by changing the values in the queue, without destroying the defaults saved in the
module_ info. The module_stat object is not directly used by STREAMS. Each module is free to
perform its own statistics gathering using fields in this object.
The subsequent sections discuss the queue procedures in detail. In brief, the open and close
procedures are called synchronously by processes opening and closing the stream. The put proce-
dure performs immediate processing of a message. If a message cannot be processed immediately,
the put procedure adds the message to the queue's message queue. Later, when the service proce-
dure is invoked, it will perform delayed processing of these messages.
Each queue must provide a put procedure, 1 but the service procedure is optional. If there is
no service procedure, the put procedure cannot queue messages for deferred processing, but must
immediately process each message and send it to the next module. In the simplest case, a queue will
have no service procedure, and its put procedure will merely pass the message to the next queue
without processing it.
Stream I/0 is asynchronous. The only place where an I/0 operation may block the process is
at the stream head. The put and service procedures of the module and driver are non-blocking. If the
put procedure cannot send the data to the next queue, it places the message on its own message
queue, from where it may be retrieved later by the service procedure. If the service procedure re-
moves a message from the queue and discovers that it cannot process it at this time, it returns the
message to the queue and tries again later.
These two functions complement each other. The put procedure is required for processing
that cannot wait. For instance, a terminal driver must immediately echo the characters it receives, or
else the user will find it unresponsive. The service procedure handles all non-urgent actions, such as
canonical processing of incoming characters.
Because neither procedure is allowed to block, they must ensure that they do not call any
routine that may block. Hence STREAMS provides its own facilities for operations such as memory
allocation. For instance, the all ocb () routine allocates a message. If it cannot do so for any reason
(it may not find a free msgb, datab, or data buffer), it returns failure instead of blocking. The caller
then invokes the bufca ll () routine, passing a pointer to a callback function. bufca ll () adds the
caller to a list of queues that need to allocate memory. When memory becomes available,
STREAMS invokes the callback function, which usually calls the stream's service routine to retry
the call to all ocb ().
The asynchronous operation is central to the design of STREAMS. On the read side
(upstream), the driver receives the data via device interrupts. The read-side put procedures run at
interrupt level, and hence cannot afford to block. The design could have allowed blocking in the
write-side procedures, but that was rejected in the interest of symmetry and simplicity.
The service procedures are scheduled in system context, not in the context of the process that
initiated the data transfer. Hence blocking a service procedure could put an innocent process to
sleep. If, for example, a user shell process is blocked because an unrelated transfer cannot complete,
the results would be unacceptable. Making all put and service procedures non-blocking solves these
problems.
The put and service procedures must synchronize with each other while accessing common
data structures. Because the read-side put procedure may be called from interrupt handlers, it may
interrupt the execution of either service procedure, or of the write-side put procedure. Additional
locking is required on multiprocessors, since the procedures may run concurrently on different proc-
essors [Garg 90, Saxe 93].
When the put procedure defers the processing of data, it calls putq () to place the message onto the
queue and then calls qenab l e () to schedule the queue for servicing. qenab l e () sets the QENAB flag
for the queue and adds the queue to the tail of the list of queues waiting to be scheduled. If the
QENAB flag is already set, qenab l e () does nothing, since the queue has already been scheduled. Fi-
nally, qenab l e () sets a global flag called qrunfl ag, which specifies that a queue is waiting to be
scheduled.
STREAMS scheduling is implemented by a routine called runqueues () and has no relation
to UNIX process scheduling. The kernel calls runqueues () whenever a process tries to perform an
1/0 or control operation on a stream. This allows many operations to complete quickly before a
558 Chapter 17 STREAMS
context switch occurs. The kernel also calls run queues() just before returning to user mode after a
context switch.
The runqueues () routine checks if any streams need to be scheduled. If so, it calls
queuerun (),which scans the scheduler list, and calls the service procedure of each queue on it. The
service procedure must try to process all the messages on the queue, as described in the next section.
On a uniprocessor, the kernel guarantees that all scheduled service procedures will run be-
fore returning to user mode. Because any arbitrary process may be running at the time, the service
procedures must run in system context and not access the address space of the current process.
queues messages
blockage in one or more of its components and handle the situation correctly without blocking a put
or service procedure.
Flow control is optional in a queue. A queue that supports flow control interacts with the
nearest modules on either side that also support it. A queue without flow control has no service pro-
cedure. Its put procedure processes all messages immediately and sends them along to the next
queue. Its message queue is not used.
A queue that supports flow control defines low- and high-water marks, which control the
total amount of data that may be queued to it. These values are initially copied from the mod-
ule_ info structure (which is statically initialized when compiling the module), but may be changed
later by ioctl messages.
Figure 17-8 shows the operation of flow control. Queues A and C are flow-controlled, while
queue B is not. When a message arrives at queue A, its put procedure is invoked. It performs any
immediate processing that is necessary on the data and calls putq (). The putq () routine adds the
message to the queue A's own message queue, and puts queue A on the list of queues that need to
be serviced. If the message causes the queue A's high-water mark to be exceeded, it sets a flag indi-
cating that the queue is full.
At a later time, the STREAMS scheduler selects queue A and invokes its service procedure.
The service procedure retrieves messages from the queue in FIFO order. After processing them, it
calls canput () to check if the next flow-controlled queue can accept the message. The can put()
routine chases the q_next pointers until it finds a queue that is flow-controlled, which is queue C in
our example. It then checks the queue's state and returns TRUE if the queue can accept more mes-
sages or FALSE if the queue is full. Queue A's service procedure behaves differently in the two
cases, as shown in Example 17-1.
If can put () returns TRUE, queue A calls putnext (),which passes the message to queue B.
This queue is not flow-controlled and immediately processes the message and passes it to queue C,
which is known to have room for the message.
If can put() returns FALSE, queue A calls putbq () to return the message to its message
queue. The service procedure now returns without rescheduling itself for servicing.
Eventually, queue C will process its messages and fall below its low watermark. When this
happens, STREAMS automatically checks if the previous flow-controlled queue (A in this example)
is blocked. If so, it reschedules the queue for servicing. This operation is known as back-enabling a
queue.
Flow control requires consistency on the part ofthe module writer. All messages of the same
priority must be treated equally. If the put procedure queues messages for the service procedure, it
must do so for every message. Otherwise, messages will not retain their sequencing order, leading
to incorrect results.
When the service procedure runs, it must process every message in the queue, unless it can-
not do so due to allocation failures or because the next flow-controlled queue is full. Otherwise, the
flow control mecha..'lism breaks down, and the queue may never be scheduled.
High-priority messages are not subject to flow control. A put procedure that queues ordinary
messages may process high-priority messages immediately. If high-priority messages must be
queued, they are placed in front of the queue, ahead of any ordinary messages. High-priority mes-
sages retain FIFO ordering with respect to one another.
qinit
qi putp
qi srvp
qi qopen
streamtab qi qclose module info
st rdi nit qi minfo mi idnum
st wrinit ... mi idname
mi minpsz
st muxrinit qinit mi maxpsz
st muxwinit qi putp mi hiwat
qi srvp mi lowat
qi qopen
qi qclose
qi minfo
...
Figure 17-9. Data structures for configuring a module or driver.
17.5 Configuration and Setup 563
The modu 1e_info structure contains default parameters of the module. When the module is
first opened, these parameters are copied into the queue structure. A user may override them by sub-
sequent ioctl calls. Each queue may have its own module info structure, or both may share a single
object, as in the previous example. -
The rest of the configuration is different for modules and drivers. Many UNIX systems use
an fmodsw[] table to configure STREAMS modules. Each entry in the table (Figure 17-lO(a)) com-
prises a module name and a pointer to the streamtab structure for the module. Modules, therefore,
are identified and referenced by name. The module name should be the same as the mi _ i dname in
the module_ info structure, though STREAMS does not enforce this.
STREAMS device drivers are identified through the character device switch table. Each
cdevsw entry has a field called d_str, which is NULL for ordinary character devices. For
STREAMS devices, this field contains the address of the s t reamtab structure for the driver (Figure
17-10(b)). To complete the configuration, it is necessary to create the appropriate device files, with
the major number equal to the index of the driver in the cdevsw[] array (except for clone opens,
which are described in Section 17.5.4). STREAMS drivers must handle device interrupts and need
an additional mechanism to install their interrupt handlers into the kernel. This procedure is system-
dependent.
Once the driver or module is configured, it is ready to be used by applications when the ker-
nel is booted. The following subsections describe how that happens.
fmodsw[ ]
f name
cdevsw[ ]
indexed by ,-r-----=--=-----..,
major
device
numb~~------.--~
v stream
v_ops
... - spec_vnops [ ]
common ~ <troamhoad
snode ~ sd_strtab
sd - vnode strdata
stwdata
...
q_qinfo q_qinfo qinit
q_ptr q_ptr-
qinit
... ...
strtab
q_qinfo q_qinfo
q_ptr q_ptr
... ... qinit qinit
(write (read
driver end queue) queue)
A user may push a module onto an open stream by making an ioctl call with the I _PUSH command.
The kernel allocates a queue pair and calls qat tach() to add it to the stream. qat tach() initializes
the module by locating its strtab entry from the fmodsw[] table. It links the module into the
stream immediately below the stream head and calls its open procedure.
A user may remove a module from the queue using the I _POP ioctl command. This always
removes the module nearest to the stream head. Thus modules are popped in last-in, first-out (LIFO)
order.
STREAMS provides an autopush mechanism, using ioctl commands for a special driver
called the STREAMS administrative driver (sad(8)) . Using this, an administrator may specify a set
of modules to push onto a given stream when it is first opened. The s t ropen () routine checks if
autopush has been enabled for the stream, and finds and pushes all specified modules in order.
There are two other common mechanisms for pushing modules onto a stream. One is to
provide library routines that open the stream and push the correct modules onto it. Another is to
start up a daemon process during system initialization to perform this task. Thereafter, whenever
applications open the device file , they will be connected to the same stream, with all the right mod-
ules already pushed onto it.
566 Chapter 17 STREAMS
The stream head is responsible for synchronization. If it can handle the command itself, it
does so synchronously and in process context, and there is no problem. If the stream head must send
the command downstream, it blocks the process and sends down an M_ I OCT L message containing
the command and its parameters. When a module handles the command, it returns the results in an
M_ IOCACK message. If no module or driver can handle the message, the driver generates an
M_IOCNACK message. When the stream head receives either ofthese messages, it wakes up the proc-
ess and passes the results to it.
The data movement problem is concerned with the exchange of arguments and results be-
tween the user program and the module or driver that handles the ioctl. When a user issues an ioctl
command to an ordinary character device, the driver processes the command in the context of the
calling process. Each ioctl command is usually accompanied by a parameter block, whose size and
contents are command-specific. The driver copies the block from user space into the kernel, proc-
esses the command, and copies the results to user space.
This method breaks down for STREAMS drivers and modules. The module receives the
command as an M_ IOCTL message, asynchronous to the process and in system context. Because the
module does not have access to the process's address space, it cannot copy in the parameter block,
or copy the results back to the user space.
STREAMS provides two ways of overcoming this problem. The preferred solution involves
a special type of ioctl command called I_STR. The other method handles ordinary ioctl commands
and is necessary to maintain compatibility with older applications. It is called transparent ioctl
handling, as it does not require modification of existing applications.
where fd is the file descriptor, cmd is an integer that specifies a command, and arg is an optional,
command-specific value, which often contains the address of a parameter block. The driver inter-
prets the contents of arg based on the cmd and copies the parameters from user space accordingly.
A user may issue a special STREAMS ioctl message by specifying the constant I_STR as the
cmd value, and setting a rg to point to a s t ri oct 1 structure, which has the following format:
struct strioctl {
i nt i c_cmd; I* the actual command to issue *I
int ic_timeout; I* timeout period *I
i nt i c_l en; I* length ofparameter block *I
char *ic_dp; I* address ofparameter block *I
};
If the stream head cannot handle the ioctl, it creates a message of type M_ IOCTL and copies
the i c cmd value into it. It also extracts the parameter block (specified by i c 1en and i c dp) from
user space and copies it into the message. It then passes the message downstr~am. When the module
568 Chapter 17 STREAMS
that handles the command receives the message, it contains all the information required to process
the command. If the command requires data to be returned to the user, the module writes it into the
same message, changes the message type toM_ IOCACK, and sends it back upstream. The stream head
will copy the results to the parameter block in user space.
Hence the stream head passes the message downstream until it reaches a module that can
recognize and handle it. That module sends back an M_ IOCACK message to the stream head, indicat-
ing that the command was successfully intercepted. If no module can recognize the message, it
reaches the driver. If the driver cannot recognize it either, it sends back aM_ IOCNACK message, upon
which the stream head generates an appropriate error code.
This solution is efficient, but imposes some restrictions on the commands it can handle. It
will not work with older applications that do not use I_STR commands. Moreover, since the stream
head cannot interpret the parameters, they must be contained directly in the parameter block. For
example, if one parameter is a pointer to a string stored elsewhere in user space, the stream head
will copy the pointer but not the string. Hence it is essential to have a general solution that will work
in all cases, even if it is slower or less efficient.
cient mechanism to allocate and free them. put and service procedures must be non-blocking. If the
allocator cannot supply the memory immediately, they must handle the situation without blocking,
perhaps by retrying at a later time. In addition, many STREAMS drivers allow direct memory ac-
cess (DMA) from device buffers. STREAMS allows such memory to be directly converted into
messages instead of copying it into main memory.
The main memory management routines are a 11 ocb (), freeb () , and freemsg (). The
syntax for a 11 ocb () is
a 11 ocb () allocates a msgb, a datab, and a data buffer at least size bytes long; it returns a
pointer to the msgb. It initializes them so that the msgb points to the datab, which contains the be-
ginning and end of the data buffer. It also sets the b_rptr and b_wptr fields in the msgb to point to
the beginning of the data buffer. The pri argument is no longer used and is retained only for back-
ward compatibility. The freeb () routine frees a single msgb, while freemsg () traverses the b_ cont
chain, freeing all msgbs in the message. In both cases, the kernel decrements the reference count of
the associated databs. If the count falls to zero, it also releases the datab and the data buffer to
which it points.
Allocating three objects individually is inefficient and slow. STREAMS provides a faster
solution using a data structure called mdbb 1ock. Each mdbb 1ock is 128 bytes in size and includes a
msgb, a datab, and a pointer to a release handler, which is discussed in the next section. There-
maining space in the structure may be used for a data buffer.
Let us examine what happens when a module calls a 11 ocb () to allocate a message. a 1-
1ocb() calls kmem_alloc() to allocate a struct mdbb1ock, passing it the NO_SLP flag. This en-
sures that kmem_a 11 oc () returns an error instead of blocking if the memory is not available imme-
diately. If the allocation succeeds, a 11 ocb () checks if the requested size is small enough to fit into
the mdbb 1ock. If so, it initializes the structure and returns a pointer to the msgb within it Figure
17-12. Hence a single call to kmem_a 11 oc () provides the msgb, da tab, and data buffer.
If the requested size is larger, a 11 ocb () calls kmem_a 11 oc () once more, this time to allocate
the data buffer. In this case, the extra space in the mdbb 1ock is not used. If either call to
kmem_a 11 oc () fails, a 11 ocb () releases any resources it had acquired and returns NULL, indicating
failure.
The module or driver must handle an a 11 ocb () failure. One possibility is to discard the data
with which it is working. This approach is used by many network drivers when they are unable to
keep pace with incoming traffic. Often, though, the module wants to wait until memory is available.
Because put and service procedures must be non-blocking, it must find another way of waiting for
memory.
STREAMS provides a routine called bufca 11 () to handle this situation. When a module
cannot allocate a message, it calls bufca 11 (),passing it a pointer to a callback function and the size
of the message it wanted to allocate. STREAMS adds this callback to an internal queue. When suf-
ficient memory becomes available, STREAMS processes this queue and invokes each callback
function on it.
570 Chapter 17 STREAMS
struct mdbblock
- b_rptr b datab - r -
- b_wptr ...(msgb)
pointer to release handler
128 bytes< db buf db ref
[ ... (datab)
data buffer
Often, the callback function is the service procedure itself. The callback may not, however,
assume that enough memory is indeed available. By the time the callback runs, other activity may
have depleted the available memory. In that case, the module typically reissues the bufca 11 ().
Some STREAMS drivers support I/0 cards containing dual-access RAM (also called dual-ported
RAM). The card has memory buffers which may be accessed both by the device hardware and by the
CPU. Such a buffer may be mapped into the kernel or user address space, allowing an application to
access and modify its contents without copying it to or from main memory.
STREAMS drivers place their data into messages and pass them upstream. To avoid copying
the contents of the I/0 card's buffers, STREAMS provides a way to use them directly as the data
buffer of the message. Instead of using a 11 ocb (), the driver calls a routine called esba 11 oc (),
passing it the address of the buffer to be used. STREAMS allocates a msgb and datab (from a
mdbb 1ock), but not a data buffer. Instead, it uses the caller-supplied buffer and adjusts the msgb and
datab fields to reference it.
This causes a problem when the buffer is freed. Normally, when a module calls freeb () or
freemsg (),the kernel frees the msgb, datab, and the data buffer (assuming no other references to
the datab). The kmem_free() routine releases these objects and recovers the memory. Driver-
supplied buffers, however, cannot be released to the general memory pool since they belong to the
I/0 card.
Hence esba 11 oc () takes another parameter, which is the address of a release handler func-
tion. When the message is freed, the kernel frees the msgb and datab, and calls the release handler
to free the data buffer. The handler takes the necessary actions to mark the buffer as free, so that the
I/0 card may reuse it. The syntax for esba 11 oc () is
where base and size describe the buffer to be used, and free _rtnp is the address of the release
handler. The pri argument is for compatibility only and is not used in SVR4. esba 11 oc () returns a
pointer to the msgb.
17.8 Multiplexing
STREAMS provides a facility called multiplexing, which allows multiple streams to be linked to a
single stream, called a multiplexor. Multiplexing is restricted to drivers and is not supported for
modules. There are three types of multiplexors-upper, lower, and two-way. An upper multiplexor
connects multiple streams at the top to a single stream at the bottom. It is also known as a fan-in, or
M-to-1, multiplexor. A lower multiplexor, also called fan-out or 1-to-N, connects multiple lower
stream below a single upper stream. A two-way, or M-to-N, multiplexor supports multiple streams
above and below. Multiplexors may be combined in arbitrary ways to form complex configurations,
such as the one shown earlier in Figure 17-3.
STREAMS provides the framework and some support routines for multiplexing, but the
drivers are responsible for managing the multiple streams and routing data appropriately.
so that it can send the data to any stream. It manages its own flow control, since STREAMS does not
provide flow control for multiplexors.
The example omits the statements to check the return values and handle errors. The first
three statements open the enet (ethemet),fddi, and ip drivers, respectively. Next, the user links the
ip stream onto the enet stream and then onto thefddi stream. Figure 17-14 shows the resulting con-
figuration.
The next section examines the process of setting up the lower multiplexor in detail.
multiplexor, the st_rdinit and st_wrinit fields reference the qinit structures for the upper
queue pair, while the st_muxrinit and st_muxwinit fields point to the lower queue pair. In the
queues, only some procedures are required. The upper read queue must contain the open and close
procedures. The lower read queue must have a put procedure, and so must the upper write queue.
All other procedures are optional.
Figure 17-15 describes the ip and enet streams before the I_LINK command. The strdata
and stwdata are shared by all stream heads and contain the read and write qi nit structures, re-
spectively. Only the ip driver has a lower queue pair, and it is not used as yet.
Now let us look at what happens when the user issues the first I LINK command. The
st ri oct l () routine does the initial processing of all'ioctl requests. For the I..=-LINK case, it takes the
following actions:
1. Checks that both upper and lower streams are valid and that the upper stream is a
multiplexor.
2. Checks the stream for cycles. A cycle could occur if the lower stream was already
connected above the upper stream, directly or indirectly. STREAMS fails any I_ LINK call
that results in such a cycle.
3. Changes the queues in the enet stream head to point to the lower queue pair of the ip
driver.
4. Zeroes out the q_pt r fields in the enet stream head, so that they no longer point to. its
stdata structure.
5. Creates a l i n kb l k structure, containing pointers to the queues to be linked. These are
q_ top, which points to the write queue of the ip driver, and q_bot, which points to the
write queue of the enet stream head. The l in kb l k also contains a link ID, which later may
be used in routing decisions. STREAMS generates a unique link ID for each connection
and also passes it back to the user as the return value of the I_ LINK ioctl.
6. Sends the l i nkbl k downstream to the ip driver in an M_IOCTL message and waits for it to
return.
Figure 17-16 shows the connections after the I_ LINK completes. The heavy arrows show the
new connections set up by STREAMS.
574 Chapter 17 STREAMS
.. .I
'I strdata I stwdata I, I
I stdata I ip strtab enet strtab I I stdata I
q_qinfo q_qinfo st - rdinit st- rdinit q_qinfo q_qinfo
q_qptr q_qptr- q_qptr q_qptr-
... ... st winit st winit . .. . ..
st muxrini st muxrini
st muxwini st muxwini
ip upper
1
q_qinfo q_qinfo [---> q_qinfo q_qinfo
ri nit enet
I ip upper wi nit
winit 4
enet I
ip lower .__ ri nit
I ip driver I rinit qi nit I enet driver I
ip lower
win it - structures
The ip driver manages other details of the multiplexor configuration. It maintains data
structures describing all streams connected below it and, when it receives the M_I OCT L message,
adds an entry for the enet stream to them. This entry, at a minimum, must contain the lower queue
pointer and link ID (from the 1 inkbl k structure passed in the message), so it can pass messages
down to the lower stream. In the next section, we describe the data flow through the multiplexor.
I Istdata I 1
q_qinfo q_qinfo strdata stwdata
q_qptr q_qptr -
... ...
T
11
I
I l I stdata
ip upper ip upper q_qinfo q_qinfo
q_qinfo q_qinfo
... ... ri nit winit q_qptr q_qptr
ip lower ip lower ... ...
ri nit winit
t t
I ip driver I . r
enet
I<- q_qi nfo q~qiofl
winit ...
enet
ri nit
I enet driver
Figure 17-16. IP and enet streams after linking.
the q in it structure of the lower read queue of the multiplexor. Hence the message is processed by
the ip driver, which sends it up toward the ip stream head.
STREAMS does not directly support flow control for multiplexors. Hence the ip driver must
handle any flow control it requires.
modules. This daemon process then blocks indefinitely, holding an open descriptor to the control-
ling streams. This prevents the setup from being dismantled when no other process is using it.
Other processes may now use this configuration by issuing open calls to the topmost drivers
(TCP or UDP in the example of Figure 17-3). These are typically clone devices and opening them
creates new minor devices and, correspondingly, new streams to the same driver.
This solution requires a process to be dedicated to keeping the streams open. It does not
protect the system against accidental death of that process. STREAMS provides an alternative solu-
tion, using the !_PLINK and I_PUNLINK commands in place of !_LINK and !_UNLINK. !_PLINK
creates persistent links, which remain active even if no process has the stream open. Such a link
must be explicitly removed by I_PUNLINK, passing it the link ID returned by !_PLINK.
vnode of
v_op ufs_vnodeops [ ]
FIFO file v rdev
file descriptor
table in u area
f vnode
struct f;le fn rent
fn went
fn_open
Whenever a user writes data to the FIFO, the stream head sends it down the write queue,
which immediately sends it back to the read queue, where the data waits until it is read. Readers re-
trieve data from the read queue and block at the stream head if no data is available. When no proc-
esses have the FIFO open, the stream is dismantled. The stream will be rebuilt if the FIFO is opened
again. The FIFO file itself persists until explicitly unlinked.
The pipe system call creates an unnamed pipe. Prior to SVR4, data flow in the pipe was unidirec-
tional. The pipe call returned two file descriptors, one for writing and the other for reading. SVR4
reimplemented pipes using STREAMS. The new approach allows bidirectional pipes.
As before, pipe returns two file descriptors. In SVR4, however, both are open for reading
and writing. Data written to one descriptor is read from the other, and vice versa. This is achieved
by using a pair of streams. The pipe system call creates two fifonodes and a stream head for each of
them. It then fixes the queues such that the write queue of each stream head is connected to the read
queue of the other. Figure 17-18 describes the resulting configuration.
This approach has some important advantages. The pipe is now bidirectional, which makes it
much more useful. Many applications require bidirectional communication between processes. Prior
to SVR4, they had to open and manage two pipes. Moreover, implementing the pipe via streams
allows many more control operations on it. For instance, it allows the pipe to be accessed by unre-
lated processes.
The C library routine fattach provides this functionality [Pres 90]. Its syntax is
578 Chapter 17 STREAMS
fifo_vnodeops[ ]
v_op v_op
~
v stream v stream
fn_realvp fn_realvp
fn mate fn mate
fn rent fn rent
fn went fn went
fn_open fn_open
struct struct
fifonode fifonode
where fd is a file descriptor associated with a stream, and path is the pathname of a file owned by
the caller (or the caller must be root). The caller must have write access to the file. fattach uses a
special file system called namefs and mounts an instance of this file system onto the file represented
by path. Unlike other file systems, which may only be mounted on directories, namefs allows
mounting on ordinary files. On mounting, it binds the stream file descriptor fd to the mount point.
Once so attached, any reference to that pathname accesses the stream bound to it. The asso-
ciation persists until removed by fdetach, at which time the pathname is bound back to the original
file associated with it. Frequently,fattach is used to bind one end of a pipe to a filename. This al-
lows applications to create a pipe and then dynamically associate it with a filename, thus providing
unrelated processes with access to the pipe.
STREAMS provides the kernel infrastructure for networking in System V UNIX. Programmers
need a higher-level interface to write network applications. The sockets framework, introduced in
4.lcBSD in 1982, provides comprehensive support for network programming. System V UNIX
handles this problem through a set of interfaces layered on top of STREAMS. These include the
Transport Provider Interface (TPJ), which defines the interactions between transport providers and
transport users, and the Transport Layer Interface (TLI), which provides high-level programming
facilities. Since the sockets framework came long before STREAMS, there are a large number of
applications that use it. To ease the porting of these applications, SVR4 added support for sockets
through a collection of libraries and STREAMS modules.
17.10 Networking Interfaces 579
2 The T UNITDATA REQ and T UNITDATA IND types are used for datagrams, while T DATA REQ and T DATA IND are
used for byte-stre"im data. - - - - - -
580 Chapter 17 STREAMS
t_snd()
process request
t_rcv ()
wait for reply
t_snd() ----------------------- - ...f - - - - - '
Connectionless protocols operate differently (Figure 17-20). Both the server and the client
call t _open() and t _bind(), just as before. Since there are no connections to be made, we do not
need t 1is ten(), t connect(), or t accept() calls. Instead, the server sits in a loop, calling
t rcvudata(), which blocks until a client sends a message. When a message arrives, the call re-
turns to the server with the address and port number of the sender, along with the body of the mes-
sage. The server processes the message and replies to the client calling t sndudata (). The client
likewise calls t_sndudata() to send messages and t_rcvudata() to recei~e replies.
17.10.3 Sockets
Sockets [Leff 86], introduced in 4.1BSD in 1982, provide a programming interface, which may be
used both for interprocess and network communications. A socket is a communication endpoint and
17.10 Networking Interfaces 581
represents an abstract object that a process may use to send or receive messages. Although sockets
are not native to System V UNIX, SVR4 provides full BSD socket functionality in order to support
the huge number of applications written using the socket interface.
The socket interface is similar to TLI in several respects. There is almost a one-to-one corre-
spondence of socket and TLI functions. Table 17-1 shows the common TLI functions and the
equivalent socket calls.
There are, however, substantial differences in the arguments to and semantics of the TLI and
socket calls. Although TLI and STREAMS were designed to be mutually compatible, there were
582 Chapter I 7 STREAMS
several problems in adding sockets support to STREAMS [Vess 90]. Let us examine some of the
important factors that cause incompatibility between the two frameworks.
The sockets framework is procedural, not message-based. When an application calls a socket
function, the kernel sends the data to the network by directly calling lower-level transport functions.
It finds the transport-specific functions through a table lookup and routes the data to them. This al-
lows higher layers of the socket interface to share state information with transport layers through
global data structures. In STREAMS, each module is insulated from others and has no global state.
Although such a modular framework has many advantages, it is difficult to duplicate some socket
functionality that depends on shared state.
Socket calls execute in the context of the calling process. Hence any errors can be synchro-
nously reported to the caller. STREAMS processes data asynchronously, and calls such as write or
putmsg succeed as soon as the data is copied in by the stream head. If a lower-level module gener-
ates an error, it can only affect subsequent write attempts; the call that caused the error has already
succeeded.
Some problems are associated with the decision to implement sockets on top of TPI. Certain
options are processed in different places by sockets than by TPI. For instance, socket applications
specify the maximum number of unaccepted connect indications (backlog) in the 1 is ten () call,
after the socket has been opened and bound. TPI, however, requires the backlog to be specified in
the T_BIND_ REQ message, sent during the bind operation itself.
read,
write
user space
---------------,---LL------LL--,---------------
kernelspaCe
This way, it can reject sendmsg calls to unconnected sockets. If sockmod alone maintained the con-
nection information, there would be no way to return the correct error status to the caller.
Likewise, it is not sufficient to maintain state in socklib alone. This is because a process
may create a socket and then call exec, wiping out any state maintained in socklib. The socket im-
plementation detects this because exec initializes socklib to a known state. When a user tries to use
a socket after an exec, socklib sends an ioctl to sockmod to recover the lost state. Since sockmod is
in kernel space, its state is not wiped out by the exec call.
There are many interesting issues and problems concerning the SVR4 sockets implementa-
tion. They are discussed in detail in [Vess 90].
17.11 Summary
STREAMS provides a framework for writing device drivers and network protocols. It enables a
high degree of configurability and modularity. STREAMS does for drivers what pipes do for UNIX
shell commands. It allows the writing of independent modules, each of which acts as a filter and
performs some specific processing on a data stream. It then allows users to combine these modules
in different ways to form a stream. This stream acts like a bidirectional pipe, moving data between
the application and the device or network interface, with appropriate processing in between.
The modular design allows network protocols to be implemented in a layered manner, each
layer contained in a separate module. STREAMS are also used for interprocess communication, and
SVR4 has reimplemented pipes and FIFOs using streams. Finally, many character drivers, including
the terminal driver subsystem, have been rewritten as STREAMS drivers. Some of the recent en-
hancements to STREAMS include multiprocessor support [Garg 90, Saxe 93].
584 Chapter 17 STREAMS
17.12 Exercises
1. Why does STREAMS use separate msgb and datab data structures, instead of having a single
buffer header?
2. What is the difference between a STREAMS module and a STREAMS driver?
3. What is the relationship between the two queues of a module? Must they perform similar
functions?
4. How does the presence or absence of a service procedure affect the behavior of a queue?
5. Both the read and the getmsg system calls may be used to retrieve data from a stream. What
are the differences between them? For what situations is each of them more suitable?
6. Why are most STREAMS procedures not allowed to block? What can a put procedure do if it
cannot process a message immediately?
7. Why are priority bands useful?
8. Explain how and when a queue is back-enabled.
9. What functionality does the stream head provide? Why do all stream heads share a common
set of routines?
I 0. Why are STREAMS drivers accessed through the cdevsw table?
II. Why do STREAMS devices require special support for ioctl? Why can there be only one
active ioctl on a stream?
12. Why is it often reasonable to discard incoming network messages if there is a memory
shortfall?
13. What is the difference between a multiplexor and a module that is used independently in two
different streams?
14. Why does STREAMS not provide flow control for multiplexors?
15. In Figure 17-14, why is the IP layer implemented as a STREAMS driver and not as a module?
16. What are the benefits of persistent links?
17. STREAMS pipes allow bidirectional traffic, while traditional pipes do not. Describe an
application that takes advantage of this feature. How would you provide this functionality
without using STREAMS pipes?
18. What is the difference between TPI and TLI? What interactions does each of them pertain to?
19. Compare sockets and TLI as frameworks for writing network applications. What are the
advantages and drawbacks of each? Which features of one are not easily available in the
other?
20. Section 17.10.4 describes how SVR4 implements sockets on top of STREAMS. Could a
BSD-based system implement a STREAMS-like interface using sockets? What important
issues will need to be addressed?
21. Write a STREAMS module that converts all newline characters to "carriage-return + line-
feed" on the way up and the reverse transformation on the way down. Assume the messages
contain only printable ASCII characters.
22. A user may configure a stream dynamically by pushing a number of modules on the stack.
Each module does not know what module is above or below it. How then, does a module
know how to interpret the messages sent by the neighboring module? What restrictions does
17.13 References 585
this place on which modules may be stacked together and in what order? In what way does
TPI address this problem?
17.13 References
[AT&T 89] American Telephone and Telegraph, UNIX System V Release 4 Network
Programmer's Guide, 1991.
[AT&T 91] American Telephone arid Telegraph, UNIX System V Release 4 Internals Students
Guide, 1991.
[Garg 90] Garg, A., "Parallel STREAMS: A Multi-Processor Implementation," Proceedings of
the Winter 1990 USENIX Technical Conference, Jan. 1990.
[ISO 84] International Standards Organization, Open Systems Interconnection-Basic
Reference Model, ISO 7498, 1984.
[Leff 86] Leffler, S., Joy, W., Fabry, R., and Karels, M., "Networking Implementation Notes-
4.3BSD Edition," University of California, Berkeley, CA, Apr. 1986.
[Pres 90] Presotto, D.L., and Ritchie, D.M., "Interprocess Communications in the Ninth
Edition UNIX System," UNIX Research System Papers, Tenth Edition, Vol. II,
Saunders College Publishing, 1990, pp. 523-530.
[Rago 89] Rago, S., "Out-of-band Communication in STREAMS," Proceedings of the Summer
1989 USENIXTechnical Conference, Jun. 1989, pp. 29-37.
[Rite 83] Ritchie, D.M., "A Stream Input-Output System," AT&T Bell Laboratories Technical
Journal, Vol. 63, No.8, Oct. 1984, pp. 1897-1910.
[Saxe 93] Saxena, S., Peacock, J.K., Verma, V., and Krishnan, M., "Pitfalls in Multithreading
SVR4 STREAMS and Other Weightless Processes," Proceedings of the Winter 1993
USENIX Technical Conference, Jan. 1993, pp. 85-95.
[USL 92a] UNIX System Laboratories, STREAMS Modules and Drivers, UNIX SVR4.2, UNIX
Press, Prentice-Hall, Englewood Cliffs, NJ, 1992.
[USL 92b] UNIX System Laboratories, Operating System API Reference, UNIX SVR4.2, UNIX
Press, Prentice-Hall, Englewood Cliffs, NJ, 1992.
[Vess 90] Vessey, 1., and Skinner, G., "Implementing Berkeley Sockets in System V Release 4,"
Proceedings of the Winter 1990 USENIX Technical Conference, Jan. 1990, pp. 177-
193.
Index
Note: Page numbers in boldface indicate primary reference or definition
587
588 Index
XENIX, 3
xid cache. See retransmissions cache
-Z-
zombie processes, 45, 94
zombie, struct, 45
zone memory allocator, 388
r!!!!;
THl NlW fKDNTIU
URESH VAHALIA
"Vahalia has g iven us a trul y original a nd "This b ook is a mus t fo r an~·one who needs to
co mpre h e n s ive view of th e co mpa r a ti ve unde rsta n d the difference between the , -ariou
a n a to m y of the (UNlX) species:' va ria nts of the UN IX operating s~·stem :"
- Pe ter Sa lus, Ma n aging Edito r, - Ma rgo Seltzer, Harvard l' niversi~-
Cornpufin9 System s
Tile New Froutiers offers the most up-to-date and co mprehensi\·e cm·era eo
UNIX internals available. Written for professional Ui'!IX programmer
systems administrators, and omputer cience students. l'S/X lu tmrt~L< pro-
vides an in-depth look at U IX development at a h ighly accessrble le\·el
Coverage includes features that are shaping U IX archrtectu res ol the
nineties including urrrlti-tl!rwded kmrds, multiprocessor and rwl-trrm sptmr.<. '"'J
distributed Jilr systems.
PRE TTI (
PRENTI CE HALL, UPPER SADDLE RIVER, NJ 07458 9 80131 1908 H ALL