[go: up one dir, main page]

0% found this document useful (0 votes)
25 views239 pages

Os Notes

An operating system (OS) serves as an intermediary between users and computer hardware, aiming to execute user programs efficiently and conveniently. It manages resources, provides a user-friendly interface, and ensures system stability through various functions such as process management, memory management, and I/O control. The document outlines the structure, objectives, and operations of operating systems, as well as different computing environments and their architectures.

Uploaded by

Pratiksha Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views239 pages

Os Notes

An operating system (OS) serves as an intermediary between users and computer hardware, aiming to execute user programs efficiently and conveniently. It manages resources, provides a user-friendly interface, and ensures system stability through various functions such as process management, memory management, and I/O control. The document outlines the structure, objectives, and operations of operating systems, as well as different computing environments and their architectures.

Uploaded by

Pratiksha Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 239

UNIT-1

What is an Operating System?


 A program that acts as an intermediary between a user of a computer and
the computer hardware
 Operating system goals:
o Execute user programs and make solving user problems easier
o Make the computer system convenient to use
o Use the computer hardware in an efficient manner

Computer System Structure


 Computer system can be divided into four components:
o Hardware – provides basic computing resources
 CPU, memory, I/O devices
o Operating system
 Controls and coordinates use of hardware among various
applications and users
o Application programs – define the ways in which the system
resources are used to solve the computing problems of the users
 Word processors, compilers, web browsers, database
systems, video games
o Users
 People, machines, other computers

Abstract view of the components of a computer system

1
USER VIEW

 Depends on the point of view


 Users want convenience, ease of use and good performance
 Don’t care about resource utilization
 But shared computer such as mainframe or minicomputer must keep all users happy
 Users of dedicate systems such as workstations have dedicated resources but
frequently use shared resources from servers
 Handheld computers are resource poor, optimized for usability and battery life
 Some computers have little or no user interface, such as embedded computers in
devices and automobiles

SYSTEM VIEW

 OS is a resource allocator
o Manages all resources
o Decides between conflicting requests for efficient and fair resource use
 OS is a control program
o Controls execution of programs to prevent errors and improper use of the
computer

OPERATING-SYSTEM OBJECTIVES

• OS is a layer of software whose job is to manage all devices and provide user
programs with a simpler interface to the hardware
• Objectives of OS
– Convenience
– Efficiency
– Ability to evolve

Layered Approach

2
• End user views a computer system in terms of a set of applications
• Applications are developed in a programming language and is developed by
application programmer
• To make the applications reachable to computer hardware system programs or
utilities are provided
• OS comprises of collection of system programs
• OS masks the details of the hardware from the programmer and provides the
programmer with a convenient interface for using the system.

OS as User/Computer Interface
• OS provides services in the following areas:
– Program Development- Editors/Debuggers assist programmer in creating
programs. These are provided as application program development tool.
– Program execution – load program-run program-execute program
– Access to I/O devices - OS provides a uniform interface for I/O devices
which are hidden from end users
– Controlled Access to files - OS needs to understand I/O, structure of file and
also provide protection to users in multiuser environment
– System Access: Provide access to system as whole and to specific system
resources. Resolve conflicts for resource contention.
– Error detection – OS needs to be constantly aware of possible errors
• May occur in the CPU and memory hardware, in I/O devices, in user
program
• For each type of error, OS should take the appropriate action to ensure
correct and consistent computing
• Debugging facilities can greatly enhance the user’s and programmer’s
abilities to efficiently use the system
– Accounting - To keep track of which users use how much and what kinds of
computer resources

3
OS as resource manager
• OS controls the basic functions of the computer like movement, storage, processing of
data
• But the control mechanism is unusual
– OS functions as a ordinary computer software
– OS relinquishes controls or regains control depending on the processor.
– This control mechanism is not external but something internal

Fig: Resources managed by the OS


• Resources the OS control are
– Main memory
– I/O devices and Files
– Processor
• OS controlling main memory
– A portion of OS (Kernel) resides in the main memory
– Kernel contain
• Frequently used functions
• Portions of OS which are currently in use
– Main memory = Kernel + user programs + data
– Main memory allocation is done by OS and memory management hardware
in the processor
• OS controls I/O and files
– Decides when an I/O can be used by a program in execution
– Controlled access of files
• OS controls the processor
– Determines the processor time for the execution of a program

Ease of Evaluation of An OS

4
• Hardware upgrades plus new types of hardware
• New services
• Fixes: Faults are fixed and fixes are made. Fix introduces new faults
The need to change OS regularly places certain requirements on its design.
System should be modular in construction, with clearly defined interfaces between
modules and should be well documented

OPERATING SYSTEM DEFINITION

 A common function of controlling and allocation resources are then brought together
into one software is called Operating System.
 No universally accepted definition
 “Everything a vendor ships when you order an operating system” is a good
approximation
o But varies wildly
 “The one program running at all times on the computer” is the kernel.
 Everything else is either
o a system program (ships with the operating system) , or
o an application program.

COMPUTER-SYSTEM ORGANIZATION - What are all the parts, and how


do they fit together?

A modern computer system

o One or more CPUs, device controllers connect through common bus providing
access to shared memory
o Concurrent execution of CPUs and devices competing for memory cycles

5
Computer Startup

 bootstrap program is loaded at power-up or reboot


o Typically stored in ROM or EPROM, generally known as firmware
o Initializes all aspects of system
o Loads operating system kernel and starts execution

Computer-system operation

 I/O devices and the CPU can execute concurrently


 Each device controller is in charge of a particular device type
 Each device controller has a local buffer
 CPU moves data from/to main memory to/from local buffers
 I/O is from the device to local buffer of controller
 Device controller informs CPU that it has finished its operation by causing an
interrupt

Common Functions of Interrupts

 Interrupt transfers control to the interrupt service routine generally, through the
interrupt vector, which contains the addresses of all the service routines
 Interrupt architecture must save the address of the interrupted instruction
 A trap or exception is a software-generated interrupt caused either by an error or a
user request
 An operating system is interrupt driven

Interrupt Handling

 The operating system preserves the state of the CPU by storing registers and the
program counter
 Determines which type of interrupt has occurred:
o polling
o vectored interrupt system
 Separate segments of code determine what action should be taken for each type of
interrupt

Interrupt Timeline

6
Storage Structure

 Main memory – only large storage media that the CPU can access directly
o Random access
o Typically volatile
 Secondary storage – extension of main memory that provides large nonvolatile
storage capacity
 Hard disks – rigid metal or glass platters covered with magnetic recording material
o Disk surface is logically divided into tracks, which are subdivided into
sectors
o The disk controller determines the logical interaction between the device and
the computer
 Solid-state disks – faster than hard disks, nonvolatile
o Various technologies
o Becoming more popular
 Storage systems organized in hierarchy
o Speed
o Cost
o Volatility
 Caching – copying information into faster storage system; main memory can be
viewed as a cache for secondary storage
 Device Driver for each device controller to manage I/O
o Provides uniform interface between controller and kernel

Storage-device hierarchy

7
I/O Structure

 After I/O starts, control returns to user program only upon I/O completion
o Wait instruction idles the CPU until the next interrupt
o Wait loop (contention for memory access)
o At most one I/O request is outstanding at a time, no simultaneous I/O
processing
 After I/O starts, control returns to user program without waiting for I/O completion
o System call – request to the OS to allow user to wait for I/O completion
o Device-status table contains entry for each I/O device indicating its type,
address, and state
o OS indexes into I/O device table to determine device status and to modify
table entry to include interrupt

 Typical operation involves I/O requests, direct memory access ( DMA ), and
interrupt handling.
 Used for high-speed I/O devices able to transmit information at close to
memory speeds
 Device controller transfers blocks of data from buffer storage directly to main
memory without CPU intervention
 Only one interrupt is generated per block, rather than the one interrupt per byte

How a modern computer system works

8
COMPUTER-SYSTEM ARCHITECTURE - Different Operating Systems for
Different Kinds of Computer Environments

Single-Processor Systems

 One main CPU which manages the computer and runs user apps.
 Other specialized processors ( disk controllers, GPUs, etc. ) do not run user
apps.

1.3.2 Multiprocessor Systems

multiprocessor systems (also known as parallel systems or tightly coupled systems) are
growing in importance. Such systems have two or more processors in close communication,
sharing the computer bus and sometimes the clock, memory, and peripheral devices.

Multiprocessor systems have three main advantages:

1. Increased throughput - Faster execution, but not 100% linear speedup.


2. Economy of scale - Peripherals, disks, memory, shared among processors.
3. Increased reliability
o Failure of a CPU slows system, doesn't crash it.
o Redundant processing provides system of checks and balances. ( e.g.
NASA )

Two types:

1. Asymmetric Multiprocessing – each processor is assigned a specie


task.
2. Symmetric Multiprocessing – each processor performs all tasks

Symmetric multiprocessing architecture

9
 Multi-chip and multicore
 Systems containing all chips
 Chassis containing multiple separate systems

A dual-core design with two cores placed on the same chip

Clustered Systems

 Like multiprocessor systems, but multiple systems working together

 Usually sharing storage via a storage-area network (SAN)


 Provides a high-availability service which survives failures

o Asymmetric clustering has one machine in hot-standby mode


o Symmetric clustering has multiple nodes running applications,
monitoring each other

 Some clusters are for high-performance computing (HPC)

 Applications must be written to use parallelization

 Some have distributed lock manager (DLM) to avoid conflicting operations

10
General structure of a clustered system

OPERATING-SYSTEM STRUCTURE

 Multiprogramming (Batch system) needed for efficiency


o Single user cannot keep CPU and I/O devices busy at all times
o Multiprogramming organizes jobs (code and data) so CPU always has
one to execute
o A subset of total jobs in system is kept in memory
o One job selected and run via job scheduling
o When it has to wait (for I/O for example), OS switches to another job

Memory layout for a multiprogramming system

 Timesharing (multitasking) is logical extension in which CPU switches jobs so


frequently that users can interact with each job while it is running, creating interactive
computing
o Response time should be < 1 second
o Each user has at least one program executing in memory process
o If several jobs ready to run at the same time  CPU scheduling
o If processes don’t fit in memory, swapping moves them in and out to run
 Virtual memory allows execution of processes not completely in memo

11
OPERATING-SYSTEM OPERATIONS

Interrupt-driven nature of modern OSes requires that erroneous processes not be able to
disturb anything else.

Dual-Mode Operation

 Dual-mode operation allows OS to protect itself and other system components


o User mode and kernel mode
o Mode bit provided by hardware
 Provides ability to distinguish when system is running user code or
kernel code
 Some instructions designated as privileged, only executable in kernel
mode
 System call changes mode to kernel, return from call resets it to user
 Increasingly CPUs support multi-mode operations
o i.e. virtual machine manager (VMM) mode for guest VMs

Transition from user to kernel mode

Timer

 Timer to prevent infinite loop / process hogging resources


o Timer is set to interrupt the computer after some time period
o Keep a counter that is decremented by the physical clock.
o Operating system set the counter (privileged instruction)
o When counter zero generate an interrupt
o Set up before scheduling process to regain control or terminate program that
exceeds allotted time

PROCESS MANAGEMENT
 A process is a program in execution. It is a unit of work within the system. Program is
a passive entity, process is an active entity.
 Process needs resources to accomplish its task
o CPU, memory, I/O, files
o Initialization data
 Process termination requires reclaim of any reusable resources

12
 Single-threaded process has one program counter specifying location of next
instruction to execute
o Process executes instructions sequentially, one at a time, until completion
 Multi-threaded process has one program counter per thread
 Typically system has many processes, some user, some operating system running
concurrently on one or more CPUs
o Concurrency by multiplexing the CPUs among the processes / threads

The operating system is responsible for the following activities in connection with process
management:
 Creating and deleting both user and system processes
 Suspending and resuming processes
 Providing mechanisms for process synchronization
 Providing mechanisms for process communication
 Providing mechanisms for deadlock handling

MEMORY MANAGEMENT

 To execute a program all (or part) of the instructions must be in memory


 All (or part) of the data that is needed by the program must be in memory.
 Memory management determines what is in memory and when
o Optimizing CPU utilization and computer response to users
Memory management activities
o Keeping track of which parts of memory are currently being used and by
whom
o Deciding which processes (or parts thereof) and data to move into and out of
memory
o Allocating and deallocating memory space as needed

STORAGE MANAGEMENT
 OS provides uniform, logical view of information storage
o Abstracts physical properties to logical storage unit - file
o Each medium is controlled by device (i.e., disk drive, tape drive)
 Varying properties include access speed, capacity, data-transfer rate,
access method (sequential or random)
File-System management
o Files usually organized into directories
o Access control on most systems to determine who can access what
OS activities include
 Creating and deleting files and directories
 Primitives to manipulate files and directories
 Mapping files onto secondary storage
 Backup files onto stable (non-volatile) storage media
Mass-Storage Management
 Usually disks used to store data that does not fit in main memory or data that must be
kept for a “long” period of time
 Proper management is of central importance
 Entire speed of computer operation hinges on disk subsystem and its algorithms
OS activities

13
o Free-space management
o Storage allocation
o Disk scheduling
 Some storage need not be fast
o Tertiary storage includes optical storage, magnetic tape
o Still must be managed – by OS or applications
o Varies between WORM (write-once, read-many-times) and RW (read-write)

Performance of Various Levels of Storage

Migration of data “A” from Disk to Register

 Multitasking environments must be careful to use most recent value, no matter where
it is stored in the storage hierarchy
 Multiprocessor environment must provide cache coherency in hardware such that all
CPUs have the most recent value in their cache
 Distributed environment situation even more complex
o Several copies of a datum can exist

COMPUTING ENVIRONMENTS
Traditional Computing
 Stand-alone general purpose machines
 But blurred as most systems interconnect with others (i.e., the Internet)
 Portals provide web access to internal systems
 Network computers (thin clients) are like Web terminals
 Mobile computers interconnect via wireless networks
 Networking becoming ubiquitous – even home systems use firewalls to protect
home computers from Internet attacks
Mobile Computing
• Handheld smartphones, tablets, etc
• What is the functional difference between them and a “traditional” laptop?
• Extra feature – more OS features (GPS, gyroscope)
• Allows new types of apps like augmented reality

14
• Use IEEE 802.11 wireless, or cellular data networks for connectivity
• Leaders are Apple iOS and Google Android

Distributed computing
 Distributed computiing
o Collection of separate, possibly heterogeneous, systems networked
together
 Network is a communications path, TCP/IP most common
 Local Area Network (LAN)
 Wide Area Network (WAN)
 Metropolitan Area Network (MAN)
 Personal Area Network (PAN)
o Network Operating System provides features between systems across
network
 Communication scheme allows systems to exchange messages
 Illusion of a single system

Client-Server Computing
 Client-Server Computing
 Dumb terminals supplanted by smart PCs
 Many systems now servers, responding to requests generated by clients
 Compute-server system provides an interface to client to request services (i.e.,
database)
 File-server system provides interface for clients to store and retrieve files

Peer-to-Peer Computing
 Another model of distributed system
 P2P does not distinguish clients and servers
o Instead all nodes are considered peers
o May each act as client, server or both
o Node must join P2P network
 Registers its service with central lookup service on network, or
 Broadcast request for service and respond to requests for service via
discovery protocol
o Examples include Napster and Gnutella, Voice over IP (VoIP) such as Skype

15
Cloud Computing
 Delivers computing, storage, even apps as a service across a network
 Logical extension of virtualization because it uses virtualization as the base for it
functionality.
o Amazon EC2 has thousands of servers, millions of virtual machines,
petabytes of storage available across the Internet, pay based on usage
 Many types
o Public cloud – available via Internet to anyone willing to pay
o Private cloud – run by a company for the company’s own use
o Hybrid cloud – includes both public and private cloud components
o Software as a Service (SaaS) – one or more applications available via the
Internet (i.e., word processor)
o Platform as a Service (PaaS) – software stack ready for application use via the
Internet (i.e., a database server)
o Infrastructure as a Service (IaaS) – servers or storage available over Internet
(i.e., storage available for backup use)
 Cloud computing environments composed of traditional OSes, plus VMMs, plus
cloud management tools
o Internet connectivity requires security like firewalls
o Load balancers spread traffic across multiple applications

16
OPERATING-SYSTEM SERVICES

A view of operating system services

OSes provide environments in which programs run, and services for the users of the system,
including:

 User Interfaces - Means by which users can issue commands to the system.
Depending on the system these may be a command-line interface ( e.g. sh, csh, ksh,
tcsh, etc. ), a GUI interface ( e.g. Windows, X-Windows, KDE, Gnome, etc. ), or a

17
batch command systems. The latter are generally older systems using punch cards of
job-control language, JCL, but may still be used today for specialty systems designed
for a single purpose.
 Program Execution - The OS must be able to load a program into RAM, run the
program, and terminate the program, either normally or abnormally.
 I/O Operations - The OS is responsible for transferring data to and from I/O devices,
including keyboards, terminals, printers, and storage devices.
 File-System Manipulation - In addition to raw data storage, the OS is also
responsible for maintaining directory and subdirectory structures, mapping file names
to specific blocks of data storage, and providing tools for navigating and utilizing the
file system.
 Communications - Inter-process communications, IPC, either between processes
running on the same processor, or between processes running on separate processors
or separate machines. May be implemented as either shared memory or message
passing, ( or some systems may offer both. )
 Error Detection - Both hardware and software errors must be detected and handled
appropriately, with a minimum of harmful repercussions. Some systems may include
complex error avoidance or recovery systems, including backups, RAID drives, and
other redundant systems. Debugging and diagnostic tools aid users and administrators
in tracing down the cause of problems.

Other systems aid in the efficient operation of the OS itself:

 Resource Allocation - E.g. CPU cycles, main memory, storage space, and peripheral
devices. Some resources are managed with generic systems and others with very
carefully designed and specially tuned systems, customized for a particular resource
and operating environment.
 Accounting - Keeping track of system activity and resource usage, either for billing
purposes or for statistical record keeping that can be used to optimize future
performance.
 Protection and Security - Preventing harm to the system and to resources, either
through wayward internal processes or malicious outsiders. Authentication,
ownership, and restricted access are obvious parts of this system. Highly secure
systems may log all process activity down to excruciating detail, and security
regulation dictate the storage of those records on permanent non-erasable medium for
extended times in secure ( off-site ) facilities.

USER OPERATING-SYSTEM INTERFACE

Command Interpreter

 Gets and processes the next user request, and launches the requested programs.
 In some systems the CI may be incorporated directly into the kernel.
 More commonly the CI is a separate program that launches once the user logs in or
otherwise accesses the system.
 UNIX, for example, provides the user with a choice of different shells, which may
either be configured to launch automatically at login, or which may be changed on the
fly. ( Each of these shells uses a different configuration file of initial settings and
commands that are executed upon startup. )

18
 Different shells provide different functionality, in terms of certain commands that are
implemented directly by the shell without launching any external programs. Most
provide at least a rudimentary command interpretation structure for use in shell script
programming ( loops, decision constructs, variables, etc. )
 An interesting distinction is the processing of wild card file naming and I/O re-
direction. On UNIX systems those details are handled by the shell, and the program
which is launched sees only a list of filenames generated by the shell from the wild
cards. On a DOS system, the wild cards are passed along to the programs, which can
interpret the wild cards as the program sees fit.

Graphical User Interface, GUI

 Generally implemented as a desktop metaphor, with file folders, trash cans, and
resource icons.
 Icons represent some item on the system, and respond accordingly when the icon is
activated.
 First developed in the early 1970's at Xerox PARC research facility.
 In some systems the GUI is just a front end for activating a traditional command line
interpreter running in the background. In others the GUI is a true graphical shell in its
own right.
 Mac has traditionally provided ONLY the GUI interface. With the advent of OSX (
based partially on UNIX ), a command line interface has also become available.
 Because mice and keyboards are impractical for small mobile devices, these normally
use a touch-screen interface today, that responds to various patterns of swipes or
"gestures". When these first came out they often had a physical keyboard and/or a
trackball of some kind built in, but today a virtual keyboard is more commonly
implemented on the touch screen.

Choice of interface

 Most modern systems allow individual users to select their desired interface, and to
customize its operation, as well as the ability to switch between different interfaces as
needed. System administrators generally determine which interface a user starts with
when they first log in.
 GUI interfaces usually provide an option for a terminal emulator window for entering
command-line commands.
 Command-line commands can also be entered into shell scripts, which can then be
run like any other programs.

SYSTEM CALLS

 System calls provide a means for user or application programs to call upon the
services of the operating system.
 Generally written in C or C++, although some are written in assembly for optimal
performance.
 Figure 2.4 illustrates the sequence of system calls required to copy a file:

19
Example of how system calls are used.

 You can use "strace" to see more examples of the large number of system calls
invoked by a single simple command. Read the man page for strace, and try some
simple examples. ( strace mkdir temp, strace cd temp, strace date > t.t, strace cp t.t t.2,
etc. )
 Most programmers do not use the low-level system calls directly, but instead use an
"Application Programming Interface", API. The following sidebar shows the read( )
call available in the API on UNIX based systems::

The use of APIs instead of direct system calls provides for greater program portability
between different systems. The API then makes the appropriate system calls through the
system call interface, using a table lookup to access specific numbered system calls, as
shown in Figure 2.6:

20
Figure 2.6 - The handling of a user application invoking the open( ) system call

 Parameters are generally passed to system calls via registers, or less commonly, by
values pushed onto the stack. Large blocks of data are generally accessed indirectly,
through a memory address passed in a register or on the stack, as shown in Figure 2.7:

Passing of parameters as a table

TYPES OF SYSTEM CALLS

Six major categories, as outlined in Figure and the following six subsections:

21
22
 Standard library calls may also generate system calls, as shown here:

23
Process Control

 Process control system calls include end, abort, load, execute, create process,
terminate process, get/set process attributes, wait for time or event, signal event, and
allocate and free memory.
 Processes must be created, launched, monitored, paused, resumed,and eventually
stopped.
 When one process pauses or stops, then another must be launched or resumed
 When processes stop abnormally it may be necessary to provide core dumps and/or
other diagnostic or recovery tools.

24
 Compare DOS ( a single-tasking system ) with UNIX ( a multi-tasking system ).
o When a process is launched in DOS, the command interpreter first unloads as
much of itself as it can to free up memory, then loads the process and transfers
control to it. The interpreter does not resume until the process has completed,
as shown in Figure 2.9:

MS-DOS execution. (a) At system startup. (b) Running a program.

o Because UNIX is a multi-tasking system, the command interpreter remains


completely resident when executing a process, as shown in Figure below.
 The user can switch back to the command interpreter at any time, and
can place the running process in the background even if it was not
originally launched as a background process.
 In order to do this, the command interpreter first executes a "fork"
system call, which creates a second process which is an exact duplicate
( clone ) of the original command interpreter. The original process is
known as the parent, and the cloned process is known as the child, with
its own unique process ID and parent ID.
 The child process then executes an "exec" system call, which replaces
its code with that of the desired process.
 The parent ( command interpreter ) normally waits for the child to
complete before issuing a new command prompt, but in some cases it
can also issue a new prompt right away, without waiting for the child
process to complete. ( The child is then said to be running "in the
background", or "as a background process". )

25
FreeBSD running multiple programs

File Management

 File management system calls include create file, delete file, open, close, read, write,
reposition, get file attributes, and set file attributes.
 These operations may also be supported for directories as well as ordinary files.

Device Management

 Device management system calls include request device, release device, read, write,
reposition, get/set device attributes, and logically attach or detach devices.
 Devices may be physical ( e.g. disk drives ), or virtual / abstract ( e.g. files, partitions,
and RAM disks ).
 Some systems represent devices as special files in the file system, so that accessing
the "file" calls upon the appropriate device drivers in the OS. See for example the /dev
directory on any UNIX system.

Information Maintenance

 Information maintenance system calls include calls to get/set the time, date, system
data, and process, file, or device attributes.
 Systems may also provide the ability to dump memory at any time, single step
programs pausing execution after each instruction, and tracing the operation of
programs, all of which can help to debug programs.

Communication

 Communication system calls create/delete communication connection, send/receive


messages, transfer status information, and attach/detach remote devices.

26
 The message passing model must support calls to:
o Identify a remote process and/or host with which to communicate.
o Establish a connection between the two processes.
o Open and close the connection as needed.
o Transmit messages along the connection.
o Wait for incoming messages, in either a blocking or non-blocking state.
o Delete the connection when no longer needed.
 The shared memory model must support calls to:
o Create and access memory that is shared amongst processes ( and threads. )
o Provide locking mechanisms restricting simultaneous access.
o Free up shared memory and/or dynamically allocate it as needed.
 Message passing is simpler and easier, ( particularly for inter-computer
communications ), and is generally appropriate for small amounts of data.
 Shared memory is faster, and is generally the better approach where large amounts of
data are to be shared, ( particularly when most processes are reading the data rather
than writing it, or at least when only one or a small number of processes need to
change any given data item. )

Protection

 Protection provides mechanisms for controlling which users / processes have access
to which system resources.
 System calls allow the access mechanisms to be adjusted as needed, and for non-
priveleged users to be granted elevated access permissions under carefully controlled
temporary circumstances.
 Once only of concern on multi-user systems, protection is now important on all
systems, in the age of ubiquitous network connectivity.

SYSTEM PROGRAMS

 System programs provide OS functionality through separate applications, which are


not part of the kernel or command interpreters. They are also known as system
utilities or system applications.
 Most systems also ship with useful applications such as calculators and simple
editors, ( e.g. Notepad ). Some debate arises as to the border between system and non-
system applications.
 System programs may be divided into these categories:
o File management - programs to create, delete, copy, rename, print, list, and
generally manipulate files and directories.
o Status information - Utilities to check on the date, time, number of users,
processes running, data logging, etc. System registries are used to store and
recall configuration information for particular applications.
o File modification - e.g. text editors and other tools which can change file
contents.
o Programming-language support - E.g. Compilers, linkers, debuggers,
profilers, assemblers, library archive management, interpreters for common
languages, and support for make.
o Program loading and execution - loaders, dynamic loaders, overlay loaders,
etc., as well as interactive debuggers.

27
o Communications - Programs for providing connectivity between processes
and users, including mail, web browsers, remote logins, file transfers, and
remote command execution.
o Background services - System daemons are commonly started when the
system is booted, and run for as long as the system is running, handling
necessary services. Examples include network daemons, print servers, process
schedulers, and system error monitoring services.
 Most operating systems today also come complete with a set of application
programs to provide additional services, such as copying files or checking the time
and date.
 Most users' views of the system is determined by their command interpreter and the
application programs. Most never make system calls, even through the API, ( with the
exception of simple ( file ) I/O in user-written programs. )

OPERATING-SYSTEM DESIGN AND IMPLEMENTATION

Design Goals

 Requirements define properties which the finished system must have, and are a
necessary first step in designing any large complex system.
o User requirements are features that users care about and understand, and are
written in commonly understood vernacular. They generally do not include
any implementation details, and are written similar to the product description
one might find on a sales brochure or the outside of a shrink-wrapped box.
o System requirements are written for the developers, and include more details
about implementation specifics, performance requirements, compatibility
constraints, standards compliance, etc. These requirements serve as a
"contract" between the customer and the developers, ( and between developers
and subcontractors ), and can get quite detailed.
 Requirements for operating systems can vary greatly depending on the planned scope
and usage of the system. ( Single user / multi-user, specialized system / general
purpose, high/low security, performance needs, operating environment, etc. )

Mechanisms and Policies

 Policies determine what is to be done. Mechanisms determine how it is to be


implemented.
 If properly separated and implemented, policy changes can be easily adjusted without
re-writing the code, just by adjusting parameters or possibly loading new data /
configuration files. For example the relative priority of background versus foreground
tasks.

Implementation

 Traditionally OSes were written in assembly language. This provided direct control
over hardware-related issues, but inextricably tied a particular OS to a particular HW
platform.
 Recent advances in compiler efficiencies mean that most modern OSes are written in
C, or more recently, C++. Critical sections of code are still written in assembly

28
language, ( or written in C, compiled to assembly, and then fine-tuned and optimized
by hand from there. )
 Operating systems may be developed using emulators of the target hardware,
particularly if the real hardware is unavailable ( e.g. not built yet ), or not a suitable
platform for development, ( e.g. smart phones, game consoles, or other similar
devices. )

OPERATING-SYSTEM STRUCTURE

For efficient performance and implementation an OS should be partitioned into separate


subsystems, each with carefully defined tasks, inputs, outputs, and performance
characteristics. These subsystems can then be arranged in various architectural
configurations:

Simple Structure

When DOS was originally written its developers had no idea how big and important it would
eventually become. It was written by a few programmers in a relatively short amount of time,
without the benefit of modern software engineering techniques, and then gradually grew over
time to exceed its original expectations. It does not break the system into subsystems, and has
no distinction between user and kernel modes, allowing all programs direct access to the
underlying hardware. ( Note that user versus kernel mode was not supported by the 8088 chip
set anyway, so that really wasn't an option back then. )

MS-DOS layer structure

29
The original UNIX OS used a simple layered approach, but almost all the OS was in one big
layer, not really breaking the OS down into layered subsystems:

Traditional UNIX system structure

Layered Approach

 Another approach is to break the OS into a number of smaller layers, each of which
rests on the layer below it, and relies solely on the services provided by the next lower
layer.
 This approach allows each layer to be developed and debugged independently, with
the assumption that all lower layers have already been debugged and are trusted to
deliver proper services.
 The problem is deciding what order in which to place the layers, as no layer can call
upon the services of any higher layer, and so many chicken-and-egg situations may
arise.
 Layered approaches can also be less efficient, as a request for service from a higher
layer has to filter through all lower layers before it reaches the HW, possibly with
significant processing at each step.

30
A layered operating system

Microkernels

 The basic idea behind micro kernels is to remove all non-essential services from the
kernel, and implement them as system applications instead, thereby making the kernel
as small and efficient as possible.
 Most microkernels provide basic process and memory management, and message
passing between other services, and not much more.
 Security and protection can be enhanced, as most services are performed in user
mode, not kernel mode.
 System expansion can also be easier, because it only involves adding more system
applications, not rebuilding a new kernel.
 Mach was the first and most widely known microkernel, and now forms a major
component of Mac OSX.
 Windows NT was originally microkernel, but suffered from performance problems
relative to Windows 95. NT 4.0 improved performance by moving more services into
the kernel, and now XP is back to being more monolithic.
 Another microkernel example is QNX, a real-time OS for embedded systems.

31
Architecture of a typical microkernel

Modules

 Modern OS development is object-oriented, with a relatively small core kernel and a


set of modules which can be linked in dynamically. See for example the Solaris
structure, as shown in Figure below.
 Modules are similar to layers in that each subsystem has clearly defined tasks and
interfaces, but any module is free to contact any other module, eliminating the
problems of going through multiple intermediary layers, as well as the chicken-and-
egg problems.
 The kernel is relatively small in this architecture, similar to microkernels, but the
kernel does not have to implement message passing since modules are free to contact
each other directly.

Solaris loadable modules

Hybrid Systems

32
 Most OSes today do not strictly adhere to one architecture, but are hybrids of several.

Mac OS X

 The Max OSX architecture relies on the Mach microkernel for basic system
management services, and the BSD kernel for additional services. Application
services and dynamically loadable modules ( kernel extensions ) provide the rest of
the OS functionality:

The Mac OS X structure

iOS

 The iOS operating system was developed by Apple for iPhones and iPads. It runs
with less memory and computing power needs than Max OS X, and supports
touchscreen interface and graphics for small screens:

Architecture of Apple's iOS.

Android

33
 The Android OS was developed for Android smartphones and tablets by the Open
Handset Alliance, primarily Google.
 Android is an open-source OS, as opposed to iOS, which has lead to its popularity.
 Android includes versions of Linux and a Java virtual machine both optimized for
small platforms.
 Android apps are developed using a special Java-for-Android development
environment.

Architecture of Google's Android

34
Unit-2
Process Concept

 A process is an instance of a program in execution.


 Batch systems work in terms of "jobs". Many modern process concepts are still expressed
in terms of jobs, ( e.g. job scheduling ), and the two terms are often used interchangeably.

The Process

 Process memory is divided into four sections as shown in Figure below:


o The text section comprises the compiled program code, read in from non-
volatile storage when the program is launched.
o The data section stores global and static variables, allocated and initialized
prior to executing main.
o The heap is used for dynamic memory allocation, and is managed via calls
to new, delete, malloc, free, etc.
o The stack is used for local variables. Space on the stack is reserved for
local variables when they are declared ( at function entrance or elsewhere,
depending on the language ), and the space is freed up when the variables
go out of scope. Note that the stack is also used for function return values,
and the exact mechanisms of stack management may be language specific.
o Note that the stack and the heap start at opposite ends of the process's free
space and grow towards each other. If they should ever meet, then either a
stack overflow error will occur, or else a call to new or malloc will fail due
to insufficient memory available.
 When processes are swapped out of memory and later restored, additional
information must also be stored and restored. Key among them are the program
counter and the value of all program registers.
A process in memory

Process State

 Processes may be in one of 5 states, as shown in Figure below.


o New - The process is in the stage of being created.
o Ready - The process has all the resources available that it needs to run,
but the CPU is not currently working on this process's instructions.
o Running - The CPU is working on this process's instructions.
o Waiting - The process cannot run at the moment, because it is waiting for
some resource to become available or for some event to occur. For
example the process may be waiting for keyboard input, disk access
request, inter-process messages, a timer to go off, or a child process to
finish.
o Terminated - The process has completed.
 The load average reported by the "w" command indicate the average number of
processes in the "Ready" state over the last 1, 5, and 15 minutes, i.e. processes
who have everything they need to run but cannot because the CPU is busy doing
something else.
 Some systems may have other states besides the ones listed here.
Diagram of process state

Process Control Block

For each process there is a Process Control Block, PCB, which stores the following ( types of )
process-specific information, as illustrated in Figure ( Specific details may vary from system to
system. )

 Process State - Running, waiting, etc., as discussed above.


 Process ID, and parent process ID.
 CPU registers and Program Counter - These need to be saved and restored
when swapping processes in and out of the CPU.
 CPU-Scheduling information - Such as priority information and pointers to
scheduling queues.
 Memory-Management information - E.g. page tables or segment tables.
 Accounting information - user and kernel CPU time consumed, account
numbers, limits, etc.
 I/O Status information - Devices allocated, open file tables, etc.
Process control block ( PCB )

Diagram showing CPU switch from process to process


Threads

 Modern systems allow a single process to have multiple threads of execution,


which execute concurrently. Threads are covered extensively in the next chapter.

Process Scheduling

 The two main objectives of the process scheduling system are to keep the CPU busy at all
times and to deliver "acceptable" response times for all programs, particularly for
interactive ones.
 The process scheduler must meet these objectives by implementing suitable policies for
swapping processes in and out of the CPU.
 ( Note that these objectives can be conflicting. In particular, every time the system steps
in to swap processes it takes up time on the CPU to do so, which is thereby "lost" from
doing any useful productive work. )

Scheduling Queues

 All processes are stored in the job queue.


 Processes in the Ready state are placed in the ready queue.
 Processes waiting for a device to become available or to deliver data are placed in
device queues. There is generally a separate device queue for each device.
 Other queues may also be created and used as needed.
The ready queue and various I/O device queues

Schedulers

 A long-term scheduler is typical of a batch system or a very heavily loaded


system. It runs infrequently, ( such as when one process ends selecting one more
to be loaded in from disk in its place ), and can afford to take the time to
implement intelligent and advanced scheduling algorithms.
 The short-term scheduler, or CPU Scheduler, runs very frequently, on the order
of 100 milliseconds, and must very quickly swap one process out of the CPU and
swap in another one.
 Some systems also employ a medium-term scheduler. When system loads get
high, this scheduler will swap one or more processes out of the ready queue
system for a few seconds, in order to allow smaller faster jobs to finish up quickly
and clear the system. See the differences in Figures 3.7 and 3.8 below.
 An efficient scheduling system will select a good process mix of CPU-bound
processes and I/O bound processes.
Queueing-diagram representation of process scheduling

Addition of a medium-term scheduling to the queueing diagram

Context Switch

 Whenever an interrupt arrives, the CPU must do a state-save of the currently


running process, then switch into kernel mode to handle the interrupt, and then do
a state-restore of the interrupted process.
 Similarly, a context switch occurs when the time slice for one process has
expired and a new process is to be loaded from the ready queue. This will be
instigated by a timer interrupt, which will then cause the current process's state to
be saved and the new process's state to be restored.
 Saving and restoring states involves saving and restoring all of the registers and
program counter(s), as well as the process control blocks described above.
 Context switching happens VERY VERY frequently, and the overhead of doing
the switching is just lost CPU time, so context switches ( state saves & restores )
need to be as fast as possible. Some hardware has special provisions for speeding
this up, such as a single machine instruction for saving or restoring all registers at
once.
 Some Sun hardware actually has multiple sets of registers, so the context
switching can be speeded up by merely switching which set of registers are
currently in use. Obviously there is a limit as to how many processes can be
switched between in this manner, making it attractive to implement the medium-
term scheduler to swap some processes out as shown in Figure 3.8 above.

Operations on Processes
Process Creation
 Processes may create other processes through appropriate system calls, such
as fork or spawn. The process which does the creating is termed the parent of the other
process, which is termed its child.
 Each process is given an integer identifier, termed its process identifier, or PID. The
parent PID ( PPID ) is also stored for each process.
 On typical UNIX systems the process scheduler is termed sched, and is given PID 0. The
first thing it does at system startup time is to launch init, which gives that process PID 1.
Init then launches all system daemons and user logins, and becomes the ultimate parent
of all other processes. Figure shows a typical process tree for a Linux system, and other
systems will have similar though not identical trees:
 Depending on system implementation, a child process may receive some amount of
shared resources with its parent. Child processes may or may not be limited to a subset of
the resources originally allocated to the parent, preventing runaway children from
consuming all of a certain system resource.
 There are two options for the parent process after creating the child:
1. Wait for the child process to terminate before proceeding. The parent makes a
wait( ) system call, for either a specific child or for any child, which causes the
parent process to block until the wait( ) returns. UNIX shells normally wait for
their children to complete before issuing a new prompt.
2. Run concurrently with the child, continuing to process without waiting. This is the
operation seen when a UNIX shell runs a process as a background task. It is also
possible for the parent to run for a while, and then wait for the child later, which
might occur in a sort of a parallel processing operation. ( E.g. the parent may fork
off a number of children without waiting for any of them, then do a little work of
its own, and then wait for the children. )
 Two possibilities for the address space of the child relative to the parent:
1. The child may be an exact duplicate of the parent, sharing the same program and
data segments in memory. Each will have their own PCB, including program
counter, registers, and PID. This is the behavior of the fork system call in UNIX.
2. The child process may have a new program loaded into its address space, with all
new code and data segments. This is the behavior of the spawn system calls in
Windows. UNIX systems implement this as a second step, using the exec system
call.
 Figures 3.10 below shows the fork and exec process on a UNIX system. Note that
the fork system call returns the PID of the processes child to each process - It returns a
zero to the child process and a non-zero child PID to the parent, so the return value
indicates which process is which. Process IDs can be looked up any time for the current
process or its direct parent using the getpid( ) and getppid( ) system calls respectively.

Figure 3.11 - Process creation using the fork( ) system call


Process Termination

 Processes may request their own termination by making the exit( ) system call, typically
returning an int. This int is passed along to the parent if it is doing a wait( ), and is
typically zero on successful completion and some non-zero code in the event of
problems.
o child code:

int exitCode;

exit( exitCode ); // return exitCode; has the same effect when executed from main( )

o parent code:

pid_t pid;

int status

pid = wait( &status ); // pid indicates which child exited. exitCode in low-order bits
of status

// macros can test the high-order bits of status for why it stopped

 Processes may also be terminated by the system for a variety of reasons, including:
o The inability of the system to deliver necessary system resources.
o In response to a KILL command, or other un handled process interrupt.
o A parent may kill its children if the task assigned to them is no longer needed.
o If the parent exits, the system may or may not allow the child to continue without
a parent. ( On UNIX systems, orphaned processes are generally inherited by init,
which then proceeds to kill them. The UNIX nohup command allows a child to
continue executing after its parent has exited. )
 When a process terminates, all of its system resources are freed up, open files flushed and
closed, etc. The process termination status and execution times are returned to the parent
if the parent is waiting for the child to terminate, or eventually returned to init if the
process becomes an orphan. ( Processes which are trying to terminate but which cannot
because their parent is not waiting for them are termed zombies. These are eventually
inherited by init as orphans and killed off. Note that modern UNIX shells do not produce
as many orphans and zombies as older systems used to. )

Interprocess Communication

 Independent Processes operating concurrently on a systems are those that can neither
affect other processes or be affected by other processes.
 Cooperating Processes are those that can affect or be affected by other processes. There
are several reasons why cooperating processes are allowed:
o Information Sharing - There may be several processes which need access to the
same file for example. ( e.g. pipelines. )
o Computation speedup - Often a solution to a problem can be solved faster if the
problem can be broken down into sub-tasks to be solved simultaneously (
particularly when multiple processors are involved. )
o Modularity - The most efficient architecture may be to break a system down into
cooperating modules. ( E.g. databases with a client-server architecture. )
o Convenience - Even a single user may be multi-tasking, such as editing,
compiling, printing, and running the same code in different windows.

 Cooperating processes require some type of inter-process communication, which is most


commonly one of two types: Shared Memory systems or Message Passing systems.
Figure illustrates the difference between the two systems:

Communications models: (a) Message passing. (b) Shared memory.

 Shared Memory is faster once it is set up, because no system calls are required and access
occurs at normal memory speeds. However it is more complicated to set up, and doesn't
work as well across multiple computers. Shared memory is generally preferable when
large amounts of information must be shared quickly on the same computer.
 Message Passing requires system calls for every message transfer, and is therefore
slower, but it is simpler to set up and works well across multiple computers. Message
passing is generally preferable when the amount and/or frequency of data transfers is
small, or when multiple computers are involved.

Shared-Memory Systems
 In general the memory to be shared in a shared-memory system is initially within
the address space of a particular process, which needs to make system calls in
order to make that memory publicly available to one or more other processes.
 Other processes which wish to use the shared memory must then make their own
system calls to attach the shared memory area onto their address space.
 Generally a few messages must be passed back and forth between the cooperating
processes first in order to set up and coordinate the shared memory access.

Producer-Consumer Example Using Shared Memory

 This is a classic example, in which one process is producing data and another
process is consuming the data. ( In this example in the order in which it is
produced, although that could vary. )
 The data is passed via an intermediary buffer, which may be either unbounded or
bounded. With a bounded buffer the producer may have to wait until there is
space available in the buffer, but with an unbounded buffer the producer will
never need to wait. The consumer may need to wait in either case until there is
data available.
 This example uses shared memory and a circular queue. Note in the code below
that only the producer changes "in", and only the consumer changes "out", and
that they can never be accessing the same array location at the same time.
 First the following data is set up in the shared memory area:

#define BUFFER_SIZE 10

typedef struct {
...
} item;

item buffer[ BUFFER_SIZE ];


int in = 0;
int out = 0;

 Then the producer process. Note that the buffer is full when "in" is one less than
"out" in a circular sense:

item nextProduced;

while( true ) {

/* Produce an item and store it in nextProduced */


nextProduced = makeNewItem( . . . );

/* Wait for space to become available */


while( ( ( in + 1 ) % BUFFER_SIZE ) == out )
; /* Do nothing */
/* And then store the item and repeat the loop. */
buffer[ in ] = nextProduced;
in = ( in + 1 ) % BUFFER_SIZE;

 Then the consumer process. Note that the buffer is empty when "in" is equal to
"out":

item nextConsumed;

while( true ) {

/* Wait for an item to become available */


while( in == out )
; /* Do nothing */

/* Get the next available item */


nextConsumed = buffer[ out ];
out = ( out + 1 ) % BUFFER_SIZE;

/* Consume the item in nextConsumed


( Do something with it ) */

Message-Passing Systems

 Message passing systems must support at a minimum system calls for "send
message" and "receive message".
 A communication link must be established between the cooperating processes
before messages can be sent.
 There are three key issues to be resolved in message passing systems as further
explored in the next three subsections:
o Direct or indirect communication ( naming )
o Synchronous or asynchronous communication
o Automatic or explicit buffering.

Naming

 With direct communication the sender must know the name of the
receiver to which it wishes to send a message.
o There is a one-to-one link between every sender-receiver pair.
o For symmetric communication, the receiver must also know the
specific name of the sender from which it wishes to receive
messages.
For asymmetric communications, this is not necessary.
 Indirect communication uses shared mailboxes, or ports.
o Multiple processes can share the same mailbox or boxes.
o Only one process can read any given message in a mailbox.
Initially the process that creates the mailbox is the owner, and is
the only one allowed to read mail in the mailbox, although this
privilege may be transferred.
 ( Of course the process that reads the message can
immediately turn around and place an identical message
back in the box for someone else to read, but that may put it
at the back end of a queue of messages. )
o The OS must provide system calls to create and delete mailboxes,
and to send and receive messages to/from mailboxes.

Synchronization

 Either the sending or receiving of messages ( or neither or both ) may be


either blocking or non-blocking.

Buffering

 Messages are passed via queues, which may have one of three capacity
configurations:
1. Zero capacity - Messages cannot be stored in the queue, so
senders must block until receivers accept the messages.
2. Bounded capacity- There is a certain pre-determined finite
capacity in the queue. Senders must block if the queue is full, until
space becomes available in the queue, but may be either blocking
or non-blocking otherwise.
3. Unbounded capacity - The queue has a theoretical infinite
capacity, so senders are never forced to block.

Pipes
 Pipes are one of the earliest and simplest channels of communications between ( UNIX )
processes.
 There are four key considerations in implementing pipes:
1. Unidirectional or Bidirectional communication?
2. Is bidirectional communication half-duplex or full-duplex?
3. Must a relationship such as parent-child exist between the processes?
4. Can pipes communicate over a network, or only on the same machine?
 The following sections examine these issues on UNIX and Windows

Ordinary Pipes
 Ordinary pipes are uni-directional, with a reading end and a writing end. ( If bidirectional
communications are needed, then a second pipe is required. )
 In UNIX ordinary pipes are created with the system call "int pipe( int fd [ ] )".
o The return value is 0 on success, -1 if an error occurs.
o The int array must be allocated before the call, and the values are filled in by the
pipe system call:
 fd[ 0 ] is filled in with a file descriptor for the reading end of the pipe
 fd[ 1 ] is filled in with a file descriptor for the writing end of the pipe
o UNIX pipes are accessible as files, using standard read( ) and write( ) system
calls.
o Ordinary pipes are only accessible within the process that created them.
 Typically a parent creates the pipe before forking off a child.
 When the child inherits open files from its parent, including the pipe
file(s), a channel of communication is established.
 Each process ( parent and child ) should first close the ends of the pipe
that they are not using. For example, if the parent is writing to the pipe and
the child is reading, then the parent should close the reading end of its pipe
after the fork and the child should close the writing end.
 Figure 3.22 shows an ordinary pipe in UNIX,

Figure 3.24

Ordinary pipes in Windows are very similar

o Windows terms them anonymous pipes


o They are still limited to parent-child relationships.
o They are read from and written to as files.
o They are created with CreatePipe( ) function, which takes additional arguments.
o In Windows it is necessary to specify what resources a child inherits, such as
pipes.
Named Pipes

 Named pipes support bidirectional communication, communication between non parent-


child related processes, and persistence after the process which created them exits.
Multiple processes can also share a named pipe, typically one reader and multiple writers.
 In UNIX, named pipes are termed fifos, and appear as ordinary files in the file system.
o ( Recognizable by a "p" as the first character of a long listing, e.g. /dev/initctl )
o Created with mkfifo( ) and manipulated with read( ), write( ), open( ), close( ),
etc.
o UNIX named pipes are bidirectional, but half-duplex, so two pipes are still
typically used for bidirectional communications.
o UNIX named pipes still require that all processes be running on the same
machine. Otherwise sockets are used.
 Windows named pipes provide richer communications.
 Full-duplex is supported.
 Processes may reside on the same or different machines
 Created and manipulated using CreateNamedPipe( ), ConnectNamedPipe( ),
ReadFile( ),and WriteFile( ).

Threads
Overview

 A thread is a basic unit of CPU utilization, consisting of a program counter, a stack, and
a set of registers, ( and a thread ID. )
 Traditional ( heavyweight ) processes have a single thread of control - There is one
program counter, and one sequence of instructions that can be carried out at any given
time.
 As shown in Figure, multi-threaded applications have multiple threads within a single
process, each having their own program counter, stack and set of registers, but sharing
common code, data, and certain structures such as open files.
Single-threaded and multithreaded processes

Motivation

 Threads are very useful in modern programming whenever a process has multiple
tasks to perform independently of the others.
 This is particularly true when one of the tasks may block, and it is desired to
allow the other tasks to proceed without blocking.
 For example in a word processor, a background thread may check spelling and
grammar while a foreground thread processes user input ( keystrokes ), while yet
a third thread loads images from the hard drive, and a fourth does periodic
automatic backups of the file being edited.
 Another example is a web server - Multiple threads allow for multiple requests to
be satisfied simultaneously, without having to service requests sequentially or to
fork off separate processes for every incoming request. ( The latter is how this
sort of thing was done before the concept of threads was developed. A daemon
would listen at a port, fork off a child for every incoming request to be processed,
and then go back to listening to the port. )
Multithreaded server architecture

Benefits

 There are four major categories of benefits to multi-threading:


1. Responsiveness - One thread may provide rapid response while other
threads are blocked or slowed down doing intensive calculations.
2. Resource sharing - By default threads share common code, data, and other
resources, which allows multiple tasks to be performed simultaneously in
a single address space.
3. Economy - Creating and managing threads ( and context switches between
them ) is much faster than performing the same tasks for processes.
4. Scalability, i.e. Utilization of multiprocessor architectures - A single
threaded process can only run on one CPU, no matter how many may be
available, whereas the execution of a multi-threaded application may be
split amongst available processors. ( Note that single threaded processes
can still benefit from multi-processor architectures when there are multiple
processes contending for the CPU, i.e. when the load average is above
some certain threshold. )

Multithreading Models

 There are two types of threads to be managed in a modern system: User threads and
kernel threads.
 User threads are supported above the kernel, without kernel support. These are the
threads that application programmers would put into their programs.
 Kernel threads are supported within the kernel of the OS itself. All modern OSes support
kernel level threads, allowing the kernel to perform multiple simultaneous tasks and/or to
service multiple kernel system calls simultaneously.
 In a specific implementation, the user threads must be mapped to kernel threads, using
one of the following strategies.
Many-To-One Model

 In the many-to-one model, many user-level threads are all mapped onto a single
kernel thread.
 Thread management is handled by the thread library in user space, which is very
efficient.
 However, if a blocking system call is made, then the entire process blocks, even if
the other user threads would otherwise be able to continue.
 Because a single kernel thread can operate only on a single CPU, the many-to-one
model does not allow individual processes to be split across multiple CPUs.
 Green threads for Solaris and GNU Portable Threads implement the many-to-one
model in the past, but few systems continue to do so today.

Many-to-one model

One-To-One Model

 The one-to-one model creates a separate kernel thread to handle each user thread.
 One-to-one model overcomes the problems listed above involving blocking
system calls and the splitting of processes across multiple CPUs.
 However the overhead of managing the one-to-one model is more significant,
involving more overhead and slowing down the system.
 Most implementations of this model place a limit on how many threads can be
created.
 Linux and Windows from 95 to XP implement the one-to-one model for threads.
One-to-one model

Many-To-Many Model

 The many-to-many model multiplexes any number of user threads onto an equal
or smaller number of kernel threads, combining the best features of the one-to-one
and many-to-one models.
 Users have no restrictions on the number of threads created.
 Blocking kernel system calls do not block the entire process.
 Processes can be split across multiple processors.
 Individual processes may be allocated variable numbers of kernel threads,
depending on the number of CPUs present and other factors.

Many-to-many model
 One popular variation of the many-to-many model is the two-tier model, which
allows either many-to-many or one-to-one operation.
 IRIX, HP-UX, and Tru64 UNIX use the two-tier model, as did Solaris prior to
Solaris 9.

Two-level model

CPU Scheduling
Basic Concepts

 Almost all programs have some alternating cycle of CPU number crunching and waiting
for I/O of some kind. ( Even a simple fetch from memory takes a long time relative to
CPU speeds. )
 In a simple system running a single process, the time spent waiting for I/O is wasted, and
those CPU cycles are lost forever.
 A scheduling system allows one process to use the CPU while another is waiting for I/O,
thereby making full use of otherwise lost CPU cycles.
 The challenge is to make the overall system as "efficient" and "fair" as possible, subject
to varying and often dynamic conditions, and where "efficient" and "fair" are somewhat
subjective terms, often subject to shifting priority policies.

CPU-I/O Burst Cycle

 Almost all processes alternate between two states in a continuing cycle, as shown
in Figure 6.1 below :
o A CPU burst of performing calculations, and
o An I/O burst, waiting for data transfer in or out of the system.

Alternating sequence of CPU and I/O bursts.


 CPU bursts vary from process to process, and from program to program.

CPU Scheduler

 Whenever the CPU becomes idle, it is the job of the CPU Scheduler ( a.k.a. the
short-term scheduler ) to select another process from the ready queue to run next.
 The storage structure for the ready queue and the algorithm used to select the next
process are not necessarily a FIFO queue. There are several alternatives to choose
from, as well as numerous adjustable parameters for each algorithm, which is the
basic subject of this entire chapter.

Preemptive Scheduling

 CPU scheduling decisions take place under one of four conditions:


1. When a process switches from the running state to the waiting state, such
as for an I/O request or invocation of the wait( ) system call.
2. When a process switches from the running state to the ready state, for
example in response to an interrupt.
3. When a process switches from the waiting state to the ready state, say at
completion of I/O or a return from wait( ).
4. When a process terminates.
 For conditions 1 and 4 there is no choice - A new process must be selected.
 For conditions 2 and 3 there is a choice - To either continue running the current
process, or select a different one.
 If scheduling takes place only under conditions 1 and 4, the system is said to be
non-preemptive, or cooperative. Under these conditions, once a process starts
running it keeps running, until it either voluntarily blocks or until it finishes.
Otherwise the system is said to be preemptive.
 Windows used non-preemptive scheduling up to Windows 3.x, and started using
pre-emptive scheduling with Win95. Macs used non-preemptive prior to OSX,
and pre-emptive since then. Note that pre-emptive scheduling is only possible on
hardware that supports a timer interrupt.
 Note that pre-emptive scheduling can cause problems when two processes share
data, because one process may get interrupted in the middle of updating shared
data structures. Chapter 5 examined this issue in greater detail.
 Preemption can also be a problem if the kernel is busy implementing a system call
( e.g. updating critical kernel data structures ) when the preemption occurs. Most
modern UNIXes deal with this problem by making the process wait until the
system call has either completed or blocked before allowing the preemption
Unfortunately this solution is problematic for real-time systems, as real-time
response can no longer be guaranteed.
 Some critical sections of code protect themselves from con currency problems by
disabling interrupts before entering the critical section and re-enabling interrupts
on exiting the section. Needless to say, this should only be done in rare situations,
and only on very short pieces of code that will finish quickly, ( usually just a few
machine instructions. )

Dispatcher

 The dispatcher is the module that gives control of the CPU to the process
selected by the scheduler. This function involves:
o Switching context.
o Switching to user mode.
o Jumping to the proper location in the newly loaded program.
 The dispatcher needs to be as fast as possible, as it is run on every context switch.
The time consumed by the dispatcher is known as dispatch latency.

Scheduling Criteria

 There are several different criteria to consider when trying to select the "best" scheduling
algorithm for a particular situation and environment, including:
o CPU utilization - Ideally the CPU would be busy 100% of the time, so as to
waste 0 CPU cycles. On a real system CPU usage should range from 40% ( lightly
loaded ) to 90% ( heavily loaded. )
o Throughput - Number of processes completed per unit time. May range from 10
/ second to 1 / hour depending on the specific processes.
o Turnaround time - Time required for a particular process to complete, from
submission time to completion. ( Wall clock time. )
o Waiting time - How much time processes spend in the ready queue waiting their
turn to get on the CPU.
 ( Load average - The average number of processes sitting in the ready
queue waiting their turn to get into the CPU. Reported in 1-minute, 5-
minute, and 15-minute averages by "uptime" and "who". )
o Response time - The time taken in an interactive program from the issuance of a
command to the commence of a response to that command.
 In general one wants to optimize the average value of a criteria ( Maximize CPU
utilization and throughput, and minimize all the others. ) However some times one wants
to do something different, such as to minimize the maximum response time.
 Sometimes it is most desirable to minimize the variance of a criteria than the actual
value. I.e. users are more accepting of a consistent predictable system than an
inconsistent one, even if it is a little bit slower.

Scheduling Algorithms

The following subsections will explain several common scheduling strategies, looking at only a
single CPU burst each for a small number of processes. Obviously real systems have to deal with
a lot more simultaneous processes executing their CPU-I/O burst cycles.
First-Come First-Serve Scheduling, FCFS

 FCFS is very simple - Just a FIFO queue, like customers waiting in line at the
bank or the post office or at a copying machine.
 Unfortunately, however, FCFS can yield some very long average wait times,
particularly if the first process to get there takes a long time. For example,
consider the following three processes:

Process Burst Time


P1 24
P2 3
P3 3

 In the first Gantt chart below, process P1 arrives first. The average waiting time
for the three processes is ( 0 + 24 + 27 ) / 3 = 17.0 ms.
 In the second Gantt chart below, the same three processes have an average wait
time of ( 0 + 3 + 6 ) / 3 = 3.0 ms. The total run time for the three bursts is the
same, but in the second case two of the three finish much quicker, and the other
process is only delayed by a short amount.

 FCFS can also block the system in a busy dynamic system in another way, known
as the convoy effect. When one CPU intensive process blocks the CPU, a number
of I/O intensive processes can get backed up behind it, leaving the I/O devices
idle. When the CPU hog finally relinquishes the CPU, then the I/O processes pass
through the CPU quickly, leaving the CPU idle while everyone queues up for I/O,
and then the cycle repeats itself when the CPU intensive process gets back to the
ready queue.

Shortest-Job-First Scheduling, SJF

 The idea behind the SJF algorithm is to pick the quickest fastest little job that
needs to be done, get it out of the way first, and then pick the next smallest fastest
job to do next.
 ( Technically this algorithm picks a process based on the next shortest CPU
burst, not the overall process time. )
 For example, the Gantt chart below is based upon the following CPU burst times,
( and the assumption that all jobs arrive at the same time. )

Process Burst Time


P1 6
P2 8
P3 7
P4 3

 In the case above the average wait time is ( 0 + 3 + 9 + 16 ) / 4 = 7.0 ms, ( as


opposed to 10.25 ms for FCFS for the same processes. )
 SJF can be proven to be the fastest scheduling algorithm, but it suffers from one
important problem: How do you know how long the next CPU burst is going to
be?
o For long-term batch jobs this can be done based upon the limits that users
set for their jobs when they submit them, which encourages them to set
low limits, but risks their having to re-submit the job if they set the limit
too low. However that does not work for short-term CPU scheduling on an
interactive system.
o Another option would be to statistically measure the run time
characteristics of jobs, particularly if the same tasks are run repeatedly and
predictably. But once again that really isn't a viable option for short term
CPU scheduling in the real world.
o A more practical approach is to predict the length of the next burst, based
on some historical measurement of recent burst times for this process. One
simple, fast, and relatively accurate method is the exponential average,
which can be defined as follows. ( The book uses tau and t for their
variables, but those are hard to distinguish from one another and don't
work well in HTML. )

estimate[ i + 1 ] = alpha * burst[ i ] + ( 1.0 - alpha ) * estimate[ i ]

o In this scheme the previous estimate contains the history of all previous
times, and alpha serves as a weighting factor for the relative importance of
recent data versus past history. If alpha is 1.0, then past history is ignored,
and we assume the next burst will be the same length as the last burst. If
alpha is 0.0, then all measured burst times are ignored, and we just assume
a constant burst time.

SJF can be either preemptive or non-preemptive. Preemption occurs when a new process arrives
in the ready queue that has a predicted burst time shorter than the time remaining in the process
whose burst is currently on the CPU. Preemptive SJF is sometimes referred to as shortest
remaining time first scheduling.

 For example, the following Gantt chart is based upon the following data:

Process Arrival Time Burst Time


P1 0 8
P2 1 4
P3 2 9
p4 3 5

 The average wait time in this case is ( ( 5 - 3 ) + ( 10 - 1 ) + ( 17 - 2 ) ) / 4 = 26 / 4


= 6.5 ms. ( As opposed to 7.75 ms for non-preemptive SJF or 8.75 for FCFS. )

Priority Scheduling

 Priority scheduling is a more general case of SJF, in which each job is assigned a
priority and the job with the highest priority gets scheduled first. ( SJF uses the
inverse of the next expected burst time as its priority - The smaller the expected
burst, the higher the priority. )
 Note that in practice, priorities are implemented using integers within a fixed
range, but there is no agreed-upon convention as to whether "high" priorities use
large numbers or small numbers. This book uses low number for high priorities,
with 0 being the highest possible priority.
 For example, the following Gantt chart is based upon these process burst times
and priorities, and yields an average waiting time of 8.2 ms:

Process Burst Time Priority


P1 10 3
P2 1 1
P3 2 4
P4 1 5
P5 5 2

 Priorities can be assigned either internally or externally. Internal priorities are


assigned by the OS using criteria such as average burst time, ratio of CPU to I/O
activity, system resource use, and other factors available to the kernel. External
priorities are assigned by users, based on the importance of the job, fees paid,
politics, etc.
 Priority scheduling can be either preemptive or non-preemptive.
 Priority scheduling can suffer from a major problem known as indefinite
blocking, or starvation, in which a low-priority task can wait forever because
there are always some other jobs around that have higher priority.
o If this problem is allowed to occur, then processes will either run
eventually when the system load lightens ( at say 2:00 a.m. ), or will
eventually get lost when the system is shut down or crashes. ( There are
rumors of jobs that have been stuck for years. )
o One common solution to this problem is aging, in which priorities of jobs
increase the longer they wait. Under this scheme a low-priority job will
eventually get its priority raised high enough that it gets run.

Round Robin Scheduling

 Round robin scheduling is similar to FCFS scheduling, except that CPU bursts are
assigned with limits called time quantum.
 When a process is given the CPU, a timer is set for whatever value has been set
for a time quantum.
o If the process finishes its burst before the time quantum timer expires, then
it is swapped out of the CPU just like the normal FCFS algorithm.
o If the timer goes off first, then the process is swapped out of the CPU and
moved to the back end of the ready queue.
 The ready queue is maintained as a circular queue, so when all processes have had
a turn, then the scheduler gives the first process another turn, and so on.
 RR scheduling can give the effect of all processors sharing the CPU equally,
although the average wait time can be longer than with other scheduling
algorithms. In the following example the average wait time is 5.66 ms.
Process Burst Time
P1 24
P2 3
P3 3

 The performance of RR is sensitive to the time quantum selected. If the quantum


is large enough, then RR reduces to the FCFS algorithm; If it is very small, then
each process gets 1/nth of the processor time and share the CPU equally.
 BUT, a real system invokes overhead for every context switch, and the smaller
the time quantum the more context switches there are. ( See Figure 6.4 below. )
Most modern systems use time quantum between 10 and 100 milliseconds, and
context switch times on the order of 10 microseconds, so the overhead is small
relative to the time quantum.

The way in which a smaller time quantum increases context switches.

 In general, turnaround time is minimized if most processes finish their next cpu
burst within one time quantum. For example, with three processes of 10 ms bursts
each, the average turnaround time for 1 ms quantum is 29, and for 10 ms quantum
it reduces to 20. However, if it is made too large, then RR just degenerates to
FCFS. A rule of thumb is that 80% of CPU bursts should be smaller than the time
quantum.
Multilevel Queue Scheduling

 When processes can be readily categorized, then multiple separate queues can be
established, each implementing whatever scheduling algorithm is most
appropriate for that type of job, and/or with different parametric adjustments.
 Scheduling must also be done between queues, that is scheduling one queue to get
time relative to other queues. Two common options are strict priority ( no job in a
lower priority queue runs until all higher priority queues are empty ) and round-
robin ( each queue gets a time slice in turn, possibly of different sizes. )
 Note that under this algorithm jobs cannot switch from queue to queue - Once
they are assigned a queue, that is their queue until they finish.

Multilevel queue scheduling

Multilevel Feedback-Queue Scheduling

 Multilevel feedback queue scheduling is similar to the ordinary multilevel queue


scheduling described above, except jobs may be moved from one queue to
another for a variety of reasons:
o If the characteristics of a job change between CPU-intensive and I/O
intensive, then it may be appropriate to switch a job from one queue to
another.
o Aging can also be incorporated, so that a job that has waited for a long
time can get bumped up into a higher priority queue for a while.
 Multilevel feedback queue scheduling is the most flexible, because it can be tuned
for any situation. But it is also the most complex to implement because of all the
adjustable parameters. Some of the parameters which define one of these systems
include:
o The number of queues.
o The scheduling algorithm for each queue.
o The methods used to upgrade or demote processes from one queue to
another. ( Which may be different. )
o The method used to determine which queue a process enters initially.

Multilevel feedback queues.

Thread Scheduling

 The process scheduler schedules only the kernel threads.


 User threads are mapped to kernel threads by the thread library - The OS ( and in
particular the scheduler ) is unaware of them.

Contention Scope

 Contention scope refers to the scope in which threads compete for the use of
physical CPUs.
 On systems implementing many-to-one and many-to-many threads, Process
Contention Scope, PCS, occurs, because competition occurs between threads that
are part of the same process. ( This is the management / scheduling of multiple
user threads on a single kernel thread, and is managed by the thread library. )
 System Contention Scope, SCS, involves the system scheduler scheduling kernel
threads to run on one or more CPUs. Systems implementing one-to-one threads (
XP, Solaris 9, Linux ), use only SCS.
 PCS scheduling is typically done with priority, where the programmer can set
and/or change the priority of threads created by his or her programs. Even time
slicing is not guaranteed among threads of equal priority.

Pthread Scheduling

 The Pthread library provides for specifying scope contention:


o PTHREAD_SCOPE_PROCESS schedules threads using PCS, by
scheduling user threads onto available LWPs using the many-to-many
model.
o PTHREAD_SCOPE_SYSTEM schedules threads using SCS, by binding
user threads to particular LWPs, effectively implementing a one-to-one
model.
 getscope and setscope methods provide for determining and setting the scope
contention respectively:

Multiple-Processor Scheduling

 When multiple processors are available, then the scheduling gets more complicated,
because now there is more than one CPU which must be kept busy and in effective use at
all times.
 Load sharing revolves around balancing the load between multiple processors.
 Multi-processor systems may be heterogeneous, ( different kinds of CPUs ), or
homogenous, ( all the same kind of CPU ). Even in the latter case there may be special
scheduling constraints, such as devices which are connected via a private bus to only one
of the CPUs. This book will restrict its discussion to homogenous systems.

Approaches to Multiple-Processor Scheduling

 One approach to multi-processor scheduling is asymmetric multiprocessing, in


which one processor is the master, controlling all activities and running all kernel
code, while the other runs only user code. This approach is relatively simple, as
there is no need to share critical system data.
 Another approach is symmetric multiprocessing, SMP, where each processor
schedules its own jobs, either from a common ready queue or from separate ready
queues for each processor.
 Virtually all modern OSes support SMP, including XP, Win 2000, Solaris, Linux,
and Mac OSX.

Processor Affinity

 Processors contain cache memory, which speeds up repeated accesses to the same
memory locations.
 If a process were to switch from one processor to another each time it got a time
slice, the data in the cache ( for that process ) would have to be invalidated and re-
loaded from main memory, thereby obviating the benefit of the cache.
 Therefore SMP systems attempt to keep processes on the same processor, via
processor affinity. Soft affinity occurs when the system attempts to keep
processes on the same processor but makes no guarantees. Linux and some other
OSes support hard affinity, in which a process specifies that it is not to be moved
between processors.
 Main memory architecture can also affect process affinity, if particular CPUs
have faster access to memory on the same chip or board than to other memory
loaded elsewhere. ( Non-Uniform Memory Access, NUMA. ) As shown below, if
a process has an affinity for a particular CPU, then it should preferentially be
assigned memory storage in "local" fast access areas.

NUMA and CPU scheduling.

Load Balancing

 Obviously an important goal in a multiprocessor system is to balance the load


between processors, so that one processor won't be sitting idle while another is
overloaded.
 Systems using a common ready queue are naturally self-balancing, and do not
need any special handling. Most systems, however, maintain separate ready
queues for each processor.
 Balancing can be achieved through either push migration or pull migration:
o Push migration involves a separate process that runs periodically, ( e.g.
every 200 milliseconds ), and moves processes from heavily loaded
processors onto less loaded ones.
o Pull migration involves idle processors taking processes from the ready
queues of other processors.
o Push and pull migration are not mutually exclusive.
 Note that moving processes from processor to processor to achieve load balancing
works against the principle of processor affinity, and if not carefully managed, the
savings gained by balancing the system can be lost in rebuilding caches. One
option is to only allow migration when imbalance surpasses a given threshold.

Multicore Processors

 Traditional SMP required multiple CPU chips to run multiple kernel threads
concurrently.
 Recent trends are to put multiple CPUs ( cores ) onto a single chip, which appear
to the system as multiple processors.
 Compute cycles can be blocked by the time needed to access memory, whenever
the needed data is not already present in the cache. ( Cache misses. ) In Figure, as
much as half of the CPU cycles are lost to memory stall.

Memory stall.

 By assigning multiple kernel threads to a single processor, memory stall can be


avoided ( or reduced ) by running one thread on the processor while the other
thread waits for memory.

Multithreaded multicore system.


 A dual-threaded dual-core system has four logical processors available to the
operating system. The UltraSPARC T1 CPU has 8 cores per chip and 4 hardware
threads per core, for a total of 32 logical processors per chip.
 There are two ways to multi-thread a processor:
1. Coarse-grained multithreading switches between threads only when one
thread blocks, say on a memory read. Context switching is similar to
process switching, with considerable overhead.
2. Fine-grained multithreading occurs on smaller regular intervals, say on
the boundary of instruction cycles. However the architecture is designed to
support thread switching, so the overhead is relatively minor.
 Note that for a multi-threaded multi-core system, there are two levels of
scheduling, at the kernel level:
o The OS schedules which kernel thread(s) to assign to which logical
processors, and when to make context switches using algorithms as
described above.
o On a lower level, the hardware schedules logical processors on each
physical core using some other algorithm.
 The UltraSPARC T1 uses a simple round-robin method to
schedule the 4 logical processors ( kernel threads ) on each
physical core.
 The Intel Itanium is a dual-core chip which uses a 7-level priority
scheme ( urgency ) to determine which thread to schedule when
one of 5 different events occurs.

Linux Scheduling

 Prior to version 2.5, Linux used a traditional UNIX scheduling algorithm.


 Version 2.6 used an algorithm known as O(1), that ran in constant time regardless of the
number of tasks, and provided better support for SMP systems. However it yielded poor
interactive performance.
 Starting in 2.6.23, the Completely Fair Scheduler, CFS, became the standard Linux
scheduling system. See sidebar below

CFS ( Completely Fair Scheduler ) Performance

The Linux CFS scheduler provides an efficient algorithm for selecting which
task to run next. Each runnable task is placed in a red-black tree—a balanced
binary search tree whose key is based on the value of vruntime. This tree is
shown below:
When a task becomes runnable, it is added to the tree. If a task on the
tree is not runnable ( for example, if it is blocked while waiting for I/O ), it is
removed. Generally speaking, tasks that have been given less processing time
( smaller values of vruntime ) are toward the left side of the tree, and tasks
that have been given more processing time are on the right side. According
to the properties of a binary search tree, the leftmost node has the smallest
key value, which for the sake of the CFS scheduler means that it is the task
with the highest priority . Because the red-black tree is balanced, navigating
it to discover the leftmost node will require O(lgN) operations (where N
is the number of nodes in the tree). However, for efficiency reasons, the
Linux scheduler caches this value in the variable rb_leftmost, and thus
determining which task to run next requires only retrieving the cached value.

New sidebar in ninth edition

 The Linux scheduler is a preemptive priority-based algorithm with two priority ranges -
Real time from 0 to 99 and a nice range from 100 to 140.
 Unlike Solaris or XP, Linux assigns longer time quantums to higher priority tasks.
Figure 6.21 - Scheduling priorities on a Linux system.

Windows XP Scheduling ( was 5.6.2 )

 Windows XP uses a priority-based preemptive scheduling algorithm.


 The dispatcher uses a 32-level priority scheme to determine the order of thread execution,
divided into two classes - variable class from 1 to 15 and real-time class from 16 to 31, (
plus a thread at priority 0 managing memory. )
 There is also a special idle thread that is scheduled when no other threads are ready.
 Win XP identifies 7 priority classes ( rows on the table below ), and 6 relative priorities
within each class ( columns. )
 Processes are also each given a base priority within their priority class. When variable
class processes consume their entire time quanta, then their priority gets lowered, but not
below their base priority.
 Processes in the foreground ( active window ) have their scheduling quanta multiplied by
3, to give better response to interactive processes in the foreground.

Figure 6.22 - Windows thread priorities.

Process Synchronization
Background

item nextProduced;

while( true ) {

/* Produce an item and store it in nextProduced */


nextProduced = makeNewItem( . . . );

/* Wait for space to become available */


while( ( ( in + 1 ) % BUFFER_SIZE ) == out )
; /* Do nothing */

/* And then store the item and repeat the loop. */


buffer[ in ] = nextProduced;
in = ( in + 1 ) % BUFFER_SIZE;
}

Consumer code from chapter 3:

item nextConsumed;

while( true ) {

/* Wait for an item to become available */


while( in == out )
; /* Do nothing */

/* Get the next available item */


nextConsumed = buffer[ out ];
out = ( out + 1 ) % BUFFER_SIZE;

/* Consume the item in nextConsumed


( Do something with it ) */

 The only problem with the above code is that the maximum number of items
which can be placed into the buffer is BUFFER_SIZE - 1. One slot is unavailable
because there always has to be a gap between the producer and the consumer.
 We could try to overcome this deficiency by introducing a counter variable, as
shown in the following code segments:
 Unfortunately we have now introduced a new problem, because both the producer
and the consumer are adjusting the value of the variable counter, which can lead
to a condition known as a race condition. In this condition a piece of code may or
may not work correctly, depending on which of two simultaneous processes
executes first, and more importantly if one of the processes gets interrupted such
that the other process runs between important steps of the first process. ( Bank
balance example discussed in class. )
 The particular problem above comes from the producer executing "counter++" at
the same time the consumer is executing "counter--". If one process gets part way
through making the update and then the other process butts in, the value of
counter can get left in an incorrect state.
 But, you might say, "Each of those are single instructions - How can they get
interrupted halfway through?" The answer is that although they are single
instructions in C++, they are actually three steps each at the hardware level: (1)
Fetch counter from memory into a register, (2) increment or decrement the
register, and (3) Store the new value of counter back to memory. If the
instructions from the two processes get interleaved, there could be serious
problems, such as illustrated by the following:

 Exercise: What would be the resulting value of counter if the order of statements
T4 and T5 were reversed? ( What should the value of counter be after one
producer and one consumer, assuming the original value was 5? )
 Note that race conditions are notoriously difficult to identify and debug, because
by their very nature they only occur on rare occasions, and only when the timing
is just exactly right. ( or wrong! :-) ) Race conditions are also very difficult to
reproduce. :-(
 Obviously the solution is to only allow one process at a time to manipulate the
value "counter". This is a very common occurrence among cooperating processes,
so lets look at some ways in which this is done, as well as some classic problems
in this area.

The Critical-Section Problem

 The producer-consumer problem described above is a specific example of a more general


situation known as the critical section problem. The general idea is that in a number of
cooperating processes, each has a critical section of code, with the following conditions
and terminologies:
o Only one process in the group can be allowed to execute in their critical section at
any one time. If one process is already executing their critical section and another
process wishes to do so, then the second process must be made to wait until the
first process has completed their critical section work.
o The code preceding the critical section, and which controls access to the critical
section, is termed the entry section. It acts like a carefully controlled locking door.
o The code following the critical section is termed the exit section. It generally
releases the lock on someone else's door, or at least lets the world know that they
are no longer in their critical section.
o The rest of the code not included in either the critical section or the entry or exit
sections is termed the remainder section.

General structure of a typical process Pi

 A solution to the critical section problem must satisfy the following three conditions:
1. Mutual Exclusion - Only one process at a time can be executing in their critical
section.
2. Progress - If no process is currently executing in their critical section, and one or
more processes want to execute their critical section, then only the processes not
in their remainder sections can participate in the decision, and the decision cannot
be postponed indefinitely. ( I.e. processes cannot be blocked forever waiting to
get into their critical sections. )
3. Bounded Waiting - There exists a limit as to how many other processes can get
into their critical sections after a process requests entry into their critical section
and before that request is granted. ( I.e. a process requesting entry into their
critical section will get a turn eventually, and there is a limit as to how many other
processes get to go first. )
Peterson's Solution

 Peterson's Solution is a classic software-based solution to the critical section problem. It


is unfortunately not guaranteed to work on modern hardware, due to vagaries of load and
store operations, but it illustrates a number of important concepts.
 Peterson's solution is based on two processes, P0 and P1, which alternate between their
critical sections and remainder sections. For convenience of discussion, "this" process is
Pi, and the "other" process is Pj. ( I.e. j = 1 - i )
 Peterson's solution requires two shared data items:
o int turn - Indicates whose turn it is to enter into the critical section. If turn = = i,
then process i is allowed into their critical section.
o boolean flag[ 2 ] - Indicates when a process wants to enter into their critical
section. When process i wants to enter their critical section, it sets flag[ i ] to true.
 In the following diagram, the entry and exit sections are enclosed in boxes.
o In the entry section, process i first raises a flag indicating a desire to enter the
critical section.
o Then turn is set to j to allow the other process to enter their critical section if
process j so desires.
o The while loop is a busy loop ( notice the semicolon at the end ), which makes
process i wait as long as process j has the turn and wants to enter the critical
section.
o Process i lowers the flag[ i ] in the exit section, allowing process j to continue if it
has been waiting.

The structure of process Pi in Peterson's solution.


 To prove that the solution is correct, we must examine the three conditions listed above:
1. Mutual exclusion - If one process is executing their critical section when the
other wishes to do so, the second process will become blocked by the flag of the
first process. If both processes attempt to enter at the same time, the last process
to execute "turn = j" will be blocked.
2. Progress - Each process can only be blocked at the while if the other process
wants to use the critical section ( flag[ j ] = = true ), AND it is the other process's
turn to use the critical section ( turn = = j ). If both of those conditions are true,
then the other process ( j ) will be allowed to enter the critical section, and upon
exiting the critical section, will set flag[ j ] to false, releasing process i. The shared
variable turn assures that only one process at a time can be blocked, and the flag
variable allows one process to release the other when exiting their critical section.
3. Bounded Waiting - As each process enters their entry section, they set the turn
variable to be the other processes turn. Since no process ever sets it back to their
own turn, this ensures that each process will have to let the other process go first
at most one time before it becomes their turn again.
 Note that the instruction "turn = j" is atomic, that is it is a single machine instruction
which cannot be interrupted.

Synchronization Hardware

 To generalize the solution(s) expressed above, each process when entering their critical
section must set some sort of lock, to prevent other processes from entering their critical
sections simultaneously, and must release the lock when exiting their critical section, to
allow other processes to proceed. Obviously it must be possible to attain the lock only
when no other process has already set a lock. Specific implementations of this general
procedure can get quite complicated, and may include hardware solutions as outlined in
this section.
 One simple solution to the critical section problem is to simply prevent a process from
being interrupted while in their critical section, which is the approach taken by non
preemptive kernels. Unfortunately this does not work well in multiprocessor
environments, due to the difficulties in disabling and the re-enabling interrupts on all
processors. There is also a question as to how this approach affects timing if the clock
interrupt is disabled.
 Another approach is for hardware to provide certain atomic operations. These operations
are guaranteed to operate as a single instruction, without interruption. One such operation
is the "Test and Set", which simultaneously sets a boolean lock variable and returns its
previous value, as shown in Figures 5.3 and 5.4:
 Another variation on the test-and-set is an atomic swap of two booleans, as shown in
Figures 5.5 and 5.6:
 The above examples satisfy the mutual exclusion requirement, but unfortunately do not
guarantee bounded waiting. If there are multiple processes trying to get into their critical
sections, there is no guarantee of what order they will enter, and any one process could
have the bad luck to wait forever until they got their turn in the critical section. ( Since
there is no guarantee as to the relative rates of the processes, a very fast process could
theoretically release the lock, whip through their remainder section, and re-lock the lock
before a slower process got a chance. As more and more processes are involved vying for
the same resource, the odds of a slow process getting locked out completely increase. )
 Figure illustrates a solution using test-and-set that does satisfy this requirement, using
two shared data structures, boolean lock and boolean waiting[ N ], where N is the number
of processes in contention for critical sections:

Bounded-waiting mutual exclusion with TestAndSet( ).

 The key feature of the above algorithm is that a process blocks on the AND of the critical
section being locked and that this process is in the waiting state. When exiting a critical
section, the exiting process does not just unlock the critical section and let the other
processes have a free-for-all trying to get in. Rather it first looks in an orderly progression
( starting with the next process on the list ) for a process that has been waiting, and if it
finds one, then it releases that particular process from its waiting state, without unlocking
the critical section, thereby allowing a specific process into the critical section while
continuing to block all the others. Only if there are no other processes currently waiting is
the general lock removed, allowing the next process to come along access to the critical
section.
 Unfortunately, hardware level locks are especially difficult to implement in multi-
processor architectures. Discussion of such issues is left to books on advanced computer
architecture.

Mutex Locks

 The hardware solutions presented above are often difficult for ordinary programmers to
access, particularly on multi-processor machines, and particularly because they are often
platform-dependent.
 Therefore most systems offer a software API equivalent called mutex locks or simply
mutexes. ( For mutual exclusion )
 The terminology when using mutexes is to acquire a lock prior to entering a critical
section, and to release it when exiting, as shown in Figure :

Solution to the critical-section problem using mutex locks

 Just as with hardware locks, the acquire step will block the process if the lock is in use by
another process, and both the acquire and release operations are atomic.
 Acquire and release can be implemented as shown here, based on a boolean variable
"available":
 One problem with the implementation shown here, ( and in the hardware solutions
presented earlier ), is the busy loop used to block processes in the acquire phase. These
types of locks are referred to as spinlocks, because the CPU just sits and spins while
blocking the process.
 Spinlocks are wasteful of cpu cycles, and are a really bad idea on single-cpu single-
threaded machines, because the spinlock blocks the entire computer, and doesn't allow
any other process to release the lock. ( Until the scheduler kicks the spinning process off
of the cpu. )
 On the other hand, spinlocks do not incur the overhead of a context switch, so they are
effectively used on multi-threaded machines when it is expected that the lock will be
released after a short time.

Semaphores

 A more robust alternative to simple mutexes is to use semaphores, which are integer
variables for which only two ( atomic ) operations are defined, the wait and signal
operations, as shown in the following figure.
 Note that not only must the variable-changing steps ( S-- and S++ ) be indivisible, it is
also necessary that for the wait operation when the test proves false that there be no
interruptions before S gets decremented. It IS okay, however, for the busy loop to be
interrupted when the test is true, which prevents the system from hanging forever.
Semaphore Usage

 In practice, semaphores can take on one of two forms:


o Binary semaphores can take on one of two values, 0 or 1. They can be
used to solve the critical section problem as described above, and can be
used as mutexes on systems that do not provide a separate mutex
mechanism.. The use of mutexes for this purpose is shown in Figure 6.9 (
from the 8th edition ) below.

Mutual-exclusion implementation with semaphores.


o Counting semaphores can take on any integer value, and are usually used
to count the number remaining of some limited resource. The counter is
initialized to the number of such resources available in the system, and
whenever the counting semaphore is greater than zero, then a process can
enter a critical section and use one of the resources. When the counter gets
to zero ( or negative in some implementations ), then the process blocks
until another process frees up a resource and increments the counting
semaphore with a signal call. ( The binary semaphore can be seen as just a
special case where the number of resources initially available is just one. )
o Semaphores can also be used to synchronize certain operations between
processes. For example, suppose it is important that process P1 execute
statement S1 before process P2 executes statement S2.
 First we create a semaphore named synch that is shared by the two
processes, and initialize it to zero.
 Then in process P1 we insert the code:

S1;
signal( synch );

 and in process P2 we insert the code:

wait( synch );
S2;

 Because synch was initialized to 0, process P2 will block on the


wait until after P1 executes the call to signal.

Semaphore Implementation

 The big problem with semaphores as described above is the busy loop in the wait
call, which consumes CPU cycles without doing any useful work. This type of
lock is known as a spinlock, because the lock just sits there and spins while it
waits. While this is generally a bad thing, it does have the advantage of not
invoking context switches, and so it is sometimes used in multi-processing
systems when the wait time is expected to be short - One thread spins on one
processor while another completes their critical section on another processor.
 An alternative approach is to block a process when it is forced to wait for an
available semaphore, and swap it out of the CPU. In this implementation each
semaphore needs to maintain a list of processes that are blocked waiting for it, so
that one of the processes can be woken up and swapped back in when the
semaphore becomes available. ( Whether it gets swapped back into the CPU
immediately or whether it needs to hang out in the ready queue for a while is a
scheduling problem. )
 The new definition of a semaphore and the corresponding wait and signal
operations are shown as follows:
 Note that in this implementation the value of the semaphore can actually become
negative, in which case its magnitude is the number of processes waiting for that
semaphore. This is a result of decrementing the counter before checking its value.
 Key to the success of semaphores is that the wait and signal operations be atomic,
that is no other process can execute a wait or signal on the same semaphore at the
same time. ( Other processes could be allowed to do other things, including
working with other semaphores, they just can't have access to this semaphore. )
On single processors this can be implemented by disabling interrupts during the
execution of wait and signal; Multiprocessor systems have to use more complex
methods, including the use of spinlocking.
Deadlocks and Starvation

 One important problem that can arise when using semaphores to block processes
waiting for a limited resource is the problem of deadlocks, which occur when
multiple processes are blocked, each waiting for a resource that can only be freed
by one of the other ( blocked ) processes, as illustrated in the following example. (
Deadlocks are covered more completely in chapter 7. )

 Another problem to consider is that of starvation, in which one or more processes


gets blocked forever, and never get a chance to take their turn in the critical
section. For example, in the semaphores above, we did not specify the algorithms
for adding processes to the waiting queue in the semaphore in the wait( ) call, or
selecting one to be removed from the queue in the signal( ) call. If the method
chosen is a FIFO queue, then every process will eventually get their turn, but if a
LIFO queue is implemented instead, then the first process to start waiting could
starve.

Priority Inversion

 A challenging scheduling problem arises when a high-priority process gets


blocked waiting for a resource that is currently held by a low-priority process.
 If the low-priority process gets pre-empted by one or more medium-priority
processes, then the high-priority process is essentially made to wait for the
medium priority processes to finish before the low-priority process can release the
needed resource, causing a priority inversion. If there are enough medium-
priority processes, then the high-priority process may be forced to wait for a very
long time.
 One solution is a priority-inheritance protocol, in which a low-priority process
holding a resource for which a high-priority process is waiting will temporarily
inherit the high priority from the waiting process. This prevents the medium-
priority processes from preempting the low-priority process until it releases the
resource, blocking the priority inversion problem.

Classic Problems of Synchronization

The following classic problems are used to test virtually every new proposed synchronization
algorithm.

The Bounded-Buffer Problem

 This is a generalization of the producer-consumer problem wherein access is


controlled to a shared group of buffers of a limited size.
 In this solution, the two counting semaphores "full" and "empty" keep track of the
current number of full and empty buffers respectively ( and initialized to 0 and N
respectively. ) The binary semaphore mutex controls access to the critical section.
The producer and consumer processes are nearly identical - One can think of the
producer as producing full buffers, and the consumer producing empty buffers.
The Readers-Writers Problem

 In the readers-writers problem there are some processes ( termed readers ) who
only read the shared data, and never change it, and there are other processes (
termed writers ) who may change the data in addition to or instead of reading it.
There is no limit to how many readers can access the data simultaneously, but
when a writer accesses the data, it needs exclusive access.
 There are several variations to the readers-writers problem, most centered around
relative priorities of readers versus writers.
o The first readers-writers problem gives priority to readers. In this problem,
if a reader wants access to the data, and there is not already a writer
accessing it, then access is granted to the reader. A solution to this
problem can lead to starvation of the writers, as there could always be
more readers coming along to access the data. ( A steady stream of readers
will jump ahead of waiting writers as long as there is currently already
another reader accessing the data, because the writer is forced to wait until
the data is idle, which may never happen if there are enough readers. )
o The second readers-writers problem gives priority to the writers. In this
problem, when a writer wants access to the data it jumps to the head of the
queue - All waiting readers are blocked, and the writer gets access to the
data as soon as it becomes available. In this solution the readers may be
starved by a steady stream of writers.
 The following code is an example of the first readers-writers problem, and
involves an important counter and two binary semaphores:
o readcount is used by the reader processes, to count the number of readers
currently accessing the data.
o mutex is a semaphore used only by the readers for controlled access to
readcount.
o rw_mutex is a semaphore used to block and release the writers. The first
reader to access the data will set this lock and the last reader to exit will
release it; The remaining readers do not touch rw_mutex. ( Eighth edition
called this variable wrt. )
o Note that the first reader to come along will block on rw_mutex if there is
currently a writer accessing the data, and that all following readers will
only block on mutex for their turn to increment readcount.
 Some hardware implementations provide specific reader-writer locks, which are
accessed using an argument specifying whether access is requested for reading or
writing. The use of reader-writer locks is beneficial for situation in which: (1)
processes can be easily identified as either readers or writers, and (2) there are
significantly more readers than writers, making the additional overhead of the
reader-writer lock pay off in terms of increased concurrency of the readers.

The Dining-Philosophers Problem

 The dining philosophers problem is a classic synchronization problem involving


the allocation of limited resources amongst a group of processes in a deadlock-
free and starvation- free manner:
o Consider five philosophers sitting around a table, in which there are five
chopsticks evenly distributed and an endless bowl of rice in the center, as
shown in the diagram below. ( There is exactly one chopstick between
each pair of dining philosophers. )
o These philosophers spend their lives alternating between two activities:
eating and thinking.
o When it is time for a philosopher to eat, it must first acquire two
chopsticks - one from their left and one from their right.
o When a philosopher thinks, it puts down both chopsticks in their original
locations.

Figure 5.13 - The situation of the dining philosophers

 One possible solution, as shown in the following code section, is to use a set of
five semaphores ( chopsticks[ 5 ] ), and to have each hungry philosopher first wait
on their left chopstick ( chopsticks[ i ] ), and then wait on their right chopstick (
chopsticks[ ( i + 1 ) % 5 ] )
 But suppose that all five philosophers get hungry at the same time, and each starts
by picking up their left chopstick. They then look for their right chopstick, but
because it is unavailable, they wait for it, forever, and eventually all the
philosophers starve due to the resulting deadlock.
The structure of philosopher i.

 Some potential solutions to the problem include:


o Only allow four philosophers to dine at the same time. ( Limited
simultaneous processes. )
o Allow philosophers to pick up chopsticks only when both are available, in
a critical section. ( All or nothing allocation of critical resources. )
o Use an asymmetric solution, in which odd philosophers pick up their left
chopstick first and even philosophers pick up their right chopstick first. (
Will this solution always work? What if there are an even number of
philosophers? )
 Note carefully that a deadlock-free solution to the dining philosophers problem
does not necessarily guarantee a starvation- free one. ( While some or even most
of the philosophers may be able to get on with their normal lives of eating and
thinking, there may be one unlucky soul who never seems to be able to get both
chopsticks at the same time. :-(

Monitors

 Semaphores can be very useful for solving concurrency problems, but only if
programmers use them properly. If even one process fails to abide by the proper use of
semaphores, either accidentally or deliberately, then the whole system breaks down. (
And since concurrency problems are by definition rare events, the problem code may
easily go unnoticed and/or be heinous to debug. )
 For this reason a higher- level language construct has been developed, called monitors.

Monitor Usage
 A monitor is essentially a class, in which all data is private, and with the special
restriction that only one method within any given monitor object may be active at
the same time. An additional restriction is that monitor methods may only access
the shared data within the monitor and any data passed to them as parameters. I.e.
they cannot access any data external to the monitor.

Syntax of a monitor.

 Figure shows a schematic of a monitor, with an entry queue of processes waiting


their turn to execute monitor operations ( methods. )
Schematic view of a monitor

 In order to fully realize the potential of monitors, we need to introduce one


additional new data type, known as a condition.
o A variable of type condition has only two legal operations, wait and
signal. I.e. if X was defined as type condition, then legal operations would
be X.wait( ) and X.signal( )
o The wait operation blocks a process until some other process calls signal,
and adds the blocked process onto a list associated with that condition.
o The signal process does nothing if there are no processes waiting on that
condition. Otherwise it wakes up exactly one process from the condition's
list of waiting processes. ( Contrast this with counting semaphores, which
always affect the semaphore on a signal call. )
 Figure below illustrates a monitor that includes condition variables within its data
space. Note that the condition variables, along with the list of processes currently
waiting for the conditions, are in the data space of the monitor - The processes on
these lists are not "in" the monitor, in the sense that they are not executing any
code in the monitor.
Monitor with condition variables

 But now there is a potential problem - If process P within the monitor issues a
signal that would wake up process Q also within the monitor, then there would be
two processes running simultaneously within the monitor, violating the exclusion
requirement. Accordingly there are two possible solutions to this dilemma:

Signal and wait - When process P issues the signal to wake up process Q, P then waits, either
for Q to leave the monitor or on some other condition.

Signal and continue - When P issues the signal, Q waits, either for P to exit the monitor or for
some other condition.

There are arguments for and against either choice. Concurrent Pascal offers a third alternative -
The signal call causes the signaling process to immediately exit the monitor, so that the waiting
process can then wake up and proceed.
Dining-Philosophers Solution Using Monitors

 This solution to the dining philosophers uses monitors, and the restriction that a
philosopher may only pick up chopsticks when both are available. There are also
two key data structures in use in this solution:
1. enum { THINKING, HUNGRY,EATING } state[ 5 ]; A philosopher may
only set their state to eating when neither of their adjacent neighbors is
eating. ( state[ ( i + 1 ) % 5 ] != EATING && state[ ( i + 4 ) % 5 ] !=
EATING ).
2. condition self[ 5 ]; This condition is used to delay a hungry philosopher
who is unable to acquire chopsticks.
 In the following solution philosophers share a monitor, DiningPhilosophers, and
eat using the following sequence of operations:
1. DiningPhilosophers.pickup( ) - Acquires chopsticks, which may block the
process.
2. eat
3. DiningPhilosophers.putdown( ) - Releases the chopsticks.
Implementing a Monitor Using Semaphores

 One possible implementation of a monitor uses a semaphore "mutex" to control


mutual exclusionary access to the monitor, and a counting semaphore "next" on
which processes can suspend themselves after they are already "inside" the
monitor ( in conjunction with condition variables, see below. ) The integer
next_count keeps track of how many processes are waiting in the next queue.
Externally accessible monitor processes are then implemented as:

 Condition variables can be implemented using semaphores as well. For a


condition x, a semaphore "x_sem" and an integer "x_count" are introduced, both
initialized to zero. The wait and signal methods are then implemented as follows.
( This approach to the condition implements the signal-and-wait option described
above for ensuring that only one process at a time is active inside the monitor. )
Resuming Processes Within a Monitor

 When there are multiple processes waiting on the same condition within a
monitor, how does one decide which one to wake up in response to a signal on
that condition? One obvious approach is FCFS, and this may be suitable in many
cases.
 Another alternative is to assign ( integer ) priorities, and to wake up the process
with the smallest ( best ) priority.
 Figure illustrates the use of such a condition within a monitor used for resource
allocation. Processes wishing to access this resource must specify the time they
expect to use it using the acquire( time ) method, and must call the release( )
method when they are done with the resource.
A monitor to allocate a single resource.

 Unfortunately the use of monitors to restrict access to resources still only works if
programmers make the requisite acquire and release calls properly. One option
would be to place the resource allocation code into the monitor, thereby
eliminating the option for programmers to bypass or ignore the monitor, but then
that would substitute the monitor's scheduling algorithms for whatever other
scheduling algorithms may have been chosen for that particular resource. Chapter
14 on Protection presents more advanced methods for enforcing "nice"
cooperation among processes contending for shared resources.
 Concurrent Pascal, Mesa, C#, and Java all implement monitors as described here.
Erlang provides concurrency support using a similar mechanism.
Synchronization Examples

This section looks at how synchronization is handled in a number of different systems.

Synchronization in Windows

Figure 5.20 - Mutex dispatcher object

Synchronization in Linux
Unit-3
Main Memory
Background

 Obviously memory accesses and memory management are a very important part of
modern computer operation. Every instruction has to be fetched from memory before it
can be executed, and most instructions involve retrieving data from memory or storing
data in memory or both.
 The advent of multi-tasking OSes compounds the complexity of memory management,
because because as processes are swapped in and out of the CPU, so must their code and
data be swapped in and out of memory, all at high speeds and without interfering with
any other processes.
 Shared memory, virtual memory, the classification of memory as read-only versus read-
write, and concepts like copy-on-write forking all further complicate the issue.

Basic Hardware

 It should be noted that from the memory chips point of view, all memory accesses
are equivalent. The memory hardware doesn't know what a particular part of
memory is being used for, nor does it care. This is almost true of the OS as well,
although not entirely.
 The CPU can only access its registers and main memory. It cannot, for example,
make direct access to the hard drive, so any data stored there must first be
transferred into the main memory chips before the CPU can work with it. ( Device
drivers communicate with their hardware via interrupts and "memory" accesses,
sending short instructions for example to transfer data from the hard drive to a
specified location in main memory. The disk controller monitors the bus for such
instructions, transfers the data, and then notifies the CPU that the data is there
with another interrupt, but the CPU never gets direct access to the disk. )
 Memory accesses to registers are very fast, generally one clock tick, and a CPU
may be able to execute more than one machine instruction per clock tick.
 Memory accesses to main memory are comparatively slow, and may take a
number of clock ticks to complete. This would require intolerable waiting by the
CPU if it were not for an intermediary fast memory cache built into most modern
CPUs. The basic idea of the cache is to transfer chunks of memory at a time from
the main memory to the cache, and then to access individual memory locations
one at a time from the cache.
 User processes must be restricted so that they only access memory locations that
"belong" to that particular process. This is usually implemented using a base
register and a limit register for each process, as shown in Figures 8.1 and 8.2
below. Every memory access made by a user process is checked against these two
registers, and if a memory access is attempted outside the valid range, then a fatal
error is generated. The OS obviously has access to all existing memory locations,
as this is necessary to swap users' code and data in and out of memory. It should
also be obvious that changing the contents of the base and limit registers is a
privileged activity, allowed only to the OS kernel.

A base and a limit register define a logical addresss space


Hardware address protection with base and limit registers

Address Binding

 User programs typically refer to memory addresses with symbolic names such as
"i", "count", and "averageTemperature". These symbolic names must be mapped
or bound to physical memory addresses, which typically occurs in several stages:
o Compile Time - If it is known at compile time where a program will
reside in physical memory, then absolute code can be generated by the
compiler, containing actual physical addresses. However if the load
address changes at some later time, then the program will have to be
recompiled. DOS .COM programs use compile time binding.
o Load Time - If the location at which a program will be loaded is not
known at compile time, then the compiler must generate relocatable code,
which references addresses relative to the start of the program. If that
starting address changes, then the program must be reloaded but not
recompiled.
o Execution Time - If a program can be moved around in memory during
the course of its execution, then binding must be delayed until execution
time. This requires special hardware, and is the method implemented by
most modern OSes.
 Figure shows the various stages of the binding processes and the units involved in
each stage:
Multistep processing of a user program
Logical Versus Physical Address Space

 The address generated by the CPU is a logical address, whereas the address
actually seen by the memory hardware is a physical address.
 Addresses bound at compile time or load time have identical logical and physical
addresses.
 Addresses created at execution time, however, have different logical and physical
addresses.
o In this case the logical address is also known as a virtual address, and the
two terms are used interchangeably by our text.
o The set of all logical addresses used by a program composes the logical
address space, and the set of all corresponding physical addresses
composes the physical address space.
 The run time mapping of logical to physical addresses is handled by the memory-
management unit, MMU.
o The MMU can take on many forms. One of the simplest is a modification
of the base-register scheme described earlier.
o The base register is now termed a relocation register, whose value is
added to every memory request at the hardware level.
 Note that user programs never see physical addresses. User programs work
entirely in logical address space, and any memory references or manipulations are
done using purely logical addresses. Only when the address gets sent to the
physical memory chips is the physical memory address generated.

Dynamic relocation using a relocation register


Dynamic Loading

 Rather than loading an entire program into memory at once, dynamic loading
loads up each routine as it is called. The advantage is that unused routines need
never be loaded, reducing total memory usage and generating faster program
startup times. The downside is the added complexity and overhead of checking to
see if a routine is loaded every time it is called and then then loading it up if it is
not already loaded.

Dynamic Linking and Shared Libraries

 With static linking library modules get fully included in executable modules,
wasting both disk space and main memory usage, because every program that
included a certain routine from the library would have to have their own copy of
that routine linked into their executable code.
 With dynamic linking, however, only a stub is linked into the executable module,
containing references to the actual library module linked in at run time.
o This method saves disk space, because the library routines do not need to
be fully included in the executable modules, only the stubs.
o We will also learn that if the code section of the library routines is
reentrant, ( meaning it does not modify the code while it runs, making it
safe to re-enter it ), then main memory can be saved by loading only one
copy of dynamically linked routines into memory and sharing the code
amongst all processes that are concurrently using it. ( Each process would
have their own copy of the data section of the routines, but that may be
small relative to the code segments. ) Obviously the OS must manage
shared routines in memory.
o An added benefit of dynamically linked libraries ( DLLs, also known as
shared libraries or shared objects on UNIX systems ) involves easy
upgrades and updates. When a program uses a routine from a standard
library and the routine changes, then the program must be re-built ( re-
linked ) in order to incorporate the changes. However if DLLs are used,
then as long as the stub doesn't change, the program can be updated
merely by loading new versions of the DLLs onto the system. Version
information is maintained in both the program and the DLLs, so that a
program can specify a particular version of the DLL if necessary.
o In practice, the first time a program calls a DLL routine, the stub will
recognize the fact and will replace itself with the actual routine from the
DLL library. Further calls to the same routine will access the routine
directly and not incur the overhead of the stub access. ( Following the
UML Proxy Pattern. )
Swapping

 A process must be loaded into memory in order to execute.


 If there is not enough memory available to keep all running processes in memory at the
same time, then some processes who are not currently using the CPU may have their
memory swapped out to a fast local disk called the backing store.

Standard Swapping

 If compile-time or load-time address binding is used, then processes must be


swapped back into the same memory location from which they were swapped out.
If execution time binding is used, then the processes can be swapped back into
any available location.
 Swapping is a very slow process compared to other operations. For example, if a
user process occupied 10 MB and the transfer rate for the backing store were 40
MB per second, then it would take 1/4 second ( 250 milliseconds ) just to do the
data transfer. Adding in a latency lag of 8 milliseconds and ignoring head seek
time for the moment, and further recognizing that swapping involves moving old
data out as well as new data in, the overall transfer time required for this swap is
512 milliseconds, or over half a second. For efficient processor scheduling the
CPU time slice should be significantly longer than this lost transfer time.
 To reduce swapping transfer overhead, it is desired to transfer as little information
as possible, which requires that the system know how much memory a process is
using, as opposed to how much it might use. Programmers can help with this by
freeing up dynamic memory that they are no longer using.
 It is important to swap processes out of memory only when they are idle, or more
to the point, only when there are no pending I/O operations. ( Otherwise the
pending I/O operation could write into the wrong process's memory space. ) The
solution is to either swap only totally idle processes, or do I/O operations only
into and out of OS buffers, which are then transferred to or from process's main
memory as a second step.
 Most modern OSes no longer use swapping, because it is too slow and there are
faster alternatives available. ( e.g. Paging. ) However some UNIX systems will
still invoke swapping if the system gets extremely full, and then discontinue
swapping when the load reduces again. Windows 3.1 would use a modified
version of swapping that was somewhat controlled by the user, swapping
process's out if necessary and then only swapping them back in when the user
focused on that particular window.
Swapping of two processes using a disk as a backing store

Swapping on Mobile Systems

 Swapping is typically not supported on mobile platforms, for several reasons:


o Mobile devices typically use flash memory in place of more spacious hard
drives for persistent storage, so there is not as much space available.
o Flash memory can only be written to a limited number of times before it
becomes unreliable.
o The bandwidth to flash memory is also lower.
 Apple's IOS asks applications to voluntarily free up memory
o Read-only data, e.g. code, is simply removed, and reloaded later if needed.
o Modified data, e.g. the stack, is never removed, but . . .
o Apps that fail to free up sufficient memory can be removed by the OS
 Android follows a similar strategy.
o Prior to terminating a process, Android writes its application state to flash
memory for quick restarting.

Contiguous Memory Allocation

 One approach to memory management is to load each process into a contiguous space.
The operating system is allocated space first, usually at either low or high memory
locations, and then the remaining available memory is allocated to processes as needed.
( The OS is usually loaded low, because that is where the interrupt vectors are located,
but on older systems part of the OS was loaded high to make more room in low memory (
within the 640K barrier ) for user processes. )

Memory Protection ( was Memory Mapping and Protection )

 The system shown in Figure below allows protection against user programs
accessing areas that they should not, allows programs to be relocated to different
memory starting addresses as needed, and allows the memory space devoted to
the OS to grow or shrink dynamically as needs change.

Hardware support for relocation and limit registers

Memory Allocation

 One method of allocating contiguous memory is to divide all available memory


into equal sized partitions, and to assign each process to their own partition. This
restricts both the number of simultaneous processes and the maximum size of
each process, and is no longer used.
 An alternate approach is to keep a list of unused ( free ) memory blocks ( holes ),
and to find a hole of a suitable size whenever a process needs to be loaded into
memory. There are many different strategies for finding the "best" allocation of
memory to processes, including the three most commonly discussed:
1. First fit - Search the list of holes until one is found that is big enough to
satisfy the request, and assign a portion of that hole to that process.
Whatever fraction of the hole not needed by the request is left on the free
list as a smaller hole. Subsequent requests may start looking either from
the beginning of the list or from the point at which this search ended.
2. Best fit - Allocate the smallest hole that is big enough to satisfy the
request. This saves large holes for other process requests that may need
them later, but the resulting unused portions of holes may be too small to
be of any use, and will therefore be wasted. Keeping the free list sorted
can speed up the process of finding the right hole.
3. Worst fit - Allocate the largest hole available, thereby increasing the
likelihood that the remaining portion will be usable for satisfying future
requests.
 Simulations show that either first or best fit are better than worst fit in terms of
both time and storage utilization. First and best fits are about equal in terms of
storage utilization, but first fit is faster.

Fragmentation

 All the memory allocation strategies suffer from external fragmentation, though
first and best fits experience the problems more so than worst fit. External
fragmentation means that the available memory is broken up into lots of little
pieces, none of which is big enough to satisfy the next memory requirement,
although the sum total could.
 The amount of memory lost to fragmentation may vary with algorithm, usage
patterns, and some design decisions such as which end of a hole to allocate and
which end to save on the free list.
 Statistical analysis of first fit, for example, shows that for N blocks of allocated
memory, another 0.5 N will be lost to fragmentation.
 Internal fragmentation also occurs, with all memory allocation strategies. This is
caused by the fact that memory is allocated in blocks of a fixed size, whereas the
actual memory needed will rarely be that exact size. For a random distribution of
memory requests, on the average 1/2 block will be wasted per memory request,
because on the average the last allocated block will be only half full.
o Note that the same effect happens with hard drives, and that modern
hardware gives us increasingly larger drives and memory at the expense of
ever larger block sizes, which translates to more memory lost to internal
fragmentation.
o Some systems use variable size blocks to minimize losses due to internal
fragmentation.
 If the programs in memory are relocatable, ( using execution-time address
binding ), then the external fragmentation problem can be reduced via
compaction, i.e. moving all processes down to one end of physical memory. This
only involves updating the relocation register for each process, as all internal
work is done using logical addresses.
 Another solution as we will see in upcoming sections is to allow processes to use
non-contiguous blocks of physical memory, with a separate relocation register for
each block.
Segmentation

Basic Method

 Most users ( programmers ) do not think of their programs as existing in one


continuous linear address space.
 Rather they tend to think of their memory in multiple segments, each dedicated to
a particular use, such as code, data, the stack, the heap, etc.
 Memory segmentation supports this view by providing addresses with a segment
number ( mapped to a segment base address ) and an offset from the beginning of
that segment.
 For example, a C compiler might generate 5 segments for the user code, library
code, global ( static ) variables, the stack, and the heap, as shown in Figure:

Programmer's view of a program.

Segmentation Hardware
 A segment table maps segment-offset addresses to physical addresses, and
simultaneously checks for invalid addresses, using a system similar to the page
tables and relocation base registers discussed previously. ( Note that at this point
in the discussion of segmentation, each segment is kept in contiguous memory
and may be of different sizes, but that segmentation can also be combined with
paging as we shall see shortly. )

Segmentation hardware
Example of segmentation

Paging

 Paging is a memory management scheme that allows processes physical memory to be


discontinuous, and which eliminates problems with fragmentation by allocating memory
in equal sized blocks known as pages.
 Paging eliminates most of the problems of the other methods discussed previously, and is
the predominant memory management technique used today.

Basic Method

 The basic idea behind paging is to divide physical memory into a number of equal sized
blocks called frames, and to divide a programs logical memory space into blocks of the
same size called pages.
 Any page ( from any process ) can be placed into any available frame.
 The page table is used to look up what frame a particular page is stored in at the moment.
In the following example, for instance, page 2 of the program's logical memory is
currently stored in frame 3 of physical memory:
Paging hardware

Paging model of logical and physical memory

 A logical address consists of two parts: A page number in which the address resides, and
an offset from the beginning of that page. ( The number of bits in the page number limits
how many pages a single process can address. The number of bits in the offset determines
the maximum size of each page, and should correspond to the system frame size. )
 The page table maps the page number to a frame number, to yield a physical address
which also has two parts: The frame number and the offset within that frame. The number
of bits in the frame number determines how many frames the system can address, and the
number of bits in the offset determines the size of each frame.
 Page numbers, frame numbers, and frame sizes are determined by the architecture, but
are typically powers of two, allowing addresses to be split at a certain number of bits. For
example, if the logical address size is 2^m and the page size is 2^n, then the high-order
m-n bits of a logical address designate the page number and the remaining n bits
represent the offset.
 Note also that the number of bits in the page number and the number of bits in the frame
number do not have to be identical. The former determines the address range of the
logical address space, and the latter relates to the physical address space.

 ( DOS used to use an addressing scheme with 16 bit frame numbers and 16-bit offsets, on
hardware that only supported 24-bit hardware addresses. The result was a resolution of
starting frame addresses finer than the size of a single frame, and multiple frame-offset
combinations that mapped to the same physical hardware address. )
 Consider the following micro example, in which a process has 16 bytes of logical
memory, mapped in 4 byte pages into 32 bytes of physical memory. ( Presumably some
other processes would be consuming the remaining 16 bytes of physical memory. )
Paging example for a 32-byte memory with 4-byte pages

 Note that paging is like having a table of relocation registers, one for each page of the
logical memory.
 There is no external fragmentation with paging. All blocks of physical memory are used,
and there are no gaps in between and no problems with finding the right sized hole for a
particular chunk of memory.
 There is, however, internal fragmentation. Memory is allocated in chunks the size of a
page, and on the average, the last page will only be half full, wasting on the average half
a page of memory per process. ( Possibly more, if processes keep their code and data in
separate pages. )
 Larger page sizes waste more memory, but are more efficient in terms of overhead.
Modern trends have been to increase page sizes, and some systems even have multiple
size pages to try and make the best of both worlds.
 Page table entries ( frame numbers ) are typically 32 bit numbers, allowing access to
2^32 physical page frames. If those frames are 4 KB in size each, that translates to 16 TB
of addressable physical memory. ( 32 + 12 = 44 bits of physical address space. )
 When a process requests memory ( e.g. when its code is loaded in from disk ), free
frames are allocated from a free-frame list, and inserted into that process's page table.
 Processes are blocked from accessing anyone else's memory because all of their memory
requests are mapped through their page table. There is no way for them to generate an
address that maps into any other process's memory space.
 The operating system must keep track of each individual process's page table, updating it
whenever the process's pages get moved in and out of memory, and applying the correct
page table when processing system calls for a particular process. This all increases the
overhead involved when swapping processes in and out of the CPU. ( The currently
active page table must be updated to reflect the process that is currently running. )
Free frames (a) before allocation and (b) after allocation

Hardware Support

 Page lookups must be done for every memory reference, and whenever a process
gets swapped in or out of the CPU, its page table must be swapped in and out too,
along with the instruction registers, etc. It is therefore appropriate to provide
hardware support for this operation, in order to make it as fast as possible and to
make process switches as fast as possible also.
 One option is to use a set of registers for the page table. For example, the DEC
PDP-11 uses 16-bit addressing and 8 KB pages, resulting in only 8 pages per
process. ( It takes 13 bits to address 8 KB of offset, leaving only 3 bits to define a
page number. )
 An alternate option is to store the page table in main memory, and to use a single
register ( called the page-table base register, PTBR ) to record where in memory
the page table is located.
o Process switching is fast, because only the single register needs to be
changed.
o However memory access just got half as fast, because every memory
access now requires two memory accesses - One to fetch the frame
number from memory and then another one to access the desired memory
location.
o The solution to this problem is to use a very special high-speed memory
device called the translation look-aside buffer, TLB.
 The benefit of the TLB is that it can search an entire table for a key
value in parallel, and if it is found anywhere in the table, then the
corresponding lookup value is returned.

Paging hardware with TLB

 The TLB is very expensive, however, and therefore very small.


( Not large enough to hold the entire page table. ) It is therefore
used as a cache device.
 Addresses are first checked against the TLB, and if the info
is not there ( a TLB miss ), then the frame is looked up
from main memory and the TLB is updated.
 If the TLB is full, then replacement strategies range from
least-recently used, LRU to random.
 Some TLBs allow some entries to be wired down, which
means that they cannot be removed from the TLB.
Typically these would be kernel frames.
 Some TLBs store address-space identifiers, ASIDs, to
keep track of which process "owns" a particular entry in the
TLB. This allows entries from multiple processes to be
stored simultaneously in the TLB without granting one
process access to some other process's memory location.
Without this feature the TLB has to be flushed clean with
every process switch.
 The percentage of time that the desired information is found in the
TLB is termed the hit ratio.
 ( Eighth Edition Version: ) For example, suppose that it takes 100
nanoseconds to access main memory, and only 20 nanoseconds to
search the TLB. So a TLB hit takes 120 nanoseconds total ( 20 to
find the frame number and then another 100 to go get the data ),
and a TLB miss takes 220 ( 20 to search the TLB, 100 to go get the
frame number, and then another 100 to go get the data. ) So with
an 80% TLB hit ratio, the average memory access time would be:

0.80 * 120 + 0.20 * 220 = 140 nanoseconds

for a 40% slowdown to get the frame number. A 98% hit rate
would yield 122 nanoseconds average access time ( you should
verify this ), for a 22% slowdown.

 ( Ninth Edition Version: ) The ninth edition ignores the 20


nanoseconds required to search the TLB, yielding

0.80 * 100 + 0.20 * 200 = 120 nanoseconds

for a 20% slowdown to get the frame number. A 99% hit rate
would yield 101 nanoseconds average access time ( you should
verify this ), for a 1% slowdown.

Protection

 The page table can also help to protect processes from accessing memory that
they shouldn't, or their own memory in ways that they shouldn't.
 A bit or bits can be added to the page table to classify a page as read-write, read-
only, read-write-execute, or some combination of these sorts of things. Then each
memory reference can be checked to ensure it is accessing the memory in the
appropriate mode.
 Valid / invalid bits can be added to "mask off" entries in the page table that are
not in use by the current process, as shown by example in Figure 8.12 below.
 Note that the valid / invalid bits described above cannot block all illegal memory
accesses, due to the internal fragmentation. ( Areas of memory in the last page
that are not entirely filled by the process, and may contain data left over by
whoever used that frame last. )
 Many processes do not use all of the page table available to them, particularly in
modern systems with very large potential page tables. Rather than waste memory
by creating a full-size page table for every process, some systems use a page-table
length register, PTLR, to specify the length of the page table.

Valid (v) or invalid (i) bit in page table

Shared Pages

 Paging systems can make it very easy to share blocks of memory, by simply
duplicating page numbers in multiple page frames. This may be done with either
code or data.
 If code is reentrant, that means that it does not write to or change the code in any
way ( it is non self-modifying ), and it is therefore safe to re-enter it. More
importantly, it means the code can be shared by multiple processes, so long as
each has their own copy of the data and registers, including the instruction
register.
 In the example given below, three different users are running the editor
simultaneously, but the code is only loaded into memory ( in the page frames )
one time.
 Some systems also implement shared memory in this fashion.

Sharing of code in a paging environment


Structure of the Page Table

Hierarchical Paging

 Most modern computer systems support logical address spaces of 2^32 to 2^64.
 With a 2^32 address space and 4K ( 2^12 ) page sizes, this leave 2^20 entries in
the page table. At 4 bytes per entry, this amounts to a 4 MB page table, which is
too large to reasonably keep in contiguous memory. ( And to swap in and out of
memory with each process switch. ) Note that with 4K pages, this would take
1024 pages just to hold the page table!
 One option is to use a two-tier paging system, i.e. to page the page table.
 For example, the 20 bits described above could be broken down into two 10-bit
page numbers. The first identifies an entry in the outer page table, which
identifies where in memory to find one page of an inner page table. The second 10
bits finds a specific entry in that inner page table, which in turn identifies a
particular frame in physical memory. ( The remaining 12 bits of the 32 bit logical
address are the offset within the 4K frame. )
A two-level page-table scheme
Address translation for a two-level 32-bit paging architecture

 VAX Architecture divides 32-bit addresses into 4 equal sized sections, and each
page is 512 bytes, yielding an address form of:

 With a 64-bit logical address space and 4K pages, there are 52 bits worth of page
numbers, which is still too many even for two-level paging. One could increase
the paging level, but with 10-bit page tables it would take 7 levels of indirection,
which would be prohibitively slow memory access. So some other approach must
be used.

64-bits Two-tiered leaves 42 bits in outer table


Going to a fourth level still leaves 32 bits in the outer table.

Hashed Page Tables

 One common data structure for accessing data that is sparsely distributed over a
broad range of possible values is with hash tables. Figure 8.16 below illustrates a
hashed page table using chain-and-bucket hashing:

Hashed page table

Inverted Page Tables

 Another approach is to use an inverted page table. Instead of a table listing all of
the pages for a particular process, an inverted page table lists all of the pages
currently loaded in memory, for all processes. ( I.e. there is one entry per frame
instead of one entry per page. )
 Access to an inverted page table can be slow, as it may be necessary to search the
entire table in order to find the desired page ( or to discover that it is not there. )
Hashing the table can help speedup the search process.
 Inverted page tables prohibit the normal method of implementing shared memory,
which is to map multiple logical pages to a common physical frame. ( Because
each frame is now mapped to one and only one process. )

Inverted page table

IA-32 Segmentation
 The Pentium CPU provides both pure segmentation and segmentation with
paging. In the latter case, the CPU generates a logical address ( segment-offset
pair ), which the segmentation unit converts into a logical linear address, which in
turn is mapped to a physical frame by the paging unit, as shown in Figure 8.21:
Figure 8.21 - Logical to physical address translation in IA-32

IA-32 Segmentation

 The Pentium architecture allows segments to be as large as 4 GB, ( 24 bits


of offset ).
 Processes can have as many as 16K segments, divided into two 8K
groups:
o 8K private to that particular process, stored in the Local Descriptor
Table, LDT.
o 8K shared among all processes, stored in the Global Descriptor
Table, GDT.
 Logical addresses are ( selector, offset ) pairs, where the selector is made
up of 16 bits:
o A 13 bit segment number ( up to 8K )
o A 1 bit flag for LDT vs. GDT.
o 2 bits for protection codes.

o The descriptor tables contain 8-byte descriptions of each segment,


including base and limit registers.
o Logical linear addresses are generated by looking the selector up in the
descriptor table and adding the appropriate base address to the offset, as
shown in Figure 8.22:
IA-32 Paging
 Pentium paging normally uses a two-tier paging scheme, with the first 10
bits being a page number for an outer page table ( a.k.a. page directory ),
and the next 10 bits being a page number within one of the 1024 inner
page tables, leaving the remaining 12 bits as an offset into a 4K page.

 A special bit in the page directory can indicate that this page is a 4MB
page, in which case the remaining 22 bits are all used as offset and the
inner tier of page tables is not used.
 The CR3 register points to the page directory for the current process, as
shown in Figure 8.23 below.
 If the inner page table is currently swapped out to disk, then the page
directory will have an "invalid bit" set, and the remaining 31 bits provide
information on where to find the swapped out page table on the disk.
Figure 8.23 - Paging in the IA-32 architecture.
Figure 8.24 - Page address extensions.

Virtual Memory
Background

 Preceding sections talked about how to avoid memory fragmentation by breaking process
memory requirements down into smaller bites ( pages ), and storing the pages non-
contiguously in memory. However the entire process still had to be stored in memory
somewhere.
 In practice, most real processes do not need all their pages, or at least not all at once, for
several reasons:
1. Error handling code is not needed unless that specific error occurs, some of which
are quite rare.
2. Arrays are often over-sized for worst-case scenarios, and only a small fraction of
the arrays are actually used in practice.
3. Certain features of certain programs are rarely used, such as the routine to balance
the federal budget. :-)
 The ability to load only the portions of processes that were actually needed ( and only
when they were needed ) has several benefits:
o Programs could be written for a much larger address space ( virtual memory space
) than physically exists on the computer.
o Because each process is only using a fraction of their total address space, there is
more memory left for other programs, improving CPU utilization and system
throughput.
o Less I/O is needed for swapping processes in and out of RAM, speeding things
up.
Figure shows the general layout of virtual memory, which can be much larger than
physical memory:

Diagram showing virtual memory that is larger than physical memory

 Figure shows virtual address space, which is the programmers logical view of process
memory storage. The actual physical layout is controlled by the process's page table.
 Note that the address space shown in Figure 9.2 is sparse - A great hole in the middle of
the address space is never used, unless the stack and/or the heap grow to fill the hole.
Virtual address space

 Virtual memory also allows the sharing of files and memory by multiple processes, with
several benefits:
o System libraries can be shared by mapping them into the virtual address space of
more than one process.
o Processes can also share virtual memory by mapping the same block of memory
to more than one process.
o Process pages can be shared during a fork( ) system call, eliminating the need to
copy all of the pages of the original ( parent ) process.
Shared library using virtual memory

Demand Paging

 The basic idea behind demand paging is that when a process is swapped in, its pages are
not swapped in all at once. Rather they are swapped in only when the process needs them.
( on demand. ) This is termed a lazy swapper, although a pager is a more accurate term.
Transfer of a paged memory to contiguous disk space

Basic Concepts

 The basic idea behind paging is that when a process is swapped in, the pager only
loads into memory those pages that it expects the process to need ( right away. )
 Pages that are not loaded into memory are marked as invalid in the page table,
using the invalid bit. ( The rest of the page table entry may either be blank or
contain information about where to find the swapped-out page on the hard drive. )
 If the process only ever accesses pages that are loaded in memory ( memory
resident pages ), then the process runs exactly as if all the pages were loaded in to
memory.
Page table when some pages are not in main memory.

 On the other hand, if a page is needed that was not originally loaded up, then a
page fault trap is generated, which must be handled in a series of steps:
1. The memory address requested is first checked, to make sure it was a valid
memory request.
2. If the reference was invalid, the process is terminated. Otherwise, the page
must be paged in.
3. A free frame is located, possibly from a free-frame list.
4. A disk operation is scheduled to bring in the necessary page from disk. (
This will usually block the process on an I/O wait, allowing some other
process to use the CPU in the meantime. )
5. When the I/O operation is complete, the process's page table is updated
with the new frame number, and the invalid bit is changed to indicate that
this is now a valid page reference.
6. The instruction that caused the page fault must now be restarted from the
beginning, ( as soon as this process gets another turn on the CPU. )

Steps in handling a page fault

 In an extreme case, NO pages are swapped in for a process until they are
requested by page faults. This is known as pure demand paging.
 In theory each instruction could generate multiple page faults. In practice this is
very rare, due to locality of reference, covered in section 9.6.1.
 The hardware necessary to support virtual memory is the same as for paging and
swapping: A page table and secondary memory. ( Swap space, whose allocation is
discussed in chapter 12. )
 A crucial part of the process is that the instruction must be restarted from scratch
once the desired page has been made available in memory. For most simple
instructions this is not a major difficulty. However there are some architectures
that allow a single instruction to modify a fairly large block of data, ( which may
span a page boundary ), and if some of the data gets modified before the page
fault occurs, this could cause problems. One solution is to access both ends of the
block before executing the instruction, guaranteeing that the necessary pages get
paged in before the instruction begins.

Performance of Demand Paging

 Obviously there is some slowdown and performance hit whenever a page fault
occurs and the system has to go get it from memory, but just how big a hit is it
exactly?
 There are many steps that occur when servicing a page fault ( see book for full
details ), and some of the steps are optional or variable. But just for the sake of
discussion, suppose that a normal memory access requires 200 nanoseconds, and
that servicing a page fault takes 8 milliseconds. ( 8,000,000 nanoseconds, or
40,000 times a normal memory access. ) With a page fault rate of p, ( on a scale
from 0 to 1 ), the effective access time is now:

( 1 - p ) * ( 200 ) + p * 8000000

= 200 + 7,999,800 * p

which clearly depends heavily on p! Even if only one access in 1000 causes a page fault, the
effective access time drops from 200 nanoseconds to 8.2 microseconds, a slowdown of a factor
of 40 times. In order to keep the slowdown less than 10%, the page fault rate must be less than
0.0000025, or one in 399,990 accesses.

 A subtlety is that swap space is faster to access than the regular file system,
because it does not have to go through the whole directory structure. For this
reason some systems will transfer an entire process from the file system to swap
space before starting up the process, so that future paging all occurs from the
( relatively ) faster swap space.
 Some systems use demand paging directly from the file system for binary code
( which never changes and hence does not have to be stored on a page operation ),
and to reserve the swap space for data segments that must be stored. This
approach is used by both Solaris and BSD Unix.

Copy-on-Write
 The idea behind a copy-on-write fork is that the pages for a parent process do not have to
be actually copied for the child until one or the other of the processes changes the page.
They can be simply shared between the two processes in the meantime, with a bit set that
the page needs to be copied if it ever gets written to. This is a reasonable approach, since
the child process usually issues an exec( ) system call immediately after the fork.

Before process 1 modifies page C.

After process 1 modifies page C.


 Obviously only pages that can be modified even need to be labeled as copy-on-write.
Code segments can simply be shared.
 Pages used to satisfy copy-on-write duplications are typically allocated using zero-fill-
on-demand, meaning that their previous contents are zeroed out before the copy
proceeds.
 Some systems provide an alternative to the fork( ) system call called a virtual memory
fork, vfork( ). In this case the parent is suspended, and the child uses the parent's memory
pages. This is very fast for process creation, but requires that the child not modify any of
the shared memory pages before performing the exec( ) system call. ( In essence this
addresses the question of which process executes first after a call to fork, the parent or the
child. With vfork, the parent is suspended, allowing the child to execute first until it calls
exec( ), sharing pages with the parent in the meantime.

Page Replacement

 In order to make the most use of virtual memory, we load several processes into memory
at the same time. Since we only load the pages that are actually needed by each process at
any given time, there is room to load many more processes than if we had to load in the
entire process.
 However memory is also needed for other purposes ( such as I/O buffering ), and what
happens if some process suddenly decides it needs more pages and there aren't any free
frames available? There are several possible solutions to consider:
1. Adjust the memory used by I/O buffering, etc., to free up some frames for user
processes. The decision of how to allocate memory for I/O versus user processes
is a complex one, yielding different policies on different systems. ( Some allocate
a fixed amount for I/O, and others let the I/O system contend for memory along
with everything else. )
2. Put the process requesting more pages into a wait queue until some free frames
become available.
3. Swap some process out of memory completely, freeing up its page frames.
4. Find some page in memory that isn't being used right now, and swap that page
only out to disk, freeing up a frame that can be allocated to the process requesting
it. This is known as page replacement, and is the most common solution. There
are many different algorithms for page replacement, which is the subject of the
remainder of this section.
Need for page replacement.

Basic Page Replacement

 The previously discussed page-fault processing assumed that there would be free
frames available on the free-frame list. Now the page-fault handling must be
modified to free up a frame if necessary, as follows:
1. Find the location of the desired page on the disk, either in swap space or in
the file system.
2. Find a free frame:
a. If there is a free frame, use it.
b. If there is no free frame, use a page-replacement algorithm to
select an existing frame to be replaced, known as the victim frame.
c. Write the victim frame to disk. Change all related page tables to
indicate that this page is no longer in memory.
3. Read in the desired page and store it in the frame. Adjust all related page
and frame tables to indicate the change.
4. Restart the process that was waiting for this page.

Page replacement.

 Note that step 3c adds an extra disk write to the page-fault handling, effectively
doubling the time required to process a page fault. This can be alleviated
somewhat by assigning a modify bit, or dirty bit to each page, indicating whether
or not it has been changed since it was last loaded in from disk. If the dirty bit has
not been set, then the page is unchanged, and does not need to be written out to
disk. Otherwise the page write is required. It should come as no surprise that
many page replacement strategies specifically look for pages that do not have
their dirty bit set, and preferentially select clean pages as victim pages. It should
also be obvious that unmodifiable code pages never get their dirty bits set.
 There are two major requirements to implement a successful demand paging
system. We must develop a frame-allocation algorithm and a page-replacement
algorithm. The former centers around how many frames are allocated to each
process ( and to other needs ), and the latter deals with how to select a page for
replacement when there are no free frames available.
 The overall goal in selecting and tuning these algorithms is to generate the fewest
number of overall page faults. Because disk access is so slow relative to memory
access, even slight improvements to these algorithms can yield large
improvements in overall system performance.
 Algorithms are evaluated using a given string of memory accesses known as a
reference string, which can be generated in one of ( at least ) three common
ways:
1. Randomly generated, either evenly distributed or with some distribution
curve based on observed system behavior. This is the fastest and easiest
approach, but may not reflect real performance well, as it ignores locality
of reference.
2. Specifically designed sequences. These are useful for illustrating the
properties of comparative algorithms in published papers and textbooks,
( and also for homework and exam problems. :-) )
3. Recorded memory references from a live system. This may be the best
approach, but the amount of data collected can be enormous, on the order
of a million addresses per second. The volume of collected data can be
reduced by making two important observations:
1. Only the page number that was accessed is relevant. The offset
within that page does not affect paging operations.
2. Successive accesses within the same page can be treated as a single
page request, because all requests after the first are guaranteed to
be page hits. ( Since there are no intervening requests for other
pages that could remove this page from the page table. )
 So for example, if pages were of size 100 bytes, then the sequence
of address requests ( 0100, 0432, 0101, 0612, 0634, 0688, 0132,
0038, 0420 ) would reduce to page requests ( 1, 4, 1, 6, 1, 0, 4 )
As the number of available frames increases, the number of page faults
should decrease, as shown in Figure:
Graph of page faults versus number of frames.

FIFO Page Replacement

 A simple and obvious page replacement strategy is FIFO, i.e. first-in- first-out.
 As new pages are brought in, they are added to the tail of a queue, and the page at
the head of the queue is the next victim. In the following example, 20 page
requests result in 15 page faults:

FIFO page-replacement algorithm.

 Although FIFO is simple and easy, it is not always optimal, or even efficient.
 An interesting effect that can occur with FIFO is Belady's anomaly, in which
increasing the number of frames available can actually increase the number of
page faults that occur! Consider, for example, the following chart based on the
page sequence ( 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 ) and a varying number of available
frames. Obviously the maximum number of faults is 12 ( every request generates
a fault ), and the minimum number is 5 ( each page loaded only once ), but in
between there are some interesting results:

Page-fault curve for FIFO replacement on a reference string.

Optimal Page Replacement

 The discovery of Belady's anomaly lead to the search for an optimal page-
replacement algorithm, which is simply that which yields the lowest of all
possible page-faults, and which does not suffer from Belady's anomaly.
 Such an algorithm does exist, and is called OPT or MIN. This algorithm is simply
"Replace the page that will not be used for the longest time in the future."
 For example, Figure 9.14 shows that by applying OPT to the same reference
string used for the FIFO example, the minimum number of possible page faults is
9. Since 6 of the page-faults are unavoidable ( the first reference to each new
page ), FIFO can be shown to require 3 times as many ( extra ) page faults as the
optimal algorithm. ( Note: The book claims that only the first three page faults are
required by all algorithms, indicating that FIFO is only twice as bad as OPT. )
 Unfortunately OPT cannot be implemented in practice, because it requires
foretelling the future, but it makes a nice benchmark for the comparison and
evaluation of real proposed new algorithms.
 In practice most page-replacement algorithms try to approximate OPT by
predicting ( estimating ) in one fashion or another what page will not be used for
the longest period of time. The basis of FIFO is the prediction that the page that
was brought in the longest time ago is the one that will not be needed again for
the longest future time, but as we shall see, there are many other prediction
methods, all striving to match the performance of OPT.

Optimal page-replacement algorithm

LRU Page Replacement

 The prediction behind LRU, the Least Recently Used, algorithm is that the page
that has not been used in the longest time is the one that will not be used again in
the near future. ( Note the distinction between FIFO and LRU: The former looks
at the oldest load time, and the latter looks at the oldest use time. )
 Some view LRU as analogous to OPT, except looking backwards in time instead
of forwards. ( OPT has the interesting property that for any reference string S and
its reverse R, OPT will generate the same number of page faults for S and for R. It
turns out that LRU has this same property. )
 Figure 9.15 illustrates LRU for our sample string, yielding 12 page faults, ( as
compared to 15 for FIFO and 9 for OPT. )
LRU page-replacement algorithm.

 LRU is considered a good replacement policy, and is often used. The problem is
how exactly to implement it. There are two simple approaches commonly used:
1. Counters. Every memory access increments a counter, and the current
value of this counter is stored in the page table entry for that page. Then
finding the LRU page involves simple searching the table for the page
with the smallest counter value. Note that overflowing of the counter must
be considered.
2. Stack. Another approach is to use a stack, and whenever a page is
accessed, pull that page from the middle of the stack and place it on the
top. The LRU page will always be at the bottom of the stack. Because this
requires removing objects from the middle of the stack, a doubly linked
list is the recommended data structure.
 Note that both implementations of LRU require hardware support, either for
incrementing the counter or for managing the stack, as these operations must be
performed for every memory access.
 Neither LRU or OPT exhibit Belady's anomaly. Both belong to a class of page-
replacement algorithms called stack algorithms, which can never exhibit Belady's
anomaly. A stack algorithm is one in which the pages kept in memory for a frame
set of size N will always be a subset of the pages kept for a frame size of N + 1. In
the case of LRU, ( and particularly the stack implementation thereof ), the top N
pages of the stack will be the same for all frame set sizes of N or anything larger.
Use of a stack to record the most recent page references.

LRU-Approximation Page Replacement

 Unfortunately full implementation of LRU requires hardware support, and


few systems provide the full hardware support necessary.
 However many systems offer some degree of HW support, enough to
approximate LRU fairly well. ( In the absence of ANY hardware support,
FIFO might be the best available choice. )
 In particular, many systems provide a reference bit for every entry in a
page table, which is set anytime that page is accessed. Initially all bits are
set to zero, and they can also all be cleared at any time. One bit of
precision is enough to distinguish pages that have been accessed since the
last clear from those that have not, but does not provide any finer grain of
detail.

Additional-Reference-Bits Algorithm

 Finer grain is possible by storing the most recent 8 reference bits for each
page in an 8-bit byte in the page table entry, which is interpreted as an
unsigned int.
o At periodic intervals ( clock interrupts ), the OS takes over, and
right-shifts each of the reference bytes by one bit.
o The high-order ( leftmost ) bit is then filled in with the current
value of the reference bit, and the reference bits are cleared.
o At any given time, the page with the smallest value for the
reference byte is the LRU page.
 Obviously the specific number of bits used and the frequency with which
the reference byte is updated are adjustable, and are tuned to give the
fastest performance on a given hardware platform.

Second-Chance Algorithm

 The second chance algorithm is essentially a FIFO, except the reference


bit is used to give pages a second chance at staying in the page table.
o When a page must be replaced, the page table is scanned in a FIFO
( circular queue ) manner.
o If a page is found with its reference bit not set, then that page is
selected as the next victim.
o If, however, the next page in the FIFO does have its reference bit
set, then it is given a second chance:
 The reference bit is cleared, and the FIFO search continues.
 If some other page is found that did not have its reference
bit set, then that page will be selected as the victim, and this
page ( the one being given the second chance ) will be
allowed to stay in the page table.
 If , however, there are no other pages that do not have their
reference bit set, then this page will be selected as the
victim when the FIFO search circles back around to this
page on the second pass.
 If all reference bits in the table are set, then second chance degrades to
FIFO, but also requires a complete search of the table for every page-
replacement.
 As long as there are some pages whose reference bits are not set, then any
page referenced frequently enough gets to stay in the page table
indefinitely.
 This algorithm is also known as the clock algorithm, from the hands of the
clock moving around the circular queue.
Second-chance ( clock ) page-replacement algorithm.

Enhanced Second-Chance Algorithm

 The enhanced second chance algorithm looks at the reference bit and the
modify bit ( dirty bit ) as an ordered page, and classifies pages into one of
four classes:
1. ( 0, 0 ) - Neither recently used nor modified.
2. ( 0, 1 ) - Not recently used, but modified.
3. ( 1, 0 ) - Recently used, but clean.
4. ( 1, 1 ) - Recently used and modified.
 This algorithm searches the page table in a circular fashion ( in as many as
four passes ), looking for the first page it can find in the lowest numbered
category. I.e. it first makes a pass looking for a ( 0, 0 ), and then if it can't
find one, it makes another pass looking for a ( 0, 1 ), etc.
 The main difference between this algorithm and the previous one is the
preference for replacing clean pages if possible.

Counting-Based Page Replacement

 There are several algorithms based on counting the number of references that
have been made to a given page, such as:
o Least Frequently Used, LFU: Replace the page with the lowest reference
count. A problem can occur if a page is used frequently initially and then
not used any more, as the reference count remains high. A solution to this
problem is to right-shift the counters periodically, yielding a time-
decaying average reference count.
o Most Frequently Used, MFU: Replace the page with the highest reference
count. The logic behind this idea is that pages that have already been
referenced a lot have been in the system a long time, and we are probably
done with them, whereas pages referenced only a few times have only
recently been loaded, and we still need them.
 In general counting-based algorithms are not commonly used, as their
implementation is expensive and they do not approximate OPT well.

Page-Buffering Algorithms

There are a number of page-buffering algorithms that can be used in conjunction with the afore-
mentioned algorithms, to improve overall performance and sometimes make up for inherent
weaknesses in the hardware and/or the underlying page-replacement algorithms:

 Maintain a certain minimum number of free frames at all times. When a page-
fault occurs, go ahead and allocate one of the free frames from the free list first, to
get the requesting process up and running again as quickly as possible, and then
select a victim page to write to disk and free up a frame as a second step.
 Keep a list of modified pages, and when the I/O system is otherwise idle, have it
write these pages out to disk, and then clear the modify bits, thereby increasing
the chance of finding a "clean" page for the next potential victim.
 Keep a pool of free frames, but remember what page was in it before it was made
free. Since the data in the page is not actually cleared out when the page is freed,
it can be made an active page again without having to load in any new data from
disk. This is useful when an algorithm mistakenly replaces a page that in fact is
needed again soon.

Applications and Page Replacement

 Some applications ( most notably database programs ) understand their data


accessing and caching needs better than the general-purpose OS, and should
therefore be given reign to do their own memory management.
 Sometimes such programs are given a raw disk partition to work with, containing
raw data blocks and no file system structure. It is then up to the application to use
this disk partition as extended memory or for whatever other reasons it sees fit.

Allocation of Frames

We said earlier that there were two important tasks in virtual memory management: a page-
replacement strategy and a frame-allocation strategy. This section covers the second part of that
pair.

Minimum Number of Frames

 The absolute minimum number of frames that a process must be allocated is


dependent on system architecture, and corresponds to the worst-case scenario of
the number of pages that could be touched by a single ( machine ) instruction.
 If an instruction ( and its operands ) spans a page boundary, then multiple pages
could be needed just for the instruction fetch.
 Memory references in an instruction touch more pages, and if those memory
locations can span page boundaries, then multiple pages could be needed for
operand access also.
 The worst case involves indirect addressing, particularly where multiple levels of
indirect addressing are allowed. Left unchecked, a pointer to a pointer to a pointer
to a pointer to a . . . could theoretically touch every page in the virtual address
space in a single machine instruction, requiring every virtual page be loaded into
physical memory simultaneously. For this reason architectures place a limit ( say
16 ) on the number of levels of indirection allowed in an instruction, which is
enforced with a counter initialized to the limit and decremented with every level
of indirection in an instruction - If the counter reaches zero, then an "excessive
indirection" trap occurs. This example would still require a minimum frame
allocation of 17 per process.

Allocation Algorithms

 Equal Allocation - If there are m frames available and n processes to share them,
each process gets m / n frames, and the leftovers are kept in a free-frame buffer
pool.
 Proportional Allocation - Allocate the frames proportionally to the size of the
process, relative to the total size of all processes. So if the size of process i is S_i,
and S is the sum of all S_i, then the allocation for process P_i is a_i = m * S_i / S.
 Variations on proportional allocation could consider priority of process rather
than just their size.
 Obviously all allocations fluctuate over time as the number of available free
frames, m, fluctuates, and all are also subject to the constraints of minimum
allocation. ( If the minimum allocations cannot be met, then processes must either
be swapped out or not allowed to start until more free frames become available. )
Global versus Local Allocation

 One big question is whether frame allocation ( page replacement ) occurs on a


local or global level.
 With local replacement, the number of pages allocated to a process is fixed, and
page replacement occurs only amongst the pages allocated to this process.
 With global replacement, any page may be a potential victim, whether it currently
belongs to the process seeking a free frame or not.
 Local page replacement allows processes to better control their own page fault
rates, and leads to more consistent performance of a given process over different
system load levels.
 Global page replacement is overall more efficient, and is the more commonly
used approach.

Non-Uniform Memory Access

 The above arguments all assume that all memory is equivalent, or at least has
equivalent access times.
 This may not be the case in multiple-processor systems, especially where each
CPU is physically located on a separate circuit board which also holds some
portion of the overall system memory.
 In these latter systems, CPUs can access memory that is physically located on the
same board much faster than the memory on the other boards.
 The basic solution is akin to processor affinity - At the same time that we try to
schedule processes on the same CPU to minimize cache misses, we also try to
allocate memory for those processes on the same boards, to minimize access
times.
 The presence of threads complicates the picture, especially when the threads get
loaded onto different processors.
 Solaris uses an lgroup as a solution, in a hierarchical fashion based on relative
latency. For example, all processors and RAM on a single board would probably
be in the same lgroup. Memory assignments are made within the same lgroup if
possible, or to the next nearest lgroup otherwise. ( Where "nearest" is defined as
having the lowest access time. )

Thrashing

 If a process cannot maintain its minimum required number of frames, then it must be
swapped out, freeing up frames for other processes. This is an intermediate level of CPU
scheduling.
 But what about a process that can keep its minimum, but cannot keep all of the frames
that it is currently using on a regular basis? In this case it is forced to page out pages that
it will need again in the very near future, leading to large numbers of page faults.
 A process that is spending more time paging than executing is said to be thrashing.
Cause of Thrashing

 Early process scheduling schemes would control the level of multiprogramming


allowed based on CPU utilization, adding in more processes when CPU utilization
was low.
 The problem is that when memory filled up and processes started spending lots of
time waiting for their pages to page in, then CPU utilization would lower, causing
the schedule to add in even more processes and exacerbating the problem!
Eventually the system would essentially grind to a halt.
 Local page replacement policies can prevent one thrashing process from taking
pages away from other processes, but it still tends to clog up the I/O queue,
thereby slowing down any other process that needs to do even a little bit of paging
( or any other I/O for that matter. )

Thrashing

 To prevent thrashing we must provide processes with as many frames as they


really need "right now", but how do we know what that is?
 The locality model notes that processes typically access memory references in a
given locality, making lots of references to the same general area of memory
before moving periodically to a new locality, as shown in Figure 9.19 below. If
we could just keep as many frames as are involved in the current locality, then
page faulting would occur primarily on switches from one locality to another. (
E.g. when one function exits and another is called. )
Locality in a memory-reference pattern.
Working-Set Model

 The working set model is based on the concept of locality, and defines a working
set window, of length delta. Whatever pages are included in the most recent delta
page references are said to be in the processes working set window, and comprise
its current working set, as illustrated in Figure 9.20:

Figure 9.20 - Working-set model.

 The selection of delta is critical to the success of the working set model - If it is
too small then it does not encompass all of the pages of the current locality, and if
it is too large, then it encompasses pages that are no longer being frequently
accessed.
 The total demand, D, is the sum of the sizes of the working sets for all processes.
If D exceeds the total number of available frames, then at least one process is
thrashing, because there are not enough frames available to satisfy its minimum
working set. If D is significantly less than the currently available frames, then
additional processes can be launched.
 The hard part of the working-set model is keeping track of what pages are in the
current working set, since every reference adds one to the set and removes one
older page. An approximation can be made using reference bits and a timer that
goes off after a set interval of memory references:
o For example, suppose that we set the timer to go off after every 5000
references ( by any process ), and we can store two additional historical
reference bits in addition to the current reference bit.
o Every time the timer goes off, the current reference bit is copied to one of
the two historical bits, and then cleared.
o If any of the three bits is set, then that page was referenced within the last
15,000 references, and is considered to be in that processes reference set.
o Finer resolution can be achieved with more historical bits and a more
frequent timer, at the expense of greater overhead.

Page-Fault Frequency

 A more direct approach is to recognize that what we really want to control is the
page-fault rate, and to allocate frames based on this directly measurable value. If
the page-fault rate exceeds a certain upper bound then that process needs more
frames, and if it is below a given lower bound, then it can afford to give up some
of its frames to other processes.
 ( I suppose a page-replacement strategy could be devised that would select victim
frames based on the process with the lowest current page-fault frequency. )

Page-fault frequency.

 Note that there is a direct relationship between the page-fault rate and the
working-set, as a process moves from one locality to another:
Unnumbered side bar in Ninth Edition

Virtual Memory in Windows


Windows uses demand paging with clustering, meaning they page in multiple pages whenever a
page fault occurs.

The working set minimum and maximum are normally set at 50 and 345 pages respectively. (
Maximums can be exceeded in rare circumstances. )

Free pages are maintained on a free list, with a minimum threshold indicating when there are
enough free frames available.

If a page fault occurs and the process is below their maximum, then additional pages are
allocated. Otherwise some pages from this process must be replaced, using a local page
replacement algorithm.

If the amount of free frames falls below the allowable threshold, then working set trimming
occurs, taking frames away from any processes which are above their minimum, until all are at
their minimums. Then additional frames can be allocated to processes that need them.

The algorithm for selecting victim frames depends on the type of processor:

On single processor 80x86 systems, a variation of the clock ( second chance ) algorithm is used.
On Alpha and multiprocessor systems, clearing the reference bits may require invalidating
entries in the TLB on other processors, which is an expensive operation. In this case Windows
uses a variation of FIFO.
Unit-4
File-System Interface
File Concept

File Attributes

 Different OSes keep track of different file attributes, including:


o Name - Some systems give special significance to names, and particularly
extensions ( .exe, .txt, etc. ), and some do not. Some extensions may be of
significance to the OS ( .exe ), and others only to certain applications (
.jpg )
o Identifier ( e.g. inode number )
o Type - Text, executable, other binary, etc.
o Location - on the hard drive.
o Size
o Protection
o Time & Date
o User ID

File Operations

 The file ADT supports many common operations:


o Creating a file
o Writing a file
o Reading a file
o Repositioning within a file
o Deleting a file
o Truncating a file.
 Most OSes require that files be opened before access and closed after all access is
complete. Normally the programmer must open and close files explicitly, but
some rare systems open the file automatically at first access. Information about
currently open files is stored in an open file table, containing for example:
o File pointer - records the current position in the file, for the next read or
write access.
o File-open count - How many times has the current file been opened
( simultaneously by different processes ) and not yet closed? When this
counter reaches zero the file can be removed from the table.
o Disk location of the file.
o Access rights
 Some systems provide support for file locking.

1
o A shared lock is for reading only.
o A exclusive lock is for writing as well as reading.
o An advisory lock is informational only, and not enforced. ( A "Keep Out"
sign, which may be ignored. )
o A mandatory lock is enforced. ( A truly locked door. )
o UNIX used advisory locks, and Windows uses mandatory locks.

11.1.3 File Types

 Windows ( and some other systems ) use special file extensions to indicate
the type of each file:

Figure 11.3 - Common file types.

 Macintosh stores a creator attribute for each file, according to the program that
first created it with the create( ) system call.
 UNIX stores magic numbers at the beginning of certain files. ( Experiment with
the "file" command, especially in directories such as /bin and /dev )

2
File Structure

 Some files contain an internal structure, which may or may not be known to the
OS.
 For the OS to support particular file formats increases the size and complexity of
the OS.
 UNIX treats all files as sequences of bytes, with no further consideration of the
internal structure. ( With the exception of executable binary programs, which it
must know how to load and find the first executable statement, etc. )
 Macintosh files have two forks - a resource fork, and a data fork. The resource
fork contains information relating to the UI, such as icons and button images, and
can be modified independently of the data fork, which contains the code or data as
appropriate.

Internal File Structure

 Disk files are accessed in units of physical blocks, typically 512 bytes or some
power-of-two multiple thereof. ( Larger physical disks use larger block sizes, to
keep the range of block numbers within the range of a 32-bit integer. )
 Internally files are organized in units of logical units, which may be as small as a
single byte, or may be a larger size corresponding to some data record or structure
size.
 The number of logical units which fit into one physical block determines its
packing, and has an impact on the amount of internal fragmentation ( wasted
space ) that occurs.
 As a general rule, half a physical block is wasted for each file, and the larger the
block sizes the more space is lost to internal fragmentation.

Access Methods

Sequential Access

 A sequential access file emulates magnetic tape operation, and generally supports
a few operations:
o read next - read a record and advance the tape to the next position.
o write next - write a record and advance the tape to the next position.
o rewind
o skip n records - May or may not be supported. N may be limited to
positive numbers, or may be limited to +/- 1.

3
Sequential-access file.

Direct Access

 Jump to any record and read that record. Operations supported include:
o read n - read record number n. ( Note an argument is now required. )
o write n - write record number n. ( Note an argument is now required. )
o jump to record n - could be 0 or the end of file.
o Query current record - used to return back to this record later.
o Sequential access can be easily emulated using direct access. The inverse
is complicated and inefficient.

of sequential access on a direct-access file.

Other Access Methods

 An indexed access scheme can be easily built on top of a direct access system.
Very large files may require a multi-tiered indexing scheme, i.e. indexes of
indexes.

4
Figure 11.6 - Example of index and relative files.

Directory Structure

Storage Structure

 A disk can be used in its entirety for a file system.


 Alternatively a physical disk can be broken up into multiple partitions, slices, or
mini-disks, each of which becomes a virtual disk and can have its own filesystem.
( or be used for raw storage, swap space, etc. )
 Or, multiple physical disks can be combined into one volume, i.e. a larger virtual
disk, with its own filesystem spanning the physical disks.

5
A typical file-system organization.

11.3.2 Directory Overview

 Directory operations to be supported include:


o Search for a file
o Create a file - add to the directory
o Delete a file - erase from the directory
o List a directory - possibly ordered in different ways.
o Rename a file - may change sorting order
o Traverse the file system.

11.3.3. Single-Level Directory

 Simple to implement, but each file must have a unique name.

Single-level directory.

11.3.4 Two-Level Directory

 Each user gets their own directory space.


 File names only need to be unique within a given user's directory.

6
 A master file directory is used to keep track of each users directory, and must be
maintained when users are added to or removed from the system.
 A separate directory is generally needed for system ( executable ) files.
 Systems may or may not allow users to access other directories besides their own
o If access to other directories is allowed, then provision must be made to
specify the directory being accessed.
o If access is denied, then special consideration must be made for users to
run programs located in system directories. A search path is the list of
directories in which to search for executable programs, and can be set
uniquely for each user.

Two-level directory structure.

Tree-Structured Directories

 An obvious extension to the two-tiered directory structure, and the one with
which we are all most familiar.
 Each user / process has the concept of a current directory from which all
( relative ) searches take place.
 Files may be accessed using either absolute pathnames ( relative to the root of the
tree ) or relative pathnames ( relative to the current directory. )
 Directories are stored the same as any other file in the system, except there is a bit
that identifies them as directories, and they have some special structure that the
OS understands.
 One question for consideration is whether or not to allow the removal of
directories that are not empty - Windows requires that directories be emptied first,
and UNIX provides an option for deleting entire sub-trees.

7
Tree-structured directory structure.

Acyclic-Graph Directories

 When the same files need to be accessed in more than one place in the directory
structure ( e.g. because they are being shared by more than one user / process ), it
can be useful to provide an acyclic-graph structure. ( Note the directed arcs from
parent to child. )
 UNIX provides two types of links for implementing the acyclic-graph structure. (
See "man ln" for more details. )
o A hard link ( usually just called a link ) involves multiple directory entries
that both refer to the same file. Hard links are only valid for ordinary files
in the same filesystem.
o A symbolic link, that involves a special file, containing information about
where to find the linked file. Symbolic links may be used to link
directories and/or files in other filesystems, as well as ordinary files in the
current filesystem.
 Windows only supports symbolic links, termed shortcuts.

8
 Hard links require a reference count, or link count for each file, keeping track of
how many directory entries are currently referring to this file. Whenever one of
the references is removed the link count is reduced, and when it reaches zero, the
disk space can be reclaimed.
 For symbolic links there is some question as to what to do with the symbolic links
when the original file is moved or deleted:
o One option is to find all the symbolic links and adjust them also.
o Another is to leave the symbolic links dangling, and discover that they are
no longer valid the next time they are used.
o What if the original file is removed, and replaced with another file having
the same name before the symbolic link is next used?

Acyclic-graph directory structure.

General Graph Directory

 If cycles are allowed in the graphs, then several problems can arise:
o Search algorithms can go into infinite loops. One solution is to not follow
links in search algorithms. ( Or not to follow symbolic links, and to only
allow symbolic links to refer to directories. )
o Sub-trees can become disconnected from the rest of the tree and still not
have their reference counts reduced to zero. Periodic garbage collection is
required to detect and resolve this problem. ( chkdsk in DOS and fsck in
UNIX search for these problems, among others, even though cycles are

9
not supposed to be allowed in either system. Disconnected disk blocks that
are not marked as free are added back to the file systems with made-up file
names, and can usually be safely deleted. )

General graph directory.

File-System Mounting

 The basic idea behind mounting file systems is to combine multiple file systems into one
large tree structure.
 The mount command is given a filesystem to mount and a mount point ( directory ) on
which to attach it.
 Once a file system is mounted onto a mount point, any further references to that directory
actually refer to the root of the mounted file system.
 Any files ( or sub-directories ) that had been stored in the mount point directory prior to
mounting the new filesystem are now hidden by the mounted filesystem, and are no
longer available. For this reason some systems only allow mounting onto empty
directories.
 Filesystems can only be mounted by root, unless root has previously configured certain
filesystems to be mountable onto certain pre-determined mount points. ( E.g. root may
allow users to mount floppy filesystems to /mnt or something like it. ) Anyone can run
the mount command to see what filesystems are currently mounted.
 Filesystems may be mounted read-only, or have other restrictions imposed.

10
File system. (a) Existing system. (b) Unmounted volume.

Mount point.

 The traditional Windows OS runs an extended two-tier directory structure, where the first
tier of the structure separates volumes by drive letters, and a tree structure is implemented
below that level.
 Macintosh runs a similar system, where each new volume that is found is automatically
mounted and added to the desktop when it is found.

11
 More recent Windows systems allow filesystems to be mounted to any directory in the
filesystem, much like UNIX.

File Sharing

Multiple Users

 On a multi- user system, more information needs to be stored for each file:
o The owner ( user ) who owns the file, and who can control its access.
o The group of other user IDs that may have some special access to the file.
o What access rights are afforded to the owner ( User ), the Group, and to
the rest of the world ( the universe, a.k.a. Others. )
o Some systems have more complicated access control, allowing or denying
specific accesses to specifically named users or groups.

File Systems

 The advent of the Internet introduces issues for accessing files stored on remote
computers
o The original method was ftp, allowing individual files to be transported
across systems as needed. Ftp can be either account and password
controlled, or anonymous, not requiring any user name or password.
o Various forms of distributed file systems allow remote file systems to be
mounted onto a local directory structure, and accessed using normal file
access commands. ( The actual files are still transported across the
network as needed, possibly using ftp as the underlying transport
mechanism. )
o The WWW has made it easy once again to access files on remote systems
without mounting their filesystems, generally using ( anonymous ) ftp as
the underlying file transport mechanism.

The Client-Server Model

 When one computer system remotely mounts a filesystem that is


physically located on another system, the system which physically owns
the files acts as a server, and the system which mounts them is the client.
 User IDs and group IDs must be consistent across both systems for the
system to work properly. ( I.e. this is most applicable across multiple
computers managed by the same organization, shared by a common group
of users. )
 The same computer can be both a client and a server. ( E.g. cross-linked
file systems. )
 There are a number of security concerns involved in this model:
o Servers commonly restrict mount permission to certain trusted
systems only. Spoofing ( a computer pretending to be a different
computer ) is a potential security risk.

12
o Servers may restrict remote access to read-only.
o Servers restrict which filesystems may be remotely mounted.
Generally the information within those subsystems is limited,
relatively public, and protected by frequent backups.
 The NFS ( Network File System ) is a classic example of such a system.

Distributed Information Systems

 The Domain Name System, DNS, provides for a unique naming system
across all of the Internet.
 Domain names are maintained by the Network Information System, NIS,
which unfortunately has several security issues. NIS+ is a more secure
version, but has not yet gained the same widespread acceptance as NIS.
 Microsoft's Common Internet File System, CIFS, establishes a network
login for each user on a networked system with shared file access. Older
Windows systems used domains, and newer systems ( XP, 2000 ), use
active directories. User names must match across the network for this
system to be valid.
 A newer approach is the Lightweight Directory-Access Protocol, LDAP,
which provides a secure single sign-on for all users to access all resources
on a network. This is a secure system which is gaining in popularity, and
which has the maintenance advantage of combining authorization
information in one central location.

Failure Modes

 When a local disk file is unavailable, the result is generally known


immediately, and is generally non-recoverable. The only reasonable
response is for the response to fail.
 However when a remote file is unavailable, there are many possible
reasons, and whether or not it is unrecoverable is not readily apparent.
Hence most remote access systems allow for blocking or delayed
response, in the hopes that the remote system ( or the network ) will come
back up eventually.

Consistency Semantics

 Consistency Semantics deals with the consistency between the views of shared
files on a networked system. When one user changes the file, when do other users
see the changes?
 At first glance this appears to have all of the synchronization issues discussed in
Chapter 6. Unfortunately the long delays involved in network operations prohibit
the use of atomic operations as discussed in that chapter.

13
UNIX Semantics

 The UNIX file system uses the following semantics:


o Writes to an open file are immediately visible to any other user
who has the file open.
o One implementation uses a shared location pointer, which is
adjusted for all sharing users.
 The file is associated with a single exclusive physical resource, which may
delay some accesses.

Session Semantics

 The Andrew File System, AFS uses the following semantics:


o Writes to an open file are not immediately visible to other users.
o When a file is closed, any changes made become available only to
users who open the file at a later time.
 According to these semantics, a file can be associated with multiple
( possibly different ) views. Almost no constraints are imposed on
scheduling accesses. No user is delayed in reading or writing their
personal copy of the file.
 AFS file systems may be accessible by systems around the world. Access
control is maintained through ( somewhat ) complicated access control
lists, which may grant access to the entire world ( literally ) or to
specifically named users accessing the files from specifically named
remote environments.

Immutable-Shared-Files Semantics

 Under this system, when a file is declared as shared by its creator, it


becomes immutable and the name cannot be re-used for any other
resource. Hence it becomes read-only, and shared access is simple.

Protection

 Files must be kept safe for reliability ( against accidental damage ), and protection
( against deliberate malicious access. ) The former is usually managed with backup
copies. This section discusses the latter.
 One simple protection scheme is to remove all access to a file. However this makes the
file unusable, so some sort of controlled access must be arranged.

Types of Access

 The following low-level operations are often controlled:


o Read - View the contents of the file
o Write - Change the contents of the file.

14
o Execute - Load the file onto the CPU and follow the instructions contained
therein.
o Append - Add to the end of an existing file.
o Delete - Remove a file from the system.
o List -View the name and other attributes of files on the system.
 Higher-level operations, such as copy, can generally be performed through
combinations of the above.

Access Control

 One approach is to have complicated Access Control Lists, ACL, which specify
exactly what access is allowed or denied for specific users or groups.
o The AFS uses this system for distributed access.
o Control is very finely adjustable, but may be complicated, particularly
when the specific users involved are unknown. ( AFS allows some wild
cards, so for example all users on a certain remote system may be trusted,
or a given username may be trusted when accessing from any remote
system. )
 UNIX uses a set of 9 access control bits, in three groups of three. These
correspond to R, W, and X permissions for each of the Owner, Group, and Others.
( See "man chmod" for full details. ) The RWX bits control the following
privileges for ordinary files and directories:

bit Files Directories


Read ( view )
R Read directory contents. Required to get a listing of the directory.
file contents.
Write
W ( change ) file Change directory contents. Required to create or delete files.
contents.
Access detailed directory information. Required to get a long listing,
Execute file or to access any specific file in the directory. Note that if a user has
X contents as a X but not R permissions on a directory, they can still access specific
program. files, but only if they already know the name of the file they are
trying to access.

 In addition there are some special bits that can also be applied:
o The set user ID ( SUID ) bit and/or the set group ID ( SGID ) bits applied
to executable files temporarily change the identity of whoever runs the
program to match that of the owner / group of the executable program.
This allows users running specific programs to have access to files ( while
running that program ) to which they would normally be unable to
access. Setting of these two bits is usually restricted to root, and must be
done with caution, as it introduces a potential security leak.

15
o The sticky bit on a directory modifies write permission, allowing users to
only delete files for which they are the owner. This allows everyone to
create files in /tmp, for example, but to only delete files which they have
created, and not anyone else's.
o The SUID, SGID, and sticky bits are indicated with an S, S, and T in the
positions for execute permission for the user, group, and others,
respectively. If the letter is lower case, ( s, s, t ), then the corresponding
execute permission is not also given. If it is upper case, ( S, S, T ), then the
corresponding execute permission IS given.
o The numeric form of chmod is needed to set these advanced bits.

Sample permissions in a UNIX system.

 Windows adjusts files access through a simple GUI:

Other Protection Approaches and Issues

 Some systems can apply passwords, either to individual files, or to specific sub-
directories, or to the entire system. There is a trade-off between the number of
passwords that must be maintained ( and remembered by the users ) and the
amount of information that is vulnerable to a lost or forgotten password.
 Older systems which did not originally have multi- user file access permissions (
DOS and older versions of Mac ) must now be retrofitted if they are to share files
on a network.

16
 Access to a file requires access to all the files along its path as well. In a cyclic
directory structure, users may have different access to the same file accessed
through different paths.
 Sometimes just the knowledge of the existence of a file of a certain name is a
security ( or privacy ) concern. Hence the distinction between the R and X bits on
UNIX directories.

File-System Implementation
File-System Implementation

Overview

 File systems store several important data structures on the disk:


o A boot-control block, ( per volume ) a.k.a. the boot block in UNIX or the
partition boot sector in Windows contains information about how to boot
the system off of this disk. This will generally be the first sector of the
volume if there is a bootable system loaded on that volume, or the block
will be left vacant otherwise.
o A volume control block, ( per volume ) a.k.a. the master file table in
UNIX or the superblock in Windows, which contains information such as
the partition table, number of blocks on each filesystem, and pointers to
free blocks and free FCB blocks.
o A directory structure ( per file system ), containing file names and pointers
to corresponding FCBs. UNIX uses inode numbers, and NTFS uses a
master file table.
o The File Control Block, FCB, ( per file ) containing details about
ownership, size, permissions, dates, etc. UNIX stores this information in
inodes, and NTFS in the master file table as a relational database structure.

17
Figure 12.2 - A typical file-control block.

 There are also several key data structures stored in memory:


o An in-memory mount table.
o An in-memory directory cache of recently accessed directory information.
o A system-wide open file table, containing a copy of the FCB for every
currently open file in the system, as well as some other related
information.
o A per-process open file table, containing a pointer to the system open file
table as well as some other information. ( For example the current file
position pointer may be either here or in the system file table, depending
on the implementation and whether the file is being shared or not. )
 Figure 12.3 illustrates some of the interactions of file system components when
files are created and/or used:
o When a new file is created, a new FCB is allocated and filled out with
important information regarding the new file. The appropriate directory is
modified with the new file name and FCB information.
o When a file is accessed during a program, the open( ) system call reads in
the FCB information from disk, and stores it in the system-wide open file
table. An entry is added to the per-process open file table referencing the
system-wide table, and an index into the per-process table is returned by
the open( ) system call. UNIX refers to this index as a file descriptor, and
Windows refers to it as a file handle.
o If another process already has a file open when a new request comes in for
the same file, and it is sharable, then a counter in the system-wide table is
incremented and the per-process table is adjusted to point to the existing
entry in the system-wide table.
o When a file is closed, the per-process table entry is freed, and the counter
in the system-wide table is decremented. If that counter reaches zero, then
the system wide table is also freed. Any data currently stored in memory
cache for this file is written out to disk if necessary.

18
In-memory file-system structures. (a) File open. (b) File read.

Partitions and Mounting

 Physical disks are commonly divided into smaller units called partitions. They can
also be combined into larger units, but that is most commonly done for RAID
installations and is left for later chapters.
 Partitions can either be used as raw devices ( with no structure imposed upon
them ), or they can be formatted to hold a filesystem ( i.e. populated with FCBs
and initial directory structures as appropriate. ) Raw partitions are generally used
for swap space, and may also be used for certain programs such as databases that
choose to manage their own disk storage system. Partitions containing filesystems

19
can generally only be accessed using the file system structure by ordinary users,
but can often be accessed as a raw device also by root.
 The boot block is accessed as part of a raw partition, by the boot program prior to
any operating system being loaded. Modern boot programs understand multiple
OSes and filesystem formats, and can give the user a choice of which of several
available systems to boot.
 The root partition contains the OS kernel and at least the key portions of the OS
needed to complete the boot process. At boot time the root partition is mounted,
and control is transferred from the boot program to the kernel found there. ( Older
systems required that the root partition lie completely within the first 1024
cylinders of the disk, because that was as far as the boot program could reach.
Once the kernel had control, then it could access partitions beyond the 1024
cylinder boundary. )
 Continuing with the boot process, additional filesystems get mounted, adding
their information into the appropriate mount table structure. As a part of the
mounting process the file systems may be checked for errors or inconsistencies,
either because they are flagged as not having been closed properly the last time
they were used, or just for general principals. Filesystems may be mounted either
automatically or manually. In UNIX a mount point is indicated by setting a flag in
the in-memory copy of the inode, so all future references to that inode get re-
directed to the root directory of the mounted filesystem.

Virtual File Systems

 Virtual File Systems, VFS, provide a common interface to multiple different


filesystem types. In addition, it provides for a unique identifier ( vnode ) for files
across the entire space, including across all filesystems of different types. ( UNIX
inodes are unique only across a single filesystem, and certainly do not carry
across networked file systems. )
 The VFS in Linux is based upon four key object types:
o The inode object, representing an individual file
o The file object, representing an open file.
o The superblock object, representing a filesystem.
o The dentry object, representing a directory entry.
 Linux VFS provides a set of common functionalities for each filesystem, using
function pointers accessed through a table. The same functionality is accessed
through the same table position for all filesystem types, though the actual
functions pointed to by the pointers may be filesystem-specific. See
/usr/include/linux/fs.h for full details. Common operations provided include open(
), read( ), write( ), and mmap( ).

20
Figure 12.4 - Schematic view of a virtual file system.

Directory Implementation

 Directories need to be fast to search, insert, and delete, with a minimum of wasted disk
space.

Linear List

 A linear list is the simplest and easiest directory structure to set up, but it does
have some drawbacks.
 Finding a file ( or verifying one does not already exist upon creation ) requires a
linear search.
 Deletions can be done by moving all entries, flagging an entry as deleted, or by
moving the last entry into the newly vacant position.

21
 Sorting the list makes searches faster, at the expense of more complex insertions
and deletions.
 A linked list makes insertions and deletions into a sorted list easier, with overhead
for the links.
 More complex data structures, such as B-trees, could also be considered.

Hash Table

 A hash table can also be used to speed up searches.


 Hash tables are generally implemented in addition to a linear or other structure

Allocation Methods

 There are three major methods of storing files on disks: contiguous, linked, and indexed.

Contiguous Allocation

 Contiguous Allocation requires that all blocks of a file be kept together


contiguously.
 Performance is very fast, because reading successive blocks of the same file
generally requires no movement of the disk heads, or at most one small step to the
next adjacent cylinder.
 Storage allocation involves the same issues discussed earlier for the allocation of
contiguous blocks of memory ( first fit, best fit, fragmentation problems, etc. )
The distinction is that the high time penalty required for moving the disk heads
from spot to spot may now justify the benefits of keeping files contiguously when
possible.
 ( Even file systems that do not by default store files contiguously can benefit from
certain utilities that compact the disk and make all files contiguous in the process.
)
 Problems can arise when files grow, or if the exact size of a file is unknown at
creation time:
o Over-estimation of the file's final size increases external fragmentation
and wastes disk space.
o Under-estimation may require that a file be moved or a process aborted if
the file grows beyond its originally allocated space.
o If a file grows slowly over a long time period and the total final space
must be allocated initially, then a lot of space becomes unusable before the
file fills the space.
 A variation is to allocate file space in large contiguous chunks, called extents.
When a file outgrows its original extent, then an additional one is allocated. ( For
example an extent may be the size of a complete track or even cylinder, aligned
on an appropriate track or cylinder boundary. ) The high-performance files system
Veritas uses extents to optimize performance.

22
Figure 12.5 - Contiguous allocation of disk space.

Linked Allocation

 Disk files can be stored as linked lists, with the expense of the storage space
consumed by each link. ( E.g. a block may be 508 bytes instead of 512. )
 Linked allocation involves no external fragmentation, does not require pre-known
file sizes, and allows files to grow dynamically at any time.
 Unfortunately linked allocation is only efficient for sequential access files, as
random access requires starting at the beginning of the list for each new location
access.
 Allocating clusters of blocks reduces the space wasted by pointers, at the cost of
internal fragmentation.
 Another big problem with linked allocation is reliability if a pointer is lost or
damaged. Doubly linked lists provide some protection, at the cost of additional
overhead and wasted space.

23
Figure 12.6 - Linked allocation of disk space.

 The File Allocation Table, FAT, used by DOS is a variation of linked allocation,
where all the links are stored in a separate table at the beginning of the disk. The
benefit of this approach is that the FAT table can be cached in memory, greatly
improving random access speeds.

24
Figure 12.7 File-allocation table.

Indexed Allocation

 Indexed Allocation combines all of the indexes for accessing each file into a
common block ( for that file ), as opposed to spreading them all over the disk or
storing them in a FAT table.

25
Figure 12.8 - Indexed allocation of disk space.

 Some disk space is wasted ( relative to linked lists or FAT tables ) because an
entire index block must be allocated for each file, regardless of how many data
blocks the file contains. This leads to questions of how big the index block should
be, and how it should be implemented. There are several approaches:
o Linked Scheme - An index block is one disk block, which can be read
and written in a single disk operation. The first index block contains some
header information, the first N block addresses, and if necessary a pointer
to additional linked index blocks.
o Multi-Level Index - The first index block contains a set of pointers to
secondary index blocks, which in turn contain pointers to the actual data
blocks.
o Combined Scheme - This is the scheme used in UNIX inodes, in which
the first 12 or so data block pointers are stored directly in the inode, and
then singly, doubly, and triply indirect pointers provide access to more
data blocks as needed. ( See below. ) The advantage of this scheme is that
for small files ( which many are ), the data blocks are readily accessible (
up to 48K with 4K block sizes ); files up to about 4144K ( using 4K
blocks ) are accessible with only a single indirect block ( which can be

26
cached ), and huge files are still accessible using a relatively small number
of disk accesses ( larger in theory than can be addressed by a 32-bit
address, which is why some systems have moved to 64-bit file pointers. )

Figure 12.9 - The UNIX inode.

Performance

 The optimal allocation method is different for sequential access files than for
random access files, and is also different for small files than for large files.
 Some systems support more than one allocation method, which may require
specifying how the file is to be used ( sequential or random access ) at the time it
is allocated. Such systems also provide conversion utilities.
 Some systems have been known to use contiguous access for small files, and
automatically switch to an indexed scheme when file sizes surpass a certain
threshold.
 And of course some systems adjust their allocation schemes ( e.g. block sizes ) to
best match the characteristics of the hardware for optimum performance.

27
Free-Space Management

 Another important aspect of disk management is keeping track of and allocating


free space.

Bit Vector

 One simple approach is to use a bit vector, in which each bit represents a disk
block, set to 1 if free or 0 if allocated.
 Fast algorithms exist for quickly finding contiguous blocks of a given size
 The down side is that a 40GB disk requires over 5MB just to store the bitmap. (
For example. )

Linked List

 A linked list can also be used to keep track of all free blocks.
 Traversing the list and/or finding a contiguous block of a given size are not easy,
but fortunately are not frequently needed operations. Generally the system just
adds and removes single blocks from the beginning of the list.
 The FAT table keeps track of the free list as just one more linked list on the table.

Linked free-space list on disk.

28
Grouping

 A variation on linked list free lists is to use links of blocks of indices of free
blocks. If a block holds up to N addresses, then the first block in the linked-list
contains up to N-1 addresses of free blocks and a pointer to the next block of free
addresses.

Counting

 When there are multiple contiguous blocks of free space then the system can keep
track of the starting address of the group and the number of contiguous free
blocks. As long as the average length of a contiguous group of free blocks is
greater than two this offers a savings in space needed for the free list. ( Similar to
compression techniques used for graphics images when a group of pixels all the
same color is encountered. )

Space Maps

 Sun's ZFS file system was designed for HUGE numbers and sizes of files,
directories, and even file systems.
 The resulting data structures could be VERY inefficient if not implemented
carefully. For example, freeing up a 1 GB file on a 1 TB file system could involve
updating thousands of blocks of free list bit maps if the file was spread across the
disk.
 ZFS uses a combination of techniques, starting with dividing the disk up into (
hundreds of ) metaslabs of a manageable size, each having their own space map.
 Free blocks are managed using the counting technique, but rather than write the
information to a table, it is recorded in a log-structured transaction record.
Adjacent free blocks are also coalesced into a larger single free block.
 An in-memory space map is constructed using a balanced tree data structure,
constructed from the log data.
 The combination of the in-memory tree and the on-disk log provide for very fast
and efficient management of these very large files and free blocks.

Efficiency and Performance

Efficiency

 UNIX pre-allocates inodes, which occupies space even before any files are
created.
 UNIX also distributes inodes across the disk, and tries to store data files near their
inode, to reduce the distance of disk seeks between the inodes and the data.
 Some systems use variable size clusters depending on the file size.
 The more data that is stored in a directory ( e.g. last access time ), the more often
the directory blocks have to be re-written.
 As technology advances, addressing schemes have had to grow as well.

29
o Sun's ZFS file system uses 128-bit pointers, which should theoretically
never need to be expanded. ( The mass required to store 2^128 bytes with
atomic storage would be at least 272 trillion kilograms! )
 Kernel table sizes used to be fixed, and could only be changed by rebuilding the
kernels. Modern tables are dynamically allocated, but that requires more
complicated algorithms for accessing them.

Performance

 Disk controllers generally include on-board caching. When a seek is requested,


the heads are moved into place, and then an entire track is read, starting from
whatever sector is currently under the heads ( reducing latency. ) The requested
sector is returned and the unrequested portion of the track is cached in the disk's
electronics.
 Some OSes cache disk blocks they expect to need again in a buffer cache.
 A page cache connected to the virtual memory system is actually more efficient
as memory addresses do not need to be converted to disk block addresses and
back again.
 Some systems ( Solaris, Linux, Windows 2000, NT, XP ) use page caching for
both process pages and file data in a unified virtual memory.
 Figures 11.11 and 11.12 show the advantages of the unified buffer cache found
in some versions of UNIX and Linux - Data does not need to be stored twice, and
problems of inconsistent buffer information are avoided.

I/O without a unified buffer cache.

30
I/O using a unified buffer cache.

 Page replacement strategies can be complicated with a unified cache, as one needs
to decide whether to replace process or file pages, and how many pages to
guarantee to each category of pages. Solaris, for example, has gone through many
variations, resulting in priority paging giving process pages priority over file I/O
pages, and setting limits so that neither can knock the other completely out of
memory.
 Another issue affecting performance is the question of whether to implement
synchronous writes or asynchronous writes. Synchronous writes occur in the
order in which the disk subsystem receives them, without caching; Asynchronous
writes are cached, allowing the disk subsystem to schedule writes in a more
efficient order ( See Chapter 12. ) Metadata writes are often done synchronously.
Some systems support flags to the open call requiring that writes be synchronous,
for example for the benefit of database systems that require their writes be
performed in a required order.
 The type of file access can also have an impact on optimal page replacement
policies. For example, LRU is not necessarily a good policy for sequential access
files. For these types of files progression normally goes in a forward direction
only, and the most recently used page will not be needed again until after the file
has been rewound and re-read from the beginning, ( if it is ever needed at all. ) On
the other hand, we can expect to need the next page in the file fairly soon. For this
reason sequential access files often take advantage of two special policies:
o Free-behind frees up a page as soon as the next page in the file is
requested, with the assumption that we are now done with the old page
and won't need it again for a long time.
o Read-ahead reads the requested page and several subsequent pages at the
same time, with the assumption that those pages will be needed in the near
future. This is similar to the track caching that is already performed by the

31
disk controller, except it saves the future latency of transferring data from
the disk controller memory into motherboard main memory.
 The caching system and asynchronous writes speed up disk writes considerably,
because the disk subsystem can schedule physical writes to the disk to minimize
head movement and disk seek times. ( See Chapter 12. ) Reads, on the other hand,
must be done more synchronously in spite of the caching system, with the result
that disk writes can counter-intuitively be much faster on average than disk reads.

Mass-Storage Structure
Overview of Mass-Storage Structure

Magnetic Disks

 Traditional magnetic disks have the following basic structure:


o One or more platters in the form of disks covered with magnetic media.
Hard disk platters are made of rigid metal, while "floppy" disks are made
of more flexible plastic.
o Each platter has two working surfaces. Older hard disk drives would
sometimes not use the very top or bottom surface of a stack of platters, as
these surfaces were more susceptible to potential damage.
o Each working surface is divided into a number of concentric rings called
tracks. The collection of all tracks that are the same distance from the
edge of the platter, ( i.e. all tracks immediately above one another in the
following diagram ) is called a cylinder.
o Each track is further divided into sectors, traditionally containing 512
bytes of data each, although some modern disks occasionally use larger
sector sizes. ( Sectors also include a header and a trailer, including
checksum information among other things. Larger sector sizes reduce the
fraction of the disk consumed by headers and trailers, but increase internal
fragmentation and the amount of disk that must be marked bad in the case
of errors. )
o The data on a hard drive is read by read-write heads. The standard
configuration ( shown below ) uses one head per surface, each on a
separate arm, and controlled by a common arm assembly which moves all
heads simultaneously from one cylinder to another. ( Other configurations,
including independent read-write heads, may speed up disk access, but
involve serious technical difficulties. )
o The storage capacity of a traditional disk drive is equal to the number of
heads ( i.e. the number of working surfaces ), times the number of tracks
per surface, times the number of sectors per track, times the number of

32
bytes per sector. A particular physical block of data is specified by
providing the head-sector-cylinder number at which it is located.

Moving-head disk mechanism.

 In operation the disk rotates at high speed, such as 7200 rpm ( 120 revolutions per
second. ) The rate at which data can be transferred from the disk to the computer
is composed of several steps:
o The positioning time, a.k.a. the seek time or random access time is the
time required to move the heads from one cylinder to another, and for the
heads to settle down after the move. This is typically the slowest step in
the process and the predominant bottleneck to overall transfer rates.
o The rotational latency is the amount of time required for the desired
sector to rotate around and come under the read-write head.This can range
anywhere from zero to one full revolution, and on the average will equal
one-half revolution. This is another physical step and is usually the second
slowest step behind seek time. ( For a disk rotating at 7200 rpm, the
average rotational latency would be 1/2 revolution / 120 revolutions per
second, or just over 4 milliseconds, a long time by computer standards.
o The transfer rate, which is the time required to move the data
electronically from the disk to the computer. ( Some authors may also use

33
the term transfer rate to refer to the overall transfer rate, including seek
time and rotational latency as well as the electronic data transfer rate. )
 Disk heads "fly" over the surface on a very thin cushion of air. If they should
accidentally contact the disk, then a head crash occurs, which may or may not
permanently damage the disk or even destroy it completely. For this reason it is
normal to park the disk heads when turning a computer off, which means to move
the heads off the disk or to an area of the disk where there is no data stored.
 Floppy disks are normally removable. Hard drives can also be removable, and
some are even hot-swappable, meaning they can be removed while the computer
is running, and a new hard drive inserted in their place.
 Disk drives are connected to the computer via a cable known as the I/O Bus.
Some of the common interface formats include Enhanced Integrated Drive
Electronics, EIDE; Advanced Technology Attachment, ATA; Serial ATA, SATA,
Universal Serial Bus, USB; Fiber Channel, FC, and Small Computer Systems
Interface, SCSI.
 The host controller is at the computer end of the I/O bus, and the disk controller
is built into the disk itself. The CPU issues commands to the host controller via
I/O ports. Data is transferred between the magnetic surface and onboard cache by
the disk controller, and then the data is transferred from that cache to the host
controller and the motherboard memory at electronic speeds.

Magnetic Tapes

 Magnetic tapes were once used for common secondary storage before the days of
hard disk drives, but today are used primarily for backups.
 Accessing a particular spot on a magnetic tape can be slow, but once reading or
writing commences, access speeds are comparable to disk drives.
 Capacities of tape drives can range from 20 to 200 GB, and compression can
double that capacity.

Disk Structure

 The traditional head-sector-cylinder, HSC numbers are mapped to linear block addresses
by numbering the first sector on the first head on the outermost track as sector 0.
Numbering proceeds with the rest of the sectors on that same track, and then the rest of
the tracks on the same cylinder before proceeding through the rest of the cylinders to the
center of the disk. In modern practice these linear block addresses are used in place of the
HSC numbers for a variety of reasons:
1. The linear length of tracks near the outer edge of the disk is much longer than for
those tracks located near the center, and therefore it is possible to squeeze many
more sectors onto outer tracks than onto inner ones.
2. All disks have some bad sectors, and therefore disks maintain a few spare sectors
that can be used in place of the bad ones. The mapping of spare sectors to bad
sectors in managed internally to the disk controller.
3. Modern hard drives can have thousands of cylinders, and hundreds of sectors per
track on their outermost tracks. These numbers exceed the range of HSC numbers

34
for many ( older ) operating systems, and therefore disks can be configured for
any convenient combination of HSC values that falls within the total number of
sectors physically on the drive.
 There is a limit to how closely packed individual bits can be placed on a physical media,
but that limit is growing increasingly more packed as technological advances are made.
 Modern disks pack many more sectors into outer cylinders than inner ones, using one of
two approaches:
o With Constant Linear Velocity, CLV, the density of bits is uniform from cylinder
to cylinder. Because there are more sectors in outer cylinders, the disk spins
slower when reading those cylinders, causing the rate of bits passing under the
read-write head to remain constant. This is the approach used by modern CDs and
DVDs.
o With Constant Angular Velocity, CAV, the disk rotates at a constant angular
speed, with the bit density decreasing on outer cylinders. ( These disks would
have a constant number of sectors per track on all cylinders. )

Disk Attachment

Disk drives can be attached either directly to a particular host ( a local disk ) or to a network.

Host-Attached Storage

 Local disks are accessed through I/O Ports as described earlier.


 The most common interfaces are IDE or ATA, each of which allow up to two
drives per host controller.
 SATA is similar with simpler cabling.
 High end workstations or other systems in need of larger number of disks
typically use SCSI disks:
o The SCSI standard supports up to 16 targets on each SCSI bus, one of
which is generally the host adapter and the other 15 of which can be disk
or tape drives.
o A SCSI target is usually a single drive, but the standard also supports up to
8 units within each target. These would generally be used for accessing
individual disks within a RAID array. ( See below. )
o The SCSI standard also supports multiple host adapters in a single
computer, i.e. multiple SCSI busses.
o Modern advancements in SCSI include "fast" and "wide" versions, as well
as SCSI-2.
o SCSI cables may be either 50 or 68 conductors. SCSI devices may be
external as well as internal.
o See wikipedia for more information on the SCSI interface.
 FC is a high-speed serial architecture that can operate over optical fiber or four-
conductor copper wires, and has two variants:
o A large switched fabric having a 24-bit address space. This variant allows
for multiple devices and multiple hosts to interconnect, forming the basis
for the storage-area networks, SANs, to be discussed in a future section.

35
o The arbitrated loop, FC-AL, that can address up to 126 devices ( drives
and controllers. )

Network-Attached Storage

 Network attached storage connects storage devices to computers using a remote


procedure call, RPC, interface, typically with something like NFS filesystem
mounts. This is convenient for allowing several computers in a group common
access and naming conventions for shared storage.
 NAS can be implemented using SCSI cabling, or ISCSI uses Internet protocols
and standard network connections, allowing long-distance remote access to shared
files.
 NAS allows computers to easily share data storage, but tends to be less efficient
than standard host-attached storage.

Figure 10.2 - Network-attached storage.

Storage-Area Network

 A Storage-Area Network, SAN, connects computers and storage devices in a


network, using storage protocols instead of network protocols.
 One advantage of this is that storage access does not tie up regular networking
bandwidth.
 SAN is very flexible and dynamic, allowing hosts and devices to attach and
detach on the fly.
 SAN is also controllable, allowing restricted access to certain hosts and devices.

36
Figure 10.3 - Storage-area network.

Disk Scheduling

 As mentioned earlier, disk transfer speeds are limited primarily by seek times and
rotational latency. When multiple requests are to be processed there is also some
inherent delay in waiting for other requests to be processed.
 Bandwidth is measured by the amount of data transferred divided by the total amount of
time from the first request being made to the last transfer being completed, ( for a series
of disk requests. )
 Both bandwidth and access time can be improved by processing requests in a good order.
 Disk requests include the disk address, memory address, number of sectors to transfer,
and whether the request is for reading or writing.

FCFS Scheduling

 First-Come First-Serve is simple and intrinsically fair, but not very efficient.
Consider in the following sequence the wild swing from cylinder 122 to 14 and
then back to 124:

37
disk scheduling.

SSTF Scheduling

 Shortest Seek Time First scheduling is more efficient, but may lead to starvation
if a constant stream of requests arrives for the same general area of the disk.
 SSTF reduces the total head movement to 236 cylinders, down from 640 required
for the same set of requests under FCFS. Note, however that the distance could be
reduced still further to 208 by starting with 37 and then 14 first before processing
the rest of the requests.

disk scheduling.

10.4.3 SCAN Scheduling

 The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from
one end of the disk to the other, similarly to an elevator processing requests in a
tall building.
38
SCAN disk scheduling.

 Under the SCAN algorithm, If a request arrives just ahead of the moving head
then it will be processed right away, but if it arrives just after the head has passed,
then it will have to wait for the head to pass going the other way on the return trip.
This leads to a fairly wide variation in access times which can be improved upon.
 Consider, for example, when the head reaches the high end of the disk: Requests
with high cylinder numbers just missed the passing head, which means they are
all fairly recent requests, whereas requests with low numbers may have been
waiting for a much longer time. Making the return scan from high to low then
ends up accessing recent requests first and making older requests wait that much
longer.

10.4.4 C-SCAN Scheduling

 The Circular-SCAN algorithm improves upon SCAN by treating all requests in a


circular queue fashion - Once the head reaches the end of the disk, it returns to the
other end without processing any requests, and then starts again from the
beginning of the disk:

39
C-SCAN disk scheduling.

12.4.5 LOOK Scheduling

 LOOK scheduling improves upon SCAN by looking ahead at the queue of


pending requests, and not moving the heads any farther towards the end of the
disk than is necessary. The following diagram illustrates the circular form of
LOOK:

C-LOOK disk scheduling.

40
Selection of a Disk-Scheduling Algorithm

 With very low loads all algorithms are equal, since there will normally only be
one request to process at a time.
 For slightly larger loads, SSTF offers better performance than FCFS, but may lead
to starvation when loads become heavy enough.
 For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
 The actual optimal algorithm may be something even more complex than those
discussed here, but the incremental improvements are generally not worth the
additional overhead.
 Some improvement to overall filesystem access times can be made by intelligent
placement of directory and/or inode information. If those structures are placed in
the middle of the disk instead of at the beginning of the disk, then the maximum
distance from those structures to data blocks is reduced to only one-half of the
disk size. If those structures can be further distributed and furthermore have their
data blocks stored as close as possible to the corresponding directory structures,
then that reduces still further the overall time to find the disk block numbers and
then access the corresponding data blocks.
 On modern disks the rotational latency can be almost as significant as the seek
time, however it is not within the OSes control to account for that, because
modern disks do not reveal their internal sector mapping schemes, ( particularly
when bad blocks have been remapped to spare sectors. )
o Some disk manufacturers provide for disk scheduling algorithms directly
on their disk controllers, ( which do know the actual geometry of the disk
as well as any remapping ), so that if a series of requests are sent from the
computer to the controller then those requests can be processed in an
optimal order.
o Unfortunately there are some considerations that the OS must take into
account that are beyond the abilities of the on-board disk-scheduling
algorithms, such as priorities of some requests over others, or the need to
process certain requests in a particular order. For this reason OSes may
elect to spoon-feed requests to the disk controller one at a time in certain
situations.

Disk Management

Disk Formatting

 Before a disk can be used, it has to be low-level formatted, which means laying
down all of the headers and trailers marking the beginning and ends of each
sector. Included in the header and trailer are the linear sector numbers, and error-
correcting codes, ECC, which allow damaged sectors to not only be detected, but
in many cases for the damaged data to be recovered ( depending on the extent of
the damage. ) Sector sizes are traditionally 512 bytes, but may be larger,
particularly in larger drives.

41
 ECC calculation is performed with every disk read or write, and if damage is
detected but the data is recoverable, then a soft error has occurred. Soft errors are
generally handled by the on-board disk controller, and never seen by the OS. ( See
below. )
 Once the disk is low-level formatted, the next step is to partition the drive into one
or more separate partitions. This step must be completed even if the disk is to be
used as a single large partition, so that the partition table can be written to the
beginning of the disk.
 After partitioning, then the filesystems must be logically formatted, which
involves laying down the master directory information ( FAT table or inode
structure ), initializing free lists, and creating at least the root directory of the
filesystem. ( Disk partitions which are to be used as raw devices are not logically
formatted. This saves the overhead and disk space of the filesystem structure, but
requires that the application program manage its own disk storage requirements. )

Boot Block

 Computer ROM contains a bootstrap program ( OS independent ) with just


enough code to find the first sector on the first hard drive on the first controller,
load that sector into memory, and transfer control over to it. ( The ROM bootstrap
program may look in floppy and/or CD drives before accessing the hard drive,
and is smart enough to recognize whether it has found valid boot code or not. )
 The first sector on the hard drive is known as the Master Boot Record, MBR, and
contains a very small amount of code in addition to the partition table. The
partition table documents how the disk is partitioned into logical disks, and
indicates specifically which partition is the active or boot partition.
 The boot program then looks to the active partition to find an operating system,
possibly loading up a slightly larger / more advanced boot program along the way.
 In a dual-boot ( or larger multi-boot ) system, the user may be given a choice of
which operating system to boot, with a default action to be taken in the event of
no response within some time frame.
 Once the kernel is found by the boot program, it is loaded into memory and then
control is transferred over to the OS. The kernel will normally continue the boot
process by initializing all important kernel data structures, launching important
system services ( e.g. network daemons, sched, init, etc. ), and finally providing
one or more login prompts. Boot options at this stage may include single-user
a.k.a. maintenance or safe modes, in which very few system services are started -
These modes are designed for system administrators to repair problems or
otherwise maintain the system.

42
Booting from disk in Windows 2000.

Bad Blocks

 No disk can be manufactured to 100% perfection, and all physical objects wear
out over time. For these reasons all disks are shipped with a few bad blocks, and
additional blocks can be expected to go bad slowly over time. If a large number of
blocks go bad then the entire disk will need to be replaced, but a few here and
there can be handled through other means.
 In the old days, bad blocks had to be checked for manually. Formatting of the disk
or running certain disk-analysis tools would identify bad blocks, and attempt to
read the data off of them one last time through repeated tries. Then the bad blocks
would be mapped out and taken out of future service. Sometimes the data could
be recovered, and sometimes it was lost forever. ( Disk analysis tools could be
either destructive or non-destructive. )
 Modern disk controllers make much better use of the error-correcting codes, so
that bad blocks can be detected earlier and the data usually recovered. ( Recall
that blocks are tested with every write as well as with every read, so often errors
can be detected before the write operation is complete, and the data simply written
to a different sector instead. )
 Note that re-mapping of sectors from their normal linear progression can throw
off the disk scheduling optimization of the OS, especially if the replacement
sector is physically far away from the sector it is replacing. For this reason most
disks normally keep a few spare sectors on each cylinder, as well as at least one
spare cylinder. Whenever possible a bad sector will be mapped to another sector
on the same cylinder, or at least a cylinder as close as possible. Sector slipping
may also be performed, in which all sectors between the bad sector and the
replacement sector are moved down by one, so that the linear progression of
sector numbers can be maintained.

43
 If the data on a bad block cannot be recovered, then a hard error has occurred.,
which requires replacing the file(s) from backups, or rebuilding them from
scratch.

Swap-Space Management

 Modern systems typically swap out pages as needed, rather than swapping out entire
processes. Hence the swapping system is part of the virtual memory management system.
 Managing swap space is obviously an important task for modern OSes.

Swap-Space Use

 The amount of swap space needed by an OS varies greatly according to how it is


used. Some systems require an amount equal to physical RAM; some want a
multiple of that; some want an amount equal to the amount by which virtual
memory exceeds physical RAM, and some systems use little or none at all!
 Some systems support multiple swap spaces on separate disks in order to speed up
the virtual memory system.

Swap-Space Location

Swap space can be physically located in one of two locations:

 As a large file which is part of the regular filesystem. This is easy to


implement, but inefficient. Not only must the swap space be accessed
through the directory system, the file is also subject to fragmentation
issues. Caching the block location helps in finding the physical blocks, but
that is not a complete fix.
 As a raw partition, possibly on a separate or little-used disk. This allows
the OS more control over swap space management, which is usually faster
and more efficient. Fragmentation of swap space is generally not a big
issue, as the space is re-initialized every time the system is rebooted. The
downside of keeping swap space on a raw partition is that it can only be
grown by repartitioning the hard drive.

Swap-Space Management: An Example

 Historically OSes swapped out entire processes as needed. Modern systems swap
out only individual pages, and only as needed. ( For example process code blocks
and other blocks that have not been changed since they were originally loaded are
normally just freed from the virtual memory system rather than copying them to
swap space, because it is faster to go find them again in the filesystem and read
them back in from there than to write them out to swap space and then read them
back. )
 In the mapping system shown below for Linux systems, a map of swap space is
kept in memory, where each entry corresponds to a 4K block in the swap space.

44
Zeros indicate free slots and non-zeros refer to how many processes have a
mapping to that particular block ( >1 for shared pages only. )

The data structures for swapping on Linux systems.

45
Deadlocks
System Model

 For the purposes of deadlock discussion, a system can be modeled as a


collection of limited resources, which can be partitioned into different
categories, to be allocated to a number of processes, each having different
needs.
 Resource categories may include memory, printers, CPUs, open files, tape
drives, CD-ROMS, etc.
 By definition, all the resources within a category are equivalent, and a request
of this category can be equally satisfied by any one of the resources in that
category. If this is not the case ( i.e. if there is some difference between the
resources within a category ), then that category needs to be further divided into
separate categories. For example, "printers" may need to be separated into
"laser printers" and "color inkjet printers".
 Some categories may have a single resource.
 In normal operation a process must request a resource before using it, and
release it when it is done, in the following sequence:
1. Request - If the request cannot be immediately granted, then the process
must wait until the resource(s) it needs become available. For example
the system calls open( ), malloc( ), new( ), and request( ).
2. Use - The process uses the resource, e.g. prints to the printer or reads
from the file.
3. Release - The process relinquishes the resource. so that it becomes
available for other processes. For example, close( ), free( ), delete( ), and
release( ).
 For all kernel-managed resources, the kernel keeps track of what resources are
free and which are allocated, to which process they are allocated, and a queue
of processes waiting for this resource to become available. Application-
managed resources can be controlled using mutexes or wait( ) and signal( )
calls, ( i.e. binary or counting semaphores. )
 A set of processes is deadlocked when every process in the set is waiting for a
resource that is currently allocated to another process in the set ( and which can
only be released when that other waiting process makes progress. )
Deadlock Characterization

Necessary Conditions

 There are four conditions that are necessary to achieve deadlock:


1. Mutual Exclusion - At least one resource must be held in a non-
sharable mode; If any other process requests this resource, then
that process must wait for the resource to be released.
2. Hold and Wait - A process must be simultaneously holding at
least one resource and waiting for at least one resource that is
currently being held by some other process.
3. No preemption - Once a process is holding a resource ( i.e. once
its request has been granted ), then that resource cannot be taken
away from that process until the process voluntarily releases it.
4. Circular Wait - A set of processes { P0, P1, P2, . . ., PN } must
exist such that every P[ i ] is waiting for P[ ( i + 1 ) % ( N + 1 ) ]. (
Note that this condition implies the hold-and-wait condition, but it
is easier to deal with the conditions if the four are considered
separately. )

Resource-Allocation Graph

 In some cases deadlocks can be understood more clearly through the use
of Resource-Allocation Graphs, having the following properties:
o A set of resource categories, { R1, R2, R3, . . ., RN }, which
appear as square nodes on the graph. Dots inside the resource
nodes indicate specific instances of the resource. ( E.g. two dots
might represent two laser printers. )
o A set of processes, { P1, P2, P3, . . ., PN }
o Request Edges - A set of directed arcs from Pi to Rj, indicating
that process Pi has requested Rj, and is currently waiting for that
resource to become available.
o Assignment Edges - A set of directed arcs from Rj to Pi
indicating that resource Rj has been allocated to process Pi, and
that Pi is currently holding resource Rj.
o Note that a request edge can be converted into an assignment
edge by reversing the direction of the arc when the request is
granted. ( However note also that request edges point to the
category box, whereas assignment edges emanate from a
particular instance dot within the box. )
o For example:
Figure 7.1 - Resource allocation graph

 If a resource-allocation graph contains no cycles, then the system is not


deadlocked. ( When looking for cycles, remember that these
aredirected graphs. ) See the example in Figure 7.2 above.
 If a resource-allocation graph does contain cycles AND each resource
category contains only a single instance, then a deadlock exists.
 If a resource category contains more than one instance, then the presence
of a cycle in the resource-allocation graph indicates the possibility of a
deadlock, but does not guarantee one. Consider, for example, Figures 7.3
and 7.4 below:
Resource allocation graph with a deadlock

Resource allocation graph with a cycle but no deadlock


Methods for Handling Deadlocks

 Generally speaking there are three ways of handling deadlocks:


1. Deadlock prevention or avoidance - Do not allow the system to get into a
deadlocked state.
2. Deadlock detection and recovery - Abort a process or preempt some
resources when deadlocks are detected.
3. Ignore the problem all together - If deadlocks only occur once a year or
so, it may be better to simply let them happen and reboot as necessary
than to incur the constant overhead and system performance penalties
associated with deadlock prevention or detection. This is the approach
that both Windows and UNIX take.
 In order to avoid deadlocks, the system must have additional information about
all processes. In particular, the system must know what resources a process will
or may request in the future. ( Ranging from a simple worst-case maximum to a
complete resource request and release plan for each process, depending on the
particular algorithm. )
 Deadlock detection is fairly straightforward, but deadlock recovery requires
either aborting processes or preempting resources, neither of which is an
attractive alternative.
 If deadlocks are neither prevented nor detected, then when a deadlock occurs
the system will gradually slow down, as more and more processes become
stuck waiting for resources currently held by the deadlock and by other waiting
processes. Unfortunately this slowdown can be indistinguishable from a general
system slowdown when a real-time process has heavy computing needs.

Deadlock Prevention

 Deadlocks can be prevented by preventing at least one of the four required


conditions:

Mutual Exclusion

 Shared resources such as read-only files do not lead to deadlocks.


 Unfortunately some resources, such as printers and tape drives, require
exclusive access by a single process.

Hold and Wait

 To prevent this condition processes must be prevented from holding one


or more resources while simultaneously waiting for one or more others.
There are several possibilities for this:
o Require that all processes request all resources at one time. This
can be wasteful of system resources if a process needs one
resource early in its execution and doesn't need some other
resource until much later.
o Require that processes holding resources must release them before
requesting new resources, and then re-acquire the released
resources along with the new ones in a single new request. This
can be a problem if a process has partially completed an operation
using a resource and then fails to get it re-allocated after releasing
it.
o Either of the methods described above can lead to starvation if a
process requires one or more popular resources.

No Preemption

 Preemption of process resource allocations can prevent this condition of


deadlocks, when it is possible.
o One approach is that if a process is forced to wait when requesting
a new resource, then all other resources previously held by this
process are implicitly released, ( preempted ), forcing this process
to re-acquire the old resources along with the new resources in a
single request, similar to the previous discussion.
o Another approach is that when a resource is requested and not
available, then the system looks to see what other processes
currently have those resources and are themselves blocked
waiting for some other resource. If such a process is found, then
some of their resources may get preempted and added to the list of
resources for which the process is waiting.
o Either of these approaches may be applicable for resources whose
states are easily saved and restored, such as registers and memory,
but are generally not applicable to other devices such as printers
and tape drives.

Circular Wait

 One way to avoid circular wait is to number all resources, and to require
that processes request resources only in strictly increasing ( or
decreasing ) order.
 In other words, in order to request resource Rj, a process must first
release all Ri such that i >= j.
 One big challenge in this scheme is determining the relative ordering of
the different resources
Deadlock Avoidance

 The general idea behind deadlock avoidance is to prevent deadlocks from ever
happening, by preventing at least one of the aforementioned conditions.
 This requires more information about each process, AND tends to lead to low
device utilization. ( I.e. it is a conservative approach. )
 In some algorithms the scheduler only needs to know the maximum number of
each resource that a process might potentially use. In more complex algorithms
the scheduler can also take advantage of the schedule of exactly what resources
may be needed in what order.
 When a scheduler sees that starting a process or granting resource requests may
lead to future deadlocks, then that process is just not started or the request is not
granted.
 A resource allocation state is defined by the number of available and allocated
resources, and the maximum requirements of all processes in the system.

Safe State

 A state is safe if the system can allocate all resources requested by all
processes ( up to their stated maximums ) without entering a deadlock
state.
 More formally, a state is safe if there exists a safe sequence of processes
{ P0, P1, P2, ..., PN } such that all of the resource requests for Pi can be
granted using the resources currently allocated to Pi and all processes Pj
where j < i. ( I.e. if all the processes prior to Pi finish and free up their
resources, then Pi will be able to finish also, using the resources that they
have freed up. )
 If a safe sequence does not exist, then the system is in an unsafe state,
which MAY lead to deadlock. ( All safe states are deadlock free, but not
all unsafe states lead to deadlocks. )
Figure 7.6 - Safe, unsafe, and deadlocked state spaces.

 For example, consider a system with 12 tape drives, allocated as follows.


Is this a safe state? What is the safe sequence?

Maximum Needs Current Allocation


P0 10 5
P1 4 2
P2 9 2

 What happens to the above table if process P2 requests and is granted


one more tape drive?
 Key to the safe state approach is that when a request is made for
resources, the request is granted only if the resulting allocation state is a
safe one.

Resource-Allocation Graph Algorithm

 If resource categories have only single instances of their resources, then


deadlock states can be detected by cycles in the resource-allocation
graphs.
 In this case, unsafe states can be recognized and avoided by augmenting
the resource-allocation graph with claim edges, noted by dashed lines,
which point from a process to a resource that it may request in the future.
 In order for this technique to work, all claim edges must be added to the
graph for any particular process before that process is allowed to request
any resources. ( Alternatively, processes may only make requests for
resources for which they have already established claim edges, and claim
edges cannot be added to any process that is currently holding
resources. )
 When a process makes a request, the claim edge Pi->Rj is converted to a
request edge. Similarly when a resource is released, the assignment
reverts back to a claim edge.
 This approach works by denying requests that would produce cycles in
the resource-allocation graph, taking claim edges into effect.
 Consider for example what happens when process P2 requests resource
R2:

Resource allocation graph for deadlock avoidance

 The resulting resource-allocation graph would have a cycle in it, and so


the request cannot be granted.
- An unsafe state in a resource allocation graph

Banker's Algorithm

 For resource categories that contain more than one instance the resource-
allocation graph method does not work, and more complex ( and less
efficient ) methods must be chosen.
 The Banker's Algorithm gets its name because it is a method that
bankers could use to assure that when they lend out resources they will
still be able to satisfy all their clients. ( A banker won't loan out a little
money to start building a house unless they are assured that they will
later be able to loan out the rest of the money to finish the house. )
 When a process starts up, it must state in advance the maximum
allocation of resources it may request, up to the amount available on the
system.
 When a request is made, the scheduler determines whether granting the
request would leave the system in a safe state. If not, then the process
must wait until the request can be granted safely.
 The banker's algorithm relies on several key data structures: ( where n is
the number of processes and m is the number of resource categories. )
o Available[ m ] indicates how many resources are currently
available of each type.
o Max[ n ][ m ] indicates the maximum demand of each process of
each resource.
o Allocation[ n ][ m ] indicates the number of each resource
category allocated to each process.
o Need[ n ][ m ] indicates the remaining resources needed of each
type for each process. ( Note that Need[ i ][ j ] = Max[ i ][ j ] -
Allocation[ i ][ j ] for all i, j. )
 For simplification of discussions, we make the following notations /
observations:
o One row of the Need vector, Need[ i ], can be treated as a vector
corresponding to the needs of process i, and similarly for
Allocation and Max.
o A vector X is considered to be <= a vector Y if X[ i ] <= Y[ i ] for
all i.

Safety Algorithm

 In order to apply the Banker's algorithm, we first need an


algorithm for determining whether or not a particular state is safe.
 This algorithm determines if the current state of a system is safe,
according to the following steps:
1. Let Work and Finish be vectors of length m and n
respectively.
 Work is a working copy of the available resources,
which will be modified during the analysis.
 Finish is a vector of booleans indicating whether a
particular process can finish. ( or has finished so far
in the analysis. )
 Initialize Work to Available, and Finish to false for
all elements.
2. Find an i such that both (A) Finish[ i ] == false, and (B)
Need[ i ] < Work. This process has not finished, but could
with the given available working set. If no such i exists, go
to step 4.
3. Set Work = Work + Allocation[ i ], and set Finish[ i ] to
true. This corresponds to process i finishing up and
releasing its resources back into the work pool. Then loop
back to step 2.
4. If finish[ i ] == true for all i, then the state is a safe state,
because a safe sequence has been found.
 ( JTB's Modification:
1. In step 1. instead of making Finish an array of booleans
initialized to false, make it an array of ints initialized to 0.
Also initialize an int s = 0 as a step counter.
2. In step 2, look for Finish[ i ] == 0.
3. In step 3, set Finish[ i ] to ++s. S is counting the number of
finished processes.
4. For step 4, the test can be either Finish[ i ] > 0 for all i, or s
>= n. The benefit of this method is that if a safe state exists,
then Finish[ ] indicates one safe sequence ( of possibly
many. ) )

Resource-Request Algorithm ( The Bankers Algorithm )

 Now that we have a tool for determining if a particular state is


safe or not, we are now ready to look at the Banker's algorithm
itself.
 This algorithm determines if a new request is safe, and grants it
only if it is safe to do so.
 When a request is made ( that does not exceed currently available
resources ), pretend it has been granted, and then see if the
resulting state is a safe one. If so, grant the request, and if not,
deny the request, as follows:
1. Let Request[ n ][ m ] indicate the number of resources of
each type currently requested by processes. If Request[ i ] >
Need[ i ] for any process i, raise an error condition.
2. If Request[ i ] > Available for any process i, then that
process must wait for resources to become available.
Otherwise the process can continue to step 3.
3. Check to see if the request can be granted safely, by
pretending it has been granted and then seeing if the
resulting state is safe. If so, grant the request, and if not,
then the process must wait until its request can be granted
safely.The procedure for granting a request ( or pretending
to for testing purposes ) is:
 Available = Available - Request
 Allocation = Allocation + Request
 Need = Need - Request

An Illustrative Example

 Consider the following situation:


 And now consider what happens if process P1 requests 1 instance
of A and 2 instances of C. ( Request[ 1 ] = ( 1, 0, 2 ) )

 What about requests of ( 3, 3,0 ) by P4? or ( 0, 2, 0 ) by P0? Can


these be safely granted? Why or why not?

Deadlock Detection

 If deadlocks are not avoided, then another approach is to detect when they have
occurred and recover somehow.
 In addition to the performance hit of constantly checking for deadlocks, a
policy / algorithm must be in place for recovering from deadlocks, and there is
potential for lost work when processes must be aborted or have their resources
preempted.
Single Instance of Each Resource Type

 If each resource category has a single instance, then we can use a


variation of the resource-allocation graph known as a wait-for graph.
 A wait-for graph can be constructed from a resource-allocation graph by
eliminating the resources and collapsing the associated edges, as shown
in the figure below.
 An arc from Pi to Pj in a wait-for graph indicates that process Pi is
waiting for a resource that process Pj is currently holding.

(a) Resource allocation graph. (b) Corresponding wait-for graph

 As before, cycles in the wait-for graph indicate deadlocks.


 This algorithm must maintain the wait-for graph, and periodically search
it for cycles.

Several Instances of a Resource Type

 The detection algorithm outlined here is essentially the same as the


Banker's algorithm, with two subtle differences:
o In step 1, the Banker's Algorithm sets Finish[ i ] to false for all i.
The algorithm presented here sets Finish[ i ] to false only if
Allocation[ i ] is not zero. If the currently allocated resources for
this process are zero, the algorithm sets Finish[ i ] to true. This is
essentially assuming that IF all of the other processes can finish,
then this process can finish also. Furthermore, this algorithm is
specifically looking for which processes are involved in a
deadlock situation, and a process that does not have any resources
allocated cannot be involved in a deadlock, and so can be
removed from any further consideration.
o Steps 2 and 3 are unchanged
o In step 4, the basic Banker's Algorithm says that if Finish[ i ] ==
true for all i, that there is no deadlock. This algorithm is more
specific, by stating that if Finish[ i ] == false for any process Pi,
then that process is specifically involved in the deadlock which
has been detected.
 ( Note: An alternative method was presented above, in which Finish held
integers instead of booleans. This vector would be initialized to all zeros,
and then filled with increasing integers as processes are detected which
can finish. If any processes are left at zero when the algorithm
completes, then there is a deadlock, and if not, then the integers in finish
describe a safe sequence. To modify this algorithm to match this section
of the text, processes with allocation = zero could be filled in with N, N -
1, N - 2, etc. in step 1, and any processes left with Finish = 0 in step 4
are the deadlocked processes. )
 Consider, for example, the following state, and determine if it is
currently deadlocked:
 Now suppose that process P2 makes a request for an additional instance
of type C, yielding the state shown below. Is the system now
deadlocked?

Detection-Algorithm Usage

 When should the deadlock detection be done? Frequently, or


infrequently?
 The answer may depend on how frequently deadlocks are expected to
occur, as well as the possible consequences of not catching them
immediately. ( If deadlocks are not removed immediately when they
occur, then more and more processes can "back up" behind the deadlock,
making the eventual task of unblocking the system more difficult and
possibly damaging to more processes. )
 There are two obvious approaches, each with trade-offs:
1. Do deadlock detection after every resource allocation which
cannot be immediately granted. This has the advantage of
detecting the deadlock right away, while the minimum number of
processes are involved in the deadlock. ( One might consider that
the process whose request triggered the deadlock condition is the
"cause" of the deadlock, but realistically all of the processes in the
cycle are equally responsible for the resulting deadlock. ) The
down side of this approach is the extensive overhead and
performance hit caused by checking for deadlocks so frequently.
2. Do deadlock detection only when there is some clue that a
deadlock may have occurred, such as when CPU utilization
reduces to 40% or some other magic number. The advantage is
that deadlock detection is done much less frequently, but the
down side is that it becomes impossible to detect the processes
involved in the original deadlock, and so deadlock recovery can
be more complicated and damaging to more processes.
3. ( As I write this, a third alternative comes to mind: Keep a
historical log of resource allocations, since that last known time of
no deadlocks. Do deadlock checks periodically ( once an hour or
when CPU usage is low?), and then use the historical log to trace
through and determine when the deadlock occurred and what
processes caused the initial deadlock. Unfortunately I'm not
certain that breaking the original deadlock would then free up the
resulting log jam. )

Recovery From Deadlock

 There are three basic approaches to recovery from deadlock:


1. Inform the system operator, and allow him/her to take manual
intervention.
2. Terminate one or more processes involved in the deadlock
3. Preempt resources.

Process Termination

 Two basic approaches, both of which recover resources allocated to


terminated processes:
o Terminate all processes involved in the deadlock. This definitely
solves the deadlock, but at the expense of terminating more
processes than would be absolutely necessary.
o Terminate processes one by one until the deadlock is broken. This
is more conservative, but requires doing deadlock detection after
each step.
 In the latter case there are many factors that can go into deciding which
processes to terminate next:
1. Process priorities.
2. How long the process has been running, and how close it is
to finishing.
3. How many and what type of resources is the process
holding. ( Are they easy to preempt and restore? )
4. How many more resources does the process need to
complete.
5. How many processes will need to be terminated
6. Whether the process is interactive or batch.
7. ( Whether or not the process has made non-restorable
changes to any resource. )
Resource Preemption

 When preempting resources to relieve deadlock, there are three


important issues to be addressed:
1. Selecting a victim - Deciding which resources to preempt from
which processes involves many of the same decision criteria
outlined above.
2. Rollback - Ideally one would like to roll back a preempted
process to a safe state prior to the point at which that resource was
originally allocated to the process. Unfortunately it can be
difficult or impossible to determine what such a safe state is, and
so the only safe rollback is to roll back all the way back to the
beginning. ( I.e. abort the process and make it start over. )
3. Starvation - How do you guarantee that a process won't starve
because its resources are constantly being preempted? One option
would be to use a priority system, and increase the priority of a
process every time its resources get preempted. Eventually it
should get a high enough priority that it won't get preempted any
more.

Protection
Goals of Protection

 Obviously to prevent malicious misuse of the system by users or programs.


 To ensure that each shared resource is used only in accordance with
system policies, which may be set either by system designers or by system
administrators.
 To ensure that errant programs cause the minimal amount of damage possible.
 Note that protection systems only provide the mechanisms for enforcing
policies and ensuring reliable systems. It is up to administrators and users to
implement those mechanisms effectively.
Principles of Protection

 The principle of least privilege dictates that programs, users, and systems be
given just enough privileges to perform their tasks.
 This ensures that failures do the least amount of harm and allow the least of
harm to be done.
 For example, if a program needs special privileges to perform a task, it is better
to make it a SGID program with group ownership of "network" or "backup" or
some other pseudo group, rather than SUID with root ownership. This limits
the amount of damage that can occur if something goes wrong.
 Typically each user is given their own account, and has only enough privilege
to modify their own files.
 The root account should not be used for normal day to day activities - The
System Administrator should also have an ordinary account, and reserve use of
the root account for only those tasks which need the root privileges

Domain of Protection

 A computer can be viewed as a collection of processes and objects ( both HW


& SW ).
 The need to know principle states that a process should only have access to
those objects it needs to accomplish its task, and furthermore only in the modes
for which it needs access and only during the time frame when it needs access.
 The modes available for a particular object may depend upon its type.

Domain Structure

 A protection domain specifies the resources that a process may access.


 Each domain defines a set of objects and the types of operations that
may be invoked on each object.
 An access right is the ability to execute an operation on an object.
 A domain is defined as a set of < object, { access right set } > pairs, as
shown below. Note that some domains may be disjoint while others
overlap.
System with three protection domains.

 The association between a process and a domain may


be static or dynamic.
o If the association is static, then the need-to-know principle
requires a way of changing the contents of the domain
dynamically.
o If the association is dynamic, then there needs to be a mechanism
for domain switching.
 Domains may be realized in different fashions - as users, or as processes,
or as procedures. E.g. if each user corresponds to a domain, then that
domain defines the access of that user, and changing domains involves
changing user ID.

14.3.2 An Example: UNIX

 UNIX associates domains with users.


 Certain programs operate with the SUID bit set, which effectively
changes the user ID, and therefore the access domain, while the program
is running. ( and similarly for the SGID bit. ) Unfortunately this has
some potential for abuse.
 An alternative used on some systems is to place privileged programs in
special directories, so that they attain the identity of the directory owner
when they run. This prevents crackers from placing SUID programs in
random directories around the system.
 Yet another alternative is to not allow the changing of ID at all. Instead,
special privileged daemons are launched at boot time, and user processes
send messages to these daemons when they need special tasks
performed.

14.3.3 An Example: MULTICS


 The MULTICS system uses a complex system of rings, each
corresponding to a different protection domain, as shown below:

MULTICS ring structure.

 Rings are numbered from 0 to 7, with outer rings having a subset of the
privileges of the inner rings.
 Each file is a memory segment, and each segment description includes
an entry that indicates the ring number associated with that segment, as
well as read, write, and execute privileges.
 Each process runs in a ring, according to the current-ring-number, a
counter associated with each process.
 A process operating in one ring can only access segments associated
with higher ( farther out ) rings, and then only according to the access
bits. Processes cannot access segments associated with lower rings.
 Domain switching is achieved by a process in one ring calling upon a
process operating in a lower ring, which is controlled by several factors
stored with each segment descriptor:
o An access bracket, defined by integers b1 <= b2.
o A limit b3 > b2
o A list of gates, identifying the entry points at which the segments
may be called.
 If a process operating in ring i calls a segment whose bracket is such that
b1 <= i <= b2, then the call succeeds and the process remains in ring i.
 Otherwise a trap to the OS occurs, and is handled as follows:
o If i < b1, then the call is allowed, because we are transferring to a
procedure with fewer privileges. However if any of the parameters
being passed are of segments below b1, then they must be copied
to an area accessible by the called procedure.
o If i > b2, then the call is allowed only if i <= b3 and the call is
directed to one of the entries on the list of gates.
 Overall this approach is more complex and less efficient than other
protection schemes.

Access Matrix

 The model of protection that we have been discussing can be viewed as


an access matrix, in which columns represent different system resources and
rows represent different protection domains. Entries within the matrix indicate
what access that domain has to that resource.

Figure 14.3 - Access matrix.

 Domain switching can be easily supported under this model, simply by


providing "switch" access to other domains:
Access matrix of Figure 14.3 with domains as objects.

 The ability to copy rights is denoted by an asterisk, indicating that processes in


that domain have the right to copy that access within the same column, i.e. for
the same object. There are two important variations:
o If the asterisk is removed from the original access right, then the right
is transferred, rather than being copied. This may be termed
a transfer right as opposed to a copy right.
o If only the right and not the asterisk is copied, then the access right is
added to the new domain, but it may not be propagated further. That is
the new domain does not also receive the right to copy the access. This
may be termed a limited copy right, as shown in Figure 14.5 below:
Access matrix with copy rights.

 The owner right adds the privilege of adding new rights or removing existing
ones:
Access matrix with owner rights.

 Copy and owner rights only allow the modification of rights within a column.
The addition of control rights, which only apply to domain objects, allow a
process operating in one domain to affect the rights available in other domains.
For example in the table below, a process operating in domain D2 has the right
to control any of the rights in domain D4.
Modified access matrix

Implementation of Access Matrix

Global Table

 The simplest approach is one big global table with < domain, object,
rights > entries.
 Unfortunately this table is very large ( even if sparse ) and so cannot be
kept in memory ( without invoking virtual memory techniques. )
 There is also no good way to specify groupings - If everyone has access
to some resource, then it still needs a separate entry for every domain.

Access Lists for Objects

 Each column of the table can be kept as a list of the access rights for that
particular object, discarding blank entries.
 For efficiency a separate list of default access rights can also be kept,
and checked first.

Capability Lists for Domains

 In a similar fashion, each row of the table can be kept as a list of the
capabilities of that domain.
 Capability lists are associated with each domain, but not directly
accessible by the domain or any user process.
 Capability lists are themselves protected resources, distinguished from
other data in one of two ways:
o A tag, possibly hardware implemented, distinguishing this special
type of data. ( other types may be floats, pointers, booleans, etc. )
o The address space for a program may be split into multiple
segments, at least one of which is inaccessible by the program
itself, and used by the operating system for maintaining the
process's access right capability list.

A Lock-Key Mechanism

 Each resource has a list of unique bit patterns, termed locks.


 Each domain has its own list of unique bit patterns, termed keys.
 Access is granted if one of the domain's keys fits one of the resource's
locks.
 Again, a process is not allowed to modify its own keys.

Comparison

 Each of the methods here has certain advantages or disadvantages,


depending on the particular situation and task at hand.
 Many systems employ some combination of the listed methods.

Access Control

 Role-Based Access Control, RBAC, assigns privileges to users, programs, or


roles as appropriate, where "privileges" refer to the right to call certain system
calls, or to use certain parameters with those calls.
 RBAC supports the principle of least privilege, and reduces the susceptibility to
abuse as opposed to SUID or SGID programs.
Role-based access control in Solaris 10.

Revocation of Access Rights

 The need to revoke access rights dynamically raises several questions:


o Immediate versus delayed - If delayed, can we determine when the
revocation will take place?
o Selective versus general - Does revocation of an access right to an object
affect all users who have that right, or only some users?
o Partial versus total - Can a subset of rights for an object be revoked, or
are all rights revoked at once?
o Temporary versus permanent - If rights are revoked, is there a
mechanism for processes to re-acquire some or all of the revoked rights?
 With an access list scheme revocation is easy, immediate, and can be selective,
general, partial, total, temporary, or permanent, as desired.
 With capabilities lists the problem is more complicated, because access rights
are distributed throughout the system. A few schemes that have been developed
include:
o Reacquisition - Capabilities are periodically revoked from each domain,
which must then re-acquire them.
o Back-pointers - A list of pointers is maintained from each object to each
capability which is held for that object.
o Indirection - Capabilities point to an entry in a global table rather than to
the object. Access rights can be revoked by changing or invalidating the
table entry, which may affect multiple processes, which must then re-
acquire access rights to continue.
o Keys - A unique bit pattern is associated with each capability when
created, which can be neither inspected nor modified by the process.
 A master key is associated with each object.
 When a capability is created, its key is set to the object's master
key.
 As long as the capability's key matches the object's key, then the
capabilities remain valid.
 The object master key can be changed with the set-key command,
thereby invalidating all current capabilities.
 More flexibility can be added to this scheme by implementing
a list of keys for each object, possibly in a global table.

Capability-Based Systems ( Optional )

An Example: Hydra

 Hydra is a capability-based system that includes both system-


defined rights and user-defined rights. The interpretation of user-defined
rights is up to the specific user programs, but the OS provides support
for protecting access to those rights, whatever they may be
 Operations on objects are defined procedurally, and those procedures are
themselves protected objects, accessed indirectly through capabilities.
 The names of user-defined procedures must be identified to the
protection system if it is to deal with user-defined rights.
 When an object is created, the names of operations defined on that object
become auxiliary rights, described in a capability for an instanceof the
type. For a process to act on an object, the capabilities it holds for that
object must contain the name of the operation being invoked. This
allows access to be controlled on an instance-by-instance and process-
by-process basis.
 Hydra also allows rights amplification, in which a process is deemed to
be trustworthy, and thereby allowed to act on any object corresponding
to its parameters.
 Programmers can make direct use of the Hydra protection system, using
suitable libraries which are documented in appropriate reference
manuals.
An Example: Cambridge CAP System

 The CAP system has two kinds of capabilities:


o Data capability, used to provide read, write, and execute access to
objects. These capabilities are interpreted by microcode in the
CAP machine.
o Software capability, is protected but not interpreted by the CAP
microcode.
 Software capabilities are interpreted by protected
( privileged ) procedures, possibly written by application
programmers.
 When a process executes a protected procedure, it
temporarily gains the ability to read or write the contents of
a software capability.
 This leaves the interpretation of the software capabilities up
to the individual subsystems, and limits the potential
damage that could be caused by a faulty privileged
procedure.
 Note, however, that protected procedures only get access to
software capabilities for the subsystem of which they are a
part. Checks are made when passing software capabilities
to protected procedures that they are of the correct type.
 Unfortunately the CAP system does not provide libraries,
making it harder for an individual programmer to use than
the Hydra system.

Language-Based Protection ( Optional )

o As systems have developed, protection systems have become more


powerful, and also more specific and specialized.
o To refine protection even further requires putting protection capabilities
into the hands of individual programmers, so that protection policies can
be implemented on the application level, i.e. to protect resources in ways
that are known to the specific applications but not to the more general
operating system.

Compiler-Based Enforcement

 In a compiler-based approach to protection enforcement, programmers


directly specify the protection needed for different resources at the time
the resources are declared.
 This approach has several advantages:
1. Protection needs are simply declared, as opposed to a complex
series of procedure calls.
2. Protection requirements can be stated independently of the
support provided by a particular OS.
3. The means of enforcement need not be provided directly by the
developer.
4. Declarative notation is natural, because access privileges are
closely related to the concept of data types.
 Regardless of the means of implementation, compiler-based protection
relies upon the underlying protection mechanisms provided by the
underlying OS, such as the Cambridge CAP or Hydra systems.
 Even if the underlying OS does not provide advanced protection
mechanisms, the compiler can still offer some protection, such as
treating memory accesses differently in code versus data segments. ( E.g.
code segments cant be modified, data segments can't be executed. )
 There are several areas in which compiler-based protection can be
compared to kernel-enforced protection:
o Security. Security provided by the kernel offers better protection
than that provided by a compiler. The security of the compiler-
based enforcement is dependent upon the integrity of the compiler
itself, as well as requiring that files not be modified after they are
compiled. The kernel is in a better position to protect itself from
modification, as well as protecting access to specific files. Where
hardware support of individual memory accesses is available, the
protection is stronger still.
o Flexibility. A kernel-based protection system is not as flexible to
provide the specific protection needed by an individual
programmer, though it may provide support which the
programmer may make use of. Compilers are more easily changed
and updated when necessary to change the protection services
offered or their implementation.
o Efficiency. The most efficient protection mechanism is one
supported by hardware and microcode. Insofar as software based
protection is concerned, compiler-based systems have the
advantage that many checks can be made off-line, at compile
time, rather that during execution.
 The concept of incorporating protection mechanisms into programming
languages is in its infancy, and still remains to be fully developed.
However the general goal is to provide mechanisms for three functions:
0. Distributing capabilities safely and efficiently among customer
processes. In particular a user process should only be able to
access resources for which it was issued capabilities.
1. Specifying the type of operations a process may execute on a
resource, such as reading or writing.
2. Specifying the order in which operations are performed on the
resource, such as opening before reading.

Protection in Java

 Java was designed from the very beginning to operate in a distributed


environment, where code would be executed from a variety of trusted
and untrusted sources. As a result the Java Virtual Machine, JVM
incorporates many protection mechanisms
 When a Java program runs, it load up classes dynamically, in response to
requests to instantiates objects of particular types. These classes may
come from a variety of different sources, some trusted and some not,
which requires that the protection mechanism be implemented at the
resolution of individual classes, something not supported by the basic
operating system.
 As each class is loaded, it is placed into a separate protection domain.
The capabilities of each domain depend upon whether the source URL is
trusted or not, the presence or absence of any digital signatures on the
class ( Chapter 15 ), and a configurable policy file indicating which
servers a particular user trusts, etc.
 When a request is made to access a restricted resource in Java, ( e.g.
open a local file ), some process on the current call stack must
specifically assert a privilege to perform the operation. In essence this
method assumes responsibility for the restricted access. Naturally the
method must be part of a class which resides in a protection domain that
includes the capability for the requested operation. This approach is
termed stack inspection, and works like this:
o When a caller may not be trusted, a method executes an access
request within a doPrivileged( ) block, which is noted on
the calling stack.
o When access to a protected resource is
requested, checkPermissions( ) inspects the call stack to
see if a method has asserted the privilege to access the protected
resource.
 If a suitable doPriveleged block is encountered on the stack
before a domain in which the privilege is disallowed, then
the request is granted.
 If a domain in which the request is disallowed is
encountered first, then the access is denied and a
AccessControlException is thrown.
 If neither is encountered, then the response is
implementation dependent.
 In the example below the untrusted applet's call to get( ) succeeds,
because the trusted URL loader asserts the privilege of opening the
specific URL lucent.com. However when the applet tries to make a
direct call to open( ) it fails, because it does not have privilege to
access any sockets.

Stack inspection.

You might also like