Os Notes
Os Notes
1
USER VIEW
SYSTEM VIEW
OS is a resource allocator
o Manages all resources
o Decides between conflicting requests for efficient and fair resource use
OS is a control program
o Controls execution of programs to prevent errors and improper use of the
computer
OPERATING-SYSTEM OBJECTIVES
• OS is a layer of software whose job is to manage all devices and provide user
programs with a simpler interface to the hardware
• Objectives of OS
– Convenience
– Efficiency
– Ability to evolve
Layered Approach
2
• End user views a computer system in terms of a set of applications
• Applications are developed in a programming language and is developed by
application programmer
• To make the applications reachable to computer hardware system programs or
utilities are provided
• OS comprises of collection of system programs
• OS masks the details of the hardware from the programmer and provides the
programmer with a convenient interface for using the system.
OS as User/Computer Interface
• OS provides services in the following areas:
– Program Development- Editors/Debuggers assist programmer in creating
programs. These are provided as application program development tool.
– Program execution – load program-run program-execute program
– Access to I/O devices - OS provides a uniform interface for I/O devices
which are hidden from end users
– Controlled Access to files - OS needs to understand I/O, structure of file and
also provide protection to users in multiuser environment
– System Access: Provide access to system as whole and to specific system
resources. Resolve conflicts for resource contention.
– Error detection – OS needs to be constantly aware of possible errors
• May occur in the CPU and memory hardware, in I/O devices, in user
program
• For each type of error, OS should take the appropriate action to ensure
correct and consistent computing
• Debugging facilities can greatly enhance the user’s and programmer’s
abilities to efficiently use the system
– Accounting - To keep track of which users use how much and what kinds of
computer resources
3
OS as resource manager
• OS controls the basic functions of the computer like movement, storage, processing of
data
• But the control mechanism is unusual
– OS functions as a ordinary computer software
– OS relinquishes controls or regains control depending on the processor.
– This control mechanism is not external but something internal
Ease of Evaluation of An OS
4
• Hardware upgrades plus new types of hardware
• New services
• Fixes: Faults are fixed and fixes are made. Fix introduces new faults
The need to change OS regularly places certain requirements on its design.
System should be modular in construction, with clearly defined interfaces between
modules and should be well documented
A common function of controlling and allocation resources are then brought together
into one software is called Operating System.
No universally accepted definition
“Everything a vendor ships when you order an operating system” is a good
approximation
o But varies wildly
“The one program running at all times on the computer” is the kernel.
Everything else is either
o a system program (ships with the operating system) , or
o an application program.
o One or more CPUs, device controllers connect through common bus providing
access to shared memory
o Concurrent execution of CPUs and devices competing for memory cycles
5
Computer Startup
Computer-system operation
Interrupt transfers control to the interrupt service routine generally, through the
interrupt vector, which contains the addresses of all the service routines
Interrupt architecture must save the address of the interrupted instruction
A trap or exception is a software-generated interrupt caused either by an error or a
user request
An operating system is interrupt driven
Interrupt Handling
The operating system preserves the state of the CPU by storing registers and the
program counter
Determines which type of interrupt has occurred:
o polling
o vectored interrupt system
Separate segments of code determine what action should be taken for each type of
interrupt
Interrupt Timeline
6
Storage Structure
Main memory – only large storage media that the CPU can access directly
o Random access
o Typically volatile
Secondary storage – extension of main memory that provides large nonvolatile
storage capacity
Hard disks – rigid metal or glass platters covered with magnetic recording material
o Disk surface is logically divided into tracks, which are subdivided into
sectors
o The disk controller determines the logical interaction between the device and
the computer
Solid-state disks – faster than hard disks, nonvolatile
o Various technologies
o Becoming more popular
Storage systems organized in hierarchy
o Speed
o Cost
o Volatility
Caching – copying information into faster storage system; main memory can be
viewed as a cache for secondary storage
Device Driver for each device controller to manage I/O
o Provides uniform interface between controller and kernel
Storage-device hierarchy
7
I/O Structure
After I/O starts, control returns to user program only upon I/O completion
o Wait instruction idles the CPU until the next interrupt
o Wait loop (contention for memory access)
o At most one I/O request is outstanding at a time, no simultaneous I/O
processing
After I/O starts, control returns to user program without waiting for I/O completion
o System call – request to the OS to allow user to wait for I/O completion
o Device-status table contains entry for each I/O device indicating its type,
address, and state
o OS indexes into I/O device table to determine device status and to modify
table entry to include interrupt
Typical operation involves I/O requests, direct memory access ( DMA ), and
interrupt handling.
Used for high-speed I/O devices able to transmit information at close to
memory speeds
Device controller transfers blocks of data from buffer storage directly to main
memory without CPU intervention
Only one interrupt is generated per block, rather than the one interrupt per byte
8
COMPUTER-SYSTEM ARCHITECTURE - Different Operating Systems for
Different Kinds of Computer Environments
Single-Processor Systems
One main CPU which manages the computer and runs user apps.
Other specialized processors ( disk controllers, GPUs, etc. ) do not run user
apps.
multiprocessor systems (also known as parallel systems or tightly coupled systems) are
growing in importance. Such systems have two or more processors in close communication,
sharing the computer bus and sometimes the clock, memory, and peripheral devices.
Two types:
9
Multi-chip and multicore
Systems containing all chips
Chassis containing multiple separate systems
Clustered Systems
10
General structure of a clustered system
OPERATING-SYSTEM STRUCTURE
11
OPERATING-SYSTEM OPERATIONS
Interrupt-driven nature of modern OSes requires that erroneous processes not be able to
disturb anything else.
Dual-Mode Operation
Timer
PROCESS MANAGEMENT
A process is a program in execution. It is a unit of work within the system. Program is
a passive entity, process is an active entity.
Process needs resources to accomplish its task
o CPU, memory, I/O, files
o Initialization data
Process termination requires reclaim of any reusable resources
12
Single-threaded process has one program counter specifying location of next
instruction to execute
o Process executes instructions sequentially, one at a time, until completion
Multi-threaded process has one program counter per thread
Typically system has many processes, some user, some operating system running
concurrently on one or more CPUs
o Concurrency by multiplexing the CPUs among the processes / threads
The operating system is responsible for the following activities in connection with process
management:
Creating and deleting both user and system processes
Suspending and resuming processes
Providing mechanisms for process synchronization
Providing mechanisms for process communication
Providing mechanisms for deadlock handling
MEMORY MANAGEMENT
STORAGE MANAGEMENT
OS provides uniform, logical view of information storage
o Abstracts physical properties to logical storage unit - file
o Each medium is controlled by device (i.e., disk drive, tape drive)
Varying properties include access speed, capacity, data-transfer rate,
access method (sequential or random)
File-System management
o Files usually organized into directories
o Access control on most systems to determine who can access what
OS activities include
Creating and deleting files and directories
Primitives to manipulate files and directories
Mapping files onto secondary storage
Backup files onto stable (non-volatile) storage media
Mass-Storage Management
Usually disks used to store data that does not fit in main memory or data that must be
kept for a “long” period of time
Proper management is of central importance
Entire speed of computer operation hinges on disk subsystem and its algorithms
OS activities
13
o Free-space management
o Storage allocation
o Disk scheduling
Some storage need not be fast
o Tertiary storage includes optical storage, magnetic tape
o Still must be managed – by OS or applications
o Varies between WORM (write-once, read-many-times) and RW (read-write)
Multitasking environments must be careful to use most recent value, no matter where
it is stored in the storage hierarchy
Multiprocessor environment must provide cache coherency in hardware such that all
CPUs have the most recent value in their cache
Distributed environment situation even more complex
o Several copies of a datum can exist
COMPUTING ENVIRONMENTS
Traditional Computing
Stand-alone general purpose machines
But blurred as most systems interconnect with others (i.e., the Internet)
Portals provide web access to internal systems
Network computers (thin clients) are like Web terminals
Mobile computers interconnect via wireless networks
Networking becoming ubiquitous – even home systems use firewalls to protect
home computers from Internet attacks
Mobile Computing
• Handheld smartphones, tablets, etc
• What is the functional difference between them and a “traditional” laptop?
• Extra feature – more OS features (GPS, gyroscope)
• Allows new types of apps like augmented reality
14
• Use IEEE 802.11 wireless, or cellular data networks for connectivity
• Leaders are Apple iOS and Google Android
Distributed computing
Distributed computiing
o Collection of separate, possibly heterogeneous, systems networked
together
Network is a communications path, TCP/IP most common
Local Area Network (LAN)
Wide Area Network (WAN)
Metropolitan Area Network (MAN)
Personal Area Network (PAN)
o Network Operating System provides features between systems across
network
Communication scheme allows systems to exchange messages
Illusion of a single system
Client-Server Computing
Client-Server Computing
Dumb terminals supplanted by smart PCs
Many systems now servers, responding to requests generated by clients
Compute-server system provides an interface to client to request services (i.e.,
database)
File-server system provides interface for clients to store and retrieve files
Peer-to-Peer Computing
Another model of distributed system
P2P does not distinguish clients and servers
o Instead all nodes are considered peers
o May each act as client, server or both
o Node must join P2P network
Registers its service with central lookup service on network, or
Broadcast request for service and respond to requests for service via
discovery protocol
o Examples include Napster and Gnutella, Voice over IP (VoIP) such as Skype
15
Cloud Computing
Delivers computing, storage, even apps as a service across a network
Logical extension of virtualization because it uses virtualization as the base for it
functionality.
o Amazon EC2 has thousands of servers, millions of virtual machines,
petabytes of storage available across the Internet, pay based on usage
Many types
o Public cloud – available via Internet to anyone willing to pay
o Private cloud – run by a company for the company’s own use
o Hybrid cloud – includes both public and private cloud components
o Software as a Service (SaaS) – one or more applications available via the
Internet (i.e., word processor)
o Platform as a Service (PaaS) – software stack ready for application use via the
Internet (i.e., a database server)
o Infrastructure as a Service (IaaS) – servers or storage available over Internet
(i.e., storage available for backup use)
Cloud computing environments composed of traditional OSes, plus VMMs, plus
cloud management tools
o Internet connectivity requires security like firewalls
o Load balancers spread traffic across multiple applications
16
OPERATING-SYSTEM SERVICES
OSes provide environments in which programs run, and services for the users of the system,
including:
User Interfaces - Means by which users can issue commands to the system.
Depending on the system these may be a command-line interface ( e.g. sh, csh, ksh,
tcsh, etc. ), a GUI interface ( e.g. Windows, X-Windows, KDE, Gnome, etc. ), or a
17
batch command systems. The latter are generally older systems using punch cards of
job-control language, JCL, but may still be used today for specialty systems designed
for a single purpose.
Program Execution - The OS must be able to load a program into RAM, run the
program, and terminate the program, either normally or abnormally.
I/O Operations - The OS is responsible for transferring data to and from I/O devices,
including keyboards, terminals, printers, and storage devices.
File-System Manipulation - In addition to raw data storage, the OS is also
responsible for maintaining directory and subdirectory structures, mapping file names
to specific blocks of data storage, and providing tools for navigating and utilizing the
file system.
Communications - Inter-process communications, IPC, either between processes
running on the same processor, or between processes running on separate processors
or separate machines. May be implemented as either shared memory or message
passing, ( or some systems may offer both. )
Error Detection - Both hardware and software errors must be detected and handled
appropriately, with a minimum of harmful repercussions. Some systems may include
complex error avoidance or recovery systems, including backups, RAID drives, and
other redundant systems. Debugging and diagnostic tools aid users and administrators
in tracing down the cause of problems.
Resource Allocation - E.g. CPU cycles, main memory, storage space, and peripheral
devices. Some resources are managed with generic systems and others with very
carefully designed and specially tuned systems, customized for a particular resource
and operating environment.
Accounting - Keeping track of system activity and resource usage, either for billing
purposes or for statistical record keeping that can be used to optimize future
performance.
Protection and Security - Preventing harm to the system and to resources, either
through wayward internal processes or malicious outsiders. Authentication,
ownership, and restricted access are obvious parts of this system. Highly secure
systems may log all process activity down to excruciating detail, and security
regulation dictate the storage of those records on permanent non-erasable medium for
extended times in secure ( off-site ) facilities.
Command Interpreter
Gets and processes the next user request, and launches the requested programs.
In some systems the CI may be incorporated directly into the kernel.
More commonly the CI is a separate program that launches once the user logs in or
otherwise accesses the system.
UNIX, for example, provides the user with a choice of different shells, which may
either be configured to launch automatically at login, or which may be changed on the
fly. ( Each of these shells uses a different configuration file of initial settings and
commands that are executed upon startup. )
18
Different shells provide different functionality, in terms of certain commands that are
implemented directly by the shell without launching any external programs. Most
provide at least a rudimentary command interpretation structure for use in shell script
programming ( loops, decision constructs, variables, etc. )
An interesting distinction is the processing of wild card file naming and I/O re-
direction. On UNIX systems those details are handled by the shell, and the program
which is launched sees only a list of filenames generated by the shell from the wild
cards. On a DOS system, the wild cards are passed along to the programs, which can
interpret the wild cards as the program sees fit.
Generally implemented as a desktop metaphor, with file folders, trash cans, and
resource icons.
Icons represent some item on the system, and respond accordingly when the icon is
activated.
First developed in the early 1970's at Xerox PARC research facility.
In some systems the GUI is just a front end for activating a traditional command line
interpreter running in the background. In others the GUI is a true graphical shell in its
own right.
Mac has traditionally provided ONLY the GUI interface. With the advent of OSX (
based partially on UNIX ), a command line interface has also become available.
Because mice and keyboards are impractical for small mobile devices, these normally
use a touch-screen interface today, that responds to various patterns of swipes or
"gestures". When these first came out they often had a physical keyboard and/or a
trackball of some kind built in, but today a virtual keyboard is more commonly
implemented on the touch screen.
Choice of interface
Most modern systems allow individual users to select their desired interface, and to
customize its operation, as well as the ability to switch between different interfaces as
needed. System administrators generally determine which interface a user starts with
when they first log in.
GUI interfaces usually provide an option for a terminal emulator window for entering
command-line commands.
Command-line commands can also be entered into shell scripts, which can then be
run like any other programs.
SYSTEM CALLS
System calls provide a means for user or application programs to call upon the
services of the operating system.
Generally written in C or C++, although some are written in assembly for optimal
performance.
Figure 2.4 illustrates the sequence of system calls required to copy a file:
19
Example of how system calls are used.
You can use "strace" to see more examples of the large number of system calls
invoked by a single simple command. Read the man page for strace, and try some
simple examples. ( strace mkdir temp, strace cd temp, strace date > t.t, strace cp t.t t.2,
etc. )
Most programmers do not use the low-level system calls directly, but instead use an
"Application Programming Interface", API. The following sidebar shows the read( )
call available in the API on UNIX based systems::
The use of APIs instead of direct system calls provides for greater program portability
between different systems. The API then makes the appropriate system calls through the
system call interface, using a table lookup to access specific numbered system calls, as
shown in Figure 2.6:
20
Figure 2.6 - The handling of a user application invoking the open( ) system call
Parameters are generally passed to system calls via registers, or less commonly, by
values pushed onto the stack. Large blocks of data are generally accessed indirectly,
through a memory address passed in a register or on the stack, as shown in Figure 2.7:
Six major categories, as outlined in Figure and the following six subsections:
21
22
Standard library calls may also generate system calls, as shown here:
23
Process Control
Process control system calls include end, abort, load, execute, create process,
terminate process, get/set process attributes, wait for time or event, signal event, and
allocate and free memory.
Processes must be created, launched, monitored, paused, resumed,and eventually
stopped.
When one process pauses or stops, then another must be launched or resumed
When processes stop abnormally it may be necessary to provide core dumps and/or
other diagnostic or recovery tools.
24
Compare DOS ( a single-tasking system ) with UNIX ( a multi-tasking system ).
o When a process is launched in DOS, the command interpreter first unloads as
much of itself as it can to free up memory, then loads the process and transfers
control to it. The interpreter does not resume until the process has completed,
as shown in Figure 2.9:
25
FreeBSD running multiple programs
File Management
File management system calls include create file, delete file, open, close, read, write,
reposition, get file attributes, and set file attributes.
These operations may also be supported for directories as well as ordinary files.
Device Management
Device management system calls include request device, release device, read, write,
reposition, get/set device attributes, and logically attach or detach devices.
Devices may be physical ( e.g. disk drives ), or virtual / abstract ( e.g. files, partitions,
and RAM disks ).
Some systems represent devices as special files in the file system, so that accessing
the "file" calls upon the appropriate device drivers in the OS. See for example the /dev
directory on any UNIX system.
Information Maintenance
Information maintenance system calls include calls to get/set the time, date, system
data, and process, file, or device attributes.
Systems may also provide the ability to dump memory at any time, single step
programs pausing execution after each instruction, and tracing the operation of
programs, all of which can help to debug programs.
Communication
26
The message passing model must support calls to:
o Identify a remote process and/or host with which to communicate.
o Establish a connection between the two processes.
o Open and close the connection as needed.
o Transmit messages along the connection.
o Wait for incoming messages, in either a blocking or non-blocking state.
o Delete the connection when no longer needed.
The shared memory model must support calls to:
o Create and access memory that is shared amongst processes ( and threads. )
o Provide locking mechanisms restricting simultaneous access.
o Free up shared memory and/or dynamically allocate it as needed.
Message passing is simpler and easier, ( particularly for inter-computer
communications ), and is generally appropriate for small amounts of data.
Shared memory is faster, and is generally the better approach where large amounts of
data are to be shared, ( particularly when most processes are reading the data rather
than writing it, or at least when only one or a small number of processes need to
change any given data item. )
Protection
Protection provides mechanisms for controlling which users / processes have access
to which system resources.
System calls allow the access mechanisms to be adjusted as needed, and for non-
priveleged users to be granted elevated access permissions under carefully controlled
temporary circumstances.
Once only of concern on multi-user systems, protection is now important on all
systems, in the age of ubiquitous network connectivity.
SYSTEM PROGRAMS
27
o Communications - Programs for providing connectivity between processes
and users, including mail, web browsers, remote logins, file transfers, and
remote command execution.
o Background services - System daemons are commonly started when the
system is booted, and run for as long as the system is running, handling
necessary services. Examples include network daemons, print servers, process
schedulers, and system error monitoring services.
Most operating systems today also come complete with a set of application
programs to provide additional services, such as copying files or checking the time
and date.
Most users' views of the system is determined by their command interpreter and the
application programs. Most never make system calls, even through the API, ( with the
exception of simple ( file ) I/O in user-written programs. )
Design Goals
Requirements define properties which the finished system must have, and are a
necessary first step in designing any large complex system.
o User requirements are features that users care about and understand, and are
written in commonly understood vernacular. They generally do not include
any implementation details, and are written similar to the product description
one might find on a sales brochure or the outside of a shrink-wrapped box.
o System requirements are written for the developers, and include more details
about implementation specifics, performance requirements, compatibility
constraints, standards compliance, etc. These requirements serve as a
"contract" between the customer and the developers, ( and between developers
and subcontractors ), and can get quite detailed.
Requirements for operating systems can vary greatly depending on the planned scope
and usage of the system. ( Single user / multi-user, specialized system / general
purpose, high/low security, performance needs, operating environment, etc. )
Implementation
Traditionally OSes were written in assembly language. This provided direct control
over hardware-related issues, but inextricably tied a particular OS to a particular HW
platform.
Recent advances in compiler efficiencies mean that most modern OSes are written in
C, or more recently, C++. Critical sections of code are still written in assembly
28
language, ( or written in C, compiled to assembly, and then fine-tuned and optimized
by hand from there. )
Operating systems may be developed using emulators of the target hardware,
particularly if the real hardware is unavailable ( e.g. not built yet ), or not a suitable
platform for development, ( e.g. smart phones, game consoles, or other similar
devices. )
OPERATING-SYSTEM STRUCTURE
Simple Structure
When DOS was originally written its developers had no idea how big and important it would
eventually become. It was written by a few programmers in a relatively short amount of time,
without the benefit of modern software engineering techniques, and then gradually grew over
time to exceed its original expectations. It does not break the system into subsystems, and has
no distinction between user and kernel modes, allowing all programs direct access to the
underlying hardware. ( Note that user versus kernel mode was not supported by the 8088 chip
set anyway, so that really wasn't an option back then. )
29
The original UNIX OS used a simple layered approach, but almost all the OS was in one big
layer, not really breaking the OS down into layered subsystems:
Layered Approach
Another approach is to break the OS into a number of smaller layers, each of which
rests on the layer below it, and relies solely on the services provided by the next lower
layer.
This approach allows each layer to be developed and debugged independently, with
the assumption that all lower layers have already been debugged and are trusted to
deliver proper services.
The problem is deciding what order in which to place the layers, as no layer can call
upon the services of any higher layer, and so many chicken-and-egg situations may
arise.
Layered approaches can also be less efficient, as a request for service from a higher
layer has to filter through all lower layers before it reaches the HW, possibly with
significant processing at each step.
30
A layered operating system
Microkernels
The basic idea behind micro kernels is to remove all non-essential services from the
kernel, and implement them as system applications instead, thereby making the kernel
as small and efficient as possible.
Most microkernels provide basic process and memory management, and message
passing between other services, and not much more.
Security and protection can be enhanced, as most services are performed in user
mode, not kernel mode.
System expansion can also be easier, because it only involves adding more system
applications, not rebuilding a new kernel.
Mach was the first and most widely known microkernel, and now forms a major
component of Mac OSX.
Windows NT was originally microkernel, but suffered from performance problems
relative to Windows 95. NT 4.0 improved performance by moving more services into
the kernel, and now XP is back to being more monolithic.
Another microkernel example is QNX, a real-time OS for embedded systems.
31
Architecture of a typical microkernel
Modules
Hybrid Systems
32
Most OSes today do not strictly adhere to one architecture, but are hybrids of several.
Mac OS X
The Max OSX architecture relies on the Mach microkernel for basic system
management services, and the BSD kernel for additional services. Application
services and dynamically loadable modules ( kernel extensions ) provide the rest of
the OS functionality:
iOS
The iOS operating system was developed by Apple for iPhones and iPads. It runs
with less memory and computing power needs than Max OS X, and supports
touchscreen interface and graphics for small screens:
Android
33
The Android OS was developed for Android smartphones and tablets by the Open
Handset Alliance, primarily Google.
Android is an open-source OS, as opposed to iOS, which has lead to its popularity.
Android includes versions of Linux and a Java virtual machine both optimized for
small platforms.
Android apps are developed using a special Java-for-Android development
environment.
34
Unit-2
Process Concept
The Process
Process State
For each process there is a Process Control Block, PCB, which stores the following ( types of )
process-specific information, as illustrated in Figure ( Specific details may vary from system to
system. )
Process Scheduling
The two main objectives of the process scheduling system are to keep the CPU busy at all
times and to deliver "acceptable" response times for all programs, particularly for
interactive ones.
The process scheduler must meet these objectives by implementing suitable policies for
swapping processes in and out of the CPU.
( Note that these objectives can be conflicting. In particular, every time the system steps
in to swap processes it takes up time on the CPU to do so, which is thereby "lost" from
doing any useful productive work. )
Scheduling Queues
Schedulers
Context Switch
Operations on Processes
Process Creation
Processes may create other processes through appropriate system calls, such
as fork or spawn. The process which does the creating is termed the parent of the other
process, which is termed its child.
Each process is given an integer identifier, termed its process identifier, or PID. The
parent PID ( PPID ) is also stored for each process.
On typical UNIX systems the process scheduler is termed sched, and is given PID 0. The
first thing it does at system startup time is to launch init, which gives that process PID 1.
Init then launches all system daemons and user logins, and becomes the ultimate parent
of all other processes. Figure shows a typical process tree for a Linux system, and other
systems will have similar though not identical trees:
Depending on system implementation, a child process may receive some amount of
shared resources with its parent. Child processes may or may not be limited to a subset of
the resources originally allocated to the parent, preventing runaway children from
consuming all of a certain system resource.
There are two options for the parent process after creating the child:
1. Wait for the child process to terminate before proceeding. The parent makes a
wait( ) system call, for either a specific child or for any child, which causes the
parent process to block until the wait( ) returns. UNIX shells normally wait for
their children to complete before issuing a new prompt.
2. Run concurrently with the child, continuing to process without waiting. This is the
operation seen when a UNIX shell runs a process as a background task. It is also
possible for the parent to run for a while, and then wait for the child later, which
might occur in a sort of a parallel processing operation. ( E.g. the parent may fork
off a number of children without waiting for any of them, then do a little work of
its own, and then wait for the children. )
Two possibilities for the address space of the child relative to the parent:
1. The child may be an exact duplicate of the parent, sharing the same program and
data segments in memory. Each will have their own PCB, including program
counter, registers, and PID. This is the behavior of the fork system call in UNIX.
2. The child process may have a new program loaded into its address space, with all
new code and data segments. This is the behavior of the spawn system calls in
Windows. UNIX systems implement this as a second step, using the exec system
call.
Figures 3.10 below shows the fork and exec process on a UNIX system. Note that
the fork system call returns the PID of the processes child to each process - It returns a
zero to the child process and a non-zero child PID to the parent, so the return value
indicates which process is which. Process IDs can be looked up any time for the current
process or its direct parent using the getpid( ) and getppid( ) system calls respectively.
Processes may request their own termination by making the exit( ) system call, typically
returning an int. This int is passed along to the parent if it is doing a wait( ), and is
typically zero on successful completion and some non-zero code in the event of
problems.
o child code:
int exitCode;
exit( exitCode ); // return exitCode; has the same effect when executed from main( )
o parent code:
pid_t pid;
int status
pid = wait( &status ); // pid indicates which child exited. exitCode in low-order bits
of status
// macros can test the high-order bits of status for why it stopped
Processes may also be terminated by the system for a variety of reasons, including:
o The inability of the system to deliver necessary system resources.
o In response to a KILL command, or other un handled process interrupt.
o A parent may kill its children if the task assigned to them is no longer needed.
o If the parent exits, the system may or may not allow the child to continue without
a parent. ( On UNIX systems, orphaned processes are generally inherited by init,
which then proceeds to kill them. The UNIX nohup command allows a child to
continue executing after its parent has exited. )
When a process terminates, all of its system resources are freed up, open files flushed and
closed, etc. The process termination status and execution times are returned to the parent
if the parent is waiting for the child to terminate, or eventually returned to init if the
process becomes an orphan. ( Processes which are trying to terminate but which cannot
because their parent is not waiting for them are termed zombies. These are eventually
inherited by init as orphans and killed off. Note that modern UNIX shells do not produce
as many orphans and zombies as older systems used to. )
Interprocess Communication
Independent Processes operating concurrently on a systems are those that can neither
affect other processes or be affected by other processes.
Cooperating Processes are those that can affect or be affected by other processes. There
are several reasons why cooperating processes are allowed:
o Information Sharing - There may be several processes which need access to the
same file for example. ( e.g. pipelines. )
o Computation speedup - Often a solution to a problem can be solved faster if the
problem can be broken down into sub-tasks to be solved simultaneously (
particularly when multiple processors are involved. )
o Modularity - The most efficient architecture may be to break a system down into
cooperating modules. ( E.g. databases with a client-server architecture. )
o Convenience - Even a single user may be multi-tasking, such as editing,
compiling, printing, and running the same code in different windows.
Shared Memory is faster once it is set up, because no system calls are required and access
occurs at normal memory speeds. However it is more complicated to set up, and doesn't
work as well across multiple computers. Shared memory is generally preferable when
large amounts of information must be shared quickly on the same computer.
Message Passing requires system calls for every message transfer, and is therefore
slower, but it is simpler to set up and works well across multiple computers. Message
passing is generally preferable when the amount and/or frequency of data transfers is
small, or when multiple computers are involved.
Shared-Memory Systems
In general the memory to be shared in a shared-memory system is initially within
the address space of a particular process, which needs to make system calls in
order to make that memory publicly available to one or more other processes.
Other processes which wish to use the shared memory must then make their own
system calls to attach the shared memory area onto their address space.
Generally a few messages must be passed back and forth between the cooperating
processes first in order to set up and coordinate the shared memory access.
This is a classic example, in which one process is producing data and another
process is consuming the data. ( In this example in the order in which it is
produced, although that could vary. )
The data is passed via an intermediary buffer, which may be either unbounded or
bounded. With a bounded buffer the producer may have to wait until there is
space available in the buffer, but with an unbounded buffer the producer will
never need to wait. The consumer may need to wait in either case until there is
data available.
This example uses shared memory and a circular queue. Note in the code below
that only the producer changes "in", and only the consumer changes "out", and
that they can never be accessing the same array location at the same time.
First the following data is set up in the shared memory area:
#define BUFFER_SIZE 10
typedef struct {
...
} item;
Then the producer process. Note that the buffer is full when "in" is one less than
"out" in a circular sense:
item nextProduced;
while( true ) {
Then the consumer process. Note that the buffer is empty when "in" is equal to
"out":
item nextConsumed;
while( true ) {
Message-Passing Systems
Message passing systems must support at a minimum system calls for "send
message" and "receive message".
A communication link must be established between the cooperating processes
before messages can be sent.
There are three key issues to be resolved in message passing systems as further
explored in the next three subsections:
o Direct or indirect communication ( naming )
o Synchronous or asynchronous communication
o Automatic or explicit buffering.
Naming
With direct communication the sender must know the name of the
receiver to which it wishes to send a message.
o There is a one-to-one link between every sender-receiver pair.
o For symmetric communication, the receiver must also know the
specific name of the sender from which it wishes to receive
messages.
For asymmetric communications, this is not necessary.
Indirect communication uses shared mailboxes, or ports.
o Multiple processes can share the same mailbox or boxes.
o Only one process can read any given message in a mailbox.
Initially the process that creates the mailbox is the owner, and is
the only one allowed to read mail in the mailbox, although this
privilege may be transferred.
( Of course the process that reads the message can
immediately turn around and place an identical message
back in the box for someone else to read, but that may put it
at the back end of a queue of messages. )
o The OS must provide system calls to create and delete mailboxes,
and to send and receive messages to/from mailboxes.
Synchronization
Buffering
Messages are passed via queues, which may have one of three capacity
configurations:
1. Zero capacity - Messages cannot be stored in the queue, so
senders must block until receivers accept the messages.
2. Bounded capacity- There is a certain pre-determined finite
capacity in the queue. Senders must block if the queue is full, until
space becomes available in the queue, but may be either blocking
or non-blocking otherwise.
3. Unbounded capacity - The queue has a theoretical infinite
capacity, so senders are never forced to block.
Pipes
Pipes are one of the earliest and simplest channels of communications between ( UNIX )
processes.
There are four key considerations in implementing pipes:
1. Unidirectional or Bidirectional communication?
2. Is bidirectional communication half-duplex or full-duplex?
3. Must a relationship such as parent-child exist between the processes?
4. Can pipes communicate over a network, or only on the same machine?
The following sections examine these issues on UNIX and Windows
Ordinary Pipes
Ordinary pipes are uni-directional, with a reading end and a writing end. ( If bidirectional
communications are needed, then a second pipe is required. )
In UNIX ordinary pipes are created with the system call "int pipe( int fd [ ] )".
o The return value is 0 on success, -1 if an error occurs.
o The int array must be allocated before the call, and the values are filled in by the
pipe system call:
fd[ 0 ] is filled in with a file descriptor for the reading end of the pipe
fd[ 1 ] is filled in with a file descriptor for the writing end of the pipe
o UNIX pipes are accessible as files, using standard read( ) and write( ) system
calls.
o Ordinary pipes are only accessible within the process that created them.
Typically a parent creates the pipe before forking off a child.
When the child inherits open files from its parent, including the pipe
file(s), a channel of communication is established.
Each process ( parent and child ) should first close the ends of the pipe
that they are not using. For example, if the parent is writing to the pipe and
the child is reading, then the parent should close the reading end of its pipe
after the fork and the child should close the writing end.
Figure 3.22 shows an ordinary pipe in UNIX,
Figure 3.24
Threads
Overview
A thread is a basic unit of CPU utilization, consisting of a program counter, a stack, and
a set of registers, ( and a thread ID. )
Traditional ( heavyweight ) processes have a single thread of control - There is one
program counter, and one sequence of instructions that can be carried out at any given
time.
As shown in Figure, multi-threaded applications have multiple threads within a single
process, each having their own program counter, stack and set of registers, but sharing
common code, data, and certain structures such as open files.
Single-threaded and multithreaded processes
Motivation
Threads are very useful in modern programming whenever a process has multiple
tasks to perform independently of the others.
This is particularly true when one of the tasks may block, and it is desired to
allow the other tasks to proceed without blocking.
For example in a word processor, a background thread may check spelling and
grammar while a foreground thread processes user input ( keystrokes ), while yet
a third thread loads images from the hard drive, and a fourth does periodic
automatic backups of the file being edited.
Another example is a web server - Multiple threads allow for multiple requests to
be satisfied simultaneously, without having to service requests sequentially or to
fork off separate processes for every incoming request. ( The latter is how this
sort of thing was done before the concept of threads was developed. A daemon
would listen at a port, fork off a child for every incoming request to be processed,
and then go back to listening to the port. )
Multithreaded server architecture
Benefits
Multithreading Models
There are two types of threads to be managed in a modern system: User threads and
kernel threads.
User threads are supported above the kernel, without kernel support. These are the
threads that application programmers would put into their programs.
Kernel threads are supported within the kernel of the OS itself. All modern OSes support
kernel level threads, allowing the kernel to perform multiple simultaneous tasks and/or to
service multiple kernel system calls simultaneously.
In a specific implementation, the user threads must be mapped to kernel threads, using
one of the following strategies.
Many-To-One Model
In the many-to-one model, many user-level threads are all mapped onto a single
kernel thread.
Thread management is handled by the thread library in user space, which is very
efficient.
However, if a blocking system call is made, then the entire process blocks, even if
the other user threads would otherwise be able to continue.
Because a single kernel thread can operate only on a single CPU, the many-to-one
model does not allow individual processes to be split across multiple CPUs.
Green threads for Solaris and GNU Portable Threads implement the many-to-one
model in the past, but few systems continue to do so today.
Many-to-one model
One-To-One Model
The one-to-one model creates a separate kernel thread to handle each user thread.
One-to-one model overcomes the problems listed above involving blocking
system calls and the splitting of processes across multiple CPUs.
However the overhead of managing the one-to-one model is more significant,
involving more overhead and slowing down the system.
Most implementations of this model place a limit on how many threads can be
created.
Linux and Windows from 95 to XP implement the one-to-one model for threads.
One-to-one model
Many-To-Many Model
The many-to-many model multiplexes any number of user threads onto an equal
or smaller number of kernel threads, combining the best features of the one-to-one
and many-to-one models.
Users have no restrictions on the number of threads created.
Blocking kernel system calls do not block the entire process.
Processes can be split across multiple processors.
Individual processes may be allocated variable numbers of kernel threads,
depending on the number of CPUs present and other factors.
Many-to-many model
One popular variation of the many-to-many model is the two-tier model, which
allows either many-to-many or one-to-one operation.
IRIX, HP-UX, and Tru64 UNIX use the two-tier model, as did Solaris prior to
Solaris 9.
Two-level model
CPU Scheduling
Basic Concepts
Almost all programs have some alternating cycle of CPU number crunching and waiting
for I/O of some kind. ( Even a simple fetch from memory takes a long time relative to
CPU speeds. )
In a simple system running a single process, the time spent waiting for I/O is wasted, and
those CPU cycles are lost forever.
A scheduling system allows one process to use the CPU while another is waiting for I/O,
thereby making full use of otherwise lost CPU cycles.
The challenge is to make the overall system as "efficient" and "fair" as possible, subject
to varying and often dynamic conditions, and where "efficient" and "fair" are somewhat
subjective terms, often subject to shifting priority policies.
Almost all processes alternate between two states in a continuing cycle, as shown
in Figure 6.1 below :
o A CPU burst of performing calculations, and
o An I/O burst, waiting for data transfer in or out of the system.
CPU Scheduler
Whenever the CPU becomes idle, it is the job of the CPU Scheduler ( a.k.a. the
short-term scheduler ) to select another process from the ready queue to run next.
The storage structure for the ready queue and the algorithm used to select the next
process are not necessarily a FIFO queue. There are several alternatives to choose
from, as well as numerous adjustable parameters for each algorithm, which is the
basic subject of this entire chapter.
Preemptive Scheduling
Dispatcher
The dispatcher is the module that gives control of the CPU to the process
selected by the scheduler. This function involves:
o Switching context.
o Switching to user mode.
o Jumping to the proper location in the newly loaded program.
The dispatcher needs to be as fast as possible, as it is run on every context switch.
The time consumed by the dispatcher is known as dispatch latency.
Scheduling Criteria
There are several different criteria to consider when trying to select the "best" scheduling
algorithm for a particular situation and environment, including:
o CPU utilization - Ideally the CPU would be busy 100% of the time, so as to
waste 0 CPU cycles. On a real system CPU usage should range from 40% ( lightly
loaded ) to 90% ( heavily loaded. )
o Throughput - Number of processes completed per unit time. May range from 10
/ second to 1 / hour depending on the specific processes.
o Turnaround time - Time required for a particular process to complete, from
submission time to completion. ( Wall clock time. )
o Waiting time - How much time processes spend in the ready queue waiting their
turn to get on the CPU.
( Load average - The average number of processes sitting in the ready
queue waiting their turn to get into the CPU. Reported in 1-minute, 5-
minute, and 15-minute averages by "uptime" and "who". )
o Response time - The time taken in an interactive program from the issuance of a
command to the commence of a response to that command.
In general one wants to optimize the average value of a criteria ( Maximize CPU
utilization and throughput, and minimize all the others. ) However some times one wants
to do something different, such as to minimize the maximum response time.
Sometimes it is most desirable to minimize the variance of a criteria than the actual
value. I.e. users are more accepting of a consistent predictable system than an
inconsistent one, even if it is a little bit slower.
Scheduling Algorithms
The following subsections will explain several common scheduling strategies, looking at only a
single CPU burst each for a small number of processes. Obviously real systems have to deal with
a lot more simultaneous processes executing their CPU-I/O burst cycles.
First-Come First-Serve Scheduling, FCFS
FCFS is very simple - Just a FIFO queue, like customers waiting in line at the
bank or the post office or at a copying machine.
Unfortunately, however, FCFS can yield some very long average wait times,
particularly if the first process to get there takes a long time. For example,
consider the following three processes:
In the first Gantt chart below, process P1 arrives first. The average waiting time
for the three processes is ( 0 + 24 + 27 ) / 3 = 17.0 ms.
In the second Gantt chart below, the same three processes have an average wait
time of ( 0 + 3 + 6 ) / 3 = 3.0 ms. The total run time for the three bursts is the
same, but in the second case two of the three finish much quicker, and the other
process is only delayed by a short amount.
FCFS can also block the system in a busy dynamic system in another way, known
as the convoy effect. When one CPU intensive process blocks the CPU, a number
of I/O intensive processes can get backed up behind it, leaving the I/O devices
idle. When the CPU hog finally relinquishes the CPU, then the I/O processes pass
through the CPU quickly, leaving the CPU idle while everyone queues up for I/O,
and then the cycle repeats itself when the CPU intensive process gets back to the
ready queue.
The idea behind the SJF algorithm is to pick the quickest fastest little job that
needs to be done, get it out of the way first, and then pick the next smallest fastest
job to do next.
( Technically this algorithm picks a process based on the next shortest CPU
burst, not the overall process time. )
For example, the Gantt chart below is based upon the following CPU burst times,
( and the assumption that all jobs arrive at the same time. )
o In this scheme the previous estimate contains the history of all previous
times, and alpha serves as a weighting factor for the relative importance of
recent data versus past history. If alpha is 1.0, then past history is ignored,
and we assume the next burst will be the same length as the last burst. If
alpha is 0.0, then all measured burst times are ignored, and we just assume
a constant burst time.
SJF can be either preemptive or non-preemptive. Preemption occurs when a new process arrives
in the ready queue that has a predicted burst time shorter than the time remaining in the process
whose burst is currently on the CPU. Preemptive SJF is sometimes referred to as shortest
remaining time first scheduling.
For example, the following Gantt chart is based upon the following data:
Priority Scheduling
Priority scheduling is a more general case of SJF, in which each job is assigned a
priority and the job with the highest priority gets scheduled first. ( SJF uses the
inverse of the next expected burst time as its priority - The smaller the expected
burst, the higher the priority. )
Note that in practice, priorities are implemented using integers within a fixed
range, but there is no agreed-upon convention as to whether "high" priorities use
large numbers or small numbers. This book uses low number for high priorities,
with 0 being the highest possible priority.
For example, the following Gantt chart is based upon these process burst times
and priorities, and yields an average waiting time of 8.2 ms:
Round robin scheduling is similar to FCFS scheduling, except that CPU bursts are
assigned with limits called time quantum.
When a process is given the CPU, a timer is set for whatever value has been set
for a time quantum.
o If the process finishes its burst before the time quantum timer expires, then
it is swapped out of the CPU just like the normal FCFS algorithm.
o If the timer goes off first, then the process is swapped out of the CPU and
moved to the back end of the ready queue.
The ready queue is maintained as a circular queue, so when all processes have had
a turn, then the scheduler gives the first process another turn, and so on.
RR scheduling can give the effect of all processors sharing the CPU equally,
although the average wait time can be longer than with other scheduling
algorithms. In the following example the average wait time is 5.66 ms.
Process Burst Time
P1 24
P2 3
P3 3
In general, turnaround time is minimized if most processes finish their next cpu
burst within one time quantum. For example, with three processes of 10 ms bursts
each, the average turnaround time for 1 ms quantum is 29, and for 10 ms quantum
it reduces to 20. However, if it is made too large, then RR just degenerates to
FCFS. A rule of thumb is that 80% of CPU bursts should be smaller than the time
quantum.
Multilevel Queue Scheduling
When processes can be readily categorized, then multiple separate queues can be
established, each implementing whatever scheduling algorithm is most
appropriate for that type of job, and/or with different parametric adjustments.
Scheduling must also be done between queues, that is scheduling one queue to get
time relative to other queues. Two common options are strict priority ( no job in a
lower priority queue runs until all higher priority queues are empty ) and round-
robin ( each queue gets a time slice in turn, possibly of different sizes. )
Note that under this algorithm jobs cannot switch from queue to queue - Once
they are assigned a queue, that is their queue until they finish.
Thread Scheduling
Contention Scope
Contention scope refers to the scope in which threads compete for the use of
physical CPUs.
On systems implementing many-to-one and many-to-many threads, Process
Contention Scope, PCS, occurs, because competition occurs between threads that
are part of the same process. ( This is the management / scheduling of multiple
user threads on a single kernel thread, and is managed by the thread library. )
System Contention Scope, SCS, involves the system scheduler scheduling kernel
threads to run on one or more CPUs. Systems implementing one-to-one threads (
XP, Solaris 9, Linux ), use only SCS.
PCS scheduling is typically done with priority, where the programmer can set
and/or change the priority of threads created by his or her programs. Even time
slicing is not guaranteed among threads of equal priority.
Pthread Scheduling
Multiple-Processor Scheduling
When multiple processors are available, then the scheduling gets more complicated,
because now there is more than one CPU which must be kept busy and in effective use at
all times.
Load sharing revolves around balancing the load between multiple processors.
Multi-processor systems may be heterogeneous, ( different kinds of CPUs ), or
homogenous, ( all the same kind of CPU ). Even in the latter case there may be special
scheduling constraints, such as devices which are connected via a private bus to only one
of the CPUs. This book will restrict its discussion to homogenous systems.
Processor Affinity
Processors contain cache memory, which speeds up repeated accesses to the same
memory locations.
If a process were to switch from one processor to another each time it got a time
slice, the data in the cache ( for that process ) would have to be invalidated and re-
loaded from main memory, thereby obviating the benefit of the cache.
Therefore SMP systems attempt to keep processes on the same processor, via
processor affinity. Soft affinity occurs when the system attempts to keep
processes on the same processor but makes no guarantees. Linux and some other
OSes support hard affinity, in which a process specifies that it is not to be moved
between processors.
Main memory architecture can also affect process affinity, if particular CPUs
have faster access to memory on the same chip or board than to other memory
loaded elsewhere. ( Non-Uniform Memory Access, NUMA. ) As shown below, if
a process has an affinity for a particular CPU, then it should preferentially be
assigned memory storage in "local" fast access areas.
Load Balancing
Multicore Processors
Traditional SMP required multiple CPU chips to run multiple kernel threads
concurrently.
Recent trends are to put multiple CPUs ( cores ) onto a single chip, which appear
to the system as multiple processors.
Compute cycles can be blocked by the time needed to access memory, whenever
the needed data is not already present in the cache. ( Cache misses. ) In Figure, as
much as half of the CPU cycles are lost to memory stall.
Memory stall.
Linux Scheduling
The Linux CFS scheduler provides an efficient algorithm for selecting which
task to run next. Each runnable task is placed in a red-black tree—a balanced
binary search tree whose key is based on the value of vruntime. This tree is
shown below:
When a task becomes runnable, it is added to the tree. If a task on the
tree is not runnable ( for example, if it is blocked while waiting for I/O ), it is
removed. Generally speaking, tasks that have been given less processing time
( smaller values of vruntime ) are toward the left side of the tree, and tasks
that have been given more processing time are on the right side. According
to the properties of a binary search tree, the leftmost node has the smallest
key value, which for the sake of the CFS scheduler means that it is the task
with the highest priority . Because the red-black tree is balanced, navigating
it to discover the leftmost node will require O(lgN) operations (where N
is the number of nodes in the tree). However, for efficiency reasons, the
Linux scheduler caches this value in the variable rb_leftmost, and thus
determining which task to run next requires only retrieving the cached value.
The Linux scheduler is a preemptive priority-based algorithm with two priority ranges -
Real time from 0 to 99 and a nice range from 100 to 140.
Unlike Solaris or XP, Linux assigns longer time quantums to higher priority tasks.
Figure 6.21 - Scheduling priorities on a Linux system.
Process Synchronization
Background
item nextProduced;
while( true ) {
item nextConsumed;
while( true ) {
The only problem with the above code is that the maximum number of items
which can be placed into the buffer is BUFFER_SIZE - 1. One slot is unavailable
because there always has to be a gap between the producer and the consumer.
We could try to overcome this deficiency by introducing a counter variable, as
shown in the following code segments:
Unfortunately we have now introduced a new problem, because both the producer
and the consumer are adjusting the value of the variable counter, which can lead
to a condition known as a race condition. In this condition a piece of code may or
may not work correctly, depending on which of two simultaneous processes
executes first, and more importantly if one of the processes gets interrupted such
that the other process runs between important steps of the first process. ( Bank
balance example discussed in class. )
The particular problem above comes from the producer executing "counter++" at
the same time the consumer is executing "counter--". If one process gets part way
through making the update and then the other process butts in, the value of
counter can get left in an incorrect state.
But, you might say, "Each of those are single instructions - How can they get
interrupted halfway through?" The answer is that although they are single
instructions in C++, they are actually three steps each at the hardware level: (1)
Fetch counter from memory into a register, (2) increment or decrement the
register, and (3) Store the new value of counter back to memory. If the
instructions from the two processes get interleaved, there could be serious
problems, such as illustrated by the following:
Exercise: What would be the resulting value of counter if the order of statements
T4 and T5 were reversed? ( What should the value of counter be after one
producer and one consumer, assuming the original value was 5? )
Note that race conditions are notoriously difficult to identify and debug, because
by their very nature they only occur on rare occasions, and only when the timing
is just exactly right. ( or wrong! :-) ) Race conditions are also very difficult to
reproduce. :-(
Obviously the solution is to only allow one process at a time to manipulate the
value "counter". This is a very common occurrence among cooperating processes,
so lets look at some ways in which this is done, as well as some classic problems
in this area.
A solution to the critical section problem must satisfy the following three conditions:
1. Mutual Exclusion - Only one process at a time can be executing in their critical
section.
2. Progress - If no process is currently executing in their critical section, and one or
more processes want to execute their critical section, then only the processes not
in their remainder sections can participate in the decision, and the decision cannot
be postponed indefinitely. ( I.e. processes cannot be blocked forever waiting to
get into their critical sections. )
3. Bounded Waiting - There exists a limit as to how many other processes can get
into their critical sections after a process requests entry into their critical section
and before that request is granted. ( I.e. a process requesting entry into their
critical section will get a turn eventually, and there is a limit as to how many other
processes get to go first. )
Peterson's Solution
Synchronization Hardware
To generalize the solution(s) expressed above, each process when entering their critical
section must set some sort of lock, to prevent other processes from entering their critical
sections simultaneously, and must release the lock when exiting their critical section, to
allow other processes to proceed. Obviously it must be possible to attain the lock only
when no other process has already set a lock. Specific implementations of this general
procedure can get quite complicated, and may include hardware solutions as outlined in
this section.
One simple solution to the critical section problem is to simply prevent a process from
being interrupted while in their critical section, which is the approach taken by non
preemptive kernels. Unfortunately this does not work well in multiprocessor
environments, due to the difficulties in disabling and the re-enabling interrupts on all
processors. There is also a question as to how this approach affects timing if the clock
interrupt is disabled.
Another approach is for hardware to provide certain atomic operations. These operations
are guaranteed to operate as a single instruction, without interruption. One such operation
is the "Test and Set", which simultaneously sets a boolean lock variable and returns its
previous value, as shown in Figures 5.3 and 5.4:
Another variation on the test-and-set is an atomic swap of two booleans, as shown in
Figures 5.5 and 5.6:
The above examples satisfy the mutual exclusion requirement, but unfortunately do not
guarantee bounded waiting. If there are multiple processes trying to get into their critical
sections, there is no guarantee of what order they will enter, and any one process could
have the bad luck to wait forever until they got their turn in the critical section. ( Since
there is no guarantee as to the relative rates of the processes, a very fast process could
theoretically release the lock, whip through their remainder section, and re-lock the lock
before a slower process got a chance. As more and more processes are involved vying for
the same resource, the odds of a slow process getting locked out completely increase. )
Figure illustrates a solution using test-and-set that does satisfy this requirement, using
two shared data structures, boolean lock and boolean waiting[ N ], where N is the number
of processes in contention for critical sections:
The key feature of the above algorithm is that a process blocks on the AND of the critical
section being locked and that this process is in the waiting state. When exiting a critical
section, the exiting process does not just unlock the critical section and let the other
processes have a free-for-all trying to get in. Rather it first looks in an orderly progression
( starting with the next process on the list ) for a process that has been waiting, and if it
finds one, then it releases that particular process from its waiting state, without unlocking
the critical section, thereby allowing a specific process into the critical section while
continuing to block all the others. Only if there are no other processes currently waiting is
the general lock removed, allowing the next process to come along access to the critical
section.
Unfortunately, hardware level locks are especially difficult to implement in multi-
processor architectures. Discussion of such issues is left to books on advanced computer
architecture.
Mutex Locks
The hardware solutions presented above are often difficult for ordinary programmers to
access, particularly on multi-processor machines, and particularly because they are often
platform-dependent.
Therefore most systems offer a software API equivalent called mutex locks or simply
mutexes. ( For mutual exclusion )
The terminology when using mutexes is to acquire a lock prior to entering a critical
section, and to release it when exiting, as shown in Figure :
Just as with hardware locks, the acquire step will block the process if the lock is in use by
another process, and both the acquire and release operations are atomic.
Acquire and release can be implemented as shown here, based on a boolean variable
"available":
One problem with the implementation shown here, ( and in the hardware solutions
presented earlier ), is the busy loop used to block processes in the acquire phase. These
types of locks are referred to as spinlocks, because the CPU just sits and spins while
blocking the process.
Spinlocks are wasteful of cpu cycles, and are a really bad idea on single-cpu single-
threaded machines, because the spinlock blocks the entire computer, and doesn't allow
any other process to release the lock. ( Until the scheduler kicks the spinning process off
of the cpu. )
On the other hand, spinlocks do not incur the overhead of a context switch, so they are
effectively used on multi-threaded machines when it is expected that the lock will be
released after a short time.
Semaphores
A more robust alternative to simple mutexes is to use semaphores, which are integer
variables for which only two ( atomic ) operations are defined, the wait and signal
operations, as shown in the following figure.
Note that not only must the variable-changing steps ( S-- and S++ ) be indivisible, it is
also necessary that for the wait operation when the test proves false that there be no
interruptions before S gets decremented. It IS okay, however, for the busy loop to be
interrupted when the test is true, which prevents the system from hanging forever.
Semaphore Usage
S1;
signal( synch );
wait( synch );
S2;
Semaphore Implementation
The big problem with semaphores as described above is the busy loop in the wait
call, which consumes CPU cycles without doing any useful work. This type of
lock is known as a spinlock, because the lock just sits there and spins while it
waits. While this is generally a bad thing, it does have the advantage of not
invoking context switches, and so it is sometimes used in multi-processing
systems when the wait time is expected to be short - One thread spins on one
processor while another completes their critical section on another processor.
An alternative approach is to block a process when it is forced to wait for an
available semaphore, and swap it out of the CPU. In this implementation each
semaphore needs to maintain a list of processes that are blocked waiting for it, so
that one of the processes can be woken up and swapped back in when the
semaphore becomes available. ( Whether it gets swapped back into the CPU
immediately or whether it needs to hang out in the ready queue for a while is a
scheduling problem. )
The new definition of a semaphore and the corresponding wait and signal
operations are shown as follows:
Note that in this implementation the value of the semaphore can actually become
negative, in which case its magnitude is the number of processes waiting for that
semaphore. This is a result of decrementing the counter before checking its value.
Key to the success of semaphores is that the wait and signal operations be atomic,
that is no other process can execute a wait or signal on the same semaphore at the
same time. ( Other processes could be allowed to do other things, including
working with other semaphores, they just can't have access to this semaphore. )
On single processors this can be implemented by disabling interrupts during the
execution of wait and signal; Multiprocessor systems have to use more complex
methods, including the use of spinlocking.
Deadlocks and Starvation
One important problem that can arise when using semaphores to block processes
waiting for a limited resource is the problem of deadlocks, which occur when
multiple processes are blocked, each waiting for a resource that can only be freed
by one of the other ( blocked ) processes, as illustrated in the following example. (
Deadlocks are covered more completely in chapter 7. )
Priority Inversion
The following classic problems are used to test virtually every new proposed synchronization
algorithm.
In the readers-writers problem there are some processes ( termed readers ) who
only read the shared data, and never change it, and there are other processes (
termed writers ) who may change the data in addition to or instead of reading it.
There is no limit to how many readers can access the data simultaneously, but
when a writer accesses the data, it needs exclusive access.
There are several variations to the readers-writers problem, most centered around
relative priorities of readers versus writers.
o The first readers-writers problem gives priority to readers. In this problem,
if a reader wants access to the data, and there is not already a writer
accessing it, then access is granted to the reader. A solution to this
problem can lead to starvation of the writers, as there could always be
more readers coming along to access the data. ( A steady stream of readers
will jump ahead of waiting writers as long as there is currently already
another reader accessing the data, because the writer is forced to wait until
the data is idle, which may never happen if there are enough readers. )
o The second readers-writers problem gives priority to the writers. In this
problem, when a writer wants access to the data it jumps to the head of the
queue - All waiting readers are blocked, and the writer gets access to the
data as soon as it becomes available. In this solution the readers may be
starved by a steady stream of writers.
The following code is an example of the first readers-writers problem, and
involves an important counter and two binary semaphores:
o readcount is used by the reader processes, to count the number of readers
currently accessing the data.
o mutex is a semaphore used only by the readers for controlled access to
readcount.
o rw_mutex is a semaphore used to block and release the writers. The first
reader to access the data will set this lock and the last reader to exit will
release it; The remaining readers do not touch rw_mutex. ( Eighth edition
called this variable wrt. )
o Note that the first reader to come along will block on rw_mutex if there is
currently a writer accessing the data, and that all following readers will
only block on mutex for their turn to increment readcount.
Some hardware implementations provide specific reader-writer locks, which are
accessed using an argument specifying whether access is requested for reading or
writing. The use of reader-writer locks is beneficial for situation in which: (1)
processes can be easily identified as either readers or writers, and (2) there are
significantly more readers than writers, making the additional overhead of the
reader-writer lock pay off in terms of increased concurrency of the readers.
One possible solution, as shown in the following code section, is to use a set of
five semaphores ( chopsticks[ 5 ] ), and to have each hungry philosopher first wait
on their left chopstick ( chopsticks[ i ] ), and then wait on their right chopstick (
chopsticks[ ( i + 1 ) % 5 ] )
But suppose that all five philosophers get hungry at the same time, and each starts
by picking up their left chopstick. They then look for their right chopstick, but
because it is unavailable, they wait for it, forever, and eventually all the
philosophers starve due to the resulting deadlock.
The structure of philosopher i.
Monitors
Semaphores can be very useful for solving concurrency problems, but only if
programmers use them properly. If even one process fails to abide by the proper use of
semaphores, either accidentally or deliberately, then the whole system breaks down. (
And since concurrency problems are by definition rare events, the problem code may
easily go unnoticed and/or be heinous to debug. )
For this reason a higher- level language construct has been developed, called monitors.
Monitor Usage
A monitor is essentially a class, in which all data is private, and with the special
restriction that only one method within any given monitor object may be active at
the same time. An additional restriction is that monitor methods may only access
the shared data within the monitor and any data passed to them as parameters. I.e.
they cannot access any data external to the monitor.
Syntax of a monitor.
But now there is a potential problem - If process P within the monitor issues a
signal that would wake up process Q also within the monitor, then there would be
two processes running simultaneously within the monitor, violating the exclusion
requirement. Accordingly there are two possible solutions to this dilemma:
Signal and wait - When process P issues the signal to wake up process Q, P then waits, either
for Q to leave the monitor or on some other condition.
Signal and continue - When P issues the signal, Q waits, either for P to exit the monitor or for
some other condition.
There are arguments for and against either choice. Concurrent Pascal offers a third alternative -
The signal call causes the signaling process to immediately exit the monitor, so that the waiting
process can then wake up and proceed.
Dining-Philosophers Solution Using Monitors
This solution to the dining philosophers uses monitors, and the restriction that a
philosopher may only pick up chopsticks when both are available. There are also
two key data structures in use in this solution:
1. enum { THINKING, HUNGRY,EATING } state[ 5 ]; A philosopher may
only set their state to eating when neither of their adjacent neighbors is
eating. ( state[ ( i + 1 ) % 5 ] != EATING && state[ ( i + 4 ) % 5 ] !=
EATING ).
2. condition self[ 5 ]; This condition is used to delay a hungry philosopher
who is unable to acquire chopsticks.
In the following solution philosophers share a monitor, DiningPhilosophers, and
eat using the following sequence of operations:
1. DiningPhilosophers.pickup( ) - Acquires chopsticks, which may block the
process.
2. eat
3. DiningPhilosophers.putdown( ) - Releases the chopsticks.
Implementing a Monitor Using Semaphores
When there are multiple processes waiting on the same condition within a
monitor, how does one decide which one to wake up in response to a signal on
that condition? One obvious approach is FCFS, and this may be suitable in many
cases.
Another alternative is to assign ( integer ) priorities, and to wake up the process
with the smallest ( best ) priority.
Figure illustrates the use of such a condition within a monitor used for resource
allocation. Processes wishing to access this resource must specify the time they
expect to use it using the acquire( time ) method, and must call the release( )
method when they are done with the resource.
A monitor to allocate a single resource.
Unfortunately the use of monitors to restrict access to resources still only works if
programmers make the requisite acquire and release calls properly. One option
would be to place the resource allocation code into the monitor, thereby
eliminating the option for programmers to bypass or ignore the monitor, but then
that would substitute the monitor's scheduling algorithms for whatever other
scheduling algorithms may have been chosen for that particular resource. Chapter
14 on Protection presents more advanced methods for enforcing "nice"
cooperation among processes contending for shared resources.
Concurrent Pascal, Mesa, C#, and Java all implement monitors as described here.
Erlang provides concurrency support using a similar mechanism.
Synchronization Examples
Synchronization in Windows
Synchronization in Linux
Unit-3
Main Memory
Background
Obviously memory accesses and memory management are a very important part of
modern computer operation. Every instruction has to be fetched from memory before it
can be executed, and most instructions involve retrieving data from memory or storing
data in memory or both.
The advent of multi-tasking OSes compounds the complexity of memory management,
because because as processes are swapped in and out of the CPU, so must their code and
data be swapped in and out of memory, all at high speeds and without interfering with
any other processes.
Shared memory, virtual memory, the classification of memory as read-only versus read-
write, and concepts like copy-on-write forking all further complicate the issue.
Basic Hardware
It should be noted that from the memory chips point of view, all memory accesses
are equivalent. The memory hardware doesn't know what a particular part of
memory is being used for, nor does it care. This is almost true of the OS as well,
although not entirely.
The CPU can only access its registers and main memory. It cannot, for example,
make direct access to the hard drive, so any data stored there must first be
transferred into the main memory chips before the CPU can work with it. ( Device
drivers communicate with their hardware via interrupts and "memory" accesses,
sending short instructions for example to transfer data from the hard drive to a
specified location in main memory. The disk controller monitors the bus for such
instructions, transfers the data, and then notifies the CPU that the data is there
with another interrupt, but the CPU never gets direct access to the disk. )
Memory accesses to registers are very fast, generally one clock tick, and a CPU
may be able to execute more than one machine instruction per clock tick.
Memory accesses to main memory are comparatively slow, and may take a
number of clock ticks to complete. This would require intolerable waiting by the
CPU if it were not for an intermediary fast memory cache built into most modern
CPUs. The basic idea of the cache is to transfer chunks of memory at a time from
the main memory to the cache, and then to access individual memory locations
one at a time from the cache.
User processes must be restricted so that they only access memory locations that
"belong" to that particular process. This is usually implemented using a base
register and a limit register for each process, as shown in Figures 8.1 and 8.2
below. Every memory access made by a user process is checked against these two
registers, and if a memory access is attempted outside the valid range, then a fatal
error is generated. The OS obviously has access to all existing memory locations,
as this is necessary to swap users' code and data in and out of memory. It should
also be obvious that changing the contents of the base and limit registers is a
privileged activity, allowed only to the OS kernel.
Address Binding
User programs typically refer to memory addresses with symbolic names such as
"i", "count", and "averageTemperature". These symbolic names must be mapped
or bound to physical memory addresses, which typically occurs in several stages:
o Compile Time - If it is known at compile time where a program will
reside in physical memory, then absolute code can be generated by the
compiler, containing actual physical addresses. However if the load
address changes at some later time, then the program will have to be
recompiled. DOS .COM programs use compile time binding.
o Load Time - If the location at which a program will be loaded is not
known at compile time, then the compiler must generate relocatable code,
which references addresses relative to the start of the program. If that
starting address changes, then the program must be reloaded but not
recompiled.
o Execution Time - If a program can be moved around in memory during
the course of its execution, then binding must be delayed until execution
time. This requires special hardware, and is the method implemented by
most modern OSes.
Figure shows the various stages of the binding processes and the units involved in
each stage:
Multistep processing of a user program
Logical Versus Physical Address Space
The address generated by the CPU is a logical address, whereas the address
actually seen by the memory hardware is a physical address.
Addresses bound at compile time or load time have identical logical and physical
addresses.
Addresses created at execution time, however, have different logical and physical
addresses.
o In this case the logical address is also known as a virtual address, and the
two terms are used interchangeably by our text.
o The set of all logical addresses used by a program composes the logical
address space, and the set of all corresponding physical addresses
composes the physical address space.
The run time mapping of logical to physical addresses is handled by the memory-
management unit, MMU.
o The MMU can take on many forms. One of the simplest is a modification
of the base-register scheme described earlier.
o The base register is now termed a relocation register, whose value is
added to every memory request at the hardware level.
Note that user programs never see physical addresses. User programs work
entirely in logical address space, and any memory references or manipulations are
done using purely logical addresses. Only when the address gets sent to the
physical memory chips is the physical memory address generated.
Rather than loading an entire program into memory at once, dynamic loading
loads up each routine as it is called. The advantage is that unused routines need
never be loaded, reducing total memory usage and generating faster program
startup times. The downside is the added complexity and overhead of checking to
see if a routine is loaded every time it is called and then then loading it up if it is
not already loaded.
With static linking library modules get fully included in executable modules,
wasting both disk space and main memory usage, because every program that
included a certain routine from the library would have to have their own copy of
that routine linked into their executable code.
With dynamic linking, however, only a stub is linked into the executable module,
containing references to the actual library module linked in at run time.
o This method saves disk space, because the library routines do not need to
be fully included in the executable modules, only the stubs.
o We will also learn that if the code section of the library routines is
reentrant, ( meaning it does not modify the code while it runs, making it
safe to re-enter it ), then main memory can be saved by loading only one
copy of dynamically linked routines into memory and sharing the code
amongst all processes that are concurrently using it. ( Each process would
have their own copy of the data section of the routines, but that may be
small relative to the code segments. ) Obviously the OS must manage
shared routines in memory.
o An added benefit of dynamically linked libraries ( DLLs, also known as
shared libraries or shared objects on UNIX systems ) involves easy
upgrades and updates. When a program uses a routine from a standard
library and the routine changes, then the program must be re-built ( re-
linked ) in order to incorporate the changes. However if DLLs are used,
then as long as the stub doesn't change, the program can be updated
merely by loading new versions of the DLLs onto the system. Version
information is maintained in both the program and the DLLs, so that a
program can specify a particular version of the DLL if necessary.
o In practice, the first time a program calls a DLL routine, the stub will
recognize the fact and will replace itself with the actual routine from the
DLL library. Further calls to the same routine will access the routine
directly and not incur the overhead of the stub access. ( Following the
UML Proxy Pattern. )
Swapping
Standard Swapping
One approach to memory management is to load each process into a contiguous space.
The operating system is allocated space first, usually at either low or high memory
locations, and then the remaining available memory is allocated to processes as needed.
( The OS is usually loaded low, because that is where the interrupt vectors are located,
but on older systems part of the OS was loaded high to make more room in low memory (
within the 640K barrier ) for user processes. )
The system shown in Figure below allows protection against user programs
accessing areas that they should not, allows programs to be relocated to different
memory starting addresses as needed, and allows the memory space devoted to
the OS to grow or shrink dynamically as needs change.
Memory Allocation
Fragmentation
All the memory allocation strategies suffer from external fragmentation, though
first and best fits experience the problems more so than worst fit. External
fragmentation means that the available memory is broken up into lots of little
pieces, none of which is big enough to satisfy the next memory requirement,
although the sum total could.
The amount of memory lost to fragmentation may vary with algorithm, usage
patterns, and some design decisions such as which end of a hole to allocate and
which end to save on the free list.
Statistical analysis of first fit, for example, shows that for N blocks of allocated
memory, another 0.5 N will be lost to fragmentation.
Internal fragmentation also occurs, with all memory allocation strategies. This is
caused by the fact that memory is allocated in blocks of a fixed size, whereas the
actual memory needed will rarely be that exact size. For a random distribution of
memory requests, on the average 1/2 block will be wasted per memory request,
because on the average the last allocated block will be only half full.
o Note that the same effect happens with hard drives, and that modern
hardware gives us increasingly larger drives and memory at the expense of
ever larger block sizes, which translates to more memory lost to internal
fragmentation.
o Some systems use variable size blocks to minimize losses due to internal
fragmentation.
If the programs in memory are relocatable, ( using execution-time address
binding ), then the external fragmentation problem can be reduced via
compaction, i.e. moving all processes down to one end of physical memory. This
only involves updating the relocation register for each process, as all internal
work is done using logical addresses.
Another solution as we will see in upcoming sections is to allow processes to use
non-contiguous blocks of physical memory, with a separate relocation register for
each block.
Segmentation
Basic Method
Segmentation Hardware
A segment table maps segment-offset addresses to physical addresses, and
simultaneously checks for invalid addresses, using a system similar to the page
tables and relocation base registers discussed previously. ( Note that at this point
in the discussion of segmentation, each segment is kept in contiguous memory
and may be of different sizes, but that segmentation can also be combined with
paging as we shall see shortly. )
Segmentation hardware
Example of segmentation
Paging
Basic Method
The basic idea behind paging is to divide physical memory into a number of equal sized
blocks called frames, and to divide a programs logical memory space into blocks of the
same size called pages.
Any page ( from any process ) can be placed into any available frame.
The page table is used to look up what frame a particular page is stored in at the moment.
In the following example, for instance, page 2 of the program's logical memory is
currently stored in frame 3 of physical memory:
Paging hardware
A logical address consists of two parts: A page number in which the address resides, and
an offset from the beginning of that page. ( The number of bits in the page number limits
how many pages a single process can address. The number of bits in the offset determines
the maximum size of each page, and should correspond to the system frame size. )
The page table maps the page number to a frame number, to yield a physical address
which also has two parts: The frame number and the offset within that frame. The number
of bits in the frame number determines how many frames the system can address, and the
number of bits in the offset determines the size of each frame.
Page numbers, frame numbers, and frame sizes are determined by the architecture, but
are typically powers of two, allowing addresses to be split at a certain number of bits. For
example, if the logical address size is 2^m and the page size is 2^n, then the high-order
m-n bits of a logical address designate the page number and the remaining n bits
represent the offset.
Note also that the number of bits in the page number and the number of bits in the frame
number do not have to be identical. The former determines the address range of the
logical address space, and the latter relates to the physical address space.
( DOS used to use an addressing scheme with 16 bit frame numbers and 16-bit offsets, on
hardware that only supported 24-bit hardware addresses. The result was a resolution of
starting frame addresses finer than the size of a single frame, and multiple frame-offset
combinations that mapped to the same physical hardware address. )
Consider the following micro example, in which a process has 16 bytes of logical
memory, mapped in 4 byte pages into 32 bytes of physical memory. ( Presumably some
other processes would be consuming the remaining 16 bytes of physical memory. )
Paging example for a 32-byte memory with 4-byte pages
Note that paging is like having a table of relocation registers, one for each page of the
logical memory.
There is no external fragmentation with paging. All blocks of physical memory are used,
and there are no gaps in between and no problems with finding the right sized hole for a
particular chunk of memory.
There is, however, internal fragmentation. Memory is allocated in chunks the size of a
page, and on the average, the last page will only be half full, wasting on the average half
a page of memory per process. ( Possibly more, if processes keep their code and data in
separate pages. )
Larger page sizes waste more memory, but are more efficient in terms of overhead.
Modern trends have been to increase page sizes, and some systems even have multiple
size pages to try and make the best of both worlds.
Page table entries ( frame numbers ) are typically 32 bit numbers, allowing access to
2^32 physical page frames. If those frames are 4 KB in size each, that translates to 16 TB
of addressable physical memory. ( 32 + 12 = 44 bits of physical address space. )
When a process requests memory ( e.g. when its code is loaded in from disk ), free
frames are allocated from a free-frame list, and inserted into that process's page table.
Processes are blocked from accessing anyone else's memory because all of their memory
requests are mapped through their page table. There is no way for them to generate an
address that maps into any other process's memory space.
The operating system must keep track of each individual process's page table, updating it
whenever the process's pages get moved in and out of memory, and applying the correct
page table when processing system calls for a particular process. This all increases the
overhead involved when swapping processes in and out of the CPU. ( The currently
active page table must be updated to reflect the process that is currently running. )
Free frames (a) before allocation and (b) after allocation
Hardware Support
Page lookups must be done for every memory reference, and whenever a process
gets swapped in or out of the CPU, its page table must be swapped in and out too,
along with the instruction registers, etc. It is therefore appropriate to provide
hardware support for this operation, in order to make it as fast as possible and to
make process switches as fast as possible also.
One option is to use a set of registers for the page table. For example, the DEC
PDP-11 uses 16-bit addressing and 8 KB pages, resulting in only 8 pages per
process. ( It takes 13 bits to address 8 KB of offset, leaving only 3 bits to define a
page number. )
An alternate option is to store the page table in main memory, and to use a single
register ( called the page-table base register, PTBR ) to record where in memory
the page table is located.
o Process switching is fast, because only the single register needs to be
changed.
o However memory access just got half as fast, because every memory
access now requires two memory accesses - One to fetch the frame
number from memory and then another one to access the desired memory
location.
o The solution to this problem is to use a very special high-speed memory
device called the translation look-aside buffer, TLB.
The benefit of the TLB is that it can search an entire table for a key
value in parallel, and if it is found anywhere in the table, then the
corresponding lookup value is returned.
for a 40% slowdown to get the frame number. A 98% hit rate
would yield 122 nanoseconds average access time ( you should
verify this ), for a 22% slowdown.
for a 20% slowdown to get the frame number. A 99% hit rate
would yield 101 nanoseconds average access time ( you should
verify this ), for a 1% slowdown.
Protection
The page table can also help to protect processes from accessing memory that
they shouldn't, or their own memory in ways that they shouldn't.
A bit or bits can be added to the page table to classify a page as read-write, read-
only, read-write-execute, or some combination of these sorts of things. Then each
memory reference can be checked to ensure it is accessing the memory in the
appropriate mode.
Valid / invalid bits can be added to "mask off" entries in the page table that are
not in use by the current process, as shown by example in Figure 8.12 below.
Note that the valid / invalid bits described above cannot block all illegal memory
accesses, due to the internal fragmentation. ( Areas of memory in the last page
that are not entirely filled by the process, and may contain data left over by
whoever used that frame last. )
Many processes do not use all of the page table available to them, particularly in
modern systems with very large potential page tables. Rather than waste memory
by creating a full-size page table for every process, some systems use a page-table
length register, PTLR, to specify the length of the page table.
Shared Pages
Paging systems can make it very easy to share blocks of memory, by simply
duplicating page numbers in multiple page frames. This may be done with either
code or data.
If code is reentrant, that means that it does not write to or change the code in any
way ( it is non self-modifying ), and it is therefore safe to re-enter it. More
importantly, it means the code can be shared by multiple processes, so long as
each has their own copy of the data and registers, including the instruction
register.
In the example given below, three different users are running the editor
simultaneously, but the code is only loaded into memory ( in the page frames )
one time.
Some systems also implement shared memory in this fashion.
Hierarchical Paging
Most modern computer systems support logical address spaces of 2^32 to 2^64.
With a 2^32 address space and 4K ( 2^12 ) page sizes, this leave 2^20 entries in
the page table. At 4 bytes per entry, this amounts to a 4 MB page table, which is
too large to reasonably keep in contiguous memory. ( And to swap in and out of
memory with each process switch. ) Note that with 4K pages, this would take
1024 pages just to hold the page table!
One option is to use a two-tier paging system, i.e. to page the page table.
For example, the 20 bits described above could be broken down into two 10-bit
page numbers. The first identifies an entry in the outer page table, which
identifies where in memory to find one page of an inner page table. The second 10
bits finds a specific entry in that inner page table, which in turn identifies a
particular frame in physical memory. ( The remaining 12 bits of the 32 bit logical
address are the offset within the 4K frame. )
A two-level page-table scheme
Address translation for a two-level 32-bit paging architecture
VAX Architecture divides 32-bit addresses into 4 equal sized sections, and each
page is 512 bytes, yielding an address form of:
With a 64-bit logical address space and 4K pages, there are 52 bits worth of page
numbers, which is still too many even for two-level paging. One could increase
the paging level, but with 10-bit page tables it would take 7 levels of indirection,
which would be prohibitively slow memory access. So some other approach must
be used.
One common data structure for accessing data that is sparsely distributed over a
broad range of possible values is with hash tables. Figure 8.16 below illustrates a
hashed page table using chain-and-bucket hashing:
Another approach is to use an inverted page table. Instead of a table listing all of
the pages for a particular process, an inverted page table lists all of the pages
currently loaded in memory, for all processes. ( I.e. there is one entry per frame
instead of one entry per page. )
Access to an inverted page table can be slow, as it may be necessary to search the
entire table in order to find the desired page ( or to discover that it is not there. )
Hashing the table can help speedup the search process.
Inverted page tables prohibit the normal method of implementing shared memory,
which is to map multiple logical pages to a common physical frame. ( Because
each frame is now mapped to one and only one process. )
IA-32 Segmentation
The Pentium CPU provides both pure segmentation and segmentation with
paging. In the latter case, the CPU generates a logical address ( segment-offset
pair ), which the segmentation unit converts into a logical linear address, which in
turn is mapped to a physical frame by the paging unit, as shown in Figure 8.21:
Figure 8.21 - Logical to physical address translation in IA-32
IA-32 Segmentation
A special bit in the page directory can indicate that this page is a 4MB
page, in which case the remaining 22 bits are all used as offset and the
inner tier of page tables is not used.
The CR3 register points to the page directory for the current process, as
shown in Figure 8.23 below.
If the inner page table is currently swapped out to disk, then the page
directory will have an "invalid bit" set, and the remaining 31 bits provide
information on where to find the swapped out page table on the disk.
Figure 8.23 - Paging in the IA-32 architecture.
Figure 8.24 - Page address extensions.
Virtual Memory
Background
Preceding sections talked about how to avoid memory fragmentation by breaking process
memory requirements down into smaller bites ( pages ), and storing the pages non-
contiguously in memory. However the entire process still had to be stored in memory
somewhere.
In practice, most real processes do not need all their pages, or at least not all at once, for
several reasons:
1. Error handling code is not needed unless that specific error occurs, some of which
are quite rare.
2. Arrays are often over-sized for worst-case scenarios, and only a small fraction of
the arrays are actually used in practice.
3. Certain features of certain programs are rarely used, such as the routine to balance
the federal budget. :-)
The ability to load only the portions of processes that were actually needed ( and only
when they were needed ) has several benefits:
o Programs could be written for a much larger address space ( virtual memory space
) than physically exists on the computer.
o Because each process is only using a fraction of their total address space, there is
more memory left for other programs, improving CPU utilization and system
throughput.
o Less I/O is needed for swapping processes in and out of RAM, speeding things
up.
Figure shows the general layout of virtual memory, which can be much larger than
physical memory:
Figure shows virtual address space, which is the programmers logical view of process
memory storage. The actual physical layout is controlled by the process's page table.
Note that the address space shown in Figure 9.2 is sparse - A great hole in the middle of
the address space is never used, unless the stack and/or the heap grow to fill the hole.
Virtual address space
Virtual memory also allows the sharing of files and memory by multiple processes, with
several benefits:
o System libraries can be shared by mapping them into the virtual address space of
more than one process.
o Processes can also share virtual memory by mapping the same block of memory
to more than one process.
o Process pages can be shared during a fork( ) system call, eliminating the need to
copy all of the pages of the original ( parent ) process.
Shared library using virtual memory
Demand Paging
The basic idea behind demand paging is that when a process is swapped in, its pages are
not swapped in all at once. Rather they are swapped in only when the process needs them.
( on demand. ) This is termed a lazy swapper, although a pager is a more accurate term.
Transfer of a paged memory to contiguous disk space
Basic Concepts
The basic idea behind paging is that when a process is swapped in, the pager only
loads into memory those pages that it expects the process to need ( right away. )
Pages that are not loaded into memory are marked as invalid in the page table,
using the invalid bit. ( The rest of the page table entry may either be blank or
contain information about where to find the swapped-out page on the hard drive. )
If the process only ever accesses pages that are loaded in memory ( memory
resident pages ), then the process runs exactly as if all the pages were loaded in to
memory.
Page table when some pages are not in main memory.
On the other hand, if a page is needed that was not originally loaded up, then a
page fault trap is generated, which must be handled in a series of steps:
1. The memory address requested is first checked, to make sure it was a valid
memory request.
2. If the reference was invalid, the process is terminated. Otherwise, the page
must be paged in.
3. A free frame is located, possibly from a free-frame list.
4. A disk operation is scheduled to bring in the necessary page from disk. (
This will usually block the process on an I/O wait, allowing some other
process to use the CPU in the meantime. )
5. When the I/O operation is complete, the process's page table is updated
with the new frame number, and the invalid bit is changed to indicate that
this is now a valid page reference.
6. The instruction that caused the page fault must now be restarted from the
beginning, ( as soon as this process gets another turn on the CPU. )
In an extreme case, NO pages are swapped in for a process until they are
requested by page faults. This is known as pure demand paging.
In theory each instruction could generate multiple page faults. In practice this is
very rare, due to locality of reference, covered in section 9.6.1.
The hardware necessary to support virtual memory is the same as for paging and
swapping: A page table and secondary memory. ( Swap space, whose allocation is
discussed in chapter 12. )
A crucial part of the process is that the instruction must be restarted from scratch
once the desired page has been made available in memory. For most simple
instructions this is not a major difficulty. However there are some architectures
that allow a single instruction to modify a fairly large block of data, ( which may
span a page boundary ), and if some of the data gets modified before the page
fault occurs, this could cause problems. One solution is to access both ends of the
block before executing the instruction, guaranteeing that the necessary pages get
paged in before the instruction begins.
Obviously there is some slowdown and performance hit whenever a page fault
occurs and the system has to go get it from memory, but just how big a hit is it
exactly?
There are many steps that occur when servicing a page fault ( see book for full
details ), and some of the steps are optional or variable. But just for the sake of
discussion, suppose that a normal memory access requires 200 nanoseconds, and
that servicing a page fault takes 8 milliseconds. ( 8,000,000 nanoseconds, or
40,000 times a normal memory access. ) With a page fault rate of p, ( on a scale
from 0 to 1 ), the effective access time is now:
( 1 - p ) * ( 200 ) + p * 8000000
= 200 + 7,999,800 * p
which clearly depends heavily on p! Even if only one access in 1000 causes a page fault, the
effective access time drops from 200 nanoseconds to 8.2 microseconds, a slowdown of a factor
of 40 times. In order to keep the slowdown less than 10%, the page fault rate must be less than
0.0000025, or one in 399,990 accesses.
A subtlety is that swap space is faster to access than the regular file system,
because it does not have to go through the whole directory structure. For this
reason some systems will transfer an entire process from the file system to swap
space before starting up the process, so that future paging all occurs from the
( relatively ) faster swap space.
Some systems use demand paging directly from the file system for binary code
( which never changes and hence does not have to be stored on a page operation ),
and to reserve the swap space for data segments that must be stored. This
approach is used by both Solaris and BSD Unix.
Copy-on-Write
The idea behind a copy-on-write fork is that the pages for a parent process do not have to
be actually copied for the child until one or the other of the processes changes the page.
They can be simply shared between the two processes in the meantime, with a bit set that
the page needs to be copied if it ever gets written to. This is a reasonable approach, since
the child process usually issues an exec( ) system call immediately after the fork.
Page Replacement
In order to make the most use of virtual memory, we load several processes into memory
at the same time. Since we only load the pages that are actually needed by each process at
any given time, there is room to load many more processes than if we had to load in the
entire process.
However memory is also needed for other purposes ( such as I/O buffering ), and what
happens if some process suddenly decides it needs more pages and there aren't any free
frames available? There are several possible solutions to consider:
1. Adjust the memory used by I/O buffering, etc., to free up some frames for user
processes. The decision of how to allocate memory for I/O versus user processes
is a complex one, yielding different policies on different systems. ( Some allocate
a fixed amount for I/O, and others let the I/O system contend for memory along
with everything else. )
2. Put the process requesting more pages into a wait queue until some free frames
become available.
3. Swap some process out of memory completely, freeing up its page frames.
4. Find some page in memory that isn't being used right now, and swap that page
only out to disk, freeing up a frame that can be allocated to the process requesting
it. This is known as page replacement, and is the most common solution. There
are many different algorithms for page replacement, which is the subject of the
remainder of this section.
Need for page replacement.
The previously discussed page-fault processing assumed that there would be free
frames available on the free-frame list. Now the page-fault handling must be
modified to free up a frame if necessary, as follows:
1. Find the location of the desired page on the disk, either in swap space or in
the file system.
2. Find a free frame:
a. If there is a free frame, use it.
b. If there is no free frame, use a page-replacement algorithm to
select an existing frame to be replaced, known as the victim frame.
c. Write the victim frame to disk. Change all related page tables to
indicate that this page is no longer in memory.
3. Read in the desired page and store it in the frame. Adjust all related page
and frame tables to indicate the change.
4. Restart the process that was waiting for this page.
Page replacement.
Note that step 3c adds an extra disk write to the page-fault handling, effectively
doubling the time required to process a page fault. This can be alleviated
somewhat by assigning a modify bit, or dirty bit to each page, indicating whether
or not it has been changed since it was last loaded in from disk. If the dirty bit has
not been set, then the page is unchanged, and does not need to be written out to
disk. Otherwise the page write is required. It should come as no surprise that
many page replacement strategies specifically look for pages that do not have
their dirty bit set, and preferentially select clean pages as victim pages. It should
also be obvious that unmodifiable code pages never get their dirty bits set.
There are two major requirements to implement a successful demand paging
system. We must develop a frame-allocation algorithm and a page-replacement
algorithm. The former centers around how many frames are allocated to each
process ( and to other needs ), and the latter deals with how to select a page for
replacement when there are no free frames available.
The overall goal in selecting and tuning these algorithms is to generate the fewest
number of overall page faults. Because disk access is so slow relative to memory
access, even slight improvements to these algorithms can yield large
improvements in overall system performance.
Algorithms are evaluated using a given string of memory accesses known as a
reference string, which can be generated in one of ( at least ) three common
ways:
1. Randomly generated, either evenly distributed or with some distribution
curve based on observed system behavior. This is the fastest and easiest
approach, but may not reflect real performance well, as it ignores locality
of reference.
2. Specifically designed sequences. These are useful for illustrating the
properties of comparative algorithms in published papers and textbooks,
( and also for homework and exam problems. :-) )
3. Recorded memory references from a live system. This may be the best
approach, but the amount of data collected can be enormous, on the order
of a million addresses per second. The volume of collected data can be
reduced by making two important observations:
1. Only the page number that was accessed is relevant. The offset
within that page does not affect paging operations.
2. Successive accesses within the same page can be treated as a single
page request, because all requests after the first are guaranteed to
be page hits. ( Since there are no intervening requests for other
pages that could remove this page from the page table. )
So for example, if pages were of size 100 bytes, then the sequence
of address requests ( 0100, 0432, 0101, 0612, 0634, 0688, 0132,
0038, 0420 ) would reduce to page requests ( 1, 4, 1, 6, 1, 0, 4 )
As the number of available frames increases, the number of page faults
should decrease, as shown in Figure:
Graph of page faults versus number of frames.
A simple and obvious page replacement strategy is FIFO, i.e. first-in- first-out.
As new pages are brought in, they are added to the tail of a queue, and the page at
the head of the queue is the next victim. In the following example, 20 page
requests result in 15 page faults:
Although FIFO is simple and easy, it is not always optimal, or even efficient.
An interesting effect that can occur with FIFO is Belady's anomaly, in which
increasing the number of frames available can actually increase the number of
page faults that occur! Consider, for example, the following chart based on the
page sequence ( 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 ) and a varying number of available
frames. Obviously the maximum number of faults is 12 ( every request generates
a fault ), and the minimum number is 5 ( each page loaded only once ), but in
between there are some interesting results:
The discovery of Belady's anomaly lead to the search for an optimal page-
replacement algorithm, which is simply that which yields the lowest of all
possible page-faults, and which does not suffer from Belady's anomaly.
Such an algorithm does exist, and is called OPT or MIN. This algorithm is simply
"Replace the page that will not be used for the longest time in the future."
For example, Figure 9.14 shows that by applying OPT to the same reference
string used for the FIFO example, the minimum number of possible page faults is
9. Since 6 of the page-faults are unavoidable ( the first reference to each new
page ), FIFO can be shown to require 3 times as many ( extra ) page faults as the
optimal algorithm. ( Note: The book claims that only the first three page faults are
required by all algorithms, indicating that FIFO is only twice as bad as OPT. )
Unfortunately OPT cannot be implemented in practice, because it requires
foretelling the future, but it makes a nice benchmark for the comparison and
evaluation of real proposed new algorithms.
In practice most page-replacement algorithms try to approximate OPT by
predicting ( estimating ) in one fashion or another what page will not be used for
the longest period of time. The basis of FIFO is the prediction that the page that
was brought in the longest time ago is the one that will not be needed again for
the longest future time, but as we shall see, there are many other prediction
methods, all striving to match the performance of OPT.
The prediction behind LRU, the Least Recently Used, algorithm is that the page
that has not been used in the longest time is the one that will not be used again in
the near future. ( Note the distinction between FIFO and LRU: The former looks
at the oldest load time, and the latter looks at the oldest use time. )
Some view LRU as analogous to OPT, except looking backwards in time instead
of forwards. ( OPT has the interesting property that for any reference string S and
its reverse R, OPT will generate the same number of page faults for S and for R. It
turns out that LRU has this same property. )
Figure 9.15 illustrates LRU for our sample string, yielding 12 page faults, ( as
compared to 15 for FIFO and 9 for OPT. )
LRU page-replacement algorithm.
LRU is considered a good replacement policy, and is often used. The problem is
how exactly to implement it. There are two simple approaches commonly used:
1. Counters. Every memory access increments a counter, and the current
value of this counter is stored in the page table entry for that page. Then
finding the LRU page involves simple searching the table for the page
with the smallest counter value. Note that overflowing of the counter must
be considered.
2. Stack. Another approach is to use a stack, and whenever a page is
accessed, pull that page from the middle of the stack and place it on the
top. The LRU page will always be at the bottom of the stack. Because this
requires removing objects from the middle of the stack, a doubly linked
list is the recommended data structure.
Note that both implementations of LRU require hardware support, either for
incrementing the counter or for managing the stack, as these operations must be
performed for every memory access.
Neither LRU or OPT exhibit Belady's anomaly. Both belong to a class of page-
replacement algorithms called stack algorithms, which can never exhibit Belady's
anomaly. A stack algorithm is one in which the pages kept in memory for a frame
set of size N will always be a subset of the pages kept for a frame size of N + 1. In
the case of LRU, ( and particularly the stack implementation thereof ), the top N
pages of the stack will be the same for all frame set sizes of N or anything larger.
Use of a stack to record the most recent page references.
Additional-Reference-Bits Algorithm
Finer grain is possible by storing the most recent 8 reference bits for each
page in an 8-bit byte in the page table entry, which is interpreted as an
unsigned int.
o At periodic intervals ( clock interrupts ), the OS takes over, and
right-shifts each of the reference bytes by one bit.
o The high-order ( leftmost ) bit is then filled in with the current
value of the reference bit, and the reference bits are cleared.
o At any given time, the page with the smallest value for the
reference byte is the LRU page.
Obviously the specific number of bits used and the frequency with which
the reference byte is updated are adjustable, and are tuned to give the
fastest performance on a given hardware platform.
Second-Chance Algorithm
The enhanced second chance algorithm looks at the reference bit and the
modify bit ( dirty bit ) as an ordered page, and classifies pages into one of
four classes:
1. ( 0, 0 ) - Neither recently used nor modified.
2. ( 0, 1 ) - Not recently used, but modified.
3. ( 1, 0 ) - Recently used, but clean.
4. ( 1, 1 ) - Recently used and modified.
This algorithm searches the page table in a circular fashion ( in as many as
four passes ), looking for the first page it can find in the lowest numbered
category. I.e. it first makes a pass looking for a ( 0, 0 ), and then if it can't
find one, it makes another pass looking for a ( 0, 1 ), etc.
The main difference between this algorithm and the previous one is the
preference for replacing clean pages if possible.
There are several algorithms based on counting the number of references that
have been made to a given page, such as:
o Least Frequently Used, LFU: Replace the page with the lowest reference
count. A problem can occur if a page is used frequently initially and then
not used any more, as the reference count remains high. A solution to this
problem is to right-shift the counters periodically, yielding a time-
decaying average reference count.
o Most Frequently Used, MFU: Replace the page with the highest reference
count. The logic behind this idea is that pages that have already been
referenced a lot have been in the system a long time, and we are probably
done with them, whereas pages referenced only a few times have only
recently been loaded, and we still need them.
In general counting-based algorithms are not commonly used, as their
implementation is expensive and they do not approximate OPT well.
Page-Buffering Algorithms
There are a number of page-buffering algorithms that can be used in conjunction with the afore-
mentioned algorithms, to improve overall performance and sometimes make up for inherent
weaknesses in the hardware and/or the underlying page-replacement algorithms:
Maintain a certain minimum number of free frames at all times. When a page-
fault occurs, go ahead and allocate one of the free frames from the free list first, to
get the requesting process up and running again as quickly as possible, and then
select a victim page to write to disk and free up a frame as a second step.
Keep a list of modified pages, and when the I/O system is otherwise idle, have it
write these pages out to disk, and then clear the modify bits, thereby increasing
the chance of finding a "clean" page for the next potential victim.
Keep a pool of free frames, but remember what page was in it before it was made
free. Since the data in the page is not actually cleared out when the page is freed,
it can be made an active page again without having to load in any new data from
disk. This is useful when an algorithm mistakenly replaces a page that in fact is
needed again soon.
Allocation of Frames
We said earlier that there were two important tasks in virtual memory management: a page-
replacement strategy and a frame-allocation strategy. This section covers the second part of that
pair.
Allocation Algorithms
Equal Allocation - If there are m frames available and n processes to share them,
each process gets m / n frames, and the leftovers are kept in a free-frame buffer
pool.
Proportional Allocation - Allocate the frames proportionally to the size of the
process, relative to the total size of all processes. So if the size of process i is S_i,
and S is the sum of all S_i, then the allocation for process P_i is a_i = m * S_i / S.
Variations on proportional allocation could consider priority of process rather
than just their size.
Obviously all allocations fluctuate over time as the number of available free
frames, m, fluctuates, and all are also subject to the constraints of minimum
allocation. ( If the minimum allocations cannot be met, then processes must either
be swapped out or not allowed to start until more free frames become available. )
Global versus Local Allocation
The above arguments all assume that all memory is equivalent, or at least has
equivalent access times.
This may not be the case in multiple-processor systems, especially where each
CPU is physically located on a separate circuit board which also holds some
portion of the overall system memory.
In these latter systems, CPUs can access memory that is physically located on the
same board much faster than the memory on the other boards.
The basic solution is akin to processor affinity - At the same time that we try to
schedule processes on the same CPU to minimize cache misses, we also try to
allocate memory for those processes on the same boards, to minimize access
times.
The presence of threads complicates the picture, especially when the threads get
loaded onto different processors.
Solaris uses an lgroup as a solution, in a hierarchical fashion based on relative
latency. For example, all processors and RAM on a single board would probably
be in the same lgroup. Memory assignments are made within the same lgroup if
possible, or to the next nearest lgroup otherwise. ( Where "nearest" is defined as
having the lowest access time. )
Thrashing
If a process cannot maintain its minimum required number of frames, then it must be
swapped out, freeing up frames for other processes. This is an intermediate level of CPU
scheduling.
But what about a process that can keep its minimum, but cannot keep all of the frames
that it is currently using on a regular basis? In this case it is forced to page out pages that
it will need again in the very near future, leading to large numbers of page faults.
A process that is spending more time paging than executing is said to be thrashing.
Cause of Thrashing
Thrashing
The working set model is based on the concept of locality, and defines a working
set window, of length delta. Whatever pages are included in the most recent delta
page references are said to be in the processes working set window, and comprise
its current working set, as illustrated in Figure 9.20:
The selection of delta is critical to the success of the working set model - If it is
too small then it does not encompass all of the pages of the current locality, and if
it is too large, then it encompasses pages that are no longer being frequently
accessed.
The total demand, D, is the sum of the sizes of the working sets for all processes.
If D exceeds the total number of available frames, then at least one process is
thrashing, because there are not enough frames available to satisfy its minimum
working set. If D is significantly less than the currently available frames, then
additional processes can be launched.
The hard part of the working-set model is keeping track of what pages are in the
current working set, since every reference adds one to the set and removes one
older page. An approximation can be made using reference bits and a timer that
goes off after a set interval of memory references:
o For example, suppose that we set the timer to go off after every 5000
references ( by any process ), and we can store two additional historical
reference bits in addition to the current reference bit.
o Every time the timer goes off, the current reference bit is copied to one of
the two historical bits, and then cleared.
o If any of the three bits is set, then that page was referenced within the last
15,000 references, and is considered to be in that processes reference set.
o Finer resolution can be achieved with more historical bits and a more
frequent timer, at the expense of greater overhead.
Page-Fault Frequency
A more direct approach is to recognize that what we really want to control is the
page-fault rate, and to allocate frames based on this directly measurable value. If
the page-fault rate exceeds a certain upper bound then that process needs more
frames, and if it is below a given lower bound, then it can afford to give up some
of its frames to other processes.
( I suppose a page-replacement strategy could be devised that would select victim
frames based on the process with the lowest current page-fault frequency. )
Page-fault frequency.
Note that there is a direct relationship between the page-fault rate and the
working-set, as a process moves from one locality to another:
Unnumbered side bar in Ninth Edition
The working set minimum and maximum are normally set at 50 and 345 pages respectively. (
Maximums can be exceeded in rare circumstances. )
Free pages are maintained on a free list, with a minimum threshold indicating when there are
enough free frames available.
If a page fault occurs and the process is below their maximum, then additional pages are
allocated. Otherwise some pages from this process must be replaced, using a local page
replacement algorithm.
If the amount of free frames falls below the allowable threshold, then working set trimming
occurs, taking frames away from any processes which are above their minimum, until all are at
their minimums. Then additional frames can be allocated to processes that need them.
The algorithm for selecting victim frames depends on the type of processor:
On single processor 80x86 systems, a variation of the clock ( second chance ) algorithm is used.
On Alpha and multiprocessor systems, clearing the reference bits may require invalidating
entries in the TLB on other processors, which is an expensive operation. In this case Windows
uses a variation of FIFO.
Unit-4
File-System Interface
File Concept
File Attributes
File Operations
1
o A shared lock is for reading only.
o A exclusive lock is for writing as well as reading.
o An advisory lock is informational only, and not enforced. ( A "Keep Out"
sign, which may be ignored. )
o A mandatory lock is enforced. ( A truly locked door. )
o UNIX used advisory locks, and Windows uses mandatory locks.
Windows ( and some other systems ) use special file extensions to indicate
the type of each file:
Macintosh stores a creator attribute for each file, according to the program that
first created it with the create( ) system call.
UNIX stores magic numbers at the beginning of certain files. ( Experiment with
the "file" command, especially in directories such as /bin and /dev )
2
File Structure
Some files contain an internal structure, which may or may not be known to the
OS.
For the OS to support particular file formats increases the size and complexity of
the OS.
UNIX treats all files as sequences of bytes, with no further consideration of the
internal structure. ( With the exception of executable binary programs, which it
must know how to load and find the first executable statement, etc. )
Macintosh files have two forks - a resource fork, and a data fork. The resource
fork contains information relating to the UI, such as icons and button images, and
can be modified independently of the data fork, which contains the code or data as
appropriate.
Disk files are accessed in units of physical blocks, typically 512 bytes or some
power-of-two multiple thereof. ( Larger physical disks use larger block sizes, to
keep the range of block numbers within the range of a 32-bit integer. )
Internally files are organized in units of logical units, which may be as small as a
single byte, or may be a larger size corresponding to some data record or structure
size.
The number of logical units which fit into one physical block determines its
packing, and has an impact on the amount of internal fragmentation ( wasted
space ) that occurs.
As a general rule, half a physical block is wasted for each file, and the larger the
block sizes the more space is lost to internal fragmentation.
Access Methods
Sequential Access
A sequential access file emulates magnetic tape operation, and generally supports
a few operations:
o read next - read a record and advance the tape to the next position.
o write next - write a record and advance the tape to the next position.
o rewind
o skip n records - May or may not be supported. N may be limited to
positive numbers, or may be limited to +/- 1.
3
Sequential-access file.
Direct Access
Jump to any record and read that record. Operations supported include:
o read n - read record number n. ( Note an argument is now required. )
o write n - write record number n. ( Note an argument is now required. )
o jump to record n - could be 0 or the end of file.
o Query current record - used to return back to this record later.
o Sequential access can be easily emulated using direct access. The inverse
is complicated and inefficient.
An indexed access scheme can be easily built on top of a direct access system.
Very large files may require a multi-tiered indexing scheme, i.e. indexes of
indexes.
4
Figure 11.6 - Example of index and relative files.
Directory Structure
Storage Structure
5
A typical file-system organization.
Single-level directory.
6
A master file directory is used to keep track of each users directory, and must be
maintained when users are added to or removed from the system.
A separate directory is generally needed for system ( executable ) files.
Systems may or may not allow users to access other directories besides their own
o If access to other directories is allowed, then provision must be made to
specify the directory being accessed.
o If access is denied, then special consideration must be made for users to
run programs located in system directories. A search path is the list of
directories in which to search for executable programs, and can be set
uniquely for each user.
Tree-Structured Directories
An obvious extension to the two-tiered directory structure, and the one with
which we are all most familiar.
Each user / process has the concept of a current directory from which all
( relative ) searches take place.
Files may be accessed using either absolute pathnames ( relative to the root of the
tree ) or relative pathnames ( relative to the current directory. )
Directories are stored the same as any other file in the system, except there is a bit
that identifies them as directories, and they have some special structure that the
OS understands.
One question for consideration is whether or not to allow the removal of
directories that are not empty - Windows requires that directories be emptied first,
and UNIX provides an option for deleting entire sub-trees.
7
Tree-structured directory structure.
Acyclic-Graph Directories
When the same files need to be accessed in more than one place in the directory
structure ( e.g. because they are being shared by more than one user / process ), it
can be useful to provide an acyclic-graph structure. ( Note the directed arcs from
parent to child. )
UNIX provides two types of links for implementing the acyclic-graph structure. (
See "man ln" for more details. )
o A hard link ( usually just called a link ) involves multiple directory entries
that both refer to the same file. Hard links are only valid for ordinary files
in the same filesystem.
o A symbolic link, that involves a special file, containing information about
where to find the linked file. Symbolic links may be used to link
directories and/or files in other filesystems, as well as ordinary files in the
current filesystem.
Windows only supports symbolic links, termed shortcuts.
8
Hard links require a reference count, or link count for each file, keeping track of
how many directory entries are currently referring to this file. Whenever one of
the references is removed the link count is reduced, and when it reaches zero, the
disk space can be reclaimed.
For symbolic links there is some question as to what to do with the symbolic links
when the original file is moved or deleted:
o One option is to find all the symbolic links and adjust them also.
o Another is to leave the symbolic links dangling, and discover that they are
no longer valid the next time they are used.
o What if the original file is removed, and replaced with another file having
the same name before the symbolic link is next used?
If cycles are allowed in the graphs, then several problems can arise:
o Search algorithms can go into infinite loops. One solution is to not follow
links in search algorithms. ( Or not to follow symbolic links, and to only
allow symbolic links to refer to directories. )
o Sub-trees can become disconnected from the rest of the tree and still not
have their reference counts reduced to zero. Periodic garbage collection is
required to detect and resolve this problem. ( chkdsk in DOS and fsck in
UNIX search for these problems, among others, even though cycles are
9
not supposed to be allowed in either system. Disconnected disk blocks that
are not marked as free are added back to the file systems with made-up file
names, and can usually be safely deleted. )
File-System Mounting
The basic idea behind mounting file systems is to combine multiple file systems into one
large tree structure.
The mount command is given a filesystem to mount and a mount point ( directory ) on
which to attach it.
Once a file system is mounted onto a mount point, any further references to that directory
actually refer to the root of the mounted file system.
Any files ( or sub-directories ) that had been stored in the mount point directory prior to
mounting the new filesystem are now hidden by the mounted filesystem, and are no
longer available. For this reason some systems only allow mounting onto empty
directories.
Filesystems can only be mounted by root, unless root has previously configured certain
filesystems to be mountable onto certain pre-determined mount points. ( E.g. root may
allow users to mount floppy filesystems to /mnt or something like it. ) Anyone can run
the mount command to see what filesystems are currently mounted.
Filesystems may be mounted read-only, or have other restrictions imposed.
10
File system. (a) Existing system. (b) Unmounted volume.
Mount point.
The traditional Windows OS runs an extended two-tier directory structure, where the first
tier of the structure separates volumes by drive letters, and a tree structure is implemented
below that level.
Macintosh runs a similar system, where each new volume that is found is automatically
mounted and added to the desktop when it is found.
11
More recent Windows systems allow filesystems to be mounted to any directory in the
filesystem, much like UNIX.
File Sharing
Multiple Users
On a multi- user system, more information needs to be stored for each file:
o The owner ( user ) who owns the file, and who can control its access.
o The group of other user IDs that may have some special access to the file.
o What access rights are afforded to the owner ( User ), the Group, and to
the rest of the world ( the universe, a.k.a. Others. )
o Some systems have more complicated access control, allowing or denying
specific accesses to specifically named users or groups.
File Systems
The advent of the Internet introduces issues for accessing files stored on remote
computers
o The original method was ftp, allowing individual files to be transported
across systems as needed. Ftp can be either account and password
controlled, or anonymous, not requiring any user name or password.
o Various forms of distributed file systems allow remote file systems to be
mounted onto a local directory structure, and accessed using normal file
access commands. ( The actual files are still transported across the
network as needed, possibly using ftp as the underlying transport
mechanism. )
o The WWW has made it easy once again to access files on remote systems
without mounting their filesystems, generally using ( anonymous ) ftp as
the underlying file transport mechanism.
12
o Servers may restrict remote access to read-only.
o Servers restrict which filesystems may be remotely mounted.
Generally the information within those subsystems is limited,
relatively public, and protected by frequent backups.
The NFS ( Network File System ) is a classic example of such a system.
The Domain Name System, DNS, provides for a unique naming system
across all of the Internet.
Domain names are maintained by the Network Information System, NIS,
which unfortunately has several security issues. NIS+ is a more secure
version, but has not yet gained the same widespread acceptance as NIS.
Microsoft's Common Internet File System, CIFS, establishes a network
login for each user on a networked system with shared file access. Older
Windows systems used domains, and newer systems ( XP, 2000 ), use
active directories. User names must match across the network for this
system to be valid.
A newer approach is the Lightweight Directory-Access Protocol, LDAP,
which provides a secure single sign-on for all users to access all resources
on a network. This is a secure system which is gaining in popularity, and
which has the maintenance advantage of combining authorization
information in one central location.
Failure Modes
Consistency Semantics
Consistency Semantics deals with the consistency between the views of shared
files on a networked system. When one user changes the file, when do other users
see the changes?
At first glance this appears to have all of the synchronization issues discussed in
Chapter 6. Unfortunately the long delays involved in network operations prohibit
the use of atomic operations as discussed in that chapter.
13
UNIX Semantics
Session Semantics
Immutable-Shared-Files Semantics
Protection
Files must be kept safe for reliability ( against accidental damage ), and protection
( against deliberate malicious access. ) The former is usually managed with backup
copies. This section discusses the latter.
One simple protection scheme is to remove all access to a file. However this makes the
file unusable, so some sort of controlled access must be arranged.
Types of Access
14
o Execute - Load the file onto the CPU and follow the instructions contained
therein.
o Append - Add to the end of an existing file.
o Delete - Remove a file from the system.
o List -View the name and other attributes of files on the system.
Higher-level operations, such as copy, can generally be performed through
combinations of the above.
Access Control
One approach is to have complicated Access Control Lists, ACL, which specify
exactly what access is allowed or denied for specific users or groups.
o The AFS uses this system for distributed access.
o Control is very finely adjustable, but may be complicated, particularly
when the specific users involved are unknown. ( AFS allows some wild
cards, so for example all users on a certain remote system may be trusted,
or a given username may be trusted when accessing from any remote
system. )
UNIX uses a set of 9 access control bits, in three groups of three. These
correspond to R, W, and X permissions for each of the Owner, Group, and Others.
( See "man chmod" for full details. ) The RWX bits control the following
privileges for ordinary files and directories:
In addition there are some special bits that can also be applied:
o The set user ID ( SUID ) bit and/or the set group ID ( SGID ) bits applied
to executable files temporarily change the identity of whoever runs the
program to match that of the owner / group of the executable program.
This allows users running specific programs to have access to files ( while
running that program ) to which they would normally be unable to
access. Setting of these two bits is usually restricted to root, and must be
done with caution, as it introduces a potential security leak.
15
o The sticky bit on a directory modifies write permission, allowing users to
only delete files for which they are the owner. This allows everyone to
create files in /tmp, for example, but to only delete files which they have
created, and not anyone else's.
o The SUID, SGID, and sticky bits are indicated with an S, S, and T in the
positions for execute permission for the user, group, and others,
respectively. If the letter is lower case, ( s, s, t ), then the corresponding
execute permission is not also given. If it is upper case, ( S, S, T ), then the
corresponding execute permission IS given.
o The numeric form of chmod is needed to set these advanced bits.
Some systems can apply passwords, either to individual files, or to specific sub-
directories, or to the entire system. There is a trade-off between the number of
passwords that must be maintained ( and remembered by the users ) and the
amount of information that is vulnerable to a lost or forgotten password.
Older systems which did not originally have multi- user file access permissions (
DOS and older versions of Mac ) must now be retrofitted if they are to share files
on a network.
16
Access to a file requires access to all the files along its path as well. In a cyclic
directory structure, users may have different access to the same file accessed
through different paths.
Sometimes just the knowledge of the existence of a file of a certain name is a
security ( or privacy ) concern. Hence the distinction between the R and X bits on
UNIX directories.
File-System Implementation
File-System Implementation
Overview
17
Figure 12.2 - A typical file-control block.
18
In-memory file-system structures. (a) File open. (b) File read.
Physical disks are commonly divided into smaller units called partitions. They can
also be combined into larger units, but that is most commonly done for RAID
installations and is left for later chapters.
Partitions can either be used as raw devices ( with no structure imposed upon
them ), or they can be formatted to hold a filesystem ( i.e. populated with FCBs
and initial directory structures as appropriate. ) Raw partitions are generally used
for swap space, and may also be used for certain programs such as databases that
choose to manage their own disk storage system. Partitions containing filesystems
19
can generally only be accessed using the file system structure by ordinary users,
but can often be accessed as a raw device also by root.
The boot block is accessed as part of a raw partition, by the boot program prior to
any operating system being loaded. Modern boot programs understand multiple
OSes and filesystem formats, and can give the user a choice of which of several
available systems to boot.
The root partition contains the OS kernel and at least the key portions of the OS
needed to complete the boot process. At boot time the root partition is mounted,
and control is transferred from the boot program to the kernel found there. ( Older
systems required that the root partition lie completely within the first 1024
cylinders of the disk, because that was as far as the boot program could reach.
Once the kernel had control, then it could access partitions beyond the 1024
cylinder boundary. )
Continuing with the boot process, additional filesystems get mounted, adding
their information into the appropriate mount table structure. As a part of the
mounting process the file systems may be checked for errors or inconsistencies,
either because they are flagged as not having been closed properly the last time
they were used, or just for general principals. Filesystems may be mounted either
automatically or manually. In UNIX a mount point is indicated by setting a flag in
the in-memory copy of the inode, so all future references to that inode get re-
directed to the root directory of the mounted filesystem.
20
Figure 12.4 - Schematic view of a virtual file system.
Directory Implementation
Directories need to be fast to search, insert, and delete, with a minimum of wasted disk
space.
Linear List
A linear list is the simplest and easiest directory structure to set up, but it does
have some drawbacks.
Finding a file ( or verifying one does not already exist upon creation ) requires a
linear search.
Deletions can be done by moving all entries, flagging an entry as deleted, or by
moving the last entry into the newly vacant position.
21
Sorting the list makes searches faster, at the expense of more complex insertions
and deletions.
A linked list makes insertions and deletions into a sorted list easier, with overhead
for the links.
More complex data structures, such as B-trees, could also be considered.
Hash Table
Allocation Methods
There are three major methods of storing files on disks: contiguous, linked, and indexed.
Contiguous Allocation
22
Figure 12.5 - Contiguous allocation of disk space.
Linked Allocation
Disk files can be stored as linked lists, with the expense of the storage space
consumed by each link. ( E.g. a block may be 508 bytes instead of 512. )
Linked allocation involves no external fragmentation, does not require pre-known
file sizes, and allows files to grow dynamically at any time.
Unfortunately linked allocation is only efficient for sequential access files, as
random access requires starting at the beginning of the list for each new location
access.
Allocating clusters of blocks reduces the space wasted by pointers, at the cost of
internal fragmentation.
Another big problem with linked allocation is reliability if a pointer is lost or
damaged. Doubly linked lists provide some protection, at the cost of additional
overhead and wasted space.
23
Figure 12.6 - Linked allocation of disk space.
The File Allocation Table, FAT, used by DOS is a variation of linked allocation,
where all the links are stored in a separate table at the beginning of the disk. The
benefit of this approach is that the FAT table can be cached in memory, greatly
improving random access speeds.
24
Figure 12.7 File-allocation table.
Indexed Allocation
Indexed Allocation combines all of the indexes for accessing each file into a
common block ( for that file ), as opposed to spreading them all over the disk or
storing them in a FAT table.
25
Figure 12.8 - Indexed allocation of disk space.
Some disk space is wasted ( relative to linked lists or FAT tables ) because an
entire index block must be allocated for each file, regardless of how many data
blocks the file contains. This leads to questions of how big the index block should
be, and how it should be implemented. There are several approaches:
o Linked Scheme - An index block is one disk block, which can be read
and written in a single disk operation. The first index block contains some
header information, the first N block addresses, and if necessary a pointer
to additional linked index blocks.
o Multi-Level Index - The first index block contains a set of pointers to
secondary index blocks, which in turn contain pointers to the actual data
blocks.
o Combined Scheme - This is the scheme used in UNIX inodes, in which
the first 12 or so data block pointers are stored directly in the inode, and
then singly, doubly, and triply indirect pointers provide access to more
data blocks as needed. ( See below. ) The advantage of this scheme is that
for small files ( which many are ), the data blocks are readily accessible (
up to 48K with 4K block sizes ); files up to about 4144K ( using 4K
blocks ) are accessible with only a single indirect block ( which can be
26
cached ), and huge files are still accessible using a relatively small number
of disk accesses ( larger in theory than can be addressed by a 32-bit
address, which is why some systems have moved to 64-bit file pointers. )
Performance
The optimal allocation method is different for sequential access files than for
random access files, and is also different for small files than for large files.
Some systems support more than one allocation method, which may require
specifying how the file is to be used ( sequential or random access ) at the time it
is allocated. Such systems also provide conversion utilities.
Some systems have been known to use contiguous access for small files, and
automatically switch to an indexed scheme when file sizes surpass a certain
threshold.
And of course some systems adjust their allocation schemes ( e.g. block sizes ) to
best match the characteristics of the hardware for optimum performance.
27
Free-Space Management
Bit Vector
One simple approach is to use a bit vector, in which each bit represents a disk
block, set to 1 if free or 0 if allocated.
Fast algorithms exist for quickly finding contiguous blocks of a given size
The down side is that a 40GB disk requires over 5MB just to store the bitmap. (
For example. )
Linked List
A linked list can also be used to keep track of all free blocks.
Traversing the list and/or finding a contiguous block of a given size are not easy,
but fortunately are not frequently needed operations. Generally the system just
adds and removes single blocks from the beginning of the list.
The FAT table keeps track of the free list as just one more linked list on the table.
28
Grouping
A variation on linked list free lists is to use links of blocks of indices of free
blocks. If a block holds up to N addresses, then the first block in the linked-list
contains up to N-1 addresses of free blocks and a pointer to the next block of free
addresses.
Counting
When there are multiple contiguous blocks of free space then the system can keep
track of the starting address of the group and the number of contiguous free
blocks. As long as the average length of a contiguous group of free blocks is
greater than two this offers a savings in space needed for the free list. ( Similar to
compression techniques used for graphics images when a group of pixels all the
same color is encountered. )
Space Maps
Sun's ZFS file system was designed for HUGE numbers and sizes of files,
directories, and even file systems.
The resulting data structures could be VERY inefficient if not implemented
carefully. For example, freeing up a 1 GB file on a 1 TB file system could involve
updating thousands of blocks of free list bit maps if the file was spread across the
disk.
ZFS uses a combination of techniques, starting with dividing the disk up into (
hundreds of ) metaslabs of a manageable size, each having their own space map.
Free blocks are managed using the counting technique, but rather than write the
information to a table, it is recorded in a log-structured transaction record.
Adjacent free blocks are also coalesced into a larger single free block.
An in-memory space map is constructed using a balanced tree data structure,
constructed from the log data.
The combination of the in-memory tree and the on-disk log provide for very fast
and efficient management of these very large files and free blocks.
Efficiency
UNIX pre-allocates inodes, which occupies space even before any files are
created.
UNIX also distributes inodes across the disk, and tries to store data files near their
inode, to reduce the distance of disk seeks between the inodes and the data.
Some systems use variable size clusters depending on the file size.
The more data that is stored in a directory ( e.g. last access time ), the more often
the directory blocks have to be re-written.
As technology advances, addressing schemes have had to grow as well.
29
o Sun's ZFS file system uses 128-bit pointers, which should theoretically
never need to be expanded. ( The mass required to store 2^128 bytes with
atomic storage would be at least 272 trillion kilograms! )
Kernel table sizes used to be fixed, and could only be changed by rebuilding the
kernels. Modern tables are dynamically allocated, but that requires more
complicated algorithms for accessing them.
Performance
30
I/O using a unified buffer cache.
Page replacement strategies can be complicated with a unified cache, as one needs
to decide whether to replace process or file pages, and how many pages to
guarantee to each category of pages. Solaris, for example, has gone through many
variations, resulting in priority paging giving process pages priority over file I/O
pages, and setting limits so that neither can knock the other completely out of
memory.
Another issue affecting performance is the question of whether to implement
synchronous writes or asynchronous writes. Synchronous writes occur in the
order in which the disk subsystem receives them, without caching; Asynchronous
writes are cached, allowing the disk subsystem to schedule writes in a more
efficient order ( See Chapter 12. ) Metadata writes are often done synchronously.
Some systems support flags to the open call requiring that writes be synchronous,
for example for the benefit of database systems that require their writes be
performed in a required order.
The type of file access can also have an impact on optimal page replacement
policies. For example, LRU is not necessarily a good policy for sequential access
files. For these types of files progression normally goes in a forward direction
only, and the most recently used page will not be needed again until after the file
has been rewound and re-read from the beginning, ( if it is ever needed at all. ) On
the other hand, we can expect to need the next page in the file fairly soon. For this
reason sequential access files often take advantage of two special policies:
o Free-behind frees up a page as soon as the next page in the file is
requested, with the assumption that we are now done with the old page
and won't need it again for a long time.
o Read-ahead reads the requested page and several subsequent pages at the
same time, with the assumption that those pages will be needed in the near
future. This is similar to the track caching that is already performed by the
31
disk controller, except it saves the future latency of transferring data from
the disk controller memory into motherboard main memory.
The caching system and asynchronous writes speed up disk writes considerably,
because the disk subsystem can schedule physical writes to the disk to minimize
head movement and disk seek times. ( See Chapter 12. ) Reads, on the other hand,
must be done more synchronously in spite of the caching system, with the result
that disk writes can counter-intuitively be much faster on average than disk reads.
Mass-Storage Structure
Overview of Mass-Storage Structure
Magnetic Disks
32
bytes per sector. A particular physical block of data is specified by
providing the head-sector-cylinder number at which it is located.
In operation the disk rotates at high speed, such as 7200 rpm ( 120 revolutions per
second. ) The rate at which data can be transferred from the disk to the computer
is composed of several steps:
o The positioning time, a.k.a. the seek time or random access time is the
time required to move the heads from one cylinder to another, and for the
heads to settle down after the move. This is typically the slowest step in
the process and the predominant bottleneck to overall transfer rates.
o The rotational latency is the amount of time required for the desired
sector to rotate around and come under the read-write head.This can range
anywhere from zero to one full revolution, and on the average will equal
one-half revolution. This is another physical step and is usually the second
slowest step behind seek time. ( For a disk rotating at 7200 rpm, the
average rotational latency would be 1/2 revolution / 120 revolutions per
second, or just over 4 milliseconds, a long time by computer standards.
o The transfer rate, which is the time required to move the data
electronically from the disk to the computer. ( Some authors may also use
33
the term transfer rate to refer to the overall transfer rate, including seek
time and rotational latency as well as the electronic data transfer rate. )
Disk heads "fly" over the surface on a very thin cushion of air. If they should
accidentally contact the disk, then a head crash occurs, which may or may not
permanently damage the disk or even destroy it completely. For this reason it is
normal to park the disk heads when turning a computer off, which means to move
the heads off the disk or to an area of the disk where there is no data stored.
Floppy disks are normally removable. Hard drives can also be removable, and
some are even hot-swappable, meaning they can be removed while the computer
is running, and a new hard drive inserted in their place.
Disk drives are connected to the computer via a cable known as the I/O Bus.
Some of the common interface formats include Enhanced Integrated Drive
Electronics, EIDE; Advanced Technology Attachment, ATA; Serial ATA, SATA,
Universal Serial Bus, USB; Fiber Channel, FC, and Small Computer Systems
Interface, SCSI.
The host controller is at the computer end of the I/O bus, and the disk controller
is built into the disk itself. The CPU issues commands to the host controller via
I/O ports. Data is transferred between the magnetic surface and onboard cache by
the disk controller, and then the data is transferred from that cache to the host
controller and the motherboard memory at electronic speeds.
Magnetic Tapes
Magnetic tapes were once used for common secondary storage before the days of
hard disk drives, but today are used primarily for backups.
Accessing a particular spot on a magnetic tape can be slow, but once reading or
writing commences, access speeds are comparable to disk drives.
Capacities of tape drives can range from 20 to 200 GB, and compression can
double that capacity.
Disk Structure
The traditional head-sector-cylinder, HSC numbers are mapped to linear block addresses
by numbering the first sector on the first head on the outermost track as sector 0.
Numbering proceeds with the rest of the sectors on that same track, and then the rest of
the tracks on the same cylinder before proceeding through the rest of the cylinders to the
center of the disk. In modern practice these linear block addresses are used in place of the
HSC numbers for a variety of reasons:
1. The linear length of tracks near the outer edge of the disk is much longer than for
those tracks located near the center, and therefore it is possible to squeeze many
more sectors onto outer tracks than onto inner ones.
2. All disks have some bad sectors, and therefore disks maintain a few spare sectors
that can be used in place of the bad ones. The mapping of spare sectors to bad
sectors in managed internally to the disk controller.
3. Modern hard drives can have thousands of cylinders, and hundreds of sectors per
track on their outermost tracks. These numbers exceed the range of HSC numbers
34
for many ( older ) operating systems, and therefore disks can be configured for
any convenient combination of HSC values that falls within the total number of
sectors physically on the drive.
There is a limit to how closely packed individual bits can be placed on a physical media,
but that limit is growing increasingly more packed as technological advances are made.
Modern disks pack many more sectors into outer cylinders than inner ones, using one of
two approaches:
o With Constant Linear Velocity, CLV, the density of bits is uniform from cylinder
to cylinder. Because there are more sectors in outer cylinders, the disk spins
slower when reading those cylinders, causing the rate of bits passing under the
read-write head to remain constant. This is the approach used by modern CDs and
DVDs.
o With Constant Angular Velocity, CAV, the disk rotates at a constant angular
speed, with the bit density decreasing on outer cylinders. ( These disks would
have a constant number of sectors per track on all cylinders. )
Disk Attachment
Disk drives can be attached either directly to a particular host ( a local disk ) or to a network.
Host-Attached Storage
35
o The arbitrated loop, FC-AL, that can address up to 126 devices ( drives
and controllers. )
Network-Attached Storage
Storage-Area Network
36
Figure 10.3 - Storage-area network.
Disk Scheduling
As mentioned earlier, disk transfer speeds are limited primarily by seek times and
rotational latency. When multiple requests are to be processed there is also some
inherent delay in waiting for other requests to be processed.
Bandwidth is measured by the amount of data transferred divided by the total amount of
time from the first request being made to the last transfer being completed, ( for a series
of disk requests. )
Both bandwidth and access time can be improved by processing requests in a good order.
Disk requests include the disk address, memory address, number of sectors to transfer,
and whether the request is for reading or writing.
FCFS Scheduling
First-Come First-Serve is simple and intrinsically fair, but not very efficient.
Consider in the following sequence the wild swing from cylinder 122 to 14 and
then back to 124:
37
disk scheduling.
SSTF Scheduling
Shortest Seek Time First scheduling is more efficient, but may lead to starvation
if a constant stream of requests arrives for the same general area of the disk.
SSTF reduces the total head movement to 236 cylinders, down from 640 required
for the same set of requests under FCFS. Note, however that the distance could be
reduced still further to 208 by starting with 37 and then 14 first before processing
the rest of the requests.
disk scheduling.
The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from
one end of the disk to the other, similarly to an elevator processing requests in a
tall building.
38
SCAN disk scheduling.
Under the SCAN algorithm, If a request arrives just ahead of the moving head
then it will be processed right away, but if it arrives just after the head has passed,
then it will have to wait for the head to pass going the other way on the return trip.
This leads to a fairly wide variation in access times which can be improved upon.
Consider, for example, when the head reaches the high end of the disk: Requests
with high cylinder numbers just missed the passing head, which means they are
all fairly recent requests, whereas requests with low numbers may have been
waiting for a much longer time. Making the return scan from high to low then
ends up accessing recent requests first and making older requests wait that much
longer.
39
C-SCAN disk scheduling.
40
Selection of a Disk-Scheduling Algorithm
With very low loads all algorithms are equal, since there will normally only be
one request to process at a time.
For slightly larger loads, SSTF offers better performance than FCFS, but may lead
to starvation when loads become heavy enough.
For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
The actual optimal algorithm may be something even more complex than those
discussed here, but the incremental improvements are generally not worth the
additional overhead.
Some improvement to overall filesystem access times can be made by intelligent
placement of directory and/or inode information. If those structures are placed in
the middle of the disk instead of at the beginning of the disk, then the maximum
distance from those structures to data blocks is reduced to only one-half of the
disk size. If those structures can be further distributed and furthermore have their
data blocks stored as close as possible to the corresponding directory structures,
then that reduces still further the overall time to find the disk block numbers and
then access the corresponding data blocks.
On modern disks the rotational latency can be almost as significant as the seek
time, however it is not within the OSes control to account for that, because
modern disks do not reveal their internal sector mapping schemes, ( particularly
when bad blocks have been remapped to spare sectors. )
o Some disk manufacturers provide for disk scheduling algorithms directly
on their disk controllers, ( which do know the actual geometry of the disk
as well as any remapping ), so that if a series of requests are sent from the
computer to the controller then those requests can be processed in an
optimal order.
o Unfortunately there are some considerations that the OS must take into
account that are beyond the abilities of the on-board disk-scheduling
algorithms, such as priorities of some requests over others, or the need to
process certain requests in a particular order. For this reason OSes may
elect to spoon-feed requests to the disk controller one at a time in certain
situations.
Disk Management
Disk Formatting
Before a disk can be used, it has to be low-level formatted, which means laying
down all of the headers and trailers marking the beginning and ends of each
sector. Included in the header and trailer are the linear sector numbers, and error-
correcting codes, ECC, which allow damaged sectors to not only be detected, but
in many cases for the damaged data to be recovered ( depending on the extent of
the damage. ) Sector sizes are traditionally 512 bytes, but may be larger,
particularly in larger drives.
41
ECC calculation is performed with every disk read or write, and if damage is
detected but the data is recoverable, then a soft error has occurred. Soft errors are
generally handled by the on-board disk controller, and never seen by the OS. ( See
below. )
Once the disk is low-level formatted, the next step is to partition the drive into one
or more separate partitions. This step must be completed even if the disk is to be
used as a single large partition, so that the partition table can be written to the
beginning of the disk.
After partitioning, then the filesystems must be logically formatted, which
involves laying down the master directory information ( FAT table or inode
structure ), initializing free lists, and creating at least the root directory of the
filesystem. ( Disk partitions which are to be used as raw devices are not logically
formatted. This saves the overhead and disk space of the filesystem structure, but
requires that the application program manage its own disk storage requirements. )
Boot Block
42
Booting from disk in Windows 2000.
Bad Blocks
No disk can be manufactured to 100% perfection, and all physical objects wear
out over time. For these reasons all disks are shipped with a few bad blocks, and
additional blocks can be expected to go bad slowly over time. If a large number of
blocks go bad then the entire disk will need to be replaced, but a few here and
there can be handled through other means.
In the old days, bad blocks had to be checked for manually. Formatting of the disk
or running certain disk-analysis tools would identify bad blocks, and attempt to
read the data off of them one last time through repeated tries. Then the bad blocks
would be mapped out and taken out of future service. Sometimes the data could
be recovered, and sometimes it was lost forever. ( Disk analysis tools could be
either destructive or non-destructive. )
Modern disk controllers make much better use of the error-correcting codes, so
that bad blocks can be detected earlier and the data usually recovered. ( Recall
that blocks are tested with every write as well as with every read, so often errors
can be detected before the write operation is complete, and the data simply written
to a different sector instead. )
Note that re-mapping of sectors from their normal linear progression can throw
off the disk scheduling optimization of the OS, especially if the replacement
sector is physically far away from the sector it is replacing. For this reason most
disks normally keep a few spare sectors on each cylinder, as well as at least one
spare cylinder. Whenever possible a bad sector will be mapped to another sector
on the same cylinder, or at least a cylinder as close as possible. Sector slipping
may also be performed, in which all sectors between the bad sector and the
replacement sector are moved down by one, so that the linear progression of
sector numbers can be maintained.
43
If the data on a bad block cannot be recovered, then a hard error has occurred.,
which requires replacing the file(s) from backups, or rebuilding them from
scratch.
Swap-Space Management
Modern systems typically swap out pages as needed, rather than swapping out entire
processes. Hence the swapping system is part of the virtual memory management system.
Managing swap space is obviously an important task for modern OSes.
Swap-Space Use
Swap-Space Location
Historically OSes swapped out entire processes as needed. Modern systems swap
out only individual pages, and only as needed. ( For example process code blocks
and other blocks that have not been changed since they were originally loaded are
normally just freed from the virtual memory system rather than copying them to
swap space, because it is faster to go find them again in the filesystem and read
them back in from there than to write them out to swap space and then read them
back. )
In the mapping system shown below for Linux systems, a map of swap space is
kept in memory, where each entry corresponds to a 4K block in the swap space.
44
Zeros indicate free slots and non-zeros refer to how many processes have a
mapping to that particular block ( >1 for shared pages only. )
45
Deadlocks
System Model
Necessary Conditions
Resource-Allocation Graph
In some cases deadlocks can be understood more clearly through the use
of Resource-Allocation Graphs, having the following properties:
o A set of resource categories, { R1, R2, R3, . . ., RN }, which
appear as square nodes on the graph. Dots inside the resource
nodes indicate specific instances of the resource. ( E.g. two dots
might represent two laser printers. )
o A set of processes, { P1, P2, P3, . . ., PN }
o Request Edges - A set of directed arcs from Pi to Rj, indicating
that process Pi has requested Rj, and is currently waiting for that
resource to become available.
o Assignment Edges - A set of directed arcs from Rj to Pi
indicating that resource Rj has been allocated to process Pi, and
that Pi is currently holding resource Rj.
o Note that a request edge can be converted into an assignment
edge by reversing the direction of the arc when the request is
granted. ( However note also that request edges point to the
category box, whereas assignment edges emanate from a
particular instance dot within the box. )
o For example:
Figure 7.1 - Resource allocation graph
Deadlock Prevention
Mutual Exclusion
No Preemption
Circular Wait
One way to avoid circular wait is to number all resources, and to require
that processes request resources only in strictly increasing ( or
decreasing ) order.
In other words, in order to request resource Rj, a process must first
release all Ri such that i >= j.
One big challenge in this scheme is determining the relative ordering of
the different resources
Deadlock Avoidance
The general idea behind deadlock avoidance is to prevent deadlocks from ever
happening, by preventing at least one of the aforementioned conditions.
This requires more information about each process, AND tends to lead to low
device utilization. ( I.e. it is a conservative approach. )
In some algorithms the scheduler only needs to know the maximum number of
each resource that a process might potentially use. In more complex algorithms
the scheduler can also take advantage of the schedule of exactly what resources
may be needed in what order.
When a scheduler sees that starting a process or granting resource requests may
lead to future deadlocks, then that process is just not started or the request is not
granted.
A resource allocation state is defined by the number of available and allocated
resources, and the maximum requirements of all processes in the system.
Safe State
A state is safe if the system can allocate all resources requested by all
processes ( up to their stated maximums ) without entering a deadlock
state.
More formally, a state is safe if there exists a safe sequence of processes
{ P0, P1, P2, ..., PN } such that all of the resource requests for Pi can be
granted using the resources currently allocated to Pi and all processes Pj
where j < i. ( I.e. if all the processes prior to Pi finish and free up their
resources, then Pi will be able to finish also, using the resources that they
have freed up. )
If a safe sequence does not exist, then the system is in an unsafe state,
which MAY lead to deadlock. ( All safe states are deadlock free, but not
all unsafe states lead to deadlocks. )
Figure 7.6 - Safe, unsafe, and deadlocked state spaces.
Banker's Algorithm
For resource categories that contain more than one instance the resource-
allocation graph method does not work, and more complex ( and less
efficient ) methods must be chosen.
The Banker's Algorithm gets its name because it is a method that
bankers could use to assure that when they lend out resources they will
still be able to satisfy all their clients. ( A banker won't loan out a little
money to start building a house unless they are assured that they will
later be able to loan out the rest of the money to finish the house. )
When a process starts up, it must state in advance the maximum
allocation of resources it may request, up to the amount available on the
system.
When a request is made, the scheduler determines whether granting the
request would leave the system in a safe state. If not, then the process
must wait until the request can be granted safely.
The banker's algorithm relies on several key data structures: ( where n is
the number of processes and m is the number of resource categories. )
o Available[ m ] indicates how many resources are currently
available of each type.
o Max[ n ][ m ] indicates the maximum demand of each process of
each resource.
o Allocation[ n ][ m ] indicates the number of each resource
category allocated to each process.
o Need[ n ][ m ] indicates the remaining resources needed of each
type for each process. ( Note that Need[ i ][ j ] = Max[ i ][ j ] -
Allocation[ i ][ j ] for all i, j. )
For simplification of discussions, we make the following notations /
observations:
o One row of the Need vector, Need[ i ], can be treated as a vector
corresponding to the needs of process i, and similarly for
Allocation and Max.
o A vector X is considered to be <= a vector Y if X[ i ] <= Y[ i ] for
all i.
Safety Algorithm
An Illustrative Example
Deadlock Detection
If deadlocks are not avoided, then another approach is to detect when they have
occurred and recover somehow.
In addition to the performance hit of constantly checking for deadlocks, a
policy / algorithm must be in place for recovering from deadlocks, and there is
potential for lost work when processes must be aborted or have their resources
preempted.
Single Instance of Each Resource Type
Detection-Algorithm Usage
Process Termination
Protection
Goals of Protection
The principle of least privilege dictates that programs, users, and systems be
given just enough privileges to perform their tasks.
This ensures that failures do the least amount of harm and allow the least of
harm to be done.
For example, if a program needs special privileges to perform a task, it is better
to make it a SGID program with group ownership of "network" or "backup" or
some other pseudo group, rather than SUID with root ownership. This limits
the amount of damage that can occur if something goes wrong.
Typically each user is given their own account, and has only enough privilege
to modify their own files.
The root account should not be used for normal day to day activities - The
System Administrator should also have an ordinary account, and reserve use of
the root account for only those tasks which need the root privileges
Domain of Protection
Domain Structure
Rings are numbered from 0 to 7, with outer rings having a subset of the
privileges of the inner rings.
Each file is a memory segment, and each segment description includes
an entry that indicates the ring number associated with that segment, as
well as read, write, and execute privileges.
Each process runs in a ring, according to the current-ring-number, a
counter associated with each process.
A process operating in one ring can only access segments associated
with higher ( farther out ) rings, and then only according to the access
bits. Processes cannot access segments associated with lower rings.
Domain switching is achieved by a process in one ring calling upon a
process operating in a lower ring, which is controlled by several factors
stored with each segment descriptor:
o An access bracket, defined by integers b1 <= b2.
o A limit b3 > b2
o A list of gates, identifying the entry points at which the segments
may be called.
If a process operating in ring i calls a segment whose bracket is such that
b1 <= i <= b2, then the call succeeds and the process remains in ring i.
Otherwise a trap to the OS occurs, and is handled as follows:
o If i < b1, then the call is allowed, because we are transferring to a
procedure with fewer privileges. However if any of the parameters
being passed are of segments below b1, then they must be copied
to an area accessible by the called procedure.
o If i > b2, then the call is allowed only if i <= b3 and the call is
directed to one of the entries on the list of gates.
Overall this approach is more complex and less efficient than other
protection schemes.
Access Matrix
The owner right adds the privilege of adding new rights or removing existing
ones:
Access matrix with owner rights.
Copy and owner rights only allow the modification of rights within a column.
The addition of control rights, which only apply to domain objects, allow a
process operating in one domain to affect the rights available in other domains.
For example in the table below, a process operating in domain D2 has the right
to control any of the rights in domain D4.
Modified access matrix
Global Table
The simplest approach is one big global table with < domain, object,
rights > entries.
Unfortunately this table is very large ( even if sparse ) and so cannot be
kept in memory ( without invoking virtual memory techniques. )
There is also no good way to specify groupings - If everyone has access
to some resource, then it still needs a separate entry for every domain.
Each column of the table can be kept as a list of the access rights for that
particular object, discarding blank entries.
For efficiency a separate list of default access rights can also be kept,
and checked first.
In a similar fashion, each row of the table can be kept as a list of the
capabilities of that domain.
Capability lists are associated with each domain, but not directly
accessible by the domain or any user process.
Capability lists are themselves protected resources, distinguished from
other data in one of two ways:
o A tag, possibly hardware implemented, distinguishing this special
type of data. ( other types may be floats, pointers, booleans, etc. )
o The address space for a program may be split into multiple
segments, at least one of which is inaccessible by the program
itself, and used by the operating system for maintaining the
process's access right capability list.
A Lock-Key Mechanism
Comparison
Access Control
An Example: Hydra
Compiler-Based Enforcement
Protection in Java
Stack inspection.