CS350: Operating Systems
Lecture 8: Virtual Memory OS
Ali Mashtizadeh
University of Waterloo
1 / 44
Outline
1 Paging
2 Eviction policies
3 Thrashing
4 User-level API
5 Case study: 4.4 BSD
2 / 44
Paging
• Use disk to simulate larger virtual than physical mem
3 / 44
Working set model
# of accesses
virtual address
• Disk much, much slower than memory
I Goal: run at memory speed, not disk speed
• 80/20 rule: 20% of memory gets 80% of memory accesses
I Keep the hot 20% in memory
I Keep the cold 80% on disk
4 / 44
Working set model
# of accesses
virtual address
• Disk much, much slower than memory
I Goal: run at memory speed, not disk speed
• 80/20 rule: 20% of memory gets 80% of memory accesses
Keep the hot 20% in memory
I Keep the cold 80% on disk
4 / 44
Working set model
# of accesses
virtual address
• Disk much, much slower than memory
I Goal: run at memory speed, not disk speed
• 80/20 rule: 20% of memory gets 80% of memory accesses
I Keep the hot 20% in memory
Keep the cold 80% on disk
4 / 44
Paging challenges
• How to resume a process after a fault?
I Need to save state and resume
I Process might have been in the middle of an instruction!
• What to fetch from disk?
I Just needed page or more?
• What to eject?
I How to allocate physical pages amongst processes?
I Which of a particular process’s pages to keep in memory?
5 / 44
Re-starting instructions
• Hardware provides kernel with information about page fault
I Faulting virtual address (In %c0_vaddr reg on MIPS)
I Address of instruction that caused fault (%c0_epc reg)
I Was the access a read or write? Was it an instruction fetch?
Was it caused by user access to kernel-only memory?
• Hardware must allow resuming after a fault
• Idempotent instructions are easy
I E.g., simple load or store instruction can be restarted
I Just re-execute any instruction that only accesses one address
6 / 44
What to fetch
• Bring in page that caused page fault
• Pre-fetch surrounding pages?
I Reading two disk blocks approximately as fast as reading one
I As long as no track/head switch, seek time dominates
I If application exhibits spacial locality, then big win to store and read multiple
contiguous pages
• Also pre-zero unused pages in idle loop
I Need 0-filled pages for stack, heap, anonymously mmapped memory
I Zeroing them only on demand is slower
I Hence, many OSes zero freed pages while CPU is idle
7 / 44
Selecting physical pages
• May need to eject some pages
I More on eviction policy in two slides
• May also have a choice of physical pages
• Direct-mapped physical caches
I Virtual → Physical mapping can affect performance
I In old days: Physical address A conflicts with kC + A
(where k is any integer, C is cache size)
I Applications can conflict with each other or themselves
I Scientific applications benefit if consecutive virtual pages do not conflict in the
cache
I Many other applications do better with random mapping
I These days: CPUs more sophisticated than kC + A
8 / 44
Superpages
• How should OS make use of “large” mappings
I x86 has 2/4MB pages that might be useful
I Alpha has even more choices: 8KB, 64KB, 512KB, 4MB
• Sometimes more pages in L2 cache than TLB entries
I Don’t want costly TLB misses going to main memory
• Or have two-level TLBs
I Want to maximize hit rate in faster L1 TLB
• OS can transparently support superpages [Navarro]
I “Reserve” appropriate physical pages if possible
I Promote contiguous pages to superpages
I Does complicate evicting (esp. dirty pages) – demote
9 / 44
Outline
1 Paging
2 Eviction policies
3 Thrashing
4 User-level API
5 Case study: 4.4 BSD
10 / 44
Straw man: FIFO eviction
• Evict oldest fetched page in system
• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• 3 physical pages: 9 page faults
11 / 44
Straw man: FIFO eviction
• Evict oldest fetched page in system
• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• 3 physical pages: 9 page faults
• 4 physical pages: 10 page faults
11 / 44
Belady’s Anomaly
• More physical memory doesn’t always mean fewer faults
12 / 44
Optimal page replacement
• What is optimal (if you knew the future)?
13 / 44
Optimal page replacement
• What is optimal (if you knew the future)?
I Replace page that will not be used for longest period of time
• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• With 4 physical pages:
13 / 44
LRU page replacement
• Approximate optimal with least recently used
I Because past often predicts the future
• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• With 4 physical pages: 8 page faults
• Problem 1: Can be pessimal – example?
• Problem 2: How to implement? 14 / 44
LRU page replacement
• Approximate optimal with least recently used
I Because past often predicts the future
• Example—reference string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• With 4 physical pages: 8 page faults
• Problem 1: Can be pessimal – example?
I Looping over memory (then want MRU eviction)
• Problem 2: How to implement? 14 / 44
Straw man LRU implementations
• Stamp PTEs with timer value
I E.g., CPU has cycle counter
I Automatically writes value to PTE on each page access
I Scan page table to find oldest counter value = LRU page
I Problem: Would double memory traffic!
• Keep doubly-linked list of pages
I On access remove page, place at tail of list
I Problem: again, very expensive
• What to do?
I Just approximate LRU, don’t try to do it exactly
15 / 44
Clock algorithm
• Use accessed bit supported by most hardware
I E.g., Pentium will write 1 to A bit in PTE on first access
I Software managed TLBs like MIPS can do the same
• Do FIFO but skip accessed pages
A=0
• Keep pages in circular FIFO list A=1 A=1
• Scan: A=0 A=0
I page’s A bit = 1, set to 0 & skip
A=0 A=0
I else if A = 0, evict
• A.k.a. second-chance replacement A=1 A=1
A=1 A=0
A=0
16 / 44
Clock algorithm
• Use accessed bit supported by most hardware
I E.g., Pentium will write 1 to A bit in PTE on first access
I Software managed TLBs like MIPS can do the same
• Do FIFO but skip accessed pages
A=0
• Keep pages in circular FIFO list A=0 A=1
• Scan: A=0 A=0
I page’s A bit = 1, set to 0 & skip
A=0 A=0
I else if A = 0, evict
• A.k.a. second-chance replacement A=1 A=1
A=1 A=0
A=0
16 / 44
Clock algorithm
• Use accessed bit supported by most hardware
I E.g., Pentium will write 1 to A bit in PTE on first access
I Software managed TLBs like MIPS can do the same
• Do FIFO but skip accessed pages
A=0
• Keep pages in circular FIFO list A=0 A=1
• Scan: A=0 A=0
I page’s A bit = 1, set to 0 & skip
A=0 A=0
I else if A = 0, evict
• A.k.a. second-chance replacement A=1 A=1
A=1 A=0
A=0
16 / 44
Clock algorithm (continued)
A=0
• Large memory may be a problem A=1 A=0
I Most pages referenced in long A=1 A=1
interval
• Add a second clock hand A=0 A=0
I Two hands move in lockstep
A=1 A=1
I Leading hand clears A bits
I Trailing hand evicts pages with A=0 A=1 A=0
A=0
• Can also take advantage of hardware Dirty bit
I Each page can be (Unaccessed, Clean), (Unaccessed, Dirty), (Accessed, Clean), or
(Accessed, Dirty)
I Consider clean pages for eviction before dirty
• Or use n-bit accessed count instead just A bit
I On sweep: count = (A << (n − 1)) | (count >> 1)ft
I Evict page with lowest count 17 / 44
Clock algorithm (continued)
A=0
• Large memory may be a problem A=1 A=0
I Most pages referenced in long A=1 A=0
interval
• Add a second clock hand A=0 A=0
I Two hands move in lockstep
A=1 A=1
I Leading hand clears A bits
I Trailing hand evicts pages with A=0 A=1 A=0
A=0
• Can also take advantage of hardware Dirty bit
I Each page can be (Unaccessed, Clean), (Unaccessed, Dirty), (Accessed, Clean), or
(Accessed, Dirty)
I Consider clean pages for eviction before dirty
• Or use n-bit accessed count instead just A bit
I On sweep: count = (A << (n − 1)) | (count >> 1)ft
I Evict page with lowest count 17 / 44
Clock algorithm (continued)
A=0
• Large memory may be a problem A=1 A=0
I Most pages referenced in long A=1 A=0
interval
• Add a second clock hand A=0 A=0
I Two hands move in lockstep
A=1 A=1
I Leading hand clears A bits
I Trailing hand evicts pages with A=0 A=1 A=0
A=0
• Can also take advantage of hardware Dirty bit
I Each page can be (Unaccessed, Clean), (Unaccessed, Dirty), (Accessed, Clean), or
(Accessed, Dirty)
I Consider clean pages for eviction before dirty
• Or use n-bit accessed count instead just A bit
I On sweep: count = (A << (n − 1)) | (count >> 1)ft
I Evict page with lowest count 17 / 44
Other replacement algorithms
• Random eviction
I Simple to implement
I Not overly horrible (avoids Belady & pathological cases)
I Used in hypervisors to avoid double swap [Waldspurger]
• LFU (least frequently used) eviction
• MFU (most frequently used) algorithm
• Neither LFU nor MFU used very commonly
• Workload specific policies: Databases
18 / 44
Naïve paging
• Naïve page replacement: 2 disk I/Os per page fault
19 / 44
Page buffering
• Idea: reduce # of I/Os on the critical path
• Keep pool of free page frames
I On fault, still select victim page to evict
I But read fetched page into already free page
I Can resume execution while writing out victim page
I Then add victim page to free pool
• Can also yank pages back from free pool
I Contains only clean pages, but may still have data
I If page fault on page still in free pool, recycle
20 / 44
Page allocation
• Allocation can be global or local
• Global allocation doesn’t consider page ownership
I E.g., with LRU, evict least recently used page of any proc
I Works well if P1 needs 20% of memory and P2 needs 70%:
P1 P2
I Doesn’t protect you from memory pigs
(imagine P2 keeps looping through array that is size of mem)
• Local allocation isolates processes (or users)
I Separately determine how much memory each process should have
I Then use LRU/clock/etc. to determine which pages to evict within each process
21 / 44
Outline
1 Paging
2 Eviction policies
3 Thrashing
4 User-level API
5 Case study: 4.4 BSD
22 / 44
Thrashing
Thrashing is when an application is in a constantly swapping pages in and out
preventing the application from making forward progress at any reasonable rate.
• Processes require more memory than system has
I Each time one page is brought in, another page, whose contents will soon be
referenced, is thrown out
I Processes will spend all of their time blocked, waiting for pages to be fetched from
disk
I I/O devs at 100% utilization but system not getting much useful work done
• What we wanted: virtual memory the size of disk with access time the speed
of physical memory
• What we got: memory with access time of disk
23 / 44
Reasons for thrashing
• Access pattern has no temporal locality (past 6= future)
(80/20 rule has broken down)
• Hot memory does not fit in physical memory
P1
memory
• Each process fits individually, but too many for system
P1 P3 P5 P7 P9 P P P
P2 P4 P6 P8 P10 11P12 13P14 15P16
memory
I At least this case is possible to address
24 / 44
Dealing with thrashing
• Approach 1: working set
I Thrashing viewed from a caching perspective: given locality of reference, how big
a cache does the process need?
I Or: how much memory does the process need in order to make reasonable
progress (its working set)?
I Only run processes whose memory requirements can be satisfied
• Approach 2: page fault frequency
I Thrashing viewed as poor ratio of fetch to work
I PFF = page faults / instructions executed
I If PFF rises above threshold, process needs more memory.
Not enough memory on the system? Swap out.
I If PFF sinks below threshold, memory can be taken away
25 / 44
Working sets
Transitions
working set size
time
• Working set changes across phases
I Baloons during phase transitions
26 / 44
Outline
1 Paging
2 Eviction policies
3 Thrashing
4 User-level API
5 Case study: 4.4 BSD
27 / 44
Recall typical virtual address space
kernel
stack
breakpoint
heap
uninitialized data (bss)
initialized data
read-only data
code (text)
• Dynamically allocated memory goes in heap
• Top of heap called breakpoint
I Addresses between breakpoint and stack all invalid 28 / 44
Early VM system calls
• OS keeps “Breakpoint” – top of heap
I Memory regions between breakpoint & stack fault on access
• char *brk (const char *addr);
I Set and return new value of breakpoint
• char *sbrk (int incr);
I Increment value of the breakpoint & return old value
• Can implement malloc in terms of sbrk
I But hard to “give back” physical memory to system
29 / 44
Memory mapped files
kernel
stack
mmapped
regions
heap
uninitialized data (bss)
initialized data
read-only data
code (text)
• Other memory objects between heap and stack
30 / 44
mmap system call
• void *mmap (void *addr, size_t len, int prot,
int flags, int fd, off_t offset)
I Map file specified by fd at virtual address addr
I If addr is null, let kernel choose the address
• prot – protection of region
I OR of prot_exec, prot_read, prot_write, prot_none
• flags
I map_anon – anonymous memory (fd should be -1)
I map_private – modifications are private
I map_shared – modifications seen by everyone
31 / 44
More VM system calls
• int munmap(void *addr, size_t len)
I Removes memory-mapped object
• int mprotect(void *addr, size_t len, int prot)
I Changes protection on pages to or of PROT_…
• int msync(void *addr, size_t len, int flags);
I Flush changes of mmapped file to backing store
• int mincore(void *addr, size_t len, char *vec)
I Returns in vec which pages present
• int madvise(void *addr, size_t len, int behav)
I Advise the OS on memory use
32 / 44
Exposing page faults
struct sigaction {
union { /* signal handler */
void (*sa_handler)(int);
void (*sa_sigaction)(int, siginfo_t *, void *);
};
sigset_t sa_mask; /* signal mask to apply */
int sa_flags;
};
int sigaction (int sig, const struct sigaction *act,
struct sigaction *oact)
• Can specify function to run on SIGSEGV
(Unix signal raised on invalid memory access)
33 / 44
Example: OpenBSD/i386 siginfo
struct sigcontext {
int sc_gs; int sc_fs; int sc_es; int sc_ds;
int sc_edi; int sc_esi; int sc_ebp; int sc_ebx;
int sc_edx; int sc_ecx; int sc_eax;
int sc_eip; int sc_cs; /* instruction pointer */
int sc_eflags; /* condition codes, etc. */
int sc_esp; int sc_ss; /* stack pointer */
int sc_onstack; /* sigstack state to restore */
int sc_mask; /* signal mask to restore */
int sc_trapno;
int sc_err;
};
• Linux uses ucontext_t – same idea, just uses nested structures that won’t all
fit on one slide
34 / 44
VM tricks at user level
• Combination of mprotect/sigaction very powerful
I Can use OS VM tricks in user-level programs [Appel]
I E.g., fault, unprotect page, return from signal handler
• Technique used in object-oriented databases
I Bring in objects on demand
I Keep track of which objects may be dirty
I Manage memory as a cache for much larger object DB
• Other interesting applications
I Useful for some garbage collection algorithms
I Snapshot processes (copy on write)
35 / 44
Outline
1 Paging
2 Eviction policies
3 Thrashing
4 User-level API
5 Case study: 4.4 BSD
36 / 44
Overview
• Windows and most UNIX systems seperate the VM system into two parts
I VM PMap: Manages the hardware interface (e.g. TLB in MIPS)
I VM Map: Machine independent representation of memory
• 4.4 BSD VM is based on [Mach VM]
• VM Map consists of one or more objects (or segments)
• Each object consists of a contiguous mmap()
• Objects can be backed by files and/or shared between processes
• VM PMap manages the hardware (often caches mappings)
37 / 44
Operation
• Calls into mmap(), munmap(), mprotect()
I Update VM Map
I VM Map routines call into the VM PMap to invalidate and update the TLB
• Page faults
I Exception handler calls into the VM PMap to load the TLB
I If the page isn’t in the PMap we call VM Map code
• Low memory options
I PMap is a cache and can be discarded during a low memory condition
38 / 44
4.4 BSD VM system [McKusick]
• Each process has a vmspace structure containing
I vm_map – machine-independent virtual address space
I vm_pmap – machine-dependent data structures
I statistics – e.g. for syscalls like getrusage ()
• vm_map is a linked list of vm_map_entry structs
I vm_map_entry covers contiguous virtual memory
I points to vm_object struct
• vm_object is source of data
I e.g. vnode object for memory mapped file
I points to list of vm_page structs (one per mapped page)
I shadow objects point to other objects for copy on write
39 / 44
4.4 BSD VM data structures
shadow vm_page
vm_map_entry object
vm_map vnode/
object vm_page
vm_pmap vm_page
vm_map_entry
stats
vm_page
shadow vnode/
vmspace vm_map_entry object object
vm_page vm_page
vm_map_entry vnode/
object
vm_page
40 / 44
Pmap (machine-dependent) layer
• Pmap layer holds architecture-specific VM code
• VM layer invokes pmap layer
I On page faults to install mappings
I To protect or unmap pages
I To ask for dirty/accessed bits
• Pmap layer is lazy and can discard mappings
I No need to notify VM layer
I Process will fault and VM layer must reinstall mapping
• Pmap handles restrictions imposed by cache
41 / 44
Example uses
• vm_map_entry structs for a process
I r/o text segment → file object
I r/w data segment → shadow object → file object
I r/w stack → anonymous object
• New vm_map_entry objects after a fork:
I Share text segment directly (read-only)
I Share data through two new shadow objects
(must share pre-fork but not post-fork changes)
I Share stack through two new shadow objects
• Must discard/collapse superfluous shadows
I E.g., when child process exits
42 / 44
What happens on a fault?
• Traverse vm_map_entry list to get appropriate entry
I No entry? Protection violation? Send process a SIGSEGV
• Traverse list of [shadow] objects
• For each object, traverse vm_page structs
• Found a vm_page for this object?
I If first vm_object in chain, map page
I If read fault, install page read only
I Else if write fault, install copy of page
• Else get page from object
I Page in from file, zero-fill new page, etc.
43 / 44
Paging in day-to-day use
• Demand paging
I Read pages from vm_object of executable file
• Copy-on-write (fork, mmap, etc.)
I Use shadow objects
• Growing the stack, BSS page allocation
I A bit like copy-on-write for /dev/zero
I Can have a single read-only zero page for reading
I Special-case write handling with pre-zeroed pages
• Shared text, shared libraries
I Share vm_object (shadow will be empty where read-only)
• Shared memory
I Two processes mmap same file, have same vm_object (no shadow)
44 / 44