Content

Paging Overview
1. Introduction to Virtual Memory
2. Paging MMU
3. Page Tables
Translation Lookaside Buffer (TLB)
1. TLB Basics
2. Memory Protection
Demand Paging
1. Demand Paging Overview
2. Virtual Memory Hierarchy
3. Managing Virtual Address Space
4. Managing Physical Memory
5. Managing Swap Area
Processes & Virtual Memory
1. Interaction of Processes with VM System
2. Page Sharing & Memory-Mapped Files
3. Kernal Address Space
Page Replacement Algorithms
1. Page Replacement Algorithms Overview
2. Paging Issues

Paging Overview

Introduction to Virtual Memory

Problems with Contiguous Memory

Growing program requires copying entire program
Wasted memory due to internal & external fragmentation
Running program requires loading entire program
Max program size limited by memory size

Paging MMU

Pages & Frames

Page: contiguous, fixed size chunks in virtual address space
Frame: physical memory mapped from a page

Virtual & Physical Addresses

# 32-bit machine, virtual address space 2^32 B
# page size 2^12, # of pages 2^20
|...20 bits...|.12 bits.|
    page num     offset

# e.g. physical address space 2^30 B
# frame size 2^12, # of frames 2^18
|...18 bits...|.12 bits.|
    page num     offset

               VA                PA
          ----------->       ----------->
Processor                MMU                 Bus
          <----------------------------
                            data

Benefits of Paging

Growing program requires allocating a page at a time
No external fragmentation (page granularity); internal fragmentation 1/2 page per region
Pages can be loaded in memory as program runs
Max program size limited by disk size

Page Tables

Mapping information maintained by MMU.

Page Table Entries (PTE)

|...18 bits...|.7 bits.|C|D|R|W|V|
   frame num    unused

Storing Page Table

Stored in memory
- Each memory access requires 2 memory accesses: PTE & PA
- Solution: Cache PTE in TLB
Page table register (PTR)
- For MMU to store the location of start of page table
- OS associates a seperate page table for each process
- Address space switch switches PTR

Page Table Size

#pages * PTE size (typically word size)
= (va_size / page_size) * PTE size

Smaller page size: lower internal fragmentation
Larger page size: fewer PTE & memory overhead

Multi-Level Page Table

|.10 bits.|.10 bits.|.12 bits.|
    PT1       PT2      offset

Slower than single-level page table because more memory accesses
Saves space

Inverted Page Table

Maps frame num -> (thread id, page num)

|....|...52 bits...|.12 bits.|
 TID    page num     offset
  |            |
  -----------
         |
   hash func ----> index into page table

Exhaustive search is slow
- Use hash table indexed by page number
Hashing function needs to be good
Poor cache locality
- Adjacent pages hashed to scattered locations
Sharing memory is complicated

Translation Lookaside Buffer (TLB)

TLB Basics

H/W cache of PTE
- Small # of entries
- Exploits locality (program uses small # of pages at a time)

|virtual page number (VPN)|...18 bits...|.7 bits.|C|D|R|W|V|
                            frame num     unused
|    key: page number     |            data: PTE           |

V: valid, W: writable, R: referenced (H/W), D: dirty (H/W), C: cacheable

V: is the entry valid or not?
W: is the page writable?
C: is the page cacheable? when not cacheable, processor should bypass the cache when accessing the page, e.g., to access memory-mapped device registers
D: has a writable page been written to (is it dirty with respect to the data on disk)?
R: has a page been referenced?
unused: unused by hardware, can be used by OS

H/W may or may not need execution bit, but instead use read as execute

TLB Operations

TLB Lookup

Fully associative TLB example

TLB cache miss handling

TLB lookup fails -> page table lookup -> TLB cache replaced (TLB replacement policy)
Handled by H/W or OS
- H/W managed TLB
  - H/W defines page table format & replacement policy
  - PTR to locate page table in physical memory
  - Fast miss handling
- S/W managed TLB
  - H/W generates trap called TLB miss fault
    - Read fault
    - Write fault
  - OS handles TLB miss similar to exception handling
  - OS figures out correct PTE, add in TLB (CPU has instructions to modify TLB)
  - Page tables become entirely a OS data structure (no PTR in H/W, H/W doesn't know about page table)
  - TLB replacement policy managed in S/W
  - Slower by more flexible miss handling

TLB cache invalidate

Adds cost to context switching
- Changing PTR + invalidating TLB + TLB misses afterwards
Invalidate options
- Clear TLB: clearing valid bit of all entries
- Tagged TLB: H/W maintains id tag on each entry
  - Compare current thread id (stored in register) to the tag
  - No invalidation; enables space multiplexing of entries

Memory Protection

Generate protection fault if memory access inconsistent with protection bits
- Read-only fault
- No-execute fault

Demand Paging

Demand Paging Overview

Motivation: program size/# of running programs limited by physical memory
- Make memory a cache for data on disk
Basic Mechanism
- PTE unset for invalid page
- Cache miss when program accesses invalid page
- Generates page fault
- OS page fault handler performs miss handling: allocate frame, update PTE, set valid bit, restart instruction
Benefits
- Allows programs run to exceed available memory (limited by VA space (processor architecture) or disk)
- Faster startup programs (no need to load entire program)

Virtual Memory Hierarchy

Swap Handler

Chooses a page to evict using page replacement algorithm
If page modified, write to a free location on swap
Find pages that map to the evicted frame
- Need frame-to-page mapping (coremap)
- Change all PTEs to invalid
- Keep track of the evicted frame location in swap
Return newly freed frame to page fault handler

Other Notes

If TLB miss handling is performed by H/W, no TLB miss fault generated; if by S/W, H/W doesn't access PT
Page fault handler:
- checks segmentation fault, protection fault
- allocates frame, PTE
- loads data from disk into frame
- maps page to frame by updating PTE
Page contents may:
- be all 0
- in swap
- in executable
OS accesses PT when:
- on TLB-miss fault (for h/w managed TLB), it reads the page table
- on page fault, it allocates frame and updates page table
- whenever address space of current process is modified, (frame is allocated/deallocated, context switch), it updates page table

Managing Virtual Address Space

Address Space

Broken into regions or segments
OS must track the regions within an address space & PT holding translations
API
- id = as_create() create empty address space, allocate new PT with no pages mapped to frames
- as_destroy(id) destroy as, free PT
- as_define_region(id,...) add new region
- as_modify_region(id,...) change size (e.g. heap, stack)
- as_find_region(id,va) find valid region or give seg fault
- as_load_page(id,...) load page from file into memory
- new_id = as_copy(id) create copy of as & PT (actually copy data to other frames)
- as_switch(id) mappings in TLB must be removed

Managing Physical Memory

Coremap

Array of structures, one per frame
Tracks:
- whether frame allocated
- address spaces & pages map to this frame (list of (pid, page_id))
- kernel or user
API
- frame = allocate_frame(n) alloc n free contiguous frames (user needs 1 frame; kernel needs more)
- free_frame(frame)
- map(id,page_nr,frame)
- unamp(id,page_nr)
- evict(frame) for swap handler to unmap pages associated with frame

Managing Swap Area

Dirty pages need to be placed in swap
Bitmap or linked list of swap frames on disk; keep location in invalid PTE

If loading shared frame from swap and needs to write to it:

allocate a frame and keep the swap frame

next time allocate another frame and free the swap frame

# PTE
|frame number|unused|R|D|C|W|1| virtual page in memory
|000.......................0|0| invalid PTE
|index in swap area.........|0| virtual page in swap

Processes & Virtual Memory

Interaction of Processes with VM System

OS implements VM system using 3 abstractions: address space, physical memory, & swap management
Rest of OS invokes the VM systems when address space of current process changes
- Process creation fork
  - Create new PT & copy content from parent's regions
- Process execution execv
  - Destroy old address space structure & create new one
- Process termination exit
- Context switch
  - H/W managed TLB: change PTR, flush TLB
  - S/W managed TLB: flush TLB
- Memory allocation/deallocation sbrk, stack
  - Heap: system call (sbrk) to grow; allows OS to detect errors
  - Stack: when faulting address close to stack, extend stack region running standard page fault handler

Benefits
- Fast communication between processes
- Good isolation (only share when needed)
Applications
- Sharing text regions
  - e.g. dynamically loaded libraries
  - Must update all pages when frame evicted
- Copy-on-write pages
  - Child shares pages with parent until pages modified
  - PTE marks COW & read-only
  - On protection fault: allocate new frame, copy, remap, make writable, update TLB, resume
  - Must update all pages when frame evicted
- Memory-mapped files
  - Threads r/w files using load/store instead of system calls
  - mmap(addr,length,prot_flags,fd,offset) maps a file at given offset contiguously within address space
    - Storing offset to provide flexibility since different processes may have different addresses of shared memory
  - File data loaded on page fault
  - Instead of swap area, use file itself as backing store
  - More efficient than r/w system calls because system calls need buffer in kernel to transfer data
Shared memory
- Threads r/w to mapped memory region
- PTE can have different protections for different threads
- Memory can be mapped at same or different VA in each process

Kernal Address Space

OS typically lies in low physical memory
Options
- Paging turned off in kernel mode, OS access physical memory directly
  - OS needs to simulate paging, i.e. do address translation in software, to copy in or out user parameters on a system call, deal with dirty bit, etc.
- OS uses separate address space
  - Pros
    - Clean isolation
  - Cons
    - System call requires switching MMU -> TLB flush
    - Copy in/out of system call parameters requires traversing PT in S/W
- OS maps to address space of each thread
  - Typically mapped to high addresses
  - Pros
    - No system call
    - Copy in/out of system call parameters can reuse paging H/W
  - Cons
    - Address space of processes reduces
    - PT needs to be setup to access OS code
      - H/W may allow OS to bypass PT

Page Replacement Algorithms

Page Replacement Algorithms Overview

When will paging work?
- If occurs rarely
- Spatial locality & temporal locality
Algorithms
- Optimal algorithm: page that will not be needed for longest time
  - Need to know the future
  - Can be used to compare performance
    - Generate a log of VM accesses (reference string); use log to simulate various replacement policies
- FIFO: oldest page
  - Oldest page may be needed soon
  - Faults may increase if given more memory (Belady's Anomaly)
- Clock: FIFO giving second chance to referenced pages
  - Referenced bit: set by processor when page read/written; cleared by OS/software
  - Dirty bit: set by processor when page written; cleared by OS/software
  - OS synchronizes R/D bits in TLB with PTE (H/W: write-through or write-back; S/W: needs to implement synchronization)
  - OS simulates the bits if H/W doesn't maintain
    - When TLB read fault:
      - Set referenced bit; make page read-only
    - When TLB read-only protection fault/write fault:
      - Set referenced & dirty bits; make page writeable
    - Algorithm
```
  Choose page starting from clock hand:
      if referenced bit set:
          unset, goto next
      else:
          if page dirty:
              schedule page write, goto next
          else:
              select for replacement
```
- LRU: page used least recently
  - Updating LRU on each memory access is expensive
  - If MMU maintains a counter incremented at each clock cycle:
    - When page accessed:
      - Writes counter value to PTE
    - On page fault:
      - S/W looks through PT, identifies entry with oldest timestamp
  - If MMU doesn't provide such counter, OS maintains it in S/W (LRU aprroximation):
    - Periodically (timer interrupt) increment counter; granularity depends on timer interrupt
      - If referenced bit set:
        
        Write counter value
        
        Clear referenced bit
    - On page fault:
      - S/W looks through PT & identify the oldest timestamp
- Working set clock: keep working set in memory
  - Working set: set of pages a program needs currently
    - Working set interval T: WS(T) = {pages accessed in interval (now, now-T)}
    - Max WS: how much memory a program needs
  - Algorithm
```
  Choose page starting from clock hand:
      if referenced bit set:
          update time-of-last-use to current virtual time
          unset, goto next
      else:
          if (current time - time-of-last-use) < T:
              continue
          else:
              if page dirty:
                  schedule page write, goto next
              else:
                  select for replacement
```
- Page fault frequency: estimate of working set needs of a program
  - Measurements
```
  For each thread:
      On fault:
          f = f + 1
      Every second, update faults/sec (fe) via aging:
          fe = (1-a) * fe + a * f, f = 0
          # 0 < a < 1, weighting factor; a -> 1 means history ignored
```
  - Allocate frames s.t. PFF is equal for programs
    - Global replacement: victim process has lowest PFF
    - Local replacement: use algorithms above (clock, LRU, etc.) to evict a page within the victim process
Comparison

|Algorithm|Comment| |:--------|:------| |Optimal|Not implementable; useful as benchmark| |FIFO|Might throw out important pages| |Clock|Realistic| |LRU|Excellent; difficult to implement efficiently| |WSC|Efficient working set algorithm| |PFF|Fairness in working set allocation|

Paging Issues

Paging & I/O Interaction

Problem: a frame waiting for I/O may be selected for eviction
Solution: each frame has do not evict flag called pinned page; un-pin after I/O completes

Paging Performance

Paging works best if many free frames
If pages full of dirty pages, 2 disk operations (swap in/out) needed on each page fault
Paging daemon: swap out in advance
- OS maintains a pool of free frames using paging thread/daemon
- Daemon runs replacement algorithm periodically or when pool reaches low watermark:
  - Writes out dirty pages
  - Frees enough pages until pool reaches high watermark
- Frames can be rescued if page referenced before reallocation, because previous content still holds
Prefetching: swap in in advance
- Predict future page usage at current faults
- Works well when pages read sequentially

Thrashing

Livelock: OS spends time paging data from disk, programs not making progress
Causes
1. Pages replacement algorithm not working
2. Not enough memory to hold working set of all running programs
Solutions
1. Swapping: suspend some programs for a while
2. Buy more memory