Content
- Overview of File Systems
- File System Design
- Sharing Files
- Unix File System
- Consistency & Crash Recovery
- Journaling File Systems
- Log-Structured File Systems (LFS)
Overview of File Systems
- Abstraction for storing, organizing, & accessing persistent data
- Data organized as objects called files
- Files accessed via system calls
Basic File-Related Calls
- Open
- Program can keep the file description returned to imporve file access efficiency
- Read/Write: r/w bytes from/to current position; update position
- Seek: move to new position
- Close
- Create, rename, delete, get/set attributes
Basic Directory-Related Calls
- Open
- Readdir: read entries in a directory
- No writedir to avoid dir metadata corruption
- Seekdir
- Close
- Create, rename, delete, get/set attributes
- Link/Unlink: add/remove a name for an existing file
- Hard link: different directory entries have same inode number
- Inodes maintains reference count
- Points to files not directories to avoid cycles (ensure directory deletion ok)
- Becomes DAG
- Symbolic link (short cut): a file containing data of another file's name (redirect)
- File type: shortcut (l)
- Hard link: different directory entries have same inode number
File System Design
- OS needs to:
- maintain information about files & directories
- store files durably
- handle machine crashes e.g. recover data
Disk Blocks
- Disks are accessed at the granularity of sectors; a file system allocates data in blocks
- Reduces overhead managing individual blocks
- Increase internal fragmentation
File System Tasks
- Free block management
- Bitmap: separate area on disk
- Allows allocating contiguous blocks to file easily
- Only need 1 bitmap block in memory at a time
- Extra space for bitmap
- Bitmap: separate area on disk
Block allocation & replacement: maps (potentially non-contiguous) blocks to file
- Contiguous allocation
- Good performance for sequential reading
- File growth requires copying
- Fragmentation; need periodic compaction
- Good for CD-ROMS: file sizes known, files never deleted
- Linked list allocation: first word in block points to next block
- Random accesses slow
- File allocation table (FAT): keep linked list information in memory
- Random accesses faster
- Entire table in memory, poor scalibility of file system
- More efficient than i-node on small files
I-node based allocation: tree to store index information
- Allows growth of index information without spreading this information too much
- Optimized for small files
Root of tree: inode
- 12 direct block pointers
- 1 indirect pointer
- 1 double indirect pointer
1 triple indirect pointer
disk partitions |MBR|||| | | | | / \ |Super|Inode |Block |Inode |File & dir| |block|bitmap|bitmap|blocks|blocks |
Block placement
- Problems
- Data blocks scatter across disk in aging file systems -> long seeks
- Inodes at beginning of disk, far from data -> going back and forth -> long seeks
- Problems
- BSD Fast File System
- Disk partitioned into groups of cylinders
- Superblocks placed in every group & in offset manner to recover from damage
- Place these in same cylinder group:
- Inode, data blocks in a file
- Files in a directory
- Place in nearby group if group full
- Contiguous allocation
Directory management: map file names to location of starting block of file
|file name|file attributes|starting block #| ... # Unix |file name|i-node #| # i-node contains file attributes ...
File names
- Short, fixed length names
Variable length names
- Size of directory entry variable
Options
Entries allocated contiguously:
|len|name|inode|len|name|inode|...
-> Inefficient for search; fragmentation
Allocate pointers to file names in the beginning of directory; heap at end to store names
|p1|hash of name1| # fixed size |p2|hash of name2| ... p1-> |name1|inode| # variable size p2-> |name2|inode|name3|inode|
- File deletion
- Directory entry removed from directory
- All blocks in file returned to free list
- Path lookup: e.g.
/D1/D2/F
- Super block (locatin of inode blocks area; typically already cached in memory)
- Inode of
/
-> data blocks of/
- Inode of
D1
-> data blocks ofD1
- Inode of
D2
-> data blocks ofD2
- Inode of
F
(returned inopen
) -> data blocks ofF
Buffer cache management: cache disk block in memory
Operations
Block lookup
# hash table to lookup blocks in memory key |device|block #| -> |disk block in memory| -> ... ...
Block miss
- Block flush
- Issues
- Limited size: need replacement algorithms
- Competes with VM system about frames
- Separate cache: inefficient use of cache if not doing file I/O
- Unified cache: large file I/O affects performance of other programs
- Read ahead: prefetch next block from disk
Sharing Files
- Issues
- Concurrent access
- Sequential consistency
- A
read
call sees data from most recently finishedwrite
call - All processors see same order of writes
- A
- Sequential consistency
- Protection
- Subject: who can access a file
- Object: the file
- Action: how can they access a file
read
,write
,execute
,append
,change protection
,delete
, etc.
- Mechanisms
- Access control lists (ACL): object -> subjects & actions
- Easy to grant, revoke
- Becomes large when heavily shared -> use groups
- Capabilities: subject -> objects & actions
- Easy to transfer
- Hard to revoke
- Access control lists (ACL): object -> subjects & actions
- Concurrent access
Unix File System
# In memory
Parent's fd table Open file table
----------- |File position|
----------- |R/W |
----------- ------> |Ptr to i-node|
... ---------------
| |
Child's fd table / | |
----------- / ...
----------- -----
-----------
...
# Child can copy parent's fd table when forked
File-Related System Calls
fd = open(name,mode)
- Path lookup: find inode of file
- Cache inode in buffer cache
- Check permissions
- Set up entry in open file table
- Set up entry in fd table
- Return fd
byte_count = read(fd, buffer, buffer_size)
- Figure out data/indirect blocks to read
- Read from disk into buffer cache
- Copy data to user buffer
- Update file position
- Return # of bytes read
byte_count = write(fd, buffer, num_bytes)
- Figure out data/indirect blocks to write
- Read from disk into buffer cache
- Copy data from user buffer to buffer cache
- Update i-node
- Mark modified buffers dirty (inode, free maps, indirect, data blocks, etc.)
- Schedule writing dirty buffers to disk
- Update file position
- Return # of bytes written
close(fd)
Mounting File Systems
- File system namespace: set of names for all files in a file system
- File system mounting glues a file system namespace into the namespace of another file system
Consistency & Crash Recovery
Deleting a Unix File
- Remove file's directory entry in directory data block
- Mark inode of file as free in inode bitmap; mark file blocks as free in block bitmap; update metadata in inode of directory
Crash in between leads to storage leak.
Switch steps. Crash in between leads to dangling pointer.
Reducing File System Inconsistency
- Write-through metadata blocks
- Avoid dangling pointers
- Still inconsistent for the last file system operation before a crash
- Write-back data blocks
- Most blocks are data blocks, improves file system performance
- Data blocks can be lost without affecting file system consistency
- Crash recovery: restore file system consistency
- Full scan of file system to recover inconsistent states
- Long time since disk capacities increase faster than disk throughput
- Avoid crash recovery:
- Use battery-backed RAM
- Ensure enough power to write all dirty blocks to disk
- Failure atomicity: a file system operation either doesn't happen at all or happens completely
- Undo recovery
- Copy old block on disk to spare block on disk (mark done)
- Copy new block in memory to block on disk
- Remove spare block
- Redo recovery
- Copy new block in memory to spare block on disk (mark done)
- Copy new block in memory to old block on disk
- Remove spare block
- Undo recovery
- Use battery-backed RAM
Journaling File Systems
- Write-ahead logging
- Write new versions of blocks in a journal (circular log)
- Commit when done
- File system then updates asynchronously
- Copy journal blocks to file system on crash where commit is present
Steps
- Write blocks (e.g. B,I,D) to journal
Write commit block to journal
|...|TransactionHeaderBlock|B'|I'|D'|Commit| # TransactionHeaderBlock contains block # of the following blocks
- Install: copy updated B,I,D to file system
- Free transaction in journal
- Observation based on FFS
- As memory gets larger, need to optimize for writes
- Reads: read from buffer cache, less of performance problem
- Writes: becomes heavy for synchronous operations & data integrity
- Also, writes are not well clustered i.e. directory, inodes, data are scattered
Log-Structured File Systems (LFS)
Write all file system data & metadata in a contiguous log.
- LFS reads
- When inodes updated, written in log (scattered in log)
- Inode-map: an array in memory to locate inodes
- Inode # -> inode location in log
LFS writes
- When inodes updated, inode-map has to be updated & stored on disk (inode-map itself stored in the log)
Uses checkpoint region in a fixed area on disk
- Locates inode-map blocks in log
- Serves as a commit point
Region updated when inode-map blocks written
|superblock|checkpoint region| segment |segment|...| / \ |segment|inode|inode|...| |header | | map |...| # Checkpoint region -> inode map -> inode -> data
Keep 2 checkpoint regions to ensure failure atomicity on checkpoint region itself
- Log space reclamation
- Reclaim segments when:
- blocks overwritten
- blocks deleted
- live blocks have to be copied out of segments
- Reclaim segments when:
- Wear leveling: writing to SSD in log fashion
- Since each block in SSD has limited endurance, log fashion ensures writes to be spread evenly on disk
Comparison on Journaling File Systems & Log-Structured File Systems
Journaling
- Motivation: speed up crash recovery
- Benefit: speeds up crash recovery
- Drawback: every block write becomes two block writes
Log-Structured
- Motivation: reads will be absorbed by buffer cache, optimize for writes by issuing all writes sequentially
- Drawback:
- Reads not sequential if file created sequentially
- Need garbage collection -> affect normal I/O performance
- Writes: 1 or more; 1 for writing to log, more for cleaning process