Content

  1. Overview of File Systems
  2. File System Design
  3. Sharing Files
  4. Unix File System
  5. Consistency & Crash Recovery
  6. Journaling File Systems
  7. Log-Structured File Systems (LFS)

Overview of File Systems

  • Abstraction for storing, organizing, & accessing persistent data
    • Data organized as objects called files
    • Files accessed via system calls
  • Open
    • Program can keep the file description returned to imporve file access efficiency
  • Read/Write: r/w bytes from/to current position; update position
  • Seek: move to new position
  • Close
  • Create, rename, delete, get/set attributes
  • Open
  • Readdir: read entries in a directory
    • No writedir to avoid dir metadata corruption
  • Seekdir
  • Close
  • Create, rename, delete, get/set attributes
  • Link/Unlink: add/remove a name for an existing file
    • Hard link: different directory entries have same inode number
      • Inodes maintains reference count
      • Points to files not directories to avoid cycles (ensure directory deletion ok)
      • Becomes DAG
    • Symbolic link (short cut): a file containing data of another file's name (redirect)
      • File type: shortcut (l)

File System Design

  • OS needs to:
    • maintain information about files & directories
    • store files durably
    • handle machine crashes e.g. recover data

Disk Blocks

  • Disks are accessed at the granularity of sectors; a file system allocates data in blocks
    • Reduces overhead managing individual blocks
    • Increase internal fragmentation

File System Tasks

  • Free block management
    • Bitmap: separate area on disk
      • Allows allocating contiguous blocks to file easily
      • Only need 1 bitmap block in memory at a time
      • Extra space for bitmap
  • Block allocation & replacement: maps (potentially non-contiguous) blocks to file

    • Contiguous allocation
      • Good performance for sequential reading
      • File growth requires copying
      • Fragmentation; need periodic compaction
      • Good for CD-ROMS: file sizes known, files never deleted
    • Linked list allocation: first word in block points to next block
      • Random accesses slow
    • File allocation table (FAT): keep linked list information in memory
      • Random accesses faster
      • Entire table in memory, poor scalibility of file system
      • More efficient than i-node on small files
    • I-node based allocation: tree to store index information

      • Allows growth of index information without spreading this information too much
      • Optimized for small files
      • Root of tree: inode

        • 12 direct block pointers
        • 1 indirect pointer
        • 1 double indirect pointer
        • 1 triple indirect pointer

                            disk partitions
          |MBR||||  |                                   |    |    |
                    /                                    \
                   |Super|Inode |Block |Inode |File & dir|
                   |block|bitmap|bitmap|blocks|blocks    |
          
      • Block placement

        • Problems
          • Data blocks scatter across disk in aging file systems -> long seeks
          • Inodes at beginning of disk, far from data -> going back and forth -> long seeks
      • BSD Fast File System
        • Disk partitioned into groups of cylinders
        • Superblocks placed in every group & in offset manner to recover from damage
        • Place these in same cylinder group:
          • Inode, data blocks in a file
          • Files in a directory
          • Place in nearby group if group full
  • Directory management: map file names to location of starting block of file

      |file name|file attributes|starting block #|
      ...
    
      # Unix
      |file name|i-node #| # i-node contains file attributes
      ...
    
    • File names

      • Short, fixed length names
      • Variable length names

        • Size of directory entry variable
        • Options

          • Entries allocated contiguously:

              |len|name|inode|len|name|inode|...
            

            -> Inefficient for search; fragmentation

          • Allocate pointers to file names in the beginning of directory; heap at end to store names

              |p1|hash of name1| # fixed size
              |p2|hash of name2|
              ...
              p1-> |name1|inode| # variable size
              p2-> |name2|inode|name3|inode|
            
    • File deletion
      • Directory entry removed from directory
      • All blocks in file returned to free list
    • Path lookup: e.g. /D1/D2/F
      • Super block (locatin of inode blocks area; typically already cached in memory)
      • Inode of / -> data blocks of /
      • Inode of D1 -> data blocks of D1
      • Inode of D2 -> data blocks of D2
      • Inode of F (returned in open) -> data blocks of F
  • Buffer cache management: cache disk block in memory

    • Operations

      • Block lookup

          # hash table to lookup blocks in memory
        
                key
          |device|block #| -> |disk block in memory| -> ...
          ...
        
      • Block miss

      • Block flush
    • Issues
      • Limited size: need replacement algorithms
      • Competes with VM system about frames
        • Separate cache: inefficient use of cache if not doing file I/O
        • Unified cache: large file I/O affects performance of other programs
    • Read ahead: prefetch next block from disk

Sharing Files

  • Issues
    • Concurrent access
      • Sequential consistency
        • A read call sees data from most recently finished write call
        • All processors see same order of writes
    • Protection
      • Subject: who can access a file
      • Object: the file
      • Action: how can they access a file
        • read, write, execute, append, change protection, delete, etc.
      • Mechanisms
        • Access control lists (ACL): object -> subjects & actions
          • Easy to grant, revoke
          • Becomes large when heavily shared -> use groups
        • Capabilities: subject -> objects & actions
          • Easy to transfer
          • Hard to revoke

Unix File System

# In memory

Parent's fd table   Open file table
-----------         |File position|
-----------         |R/W          |
----------- ------> |Ptr to i-node|
...                 ---------------
                    |             |
Child's fd table  / |             |
-----------      /        ...
----------- -----
-----------
...

# Child can copy parent's fd table when forked
  • fd = open(name,mode)
    • Path lookup: find inode of file
    • Cache inode in buffer cache
    • Check permissions
    • Set up entry in open file table
    • Set up entry in fd table
    • Return fd
  • byte_count = read(fd, buffer, buffer_size)
    • Figure out data/indirect blocks to read
    • Read from disk into buffer cache
    • Copy data to user buffer
    • Update file position
    • Return # of bytes read
  • byte_count = write(fd, buffer, num_bytes)
    • Figure out data/indirect blocks to write
    • Read from disk into buffer cache
    • Copy data from user buffer to buffer cache
    • Update i-node
    • Mark modified buffers dirty (inode, free maps, indirect, data blocks, etc.)
    • Schedule writing dirty buffers to disk
    • Update file position
    • Return # of bytes written
  • close(fd)

Mounting File Systems

  • File system namespace: set of names for all files in a file system
  • File system mounting glues a file system namespace into the namespace of another file system

Consistency & Crash Recovery

Deleting a Unix File

  1. Remove file's directory entry in directory data block
  2. Mark inode of file as free in inode bitmap; mark file blocks as free in block bitmap; update metadata in inode of directory

Crash in between leads to storage leak.

Switch steps. Crash in between leads to dangling pointer.

Reducing File System Inconsistency

  • Write-through metadata blocks
    • Avoid dangling pointers
    • Still inconsistent for the last file system operation before a crash
  • Write-back data blocks
    • Most blocks are data blocks, improves file system performance
    • Data blocks can be lost without affecting file system consistency
  • Crash recovery: restore file system consistency
    • Full scan of file system to recover inconsistent states
    • Long time since disk capacities increase faster than disk throughput
  • Avoid crash recovery:
    • Use battery-backed RAM
      • Ensure enough power to write all dirty blocks to disk
    • Failure atomicity: a file system operation either doesn't happen at all or happens completely
      • Undo recovery
        1. Copy old block on disk to spare block on disk (mark done)
        2. Copy new block in memory to block on disk
        3. Remove spare block
      • Redo recovery
        1. Copy new block in memory to spare block on disk (mark done)
        2. Copy new block in memory to old block on disk
        3. Remove spare block

Journaling File Systems

  • Write-ahead logging
    • Write new versions of blocks in a journal (circular log)
    • Commit when done
    • File system then updates asynchronously
    • Copy journal blocks to file system on crash where commit is present
  • Steps

    1. Write blocks (e.g. B,I,D) to journal
    2. Write commit block to journal

       |...|TransactionHeaderBlock|B'|I'|D'|Commit|
      
       # TransactionHeaderBlock contains block # of the following blocks
      
    3. Install: copy updated B,I,D to file system
    4. Free transaction in journal
  • Observation based on FFS
    • As memory gets larger, need to optimize for writes
    • Reads: read from buffer cache, less of performance problem
    • Writes: becomes heavy for synchronous operations & data integrity
      • Also, writes are not well clustered i.e. directory, inodes, data are scattered

Log-Structured File Systems (LFS)

Write all file system data & metadata in a contiguous log.

  • LFS reads
    • When inodes updated, written in log (scattered in log)
    • Inode-map: an array in memory to locate inodes
      • Inode # -> inode location in log
  • LFS writes

    • When inodes updated, inode-map has to be updated & stored on disk (inode-map itself stored in the log)
    • Uses checkpoint region in a fixed area on disk

      • Locates inode-map blocks in log
      • Serves as a commit point
      • Region updated when inode-map blocks written

        |superblock|checkpoint region|      segment      |segment|...|
                                    /                     \
                                   |segment|inode|inode|...|
                                   |header |     | map |...|
        
        # Checkpoint region -> inode map -> inode -> data
        
      • Keep 2 checkpoint regions to ensure failure atomicity on checkpoint region itself

  • Log space reclamation
    • Reclaim segments when:
      • blocks overwritten
      • blocks deleted
      • live blocks have to be copied out of segments
  • Wear leveling: writing to SSD in log fashion
    • Since each block in SSD has limited endurance, log fashion ensures writes to be spread evenly on disk

Comparison on Journaling File Systems & Log-Structured File Systems

Journaling

  • Motivation: speed up crash recovery
  • Benefit: speeds up crash recovery
  • Drawback: every block write becomes two block writes

Log-Structured

  • Motivation: reads will be absorbed by buffer cache, optimize for writes by issuing all writes sequentially
  • Drawback:
    1. Reads not sequential if file created sequentially
    2. Need garbage collection -> affect normal I/O performance
  • Writes: 1 or more; 1 for writing to log, more for cleaning process

Resources

results matching ""

    No results matching ""