Relational Database

  • Relation
    • $$A_1, A_2, ..., A_n$$: attributes
    • $$R = (A_1, A_2, ..., A_n)$$: relation schema
    • $$r(R)$$: relation on relation schema $$R$$
      • A set of n-tuples $$(a_1, a_2, ..., a_n)$$
      • Each $$a_i \in D_i$$
    • Relations are sets
      • Each entry unique (opposed to multisets)

Query Languages

Categories

  • Procedural
  • Non-procedural: specifies the task
    • e.g. SQL

Pure languages

Relational Algebra
  • Procedural
  • 6 basic operators
    • Select $$\sigma_p(r)$$
    • Project $$\pi_{A_1, A_2, ..., A_k}(r)$$
    • Union $$r \cup s$$ (same arity, compatible relations)
    • Intersection $$r \cap s$$ (same arity, compatible relations)
    • Difference $$r - s$$ (same arity, compatible relations)
    • Cartesian product $$r \times s$$
    • Rename
  • Additional operators
    • Set intersection
    • Natural join $$r \bowtie s$$
    • θ-join
    • Division
    • Generalized projection
    • Assignment
    • Aggregation $${G_1, ..., G_n} g{F_1(A_1), ..., F_n(A_n)} (E)$$
      • avg
      • min
      • max
      • sum
      • count
    • Outer join
Relational Calculus

SQL

  • Result is multiset (allows duplicates)
SELECT A1, A2, ..., An
FROM r1, r2, ..., rm
WHERE P

$$ \pi_{A_1, A_2, ..., A_n}(\sigma_P(r_1 \times r_2 \times ... \times r_m))

$$

Storage of Databases

Physical Storage Media

  • Cache
  • Main memory
  • Magnetic-disk
    • Platter
    • Circular track
      • Cylinder
    • Sector
      • Smallest unit to be read/written
      • Block: contiguous sequence of sectors

Storage Hierarchy

  • Primary storage
    • Volatile
    • Fast
  • Secondary storage / on-line storage
    • Non-volatile
    • Moderately fast access
  • Tertiary storage / off-line storage
    • Non-volatile
    • Slow access

Performance Measures of Disk

  • Access time
    • Read/write request issued to data transfer begins
    • Seek time + rotational latency
  • Data-transfer rate
Disk-Arm-Scheduling

Algorithm ordering pending accesses to tracks s.t. disk arm movement is minimized.

Storage Access

Storage transferred in blocks.

  • Buffer
    • Portion of main memory available for storing blocks
  • Buffer manager
    • Allocating buffer space
    • Goal: minimize block transfers
  • Buffer-replacement policy
    • Least recently used (LRU)

File Organization

  • Database
    • Files: used for one particular record type/relation (except for clustered file organization)
      • Records: record pointer/record id
        • Fields

Records of each relation may be stored in a separate file. In clustered file organization, records of different relations can be stored in the same file.

Records
  • Heap: anywhere in the file if space
  • Sequential: ordered by search key
  • Hashing

Indexing

  • Linear scan
    • $$O(B_I)$$
  • Binary search
    • $$O(log_2 B_I)$$
  • f-way search
    • $$O(log_f B_I)$$

Basics

  • Search key
    • Attributes used to look up records in a file
  • Index file
    • Much smaller than relation file
    • Can be ordered
    • Need to be updated when file modified
  • Index entries
    • Records in index files
    • search-key | pointer
  • Indices
    • Ordered indices: search keys sorted
    • Hash indices: search keys distributed across buckets using hash functions
Index Quality
  • Access types supported
    • Equality query
    • Range query
  • Access time
  • Insertion time
  • Deletion time
  • Space overhead
Classification of Indices
  • Primary/clustered index v.s. secondary/non-clustered index
    • Primary index
      • Search key order = sequential order of file
      • Search key usually primary key
    • Secondary index
      • Search key order != sequential order of file
  • Dense index v.s. sparse index
    • Dense index
      • All search-key values in file
      • Record locating efficiency
    • Sparse index
Multilevel index

If index not fit in memory, construct multilevel index files. Construct sequential first-level index file, and second-level sparse index on it.

  • Outer index
    • Sparse index on first-level index file
  • Inner index
    • First-level index file

B+-Tree Index Files

Dynamic, multi-level index.

  • Advantage
    • Automatically reorganized itself with small local changes
  • Disadvantage
    • Extra insertion & deletion overhead, space overhead
Basics
  • Disk-based tree structure
    • Node: block with block-id
  • Multi-way tree
    • Each node with multiple children (ceil(n/2) to n)
      • At least 50% of a node occupied
    • Order/degree of tree
      • ceil(n/2)
  • Balanced tree
    • All paths from root to leaf have the same length
  • Disjoint partition of attribute domain into ranges
    • Each sub-tree indexes a range in the attribute domain
    • Leaf node
      • Index entries, with pointers to relation file
Non-Leaf Nodes
  • Up to n-1 search keys
  • Up to n pointers
  • At least ceil(n/2) pointers (fan-out or degree)
  • Forms a multi-level sparse index on leaf nodes
| P1 | K1 | P2 | K2 | ... | Pn-1 | Kn-1 | Pn |
  • All search keys in the subtree P1 points to < K1
  • Ki-1 <= all search keys in the subtree Pi points to < Ki
Leaf Nodes
  • Up to n-1 entries
  • At least ceil((n-1)/2) entries
  • Entry
    • search key | record id
  • Sorted
  • 1 sibling pointer
Observations
  • Nodes dynamically created/deleted
    • Cannot guarantee physical closeness for nodes of tree
  • Non-leaf levels form a hierarchy of sparse indices
  • For n/2 = m, need $$log_m r$$ levels to handle a file of r records
    • Path no longer than $$ceil(log_m r)$$
  • Insertion/deletions handled in logarithmic time
Queries

Find search key k.

  1. Root node
    1. Find smallest search key value Ki > k
      1. Follow Pi to child node if exists
      2. Else, follow Pn
  2. Non-leaf node
    1. Repeat above
  3. Leaf node
    1. Find Ki = k, follow Pi to record
    2. Else, no record exists
Comparison with Binary Tree
  • Problems of binary tree
    • Small fan-out
      • Tall tree, many node accesses
    • Bad space utilization
      • Each node is a disk page of only 2 entries
Insertion
  • If search key found, update record
  • If search key not found
    • If room in leaf node, insert search key | record id
    • Otherwise, split the node
      • First ceil(n/2) in the original node, the rest in new node
      • Update parent node, propagate the effect up if needed
Deletion
  • If too few entries after deletion
    • If can fit a single node with sibling, merge with sibling and delete the other node
    • Otherwise, redistribute the pointers (rebalancing)
  • Update parent node, propagate the effect up if needed

Static Hashing

  • Bucket
    • A unit of storage containing one or more records
    • Typically a disk block
  • Hash function h
    • h(search key) = bucket address B
    • Ideal hash function
      • Uniform: each bucket assigned the same number of search key values from the set of all possible values
      • Random: each bucket assigned the same number of search key values, irrespective of the actual distribution of them
    • Bucket overflow
      • Overflow chaining/closed hashing
    • Database grows with time
      • Periodic reorganize the file with a new hash function
        • Expensive
      • Dynamic hashing

results matching ""

    No results matching ""