Relational Database
- Relation
- $$A_1, A_2, ..., A_n$$: attributes
- $$R = (A_1, A_2, ..., A_n)$$: relation schema
- $$r(R)$$: relation on relation schema $$R$$
- A set of n-tuples $$(a_1, a_2, ..., a_n)$$
- Each $$a_i \in D_i$$
- Relations are sets
- Each entry unique (opposed to multisets)
Query Languages
Categories
- Procedural
- Non-procedural: specifies the task
- e.g. SQL
Pure languages
Relational Algebra
- Procedural
- 6 basic operators
- Select $$\sigma_p(r)$$
- Project $$\pi_{A_1, A_2, ..., A_k}(r)$$
- Union $$r \cup s$$ (same arity, compatible relations)
- Intersection $$r \cap s$$ (same arity, compatible relations)
- Difference $$r - s$$ (same arity, compatible relations)
- Cartesian product $$r \times s$$
- Rename
- Additional operators
- Set intersection
- Natural join $$r \bowtie s$$
- θ-join
- Division
- Generalized projection
- Assignment
- Aggregation $${G_1, ..., G_n} g{F_1(A_1), ..., F_n(A_n)} (E)$$
avg
min
max
sum
count
- Outer join
Relational Calculus
SQL
- Result is multiset (allows duplicates)
SELECT A1, A2, ..., An
FROM r1, r2, ..., rm
WHERE P
$$ \pi_{A_1, A_2, ..., A_n}(\sigma_P(r_1 \times r_2 \times ... \times r_m))
$$
Storage of Databases
Physical Storage Media
- Cache
- Main memory
- Magnetic-disk
- Platter
- Circular track
- Cylinder
- Sector
- Smallest unit to be read/written
- Block: contiguous sequence of sectors
Storage Hierarchy
- Primary storage
- Volatile
- Fast
- Secondary storage / on-line storage
- Non-volatile
- Moderately fast access
- Tertiary storage / off-line storage
- Non-volatile
- Slow access
Performance Measures of Disk
- Access time
- Read/write request issued to data transfer begins
- Seek time + rotational latency
- Data-transfer rate
Disk-Arm-Scheduling
Algorithm ordering pending accesses to tracks s.t. disk arm movement is minimized.
Storage Access
Storage transferred in blocks.
- Buffer
- Portion of main memory available for storing blocks
- Buffer manager
- Allocating buffer space
- Goal: minimize block transfers
- Buffer-replacement policy
- Least recently used (LRU)
File Organization
- Database
- Files: used for one particular record type/relation (except for clustered file organization)
- Records: record pointer/record id
- Fields
- Records: record pointer/record id
- Files: used for one particular record type/relation (except for clustered file organization)
Records of each relation may be stored in a separate file. In clustered file organization, records of different relations can be stored in the same file.
Records
- Heap: anywhere in the file if space
- Sequential: ordered by search key
- Hashing
Indexing
- Linear scan
- $$O(B_I)$$
- Binary search
- $$O(log_2 B_I)$$
- f-way search
- $$O(log_f B_I)$$
Basics
- Search key
- Attributes used to look up records in a file
- Index file
- Much smaller than relation file
- Can be ordered
- Need to be updated when file modified
- Index entries
- Records in index files
search-key | pointer
- Indices
- Ordered indices: search keys sorted
- Hash indices: search keys distributed across buckets using hash functions
Index Quality
- Access types supported
- Equality query
- Range query
- Access time
- Insertion time
- Deletion time
- Space overhead
Classification of Indices
- Primary/clustered index v.s. secondary/non-clustered index
- Primary index
- Search key order = sequential order of file
- Search key usually primary key
- Secondary index
- Search key order != sequential order of file
- Primary index
- Dense index v.s. sparse index
- Dense index
- All search-key values in file
- Record locating efficiency
- Sparse index
- Some search-key values in file
- Cannot be secondary index
- Space & maintenance efficiency
- Dense index
Multilevel index
If index not fit in memory, construct multilevel index files. Construct sequential first-level index file, and second-level sparse index on it.
- Outer index
- Sparse index on first-level index file
- Inner index
- First-level index file
B+-Tree Index Files
Dynamic, multi-level index.
- Advantage
- Automatically reorganized itself with small local changes
- Disadvantage
- Extra insertion & deletion overhead, space overhead
Basics
- Disk-based tree structure
- Node: block with
block-id
- Node: block with
- Multi-way tree
- Each node with multiple children (
ceil(n/2)
ton
)- At least 50% of a node occupied
- Order/degree of tree
ceil(n/2)
- Each node with multiple children (
- Balanced tree
- All paths from root to leaf have the same length
- Disjoint partition of attribute domain into ranges
- Each sub-tree indexes a range in the attribute domain
- Leaf node
- Index entries, with pointers to relation file
Non-Leaf Nodes
- Up to
n-1
search keys - Up to
n
pointers - At least
ceil(n/2)
pointers (fan-out or degree) - Forms a multi-level sparse index on leaf nodes
| P1 | K1 | P2 | K2 | ... | Pn-1 | Kn-1 | Pn |
- All search keys in the subtree
P1
points to <K1
Ki-1
<= all search keys in the subtreePi
points to <Ki
Leaf Nodes
- Up to
n-1
entries - At least
ceil((n-1)/2)
entries - Entry
search key | record id
- Sorted
- 1 sibling pointer
Observations
- Nodes dynamically created/deleted
- Cannot guarantee physical closeness for nodes of tree
- Non-leaf levels form a hierarchy of sparse indices
- For
n/2 = m
, need $$log_m r$$ levels to handle a file ofr
records- Path no longer than $$ceil(log_m r)$$
- Insertion/deletions handled in logarithmic time
Queries
Find search key k
.
- Root node
- Find smallest search key value
Ki
>k
- Follow
Pi
to child node if exists - Else, follow
Pn
- Follow
- Find smallest search key value
- Non-leaf node
- Repeat above
- Leaf node
- Find
Ki = k
, followPi
to record - Else, no record exists
- Find
Comparison with Binary Tree
- Problems of binary tree
- Small fan-out
- Tall tree, many node accesses
- Bad space utilization
- Each node is a disk page of only 2 entries
- Small fan-out
Insertion
- If search key found, update record
- If search key not found
- If room in leaf node, insert
search key | record id
- Otherwise, split the node
- First
ceil(n/2)
in the original node, the rest in new node - Update parent node, propagate the effect up if needed
- First
- If room in leaf node, insert
Deletion
- If too few entries after deletion
- If can fit a single node with sibling, merge with sibling and delete the other node
- Otherwise, redistribute the pointers (rebalancing)
- Update parent node, propagate the effect up if needed
Static Hashing
- Bucket
- A unit of storage containing one or more records
- Typically a disk block
- Hash function
h
h(search key) = bucket address B
- Ideal hash function
- Uniform: each bucket assigned the same number of search key values from the set of all possible values
- Random: each bucket assigned the same number of search key values, irrespective of the actual distribution of them
- Bucket overflow
- Overflow chaining/closed hashing
- Database grows with time
- Periodic reorganize the file with a new hash function
- Expensive
- Dynamic hashing
- Periodic reorganize the file with a new hash function