Relational Database
- Relation- $$A_1, A_2, ..., A_n$$: attributes
- $$R = (A_1, A_2, ..., A_n)$$: relation schema
- $$r(R)$$: relation on relation schema $$R$$- A set of n-tuples $$(a_1, a_2, ..., a_n)$$
- Each $$a_i \in D_i$$
 
- Relations are sets- Each entry unique (opposed to multisets)
 
 
Query Languages
Categories
- Procedural
- Non-procedural: specifies the task- e.g. SQL
 
Pure languages
Relational Algebra
- Procedural
- 6 basic operators- Select $$\sigma_p(r)$$
- Project $$\pi_{A_1, A_2, ..., A_k}(r)$$
- Union $$r \cup s$$ (same arity, compatible relations)
- Intersection $$r \cap s$$ (same arity, compatible relations)
- Difference $$r - s$$ (same arity, compatible relations)
- Cartesian product $$r \times s$$
- Rename
 
- Additional operators- Set intersection
- Natural join $$r \bowtie s$$
- θ-join
- Division
- Generalized projection
- Assignment
- Aggregation $${G_1, ..., G_n} g{F_1(A_1), ..., F_n(A_n)} (E)$$- avg
- min
- max
- sum
- count
 
- Outer join
 
Relational Calculus
SQL
- Result is multiset (allows duplicates)
SELECT A1, A2, ..., An
FROM r1, r2, ..., rm
WHERE P
$$ \pi_{A_1, A_2, ..., A_n}(\sigma_P(r_1 \times r_2 \times ... \times r_m))
$$
Storage of Databases
Physical Storage Media
- Cache
- Main memory
- Magnetic-disk- Platter
- Circular track- Cylinder
 
- Sector- Smallest unit to be read/written
- Block: contiguous sequence of sectors
 
 
Storage Hierarchy
- Primary storage- Volatile
- Fast
 
- Secondary storage / on-line storage- Non-volatile
- Moderately fast access
 
- Tertiary storage / off-line storage- Non-volatile
- Slow access
 
Performance Measures of Disk
- Access time- Read/write request issued to data transfer begins
- Seek time + rotational latency
 
- Data-transfer rate
Disk-Arm-Scheduling
Algorithm ordering pending accesses to tracks s.t. disk arm movement is minimized.
Storage Access
Storage transferred in blocks.
- Buffer- Portion of main memory available for storing blocks
 
- Buffer manager- Allocating buffer space
- Goal: minimize block transfers
 
- Buffer-replacement policy- Least recently used (LRU)
 
File Organization
- Database- Files: used for one particular record type/relation (except for clustered file organization)- Records: record pointer/record id- Fields
 
 
- Records: record pointer/record id
 
- Files: used for one particular record type/relation (except for clustered file organization)
Records of each relation may be stored in a separate file. In clustered file organization, records of different relations can be stored in the same file.
Records
- Heap: anywhere in the file if space
- Sequential: ordered by search key
- Hashing
Indexing

- Linear scan- $$O(B_I)$$
 
- Binary search- $$O(log_2 B_I)$$
 
- f-way search- $$O(log_f B_I)$$
 
Basics
- Search key- Attributes used to look up records in a file
 
- Index file- Much smaller than relation file
- Can be ordered
- Need to be updated when file modified
 
- Index entries- Records in index files
- search-key | pointer
 
- Indices- Ordered indices: search keys sorted
- Hash indices: search keys distributed across buckets using hash functions
 
Index Quality
- Access types supported- Equality query
- Range query
 
- Access time
- Insertion time
- Deletion time
- Space overhead
Classification of Indices
- Primary/clustered index v.s. secondary/non-clustered index- Primary index- Search key order = sequential order of file
- Search key usually primary key
 
- Secondary index- Search key order != sequential order of file
 
 
- Primary index
- Dense index v.s. sparse index- Dense index- All search-key values in file
- Record locating efficiency
 
- Sparse index- Some search-key values in file
- Cannot be secondary index
- Space & maintenance efficiency
 
 
- Dense index
Multilevel index

If index not fit in memory, construct multilevel index files. Construct sequential first-level index file, and second-level sparse index on it.
- Outer index- Sparse index on first-level index file
 
- Inner index- First-level index file
 
B+-Tree Index Files
Dynamic, multi-level index.
- Advantage- Automatically reorganized itself with small local changes
 
- Disadvantage- Extra insertion & deletion overhead, space overhead
 
Basics
- Disk-based tree structure- Node: block with block-id
 
- Node: block with 
- Multi-way tree- Each node with multiple children (ceil(n/2)ton)- At least 50% of a node occupied
 
- Order/degree of tree- ceil(n/2)
 
 
- Each node with multiple children (
- Balanced tree- All paths from root to leaf have the same length
 
- Disjoint partition of attribute domain into ranges- Each sub-tree indexes a range in the attribute domain
- Leaf node- Index entries, with pointers to relation file
 
 
Non-Leaf Nodes
- Up to n-1search keys
- Up to npointers
- At least ceil(n/2)pointers (fan-out or degree)
- Forms a multi-level sparse index on leaf nodes
| P1 | K1 | P2 | K2 | ... | Pn-1 | Kn-1 | Pn |
- All search keys in the subtree P1points to <K1
- Ki-1<= all search keys in the subtree- Pipoints to <- Ki
Leaf Nodes
- Up to n-1entries
- At least ceil((n-1)/2)entries
- Entry- search key | record id
 
- Sorted
- 1 sibling pointer
Observations
- Nodes dynamically created/deleted- Cannot guarantee physical closeness for nodes of tree
 
- Non-leaf levels form a hierarchy of sparse indices
- For n/2 = m, need $$log_m r$$ levels to handle a file ofrrecords- Path no longer than $$ceil(log_m r)$$
 
- Insertion/deletions handled in logarithmic time
Queries
Find search key k.
- Root node- Find smallest search key value Ki>k- Follow Pito child node if exists
- Else, follow Pn
 
- Follow 
 
- Find smallest search key value 
- Non-leaf node- Repeat above
 
- Leaf node- Find Ki = k, followPito record
- Else, no record exists
 
- Find 
Comparison with Binary Tree
- Problems of binary tree- Small fan-out- Tall tree, many node accesses
 
- Bad space utilization- Each node is a disk page of only 2 entries
 
 
- Small fan-out
Insertion
- If search key found, update record
- If search key not found- If room in leaf node, insert search key | record id
- Otherwise, split the node- First ceil(n/2)in the original node, the rest in new node
- Update parent node, propagate the effect up if needed
 
- First 
 
- If room in leaf node, insert 
Deletion
- If too few entries after deletion- If can fit a single node with sibling, merge with sibling and delete the other node
- Otherwise, redistribute the pointers (rebalancing)
 
- Update parent node, propagate the effect up if needed
Static Hashing
- Bucket- A unit of storage containing one or more records
- Typically a disk block
 
- Hash function h- h(search key) = bucket address B
- Ideal hash function- Uniform: each bucket assigned the same number of search key values from the set of all possible values
- Random: each bucket assigned the same number of search key values, irrespective of the actual distribution of them
 
- Bucket overflow- Overflow chaining/closed hashing
 
- Database grows with time- Periodic reorganize the file with a new hash function- Expensive
 
- Dynamic hashing
 
- Periodic reorganize the file with a new hash function