Relational Database

Relation
- $$A_1, A_2, ..., A_n$$: attributes
- $$R = (A_1, A_2, ..., A_n)$$: relation schema
- $$r(R)$$: relation on relation schema $$R$$
  - A set of n-tuples $$(a_1, a_2, ..., a_n)$$
  - Each $$a_i \in D_i$$
- Relations are sets
  - Each entry unique (opposed to multisets)

Query Languages

Pure languages

Relational Algebra

Procedural
6 basic operators
- Select $$\sigma_p(r)$$
- Project $$\pi_{A_1, A_2, ..., A_k}(r)$$
- Union $$r \cup s$$ (same arity, compatible relations)
- Intersection $$r \cap s$$ (same arity, compatible relations)
- Difference $$r - s$$ (same arity, compatible relations)
- Cartesian product $$r \times s$$
- Rename
Additional operators
- Set intersection
- Natural join $$r \bowtie s$$
- θ-join
- Division
- Generalized projection
- Assignment
- Aggregation $${G_1, ..., G_n} g{F_1(A_1), ..., F_n(A_n)} (E)$$
  - avg
  - min
  - max
  - sum
  - count
- Outer join

Relational Calculus

SQL

Result is multiset (allows duplicates)

SELECT A1, A2, ..., An
FROM r1, r2, ..., rm
WHERE P

$$ \pi_{A_1, A_2, ..., A_n}(\sigma_P(r_1 \times r_2 \times ... \times r_m))

Storage of Databases

Physical Storage Media

Cache
Main memory
Magnetic-disk
- Platter
- Circular track
  - Cylinder
- Sector
  - Smallest unit to be read/written
  - Block: contiguous sequence of sectors

Storage Hierarchy

Primary storage
- Volatile
- Fast
Secondary storage / on-line storage
- Non-volatile
- Moderately fast access
Tertiary storage / off-line storage
- Non-volatile
- Slow access

Performance Measures of Disk

Access time
- Read/write request issued to data transfer begins
- Seek time + rotational latency
Data-transfer rate

Disk-Arm-Scheduling

Algorithm ordering pending accesses to tracks s.t. disk arm movement is minimized.

Storage Access

Storage transferred in blocks.

Buffer
- Portion of main memory available for storing blocks
Buffer manager
- Allocating buffer space
- Goal: minimize block transfers
Buffer-replacement policy
- Least recently used (LRU)

File Organization

Database
- Files: used for one particular record type/relation (except for clustered file organization)
  - Records: record pointer/record id
    - Fields

Records of each relation may be stored in a separate file. In clustered file organization, records of different relations can be stored in the same file.

Records

Heap: anywhere in the file if space
Sequential: ordered by search key
Hashing

Indexing

Linear scan
- $$O(B_I)$$
Binary search
- $$O(log_2 B_I)$$
f-way search
- $$O(log_f B_I)$$

Basics

Search key
- Attributes used to look up records in a file
Index file
- Much smaller than relation file
- Can be ordered
- Need to be updated when file modified
Index entries
- Records in index files
- search-key | pointer
Indices
- Ordered indices: search keys sorted
- Hash indices: search keys distributed across buckets using hash functions

Index Quality

Access types supported
- Equality query
- Range query
Access time
Insertion time
Deletion time
Space overhead

Classification of Indices

Primary/clustered index v.s. secondary/non-clustered index
- Primary index
  - Search key order = sequential order of file
  - Search key usually primary key
- Secondary index
  - Search key order != sequential order of file
Dense index v.s. sparse index
- Dense index
  - All search-key values in file
  - Record locating efficiency
- Sparse index
  - Some search-key values in file
  - Cannot be secondary index
  - Space & maintenance efficiency

Multilevel index

If index not fit in memory, construct multilevel index files. Construct sequential first-level index file, and second-level sparse index on it.

Outer index
- Sparse index on first-level index file
Inner index
- First-level index file

B+-Tree Index Files

Dynamic, multi-level index.

Advantage
- Automatically reorganized itself with small local changes
Disadvantage
- Extra insertion & deletion overhead, space overhead

Basics

Disk-based tree structure
- Node: block with block-id
Multi-way tree
- Each node with multiple children (ceil(n/2) to n)
  - At least 50% of a node occupied
- Order/degree of tree
  - ceil(n/2)
Balanced tree
- All paths from root to leaf have the same length
Disjoint partition of attribute domain into ranges
- Each sub-tree indexes a range in the attribute domain
- Leaf node
  - Index entries, with pointers to relation file

Non-Leaf Nodes

Up to n-1 search keys
Up to n pointers
At least ceil(n/2) pointers (fan-out or degree)
Forms a multi-level sparse index on leaf nodes

| P1 | K1 | P2 | K2 | ... | Pn-1 | Kn-1 | Pn |

All search keys in the subtree P1 points to < K1
Ki-1 <= all search keys in the subtree Pi points to < Ki

Leaf Nodes

Up to n-1 entries
At least ceil((n-1)/2) entries
Entry
- search key | record id
Sorted
1 sibling pointer

Observations

Nodes dynamically created/deleted
- Cannot guarantee physical closeness for nodes of tree
Non-leaf levels form a hierarchy of sparse indices
For n/2 = m, need $$log_m r$$ levels to handle a file of r records
- Path no longer than $$ceil(log_m r)$$
Insertion/deletions handled in logarithmic time

Queries

Find search key k.

Root node
1. Find smallest search key value Ki > k
  1. Follow Pi to child node if exists
  2. Else, follow Pn
Non-leaf node
1. Repeat above
Leaf node
1. Find Ki = k, follow Pi to record
2. Else, no record exists

Comparison with Binary Tree

Problems of binary tree
- Small fan-out
  - Tall tree, many node accesses
- Bad space utilization
  - Each node is a disk page of only 2 entries

Insertion

If search key found, update record
If search key not found
- If room in leaf node, insert search key | record id
- Otherwise, split the node
  - First ceil(n/2) in the original node, the rest in new node
  - Update parent node, propagate the effect up if needed

Deletion

If too few entries after deletion
- If can fit a single node with sibling, merge with sibling and delete the other node
- Otherwise, redistribute the pointers (rebalancing)
Update parent node, propagate the effect up if needed

Static Hashing

Bucket
- A unit of storage containing one or more records
- Typically a disk block
Hash function h
- h(search key) = bucket address B
- Ideal hash function
  - Uniform: each bucket assigned the same number of search key values from the set of all possible values
  - Random: each bucket assigned the same number of search key values, irrespective of the actual distribution of them
- Bucket overflow
  - Overflow chaining/closed hashing
- Database grows with time
  - Periodic reorganize the file with a new hash function
    - Expensive
  - Dynamic hashing

Background

Relational Database

Query Languages

Categories

Pure languages

Relational Algebra

Relational Calculus

SQL

Storage of Databases

Physical Storage Media

Storage Hierarchy

Performance Measures of Disk

Disk-Arm-Scheduling

Storage Access

File Organization

Records

Indexing

Basics

Index Quality

Classification of Indices

Multilevel index

B+-Tree Index Files

Basics

Non-Leaf Nodes

Leaf Nodes

Observations

Queries

Comparison with Binary Tree

Insertion

Deletion

Static Hashing

results matching ""

No results matching ""