Powered by GitBook

Query Evaluation

Parsing & translation into relational algebra expression
1. RA is procedural
Optimization
Evaluation

Cost of Query Evaluation

Factors for Evaluating Each Operation

Size of input
Technique used to access the input
Technique used to evaluate the operation

Measures of Query Cost

Total elapsed time for answering query.

Disk access
- Number of seeks * avg seek cost
- Number of block transfers * avg block transfer time
  - Number of blocks read * avg block read cost
  - Number of blocks written * avg block write cost
Buffer size
CPU
Network

Evaluation of Expressions

Evaluation of an entire expression tree
- Materialization
  - Evaluate one operation at a time, starting from the lowest level
    - Intermediate results materialized into temporary relations and passed on the next-level operations
- Pipelining
  - Evaluate several operations simultaneously, passing results of one operation on to the next
  - May not always be possible

Pipelining v.s. Blocking Operators

Operators: algorithms for evaluating a basic operation.

Blocking operator
- Operators require the whole input to be available before they start producing results
Pipelining operator
- Operators may produce results before one or both inputs are fully seen

Interface of Pipelining Operators

Open(): init state by allocating buffers for I/O
GetNext(): get next output of operator
Close(): terminate operator by closing I/O

Operations

Selection Operation

Linear search
Binary search
Index-based search

B+-Tree

Primary Index

Cost = height of tree + number of blocks matching the index

Secondary Index

Cost = height of tree + 1 if candidate key
Cost = height of tree + number of records otherwise

Sorting

Logical sorting: sort by secondary index
Physical sorting: sort the relation
- Internal sorting: if relations fit in memory
- External sorting: sort-merge algorithm

External Sort-Merge Algorithm

Let M be number of blocks in memory, N be number of runs created.

Create sorted runs
1. Read M blocks of relation into memory
2. Sort
3. Write to run Ri
If N < M
1. Merge runs (N-say merge)
  1. Use N blocks of memory to buffer input runs, 1 block to buffer output
  2. Read first block of each run to its buffer page
  3. Select first record in sorted order among buffer pages, write to output buffer, delete from input buffer
    1. If output full, write to disk
    2. If input empty, load next block
  4. Repeat until input buffer pages empty
If N >= M
1. Merge every M-1 runs, each pass reduces the number of runs by a factor of M-1
2. Repeat until runs merged into 1
Cost
- $$br (2 * ceil(log{M-1}(\frac{b_r}{M})) + 1)$$
- Each block in relation
  - 2 reads (internal sort + merge)
  - 1 write (output to disk)

Join Operation

Merge join
- Sorted output
Hash join
- Cheap
Nested-loop join
- Can pipeline

Nested-Loop Join

r: outer relation; s: inner relation

Steps
- Compare each tuple in r with tuples in s
Cost
- $$b_r + n_r * b_s$$
Advantage
- Simple
- Requires no indices
- General join
Disadvantage
- Expensive

Block Nested-Loop Join

Steps
- Compare each tuple in each block in r with tuples in each block in s
Cost
- $$b_r + b_r * b_s$$
- Can be reduced to $$b_r + ceil(\frac{b_r}{M-2}) * b_s$$
  - M-2 memory blocks used for loading blocks of r

Indexed Nested-Loop Join

Steps
- For each tuple in r, use the index to look up tuples in s
Cost
- $$b_r + n_r * \text{index search cost}$$
Disadvantage
- Applicable only to equi-join or natural join
- Index needs to be available on inner relation's join attribute

Merge Join

Steps
- Sort r and s on their join attribute
- Merge (similar to sort-merge algorithm)
  - Need to handle duplicate values in join attribute
Cost
- $$sort(r) + sort(s) + b_r + b_s$$
Disadvantage
- Applicable only to equi-join or natural join

Hash Join

r: probe input; s: build input

Steps
- Use hash function h to partition tuples of r and s
  - h maps common attributes of r and s to 0..n buckets
  - Tuples in bucket ri need only be compared with tuples in bucket si
- For each i
  - Load si into memory
  - Build an in-memory hash table using the join attribute
    - Different hash function than the earlier h
  - Read tuples in ri one by one, locate matching tuple in si using the in-memory hash table
  - Output concatenation of their attributes
Choice
- Number of buckets n
  - si should fit in memory, ri need not
  - Typically $$n = ceil(\frac{b_s}{M}) * f$$
    - f: fudge factor, typically around 1.2
Cost
- $$3 * (b_r + b_s) + \text{a little}$$
Disadvantage
- Applicable only to equi-join or natural join

Projection Operation

Drop some columns & eliminate duplicate tuples.

Duplicate Elimination

Index exists for projection attributes
- Cost: index scan (cheaper than file scan)
Otherwise
- Partitioning (sorting or hashing) to bring duplicates together, then eliminate
- Cost: sorting or hashing
  - External sorting: $$O(N log_{M-1} N)$$
    - N: number of pages of input
    - M: number of pages fit in memory
  - Hashing: same as above

Aggregation Operation

Similar to duplicate elimination
Optimization
- Compute partial aggregate values during run generation & intermediate merges

results matching ""

No results matching ""