Query Optimization

Cost of operations depends on
- Statistical information about relations
- Statistical information about intermediate results

Cost Based Optimization

Generate logically equivalent expressions
1. Use equivalence rules
Annotate resultant expressions to get alternative query plans
Choose the cheapest plan based on estimated cost
1. Estimate cost of operations & techniques
  1. Sizes of input
  2. Statistics about attribute value distribution
2. Estimate output size or operations
  1. Statistics about attribute value distribution

Basic Statistics for Cost Estimation

$$n_r$$: number of tuples in relation r
$$b_r$$: number of blocks in relation r
- $$b_r = ceil(\frac{n_r}{f_r})$$ if tuples of r are stored together physically in a file
$$s_r$$: size of a tuple of r
$$f_r$$: blocking factor of r (number of tuples in a block)
$$V(A, r)$$: number of distinct values in r for attribute A
$$SC(A, r)$$: selection cardinality of attribute A in r (avg. number of records that satisfy equality on A)

Catalog Information about Indices

$$f_i$$: avg. fan-out of internal nodes of index i
$$HT_i$$: height of index i
- $$HTi = ceil(log{f_i} (V(A, r)))$$ for a balanced tree
- $$HT_i = 1$$ for a hash index
$$LB_i$$: number of blocks at the leaf level

Measure of Query Cost

The number of block transfers from disk.

Selection Size Estimation

Equality selection $$\sigma_{A=v}(r)$$
- $$SC(A, r)$$
- $$ceil(\frac{SC(A, r)}{f_r})$$: number of blocks these records will occupy
- 1 if A is candidate key
Comparison selection $$\sigma_{A \le V}(r)$$
- Let C be the estimated number of tuples satisfying the condition
  - $$C = n_r * \frac{v - min(A, r)}{max(A, r) - min(A, r)}$$
- Histograms can be used to improve accuracy

Implementation of Complex Selections

Selectivity of condition $$\theta_i$$: probability that a tuple in relation r satisfies $$\theta_i$$
- $$\frac{s_i}{n_r}$$
Conjunction $$\sigma_{\theta_1 \wedge \theta_2 \wedge \dots \wedge \theta_n} (r)$$
- $$n_r \frac{s_1 s_2 \dots s_n}{n_r^n}$$
Disjunction $$\sigma_{\theta_1 \vee \theta_2 \vee \dots \vee \theta_n} (r)$$
- $$n_r * (1 - (1 - \frac{s_1}{n_r})(1 - \frac{s_2}{n_r}) \dots (1 - \frac{s_n}{n_r}))$$
Negation $$\sigma_{\neg \theta} (r)$$
- $$nr - size(\sigma{\theta}(r))$$

Join Size Estimation

Relation r and s, Cartesian product $$r \times s$$ contains $$n_r * n_s$$ tuples.

If $$R \cap S = \varnothing$$
- $$r \Join s = r \times s$$
If $$R \cap S$$ is a key for $$R$$
- Keys in $$S$$ may be absent in $$R$$
- $$n_{r \Join s} \le n_s$$
If $$R \cap S$$ in $$S$$ is a foreign key referencing $$R$$
- Keys in $$S$$ must be in $$R$$
- $$n_{r \Join s} = n_s$$
If $$R \cap S = {A}$$ is not a key
- Assume every tuple in r produces tuples in $$r \Join s$$
  - $$\frac{n_r n_s}{V(A, s)} = n_r \frac{n_s}{V(A, s)}$$
  - $$\frac{n_s}{V(A, s)}$$: average number of tuples for each distinct value in s
- Assume every tuple in s produces tuples in $$r \Join s$$
  - $$\frac{n_r n_s}{V(A, r)} = n_s \frac{n_r}{V(A, r)}$$
  - $$\frac{n_r}{V(A, r)}$$: average number of tuples for each distinct value in r
- Take the lowest of the two estimates

Projection Size Estimation

$$\Pi_A(r) = V(A, r)$$

Aggregation Size Estimation

$$_A g_F (r) = V(A, r)$$

Set Operation Size Estimation

Unions/intersections of selections on the same relation
- Rewrite & use size estimate for selections
Operations on different relations
- $$r \cup s = size(r) + size(s)$$
- $$r \cap s = min(size(r), size(s))$$
- $$r - s = size(r)$$

Transformation of Relational Expressions

Equivalence Rule

Expressions of two forms are equivalent.

Conjunctive selection = sequence of individual selections
1. $$\sigma{\theta_1 \wedge \theta_2} (E) = \sigma{\theta1}(\sigma{\theta_2} (E))$$
Commutativity of selection
1. $$\sigma{\theta_1}(\sigma{\theta2} (E)) = \sigma{\theta2}(\sigma{\theta_1} (E))$$
Only the last in a sequence of projection is needed
1. $$\pi{t_1}(\pi{t2}(\dots (\pi{tn}(E)))) = \pi{t_1} (E)$$
Selections can be combined with Cartesian products & theta joins
1. $$\sigma{\theta}(E_1 \times E_2) = E_1 \Join{\theta} E_2$$
2. $$\sigma{\theta_1}(E_1 \Join{\theta2} E_2) = E_1 \Join{\theta_1 \wedge \theta_2} E2$$
Commutativity of theta-joins
1. $$E1 \Join{\theta} E2 = E_2 \Join{\theta} E_1$$
Associativity of natural join
1. $$(E_1 \Join E_2) \Join E_3 = E_1 \Join (E_2 \Join E_3)$$
Associativity of theta-join
1. $$(E1 \Join{\theta1} E_2) \Join{\theta2 \wedge \theta_3} E_3 = E_1 \Join{\theta1 \wedge \theta_3} (E_2 \Join{\theta_2} E_3)$$
Distribution of selection over theta join
1. If $$\theta_0$$ involves only the attributes of $$E_1$$
  1. $$\sigma{\theta_0} (E_1 \Join{\theta} E2) = (\sigma{\theta0} (E_1)) \Join{\theta} E_2$$
2. If $$\theta_1$$ involves only the attributes of $$E_1$$, $$\theta_2$$ involves only the attributes of $$E_2$$
  1. $$\sigma{\theta_1 \wedge \theta_2} (E_1 \Join{\theta} E2) = (\sigma{\theta1} (E_1)) \Join{\theta} (\sigma_{\theta_2} (E_2))$$
Distribution of projection over theta join
1. If $$\theta$$ involves only attributes from $$L_1 \cup L_2$$
  1. $$\pi{L_1 \cup L_2} (E_1 \Join{\theta} E2) = (\pi{L1} (E_1)) \Join{\theta} (\pi_{L_2} (E_2))$$
2. Consider $$E1 \Join{\theta} E_2$$
  1. $$\pi{L_1 \cup L_2} (E_1 \Join{\theta} E2) = \pi{L1 \cup L_2} ((\pi{L1 \cup L_3} (E_1)) \Join{\theta} (\pi_{L_2 \cup L_4} (E_2)))$$
    1. $$L_i$$: sets of attributes from $$E_i$$
    2. $$L_3$$: attributes of $$E_1$$ not in $$L_1 \cup L_2$$, but are involved in $$\theta$$
    3. $$L_4$$: attributes of $$E_2$$ not in $$L_1 \cup L_2$$, but are involved in $$\theta$$
Join Ordering
1. $$(r_1 \Join r_2) \Join r_3 = r_1 \Join (r_2 \Join r_3)$$

Evaluation Plan

What algorithm used for each operation
How the execution of operations is coordinated

Choice of Evaluation Plan

Cost-Based Optimization

$$r_1 \Join r_2 \Join \dots \Join r_n$$ has $$\frac{(2(n-1))!}{(n-1)!}$$ different join orders. Ref

Dynamic Programming

When computing the cost for join order, store for future use. Total $$2^n - 1$$ possible costs to choose from.

find_best_plan(S):
    if bestplan[S].cost != inf):
        return bestplan[S]
    for non-empty subset S1 of S s.t. S1 != S:
        P1 = find_best_plan(S1)
        P2 = find_best_plan(S - S1)
        A = best algo for joining results of P1 and P2
        cost = P1.cost + P2.cost + A.cost

        if cost < bestplan[S].cost:
            bestplan[S].cost = cost
            bestplan[S].plan = "P1.plan + P2.plan + A"

    return bestplan[S]

Time complexity
- Bushy tree: $$O(3^n)$$
- Left-deep tree: $$O(n 2^n)$$
Space complexity
- $$O(2^n)$$

Heuristic Optimization

Perform selection early
Perform projection early
Perform most restrictive selection & join operations first
Combine with partial cost-based optimization

Typical Steps

Deconstruct conjunctive selections & move down
Execute first the selections & joins that will produce the smallest relations
Cartesian product + selection = join
Deconstruct projections & move down (may create new projections)
Execute the subtrees whose operations can be pipelined

Query Optimization

Query Optimization

Cost Based Optimization

Basic Statistics for Cost Estimation

Catalog Information about Indices

Measure of Query Cost

Selection Size Estimation

Implementation of Complex Selections

Join Size Estimation

Projection Size Estimation

Aggregation Size Estimation

Set Operation Size Estimation

Transformation of Relational Expressions

Equivalence Rule

Evaluation Plan

Choice of Evaluation Plan

Cost-Based Optimization

Dynamic Programming

Heuristic Optimization

Typical Steps

results matching ""

No results matching ""