Query Optimization

  • Cost of operations depends on
    • Statistical information about relations
    • Statistical information about intermediate results

Cost Based Optimization

  1. Generate logically equivalent expressions
    1. Use equivalence rules
  2. Annotate resultant expressions to get alternative query plans
  3. Choose the cheapest plan based on estimated cost
    1. Estimate cost of operations & techniques
      1. Sizes of input
      2. Statistics about attribute value distribution
    2. Estimate output size or operations
      1. Statistics about attribute value distribution
Basic Statistics for Cost Estimation
  • $$n_r$$: number of tuples in relation r
  • $$b_r$$: number of blocks in relation r
    • $$b_r = ceil(\frac{n_r}{f_r})$$ if tuples of r are stored together physically in a file
  • $$s_r$$: size of a tuple of r
  • $$f_r$$: blocking factor of r (number of tuples in a block)
  • $$V(A, r)$$: number of distinct values in r for attribute A
  • $$SC(A, r)$$: selection cardinality of attribute A in r (avg. number of records that satisfy equality on A)
Catalog Information about Indices
  • $$f_i$$: avg. fan-out of internal nodes of index i
  • $$HT_i$$: height of index i
    • $$HTi = ceil(log{f_i} (V(A, r)))$$ for a balanced tree
    • $$HT_i = 1$$ for a hash index
  • $$LB_i$$: number of blocks at the leaf level
Measure of Query Cost

The number of block transfers from disk.

Selection Size Estimation
  • Equality selection $$\sigma_{A=v}(r)$$
    • $$SC(A, r)$$
    • $$ceil(\frac{SC(A, r)}{f_r})$$: number of blocks these records will occupy
    • 1 if A is candidate key
  • Comparison selection $$\sigma_{A \le V}(r)$$
    • Let C be the estimated number of tuples satisfying the condition
      • $$C = n_r * \frac{v - min(A, r)}{max(A, r) - min(A, r)}$$
    • Histograms can be used to improve accuracy
Implementation of Complex Selections
  • Selectivity of condition $$\theta_i$$: probability that a tuple in relation r satisfies $$\theta_i$$
    • $$\frac{s_i}{n_r}$$
  • Conjunction $$\sigma_{\theta_1 \wedge \theta_2 \wedge \dots \wedge \theta_n} (r)$$
    • $$n_r \frac{s_1 s_2 \dots s_n}{n_r^n}$$
  • Disjunction $$\sigma_{\theta_1 \vee \theta_2 \vee \dots \vee \theta_n} (r)$$
    • $$n_r * (1 - (1 - \frac{s_1}{n_r})(1 - \frac{s_2}{n_r}) \dots (1 - \frac{s_n}{n_r}))$$
  • Negation $$\sigma_{\neg \theta} (r)$$
    • $$nr - size(\sigma{\theta}(r))$$
Join Size Estimation

Relation r and s, Cartesian product $$r \times s$$ contains $$n_r * n_s$$ tuples.

  • If $$R \cap S = \varnothing$$
    • $$r \Join s = r \times s$$
  • If $$R \cap S$$ is a key for $$R$$
    • Keys in $$S$$ may be absent in $$R$$
    • $$n_{r \Join s} \le n_s$$
  • If $$R \cap S$$ in $$S$$ is a foreign key referencing $$R$$
    • Keys in $$S$$ must be in $$R$$
    • $$n_{r \Join s} = n_s$$
  • If $$R \cap S = {A}$$ is not a key
    • Assume every tuple in r produces tuples in $$r \Join s$$
      • $$\frac{n_r n_s}{V(A, s)} = n_r \frac{n_s}{V(A, s)}$$
      • $$\frac{n_s}{V(A, s)}$$: average number of tuples for each distinct value in s
    • Assume every tuple in s produces tuples in $$r \Join s$$
      • $$\frac{n_r n_s}{V(A, r)} = n_s \frac{n_r}{V(A, r)}$$
      • $$\frac{n_r}{V(A, r)}$$: average number of tuples for each distinct value in r
    • Take the lowest of the two estimates
Projection Size Estimation

$$\Pi_A(r) = V(A, r)$$

Aggregation Size Estimation

$$_A g_F (r) = V(A, r)$$

Set Operation Size Estimation
  • Unions/intersections of selections on the same relation
    • Rewrite & use size estimate for selections
  • Operations on different relations
    • $$r \cup s = size(r) + size(s)$$
    • $$r \cap s = min(size(r), size(s))$$
    • $$r - s = size(r)$$

Transformation of Relational Expressions

Equivalence Rule

Expressions of two forms are equivalent.

  1. Conjunctive selection = sequence of individual selections
    1. $$\sigma{\theta_1 \wedge \theta_2} (E) = \sigma{\theta1}(\sigma{\theta_2} (E))$$
  2. Commutativity of selection
    1. $$\sigma{\theta_1}(\sigma{\theta2} (E)) = \sigma{\theta2}(\sigma{\theta_1} (E))$$
  3. Only the last in a sequence of projection is needed
    1. $$\pi{t_1}(\pi{t2}(\dots (\pi{tn}(E)))) = \pi{t_1} (E)$$
  4. Selections can be combined with Cartesian products & theta joins
    1. $$\sigma{\theta}(E_1 \times E_2) = E_1 \Join{\theta} E_2$$
    2. $$\sigma{\theta_1}(E_1 \Join{\theta2} E_2) = E_1 \Join{\theta_1 \wedge \theta_2} E2$$
  5. Commutativity of theta-joins
    1. $$E1 \Join{\theta} E2 = E_2 \Join{\theta} E_1$$
  6. Associativity of natural join
    1. $$(E_1 \Join E_2) \Join E_3 = E_1 \Join (E_2 \Join E_3)$$
  7. Associativity of theta-join
    1. $$(E1 \Join{\theta1} E_2) \Join{\theta2 \wedge \theta_3} E_3 = E_1 \Join{\theta1 \wedge \theta_3} (E_2 \Join{\theta_2} E_3)$$
  8. Distribution of selection over theta join
    1. If $$\theta_0$$ involves only the attributes of $$E_1$$
      1. $$\sigma{\theta_0} (E_1 \Join{\theta} E2) = (\sigma{\theta0} (E_1)) \Join{\theta} E_2$$
    2. If $$\theta_1$$ involves only the attributes of $$E_1$$, $$\theta_2$$ involves only the attributes of $$E_2$$
      1. $$\sigma{\theta_1 \wedge \theta_2} (E_1 \Join{\theta} E2) = (\sigma{\theta1} (E_1)) \Join{\theta} (\sigma_{\theta_2} (E_2))$$
  9. Distribution of projection over theta join
    1. If $$\theta$$ involves only attributes from $$L_1 \cup L_2$$
      1. $$\pi{L_1 \cup L_2} (E_1 \Join{\theta} E2) = (\pi{L1} (E_1)) \Join{\theta} (\pi_{L_2} (E_2))$$
    2. Consider $$E1 \Join{\theta} E_2$$
      1. $$\pi{L_1 \cup L_2} (E_1 \Join{\theta} E2) = \pi{L1 \cup L_2} ((\pi{L1 \cup L_3} (E_1)) \Join{\theta} (\pi_{L_2 \cup L_4} (E_2)))$$
        1. $$L_i$$: sets of attributes from $$E_i$$
        2. $$L_3$$: attributes of $$E_1$$ not in $$L_1 \cup L_2$$, but are involved in $$\theta$$
        3. $$L_4$$: attributes of $$E_2$$ not in $$L_1 \cup L_2$$, but are involved in $$\theta$$
  10. Join Ordering
    1. $$(r_1 \Join r_2) \Join r_3 = r_1 \Join (r_2 \Join r_3)$$

Evaluation Plan

  • What algorithm used for each operation
  • How the execution of operations is coordinated

Choice of Evaluation Plan

Cost-Based Optimization

$$r_1 \Join r_2 \Join \dots \Join r_n$$ has $$\frac{(2(n-1))!}{(n-1)!}$$ different join orders. Ref

Dynamic Programming

When computing the cost for join order, store for future use. Total $$2^n - 1$$ possible costs to choose from.

find_best_plan(S):
    if bestplan[S].cost != inf):
        return bestplan[S]
    for non-empty subset S1 of S s.t. S1 != S:
        P1 = find_best_plan(S1)
        P2 = find_best_plan(S - S1)
        A = best algo for joining results of P1 and P2
        cost = P1.cost + P2.cost + A.cost

        if cost < bestplan[S].cost:
            bestplan[S].cost = cost
            bestplan[S].plan = "P1.plan + P2.plan + A"

    return bestplan[S]
  • Time complexity
    • Bushy tree: $$O(3^n)$$
    • Left-deep tree: $$O(n 2^n)$$
  • Space complexity
    • $$O(2^n)$$
Heuristic Optimization
  • Perform selection early
  • Perform projection early
  • Perform most restrictive selection & join operations first
  • Combine with partial cost-based optimization
Typical Steps
  1. Deconstruct conjunctive selections & move down
  2. Execute first the selections & joins that will produce the smallest relations
  3. Cartesian product + selection = join
  4. Deconstruct projections & move down (may create new projections)
  5. Execute the subtrees whose operations can be pipelined

results matching ""

    No results matching ""