Query Optimization
- Cost of operations depends on
- Statistical information about relations
- Statistical information about intermediate results
Cost Based Optimization
- Generate logically equivalent expressions
- Use equivalence rules
- Annotate resultant expressions to get alternative query plans
- Choose the cheapest plan based on estimated cost
- Estimate cost of operations & techniques
- Sizes of input
- Statistics about attribute value distribution
- Estimate output size or operations
- Statistics about attribute value distribution
- Estimate cost of operations & techniques
Basic Statistics for Cost Estimation
- $$n_r$$: number of tuples in relation
r
- $$b_r$$: number of blocks in relation
r
- $$b_r = ceil(\frac{n_r}{f_r})$$ if tuples of
r
are stored together physically in a file
- $$b_r = ceil(\frac{n_r}{f_r})$$ if tuples of
- $$s_r$$: size of a tuple of
r
- $$f_r$$: blocking factor of
r
(number of tuples in a block) - $$V(A, r)$$: number of distinct values in
r
for attributeA
- $$SC(A, r)$$: selection cardinality of attribute
A
inr
(avg. number of records that satisfy equality onA
)
Catalog Information about Indices
- $$f_i$$: avg. fan-out of internal nodes of index
i
- $$HT_i$$: height of index
i
- $$HTi = ceil(log{f_i} (V(A, r)))$$ for a balanced tree
- $$HT_i = 1$$ for a hash index
- $$LB_i$$: number of blocks at the leaf level
Measure of Query Cost
The number of block transfers from disk.
Selection Size Estimation
- Equality selection $$\sigma_{A=v}(r)$$
- $$SC(A, r)$$
- $$ceil(\frac{SC(A, r)}{f_r})$$: number of blocks these records will occupy
1
ifA
is candidate key
- Comparison selection $$\sigma_{A \le V}(r)$$
- Let
C
be the estimated number of tuples satisfying the condition- $$C = n_r * \frac{v - min(A, r)}{max(A, r) - min(A, r)}$$
- Histograms can be used to improve accuracy
- Let
Implementation of Complex Selections
- Selectivity of condition $$\theta_i$$: probability that a tuple in relation
r
satisfies $$\theta_i$$- $$\frac{s_i}{n_r}$$
- Conjunction $$\sigma_{\theta_1 \wedge \theta_2 \wedge \dots \wedge \theta_n} (r)$$
- $$n_r \frac{s_1 s_2 \dots s_n}{n_r^n}$$
- Disjunction $$\sigma_{\theta_1 \vee \theta_2 \vee \dots \vee \theta_n} (r)$$
- $$n_r * (1 - (1 - \frac{s_1}{n_r})(1 - \frac{s_2}{n_r}) \dots (1 - \frac{s_n}{n_r}))$$
- Negation $$\sigma_{\neg \theta} (r)$$
- $$nr - size(\sigma{\theta}(r))$$
Join Size Estimation
Relation r
and s
, Cartesian product $$r \times s$$ contains $$n_r * n_s$$ tuples.
- If $$R \cap S = \varnothing$$
- $$r \Join s = r \times s$$
- If $$R \cap S$$ is a key for $$R$$
- Keys in $$S$$ may be absent in $$R$$
- $$n_{r \Join s} \le n_s$$
- If $$R \cap S$$ in $$S$$ is a foreign key referencing $$R$$
- Keys in $$S$$ must be in $$R$$
- $$n_{r \Join s} = n_s$$
- If $$R \cap S = {A}$$ is not a key
- Assume every tuple in
r
produces tuples in $$r \Join s$$- $$\frac{n_r n_s}{V(A, s)} = n_r \frac{n_s}{V(A, s)}$$
- $$\frac{n_s}{V(A, s)}$$: average number of tuples for each distinct value in
s
- Assume every tuple in
s
produces tuples in $$r \Join s$$- $$\frac{n_r n_s}{V(A, r)} = n_s \frac{n_r}{V(A, r)}$$
- $$\frac{n_r}{V(A, r)}$$: average number of tuples for each distinct value in
r
- Take the lowest of the two estimates
- Assume every tuple in
Projection Size Estimation
$$\Pi_A(r) = V(A, r)$$
Aggregation Size Estimation
$$_A g_F (r) = V(A, r)$$
Set Operation Size Estimation
- Unions/intersections of selections on the same relation
- Rewrite & use size estimate for selections
- Operations on different relations
- $$r \cup s = size(r) + size(s)$$
- $$r \cap s = min(size(r), size(s))$$
- $$r - s = size(r)$$
Transformation of Relational Expressions
Equivalence Rule
Expressions of two forms are equivalent.
- Conjunctive selection = sequence of individual selections
- $$\sigma{\theta_1 \wedge \theta_2} (E) = \sigma{\theta1}(\sigma{\theta_2} (E))$$
- Commutativity of selection
- $$\sigma{\theta_1}(\sigma{\theta2} (E)) = \sigma{\theta2}(\sigma{\theta_1} (E))$$
- Only the last in a sequence of projection is needed
- $$\pi{t_1}(\pi{t2}(\dots (\pi{tn}(E)))) = \pi{t_1} (E)$$
- Selections can be combined with Cartesian products & theta joins
- $$\sigma{\theta}(E_1 \times E_2) = E_1 \Join{\theta} E_2$$
- $$\sigma{\theta_1}(E_1 \Join{\theta2} E_2) = E_1 \Join{\theta_1 \wedge \theta_2} E2$$
- Commutativity of theta-joins
- $$E1 \Join{\theta} E2 = E_2 \Join{\theta} E_1$$
- Associativity of natural join
- $$(E_1 \Join E_2) \Join E_3 = E_1 \Join (E_2 \Join E_3)$$
- Associativity of theta-join
- $$(E1 \Join{\theta1} E_2) \Join{\theta2 \wedge \theta_3} E_3 = E_1 \Join{\theta1 \wedge \theta_3} (E_2 \Join{\theta_2} E_3)$$
- Distribution of selection over theta join
- If $$\theta_0$$ involves only the attributes of $$E_1$$
- $$\sigma{\theta_0} (E_1 \Join{\theta} E2) = (\sigma{\theta0} (E_1)) \Join{\theta} E_2$$
- If $$\theta_1$$ involves only the attributes of $$E_1$$, $$\theta_2$$ involves only the attributes of $$E_2$$
- $$\sigma{\theta_1 \wedge \theta_2} (E_1 \Join{\theta} E2) = (\sigma{\theta1} (E_1)) \Join{\theta} (\sigma_{\theta_2} (E_2))$$
- If $$\theta_0$$ involves only the attributes of $$E_1$$
- Distribution of projection over theta join
- If $$\theta$$ involves only attributes from $$L_1 \cup L_2$$
- $$\pi{L_1 \cup L_2} (E_1 \Join{\theta} E2) = (\pi{L1} (E_1)) \Join{\theta} (\pi_{L_2} (E_2))$$
- Consider $$E1 \Join{\theta} E_2$$
- $$\pi{L_1 \cup L_2} (E_1 \Join{\theta} E2) = \pi{L1 \cup L_2} ((\pi{L1 \cup L_3} (E_1)) \Join{\theta} (\pi_{L_2 \cup L_4} (E_2)))$$
- $$L_i$$: sets of attributes from $$E_i$$
- $$L_3$$: attributes of $$E_1$$ not in $$L_1 \cup L_2$$, but are involved in $$\theta$$
- $$L_4$$: attributes of $$E_2$$ not in $$L_1 \cup L_2$$, but are involved in $$\theta$$
- $$\pi{L_1 \cup L_2} (E_1 \Join{\theta} E2) = \pi{L1 \cup L_2} ((\pi{L1 \cup L_3} (E_1)) \Join{\theta} (\pi_{L_2 \cup L_4} (E_2)))$$
- If $$\theta$$ involves only attributes from $$L_1 \cup L_2$$
- Join Ordering
- $$(r_1 \Join r_2) \Join r_3 = r_1 \Join (r_2 \Join r_3)$$
Evaluation Plan
- What algorithm used for each operation
- How the execution of operations is coordinated
Choice of Evaluation Plan
Cost-Based Optimization
$$r_1 \Join r_2 \Join \dots \Join r_n$$ has $$\frac{(2(n-1))!}{(n-1)!}$$ different join orders. Ref
Dynamic Programming
When computing the cost for join order, store for future use. Total $$2^n - 1$$ possible costs to choose from.
find_best_plan(S):
if bestplan[S].cost != inf):
return bestplan[S]
for non-empty subset S1 of S s.t. S1 != S:
P1 = find_best_plan(S1)
P2 = find_best_plan(S - S1)
A = best algo for joining results of P1 and P2
cost = P1.cost + P2.cost + A.cost
if cost < bestplan[S].cost:
bestplan[S].cost = cost
bestplan[S].plan = "P1.plan + P2.plan + A"
return bestplan[S]
- Time complexity
- Bushy tree: $$O(3^n)$$
- Left-deep tree: $$O(n 2^n)$$
- Space complexity
- $$O(2^n)$$
Heuristic Optimization
- Perform selection early
- Perform projection early
- Perform most restrictive selection & join operations first
- Combine with partial cost-based optimization
Typical Steps
- Deconstruct conjunctive selections & move down
- Execute first the selections & joins that will produce the smallest relations
- Cartesian product + selection = join
- Deconstruct projections & move down (may create new projections)
- Execute the subtrees whose operations can be pipelined