Traditional Queries
- Single value query
- Range query
- AVG/SUM query
- VMin/VMax query
- EMin/EMax query
- Nearest neighbor query
- Join
Probabilistic Queries
In uncertain databases, data uncertainty is treated as first-class citizen.
- Data value: closed region + pdf
- Probabilistic query: answers with probabilities reflecting confidence
- Imprecise but correct
pdf v.s. cdf
Classifications & Evaluations
- Value-based v.s. entity-based
- Dependent v.s. independent on other objects
Value-Based Independent Class
VSingleQ (Probabilistic Single Value Query)
- Input: object $$T_k$$
- Output: $${p(x) | x \text{ in } [l, u]}$$
- $$p(x)$$: pdf of $$T_k.a$$, $$p(x) = 0$$ when $$x < l$$ or $$x > u$$
- Evaluation
- Simply retrieve the uncertain information of $$T_k$$
- Scalable execution
- Locate $$T_k$$ faster with disk-based hash index
- Dynamic hashing
- Locate $$T_k$$ faster with disk-based hash index
Value-Based Dependent Class
VAvgQ (Value-Based Average Query) / VSumQ (Value-Based Sum Query) / VMinQ (Probabilistic Minimum Value Query) / VMaxQ (Probabilistic Maximum Value Query)
- Input: object collection $$T$$
- Output: $${p(x) | x \text{ in } [l, u]}$$
- $$X$$: random variable for average/sum/min/max of $$T.a$$
- Evaluation
- VSumQ
- Range
- $$[\sum^{|T|}{i=1}l_i, \sum^{|T|}{i=1}u_i]$$
- pdf
- $$p(x) = \int^{min(T1.u, x - T2.l)}_{max(T1.l, x - T2.u)} f_1(y)f_2(x-y)dy$$
- $$X = T_1.a + T_2.a$$, $$x$$ in $$[X.l, X.u]$$
- $$y$$ in $$[T_2.l, T_2.u]$$
- $$T1.l < x-y < T1.u$$, hence $$x - T_2.u < y < x - T_2.l$$
- $$p(x) = \int^{min(T1.u, x - T2.l)}_{max(T1.l, x - T2.u)} f_1(y)f_2(x-y)dy$$
- For all objects
- Pick 2 objects
- Sum their attribute values up
- Add the result to another object's attribute
- Repeat until all objects consumed
- Range
- VAvgQ
- Scale from VSumQ (divide by $$|T|$$)
- VMinQ / VMaxQ
- Adapt from EMinQ
- VSumQ
Overhead of VSumQ / VAvgQ
Integration.
- Evaluate by hand
- Simple pdf types
- Change the integration function to mathematical series
- Algebraic functions
- Numerical methods
- Arbitrary pdf
Entity-Based Independent Class
ERQ (Probabilistic Range Query)
- Input: object collection $$T$$, range $$[l, u]$$
- Output: $${(T_i, p_i)}$$
- $$p_i$$: non-zero probability s.t. $$T_i.a$$ is in $$[l, u]$$
- Evaluation
- 1D
- $$pi = \int^{min(T_i.u, u)}{max(T_i.l, l)} f_i(z)dz$$
- 1D
- 2D
- $$pi = \int{A_i} f_i(x, y) dxdy$$
- 2D
- Scalable execution
- Locate overlaps faster with interval index
- B-tree, R-tree
- Locate overlaps faster with interval index
- Efficient version: PTRQ (Probabilistic Threshold ERQ) with PTI (Probabilistic Threshold Indexing)
- Return object only if the probability exceeds some threshold $$T$$
- MBR not retrieved if:
- There exists a p-bound s.t. $$p < T$$
- Queried region is to the right of the right p-bound or left of the left p-bound (1D case)
- MBR not retrieved if:
- Advantages of PTI
- Can index any form of uncertainty pdf
- Simple implementation
- Support other probabilistic queries
- Facilitate query evaluation over multi-dimensional data
- Return object only if the probability exceeds some threshold $$T$$
ERQ with R-Tree
- Filtering
- MBR tested against the query range, retrieve candidate objects
- Refinement
- Compute probability values of candidate objects
p-Bounds
PTI
M.PT
: left-p-bound, right-p-boundT_G
: p values
Entity-Based Dependent Class
EMinQ (Probabilistic Minimum Query) / EMaxQ (Probabilistic Maximum Query)
- Input: object collection $$T$$
- Output: $${(T_i, p_i)}$$
- $$p_i$$: non-zero probability s.t. $$T_i.a$$ is the min/max in $$T$$
- Sum to 1
- $$p_i$$: non-zero probability s.t. $$T_i.a$$ is the min/max in $$T$$
- Evaluation: 3-phase
- Interval elimination
- Can adapt from nearest-neighbor search
- Interval bounding
- Cut off, sort, rename objects
- Probability computation
- $$f$$: pdf
- $$F$$: cdf
- $$Tp.a$$ is min with probability $$\sum{(li, l_j)} \int^{l_j}{li} f_p(x) * \prod{k \in S_{ij}}(1 - F_k(x))dx$$
- $$(li, l_j) \in {(l_p, l{p+1}), (l{p+1}, l{p+2}), \dots}$$
- $$S_{ij} = {k \, |\, T_k \text{can be smaller than } T_p}$$
- Interval elimination
Numerical Method
Difficult to choose the stripe width $$\Delta$$.
- Solution
- Make $$\Delta$$ adaptive to width
- Define $$\epsilon$$ as the inverse of number of stripes used, which controls precision
ENNQ (Probabilistic Nearest Neighbor Query)
- Input: object collection $$T$$, query point $$q$$
- Output: $${(T_i, p_i)}$$
- $$p_i$$: non-zero probability s.t. $$|T_i.a - q|$$ is min in $$T$$
- Sum to 1
- $$p_i$$: non-zero probability s.t. $$|T_i.a - q|$$ is min in $$T$$
- Evaluation
- Compute $$qi(r) = d_i(r) \prod{k \in S_i} (1 - D_i(r))$$: the probability s.t. at distance $$r$$, $$T_i$$ is the NN of $$q$$
- $$d_i(r)$$: distance pdf from $$T_i$$ to $$q$$
- $$D_i(r)$$: distance cdf from $$T_i$$ to $$q$$
- $$S_i = {k \, |\, T_k \text{can be closer to q than } T_i}$$
- Compute $$pi = \int^{f_i}{n_i} q_i(r)dr$$
- $$f_i$$: furthest distance from $$T_i$$ to $$q$$
- $$n_i$$: nearest distance from $$T_i$$ to $$q$$
- Compute $$qi(r) = d_i(r) \prod{k \in S_i} (1 - D_i(r))$$: the probability s.t. at distance $$r$$, $$T_i$$ is the NN of $$q$$
Join
- Input: object collection $$T$$ & $$U$$, joining condition
- Output: $${(Ti, U_j, p{i,j})}$$
- $$p_{i,j}$$: non-zero probability s.t. $$(T_i, U_j)$$ satisfies the joining condition
- Evaluation
- Equality
- Resolution $$c$$:$$a = b$$ if they are within $$c$$ of each other
- $$P(a =_c b) = \int f_a(x)(F_b(x+c) - F_b(x-c))dx$$
- Efficient version
- Multi-step join processing (if R-tree exists)
- threshold
- PTI
- Equality
Probabilistic Top-K Queries
- Possible world semantics (PWS)
- A query algorithm should behave as if the query is evaluated on each possible world
- Number of worlds is exponential!
- A query algorithm should behave as if the query is evaluated on each possible world
- Top-k query
- Depends on:
- Function $$f$$
- Existential probability
- Semantics to combine scores & probabilities
- U-Topk
- U-kRanks
- PT-k
- Global-topk
- Depends on:
U-Topk Query
Return a size-k set of tuples with max probability of being top-k across all possible worlds.