Traditional Queries

Single value query
Range query
AVG/SUM query
VMin/VMax query
EMin/EMax query
Nearest neighbor query
Join

Probabilistic Queries

In uncertain databases, data uncertainty is treated as first-class citizen.

Data value: closed region + pdf
Probabilistic query: answers with probabilities reflecting confidence
- Imprecise but correct

pdf v.s. cdf

Classifications & Evaluations

Value-based v.s. entity-based
Dependent v.s. independent on other objects

Value-Based Independent Class

VSingleQ (Probabilistic Single Value Query)

Input: object $$T_k$$
Output: $${p(x) | x \text{ in } [l, u]}$$
- $$p(x)$$: pdf of $$T_k.a$$, $$p(x) = 0$$ when $$x < l$$ or $$x > u$$
Evaluation
- Simply retrieve the uncertain information of $$T_k$$
Scalable execution
- Locate $$T_k$$ faster with disk-based hash index
  - Dynamic hashing

Value-Based Dependent Class

VAvgQ (Value-Based Average Query) / VSumQ (Value-Based Sum Query) / VMinQ (Probabilistic Minimum Value Query) / VMaxQ (Probabilistic Maximum Value Query)

Input: object collection $$T$$
Output: $${p(x) | x \text{ in } [l, u]}$$
- $$X$$: random variable for average/sum/min/max of $$T.a$$
Evaluation
- VSumQ
  - Range
    - $$[\sum^{|T|}{i=1}l_i, \sum^{|T|}{i=1}u_i]$$
  - pdf
    - $$p(x) = \int^{min(T1.u, x - T2.l)}_{max(T1.l, x - T2.u)} f_1(y)f_2(x-y)dy$$
      - $$X = T_1.a + T_2.a$$, $$x$$ in $$[X.l, X.u]$$
      - $$y$$ in $$[T_2.l, T_2.u]$$
      - $$T1.l < x-y < T1.u$$, hence $$x - T_2.u < y < x - T_2.l$$
  - For all objects
    - Pick 2 objects
    - Sum their attribute values up
    - Add the result to another object's attribute
    - Repeat until all objects consumed
- VAvgQ
  - Scale from VSumQ (divide by $$|T|$$)
- VMinQ / VMaxQ
  - Adapt from EMinQ

Overhead of VSumQ / VAvgQ

Integration.

Evaluate by hand

Simple pdf types

Change the integration function to mathematical series

Algebraic functions

Numerical methods

Arbitrary pdf

Entity-Based Independent Class

ERQ (Probabilistic Range Query)

Input: object collection $$T$$, range $$[l, u]$$
Output: $${(T_i, p_i)}$$
- $$p_i$$: non-zero probability s.t. $$T_i.a$$ is in $$[l, u]$$
Evaluation
- 1D
  - $$pi = \int^{min(T_i.u, u)}{max(T_i.l, l)} f_i(z)dz$$
- 2D
  - $$pi = \int{A_i} f_i(x, y) dxdy$$
Scalable execution
- Locate overlaps faster with interval index
  - B-tree, R-tree
Efficient version: PTRQ (Probabilistic Threshold ERQ) with PTI (Probabilistic Threshold Indexing)
- Return object only if the probability exceeds some threshold $$T$$
  - MBR not retrieved if:
    - There exists a p-bound s.t. $$p < T$$
    - Queried region is to the right of the right p-bound or left of the left p-bound (1D case)
- Advantages of PTI
  - Can index any form of uncertainty pdf
  - Simple implementation
  - Support other probabilistic queries
  - Facilitate query evaluation over multi-dimensional data

ERQ with R-Tree

Filtering

MBR tested against the query range, retrieve candidate objects

Refinement

Compute probability values of candidate objects

p-Bounds

PTI

M.PT: left-p-bound, right-p-bound

T_G: p values

Entity-Based Dependent Class

EMinQ (Probabilistic Minimum Query) / EMaxQ (Probabilistic Maximum Query)

Input: object collection $$T$$
Output: $${(T_i, p_i)}$$
- $$p_i$$: non-zero probability s.t. $$T_i.a$$ is the min/max in $$T$$
  - Sum to 1
Evaluation: 3-phase
- Interval elimination
  - Can adapt from nearest-neighbor search
- Interval bounding
  - Cut off, sort, rename objects
- Probability computation
  - - $$f$$: pdf
    - $$F$$: cdf
  - $$Tp.a$$ is min with probability $$\sum{(li, l_j)} \int^{l_j}{li} f_p(x) * \prod{k \in S_{ij}}(1 - F_k(x))dx$$
    - $$(li, l_j) \in {(l_p, l{p+1}), (l{p+1}, l{p+2}), \dots}$$
    - $$S_{ij} = {k \, |\, T_k \text{can be smaller than } T_p}$$

Numerical Method

Difficult to choose the stripe width $$\Delta$$.

Solution

Make $$\Delta$$ adaptive to width

Define $$\epsilon$$ as the inverse of number of stripes used, which controls precision

ENNQ (Probabilistic Nearest Neighbor Query)

Input: object collection $$T$$, query point $$q$$
Output: $${(T_i, p_i)}$$
- $$p_i$$: non-zero probability s.t. $$|T_i.a - q|$$ is min in $$T$$
  - Sum to 1
Evaluation
- Compute $$qi(r) = d_i(r) \prod{k \in S_i} (1 - D_i(r))$$: the probability s.t. at distance $$r$$, $$T_i$$ is the NN of $$q$$
  - $$d_i(r)$$: distance pdf from $$T_i$$ to $$q$$
  - $$D_i(r)$$: distance cdf from $$T_i$$ to $$q$$
  - $$S_i = {k \, |\, T_k \text{can be closer to q than } T_i}$$
- Compute $$pi = \int^{f_i}{n_i} q_i(r)dr$$
  - $$f_i$$: furthest distance from $$T_i$$ to $$q$$
  - $$n_i$$: nearest distance from $$T_i$$ to $$q$$

Join

Input: object collection $$T$$ & $$U$$, joining condition
Output: $${(Ti, U_j, p{i,j})}$$
- $$p_{i,j}$$: non-zero probability s.t. $$(T_i, U_j)$$ satisfies the joining condition
Evaluation
- Equality
  - Resolution $$c$$:$$a = b$$ if they are within $$c$$ of each other
  - $$P(a =_c b) = \int f_a(x)(F_b(x+c) - F_b(x-c))dx$$
- Efficient version
  - Multi-step join processing (if R-tree exists)
  - - threshold
  - - PTI

Probabilistic Top-K Queries

Possible world semantics (PWS)
- A query algorithm should behave as if the query is evaluated on each possible world
  - Number of worlds is exponential!
Top-k query
- Depends on:
  - Function $$f$$
  - Existential probability
- Semantics to combine scores & probabilities
  - U-Topk
  - U-kRanks
  - PT-k
  - Global-topk

U-Topk Query

Return a size-k set of tuples with max probability of being top-k across all possible worlds.

Probabilistic Queries

Traditional Queries

Probabilistic Queries

pdf v.s. cdf

Classifications & Evaluations

Value-Based Independent Class

VSingleQ (Probabilistic Single Value Query)

Value-Based Dependent Class

VAvgQ (Value-Based Average Query) / VSumQ (Value-Based Sum Query) / VMinQ (Probabilistic Minimum Value Query) / VMaxQ (Probabilistic Maximum Value Query)

Overhead of VSumQ / VAvgQ

Entity-Based Independent Class

ERQ (Probabilistic Range Query)

ERQ with R-Tree

p-Bounds

PTI

Entity-Based Dependent Class

EMinQ (Probabilistic Minimum Query) / EMaxQ (Probabilistic Maximum Query)

Numerical Method

ENNQ (Probabilistic Nearest Neighbor Query)

Join

Probabilistic Top-K Queries

U-Topk Query

ORION

Queries

Architecture

results matching ""

No results matching ""