Traditional Queries

  • Single value query
  • Range query
  • AVG/SUM query
  • VMin/VMax query
  • EMin/EMax query
  • Nearest neighbor query
  • Join

Probabilistic Queries

In uncertain databases, data uncertainty is treated as first-class citizen.

  • Data value: closed region + pdf
  • Probabilistic query: answers with probabilities reflecting confidence
    • Imprecise but correct

pdf v.s. cdf

Classifications & Evaluations

  • Value-based v.s. entity-based
  • Dependent v.s. independent on other objects

Value-Based Independent Class

VSingleQ (Probabilistic Single Value Query)
  • Input: object $$T_k$$
  • Output: $${p(x) | x \text{ in } [l, u]}$$
    • $$p(x)$$: pdf of $$T_k.a$$, $$p(x) = 0$$ when $$x < l$$ or $$x > u$$
  • Evaluation
    • Simply retrieve the uncertain information of $$T_k$$
  • Scalable execution
    • Locate $$T_k$$ faster with disk-based hash index
      • Dynamic hashing

Value-Based Dependent Class

VAvgQ (Value-Based Average Query) / VSumQ (Value-Based Sum Query) / VMinQ (Probabilistic Minimum Value Query) / VMaxQ (Probabilistic Maximum Value Query)
  • Input: object collection $$T$$
  • Output: $${p(x) | x \text{ in } [l, u]}$$
    • $$X$$: random variable for average/sum/min/max of $$T.a$$
  • Evaluation
    • VSumQ
      • Range
        • $$[\sum^{|T|}{i=1}l_i, \sum^{|T|}{i=1}u_i]$$
      • pdf
        • $$p(x) = \int^{min(T1.u, x - T2.l)}_{max(T1.l, x - T2.u)} f_1(y)f_2(x-y)dy$$
          • $$X = T_1.a + T_2.a$$, $$x$$ in $$[X.l, X.u]$$
          • $$y$$ in $$[T_2.l, T_2.u]$$
          • $$T1.l < x-y < T1.u$$, hence $$x - T_2.u < y < x - T_2.l$$
      • For all objects
        • Pick 2 objects
        • Sum their attribute values up
        • Add the result to another object's attribute
        • Repeat until all objects consumed
    • VAvgQ
      • Scale from VSumQ (divide by $$|T|$$)
    • VMinQ / VMaxQ
      • Adapt from EMinQ
Overhead of VSumQ / VAvgQ

Integration.

  1. Evaluate by hand
    1. Simple pdf types
  2. Change the integration function to mathematical series
    1. Algebraic functions
  3. Numerical methods
    1. Arbitrary pdf

Entity-Based Independent Class

ERQ (Probabilistic Range Query)
  • Input: object collection $$T$$, range $$[l, u]$$
  • Output: $${(T_i, p_i)}$$
    • $$p_i$$: non-zero probability s.t. $$T_i.a$$ is in $$[l, u]$$
  • Evaluation
    • 1D
      • $$pi = \int^{min(T_i.u, u)}{max(T_i.l, l)} f_i(z)dz$$
    • 2D
      • $$pi = \int{A_i} f_i(x, y) dxdy$$
  • Scalable execution
    • Locate overlaps faster with interval index
      • B-tree, R-tree
  • Efficient version: PTRQ (Probabilistic Threshold ERQ) with PTI (Probabilistic Threshold Indexing)
    • Return object only if the probability exceeds some threshold $$T$$
      • MBR not retrieved if:
        • There exists a p-bound s.t. $$p < T$$
        • Queried region is to the right of the right p-bound or left of the left p-bound (1D case)
    • Advantages of PTI
      • Can index any form of uncertainty pdf
      • Simple implementation
      • Support other probabilistic queries
      • Facilitate query evaluation over multi-dimensional data
ERQ with R-Tree
  1. Filtering
    1. MBR tested against the query range, retrieve candidate objects
  2. Refinement
    1. Compute probability values of candidate objects
p-Bounds

PTI
  • M.PT: left-p-bound, right-p-bound
  • T_G: p values

Entity-Based Dependent Class

EMinQ (Probabilistic Minimum Query) / EMaxQ (Probabilistic Maximum Query)
  • Input: object collection $$T$$
  • Output: $${(T_i, p_i)}$$
    • $$p_i$$: non-zero probability s.t. $$T_i.a$$ is the min/max in $$T$$
      • Sum to 1
  • Evaluation: 3-phase
    • Interval elimination
      • Can adapt from nearest-neighbor search
    • Interval bounding
      • Cut off, sort, rename objects
    • Probability computation
        • $$f$$: pdf
        • $$F$$: cdf
      • $$Tp.a$$ is min with probability $$\sum{(li, l_j)} \int^{l_j}{li} f_p(x) * \prod{k \in S_{ij}}(1 - F_k(x))dx$$
        • $$(li, l_j) \in {(l_p, l{p+1}), (l{p+1}, l{p+2}), \dots}$$
        • $$S_{ij} = {k \, |\, T_k \text{can be smaller than } T_p}$$
Numerical Method

Difficult to choose the stripe width $$\Delta$$.

  • Solution
    • Make $$\Delta$$ adaptive to width
      • Define $$\epsilon$$ as the inverse of number of stripes used, which controls precision
ENNQ (Probabilistic Nearest Neighbor Query)
  • Input: object collection $$T$$, query point $$q$$
  • Output: $${(T_i, p_i)}$$
    • $$p_i$$: non-zero probability s.t. $$|T_i.a - q|$$ is min in $$T$$
      • Sum to 1
  • Evaluation
    • Compute $$qi(r) = d_i(r) \prod{k \in S_i} (1 - D_i(r))$$: the probability s.t. at distance $$r$$, $$T_i$$ is the NN of $$q$$
      • $$d_i(r)$$: distance pdf from $$T_i$$ to $$q$$
      • $$D_i(r)$$: distance cdf from $$T_i$$ to $$q$$
      • $$S_i = {k \, |\, T_k \text{can be closer to q than } T_i}$$
    • Compute $$pi = \int^{f_i}{n_i} q_i(r)dr$$
      • $$f_i$$: furthest distance from $$T_i$$ to $$q$$
      • $$n_i$$: nearest distance from $$T_i$$ to $$q$$

Join
  • Input: object collection $$T$$ & $$U$$, joining condition
  • Output: $${(Ti, U_j, p{i,j})}$$
    • $$p_{i,j}$$: non-zero probability s.t. $$(T_i, U_j)$$ satisfies the joining condition
  • Evaluation
    • Equality
      • Resolution $$c$$:$$a = b$$ if they are within $$c$$ of each other
      • $$P(a =_c b) = \int f_a(x)(F_b(x+c) - F_b(x-c))dx$$
    • Efficient version
      • Multi-step join processing (if R-tree exists)
        • threshold
        • PTI

Probabilistic Top-K Queries

  • Possible world semantics (PWS)
    • A query algorithm should behave as if the query is evaluated on each possible world
      • Number of worlds is exponential!
  • Top-k query
    • Depends on:
      • Function $$f$$
      • Existential probability
    • Semantics to combine scores & probabilities
      • U-Topk
      • U-kRanks
      • PT-k
      • Global-topk

U-Topk Query

Return a size-k set of tuples with max probability of being top-k across all possible worlds.

ORION

Queries

Architecture

results matching ""

    No results matching ""