Background

Multidimensional Data

  • Flat relational table
  • Multimedia feature vectors
  • Data warehouse data
  • Spatial data
  • Text documents
  • ...

Attribute Types

  • Ordinal
  • Nominal categorical
  • Binary

Basic Queries

  • (Range) selection query
  • Distance (similarity) query
  • Nearest neighbor (similarity) query

Top-K Search Methods

Returns k objects with the highest combined score based on aggregate function f. Attribute value ranges usually normalized.

  • Probabilistic/approximate top-k retrieval
  • Random/sorted access at ranked inputs
  • Ordered/indexed inputs
  • Centralized/distributed inputs

Top-K Query

SELECT name
FROM Employee
ORDER BY 0.5 * salary + 0.5 * age
LIMIT 10

Top-K Joins

SELECT h.id, s.id
FROM House h, School s
WHERE h.location = s.location
ORDER BY h.price + 10 * s.tuition
LIMIT 5

Variants

  • Apply on single table || ranked list of tuples ordered by individual attributes
    • Ranked inputs may be in different servers

Evaluations

  • Aggregation functions assumed to be:
    • Distributive
      • f(x, y, z, w) = f(f(x, y), f(z, w))
    • Monotone
      • If x < y and z < w then f(x, z) < f(y, w)

Rank Aggregation

Based on 1-D ordering & merging sorted lists.

  • Benefits
    • Can be applied on any subset of inputs (arbitrary subspaces)
    • Appropriate for distributed data
    • Appropriate for top-k joins
    • Simple
  • Drawbacks
    • Slower than index-based methods
    • Require sorted inputs

Threshold Algorithm (TA)

  1. Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
  2. For each encountered object o, random access to inputs where o has not been seen
  3. Maintain top-k objects
  4. Define T = f(l1, ..., lm) on last atomic scores seen; terminate if score of the k-th object >= T
Properties
  • Standard module for merging ranked lists
  • Finds results quickly
  • Random access may be expensive || impossible

Variant: Quick-Combine

Access source with largest decrease rate in score s.t. T is reduced faster.

No Random Access (NRA)

  1. Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
  2. For each object o seen so far, maintain
    1. fxub: upper bound for aggregate score fx
    2. fxlb: lower bound for aggregate score f
  3. Maintain top-k objects Wk with the largest flb
  4. Terminate if the smallest flb in top-k >= largest fub of any object not in top-k
Properties
  • More generic
    • Not dependent on random accesses
  • Can be cheaper if random accesses expensive
  • Update upper bounds for all objects seen so far unconditionally
    • No input pruned!
    • O(n) per access

Lattice-Based Rank Aggregation (LARA)

t: k-th highest score in Wk

T = f(l1, ..., lm): score derived from last atomic scores

  1. Growing phase
    1. While t < T
      1. Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
      2. Update lower bounds for each newly accessed object
      3. Update top-k objects Wk, compute t
        1. O(log k) for heap implementations
      4. Update T from the last atomic scores seen at each input
        1. O(1)
  2. Shrinking phase
    1. While t >= T, objects not seen so far are pruned, objects seen so far are candidates for top-k
      1. Distribute candidates using a lattice

Observations

  1. While t < T, any objects never seen so far can end up in the top-k
  2. While t < T, any objects seen so far can end up in the top-k
  3. If t >= T, no object never seen so far can end up in the top-k

Hence,

  1. While t < T, should not apply expensive updates & comparisons using upper bounds
  2. Divide into growing phase & shrinking phase

Cost Analysis

  • Growing phase
    • O(log k) per access
  • Shrinking phase
    • O(2^m + log k) per access

Index-Based Methods

Based on multidimensional indexing.

  • Benefits
    • More efficient than rank aggregation methods
    • Apply directly on data in record format
    • Can make use of general-purpose multi-dimensional indexes
  • Drawbacks
    • May not be applicable to top-k queries in arbitrary subspaces
    • Hard to apply on distributed data
    • Cannot be used on top of other query operations
    • Some methods have high preprocessing & storage requirements
  • Aggregate function is a geometric shape sweeping the space from origin
  • First k points hit are the top-k
  • Branch-and-bound search applicable

    • Bound: low-most corners
  • Benefits

    • I/O optimal given R-tree indexes
    • Incremental
  • Drawbacks
    • Relies on existence of index for the set of attributes used in f
  • Objective: k hotels with min(f(hi))
  • Heap Q: prioritized based on {min(f(p)): p in M}

results matching ""

    No results matching ""