Background

Multidimensional Data

Flat relational table
Multimedia feature vectors
Data warehouse data
Spatial data
Text documents
...

Attribute Types

Ordinal
Nominal categorical
Binary

Basic Queries

(Range) selection query
Distance (similarity) query
Nearest neighbor (similarity) query

Top-K Search Methods

Returns k objects with the highest combined score based on aggregate function f. Attribute value ranges usually normalized.

Probabilistic/approximate top-k retrieval
Random/sorted access at ranked inputs
Ordered/indexed inputs
Centralized/distributed inputs

Top-K Query

SELECT name
FROM Employee
ORDER BY 0.5 * salary + 0.5 * age
LIMIT 10

Top-K Joins

SELECT h.id, s.id
FROM House h, School s
WHERE h.location = s.location
ORDER BY h.price + 10 * s.tuition
LIMIT 5

Variants

Apply on single table || ranked list of tuples ordered by individual attributes
- Ranked inputs may be in different servers

Evaluations

Aggregation functions assumed to be:
- Distributive
  - f(x, y, z, w) = f(f(x, y), f(z, w))
- Monotone
  - If x < y and z < w then f(x, z) < f(y, w)

Rank Aggregation

Based on 1-D ordering & merging sorted lists.

Benefits
- Can be applied on any subset of inputs (arbitrary subspaces)
- Appropriate for distributed data
- Appropriate for top-k joins
- Simple
Drawbacks
- Slower than index-based methods
- Require sorted inputs

Threshold Algorithm (TA)

Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
For each encountered object o, random access to inputs where o has not been seen
Maintain top-k objects
Define T = f(l1, ..., lm) on last atomic scores seen; terminate if score of the k-th object >= T

Properties

Standard module for merging ranked lists
Finds results quickly
Random access may be expensive || impossible

Variant: Quick-Combine

Access source with largest decrease rate in score s.t. T is reduced faster.

No Random Access (NRA)

Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
For each object o seen so far, maintain
1. fxub: upper bound for aggregate score fx
2. fxlb: lower bound for aggregate score f
Maintain top-k objects Wk with the largest flb
Terminate if the smallest flb in top-k >= largest fub of any object not in top-k

Properties

More generic
- Not dependent on random accesses
Can be cheaper if random accesses expensive
Update upper bounds for all objects seen so far unconditionally
- No input pruned!
- O(n) per access

Lattice-Based Rank Aggregation (LARA)

t: k-th highest score in Wk

T = f(l1, ..., lm): score derived from last atomic scores

Growing phase
1. While t < T
  1. Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
  2. Update lower bounds for each newly accessed object
  3. Update top-k objects Wk, compute t
    1. O(log k) for heap implementations
  4. Update T from the last atomic scores seen at each input
    1. O(1)
Shrinking phase
1. While t >= T, objects not seen so far are pruned, objects seen so far are candidates for top-k
  1. Distribute candidates using a lattice

Observations

While t < T, any objects never seen so far can end up in the top-k
While t < T, any objects seen so far can end up in the top-k
If t >= T, no object never seen so far can end up in the top-k

Hence,

While t < T, should not apply expensive updates & comparisons using upper bounds
Divide into growing phase & shrinking phase

Cost Analysis

Growing phase
- O(log k) per access
Shrinking phase
- O(2^m + log k) per access

Index-Based Methods

Based on multidimensional indexing.

Benefits
- More efficient than rank aggregation methods
- Apply directly on data in record format
- Can make use of general-purpose multi-dimensional indexes
Drawbacks
- May not be applicable to top-k queries in arbitrary subspaces
- Hard to apply on distributed data
- Cannot be used on top of other query operations
- Some methods have high preprocessing & storage requirements

Multi-Dimensional Index Search

Aggregate function is a geometric shape sweeping the space from origin
First k points hit are the top-k
Branch-and-bound search applicable
- Bound: low-most corners
Benefits
- I/O optimal given R-tree indexes
- Incremental
Drawbacks
- Relies on existence of index for the set of attributes used in f

Best-First Search

Objective: k hotels with min(f(hi))
Heap Q: prioritized based on {min(f(p)): p in M}

Top-K Queries

Background

Multidimensional Data

Attribute Types

Basic Queries

Top-K Search Methods

Top-K Query

Top-K Joins

Variants

Evaluations

Rank Aggregation

Threshold Algorithm (TA)

Properties

Variant: Quick-Combine

No Random Access (NRA)

Properties

Lattice-Based Rank Aggregation (LARA)

Observations

Cost Analysis

Index-Based Methods

Multi-Dimensional Index Search

Best-First Search

results matching ""

No results matching ""