Background
Multidimensional Data
- Flat relational table
- Multimedia feature vectors
- Data warehouse data
- Spatial data
- Text documents
- ...
Attribute Types
- Ordinal
- Nominal categorical
- Binary
Basic Queries
- (Range) selection query
- Distance (similarity) query
- Nearest neighbor (similarity) query
Top-K Search Methods
Returns k objects with the highest combined score based on aggregate function f. Attribute value ranges usually normalized.
- Probabilistic/approximate top-k retrieval
- Random/sorted access at ranked inputs
- Ordered/indexed inputs
- Centralized/distributed inputs
Top-K Query
SELECT name
FROM Employee
ORDER BY 0.5 * salary + 0.5 * age
LIMIT 10
Top-K Joins
SELECT h.id, s.id
FROM House h, School s
WHERE h.location = s.location
ORDER BY h.price + 10 * s.tuition
LIMIT 5
Variants
- Apply on single table || ranked list of tuples ordered by individual attributes
- Ranked inputs may be in different servers

Evaluations
- Aggregation functions assumed to be:
- Distributive
f(x, y, z, w) = f(f(x, y), f(z, w))
- Monotone
- If
x < yandz < wthenf(x, z) < f(y, w)
- If
- Distributive
Rank Aggregation
Based on 1-D ordering & merging sorted lists.
- Benefits
- Can be applied on any subset of inputs (arbitrary subspaces)
- Appropriate for distributed data
- Appropriate for top-k joins
- Simple
- Drawbacks
- Slower than index-based methods
- Require sorted inputs
Threshold Algorithm (TA)
- Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
- For each encountered object
o, random access to inputs whereohas not been seen - Maintain top-k objects
- Define
T = f(l1, ..., lm)on last atomic scores seen; terminate if score of thek-th object >=T
Properties
- Standard module for merging ranked lists
- Finds results quickly
- Random access may be expensive || impossible
Variant: Quick-Combine
Access source with largest decrease rate in score s.t. T is reduced faster.
No Random Access (NRA)
- Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
- For each object
oseen so far, maintainfxub: upper bound for aggregate scorefxfxlb: lower bound for aggregate scoref
- Maintain top-k objects
Wkwith the largestflb - Terminate if the smallest
flbin top-k >= largestfubof any object not in top-k
Properties
- More generic
- Not dependent on random accesses
- Can be cheaper if random accesses expensive
- Update upper bounds for all objects seen so far unconditionally
- No input pruned!
O(n)per access
Lattice-Based Rank Aggregation (LARA)
t: k-th highest score in Wk
T = f(l1, ..., lm): score derived from last atomic scores
- Growing phase
- While
t<T- Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
- Update lower bounds for each newly accessed object
- Update top-k objects
Wk, computetO(log k)for heap implementations
- Update
Tfrom the last atomic scores seen at each inputO(1)
- While
- Shrinking phase
- While
t>=T, objects not seen so far are pruned, objects seen so far are candidates for top-k- Distribute candidates using a lattice
- While

Observations
- While
t<T, any objects never seen so far can end up in the top-k - While
t<T, any objects seen so far can end up in the top-k - If
t>=T, no object never seen so far can end up in the top-k
Hence,
- While
t<T, should not apply expensive updates & comparisons using upper bounds - Divide into growing phase & shrinking phase
Cost Analysis
- Growing phase
O(log k)per access
- Shrinking phase
O(2^m + log k)per access
Index-Based Methods
Based on multidimensional indexing.
- Benefits
- More efficient than rank aggregation methods
- Apply directly on data in record format
- Can make use of general-purpose multi-dimensional indexes
- Drawbacks
- May not be applicable to top-k queries in arbitrary subspaces
- Hard to apply on distributed data
- Cannot be used on top of other query operations
- Some methods have high preprocessing & storage requirements
Multi-Dimensional Index Search
- Aggregate function is a geometric shape sweeping the space from origin
- First
kpoints hit are the top-k Branch-and-bound search applicable
- Bound: low-most corners
Benefits
- I/O optimal given R-tree indexes
- Incremental
- Drawbacks
- Relies on existence of index for the set of attributes used in
f
- Relies on existence of index for the set of attributes used in
Best-First Search
- Objective:
khotels withmin(f(hi)) - Heap
Q: prioritized based on{min(f(p)): p in M}
