Background
Multidimensional Data
- Flat relational table
- Multimedia feature vectors
- Data warehouse data
- Spatial data
- Text documents
- ...
Attribute Types
- Ordinal
- Nominal categorical
- Binary
Basic Queries
- (Range) selection query
- Distance (similarity) query
- Nearest neighbor (similarity) query
Top-K Search Methods
Returns k
objects with the highest combined score based on aggregate function f
. Attribute value ranges usually normalized.
- Probabilistic/approximate top-k retrieval
- Random/sorted access at ranked inputs
- Ordered/indexed inputs
- Centralized/distributed inputs
Top-K Query
SELECT name
FROM Employee
ORDER BY 0.5 * salary + 0.5 * age
LIMIT 10
Top-K Joins
SELECT h.id, s.id
FROM House h, School s
WHERE h.location = s.location
ORDER BY h.price + 10 * s.tuition
LIMIT 5
Variants
- Apply on single table || ranked list of tuples ordered by individual attributes
- Ranked inputs may be in different servers
Evaluations
- Aggregation functions assumed to be:
- Distributive
f(x, y, z, w) = f(f(x, y), f(z, w))
- Monotone
- If
x < y
andz < w
thenf(x, z) < f(y, w)
- If
- Distributive
Rank Aggregation
Based on 1-D ordering & merging sorted lists.
- Benefits
- Can be applied on any subset of inputs (arbitrary subspaces)
- Appropriate for distributed data
- Appropriate for top-k joins
- Simple
- Drawbacks
- Slower than index-based methods
- Require sorted inputs
Threshold Algorithm (TA)
- Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
- For each encountered object
o
, random access to inputs whereo
has not been seen - Maintain top-k objects
- Define
T = f(l1, ..., lm)
on last atomic scores seen; terminate if score of thek
-th object >=T
Properties
- Standard module for merging ranked lists
- Finds results quickly
- Random access may be expensive || impossible
Variant: Quick-Combine
Access source with largest decrease rate in score s.t. T
is reduced faster.
No Random Access (NRA)
- Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
- For each object
o
seen so far, maintainfxub
: upper bound for aggregate scorefx
fxlb
: lower bound for aggregate scoref
- Maintain top-k objects
Wk
with the largestflb
- Terminate if the smallest
flb
in top-k >= largestfub
of any object not in top-k
Properties
- More generic
- Not dependent on random accesses
- Can be cheaper if random accesses expensive
- Update upper bounds for all objects seen so far unconditionally
- No input pruned!
O(n)
per access
Lattice-Based Rank Aggregation (LARA)
t
: k-th highest score in Wk
T = f(l1, ..., lm)
: score derived from last atomic scores
- Growing phase
- While
t
<T
- Iteratively retrieve objects & their atomic scores from the ranked inputs in a round-robin fashion
- Update lower bounds for each newly accessed object
- Update top-k objects
Wk
, computet
O(log k)
for heap implementations
- Update
T
from the last atomic scores seen at each inputO(1)
- While
- Shrinking phase
- While
t
>=T
, objects not seen so far are pruned, objects seen so far are candidates for top-k- Distribute candidates using a lattice
- While
Observations
- While
t
<T
, any objects never seen so far can end up in the top-k - While
t
<T
, any objects seen so far can end up in the top-k - If
t
>=T
, no object never seen so far can end up in the top-k
Hence,
- While
t
<T
, should not apply expensive updates & comparisons using upper bounds - Divide into growing phase & shrinking phase
Cost Analysis
- Growing phase
O(log k)
per access
- Shrinking phase
O(2^m + log k)
per access
Index-Based Methods
Based on multidimensional indexing.
- Benefits
- More efficient than rank aggregation methods
- Apply directly on data in record format
- Can make use of general-purpose multi-dimensional indexes
- Drawbacks
- May not be applicable to top-k queries in arbitrary subspaces
- Hard to apply on distributed data
- Cannot be used on top of other query operations
- Some methods have high preprocessing & storage requirements
Multi-Dimensional Index Search
- Aggregate function is a geometric shape sweeping the space from origin
- First
k
points hit are the top-k Branch-and-bound search applicable
- Bound: low-most corners
Benefits
- I/O optimal given R-tree indexes
- Incremental
- Drawbacks
- Relies on existence of index for the set of attributes used in
f
- Relies on existence of index for the set of attributes used in
Best-First Search
- Objective:
k
hotels withmin(f(hi))
- Heap
Q
: prioritized based on{min(f(p)): p in M}