Spatial Data
- Spatial object
- With spatial extent
- Vector representation
- Raster representation
- Point
- With spatial extent
Spatial Relationships
Associates two objects according to their relative location and extent in space. Used as predicates in spatial queries.
- Object
- Boundary
- Interior (optional)
Topological Relationships
Defined by a set of relationships between the objects' boundaries and interiors.
disjoint(o1, o2)
(interior(o1) && interior(o2) == null) && (boundary(o1) && boundary(o2) == null)
intersects/overlaps(o1, o2)
(interior(o1) && interior(o2) != null) || (boundary(o1) && boundary(o2) != null)
equals(o1, o2)
(interior(o1) == interior(o2)) && boundary(o1) == boundary(o2)
inside(o1, o2)
interior(o1) in interior(o2)
contains(o1, o2)
interior(o2) in interior(o1)
adjacent(o1, o2)
(interior(o1) && interior(o2) == null) && (boundary(o1) && boundary(o2) != null)
Distance Relationships
Associates objects based on their geometric distance (Euclidean distance).
- Explicit
- Abstract/discrete
Directional Relationships
Associates objects based on their relative orientation according to a global reference system.
Spatial Queries
- Things spacial about spatial
- Dimensionality: no ordering
- Complex spatial extent
- No standard definitions of spatial operations and algebra
- Spatial access methods (SAM)
- Index spatial objects and facilitate efficient processing of simple spatial query types
Two-Step Spatial Query Processing
Evaluating spatial relationships on geometric data is slow; a spatial object is approximated by its minimum bounding rectangle (MBR).
- Filter step
- Query on MBRs
- Refinement step
- Query on exact geometry of objects
Spatial Access Methods
No good theoretical worst-case guarantees for range queries; therefore aims at minimization of the expected cost.
Point Access Methods
Divide space into disjoint partitions & group points according to the regions.
- Problems
- Objects may need to be clipped into several parts.
- Data redundancy
- Performance issue
- Overlapping allowed?
- Optimization of minimum overlap -> hard
- Need to seek fast, suboptimal option
- Objects may need to be clipped into several parts.
The R-Tree
- Leaf node entreis
<MBR, object-id>
- All leaves same level (balanced tree)
- Internal node entries
<MBR, ptr>
- MBR = MBR of all entries pointed by it
- Root
- At least 2 children
- Parameters
M
: max entries per nodem
: min entries per nodem <= M/2
- Usually
m = 0.4M
- One node = one disk block
Range Searching
range_query(query W, node n):
results = []
if n is not leaf:
for e in n:
if predicate(e.MBR, W): # e.g. intersect
results.concat(range_query(W, e.ptr))
if n is leaf:
for e in n:
if predicate(e.MBR, W) and predicate(e.obj_id, W):
results.add(e.obj_id)
return results
Construction of R-Tree
- Insertion
- Interleave with search operation
- Similar to B+-tree
- Special optimization needed
- Choose the path the new MBR is inserted
- Split overflown nodes
- Special optimization needed
- Deletion
- Interleave with search operation
- Underflow
- Delete the underflown leaf node
- Re-insert the remaining entries
R*-Tree
Differ only in insertion algorithm; aims at constructing a tree of high quality.
- Good tree has:
- Nodes with small MBRs
- Nodes with small overlap
- Nodes that look like squares
- Nodes as full as possible
- Optimization criteria
- Minimize area covered by index rectangle
- = minimize dead space
- Minimize overlaps between node MBRs
- = minimize traversed paths
- Minimize margins of node MBRs
- = minimize intersections for a random query
- Minimize tree height
- = minimize storage utilization
- Minimize area covered by index rectangle
Sometimes may not be possible to optimize all criteria.
Insertion Heuristics
How to choose the path to insert a new entry?
- Follow subtree pointed by entry whose MBR:
- Enlarged the least
- Causes least overlap
- Break ties by choosing MBR with least area
Node Splitting
Distribute quickly a set of rectangles into 2 nodes s.t. the areas, overlap, and margins are minimized.
- Give weight on some optimization criteria
- Need to be fast
- Distribution may not be even
node_splitting(n, m, M):
# Determine the split axis (min margin)
sum['x'] = 0, sum['y'] = 0
for axis in ['x', 'y']:
sum[axis] = 0
sl = sort_entries_by_lower_value(n)
su = sort_entries_by_upper_value(n)
for k = m to M+1-m:
A, B = group(sl, k) # First k in A, others in B
sum[axis] += margin(A) + margin(B)
A, B = group(su, k)
sum[axis] += margin(A) + margin(B)
split_axis = argmin(sum)
# Distribute entries along the axis (min overlap)
dist = min_overlap(n, m, M, split_axis)
if len(dist) > 1: # min area if tie
dist = min_area(n, m, M, dist)
Forced Reinsert
Try to see if some entries could fit better in another node when overflow.
- Steps
- Find 30% furthest entries from center
- Reinsert (if another overflow, don't repeat)
- Advantages
- Less overlap
- More space utilized
Building R-Trees
- Iterative
- Iteratively insert rectangles into an initially empty tree
- Drawbacks
- Tree reorganization slow
- Tree nodes not as full as possible
- Bulk-load
- Bulk-load the rectangles into the tree with some fast process
- Benefits
- Fast
- Good space utilization
X-Sorting
- Steps
- Sort in x axis
- Pack
M
consecutive rectangles in leaf nodes - Build tree bottom-up
- Drawbacks
- Results in long strips
Hilbert Sorting
- Steps
- Similar to x-sorting, but with space-filling curve to order the rectangles
- Benefits
- Better structure
- Drawbacks
- Large overlap
Sort-Tile Recursive
- Steps
- Sort in one axis first
- Group
sqrt(n)
rectangles in another axis
- Benefits
- Best structure
Spatial Query Processing
Spatial Selections
Range (window W
) intersection query.
Nearest Neighbor Queries
For a spatial relation R
and a query object q
, find the nearest (k
)neighbor of q
in R
.
$$ NN(q, R) = {o \in R: dist(q, o) \le dist(q, o') \forall o' \in R}
$$
$$ KNN(q, k, R) = S \subset R: |S| = k, dist(q, o) \le dist(q, o') \forall o \in S \forall o' \in (R-S)
$$
Distance Measures
Distance between MSRs lower-bounds distance between objects.
$$ dist(MSR(o_i), MSR(o_j)) \le dist(o_i, o_j)
$$
Distance between R-tree node MSRs lower-bounds distance between the entries.
$$ dist(MSR(n_i), MSR(n_j)) \le dist(MSR(e_i), MSR(e_j)) \forall e_i \in n_i, e_j \in n_j
$$
Distance between query object q
and an R-tree node MSR lower-bounds distance between q
and the entries.
$$ dist(q, MSR(n)) \le dist(q, MSR(e)) \forall e \in n
$$
Can be used to guide/prune search in an R-tree.
Depth-First NN Search
DF_NN_search(q, n, o_nn):
if n is leaf:
for e in n:
if dist(q, e.MBR) < dist(q, o_nn):
if dist(q, e.ptr) < dist(q, o_nn):
o_nn = e.ptr
else:
for e in n:
if dist(q, e.MBR) < dist(q, o_nn):
o = e.ptr
DF_NN_search(q, o)
- Benefits
- Large space pruned
- Should order entries of node s.t. potential to find an NN earlier is maximized
- Easily adaptable to k-NN
- At most one path in memory -> small
- Drawbacks
- Does not visit the least possible number of nodes
- Not incremental
Best-First NN Search
BF_NN_search(q, R):
o_nn = null
priority_queue q
for e in R.root:
q.add(e)
while (not q.empty()) and (dist(q, q.top().MBR) < dist(q, o_nn)):
e = q.top()
q.pop()
if e is not leaf:
for e' in e.ptr:
if dist(q, e'.MBR) < dist(q, o_nn):
q.add(e')
else:
if dist(e.ptr, q) < dist(q, o_nn):
o_nn = e.ptr
return o_nn
- Benefits
- More efficient given large enough memory
- Optimal in number of nodes visited
- Only nodes whose MBR intersects the disk centered at
q
with radius <= the real NN distance are visited
- Only nodes whose MBR intersects the disk centered at
- Adaptable for k-NN search
- Second heap to organize the NN found so far
- Or incremental search version
- Adaptable for incremental search
- i.e. after finding the NN, we can incrementally find the next NN without starting from the beginning
- Put objects on heap
- Never prune, wait for objects to come out
- i.e. after finding the NN, we can incrementally find the next NN without starting from the beginning
- Drawbacks
- The heap can grow very large
Spatial Joins
- Input
- 2 spatial relations
R
&S
- 1 spatial relationship
θ
- 2 spatial relations
- Output
- $${(r, s): r \in R, s \in S, r \theta s = true}$$
- Types
- Semi-join
- Similarity join
- Closest pairs
- All-NN
- Iceberg distance join
- Issues
- More expensive than selections
- How to process with indexes?
- Both inputs indexed (e.g. synchronized tree traversal)
- R-tree join the best
- One input indexed (e.g. indexed nested loops)
- Slot-index spatial join & sort and match the best
- No input indexed (e.g. spatial hash join)
- Generally worst than methods with index
- Both inputs indexed (e.g. synchronized tree traversal)
- How to apply multi-step process?
- Mostly focus on efficient processing of filter step
- Most spatial predicates on actual objects reduce to intersection of MBRs in filter step -> mainly consider the intersect predicate
Filter Step
R-Tree Join
If node nR
does not intersect node nS
, then no object under nR
intersect objects under node nS
. Pruning can be checked.
# Assume tress have same height
r_tree_join(nr, ns):
for ei in nr:
for ej in ns:
if intersect(ei.MBR, ej.MBR):
if nr is leaf: # ns also a leaf
output(ei.ptr, ej.ptr)
else:
r_tree_join(ei.ptr, ej.ptr)
Cost: O(N * M)
Optimization 1: Space Restriction
Check entry in nR
with MBR of nS
.
Cost: O(N + M)
Optimization 2: Plane Sweep
Sort entries of both nodes on their lower x value, then sweep a line to find all entry pairs that qualify x-intersection. Then check y-intersection.
Cost: O((N+M) * log(max(N, M)) + k)
Worst case suboptimal; effective on average.
Single Index Methods
Indexed Nested Loops
- Window query for objects of non-indexed dataset
Seeded Tree Join & Bulk-Load and Match
- Build on-the-fly R-tree
- R-join
Better if number of objects large.
Sort and Match
- Sort non-indexed dataset
- Pack leaf nodes (without building R-tree)
- Match leaf nodes with existing tree
Slot-Index Spatial Join
- Hash-join using entries of R-tree
Non-Indexed Methods
Spatial Hash Join
Partition Based Spatial Merge Join
Refinement Step
More detailed approximations to make conclusions.
- Conservative approximations
- Convex hull
- Progressive approximations
- Maximum enclosed rectangle
If still inconclusive, perform expensive refinement step, i.e. computational geometry algorithms.