Spatial Data
- Spatial object- With spatial extent- Vector representation
- Raster representation
 
- Point
 
- With spatial extent
Spatial Relationships
Associates two objects according to their relative location and extent in space. Used as predicates in spatial queries.
- Object- Boundary
- Interior (optional)
 
Topological Relationships
Defined by a set of relationships between the objects' boundaries and interiors.
- disjoint(o1, o2)- (interior(o1) && interior(o2) == null) && (boundary(o1) && boundary(o2) == null)
 
- intersects/overlaps(o1, o2)- (interior(o1) && interior(o2) != null) || (boundary(o1) && boundary(o2) != null)
 
- equals(o1, o2)- (interior(o1) == interior(o2)) && boundary(o1) == boundary(o2)
 
- inside(o1, o2)- interior(o1) in interior(o2)
 
- contains(o1, o2)- interior(o2) in interior(o1)
 
- adjacent(o1, o2)- (interior(o1) && interior(o2) == null) && (boundary(o1) && boundary(o2) != null)
 

Distance Relationships
Associates objects based on their geometric distance (Euclidean distance).
- Explicit
- Abstract/discrete
Directional Relationships
Associates objects based on their relative orientation according to a global reference system.
Spatial Queries
- Things spacial about spatial- Dimensionality: no ordering
- Complex spatial extent
- No standard definitions of spatial operations and algebra
 
- Spatial access methods (SAM)- Index spatial objects and facilitate efficient processing of simple spatial query types
 
Two-Step Spatial Query Processing
Evaluating spatial relationships on geometric data is slow; a spatial object is approximated by its minimum bounding rectangle (MBR).
- Filter step- Query on MBRs
 
- Refinement step- Query on exact geometry of objects
 
Spatial Access Methods
No good theoretical worst-case guarantees for range queries; therefore aims at minimization of the expected cost.
Point Access Methods
Divide space into disjoint partitions & group points according to the regions.
- Problems- Objects may need to be clipped into several parts.- Data redundancy
- Performance issue
 
- Overlapping allowed?- Optimization of minimum overlap -> hard
- Need to seek fast, suboptimal option
 
 
- Objects may need to be clipped into several parts.
The R-Tree

- Leaf node entreis- <MBR, object-id>
- All leaves same level (balanced tree)
 
- Internal node entries- <MBR, ptr>
- MBR = MBR of all entries pointed by it
 
- Root- At least 2 children
 
- Parameters- M: max entries per node
- m: min entries per node- m <= M/2
- Usually m = 0.4M
 
 
- One node = one disk block
Range Searching
range_query(query W, node n):
    results = []
    if n is not leaf:
        for e in n:
            if predicate(e.MBR, W): # e.g. intersect
                results.concat(range_query(W, e.ptr))
    if n is leaf:
        for e in n:
            if predicate(e.MBR, W) and predicate(e.obj_id, W):
                results.add(e.obj_id)
    return results
Construction of R-Tree
- Insertion- Interleave with search operation
- Similar to B+-tree- Special optimization needed- Choose the path the new MBR is inserted
- Split overflown nodes
 
 
- Special optimization needed
 
- Deletion- Interleave with search operation
- Underflow- Delete the underflown leaf node
- Re-insert the remaining entries
 
 
R*-Tree
Differ only in insertion algorithm; aims at constructing a tree of high quality.
- Good tree has:- Nodes with small MBRs
- Nodes with small overlap
- Nodes that look like squares
- Nodes as full as possible
 
- Optimization criteria- Minimize area covered by index rectangle- = minimize dead space
 
- Minimize overlaps between node MBRs- = minimize traversed paths
 
- Minimize margins of node MBRs- = minimize intersections for a random query
 
- Minimize tree height- = minimize storage utilization
 
 
- Minimize area covered by index rectangle
Sometimes may not be possible to optimize all criteria.
Insertion Heuristics
How to choose the path to insert a new entry?
- Follow subtree pointed by entry whose MBR:- Enlarged the least
- Causes least overlap
 
- Break ties by choosing MBR with least area
Node Splitting
Distribute quickly a set of rectangles into 2 nodes s.t. the areas, overlap, and margins are minimized.
- Give weight on some optimization criteria
- Need to be fast
- Distribution may not be even
node_splitting(n, m, M):
    # Determine the split axis (min margin)
    sum['x'] = 0, sum['y'] = 0
    for axis in ['x', 'y']:
        sum[axis] = 0
        sl = sort_entries_by_lower_value(n)
        su = sort_entries_by_upper_value(n)
        for k = m to M+1-m:
            A, B = group(sl, k) # First k in A, others in B
            sum[axis] += margin(A) + margin(B)
            A, B = group(su, k)
            sum[axis] += margin(A) + margin(B)
    split_axis = argmin(sum)
    # Distribute entries along the axis (min overlap)
    dist = min_overlap(n, m, M, split_axis)
    if len(dist) > 1: # min area if tie
        dist = min_area(n, m, M, dist)
Forced Reinsert
Try to see if some entries could fit better in another node when overflow.
- Steps- Find 30% furthest entries from center
- Reinsert (if another overflow, don't repeat)
 
- Advantages- Less overlap
- More space utilized
 
Building R-Trees
- Iterative- Iteratively insert rectangles into an initially empty tree
- Drawbacks- Tree reorganization slow
- Tree nodes not as full as possible
 
 
- Bulk-load- Bulk-load the rectangles into the tree with some fast process
- Benefits- Fast
- Good space utilization
 
 
X-Sorting
- Steps- Sort in x axis
- Pack Mconsecutive rectangles in leaf nodes
- Build tree bottom-up
 
- Drawbacks- Results in long strips
 
Hilbert Sorting
- Steps- Similar to x-sorting, but with space-filling curve to order the rectangles
 
- Benefits- Better structure
 
- Drawbacks- Large overlap
 
Sort-Tile Recursive
- Steps- Sort in one axis first
- Group sqrt(n)rectangles in another axis
 
- Benefits- Best structure
 

Spatial Query Processing
Spatial Selections
Range (window W) intersection query.
Nearest Neighbor Queries
For a spatial relation R and a query object q, find the nearest (k)neighbor of q in R.
$$ NN(q, R) = {o \in R: dist(q, o) \le dist(q, o') \forall o' \in R}
$$
$$ KNN(q, k, R) = S \subset R: |S| = k, dist(q, o) \le dist(q, o') \forall o \in S \forall o' \in (R-S)
$$
Distance Measures
Distance between MSRs lower-bounds distance between objects.
$$ dist(MSR(o_i), MSR(o_j)) \le dist(o_i, o_j)
$$
Distance between R-tree node MSRs lower-bounds distance between the entries.
$$ dist(MSR(n_i), MSR(n_j)) \le dist(MSR(e_i), MSR(e_j)) \forall e_i \in n_i, e_j \in n_j
$$
Distance between query object q and an R-tree node MSR lower-bounds distance between q and the entries.
$$ dist(q, MSR(n)) \le dist(q, MSR(e)) \forall e \in n
$$
Can be used to guide/prune search in an R-tree.
Depth-First NN Search
DF_NN_search(q, n, o_nn):
    if n is leaf:
        for e in n:
            if dist(q, e.MBR) < dist(q, o_nn):
                if dist(q, e.ptr) < dist(q, o_nn):
                    o_nn = e.ptr
    else:
        for e in n:
            if dist(q, e.MBR) < dist(q, o_nn):
                o = e.ptr
                DF_NN_search(q, o)
- Benefits- Large space pruned
- Should order entries of node s.t. potential to find an NN earlier is maximized
- Easily adaptable to k-NN
- At most one path in memory -> small
 
- Drawbacks- Does not visit the least possible number of nodes
- Not incremental
 
Best-First NN Search
BF_NN_search(q, R):
    o_nn = null
    priority_queue q
    for e in R.root:
        q.add(e)
    while (not q.empty()) and (dist(q, q.top().MBR) < dist(q, o_nn)):
        e = q.top()
        q.pop()
        if e is not leaf:
            for e' in e.ptr:
                if dist(q, e'.MBR) < dist(q, o_nn):
                    q.add(e')
        else:
            if dist(e.ptr, q) < dist(q, o_nn):
                o_nn = e.ptr
    return o_nn
- Benefits- More efficient given large enough memory
- Optimal in number of nodes visited- Only nodes whose MBR intersects the disk centered at qwith radius <= the real NN distance are visited
 
- Only nodes whose MBR intersects the disk centered at 
- Adaptable for k-NN search- Second heap to organize the NN found so far
- Or incremental search version
 
- Adaptable for incremental search- i.e. after finding the NN, we can incrementally find the next NN without starting from the beginning- Put objects on heap
- Never prune, wait for objects to come out
 
 
- i.e. after finding the NN, we can incrementally find the next NN without starting from the beginning
 
- Drawbacks- The heap can grow very large
 
Spatial Joins
- Input- 2 spatial relations R&S
- 1 spatial relationship θ
 
- 2 spatial relations 
- Output- $${(r, s): r \in R, s \in S, r \theta s = true}$$
 
- Types- Semi-join
- Similarity join
- Closest pairs
- All-NN
- Iceberg distance join
 
- Issues- More expensive than selections
- How to process with indexes?- Both inputs indexed (e.g. synchronized tree traversal)- R-tree join the best
 
- One input indexed (e.g. indexed nested loops)- Slot-index spatial join & sort and match the best
 
- No input indexed (e.g. spatial hash join)- Generally worst than methods with index
 
 
- Both inputs indexed (e.g. synchronized tree traversal)
- How to apply multi-step process?- Mostly focus on efficient processing of filter step
- Most spatial predicates on actual objects reduce to intersection of MBRs in filter step -> mainly consider the intersect predicate
 
 


Filter Step
R-Tree Join
If node nR does not intersect node nS, then no object under nR intersect objects under node nS. Pruning can be checked.
# Assume tress have same height
r_tree_join(nr, ns):
    for ei in nr:
        for ej in ns:
            if intersect(ei.MBR, ej.MBR):
                if nr is leaf: # ns also a leaf
                    output(ei.ptr, ej.ptr)
                else:
                    r_tree_join(ei.ptr, ej.ptr)
Cost: O(N * M)
Optimization 1: Space Restriction
Check entry in nR with MBR of nS.

Cost: O(N + M)
Optimization 2: Plane Sweep
Sort entries of both nodes on their lower x value, then sweep a line to find all entry pairs that qualify x-intersection. Then check y-intersection.
 Cost:
Cost: O((N+M) * log(max(N, M)) + k)
Worst case suboptimal; effective on average.
Single Index Methods
Indexed Nested Loops
- Window query for objects of non-indexed dataset

Seeded Tree Join & Bulk-Load and Match
- Build on-the-fly R-tree
- R-join
Better if number of objects large.

Sort and Match
- Sort non-indexed dataset
- Pack leaf nodes (without building R-tree)
- Match leaf nodes with existing tree

Slot-Index Spatial Join
- Hash-join using entries of R-tree

Non-Indexed Methods
Spatial Hash Join

Partition Based Spatial Merge Join

Refinement Step
More detailed approximations to make conclusions.
- Conservative approximations- Convex hull
 
- Progressive approximations- Maximum enclosed rectangle
 

If still inconclusive, perform expensive refinement step, i.e. computational geometry algorithms.