Spatial Data

Spatial object
- With spatial extent
  - Vector representation
  - Raster representation
- Point

Spatial Relationships

Associates two objects according to their relative location and extent in space. Used as predicates in spatial queries.

Object
- Boundary
- Interior (optional)

Topological Relationships

Defined by a set of relationships between the objects' boundaries and interiors.

disjoint(o1, o2)
- (interior(o1) && interior(o2) == null) && (boundary(o1) && boundary(o2) == null)
intersects/overlaps(o1, o2)
- (interior(o1) && interior(o2) != null) || (boundary(o1) && boundary(o2) != null)
equals(o1, o2)
- (interior(o1) == interior(o2)) && boundary(o1) == boundary(o2)
inside(o1, o2)
- interior(o1) in interior(o2)
contains(o1, o2)
- interior(o2) in interior(o1)
adjacent(o1, o2)
- (interior(o1) && interior(o2) == null) && (boundary(o1) && boundary(o2) != null)

Distance Relationships

Associates objects based on their geometric distance (Euclidean distance).

Explicit
Abstract/discrete

Directional Relationships

Associates objects based on their relative orientation according to a global reference system.

Spatial Queries

Things spacial about spatial
- Dimensionality: no ordering
- Complex spatial extent
- No standard definitions of spatial operations and algebra
Spatial access methods (SAM)
- Index spatial objects and facilitate efficient processing of simple spatial query types

Two-Step Spatial Query Processing

Evaluating spatial relationships on geometric data is slow; a spatial object is approximated by its minimum bounding rectangle (MBR).

Filter step
- Query on MBRs
Refinement step
- Query on exact geometry of objects

Spatial Access Methods

No good theoretical worst-case guarantees for range queries; therefore aims at minimization of the expected cost.

Point Access Methods

Divide space into disjoint partitions & group points according to the regions.

Problems
- Objects may need to be clipped into several parts.
  - Data redundancy
  - Performance issue
- Overlapping allowed?
  - Optimization of minimum overlap -> hard
  - Need to seek fast, suboptimal option

The R-Tree

Leaf node entreis
- <MBR, object-id>
- All leaves same level (balanced tree)
Internal node entries
- <MBR, ptr>
- MBR = MBR of all entries pointed by it
Root
- At least 2 children
Parameters
- M: max entries per node
- m: min entries per node
  - m <= M/2
  - Usually m = 0.4M
One node = one disk block

Range Searching

range_query(query W, node n):
    results = []
    if n is not leaf:
        for e in n:
            if predicate(e.MBR, W): # e.g. intersect
                results.concat(range_query(W, e.ptr))
    if n is leaf:
        for e in n:
            if predicate(e.MBR, W) and predicate(e.obj_id, W):
                results.add(e.obj_id)
    return results

Construction of R-Tree

Insertion
- Interleave with search operation
- Similar to B+-tree
  - Special optimization needed
    - Choose the path the new MBR is inserted
    - Split overflown nodes
Deletion
- Interleave with search operation
- Underflow
  - Delete the underflown leaf node
  - Re-insert the remaining entries

R*-Tree

Differ only in insertion algorithm; aims at constructing a tree of high quality.

Good tree has:
- Nodes with small MBRs
- Nodes with small overlap
- Nodes that look like squares
- Nodes as full as possible
Optimization criteria
- Minimize area covered by index rectangle
  - = minimize dead space
- Minimize overlaps between node MBRs
  - = minimize traversed paths
- Minimize margins of node MBRs
  - = minimize intersections for a random query
- Minimize tree height
  - = minimize storage utilization

Sometimes may not be possible to optimize all criteria.

Insertion Heuristics

How to choose the path to insert a new entry?

Follow subtree pointed by entry whose MBR:
- Enlarged the least
- Causes least overlap
Break ties by choosing MBR with least area

Node Splitting

Distribute quickly a set of rectangles into 2 nodes s.t. the areas, overlap, and margins are minimized.

Give weight on some optimization criteria
Need to be fast
Distribution may not be even

node_splitting(n, m, M):
    # Determine the split axis (min margin)
    sum['x'] = 0, sum['y'] = 0
    for axis in ['x', 'y']:
        sum[axis] = 0
        sl = sort_entries_by_lower_value(n)
        su = sort_entries_by_upper_value(n)

        for k = m to M+1-m:
            A, B = group(sl, k) # First k in A, others in B
            sum[axis] += margin(A) + margin(B)
            A, B = group(su, k)
            sum[axis] += margin(A) + margin(B)
    split_axis = argmin(sum)

    # Distribute entries along the axis (min overlap)
    dist = min_overlap(n, m, M, split_axis)
    if len(dist) > 1: # min area if tie
        dist = min_area(n, m, M, dist)

Forced Reinsert

Try to see if some entries could fit better in another node when overflow.

Steps
- Find 30% furthest entries from center
- Reinsert (if another overflow, don't repeat)
Advantages
- Less overlap
- More space utilized

Building R-Trees

Iterative
1. Iteratively insert rectangles into an initially empty tree
2. Drawbacks
  1. Tree reorganization slow
  2. Tree nodes not as full as possible
Bulk-load
1. Bulk-load the rectangles into the tree with some fast process
2. Benefits
  1. Fast
  2. Good space utilization

X-Sorting

Steps
- Sort in x axis
- Pack M consecutive rectangles in leaf nodes
- Build tree bottom-up
Drawbacks
- Results in long strips

Hilbert Sorting

Steps
- Similar to x-sorting, but with space-filling curve to order the rectangles
Benefits
- Better structure
Drawbacks
- Large overlap

Sort-Tile Recursive

Steps
- Sort in one axis first
- Group sqrt(n) rectangles in another axis
Benefits
- Best structure

Spatial Query Processing

Spatial Selections

Range (window W) intersection query.

Nearest Neighbor Queries

For a spatial relation R and a query object q, find the nearest (k)neighbor of q in R.

$$ NN(q, R) = {o \in R: dist(q, o) \le dist(q, o') \forall o' \in R}

$$ KNN(q, k, R) = S \subset R: |S| = k, dist(q, o) \le dist(q, o') \forall o \in S \forall o' \in (R-S)

Distance Measures

Distance between MSRs lower-bounds distance between objects.

$$ dist(MSR(o_i), MSR(o_j)) \le dist(o_i, o_j)

Distance between R-tree node MSRs lower-bounds distance between the entries.

$$ dist(MSR(n_i), MSR(n_j)) \le dist(MSR(e_i), MSR(e_j)) \forall e_i \in n_i, e_j \in n_j

Distance between query object q and an R-tree node MSR lower-bounds distance between q and the entries.

$$ dist(q, MSR(n)) \le dist(q, MSR(e)) \forall e \in n

Can be used to guide/prune search in an R-tree.

Depth-First NN Search

DF_NN_search(q, n, o_nn):
    if n is leaf:
        for e in n:
            if dist(q, e.MBR) < dist(q, o_nn):
                if dist(q, e.ptr) < dist(q, o_nn):
                    o_nn = e.ptr
    else:
        for e in n:
            if dist(q, e.MBR) < dist(q, o_nn):
                o = e.ptr
                DF_NN_search(q, o)

Benefits
- Large space pruned
- Should order entries of node s.t. potential to find an NN earlier is maximized
- Easily adaptable to k-NN
- At most one path in memory -> small
Drawbacks
- Does not visit the least possible number of nodes
- Not incremental

Best-First NN Search

BF_NN_search(q, R):
    o_nn = null
    priority_queue q

    for e in R.root:
        q.add(e)

    while (not q.empty()) and (dist(q, q.top().MBR) < dist(q, o_nn)):
        e = q.top()
        q.pop()

        if e is not leaf:
            for e' in e.ptr:
                if dist(q, e'.MBR) < dist(q, o_nn):
                    q.add(e')
        else:
            if dist(e.ptr, q) < dist(q, o_nn):
                o_nn = e.ptr
    return o_nn

Benefits
- More efficient given large enough memory
- Optimal in number of nodes visited
  - Only nodes whose MBR intersects the disk centered at q with radius <= the real NN distance are visited
- Adaptable for k-NN search
  - Second heap to organize the NN found so far
  - Or incremental search version
- Adaptable for incremental search
  - i.e. after finding the NN, we can incrementally find the next NN without starting from the beginning
    - Put objects on heap
    - Never prune, wait for objects to come out
Drawbacks
- The heap can grow very large

Spatial Joins

Input
- 2 spatial relations R & S
- 1 spatial relationship θ
Output
- $${(r, s): r \in R, s \in S, r \theta s = true}$$
Types
- Semi-join
- Similarity join
- Closest pairs
- All-NN
- Iceberg distance join
Issues
- More expensive than selections
- How to process with indexes?
  - Both inputs indexed (e.g. synchronized tree traversal)
    - R-tree join the best
  - One input indexed (e.g. indexed nested loops)
    - Slot-index spatial join & sort and match the best
  - No input indexed (e.g. spatial hash join)
    - Generally worst than methods with index
- How to apply multi-step process?
  - Mostly focus on efficient processing of filter step
  - Most spatial predicates on actual objects reduce to intersection of MBRs in filter step -> mainly consider the intersect predicate

Filter Step

R-Tree Join

If node nR does not intersect node nS, then no object under nR intersect objects under node nS. Pruning can be checked.

# Assume tress have same height
r_tree_join(nr, ns):
    for ei in nr:
        for ej in ns:
            if intersect(ei.MBR, ej.MBR):
                if nr is leaf: # ns also a leaf
                    output(ei.ptr, ej.ptr)
                else:
                    r_tree_join(ei.ptr, ej.ptr)

Cost: O(N * M)

Optimization 1: Space Restriction

Check entry in nR with MBR of nS.

Cost: O(N + M)

Optimization 2: Plane Sweep

Sort entries of both nodes on their lower x value, then sweep a line to find all entry pairs that qualify x-intersection. Then check y-intersection.

Cost: O((N+M) * log(max(N, M)) + k)

Worst case suboptimal; effective on average.

Single Index Methods

Indexed Nested Loops

Window query for objects of non-indexed dataset

Seeded Tree Join & Bulk-Load and Match

Build on-the-fly R-tree
R-join

Better if number of objects large.

Sort and Match

Sort non-indexed dataset
Pack leaf nodes (without building R-tree)
Match leaf nodes with existing tree

Slot-Index Spatial Join

Hash-join using entries of R-tree

Non-Indexed Methods

Spatial Hash Join

Partition Based Spatial Merge Join

More detailed approximations to make conclusions.

Conservative approximations
- Convex hull
Progressive approximations
- Maximum enclosed rectangle

If still inconclusive, perform expensive refinement step, i.e. computational geometry algorithms.

Spatial Databases