Spatial Data

  • Spatial object
    • With spatial extent
      • Vector representation
      • Raster representation
    • Point

Spatial Relationships

Associates two objects according to their relative location and extent in space. Used as predicates in spatial queries.

  • Object
    • Boundary
    • Interior (optional)

Topological Relationships

Defined by a set of relationships between the objects' boundaries and interiors.

  • disjoint(o1, o2)
    • (interior(o1) && interior(o2) == null) && (boundary(o1) && boundary(o2) == null)
  • intersects/overlaps(o1, o2)
    • (interior(o1) && interior(o2) != null) || (boundary(o1) && boundary(o2) != null)
  • equals(o1, o2)
    • (interior(o1) == interior(o2)) && boundary(o1) == boundary(o2)
  • inside(o1, o2)
    • interior(o1) in interior(o2)
  • contains(o1, o2)
    • interior(o2) in interior(o1)
  • adjacent(o1, o2)
    • (interior(o1) && interior(o2) == null) && (boundary(o1) && boundary(o2) != null)

Distance Relationships

Associates objects based on their geometric distance (Euclidean distance).

  • Explicit
  • Abstract/discrete

Directional Relationships

Associates objects based on their relative orientation according to a global reference system.

Spatial Queries

  • Things spacial about spatial
    • Dimensionality: no ordering
    • Complex spatial extent
    • No standard definitions of spatial operations and algebra
  • Spatial access methods (SAM)
    • Index spatial objects and facilitate efficient processing of simple spatial query types

Two-Step Spatial Query Processing

Evaluating spatial relationships on geometric data is slow; a spatial object is approximated by its minimum bounding rectangle (MBR).

  • Filter step
    • Query on MBRs
  • Refinement step
    • Query on exact geometry of objects

Spatial Access Methods

No good theoretical worst-case guarantees for range queries; therefore aims at minimization of the expected cost.

Point Access Methods

Divide space into disjoint partitions & group points according to the regions.

  • Problems
    • Objects may need to be clipped into several parts.
      • Data redundancy
      • Performance issue
    • Overlapping allowed?
      • Optimization of minimum overlap -> hard
      • Need to seek fast, suboptimal option

The R-Tree

  • Leaf node entreis
    • <MBR, object-id>
    • All leaves same level (balanced tree)
  • Internal node entries
    • <MBR, ptr>
    • MBR = MBR of all entries pointed by it
  • Root
    • At least 2 children
  • Parameters
    • M: max entries per node
    • m: min entries per node
      • m <= M/2
      • Usually m = 0.4M
  • One node = one disk block

Range Searching

range_query(query W, node n):
    results = []
    if n is not leaf:
        for e in n:
            if predicate(e.MBR, W): # e.g. intersect
                results.concat(range_query(W, e.ptr))
    if n is leaf:
        for e in n:
            if predicate(e.MBR, W) and predicate(e.obj_id, W):
                results.add(e.obj_id)
    return results

Construction of R-Tree

  • Insertion
    • Interleave with search operation
    • Similar to B+-tree
      • Special optimization needed
        • Choose the path the new MBR is inserted
        • Split overflown nodes
  • Deletion
    • Interleave with search operation
    • Underflow
      • Delete the underflown leaf node
      • Re-insert the remaining entries

R*-Tree

Differ only in insertion algorithm; aims at constructing a tree of high quality.

  • Good tree has:
    • Nodes with small MBRs
    • Nodes with small overlap
    • Nodes that look like squares
    • Nodes as full as possible
  • Optimization criteria
    • Minimize area covered by index rectangle
      • = minimize dead space
    • Minimize overlaps between node MBRs
      • = minimize traversed paths
    • Minimize margins of node MBRs
      • = minimize intersections for a random query
    • Minimize tree height
      • = minimize storage utilization

Sometimes may not be possible to optimize all criteria.

Insertion Heuristics

How to choose the path to insert a new entry?

  • Follow subtree pointed by entry whose MBR:
    • Enlarged the least
    • Causes least overlap
  • Break ties by choosing MBR with least area
Node Splitting

Distribute quickly a set of rectangles into 2 nodes s.t. the areas, overlap, and margins are minimized.

  • Give weight on some optimization criteria
  • Need to be fast
  • Distribution may not be even
node_splitting(n, m, M):
    # Determine the split axis (min margin)
    sum['x'] = 0, sum['y'] = 0
    for axis in ['x', 'y']:
        sum[axis] = 0
        sl = sort_entries_by_lower_value(n)
        su = sort_entries_by_upper_value(n)

        for k = m to M+1-m:
            A, B = group(sl, k) # First k in A, others in B
            sum[axis] += margin(A) + margin(B)
            A, B = group(su, k)
            sum[axis] += margin(A) + margin(B)
    split_axis = argmin(sum)

    # Distribute entries along the axis (min overlap)
    dist = min_overlap(n, m, M, split_axis)
    if len(dist) > 1: # min area if tie
        dist = min_area(n, m, M, dist)
Forced Reinsert

Try to see if some entries could fit better in another node when overflow.

  • Steps
    • Find 30% furthest entries from center
    • Reinsert (if another overflow, don't repeat)
  • Advantages
    • Less overlap
    • More space utilized

Building R-Trees

  1. Iterative
    1. Iteratively insert rectangles into an initially empty tree
    2. Drawbacks
      1. Tree reorganization slow
      2. Tree nodes not as full as possible
  2. Bulk-load
    1. Bulk-load the rectangles into the tree with some fast process
    2. Benefits
      1. Fast
      2. Good space utilization
X-Sorting
  • Steps
    • Sort in x axis
    • Pack M consecutive rectangles in leaf nodes
    • Build tree bottom-up
  • Drawbacks
    • Results in long strips
Hilbert Sorting
  • Steps
    • Similar to x-sorting, but with space-filling curve to order the rectangles
  • Benefits
    • Better structure
  • Drawbacks
    • Large overlap
Sort-Tile Recursive
  • Steps
    • Sort in one axis first
    • Group sqrt(n) rectangles in another axis
  • Benefits
    • Best structure

Spatial Query Processing

Spatial Selections

Range (window W) intersection query.

Nearest Neighbor Queries

For a spatial relation R and a query object q, find the nearest (k)neighbor of q in R.

$$ NN(q, R) = {o \in R: dist(q, o) \le dist(q, o') \forall o' \in R}

$$

$$ KNN(q, k, R) = S \subset R: |S| = k, dist(q, o) \le dist(q, o') \forall o \in S \forall o' \in (R-S)

$$

Distance Measures

Distance between MSRs lower-bounds distance between objects.

$$ dist(MSR(o_i), MSR(o_j)) \le dist(o_i, o_j)

$$

Distance between R-tree node MSRs lower-bounds distance between the entries.

$$ dist(MSR(n_i), MSR(n_j)) \le dist(MSR(e_i), MSR(e_j)) \forall e_i \in n_i, e_j \in n_j

$$

Distance between query object q and an R-tree node MSR lower-bounds distance between q and the entries.

$$ dist(q, MSR(n)) \le dist(q, MSR(e)) \forall e \in n

$$

Can be used to guide/prune search in an R-tree.

DF_NN_search(q, n, o_nn):
    if n is leaf:
        for e in n:
            if dist(q, e.MBR) < dist(q, o_nn):
                if dist(q, e.ptr) < dist(q, o_nn):
                    o_nn = e.ptr
    else:
        for e in n:
            if dist(q, e.MBR) < dist(q, o_nn):
                o = e.ptr
                DF_NN_search(q, o)
  • Benefits
    • Large space pruned
    • Should order entries of node s.t. potential to find an NN earlier is maximized
    • Easily adaptable to k-NN
    • At most one path in memory -> small
  • Drawbacks
    • Does not visit the least possible number of nodes
    • Not incremental
BF_NN_search(q, R):
    o_nn = null
    priority_queue q

    for e in R.root:
        q.add(e)

    while (not q.empty()) and (dist(q, q.top().MBR) < dist(q, o_nn)):
        e = q.top()
        q.pop()

        if e is not leaf:
            for e' in e.ptr:
                if dist(q, e'.MBR) < dist(q, o_nn):
                    q.add(e')
        else:
            if dist(e.ptr, q) < dist(q, o_nn):
                o_nn = e.ptr
    return o_nn
  • Benefits
    • More efficient given large enough memory
    • Optimal in number of nodes visited
      • Only nodes whose MBR intersects the disk centered at q with radius <= the real NN distance are visited
    • Adaptable for k-NN search
      • Second heap to organize the NN found so far
      • Or incremental search version
    • Adaptable for incremental search
      • i.e. after finding the NN, we can incrementally find the next NN without starting from the beginning
        • Put objects on heap
        • Never prune, wait for objects to come out
  • Drawbacks
    • The heap can grow very large

Spatial Joins

  • Input
    • 2 spatial relations R & S
    • 1 spatial relationship θ
  • Output
    • $${(r, s): r \in R, s \in S, r \theta s = true}$$
  • Types
    • Semi-join
    • Similarity join
    • Closest pairs
    • All-NN
    • Iceberg distance join
  • Issues
    • More expensive than selections
    • How to process with indexes?
      • Both inputs indexed (e.g. synchronized tree traversal)
        • R-tree join the best
      • One input indexed (e.g. indexed nested loops)
        • Slot-index spatial join & sort and match the best
      • No input indexed (e.g. spatial hash join)
        • Generally worst than methods with index
    • How to apply multi-step process?
      • Mostly focus on efficient processing of filter step
      • Most spatial predicates on actual objects reduce to intersection of MBRs in filter step -> mainly consider the intersect predicate

Filter Step
R-Tree Join

If node nR does not intersect node nS, then no object under nR intersect objects under node nS. Pruning can be checked.

# Assume tress have same height
r_tree_join(nr, ns):
    for ei in nr:
        for ej in ns:
            if intersect(ei.MBR, ej.MBR):
                if nr is leaf: # ns also a leaf
                    output(ei.ptr, ej.ptr)
                else:
                    r_tree_join(ei.ptr, ej.ptr)

Cost: O(N * M)

Optimization 1: Space Restriction

Check entry in nR with MBR of nS.

Cost: O(N + M)

Optimization 2: Plane Sweep

Sort entries of both nodes on their lower x value, then sweep a line to find all entry pairs that qualify x-intersection. Then check y-intersection.

Cost: O((N+M) * log(max(N, M)) + k)

Worst case suboptimal; effective on average.

Single Index Methods

Indexed Nested Loops

  • Window query for objects of non-indexed dataset

Seeded Tree Join & Bulk-Load and Match

  • Build on-the-fly R-tree
  • R-join

Better if number of objects large.

Sort and Match

  • Sort non-indexed dataset
  • Pack leaf nodes (without building R-tree)
  • Match leaf nodes with existing tree

Slot-Index Spatial Join

  • Hash-join using entries of R-tree

Non-Indexed Methods

Spatial Hash Join

Partition Based Spatial Merge Join

Refinement Step

More detailed approximations to make conclusions.

  • Conservative approximations
    • Convex hull
  • Progressive approximations
    • Maximum enclosed rectangle

If still inconclusive, perform expensive refinement step, i.e. computational geometry algorithms.

References

results matching ""

    No results matching ""