Text Database

A collection of documents, each as a list of words.

  • Queries
    • Containment query
      • Containment of keywords
    • Ranking query
      • Similarity of list of keywords
  • Indexing
    • Signature
    • Inverted file

Containment Query

  • Expressed with AND, OR, NOT
  • Stemming used to unify different endings of the same word

Evaluation

Naive approach

Scan through every document in the database. Expensive.

Sorted Set Comparison

Check if the set of keywords in the query $$s$$ is a subset of the keywords in each document $$t$$.

$$ s \subseteq t

$$

  1. Sort $$s$$ & $$t$$
  2. Scan the words of the two sets concurrently

Expensive, cost = sorting + 1 pass scanning

Indexing approach

Preprocess each document & extract words from them. Index the documents, then use the index to evaluate queries.

  • Signature
  • Inverted file
Signature

Approximate the set of keywords by signatures, which are bitmaps generated by some hash function.

Signature: a bitmap of length $$b$$ formed by setting bit $$H(e)$$ for each $$e \in s$$.

  • $$W$$: domain of the set elements/vocabulary
    • Superset of any document/query sets
  • $$b$$: signature length
  • $$H$$: maps each word in $$W$$ to a number in $$[0, b-1]$$

Signatures are lossy approximations, i.e. different sets may have the same signature. They are used as fast filters for set operations.

Filtering:

$$ \begin{aligned} x \subseteq y &\rightarrow sig(x) \subseteq sig(y) \Leftrightarrow sig(x) \land \neg sig(y) = 0\ x = y &\rightarrow sig(x) = sig(y)\ x \cap y \neq \emptyset &\rightarrow sig(x) \land sig(y) \neq 0 \end{aligned}

$$

Signature file: stores sequentially the signatures of all documents in a text database $$D$$.

  • Updating signature files
    • Similar to updating relations

Query processing:

  1. Find drops
    1. Compute $$sig(q)$$
    2. Scan through signature file to select qualified signatures
  2. Filter out false drops
    1. Refinement step comparing the documents with $$q$$

Signature-based indexes:

  • Bit-sliced signature
    • Only the columns where $$q$$'s signature has 1 are accessed
  • Signature-tree
    • Similar to R-tree
      • Leaf nodes: signatures
      • Internal nodes: logical OR of all signatures in the sub-tree
    • Good for dynamic data

Supported Selection Queries

  • Set containment selection (subset query)
    • Inverse superset query also useful
  • Set equality selection
  • Set overlap selection
Inverted File

Each word $$e$$ has an inverted list of document-ids that contain $$e$$.

  • Shown to be superior to signature-based techniques
  • Best for static data
    • But updates expensive, thus performed in batches
  • Can be smaller in size, since can be easily compressed
    • Run-length encoding: gaps between consecutive ids
      • E.g. {11, 24, 57, ...} -> {11, 13, 33, ...}
      • Then use variable-length encoding for the gaps
        • E.g. Golomb coding; a few bits suffice to encode the ids

Query processing:

Join the lists corresponding to the elements of the query.

  • Cheap, since ids stored sorted in the lists

Ranking Query

Instead considering a document as a $$|W|$$-length bit-vector indicating which terms it contains, the document can be represented by a $$|W|$$-length bit-vector $$w$$ indicating the relevance, e.g. frequency, of each term in the document.

Similarity:

$$ sim(d, q) = \frac{\vec{wd} \cdot \vec{w_q}}{|\vec{w_d}| |\vec{w_q}|} = \frac{\sum^N{i=1} w{d, i} \times w{q, i}}{\sqrt{\sum^N{i=1} w{d, i}^2} \sqrt{\sum^N{i=1} w{q, i}^2}}

$$

Define ranking of documents based on their similarity to the query.

Inverted File

Query processing:

Accumulate weights of documents for each term in query by accessing & merging the lists.

Also maintain a set of $$k$$ most similar documents to the query while measuring the similarity of documents.

String Matching

  • Indexing
    • Suffix tree
    • Suffix array
  • Variations
    • Approximate string matching: allows errors in matches
      • Applications
        • Information retrieval, computational biology, signal processing, etc.
      • Similarity metric
        • Edit distance: number of primitive operations (insert, delete, replace) necessary to transform one string to the other
          • String alignment
        • LCS distance
        • Hamming distance
        • Episode distance
        • Reversals
        • Block distance
        • q-gram distance

Suffix Tree

A suffix trie where unary paths are compressed.

  • Internal nodes: unique substring
  • Leaf nodes: position of the suffix compressed in path
  • Suffix link: used for fast construction

Query processing:

  1. Search for substring in query along some path, get a starting position
  2. Go to that position in the string, scan to check the occurrence of the substring

Suffix Array

To maintain the large suffix tree in memory, use suffix array instead of trees. It is a lexicographically sorted array of all (suffix, starting position) pairs in the string.

results matching ""

    No results matching ""