Text Database

A collection of documents, each as a list of words.

Keyword-Based Search

Queries
- Containment query
  - Containment of keywords
- Ranking query
  - Similarity of list of keywords
Indexing
- Signature
- Inverted file

Containment Query

Expressed with AND, OR, NOT
Stemming used to unify different endings of the same word

Evaluation

Naive approach

Scan through every document in the database. Expensive.

Sorted Set Comparison

Check if the set of keywords in the query $$s$$ is a subset of the keywords in each document $$t$$.

$$ s \subseteq t

Sort $$s$$ & $$t$$
Scan the words of the two sets concurrently

Expensive, cost = sorting + 1 pass scanning

Indexing approach

Preprocess each document & extract words from them. Index the documents, then use the index to evaluate queries.

Signature
Inverted file

Signature

Approximate the set of keywords by signatures, which are bitmaps generated by some hash function.

Signature: a bitmap of length $$b$$ formed by setting bit $$H(e)$$ for each $$e \in s$$.

$$W$$: domain of the set elements/vocabulary
- Superset of any document/query sets
$$b$$: signature length
$$H$$: maps each word in $$W$$ to a number in $$[0, b-1]$$

Signatures are lossy approximations, i.e. different sets may have the same signature. They are used as fast filters for set operations.

Filtering:

$$ \begin{aligned} x \subseteq y &\rightarrow sig(x) \subseteq sig(y) \Leftrightarrow sig(x) \land \neg sig(y) = 0\ x = y &\rightarrow sig(x) = sig(y)\ x \cap y \neq \emptyset &\rightarrow sig(x) \land sig(y) \neq 0 \end{aligned}

Signature file: stores sequentially the signatures of all documents in a text database $$D$$.

Updating signature files
- Similar to updating relations

Query processing:

Find drops
1. Compute $$sig(q)$$
2. Scan through signature file to select qualified signatures
Filter out false drops
1. Refinement step comparing the documents with $$q$$

Signature-based indexes:

Bit-sliced signature
- Only the columns where $$q$$'s signature has 1 are accessed
Signature-tree
- Similar to R-tree
  - Leaf nodes: signatures
  - Internal nodes: logical OR of all signatures in the sub-tree
- Good for dynamic data

Supported Selection Queries

Set containment selection (subset query)

Inverse superset query also useful

Set equality selection

Set overlap selection

Inverted File

Each word $$e$$ has an inverted list of document-ids that contain $$e$$.

Shown to be superior to signature-based techniques
Best for static data
- But updates expensive, thus performed in batches
Can be smaller in size, since can be easily compressed
- Run-length encoding: gaps between consecutive ids
  - E.g. {11, 24, 57, ...} -> {11, 13, 33, ...}
  - Then use variable-length encoding for the gaps
    - E.g. Golomb coding; a few bits suffice to encode the ids

Query processing:

Join the lists corresponding to the elements of the query.

Cheap, since ids stored sorted in the lists

Ranking Query

Instead considering a document as a $$|W|$$-length bit-vector indicating which terms it contains, the document can be represented by a $$|W|$$-length bit-vector $$w$$ indicating the relevance, e.g. frequency, of each term in the document.

Similarity:

$$ sim(d, q) = \frac{\vec{wd} \cdot \vec{w_q}}{|\vec{w_d}| |\vec{w_q}|} = \frac{\sum^N{i=1} w{d, i} \times w{q, i}}{\sqrt{\sum^N{i=1} w{d, i}^2} \sqrt{\sum^N{i=1} w{q, i}^2}}

Define ranking of documents based on their similarity to the query.

Inverted File

Query processing:

Accumulate weights of documents for each term in query by accessing & merging the lists.

Also maintain a set of $$k$$ most similar documents to the query while measuring the similarity of documents.

String Matching

Indexing
- Suffix tree
- Suffix array
Variations
- Approximate string matching: allows errors in matches
  - Applications
    - Information retrieval, computational biology, signal processing, etc.
  - Similarity metric
    - Edit distance: number of primitive operations (insert, delete, replace) necessary to transform one string to the other
      - String alignment
    - LCS distance
    - Hamming distance
    - Episode distance
    - Reversals
    - Block distance
    - q-gram distance

Suffix Tree

A suffix trie where unary paths are compressed.

Internal nodes: unique substring
Leaf nodes: position of the suffix compressed in path
Suffix link: used for fast construction

Query processing:

Search for substring in query along some path, get a starting position
Go to that position in the string, scan to check the occurrence of the substring

Suffix Array

To maintain the large suffix tree in memory, use suffix array instead of trees. It is a lexicographically sorted array of all (suffix, starting position) pairs in the string.

Text Database

Text Database

Keyword-Based Search

Containment Query

Evaluation

Naive approach

Sorted Set Comparison

Indexing approach

Signature

Inverted File

Ranking Query

Inverted File

String Matching

Suffix Tree

Suffix Array

results matching ""

No results matching ""