Text Database
A collection of documents, each as a list of words.
Keyword-Based Search
- Queries
- Containment query
- Containment of keywords
- Ranking query
- Similarity of list of keywords
- Containment query
- Indexing
- Signature
- Inverted file
Containment Query
- Expressed with
AND
,OR
,NOT
- Stemming used to unify different endings of the same word
Evaluation
Naive approach
Scan through every document in the database. Expensive.
Sorted Set Comparison
Check if the set of keywords in the query $$s$$ is a subset of the keywords in each document $$t$$.
$$ s \subseteq t
$$
- Sort $$s$$ & $$t$$
- Scan the words of the two sets concurrently
Expensive, cost = sorting + 1 pass scanning
Indexing approach
Preprocess each document & extract words from them. Index the documents, then use the index to evaluate queries.
- Signature
- Inverted file
Signature
Approximate the set of keywords by signatures, which are bitmaps generated by some hash function.
Signature: a bitmap of length $$b$$ formed by setting bit $$H(e)$$ for each $$e \in s$$.
- $$W$$: domain of the set elements/vocabulary
- Superset of any document/query sets
- $$b$$: signature length
- $$H$$: maps each word in $$W$$ to a number in $$[0, b-1]$$
Signatures are lossy approximations, i.e. different sets may have the same signature. They are used as fast filters for set operations.
Filtering:
$$ \begin{aligned} x \subseteq y &\rightarrow sig(x) \subseteq sig(y) \Leftrightarrow sig(x) \land \neg sig(y) = 0\ x = y &\rightarrow sig(x) = sig(y)\ x \cap y \neq \emptyset &\rightarrow sig(x) \land sig(y) \neq 0 \end{aligned}
$$
Signature file: stores sequentially the signatures of all documents in a text database $$D$$.
- Updating signature files
- Similar to updating relations
Query processing:
- Find drops
- Compute $$sig(q)$$
- Scan through signature file to select qualified signatures
- Filter out false drops
- Refinement step comparing the documents with $$q$$
Signature-based indexes:
- Bit-sliced signature
- Only the columns where $$q$$'s signature has 1 are accessed
- Signature-tree
- Similar to R-tree
- Leaf nodes: signatures
- Internal nodes: logical OR of all signatures in the sub-tree
- Good for dynamic data
- Similar to R-tree
Supported Selection Queries
- Set containment selection (subset query)
- Inverse superset query also useful
- Set equality selection
- Set overlap selection
Inverted File
Each word $$e$$ has an inverted list of document-ids that contain $$e$$.
- Shown to be superior to signature-based techniques
- Best for static data
- But updates expensive, thus performed in batches
- Can be smaller in size, since can be easily compressed
- Run-length encoding: gaps between consecutive ids
- E.g.
{11, 24, 57, ...}
->{11, 13, 33, ...}
- Then use variable-length encoding for the gaps
- E.g. Golomb coding; a few bits suffice to encode the ids
- E.g.
- Run-length encoding: gaps between consecutive ids
Query processing:
Join the lists corresponding to the elements of the query.
- Cheap, since ids stored sorted in the lists
Ranking Query
Instead considering a document as a $$|W|$$-length bit-vector indicating which terms it contains, the document can be represented by a $$|W|$$-length bit-vector $$w$$ indicating the relevance, e.g. frequency, of each term in the document.
Similarity:
$$ sim(d, q) = \frac{\vec{wd} \cdot \vec{w_q}}{|\vec{w_d}| |\vec{w_q}|} = \frac{\sum^N{i=1} w{d, i} \times w{q, i}}{\sqrt{\sum^N{i=1} w{d, i}^2} \sqrt{\sum^N{i=1} w{q, i}^2}}
$$
Define ranking of documents based on their similarity to the query.
Inverted File
Query processing:
Accumulate weights of documents for each term in query by accessing & merging the lists.
Also maintain a set of $$k$$ most similar documents to the query while measuring the similarity of documents.
String Matching
- Indexing
- Suffix tree
- Suffix array
- Variations
- Approximate string matching: allows errors in matches
- Applications
- Information retrieval, computational biology, signal processing, etc.
- Similarity metric
- Edit distance: number of primitive operations (insert, delete, replace) necessary to transform one string to the other
- String alignment
- LCS distance
- Hamming distance
- Episode distance
- Reversals
- Block distance
- q-gram distance
- Edit distance: number of primitive operations (insert, delete, replace) necessary to transform one string to the other
- Applications
- Approximate string matching: allows errors in matches
Suffix Tree
A suffix trie where unary paths are compressed.
- Internal nodes: unique substring
- Leaf nodes: position of the suffix compressed in path
- Suffix link: used for fast construction
Query processing:
- Search for substring in query along some path, get a starting position
- Go to that position in the string, scan to check the occurrence of the substring
Suffix Array
To maintain the large suffix tree in memory, use suffix array instead of trees. It is a lexicographically sorted array of all (suffix, starting position)
pairs in the string.