Word2vec
Main Idea
- Algorithms
- Skip-Grams (SG): predict context words given target (position independent)
- Continuous Bag-of-Words (CBOW): predict target word from bag-of-words context
- Training methods
- Hierarchical Softmax
- Negative Sampling
Algorithms
Skip-Grams
Softmax
Objective function
$$ Maximize \ J'(θ) = \prod{t=1}^T \prod{-m \leq j \leq m, j \neq 0} p(w{t+j}|w{t}; θ) \ \rightarrow Minimize \ J(θ) = -\frac{1}{T} \sum{t=1}^T \sum{-m \leq j \leq m, j \neq 0} log~p(w{t+j}|w{t}; θ) $$
$$ p(o|c) = \frac{e^{u{o}^{T}v{c}}}{\sum{w=1}^{V}e^{u{w}^{T}v_{c}}} $$
Gradient descent
- Updatesfor each element of
θ
with step sizeα
$$ θ{j}^{new} = θ{j}^{old} - \alpha \frac{\partial}{\partial θ_{j}^{old}}J(θ) $$ - Matrix notation $$ θ^{new} = θ^{old} - \alpha \frac{\partial}{\partial θ^{old}}J(θ) \ θ^{new} = θ^{old} - \alpha \nabla_{θ}J(θ) $$
- Stochastic gradient descent (SGD): update parameters after each window
t
$$ θ^{new} = θ^{old} - \alpha \nabla{θ}J{t}(θ) $$
- Updatesfor each element of
Negative Sampling
Softmax too computationally expensive, and the word vectors are often sparse.
* Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al. 2013)
Objective function with
k
negative samples,U(x)
being the unigram distribution: maximize probability that real outside word appears (first log), minimize probability that random outside word appears around center word (second log)$$ J(θ) = \frac{1}{T} \sum{t=1}^{T}J{t}(θ) \ J{t}(θ) = log~\sigma(u{o}^{T}v{c}) + \sum{i=1}^{k}\mathbb{E}{j \mathtt{\sim} P(w)} [log~\sigma(-u{j}^{T}v_{c})] $$
$$ \sigma(x) = \frac{1}{1 + e^{-x}} \ \sigma(-x) = 1 - \sigma(x) \ \sigma'(x) = \sigma(x)(1 - \sigma(x)) $$
$$ P(w) = \frac{U(w)^{\frac{3}{4}}}{Z} $$
Co-occurence Matrix
- Problems
- Increase in size with vocabulary
- High dimension, lots of space
- Sparsity issue
Low Dimensional Vectors
Store most of the important information in a fixed, small number of dimensions - dense vector.
* An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence (Rohde et al. 2005)
- Dimension reduction
- Singular value decomposition
- Problems
- Computational cost
O(mn^2)
- Hard to incorporate new words
- Different learning regime from other DL models
- Computational cost
- Problems
- Singular value decomposition
- Hacks
- Functions words too frequent
min(X, t)
witht~100
- Ignore them
- Ramp windows that count closer words more
- Pearson correlation instead of counts, then set negative values to
0
- Functions words too frequent
GloVe
* Global Vectors for Word Representation (Pennington et al. 2014)
- Comparison: count-based v.s. direct prediction
GloVe is the best of both worlds.
Objective function with
P_ij
being count in co-orcurrence matrix$$ J(θ) = \frac{1}{2} \sum{i,j=1}^{W}f(P{ij})(u{i}^{T}v{j} - logP_{ij})^{2} $$
$$ X_{final} = U + V $$
Advantages
- Fast training
- Scalable to huge corpora
- Good performance even with small corpus, small vectors