Word2vec

Main Idea

Algorithms
- Skip-Grams (SG): predict context words given target (position independent)
- Continuous Bag-of-Words (CBOW): predict target word from bag-of-words context
Training methods
- Hierarchical Softmax
- Negative Sampling

Objective function

$$ Maximize \ J'(θ) = \prod{t=1}^T \prod{-m \leq j \leq m, j \neq 0} p(w{t+j}|w{t}; θ) \ \rightarrow Minimize \ J(θ) = -\frac{1}{T} \sum{t=1}^T \sum{-m \leq j \leq m, j \neq 0} log~p(w{t+j}|w{t}; θ) $$

$$ p(o|c) = \frac{e^{u{o}^{T}v{c}}}{\sum{w=1}^{V}e^{u{w}^{T}v_{c}}} $$
Gradient descent
- Updatesfor each element of θ with step size α $$ θ{j}^{new} = θ{j}^{old} - \alpha \frac{\partial}{\partial θ_{j}^{old}}J(θ) $$
- Matrix notation $$ θ^{new} = θ^{old} - \alpha \frac{\partial}{\partial θ^{old}}J(θ) \ θ^{new} = θ^{old} - \alpha \nabla_{θ}J(θ) $$
- Stochastic gradient descent (SGD): update parameters after each window t $$ θ^{new} = θ^{old} - \alpha \nabla{θ}J{t}(θ) $$

Softmax too computationally expensive, and the word vectors are often sparse.

Problems
- Increase in size with vocabulary
- High dimension, lots of space
- Sparsity issue

Store most of the important information in a fixed, small number of dimensions - dense vector.

GloVe is the best of both worlds.

Objective function with P_ij being count in co-orcurrence matrix

$$ J(θ) = \frac{1}{2} \sum{i,j=1}^{W}f(P{ij})(u{i}^{T}v{j} - logP_{ij})^{2} $$

$$ X_{final} = U + V $$
Advantages
- Fast training
- Scalable to huge corpora
- Good performance even with small corpus, small vectors