Word2vec

Main Idea

  • Algorithms
    • Skip-Grams (SG): predict context words given target (position independent)
    • Continuous Bag-of-Words (CBOW): predict target word from bag-of-words context
  • Training methods
    • Hierarchical Softmax
    • Negative Sampling

Algorithms

Skip-Grams

Softmax
  • Objective function

    $$ Maximize \ J'(θ) = \prod{t=1}^T \prod{-m \leq j \leq m, j \neq 0} p(w{t+j}|w{t}; θ) \ \rightarrow Minimize \ J(θ) = -\frac{1}{T} \sum{t=1}^T \sum{-m \leq j \leq m, j \neq 0} log~p(w{t+j}|w{t}; θ) $$

    $$ p(o|c) = \frac{e^{u{o}^{T}v{c}}}{\sum{w=1}^{V}e^{u{w}^{T}v_{c}}} $$

  • Gradient descent

    • Updatesfor each element of θ with step size α $$ θ{j}^{new} = θ{j}^{old} - \alpha \frac{\partial}{\partial θ_{j}^{old}}J(θ) $$
    • Matrix notation $$ θ^{new} = θ^{old} - \alpha \frac{\partial}{\partial θ^{old}}J(θ) \ θ^{new} = θ^{old} - \alpha \nabla_{θ}J(θ) $$
    • Stochastic gradient descent (SGD): update parameters after each window t $$ θ^{new} = θ^{old} - \alpha \nabla{θ}J{t}(θ) $$
Negative Sampling

Softmax too computationally expensive, and the word vectors are often sparse.

* Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al. 2013)

  • Objective function with k negative samples, U(x) being the unigram distribution: maximize probability that real outside word appears (first log), minimize probability that random outside word appears around center word (second log)

    $$ J(θ) = \frac{1}{T} \sum{t=1}^{T}J{t}(θ) \ J{t}(θ) = log~\sigma(u{o}^{T}v{c}) + \sum{i=1}^{k}\mathbb{E}{j \mathtt{\sim} P(w)} [log~\sigma(-u{j}^{T}v_{c})] $$

    $$ \sigma(x) = \frac{1}{1 + e^{-x}} \ \sigma(-x) = 1 - \sigma(x) \ \sigma'(x) = \sigma(x)(1 - \sigma(x)) $$

    $$ P(w) = \frac{U(w)^{\frac{3}{4}}}{Z} $$

Co-occurence Matrix

  • Problems
    • Increase in size with vocabulary
    • High dimension, lots of space
    • Sparsity issue

Low Dimensional Vectors

Store most of the important information in a fixed, small number of dimensions - dense vector.

* An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence (Rohde et al. 2005)

  • Dimension reduction
    • Singular value decomposition
      • Problems
        • Computational cost O(mn^2)
        • Hard to incorporate new words
        • Different learning regime from other DL models
  • Hacks
    • Functions words too frequent
      • min(X, t) with t~100
      • Ignore them
    • Ramp windows that count closer words more
    • Pearson correlation instead of counts, then set negative values to 0

GloVe

* Global Vectors for Word Representation (Pennington et al. 2014)

  • Comparison: count-based v.s. direct prediction

GloVe is the best of both worlds.

  • Objective function with P_ij being count in co-orcurrence matrix

    $$ J(θ) = \frac{1}{2} \sum{i,j=1}^{W}f(P{ij})(u{i}^{T}v{j} - logP_{ij})^{2} $$

    $$ X_{final} = U + V $$

  • Advantages

    • Fast training
    • Scalable to huge corpora
    • Good performance even with small corpus, small vectors

results matching ""

    No results matching ""