Skip-gram negative sampling as (unshifted) PMI matrix factorization

In previous post, we arrived at two formulas showing the equivalence between SGNS and shifted PMI:

$p(D|w,c) = \sigma(w \cdot c) = \frac{1}{1 + e^{-w \cdot c}}$    (1)

$p(D|w,c) = \frac{1}{1 + ke^{-\mathrm{PMI}(w,c)}}$    (2)

Apparently, the reason for the “shift” is that in (1) there’s no while in (2) there is. The “shift” is not just an ugly patch in the formula but it might also have a negative effect on the quality of learned embeddings.

The solution is easy: replace $\sigma(x)$ by $\sigma_k(x)=\frac{1}{1+ke^{-x}}$ where k is the number of negative examples. This translates into the modification of line #698 in word2vec.c in the original Google’s source code into:

    expTable[i] = expTable[i] / (expTable[i] + negative);

This small change will give you small but consistent improvement on similarity benchmarks such as SimLex-999 (figure). Not too bad, right? Source code is available at Bitbucket repository.