Skip-gram negative sampling as (unshifted) PMI matrix factorization

In previous post, we arrived at two formulas showing the equivalence between SGNS and shifted PMI:

p(D|w,c) = \sigma(w \cdot c) = \frac{1}{1 + e^{-w \cdot c}}    (1)

p(D|w,c) = \frac{1}{1 + ke^{-\mathrm{PMI}(w,c)}}    (2)

Apparently, the reason for the “shift” is that in (1) there’s no while in (2) there is. The “shift” is not just an ugly patch in the formula but it might also have a negative effect on the quality of learned embeddings. Continue reading