A new proof of the equivalence of word2vec’s SGNS and Shifted PPMI

[removed section]

At the heart of the argument was Levy and Goldberg’s proof that minimizing the loss of Skip-gram negative sampling (SGNS) is effectively approximating a shifted PPMI matrix. Starting with the log-likelihood, they worked their way to local objective for each word-context pair and compare its derivative to zero to arrive at a function of PPMI. One might rightly question if the loss function is essential in this proof or there is a deeper link between the two formalizations?

My answer: Yes, there is.

Recall that negative sampling is all about relative probability between two distribution: positive D and negative N. After sampling a word-context pair (w, c) from either distribution with the ratio D:N=1:k, it is the job of a model to guess which distribution the pair came from. (UPDATE: It is perhaps more prevalent to think of D and N as conditional distributions, see alternative proof for this case.) SGNS models the probability that a pair came from D simply by the inner product of their embeddings, transformed by sigmoid function:

p(D|w,c) = \sigma(w \cdot c)    (1)

What is the real probability that (w, c) comes from D? Bayes rule is your friend:

p(D|w,c) = \frac{p(w,c|D)p(D)}{p(w,c|D)p(D) + p(w,c|N)p(N)}

word2vec, followed by later work, assumed the negative distribution N to be the product of empirical distribution of words and contexts: p(w,c|N) = p(w|D)p(c|D) , therefore:

p(D|w,c) = \frac{p(w,c|D)\frac{1}{k+1}}{p(w,c|D)\frac{1}{k+1} + p(w|D)p(c|D)\frac{k}{k+1}}

\Rightarrow p(D|w,c) = \frac{1}{1 + k\frac{p(w|D)p(c|D)}{p(w,c|D)}}

Magically, PMI appears before our eyes. Substitute \mathrm{PMI}(w,c) = \log \frac{p(w,c|D)}{p(w|D)p(c|D)} into the above formula, we have:

p(D|w,c) = \frac{1}{1 + ke^{-\mathrm{PMI}(w,c)}}    (2)

Compare (1) and (2), we arrive at  w \cdot c = \mathrm{PMI}(w,c) - \log k . So the use of negative sampling in word2vec specified the matrix it factorizes, regardless of the loss function.

With this understanding, the comparison between once-popular DSMs and SGNS makes sense because the only difference is matrix factorization. While SVD, NMF and NNSE treat all words equally, SGNS is akin to weighted matrix factorization where frequent words are weighted more than infrequent ones (see Levy & Goldberg, 2014). Take into account Zipf law, this difference is substantial and can explain the superior performance of word2vec.

Alternative proof (conditional distribution)

p(w|c, N) = p(w|D)

p(D|w,c) = \sigma(w,c)

p(D|w,c) = \frac{p(w|c,D)p(D)}{p(w|c,D)p(D)+p(w|c,N)p(N)} =\frac{p(w|c,D)\frac{1}{k+1}}{p(w|c,D)\frac{1}{k+1}+p(w|D)\frac{k}{k+1}} = \frac{1}{1+k\frac{p(w|D)}{p(w|c,D)}} = \frac{1}{1+k\frac{p(w|D)p(c|D)}{p(w,c|D)}}

Compare two formulas above, we come to the same conclusion.

Note on sampling

It has never been made explicit what noise distribution word2vec samples from. (Is anybody like me in finding it strange to sampling without saying a word about the distribution?)

However, we know that there is one. If one uses word frequency, she can’t say that the noise distribution is p(w|c,N)=\frac{1}{|V|} any more. By adopting a sampling scheme you commit to a specific form of noise distribution. It is not just any distribution, it is supposed to be a certain distribution.

While I can’t say what Mikolov and colleagues had in mind when they created word2vec, and p(w|c,N) = p(w|D) is not the only possible distribution, it is certainly one that licenses the sampling scheme in word2vec.

Mikolov et al. did propose concrete form of N: U(w) (which is equivalent to p(w|D) in this blog post) and U(w)^{\frac{3}{4}} .


One thought on “A new proof of the equivalence of word2vec’s SGNS and Shifted PPMI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s