A new proof of the equivalence of word2vec’s SGNS and Shifted PMI

[removed section]

At the heart of the argument was Levy and Goldberg’s proof that minimizing the loss of Skip-gram negative sampling (SGNS) is effectively approximating a shifted PMI matrix. Starting with the log-likelihood, they worked their way to local objective for each word-context pair and compare its derivative to zero to arrive at a function of PMI. One might rightly question if the loss function is essential in this proof or there is a deeper link between the two formalizations?

My answer: Yes, there is.

Recall that negative sampling is all about relative probability between two distribution: positive D and negative N. After sampling a word-context pair (w, c) from either distribution with the ratio D:N=1:k, it is the job of a model to guess which distribution the pair came from. (UPDATE: It is perhaps more prevalent to think of D and N as conditional distributions, see alternative proof for this case.) SGNS models the probability that a pair came from D simply by the inner product of their embeddings, transformed by sigmoid function:

p(D|w,c) = \sigma(w \cdot c)    (1)

What is the real probability that (w, c) comes from D? Bayes rule is your friend:

p(D|w,c) = \frac{p(w,c|D)p(D)}{p(w,c|D)p(D) + p(w,c|N)p(N)}

word2vec, followed by later work, assumed the negative distribution N to be the product of empirical distribution of words and contexts: p(w,c|N) = p(w|D)p(c|D) , therefore:

p(D|w,c) = \frac{p(w,c|D)\frac{1}{k+1}}{p(w,c|D)\frac{1}{k+1} + p(w|D)p(c|D)\frac{k}{k+1}}

\Rightarrow p(D|w,c) = \frac{1}{1 + k\frac{p(w|D)p(c|D)}{p(w,c|D)}}

Magically, PMI appears before our eyes. Substitute \mathrm{PMI}(w,c) = \log \frac{p(w,c|D)}{p(w|D)p(c|D)} into the above formula, we have:

p(D|w,c) = \frac{1}{1 + ke^{-\mathrm{PMI}(w,c)}}    (2)

Compare (1) and (2), we arrive at  w \cdot c = \mathrm{PMI}(w,c) - \log k . So the use of negative sampling in word2vec specified the matrix it factorizes, regardless of the loss function.

With this understanding, the comparison between once-popular DSMs and SGNS makes sense because the only difference is matrix factorization. While SVD, NMF and NNSE treat all words equally, SGNS is akin to weighted matrix factorization where frequent words are weighted more than infrequent ones (see Levy & Goldberg, 2014). Take into account Zipf law, this difference is substantial and can explain the superior performance of word2vec.

Alternative proof (conditional distribution)

p(w|c, N) = p(w|D)

p(D|w,c) = \sigma(w,c)

p(D|w,c) = \frac{p(w|c,D)p(D)}{p(w|c,D)p(D)+p(w|c,N)p(N)} =\frac{p(w|c,D)\frac{1}{k+1}}{p(w|c,D)\frac{1}{k+1}+p(w|D)\frac{k}{k+1}} = \frac{1}{1+k\frac{p(w|D)}{p(w|c,D)}} = \frac{1}{1+k\frac{p(w|D)p(c|D)}{p(w,c|D)}}

Compare two formulas above, we come to the same conclusion.

Note on sampling

It has never been made explicit what noise distribution word2vec samples from. (Is anybody like me in finding it strange to sampling without saying a word about the distribution?)

However, we know that there is one. If one uses word frequency, she can’t say that the noise distribution is p(w|c,N)=\frac{1}{|V|} any more. By adopting a sampling scheme you commit to a specific form of noise distribution. It is not just any distribution, it is supposed to be a certain distribution.

While I can’t say what Mikolov and colleagues had in mind when they created word2vec, and p(w|c,N) = p(w|D) is not the only possible distribution, it is certainly one that licenses the sampling scheme in word2vec.

Mikolov et al. did propose concrete form of N: U(w) (which is equivalent to p(w|D) in this blog post) and U(w)^{\frac{3}{4}} .

Advertisements

7 thoughts on “A new proof of the equivalence of word2vec’s SGNS and Shifted PMI

  1. How are you claiming pmi(w,c) = log (w,c|D=1)/(w|D=1)(c|D=1) ? The definition I found was that pmi(w,c) = log p(w,c)/p(w)p(c). I tried to verify the equivalence between the definition I found and the claim in this post and I could not verify it quickly. Am I missing something?

    Like

    • Hi Harrison, I think you’re making it more complicated than it really is. pmi is defined on true observed data which must come from a distribution. That distribution we call D. In a context where there’s no other distribution, we don’t need to specify it — that’s why pmi is usually written as pmi(w,c) = log p(w,c)/p(w)p(c). In our case, there are D and N, therefore we need “|D” which simply means “I’m talking about the real-world distribution, not the noise distribution”.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s