At the heart of the argument was Levy and Goldberg’s proof that minimizing the loss of Skip-gram negative sampling (SGNS) is effectively approximating a shifted PMI matrix. Starting with the log-likelihood, they worked their way to local objective for each word-context pair and compare its derivative to zero to arrive at a function of PMI. One might rightly question if the loss function is essential in this proof or there is a deeper link between the two formalizations?
My answer: Yes, there is.
Recall that negative sampling is all about relative probability between two distribution: positive D and negative N. After sampling a word-context pair (w, c) from either distribution with the ratio D:N=1:k, it is the job of a model to guess which distribution the pair came from. (UPDATE: It is perhaps more prevalent to think of D and N as conditional distributions, see alternative proof for this case.) SGNS models the probability that a pair came from D simply by the inner product of their embeddings, transformed by sigmoid function:
What is the real probability that (w, c) comes from D? Bayes rule is your friend:
word2vec, followed by later work, assumed the negative distribution N to be the product of empirical distribution of words and contexts: , therefore:
Magically, PMI appears before our eyes. Substitute into the above formula, we have:
Compare (1) and (2), we arrive at . So the use of negative sampling in word2vec specified the matrix it factorizes, regardless of the loss function.
With this understanding, the comparison between once-popular DSMs and SGNS makes sense because the only difference is matrix factorization. While SVD, NMF and NNSE treat all words equally, SGNS is akin to weighted matrix factorization where frequent words are weighted more than infrequent ones (see Levy & Goldberg, 2014). Take into account Zipf law, this difference is substantial and can explain the superior performance of word2vec.
Compare two formulas above, we come to the same conclusion.
It has never been made explicit what noise distribution word2vec samples from. (Is anybody like me in finding it strange to sampling without saying a word about the distribution?) However, we know that there is one. If one uses word frequency, she can’t say that the noise distribution is any more. By adopting a sampling scheme you commit to a specific form of noise distribution. It is not just any distribution, it is supposed to be a certain distribution. While I can’t say what Mikolov and colleagues had in mind when they created word2vec, and is not the only possible distribution, it is certainly one that licenses the sampling scheme in word2vec.
Mikolov et al. did propose concrete form of N: (which is equivalent to in this blog post) and .