At the heart of the argument was Levy and Goldberg’s proof that minimizing the loss of Skip-gram negative sampling (SGNS) is effectively approximating a shifted PMI matrix. Starting with the log-likelihood, they worked their way to local objective for each word-context pair and compare its derivative to zero to arrive at a function of PMI. One might rightly question if the loss function is essential in this proof or there is a deeper link between the two formalizations?
My answer: Yes, there is. Continue reading