In previous post, we arrived at two formulas showing the equivalence between SGNS and shifted PMI:
Apparently, the reason for the “shift” is that in (1) there’s no k while in (2) there is. The “shift” is not just an ugly patch in the formula but it might also have a negative effect on the quality of learned embeddings.
The solution is easy: replace by where k is the number of negative examples. This translates into the modification of line #698 in word2vec.c in the original Google’s source code into:
expTable[i] = expTable[i] / (expTable[i] + negative);
This small change will give you small but consistent improvement on similarity benchmarks such as SimLex-999 (figure). Not too bad, right? Source code is available at Bitbucket repository.