# A paper is the tip of an iceberg

I was reading Clark and Manning (2016) and studying their code. The contrast is just amazing.

This is what the paper has to say:

This is what I found after 1 hour of reading a JSON file and writing down all layers of the neural net (the file is data/models/all_pairs/architecture.json, created when you run the experiment):

Without the source code, this would be a replication nightmare for sure.

References

Clark, K., & Manning, C. D. (2016). Improving Coreference Resolution by Learning Entity-Level Distributed Representations. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 643–653. http://doi.org/10.18653/v1/P16-1061

These days, I heard a lot about un-/semisupervised learning. Ivan Titov’s workshop at the beginning of this month is all about it and now Facebook published a blog post about unsupervised learning for images and videos. They use adversarial networks, i.e. one neural network generates images and another tries to differentiate between authentic and generated images. The first network is trained to fool the other and the second network is trained not to be fooled. Continue reading

# Hyperparameter tuning in SURFsara HPC Cloud

Hyperparameter tuning is difficult, not because it’s terribly complicated but obtaining enough resource is often not easy. I’m lucky enough to work at Vrije Universiteit and therefore can access the SURFsara HPC Cloud with not too much effort. Compared to Amazon EC2 (the only other cloud solution I have tried before), the functionality is rather basic but I think suits the needs of many researchers. Using the web interface or OpenNebula API, you can easily customize an image, attach hard drive, launch 10 instances and access any of them using a public key. What else do you need to run your experiments? Continue reading

# Reproducing Chen & Manning (2014)

Neural dependency parsing is attractive for several reasons: first, distributed representation generalizes better, second, fast parsing unlocks new applications, and third, fast training means parsers can be co-trained with other NLP modules and integrated into a bigger system.

Chen & Manning (2014) from Stanford were the first to show that neural dependency parsing works and Google folks were quick to adopt this paradigm to improve the state-of-the-art (e.g. Weiss et al., 2015).

Though Stanford open-sourced their parser as part of CoreNLP, they didn’t release the code of their experiments. As anybody in academia probably knows, reproducing experiments is non-trivial, even extremely difficult at times. Since I have painstakingly gone through the process, I think it’s a good idea to share with you.

# Skip-gram negative sampling as (unshifted) PMI matrix factorization

In previous post, we arrived at two formulas showing the equivalence between SGNS and shifted PMI:

$p(D|w,c) = \sigma(w \cdot c) = \frac{1}{1 + e^{-w \cdot c}}$    (1)

$p(D|w,c) = \frac{1}{1 + ke^{-\mathrm{PMI}(w,c)}}$    (2)

Apparently, the reason for the “shift” is that in (1) there’s no while in (2) there is. The “shift” is not just an ugly patch in the formula but it might also have a negative effect on the quality of learned embeddings. Continue reading

# A new proof of the equivalence of word2vec’s SGNS and Shifted PPMI

[removed section]

At the heart of the argument was Levy and Goldberg’s proof that minimizing the loss of Skip-gram negative sampling (SGNS) is effectively approximating a shifted PPMI matrix. Starting with the log-likelihood, they worked their way to local objective for each word-context pair and compare its derivative to zero to arrive at a function of PPMI. One might rightly question if the loss function is essential in this proof or there is a deeper link between the two formalizations?