What’s up with McNemar’s test?

A quick note from EACL: some papers related to LSDSem workshop (Bugert et al. 2017; Zhou et al. 2015) use McNemar’s test to establish statistical significance and I find it very odd.

McNemar’s test examine “marginal (probability) homogeneity” which in our case is whether two systems yield (statistically) the same performance. According to the source code I found on Github, the way it works is:

  1. Obtain predictions of System 1 and System 2
  2. Compare them to gold labels to fill this table:
    Sys1 correct Sys1 wrong
    Sys2 correct a b
    Sys2 wrong c d
  3. Compute the test statistics: \chi^2 = {(b-c)^2 \over b+c} and p-value
  4. If p-value is less than a certain level (e.g. the magical 0.05), we reject the null hypothesis which is p(Sys1 correct) == p(Sys2 correct)

As it happens in the papers, the difference is statistically significant and therefore results are meaningful. Happy?

Not so fast. Continue reading

A paper is the tip of an iceberg

I was reading Clark and Manning (2016) and studying their code. The contrast is just amazing.

This is what the paper has to say:


This is what I found after 1 hour of reading a JSON file and writing down all layers of the neural net (the file is data/models/all_pairs/architecture.json, created when you run the experiment):

Without the source code, this would be a replication nightmare for sure.


Clark, K., & Manning, C. D. (2016). Improving Coreference Resolution by Learning Entity-Level Distributed Representations. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 643–653. http://doi.org/10.18653/v1/P16-1061

Long paper accepted for EACL 2017

Title: Tackling Error Propagation through Reinforcement Learning: A Case of Greedy Dependency Parsing

Conference: EACL 2017 (European Chapter of the Association for Computational Linguistics), at Valencia, 3-7 April 2017.

Error propagation is a common problem in NLP. Reinforcement learning explores erroneous states during training and can therefore be more robust when mistakes are made early in a process. In this paper, we apply reinforcement learning to greedy dependency parsing which is known to suffer from error propagation. Reinforcement learning improves accuracy of both labeled and unlabeled dependencies of the Stanford Neural Dependency Parser, a high performance greedy parser, while maintaining its efficiency. We investigate the portion of errors which are the result of error propagation and confirm that reinforcement learning reduces the occurrence of error propagation.

Full article: arXiv:1702.06794

Slides: view online

Hyperparameter tuning in SURFsara HPC Cloud

Hyperparameter tuning is difficult, not because it’s terribly complicated but obtaining enough resource is often not easy. I’m lucky enough to work at Vrije Universiteit and therefore can access the SURFsara HPC Cloud with not too much effort. Compared to Amazon EC2 (the only other cloud solution I have tried before), the functionality is rather basic but I think suits the needs of many researchers. Using the web interface or OpenNebula API, you can easily customize an image, attach hard drive, launch 10 instances and access any of them using a public key. What else do you need to run your experiments? Continue reading

Reproducing Chen & Manning (2014)

Neural dependency parsing is attractive for several reasons: first, distributed representation generalizes better, second, fast parsing unlocks new applications, and third, fast training means parsers can be co-trained with other NLP modules and integrated into a bigger system.

Chen & Manning (2014) from Stanford were the first to show that neural dependency parsing works and Google folks were quick to adopt this paradigm to improve the state-of-the-art (e.g. Weiss et al., 2015).

Though Stanford open-sourced their parser as part of CoreNLP, they didn’t release the code of their experiments. As anybody in academia probably knows, reproducing experiments is non-trivial, even extremely difficult at times. Since I have painstakingly gone through the process, I think it’s a good idea to share with you.

Continue reading

A new proof of the equivalence of word2vec’s SGNS and Shifted PPMI

[removed section]

At the heart of the argument was Levy and Goldberg’s proof that minimizing the loss of Skip-gram negative sampling (SGNS) is effectively approximating a shifted PPMI matrix. Starting with the log-likelihood, they worked their way to local objective for each word-context pair and compare its derivative to zero to arrive at a function of PPMI. One might rightly question if the loss function is essential in this proof or there is a deeper link between the two formalizations?

My answer: Yes, there is. Continue reading

Similarity, co-occurrence, functional relation, part-whole relation, subcategorization, what else?

In word sense disambiguation and named-entity disambiguation, an important assumption is that a document consists of related concepts and entities.

There are millions of concepts and entities, what makes some related but not others? This question is difficult and I don’t have the definitive answer. But it is a good start to list some classes of relatedness. Continue reading

Knowledge base completion 101

Knowledge base completion (KBC) is not a standard task in natural language processing nor in machine learning. A search on Google scholar results in only over 100 article containing this phrase. Although it is similar to link prediction, “a long-standing challenge in modern information science” (Lü & Zhou, 2011), it has received much less attention.

However KBC is potentially an important step towards natural language understanding and recent advances in representation learning have enabled researchers to learn larger datasets with improved precision. Actually, a half of KBC articles were published in or after 2010. Continue reading