What’s wrong with McNemar’s test?

A quick note from EACL: some papers related to LSDSem workshop (Bugert et al. 2017; Zhou et al. 2015) use McNemar’s test to establish statistical significance and I find it very odd.

McNemar’s test examine “marginal (probability) homogeneity” which in our case is whether two systems yield (statistically) the same performance. According to the source code I found on Github, the way it works is:

  1. Obtain predictions of System 1 and System 2
  2. Compare them to gold labels to fill this table:
    Sys1 correct Sys1 wrong
    Sys2 correct a b
    Sys2 wrong c d
  3. Compute the test statistics: \chi^2 = {(b-c)^2 \over b+c} and p-value
  4. If p-value is less than a certain level (e.g. the magical 0.05), we reject the null hypothesis which is p(Sys1 correct) == p(Sys2 correct)

As it happens in the papers, the difference is statistically significant and therefore results are meaningful. Happy?

Not so fast.

All this test tells you is that a certain run of System 1 yields different performance from a certain run of System 2. And the test only matters when their figures are close enough (e.g. 0.70 and 0.71). In cases where there’s a big difference, it will certainly returns significant.

However, from what I have experienced with NLP, the problem is more complicated. Systems, especially neural networks which have a lot of randomly-initialized, stochastically-trained parameters, behave differently every time we retrain them (with a new random seed). In extreme cases, the variance can be so big that two systems that differs by 10 percentage points of F1 aren’t statistically different (because the standard deviation is even greater).

McNemar’s test tells us nothing about this problem. Actually no test that involves only one run of each system can.

What we need is to train and evaluate systems many times; get a bunch of accuracy, F1, BLEU, or any relevant measure; and then compute t-test over these numbers. This hasn’t been a standard practice in NLP so far but I strongly believe that we should do it more, especially because more and more work is done using neural networks.


Bugert, M., Puzikov, Y., Andreas, R., Eckle-kohler, J., Martin, T., & Mart, E. (2017). LSDSem 2017 : Exploring Data Generation Methods for the Story Cloze Test. The 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-Level Semantics (LSDSEM 2017), (2016), 56–61.

Zhou, Mengfei, Anette Frank, Annemarie Friedrich, and Alexis Palmer. “Semantically Enriched Models for Modal Sense Classification.” In Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics (LSDSem), p. 44. 2015.


2 thoughts on “What’s wrong with McNemar’s test?

    • Exactly! I think people are moving in the same direction too.

      Reimers, N., & Gurevych, I. (2017). Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In EMNLP (pp. 338–348).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s