A quick note from EACL: some papers related to LSDSem workshop (Bugert et al. 2017; Zhou et al. 2015) use McNemar’s test to establish statistical significance and I find it very odd.

McNemar’s test examine “marginal (probability) homogeneity” which in our case is whether two systems yield (statistically) the same performance. According to the source code I found on Github, the way it works is:

- Obtain predictions of System 1 and System 2
- Compare them to gold labels to fill this table:

Sys1 correct Sys1 wrong Sys2 correct a b Sys2 wrong c d - Compute the test statistics: and
*p*-value - If
*p*-value is less than a certain level (e.g. the magical 0.05), we reject the null hypothesis which is*p*(Sys1 correct) ==*p*(Sys2 correct)

As it happens in the papers, the difference is statistically significant and therefore results are meaningful. Happy?

Not so fast.

All this test tells you is that a certain run of System 1 yields different performance from a certain run of System 2. And the test only matters when their figures are close enough (e.g. 0.70 and 0.71). In cases where there’s a big difference, it will certainly returns significant.

However, from what I have experienced with NLP, the problem is more complicated. Systems, especially neural networks which have a lot of randomly-initialized, stochastically-trained parameters, behave differently every time we retrain them (with a new random seed). In extreme cases, the variance can be so big that two systems that differs by 10 percentage points of F1 aren’t statistically different (because the standard deviation is even greater).

McNemar’s test tells us nothing about this problem. Actually no test that involves only one run of each system can.

What we need is to train and evaluate systems many times; get a bunch of accuracy, F1, BLEU, or any relevant measure; and then compute t-test over these numbers. This hasn’t been a standard practice in NLP so far but I strongly believe that we should do it more, especially because more and more work is done using neural networks.

### References

Bugert, M., Puzikov, Y., Andreas, R., Eckle-kohler, J., Martin, T., & Mart, E. (2017). LSDSem 2017 : Exploring Data Generation Methods for the Story Cloze Test. The 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-Level Semantics (LSDSEM 2017), (2016), 56–61.

Zhou, Mengfei, Anette Frank, Annemarie Friedrich, and Alexis Palmer. “Semantically Enriched Models for Modal Sense Classification.” In *Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics (LSDSem)*, p. 44. 2015.

Truly so!

I think it would make even more sense to run the experiments multiple times and report the score distribution, showing what the min/max and variance are.

LikeLike

Exactly! I think people are moving in the same direction too.

Reimers, N., & Gurevych, I. (2017). Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In EMNLP (pp. 338–348).

LikeLike