A quick note from EACL: some papers related to LSDSem workshop (Bugert et al. 2017; Zhou et al. 2015) use McNemar’s test to establish statistical significance and I find it very odd.

McNemar’s test examine “marginal (probability) homogeneity” which in our case is whether two systems yield (statistically) the same performance. According to the source code I found on Github, the way it works is:

- Obtain predictions of System 1 and System 2
- Compare them to gold labels to fill this table:

Sys1 correct Sys1 wrong Sys2 correct a b Sys2 wrong c d - Compute the test statistics: and
*p*-value - If
*p*-value is less than a certain level (e.g. the magical 0.05), we reject the null hypothesis which is*p*(Sys1 correct) ==*p*(Sys2 correct)

As it happens in the papers, the difference is statistically significant and therefore results are meaningful. Happy?

Not so fast. Continue reading