Neural dependency parsing is attractive for several reasons: first, distributed representation generalizes better, second, fast parsing unlocks new applications, and third, fast training means parsers can be co-trained with other NLP modules and integrated into a bigger system.
Chen & Manning (2014) from Stanford were the first to show that neural dependency parsing works and Google folks were quick to adopt this paradigm to improve the state-of-the-art (e.g. Weiss et al., 2015).
Though Stanford open-sourced their parser as part of CoreNLP, they didn’t release the code of their experiments. As anybody in academia probably knows, reproducing experiments is non-trivial, even extremely difficult at times. Since I have painstakingly gone through the process, I think it’s a good idea to share with you.
(For the impatient, the source code is available on Bitbucket.)
First and foremost, the paper didn’t cover all details of the implementation. I think this holds true for most papers in NLP as there are always more nitty-gritties than what can be conveyed in 8 pages. For example, in jackknifing, whether one divides the dataset by sentences or documents can shrink or enlarge the shared vocabulary between training and testing sets therefore affects accuracy.
Some details are left out probably because they are taken for granted by veterans. They might however be surprising for newcomers like me. For example, nowhere in Chen & Manning (2014) they mentioned Wall Street Journal. Nevertheless, the dataset should be understood as the WSJ part of Penn Treebank instead of the whole thing. Not knowing this, I spent 3 months thinking that the inferior results I got then was due to less training data. I only realized that my implementation was to blame when getting my hands on the full Penn Treebank.
So, I reimplemented the Stanford neural dependency parser using Torch7 and got these results:
|Stanford dep.||CoNLL dep.|
|Stanford impl. + Published model||91.4||89.4||65.6||56.1||(2)|
|My impl. + Published model||91.6||89.7||92.3||91.0||(4)|
Rows (1) and (2) tell us something about the current dataset and what Chen & Manning used. The “Stanford dependency” dataset, i.e. WSJ constituent trees converted into dependency tree using Stanford software, seems fine but the “CoNLL dependency” dataset, i.e. converted using LTH conversion tool, doesn’t seem to match. When I compare the statistics, my CoNLL dataset contains about 2000 words more than reported in the paper. I tried different parameters but couldn’t get the right number so I reverted back to
pennconverter.jar -raw. Since more recent papers mostly evaluate on Stanford dependency anyway, I decided to move on.
As noted on Stanford website, published models were trained with Matlab code (not available) and you’re likely to get lower results using public the Java code. This explains the difference between rows (1) and (3). I try here to match the performance of Java implementation only (compare rows (3) and (5), (4) and (6)) and hope that better hyperparameter tuning will get us to the published results.
This work provides researchers with some fast and reproducible experiments. With the help of a GPU, it took my code about 1.5 hour to train compared to 8 hour of Stanford’s code. The parsing speed is slower: about 400 sentences/s compared to 1000 sentences/s however, given that WSJ contains about 40k sentences, it will take just less than 2 minutes to parse the whole corpus. It’s possible to speed up using threads (what Stanford’s implementation also does) but for now I’ll keep things simple.
Chen, D., & Manning, C. (2014). A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 740–750). Doha, Qatar: Association for Computational Linguistics.
Weiss, D., Alberti, C., Collins, M., & Petrov, S. (2015). Structured Training for Neural Network Transition-Based Parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 323–333). Association for Computational Linguistics.