Reproducing Chen & Manning (2014)

Neural dependency parsing is attractive for several reasons: first, distributed representation generalizes better, second, fast parsing unlocks new applications, and third, fast training means parsers can be co-trained with other NLP modules and integrated into a bigger system.

Chen & Manning (2014) from Stanford were the first to show that neural dependency parsing works and Google folks were quick to adopt this paradigm to improve the state-of-the-art (e.g. Weiss et al., 2015).

Though Stanford open-sourced their parser as part of CoreNLP, they didn’t release the code of their experiments. As anybody in academia probably knows, reproducing experiments is non-trivial, even extremely difficult at times. Since I have painstakingly gone through the process, I think it’s a good idea to share with you.

(For the impatient, the source code is available on Bitbucket.)

First and foremost, the paper didn’t cover all details of the implementation. I think this holds true for most papers in NLP as there are always more nitty-gritties than what can be conveyed in 8 pages. For example, in jackknifing, whether one divides the dataset by sentences or documents can shrink or enlarge the shared vocabulary between training and testing sets therefore affects accuracy.

Some details are left out probably because they are taken for granted by veterans. They might however be surprising for newcomers like me. For example, nowhere in Chen & Manning (2014) they mentioned Wall Street Journal. Nevertheless, the dataset should be understood as the WSJ part of Penn Treebank instead of the whole thing. Not knowing this, I spent 3 months thinking that the inferior results I got then was due to less training data. I only realized that my implementation was to blame when getting my hands on the full Penn Treebank.

So, I reimplemented the Stanford neural dependency parser using Torch7 and got these results:

Stanford dep. CoNLL dep.
Published results 91.8 89.6 92.0 90.7 (1)
Stanford impl. + Published model 91.4 89.4 65.6 56.1 (2)
Stanford impl. 90.4 89.0  86.4 84.8 (3)
My impl. + Published model 91.6 89.7 92.3 91.0 (4)
My impl. 90.2 88.7 90.8 89.7 (5)

Rows (1) and (2) tell us something about the current dataset and what Chen & Manning used. The “Stanford dependency” dataset, i.e. WSJ constituent trees converted into dependency tree using Stanford software, seems fine but the “CoNLL dependency” dataset, i.e. converted using LTH conversion tool, doesn’t seem to match. When I compare the statistics, my CoNLL dataset contains about 2000 words more than reported in the paper. I tried different parameters but couldn’t get the right number so I reverted back to pennconverter.jar -raw. Since more recent papers mostly evaluate on Stanford dependency anyway, I decided to move on.

As noted on Stanford website, published models were trained with Matlab code (not available) and you’re likely to get lower results using public the Java code. This explains the difference between rows (1) and (3). I try here to match the performance of Java implementation only (compare rows (3) and (5), (4) and (6)) and hope that better hyperparameter tuning will get us to the published results.

This work provides researchers with some fast and reproducible experiments. With the help of a GPU, it took my code about 1.5 hour to train compared to 8 hour of Stanford’s code. The parsing speed is slower: about 400 sentences/s compared to 1000 sentences/s however, given that WSJ contains about 40k sentences, it will take just less than 2 minutes to parse the whole corpus. It’s possible to speed up using threads (what Stanford’s implementation also does) but for now I’ll keep things simple.

For more details, my notes are available on Wikia and source code on Bitbucket.


Chen, D., & Manning, C. (2014). A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 740–750). Doha, Qatar: Association for Computational Linguistics.

Weiss, D., Alberti, C., Collins, M., & Petrov, S. (2015). Structured Training for Neural Network Transition-Based Parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 323–333). Association for Computational Linguistics.


3 thoughts on “Reproducing Chen & Manning (2014)

  1. Thanks for the overview. This is really insightful. I have just a quick question that I’m really hoping you can answer. I’m probably misunderstanding something but I have similar frustrations due to the lack of details in implementation.

    I am still trying to reproduce the results of this paper (and replicate SyntaxNet) in Python, but I suspect one of the problems I’m coming across is error propagation. When I measure tokens independently (with gold DEPREL labeling for already-parsed siblings and children) the accuracy is in the ballpark of Parsey’s Cousins, but when I measure using entire sentences at once, the accuracy quickly heads downhill. I suspect this is because one wrong transition or one missing SHIFT basically lands the entire remaining part of the sentence in the trash bin.

    Is what I’m saying making sense, or perhaps I’m fundamentally misunderstanding something? Is this a problem you ran into? Ironically I noticed on your front page you had a paper about error propagation, but what I’d like to know is how your implementation here handles this sort of problem or whether you even ran into it in the first place.


    • Hi Andrew. I don’t know the details of your setup but in regular cases, the evaluation starts *after* the parser has produced its output. So whatever difference there is, it isn’t in the parsing but in the evaluation itself. I guess it’s simply because sentence-based accuracy is harder. For example, a sentence with 1 incorrect link might be judged 90% correct in token-based evaluation but 0% correct in sentence-based evaluation (i.e. a sentence is correct only all links are correct).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s