While I was still active in NLP, one thing that struck me was how hard it was to interpret errors. When people make errors, for example, while reading garden-path sentences, it’s easy to see why. Maybe the verb is used in an uncommon way or we are too eager to connect words together. But the errors of a statistical parser don’t make any sense. They appeared totally random, at least to me.
The multi-layered approach to language analysis has a rich tradition but little real-world application. These days, people in industry use neural end-to-end models. One reason for that is that the performance on many tasks is still lower than it needs to be. A syntactic parser with 92% accuracy sounds great but, assuming each sentence has 20 tokens, it only gets a sentence completely correct 0.9220=19% of the time. Errors throw off further processing and we don’t know what to do with them. We can’t even detect them so far.
Error control seems to be the most pressing neglected problem in NLP. If we could detect and ignore stupid errors, we can have much more confidence in the rest of the analysis (i.e. trading recall for accuracy) so more applications will open. Thresholding, where we remove predictions with scores under a certain bar, is an obvious choice. So, calibrating the scores or probabilities of models to match the empirical chance of success is a worthy research direction (which is, sadly, also under-researched.) I’d like to add to that two more ideas for error control borrowed from telecommunication.
Continue reading