Notes on machine learning and exceptions

Statistical machine learning has been the de-facto standard in NLP research and practice. However, its very success might be hiding its the problems. One such problem is exceptions.

Natural language is full of exceptions: idiomatic phrases that defy compositionality, irregular verbs and exceptions to grammatical rules, or unexpected events that, though not linguistic phenomena themselves, happen to be communicated via language. So far, statistical NLP has treated them as inconvenient oddity and, in most cases, swept them under the rug, hoping that they wouldn’t reduce F-score.

But a system doesn’t really understand language without handling exceptions and I will argue that (not) handling exceptions has important consequences to machine learning.

What is an exception?

According to MacMilan dictionary, an exception is

“someone or something that is different in some way from other people or things and so cannot be included in a general statement”.

If you have a rule-based system, you can read the rules and tell whether it can handle a given exception. But a machine learning system, e.g. Stanford neural network parser, can not be penetrated by human eyes. We can’t say what the general statement is therefore we can’t talk about exceptions. In an abstract level, it contains only a single hypothesis for everything, encoded inside the neural network. But does it mean that it can’t handle any exception?

Exceptions in machine learning

We know a case is an exception when the rules don’t work.

A class of cases that reliably breaks machine learning systems is long-tail cases. These are the expressions that rarely occur in training data, such as technical terms, rare events or infamous people. Given the ubiquitous Zipfian distribution, a large number of distinct expressions fall into the long tail and their total mass can create a noticeable effect on performance. However, training has been hard since there simply isn’t that much data about each of them.

Interestingly, what seem to be exceptions to machine learning systems are not to humans. For example, we don’t read about volcanic eruptions everyday but we can understand news about such events just fine. We can even understand it the first time we heard about it provided a dictionary definition is accessible.

On the other hand, there are clear exceptions that fail to break machine learning systems. One example is the most popular word in English: the. To any speaker of the language, the is very exceptional. It is literally one of a kind: it is the only definite article in English (while some languages have more), coming from the already-small class of articles which include two more words (a/an). Any rule-based parser would contain some rules specifically designed for the. But a statistical one just treats the like any other words and this poses a risk of over-generalization.

I quickly check this possibility by looking at the word embeddings created by nndep (Stanford neural network parser) and word2vec. The results reinforce caution: comparing to a list of 5 common nouns (dog, river, man, space, truth), the is no further from its nearest neighbors than the nouns are from theirs. That means if one substitute the by one of its nearest neighbors (which include inpart, S.p., and G.m.b), the system might not to update its decisions, just as girl can be substituted by boy. In the context of syntactic parsing, the later case is likely correct, while the former is, well, undesirable (it would be better if the system breaks as further processing will likely lead to further, stranger errors).

One might argue that the neural network can define a tighter region around the within its decision boundaries but this is highly unlikely. After all, there are much more exceptions in English than hidden nodes in the neural network. The ultimate test would use ungrammatical sentences but that would have to wait.

Put together, the two opposite problems above raise a curious issue: we might be using parsers that fail to model grammars. That is parsers that can parse correct sentences but cannot tell grammatical and ungrammatical sentences apart. I’m strongly against Sampson’s Grammar without grammaticality idea, and for practical reasons. Without knowing a sentence is ungrammatical, a system won’t be able to give it the correct treatment, let alone recovering it from error.

Handling (non-)exceptions

Correct handling exceptions and rare but non-exceptional cases is an interesting research topic. It is a sad fact of modern NLP that automatic large-scale evaluation leads to the blind optimization of F-score instead of answering interesting research questions (or raising them even).

But hopefully, when the fruits of algorithmic and hardware improvement have all been harvested, people will feel the need to challenge the core assumptions of machine learning. And hopefully that would lead to a breakthrough.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s