Two ideas of error control in natural language processing

While I was still active in NLP, one thing that struck me was how hard it was to interpret errors. When people make errors, for example, while reading garden-path sentences, it’s easy to see why. Maybe the verb is used in an uncommon way or we are too eager to connect words together. But the errors of a statistical parser don’t make any sense. They appeared totally random, at least to me.

The multi-layered approach to language analysis has a rich tradition but little real-world application. These days, people in industry use neural end-to-end models. One reason for that is that the performance on many tasks is still lower than it needs to be. A syntactic parser with 92% accuracy sounds great but, assuming each sentence has 20 tokens, it only gets a sentence completely correct 0.9220=19% of the time. Errors throw off further processing and we don’t know what to do with them. We can’t even detect them so far.

Error control seems to be the most pressing neglected problem in NLP. If we could detect and ignore stupid errors, we can have much more confidence in the rest of the analysis (i.e. trading recall for accuracy) so more applications will open. Thresholding, where we remove predictions with scores under a certain bar, is an obvious choice. So, calibrating the scores or probabilities of models to match the empirical chance of success is a worthy research direction (which is, sadly, also under-researched.) I’d like to add to that two more ideas for error control borrowed from telecommunication.

Continue reading

Machine learning madness and causal fitting

As of May 2021, MLPs are making waves on machine learning Twitter. In a span of roughly a month, multiple reports came out announcing that multi-layer feedforward neural networks can achieve state-of-the-art in computer vision and natural language processing, and attention is not necessary after all. So is BERT going out and MLPs are cool again? Before we jump to conclusion, let’s take a step back.

Continue reading

Notes on k-winners take all and shattered loss landscapes

Following ICLR 2020, I’m delighted to find a paper that uses k-winners take all, my old favorite during master’s study:

Xiao, C., Zhong, P., & Zheng, C. (2020). Enhancing Adversarial Defense by k-Winners-Take-All. In International Conference on Learning Representations. Retrieved from

The paper reveals a super interesting insight that kWTA networks, given enough width, are highly discontinuous w.r.t. the input but relatively continuous w.r.t. the weights. This is a bad news for those attackers that use gradient naïvely because they’re going to face local minima and steep gradient walls all the time. At the same time, we can still train the network by following the gradient w.r.t. the weights.

The visualization of the loss landscape is particularly striking, having some evil-castle-esque appearance to it:

Screenshot 2020-04-30 at 14.10.03

Some obvious questions come to my mind:

  • Is it desirable to have such a loss landscape? Assuming it is a cross-entropy loss calculated on output probabilities, the figure implies that the probabilities fluctuate wildly with tiny perturbation of the input. Are we sacrificing usability for (a narrow definition of) robustness?
  • In this active arm-race, how long does this defense last? For example, an attacker might smooth out the gradient by taking the average loss of a small region around any point. This might help him find out an effective direction to attack.

Quick and easy expert annotation

My PhD research had me dig into Amazon Mechanical Turk more than I wished to. Together with CrowdFlower/Figure8, those are the default names that come up whenever one thinks of annotation but, after much research, I realized that the right tool is much simpler.

Crowd-sourcing tools are for, ahem, crowds — a mass of unknown people of unknown characteristics. But not every task can be done by a crowd. Continue reading

Replicability as a moral responsibility

In the newly developed Oxford-Munich code of conduct for data scientists (, one can find the following articles that obligate data scientists to ensure replicability:

3b(iv). Artificial data handling

The Data Scientist is responsible for communicating all the procedures employed to make the original data more adequate for the specific problem, especially techniques intended to correct gaps in the data, to balance classification problems, e.g.  Interpolation, extrapolation, oversampling and under-sampling. As far as possible, these procedures should be peer-reviewed.

4l. AI Reproducibility

Most of the models created by data scientists have stochastic components, meaning there is no guarantee that the same model will be produced given the same training data. Moreover, it’s a known issue, that fixing a seed to force reproducibility compromises the parallelization of the models.
The Data Scientist shall be responsible to ensure reproducibility in situations where understanding the overall behavior of the system is critical.

Hopefully, this initiative will make industrial and academic researchers more aware of replicability issues and improve their procedures.


Replicable by design

At EACL last year, I had a lengthy chat with a guy next to his poster about the (ir)replicability of some high-profile papers in information retrieval [1]. During some 5 years of research that I’ve gone through, I also often ran into reproducibility problems. Probably many PhD students out there have relatable experiences.

Obviously, researchers should take full responsibility to produce replicable research. But we should also recognize the underlying systemic issue. Researchers are not rewarded to make their work repeatable. Once a paper is accepted, you are already in the middle of a new one so there’s no time to make your old code re-runable (if that’s possible at all). Added to that, the likelihood (or threat) of your work being reproduced is terribly small. There are not many reports of reproducibility problem in NLP and retracted papers are non-existent. While big conferences are starting to address this problem (COLING 2018 has a track for reproduction and LREC 2018 also mentions “replicability and reproducibility issues”), I suspect it will take years for the effect to be felt.

In the meantime, what we could do is to align the effort to the incentive. Ideally, it should take no extra work to make your research replicable. The solution, I think, is to make experiments replicable by design. Continue reading

9th RecSys Amsterdam meetup: glimpses and creative nuggets

It’s been one month and a half that I found myself working at Elsevier. Five years ago when I was in Italy working on a Vietnamese news recommendation service, it didn’t come to my mind that the work would earn me a good job in yet another country. I’m grateful for what life brings my way. However, five years is a long stretch of time for human memory and the state-of-the-art surely has changed a lot. For one, these shiny toys called deep neural networks keep popping up everywhere. So a local event on the field seems good to attend.

RecSys Amsterdam meetup is an informal meeting of recommender researchers and practitioners in Amsterdam. It used to be held once a year but because of a surge in interest, now every four months or so. The 9th installment that I attended is a small gathering (there’re 140 attendees on the web but it felt more like 60) with companies (or academics that interned at a company) showcasing their cool stuff and, of course, approaching potential hires. I think we had a good mix of Microsoft, FD Mediagroep, and Continue reading

A Critique of a Critique of Word Similarity Datasets: Sanity Check or Unnecessary Confusion?

Batchkarov et al. (2016) is one of evaluation/methodology papers much needed in NLP and I hope we’ll have more of them. But I think w.r.t. statistical methodology, the paper is troublesome or at least not good enough for ACL. In this short report, I explain why.

Critical evaluation of word similarity datasets is very important for computational lexical semantics. This short report concerns the sanity check proposed in Batchkarov et al. (2016) to evaluate several popular datasets such as MC, RG and MEN — the first two reportedly failed. I argue that this test is unstable, offers no added insight, and needs major revision in order to fulfill its purported goal.

Phrase Detectives caught me by surprise

Screen Shot 2017-07-13 at 16.48.42

I got 52% play Phrase Detective on Facebook. How could I get a PhD in Natural Language Processing?

Just kidding, I’m not worrying at all about graduation but just a bit surprised by some features of the game. I’m studying the possibility of running a crowd-sourcing task on coreference resolution so I’m very much interested in how to do crowd-sourcing properly. So these are the things that I found surprising: Continue reading