Notes on k-winners take all and shattered loss landscapes

Following ICLR 2020, I’m delighted to find a paper that uses k-winners take all, my old favorite during master’s study:

Xiao, C., Zhong, P., & Zheng, C. (2020). Enhancing Adversarial Defense by k-Winners-Take-All. In International Conference on Learning Representations. Retrieved from

The paper reveals a super interesting insight that kWTA networks, given enough width, are highly discontinuous w.r.t. the input but relatively continuous w.r.t. the weights. This is a bad news for those attackers that use gradient naïvely because they’re going to face local minima and steep gradient walls all the time. At the same time, we can still train the network by following the gradient w.r.t. the weights.

The visualization of the loss landscape is particularly striking, having some evil-castle-esque appearance to it:

Screenshot 2020-04-30 at 14.10.03

Some obvious questions come to my mind:

  • Is it desirable to have such a loss landscape? Assuming it is a cross-entropy loss calculated on output probabilities, the figure implies that the probabilities fluctuate wildly with tiny perturbation of the input. Are we sacrificing usability for (a narrow definition of) robustness?
  • In this active arm-race, how long does this defense last? For example, an attacker might smooth out the gradient by taking the average loss of a small region around any point. This might help him find out an effective direction to attack.

Quick and easy expert annotation

My PhD research had me dig into Amazon Mechanical Turk more than I wished to. Together with CrowdFlower/Figure8, those are the default names that come up whenever one thinks of annotation but, after much research, I realized that the right tool is much simpler.

Crowd-sourcing tools are for, ahem, crowds — a mass of unknown people of unknown characteristics. But not every task can be done by a crowd. Continue reading

Replicability as a moral responsibility

In the newly developed Oxford-Munich code of conduct for data scientists (, one can find the following articles that obligate data scientists to ensure replicability:

3b(iv). Artificial data handling

The Data Scientist is responsible for communicating all the procedures employed to make the original data more adequate for the specific problem, especially techniques intended to correct gaps in the data, to balance classification problems, e.g.  Interpolation, extrapolation, oversampling and under-sampling. As far as possible, these procedures should be peer-reviewed.

4l. AI Reproducibility

Most of the models created by data scientists have stochastic components, meaning there is no guarantee that the same model will be produced given the same training data. Moreover, it’s a known issue, that fixing a seed to force reproducibility compromises the parallelization of the models.
The Data Scientist shall be responsible to ensure reproducibility in situations where understanding the overall behavior of the system is critical.

Hopefully, this initiative will make industrial and academic researchers more aware of replicability issues and improve their procedures.


Replicable by design

At EACL last year, I had a lengthy chat with a guy next to his poster about the (ir)replicability of some high-profile papers in information retrieval [1]. During some 5 years of research that I’ve gone through, I also often ran into reproducibility problems. Probably many PhD students out there have relatable experiences.

Obviously, researchers should take full responsibility to produce replicable research. But we should also recognize the underlying systemic issue. Researchers are not rewarded to make their work repeatable. Once a paper is accepted, you are already in the middle of a new one so there’s no time to make your old code re-runable (if that’s possible at all). Added to that, the likelihood (or threat) of your work being reproduced is terribly small. There are not many reports of reproducibility problem in NLP and retracted papers are non-existent. While big conferences are starting to address this problem (COLING 2018 has a track for reproduction and LREC 2018 also mentions “replicability and reproducibility issues”), I suspect it will take years for the effect to be felt.

In the meantime, what we could do is to align the effort to the incentive. Ideally, it should take no extra work to make your research replicable. The solution, I think, is to make experiments replicable by design. Continue reading

9th RecSys Amsterdam meetup: glimpses and creative nuggets

It’s been one month and a half that I found myself working at Elsevier. Five years ago when I was in Italy working on a Vietnamese news recommendation service, it didn’t come to my mind that the work would earn me a good job in yet another country. I’m grateful for what life brings my way. However, five years is a long stretch of time for human memory and the state-of-the-art surely has changed a lot. For one, these shiny toys called deep neural networks keep popping up everywhere. So a local event on the field seems good to attend.

RecSys Amsterdam meetup is an informal meeting of recommender researchers and practitioners in Amsterdam. It used to be held once a year but because of a surge in interest, now every four months or so. The 9th installment that I attended is a small gathering (there’re 140 attendees on the web but it felt more like 60) with companies (or academics that interned at a company) showcasing their cool stuff and, of course, approaching potential hires. I think we had a good mix of Microsoft, FD Mediagroep, and Continue reading

A Critique of a Critique of Word Similarity Datasets: Sanity Check or Unnecessary Confusion?

Batchkarov et al. (2016) is one of evaluation/methodology papers much needed in NLP and I hope we’ll have more of them. But I think w.r.t. statistical methodology, the paper is troublesome or at least not good enough for ACL. In this short report, I explain why.

Critical evaluation of word similarity datasets is very important for computational lexical semantics. This short report concerns the sanity check proposed in Batchkarov et al. (2016) to evaluate several popular datasets such as MC, RG and MEN — the first two reportedly failed. I argue that this test is unstable, offers no added insight, and needs major revision in order to fulfill its purported goal.

Phrase Detectives caught me by surprise

Screen Shot 2017-07-13 at 16.48.42

I got 52% play Phrase Detective on Facebook. How could I get a PhD in Natural Language Processing?

Just kidding, I’m not worrying at all about graduation but just a bit surprised by some features of the game. I’m studying the possibility of running a crowd-sourcing task on coreference resolution so I’m very much interested in how to do crowd-sourcing properly. So these are the things that I found surprising: Continue reading

Notes on machine learning and exceptions

Statistical machine learning has been the de-facto standard in NLP research and practice. However, its very success might be hiding its the problems. One such problem is exceptions.

Natural language is full of exceptions: idiomatic phrases that defy compositionality, irregular verbs and exceptions to grammatical rules, or unexpected events that, though not linguistic phenomena themselves, happen to be communicated via language. So far, statistical NLP has treated them as inconvenient oddity and, in most cases, swept them under the rug, hoping that they wouldn’t reduce F-score.

But a system doesn’t really understand language without handling exceptions and I will argue that (not) handling exceptions has important consequences to machine learning. Continue reading

A paper is the tip of an iceberg

I was reading Clark and Manning (2016) and studying their code. The contrast is just amazing.

This is what the paper has to say:


This is what I found after 1 hour of reading a JSON file and writing down all layers of the neural net (the file is data/models/all_pairs/architecture.json, created when you run the experiment):

Without the source code, this would be a replication nightmare for sure.


Clark, K., & Manning, C. D. (2016). Improving Coreference Resolution by Learning Entity-Level Distributed Representations. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 643–653.