It’s been one month and a half that I found myself working at Elsevier. Five years ago when I was in Italy working on a Vietnamese news recommendation service, it didn’t come to my mind that the work would earn me a good job in yet another country. I’m grateful for what life brings my way. However, five years is a long stretch of time for human memory and the state-of-the-art surely has changed a lot. For one, these shiny toys called deep neural networks keep popping up everywhere. So a local event on the field seems good to attend.
RecSys Amsterdam meetup is an informal meeting of recommender researchers and practitioners in Amsterdam. It used to be held once a year but because of a surge in interest, now every four months or so. The 9th installment that I attended is a small gathering (there’re 140 attendees on the web but it felt more like 60) with companies (or academics that interned at a company) showcasing their cool stuff and, of course, approaching potential hires. I think we had a good mix of Microsoft, FD Mediagroep, and Booking.com.
To me, the presentations were very helpful to put my work at Elsevier in perspective. The work at three big and/or well-funded companies is not as complicated as one might expect. The Microsoft folks are constrained by the existing infrastructure — i.e. they have to use an internal search engine, even vanilla collaborative filtering is out of the question. FD Mediagroep (a financial news service) used article cosine similarity and popularity. Booking.com didn’t name even one algorithm so perhaps it’s not their strength?
In 2012, our team of Vietnamese fresh college graduates attempted to reproduce Google News’ distributed recommender system. It was challenging and cool but, in hindsight, overkill. The lesson that I later learned, and was echoed in this meeting, is to start simple, integrate early, A/B test and listen to customers. After all, technology is not just cool stuff but tools to serve people.
Apart from contextualizing RecSys work in Amsterdam, the meeting also offers interesting ideas:
- Microsoft people learn to formulate search queries instead of conventional RecSys algorithms. They used a CNN to score keywords (it seems that a query is a list of keywords) and predict when to stop. Their paper can be found here.
- The guy from FD Mediagroep seems very fond of PySpark and more specifically computeSimilarities() function. Maybe I’ll look it up.
- Booking.com faces a problem called “continuous cold-start” because the interaction with customers are sporadic and their taste changes in between.
- It seems that Booking.com has the most established machine learning workflow that enables everyone in a product development team to add new features.
- They use machine learning to personalize user interface as well (e.g. number of items to display)
- Booking.com doesn’t use “precision, recall, etc.” for reasons that are not so persuasive to me. They instead employ “label-free evaluation” by plotting “Response Distribution Charts” (simply histograms of model predictions) and identify four types of “potential pathologies” (photo below). This could be a good complementary method to debug a model (?)
It was nice attending the 9th RecSys Amsterdam meetup. I’m looking forward to the next gathering coming up in some months.
Batchkarov et al. (2016) is one of evaluation/methodology papers much needed in NLP and I hope we’ll have more of them. But I think w.r.t. statistical methodology, the paper is troublesome or at least not good enough for ACL. In this short report, I explain why.
Critical evaluation of word similarity datasets is very important for computational lexical semantics. This short report concerns the sanity check proposed in Batchkarov et al. (2016) to evaluate several popular datasets such as MC, RG and MEN — the first two reportedly failed. I argue that this test is unstable, offers no added insight, and needs major revision in order to fulfill its purported goal.
I got 52% play Phrase Detective on Facebook. How could I get a PhD in Natural Language Processing?
Just kidding, I’m not worrying at all about graduation but just a bit surprised by some features of the game. I’m studying the possibility of running a crowd-sourcing task on coreference resolution so I’m very much interested in how to do crowd-sourcing properly. Please tell me what you think in the comment section! Continue reading
Statistical machine learning has been the de-facto standard in NLP research and practice. However, its very success might be hiding its the problems. One such problem is exceptions.
Natural language is full of exceptions: idiomatic phrases that defy compositionality, irregular verbs and exceptions to grammatical rules, or unexpected events that, though not linguistic phenomena themselves, happen to be communicated via language. So far, statistical NLP has treated them as inconvenient oddity and, in most cases, swept them under the rug, hoping that they wouldn’t reduce F-score.
But a system doesn’t really understand language without handling exceptions and I will argue that (not) handling exceptions has important consequences to machine learning. Continue reading
A quick note from EACL: some papers related to LSDSem workshop (Bugert et al. 2017; Zhou et al. 2015) use McNemar’s test to establish statistical significance and I find it very odd.
McNemar’s test examine “marginal (probability) homogeneity” which in our case is whether two systems yield (statistically) the same performance. According to the source code I found on Github, the way it works is:
- Obtain predictions of System 1 and System 2
- Compare them to gold labels to fill this table:
- Compute the test statistics: and p-value
- If p-value is less than a certain level (e.g. the magical 0.05), we reject the null hypothesis which is p(Sys1 correct) == p(Sys2 correct)
As it happens in the papers, the difference is statistically significant and therefore results are meaningful. Happy?
Not so fast. Continue reading
I was reading Clark and Manning (2016) and studying their code. The contrast is just amazing.
This is what the paper has to say:
This is what I found after 1 hour of reading a JSON file and writing down all layers of the neural net (the file is
data/models/all_pairs/architecture.json, created when you run the experiment):
Without the source code, this would be a replication nightmare for sure.
Clark, K., & Manning, C. D. (2016). Improving Coreference Resolution by Learning Entity-Level Distributed Representations. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 643–653. http://doi.org/10.18653/v1/P16-1061
Title: Tackling Error Propagation through Reinforcement Learning: A Case of Greedy Dependency Parsing
Conference: EACL 2017 (European Chapter of the Association for Computational Linguistics), at Valencia, 3-7 April 2017.
Error propagation is a common problem in NLP. Reinforcement learning explores erroneous states during training and can therefore be more robust when mistakes are made early in a process. In this paper, we apply reinforcement learning to greedy dependency parsing which is known to suffer from error propagation. Reinforcement learning improves accuracy of both labeled and unlabeled dependencies of the Stanford Neural Dependency Parser, a high performance greedy parser, while maintaining its efficiency. We investigate the portion of errors which are the result of error propagation and confirm that reinforcement learning reduces the occurrence of error propagation.
Full article: arXiv:1702.06794
Slides: view online
Last week, I had a good time at CLIN 27. The city was pretty, with a cute morning market and tasty croissants. The snow was kind to me (sometimes I does like snow if it is gentle). The poem presentation from Tim van de Cruys was funny and I met some old friends. I brought to CLIN my own side project where I explore, explain and (slightly) improve word2vec:
Hyperparameter tuning is difficult, not because it’s terribly complicated but obtaining enough resource is often not easy. I’m lucky enough to work at Vrije Universiteit and therefore can access the SURFsara HPC Cloud with not too much effort. Compared to Amazon EC2 (the only other cloud solution I have tried before), the functionality is rather basic but I think suits the needs of many researchers. Using the web interface or OpenNebula API, you can easily customize an image, attach hard drive, launch 10 instances and access any of them using a public key. What else do you need to run your experiments? Continue reading
Neural dependency parsing is attractive for several reasons: first, distributed representation generalizes better, second, fast parsing unlocks new applications, and third, fast training means parsers can be co-trained with other NLP modules and integrated into a bigger system.
Chen & Manning (2014) from Stanford were the first to show that neural dependency parsing works and Google folks were quick to adopt this paradigm to improve the state-of-the-art (e.g. Weiss et al., 2015).
Though Stanford open-sourced their parser as part of CoreNLP, they didn’t release the code of their experiments. As anybody in academia probably knows, reproducing experiments is non-trivial, even extremely difficult at times. Since I have painstakingly gone through the process, I think it’s a good idea to share with you.