Phrase Detectives caught me by surprise

Screen Shot 2017-07-13 at 16.48.42

I got 52% play Phrase Detective on Facebook. How could I get a PhD in Natural Language Processing?

Just kidding, I’m not worrying at all about graduation but just a bit surprised by some features of the game. I’m studying the possibility of running a crowd-sourcing task on coreference resolution so I’m very much interested in how to do crowd-sourcing properly. Please tell me what you think in the comment section!

So these are the things that I found surprising:

1. Some cases are super hard

Screen Shot 2017-07-13 at 17.09.48

Nearest mention questions require a player to determine whether two mentions are coreferent but only that they are closer to each other than any other coreferent mentions in the text.

This innocent question can become notoriously hard when the two mentions in question are far apart (which has happened to me in the first training session). To answer the question correctly, you’ll need to read everything between the two mentions and check if any of them happens to be coreferent.

Screen Shot 2017-07-13 at 17.51.31The “name the culprit” question is similarly challenging. My guess is the designers assume a positive linear relationship between proximity and “easiness”. In practice, I remember salient excerpts from the beginning of the text (e.g. when a person is first introduced) but was forced to give up those “anchors” to annotate things that happen to be closer.

And by the way, why is that information even needed? Once you work out every mentions in a cluster, finding which one is closest to any mention is a piece of cake.

2. The task isn’t streamlined

The most important thing my swimming teacher taught me was probably streamlining. By arranging your maneuver in a certain way, you can reduce water resistance and maintain momentum.

Analogously, in annotation, you’re slowed down by new information. Humans’ working memory can only hold a certain number of facts (some say 5-7) so if you keep encountering new persons or events, you’ll constantly need to purge your working memory, extract new information about them and record them as new memories. Doing so slows you down considerably.

If you see consecutive questions about the same person or event instead, you’ll only need to maintain the facts about that person (gender, name, job, what he/she recently does, etc.) or that event (participants, location, time, etc.) without any extra reading.

I was surprised that Phrase Detectives wasn’t designed in this way. Instead, it shows increasingly large parts of a document starting from the first sentence. I guess the intention was to encourage sort-of streamlining but things just don’t work that way. From sentence #3 or #4 onwards, I couldn’t remember all relevant information anymore and found myself looking up and down the paragraphs and reading each piece of information several times.

3. The concept of “property” is hard to grasp

Screen Shot 2017-07-13 at 18.17.26

In most examples, it looks like properties are the left-hand-side of to be, i.e. it defines an expression in one way or another. But then the example “Mary is a bit cold” shattered that intuition. While doing the tasks, I found conflicting examples (i.e. things that “experts” say, revealed when you review your work) where e.g. sometimes height is considered a property, sometimes not.

I know that in a crowd-sourcing task, you can’t assume much about the linguistic knowledge of annotators (in this case, players) but if few examples can’t do the job, maybe you should consider giving more instructions.

Screen Shot 2017-07-13 at 18.23.18

4. It is drudgery

I found this problem in most “games with a purpose” that I tried. In them, the purpose is clear but the game is nowhere to be found.

I consider myself not particularly difficult with games. I can enjoy playing very different things like chess, Super Mario, playing instruments, rope skipping, basketball, etc. With regards to non-physical games, I think there’s a “golden ratio” of cognitive demand vs. “smartness”. For example, chess is very demanding in that it requires long uninterrupted attention, keeping many items in memory, long chains of reasoning. But it is considered a “smart” activity so a good player can feel good about him-/herself and receive social encouragement. In contrast, Super Mario is an example of low-demand, low-smartness games. Both types of games are very popular.

The problem with many “games with a purpose” is that they demand hard labor without smartness. Finding coreference is easy because people always maintain a model of the world in their mind and the moment they read a mention, they immediately and effortlessly link it into an entity in there. Annotating coreference is conceptually easy because it just requires the aligning of mental model and the text but it becomes hard because people can’t remember the full text so they have to keep scanning and reading back parts of the text. And that work is not smart at all. So the player doesn’t feel good about him-/herself.


Above are just my personal observations and subjective judgments based on small amount of games. So far the reports I found about Phrase Detectives sound very bright (I didn’t read the documents yet) and it is still one of the best game-with-a-purpose projects in NLP. I hope that by sharing my thoughts, we could make Phrase Detectives and future projects even better.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s