The Illustrated BERT: How NLP Cracked Transfer Learning

danieldk · on Dec 24, 2018

I think that the work that is done on ELMo, BERT and others is great and useful. Unfortunately, there are many grandiose claims circling around these papers, such as the title of this blog post.

For example:

If we’re using this GloVe representation, then the word “stick” would be represented by this vector no-matter what the context was. “Wait a minute” said a number of NLP researchers (Peters et. al., 2017, McCann et. al., 2017, and yet again Peters et. al., 2018 in the ELMo paper ), “stick”” has multiple meanings depending on where it’s used. Why not give it an embedding based on the context it’s used in – to both capture the word meaning in that context as well as other contextual information?”. And so, contextualized word-embeddings were born.

This is blatantly false. Contextualized word representations have been around for a very long time. For example, the neural probabilistic language model proposed by Bengio et al., 2003 produces contextual word representations. There have been many papers about neural language models thereafter. However, the idea is even older, Schütze's 1993 paper (Word Spaces) produces context-dependent word representations with subword units (n-grams).

Researchers have been well-aware for decades that ideally one would need context-sensitive representations and that representations such as those produced by word2vec or GloVe have this shortcoming. However, one of the reason that word2vec became so popular is that it is damn cheap to train [1] and that the possibility to pretrain on much larger corpora gave these simpler models an edge.

ELMO, BERT, and others (even though they differ quite a bit) spiritual successors of earlier neural language models that rely on newer techniques (BiDi LSTMs, convolutions over characters, transformers, etc.), larger amounts of data, and the availability of much faster hardware than we had one or two decades ago (e.g. BERT was trained on 64 TPU chips, or as Ed Grefenstette called it blowing through a forest's worth of GPU-time).

Disclaimer: I have nothing against this work. I very much enjoyed the ELMo paper. I am just objecting to all the hype/marketing out there.

[1] The skip-gram model with negative sampling is very similar to logistic regression, where one optimizes parameters of two vectors rather than just one weight vector.

andreyk · on Dec 24, 2018

This blog post certainly is pretty flawed on attribution of ideas - it attributes word2vec as the first to introduce word vectors (which is.... very wrong).

"Word2Vec showed that we can use a vector (a list of numbers) to properly represent words in a way that captures semantic or meaning-related relationships (e.g. the ability to tell if words are similar, or opposites, or that a pair of words like “Stockholm” and “Sweden” have the same relationship between them as “Cairo” and “Egypt” have between them) as well as syntactic, or grammar-based, relationships (e.g. the relationship between “had” and “has” is the same as that between “was” and “is”)."

KasianFranks · on Dec 24, 2018

True, we built NLP/NLU vector representations from the ground up to tackle hypothesis generation and hidden relationship detection connected to genes, genomic pathways and therapeutics related to extending human lifespan, DNA repair and LET radiation chromosomal damage repair at Lawrence Berkeley National Lab from 2002 to 2008 https://www.google.com/patents/US7987191 followed by Tomas Mikolov's/Google's work and preceded by countless others of course.

PaulHoule · on Dec 24, 2018

Has anyone developed commercial applications based on word embeddings?

It's clear that people are putting up better and better numbers on certain tasks that are widely shared, but for all I know these will always be a bridesmaid and never a bride when it comes to being useful for something.

Back in the 1970s it was clear that it wasn't going to be easy to make rule-based parsers that were "good enough" but it seems that now the task has been defined down so that if you can do better than chance that's a miracle. Thus people can kid themselves into thinking they are practicing what Thomas Kuhn called "normal science" since they are in the same shared reality even if it is a delusion.

rpedela · on Dec 24, 2018

I train NER and text classification models using Spacy. You can give Spacy word vectors and the accuracy usually increases 1-5% in my experience. I use NER to classify entities within search keywords, text classification for determining if a new court docket is relevant to a research group I work for, and I have started working on an ML-assisted structured data extraction from free text tool. All of these things are part of commercial products or will be.

PaulHoule · on Dec 24, 2018

what is the before and after accuracy? how big of a training set?

rpedela · on Dec 24, 2018

Probably easiest to talk about the NER for search keywords. I have found 1000-2000 examples per label is enough to get >80% accuracy. I don't remember the exact before/after adding pre-trained word vectors, but it was a 2-3% improvement and the current F1 score is ~86% I believe.

I have found it best to apply this to search tail queries (uncommon queries) because the relevance already sucks so any improvement is helpful. Often they are misspellings or unique wordings of common queries. If you have a product search engine, and you can recognize the product category then you have a much better chance of giving the user a relevant result even if the words are misspelled. If you can improve the overall relevance of tail queries, then revenue for the product search engine should increase as well.

I should also add that Spacy turns all words into vectors under-the-hood. Providing pre-trained word vectors accelerates the training process so you don't need as many examples (hopefully).

m_ke · on Dec 24, 2018

Most modern NLP methods are built on top of word embeddings, things like Neural Machine Translation, text classification and etc all convert input words into word embeddings then stack a neural networks on top of it.

PaulHoule · on Dec 24, 2018

People write papers about these things.

"Do they make commercial applications?" is the question.

The paradigm here is "Does method X get better answers than method Y?" as opposed to "Will we embarass ourselves if we put method Z into production?"

m_ke · on Dec 24, 2018

All of those things are used daily by billions of people around the world. Google sells their translation system through Google cloud, so does AWS and a bunch of other NLP startups.

PaulHoule · on Dec 24, 2018

Can you point me to a text analysis API that is not embarrassingly bad?

I have tried products from Amazon, IBM, Google and other companies and the one thing they have in common is they've never passed an acceptance test for me.

Language translation is a particular bad example of something where a "Clever Hans" effect can happen, where the reader's desire and capability for closure will fill in for mistakes the system makes and make it look like it performs better than it really does.

nl · on Dec 24, 2018

I don’t use off the shelf APIs, but I’d be surprised if Google’s named entity recogniser isn’t pretty good in the news or financial domain.

OpenCalais is pretty good in the financial domain. http://www.opencalais.com/opencalais-demo/

IBM isn’t great. Don’t know about Amazon.

singhrac · on Dec 24, 2018

That's true, but specifically for translation I think all SOTA methods use task-specific embeddings (i.e., learn their own instead of using BERT or Elmo, etc.). See https://arxiv.org/pdf/1808.09381.pdf and note that the BERT paper doesn't mention translation as a metric.

nl · on Dec 24, 2018

I build commercial applications built on the FastAI language model (similar to this) in a legal domain.

It’s a per sentence matching task, and it increased the accuracy ~5% (!!) over a tuned FastText classifier, which in turn was much better than a traditional n-gram based classifier.

The accuracy is now good enough (and human like enough) that trained humans generally disagree with each other when trying to correct its “errors”.

Training set size was ~1000 labeled examples.

voidray · on Dec 25, 2018

I work at a company that uses word embeddings to generate predictions that customers actually pay for. There are lots of other things going on under the hood (e.g. neural networks in which the embeddings are the first layer, non-ML software architecture to make the UX something worth paying for) but to answer your question, embeddings are definitely generating real value.

andreyk · on Dec 24, 2018

See also "NLP's ImageNet moment has arrived" (https://thegradient.pub/nlp-imagenet/) by one of the researchers involved in the papers surveyed in this post.

danielbigham · on Dec 24, 2018

Thanks for sharing this link -- found it quite a good read.

deytempo · on Dec 25, 2018

It’s ironic that the study of language leads to the creation of a new one

julienfr112 · on Dec 24, 2018

How is this related to fastText ?