Since the advent of Word2Vec(along with other word vector models) and it’s rich word representations, natural language processing models excelled at several tasks like sentiment analysis, language models, Machine translations, etc. This post talks about a similar concept called sentence embeddings that led to a significant performance boost in several NLP tasks and it boils down to using whole sentences to encode rather than just words. First, let’s get a quick and dirty look at word embeddings.
A collective term for models that learned to map a set of words in a vocabulary into vectors of numerical values. The core concept of word embeddings is that every word used in a language can be represented by a set of real numbers (a vector). Word embeddings are N-dimensional vectors that try to capture word-meaning and context in their values. Any set of numbers is a valid word vector, but to be useful, a set of word vectors for a vocabulary should capture the meaning of words, the relationship between words, and the context of different words as they are used naturally.
The by-product of doing so is pretty interesting. Once all the words in the vocabulary are mapped to vectors, for each word, the embedding captures the “meaning” of the word. Similar words end up with similar embedding values, as a result, they ended up being close to each other in the latent space.
Sentence embeddings are a similar concept. It embeds a full sentence into a vector space. These sentence embeddings retain some nice properties, as they inherit features from their underlying word embeddings.
There is a huge trend lately for a quest for Universal Embeddings: embeddings that are pre-trained on a large corpus and can be plugged in a variety of downstream task models (sentimental analysis, classification, translation…) to automatically improve their performance by incorporating some general word/sentence representations learned on the larger dataset.
While unsupervised representation learning of sentences had been the norm for quite some time, last year has seen a shift toward supervised and multi-task learning schemes with a number of very interesting proposals in late 2017/early 2018.
While there are many competing approaches to learn sentence embeddings, in this post, we primarily focus on Infersent, a sentence embeddings method that provides semantic sentence representations designed by Facebook.
Folks at Facebook research released a paper in July 2018 that talks about “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data” which employs a supervised method to learn sentence embeddings.
They discussed that models learned on NLI task can perform better than models trained in unsupervised conditions or on other supervised tasks. By exploring various architectures, they showed that a BiLSTM network with max-pooling makes the best current universal sentence encoding methods, outperforming existing approaches like SkipThought vectors. Let’s discuss in detail how Infersent works.
A dig at the NLI task
Natural language inference, also known as recognizing textual entailments aims at finding a directional relationship between text fragments. In this framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively. And, “t entails h” (t => h) if, typically, a human reading t would infer that h is most likely true.
For this task, the dataset used is SNLI (Stanford Natural Language Inference) dataset. It consists of 570k human-generated English sentence pairs, manually labeled with one of the three categories — entailment, contradiction and neutral.
How Infersent works:
The architecture consists of 2 parts:
1. One is the sentence encoder that takes word vectors and encodes sentences into vectors
2. Two, an NLI classifier that takes the encoded vectors in and outputs a class among entailment, contradiction and neutral.
The paper discusses multiple architectures for sentence encoding. Let’s look at them starting with the best one stated in the paper.
BiLSTM with max/mean pooling
It’s a bi-directional LSTM network which computes n-vectors for n-words and each vector is a concatenation of output from a forward LSTM and a backward LSTM that read the sentence in opposite direction. Then a max/mean pool is applied to each of the concatenated vectors to form the fixed-length final vector.
LSTM and GRU
These are the vanilla versions where the network output n- hidden vectors for a sequence of n-words and the last hidden vector is considered as the final fixed-length vector.
Another variant to it is BiGRU-last where two GRU are used one for forward and other for backward and the last hidden state of both GRU’s are concatenated for final vector.
Self Attentive network
This uses the attention mechanism over the hidden states of a BiLSTM to generate a representation u of an input sentence.
The essence of Attention Mechanisms is that they are an imitation of the human sight mechanism. When the human sight mechanism detects an item, it will typically not scan the entire scene end to end; rather it will always focus on a specific portion according to the person’s needs. When a person notices that the object they want to pay attention to typically appears in a particular part of a scene, they will learn that in the future, the object will look in that portion and tend to focus their attention on that area.
In the paper, they used a Self-attentive network with multiple views of the input sentence, so that the model can learn which part of the sentence is important for the given task. Concretely, they have 4 context vectors u1, u2, u3, u4 which generate 4 representations that are then concatenated to obtain the single sentence representation.
One of the currently best performing models on classification tasks is a convolutional architecture termed AdaSent, which concatenates different representations of the sentences at a different level of abstractions. The hierarchical convolutional network introduced in the paper is inspired by that and comprise 4 convolutional layers. At every layer, max-pooling of the feature maps is done to obtain representation. The final sentence embedding is represented by a concatenation of the 4 max-pooled representations.
Natural Language Inference Classifier
This section discusses the inference classifier network which takes these sentence embeddings and predicts the output label.
After the sentence vectors are fed as input to this model, 3 matching methods are applied to extract relations between the text, u and hypothesis, v –concatenation of the two representations (u, v)element-wise product u * vand, absolute element-wise difference |u — v |
The resulting vector captures information from both the text, u and the hypothesis, v, and is fed into a 3-class classifier( entailment, contradiction and neutral) consisting of multiple fully connected layers followed by a softmax layer.
Hope this gives a clear understanding of how Infersent works. I have curated a single jupyter notebook implementation of Infersent and how to use it.
1. First, download the state-of-the-art fastText embeddings
#### to download Glove
curl -Lo glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
#### To download Fasttext
curl -Lo crawl-300d-2M.vec.zip https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip
2. To get the InferSent model and reproduce our results, download the pre-trained models.
#### Glove based model
curl -Lo examples/infersent1.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent1.pkl
#### Fasttext based model
curl -Lo examples/infersent2.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent2.pkl
3. Make sure you have NLTK and Punkt tokenizer.
4. You can use the sample.txt file for inputting sentences or you can replace with your own sentences.
5. Infersent can also show the word importances in a sentence which can be leveraged to build an intent extractor. Find how to use it from the notebook.