Building a Cutting-Edge Model for Aspect-Based Sentiment Analysis

--

ReviewTrackers’ Insights engine pulls data from millions of reviews for our customers. One major component of this is predicting the sentiment of individual keywords. In this post, I will go over how to build a keyword-level sentiment analysis engine using state of the art technology.

Part 1: Defining the Task

What is Sentiment Analysis?

Sentiment Analysis is a task in natural language processing (NLP) in which a piece of text is assigned a sentiment prediction. At its simplest, this could mean classifying an entire sentence as positive or negative. Let x be the input the model receives (text that has been segmented into individual tokens) and y be the desired output (our sentiment label):

x = ['I', 'loved', 'the', 'food', '!']
y = 'positive'

However, natural language is wonderfully diverse and messy, so for many applications a more fine-grained approach is more informative. Consider the example below:

x = ['I', 'loved', 'how', 'flavorful', 'the', 'burger', 
'was', ',', 'but', 'wish', 'the', 'customer',
'service', 'was', 'faster', '.']
y = 'mixed'

This text chunk contains multiple different sentiments within the same sentence. Here, a one-sentiment-fits-all model is very limited. One compromise is to predict the frustratingly uninformative label “mixed” sentiment. Ideally, we would want our model to predict that within that sentence, “burger” has positive sentiment and “customer service” has negative sentiment.

What is ABSA?

Aspect-Based Sentiment Analysis (ABSA) is one way to address the problem above. In this version of the task, sentiment predictions are made for every “aspect” in the text, as shown below:

x =(['burger'],
['I', 'loved', 'how', 'flavorful', 'the', 'burger',
'was', ',', 'but', 'wish', 'the', 'customer',
'service', 'was', 'faster', '.'])
y = 'positive'
x =(['customer', 'service'],
['I', 'loved', 'how', 'flavorful', 'the', 'burger',
'was', ',', 'but', 'wish', 'the', 'customer',
'service', 'was', 'faster', '.'])
y = 'negative'

Note that in practice, this is often part of a larger pipeline in which a step called “aspect extraction” is applied before ABSA to determine which words in the text should be treated as aspects. Aspect extraction might be conceptualized as a sequence labeling task. Here, each token is predicted to be ‘B’ for the beginning token of an aspect, ‘I’ for any intermediate token in a multi-token aspect or ‘O’ for other:

x = ['I', 'loved', 'how', 'flavorful', 'the', 'burger', 
'was', ',', 'but', 'wish', 'the', 'customer',
'service', 'was', 'faster', '.']
y = ['O', 'O', 'O', 'O', 'O', 'B', 'O' ,'O', 'O', 'O',
'O', 'B', 'I', 'O', 'O', 'O']

Part 2: Defining the Model

BERT

Assuming we have our text and its aspects extracted, how do we train the model for ABSA? Here, we will start by using BERT, a versatile, pre-trained language model.

A language model (LM) assigns a probability to a combination of tokens. For example, reasonable LM should give a higher likelihood to [‘turn’, ‘right’] than [‘turn’, ‘write’]. This deceptively simple framework is what allows voice recognition software to use context to distinguish similar sounding words. LMs also allow you to make predictions: out of all words in the model vocabulary, which is the most likely to come next in the sequence [‘take’, ‘it’, ‘or’, ‘leave’]? A good LM would probably select ‘it’ as the next token.

If we were starting from scratch with randomly initialized parameters, we would in effect have to teach the LM the basics of the language just to get started. Luckily, BERT is already pre-trained on massive amounts of data (all of English Wikipedia, plus a large corpus of books).

For many machine learning tasks, labeled training data is needed. Training using manually labeled data is referred to as supervised learning. However, data annotation is time-consuming and unrealistic for the scale of data that BERT is trained on. To get around this, BERT is trained using a clever approach called self-supervised learning. One self-supervised training task is called masked language modeling. In this task, a certain token is masked out and the model tries to predict what this missing word is. Let’s say that the sentence “The roast beef was tasty.” occurs in the dataset. From that, a training pair can be automatically generated without any manual annotations:

x = ['The', 'roast', '[MASK]', 'was' 'tasty', '.']
y = ['beef']

You can play around with the BERT masked language model here:

Post-Training

This step is optional. If you skip directly to the fine-tuning step, you can still have a high-functioning model. However, post-training has been shown to help. Xu et al. (2019) demonstrate improved performance of BERT on ABSA for online reviews by post-training it on data from that domain. Post-training simply means initializing the model with BERT’s pre-trained weights and continuing the same self-supervised training regime on a more specific dataset. For example, if your downstream task covers healthcare and you have access to a large amount of unannotated healthcare data, this may be a good step to take.

Here is an interactive example of Xu et al.’s model post-trained on online reviews:

Fine-Tuning

So far, we have only used BERT as a language model, but ultimately we need it to predict sentiment polarity, not word likelihood. Adapting the architecture for sequence classification and running fine-tuning allows us to do this. First, we add a classification layer on top of the language model. Mathematically, this functions as follows:

P = softmax(CW),

where C and W ᴷˣᴴ

Let’s unpack that. C is the output of the final hidden layer of the LM from the special classification token (more on that later). The notation above indicates that C is composed of real numbers ℝ and its shape is that of H, i.e. the size of the previous hidden layer. In the case of the base model of BERT, H=768. W refers to the weights of the new classification layer, which will be updated during fine-tuning. It is a tensor of dimensionality K by H, where K is the number of classes we are predicting across. Since we are using a three-way distinction of positive/neutral/negative, K=3.

Taking the dot product of C and the transposition of W results in an array of size K. These are sometimes referred to as “logits”. These values need to be normalized to have a probability distribution that sums to 1. The softmax function does this for us. To get the predicted label, just return the label corresponding to the highest probability:

y_hat = labels[argmax(P)]

🤗 Hugging Face’s transformers library in PyTorch makes it easy to import BERT with this classification layer already on top:

from transformers import BertForSequenceClassificationmodel = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(labels))

This framework also makes it quick and easy to try out different architectures and flavors or BERT. Think RoBERTa might boost performance?

from transformers import RobertaForSequenceClassificationmodel = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=len(labels))

Want to use the version of BERT that is already post-trained on online reviews?

from transformers import BertForSequenceClassificationmodel = BertForSequenceClassification.from_pretrained("activebus/BERT_Review", num_labels=len(labels))

Want a more light-weight model that can run inference more quickly?

from transformers import DistilBertForSequenceClassificationmodel = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(labels))

etc. How convenient!

Our model is now structured to assign a probability distribution over our label set. Because of the nature of the task, annotated training data is required, making this a supervised learning problem. There are plenty of guides with sample code available to show to to fine-tune BERT for sentiment analysis. Again, using PyTorch and Hugging Face make this a relatively easy process.

Structuring the Input

As for how to include the aspect in the input, several leading models use concatenation, such as prefixing the aspect before the whole text with a separation marker in between. For example, you could structure the input as follows:

x = ['[CLS]', 'customer', 'service', '[SEP]', 'I', 'loved', 
'how', 'flavorful', 'the', 'burger', 'was', ',', 'but',
'wish', 'the', 'customer', 'service', 'was', 'faster',
'.', '[SEP]']

Here, ‘[CLS]’ and ‘[SEP]’ are special tokens that BERT uses to denote sentence classification and sentence separators, respectively. The classification token is the token whose weights are fed into the classification layer of the network. The separator token is often used for completely different NLP tasks where the input consists of two distinct parts (next sentence prediction, question answering, etc.), but BERT is adaptable enough to learn from the fine-tuning data that this structure is used here to indicate the aspect to focus on for sentiment prediction.

Illustration of BERT input for ABSA from Xu et al.

With this (and some training data and processing power), you should have all the parts necessary to make a strong ABSA model. Here is an overview of the training stages. As a reminder, pre-training has already been done for us, and post-training is optional:

If you want to compare your model to others, SemEval 2014 Task 4 Subtask 2 is a common evaluation benchmark. As of early 2021, leading papers with more complex architectures (Rietzler et al., Tian et al., etc.) have pushed the bar up to just a few percentage points shy of 90% accuracy on that dataset. Best of luck building your own model and pushing the bar even higher.

To learn more about some of the other NLP tasks that ReviewTrackers handles, check out our white paper.

--

--