The goal
The goal for this project is to create a model that can accurately classify some piece of text into Toxic or not. Basically, if toxicity = 1 or 0.
This is a very simple problem to solve, all you need is a database of texts that are toxic and not, and then you can train your model on it.
The dataset
The competition specifies that the model must be able to predict texts written in Brazilian Portuguese, so the dataset is in Portuguese as well.
The dataset is based on ToLD-Br, which is a huge dataset of tweets (or is it Xeets now?) that contains some additional info such as a classification if the text contains homophobia, obscenity, insults, racism, misogyny and xenophobia. The dataset for the competition, however, is a simple toxicity column.
On the left, the 'Text' column contains the tweet in question, and the 'Toxicity' column if the text is either toxic or not (1 or 0)
Classification problem
Whenever you think about classification, your first guess would be that you need some kind of neural network.
As you may guess from the title of the article, BERT was chosen since it is more recent, it's built in a neural network architecture that uses transformers, which is perfect for Natural Language Processing (NLP).
How does BERT work?
Recurrent and convolutional neural networks use sequential computation to generate predictions. They can predict which word will follow a sequence of given words once trained on huge datasets - this behavior is nicknamed unidirectional algorithm.
BERT, however, has a mechanism called self-attention, which can do this prediction based on the words that precede but also that follow, or in other words, a bi-directional algorithm.
Source: Javier Canales Luna @ DataCamp
The training
First of all, the training data must be cleaned up so that less characters need to be processed by our model. There's some theory on what characters matter and what don't, but I've decided on this final function for format_text
:
def format_text(text):
# Convert text to lowercase
text = text.lower()
# Remove words that begin with @ such as tagging @user
text = re.sub(r'@\w+', '', text)
# Remove words that begin with # such as #happy
text = re.sub(r'\b#\w+\b', '', text)
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove punctuation and emojis
text = re.sub(r'[^\w\s]', '', text)
# Remove stop words
pt_stp_words = stopwords.words('portuguese')
text = ' '.join([word for word in text.split() if word not in pt_stp_words])
# Remove double spaces
text = re.sub(r'\s+', ' ', text)
return text
The comments are all self-explanatory. All but one: stopwords.
What are stopwords?
Stopwords are words that are very frequently found in phrases but they don't add very significant meaning.
Such words are "i", "my", "myself", "you", "your". More words can be found here.
For this project, however, I've used stopwords for the Portuguese language available in the nltk.corpus
package.
The model
Now, to be used in our model we'll create a TextClassificationDataset
class that'll handle the storing and encoding of our texts.
class TextClassificationDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_length):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
label = self.labels[idx]
encoding = self.tokenizer(text, return_tensors='pt', max_length=self.max_length, padding='max_length', truncation=True)
return {'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'label': torch.tensor(label)}
- We begin by defining that this class is a PyTorch Dataset.
- The
__init__
method takes the argumentstexts
andlabels
, which are the values in the train dataset in the format of a list. So, for example, the row #3 would have the content of the tweet attexts[2]
and the classification atlabels[2]
. - The argument
tokenizer
is used to convert the texts into a format that the model can understand - since it cannot understand straight text. - The argument
max_length
is used to limit the length of the tokenized sequences. - The method
__len__
returns the number of samples. - The method
__getitem__
is used to retrieve the specific item given an indexidx
. This will retrieve the item from the lists oftexts
andlabels
, as well as encoding the value using thetokenizer
from__init__
. - This encoding is split into two parts:
input_ids
andattention_mask
.input_ids
are the tokenized text, andattention_mask
is a binary mask that indicates which tokens are actual words versus padding. - Everything is transformed into a PyTorch Tensor.
With the data cleaned up, it was time to create the BERT Classifier. For this project, I used BERTimbau Base, a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment.
These people are so creative.
In the end, this is what our BERTClassifier looked like:
class BERTClassifier(nn.Module):
def __init__(self, bert_model_name, num_classes):
super(BERTClassifier, self).__init__()
self.bert = BertModel.from_pretrained(bert_model_name)
self.dropout = nn.Dropout(0.1)
self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = outputs.pooler_output
x = self.dropout(pooled_output)
logits = self.fc(x)
return logits
# Example of initialization
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BERTClassifier('neuralmind/bert-base-portuguese-cased', 2).to(device)
# If there's a .pth file to load
model.load_state_dict(torch.load('bert_classifier.pth'))
Breaking this stuff into parts:
- The
__init__
function acts as a constructor. It sets the pretrained BertModel from the givenbert_model_name
, add a dropout layer to keep things in check and a linear layer to help classify text intonum_classes
- in our case, 2 polar opposites. - The
forward
function is defined so that it correctly goes through the additional layers we've set up.
Please note that I didn't tinker a lot with these, since they were kind of default from the sources that I was studying.
Given all of that, now we need our train
function. We'll need a lot of things, though:
# Set up parameters
bert_model_name = 'neuralmind/bert-base-portuguese-cased'
num_classes = 2
max_length = 128
batch_size = 16
num_epochs = 2
learning_rate = 2e-5
def train(model, data_loader, optimizer, scheduler, device):
model.train()
for batch in data_loader:
optimizer.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = nn.CrossEntropyLoss()(outputs, labels)
loss.backward()
optimizer.step()
scheduler.step()
## Begin training
# Split into train and validation datasets
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)
val_dataset = TextClassificationDataset(val_texts, val_labels, tokenizer, max_length)
# Create DataLoader for batch processing
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)
# Additional steps
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(train_dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)
A lot to unpack here:
- First, we define some parameters that are going to be used in the model.
-
num_classes
is simple: either toxic, or not. -
max_length
as already described is the max length of the encoded text. -
batch_size
would be the number of samples to work through before the model's internal parameters are updated. This value is a choice of balance between reasonable memory requirements without that much loss of performance. -
learning_rate
is2e-5
, which would be0.00002
. If the learning rate is too high, the model might overshoot the minimum of the loss function and fail to converge. If the rate is too low, the model might get stuck in a sub optimal solution. The value of2e-5
is commonly used since it is small enough to allow the model to make gradual progress without overshooting or converging to slowly.
Let's skip the train
method for now and explain the items below:
- The
optimizer
is used to adjust the parameters of our model to normalize the error or loss function. The optimizer changes the weighs and biases of the neurons in response to the error the model produced in its prediction during training. AdamW is a variation of the Adam optimizer. - The
total_steps
are the total number of steps that will be run, given that each epoch goes through the entire dataset once - so "amount of epochs" times "amount of rows in each epoch". - The learning rate scheduler,
scheduler
, is used to adjust learning rate during training. It is used to adjust the learning rate during training, and has proven to avoid overfitting, convergence faster and escape saddle points.
Given everything that was said (and I know that it's too much!), now let's break down the train
method:
- First, it sets the model to training mode.
- Then, enters a loop for each batch of the data loader.
- In this loop, it clears the gradient since they're accumulated in PyTorch. It needs to be reset for each batch.
- It moves the batch to the device being used to training, such as the CPU or GPU.
- Then, it retrieves the input IDs, attention masks and everything else. This is used as input to the model.
- Then, with whatever the model outputs, loss is calcuated with the
CrossEntropyLoss
function. - It performs backpropagation by calling
loss.backward()
. - The
optimizer.step()
applies the gradients computed in the previous step to update the model's parameters. - Finally, the learning rate is adjusted with
scheduler.step()
.
Phew! A lot of things to uncover.
In the end, we can just call the train
function for each epoch, and then save the model as a .pth
file.
for epoch in range(num_epochs):
print(f"Epoch {epoch + 1}/{num_epochs}")
train(model, train_dataloader, optimizer, scheduler, device)
accuracy, report = evaluate(model, val_dataloader, device)
print(f"Validation Accuracy: {accuracy:.4f}")
print(report)
torch.save(model.state_dict(), "bert_classifier.pth")
This model will be available in the path, and can be imported and used to predict the toxicity of texts! Here's one example:
def predict_sentiment(text, model, tokenizer, device, max_length=128):
model.eval()
encoding = tokenizer(text, return_tensors='pt', max_length=max_length, padding='max_length', truncation=True)
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
_, preds = torch.max(outputs, dim=1)
return 1 if preds.item() == 1 else 0
# Load the model from the .pth file
model = BERTClassifier('neuralmind/bert-base-portuguese-cased', 2)
model.load_state_dict(torch.load('bert_pt_classifier.pth'))
print(predict_sentiment('Hello world!', model, tokenizer, device)) # Returns 0
Conclusion
And that's it! If you want to check it out and train/test this model yourself, feel free to check the code in my GitHub repository!
This post was born out of my first Kaggle competition!
Despite not winning the competition, I'm still very close to the top, with 0.00952 setting me apart from the first place, so I hope my experience can also teach other beginners something useful!
I'm already a software engineer at work, but artificial intelligence has always been a source of curiosity for me. When I was in college, I had a brief exposure to computer vision and even ended up publishing some scientific articles. Now, I'm trying to make up for lost time studying and learning AI again. Follow me to join me on my journey!
Special thanks
First of all, special thanks to Pedro Gengo and the folks over at Tensorflow User Group São Paulo for creating the Kaggle Competition and inspiring this project!
Also, huge thanks to Kang Pham for writing this tutorial where I got most of this code!
And finally, thanks for Pedro Henrique Vieira de Lima whose work on Detecção de Comentários Tóxicos em Chats e Redes Sociais com Deep Learning was crucial for hitting a higher score on the leaderboard.
Top comments (1)
Amazing :0