Online conversations are ubiquitous in modern life, spanning industries from video games to telecommunications. This has led to an exponential growth in the amount of online conversation data, which has helped develop state-of-the-art natural language processing (NLP) systems like chatbots and natural language generation (NLG) models. Over time, various NLP techniques for text analysis have also evolved. This necessitates a fully managed service that can be integrated into applications using API calls without the need for extensive machine learning (ML) expertise. AWS offers pre-trained AWS AI services like Amazon Comprehend, which can effectively handle NLP use cases involving classification, text summarization, entity recognition, and more to gather insights from text.
Additionally, online conversations have led to a widespread phenomenon of non-traditional usage of language. Traditional NLP techniques often perform poorly on this text data due to the constantly evolving and domain-specific vocabularies within different platforms, as well as the significant lexical deviations of words from proper English, either by accident or intentionally as a form of adversarial attack.
In this post, we describe multiple ML approaches for text classification of online conversations with tools and services available on AWS.
Prerequisites
Before diving deep into this use case, please complete the following prerequisites:
Set up an AWS account and create an IAM user.
(Optional) Set up your Cloud9 IDE environment.
Dataset
For this post, we use the Jigsaw Unintended Bias in Toxicity Classification dataset, a benchmark for the specific problem of classification of toxicity in online conversations. The dataset provides toxicity labels as well as several subgroup attributes such as obscene, identity attack, insult, threat, and sexually explicit. Labels are provided as fractional values, which represent the proportion of human annotators who believed the attribute applied to a given piece of text, which are rarely unanimous. To generate binary labels (for example, toxic or non-toxic), a threshold of 0.5 is applied to the fractional values and comments with values greater than the threshold are treated as the positive class for that label.
Subword embedding and RNNs
For our first modeling approach, we use a combination of subword embedding and recurrent neural networks (RNNs) to train text classification models. Subword embeddings were introduced by Bojanowski et al. in 2017 as an improvement upon previous word-level embedding methods. Traditional Word2Vec skip-gram models are trained to learn a static vector representation of a target word that optimally predicts that wordβs context. Subword models, on the other hand, represent each target word as a bag of the character n-grams that make up the word, where an n-gram is composed of a set of n consecutive characters. This method allows for the embedding model to better represent the underlying morphology of related words in the corpus as well as the computation of embeddings for novel, out-of-vocabulary (OOV) words. This is particularly important in the context of online conversations, a problem space in which users often misspell words (sometimes intentionally to evade detection) and also use a unique, constantly evolving vocabulary that might not be captured by a general training corpus.
Amazon SageMaker makes it easy to train and optimize an unsupervised subword embedding model on your own corpus of domain-specific text data with the built-in BlazingText algorithm. We can also download existing general-purpose models trained on large datasets of online text, such as the following English language models available directly from fastText. From your SageMaker notebook instance, simply run the following to download a pre-trained fastText model:
!wget -O vectors.zip [https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip](https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip)
Whether youβve trained your own embeddings with BlazingText or downloaded a pre-trained model, the result is a zipped model binary that you can use with the gensim library to embed a given target word as a vector-based on its constituent subwords:
After we preprocess a given segment of text, we can use this approach to generate a vector representation for each of the constituent words (as separated by spaces). We then use SageMaker and a deep learning framework such as PyTorch to train a customized RNN with a binary or multilabel classification objective to predict whether the text is toxic or not and the specific sub-type of toxicity based on labeled training examples.
To upload your preprocessed text to Amazon Simple Storage Service (Amazon S3), use the following code:
To initiate scalable, multi-GPU model training with SageMaker, enter the following code:
Within , we define a PyTorch Dataset that is used by train.py to prepare the text data for training and evaluation of the model:
Note that this code anticipates that the vectors.zip file containing your fastText or BlazingText embeddings will be stored in .
Additionally, you can easily deploy pretrained fastText models on their own to live SageMaker endpoints to compute embedding vectors on the fly for use in relevant word-level tasks. See the following GitHub example for more details.
In the Next Part, Iβll explain how to use a Transformer with a Hugging Face for text classification in AWS. The link to Part 2 will be given soon.
Top comments (0)