Intro
Machine Learning is a very popular buzz word these days, and today we are going to focus on a little corner of the Behemoth we know as ML. That little corner is Natural Language Processing, but even that little corner of ML is still too big so we will focus on a little corner of NLP known as Sentiment Analysis. What this area of focus intends to accomplish is attempting to determine the emotions behind a natural language sentence (i.e. the way you and I speak versus the languages and grammars a machine can actually make sense of). In this article we will be exploring the process behind creating our very own sentiment analyzer as well as seeing how it can be incorporated into an existing application.
Getting Started
As previously mentioned we will be doing sentiment analysis, but more mysteriously we will be adding the functionality it an existing application. I have made a very simple GUI using Python and tkinter
to make a text field that responds when the user presses enter
. You can clone the repo as follows:
git clone https://github.com/alexei-dulub/sentiment_model.git
This repo only contains one file, but you can run it with the following:
python3 app.py
This should open the GUI window barring that your OS doesn't have any weird issues with tkinter
and tcl/tk like it did for me on macOS. You can see that you can type in the text field, and once you press enter
you will see the :l change to a :(. However, this is the only time it will happen. Looking at the code you can see that this is what we bind'ed the enter
action to do:
def change_emote(self, event):
self.emote['text'] = self.negative
self.master.bind('<Return>', self.change_emote)
Our focus today will be on giving this change_emote()
function more life by making a sentiment analyzer, and then using said analyzer to "read" what we put in the text box. Then we will pass our input to our analyzer, have a result returned (i.e. "Negative" or "Positive"), and we will change the emote accordingly. Let's dive in!
It's All About the Datasets
Probably the most pythonic thing we will do in this article is Stand on the Shoulders of Giants because our analyzer is almost exclusively built using the Natural Language Tool Kit or nltk
module. And the biggest help it probably offers us it the collection of data so that we do not have to collect and sort through all of this ourselves. We will, however, have to clean it up as we see fit, and train the model so let's get started, shall we?
We will begin by getting nltk
if you do not already have it installed:
python3 -m pip install nltk
Then we will need to download some additional things. Since they offer so many datasets, grammars, and models it would be too large of a download for a module and probably too much for the average user. So, the people behind nltk
have so kindly made it modular. This means we will need to run the following in an interactive python session:
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
>>> nltk.download('stopwords')
>>> nltk.download('twitter_samples')
>>> nltk.download('averaged_perceptron_tagger')
>>> quit()
This should be enough to get us started on making our sentiment analyzer. Now, I know that imports don't come to you right away when you sit down to write something, but today they do. Reference the following when we start to introduce new functions:
import random
from nltk import classify, pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.classify.naivebayes import NaiveBayesClassifier
Now that we don't have to worry about where things are coming from (this is my least favorite part of looking at others' code snippets) we can begin collecting the data:
positive = twitter_samples.strings('positive_tweets.json')
negative = twitter_samples.strings('negative_tweets.json')
stop_words = list(set(stopwords.words('english')))
We start of by making sure we didn't waste our time in the interactive python session by using the twitter samples right away as well as the stopwords we gathered also. Stop words, you might be wondering, are basically anything we want to filter out which in our case are things that do not really present us with any sentiment. Words such as "the" or "a" or "an" or "this" and countless others along those lines. Fortunately for us we did not have to make this list and we can move on.
Cleaning the Data
If you were to take a peek at the data we have so far by doing something like...
print(positive[0])
...it would definitely look like a tweet, but I am sure that there are some things in there that we could get rid of as well as some things that we should add to this data. First we will tokenize our data as follows:
positive_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tokens = twitter_samples.tokenized('negative_tweets.json')
This can replace the previous lines about twitter_samples.
Feel free to peek at this one as well as anything else moving forward, but what we have done here is taken the strings from the tweets and tokenized them meaning that now we have an array of arrays of strings that are individual words so this...
'#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)'
...turns into this...
['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']
This allows for us to actually get started on cleaning up the data, and we love functions here so let's make one!
def clean(tokens):
tokens = [x for x in tokens if not x in stop_words]
l = WordNetLemmatizer()
lemmatized = []
for word, tag in pos_tag(tokens):
if tag.startswith('NN'):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatized.append(l.lemmatize(word, pos))
return lemmatized
There is a bit going on here, but there are three main things to point our here. The first being:
tokens = [x for x in tokens if not x in stop_words]
Here you can see we are finally making use of our stopwords, and we do so by using this python comprehension to remove any of them that may appear in our tokens passed to our clean()
function. Next is the use of the WordNetLemmatizer
which allows us to clean up individual words themselves by lemmatizing them. Lemmatization allows us to only have to look at the roots of words instead of trying to figure out a case for every possible variation of a word. An example being "birds" changing to "bird" and "cries" changing to "cry" and so on. This way if we look at "cries" or "crying" or "cried" it is all looked at as the same root of "cry." Lastly, we are determining the part of speech of each token in its given sentence, and this will be used for several things (not necessarily directly by us) such as by the lemmatizer. Having the part of speech is just all around great to have when doing NLP. Now we can make use of our new function:
positive_clean = []
negative_clean = []
for token in positive_tokens:
positive_clean.append(clean(token))
for token in negative_tokens:
negative_clean.append(clean(token))
After scrubbing our tokens clean of any unwanted noise we can begin to pack up our data into a format that will be conducive to handing over to an algorithm for training.
def final_token_generator(tokens):
for tokens in tokens:
yield dict([token, True] for token in tokens)
positive_model_tokens = final_token_generator(positive_clean)
negative_model_tokens = final_token_generator(negative_clean)
Curious as to what this function is even doing? Read more here.
Our Model's Genesis
Before we can hand over our data to the algorithm we use we must give it a little more labeling:
positive_dataset = [(token, "Positive") for token in positive_model_tokens]
negative_dataset = [(token, "Negative") for token in negative_model_tokens]
This will allow the algorithm during training and testing how well it is doing (and don't we all love feedback?) And speaking of training and testing let's do just that by splitting up our data into two different datasets:
dataset = positive_dataset + negative_dataset
random.shuffle(dataset)
random.shuffle(dataset)
random.shuffle(dataset)
training = dataset[:7000]
testing = dataset[7000:]
All we did here was concatenate the two datasets, shuffled them around a few times so make sure there is at least a random distribution of both positive and negative throughout, and then finally split them evenly so each has 7000 tokens, cleaned and labeled. Now comes the machine learning:
classifier = NaiveBayesClassifier.train(training)
Yep, isn't python wild? We can see how our model looks by using the following lines:
print("Accuracy:", classify.accuracy(classifier, testing))
print(classifier.show_most_informative_features(10))
This will give you a little insight into our model. Try changing the times you shuffle or whether or not you remove stop words and see how this output gets affected. Like I mentioned, this is just a very simple example on how to make a model for sentiment analysis, and there is a myriad of ways we can change and improve this model made today. But what is the use of a model if you can do things with it?
def analyze(input):
custom_tokens = clean(word_tokenize(input))
return classifier.classify(dict([token, True] for token in custom_tokens))
This is what we will be using in our app.py
to determine the sentiment of a sentence. First we clean the input, and then we run it through the classifier which will either return a string containing "Positive" or "Negative". Going back to that app.py
the change_emote()
function looks like this:
def change_emote(self, event):
result = self.analyzer.analyze(self.text_field.get())
if result == 'Positive':
self.emote['text'] = self.positive
elif result == 'Negative':
self.emote['text'] = self.negative
else:
self.emote['text'] = self.neutral
You should now be able to run the app file and (eventually) get the GUI where it should change the emote according to the sentiment of your sentence entered.
Conclusion
Today we looked at how to use nltk
to gather data, clean it, and then apply the formatted data to a Naive Bayes Classifier, and while we did not cover what that does necessarily since it is out of the scope of this article and you can probably find far better resources on it than myself. Nonetheless we were able to make use of it, and then turn around and apply a model we made to increase the functionality of a GUI. Isn't that exciting! This is still a very simple model, but I hope you are feeling inspired to play around with it, and see how you can improve it (e.g. there is no classification for neutral sentiment). Maybe see what other classifiers are out there for this or see if you can improve load times or even different datasets to work from. There are so many possibilities, and I thank you for your time viewing this article.
The final code can be found here also feel free to read our chatbot architecture article.
Top comments (0)