Disclaimer: this is an article of a project that uses the Google Language Sentiment Analysis API, it doesn't train any machine learning model.
As a side project, I decided to develop a project to do sentiment analysis of headlines of some of the most important Brazilian news agencies. On the one hand I would like to test Google's API and on the other hand I would like to check if I could see significant differences on sentiments of the headlines of each news agency.
The decisions on the architecture of this project were taken based on two decision criteria:
- Lowest Prices
- Less work
For a database I decided to use Google's Firestore (non relational database) - no special reason for that other than "I'm already using GCP (Google Cloud Platform) for the sentiment analysis".
The database has three collections:
The documents in the collections have the following fields:
- name: the website's name
- regex: regex used for scraping the website's headlines
- url: the websites's url
keywords (that we want to scrape):
- value: the string that we are looking for on the news agencies websites
- headline: the original headline analyzed
- headlineEnglish: headline translated to English (we'll talk about that later)
- isOnline: boolean that indicates if the headline is still being displayed on the website
- keywords: array with the keywords found in the headine
- onlineStartDate: timestamp of the first time the headline has been seen on the website
- onlineEndDate: timestamp of the last time the headline has been seen on the website
- onlineTotalTimeMS: the difference between the end and start dates (in milliseconds)
- sentimentScore: score of the sentiment analyzed (-1 to -0.25 means a negative sentiment, -0.25 to 0.25 a neutral sentiment and 0.25 to 1 a positive sentiment)
- sentimentMagnitude: the magnitude of the sentiment analyzed
- website: the website's name (from where the headline has been scraped)
The responsible for actually doing all the work is a Node.js script (https://github.com/Brudhu/politicians_analysis). The script does the following:
- Get all the info it needs (like websites info, keywords etc.) from Firestore
- Scrape the websites to get the headlines (using puppeteer and the regex stored on Firestore)
- Pick headlines that have at least one of the keywords
- Check which of the scraped headlines have not been analyzed yet
- Translate headlines to English (using an API from Azure) - there we go: the reason for that is that in a quick test of the sentiment analysis API I realized it works a lot better with English sentences than Portuguese sentiments
- Analyze the sentiment of the headline translated to English (GCP Language API)
- Insert new sentiments in the "sentiments" collection
- Update sentiments that are not online anymore
I decided to run this job periodically every 30 minutes (not faster because I don't want to spend to much on Cloud resources).
I had two options to host the job: GCP (again) and Heroku - I know there are thousands of options but these are the ones I've had more experience
with. I decided to go with Heroku and Heroku Scheduler Addon (the scheduler is the responsible for running the script periodically). It's free for now.
While the job on Heroku is free, the project on GCP is costing me 0.01 BRL per day.
To get the data from Firestore and analyze it, I wrote a Python script (will release it later).
For the first tests I set up two news agencies:
The keywords are:
- Bolsonaro (Brazilian president)
- Moro (Former Brazilian minister of justice - removed from the ministry a in April)
- Lula (Former Brazilian president)
- Dória (Governor of São Paulo state in Brazil)
In less than 14 days I got 571 headlines analyzed: 366 from UOL (the first one I started collecting data from) and 205 from G1.
The only keyword that has enough data for some analysis is "Bolsonaro", which makes sense since he is the current president.
Opinião: Com a PF, Bolsonaro cumpre a profecia de Jucá
Opinion: With PF, Bolsonaro fulfills the prophecy of Jucá
Bolsonaro amplia lista de atividades consideradas essenciais na pandemia
Bolsonaro expands list of activities considered essential in the pandemic
Bolsonaro culpa governadores: 'Essa conta não é minha'
Bolsonaro blames governors: 'This account is not mine'
In this case we can see an error on the translation. I would say the best translation would be "Bolsonaro blames governors: 'This bill is not mine'"
Procuradora diz que Bolsonaro violou a Constituição ao determinar revogação de portarias sobre armas
Prosecutor says Bolsonaro violated the Constitution by determining repeal of ordinances on weapons
- The word clouds are displaying only words with 3 or more occurrences. The only keyword analyzed so far is "Bolsonaro".
They look very similar: both are largely concentrated around the neutral area and both medians are pretty close - around 0 a little shifted to negative sentiments, but they are not exactly the same. UOL's box plot's minimum and maximum tails are longer then the ones from G1. Let's take a closer look.
- Negative: 26.8%
- Neutral: 57.4%
- Positive: 15.8%
- Negative: 25.3%
- Neutral: 58.6%
- Positive: 16.1%
- Negative: 29.9%
- Neutral: 55.2%
- Positive: 14.9%
While they are still similar, we can see that G1 has more negative sentiment headlines than UOL, while UOL has more neutral and positive sentiment headlines.
In the histogram we can confirm what we saw before: we have more negative than positive sentiments, but neutral sentiments are way more common.
It looks like while G1 has proportionally more negative sentiments than UOL (like we saw on the percentages before), UOL tends to be a little more "extremist", with more very negative and very positive sentiment headlines.
Now let's break the histograms even more: by positive and negative sentiments for each website.
UOL has more headlines with sentiments >= 0.7 (very positive sentiments).
Even though we now that G1 has more headlines with negative sentiments, these histograms shows that UOL has more headlines with sentiments <= -0.6 (very negative sentiments).
While it was a lot fun to work on this project and having learned new stuff, I have to point out some of the flaws here:
- The translation from Portuguese to English (Azure) is very good, but not perfect for some cases
- Headlines related to Brazilian politics sometimes have a specific context that would be useful for the translation and Azure doesn't get it
- Some of the headlines were written by columnists and may be too informal to make sense after being translated (e.g. "Batata assou no fogo do parquinho dos Bolsonaro" which was translated to "Potato baked in the fire of bolsonaro playground" this sentence contains a Brazilian expression and means, in a very simplistic translation, something like "The Bolsonaros are in a bad situation")
- Getting way more negative than positive sentiments may not reflect a partial position of the news agencies. Many headlines are about problems related to Covid-19 and may be inherently negative (some are not).
Both agencies have similar results - not exactly the same, but very similar.
Recently I added a new news agency (R7) and will try to update the data and analysis once I have more relevant data - maybe with new news agencies and new keywords.