In the previous post, the part one I've displayed how to automate part of the data gathering process where you know most of the constrains and you need to cut down the manual work of data entry. Namely, my challenge was that I wanted to get tweets from English (most resources for data analysis - sorry other languages 😩) speaking stand up comedians in predefined period.
I've managed to produce
.csv file with name of the stand up comedian and respective twitter handle. Everything I need to gather(scrape) their tweets.
There's few ways one can achieve this:
Use Twitter API and preferred python library (there are a couple of good ones).
PRO: you're using the API - you know what data you'll get regardless of the UI/structural changes
CON(?): limited amount of options for free usage, that needs to be approved - there's a process where you need to apply and get the API Key as a result of a request for use processed successfully.
Use GetOldTweets3 (and variations of it)
PRO: easy to use for very little amount of data
CON: google "Too Many Requests"
Use nasty library (NASTY Advanced Search Tweet Yielder).
PRO: Easy to use, flexible
CON: I haven't came across one
Since I've tried all the approaches, I'll present you with the colab in which the one I think works the best: nasty
The result is
.csv file with the format (it's this printed in rows):
After this the data needs to be cleaned, normalized and purified in order to be used for various purposes like sentiment analysis, topic modeling, labeling and so on.