Jessica Garson for XDevelopers

Posted on Mar 13, 2020 • Edited on Mar 14, 2020

Running the Python Package for Search Tweets in R

#twitter #r #python #tutorial

This tutorial was originally posted to the Twitter developer blog.

Reticulate is a package for R that allows you to run Python code inside of R. Since both Python and R are very popular for common data science tasks, it makes sense that you would want to use them together. A few times, I’ve been asked if we had an R version of the Python package, search-tweets-python. This package allows you to connect to the premium and enterprise search endpoints easily. With reticulate, you can call this Python package inside of R. This tutorial will walk you through how to use the search-tweets-python package inside of R to get a data frame that contains the date, ID and text of Tweets from a Twitter handle of your choosing from the past 30 days.

What you need to get started with the Twitter API

First, you will need an approved developer account. You can apply for a developer account here. Once your account is approved, you will need to create a Twitter app. Then, set up that app for the Search Tweets: 30-Day environment. You will also need to set a dev environment name – for the purposes of this tutorial mine is named Testaroo. After you have completed the initial set up, you’ll need to locate your keys and tokens.

Installing the packages you need

To follow this tutorial, you will need to install the packages reticulate and dplyr. Reticulate is the package that will allow you to run Python code inside of R and dplyr allows you to shape your data to get exactly the results you are looking for.

install.packages('reticulate')
install.packages('dplyr')

You will also need to call the packages to start working with them:

library('reticulate')
library('dplyr')

Getting your Python environment to set up inside of R

To work with reticulate you will need to make sure that your python environment is set up in R. You will need to create a variable for your Python path and use the use_python module to set this path.

path_to_python <- "/usr/bin/python3.7"
use_python(path_to_python)

You will need to install the Python packages searchtweets to help you connect to the Twitter API, and pandas to create a data frame. You’ll create a virtual environment that provides an isolated Python environment to install packages into. You also have the option of using conda if you prefer.

The following code creates a virtual environment called tweetenv. You’ll use two lines of code to install into this environment. After installing it, you can tell R to use the virtual environment you just created called tweetenv by using the function use_virtualenv.

virtualenv_install("tweetenv", "searchtweets", ignore_installed = TRUE)
virtualenv_install("tweetenv", "pandas", ignore_installed = TRUE)
use_virtualenv("tweetenv", required = TRUE)

You’ll need to import two packages you just installed into your code. You’ll create a variable called st that imports the package searchtweets and one called pd for pandas.

st <- import("searchtweets")
pd <- import("pandas")

Storing and accessing your credentials

To store your credentials and endpoint information you can create a file in the same directory you are using called secret.yaml that contains the following:

search_tweets_api:
  account_type: premium
  endpoint: https://api.twitter.com/1.1/tweets/search/30day/Testaroo.json
  consumer_key: xxxxxxxxxxxxxxxxxxx
  consumer_secret: xxxxxxxxxxxxxxxxxxx

The env_name in the endpoint is the name of the dev environment you created on developer.twitter.com. You will need to change this to the name of your own dev environment.

You want to make sure you add this file to your .gitignore before you push it to GitHub. For more information about working with a .gitignore file check out this page. You also might want to review our guide on keeping tokens secure.

In your R file, you will want to create a variable that uses the load_credentials function of the search-tweets-python library to load in your secret.yaml file. You may need to type the full path for the file if you run into any issues.

cred <-
  st$load_credentials(filename = "secret.yaml",
                      yaml_key = "search_tweets_api",
                      env_overwrite = FALSE)

Getting key information

You’ll first need to set up a variable to allow us to type in the Twitter handle you’re looking for. For the purposes of this tutorial, the example being used is @TwitterDev. But you can type in any Twitter handle you wish to after being prompted to do so by the code.

input <- readline('What handle do you want to get Tweets from? ')

You’ll need to set up two variables one for today that pulls in the current date and one that gets us the data from 30 days back.

today <- toString(Sys.Date())
thirty_days <- toString(as.Date(today) - 30)

In order for your text to be formatted correctly so that you can construct a valid search query, you’ll need to do some string formatting to create a variable called pt_format.

pt_format <- sprintf("from:%s", input)

Run a print statement to make sure everything looks good.

print(pt_format)

What you get back should look like this:

[1] "from:TwitterDev"

Now you can pass this information into a rule by passing in the pt_format you just created, thirty_days, today and set the number of results per a call.

rule <-
  st$gen_rule_payload(
    pt_rule = pt_format,
    from_date = thirty_days,
    to_date = today,
    results_per_call = 500
  )

Simply print out this rule to see what you have:

print(rule)

You should get back something that looks like this with the handle you typed in:

[1] "{\"query\": \"from:TwitterDev\", \"toDate\": \"202002070000\", \"fromDate\": \"202001080000\"}"

Getting the Tweets

Next, you’ll need to create a variable called rs which lets us connect to the endpoint, pass in the max results, and pass in your username, endpoint, and bearer token.

rs <-
  st$ResultStream(
    rule_payload = rule,
    max_results = 500,
    username = cred$extra_headers_dict,
    endpoint = cred$endpoint,
    bearer_token = cred$bearer_token
  )
print(rs)

If you print out this variable you’ll get a body of JSON that looks like this:

ResultStream: 
    {
    "username": null,
    "endpoint": "https://api.twitter.com/1.1/tweets/search/30day/Testaroo.json",
    "rule_payload": {
        "query": "from:TwitterDev",
        "toDate": "202002070000",
        "fromDate": "202001080000"
    },
    "tweetify": true,
    "max_results": 1000000000000000
}

You can pass the stream of the results using the attribute stream() into a variable called tweets.

tweets <- rs$stream()
print(tweets)

When you print out the variable of tweets you will get a generator object that looks like this.

<generator object ResultStream.stream at 0x11f4a1af0>

To be able to access the information in a Python generator object you can use the iterate attribute.

it_tweets <- iterate(tweets)

Creating a data frame

A data frame is a table-like data structure which can be particularly useful for working with datasets. To get a data frame of Tweets you can use the DataFrame attribute of pandas. A data frame object from pandas is compatible with the data frames in R, so you can use your favorite R packages from this point you can use without doing extra conversions.

tweets_df <- pd$DataFrame(data=it_tweets)

If you want to look through the data frame, It’s a bit easier to use View than print, which will show you a large data frame with many columns.

View(tweets_df)

If you want to work with a smaller data frame that contains only a few columns you can select the specific columns using dyplr, a package for R that is helpful for reshaping data. The following code will return a data frame that has the date, Tweet ID and the text for all Tweets that fit the query you ran earlier.

smaller_df <- select(tweets_df, created_at, id, text)
View(smaller_df)

Next steps

It was exciting to explore the interoperability of working with Twitter data with elements of two different languages. Other things you can do to explore this further include running R inside of a Jupyter Notebook, and use a package called rpy2 to run R code inside of Python. The full version of the code can be found here. Let us know on the forums if you run into any troubles along the way or Tweet us at @TwitterDev if this inspires you to create anything.

I used several libraries and tools beyond the Twitter API to make this tutorial, but you may have different needs and requirements and should evaluate whether those tools are right for you.

Top comments (10)

Jessica Garson • Mar 13 '20 • Edited

Yeah it's great! I wrote about rtweet in the past but I found the search-tweets-python package handles the pagination of working with the premium endpoints in a more robust way.

Doc Scott • Feb 28 '21

I am using the academic researcher api and so I have to use searchtweets-v2 which has some of the same commands but something goes different at the iterate(tweets) step for me. I like the r dataframe but cannot seem to get to that.

Hamza • Mar 13 '20

I've always wanted to play around more with R. 🤩

Jessica Garson • Mar 15 '20

Let me know if you ever want to pair up together on an R related project.

Alexandro Disla • Mar 15 '20

Are you planning to a series?

Jessica Garson • Mar 15 '20

I've written about similar topics in the past but I don't know if it's an official series. Is there anything you'd like to see written about?

Alexandro Disla • Mar 15 '20

Well mostly with the same spirit of this article, type of analytics on social with R and/or python with tweeter api.

Jessica Garson • Mar 15 '20

Cool, glad you enjoyed this! Thanks for the feedback!

Sp!ral • Mar 27 '21

I have a question about using Python packages in R. I want to use academic API in R but only see sample codes in Github for Python, not R. Does this mean I can copy those sample Python codes after setting up the R environment following this article?