DEV Community

Cover image for Joining multiple datasets on the same column in R using dplyr and purrr
Chris Greening
Chris Greening

Posted on

Joining multiple datasets on the same column in R using dplyr and purrr

Joining multiple datasets on the same column is a common pattern in data preparation

So let's jump in and explore how we can leverage R and the tidyverse to join an arbitrary number of datasets on a shared column with elegant, readable code!

Table of Contents

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

favicon christophergreening.com

Installing prerequisite packages

In this tutorial we'll be using dplyr and purrr from the popular tidyverse collection of packages

The following line of code will install them on your machine if they aren't already:



install.packages(c("dplyr", "purrr"))


Enter fullscreen mode Exit fullscreen mode

Examining our sample datasets

For the following examples, we'll be using real-world agricultural data sourced via Eurostat containing the number of specific livestock animals (swine, bovine, sheep, and goats) in a country during a given year

For example, here is the goats dataset



> goats
# A tibble: 1,322 Γ— 3
   country  year goats
   <chr>   <dbl> <dbl>
 1 Albania  2014 904
 2 Albania  2015 932
 3 Albania  2016 941.
 4 Albania  2017 933.
 5 Albania  2018 917.
 6 Albania  2019 863.
 7 Albania  2020 774.
 8 Austria  1993  47
 9 Austria  1994  49.8
10 Austria  1995  54.2
# … with 1,312 more rows


Enter fullscreen mode Exit fullscreen mode

Our goal is to join these datasets by country and year into a single livestock.data variable containing all the animals like so:



> livestock.data
# A tibble: 1,322 Γ— 6
   country  year bovine goats swine sheep
   <chr>   <dbl>  <dbl> <dbl> <dbl> <dbl>
 1 Albania  2014   500. 904    172. 1896
 2 Albania  2015   506. 932    177. 1918
 3 Albania  2016   497. 941.   184. 1972.
 4 Albania  2017   475. 933.   180. 1926.
 5 Albania  2018   467. 917.   184. 1864.
 6 Albania  2019   416. 863.   184. 1758.
 7 Albania  2020   363. 774.   158. 1558.
 8 Austria  1993  2334.  47     NA   334
 9 Austria  1994  2329.  49.8 3729.  342.
10 Austria  1995  2326.  54.2 3706.  365.
# … with 1,312 more rows


Enter fullscreen mode Exit fullscreen mode

Using dplyr::full_join to manually join two datasets at a time

Let's start with a naive approach and manually join our datasets one-by-one on the country and year columns



by = c("country", "year")
livestock.data <- dplyr::full_join(bovine, goats, by=by)
livestock.data <- dplyr::full_join(livestock.data, swine, by=by)
livestock.data <- dplyr::full_join(livestock.data, sheep, by=by)


Enter fullscreen mode Exit fullscreen mode

The above code accomplishes the exercise by:

  1. Manually stepping through each animal variable
  2. applying a function that takes two arguments (in this case dplyr::full_join)
  3. and chaining the output of one step (livestock.data) as the input for the next step

While this might work for four datasets, what if we had 100 datasets? 1000 datasets? n datasets?! Suddenly not a great solution!

Let's investigate how we can improve, automate, and scale this


Understanding the reduce operation

The reduce operation is a technique that combines all the elements of an array (i.e. an array containing our individual livestock datasets) into a single value (i.e. the final joined table).

The reduce operation accomplishes this by:

  1. Looping over an array
  2. applying a function that takes two arguments (such as dplyr::full_join)
  3. and chaining the output of one step as the input for the next step

Sound familiar? This is exactly what we just performed manually in the previous section except this time we'll be leveraging R to do it for us!

So let's see in practice how we can apply the reduce operation to elegantly join our livestock.data


Leveraging purrr::reduce to join multiple datasets

purrr is a package that enhances R's functional programming toolkit for working with functions and vectors (i.e. reducing, mapping, filtering, etc.)

In this case, we're going to use purrr::reduce in conjunction with dplyr::full_join to join all of our datasets in one line of concise, readable code



livestock.data <- purrr::reduce(
    list(bovine, goats, swine, sheep),
    function(left, right) {
      dplyr::full_join(left, right, by=c("country", "year"))
    }
)


Enter fullscreen mode Exit fullscreen mode

And that's it! We accomplished this by:

  • 1. Looping over a list of our livestock ```R

list(bovine, goats, swine, sheep)

- 2. **applying** `dplyr::full_join` which takes two arguments 

```R


function(left, right) {
  dplyr::full_join(left, right, by=c("country", "year"))
}


Enter fullscreen mode Exit fullscreen mode
  • 3. and chaining the output of one step as the input for the next step

Image showing the different datasets joining together in a hierarchical chain that starts with bovine and goats joining into livestock.data, livestock.data joining with swine, and livestock.data finally joining with sheep


Conclusion

Thus the elegance of the reduce operation comes from its flexibility to handle n datasets (assuming your machine has enough memory 😜)

If you want to take this a step further and practice with sample code and data, I've pulled together a full working example for you to explore on GitHub!

Thanks so much for reading and if you liked my content, be sure to check out some of my other work or connect with me on social media or my personal website πŸ˜„

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

favicon christophergreening.com

Cheers!



Additional resources

Top comments (0)