DEV Community

Erik Marsja
Erik Marsja

Posted on

Mastering Data Manipulation: Merge Datasets in R

Combining datasets is an essential task in data science that empowers analysts to open valuable understandings and drive data-driven decision-making. The process of merging datasets allows analysts to consolidate information from numerous sources into a unified and comprehensive view, providing a deeper understanding of complex relationships and patterns within the data. In this R tutorial, we will get into merging datasets in R, providing you with essential skills to harness the full potential of your data.

Requirements

Before starting, ensure you have the necessary knowledge and packages. Basic familiarity with R programming is assumed, including understanding data frames and common data manipulation functions in R.

This tutorial will use R, a powerful open-source programming language and environment for statistical computing and data analysis. Additionally, we will primarily rely on two essential R packages: dplyr and readxl. Dplyr offers an intuitive grammar for data manipulation, enabling smooth and efficient data transformation. Meanwhile, readxl focuses on reading Excel files.

Reading Excel Files

Here we read two .xlsx files in R:

# Load the required library
library(readxl)

# Read the student data from the .xlsx file
student_data <- read_excel("student_data.xlsx")

# Read the course data from the .xlsx file
course_data <- read_excel("course_data.xlsx")
Enter fullscreen mode Exit fullscreen mode

After executing the above code, the "student_data" and "course_data" data frames will contain the data from the respective .xlsx files. You can merge these datasets in R using various methods, explore relationships between students and courses, and conduct insightful data analyses.

Merge Datasets in R Example 1: Inner Join

In this example, we will perform an inner join between the "student_data" and "course_data" datasets based on the common CourseID. The inner join will only merge data in R with students taking courses and exclude those not enrolled in any course.

# Load dplyr
library(dplyr)

# Perform an inner join
inner_merged_data <- inner_join(student_data, 
                                course_data, by = "CourseID")

# Display the merged dataset
print(inner_merged_data)
Enter fullscreen mode Exit fullscreen mode

After executing the above code and merging the two datasets in R, the "student_data" and "course_data" data frames will contain the data from the respective .xlsx files. You can now explore various methods to merge these datasets, identify relationships between students and courses, and conduct insightful data analyses.

Merge Datasets in R Example 2: Left Join

In this example, we will perform a left join between the "student_data" and "course_data" datasets based on the common CourseID. The left join will merge data in R from the "student_data" DataFrame with all available courses from the "course_data" DataFrame, including students not enrolled in any course.

# Perform a left join
left_merged_data <- left_join(student_data,
                              course_data, by = "CourseID")

# Display the merged dataset
print(left_merged_data)
Enter fullscreen mode Exit fullscreen mode

After running the code, we utilize the left_join() function from the dplyr package to merge the "student_data" and "course_data" datasets based on the common "CourseID" column. The resulting "left_merged_data" data frame includes all rows from the "student_data" DataFrame, irrespective of course enrollment. The corresponding columns from the "course_data" DataFrame will have NA values for students not enrolled in any course.

Merged datasets.

There is also a function for merging datasets called right_join() in the dplyr package. The right_join() function works similarly to left_join(), but it retains all rows from the right dataset and includes matching rows from the left dataset based on the common column. This provides a perspective where the focus is on the right dataset, and it incorporates relevant information from the left dataset, even for cases where no matches exist in the left dataset.

Example 3: Filtered Merge using %in%

To merge the datasets using %in% in R, we will perform a filtered merge that includes only the students taking the course with CourseID 201 and the corresponding Instructor teaching that course.

# Filter students taking course 201
students_taking_course_201 <- student_data %>%
  filter(CourseID %in% 201)

# Filter the course with CourseID 201 and its Instructor
course_201_info <- course_data %>%
  filter(CourseID %in% 201)

# Merge the filtered datasets
merged_data <- inner_join(students_taking_course_201, 
                          course_201_info, by = "CourseID")

# Display the merged dataset
print(merged_data)
Enter fullscreen mode Exit fullscreen mode

After executing the code, we filter the students taking the course with CourseID 201 from the "student_data" DataFrame using the %in% operator. Simultaneously, we filter the course with CourseID 201 and its Instructor from the "course_data" DataFrame. Next, we perform an inner join on the filtered datasets based on the common "CourseID" column. This results in the merged dataset containing only the students taking Course 201 and the corresponding Instructor teaching that course.

Filtered and merged datasets.

In this post, we explored how to merge datasets in R using various functions from the dplyr package, such as left_join() and right_join(). We learned how to perform inner joins and utilize %in% to filter and merge specific data based on common columns.

Top comments (0)