Merge Function In R

In this article, we will discuss the Merge Function and how it works in the R Programming Language.

Merge Function In R

The merge() function in R is a powerful tool for combining data frames based on common columns or keys. It allows you to perform database-style merges, similar to SQL joins, to merge data from multiple sources into a single data frame. In this article, we’ll explore the merge() function in detail, discussing its syntax, parameters, and examples to demonstrate its usage.

The basic syntax of the merge() function is as follows:

merge(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE, ...)
  • x, y: The data frames to be merged.
  • by: A character vector specifying the variables to merge by. If NULL, the function will merge by common column names.
  • by.x, by.y: Additional specifications for column names to merge by in x and y, respectively. These parameters can be used when the column names in x and y differ.
  • all: Logical; if TRUE, it performs a full outer join, retaining all rows from both x and y. If FALSE, it performs an inner join, retaining only the rows with matching keys.

Let’s illustrate the usage of the merge() function with an example

R
# Create two sample data frames
df1 <- data.frame(ID = c(1, 2, 3), Name = c("Johny", "Ali", "Boby"), 
                  Score = c(80, 85, 90))
df2 <- data.frame(ID = c(2, 3, 4), Department = c("IT", "Finance", "HR"))
# Print the original data frames
print("Original Data Frame 1:")
print(df1)
print("Original Data Frame 2:")
print(df2)

Output:

[1] "Original Data Frame 1:"

  ID  Name Score
1  1 Johny    80
2  2   Ali    85
3  3  Boby    90

[1] "Original Data Frame 2:"

  ID Department
1  2         IT
2  3    Finance
3  4         HR

Now, let’s merge these data frames based on the common column “ID”:

R
# Merge data frames based on the common column "ID"
merged_df <- merge(df1, df2, by = "ID", all = TRUE)
# Print the merged data frame
print("Merged Data Frame:")
print(merged_df)

Output:

[1] "Merged Data Frame:"
  ID  Name Score Department
1  1 Johny    80       <NA>
2  2   Ali    85         IT
3  3  Boby    90    Finance
4  4  <NA>    NA         HR

In the merged data frame, each row represents a unique combination of data from both input data frames. The all = TRUE parameter ensures that all rows from both data frames are retained, with missing values (NA) filled in for non-matching rows.

Now we demonstrating the merge() function in R using multiple data frames with different merge specifications.

R
# Sample data frames
df1 <- data.frame(ID = 1:5, Name = c("Anurag", "Shivang", "Vipul", "Jayesh", "Pratham"))
df2 <- data.frame(ID = c(2, 4, 6), Score = c(85, 92, 78))
df3 <- data.frame(ID = c(1, 2, 3), Age = c(25, 30, 35))

# Print original data frames
print("Original data frames:")
print(df1)
print(df2)
print(df3)

# Perform inner join on 'ID' column
merged_inner <- merge(x = df1, y = df2, by = "ID", all = FALSE)
print("Inner join:")
print(merged_inner)

# Perform left join on 'ID' column
merged_left <- merge(x = df1, y = df2, by = "ID", all.x = TRUE)
print("Left join:")
print(merged_left)

# Perform outer join on 'ID' column
merged_outer <- merge(x = df1, y = df2, by = "ID", all = TRUE)
print("Outer join:")
print(merged_outer)

# Merge using multiple columns
merged_multi <- merge(x = df1, y = df3, by = c("ID"), all = TRUE)
print("Merge with multiple columns:")
print(merged_multi)

Output:

[1] "Original data frames:"
  ID    Name
1  1  Anurag
2  2 Shivang
3  3   Vipul
4  4  Jayesh
5  5 Pratham

  ID Score
1  2    85
2  4    92
3  6    78

  ID Age
1  1  25
2  2  30
3  3  35

[1] "Inner join:"
  ID    Name Score
1  2 Shivang    85
2  4  Jayesh    92

[1] "Left join:"
  ID    Name Score
1  1  Anurag    NA
2  2 Shivang    85
3  3   Vipul    NA
4  4  Jayesh    92
5  5 Pratham    NA

[1] "Outer join:"
  ID    Name Score
1  1  Anurag    NA
2  2 Shivang    85
3  3   Vipul    NA
4  4  Jayesh    92
5  5 Pratham    NA
6  6    <NA>    78

[1] "Merge with multiple columns:"
  ID    Name Age
1  1  Anurag  25
2  2 Shivang  30
3  3   Vipul  35
4  4  Jayesh  NA
5  5 Pratham  NA

df1, df2, and df3 are three sample data frames. We demonstrate different types of merges: inner, left, and outer joins using the merge() function.

We also show how to merge on multiple columns by specifying a vector of column names to the by parameter.

Conclusion

The merge() function in R provides a flexible and efficient way to combine data frames based on common columns or keys. Whether you need to perform inner joins, outer joins, or other types of merges, the merge() function offers a range of options to customize the merging process according to your requirements.



Contact Us