What is Duplicate Data?
Sometimes our dataset has similar or identical rows and columns, such type of data is known as duplicate data. Due to this, we can count the same thing twice or more times based on the number of times the value has been duplicated. This alters the output and dealing with it is important. To understand this better we will create a fictional dataset as an example. This example is based on the salary, ID, age, and name of the employee. Duplicate values in such datasets can cause serious confusion and issues.
R
# Create a sample dataset with duplicate data example_data <- data.frame ( ID = c (1, 2, 3, 4, 5, 1, 6, 2), Name = c ( "Alice" , "Bob" , "Charlie" , "David" , "Eve" , "Alice" , "Frank" , "Bob" ), Age = c (25, 30, 35, 22, 28, 25, 40, 30), Salary = c (50000, 60000, 70000, 45000, 55000, 50000, 80000, 60000) ) # Display the dataset with duplicate data print ( "Dataset with Duplicate Data:" ) print (example_data) |
Output:
[1] "Dataset with Duplicate Data:"
ID Name Age Salary
1 1 Alice 25 50000
2 2 Bob 30 60000
3 3 Charlie 35 70000
4 4 David 22 45000
5 5 Eve 28 55000
6 1 Alice 25 50000
7 6 Frank 40 80000
8 2 Bob 30 60000
Here column 6 is a duplicate of column 1 as well and column 8 is a duplicate of column 2 making multiple values for similar things
Identify Duplicate Data
The dataset we took here is small for example therefore identifying duplicate values was easier by going through each value but if we have a large dataset, it is not possible to go through each column and identify duplicate values. It is also time-consuming, to make this issue easier we can follow the below-mentioned code:
R
# Identify duplicate rows based on all columns duplicates_all <- example_data[ duplicated (example_data), ] # Identify duplicate rows based on selected columns (e.g., ID and Name) duplicates_selected <- example_data[ duplicated (example_data[ c ( "ID" , "Name" )]), ] # Display duplicate rows print ( "Duplicate Rows (All Columns):" ) print (duplicates_all) print ( "Duplicate Rows (Selected Columns):" ) print (duplicates_selected) |
Output:
[1] "Duplicate Rows (All Columns):"
ID Name Age Salary
6 1 Alice 25 50000
8 2 Bob 30 60000
[1] "Duplicate Rows (Selected Columns):"
ID Name Age Salary
6 1 Alice 25 50000
8 2 Bob 30 60000
This code gave us the duplicated values in our dataset.
Dealing with Duplicate Data
There are several ways of dealing with duplicate data such as Deleting such rows or Aggregation of the duplicated rows or columns. We will understand how to do it with the help of the above example of salary, ID, age, and name of employees in a company.
Deleting duplicate values
We can delete the columns or rows that are twice or more than twice in our dataset.
R
# Remove duplicate rows and create a new dataset no_duplicates_data <- unique (example_data) # Display the dataset after removing duplicates print ( "Dataset after Removing Duplicates:" ) print (no_duplicates_data) |
Output:
[1] "Dataset after Removing Duplicates:"
ID Name Age Salary
1 1 Alice 25 50000
2 2 Bob 30 60000
3 3 Charlie 35 70000
4 4 David 22 45000
5 5 Eve 28 55000
7 6 Frank 40 80000
Aggregating Duplicate Data
We can also merge these values if these values are taken for different periods and we want to merge those two rows or columns we can follow the below code:
R
# Aggregate data by summing Salary for each unique combination of ID and Name aggregated_data <- aggregate (Salary ~ ID + Name + Age, data = example_data, sum) # Display the aggregated dataset print ( "Aggregated Dataset:" ) print (aggregated_data) |
Output:
[1] "Aggregated Dataset:"
ID Name Age Salary
1 4 David 22 45000
2 1 Alice 25 100000
3 5 Eve 28 55000
4 2 Bob 30 120000
5 3 Charlie 35 70000
6 6 Frank 40 80000
Data Matching
This is done when we want to keep the earliest column or row or just one of the duplicated values. This keeps the most relevant value out of the multiple values. The !duplicated condition is used to keep only the first occurrence of each unique combination of columns.
R
# Keep only the first occurrence of each unique combination of ID and Name matched_data <- example_data[! duplicated (example_data[ c ( "ID" , "Name" )]), ] # Display the dataset after matching duplicates print ( "Dataset after Matching Duplicates:" ) print (matched_data) |
Output:
[1] "Dataset after Matching Duplicates:"
ID Name Age Salary
1 1 Alice 25 50000
2 2 Bob 30 60000
3 3 Charlie 35 70000
4 4 David 22 45000
5 5 Eve 28 55000
7 6 Frank 40 80000
Conclusion
In this article, we understood how to deal with missing, invalid, and duplicate data in R programming language with the help of different examples. We also visualized the original and maintained dataset to understand the difference between them.
Coping with Missing, Invalid and Duplicate Data in R
Data is the base of statistical analysis and machine learning. The free data we get for processing is often raw and has many issues like invalid terms, and missing or duplicate values that can cause major changes in our model processing and estimation.
We use the past data to train our model and predict values based on this past data. These issues like invalid data or missing values can cause lower accuracy in prediction models therefore, handling these problems is an important part of data processing. In this article, we will learn how to cope with missing, invalid, and duplicate data in R Programming Language.
Contact Us