Identify and Remove Duplicate Data in R
A dataset can have duplicate values and to keep it redundancy-free and accurate, duplicate rows need to be identified and removed. In this article, we are going to see how to identify and remove duplicate data in R. First we will check if duplicate data is present in our data, if yes then, we will remove it.
Identifying Duplicate Data in vector
We can use duplicated() function to find out how many duplicates value are present in a vector.
R
# Create a sample vector with duplicate elements vector_data <- c (1, 2, 3, 4, 4, 5) # Identify duplicate elements duplicated (vector_data) # count of duplicated data sum ( duplicated (vector_data)) |
Output:
[1] FALSE FALSE FALSE FALSE TRUE FALSE [1] 1
Removing Duplicate Data in a vector
We can remove duplicate data from vectors by using unique() functions so it will give only unique values.
R
# Create a sample vector with duplicate elements vector_data <- c (1, 2, 3, 4, 4, 5) # Remove duplicate elements unique (vector_data) |
Output:
[1] 1 2 3 4 5
Identifying Duplicate Data in a data frame
For identification, we will use the duplicated() function which returns the count of duplicate rows.
Syntax:
duplicated(dataframe)
Approach:
- Create data frame
- Pass it to duplicated() function
- This function returns the rows which are duplicated in form of boolean values
- Apply the sum function to get the number
Data in use:
name maths science history 1 Ram 7 5 7 2 Geeta 8 7 7 3 John 8 6 7 4 Paul 9 8 7 5 Cassie 10 9 7 6 Geeta 8 7 7 7 Paul 9 8 7
Example:
R
# Creating a sample data frame of students # and their marks in respective subjects. student_result= data.frame (name= c ( "Ram" , "Geeta" , "John" , "Paul" , "Cassie" , "Geeta" , "Paul" ), maths= c (7,8,8,9,10,8,9), science= c (5,7,6,8,9,7,8), history= c (7,7,7,7,7,7,7)) # Printing data student_result duplicated (student_result) sum ( duplicated (student_result)) |
Output:
duplicated(student_result) [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE sum(duplicated(student_result)) [1] 2
Removing Duplicate Data in a data frame
Approach
- Create data frame
- Select rows which are unique
- Retrieve those rows
- Display result
Method 1: Using unique()
We use unique() to get rows having unique values in our data.
Syntax:
unique(dataframe)
Example:
R
# Creating a sample data frame of students # and their marks in respective subjects. student_result= data.frame (name= c ( "Ram" , "Geeta" , "John" , "Paul" , "Cassie" , "Geeta" , "Paul" ), maths= c (7,8,8,9,10,8,9), science= c (5,7,6,8,9,7,8), history= c (7,7,7,7,7,7,7)) # Printing data student_result unique (student_result) |
Output:
name maths science history 1 Ram 7 5 7 2 Geeta 8 7 7 3 John 8 6 7 4 Paul 9 8 7 5 Cassie 10 9 7
Method 2: Using distinct()
Package “tidyverse” should be installed and “dplyr” library should be loaded to use distinct(). We use distinct() to get rows having distinct values in our data.
Syntax:
distinct(dataframe,keepall)
Parameter:
- dataframe: data in use
- keepall: decides which variables to keep
Example:
R
# load library library (tidyverse) # Creating a sample data frame of students and # their marks in respective subjects. student_result= data.frame (name= c ( "Ram" , "Geeta" , "John" , "Paul" , "Cassie" , "Geeta" , "Paul" ), maths= c (7,8,8,9,10,8,9), science= c (5,7,6,8,9,7,8), history= c (7,7,7,7,7,7,7)) # Printing data student_result distinct (student_result) |
Output:
name maths science history 1 Ram 7 5 7 2 Geeta 8 7 7 3 John 8 6 7 4 Paul 9 8 7 5 Cassie 10 9 7
Example 2: Printing unique rows in terms of maths column
R
# Creating a sample data frame of students and # their marks in respective subjects. student_result= data.frame (name= c ( "Ram" , "Geeta" , "John" , "Paul" , "Cassie" , "Geeta" , "Paul" ), maths= c (7,8,8,9,10,8,9), science= c (5,7,6,8,9,7,8), history= c (7,7,7,7,7,7,7)) # Printing data student_result distinct (student_result,maths,.keep_all = TRUE ) |
Output:
name maths science history 1 Ram 7 5 7 2 Geeta 8 7 7 3 Paul 9 8 7 4 Cassie 10 9 7
Contact Us