Data Manipulation in R with Dplyr Package
In this article let’s discuss manipulating data in the R programming language.
In order to manipulate the data, R provides a library called dplyr which consists of many built-in methods to manipulate the data. So to use the data manipulation function, first need to import the dplyr package using library(dplyr) line of code. Below is the list of a few data manipulation functions present in dplyr package.
Function Name |
Description |
---|---|
filter() |
Produces a subset of a Data Frame. |
distinct() |
Removes duplicate rows in a Data Frame |
arrange() |
Reorder the rows of a Data Frame |
select() |
Produces data in required columns of a Data Frame |
rename() |
Renames the variable names |
mutate() |
Creates new variables without dropping old ones. |
transmute() |
Creates new variables by dropping the old. |
summarize() |
Gives summarized data like Average, Sum, etc. |
filter() method
The filter() function is used to produce the subset of the data that satisfies the condition specified in the filter() method. In the condition, we can use conditional operators, logical operators, NA values, range operators etc. to filter out data. Syntax of filter() function is given below-
filter(dataframeName, condition)
Example:
In the below code we used filter() function to fetch the data of players who scored more than 100 runs from the “stats” data frame.
R
# import dplyr package library (dplyr) # create a data frame stats <- data.frame (player= c ( 'A' , 'B' , 'C' , 'D' ), runs= c (100, 200, 408, 19), wickets= c (17, 20, NA , 5)) # fetch players who scored more # than 100 runs filter (stats, runs>100) |
Output
player runs wickets 1 B 200 20 2 C 408 NA
distinct() method
The distinct() method removes duplicate rows from data frame or based on the specified columns. The syntax of distinct() method is given below-
distinct(dataframeName, col1, col2,.., .keep_all=TRUE)
Example:
Here in this example, we used distinct() method to remove the duplicate rows from the data frame and also remove duplicates based on a specified column.
R
# import dplyr package library (dplyr) # create a data frame stats <- data.frame (player= c ( 'A' , 'B' , 'C' , 'D' , 'A' , 'A' ), runs= c (100, 200, 408, 19, 56, 100), wickets= c (17, 20, NA , 5, 2, 17)) # removes duplicate rows distinct (stats) #remove duplicates based on a column distinct (stats, player, .keep_all = TRUE ) |
Output
player runs wickets 1 A 100 17 2 B 200 20 3 C 408 NA 4 D 19 5 5 A 56 2 player runs wickets 1 A 100 17 2 B 200 20 3 C 408 NA 4 D 19 5
arrange() method
In R, the arrange() method is used to order the rows based on a specified column. The syntax of arrange() method is specified below-
arrange(dataframeName, columnName)
Example:
In the below code we ordered the data based on the runs from low to high using arrange() function.
R
# import dplyr package library (dplyr) # create a data frame stats <- data.frame (player= c ( 'A' , 'B' , 'C' , 'D' ), runs= c (100, 200, 408, 19), wickets= c (17, 20, NA , 5)) # ordered data based on runs arrange (stats, runs) |
Output
player runs wickets 1 D 19 5 2 A 100 17 3 B 200 20 4 C 408 NA
select() method
The select() method is used to extract the required columns as a table by specifying the required column names in select() method. The syntax of select() method is mentioned below-
select(dataframeName, col1,col2,…)
Example:
Here in the below code we fetched the player, wickets column data only using select() method.
R
# import dplyr package library (dplyr) # create a data frame stats <- data.frame (player= c ( 'A' , 'B' , 'C' , 'D' ), runs= c (100, 200, 408, 19), wickets= c (17, 20, NA , 5)) # fetch required column data select (stats, player,wickets) |
Output
player wickets 1 A 17 2 B 20 3 C NA 4 D 5
rename() method
The rename() function is used to change the column names. This can be done by the below syntax-
rename(dataframeName, newName=oldName)
Example:
In this example, we change the column name “runs” to “runs_scored” in stats data frame.
R
# import dplyr package library (dplyr) # create a data frame stats <- data.frame (player= c ( 'A' , 'B' , 'C' , 'D' ), runs= c (100, 200, 408, 19), wickets= c (17, 20, NA , 5)) # renaming the column rename (stats, runs_scored=runs) |
Output
player runs_scored wickets 1 A 100 17 2 B 200 20 3 C 408 NA 4 D 19 5
mutate() & transmute() methods
These methods are used to create new variables. The mutate() function creates new variables without dropping the old ones but transmute() function drops the old variables and creates new variables. The syntax of both methods is mentioned below-
mutate(dataframeName, newVariable=formula)
transmute(dataframeName, newVariable=formula)
Example:
In this example, we created a new column avg using mutate() and transmute() methods.
R
# import dplyr package library (dplyr) # create a data frame stats <- data.frame (player= c ( 'A' , 'B' , 'C' , 'D' ), runs= c (100, 200, 408, 19), wickets= c (17, 20, 7, 5)) # add new column avg mutate (stats, avg=runs/4) # drop all and create a new column transmute (stats, avg=runs/4) |
Output
player runs wickets avg 1 A 100 17 25.00 2 B 200 20 50.00 3 C 408 7 102.00 4 D 19 5 4.75 avg 1 25.00 2 50.00 3 102.00 4 4.75
Here mutate() functions adds a new column for the existing data frame without dropping the old ones where as transmute() function created a new variable but dropped all the old columns.
summarize() method
Using the summarize method we can summarize the data in the data frame by using aggregate functions like sum(), mean(), etc. The syntax of summarize() method is specified below-
summarize(dataframeName, aggregate_function(columnName))
Example:
In the below code we presented the summarized data present in the runs column using summarize() method.
R
# import dplyr package library (dplyr) # create a data frame stats <- data.frame (player= c ( 'A' , 'B' , 'C' , 'D' ), runs= c (100, 200, 408, 19), wickets= c (17, 20, 7, 5)) # summarize method summarize (stats, sum (runs), mean (runs)) |
Output
sum(runs) mean(runs) 1 727 181.75
Contact Us