Data Manipulation in R with Dplyr Package ❤️

In this article let’s discuss manipulating data in the R programming language.

In order to manipulate the data, R provides a library called dplyr which consists of many built-in methods to manipulate the data. So to use the data manipulation function, first need to import the dplyr package using library(dplyr) line of code. Below is the list of a few data manipulation functions present in dplyr package.

Function Name	Description
filter()	Produces a subset of a Data Frame.
distinct()	Removes duplicate rows in a Data Frame
arrange()	Reorder the rows of a Data Frame
select()	Produces data in required columns of a Data Frame
rename()	Renames the variable names
mutate()	Creates new variables without dropping old ones.
transmute()	Creates new variables by dropping the old.
summarize()	Gives summarized data like Average, Sum, etc.

The filter() function is used to produce the subset of the data that satisfies the condition specified in the filter() method. In the condition, we can use conditional operators, logical operators, NA values, range operators etc. to filter out data. Syntax of filter() function is given below-

filter(dataframeName, condition)

Example:

In the below code we used filter() function to fetch the data of players who scored more than 100 runs from the “stats” data frame.

R

# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, NA, 5))
 
# fetch players who scored more 
# than 100 runs
filter(stats, runs>100)

Output

  player runs wickets
1      B  200      20
2      C  408      NA

The distinct() method removes duplicate rows from data frame or based on the specified columns. The syntax of distinct() method is given below-

distinct(dataframeName, col1, col2,.., .keep_all=TRUE)

Example:

Here in this example, we used distinct() method to remove the duplicate rows from the data frame and also remove duplicates based on a specified column.

R

# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D', 'A', 'A'),
                runs=c(100, 200, 408, 19, 56, 100),
                wickets=c(17, 20, NA, 5, 2, 17))
 
# removes duplicate rows
distinct(stats)
 
#remove duplicates based on a column
distinct(stats, player, .keep_all = TRUE)

Output

  player runs wickets
1      A  100      17
2      B  200      20
3      C  408      NA
4      D   19       5
5      A   56       2
  player runs wickets
1      A  100      17
2      B  200      20
3      C  408      NA
4      D   19       5

In R, the arrange() method is used to order the rows based on a specified column. The syntax of arrange() method is specified below-

arrange(dataframeName, columnName)

Example:

In the below code we ordered the data based on the runs from low to high using arrange() function.

R

# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, NA, 5))
 
# ordered data based on runs
arrange(stats, runs)

Output

  player runs wickets
1      D   19       5
2      A  100      17
3      B  200      20
4      C  408      NA

The select() method is used to extract the required columns as a table by specifying the required column names in select() method. The syntax of select() method is mentioned below-

select(dataframeName, col1,col2,…)

Example:

Here in the below code we fetched the player, wickets column data only using select() method.

R

# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, NA, 5))
 
# fetch required column data
select(stats, player,wickets)

Output

  player wickets
1      A      17
2      B      20
3      C      NA
4      D       5

The rename() function is used to change the column names. This can be done by the below syntax-

rename(dataframeName, newName=oldName)

Example:

In this example, we change the column name “runs” to “runs_scored” in stats data frame.

R

# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, NA, 5))
 
# renaming the column
rename(stats, runs_scored=runs)

Output

  player runs_scored wickets
1      A         100      17
2      B         200      20
3      C         408      NA
4      D          19       5

These methods are used to create new variables. The mutate() function creates new variables without dropping the old ones but transmute() function drops the old variables and creates new variables. The syntax of both methods is mentioned below-

mutate(dataframeName, newVariable=formula)

transmute(dataframeName, newVariable=formula)

Example:

In this example, we created a new column avg using mutate() and transmute() methods.

R

# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, 7, 5))
 
# add new column avg
mutate(stats, avg=runs/4)
 
# drop all and create a new column
transmute(stats, avg=runs/4)

Output

  player runs wickets    avg
1      A  100      17  25.00
2      B  200      20  50.00
3      C  408       7 102.00
4      D   19       5   4.75
     avg
1  25.00
2  50.00
3 102.00
4   4.75

Here mutate() functions adds a new column for the existing data frame without dropping the old ones where as transmute() function created a new variable but dropped all the old columns.

Using the summarize method we can summarize the data in the data frame by using aggregate functions like sum(), mean(), etc. The syntax of summarize() method is specified below-

summarize(dataframeName, aggregate_function(columnName))

Example:

In the below code we presented the summarized data present in the runs column using summarize() method.

R

# import dplyr package
library(dplyr)
 
# create a data frame 
stats <- data.frame(player=c('A', 'B', 'C', 'D'),
                runs=c(100, 200, 408, 19),
                wickets=c(17, 20, 7, 5))
 
# summarize method
summarize(stats, sum(runs), mean(runs))

Output

  sum(runs) mean(runs)
1       727     181.75

Data Manipulation in R with Dplyr Package

filter() method

R

distinct() method

R

arrange() method

R

select() method

R

rename() method

R

mutate() & transmute() methods

R

summarize() method

R

Contact Us