What is missing data?

Missing data is the missing values in the dataset that can cause issues in various predictions. Many statistical and machine learning models cannot handle such values, so it is important to handle them. To deal with missing values we must identify them first.

Identifying Missing Data

Before dealing with missing data we must identify them in the dataset. We can use is.na() function to identify missing values.

R




# Check for missing values in a data frame
missing_values <- is.na(my_data)


Once the missing data is identified we need to handle it.

How to Deal with Missing Data?

There are several ways to deal with the missing values such as deleting the rows and columns or imputing the unavailable values from the dataset. These ways are discussed below with the help of examples.

Deleting missing rows and columns

We can delete the rows and columns that have missing values. We can do it using the na.omit() function or complete.cases() functions as well.

R




# Remove rows with missing values
clean_data <- my_data[complete.cases(my_data), ]


We can also understand this with the help of an example. In this example, we will create a fictional dataset of ID, Age, and Salary and deal with the missing values in it.

R




# Creating a sample data frame with missing values
my_data <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  Age = c(25, NA, 30, 22, 35),
  Salary = c(50000, 60000, NA, 45000, 70000)
)
 
# Display the original data frame
print("Original Data Frame:")
print(my_data)
 
# Applying na.omit() to remove rows with missing values
clean_data <- na.omit(my_data)
 
# Display the data frame after omitting missing values
print("Data Frame after using na.omit():")
print(clean_data)


Output:

[1] "Original Data Frame:"
  ID Age Salary
1  1  25  50000
2  2  NA  60000
3  3  30     NA
4  4  22  45000
5  5  35  70000

[1] "Data Frame after using na.omit():"
  ID Age Salary
1  1  25  50000
4  4  22  45000
5  5  35  70000

The rows and columns that had NA values are deleted in the output so that the missing values can not alter our predictions,

Imputing the missing values

We can also replace the missing values with the substituted values. Imputation in statistics is a technique by which we use substituted values in place of missing values to deal with them. These substituted values can be calculated with the help of the mean, median, or mode values of the dataset. The syntax to impute values is given below:

R




# Impute missing values with mean
my_data_imputed <- na.aggregate(my_data, FUN = mean)


We can understand this with the help of the same example mentioned above.

R




# Creating a sample data frame with missing values
my_data <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  Age = c(25, NA, 30, 22, 35),
  Salary = c(50000, 60000, NA, 45000, 70000)
)
 
# Display the original data frame
print("Original Data Frame:")
print(my_data)
 
# Imputing missing values with mean
imputed_data <- my_data
 
# Impute Age column with mean
imputed_data$Age <- ifelse(is.na(imputed_data$Age), mean(imputed_data$Age,
                                                        na.rm = TRUE), imputed_data$Age)
 
# Impute Salary column with mean
imputed_data$Salary <- ifelse(is.na(imputed_data$Salary),
                              mean(imputed_data$Salary, na.rm = TRUE),
                              imputed_data$Salary)
 
# Display the data frame after imputation
print("Data Frame after Imputation:")
print(imputed_data)


Output:

[1] "Original Data Frame:"
  ID Age Salary
1  1  25  50000
2  2  NA  60000
3  3  30     NA
4  4  22  45000
5  5  35  70000

[1] "Data Frame after Imputation:"
  ID Age Salary
1  1  25  50000
2  2  28  60000
3  3  30  56250
4  4  22  45000
5  5  35  70000

Here, we replace the values with the mean values of the dataset.

Interpolation of missing data

We can also replace the missing values with the help of an estimation of the other points that are available to us. This is known as interpolation. For this, we need to install the zoo package in R.

Syntax:

R




#install library
install.packages("zoo")
#load package
library(zoo)
 
# Interpolate missing values
my_data_interp <- zoo::na.approx(my_data)


Applying this to the above-mentioned example we will get:

R




# Interpolate missing values using na.approx from zoo
interpolated_data <- my_data
 
# Interpolate Age column
interpolated_data$Age <- zoo::na.approx(interpolated_data$Age)
 
# Interpolate Salary column
interpolated_data$Salary <- zoo::na.approx(interpolated_data$Salary)
 
# Display the data frame after interpolation
print("Data Frame after Interpolation:")
print(interpolated_data)


Output:

[1] "Data Frame after Interpolation:"
  ID  Age Salary
1  1 25.0  50000
2  2 27.5  60000
3  3 30.0  52500
4  4 22.0  45000
5  5 35.0  70000

The missing values in our original dataset are now replaced with the estimated points based on the data set we have. We use the na.approx function to interpolate missing values in the Age and Salary columns.

Coping with Missing, Invalid and Duplicate Data in R

Data is the base of statistical analysis and machine learning. The free data we get for processing is often raw and has many issues like invalid terms, and missing or duplicate values that can cause major changes in our model processing and estimation.

We use the past data to train our model and predict values based on this past data. These issues like invalid data or missing values can cause lower accuracy in prediction models therefore, handling these problems is an important part of data processing. In this article, we will learn how to cope with missing, invalid, and duplicate data in R Programming Language.

Similar Reads

What is missing data?

Missing data is the missing values in the dataset that can cause issues in various predictions. Many statistical and machine learning models cannot handle such values, so it is important to handle them. To deal with missing values we must identify them first....

What is Invalid Data?

...

What is Duplicate Data?

...

Contact Us