Data Cleaning Operation

After checking the summary of the dataset and we found the number on NA in two columns(Ozone and Solar.R)

R

summary(airquality)

Output:

     Ozone           Solar.R           Wind             Temp           Month      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000  
 NA's   :37       NA's   :7                                                       
      Day      
 Min.   : 1.0  
 1st Qu.: 8.0  
 Median :16.0  
 Mean   :15.8  
 3rd Qu.:23.0  
 Max.   :31.0

We can get a clear visual of the irregular data using a boxplot.

R

boxplot(airquality)

Output:

Boxplot of Airquality Dataset

Removing irregularities data with is.na() methods.

R

New_df = airquality
 
New_df$Ozone = ifelse(is.na(New_df$Ozone), 
                      median(New_df$Ozone,
                             na.rm = TRUE),
                      New_df$Ozone)
 
summary(New_df)

Output:

     Ozone           Solar.R           Wind             Temp           Month      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000  
 1st Qu.: 21.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000  
 Mean   : 39.56   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993  
 3rd Qu.: 46.00   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000  
                  NA's   :7                                                       
      Day      
 Min.   : 1.0  
 1st Qu.: 8.0  
 Median :16.0  
 Mean   :15.8  
 3rd Qu.:23.0  
 Max.   :31.0

Performing the same operation in another column.

R

New_df$Solar.R = ifelse(is.na(New_df$Solar.R),
                        median(New_df$Solar.R, 
                               na.rm = TRUE),
                        New_df$Solar.R)

Now can clearly see that we don’t have any unclean data using summary methods.

R

summary(New_df)

Output:

     Ozone           Solar.R           Wind             Temp           Month      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000  
 1st Qu.: 21.00   1st Qu.:120.0   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000  
 Mean   : 39.56   Mean   :186.8   Mean   : 9.958   Mean   :77.88   Mean   :6.993  
 3rd Qu.: 46.00   3rd Qu.:256.0   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000  
      Day      
 Min.   : 1.0  
 1st Qu.: 8.0  
 Median :16.0  
 Mean   :15.8  
 3rd Qu.:23.0  
 Max.   :31.0

We can clearly see that we don’t have any missing data inside data frame.

R

head(New_df)

Output:

  Ozone Solar.R Wind Temp Month Day
1  41.0     190  7.4   67     5   1
2  36.0     118  8.0   72     5   2
3  12.0     149 12.6   74     5   3
4  18.0     313 11.5   62     5   4
5  31.5     205 14.3   56     5   5
6  28.0     205 14.9   66     5   6

Now our boxplot outliers also show no errors.

R

boxplot(New_df)

Boxplot of Airquality Data set

Depending on the nature of the dataset and the cleaning requirements, many techniques and functions may be employed to clean the data. Before moving on to further in-depth research, exploratory data analysis and rigorous study of the data are essential in spotting and resolving data quality issues.

Data Cleaning in R

In this article, we will briefly be going through Data cleaning with its application and its technique for implementation in the R programming language.

Data Cleaning Operation

R

R

R

R

R

R

R

Data Cleaning in R

Similar Reads

Contact Us