Data Cleaning Operation
After checking the summary of the dataset and we found the number on NA in two columns(Ozone and Solar.R)
R
summary (airquality) |
Output:
Ozone Solar.R Wind Temp Month Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 NA's :37 NA's :7 Day Min. : 1.0 1st Qu.: 8.0 Median :16.0 Mean :15.8 3rd Qu.:23.0 Max. :31.0
We can get a clear visual of the irregular data using a boxplot.
R
boxplot (airquality) |
Output:
Removing irregularities data with is.na() methods.
R
New_df = airquality New_df$Ozone = ifelse ( is.na (New_df$Ozone), median (New_df$Ozone, na.rm = TRUE ), New_df$Ozone) summary (New_df) |
Output:
Ozone Solar.R Wind Temp Month Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 1st Qu.: 21.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Mean : 39.56 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993 3rd Qu.: 46.00 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 NA's :7 Day Min. : 1.0 1st Qu.: 8.0 Median :16.0 Mean :15.8 3rd Qu.:23.0 Max. :31.0
Performing the same operation in another column.
R
New_df$Solar.R = ifelse ( is.na (New_df$Solar.R), median (New_df$Solar.R, na.rm = TRUE ), New_df$Solar.R) |
Now can clearly see that we don’t have any unclean data using summary methods.
R
summary (New_df) |
Output:
Ozone Solar.R Wind Temp Month Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 1st Qu.: 21.00 1st Qu.:120.0 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Mean : 39.56 Mean :186.8 Mean : 9.958 Mean :77.88 Mean :6.993 3rd Qu.: 46.00 3rd Qu.:256.0 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 Day Min. : 1.0 1st Qu.: 8.0 Median :16.0 Mean :15.8 3rd Qu.:23.0 Max. :31.0
We can clearly see that we don’t have any missing data inside data frame.
R
head (New_df) |
Output:
Ozone Solar.R Wind Temp Month Day 1 41.0 190 7.4 67 5 1 2 36.0 118 8.0 72 5 2 3 12.0 149 12.6 74 5 3 4 18.0 313 11.5 62 5 4 5 31.5 205 14.3 56 5 5 6 28.0 205 14.9 66 5 6
Now our boxplot outliers also show no errors.
R
boxplot (New_df) |
Depending on the nature of the dataset and the cleaning requirements, many techniques and functions may be employed to clean the data. Before moving on to further in-depth research, exploratory data analysis and rigorous study of the data are essential in spotting and resolving data quality issues.
Data Cleaning in R
In this article, we will briefly be going through Data cleaning with its application and its technique for implementation in the R programming language.
Contact Us