Data Preprocessing in R

Installing and loading the tidyverse package.

The Tidyverse Metapackage – Our adventure begins with the mysterious meta-package called “tidyverse.” With a simple incantation, “library(tidyverse)” we unlock the powerful tools and unleash the magic of data manipulation and visualization.

R

#installing packages   eg : ggplot2 , dplyr ,tidyr 
  
library(tidyverse) 

Listing files in the “../input” directory.

As we explore further, we stumble upon a mystical directory known as “../input/”. With a flick of our code wand, we unleash its secrets and reveal a list of hidden files.

R

# importing files eg: data.csv ,images.zip,config.txt 
  
list.files(path = "../input")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The Magical Powers of “psych“, we stumble upon a rare gem known as “psych.” With, “install.packages(‘psych’)” followed by “library(psych),” we unlock its potent abilities.

R

install.packages("psych") 
library(psych)

Reading the Dataset

We stumbled upon a precious artifact—a CSV file named “Expanded_data_with_more_features.csv“. With a wave of our wand and the invocation of “read.csv()“, we summon the data into our realm.

R

# Reading data set 
data <- read.csv("/kaggle/input/students-exam-scores/"
                 "Expanded_data_with_more_features.csv") 

Data Exploration & Analysis

Our adventure begins with a simple task—discovering what lies within our dataset. We load the data and take a sneak peek. With the “head()” function, we unveil the first five rows. We revel in the joy of exploring the dimensions of our dataset using “dim()“.

R

# Exploring the dimensions of our dataset  
dim(data) 

Output:

30641 * 15

You have summoned the names() function to reveal the secret names of the columns in your dataset. By capturing the output in the variable variable_names and invoking the print() spell, you have successfully unveiled the hidden names of the columns, allowing you to comprehend the structure of your dataset.

R

# displaying the column names  
variable_names <- names(data) 
print(variable_names) 

Output:

 [1] "X"                   "Gender"              "EthnicGroup"        
 [4] "ParentEduc"          "LunchType"           "TestPrep"           
 [7] "ParentMaritalStatus" "PracticeSport"       "IsFirstChild"       
[10] "NrSiblings"          "TransportMeans"      "WklyStudyHours"     
[13] "MathScore"           "ReadingScore"        "WritingScore"

Next, you cast the str() spell upon your dataset. This spell reveals the data types of each variable (column) granting you insight into their mystical nature. By deciphering the output of this spell, you can understand the types of variables present in your dataset, such as character (chr) numeric (num), or factors, among others.

R

# See the data types of variables(column) 
str(data) 
# from here we can see WklyStudyHours should 
#  have numeric data type instead of chr

Output:

'data.frame':    30641 obs. of  15 variables:
 $ X                  : int  0 1 2 3 4 5 6 7 8 9 ...
 $ Gender             : chr  "female" "female" "female" "male" ...
 $ EthnicGroup        : chr  "" "group C" "group B" "group A" ...
 $ ParentEduc         : chr  "bachelor's degree" "some college" "master's degree"
                          "associate's degree" ...
 $ LunchType          : chr  "standard" "standard" "standard" "free/reduced" ...
 $ TestPrep           : chr  "none" "" "none" "none" ...
 $ ParentMaritalStatus: chr  "married" "married" "single" "married" ...
 $ PracticeSport      : chr  "regularly" "sometimes" "sometimes" "never" ...
 $ IsFirstChild       : chr  "yes" "yes" "yes" "no" ...
 $ NrSiblings         : int  3 0 4 1 0 1 1 1 3 NA ...
 $ TransportMeans     : chr  "school_bus" "" "school_bus" "" ...
 $ WklyStudyHours     : chr  "< 5" "5 - 10" "< 5" "5 - 10" ...
 $ MathScore          : int  71 69 87 45 76 73 85 41 65 37 ...
 $ ReadingScore       : int  71 90 93 56 78 84 93 43 64 59 ...
 $ WritingScore       : int  74 88 91 42 75 79 89 39 68 50 ...

Ah, but wait! Your keen eyes have spotted an anomaly. The variable “WklyStudyHours” appears to have been labeled as a character (chr) instead of its rightful numeric (num) nature. Fear not, for you possess the power to correct this. To rectify this discrepancy, you can use data$WklyStudy…….. Ooopz can’t connect with the spell keep exploring the adventure see in the next part hahaha hahaha hahaha

Data Cleaning & Formatting

The Magical Transformation Ah, But our dataset has a few quirks, like missing values and unwanted index columns. Fear not, With a dash of code, we bid farewell to the extra index column and wave our wands to convert blank spaces into the mystical realm of NA values. Step by step, we breathe life into our dataset, ensuring each column shines brightly with the correct data type.

R

#  Removing column X as it is an extra index column  
data$X <- NULL
  
#coverting the blank spaces into NA 
columns <- colnames(data) 
  
for (column in columns) { 
  data[[column]] <- ifelse(trimws(data[[column]]) == "",  
                           NA, data[[column]]) 
} 

Handling Missing Values

A World of Missing Values Ahoy! We’ve stumbled upon missing values! But fret not, we shall fill these gaps and restore balance to our dataset. One by one, we rescue columns from the clutches of emptiness, replacing the void with the most common categories. The missing values tremble as we conquer them, transforming our dataset into a complete and harmonious entity.

R

# Seeing missing values  
colSums(is.na(data)) 

Output:

Gender:0
EthnicGroup:1840
ParentEduc:1845
LunchType:0
TestPrep:1830
ParentMaritalStatus:1190
PracticeSport:631
IsFirstChild:904
NrSiblings:1572
TransportMeans:3134
WklyStudyHours:0
MathScore:0
ReadingScore:0
WritingScore:0

Handling Categorical Data

Unveiling the Secrets of Categorical Columns. Let’s start with the EthnicGroup column first as it has 1840 missing values.

R

# unique_values in EthnicGroup 
unique_values <- unique(data$EthnicGroup) 
unique_values 

Output:

NA 'group C' 'group B' 'group A' 'group D' 'group E'

R

# Create a bar plot of EthnicGroup 
ggplot(data, aes(x = EthnicGroup)) + 
  geom_bar(fill = "steelblue") + 
  labs(title = "Distribution of EthnicGroup", 
       x = "EthnicGroup", y = "Count") 

Output:

R

# Calculate the frequency count for each category 
frequency <- table(data$EthnicGroup) 
# Find the category with the highest count (mode) 
mode <- names(frequency)[which.max(frequency)] 
# Print the mode 
print(mode) 

Output:

[1] "group C"

R

# filling missing values in Ethnic group 
data$EthnicGroup[is.na(data$EthnicGroup)] <- mode 
  
# Create a bar plot of EthnicGroup after removing missing values  
ggplot(data, aes(x = EthnicGroup)) + 
  geom_bar(fill = "steelblue") + 
  labs(title = "Distribution of EthnicGroup", 
       x = "EthnicGroup", y = "Count")

Output:

First, we seek the precise values within the “EthnicGroup” to discover the awesome ethnic groups present in our dataset. Then we create a bar plot. After this, we calculate the frequency rely for every class. This step allows us to understand the distribution of ethnic companies extra quantitatively. We then identify the class with the highest rely. In the event of lacking values within the “EthnicGroup” column, we address this difficulty by way of filling the ones gaps with the mode’s cost. Finally, we create another bar plot of the “EthnicGroup” column, this time after eliminating the missing values. The visualization serves as a contrast to the preceding plot.

By executing the below code block we will be able to impute the missing values in the ParentEduc, WklyStudyHours, and NrSiblings.

R

data <- na.omit(data, cols=c("ParentEduc", 
                             "WklyStudyHours")) 
  
# Calculate the median of the NrSiblings column 
median_value <- median(data$NrSiblings,  
                       na.rm = TRUE) 
  
# filling missing values 
data$NrSiblings <- ifelse(is.na(data$NrSiblings), 
                          median_value, data$NrSiblings)

Now we will use a little bit sophisticated method to fill in the missing values in the remaining columns of the data.

R

column <- c("TransportMeans", "IsFirstChild", "TestPrep", 
            "ParentMaritalStatus", "PracticeSport") 
  
for(column in columns){ 
    # Here we are going to calculate frequencies 
    # of 'yes' and 'no' 
    frequency <- table(data$column) 
  
    # Finding the most common category 
    most_common <- names(frequency)[which.max(frequency)] 
  
    # Filling the missing values with the 
    # most common category (or Mode) 
    data$column[is.na(data$column)] <- most_common 
}

Now let’s check whether there are any more missing values left in the dataset or not.

R

# Seeing missing values  
sum(colSums(is.na(data)))

Output:

Descriptive statistical measures of a dataset help us better visualize and explore the dataset at hand without going through each observation of the dataset. We can get all the descriptive statistical measures of the dataset using the describe() function.

R

describe(data)

Output:

R

R

R

Reading the Dataset

R

Data Exploration & Analysis

R

R

R

Data Cleaning & Formatting

R

Handling Missing Values

R

Handling Categorical Data

R

R

R

R

R

R

R

R

Data Preprocessing in R

Similar Reads

Contact Us