Data Preprocessing in R

Installing and loading the tidyverse package.

The Tidyverse Metapackage – Our adventure begins with the mysterious meta-package called β€œtidyverse.” With a simple incantation, β€œlibrary(tidyverse)” we unlock the powerful tools and unleash the magic of data manipulation and visualization.

R




#installing packages   eg : ggplot2 , dplyr ,tidyr
  
library(tidyverse)


Listing files in the β€œ../input” directory.

As we explore further, we stumble upon a mystical directory known as β€œ../input/”. With a flick of our code wand, we unleash its secrets and reveal a list of hidden files.

R




# importing files eg: data.csv ,images.zip,config.txt
  
list.files(path = "../input")


── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
βœ” dplyr 1.1.2 βœ” readr 2.1.4
βœ” forcats 1.0.0 βœ” stringr 1.5.0
βœ” ggplot2 3.4.2 βœ” tibble 3.2.1
βœ” lubridate 1.9.2 βœ” tidyr 1.3.0
βœ” purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag() masks stats::lag()
β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The Magical Powers of β€œpsychβ€œ, we stumble upon a rare gem known as β€œpsych.” With, β€œinstall.packages(β€˜psych’)” followed by β€œlibrary(psych),” we unlock its potent abilities.

R




install.packages("psych")
library(psych)


Reading the Dataset

We stumbled upon a precious artifactβ€”a CSV file named β€œExpanded_data_with_more_features.csvβ€œ. With a wave of our wand and the invocation of β€œread.csv()β€œ, we summon the data into our realm.

R




# Reading data set
data <- read.csv("/kaggle/input/students-exam-scores/"
                 "Expanded_data_with_more_features.csv")


Data Exploration & Analysis

Our adventure begins with a simple taskβ€”discovering what lies within our dataset. We load the data and take a sneak peek. With the β€œhead()” function, we unveil the first five rows. We revel in the joy of exploring the dimensions of our dataset using β€œdim()β€œ.

R




# Exploring the dimensions of our dataset 
dim(data)


Output:

30641 * 15

You have summoned the names() function to reveal the secret names of the columns in your dataset. By capturing the output in the variable variable_names and invoking the print() spell, you have successfully unveiled the hidden names of the columns, allowing you to comprehend the structure of your dataset.

R




# displaying the column names 
variable_names <- names(data)
print(variable_names)


Output:

 [1] "X"                   "Gender"              "EthnicGroup"        
[4] "ParentEduc" "LunchType" "TestPrep"
[7] "ParentMaritalStatus" "PracticeSport" "IsFirstChild"
[10] "NrSiblings" "TransportMeans" "WklyStudyHours"
[13] "MathScore" "ReadingScore" "WritingScore"

Next, you cast the str() spell upon your dataset. This spell reveals the data types of each variable (column) granting you insight into their mystical nature. By deciphering the output of this spell, you can understand the types of variables present in your dataset, such as character (chr) numeric (num), or factors, among others.

R




# See the data types of variables(column)
str(data)
# from here we can see WklyStudyHours should
#  have numeric data type instead of chr


Output:

'data.frame':    30641 obs. of  15 variables:
$ X : int 0 1 2 3 4 5 6 7 8 9 ...
$ Gender : chr "female" "female" "female" "male" ...
$ EthnicGroup : chr "" "group C" "group B" "group A" ...
$ ParentEduc : chr "bachelor's degree" "some college" "master's degree"
"associate's degree" ...
$ LunchType : chr "standard" "standard" "standard" "free/reduced" ...
$ TestPrep : chr "none" "" "none" "none" ...
$ ParentMaritalStatus: chr "married" "married" "single" "married" ...
$ PracticeSport : chr "regularly" "sometimes" "sometimes" "never" ...
$ IsFirstChild : chr "yes" "yes" "yes" "no" ...
$ NrSiblings : int 3 0 4 1 0 1 1 1 3 NA ...
$ TransportMeans : chr "school_bus" "" "school_bus" "" ...
$ WklyStudyHours : chr "< 5" "5 - 10" "< 5" "5 - 10" ...
$ MathScore : int 71 69 87 45 76 73 85 41 65 37 ...
$ ReadingScore : int 71 90 93 56 78 84 93 43 64 59 ...
$ WritingScore : int 74 88 91 42 75 79 89 39 68 50 ...

Ah, but wait! Your keen eyes have spotted an anomaly. The variable β€œWklyStudyHours” appears to have been labeled as a character (chr) instead of its rightful numeric (num) nature. Fear not, for you possess the power to correct this. To rectify this discrepancy, you can use data$WklyStudy…….. Ooopz can’t connect with the spell keep exploring the adventure see in the next part hahaha hahaha hahaha  

Data Cleaning & Formatting

The Magical Transformation Ah, But our dataset has a few quirks, like missing values and unwanted index columns. Fear not, With a dash of code, we bid farewell to the extra index column and wave our wands to convert blank spaces into the mystical realm of NA values. Step by step, we breathe life into our dataset, ensuring each column shines brightly with the correct data type.

R




#  Removing column X as it is an extra index column 
data$X <- NULL
  
#coverting the blank spaces into NA
columns <- colnames(data)
  
for (column in columns) {
  data[[column]] <- ifelse(trimws(data[[column]]) == ""
                           NA, data[[column]])
}


Handling Missing Values

A World of Missing Values Ahoy! We’ve stumbled upon missing values! But fret not, we shall fill these gaps and restore balance to our dataset. One by one, we rescue columns from the clutches of emptiness, replacing the void with the most common categories. The missing values tremble as we conquer them, transforming our dataset into a complete and harmonious entity.

R




# Seeing missing values 
colSums(is.na(data))


Output:

Gender:0
EthnicGroup:1840
ParentEduc:1845
LunchType:0
TestPrep:1830
ParentMaritalStatus:1190
PracticeSport:631
IsFirstChild:904
NrSiblings:1572
TransportMeans:3134
WklyStudyHours:0
MathScore:0
ReadingScore:0
WritingScore:0

Handling Categorical Data

Unveiling the Secrets of Categorical Columns. Let’s start with the EthnicGroup column first as it has 1840 missing values.

R




# unique_values in EthnicGroup
unique_values <- unique(data$EthnicGroup)
unique_values


Output:

NA 'group C' 'group B' 'group A' 'group D' 'group E'

R




# Create a bar plot of EthnicGroup
ggplot(data, aes(x = EthnicGroup)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of EthnicGroup",
       x = "EthnicGroup", y = "Count")


Output:

R




# Calculate the frequency count for each category
frequency <- table(data$EthnicGroup)
# Find the category with the highest count (mode)
mode <- names(frequency)[which.max(frequency)]
# Print the mode
print(mode)


Output:

[1] "group C"

R




# filling missing values in Ethnic group
data$EthnicGroup[is.na(data$EthnicGroup)] <- mode
  
# Create a bar plot of EthnicGroup after removing missing values 
ggplot(data, aes(x = EthnicGroup)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of EthnicGroup",
       x = "EthnicGroup", y = "Count")


Output:

First, we seek the precise values within the β€œEthnicGroup” to discover the awesome ethnic groups present in our dataset. Then we create a bar plot. After this, we calculate the frequency rely for every class. This step allows us to understand the distribution of ethnic companies extra quantitatively. We then identify the class with the highest rely. In the event of lacking values within the β€œEthnicGroup” column, we address this difficulty by way of filling the ones gaps with the mode’s cost. Finally, we create another bar plot of the β€œEthnicGroup” column, this time after eliminating the missing values. The visualization serves as a contrast to the preceding plot.

By executing the below code block we will be able to impute the missing values in the ParentEduc, WklyStudyHours, and NrSiblings.

R




data <- na.omit(data, cols=c("ParentEduc",
                             "WklyStudyHours"))
  
# Calculate the median of the NrSiblings column
median_value <- median(data$NrSiblings, 
                       na.rm = TRUE)
  
# filling missing values
data$NrSiblings <- ifelse(is.na(data$NrSiblings),
                          median_value, data$NrSiblings)


Now we will use a little bit sophisticated method to fill in the missing values in the remaining columns of the data.

R




column <- c("TransportMeans", "IsFirstChild", "TestPrep",
            "ParentMaritalStatus", "PracticeSport")
  
for(column in columns){
    # Here we are going to calculate frequencies
    # of 'yes' and 'no'
    frequency <- table(data$column)
  
    # Finding the most common category
    most_common <- names(frequency)[which.max(frequency)]
  
    # Filling the missing values with the
    # most common category (or Mode)
    data$column[is.na(data$column)] <- most_common
}


Now let’s check whether there are any more missing values left in the dataset or not.

R




# Seeing missing values 
sum(colSums(is.na(data)))


Output:

0

Descriptive statistical measures of a dataset help us better visualize and explore the dataset at hand without going through each observation of the dataset. We can get all the descriptive statistical measures of the dataset using the describe() function.

R




describe(data)


Output:

Data Preprocessing in R

Welcome, adventurous data enthusiasts! Today we celebrate an exciting journey filled with lots of twists, turns, and fun, as we dive into the world of data cleaning and visualization through R Programming Language. Grab your virtual backpacks, put on your data detective hats, Ready to unravel the secrets of a dataset filled with test results and interesting features.

Similar Reads

Data Preprocessing in R

Installing and loading the tidyverse package....

Feature scaling

...

Feature Encoding

...

Contact Us