Data Cleaning & Transformation with Dplyr in R
In R, data formatting typically involves preparing and structuring your data in a way that is suitable for analysis or visualization. The exact steps for data formatting may vary depending on your specific dataset and the analysis you want to perform. Here are some common data formatting tasks in the R Programming Language.
Formatting data in R is an essential part of data preprocessing and analysis. Depending on our specific needs, we want to manipulate data types, change the structure of our data frame, or format variables. Here are some common tasks related to data formatting in R:
Import Data
Use functions like read.csv(), read.table(), read.xlsx() from packages like readr, readxl, or others to import our data into R. Ensure that our data is in a supported file format such as CSV, Excel, or a database.
Checking Data Structure
Use str() to check the structure of your data frame. This function displays the data type and structure of each column in your data frame.
str(your_data_frame)
Changing Data Types
Convert columns to appropriate data types using functions like as.numeric(), as.character(), as.Date(), etc. For example, to convert a column to a numeric type.
# Convert a numeric variable to character
df$numeric_var <- as.character(df$numeric_var)
# Convert a character variable to numeric
df$character_var <- as.numeric(df$character_var)
# Convert a variable to a factor (categorical)
df$factor_var <- as.factor(df$factor_var)
Handling Missing Values
Use functions like is.na(), complete.cases(), or na.omit() to identify and handle missing values (NA) appropriately. You can choose to remove rows with missing values or impute missing values with mean, median, or other strategies.
# Remove rows with missing values
your_data_frame <- your_data_frame[complete.cases(your_data_frame), ]
# Impute missing values with mean
your_data_frame$numeric_column[is.na(your_data_frame$numeric_column)] <- mean
(your_data_frame$numeric_column, na.rm = TRUE)
Renaming Columns
Rename columns using the colnames() function or by directly assigning new column names to the names() attribute of your data frame.
# Rename a column
colnames(your_data_frame)[colnames(your_data_frame) == "old_column_name"] <- "new_column_name"
Merging and Joining Data
Use functions like merge(), join(), or dplyr functions like left_join(), inner_join(), etc., to combine data from multiple sources based on common keys.
# Merge two data frames by a common key
merged_data <- merge(data_frame1, data_frame2, by = "common_key")
Formatting Dates and Times
Use functions like as.Date(), as.POSIXct(), or format() to format date and time columns as needed.
your_data_frame$date_column <- as.Date(your_data_frame$date_column, format = "%Y-%m-%d")
Aggregating Data
Use functions like aggregate() or dplyr functions like group_by() and summarize() to aggregate data based on specific variables.
library(dplyr)
aggregated_data <- your_data_frame %>%
group_by(category_column) %>%
summarize(mean_value = mean(numeric_column))
Remember that data formatting is often a crucial step in data analysis and can significantly impact the accuracy and success of your analyses and visualizations. The specific formatting steps you need to take depend on your dataset and analysis objectives.
Let’s go through some common data formatting tasks
Selecting Columns
We can select specific columns of the dataset using the $ operator or the subset() function.
R
# load the required package library (dplyr) # Select specific columns using the $ operator selected_data <- mtcars$mpg # Select specific columns using subset() selected_data <- subset (mtcars, select = c (mpg, hp)) head (selected_data) |
Output:
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
Hornet 4 Drive 21.4 110
Hornet Sportabout 18.7 175
Valiant 18.1 105
Filtering Rows
We can filter rows based on certain conditions using the subset() or [ ] indexing.
R
# Filter rows where mpg is greater than 20 filtered_data <- subset (mtcars, mpg > 20) # Alternatively, you can use [ ] indexing filtered_data <- mtcars[mtcars$mpg > 20, ] filtered_data |
Output:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Renaming Columns
We can rename columns using the colnames() function.
R
# Rename the "mpg" column to "MilesPerGallon" colnames (mtcars)[ colnames (mtcars) == "mpg" ] <- "MilesPerGallon" names (mtcars) |
Output:
[1] "MilesPerGallon" "cyl" "disp" "hp" "drat"
[6] "wt" "qsec" "vs" "am" "gear"
[11] "carb"
Sorting Data
You can sort the data by one or more columns using the order() function.
R
# Sort the data by mpg in descending order sorted_data <- mtcars[ order (-mtcars$MilesPerGallon), ] head (sorted_data) |
Output:
MilesPerGallon cyl disp hp drat wt qsec vs am gear carb
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Creating New Variables
We can create new variables based on existing ones.
R
# Create a new variable "MilesPerGallonCategory" based on "mpg" mtcars$MilesPerGallonCategory <- ifelse (mtcars$mpg > 20, "High" , "Low" ) mtcars$MilesPerGallonCategory |
Output:
[1] "High" "High" "High" "High" "Low" "Low" "Low" "High" "High" "Low" "Low" "Low" "Low"
[14] "Low" "Low" "Low" "Low" "High" "High" "High" "High" "Low" "Low" "Low" "Low" "High"
[27] "High" "High" "Low" "Low" "Low" "High"
Aggregating Data
We can summarize data using aggregation functions like mean(), sum(), etc.
R
# Calculate the mean mpg for each number of cylinders mean_mpg_by_cyl <- aggregate (mtcars$MilesPerGallon, by = list (mtcars$cyl), FUN = mean) colnames (mean_mpg_by_cyl) <- c ( "Cylinders" , "MeanMPG" ) mean_mpg_by_cyl |
Output:
Cylinders MeanMPG
1 4 26.66364
2 6 19.74286
3 8 15.10000
Merging Data
We can merge datasets using functions like merge().
R
# Create a second dataset for merging df2 <- data.frame (Car = rownames (mtcars), Price = runif ( nrow (mtcars), min = 10000, max = 50000)) # Merge the two datasets by the "Car" column merged_data <- merge (mtcars, df2, by.x = "row.names" , by.y = "Car" ) head (merged_data) |
Output:
Row.names mpg cyl disp hp drat wt qsec vs am gear carb
1 AMC Javelin 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
2 Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
3 Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
4 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
5 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
6 Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
MilesPerGallonCategory Price
1 Low 49797.46
2 Low 35230.94
3 Low 14315.48
4 Low 36384.85
5 High 27144.52
6 Low 10147.37
These are some basic data formatting and manipulation tasks we can perform in R using built-in functions and datasets like mtcars. Depending on your specific data analysis needs, you may also want to explore packages like dplyr and tidyr for more advanced data manipulation tasks.
Contact Us