Loading and Cleaning Data with R and the tidyverse
The tidyverse is a collection of packages that work well together due to shared data representations and API design. The tidyverse package is intended to make it simple to install and load core tidyverse packages with a single command.
To install tidyverse, put the following code in RStudio:
R
# Install from CRAN install.packages ( "tidyverse" ) # to check your installation library (tidyverse) |
Output:
ββ Attaching packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse 1.3.2 ββ β ggplot2 3.3.6 β purrr 0.3.5 β tibble 3.1.8 β dplyr 1.0.10 β tidyr 1.2.1 β stringr 1.4.1 β readr 2.1.3 β forcats 0.5.2 ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ β dplyr::filter() masks stats::filter() β dplyr::lag() masks stats::lag()
The tidyr package will be used for data cleaning, and the readr package will be used for data loading.
Data loading using readr
Dear Friends, In this tutorial, we will read and parse a CSV file using the readr packageβs read CSV function. CSV (Comma-Separated Values) files contain data separated by commas. The following CSV file will be used in the following example. To begin, pass the path to the file to be read to the read_csv function. The read CSV function generates tibbles that can be attached to variables.
R
# load the tidyverse by running this code: library (tidyverse) # create a tibble named rand rand <- read_csv ( "Example.csv" ) |
Output:
chr (2): ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Delimiter: "," chr (2): Code, Age_single_years dbl (2): Census_night_population_count, Census_usually_resident_population_count βΉ Use `spec()` to retrieve the full column specification for this data. βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message
Inline CSV input is very useful, and these options can also help you in normal file parsing too.
R
# give inline csv input read_csv ("a,b,c 1,2,3 4,5,6") |
Output:
a b c <dbl> <dbl> <dbl> 1 1 2 3 2 4 5 6
The first line of a CSV file is the name of the columns. However, there are other options for dealing with exceptions.
R
read_csv ("first line of metadata second line of metadata a,b,c 1,2,3", skip = 2) |
Output:
x y z <dbl> <dbl> <dbl> 1 1 2 3
R
# when we need to ignore comments in csv file read_csv (" #ignore it is a comment #ignore this is another comment x,y,z 1,2,3 4,5,6 ", comment = " #") |
Output:
x y z <dbl> <dbl> <dbl> 1 1 2 3 2 4 5 6
If the first line is not the name of the columns, then we can do this
R
# If you do not set column names then R does it for you. # The false flag tells the computer that the # first line is not column names. read_csv ( "1,2,3\n4,5,6" , col_names= FALSE ) |
Output:
X1 X2 X3 <dbl> <dbl> <dbl> 1 1 2 3 2 4 5 6
R
# You can set custom column names read_csv ( "1,2,3\n4,5,6" , col_names = c ( "COLUMN1" , "COLUMN2" , "COLUMN3" )) |
Output:
COLUMN1 COLUMN2 COLUMN3 <dbl> <dbl> <dbl> 1 1 2 3 2 4 5 6
R
# you can use na to represent missing data read_csv ( "a,b,c\n1,2,." , na = "." ) |
Output:
a b c <dbl> <dbl> <lgl> 1 1 2 NA
Cleaning Data with tidyverse (What even is tidy data?)
There are three rules of tidy data.
- Every variable is a column.
- Every observation is a row.
- And every value is a cell.
First, see examples of tidy and untidy data.
R
# tidy data table1 |
Output:
country year cases population <chr> <int> <int> <int> 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583
Examples of untidy data and how to deal with it.
pivot_wider()
In table2, a single observation is scattered across several rows, this can be fixed by using the pivot_wider() option
- The column to take variable names from. Here, itβs typed.
- The column to take values from. Here it counts.
R
table2 |
Output:
country year type count <chr> <int> <chr> <int> 1 Afghanistan 1999 cases 745 2 Afghanistan 1999 population 19987071 3 Afghanistan 2000 cases 2666 4 Afghanistan 2000 population 20595360 5 Brazil 1999 cases 37737 6 Brazil 1999 population 172006362 7 Brazil 2000 cases 80488 8 Brazil 2000 population 174504898 9 China 1999 cases 212258 10 China 1999 population 1272915272 11 China 2000 cases 213766 12 China 2000 population 1280428583
R
table2 %>% pivot_wider (names_from=type, values_from=count) |
Output:
country year cases population <chr> <int> <int> <int> 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583
separate()
In table three, we have to separate two values in a column.
R
table3 |
Output:
country year rate * <chr> <int> <chr> 1 Afghanistan 1999 745/19987071 2 Afghanistan 2000 2666/20595360 3 Brazil 1999 37737/172006362 4 Brazil 2000 80488/174504898 5 China 1999 212258/1272915272 6 China 2000 213766/1280428583
R
table3 %>% separate (rate, into = c ( "cases" , "population" ), sep = "/" ) |
Output:
country year cases population <chr> <int> <chr> <chr> 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583
βcasesβ and βpopulationβ are character columns, which is the default behavior in separate(). It leaves the type of the column as it is, we can convert to better types using βconvert = TRUEβ.
R
table3 %>% separate (rate, into= c ( "cases" , "population" ), convert= TRUE ) |
Output:
country year cases population <chr> <int> <int> <int> 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583
You can also pass an integer to βsepβ which will interpret the integers as positions to split at. Indexing from 1 from the left and -1 from the right.
R
table3 %>% separate (year, into= c ( "century" , "year" ), sep=2) |
Output:
country century year rate <chr> <chr> <chr> <chr> 1 Afghanistan 19 99 745/19987071 2 Afghanistan 20 00 2666/20595360 3 Brazil 19 99 37737/172006362 4 Brazil 20 00 80488/174504898 5 China 19 99 212258/1272915272 6 China 20 00 213766/1280428583
pivot_longer()
When some column names are not names of variables, but values of a variable.
- The set of columns whose names are values, not variables. In this example, those are the columns β1999β and β2000β.
- The name of the variable to move the column names to here is βyearβ.
- The name of the variable to move the column values to here itβs βcasesβ.
R
table4a |
Output:
country `1999` `2000` * <chr> <int> <int> 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766
R
table4a %>% pivot_longer ( c (`1999`, `2000`), names_to= "year" , values_to= "cases" ) |
Output:
country year cases <chr> <chr> <int> 1 Afghanistan 1999 745 2 Afghanistan 2000 2666 3 Brazil 1999 37737 4 Brazil 2000 80488 5 China 1999 212258 6 China 2000 213766
unite()
Use βunite()β to rejoin the century and year columns that we created in the last example. βunite()β takes a tibble and the name of the new variable to create, and a column to combine.
R
table5 %>% unite (new, century, year) |
Output:
country new rate <chr> <chr> <chr> 1 Afghanistan 19_99 745/19987071 2 Afghanistan 20_00 2666/20595360 3 Brazil 19_99 37737/172006362 4 Brazil 20_00 80488/174504898 5 China 19_99 212258/1272915272 6 China 20_00 213766/1280428583
We also need to use the sep argument, because by default R will place an underscore (_) between the values from different columns. Here we donβt want any separator, so we use ββ,
R
table5 %>% unite (new, century, year, sep = "" ) |
Output:
country new rate <chr> <chr> <chr> 1 Afghanistan 1999 745/19987071 2 Afghanistan 2000 2666/20595360 3 Brazil 1999 37737/172006362 4 Brazil 2000 80488/174504898 5 China 1999 212258/1272915272 6 China 2000 213766/1280428583
Missing values
A value can be missing in two ways,
- Explicitly β NA.
- Implicitly β not present in the data.
R
table98 <- tibble ( country = c ( "Afghanistan" , "Afghanistan" , "Brazil" , "China" , "China" ), year = c (1999, 2000, 1999, 1999, 2000), cases = c ( 745, 2666, 37737, 80488, 212258), population = c (19987071, 20595360, 172006362, NA , 1280428583) ) |
There are two missing values here,
- The population of 1999 βChinaβ is explicitly missing because its cell has NA.
- The population of 2000 βBrazilβ is explicitly missing because it doesnβt appear in the data.
We can make the implicit missing value explicit by putting years in the columns:
R
table98 %>% pivot_wider (names_from=year, values_from=population) |
Output:
country cases `1999` `2000` <chr> <dbl> <dbl> <dbl> 1 Afghanistan 745 19987071 NA 2 Afghanistan 2666 NA 20595360 3 Brazil 37737 172006362 NA 4 China 80488 NA NA 5 China 212258 NA 1280428583
You can set βvalues_drop_na = TRUEβ in βpivot_longer()β to turn explicit missing values implicit,
R
table98 %>% pivot_wider (names_from = year, values_from = population) %>% pivot_longer ( cols = c (`1999`, `2000`), names_to = "year" , values_to = "population" , values_drop_na = TRUE ) |
Output:
country cases year population <chr> <dbl> <chr> <dbl> 1 Afghanistan 745 1999 19987071 2 Afghanistan 2666 2000 20595360 3 Brazil 37737 1999 172006362 4 China 212258 2000 1280428583
complete()
complete() takes a set of columns, and finds all unique combinations, filling in explicit NAs where necessary.
R
table98 %>% complete (year, cases) |
Output:
year cases country population <dbl> <dbl> <chr> <dbl> 1 1999 745 Afghanistan 19987071 2 1999 2666 NA NA 3 1999 37737 Brazil 172006362 4 1999 80488 China NA 5 1999 212258 NA NA 6 2000 745 NA NA 7 2000 2666 Afghanistan 20595360 8 2000 37737 NA NA 9 2000 80488 NA NA 10 2000 212258 China 1280428583
fill()
fill in those missing values with fill(). Replaces missing values with the most recent non-missing value (sometimes called the last observation carried forward).
R
treatment <- tribble ( ~ person, ~ treatment, ~response, "Gautam" , 1, 7, NA , 2, 10, NA , 3, 9, "heema" , 1, 4 ) treatment %>% fill (person) |
Output:
person treatment response <chr> <dbl> <dbl> 1 Gautam 1 7 2 Gautam 2 10 3 Gautam 3 9 4 heema 1 4
Contact Us