Chapter 7 Cleaning Data
Unlike any textbook data, real-world data is messy and hardly ready for analysis. Some common problems with real-world data are incorrect data types and space-padded text. On top of that, R
prefers data in a certain way. For example, unknown values are noted as NA
s, and for character or factor data types, the plotting order of axis labels depends on the order of underlying data. We’ll take a look at some of these cleaning operations, but first let’s load the data using the library readr
:
library(readr)
donor_data <- read_csv("data/DonorSampleData.csv")
We will use the dplyr
library’s glimpse
function to peek into the data. You can also see the summary of the data by using the summary(donor_data)
command.
library(dplyr)
glimpse(donor_data)
#> Observations: 34,508
#> Variables: 22
#> $ ID <int> 1, 2, 3, 4, 5, 6,...
#> $ ZIPCODE <chr> "23187", "77643",...
#> $ AGE <int> NA, 33, NA, 31, 6...
#> $ MARITAL_STATUS <chr> "Married", "Unkno...
#> $ GENDER <chr> "Female", "Female...
#> $ MEMBERSHIP_IND <chr> "N", "N", "N", "N...
#> $ ALUMNUS_IND <chr> "N", "Y", "N", "Y...
#> $ PARENT_IND <chr> "N", "N", "N", "N...
#> $ HAS_INVOLVEMENT_IND <chr> "N", "Y", "N", "Y...
#> $ WEALTH_RATING <chr> NA, NA, NA, NA, N...
#> $ DEGREE_LEVEL <chr> NA, "UB", NA, NA,...
#> $ PREF_ADDRESS_TYPE <chr> "HOME", NA, "HOME...
#> $ EMAIL_PRESENT_IND <chr> "N", "Y", "N", "Y...
#> $ CON_YEARS <int> 1, 0, 1, 0, 0, 0,...
#> $ PrevFYGiving <chr> "$0", "$0", "$0",...
#> $ PrevFY1Giving <chr> "$0", "$0", "$0",...
#> $ PrevFY2Giving <chr> "$0", "$0", "$0",...
#> $ PrevFY3Giving <chr> "$0", "$0", "$0",...
#> $ PrevFY4Giving <chr> "$0", "$0", "$0",...
#> $ CurrFYGiving <chr> "$0", "$0", "$200...
#> $ TotalGiving <chr> "$10", "$2,100", ...
#> $ DONOR_IND <chr> "Y", "Y", "Y", "N...
As you see from the glimpse of the data, all columns, except for numeric columns, are character or string type. This is the default setting for the readr
package, compared to the default setting of the factor using read.csv
from base R
. Reading data as character type helps to clean it easily because factors are stored by their indices compared to the literal string values of the characters.
7.1 Remove Extra Spaces
Often, data-entry errors cause extra spaces around text. For example, “John Smith
” might be entered as “John Smith
”. The library stringr
offers a convenient way to trim all the spaces around text.
library(stringr)
str_trim('John Smith ')
#> [1] "John Smith"
str_trim(' John Smith ')
#> [1] "John Smith"
Let’s remove spaces around the PREF_ADDRESS_TYPE
column in our data:
donor_data$PREF_ADDRESS_TYPE <- str_trim(
donor_data$PREF_ADDRESS_TYPE)
What if we wanted to trim extra spaces from all character columns? We can use the mutate_if
function from the dplyr
package. This function will apply any arbitrary function to the selected columns, matched by a condition. For our purposes, we can select all the columns that are of character type using the is.character
function.
is.character(donor_data$PREF_ADDRESS_TYPE)
#> [1] TRUE
is.character(donor_data$MARITAL_STATUS)
#> [1] TRUE
is.character(donor_data$GENDER)
#> [1] TRUE
donor_data <- mutate_if(donor_data,
.predicate = is.character,
.funs = str_trim)
7.2 Change Data Types
As we saw from the glimpse above, there are a few columns that are indicator variables (with yes or no values). For various types of analyses, it is useful to convert them to factor type variables. We will use the ends_with
function from the dplyr
library to select all columns whose names end with “_IND“. Then, we will convert these selected columns to factors.
donor_data <- mutate_at(donor_data,
.vars = vars(ends_with("_IND")),
.funs = as.factor)
7.3 Replace Values with NA
If your data has text values that denotes missing values, we need to explicitly convert those to NA
for R
to treat them as missing values. The MARITAL_STATUS
column has the value of Unknown
for a majority of the rows. Let’s convert those to NA
using the ifelse
function. The ifelse
function will test for a condition and return some other values based on the result of the test. In this case, we will test whether the MARITAL_STATUS
column has the value of Unknown
; if the result is true, then we return NA
, else we return the original value.
donor_data$MARITAL_STATUS <- with(
donor_data,
ifelse(MARITAL_STATUS == 'Unknown',
NA,
MARITAL_STATUS))
7.4 Change Order of Values
For a factor column, the order of the elements in that column matters for various labeling operations. If the order is important to you, you should let R
know that order. The following is how to do so.
donor_data$WEALTH_RATING <- with(
donor_data,
factor(WEALTH_RATING,
levels = c('$1-$24,999', '$25,000-$49,999',
'$50,000-$99,999', '$100,000-$249,999',
'$250,000-$499,999', '$500,000-$999,999',
'$1,000,000-$2,499,999',
'$2,500,000-$4,999,999',
'$5,000,000-$9,999,999',
'$10,000,000-$24,999,999'),
ordered = TRUE))
7.5 Clean ZIP Codes
ZIP Codes are often problematic in addresses because of data-entry errors or the presence of international ZIP Codes. We will use the zipcode
library and the clean.zipcodes
function to clean and limit the ZIP Codes to only US ZIP Codes.
The following are some erroneous ZIP Codes.
#> # A tibble: 6 x 1
#> ZIPCODE
#> <chr>
#> 1 NA-1175
#> 2 NA-2179
#> 3 NA-2245
#> 4 NA-4919
#> 5 NA-5419
#> 6 NA-6653
And some ZIP Codes in the zip + 4 format.
#> # A tibble: 6 x 1
#> ZIPCODE
#> <chr>
#> 1 92555-6454
#> 2 90265-5223
#> 3 90265-5785
#> 4 14845-5217
#> 5 33433-3803
#> 6 90265-4485
Let’s clean them.
library(zipcode)
head(donor_data$ZIPCODE)
#> [1] "23187" "77643" NA
#> [4] "47141" "92555-6454" "95191"
donor_data$ZIPCODE <- clean.zipcodes(donor_data$ZIPCODE)
head(donor_data$ZIPCODE)
#> [1] "23187" "77643" NA "47141" "92555"
#> [6] "95191"
7.6 Manipulate Dates
Another common problem with the messy data is incorrect or missing date formats. The lubridate
library makes it easy to perform various date-based operations.
Let’s say we want to add a column for birth dates. We don’t have the actual birth date, but only the age in our sample data. Let’s look at the wrong way of doing this first.
library(lubridate)
# Wrong Way. Only Subtracted Days
donor_data %>% mutate(BIRTH_DATE = as_date(today() - AGE)) %>%
select(BIRTH_DATE)
#> # A tibble: 34,508 x 1
#> BIRTH_DATE
#> <date>
#> 1 NA
#> 2 2017-12-26
#> 3 NA
#> 4 2017-12-28
#> 5 2017-11-21
#> 6 2017-12-02
#> # ... with 3.45e+04 more rows
The correct way is to subtract the age from the current year and then create a date field. We will use random months and days.
# Correct Way
donor_data$BIRTH_DATE <- make_date(
year = year(today()) - donor_data$AGE,
month = sample(1:12, replace = TRUE),
day = sample(1:26, replace = TRUE))
head(donor_data$BIRTH_DATE)
#> [1] NA "1985-06-16" NA
#> [4] "1987-12-03" "1950-09-11" "1961-01-23"
7.7 Remove Non-Numeric Characters
Often, numeric data columns have non-numeric characters, which create problems in further analysis. A common example is seeing the dollar sign ($) and/or commas in currency columns. The stringr
library offers the str_replace_all
function to replace such offending characters with something else.
Let’s remove the dollar sign from the TotalGiving
column.
head(donor_data$TotalGiving)
#> [1] "$10" "$2,100" "$200" "$0" "$505"
#> [6] "$0"
donor_data$TotalGiving <- str_replace_all(
string = donor_data$TotalGiving,
pattern = "\\$",
replacement = "")
head(donor_data$TotalGiving)
#> [1] "10" "2,100" "200" "0" "505"
#> [6] "0"
R
that we’re looking for a literal symbol by using a backward slash. Unfortunately, a backward slash is a reserved symbol for R
but we escaped that by using another backward slash.
Let’s remove any commas in the TotalGiving
column.
donor_data$TotalGiving <- str_replace_all(
string = donor_data$TotalGiving,
pattern = ",",
replacement = "")
7.8 Convert from Character to Numeric Data Type
Now that the TotalGiving
column is clean, let’s save it as a numeric column using the as.numeric
function.
donor_data$TotalGiving <- as.numeric(donor_data$TotalGiving)
7.9 Export Cleaned Data File
In the following chapters, we will use this cleaned version. Let’s save this data frame using the write_csv
function from the readr
library:
write_csv(donor_data, "data/DonorSampleDataCleaned.csv")