Missing data functions

The problem

Data isn’t clean, perfect, and ready-to-use. You always have to clean it before you can use it. This is always a data-specific and context-specific problem and solution.

Varying solutions

  • Need to understand the scope of the problem, both on a row-basis and on a column-basis.
  • Need to ensure R can easily and consistently detect the missing data. So address this first.
  • Need to come up with a separate policy for handling missing data for each column and for each row. Do this before you continue.
  • Address each column.
  • After addressing each column, then—and only then—convert the survey data from wide to long (if you have to do so).

Functions to help deal with missing values

When doing your analyses, you will want to be clear about the following:

  • How prevalent NA and NaN values are, and
  • How you want to handle NA and NaN values in your analysis—do you want to include them or exclude them from your calculations?

Sometimes you want to examine a list to see if there are missing values. Let’s quickly define a list and test it with these functions:

lst <- list("a", 3, TRUE, FALSE, NA, NaN, 0/0)
is.na(lst)
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
sapply(lst, function(x) ifelse(is.nan(x), NaN, x))
[1] "a"     "3"     "TRUE"  "FALSE" NA      "NaN"   "NaN"  
df <- data.frame(a=c(1, 2, 3), b=c(5, NA, NaN))
is.na(df)
         a     b
[1,] FALSE FALSE
[2,] FALSE  TRUE
[3,] FALSE  TRUE
sapply(df, function(x) ifelse(is.nan(x), NaN, x))
     a   b
[1,] 1   5
[2,] 2  NA
[3,] 3 NaN
skim(df)
Data summary
Name df
Number of rows 3
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
a 0 1.00 2 1 1 1.5 2 2.5 3 ▇▁▇▁▇
b 2 0.33 5 NA 5 5.0 5 5.0 5 ▁▁▇▁▁

This is how you retrieve a specific observation (row) from a data frame:

survey[6,]
  Year           ID NPS  Field ClassLevel    Status Gender BirthYear FinPL
6 2012 mdoqvaalcscx   8 Undecl         Sr Part-time Female      1988   Yes
  FinSch FinGov FinSelf FinPar FinOther TooDifficult NotRelevant
6    Yes     No     Yes     No       No                         
       PoorTeaching UnsuppFac Grades          Sched ClassTooBig BadAdvising
6 Strongly Disagree   Neutral  Agree Strongly Agree        <NA>    Disagree
  FinAid   OverallValue
6   <NA> Strongly Agree

The anyNA(x) function determines if there are any NA values in the vector:

anyNA(lst)
[1] TRUE

Here we use it on the 6th row of survey:

anyNA(survey[6,])
[1] TRUE

The following call returns which items in the vector have the value NA:

which(is.na(survey[6,]))
[1] 21 23

The following counts how many items in the vector have the value NA:

sum(is.na(survey[6,]))
[1] 2