Manipulate survey data

Author

Scott Moore

Published

October 4, 2024

This file imports, transforms, and then exports the data that I created for the Assessment Institute meeting in Indianapolis in October 2024.

1 Setup & import libraries

Standard library import steps. Enables all that is to come.

library(tidyverse)
library(skimr)

2 Import data

Data in CSV format with headers.

survey <- read_csv("data/retention_survey_history.csv")

3 Examine data

Show us a bit of information about the imported data.

glimpse(survey)
Rows: 33,524
Columns: 24
$ Year         <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 201…
$ ID           <chr> "xuojqdfdozvu", "vvwkinqvnibo", "ibyjcmiopiqk", "lsqamawy…
$ NPS          <dbl> 8, 8, 8, 8, 8, 8, 8, 4, 4, 7, 6, 6, 6, 4, 7, 4, 6, 7, 4, …
$ Field        <chr> "Undecl", "SocSci", "CompSci", "Other", "SocSci", "Undecl…
$ ClassLevel   <chr> "Sr", "Sr", "Sr", "Sr", "Sr", "Sr", "Sr", "Fresh", "Fresh…
$ Status       <chr> "Full-time", "Full-time", "Full-time", "Full-time", "Full…
$ Gender       <chr> "Male", "Female", "Male", "Female", "Other", "Female", "M…
$ BirthYear    <dbl> 1990, 1999, 1999, 1989, 1993, 1988, 1996, 1991, 1990, 198…
$ FinPL        <chr> "No", "No", "No", "No", "Yes", "Yes", "No", "No", "No", "…
$ FinSch       <chr> "No", "No", "No", "Yes", "Yes", "Yes", "Yes", "No", "No",…
$ FinGov       <chr> "No", "Yes", "Yes", NA, "No", "No", "No", "No", "No", "No…
$ FinSelf      <chr> "No", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No",…
$ FinPar       <chr> "Yes", "No", NA, "Yes", "No", "No", "No", "No", "No", "Ye…
$ FinOther     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No…
$ TooDifficult <chr> "Disagree", "Strongly Disagree", "Disagree", "Agree", NA,…
$ NotRelevant  <chr> "Strongly Disagree", "Disagree", NA, "Strongly Disagree",…
$ PoorTeaching <chr> "Agree", "Agree", "Agree", "Disagree", "Agree", "Strongly…
$ UnsuppFac    <chr> NA, "Neutral", "Neutral", "Neutral", NA, "Neutral", "Stro…
$ Grades       <chr> NA, "Disagree", "Strongly Disagree", NA, NA, "Agree", NA,…
$ Sched        <chr> "Strongly Agree", "Strongly Disagree", "Strongly Disagree…
$ ClassTooBig  <chr> "Neutral", NA, "Strongly Disagree", "Strongly Disagree", …
$ BadAdvising  <chr> "Disagree", "Disagree", NA, "Disagree", "Strongly Disagre…
$ FinAid       <chr> "Strongly Agree", "Strongly Agree", "Strongly Agree", "Ag…
$ OverallValue <chr> "Strongly Agree", "Strongly Agree", "Neutral", "Strongly …
names(survey)
 [1] "Year"         "ID"           "NPS"          "Field"        "ClassLevel"  
 [6] "Status"       "Gender"       "BirthYear"    "FinPL"        "FinSch"      
[11] "FinGov"       "FinSelf"      "FinPar"       "FinOther"     "TooDifficult"
[16] "NotRelevant"  "PoorTeaching" "UnsuppFac"    "Grades"       "Sched"       
[21] "ClassTooBig"  "BadAdvising"  "FinAid"       "OverallValue"
skim(survey)
Data summary
Name survey
Number of rows 33524
Number of columns 24
_______________________
Column type frequency:
character 21
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ID 0 1.00 12 12 0 33524 0
Field 12 1.00 2 9 0 15 0
ClassLevel 0 1.00 2 5 0 5 0
Status 627 0.98 5 9 0 3 0
Gender 875 0.97 4 6 0 3 0
FinPL 1674 0.95 2 3 0 2 0
FinSch 1615 0.95 2 3 0 2 0
FinGov 1586 0.95 2 3 0 2 0
FinSelf 1622 0.95 2 3 0 2 0
FinPar 1699 0.95 2 3 0 2 0
FinOther 1699 0.95 2 3 0 2 0
TooDifficult 6645 0.80 5 17 0 5 0
NotRelevant 6622 0.80 5 17 0 5 0
PoorTeaching 6819 0.80 5 17 0 5 0
UnsuppFac 6676 0.80 5 17 0 5 0
Grades 6637 0.80 5 17 0 5 0
Sched 6684 0.80 5 17 0 5 0
ClassTooBig 6634 0.80 5 17 0 5 0
BadAdvising 6735 0.80 5 17 0 5 0
FinAid 6676 0.80 5 17 0 5 0
OverallValue 6631 0.80 5 17 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2014.50 1.71 2012 2013 2014 2016 2017 ▇▅▅▃▅
NPS 0 1 5.95 1.52 4 4 6 7 8 ▇▁▆▅▅
BirthYear 0 1 1994.01 3.76 1988 1991 1994 1997 2000 ▇▅▇▅▇

4 Fix NA values

Just in case any columns represent NA values in a variety of ways, we can standardize the representation before we continue to simplify any logic later in the process.

4.1 Create standardized representation of NA

Actually do the transformation.

survey <-
  survey |> 
    mutate(across(c(Field:Gender,
                    FinPL:OverallValue),
                  ~ case_when(is.na(.x) ~ NA,
                              .x == "" ~ NA,
                              .x == "NA" ~ NA,
                              .x == "--" ~ NA,
                              .default = .x)))

4.2 Ensure that it works

Just what it says — ensure that it works. This counts the number of appearances in every single column of the non-standard ways in which NA values might have been represented. We should see nothing but zero (0) values in the table below.

survey |> 
  summarize(across(everything(),
                   ~sum(.x %in% c("NA", "--", "")))) |> 
  print(width = Inf)
# A tibble: 1 × 24
   Year    ID   NPS Field ClassLevel Status Gender BirthYear FinPL FinSch FinGov
  <int> <int> <int> <int>      <int>  <int>  <int>     <int> <int>  <int>  <int>
1     0     0     0     0          0      0      0         0     0      0      0
  FinSelf FinPar FinOther TooDifficult NotRelevant PoorTeaching UnsuppFac Grades
    <int>  <int>    <int>        <int>       <int>        <int>     <int>  <int>
1       0      0        0            0           0            0         0      0
  Sched ClassTooBig BadAdvising FinAid OverallValue
  <int>       <int>       <int>  <int>        <int>
1     0           0           0      0            0

5 Factors

5.1 Define factors

Define factors for as many fields as we can. Specify its order when we can so that later analyses are better organized.

fin_cols <- c("FinPL", "FinSch", "FinGov", 
              "FinSelf", "FinPar", "FinOther")
exp_cols <- c("TooDifficult", "NotRelevant", "PoorTeaching",
              "UnsuppFac", "Grades", "Sched", "ClassTooBig",
              "BadAdvising", "FinAid", "OverallValue")
fin_levels <- c("Yes", "No")
exp_levels <- c("Strongly Disagree", "Disagree", "Neutral", "Agree",
                "Strongly Agree")
class_levels <- c("Fresh", "Soph", "Jr", "Sr")

survey <-
  survey |> 
    mutate(across(all_of(exp_cols), 
                  ~factor(.x, 
                          levels = exp_levels, 
                          ordered = TRUE)))
survey <-
  survey |> 
    mutate(across(all_of(fin_cols),
                  ~factor(.x, 
                          levels = fin_levels, 
                          ordered = TRUE)))
survey <-
  survey |> 
    mutate(ClassLevel = factor(ClassLevel,
                               levels = class_levels,
                               ordered = TRUE))
survey <-
  survey |> 
    mutate(Status = factor(Status,
                           levels = c("Full-time", "Part-time",
                           "Other")))
survey <-
  survey |> 
    mutate(Gender = factor(Gender,
                           levels = c("Female", "Male", "Other")))

We want to handle Field of study somewhat differently. It has NA values in the column, but we already have Undecl and Other as values. We would rather include the NA values in the analysis, but we want to be able to see if the results for that value differ from the results of these other two values. Let’s make Unknown an accepted value for Field.

poss_fields <- c("LifeSci", "PhysSci", "PubHealth", "Nurs", "OthHealth",
                 "PubAdm", "SocSci", "ArtsHum",
                 "CompSci", "Eng", "Bus", "Ed", "ArchUP",
                 "Other", "Undecl", "Unknown")
survey$Field <- 
  survey$Field |> 
  replace_na("Unknown")
survey <-
  survey |> 
    mutate(Field = factor(Field,
                          levels = poss_fields))

5.2 Validate the factor creation process

Let’s show the structure of the table. This will allow us to validate that all of our factor declarations above worked.

str(survey)
tibble [33,524 × 24] (S3: tbl_df/tbl/data.frame)
 $ Year        : num [1:33524] 2012 2012 2012 2012 2012 ...
 $ ID          : chr [1:33524] "xuojqdfdozvu" "vvwkinqvnibo" "ibyjcmiopiqk" "lsqamawyancj" ...
 $ NPS         : num [1:33524] 8 8 8 8 8 8 8 4 4 7 ...
 $ Field       : Factor w/ 16 levels "LifeSci","PhysSci",..: 15 7 9 14 7 15 15 10 7 7 ...
 $ ClassLevel  : Ord.factor w/ 4 levels "Fresh"<"Soph"<..: 4 4 4 4 4 4 4 1 1 3 ...
 $ Status      : Factor w/ 3 levels "Full-time","Part-time",..: 1 1 1 1 1 2 2 1 1 1 ...
 $ Gender      : Factor w/ 3 levels "Female","Male",..: 2 1 2 1 3 1 2 1 1 2 ...
 $ BirthYear   : num [1:33524] 1990 1999 1999 1989 1993 ...
 $ FinPL       : Ord.factor w/ 2 levels "Yes"<"No": 2 2 2 2 1 1 2 2 2 2 ...
 $ FinSch      : Ord.factor w/ 2 levels "Yes"<"No": 2 2 2 1 1 1 1 2 2 1 ...
 $ FinGov      : Ord.factor w/ 2 levels "Yes"<"No": 2 1 1 NA 2 2 2 2 2 2 ...
 $ FinSelf     : Ord.factor w/ 2 levels "Yes"<"No": 2 1 2 2 1 1 2 1 2 1 ...
 $ FinPar      : Ord.factor w/ 2 levels "Yes"<"No": 1 2 NA 1 2 2 2 2 2 1 ...
 $ FinOther    : Ord.factor w/ 2 levels "Yes"<"No": 2 2 2 2 2 2 2 2 2 2 ...
 $ TooDifficult: Ord.factor w/ 5 levels "Strongly Disagree"<..: 2 1 2 4 NA NA 3 1 5 1 ...
 $ NotRelevant : Ord.factor w/ 5 levels "Strongly Disagree"<..: 1 2 NA 1 3 NA 5 NA NA 4 ...
 $ PoorTeaching: Ord.factor w/ 5 levels "Strongly Disagree"<..: 4 4 4 2 4 1 4 1 2 5 ...
 $ UnsuppFac   : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA 3 3 3 NA 3 1 5 3 NA ...
 $ Grades      : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA 2 1 NA NA 4 NA 2 NA 2 ...
 $ Sched       : Ord.factor w/ 5 levels "Strongly Disagree"<..: 5 1 1 4 1 5 4 4 4 5 ...
 $ ClassTooBig : Ord.factor w/ 5 levels "Strongly Disagree"<..: 3 NA 1 1 5 NA 5 4 NA 2 ...
 $ BadAdvising : Ord.factor w/ 5 levels "Strongly Disagree"<..: 2 2 NA 2 1 2 1 NA 2 1 ...
 $ FinAid      : Ord.factor w/ 5 levels "Strongly Disagree"<..: 5 5 5 4 NA NA 5 5 1 NA ...
 $ OverallValue: Ord.factor w/ 5 levels "Strongly Disagree"<..: 5 5 3 5 2 5 5 4 NA 2 ...

This is another way to validate that it worked. This will show 6 distinct values for each column in the table. It will pad with NA values if it has less than six. The ordered values should be listed in order; the others are essentially random and nothing should be read into it.

get_distinct_and_pad <- function(column, numvals = 6) {
  distinct_values <- tibble(value = column) |>
                     distinct(value) |>
                     arrange(value) |> 
                     slice_head(n = numvals) |>
                     pull(value)

  # Pad with NA if there are less than n values
  length(distinct_values) <- numvals
  return(distinct_values)
}

survey |> 
  map(get_distinct_and_pad) |> 
  as.data.frame()
  Year           ID NPS     Field ClassLevel    Status Gender BirthYear FinPL
1 2012 aaapjamvmsgd   4   LifeSci      Fresh Full-time Female      1988   Yes
2 2013 aacgcpuqbhue   6   PhysSci       Soph Part-time   Male      1989    No
3 2014 aacgctmukgti   7 PubHealth         Jr     Other  Other      1990  <NA>
4 2015 aacukkmcqpdr   8      Nurs         Sr      <NA>   <NA>      1991  <NA>
5 2016 aacuutwvloft  NA OthHealth       <NA>      <NA>   <NA>      1992  <NA>
6 2017 aadtesmehzsd  NA    PubAdm       <NA>      <NA>   <NA>      1993  <NA>
  FinSch FinGov FinSelf FinPar FinOther      TooDifficult       NotRelevant
1    Yes    Yes     Yes    Yes      Yes Strongly Disagree Strongly Disagree
2     No     No      No     No       No          Disagree          Disagree
3   <NA>   <NA>    <NA>   <NA>     <NA>           Neutral           Neutral
4   <NA>   <NA>    <NA>   <NA>     <NA>             Agree             Agree
5   <NA>   <NA>    <NA>   <NA>     <NA>    Strongly Agree    Strongly Agree
6   <NA>   <NA>    <NA>   <NA>     <NA>              <NA>              <NA>
       PoorTeaching         UnsuppFac            Grades             Sched
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
2          Disagree          Disagree          Disagree          Disagree
3           Neutral           Neutral           Neutral           Neutral
4             Agree             Agree             Agree             Agree
5    Strongly Agree    Strongly Agree    Strongly Agree    Strongly Agree
6              <NA>              <NA>              <NA>              <NA>
        ClassTooBig       BadAdvising            FinAid      OverallValue
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Disagree
2          Disagree          Disagree          Disagree          Disagree
3           Neutral           Neutral           Neutral           Neutral
4             Agree             Agree             Agree             Agree
5    Strongly Agree    Strongly Agree    Strongly Agree    Strongly Agree
6              <NA>              <NA>              <NA>              <NA>

6 Convert “wide” to “long” data

R has to have data in “long” format for analysis. The pivot_longer function is the tool that allows us to make the transformation from “wide” to “long”.

survey <-
  survey |> 
    pivot_longer(
      names_to = "Question",
      cols = c(TooDifficult:OverallValue), 
      values_to = "Response"
    ) |>
    arrange(Year, ID, Question)

7 Prepare numerical data

When analyzing survey data, it can be helpful to have both a string reprentation and an integer representation of response data. This stsatement creates the new NumResp column that will hold the numeric response data while retaining the string Response column.

survey <-
  survey |> 
    mutate(NumResp = case_when(
      Response == "Strongly Disagree" ~ 1,
      Response == "Disagree" ~ 2,
      Response == "Neutral" ~ 3,
      Response == "Agree" ~ 4,
      Response == "Strongly Agree" ~ 5,
      .default = NA))

8 Look at data again

Now that we have made all of these transformations, let’s simply print out information about the survey table in its new form for documentation.

glimpse(survey)
Rows: 335,240
Columns: 17
$ Year       <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,…
$ ID         <chr> "aacuutwvloft", "aacuutwvloft", "aacuutwvloft", "aacuutwvlo…
$ NPS        <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,…
$ Field      <fct> Undecl, Undecl, Undecl, Undecl, Undecl, Undecl, Undecl, Und…
$ ClassLevel <ord> Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr, Jr,…
$ Status     <fct> Full-time, Full-time, Full-time, Full-time, Full-time, Full…
$ Gender     <fct> Female, Female, Female, Female, Female, Female, Female, Fem…
$ BirthYear  <dbl> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995,…
$ FinPL      <ord> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No,…
$ FinSch     <ord> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
$ FinGov     <ord> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No,…
$ FinSelf    <ord> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, N…
$ FinPar     <ord> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, No, No, No, No, No,…
$ FinOther   <ord> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No,…
$ Question   <chr> "BadAdvising", "ClassTooBig", "FinAid", "Grades", "NotRelev…
$ Response   <ord> NA, Neutral, Agree, NA, NA, Strongly Agree, Strongly Agree,…
$ NumResp    <dbl> NA, 3, 4, NA, NA, 5, 5, 2, 2, 2, 1, 3, 4, 1, NA, 5, NA, 3, …
skim(survey)
Data summary
Name survey
Number of rows 335240
Number of columns 17
_______________________
Column type frequency:
character 2
factor 11
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ID 0 1 12 12 0 33524 0
Question 0 1 5 12 0 10 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Field 0 1.00 FALSE 16 Soc: 71580, Lif: 63180, Und: 41150, Eng: 36790
ClassLevel 5510 0.98 TRUE 4 Fre: 109270, Sop: 83110, Jr: 71340, Sr: 66010
Status 6270 0.98 FALSE 3 Ful: 251240, Par: 61730, Oth: 16000
Gender 8750 0.97 FALSE 3 Fem: 159880, Mal: 137130, Oth: 29480
FinPL 16740 0.95 TRUE 2 No: 222110, Yes: 96390
FinSch 16150 0.95 TRUE 2 No: 160240, Yes: 158850
FinGov 15860 0.95 TRUE 2 No: 286960, Yes: 32420
FinSelf 16220 0.95 TRUE 2 No: 190670, Yes: 128350
FinPar 16990 0.95 TRUE 2 No: 255370, Yes: 62880
FinOther 16990 0.95 TRUE 2 No: 301970, Yes: 16280
Response 66759 0.80 TRUE 5 Str: 62765, Agr: 61927, Neu: 55456, Dis: 48919

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1.0 2014.50 1.71 2012 2013 2014 2016 2017 ▇▅▅▃▅
NPS 0 1.0 5.95 1.52 4 4 6 7 8 ▇▁▆▅▅
BirthYear 0 1.0 1994.01 3.76 1988 1991 1994 1997 2000 ▇▅▇▅▇
NumResp 66759 0.8 3.22 1.37 1 2 3 4 5 ▅▆▇▇▇

9 Summarize by question

Now that we have the new NumResp column, we can calculate numerical data on the survey responses.

survey |> 
  group_by(Question) |> 
  summarize(Median = median(NumResp, na.rm = TRUE),
            Avg = mean(NumResp, na.rm = TRUE))
# A tibble: 10 × 3
   Question     Median   Avg
   <chr>         <dbl> <dbl>
 1 BadAdvising       2  2.33
 2 ClassTooBig       2  2.52
 3 FinAid            4  3.83
 4 Grades            3  3.02
 5 NotRelevant       3  2.73
 6 OverallValue      4  4.11
 7 PoorTeaching      4  3.42
 8 Sched             4  4.00
 9 TooDifficult      3  2.99
10 UnsuppFac         3  3.28

10 Remove columns that we do not need

This table is quite large. We can get rid of the ID field if nothing else since we will not be doing any analysis related to the ID of the survey.

survey <-
  survey |>
    select(Year, NPS, Field, ClassLevel, Status, Gender, BirthYear,
           FinPL, FinSch, FinGov, FinSelf, FinPar, FinOther,
           Question, Response, NumResp)

11 Export data

Export the data but don’t include the row number.

write.csv(survey,"data/survey-output.csv", 
          row.names=FALSE)