On this page, we’re going to take you through the process of getting a data frame into a state of readiness for analysis by R and the tidyverse.
This consists of, first, acquainting ourselves with the data frame, identifying any issues with its structure, converting it from wide to long where appropriate, and then finishing up with other changes (creating factors, creating a new column, and doing a quick validity check).
The survey data frame
We have worked with the survey table throughout this site.
summary(survey)
Year ID NPS Field
Min. :2012 Length:33524 Min. :4.000 Length:33524
1st Qu.:2013 Class :character 1st Qu.:4.000 Class :character
Median :2014 Mode :character Median :6.000 Mode :character
Mean :2014 Mean :5.955
3rd Qu.:2016 3rd Qu.:7.000
Max. :2017 Max. :8.000
ClassLevel Status Gender BirthYear
Length:33524 Length:33524 Length:33524 Min. :1988
Class :character Class :character Class :character 1st Qu.:1991
Mode :character Mode :character Mode :character Median :1994
Mean :1994
3rd Qu.:1997
Max. :2000
FinPL FinSch FinGov FinSelf
Length:33524 Length:33524 Length:33524 Length:33524
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
FinPar FinOther TooDifficult NotRelevant
Length:33524 Length:33524 Length:33524 Length:33524
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
PoorTeaching UnsuppFac Grades Sched
Length:33524 Length:33524 Length:33524 Length:33524
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
ClassTooBig BadAdvising FinAid OverallValue
Length:33524 Length:33524 Length:33524 Length:33524
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
We know right off that 3 columns right near the beginning are factors. Let’s handle those first.
Yep, that’s a lot. For this analysis, we want to ensure that all of the missing data is encoded as NA values. We use the command below to make these changes.
It operates on the 4 columns from Field to Gender and the 16 columns from FinPL to OverallValue. For every value in those columns, it changes values of "", "NA", and "--" to NA.
The output now shows many hundreds more NA values in most of the columns. Again, this will make computations easier as we move forward.
Converting wide to long
Now it’s time for the main event — converting a wide table to a long table. In this case, we want to put all of the responses to survey questions in one column.
Using the format given in this page, we know we need the following information:
data frame: survey
names_to: we will name the column header Question
cols: we will combine all of the survey response fields from TooDifficult to OverallValue into one column
values_to: we will name the column with the responses Response
Combining all of this information, we get the following command:
Year ID NPS Field
Min. :2012 Length:335240 Min. :4.000 Length:335240
1st Qu.:2013 Class :character 1st Qu.:4.000 Class :character
Median :2014 Mode :character Median :6.000 Mode :character
Mean :2014 Mean :5.955
3rd Qu.:2016 3rd Qu.:7.000
Max. :2017 Max. :8.000
ClassLevel Status Gender BirthYear
Fresh:109270 Full-time:251240 Female:159880 Min. :1988
Soph : 83110 Part-time: 61730 Male :137130 1st Qu.:1991
Jr : 71340 Other : 16000 Other : 29480 Median :1994
Sr : 66010 NA's : 6270 NA's : 8750 Mean :1994
NA's : 5510 3rd Qu.:1997
Max. :2000
FinPL FinSch FinGov FinSelf
Length:335240 Length:335240 Length:335240 Length:335240
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
FinPar FinOther Question Response
Length:335240 Length:335240 Length:335240 Length:335240
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Creating appropriate factors
And now let’s take a look at the contents of the last eight columns as preparation for defining some more factors:
Since we aren’t going to do any analysis on the rows containing an NA as a response, we can just remove those rows permanently from the survey table with this:
Well, this is supposed to calculate the median and mean responses to each question. It doesn’t work since the values in Response are character strings. Let’s fix that.
Readying for analysis
We can fix the above problem by creating a new numeric column called NumResp with this command: