Week 1 Homework
Using quarto-live
documents
Within this document are blocks of R
code.
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
Instructions for this homework
Run the lines of code from top to bottom. When you see a TODO
marker, read the comment there to see what you need to do to complete the script. Then type in your contribution and run it. Check the result. If you get an error, this is normal (we get them all the time!)—it’s part of the learning process. Read the error message to see what went wrong, and try to fix it. If you can’t, ask for help.
- TODO
-
Before going any further, review the
pipe
(|>
) operator and theselect()
andfilter()
functions by working through the scripts pipe.R, select.R, data types.R, and filter.R in the /tidyverse folder. You might also refer to the appropriate pages on the RforIR site as well.
Preliminaries
Setup
We have loaded both the tidyverse
and tidylog
packages in the background.
We are now going to load the raw data from the CSV
files:
<- read_csv("data/course_section.csv") courses
Data Dictionary
CourseSectionID
: unique identifier for each course sectionCourseID
: unique identifier for each course, specifying subject and course number.SectionID
: the section number of a course type, e.g., 001, 002, etc.TermID
: unique identifier for each termCredits
: the number of credits for the courseProfID
: unique identifier for each professor
Questions
Practice with select()
Question 1
Test your knowledge
Create a new data frame by using the select()
function to keep only the columns CourseSectionID
and Credits.
Call the new data frame course_credits
.
You need to assign the results of the query to course_credits
.
<- ______ |>
course_credits select(______)
You are selecting data from the courses
data table (tibble).
<- courses |>
course_credits select(______)
You need to select two separate columns and separate them by a comma.
<- courses |>
course_credits select(______, ______)
Question 2
Test your knowledge
If you are using RStudio, then using the data browser at top right, inspect the result to make sure it looks right.
You should see a spreadsheet-like display with the two columns you selected.
Question 3
Test your knowledge
What if we wanted all the columns except for Credits? Write the select()
to do that below, starting with the data frame and piping it to a select()
. This will print the first few rows of the result in the console. Hint: you don’t have to list all the columns.
The only thing you have to change is within the select()
operator. Don’t change anything else.
|>
courses select(______)
The easiest way to get rid of one column for a select
operator is to use the minus
operator.
|>
courses select(-NAMEOFONECOLUMN)
- 1
-
You do not need to assign the results to another tibble. Simply start with the
courses
tibble as the data source for the query. - 2
-
The easiest way to select all of the columns except for the
Credits
column is to use theminus
operator. - 3
-
You can, but don’t have to, use the
head()
function to limit the number of rows that are shown.
The minus deselects only that column. What other way(s) might you do this?
Question 4
In naming columns, the convention of using uppercase letters to separate words, as in ThisIsAColumn
, is called camel case, because of the “humps”. The courses data frame uses this convention. A different convention is snake case, where lowercase words are separated by underscores, as in this_is_a_column.
Generally, snake case is easier to read. It is also the standard throughout the tidyverse
.
Test your knowledge
Use the select()
function to select the first three columns of the courses
data frame and rename the columns to snake case. For example, CourseSectionID
would become course_section_id
. Print the result to the console — i.e., do not save the information to another data frame.
Again, you only need to apply the select()
function to the courses
data table.
|>
courses select(______)
You need to use snake_case
for the new names. The code below shows how to rename the first column. Continue with the second and third columns.
|>
courses select(course_section_id = CourseSectionID,
______)
To rename three columns, you need to have three phrases separated by commas within the select()
statement.
|>
courses select(course_section_id = CourseSectionID,
course_id = ______,
subject_id = ______)
1|>
courses 2select(course_section_id = CourseSectionID,
3course_id = CourseID,
4subject_id = SubjectID)
- 1
-
The
courses
tibble is the data source. - 2
-
The
select()
statement can be used to rename columns. This first line says “renameCourseSectionID
tocourse_section_id
.” - 3
-
This second line says “rename
CourseID
tocourse_id
.” Notice that this is separated from the previous column by a comma. - 4
-
Finally, the third line says “rename
SubjectID
tosubject_id
.”
Since we did not assign the results of this whole command to a new data table, these results are temporary.
How would you make them permanent?
Question 5
Test your knowledge
There is also a rename()
function that can be used to rename columns if you want to keep all the columns (i.e. you don’t need to select()
them).
Look up “tidyverse rename” in a search engine (or simply look at this page in rforir) and see if you can rename all the columns in the data frame to snake case.
We are using the courses
data table as the source. Instead of select()
as we did in the previous question, we are going to use rename()
in this answer.
|>
courses rename(______)
The format of this rename()
command is similar to the one for the select()
command in the previous question. This is a more focused command with a simple purpose. Neither one is right or wrong.
This is how you rename the CourseSectionID
column. Now you need to rename the rest of the columns.
|>
courses rename(course_section_id = CourseSectionID)
Just as with the select()
command, you separate the renaming of the fields by a comma. Continue until you have completed all of them.
|>
courses rename(course_section_id = CourseSectionID,
______)
1|>
courses 2rename(course_section_id = CourseSectionID,
course_id = CourseID,
subject_id = SubjectID,
section_id = SectionID,
term_id = TermID,
credits = Credits,
prof_id = ProfID)
- 1
-
The
courses
data table is the source. - 2
-
The formatting of this approach to renaming is the same as for the
select()
statement above.
Practice with filter()
- TODO
-
Review the
filter()
function by working through the scriptfilter.R
in the/tidyverse
folder.
Our task is to filter the courses data frame to keep only the most recent term, which is associated with the largest TermID
. We can do a quick pivot table to count cases by TermID
like this:
The table()
function counts the number of occurrences of each unique value.
You may recall from the select.R
script that the $
operator is used to select a column from a data frame as a vector.
The result in the console shows that there are 684 courses in each of the terms, and the largest term number is 128.
What’s the most recent term?
Question 6
Test your knowledge
We could do the following if we wanted to save the courses for the most recent term to a new data frame:
Why is this not the best way to do this? Think about what happens next term when you want the same report.
As time goes on, the maximum term will increase, and this filter will be out of date. This kind of thing is a danger in automated reports, because it means manually inspecting parameters to make sure they are up to date. It’s better to find a method to automatically get the most recent term.
An alternative is to use the filter()
function with the max()
function to get the most recent term automatically.
Try this and compare
Test that these are the same with the handy identical()
function
An alternative is to use the filter()
function with the max()
function to get the most recent term automatically.
Try this and compare
Test that these are the same with the handy identical()
function
The identical()
function returns TRUE
if the two data frames are the same. *WHY DOES THIS NOT WORK?!?!?!?!?**
Question 7
Test your knowledge
Copy the filter code from Question 6 and modify it to keep only the courses with the smallest (minimum) TermID
. Do not use an integer in your query! Assign the results of the query to the table early_courses
. After creating the new table, print out the top of the new table.
We are selecting from the data source courses
and assigning the results to the early_courses
data table. We are going to print out the top of the new table, so that means that we should use head()
.
Do not use head()
in the first command because it will put just the top few rows in the new table. We want all the appropriately filtered data to go into this new table and only then print out the top of the table.
All that is left is for you to figure out how to create the appropriate filter.
<- courses |>
early_courses filter(______)
head(early_courses)
<- courses |>
early_courses filter(TermID == ______)
head(early_courses)
Question 8
Test your knowledge
This one’s a little harder, but see if you can figure it out. Use the filter()
function to keep only the course section that were taught in the most recent HALF of the data. In other words, the term number should be greater than the median term number.
Consider using the filter()
function from dplyr
.
<- courses |>
recent_half_courses filter(______)
head(recent_half_courses)
You should filter the dataset using the species
column.
<- courses |>
recent_half_courses filter(TermID > ______)
head(recent_half_courses)
Practice with count()
The count()
function does something we have to do all the time in IR–count cases within groups. It’s a shortcut for a combination of group_by()
and summarize()
, which together can do more complicated kinds of group statistics. But counting is pretty useful all by itself.
OR
Both do the same thing–they count the number of rows in the data frame. Compare to the number of observations (again, rows) in the data browser.
The count()
function is most useful when we want to count cases within groups. For example, to count course sections by term, we can do this:
How many rows are there that have the same TermID
?
Question 9
Test your knowledge
How many course sections were taught in the most recent term?
Consider using the filter()
function from dplyr
.
|>
courses filter(______) |>
count(______)
You should filter the dataset using the species
column.
|>
courses filter(______ == ______) |>
count(______)
This is the third hint.
|>
courses filter(TermID == max(TermID)) |>
count(______)
Question 10
Test your knowledge
The chair of the history (ID == "HISTO"
) department wants to know how many course sections were taught in the most recent term. Can you help her out?
Consider using the filter()
function from dplyr
.
|>
courses filter(______) |>
count(______)
You should filter the dataset using the species
column.
|>
courses filter(TermID == ______,
== ______) |>
SubjectID count(______)
This is the third hint.
|>
courses filter(TermID == max(TermID),
== "HISTO") |>
SubjectID count(______, ______)
- 1
-
comma means AND, or you can use
&
for same thing infilter()
- 2
- asdf
- 3
- you can leave out these column names, but this way you get more information than just the count–useful for making sure it’s correct.
Question 11
Test your knowledge
A professor with ID == "8749910288-DOUG"
wants to know how many course sections he has taught in each of the most recent three terms. Can you help him out?
Consider using the filter()
function from dplyr
.
|>
courses filter(______) |>
count(______)
You should filter the dataset using the species
column.
|>
courses filter(ProfID == "8749910288-DOUG",
> ______) |>
TermID count(______)
This is the third hint.
|>
courses filter(ProfID == "8749910288-DOUG",
> max(TermID) - 3) |>
TermID count(______)
Practice with the pipe
The power of the tidyverse functions comes from flexible application of the function to compose useful data processes by chaining the operations in an assembly line.
If we chain filter()
and select()
it’s usually best to filter BEFORE selecting, because we might accidentally deselect the column we want to filter on, and that makes for confusing error messages.
Here’s an example.
Get the course Ids from the most recent term. Note that this will produce an error! Run it, and then see if you can fix it before continuing below.
Compare to this version, which works