Week 1 Homework

Using quarto-live documents

Within this document are blocks of R code.

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

Instructions for this homework

Run the lines of code from top to bottom. When you see a TODO marker, read the comment there to see what you need to do to complete the script. Then type in your contribution and run it. Check the result. If you get an error, this is normal (we get them all the time!)—it’s part of the learning process. Read the error message to see what went wrong, and try to fix it. If you can’t, ask for help.

TODO
Before going any further, review the pipe (|>) operator and the select() and filter() functions by working through the scripts pipe.R, select.R, data types.R, and filter.R in the /tidyverse folder. You might also refer to the appropriate pages on the RforIR site as well.

Preliminaries

Setup

We have loaded both the tidyverse and tidylog packages in the background.

We are now going to load the raw data from the CSV files:

courses <- read_csv("data/course_section.csv")

Data Dictionary

  • CourseSectionID: unique identifier for each course section
  • CourseID: unique identifier for each course, specifying subject and course number.
  • SectionID: the section number of a course type, e.g., 001, 002, etc.
  • TermID: unique identifier for each term
  • Credits: the number of credits for the course
  • ProfID: unique identifier for each professor

Questions

Practice with select()

Question 1

Test your knowledge

Create a new data frame by using the select() function to keep only the columns CourseSectionID and Credits. Call the new data frame course_credits.

Hint 1

You need to assign the results of the query to course_credits.

course_credits <- ______ |>
  select(______)
Hint 2

You are selecting data from the courses data table (tibble).

course_credits <- courses |>
  select(______)
Hint 3

You need to select two separate columns and separate them by a comma.

course_credits <- courses |>
  select(______, ______)
Fully worked solution:
1course_credits <- courses |>
2    select(CourseSectionID, Credits)
1
Assign the results of the query to course_credits. The source data is in the courses tibble.
2
Select (choose) two columns from the courses tibble.

Question 2

Test your knowledge

If you are using RStudio, then using the data browser at top right, inspect the result to make sure it looks right.

Fully worked solution:

You should see a spreadsheet-like display with the two columns you selected.

Question 3

Test your knowledge

What if we wanted all the columns except for Credits? Write the select() to do that below, starting with the data frame and piping it to a select(). This will print the first few rows of the result in the console. Hint: you don’t have to list all the columns.

Hint 1

The only thing you have to change is within the select() operator. Don’t change anything else.

courses |>
  select(______)
Hint 2

The easiest way to get rid of one column for a select operator is to use the minus operator.

courses |>              
  select(-NAMEOFONECOLUMN)
Fully worked solution:
1courses |>
2  select(-Credits) |>
3  head(10)
1
You do not need to assign the results to another tibble. Simply start with the courses tibble as the data source for the query.
2
The easiest way to select all of the columns except for the Credits column is to use the minus operator.
3
You can, but don’t have to, use the head() function to limit the number of rows that are shown.

The minus deselects only that column. What other way(s) might you do this?

Question 4

In naming columns, the convention of using uppercase letters to separate words, as in ThisIsAColumn, is called camel case, because of the “humps”. The courses data frame uses this convention. A different convention is snake case, where lowercase words are separated by underscores, as in this_is_a_column. Generally, snake case is easier to read. It is also the standard throughout the tidyverse.

Test your knowledge

Use the select() function to select the first three columns of the courses data frame and rename the columns to snake case. For example, CourseSectionID would become course_section_id. Print the result to the console — i.e., do not save the information to another data frame.

Hint 1

Again, you only need to apply the select() function to the courses data table.

courses |>                                     
  select(______)
Hint 2

You need to use snake_case for the new names. The code below shows how to rename the first column. Continue with the second and third columns.

courses |>                                     
  select(course_section_id = CourseSectionID,
         ______)
Hint 3

To rename three columns, you need to have three phrases separated by commas within the select() statement.

courses |>                                     
  select(course_section_id = CourseSectionID,
         course_id = ______,  
         subject_id = ______)
Fully worked solution:
1courses |>
2  select(course_section_id = CourseSectionID,
3         course_id = CourseID,
4         subject_id = SubjectID)
1
The courses tibble is the data source.
2
The select() statement can be used to rename columns. This first line says “rename CourseSectionID to course_section_id.”
3
This second line says “rename CourseID to course_id.” Notice that this is separated from the previous column by a comma.
4
Finally, the third line says “rename SubjectID to subject_id.”

Since we did not assign the results of this whole command to a new data table, these results are temporary.

How would you make them permanent?

Question 5

Test your knowledge

There is also a rename() function that can be used to rename columns if you want to keep all the columns (i.e. you don’t need to select() them).

Look up “tidyverse rename” in a search engine (or simply look at this page in rforir) and see if you can rename all the columns in the data frame to snake case.

Hint 1

We are using the courses data table as the source. Instead of select() as we did in the previous question, we are going to use rename() in this answer.

courses |> 
  rename(______)
Hint 2

The format of this rename() command is similar to the one for the select() command in the previous question. This is a more focused command with a simple purpose. Neither one is right or wrong.

This is how you rename the CourseSectionID column. Now you need to rename the rest of the columns.

courses |> 
  rename(course_section_id = CourseSectionID)
Hint 3

Just as with the select() command, you separate the renaming of the fields by a comma. Continue until you have completed all of them.

courses |> 
  rename(course_section_id = CourseSectionID,
         ______)
Fully worked solution:
1courses |>
2  rename(course_section_id = CourseSectionID,
         course_id = CourseID,
         subject_id = SubjectID,
         section_id = SectionID,
         term_id = TermID,
         credits = Credits,
         prof_id = ProfID)
1
The courses data table is the source.
2
The formatting of this approach to renaming is the same as for the select() statement above.

Practice with filter()

TODO
Review the filter() function by working through the script filter.R in the /tidyverse folder.

Our task is to filter the courses data frame to keep only the most recent term, which is associated with the largest TermID. We can do a quick pivot table to count cases by TermID like this:

The table() function counts the number of occurrences of each unique value.

You may recall from the select.R script that the $ operator is used to select a column from a data frame as a vector.

The result in the console shows that there are 684 courses in each of the terms, and the largest term number is 128.

What’s the most recent term?

Question 6

Test your knowledge

We could do the following if we wanted to save the courses for the most recent term to a new data frame:

Why is this not the best way to do this? Think about what happens next term when you want the same report.

Fully worked solution:

As time goes on, the maximum term will increase, and this filter will be out of date. This kind of thing is a danger in automated reports, because it means manually inspecting parameters to make sure they are up to date. It’s better to find a method to automatically get the most recent term.

An alternative is to use the filter() function with the max() function to get the most recent term automatically.

Try this and compare

Test that these are the same with the handy identical() function

An alternative is to use the filter() function with the max() function to get the most recent term automatically.

Try this and compare

Test that these are the same with the handy identical() function

The identical() function returns TRUE if the two data frames are the same. *WHY DOES THIS NOT WORK?!?!?!?!?**

Question 7

Test your knowledge

Copy the filter code from Question 6 and modify it to keep only the courses with the smallest (minimum) TermID. Do not use an integer in your query! Assign the results of the query to the table early_courses. After creating the new table, print out the top of the new table.

Hint 1

We are selecting from the data source courses and assigning the results to the early_courses data table. We are going to print out the top of the new table, so that means that we should use head().

Do not use head() in the first command because it will put just the top few rows in the new table. We want all the appropriately filtered data to go into this new table and only then print out the top of the table.

All that is left is for you to figure out how to create the appropriate filter.

early_courses <- courses |>
  filter(______)
head(early_courses)
Hint 2
early_courses <- courses |>
  filter(TermID == ______)
head(early_courses)
Fully worked solution:
1early_courses <- courses |>
2  filter(TermID == min(TermID))
3head(early_courses)
1
asdf
2
asdf
3
asdf

Question 8

Test your knowledge

This one’s a little harder, but see if you can figure it out. Use the filter() function to keep only the course section that were taught in the most recent HALF of the data. In other words, the term number should be greater than the median term number.

Hint 1

Consider using the filter() function from dplyr.

recent_half_courses <- courses |> 
  filter(______)
head(recent_half_courses)
Hint 2

You should filter the dataset using the species column.

recent_half_courses <- courses |> 
  filter(TermID > ______)
head(recent_half_courses)
Fully worked solution:
1recent_half_courses <- courses |>
2  filter(TermID > median(TermID))
3head(recent_half_courses)
1
asdf
2
asdf
3
asdf

Practice with count()

The count() function does something we have to do all the time in IR–count cases within groups. It’s a shortcut for a combination of group_by() and summarize(), which together can do more complicated kinds of group statistics. But counting is pretty useful all by itself.

OR

Both do the same thing–they count the number of rows in the data frame. Compare to the number of observations (again, rows) in the data browser.

The count() function is most useful when we want to count cases within groups. For example, to count course sections by term, we can do this:

How many rows are there that have the same TermID?

Question 9

Test your knowledge

How many course sections were taught in the most recent term?

Hint 1

Consider using the filter() function from dplyr.

courses |> 
  filter(______) |> 
  count(______)
Hint 2

You should filter the dataset using the species column.

courses |> 
  filter(______ == ______) |> 
  count(______)
Hint 3

This is the third hint.

courses |> 
  filter(TermID == max(TermID)) |> 
  count(______)
Fully worked solution:
1courses |>
2  filter(TermID == max(TermID)) |>
3  count(TermID)
1
asdf
2
asdf
3
asdf

Question 10

Test your knowledge

The chair of the history (ID == "HISTO") department wants to know how many course sections were taught in the most recent term. Can you help her out?

Hint 1

Consider using the filter() function from dplyr.

courses |> 
  filter(______) |>
  count(______)
Hint 2

You should filter the dataset using the species column.

courses |> 
  filter(TermID == ______, 
         SubjectID == ______) |>
  count(______)
Hint 3

This is the third hint.

courses |> 
  filter(TermID == max(TermID), 
         SubjectID == "HISTO") |>
  count(______, ______)
Fully worked solution:
courses |> 
1  filter(TermID == max(TermID),
2         SubjectID == "HISTO") |>
3  count(TermID, SubjectID)
                           ## 
1
comma means AND, or you can use & for same thing in filter()
2
asdf
3
you can leave out these column names, but this way you get more information than just the count–useful for making sure it’s correct.

Question 11

Test your knowledge

A professor with ID == "8749910288-DOUG" wants to know how many course sections he has taught in each of the most recent three terms. Can you help him out?

Hint 1

Consider using the filter() function from dplyr.

courses |> 
  filter(______) |>
  count(______) 
Hint 2

You should filter the dataset using the species column.

courses |> 
  filter(ProfID == "8749910288-DOUG",
         TermID > ______) |>
  count(______) 
Hint 3

This is the third hint.

courses |> 
  filter(ProfID == "8749910288-DOUG",
         TermID > max(TermID) - 3) |>
  count(______) 
Fully worked solution:
courses |> 
1  filter(TermID > max(TermID) - 3,
2           ProfID == "8749910288-DOUG") |>
3  count(TermID, ProfID)
1
the & means AND. In filter() you can also use a comma instead.
2
asdf
3
asdf

Practice with the pipe

The power of the tidyverse functions comes from flexible application of the function to compose useful data processes by chaining the operations in an assembly line.

If we chain filter() and select() it’s usually best to filter BEFORE selecting, because we might accidentally deselect the column we want to filter on, and that makes for confusing error messages.

Here’s an example.

Get the course Ids from the most recent term. Note that this will produce an error! Run it, and then see if you can fix it before continuing below.

Compare to this version, which works