dim(st_info)[1] 2000 10
The following command prints out the number of rows (observations) and columns (variables).
dim(st_info)[1] 2000 10
This shows us that there are 2000 rows and 10 columns.
If you want to see the dataframe, then use this very straight-forward command (i.e., type the name of the dataframe):
st_info# A tibble: 2,000 × 10
`Application ID` Given Family Birthdate Email St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 4563269562-RODR-… Teno… Rodri… 05/26/20… teno… GA COWET… M W 1436
2 9221751846-ROEH-… Trav… Roe 03/31/20… trav… SC GREEN… M A 1398
3 4290276249-ALLE-… Axel Allen 06/23/20… axel… GA BULLO… M W 1090
4 3398780452-HILT-… Just… Hilton 06/29/20… just… GA DEKAL… M W 1516
5 7691897840-SMIT-… Mehr… Smith 11/03/20… mehr… GA WHITF… M O 1440
6 1862245592-SHIP-… Cart… Shipl… 10/31/20… cart… GA DEKAL… M H 1438
7 2085584835-CHAM-… Benj… Chamb… 02/08/20… benj… SC DARLI… M W 1452
8 3918571924-STAH-… Robe… Stahl 06/16/20… robe… SC CHARL… M W 1536
9 3023361776-ROGE-… Zaid… Rogers 09/11/20… zaid… SC LANCA… M B 1487
10 7386692838-SCOT-… Jakob Scott 05/25/20… jako… SC PICKE… M H 1373
# ℹ 1,990 more rows
As you can see, this shows the size of the dataframe, the variable/column names, the data types, and the first 10 rows of data.
The View(dt) command (note the capitalization!!) shows the data table in spreadsheet form in a window in the top-left of RStudio.
View(st_info)If the dataset has many rows, you probably won’t want to print it out in its entirety. That is what head() and tail() are for. These print out a few of the rows just so that you can get an idea of what the dataset contains.
The following command shows the top rows of st_info. It tries to format the data so that it fits on one printed line:
head(st_info)# A tibble: 6 × 10
`Application ID` Given Family Birthdate Email St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 4563269562-RODR-2… Teno… Rodri… 05/26/20… teno… GA COWET… M W 1436
2 9221751846-ROEH-2… Trav… Roe 03/31/20… trav… SC GREEN… M A 1398
3 4290276249-ALLE-2… Axel Allen 06/23/20… axel… GA BULLO… M W 1090
4 3398780452-HILT-2… Just… Hilton 06/29/20… just… GA DEKAL… M W 1516
5 7691897840-SMIT-2… Mehr… Smith 11/03/20… mehr… GA WHITF… M O 1440
6 1862245592-SHIP-2… Cart… Shipl… 10/31/20… cart… GA DEKAL… M H 1438
You can see that head() left off some data in a few of the variables (e.g., Application ID, Given (in some cases), Family (again, in some cases), etc.). Also, every variable contains character data except for SAT which contains dbl (that is, numeric) data.
Notice the line at the top of the above text:
# A tibble: 6 x 10
This is a comment, as indicated by the leading # character. This means that the text that follows is not executed, but is merely explanatory text.
Now, what is this about a tibble? Well, this is a bit of a play on words used by the tidyverse to refer to a table of data (i.e., a dataframe) that has some special features. We aren’t going to go into those features here but just know that these features make it easier to work with the data — which is the purpose of the tidyverse.
This text says that this tibble is 6 x 10 — that is, there are 6 observations (rows) and 10 variables (columns). Now, understand that this does not mean that st_info has 10 rows—it means that the dataframe returned by head() has 10 rows; st_info remains unchanged with 2,000 rows.
The following command shows the bottom rows of st_info. Just as with head(), it tries to format the data so that it fits on one printed line:
tail(st_info)# A tibble: 6 × 10
`Application ID` Given Family Birthdate Email St County Sex Race SAT
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 3611829360-ROTH-2… Penny Roth 09/02/20… penn… SC BERKE… F W 1105
2 8814528045-MORI-2… Eris Morin 04/05/20… eris… GA CHERO… F H 1266
3 3855136351-RODR-2… Ellis Rodri… 09/05/20… elli… SC BEAUF… F W 1519
4 8669152198-EWIN-2… Emma Ewing 08/02/20… emma… SC LEXIN… F W 1507
5 5442220352-NAVA-2… Iris Navar… 10/09/20… iris… SC LANCA… F B 1335
6 8438756330-FELI-2… Lill… Felix 11/09/20… lill… GA DOUGH… F W 1392
This also results in a 6 x 10 tibble.
If you want to know the column/variable types for a dataframe, use this command:
spec(st_info)cols(
`Application ID` = col_character(),
Given = col_character(),
Family = col_character(),
Birthdate = col_character(),
Email = col_character(),
St = col_character(),
County = col_character(),
Sex = col_character(),
Race = col_character(),
SAT = col_double()
)
Another tool that provides a way to get information about the dataframe follows. It displays column details along with some sample data:
glimpse(st_info)Rows: 2,000
Columns: 10
$ `Application ID` <chr> "4563269562-RODR-2021", "9221751846-ROEH-2021", "4290…
$ Given <chr> "Tenoch", "Travis", "Axel", "Justice", "Mehran", "Car…
$ Family <chr> "Rodriguez", "Roe", "Allen", "Hilton", "Smith", "Ship…
$ Birthdate <chr> "05/26/2003", "03/31/2003", "06/23/2003", "06/29/2003…
$ Email <chr> "tenorodr2178@yahoo.com", "travroet1415@gmail.com", "…
$ St <chr> "GA", "SC", "GA", "GA", "GA", "GA", "SC", "SC", "SC",…
$ County <chr> "COWETAGA", "GREENVSC", "BULLOCGA", "DEKALBGA", "WHIT…
$ Sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"…
$ Race <chr> "W", "A", "W", "W", "O", "H", "W", "W", "B", "H", "W"…
$ SAT <dbl> 1436, 1398, 1090, 1516, 1440, 1438, 1452, 1536, 1487,…
If you want to know the names of the columns (variables) in a dataframe, use the following:
names(st_info) [1] "Application ID" "Given" "Family" "Birthdate"
[5] "Email" "St" "County" "Sex"
[9] "Race" "SAT"
A command that gives something of a combination of the spec() command and the glimpse() command is the following:
str(st_info)spc_tbl_ [2,000 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Application ID: chr [1:2000] "4563269562-RODR-2021" "9221751846-ROEH-2021" "4290276249-ALLE-2021" "3398780452-HILT-2021" ...
$ Given : chr [1:2000] "Tenoch" "Travis" "Axel" "Justice" ...
$ Family : chr [1:2000] "Rodriguez" "Roe" "Allen" "Hilton" ...
$ Birthdate : chr [1:2000] "05/26/2003" "03/31/2003" "06/23/2003" "06/29/2003" ...
$ Email : chr [1:2000] "tenorodr2178@yahoo.com" "travroet1415@gmail.com" "axelallew7455@yahoo.com" "justhilt6515@msn.com" ...
$ St : chr [1:2000] "GA" "SC" "GA" "GA" ...
$ County : chr [1:2000] "COWETAGA" "GREENVSC" "BULLOCGA" "DEKALBGA" ...
$ Sex : chr [1:2000] "M" "M" "M" "M" ...
$ Race : chr [1:2000] "W" "A" "W" "W" ...
$ SAT : num [1:2000] 1436 1398 1090 1516 1440 ...
- attr(*, "spec")=
.. cols(
.. `Application ID` = col_character(),
.. Given = col_character(),
.. Family = col_character(),
.. Birthdate = col_character(),
.. Email = col_character(),
.. St = col_character(),
.. County = col_character(),
.. Sex = col_character(),
.. Race = col_character(),
.. SAT = col_double()
.. )
- attr(*, "problems")=<externalptr>
The command summary() provides something of a more display-focused alternative—while also providing some simple statistics—to those users who want to see a general overview of a dataframe:
summary(st_info) Application ID Given Family Birthdate
Length:2000 Length:2000 Length:2000 Length:2000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Email St County Sex
Length:2000 Length:2000 Length:2000 Length:2000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Race SAT
Length:2000 Min. : 978
Class :character 1st Qu.:1206
Mode :character Median :1308
Mean :1315
3rd Qu.:1425
Max. :1600
You can see above that it calculates some summary statistics for the numeric column.