DATA 220G — Week 1 - 05

Putting it all together

For the most part, we’ll used tidyverse-style operations in this course. BUT you need to know base R language operations as well as there are millions of lines of R code out there you need to read. I won’t use them often and there’s plenty of help so you’ll have to work through this each new time until it becomes second nature.

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.3.4     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
nh_wx <- read_csv("2017-2008-nx-wx.csv")
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   value = col_double(),
##   anomaly = col_double()
## )
nh_wx
## # A tibble: 120 x 3
##          date value anomaly
##        <date> <dbl>   <dbl>
##  1 2008-01-01  23.2     1.1
##  2 2008-02-01  24.4     0.0
##  3 2008-03-01  30.5    -2.7
##  4 2008-04-01  46.0    -0.3
##  5 2008-05-01  54.2    -3.3
##  6 2008-06-01  66.3     1.1
##  7 2008-07-01  71.1    -0.1
##  8 2008-08-01  66.4    -2.5
##  9 2008-09-01  61.0    -1.0
## 10 2008-10-01  46.8    -3.0
## # ... with 110 more rows
glimpse(nh_wx)
## Observations: 120
## Variables: 3
## $ date    <date> 2008-01-01, 2008-02-01, 2008-03-01, 2008-04-01, 2008-...
## $ value   <dbl> 23.2, 24.4, 30.5, 46.0, 54.2, 66.3, 71.1, 66.4, 61.0, ...
## $ anomaly <dbl> 1.1, 0.0, -2.7, -0.3, -3.3, 1.1, -0.1, -2.5, -1.0, -3....

IMPORTANT

Folks coming from Excel tend to think in “Rows” or “Cells”

Let’s pull up an “Excel” view of this data and I’ll talk more there.

# View(nh_wx)

This is a “new” data set to us, so I like to perform some various exploratory tasks. We’ll cover many more of these in week 2, but this is just to get us jump started.

range(nh_wx$date) # what is the date range
## [1] "2008-01-01" "2017-12-01"
length(unique(nh_wx$date)) # how many unique dates?
## [1] 120
range(nh_wx$value)
## [1] 12.2 73.8
length(unique(nh_wx$value))
## [1] 106
range(nh_wx$anomaly)
## [1] -12.2   9.4
length(nh_wx$anomaly)
## [1] 120
head(nh_wx$date) # to not crowd out the console
## [1] "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01"
## [6] "2008-06-01"
tail(nh_wx$date) # to not crowd out the console
## [1] "2017-07-01" "2017-08-01" "2017-09-01" "2017-10-01" "2017-11-01"
## [6] "2017-12-01"

Selecting individual columns

You saw the $ above, but there are a couple other ways

nh_wx$date
##   [1] "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01"
##   [6] "2008-06-01" "2008-07-01" "2008-08-01" "2008-09-01" "2008-10-01"
##  [11] "2008-11-01" "2008-12-01" "2009-01-01" "2009-02-01" "2009-03-01"
##  [16] "2009-04-01" "2009-05-01" "2009-06-01" "2009-07-01" "2009-08-01"
##  [21] "2009-09-01" "2009-10-01" "2009-11-01" "2009-12-01" "2010-01-01"
##  [26] "2010-02-01" "2010-03-01" "2010-04-01" "2010-05-01" "2010-06-01"
##  [31] "2010-07-01" "2010-08-01" "2010-09-01" "2010-10-01" "2010-11-01"
##  [36] "2010-12-01" "2011-01-01" "2011-02-01" "2011-03-01" "2011-04-01"
##  [41] "2011-05-01" "2011-06-01" "2011-07-01" "2011-08-01" "2011-09-01"
##  [46] "2011-10-01" "2011-11-01" "2011-12-01" "2012-01-01" "2012-02-01"
##  [51] "2012-03-01" "2012-04-01" "2012-05-01" "2012-06-01" "2012-07-01"
##  [56] "2012-08-01" "2012-09-01" "2012-10-01" "2012-11-01" "2012-12-01"
##  [61] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-01" "2013-05-01"
##  [66] "2013-06-01" "2013-07-01" "2013-08-01" "2013-09-01" "2013-10-01"
##  [71] "2013-11-01" "2013-12-01" "2014-01-01" "2014-02-01" "2014-03-01"
##  [76] "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01" "2014-08-01"
##  [81] "2014-09-01" "2014-10-01" "2014-11-01" "2014-12-01" "2015-01-01"
##  [86] "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01" "2015-06-01"
##  [91] "2015-07-01" "2015-08-01" "2015-09-01" "2015-10-01" "2015-11-01"
##  [96] "2015-12-01" "2016-01-01" "2016-02-01" "2016-03-01" "2016-04-01"
## [101] "2016-05-01" "2016-06-01" "2016-07-01" "2016-08-01" "2016-09-01"
## [106] "2016-10-01" "2016-11-01" "2016-12-01" "2017-01-01" "2017-02-01"
## [111] "2017-03-01" "2017-04-01" "2017-05-01" "2017-06-01" "2017-07-01"
## [116] "2017-08-01" "2017-09-01" "2017-10-01" "2017-11-01" "2017-12-01"
nh_wx[, "date"]
## # A tibble: 120 x 1
##          date
##        <date>
##  1 2008-01-01
##  2 2008-02-01
##  3 2008-03-01
##  4 2008-04-01
##  5 2008-05-01
##  6 2008-06-01
##  7 2008-07-01
##  8 2008-08-01
##  9 2008-09-01
## 10 2008-10-01
## # ... with 110 more rows
select(nh_wx, date) # read the book to really understand this more
## # A tibble: 120 x 1
##          date
##        <date>
##  1 2008-01-01
##  2 2008-02-01
##  3 2008-03-01
##  4 2008-04-01
##  5 2008-05-01
##  6 2008-06-01
##  7 2008-07-01
##  8 2008-08-01
##  9 2008-09-01
## 10 2008-10-01
## # ... with 110 more rows

We can select more than 1 (but not with $) and in any order

nh_wx[, c("anomaly", "date")]
## # A tibble: 120 x 2
##    anomaly       date
##      <dbl>     <date>
##  1     1.1 2008-01-01
##  2     0.0 2008-02-01
##  3    -2.7 2008-03-01
##  4    -0.3 2008-04-01
##  5    -3.3 2008-05-01
##  6     1.1 2008-06-01
##  7    -0.1 2008-07-01
##  8    -2.5 2008-08-01
##  9    -1.0 2008-09-01
## 10    -3.0 2008-10-01
## # ... with 110 more rows
select(nh_wx, date, anomaly)
## # A tibble: 120 x 2
##          date anomaly
##        <date>   <dbl>
##  1 2008-01-01     1.1
##  2 2008-02-01     0.0
##  3 2008-03-01    -2.7
##  4 2008-04-01    -0.3
##  5 2008-05-01    -3.3
##  6 2008-06-01     1.1
##  7 2008-07-01    -0.1
##  8 2008-08-01    -2.5
##  9 2008-09-01    -1.0
## 10 2008-10-01    -3.0
## # ... with 110 more rows

Remember, a data frame is just a list of vectors, so we can do vector operations on the “columns”

(nh_wx$value - 32) * 0.5556 # celsius
##   [1]  -4.88928  -4.22256  -0.83340   7.77840  12.33432  19.05708  21.72396
##   [8]  19.11264  16.11240   8.22288   2.88912  -2.77800  -9.83412  -4.22256
##  [15]   0.50004   8.44512  13.16772  17.16804  19.16820  20.66832  14.22336
##  [22]   7.44504   5.38932  -3.77808  -5.05596  -1.77792   4.11144   9.27852
##  [29]  15.33456  18.89040  23.22408  20.89056  17.27916   8.94516   3.55584
##  [36]  -3.11136  -7.66728  -6.00048   0.11112   7.83396  14.50116  18.00144
##  [43]  22.16844  20.61276  17.72364   9.66744   5.77824  -0.33336  -3.55584
##  [50]  -1.05564   5.55600   8.50068  15.33456  18.11256  22.00176  21.61284
##  [57]  15.27900  10.94532   2.22240  -0.66672  -5.11152  -3.83364   0.50004
##  [64]   7.38948  13.72332  18.89040  22.94628  19.39044  15.33456   9.44520
##  [71]   1.77792  -4.66704  -6.88944  -6.94500  -3.50028   6.77832  13.44552
##  [78]  18.50148  21.33504  19.05708  15.61236  10.94532   2.05572  -0.33336
##  [85]  -7.55616 -11.00088  -2.27796   6.94500  16.55688  17.66808  21.55728
##  [92]  21.55728  18.44592   8.88960   5.77824   3.22248  -2.77800  -2.05572
##  [99]   4.27812   6.61164  14.27892  18.94596  22.50180  22.66848  18.00144
## [106]  10.38972   5.16708  -2.38908  -1.88904  -1.00008  -1.55568   9.66744
## [113]  12.77880  18.94596  21.11280  19.61268  18.55704  13.83444   3.38916
## [120]  -5.38932
mutate(nh_wx, value_c = (value - 32) * 0.556)
## # A tibble: 120 x 4
##          date value anomaly value_c
##        <date> <dbl>   <dbl>   <dbl>
##  1 2008-01-01  23.2     1.1 -4.8928
##  2 2008-02-01  24.4     0.0 -4.2256
##  3 2008-03-01  30.5    -2.7 -0.8340
##  4 2008-04-01  46.0    -0.3  7.7840
##  5 2008-05-01  54.2    -3.3 12.3432
##  6 2008-06-01  66.3     1.1 19.0708
##  7 2008-07-01  71.1    -0.1 21.7396
##  8 2008-08-01  66.4    -2.5 19.1264
##  9 2008-09-01  61.0    -1.0 16.1240
## 10 2008-10-01  46.8    -3.0  8.2288
## # ... with 110 more rows

Let me sneak user-defined functions in here as a refresher but we’ll cover this often in the semester

to_celsius <- function(temp_in_f) {
  (temp_in_f - 32) * 0.556
}

mutate(nh_wx, value_c = to_celsius(value)) # way more readable
## # A tibble: 120 x 4
##          date value anomaly value_c
##        <date> <dbl>   <dbl>   <dbl>
##  1 2008-01-01  23.2     1.1 -4.8928
##  2 2008-02-01  24.4     0.0 -4.2256
##  3 2008-03-01  30.5    -2.7 -0.8340
##  4 2008-04-01  46.0    -0.3  7.7840
##  5 2008-05-01  54.2    -3.3 12.3432
##  6 2008-06-01  66.3     1.1 19.0708
##  7 2008-07-01  71.1    -0.1 21.7396
##  8 2008-08-01  66.4    -2.5 19.1264
##  9 2008-09-01  61.0    -1.0 16.1240
## 10 2008-10-01  46.8    -3.0  8.2288
## # ... with 110 more rows

Indexing data frames

nh_wx[1:10,]
## # A tibble: 10 x 3
##          date value anomaly
##        <date> <dbl>   <dbl>
##  1 2008-01-01  23.2     1.1
##  2 2008-02-01  24.4     0.0
##  3 2008-03-01  30.5    -2.7
##  4 2008-04-01  46.0    -0.3
##  5 2008-05-01  54.2    -3.3
##  6 2008-06-01  66.3     1.1
##  7 2008-07-01  71.1    -0.1
##  8 2008-08-01  66.4    -2.5
##  9 2008-09-01  61.0    -1.0
## 10 2008-10-01  46.8    -3.0
nh_wx[1:10,]$anomaly
##  [1]  1.1  0.0 -2.7 -0.3 -3.3  1.1 -0.1 -2.5 -1.0 -3.0
nh_wx[1:10, "anomaly"]
## # A tibble: 10 x 1
##    anomaly
##      <dbl>
##  1     1.1
##  2     0.0
##  3    -2.7
##  4    -0.3
##  5    -3.3
##  6     1.1
##  7    -0.1
##  8    -2.5
##  9    -1.0
## 10    -3.0
slice(nh_wx, 1:10)
## # A tibble: 10 x 3
##          date value anomaly
##        <date> <dbl>   <dbl>
##  1 2008-01-01  23.2     1.1
##  2 2008-02-01  24.4     0.0
##  3 2008-03-01  30.5    -2.7
##  4 2008-04-01  46.0    -0.3
##  5 2008-05-01  54.2    -3.3
##  6 2008-06-01  66.3     1.1
##  7 2008-07-01  71.1    -0.1
##  8 2008-08-01  66.4    -2.5
##  9 2008-09-01  61.0    -1.0
## 10 2008-10-01  46.8    -3.0
select(slice(nh_wx, 1:10), anomaly)
## # A tibble: 10 x 1
##    anomaly
##      <dbl>
##  1     1.1
##  2     0.0
##  3    -2.7
##  4    -0.3
##  5    -3.3
##  6     1.1
##  7    -0.1
##  8    -2.5
##  9    -1.0
## 10    -3.0
slice(nh_wx, 1:10) %>%
  select(anomaly)
## # A tibble: 10 x 1
##    anomaly
##      <dbl>
##  1     1.1
##  2     0.0
##  3    -2.7
##  4    -0.3
##  5    -3.3
##  6     1.1
##  7    -0.1
##  8    -2.5
##  9    -1.0
## 10    -3.0
pull(slice(nh_wx, 1:10), anomaly)
##  [1]  1.1  0.0 -2.7 -0.3 -3.3  1.1 -0.1 -2.5 -1.0 -3.0
slice(nh_wx, 1:10) %>%
  pull(anomaly)
##  [1]  1.1  0.0 -2.7 -0.3 -3.3  1.1 -0.1 -2.5 -1.0 -3.0

Finding things in data frames

The first parameter (when using []) is either an numeric vector of indices or a logical vector that will let us choose which “rows” we want. Refresh your memory on vectors in the second lesson from this week.

For example, this means we can use boolean logic to find things. Like “What year+months had an average temperature above freezing?”

nh_wx[nh_wx$value > 32,]
## # A tibble: 87 x 3
##          date value anomaly
##        <date> <dbl>   <dbl>
##  1 2008-04-01  46.0    -0.3
##  2 2008-05-01  54.2    -3.3
##  3 2008-06-01  66.3     1.1
##  4 2008-07-01  71.1    -0.1
##  5 2008-08-01  66.4    -2.5
##  6 2008-09-01  61.0    -1.0
##  7 2008-10-01  46.8    -3.0
##  8 2008-11-01  37.2    -1.6
##  9 2009-03-01  32.9    -0.3
## 10 2009-04-01  47.2     0.9
## # ... with 77 more rows
filter(nh_wx, value > 32)
## # A tibble: 87 x 3
##          date value anomaly
##        <date> <dbl>   <dbl>
##  1 2008-04-01  46.0    -0.3
##  2 2008-05-01  54.2    -3.3
##  3 2008-06-01  66.3     1.1
##  4 2008-07-01  71.1    -0.1
##  5 2008-08-01  66.4    -2.5
##  6 2008-09-01  61.0    -1.0
##  7 2008-10-01  46.8    -3.0
##  8 2008-11-01  37.2    -1.6
##  9 2009-03-01  32.9    -0.3
## 10 2009-04-01  47.2     0.9
## # ... with 77 more rows

“What year+months” had an average temperature below 0°F?"

filter(nh_wx, value < 0)
## # A tibble: 0 x 3
## # ... with 3 variables: date <date>, value <dbl>, anomaly <dbl>

“What did the average temperature in February look like over the years?”

OK, for this one, we’ll need some help. There are built-in date operations in R (help links below) but, well use the lubridate package since it really simplifies this for us. It will be important to know how to do it without “crutches” at some point, but not in this introduction.

library(lubridate) # note: in "real" scripts we have all the `library()` calls at the top of the script
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

To answer this question, we need to know which date is a “February”. There’s a month() function in lubridate to help us with this:

month(nh_wx$date)
##   [1]  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11
##  [24] 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9 10
##  [47] 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8  9
##  [70] 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7  8
##  [93]  9 10 11 12  1  2  3  4  5  6  7  8  9 10 11 12  1  2  3  4  5  6  7
## [116]  8  9 10 11 12

Well, that’s useful, but I like names

month(nh_wx$date, label=TRUE) # we'll worry about the "Levels" in another class but this returns a special character vector called a "factor"
##   [1] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May
##  [18] Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct
##  [35] Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
##  [52] Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug
##  [69] Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
##  [86] Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun
## [103] Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## [120] Dec
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

Let’s add that info to the data frame

nh_wx <- mutate(nh_wx, month = month(date, label=TRUE))

nh_wx
## # A tibble: 120 x 4
##          date value anomaly month
##        <date> <dbl>   <dbl> <ord>
##  1 2008-01-01  23.2     1.1   Jan
##  2 2008-02-01  24.4     0.0   Feb
##  3 2008-03-01  30.5    -2.7   Mar
##  4 2008-04-01  46.0    -0.3   Apr
##  5 2008-05-01  54.2    -3.3   May
##  6 2008-06-01  66.3     1.1   Jun
##  7 2008-07-01  71.1    -0.1   Jul
##  8 2008-08-01  66.4    -2.5   Aug
##  9 2008-09-01  61.0    -1.0   Sep
## 10 2008-10-01  46.8    -3.0   Oct
## # ... with 110 more rows
filter(nh_wx, month == "Feb")
## # A tibble: 10 x 4
##          date value anomaly month
##        <date> <dbl>   <dbl> <ord>
##  1 2008-02-01  24.4     0.0   Feb
##  2 2009-02-01  24.4     0.0   Feb
##  3 2010-02-01  28.8     4.4   Feb
##  4 2011-02-01  21.2    -3.2   Feb
##  5 2012-02-01  30.1     5.7   Feb
##  6 2013-02-01  25.1     0.7   Feb
##  7 2014-02-01  19.5    -4.9   Feb
##  8 2015-02-01  12.2   -12.2   Feb
##  9 2016-02-01  28.3     3.9   Feb
## 10 2017-02-01  30.2     5.8   Feb

NOTE: We could have done that without modifying the data frame if we wanted to:

nh_wx <- read_csv("2017-2008-nx-wx.csv")
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   value = col_double(),
##   anomaly = col_double()
## )
nh_wx
## # A tibble: 120 x 3
##          date value anomaly
##        <date> <dbl>   <dbl>
##  1 2008-01-01  23.2     1.1
##  2 2008-02-01  24.4     0.0
##  3 2008-03-01  30.5    -2.7
##  4 2008-04-01  46.0    -0.3
##  5 2008-05-01  54.2    -3.3
##  6 2008-06-01  66.3     1.1
##  7 2008-07-01  71.1    -0.1
##  8 2008-08-01  66.4    -2.5
##  9 2008-09-01  61.0    -1.0
## 10 2008-10-01  46.8    -3.0
## # ... with 110 more rows
filter(nh_wx, month(date, label=TRUE) == "Feb")
## # A tibble: 10 x 3
##          date value anomaly
##        <date> <dbl>   <dbl>
##  1 2008-02-01  24.4     0.0
##  2 2009-02-01  24.4     0.0
##  3 2010-02-01  28.8     4.4
##  4 2011-02-01  21.2    -3.2
##  5 2012-02-01  30.1     5.7
##  6 2013-02-01  25.1     0.7
##  7 2014-02-01  19.5    -4.9
##  8 2015-02-01  12.2   -12.2
##  9 2016-02-01  28.3     3.9
## 10 2017-02-01  30.2     5.8

REALLY quick intro to plotting

The book(s) & resource(s) cover ggplot2. There are many way to plot in R but we’ll primarily be using ggplot2. Since ggplot2 is a whole class in and of itself, let’s just walk through plotting the average monthly temperature data as a line chart and worry about the details in that class. I’ll sneak ggplot2 in quite a bit all semester.

NOTE: that it “comes along for the ride” with tidyverse so no extra library() call is needed.

ggplot(nh_wx) +
  geom_line(aes(date, value))

You’ve got enough ground work, now, for the first project! ### More info

Read these. Alot. You’ll get spot quizzes occassionally about some esoteric edge cases.

help("read_csv")
help("read.csv") # note the "." vs "_"
help("Date")
help("as.Date")
help("min") # might be good to try this and the other 3 below on some vectors you create
help("max")
help("mean")
help("median")
help("lubridate") # VERY helpful package for working with dates
help("factor")