For the most part, we’ll used tidyverse-style operations in this course. BUT you need to know base R language operations as well as there are millions of lines of R code out there you need to read. I won’t use them often and there’s plenty of help so you’ll have to work through this each new time until it becomes second nature.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.3.4 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
nh_wx <- read_csv("2017-2008-nx-wx.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## value = col_double(),
## anomaly = col_double()
## )
nh_wx
## # A tibble: 120 x 3
## date value anomaly
## <date> <dbl> <dbl>
## 1 2008-01-01 23.2 1.1
## 2 2008-02-01 24.4 0.0
## 3 2008-03-01 30.5 -2.7
## 4 2008-04-01 46.0 -0.3
## 5 2008-05-01 54.2 -3.3
## 6 2008-06-01 66.3 1.1
## 7 2008-07-01 71.1 -0.1
## 8 2008-08-01 66.4 -2.5
## 9 2008-09-01 61.0 -1.0
## 10 2008-10-01 46.8 -3.0
## # ... with 110 more rows
glimpse(nh_wx)
## Observations: 120
## Variables: 3
## $ date <date> 2008-01-01, 2008-02-01, 2008-03-01, 2008-04-01, 2008-...
## $ value <dbl> 23.2, 24.4, 30.5, 46.0, 54.2, 66.3, 71.1, 66.4, 61.0, ...
## $ anomaly <dbl> 1.1, 0.0, -2.7, -0.3, -3.3, 1.1, -0.1, -2.5, -1.0, -3....
IMPORTANT
Folks coming from Excel tend to think in “Rows” or “Cells”
Let’s pull up an “Excel” view of this data and I’ll talk more there.
# View(nh_wx)
This is a “new” data set to us, so I like to perform some various exploratory tasks. We’ll cover many more of these in week 2, but this is just to get us jump started.
range(nh_wx$date) # what is the date range
## [1] "2008-01-01" "2017-12-01"
length(unique(nh_wx$date)) # how many unique dates?
## [1] 120
range(nh_wx$value)
## [1] 12.2 73.8
length(unique(nh_wx$value))
## [1] 106
range(nh_wx$anomaly)
## [1] -12.2 9.4
length(nh_wx$anomaly)
## [1] 120
head(nh_wx$date) # to not crowd out the console
## [1] "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01"
## [6] "2008-06-01"
tail(nh_wx$date) # to not crowd out the console
## [1] "2017-07-01" "2017-08-01" "2017-09-01" "2017-10-01" "2017-11-01"
## [6] "2017-12-01"
You saw the $
above, but there are a couple other ways
nh_wx$date
## [1] "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01"
## [6] "2008-06-01" "2008-07-01" "2008-08-01" "2008-09-01" "2008-10-01"
## [11] "2008-11-01" "2008-12-01" "2009-01-01" "2009-02-01" "2009-03-01"
## [16] "2009-04-01" "2009-05-01" "2009-06-01" "2009-07-01" "2009-08-01"
## [21] "2009-09-01" "2009-10-01" "2009-11-01" "2009-12-01" "2010-01-01"
## [26] "2010-02-01" "2010-03-01" "2010-04-01" "2010-05-01" "2010-06-01"
## [31] "2010-07-01" "2010-08-01" "2010-09-01" "2010-10-01" "2010-11-01"
## [36] "2010-12-01" "2011-01-01" "2011-02-01" "2011-03-01" "2011-04-01"
## [41] "2011-05-01" "2011-06-01" "2011-07-01" "2011-08-01" "2011-09-01"
## [46] "2011-10-01" "2011-11-01" "2011-12-01" "2012-01-01" "2012-02-01"
## [51] "2012-03-01" "2012-04-01" "2012-05-01" "2012-06-01" "2012-07-01"
## [56] "2012-08-01" "2012-09-01" "2012-10-01" "2012-11-01" "2012-12-01"
## [61] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-01" "2013-05-01"
## [66] "2013-06-01" "2013-07-01" "2013-08-01" "2013-09-01" "2013-10-01"
## [71] "2013-11-01" "2013-12-01" "2014-01-01" "2014-02-01" "2014-03-01"
## [76] "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01" "2014-08-01"
## [81] "2014-09-01" "2014-10-01" "2014-11-01" "2014-12-01" "2015-01-01"
## [86] "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01" "2015-06-01"
## [91] "2015-07-01" "2015-08-01" "2015-09-01" "2015-10-01" "2015-11-01"
## [96] "2015-12-01" "2016-01-01" "2016-02-01" "2016-03-01" "2016-04-01"
## [101] "2016-05-01" "2016-06-01" "2016-07-01" "2016-08-01" "2016-09-01"
## [106] "2016-10-01" "2016-11-01" "2016-12-01" "2017-01-01" "2017-02-01"
## [111] "2017-03-01" "2017-04-01" "2017-05-01" "2017-06-01" "2017-07-01"
## [116] "2017-08-01" "2017-09-01" "2017-10-01" "2017-11-01" "2017-12-01"
nh_wx[, "date"]
## # A tibble: 120 x 1
## date
## <date>
## 1 2008-01-01
## 2 2008-02-01
## 3 2008-03-01
## 4 2008-04-01
## 5 2008-05-01
## 6 2008-06-01
## 7 2008-07-01
## 8 2008-08-01
## 9 2008-09-01
## 10 2008-10-01
## # ... with 110 more rows
select(nh_wx, date) # read the book to really understand this more
## # A tibble: 120 x 1
## date
## <date>
## 1 2008-01-01
## 2 2008-02-01
## 3 2008-03-01
## 4 2008-04-01
## 5 2008-05-01
## 6 2008-06-01
## 7 2008-07-01
## 8 2008-08-01
## 9 2008-09-01
## 10 2008-10-01
## # ... with 110 more rows
We can select more than 1 (but not with $
) and in any order
nh_wx[, c("anomaly", "date")]
## # A tibble: 120 x 2
## anomaly date
## <dbl> <date>
## 1 1.1 2008-01-01
## 2 0.0 2008-02-01
## 3 -2.7 2008-03-01
## 4 -0.3 2008-04-01
## 5 -3.3 2008-05-01
## 6 1.1 2008-06-01
## 7 -0.1 2008-07-01
## 8 -2.5 2008-08-01
## 9 -1.0 2008-09-01
## 10 -3.0 2008-10-01
## # ... with 110 more rows
select(nh_wx, date, anomaly)
## # A tibble: 120 x 2
## date anomaly
## <date> <dbl>
## 1 2008-01-01 1.1
## 2 2008-02-01 0.0
## 3 2008-03-01 -2.7
## 4 2008-04-01 -0.3
## 5 2008-05-01 -3.3
## 6 2008-06-01 1.1
## 7 2008-07-01 -0.1
## 8 2008-08-01 -2.5
## 9 2008-09-01 -1.0
## 10 2008-10-01 -3.0
## # ... with 110 more rows
Remember, a data frame is just a list of vectors, so we can do vector operations on the “columns”
(nh_wx$value - 32) * 0.5556 # celsius
## [1] -4.88928 -4.22256 -0.83340 7.77840 12.33432 19.05708 21.72396
## [8] 19.11264 16.11240 8.22288 2.88912 -2.77800 -9.83412 -4.22256
## [15] 0.50004 8.44512 13.16772 17.16804 19.16820 20.66832 14.22336
## [22] 7.44504 5.38932 -3.77808 -5.05596 -1.77792 4.11144 9.27852
## [29] 15.33456 18.89040 23.22408 20.89056 17.27916 8.94516 3.55584
## [36] -3.11136 -7.66728 -6.00048 0.11112 7.83396 14.50116 18.00144
## [43] 22.16844 20.61276 17.72364 9.66744 5.77824 -0.33336 -3.55584
## [50] -1.05564 5.55600 8.50068 15.33456 18.11256 22.00176 21.61284
## [57] 15.27900 10.94532 2.22240 -0.66672 -5.11152 -3.83364 0.50004
## [64] 7.38948 13.72332 18.89040 22.94628 19.39044 15.33456 9.44520
## [71] 1.77792 -4.66704 -6.88944 -6.94500 -3.50028 6.77832 13.44552
## [78] 18.50148 21.33504 19.05708 15.61236 10.94532 2.05572 -0.33336
## [85] -7.55616 -11.00088 -2.27796 6.94500 16.55688 17.66808 21.55728
## [92] 21.55728 18.44592 8.88960 5.77824 3.22248 -2.77800 -2.05572
## [99] 4.27812 6.61164 14.27892 18.94596 22.50180 22.66848 18.00144
## [106] 10.38972 5.16708 -2.38908 -1.88904 -1.00008 -1.55568 9.66744
## [113] 12.77880 18.94596 21.11280 19.61268 18.55704 13.83444 3.38916
## [120] -5.38932
mutate(nh_wx, value_c = (value - 32) * 0.556)
## # A tibble: 120 x 4
## date value anomaly value_c
## <date> <dbl> <dbl> <dbl>
## 1 2008-01-01 23.2 1.1 -4.8928
## 2 2008-02-01 24.4 0.0 -4.2256
## 3 2008-03-01 30.5 -2.7 -0.8340
## 4 2008-04-01 46.0 -0.3 7.7840
## 5 2008-05-01 54.2 -3.3 12.3432
## 6 2008-06-01 66.3 1.1 19.0708
## 7 2008-07-01 71.1 -0.1 21.7396
## 8 2008-08-01 66.4 -2.5 19.1264
## 9 2008-09-01 61.0 -1.0 16.1240
## 10 2008-10-01 46.8 -3.0 8.2288
## # ... with 110 more rows
Let me sneak user-defined functions in here as a refresher but we’ll cover this often in the semester
to_celsius <- function(temp_in_f) {
(temp_in_f - 32) * 0.556
}
mutate(nh_wx, value_c = to_celsius(value)) # way more readable
## # A tibble: 120 x 4
## date value anomaly value_c
## <date> <dbl> <dbl> <dbl>
## 1 2008-01-01 23.2 1.1 -4.8928
## 2 2008-02-01 24.4 0.0 -4.2256
## 3 2008-03-01 30.5 -2.7 -0.8340
## 4 2008-04-01 46.0 -0.3 7.7840
## 5 2008-05-01 54.2 -3.3 12.3432
## 6 2008-06-01 66.3 1.1 19.0708
## 7 2008-07-01 71.1 -0.1 21.7396
## 8 2008-08-01 66.4 -2.5 19.1264
## 9 2008-09-01 61.0 -1.0 16.1240
## 10 2008-10-01 46.8 -3.0 8.2288
## # ... with 110 more rows
nh_wx[1:10,]
## # A tibble: 10 x 3
## date value anomaly
## <date> <dbl> <dbl>
## 1 2008-01-01 23.2 1.1
## 2 2008-02-01 24.4 0.0
## 3 2008-03-01 30.5 -2.7
## 4 2008-04-01 46.0 -0.3
## 5 2008-05-01 54.2 -3.3
## 6 2008-06-01 66.3 1.1
## 7 2008-07-01 71.1 -0.1
## 8 2008-08-01 66.4 -2.5
## 9 2008-09-01 61.0 -1.0
## 10 2008-10-01 46.8 -3.0
nh_wx[1:10,]$anomaly
## [1] 1.1 0.0 -2.7 -0.3 -3.3 1.1 -0.1 -2.5 -1.0 -3.0
nh_wx[1:10, "anomaly"]
## # A tibble: 10 x 1
## anomaly
## <dbl>
## 1 1.1
## 2 0.0
## 3 -2.7
## 4 -0.3
## 5 -3.3
## 6 1.1
## 7 -0.1
## 8 -2.5
## 9 -1.0
## 10 -3.0
slice(nh_wx, 1:10)
## # A tibble: 10 x 3
## date value anomaly
## <date> <dbl> <dbl>
## 1 2008-01-01 23.2 1.1
## 2 2008-02-01 24.4 0.0
## 3 2008-03-01 30.5 -2.7
## 4 2008-04-01 46.0 -0.3
## 5 2008-05-01 54.2 -3.3
## 6 2008-06-01 66.3 1.1
## 7 2008-07-01 71.1 -0.1
## 8 2008-08-01 66.4 -2.5
## 9 2008-09-01 61.0 -1.0
## 10 2008-10-01 46.8 -3.0
select(slice(nh_wx, 1:10), anomaly)
## # A tibble: 10 x 1
## anomaly
## <dbl>
## 1 1.1
## 2 0.0
## 3 -2.7
## 4 -0.3
## 5 -3.3
## 6 1.1
## 7 -0.1
## 8 -2.5
## 9 -1.0
## 10 -3.0
slice(nh_wx, 1:10) %>%
select(anomaly)
## # A tibble: 10 x 1
## anomaly
## <dbl>
## 1 1.1
## 2 0.0
## 3 -2.7
## 4 -0.3
## 5 -3.3
## 6 1.1
## 7 -0.1
## 8 -2.5
## 9 -1.0
## 10 -3.0
pull(slice(nh_wx, 1:10), anomaly)
## [1] 1.1 0.0 -2.7 -0.3 -3.3 1.1 -0.1 -2.5 -1.0 -3.0
slice(nh_wx, 1:10) %>%
pull(anomaly)
## [1] 1.1 0.0 -2.7 -0.3 -3.3 1.1 -0.1 -2.5 -1.0 -3.0
The first parameter (when using []
) is either an numeric vector of indices or a logical vector that will let us choose which “rows” we want. Refresh your memory on vectors in the second lesson from this week.
For example, this means we can use boolean logic to find things. Like “What year+months had an average temperature above freezing?”
nh_wx[nh_wx$value > 32,]
## # A tibble: 87 x 3
## date value anomaly
## <date> <dbl> <dbl>
## 1 2008-04-01 46.0 -0.3
## 2 2008-05-01 54.2 -3.3
## 3 2008-06-01 66.3 1.1
## 4 2008-07-01 71.1 -0.1
## 5 2008-08-01 66.4 -2.5
## 6 2008-09-01 61.0 -1.0
## 7 2008-10-01 46.8 -3.0
## 8 2008-11-01 37.2 -1.6
## 9 2009-03-01 32.9 -0.3
## 10 2009-04-01 47.2 0.9
## # ... with 77 more rows
filter(nh_wx, value > 32)
## # A tibble: 87 x 3
## date value anomaly
## <date> <dbl> <dbl>
## 1 2008-04-01 46.0 -0.3
## 2 2008-05-01 54.2 -3.3
## 3 2008-06-01 66.3 1.1
## 4 2008-07-01 71.1 -0.1
## 5 2008-08-01 66.4 -2.5
## 6 2008-09-01 61.0 -1.0
## 7 2008-10-01 46.8 -3.0
## 8 2008-11-01 37.2 -1.6
## 9 2009-03-01 32.9 -0.3
## 10 2009-04-01 47.2 0.9
## # ... with 77 more rows
“What year+months” had an average temperature below 0°F?"
filter(nh_wx, value < 0)
## # A tibble: 0 x 3
## # ... with 3 variables: date <date>, value <dbl>, anomaly <dbl>
“What did the average temperature in February look like over the years?”
OK, for this one, we’ll need some help. There are built-in date operations in R (help links below) but, well use the lubridate
package since it really simplifies this for us. It will be important to know how to do it without “crutches” at some point, but not in this introduction.
library(lubridate) # note: in "real" scripts we have all the `library()` calls at the top of the script
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
To answer this question, we need to know which date
is a “February”. There’s a month()
function in lubridate
to help us with this:
month(nh_wx$date)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11
## [24] 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10
## [47] 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9
## [70] 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8
## [93] 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7
## [116] 8 9 10 11 12
Well, that’s useful, but I like names
month(nh_wx$date, label=TRUE) # we'll worry about the "Levels" in another class but this returns a special character vector called a "factor"
## [1] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May
## [18] Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## [35] Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
## [52] Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug
## [69] Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
## [86] Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun
## [103] Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## [120] Dec
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
Let’s add that info to the data frame
nh_wx <- mutate(nh_wx, month = month(date, label=TRUE))
nh_wx
## # A tibble: 120 x 4
## date value anomaly month
## <date> <dbl> <dbl> <ord>
## 1 2008-01-01 23.2 1.1 Jan
## 2 2008-02-01 24.4 0.0 Feb
## 3 2008-03-01 30.5 -2.7 Mar
## 4 2008-04-01 46.0 -0.3 Apr
## 5 2008-05-01 54.2 -3.3 May
## 6 2008-06-01 66.3 1.1 Jun
## 7 2008-07-01 71.1 -0.1 Jul
## 8 2008-08-01 66.4 -2.5 Aug
## 9 2008-09-01 61.0 -1.0 Sep
## 10 2008-10-01 46.8 -3.0 Oct
## # ... with 110 more rows
filter(nh_wx, month == "Feb")
## # A tibble: 10 x 4
## date value anomaly month
## <date> <dbl> <dbl> <ord>
## 1 2008-02-01 24.4 0.0 Feb
## 2 2009-02-01 24.4 0.0 Feb
## 3 2010-02-01 28.8 4.4 Feb
## 4 2011-02-01 21.2 -3.2 Feb
## 5 2012-02-01 30.1 5.7 Feb
## 6 2013-02-01 25.1 0.7 Feb
## 7 2014-02-01 19.5 -4.9 Feb
## 8 2015-02-01 12.2 -12.2 Feb
## 9 2016-02-01 28.3 3.9 Feb
## 10 2017-02-01 30.2 5.8 Feb
NOTE: We could have done that without modifying the data frame if we wanted to:
nh_wx <- read_csv("2017-2008-nx-wx.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## value = col_double(),
## anomaly = col_double()
## )
nh_wx
## # A tibble: 120 x 3
## date value anomaly
## <date> <dbl> <dbl>
## 1 2008-01-01 23.2 1.1
## 2 2008-02-01 24.4 0.0
## 3 2008-03-01 30.5 -2.7
## 4 2008-04-01 46.0 -0.3
## 5 2008-05-01 54.2 -3.3
## 6 2008-06-01 66.3 1.1
## 7 2008-07-01 71.1 -0.1
## 8 2008-08-01 66.4 -2.5
## 9 2008-09-01 61.0 -1.0
## 10 2008-10-01 46.8 -3.0
## # ... with 110 more rows
filter(nh_wx, month(date, label=TRUE) == "Feb")
## # A tibble: 10 x 3
## date value anomaly
## <date> <dbl> <dbl>
## 1 2008-02-01 24.4 0.0
## 2 2009-02-01 24.4 0.0
## 3 2010-02-01 28.8 4.4
## 4 2011-02-01 21.2 -3.2
## 5 2012-02-01 30.1 5.7
## 6 2013-02-01 25.1 0.7
## 7 2014-02-01 19.5 -4.9
## 8 2015-02-01 12.2 -12.2
## 9 2016-02-01 28.3 3.9
## 10 2017-02-01 30.2 5.8
REALLY quick intro to plotting
The book(s) & resource(s) cover ggplot2. There are many way to plot in R but we’ll primarily be using ggplot2. Since ggplot2 is a whole class in and of itself, let’s just walk through plotting the average monthly temperature data as a line chart and worry about the details in that class. I’ll sneak ggplot2 in quite a bit all semester.
NOTE: that it “comes along for the ride” with tidyverse
so no extra library()
call is needed.
ggplot(nh_wx) +
geom_line(aes(date, value))
You’ve got enough ground work, now, for the first project! ### More info
Read these. Alot. You’ll get spot quizzes occassionally about some esoteric edge cases.
help("read_csv")
help("read.csv") # note the "." vs "_"
help("Date")
help("as.Date")
help("min") # might be good to try this and the other 3 below on some vectors you create
help("max")
help("mean")
help("median")
help("lubridate") # VERY helpful package for working with dates
help("factor")