I needed to make a presentation on some of my work last semester to give to our lab group, so I created a Slidy presentation in RStudio. I give a brief introduction to some of the data manipulation and analysis techniques I have learned recently, and applied to some precipitation data we have collected across one of ourĀ sudden oak death study areas. This data still needs some work. The presentation can be exported from RStudio as an HTML or PDF so it is easily viewable on almost any machine. Some things show up a little bit differently depending on the selected format, so here are the PDF and HTML versions for comparison. Also, here is the RMarkdown (.rmd) file to download if you want to examine the code syntax. I provide a few highlights below. Each of the headers would indicate a new slide in the presentation.
Data Manipulation with R
Whalen Dillon
December 9, 2014
R Markdown
This is a slidy presentation generated using R Markdown in
Things to keep in mind about R
It is more a scripting language than programming language
R
is optimized for vectorization (what the heck does that mean?)
Generally avoid looping operations:
data data_squared system.time( for(i in data){ data_squared[i] }) ## user system elapsed ## 0.151 0.009 0.160
# Vectorization is faster system.time(data_squared ## user system elapsed ## 0 0 0
Getting data into R
– multiple files
I have a directory with annual data files over 10 years
files <- list.files("Rain_Gauge/2_RG_EXPORTS", pattern="*.csv", full.names=TRUE) length(files) ## [1] 112
head(files, 3) ## [1] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2004.csv" ## [2] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2005.csv" ## [3] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2006.csv"
Getting data into R
– multiple files
Read all the files in the vector “files” into a single data frame
library(plyr) # `ldply()` function reads a list, returns a data frame library(data.table) # `fread()` function rg_data <- ldply(files, function(i){fread(i)}) class(rg_data) ## [1] "data.frame"
head(rg_data, 3) ## id date time events daily_events hourly_events ## 1 annadel 11/12/2003 13:00:00 NA NA 0 ## 2 annadel 11/12/2003 14:00:00 NA NA 0 ## 3 annadel 11/12/2003 15:00:00 NA NA 0
Dealing with dates and time
I want to be able to group and sort by dates and times
Join date
and time
columns into new variable date_time
rg_data$date_time class(rg_data$date_time) ## [1] "character"
Dealing with dates and time
Convert date_time
into format interpretable by the computer (POSIX)
rg_data$date_time tz="UTC") class(rg_data$date_time) ## [1] "POSIXlt" "POSIXt"
Dealing with dates and time
Create year
, month
, and day
variables for grouping
- Many functions can’t handle POSIX formatted date/time
These functions come from the data.table
package in this case
rg_data$year <- year(rg_data$date_time) # extracts year rg_data$month <- month(rg_data$date_time) # extracts month rg_data$day <- mday(rg_data$date_time) # extracts day of month head(rg_data, 3)head(rg_data, 3) ## id date time events daily_events hourly_events ## 1 annadel 11/12/2003 13:00:00 NA NA 0 ## 2 annadel 11/12/2003 14:00:00 NA NA 0 ## 3 annadel 11/12/2003 15:00:00 NA NA 0 ## date_time year month day ## 1 2003-11-12 13:00:00 2003 11 12 ## 2 2003-11-12 14:00:00 2003 11 12 ## 3 2003-11-12 15:00:00 2003 11 12
Subset and summarize data
Create dataset of daily precipitation in inches
library(dplyr) dy_rg_data %>% select(id, date, year, month, day, events) %>% group_by(id, year, month, day) %>% summarize(daily_events=length(events), daily_ppt=length(events)*0.01) str(dy_rg_data) ## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 34807 obs. of 6 variables: ## $ id : chr "annadel" "annadel" "annadel" "annadel" ... ## $ year : int 2003 2003 2003 2003 2003 2003 2003 2003 2003 2003 ... ## $ month : int 11 11 11 11 11 11 11 11 11 11 ... ## $ day : int 12 13 14 15 16 17 18 19 20 21 ... ## $ daily_events: int 11 24 37 26 33 24 24 24 24 24 ... ## $ daily_ppt : num 0.11 0.24 0.37 0.26 0.33 0.24 0.24 0.24 0.24 0.24 ... ## - attr(*, "vars")=List of 3 ## ..$ : symbol id ## ..$ : symbol year ## ..$ : symbol month ## - attr(*, "drop")= logi TRUE
Subset and summarize data
Add a date interpretable by the computer
dy_rg_data$date <- as.Date( with(dy_rg_data, paste(as.character(year), as.character(month), as.character(day), sep="/")), format = "%Y/%m/%d")
Plot rainfall data
library(ggplot2) qplot(date, daily_ppt, data = dy_rg_data, geom = c("point","line"), ylab = "Daily rainfall (inches)", color = daily_ppt > 6)
Re-plot rainfall data without ouliers
qplot(date, daily_ppt, data = dy_rg_data %>% filter(daily_ppt < 6), geom = c("point","line"), ylab = "Daily rainfall (inches)", color = year) + theme_bw()