Presenting Data Manipulations in R

I needed to make a presentation on some of my work last semester to give to our lab group, so I created a Slidy presentation in RStudio. I give a brief introduction to some of the data manipulation and analysis techniques I have learned recently, and applied to some precipitation data we have collected across one of ourĀ sudden oak death study areas. This data still needs some work. The presentation can be exported from RStudio as an HTML or PDF so it is easily viewable on almost any machine. Some things show up a little bit differently depending on the selected format, so here are the PDF and HTML versions for comparison. Also, here is the RMarkdown (.rmd) file to download if you want to examine the code syntax. I provide a few highlights below. Each of the headers would indicate a new slide in the presentation.

Data Manipulation with R

Whalen Dillon
December 9, 2014

R Markdown

This is a slidy presentation generated using R Markdown in
RStudio Logo

Things to keep in mind about R

It is more a scripting language than programming language

R is optimized for vectorization (what the heck does that mean?)

Generally avoid looping operations:

data data_squared system.time(
for(i in data){
data_squared[i] })

## user system elapsed
## 0.151 0.009 0.160
# Vectorization is faster
system.time(data_squared

## user system elapsed
## 0 0 0

Getting data into R – multiple files

I have a directory with annual data files over 10 years

files <- list.files("Rain_Gauge/2_RG_EXPORTS", pattern="*.csv", full.names=TRUE)

length(files)

## [1] 112
head(files, 3)

## [1] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2004.csv"
## [2] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2005.csv"
## [3] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2006.csv"

Getting data into R – multiple files

Read all the files in the vector “files” into a single data frame

library(plyr) # `ldply()` function reads a list, returns a data frame
library(data.table) # `fread()` function
rg_data <- ldply(files, function(i){fread(i)})
class(rg_data)

## [1] "data.frame"
head(rg_data, 3)

## id date time events daily_events hourly_events
## 1 annadel 11/12/2003 13:00:00 NA NA 0
## 2 annadel 11/12/2003 14:00:00 NA NA 0
## 3 annadel 11/12/2003 15:00:00 NA NA 0

Dealing with dates and time

I want to be able to group and sort by dates and times

Join date and time columns into new variable date_time

rg_data$date_time class(rg_data$date_time)

## [1] "character"

Dealing with dates and time

Convert date_time into format interpretable by the computer (POSIX)

rg_data$date_time tz="UTC")
class(rg_data$date_time)

## [1] "POSIXlt" "POSIXt"

Dealing with dates and time

Create year, month, and day variables for grouping

  • Many functions can’t handle POSIX formatted date/time

These functions come from the data.table package in this case

rg_data$year <- year(rg_data$date_time) # extracts year
rg_data$month <- month(rg_data$date_time) # extracts month
rg_data$day <- mday(rg_data$date_time) # extracts day of month

head(rg_data, 3)head(rg_data, 3)

## id date time events daily_events hourly_events
## 1 annadel 11/12/2003 13:00:00 NA NA 0
## 2 annadel 11/12/2003 14:00:00 NA NA 0
## 3 annadel 11/12/2003 15:00:00 NA NA 0
## date_time year month day
## 1 2003-11-12 13:00:00 2003 11 12
## 2 2003-11-12 14:00:00 2003 11 12
## 3 2003-11-12 15:00:00 2003 11 12

Subset and summarize data

Create dataset of daily precipitation in inches

library(dplyr)
dy_rg_data %>%
select(id, date, year, month, day, events) %>%
group_by(id, year, month, day) %>%
summarize(daily_events=length(events), daily_ppt=length(events)*0.01)
str(dy_rg_data)

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 34807 obs. of 6 variables:
## $ id : chr "annadel" "annadel" "annadel" "annadel" ...
## $ year : int 2003 2003 2003 2003 2003 2003 2003 2003 2003 2003 ...
## $ month : int 11 11 11 11 11 11 11 11 11 11 ...
## $ day : int 12 13 14 15 16 17 18 19 20 21 ...
## $ daily_events: int 11 24 37 26 33 24 24 24 24 24 ...
## $ daily_ppt : num 0.11 0.24 0.37 0.26 0.33 0.24 0.24 0.24 0.24 0.24 ...
## - attr(*, "vars")=List of 3
## ..$ : symbol id
## ..$ : symbol year
## ..$ : symbol month
## - attr(*, "drop")= logi TRUE

Subset and summarize data

Add a date interpretable by the computer

dy_rg_data$date <- as.Date(
with(dy_rg_data, paste(as.character(year), as.character(month),
as.character(day), sep="/")),
format = "%Y/%m/%d")

Plot rainfall data

library(ggplot2)
qplot(date, daily_ppt, data = dy_rg_data, geom = c("point","line"),
ylab = "Daily rainfall (inches)", color = daily_ppt > 6)

plot daily precip-1
Maybe a few outliers…

Re-plot rainfall data without ouliers

qplot(date, daily_ppt,
data = dy_rg_data %>% filter(daily_ppt < 6),
geom = c("point","line"), ylab = "Daily rainfall (inches)",
color = year) +
theme_bw()

plot daily precip no outliers-1