Intro to R as GIS

For the past two years I have served as one of two student representatives on the US-IALE executive committee. One of the major things we do in addition to providing a student voice on the ExComm is organize a students-only half-day workshop at our annual meeting. This is offered at no-cost to students attending the conference and we try to do things relevant to the field. In 2015 our chapter hosted the IALE World Congress in Portland, Oregon and with an eye on software that many students are learning to use we (ambitiously) put together an introductory workshop on manipulating and analyzing spatial data using R. We were able to recruit three other people to help develop and deliver the workshop, and managed to cram the whole thing into 4 hours.

iale2015-stud-workshop

Karl providing some guidance during the workshop – equally possible that he’s saying, “I haven’t seen that error before…”

Given that we unleashed a barely controlled firehose of R on the attendees, I think that overall it went okay. Given the material I think it would work better as a 6-8 hour workshop with the option for attendees to bring/use their own data. Maybe this is the way it should be set up from the start, i.e. here is a dataset that I know it works with, now try and do it with your own. I haven’t organized or been a part of delivering many workshops, but I learned a lot and really enjoyed the experience.

If you want to check it out the workshop materials the are freely available here on GitHub.

Presenting Data Manipulations in R

I needed to make a presentation on some of my work last semester to give to our lab group, so I created a Slidy presentation in RStudio. I give a brief introduction to some of the data manipulation and analysis techniques I have learned recently, and applied to some precipitation data we have collected across one of our sudden oak death study areas. This data still needs some work. The presentation can be exported from RStudio as an HTML or PDF so it is easily viewable on almost any machine. Some things show up a little bit differently depending on the selected format, so here are the PDF and HTML versions for comparison. Also, here is the RMarkdown (.rmd) file to download if you want to examine the code syntax. I provide a few highlights below. Each of the headers would indicate a new slide in the presentation.

Data Manipulation with R

Whalen Dillon
December 9, 2014

R Markdown

This is a slidy presentation generated using R Markdown in
RStudio Logo

Things to keep in mind about R

It is more a scripting language than programming language

R is optimized for vectorization (what the heck does that mean?)

Generally avoid looping operations:

data data_squared system.time(
for(i in data){
data_squared[i] })

## user system elapsed
## 0.151 0.009 0.160
# Vectorization is faster
system.time(data_squared

## user system elapsed
## 0 0 0

Getting data into R – multiple files

I have a directory with annual data files over 10 years

files <- list.files("Rain_Gauge/2_RG_EXPORTS", pattern="*.csv", full.names=TRUE)

length(files)

## [1] 112
head(files, 3)

## [1] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2004.csv"
## [2] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2005.csv"
## [3] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2006.csv"

Getting data into R – multiple files

Read all the files in the vector “files” into a single data frame

library(plyr) # `ldply()` function reads a list, returns a data frame
library(data.table) # `fread()` function
rg_data <- ldply(files, function(i){fread(i)})
class(rg_data)

## [1] "data.frame"
head(rg_data, 3)

## id date time events daily_events hourly_events
## 1 annadel 11/12/2003 13:00:00 NA NA 0
## 2 annadel 11/12/2003 14:00:00 NA NA 0
## 3 annadel 11/12/2003 15:00:00 NA NA 0

Dealing with dates and time

I want to be able to group and sort by dates and times

Join date and time columns into new variable date_time

rg_data$date_time class(rg_data$date_time)

## [1] "character"

Dealing with dates and time

Convert date_time into format interpretable by the computer (POSIX)

rg_data$date_time tz="UTC")
class(rg_data$date_time)

## [1] "POSIXlt" "POSIXt"

Dealing with dates and time

Create year, month, and day variables for grouping

  • Many functions can’t handle POSIX formatted date/time

These functions come from the data.table package in this case

rg_data$year <- year(rg_data$date_time) # extracts year
rg_data$month <- month(rg_data$date_time) # extracts month
rg_data$day <- mday(rg_data$date_time) # extracts day of month

head(rg_data, 3)head(rg_data, 3)

## id date time events daily_events hourly_events
## 1 annadel 11/12/2003 13:00:00 NA NA 0
## 2 annadel 11/12/2003 14:00:00 NA NA 0
## 3 annadel 11/12/2003 15:00:00 NA NA 0
## date_time year month day
## 1 2003-11-12 13:00:00 2003 11 12
## 2 2003-11-12 14:00:00 2003 11 12
## 3 2003-11-12 15:00:00 2003 11 12

Subset and summarize data

Create dataset of daily precipitation in inches

library(dplyr)
dy_rg_data %>%
select(id, date, year, month, day, events) %>%
group_by(id, year, month, day) %>%
summarize(daily_events=length(events), daily_ppt=length(events)*0.01)
str(dy_rg_data)

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 34807 obs. of 6 variables:
## $ id : chr "annadel" "annadel" "annadel" "annadel" ...
## $ year : int 2003 2003 2003 2003 2003 2003 2003 2003 2003 2003 ...
## $ month : int 11 11 11 11 11 11 11 11 11 11 ...
## $ day : int 12 13 14 15 16 17 18 19 20 21 ...
## $ daily_events: int 11 24 37 26 33 24 24 24 24 24 ...
## $ daily_ppt : num 0.11 0.24 0.37 0.26 0.33 0.24 0.24 0.24 0.24 0.24 ...
## - attr(*, "vars")=List of 3
## ..$ : symbol id
## ..$ : symbol year
## ..$ : symbol month
## - attr(*, "drop")= logi TRUE

Subset and summarize data

Add a date interpretable by the computer

dy_rg_data$date <- as.Date(
with(dy_rg_data, paste(as.character(year), as.character(month),
as.character(day), sep="/")),
format = "%Y/%m/%d")

Plot rainfall data

library(ggplot2)
qplot(date, daily_ppt, data = dy_rg_data, geom = c("point","line"),
ylab = "Daily rainfall (inches)", color = daily_ppt > 6)

plot daily precip-1
Maybe a few outliers…

Re-plot rainfall data without ouliers

qplot(date, daily_ppt,
data = dy_rg_data %>% filter(daily_ppt < 6),
geom = c("point","line"), ylab = "Daily rainfall (inches)",
color = year) +
theme_bw()

plot daily precip no outliers-1

Temporary files pile-up while using the `raster` package in R

Update: I’m not sure that this method has ever actually worked for me. I would love to hear success/failure for others. Restarting the computer seems to always free things up.

The raster package in R is incredibly useful and powerful free and open source solution for helping do geospatial analysis, especially if you are familiar with R, but don’t work regularly with another GIS software. It is also very useful even if you do, after all, you may not always be working somewhere that can afford licenses for commercial desktop GIS software (ahem, ArcGIS). Though I and my fellow students here at NC State have ready access to commercial software at no cost to ourselves, we really like learning to use and integrate R into our work, because it can then be reproduced and we can collaborate more easily. The quoted information in this post can be found here on Inside-R, a super-helpful reference site. The raster package was developed for:

“Reading, writing, manipulating, analyzing and modeling of gridded spatial data. The package implements basic and high-level functions. Processing of very large files is supported.

It’s that last part that leads to the pile-up of temporary files.

Another student and I are working on a project (it’s his thesis, so really he is doing the work) that is comparing outcomes of species distribution models using predictor variables at different resolutions across the entire extent of Oregon and California. This means processing through a lot of mapped surfaces during the fitting and prediction phases. The raster package is the only way to handle this, because the

“Functions in the raster package create temporary files if the values of an output RasterLayer cannot be stored in memory (RAM). This can happen when no filename is provided to a function and in functions where you cannot provide a filename (e.g. when using ‘raster algebra’).”

Now, in a “normal” R session (using the command line or GUI that comes with R installs) these temporary files are automatically removed at the start of each session. However, it seems that if you are using RStudio under certain settings (maybe the defaults, not positive on that part), then the temporary files may be retained even when you start a new session.  So, if you find that your hard drive is filling up and wondering “Where, why, and how do I fix it?” the solution is right there as part of the raster package with the removeTmpFiles function, which can be implemented to remove all the temporary files with the minimum age of the files indicated by the value for h measured in hours.

#Remove all temporary files that are more than 24 hours old:
removeTmpFiles(h=24)
#Remove all temporary files currently in existence:
removeTmpFiles(h=0)