Intro to R as GIS

For the past two years I have served as one of two student representatives on the US-IALE executive committee. One of the major things we do in addition to providing a student voice on the ExComm is organize a students-only half-day workshop at our annual meeting. This is offered at no-cost to students attending the conference and we try to do things relevant to the field. In 2015 our chapter hosted the IALE World Congress in Portland, Oregon and with an eye on software that many students are learning to use we (ambitiously) put together an introductory workshop on manipulating and analyzing spatial data using R. We were able to recruit three other people to help develop and deliver the workshop, and managed to cram the whole thing into 4 hours.


Karl providing some guidance during the workshop – equally possible that he’s saying, “I haven’t seen that error before…”

Given that we unleashed a barely controlled firehose of R on the attendees, I think that overall it went okay. Given the material I think it would work better as a 6-8 hour workshop with the option for attendees to bring/use their own data. Maybe this is the way it should be set up from the start, i.e. here is a dataset that I know it works with, now try and do it with your own. I haven’t organized or been a part of delivering many workshops, but I learned a lot and really enjoyed the experience.

If you want to check it out the workshop materials the are freely available here on GitHub.

Edge Effects and Connectivity in Landscape Ecology

The the way the landscape is seen from your perspective or mine is likely similar, yet not quite the same, and still our interactions with this landscape are completely different from that of a wolf or a bird or a plant or microbe. This is infinitely fascinating to me.

This semester we have been having paper discussions during our lab meetings, each led by a different member (grad students and postdocs). The first few were tilted toward the human dimension side of our lab, so I was excited to mix things up and lead a discussion about some traditional landscape ecology research. Thinking about the incredible variety of landscapes, how they are connected and divided, how those patterns of connection and division change depending on your perspective is my version of “going back to the bench.” It is one of the major inspirations to me as a scientist. So this week we talked about some ideas that are at the foundation of landscape ecology, particularly edge effects and connectivity.

What are “edge effects” and “connectivity” anyway? The people in our lab group come from a variety of backgrounds, personally and academically. I asked people to provide a definition of “edge effects” from their perspective and this produced two responses. Everyone has at least a little experience with GIS, so one type of “edge effect” brought up was technological where if you are doing a calculation over a gridded surface the values at the edges of the map end up biased because fewer input cells can be used to calculate the values for these cells. The other definition was ecological where an “edge effect” is due to a abrupt transition between environments or landscape characteristics that creates relatively distinct habitat boundaries. This type of edge effect influences the local climate and the species that are likely to occur or occupy the space on either side.

Connectivity is typically in one of two categories, structural or functional, though these are not necessarily mutually exclusive. Structural connectivity is probably the most familiar type to many people. One example are wildlife corridors, which provide a pathway for animals to travel but are not exactly the type of habitat where they would linger. For me, functional connectivity is more easily characterized by thinking about passively dispersed organisms such as wind dispersed pathogens (I study one of these so I might be a little biased). In this example, the pathogen depends on hosts occurring in sufficient frequency and density in order for it to traverse the landscape, and establish and reproduce in a new location. So, a corridor connecting two larger areas may be structural or functional or both in terms of connectivity.

In the paper that we discussed the authors designed a landscape scale experiment to test the effects of connectivity, fragmentation, and edges on the development and spread of a plant disease. The landscape scale experiment itself is admirable because replication at a scale larger than a laboratory or greenhouse is challenging. It is just so big.

The pathogen they were investigating was southern corn leaf blight on sweet corn. They tested whether a structural corridor affected the spread and development of this wind-dispersed pathogen across the landscape. In addition they tested whether there were edge effects on disease development by placing infected plants at varying distances from the edge of the “habitat” patch. The habitat in this case was “regenerating longleaf pine forest” that had been cut into patches with various configurations (I believe for other purposes, but useful for this experiment). They found that connectivity did not have a detectable effect on disease spread or development, but did detect edge effects that were dependent on the configuration of the patch.

While this landscape was supremely useful for doing experiments with this disease system, a substantial drawback was the realism. The immediate question the came to my mind was if there had been functional connectivity in addition to the structural connectivity would they have detected an effect, especially since this is a passively dispersing pathogen? This is an additional experiment that I and others thought would have really improved the study, but that does not take away from the insights that they did gain. And I think this is how science works, in bits and pieces, fits and starts, and eventually we are able to hopefully say at least one thing about a system or process with substantial confidence.

Presenting Data Manipulations in R

I needed to make a presentation on some of my work last semester to give to our lab group, so I created a Slidy presentation in RStudio. I give a brief introduction to some of the data manipulation and analysis techniques I have learned recently, and applied to some precipitation data we have collected across one of our sudden oak death study areas. This data still needs some work. The presentation can be exported from RStudio as an HTML or PDF so it is easily viewable on almost any machine. Some things show up a little bit differently depending on the selected format, so here are the PDF and HTML versions for comparison. Also, here is the RMarkdown (.rmd) file to download if you want to examine the code syntax. I provide a few highlights below. Each of the headers would indicate a new slide in the presentation.

Data Manipulation with R

Whalen Dillon
December 9, 2014

R Markdown

This is a slidy presentation generated using R Markdown in
RStudio Logo

Things to keep in mind about R

It is more a scripting language than programming language

R is optimized for vectorization (what the heck does that mean?)

Generally avoid looping operations:

data data_squared system.time(
for(i in data){
data_squared[i] })

## user system elapsed
## 0.151 0.009 0.160
# Vectorization is faster

## user system elapsed
## 0 0 0

Getting data into R – multiple files

I have a directory with annual data files over 10 years

files <- list.files("Rain_Gauge/2_RG_EXPORTS", pattern="*.csv", full.names=TRUE)


## [1] 112
head(files, 3)

## [1] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2004.csv"
## [2] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2005.csv"
## [3] "Rain_Gauge/2_RG_EXPORTS/annadel_day_hr_2006.csv"

Getting data into R – multiple files

Read all the files in the vector “files” into a single data frame

library(plyr) # `ldply()` function reads a list, returns a data frame
library(data.table) # `fread()` function
rg_data <- ldply(files, function(i){fread(i)})

## [1] "data.frame"
head(rg_data, 3)

## id date time events daily_events hourly_events
## 1 annadel 11/12/2003 13:00:00 NA NA 0
## 2 annadel 11/12/2003 14:00:00 NA NA 0
## 3 annadel 11/12/2003 15:00:00 NA NA 0

Dealing with dates and time

I want to be able to group and sort by dates and times

Join date and time columns into new variable date_time

rg_data$date_time class(rg_data$date_time)

## [1] "character"

Dealing with dates and time

Convert date_time into format interpretable by the computer (POSIX)

rg_data$date_time tz="UTC")

## [1] "POSIXlt" "POSIXt"

Dealing with dates and time

Create year, month, and day variables for grouping

  • Many functions can’t handle POSIX formatted date/time

These functions come from the data.table package in this case

rg_data$year <- year(rg_data$date_time) # extracts year
rg_data$month <- month(rg_data$date_time) # extracts month
rg_data$day <- mday(rg_data$date_time) # extracts day of month

head(rg_data, 3)head(rg_data, 3)

## id date time events daily_events hourly_events
## 1 annadel 11/12/2003 13:00:00 NA NA 0
## 2 annadel 11/12/2003 14:00:00 NA NA 0
## 3 annadel 11/12/2003 15:00:00 NA NA 0
## date_time year month day
## 1 2003-11-12 13:00:00 2003 11 12
## 2 2003-11-12 14:00:00 2003 11 12
## 3 2003-11-12 15:00:00 2003 11 12

Subset and summarize data

Create dataset of daily precipitation in inches

dy_rg_data %>%
select(id, date, year, month, day, events) %>%
group_by(id, year, month, day) %>%
summarize(daily_events=length(events), daily_ppt=length(events)*0.01)

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 34807 obs. of 6 variables:
## $ id : chr "annadel" "annadel" "annadel" "annadel" ...
## $ year : int 2003 2003 2003 2003 2003 2003 2003 2003 2003 2003 ...
## $ month : int 11 11 11 11 11 11 11 11 11 11 ...
## $ day : int 12 13 14 15 16 17 18 19 20 21 ...
## $ daily_events: int 11 24 37 26 33 24 24 24 24 24 ...
## $ daily_ppt : num 0.11 0.24 0.37 0.26 0.33 0.24 0.24 0.24 0.24 0.24 ...
## - attr(*, "vars")=List of 3
## ..$ : symbol id
## ..$ : symbol year
## ..$ : symbol month
## - attr(*, "drop")= logi TRUE

Subset and summarize data

Add a date interpretable by the computer

dy_rg_data$date <- as.Date(
with(dy_rg_data, paste(as.character(year), as.character(month),
as.character(day), sep="/")),
format = "%Y/%m/%d")

Plot rainfall data

qplot(date, daily_ppt, data = dy_rg_data, geom = c("point","line"),
ylab = "Daily rainfall (inches)", color = daily_ppt > 6)

plot daily precip-1
Maybe a few outliers…

Re-plot rainfall data without ouliers

qplot(date, daily_ppt,
data = dy_rg_data %>% filter(daily_ppt < 6),
geom = c("point","line"), ylab = "Daily rainfall (inches)",
color = year) +

plot daily precip no outliers-1

Temporary files pile-up while using the `raster` package in R

Update: I’m not sure that this method has ever actually worked for me. I would love to hear success/failure for others. Restarting the computer seems to always free things up.

The raster package in R is incredibly useful and powerful free and open source solution for helping do geospatial analysis, especially if you are familiar with R, but don’t work regularly with another GIS software. It is also very useful even if you do, after all, you may not always be working somewhere that can afford licenses for commercial desktop GIS software (ahem, ArcGIS). Though I and my fellow students here at NC State have ready access to commercial software at no cost to ourselves, we really like learning to use and integrate R into our work, because it can then be reproduced and we can collaborate more easily. The quoted information in this post can be found here on Inside-R, a super-helpful reference site. The raster package was developed for:

“Reading, writing, manipulating, analyzing and modeling of gridded spatial data. The package implements basic and high-level functions. Processing of very large files is supported.

It’s that last part that leads to the pile-up of temporary files.

Another student and I are working on a project (it’s his thesis, so really he is doing the work) that is comparing outcomes of species distribution models using predictor variables at different resolutions across the entire extent of Oregon and California. This means processing through a lot of mapped surfaces during the fitting and prediction phases. The raster package is the only way to handle this, because the

“Functions in the raster package create temporary files if the values of an output RasterLayer cannot be stored in memory (RAM). This can happen when no filename is provided to a function and in functions where you cannot provide a filename (e.g. when using ‘raster algebra’).”

Now, in a “normal” R session (using the command line or GUI that comes with R installs) these temporary files are automatically removed at the start of each session. However, it seems that if you are using RStudio under certain settings (maybe the defaults, not positive on that part), then the temporary files may be retained even when you start a new session.  So, if you find that your hard drive is filling up and wondering “Where, why, and how do I fix it?” the solution is right there as part of the raster package with the removeTmpFiles function, which can be implemented to remove all the temporary files with the minimum age of the files indicated by the value for h measured in hours.

#Remove all temporary files that are more than 24 hours old:
#Remove all temporary files currently in existence: