30 September 2015

U of M - EEB seminars

Today we had an interesting seminar with Julie Lockwood from Rutgers University.  Her talk title was Killing the cuddly: Tactics for managing exotic predators to protect their native prey.   Her talk was divided into two sections the first dealt with the importance of model choice in providing guidance to managers and the second part dealt with attempts to understand population growth in invasive species.   

Importance of modeling
She motivated the first part of her talk with an empirical example from Rayner et al 2007.  This manuscript looks at the breeding success of Cook’s petrels under three conditions first prior to any work being done to remove predators so that both cats and rats were present (success=.25-.50), then after cats had been removed but rats were still present (success=0.0-.25), and then finally once both had been removed (success=.50-.75).  Here is the graph that she showed from that paper:

She then explained that most model informed management decisions that are currently made use models that only examine the predator population and look for the life stage that can be targeted to most quickly reduce its population.   She suggests instead we need to take a more nuanced approach with a more complex model and use the target of keeping the prey alive rather than dropping the population of predator as quickly as possible.

She then described work that developed a model that allowed the predator to effect the prey population size.  This model can accommodate biological realities such as the fact that often juvenile predators can’t eat adult prey or that only certain stages of prey are targeted at all.  Once this model was developed they did a series of full block experiments with short, medium, or long lived predators and prey for a total of 9 pairings.  The general results of this were that particularly in the case of long lived predators traditional approaches and her recommended approach found different optimal solutions.

- my biggest question from this part of the talk was: Can we wrap in two predators and get the "right" solution to the motivating example of Cook's petrels?

Population Lag
The next part of her talk was focused on population lag.  This is a well documented characteristic particularly in invasive plants.  It is characterized by a species being present at low density for an extended period of time and then at some point suddenly exhibiting a great increase in population size.   In plants it appears that population lag can range from as little as a decade to as long as a century or more.  The data available for vertebrates though was less clear.  Primarily because there are not large numbers of vertebrates that have been introduced with abundant data on population size over time.  However, Hawaiian birds offer the opportunity to study this phenomenon.  Her team used the Audubon Christmas bird count data to look at population dynamics of 54 invasive species in Hawaii.

I’ll be honest this part of the talk got a little fuzzy for me.  The data are very noisy and so apparently some method was used to define a maximum population size and data after this point were discarded (I believe).  The remaining data were fit with a number of models 1) simple linear, 2) log 3) two piece linear.  For the two piece linear basically every point between the 6th year and the 5th year prior to peak population was tried as the break and the break that resulted in the highest likelihood was kept.  If the single break model was better than the alternatives then that was evidence for population lag in that species.  Her results indicated that most species did exhibit a population lag. 

- my biggest question/comment from this part of the talk was: Seems like one of those places where we need to develop or assess model adequacy, I could picture the best model here being too poor to reveal biologically important characteristics.

All in all this was a really interesting talk.  My final take home from the talk was that most invasive species provide us with a long period of time when population grows slowly and during this time careful modeling can provide us with the best chance of having a successful attempt in controlling an invasive species.  In addition even if we miss this low population window populations are stochastic and so perhaps we should be prepared to attack an invasive when it experiences a natural fluctuation.

Here are links to the Lockwood lab and the pages of her two former students who did much of this work.  My explanations above are no doubt over simplifications and contain errors which are purely my own.  If this stuff really interests you check out the publications below which contain the research I described above.

Kevin Aagaard

Orin Robinson


25 September 2015

Brownian Motion

If you want to skip all the code and Shiny info you can jump straight to the shiny app: Brownian motion simulator.

In comparative phylogenetics Brownian motion is often either a null model that we compare observations to or is a de facto assumption that our methods depend on.  For this reason, I think it can be useful to have nice interactive ways for students to explore this model and develop an intuitive understanding of how it behaves.  The R package Shiny offers a great environment to develop something like this.  Below you can see just how easy it is to create a slick and responsive web app.

The first step is designing a file called “ui.R” this is the user interface.  It consists of two primary code chunks the first is the sidebarLayout this describes the part of the page that the user will interact with.  The Shiny package has many implemented widgets built into the package and in this example I use two of these one is sliderInput and the other is the actionButton.  SliderInput creates a sliding widget and lets you set the min and max value for the variable as well as an initial starting value.  The actionButton widget creates a variable that initially has a value of 1 and this is incremented by 1 each time the button is pressed.  I use the actionButton to act as the seed for the stochastic portion of my code.  The second part of this simple ui.R file specifies the output to place next to the sidebar.  In this example I simply call the object “distPlot” which is created in the server page and I also specify the vertical size to be larger than it would be by default.

The next component of the Shiny app is the “server.R” file.  This is the code that will be run to generate the outpot portion of the webapp.  This consists of two components in the example.  The first is the reactive component that is the result of the Brownian motion simulation. By using replicate we will create a matrix where the rows represent generations and the columns represent iterations. By placing it inside of a reactive command we insure that it will only be rerun when one of the underlying input variables changes.  If you look at the code you can see that this includes: input$seed.val, input$reps, input$gens.  In contrast if another variable such as input$mon changes “x” will not be recalculated.

The second element of the code is the actual plot I want output.  Because x is a matrix we can use matplot to plot the results quickly and easily.  I choose to use a new color pallete viridis.  However, we want to make this interactive by adding a histrogram of any time point on the graph.  We do this by using the argument input$mon to select a specific row of x to plot.  This determines what data is used for the second graph and adds a vertical line indicating what time point is being plotted.

Here is what the finished product looks like:

05 September 2015

Ternary Diagram - Working Example

A couple of years ago I posted a bit of code that lets you plot data in a ternary diagram.  This is a useful way to display proportion data when proportions are split between three possible states.  Unfortunately, my original post included a link to the underlying data in my dropbox account.  This has led to me accidentally deleting it a couple of times. 

Well I did it again.  I decided that the best solution would be to just post some new code that included simulation of the data.

The original data was generated under a phylogenetic model and the data represented the number of species in each of 3 possible states.  If you would like to regenerate data for that approach you will need an R package that allows for discrete data simulation on a phylogenetic tree.  A fairly easy to install package that can do this is geiger.  Below is the code for generating the an appropriate dataset.

If you are not interested in phylogenetics you can generate an appropriate dataset in a non-phylogenetic framework like this:

Finally with either of these datasets you can then plot using the vcd package. 

This script will then produce:


27 August 2015

Chromosome Evolution Conference


I had the good fortune of speaking at the American Genetics Association president’s symposium last week.  There were many things to enjoy at this conference.  First the location was Bainbridge Island just across the bay from downtown Seattle.  Second, despite a striking diversity of particular subfields all speakers shared a broad interest in chromosomal evolution and were interested in sharing ideas and approaches and finding ways to move the field forward.  There was also relative equality in the distribution of sexes with 10 female and 12 male speakers.  Likewise the poster session featured 14 female and 16 male presenters.  Finally, if the talks and posters weren’t sufficient then the opportunity to socialize and discuss science certainly were.  All meals were catered and there were no concurrent sessions.  This meant that you got to spend 2 ½ days sharing ideas and learning from each other without ever missing a talk or poster.  So many thanks to Katie Piechel for putting together such an awesome conference!  Below are a few highlights that I found particularly exciting.

Jeremy Searle from Cornell gave a great talk covering hybrid zones and speciation in mice and shrews.  He spoke quite a bit about the amazing variation in chromosome number that he has found on the island of Madeira.  This hit home for me because in my analysis of chromosome number it ends up that some of the fastest rates that I have observed are driven largely by the amazing diversity that I observe in species from the Canary islands just south of Madeira.  Its always reassuring when you see the same kind of patterns in such distant parts of the tree of life.

Karen Miga from University of California, Santa Cruz discussed her work trying to shed light on the dark side of the genome.  To be honest I’m not sure how to use the work that she has done but I think that her work represents a huge step forward in giving us some usable characterization of the nature of centromeric regions of chromosomes.  For a lay introduction to her work you can check out her blog post at Scientific American or if you're feeling a bit braver here is a nice article on a preprint server.

Beatriz Vicoso who recently took a position at the Institute of Science and Technology in Austria talked about some of the amazing work that she did while she was a postdoc with Doris Bachtrog.  Specifically the suprising discovery that heteromorphic sex chromosomes are not always a trap.  Indeed many of the mysterious attributes of the dot chromosome in Drosophila now make sense once we realize that it spent millions of years as a sex chromosome.  To read about this check out their article in PLOS Biology.

I could and should add another 10 or 12 talks and posters but I probably need to focus on my own papers instead.  Keep your eyes out for next years AGA president’s symposium in Asillomar; it will focus on local adaptation and will be organized by Lynda Delph.

Hadley and Winston talk about the future of interactive graphics

I went to my first R users group meeting in Minnesota and was treated to talks by two R heavy weights. First Hadley Wickham gave a talk about what he views as the most exciting and dynamic area of R development (interactive graphics) and this was followed by a short talk by Winston Chang showing what he has been doing to develop interactive graphics a part of the Shiny environment.  Below are some take home messages and a few links that I thought were particularly interesting.  

Hadley Wickham: First the slides for Hadley’s talk are on his github page.  Hadley started his talk off with a short demo showing how interactive graphics could allow you to learn about your data.  He has a clever dataset of 3d data points and these show no apparent pattern when viewed along any two axes.  However when viewed from the correct angle we discover that there is a very distinct spiral pattern in the data this provides a nice aha moment in his talk.  Next Hadley discussed the history of interactive graphics in R splitting them into 3 categories 1) those written in lower level languages, 2) those hacking existing R plotting functions, and 3) and browser based solutions.  Many of the packages that he showed are not very realistic for everyday use or for investing a lot of time since they are no longer being actively developed.

Perhaps the most interesting part of Hadley’s talk was his discussion and demonstration of ggvis this is Hadley’s package that he envisions eventually replacing ggplot2.  His goal is to continue to work on this and perhaps sometime in 2016 have it to the point that anything you could do in ggplot2 you can do in ggvis.  The upside of this is that if you are already comfortable with ggplot2 you will have no trouble transitioning into ggvis.  Hadley is using the same type of grammer based approach to building up more complex graphs from simpler elements.

Winston Chang gave a much shorter talk but illustrated some of the new interactive developments in Shiny.  Despite my use of Shiny I was actually completely unaware of this development.  Winston has added some nice examples on the Shiny articles page (bottom of right column).

Other interesting links from the evening:

Html widgets for R: allows you to use a variety of javascript visualization libraries within R

Hadley’s new book Advanced R

21 April 2015

Beetle behavior - do they avoid, aggregate, or simply wander

I have a student this semester who was interested in determining whether a group of beetles in a confined space (petri dish) avoided each other, aggregated in groups, or perhaps simply wandered randomly.  I don’t know of an existing statistical test to determine the answer to this question.   However, it seemed that perhaps the mean pairwise distance between all beetles might be a good measure of where an organism falls along this continuum.  For example if we have a square grid that is 10 units on a side and we put one beetle roughly in each corner:

The pairwise measures are:
The average of these measures is 10.24

In contrast if we have the same grid but the beetles have aggregated into two clumps:

The pairwise measures would now be:
The average of these measures is 8.48

Spread out individuals give us a larger number clumping/aggregation leads to a smaller number - this is what we are after.  Once she collected her data we needed to create a null distribution of expected pairwise distances assuming that the beetles were randomly distributed through the available space.  To do this we simply need to create the same number of random X and Y coordinates (making sure they fall in the circle of our petri dish) as she had in her experiment. Once we have a set of random coordinates we can then measure all the pairwise distances.  We repeat this process thousands of times to generate a null distribution of what we expect to see.  Here is what the pseudocode for this looks like

Radius of circle = 
Number of beetles =
Iterations =
Observed pairwise mean =

Loop for iterations
-Loop for number of beetles
--While coordinate not found
---Generate a coordinate
---Test coordinate for falling inside circle if true save
--Calculate mean of all pairwise distances and record
Plot null distribution and observed value

Below is the actual R code that I used to accomplish this.  I decided to tweak it a bit and turn it into a shiny app that allows her to run it through the web thanks the free shiny apps server.  The fully implemented version is here

 # This code produces pairwise measurement for  
 # randomly placed points in a circle of arbitrary size  
 # Heath Blackmon & Elizabeth Mallory  
 # 20 April 2015  
 RandomDist <- function(radius=10,   
  # this internal function just calculates a pairwise distance  
  pw.dist <- function(x1, x2, y1, y2){  
   sqrt(((x1 - x2) ^ 2) + ((y1 - y2) ^ 2))  
  # this function tests whether a point fall inside (or on)  
  # a circle of given radius  
  in.circle <- function(x, y, radius){  
   (x - radius)^2 + (y - radius)^2 < radius^2  
  # we will store our final null dist here  
  final.results <- vector()  
  # this loop will repeat till iterations is reached  
  for(k in 1:iterations){  
   # this print statement allows us to monitor progress  
   if(k/10 == round(k/10)) print(paste("iteration", k))  
   # we will store our coordinates in this dataframe  
   coords <- data.frame(x=0,y=0)  
   # this will loop through for the number of individuals  
   for(m in 1:individuals){  
    # this flag will be false until we generate a   
    # valid coordinate  
    inside <- F  
    # we generate the coordinates in this while statement  
    while(inside == F){  
     # pick coordinates from uniform distribution  
     t.coords <- runif(n=2, min = 0, max = radius*2)  
     # now test if they are in circle  
     if(in.circle(x=t.coords[1], y=t.coords[2], radius)){  
      inside <- T  
    coords[m, 1:2] <- t.coords  
   colnames(coords) <- c("X", "Y")  
   # this calculates the cumulative sum of the number of   
   # individuals - 1 (this is equivelant to the number of   
   # pairwise measures)  
   pairnum <- tail(cumsum((individuals - 1):1), n=1)  
   pairmeas <- vector()  
   counter <- 2  
   # this loop is going to do the pairwise measures  
   for(i in 1:(individuals-1)){  
    for(j in counter:individuals){  
     # here we do an actual pairwise calculation  
     x <- pw.dist(coords[i,1],  
     pairmeas <- c(pairmeas, x)  
    counter <- counter + 1  
   # store result from each iteration  
   final.results[k] <- mean(pairmeas)  
  # calculate emperical p-value  
  lower <- sum(final.results>observed)  
  higher <- sum(final.results<observed)  
  if(lower < higher){  
   pval <- lower/iterations  
   pval <- higher/iterations  
  # plot the results  
     main = "Distribution of Mean Pairwise Measures",   
     xlab = "mean distance",  
     ylab = "density",  
  abline(v=observed, col="red", lwd=2)  
  # pull in the plot limits  
  mylims <- par("usr")  
     paste("pvalue =", pval),  

17 April 2015

R: concatenate alignments into a supermatrix

Today a friend asked me the easiest way to concatenate several hundred fasta format alignments that include some differences in the species present in each alignment into a single supermatrix. For example if we had two alignments like this:

>species A
>species C

>species A
>species B

we would want to produce a final alignment that looks like this:

>species A
atcgatcgtagctcgtacgtcgtacg---act cgt---acgtcgtactgtc
>species B
-------------------------------- cgtcctacgtcgtactgtc
>species C

We would also want to keep track of the data partitions that have been combined so we would want a report like this as well:

Locus 1 1:33
Locus 2 34:52

I assumed that this would be a simple task in R – surely somebody has a function for this.  However, after checking some of the big packages like APE I found that there wasn’t an obvious and easy way to do this.  For instance in ape while there is c method for dna object it expects them to have the same taxa and it drops any taxa not in all alignments.  I am in the process of writing a manuscript for my package evobiR so I decided I should write a function for this. Here is the code that I came up with.

 SuperMatrix <- function(missing = "-",  
             prefix = "concatenated",  
             save = T){  
  # get file names  
  file.names <- list.files()  
  # read DNA  
  DNA <- list()  
  for(i in 1:length(file.names)){  
   print(paste("Reading alignment", i))  
   DNA[[i]] <- read.dna(file=file.names[i],   
              format = "f",   
  # calculate total alignment length  
  total.length <- 0  
  for(i in 1:length(DNA)){  
   total.length <- total.length + ncol(DNA[[i]])  
  # get all taxa names  
  taxa <- vector()  
  for(i in 1:length(DNA)){  
   counter <- length(taxa) + 1  
   for(j in 1:nrow(DNA[[i]])){  
    taxa <- c(taxa,   
         row.names(DNA[[i]])[!row.names(DNA[[i]]) %in% taxa])  
  # create empty alignment  
  seqmatrix <- matrix(as.character(missing),  
  rownames(seqmatrix) <- taxa  
  # create partition record holder  
  partitions <- as.data.frame(matrix(,length(DNA),3))  
  colnames(partitions) <- c("part", "start", "stop")  
  #build the supermatrix  
  c.col <- 0  
  print("Creating supermatrix")  
  for(i in 1:length(DNA)){  
   print(paste("Processing alignment", i))  
   gene <- DNA[[i]]  
   for(j in 1:nrow(gene)){  
    c.row <- which(rownames(seqmatrix) == rownames(gene)[j])  
    seqmatrix[c.row, (c.col + 1):(c.col + ncol(gene))] <- gene[j, ]  
   partitions[i,1:3] <- c(file.names[i], c.col + 1, c.col + ncol(gene))  
   c.col <- c.col + ncol(gene)  
  results <- list()  
  results[[1]] <- partitions  
  results[[2]] <- seqmatrix  
  if(save == T){  
   print("saving files")  
        file = paste(prefix, ".fasta", sep = ""),   
        format = "f")  
        row.names = F,   
        file = paste(prefix, ".partitions.csv", sep = ""))  

While I haven’t done anything to try and make this very fast testing it on my friends radseq data it was able to construct a supermatrix for 18 taxa and 200,000 bp in just a couple of minutes so probably is already fast enough for most purposes.  This function has already been loaded into the current version of evobiR available from github If you find this useful or see a potential problem please let me know.