21 April 2015

Beetle behavior - do they avoid, aggregate, or simply wander

I have a student this semester who was interested in determining whether a group of beetles in a confined space (petri dish) avoided each other, aggregated in groups, or perhaps simply wandered randomly.  I don’t know of an existing statistical test to determine the answer to this question.   However, it seemed that perhaps the mean pairwise distance between all beetles might be a good measure of where an organism falls along this continuum.  For example if we have a square grid that is 10 units on a side and we put one beetle in each corner:













The pairwise measures are:
10
10
10
10
14.14
14.14
The average of these measures is 11.38

In contrast if we have the same grid but the beetles have aggregated into two clumps:














The pairwise measures would now be:
0
0
14.14
14.14
14.14
14.14
The average of these measures is 9.43

Spread out individuals give us a larger number clumping/aggregation leads to a smaller number - this is what we are after.  Once she collected her data we needed to create a null distribution of expected pairwise distances assuming that the beetles were randomly distributed through the available space.  To do this we simply need to create the same number of random X and Y coordinates (making sure they fall in the circle of our petri dish) as she had in her experiment. Once we have a set of random coordinates we can then measure all the pairwise distances.  We repeat this process thousands of times to generate a null distribution of what we expect to see.  Here is what the pseudocode for this looks like

Radius of circle = 
Number of beetles =
Iterations =
Observed pairwise mean =

Loop for iterations
-Loop for number of beetles
--While coordinate not found
---Generate a coordinate
---Test coordinate for falling inside circle if true save
--Calculate mean of all pairwise distances and record
Plot null distribution and observed value



Below is the actual R code that I used to accomplish this.  I decided to tweak it a bit and turn it into a shiny app that allows her to run it through the web thanks the free shiny apps server.  The fully implemented version is here


 # This code produces pairwise measurement for  
 # randomly placed points in a circle of arbitrary size  
 # Heath Blackmon & Elizabeth Mallory  
 # 20 April 2015  
 RandomDist <- function(radius=10,   
             individuals=5,   
             iterations=100,   
             observed=34){  
  # this internal function just calculates a pairwise distance  
  pw.dist <- function(x1, x2, y1, y2){  
   sqrt(((x1 - x2) ^ 2) + ((y1 - y2) ^ 2))  
  }  
  # this function tests whether a point fall inside (or on)  
  # a circle of given radius  
  in.circle <- function(x, y, radius){  
   (x - radius)^2 + (y - radius)^2 < radius^2  
  }  
  # we will store our final null dist here  
  final.results <- vector()  
  # this loop will repeat till iterations is reached  
  for(k in 1:iterations){  
   # this print statement allows us to monitor progress  
   if(k/10 == round(k/10)) print(paste("iteration", k))  
   # we will store our coordinates in this dataframe  
   coords <- data.frame(x=0,y=0)  
   # this will loop through for the number of individuals  
   for(m in 1:individuals){  
    # this flag will be false until we generate a   
    # valid coordinate  
    inside <- F  
    # we generate the coordinates in this while statement  
    while(inside == F){  
     # pick coordinates from uniform distribution  
     t.coords <- runif(n=2, min = 0, max = radius*2)  
     # now test if they are in circle  
     if(in.circle(x=t.coords[1], y=t.coords[2], radius)){  
      inside <- T  
     }  
    }  
    coords[m, 1:2] <- t.coords  
   }  
   colnames(coords) <- c("X", "Y")  
   # this calculates the cumulative sum of the number of   
   # individuals - 1 (this is equivelant to the number of   
   # pairwise measures)  
   pairnum <- tail(cumsum((individuals - 1):1), n=1)  
   pairmeas <- vector()  
   counter <- 2  
   # this loop is going to do the pairwise measures  
   for(i in 1:(individuals-1)){  
    for(j in counter:individuals){  
     # here we do an actual pairwise calculation  
     x <- pw.dist(coords[i,1],  
            coords[j,1],   
            coords[i,2],   
            coords[j,2])  
     pairmeas <- c(pairmeas, x)  
    }  
    counter <- counter + 1  
   }  
   # store result from each iteration  
   final.results[k] <- mean(pairmeas)  
  }  
  # calculate emperical p-value  
  lower <- sum(final.results>observed)  
  higher <- sum(final.results<observed)  
  if(lower < higher){  
   pval <- lower/iterations  
  }else{  
   pval <- higher/iterations  
  }  
  # plot the results  
  plot(density(final.results),   
     main = "Distribution of Mean Pairwise Measures",   
     xlab = "mean distance",  
     ylab = "density",  
     cex.main=.7)  
  abline(v=observed, col="red", lwd=2)  
  # pull in the plot limits  
  mylims <- par("usr")  
  text(x=mylims[1],   
     y=mean(mylims[3:4]),   
     paste("pvalue =", pval),  
     col="red",  
     pos=4,  
     cex=.6)  
  return(final.results)  
 }  
 RandomDist(radius = 44.5,   
       individuals = 10,   
       iterations = 1000,  
       observed = 30)  
 radius=44.5  
 individuals=10  
 iterations=100  
 observed=30  
 foo <-   
  mylims <- par("usr")  


17 April 2015

R: concatenate alignments into a supermatrix

Today a friend asked me the easiest way to concatenate several hundred fasta format alignments that include some differences in the species present in each alignment into a single supermatrix. For example if we had two alignments like this:

LOCUS 1
>species A
atcgatcgtagctcgtacgtcgtacg---act
>species C
atcgatcgtagctcgtacgtcgtacggtaact

LOCUS 2
>species A
cgt---acgtcgtactgtc
>species B
cgtcctacgtcgtactgtc

we would want to produce a final alignment that looks like this:

>species A
atcgatcgtagctcgtacgtcgtacg---act cgt---acgtcgtactgtc
>species B
-------------------------------- cgtcctacgtcgtactgtc
>species C
atcgatcgtagctcgtacgtcgtacggtaact-------------------

We would also want to keep track of the data partitions that have been combined so we would want a report like this as well:

Locus 1 1:33
Locus 2 34:52

I assumed that this would be a simple task in R – surely somebody has a function for this.  However, after checking some of the big packages like APE I found that there wasn’t an obvious and easy way to do this.  For instance in ape while there is c method for dna object it expects them to have the same taxa and it drops any taxa not in all alignments.  I am in the process of writing a manuscript for my package evobiR so I decided I should write a function for this. Here is the code that I came up with.

 SuperMatrix <- function(missing = "-",  
             prefix = "concatenated",  
             save = T){  
  # get file names  
  file.names <- list.files()  
  # read DNA  
  DNA <- list()  
  for(i in 1:length(file.names)){  
   print(paste("Reading alignment", i))  
   DNA[[i]] <- read.dna(file=file.names[i],   
              format = "f",   
              as.character=T)  
  }  
  # calculate total alignment length  
  total.length <- 0  
  for(i in 1:length(DNA)){  
   total.length <- total.length + ncol(DNA[[i]])  
  }  
  # get all taxa names  
  taxa <- vector()  
  for(i in 1:length(DNA)){  
   counter <- length(taxa) + 1  
   for(j in 1:nrow(DNA[[i]])){  
    taxa <- c(taxa,   
         row.names(DNA[[i]])[!row.names(DNA[[i]]) %in% taxa])  
   }  
  }  
  # create empty alignment  
  seqmatrix <- matrix(as.character(missing),  
                   length(taxa),  
                   total.length)  
  rownames(seqmatrix) <- taxa  
  # create partition record holder  
  partitions <- as.data.frame(matrix(,length(DNA),3))  
  colnames(partitions) <- c("part", "start", "stop")  
  #build the supermatrix  
  c.col <- 0  
  print("Creating supermatrix")  
  for(i in 1:length(DNA)){  
   print(paste("Processing alignment", i))  
   gene <- DNA[[i]]  
   print(i)  
   for(j in 1:nrow(gene)){  
    c.row <- which(rownames(seqmatrix) == rownames(gene)[j])  
    seqmatrix[c.row, (c.col + 1):(c.col + ncol(gene))] <- gene[j, ]  
   }  
   partitions[i,1:3] <- c(file.names[i], c.col + 1, c.col + ncol(gene))  
   c.col <- c.col + ncol(gene)  
  }  
  results <- list()  
  results[[1]] <- partitions  
  results[[2]] <- seqmatrix  
  if(save == T){  
   print("saving files")  
   write.dna(seqmatrix,   
        file = paste(prefix, ".fasta", sep = ""),   
        format = "f")  
   write.csv(partitions,   
        row.names = F,   
        file = paste(prefix, ".partitions.csv", sep = ""))  
  }  
  return(results)  
 }  


While I haven’t done anything to try and make this very fast testing it on my friends radseq data it was able to construct a supermatrix for 18 taxa and 200,000 bp in just a couple of minutes so probably is already fast enough for most purposes.  This function has already been loaded into the current version of evobiR available from github If you find this useful or see a potential problem please let me know.


cheers

28 August 2014

Recent Publications: Beetle and Otherwise


Haven't had a lot of time for updates to the blog lately.  However, I have a few beetle related publications recently.

1) An article that contains a lot of my dissertation research came out this summer in Genetics.  In this paper I use the karyotypes, specifically sex chromosome morphology, from over 1,000 beetles to make inferences about the mode and tempo of sex-limited chromosome turnover across Coleoptera.  The results of this study inspired the "fragile Y hypothesis".  This hypothesis predicts that the characteristics of meiosis in a clade have a large impact on the rate of Y chromosome turnover and that sexually antagonistic selection reduces the size of the PAR (this is the region of the sex chromosome that forms chiasmata and recombines insuring proper segregation) to such an extent that faithful segregation becomes difficult.  Beetles are an interesting group to test these ideas in because some clades have X and Y chromosomes that do not have a PAR and don’t come together and form chiasmata instead the are held together at a distance by proteins.  While many other species have a more traditional meiosis where a PAR allows for chiasmata insuring proper segregation.  When we compare the rate of Y chromosome turnover in these two groups we find that those with distance pairing sex chromosome are less likely to lose their Y chromosome.  When we look to mammals we see a replication of this pattern suggesting that this is an important aspect of sex chromosome evolution in many clades with heterogametic sex determination.


2) I spent last fall semester at the National Evolutionary Synthesis Center in Durham North Carolina.  During my time there I was working as part of the Tree of Sex Working Group to collect and curate karyotypes and sexual system data for all arthropods.  This work led to over 11,000 records for arthropods which are now part of a database that is approaching 30K records in all across the tree of life.  I also had the chance to make some really cool figures for this paper to help illustrate the distribution of data in our database.





















3) Finally I also had the opportunity to write an article for the Ontario Entomological Society doing a review of sex chromosome and chromosome number evolution across Coleoptera.  This was an enjoyable article to write.  I was able to talk a bit about some of the early pioneers that used beetles to discover important aspects of basic biology in the early 1900s.



07 April 2014

Sliding window analysis


Sliding window analyses are a common approach when examining very large datasets such as genomic data.  A friend is working to find regions of the genome that may show signs of involvement in a complex trait, and part of this analysis required that he calculate the mean value of a measure for each site in the genome.  The figure below shows the goal of such a function.  Using a window size of 4 and a step size of 2:
There may be a ready made way to do this in R, but I was unable to quickly find it so I just went ahead and wrote one up.  First I made some data to work with:


then I wrote a simple function:

and here is the result of running this function and plotting the result with window sizes of 2, 20, 200, and 400.  




Hopefully this will be helpful to someone.

cheers