16 February 2013

Austin - Phylogenomics and Metagenomics


I am in Austin this weekend for the Phylogenomics and metagenomics symposium and workshop organized by Tandy Warnow.  Today was the symposium tomorrow is the workshop.  Lots of really interesting and exciting methods were presented today.  Prior to getting here I felt that the thing that I was most interested in was simultaneous estimation of trees and alignments – SATe type tools - these were really cool and interesting.  Although two speakers really grabbed my attention with ideas involving “importance sampling”. 

Fair warning: Anything that doesn’t make sense below is certainly due to my own misunderstanding.

The first to do this was Mark Holder.  He was discussing stepping stone sampling to estimate the marginal likelihood of a model.  Part of this process involved getting trees that are similar to those in the posterior but were not actually sampled.  The challenge here is to do this in an informed way were you use the info in the posterior to inform your choice of trees.  So the way this works is that you can look at a consensus tree and you will have probabilities of each edge in your consensus.  To produce new trees just work your way through your consensus tree retaining edges in proportion to their posterior probability.  This will leave you with a bunch of multifurcations that need to be resolved in any way other than the way they are found in the consensus tree.  This process allows you to produce a new sample of trees which will be centered on the posterior distribution but spread out.

The second was Bret Larget.  So Bret’s idea is that because subtrees are “approximately” independent of one another you can use a sample of trees to estimate the probability of trees that were never sampled in the tree search process.  Furthermore this process can really be used to estimate a true probability of a tree rather than simply considering its probability to be equal to its frequency in the posterior sample produced by the mcmc sampler.  I cant explain the way that you calculate the probability of such a tree very well but here is a link to Bret’s talk.  It is fairly straight forward but it has been a long day!  The effect of this is that a “relatively” small sample of trees may actually contain just as much information (when examine in this way) as a much larger traditional posterior distribution when examined in a traditional approach 

12 February 2013

Darwin Day!

Happy Darwin Day!  Super busy this week getting ready for a workshop and a talk but here is a Darwin slide that I just put together last weekend:


It's amazing how many ideas that we still study today are acknowledged in Darwin's work.

Cheers

05 January 2013

Tribolium Photos

In our lab we use the red flour beetle (Tribolium castaneum) as our model system.  However, we also use several other closely related species as well.  A while back I snapped some shots of some of these related species.  Just for good measure I also included T. castaneum.  Once I was cleaning the photos up I realized the T. castaneum I picked was actually a mutant called  pearl.  This mutation causes the eye to lack pigmentation.  You can see this best in the lateral shot.  I went back and used a wild type individual for the ventral shot so you can also see what the eyes normally look like.  If you need a high res picture of one of these species just let me know these are sized down to work on blogger.











I have included T. brevicornis in this grouping even though it should be noted that this is actually only distantly related to tribolium.  Molecular analysis places it confidently outside of the rest of the tribolium species.  Eventually it will probably be moved out of the genus tribolium.

Cheers and Happy New Year

05 December 2012

Art + Science = Super Cool


Today it seems to me to be pretty rare to find a successful scientist incorporating their knowledge or love of their particular slice of biology into more artistic endeavors.  One scientist who you can’t say this about is David Maddison.  The advance access of Systematic Biology has a poem that he wrote several years ago Tree of Life.  It’s open access and I encourage people to go check it out.  Super cool!  If you go to David’s website you’ll find that he brings his artistic and scientific interests together in the form of some amazing beetle illustrations as well.
As long as we are doing a post on the arts, in case you haven’t already listened to him let me recommend the evolution record by Baba Brinkman (its on spotify).   The record is the product of a NESCent program that brings artists together with scientists and allows for cross-fertilization that might never happen otherwise.   The record is well worth a listen and its just super cool to hear some rap from Origins and sample Richard Dawkins.

Cheers

15 November 2012

Variance in the outcomes of birth death processes

I don’t think a lot of people appreciate the degree of variance possible under a simple constant rate speciation extinction model.  I think the natural reaction if we see the third and fifth tree above as sister lineages is that the third must have a significantly higher net speciation rate.  However each of the above trees were generated under the same model.  When I went to the Bodega phylogenetics workshop last year Brian Moore did a nice presentation where he walked us through illustrating this point in R.  I need to do the same thing for a seminar that I am giving soon and I decided that it would be good to generate it on my own and make a few small aesthetic changes.  So here it is:

here is the script

I will try to post the code here in the post for this later but something about it seems to be conflicting with bloggers parser and it comes out all messed up.  For now you can just get the script from dropbox

cheers

08 November 2012

Quantitative Use of Tree Space…..or not. :-(

I am not sure when I first heard the term "tree space" but I think that it is such a fascinating concept it distracts me on frequent occasions.  If you aren’t familiar with it let me describe it for you.  Think of a phylogenetic tree with four tips.  Ignoring for a moment differences in branch length you can imagine that there are a small number of trees that you can reach by making a single branch exchange to the topology of the tree.  These can be defined as a single “step” away.  We can get more nuanced by taking into consideration the differences in the branch lengths.  When we do this we can then think that there are basically an infinite number of trees that share varying degrees of similarity with one another.

This high dimensional infinite landscape of trees is what we attempt to search when we are inferring the phylogeny of a group of organisms.  The most commonly used methods today don’t attempt to find a single tree but rather produce a posterior distribution which should be representative of the probability of the different groupings and branch lengths in the “true” tree.  In other words if taxa a, b, and c are a monophyletic group in 95% of the trees in the posterior distribution then there is a 95% chance that they form a monophyletic group in the “true” tree.

This idea of trees occurring in a multideminsional space is old and might well have been influenced by Sewall Wright’s conception of a fitness landscape which shares many of the same principles.  However both the idea of a fitness landscape and tree space seem to have been largely left to the realm of didactic tools.  This really frustrates me because I feel like if I think about it hard enough or play with the concept enough. That I should be able to leverage an understanding of tree space to shed light on one of the many challenges in phylogenetics.  One problem that seem particularly apropos is the  sampling of posterior distributions of trees.

Why do we sample phylogenies?  Well as I said modern inference methods (MrBayes, Beast, etc) produce posterior samples of trees that can be enormously large.  In fact we often like them to be very large because this helps to makes sure that we really are capturing whatever variability their might be in our estimate of the phylogeny.  However this creates a problem when we move to attempting to use these phylogenies for things like modeling, especially as phylogenies get larger.  In a recent project that I was working on a simulation that I wanted to run across my trees took several days per a tree.  Because of this it would be nice if we could quantitatively choose an appropriate number of trees to sample from our posterior distribution that represents the uncertainty that we have in our posterior but that isn’t simply excessive.
I really thought that I had a solution to this problem.  I recently read a great paper by Susan HolmesComputational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees”  in this paper she discusses the possibility of representing a distribution of trees as a tree itself.  This is done by simply calculating a distance matrix for the collection of trees and then using your tree building method of choice to make a tree where each phylogeny effectively becomes a leaf. 
So my thought was that if we do this then the branch length of this tree is basically an indirect representation of “tree space” aka longer distances between two tips (trees) means the trees are more different.  So I thought that perhaps we could just look at how quickly we accumulate branch length as we begin to sample the tips of such a tree.  I thought that if we had sampled enough trees we would get the majority of the branch length on the tree and know that we had enough trees to represent what was going on.
So here was my hypothesis for what I would find:




So my first step was to create different artificial posterior distributions of trees.  I did this by downloading 10,000 Perissodactyla trees from the 10K tree project website.  This is a nice small tree to play with just 17 species.  I pulled a sample of 200 trees out of this and used this as my posterior to play with.  There is not a lot of variability in these trees so I kept this original sample as my “tight” posterior data set.  I then created two perturbed data sets by scaling groups of trees by different factors.  This effectively meant that my posterior had some short trees and some long trees. I calculated distance matrices for each of these sets of trees and did simple upgma to construct my “tree of trees”.  This seemed to work the way that I expected.





Now I can begin sampling taxa from these trees and see how branch length accumulates… Remember my hypothesis was that I believed you would see branch length accumulate very quickly in a low uncertainty situation but more slowly in a high uncertainty situation.  And the results were…….


Ok so I wasn’t really surprised I could tell (probably just like you) before I ran it that the opposite was going to happen.  The important thing is understanding why it is working this way, and I believe the answer to this is that tree space doesn’t have any inherent measure apart from the trees that we use to define it.  Therefore the fact that all of the trees in the low uncertainty sample are more similar than any in the high uncertainty sample is not represented in the way that I creating tree space.
So how might we solve this problem?  One way might be to identify the potential size of tree space and attempt to “nest” our posterior distribution into this potential tree space.  I experimented with this a bit…  I decided that we might do this by creating basically random phylogenies that have the same number of tips and the same range of depths as our realized phylogenies and then including them in the calculation of the distance matrix.  This actually worked pretty well and I find that the “realized” phylogenies (red below) nest tightly within potential tree space (blue below) when there is little uncertainty in the posterior distribution and are more spread out across potential tree space when there is more uncertainty in the posterior distribution.






So that graphic right there shows me the difference in the datasets and shows me that I probably need to sample more from the posterior pictured on the right but... Unfortunately I am not at all sure how to turn all of this in to a quantitative measurement of required sample size so I thought that I would just write this up and let it sit on the back of my mind for a while.  Who knows maybe someone else will read this and point out the painfully obvious answer that I am missing.  If you have any thoughts about this and would like to discuss it feel free to email me at coleoguy at gmail dot com

As you know I love to share my R code and I will this time but if you look at it remember that most of this was done in the course of two nights at a starbucks and was just for fun so its not the prettiest or best organized code ever! my ugly (but kinda cool) R code
cheers

26 October 2012

I love R

It really is amazing what you can do with a little practice in R.  The kinds of graphs that you can generate are really only limited by your own imagination.  I have an MCMC log file from BayesTraits that contains a sample of the posterior distribution of 7 transition rates.  For some time I have just been viewing these with the program Tracer that is distributed with beast.  These graphs aren't bad at all and they are super easy to generate.

 Graph from Tracer:
The problem with this graph is that everything has to be added piece by piece after the fact.  The axis doesn't really work the way it is produced and neither does the legend (i added the lines and text in another program).

So what can we do with R to display this data in a clear and convincing way?  If we like this kind of line density plot we can take advantage of the density function in R a simple command along the lines of:

>plot(density(data))

This will produce a simple line graph of the density type shown above. All of the normal par variables are available to make it as pretty or customized as desired.

I was interested in some different ways of representing this data instead of using this line density plot and for these I turned to box, violin, and scatter plots.

The easiest of these is probably the ready made function boxplot.  With box plot we have a call and graph like this:

>boxplot(data1,data2,data3,data4,data5,data6)
 


The length of the whiskers as well as just about everything else you see here can be customized in anyway you would like.  Personally though I don't feel like boxplot graphs give me a good feel for the variability in a dataset.  My goal was to get something that spoke to me visually more clearly.  So the next thing that I tried was the package vioplot.
The vioplot argument line is more or less the same as that for boxplot, I have added a couple of extra lines to place means and 95% HPD on top of the violin plots
>vioplot(data[,1],data[,2],data[,3],data[,4],data[,5])
> arrows(...)
>points(....)

 I feel like I am really starting to get somewhere with this.  With this I feel we get a much better idea about the real variability in our estimates.  Seeing this though really made me want to do better and so I decided to start from scratch and see what I could come up with.

My thought was to plot the actual points and introduce a jitter to the x value allowing the data to be spread out enough that the density of data points would be obvious when you look at the graph.  I am still working on the code but as it sits right now we get something along these lines.
 
This isn't bad but I feel like it could be improved by making the jitter value being given to the X proportional to the distance from the mean value of the estimate.  If we do this we then get something along these lines.



Once I figure out what the best way to do this is I will post the code.  Have a great weekend

Cheers