I am not sure when I first heard the term "tree space" but I think that it is such a fascinating concept it distracts me on frequent occasions.  If you aren’t familiar with it let me describe it for you.  Think of a phylogenetic tree with four tips.  Ignoring for a moment differences in branch length you can imagine that there are a small number of trees that you can reach by making a single branch exchange to the topology of the tree.  These can be defined as a single “step” away.  We can get more nuanced by taking into consideration the differences in the branch lengths.  When we do this we can then think that there are basically an infinite number of trees that share varying degrees of similarity with one another.

This high dimensional infinite landscape of trees is what we attempt to search when we are inferring the phylogeny of a group of organisms.  The most commonly used methods today don’t attempt to find a single tree but rather produce a posterior distribution which should be representative of the probability of the different groupings and branch lengths in the “true” tree.  In other words if taxa a, b, and c are a monophyletic group in 95% of the trees in the posterior distribution then there is a 95% chance that they form a monophyletic group in the “true” tree.

This idea of trees occurring in a multideminsional space is old and might well have been influenced by Sewall Wright’s conception of a fitness landscape which shares many of the same principles.  However both the idea of a fitness landscape and tree space seem to have been largely left to the realm of didactic tools.  This really frustrates me because I feel like if I think about it hard enough or play with the concept enough. That I should be able to leverage an understanding of tree space to shed light on one of the many challenges in phylogenetics.  One problem that seem particularly apropos is the  sampling of posterior distributions of trees.

Why do we sample phylogenies?  Well as I said modern inference methods (MrBayes, Beast, etc) produce posterior samples of trees that can be enormously large.  In fact we often like them to be very large because this helps to makes sure that we really are capturing whatever variability their might be in our estimate of the phylogeny.  However this creates a problem when we move to attempting to use these phylogenies for things like modeling, especially as phylogenies get larger.  In a recent project that I was working on a simulation that I wanted to run across my trees took several days per a tree.  Because of this it would be nice if we could quantitatively choose an appropriate number of trees to sample from our posterior distribution that represents the uncertainty that we have in our posterior but that isn’t simply excessive.
I really thought that I had a solution to this problem.  I recently read a great paper by Susan HolmesComputational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees”  in this paper she discusses the possibility of representing a distribution of trees as a tree itself.  This is done by simply calculating a distance matrix for the collection of trees and then using your tree building method of choice to make a tree where each phylogeny effectively becomes a leaf. 
So my thought was that if we do this then the branch length of this tree is basically an indirect representation of “tree space” aka longer distances between two tips (trees) means the trees are more different.  So I thought that perhaps we could just look at how quickly we accumulate branch length as we begin to sample the tips of such a tree.  I thought that if we had sampled enough trees we would get the majority of the branch length on the tree and know that we had enough trees to represent what was going on.
So here was my hypothesis for what I would find:




So my first step was to create different artificial posterior distributions of trees.  I did this by downloading 10,000 Perissodactyla trees from the 10K tree project website.  This is a nice small tree to play with just 17 species.  I pulled a sample of 200 trees out of this and used this as my posterior to play with.  There is not a lot of variability in these trees so I kept this original sample as my “tight” posterior data set.  I then created two perturbed data sets by scaling groups of trees by different factors.  This effectively meant that my posterior had some short trees and some long trees. I calculated distance matrices for each of these sets of trees and did simple upgma to construct my “tree of trees”.  This seemed to work the way that I expected.





Now I can begin sampling taxa from these trees and see how branch length accumulates… Remember my hypothesis was that I believed you would see branch length accumulate very quickly in a low uncertainty situation but more slowly in a high uncertainty situation.  And the results were…….


Ok so I wasn’t really surprised I could tell (probably just like you) before I ran it that the opposite was going to happen.  The important thing is understanding why it is working this way, and I believe the answer to this is that tree space doesn’t have any inherent measure apart from the trees that we use to define it.  Therefore the fact that all of the trees in the low uncertainty sample are more similar than any in the high uncertainty sample is not represented in the way that I creating tree space.
So how might we solve this problem?  One way might be to identify the potential size of tree space and attempt to “nest” our posterior distribution into this potential tree space.  I experimented with this a bit…  I decided that we might do this by creating basically random phylogenies that have the same number of tips and the same range of depths as our realized phylogenies and then including them in the calculation of the distance matrix.  This actually worked pretty well and I find that the “realized” phylogenies (red below) nest tightly within potential tree space (blue below) when there is little uncertainty in the posterior distribution and are more spread out across potential tree space when there is more uncertainty in the posterior distribution.






So that graphic right there shows me the difference in the datasets and shows me that I probably need to sample more from the posterior pictured on the right but... Unfortunately I am not at all sure how to turn all of this in to a quantitative measurement of required sample size so I thought that I would just write this up and let it sit on the back of my mind for a while.  Who knows maybe someone else will read this and point out the painfully obvious answer that I am missing.  If you have any thoughts about this and would like to discuss it feel free to email me at coleoguy at gmail dot com

As you know I love to share my R code and I will this time but if you look at it remember that most of this was done in the course of two nights at a starbucks and was just for fun so its not the prettiest or best organized code ever! my ugly (but kinda cool) R code
cheers

2

View comments

  1. Thanks for the post, so interessant to read.

    ReplyDelete
  2. If you're looking for reliable and efficient land clearing and tree removal services in Austin, our team is here to help. Our experts specialize in using advanced equipment and techniques to safely and effectively remove trees and other vegetation from your property, leaving it ready for your next project. Whether you need to clear land for new construction or simply want to remove unwanted trees and brush, we can handle jobs of any size and complexity. Our commitment to using eco-friendly methods and preserving the natural environment ensures that your land will be cleared without causing undue harm to the surrounding ecosystem. With our years of experience in the industry, you can trust us to provide top-quality tree removal and land clearing services that meet your needs and exceed your expectations. Contact us today to learn more about our services and how we can help transform your property in Austin.

    ReplyDelete
Great Blogs
Great Blogs
About Me
About Me
My Photo
I am broadly interested in the application and development of comparative methods to better understand genome evolution at all scales from nucleotides to chromosomes.
Subscribe
Subscribe
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.