07 April 2014

Sliding window analysis

Sliding window analyses are a common approach when examining very large datasets such as genomic data.  A friend is working to find regions of the genome that may show signs of involvement in a complex trait, and part of this analysis required that he calculate the mean value of a measure for each site in the genome.  The figure below shows the goal of such a function.  Using a window size of 4 and a step size of 2:
There may be a ready made way to do this in R, but I was unable to quickly find it so I just went ahead and wrote one up.  First I made some data to work with:

then I wrote a simple function:

and here is the result of running this function and plotting the result with window sizes of 2, 20, 200, and 400.  

Hopefully this will be helpful to someone.


25 October 2013

Population Genetics Simulator

There are lots of population genetics simulators available as both standalone programs and online offerings.  However, I decided that I wanted to make my own.  So here is my Fisher-Wright model basic population genetics simulator.  I thought it might be a fun way to try out a new package called “shiny”.  Shiny is a package from the developers of RStudio.  It provides an easy way to create interactive web pages that draw on R for generating the displayed content.  These can be run locally by anyone that has the Shiny package installed or you can set up a shiny server and host them as webpages that will use an instance of R on the server to generate the displayed content.  So here is what it looks like in action:

This isn't on the CRAN version of evobiR yet but if you want to play with it you can get it by installing through GitHub.

install_github("evobir", username='coleoguy')



Here are links to some other useful simulation software:

15 October 2013


Over the course of the last couple of years I have written quite a few scripts in R.  These scripts are more than just my own research, a few I have written for friends, others I wrote tying to understand a topic.  So they run the gamut: teaching simulations, posterior predictive simulations, preparing data for other comparative tools, and analyzing sequences.  I've decided that I should collect all of them and create an R package that I can continue to add to as my research progresses.  So the first and rather rough limited version of this is now on github.  If you want to install it from github its pretty straight forward:

install_github("evobir", username='coleoguy')

for a description of the functions included in the package you can check out my webpage.


02 September 2013

SPAM: tree ≠ tree

I dont get a lot of spam on my blog and what I do get seems to be quite poorly targeted.  However this morning I laughed out loud when I got this comment on my entry about using R to removes taxa from a phylogenetic "tree"

Action Tree Service has left a new comment on your post "Pruning Taxa from Trees": 

Tree pruning is very important process of tree care service.Tree Pruning removes specific branches or stems to benefit the whole tree.Tree pruning

Happy Labor Day!

25 August 2013

Green Peas and Hamming Distances

The first problem for today is drawn from Mendel's first rule.

Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous,  and n are homozygous recessive.

Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype).  Assume that any two organisms can mate.

The second problem just wants us to calculate the Hamming distance between two aligned DNA sequences.  This is a really easy problem my answer is embarrassingly long but I didn't really feel like optimizing it since it worked right off.

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).

Return: The Hamming distance dH(s,t).

24 August 2013

NESCent Fellowship

I've just started a grad fellowship at NESCent.  My main goal for my time here is to produce papers that utilize data from the tree of sex data that I have been helping to collect.  However, I do have a few ancillary goals for my time here one of which is to become a better programmer.  I've asked a few people (way smarter than me) over the last couple of years how to become really good in a particular language.  The most frequent response that I have received is to just do "something" each and every day in whatever language it is that you are working on.  To that end I am supplementing my normal programming with the challenges at Rosalind.  This is a great website that has all kinds of programming challenges that you can attempt to solve.  The site seems to have a python focus, buts lots of members solve problems with every language you can think of.  I’ve decided that I am going to start at the beginning of the bioinformatics challenges and try and solve about a problem a day in R and in Python.  The first ones are really simple so I did three in R today.

Challenge 1: A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.  An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."
Given: A DNA string s of length at most 1000 nt.
Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

Challenge 2: An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'. Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.
Given: A DNA string t having length at most 1000 nt.
Return: The transcribed RNA string of t.

Challenge 3: In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.  The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").
Given: A DNA string s of length at most 1000 bp.
Return: The reverse complement sc of s.

cheers from NESCent

09 July 2013

I said that I would post an updated version of my ABBA/BABA code at some point and now seems to be a good time.  I have two basic functions one is the original ABBA/BABA reported in:
Durand, Eric Y., et al. "Testing for ancient admixture between closely related populations." Molecular biology and evolution 28.8 (2011): 2239-2252.

The second is a really cool extension allowing us to use this test to identify directionality and timing of introgression.  It was developed and reported in:Eaton, Deren AR, and Richard H. Ree. "Inferring Phylogeny and Introgression using RADseq Data: An Example from Flowering Plants (Pedicularis: Orobanchaceae)." Systematic biology (2013).