First, if you need a quick refresher on what the ABBA/BABA test is you can check out this post:
ABBA/BABA
Perhaps one of the simplest tests for introgression is the ABBA/BABA test. I have a couple of versions of this test implemented in my R package EvobiR. The most frequent email that I have gotten concerning ABBA/BABA tests is from people that want support for ambiguity codes. Ambiguity codes have two very different sources. The first can be due simply to sequencing error where we accidently read some copies of a DNA sequence incorrectly. The second, and with the current methods hopefully more common source, is the result of the way we sequence and assemble our data. Because we often use short sequences that are assembled by their overlapping portions we often don’t have haplotype data (basically knowing which SNPs belong with each other). Instead, we get a single sequence for each individual and it contains information from both the paternal and maternal copy of the sequence in question.
As an example lets imagine sequencing this locus that contains three SNPs.
TRUTH
Paternal copy CGTCAAGATACATGCCGCTCCTGTCATAACTGCG
Maternal copy CGTCATGATACATGCCGCTGCTGTCATCACTGCG
SEUENCE DATA
CGTCATGA
AAGATACAT
TACATGCCGCTCCT
TACTGTCAT
GCTGCTGTCATC
TCATAACTGCG
OUTPUT SEQUENCE
CGTCAWGATACATGCCGCTVCTGTCATMACTGCG
So how should we handle these ambiguities when we start performing the ABBA/BABA test. First, I would argue that when sequencing individuals only ambiguity codes that represent 2 nucleotides are valid. It seems that codes that represent 3 or 4 nucleotides (highlighted in red above) must be either sequencing error or represent different loci. So the first thing we should do is filter out those 3 and 4 nucleotide loci. For those loci that are biallelic we have two choices: 1) we could randomly resolve them to either state or 2) we could count biallelic sites as weighted possibilities. Option 2 isn’t bad if we have a single sequence with an ambiguity:
ARAG this effectively means we would count ½ locus as AAAG and ½ a locus as AGAG. However, we can imagine this gets complicated quickly for instance if we have ARRR we now have 8 possibilities AAAA, AGAA, AAGA, AAAG, AGGA, AGAG, AAGG, AGGG. This would be quite a pain to code. Additionally, if we think about it if we randomly resolve sites at the limit (many sites or the average of many replicate analyses) we would converge on the same scores. For this reason, at least for now I'm just going to implement the drop everything ambiguous or the random resolution solutions.
So the extension to the code include:
1) first filtering out any loci that contain these codes:
B which is the ambiguity code for C or G or T
D which is the ambiguity code for A or G or T
H which is the ambiguity code for A or C or T
V which is the ambiguity code for A or C or G
N which is the ambiguity code for any base
|
|
2) randomly resolving any occurrences of these codes:
R to A or G
Y to C or T
S to G or C
W to A or T
K to G or T
M to A or C
So I now have a flag ambig that the user can set to "D" to drop all sites with any ambiguity, or "R" to resolve all biallelic sites to one of the possible nucleotides, or "I" to ignore all of this and work the same as it did in the past (see the note at bottom for the possible use of "I").
To make sure this is working correctly I’m using a very contrived example. When we drop all ambiguities there should be no signal but when we randomly resolve there should be at least some signal:
>Seq1
CCCCCCCCCYYYYSSSSCCCCCCCCCCCCCCCCMMMMMAAAAACCCCC
>Seq2
SSSSSSKKKGGGGGGGGRRRRRRRRRRWWWWWWTTTTTCCCCCAAAAA
>Seq3
GGGGGGGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTTTAAAAAAAAAA
>Seq4
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
Below is an example of using both ABBA/BABA tests with this new flag:
CalcD(alignment = "example.fasta", sig.test = "N", ambig="D")
Sites in alignment = 10
Number of sites with ABBA pattern = 5
Number of sites with BABA pattern = 5
D raw statistic = 0
[1] 0
CalcD(alignment = "example.fasta", sig.test = "N", ambig="R")
Sites in alignment = 48
Number of sites with ABBA pattern = 24
Number of sites with BABA pattern = 5
D raw statistic = 0.6551724
[1] 0.6551724
CalcD(alignment = "example.fasta", sig.test = "N", ambig="I")
Sites in alignment = 48
Number of sites with ABBA pattern = 5
Number of sites with BABA pattern = 5
D raw statistic = 0
[1] 0
That looks like what we expect. However, it is important to remember that with the ambig flag set to "R" this analysis becomes stochastic. For instance, if we run this code to repeat the analysis 500 times we can see what the distribution of scores looks like:
D = vector()
for(i in 1:500){
D[i] = CalcD(alignment = "example.fasta", sig.test = "N", ambig="R")
}
hist(D, xlim=c(.4,.8),xlab="Value of D statistic", main="Variation across 500 runs")
I mentioned above that the "I" flag for ambig allows you to use these functions without any changes. This is really left in for one user who is trying to apply the ABBA/BABA test to microsatellite data. So this user never has A, C, G, or T and the new code would have thrown out all of their data. For most applications, though I think users should be using the drop option unless they are going to carefully explore how these ambiguities impact their results. For this reason, the default for both functions is drop ambiguities.
To install the latest version of evobiR that has all of these changes just run this code in R:
install.packages("devtools")
library(devtools)
install_github('coleoguy/evobir', build_vignettes = T)
library(evobiR)
Let me know if you have any problems with this.
cheers
Add a comment