Today a friend asked me the easiest way to concatenate
several hundred fasta format alignments that include some differences in the
species present in each alignment into a single supermatrix. For example if we
had two alignments like this:
LOCUS 1
>species A
atcgatcgtagctcgtacgtcgtacg---act
>species C
atcgatcgtagctcgtacgtcgtacggtaact
LOCUS 2
>species A
cgt---acgtcgtactgtc
>species B
cgtcctacgtcgtactgtc
we would want to produce a final alignment that looks like
this:
>species A
atcgatcgtagctcgtacgtcgtacg---act cgt---acgtcgtactgtc
>species B
-------------------------------- cgtcctacgtcgtactgtc
>species C
atcgatcgtagctcgtacgtcgtacggtaact-------------------
We would also want to keep track of the data partitions that
have been combined so we would want a report like this as well:
Locus 1 1:33
Locus 2 34:52
I assumed that this would be a simple task in R – surely somebody
has a function for this.
However, after
checking some of the big packages like APE I found that there wasn’t an obvious
and easy way to do this.
For instance in ape while there is c method for dna object it expects them to have the same taxa and it drops any taxa not in all alignments. I am in the
process of writing a manuscript for my package
evobiR so I decided I should write
a function for this. Here is the code that I came up with.
SuperMatrix <- function(missing = "-",
prefix = "concatenated",
save = T){
# get file names
file.names <- list.files()
# read DNA
DNA <- list()
for(i in 1:length(file.names)){
print(paste("Reading alignment", i))
DNA[[i]] <- read.dna(file=file.names[i],
format = "f",
as.character=T)
}
# calculate total alignment length
total.length <- 0
for(i in 1:length(DNA)){
total.length <- total.length + ncol(DNA[[i]])
}
# get all taxa names
taxa <- vector()
for(i in 1:length(DNA)){
counter <- length(taxa) + 1
for(j in 1:nrow(DNA[[i]])){
taxa <- c(taxa,
row.names(DNA[[i]])[!row.names(DNA[[i]]) %in% taxa])
}
}
# create empty alignment
seqmatrix <- matrix(as.character(missing),
length(taxa),
total.length)
rownames(seqmatrix) <- taxa
# create partition record holder
partitions <- as.data.frame(matrix(,length(DNA),3))
colnames(partitions) <- c("part", "start", "stop")
#build the supermatrix
c.col <- 0
print("Creating supermatrix")
for(i in 1:length(DNA)){
print(paste("Processing alignment", i))
gene <- DNA[[i]]
print(i)
for(j in 1:nrow(gene)){
c.row <- which(rownames(seqmatrix) == rownames(gene)[j])
seqmatrix[c.row, (c.col + 1):(c.col + ncol(gene))] <- gene[j, ]
}
partitions[i,1:3] <- c(file.names[i], c.col + 1, c.col + ncol(gene))
c.col <- c.col + ncol(gene)
}
results <- list()
results[[1]] <- partitions
results[[2]] <- seqmatrix
if(save == T){
print("saving files")
write.dna(seqmatrix,
file = paste(prefix, ".fasta", sep = ""),
format = "f")
write.csv(partitions,
row.names = F,
file = paste(prefix, ".partitions.csv", sep = ""))
}
return(results)
}
While I haven’t done anything to try and make this very fast
testing it on my friends radseq data it was able to construct a supermatrix for
18 taxa and 200,000 bp in just a couple of minutes so probably is already fast enough
for most purposes. This function has already been loaded into the current version of
evobiR available from github If you find this
useful or see a potential problem please let me know.
cheers
Add a comment