Today a friend asked me the easiest way to concatenate several hundred fasta format alignments that include some differences in the species present in each alignment into a single supermatrix. For example if we had two alignments like this:

LOCUS 1
>species A
atcgatcgtagctcgtacgtcgtacg---act
>species C
atcgatcgtagctcgtacgtcgtacggtaact

LOCUS 2
>species A
cgt---acgtcgtactgtc
>species B
cgtcctacgtcgtactgtc

we would want to produce a final alignment that looks like this:

>species A
atcgatcgtagctcgtacgtcgtacg---act cgt---acgtcgtactgtc
>species B
-------------------------------- cgtcctacgtcgtactgtc
>species C
atcgatcgtagctcgtacgtcgtacggtaact-------------------

We would also want to keep track of the data partitions that have been combined so we would want a report like this as well:

Locus 1 1:33
Locus 2 34:52

I assumed that this would be a simple task in R – surely somebody has a function for this.  However, after checking some of the big packages like APE I found that there wasn’t an obvious and easy way to do this.  For instance in ape while there is c method for dna object it expects them to have the same taxa and it drops any taxa not in all alignments.  I am in the process of writing a manuscript for my package evobiR so I decided I should write a function for this. Here is the code that I came up with.

 SuperMatrix <- function(missing = "-",  
             prefix = "concatenated",  
             save = T){  
  # get file names  
  file.names <- list.files()  
  # read DNA  
  DNA <- list()  
  for(i in 1:length(file.names)){  
   print(paste("Reading alignment", i))  
   DNA[[i]] <- read.dna(file=file.names[i],   
              format = "f",   
              as.character=T)  
  }  
  # calculate total alignment length  
  total.length <- 0  
  for(i in 1:length(DNA)){  
   total.length <- total.length + ncol(DNA[[i]])  
  }  
  # get all taxa names  
  taxa <- vector()  
  for(i in 1:length(DNA)){  
   counter <- length(taxa) + 1  
   for(j in 1:nrow(DNA[[i]])){  
    taxa <- c(taxa,   
         row.names(DNA[[i]])[!row.names(DNA[[i]]) %in% taxa])  
   }  
  }  
  # create empty alignment  
  seqmatrix <- matrix(as.character(missing),  
                   length(taxa),  
                   total.length)  
  rownames(seqmatrix) <- taxa  
  # create partition record holder  
  partitions <- as.data.frame(matrix(,length(DNA),3))  
  colnames(partitions) <- c("part", "start", "stop")  
  #build the supermatrix  
  c.col <- 0  
  print("Creating supermatrix")  
  for(i in 1:length(DNA)){  
   print(paste("Processing alignment", i))  
   gene <- DNA[[i]]  
   print(i)  
   for(j in 1:nrow(gene)){  
    c.row <- which(rownames(seqmatrix) == rownames(gene)[j])  
    seqmatrix[c.row, (c.col + 1):(c.col + ncol(gene))] <- gene[j, ]  
   }  
   partitions[i,1:3] <- c(file.names[i], c.col + 1, c.col + ncol(gene))  
   c.col <- c.col + ncol(gene)  
  }  
  results <- list()  
  results[[1]] <- partitions  
  results[[2]] <- seqmatrix  
  if(save == T){  
   print("saving files")  
   write.dna(seqmatrix,   
        file = paste(prefix, ".fasta", sep = ""),   
        format = "f")  
   write.csv(partitions,   
        row.names = F,   
        file = paste(prefix, ".partitions.csv", sep = ""))  
  }  
  return(results)  
 }  


While I haven’t done anything to try and make this very fast testing it on my friends radseq data it was able to construct a supermatrix for 18 taxa and 200,000 bp in just a couple of minutes so probably is already fast enough for most purposes.  This function has already been loaded into the current version of evobiR available from github If you find this useful or see a potential problem please let me know.


cheers

0

Add a comment

Great Blogs
Great Blogs
About Me
About Me
My Photo
I am broadly interested in the application and development of comparative methods to better understand genome evolution at all scales from nucleotides to chromosomes.
Subscribe
Subscribe
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.