I am working on a side project that involves downloading lots of sequence clusters from the phylota website.  This is a handy tool and makes it quick and easy to get lots of useful sequence data.  However, it also gives you all sequences in genbank for each species.  This means that you might have 500 sequences for a particular gene but only sequences from 10 or 20 different species.  The population sampling that is present in genbank is useful for many projects, but if we are just after a species level phylogeny then what we would really prefer is to just have the best (longest.. maybe) sequence from the species.  I've been using a snippet of pyhton code a friend gave me a long time ago to process these files and keep only the longest sequence for each taxa.  However, I figured it might be a good idea just to have my own R function that does this.  Plus I can add it to my R package so why not.  So here is what I came up with.  This uses only a couple of other function from APE read.dna and wrtie.dna.  I have written it so you can either process a single file or a whole folder of files.  Here is the code by itself but it is also available in the latest version of EvobiR on the github repository.

 # Simple function takes fasta and returns  
 # fitlered fasta with only the longest sequence   
 # for each species.  
 FastaFilter <- function(f.in, folder = F, prefix="pruned"){  
  # f.in if we are processing just a single file this should be  
  # a text string holding the name or name and path to the file  
  # folder if T then all files in the folder will be   
  # processed by the function  
  # prefix as a default we save new copies with the prefix  
  # pruned but we can put whatever we want here  
  # first we have a function for a single file  
  single.file <- function(f.in, prefix){  
   # this reads in our file  
   seqs <- read.dna(f.in, as.character = T, format="fasta", as.matrix = F)  
   # gets a lits of the unique taxa  
   taxa.present <- unique(names(seqs))  
   # this will hold our final set of sequences  
   final.list <- list()  
   # here we find the longest sequence for each species and store it  
   for(i in 1:length(taxa.present)){  
    hits <- which(names(seqs) == taxa.present[[i]])  
    possible <- seqs[hits]  
    z <- which.max(unlist(lapply(possible,length)))  
    final.list[i] <- possible[z]  
    names(final.list)[i] <- names(possible)[z]  
   }  
   # save the result as a fasta file  
   write.dna(x=final.list, file=paste(prefix, ".", f.in, sep = ""),   
        format = "fasta")  
  }  
  # here we call our function for a single file  
  if(folder == F){  
   single.file(f.in, prefix)  
  }  
  # here we get a list of all files and call our  
  # function itterativly to process each file  
  if(folder == T){  
   files <- list.files()  
   for(i in 1:length(files)){  
    single.file(f.in = files[i], prefix = prefix)  
   }  
  }  
 }  

cheers


Great Blogs
Great Blogs
About Me
About Me
My Photo
I am broadly interested in the application and development of comparative methods to better understand genome evolution at all scales from nucleotides to chromosomes.
Subscribe
Subscribe
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.