-->
I went to my first R users group meeting in Minnesota and was treated to talks by two R heavy weights. First Hadley Wickham gave a talk about what he views as the most exciting and dynamic area of R development (interactive graphics) and this was followed by a short talk by Winston Chang showing what he has been doing to develop interactive graphics a part of the Shiny environment.  Below are some take home messages and a few links that I thought were particularly interesting.  

Hadley Wickham: First the slides for Hadley’s talk are on his github page.  Hadley started his talk off with a short demo showing how interactive graphics could allow you to learn about your data.  He has a clever dataset of 3d data points and these show no apparent pattern when viewed along any two axes.  However when viewed from the correct angle we discover that there is a very distinct spiral pattern in the data this provides a nice aha moment in his talk.  Next Hadley discussed the history of interactive graphics in R splitting them into 3 categories 1) those written in lower level languages, 2) those hacking existing R plotting functions, and 3) and browser based solutions.  Many of the packages that he showed are not very realistic for everyday use or for investing a lot of time since they are no longer being actively developed.

Perhaps the most interesting part of Hadley’s talk was his discussion and demonstration of ggvis this is Hadley’s package that he envisions eventually replacing ggplot2.  His goal is to continue to work on this and perhaps sometime in 2016 have it to the point that anything you could do in ggplot2 you can do in ggvis.  The upside of this is that if you are already comfortable with ggplot2 you will have no trouble transitioning into ggvis.  Hadley is using the same type of grammer based approach to building up more complex graphs from simpler elements.

Winston Chang gave a much shorter talk but illustrated some of the new interactive developments in Shiny.  Despite my use of Shiny I was actually completely unaware of this development.  Winston has added some nice examples on the Shiny articles page (bottom of right column).


Other interesting links from the evening:

Html widgets for R: allows you to use a variety of javascript visualization libraries within R

Hadley’s new book Advanced R
0

Add a comment

  1. This semester I am teaching an experimental design course.  However, three of my learning goals for my students are more general than experimental design:

    • munge a big dataset into different formats
    • use R to visualize their data and explore possible relationships
    • create scripts in R that process data do statistical test and output final plots for manuscripts

    I wanted to justify the importance of learning good coding, scripting, computational skills to my students so I decided to mine the Evoldir mailing list for the last month or so.  I first downloaded all of the adds for postdoc positions that were posted from December 1, 2017 to January 15, 2018.  There were a total of 86 adds.  For each one of these adds, I first determined whether any computational skills were listed as desired/required for applicants.  Next, I counted the occurrence of requests for several more specific skills like knowing specific languages.


    What I found was that 74% of the adds listed some form of computational skill as desired or necessary in the applicant.  The bioinformatics category included all adds with vague statements like "competitive applicants will have experience running bioinformatic analyses of..."  The misc. languages category included versions of C, awk,  and java.

    I should note that in some ways my approach underestimates the importance of computational skills.  For instance, several of the adds that listed no computational skills are for departmentally funded independent postdocs.  These advertisements usually list no skill requirements despite the fact that many departments and selection committees will none the less rate these skills as important in reviewing an applicants plans or previous contributions.  The asterisks do not indicate anything about the add.  Rather, these are skills that my students should begin developing in my class this semester if they work hard.

    So the answer is yes. Having computational skills will almost always increase your productivity and this is becoming more and more widely appreciated.

    * On a sad note more adds listed knowing Perl as a desired trait than unix or modeling - ugh!




    2

    View comments

  2. A blog reader recently asked if the sliding window function could be extended to a matrix; This seemed like something that might be useful so I have provided this extension below. Furthermore, once I looked at the code I realized that I needed to add validation testing and helpful error messages. The version below has been uploaded to the GitHub version of the EvobiR package and can be loaded using devtools as described here.

    SlidingWindow <- function(FUN, data, window, step, strict){
      # Validation testing
      if(strict) {
        
        if(!is.numeric(data)) stop("Please supply numeric data")
    
        if(sum(is.vector(data), is.matrix(data)) == 0) {
          stop("You must supply data as a vector or matrix")
        }
        
        if(is.vector(data)){
          if(window > length(data)) stop("Window size is too large")
          if((step + window) > length(data)) stop("Window and step size 
                                                should be small enough 
                                                that multiple windows 
                                                can be examined")
        }
        
        if(is.matrix(data)){
          if(window > nrow(data)) stop("Window size is too large for number of rows")
          if(window > ncol(data)) stop("Window size is too large for number of cols")
          if((step + window) > nrow(data)) stop("Window and step size 
                                                should be small enough 
                                                that multiple windows 
                                                can be examined")
        }
      }
      
      # code for vectors
      if(is.vector(data)) {
        total <- length(data)
        spots <- seq(from = 1, to = (total - window), by = step)
        result <- vector(length = length(spots))
        for(i in 1:length(spots)){
          result[i] <- match.fun(FUN)(data[spots[i]:(spots[i] + window - 1)])
        }
      }
      
      # code for matrices
      if(is.matrix(data)){
        total.x <- ncol(data)
        spots.x <- seq(from = 1, to = (total.x - window), by = step)
        total.y <- nrow(data)
        spots.y <- seq(from = 1, to = (total.y - window), by = step)
        result <- matrix(, length(spots.y), length(spots.x))
        for(i in 1:length(spots.y)){
          for(j in 1:length(spots.x)){
            result[i, j] <- match.fun(FUN)(data[spots.y[i]:(spots.y[i] + window - 1),
                                                spots.x[j]:(spots.x[j] + window - 1)])
          }
        }
      }
      
      # complete failure message
      if(!exists("result")) stop("Hmmm unknown error... Sorry")
      
      # return the result to the user
      return(result)
    }
    

    Here is an example of this in action.  First, let us simulate a bit of interesting data:

    x <- seq(-pi, pi, len = 600)
    y <- seq(-pi, pi, len = 600)
    g <- expand.grid(x = x, y = y)
    g$z <- sin(sqrt(g$x^2 + g$y^2)) + rnorm(600^2, sd=2)
    ex.dat <- matrix(g$z,600,600)
    image(ex.dat)
    

    This data looks like this when we plot it as a heat map:



    That looks interesting but we have far too much data and it looks like there is quite a bit of very small-scale noise in the data.  Let us try using our window function and see if it makes things a bit clearer:



    reduced.ex.dat <- SlidingWindow(FUN="mean", data=ex.dat, window=10, step=5, strict=T)
    image(reduced.ex.dat)
    


    And now we can produce the same plot on our data after applying a sliding window:



    Hopefully, that is helpful!
















    8

    View comments

  3. The Publons blog recently posted an interesting graphic showing the connections among editors and peer reviewers for 28146 reviews.  As I started looking at the figures (which are great), I wanted to dig into the data.  The people at Publons were nice enough to share this data and amended it to their post.

    The thing that I was particularly interested in was looking at differences among countries.  For instance, do editors in some countries choose reviewers from a broader geographic region than others?  Do editors choose reviewers from their hemisphere?

    To start digging into this, we first need to do a bit of data munging.  The data as provided by Publons is in an XML type format, and since I am going to be in R the first thing that I was interested in was just converting this to a matrix.  The matrix should be set up such that there is a row for every country with at least one handling editor and a column for every country that has as performed at least one review.  I should note that I have made a couple of changes from the original dataset: 1) countries of the United Kindom have been combined 2) likewise data associated with Hong Kong or Macau have been combined with that for China 3) I've dropped a couple of countries that had very little data and are not widely recognized as independent states.  Here is a link to the processed data publons.csv.

    Although we could look at this just by country I wanted a few more dimensions that we could explore.  Specifically, I was interested in distance between pairs.  To get this, I constructed a 123x123 matrix that represents the distance between all countries. The diagonal of this matrix is the average distance within a country.  Briefly, this value is estimated as the distance between two randomly drawn points from a circle with the same area as the country in question (128r/45pi).  This assumption may bias us towards smaller within country distances since all non-circle shaped countries would have larger expected distances.  On the other hand reviewers and editors are concentrated in cities, so perhaps these will balance out a bit. Here is a link to the country distance matrix distances.csv.

    In addition to the distance matrix, I also wanted to have some measure of the number of articles being produced by scientists in each country.  I found this data on the natureindex.com website.  Specifically, I will be using the weighted fractional count (WFC).  This metric takes into account the number of authors and the availability of data from different STEM fields.  Although I would prefer to have the actual number of potential reviewers in each country, the WFC should be highly correlated and was easy to find.  Here is a link to the WFC data formatted to match the Publons data: wfc.csv.  Next, we might simply look and see if we are right that in general editors get more reviews from countries that are close by and have lots of scientists producing articles.  To do this let's just divide the distance between each editor reviewer pair by the WFC of the reviewing country and then plot this against the number of reviews produced.




    Naively we might imagine that simply the number of editor in a country should predict the number of reviews coming from a country.  Graphically we could get this by just plotting the row sums vs. column sums from our Publons data matrix:



    Countries that fall below the diagonal have more editor initiations while those above have more reviews produced.  It is important to note that the data has been plotted on a log scale so that deviations from this line are arguably more striking as we move to the upper right quadrant of the plot.  One interesting thing that we can note right away is the cluster of countries on the far left of the graph.  These are countries that had zero handling editors but are still producing reviews.  In case you're curious those countries are: UAE, Qatar, Georgia, Yemen, Tanzania, Slovakia, Nepal, Cameroon, Vietnam, Venezuela, Botswana, Zimbabwe, Philippines, Latvia, Bosnia & Herzegovina, Senegal, Papua New Guinea, Bahrain, Cambodia, Belarus, Zambia, Malawi, Kyrgyzstan, Armenia, and Sudan. The highest of these is Nepal which is responsible for 48 reviews but no handling editors.

    So my initial interest in this dataset was really about those countries that are producing lots of science but fall relatively far above or below this line.  To look at this, the first thing that we will do is slim our data to focus on those countries that had at least 100 handling editor counts.  Next, I've just plotted the difference in reviews minus editors.





    So who are the countries forming the end of this distribution?  Well at the negative end of this distribution we find the U.S. followed by the U.K., Canada, and China.  Meanwhile, the positive end of this distribution is formed by Germany, Portugal, Italy, and Spain.  These residuals might be a bit misleading though.  For instance, a 1% imbalance between reviewers and editors will be much larger for the UK than the for New Zeland simply due to the difference in the size of the research communities.  One way we might control for this is by standardizing these residuals relative to Nature's WFC





    So this does certainly cause changes in our indexing of countries, but interestingly these aren't dramatic.  For instance, the U.S., Canada, and the UK are still in the bottom 5. Likewise, Portugal, Italy, and Spain are in the top 6 based on this ranking.  These two plots suggest to me that there might be some real differences in the balance between handling editors and review production across countries.

    If the differences are significant, then we should be able to recover something with an ANCOVA.  My approach here is to fit a linear model:

    reviews = distance + WFC

    The ANCOVA lets us ask whether the slope associated with this model is different among editor countries.  If editors from a country are strongly biased towards choosing reviewers close to them, this will have a more negative slope.  On the other hand, if reviewers are biased towards picking reviewers from some distant country for some reason it could even have a positive value (the very first plot suggests this unlikely). Here are the results from running the ANCOVA in R:

                   Sum Sq       Df      F value        Pr(>F)  
    dist           41645        1       4.2957         0.03837
    WFC            783958       1       80.8652        2e-16
    dist:editor    723619       85      0.8781         0.77678  
    Residuals      15026658     1550  

    So this reveals that the overall slope doesn't appear to be significantly different among editor countries.  The one thing that might remain is that some countries could have a large deviation from our model due to some odd historical contingency.  The one example that I could think of to look at for this is Commonwealth states.  So a good comparison here might be the US and Canada and what proportion of reviews editors in these countries get from the UK and Australia.  As the plot below shows, we don't see any indication that this historical contingency has created a bias in editors choices for reviews.  In fact, they are strikingly similar.





    Obviously, everything that I did here could be affected by the uptake of the Publons peer review tracking site.  If it has a systematic bias with regard to native language or geographic region it may have led me astray.  Regardless, thanks to the kind people at Publons for allowing me to play with their data.

    The code used to produce all of these plots and tests is here: publons.R.



    0

    Add a comment

  4. As part of a project with an undergrad this summer I am working on a theoretical study that involves simulations of many different tree topologies that include varying amounts and types of geneflow among lineages.   For a number of reasons that I won't go into here, we will be generating many hundreds of backbone trees in the R package diversitree that we will then add hybridization events to and simulate possible gene trees in ms.  A potential stumbling block in this process is the lack of a link between R and ms.  These two environments store tree information in very different formats.  In R a tree is held as a table (below its a 6x2 table) where each row of the table is a branch and the nodes it connects are listed as its elements.  R then stores the length of these branches in a vector that has the same length as the rows in the table.  In contrast, ms views a tree as a series of joining events that link together demes (populations) that are separate in the present.  Therefore it lists a tree in a much simpler format where we have joining events (coalescences) that occur further and further back in time.



    So for this project, I needed a function that could take a random tree from R and produce and ms string.  For my purposes, I want the ability to have ms give gene trees or segregating sites.  Here is the code I came up with.  It is also available from GitHub version of my R package evobiR:


     getMS <- function(tree, samps, report, n.site=NULL){  
      # add a demes tracking column to the edge matrix  
      tree$edge <- cbind(tree$edge, rep(NA, nrow(tree$edge)))  
      output <- vector()  
      used.demes <- 0  
      check <- T  
      while(check == T){  
       # find tips  
       tips <- tree$edge[!tree$edge[, 2] %in% tree$edge[, 1], 2]  
       # determine which rows of #edge and elements of #edge.length these are  
       branches <- which(tree$edge[,2] %in% tips)  
       # find the shortest of the branches  
       shortb <- branches[which.min(tree$edge.length[branches])]  
       # find the parent node  
       par.node <- tree$edge[shortb, 1]  
       # find branches with this parent  
       hits <- tree$edge[, 1] == par.node  
       # remove one already stored  
       hits[shortb] <- F  
       # store the branch we will join  
       shortb[2] <- which(hits)  
       # determine if demes are named branch 1  
       if(is.na(tree$edge[shortb[1], 3])){  
        demes1 <- max(used.demes) + 1  
        used.demes <- c(used.demes, (max(used.demes) + 1))  
       }else{  
        demes1 <- tree$edge[shortb[1], 3]  
       }  
       # determine if demes are named branch 1  
       if(is.na(tree$edge[shortb[2], 3])){  
        demes2 <- max(used.demes) + 1  
        used.demes <- c(used.demes, (max(used.demes) + 1))  
       }else{  
        demes2 <- tree$edge[shortb[2], 3]  
       }  
       # store deme name in tree$edge  
       tree$edge[tree$edge[, 2] == par.node, 3] <- demes2  
       # record ej statement  
       output <- paste(output, "-ej", tree$edge.length[shortb[1]], demes1, demes2)  
       # check and see if we still have more work to do  
       if(length(tree$edge.length)>2){  
        #add edge length that will be lost when we remove branches  
        tree$edge.length[tree$edge[, 2] == par.node] <-   
         tree$edge.length[tree$edge[, 2] == par.node] +   
         tree$edge.length[shortb[1]]  
        #update tree$edge.length  
        tree$edge.length <- tree$edge.length[-shortb]  
        #update tree$edge  
        tree$edge <- tree$edge[-shortb, ]  
       }else{  
        # when we are down to 2 edges we are done  
        check <- F  
       }  
      }  
      z <- paste(rep("1", length(tree$tip)), collapse=" ")  
      outz <- paste("-", report, sep="")  
      output <- paste(length(tree$tip),   
              samps, "-I",  
              length(tree$tip),   
              z,   
              output,  
              outz,  
              n.site)  
      return(output)  
     }  
    

    And here is an example of it running:

     > library(ape)  
     > tree <- rcoal(4)  
     > plot(tree)  
     > getMS(tree, samps=2, report="T")  
     [1] "4 2 -I 4 1 1 1 1 -ej 0.0178240316454321 1 2 -ej 0.65862003858505 2 3 -ej 0.922894212466909 4 3 -T "  
     > getMS(tree, samps=20, report="s", n.site=50)  
     [1] "4 20 -I 4 1 1 1 1 -ej 0.0178240316454321 1 2 -ej 0.65862003858505 2 3 -ej 0.922894212466909 4 3 -s 50"  
    

    And now we can run these strings through ms:
     $ ./ms 4 2 -I 4 1 1 1 1 -ej 0.0178240316454321 1 2 -ej 0.65862003858505 2 3 -ej 0.922894212466909 4 3 -T  
     ./ms 4 2 -I 4 1 1 1 1 -ej 0.0178240316454321 1 2 -ej 0.65862003858505 2 3 -ej 0.922894212466909 4 3 -T   
     32028 54298 17741  
     //  
     (3:1.096,(4:0.950,(1:0.502,2:0.502):0.448):0.146);  
     //  
     (3:1.340,(4:1.221,(1:0.314,2:0.314):0.907):0.119);  
    
    0

    Add a comment

  5. I am working on a side project that involves downloading lots of sequence clusters from the phylota website.  This is a handy tool and makes it quick and easy to get lots of useful sequence data.  However, it also gives you all sequences in genbank for each species.  This means that you might have 500 sequences for a particular gene but only sequences from 10 or 20 different species.  The population sampling that is present in genbank is useful for many projects, but if we are just after a species level phylogeny then what we would really prefer is to just have the best (longest.. maybe) sequence from the species.  I've been using a snippet of pyhton code a friend gave me a long time ago to process these files and keep only the longest sequence for each taxa.  However, I figured it might be a good idea just to have my own R function that does this.  Plus I can add it to my R package so why not.  So here is what I came up with.  This uses only a couple of other function from APE read.dna and wrtie.dna.  I have written it so you can either process a single file or a whole folder of files.  Here is the code by itself but it is also available in the latest version of EvobiR on the github repository.

     # Simple function takes fasta and returns  
     # fitlered fasta with only the longest sequence   
     # for each species.  
     FastaFilter <- function(f.in, folder = F, prefix="pruned"){  
      # f.in if we are processing just a single file this should be  
      # a text string holding the name or name and path to the file  
      # folder if T then all files in the folder will be   
      # processed by the function  
      # prefix as a default we save new copies with the prefix  
      # pruned but we can put whatever we want here  
      # first we have a function for a single file  
      single.file <- function(f.in, prefix){  
       # this reads in our file  
       seqs <- read.dna(f.in, as.character = T, format="fasta", as.matrix = F)  
       # gets a lits of the unique taxa  
       taxa.present <- unique(names(seqs))  
       # this will hold our final set of sequences  
       final.list <- list()  
       # here we find the longest sequence for each species and store it  
       for(i in 1:length(taxa.present)){  
        hits <- which(names(seqs) == taxa.present[[i]])  
        possible <- seqs[hits]  
        z <- which.max(unlist(lapply(possible,length)))  
        final.list[i] <- possible[z]  
        names(final.list)[i] <- names(possible)[z]  
       }  
       # save the result as a fasta file  
       write.dna(x=final.list, file=paste(prefix, ".", f.in, sep = ""),   
            format = "fasta")  
      }  
      # here we call our function for a single file  
      if(folder == F){  
       single.file(f.in, prefix)  
      }  
      # here we get a list of all files and call our  
      # function itterativly to process each file  
      if(folder == T){  
       files <- list.files()  
       for(i in 1:length(files)){  
        single.file(f.in = files[i], prefix = prefix)  
       }  
      }  
     }  
    

    cheers


    1

    View comments





  6. For the past couple of years, there have been a small but growing group of people that have wanted to track their reading for a year.  For most of us, the idea is that this will push us to read more broadly and more consistently.  I tried to do this last year after seeing all the #365paper tweets at the beginning of 2016.  My decision was that I would write a paragraph summarizing each paper, and post these all under a single blog post.  Unfortunately, I just don't have the time to do that.  Writing a paragraph summary of each paper and then adding this to an existing blog post was just too much of a hassle.

    The reality is that for the vast majority (90%+) of papers that I read I am reading to extract 1 small piece of information. For example, did this research test X or Y, what protocol was used in the assay, is A greater than B.  In large part, I'm quite often uninterested in much of the discussion and conjecture of a paper.  Rather than summarizing all of this I would rather move on and read the next paper.

    So this year I wanted an easier quicker method to track my reading.  I think I've come up with a solution using my favorite environment R.  First, I created a directory named "365.2017" on my GitHub website repository.  This directory has two files a CSV file and an Rmd file (R markdown).  Now when I read a paper all I have to do is add it to the end of the CSV file and then periodically knit the Rmd file to update the web page.


    the csv file is super simple here is an extract of the first two lines with the titles and notes truncated

    Count,Paper,Notes
    1, Meiotic Consequences of Genetic...., Crosses where strains


    The Rmd file is likewise quite simple and contain only 12 lines of code:

     ---  
     title: 'Reading tracker: Heath Blackmon'  
     date: "last updated: `r Sys.Date()`"  
     output: html_document  
     ---  
     This is just a quick little website to keep track of my reading:  
     ```{r echo=FALSE}  
     library(knitr)  
     dat = read.csv("reads.csv", as.is=T, header=F, fileEncoding = "latin1")  
     dat = cbind(1:nrow(dat),dat)  
     kable(dat, col.names=c("Paper", "Notes"))  
    

    One important note here is the argument fileEncoding = "latin1".  This argument will allow the R code to handle all of the typical things you run into with citations (special characters, accents, non-breaking spaces, etc.).

    With this setup I can add a paper and update the website all in a matter of seconds and I don't feel the pressure to write a well formed summary that i did last year.  Here is a link to the final web page.


    I've actually had to continue to tweak the argument that I am giving to fileEncoding.  From my reading this may well be a bit specific to your operating system and where you are getting the text that is going into your CSV file.  In my case, I'm on a mac and I am primarily using google scholar to grab citation information.


    1

    View comments



  7. I have always intended this blog to serve as a place to chat about fun or interesting things related to my interests in Coleoptera, evolution, coding, and academia.  That being said I am deeply concerned by the discrimination and rejection of science and logic that is being normalized by our government.  Several high profile scientists have recently set up a mechanism to ensure that any scientists employed by the federal government have a way to disseminate scientific finding they are forbidden from sharing through normal channels.  I'm not 100% sure how much of a problem this is and how much of this is typical of transition in parties.  That being said better safe than sorry.  So below is the full text of their announcement:

    Governmental scientists employed at a subset of agencies have been forbidden from presenting their findings to the public. We have drafted the following response for distribution, and encourage other scientists to post it to their websites, when feasible.

    Graham Coop
    Professor of Evolution and Ecology
    UC Davis

    Michael B. Eisen
    Professor of Molecular and Cell Biology
    UC Berkeley

    Molly Przeworski
    Professor of Biological Sciences
    Columbia University

    -----

    In Defense of Science

    We are deeply concerned by the Trump administration’s move to gag scientists working at various governmental agencies. The US government employs scientists working on medicine, public health, agriculture, energy, space, clean water and air, weather, the climate and many other important areas. Their job is to produce data to inform decisions by policymakers, businesses and individuals. We are all best served by allowing these scientists to discuss their findings openly and without the intrusion of politics. Any attack on their ability to do so is an attack on our ability to make informed decisions as individuals, as communities and as a nation.

    If you are a government scientist who is blocked from discussing their work, we will share it on your behalf, publicly or with the appropriate recipients. You can email us at USScienceFacts@gmail.com.

    If you use this address please use PGP encryption using this PGP public key: http://pgp.mit.edu/pks/lookup?op=get&search=0x52C7139DE0A3D350

    0

    Add a comment

  8. I'll be giving a job talk in a month or so and I have been meaning to go through my standard slides and make sure that I am using color-blind friendly palettes.  For the last year or so everything that I do with R, I tend to do using the viridis package which I know is supposed to be fairly good.  However, for items that I make in powerpoint I often use the website ColorBrewer, and in the past, I may not have paid that much attention to this issue.

    I have some pretty intricate animated slides and my first question was whether it was even worth messing with some of these. Perhaps it was already good enough.  To make this decision I needed to know two things. 1) What is the incidence rate of color blindness by type in the US? 2) What will these slides look like to a person who is color-blind?

    So first the incidence rates.  This varies a lot by type and population but a good rule of thumb is around eight percent of men and 0.5 percent of women have some form of dichromatic vision.  The other types of color blindness are rare enough that I'm going to focus on dichromats for now. For normal trichromatic vision your receiving information on the intensity of light by your rods as well as three types of cones that are most responsive to either short, medium, or long wavelength light.  Commonly people refer to these as blue, green, and red cones respectively despite the fact that they don't really match up to well to those colors.  The information from these cones are what allows us to determine the wavelength (color) of objects we see.  In contrast, dichromats have only two of these three types of cones leading to greater difficulty in distinguishing various shades.  I would never consciously give a talk that was I knew was confusing to 4-8% of my audience1.  So the incidence rate is definitely high enough that I can't use that as an excuse.  My slides need to be color-blind friendly!

    Next question how bad are my slides as is?  It ends up there are some great resources on the web that let you get some idea of what your slides will look like to people with different types of color blindness.  I particularly like the site CoBliS .  With this site, you upload a PNG or JPG file and then choose the type of color blindness to simulate.  I do a lot of work with phylogenies and I have often used phytools contmap function to illustrate phenotypic evolution across a phylogeny.  The default for this function uses the rainbow palette to produce a range of colors for the trait being analyzed:


    In general, people really like these.  Personally, I love them they are quite striking.  What does this look like to someone who is color-blind?  I checked out several different types of color blindness that can be simulated with websites like CoBliS.  The two most common types of color blindness in the population are protanopia and deuteranopia.  I found that in several of the images that I checked the specific colors that were challenging differed a bit but in general, a bad palette was bad and a good palette was good.  For that reason, the examples that I show below are all simulated as visualized by someone with protanopia color vision.


    Obviously, this simply isn't going to work.  In particular, we see that the important distinction between green and yellow is simply missing for viewers with one of the most common types of color blindness.  We need to find a palette that is at least marginally better. Let's try it again with viridis palette.

    This palette still works fairly well for a trichromatic person like myself.  Let's look at the version we would see with protanopia color blindness.





    This isn't great but we can pick up a bit of the variation that we had lost toward the bottom of the tree.  If you are using contMap and want a version that uses viridis it is easy enough to hack the function.  However, here is a link to mine CBcontMap.

    cheers



    1 If we had gender balance we would only need to worry about 4.25% of our audience.  Not an insignificant amount but sadly most academic audiences are skewed towards men pushing that value closer to the 8%.

    2 Here is short article in Nature methods that also looks at this issue: Wong, Bang. "Points of view: Color blindness." nature methods 8.6 (2011): 441-441.
    0

    Add a comment

  9. I am taking the opportunity to sit in Ruth Shaw's quantitative genetics class this semester.  I hope that I get to teach this class at some point in the future and I thought it would be great to see how a pro like Ruth teaches this class.  Listening to her lecture and show slides today though inspired me to make a little app that I would like my future students to have.  We are using Falconer and Mackay and in chapter 6 they are attempting to explain the effect of the number of loci on the expected distribution of observed phenotypes.  The students were shown this graph



    Students made the connection to why you would get a good approximation of a normal distribution when you have a large number of loci.    However, I thought though that it would be pretty easy to make a web page with the R package Shiny that would recapitulate this graph and let your students see how things change as a function of the different parameters involved.  The solution that I wrote up includes parameters for:

    a: genotypic value - this is just the phenotypic measure of the homozygous A2A2 in contrast, the A1A1 homozygote will have a phenotypic measure of -a.

    d: degree of dominance- the measure of how the heterozygous individual deviates from the expected value of 0 (a-a).

    mean p: mean allele frequency- unlike in the example I want to allow allele frequency at each locus to be a random draw from a truncated normal distribution with a minimum value of 0 and a maximum value of 1.

    var p: variance in allele frequency- just describes the distribution that we are drawing allele frequencies from.

    loci: the number of loci affecting the trait of interest.

    samp: sample size- rather than calculating the probabilities I want to take samples from a population to reinforce the importance and impact of sample size on our ability to understand a system.

    Finally, I have a resample button that lets us go take another sample from the population.


    Here is a link to the web app: https://popgensim.shinyapps.io/loci_num/


    I should probably get back to writing papers and applying for faculty positions!

    cheers
    0

    Add a comment

  10. First, if you need a quick refresher on what the ABBA/BABA test is you can check out this post: ABBA/BABA

    Perhaps one of the simplest tests for introgression is the ABBA/BABA test.  I have a couple of versions of this test implemented in my R package EvobiR.  The most frequent email that I have gotten concerning ABBA/BABA tests is from people that want support for ambiguity codes.  Ambiguity codes have two very different sources.  The first can be due simply to sequencing error where we accidently read some copies of a DNA sequence incorrectly.  The second, and with the current methods hopefully more common source, is the result of the way we sequence and assemble our data.  Because we often use short sequences that are assembled by their overlapping portions we often don’t have haplotype data (basically knowing which SNPs belong with each other).  Instead, we get a single sequence for each individual and it contains information from both the paternal and maternal copy of the sequence in question.

    As an example lets imagine sequencing this locus that contains three SNPs.

    TRUTH
    Paternal copy  CGTCAAGATACATGCCGCTCCTGTCATAACTGCG
    Maternal copy  CGTCATGATACATGCCGCTGCTGTCATCACTGCG

    SEUENCE DATA
    CGTCATGA
        AAGATACAT
            TACATGCCGCTCCT
                      TACTGTCAT
                    GCTGCTGTCATC
                           TCATAACTGCG

    OUTPUT SEQUENCE
    CGTCAWGATACATGCCGCTVCTGTCATMACTGCG

    So how should we handle these ambiguities when we start performing the ABBA/BABA test.  First, I would argue that when sequencing individuals only ambiguity codes that represent 2 nucleotides are valid.  It seems that codes that represent 3 or 4 nucleotides (highlighted in red above) must be either sequencing error or represent different loci.  So the first thing we should do is filter out those 3 and 4 nucleotide loci.  For those loci that are biallelic we have two choices: 1) we could randomly resolve them to either state or 2) we could count biallelic sites as weighted possibilities.  Option 2 isn’t bad if we have a single sequence with an ambiguity:
    ARAG this effectively means we would count ½ locus as AAAG and ½ a locus as AGAG.  However, we can imagine this gets complicated quickly for instance if we have ARRR we now have 8 possibilities AAAA, AGAA, AAGA, AAAG, AGGA, AGAG, AAGG, AGGG.  This would be quite a pain to code. Additionally, if we think about it if we randomly resolve sites at the limit (many sites or the average of many replicate analyses) we would converge on the same scores.  For this reason, at least for now I'm just going to implement the drop everything ambiguous or the random resolution solutions.

    So the extension to the code include:

    1) first filtering out any loci that contain these codes:
    B which is the ambiguity code for C or G or T
    which is the ambiguity code for A or G or T
    which is the ambiguity code for A or C or T
    which is the ambiguity code for A or C or G
    which is the ambiguity code for any base


    2) randomly resolving any occurrences of these codes:
    R to A or G
    Y to C or T
    S to G or C
    W to A or T
    K to G or T
    M to A or C

    So I now have a flag ambig that the user can set to "D" to drop all sites with any ambiguity, or "R" to resolve all biallelic sites to one of the possible nucleotides, or "I" to ignore all of this and work the same as it did in the past (see the note at bottom for the possible use of "I").

    To make sure this is working correctly I’m using a very contrived example.  When we drop all ambiguities there should be no signal but when we randomly resolve there should be at least some signal:

    >Seq1
    CCCCCCCCCYYYYSSSSCCCCCCCCCCCCCCCCMMMMMAAAAACCCCC
    >Seq2
    SSSSSSKKKGGGGGGGGRRRRRRRRRRWWWWWWTTTTTCCCCCAAAAA
    >Seq3
    GGGGGGGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTTTAAAAAAAAAA
    >Seq4
    CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

     Below is an example of using both ABBA/BABA tests with this new flag:  
     CalcD(alignment = "example.fasta", sig.test = "N", ambig="D")  
     Sites in alignment = 10  
     Number of sites with ABBA pattern = 5  
     Number of sites with BABA pattern = 5  
     D raw statistic = 0   
     [1] 0  
     CalcD(alignment = "example.fasta", sig.test = "N", ambig="R")  
     Sites in alignment = 48  
     Number of sites with ABBA pattern = 24  
     Number of sites with BABA pattern = 5  
     D raw statistic = 0.6551724   
     [1] 0.6551724  
     CalcD(alignment = "example.fasta", sig.test = "N", ambig="I")  
     Sites in alignment = 48  
     Number of sites with ABBA pattern = 5  
     Number of sites with BABA pattern = 5  
     D raw statistic = 0   
     [1] 0  
    


    That looks like what we expect.  However, it is important to remember that with the ambig flag set to "R" this analysis becomes stochastic.  For instance, if we run this code to repeat the analysis 500 times we can see what the distribution of scores looks like:


    D = vector()
    for(i in 1:500){
      D[i] = CalcD(alignment = "example.fasta", sig.test = "N", ambig="R")
    }
    hist(D, xlim=c(.4,.8),xlab="Value of D statistic", main="Variation across 500 runs")



    I mentioned above that the "I" flag for ambig allows you to use these functions without any changes.  This is really left in for one user who is trying to apply the ABBA/BABA test to microsatellite data.  So this user never has A, C, G, or T and the new code would have thrown out all of their data.  For most applications, though I think users should be using the drop option unless they are going to carefully explore how these ambiguities impact their results.  For this reason, the default for both functions is drop ambiguities.

    To install the latest version of evobiR that has all of these changes just run this code in R:

    install.packages("devtools")
    library(devtools)
    install_github('coleoguy/evobir', build_vignettes = T)
    library(evobiR)


    Let me know if you have any problems with this.

    cheers





    0

    Add a comment

Great Blogs
Great Blogs
About Me
About Me
My Photo
I am broadly interested in the application and development of comparative methods to better understand genome evolution at all scales from nucleotides to chromosomes.
Subscribe
Subscribe
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.