Duke Wiki  logo
Page tree
Skip to end of metadata
Go to start of metadata



The basic logic of parsimony is to find the tree topology that minimizes the number of evolutionary changes that must be inferred for the character matrix. Ideally, we would obtain this solution by drawing every possible unrooted tree that relates the taxa. We would then establish the minimum number of evolutionary changes (parsimony steps) that must have occurred for every character on every tree. The tree topology that minimizes the sum of parsimony steps across all characters is the most parsimonious tree. Sometimes, multiple tree topologies are equally most parsimonious. The summed number of steps required by a tree is called the tree length.


Unrooted trees do not tell us about recency of ancestry; that is, they have no time directionality to them. This makes them less useful than rooted trees for testing comparative hypotheses. We typically root trees with an outgroup, which is a taxon or set of taxa that diverged prior to the rest of the group of interest. The group of species whose relationships we are really interested in is called the ingroup or study group. All of the ingroup species should share a more recent common ancestor with each other than with the outgroup. For instance, if we have five taxa labeled A, B, C, D, and E, and A is identified as the outgroup, we are asserting our confidence that, prior to any phylogenetic analysis, B, C, D, and E are all more closely related to each other than any is to A. Hence, the members of the ingroup will share more synapomorphies with one other on average than with A.


We can almost always root trees after inferring the most parsimonious trees in their unrooted forms. This is because, under most commonly used parsimony algorithms, the rooting itself does not affect our inference about the number of changes required. The number of required changes is how we identify the best tree topology. The rooting tells us about the order of branching events in evolution and thereby informs our understanding of the order of changes in character states.


You can imagine how time consuming phylogenetic inference would be with many taxa and characters. Obviously, it would help to automate this process, and there are many programs available to achieve this. Below, we provide an example in R using functions from two R packages: 'phangorn' and 'ape'.


Parsimony Analysis in R


Start by opening R (or downloading it, if you haven't done so already). Load the packages 'phangorn' and 'ape' into R (see Section 1.1.3 for instructions on installing packages):



To demonstrate the process, we will use a subset of the same molecular dataset provided above (Maddison et al. 1997). Download the file (

) and place it in a folder of your choice.


Now, read the file into the variable 'primates':

primates = read.phyDat("chars2.txt", format="phylip", type="DNA")  

The next step is to provide the package with a starter tree to begin the optimization process. To do this, you might use a distance based approach. First, create a distance matrix using the 'phangorn' functions 'dist.dna' and 'as.DNAbin':


dm = dist.dna(as.DNAbin(primates))      

Next, create two trees, one using UPGMA and another using Neighbor Joining, both of which are available as functions in 'ape':

treeUPGMA = upgma(dm)                    

treeNJ = NJ(dm)                         


We can now view our trees. In the following block of code, the last two lines of code (beginning with 'plot') will plot the trees, but the second plot will replace the first in the plot window. In order to view the trees side-by-side, we enter the first two lines of code to create a 2-frame plot window to hold both of our plots:

layout(matrix(c(1,2)), height=c(1,1.25)) # plot window dimensions 

par(mar = c(.1,.1,.1,.1))  # adjust margins
plot(treeUPGMA, main="UPGMA", cex = 0.8)  # rooted tree on top; cex adjusts text size
plot(treeNJ, "unrooted", main="NJ", cex = 0.5) # unrooted tree on bottom


We can now obtain data on the parsimony score (i.e., the number of steps) for the two trees:

parsimony(treeUPGMA, primates)

parsimony(treeNJ, primates)


The most parsimonious tree is the one with the lowest score. In this case, it is the neighbor joining tree with a score of 302.


This is great, but what we really want to do is find the most parsimonious tree. For this, we can use the function 'optim.parsimony()', as follows, with our rooted tree:

optParsUPGMA = optim.parsimony(treeUPGMA, primates)


You will see that it found an even shorter tree (parsimony score of 300). Now, do the same with your unrooted neighbor joining tree:

optParsNJ = optim.parsimony(treeNJ, primates)


Let's take a look at our new trees ('layout' and 'par' keep their settings until you either quit R or reset them manually):

plot(optParsUPGMA, main="UPGMA", cex = 0.8) # rooted tree on top

plot(optParsNJ, "unrooted", main="NJ", cex = 0.5) # unrooted tree on bottom


Leaving aside the quality of our tree with this small dataset (what's the deal with LemurPongo and Tarsius?), we can export our tree in newick format using this function:

 write.tree(optParsUPGMA, file="optParsUPGMA.nex")


There is much more to learn about building trees than presented here. Hopefully, however, this will give you a taste of what can be done in R for building and viewing trees.



Hodgson, J. A., K. N. Sterner, L. J. Matthews, A. S. Burrell, R. L. Raaum, C. B. Stewart, and T. R. Disotell. Successive radiations, not stasis, in the South American primate fauna. 2009. Proceedings of the National Academy of Sciences, USA. 106:5534-5539
Maddison, D. R., D. L. Swofford, and W. P. Maddison. 1997. Nexus: An extensible file format for systematic information. Syst Biol 46:590-621.

Paradis, E., J. Claude, and K. Strimmer. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289-290.
R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Schliep, K. 2010. phangorn: Phylogenetic analysis in R. R package version 1.2-0.



Contributed by Charlie Nunn and Luke Matthews

  • No labels