A Simple Bayesian MCMC Analysis in MrBayes
In this example, you will infer a phylogeny using Bayesian methods. We will use the program MrBayes (Huelsenbeck and Ronquist 2001), which is one of the most widely used programs for Bayesian tree inference. In addition, you will gain experience with Tracer (Rambaut and Drummond 2007)- a program for evaluating output from an MCMC search - and FigTree - a program for viewing phylogenetic trees.
Start by downloading MrBayes and the dataset:
. Put both the MrBayes executable and data set into the same directory in order to run the analysis (alternatively, enter the full or relative path to the 'anthrotree26.txt' dataset).
After opening MrBayes, bring the data from the file into the program by typing:
Next, an outgroup must be specified, as this determines the rooting of the tree. To set Lemur catta as the outgroup, type:
Next, we need to set the evolutionary model for our DNA data. Based on the previous section of this website, the general time reversible model with a proportion of invariable sites and gamma-distributed rate variation across sites (GTR+I+G) was identified as the best suited model in this dataset. Select this model by typing:
lset nst=6 rates=invgamma
You also have to set up the parameters for the Markov chain. We will keep things simple by modifying only a few parameters. Specifically, we will set the number of generations the Markov chain should run (300000), how often a sample should be drawn from the distribution (every 500 generations), and how frequently MrBayes should output a sample on the screen (every 500 generations). The command for making all these changes is as follows.
mcmcp ngen=300000 samplefreq=500 printfreq=500
This will ensure that you get about 600 samples from the posterior probability distribution (601, to be precise, as the first generation is always sampled). By default, two runs are performed simultaneously, thus doubling the number of samples to 1202. You can now run the actual MCMC analysis by typing:
You will see some details on the analysis parameters, including many parameters that we kept at default values, and then the program will start sampling trees and parameter values. Be prepared to wait some time for the analysis to run. The remaining time to completion of the analysis is displayed in the last column (my computer estimated a run time of 39 minutes for this dataset).
As the analysis runs, pay attention to the "average deviation of split frequencies", which should approach 0 over time. This reflects that the tree samples from the two simultaneous runs become increasingly similar, as both converge onto the stationary distribution. Sufficiently low values for this measure are approximately 0.01. You should obtain values below 0.01 at the end of your simulation (e.g., my computer reached 0.0084). If not, run the analysis for longer (the program will ask if you would like to continue after the 300,000th generation).
Determining Burn-In: Tracer
After the analysis finished, you have to summarize the samples from the posterior probability distribution. This involves determining the so-called "burn-in," which is the phase of an MCMC search when stationarity has not yet been reached. This pre-stationary portion of the distribution is usually discarded from the analysis. The concept of the burn-in is controversial in the literature; see here for a fruitful discussion. A burn-in is typically indicated by a rapid increase in likelihood at the beginning of the MCMC tree search, as seen in the following output.
There are several ways of determining the burn-in; an elegant and fast way is to use specialized programs such as Tracer. After downloading Tracer, start it, and open the files that contain the likelihoods of the samples. They have the ending '.p' and are located in the same folder where the NEXUS file for this example is. You should have two different .p files in this directory, one for each run. In Tracer, import both by clicking on the “+” button at the left beneath the menu. You have to do that again after the importing of the first file is completed. The table at the top ("Trace Files") should then contain three entries: two for the names of the files you just imported, and one entry that is named “Combined”. Click on “Combined”, as you want to include the information of both files (which represents the two runs) in the determination of the burn-in. Then, click on “Trace” at the right of the window to obtain a plot that shows the likelihood of the samples over time. This is the plot that we use to determine the burn-in, and ideally it should look like white noise as explained earlier.
If you notice an increase in the likelihood over time, you can increase the burn-in by modifying the value in the column “Burn-In” in the “Trace Files” table in the upper left (click on the number and type the new value). For illustration purposes, change the burn-in to 0. You will now see a drastic change of the curve: it increases very fast at the beginning, and reaches a stationary plateau after some generations. This indicates that the burn-in value must be set higher. Repeat that procedure until you find a value that produces white noise (a flat slope with noise around it), but don’t exclude too many generations, as this decreases the number of trees that are used in the following analyses. A reasonable burn-in in this dataset would be to exclude the first 100,000 generations.
After you determined the burn-in, return to MrBayes and type:
This will summarize the parameter values of the MCMC search from after the burn-in of 100,000 generations. The number 200 refers to the number of sampled generations that are discarded, not the actual generation number. Recall that we sampled every 500 generations. Thus, 200 samples X 500 generations per sample = 100,000 generations, which we determined as burn-in sing Tracer. A detailed description of the output of sump is beyond the scope of this tutorial, so we refer you to the MrBayes manual for more details.
If you excluded the samples where stationarity has been reached, then the plot that is generated by MrBayes should look like "white noise”; no tendency of increase or decrease over time should be detectable. In Figure 1, for example, a screenshot from the program Tracer is shown. The first 100.000 generations have been excluded, and samples from generations 100,500 – 300,000 generations show no constant increase or decrease over time. The corresponding MrBayes plot shows “white noise”.
Summarizing Trees and Viewing the Consensus Tree in FigTree
The last step in MrBayes is to summarize the trees. We can do this by typing
In your folder, you will now find a new file with the ending .con that contains two consensus trees: a clade credibility tree and a phylogram. The former contains the probability of each partition or clade in the tree, the latter the branch lengths measured in expected substitutions per site. It is important to note that both trees are consensus trees from all samples (excluding the burn-in, of course), and a 50% majority rule to create those trees is used by default. This means that a polytomy is introduced if a particular split occurs in less than 50% of all trees.
The consensus file can be read by tree drawing programs, such as FigTree. To do so, download FigTree and launch the program. Use the menus to open the file ending in .con, which is located in the folder with the MrBayes program and data. After opening the file, you will be asked how to label nodes and branches, as FigTree recognizes that the nodes in the tree file are named (the posterior probabilities, in fact). Enter an arbitrary name, or just accept the default labeling of "label" and click OK.
You should now see a graphical representation of one of the two phylogenetic trees from the consensus file. FigTree provides plenty of options to modify the appearance of the tree; explore it for yourself, it is straight forward. The tree is yet not correctly rooted; as Lemur catta should be the outgroup. In the “Selection Mode” field at the top, click on “Clade” and select the branch that leads to Lemur catta. The branch should now be visually highlighted. Go to “Tree ->Root on Branch” to root the tree.
You can display posterior probabilites for each node by clicking on “Node Labels” in the middle block of options at the left. In the section “Display,” select the label that represents the posterior values (FigTree asked for the name of the labeling before). You should now see node labels, i.e. posterior probabilities. They are important, because they indicate the support for each particular clade. Ideally, all values are close to 1. More often, howerver, there will be some uncertainty. Posterior probabilities can be much lower than 1, and if they are below 0.5, the node will be represented as a polytomy. In your analysis, it is most probable (although not guaranteed, as Bayesian tree inference is a somewhat stochastic process) that the lineage with the species Aotus, Saguinus, Callicebus, Ateles, Cebus and Saimiri contains a polytomy, indicating that the program was unable resolve this lineage. The rest of the tree should be relatively well resolved with high posterior probability values.
It is worth nothing that an alternative and more elegant way of starting an analysis is to add an “MrBayes” block at the end of your nexus file (i.e., the text file that you downloaded above). The MrBayes block should contain all necessary commands for running the analysis, and would here look as follows:
BEGIN MRBAYES; lset applyto=(all) nst=6 rates=invgamma; outgroup Lemur_catta; mcmcp ngen=300000 samplefreq=500 printfreq=500; mcmc; END;
If the data set contains the MrBayes block above, then typing "execute anthrotree26.txt" on the MrBayes command line will start the analysis. When MrBayes reads in a file from an "execute" command, it will perform all the commands that are coded in the MrBayes block.
Congratulations, you just finished your first Bayesian tree inference! Be aware that this was a very simplified analysis, and the background for Bayesian tree inference has been described only briefly. As you become more proficient with this methodology, you will notice that things are often more complicated than presented here. This is especially true with bigger and more complicated datasets.
Huelsenbeck, J. P., and F. Ronquist. 2001. MRBAYES: Bayesian inference of phylogeny. Bioinformatics 17:754-755.
Rambaut, A., and A.J. Drummond. 2007. Tracer v1.4, Available from http://beast.bio.ed.ac.uk/Tracer