Variation in the molecular substitution rate between lineages is a known factor when analyzing sequence data. This variation may be caused by the species intrinsic characteristics. The various ways species’ characteristics have been shown to influence the rate of molecular evolution include generation time, metabolic rate, reproductive mode, mating preference, sex of carrier, habitat, or any other life history trait, or morphological feature.
traitRate is a program for detecting trait-dependent shifts in the rate of
molecular evolution. Currently, traitRate allows only characters with two
possible states (binary) to be analyzed. traitRate implements a likelihood
model that combines models of sequence and character (morphological) evolution.
The program outputs the relative rate of sequence evolution being at character
state 1 relative to character state 0, along with the log-likelihood of the
model and the log-likelihood of the null model that assumes that the state of
the binary character is not associated with the rate of evolution.
Methodology
To run the program type traitRate followed by the path to the control
file. The control file specifies the paths to the input/output files and the
various options to be used. The obligatory inputs to traitRate are an
ultrametric tree file in Newick format, a sequence alignment file in a number
of common formats, and a character data file in a FASTA format. The user is
responsible for the correspondence between the different files so that all
extant taxa in the tree have both sequence and character data, and that all
taxa with sequence and character data appear in the tree.
Significance tests
The first phase of the analysis is the estimation of the parameters of the
underlying model. This is done with standard maximum likelihood techniques. The
program outputs the maximum-likelihood parameter estimation and the
log-likelihood of the null and alternative models. The likelihood ratio test,
approximated with a chi-square distribution with one degree of freedom, can be
used to compare between these two nested models. A log-likelihood difference of
1.92 between the two models is significant at the 0.05 level. Importantly,
significant result using the likelihood ratio test between these two models is
only the first condition and does not necessarily imply that the analyzed trait
influences the rate of sequence evolution. A "parametric trait bootstrap"
procedure (see below) should follow to verify whether the observed rate
variation is caused by the analyzed trait compared to other, uncorrelated,
traits that evolve in a similar manner. The last step is done using
simulations, which may be computationally intensive. See below
for directions regarding this procedure.
IMPORTANT NOTES
1. Input tree
The input species tree should be a rooted ultrametric tree in which the
distances from the root to the tips are all equal. The program accepts either a
single tree or multiple plausible trees as representing the phylogeny of the
analyzed species.
To use a single tree, the option _treeFile in the control file should be used.
For multiple trees, the option _treesFile should be used. In this case, a
multi-newick file should be used. See here for an
example.
Note that computation time is much longer using multiple trees.
2. Number of stochastic mappings
The program relies on a sampling technique, called stochastic mapping, to
approximate the likelihood of the combined sequence and character data. The
approximation is more accurate when more mappings are used. However, the
computation time increases linearly with the number of mappings. Although,
practically, we find this parameter to have little influence on the results, we
recommend trying several different numbers of mappings. The number of mappings
is specified using the _stochasicMappingIterations option in the control file.
3. Running the parametric trait bootstrap procedure
Step one - simulate random character data.
This procedure generates random character data by simulating character evolution along the input
phylogeny using the parameters of the character model that were inferred for
the original dataset.
In order to run this step, the parameters of the character model
(_charModelParam1 and _charModelParam2) should be set to be equal to their
inferred value. In addition set the following parameters in the control file:
_mainType simulateFalsePositive
_outDir <simulate directory>
_startSimulationsIter 0
_endSimulationsIter 199
The _endSimulationsIter parameter specifies the number of iteration to perform.
As always, the more the better but 200 (or 100) should be enough. In addition,
the same character data file and tree file as in the original run should be
provided.
Step two - run traitRate for each simulated data. This step should,
preferentially run in parallel for the different simulated character datasets.
The input for each run should be the original tree file, the original sequence
file, and the simulated character file. For example, for the 10th simulated
data the following parameters should be specified in the control file:
_mainType Optimize_Model
_outDir <simulate directory>/sim_10/
_characterFile <simulate directory>/sim_10/simRandomChars.fasta
_seqFile <path to original sequence alignment file>
_treeFile <path to original species tree file>
Step three - assign significance level. Collect the difference in the log-likelihood score between the null and alternative models from all simulated runs. The log-likelihood difference observed for the original data should then be compared to the simulated distribution to produce a p-value. A perl script is provided here for this step.
Parameter | Description | Default |
_mainType | The type of analysis to run.
Possible options: Optimize_Model = Optimize the model parameters and infer relative rate Run_Fix_Param = Run analysis with the specified parameters values simulateFalsePositive = simulate random character data based on a given set of model parameters. See Running the parametric trait bootstrap procedure | Optimize_Model |
_treeFile or _treesFile | A path to the species phylogeny specifying one tree (_treeFile) or multiple trees (_treesFile) in Newick format | Either _treeFile or _treesFile is Obligatory |
_characterFile | A path to a FASTA-format file with the character state assignments | Obligatory |
_seqFile |
A path to a sequence alignment file.
Supported formats: Phylip, Clustal, Fasta, Mase, Molphy. | Obligatory |
_outDir | A path to the location of the output directory. | RESULTS |
_outFile | Out file name | log.txt |
_logFile | Log file name. Will be output in the _outDir directory | log.txt |
_scaledTreeFile | Name of the scaled tree file. Will be output in the _outDir directory | scaled.tree |
_sequenceType | nc (nucleotide sequence) prot (amino-acid sequence) | nc |
_seqModelType |
Nucleotide models: JC, K2, HKY, GTR Protein models: JTT, MT_REV, CP_REV, LG, WAG, AA_JC See Sequence models | JC |
_bScaleTree |
If 1, then the total length of the tree is fixed to a certain value. This value is optimized during the model optimization procedure. 0 = total tree length will not be optimized. | 1 |
_referenceTree |
A path to a reference tree file. The total branch lengths of this tree can be
used to fix the total branch lengths of the species tree.
When using this option, the _bScaleTree parameter can be set to 0. | "" |
_stochasicMappingIterations | Number of stochastic mapping iterations | 100 |
_startSimulationsIter |
An index specifying the first character simulation iteration See Running the parametric trait bootstrap procedure | 0 |
_endSimulationsIter |
An index specifying the last character simulation iteration See Running the parametric trait bootstrap procedure | 0 |
Model parameters
The program combines models of sequence and character-state evolution such that
rates of sequence change depend on the character state of a lineage at each
point in time. The model parameters thus depend on the specified character and
sequence models. Currently, only one character model is implemented that has
two parameters and is applicable to binary data only. The model parameters are:
_relRate = the rate of sequence evolution being at character state 1 relative to state 0.
_charModelParam1 = the rate of change from state 1 to state 0.
_charModelParam2 = A scaling parameter that transforms the units of branch
lengths from average number of nucleotide substitutions per site to average
number of character transitions. The parameters of the sequence model are:
_gammaParam: specifies the single shape parameter of the gamma distribution
that models among-site-rate-variation.
_gammaCategories: the number of discrete gamma categories.
_seqModelParam1-6: specify the rate matrix parameters. For example, when using
the HKY model, _seqModelParam1 specifies the transition versus transversion
rate bias.
_treeLength: the total tree length after the scaling procedure.
This parameter allows the adjusted tree to be stretched or shrunk, as different choices for the relative rate parameter are explored.