chromEvol (VERSION 1.3)
chromEvol is a program for analyzing changes in chromosome-number along a
phylogeny.
Download the program:
Current version is from January 2012, which includes some minor bug fixes.
For support and questions please email me: itaymay 'at' gmail.com
chromEvol.exe (Windows)
Examples of input files:
params.txt (chromEvol control file).
Arist.counts (chromosome counts file
format).
Arist.tree (Newick tree file format).
You can try the program by typing
>chromEvol.exe params.txt
Source code and copyrights:
Source code (C++) is available for download here:
[chromEvol_source-1.3.tar.gz].
The makefile within can be used to compile the executable (using the make or
gmake commands). Alternatively, type:
g++ -o chromEvol.exe -O3 *.cpp -DDOUBLEREP
If there are problems with the compilations (occasionally, with various
versions of g++) - please email me and I'll try to help. To modify the code, or
use parts of it for other purposes, permission is requested. Please contact
Itay Mayrose at itay 'at' gmail.com
In citing the chromEvol program please refer to:
Mayrose I, Barker MS, Otto SP. 2010. Probabilistic models of chromosome number
evolution and the inference of polyploidy. Systematic Biology. 59(2):132-144
Overview:
Chromosome number is a remarkably dynamic feature of eukaryotic evolution.
Chromosome numbers can change by a duplication of the whole genome (a process
termed polyploidy), or by gaining or losing single chromosomes. Of the various
mechanisms of chromosome number change, polyploidy has received significant
attention because of the impact such an event may have on the organism.
Polyploids often differ markedly from their progenitors in
morphological, physiological, or life history characteristics, and these
differences may contribute to the establishment and success of a polyploid
species in novel ecological settings.
ChromEvol implements a series of likelihood models for the evolution of
chromosome numbers. By comparing the fit of the different models to biological
data, it may be possible to gain insight regarding the pathways by which the
evolution of chromosome number proceeds. For each model, the program infers the
set of ancestral chromosome numbers and estimates the location along the tree
for which polyploidy events (and other chromosome number changes) occurred.
Methodology:
To run the program type chromEvol followed by the path to the control
file. The control file specifies the paths to the input/output files and the various options to be used.
The obligatory inputs to chromEvol are a tree file in
Newick format and a chromosome counts data in the correct format. The counts should represent the haploid number. The user is responsible
for a correspondence between the two files so that all extant taxa in the tree
have chromosome count data, and that all taxa with count data appear
in the tree. The first phase of the analysis is the estimation of the parameters of the underlying model. This is done with standard maximum likelihood techniques. Then, the most likely ancestral chromosome
numbers are inferred, as well as the probability of any given chromosome number to
exist at any internal node. Finally, the program estimates the expected number
of polyploidizations and transitions of a single chromosome that have
occurred. The last step is done using simulations, which may be computationally intensive. For this stage, the interplay between
accuracy and running time can be controlled using the control file.
IMPORTANT NOTES:
1. The input phylogenetic tree is assumed to be rooted.
The root name is printed in the chromEvol.res output file.
In order to verify that this is the correct root node,
the user should view the file allNodes.tree.
In case the default rooting is not correct, it is possible to change the root using the _rootAt option in the control file.
2. For efficient multi-dimensional optimization, the program sets an upper bound of 100.0 for the rate parameters.
However, in case one of the optimized model parameters are close to this upper bound it is indicative that the model parameters may not have reached their global optimum.
The solution is to multiply all branch lengths by a certain factor (e.g., 10) and run the program again.
Multiplying all branch lengths by a scalar can be done using the _branchMul option in the control file.
3. The branch lengths of the tree represent the expected number of chromosome-number transitions along the branch.
When the branch lengths are exceptionally large (or small) the program will infer unrealistic ancestral states and will overestimate the number of transitions.
By default the program scales the input tree (multiply all branches by a constant) so that the total tree length will represent a realistic number of chromosome-number transitions across the whole tree.
Use the _branchMul parameter in the control file (see below), to scale the tree by a specified scalar or to keep the branch lengths identical to the input tree (_branchMul = 1.0)
4. The program also accepts multiple chromosome counts for a certain species.
For example, when 40% of the individuals in the population have a 24 haploid chromosome count and 60% have a 26 haploid chromosome count these should be written in the data file as follows:
>TAXA_B
24=0.6_26=0.4
If two chromosome numbers are valid for specific taxa, for example, if both 24 and 25 are valid for TAXA_A then each of these can be given a 0.5 probability:
>TAXA_A
24=0.5_25=0.5
Control file options:
An example of a control option file can be found here. This file specifies a model with 3 free parameters (_lossConstR, _gainConstR, _duplConstR) with the demi-polyploidy rate equal to the polyploidy rate.
A description of each parameter is given below.
Parameter
|
Description
|
Default |
_dataFile
|
A path to a file with the chromosome count data
|
Obligatory |
_treeFile
|
A path to a tree file in Newick format
|
Obligatory |
_outDir
|
A path to the location of the output directory. The directory should be an existing one (that is - the program will not create a new directory).
|
RESULTS |
_mainType
|
Possible options:
Run_Fix_Param = Run analysis with the specified parameters values
Optimize_Model = Optimize the specified model parameters and then run analysis
All_Models = Run analysis for each of the eight models, see Models comparison
|
Optimize_Model |
_maxChrNum
|
The maximal chromosome number allowed
Negative values (-X): Set the maximal chromosome number allowed to be X units larger than the maximal chromosome number observed in the data file
|
-10 (10 units larger than the maximal chromosome number observed in the data file) |
_minChrNum
|
The minimal chromosome number allowed
Negative values (-X): Set the minimal chromosome number allowed to be X units smaller than the minimal chromosome number observed in the data file
|
1 |
_simulationsNum
|
The number of simulations for computing the expectation of the number of
changes of certain transition type along each branch. Note: This step is
computationally expensive. Lower values results in faster computations with
decreased accuracy
|
10000 |
_rootAt
|
The internal node assumed to represent the root of the tree.
|
N1 |
_branchMul
|
If different than 1.0 then all branch lengths of the tree are multiplied by
this scalar. Should be used if one of the model parameters are close to their
boundary value (100), or in order to scale the tree when the branch lengths are exceptionally large or small.
|
999 (=Scale tree so that total tree length is equal to the number of different character types) |
Model parameters
Currently the program supports 6 types of transitions between different chromosome numbers.
The user may include all parameters in the model or choose to ignore some of them.
By specifying different sets of parameters the user may compare different hypotheses regarding the evolution of chromosome number along a given phylogeny.
The model parameters should are specified in the control file. In order to include a parameter its name should be followed by a space and a positive number.
In case the Optimize_Model option is specified, this number represents the initial parameter value for optimization, or is fixed to that value in case Run_Fix_Param is specified.
In order to exclude a parameter, the parameter name should be followed by the number -999.
Parameter
|
Description |
_gainConstR
|
An increase by a single chromosome |
_gainLinearR
|
Rates for single chromosome increases are dependent on the current chromosome number |
_lossConstR
|
A decrease by a single chromosome |
_lossLinearR
|
Rates for single chromosome decreases are dependent on the current chromosome number |
_duplConstR
|
A duplication of the whole genome (polyploidy) |
_demiPloidyR
|
A demi-polyploidization. This parameter allows for transitions from a genome with n haploid chromosomes to 1.5n (e.g., 4x to 6x).
If -2 is specified then the rate of demi-polyploidy is equal to that of polyploidy. Thus, the number of model parameters does not increase
|
Models comparison
A set of 8 models, each with a different set of parameters can be optimized. The maximal log-likelihood values and AIC scores of each model are printed to the file modelsSummary.txt.
In order to run this option the following line should be included in the control file:
_mainType All_Models
The following models are run:
Model
|
Model parameters |
CONST_RATE
|
_gainConstR, _lossConstR, _duplConstR |
CONST_RATE_DEMI
|
_gainConstR, _lossConstR, _duplConstR = _demiPloidyR
|
CONST_RATE_DEMI_EST
|
_gainConstR, _lossConstR, _duplConstR, _demiPloidyR
|
CONST_RATE_NO_DUPL
|
_gainConstR, _lossConstR
|
LINEAR_RATE
|
_gainConstR, _gainLinearR, _lossConstR, _lossLinearR, _duplConstR
|
LINEAR_RATE_DEMI
|
_gainConstR, _gainLinearR, _lossConstR, _lossLinearR, _duplConstR = _demiPloidyR
|
LINEAR_RATE_DEMI_EST
|
_gainConstR, _gainLinearR, _lossConstR, _lossLinearR, _duplConstR, _demiPloidyR
|
LINEAR_RATE_NO_DUPL
|
_gainConstR, _gainLinearR, _lossConstR, _lossLinearR
|
Chromosome counts file format View Example
Chromosome counts data should be supplied in a format similar to a FASTA file, with a few extensions.
For each species in the input tree two lines should be specified.
The first line lists the species name, which is preceded by the symbol '>'.
The species name must be identical to the name as appear in the input tree file. The second line specified the chromosome count for that species.
Extensions:
1. If the count for a certain species is unknown, the symbol 'X' can be used.
The program then treats this species as missing data (similar to a gap in molecular sequence data).
2. The program also accepts multiple chromosome counts for a certain species (NOTE: THIS OPTION IS NOT FULLY TESTED).
There are two possible scenarios for this option.
(A) When two counts are sampled from a population.
For example, 40% individuals having a 24 haploid chromosome count and 60% having a 26 haploid chromosome count these should be written in the data file as follows:
>TAXA_B
24=0.6_26=0.4
(B) Two chromosome numbers are valid for specific taxa. For example if both 24 and 25 are valid for TAXA_A then these two counts should be separated by '_' and given 0.5 probability:
>TAXA_A
24=0.5_25=0.5
Outputs:
chromEvol.res:
This file includes various run statistics as well as the inferred model parameters, frequencies of the chromosome numbers at the root of the tree, and the log-likelihood value of the optimized parameter set.
log.txt
All outputs that are printed during an execution of the program
allNodes.tree:
A tree file in Newick format that specify the names for nodes (internals and externals0 of the input tree.
Internal node's names are given as the bootstrap values and can be viewed in tree viewing programs such as njplot or FigTree.
mlAncestors.tree
A Newick tree file with the maximum-likelihood ancestor reconstruction.
The reconstruction of ancestral states is printed as the bootstrap values and can be viewed using a tree viewing program.
posteriorAncestors.tree
A Newick tree file with the posterior probabilities of the two most probable chromosome numbers at each internal node.
These probabilities are printed as the bootstrap values and can be viewed using a tree viewing program.
Exp.tree
Similar to posteriorAncestors.tree above -
lists in a quite ugly way the expected number of gains//loses//polyploidy//demi-polyploidy events that are inferred to occur along each branch.
These expectations are printed instead of the bootstrap values for the terminal node below the branch (further from the root) and can be viewed using a tree viewing program.
ancestorsProbs.txt
A table with the probability of each chromosome number to occur at each internal node.
expectations.txt
Lists branches with an expectation above 0.5 to have experienced a gain, a loss, a polyploidization, or a demi-polyploidization event.
A table at the end of the file lists the expected number of gain/loss/polyploidy/demi-polyploidy events for all branches of the phylogeny.
The name of the branch is given as the name of the node bellow it (further from the root).
log.txt
All outputs that are printed during an execution of the program
modelsSummary.txt (only when the _mainType option All_Models is specified)
Lists the log-likelihood and AIC scores of all models. This should be used to choose the model that best fit a particular dataset.