Date of this Version
Reconstructing the evolutionary history of biological sequences will provide a better understanding of mechanisms of sequence divergence and functional evolution. Long-term sequence evolution includes not only substitutions of residues but also more dynamic changes such as insertion, deletion, and long-range rearrangements. Such dynamic changes make reconstructing sequence evolution history difficult and affect the accuracy of molecular evolutionary methods, such as multiple sequence alignments (MSAs) and phylogenetic methods. In order to test the accuracy of these methods, benchmark datasets are required. However, currently available benchmark datasets have limitations in their sizes and evolutionary histories of the included sequences are unknown. These are the serious drawbacks as benchmarks. Such problems can be solved by simulating sequences to create benchmark datasets with known evolutionary history. However, currently available simulation methods do not allow biologically realistic dynamic sequence evolution.
We introduced indel-Seq-Gen version 1.0 (iSGv1.0), a program that simulates realistic evolutionary processes of protein sequences with insertions and deletions (indels). iSGv1.0 allows the user to simulate multiple subsequences according to different evolutionary parameters, tracks all evolutionary events including indels and outputs the "true" MSA of the simulated sequences. With indel-Seq-Gen version 2.0 (iSGv2.0), we aimed at simulating evolution of highly divergent DNA sequences and protein superfamilies. iSGv2.0 adds lineage-specific evolution, motif conservation, indel tracking, subsequence length constraints, and incorporates coding and non-coding DNA evolution. We uncovered a flaw in the modeling of indels used in current state of the art methods, and fixed it by using a novel discrete stepping procedure.
Finally, we developed a new MSA scoring metric called the gap profile score that utilizes insertion and deletion placements to evaluate MSA accuracy. Using a series of benchmark alignments created with iSGv2.0, we examined the performance of our scoring method against currently used character-based scoring metrics, including the sum of pairs score. We examined how well the scoring metric output correlates with accuracy of phylogenetic reconstruction. We show that the gap profile score opens a novel way to gauge the efficacy of MSA reconstructions, potentially opening the door to the research of better models of indel placement into MSA reconstruction methods.