WITCH: Improved Multiple Sequence Alignment Through Weighted Consensus Hidden Markov Model Alignment
- PMID: 35575747
- DOI: 10.1089/cmb.2021.0585
WITCH: Improved Multiple Sequence Alignment Through Weighted Consensus Hidden Markov Model Alignment
Abstract
Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers "full-length," represents this "backbone alignment" using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it usesHMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments.
Keywords: divide and conquer; hidden Markov model; multiple sequence alignment.
Similar articles
-
HMMerge: an ensemble method for multiple sequence alignment.Bioinform Adv. 2023 Apr 17;3(1):vbad052. doi: 10.1093/bioadv/vbad052. eCollection 2023. Bioinform Adv. 2023. PMID: 37128578 Free PMC article.
-
UPP2: fast and accurate alignment of datasets with fragmentary sequences.Bioinformatics. 2023 Jan 1;39(1):btad007. doi: 10.1093/bioinformatics/btad007. Bioinformatics. 2023. PMID: 36625535 Free PMC article.
-
Ultra-large alignments using phylogeny-aware profiles.Genome Biol. 2015 Jun 16;16(1):124. doi: 10.1186/s13059-015-0688-z. Genome Biol. 2015. PMID: 26076734 Free PMC article.
-
Profile hidden Markov models.Bioinformatics. 1998;14(9):755-63. doi: 10.1093/bioinformatics/14.9.755. Bioinformatics. 1998. PMID: 9918945 Review.
-
Recent progress on methods for estimating and updating large phylogenies.Philos Trans R Soc Lond B Biol Sci. 2022 Oct 10;377(1861):20210244. doi: 10.1098/rstb.2021.0244. Epub 2022 Aug 22. Philos Trans R Soc Lond B Biol Sci. 2022. PMID: 35989607 Free PMC article. Review.
Cited by
-
EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment.Algorithms Mol Biol. 2023 Dec 7;18(1):21. doi: 10.1186/s13015-023-00247-x. Algorithms Mol Biol. 2023. PMID: 38062452 Free PMC article.
-
HMMerge: an ensemble method for multiple sequence alignment.Bioinform Adv. 2023 Apr 17;3(1):vbad052. doi: 10.1093/bioadv/vbad052. eCollection 2023. Bioinform Adv. 2023. PMID: 37128578 Free PMC article.
-
UPP2: fast and accurate alignment of datasets with fragmentary sequences.Bioinformatics. 2023 Jan 1;39(1):btad007. doi: 10.1093/bioinformatics/btad007. Bioinformatics. 2023. PMID: 36625535 Free PMC article.
-
WITCH-NG: efficient and accurate alignment of datasets with sequence length heterogeneity.Bioinform Adv. 2023 Mar 6;3(1):vbad024. doi: 10.1093/bioadv/vbad024. eCollection 2023. Bioinform Adv. 2023. PMID: 36970502 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources