Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny

doi:10.1038/s41467-022-34630-w

. 2022 Nov 15;13(1):6968.

doi: 10.1038/s41467-022-34630-w.

Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny

Robert C Edgar¹

Affiliations

PMID: 36379955
PMCID: PMC9664440
DOI: 10.1038/s41467-022-34630-w

Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny

Robert C Edgar. Nat Commun. 2022.

. 2022 Nov 15;13(1):6968.

doi: 10.1038/s41467-022-34630-w.

Author

Robert C Edgar¹

Affiliation

¹ Independent Researcher, . robert@drive5.com.

PMID: 36379955
PMCID: PMC9664440
DOI: 10.1038/s41467-022-34630-w

Abstract

Multiple sequence alignments are widely used to infer evolutionary relationships, enabling inferences of structure, function, and phylogeny. Standard practice is to construct one alignment by some preferred method and use it in further analysis; however, undetected alignment bias can be problematic. I describe Muscle5, a novel algorithm which constructs an ensemble of high-accuracy alignment with diverse biases by perturbing a hidden Markov model and permuting its guide tree. Confidence in an inference is assessed as the fraction of the ensemble which supports it. Applied to phylogenetic tree estimation, I show that ensembles can confidently resolve topologies with low bootstrap according to standard methods, and conversely that some topologies with high bootstraps are incorrect. Applied to the phylogeny of RNA viruses, ensemble analysis shows that recently adopted taxonomic phyla are probably polyphyletic. Ensemble analysis can improve confidence assessment in any inference from an alignment.

PubMed Disclaimer

Conflict of interest statement

The author declares no competing interests.

Figures

**Fig. 1. Typical ensemble workflow for alignment and phylogeny assessment.**
An ensemble of MSAs is generated and assessed for accuracy using Muscle5. Gray rectangles are processing steps made by an algorithm or software package. First, Muscle5 (step 1) generates an ensemble of MSAs (step 2), each alignment is generated by a different combination of a perturbed HMM and permuted guide tree. The accuracy of the MSAs can be assessed by Muscle5 (step 3) using accuracy metrics such as Column Confidence (CC). A phylogeny algorithm (step 4), e.g. maximum likelihood (ML), is used to predict a tree from each MSA (step 5). Finally, accuracy metrics, e.g. Ensemble Confidence (EC), are calculated from the resulting ensemble of trees (step 6). The Newick package (https://github.com/rcedgar/newick) was used to calculate the novel metrics described in this paper.

**Fig. 2. Accuracy of Muscle5 on structure-based benchmarks.**
a Average accuracy of Muscle5 ensemble replicates compared to Clustal-Omega, ProbCons and MAFFT on benchmarks of protein (top) and RNA (bottom) alignments; the default variant *none.0* is the wider bar. b Correlation between AC and accuracy (fraction correct columns) for protein (top) and RNA (bottom); Pearson’s r = 0.80, 0.84, respectively. c Probability that a column is correct after binning into CC percentage intervals: 0 + is 0% ≤ CC < 10%, 10+ is 10% ≤ CC < 20% etc.; the last bar is CC = 100%. Thus CC is predictive of correctness and AC is predictive of accuracy.

**Fig. 3. Replicate alignments of BBS11008.**
Two replicate alignments of a segment in Balibase set BBS11008 are shown together with ribbon diagrams of two of its four structures (2pna and 1uur). This region comprises a well-conserved anti-parallel beta-sheet (green) which transitions into a variable exposed loop (magenta, outlined by rectangles). Sequence homology in the beta-sheet is unambiguous except for one gap (grey background), while both sequence and structure similarity is unclear in the loop, which is reflected by lower CC values (top histogram; CC was calculated from a diversified ensemble of 100 replicates).

**Fig. 4. Ensemble confidence of coronavirus genus topologies and monophyly.**
a Relative frequencies of tree topologies for coronavirus genera from a diversified ensemble using six different tree estimation methods. The rightmost bar shows the combined ensemble average with MEGA-NJ and FastTree excluded. A = *Alphacoronavirus*, B = *Betacoronavirus*, G = *Gammacoronavirus* and D = *Deltacoronavirus* b Ensemble Monophyly (EM) of coronavirus genera. All six tree estimation methods give confidence > 96% to monophyly of all genera except for EM = 82.0% for Deltacoronavirus by RaxML.

**Fig. 5. Bootstraps for coronavirus consensus genus topology.**
The coronavirus genus topology is (((A,B),G),D) with high ensemble confidence (Fig. 4). Using the default Muscle5 MSA (none.0), this topology was reported by four of the six tree estimation methods with bootstraps as shown in the figure, where bootstrap values are mostly low.

**Fig. 6. Ensemble frequencies of phylum topologies.**
Relative frequencies of tree topologies of Ribovirus phyla from a diversified ensemble using six different tree estimation methods. The rightmost bar shows the combined ensemble average. D = *Duplornaviricota*, K = *Kitrinoviricota*, L = *Lenarviricota*, N = *Negarnaviricota*, P = *Pisuviricota*.

**Fig. 7. Phylum topologies estimated from replicates abc.2 and bca.2.**
HMM parameters are held fixed for making the MSAs (both have random seed 2) while the guide tree topology is permuted. All six tree estimation methods agree with each other on the topology on a given MSA, but the topologies are different so one or both topologies must be wrong and the reproducible wrong tree must be induced by guide tree bias as the MSA is otherwise unchanged. Most methods give high bootstraps (shown in table below the trees) for most or all of the edges.

**Fig. 8. Conflicting coronavirus genus topologies in the literature.**
Trees from four published papers reporting high bootstraps for conflicting genus topologies: Degroot2013, Wang2014, Woo2010 and Wang2014. Wang2014 estimated trees from three different alignments: (A) whole genomes, (B) spike protein and (C) nucleocapsid protein.

**Fig. 9. Misaligned catalytic residues in RdRp MSAs.**
Misalignments of essential catalytic residues were identified using Palmscan. a Ten representative sequences from the Wolf2018 alignment are shown. The top seven sequences place the catalytic glycine (G) in motif B in a different column than the bottom three. Sequence logos for the relevant Palmscan PSSMs are shown above and below the alignment. b Percentages of sequences with at least one catalytic residue misalignment and the total number of residue misalignments for Wolf2018 alignment S3 and the maximum and mean values on the corresponding Muscle5 ensemble. S3 is a subset alignment used by Wolf2018 to estimate the top-level (near-root) branching order of their tree, it contains 238 sequences. The equivalent Muscle5 ensemble has 249 sequences per MSA, selecting different subsets in addition to different alignment parameters to construct replicates. Note that all Muscle5 replicates have substantially fewer errors than S3.

**Fig. 10. Monophylicity of RNA virus phyla.**
a Wolf2018 tree showing high bootstrap values as reported in their paper. b RAxML tree estimated from M^* (Muscle5 replicate with highest AC) showing high bootstrap values for a conflicting topology. c Mean false positive frequencies as a percentage of the best-fit subtree, averaged over a diverse ensemble. d Monophyly of the tree in panel of the best-fit subtree as percentage TP, FP and FN, respectively, averaged over a diverse ensemble. Note that TP is low, ranging from 43% for *Pisuviricota* to 65% for *Negarnaviricota*. Compare with Fig. 4 (b)shows high monophyly of coronavirus genera.

See this image and copyright information in PMC

Cited by

Genome assembly of Stephania longa provides insight into cepharanthine biosynthesis.
Shang H, Lu Y, Xun L, Wang K, Li B, Liu Y, Ma T. Shang H, et al. Front Plant Sci. 2024 Sep 5;15:1414636. doi: 10.3389/fpls.2024.1414636. eCollection 2024. Front Plant Sci. 2024. PMID: 39301160 Free PMC article.
Chromosome-scale assembly of the wild cereal relative Elymus sibiricus.
Shen W, Liu B, Guo J, Yang Y, Li X, Chen J, Dou Q. Shen W, et al. Sci Data. 2024 Jul 26;11(1):823. doi: 10.1038/s41597-024-03622-4. Sci Data. 2024. PMID: 39060306 Free PMC article.
Phylogenetic reconciliation: making the most of genomes to understand microbial ecology and evolution.
Williams TA, Davin AA, Szánthó LL, Stamatakis A, Wahl NA, Woodcroft BJ, Soo RM, Eme L, Sheridan PO, Gubry-Rangin C, Spang A, Hugenholtz P, Szöllősi GJ. Williams TA, et al. ISME J. 2024 Jan 8;18(1):wrae129. doi: 10.1093/ismejo/wrae129. ISME J. 2024. PMID: 39001714 Free PMC article. Review.
Denitrification genotypes of endospore-forming Bacillota.
Bell E, Chen J, Richardson WDL, Fustic M, Hubert CRJ. Bell E, et al. ISME Commun. 2024 Sep 4;4(1):ycae107. doi: 10.1093/ismeco/ycae107. eCollection 2024 Jan. ISME Commun. 2024. PMID: 39263550 Free PMC article.
Sequence-Based Antigenic Analyses of H1 Swine Influenza A Viruses from Colombia (2008-2021) Reveals Temporal and Geographical Antigenic Variations.
Ospina-Jimenez AF, Gomez AP, Rincon-Monroy MA, Ortiz L, Perez DR, Peña M, Ramirez-Nieto G. Ospina-Jimenez AF, et al. Viruses. 2023 Sep 30;15(10):2030. doi: 10.3390/v15102030. Viruses. 2023. PMID: 37896808 Free PMC article.

See all "Cited by" articles

References

1. Sievers F, Higgins DG. Clustal omega. Curr. Protoc. Bioinforma. 2014;48:3–13. doi: 10.1002/0471250953.bi0313s48. - DOI - PubMed
1. Katoh K, Standley DM. Mafft multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. - DOI - PMC - PubMed
1. Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. - DOI - PMC - PubMed
1. Thompson JD, Plewniak F, Poch O. Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinforma. (Oxf., Engl.) 1999;15:87–88. doi: 10.1093/bioinformatics/15.1.87. - DOI - PubMed
1. Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural rnas. Nucleic Acids Res. 2005;33:2433–2439. doi: 10.1093/nar/gki541. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

[1] Sievers F, Higgins DG. Clustal omega. Curr. Protoc. Bioinforma. 2014;48:3–13. doi: 10.1002/0471250953.bi0313s48. - DOI - PubMed

[2] Sievers F, Higgins DG. Clustal omega. Curr. Protoc. Bioinforma. 2014;48:3–13. doi: 10.1002/0471250953.bi0313s48. - DOI - PubMed

[3] Katoh K, Standley DM. Mafft multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. - DOI - PMC - PubMed

[4] Katoh K, Standley DM. Mafft multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. - DOI - PMC - PubMed

[5] Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. - DOI - PMC - PubMed

[6] Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. - DOI - PMC - PubMed

[7] Thompson JD, Plewniak F, Poch O. Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinforma. (Oxf., Engl.) 1999;15:87–88. doi: 10.1093/bioinformatics/15.1.87. - DOI - PubMed

[8] Thompson JD, Plewniak F, Poch O. Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinforma. (Oxf., Engl.) 1999;15:87–88. doi: 10.1093/bioinformatics/15.1.87. - DOI - PubMed

[9] Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural rnas. Nucleic Acids Res. 2005;33:2433–2439. doi: 10.1093/nar/gki541. - DOI - PMC - PubMed

[10] Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural rnas. Nucleic Acids Res. 2005;33:2433–2439. doi: 10.1093/nar/gki541. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny

Affiliation

Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny

Author

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources