Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 15;13(1):6968.
doi: 10.1038/s41467-022-34630-w.

Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny

Affiliations

Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny

Robert C Edgar. Nat Commun. .

Abstract

Multiple sequence alignments are widely used to infer evolutionary relationships, enabling inferences of structure, function, and phylogeny. Standard practice is to construct one alignment by some preferred method and use it in further analysis; however, undetected alignment bias can be problematic. I describe Muscle5, a novel algorithm which constructs an ensemble of high-accuracy alignment with diverse biases by perturbing a hidden Markov model and permuting its guide tree. Confidence in an inference is assessed as the fraction of the ensemble which supports it. Applied to phylogenetic tree estimation, I show that ensembles can confidently resolve topologies with low bootstrap according to standard methods, and conversely that some topologies with high bootstraps are incorrect. Applied to the phylogeny of RNA viruses, ensemble analysis shows that recently adopted taxonomic phyla are probably polyphyletic. Ensemble analysis can improve confidence assessment in any inference from an alignment.

PubMed Disclaimer

Conflict of interest statement

The author declares no competing interests.

Figures

Fig. 1
Fig. 1. Typical ensemble workflow for alignment and phylogeny assessment.
An ensemble of MSAs is generated and assessed for accuracy using Muscle5. Gray rectangles are processing steps made by an algorithm or software package. First, Muscle5 (step 1) generates an ensemble of MSAs (step 2), each alignment is generated by a different combination of a perturbed HMM and permuted guide tree. The accuracy of the MSAs can be assessed by Muscle5 (step 3) using accuracy metrics such as Column Confidence (CC). A phylogeny algorithm (step 4), e.g. maximum likelihood (ML), is used to predict a tree from each MSA (step 5). Finally, accuracy metrics, e.g. Ensemble Confidence (EC), are calculated from the resulting ensemble of trees (step 6). The Newick package (https://github.com/rcedgar/newick) was used to calculate the novel metrics described in this paper.
Fig. 2
Fig. 2. Accuracy of Muscle5 on structure-based benchmarks.
a Average accuracy of Muscle5 ensemble replicates compared to Clustal-Omega, ProbCons and MAFFT on benchmarks of protein (top) and RNA (bottom) alignments; the default variant none.0 is the wider bar. b Correlation between AC and accuracy (fraction correct columns) for protein (top) and RNA (bottom); Pearson’s r = 0.80, 0.84, respectively. c Probability that a column is correct after binning into CC percentage intervals: 0 + is 0% ≤ CC < 10%, 10+ is 10% ≤ CC < 20% etc.; the last bar is CC = 100%. Thus CC is predictive of correctness and AC is predictive of accuracy.
Fig. 3
Fig. 3. Replicate alignments of BBS11008.
Two replicate alignments of a segment in Balibase set BBS11008 are shown together with ribbon diagrams of two of its four structures (2pna and 1uur). This region comprises a well-conserved anti-parallel beta-sheet (green) which transitions into a variable exposed loop (magenta, outlined by rectangles). Sequence homology in the beta-sheet is unambiguous except for one gap (grey background), while both sequence and structure similarity is unclear in the loop, which is reflected by lower CC values (top histogram; CC was calculated from a diversified ensemble of 100 replicates).
Fig. 4
Fig. 4. Ensemble confidence of coronavirus genus topologies and monophyly.
a Relative frequencies of tree topologies for coronavirus genera from a diversified ensemble using six different tree estimation methods. The rightmost bar shows the combined ensemble average with MEGA-NJ and FastTree excluded. A = Alphacoronavirus, B = Betacoronavirus, G = Gammacoronavirus and D =  Deltacoronavirus b Ensemble Monophyly (EM) of coronavirus genera. All six tree estimation methods give confidence > 96% to monophyly of all genera except for EM = 82.0% for Deltacoronavirus by RaxML.
Fig. 5
Fig. 5. Bootstraps for coronavirus consensus genus topology.
The coronavirus genus topology is (((A,B),G),D) with high ensemble confidence (Fig. 4). Using the default Muscle5 MSA (none.0), this topology was reported by four of the six tree estimation methods with bootstraps as shown in the figure, where bootstrap values are mostly low.
Fig. 6
Fig. 6. Ensemble frequencies of phylum topologies.
Relative frequencies of tree topologies of Ribovirus phyla from a diversified ensemble using six different tree estimation methods. The rightmost bar shows the combined ensemble average. D = Duplornaviricota, K = Kitrinoviricota, L = Lenarviricota, N = Negarnaviricota, P = Pisuviricota.
Fig. 7
Fig. 7. Phylum topologies estimated from replicates abc.2 and bca.2.
HMM parameters are held fixed for making the MSAs (both have random seed 2) while the guide tree topology is permuted. All six tree estimation methods agree with each other on the topology on a given MSA, but the topologies are different so one or both topologies must be wrong and the reproducible wrong tree must be induced by guide tree bias as the MSA is otherwise unchanged. Most methods give high bootstraps (shown in table below the trees) for most or all of the edges.
Fig. 8
Fig. 8. Conflicting coronavirus genus topologies in the literature.
Trees from four published papers reporting high bootstraps for conflicting genus topologies: Degroot2013, Wang2014, Woo2010 and Wang2014. Wang2014 estimated trees from three different alignments: (A) whole genomes, (B) spike protein and (C) nucleocapsid protein.
Fig. 9
Fig. 9. Misaligned catalytic residues in RdRp MSAs.
Misalignments of essential catalytic residues were identified using Palmscan. a Ten representative sequences from the Wolf2018 alignment are shown. The top seven sequences place the catalytic glycine (G) in motif B in a different column than the bottom three. Sequence logos for the relevant Palmscan PSSMs are shown above and below the alignment. b Percentages of sequences with at least one catalytic residue misalignment and the total number of residue misalignments for Wolf2018 alignment S3 and the maximum and mean values on the corresponding Muscle5 ensemble. S3 is a subset alignment used by Wolf2018 to estimate the top-level (near-root) branching order of their tree, it contains 238 sequences. The equivalent Muscle5 ensemble has 249 sequences per MSA, selecting different subsets in addition to different alignment parameters to construct replicates. Note that all Muscle5 replicates have substantially fewer errors than S3.
Fig. 10
Fig. 10. Monophylicity of RNA virus phyla.
a Wolf2018 tree showing high bootstrap values as reported in their paper. b RAxML tree estimated from M* (Muscle5 replicate with highest AC) showing high bootstrap values for a conflicting topology. c Mean false positive frequencies as a percentage of the best-fit subtree, averaged over a diverse ensemble. d Monophyly of the tree in panel of the best-fit subtree as percentage TP, FP and FN, respectively, averaged over a diverse ensemble. Note that TP is low, ranging from 43% for Pisuviricota to 65% for Negarnaviricota. Compare with Fig. 4 (b)shows high monophyly of coronavirus genera.

Similar articles

Cited by

References

    1. Sievers F, Higgins DG. Clustal omega. Curr. Protoc. Bioinforma. 2014;48:3–13. doi: 10.1002/0471250953.bi0313s48. - DOI - PubMed
    1. Katoh K, Standley DM. Mafft multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. - DOI - PMC - PubMed
    1. Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. - DOI - PMC - PubMed
    1. Thompson JD, Plewniak F, Poch O. Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinforma. (Oxf., Engl.) 1999;15:87–88. doi: 10.1093/bioinformatics/15.1.87. - DOI - PubMed
    1. Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural rnas. Nucleic Acids Res. 2005;33:2433–2439. doi: 10.1093/nar/gki541. - DOI - PMC - PubMed
  NODES
Note 2
twitter 2