Automatic generation of primary sequence patterns from sets of related protein sequences
- PMID: 2296575
- PMCID: PMC53211
- DOI: 10.1073/pnas.87.1.118
Automatic generation of primary sequence patterns from sets of related protein sequences
Abstract
We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common "root" pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a "pay once" gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequence family membership than any of the individual sequences used to construct the pattern.
Similar articles
-
Hierarchical method to align large numbers of biological sequences.Methods Enzymol. 1990;183:456-74. doi: 10.1016/0076-6879(90)83031-4. Methods Enzymol. 1990. PMID: 2156130
-
A novel randomized iterative strategy for aligning multiple protein sequences.Comput Appl Biosci. 1991 Oct;7(4):479-84. doi: 10.1093/bioinformatics/7.4.479. Comput Appl Biosci. 1991. PMID: 1747779
-
A non-local gap-penalty for profile alignment.Bull Math Biol. 1996 Jan;58(1):1-18. doi: 10.1007/BF02458279. Bull Math Biol. 1996. PMID: 8819751
-
Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model.BMC Bioinformatics. 2004 Oct 25;5:157. doi: 10.1186/1471-2105-5-157. BMC Bioinformatics. 2004. PMID: 15504234 Free PMC article.
-
An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments.J Mol Biol. 2000 Aug 18;301(3):691-711. doi: 10.1006/jmbi.2000.3975. J Mol Biol. 2000. PMID: 10966778
Cited by
-
Protein database searches for multiple alignments.Proc Natl Acad Sci U S A. 1990 Jul;87(14):5509-13. doi: 10.1073/pnas.87.14.5509. Proc Natl Acad Sci U S A. 1990. PMID: 2196570 Free PMC article.
-
Structure and function of tyrosine kinase receptors.J Bioenerg Biomembr. 1991 Feb;23(1):63-82. doi: 10.1007/BF00768839. J Bioenerg Biomembr. 1991. PMID: 1849136 Review.
-
An Eulerian path approach to local multiple alignment for DNA sequences.Proc Natl Acad Sci U S A. 2005 Feb 1;102(5):1285-90. doi: 10.1073/pnas.0409240102. Epub 2005 Jan 24. Proc Natl Acad Sci U S A. 2005. PMID: 15668398 Free PMC article.
-
Searching databases of conserved sequence regions by aligning protein multiple-alignments.Nucleic Acids Res. 1996 Oct 1;24(19):3836-45. doi: 10.1093/nar/24.19.3836. Nucleic Acids Res. 1996. PMID: 8871566 Free PMC article.
-
Automated assembly of protein blocks for database searching.Nucleic Acids Res. 1991 Dec 11;19(23):6565-72. doi: 10.1093/nar/19.23.6565. Nucleic Acids Res. 1991. PMID: 1754394 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous