Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.
Motifs looks for protein motifs by searching protein sequences for regular-expression patterns described in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not with gaps. Motifs can only be used to search for patterns in protein sequences.
There is a very informative abstract on every motif in the PROSITE Dictionary. These abstracts are included in the output if any motif is found in your sequence.
The PROSITE Dictionary was compiled and is maintained by Dr. Amos Bairoch of the University of Geneva.
Here is some of the output file:
MOTIFS from: PIR:Kihua Mismatches: 0 September 25, 1998 11:39 .. KIHUA Check: 1665 Length: 194 ! adenylate kinase (EC 2.7.4.3) 1 - human ______________________________________________________________________________ Adenylate_Kinase (L,I,V,M,F,Y,W)3DG(F,Y,I)PRx3(N,Q) (L,I,F){3}DG(Y)PRx{3}(Q) 90: NTSKG FLIDGYPREVQQ GEEFE ****************************** * Adenylate kinase signature * ****************************** Adenylate kinase (EC 2.7.4.3) (AK) [1] is a small monomeric enzyme that catalyzes the reversible transfer of MgATP to AMP (MgATP + AMP = MgADP + ADP). In mammals there are three different isozymes: - AK1 (or myokinase), which is cytosolic. - AK2, which is located in the outer compartment of mitochondria. - AK3 (or GTP:AMP phosphotransferase), which is located in the mitochondrial matrix and which uses MgGTP instead of MgATP. The sequence of AK has also been obtained from different bacterial species and from plants and fungi. Two other enzymes have been found to be evolutionary related to AK. These are: - Yeast uridylate kinase (EC 2.7.4.-) (UK) (gene URA6) [2] which catalyzes the transfer of a phosphate group from ATP to UMP to form UDP and ADP. - Slime mold UMP-CMP kinase (EC 2.7.4.14) [3] which catalyzes the transfer of a phosphate group from ATP to either CMP or UMP to form CDP or UDP and ADP. Several regions of AK family enzymes are well conserved, including the ATP- binding domains. We have selected the most conserved of all regions as a signature for this type of enzyme. This region includes an aspartic acid residue that is part of the catalytic cleft of the enzyme and that is involved in a salt bridge. It also includes an arginine residue whose modification leads to inactivation of the enzyme. -Consensus pattern: [LIVMFYW](3)-D-G-[FYI]-P-R-x(3)-[NQ] -Sequences known to belong to this class detected by the pattern: ALL, except for Schistosoma mansoni (blood fluke) and Yersinia enterocolitica AK. -Other sequence(s) detected in SWISS-PROT: NONE. -Note: archaebacterial AK do not belong to this family [4]. -Last update: November 1997 / Pattern and text revised. [ 1] Schulz G.E. Cold Spring Harbor Symp. Quant. Biol. 52:429-439(1987). [ 2] Liljelund P., Sanni A., Friesen J.D., Lacroute F. Biochem. Biophys. Res. Commun. 165:464-473(1989). [ 3] Wiesmueller L., Noegel A.A., Barzu O., Gerisch G., Schleicher M. J. Biol. Chem. 265:6339-6345(1990). [ 4] Kath T.H., Schmid R., Schaefer G. Arch. Biochem. Biophys. 307:405-410(1993). ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Above each find, the regular expression found by the program is displayed ((L,I,V,M,F,Y,W)3DG(F,Y)PRx3(N,Q)). Below this is a simplification of the expression showing selected amino acids and ranges ((L,I,F){3}DG(Y)PRx{3}(Q)) so that you can better see what was actually found. The find is displayed between five flanking residues to the N-terminus and C-terminus of the protein. The number to the left of the find is the first coordinate of the motif (not of the flanking symbols). In the example above, 90 is the coordinate of the first F in FLIDGYPREVQQ, not of the first N in NTSKG.
The PROSITE Dictionary contains an extensive abstract summarizing current information for a motif. Motifs displays the abstract below each pattern that is found. If the same pattern is found in more than one sequence, the abstract is only shown below the pattern in the first sequence in which the pattern is found. Several different patterns may share the same abstract. If you want to reduce the size of your output you can suppress these abstracts with -NOREFerence. When abstracts are being suppressed there will be a filename, such as 0179.pdoc, that appears in parentheses below each pattern found. You can use the Fetch program to make a copy of this file in order to look at the abstract.
Motifs takes as input one or more protein sequence files. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If Motifs rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence.
FindPatterns and all of the Wisconsin Package(TM) mapping programs use the same search algorithm and pattern file format as Motifs. ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences.
The pattern motifs may not be more than 350 characters long.
Motifs will not introduce gaps, but it can tolerate mismatches when with Number of allowed mismatches set to n. Mismatched finds are shown in the output in lowercase. Mismatches cannot occur within NOT expressions (see the DEFINING PATTERNS topic below).
In addition to your input protein sequence files, Motifs reads a local data file like the one below to find the search patterns. This file is modeled on the enzyme data files for the mapping programs described in Appendix VII. The offset field is not used by Motifs, but the field must have a number in it to make the file compatible with the mapping files.
The exact column used for each field does not matter, only the order of the fields in the line. You may give several patterns the same name, but put all of the entries for that name on adjacent lines of this file. The patterns may not be more than 350 characters long. Blank lines and lines that start with an exclamation point (!) are ignored.
Here is part of the default data file used by Motifs:
PROSITETOGCG of: prosite.doc and prosite.dat August 20, 1998 15:57 Release 15.0 (7/1998) Name Offset Pattern .. PDoc_Name 11s_Seed_Storage 1 NGx(D,E)2x(L,I,V,M,F)C(S,T)x{11,12}(P,A,G)D 0284.pdoc 1433_1 1 RNL(L,I)SV(G,A)YKN(I,V) 0633.pdoc 1433_2 1 YK(D,E)STLIMQLL(R,H)DNLTLW(T,A)(S,A) 0633.pdoc 25a_Synth_1 1 GGSx(A,G)(K,R)xTxL(K,R)(G,S,T)xSD(A,G) 0653.pdoc 25a_Synth_2 1 RPVILDPx(D,E)PT 0653.pdoc //////////////////////////////////////////////////////////////////////////// Zinc_Finger_C2h2 1 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H 0028.pdoc Zinc_Finger_C3hc4 1 CxHx(L,I,V,M,F,Y)Cx2C(L,I,V,M,Y,A) 0449.pdoc Zinc_Protease 1 (G,S,T,A,L,I,V,N)x2HE(L,I,V,M,F,Y,W)~(D,E,H,R,K,P ... Zn2_Cy6_Fungal 1 (G,A,S,T,P,V)Cx2C(R,K,H,S,T,A,C,W)x2(R,K,H)x2Cx{5 ... Zp_Domain 1 (L,I,V,M,F,Y,W)x7(S,T,A,P,D,N)x3(L,I,V,M,F,Y,W)x( ...
The PROSITE Dictionary contains a
number of short sequence patterns
that occur frequently in protein
sequences. Most of these
frequently found patterns are post-translational
modifications, but more specific
patterns such as leucine zippers
also fall into this category.
Such frequently found patterns
are not
normally shown by Motifs, but
you can display them with
Here are some of the patterns that the PROSITE Dictionary classifies as frequently occurring:
;Amidation 1 xG(R,K)(R,K) 0009.pdoc ;Asn_Glycosylation 1 N~(P)(S,T)~(P) 0001.pdoc ;Camp_Phospho_Site 1 (R,K)2x(S,T) 0004.pdoc ;Ck2_Phospho_Site 1 (S,T)x2(D,E) 0006.pdoc ;Glycosaminoglycan 1 SGxG 0002.pdoc ;Leucine_Zipper 1 Lx6Lx6Lx6L 0029.pdoc ;Microbodies_Cter 1 (S,A,G,C,N)(R,K,H)(L,I,V,M,A,F)> 0299.pdoc ;Myristyl 1 G~(E,D,R,K,H,P,F,Y,W)x2(S,T,A,G,C,N)~(P) 0008.pdoc ;Pkc_Phospho_Site 1 (S,T)x(R,K) 0005.pdoc ;Rgd 1 RGD 0016.pdoc ;Tyr_Phospho_Site 1 (R,K)x{2,3}(D,E)x{2,3}Y 0007.pdoc
The PDoc_Name field in the pattern file prosite.patterns has the name of a PDoc (PROSITE Document) file containing the abstract for each pattern. You can use Fetch to look at any abstracts of interest. If you run Motifs with -NOREFerence, the name of the corresponding PDoc file is shown below each pattern found.
If you specify more than one sequence, Motifs displays each one's name on the screen as it is searched. However, unless you use -SHOw, the output file shows only those sequences in which a motif was actually found.
If you run Motifs with -NAMes, the output file is a list file. (See "Using List Files" in Chapter 2, Using Sequence Files and Databases of the User's Guide for more information about list files.)
With the publication of the PROSITE Dictionary, Amos Bairoch has shown that regular expressions can reliably recognize known protein pattern motifs. When new examples of a known motif are discovered, these expressions can usually be modified to recognize the new example. The process of modifying a regular expression so that it covers all of the members of a newly expanded family of similar sequence patterns could be referred to as "ambiguation."
The problem with regular expressions is that they often fail to recognize sequences that are not yet known to be members of the sequence family. You should consider using Profile technology if your aim is to bring together similar sequences whose association has not yet been recognized.
There are a few patterns in PROSITE that are defined with rules rather than regular expressions. Motifs does not look for these patterns.
FindPatterns, Map, MapSort, MapPlot, and Motifs all let you search with ambiguous expressions that match many different sequences. The expressions can include any legal GCG sequence character (see Appendix III). The expressions can also include several non-sequence characters, which are used to specify OR matching, NOT matching, begin and end constraints, and repeat counts. For instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of any base, followed by ATG. Following is an explanation of the syntax for pattern specification.
Parentheses () enclose one or more symbols that can be repeated some number of times. Braces {} enclose numbers that tell how many times the symbols within the preceding parentheses must be found.
Sometimes, you can leave out part of an expression. If braces appear without preceding parentheses, the numbers in the braces define the number of repeats for the immediately preceding symbol. One or both of the numbers within the braces may be missing. For instance, both the pattern GATG{2,}A and the pattern GATG{2}A mean GAT, followed by G repeated from 2 to 350,000 times, followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0 to 350,000 times, followed by A; the pattern GAT(TG){,2}A means GAT, followed by TG repeated from 0 to 2 times, followed by A; the pattern GAT(TG){2,2}A means GAT, followed by TG repeated exactly 2 times, followed by A. (If the pattern in the parentheses is an OR expression (see below), it cannot be repeated more than 2,000 times.)
If you are searching nucleic acids, the ambiguity symbols defined in Appendix III let you define any combination of G, A, T, or C. If you are searching proteins, you can specify any of several symbol choices by enclosing the different choices in parentheses and separating the choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed by S. The length of each choice need not be the same, and there can be up to 31 different choices within each set of parentheses. The pattern GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from 1 to 4 times followed by A. The sequence GATTGGA matches this pattern. There can be several parentheses in a pattern, but parentheses cannot be nested.
The pattern GC~CAT means GC, followed by any symbol except C, followed by AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T, followed by CC.
The pattern <GACCAT can only be found if it occurs at the beginning of the sequence range being searched. Likewise, the pattern GACCAT> would only be found if it occurs at the end of the sequence range.
The publication of the PROSITE Dictionary of Protein Sites and Patterns by Dr. Amos Bairoch of the University of Geneva is one of the great achievements of sequence analysis. Dr. Bairoch's prodigious efforts can be seen in every abstract of this extraordinary collection. His generosity in distributing it, and his patience in compiling it so carefully, puts all of us in his debt.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.