MOTIFS

Table of Contents
FUNCTION
DESCRIPTION
OUTPUT
PROSITE ABSTRACTS
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
MISMATCHES
PATTERN FILE
FREQUENT MOTIFS
SUGGESTIONS
CONSIDERATIONS
DEFINING PATTERNS
ACKNOWLEDGMENTS
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.

DESCRIPTION

[ Previous | Top | Next ]

Motifs looks for protein motifs by searching protein sequences for regular-expression patterns described in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not with gaps. Motifs can only be used to search for patterns in protein sequences.

There is a very informative abstract on every motif in the PROSITE Dictionary. These abstracts are included in the output if any motif is found in your sequence.

The PROSITE Dictionary was compiled and is maintained by Dr. Amos Bairoch of the University of Geneva.

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:


 MOTIFS from: PIR:Kihua

 Mismatches: 0                September 25, 1998 11:39  ..

               KIHUA  Check: 1665  Length: 194   ! adenylate kinase (EC 2.7.4.3)
 1 - human

______________________________________________________________________________

Adenylate_Kinase      (L,I,V,M,F,Y,W)3DG(F,Y,I)PRx3(N,Q)
                           (L,I,F){3}DG(Y)PRx{3}(Q)
            90: NTSKG            FLIDGYPREVQQ            GEEFE

******************************
* Adenylate kinase signature *
******************************

Adenylate kinase  (EC 2.7.4.3) (AK) [1]  is  a  small  monomeric  enzyme  that
catalyzes the reversible transfer of MgATP to AMP (MgATP + AMP = MgADP + ADP).
In mammals there are three different isozymes:

 - AK1 (or myokinase), which is cytosolic.
 - AK2, which is located in the outer compartment of mitochondria.
 - AK3 (or GTP:AMP phosphotransferase),  which is located in the mitochondrial
   matrix and which uses MgGTP instead of MgATP.

The sequence of  AK has also  been  obtained from different  bacterial species
and from plants and fungi.

Two other enzymes have been found to be evolutionary related to AK. These are:

 - Yeast uridylate kinase  (EC 2.7.4.-) (UK)  (gene URA6) [2]  which catalyzes
   the transfer of a phosphate group from ATP to UMP to form UDP and ADP.
 - Slime mold UMP-CMP kinase (EC 2.7.4.14) [3] which catalyzes the transfer of
   a phosphate group from ATP to either CMP or UMP to form CDP or UDP and ADP.

Several regions of  AK  family enzymes  are well conserved, including the ATP-
binding domains.  We have  selected the  most conserved  of  all  regions as a
signature for this type  of  enzyme.   This  region includes  an aspartic acid
residue that is  part of the  catalytic  cleft  of  the  enzyme  and  that  is
involved in  a salt  bridge.    It  also  includes an  arginine  residue whose
modification leads to inactivation of the enzyme.

-Consensus pattern: [LIVMFYW](3)-D-G-[FYI]-P-R-x(3)-[NQ]
-Sequences known to belong to this class detected by the pattern: ALL,  except
 for Schistosoma mansoni (blood fluke) and Yersinia enterocolitica AK.
-Other sequence(s) detected in SWISS-PROT: NONE.

-Note: archaebacterial AK do not belong to this family [4].

-Last update: November 1997 / Pattern and text revised.

[ 1] Schulz G.E.
     Cold Spring Harbor Symp. Quant. Biol. 52:429-439(1987).
[ 2] Liljelund P., Sanni A., Friesen J.D., Lacroute F.
     Biochem. Biophys. Res. Commun. 165:464-473(1989).
[ 3] Wiesmueller L., Noegel A.A., Barzu O., Gerisch G., Schleicher M.
     J. Biol. Chem. 265:6339-6345(1990).
[ 4] Kath T.H., Schmid R., Schaefer G.
     Arch. Biochem. Biophys. 307:405-410(1993).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Above each find, the regular expression found by the program is displayed ((L,I,V,M,F,Y,W)3DG(F,Y)PRx3(N,Q)). Below this is a simplification of the expression showing selected amino acids and ranges ((L,I,F){3}DG(Y)PRx{3}(Q)) so that you can better see what was actually found. The find is displayed between five flanking residues to the N-terminus and C-terminus of the protein. The number to the left of the find is the first coordinate of the motif (not of the flanking symbols). In the example above, 90 is the coordinate of the first F in FLIDGYPREVQQ, not of the first N in NTSKG.

PROSITE ABSTRACTS

[ Previous | Top | Next ]

The PROSITE Dictionary contains an extensive abstract summarizing current information for a motif. Motifs displays the abstract below each pattern that is found. If the same pattern is found in more than one sequence, the abstract is only shown below the pattern in the first sequence in which the pattern is found. Several different patterns may share the same abstract. If you want to reduce the size of your output you can suppress these abstracts with -NOREFerence. When abstracts are being suppressed there will be a filename, such as 0179.pdoc, that appears in parentheses below each pattern found. You can use the Fetch program to make a copy of this file in order to look at the abstract.

INPUT FILES

[ Previous | Top | Next ]

Motifs takes as input one or more protein sequence files. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If Motifs rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

FindPatterns and all of the Wisconsin Package(TM) mapping programs use the same search algorithm and pattern file format as Motifs. ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences.

RESTRICTIONS

[ Previous | Top | Next ]

The pattern motifs may not be more than 350 characters long.

MISMATCHES

[ Previous | Top | Next ]

Motifs will not introduce gaps, but it can tolerate mismatches when with Number of allowed mismatches set to n. Mismatched finds are shown in the output in lowercase. Mismatches cannot occur within NOT expressions (see the DEFINING PATTERNS topic below).

PATTERN FILE

[ Previous | Top | Next ]

In addition to your input protein sequence files, Motifs reads a local data file like the one below to find the search patterns. This file is modeled on the enzyme data files for the mapping programs described in Appendix VII. The offset field is not used by Motifs, but the field must have a number in it to make the file compatible with the mapping files.

The exact column used for each field does not matter, only the order of the fields in the line. You may give several patterns the same name, but put all of the entries for that name on adjacent lines of this file. The patterns may not be more than 350 characters long. Blank lines and lines that start with an exclamation point (!) are ignored.

Here is part of the default data file used by Motifs:


PROSITETOGCG of: prosite.doc and prosite.dat  August 20, 1998 15:57

Release 15.0  (7/1998)

Name            Offset Pattern                  ..                 PDoc_Name

11s_Seed_Storage     1 NGx(D,E)2x(L,I,V,M,F)C(S,T)x{11,12}(P,A,G)D 0284.pdoc
1433_1               1 RNL(L,I)SV(G,A)YKN(I,V)                     0633.pdoc
1433_2               1 YK(D,E)STLIMQLL(R,H)DNLTLW(T,A)(S,A)        0633.pdoc
25a_Synth_1          1 GGSx(A,G)(K,R)xTxL(K,R)(G,S,T)xSD(A,G)      0653.pdoc
25a_Synth_2          1 RPVILDPx(D,E)PT                             0653.pdoc

////////////////////////////////////////////////////////////////////////////

Zinc_Finger_C2h2     1 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H       0028.pdoc
Zinc_Finger_C3hc4    1 CxHx(L,I,V,M,F,Y)Cx2C(L,I,V,M,Y,A)          0449.pdoc
Zinc_Protease        1 (G,S,T,A,L,I,V,N)x2HE(L,I,V,M,F,Y,W)~(D,E,H,R,K,P ...
Zn2_Cy6_Fungal       1 (G,A,S,T,P,V)Cx2C(R,K,H,S,T,A,C,W)x2(R,K,H)x2Cx{5 ...
Zp_Domain            1 (L,I,V,M,F,Y,W)x7(S,T,A,P,D,N)x3(L,I,V,M,F,Y,W)x( ...

FREQUENT MOTIFS

[ Previous | Top | Next ]

The PROSITE Dictionary contains a number of short sequence patterns that occur frequently in protein sequences. Most of these frequently found patterns are post-translational modifications, but more specific patterns such as leucine zippers also fall into this category. Such frequently found patterns are not normally shown by Motifs, but you can display them with . More so than with other patterns in the PROSITE Dictionary, the presence of these frequently occurring patterns does not assure you that the protein actually contains the corresponding function.

Here are some of the patterns that the PROSITE Dictionary classifies as frequently occurring:


;Amidation           1 xG(R,K)(R,K)                             0009.pdoc
;Asn_Glycosylation   1 N~(P)(S,T)~(P)                           0001.pdoc
;Camp_Phospho_Site   1 (R,K)2x(S,T)                             0004.pdoc
;Ck2_Phospho_Site    1 (S,T)x2(D,E)                             0006.pdoc
;Glycosaminoglycan   1 SGxG                                     0002.pdoc
;Leucine_Zipper      1 Lx6Lx6Lx6L                               0029.pdoc
;Microbodies_Cter    1 (S,A,G,C,N)(R,K,H)(L,I,V,M,A,F)>         0299.pdoc
;Myristyl            1 G~(E,D,R,K,H,P,F,Y,W)x2(S,T,A,G,C,N)~(P) 0008.pdoc
;Pkc_Phospho_Site    1 (S,T)x(R,K)                              0005.pdoc
;Rgd                 1 RGD                                      0016.pdoc
;Tyr_Phospho_Site    1 (R,K)x{2,3}(D,E)x{2,3}Y                  0007.pdoc

SUGGESTIONS

[ Previous | Top | Next ]

The PDoc_Name field in the pattern file prosite.patterns has the name of a PDoc (PROSITE Document) file containing the abstract for each pattern. You can use Fetch to look at any abstracts of interest. If you run Motifs with -NOREFerence, the name of the corresponding PDoc file is shown below each pattern found.

If you specify more than one sequence, Motifs displays each one's name on the screen as it is searched. However, unless you use -SHOw, the output file shows only those sequences in which a motif was actually found.

If you run Motifs with -NAMes, the output file is a list file. (See "Using List Files" in Chapter 2, Using Sequence Files and Databases of the User's Guide for more information about list files.)

CONSIDERATIONS

[ Previous | Top | Next ]

With the publication of the PROSITE Dictionary, Amos Bairoch has shown that regular expressions can reliably recognize known protein pattern motifs. When new examples of a known motif are discovered, these expressions can usually be modified to recognize the new example. The process of modifying a regular expression so that it covers all of the members of a newly expanded family of similar sequence patterns could be referred to as "ambiguation."

The problem with regular expressions is that they often fail to recognize sequences that are not yet known to be members of the sequence family. You should consider using Profile technology if your aim is to bring together similar sequences whose association has not yet been recognized.

There are a few patterns in PROSITE that are defined with rules rather than regular expressions. Motifs does not look for these patterns.

DEFINING PATTERNS

[ Previous | Top | Next ]

FindPatterns, Map, MapSort, MapPlot, and Motifs all let you search with ambiguous expressions that match many different sequences. The expressions can include any legal GCG sequence character (see Appendix III). The expressions can also include several non-sequence characters, which are used to specify OR matching, NOT matching, begin and end constraints, and repeat counts. For instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of any base, followed by ATG. Following is an explanation of the syntax for pattern specification.

Implied Sets and Repeat Counts

Parentheses () enclose one or more symbols that can be repeated some number of times. Braces {} enclose numbers that tell how many times the symbols within the preceding parentheses must be found.

Sometimes, you can leave out part of an expression. If braces appear without preceding parentheses, the numbers in the braces define the number of repeats for the immediately preceding symbol. One or both of the numbers within the braces may be missing. For instance, both the pattern GATG{2,}A and the pattern GATG{2}A mean GAT, followed by G repeated from 2 to 350,000 times, followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0 to 350,000 times, followed by A; the pattern GAT(TG){,2}A means GAT, followed by TG repeated from 0 to 2 times, followed by A; the pattern GAT(TG){2,2}A means GAT, followed by TG repeated exactly 2 times, followed by A. (If the pattern in the parentheses is an OR expression (see below), it cannot be repeated more than 2,000 times.)

OR Matching

If you are searching nucleic acids, the ambiguity symbols defined in Appendix III let you define any combination of G, A, T, or C. If you are searching proteins, you can specify any of several symbol choices by enclosing the different choices in parentheses and separating the choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed by S. The length of each choice need not be the same, and there can be up to 31 different choices within each set of parentheses. The pattern GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from 1 to 4 times followed by A. The sequence GATTGGA matches this pattern. There can be several parentheses in a pattern, but parentheses cannot be nested.

NOT Matching

The pattern GC~CAT means GC, followed by any symbol except C, followed by AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T, followed by CC.

Begin and End Constraints

The pattern <GACCAT can only be found if it occurs at the beginning of the sequence range being searched. Likewise, the pattern GACCAT> would only be found if it occurs at the end of the sequence range.

ACKNOWLEDGMENTS

[ Previous | Top | Next ]

The publication of the PROSITE Dictionary of Protein Sites and Patterns by Dr. Amos Bairoch of the University of Geneva is one of the great achievements of sequence analysis. Dr. Bairoch's prodigious efforts can be seen in every abstract of this extraordinary collection. His generosity in distributing it, and his patience in compiling it so carefully, puts all of us in his debt.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

    search includes patterns that are frequently found in many proteins
    search excludes patterns that are frequently found in many proteins

displays frequently found patterns, such as post-translational modifications.

Number of allowed mismatches

causes Motifs to recognize places where patterns are found with one or fewer mismatches. The display uses case to distinguish between matches and mismatches.

Printed: January 13, 1999 6:27 (1162)