FINDPATTERNS

Table of Contents
FUNCTION
DESCRIPTION
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
DEFINING PATTERNS
CONSIDERATIONS
SPECIFYING SEQUENCES
LARGE DATA SETS
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.

DESCRIPTION

[ Previous | Top | Next ]

FindPatterns locates short sequence patterns. If you are trying to find a pattern in a sequence or if you know of a sequence that you think occurs somewhere within a larger one, you can find your place with FindPatterns. FindPatterns can look through large data sets for any short sequence patterns you specify. FindPatterns can recognize patterns with some symbols mismatched but not with gaps. It supports the IUPAC-IUB nucleotide ambiguity codes (see Appendix III) for searching through nucleotide sequences.

FindPatterns searches both strands of a nucleotide sequence if the patterns you specify are not identical on both strands. If your sequence is a protein, FindPatterns searches for a simple symbol match between your pattern and the protein sequence.

FindPatterns names each sequence on the screen as it is searched. The output file shows only sequences where a pattern was found unless you use -SHOw. Five residues from the original sequence are shown on either side of each "find." The word /Rev occurs if the reverse of the pattern is found. If you run FindPatterns with -NAMes, the output file is written as a list file, which you can use as input to other Wisconsin Package(TM) programs that support indirect file specifications.

FindPatterns writes all of its results in the same output file.

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:


! FINDPATTERNS on genbank:humig* allowing 0 mismatches

! Using patterns from: pattern.dat  October 8, 1998 09:37 ..

            HUMIG10H  ck: 1075  len: 408   ! L38425 Homo sapiens Ig rearra

BamHI                 GGATCC
           170: GGGCT GGATCC GCCAG

             HUMIG1L  ck: 7509  len: 318   ! L38432 Homo sapiens Ig rearra

BamHI                 GGATCC
           167: CTCAG GGATCC CTGAG

             HUMIG9H  ck: 8709  len: 426   ! L38435 Homo sapiens Ig rearra

BamHI                 GGATCC
            17: CTGGA GGATCC TTTTC

           HUMIGAMKB  ck: 9203  len: 406   ! L28050 Human Ig rearranged al

BamHI                 GGATCC
           144: GAGCT GGATCC GTCAG

//////////////////////////////////////////////////////////////////////////

          HUMIGHZBV  ck: 4060  len: 234   ! L23278 Human rearranged IgH chain

BamHI                 GGATCC
           229: CCAAG GGATCC

           HUMIGKVAC  ck: 9098  len: 1,331 ! M23090 Human germline IgK chain

Promotor              TAATA(N){20,30}ATG
                        TAATAN{24}ATG
         1,177: CAGTA TAATAACTGGCCTCCCACAGTGATTCAACATG AAACA

            HUMIGL1A  ck: 6825  len: 4,523 ! M77640 Homo sapiens L1 cell

BamHI                 GGATCC
         1,129: CAACG GGATCC CTGTG
         3,833: CCTTG GGATCC AGGCC
         3,905: GCCTC GGATCC CCTTC

           HUMIGLVAV  ck: 7923  len: 336   ! L33443 Homo sapiens (clone Lc4)

Promotor /Rev         CAT(N){20,30}TATTA
                        CATN{21}TATTA
            60: GTCAC CATCACTTGTCGGGCGAGTCAGAGTATTA GCAGC

           HUMIGHZBV  ck: 4060  len: 234   ! L23278 Human rearranged IgH chain

BamHI                 GGATCC
           229: CCAAG GGATCC

           HUMIGKVAC  ck: 9098  len: 1,331 ! M23090 Human germline IgK chain

Promotor              TAATA(N){20,30}ATG
                        TAATAN{24}ATG
         1,177: CAGTA TAATAACTGGCCTCCCACAGTGATTCAACATG AAACA

            HUMIGL1A  ck: 6825  len: 4,523 ! M77640 Homo sapiens L1 cell

BamHI                 GGATCC
         1,129: CAACG GGATCC CTGTG
         3,833: CCTTG GGATCC AGGCC
         3,905: GCCTC GGATCC CCTTC

           HUMIGLVAV  ck: 7923  len: 336   ! L33443 Homo sapiens (clone Lc4)

Promotor /Rev         CAT(N){20,30}TATTA
                        CATN{21}TATTA
            60: GTCAC CATCACTTGTCGGGCGAGTCAGAGTATTA GCAGC

           HUMIGLYM1  ck: 9847  len: 881   ! D01059 Human immunoglobulin

BamHI                 GGATCC
           287: TCTCT GGATCC AAAGA
           787: CTGCA GGATCC CAGGG

EcoRI                 GAATTC
             2:     G GAATTC CGGGT

            HUMIGVK1  ck: 8546  len: 324   ! D38039 Human mRNA for immunoglobul

EcoRI                 GAATTC
           208: GGACA GAATTC ACTCT

 Databases searched:
        GenBank, Release 108.0, Released on 16Aug1998, Formatted on 17Aug1998

     Total finds:        695
    Total length:    738,064
 Total sequences:      1,596
        CPU time:      07.25

If the pattern is a complex expression, it will be written above each find along with a simplification of the ambiguous parts of the pattern so that you can see what was actually found. In the above example, the Promoter pattern CAT(N){20,30}TATTA is the pattern being searched, and in the first case shown, CATN{23}TATTA is the pattern actually found. Five residues from the original sequence are shown on either side of the find. In the example above, 119 is the coordinate of the first C in CATGCCAA ... not of the G in the flanking residues GATCA.

INPUT FILES

[ Previous | Top | Next ]

FindPatterns takes single or multiple sequences as input. The function of FindPatterns depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

The Wisconsin Package mapping programs Map, MapPlot and MapSort can be used to mark finds in the context of a DNA restriction map. Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. These programs all use the same search algorithm and input data file format as FindPatterns.

RESTRICTIONS

[ Previous | Top | Next ]

Patterns typed in from the terminal may not be longer than 132 characters. Patterns from a data file may not be longer than 350 characters.

FindPatterns can search for a maximum of 10,000 patterns in a nucleotide search. If your pattern.dat file contains more than 10,000 patterns, only the first 10,000 are used.

The restrictions specified using Minimum number of occurrences and Maximum number of occurrences refer to the number of times a pattern if found in a sequence and must be fulfilled on a single strand of a nucleotide sequence for the find to be reported. For instance, if you use Minimum number of occurrences set to 2 and Patterns to be found set to CCCC with the sequence CCCCGGGG, no finds will be reported, because although there are two finds there is only one instance of the pattern on each strand.

DEFINING PATTERNS

[ Previous | Top | Next ]

FindPatterns, Map, MapSort, MapPlot, and Motifs all let you search with ambiguous expressions that match many different sequences. The expressions can include any legal GCG sequence character (see Appendix III). The expressions can also include several non-sequence characters, which are used to specify OR matching, NOT matching, begin and end constraints, and repeat counts. For instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of any base, followed by ATG. Following is an explanation of the syntax for pattern specification.

Implied Sets and Repeat Counts

Parentheses () enclose one or more symbols that can be repeated some number of times. Braces {} enclose numbers that tell how many times the symbols within the preceding parentheses must be found.

Sometimes, you can leave out part of an expression. If braces appear without preceding parentheses, the numbers in the braces define the number of repeats for the immediately preceding symbol. One or both of the numbers within the braces may be missing. For instance, both the pattern GATG{2,}A and the pattern GATG{2}A mean GAT, followed by G repeated from 2 to 350,000 times, followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0 to 350,000 times, followed by A; the pattern GAT(TG){,2}A means GAT, followed by TG repeated from 0 to 2 times, followed by A; the pattern GAT(TG){2,2}A means GAT, followed by TG repeated exactly 2 times, followed by A. (If the pattern in the parentheses is an OR expression (see below), it cannot be repeated more than 2,000 times.)

OR Matching

If you are searching nucleic acids, the ambiguity symbols defined in Appendix III let you define any combination of G, A, T, or C. If you are searching proteins, you can specify any of several symbol choices by enclosing the different choices in parentheses and separating the choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed by S. The length of each choice need not be the same, and there can be up to 31 different choices within each set of parentheses. The pattern GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from 1 to 4 times followed by A. The sequence GATTGGA matches this pattern. There can be several parentheses in a pattern, but parentheses cannot be nested.

NOT Matching

The pattern GC~CAT means GC, followed by any symbol except C, followed by AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T, followed by CC.

Begin and End Constraints

The pattern <GACCAT can only be found if it occurs at the beginning of the sequence range being searched. Likewise, the pattern GACCAT> would only be found if it occurs at the end of the sequence range.

CONSIDERATIONS

[ Previous | Top | Next ]

FindPatterns will not introduce gaps, but it can tolerate mismatches when it is run with Number of allowed mismatches in the pattern match. Mismatched finds are shown in the output in lowercase.

SPECIFYING SEQUENCES

[ Previous | Top | Next ]

There is information on specifying sets of sequences in Chapter 2, Using Sequence Files and Databases of the User's Guide.

LARGE DATA SETS

[ Previous | Top | Next ]

FindPatterns is one of the few programs in the Wisconsin Package that can take more than a few minutes to run. Large searches should probably be run in the batch queue.

Patterns that start with complicated OR or NOT expressions take longer to search than simple expressions like GATTC.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Patterns to be found

specifies the patterns to be found. Use commas to separate multiple patterns. To specify a comma as part of a pattern, enclose the whole pattern in quotes. For more information see the DEFINING PATTERNS topic.

Number of allowed mismatches in the pattern match

causes the program to recognize sites that are like the recognition site but with one (or more) mismatches. If you allow too many mismatches, you may get ridiculous results. The output from most mapping programs distinguishes between real sites and sites with one or more mismatches.

Search all sequences as if they were circular

    Search only the top strand of nucleotide sequences
    Search both strands of nucleotide sequences

searches only the top strand of nucleotide sequences.

Search all sequences as if they were circular

searches past the end of the sequence into the beginning of the sequence as if the molecule were continuous. Patterns that span the origin can only be found if the search is Search all sequences as if they were circular.

Do an "overlapping-set" search if nucleotide sequences

makes an overlap set map instead of the usual subset map. If your sequence is very ambiguous (as for instance a back-translated sequence would be) and you want to see where restriction sites could be, then you should create an overlap-set map. Overlap-set and subset pattern recognition are discussed in more detail in the Program Manual entry for Window.

Accept only perfect (nonambiguous) matches

sets the program to look for a perfect alphabetic match between the site and the sequence. Ambiguity codes are normally expanded so that the site RXY would find sequences like ACT or GAC. With this parameter the ambiguity codes are not expanded so the site RXY would only match the sequence RXY.

Minimum number of occurrences

excludes patterns that are not found at least the specified number of times.

Maximum number of occurrences

excludes patterns found more than the specified number of times.

Show patterns that occur only once

excludes patterns found in your sequence more than once.

Exclude patterns found between positions base1,base2

excludes patterns found anywhere within one or more ranges of the sequence. If a pattern is found within an excluded range, then the pattern is not displayed. The ranges are defined with sets of two numbers. The numbers are separated by commas. Spaces between numbers are not allowed. The numbers must be integers that fall within the sequence beginning and ending points you have chosen. The range may be circular if the sequence being analyzed is circular. Exclusion is not done if there are any non-numeric characters in the numbers or numbers out of range or if there is an odd number of integers following the parameter.

Enter earliest date numerically as m.yy

limits the search to sequences that have been entered into the datbase or modified since the date you specify. As this is being written, only the EMBL, GenBank, and SWISS-PROT databases support this parameter.

Printed: January 13, 1999 6:27 (1162)