COMPARE

Table of Contents
FUNCTION
DESCRIPTION
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
ALGORITHM
CONSIDERATIONS
SUGGESTIONS
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

Compare compares two protein or nucleic acid sequences and creates a file of the points of similarity between them for plotting with DotPlot. Compare finds the points using either a window/stringency or a word match criterion. The word comparison is 1,000 times faster than the window/stringency comparison, but somewhat less sensitive.

DESCRIPTION

[ Previous | Top | Next ]

Compare is the first program of a two-program set that produces dot-plots. Compare compares two sequences and writes a file of the points where matches of a certain quality are found. The points in the output file can be plotted with the DotPlot program. Dot-plotting is the best method in the Wisconsin Package(TM) for comparing two sequences when you suspect that there could be more than one segment of similarity between the two.

Compare makes a file with the coordinates of each point where two sequences are similar. The sequences are compared in every possible register and a point is added to the file wherever some match criterion for similarity is met. The match criterion can be met in two different ways:

The standard way compares two sequences in every register, searching for all the places where a given number of matches (stringency) occur within a given range (window). See Maizel and Lenk (1981) "Enhanced Graphic Matrix Analysis of Nucleic Acid and Protein Sequences" Proc. Natl. Acad. Sci. USA 78; 7665-7669 for a description of the matrix analysis of biological sequences.

The other way to find points of similarity is to search for short perfect matches of some set length. Short perfect matches are referred to as words. The word comparison between two sequences is about 1,000 times faster than the window/stringency match described above, but it requires that the sequences contain short perfect matches for any similarity to be found. Word comparison is discussed in detail by Wilbur and Lipman (1983) "Rapid Similarity Searches of Nucleic Acid and Protein Data Banks" Proc. Natl. Acad. Sci. USA 80; 726-730. The authors refer to a word as a k-tuple. Compare does a word comparison if it is run with -WORdsize.

You may limit the number of points that Compare finds with -LIMit.

OUTPUT

[ Previous | Top | Next ]

The output file from this session can be read by the DotPlot program to produce a dot-plot. The plots generated by DotPlot from this session and from another session with -WORdsize=8 are shown in the figures in the Program Manual below. The example session with DotPlot uses the file from this session with Compare. Here is part of the output file:


 COMPARE of: hpr.seq  check: 8102  from: 1  to: 2966

Haptoglobin related sequence
HindIII fragment sequenced 12/27/83
  (partially from hpf sequence)

 *** To: hpf.seq  check: 2624  from: 1  to: 2740

Haptoglobin alpha2
HindIII fragment , region equivalent to hp1f

 Window: 21  Stringency: 14  Points: 4986  September 27, 1998 12:15  ..

    131   2639    187   2624    276   2670    277   2671    278   2672
     94   2454     95   2455     96   2456    128   2389    132   2389
     32   2281    146   2389    164   2389     47   2098    656   2662

    //////////////////////////////////////////////////////////////////

   2861    123   2864    126   2865    127   2866    128   2867    129
   2911     56      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
       hpr.seq   8102      1   2966  F
       hpf.seq   2624      1   2740  F
            21     14      0  COMPARE

INPUT FILES

[ Previous | Top | Next ]

Compare accepts two individual nucleotide sequences or protein sequences as input. The function of Compare depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

DotPlot makes a dot-plot with the output file from Compare or StemLoop. StemLoop finds stems (inverted repeats) within a sequence. You specify the minimum stem length, minimum and maximum loop sizes, and the minimum number of bonds per stem. All stems or only the best stems can be displayed on your screen or written into a file. BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. Repeat finds direct repeats in sequences. You must set the size, stringency, and range within which the repeat must occur; all the repeats of that size or greater are displayed as short alignments.

RESTRICTIONS

[ Previous | Top | Next ]

No more than 200,000 points may be produced in the plot file. The point files can be quite large and should be deleted as soon as they have been examined. Window must be between 1 and 100, or if word comparison is done, the word size must be between 1 and 25.

ALGORITHM

[ Previous | Top | Next ]

Compare makes a file of every point where two sequences are similar according to a set match criterion. The points are the Cartesian coordinates of each point of similarity in units of the original sequence coordinates. If the window is greater than 1, the point recorded by Compare is in the middle of the window.

Window/Stringency Comparisons

For window/stringency comparisons, Compare reads a scoring matrix (see Chapter 4, Using Data Files in the User's Guide) that defines a match value for every possible GCG symbol comparison. Compare then slides the vertical sequence along the horizontal in order to generate every possible register of comparison. For each register, Compare slides a window along the pair of sequences. The match values for each pair of symbols within the window are summed to determine a score for the window at each window position. When the score is greater than or equal to the stringency, then the match criterion has been met and a point is added to the file at the position of the middle of the window on both axes. When the window has no integral center (windows of even length), then Compare rounds the coordinates up. If you have used -ALL, then points are added to the file at all of the positions within the window that have match values greater than or equal to the average positive non-identical comparison value in the scoring matrix (see -ALL in the PARAMETER REFERENCE topic).

Word Comparisons

For word comparisons, you set a word length. Compare then slides the vertical sequence along the horizontal in order to generate every possible register of comparison. For each register, Compare slides a window whose size is equal to the word length along the pair of sequences. If all of the symbols in the two sequences within the window are identical, Compare puts a point in the file at the middle of the word's position in the two sequences. If the word has no integral center (words of even length), then Compare rounds the coordinates up.

Alphabet

The parameter alphabet that appears in the output is the number of symbols in the alphabet that could make up each word. The alphabet contains four symbols for nucleic acids and up to 31 for peptide sequences.

CONSIDERATIONS

[ Previous | Top | Next ]

Dot-plotting helps recognize large regions of similarity. It is not really sensitive enough, in most uses, to see small structures like promoters. In general, you should not try to look for structures that are smaller than the stringency. The window/stringency comparison is usually more sensitive than the word comparison for regions that are only weakly related.

For window/stringency comparisons, Compare chooses a default stringency that is appropriate for the scoring matrix that it reads. If you select a different scoring matrix with Scoring Matrix, the program will adjust the default stringency accordingly.

SUGGESTIONS

[ Previous | Top | Next ]

Try a Word Comparison First

Word comparisons are very fast, so run Compare with -WORdsize first. Usually, this pilot run gives you a rough idea what the dot-plot for the more sensitive window/stringency comparison is going to look like. See the two plots in the Program Manual entry for DotPlot for examples of each type of comparison.

Setting Window and Stringency

A window 21-symbols wide with a stringency of 14 is a good place to start when comparing nucleic acid sequences that have very few ambiguity codes in them. The number of points you get should be of the same magnitude as the number of symbols in your sequences. We have had good results with a window of 30 and a stringency of 11 for peptide sequence comparison. You can use -LIMit to stop the program before the number of points gets unreasonable.

Setting Word Size

You might try a word size of 6 for nucleic acid sequences of 1,000 bases and perhaps 8 for 10,000 bases. You can start with a word size of 2 or 3 for peptide-sequence comparisons.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Comparison window

sets the size of the window within which the comparison score is calculated (when doing a window/stringency comparison).

Set stringency for match in comparison window

sets the minimum comparison score that defines a match (when doing a window/stringency comparison). The comparison score is the sum of the individual match values for each pair of symbols within the window.

Scoring Matrix

allows you to specify a scoring matrix other other than the program default. In creating alignments or finding sequence similarity, matching residues are scored according to values found in a scoring matrix. The matrix you choose depends on the expected similarity of the sequences to be compared. For example, you might use blosum90 to compare sequences that are expected to be very similar and blosum35 if you are expecting the sequences to be much less similar.

Printed: January 13, 1999 6:26 (1162)