BESTFIT

Table of Contents
FUNCTION
DESCRIPTION
SEARCHING FOR SIMILARITY
OUTPUT
INPUT FILES
RELATED PROGRAMS
ALGORITHM
CONSIDERATIONS
ALIGNING LONG SEQUENCES
EVALUATING ALIGNMENT SIGNIFICANCE
ALIGNMENT METRICS
PEPTIDE SEQUENCES
RESTRICTIONS
ACKNOWLEDGEMENTS
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman.

DESCRIPTION

[ Previous | Top | Next ]

BestFit inserts gaps to obtain the optimal alignment of the best region of similarity between two sequences, and then displays the alignment in a format similar to the output from Gap. The sequences can be of very different lengths and have only a small segment of similarity between them. You could take a short RNA sequence, for example, and run it against a whole mitochondrial genome.

SEARCHING FOR SIMILARITY

[ Previous | Top | Next ]

BestFit is the most powerful method in the Wisconsin Package(TM) for identifying the best region of similarity between two sequences whose relationship is unknown.

OUTPUT

[ Previous | Top | Next ]

Here is the output file. Notice how BestFit finds and displays only the best segments of similarity:


 BESTFIT of: gamma.seq  check: 6474  from: 1  to: 500

Human fetal beta globins G and A gamma
from Shen, Slightom and Smithies,  Cell 26; 191-203.
Analyzed by Smithies et al. Cell 26; 345-353.

 to: alu.seq  check: 4238  from: 1  to: 207

HSREP2 from the EMBL data library
Human Alu repetitive sequence located near the insulin gene
Dhruva D.R., Shenk T., Subramanian K.N.; "Integration in vivo into
Simian virus 40 DNA of a sequence that resembles a certain family of
genomic interspersed repeated sequences"; Proc. Natl. Acad. Sci. USA
77:4514-4518(1980).  . . .

 CompCheck: 2335

         Gap Weight:     50      Average Match: 10.000
      Length Weight:      3   Average Mismatch: -9.000

            Quality:   1293             Length:    209
              Ratio:  6.246               Gaps:      3
 Percent Similarity: 84.466   Percent Identity: 84.466

        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   5
                    . =   1

 gamma.seq x alu.seq       September 24, 1998 17:06  ..-+-+

                  .         .         .         .         .
     137 AGACCAACCTGGCCAACATGGTGAAATCCCATCTCTAC.AAAAATACAAA 185
         |||||| |||||||||||||||||||  ||||||||||  ||||||||||
       1 AGACCAGCCTGGCCAACATGGTGAAACTCCATCTCTACTGAAAATACAAA 50
                  .         .         .         .         .
     186 AATTAGACAGGCATGATGGCAAGTGCCTGTAATCCCAGCTACTTGGGAGG 235
         |||||| |||||||| ||    ||||||| |||||||||||||| |||||
      51 AATTAGCCAGGCATGGTGATGCGTGCCTGGAATCCCAGCTACTTAGGAGG 100
                  .         .         .         .         .
     236 CTGAGGAAGGAGAATTGCTTGAACCTGGAAGGCAGGAGTTGCAGTGAGCC 285
         |||||  || |||||  ||| ||||  | |||  |  |||||||||||||
     101 CTGAGACAGAAGAATCCCTTAAACCAAG.AGGTGGAGGTTGCAGTGAGCC 149
                  .         .         .         .         .
     286 GAGATCATACCACTGCACTCCAGCCTGGGTGACAGAACAAGACTCTGTCT 335
         ||||||  ||  |||||||||||||| ||||||||| | ||||||  |||
     150 GAGATCGCACGGCTGCACTCCAGCCT.GGTGACAGAGCGAGACTCCATCT 198

     336 CAAAAAAAA 344
         |||||||||
     199 CAAAAAAAA 207

INPUT FILES

[ Previous | Top | Next ]

BestFit accepts two individual nucleotide sequences or protein sequences as input. The function of BestFit depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

When you want an alignment that covers the whole length of both sequences, use Gap. When you are trying to find only the best segment of similarity between two sequences, use BestFit. PileUp creates a multiple sequence alignment of a group of related sequences, aligning the whole length of all sequences. DotPlot displays the entire surface of comparison for a comparison of two sequences. GapShow displays the pattern of differences between two aligned sequences. PlotSimilarity plots the average similarity of two or more aligned sequences at each position in the alignment. Pretty displays alignments of several sequences. LineUp is an editor for editing multiple sequence alignments. CompTable helps generate scoring matrices for peptide comparison.

ALGORITHM

[ Previous | Top | Next ]

BestFit uses the local homology algorithm of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to find the best segment of similarity between two sequences. BestFit reads a scoring matrix that contains values for every possible GCG symbol match . The program uses these values to construct a path matrix that represents the entire surface of comparison with a score at every position for the best possible alignment to that point. The quality score for the best alignment to any point is equal to the sum of the scoring matrix values of the matches in that alignment, less the gap creation penalty times the number of gaps in that alignment, less the gap extension penalty times the total length of all gaps in that alignment. The gap creation and gap extension penalties are set by you. If the best path to any point has a negative value, a zero is put in that position.

After the path matrix is complete, the highest value on the surface of comparison represents the end of the best region of similarity between the sequences. The best path from this highest value backwards to the point where the values revert to zero is the alignment shown by BestFit. This alignment is the best segment of similarity between the two sequences.

For nucleic acids, the default scoring matrix has a match value of 10 for each identical symbol comparison and -9 for each non-identical comparison (not considering nucleotide ambiguity symbols for this example). The quality score for a nucleic acid alignment can, therefore, be determined using the following equation:


       Quality = 10 x TotalMatches + -9 x TotalMismatches
                    - (GapCreationPenalty x GapNumber)
                    - (GapExtensionPenalty x TotalLengthOfGaps)

The quality score for a protein alignment is calculated in a similar manner. However, while the default nucleic acid scoring matrix has a single value for all non-identical comparisons, the default protein scoring matrix has different values for the various non-identical amino acid comparisons. The quality score for a protein alignment can therefore be determined using the following equation (where Total(AA) is the total number of A-A (Ala-Ala) matches in the alignment, CmpVal(AA) is the value for an A-A comparison in the scoring matrix, Total(AB) is the total number of A-B (Ala-Asx) matches in the alignment, CmpVal(AB) is the value for an A-B comparison in the scoring matrix, ...) :


       Quality =  CmpVal(AA) x Total(AA)
                + CmpVal(AB) x Total(AB)
                + CmpVal(AC) x Total(AC)
                           .
                           .
                           .
                + CmpVal(ZZ) x Total(ZZ)
                - (GapCreationPenalty x GapNumber)
                - (GapExtensionPenalty x TotalLengthOfGaps)

For a more complete discussion of scoring matrices, see Appendix VII.

CONSIDERATIONS

[ Previous | Top | Next ]

BestFit Always Finds Something

BestFit always finds an alignment for any two sequences you compare -- even if there is no significant similarity between them! You must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity.

The Segments Shown Obscure Alternative Segments

BestFit only shows one segment of similarity, so if there are several, all but one is obscured. You can approach this problem with graphic matrix analysis (see the Compare and DotPlot programs). Alternatively, you can run BestFit on ranges outside the ranges of similarity found in earlier runs to bring other segments out of the shadow of the best segment.

The Best Fit is Only One Member of a Family

Like all fast gapping algorithms, the alignment displayed is a member of the family of best alignments. This family may have other members of equal quality, but will not have any member with a higher quality. The family is usually significantly different for different choices of gap creation and gap extension penalties. See the CONSIDERATIONS topic in the entry for the Gap program in the Program Manual to learn more about how to assign gap creation and gap extension penalties.

Default Gap Penalties are Specific to Each Scoring Matrix

BestFit chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with Scoring Matrix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use Set gap creation penalty and Set gap extension penalty to specify alternative gap penalties if you don't want to accept the default values.

The Public Scoring Matrix for Nucleic Acid Comparisons is Very Stringent

The scoring matrix swgapdna.cmp awards matches +10 and penalizes mismatches -9 so the segments found may be very brief. This penalty means that the alignment cannot be extended by even two bases to pick one extra match. The scoring matrix used by Smith and Waterman, when local alignments were first described, used -.33 for the mismatch penalty and +1 for the match score. You can use Fetch to copy randomdna.cmp and rename it swgapdna.cmp to use a value of -3 for each mismatch and +10 for each match, or use nwsgapdna.cmp, which has no mismatch penalty at all.

Rapid Alignment

When possible, BestFit tries to find the optimal alignment very quickly. If this rapid alignment is not unambiguously optimal, BestFit automatically realigns the sequences to calculate the optimal alignment. When this occurs, the monitor of alignment progress on your terminal screen (Aligning...) is displayed twice for a single alignment.

ALIGNING LONG SEQUENCES

[ Previous | Top | Next ]

The program attempts to allocate enough computer memory to align the input sequences. In the worst case, where the two sequences being aligned are unrelated, the allocation is proportional to the product of the lengths of the two input sequences. However, in many cases where the sequences being aligned are more closely related, the computer can determine an optimal alignment using less memory. When memory on your computer is limiting and the program cannot allocate all of the memory it needs to align long sequences, it completes the alignment in whatever memory it can allocate and displays the message *** Alignment is not guaranteed to be optimal ***. Because the criteria used in the calculation for guaranteeing an optimal alignment are very stringent, the alignment often may be optimal even if this message is displayed.

If you know roughly where the alignment of interest for long sequences begins, you can run the program with -LIMit. Then set the starting coordinates for each sequence near the point where the alignment of interest begins and set gap shift limits on each sequence. The program then aligns the sequences from your starting point such that the sequences do not get out of phase by more than the gap shift limits you have set. If you started both sequences at base number one and set the gap shift limit for sequence one to 100 and for sequence two to 50, then base 350 in sequence one could not be gapped to any base outside of the range from 300 to 450 on sequence two. These limited alignments often require less computer memory than unlimited alignments.

EVALUATING ALIGNMENT SIGNIFICANCE

[ Previous | Top | Next ]

This program can help you evaluate the significance of the alignment, using a simple statistical method, with Generate statistics from 10 randomized alignments. The second sequence is repeatedly shuffled, maintaining its length and composition, and then realigned to the first sequence. The average alignment score, plus or minus the standard deviation, of all randomized alignments is reported in the output file. You can compare this average quality score to the quality score of the actual alignment to help evaluate the significance of the alignment. The number of randomizations can be specified by adding an optional value to Generate statistics from 10 randomized alignments; the default is 10. You can preserve the dinucleotide or dipeptide composition of the input sequence in the shuffled sequence by using Randomize alignment preserving: set to dinucleotide or dipeptide composition. Use Randomize alignment preserving: set to trinucleotide or tripeptide composition to preserve the trinucleotide or tripeptide composition of the input sequence.

By ignoring the statistical properties of biological sequences, this simple Monte Carlo statistical method may give misleading results. Please see Lipman, D.J., Wilbur, W.J., Smith, T.F., and Waterman, M.S. (Nucl. Acids Res. 12; 215-226 (1984)) for a discussion of the statistical significance of nucleic acid similarities.

ALIGNMENT METRICS

[ Previous | Top | Next ]

BestFit and Gap display four figures of merit for alignments: Quality, Ratio, Identity, and Similarity.

The Quality (described above) is the metric maximized in order to align the sequences. Ratio is the quality divided by the number of bases in the shorter segment. Percent Identity is the percent of the symbols that actually match. Percent Similarity is the percent of the symbols that are similar. Symbols that are across from gaps are ignored in the calculation of Percent Identity and Percent Similarity. A similarity is scored when the scoring matrix value for a pair of symbols is greater than or equal to the average positive non-identical comparison value in the matrix, the similarity threshold. This threshold is also used by the display procedure to decide when to put a ':' (colon) between two aligned symbols. You can change this threshold by specifying optional values to -PAIr. For instance, -PAIr=10,5 would set the similarity threshold to 5.

The similarity and identity metrics are not optimized by alignment programs so they should not be used to compare alignments.

PEPTIDE SEQUENCES

[ Previous | Top | Next ]

If your input sequences are peptide sequences, this program uses a scoring matrix, blosum62.cmp, with comparison values derived from a study of substitutions between amino acid pairs in ungapped block of aligned protein segments as measured by Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)).

RESTRICTIONS

[ Previous | Top | Next ]

Input sequences may not be more than 32,000 symbols long.

ACKNOWLEDGEMENTS

[ Previous | Top | Next ]

Gap and BestFit were originally written for Version 1.0 by Paul Haeberli from a careful reading of the Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)) and the Smith and Waterman (Adv. Appl. Math. 2; 482-489 (1981)) papers.

Limited alignments were designed by Paul Haeberli and added to the Package for Version 3.0. They were united into a single program by Philip Delaquess for Version 4.0.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Set gap creation penalty

sets the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

Set gap extension penalty

sets the gap extension penalty that is substracted from the alignment score for each gapped symbol.

Scoring Matrix

allows you to specify a scoring matrix other other than the program default. In creating alignments or finding sequence similarity, matching residues are scored according to values found in a scoring matrix. The matrix you choose depends on the expected similarity of the sequences to be compared. For example, you might use blosum90 to compare sequences that are expected to be very similar and blosum35 if you are expecting the sequences to be much less similar.

Don't penalize gap extensions longer than

lets you set the maximum penalty for any gap in the alignment. For instance, if you specify Don't penalize gap extensions longer than set to 12, then any gap longer than 12 characters is penalized the same as a gap of length 12. Using this parameter, alignments can contain large gaps without incurring large gap extension penalties. This may be useful, for instance, if you are aligning a cDNA sequence with the corresponding genomic DNA sequence containing large introns.

Gap shift limit for sequence 1
Gap shift limit for sequence 2

let you set gap shift limits for each sequence. When you already know of a long similarity between two sequences you can "zip" them together using this mode. The beginning coordinates for each sequence must be near the beginning of the alignment you want to see. The alignment continues so that gaps inserted do not require the sequences to get out of step by more than the gap shift limits. You can align very long sequences rapidly. When you set gap shift limits for one or both input sequences, the maximum surface of comparison available to your alignment is 3.5 million. The size of the surface of comparison that your alignment actually requires can be predicted by multiplying the average length of the two sequences by the sum of the two shift limits.

Generate statistics from 10 randomized alignments

reports the average alignment score and standard deviation from 10 randomized alignments in which the second sequence is repeatedly shuffled, maintaining the length and composition of the original sequence, and then aligned to the first sequence. You can use the optional parameter to set the number of randomized alignment to some number other than 10.

Randomize alignment preserving:

    nucleotide or amino acid composition
    dinucleotide or dipeptide composition
    trinucleotide or tripeptide composition

Use Randomize alignment preserving: to preserve composition of the original sequence in the shuffled (randomized) sequences. By default the program preserves single base or residue composition. Randomize alignment preserving: allows you to preserve dinucleotide/dipeptide composition or trinucleotide/tripeptide composition.

Printed: January 13, 1999 6:26 (1162)