GAP

Table of Contents

FUNCTION

DESCRIPTION

OUTPUT

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

ALIGNING LONG SEQUENCES

EVALUATING ALIGNMENT SIGNIFICANCE

FUNCTION [ Top | Next ]

Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps.

DESCRIPTION [ Previous | Top | Next ]

Gap considers all possible alignments and gap positions between two sequences and creates a global alignment that maximizes the number of matched residues and minimizes the number and size of gaps. A scoring matrix is used to assign values for symbol matches. In addition, a gap creation penalty and a gap extension penalty are required to limit the insertion of gaps into the alignment. Gap uses the alignment method of Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)) that has been shown to be equivalent to Sellers (SIAM J. of Applied Math 26; 787-793 (1974), see the CONSIDERATIONS topic below).

OUTPUT [ Previous | Top | Next ]

Here is the output from this session:


 GAP of: hpr.seq  check: 8102  from: 1  to: 2966

Haptoglobin related sequence
HindIII fragment sequenced 12/27/83
  (partially from hpf sequence)

 to: hpf.seq  check: 2624  from: 1  to: 2740

Haptoglobin alpha2
HindIII fragment , region equivalent to hp1f

 Symbol comparison table: /package/share/9.0/gcgcore/data/rundata/nwsgapdna.cmp
 CompCheck: 8760

         Gap Weight:     50      Average Match: 10.000
      Length Weight:      3   Average Mismatch:  0.000

            Quality:  24426             Length:   2982
              Ratio:  8.915               Gaps:     13
 Percent Similarity: 94.897   Percent Identity: 94.897

        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   5
                    . =   1

 hpr.seq x hpf.seq         September 29, 1998 10:32  ..

                  .         .         .         .         .
       1 AAGCTTGGTATGCTCAGAAGCAGCTAAAGCGTGTATGTGGGGCGGAGGGT 50
         ||||||||||||||||||||| ||||||| ||||||| |  |    | ||
       1 AAGCTTGGTATGCTCAGAAGCTGCTAAAGTGTGTATGGGCAG....GTGT 46

    ////////////////////////////////////////////////////////////
                  .         .         .         .         .
    1749 TTCCTCTTTCTTCAGAGATGATGAATTATTGTAGCTCCTAGCCCTTTCTT 1798
         ||| |||||||| |||||  |||||||||||||
    1678 TTCATCTTTCTTTAGAGAGAATGAATTATTGTA................. 1710
                                  .
                                  .
                                  .
                  .         .         .         .         .
    1949 TGGCCCCTAGCCCTTTCAATGAATTTCAGGGAATTGTGAAAATTCCTTTG 1998
           |||||||||||||||||||||||||||||||||||| ||||||||||
    1711 ..GCCCCTAGCCCTTTCAATGAATTTCAGGGAATTGTGGAAATTCCTTTA 1758

    ////////////////////////////////////////////////////////////
                  .         .         .
    2935 GAGGACACCTGGTACGCGGCTGGGATCTTAAG 2966
         |||||||||||||| ||| |||||||||||||
    2709 GAGGACACCTGGTATGCGACTGGGATCTTAAG 2740

INPUT FILES [ Previous | Top | Next ]

Gap accepts two individual nucleotide sequences or protein sequences as input. The function of Gap depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N orType: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

When you want an alignment that covers the whole length of both sequences, use Gap. When you are trying to find only the best segment of similarity between two sequences, use BestFit. PileUp creates a multiple sequence alignment of a group of related sequences, aligning the whole length of all sequences. DotPlot displays the entire surface of comparison for a comparison of two sequences. GapShow displays the pattern of differences between two aligned sequences. PlotSimilarity plots the average similarity of two or more aligned sequences at each position in the alignment. Pretty displays alignments of several sequences. LineUp is an editor for editing multiple sequence alignments. CompTable helps generate scoring matrices for peptide comparison.

RESTRICTIONS [ Previous | Top | Next ]

Input sequences may not be more than 32,000 symbols long.

ALIGNING LONG SEQUENCES [ Previous | Top | Next ]

The program attempts to allocate enough computer memory to align the input sequences. In the worst case, where the two sequences being aligned are unrelated, the allocation is proportional to the product of the lengths of the two input sequences. However, in many cases where the sequences being aligned are more closely related, the computer can determine an optimal alignment using less memory. When memory on your computer is limiting and the program cannot allocate all of the memory it needs to align long sequences, it completes the alignment in whatever memory it can allocate and displays the message*** Alignment is not guaranteed to be optimal ***. Because the criteria used in the calculation for guaranteeing an optimal alignment are very stringent, the alignment often may be optimal even if this message is displayed.

If you know roughly where the alignment of interest for long sequences begins, you can run the program with -LIMit. Then set the starting coordinates for each sequence near the point where the alignment of interest begins and set gap shift limits on each sequence. The program then aligns the sequences from your starting point such that the sequences do not get out of phase by more than the gap shift limits you have set. If you started both sequences at base number one and set the gap shift limit for sequence one to 100 and for sequence two to 50, then base 350 in sequence one could not be gapped to any base outside of the range from 300 to 450 on sequence two. These limited alignments often require less computer memory than unlimited alignments.

EVALUATING ALIGNMENT SIGNIFICANCE [ Previous | Top | Next ]

This program can help you evaluate the significance of the alignment, using a simple statistical method, with Generate statistics from 10 randomized alignments. The second sequence is repeatedly shuffled, maintaining its length and composition, and then realigned to the first sequence. The average alignment score, plus or minus the standard deviation, of all randomized alignments is reported in the output file. You can compare this average quality score to the quality score of the actual alignment to help evaluate the significance of the alignment. The number of randomizations can be specified by adding an optional value to Generate statistics from 10 randomized alignments; the default is 10. You can preserve the dinucleotide or dipeptide composition of the input sequence in the shuffled sequence by using Randomize alignment preserving: set to dinucleotide or dipeptide composition. Use Randomize alignment preserving: set to trinucleotide or tripeptide composition to preserve the trinucleotide or tripeptide composition of the input sequence.

By ignoring the statistical properties of biological sequences, this simple Monte Carlo statistical method may give misleading results. Please see Lipman, D.J., Wilbur, W.J., Smith, T.F., and Waterman, M.S. (Nucl. Acids Res. 12; 215-226 (1984)) for a discussion of the statistical significance of nucleic acid similarities.

CONSIDERATIONS [ Previous | Top | Next ]

Other Tools May Be Better Than Gap

Gap is capable of ignoring a region of excellent similarity or similarity between two sequences if it can produce an alignment with equal or better quality in some other way. BestFit is a better tool to search for weak or unknown similarity or similarity that you suspect is not coextensive along the sequences. It is extremely important that you think formally about what Gap does. Using Gap rather than BestFit implies that you want an alignment where neither sequence is truncated.

Gap presents you with one member of the family of best alignments. There may be (and usually are) many members of this family, but no other member has a better quality. When two sequences are closely related, Gap is a good way to see the relationship between them; however, a gapped alignment obscures, or can even be confounded by, internal repeats. Graphic matrix analysis is more powerful for seeing internally repeated structures and approximating the frame of best alignment between two sequences that have never been previously compared. (See the Compare and DotPlot programs.)

Scoring Matrices

The modification of scoring matrices is discussed in Appendix VII.

There is considerable evidence that more sensitive nucleic acid alignments may be possible by scoring transitions slightly positive and transversions slightly negative.

Gap chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with Scoring Matrix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use Set gap creation penalty and Set gap extension penalty to specify alternative gap penalties if you don't want to accept the default values.

CompTable helps you create scoring matrices based on a simplification scheme for amino acid differences. There is a also a short C program that can be modified to help you write a new scoring matrix quickly. The program is called cmpvals.c, and it is located in the public database. You may Fetch and modify cmpvals.c if you are comfortable working with the C programming language.

Forced Pairing

You can get a position in sequence one to pair with some other position in sequence two by choosing a special symbol not used in the rest of the sequences and giving it a very high match value in the scoring matrix. The alphabet of legitimate GCG sequence symbols is defined in Appendix III.

Needleman-Wunsch Versus Sellers

Gap makes an alignment to find the maximum similarity between two sequences by the method of Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)) that is similar to finding the minimum difference according to the method of Sellers (SIAM J. of Applied Math 26; 787-793 (1974)). Smith, Waterman, and Fitch (J. Mol. Evol. 18; 38-46 (1981)) showed that the methods were precisely equivalent when the Needleman and Wunsch gap creation penalty is equal to the Sellers gap creation penalty - 0.5 and when the end gaps for Needleman and Wunsch are penalized in same way as all the other gaps. Set gap extension penalty allows you to penalize the end gaps introduced by Gap.

Rapid Alignment

When possible, Gap tries to find the optimal alignment very quickly. If this rapid alignment is not unambiguously optimal, Gap automatically realigns the sequences to calculate the optimal alignment. When this occurs, the monitor of alignment progress on your terminal screen (Aligning...) is displayed twice for a single alignment.

ALGORITHM [ Previous | Top | Next ]

Gap reads a scoring matrix that contains values for every possible GCG symbol match. Gap finds an alignment with the maximum possible quality where the quality of an alignment is equal to the sum of the values of the matches (each match scored with the scoring matrix) less the gap creation penalty times the number of internal gaps and less the gap extension penalty times the total length of the internal gaps. The alignment found by Gap is, therefore, sensitive to the scoring matrix values and the gap penalties. There is no penalty if either sequence is shifted to the place where the alignment begins unless end gaps are penalized by using Set gap extension penalty.

ALIGNMENT METRICS [ Previous | Top | Next ]

BestFit and Gap display four figures of merit for alignments: Quality, Ratio, Identity, and Similarity.

The Quality (described above) is the metric maximized in order to align the sequences. Ratio is the quality divided by the number of bases in the shorter segment. Percent Identity is the percent of the symbols that actually match. Percent Similarity is the percent of the symbols that are similar. Symbols that are across from gaps are ignored in the calculation of Percent Identity and Percent Similarity. A similarity is scored when the scoring matrix value for a pair of symbols is greater than or equal to the average positive non-identical comparison value in the matrix, the similarity threshold. This threshold is also used by the display procedure to decide when to put a ':' (colon) between two aligned symbols. You can change this threshold by specifying optional values to -PAIr. For instance, -PAIr=10,5 would set the similarity threshold to 5.

The similarity and identity metrics are not optimized by alignment programs so they should not be used to compare alignments.

PEPTIDE SEQUENCES [ Previous | Top | Next ]

If your input sequences are peptide sequences, this program uses a scoring matrix, blosum62.cmp, with comparison values derived from a study of substitutions between amino acid pairs in ungapped block of aligned protein segments as measured by Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)).

ACKNOWLEDGEMENTS [ Previous | Top | Next ]

Gap and BestFit were originally written for Version 1.0 by Paul Haeberli from a careful reading of the Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)) and the Smith and Waterman (Adv. Appl. Math. 2; 482-489 (1981)) papers.

Limited alignments were designed by Paul Haeberli and added to the Package for Version 3.0. They were united into a single program by Philip Delaquess for Version 4.0.

PARAMETER REFERENCE [ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Set gap creation penalty

sets the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

Set gap extension penalty

sets the gap extension penalty that is substracted from the alignment score for each gapped symbol.

Scoring Matrix

allows you to specify a scoring matrix other other than the program default. In creating alignments or finding sequence similarity, matching residues are scored according to values found in a scoring matrix. The matrix you choose depends on the expected similarity of the sequences to be compared. For example, you might use blosum90 to compare sequences that are expected to be very similar and blosum35 if you are expecting the sequences to be much less similar.

Don't penalize gap extensions longer than

lets you set the maximum penalty for any gap in the alignment. For instance, if you specify Don't penalize gap extensions longer than set to 12, then any gap longer than 12 characters is penalized the same as a gap of length 12. Using this parameter, alignments can contain large gaps without incurring large gap extension penalties. This may be useful, for instance, if you are aligning a cDNA sequence with the corresponding genomic DNA sequence containing large introns.

Gap shift limit for sequence 1
Gap shift limit for sequence 2

let you set gap shift limits for each sequence. When you already know of a long similarity between two sequences you can "zip" them together using this mode. The beginning coordinates for each sequence must be near the beginning of the alignment you want to see. The alignment continues so that gaps inserted do not require the sequences to get out of step by more than the gap shift limits. You can align very long sequences rapidly. When you set gap shift limits for one or both input sequences, the maximum surface of comparison available to your alignment is 3.5 million. The size of the surface of comparison that your alignment actually requires can be predicted by multiplying the average length of the two sequences by the sum of the two shift limits.

Generate statistics from 10 randomized alignments

reports the average alignment score and standard deviation from 10 randomized alignments in which the second sequence is repeatedly shuffled, maintaining the length and composition of the original sequence, and then aligned to the first sequence. You can use the optional parameter to set the number of randomized alignment to some number other than 10.

Randomize alignment preserving:

Use Randomize alignment preserving: to preserve composition of the original sequence in the shuffled (randomized) sequences. By default the program preserves single base or residue composition. Randomize alignment preserving: allows you to preserve dinucleotide/dipeptide composition or trinucleotide/tripeptide composition.

Set gap extension penalty

causes the end gaps to be penalized in the same way as all other gaps.

Printed: January 13, 1999 6:27 (1162)