Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps.
Gap considers all possible alignments and gap positions between two sequences and creates a global alignment that maximizes the number of matched residues and minimizes the number and size of gaps. A scoring matrix is used to assign values for symbol matches. In addition, a gap creation penalty and a gap extension penalty are required to limit the insertion of gaps into the alignment. Gap uses the alignment method of Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)) that has been shown to be equivalent to Sellers (SIAM J. of Applied Math 26; 787-793 (1974), see the CONSIDERATIONS topic below).
Here is the output from this session:
GAP of: hpr.seq check: 8102 from: 1 to: 2966 Haptoglobin related sequence HindIII fragment sequenced 12/27/83 (partially from hpf sequence) to: hpf.seq check: 2624 from: 1 to: 2740 Haptoglobin alpha2 HindIII fragment , region equivalent to hp1f Symbol comparison table: /package/share/9.0/gcgcore/data/rundata/nwsgapdna.cmp CompCheck: 8760 Gap Weight: 50 Average Match: 10.000 Length Weight: 3 Average Mismatch: 0.000 Quality: 24426 Length: 2982 Ratio: 8.915 Gaps: 13 Percent Similarity: 94.897 Percent Identity: 94.897 Match display thresholds for the alignment(s): | = IDENTITY : = 5 . = 1 hpr.seq x hpf.seq September 29, 1998 10:32 .. . . . . . 1 AAGCTTGGTATGCTCAGAAGCAGCTAAAGCGTGTATGTGGGGCGGAGGGT 50 ||||||||||||||||||||| ||||||| ||||||| | | | || 1 AAGCTTGGTATGCTCAGAAGCTGCTAAAGTGTGTATGGGCAG....GTGT 46 //////////////////////////////////////////////////////////// . . . . . 1749 TTCCTCTTTCTTCAGAGATGATGAATTATTGTAGCTCCTAGCCCTTTCTT 1798 ||| |||||||| ||||| ||||||||||||| 1678 TTCATCTTTCTTTAGAGAGAATGAATTATTGTA................. 1710 . . . . . . . . 1949 TGGCCCCTAGCCCTTTCAATGAATTTCAGGGAATTGTGAAAATTCCTTTG 1998 |||||||||||||||||||||||||||||||||||| |||||||||| 1711 ..GCCCCTAGCCCTTTCAATGAATTTCAGGGAATTGTGGAAATTCCTTTA 1758 //////////////////////////////////////////////////////////// . . . 2935 GAGGACACCTGGTACGCGGCTGGGATCTTAAG 2966 |||||||||||||| ||| ||||||||||||| 2709 GAGGACACCTGGTATGCGACTGGGATCTTAAG 2740
Gap accepts two individual nucleotide sequences or protein sequences as input. The function of Gap depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.
When you want an alignment that covers the whole length of both sequences, use Gap. When you are trying to find only the best segment of similarity between two sequences, use BestFit. PileUp creates a multiple sequence alignment of a group of related sequences, aligning the whole length of all sequences. DotPlot displays the entire surface of comparison for a comparison of two sequences. GapShow displays the pattern of differences between two aligned sequences. PlotSimilarity plots the average similarity of two or more aligned sequences at each position in the alignment. Pretty displays alignments of several sequences. LineUp is an editor for editing multiple sequence alignments. CompTable helps generate scoring matrices for peptide comparison.
Input sequences may not be more than 32,000 symbols long.
The program attempts to allocate enough computer memory to align the input sequences. In the worst case, where the two sequences being aligned are unrelated, the allocation is proportional to the product of the lengths of the two input sequences. However, in many cases where the sequences being aligned are more closely related, the computer can determine an optimal alignment using less memory. When memory on your computer is limiting and the program cannot allocate all of the memory it needs to align long sequences, it completes the alignment in whatever memory it can allocate and displays the message *** Alignment is not guaranteed to be optimal ***. Because the criteria used in the calculation for guaranteeing an optimal alignment are very stringent, the alignment often may be optimal even if this message is displayed.
If you know roughly where the alignment of interest for long sequences begins, you can run the program with -LIMit. Then set the starting coordinates for each sequence near the point where the alignment of interest begins and set gap shift limits on each sequence. The program then aligns the sequences from your starting point such that the sequences do not get out of phase by more than the gap shift limits you have set. If you started both sequences at base number one and set the gap shift limit for sequence one to 100 and for sequence two to 50, then base 350 in sequence one could not be gapped to any base outside of the range from 300 to 450 on sequence two. These limited alignments often require less computer memory than unlimited alignments.
This program can help you evaluate the significance of the alignment, using a simple statistical method, with Generate statistics from 10 randomized alignments. The second sequence is repeatedly shuffled, maintaining its length and composition, and then realigned to the first sequence. The average alignment score, plus or minus the standard deviation, of all randomized alignments is reported in the output file. You can compare this average quality score to the quality score of the actual alignment to help evaluate the significance of the alignment. The number of randomizations can be specified by adding an optional value to Generate statistics from 10 randomized alignments; the default is 10. You can preserve the dinucleotide or dipeptide composition of the input sequence in the shuffled sequence by using Randomize alignment preserving: set to dinucleotide or dipeptide composition. Use Randomize alignment preserving: set to trinucleotide or tripeptide composition to preserve the trinucleotide or tripeptide composition of the input sequence.
By ignoring the statistical properties of biological sequences, this simple Monte Carlo statistical method may give misleading results. Please see Lipman, D.J., Wilbur, W.J., Smith, T.F., and Waterman, M.S. (Nucl. Acids Res. 12; 215-226 (1984)) for a discussion of the statistical significance of nucleic acid similarities.
Gap is capable of ignoring a region of excellent similarity or similarity between two sequences if it can produce an alignment with equal or better quality in some other way. BestFit is a better tool to search for weak or unknown similarity or similarity that you suspect is not coextensive along the sequences. It is extremely important that you think formally about what Gap does. Using Gap rather than BestFit implies that you want an alignment where neither sequence is truncated.
Gap presents you with one member of the family of best alignments. There may be (and usually are) many members of this family, but no other member has a better quality. When two sequences are closely related, Gap is a good way to see the relationship between them; however, a gapped alignment obscures, or can even be confounded by, internal repeats. Graphic matrix analysis is more powerful for seeing internally repeated structures and approximating the frame of best alignment between two sequences that have never been previously compared. (See the Compare and DotPlot programs.)
The modification of scoring matrices is discussed in Appendix VII.
There is considerable evidence that more sensitive nucleic acid alignments may be possible by scoring transitions slightly positive and transversions slightly negative.
Gap chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with Scoring Matrix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use Set gap creation penalty and Set gap extension penalty to specify alternative gap penalties if you don't want to accept the default values.
CompTable helps you create scoring matrices based on a simplification scheme for amino acid differences. There is a also a short C program that can be modified to help you write a new scoring matrix quickly. The program is called cmpvals.c, and it is located in the public database. You may Fetch and modify cmpvals.c if you are comfortable working with the C programming language.
You can get a position in sequence one to pair with some other position in sequence two by choosing a special symbol not used in the rest of the sequences and giving it a very high match value in the scoring matrix. The alphabet of legitimate GCG sequence symbols is defined in Appendix III.
Gap makes an alignment to find the maximum similarity between two sequences by the method of Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)) that is similar to finding the minimum difference according to the method of Sellers (SIAM J. of Applied Math 26; 787-793 (1974)). Smith, Waterman, and Fitch (J. Mol. Evol. 18; 38-46 (1981)) showed that the methods were precisely equivalent when the Needleman and Wunsch gap creation penalty is equal to the Sellers gap creation penalty - 0.5 and when the end gaps for Needleman and Wunsch are penalized in same way as all the other gaps. Set gap extension penalty allows you to penalize the end gaps introduced by Gap.
When possible, Gap tries to find the optimal alignment very quickly. If this rapid alignment is not unambiguously optimal, Gap automatically realigns the sequences to calculate the optimal alignment. When this occurs, the monitor of alignment progress on your terminal screen (Aligning...) is displayed twice for a single alignment.
Gap reads a scoring matrix that contains values for every possible GCG symbol match. Gap finds an alignment with the maximum possible quality where the quality of an alignment is equal to the sum of the values of the matches (each match scored with the scoring matrix) less the gap creation penalty times the number of internal gaps and less the gap extension penalty times the total length of the internal gaps. The alignment found by Gap is, therefore, sensitive to the scoring matrix values and the gap penalties. There is no penalty if either sequence is shifted to the place where the alignment begins unless end gaps are penalized by using Set gap extension penalty.
BestFit and Gap display four figures of merit for alignments: Quality, Ratio, Identity, and Similarity.
The Quality (described above) is the metric maximized in order to align the sequences. Ratio is the quality divided by the number of bases in the shorter segment. Percent Identity is the percent of the symbols that actually match. Percent Similarity is the percent of the symbols that are similar. Symbols that are across from gaps are ignored in the calculation of Percent Identity and Percent Similarity. A similarity is scored when the scoring matrix value for a pair of symbols is greater than or equal to the average positive non-identical comparison value in the matrix, the similarity threshold. This threshold is also used by the display procedure to decide when to put a ':' (colon) between two aligned symbols. You can change this threshold by specifying optional values to -PAIr. For instance, -PAIr=10,5 would set the similarity threshold to 5.
The similarity and identity metrics
are not optimized by alignment
programs so they should not
be used
to compare alignments.
If your input sequences are peptide sequences, this program uses a scoring matrix, blosum62.cmp, with comparison values derived from a study of substitutions between amino acid pairs in ungapped block of aligned protein segments as measured by Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)).
Gap and BestFit were originally written for Version 1.0 by Paul Haeberli from a careful reading of the Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)) and the Smith and Waterman (Adv. Appl. Math. 2; 482-489 (1981)) papers.
Limited alignments were designed by Paul Haeberli and added to the Package for Version 3.0. They were united into a single program by Philip Delaquess for Version 4.0.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.