FRAMEALIGN

Table of Contents
FUNCTION
DESCRIPTION
OUTPUT
INPUT FILES
RELATED PROGRAMS
ALGORITHM
ALIGNMENT METRICS
CONSIDERATIONS
SUGGESTIONS
ACKNOWLEDGEMENTS
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

FrameAlign creates an optimal alignment of the best segment of similarity (local alignment) between a protein sequence and the codons in all possible reading frames on a single strand of a nucleotide sequence. Optimal alignments may include reading frame shifts.

DESCRIPTION

[ Previous | Top | Next ]

FrameAlign inserts gaps to obtain the optimal local alignment of the best region of similarity between a protein sequence and the codons in a nucleotide sequence. Because FrameAlign can align the protein to codons in different reading frames of the nucleotide sequence, it can identify sequence similarity even when the nucleotide sequence contains reading frame shifts.

In standard sequence alignment programs, you routinely specify gap creation and extension penalties. In addition to these penalties, FrameAlign also allows you to specify a separate frameshift penalty for the creation of gaps that result in reading frame shifts in the nucleotide sequence. (See the ALGORITHM topic for a more detailed explanation of how gaps are penalized.)

By default, FrameAlign creates a local alignment between the nucleotide and protein sequences. If you specify Alignment Options, FrameAlign creates a global alignment where gaps are inserted to optimize the alignment between the entire nucleotide sequence and the entire protein sequence.

OUTPUT

[ Previous | Top | Next ]

Here is the output file:


Local alignment of: atts0012  check: 2422  from: 1  to: 286

LOCUS       ATTS0012      286 bp    RNA             EST       31-OCT-1992
DEFINITION  A. thaliana transcribed sequence; clone TAT1B11, 5' end; similar to
            GLYCERALDEHYDE 3-PHOSPHATE DEHYDROGENASE, mRNA sequence.
ACCESSION   Z17438
NID         g16580
KEYWORDS    EST; expressed sequence tag; partial cDNA sequence. . . .

 to: jq1287  check: 7459  from: 1  to: 338

P1;JQ1287 - glyceraldehyde-3-phosphate dehydrogenase (EC 1.2.1.12), cytosolic -
 Arabidopsis thaliana
C;Species: Arabidopsis thaliana (mouse-ear cress)
C;Date: 31-Mar-1992 #sequence_revision 31-Mar-1992 #text_change 08-Sep-1997
C;Accession: JQ1287; JS0614
R;Shih, M.C.; Heinrich, P.; Goodman, H.M.
Gene 104, 133-138, 1991 . . .

 Scoring matrix: /package/share/10.0/gcgcore/data/rundata/blosum62.cmp
  CompCheck: 6430
 Translation table: /package/share/10.0/gcgcore/data/rundata/translate.txt

         Gap Weight:      8      Average Match:  2.912
      Length Weight:      2   Average Mismatch: -2.003
  Frameshift Weight:      0

            Quality:    363             Length:    240
              Ratio:  4.654               Gaps:      2
 Percent Similarity: 98.718   Percent Identity: 97.436

        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   2
                    . =   1

 atts0012 x jq1287          October 8, 1998 17:20  ..

                  .         .         .         .         .
       3 GAAATCAAGAAGGCCATCAAGGAGGAATCTGAAGGCAAAATGAAGGGAAT 52
         |||||||||||||||||||||||||||||||||||||||:::||||||||
     261 GluIleLysLysAlaIleLysGluGluSerGluGlyLysLeuLysGlyIl 277
                  .         .         .         .         .
      53 TTTGGGATACTCTGAGGATGATGTTGTGTCTACCGACTTTGTTGGTGACA 102
         ||||||||||...|||||||||||||||||||||||||||||||||||||
     278 eLeuGlyTyrThrGluAspAspValValSerThrAspPheValGlyAspA 294
                  .         .         .         .         .
     103 ACAGGTCAAGCATTTTCGATGCCAAGGCTGGATTGCATTGCATTGAGCGA 152
         ||||||||||||||||||||||||||||||||    ||||||||||||||
     295 snArgSerSerIlePheAspAlaLysAlaGly....IleAlaLeuSerAs 309
                  .         .         .         .         .
     153 CAAGTTTGTGAAGTTGGTGTCATGGTACGACAACGAATGGGGTTACACAG 202
         ||||||||||||||||||||||||||||||||||||||||||||||  ||
     310 pLysPheValLysLeuValSerTrpTyrAspAsnGluTrpGlyTyr..Se 325
                  .         .         .         .
     203 TTCTCGTGTCGTTGACCTTATCGTTCACATGTCAAAGGCC 242
         ||||||||||||||||||||||||||||||||||||||||
     326 rSerArgValValAspLeuIleValHisMetSerLysAla 338

The alignment output displays sequence similarity by printing one of three characters between a codon and an amino acid: a pipe character (|), a colon (:), or a period (.). Normally, a pipe character is put between a codon and an amino acid when the translated codon is identical to the amino acid. A colon is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to the average positive non-identical comparison value in the amino acid substitution matrix. A period is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to 1. You can change these match display thresholds by specifying -PAIr. (See Appendix VII for more information about comparison values in scoring matrices.)

INPUT FILES

[ Previous | Top | Next ]

The input to FrameAlign is a nucleotide sequence and a protein sequence. You can specify the sequences in any order as input to the program.

RELATED PROGRAMS

[ Previous | Top | Next ]

Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps. BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. Both Gap and BestFit align two sequences of the same type (i.e. both nucleotide sequences or both protein sequences).

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

ALGORITHM

[ Previous | Top | Next ]

FrameAlign aligns a nucleotide sequence with a protein sequence. The alignment procedure is an extension of the local alignment algorithm of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) that is modified to determine the score of the best segment of similarity between a protein sequence and the codons in a nucleotide sequence.

Scoring Matrix

To create the alignments, FrameAlign requires a scoring matrix that contains values for matches between all possible amino acids and codons. FrameAlign derives this amino acid - codon scoring matrix on the fly from a translation table and an amino acid substitution matrix. The translation table contains a list of all possible codons for each amino acid. The amino acid substitution matrix contains match values for the comparison of all possible amino acids.

In the derived amino acid - codon scoring matrix, the value of a match between any amino acid and any codon is the value of the match between the amino acid and the translated codon in the amino acid substitution matrix. If a codon contains IUB nucleotide ambiguity symbols (described in Appendix III), and all possible unambiguous representations of the codon translate to the same amino acid (e.g. MGR always translates to arginine in the standard genetic code), then the value of a match between that codon and any amino acid can be similarly determined. If all possible unambiguous representations of the codon do not translate to the same amino acid, then that codon is assumed to translate to an 'X'.

FrameAlign chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with Scoring Matrix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use Set gap creation penalty and Set gap extension penalty or respond to the program prompts to specify alternative gap penalties if you don't want to accept the default values.

Protein-Nucleotide Alignment

FrameAlign uses the values in the amino acid - codon scoring matrix to determine the score of the best alignment between the protein and nucleotide sequences. If you consider a graph, or path matrix, with the nucleotide sequence placed on the X axis and the protein sequence placed on the Y axis, then every point on the path matrix represents the best alignment between the sequences that ends at that point. For any point on the path matrix, the X coordinate is the first nucleotide of the final codon in the alignment, and the Y coordinate is the final amino acid in the alignment. Each possible alignment end point is associated with a path, which is a series of steps (insertions, deletions, matches) through the path matrix required to create the alignment. Each step has its own score, and the scores for all the steps in an alignment path determine the quality score for the alignment. The quality score for an alignment is equal to the sum of the scoring matrix values of the matches in the alignment, minus the gap creation penalty multiplied by the number of gaps in the alignment, minus the frameshift penalty multiplied by the number of gaps in the alignment that change the reading frame, minus the gap extension penalty multiplied by the total length of all gaps in the alignment. (You can set the value for each of the penalties.)


quality = SUM(scoring matrix values of the matches in the alignment) -
          gap creation penalty  x  number of gaps in the alignment -
          frameshift penalty    x  number of gaps in the alignment
                                   that change the reading frame -
          gap extension penalty x  total length of all gaps
                                   in the alignment

For example, the following protein-nucleotide alignment consists of six steps:


       1 UGUUGUAUUCG....UGGUGG 17
         ||||||:::      ||||||
       1 CysCysValGlnIleTrpTrp 7

The first two steps are UGU-Cys matches. The third step is an AUU-Val match. The fourth step is a four nucleotide deletion. The last two steps are UGG-Trp matches. The quality score for this alignment is the sum of the scoring matrix values for two UGU-Cys matches, one AUU-Val match, and two UGG-Trp matches, minus one gap creation penalty, minus four gap extension penalties, minus one frameshift penalty.

Matches between an amino acid and a partial codon, like


                  CG.

Gln

in the above example, do not add any match value to the alignment score. By convention, all gap characters in partial codons are placed at the end of the codon. For example, the partial codon CG. in the above example will never be written as C.G

If the best alignment ending at any point has a negative value, a zero is put at that position of the path matrix; otherwise, the quality score for the alignment is put at that position. After the path matrix is completely filled, the highest value in the matrix represents the score of the best region of similarity between the sequences (optimal local alignment). This highest value is reported as the comparison score between the nucleotide and protein sequences. The alignment itself can be reconstructed for display by following the best path from this point of highest value backward to the point where the path matrix has a value of zero.

ALIGNMENT METRICS

[ Previous | Top | Next ]

Four figures of merit are displayed along with the optimal alignment between the protein and nucleotide sequences: Quality, Ratio, Identity, and Similarity.

The Quality score (described above in the ALGORITHM topic) is the measure that is maximized in order to align the sequences. Ratio is the Quality divided by the smaller of one-third the number of bases in the alignment and the number of amino acids in the alignment. Gap symbols are ignored in the calculation of Ratio. Identity is the percent of identical matches between amino acids and codons in the alignment (i.e. the amino acid is identical to the translated codon). Similarity is the percent of matches between amino acids and codons in the alignment whose comparison values exceed the similarity threshold. By default, this threshold is the average positive non-identical comparison value in the scoring matrix. FrameAlign uses this same threshold to decide when to put a colon (:) between an aligned codon and amino acid in the alignment display. You can reset this threshold with -PAIr.

CONSIDERATIONS

[ Previous | Top | Next ]

FrameAlign Always Finds Something

FrameAlign always finds an alignment for any protein and nucleotide sequences you compare, even if there is no significant similarity between them. You must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity.

FrameAlign Shows Only a Single Segment of Similarity

FrameAlign shows only one optimal alignment between a protein sequence and a nucleotide sequence. There are reasons why you might want to evaluate several optimal and suboptimal alignments.

- If there are several disjoint segments of similarity, the selection of only a single segment for display does not provide a comprehensive view of the relationship between the nucleotide and protein sequences.

- The alignments displayed by FrameAlign are sensitive to your choices for the scoring matrix and gap penalties. If you vary these choices even slightly, FrameAlign may calculate different optimal alignments for the same segment of similarity between the sequences. If FrameAlign were able to display multiple and suboptimal alignments of the same region, you would be able to use the variation among the different alignments to determine which portions of the alignments were reliably determined.

SUGGESTIONS

[ Previous | Top | Next ]

Nucleotide Sequences Using Nonstandard Genetic Codes

If the nucleotide sequence is from an organism or organelle that uses a nonstandard genetic code, then you should specify an appropriate translation table using Translation Table. Different translation tables are discussed in Appendix VII.

Aligning a Protein Sequence with a Genomic Sequence Containing Introns

If you align a genomic sequence containing long introns to its corresponding protein sequence, FrameAlign will often display the local alignment of only one of the exons to its corresponding portion of the protein. To align the entire protein sequence to the entire genomic sequence, use Alignment Options and reduce the gap extension penalty in response to the program prompt.

ACKNOWLEDGEMENTS

[ Previous | Top | Next ]

FrameAlign was written by Irv Edelman.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Set gap creation penalty

sets the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

Set gap extension penalty

sets the gap extension penalty that is substracted from the alignment score for each gapped symbol.

Set frameshift penalty

sets the gap penalty that is subtracted from the alignment score whenever a gap is inserted that shifts the reading frame of the nucleotide sequence.

Scoring Matrix

allows you to specify a scoring matrix other other than the program default. In creating alignments or finding sequence similarity, matching residues are scored according to values found in a scoring matrix. The matrix you choose depends on the expected similarity of the sequences to be compared. For example, you might use blosum90 to compare sequences that are expected to be very similar and blosum35 if you are expecting the sequences to be much less similar.

Translation Table

Usually, the Standard translation table is the basis for all translations. You can choose translation tables for various non-standard genomes such as yeast mitochondrial.

Alignment Options

    local alignment
    global alignment

aligns the entire lengths of the nucleotide and protein sequences (global alignment). By default, FrameAlign determines a local alignment of the best region of similarity between the protein sequence and the codons in the nucleotide sequence.

End Gap Options

    don't penalize gaps at the ends of the alignment
    penalize end gaps like other gaps

penalizes gaps placed before the beginning of a sequence and after the end of a sequence the same as gaps inserted within a sequence. By default, gaps placed at the very ends of sequences in global alignments are not penalized at all.

Don't penalize gap extensions longer than

lets you set the maximum penalty for any gap in the alignment. For instance, if you specify Don't penalize gap extensions longer than set to 12, then any gap longer than 12 characters is penalized the same as a gap of length 12. Using this parameter, alignments can contain large gaps without incurring large gap extension penalties. This may be useful, for instance, if you are aligning a protein sequence with the corresponding genomic DNA sequence containing large introns.

Printed: January 13, 1999 6:27 (1162)