TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"
TFastA uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to search for similarities between a query protein sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six frames before performing the comparison. Each translated reading frame is treated as a separate sequence to be searched.
In the first step of this search, the comparison can be viewed as a set of dot plots, with the query as the vertical sequence and the group of sequences to which the query is being compared as the different horizontal sequences. This first step finds the registers of comparison (diagonals) having the largest number of short perfect matches (words) for each comparison. In the second step, these "best" regions are rescored using a scoring matrix that allows conservative replacements, ambiguity symbols, and runs of identities shorter than the size of a word. In the third step, the program checks to see if some of these initial highest-scoring diagonals can be joined together. Finally, the search set sequences with the highest scores are aligned to the query sequence for display.
A word is any short sequence (n-mer or k-tuple) where you have set n to some small integer less than or equal to six. The word GGATGG is one of the 4,096 possible words of length six that can be created from an alphabet consisting of the four letters G, A, T, and C. The word QL is one of the 400 possible words of length two that you can make with the 20 letters of the amino acid alphabet.
The output from TFastA is a list file, and is suitable for input to any GCG program that allows indirect file specifications. (For information about indirect file specification, see Chapter 2, Using Sequence Files and Databases of the User's Guide.)
Here is some of the output file:
!!SEQUENCE_LIST 1.0 (Peptide) TFASTA of: ggamma.pep from: 1 to: 147 September 23, 1998 15:27 TRANSLATE of: gamma.seq check: 6474 from: 2179 to: 2270 and of: gamma.seq check: 6474 from: 2393 to: 2615 and of: gamma.seq check: 6474 from: 3502 to: 3630 generated symbols 1 to: 148. Human fetal beta globins G and A gamma from Shen, Slightom and Smithies, Cell 26; 191-203. . . . TO: GENEMBL:* Sequences: 552,297 Symbols: 1,036,534,764 Word Size: 2 Sequences too short to analyze: 26 (118 symbols) Databases searched: GenBank, Release 108.0, Released on 16Aug1998, Formatted on 16Aug1998 EMBL, Release 55.0, Released on 16Jun1998, Formatted on 18Aug1998 Searching all six frames. Scoring matrix: GenRunData:Blosum50.Cmp Variable pamfactor used Gap creation penalty: 16 Gap extension penalty: 2 Histogram Key: Each histogram symbol represents 1010 search set sequences Each inset symbol represents 15 search set sequences z-scores computed from opt scores z-score obs exp (=) (*) < 20 762 0:= 22 287 0:= 24 354 1:* 26 580 12:* 28 196 125:* 30 307 760:* 32 983 2937:= * 34 2854 7965:=== * 36 8036 16358:======== * 38 16819 27033:================= * 40 30999 37709:=============================== * 42 45853 46095:=============================================* 44 55899 50847:==================================================*===== 46 60552 51789:===================================================*======== 48 59619 49582:=================================================*========== 50 52834 45243:============================================*======== 52 44792 39776:=======================================*===== 54 37576 33976:=================================*==== 56 29109 28380:============================* 58 22227 23300:=======================* 60 16395 18874:================= * 62 12883 15132:============= * 64 9798 12034:========== * 66 8069 9511:======== * 68 7098 7481:=======* 70 5521 5863:=====* 72 4551 4581:====* 74 3626 3572:===* 76 2725 2780:==* 78 2267 2161:==* 80 1763 1678:=* 82 1463 1284:=* 84 1119 1017:=* 86 801 787:* 88 630 609:* 90 487 471:* 92 423 364:* :========================*==== 94 330 282:* :==================*=== 96 228 218:* :==============*= 98 141 169:* :========== * 100 154 131:* :========*== 102 122 101:* :======*== 104 80 78:* :=====* 106 64 61:* :====* 108 61 47:* :===*= 110 78 36:* :==*=== 112 29 28:* :=* 114 20 22:* :=* 116 15 17:* :=* 118 11 13:* :* >120 707 10:* :*======================================= Joining threshold: 36, opt. threshold: 36, opt. width: 16, reg.-scaled The best scores are: frame init1 initn opt z-sc E(...).. GB_PR2:HUMHBGG ! M15386 Human hemoglobin gamma-G (HB...(3) 971 971 971 1547.8 1.2e-78 GB_PR1:HSGGGPHG ! X55656 H.sapiens mRNA for gamma-G g...(2) 843 843 843 1343.8 2.7e-67 GB_PAT:I42109 ! I42109 Sequence 4 from patent US 56...(3) 765 765 776 1238.0 2.1e-61 ////////////////////////////////////////////////////////////////////////////// \\End of List ggamma.pep GB_PR2:HUMHBGG LOCUS HUMHBGG 545 bp mRNA PRI 24-MAR-1997 DEFINITION Human hemoglobin gamma-G (HBG2) mRNA, partial cds. ACCESSION M15386 NID g183884 KEYWORDS . SOURCE human. . . . SCORES Frame: (3) Init1: 971 Initn: 971 Opt: 971 z-score: 1547.8 E(): 1.2e-78 100.0% identity in 147 aa overlap 10 20 30 40 50 ggamma.pep MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAI ||||||||||||||||||||||||||||||||||||||||||||||||||||||| HUMHBGG PSPDAMGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAI 10 20 30 40 50 60 60 70 80 90 100 110 ggamma.pep MGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVL |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| HUMHBGG MGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVL 70 80 90 100 110 120 120 130 140 ggamma.pep AIHFGKEFTPEVQASWQKMVTGVASALSSRYH |||||||||||||||||||||||||||||||| HUMHBGG AIHFGKEFTPEVQASWQKMVTGVASALSSRYHXARCPXCRASRIGFILQAIQIINLFCXE 130 140 150 160 170 180 HUMHBGG I ///////////////////////////////////////////////////////////////////////// ! Distributed over 1 thread. ! Start time: Wed Sep 23 14:15:13 1998 ! Completion time: Wed Sep 23 15:29:29 1998 ! CPU time used: ! Database scan: 0:30:17.2 ! Post-scan processing: 0:00:07.0 ! Total CPU time: 0:30:24.2 ! Output File: ggamma.tfasta
The first part of the output file contains a histogram showing the distribution of the z-scores between the query and search set sequences. (See the ALGORITHM topic for an explanation of z-score.) The histogram is composed of bins of size 2 that are labeled according to the higher score for that bin (the leftmost column of the histogram). For example, the bin labeled 24 stores the number of sequence pairs that had scores of 23 or 24.
The next two columns of the histogram list the number of z-scores that fell within each bin. The second column lists the number of z-scores observed in the search and the third column lists the number of z-scores that were expected.
The body of the histogram displays a graphical representation of the score distributions. Equal signs (=) indicate the number of scores of that magnitude that were observed during the search, while asterisks (*) plot the number of scores of that magnitude that were expected.
At the bottom of the histogram is a list of some of the parameters pertaining to the search.
Below the histogram, TFastA displays a listing of the best scores. This listing includes the reading frame in the original nucleotide sequence from which the reported translated sequence is derived.
Following the list of best scores, TFastA displays the alignments of the regions of best overlap between the query and search sequences. In these alignments, stop codons are represented by the letter X.
This program displays only the region of overlap between the two aligned sequences (plus some residues on either side of the region to provide context for the alignment) unless you use -SHOWall. The display of identities and conservative replacements between the aligned sequences depends on the value of -MARKx. By default ( -MARKx=3), the pipe character (|) is used to denote identities and the colon (:) to denote conservative replacements.
TFastA accepts a single protein sequence as the query sequence. The search set is either a single nucleic acid sequence or multiple nucleic acid sequences. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If TFastA rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence.
FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST.
BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds. NetBLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.
SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST and FastA, it can be very slow.
TFastX does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA, and like TFastA, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"
FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"
FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.
WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.
ProfileSearch and MotifSearch use a profile (derived from a set of aligned sequences) instead of a query sequence to search a collection of sequences. FindPatterns uses a pattern described by a regular expression to search a collection of sequences.
StringSearch, LookUp, and Names identify sequences by searching the annotation (non-sequence) portions of seqence files or sequence databases.
The query sequence may not be longer than 32,000 symbols. You cannot select a list size of more than 1,000 best scores nor view more than 1,000 alignments. The word size must be either 1 or 2.
For the estimates of statistical significance to be valid, the search set must contain a large sample of unrelated sequences. The statistical estimates will not be calculated at all if there are fewer than 60 frames searched (equivalent to 10 sequences when all six frames are searched, or 20 sequences when only the top three frames are searched).
With -NOOPTall, the estimates of statistical significance will not be accurate.
For a description of the algorithm, see the FastA program documentation.
TFastA treats each reading frame as a different sequence. If a nucleotide sequence contains a gene coding for a protein similar to your query, but with an intervening sequence that changes the reading frame, the program will find and display two matches, one for each reading frame. If the individual matches each have fairly low scores, they may not make the list of best scores. If you suspect that the gene for your query sequence contains intervening sequences, or if you are searching a nucleotide database known to contain sequencing errors that may cause a frameshift (such as the EST division of GenBank), use TFastX instead of TFastA.
TFastA translates stop codons in search set sequences to the sequence symbol X.
The E() scores are affected by similarities in sequence composition between the query sequence and the search set sequence. Unrelated sequences may have "significant" scores because of composition bias.
If there is a database entry that overlaps your query in several places, but there are large gaps between the matching regions, only the best overlap appears in the alignment display.
There are two ways to control the size of the list of best scores. By default, scores are listed until a specific E() value is reached. You may set the value in response to the program prompt or by using List scores until E() reaches; otherwise the program uses 10.0 for protein searches, 2.0 for nucleic acid searches.
If you use Number of scores to list (regardless of E() value), the E() value is ignored, and the program will list the number of scores you requested.
By default, TFastA uses a word size of 2. If it finds few or no matches, especially if your query sequence is short, rerun the search using Word size set to 1 to increase the sensitivity. Note that this will dramatically increase the amount of CPU time required to run the program.
Unlike other GCG programs, TFastA does not read the default gap creation and gap extension penalties from the scoring matrix file. It uses default gap creation and extension penalties that were empirically determined to be appropriate for the BLOSUM50 scoring matrix. If you select a different scoring matrix with Scoring Matrix, you may need to change the gap penalties. The histogram display gives a qualitative view of the quality of fit between the actual distribution of scores and the expected distribution of scores. This information may indicate whether or not suitable gap creation and extension penalties were used for the search. When the histogram shows poor agreement between the actual distribution and the theoretical distribution, you might consider using Set gap creation penalty and/or Set gap extension penalty to specify higher gap creation and extension penalties, respectively. For example, you might increase the gap creation penalty from 16 to 20 and the gap extension penalty from 4 to 6.
There are two different philosophies on how to penalize gaps in an alignment. One way is to penalize a gap by the gap creation penalty plus the extension penalty times the length of the gap (gapweight + (lengthweight x gap length)). The other way is to use the gap creation penalty plus the extension penalty times the gap length excluding the first residue in the gap (gapweight + (lengthweight x (gap length - 1)).
"Native" GCG programs, such as Framesearch and Bestfit, handle gap extension penalties the first way, while the FastA-family programs use the second way. Therefore a value for Set gap extension penalty that gives good results with one of the FastA-family programs may not give equivalent results with a native GCG program, and vice versa.
This program is multithreaded. It has the potential to run faster on a machine equipped with multiple processors because different parts of the analysis can be run in parallel on different processors. By default, the program assumes you have one processor, so the analysis is performed using one thread. You can use Number of processors to use to increase the number of threads up to the number of physical processors on the computer.
Under ideal conditions, the increase in speed is roughly linear with the number of processors used. But conditions are rarely ideal. If your computer is heavily used, competition for the processors can reduce the program's performance. In such an environment, try to run multithreaded programs during times when the load on the system is light.
As the number of threads increases, the amount of memory required increases substantially. You may need to ask your system administrator to increase the memory quota for your account if you want to use more than two threads.
Never use Number of processors to use to set the number of threads higher than the number of physical processors that the machine has -- it does not increase program performance, but instead uses up a lot of memory needlessly and makes it harder for other users on the system to get processor time. Ask your system administrator how many processors your computer has if you aren't sure.
If you want to search a single database division instead of an entire database, see the "Using Database Sequences" topic of Chapter 2, Using Sequence Files and Databases of the User's Guide for a list of the logical names used for the databases and the divisions of each database. The search set can also consist of a group of sequence files that are not in a database. Use a multiple sequence specification to name these. For information about naming groups of sequences for the search set, see the topics "Specifying Files" and "Using Wildcards" in Chapter 1, Getting Started, and "Using Database Sequences," "Using Multiple Sequence Format (MSF) Files", "Using Rich Sequence Format (RSF) Files", and "Using List Files" in Chapter 2, Using Sequence Files and Databases of the User's Guide.
The FASTA program family (FastA, TFastA, FastX, TFastX, and SSearch) was written by Professor William Pearson of the University of Virginia Department of Biochemistry (Pearson and Lipman, Proc. Natl. Acad. Sci., USA 85; 2444-2448 (1988)). In collaboration with Dr. Pearson, the programs were modified and documented for distribution with GCG Version 6.1 by Mary Schultz and Irv Edelman, and for Versions 8 through 10 by Sue Olson.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.