Education page
BLAST tutorial
This BLAST tutorial is designed to help both the novice and experienced BLAST user to set up and perform a BLAST search, decipher the output and analyze the results. The tutorial illustrates the potential for BLAST and PSI-BLAST searches to identify even weak (subtle) homologies to annotated entries in the database. It demonstrates that BLAST and PSI-BLAST (see separate PSI-BLAST tutorial) are important tools for predicting both biochemical activities and function from sequence relationships. In addition to the tutorial, the BLAST guide may be useful in becoming acquainted with the ins and outs of BLAST searching.

Introduction to a BLAST Query
Open a new browser window so that the BLAST program can be compared to the tutorial. Notice that the tutorial page resembles the Query form for an ADVANCED BLAST search, however, the elements of the Query form have been reorganized on the tutorial page to facilitate describing them. Explanatory notes have been added in light grey boxes. Additional details about BLAST are available through the BLAST details buttons.

The BLAST browser window may be left open and used in parallel, or it may be closed while browsing through this tutorial. Scroll down the tutorial page to learn how to submit a BLAST search, step by step. When you are ready, the button will take you to the BLAST output page where the results of this search can be examined.

Step 1.  Choose the program to use and the database to search.  details, more1
     Program Database

As an example, consider the uncharacterized archaebacterial protein, MJ0577, from Methanococcus jannaschii. The amino acid sequence derived from the MJ0577 open reading frame will be used as a query in a search for sequence relatives in the amino acid database, nr (non-redundant). blastp is the appropriate search routine for all searches in which an amino acid query is to be compared to an amino acid database. The nr database is a good choice for a comprehensive search.

Step 2.  Input the data. details, more2

Query data is formatted as
Adjust the pull down menu (above) so that the selection (FASTA format vs. Accession or GI) matches the format of the sequence in the input window (below). In this case, FASTA format is chosen above to correspond with the FASTA formatted sequence in the input box. The GI (GenBank Identifier) is the number (2501594, in this case) located between two vertical lines following the "gi" in the Entrez entry (shown below). Entering "2501594" or the Accession # "Q57997" in the sequence box will accomplish the same thing as entering the sequence in FASTA format.
entrez entry

Paste the query sequence or its (GI or accession) number into the query window.


Step 3.   Set the program options or choose defaults.
note: certain options (*) are available only when using the Advanced BLAST Web site.

Perform ungapped alignment

Leaving this box unchecked will allow gaps to be introduced into sequence alignments. This default option ensures that any similarities, even those that define a domain within the coding region will be identified, if the extent of local similarity is high enough. details, more3

* Choose an Organism from the list to limit your search:                               

* or enter your Organism Name or Taxonomic Class here:
This search is not limited to a particular organism. Relationships to proteins in any kingdom may provide clues about the functional classification of the hypothetical ORF in question. details, more3

* Expect
The E value threshold for the MJ0577 search has been changed from the default value of 10 to a setting of 1. Although hits with E values much higher than 0.1 are unlikely to reflect true sequence relatives, it is useful to examine hits with lower significance (E values between 0.1 and 10) for short regions of similarity. In the absence of longer similarities, these short regions may allow the tentative assignment of biochemical activities to the ORF in question. The significance of any such regions must be assessed on a case by case basis.

In trying to find a function for the unannotated open reading frame, MJ0577, look first for homologous proteins in other organisms that may already be annotated. Secondarily, note any short regions that bear significant similarity to portions of one or more proteins in the database that have been biochemically characterized. In this example, we will restrict our interest to BLAST hits with E values less than or equal to 1.0.

details, more3
Filter Low complexity * Human repeats

It is appropriate to filter most queries for low complexity sequences. By taking an advance peek at the first alignment in the BLAST output, it can be seen that MJ0577 has no low complexity regions that are detected by the SEG filtering algorithm. Low complexity regions would appear as X's in the alignment of MJ0577 with itself.

Since this is not a human sequence, the human repeat check box is left unchecked.

Some types of low complexity sequences may not be detected by the filtering option in BLAST. For example, coiled-coil and transmembrane regions need to be detected using the appropriate programs outside of BLAST. As an example, the COILS algorithm was used to perform an analysis of the MJ0577 open reading frame for the presence of coiled-coil regions. It is apparent from the analysis that MJ0577 does, in fact, have a coiled-coil region. Since coiled-coil encoding sequence can lead to matches with other coiled-coil proteins and thus obscure more meaningful hits, the user might consider manually masking the region to optmize the sensitivity of the search. To do this, the amino acids between aa 71(SLLL) and aa 120 (IIVV) would be replaced with X's. A query window in which this has already been done can be viewed here.

details, more3
* Query Genetic Codes (blastx only)
When employing the blastx program (in which a translated nucleotide sequence is used as a query against a protein database), the genetic code to be used in the translation can be specified here. The standard genetic code is used by default. Since this tutorial employs blastp and not blastx, this option is not pertinent. details, more3

   * Matrix    Gap existence cost    Per residue gap cost     Lambda ratio
BLOSUM62 is a general purpose matrix and the default choice in BLAST 2.0. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur among consensus blocks within related proteins. BLOSUM62 is among the best of the available matrices for detecting weak protein similarities. Other supported options include PAM30, PAM70, BLOSUM80, and BLOSUM45. details, more3

* Other advanced options:
In the "Advanced Options" field it is possible to specify gap costs, word size, and other parameters not otherwise selectable on the query form. Output formatting options may also be adjusted here in case the formatting choices available through the form (see Step 4 below) are not adequate. For example, the user might type: "-v 150" to cause 150 descriptions (rather than 100 or 250 available through the pull-down menu) to be displayed. Find out how to specify these options using the details button. BLAST details


Step 4.  Set the output formatting options details, more4
NCBI-gi Graphical overview
Alignment view Descriptions Alignments

These items are needed only for formatting. Note, however, that for queries with numerous significant hits in the selected database, the choice of a low number of descriptions or alignments may override the chosen E value threshold. For instance, a list of the 100 most significant hits (descriptions = 100) may (depending on the query) only contain sequences with E values less than 1. Though the E value threshold may have been set at 10, hits with E values between 1 and 10 will not be listed.

In the current example, the number of descriptions to be displayed has been left at the default value of 100. In this example, alignments have been set at 50 to save space.

details, more4a


Step 5.  Perform the search details, more5
   Send reply to the Email address:          In HTML format
     

Click on the search button now to initiate the search. In a short time, the query sequence has been compared to all of the entries in the specified database. Each comparison is scored and the top scores are listed in rank order. You will be automatically taken to an intermediate formatting page from which point you can change several of the formatting options. If no changes are desired, simply click on the "Format Results" button to see the Results of your search.

Revised June 12, 2000



























[BLAST tutorial] [glossary] [Query tutorial] [PSI-BLAST tutorial\ [Guide] [BLAST information]