Similarity searching

back to Education

BLAST
information
tutorial
guide

PSI-BLAST
tutorial

More Information
similarity searching
rules of thumb
glossary
reference list

Introduction to the Similarity Page

This page summarizes the basic concept and vocabulary of sequence similarity searching. It is included for those new to the field who may not appreciate the importance of this technique in biology, who lack the vocabulary to understand the BLAST guide and tutorial or who require a basic rather than a sophisticated understanding of the methods involved.

Premise

The sequence itself is not informative; it must be analyzed by comparative methods against existing databases to develop hypothesis concerning relatives and function. For example: An abundant message in a cancer cell line may bear similarity to protein phosphatase genes. This relationship would prompt experimental scientists to investigate the role of phosphorylation and dephosphorylation in the regulation of cellular transformation.

Terms

The terms, Similarity, Identity, and Homology each have a distinct meaning. Orthology and Paralogy are important concepts describing the relationship of members of a given protein family in one organism to the members of the same family in other organisms. (Reeck, G. R., de Haen, C. et al. (1987) ).

General approach

The General approach involves the use of a set of algorithms such as the BLAST programs to compare a query sequence to all the sequences in a specified database. Comparisons are made in a pairwise fashion. Each comparison is given a score reflecting the degree of similarity between the query and the sequence being compared. The higher the score, the greater the degree of similarity. The similarity is measured and shown by aligning two sequences. Alignments can be global or local (algorithm specific). A global alignment is an optimal alignment that includes all characters from each sequence, whereas a local alignment is an optimal alignment that includes only the most similar local region or regions. Discriminating between real and artifactual matches is done using an estimate of probability that the match might occur by chance. Of course, similarity, by itself, cannot be considered a sufficient indicator of function.

The BLAST algorithm

The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 that are used to search sequence databases for optimal local alignments to a query. The BLAST programs improved the overall speed of searches while retaining good sensitivity (important as databases continue to grow) by breaking the query and database sequences into fragments ("words"), and initially seeking matches between fragments. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a given substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search.

Quantification

The quality of each pair-wise alignment is represented as a score and the scores are ranked. Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein). A unitary matrix is used for DNA pairs because each position can be given a score of +1 if it matches and a score of zero if it does not. Substitution matrices are used for amino acid alignments. These are matrices in which each possible residue substitution is given a score reflecting the probability that it is related to the corresponding residue in the query. The alignment score will be the sum of the scores for each position. Various scoring systems (e.g. PAM, BLOSUM and PSSM) for quantifying the relationships between residues have been used.

Gaps

Positions at which a letter is paired with a null are called gaps. Gap scores are negative. Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is frequently ascribed more significance than the length of the gap. Hence the gap is penalized heavily, whereas a lesser penalty is assigned to each subsequent residue in the gap. There is no widely accepted theory for selecting gap costs. It is rarely necessary to change gap values from the default.

Significance

The significance of each alignment is computed as a P value or an E value . Each alignment must be viewed by a critical human eye before being accepted as meaningful. For example high scoring pairs whose similarity is based on repeated amino acid stretches (e.g. poly glutamine) are unlikely to reflect meaningful similarity between the query and the match. Filters, (e.g. SEG) that mask low complexity regions, can be applied to partially alleviate this problem.

Databases

A variety of DNA and protein databases are available. A protein database is appropriate for searches with an amino acid sequence as query. A nucleic acid database is generally appropriate for searches with a DNA query sequence. The exception to this occurs when using programs such as BLASTX and TBLASTN, which perform cross-comparisons between different types of query and database sequences.

Disclaimer Privacy statement

Revised May 2, 2000