PROFILESEARCH

Table of Contents
FUNCTION
DESCRIPTION
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
ALGORITHM
NORMALIZATION OF SCORES
CONSIDERATIONS
PROFILE FILE FORMAT
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

ProfileSearch uses a profile (representing a group of aligned sequences) as a query to search the database for new sequences with similarity to the group. The profile is created with the program ProfileMake.

DESCRIPTION

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

Using the method of Gribskov, et al. (Methods in Enzymology, 183; 146-159 (1989)), ProfileSearch accepts a profile from ProfileMake and uses it to search a database (or any set of sequences you specify) for sequences that are similar to the aligned probe sequences used to create the profile. The algorithm calculates the score (quality) of the optimal alignment between the profile and each sequence in the database and creates a list of all of the sequences in the database with an alignment score above some threshold. The results of ProfileSearch are corrected for systematic effects of the sequence length on the score. The output list can be displayed as optimal alignments with ProfileSegments.

The gap creation and gap extension penalties specified for ProfileSearch are maximum values. The actual position-specific gap penalties at any position are determined by multiplying the gap creation penalty by the percent value in the second to the last column of the profile, and the gap extension penalty by the percent value in the last column of the profile.

ProfileSearch does a lot of computing so you will probably want to run it in the batch queue (see the CONSIDERATIONS topic below).

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:


!!SEQUENCE_LIST 1.0
(Peptide) PROFILESEARCH of: hsp70.prf Length: 718 to: PIR:*

	 Scores are not corrected for composition effects

	         Gap Weight: 24.00
	  Gap Length Weight: 0.27
	 Sequences Examined: 99165
	 CPU time (seconds): 3040
*    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
Profile information:
(Peptide) PROFILEMAKE v4.50 of: Hsp70.MSF{*}  Length: 718
  Sequences: 28  MaxScore: 2172.36  October 11, 1996 11:41
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
         Hsp70.MSF{Hs70_Plafa}  From: 1         To: 718       Weight: 1.00
         Hsp70.MSF{Hs70_Thean}  From: 1         To: 718       Weight: 1.00 . . .
*    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
Normalization: 					October 16, 1998 11:26

	 Curve fit using 48 length pools
	 2 of 50 pools were rejected
	 Normalization equation:

		 Calc_Score = 138.82 * ( 1.0 - exp(-0.0020*SeqLen - 0.3503) )

	 Correlation for curve fit: 0.911

	 Z score calculation:
	 Average and standard deviation calculated using 98922 scores
	 243 of 99165 scores were rejected

		 Z_Score = ( Score/Calc_Score - 0.992 ) / 0.107

          Sequence  Strd ZScore   Orig Length ! Documentation  ..
PIR2:JC4853           +  136.40 1740.39    646 ! dnaK-type molecular chape ...
PIR2:S07197           +  136.40 1740.39    646 ! dnaK-type molecular chape ...
PIR2:A27077           +  136.39 1740.20    646 ! dnaK-type molecular chape ...

/////////////////////////////////////////////////////////////////////////// ...

PIR2:B70338           +    2.52  80.21    133 ! general secretion pathway  ...
PIR2:I40149           +    2.51 100.70    257 ! outer surface protein D -  ...
PIR2:JC6163           +    2.51  83.98    154 ! ubiquitin-conjugating enzy ...

The output file from ProfileSearch is an ordered list of the sequences with the highest alignment scores when compared to the profile. Unless the normalization procedure is disabled or fails, the file is ordered according to the Z scores (see the NORMALIZATION OF SCORES topic below).

The documentation section of the file (before the sequence listing) is divided by rows of asterisks into three parts. The first section describes the conditions used in running ProfileSearch, the number of sequences examined, and the amount of CPU time used. The second section reports the documentary information from the profile. The third section of the documentation records the process of the normalization (see the NORMALIZATION OF SCORES topic below).

INPUT FILES

[ Previous | Top | Next ]

ProfileSearch requires a profile as one of its input files. You can create profiles from aligned sequences by means of the ProfileMake program. In the ProfileDir directory, GCG provides a large number of amino acid profiles derived from the PROSITE database.

ProfileSearch accepts multiple (two or more) sequences of the same type as the search set. (They should be of the same type as the sequences that were used to create the profile.) You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of ProfileSearch depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences. LineUp is a multiple sequence editor used to create multiple sequence alignments. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

RESTRICTIONS

[ Previous | Top | Next ]

We have little experience using nucleotide sequences with profile analysis.

Because of memory constraints, ProfileSearch will only use the first 100,000 sequences for performing normalization (see the NORMALIZATION OF SCORES topic, below).

ALGORITHM

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

NORMALIZATION OF SCORES

[ Previous | Top | Next ]

The scores for comparison of sequences and a profile are systematically correlated with the lengths of the sequences. ProfileSearch corrects the observed alignment scores for these systematic affects of sequence length to give normalized scores.

The relationship between the length of a sequence (SeqLen) and the observed score for comparison of the profile and sequence (Score) is usually very close to


Score = C * (1 - e(A*SeqLen+B))

where A, B, and C are some constants. These constants can be empirically determined by fitting the alignment scores of the profile and the search set sequences to the above equation.

The scores for each sequence are sorted by the sequence length and pooled together in groups. Each pool contains at least 50 sequences (20 sequences if there are fewer than 1,500 total scores to normalize), but if the sequences are very similar in length, the program continues to add sequences to the pool until a sequence is encountered that is more than 50 residues longer than the shortest sequence in the pool. For each pool, the mean length, score, and the standard deviation of the scores are determined. If a pool contains more than the minimum number of sequences (50 or 20), these calculations are derived from an evenly-spaced sample of 50 (or 20) sequences from the pool rather than from all sequences in the pool.

The best fit of these values to the above equation is determined by non-linear fitting by the Levenberg-Marquardt method. The fit is weighted by the standard deviations of the pools; pools that contain values that represent true similarity between the sequence and profile have large standard deviations and, therefore, make relatively small contributions to the fit.

The assumption of the normalization is that the alignment scores being normalized represent sequences unrelated to the profile. Because there are usually some sequences that are related to the profile, a second pass of normalization is made. In the second pass, the average score of each length pool is compared to the calculated score for a sequence of the same length. Pools whose average scores differ from their expected scores by more than 5.0 times the average standard deviation of all pools are eliminated from consideration. This effectively eliminates most pools that contain many sequences related to the profile, since these pools have average scores with large deviations from the predicted scores. The curve fitting procedure is now repeated and new values for A, B, and C calculated.

Each normalized score (Score(N)) is then calculated


Score(N) = Score(orig) / (C * (1 - e(A*SeqLen+B)))

The average (Score(N)(avg)) and standard deviation (SD(N)) of all normalized scores are then calculated. Normalized scores whose original scores differ from the calculated score for a sequence of the same length by more than 5.0 times the average standard deviation of all length pools are omitted from this calculation. This ensures that sequences similar to the profile will not affect the calculation of the mean and standard deviation.

The Z scores are the differences in standard deviation units between each normalized comparison score and the mean normalized comparison score for sequences unrelated to the profile. Therefore, a Z score of 5.0 means that a comparison score is significant at the 5.0 sigma level. The Z score is calculated as


Z_Score = (Score(N) - Score(N)(avg)) / SD(N)

The Z score has a mean of 0.0 and a standard deviation of 1.0.

The third section of the documentation in the ProfileSearch output file records the process of the normalization described above. First, the number of length pools used in the normalization, and the number of pools rejected because of high standard deviation are reported. The empirically derived equation of the curve is then presented, along with the correlation coefficient for the agreement between the calculated curve and the observed results. This value is typically about 0.95. If it is much lower (e.g., less than 0.90), the reported Z scores are not accurate. Finally, the equation used for the calculation of Z scores is reported, along with the number of scores omitted from the calculation of the normalized mean and standard deviation.

Optionally, a file recording the observed and calculated scores for the length pools used in the normalization procedure may be produced with -FITfile. For each pool used in the calculation, the file contains the average and standard deviation of the lengths and observed scores, and the calculated score for a sequence of the average length of the pool.

Failure of Normalization

The normalization procedure described above appears to be robust, but it must be remembered that it is based on an empirically derived description of the distribution of scores. In particular, this normalization may not be appropriate for very long profiles with large numbers of rows with low gap creation penalties. It also fails if the sequences in the search set are all very similar in length.

If the first pass of the curve fitting procedure is unsuccessful, the Z scores are all reported as zero. If the first pass is successful, but the second pass fails, Z scores are calculated and the entries in the output file are sorted based on the Z scores; however, a warning message is produced in the documentation of the output. If fewer than 400 entries are saved in the output or if fewer than three length pools are present, normalization will not be attempted and all Z scores will be reported as zero.

CONSIDERATIONS

[ Previous | Top | Next ]

ProfileSearch attempts to choose default maximum gap creation and extension penalties that are appropriate for the query profile it reads. You can use Set maximum gap creation penalty and Set maximum gap extension penalty or respond to the program prompts to specify alternative maximum gap penalties if you don't want to accept the default values.

ProfileSearch requires the computer to make a calculation that is proportional to the product of the profile length times the length of the database. On slower computers, large profiles searched against large databases (e.g. all of PIR) may require hours of CPU time. Searches of the nucleotide sequence database with profiles prepared from nucleotide sequence alignments may take substantially longer because of the larger size of the database and the necessity to search both forward and reverse strands of each database sequence.

PROFILE FILE FORMAT

[ Previous | Top | Next ]

Look at the entry for ProfileMake for a description and an example of what profile files look like.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Set maximum gap creation penalty

sets the maximum position-specific gap creation penalty that is subtracted from the alignment score whenever a gap is created.

Set maximum gap extension penalty

sets the maximum position-specific gap extension penalty that is subtracted from the alignment score for each gapped symbol.

Lowest Z score to report in output list

sets the lowest Z score for an entry to be reported in the output list. If Z scores aren't calculated, then all entries (up to the maximum number of 100,000) are reported. The default reports an entry if its score is 2.5 standard deviations above the expected score for a sequence of the same length.

Minimum length of sequence to examine in the search

sets the minimum length for a sequence to be searched.

Maximum time for search(in seconds)

sets a limit in seconds beyond which the search is aborted. The limit defaults to 86,400 seconds (24 hours). ProfileSearch reports the results for the sequences searched before the time limit is exceeded.

Printed: January 13, 1999 6:28 (1162)