PROFILESCAN

Table of Contents

FUNCTION

DESCRIPTION

OUTPUT

FUNCTION [ Top | Next ]

ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences.

DESCRIPTION [ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileScan uses the method of Gribskov et al. (CABIOS 4(1); 61-66 (1988)) to find structural and sequence motifs in protein sequences. These motifs are represented as profiles in a library. ProfileScan aligns each profile motif to the sequence, and displays all alignments between the profile and sequence that have a normalized score above a set threshold. Because more than one alignment between a sequence and a particular motif can be found, each repeat of a duplicated structure (such as the zinc finger motif) can be presented.

OUTPUT [ Previous | Top | Next ]

Here is some of the ygbyad.scan output file:


 PROFILESCAN of : ygbyad  check: 5237  from: 1  to: 1392

P1;YGBYAD - L-aminoadipate-semialdehyde dehydrogenase (EC 1.2.1.31) - yeast
 (Saccharomyces cerevisiae)
N;Alternate names: alpha-aminoadipate reductase; protein YBR0910; protein
 YBR115c
C;Species: Saccharomyces cerevisiae
C;Date: 31-Dec-1991 #sequence_revision 31-Dec-1991 #text_change 12-Dec-1997
C;Accession: JU0448; S48279; S45983; A25815; S37810; S25367; S34171; S44694
R;Morris, M.E.; Jinks-Robertson, S. . . .

 Compare to profile library: GenRunData:profilescan.fil

 ..
--------------------------------------------------------------------------------
 Profile: profiledir:amp_binding.prf
   Gap weight:  4.50     Gap Length weight:   0.05
   Ave match:   0.12     Ave mismatch     :  -0.10
(Peptide) PROFILEMAKE v4.40 of: 0455.Msf2{*}  Length: 59
  Sequences: 28  MaxScore: 15.35  December 2, 1992  01:06
This profile is derived from PROSITE release 10.0 and has been tested
by a database search against SWISS-PROT release 26.0.  A comparison
of the SWISS-PROT annotation and the results of the database search
follows.  For further information about this motif, consult the . . .

Profile: profiledir:amp_binding.prf     alignment: 1

 Quality:  10.69       Gaps: 0
   Ratio:   0.21     Length: 51
 Normalized quality:  2.34
                  .         .         .         .         .
S    399 DHYKDTRTGVVVGPDSNPTLSFTSGSEGIPKGVLGRHFSLAYYFNWMSKR 448
         :. .:: :.....::. : | |||||:| |||||  | ::.   . ::::
P      7 EQSEDTETTQPDDPEDLAFIIFTSGTTGKPKGVMLTHKGVVNSVSSLSDR 56

S    449 F 449
         |
P     57 F 57

*****************************************
* Putative AMP-binding domain signature *
*****************************************

It has been shown [1 to 5] that a number of prokaryotic and eukaryotic enzymes
which all probably act via  an ATP-dependent  covalent binding of AMP to their
substrate, share a region of sequence similarity. These enzymes are:

//////////////////////////////////////////////////////////////////////////////

-Consensus pattern: [LIVMFY]-x(2)-[STG]-[STAG]-G-[ST]-[STEI]-[SG]-x-[PASLIVM]-
                    [KR]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: 13.

-Note: in a majority of cases the residue that  follows  the Lys at the end of
 the pattern is a Gly.

-Last update: November 1997 / Pattern and text revised.

[ 1] Toh H.
     Protein Seq. Data Anal. 4:111-117(1991).
[ 2] Smith D.J., Earl A.J., Turner G.
     EMBO J. 9:2743-2750(1990).
[ 3] Schroeder J.
     Nucleic Acids Res. 17:460-460(1989).
[ 4] Mallonee D.H., Adams J.L., Hylemon P.B.
     J. Bacteriol. 174:2065-2071(1992).
[ 5] Turgay K., Krause M., Marahiel M.A.
     Mol. Microbiol. 6:529-546(1992).

//////////////////////////////////////////////////////////////////////////////

The file ygbyad.sum lists the number of occurrences of each motif in the sequence of interest, the score for each occurrence, and the threshold score for that motif.

INPUT FILES [ Previous | Top | Next ]

ProfileScan takes as input one or more protein sequences. You can specify multiple sequences in a number of ways: by using a list file, for example@project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If ProfileScan rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence.

PileUp creates a multiple sequence alignment from a group of related sequences. LineUp is a multiple sequence editor used to create multiple sequence alignments. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns.

RESTRICTIONS [ Previous | Top | Next ]

Unknown.

ALGORITHM [ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileScan acts similarly to ProfileGap to align the motif profile to a sequence. Unlike ProfileGap, all alignments with scores above a set threshold are displayed. The scores are normalized for systematic effects of sequence length on the score. Since the average normalized score for sequences unrelated to the profile is expected to be 1.0, the threshold can be viewed as the factor by which an alignment score must exceed the expected alignment score for unrelated sequences to be reported. For instance, if the threshold is set at 2.0, an alignment is reported if its normalized score is at least 2.0 times the expected score for sequences unrelated to the profile.

In practice, two possible thresholds, high and interesting, can be selected. The threshold values for each motif are present in the motif library file, profilescan.fil. The interesting level is usually set at 3.0 standard deviations above the mean score for sequences in the database unrelated to the profile, and the high level is usually set at the 5.0 to 6.0 standard deviation level. The default high threshold can be overridden with Report scores higher than:. (See the entry for ProfileSearch in the Program Manual for a complete description of normalized scores.)

Validated Profiles

The motif library consists of validated profiles derived from aligned sequences known to contain each structural motif. A validated profile has the following properties: 1) all of the sequences used to create the profile correctly align to the profile; and 2) all sequences known to contain the motif score above the high threshold. The scores for these sequences are higher in every case than the scores for sequences known to lack the motif. Operationally, the process of creating a validated profile is as follows:

Each sequence known to contain the motif is aligned to the profile using ProfileGap. The alignment generated should correspond to the original alignment. If the alignments differ significantly, they are repeated with different gap creation and gap extension penalties until they agree.

Each motif profile is compared to all the sequences in the database using ProfileSearch. All sequences known to contain the motif represented by the profile should have higher scores than any sequences that lack the motif.

If the profile does not adequately discriminate between sequences with the motif and those without, and if changing the gap creation and gap extension penalties does not improve the discrimination, the alignments are examined by eye to determine why the sequences without the motif are giving high scores. The profile can then be edited by hand to reduce the scores in the profile at the positions that are contributing to the high scores of the sequences lacking the motif.

CONSIDERATIONS [ Previous | Top | Next ]

ProfileScan may report multiple occurrences of a motif profile in a protein sequence. The alignments may represent repeats of a duplicated structure, or they may represent distinct alignments between the motif profile and the same region of the protein sequence. These alternatives can be distinguished by looking at the alignments in the .scan file.

PARAMETER REFERENCE [ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.