PROFILEMAKE

Table of Contents
FUNCTION
DESCRIPTION
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
CALCULATING THE PROFILE
CONSIDERATIONS
ACKNOWLEDGMENT
PARAMETER REFERENCE

You can set the parameters listed below from the command line. For more information, see "Using ProgramParameters" in Chapter 3, Using Programs in the User's Guide.

-MATRix=mymatrix.cmp
-BEGin=1
-END=100
-WEIGHT=1.0
-GAPCoefficient=100
-LENGTHCoefficient=100
-GAPRatio=0.33
-LENGTHRatio=0.1
-NOLOGwgt
-STRINgent
-SEQout=hsp70.pep

FUNCTION

[ Top | Next ]

ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap).

DESCRIPTION

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileMake uses the method of Gribskov, et al (Proc. Natl. Acad. Sci. USA 84; 4355-4358 (1987)) to create a profile from a group of aligned sequences. A profile is a table that contains all of the comparison information of a group of aligned sequences. These sequences must be previously aligned (see the RELATED PROGRAMS topic below) before running ProfileMake. The profile contains as many rows as there are positions in the aligned sequences. Each row contains a score for the alignment of the corresponding position of the aligned sequences with each possible base or residue.

The profile is the input data for ProfileSearch, which can find sequences in the database similar to your group of aligned sequences, and ProfileGap, which can make an optimal alignment between the aligned sequences and another sequence.

The aligned sequences may be specified to ProfileMake with an ambiguous file expression or in a list file similar to the input for Pretty or LineUp. (See Chapter 2, Using Sequence Files and Databases in the User's Guide for more information.)

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:


!!AA_PROFILE 1.0
(Peptide) PROFILEMAKE v4.50 of: hsp70.msf{*}  Length: 743
  Sequences: 25  MaxScore: 2172.36  October 7, 1998 11:41

                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10

         hsp70.msf{S11448}  From: 1         To: 743       Weight: 1.00
         hsp70.msf{S06443}  From: 1         To: 743       Weight: 1.00

         /////////////////////////////////////////////////////////////////

         hsp70.msf{S29261}  From: 1         To: 743       Weight: 1.00

Symbol comparison table: GenRunData:blosum62.cmp  FileCheck: 6430

     Relaxed treatment of non-observed characters
     Exponential weighting of characters
Cons A    B    C    D    E    F    G    H    I    K    L  ... Gap  Len  ..
 M   -1   -4   -1   -4   -2    0   -4   -2    1   -1    2 ...   9    9
 L   -1   -5   -1   -5   -4    0   -5   -4    2   -2    4 ...   9    9

 /////////////////////////////////////////////////////////////////////

 E   -2    5  -10    5   12   -7   -5    0   -7    2   -7 ...   2    2
 V    0   -7   -3   -7   -5   -2   -7   -7    7   -5    2 ...   2    2
 B   -5   15   -7   15    5   -7   -3   -2   -7   -2  -10 ...   2    2
 * 1390    0  114 1140 1219  600 1333  167 1011 1254 1183 ...

INPUT FILES

[ Previous | Top | Next ]

ProfileMake accepts multiple sequences (two or more) all of the same type. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of ProfileMake depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences. LineUp is a multiple sequence editor used to create multiple sequence alignments. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

RESTRICTIONS

[ Previous | Top | Next ]

We have little experience using nucleotide sequences with profile analysis.

Profiles must be no more than 1000 residues long. ProfileMake cannot accept more than 5000 aligned sequences for the profile. It is your responsibility to ensure that the sequences input to ProfileMake are in alignment.

CALCULATING THE PROFILE

[ Previous | Top | Next ]

Similarity Scores

In a scoring matrix, a score can be found for the comparison of any two sequence symbols. (See Appendix VII for more information.) Given a group of aligned sequences, a score can be calculated for the comparison of a symbol to each position of the aligned sequences. This comparison score differs from position to position in the aligned sequences, because each position contains a different spectrum of sequence symbols. The overall score is, in a sense, the average of the comparison scores for the sequence symbols found at a particular aligned sequence position.

Each row of a profile contains the scores for a comparison of the corresponding position of a multiple sequence alignment to each possible sequence symbol. For example, if a profile is made from a group of aligned protein sequences, the 10th row of the profile has values for the comparison of the 10th position in the alignment to each possible amino acid. The profile has as many rows as there are positions in the alignment, and each row has as many comparison scores as there are amino acid symbols. Thus, the profile is a position-specific scoring matrix for every position in a multiple sequence alignment.

The consensus sequence character is the symbol with the largest value in each row of the profile. It is used solely for the display of alignments and not for the calculation of the optimal alignment between a profile and a sequence.

The last row of the profile contains the composition for the whole profile. In the A column, for instance, the total number of A's in the multiple sequence alignment is shown.

Sequence Symbol Weights

As stated above, the comparison score of an alignment position and a given sequence symbol is an average of the comparison scores for the different sequence symbols at that position. This average is weighted so that a symbol's weight in the calculation of the average score increases along with its fraction of the symbols at that position. Two types of weighting are currently used. Linear weighting (chosen with -NOLOGwgt) gives a weight to each symbol that is directly proportional to the number of occurrences of that symbol at a given position. The default logarithmic weighting gives a symbol that predominates at a given position a disproportionately higher weight than a symbol that occurs only once. This causes positions in the aligned sequences that have many identical residues to bias the profile more strongly towards the identical residues than when linear weighting is used.

Using either kind of weighting, the weight for a residue is 0 when that residue does not occur at a given position; the weight is 1 when only that residue is found at a given position.

If the number of aligned sequences is fairly small, the sequence symbols observed at each position of the alignment may not represent the whole spectrum of symbols that would be observed if more sequences were available. In these cases, even residues that are not observed at a given position in the alignment should perhaps be given a small weight. For nucleic acids, non-observed bases are given a weight of 0 by default. The default for proteins is to give non-observed amino acids a weight equal to 0.025 divided by the sum of the sequence weights. -STRINgent gives non-observed sequence symbols a weight of 0.

Gap Coefficients

The profile also includes position-specific gap coefficients, expressed as percentages. The gap coefficient determines the penalty that an alignment must pay in order to create a gap, and the gap length coefficient determines the penalty that must be paid in order to extend a gap. The actual gap penalties are calculated by multiplying the position-specific gap coefficients by the gap penalties specified when running the other Profile programs.

All gaps in the aligned sequences that overlap are treated as a single gap for purposes of calculating gap coefficients. The gap is considered to begin at the position of the leftmost gap character (. or ~) in any of the sequences, and to end at the rightmost gap character. The position-specific gap coefficients are reduced from 100 percent as a function of the longest gap through the position of interest in the aligned sequences. The gap coefficient G and gap length coefficient L are calculated as


G = C(G) x ( R(G) / (1 + GapLength x R(L) )
L = C(G) x ( R(G) / (1 + GapLength x R(L) )
where GapLength is the length of the gap as defined above. GapCoefficient (C(G)), GapRatio (R(G)), and GapLengthRatio (R(L)) have default values of 100, 0.33, and 0.1 respectively, but can be changed with -GAPCoefficient, -GAPRatio, and -LENGTHRatio.

You can edit the profile with a text editor and change the gap coefficients to any values you wish.

CONSIDERATIONS

[ Previous | Top | Next ]

If you edit a profile, the "length:" entry must agree with the actual length of the profile (number of rows).

If you create a profile from a single peptide sequence, you should use -STRINgent to give a weight of 0 to all symbols not occurring at each position in the sequence.

ACKNOWLEDGMENT

[ Previous | Top | Next ]

Profile analysis was first described in 1987 by Michael Gribskov, Andrew McLachlan and David Eisenberg (Proc. Natl. Acad. Sci. USA 84; 4355-4358). Other recent publications describing profile technology are referenced at the end of the Profile Analysis Essay above. The profile programs in the Wisconsin Package were developed and communicated to us by Dr. Gribskov.

PARAMETER REFERENCE

You can set the parameters listed below from the command line. For more information, see "Using ProgramParameters" in Chapter 3, Using Programs in the User's Guide.

-MATRix=mymatrix.cmp

[ Previous | Top | Next ]

allows you to specify a scoring matrix other other than the program default. In creating alignments or finding sequence similarity, matching residues are scored according to values found in a scoring matrix. The matrix you choose depends on the expected similarity of the sequences to be compared. For example, you might use blosum90 to compare sequences that are expected to be very similar and blosum35 if you are expecting the sequences to be much less similar.

-BEGin=1

[ Previous | Top | Next ]

sets the beginning position for all input sequences. When the beginning position is set from the command line, ProfileMake ignores beginning positions specified for individual sequences in a list file.

-END=100

[ Previous | Top | Next ]

sets the ending position for all input sequences. When the ending position is set from the command line, ProfileMake ignores ending positions specified for sequences in a list file.

-WEIGHT=1.0

[ Previous | Top | Next ]

sets the sequence weight for all input sequences. When the weight is set with this parameter, ProfileMake ignores weights specified for individual sequences in a list file, MSF file, or RSF file.

-GAPCoefficient=100

[ Previous | Top | Next ]

sets the maximum gap coefficient for the profile. This coefficient is expressed as a percentage and has a default maximum value of 100 percent. This value is found in each row of the profile where the corresponding alignment has no gaps at all. The gap coefficient is reduced from 100 percent at positions in the alignment that have gaps. In the other profile programs, the gap coefficient in each row of the profile is multiplied by an interactively specified gap creation penalty to calculate the penalty for creating a gap at that position.

-LENGTHCoefficient=100

[ Previous | Top | Next ]

sets the maximum gap length coefficient for the profile. This coefficient is expressed as a percentage and has a default maximum value of 100 percent. This value is found in each row of the profile where the corresponding alignment has no gaps at all. The gap length coefficient is reduced from 100 percent at positions in the alignment that have gaps. In the other profile programs, the gap length coefficient in each row of the profile is multiplied by an interactively specified gap extension penalty to calculate the penalty for extending a gap at that position.

-GAPRatio=0.33

[ Previous | Top | Next ]

is used to calculate the gap and gap length coefficients for a row of the profile where the multiple sequence alignment has gaps. GAPRatio multiplied by GAPCoefficient is approximately equal to the maximum gap coefficient in a region with gaps. Similarly, GAPRatio multiplied by LENGTHCoefficient is approximately equal to the maximum gap length coefficient in a region with gaps.

-LENGTHRatio=0.1

[ Previous | Top | Next ]

determines how rapidly the gap coefficient and gap length coefficient decrease with increasing gap size. With a gap of length GapLength, both of these coefficients decrease from their maximum values by a factor of


GAPRatio / ( 1 + (LENGTHRatio x GapLength) )

-NOLOGwgt

[ Previous | Top | Next ]

uses linear weighting of the residues at each position in the aligned sequences. The weight of each residue is directly proportional to the number of times the residue occurs at a given position in the aligned sequences. The default is exponential weighting that causes positions in the aligned sequences with many identical residues to bias the profile more strongly towards the identical residues than does linear weighting.

-STRINgent

[ Previous | Top | Next ]

gives a weight of 0 to all symbols not occurring at a given position in the aligned sequences. This is the default for nucleic acids. For proteins, residues not occurring at a position in the aligned sequences are given a small weight by default.

-SEQout=hsp70.pep

[ Previous | Top | Next ]

writes the consensus from the profile into a new sequence file. This sequence output file is written in addition to the file with the profile. The sequence file can be named by you or ProfileMake gives it the same name as the profile, but with the extension .seq for DNA or .pep for protein.

Printed: January 13, 1999 6:28 (1162)