PILEUP(+)

Table of Contents
FUNCTION
DESCRIPTION
OUTPUT
DENDROGRAM
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
ALGORITHM
CONSIDERATIONS
SUGGESTIONS
ACKNOWLEDGEMENT
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment.

DESCRIPTION

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment using a simplification of the progressive alignment method of Feng and Doolittle (Journal of Molecular Evolution 25; 351-360 (1987)). The method used is similar to the method described by Higgins and Sharp (CABIOS 5; 151-153 (1989)).

The multiple alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences. Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments that include increasingly dissimilar sequences and clusters, until all sequences have been included in the final pairwise alignment.

Before alignment, the sequences are first clustered by similarity to produce a dendrogram, or tree representation of clustering relationships. It is this dendrogram that directs the order of the subsequent pairwise alignments. PileUp can plot this dendrogram so that you can see the order of the pairwise alignments that created the final alignment.

As a general rule, PileUp can align up to 500 sequences, with any single sequence in the final alignment restricted to a maximum length of 7,000 characters (including gap characters inserted into the sequence by PileUp to create the alignment). However, if you include long sequences in the alignment, the number of sequences PileUp can align decreases. See the RESTRICTIONS topic, below, for a more complete discussion of sequence number and size limitations.

OUTPUT

[ Previous | Top | Next ]

Below is some of the output file containing the multiple sequence alignment. By default, similar sequences are positioned close to each other in the output file, but if you use -NOSORt, the aligned sequences are listed in the same order as they were presented to the program.


!!AA_MULTIPLE_ALIGNMENT 1.0
PileUp of: @hsp70.list

 Symbol comparison table: GenRunData:blosum62.cmp  CompCheck: 6430

                   GapWeight: 8
             GapLengthWeight: 2

 hsp70.msf  MSF: 743  Type: P  October 6, 1998 18:23  Check: 7784 ..

 Name: S11448           Len:   743  Check: 3635  Weight:  1.00
 Name: S06443           Len:   743  Check: 5861  Weight:  1.00

 /////////////////////////////////////////////////////////////

 Name: S29261           Len:   743  Check: 7748  Weight:  1.00

//

        1                                                   50
S11448  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
S06443  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
A25398  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTYE GAIGIDLGTT YSCVGVWQNE
S06158  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTYE GAIGIDLGTT YSCVGVWQNE
S42164  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MS KAVGIDLGTT YSCVAHFAND
S20139  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MS KAVGIDLGTT YSCVAHFSND
B36590  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MS KAVGIDLGTT YSCVAHFAND
A25089  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~MAKSEG PAIGIDLGTT YSCVGLWQHD
S03250  ~~~~~~~~~~ ~~~~~~~~~~ ~~~MAGKGEG PAIGIDLGTT YSCVGVWQHD
A27077  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MSKG PAVGIDLGTT YSCVGVFQHG
S07197  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MSKG PAVGIDLGTT YSCVGVFQHG
A25646  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MSGKG PAIGIDLGTT YSCVGVFQHG
S10859  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MSARG PAIGIDLGTT YSCVGVFQHG
A29160  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MAKA AAVGIDLGTT YSCVGVFQHG
JH0095  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MAKN TAIGIDLGTT YSCVGVFQHG
A03310  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MATKG VAVGIDLGTT YSCVGVFQHG
JT0285  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MSKH NAVGIDLGTT YSCVGVFMHG
S09036  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~MQAPRE LAVGIDLGTT YSCVGVFQQG
JU0062  ~~~~~~~~~~ ~~~~~~~~~~ ~~~MAQSVSG YSVGIDLGTT YSCVGVWQND
JU0164  ~~~~~~~~~~ ~~~~~MASAK GSKPNLPESN IAIGIDLGTT YSCVGVWRNE
A34041  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~MAANKG MAIGIDLGTT YSCVGVFQHG
S05776  ~~~~~~~~~~ ~~~~~~~~~~ ~~ADDVENYG TVIGIDLGTT YSCVAVMKNG
S20149  ~~~~~~~~~~ ~~~~~~~~~~ ~~~MAEGVFQ GAIGIDLGTT YSCVATYESS
A32493  MLAAKNILNR SSLSSSFRIA TRLQSTKVQG SVIGIDLGTT NSAVAIMEGK
S29261  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MG KIIGIDLGTT NSCVAIMDGT

The gaps at the ends of each sequence are written as tildes (~) which may represent differences in input sequence lengths rather than missing characters or significant differences in the alignment. Internal gaps in each sequence are written as periods (.). When you create an end-weighted alignment in PileUp by using -ENDWeight, gaps at the ends of each sequence are written as periods since those gaps may represent missing characters or significant differences in the alignment. See Appendix III for more information about the two different gap characters.

DENDROGRAM

[ Previous | Top | Next ]

PileUp can plot a dendrogram like the one below that shows the clustering relationships used to determine the order of the pairwise alignments that together create the final multiple sequence alignment. Distance along the vertical axis is proportional to the difference between sequences; distance along the horizontal axis has no significance at all. The interpretation of the dendrogram is discussed in the ALGORITHM topic below.

INPUT FILES

[ Previous | Top | Next ]

PileUp accepts multiple (two or more) nucleotide sequences or multiple (two or more) protein sequences as input. The function of PileUp depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

LineUp is a screen editor for editing multiple sequence alignments. You can edit up to 30 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict.

Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it.

PlotSimilarity plots the running average of the similarity among the sequences in a multiple sequence alignment.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for new sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between a sequence and a group of aligned sequences represented as a profile.

The Wisconsin Package includes several programs for evolutionary analysis of multiple sequence alignments. Distances creates a matrix of pairwise distances between the sequences in a multiple sequence alignment. Diverge measures the number of synonymous and nonsynonymous substitutions per site of two or more aligned protein coding regions and can output matrices of these values. GrowTree reconstructs a tree from a distance matrix or a matrix of synonymous or nonsynonymous substitutions. PAUPSearch reconstructs phylogenetic trees from a multiple sequence alignment using parsimony, distance, or maximum likelihood criteria; PAUPDisplay can manipulate and display the trees output by PAUPSearch and can also plot the trees output by GrowTree.

Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps.

RESTRICTIONS

[ Previous | Top | Next ]

As shipped, PileUp restricts each sequence in the final alignment to a maximum length of 7,000 characters. This maximum length includes the input sequence length plus the total length of all gap characters inserted into the sequence to create the final alignment. By default, each input sequence is restricted to a maximum length of 5,000. Also by default, PileUp can add a maximum of 2,000 gap characters for each sequence in the final alignment.

If you wish to align longer sequences, then you can specify a maximum sequence length of up to 7,000 with Maximum input sequence range, for example Maximum input sequence range set to 6000. If you increase the maximum sequence length in this way, then the maximum amount of allowed gapping is automatically reduced so that the final aligned sequence length cannot exceed 7,000 for any sequence.

If you wish to allow for more gapping in the final alignment, then you can specify a maximum number of gap characters for each sequence with Maximum number of gap characters ('.' and '~') added to any sequence, for example Maximum number of gap characters ('.' and '~') added to any sequence set to 3000. If you increase the maximum amount of gapping permitted for each sequence in this way, the maximum sequence length is automatically decreased so that the final aligned sequence length cannot exceed 7,000 for any sequence.

As shipped, the total length of all of the sequences read into PileUp (including the gap allowance for each sequence) cannot be greater than 2,000,000. By reducing the gap allowance for each sequence using Maximum number of gap characters ('.' and '~') added to any sequence, you can increase the number of sequences that can be read into the program up to the maximum of 500 sequences.

The surface of comparison (see the CONSIDERATIONS topic for a explanation) is limited to 2,250,000.

All of these limits are adjustable (see the CONSIDERATIONS topic below).

ALGORITHM

[ Previous | Top | Next ]

A rigorously optimal alignment of even a small number of short sequences would be intractable, both in terms of memory and time. Therefore, PileUp does a series of progressive, pairwise alignments between sequences and clusters of sequences to generate the final alignment. A cluster consists of two or more already-aligned sequences.

PileUp begins by doing pairwise alignments that score the similarity between every possible pair of sequences. These similarity scores are used to create a clustering order that can be represented as a dendrogram. The clustering strategy represented by the dendrogram is called UPGMA that stands for unweighted pair-group method using arithmetic averages (Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA).

The dendrogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment. For example:

PileUp uses this clustering order and first aligns the two most-related sequences to each other in order to produce the first cluster. It then aligns the next most related sequence to this cluster or the next two most-related sequences to each other in order to produce another cluster. A series of such pairwise alignments that includes increasingly dissimilar sequences and clusters of sequences at each iteration produces the final alignment.

In the above example, Seq1 and Seq2 are aligned first. Next, Seq3 and Seq4 are aligned. The cluster of Seq1-aligned-to-Seq2 is then aligned to the cluster of Seq3-aligned-to-Seq4. Finally, Seq5 is aligned to the cluster that now contains Seq1 through Seq4 to generate the final alignment of Seq1 through Seq5.

Each pairwise alignment in PileUp uses the method of Needleman and Wunsch (Journal of Molecular Biology 48; 443-453 (1970)), that is extended for use with clusters of aligned sequences rather than only individual sequences. For a pairwise alignment of individual sequences, the comparison score between any two sequence symbols is found in a scoring matrix . For a pairwise alignment of clusters of sequences, the comparison score between any two positions in those clusters is simply the arithmetic average of the scores for all possible symbol comparisons at those positions. When gaps are inserted into a cluster to produce an alignment, they are inserted at the same position in all of the sequences of the cluster.

CONSIDERATIONS

[ Previous | Top | Next ]

Because a rigorous optimal alignment of even a small number of short sequences would be intractable, PileUp uses an approach that may not produce the most optimal multiple sequence alignment. (See the ALGORITHM topic above for a description of this approach.)

Clustering

The approach used by PileUp is sensitive to the order in which sequences are aligned. A clustering algorithm determines this order from the pairwise similarities calculated before the final alignments are done. The goal of the clustering is to see that very similar sequences are aligned to each other before they are aligned to more distantly related sequences. There is, at present, no way for you to modify the order of these alignments.

While PileUp calculates the similarity between each of the sequences, this information is not used by the program to weight the sequences. That is, if there are several very similar sequences, the final alignment may be constrained to minimize the disruption of these sequences.

The dendrogram is not a phylogenetic reconstruction, although the vertical branch lengths are proportional to the distance between the sequences. Its purpose is to represent the clustering order used to create the final alignment. This order is the only information from the dendrogram used by PileUp. See the RELATED PROGRAMS topic for a description of programs in the Wisconsin Package that you can use to create phylogenetic reconstructions from multiple sequence alignments.

Global Alignment

If you know the difference between Gap and BestFit, consider PileUp an extension of the Gap program for more than two sequences, rather than an extension of the BestFit program. PileUp, like Gap, tries to find a global optimal alignment, while BestFit finds a local optimal alignment.

Because PileUp aligns sequences along their entire lengths, it is not ideally suited to finding the best local region of similarity (such as a shared motif) among all of the sequences. However, PileUp has been used successfully for this purpose.

By default, PileUp does not penalize gaps occurring at the ends of sequences. Therefore, related sequences that differ in the extent of their sequencing can be reasonably aligned by PileUp. You can override this default with -ENDWeight, in which case length differences among the sequences become significant.

Piling Up Unrelated Sequences

PileUp always aligns all of the sequences you specify, even if they are not related. The alignment can be degraded if some of the sequences are not similar to one another.

Arbitrary Gap Placement

In any pairwise alignment, the position of the inserted gaps may be arbitrary; equally optimal alignments can be generated by inserting the gaps differently. PileUp can exaggerate these arbitrary differences if you use either -LOWroad or -HIGhroad. This selection usually affects the final alignment. For the most part, however, the difference between the high road and low road alignments should not be very significant, although you may want to check.

Here is an example showing the difference between high and low road for the alignment of three short sequences. The first pairwise alignment creates an aligned cluster of the two most closely related sequences; the second alignment aligns this cluster to the third sequence creating the final multiple sequence alignment. Although the qualities after the first round alignments are the same, the quality of the final low-road alignment is higher than the high-road one.

             For:       Match = 10       Gap weight = 10
                     Mismatch =  0    Length weight =  0

                HighRoad                          LowRoad

                GACCAT                            GACCAT
Alignment  1    GAG.AT    Quality = 30            GA.GAT    Quality = 30

                GACC.AT                           GAC.CAT
Alignment  2    GAG..AT   Quality = 25            GA..GAT   Quality = 30
                AACGGAT                           AACGGAT

High road alignments shift all of the arbitrary gaps in the second sequence or cluster of aligned sequences to the right and all of the arbitrary gaps in the first sequence or cluster of aligned sequences to the left. Low road alignments do the opposite. When neither high road nor low road is selected, the program tries not to insert a gap whenever that is possible and uses the high road when that is not possible.

Scoring Matrices

The default scoring matrices are not necessarily appropriate for all alignments. (See Chapter 4, Using Data Files in the User's Guide for more information.) We provide several alternative scoring matrices suitable for multiple sequence alignments. These matrices are listed in Appendix VII. PileUp chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with Scoring Matrix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use Set gap creation penalty and Set gap extension penalty or respond to the program prompts to specify alternative gap penalties if you don't want to accept the default values.

Surface of Comparison

PileUp performs a series of pairwise alignments between clusters of sequences to create the final multiple sequence alignment. Each pairwise alignment requires enough computer memory for a surface of comparison proportional to the product of the lengths of the two clusters being aligned. Since all sequences in an aligned cluster have the same length, the length of a cluster is simply the length of any sequence within that cluster.

PileUp allows you to align sequences, the product of whose lengths is greater than the surface of comparison. In this case, the program limits the total length of gaps that can be inserted into each sequence and calculates the best alignment within this incomplete, or limited, surface of comparison. The program then performs a calculation to determine whether the alignment could possibly be improved if there were no restriction on the total length of gaps in each sequence. If the program cannot rule out this possibility, it displays the message *** Alignment is not guaranteed to be optimal *** . Because the criteria used in the calculation for guaranteeing an optimal alignment are very stringent, a limited alignment often may be optimal even if this message is displayed. In any event, the program continues to completion.

SUGGESTIONS

[ Previous | Top | Next ]

Editing Multiple Sequence Alignments

PileUp writes the alignment into a multiple sequence format (MSF) file that interleaves the sequences to show their alignment. You can also edit the alignment created by PileUp with a regular text editor.

The Pretty program can calculate a consensus for the multiple sequence alignment and can display the alignment several different ways.

ACKNOWLEDGEMENT

[ Previous | Top | Next ]

PileUp was written by Irv Edelman.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Set gap creation penalty

sets the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

Set gap extension penalty

sets the gap extension penalty that is substracted from the alignment score for each gapped symbol.

Scoring Matrix

allows you to specify a scoring matrix other other than the program default. In creating alignments or finding sequence similarity, matching residues are scored according to values found in a scoring matrix. The matrix you choose depends on the expected similarity of the sequences to be compared. For example, you might use blosum90 to compare sequences that are expected to be very similar and blosum35 if you are expecting the sequences to be much less similar.

Maximum input sequence range

sets the maximum length for each individual input sequence. Setting a higher limit (up to a maximum of 7,000) allows you to align longer sequences while setting a lower limit allows you to add more and longer gaps to each sequence. (See the RESTRICTIONS topic for a more detailed description.)

Maximum number of gap characters ('.' and '~') added to any sequence

sets the maximum combined length of all gaps that can be added to each sequence. Setting a higher limit allows you to add more and longer gaps to each sequence while setting a lower limit allows you to align a greater number of sequences. (See the RESTRICTIONS topic for a more detailed description.)

Printed: January 13, 1999 6:28 (1162)