HTHSCAN

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION [ Top | Next ]

HTHScan scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.

DESCRIPTION [ Previous | Top | Next ]

HTHScan predicts helix-turn-helix (H-T-H) motifs in protein sequences. For each sequence, HTHScan prints a list of possible H-T-H motifs sorted in descending order according to score. Associated with each score is the probability of achieving that score in the target sequence by chance using the given family-specific weight matrix. HTHScan has weight matrices for the araC and lysR families of H-T-H motifs and one for homeobox domains.

EXAMPLE [ Previous | Top | Next ]

Here is a session with HTHScan that was used to find H-T-Hs in the arabinose operon regulatory protein araC sequence from E. coli:


% hthscan

  HTHScan of what sequence(s)? PIR:Rgeca

                  Begin (* 1 *) ?
                End (*   292 *) ?

  Search using weight matrix for which H-T-H family:

      A.  AraC
      B.  LysR
      C.  Homeobox

     Please choose one: (* A *):

  Only display H-T-Hs whose score exceeds (* 4.0 *) ?

  What should I call the output file (* rgeca.hthscan *) ?

                  Input sequences processed: 1
  Number of sequences with predicted H-T-Hs: 1
                                Output file: rgeca.hthscan
  CPU time (sec): 1.22

%

OUTPUT [ Previous | Top | Next ]: Here is the output file:


HTHScan of PIR1:Rgeca  September 29, 1998 10:18

  Weight matrix: GenRunData:htharac.dat
  Minimum score for H-T-Hs (threshold): 4.0

> sequence: pir1:rgeca
      name: rgeca  check: 4061  from: 1  to: 292

   1. 197 IASVAQHVCLSPSRLSHLFR 216
      Score: 39.8
      Probability: 4.031E-12

  Databases searched:
        NBRF, Release 57.0, Released on 30Jun1998, Formatted on 18Aug1998
  Input sequences searched: 1
  Number of sequences with predicted H-T-Hs: 1
  CPU time (sec): 0.68

The N-terminus->C-terminus direction of the predicted H-T-H is from left to right. The position of the first residue in the H-T-H is shown to the left. The position of the last residue in the H-T-H is shown to the right.

Below the H-T-H display is the score computed for the predicted H-T-H and the probability of random occurrence of that score or better given a sequence whose residue distribution is uniform and whose positions are independent of one another.

INPUT FILES [ Previous | Top | Next ]

The input to HTHScan is one or more protein sequences. If HTHScan rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*.

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds. FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. SPScan scans protein sequences for the presence of secretory signal peptides (SPs). CoilScan locates coiled-coil segments in protein sequences.

CONSIDERATIONS [ Previous | Top | Next ]

Because of the way HTHScan sorts and stores predicted H-T-H motifs during scanning, no particular ordering is guaranteed among H-T-H motifs that have exactly the same score .

Ambiguity codes (such as B or Z) in protein sequences contribute exactly 0 to the score of the sequence window within which they are found. Therefore, the scores and probabilities associated with any predicted motifs from such a sequence window are likely to differ to varying extents from what they would be otherwise. You shouldn't routinely encounter this problem because ambiguity codes are extremely rare in protein sequences.

ALGORITHM [ Previous | Top | Next ]

HTHScan uses a log-odds position-weight matrix ("weight matrix") to detect the presence of H-T-H motifs in protein sequences. The weight matrix encodes the H-T-H motif as a set of weights representing the likelihood of each amino acid residue to appear in each position of the motif. The score reported by HTHScan for each prediction is a measure of the local goodness of fit between the target sequence and the H-T-H signal represented by the weight matrix. This score is the sum of the weights corresponding to the amino acid residues found in the target sequence at each weight matrix position.

The statistical significance of each score is computed as the probability of random occurrence of that score or better in a sequence with the same amino acid residue distribution as the target sequence and whose positions are all independent of each other (Claverie, J.-M. and Audic, S. CABIOS 12(5); 431-439 (1996)).

The weight matrices used by HTHScan were prepared using sequence sets taken from Pfam Release 2.0 (Sonnhammer, E.L. et al. Proteins 28; 405-420 (1997)). The Pfam families used were HTH 1 (bacterial regulatory helix-loop-helix proteins, lysR family), HTH 2 (bacterial regulatory helix-loop-helix proteins, araC family), and homeobox (homeobox domain). The log-odds weight matrices were constructed from these sequences with MEME version 2.1 (Bailey, T.L. and Elkan, C. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36 (1994)).

COMMAND-LINE SUMMARY [ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % hthscan [-INfile=]PIR:Rgeca -Default

Prompted Parameters:

-BEGin=1 -END=292         sets the range of interest
-FAMily=arac              specifies weight matrix by H-T-H family:
                            "arac", "lysr", or "homeobox"
-THRESHold=4.0            sets minimum score for H-T-H detection
[-OUTfile=]rgeca.hthscan  names the output file

Local Data Files:

-DATa=htharac.dat      assigns weight matrix for the araC family H-T-Hs
-DATa=hthlysr.dat      assigns weight matrix for the lysR family H-T-Hs
-DATa=hthhomeobox.dat  assigns weight matrix for the homeobox family H-T-Hs

Optional Parameters:

-NUMTOPscores=3        specifies maximum number of H-T-Hs to report
-EVEn                  assumes even target residue distribution
-NOPROBabilities       doesn't compute score probabilities
-VERbose               uses verbose output
-RSF[=hthscan.rsf]     saves features in the RSF file
-MONitor               displays screen trace of progress
-NOSUMmary             suppresses screen summary at end of the program

ACKNOWLEDGEMENT [ Previous | Top | Next ]

We thank Tim Bailey, Charles Elkan, and Bill Grundy for MEME (http://www.sdsc.edu/MEME), which was used to create the log-odds weight matrices. We thank Erik Sonnhammer, Sean Eddy, and Richard Durbin for the Pfam protein domain family database (http://www.sanger.ac.uk/Software/Pfam/), which was used to create input sequence sets for MEME.

HTHScan was written by Ted Slater.

LOCAL DATA FILES [ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

If you choose to search for the araC family of H-T-H motifs (the default), HTHScan will use the weight matrix file htharac.dat. If you choose to search for the lysR family of H-T-H motifs, HTHScan will use the weight matrix file hthlysr.dat. If you choose to search for the homeobox family of H-T-H motifs, HTHScan will use the weight matrix file hthhomeobox.dat.

PARAMETER REFERENCE [ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

H-T-H family:

allows you to specify the weight matrix used by choosing the H-T-H motif family by name. You may specify arac for the araC family of bacterial regulatory proteins (represented by the weight matrix file htharac.dat), lysr for the lysR family of bacterial regulatory proteins (represented by the weight matrix file hthlysr.dat), or homeobox for the homeobox domain, (represented by the weight matrix file hthhomeobox.dat).

Minimum acceptable score for H-T-H motifs

allows you to specify the minimum acceptable score for an H-T-H motif prediction. If you don't specify a value, HTHScan will use a default value of 4.0 for the araC family of bacterial regulatory proteins and a default of 10.0 for the other families.

Assume even residue distribution in input sequence

tells HTHScan to assume that amino acid residues are distributed evenly throughout the length of the target sequence for the purpose of calculating score probabilities. This makes HTHScan perform a little faster, because it does not have to compute the actual distribution of residues in each input sequence. However, reliability of the score probability calculations may be adversely affected.

Printed: January 13, 1999 6:27 (1162)