Network Science Corporation

Terms and Definitions in Bioinformatics



Editors note:Since the topic of bioinformatics incorporates many of the terms used in molecular biology and genetics, a number of these terms are included in this paper.

==========================================
http://www.netsci.org/Science/Bioinform/definitions.html

========================================

Allele: Different forms of a gene which occupy the same position on the chromosome.

Amino acid: An a-amino carboxylic acid of the general form H3N-CHR-COO-. There are 20 common amino acids, defined by the R group on the alpha-carbon (A listing of common amino acids is available), that are used to build proteins and peptides.

Amplification: The process of repeatedly making copies of the same piece of DNA.

Annotation: Text fields of information about a biosequence which are added to a sequence databases. Annotation (the elucidation and description of biologically relevant features in the sequence) consists of the description of the following items:

Assembly: The process of placing fragments of DNA that have been sequenced into their correct position within the chromosome.

Autoradiography: The method of detecting molecules or molecular fragments which uses a radioactive label within the molecule of interest. The location of the radiolabel or "tag" is detected with X-ray film.

Autosomal: A position on any chromosome other than a sex determining chromosome.

Bacterial artificial chromosome (BAC): A long sequencing vector which is created from a bacterial chromosome by splicing a DNA fragment of 100kb (or more) from another species. Once the foreign DNA has been cloned into the host bacteria, many copies of the new chromosome can be made.

Base: One of five molecules which are assembled, along with a ribose and a phosphate, to form nucleotides (Figure 1). Adenine (A), guanine (G), cytosine (C), and thymine (T) are found in DNA while RNA is made from adenine (A), guanine (G), cytosine (C), and uracil (U).

Figure 1

Nucleotide Bases



Base pair (BP): The complementary bases on opposite strands of DNA which are held together by hydrogen bonding. The atomic structure of these bases preselect the pairing of adenine with thymine and the pairing of guanine with cytosine (or uracil in RNA).

BEAUTY (BLAST Enhanced Alignment Utility): A tool developed at Baylor College of Medicine (Worley et al. 1995) which uses BLAST to search several custom databases and incorporates sequence family information, location of conserved domains, and information about any annotated sites or domains directly into the BLAST query results.

Bioinformatics: An absolute definition of bioinformatics has not been agreed upon. The first level, however, can be defined as the design and application of methods for the collection, organization, indexing, storage, and analysis of biological sequences (both nucleic acids [DNA and RNA] and proteins). The next stage of bioinformatics is the derivation of knowledge concerning the pathways, functions, and interactions of these genes (functional genomics) and proteins (proteomics). Bioinformatics is also referred to as computational biology.

BLAST: Basic Local Alignment Search Tool. A program for searching biosequence databases which was developed and is maintained by a group at the National Center for Biotechnology Information (NCBI). There are several versions of BLAST: BLASTP which searches a protein database, BLASTN to search a nucleotide database, TBLASTN which searches for a protein sequence in a nucleotide database by translating nucleotide sequences in all 6 reading frames, BLASTX which can search for a nucleotide sequence against a protein database by translating the query via all 6 reading frames, gapped-BLAST, and psi-BLAST. BLAST locates patches of regional similarity instead of calculating the best overall alignment using gaps. The program then uses a scoring matrix to rank these matches as positive, negative or zero. If the initial match is scored highly, the search is expanded in both directions until the ranking score falls off.

BLITZ: EBI's ultra-fast protein database search which uses the MPsearch algorithm.

BLOCKS: A database of ungapped multiple alignments for protein/peptide families in PROSITE.

Blotting (Blots): The process of transferring DNA, RNA, or proteins to a solid support (usually a sheet of nitrocellulose paper) for hybridization after it has been separated by electrophoresis. Blots are named according to the material that is analyzed. A Southern Blot examines DNA which has been cut with restriction enzymes and probed with radioactive DNA. The Northern Blot analyzes RNA which is probed with radioactive DNA or RNA. Western Blots examine proteins which are probed with radioactive or enzymatically-tagged antibodies.

Cell: The smallest functional structural unit of living matter. Cells are classed as either procaryotic and eucaryotic.

CentiMorgan (cM): The unit of measurement for distance and recombinate frequency on a genetic map. Formally, the length (number of bases) that have a 1% probability of participating in mixing of genes. For humans, the average length of a cM is one million base pairs (or 1 megabase, Mb).

cDNA (complementary DNA): An artificial piece of DNA that is synthesized from an mRNA (messenger RNA) template and is created using reverse transcriptase. The single stranded form of cDNA is frequently used as a probe in the preparation of a physical map of a genome. cDNA is preferred for sequence analysis because the introns found in DNA are removed in translation from DNA ----> mRNA ----> cDNA.

Chromosome: A collection of DNA and protein which organizes the human genome. Each human cell contains 23 sets of chromosomes; 22 pairs of autosomes (non sex determining chromosomes) and one pair of sex determining chromosomes. The human genome within the 23 sets of chromosomes is made of approximately 30,000 to 100,000 genes which are built from over 3 billion base pairs. While eukaryotic chromosomes are complex sets of proteins and DNA, prokaryotic chromosomal DNA is circular with the entire genome on a single chromosome.

Cloning: The technique used to produce copies of a piece of DNA. A DNA fragment that contains a gene of interest is inserted into the genome of a virus or plasmid which is then allowed to replicate.

Cloning vector: A piece of DNA from any foreign body which is grafted into a host DNA strand that can then self replicate. Vectors are used to introduce foreign DNA into host cells for the purpose of manufacturing large quantities of the new DNA or the protein that the DNA expresses.

CLUSTAL W: A general purpose program for multiple alignments of DNA and protein sequences developed by Thompson, et. al. in 1994.

Coding region: The portion of a genome that is translated to RNA which in turn codes protein (also see exon).

Codon: The set of three nucleotides along a strand of mRNA that determine (or code) the amino acid placement during protein synthesis. The number of possible arrangements of these three nucleotides (or triplet codes) available for protein synthesis is (4 bases)3 = 64. Thus, each amino acid can be coded by up to 6 different triplet codes. Three triplet codes (UAA, UAG, UGA) specify the end of the protein. In the example below, three codons are shown.

--- UCA     CGU     CAU ---
Ser ------ Arg ------- His

Figure 2
Codons

Complementarity: The sequence-specific or shape-specific recognition that occurs when two or more molecules bind together. DNA forms double stranded helixes because the complementary orientation of the bases in each strand facilitate the formation of the hydrogen bonds which hold the strands together.

Computational biology: See bioinformatics

Consensus sequence: The most commonly occurring amino acid or nucleotide at each position of an aligned series of proteins or polynucleotides.

Consensus map: The location of all consensus sequences in a series of multiply aligned proteins or polynucleotides.

Conserved sequence: A sequence within DNA or protein that is consistent across species or has remained unchanged within the species over its evolutionary period.

Contig maps: The representation of the structure of contiguous regions of the genome (contigs) by specifing overlap relationships among a set of clones.

Contigs: A series of cloning vectors which are ordered in such a way as to have each sequence overlap that of its neighbors. The result is that the assembly of the series provides a contiguous part of a genome.

CORBA: Common Object Request Broker Architecture. A technology specification (sometimes referred to as a wrapper) that uses an interface definition language (IDL, code which defines the properties of data modules or objects) and software (the Object Request Broker or ORB) to define how objects (self-contained modules of data or instructions) can share the characteristics needed to form a unified application. The CORBA specification was defined by the Object Management Group (OMG, http://www.omg.org/) in 1991.

Cosmid: An artificial cloning vector (40-50kb of DNA) that can be replicated inside E. coli bacteria.

Crossing over: The interchange of two pieces of homologous chromosomes.

Deoxyribose: A five carbon sugar lacking a hydroxyl group on position 2 (beta-d-2-deoxyribose) which is used in the construction of DNA

Figure 3
Deoxyribose



Diploid: A cell containing two sets of chromosomes.

Distance matrix: The method used to present the results of the calculation of an optimal pairwise alignment score. The matrix field (i,j) is the score assigned to the optimal alignment between two residues (up to a total of i by j residues) from the input sequences. Each entry is calculated from the top-left neighboring entries by way of a recursive equation.

Distance measure: A function that associates a non-negative numeric value with a pair of biosequences. The shorter distance between the sequences (i.e., the lower the number), the greater the similarity. For example, the distance between leu and ala is small while the distance between leu and arg is large.

DNA (deoxyribonucleic acid): A double stranded molecule made of a linear assembly of nucleotides (See Figure 2). DNA holds the genetic code for an organism in the arrangement of the bases. The double strand of DNA results from the hydrogen bonds formed between bases when two polynucleotide chains, identical, but running in opposite directions, associate.

Figure 4
Double stranded segment of DNA



DNA polymerase: The enzyme which assembles DNA into a double helix by adding complementary bases to a single strand of DNA. Linkages are formed by adding nucleotides at the 5' hydroxyl group to the phosphate group located on the 3' hydroxyl.

EBI: The European Bioinformatics Institute (http://www.ebi.ac.uk/) is a part of the EMBL.

Electrophoresis: The primary method used to separate the mixture of nucleotide or peptide fragments generated from DNA or protein cleavage experiments. The apperatus consists of a plate, coated with either agarose or acrylamide gels, which is placed in an electric field. As the solvent is allowed to infuse up the length of the plate, the components of the mixture are separated by size. In addition, the electrical charge along the side of the plate forces migration of the DNA or protein fragments according to the net charge of the residues.

EMBL: The European Molecular Biology Laboratory (http://www.embl-heidelberg.de/) which is located in Heidelberg Germany.

EMBL Nucleotide Sequence Database: Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications. The database is produced in collaboration with GenBank and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis.

Endonuclease: An enzyme that cleaves at internal locations within a nucleotide sequence. The enzyme's site of action is generally a sequence of 8 bases. For E. coli, treatment with a restriction endonuclease will lead to around 70 fragments. Cleavage of human DNA leads to around 50,000 fragments.

Enzyme: A protein which catalyzes (or speeds the rate of reaction for) biochemical processes, but which does not alter the nature or direction of the reaction.

Entrez: A WWW-based database retrieval program created by the National Center for Biotechnology Information (NCBI), a division of the NIH.

EST (Expressed Sequence Tag): A partial sequence of a cDNA clone that can be used to identify sites in a gene.

Eukaryote: An organism whose genomic DNA is organized as multiple chromosomes within a separate organelle -- the cell nucleus.

Exon: The region of DNA which encodes proteins. These regions are usually found scattered throughout a given strand of DNA. During transcription of DNA to RNA, the separate exons are joined to form a continuous coding region.

Exonuclease: An enzyme which cleaves nucleotides sequentially starting at the free end of the linear chain of DNA.

FASTA: An alignment program for protein sequences created by Pearson and Lipman in 1988. The program is one of the many heuristic algorithms proposed to speed up sequence comparison. The basic idea is to add a fast prescreen step to locate the highly matching segments between two sequences, and then extend these matching segments to local alignments using more rigorous algorithms such as Smith-Waterman.

Fingerprinting: The process of identifying overlapping regions at the ends of DNA fragments.

FISH: Fluorescence in situ hybridization. A method used to pinpoint the location of a DNA sequence on a chromosome.

Frameshift: Genetic mutation which shifts the reading frame used to translate mRNA (see reading frame).

Functional genomics: The development and application of experimental approaches to assess gene function by making use of the information and reagents provided by structural genomics.

Gamete: The specialized cell (from either an egg or sperm) that is used for sexual reproduction.

Gene: A section of DNA at a specific position on a particular chromosome that specifies the amino acid sequence for a protein.

GenBank: The NIH genetic sequence database. An annotated collection of all publicly available DNA sequences which is located at http://www.ncbi.nlm.nih.gov/. There are approximately 2,162,000,000 bases in 3,044,000 sequence records as of December 1998. GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.

Gene expression: The conversion of the information encoded in a gene to messenger RNA which is in turn converted to protein.

Genetic map (Linkage Map): The linear order of genes on a chromosome of a species. Genetic maps are created by observing the recombination of tagged genetic segments (STSs) during meiosis. The map shows the position of known genes and markers relative to each other, but does not show the specific physical points on the chromosomes.

Genetic mutation: An inheritable alteration in DNA or RNA which results in a change in the structure, sequence, or function of a gene.

Genetic polymorphism: The occurrence of one or more different alleles at the same locus in a one percent or greater of a specific population.

Genome: The total genetic material of a given organism.

Genomics: The mapping, sequencing, and analysis of an organism's genome.

Genotyping: The use of markers to organize the genetic information found in individual DNA samples and to measure the variation between such samples.

Haploid: A cell containing only one set of chromosomes.

HGSI (The Human Genome Sequencing Index): A service provided by the NCBI to members of the international consortium to support coordination and tracking of the Human Genome Project (HGP). Sequence and mapping target data from centers participating in the international consortium are submitted via the HGSI web site. This web site also presents an overview of HGP progress to the research community in tabular and graphic displays of the target data.

Hidden Markov models (HMM): A computer algorithm which locates the essential, unique features which can distinguish a protein or gene family by analyzing a range of known sequences from the family. These features then are used to locate similar characteristics in unknown sequences.

Homology modeling: The use of 3-dimensional (3-D) geometry and sequence information from proteins of known 3-D structure to develop models for proteins whose 3-D structure is unknown. In the first step of homology modeling, search and alignment algorithms are used to find the best sequence overlap of the 'unknown' protein with the sequences of related proteins which have 3-D data. In the second step, the geometry of the 3-D structures is used as a template for generating a 3-D structural model for the regions of high sequence homology in the unknown protein (the conserved regions). Finally, the sections with low homology to known proteins (the variable regions) are modeled using a variety of computational techniques.

HUGO: The Human Genome Organization.

Hybrides (or hybride molecular complexes): The formation of a compliementary complex between a probe molecule and a target molecule. This complex is generally tagged with a radioactive label on the probe molecule so that the complex can be located and isolated for further study. Hybrid molecular complexes of the type DNA-DNA, DNA-RNA, and Protein-Protein are frequently used in genetic analysis. Since hybridization reactions are specific, they can be used to locate one DNA, RNA, or protein molecule within complex mixtures of similar molecules.

Hybridization: The formation of a double stranded DNA, RNA, or DNA/RNA from two complementary oligonucleotide strands.

Hydrogen bond: A dipole-dipole attraction in which a hydrogen atom bridges two electronegative atoms. One half of the hydrogen bond is a covalent bond and the other is an electrostatic bond. The example below shows the hydrogen bonds formed between cytosine and guanine.

Figure 5
Hydrogen Bonding



Induction: The switching of cells between pathways under the influence of an adjacent group of cells. It is possible to generate several different cells through a series of inductions between a limited number of cell types.

Intron: The portion of a DNA sequence which interrupts the protein coding sequences of the gene. Most introns begin with the nucleotides GT and end with the nucleotides AG.

Figure 6
Translation/Transcription



In vitro: Outside a living organism, usually in a test tube.

In vivo: Inside a living organism.

Kilobase (kb): A length of DNA equal to 1,000 nucleotides.

Linkage analysis: The process used to study genotype variations between affected and healthy individuals wherein specific regions of the genome that may be inherited with, or "linked" to, disease are determined.

Linkage map: A map which displays the relative positions of genetic loci on a chromosome.

Loci: The location of a gene or other marker on the surface of a chromosome. The use of locus is sometimes restricted to mean regions of DNA that are expressed.

MAP: Program developed by Huang which computes a multiple global alignment using an iterative pairwise method.

Mapping: The process of determining the positions of genes and the distances between them on a chromosome. This is accomplished by indentifying unique genome markers (ESTs, STSs, etc.) and localizing these to specific sites on the chromosome. There are three types of DNA maps: physical maps, genetic maps, and cytogenetic maps. The types of markers identified differentiate the map produced.

Marker: A physical location on a chromosome which can be reliably monitored during replication and inheritance. Markers on the Human Transcript Map are all STSs.

Microarray: DNA which has been anchored to a chip as an array of

microscopic dots, each one of which represents a gene. Messenger RNA which encodes for known proteins is added and will hybridize with its complementary DNA on the chip. The result will be a fluorescent signal indicating that the specific gene has been activated.

Mitosis: Cell division - The process that produces daughter cells that are genetically identical to each other and to the parent cell.

Motifs: A pattern of DNA sequence that is similar for genes of similar function. Also a pattern for protein primary structure (sequence motifs) and tertiary structure that is the same across proteins of similar families.

mRNA (messenger RNA): RNA that is used as the template for protein synthesis. The first codon in a messenger RNA sequence is almost always AUG

Multiple Alignment: A set of biosequences arranged in a table such that each row of the table consists of one sequence padded by gaps. The columns of the table highlight similarity (or residue conservation) between positions of each biosequence. An Optimal Multiple Alignment is one that has the highest degree of similarity, or the lowest cost.

NCBI: The National Center for Biotechnology Information (http://www.ncgi.nlm.nih.gov/), a division of the NIH, is the home of the BLAST and Entrez servers.

NCGR: The National Center for Genome Resources (http://www.ncgr.org/).

NHGRI: The National Human Genome Research Institute of the NIH (http://www.nhgri.nih.gov/)

Northern Blot: An electrophoresis-based technique which is used to find mRNA sequences that are complementary to a piece of DNA called a probe.

Nucleotide (nt): A molecule which contains three components: a sugar (deoxyribose in DNA, ribose in RNA), a phosphate group, and a heterocyclic base.

Figure 7
Nucleotides



Oligos (Oligonucleotides): A chain of nucleotides.

Oncogene: A mutant gene that promotes uncontrolled cell growth once activated.

Operon: The group of contiguous genes in a bacterial chromosome that are transcribed into an mRNA molecule.

Pairwise alignment: In the first step, two sequences are padded by gaps so that they are the same length and so that they display the maximum similarity on a residue to residue basis. An optimal Pairwise Alignment is an alignment which has the maximum amount of similarity with the minimum number of residue 'substitutions'.

PCR (polymerase chain reaction; in vitro DNA amplification): The laboratory technique for duplicating (or replicating) DNA using the bacterium Thermus aquaticus, a heat stable bacterium from the hot springs of Yellowstone. As with the polymerase reaction that occurs in cells, there are three stages of a PCR process: separation of the DNA double helix, addition of the primer to the section of the DNA strand which is to be copies, and synthesis of the new DNA. Since PCR is run in a single reaction vessel, the reactor contains all of the components necessary for replication: the target DNA, nucleotides, the primer, and the bacterial DNA polymerase. PCR is initiated by heating the reaction vessel to 90¢X which causes the DNA chains to separate. The tempature is lowered to 55¢X to allow the primers to bind to the section of the DNA that they were designed to recognize. Replication is then initiated by heating the vessel to 75¢X. The process is repeated until the quantity of new DNA desired in obtained. Thirty cycles of PCR can produce over 1 million copies of a target DNA.

PDB (Protein Data Bank): An international repository for the results of macromolecular studies using NMR, X-ray crystallography, or homology methods. The results of structural studies of proteins, RNA, DNA, viruses, and polysaccharides are presently available. The term PDB also defines a standard file format for publishing protein and nucleotide structures for use in computer programs.

Peptide: A small chain of amino acids (see protein).

Physical map: The physical locations (and order) on chromosomes of identifiable areas of DNA sequences such as restriction sites, genes, coding regions, etc. Physical maps are used when searching for disease genes by positional cloning strategies and for DNA sequencing.

PIMA: An alignment algorithm developed by Smith and Smith which performs multiple alignments using a covering pattern construction algorithm.

Plasmid: DNA and/or RNA that is not a part of the chromosome, but is replicated and inherited at cell division. Plasmids are generally found in yeast and bacteria.

Polymerase: The process of copying DNA in each chromosome during cell division. In the first step the two DNA chains of the double helix unwind and separate into separate strands. Each strand then serves as a template for the DNA polymerase to make a copy of each strand starting at the 3' end of the chain.

Polymorphic marker: A length of DNA that displays population-based variability so that its inheritance can be followed.

Polymorphism: Individual differences in DNA. Single nucleotide polymorphism (the difference of one nucleotide in a DNA strand) is currently of interest to a number of companies.

Polypeptide: A linear chain of amino acids joined head to tail via a peptide bond between the carboxylic acid group of one amino acid and the amino group of the next amino acid.

Post translational modification: Changes which occur to a protein after translation from mRNA. Modifications can include cleavage of a small number of residues, the addition of carbohydrates, phosphorylation of hydroxyl groups, acetylation, etc.

Primer: The short sequence of nucleotides (usually eight) which serve to prime the DNA polymerase process during cell division. Primers are produced by the enzyme primase. Primers also can be customized to 'isolate' specific sections of DNA for replication using PCR.

PRINTS: The protein motif fingerprint database created by Attwood and Beck.

Probe: A radiolabeled or fluorescent oligonucleotide used to locate complementary sequences in a hybridization experiment.

Prokaryote: An organism whose DNA is not enclosed in a separate organelle.

Promoter: The short sequence on nucleotides on DNA that start the transcription of RNA by RNA polymerase.

PROSITE: A database of protein families and domains which is maintained at the EMBL. It is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.

Protein: A linear chain of variable length which is constructed from the 20 basic amino acids (also referred to as a peptide or as a polypeptide). The linear arrangement of the amino acids is known as the protein's primary structure. The local three dimensional arrangement (or folding pattern) of the main portion of the chain (the polypeptide backbone) is known as the protein's secondary structure. The overall three dimensional arrangement of all atoms in a single chain in the protein is termed the protein's tertiary structure. The three dimensional shape, in conjunction with with the chemical properties of the amino acids contained in the protein, determines the protein's function.

Proteome: The full compliment of proteins produced by a particular genome.

Proteomics: The study of protein expression, structure, and function, and the interactions of all proteins of a specific organism.

Radiation hybrid: A population of cloned cells derived by fusion of X-irradiated donor cells with rodent cells.

Radiation hybrid (RH) mapping: The approach to physically mapping DNA that makes use of the frequency of X-ray induced breakage to infer distances between markers.

Radiation hybrid (RH) panel: A set of DNA samples prepared from a collection of radiation hybrids.

Reading frame (also open reading frame): The stretch of triplet sequence of DNA that encodes a protein. The reading frame is designated by the initiation or start codon and is terminated by a stop codon. As an example, the sequence CAGAUGAGGUCAGGCAUA potentially can be translated as follows:

Position 1     CAG AUG AGG UCA GGC AUA  
  gln met arg ser gly ile  
 
Position 2     C AGA UGA GGU CAG GCA UA
    arg trp gly gln ala  
 
Position 3     CA GAU GAG GUC AGG CAU A
    asp glu val arg his  

Figure 8
Open Reading Frames



DNA (through RNA) uses a triplet code to specify the amino acid for a given protein. As can be seen above, a given strand of DNA has three possible starting points (position [or reading frame] one, two, or three). Since both strands of DNA can be translated into RNA and then into protein, a sequence of double helical DNA can specify six different reading frames.

Recombinant DNA: Partial strands of DNA from different sources which are joined outside of a cell.

Recombination: The exchange of regions of DNA on chromosomes via cross over during meiosis (see crossover).

Regulatory region: The segment of DNA that controls whether and to what degree, a gene will be expressed.

Resolution: The amount of information (or molecular detail) that is available on a physical map.

Restriction enzyme: A protein which recognizes specific sites on nucleotides or proteins and hydrolyzes the nucleotide or protein at these points.

Restriction map: A physical map which shows the order and distances between cleavage sites of site-specific restriction endonucleases.

Restriction site: The location on a DNA or protein chain at which a specific restriction enzyme will act.

Retrotransposons: Short sequences of DNA that make new copies of themselves via reverse transcription of an RNA intermediate.

RH Mapping: A Statistical method used to determine the distance between DNA markers, as well as their order on the chromosome. The technique depends on using X-rays to break the chromosome.

Ribonucleic acid (RNA): Nucleotide made from a ribose, a base [adenine (A), guanine (G), cytosine (C), and uracil (U)], and a phosphate group. RNA is generally found in the cell nucleus or cytoplasm.

Ribose: A five carbon sugar (b-d-ribose) which is used in the construction of RNA.

Ribosome: Cellular components made of ribosomal RNA and proteins which are the site of protein synthesis (translation).

Scaffold: A series of contigus that are in the correct order, but are not connected in one continuous length.

Scoring Function (also cost function or weight function): The methods used to evaluate the quality of the overlap between sequences. A variety of scoring functions are used to evaluate single replacement operations, multiple alignments (either whole or columns), and pairwise alignments. The score of an alignment of two sequences (a and b) is the sum of the score of all the replacement operations that lead from a to b.

Sequencing: Determining the order of nucleotides in a gene or the order of amino acids in a protein.

Sequence tagged sites (STS): The unique occurrence of a short, specific length of DNA within a genome whose location and sequence are known and that can be detected by a specific PCR. An STS is used to orient and identify mapping data for the construction of physical genome maps.

Shotgun method: A method that uses enzymes to cut DNA into hundreds (or thousands) of random bits which are then reassembled by computer so it looks like the original genome. The Human Genome Project shotgun approach is applied to cloned DNA fragments that already have been mapped so that it is known exactly where they are located on the genome, making assembly easier and much less prone to error.

Similarity measure (also similarity function or similarity score): A scoring function that is used to rank the degree of similarity of a pair of sequences. Larger values indicate greater similarity.

Single nucleotide polymorphism (SNP): The most common type of DNA sequence variation. An SNP is a change in a single base pair at a particular position along the DNA strand. When an SNP occurs, the gene's function may change, as seen in the development of bacterial resistance to antibiotics or of cancer in humans.

Southern Blot: An electrophoresis-based technique used to find DNA sequences which are complementary to a DNA probe.

Splicing: The process of cutting, excising, and recombining an RNA or DNA. In RNA, splicing is used to remove introns from the coding sequence.

Structural genomics: The prediction of the 3-D structure of proteins encoded by genes using both experimental and computational techniques.

The Swiss Institute of Bioinformatics (SBI): An academic institution established on March 30, 1998 as a non-profit foundation. The goals of this institute are to promote the development of software tools and databases in the field of bioinformatics, to sustain a high-quality research program in bioinformatics, to provide, in collaboration with academic partners, a curriculum of courses and seminars for the formation of research scientists in the field of bioinformatics, and to offer services to the Swiss scientific user community through the Swiss-EMBnet node (which is currently maintained jointly by Swiss Institute for Cancer Research and the University of Geneva).

SWISS-PROT: An annotated protein sequence database established in 1986 and maintained collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)). The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence entries are composed of different line-types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database.

Telomere: The free end of a chromosome.

Threading: Computational algorithms which use the known 3-D structure of proteins as a template for positioning (or 'threading') an unknown sequence. The overlap of the sequences is based on how the sequences match with respect to physical properties at specific residues rather than matching based on the similarity of amino acid sequences.

TIGR: The Institute for Genomic Research, located in Bethesda Maryland (http://www.tigr.org/)

TrEMBL: The supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT.

Transcription: The process of copying a strand of DNA to yield a complementary strand of RNA

Transcription Factors: The class of proteins which bind to DNA and promote or inhibit the initiation of transcription.

Transfection: Introduction of a foreign DNA molecule into a eucaryotic cell and subsequent expression of the genes of the new DNA.

Transfer RNA (tRNA): Specialized RNA which transfers single amino acids to a growing protein chain. tRNA has a complementary codon to the codon on the mRNA.

Translation: The process of sequentially converting the codons on mRNA into amino acids which are then linked to form a protein.

Western Blot: An electrophoresis-based technique used to find proteins based on their ability to bind to specific antibodies.

Yeast artifical chromosome (YAC): An artifical chromosome containing a yeast centromere, two telomeric sequences, and a marker. The YAC is constructed by cloning very large genomic fragments (up to one million bases) from another species into yeast vectors that can replicate in yeast.

Zygote: A diploid cell created by the union of a male and female gamete.

========================================================

[ NetSci's Home Page ] [ The Science Center ] [ The Bioinformatics TOC ]