BASIC SCIENCES

The spectacular recent progress in genetics is the result of accumulated knowledge in several branches of biology and biochemistry that are the fruit of long years of basic research.

I. FROM THE ORGANISM TO THE GENE
II. GENES
III. PROTEINS
IV. GENETIC INFORMATION
V. MUTATIONS

I. FROM THE ORGANISM TO THE GENE

Related Current Themes

Related Techniques

GENETIC ENGINEERING

CLONING
DETERMINATION OF KARYOTYPES


What is the relationship between organism and gene?

The cell is the organizing principle that links organisms and genes. All living organisms are composed of cells. For a complex organism such as a human being, each cell of an adult is specialized for a particular function related to the organ in which it occurs, as well as its role in the organ (heart cells, neurons, stomach lining, skin cells, liver cells, etc.). Nevertheless, virtually all cells share many common properties, including a large nucleus several microns in diameter (that generally occupies about 10% of the cell volume), containing the chromosomes upon which are located the genes.

The specialized role of each particular type of cell, as well as various "housekeeping" functions common to all cells, are determined principally by the genes in the nucleus of the cell. In humans, the genes are organized in 22 pairs of chromosomes (one member of each pair obtained from each parent), plus the chromosomes X and X (in a female) or X and Y (in a male). In both cases, one X chromosome is received from the mother, and the father contributes either an X or a Y chromosome (the sex is determined by the sperm that fertilizes the egg of the future individual, either an X-carrying sperm or a Y-carrying sperm).


How do cells become specialized during development?

As the fertilized egg undergoes successive cell divisions, producing greater and greater numbers of cells, these cells become specialized because certain genes are activated and other genes are quiescent. The fertilized egg is totipotent -- it can give rise to all cells of the body, as well as the cells of the placenta. Other cells isolated from early embryos, the embryonic stem (ES) cells, are pleuripotent, that is they have the capacity to differentiate into any of the different forms of specialized adult cells. Such cells can be isolated from embryos, grown in the laboratory, and in the presence of a specific growth factors can be stimulated to differentiate into a specialized cell type (muscle, nerve, etc.).

In conclusion, the relationship between gene and organism is determined by the specialization of cells during development. As the organism advances to its final stages prior to birth, most cells have become highly differentiated. However, small numbers of stem cells can still be found in many tissues; they are not as versatile as the pleuripotent ES cells, but these multipotent cells can be triggered by specific growth factors to change into other cell types and are of great interest to medical researchers. In general, the specialization of the cells results from the fact that each gene that is activated leads to the synthesis of a specific protein. Therefore, each cell contains thousands of different proteins. Some are the "housekeeping" proteins that are found in nearly all cells and, for example, catalyze the basic reactions needed to metabolize sugars and fats that keep the cells alive, as well as to produce the various chemical building blocks that the cell requires to renew its structures. Other specific proteins perform specialized functions, such as insulin synthesized only in certain cells of the pancreas, hemoglobin that is present only in the red blood cells, and various receptors of neurotransmitters that occur only in neurons.


Do all organisms operate along the same principles?

The union of egg and sperm to initiate development occurs in complex, multi-cellular animals. In order to provide a more complete explanation of the relation between gene and organism, two additional aspects must be considered:

1. Plants and animals are composed of eucaryotic cells. The term "eucaryotic" means "true nucleus" and these cells contain multiple chromosomes organized in a distinct nucleus. Other simpler unicellular organisms, such as bacteria, are much smaller and contain a single chromosome that is not isolated within a nucleus. Such cells without a distinct nucleus are called procaryotic cells. Unicellular organisms constituted by a eucaryotic cell exist as well, as represented for example by yeast. Bacteria can be responsible for diseases, such as tuberculosis, but can also play a positive role, for example in the production of yogurt. Moreover, the intestinal bacteria of human beings are a necessary feature of the digestive system. Indeed, there are more procaryotic bacterial cells in the gut than eucaryotic cells in the body!

2. Eucaryotic cells contain specialized structures (organelles), in addition to the nucleus, that perform specialized functions. For example, eucaryotic cells contain mitochondria that carry out many of the reactions essential for energy production. Mitochondria are roughly the size of bacteria and indeed may represent descendents of bacteria that were "captured" at an early stage in the evolution of eucaryotic cells. Mitochondria also contain a simple chromosome with genes that code for a small number of proteins necessary for mitochondrial function; many other mitochondrial proteins are synthesized under control of genes in the nucleus and then imported into the mitochondria. All mitochondria are derived from the egg and thus are entirely of maternal origin. During cell division, mitochondria also divide, so their numbers per cell remain relatively constant. In plant cells, other specialized organelles called chloroplasts (which also contain a small chromosome with genes that synthesize specific proteins) carry out reactions of photosynthesis.


What is a chromosome?

The chromosome is a long complex composed of one molecule of double-stranded DNA surrounded by thousands of proteins (the histones). It can be visualized in a dense, compacted form when cells divide (mitosis). The 46 chromosomes of humans can be aligned according to size to establish the gallery of forms that is known as the karyotype. Certain genetic diseases have been correlated with changes in the number or appearance of chromosomes, as visualized under a microscope, as in the case of trisomy 21, which can be detected prenatally by examining the karyotype of cells removed by amniocentesis. For a genetic disease that leads to mental retardation, the fragile-X syndrome, the X chromosome appears broken near one end when a karyotype is established.

A mid-sized chromosome (number 9) is about 2.5 microns long when compacted and contains a molecule of DNA with about 150 million base pairs (which corresponds to a length of about 50 mm or 50,000 microns). Hence, in the dense chromosome, the DNA molecule is compacted about 20,000 times by being coiled into smaller and smaller units: first, beads of DNA wrapped around a histone core (to give the "nucleosome"); then, tightly packed ribbons of nucleosomes; and finally, loops of these nucleosome ribbons aligned in the dense pattern that produces the intact chromosome. The stages of compacting are illustrated in the figure below, with the histone core of the nucleosomes represented by the green cylinders.

CHROMATIN



What is heredity?

In the early part of the 20th century, biologists established the connection between genetic traits and chromosomes, in large measure from studies on the fruit fly, Drosophila. Individual traits such as eye color were found to "map" along chromosomes in a linear relationship. These studies lead to the basic concept that there are definite structures encoded in the genes on chromosomes (the genotype) that are responsible for various observable properties of the organism (the phenotype). Nevertheless, the phenotype may also be influenced by factors in the environment. For example, if obesity is considered as a phenotype, it may occur in individuals who have inherited a genetic defect, or it may be provoked in genetically "normal" individuals by a cultural environment that favors a sedentary life style and inappropriate nutrition.

The underlying chemical basis of how the genotype can alter the phenotype remained inaccessible until the second half of the 20th century. Chromosomes were known to contain the nucleic acid DNA, but a key insight was missing to link such molecules to genetics, because initially their complexity, based on only four classes of building blocks, was underestimated. It was necessary to await the results of experiments using bacteria and their viruses that furnished direct evidence concerning the predominant role of DNA. Even so, the participation of DNA in the genetic mechanism remained obscure until Watson and Crick published their model for DNA in 1952, the famous double helix.


What is DNA?

The structure of DNA with two helical strands resolved two enigmas at the same time: how the information is stored in the chromosomes and how this information is duplicated during each division of the cell. These molecules are chains of millions of building blocks made of three small pieces: a base, a sugar (deoxyribose), and a phosphate. The phosphate links the sugars into long polymers and on each sugar is attached one of the bases. The sugar and the phosphate are identical at each position, but the base can be any one of four varieties (with their one-letter abbreviation in parenthesis): adenine (A), thymine (T), guanine (G), or cytosine (C). The specific sequence of bases is what constitutes the information stored in the chromosomes, just as the sequence of letters determines the information in this line of text. The relationship between base, sugar, phosphate are summarized below on the left, with the A:T and G:C base pairs linked by hydrogen bonds:


DNA occurs as a long helical structure with two complimentary strands that constitute the double helix, as shown above on the right. The two strands fit together like pieces of a puzzle with perfectly complementary surfaces. Moreover, the strands are held together by the combined effect of many weak attractive interactions called hydrogen bonds that can only occur when A is facing with T (making two hydrogen bonds) and G is facing C (making three hydrogen bonds). All features of living organisms determined by their genetics stem from the interactions of strands of DNA and the particular sequence of bases in the strands.

To see the molecular model of the DNA double helix on the right in a rotating animation, click here to open a new window.


How is DNA duplicated during cell division?

The A:T and G:C base pairs are at the heart of genetics, since they ensure that the DNA molecules can be exactly replicated at each cell division. When the DNA strands separate, each strand provides the template that specifies the sequence of its missing partner. Hence, after each round of cell division, each molecule of DNA in a daughter cell contains one intact strand from the DNA molecule of the mother cell, plus a new strand synthesized entirely on the basis of using the "old" strand as a template. This mode of replication leading to a molecule with one old strand and one new strand is called "semi-conservative replication" and is summarized schematically below.

How does the organism develop from successive cell divisions?

For a complex organism to develop from a fertilized egg by successive steps of cell division, two conditions must be satisfied. First, with each cell division the information in the chromosomes (i.e., the sequence of bases in the DNA molecule) must be transmitted to the daughter cells. The replication of DNA is achieved by the use of one strand as the template for the other (semi-conservative replication). Second, as the divisions progress, some variations must occur concerning which genes are expressed, leading to specialized cells that form different tissues and organs of the body.

Cell division is a complex and delicate process, since each chromosome must be duplicated precisely and the cell must then insure that each daughter cell receives a copy of each chromosome. This process is carried out in steps known as the cell cycle, which includes the key phase of chromosome separation known as mitosis. However, prior to mitosis there is a phase of synthesis (known as S-phase) that permits the chromosomes to be duplicated. By the time mitosis begins, each chromosome has been duplicated and the two identical "twins" are joined near their centers, like Siamese twins, with the connection made at a specialized structure (containing a specific DNA sequence) called the centromere. The chromosomes joined in this way are referred to as "sister chromatids".

Mitosis itself begins as the paired chromosomes condense into a highly compact form and the membrane that defines the nucleus begins to disperse. At the same time a spindle begins to form with two poles that represent the two extremes toward which one chromosome of each pair will be directed. The spindle is composed of microtubules, one class of the molecular skeleton of the cell (the "cytoskeleton"). The microtubules guide the paired chromosomes to a plane at the center of the cell (the "metaphase plate"). The two chromosomes of each pair are attached to microtubules connected to opposite poles and at a precise moment, the connection between the paired chromosomes is broken and the separated chromosomes begin to be pulled in opposite directions. The process continues until each extremity of the cell has received a full complement of chromosomes; then the nuclear membrane begins to reform and the physical separation of the two daughter cells begins through the action of a cleavage ring that pinches the middle of the cell to a narrow constriction that finally is broken, leaving the two separate daughter cells. The various steps are represented schematically below on the left for a single pair of chromosomes.

CELL DIVISION



What process produces egg and sperm?

The above pattern of mitosis is universal for eucaryotic cells, with the exception of the steps that form gametes. Most cells of the body contain pairs of chromosome homologues, with one member received from the father and the other from the mother. Since mitosis permits each daughter cell to receive a copy of each chromosome, the continuity of each cell containing the chromosome homologues from each parent is maintained, with each cell containing a full diploid amount of DNA. (The term diploid refers to the fact that these cells contain a double set of DNA molecules, because they arise from the fusion of egg and sperm each containing only one chromosome of each pair or a "haploid" amount, or single set, of DNA molecules.) However, for production of egg and sperm cells it is necessary to achieve a more subtle distribution, and this process is called meiosis. It resembles mitosis, but involves two cycles of cell division. In the first cycle, the chromosomes duplicate as in mitosis, but then unite with their homologues to form a complex with four chromatids. Following alignment on the metaphase plate, the duplicated homologues move in opposite directions. Therefore, rather than separating as in mitosis, the sister chromatids remain together and each daughter cell now inherits two copies of the same chromosome from each of homologues. In a second cycle of cell division without any further DNA replication, these paired chromosomes align at the metaphase plate, separate, and one chromosome of each set migrates to the poles to produce two haploid cells or a total of 4 haploid cells from each cell that enters meiosis. The differences between mitosis and meiosis are summarized in the figure above.


What accounts for the diversity of the various forms of life?

All forms of life are based on similar molecular processes, beginning with DNA and leading to the expression of proteins specified in the genes of DNA according to the genetic code (presented below in part III. Proteins). The diversity of living organisms on the Earth presumably emerged by evolution of species from a small number of original forms. The evolution involved processes at several levels:

1. Mutations in genes that led to proteins with new or altered functions.

2. Duplication of genes permitting divergence of proteins with expanded functional properties.

3. Duplication and rearrangement of large portions of chromosomes to increase the gene pool further and permit the evolution of genes to endow proteins with additional functional variations.


At each stage, the evolutionary principles first proclaimed by Darwin permitted emergence of new species. The first principle concerns the variations introduced by spontaneous changes in the DNA. The second principle concerns the selection exerted by a particular environment, such that certain forms are more successful at reproducing and thus flourish. Species considered to be extremely different, for example humans and mice, may nevertheless have a surprising degree of similarity at the cellular and molecular level. Both have nearly the same amount of DNA in their cells. A mouse has only 19 pairs of chromosomes (plus X and Y) compared to 22 pairs (plus X and Y) for a human. Nevertheless, large regions of the two sets of chromosomes possess similar organizations of genes, for example in the hemoglobin beta chain complex on chromosome 11 in humans and on chromosome 7 in mice. While the sizes of the two organisms are very different, they both contain a similar repertory of cell types. Even in the brain, which is a great deal larger and more complex in humans, virtually all of the genes coding for key receptors of neurotransmitters that have been identified in humans are also present in the mouse. The obvious differences that exist are therefore due to subtle variations in the timing of gene expression and the number of cell divisions that occur in various tissues and organs during differentiation and development prior to and following birth. Key regulatory genes are thought to be responsible for these differences, but few such genes have been identified so far.

II. GENES

BACK TO TOP OF PAGE

Related Current Themes

Related Techniques

ADVANCEMENTS IN HEALTH CARE

DNA SEQUENCING
GENE TRANSFER


What is a gene?

The basic elements of heredity are genes aligned along chromosomes. For humans, the genes are distributed on the 23 pairs of chromosomes present in each cell. One chromosome from each pair is obtained from the mother, the other from the father. Therefore, for most genes, very similar forms (alleles) are present at corresponding positions on each chromosome. A mutation may be present on one of the alleles and the mutations can be either dominant or recessive. If the mutation on one chromosome is recessive and the other allele is normal, no ill effects are produced. However, many genetic diseases are provoked by mutations that are dominant and cause a disease even when present on only one allele.

What is the structure of genes?

Genes are specific sequences of bases that occur at irregular intervals along the sequence of DNA of each chromosome. In general, there is nothing special about the structure of a gene; it is only the particular sequence of bases that constitutes information that triggers the machinery of the cell to treat it as a gene. For example, in the sequence below of using the 26 letters of our alphabet, there is no particular information apparent at first glance.

owotiepowgkgybdsasdbreutzhaveanicedayqac


However, close examination reveals interpretable information beginning at the 15th letter from the end. In a similar way the machinery of the cell can recognize the beginning of a gene, initiate "decoding" of the base sequence to extract the information it contains (the orders for constructing a protein with a particular sequence of amino acids), and recognize the point at which the information to be decoded terminates.


How do genes work?

The long sequences of bases of DNA that constitute a gene operate in most cases by specifying the structure of a particular protein. As progress was made in defining the genetic mechanisms at the molecular level, it was observed that in living cells the starting points of genes are recognized as specific sequences of bases that mark the beginning of information that is to be transcribed into RNA. Three major classes of RNA exist:

1. Messenger RNA (mRNA) molecules that can be translated into a protein according to the genetic code;

2. Transfer RNA (tRNA) molecules that participate in the transcription by attaching a specific amino acid and recognizing a specific codon on the mRNA;

3. Ribosomal RNA (rRNA) molecules that are incorporated into ribosomes (the complexes upon which transcription occurs).


The portions of the DNA that encode the information for any of these forms of RNA are considered to be genes. However, the vast majority of genes encode information for the sequence of proteins and their product is thus mRNA. In effect, when speaking of genes, reference is usually made to genes for proteins. Therefore, how genes work concerns how they permit base sequences of DNA to be expressed as amino acid sequences of a protein. In addition, a full understanding of how a gene works requires information on where and when it is expressed.

The fact that a particular gene in a cell is active involves many factors that distinguish it from other genes that are quiescent. In general, the factors operate at several levels and can include:

  • The specific structure in which the DNA is packaged in that region of its chromosome. Genes in some regions of a chromosome are more readily turned on than in other regions.
  • The specific sequence of bases that lies just outside the beginning of the gene. Such sequences (called promoters) may bind specific proteins that facilitate turning on the near-by gene. Such proteins (called transcription factors) may be specific for a certain class of cells. For example, in muscle cells a muscle-specific transcription factor may contribute to turning on many of the genes that produce the proteins required for muscle contraction.

  • The bases at the beginning and end of the gene. These bases provide signals that the cell machinery can read to determined where the information of the gene starts and where it is completed.



Are the genes in all living organisms organized along the same principles?

A complicating factor exists for certain genes in most animals and plants (but to a much lesser extent in bacteria): the presence of non-coding sequences called introns. The information in many genes is not continuous, but is interrupted by stretches of hundreds of bases. As a result, when mRNA corresponding to a gene is synthesized, both the sequences (called exons) that code for protein, as well as the non-coding introns, are present. However, there are specialized splicing factors in cells that recognize and remove introns, while joining together the exons, so that the mature mRNA is made up of continuous exons. From the point of view of cell structure, the scarcity of introns in bacteria correlates with the fact that bacteria are small procaryotic organisms (diameter of ~ 1 micron) and simple cells (called procaryotes) that contain a single circular chromosome and neither a separate nucleus nor distinct organelles, such as mitochondria. In contrast, the eucaryotic cells of plants and animals are hundreds of times more voluminous, contain a defined nucleus with many linear chromosomes, as well as many organelles.


How are genes recognized by biologists?

The major challenge in the current projects to sequence the entire genetic repository of humans and other species is to identify genes, once the sequence is completed. In effect, the problem is to recognize the sequence of triplets (three adjacent bases) that correspond to the codons specifying particular amino acids that will be translated into protein on the basis of the genetic code.

A critical factor for any given region is to establish the correct reading frame (i.e., which base is the first of a triplet codon for any sequence) in order for the information to be interpreted correctly. For example, the gene for the beta subunit of hemoglobin includes the sequence:

ATGGTGCACCTGACTCCTGAGGAGAAG


If read in the frame that starts with the first "A," separation into codons (with the amino acid sequence encoded presented below using the three-letter abbreviations for the amino acids that are specified by the genetic code) yields:

ATG GTG CAC CTG ACT CCT GAG GAG AAG

Met Val His Leu Thr Pro Glu Glu Lys


This sequence corresponds to the hemoglobin beta subunit (Val - His - Leu - Thr - Pro - Glu - Glu - Lys...), with the initiator methionine (Met) removed during synthesis of the protein. The genetic disease sickle cell anemia arises when the second base, "A" (underlined above), in the triplet "GAG" is replaced by "T" to give Val in place of Glu.

If the triplet reading frame were to begin with the second or the third base, an entirely different amino acid sequence would be produced. For example, starting with the second base yields:

TGG TGC ACC TGA CTC CTG AGG AGA

Trp Cys Thr -*- Leu Leu Arg Arg


Assuming that the protein-coding sequence of the gene had started at an earlier "ATG" codon, the protein sequence would end at the "TGA" stop codon (indicated by "-*-") and the sequence would not correspond to hemoglobin. (It should be noted that codons can also be indicated in their messenger RNA form with U in place of T.)

Thus, an important indication for where a gene starts is the presence of a start codon (AUG in the mRNA or ATG in the DNA) that signals the initiation of the polypeptide chain, with a termination codon at an appropriate distance to specify the end of the sequence. For many genes, appropriately positioned signals for introns (the non-coding sequences that interrupt genes) can also be recognized. As the number of known genes increases, the candidate genes in any new sequences can be checked for proteins with similar sequences that have already been identified for genes in other organisms. Since proteins responsible for the same function in different species often retain a similar structure throughout evolution, the fact that a candidate gene codes for a protein that is similar to one previously identified in a sequenced gene provides an additional verification that the new gene has been correctly identified.


How many genes are present in a living organism?

Estimates can be made in various ways, but the ultimate test is to identify all the genes when the entire DNA sequence is established. For humans, the number of 3 billion bases per chromosome set inherited from each parent was readily established, but the number of genes has been more difficult to ascertain. Estimates of about 100,000 were usually made, but the most recent data from the sequencing of small human chromosomes suggests that the number is only about 33,000. For the fruit fly that has been recently sequenced, fewer genes were found than expected. The fruit fly has yielded numerous insights for genetics and development (see "LA DROSOPHILE AUX YEUX ROUGES" by Walter Gerhring, Editions Odile Jacob) and its sequence was recently completed using the shotgun approach pioneered by Craig Venter and his colleagues at Celera Genomics that has greatly accelerated progress on sequencing the human genome. The shotgun approach involves sequencing small pieces of the genome at random and then submitting the results to computer programs that match sequences of overlapping fragments and establish the order of the fragments. A slower approach followed by the older, cooperative effort involving scientists from several countries was to systematically divide the DNA into large fragments whose position is known and then to sequence them individually. Ultimately a combination of both approaches has proven to be most successful, as seen in the announcement of completion of the first stage of the project on June 26, 2000, obtaining 95% of the sequence information. The next phase will involve annotating the information to localize and identify the genes within these sequences.


How similar are the genomes of different species?

Comparison of complete DNA sequences determined indicates a trend toward less compact chromosomes (more space along the DNA between genes) for more complex organisms, as noted in the following summary, with only about 3% of the DNA actually used for genes in humans (the numbers are approximate because the exact number of genes in humans has not yet been determined).

Organism

Base pairs (Millions)

Length

Genes

% DNA as genes

E. coli

4.6

1.6 mm

4,288

>90%

Yeast

12

4.1 mm

6,241

70%

Nematode

97

3.3 cm

19,000

27%

Fruit fly

160

5.4 cm

13,600

12%

Mouse

3500

1.2 m

33,000

3%

Human

3500

1.2 m

33,000

3%

The sequence of the fruit fly revealed 13,601 genes, even less than were found for the nematode worm (C. elegans), a much simpler animal that was fully sequenced in 1998. The complexity of an organism is thus not necessarily reflected by the number of genes. The much simpler yeast contains 6241 genes, suggesting that this number of genes is the minimum for a complex eucaryotic cell that must carry out the various intracellular processes; the development of a more complex multi-cellular organism such as the fly would then be the result of various additional proteins that control cell differentiation and signaling between cells. Furthermore, the lower percentage of the DNA occupied by genes in humans or mice suggests that the number of proteins is only one factor in explaining the complexity of these organisms. The subtle timing and coordination of the expression of the genes that are responsible for their development and plasticity may be achieved through relationships between the long non-coding portions of the DNA that scientists are just beginning to understand.


What can be learned about fly genes that is useful for human genetics?


A striking result obtained for the fly genome is the large number of proteins that are in common with mammals, including humans. The number of common proteins, i.e. proteins in different organisms that share considerable sequence identity (also known as "orthologs"), indicates that half of the protein sequences found in the fly show similarity to known mammalian proteins. Indeed, for a list of 299 human proteins within which mutations are known to be responsible for a particular disease, in 177 cases (61%) an ortholog has been found in Drosophila. This situation provides the opportunity of analyzing these genes in the fly more deeply than would be readily possible in humans.

 

III. PROTEINS

BACK TO TOP OF PAGE

Related Current Themes

Related Techniques

ADVANCEMENTS IN HEALTH CARE

CLONING


What is a protein?

Proteins represent an extremely large and diverse class of molecules that, in spite of their enormous variety, are all constructed along the same principles, using an array of 20 different, but related "building blocks", the amino acids. Their diversity is reflected by the facts that

  • some are long rods, others are compact balls
  • some are free in solution, others are assembled into large complex structures
  • some are very small (<100 amino acid building blocks), others are very large (>1000 amino acid building blocks
  • some are in the watery environment of the cytoplasm of the cell, others are imbedded in oily membranous structures


What roles are fulfilled by proteins?


Proteins carry out virtually all of the specific chemical tasks needed to sustain a living cell. Historically, one of the first major classes of proteins to be recognized was enzymes and in fact, a large fraction of the proteins in a cell are enzymes. Without enzymes life would be impossible, for two main reasons. First, enzymes are catalysts that greatly speed up reactions. For example, the simple reaction to hydrolyze the bond between two amino acids in a protein (an essential step for the digestion of proteins in the diet) occurs in about one-hundredth of a second (10 milliseconds) in the presence of an enzyme, but in the absence of enzymes would occur 10
12 times (a million million times) more slowly, requiring several hundred years! Second, enzymes are responsible for the exquisite specificity of biochemical reactions. For example, the genetic code is operational because enzymes link transfer-RNA molecules with the appropriate amino acid and they can distinguish between minor differences in both partners to be sure that the fidelity of the genetic code is maintained. For these and numerous other reasons, life as we know it would be unthinkable without enzymes.

In addition to the thousands of functions catalyzed by enzymes, proteins exert a large number of other functional roles. A short list (by no means exhaustive) includes

  • antibodies, with specific antigen-binding sites that can recognize a large repertory of different chemical structures.
  • transport proteins, such as hemoglobin, bind specific gases, metals or nutrients, in order to facilitate their transport to appropriate targets in the body.
  • receptor proteins, including those in cell membranes, recognize the arrival of chemical signals and transduce this information into specific changes within cells.
  • structural proteins, including collagen in connective tissue and keratin in hair and skin, exist as long fibers that provide structural rigidity.
  • hormones, such as insulin and growth hormone, are signals that are captured by specific receptor proteins.
  • intracellular messenger proteins, such as the G proteins, that are modified when certain molecules bind to receptors at the cell surface that trigger responses within the cell.
  • force-generating proteins such as myosin, that in combination with another muscle protein, actin, can provide the essential reactions of muscle contraction.
  • transcription factors that bind to specific sequences of DNA to determine which genes are active and which are quiescent.



How is the particular function of a protein determined by its structure?

The myriad of different chemical processes of a cell involves thousands of distinct proteins. Depending on the function of the protein, its overall structure must coincide with its role. Enzymes generally have a compact globular structure, since a precise three-dimensional form is needed to create an active site that catalyses a specific reaction. Antibodies, the molecules that recognize foreign chemical entities, also posses a compact globular structure. Many hormones and regulatory proteins occur as small globular structures. Transcription factors have specialized shapes that permit interactions within the grooves in the double helix of DNA that regulate transcription of DNA. The fibrous proteins often possess a repeating structure that assembles in long helical chains. Some proteins exist in a globular form but with the propensity for many such individual molecules to assembly into long fibers, as in the case of actin. The actin fibers are the rails along which myosin molecules move during muscle contraction. Myosin is a protein with a globular head and a long fibrous tail. There are also many proteins with an oily surface that permits insertion into membranes, e.g., receptors and transport proteins.

What is the basic structure of proteins?

The system of genetic coding allows the sequence of bases in the DNA to determine the sequence of amino acids in the corresponding protein. All proteins are composed of a long linear chain of hundreds of amino acids. There are 20 types of amino acids in proteins, each based on a common chemical structure, but with a different
R group:

The distinct R groups possess different chemical properties (acidic, basic, polar, non-polar, etc.) that provide for a wide range of structures capable of fulfilling the various requirements of protein. A table of the 20 amino acids is presented with their names, 3 and 1 letter abbreviations, chemical structure of their R-group, and specific triplet codons in the genetic code.


How does the genetic code convert DNA sequence into protein sequence?

Each amino acid is specified by one or more codons summarized in Table of the Genetic Code:

Second Position

First Position

U

C

A

G

Third Position

U

C

A

G

U

U

U

= Phe

U

U

C

= Phe

U

U

A

= Leu

U

U

G

= Leu

 C

U

U

= Leu

C 

U

 C

= Leu

C 

U

 A

= Leu

C 

U

 G

= Leu

 A

U

U 

= Ile

A 

U

 C

= Ile 

A 

U

 A

= Ile 

A 

U

G 

= Met

 G

U

U 

= Val 

G 

U

 C

= Val 

G 

U

A 

= Val 

G 

U

 G = Val 

U

C

U

= Ser

U

C

C

= Ser

U

C

A

= Ser

U

C

G

= Ser

 C

C

 U

= Pro

 C

C

C

= Pro 

 C

C

A 

= Pro 

 C

C

G

= Pro 

A

C

 U

= Thr

A 

C

C

= Thr 

A 

C

A

= Thr

A 

C

 G

= Thr 

G

C

 U

= Ala

G 

C

C

= Ala

G 

C

A

= Ala

G 

C

G

= Ala 

U

A

U

= Tyr

U

A

C

= Tyr

U

A

A

 = stop

U

A

G

= stop 

 C

A

 U

 = His

 C

A

 C

 = His 

 C

A

 A

 = Gln 

 C

A

 G

 = Gln 

A

A

 U

= Asn

A 

A

 C

= Asn 

A 

A

 A

= Lys

A 

A

 G

= Lys

G

A

 U

= Asp

G 

A

 C

= Asp 

G 

A

 A

 = Glu

G 

A

 G

 = Glu 

U

G

U

= Cys

U

G

C

= Cys

U

G

A

= stop

U

G

G

= Trp

 C

G

 U

 = Arg

 C

G

 C

= Arg 

 C

G

 A

= Arg 

 C

G

 G

= Arg 

A

G

 U

 = Ser

A 

G

 C

 = Ser

A 

G

 A

 = Arg

A 

G

 G

 = Arg

G

G

 U

 = Gly

G 

G

 C

= Gly 

G 

G

 A

= Gly 

G 

G

 G

= Gly 

U

C

A

G

 U

 C

 A

G

 U

 C

 A

G

 U

 C

A

 G

There are 64 codons, the maximum number for triplets with four letters U, C, A, and G. These bases are the forms found in mRNA, the molecules formed in the transcription process that resemble one strand of DNA and participate directly in the translation into the amino acid sequence of proteins. As in the case of DNA, the mRNA molecules contain the bases cytosine (C), adenine (A), and guanine (G), but each thymine (T) in DNA is replaced by a uracil (U) in RNA, where the letters in parenthesis are the standard abbreviations. Several amino acids have many codons (6 for arginine and leucine), while others only have one (tryptophan and methionine). Moreover, the codon for methionine, AUG, also signals the beginning of a coding sequence.

As the mRNA is read on a ribosome codon by codon, the corresponding amino acid is delivered by a transfer RNA that binds to the codon (because it possesses a complimentary structure, the anti-codon). The synthesis of proteins progresses by joining successive amino acids together via a peptide bond (indicated by the arrow below), according to the sequence of codons in the mRNA, to produce a long string of amino acids that is called a polypeptide chain.

In this way, a sequence of amino acids is generated that is co-linear with the sequence of bases in the DNA. However, while the DNA sequence is functional in its linear form, the protein only becomes functional when the amino acid sequence spontaneously folds into a specific three-dimensional structure. It is only the final three-dimensional form that is capable of binding its substrate (if it is an enzyme) or a hormone (if it is a receptor) because of the specific binding pockets that are created when distant regions of the polypeptide come into proximity during the folding process. How the sequence of amino acids achieves the proper folding into its three-dimensional form remains poorly understood. Indeed, reading the linear DNA sequence is relatively simple compared to the possibility of using the amino acid sequence to deduce the final three-dimensional form that the protein will take. In fact, the known three-dimensional structures of proteins were established by experimental methods using X-rays on protein crystals, or nuclear magnetic resonance (RMN) on concentrated solutions of purified proteins, and cannot be "predicted" merely on the basis of the amino acid sequence.


What is meant by calling the genetic code "universal"?


The fact that the same basic coding rules apply in all living organisms that have been examined has led scientists to characterize the genetic code as "universal." However, this conclusion may be premature, since other forms of life elsewhere in the universe could conceivably exist based on biochemical principles similar to those observed on Earth, but with a genetic code using different codon assignments for the amino acids. Therefore, a more sustainable conclusion would be that the genetic code is "global", but even this conclusion is not absolute, since several small differences have been observed for certain micro-organisms and DNA-containing organelles.


What occurs that makes proteins unable to fulfill their roles when specified by a gene with a mutation?

No simple answer to this question is possible because many different changes can alter the proper functioning of a protein. Just as removing any one of a large number of key parts can stop an automobile, mutations can produce many changes in proteins that interfere with their function. A gene that contains a deletion may eliminate large portions of the structure or a single amino acid may be changed as the result of the change in a single base in the DNA. Moreover, in rare cases, a novel structure may be created, as in the case of the fibers of hemoglobin S in individuals with sickle cell disease, or in the case of mutations leading to a constitutive receptor kinase that causes cancer. However, in many cases the change is merely the replacement of one amino acid by another, a difference barely perceivable, but nevertheless sufficient to destroy the function of the protein.


How many different proteins are needed by a complex organism?

Every organism contains genes that code for various families of proteins required to maintain the life of that organism. One of the interests in determining the full sequence of DNA for an organism is to provide information on the number of different proteins required for life by that species. However, merely counting the number of genes does not provide a clear estimate of the number of different proteins. The reason is that many genes code for proteins in the same "family" that possess only minor differences in structure. Therefore, a more interesting question for characterizing a species is how many different protein families are represented in its genome. The entire repertoire of proteins that occur in an organism is called its "proteome," in analogy with the total content in genes being called the "genome." The number of distinct families that occurs among all of the proteins is called the core proteome. For example, in yeast, among the 6241 protein-coding genes, 1858 are variants within closely related families of structures. Therefore, when the numbers are adjusted for these multigene families, the yeast core proteome contains only 4383 distinct protein families. When the same analysis is applied to the fly, the 13,601 genes yield a core proteome of 8065 distinct protein families, nearly double the number of yeast. For the human genome, its core proteome may not be much larger than the fly's. There could just be many more members in each protein family. No function is known for roughly half of the proteins identified so far in yeast, the nematode and the fly, so a great deal of work remains in order to give meaning to this information.

Yeast, nematodes, flies, and humans are all examples of organisms based on a complex cell structure with multiple chromosomes in a nucleus and various other organelles, such as mitochondria and chloroplasts (in plants). These organisms are classified as eucaryotes; in contrast much simpler organisms, such as bacteria, with a single circular chromosome and no nucleus or other organelles, are classified as procaryotes. The procaryotes contain fewer genes than yeast and the simplest bacteria so far sequenced,
Mycoplasma genitalium, has only 517 genes. Moreover, these bacteria continue to live when about 200 of these genes are deleted individually and therefore only 300 or so genes are essential for life. Hence, for simple procaryotic organisms scientists are closing in on the minimal requirements for life, but for one-third of the essential genes in Mycoplasma genitalium, the protein encoded by the gene is of unknown function.


IV. GENETIC INFORMATION

BACK TO TOP OF PAGE

Related Current Themes

Related Techniques

ADVANCEMENTS IN HEALTH CARE
GENETIC ENGINEERING

GENE TRANSFER


How is the genetic information stored and transmitted?

The discovery that DNA is composed of long sequences of four bases -- adenine (A), thymine (T), guanine (G), and cytosine (C) -- led to the idea that the precise order of the bases could correspond in some way to the sequence of amino acids in the corresponding protein. However, this notion also raised two major questions:

1. How many bases in the DNA sequence correspond to one amino acid in a protein?

2. What is the direct physical interaction by which an order of bases can determine an order of amino acids?


The first question is related to the number of elements involved. Since each position of the DNA can be occupied by one of 4 bases, but each position in a protein can be occupied by one of 20 amino acids, a one-to-one correspondence could not occur. A series of three bases is the minimum number that provides a sufficient number of different combinations (4 x 4 x 4 = 64) to specify twenty different amino acids and various experimental observations indicated that base "triplets" are indeed the units (codons) that correspond to individual amino acids.

The second question is related to the precise chemical structures of bases and amino acids. Since amino acids are very varied in their structures, it was difficult to imagine how an order of bases in the DNA could lead to a specific alignment of amino acids. In fact, a network of adapter molecules is necessary and involves another form of nucleic acid called RNA. In conjunction with enzymes that can recognize amino acids and specific transfer RNA molecules, a bridge is made between the distinct structural forms of bases in nucleic acids and amino acids in proteins. The steps involving this adapter network are complex, and divided into two phases, transcription (copying and amplifying the DNA sequence by production of molecules of messenger RNA) and translation (decoding the messenger RNA into an amino acid sequence by utilizing transfer RNA molecules and ribosomes that include ribosomal RNA molecules). In this way the broad pattern of expression of the genetic information (DNA
--> RNA --> protein) is achieved by the transcription from DNA --> RNA (same language), followed by translation from RNA -->protein (change of language utilizing the genetic code).

What is the genetic code?

The genetic code gives the precise sequence of bases in genes that determines each of the 20 amino acids in proteins. Hence, if the base sequence of a gene is known, the amino acid sequence of the corresponding protein can be deduced. For example, the base sequence:

GTGCACCTGACTCCTGAGGAGAAG

codes for the amino acid sequence:

Val-His-Leu-Thr-Pro-Glu-Glu-Lys


which is the beginning of the hemoglobin beta chain. The opposite information -- determining the base sequence from the amino acid sequence -- is not possible, because for many amino acids, there is redundancy, i.e., several different base sequences code for the same amino acid. For example, Val can be coded by four different bases, His by two, Leu by six, and Thr by four. Hence, for just the four initial amino acids of the hemoglobin beta chain, 192 different base sequences are compatible possibilities.

How is transcription achieved?

The fact that each strand of the double-stranded DNA can provide a template for synthesis of the other strand lies at the heart of the copying of the genetic information during each cell division. The actual synthesis of one strand using the other as a template involves several enzymes, notably DNA polymerases. The same principle of template-based synthesis is used for gene expression by making single-stranded copies of the gene in a form called messenger RNA. In this case, only one strand of the DNA is copied and the enzymes involved are called RNA polymerases. Molecules of RNA resemble one of the strands of DNA, with two important differences:

  1. Each base in RNA is linked to the sugar ribose, whereas in DNA the sugar is deoxyribose.
  2. The bases A, G and C are used in RNA, as in DNA, but for RNA the base that pairs with A is U (uracil), not T (thymine) found in DNA.


The organization of DNA and RNA involves a long polymer of alternating sugar and phosphate, with the base attached to the sugar.

The gene for the beta subunit of hemoglobin begins with the structure:

...GTGCACCTGACTCCTGAGGAGAAG...
...¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦...
...CACGTGGACTGAGGACTCCTCTTC...

Part of the identification of where genes begin and end on DNA involves identifying which strand corresponds to the coding information and which is the complementary template. In the above example for hemoglobin, the upper strand carries the information that corresponds to the amino acid sequence of the hemoglobin subunit and this information will be transcribed into RNA by using the lower strand as the template, producing a messenger RNA molecule with the structure:

...GUGCACCUGACUCCTGAGGAGAAG...

This sequence would produce the corresponding protein according to the assignments of the Table of the Genetic Code (which presents the 64 possible mRNA triplets and the corresponding amino acid encoded by each triplet):

GUG CAC CUG ACU CCU GAG GAG AAG
Val His Leu Thr Pro Glu Glu Lys

Many genes, particularly in eucaryotic organisms, are interrupted by non-coding sequences (called introns) that separate the portions of the gene (called exons) that actually code for the sequence of amino acids in the protein. When the DNA is transcribed to yield messenger RNA (m RNA), the introns are also copied. However, in a subsequent step the mRNA is processed (in a procedure called "splicing") to eliminate the introns, so that the mature mRNA no longer contains the non-coding intron sequences.


How is translation achieved?

The transcription phase of protein synthesis leads to incorporation of the genetic information in a transcript called the messenger RNA (mRNA), which must then be translated into the amino acid sequence of a protein. The link between bases and amino acids is provided by the transfer RNA molecules. For each codon of the genetic code, there are transfer RNA (tRNA) molecules that possess an anti-codon that forms base pairs with the codon of the mRNA, by mimicking the hydrogen bonds between the strands in the corresponding position of DNA. The pairing occurs at specific sites on ribosomes that orient the mRNA and tRNA in order to favor the specific interactions between codon and anti-codon. For example, when GUG occurs in the mRNA exposed at the appropriate site on the ribosome, a tRNA bearing the anti-codon CAC will bind. Moreover, via the action of specific enzymes that recognize both a specific tRNA and one of the amino acids, the tRNA molecules are charged with the amino acid that corresponds to the codon recognized by the anti-codon. Hence in the example of the GUG codon, the tRNA with the CAC anti-codon will be charged with valine.

As the mRNA moves along the ribosome, successive codons are exposed and align with their appropriate anti-codon bearing tRNA molecules. When two adjacent tRNA molecules are in position on the ribosome, a peptide bond is formed, with the ribosomal RNA (rRNA) catalyzing the peptide bond formation. In this way, the series of codons on the mRNA is translated into a series of amino acids. One linear message is translated into another, with the conversion provided by the "bilingual" tRNA molecules that recognize one language (codons) via their anti-codon and the other due to the amino acid attached. The tRNA molecules are therefore at the heart of the translation mechanism, but ultimately the key to the process is held by the many specific enzymes that recognize one class of tRNA molecules and link it with its appropriate amino acid. Moreover, the ribosomal RNA also participates in the catalysis of peptide bond formation, illustrating another important principle, that RNA molecules can also possess enzymatic activity, with such RNA molecules called "ribozymes".

The various steps of transcription and translation are summarized schematically in the diagram below:



As the amino acid chain grows, it spontaneously wraps around itself to fold in the compact and highly ordered three-dimensional structure that produces a functional protein. The folding process is the final step in gene expression.

 

V. MUTATIONS

BACK TO TOP OF PAGE

Related Current Themes

Related Techniques

GENETIC DISEASES

DNA SEQUENCING




What is a mutation?

A mutation is any change that occurs in the DNA. It can be of the simplest form, a point mutation, that replaces a single base (A, T, G, or C) by a different base, or it can involve more complex changes, including insertions of one of more bases, deletions of many bases, or major rearrangements of a substantial portion of a chromosome. Changes of this type account for the variations that led to the evolution of the multitude of species found among the varied forms of life on our planet. Mutations can alter the specific structure of protein and RNA molecules, when the mutations lie within one of the regions of the chromosome (known as genes) that specify these structures. However, for a typical human chromosome, genes make up only about 3% of the total DNA sequence, but mutations in the intervening regions can greatly influence timing and expression of the genes.

What are the effects of mutations?


Mutations can occur at low frequency whenever DNA replicates during cell division. If they occur in the cells producing eggs or sperm, they will be transmitted to every cell of the next generation, whereas in other cells, "somatic" mutations will only effect tissues derived from the mutated cell (as occurs for certain forms of cancer) and will not be transmitted to the next generation. In general, every individual possesses many mutations compared to the "average" human being; most have no significant consequences and are considered "neutral". Others may lead to certain characteristics or susceptibilities that are considered to be part of the unique characteristics of a normal individual. Some are recessive mutations that would cause pathological conditions in a homozygous state (if they were present on both members of a pair of chromosomes), but are harmless in the heterozygous state (present on just one member of a pair of chromosomes). However, such latent recessive mutations are a reason that consanguinity within a family or within in a small population can have dangerous consequences for future generations. Some mutations are dominant and can provoke a genetic disease even when present on one chromosome. Recent studies have identified a novel type of mutation that involves repetitions of certain 3-base sequences (triplets) up to thousands of times. A small fraction of individuals in the human population are born with such a mutation that can cause a serious disease, for example " fragile X syndrome "or "severe myotonic dystrophy". Thousands of such diseases have been identified, but many occur in a just a few families in the world.


What is special about mutations on the X or Y chromosomes?

Among the 46 chromosomes, humans possess 22 pairs plus X and Y. Females are XX and males are XY. As a result, a male inherits only one X chromosome (from the mother) and if that chromosome harbors a recessive mutation in the mother, the mutation will appear "dominant" in the son, as in the case of hemophilia. Hence, each son of such a mother has a 50% chance of receiving the X chromosome with the mutation. Other genetic diseases lying on the Y chromosome are passed only from father to son.


What causes mutations?

Mutations occur spontaneously at a low, but finite rate. Whenever DNA is duplicated there is a tiny probability that an error will occur. In addition, the mutation rate may be increased by several factors, including radiation (one of the consequences of Hiroshima and Tchernobyl). Certain chemical pollutants in the environment possess mutagenic activity. UV light is also mutagenic and is the principal cause of skin cancers, principally due to mutations that inactivate the protein p53 that normally participates in cell growth control. Many other forms of cancer are caused by mutations in genes that produce abnormally active (or constitutive) proteins.