Lecture 7

Biology of Transcription Factors


        The generation of a complex organism from a single cell requires a remarkably intricate system in which gene expression is tightly regulated, both temporally and spatially.  This process is mediated by a limited number of divergent transcription factors that bind specifically to DNA enhancer and promoter sequences of many genes.  As every transcriptional event relies on the formation of a protein-DNA complex, it is vital to learn the basic facts about these transcription factors and their importance in gene regulation.  In discussing transcription factors, I am going to take the approach that there are some common problems TFs have to solve, and some common solutions.

    This section of the course will be divided into two parts.  In  the first part, we will discuss some of the more common structural motifs of transcription factors and then deal with the general problems that they must overcome.  The second part deals with some of the solutions that have evolved to overcome these problems, namely dimerization.

1.  Transcription Factors

Protein Structure

    If we are to understand proteins, we must think of them as a hierarchy of functional parts:

Motif - A short conserved region in a protein sequence.  Motifs are frequently highly conserved parts of domains and are identified as highly similar regions in alignments of protein segments.

Domain - An independent structural unit of a protein assumed to fold independently of the rest of the protein and possessing autonomous function.  Domains are evolutionary related.

    In the light of structural and biochemical evidence which has accumulated over recent years it has become clear that the traditional view that 'polypeptide = protein' is inadequate to describe some naturally occurring polypeptides.  In particular, it can be shown that different regions along a single polypeptide chain can act as independent units, to the extent that they can  be excised from the chain, and still be shown to fold correctly, and often still exhibit biological activity.  These independent regions are termed domains.

    Domains sometimes act completely independently of each other, as in the case of a catalytic domain and a binding domain, where the two domains don't interact with each other, but their association is synergistic because the linker between them means that the catalytic domain is kept in close contact to its substrate.  In other cases structural interactions between domains do occur (indeed there have been structures solved for multi-domain complexes).  In this case, the interaction between the domains should be considered as something akin to quaternary structure, rather that treating the whole complex as a single protein.

   

DNA-binding domains:

    Most transcription factors can bind to specific DNA sequences, and these trans-regulatory proteins can be grouped together in families based on similarities in structure. Within such a family, proteins share a common framework structure in their respective DNA-binding sites, and slight differences in the amino acids at the binding site can alter the sequence of the DNA to which it binds.  Not surprisingly, there are a limited number of ways that proteins interact specifically with DNA.  Depending on their mode of interaction with DNA, transcription factors have been categorized into a limited number of gene families, each sharing a common DNA-binding domain.   Here, I will briefly discuss the more common DNA-binding domains.

Review Article: Warren AJ.  Eukaryotic transcription factors.  Curr Opin Struct Biol 2002 Feb;12(1):107-14.

a.  Helix-turn-helix (HTH) and variants

    The canonical HTH domain consists of two antiparallel helices linked by a turn of 3 to 4 amino acids, so that the helices lie at a 120o angle.  One of the two helices lies in the major groove and the other lies across the DNA.  The example you will see most often in this course are proteins that contain the HTH domain called the "homeodomain".  These proteins were first discovered in the homeotic genes, but subsequently found in many other transcription factors.  Homeotic genes are known to play a critical role in animal development.  Note that homeotic proteins have homeodomains, but not all proteins with homeodomains are homeotic.  There are many variants of the basic motif: proteins with a third helix, a tether to connect the two helices, amino and carboxy-terminal extensions to contact the DNA, a flexible loop or wing to contact the DNA.

 

Oct-1 POU Domain

    This is a tethered HTH variant.  The octamer transcription factor 1 (Oct-1) POU binding region consists of two indepedent DNA-binding domains, a POU-specific domain and a POU homeodomain.  Both of these domains are HTH domains that contact the major groove, and are joined together by a unique and flexible linker region.  This linker largely differs both in sequence and length (15 to 56 residues) among the members of the POU family.  It is thought that these two domains bind cooperative.

 

The domains of POU family transcription factors.

Winged HTH

    "Winged" Helix-Turn-Helix domain is a modified DNA binding Helix-Turn-Helix domain with a small beta sheet and an extended appendage that protrudes out from the central core like a wing.

Winged-helix, or forkhead, domain from HNF3, a hepatocyte-specific transcription factor.

b.  Helix-loop-helix (HLH) 

    HLH proteins have two helices connected by a flexible (looped) linker.  One helix is basic and contacts the DNA, whereas the other is required for dimerization.  HLH proteins work as dimers so there are 4 helices in the DNA-binding molecule.  The two basic helices (one from each monomer) contact the major groove of the DNA like a pair of scissors.  The interacting helices often have heptad repeats of leucine residues along one face of the helix.  It is thought that hydrophobic interactions between the leucines stabilize dimer formation.  These helices are often called leucine zippers, thus HLH transcription factors are often called b-zip proteins, where b stands for basic.  

The transcription factor MyoD.  The structure of the helix-loop-helix motif.

The structure of the AP-1/DNA complex.  AP-1 is a dimer formed by Jun and its homologous protein Fos.   It contains a leucine zipper motif where two a helices look like a zipper with leucine residues (green color) lining on the inside of the zipper.

c.  Zinc fingers and zinc clusters

    Many versions of DNA-binding proteins use zinc to stabilize their DNA-binding domains.  The canonical Zn finger (C2H2 zinc finger) is characterized by the sequence CX2-4C....HX2-4H, where C = cysteine, H = histidine, X = any amino acid.  In the 3D structure, two cysteine residues and two histidine residues interact with a zinc ion, which stabilizes the loop or finger of amino acids that interact with the DNA.  Most transcription factors of this class have at least 3 and up to 14 fingers, thus increasing the contact with the DNA.  It is becoming clear that many types of zinc fingers exist, including cysteine clusters in which 6 or more cysteines contribute to Zn binding.

The archetypal transcriptional activator Sp1. The structure of a zinc finger region. 

The complex of the estrogen receptor (ER) zinc finger domain and DNA.  In this figure, two ERs form a dimer.  Each ER binds to two zinc ions (represented by orange balls).  Most steroid hormone receptors contain such motif. 


Most transcription factors fall into one of the three classes or variants listed above.  You do not need to memorize the material above, but it will help you when reading the literature and provide a way to keep track of the different kinds of transcription factors.


2.  General Biological Problems

    A simplistic view of gene regulation may be that different transcription factors activate/silence different genes by binding unique sequences within the regulatory regions of their target genes.  There are a considerable number of problems with this model.  We will discuss some of these problems in order to provide a context for the lectures to follow on how dimerization provides a solution to many of these problems.

a.  The first problem, which has been revealed by efforts of the Genome Sequencing laboratories, is that there are surprisingly few genes in most organisms.  There are approximately 16 000 genes in Drosophila and perhaps 60 000 in humans.  Given the number of developmental decisions that must take place there are simply not enough transcription factors available to have one transcription factor dedicated to each regulatory event.  Therefore, there must be some way to maximize the information possible from having fewer transcription factors than there are events to regulate.

b.  A second problem is that many transcription factors bind to many DNA sequences with the same affinity.  Given that the transcription factor has to find the correct sequence in a genome filled with many similar sequences to which it binds with equal affinity, how does the transcription factor distinguish the binding sites?

c.  A related problem is that many transcription factors are members of gene families, all with similar binding affinities.  So, for example, there are 7 homeotic proteins in Drosophila, all which have homeodomains and all which show identical properties in vitro.  Surprisingly, each of these 7 homeotic proteins recognize different targets in vivo.  How is this accomplished?

d.  Alternatively, a given DNA sequence can be recognized by different transcription factors.  How is it decided which transcription factor will occupy the site?  Competition (relative abundance, or binding affinity decreases)?

e.  Another problem which is much more subtle, but very important, is that the binding constants for most eukaryotic transcription factors are in the range of 10-8 to 10-9 M, compared to prokaryotic binding constants of about 10-11 M.  The consequence of this is that eukaryotic transcription factors have relatively low affinity for their targets.  The benefit to this is that they fall of their targets readily, however, it makes finding and binding the targets less likely.

f.  The number of transcription factors in a cell is quite low (usually ranging from 1 to 100 molecules), yet the amount of DNA to be sampled to find the target is very high.  How can it be arranged that transcription factors find their targets rapidly in the genome, especially in light of the problems stated above?

g.  Numerous studies have shown that some transcription factors are involved in repressing gene expression in one cell type and activate in another cell type.  How is it that one factor can act both as a repressor and an activator?