Relationship of the DNA of a eukaryotic gene to the 5'-UTR of its mRNA

In eukaryote pre-mRNA I am having a little trouble grasping exactly what the 5 prime untranslated region is defined as.

It seems that it could be defined as the difference in pre-mRNA between the transcription and translation, is that right?

If so, does the transcription begin immediately after the end of the promoter sequence? Immediately after the promoter region on the DNA ends, does the next nucleotide become part of the 5'UTR?

Concise Answer

The 5'-UTR region of a eukaryotic mRNA is derived from the RNA transcript of the region of a gene between the transcription start site and the DNA corresponding to the translational initiation codon. It differs from that region of the initial transcript in most cases by having a modified guanosine nucleotide added at the 5'-end in a 'cap' structure, and in some cases by lacking an intron present in the former.

The Transcriptional Start Site (TSS) and the Promoter

This answer is in terms of the TSS rather than 'promoter region' or 'promoter sequence', favoured by the questioner because in eukaryotes, at least, there is no simple and easily definable promoter sequence. A 2012 review of metazoan promoters begins with the definition: “Gene promoters are the loci overlapping transcription start sites (TSSs), at which the total regulatory input of a gene is integrated into the rate of transcriptional initiation. The immediate role of the promoter is to bind and correctly position the transcription initiation complex… ”.

A simplified diagram of some components of human promoters (adapted from an earlier review by Hahn) is shown below. The position of 'Inr' - the Initiator box which contains the TSS - varies with respect to the TATA box (usually from -25 to -30 in humans), as does its sequence. Hence, even ignoring the alternative TSSs mentioned by @Luke it is not possible to identify the TSS unequivocally by inspection of the DNA sequence of a gene in the way the questioner would apparently wish.

The Translational Initiation Codon

It should be mentioned that it also not possible to identify the mRNA translational initiation codon unequivocally by inspection of the DNA sequence of a gene. Although 90% or more cases follow Kozak's rule, so that it corresponds to the first ATG, in some cases it may correspond to the second ATG, and in a small minority - where internal initiation occurs - to some distal ATG. There are also a very few examples of initiation of translation at codons other than AUG.

Introns in 5'-UTR regions of pre-mRNA

A common rationalization of the intron/exon nature of eukaryotic pre-mRNA is in terms of alternative splicing to produce different protein sequences. Perhaps because of this, the presence of introns in the primary transcript containing the 5'- and 3'-UTRs is often ignored (see Bicknell et al. for a review). In fact, approximately 35% of 5'-UTR precursors contain introns, not present in the mature mRNA.

5'-capping of eukaryotic mRNA

The vast majority of eukaryotic mRNAs have a cap at their 5'-end, produced by the addition of a guanosine nucleotide in a 5'-5' pyrophosphate linkage to the 5'-residue (usually A) of the pre-mRNA. This is subsequently methylated at carbon-7; and 2'-O-methylation of the ribose of the subsequent one or two nucleotides may also occur.


Although “DNA makes RNA makes Protein”, I do think it important to describe features of information molecules in terms of their actual composition. Thus, the 5'-UTR has to be defined in terms of mRNA, not promoters or exons. If one makes reference to the information content of DNA or RNA than I think a form of words should be used to make that clear. Muddled forms of words can easily lead to muddled thinking.

Short answer

The 5' UTR on the mRNA includes sequence from the Transcriptional Start Site (TSS) to the first exon. Promoters are usually associated with a corresponding TSS.

Longer answer

In defining a UTR we must consider where transcription begins. Strictly speaking transcription begins at the Transcriptional Start Site (TSS). A useful website for exact definitions for terms such as this is the Sequence Ontology - see here for the TSS definition (link) - although admittedly in this case it does not provide more information than that (i.e. the TSS is where transcription begins… which we might have gathered from the name). TSS will have an associated promoter region, which is where the RNA polymerase (and various regulators and transcription factors bind).

To expand: a gene may have multiple TSS' which may be promoted under different circumstances (e.g. with different factors present). We therefore have the concept of a TSS region "The region of a gene from the 5' most TSS to the 3' TSS" (link). TSS have to be empirically determined, and seem to vary by tissue (a database of all known TSS has been accumulating evidence since 2002: DBTSS). I have just performed a search for gene APOE (chr 19) and this is a small excerpt of the data presented, just to give an idea:


So the TSS for a gene is context dependent on the when/where the transcript is being produced.

In searching for further information I found a paper that describes the 5' UTR on the mRNA, which is not necessarily an exact copy of the genomic DNA (e.g. by addition of the cap). Here is the corresponding paragraph and diagram, which explains it better than I can paraphrase (highlighting by me):

Transcriptional control is mediated by transcription factors, RNA polymerase and a series of cis-acting elements located in the DNA, such as promoters, enhancers, silencers and locus-control elements, organized in a modular structure and regulates the production of pre-mRNA molecules, which undergo several steps of processing before they become functional mRNAs. Introns are removed, a 7-methyl-guanylate (m7G) cap structure is added at the 5' end of the first exon, and a stretch of 100-250 adenine residues (the poly(A) tail) is added at the 3' end of the last exon, which is itself generated by endonucleolytic cleavage of the primary transcript. Sometimes the sequence of the mRNA is also altered in a process called mRNA editing, and the resulting coding sequence of the mature RNA differs from the corresponding sequence in the genome. The resultant mature mRNA, in eukaryotes, has a tripartite structure consisting of a 5' untranslated region (5' UTR), a coding region made up of triplet codons that each encode an amino acid and a 3' untranslated region (3' UTR). Figure 1 shows these and other features of mRNAs.

Figure 1: The generic structure of a eukaryotic mRNA, illustrating some post-transcriptional regulatory elements that affect gene expression. Abbreviations (from 5' to 3'): UTR, untranslated region; m7G, 7-methyl-guanosine cap; hairpin, hairpin-like secondary structures; uORF, upstream open reading frame; IRES, internal ribosome entry site; CPE, cytoplasmic polyadenylation element; AAUAAA, polyadenylation signal.

Reference: Mignone, F., Gissi, C., Liuni, S., & Pesole, G. (2002). Untranslated regions of mRNAs. Genome Biology, 3(3).


To conclude my answer (apologies for its lack of brevity) there are different types of promoter, not all of which directly include a TSS, but the majority of promoters are called "Major promoters" and do include the TSS (in addition to the RNA polymerase binding site, and other transcription factor binding sites).

  • Eukaryotic Promoter Database
  • Dreos R, et al. EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era Nucleic Acids Res. 2013 Jan;41.

The UTR is the region of the transcript upstream of the starting methionine. The promotor is not itself transcribed.

Eukaryotic Transcription

Prokaryotes and eukaryotes perform fundamentally the same process of transcription, with a few key differences. The most important difference between prokaryotes and eukaryotes is the latter’s membrane-bound nucleus and organelles. With the genes enclosed in a nucleus, the eukaryotic cell must be able to transport its mRNA to the cytoplasm and must protect its mRNA from degrading before it is translated. Eukaryotes also employ three different polymerases that each transcribe a different subset of genes.

Gene Regulation

Gene Regulation documents the proceedings of the CETUS-UCLA Symposium ""Gene Regulation,"" held in Keystone, Colorado in March/April 1982. The symposium related gene structure and regulatory sequences to overall genomic organization and genetic evolution. It was the first meeting to focus on regulation of eukaryotic gene expression since the maturation in recombinant DNA technology. The book is organized into four parts. Part I presents studies on the structure of eukaryotic genes, including the organization and molecular basis for differential expression of the mouse λ light chain genes globin gene transcription and RNA processing and the cloning of the human chromosomal α1-antitrypsin gene and its structural comparison with the chicken gene coding for ovalbumin. Part II on chromatin structure includes papers on nuclease sensitivity of the ovalbumin gene and its flanking DNA sequences and the relationship of chromatin structure to DNA sequence. Part III on gene expression includes papers on the role of poly(A) in eukaryotic mRNA metabolism and the in vitro transcription of Drosophila tRNA genes. Part IV on cellular biology includes studies such as the importance of calmodulin to the eukaryotic cells.

Gene Regulation documents the proceedings of the CETUS-UCLA Symposium ""Gene Regulation,"" held in Keystone, Colorado in March/April 1982. The symposium related gene structure and regulatory sequences to overall genomic organization and genetic evolution. It was the first meeting to focus on regulation of eukaryotic gene expression since the maturation in recombinant DNA technology. The book is organized into four parts. Part I presents studies on the structure of eukaryotic genes, including the organization and molecular basis for differential expression of the mouse λ light chain genes globin gene transcription and RNA processing and the cloning of the human chromosomal α1-antitrypsin gene and its structural comparison with the chicken gene coding for ovalbumin. Part II on chromatin structure includes papers on nuclease sensitivity of the ovalbumin gene and its flanking DNA sequences and the relationship of chromatin structure to DNA sequence. Part III on gene expression includes papers on the role of poly(A) in eukaryotic mRNA metabolism and the in vitro transcription of Drosophila tRNA genes. Part IV on cellular biology includes studies such as the importance of calmodulin to the eukaryotic cells.

What is Eukaryotic Gene Structure

Eukaryotic gene structure is the organization of the eukaryotic genes in the genome. Here, in the eukaryotic gene structure, the most significant feature is the presence of introns between the open reading frame, breaking it into pieces called exons. Only exons will remain in the mature mRNA as the removal of introns occurs during the post-transcriptional modifications in a process called RNA splicing. Generally, introns sequences are longer than exon sequences. Once spliced out, the eukaryotic mRNA contains a single, continuous protein-coding region. Apart from that, a 5’ cap and a 3’ poly A tail are added to the eukaryotic mRNA to increase the stability.

Figure 2: Eukaryotic Gene Structure

However, unlike prokaryotes, eukaryotic genes do not form clusters regulated by the shared regulatory elements. In addition, each eukaryotic gene has its own promoter and other regulatory elements. Therefore, the eukaryotic mRNA always contains a single open reading frame.

Single-molecule approaches to RNA modifications

Single-molecule sequencing has the potential to provide base-level resolution of m 6 A sites, without the need for motif-based inference. The most commonly found platform for this method of sequencing currently on the market is the single-molecule, real-time (SMRT) technology (Pacific Biosciences). SMRT sequencing uses thousands of zero-mode waveguides (ZMWs) to capture an enzyme in real time, traditionally a DNA polymerase, as it incorporates fluorescent nucleotides into a polymer [45]. This method of molecular monitoring has the advantage of detecting both genetic and epigenetic information simultaneously, since the patterns of base incorporation by the polymerase are contingent upon the steric and sequence contexts of the bases present in the template [46]. Specifically, if a modified base is present on the template, the biophysical dynamics of DNA polymerase movement and base incorporation are affected, creating a unique kinetic signature before, during and after base incorporation, and thus enabling identification of specific DNA modifications [47].

Here, we report a novel application of this technology, which can be used to detect modified bases within RNA, including m 6 A sites. To characterize m 6 A sites in RNA at single-nucleotide resolution, we used a reverse transcriptase as the enzyme within a ZMW, instead of a DNA polymerase, and this substitution allowed the direct observation of cDNA synthesis in real time. While base incorporations during reverse transcription typically occur at standard speeds, the incorporation of synthetically designed m 6 A sites showed that there is a significant increase in the inter-pulse duration (IPD) when a methylated adenosine is present in the RNA template, relative to the IPD for a standard adenosine (Figure 4). To our knowledge, this represents the first demonstration of a reverse transcriptase-based kinetic signature that can directly detect modified RNA. However, current single-molecule technology is not without its own challenges. First and foremost, reverse transcriptases stutter when incorporating bases, complicating the accurate reading of homonucleotide stretches and the base resolution of m 6 A therein. Second, the current throughput is too low for transcriptome-wide approaches. Notwithstanding these caveats, the SMRT technology has the clear potential to detect an underlying epitranscriptomic change in a native RNA template.

Single-molecule sequencing of RNA to detect epitranscriptomic changes. SMRT sequencing with the Pacific Biosciences RS shows longer times (inter-pulse distances) to incorporate m 6 A versus standard adenosines. (a) Experimental design for using a DNA primer in a reverse transcription reaction. Sequencing of the unmodified template shows, in a single-molecule sequencing trace, base incorporation via a reverse transcriptase-mediated cDNA synthesis reaction. (b) Shows sequencing as with (a), but using an RNA template with m 6 A instead of normal adenosines. Incorporation of thymines (T) show significant delay (longer inter-pulse distances). A.U. stands for normalized arbitrary units in fluorescence measurement. (c) Exponential fit of experimentally observed inter-pulse distances (IPDs). (d) Shows the difference between the average IPDs for native As and m 6 As. The average IPD in each case is the reverse of the exponential decay rate. The error bars indicate the range around each average IPD that includes 83% of the observed IPDs (that is, ±½ of standard deviation of the exponential fit). We used an Ansari-Bradley test in Matlab to confirm that the distribution functions were different (P = 0.0043).

Similarly, Oxford Nanopore Technologies (ONT) and other companies are developing nanopore-based sequencing technologies, which use nanopore-forming proteins to sequence DNA by attaching an application-specific integrated circuit to the membrane upon which the nanopore rests. In principle, observations of any modified DNA or RNA base could be made during transit of the molecule through the nanopore, and some observations have already been made with nanopores that allow detection of 5hmC [48]. While all of these technologies are still under development, we note that all direct-observation methods, in principle, have the potential to detect m 6 A and other epitranscriptomic modifications.

Summary – CDS vs cDNA

Coding sequence and cDNA are two types of nucleotide sequences. Coding sequence lies within a gene while cDNA is synthesized artificially. Coding sequence has exons and two codons while cDNA has an mRNA sequence and two UTRs. Both CDS and cDNA have only the nucleotide sequences which actually translate into a protein. They do not contain introns. Unlike CDS, cDNA synthesis process requires reverse transcriptase enzyme. Also, cDNA can convert into cDNA libraries. Thus, this summarizes the difference between CDS and cDNA.


1. “Coding Region.” Wikipedia, Wikimedia Foundation, 18 Apr. 2020, Available here.

Image Courtesy:

1. “Coding Region in DNA” By Pragpats – Own work (CC BY-SA 4.0) via Commons Wikimedia
2. “Formation of a cDNA Library” By PhD Dre at the English Wikipedia (CC BY-SA 3.0) via Commons Wikimedia

Relationship of the DNA of a eukaryotic gene to the 5'-UTR of its mRNA - Biology

Campbell Biology Chapter 17 (powell_h)

1) Which of the following variations on translation would be most disadvantageous for a cell?
A) translating polypeptides directly from DNA
B) using fewer kinds of tRNA
C) having only one stop codon
D) lengthening the half-life of mRNA
E) having a second codon (besides AUG) as a start codon

2) Garrod hypothesized that "inborn errors of metabolism" such as alkaptonuria occur because
A) metabolic enzymes require vitamin cofactors, and affected individuals have significant nutritional deficiencies.
B) enzymes are made of DNA, and affected individuals lack DNA polymerase.
C) many metabolic enzymes use DNA as a cofactor, and affected individuals have mutations that prevent their enzymes from interacting efficiently with DNA.
D) certain metabolic reactions are carried out by ribozymes, and affected individuals lack key splicing factors.
E) genes dictate the production of specific enzymes, and affected individuals have genetic defects that cause them to lack certain enzymes.

3) Garrod's information about the enzyme alteration resulting in alkaptonuria led to further elucidation of the same pathway in humans. Phenylketonuria (PKU) occurs when another enzyme in the pathway is altered or missing, resulting in a failure of phenylalanine (phe) to be metabolized to another amino acid: tyrosine. Tyrosine is an earlier substrate in the pathway altered in alkaptonuria. How might PKU affect the presence or absence of alkaptonuria?
A) It would have no effect, because PKU occurs several steps away in the pathway.
B) It would have no effect, because tyrosine is also available from the diet.
C) Anyone with PKU must also have alkaptonuria.
D) Anyone with PKU is born with a predisposition to later alkaptonuria.
E) Anyone with PKU has mild symptoms of alkaptonuria.

4) The nitrogenous base adenine is found in all members of which group?
A) proteins, triglycerides, and testosterone
B) proteins, ATP, and DNA
C) ATP, RNA, and DNA
D) α glucose, ATP, and DNA
E) proteins, carbohydrates, and ATP

5) A particular triplet of bases in the template strand of DNA is 5' AGT 3'. The corresponding codon for the mRNA transcribed is
A) 3' UCA 5'.
B) 3' UGA 5'.
C) 5' TCA 3'.
D) 3' ACU 5'.
E) either UCA or TCA, depending on wobble in the first base.

6) The genetic code is essentially the same for all organisms. From this, one can logically assume which of the following?
A) A gene from an organism can theoretically be expressed by any other organism.
B) All organisms have experienced convergent evolution.
C) DNA was the first genetic material.
D) The same codons in different organisms translate into the different amino acids.
E) Different organisms have different numbers of different types of amino acids.

7) The "universal" genetic code is now known to have exceptions. Evidence for this can be found if which of the following is true?
A) If UGA, usually a stop codon, is found to code for an amino acid such as tryptophan (usually coded for by UGG only).
B) If one stop codon, such as UGA, is found to have a different effect on translation than another stop codon, such as UAA.
C) If prokaryotic organisms are able to translate a eukaryotic mRNA and produce the same polypeptide.
D) If several codons are found to translate to the same amino acid, such as serine.
E) If a single mRNA molecule is found to translate to more than one polypeptide when there are two or more AUG sites.

8) Which of the following nucleotide triplets best represents a codon?
A) a triplet separated spatially from other triplets
B) a triplet that has no corresponding amino acid
C) a triplet at the opposite end of tRNA from the attachment site of the amino acid
D) a triplet in the same reading frame as an upstream AUG
E) a sequence in tRNA at the 3' end

9) Which of the following provides some evidence that RNA probably evolved before DNA?
A) RNA polymerase uses DNA as a template.
B) RNA polymerase makes a single-stranded molecule.
C) RNA polymerase does not require localized unwinding of the DNA.
D) DNA polymerase uses primer, usually made of RNA.
E) DNA polymerase has proofreading function.

10) Which of the following statements best describes the termination of transcription in prokaryotes?
A) RNA polymerase transcribes through the polyadenylation signal, causing proteins to associate with the transcript and cut it free from the polymerase.
B) RNA polymerase transcribes through the terminator sequence, causing the polymerase to separate from the DNA and release the transcript.
C) RNA polymerase transcribes through an intron, and the snRNPs cause the polymerase to let go of the transcript.
D) Once transcription has initiated, RNA polymerase transcribes until it reaches the end of the chromosome.
E) RNA polymerase transcribes through a stop codon, causing the polymerase to stop advancing through the gene and release the mRNA.

11) Which of the following does not occur in prokaryotic gene expression, but does in eukaryotic gene expression?
A) mRNA, tRNA, and rRNA are transcribed.
B) RNA polymerase binds to the promoter.
C) A poly-A tail is added to the 3' end of an mRNA and a cap is added to the 5' end.
D) Transcription can begin as soon as translation has begun even a little.
E) RNA polymerase requires a primer to elongate the molecule.

12) RNA polymerase in a prokaryote is composed of several subunits. Most of these subunits are the same for the transcription of any gene, but one, known as sigma, varies considerably. Which of the following is the most probable advantage for the organism of such sigma switching?
A) It might allow the transcription process to vary from one cell to another.
B) It might allow the polymerase to recognize different promoters under certain environmental conditions.
C) It could allow the polymerase to react differently to each stop codon.
D) It could allow ribosomal subunits to assemble at faster rates.
E) It could alter the rate of translation and of exon splicing.

13) Which of the following is a function of a poly-A signal sequence?
A) It adds the poly-A tail to the 3' end of the mRNA.
B) It codes for a sequence in eukaryotic transcripts that signals enzymatic cleavage

1035 nucleotides away.
C) It allows the 3' end of the mRNA to attach to the ribosome.
D) It is a sequence that codes for the hydrolysis of the RNA polymerase.
E) It adds a 7-methylguanosine cap to the 3' end of the mRNA.

14) In eukaryotes there are several different types of RNA polymerase. Which type is involved in transcription of mRNA for a globin protein?
A) ligase
B) RNA polymerase I
C) RNA polymerase II
D) RNA polymerase III
E) primase

15) Transcription in eukaryotes requires which of the following in addition to RNA polymerase?
A) the protein product of the promoter
B) start and stop codons
C) ribosomes and tRNA
D) several transcription factors (TFs)
E) aminoacyl synthetase

16) A part of the promoter, called the TATA box, is said to be highly conserved in evolution. Which of the following might this illustrate?
A) The sequence evolves very rapidly.
B) The sequence does not mutate.
C) Any mutation in the sequence is selected against.
D) The sequence is found in many but not all promoters.
E) The sequence is transcribed at the start of every gene.

17) The TATA sequence is found only several nucleotides away from the start site of transcription. This most probably relates to which of the following?
A) the number of hydrogen bonds between A and T in DNA
B) the triplet nature of the codon
C) the ability of this sequence to bind to the start site
D) the supercoiling of the DNA near the start site
E) the 3-D shape of a DNA molecule

18) What is a ribozyme?
A) an enzyme that uses RNA as a substrate
B) an RNA with enzymatic activity
C) an enzyme that catalyzes the association between the large and small ribosomal subunits
D) an enzyme that synthesizes RNA as part of the transcription process
E) an enzyme that synthesizes RNA primers during DNA replication

19) A transcription unit that is 8,000 nucleotides long may use 1,200 nucleotides to make a protein consisting of approximately 400 amino acids. This is best explained by the fact that
A) many noncoding stretches of nucleotides are present in mRNA.
B) there is redundancy and ambiguity in the genetic code.
C) many nucleotides are needed to code for each amino acid.
D) nucleotides break off and are lost during the transcription process.
E) there are termination exons near the beginning of mRNA.

Before It Gets Started: Regulating Translation at the 5′ UTR

Translation regulation plays important roles in both normal physiological conditions and diseases states. This regulation requires cis-regulatory elements located mostly in 5′ and 3′ UTRs and trans-regulatory factors (e.g., RNA binding proteins (RBPs)) which recognize specific RNA features and interact with the translation machinery to modulate its activity. In this paper, we discuss important aspects of 5′ UTR-mediated regulation by providing an overview of the characteristics and the function of the main elements present in this region, like uORF (upstream open reading frame), secondary structures, and RBPs binding motifs and different mechanisms of translation regulation and the impact they have on gene expression and human health when deregulated.

1. Translation Regulation

Gene expression can be modulated at multiple levels from chromatin modification to mRNA translation. Despite the importance of transcriptional regulation, it is clear at this point that mRNA levels cannot be used as a sole parameter to justify the protein content of a cell. In fact, in a recent study from our lab, we determined that a direct correlation between mRNA and protein exists for less than a third of analyzed genes in a human cell line. Moreover, our analysis suggested that translation regulation contributes considerably to the protein variation as several parameters related to translation like 5′ UTR, 3′ UTR, coding sequence length, presence of uORFs and amino acid composition, and so forth showed good correlations with the obtained mRNA/protein ratios [1]. Translation regulation functions as an important switch when rapid changes in gene expression are required in reponse to internal and external stimuli (PDGF2, VEGF, TGFβ are examples of genes controlled in such way). Translation regulation also plays a significant role during development and cell differentiation by altering the levels of expression of specific mRNA subsets during a particular time window while the majority of transcripts remain unchanged (reviewed in [2–4]).

In this paper, we will focus on the importance of 5′ UTR mediated regulation and the different functional elements present in this region with the exception of IRES which is discussed in a different article of this issue. The main regulatory elements in 5′ UTR are secondary structures (including IRES), binding sites for RNA binding proteins, uAUGs and uORFs (Figure 1).

2. 5′ UTR

The average length of 5′ UTRs is

220 nucleotides across species [5]. In vertebrates, 5′ UTRs tend to be longer in transcripts encoding transcription factors, protooncogenes, growth factors and their receptors, and proteins that are poorly translated under normal conditions [6]. High GC content is also a conserved feature, with values surpassing 60% in the case of warm-blooded vertebrates. In the context of hairpin structures, GC content can affect protein translation efficiency independent of hairpin thermal stability and hairpin position [7]. UTRs of eukaryotic mRNAs also display a variety of repeats that include short and long interspersed elements (SINEs and LINEs, resp.), simple sequence repeats (SSRs), minisatellites, and macrosatellites [5].

Translation initiation in eukaryotes requires the recruitment of ribosomal subunits at either the 5′ m7G cap structure. The initiation codon is generally located far downstream, requiring ribosomal movement to this site. This movement appears to be nonlinear for some mRNAs (i.e., ribosomal subunits appear to bypass (shunt) segments of the 5′ UTR as they move in the direction of the AUG). Shunting could allow mRNAs containing uAUGs or hairpin structures to be translated efficiently. Important examples are provided by the cauliflower mosaic virus [8] and adenovirus [9] mRNAs. The mechanism of ribosomal shunting is rather complex requiring mRNA-rRNA base pairing [10].

Genes presenting differences in the 5′ UTR of their transcripts are relatively common. 10–18% of genes express alternative 5′ UTR by using multiple promoters [11, 12] while alternative splicing within UTRs is estimated to affect 13% of genes in the mammalian transcriptome [13]. These variations in 5′ UTR can function as important switches to regulate gene expression. Two important examples are provided by the cancer-related genes BRCA1 (breast cancer 1) and TGF-β (transforming growth factor β). BRCA1 is a tumor suppressor, frequently mutated in breast cancer with functions in cell cycle, apoptosis, and DNA damage repair. BRAC1 produces two different transcripts that derive from two different promoters and therefore display differences in their 5′ UTR. A shorter transcript is expressed in cancerous as well as noncancerous breast tissue and efficiently translated, while a longer transcript is predominantly expressed in breast cancers. The presence of several uAUGs and a more complex structure dramatically affect the translation of this longer transcript. This causes an overall decrease in BRAC1 levels in tumor cells, leading to a relief in growth inhibition [14]. TGF-β is implicated in a large number of processes that include cell proliferation, migration, wound repair, development, tumorigenesis and immunosuppression. There are three known isoforms: β1, β2, and β3. TGF-β3 produces two alternative transcripts: a 3.5 kb transcript with a very long 5′ UTR (1.1 kb) and a 2.6 kb transcript with a shorter 5′ UTR (0.23 kb). The presence of 11 uORFs in the longer transcript dramatically inhibits its translation while the shorter transcript is efficiently translated [15, 16].

3. Regulation by Secondary Structure

Secondary structures can function as major regulatory tools in 5′ UTRs. A correlation with gene function has been suggested secondary structures have been determined to be particularly prevalent among mRNAs encoding transcription factors, protooncogenes, growth factors, and their receptors and proteins poorly translated under normal conditions. >90% of transcripts in these classes have 5′ UTRs containing stable secondary structures with average free energies less than −50 kcal/mol. 60% of these stable secondary structures are positioned very close to the cap structure [6]. These structures are very effective in inhibiting translation. In fact, a hairpin situated close to the cap with a free energy of −30 kcal/mol would be sufficient to block the access of the preinitiation complex to the mRNA. When located further away in the 5′ UTR, hairpins require a free energy stronger than −50 kcal/mol to be able to block translation [17, 18]. Stable secondary structure can resist the unwinding activity of the helicase elF4A. This effect can be overcome partially by the overexpression of elF4A in partnership with elF4B [19]. mRNAs with a highly structured 5′ UTR like proto-oncogenes and other growth factors use cap-dependent translation initiation. Not surprisingly, the overexpression of components of the translation initiation machinery including elf4E has been linked to tumorigenesis (reviewed in [18, 20]).

The gene TGF-β1 provides a good example of translation inhibition mediated by secondary structure [21, 22]. An evolutionary conserved motif in the 5′ UTR forms a stable stem loop. However, this structure by itself is not sufficient to block translation. Translation repression of TGF-β1 depends on increased binding of the RNA binding protein YB-1 to the TGF-β1 transcript [23]. It was then proposed that YB-1 binds the 5′ UTR of TGF-β1 with high affinity thanks to its GC content and cooperates with the stem loop to inhibit TGF-β1 translation by facilitating duplex formation [24].

4. Regulation by RNA Binding Proteins

The human genome is predicted to encode circa 1,000 RNA binding proteins (RBPs) with a large percentage of them implicated in translation. They could be categorized into two main groups: RBPs that are part of the basic translation machinery and required for the translation of all expressed mRNAs (examples: PABPI, elf4E) and RBPs that function in a more selective way by controlling either positively or negatively the levels of translation of specific target mRNAs (examples: HuR, Musashi1). Regarding this later group, it has been observed that RBPs can use distinct mechanisms to increase or inhibit translation. Although several exceptions are known, it can be said that RBPs often recognize specific motifs in UTRs and interact with the translation machinery to control expression. Interference with translation normally takes place during the initiation step (reviewed in [25]).

The best characterized example of RBP-mediated regulation involving 5′ UTRs is provided by the iron regulatory proteins (IRP 1 and 2). These proteins recognize a highly conserved stem loop structure with circa 30 nucleotides, known as the iron response element (IRE). The most important features include a hexanucleotide loop with the sequence CAGYCX (Y = U or A X = U, C, or A) and a 5 bp upper stem that is separated from a lower stem of variable length by an unpaired cytosine. This regulation is crucial in maintaining cellular iron homeostasis as a large number of mRNAs connected to iron storage and metabolism including ferritin, mitochondrial aconitase, succinate dehydrogenase-iron protein, erythroid 5-aminolevulinate synthetase (eALAS), and an iron-exportin molecule named ferroportin (FPN1) have their expression modulated by this system. When cellular iron levels are low, IRP1 and IRP2 bind the IRE and block translation of the downstream ORF. When intracellular iron levels are high, the RNA binding activity of both IRPs is reduced (Figure 2(a)). IREs tend to be positioned close to the cap, which causes a steric inhibition of the binding of 40S ribosomal subunits to the transcript. When located distant to the cap, rather than affecting 40S recruitment, the IRE-IRP complex blocks ribosomal scanning (reviewed in [26]). An interesting bypass of the IRE/IRP mechanism can be observed in iron-starved duodenal and erythroid precursor cells. An upstream promoter is used to generate FPN1 pre-mRNAs containing one more exon that is connected by alternative splicing to a splice acceptor in the 3′ of the IRE. A mature FPN1 transcript containing the same open reading frame is generated however, the 5′ UTR does not contain the IRE [27]. Therefore, these cells express the alternative FPN1 isoform in an iron-independent manner [27, 28]. Mutations affecting IREs can lead to diseases. This is the case of hereditary hyperferritinemia-cataract syndrome (HHCS), a genetic autosomal dominant disorder in which aggregation and crystallization of ferritin in the lens leads to bilateral cataracts [29].

(b) Translational regulation by RNA binding proteins. (a) In iron-deficient cells, IRPs bind to the IRE localized in the 5′ UTR of ferritin mRNA, blocking its translation. Once cellular iron levels increase, a complex containing Fe binds to IRPs. Thus, these proteins are allosterically modified, which reduces IRP-IRE binding and allows the translation of ferritin mRNAs. (b) msl-2 gene regulation in females flies. After transcription in the nucleus, SXL specifically binds to intronic U-rich regions of msl-2 pre-mRNA and inhibits the intron removal (1). In the cytoplasm, SXL binds to the same elements localized now in the 5′ UTR of mature msl-2 mRNA, enhances the translation initiation of a upstream ORF (2), and prevents the main ORF translation (3). The regulatory elements in the 3′ UTR of msl-2 mRNA were not represented.

RBP-mediated regulation can be very elaborate and involve multiple steps. One good example showing the crosstalk between factors and distinct regulatory processes is the male-specific-lethal 2 (msl-2) gene in Drosophila, a main player in dosage compensation. The female-specific RNA binding protein sex lethal (SXL) participates in multiple aspects of msl-2 regulation where msl-2 expression must be prevented (Figure 2(b)). Regulation starts at the splicing level SXL binds to two polyU stretches located in an intron that is part of the 5′ UTR. This process causes intron retention and preserves critical sequences that later will be used in translation regulation [30, 31]. In the cytoplasm, the same SXL protein will function as a translation repressor of msl-2 in two distinct mechanisms taking place at the 3′ and 5′ UTR [32]. SXL binds U-rich sequences in the 3′ UTR and recruits the corepressor protein UNR (upstream of N-ras) and PABP blocking the recruitment of the pre-initiation complex to the 5′ end of the mRNA [33–35]. To assure that msl-2 gets fully repressed, a second regulatory step also mediated by SXL takes place at the 5′ UTR. This repression involves a novel regulatory mechanism where crosstalk between SXL and a uORF takes place to efficiently repress translation [36]. The 5′ UTR of msl-2 contains 3 uORFs but only the 3rd one is involved in the repression. Interestingly, this repression is very weak in the absence of SXL (

2-fold), but when present, SXL binds a poly U stretch a few nucleotides away from the uAUG and increases this repression to more than 14-fold. SXL acts by boosting translation initiation at the uAUG and not by acting as a simple steric arrest of scanning ribosomes. This effect may take place via an interaction between SXL and translation initiation factors possibly members of elF3 component as indicated by a two-hybrid screening. This mechanism potentially affects a large number of mRNAs 268 transcripts in Drosophila were determined to contain SXL binding motifs associated with uAUG spaced at an appropriate distance. For instance, a reporter construct containing the 5′UTR of the gene Irr47 was repressed

RBPs can have antagonistic functions when regulating translation. An interesting example is the regulation of p21 in the context of replicative senescence, a cellular state where cells enter an irreversible growth arrest. Induction of p21 is required to initiate the process, and to inhibit cdk2-cyclin E complexes. The 5′ UTR of p21 contains a GC-rich sequence that forms a stem loop. This element is recognized by two RBPs with distinct properties: CUGBP1 and calreticulin (CRT). Competition between the two proteins determines final levels of p21 expression and establishes if cells will proliferate or undergo growth arrest and senescence. Binding of CUGBP1 to p21 mRNA is dramatically increased in senescence compared to young fibroblast cells. Protein levels do not change during the process and this increase in activity is due to phosphorylation. On the other hand, CRT IPs showed a four-to-fivefold reduction of activity in senescence cells due to a decrease in expression. Both proteins were shown to affect p21 translation. However, while CUGBP1 functions as an activator, CRT acts as a repressor. Since the two proteins have opposing activity in senescent cells, they were examined to see if they compete for interaction with p21 mRNA and to control its translation. Increasing amounts of one protein were able to reverse the binding of the other protein to p21 mRNA and its effect on translation affinity to the binding site is rather different as CUGBP1 had to be present in the binding reactions at a four-to-eightfold molar excess to CRT to antagonize its binding to p21 mRNA and impact its translation [37].

5. Regulation by uORFs and Upstream AUGs

uORFs and uAUGs are major regulatory elements in 5′ UTRs. As their names suggest, uORFs are sequences defined by a start and stop codons upstream of the main coding region while uAUGs are start codons without an in-frame downstream stop codon located upstream of the main coding region. A large percentage of the human transcriptome contains uORF and/or uAUGs, with values ranging between 44 and 49% [38, 39]. Similar numbers are found in the mouse transcriptome. Although these numbers might sound high, both uORFs and uAUGs are less frequent than expected by chance, suggesting that they are under selective pressure. uORFs and uAUG are overrepresented in particular subgroups like transcription factors, growth factors, and their receptors and proto-oncogenes [6]. Both uORFs and uAUGs are extremely diverse varying in position in relation to the cap and main AUG, number per transcript and length (in the case of uORFs) [38]. Supplementary Table 1 (in Supplementary Material available online at provides a comprehensive list of uORFs and uAUGs present in the human transcriptome. uORFs and uAUGs have not been extensively analyzed in terms of conservation. A pilot study done with a subset of human, mouse, and rat transcripts indicated that both elements are moderately conserved as 38% of uORFs and 24% of uAUGs were determined to be conserved among three species [39]. The modest conservation of uORFs combined with the fact that their average length (20 nucleotides) is expected by chance and uAUGs provide a stronger suppression in comparison to uORFs suggests that many uAUGs have been neutralized in the process of evolution by the acquisition of a downstream stop codon. It has been proposed then that only a few uORFs, very likely the conserved ones, have been recruited for expression regulation [39]. In yeast, it has been shown that uORFs are statistically underrepresented in 5′ UTRs and were removed by selective pressure, indicating similarly that the remaining uORFs may be implicated in translation regulation [40].

Although, overall it has been suggested that uORFs are negatively correlated with protein production [1, 38, 41] until now, functional activity has been demonstrated for only a limited number of uORFs and uAUGs. In Figure 3, we show examples of the impact uAUGs can have on translation efficiency. Among the most relevant features that can contribute to functionality are long 5′ cap-to-uORF distance, sequence conservation, context in which the AUG is located, strength of the initiation site for the ORF, length of the uORF, and number of AUGs in the 5′ UTR [38, 42]. Different outcomes have been observed when a ribosome encounters a uAUG or uORF [43]. Since the number of characterized events is still small, it is hard to define general mechanisms we describe then a few well-characterized and relevant events. Leaky scanning is defined when a proportion of the scanning complexes bypass the uAUG or uORF and continue scanning for the next AUG. In this case, the upstream AUG acts as a “decoy” from the ORF AUG, functioning as a negative regulator of translation at least for some fraction of ribosomes. The production of cis-acting peptides by uORFs can reduce the initiation of translation of the downstream ORF by stalling the ribosome at the end of the uORF [44]. A classical example is provided by the evolutionarily conserved eukaryotic arginine attenuator peptide (AAP), that negatively controls the translation of proteins involved in the de novo fungal arginine biosynthesis in high arginine concentration [45]. In this scenario, arginine changes AAP conformation and/or P site environment causing ribosomal stalling at the termination codon of AAP uORF [46, 47]. AAP also reduces translation elongation by ribosome stalling when the uORF is inserted within an encoding sequence [48]. Another classical example of uORF-mediated regulation comes from yeast. Four uORFs are present in the 5′ UTR of the transcription factor GCN4. The first of the four uORFs is always efficiently translated regardless of the nutritional conditions. In unperturbed cells, rapid reloading of ribosomes and initiation cofactors allow translation of uORFs 2–4 while inhibiting the translation of the main ORF. In situations of amino acid starvation, initiation factors are scarce, resulting in a decelerated reloading of ribosomes and scanning across the sequences containing the uORFs. A functional initiation complex is reassembled only at the main coding sequence and GCN4 expressed. This mechanism allows a fast response to nutritional stress [49, 50]. Another similar example of regulated expression via uORF is the Carnitine Palmitoyltransferase 1C (CPT1C) gene. CPT1C regulates metabolism in the brain in situations of energy surplus. The presence of uORF in the 5′ UTR represses the expression of the ORF. However, this repression is relieved in response to specific stress stimuli like glucose depravation and palmitate-BSA treatment [51]. It has been suggested that uORFs can also induce mRNA degradation. A series of 5′ UTR constructs containing as a reporter the cat gene from the bacterial transposon Tn9 was tested in yeast. A single nucleotide substitution was used to create a 7-codon ORF upstream of the cat gene. The uORF was translated efficiently and caused translation inhibition of the cat ORF and destabilization of the cat mRNA [52]. A connection between uORFs and mRNA decay was also suggested based on a comparison between average levels of expression of uORF-containing and non-uORF-containing transcripts [41].


3′-UTR of mRNA and associated diseases

The cis-regulatory elements in the 3′-UTRs of mRNAs which influence translation and cause diseases are briefly discussed here. miRNAs (microRNAs) also regulate the translation of many mRNAs and their dysregulation often leads to various diseases. However, a detailed discussion on miRNAs is beyond the scope of this review. The readers may refer to the following reviews for further details: Esquela-Kerscher and Slack ( 2006 ), De Silanes et al. ( 2007 ) and Soifer et al. ( 2007 ).

The 3′-UTR plays an important role in the translation, localization and stability of the mRNA. Research on the pathophysiology of diseases and mutations affecting the functionality of the 3′-UTR is still sparse. However, the available data do suggest that this mRNA region, which is often neglected during the genetic screening of disease-associated candidate genes, plays an important role in various diseases and disease progression (Conne et al., 2000 ). Mutations affecting the termination codon, polyadenylation signal and secondary structure of 3′-UTR of mRNA can cause translation de-regulation and disease. The available information does suggest an association between the mutations in 3′-UTR and diseases however, this association needs to be confirmed independently. Here, we briefly describe some characteristic mutations at the 3′-UTR associated with various diseases.

Mutations that affect the termination codon and cause diseases

Alterations in the length of the 3′-UTR due to perturbations in the position of the physiological termination codon influence the translation of mRNA. PTCs increase the length of the 3′-UTR, whereas delayed stop codons reduce its length, thus changing the length of the translated polypeptide. In-frame nonsense mutation leads to premature translation termination and formation of truncated polypeptides. Such variants are often involved in metabolic dysfunctions and pathogenesis of diseases.

EBS (epidermolysis bullosa simplex) is an autosomal dominant inherited blistering skin disease, which is commonly known as keratin disease. Keratin is the main structural protein in the epidermis and point mutations in the keratin gene KRT14 or KRT5 cause this disease in most cases. Mutations in KRT5 or KRT14 that produce PTCs have also been reported in several cases (Muller et al., 1999 Gu et al., 2002 ). A novel type of EBS has been reported this type of EBS is caused by a deletion mutation in the keratin gene, which leads to a frameshift and delayed stop codon. This disease is characterized by migratory circinate erythema and multiple vesicles on the circular belt-like areas affected by erythema. A mutation, 1649delG, produces mutant K5 (keratin 5) protein with a frameshift of its terminal 41 amino acids, and this protein is 35 amino acids longer than the wild-type K5 protein due to a delayed termination codon. This elongated mutant K5 protein might interfere with the functions of the wild-type K5 protein or its associated proteins (Gu et al., 2003 ).

Aniridia is a genetic disease characterized by the complete or partial loss of the iris and other anomalies. It is a developmental disorder characterized by malfunction of PAX6 (paired box 6), a gene involved in the development of eyes. A wide range of mutations are associated with this disease. Here, we discuss only the mutations that affect the termination codon. Nonsense mutations that give rise to PTCs have been reported in aniridia patients (Chao et al., 2000 , 2003 ). Further, a mutation in the physiological stop codon (TAA→TTA), which leads to a run-on into the 3′-UTR, has also been reported to be associated with this disease. This mutation introduces a missense leucine and a short stretch of bulky lysine residues in the highly conserved C-terminal domain. This mutant protein probably interferes with normal PAX6 function (Chao et al., 2003 ). As discussed above, PTCs lead to several inherited diseases, including cystic fibrosis and DMD. Drugs such as PTC124 may be helpful in treating diseases caused by PTCs.

Mutations that affect polyadenylation signal and cause diseases

The polyadenylation signal is a sequence motif (AAUAAA) recognized by RNA-binding factors. This motif is highly conserved and is essential for transcriptional termination and efficient polyadenylation of mRNAs. Human diseases resulting from mutations in polyadenylation signals are rare. Here, we describe two diseases associated with mutations in the polyadenylation signal.

The abolition of the canonical polyadenylation signal, which causes a read-through and reduced accumulation of α-haemoglobin, is related to HbH (haemoglobin H) disease. The clinical phenotype of HbH disease is heterogeneous, with mild-to-moderate microcytic, hypochromic anaemia. Some mutations in the polyadenylation signal, which are associated with this disease, are as follows: AATAAA→AATAAG (Higgs et al., 1983 ), AATAAA→AATGAA (Losekoot et al., 1991 ) and AUAAA→AAUA (Harteveld et al., 1994 ). Combination mutations (mutations affecting polyadenylation signal and wide deletion mutations) are also associated with this disease (Prior et al., 2007 ).

The FOXP3 (forkhead box P3) gene encodes a transcription factor that is characterized by the forkhead domain. FOXP3 appears to function as the master regulator in the development and function of regulatory T-cells (Zhang and Zhou, 2007 ). Mutations in the FOXP3 gene were reported in patients with a fatal disease called IPEX (immune dysfunction, polyendocrinopathy and enteropathy, X-linked), which is characterized by dysfunction of regulatory T-cells and subsequent autoimmunity. Most of these reported mutations are located in the forkhead domain and are predicted to disrupt critical DNA protein interactions (Wildin et al., 2001 ). However, Bennett et al. ( 2001 ) reported a rare A→G transition within the first polyadenylation signal (AATAAA→AATGAA) after the stop codon which is associated with this disease. This mutation is predicted to cause non-specific degradation of the FOXP3 protein, which results in IPEX. Thus mutations in the polyadenylation signal are indeed associated with diseases.

Mutations that alter the secondary structure of 3′-UTR and cause diseases

The 3′-UTR is also characterized by secondary structures which play an important role in the interaction of mRNAs with the associated proteins. A perturbation in these structures due to an inherent change in the sequence alters its interaction with proteins. A study by Chen et al. ( 2006 ) on 83 disease-associated variants in the 3′-UTR of various human mRNAs revealed a correlation between the functionality of these variants and alterations in the predicted secondary structure (reviewed in Chen et al., 2006 ). Such studies emphasize the importance of this region in translational regulation. Here, we briefly describe two diseases associated with alterations in the secondary structure of the 3′-UTR of mRNAs.

GATA4 is a member of the family of proteins that bind to the GATA motif and is characterized by the presence of one or two zinc fingers. This transcription factor is associated with the septation defects of the human heart. The term CHD (congenital heart disease) refers to defects in the structure and function of the heart due to abnormal heart development during gestation. Many mutations in the GATA4 gene have been reported in families with CHD. Some of these mutations occur in the 3′-UTR of the GATA4 mRNA. These mutations are predicted to alter the secondary structure of mRNA and thus its localization and translation (Reamon-Buettner et al., 2007 ). In the example discussed above, a 1723 C→T transition in the 3′-UTR of TGF-β3 is also invariably associated with ARVC (Beffagna et al., 2005 ).

Supporting Information

S1 Text. S1–S3 Tables and supplemental experimental procedures.

S1 Table. Primers used for qRT-PCR. Related to Figs. 4, 5 and 7. S2 Table. Select average mRNA half-life calculations using either actinomycin D transcriptional shut-offs (72 hpi) or metabolic labeling with 4-thio uridine (4sU) (120 hpi). Related to Figs. 4 and 5. The average of three independent experiments is reported for actinomycin D shut-offs average of two independent experiments is reported for metabolic labeling with 4sU. The fold change in mRNA half-life in mock vs HCV infected cells is shown in the ‘Fold change’ column. S3 Table. Fold increases in select mRNA abundances in HCV infected cells compared to mock infected cells as calculated by multiple methods. Related to Figs. 5 and 6. The average fold-change in mRNA abundances from three independent untreated RNA experiments using qRT-PCR analysis and two independent experiments using RNAseq.

S1 Fig. XRN1-mediated decay intermediates are not commonly observed during RNA degradation and XRN1 protein levels do not appreciably change in HCV or BVDV infections.

Related to Figs. 1 and 4. Panels A. A series of radiolabeled RNAs containing a 5’ monophosphate was incubated with recombinant XRN1 for 0, 30 or 60 seconds. Reaction products were analyzed on a 5% denaturing acrylamide gel. Panel B. XRN1 levels in Huh7.5 cells (left panel) or MDBK cells (right panel) were analyzed by western blotting during mock or either HCV (left) or BVDV (right) infection. Average XRN1 levels from three HCV infections +/− standard deviation relative to RPL19 control are shown. Tubulin (TUBA1A) was used as a loading control in the gel on the right (representative of blots from three independent infections).

S2 Fig. Validation of global RNAseq dataset.

Related to Fig. 5. Panel A. HCV and Mock infected Huh7.5 cells were subjected to 4sU labeling for one hour prior to harvest, then RNA was separated into 4sU labeled, unlabeled and total populations and subject to Illumina sequencing. Tophat2 mapped reads were categorized as intronic, coding sequence, intergenic, UTR or HCV using PicardTools CollectRNASeqMetrics function. Reads mapping to the HCV genome in infected samples accounted for ∼1% of all mapped reads. Error bars represent standard deviation. Panel B. Normalized log2 transformed sequencing data correlated highly and samples clustered together. Panel C. Changes in mRNA abundance correlate well with previously published HCV infections. A total of 612 differentially expressed mRNAs from Walters et al. [43] at time points from 24–120 hours were compared to mRNA abundance changes observed in Mock versus HCV infected samples at 120 hours.

S3 Fig. Comparison of the different HCV-derived 5’ UTRs used in the study.

Related to Figs. 1–6. Clustal Omega alignment [S9, S10] of the sequences of the 5’ UTRs of the three strains of HCV used in the study. JFH-1 is the wild-type HCV infectious virus used in Figs. 2D, 3A, 4, and 5. The H77 5’ UTR was cloned into pGEM-4 and peGFP-N1 vectors for Figs. 1, 2, and 3 and the Replicon was used for Fig. 2D. Note that the various 5’ UTRs are over 92.5% similar to each other (determined by the clustal 2.1 percent identity matrix).

S4 Fig. GSEA analysis of stabilized transcription factors.

Related to Figs. 5–7. MSigDB datasets defining mRNAs with binding sites matching a set of predicted binding motifs for the FOS and JUN heterodimer AP-1 (left panels) and MYC (right panels) were evaluated for correlation with rank ordered changes in mRNA abundance (top panels) and mRNA transcription rate (bottom panels).

S4 Table. Excel file with all half-lives and abundance data from global analysis of gene expression in HCV infections.

Watch the video: Transcription DNA to mRNA (November 2021).