Improving RNA-Sequencing by Stochastic Labels

Project Summary

We propose and test a method for the absolute quantitation of RNA molecules in RNA-Seq gene expression studies. Prior to any PCR amplification, cDNA fragments are generated and randomly labeled with a diverse set of “stochastic labels” comprising of nucleic acid barcodes. After amplification and sequencing, the number of stochastic labels observed for a given transcript is counted to reveal the absolute concentration of the original molecules present in the sample. Each DNA fragment can be uniquely barcoded at the molecular level so that multiple copies of DNAs of identical sequences in the original sample can be distinguished from amplified clones. Representation distortion caused by PCR amplification is greatly minimized, and rare events such as mutations can be distinguished from sequencing errors. Finally, the true number of original RNA molecules captured in the sequencing libraries can be determined unambiguously and meaningless over-sampling by increasingly deeper sequencing can be avoided.

Significance

RNA-Seq is a powerful tool for gene expression analysis. However, despite the high sampling depth achievable by this method, it is incapable of measuring the absolute concentration of each transcript. Counting the number of reads in the sequence data that represent one specific gene transcript will only provide a relative abundance ratio compared to another transcript or a reference control. Another weakness of RNA-Seq experiments is the inefficiency of the sample preparation procedures. There are many steps involved in the creation of a sequencing library and precious sample is lost at every step. These losses are made very evident upon analysis of data generated from stochastically labeled libraries. On average, only about 0.02% of the RNA in the original sample is represented in the library. This makes it very difficult for the quantitative detection of low abundance transcripts. Using our method, it will be possible for researchers to differentiate between true abundance information and inaccuracies created by amplification or other biases, which will significantly improve the quality of conclusions drawn from such experiments.

Approach

Conceptual

(A) Each copy of a molecule randomly captures a label by choosing from a large, non-depleting reservoir of diverse labels. The subsequent diversity of the labeled molecules is governed by the statistics of random choice, and depends on the number of copies of identical molecules in the collection compared to the number of kinds of labels. Once the molecules are labeled, they can be amplified so that simple present/absent threshold detection methods can be used for each. Counting the number of distinctly labeled targets reveals the original number of molecules of each species. (B) An example showing the number of stochastically captured labels for a given number of target molecules calculated using a non-depleting reservoir of 960 diverse label.

Experimental

The process begins with mRNA selection from total RNA using oligo (dT) magnetic beads. mRNAs are subsequently fragmented and cDNA is generated by reverse transcription followed by second strand synthesis. After purification, the cDNA ends are repaired and an A overhang is added to enable adaptor ligation. Stochastic label adapters are synthesized and after ligating to the cDNA fragments, the cDNAs are size-selected and enriched by PCR.

Accomplishments

ERCC spike-in controls consisting of a set of 92 calibration transcripts of known concentrations was mixed with 500 ng of Human lymphocyte RNA. Stochastic labeling technology was introduced into to the standard RNA-Seq workflow to generate a library of cDNA fragments. Paired end sequencing was performed on the MiSeq instrument. (A) A plot of the ERCC controls shows that the number of stochastic labels counted on the sequence reads correlates well with the concentration of each control RNA. Furthermore, a set of ERCC controls was enriched from this library using sequence capture and was sampled by a second MiSeq run . (B) Deep sequencing of these ERCC controls demonstrates that only a very small fraction (~0.02%) of the original RNA in the sample was present in the sequencing library. The absolute quantitation obtained with stochastic labels demonstrates that, even when starting with a large sample size of 500ng of total RNA (representing about 50,000 cells), only about 10 cell equivalents of mRNA is actually represented in the final cDNA library. These findings indicate that RNA-Seq may not be suitable for studying small sample sizes or rare transcripts, and that more efficient cDNA library preparation methods are greatly needed.

Future Objectives

Using stochastic labels, we will develop methods for high efficiency sequencing library preparation so as to enable the quantitative sampling of mRNA transcripts down to a single cell.

A sequence enrichment step will be used for the capture and directed sequencing of target genes of interest. Post-enrichment, counting the molecular stochastic labels reveals the concentration of the original transcripts.

A direct mRNA sampling approach based on RT-PCR using Stochastic labels on PCR primers will also be developed.

We will introduce stochastic labels into sequencing experiments for microRNAs and genomic DNA in addition to mRNA gene expression.

Other sequence sampling assays requiring high precision and accurate quantitative (and qualitative) measurements may also benefit from this technology. Examples include clinical applications such as infectious disease, cancer and various developmental disorders.

Reference

Shiroguchi K, Jia TZ, Sims PA, Xie XS. Digital RNA sequencing minimizes sequence-dependent bias and amplification noise with optimized single-molecule barcodes. Proc Natl Acad Sci U S A. 2012 Jan 24; 109 (4) :1347-52.

Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M, Linnarsson S, Taipale J. Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods. 2011 Nov 20; 9 (1) :72-4.

Casbon JA, Osborne RJ, Brenner S, Lichtenstein CP. A method for counting PCR template molecules with application to next-generation sequencing. Nucleic Acids Res. 2011 Jul; 39 (12) :e81

Kinde I, Wu J, Papadopoulos N, Kinzler KW, Vogelstein B. Detection and quantification of rare mutations with massively parallel sequencing. Proc Natl Acad Sci U S A. 2011 Jun 7; 108 (23) :9530-5.

Fu GK, Hu J, Wang PH, Fodor SP. Counting individual DNA molecules by the stochastic attachment of diverse labels. Proc Natl Acad Sci U S A. 2011 May 31; 108 (22) :9026-31.