Development of Next-Gen Human Transcriptome Array

Introduction

Glue Grant Human Transcriptome Array (GG-H) is a collaboration result between Stanford Genome Technology Center, Wing Wong’s lab at StanfordAffymetrix Inc and the Inflammation and Host Response to Injury program (“Glue Grant”). The array has been comprehsively designed to interrogate various apects of the transcriptome, incuding gene expression, alternative splicing, detection of coding SNPs and non-coding transcription. With talored procotol to work efficiently with small amount of total RNA, the array provides a high-throughput but low-cost platform for clinical genomic studies.

Affymetrix is expected to make the GG-H array available commercially in January 2013.  The commercial version of the GG-H array is named as Human Transcriptome Array (HTA).

Array Components and Probe Design

Various components of the array and their probe design strategies are summarized in the following table and illustrated in the figure.

Array ComponentsNumber of targetsNumber of ProbesDesign
Gene exons315,1233,292,929On average ten probes per exon (~119 probes per gene) were selected based on high thermodyanmic scores, uniqueness and spreadness on targets
Exon-exon junctions260,4881,060,703Four probes per junction at (-3, -1, +1 , +3) relative to the splicing site
Coding SNPs and DMET variations89,782982,941Six probes per allele at -4, 0, and +4 positions on each of the two strands relative to the SNP
Non-coding functional RNA (f-ncRNA)7305,869Ten probes per ncRNA were selected based on high thermodyanmic scores, uniqueness and spreadness on targets
Non-coding antisense expression (as-ncRNA)50,783563,097Probes were selected at the density of one probe per 50 bp of UTR and with a minimum of six probes per region
Un-annotated transcribed units (UTU)49,957488,581Ten probes per UTU were selected based on high thermodyanmic scores, uniqueness and spreadness on targets
Other probes including controls 498,840Designed for quality control of the assay, background modeling, estimation of cross hybridization, and monitoring the ribosomal RNA
Total 6,892,960 
arrayscheme

Libary Files, Annotation and Database

To support different kinds of analyses using GG-H array, we have developed a set of library and annotation files. Most important ones are summarized in the following table. In addition, a comprehensive database (http://gluegrant1.stanford.edu/~DIC/db) is also available for the query of array design and annotation information. Users can use the database to generate customerized library and annotation files.

File NameFile TypeDescriptionDownload
hGlue2_0.r1.clfCEL Layout File (CLF)CLF along with PGF make up the core chip layout information for our array. The CLF contains the mapping of probe IDs to x/y positions in the CEL file.hGlue2_0.r1.core.tar.gz
hGlue2_0.r1.pgfProbe Grouping File (PGF)PGF along with CLF make up the core chip layout information for our array. The PGF groups specific probes (by probe ID) into probesets.
hGlue2_0.r1.antigenomic.bgpBackGround Probes (BGP)The BGP file lists what probes (by probe ID) are to be used in various background correction methods (e.g. GCBG method).
hGlue2_0.r1.qccQuality Control Content (QCC)The QCC file lists probes serving various quality control purposes.
hGlue2_0.r1.pgf.tblTab-deliminatedThe file is used for GlueQC package for quality control summary.
hGlue2_0.r1.PSR.psProbeset List (PS)The PS file lists probeset IDs for Probe Selection Regions (PSRs).
hGlue2_0.r1.TC.mpsMeta Probeset List (MPS)The MPS file is used to group individual PSR (exon) level probesets into Transcript Cluster (gene) level meta probesets.
hGlue2_0.r1.TC_Annot.csvGene Annotation FileThe annotation file links transcript cluster (gene) to chromosomal position information, gene information, functional annotation (gene ontology and pathway) and other information in public databases
hGlue2_0.r1.ASSAlternative Splicing Structure (ASS)The ASS file provides the alternative splicing structure based on design time knowledge. It describe how exons and junctions are connected in a transcript cluster.
hGlue2_0.r1.Probe.BEDBED FileGenome coordinate file for probes on hg18
hGlue2_0.r1.PSR.BEDBED FileGenome coordinate file for Probe Selection Regions (PSRs) on hg18
hGlue2_0.r1.TC.BEDBED FileGenome coordinate file for Transcript Clusters (TCs) on hg18
hGlue2_0.r1.gene info Gene Ontology.xlsdChip Library FileGene ontology frequency summary for GG-H geneshGlue2_0.r1.dChip.tar.gz
hGlue2_0.r1.gene info.xlsdChip Library FileGene annotation information for GG-H genes
hGlue2_0.r1.genome info.xlsdChip Library FileGenome coordinate information for GG-H genes
component.ontology; function.ontology; process.ontology;dChip Library Filecellular component, molecular function and biological process ontology mapping for GG-H genes

Analysis Pipeline and Softwares

To support routine analyses of GG-H array, we have established a basic pipeline for quality control, expression indices calculation and detection of alternative splicings. For other compomnents of the array, the analysis methods are still exploratory and very customerized.

AanlysisSoftwareDescriptionDownload
Quality controlGlueQC (requires APT and R bioconductor)Assess array quality through exploratory plots and summary statisticsGlueQC website
Expression indices calculationAffymetrix Power Tools (APT)JETTABackground correction, normalization and calculatation of exon or gene expression matricesAPT website
Detection of alternative splicingJunction and Exon array Toolkits for Transcriptome Analysis (JETTA)Detection of alternatively spliced exons with or without supporting junctionsJETTA website
High-level exploratory analysisdChipClustering of gene expression and enrichment analysis of ontogies, pathways and genome locationsdChip website
VisualizationUCSC genome browserVisualize probe/exon/gene on genome browserUCSC genome brower

1. Quality control

Ensuring high quality of data is crutial to genomic studies. GlueQC starts with CEL files and checks a few quality scores to filter out outliers. Quality statistics include probe-level foreground and background signal, area under curve using Norm Exons and Norm Introns as positive and negative controls respectively, probeset prensence call, and betwen-array correlation at both exon and gene level.

To run the script,

Rscript GlueQC.R celpath=CEL_PATH outpath=OUTPUT_PATH libpath=LIB_PATH

2. Expression indices calculation

Low-level analysis of microarray includes background correction, normalization and exon/gene expression indices calculation. Here we show examples of low-level analyses using APT.

To calculate gene-level expression using APT rma-sketch:

apt-probeset-summarize -a rma-sketch -c hGlue2_0.r1.clf -p hGlue2_0.r1.pgf -b hGlue2_0.r1.antigenomic.bgp -m hGlue2_0.r1.TC.mps -o gene_expr *.CEL

To calculate exon-level expression using APT rma-sketch:

apt-probeset-summarize -a rma-sketch -c hGlue2_0.r1.clf -p hGlue2_0.r1.pgf -b hGlue2_0.r1.antigenomic.bgp -s hGlue2_0.r1.PSR.ps -o exon_expr *.CEL

JETTA is also capable of performing low-level analyses. Please refer to its dedicated website for instructions (JETTA website).

3. Alternative splicing uing JETTA

With the addition of junction probes, GG-H can improve the accuracy of alternative splicing detection. To meet the need of including junctions into alternative splicing analysis, we have developed Junction and Exon array Toolkits for Transcriptome Analysis (JETTA), an integrated software tool for expression indicies calcaultaion and alternative splicing analysis. Please refer to its dedicated website for instructions (JETTA website).

4. High-level exploratory analysis using dChip

Biologists are oftentimes interested in clustering and functional enrichment analysis at gene level. For this purpose, we provide users a set of library files to support these kinds of analysis using dChip. Please refer to dChip website for more instrunctions on how to run dChip (dChip website).

Protocol

The GG-H procotol is based on Ambion Inc./Applied Biosystems (cat# 4411974) and has been specially modified to efficiently work with small amount of starting material. It uses two rounds of single-strand cDNA synthesis to amplify mRNA and Affymetrix GeneChip WT terminal labeling technology to label fragment cDNA for hybridization. The detailed proctol can be found here (GG-H protocol).

Availability

The array platform has been depsited to NCBI GEO under GPL11319. An example data set is accessible at GSE26072 (and GSE26109 for the RNA-Seq data used in the paper).

 The GG-H array can be ordered from Affymetrix as a custom array. For more information, please contact  dbowe@stanford.edu  or  dbowe@stanford.edu .

Reference 

Xu W, Seok J, Mindrinos MN, Schweitzer AC, Jiang H, Wilhelmy J, Clark TA, Kapur K, Xing Y, Faham M, Storey JD, Moldawer LL, Maier RV, Tompkins RG, Wong WH, Davis RW, Xiao W; Inflammation and Host Response to Injury Large-Scale Collaborative Research Program. Human transcriptome array for high-throughput clinical studies. Proc Natl Acad Sci U S A. 2011 Mar 1;108(9):3707-12. doi: 10.1073/pnas.1019753108. Epub 2011 Feb 11.

Questions and Comments

For questions and comments, please join our discussion group at http://groups.google.com/group/GGHarray.

Last modified 12/22/2012. Webmaster:  weihongx@stanford.edu

Leave a Comment

Your email address will not be published. Required fields are marked *