Development of Next-Gen Human Transcriptome Array

Introduction

Glue Grant Human Transcriptome Array (GG-H) is a collaboration result between Stanford Genome Technology Center, Wing Wong’s lab at Stanford, Affymetrix Inc and the Inflammation and Host Response to Injury program (“Glue Grant”). The array has been comprehsively designed to interrogate various apects of the transcriptome, incuding gene expression, alternative splicing, detection of coding SNPs and non-coding transcription. With talored procotol to work efficiently with small amount of total RNA, the array provides a high-throughput but low-cost platform for clinical genomic studies.

Affymetrix is expected to make the GG-H array available commercially in January 2013. The commercial version of the GG-H array is named as Human Transcriptome Array (HTA).

Array Components and Probe Design

Various components of the array and their probe design strategies are summarized in the following table and illustrated in the figure.

Array Components	Number of targets	Number of Probes	Design
Gene exons	315,123	3,292,929	On average ten probes per exon (~119 probes per gene) were selected based on high thermodyanmic scores, uniqueness and spreadness on targets
Exon-exon junctions	260,488	1,060,703	Four probes per junction at (-3, -1, +1 , +3) relative to the splicing site
Coding SNPs and DMET variations	89,782	982,941	Six probes per allele at -4, 0, and +4 positions on each of the two strands relative to the SNP
Non-coding functional RNA (f-ncRNA)	730	5,869	Ten probes per ncRNA were selected based on high thermodyanmic scores, uniqueness and spreadness on targets
Non-coding antisense expression (as-ncRNA)	50,783	563,097	Probes were selected at the density of one probe per 50 bp of UTR and with a minimum of six probes per region
Un-annotated transcribed units (UTU)	49,957	488,581	Ten probes per UTU were selected based on high thermodyanmic scores, uniqueness and spreadness on targets
Other probes including controls		498,840	Designed for quality control of the assay, background modeling, estimation of cross hybridization, and monitoring the ribosomal RNA
Total		6,892,960

Libary Files, Annotation and Database

To support different kinds of analyses using GG-H array, we have developed a set of library and annotation files. Most important ones are summarized in the following table. In addition, a comprehensive database (http://gluegrant1.stanford.edu/~DIC/db) is also available for the query of array design and annotation information. Users can use the database to generate customerized library and annotation files.

File Name	File Type	Description	Download
hGlue2_0.r1.clf	CEL Layout File (CLF)	CLF along with PGF make up the core chip layout information for our array. The CLF contains the mapping of probe IDs to x/y positions in the CEL file.	hGlue2_0.r1.core.tar.gz
hGlue2_0.r1.pgf	Probe Grouping File (PGF)	PGF along with CLF make up the core chip layout information for our array. The PGF groups specific probes (by probe ID) into probesets.
hGlue2_0.r1.antigenomic.bgp	BackGround Probes (BGP)	The BGP file lists what probes (by probe ID) are to be used in various background correction methods (e.g. GCBG method).
hGlue2_0.r1.qcc	Quality Control Content (QCC)	The QCC file lists probes serving various quality control purposes.
hGlue2_0.r1.pgf.tbl	Tab-deliminated	The file is used for GlueQC package for quality control summary.
hGlue2_0.r1.PSR.ps	Probeset List (PS)	The PS file lists probeset IDs for Probe Selection Regions (PSRs).
hGlue2_0.r1.TC.mps	Meta Probeset List (MPS)	The MPS file is used to group individual PSR (exon) level probesets into Transcript Cluster (gene) level meta probesets.
hGlue2_0.r1.TC_Annot.csv	Gene Annotation File	The annotation file links transcript cluster (gene) to chromosomal position information, gene information, functional annotation (gene ontology and pathway) and other information in public databases
hGlue2_0.r1.ASS	Alternative Splicing Structure (ASS)	The ASS file provides the alternative splicing structure based on design time knowledge. It describe how exons and junctions are connected in a transcript cluster.
hGlue2_0.r1.Probe.BED	BED File	Genome coordinate file for probes on hg18
hGlue2_0.r1.PSR.BED	BED File	Genome coordinate file for Probe Selection Regions (PSRs) on hg18
hGlue2_0.r1.TC.BED	BED File	Genome coordinate file for Transcript Clusters (TCs) on hg18
hGlue2_0.r1.gene info Gene Ontology.xls	dChip Library File	Gene ontology frequency summary for GG-H genes	hGlue2_0.r1.dChip.tar.gz
hGlue2_0.r1.gene info.xls	dChip Library File	Gene annotation information for GG-H genes
hGlue2_0.r1.genome info.xls	dChip Library File	Genome coordinate information for GG-H genes
component.ontology; function.ontology; process.ontology;	dChip Library File	cellular component, molecular function and biological process ontology mapping for GG-H genes

Analysis Pipeline and Softwares

To support routine analyses of GG-H array, we have established a basic pipeline for quality control, expression indices calculation and detection of alternative splicings. For other compomnents of the array, the analysis methods are still exploratory and very customerized.

Aanlysis	Software	Description	Download
Quality control	GlueQC (requires APT and R bioconductor)	Assess array quality through exploratory plots and summary statistics	GlueQC website
Expression indices calculation	Affymetrix Power Tools (APT)JETTA	Background correction, normalization and calculatation of exon or gene expression matrices	APT website
Detection of alternative splicing	Junction and Exon array Toolkits for Transcriptome Analysis (JETTA)	Detection of alternatively spliced exons with or without supporting junctions	JETTA website
High-level exploratory analysis	dChip	Clustering of gene expression and enrichment analysis of ontogies, pathways and genome locations	dChip website
Visualization	UCSC genome browser	Visualize probe/exon/gene on genome browser	UCSC genome brower

1. Quality control

Ensuring high quality of data is crutial to genomic studies. GlueQC starts with CEL files and checks a few quality scores to filter out outliers. Quality statistics include probe-level foreground and background signal, area under curve using Norm Exons and Norm Introns as positive and negative controls respectively, probeset prensence call, and betwen-array correlation at both exon and gene level.

To run the script,

Rscript GlueQC.R celpath=CEL_PATH outpath=OUTPUT_PATH libpath=LIB_PATH

2. Expression indices calculation

Low-level analysis of microarray includes background correction, normalization and exon/gene expression indices calculation. Here we show examples of low-level analyses using APT.

To calculate gene-level expression using APT rma-sketch:

apt-probeset-summarize -a rma-sketch -c hGlue2_0.r1.clf -p hGlue2_0.r1.pgf -b hGlue2_0.r1.antigenomic.bgp -m hGlue2_0.r1.TC.mps -o gene_expr *.CEL

To calculate exon-level expression using APT rma-sketch:

apt-probeset-summarize -a rma-sketch -c hGlue2_0.r1.clf -p hGlue2_0.r1.pgf -b hGlue2_0.r1.antigenomic.bgp -s hGlue2_0.r1.PSR.ps -o exon_expr *.CEL

JETTA is also capable of performing low-level analyses. Please refer to its dedicated website for instructions (JETTA website).

3. Alternative splicing uing JETTA

With the addition of junction probes, GG-H can improve the accuracy of alternative splicing detection. To meet the need of including junctions into alternative splicing analysis, we have developed Junction and Exon array Toolkits for Transcriptome Analysis (JETTA), an integrated software tool for expression indicies calcaultaion and alternative splicing analysis. Please refer to its dedicated website for instructions (JETTA website).

4. High-level exploratory analysis using dChip

Biologists are oftentimes interested in clustering and functional enrichment analysis at gene level. For this purpose, we provide users a set of library files to support these kinds of analysis using dChip. Please refer to dChip website for more instrunctions on how to run dChip (dChip website).

Protocol

The GG-H procotol is based on Ambion Inc./Applied Biosystems (cat# 4411974) and has been specially modified to efficiently work with small amount of starting material. It uses two rounds of single-strand cDNA synthesis to amplify mRNA and Affymetrix GeneChip WT terminal labeling technology to label fragment cDNA for hybridization. The detailed proctol can be found here (GG-H protocol).

Availability

The array platform has been depsited to NCBI GEO under GPL11319. An example data set is accessible at GSE26072 (and GSE26109 for the RNA-Seq data used in the paper).

The GG-H array can be ordered from Affymetrix as a custom array. For more information, please contact dbowe@stanford.edu or dbowe@stanford.edu .

Reference

Xu W, Seok J, Mindrinos MN, Schweitzer AC, Jiang H, Wilhelmy J, Clark TA, Kapur K, Xing Y, Faham M, Storey JD, Moldawer LL, Maier RV, Tompkins RG, Wong WH, Davis RW, Xiao W; Inflammation and Host Response to Injury Large-Scale Collaborative Research Program. Human transcriptome array for high-throughput clinical studies. Proc Natl Acad Sci U S A. 2011 Mar 1;108(9):3707-12. doi: 10.1073/pnas.1019753108. Epub 2011 Feb 11.

Questions and Comments

For questions and comments, please join our discussion group at http://groups.google.com/group/GGHarray.

Last modified 12/22/2012. Webmaster: weihongx@stanford.edu