--- title: 'Weighted gene co-expression network analysis with TCGA RNAseq data' author: | | Andreas Mock | Cancer Research UK Cambridge Institute, University of Cambridge date: '`r Sys.Date()`' output: BiocStyle::html_document fig_caption: yes vignette: > %\VignetteIndexEntry{Weighted gene co-expression network analysis with TCGA RNAseq data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{utf8} --- The following tutorial describes the generation of a weighted co-expression network from TCGA (The Cancer Genome Atlas) RNAseq data using the `WGCNA` *R* package by Langfelder and Horvarth[^1]. In addition, individual genes and modules will be related to sample traits. Exemplarly, a co-expression network for skin cutaneous melanomas (SKCM) will be generated. However, the following weighted gene co-expression analysis (WGCNA) framework is applicable to any TCGA tumour entity. The code of this vignette is a proof of principial example that can't be run as listed without assembling the RNAseq data as described in the following beforehand. #Assembly and preprocessing of TCGA RNAseq data Melanoma RNAseq data for the CVE extension were downloaded as expression estimates per gene (RNAseq2 level 3 data) from the [TCGA data portal](https://tcga-data.nci.nih.gov/tcga/). Please note that the TCGA Data portal is no longer operational and all TCGA data now resides at the [Genomic Data Commons](https://gdc.nci.nih.gov). For WGCNA, the individual TCGA RNAseq2 level 3 files were concatenated to a matrix `RNAseq` with gene symbols as row and TCGA patient barcodes as column names. Further preprocessing included the removal of control samples (for more information see the [TCGA Wiki](https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode)) and expression estimates with counts in less than 20% of cases. ```{r, eval=FALSE} RNAseq = RNAseq[apply(RNAseq,1,function(x) sum(x==0))0] = 1 #calculate gene significance measure for lymphocyte score (lscore) - Welch's t-Test GS_lscore = t(sapply(1:ncol(WGCNA_matrix2),function(x)c(t.test(WGCNA_matrix2[,x]~lscore,var.equal=F)$p.value, t.test(WGCNA_matrix2[,x]~lscore,var.equal=F)$estimate[1], t.test(WGCNA_matrix2[,x]~lscore,var.equal=F)$estimate[2]))) GS_lscore = cbind(GS.lscore, abs(GS_lscore[,2] - GS_lscore[,3])) colnames(GS_lscore) = c('p_value','mean_high_lscore','mean_low_lscore', 'effect_size(high-low score)'); rownames(GS_lscore) = colnames(WGCNA_matrix2) ``` To enable a high-level interpretation of the dendrogram of module eigengenes, gene ontology (GO) enrichment analysis was performed for the module genes using the `GOstats` *R* package [^8]. Modules were named according to the most significant GO einrichment given a cutoff for the ontology size. The smaller the ontology size, the more specific the term. In this analysis a cutoff of 100 terms per ontology was chosen. ```{r, eval=FALSE} #reference genes = all 5000 top mad genes ref_genes = colnames(WGCNA_matrix2) #create data frame for GO analysis library(org.Hs.eg.db) GO = toTable(org.Hs.egGO); SYMBOL = toTable(org.Hs.egSYMBOL) GO_data_frame = data.frame(GO$go_id, GO$Evidence,SYMBOL$symbol[match(GO$gene_id,SYMBOL$gene_id)]) #create GOAllFrame object library(AnnotationDbi) GO_ALLFrame = GOAllFrame(GOFrame(GO_data_frame, organism = 'Homo sapiens')) #create gene set library(GSEABase) gsc <- GeneSetCollection(GO_ALLFrame, setType = GOCollection()) #perform GO enrichment analysis and save results to list - this make take several minutes library(GEOstats) GSEAGO = vector('list',length(unique(modules))) for(i in 0:(length(unique(modules))-1)){ GSEAGO[[i+1]] = summary(hyperGTest(GSEAGOHyperGParams(name = 'Homo sapiens GO', geneSetCollection = gsc, geneIds = colnames(RNAseq)[modules==i], universeGeneIds = ref.genes, ontology = 'BP', pvalueCutoff = 0.05, conditional = FALSE, testDirection = 'over'))) print(i) } cutoff_size = 100 GO_module_name = rep(NA,length(unique(modules))) for (i in 1:length(unique(modules))){ GO.module.name[i] = GSEAGO[[i]][GSEAGO[[i]]$Size