Recent initiatives to re-process and standardize publicly available RNA-Sequencing data have opened the door to mass-scale gene expression analysis. Utilizing the ARCHS4 data repository, we obtained standardized gene counts from over 200,000 sequencing samples, parsed them into tissue and disease groups, and derived genome-wide co-expression correlations. Correlation AnalyzeR allows users to explore these correlations and extract the wealth of biological insights they can provide. It can be accessed through this website or by downloading the corresponding R package.
Correlation AnalyzeR provides four primary analysis modes:
To learn more about GSEA, check out this article.
For more information, see Topology analysis explained in the Topology mode tab.
High throughput experiments can generate a massive amount of unstructured data, making it challenging to uncover biologically meaningful results. For example, RNA sequencing analysis may yield 2000+ differentially expressed genes between two conditions but will not provide any information about how those genes relate to each other or known biological pathways. Without a computational method to parse this list, exploratory analysis of these data is inefficient and new hypotheses are not generated reliably.
Topology analysis here refers to computational methods for parsing large gene lists. These analyses aim to answer two questions:
Every gene in a gene list has ~26k correlation values associated with it -- these represent the correlation of that gene with all other genes in the genome. Any two genes are likely to display some similarities in their correlation value distributions (e.g. both ATM and BRCA1 correlate highly with BRCA2), and also some differences (e.g. BRCA1 correlates with CCNB1 but ATM does not). In a gene list, gene groups can be identified as the members of the list who share correlations in common that are not shared with other members. In Correlation AnalyzeR, multiple approaches are used to find these groups (explained below).
Principal component analysis (PCA) is used here to derive principal components within the gene correlation matrix that explain this multidimensional dataset as a 2-dimensional scatter plot. Similar genes will be closer together in PCA-space and will considered as part of the same cluster. NOTE: If 100+ genes are specified for analysis, TSNE will be used instead of PCA for visualizing the data. Clusters are chosen by hierarchical clustering independent of PCA calculations.
To learn more about PCA, check out this article.
While PCA derives principal components to explain variation in the data set, the Variant genes method first finds the top variable genes and then uses them to calculate euclidean distance between samples (this blog post by Dave Tang walks through distance calculations in R). In Correlation AnalyzeR, the output of hierarchical clustering on this distance matrix is a heatmap showing the top 1000 most divisive (variant) genes and the resulting clusters which they reveal in the input gene list.
To learn more about hierarchical clustering, check out this article.
Pathway analysis answers the question "is this list of genes biologically meaningful?" It does this by comparing the gene list to known genesets/pathways (e.g. "Hallmark Oxidative Phosphorylation") to determine if more genes from geneset are present in the gene list than would have occurred by random chance. The output of this analysis is a list of genesets ranked by the likelihood that they are enriched within the input gene list.
NOTE: "GeneRatio" represents the proportion of input list genes which belong to a pathway.
To learn more about pathway analysis, check out this article.
The R package for correlationAnalyzeR is available for download from github
## install.packages("devtools")
devtools::install_github("Bishop-Laboratory/correlationAnalyzeR")
Sample metadata was parsed using a tissue-specific regular expressions dictionary manually curated and validated for sanity against a randomized list of sample tissue metadata entries. This dictionary was supplemented by the BRENDA Tissue Ontology to provide robust sample categorization. The curated dictionary, accompanying pre-processing scripts, and final sample group assignments (as .RData objects) can be downloaded from the Correlation AnalyzeR github repository.
The expression data was generated by ARCHS4. They generously provide public access to their expression data matrices and many other useful resources through their downloads page. From ARCHS4, human expression data was downloaded. Sample meta-data was used to categorize samples as described in the above section. Count data were filtered to remove samples with low total gene counts (less than 5 million). Furthermore, genes with 0 counts in > 10% of samples were removed. Subsequent normalization of count data proceeded using DESeq2 normalization and a variance stabilizing transform (info). Only tissue-disease categories with 30+ samples from 4+ different studies were considered to ensure robust correlations. Correlations were calculated using the WGCNA cor function. The scripts used for pre-processing are also available in the Correlation AnalyzeR github repository.
Geneset Enrichment Analysis (GSEA) is a versatile technique that calculates pathway enrichment for any genome-wide ranked gene list (specifically, the fGSEA method is used here). An overview of this approach is provided in an excellent blog post by Dave Tang. Usually, gene lists are ranked by some output of differential gene expression analysis -- but any valid ranking metric can be used to sort a gene list prior to GSEA. In this case, a gene is chosen (e.g. BRCA1) which contains ~26k correlation values for every other gene in the genome. By ranking these correlated genes with their correlation values, a valid pre-ranked gene list is produced for GSEA. The top enriched pathways produced by fGSEA analysis can be considered co-expressed/correlated with the gene.
Paired-mode involves a user choosing one primary gene and a list of secondary genes to determine if the secondary gene list is correlated with the primary gene. The correlation of these genes is compared to random chance by using a permutation approach. In simple terms, the analysis asks "is the degree of correlation for the selected gene list greater than if someone were to randomly select the same number of genes?"
For each permutation, a list of random genes with the same length as the secondary gene list is chosen. Then, a two-sided t test is performed to determine whether the absolute correlation values of the randomly-chosen genes are significantly different from those of the secondary gene list. By performing 2000 permutations, a distribution of t test p values is generated. If this distribution forms a peak below p=0.05, then it is determined that the secondary gene list is significantly correlated with the primary gene compared to random chance. Absolute correlations were used so that strong correlations would be treated equally regardless of sign.
The Molecular Signatures database provides thousands of gene sets across several collections. Please see the table below to see how the corGSEA annotations from "Single gene" and "Gene vs gene" mode relate to the categories in MSigDB which are found here..
For website support and additional assistance please contact Henry Miller
For bug reports please open an issue and someone will address it shortly
Visit our webpage to more about the Bishop Laboratory, part of the Greehey Children's Cancer Research Institute