Correlation AnalyzeR
Biological insights from gene correlation data

Please acknowledge correlationAnalyzeR in your publications by citing the following reference:
H.E. Miller, correlationAnalyzeR, (2021), GitHub repository,

Biological insights from gene correlation data


Recent initiatives to re-process and standardize publicly available RNA-Sequencing data have opened the door to mass-scale gene expression analysis. Utilizing the ARCHS4 data repository, we obtained standardized gene counts from over 200,000 sequencing samples, parsed them into tissue and disease groups, and derived genome-wide co-expression correlations. Correlation AnalyzeR allows users to explore these correlations and extract the wealth of biological insights they can provide. It can be accessed through this website or by downloading the corresponding R package.

Correlation AnalyzeR provides four primary analysis modes:

  1. Single gene
  2. Gene vs gene
  3. Gene vs gene list
  4. Topology

Single gene

Use single gene mode to reveal the genome-wide co-expression correlation values for any gene of interest in multiple tissues and disease conditions:

Uncover co-expressed pathways with correlation-based geneset enrichment analysis (corGSEA):

To learn more about GSEA, check out this article.

Gene vs gene

Use Gene vs gene mode to reveal tissue- and disease-specific differences between two genes of interest:

Uncover divergent pathway correlations between two genes of interest with corGSEA:

Gene vs gene list

Use Gene vs gene list mode to reveal the location of secondary genes within a primary gene's correlation distribution:

Run significance tests to determine whether secondary genes are correlated compared to random chance:

Topology mode

Use principal component analysis (PCA) to explore the subgroups within an input gene list (Dimension reduction method):

Explore clusters and variant genes with an interactive heatmap (Variant genes method):

Enrich for pathways within a list of interesting genes (Pathway enrichment method):

For more information, see Topology analysis explained in the Topology mode tab.

Topology analysis explained

High throughput experiments can generate a massive amount of unstructured data, making it challenging to uncover biologically meaningful results. For example, RNA sequencing analysis may yield 2000+ differentially expressed genes between two conditions but will not provide any information about how those genes relate to each other or known biological pathways. Without a computational method to parse this list, exploratory analysis of these data is inefficient and new hypotheses are not generated reliably.

Topology analysis here refers to computational methods for parsing large gene lists. These analyses aim to answer two questions:

  1. "Are there any important gene groups within this list?"
  2. "Is this particular list of genes biologically significant in some way?"

Clustering identifies important groups within a gene list

Every gene in a gene list has ~26k correlation values associated with it -- these represent the correlation of that gene with all other genes in the genome. Any two genes are likely to display some similarities in their correlation value distributions (e.g. both ATM and BRCA1 correlate highly with BRCA2), and also some differences (e.g. BRCA1 correlates with CCNB1 but ATM does not). In a gene list, gene groups can be identified as the members of the list who share correlations in common that are not shared with other members. In Correlation AnalyzeR, multiple approaches are used to find these groups (explained below).

Dimensionality reduction plot (PCA)

Principal component analysis (PCA) is used here to derive principal components within the gene correlation matrix that explain this multidimensional dataset as a 2-dimensional scatter plot. Similar genes will be closer together in PCA-space and will considered as part of the same cluster. NOTE: If 100+ genes are specified for analysis, TSNE will be used instead of PCA for visualizing the data. Clusters are chosen by hierarchical clustering independent of PCA calculations.

Example output from Correlation AnalyzeR

PCA analysis in Correlation AnalyzeR

To learn more about PCA, check out this article.

Variant genes (Interactive heatmap)

While PCA derives principal components to explain variation in the data set, the Variant genes method first finds the top variable genes and then uses them to calculate euclidean distance between samples (this blog post by Dave Tang walks through distance calculations in R). In Correlation AnalyzeR, the output of hierarchical clustering on this distance matrix is a heatmap showing the top 1000 most divisive (variant) genes and the resulting clusters which they reveal in the input gene list.

Example output from Correlation AnalyzeR

Hierarchical clustering in Correlation AnalyzeR

To learn more about hierarchical clustering, check out this article.

Pathway analysis explains the functional significance of a gene list

Pathway analysis answers the question "is this list of genes biologically meaningful?" It does this by comparing the gene list to known genesets/pathways (e.g. "Hallmark Oxidative Phosphorylation") to determine if more genes from geneset are present in the gene list than would have occurred by random chance. The output of this analysis is a list of genesets ranked by the likelihood that they are enriched within the input gene list.

Example output from Correlation AnalyzeR

Pathway enrichment in Correlation AnalyzeR

NOTE: "GeneRatio" represents the proportion of input list genes which belong to a pathway.

To learn more about pathway analysis, check out this article.


FAQs and additional information


How can I download the R-package for Correlation AnalyzeR?

The R package for correlationAnalyzeR is available for download from github

## install.packages("devtools")

How were sample metadata parsed to generate tissue- and disease-specific expression data?

Sample metadata was parsed using a tissue-specific regular expressions dictionary manually curated and validated for sanity against a randomized list of sample tissue metadata entries. This dictionary was supplemented by the BRENDA Tissue Ontology to provide robust sample categorization. The curated dictionary, accompanying pre-processing scripts, and final sample group assignments (as .RData objects) can be downloaded from the Correlation AnalyzeR github repository.

How was the expression data processed into correlations?

The expression data was generated by ARCHS4. They generously provide public access to their expression data matrices and many other useful resources through their downloads page. From ARCHS4, human expression data was downloaded. Sample meta-data was used to categorize samples as described in the above section. Count data were filtered to remove samples with low total gene counts (less than 5 million). Furthermore, genes with 0 counts in > 10% of samples were removed. Subsequent normalization of count data proceeded using DESeq2 normalization and a variance stabilizing transform (info). Only tissue-disease categories with 30+ samples from 4+ different studies were considered to ensure robust correlations. Correlations were calculated using the WGCNA cor function. The scripts used for pre-processing are also available in the Correlation AnalyzeR github repository.

How is GSEA implemented on gene correlations?

Geneset Enrichment Analysis (GSEA) is a versatile technique that calculates pathway enrichment for any genome-wide ranked gene list (specifically, the fGSEA method is used here). An overview of this approach is provided in an excellent blog post by Dave Tang. Usually, gene lists are ranked by some output of differential gene expression analysis -- but any valid ranking metric can be used to sort a gene list prior to GSEA. In this case, a gene is chosen (e.g. BRCA1) which contains ~26k correlation values for every other gene in the genome. By ranking these correlated genes with their correlation values, a valid pre-ranked gene list is produced for GSEA. The top enriched pathways produced by fGSEA analysis can be considered co-expressed/correlated with the gene.

How is significance testing implemented in paired-mode?

Paired-mode involves a user choosing one primary gene and a list of secondary genes to determine if the secondary gene list is correlated with the primary gene. The correlation of these genes is compared to random chance by using a permutation approach. In simple terms, the analysis asks "is the degree of correlation for the selected gene list greater than if someone were to randomly select the same number of genes?"
For each permutation, a list of random genes with the same length as the secondary gene list is chosen. Then, a two-sided t test is performed to determine whether the absolute correlation values of the randomly-chosen genes are significantly different from those of the secondary gene list. By performing 2000 permutations, a distribution of t test p values is generated. If this distribution forms a peak below p=0.05, then it is determined that the secondary gene list is significantly correlated with the primary gene compared to random chance. Absolute correlations were used so that strong correlations would be treated equally regardless of sign.

What are the categories of MSigDB annotations?

The Molecular Signatures database provides thousands of gene sets across several collections. Please see the table below to see how the corGSEA annotations from "Single gene" and "Gene vs gene" mode relate to the categories in MSigDB which are found here..

Contact information

For website support and additional assistance please contact Henry Miller

For bug reports please open an issue and someone will address it shortly

Visit our webpage to more about the Bishop Laboratory, part of the Greehey Children's Cancer Research Institute