Sunday, March 12, 2017

Correlations In Random Genomic Data: A Simple Biology Pitfall

Wow it has been a long time since we have had a post on here! As always, that means other projects are in the works and the blog has taken a bit of a back seat, but we are back and ready to talk science. This week I wanted to get to a topic I have been meaning to get to for a while. If you are a frequent reader, you know that every now and then I like to go over some basic statistics topics that cause confusion among biologists, as well as scientists in other fields. This week I want to cover a common statistical pitfall, with the hopes that it will prevent readers from making simple mistakes. The topic for this post will be obtaining statistically significant correlations from random gene expression data.

To kick things off, let's imagine I have a gene expression dataset. More specifically, I have expression data for gene 1 and gene 2, as well as a housekeeping gene (these genes are usually used for experimental controls). Ultimately I want to compare expression of gene 1 and gene 2 in 50 different people, with my hypothesis being that the expression of these genes are positively correlated with each other (when gene 1 is highly expressed, gene 2 is also highly expressed).

We could answer this type of question with real data, but what if we use random data? In R (an awesome statistical programming language), I can generate a random set of 50 numbers, representing the gene expression of gene 1 and gene 2 among 50 different people. Please note that for consistency, I set the random seed so that the code always returns the same result. Also note that the visualization is done using log values to make it clearer, but the correlations are the raw values, and not log transformed.

# Load the ggplot2 library
# Set the random seed
# Create two sets of 50 random numbers
x <- sample(1:10000, 50)
y <- sample(1:10000, 50)
# Put them together in a data frame
df <- data.frame(gene1 = x, gene2 = y)
# Plot the results
qplot(log10(gene1), log10(gene2), data = df) + theme_classic()
cor.test(df$gene1, df$gene2)

The resulting correlation had a p-value = 0.59 and a correlation (r) of 0.077 (using Pearson correlation coefficient). We therefore generated a set of random expression values for genes 1 and 2, and when we plotted them against each other, we got a random distribution of points. This set of gene expressions was not correlated.

But if this was actually an experimental result, we might think "of course there is no correlation, we forgot to correct our results with our housekeeping gene control". For readers unfamiliar with this practice, gene expression data can sometimes be skewed by differences in loading the machine, preparing samples, etc. This can mean some samples look more abundant due to experimental variability instead of biological variability. To correct for this, we can use "housekeeping genes", which are genes that we expect to be expressed about the same amount in each person.

To correct our data for experimental variability, we may divide each of our gene 1 and gene 2 expression values by the housekeeping gene value from that sample. Therefore, if sample 1 had twice as much material loaded as sample 2, we would divide it by twice as much housekeeping gene, and the result would be values over approximately the same housekeeping gene expression. In our example, we can make this correction using a third set of randomly generated numbers.

# Building off of the previous code
# Create random housekeeping gene expression data
z <- sample(1:10000, 50)
df2 <- data.frame(gene1 = x/z, gene2 = y/z)
qplot(log10(gene1), log10(gene2), data = df2) +
theme_classic() +
geom_smooth(method=lm, se = FALSE)
cor.test(df2$gene1, df2$gene2)

With this correction using more random data, we would expect to see another lack of correlation. On the contrary, the resulting correlation had a p-value = 3.4e-9 and a correlation (r) of 0.72. Even though we were using entirely random data, we obtained a much higher and statistically significant correlation between expression of gene 1 and 2. Here we can see that we did something terribly wrong to get this result, but if we had done this on a biological dataset, we might think that we had found a great result and may even push to publish it.

So what did we do wrong? Ultimately it was that "correction" that hurt us. When we divided each person's gene 1 and gene 2 expression value by the same "housekeeping" value for that person, we were introducing a common transformation within each sample that made gene 1 and 2 expression more similar to each other, within each person. This is why it is problematic when we apply functions (such as division) to each sample individually and then perform a correlation of those samples. This is also why we have to be careful in our correlation analyses and think carefully about how we are dealing with our data, and what correlations we might be introducing by mistake.

Of course this is just a simple example and the principle could apply to many scenarios. The main point here is to outline a potential way we might skew our results without knowing it. Overall I hope this summary was informative and will help in thinking about analyses in future experiments.

Any questions, comments, or concerns? Always feel welcome to reach out in the comment section below, or reach out to me on Twitter (my Twitter link is on the right).

1 comment:

  1. This is such an informative post for people who have no idea about this advanced tech knowledge. Thank you so much for making me understand.