Making the most of statistics in biology
Written by Pierre Grognet and Gaëlle Lelandais
Using statistics is an ambiguous task for biologists, both necessary and imposed. “Necessary” because biologists know that their experimental observations can be affected by measurement errors. Repeating experiments is therefore essential. They also understand that external validations of their results and interpretations are required. However, using statistics is also an “imposed”, unpleasant task. Even if biologists are very confident that the observed effect exists (based on experiments done by others or their in-depth knowledge of their biological system), statistical validation is nevertheless required. Statistics is therefore used with concern and a sense of uselessness (“I'm sure the biological effect exists, why isn't it enough?”).
We believe that these mixed and opposing feelings, for a discipline (statistics) that is essential in all fields of biology, stem from an erroneous method of learning. Indeed, statistics is often taught as "cooking recipes" and thus depending on the situation, one recipe or another must be applied. All you have to do is then to choose the right one; and if the result isn't what was originally expected, why not just change the recipe? The problem is that, while thinking of statistics as "cooking recipes" allows for rapid studies of experimental results, it does not help for in-depth interpretation of the data, strengthening the understanding of observations, and developing a critical eye on the results. This deeper understanding is however the key to the proper use of statistics.
In this article, our aim is first to explain five specific points of vigilance that will help one make the most of statistics in one’s own research projects. In the second part, we illustrate our remarks with an example.
Five points of vigilance to get more from statistical results
1) Be careful not to confuse “scientific hypotheses” and “statistical hypotheses”
Scientific hypotheses are the reason why experiments are performed. They are essential for the direction of the study. A scientific hypothesis can be the success of a drug affecting its target or the involvement of a given protein in a biological process. Statistical hypotheses, on the other hand, are very different. Often summarised by “H0” (the null hypothesis) and “H1” (the alternative hypothesis), we forget that these hypotheses are based on a “model”. This model is a mathematical representation of a phenomenon, i.e. a mathematical equation whose parameters are estimated from observations (for example that of the normal law). The great benefit of a model is that it can be used to make “predictions”. Sometimes the models are relevant and perfectly represent reality, but sometimes they are not. Thus, deciding between H0 and H1, does not imply validation or invalidation of the scientific hypothesis but rather, it tells you whether your data fits the model or not. We are not saying that statistics are not useful! This distinction is very important to consider carefully, especially in a broader context of deep reflection on the initial scientific hypothesis.
2) Remember that choosing a hypothesis (H0 or H1) does not mean that it is true
Formally, the application of a statistical hypothesis testing consists of "rejecting" or "not rejecting" the null hypothesis (H0), for a predefined value (usually 5%) of type I error (often called “false positive”). A common over-interpretation of this result is to conclude that if H0 is not rejected, H0 is "true", or if H0 is rejected, then H1 is "true". Explaining this point in more details is beyond the scope of this article, but we believe it is important to encourage the reader to rephrase their conclusions, with for instance "rejecting H0 means that, if the statistical modelling of the H0 hypothesis is correct, the observations I have collected, i.e. my experimental measurements, are rather unlikely to be observed”. No more no less. This is a bit disappointing, but must be put into perspective. Again, statistical hypotheses are not scientific hypotheses! Finally, it is important to keep in mind that no modelling of the H1 hypothesis is done during the statistical hypothesis testing, and thus it is not possible (unfortunately) to conclude whether it is relevant or not.
3) Don't overinterpret the p-value, it depends on several sample parameters
The p-value is usually the only important piece of information for someone who uses statistics as a "cooking recipe". If the p-value is less than 0.05, the results are meant to be "significant" (H0 is indeed rejected). However, several parameters can influence the p-value. For example, in a test to compare values of means (t-test), the "size effect", i.e. the difference between the means observed in two samples, is logically linked to the p-value: the more the difference between means, the lower the p-value. This is generally well understood. However, what is less known, is that the dispersion of the observations within the samples, as well as the number of measurements in each sample also have great impacts on the calculated p-value, and this, independently of the size effect. It is thus impossible to understand, by examining only the final p-value, the underlying reasons for which the results are or are not declared as “significant”. We advise in this context, not to overinterpret the p-value alone and consider also other statistical parameters in your final decision; and more importantly, have a look at your data using graphical representation.
4) Understand that adjusting a p-value is important for multiple testing, but it does not change the “significance order” in results
It is common to read in statistics tutorials: “Warning! You must use the adjusted p-value”. This is indeed a prudent recommendation, especially in the context of omics data analysis. In a few words, the difference between the p-value and the adjusted p-value consists of taking into account the use of multiple tests in analysing data. The p-value arising from a single test is thus corrected (or “adjusted”) to control the type I error corresponding to all tests. As a result, adjusted p-values are higher than original p-values and if a threshold (for instance 0.01) is chosen to select significant results, less are obtained using adjusted p-values than the original p-value. It is important, in this context, to understand that adjusting the p-value has no effect on the order of the results. This means that the results with the smallest p-values will still be the ones with the smallest adjusted p-values. Thus, the choice between p-value and adjusted p-value is not important when your aim is to sort the results (as we often do), from the most significant to the least significant.
5) Understand that the False Discovery Rate (FDR) does not give the same information as an adjusted p-value
The concept of FDR is particularly used in the context of omics data analyses. Multiple tests are performed to, for instance, identify differentially expressed genes between several conditions (RNAseq data analysis) or identify over-represented functions in a list of candidate genes (functional enrichment analysis). Unlike the adjusted p-values, FDR relates to a set of significant results. This means that to be calculated, a threshold must have been first decided on p-values, thus defining the positive results. Many strategies for calculating the FDR exist, but they all have in common the objective of giving a false positive rate, which only concerns the list of positive results (e.g. differentially expressed genes or enriched functions), and not all of the tests. Another way to understand the difference, is to think that while a threshold of 0.05 on adjusted p-value implies that 5% of all tests will potentially be false positives, a threshold of 0.05 on FDR implies that 5% of the significant tests will potentially be false positives. Therefore, using FDR reduces the risk of false discoveries in further data explorations.
Use case - what are the overrepresented functions in my list of differentially expressed genes?
To illustrate these points of vigilance, we decided to take the examples of differential expression and functional enrichment analyses. We used some of the RNAseq results published in the following paper:
Glock, Caspar, et al. « The Translatome of Neuronal Cell Bodies, Dendrites, and Axons ». Proceedings of the National Academy of Sciences, vol. 118, no 43, octobre 2021, p. e2113929118. DOI.org (Crossref), https://doi.org/10.1073/pnas.2113929118.
In this work, the author made transcriptome and translatome experiments on microdissected rodent brain slices. Briefly, to identify which transcripts come from which cell type, they also performed RNAseq experiments on cell cultures. Associated results to identify glia-specific transcripts are provided online (dataset S1).
We used these data to plot the differential expression between glia- and neuron-enriched cultures (Figure 1A). On this plot, each gene is represented with a dot according to: on X axis, logFC values (i.e. a score of the difference of expression between the two experimental conditions) and on the Y-axis, the adjusted p-values. Notably, logFC and p-values are two statistical parameters, which result from the application of a statistical testing procedure. These two parameters can be used to create lists of genes which are “good candidates” for being differentially expressed in one or another condition. Indeed, based on the points of vigilances #1 and #2 that we described in the first part of this article , the “reality” of the differential expression is not formally demonstrated here, but only highly suspected (and this is already very powerful!). In accordance with the point of vigilance #3, we did not select our genes based on a threshold on p-values only, but added a threshold on logFC (this is the “size effect”), and thus defined three lists of genes (S1, S2, and S3) which correspond to three levels of significance (S1 being included in S2, and S2 being included in S3). Note here that, using the original or the adjusted p-value would have had no impact on the relative contents of the three lists. This is an illustration of the point of vigilance #4, i.e. the significance order of the results remain unchanged.
To go a step further, the gene names from the three gene lists (S1, S2 and S3) were used to perform a Gene Ontology (GO) term analysis with g:Profiler (https://biit.cs.ut.ee/gprofiler/gost, default parameters). Such analysis consists in testing in a gene set the over-representation of genes with a particular function, as described in the Gene Ontology database. Of course, the results depend on the number of genes initially present in the list, and findings obtained with the lists S1, S2 and S3 are summarised in Figure 1B. Interestingly, we can observe that the most enriched functions, i.e. the functions with the lowest adjusted p-values, are the same in each analysis (see the dark green cells, Figure 1B). This illustrates the idea that strong biological signals arise whatever the statistical thresholds are chosen (isn't it good news?). In this context, the advantage of having more or less restrictive lists of candidate genes, in terms of associated FDR values (remember the point of vigilance #5), is based on the possibility of exploring new avenues of research, perhaps more uncertain but certainly also more innovative.
Figure 1: (A) Volcano plot comparing the differential expression of transcripts in glia- vs. neuron-enriched cultures. The shapes indicate tissue specific enrichment. The points have been colored to fit with the different adjusted p-value thresholds used for the GO term analysis (i.e. 10-2, 10-50 and 10-100). (B) GO terms enriched for glia specific transcripts. For each set (S1, S2, S3), the list of the 15 most enriched GO terms is shown and color-coded depending on whether the GO terms are shared between 2 sets (light green), 3 sets (dark green) or unique (yellow).
Written by Pierre Grognet and Gaëlle Lelandais; Edited by John (JJ) Fung. Featured Image: NGC/Design.
Gaelle Lelandais is a Professor at the Institute for Integrative Biology of the Cell (Paris-Saclay University, CNRS, CEA) where she teaches bioinformatics and biostatistics (gaelle.lelandais[at]universite-paris-saclay.fr).
Pierre Grognet is an Assistant Professor at the Institute for Integrative Biology of the Cell (Paris-Saclay University, CNRS, CEA) where he teaches genetics, molecular biology, and microbiology (pierre.grognet[at]universite-paris-saclay.fr).
Both are working on epigenetic regulation of sexual development in a filamentous fungus.