Prof. Dr. Jörg Rahnenführer
Statistical analysis of modern sequencing data – quality control, modelling and interpretation
The analysis of genome-wide measurements of gene expression has become a widely applied approach. The high-dimensional nature and the biological character of the data require tailored statistical method development and rigorous quality control. Next generation sequencing (NGS) is a general term summarizing different modern sequencing technologies. Here we present an analysis pipeline for RNAseq data. The measurements are count data and require statistical modelling different to previous microarray-based methods. We describe examples for the path “from the expression value to the p-value”, referring to group comparisons and to pathway enrichment analysis.
Nowadays, in public repositories often raw expression data are available together with corresponding information on clinicopathological parameters. This offers the opportunity to apply meta-analysis approaches. Of course, results and conclusions depend on the reliability of the available information. Quality control is an important step in the analysis pipeline. As an example we present a likelihood-based male-female classifier that identifies sample misannotations in public omics datasets. The errors are caused by sample mix-up during biobanking or subsequent laboratory and data handling procedures and occur in 18 out of 45 analyzed public gene expression datasets. Such data should be included only with extreme caution in further investigations.