Using statistical modelling to understand cell heterogeneity

ArticleQ&ADetailDownload pdf

Dr Christiane Fuchs is the Group Leader of the Biostatistics Group at the Institute of Computational Biology at the Helmholtz Zentrum München, German Research Center for Environmental Health. Her research applies statistical modelling to the analysis of biological data, including her latest work investigating the evolution of acute myeloid leukaemia from collaborative data.
Acute myeloid leukaemia (AML) can result from myelodysplastic syndrome (MDS) – a disorder that affects the number of healthy blood cells a person has. Rather than generating fully functional blood cells, the bone marrow of people with MDS produces underdeveloped or malformed blood cells, which cannot transport oxygen around the body as effectively. This disruption of the maturation of stem cells into specific, ‘differentiated’ cells, leads to high levels of immature white blood cells known as blast cells. When the level of these cells in the blood rises to a certain point, it starts to impact the function of other blood cells, causing AML.
Using statistical analysis and models, Dr Fuchs and her collaborators are examining the variation and heterogeneities in whole populations of cells that can become cancerous and cause AML, determining how these evolve over time. Dr Fuchs’ work is made possible through the Collaborative Research Centre 1243, a collaboration of clinicians, molecular biologists, population geneticists, computational and evolutionary biologists funded by the German Research Association (DFG). The centre takes an interdisciplinary approach to investigating the varied aspects of cancer evolution. By drawing together scientists from disparate fields, the centre aims to better understand the evolution of tumours in order to improve diagnostics, prognostics and treatments for cancer patients.
Understanding heterogeneity
To understand how cells develop into cancer cells, we must first understand the differences, or heterogeneities, between how cells (of the same or different types) express our underlying genetic code. In healthy individuals, the DNA is the same throughout the whole body, but cells behave differently depending on which genes are active and how active they are: cells are heterogeneous. Subtle differences in gene expression can cause proteins that are vital to cell function to be produced differently. Sometimes these differences within otherwise identical cell types originate from cancerous development, such as with AML, and this is the area that Dr Fuchs and her team are interested in.

In order to examine gene expression, scientists use RNA, the chemical that ‘translates’ our DNA during normal cell replication. Heterogeneity in gene expression can be identified from gene expression measurements using statistical methods. Some techniques of examining cell properties rely on profiling large numbers of cells in a sample and calculating an average genetic expression profile. But, in the case of diseases like AML, where rare cell types and the differences between individual cells are important, this type of bulk technique is less useful. It is more helpful to look at single cells because being able to identify these rare cell types is essential to understanding the problems they cause.

Using statistical analysis and models, Dr Fuchs and her collaborators are trying to discern the hierarchy of processes that lead to the development of abnormal sub-clones that can cause acute myeloid leukaemia

Data from single cells, on the other hand, is affected by the way you extract information by sequencing the cells (i.e., reverse transcription or RNA sequencing): The data will contain ‘noise’ dependent on the extremely complex methods used. As a consequence, it is unknown whether observed variation is the same as the real variation being present in the body or just an artefact of isolating and handling the cells. Hence, small, subtle effects can be hidden in single cell data.
Statistics to the rescue
Statistical methods can provide the solution to this problem. Dr Fuchs and her team have developed statistical techniques to analyse the data generated by a method called stochastic profiling to measure the way that RNA molecules are expressed from a patient’s genes.
Stochastic profiling, which is somewhere between a population-scale and single-cell sampling method, was originally developed as an experimental technique by her collaborator Dr Kevin Janes at the University of Virginia. The method works by randomly collecting many ten-cell subsamples rather than individual cells, which minimises the effect of background factors because there is ten times more cell data. Information about single-cell properties is then extracted with a mathematical model.
Dr Fuchs built her model on basic mechanisms of gene expression and combined various types of statistical models (e.g., log-normal and exponential distribution models) to reflect various characteristics of gene expression. By fitting the model to data, Dr Fuchs and her team were finally able to parameterise heterogeneities in gene expression.
Validating the model
The team validated the model using theoretical and experimental data – first by simulating the expression and regulation of ten-cell samples with known distributions, then by comparing the statistical results to experimental data.

Dr Fuchs and her team have developed a statistical method called stochastic profiling analysis, which can help understand the way that RNA molecules are expressed from a patient’s genes

In order to fully validate Dr Fuchs’ findings, the method needed to be applied to real gene expression data. This was provided by Dr Kevin Janes and his PhD student (now postdoctoral fellow) Sameer Bajikar – using fluorescent labels, they were able to create images in which they could count the number of cells in a tissue (because the nuclei were visible) and highly expressed regions of the gene of interest. From these images, they could experimentally estimate the frequency of two cell populations, those with high expression rates and those with low expression rates of the gene. The agreement between this experimental estimate and Dr Fuchs’ statistical estimate was surprisingly good, providing concrete confirmation that Dr Fuchs’ method works. However, while the experimental measurement took weeks or months to confirm, a computer can complete the complex calculations required by Dr Fuchs’ method in a matter of minutes, or even seconds.
Dr Fuchs and her team compared their method with the analysis of comparable amounts of single cell data. They found that the estimates of expression frequency obtained using their model were considerably more accurate than with other models, when considering rare or very rare clusters of cells that are typical in individuals with cancers and other diseases.
Improving on existing techniques
The method developed by Dr Fuchs and her team represents an improvement on single-cell profiling, which creates considerable technical noise because of the sampling methods used, and on bulk methods, which cannot identify specific heterogeneities that might be responsible for causing disease. It is a reproducible, quantitative and computationally efficient way of profiling cells that generates more reliable information than single-cell methods, without the associated complexities and issues of single-cell profiling.
It is as yet unclear how powerful the method is with a small sample of ten-cell averages without the possibility to enlarge the dataset by merging co-expressed genes. What is clear though, is that for clusters of such genes, which are implicated in certain diseases like breast cancer or leukaemia, the technique performs very well.
The next step will be for Dr Fuchs and her team to apply the model to data supplied from collaborators at the CRC and to identify cell-to-cell variation in gene expression in the context of cancers like AML.

How did your work on stochastic profiling lead on to your current work?
Transcriptional heterogeneity is an essential factor which must be considered in various contexts, not only in disease progression like cancer, but also in e.g., developmental processes. In our initial work, we developed statistical methods and validated them on experimental data. In the current project, we are driven by biomedical questions to dissect cell-to-cell heterogeneities in AML patients.
How do you find the most suitable models to fit the data?
As a statistician, I start by looking at the data-generating process, i.e., what is known about the underlying biological processes, and from these derive appropriate probability distributions. Models should at the same time be designed as simply as possible and as complex as necessary. Moreover, as the biology of life is not exactly predictable, I incorporate random components in my models.

Every dataset is different, and
every application entails new challenges

How novel is the field?
Measuring transcriptional expression from single cells or small numbers of cells became possible only a few years ago. Experimental methods as well as computational resources develop rapidly, enabling new routes to understand genetic mechanisms. In its current form, stochastic profiling analysis would not have been possible 20 years ago.
What other applications have you found for your method?
Our initial investigations focused on human cells, but in the meantime we realised that stochastic profiling can also become powerful in microbiology. Here, understanding cellular heterogeneity might help gain insights, for example, into the composition of the microbiome or the formation of antibiotic resistance, but single-cell analysis is experimentally even more difficult because the genetic material is much sparser than in human cells.
What further developments do you plan to improve the method?
Every dataset is different, and every application entails new challenges. Stochastic profiling analysis is a general concept, but it needs to be adapted to emerging technologies like novel sequencing methods and to other molecular subsets beyond the transcriptome, e.g., the epigenome. It also needs to be tailored towards different organisms, such as bacteria or viruses which have different genetic material to mammalian cells. With these extensions, we want to make stochastic profiling analysis an even more powerful tool for understanding cell-to-cell heterogeneity and ultimately gain more insight into developmental processes and cellular changes within disease progression.

Research Objectives
Dr Fuchs’ research looks at statistical modelling and inference with applications in genetics and molecular biology. Within this field, she is particularly interested in the theory and application of stochastic differential equations.
Funding
German Research Association (Deutsche Forschungsgemeinschaft, DFG): Collaborative Research Centre 1243 “Cancer Evolution”, Subproject A17
Collaborators

Prof Fabian Theis
Dr Carsten Marr
Prof Kevin Janes
Dr Sameer Bajikar

Bio
Dr Fuchs is a mathematician with a doctorate in Statistics, having received an MSc degree in Computational Mathematics and a Diploma in Mathematics beforehand. Following her PhD, she joined the Helmholtz Zentrum München as a postdoctoral researcher. Since 2013, she has been leading the Biostatistics Group at the Institute of Computational Biology.
Contact
Dr Christiane Fuchs
Helmholtz Zentrum München
German Research Center for Environmental Health (GmbH)
Institute of Computational Biology
Ingolstaedter Landstr. 1
85764 Neuherberg
Germany
T: +49 89 3187 3385
E: christiane.fuchs@helmholtz-muenchen.de