A simple formula demystifies Simpson’s paradox

Government departments and organisations of all types use surveys to understand trends and behaviours and develop policy, but what happens if the results of those surveys are misleading?
Deducing properties and trends from what’s known as categorical data is not always straightforward.
Sometimes a trend appears in individual groups of data but disappears when the groups are combined – a phenomenon known as Simpson’s paradox.
Dr Friedrich Teuscher from the Research Institute for Farm Animal Biology, Germany (now retired) has derived a simple formula that quantifies Simpson’s paradox.

Experiments and surveys often produce what’s known as categorical data – data without numerical values, such as country of origin, profession, gender or colour. One tool used to make sense of categorical data is contingency table theory. Contingency tables display the frequencies of objects with agreeing categories of two or more categorical variables.

A wide range of contingency table analytical techniques are available to help us determine the interactions and interrelations between variables, measure their association, and infer trends. These are used in everything from cancer research to working out how many primary school places we’ll need next year.

Sometimes, however, a trend appears in individual groups of data but appears stronger, weaker, or disappears altogether when these groups are combined. Combining the groups can even reverse the trend. This phenomenon is called Simpson’s paradox after Edward H Simpson, who published his findings in 1951. He presented an example where the benefits of a drug were apparent when information from males and females was examined separately. No such effect appeared, however, when the groups were merged.

The Berkeley gender bias

A well-known example of Simpson’s paradox shows how we can be misled if an important variable is ignored. In 1973, the overall autumn admission figures for the University of California, Berkeley, showed that male applicants were more likely to be admitted than female applicants, and the university was sued for gender bias. When the data was later analysed at department level, bias against female applicants was not apparent. Moreover, when this was averaged across all departments, a moderate preference for female applicants was observed.

An association is a relationship between two variables. A two-way association is where properties of one variable relate to the properties of the other and vice versa. This association between two variables, however, may be due to other variables. With continuous data (numerical data measured on an infinite or an infinitely dense scale, eg, length or temperature), the effect of the other variables can be controlled using statistical techniques. Then, partial association remains. Accordingly, a partial two-way association is the association that remains between the two original variables when statistical techniques control the effect of a third variable. This doesn’t work for categorical data as the partial association is not necessarily unique for all categories.

“Sometimes a trend appears in individual groups of data but appears stronger or disappears altogether when these groups are combined.”

The initial Berkeley analysis involved only two variables: gender (male, female) and admittance (denied, admitted). The subsequent analysis by Bickel and colleagues added a third variable, department (one through to 85), and examined 85 partial two-way interactions in the three-way contingency table. They found four out of the 85 departments showed significant bias against female applicants, while six departments showed significant bias against male applicants.

Simpson’s paradox for quantitative data: a positive trend appears for two separate groups (red and blue lines), whereas a negative trend (black dashed line) appears when the groups are combined.

There have been many studies focusing on Simpson’s paradox, interpreting it, and avoiding it, but until recently, no measures of interaction have been developed to quantify it. Dr Friedrich Teuscher from the Research Institute for Farm Animal Biology, Germany (now retired) has derived a simple equation that demystifies Simpson’s paradox. Teuscher demonstrates how the paradox relates to the inner structure of the table, enhancing the theory that underpins the statistical application of contingency tables.

Calling on quantitative genetics

The association between categorical variables can be measured in a variety of ways, but Teuscher referred to his knowledge of quantitative genetics and linkage disequilibrium specifically. Linkage disequilibrium describes the non-random association of alleles (alternative DNA sequences at the same point on a DNA molecule) at different locations where certain genes or genetic markers are positioned. The linkage disequilibrium estimator can be used to measure two-way associations, partial associations, and to estimate the distances between loci or genes. Applying it to three-way tables enabled the researcher to derive a formula to quantify Simpson’s paradox.

Fundamentals for contingency table theory

Fundamental to the evolution of contingency table theory is the development of the log-linear model. These models make a variety of procedures available to determine the interactions essential to explain the data.

The concept of entropy is also employed. In 1963, Good handled contingency tables as multinomial distributions – these are a generalisation of the binomial distribution for more than two possible outcomes, such as modelling the probability of the six possible outcomes from rolling a die. He established the distribution using maximum entropy (ie, minimum information) under given restraints, eg, one-way and two-way marginal totals (the row and column totals forming the bottom row and right-hand ‘margins’ in a table). Results using the maximum entropy principle have been shown to agree with those of the maximum-likelihood estimators of the log-linear model. Teuscher suggests that the numerical advantages and the flexibility of the maximum entropy principle is not fully exploited within contingency table theory.

Vector interpretation of Simpson’s paradox.

A test for single variables

Teuscher also developed a test to find out if the partial interactions (correlations) within the categories of a variable agree or disagree, such as no gender effect vs gender has an effect. Actually, he describes two versions (and further refinements) of the test: one can test whether all associations are zero and whether all associations have the same value (to be estimated within the test procedure). This fills a gap that has been ignored to date, probably because such a test doesn’t suit the hierarchical log-linear model.

Multiplicative and the additive measures

Quantitative genetics’ theory of linkage disequilibrium has been generalised to include three-locus (three variables) and four-locus (four variables) linkage disequilibrium. The three-locus linkage disequilibrium is an additive measure but has been shown to be inconsistent with Bartlett’s contingency table criterion. Teuscher demonstrates how Bartlett’s multiplicative measure for a three-way association has more impact and shows that the additive measure is an approximation, a simplified form of the first-order Taylor expansion of the multiplicative measure.

“Simpson’s paradox shows how we can be misled if an important variable is ignored.”

Application of linear programming

Since the n-way marginals involve linear equations, linear programming techniques can be applied. Doing so, Teuscher achieved considerable improvements in determining fixed cells within a contingency table, and in simulating tables with given association parameters.

Application to UC Berkeley data

Taking the autumn 1973 admission figures for the six largest departments at UC Berkeley, Teuscher applied his novel methodology. His analysis indicates that the apparent discrimination against female applicants was due to a department property. Those departments with higher admittance rates (eg, Engineering) had more male applicants, while those with lower admittance rates (eg, English) had more female applicants. This new method revealed the trend straight away, whereas it took a lot of detective work from the researchers back in 1975.

Teuscher’s simple formula revealed that the University of California, Berkeley did not discriminate against female applicants.

Applying his simple formula to the Berkeley data, Teuscher shows that what appeared to be an abnormally high number of rejected female applicants was not due to discrimination. One aspect of employing the formula is of particular interest: if we are given the numbers of male and female applicants for each department together with the numbers of admitted and rejected applicants (not broken down by gender) we can deduce the difference between the overall (two-way) association between gender and admittance and the averaged partial associations between gender and admittance within the departments. In the Berkeley case, this difference appears to be large, and the overall association is not in keeping with the averaged associations within the departments. Consequently, the overall association cannot be taken as a measure of discrimination against female applicants.

The higher art of contingency table theory is the presentation of a parsimonious model (a model with only few assumptions or hypotheses) that fits the data. Bickel and colleagues did not provide such a model for the Berkeley data. Teuscher first showed that there is no fitting model, when one assumes a unique ‘discrimination rate’ within the departments. Allowing one department to have a different ‘discrimination rate’ than the other departments, he found a fitting model. This model thus reflects the essential properties that five departments did not prefer a sex to be admitted while one department clearly preferred female applicants to be admitted.

In Teuscher’s own words, his research provides ways ‘to improve the process of deducing properties of a collection of categorical data’.

What initially sparked your interest in Simpson’s paradox?
I started to think about Simpson’s paradox when my formula was already ready. In 2012/2013 – when I didn’t even know the term ‘Simpson’s paradox’ – I wanted to understand a partial three-locus disequilibrium parameter suggested in a paper from Seattle. This led to half the formula. The other half was simple. Only two months later did I become aware that the formula had to do with Simpson’s paradox.

What real-world applications do you think will benefit most from your work?
There are many real-world examples. But I know only few of them. The important point is: when a third variable interacts with the first and the second variable, Simpson’s paradox is necessarily present in the interaction of the first and second variable. This is a consequence of my formula. Another important point is that misleading effects can be avoided, when two or more explanatory variables – which differ from response variables – are involved. Then, designing the experiment is helpful.