Never in our memories have clinical trials received so many column inches and the subject of representativeness of sample data is top of many minds.
“A representative sample is a subset of a population that seeks to accurately reflect the characteristics of the larger group.” —Investopedia
In many applications of data science, it is advantageous to have sample data that is representative of the population you are modelling. However, this is not always possible, and techniques have been developed to derive information from different data sets. The requirements of the data will depend on the specific questions to be answered. For example, the data needed to identify rare side effects of a drug may be very different to the data sample needed to assess the average efficacy of a vaccine.
“The selection of people from whom to collect input depends upon the specific questions and issues to be addressed… Thus, the selection process starts by considering the research question: what are the specific objectives to be addressed by collecting patient input”? —US Food and Drug Administration (FDA) guidance note, June 2020
Given the choice, researchers would generally collect perfectly representative data samples to study. Sometimes, however, such samples are not available:
Other times, the very process of selecting a data sample will result in biases in the resulting sample:
Accepting these limitations on sampling, techniques have been developed to help control for unrepresentative data: namely, studying subgroups that are likely to have different results in a study. In a medical trial, these would be groups that exhibit physiological differences that could result in different responses to a treatment or drug – in the case of COVID, the initial differences in effects of the disease might suggest grouping on gender, age and ethnicity.
To ensure that medical studies looking at subgroups of a general population are comparable, the FDA has developed standard terminology for age, gender, and ethnicity groups. The UK’s National Institute for Health and Care Excellence (NICE) has set out guidance for using subgroups in analyzing patient data. In particular, NICE advises that subgroups should be defined before any study takes place to increase the credibility of the subgroup analysis.
If using subgrouping techniques, the data sample does not need to be representative of the ultimate population. However, the subgroups need to provide the building blocks to understand the ultimate population, and there must be enough data in each group of the sample to provide credible results.
As computing power has increased, the tools to understand longevity (or on the other side of the coin, mortality) patterns in populations have developed.
The traditional approach to understanding current longevity for a given population is to create a life table which combines data from a large sample to give probabilities of survival at different ages (and usually genders). The first published life tables were developed by Edmond Halley in 1693. The life table probabilities can be pieced together to calculate life expectancy for people at different ages. However, using a single life table to predict life expectancy requires the underlying data to be representative of your population.
The first step in addressing this representation issue is to group the data in the underlying sample. It is now common for standard mortality tables to be developed for different groups such as blue or white collar workers, those with high or low affluence and for specific professions. This gives some extra precision for pension plans and insurers trying to predict longevity for a specific population. However, the number of groups is limited by the size of the original data set and this approach still relies on the grouped sample data being representative of the population. For example, a blue collar table is only relevant for a blue collar pension plan if the blue collar data sample is representative of the plan’s population across other dimensions that impact longevity, such as income and lifestyle.
To deal with this problem, Club Vita uses multi-variate analysis to look at multiple longevity characteristics (such as affluence, address, type of work, etc.) at once and develop a suite of different base tables that capture longevity for large combinations of underlying characteristics. Instead of splitting the data into ever and ever smaller groups we can instead analyze the incremental effects of varying specific characteristics within the same model.
This results in the ability to capture a longevity assumption appropriate for each individual within a population based on their characteristics. An appropriate assumption for any population can then be created by piecing together the assumptions for the individuals that make up that population, removing the need for representativeness in the original data sample. This is the approach we use to generate our VitaCurves model.