Methods
Before receiving the data, we published an analytical protocol to increase the robustness of our approach and transparency in the model development and validation processes.22 We describe in extensive detail the prespecified model predictors, statistical analysis plan, sample size calculation, coding and cleaning of data, approach to missing data, model estimation, model specification, model validation and assessment of model performance.22 These steps are also described in brief below. Overall, the analytical protocol was adhered to with some modifications, which we describe. First, an additional cohort exclusion included individuals residing in the Yukon, Northwest Territories and Nunavut because area-based information and household income are unavailable for those regions. Second, we considered two additional predictor variables for inclusion in our model which were not initially specified, specifically food insecurity and area-level income. Third, in addition to the primary model specified in the protocol, a minimal and full model was developed and validated. Fourth, the minimal model was fit with the fewest variables to allow for user flexibility when all model variables may not be available. The maximal model where more variables were fitted was generated for comparison purposes. Fifth, internal validation consisted of a split-set instead of the bootstrap approach as a more robust form of internal validation and to eliminate issues with model stability using the bootstrap approach. We adhered to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guidelines, which are the best practice standards for the development and validation of prediction models.23
Data sources
A retrospective cohort study was used to develop a prediction model for premature mortality using population-based survey data from the Canadian Community Health Survey (CCHS). The CCHS is a cross-sectional survey covering 98% of the Canadian population 12 years and older that collects self-reported data on personal health status, healthcare utilisation and health determinants.24 Populations excluded from the CCHS sampling frame include individuals living in Aboriginal settlements, Canadian Force Bases and some remote regions. A detailed description of the CCHS survey methodology is available elsewhere.24 All respondents were linked to the Canadian Vital Statistics Database (CVSD) to ascertain deaths during the follow-up period. The data were held at the Statistics Canada Research Data Centre. Due to Research Data Centre vetting requirements, all study output is rounded deterministically to the nearest 10.
Participants
Respondents who consented to link their responses to administrative data from the first six cycles 1.1 (2000/2001), 2.1 (2003/2004), 3.1 (2005/2006), 2007/2008, 2009/2010 and 2011/2012 of the CCHS were used to create the study cohort. Respondents were excluded if they were pregnant, resided in the territories (ie, Yukon, Northwest Territories or Nunavut) or were under 18 years or older than 74 years at the date of CCHS interview. For further details on our cohort creation, please see online supplemental figure S1.
Outcome: premature mortality
Respondents were followed longitudinally for the incidence of premature mortality (binary yes/no variable), defined as deaths from all causes between the ages of 18 and 74 as recorded in the CVSD. This definition is based on the cut-off adopted by the Canadian Institute of Health Information,25 which is consistent with the definition used in reporting premature mortality among other industrialised nations.26–29
Predictors
As stated in our protocol, 38 candidate predictors were identified from the survey data in accordance with established associations with premature mortality,20 30–34 subject matter knowledge, user input, our team’s experience with the development and validation of population-level risk algorithms19 35–41 and predictor availability across survey cycles. Predictors included sociodemographic characteristics, self-perceived measures, health behaviours, chronic conditions and area-based measures.
Missing data
Multiple imputation methods were used to impute missing data, specifically, Fully conditional specification (FCS) was used to develop five imputed datasets which previous research has established as sufficient.42 43 Total missingness was low and ranged in the six combined cycles from <1% to 10%.22 All imputation was done using the multivariate imputation by chained equations (mice) algorithm in R.42 First, each cycle was separated into sex-stratified groups and imputation was run on each sex-stratified cycle separately to avoid potential between cycle variations and differences between males and females. A three-step approach was used to impute the missing values. FCS imputation was unable to converge for several chronic conditions that had a low prevalence and low missing data (ie, less than 1% missing). As such, the first step was to assume that a respondent did not have said chronic condition if any of the chronic conditions had less than 1% missing across all sex-stratified cycles. These chronic conditions included Alzheimer’s disease, arthritis, asthma, back problems, bowel disease, cancer, chronic obstructive pulmonary disease (COPD)/emphysema, diabetes, high blood pressure, heart disease, intestinal ulcers, migraines, stroke and urinary incontinence. Afterwards, five imputed datasets were generated using FCS on all sociodemographics, health behaviours and chronic conditions except for anxiety and mood disorders. Anxiety and mood disorders were not asked in the CCHS cycle 1.1, and as such, a third and final step needed to be implemented. Each imputed dataset was divided by sex and by cohort group (ie, derivation cohort and validation cohort). Once separated, FCS was run to impute the missing values for anxiety and mood disorders.
A sensitivity analysis was conducted to examine the variation in model performance measures across the imputed datasets, which would help inform whether we averaged the measures or reported performance measures on the first imputed dataset only. After finding negligible differences in the model performance measures (online supplemental table S1), we opted to present model development and validation measures on the first imputed dataset.
Study design
Models were developed and validated in the Canadian adult provincial population.44 45 There were five steps involved in the development and validation: (1) model derivation in the first three CCHS cycles ((1.1 (2000/2001), 2.1 (2003/2004) and 3.1 (2005/2006)). Two models, one for females and one for males, were developed due to the important biological sex differences in premature mortality34 46 (further outlined in the ‘Model specification’ section). This was followed by (2) the internal validation using a split sample approach where the 70% development model was applied to the remaining 30%. The model was then (3) externally validated in the last three CCHS cycles (2007/2008, 2009/2010 and 2011/2012). Next, (4) the derivation and validation data were combined to estimate the final application of the PreMPoRT model (ie, the final Canadian provincial cohort) using the same model specification as in the original model derivation. Lastly, (5) the predictive performance of the final primary model was assessed among more than 20 programme and policy-relevant subgroups with <20% difference between observed and predicted, representing good calibration.15 19 In the main results, we report observed and predicted risk for three of these subgroups (ie, education, income and immigration status).
Model specification
We began with the prespecified forms of predictors as described in the study protocol (online supplemental table S2). Candidate predictors that did not improve predictive performance were removed. Next, the primary model was selected from the minimal model, using the sequential addition of predictors until the performance resembled that of the full model. Finally, we verified the sequential addition of predictors in each of the three models using the least absolute shrinkage and selection operator as described in our study protocol.22 Following sequential predictor selection, alternative specifications were examined. For example, both BMI and age were respecified from continuous with restricted cubic splines to categories for both sexes. The focus of the results is on the primary model, however, results for the minimal and full models are in online supplemental file 1.
Statistical analysis
The predictive performance was based on measures of overall predictive accuracy (ie, Nagelkerke R2, Brier score), discrimination (ie, Harrell’s c-statistic, time-specific discrimination slope) and calibration (ie, time-specific calibration curve, calibration intercept and calibration slope). The definitions of each measure are found in online supplemental table S3. From January 2000 to December 2017, respondents from the CCHS were followed until premature mortality, 5 years of follow-up or study end date. The models were estimated using a Weibull accelerated failure time model with the proportional hazard’s assumption assessed according to the scaled Schoenfeld residuals. The CCHS survey weights were applied to ensure all estimates were representative of the population.47 48 Data cleaning and predictor coding were conducted in SAS V.9.4, with all model development and validation executed in R V.3.6.2.
Patient and public involvement
This study and the published analytical protocol were developed in consultation with decision-makers at three local public health departments in urban and rural Ontario to ensure that the design of the prediction model is optimised for applied public health applications. These decision-makers were engaged at the beginning of this work to provide feedback on the published study protocol which detailed the design, analysis and reporting of the present study.