Handling missing data in cardiovascular prediction models in real live
In this project, we investigated traditional statistical and modern machine learning (ML) methods for
handling of missing predictor data when applying prediction models for cardiovascular diseases in real-time medical settings and evaluated how well ML-based prediction model studies follow recommendations from existing reporting guidelines on missing data. We show how a majority of the clinical prediction model studies that use ML techniques does not report sufficient information on the presence and handling of missing data, despite missing values being highly common in routine healthcare data that often form the basis in ML prediction models studies. Also, we show that ML-based prediction model studies adhered poorly to the current guideline Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD).
We further developed and present the development of novel real time imputation methods for missing predictor values using either conditional modelling imputation (CMI, where a multivariable imputation model is derived for each predictor from a population) or joint modelling imputation (JMI, where we use a multivariate normal approximation to generate patient-specific imputations). The use of JMI, especially with auxiliary variables (i.e., variables not part of the prediction model), for real-time imputation of missing predictor values is found to be preferred. We also compared various ML modelling techniques that deal with missing predictor values. The use of surrogate splits were found to perform poorly, whilst pattern submodels showed good performance only when paired with a specific modelling technique. Overall, JMI is still to be preferred, provided multiple imputations are used. We also describe how the adoption of internal-external cross-validation (IECV) is preferred to assess the generalizability of prediction models during their development, and to identify whether complex modelling strategies may offer any advantages. Briefly, IECV allows to evaluate model performance in non-random hold-out samples with individuals from different settings or populations.
Lastly, we present a pilot study which shows that the use of real-time missing predictor value imputation was found to be acceptable by potential users of CVD prediction models. The findings are reported in a PhD thesis of Steven Nijman, to be defended in 2022.