Please exercise caution includes clinical tools intended for use by healthcare professionals. This website is not intended to and does not constitute professional medical advice. Healthcare professionals using these tools should exercise their own clinical judgement. We do our best, but the accuracy, completeness, adequacy, or currency of the content is not warranted or guaranteed. View the full terms of use here.

Considerations for critical appraisal

Overall quality rating

We rate the overall quality of evidence supporting a prediction guide as low, moderate, or high. We only include guides that have demonstrated at least fair prediction accuracy (see explanation below). The highest degree of risk to quality in any one of the domains 'risk to reproducibility', 'risk of data reporting bias', and 'risk to transportability' explained below determines the overall quality rating. We may include a prediction guide supported by low quality evidence due to high risk to transportability if risk to reproducibility and risk of data reporting bias are judged to be low and prediction accuracy is at least fair.

Prediction accuracy

We assess prediction accuracy by discrimination and calibration or by a combined measure (e.g. Brier score). Discrimination refers to a model's ability to differentiate between patients who do and do not develop the outcome of interest; calibration is the agreement between predicted and observed risks. A prediction model is well-calibrated if, for every 100 patients given a risk of x%, close to x go on to develop the outcome of interest.

Most studies quantify discrimination using the c-statistic (or the area under the receiver operating characteristics curve [AUC], its equivalent for binary outcomes). For these measures we judge <0.7 as poor; 0.70-0.79, fair; 0.80-0.89, good; and ≥0.90 as excellent discrimination. The assessment of calibration is more subjective. We consider any measure of how closely the predicted probabilities agree with the true event proportions, including calibration plots, goodness-of-fit tests (e.g. Hosmer-Lemmeshow test), or tables of predicted versus observed risks. When possible, we comment on the range of risks in which prediction may be poorly calibrated (e.g. if predictions >20% overestimate risk).

Risk to reproducibility

A prediction model is not likely to perform well in new patients (i.e. patients outside of the study in which it was developed) if there is risk of statistical overfitting during the data analysis. Statistical overfitting refers to the extent to which a prediction model represents random variation in the data instead of true predictor-outcome relationships. It occurs when a guide is developed in too small a sample and then evaluated in the same sample, and results in inflated estimates of prediction performance. In the absence of external validation of sufficient size, we consider performance estimates from models developed in samples with <5 events per variable to be at high risk of bias from statistical overfitting, moderate risk for 5-9 events per variable, and low risk for ≥10 events per variable or 5-9 events per variable with statistical correction techniques (i.e. ‘shrinkage’) to limit overfitting. We count all variables evaluated for their relationship with the outcome and not just those included in the final model. If attempts were made to dichotomize a continuous variable at different thresholds, each dichotomous version counts as a variable. When the candidate thresholds are not reported, we add three variables to the total variable count and used this in the events-per-variable calculation. We categorize risk to reproducibility (from statistical overfitting) as follows:

High: Model developed in sample with < 5 events per variable evaluated or evidence of poor accuracy in rigorous internal validation.
Moderate: Model developed in sample with 5-9 events per variable evaluated and no evidence of poor accuracy in rigorous internal validation (if performed), or showed at least fair accuracy in rigorous internal validation.
Low: Model developed in sample with ≥10 events per variable evaluated and no evidence of poor accuracy in rigorous internal validation (if performed), or 5-9 events per variable with methods to limit statistical overfitting, or showed at least fair accuracy in rigorous internal validation.

Risk of data reporting bias

Data reporting bias refers to bias from missing predictor or outcome data that leads investigators to identify risk factors that in truth predict the ascertainment of the outcome instead of the true risk of the outcome in patients with more comorbidities, those having higher risk surgery, or those with a complicated postoperative course. We categorize risk of data reporting bias as follows:

High: Outcome being predicted is ascertained using routinely collected data and is likely to be missed in a substantial proportion of patients unless systematic surveillance is employed, or substantial amount of missing predictor data, and no reassurance from statistical imputation.
Moderate: Outcome being predicted is ascertained using routinely collected data and is likely to be missed in some patients unless systematic surveillance is employed, or substantial amount of missing predictor data, but with reassurance from statistical imputation.
Low: Outcome being predicted is not likely to be missed in routine practice, or systematic monitoring for outcome; no or trivial amount of missing predictor data.

Risk to transportability

A transportable prediction guide remains accurate in patients drawn from a different but related population or in data collected by using methods that differ from those used in its development. We categorize risk to transportability as follows:

High: Only single center retrospective validation or evidence of poor accuracy in prospective validation or multi-center (≥2 centers) retrospective validation.
Moderate: At least fair accuracy in single center prospective validation or multicenter (≥2 centers) retrospective validation.
Low: At least fair accuracy in multicenter (≥2 centers) prospective validation.

What we mean by validation

We consider three general types of assessment methodology: external validation, internal validation, and apparent performance.

In external validation, data differ from development data in systematic ways and typically provide the most reliable assessment of prediction performance. We distinguish external validation as temporal (different points in time) or geographic (different centres), and retrospective (e.g. using existing data from medical records) or prospective (data collected with the predictors and outcome in mind). Prospective validation provides an unbiased assessment of prediction performance and generalizability to different patients, and a real-world test in the hands of clinicians.

In internal validation, performance estimates are derived from data that differ from the development sample by chance alone. This includes random splitting of a dataset (e.g. 50% of patients randomly allocated to contribute data only to the development of the prediction guide and the other 50% only to validation) or resampling from the development data (i.e. bootstrap, jack-knife, cross-validation, or sub-sampling procedures). Resampling techniques (especially bootstrapping) are preferred to splitting the dataset because they avoid ignoring data that would be valuable for model development and leaving insufficient data for model testing. Although internal validation offers some protection against bias, there are many difficult-to-assess nuances to correct execution. Even the best internal validation does not assess how well the prediction model would perform when it is put into practice or is tested in patients that differ systematically from the development sample.

Apparent performance estimates are derived from the development sample. This is the weakest form of assessment because the prediction model reflects these data by design. Thus, performance in the development sample is overestimated compared to what it would be in new patients. Apparent assessments of discrimination are still of some value but calibration is not meaningful. While these limitations are generally true, if a model is developed in a very large representative sample, it's performance in that sample may well reflect performance in a large independent sample. In that scenario, internal validation procedures offer little advantage over simply reporting the apparent performance of the model. External validation in various geographic areas, however, may still offer information about generalizability, provided that the validation samples are large and representative.


Babyak MA. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosom Med. 2004;66(3):411. doi:10.1097/01.psy.0000127692.23278.a9.

Steyerberg EW, Harrell FE, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001;54(8):774-781.

Steyerberg EW, Borsboom GJJM, van Houwelingen HC, Eijkemans MJC, Habbema JDF. Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Stat Med. 2004;23(16):2567-2586. doi:10.1002/sim.1844.

Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-138. doi:10.1097/EDE.0b013e3181c30fb2.

Laupacis A, Sekar N, Stiell IG. Clinical prediction rules. A review and suggested modifications of methodological standards. JAMA. 1997;277(6):488-494. doi:10.1001/jama.277.6.488.

Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J. June 2014. doi:10.1093/eurheartj/ehu207.

Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med. 1999;130(6):515-524. doi:199903160-00009