COX calibration analysis

Theoretical Background

Cox calibration analysis is both a logistic recalibration technique and a method to examine the current calibration of a model. To perform the analysis, we first to fit a logistic regression model using logit (log odds, aka \(\log\frac{\hat{p}}{1-\hat{p}}\)) as the predictor variable and the outcome as the target variable.

\[p_{new} = \frac{1}{1+e^{-(a + b \cdot \log\frac{\hat{p}}{1-\hat{p}})}}\]

In the case of perfect calibration, \(P(Y=1|p=\hat{p}) = \hat{p}\) and the new probability \(p_{new}\) is equal to the original probability \(\hat{p}\). That means \(a=0\) and \(b=1\). If \(b>1\), the model is under-confident at high probabilities and over-confident at low probabilities for the class-of-interest. If \(b<1\), the model is over-confident at high probabilities and under-confident at low probabilities for the class-of-interest. If \(a>0\), the model is over-confident at all probabilities for the class-of-interest. If \(a<0\), the model is under-confident at all probabilities for the class-of-interest. The confidence interval of \(a\) and \(b\) can be used to guide the calibration of the model. The user can also choose to fix \(a=0\) and fit for \(b\) only and vice versa, then there will be no interaction between \(a\) and \(b\) and the confidence interval can be used as a statistical test to test for perfect calibration.

Pros of Cox calibration analysis

Cox calibration analysis doesn’t depend on binning of data, which is a big advantage since common metrics such as ECE/MCE and HL test all depend on binning and we have shown that changing binning can lead to different results. We can also use it to perform statistical tests by fixing \(a\) to 0 and test whether \(b=1\) and the other way around to test for perfect calibration. Also, the fitted values of \(a\) and \(b\) can tell us how the model is miscalibrated, whether it is an overall under- or over-confidence or if it is over-confident in some ranges and under-confident in others. For example, if \(a\) is not close to 0 while \(b\) is close to 1, it likely indicates a prevalence shift. See more details in the prevalence adjustment notebook.

Cons of Cox calibration analysis

Cox Calibration analysis can only assess weak calibration , meaning whether \(P(Y=1|\hat{p}=p) = p\) for all \(p\). It only captures certain types of miscalibration (general over/under-confidence). A model can have \(a=0\) and \(b=1\) and still be miscalibrated. For example, a model is over-confident at low and high probabilities and under-confident at intermediate probabilities could has \(a=0\) and \(b=1\) but still be miscalibrated.

Calculating Cox slope and intercept with calzone

There are two ways to calculate the Cox slope and intercept. Calling the Cox function gives you more control over the calculation, including fixing \(a=0\) or \(b=1\).

[1]:
from calzone.utils import reliability_diagram,data_loader
from calzone.metrics import cox_regression_analysis
import numpy as np

### loading the data
wellcal_dataloader = data_loader(data_path="../../../example_data/simulated_welldata.csv")

### calculating cox slope and intercept
cox_slope, cox_intercept,cox_slope_ci,cox_intercept_ci = cox_regression_analysis(wellcal_dataloader.labels, wellcal_dataloader.probs,class_to_calculate=1,print_results=True)
                           Logit Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 5000
Model:                          Logit   Df Residuals:                     4998
Method:                           MLE   Df Model:                            1
Date:                Wed, 02 Oct 2024   Pseudo R-squ.:                  0.4438
Time:                        15:53:27   Log-Likelihood:                -1927.5
converged:                       True   LL-Null:                       -3465.6
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0450      0.040     -1.123      0.262      -0.123       0.034
x1             0.9942      0.029     34.212      0.000       0.937       1.051
==============================================================================

You can also fix the intercept \(a=0\) by using the fix_intercept=True option. Similarly, you can fix the slope \(b=1\) by using the fix_slope=True option.

[2]:
### fixing intercept and calculating cox slope
cox_slope, cox_intercept,cox_slope_ci,cox_intercept_ci = cox_regression_analysis(wellcal_dataloader.labels, wellcal_dataloader.probs,class_to_calculate=1, fix_intercept=True,print_results=True)
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.385628
         Iterations: 0
                           Logit Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 5000
Model:                          Logit   Df Residuals:                     4999
Method:                           MLE   Df Model:                            0
Date:                Wed, 02 Oct 2024   Pseudo R-squ.:                  0.4436
Time:                        15:53:27   Log-Likelihood:                -1928.1
converged:                      False   LL-Null:                       -3465.6
Covariance Type:            nonrobust   LLR p-value:                       nan
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const               0          0        nan        nan           0           0
x1             0.9939      0.029     34.210      0.000       0.937       1.051
==============================================================================

Model has been estimated subject to linear equality constraints.

Alternatively, we can use the CalibrationMetrics class to compute the COX slope and intercept.

[3]:
from calzone.metrics import CalibrationMetrics
calmetrics = CalibrationMetrics()
calmetrics.calculate_metrics(wellcal_dataloader.labels, wellcal_dataloader.probs, metrics=['COX'],print_results=True)
                           Logit Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                 5000
Model:                          Logit   Df Residuals:                     4998
Method:                           MLE   Df Model:                            1
Date:                Wed, 02 Oct 2024   Pseudo R-squ.:                  0.4438
Time:                        15:53:27   Log-Likelihood:                -1927.5
converged:                       True   LL-Null:                       -3465.6
Covariance Type:            nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0450      0.040     -1.123      0.262      -0.123       0.034
x1             0.9942      0.029     34.212      0.000       0.937       1.051
==============================================================================
[3]:
{'COX coef': 0.9942499557748269,
 'COX intercept': -0.04497652296600376,
 'COX coef lowerci': 0.9372902801721911,
 'COX coef upperci': 1.0512096313774626,
 'COX intercept lowerci': -0.12348577118577644,
 'COX intercept upperci': 0.03353272525376893,
 'COX ICI': 0.005610391483826338}

The resulting COX slope and intercept can be used to calibrate the model but it is beyond the scope of this package.

P-value of COX slope and intecept test

Although Cox calibration analysis is usually only used to estimate the overall calibration trend, the resulting estimates of the slope and intercept can also be used to test whether the model is well calibrated (moderate calibration). We will do a demostrate on the statistical size of the slope and intercept test below

[ ]:
### The size of slope test
from calzone.utils import fake_binary_data_generator
np.random.seed(123)
fakedata_generator = fake_binary_data_generator(alpha_val=0.5, beta_val=0.5)
cal_metrics = CalibrationMetrics()
sample_size = 1000
simulation_size = 1000
results = []
# generate data
for i in range(simulation_size):
    X, y = fakedata_generator.generate_data(sample_size)
    if i == 0:
        tempresult = cal_metrics.calculate_metrics(y, X, ['COX'],return_numpy=False,fix_intercept=True) #we need to fix the intercept to be 0
        keys = list(tempresult.keys())
        results.append(np.array(list(tempresult.values())))
    else:
        tempresult = cal_metrics.calculate_metrics(y, X, ['COX'],return_numpy=True,fix_intercept=True) #we need to fix the intercept to be 0
        results.append(tempresult)
results = np.array(results)

[5]:
Cox_slope = results[:,0]
Cox_slope_lowerci = results[:,2]
Cox_slope_upperci = results[:,3]
chance = np.logical_and(Cox_slope_lowerci<=1, Cox_slope_upperci>=1)
print('The size of the Cox slope test is: ', 1-np.mean(chance))
The size of the Cox slope test is:  0.039000000000000035

We can also do the intercept test:

[ ]:
### The size of intercept test
from calzone.utils import fake_binary_data_generator
np.random.seed(123)
fakedata_generator = fake_binary_data_generator(alpha_val=0.5, beta_val=0.5)
cal_metrics = CalibrationMetrics()
sample_size = 1000
simulation_size = 1000
results = []
# generate data
for i in range(simulation_size):
    X, y = fakedata_generator.generate_data(sample_size)
    if i == 0:
        tempresult = cal_metrics.calculate_metrics(y, X, ['COX'],return_numpy=False,fix_slope=True) #we need to fix the slope to be 1
        keys = list(tempresult.keys())
        results.append(np.array(list(tempresult.values())))
    else:
        tempresult = cal_metrics.calculate_metrics(y, X, ['COX'],return_numpy=True,fix_slope=True) #we need to fix the slope to be 1
        results.append(tempresult)
results = np.array(results)
[7]:
Cox_intercept = results[:,1]
Cox_intercept_lowerci = results[:,4]
Cox_intercept_upperci = results[:,5]
chance = np.logical_and(Cox_intercept_lowerci<=0, Cox_intercept_upperci>=0)
print('The size of the Cox intercept test is: ', 1-np.mean(chance))
The size of the Cox intercept test is:  0.05600000000000005

The test results in different size values but they are closer than with the HL test.

References

Cox, D. R. (1958). Two Further Applications of a Model for Binary Regression.

Calster, B. V., & Steyerberg, E. W. (2018). Calibration of Prognostic Risk Scores. In R. S. Kenett, N. T. Longford, W. W. Piegorsch, & F. Ruggeri (Eds.), Wiley StatsRef: Statistics Reference Online (1st ed., pp. 1–10). Wiley. https://doi.org/10.1002/9781118445112.stat08078

Huang, Y., Li, W., Macheret, F., Gabriel, R. A., & Ohno-Machado, L. (2020). A tutorial on calibration measurements and calibration models for clinical prediction models. Journal of the American Medical Informatics Association, 27(4), 621–633. https://doi.org/10.1093/jamia/ocz228