Validating metrics with external packages

In this notebook, we want to validate the metrics we have implemented in the calzone package using other packages or programs.

Reliability diagram

We will use scikit-learn’s calibration_curve function to calculate the reliability diagram

[2]:

from calzone.utils import reliability_diagram,data_loader
from calzone.metrics import calculate_ece_mce,hosmer_lemeshow_test
import numpy as np
from sklearn.calibration import calibration_curve
### loading the data
wellcal_dataloader = data_loader(data_path="../../../example_data/simulated_welldata.csv")

###scikit-learn implementation
scikit_reliability_H,scikit_confidence_H = calibration_curve(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1],n_bins=15,strategy='uniform',pos_label=1)
scikit_reliability_C,scikit_confidence_C = calibration_curve(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1],n_bins=15,strategy='quantile',pos_label=1)

### calzone implementation
calzone_reliability_H,calzone_confindence_H,bin_edge_H,bin_count_H = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=1, is_equal_freq=False)
calzone_reliability_C,calzone_confindence_C,bin_edge_C,bin_count_C = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=1, is_equal_freq=True)

###showing the difference between the two implementations
print("Difference for equal-width binning:")
print("Reliability difference:", np.round(np.abs(scikit_reliability_H - calzone_reliability_H), 4))
print("Confidence difference:", np.round(np.abs(scikit_confidence_H - calzone_confindence_H), 4))
print("\nDifference for equal-count binning:")
print("Reliability difference:", np.round(np.abs(scikit_reliability_C - calzone_reliability_C), 4))
print("Confidence difference:", np.round(np.abs(scikit_confidence_C - calzone_confindence_C), 4))

Difference for equal-width binning:
Reliability difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Confidence difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Difference for equal-count binning:
Reliability difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Confidence difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

We can see that the calzone package return the same Reliability diagram as the sckit-learn package. We will move to test the expected calibration error.

Expected calibration error and Z test

We will use mapie package to validate some of the metrics in calzone. Description of mapie can be found here.

[3]:

from mapie.metrics import spiegelhalter_p_value,top_label_ece,spiegelhalter_statistic
from calzone.metrics import spiegelhalter_z_test
calzone_reliability_topclass_H,calzone_confindence_topclass_H,bin_edge_topclass_H,bin_count_topclass_H = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=None, is_equal_freq=False)
calzone_reliability_topclass_C,calzone_confindence_topclass_C,bin_edge_topclass_C,bin_count_topclass_C = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=None, is_equal_freq=True)

### compare MAPIE and calzone equal-width binning
print("MAPIE topclass ECE-H:",top_label_ece(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins = 15,split_strategy='uniform'))
print("calzone topclass ECE-H:",calculate_ece_mce(calzone_reliability_topclass_H,calzone_confindence_topclass_H,bin_count_topclass_H)[0])

MAPIE topclass ECE-H: 0.009092291386189822
calzone topclass ECE-H: 0.009098499582108283

[4]:

### compare MAPIE and calzone equal-count binning
print("MAPIE topclass ECE-C:",top_label_ece(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins = 15,split_strategy='quantile'))
print("calzone topclass ECE-C:",calculate_ece_mce(calzone_reliability_topclass_C,calzone_confindence_topclass_C,bin_count_topclass_C)[0])

MAPIE topclass ECE-C: 0.016227424850494457
calzone topclass ECE-C: 0.016263864264387196

We can see that both package return a very similar result for ECE. We will also move to validate spiegelhalter’s z test. We found out that the mapie package incorrectly calculates the p-value by using a one-sided test. Therefore, we will only compare the test statistic but not the p-value.

[5]:

### compare the Z statistics
print("MAPIE Z statistic", spiegelhalter_statistic(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1]))
print("calzone Z statistic", spiegelhalter_z_test(wellcal_dataloader.labels,wellcal_dataloader.probs)[0])

MAPIE Z statistic 0.3763269161877356
calzone Z statistic 0.3763269161877356

For other metrics, we could not find any external packages that could be used to cross validate our implementation. However, we perform some simulations to test the type 1 error rate for cox slope and intercept and both return sensible values. For the ICI, we compare it with empirical ECE and found they are very similar.

HL test

Lastly, we will use a quick r code to calculate the HL test test statistic to make sure it is correct.

[51]:

hl_result = hosmer_lemeshow_test(calzone_reliability_C,calzone_confindence_C,bin_count_C,df=len(bin_count_C))
print("calzone HL-C TS=",hl_result[0],"p-value=",hl_result[1],'df=',hl_result[2])

calzone HL-C TS= 6.298957873650591 p-value= 0.9742765471951074 df= 15

[1]:

### This section is r code
library(ResourceSelection)

# Read the CSV file
data <- read.csv("../../../example_data/simulated_welldata.csv")
predicted_prob <- data[,2]  # First column with predicted probabilities
labels <- data[,3]         # Third column with actual labels


# Perform Hosmer-Lemeshow test
hltest <- function(observed, predicted) {
  hl_test <- hoslem.test(observed, predicted,g=15)

  cat("Hosmer-Lemeshow Test Results:\n")
  cat("Chi-square statistic:", round(hl_test$statistic, 4), "\n")
  cat("Degrees of freedom:", hl_test$parameter, "\n")
  cat("p-value:", round(hl_test$p.value, 4), "\n")

  return(hl_test)
}

result <- hltest(labels, predicted_prob)

ResourceSelection 0.3-6          2023-06-27

Hosmer-Lemeshow Test Results:
Chi-square statistic: 6.299
Degrees of freedom: 13
p-value: 0.9346

We see that the test statistics are the same. The R package doesn’t allow user input degree of freedom so the p-value is different as expected.

reference

Taquet, V., Blot, V., Morzadec, T., Lacombe, L., & Brunel, N. (2022). MAPIE: an open-source library for distribution-free uncertainty quantification. arXiv preprint arXiv:2207.12274.

Pedregosa, F., Varoquaux, Ga”el, Gramfort, A., Michel, V., Thirion, B., Grisel, O., … others. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.

Lele, S. R., Keim, J. L., & Solymos, P. (2017). Resource selection (probability) functions for use-availability data. Package ‘ResourceSelection’, Version 0.3-2.