Validating metrics with external packages

In this notebook, we want to validate the metrics we have implemented in the calzone package using other packages or programs.

Reliability diagram

We will use scikit-learn’s calibration_curve function to calculate the reliability diagram

[2]:
from calzone.utils import reliability_diagram,data_loader
from calzone.metrics import calculate_ece_mce,hosmer_lemeshow_test
import numpy as np
from sklearn.calibration import calibration_curve
### loading the data
wellcal_dataloader = data_loader(data_path="../../../example_data/simulated_welldata.csv")

###scikit-learn implementation
scikit_reliability_H,scikit_confidence_H = calibration_curve(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1],n_bins=15,strategy='uniform',pos_label=1)
scikit_reliability_C,scikit_confidence_C = calibration_curve(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1],n_bins=15,strategy='quantile',pos_label=1)

### calzone implementation
calzone_reliability_H,calzone_confindence_H,bin_edge_H,bin_count_H = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=1, is_equal_freq=False)
calzone_reliability_C,calzone_confindence_C,bin_edge_C,bin_count_C = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=1, is_equal_freq=True)

###showing the difference between the two implementations
print("Difference for equal-width binning:")
print("Reliability difference:", np.round(np.abs(scikit_reliability_H - calzone_reliability_H), 4))
print("Confidence difference:", np.round(np.abs(scikit_confidence_H - calzone_confindence_H), 4))
print("\nDifference for equal-count binning:")
print("Reliability difference:", np.round(np.abs(scikit_reliability_C - calzone_reliability_C), 4))
print("Confidence difference:", np.round(np.abs(scikit_confidence_C - calzone_confindence_C), 4))
Difference for equal-width binning:
Reliability difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Confidence difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Difference for equal-count binning:
Reliability difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Confidence difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

We can see that the calzone package return the same Reliability diagram as the sckit-learn package. We will move to test the expected calibration error.

Expected calibration error and Z test

We will use mapie package to validate some of the metrics in calzone. Description of mapie can be found here.

[3]:
from mapie.metrics import spiegelhalter_p_value,top_label_ece,spiegelhalter_statistic
from calzone.metrics import spiegelhalter_z_test
calzone_reliability_topclass_H,calzone_confindence_topclass_H,bin_edge_topclass_H,bin_count_topclass_H = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=None, is_equal_freq=False)
calzone_reliability_topclass_C,calzone_confindence_topclass_C,bin_edge_topclass_C,bin_count_topclass_C = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=None, is_equal_freq=True)

### compare MAPIE and calzone equal-width binning
print("MAPIE topclass ECE-H:",top_label_ece(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins = 15,split_strategy='uniform'))
print("calzone topclass ECE-H:",calculate_ece_mce(calzone_reliability_topclass_H,calzone_confindence_topclass_H,bin_count_topclass_H)[0])

MAPIE topclass ECE-H: 0.009092291386189822
calzone topclass ECE-H: 0.009098499582108283
[4]:
### compare MAPIE and calzone equal-count binning
print("MAPIE topclass ECE-C:",top_label_ece(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins = 15,split_strategy='quantile'))
print("calzone topclass ECE-C:",calculate_ece_mce(calzone_reliability_topclass_C,calzone_confindence_topclass_C,bin_count_topclass_C)[0])

MAPIE topclass ECE-C: 0.016227424850494457
calzone topclass ECE-C: 0.016263864264387196

We can see that both package return a very similar result for ECE. We will also move to validate spiegelhalter’s z test. We found out that the mapie package incorrectly calculates the p-value by using a one-sided test. Therefore, we will only compare the test statistic but not the p-value.

[5]:
### compare the Z statistics
print("MAPIE Z statistic", spiegelhalter_statistic(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1]))
print("calzone Z statistic", spiegelhalter_z_test(wellcal_dataloader.labels,wellcal_dataloader.probs)[0])
MAPIE Z statistic 0.3763269161877356
calzone Z statistic 0.3763269161877356

For other metrics, we could not find any external packages that could be used to cross validate our implementation. However, we perform some simulations to test the type 1 error rate for cox slope and intercept and both return sensible values. For the ICI, we compare it with empirical ECE and found they are very similar.

HL test

Lastly, we will use a quick r code to calculate the HL test test statistic to make sure it is correct.

[51]:
hl_result = hosmer_lemeshow_test(calzone_reliability_C,calzone_confindence_C,bin_count_C,df=len(bin_count_C))
print("calzone HL-C TS=",hl_result[0],"p-value=",hl_result[1],'df=',hl_result[2])
calzone HL-C TS= 6.298957873650591 p-value= 0.9742765471951074 df= 15
[1]:
### This section is r code
library(ResourceSelection)

# Read the CSV file
data <- read.csv("../../../example_data/simulated_welldata.csv")
predicted_prob <- data[,2]  # First column with predicted probabilities
labels <- data[,3]         # Third column with actual labels


# Perform Hosmer-Lemeshow test
hltest <- function(observed, predicted) {
  hl_test <- hoslem.test(observed, predicted,g=15)

  cat("Hosmer-Lemeshow Test Results:\n")
  cat("Chi-square statistic:", round(hl_test$statistic, 4), "\n")
  cat("Degrees of freedom:", hl_test$parameter, "\n")
  cat("p-value:", round(hl_test$p.value, 4), "\n")

  return(hl_test)
}

result <- hltest(labels, predicted_prob)
ResourceSelection 0.3-6          2023-06-27

Hosmer-Lemeshow Test Results:
Chi-square statistic: 6.299
Degrees of freedom: 13
p-value: 0.9346

We see that the test statistics are the same. The R package doesn’t allow user input degree of freedom so the p-value is different as expected.

reference

Taquet, V., Blot, V., Morzadec, T., Lacombe, L., & Brunel, N. (2022). MAPIE: an open-source library for distribution-free uncertainty quantification. arXiv preprint arXiv:2207.12274.

Pedregosa, F., Varoquaux, Ga”el, Gramfort, A., Michel, V., Thirion, B., Grisel, O., … others. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.

Lele, S. R., Keim, J. L., & Solymos, P. (2017). Resource selection (probability) functions for use-availability data. Package ‘ResourceSelection’, Version 0.3-2.