{ "cells": [ { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "plaintext" } }, "source": [ "# Validating metrics with external packages\n", "\n", "In this notebook, we want to validate the metrics we have implemented in the `calzone` package using other packages or programs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reliability diagram\n", "\n", "We will use scikit-learn's `calibration_curve` function to calculate the reliability diagram" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Difference for equal-width binning:\n", "Reliability difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n", "Confidence difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n", "\n", "Difference for equal-count binning:\n", "Reliability difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n", "Confidence difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n" ] } ], "source": [ "from calzone.utils import reliability_diagram,data_loader\n", "from calzone.metrics import calculate_ece_mce,hosmer_lemeshow_test\n", "import numpy as np\n", "from sklearn.calibration import calibration_curve\n", "### loading the data\n", "wellcal_dataloader = data_loader(data_path=\"../../../example_data/simulated_welldata.csv\")\n", "\n", "###scikit-learn implementation\n", "scikit_reliability_H,scikit_confidence_H = calibration_curve(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1],n_bins=15,strategy='uniform',pos_label=1)\n", "scikit_reliability_C,scikit_confidence_C = calibration_curve(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1],n_bins=15,strategy='quantile',pos_label=1)\n", "\n", "### calzone implementation\n", "calzone_reliability_H,calzone_confindence_H,bin_edge_H,bin_count_H = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=1, is_equal_freq=False)\n", "calzone_reliability_C,calzone_confindence_C,bin_edge_C,bin_count_C = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=1, is_equal_freq=True)\n", "\n", "###showing the difference between the two implementations\n", "print(\"Difference for equal-width binning:\")\n", "print(\"Reliability difference:\", np.round(np.abs(scikit_reliability_H - calzone_reliability_H), 4))\n", "print(\"Confidence difference:\", np.round(np.abs(scikit_confidence_H - calzone_confindence_H), 4))\n", "print(\"\\nDifference for equal-count binning:\")\n", "print(\"Reliability difference:\", np.round(np.abs(scikit_reliability_C - calzone_reliability_C), 4))\n", "print(\"Confidence difference:\", np.round(np.abs(scikit_confidence_C - calzone_confindence_C), 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the `calzone` package return the same Reliability diagram as the `sckit-learn` package. We will move to test the expected calibration error." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Expected calibration error and Z test \n", "\n", "We will use `mapie` package to validate some of the metrics in `calzone`. Description of `mapie` can be found [here](https://github.com/scikit-learn-contrib/MAPIE/tree/master)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAPIE topclass ECE-H: 0.009092291386189822\n", "calzone topclass ECE-H: 0.009098499582108283\n" ] } ], "source": [ "from mapie.metrics import spiegelhalter_p_value,top_label_ece,spiegelhalter_statistic\n", "from calzone.metrics import spiegelhalter_z_test\n", "calzone_reliability_topclass_H,calzone_confindence_topclass_H,bin_edge_topclass_H,bin_count_topclass_H = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=None, is_equal_freq=False)\n", "calzone_reliability_topclass_C,calzone_confindence_topclass_C,bin_edge_topclass_C,bin_count_topclass_C = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=None, is_equal_freq=True)\n", "\n", "### compare MAPIE and calzone equal-width binning\n", "print(\"MAPIE topclass ECE-H:\",top_label_ece(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins = 15,split_strategy='uniform'))\n", "print(\"calzone topclass ECE-H:\",calculate_ece_mce(calzone_reliability_topclass_H,calzone_confindence_topclass_H,bin_count_topclass_H)[0])\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAPIE topclass ECE-C: 0.016227424850494457\n", "calzone topclass ECE-C: 0.016263864264387196\n" ] } ], "source": [ "### compare MAPIE and calzone equal-count binning\n", "print(\"MAPIE topclass ECE-C:\",top_label_ece(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins = 15,split_strategy='quantile'))\n", "print(\"calzone topclass ECE-C:\",calculate_ece_mce(calzone_reliability_topclass_C,calzone_confindence_topclass_C,bin_count_topclass_C)[0])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that both package return a very similar result for ECE. We will also move to validate spiegelhalter's z test. We found out that the `mapie` package incorrectly calculates the p-value by using a one-sided test. Therefore, we will only compare the test statistic but not the p-value." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAPIE Z statistic 0.3763269161877356\n", "calzone Z statistic 0.3763269161877356\n" ] } ], "source": [ "### compare the Z statistics\n", "print(\"MAPIE Z statistic\", spiegelhalter_statistic(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1]))\n", "print(\"calzone Z statistic\", spiegelhalter_z_test(wellcal_dataloader.labels,wellcal_dataloader.probs)[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For other metrics, we could not find any external packages that could be used to cross validate our implementation. However, we perform some simulations to test the type 1 error rate for cox slope and intercept and both return sensible values. For the ICI, we compare it with empirical ECE and found they are very similar. \n", "\n", "## HL test\n", "\n", "Lastly, we will use a quick r code to calculate the HL test test statistic to make sure it is correct." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "calzone HL-C TS= 6.298957873650591 p-value= 0.9742765471951074 df= 15\n" ] } ], "source": [ "hl_result = hosmer_lemeshow_test(calzone_reliability_C,calzone_confindence_C,bin_count_C,df=len(bin_count_C))\n", "print(\"calzone HL-C TS=\",hl_result[0],\"p-value=\",hl_result[1],'df=',hl_result[2])" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "ResourceSelection 0.3-6 \t 2023-06-27\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Hosmer-Lemeshow Test Results:\n", "Chi-square statistic: 6.299 \n", "Degrees of freedom: 13 \n", "p-value: 0.9346 \n" ] } ], "source": [ "### This section is r code\n", "library(ResourceSelection)\n", "\n", "# Read the CSV file\n", "data <- read.csv(\"../../../example_data/simulated_welldata.csv\")\n", "predicted_prob <- data[,2] # First column with predicted probabilities\n", "labels <- data[,3] # Third column with actual labels\n", "\n", "\n", "# Perform Hosmer-Lemeshow test\n", "hltest <- function(observed, predicted) {\n", " hl_test <- hoslem.test(observed, predicted,g=15)\n", " \n", " cat(\"Hosmer-Lemeshow Test Results:\\n\")\n", " cat(\"Chi-square statistic:\", round(hl_test$statistic, 4), \"\\n\")\n", " cat(\"Degrees of freedom:\", hl_test$parameter, \"\\n\")\n", " cat(\"p-value:\", round(hl_test$p.value, 4), \"\\n\")\n", " \n", " return(hl_test)\n", "}\n", "\n", "result <- hltest(labels, predicted_prob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the test statistics are the same. The R package doesn't allow user input degree of freedom so the p-value is different as expected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## reference\n", "\n", "Taquet, V., Blot, V., Morzadec, T., Lacombe, L., & Brunel, N. (2022). MAPIE: an open-source library for distribution-free uncertainty quantification. arXiv preprint arXiv:2207.12274.\n", "\n", "Pedregosa, F., Varoquaux, Ga\"el, Gramfort, A., Michel, V., Thirion, B., Grisel, O., … others. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.\n", "\n", "Lele, S. R., Keim, J. L., & Solymos, P. (2017). Resource selection (probability) functions for use-availability data. Package ‘ResourceSelection’, Version 0.3-2." ] } ], "metadata": { "kernelspec": { "display_name": "uq", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.0" } }, "nbformat": 4, "nbformat_minor": 2 }