{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "source": [
    "# Validating metrics with external packages\n",
    "\n",
    "In this notebook, we want to validate the metrics we have implemented in the `calzone` package using other packages or programs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reliability diagram\n",
    "\n",
    "We will use scikit-learn's `calibration_curve` function to calculate the reliability diagram"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Difference for equal-width binning:\n",
      "Reliability difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
      "Confidence difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
      "\n",
      "Difference for equal-count binning:\n",
      "Reliability difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
      "Confidence difference: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n"
     ]
    }
   ],
   "source": [
    "from calzone.utils import reliability_diagram,data_loader\n",
    "from calzone.metrics import calculate_ece_mce,hosmer_lemeshow_test\n",
    "import numpy as np\n",
    "from sklearn.calibration import calibration_curve\n",
    "### loading the data\n",
    "wellcal_dataloader = data_loader(data_path=\"../../../example_data/simulated_welldata.csv\")\n",
    "\n",
    "###scikit-learn implementation\n",
    "scikit_reliability_H,scikit_confidence_H = calibration_curve(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1],n_bins=15,strategy='uniform',pos_label=1)\n",
    "scikit_reliability_C,scikit_confidence_C = calibration_curve(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1],n_bins=15,strategy='quantile',pos_label=1)\n",
    "\n",
    "### calzone implementation\n",
    "calzone_reliability_H,calzone_confindence_H,bin_edge_H,bin_count_H = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=1, is_equal_freq=False)\n",
    "calzone_reliability_C,calzone_confindence_C,bin_edge_C,bin_count_C = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=1, is_equal_freq=True)\n",
    "\n",
    "###showing the difference between the two implementations\n",
    "print(\"Difference for equal-width binning:\")\n",
    "print(\"Reliability difference:\", np.round(np.abs(scikit_reliability_H - calzone_reliability_H), 4))\n",
    "print(\"Confidence difference:\", np.round(np.abs(scikit_confidence_H - calzone_confindence_H), 4))\n",
    "print(\"\\nDifference for equal-count binning:\")\n",
    "print(\"Reliability difference:\", np.round(np.abs(scikit_reliability_C - calzone_reliability_C), 4))\n",
    "print(\"Confidence difference:\", np.round(np.abs(scikit_confidence_C - calzone_confindence_C), 4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the `calzone` package return the same Reliability diagram as the `sckit-learn` package. We will move to test the expected calibration error."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Expected calibration error and Z test \n",
    "\n",
    "We will use `mapie` package to validate some of the metrics in `calzone`. Description of `mapie` can be found [here](https://github.com/scikit-learn-contrib/MAPIE/tree/master)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MAPIE topclass ECE-H: 0.009092291386189822\n",
      "calzone topclass ECE-H: 0.009098499582108283\n"
     ]
    }
   ],
   "source": [
    "from mapie.metrics import spiegelhalter_p_value,top_label_ece,spiegelhalter_statistic\n",
    "from calzone.metrics import spiegelhalter_z_test\n",
    "calzone_reliability_topclass_H,calzone_confindence_topclass_H,bin_edge_topclass_H,bin_count_topclass_H = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=None, is_equal_freq=False)\n",
    "calzone_reliability_topclass_C,calzone_confindence_topclass_C,bin_edge_topclass_C,bin_count_topclass_C = reliability_diagram(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins=15, class_to_plot=None, is_equal_freq=True)\n",
    "\n",
    "### compare MAPIE and calzone equal-width binning\n",
    "print(\"MAPIE topclass ECE-H:\",top_label_ece(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins = 15,split_strategy='uniform'))\n",
    "print(\"calzone topclass ECE-H:\",calculate_ece_mce(calzone_reliability_topclass_H,calzone_confindence_topclass_H,bin_count_topclass_H)[0])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MAPIE topclass ECE-C: 0.016227424850494457\n",
      "calzone topclass ECE-C: 0.016263864264387196\n"
     ]
    }
   ],
   "source": [
    "### compare MAPIE and calzone equal-count binning\n",
    "print(\"MAPIE topclass ECE-C:\",top_label_ece(wellcal_dataloader.labels,wellcal_dataloader.probs,num_bins = 15,split_strategy='quantile'))\n",
    "print(\"calzone topclass ECE-C:\",calculate_ece_mce(calzone_reliability_topclass_C,calzone_confindence_topclass_C,bin_count_topclass_C)[0])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that both package return a very similar result for ECE. We will also move to validate spiegelhalter's z test. We found out that the `mapie` package incorrectly calculates the p-value by using a one-sided test. Therefore, we will only compare the test statistic but not the p-value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MAPIE Z statistic 0.3763269161877356\n",
      "calzone Z statistic 0.3763269161877356\n"
     ]
    }
   ],
   "source": [
    "### compare the Z statistics\n",
    "print(\"MAPIE Z statistic\", spiegelhalter_statistic(wellcal_dataloader.labels,wellcal_dataloader.probs[:,1]))\n",
    "print(\"calzone Z statistic\", spiegelhalter_z_test(wellcal_dataloader.labels,wellcal_dataloader.probs)[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For other metrics, we could not find any external packages that could be used to cross validate our implementation. However, we perform some simulations to test the type 1 error rate for cox slope and intercept and both return sensible values. For the ICI, we compare it with empirical ECE and found they are very similar. \n",
    "\n",
    "## HL test\n",
    "\n",
    "Lastly, we will use a quick r code to calculate the HL test test statistic to make sure it is correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "calzone HL-C TS= 6.298957873650591 p-value= 0.9742765471951074 df= 15\n"
     ]
    }
   ],
   "source": [
    "hl_result = hosmer_lemeshow_test(calzone_reliability_C,calzone_confindence_C,bin_count_C,df=len(bin_count_C))\n",
    "print(\"calzone HL-C TS=\",hl_result[0],\"p-value=\",hl_result[1],'df=',hl_result[2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ResourceSelection 0.3-6 \t 2023-06-27\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hosmer-Lemeshow Test Results:\n",
      "Chi-square statistic: 6.299 \n",
      "Degrees of freedom: 13 \n",
      "p-value: 0.9346 \n"
     ]
    }
   ],
   "source": [
    "### This section is r code\n",
    "library(ResourceSelection)\n",
    "\n",
    "# Read the CSV file\n",
    "data <- read.csv(\"../../../example_data/simulated_welldata.csv\")\n",
    "predicted_prob <- data[,2]  # First column with predicted probabilities\n",
    "labels <- data[,3]         # Third column with actual labels\n",
    "\n",
    "\n",
    "# Perform Hosmer-Lemeshow test\n",
    "hltest <- function(observed, predicted) {\n",
    "  hl_test <- hoslem.test(observed, predicted,g=15)\n",
    "  \n",
    "  cat(\"Hosmer-Lemeshow Test Results:\\n\")\n",
    "  cat(\"Chi-square statistic:\", round(hl_test$statistic, 4), \"\\n\")\n",
    "  cat(\"Degrees of freedom:\", hl_test$parameter, \"\\n\")\n",
    "  cat(\"p-value:\", round(hl_test$p.value, 4), \"\\n\")\n",
    "  \n",
    "  return(hl_test)\n",
    "}\n",
    "\n",
    "result <- hltest(labels, predicted_prob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the test statistics are the same. The R package doesn't allow user input degree of freedom so the p-value is different as expected."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## reference\n",
    "\n",
    "Taquet, V., Blot, V., Morzadec, T., Lacombe, L., & Brunel, N. (2022). MAPIE: an open-source library for distribution-free uncertainty quantification. arXiv preprint arXiv:2207.12274.\n",
    "\n",
    "Pedregosa, F., Varoquaux, Ga\"el, Gramfort, A., Michel, V., Thirion, B., Grisel, O., … others. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.\n",
    "\n",
    "Lele, S. R., Keim, J. L., & Solymos, P. (2017). Resource selection (probability) functions for use-availability data. Package ‘ResourceSelection’, Version 0.3-2."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "uq",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}