{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summary and guide for calzone\n",
    "\n",
    "We provide a summary of the calibration metrics provides by calzone, including the pros and cons of each metrics. For a more detailed explanation of each metrics and how to calculate them using calzone, please refer to the specific notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from IPython.display import display, HTML\n",
    "data = {\n",
    "    'Metrics': ['Expected calibration error<br>(ECE)', 'Maximum calibration error<br>(MCE)', 'Hosmer-Lemeshow test', \"Spiegelhalter's z test\", \"Cox's analysis\", 'Integrated calibration index<br> (ICI)'],\n",
    "    'Description': [\n",
    "        '<div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>sum of absolute difference, weighted by bin count.</div>',\n",
    "        '<div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>Maximum absolute difference.</div>',\n",
    "        '<div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>Chi-squared based test using expected and observed.</div>',\n",
    "        '<div>Decomposition of brier score.<br>Normal distributed<br> </div>',\n",
    "        '<div>Logistic regression of the logits<br> <br> </div>',\n",
    "        '<div>Similar to ECE, using smooth fit (usually losse)<br>instead of binning to get<br>the calibration curve</div>'\n",
    "    ],\n",
    "    'Pros': [\n",
    "        '<div>• Intuitive<br>• Easy to calculate</div>',\n",
    "        '<div>• Intuitive<br>• Easy to calculate</div>',\n",
    "        '<div>• Intuitive<br>• Statistical meaning</div>',\n",
    "        '<div>• Doesn\\'t rely on binning<br>• Statistical meaning</div>',\n",
    "        '<div>• Doesn\\'t rely on binning<br>• Hints at miscalibration type</div>',\n",
    "        '<div>• Doesn\\'t rely on binning<br>• Capture all kind of miscalibration</div>'\n",
    "    ],\n",
    "    'Cons': [\n",
    "        '<div>• Depend on binning <br>• Depend on class-by-class/top-class</div>',\n",
    "        '<div>• Depend on binning <br>• Depend on class-by-class/top-class</div>',\n",
    "        '<div>• Depend on binning <br>• Low power<br>• Wrong coverage</div>',\n",
    "        '<div>• Doesn\\'t detect prevalence shift</div>',\n",
    "        '<div>• Failed to capture some miscalibration</div>',\n",
    "        '<div>• Depend on the choice of curve fitting<br>• Depend on fitting parameters</div>'\n",
    "    ],\n",
    "    'Meaning': [\n",
    "        '<div>Average deviation from<br>true probability</div>',\n",
    "        '<div>Maximum deviation from<br>true probability</div>',\n",
    "        '<div>Test of<br>calibration</div>',\n",
    "        '<div>Test of<br>calibration</div>',\n",
    "        '<div>A logit fit to the<br>calibration curve</div>',\n",
    "        '<div>Average deviation from<br>true probability</div>'\n",
    "    ]\n",
    "}\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# Apply custom styling\n",
    "styled_df = df.style.set_properties(**{'text-align': 'left', 'white-space': 'pre-wrap'})\n",
    "styled_df = styled_df.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])\n",
    "\n",
    "styled_df = styled_df.hide(axis=\"index\")\n",
    "\n",
    "# Display the styled dataframe\n",
    "# ### Export PNG format of the table\n",
    "# import dataframe_image as dfi\n",
    "\n",
    "# dfi.export(styled_df,\"mytable.png\",table_conversion = 'matplotlib',dpi=300)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_f1c3f th {\n",
       "  text-align: center;\n",
       "}\n",
       "#T_f1c3f_row0_col0, #T_f1c3f_row0_col1, #T_f1c3f_row0_col2, #T_f1c3f_row0_col3, #T_f1c3f_row0_col4, #T_f1c3f_row1_col0, #T_f1c3f_row1_col1, #T_f1c3f_row1_col2, #T_f1c3f_row1_col3, #T_f1c3f_row1_col4, #T_f1c3f_row2_col0, #T_f1c3f_row2_col1, #T_f1c3f_row2_col2, #T_f1c3f_row2_col3, #T_f1c3f_row2_col4, #T_f1c3f_row3_col0, #T_f1c3f_row3_col1, #T_f1c3f_row3_col2, #T_f1c3f_row3_col3, #T_f1c3f_row3_col4, #T_f1c3f_row4_col0, #T_f1c3f_row4_col1, #T_f1c3f_row4_col2, #T_f1c3f_row4_col3, #T_f1c3f_row4_col4, #T_f1c3f_row5_col0, #T_f1c3f_row5_col1, #T_f1c3f_row5_col2, #T_f1c3f_row5_col3, #T_f1c3f_row5_col4 {\n",
       "  text-align: left;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_f1c3f\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th id=\"T_f1c3f_level0_col0\" class=\"col_heading level0 col0\" >Metrics</th>\n",
       "      <th id=\"T_f1c3f_level0_col1\" class=\"col_heading level0 col1\" >Description</th>\n",
       "      <th id=\"T_f1c3f_level0_col2\" class=\"col_heading level0 col2\" >Pros</th>\n",
       "      <th id=\"T_f1c3f_level0_col3\" class=\"col_heading level0 col3\" >Cons</th>\n",
       "      <th id=\"T_f1c3f_level0_col4\" class=\"col_heading level0 col4\" >Meaning</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td id=\"T_f1c3f_row0_col0\" class=\"data row0 col0\" >Expected calibration error<br>(ECE)</td>\n",
       "      <td id=\"T_f1c3f_row0_col1\" class=\"data row0 col1\" ><div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>sum of absolute difference, weighted by bin count.</div></td>\n",
       "      <td id=\"T_f1c3f_row0_col2\" class=\"data row0 col2\" ><div>• Intuitive<br>• Easy to calculate</div></td>\n",
       "      <td id=\"T_f1c3f_row0_col3\" class=\"data row0 col3\" ><div>• Depend on binning <br>• Depend on class-by-class/top-class</div></td>\n",
       "      <td id=\"T_f1c3f_row0_col4\" class=\"data row0 col4\" ><div>Average deviation from<br>true probability</div></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f1c3f_row1_col0\" class=\"data row1 col0\" >Maximum calibration error<br>(MCE)</td>\n",
       "      <td id=\"T_f1c3f_row1_col1\" class=\"data row1 col1\" ><div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>Maximum absolute difference.</div></td>\n",
       "      <td id=\"T_f1c3f_row1_col2\" class=\"data row1 col2\" ><div>• Intuitive<br>• Easy to calculate</div></td>\n",
       "      <td id=\"T_f1c3f_row1_col3\" class=\"data row1 col3\" ><div>• Depend on binning <br>• Depend on class-by-class/top-class</div></td>\n",
       "      <td id=\"T_f1c3f_row1_col4\" class=\"data row1 col4\" ><div>Maximum deviation from<br>true probability</div></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f1c3f_row2_col0\" class=\"data row2 col0\" >Hosmer-Lemeshow test</td>\n",
       "      <td id=\"T_f1c3f_row2_col1\" class=\"data row2 col1\" ><div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>Chi-squared based test using expected and observed.</div></td>\n",
       "      <td id=\"T_f1c3f_row2_col2\" class=\"data row2 col2\" ><div>• Intuitive<br>• Statistical meaning</div></td>\n",
       "      <td id=\"T_f1c3f_row2_col3\" class=\"data row2 col3\" ><div>• Depend on binning <br>• Low power<br>• Wrong coverage</div></td>\n",
       "      <td id=\"T_f1c3f_row2_col4\" class=\"data row2 col4\" ><div>Test of<br>calibration</div></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f1c3f_row3_col0\" class=\"data row3 col0\" >Spiegelhalter's z test</td>\n",
       "      <td id=\"T_f1c3f_row3_col1\" class=\"data row3 col1\" ><div>Decomposition of brier score.<br>Normal distributed<br> </div></td>\n",
       "      <td id=\"T_f1c3f_row3_col2\" class=\"data row3 col2\" ><div>• Doesn't rely on binning<br>• Statistical meaning</div></td>\n",
       "      <td id=\"T_f1c3f_row3_col3\" class=\"data row3 col3\" ><div>• Doesn't detect prevalence shift</div></td>\n",
       "      <td id=\"T_f1c3f_row3_col4\" class=\"data row3 col4\" ><div>Test of<br>calibration</div></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f1c3f_row4_col0\" class=\"data row4 col0\" >Cox's analysis</td>\n",
       "      <td id=\"T_f1c3f_row4_col1\" class=\"data row4 col1\" ><div>Logistic regression of the logits<br> <br> </div></td>\n",
       "      <td id=\"T_f1c3f_row4_col2\" class=\"data row4 col2\" ><div>• Doesn't rely on binning<br>• Hints at miscalibration type</div></td>\n",
       "      <td id=\"T_f1c3f_row4_col3\" class=\"data row4 col3\" ><div>• Failed to capture some miscalibration</div></td>\n",
       "      <td id=\"T_f1c3f_row4_col4\" class=\"data row4 col4\" ><div>A logit fit to the<br>calibration curve</div></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td id=\"T_f1c3f_row5_col0\" class=\"data row5 col0\" >Integrated calibration index<br> (ICI)</td>\n",
       "      <td id=\"T_f1c3f_row5_col1\" class=\"data row5 col1\" ><div>Similar to ECE, using smooth fit (usually losse)<br>instead of binning to get<br>the calibration curve</div></td>\n",
       "      <td id=\"T_f1c3f_row5_col2\" class=\"data row5 col2\" ><div>• Doesn't rely on binning<br>• Capture all kind of miscalibration</div></td>\n",
       "      <td id=\"T_f1c3f_row5_col3\" class=\"data row5 col3\" ><div>• Depend on the choice of curve fitting<br>• Depend on fitting parameters</div></td>\n",
       "      <td id=\"T_f1c3f_row5_col4\" class=\"data row5 col4\" ><div>Average deviation from<br>true probability</div></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(HTML(styled_df.to_html(escape=False)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Guide to calzone and calibration metrics\n",
    "\n",
    "calzone aims to access whether a model achieves moderate calibration, meaning whether $\\mathbb{P}(D=1|\\hat{P}=p)=p$ for all $p\\in[0,1]$.\n",
    "\n",
    "To accurately assess the calibration of a machine learning model, it's important to use a dataset that's both comprehensive and representative of the intended population. Calibration metrics such as reliability diagrams or expected calibration error aren't meaningful if the dataset doesn't reflect the real-world data the model is meant to operate on. For example, if certain prediction ranges or subgroups are underrepresented, the model might appear well-calibrated overall but actually perform poorly in those regions. In other words, without sufficient coverage of the prediction space, especially across relevant clinical or demographic groups, calibration results can be misleading. Ensuring good coverage helps make sure that the evaluation actually reflects how the model will behave in practice.\n",
    "\n",
    "calzone takes in a csv dataset which contains the probability of each class and the true label. Most metrics in calzone only work with binary classification and which transforms the problem into 1-vs-rest when calcualte the metrics. Therefore, you need to specify the class-of-interest when using the metrics. The only exception is the Top-class Expected calibration error ($ECE_{top}$) and Top-class Maximum calibration error ($MCE_{top}$) metrics which only measure the calibration of the class with highest predicted probability hence works for multi-class problems. See the corresponding documentation for more details.\n",
    "\n",
    "\n",
    "We recommend using reliability diagrams to visualize calibration. If you notice that the model consistently over- or underestimates probabilities for a certain class, it’s worth checking whether that’s just due to a prevalence shift. Prevalence shift happens when the proportion of positive and negative cases in your evaluation data is different from what the model was trained on. This can cause the predicted probabilities to be systematically off, even if the model itself hasn’t changed. You can check for this by applying a prevalence adjustment and then re-plotting the reliability diagrams. After the adjustment, take another look at the calibration metrics to see if the issue persists. This helps separate problems caused by distribution shift from those that are due to poor calibration.\n",
    "\n",
    "For a general sense of average probability deviation, we recommend using the Cox and Loess integrated calibration index (ICI) as they don't depend on binning. Alternatively, ECE can be used to measure the same but the result will depend on the binning scheme you used. If the probabilities distribution is highly skewed toward 0 and 1, use equal-count binning for ECE. \n",
    "\n",
    "Please refer to the notebooks for detailed descriptions of each metric."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "general",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}