{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Summary and guide for calzone\n",
"\n",
"We provide a summary of the calibration metrics provides by calzone, including the pros and cons of each metrics. For a more detailed explanation of each metrics and how to calculate them using calzone, please refer to the specific notebook."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"nbsphinx": "hidden"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from IPython.display import display, HTML\n",
"data = {\n",
" 'Metrics': ['Expected calibration error
(ECE)', 'Maximum calibration error
(MCE)', 'Hosmer-Lemeshow test', \"Spiegelhalter's z test\", \"Cox's analysis\", 'Integrated calibration index
(ICI)'],\n",
" 'Description': [\n",
" '
| Metrics | \n", "Description | \n", "Pros | \n", "Cons | \n", "Meaning | \n", "
|---|---|---|---|---|
| Expected calibration error (ECE) | \n",
" Using binned reliability diagram (equal-width or equal-count binning), sum of absolute difference, weighted by bin count. | \n",
" • Intuitive • Easy to calculate | \n",
" • Depend on binning • Depend on class-by-class/top-class | \n",
" Average deviation from true probability | \n",
"
| Maximum calibration error (MCE) | \n",
" Using binned reliability diagram (equal-width or equal-count binning), Maximum absolute difference. | \n",
" • Intuitive • Easy to calculate | \n",
" • Depend on binning • Depend on class-by-class/top-class | \n",
" Maximum deviation from true probability | \n",
"
| Hosmer-Lemeshow test | \n", "Using binned reliability diagram (equal-width or equal-count binning), Chi-squared based test using expected and observed. | \n",
" • Intuitive • Statistical meaning | \n",
" • Depend on binning • Low power • Wrong coverage | \n",
" Test of calibration | \n",
"
| Spiegelhalter's z test | \n", "Decomposition of brier score. Normal distributed | \n",
" • Doesn't rely on binning • Statistical meaning | \n",
" • Doesn't detect prevalence shift | \n",
" Test of calibration | \n",
"
| Cox's analysis | \n", "Logistic regression of the logits | \n",
" • Doesn't rely on binning • Hints at miscalibration type | \n",
" • Failed to capture some miscalibration | \n",
" A logit fit to the calibration curve | \n",
"
| Integrated calibration index (ICI) | \n",
" Similar to ECE, using smooth fit (usually losse) instead of binning to get the calibration curve | \n",
" • Doesn't rely on binning • Capture all kind of miscalibration | \n",
" • Depend on the choice of curve fitting • Depend on fitting parameters | \n",
" Average deviation from true probability | \n",
"