On Calibration of Modern Neural Networks
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags:
Annotations
we find that a single-parameter variant of Plattscaling (Platt et al., 1999) – which we refer to as temper-ature scaling – is often the most effective method at ob-taining calibrated probabilities. * show annotation
straightforward to implement with existing deep learningframeworks, it can be easily adopted in practical settings. * show annotation
wouldlike the confidence estimate ˆP to be calibrated, which in-tuitively means that ˆP represents a true probability * show annotation
Forexample, given 100 predictions, each with confidence of0.8, we expect that 80 should be correctly classified. * show annotation
Confidence histograms (top) and reliability diagrams(bottom) for a 5-layer LeNet (left) and a 110-layer ResNet (right)on CIFAR-100. * show annotation
These diagramsplot expected sample accuracy as a function of confidence * show annotation
Any devia-tion from a perfect diagonal represents miscalibration. * show annotation
==a perfectly cal-ibrated model will have acc(Bm) = conf(Bm) for allm ∈{1,…,M}.== * show annotation
The average confidence of LeNet closely matches its accu-racy, while the average confidence of the ResNet is substan-tially higher than its accuracy. * show annotation
LeNet iswell-calibrated, as confidence closely approximates the ex-pected accuracy (i.e. the bars align roughly along the diag-onal). On the other hand, the ResNet’s accuracy is better,but does not match its confidence. * show annotation
- The difference betweenacc and conf for a given bin represents the calibration gap(red bars in reliability diagrams – e.g. Figure 1). We useECE as the primary empirical metric to measure calibra-tion. See Section S1 for more analysis of this metric. * show annotation
A perfectly calibrated model would have an ECE of 0. The larger the ECE, the more uncalibrated the model.
ECE is a helpful first measure and is widely used for assessing model calibration. However, ECE has drawbacks that one should be aware of when using it to measure calibration (see: Measuring Calibration in Deep Learning).
Maximum Calibration Error (MCE). In high-risk ap-plications where reliable confidence measures are abso-lutely necessary, we may wish to minimize the worst-casedeviation between confidence and accuracy * show annotation
For per-fectly calibrated classifiers, MCE and ECE both equal 0 * show annotation