On Calibration of Modern Neural Networks


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags:

Annotations


  • we find that a single-parameter variant of Plattscaling (Platt et al., 1999) – which we refer to as temper-ature scaling – is often the most effective method at ob-taining calibrated probabilities. * show annotation

  • straightforward to implement with existing deep learningframeworks, it can be easily adopted in practical settings. * show annotation

  • wouldlike the confidence estimate ˆP to be calibrated, which in-tuitively means that ˆP represents a true probability * show annotation

  • Forexample, given 100 predictions, each with confidence of0.8, we expect that 80 should be correctly classified. * show annotation

  • Confidence histograms (top) and reliability diagrams(bottom) for a 5-layer LeNet (left) and a 110-layer ResNet (right)on CIFAR-100. * show annotation

  • These diagramsplot expected sample accuracy as a function of confidence * show annotation

  • Any devia-tion from a perfect diagonal represents miscalibration. * show annotation

  • ==a perfectly cal-ibrated model will have acc(Bm) = conf(Bm) for allm ∈{1,…,M}.== * show annotation

  • The average confidence of LeNet closely matches its accu-racy, while the average confidence of the ResNet is substan-tially higher than its accuracy. * show annotation

  • LeNet iswell-calibrated, as confidence closely approximates the ex-pected accuracy (i.e. the bars align roughly along the diag-onal). On the other hand, the ResNet’s accuracy is better,but does not match its confidence. * show annotation

  • The difference betweenacc and conf for a given bin represents the calibration gap(red bars in reliability diagrams – e.g. Figure 1). We useECE as the primary empirical metric to measure calibra-tion. See Section S1 for more analysis of this metric. * show annotation

A perfectly calibrated model would have an ECE of 0. The larger the ECE, the more uncalibrated the model.

ECE is a helpful first measure and is widely used for assessing model calibration. However, ECE has drawbacks that one should be aware of when using it to measure calibration (see: Measuring Calibration in Deep Learning).

https://towardsdatascience.com/expected-calibration-error-ece-a-step-by-step-visual-explanation-with-python-code-c3e9aa12937d

  • Maximum Calibration Error (MCE). In high-risk ap-plications where reliable confidence measures are abso-lutely necessary, we may wish to minimize the worst-casedeviation between confidence and accuracy * show annotation

  • For per-fectly calibrated classifiers, MCE and ECE both equal 0 * show annotation