Calibration is applicable in case a classifier outputs probabilities. Apparently some classifiers have their typical quirks - for example, they say boosted trees and SVM tend to predict probabilities conservatively, meaning closer to mid-range than to extremes. If your metric cares about exact probabilities, like logarithmic loss does, you can calibrate the classifier, that is post-process the predictions to get better estimates.

This article was inspired by Andrew Tulloch’s post on Speeding up isotonic regression in scikit-learn by 5,000x.

## Visualizing calibration with reliability diagrams

Before you attempt calibration, see how good it is to start with. The paper we’re going to refer to is *Predicting good probabilities with supervised learning* [PDF] by Caruana et al.

On real problems where the true conditional probabilities are not known, model calibration can be visualized with reliability diagrams (DeGroot & Fienberg, 1982). First, the prediction space is discretized into ten bins. Cases with predicted value between 0 and 0.1 fall in the first bin, between 0.1 and 0.2 in the second bin, etc.

For each bin, the mean predicted value is plotted against the true fraction of positive cases. If the model is well calibrated the points will fall near the diagonal line.

Here’s a reliability diagram of almost perfectly-calibrated classifier. It’s Vowpal Wabbit with data from the Criteo competition, if you’re curious.

x: mean predicted value for each bin, y: fraction of true positive cases.

And now a classifier that could use some calibration (Vowpal Wabbit / Avito competition):

Finally, let’s see a random forest trained on the Adult data:

It doesn’t look sigmoidal like the plots in the paper; more like sigmoid mirrored around the central line.

## Platt’s scaling

There are two popular calibration methods: Platt’s scaling and isotonic regression. Platt’s scaling amounts to training a logistic regression model on the classifier outputs. As Edward Raff writes:

You essentially create a new data set that has the same labels, but with one dimension (the output of the SVM). You then train on this new data set, and feed the output of the SVM as the input to this calibration method, which returns a probability. In Platt’s case, we are essentially just performing logistic regression on the output of the SVM with respect to the true class labels.

We use an additional validation set for calibration: take classifier predictions and true labels and split them, then use the first part as a training set for calibration and the second part to evaluate the results.

The code might look like the snippet below. *X* is a vector of classifier outputs and *y* are true labels.

```
from sklearn.linear_model import LogisticRegression as LR
lr = LR()
lr.fit( p_train.reshape( -1, 1 ), y_train ) # LR needs X to be 2-dimensional
p_calibrated = lr.predict_proba( p_test.reshape( -1, 1 ))[:,1]
```

And now the Adult random forest calibrated with Platt’s scaling. The blue line shows “before” and the green line “after”. The plot looks smoother because we used fewer bins than in the diagram above.

Blue: before, green: after.

The numbers look good: AUC is unchanged and log loss reduction is dramatic.

```
accuracy - before/after: 0.847788697789 / 0.846805896806
AUC - before/after: 0.878139845077 / 0.878139845077
log loss - before/after: 0.630525772871 / 0.364873617584
```

## Isotonic regression

The second popular method of calibrating is isotonic regression. The idea is to fit a piecewise-constant non-decreasing function instead of logistic regression. *Piecewise-constant non-decreasing* means stair-step shaped:

The stairs. Notice that this plot doesn’t deal with calibration. Credit: scikit-learn

The scikit-learn docs look a bit confusing to us, but apparently it’s just as simple as with logistic regression. There’s some talk about the order, but one doesn’t need to sort either *x* or *y*, the algorithm will take care of this.

```
from sklearn.isotonic import IsotonicRegression as IR
ir = IR( out_of_bounds = 'clip' )
ir.fit( p_train, y_train )
p_calibrated = ir.transform( p_test ) # or ir.fit( p_test ), that's the same thing
```

The Adult data again:

After calibration accuracy and AUC suffer a tiny bit, but log loss gets smaller, although nowhere near the result from Platt’s scaling:

```
accuracy - before/after: 0.847788697789 / 0.845945945946
AUC - before/after: 0.878139845077 / 0.877184085166
log loss - before/after: 0.630525772871 / 0.592161024832
```

## More examples

Remember the reliability diagrams for Vowpal Wabbit? Now the green line shows results after calibration by isotonic regression:

Let’s compare log loss scores:

```
In [91]: ll( y_test, p_test )
Out[91]: 0.45670528472608907
In [92]: ll( y_test, p_test_calibrated )
Out[92]: 0.45688394167069607
```

No improvement. And the second one:

It won’t come as a surprise that the score improved - the log loss dropped by 5.4%:

```
In [66]: p_cal[np.isnan( p_cal )] = 1e-15
In [67]: log_loss( y_test, p_test )
Out[67]: 0.040977954263511369
In [68]: log_loss( y_test, p_test_calibrated )
Out[68]: 0.038757356232921675
```

## Summary

There’s no point in calibrating if the classifier is already good in this respect. First make a reliability diagram and if it looks like it could be improved, then calibrate. That is, if your metric justifies it.

The code is available at GitHub. You’ll need to modify `load_data.py`

to suit your needs.

**UPDATE**: *scikit-learn* now has a good doc section on probability calibration.