MIT's Thermometer: Calibrating Large Language Models

type

status

date

slug

summary

Limitations of traditional calibration methods

Single-task calibration

Traditional machine learning models are usually designed to perform a single task, and their calibration methods are also targeted at a single task.

Multi-task application

Since LLMs can be applied to a variety of different tasks, using traditional methods to calibrate them for a single task may affect the performance of the model on other tasks.

Computational overhead

Calibrating LLMs usually requires sampling the model multiple times to obtain different predictions and then aggregating these predictions to obtain better calibration confidence. However, since LLMs have billions of parameters, this approach is computationally expensive.

Thermometer Method:

Temperature Scaling

The researchers leveraged a classic calibration method called temperature scaling to adjust the model’s confidence to be consistent with its prediction accuracy. In this context, “temperature” is a scaling parameter used to adjust the model’s confidence.

Auxiliary Model

The Thermometer automatically predicts the temperature parameters required for calibration by running an auxiliary model on top of the LLM. The auxiliary model is trained using datasets from some representative tasks, but once trained, it can generalize to new tasks of similar categories without the need for additional labeled data.

Representative datasets

For example, a Thermometer model could be trained on a dataset of multiple-choice questions (e.g., a set containing algebra questions and medical questions), and then used to calibrate an LLM that answers geometry or biology questions.

How it works

THERMOMETER calibrates the output of LLMs through an auxiliary model. This auxiliary model learns how to adjust the output probabilities of LLMs so that they more accurately reflect the truth. In this way, when the model says it is 80% sure about a certain answer, that 80% is more likely to be accurate.

Detailed Methods of Thermometer

Temperature Scaling

When the model makes a prediction, it gives a confidence score, such as “I’m 90% sure this answer is correct.” However, the model’s confidence score may not always be accurate.

Temperature Scaling introduces a “temperature” parameter (e.g. 1.2) that adjusts the confidence of the model to make it closer to the truth.

For example, if the temperature is 1.2, then the 90% confidence level might be adjusted to 75% to make it more realistic.

Learning Temperature

The THERMOMETER method is to learn this "temperature" through training data. We have a lot of data from different tasks, and through this data, the model can learn the temperature required for each task.

The process is similar to a smart thermometer that can adjust the temperature according to different environments.

Identify the network

The recognition network is a small tool that calculates "temperature". It will give a suitable temperature value based on the input data. It's like a thermometer that tells you the current temperature based on where you are.

Training process

We divide the data of different tasks into two parts, one for training and the other for validation.

During the training process, the recognition network will continuously adjust its parameters until the most suitable temperature is found.

Testing Process

At test time, we feed the data of the new task into the recognition network, which tells us the temperature of that task.

We then use this temperature to adjust the output of the LLMs to make its confidence more accurate.

Specific steps

Build a recognition network: Design a small tool that inputs data and outputs temperature values.

Train the recognition network: Use data from many tasks to train this gadget so that it can learn to give accurate temperatures.

Calibrate model outputs: In the new task, adjust the model's output confidence using the temperature given by the recognition network.

Evaluate calibration results: Use some standard methods to evaluate the effect of calibration to ensure that the confidence of model predictions is more accurate.

The THERMOMETER method makes the prediction confidence of the large language model more reliable through these steps, which is equivalent to equipping the model with a smart thermometer. In this way, when the model says "I am 90% sure", this 90% is more likely to be the real 90%.

Key benefits by Thermometer:

Efficiency

The Thermometer method does not require multiple training runs, only slightly slows down LLM, and maintains the accuracy of model predictions.

Accurate Calibration

Test results on multiple tasks show that Thermometer produces better calibration uncertainty metrics while requiring less computation.

Generalizability

The researchers found that if they trained a Thermometer model for smaller LLMs, it could be directly applied to calibrate larger LLMs within the same family.

Source: https://news.mit.edu/2024/thermometer-prevents-ai-model-overconfidence-about-wrong-answers-0731

Paper: https://arxiv.org/pdf/2403.08819

💡

Say goodbye to downtime and hello to peace of mind

cloudcone | satisfactory dedicated server