When leveraging LLMs for data labeling, it is important to be able to accurately estimate the model’s level of confidence in its own knowledge and reasoning. Doing so enables us to automatically reject low confidence labels and ensemble LLMs optimally. We examine different techniques for estimating confidence of LLM generated labels, in the context of data labeling for NLP tasks. Key takeaways:
We used Autolabel, our recently open-sourced library, to run all the experiments that are a part of this report.
Next generation data labeling tools like Autolabel leverage LLMs to create large, diverse labeled datasets. In an application domain like labeling where correctness is critical, it is imperative to be able to accurately estimate the model’s level of confidence in its own knowledge and reasoning. Doing so enables us to automatically reject low confidence labels, ensemble LLMs optimally, and learn more about strengths and weaknesses of any given LLM.
As a motivating example, consider the Toxic comments classification dataset from our labeling benchmark. If we are able to estimate the LLM’s confidence level alongside the labels it generates, we can calibrate the model’s label quality (% agreement with ground truth labels) at any given confidence score. We can then decide an operating point (confidence threshold) for the LLM, and reject all labels below this threshold.
This blog post summarizes our learnings around estimating confidence of LLM generated labels. We demonstrate the efficacy of four different confidence estimation techniques across a wide range of text labeling tasks/datasets.
For generating the labels themselves, we use GPT-4 as the labeling LLM in this evaluation. Ideally we’d like to use the same LLM for verification as well, but token-level generation probabilities (logprobs) are currently not available for OpenAI chat engine models (including GPT-4). Hence we use FLAN-T5-XXL as the verifier LLM. For techniques that don’t rely on logprobs, we also report the AUROC numbers with GPT-4 as the verifier LLM.
The AUROC is a scalar value that summarizes the overall performance of the classifier by measuring the area under the ROC curve. It provides a measure of the classifier's ability to distinguish between the positive and negative classes across all possible classification thresholds. The confidence score that we are calculating can be thought of as a binary classifier where the labels are whether the model got a specific label correct (true positive) or incorrect (false positive). Using the plot above as a reference, we can see that a random classifier would have an AUROC of 0.5.
We included the following datasets for this evaluation. These datasets are available for download here.
We benchmark the following four methods for confidence estimation:
The labeling LLM (GPT-4) generates a label (llm_label) given an input prompt. Using this, we prompt the verifier LLM to complete the following sentence. The token generation probability of “Yes” is used as the confidence score
The labeling LLM (GPT-4) generates a label (llm_label) given an input (prompt). Using this, we prompt the verifier LLM to complete the following sentence. The value output by the verifier LLM is parsed as a float and used as the confidence score. If parsing is unsuccessful, we give the sample a confidence score of 0.0 by default.
For classification-like and QA tasks, this is simply the probability of the first token in the generated label output produced. For NER, probabilities are first generated for all tokens and then the probability of tokens for each entity is averaged to compute confidence scores per entity.
We use the Verifier LLM for estimating token probabilities of a prediction - we first generate the prediction logits for all tokens in the vocabulary, for the length of the output sequence. Then, we compute the softmax over the token probability distribution for each index and use the probabilities corresponding to respective tokens in the prediction as token probabilities.
As explained in the previous section, we evaluated the performance of five different confidence estimation techniques across different NLP tasks and datasets.
Here’s a breakdown of the performance by dataset:
In order to get a more intuitive understanding for which kinds of inputs LLMs are more confident, vs less confident about, here’s what we did:
We illustrate this with the Banking complaints classification dataset below. As you can see, clusters of low confidence inputs correspond quite well to incorrect LLM labels -
Further, let’s examine a few rows with low confidence LLM labels (<= 0.2). Many of these inputs are ambiguous, and it is hard even for a human annotator to tell which of the labels (the one provided as ground truth in the dataset, or the LLM generated one) is “correct”.
Autolabel library relies on token level generation probabilities to estimate LLM label confidence. Generating confidence scores alongside labels is a simple config change - setting the key compute_confidence = True should initiate confidence score computation:
However, very few LLM providers today support extraction of token level generation probabilities alongside the completion. For all other models, Refuel provides access to a hosted Verifier LLM (currently a FLAN T5-XXL model) via an API to estimate logprobs as a post-processing step after the label is generated, regardless of the LLM that was originally used to generate the label.