We introduce a benchmark for evaluating performance of LLMs for labeling text datasets. Key takeaways and learnings:
State of the art LLMs can label text datasets at the same or better quality compared to skilled human annotators, but ~20x faster and ~7x cheaper.
For achieving the highest quality labels, GPT-4 is the best choice among out of the box LLMs (88.4% agreement with ground truth, compared to 86% for skilled human annotators). For achieving the best tradeoff between label quality and cost, GPT-3.5-turbo, PaLM-2 and open source models like FLAN-T5-XXL are compelling.
Confidence based thresholding can be a very effective way to mitigate impact of hallucinations and ensure high label quality.
This report was made using Autolabel, our recently open-sourced library to clean, label and enrich datasets using LLMs.
With the rise in popularity of Foundation Models, new models and tools are released almost every week and yet, we believe we are only at the beginning of the journey to understand and leverage these models.
While it is well known that LLMs can write poems and solve the BAR and SAT exams, there is increasing evidence suggesting that when leveraged with supervision from domain experts, LLMs can label datasets with comparable quality to skilled human annotators, even for specialized domains such as biomedical literature.
There are various ongoing efforts in the community to understand and compare LLM performance. Most of them are focused on comparing these models across a broad range of reasoning, generation and prediction tasks:
HELM (Holistic Evaluation of Language Models) by Stanford is a research benchmark to evaluate performance across a variety of prediction and generation scenarios.
Chatbot Arena by LMSys is a popular benchmark that employs an Elo-based ranking, aggregated over pairwise battles.
However, there is no widely used, live benchmark for evaluating performance of LLMs as data annotators.
Goal of this effort
With this effort, we have the following goals:
Evaluate the performance of LLMs vs human annotators for labeling text datasets across a range of tasks on three axes: quality, turnaround time and cost.
Understand the label quality and completion rate tradeoff by estimating confidence of LLM outputs. As far as we know, this is one of the first examples of introducing confidence estimation into benchmarking for data annotation.
Propose a live benchmark for text data annotation tasks, to which the community can contribute over time.
In this benchmark, we included the following datasets. Links to all the datasets are available in the References section at the end.
And we considered the following LLMs:
*FLAN T5-XXL is an open-source model originally developed by Google. For this benchmark, we hosted the model on a single A100 GPU (40 GB) on Lambda Labs, with 8-bit quantization and no batching. We used this checkpoint.
Evaluation & Human annotator setup
Our goal is to evaluate LLMs and human annotators on three axes:
Label quality: Agreement between generated label and the ground truth label
Turnaround time: Time taken (in seconds) per label
Cost: Cost (in ¢) per label
For each dataset, we generate the following splits:
Seed set (200 examples) - constructed by sampling at random from the training partition, and used for confidence calibration, and few-shot prompting.
Test set (2000 examples) - constructed by sampling at random from the test partition, and used for running the evaluation and reporting all benchmark results.
We hired annotators via a 3rd party platform that is commonly used for data labeling. Multiple annotators were hired for each dataset separately. The task was performed in stages:
We asked annotators to label the seed dataset (200 examples), providing them with labeling guidelines.
We assessed performance on the seed set, provided them with the ground truth labels for this dataset as a reference and asked them to review their mistakes.
We clarified any questions they had on the labeling guidelines after this review
Finally, we then asked them to label the test dataset set (2000 examples).
All metrics reported here for human annotators are from this final step.
Results and learnings
Label quality measures how well the generated labels (by human or LLM annotator) agree with the ground truth label provided in the dataset.
For datasets other than SQuAD, we measure agreement with an exact match between the generated and ground truth label.
For the SQuAD dataset, we measure agreement with the F1 score between generated and ground truth answer, a commonly used metric for question answering.
The table below summarizes label quality results across datasets. More details about the labeling configs used for this benchmarking are provided in the Artifacts section.
State of the art LLMs can label text datasets at the same or better quality compared to skilled human annotators.Out of the box,GPT-4’s label quality across a range of datasets (88.4% agreement with ground truth labels) is better than that of human annotators (86.2%).
Multiple other models also achieve strong performance (80+%) while priced at <1/10th the GPT-4 API cost.
With any effort to evaluate LLMs, there is always the possibility of data leakage since proprietary LLMs are trained on an incredibly large number of datasets. We recognize and acknowledge this issue, and share an additional data point from LLM labeling on a proprietary text dataset: the best performing LLMs (GPT-4, PaLM-2) achieved an 89% agreement with ground truth labels. By applying a few additional improvement techniques such as ensembling, we were able to improve this to 95+%.
One of the biggest criticisms of LLMs is hallucinations - large language models can seem very confident in their language even when they are completely incorrect. It is imperative to estimate the quality of labels in a way that is correlated with the likelihood of the label being correct.
In order to estimate label confidence, we average the token-level log probabilities output by the LLM. This approach to self-evaluation has been shown to work well on a diverse range of prediction tasks.
For LLMs (text-davinci-003) that provide log probabilities, we use these to estimate confidence.
For other LLMs, we use a FLAN T5 XXL model for confidence estimation. After the label is generated, we query the FLAN T5 XXL model for a probability distribution over the generated output tokens, given the same prompt input that was used for labeling.
We use estimated confidence to understand the tradeoff between label quality and completion rate during a calibration step. We decide an operating point for the LLM and reject all labels below this operating point threshold. For e.g. the above chart indicates that at a 95% quality threshold, we can label ~77% of the dataset using GPT-4.
The reason we need this step at all is because token-level log probabilities might not always be well calibrated, as highlighted in the GPT-4 Technical report:
Using the above confidence estimation approach, and selecting a confidence threshold at 95% label quality (by comparison, human annotator label quality is 86%), we get the following completion rates across datasets and LLMs:
GPT-4 achieves the best average completion rate across datasets, and manages to label 100% of the data above this quality threshold for 3 of the 8 datasets.
Multiple other models (text-bison@001, gpt-3.5-turbo, claude-v1 and flan-t5-xxl) achieve strong performance, on average successfully auto-labeling at least 50% of the data while priced at <1/10th the GPT-4 API cost.
The above observations on label quality and confidence estimation points us in the direction of ensembling LLMs: use faster/cheaper models to label the easier examples, and the most powerful LLMs for labeling harder examples. We have observed very good results from this in practice, and hope to share our learnings in a future post.
Time and cost
We measure the average time taken per label as follows:
LLMs: Average roundtrip time to generate the label. Slack time from retrying the API (in case we hit maximum usage limits) is included in this calculation.
Human annotators: Total time to complete the labeling task for the test dataset, divided by the dataset size (2000 examples).
We measure the average cost per label as follows:
3rd-party hosted LLMs: We use the providers’ API pricing (based typically on length of prompt and the length of generated output) to determine cost.
Self-hosted LLM (Flan-T5 XXL): As mentioned above, for this benchmarking, we hosted the model on a 1xA100 GPU (40 GB) Lambda Labs instance. Average cost per label was estimated as: (instance price per hour / 3600) * time per label * 1.5. The 1.5 factor is to account for less than 100% GPU utilization
LLMs are much faster and cheaper than human annotators. GPT-4 is ~20x faster, and ~7x cheaper, and can label datasets at the same quality compared to human annotators.
In order to get the best tradeoff between label quality and cost/time, there are a few LLM choices: text-bison@001(Google) and gpt-3.5-turbo (OpenAI) are ~1/10th the cost of GPT-4, but they do as well as GPT-4 on many tasks (e.g. entity matching and question answering). With the added flexibility of fine-tuning and self-hosting, Flan-T5 XXL is a powerful open source alternative for labeling NLP datasets at <1% of human labeling cost.
In our experiments, we observed it boosts label quality over zero-shot on average by +7 percentage points (0.67 → 0.744) over zero-shot prompting.
Amongst the few-shot prompting strategies available, selecting in-context examples to be semantically similar (k-nearest neighbors in an embedding space) to the test example is especially helpful for helping the LLM pick the correct answer among a few close possible answers. For example, here’s a customer query in the Banking77 dataset: “My card hasn't arrived yet.”. A few answers could possibly be correct here: [card_acceptance, card_arrival, card_delivery_estimate]. Providing similar examples from the seed set with their labels helps the LLM, as seen in the prompt here.
Chain of Thought
Chain-of-thought prompting has been shown to improve LLM’s reasoning abilities. In our experiments, we observed it helps improve label quality for reasoning tasks (+5 percentage point improvement in label quality across three question answering datasets). We did not see these gains from chain of thought prompting on other task types.
We used our recently open-sourced Autolabel library to run all the experiments shared in this report.
All the datasets and labeling configs used for these experiments are available in our Github repo for further analysis, under the examples folder. We invite you to explore them!
If you discover any issues or potential improvements to the shared results, we would love your contribution! You can open an issue on Github or join our Discord.
The results shared in this report are a snapshot of the ability of LLMs today to label text datasets. Advances in LLM research and productionization are occurring swiftly, and we’re learning new things almost every day!
We expect this to be a live benchmark updated periodically, and expanding to cover more tasks and problem domains. Here’s a quick overview of additions in the works:
Datasets & Tasks: Support for more datasets across problem domains and tasks such as summarization, attribute enrichment and text generation/writing tasks.
LLMs: Over the last few months, we’ve seen multiple strong open-source LLMs released such as Falcon by TII, MPT by Mosaic and Dolly by Databricks. We will be adding results from these models to this benchmark.
Run it on your data: Ability to run the benchmark on your own dataset/labeling tasks in your own environment. This is already on the roadmap for Autolabel!
The Refuel team would like to thank Arvind Neelakantan, Jayesh Gupta, Neil Dhruva, Shubham Gupta and Viswajith Venugopal for their feedback on the Autolabel library and this report.