We evaluated the performance of the newly released OpenAI models for the task of labeling text datasets across a range of NLP tasks, using the benchmarking setup and datasets introduced in this report. Key takeaways:
gpt-3.5-turbo-0613: Compared to gpt-3.5-turbo-0301, the new model is faster (~40% lower turn-around time), but its label quality is worse for 6 out of the 8 datasets in this benchmark (between 1% - 6% lower).
gpt-4-0613: Compared to gpt-4 the new model is faster as well (~20% lower turnaround time),and its label quality is on par with the older model.
For running all experiments shared in this report, we used Autolabel, our recently open-sourced library to clean, label and enrich datasets using LLMs.
gpt-4-0613 and gpt-3-turbo-0613: Updated and improved models to replace the current gpt-4 and gpt-3-turbo models respectively
gpt-3.5-turbo-16k: gpt-3.5-turbo version with 16k context length support
API calls to the stable model names (gpt-3.5-turbo, gpt-4, and gpt-4-32k) will automatically be upgraded to the new models on June 27th. Older models will be deprecated on Sept 13th.
This is great news for developers - newer models unlock new functionality like function calling support, and longer context lengths. However, when switching to a new model, a natural question to ask: will the new models work as well as before for my existing use case and data?
This is especially important since we heard from OpenAI users that the newer gpt-3.5-turbo model might not be as performant as the previous version.
What we did
As a followup to our recent evaluation of LLM performance for data labeling, we decided to compare the new OpenAI models against the current version on the same benchmark.
We evaluate the new OpenAI LLM models for data labeling across multiple NLP tasks on the following metrics:
Label quality: Agreement between generated label and the ground truth label
Turnaround time: Time taken (in seconds) per label
Note that the new pricing update affects all model versions (including the older ones), so there is no different in pricing between the current and new model versions.
As a reminder, we included the following datasets and tasks in the benchmark:
The datasets are the same as those in the original report, and complete prompts and configs can be found here.
Label quality measures how well the generated labels (by human or LLM annotator) agree with the ground truth label provided in the dataset.
For datasets other than SQuAD, we measure agreement with an exact match between the generated and ground truth label.
For SQuAD, we measure agreement with the F1 score between generated and ground truth answer, a commonly used metric for question answering.
We also benchmarked the gpt-3.5-turbo-16k model in this experiment and observed its label quality to be the same as gpt-3.5-turbo-0613. Hence, results for gpt-3.5-turbo-16k are excluded from the table above.
Compared to gpt-3.5-turbo-0301 (the current model), gpt-3.5-turbo-0613 (the new model) performed worse in terms of label quality for 6 out of 8 datasets in this benchmark, although the magnitude of decrease is small (~1% to 5% range)
The new gpt-4 model (gpt-4-0613) seems to be on par with the older model (gpt-4-0314) in terms of quality, and still beats the label quality of human annotators.
Across both the models, we see a big improvement in performance on CONLL (a named entity recognition task). We believe this is quite likely due to the dataset being in domain (either part of the pre-training or instruction tuning corpus) for the new models
Turnaround time measures the average time taken per label:
LLMs: Average roundtrip time to generate the label. This includes the time involved in retries etc (in case we hit maximum usage limits).
Human annotators: Total time to complete the labeling task for the test dataset, divided by the dataset size (2000 examples).
Compared to gpt-3.5-turbo-0301 (the current model), gpt-3.5-turbo-0613 (the new model) is significantly faster. On average, we observed a 40% decrease in turnaround time. Paired with the reduction in gpt-3.5-turbo API pricing, this might indicate the new model is smaller and optimized for faster inference.
The new gpt-4 model (gpt-4-0613) is also faster compared to gpt-4-0314. In the benchmarking, we observed a ~20% decrease in turnaround time.
Unsurprisingly, the new models still continue to be 20-50x faster compared to human annotators.
Caveat: it’s possible that part of the speed is due to the new models seeing less traffic.
We evaluated the new OpenAI models on a benchmark for LLM-data labeling capabilities. We found that the new gpt-3.5-turbo model had poorer labeling quality on six out of eight datasets. However, the new model is significantly faster (~40% lower turnaround time). The labeling performance for the gpt-4 model was more or less the same.
While this was an evaluation for data labeling capabilities only, we encourage all developers using these models to evaluate models on their own data and workloads before switching to any new model.
If you have any questions on this report, please join the conversation in our Discord here, and if you’re interested in data labeling using LLMs, check out Autolabel here.