Labeling with Confidence

In this post we examine different techniques for estimating confidence of LLM generated labels, and demonstrate how to leverage these to automatically reject low confidence labels and ensemble LLMs optimally

Is the new gpt-3.5-turbo model worse?

In this report, we compare the latest models from OpenAI against their previous versions on a data labeling benchmark to find that gpt-3.5-turbo is worse for 6/8 datasets, while gpt-4 performance remains the same.