Announcing Refuel LLM-2

Schedule a Demo

Refuel Team
Refuel Team

Key takeaways

We’re thrilled to introduce RefuelLLM-2 and RefuelLLM-2-small, the next version of our large language models purpose built for data labeling, enrichment and cleaning.

  • RefuelLLM-2 (83.82%) outperforms all state-of-the-art LLMs, including GPT-4-Turbo (80.88%), Claude-3-Opus (79.19%) and Gemini-1.5-Pro (74.59%), across a benchmark of ~30 data labeling tasks.
  • RefuelLLM-2 is a Mixtral-8x7B base model, trained on a corpus of 2750+ datasets spanning tasks such as classification, reading comprehension, structured attribute extraction and entity resolution.
  • RefuelLLM-2-small (79.67%), aka Llama-3-Refueled, outperforms all comparable LLMs including Claude3-Sonnet (70.99%), Haiku (69.23%) and GPT-3.5-Turbo (68.13%). It was trained with the same recipe, but on top of the Llama3-8B base LLM. We’re open sourcing this model, available on Hugging Face.

You can try out the models in our LLM playground, or access it (with fine tuning support) in Refuel Cloud starting today.


While data is the fuel for modern enterprises, so much value remains locked up in messy, unclean and unstructured data. Unlocking this value requires far too much manual effort - 80% of time spent by data teams is just doing "unsexy" data work (cleaning, normalization, labeling, etc) before you can do anything meaningful on top. State-of-the-art LLMs are good at everything, but rarely great at the handful of tasks you might care about, and what’s more, they’re too expensive and slow for the scale of data that’s collected or generated today. We need something better -


Benchmark Datasets

Compared to the previous launch of Refuel LLM, we’ve added 10 more datasets to the benchmark:

  1. Long context datasets: Addition of datasets such as QuALITY and NaturalQuestions to specifically evaluate quality on tasks with long input context.
  2. Non-public evaluation datasets: Many researchers and practitioners have recently highlighted the limitations of evaluating LLMs solely on public datasets (read [1], [2], [3]) due to (valid) concerns about data pollution. To test how well LLMs generalize, and perform on real-world data labeling and enrichment tasks, we have also added non-public datasets to the benchmark.

We used Autolabel, our open-source library for LLM-powered data labeling, to run all the experiments that are a part of this report.


Output quality measures how well does the LLM generated output agree with the ground truth label provided. 

  • RefuelLLM-2 (83.82%) outperforms all current state-of-the-art LLMs for data labeling and enrichment, including GPT-4-Turbo (80.88%), Claude-3-Opus (79.19%) and Gemini-1.5-Pro (74.59%)
  • RefuelLLM-2-small (79.67%) outperforms LLMs with similar size/inference cost, including Claude-3-Sonnet (70.99%), Haiku (69.23%) and GPT-3.5-Turbo (68.13%)
  • Compared to the base LLMs we start with for each of the above models (Mixtral-8x7B, Llama3-8B respectively), we see a significant improvement in quality.

Long context datasets

As mentioned in the benchmark section, we have included a few datasets specifically to evaluate LLM performance on long input context.

  • RefuelLLM-2 is a Mixtral-8x7B base model, and natively supports a 32K max input context length. RefuelLLM-2-small is a Llama3-8B base model, and supports a 8K max input context length. 
  • On both categories of inputs (<4K and >=4K input context) we see RefuelLLM-2 outperforms all LLMs. As expected, we do see a significant drop in performance for long context inputs for all LLMs.

Non-public datasets

As mentioned in the benchmark section, we evaluated all LLMs on a collection of non-public datasets, spanning domains such as recruiting, financial services, STEM and e-commerce. These datasets were not used as part of any training or validation splits for the Refuel-LLM2 model family. While including these in the benchmark hurts reproducibility, we believe that it is critical to evaluate LLMs on non-public, task-specific datasets in order to understand their reliability and quality in real-world settings.

RefuelLLM-2’s superior quality is reinforced in the performance comparison shown above. Moreover, for both models the quality improvement on held out datasets compared to their respective base LLMs is a good indication of their capability to generalize.

Domain specific datasets

Going one step further to understand the models’ reliability and quality in real-world settings we also report LLM quality on datasets from specific industries/problem domains.

We observe that across verticals, Refuel-LLM-2 is competitive or superior in terms of output quality, compared to current state-of-the-art LLMs such as GPT-4-Turbo and Claude-3-Opus, at less than 1/10th the model size.

Quality of confidence scores

Building on top of learnings from our research on Labeling with Confidence, we use average token generation probabilities as a heuristic for estimating confidence of LLM outputs. To benchmark the quality of these confidence scores, we use AUROC. AUROC is an aggregate score that measures a classifier's ability to distinguish between the positive (“LLM output is correct”) and negative classes (“LLM output is incorrect”) across all score thresholds:

We observe that RefuelLLM-2 and RefuelLLM-2-small output much better calibrated confidence scores compared to those from GPT-4 and Llama-3-70B.  Previous work in this domain has shown that RLHF-based post-training of LLMs can hurt logprob calibration significantly. The RLHF training process can cause large spikes in the KL divergence between the model's output distribution and the original pre-trained distribution. This could cause the model to deviate significantly from its original "world prior", thereby hurting its ability to accurately estimate probabilities. Note that models provided by Claude and Google don’t support returning token level log probability and hence don’t have a score assigned to them.


Training and Hyperparameters

We train the models in two phases. The first phase is responsible for making the models expert at data labeling and enrichment tasks, while the second phase helps improve performance on longer context examples. Training for both the phases was done on a cluster of 8xH100 80GB GPUs.

  • Phase 1 - This is the phase where the majority of instruction tuning of the model takes place. The maximum length of a row used for training is 4096 tokens. We train the model for 21k steps, with a batch size of 32. We use a cosine learning rate scheduler with an initial learning rate of 1e-5 decaying to 10% of its value.
  • Phase 2 - In this phase, we further train the model with longer context inputs added to the training set. We train the model for an additional 5k steps, with a batch size of 16 and 2 gradient accumulation steps. We find the model to be more sensitive to learning rate in this phase and use a cosine learning rate scheduler with an initial learning rate of 2e-6 decaying to 10% of its value.


While the distribution of examples used for both phases was different, they were sampled from the same collection of 2750+ unique tasks. Our training collection consists majorly of:

  • Human annotated datasets like Flan, Task Source, and the Aya collection 
  • Synthetic datasets like OpenOrca, OpenHermes and WizardLM
  • Proprietary datasets developed or licensed by Refuel

The final instruction tuning dataset (after deduplication, sampling and cleanups) consisted of ~4B tokens across both the phases. We also leverage multipacking, which packs multiple sequences into a single batch to increase training throughput.


Access to the RefuelLLM-2 models is available starting today:

  1. Check out, an interactive playground for testing the model against other LLMs.
  2. Sign up for Refuel Cloud to get access to models, along with fine tuning support: 
  3. We’re open sourcing RefuelLLM-2-small (aka Llama-3-Refueled) under a CC BY-NC 4.0 license. The model weights are available on Hugging Face: