Since the dawn of computing, human beings have marveled at the idea of machines that are capable of understanding and reasoning about the world. Artificial intelligence has been the holy grail of computing and with the rise of Foundation Models (for e.g. GPT-4, Whisper and Vision Transformers), we are entering a new chapter in this grand project. Their broad capabilities to understand language and vision is a huge unlock for building AI applications across every vertical - from healthcare to education to supply chains.
We are entering the era of AI abundance.
In this new era, every company will be an “AI company”. Foundation models like ChatGPT offer a glimpse of the future, but what is the path to getting there?
With new foundation models available almost every day (MPT-7B, Falcon, Stable LM, Redpajama, Dolly 2.0), which one should I pick? How do I finetune these models to perform well on my domain? How do I build a moat for my product and company if everyone is using the same LLM like GPT-4?
We envision a world where every company can bend foundation models to their will. To serve these teams, every step in the AI development lifecycle will be reimagined to leverage the capabilities of foundation models, starting from the very beginning: the availability of clean, labeled data.
Building any new AI use case requires data collection and annotation - a process that often takes weeks before even training your first model. Every Machine Learning team has felt the pain of being in a perpetual waiting cycle for “enough” labeled data.
In one of my previous roles, we had to train dozens of NLP classifiers, and everyone at the company (from engineers to Senior Directors) spent weeks just labeling data! It was slow and painful, and sadly, the final labels weren’t very high quality either, because it is difficult to share context and write accurate labeling guidelines.
Nihit, my co-founder, led ML teams for content integrity at Meta. Imagine the billions of pieces of content that are uploaded to Meta every day. Labeling even 0.01% of this data requires an army of hundreds of people! But this labeling work is critical to building performant and adaptive AI models for flagging and taking down harmful content (watch Nihit’s talk at the Databricks Data + AI summit).
With the rise of LLMs, the need for labeling has gone up dramatically – whether you look at the large number of contractors hired by OpenAI or the months-long effort from Databricks employees to label data for Dolly 2.0. We need better solutions to make AI work for everyone.
With the immense costs and time constraints involved in data labeling, it’s no wonder that for every AI project that is successful, there are hundreds that don’t even get off the ground. This is the problem that we will tackle first.
We’re excited to announce Refuel, a platform to automate dataset creation and labeling for every business and every use case. By leveraging LLMs, our users are able to create large, clean and diverse datasets in less than an hour instead of weeks.
Our team likes to say: “Clean, labeled datasets at the speed of thought”. You should be able to imagine the dataset you want, describe it in natural language to Refuel, and let us do the rest.
In addition to the platform we’re building, we’re also releasing Autolabel, an open-source library to label your data using LLMs. Our early users and internal benchmarks show a 25-100x speedup for dataset creation and labeling at par or better than human accuracy.
Once clean data is no longer a bottleneck, every company will be able to train their own models and LLMs. For us, this is the first step towards our mission of ushering in the era of AI abundance.
Before Refuel, Nihit and I spent more than a decade building large-scale AI and data systems at Meta, Primer.ai, LinkedIn, Cloudera, and Stanford. Our early engineering team has previously worked at companies like Amazon, Apple, Google DeepMind, Lyft and Uber.
On this journey, we are thrilled to partner with General Catalyst and XYZ Ventures who’re leading our seed round, and to bring on incredible angel investors who’ve been executives and leaders at OpenAI, Slack, Meta, Datadog, Upstart, Red Hat and more.
If you want to learn more about LLM-powered data annotation or want to stay up-to-date on Refuel, join our community or send us a note at firstname.lastname@example.org!