Last November ChatGPT, a chatbot developed by the research lab OpenAI, reached the million-subscriber mark just a week after its launch. I could not resist the allure and signed up. When I asked the bot to write a blog for a company executive on the future of crypto markets, I was impressed with the results. The response recognized current speculations around the issue, listed arguments for and against optimism in the market, and reminded me of uncertainties around any such predictions. When I tweaked my request to target a 5th grader, the bot sprinkled in some cheeky suggestions: the 5th grader might in the future buy their favorite snacks with cryptocurrencies.
The GPT in ChatGPT stands for Generative Pre-trained Transformer. A transformer is a type of neural architecture for modeling language. These models are called language models, and these language models generate text when presented with a starting point. Over the past two years, language models have grown exponentially in size. State-of-the-art models boast hundreds of billions of parameters. At such measure, the moniker Large Language Models, or LLMs, captures an essential distinction from these models’ predecessors. At the time of writing, OpenAI had not divulged the model that powers ChatGPT. It is likely an LLM; more speculatively a variant of GPT-3, a 175-billion-parameter mammoth.
LLMs are powerful; they can answer almost any random question. The result: tens of text-based applications and browser extensions have been developed within the past two months. How performant are they? That’s an open question, but such development speed is unprecedented. Assuming unlimited access to LLMs, how much of a prerequisite is human-added data to developing enterprise-level solutions?
Does ChatGPT still need a helping hand? We believe yes.
Our experience shows that while LLMs can boost development speed, proactively-added data is still crucial to developing a stable solution, especially for complicated tasks. While the chatbot is making waves as a possible tool for writing school papers, it still needs a lot of help to generate coherent content for enterprise-level applications.
A little background
Research into language models can be traced as far back as the late 1980s. Back then, modeling was based on the frequency of phrases, and logic dictated that the next text generated in a sentence was the one most frequently used. This approach left little room for differentiation and usually generated incoherent text.
With more available computing resources in the early 2000s, neural networks that took a “fixed window of text” approach grew in popularity and replaced phrase-based models. Coherence improved, but the fixed window requirement limited further progress. A network could be trained to target text that would appeal to 5th graders, but the strictly sequential approach left it unable to switch gears to target a more sophisticated audience.
Google introduced Transformer architecture in 2017, which significantly sped up training times for language models. The next year, it released Bidirectional Encoder Representations from Transformers (BERT), a language model with 110 million parameters. In 2019, OpenAI released GPT-2, a 1.5 billion parameter model. These language models were powerful, but the quality of text they generated deteriorated after a few paragraphs.
At 175 billion parameters, GPT-3 certainly qualifies as an LLM. Its size, in addition to the larger dataset on which it was trained, enabled it to generate longer text at high quality. It is now possible for one model to generate text appropriate for both 5th graders and tech CEOs.
What chats can do, and not do, today
Most companies that use text to build applications ask for the classification of text into categories. Generally, classification models are trained on examples of text labeled according to classification rules. This way, the model learns from supervision by a human annotator. The more high-quality labeled data you have, the better the model. Labeled data—nourished and supervised by human interaction—has been the bedrock of most text applications.
LLMs are promising more than text generation, and are working to reframe a classification problem as a text generation problem.
Responses are very sensitive to the wording used in the prompt. Changing the phrasing of the prompt slightly unwittingly changes the classification decision. Given how brittle engineering a prompt can be, how can one be sure that they have found a stable prompt? The way you determine that is to manually review results from different prompts.
Because classification tasks are not iterative and have to yield accurate results on the first pass, it is crucial to test the stability of a prompt on a large, diverse, and labeled dataset. Some of the most promising prompts during spot-checking sessions show disappointing results when tested on a larger labeled dataset.
LLMs are powerful. Their power can be extended beyond text generation to classification tasks for text by finding the right prompt. Because finding the right prompt is usually the challenge, and because classification tasks usually demand high accuracy on the first ask, it is important to test a prospective prompt on a large and diverse labeled dataset, because there are still a multitude of variable responses. LLMs are not going to replace experts and their data, at least not in the next year.
These language models, the building blocks of ChatGPT and its eventual successors, still need human hands to make sure they begin their lives on the right track.