Before the advent of large language models (LLMs), unassuming AI models took on the role of behind-the-scenes orchestrators, actively carving our digital experiences. Whether it was suggesting the next movie on a lazy Friday night or predicting the next word in our text messages, these models quietly influenced our decisions. With the advent of LLMs, these covert architects have stepped into the spotlight, fostering opportunities for direct engagement. Now, as companies begin embedding LLMs into their products, they must grapple with the significant challenge of choosing the optimal LLM to serve as a foundation. Success hinges on their ability to strike a delicate balance between superior performance and robust data privacy, an intricate ordeal underscored by the ever-evolving landscape of these models.
Companies must confront one primary decision: choosing between open-source and closed-source models. Open-source models champion community collaboration and transparency. In stark contrast, closed-source or proprietary models boast reliability through their exclusive nature and tightly controlled environments. Each choice presents its unique set of opportunities and challenges. Thus, understanding the consequential ripples of this pivotal decision becomes a critical step for organizations keen on tapping into the immense potential of LLMs.
At the first glance, a company with an unlimited budget might default to proprietary LLMs. This is not surprising. Proprietary LLMs tend to be superior to open source LLMs when assessed on performance. One clear demonstration of this superiority comes from OpenAI's GPT-4, which outperformed the highest-ranking open-source model to date—LLaMA 2 from Meta—by an impressive ten-point margin1. This superiority was on display when measured against the MMLU benchmark, a standard that evaluates knowledge2 spanning a wide range of topics from STEM to humanities and social sciences.
Additionally, GPT-4 also demonstrated its mettle in avoiding false creations or ‘hallucinations,' with a ten percent improvement over LLaMA during assessments using the TruthfulQA benchmark. This standard measures a model's ability to differentiate factual information from deliberately incorrect statements. Similar patterns of superiority can be gleaned from evaluations related to offsetting toxicity and bias, some of the most salient vulnerabilities for older generations of LLMs.
This evident distinction in performance and reliability of closed source models, like GPT-4, can be traced back to their exclusivity. Developers of such models often choose to release only refined versions to a limited pool of select testers over extended periods, thereby avoiding public exposure of less mature iterations. As of writing this article, one can avail GPT-4's superior capabilities only through a waitlist meticulously managed by OpenAI. Similarly, Anthropic’s proprietary model Claude is exclusively offered only in the US and the UK. There's also Google's PaLM 2 which lacks a direct interface altogether. This measured approach of phased exclusivity allows developers to take user feedback from initial testers and precisely fine-tune the models over an elongated timeline. This iterative feedback loop eventually heightens the model's overall robustness.
However, even a company with an unlimited budget might want to reconsider choosing proprietary models as a foundation to its applications. There are negatives.
First, the lack of transparency in these models presents a potential hazard. The numerous model parameter weights, the equivalent of billions of finely balanced control points that shape the model's performance, are kept under wraps by the developers. You could be using these sophisticated LLMs, crafting perfect prompts to extract precise responses for your product. However, if the model owners decide to adjust those parameters even slightly, your carefully constructed prompts might lose their accuracy or cease to function entirely. Without access to the underlying parameters, you are left in the dark, clueless as to what triggered the changes.
Second, adopting proprietary models could escalate corporate data leakage risks. Using closed-model APIs by default involves sending your input to the model developers. Despite assurances that these inputs are not used for subsequent model development, there's an undeniable risk of unintentional exposure of sensitive information. In response to these concerns, some developers have begun to offer 'enterprise' versions of their APIs. These versions operate within the client’s network, effectively ensuring that sensitive data never leaves the premises. Though a step in the right direction, such offerings remain out of reach for some businesses due to their associated costs, potentially leaving them more exposed to data security risks when using proprietary models.
Considering these constraints, the quest is currently on for more efficient and safe alternatives. Many organizations are cognizant of these potential pitfalls and are actively exploring solution paths that strike a balance between the robust performance of proprietary models and the user-based controls necessitated by practical applications and risk management.
Viewed through a data privacy lens, open-source models are innately secure. This security stems from a company's ability to download model parameter weights and deploy the model on their private network, thus eliminating the need for data to leave the building. Yet, this task is no mean feat, especially given the sheer magnitude of high-performing models. If top-tier proprietary models such as GPT-4—reported to have over a trillion parameters—were made open-source, their configuration and maintenance could prove to be an overwhelming venture. Even LLaMA 2, the largest and most performant open-source model, contains a hefty 70 billion parameters. This begs the question, could we diminish the size of LLMs from trillions to billions, to ease their private deployment without undermining their performance?
The open-source community witnessed a significant turning point with February's release of LLaMA's initial version under a non-commercial license, which was further intensified by the subsequent leaking of its parameter weights. Harnessing “knowledge distillation”—a methodology targeting the transference of knowledge from larger to smaller models through a teacher-student framework—developers within the open-source sphere hurried to unveil imitative models of GPT-4 and Claude, matching the size of LLaMA. After a few weeks, the open-source community presented Alpaca and Vicuna—adding a touch of animal farm charm to the mix.
Download the ebook → AI insights survey: Adopters, skeptics, and why it matters.
The appeal of imitation models is obvious: they are smaller and seem to perform just as good as the larger models. However, a deeper analysis suggests that while such models can mirror the stylistic expressions of proprietary LLMs, they fall short when it comes to acquiring the depth of knowledge, cognitive prowess, and comprehensive understanding that bigger closed-source models offer. One hypothesis is that the data gleaned from the larger proprietary models may have lacked the necessary depth and breadth of information. Alternatively, it's possible that an adequate number of parameters are needed to encapsulate a wide scope of knowledge, making these mimic prototypes too petite to meet the requirement. Given these two suppositions, two conceivable investigation routes emerge for consideration.
First, it is crucial that the data harnessed from proprietary models is multifaceted, able to discern intricate subtleties that lie beneath the rudimentary surface. To achieve this, multiple queries to the proprietary model may be needed per data point. For instance, posing the adage "can you teach an old dog new tricks?" to GPT-4 might yield a solitary but assertive "yes". Although correct, this response doesn't shed light on the idiom's history or its nuanced implications. Therefore, further in-depth, probing questions are vital to extract this profound level of data. Yet, the process of gathering such comprehensive data can be time-consuming and consequently, costly. Moreover, the task is complex due to the sheer breadth of knowledge present in larger models, making it challenging to capture the full expanse of subjects encapsulated in them.
Second, if capturing the world's knowledge in LLMs requires more than an imposing 100 billion parameters, imitation models must in turn elevate their parameter count. Essentially, they need to expand. This growth is not unjustified; a team from Berkeley, in their work The False Promise of Imitating Proprietary LLMs3, noted that the quality of the imitation models markedly amplified with their size, even when fed the same amount of imitation data. Conversely, static models incorporating superficially collected additional data failed to sharpen the quality of these imitation models. Regrettably, this approach dampens the innate appeal of accessible, less resource-demanding open-source models.
Adopting closed-source models would require either size expansion or significant investment in data acquisition, a challenging scenario for decision makers constrained by limited funds and privacy issues. However, they could still consider a compromise. One option is to limit the imitation models' scope by focusing on a particular application. For example, an LLM designed to assist in writing email templates does not necessitate proficiency in drafting mathematical proofs. Limiting the model's range makes data collection for imitation more feasible. An alternate compromise is embracing the concept that model size does have significance. Using a larger open-source model as a base for an imitation model, along with superior imitation data, could potentially yield better outcomes. The recently introduced LLaMA 2 by Meta, boasting 70 billion parameters, is contending closely with most closed-source models—a promising outset for the world of imitation.
I look forward to seeing how this plays out.
Sources:
1. https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
2. https://arxiv.org/pdf/2009.03300.pdf
3. https://arxiv.org/pdf/2305.15717.pdf
The opinions provided are those of the author and not necessarily those of Fidelity Investments or its affiliates. Fidelity does not assume any duty to update any of the information. Fidelity and any other third parties are independent entities and not affiliated. Mentioning them does not suggest a recommendation or endorsement by Fidelity.
The information regarding large language models is for informational purposes only and is not intended to constitute a recommendation, development, security assessment advice of any kind. Consider your own use case carefully and understand the risks before utilizing a generative AI tool.
1100630.1.0