In the dynamic world of business technology, we're witnessing a new era cautiously take shape: the...
Beating the mean: introducing this series
Most likely: baked into the DNA of today’s AI systems is the idea of maximum likelihood. The transformer model, the “T” in the now ubiquitous GPT acronym, works probabilistically at its core. So we get the word most likely to come next in a sequence, the pixel most likely to come next in an image, the most likely answer to a question (see this FT article for a visual in-depth explanation). As a result, we effectively get the mean answer, the average answer, from AI systems built to act like an average person.
But the average person can get confused, be wrong about lots of stuff, even lie. And now the same’s true of our most capable AI systems, which were designed by humans to learn like humans, then trained by humans. So they act like humans. Average humans.
Don’t get me wrong: for a machine to exhibit average human performance — without fatigue and across countless domains — is objectively amazing, even awe-inspiring. It’s enough to potentially transform work, education, maybe society.
After all, the average can be pretty okay for lots of things. 2+2? Wide agreement there. The best route from here to the office? Probably the average is fine. If you’re a below-average writer, a tool that brings your writing up to average is great. And indeed, an emerging body of literature demonstrates how helpful AI assistance can be for raising what would otherwise be below-average performance (e.g., this paper and this paper).
But what if average is not what you’re after? Can AI assistance help us to achieve above-average performance? Even exceptional performance?
And for those individuals and organizations already operating with excellence in a particular domain, will the widespread adoption of AI assistance actually erode instead of elevate their performance? On a mass scale, could it lead to a global convergence in work quality, where the top-performing individuals and organizations are pulled down to the global average while lower performers are pulled up? Certainly early research suggests that AI assistance can actually degrade performance (e.g., this paper).
It was this concern — and this fear of a potentially dystopian future of AI-powered mediocrity — that motivated my colleagues and me to form a new public benefit corporation, Higher Bar AI, to explore how to beat the mean for high-stakes AI applications.
Our foundational hypothesis: generic => curated => enterprise models
Our foundational hypothesis was that “generic AI” — the big foundational models from OpenAI and others — will always be vulnerable to this “convergence to the mean” effect. They’re trained on massive bodies of content, including lots of Reddit, blog posts, and websites of varying quality, accuracy and bias. To give you a sense of what we mean by “massive”, consider the statistic that GPT-3, the 2020 predecessor to the current GPT-4, was already trained on 500 billion words. That is 50,000 times more words than a typical American child encounters by age 10 (source). A lot of noise makes its way into those 500 billion words. Necessarily, the models will then reflect an average quality in any particular domain.
But if somebody were to thoughtfully curate a body of high-quality content, and anchor a model in that content, you could imagine a model that exhibits above-average performance. And if you were to further train that model based on an above-average organization’s own internal knowledge base, their employee onboarding materials, etc., you could imagine bringing an AI model’s performance up to a significantly higher bar.
Of course, quality is often in the eye of the beholder. Organizations frequently differentiate and pride themselves in their particular definition of quality. Curated and enterprise trained AI creates plenty of room for disagreement about both what’s correct or helpful in a local context. In order for organizations to maintain their own perspective and differentiation, we believe that AI assistance will need to reflect those differences.
Thus, publicly-curated content might get you so far, but ultimately you’ll need to allow for a degree of enterprise-specific (and possibly individual-specific) personalization. That’s ultimately what might rescue us from the global-regression-to-the-mean problem.
This blog series
In this blog series, we’ll begin by chronicling our journey over these past six months, sharing what we’ve done and what we’ve learned towards AI systems that can help beat the mean. After we get caught up, we’ll continue the series by sharing our ongoing work with our partners and collaborators.
Our next post dives into our first efforts to bring theory into practice by piloting AI assistance for researchers answering some of the most important and complex questions in the world with surveys. Continue reading the next post in this series.
Final thoughts
This series documents some of our limited lived experiences in implementing LLMs and GenAI in enterprise settings. Our hypothesis is that curation of custom content and additional enterprise specific training will make AIs much more trustworthy and powerful at work. If you have comments on this hypothesis or your own observations about AI in enterprises, please get in touch!
Additional readings & resources
- A couple of great LLM explainers for those who’d like to go deeper into how the technology works are this explainer from Ars Technica (Timothy B. Lee and Sean Trott) and this explainer from the FT (Madhumita Murgia).
- The idea of LLMs being average across tasks and domains, while conceptually compelling, holds a lot more nuance and caveats. This paper (HBS, Wharton, BCG) presents the idea of a “jagged frontier” of LLM capability in knowledge work that describes some underlying nuance.
--> Next article in the series: "Beating the mean: piloting for research"
Comments