This post continues our “beating the mean” series, where we explore building AI that helps individuals and organizations achieve domain-specific excellence in addition to efficiency.
<-- First article in series: "Beating the mean: introducing the series"
--> Next article in series: "Beating the mean: beyond chatbots"
Back in March of 2023, an early paper on the expected labor market impacts of generative AI predicted that survey research would be one of the occupations with the greatest exposure to AI-driven disruption. At the time, I’d been steeped in the survey research and evaluation fields for over a decade, focused on how technology could help researchers and evaluation professionals more efficiently and accurately collect data to inform high-stakes policy and programs. So I was curious how newer AI systems might affect that work.
For context, survey instruments are the tool with which we answer some of the world’s most complex and important questions. The Federal Reserve uses household surveys to gauge consumer expectations as a primary indicator of economic health and inflation. Management consulting firms use topical surveys to keep their many clients abreast of important trends like AI and ESG. Closer to home for me, international development organizations use surveys to measure the impact of interventions and determine the allocation of billions of dollars in assistance per year.
Within just a few minutes of conversing with GPT-4, I was totally convinced about the paper’s general prediction: AI can clearly be transformative in the survey research field. That’s largely because survey work is inherently multidimensional and multidisciplinary in a way that seems to require a completely unreasonable breadth of expertise. This breadth is frankly out of reach for almost all individuals and even most well-resourced teams: you ideally need to not only have deep subject matter expertise in whatever you’re trying to measure but also deep methodological expertise to measure it well — and, in addition, a deep understanding of the relevant context and population of interest.
To properly measure the effects of a particular social program in rural Ghana, for example, you need to understand the program well enough and empirical methods well enough to design an evaluation that identifies causal effects, you need to understand how to measure health outcomes and financial outcomes and a lot else besides, and you need to be able to do it all within the local language and cultural context. The range of expertise needed to do this work well borders on the impossible, and it turns out that AI is capable of stepping in to seamlessly fill a wide range of potential gaps.
Recognizing both the extraordinary needs and the extraordinary potential, my colleagues at Dobility (makers of SurveyCTO) agreed to back our new venture, and a range of SurveyCTO users doing some of the world’s leading research and evaluation work agreed to collaborate on early pilots of AI assistance. Many more generously agreed to be interviewed, to consult and advise, and to contribute in other ways to our work. (We are incredibly grateful to all of them, and limit our explicit acknowledgements here only out of a default tendency toward privacy.)
We built our first pilot platform atop the world’s most capable AI model, GPT-4, to help us do a few things at once:
One of the earliest decision points in building our initial pilot platform was whether to utilize a chat-based interface. Knowledge work happens in specific applications, for example RStudio or Excel for data analysis. Often, the unit of that work is not chat-conducive natural language, but rather numbers, code, graphical representations, tabular constructions etc.. Nevertheless, we chose a chat interface because we believed that it offered the most flexibility and comprehensiveness towards the goal of understanding user desires of assistance across diverse tasks and domains. As long as users can articulate in words their request, the AI can attempt to help.
Colleagues at Innovations for Poverty Action (IPA), Poverty Action Lab (J-PAL), Laterite, and Population Council, as well as some academic colleagues at Harvard, stepped up to participate in the early piloting, ultimately generating over 4,500 human-AI interactions.
The subjects of discussion are listed below, and ranged widely across conversations. Note that the percentages add up to more than 100% because 68% of conversations included multiple subjects. Some ranged across six or seven!
Folks who don’t work in a research-related field might be surprised that coding came in as the #1 subject for AI assistance by a significant margin. Survey and evaluation researchers utilize lots of technologies across research phases that require non-trivial technical knowledge. In addition, with data driving more and more decisions in modern organizations, data related subjects add up to reflect four out of the top nine subjects. Finally, note that besides survey design, all eight of the other “non-other” subjects are relevant to any research organization or enterprise engaging in research activities.
Overall, the conversations were consistent with our theory that this work requires an unreasonable range of expertise — and with our theory that AI assistance can help to fill important skills and knowledge gaps. Pilot users exhibited both “cyborg” behavior (collaborating closely with the AI on a range of different tasks) and “centaur” behavior (choosing specific tasks to more fully outsource to the AI), and in both cases they seemed to mostly benefit.
In terms of explicit feedback, users only clicked the thumbs-up or thumbs-down buttons to give feedback on 2% of AI responses, but when they did the feedback was more positive (85%) than negative (15%). Generally, people seemed to appreciate the AI assistance, and many users have continued to use the pilot software in the months following their introduction to it.
Coming up next in this series, we will be following the thread of AI experimentations in the survey research field. We’ll be discussing a range of topics including our second pilot that pivots away from chat interface, more sophisticated technical approaches towards accuracy, discussions on safety and more. Stay tuned!
<-- First article in series: "Beating the mean: introducing the series"
--> Next article in series: "Beating the mean: beyond chatbots"