Beating the mean: piloting for research

Written by Chris Robert | Nov 21, 2023 10:10:40 PM

This post continues our “beating the mean” series, where we explore building AI that helps individuals and organizations achieve domain-specific excellence in addition to efficiency.

<-- First article in series: "Beating the mean: introducing the series"
--> Next article in series: "Beating the mean: beyond chatbots"

Back in March of 2023, an early paper on the expected labor market impacts of generative AI predicted that survey research would be one of the occupations with the greatest exposure to AI-driven disruption. At the time, I’d been steeped in the survey research and evaluation fields for over a decade, focused on how technology could help researchers and evaluation professionals more efficiently and accurately collect data to inform high-stakes policy and programs. So I was curious how newer AI systems might affect that work.

For context, survey instruments are the tool with which we answer some of the world’s most complex and important questions. The Federal Reserve uses household surveys to gauge consumer expectations as a primary indicator of economic health and inflation. Management consulting firms use topical surveys to keep their many clients abreast of important trends like AI and ESG. Closer to home for me, international development organizations use surveys to measure the impact of interventions and determine the allocation of billions of dollars in assistance per year.

Within just a few minutes of conversing with GPT-4, I was totally convinced about the paper’s general prediction: AI can clearly be transformative in the survey research field. That’s largely because survey work is inherently multidimensional and multidisciplinary in a way that seems to require a completely unreasonable breadth of expertise. This breadth is frankly out of reach for almost all individuals and even most well-resourced teams: you ideally need to not only have deep subject matter expertise in whatever you’re trying to measure but also deep methodological expertise to measure it well — and, in addition, a deep understanding of the relevant context and population of interest.

To properly measure the effects of a particular social program in rural Ghana, for example, you need to understand the program well enough and empirical methods well enough to design an evaluation that identifies causal effects, you need to understand how to measure health outcomes and financial outcomes and a lot else besides, and you need to be able to do it all within the local language and cultural context. The range of expertise needed to do this work well borders on the impossible, and it turns out that AI is capable of stepping in to seamlessly fill a wide range of potential gaps.

Recognizing both the extraordinary needs and the extraordinary potential, my colleagues at Dobility (makers of SurveyCTO) agreed to back our new venture, and a range of SurveyCTO users doing some of the world’s leading research and evaluation work agreed to collaborate on early pilots of AI assistance. Many more generously agreed to be interviewed, to consult and advise, and to contribute in other ways to our work. (We are incredibly grateful to all of them, and limit our explicit acknowledgements here only out of a default tendency toward privacy.)

Pilot 1: Proprietary and flexible AI assistance

We built our first pilot platform atop the world’s most capable AI model, GPT-4, to help us do a few things at once:

Learn about the kinds of AI assistance that a segment of high-skilled knowledge workers (survey researchers and evaluation professionals) seek out at the point of use
Learn where GPT-4’s capabilities are inherently better vs. worse (and how to tell the difference!)
Learn about the levers, like prompt engineering, retrieval augmentation, and fine-tuned models that we can use to improve accuracy, trust, and quality
And do all of that while protecting proprietary data security, respecting user privacy, and providing real value to pilot users

One of the earliest decision points in building our initial pilot platform was whether to utilize a chat-based interface. Knowledge work happens in specific applications, for example RStudio or Excel for data analysis. Often, the unit of that work is not chat-conducive natural language, but rather numbers, code, graphical representations, tabular constructions etc.. Nevertheless, we chose a chat interface because we believed that it offered the most flexibility and comprehensiveness towards the goal of understanding user desires of assistance across diverse tasks and domains. As long as users can articulate in words their request, the AI can attempt to help.

Colleagues at Innovations for Poverty Action (IPA), Poverty Action Lab (J-PAL), Laterite, and Population Council, as well as some academic colleagues at Harvard, stepped up to participate in the early piloting, ultimately generating over 4,500 human-AI interactions.

Initial analysis & insights

The subjects of discussion are listed below, and ranged widely across conversations. Note that the percentages add up to more than 100% because 68% of conversations included multiple subjects. Some ranged across six or seven!

Coding: 45.36% of conversations (writing code and scripts for surveys, data collection, and analysis using various platforms and languages like Excel, Python, R, Stata, ODK, SurveyCTO, etc.)
Survey design: 31.22% of conversations (designing, testing, and refining survey instruments, questions, sampling methods, and logic flows; improving question wording, identifying biases, maximizing respondent understanding)
Research design: 21.67% of conversations (structuring and planning research methodology, including formulating hypotheses, establishing procedures, determining samples etc.)
Data collection: 14.80% of conversations (gathering data through surveys, interviews, focus groups, experiments, observations or other field methods)
Data analysis: 14.54% of conversations (descriptive statistics, regression analysis, statistical testing, predictive modeling, time series analysis, data visualization, and other analytical techniques applied to data)
Data engineering: 13.99% of conversations (reshaping, cleaning, merging, appending, randomizing, subsetting, aggregating, and general wrangling of data)
Data management: 7.81% of conversations (storing, organizing, documenting, sharing, securing and governing data to maintain integrity, accessibility, reliability and transparency)
Sampling: 2.69% of conversations (methods and considerations for drawing a sample from a population for surveys and research)
Attrition: 2.42% of conversations (understanding, measuring, and minimizing loss of survey respondents or research subjects over time)
Other subjects: 48.86% of conversations (other subjects of discussion)

Folks who don’t work in a research-related field might be surprised that coding came in as the #1 subject for AI assistance by a significant margin. Survey and evaluation researchers utilize lots of technologies across research phases that require non-trivial technical knowledge. In addition, with data driving more and more decisions in modern organizations, data related subjects add up to reflect four out of the top nine subjects. Finally, note that besides survey design, all eight of the other “non-other” subjects are relevant to any research organization or enterprise engaging in research activities.

Overall, the conversations were consistent with our theory that this work requires an unreasonable range of expertise — and with our theory that AI assistance can help to fill important skills and knowledge gaps. Pilot users exhibited both “cyborg” behavior (collaborating closely with the AI on a range of different tasks) and “centaur” behavior (choosing specific tasks to more fully outsource to the AI), and in both cases they seemed to mostly benefit.

In terms of explicit feedback, users only clicked the thumbs-up or thumbs-down buttons to give feedback on 2% of AI responses, but when they did the feedback was more positive (85%) than negative (15%). Generally, people seemed to appreciate the AI assistance, and many users have continued to use the pilot software in the months following their introduction to it.

Coming up next in this series, we will be following the thread of AI experimentations in the survey research field. We’ll be discussing a range of topics including our second pilot that pivots away from chat interface, more sophisticated technical approaches towards accuracy, discussions on safety and more. Stay tuned!

Additional readings & resources

Ethan Mollick’s piece on “centaur” vs. “ cyborg” AI is a detailed read on promising but uneven productivity effects across the full array of tasks within a complex domain when implementing off-the-shelf Chat-GPT. In his words, “On some tasks AI is immensely powerful, and on others it fails completely or subtly. And, unless you use AI a lot, you won’t know which is which.”. This suggests massive opportunity for purpose specific enterprise AI that can smooth this curve and build trust.
Mollick’s paper also suggests a skills-leveling effect whereby consultants with worse performance pre-study received more benefit from AI during the study. Outside of the management consulting context, this AI-driven leveling effect has also been described in multiple other business domains including call centers (Brynjolfsson, Li, Raymond paper) and coding (Peng et. al paper)

<-- First article in series: "Beating the mean: introducing the series"
--> Next article in series: "Beating the mean: beyond chatbots"

View full post