Dec 6, 2023 1:05:08 AM

Beating the mean: beyond chatbots

It can be rewarding and productive to interact with an AI chatbot for certain tasks, like generating text or answering questions. But, depending on the task, it can also be tedious, with lots of back-and-forth and copying-and-pasting. And while chatbots provide a flexible and surprisingly-capable form of AI assistance, they can feel like the consultant who only attends meetings and offers opinions. Sometimes you want more hands-on assistance, or actual help doing a thing.

Chatbots were the quickest and most-flexible way to put AI assistance into the hands of users — but they were only an opening act for the rollout of AI capabilities. ChatGPT’s launch one year ago acted as a call to arms, triggering a year of frenzied experimentation and building across the software industry, within both existing companies and literally thousands of new startups. During that time, we’ve all seen a dizzying array of custom chatbots with expanding capabilities, from Microsoft’s Bing and Copilot offerings to our own pilot AI assistant for survey researchers. Now we’re starting to see the next act come into focus: AI capabilities embedded more deeply into products, helping to perform more complex tasks like building presentation decks or performing qualitative analysis.

As I discussed previously in this series, our own piloting started with a secure enterprise chatbot, through which we gleaned a ton of opportunities for AI-assistance in survey research. With our second AI-embedded pilot, we decided to narrow in on the particular challenge of survey feedback and refinement. For more context, please refer to previous posts in this series.

<- Previous article in series: "Beating the mean: piloting for research"
<-- Start at the beginning: "Beating the mean: introducing the series"
-> Next article in series: "Beating the mean: beyond POCs"

Today, we’re open-sourcing our AI-powered survey instrument evaluation toolkit, and we’re inviting research professionals in monitoring and evaluation (M&E), social sciences, market research, and any other fields that use surveys to help pilot, evaluate, and improve these tools.

Background: the survey instrument review challenge

The quality of a survey instrument can have a profound impact on the quality and usefulness of collected data. Ask the wrong question and the data is worthless (garbage in, garbage out!). Ask the right question in the wrong way — or translate it poorly, if you’re administering the questionnaire in multiple languages — and your results will be biased, imprecise, or both.

Despite its outsized role in response quality, instrument design is rarely the “squeakiest wheel” of a survey project in practice. Once an instrument is considered “draft complete”, it’s often tempting to charge forward to implementation; after all, failings in instrument quality are less obvious than failings in schedule, sample size, or attrition rate. In reality, each of the subsequent steps of a project including survey coding (into a technology platform like Qualtrics), data collection, data analysis, and final recommendations often receives more attention and resources than instrument design.

One instrument quality practice that is quite common is to ask team members or outside experts to review a new instrument and give feedback. Particularly on multidisciplinary and international teams, this can be an opportunity for people to review the instrument from various language, subject-matter, and methodological perspectives and to give a wide range of feedback to help improve the instrument.

manual-instrument-review-dall-e

Unfortunately, the length and complexity of instruments make thorough reviews an aspirational reach. Multi-language massive surveys can seriously challenge even the most gritty reviewers to combat fatigue and muster consistent attention toward each and every question. Each reviewer can also only hope to bring limited expertise and perspective to bear on what might be an instrument that crosses subjects and disciplines.

For our embedded-AI pilot, we focused on building an automated AI instrument evaluation engine that could go through a survey instrument with a multi-faceted fine-toothed comb, without fatigue and limitations in expertise. Our hypothesis is that this tool can change in-depth review at the design stage from an aspirational practice to an easy and scalable task, thereby reducing costly errors downstream.

Subscribe

Our solution: a toolkit for deep instrument review

To test whether this was the right approach, we built a basic system that would take a survey instrument as input and output a series of specific recommendations for improvement, each with a short explanation. The idea was to act a bit like an expert reviewer with infinite attention to detail while making in-line comments and suggestions. For example:

example-evaluation-result

Under the hood, we used OpenAI’s GPT-4 capabilities in two key ways:

Reading and parsing survey instrument files into modules, questions, response options, translations, and so on. Given diversity in both file formats (Word, Excel, PDF, CSV) and formatting (often Byzantine tables), it’s no mean feat to simply read a survey instrument and make sense of how its questions, response options, and translations are organized and related..
Conducting the actual review for potential improvements. Here, we make use of GPT-4’s ability to act as a subject matter expert in a wide range of domains. In this particular pilot, we anchor in its expertise on survey methodology.

We pursued a general approach — read input, parse, evaluate, output recommendations — that might be applied to any number of domains. While our focus is survey instruments, a similar approach could be used and refined to evaluate research design (academia), code (engineering), product specs (product management), or student work (education) . Indeed, any kind of knowledge work could be read, parsed, and reviewed in a similar approach, though the particular mileage may vary. .

The instrument evaluation engine

The core of our pilot toolkit is an instrument evaluation engine that reviews every module, question, and translation to identify:

Cases where phrasing might be adjusted to improve respondent understanding and reduce measurement error (i.e., the kinds of phrasing issues that would be identified through rigorous cognitive interviewing or other forms of validation)
Cases where phrasing might be improved to remove implicit bias or stigmatizing language (inspired by this very helpful post on the subject of using ChatGPT to identify bias)
Cases where translations are inaccurate or phrased such that they might lead to differing response patterns
Cases where a validated instrument might be adapted to better measure an inferred construct of interest

We implemented each of the above as a specific “evaluation lens” within a flexible and extensible framework that allowed us to apply an arbitrary series of evaluation lenses to a survey instrument, one question or module at a time. For example, the translation evaluation lens was tasked with the following system prompt:

You are an AI designed to evaluate questionnaires and other survey instruments used by researchers and M&E professionals. You are an expert in survey methodology with training equivalent to a member of the American Association for Public Opinion Research (AAPOR) with a Ph.D. in survey methodology from University of Michigan’s Institute for Social Research. You consider primarily the content, context, and questions provided to you, and then content and methods from the most widely-cited academic publications and public and nonprofit research organizations.

You always give truthful, factual answers. When asked to give your response in a specific format, you always give your answer in the exact format requested. You never give offensive responses. If you don’t know the answer to a question, you truthfully say you don’t know.

You will be given an excerpt from a questionnaire or survey instrument between |@| and |@| delimiters. The context and location(s) for that excerpt are as followsSurvey context: {survey_context}

Survey locations: {survey_locations}

The excerpt will include the same questions and response options in multiple languages. Assume that this survey will be administered by a trained enumerator who asks each question in a single language appropriate to the respondent and reads each prompt or instruction as indicated in the excerpt. Your job is to review the excerpt for differences in the translations that could lead to differing response patterns from respondents. The goal is for translations to be accurate enough that data collected will be comparable regardless of the language of administration.

Respond in JSON format with all of the following fields:

[ JSON instructions omitted for brevity ]

While we started with the four criteria described above, you could easily imagine creating many more — so we built the toolkit to make it easy to add new ones. Think about these as the specific dimensions of quality in a review rubric you’d like the AI to apply.

In any enterprise application of AI, accuracy and trust are musts. We invested deeply in thoroughly processing an instrument to generate detailed recommendations. We also took the time to include a pre-scripted series of challenges and refinements in each evaluation criterion. For example, in the “validated instrument” criterion we first request a URL to learn more about each recommended instrument. With this initial implementation, we found that GPT-4 would often make up an invalid URL (hallucinate) — so we built in a challenge to ensure that the URL was not only valid, but also the best place to learn more. This under the hood interrogation effectively automated what we saw emerging as a best practice for humans interactively engaging with AI assistants.

Subscribe

Example findings

While we’d designed the engine to be run on early-draft survey instruments, we started by trying it out on published survey instruments from real surveys. For example, following are some example recommendations from our first run of the evaluation engine on a lockdown-era telephone survey used to measure the early impacts of COVID-19 in Mexico City.

Example phrasing and translation issues:

example-phrasing-translation

Catching typos:

example-typos

Proposing validated instruments:

example-validated-instruments

Differing response options:

example-differing-response-options

Potentially-stigmatizing language:

example-stigmatizing-language

Key implementation challenges

In our initial piloting, four key challenges emerged:

Quality: While most recommendations were quite good and even exceeded our initial expectations, many were not. The vast majority of poor recommendations, however, could be explained by parsing failures: evaluation lenses were given garbled survey excerpts (e.g., mismatched translations, questions out of order or grouped into the wrong modules, etc.). Once we adjusted inputs to provide the evaluation engine with accurate excerpts, quality improved dramatically.
Volume: It turned out that our AI evaluation engine had a huge number of recommendations, more than any human might even want to review. For example, in our first review of the lockdown-era telephone survey, the evaluation engine generated a whopping 579 findings. Fortunately, our evaluation lenses included an estimated severity rating with each finding (“a number on a scale from 1 for the least severe issues (minor phrasing issues that are very unlikely to substantively affect response patterns) to 5 for the most severe issues (problems that are very likely to substantively affect response patterns in a way that introduces bias and/or variance)”) — but even so, there were 12 severity-five findings and 112 severity-four findings in a reasonably-high-quality final instrument. And some lower-severity findings were quite useful, so a lot of human time and judgment was needed to review the results.
Speed: Our initial run took three full hours to run through every question, module, and evaluation lens and generate those 579 findings. Fortunately, once we parallelized the process to run lots of individual evaluations at the same time, we were able to bring this down to minutes rather than hours.
Cost: Our initial run also cost $39 in direct OpenAI API costs before lots of potential efficiency and cost improvement efforts. While this is extremely low relative to the cost of having true survey and subject-matter experts review the instrument in such detail, it represents a massive marginal cost relative to what we’re all used to; typically, we expect software to have zero (or near-zero) marginal cost of operation, but here the cost is quite real. Just to test and iterate on the evaluation process, we quickly racked up hundreds of dollars in API costs.

Evaluating the evaluator: a call for further piloting

While our GitHub repository includes a technical roadmap for improvements in quality, volume, speed, and cost, further piloting is needed to more broadly evaluate the overall approach. In particular:

How do AI-generated recommendations compare to those generated by existing human review processes? It would be interesting to compare the quality and comprehensiveness of AI vs. human reviews, for both typical and industry-leading human review processes.
How cost-effective is AI review relative to human review? Even if quality and comprehensiveness compare favorably, there’s the question of how cost-effective AI review is relative to existing processes.

side-by-side-review-dall-e

For comparing recommendations, you could imagine using an evaluation protocol like the following:

Agree on a common format for recommendations. This would include the original question or module reviewed, the recommendation, and a short explanation or justification.
Separately conduct human and AI review processes, with neither informed about or influenced by the other.
Combine and randomize the order of recommendations for review by expert judges. As in a double-blind study, ensure that unique ID numbers allow linking back to the original source of the recommendation — but such that judges can’t tell which recommendations are human- vs. AI-generated.
Ask judges to score each recommendation on the following scale: -1 if bad, 0 if neutral, 1 if good, -2 if very bad (would significantly degrade data quality), 2 for very good (would significantly improve data quality).
Reveal which recommendations are from which method (human vs. AI), sort by the original question or module reviewed, and then ask judges to additionally score each recommendation as follows: 1 if recommendation (substantially) matches a recommendation made via other method, 0 if unique to method.

This would allow you to effectively benchmark AI review against a traditional human review process, which would be an excellent first step toward rigorously evaluating the overall approach. And in fact, we’re currently working with one industry leader in survey research to conduct just this kind of evaluation. We’ll share the results as soon as we have them. (In that case, we’re focusing the review on a small number of new questions being added to an existing instrument, which makes the overall evaluation small and quick to execute.)

We need others to help evaluate this approach, however, across a wide range of settings and human review processes. Please see our GitHub repository for details, and email us at info@higherbar.ai to let us know what you’re trying where, what results you’re seeing, and how we can help. We hope that you’ll give this a shot and share your results with the community.

<- Previous article in series: "Beating the mean: piloting for research"
<-- Start at the beginning: "Beating the mean: introducing the series"
-> Next article in series: "Beating the mean: beyond POCs"

Subscribe