Stanford introduces cost-effective evaluation method for AI language model progress

Stanford introduces cost-effective evaluation method for AI language model progress
Sanmi Koyejo Assistant Professor of Computer Science — Stanford University
0Comments

Assessing the progress of new AI language models has been a challenging and costly endeavor. However, Stanford researchers have introduced a more effective and efficient method for evaluating these models. As new versions of AI language models are released, developers often claim improved performance. Proving these claims typically involves subjecting the models to numerous benchmark questions stored in question banks, with answers reviewed by humans.

Stanmi Koyejo, an assistant professor of computer science at Stanford’s School of Engineering, explained that it is essential to consider the difficulty of questions when evaluating model performance. “Some models may do better or worse just by luck of the draw. We’re trying to anticipate that and adjust for it to make fairer comparisons,” said Koyejo.

Sang Truong, a doctoral candidate at Stanford Artificial Intelligence Lab (SAIL), highlighted the cost implications of this evaluation process: “This evaluation process can often cost as much or more than the training itself.” To address this issue, Koyejo, Truong, and colleagues have applied Item Response Theory from education to AI evaluations. This approach takes question difficulty into account when scoring test-takers.

By analyzing questions and scoring them on difficulty using language models, researchers have managed to reduce costs significantly—by half or even more than 80% in some cases. The system allows comparison between two models’ relative performance based on question difficulty scores.

The team has also developed an automated question generator using AI’s generative powers to replenish question banks efficiently while eliminating “contaminated” questions from databases.

Their approach has been tested across various knowledge domains—including medicine, mathematics, and law—and adapted easily to new models and questions. It demonstrated subtle shifts in GPT 3.5’s safety over time during tests conducted in 2023.

Koyejo emphasized that this new method offers rigorous evaluations within reach for developers while providing users with fairer assessments: “And for everyone else,” he added, “it will mean more rapid progress and greater trust in the quickly evolving tools of artificial intelligence.”

Percy Liang from Stanford co-authored this paper alongside additional authors from UC Berkeley and UIUC; Bo Li (UIUC) collaborated too—both he & Koyejo are affiliated with Virtue AI—and funding came through MacArthur Foundation support alongside contributions by Google Inc., among others involved like Stanford HAI backing up research efforts undertaken here today!

Media contact Jill Wu can be reached via email at jillwu@stanford.edu



Related

Jennifer King, PhD, Privacy and Data Policy Fellow, Stanford Institute for Human-Centered Artificial Intelligence

The congressional hearing addressed AI chatbot safety concerns

Congressman Brett Guthrie and Congressman John Joyce held a hearing to examine the safety of AI chatbots.

Ro Khanna U.S. House of Representatives from California's 17th district

Ro Khanna calls attention to SNAP funding and healthcare coverage risks

Representative Ro Khanna raised alarms about upcoming disruptions to SNAP benefits and potential losses in health insurance coverage if ACA subsidies expire through posts on October 30 and October 31, 2025.

Luca Bluett, Player

Santa Clara men’s tennis exits ITA Regionals after quarterfinal finishes

Santa Clara University’s men’s tennis team concluded its participation in the ITA Regional Championships on Sunday, with two players reaching the singles quarterfinals and two doubles teams advancing to the same stage at the Eve Zimmerman Tennis…

Trending

The Weekly Newsletter

Sign-up for the Weekly Newsletter from South SFV Today.