Stanford introduces cost-effective evaluation method for AI language model progress

Stanford introduces cost-effective evaluation method for AI language model progress
Sanmi Koyejo Assistant Professor of Computer Science — Stanford University
0Comments

Assessing the progress of new AI language models has been a challenging and costly endeavor. However, Stanford researchers have introduced a more effective and efficient method for evaluating these models. As new versions of AI language models are released, developers often claim improved performance. Proving these claims typically involves subjecting the models to numerous benchmark questions stored in question banks, with answers reviewed by humans.

Stanmi Koyejo, an assistant professor of computer science at Stanford’s School of Engineering, explained that it is essential to consider the difficulty of questions when evaluating model performance. “Some models may do better or worse just by luck of the draw. We’re trying to anticipate that and adjust for it to make fairer comparisons,” said Koyejo.

Sang Truong, a doctoral candidate at Stanford Artificial Intelligence Lab (SAIL), highlighted the cost implications of this evaluation process: “This evaluation process can often cost as much or more than the training itself.” To address this issue, Koyejo, Truong, and colleagues have applied Item Response Theory from education to AI evaluations. This approach takes question difficulty into account when scoring test-takers.

By analyzing questions and scoring them on difficulty using language models, researchers have managed to reduce costs significantly—by half or even more than 80% in some cases. The system allows comparison between two models’ relative performance based on question difficulty scores.

The team has also developed an automated question generator using AI’s generative powers to replenish question banks efficiently while eliminating “contaminated” questions from databases.

Their approach has been tested across various knowledge domains—including medicine, mathematics, and law—and adapted easily to new models and questions. It demonstrated subtle shifts in GPT 3.5’s safety over time during tests conducted in 2023.

Koyejo emphasized that this new method offers rigorous evaluations within reach for developers while providing users with fairer assessments: “And for everyone else,” he added, “it will mean more rapid progress and greater trust in the quickly evolving tools of artificial intelligence.”

Percy Liang from Stanford co-authored this paper alongside additional authors from UC Berkeley and UIUC; Bo Li (UIUC) collaborated too—both he & Koyejo are affiliated with Virtue AI—and funding came through MacArthur Foundation support alongside contributions by Google Inc., among others involved like Stanford HAI backing up research efforts undertaken here today!

Media contact Jill Wu can be reached via email at jillwu@stanford.edu



Related

Luca Bluett, Player

Santa Clara men’s tennis exits ITA Regionals after quarterfinal finishes

Santa Clara University’s men’s tennis team concluded its participation in the ITA Regional Championships on Sunday, with two players reaching the singles quarterfinals and two doubles teams advancing to the same stage at the Eve Zimmerman Tennis…

Alex Chang, player

Chang and Razeghi reach semifinals at ITA Regionals; Stanford wraps up fall tennis event

Stanford men’s tennis concluded its participation at the ITA Northwest Regional Championships in Stockton, California, with several players advancing deep into the tournament.

Alex Gheorghe, Player

Stanford men’s water polo defeats UC Davis to end homestand

Stanford University’s men’s water polo team secured an 18-12 victory over No. 6 UC Davis, concluding a six-game homestand and improving their season record to 11-4.

Trending

The Weekly Newsletter

Sign-up for the Weekly Newsletter from South SFV Today.