John Taylor, Professor of Economics at Stanford University and developer of the "Taylor Rule" for setting interest rates | Stanford University
John Taylor, Professor of Economics at Stanford University and developer of the "Taylor Rule" for setting interest rates | Stanford University
Academics and researchers often use crowdsourcing platforms like Prolific or Amazon Mechanical Turk to recruit participants for large-scale surveys. These platforms offer monetary compensation or gift cards in exchange for demographic information and opinions. Prolific claims about 200,000 active users who have been vetted to ensure authenticity.
Despite this vetting process, there are indications that some participants may be using AI tools to complete survey questions. Janet Xu, an assistant professor at Stanford Graduate School of Business, noticed that certain responses appeared unusually polished and lacked the typical human snarkiness. This observation led her to investigate further with colleagues Simone Zhang from New York University and AJ Alvero from Cornell University.
Their study revealed that nearly one-third of Prolific users admitted to using large language models (LLMs) like ChatGPT for some survey tasks. The research involved around 800 participants who had previously taken surveys on Prolific. While two-thirds claimed never to have used LLMs for open-ended questions, a quarter acknowledged occasional use of AI assistants, primarily for help in expressing thoughts.
Concerns about authenticity were common among those who refrained from using AI tools. "So many of their answers had this moral inflection where it seems like [using AI] would be doing the research a disservice; it would be cheating," Xu noted.
The study also found demographic patterns in AI usage: newer users or those identifying as male, Black, Republican, or college-educated were more likely to report using AI writing assistance. Xu highlighted these findings as preliminary but significant due to potential biases they could introduce into public opinion data.
To understand differences between human-crafted and AI-generated responses, the authors analyzed data from studies conducted before ChatGPT's release in November 2022. Human responses typically contained more emotionally charged language compared to the neutral tone of LLMs.
Xu emphasized that while AI-generated responses might already exist in published studies, she does not believe they necessitate corrections or retractions yet. Instead, she suggests increased scrutiny on data quality by scholars and editors is warranted.
"We don’t want to make the case that AI usage is unilaterally bad or wrong," Xu said. She distinguished between scenarios where AI aids expression versus generating generic ideas—highlighting concerns over potential homogenization of human responses if overused.
Beyond academia, reliance on AI could skew perceptions in workplace diversity surveys by masking genuine issues with overly positive feedback.
The authors suggest discouraging LLM use through direct requests or technological measures like blocking text copying and pasting. They also advocate designing clearer survey questions as confusion can lead participants toward seeking external help like ChatGPT.
"A lot of the same general principles of good survey design still apply," Xu concluded, emphasizing their heightened importance today.