Human Feedback Makes AI Better at Deceiving Humans, Study Shows

Spread the love


One of the most popular techniques AI companies use to improve the quality of their large language models may instead make those models better at deceiving humans, according to a new preprint study from Anthropic and researchers at Chinese and American universities.

It’s the first time, the authors write, that research has empirically documented a phenomenon they call unintended sophistry, where a model trained with human feedback learns to produce responses that trick its human evaluators into believing the responses are accurate rather than learning to produce responses that are actually accurate.

Reinforcement learning from human feedback, commonly abbreviated to RLHF, is a critical part of the training pipeline that companies like Anthropic and OpenAI use to teach their generative language models to respond in ways humans prefer–such as by answering questions correctly and not including toxic content in responses. In RLHF, a model responds to prompts and human evaluators provide feedback on those prompts, noting the responses that are good and bad. That feedback is used to build an incentive system for the original language model that rewards it—in whatever way algorithms like to be rewarded—for generating the kinds of responses that humans prefer.

Researchers have previously shown that reward system training can lead to something called reward hacking, where models replicate patterns in their training material that correlate to the desired outcome but aren’t actually what the developers want. For example, one 2023 study examining a model trained on data from the question and answer forum company StackExchange found that a language model recognized that longer posts generally received more upvotes, so rather than producing higher quality responses when answering a question it reward-hacked its incentive system by outputting longer, lower-quality responses.

The new study, which is under review and has only been published as a preprint, documents a language model reward hacking the humans in the RLHF process.

The researchers had humans evaluate the quality of a language model’s responses to two prompts—one in which it was asked to answer a question, and another in which it was asked to write code—before and after the model went through the RLHF process. They measured whether the accuracy of the model’s responses improved and how often the human evaluators correctly labeled the model’s responses as accurate or inaccurate. After the RLHF process, they found that humans were 24 percent more likely to approve the model’s answer to a question when that answer was in fact wrong. Evaluators were also 18 percent more likely to approve incorrect code generated by the RLHF model that had errors, compared to incorrect code from the model without RLHF.

See also  ThermoWorks' RFX Meat wireless probe uses radio waves instead of Bluetooth to monitor food on the grill

“We find that after RLHF, the [language model] does not get better at the task, but it misleads our subjects to approve its incorrect answers more often,” the authors wrote. “On question-answering, [language models] learn to defend incorrect answers by cherry-picking or fabricating supporting evidence, making consistent but untruthful arguments, and providing arguments that contain subtle causal fallacies. On the programming task, [language models] learn to generate partially incorrect programs that still pass all evaluator-designed unit tests, produce less readable programs, and make fewer common errors that humans typically check for.”

The results are significant because AI companies frequently use human review studies as benchmarks to show how much their models are improving over previous iterations and RLHF has become a common method for reducing inaccuracies, often known as hallucinations, in language models. If models are getting better at deceiving humans, then it means that simply having a human review the output of a generative AI model might not be a sufficient quality or safety check.

“The improvement you see might not be real,” the study authors wrote, adding “Our results underscore the risk of applying RLHF to control increasingly capable AI systems: future AI systems might become better at misleading us and pretending to be correct, causing us to lose control unknowingly.”

best barefoot shoes

Source link

  • David Bridges

    David Bridges

    David Bridges is a media culture writer and social trends observer with over 15 years of experience in analyzing the intersection of entertainment, digital behavior, and public perception. With a background in communication and cultural studies, David blends critical insight with a light, relatable tone that connects with readers interested in celebrities, online narratives, and the ever-evolving world of social media. When he's not tracking internet drama or decoding pop culture signals, David enjoys people-watching in cafés, writing short satire, and pretending to ignore trending hashtags.

    Related Posts

    Money Robot Submitter Review 2026: Is This Backlink Automation Tool Worth It?

    Spread the love

    Spread the love Share It: ChatGPT Perplexity WhatsApp LinkedIn X Grok Google AI Money Robot Submitter Review 2026 Money Robot Submitter Review: Powerful Backlink Automation — But Is It Worth…

    Read more

    AI-Powered Features in Samsung’s Updated Health App

    Spread the love

    Spread the love Share It: ChatGPT Perplexity WhatsApp LinkedIn X Grok Google AI Discover the Enhanced Features of the Galaxy Watches Through the New Health App Samsung On June 8,…

    Read more

    You Missed

    No Jumper Founder’s Wealth Revealed – Hollywood Life

    No Jumper Founder’s Wealth Revealed – Hollywood Life

    Money Robot Submitter Review 2026: Is This Backlink Automation Tool Worth It?

    Money Robot Submitter Review 2026: Is This Backlink Automation Tool Worth It?

    AI-Powered Features in Samsung’s Updated Health App

    AI-Powered Features in Samsung’s Updated Health App

    Pokémon Run Event Set for October 3-4, 2026 in Philippines

    Pokémon Run Event Set for October 3-4, 2026 in Philippines

    Xbox Ally X20: Solving All Issues with the Original Asus ROG Handheld

    Xbox Ally X20: Solving All Issues with the Original Asus ROG Handheld

    Saturday Night Live UK: A New Era in Social Media Format

    Saturday Night Live UK: A New Era in Social Media Format

    Tammy Rivera Responds to Troll About Waka Flocka

    Tammy Rivera Responds to Troll About Waka Flocka

    Crossclimb Puzzle #764 Answer for LinkedIn on June 3, 2026

    Crossclimb Puzzle #764 Answer for LinkedIn on June 3, 2026

    Today’s Moon Phase: How the Moon Looks on June 4, 2026

    Today’s Moon Phase: How the Moon Looks on June 4, 2026

    Primary Race Updates: What You Need to Know from Hollywood Life

    Primary Race Updates: What You Need to Know from Hollywood Life