Human Feedback Makes AI Better at Deceiving Humans, Study Shows

Spread the love

Share It:

ChatGPT Perplexity WhatsApp LinkedIn X Grok Google AI

One of the most popular techniques AI companies use to improve the quality of their large language models may instead make those models better at deceiving humans, according to a new preprint study from Anthropic and researchers at Chinese and American universities.

It’s the first time, the authors write, that research has empirically documented a phenomenon they call unintended sophistry, where a model trained with human feedback learns to produce responses that trick its human evaluators into believing the responses are accurate rather than learning to produce responses that are actually accurate.

Reinforcement learning from human feedback, commonly abbreviated to RLHF, is a critical part of the training pipeline that companies like Anthropic and OpenAI use to teach their generative language models to respond in ways humans prefer–such as by answering questions correctly and not including toxic content in responses. In RLHF, a model responds to prompts and human evaluators provide feedback on those prompts, noting the responses that are good and bad. That feedback is used to build an incentive system for the original language model that rewards it—in whatever way algorithms like to be rewarded—for generating the kinds of responses that humans prefer.

Researchers have previously shown that reward system training can lead to something called reward hacking, where models replicate patterns in their training material that correlate to the desired outcome but aren’t actually what the developers want. For example, one 2023 study examining a model trained on data from the question and answer forum company StackExchange found that a language model recognized that longer posts generally received more upvotes, so rather than producing higher quality responses when answering a question it reward-hacked its incentive system by outputting longer, lower-quality responses.

The new study, which is under review and has only been published as a preprint, documents a language model reward hacking the humans in the RLHF process.

The researchers had humans evaluate the quality of a language model’s responses to two prompts—one in which it was asked to answer a question, and another in which it was asked to write code—before and after the model went through the RLHF process. They measured whether the accuracy of the model’s responses improved and how often the human evaluators correctly labeled the model’s responses as accurate or inaccurate. After the RLHF process, they found that humans were 24 percent more likely to approve the model’s answer to a question when that answer was in fact wrong. Evaluators were also 18 percent more likely to approve incorrect code generated by the RLHF model that had errors, compared to incorrect code from the model without RLHF.

“We find that after RLHF, the [language model] does not get better at the task, but it misleads our subjects to approve its incorrect answers more often,” the authors wrote. “On question-answering, [language models] learn to defend incorrect answers by cherry-picking or fabricating supporting evidence, making consistent but untruthful arguments, and providing arguments that contain subtle causal fallacies. On the programming task, [language models] learn to generate partially incorrect programs that still pass all evaluator-designed unit tests, produce less readable programs, and make fewer common errors that humans typically check for.”

The results are significant because AI companies frequently use human review studies as benchmarks to show how much their models are improving over previous iterations and RLHF has become a common method for reducing inaccuracies, often known as hallucinations, in language models. If models are getting better at deceiving humans, then it means that simply having a human review the output of a generative AI model might not be a sufficient quality or safety check.

“The improvement you see might not be real,” the study authors wrote, adding “Our results underscore the risk of applying RLHF to control increasingly capable AI systems: future AI systems might become better at misleading us and pretending to be correct, causing us to lose control unknowingly.”

Source link