RLHF is NOT Really RL

Yann LeCun wholeheartedly agrees. OpenAI co-founder Andrej Karpthy recently expressed disappointment in Reinforcement Learning from Human Feedback (RLHF), saying, “RLHF is the third (and last) major stage of training an LLM, after pre-training and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated.”

# RLHF is just barely RL

Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely… pic.twitter.com/sjRZvqc5KC

— Andrej Karpathy (@karpathy) August 7, 2024

He explained that Google DeepMind’s AlphaGo was trained using actual reinforcement learning (RL). The computer played games of Go and optimised its strategy based on rollouts that maximised the reward function (winning the game), eventually surpassing the best human players. “AlphaGo was not trained with reinforcement learning from human feedback (RLHF). If it had been, it likely would not have performed nearly as well,” said Karpathy.

However, Karpathy agrees that for tasks that are more open-ended, like summarising an article, answering tricky questions, or rewriting code, it’s much harder to define a clear goal or reward. In these cases, it’s not easy to tell the AI what a “win” looks like. Since there’s no simple way to evaluate these tasks, using RL in these scenarios is really challenging.

Not everyone aligns with Karpathy’s view. Pierluca D’Oro, a PhD student at Mila and researcher at Meta, who is building AI agents, argues that AlphaGo has a straightforward objective, to win the match. “Yes, without any doubt RL maximally shines when the reward is clearly defined. Winning at Go, that’s clearly defined! We don’t care about how the agent wins, as long as it satisfies the rules of the game,” D’Oro said.

He explained that as humans will interact with AI agents in the future, it is important for LLMs to be trained with human feedback. “AI agents are designed to benefit humans, who are not only diverse but also incredibly complex, beyond our full understanding,” he said. “For humans, it often comes from things like human common sense, expectations, or honor.”

Here, Karpathy also agrees. “RLHF is a net helpful step in building an LLM assistant,” he said, adding that LLM assistants benefit from the generator-discriminator gap. “It is significantly easier for a human labeller to select the best option from a few candidate answers than to write the ideal answer from scratch,” he explained, citing an example such as ‘generate a poem about paperclips.”

An average human labeller might struggle to create a good poem from scratch as an SFT example, but they can more easily select a well-written poem from a set of candidates.

Karpathy goes on to explain that using RLHF in complex tasks like Go wouldn’t work well because the feedback (“vibe check”) is a poor substitute for the actual goal. The process can lead to misleading outcomes and models that exploit flaws in the reward system, resulting in nonsensical or adversarial behavior.

Unlike true RL, where the reward is clear and directly tied to success, RLHF relies on subjective human judgments, making it less reliable for optimising model performance, he says.

“This is a bad take. When interacting with humans, giving answers that humans like *is* the true objective,” responded Natasha Jaques, senior research scientist at Google AI, to Karpathy’s critique.

She says that while human feedback is limited compared to something like infinite game simulations (e.g., in AlphaGo), this doesn’t make RLHF less valuable. Instead, she suggests that the challenge is greater but also potentially more impactful because it could help reduce biases in language models, which has significant societal benefits.

“Posting this is just going to discourage people from working on RLHF, when it’s currently the only viable way to mitigate possibly severe harms due to LLM biases and hallucinations,” she replied to Karpathy.

Moving Away from RLHF

Yann LeCun from Meta AI has constantly been talking about how the trial-and-error method of RL for developing intelligence is a risky way forward. For example, a baby does not identify objects by looking at a million samples of the same object, or trying dangerous things and learning from them, but instead by observing, predicting, and interacting with them even without supervision.

Meta has been bullish on self-supervised learning for quite some time. Self-supervised learning is ideal only for large corporations like Meta, which possess terabytes of data to train state-of-the-art models.

On the other hand, OpenAI recently introduced Rule-Based Rewards (RBRs), a method designed to align models with safe behaviour without extensive human data collection.

According to OpenAI, while reinforcement learning from human feedback (RLHF) has traditionally been used, RBRs are now a key component of their safety stack. RBRs use clear, simple, and step-by-step rules to assess whether a model’s outputs meet safety standards.

When integrated into the standard RLHF pipeline, RBRs help balance helpfulness with harm prevention, ensuring the model behaves safely and effectively without the need for recurrent human inputs.

Similarly, Anthropic recently introduced Constitutional AI, an approach to train AI systems, particularly language models, using a predefined set of principles or a “constitution” rather than relying heavily on human feedback.

Meanwhile, Google DeepMind, which is known for its paper “Reward is Enough” which claims intelligence can be achieved through reward maximisation, recently introduced another paper detailing Foundational Large Autorater Models (FLAMe).

FLAMe is designed to handle various quality assessment tasks and address the growing challenges and costs associated with the human evaluation of LLM outputs.

Meta, which recently released LLaMA 3.1, opts for self-supervised learning rather than RLHF. For the post-training phase of Llama 3.1, Meta employed SFT on instruction-tuning data along with Direct Preference Optimisation (DPO).

DPO is designed to directly enhance the model’s performance based on human preferences or evaluations, rather than relying solely on traditional reinforcement learning or supervised learning methods.

Meta isn’t stopping there either. It recently published another paper titled “Self-Taught Evaluators,” which proposes building a strong generalist evaluator for model-based assessment of LLM outputs. This method generates synthetic preferences over pairs of responses without relying on human annotations.

Another paper from Meta titled “Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge” allows LLMs to improve by judging their own responses instead of relying on human labellers.

In line with this, Google DeepMind also proposed another new algorithm called reinforced self-training (ReST) for language modelling. It follows a similar process of removing humans from the loop by letting language models build their own policy with a single initial command. While ReST finds application in various generative learning layouts, its expertise lies in machine translation.

Source link

Previous News

New-Age Tech Stocks Bleed On Broader Market Decline

Next News

BitGo set to transition Wrapped Bitcoin business to multi-jurisdictional custody

Moving Away from RLHF

Disclaimer

Popular

Microsoft to Introduce Voice Reporting Feature for Xbox

Adobe teams up with India’s Education Ministry for creative learning initiative

Meta May Allow Instagram and Facebook Users in Europe to Pay to Avoid Ads

Indian fintechs amplify payments soundbox pitches to woo merchants

Fintech Unicorn Pine Labs Launches Mini — A QR-First Device With Card Support

More Like this

Aye Finance Bags INR 250 Cr From Singapore’s ABC Impact

Chinese Tether laundromat, Bhutan enjoys recent Bitcoin boost: Asia Express

First iPhone 16 pre-orders arrive as lines form at Apple Stores around the world

OpenAI o1 “Strawberry” Finally Available on GitHub Copilot Chat with VS Code Integration

Get Your iPhone 16 Delivered in 10 Minutes with Blinkit and BB Now!

Here is what’s illegal under California’s 8 (and counting) new AI laws

RLHF is NOT Really RL

Moving Away from RLHF

Disclaimer

More like this

Aye Finance Bags INR 250 Cr From Singapore’s ABC...

Chinese Tether laundromat, Bhutan enjoys recent Bitcoin boost: Asia...

First iPhone 16 pre-orders arrive as lines form at...

Popular

The Tech Outage That Threw ChatGPT Out Of Gear

Apple releases new firmware version for AirPods Pro 2 and AirPods 4

Railways Developing A Super App: Ashwini Vaishnaw

Moneyboxx To Raise INR 176 Cr To Expand Its Lending Play

Wealthtech Centricity Bags $20 Mn To Build GenAI Modules

MCA Exempts Startups Looking To Reverse Flip From NCLT Nod

iPhone users can stay on iOS 17 and get security patches

Upcoming Events

Fintech Revolution Summit | Jakarta | October 24

International Technology Congress 2024 Moscow | Russia | September 17 - 19

Token 2049 | Singapore | Sept 18-19

ECODOX 4.0 | Delhi | September 18 - 19

Startup Meetup (RTF) | Gurugram | September 20

StartupNews.fyi

StartupNews.fyi

RLHF is NOT Really RL

Moving Away from RLHF

Disclaimer

Popular

More Like this

RLHF is NOT Really RL

Moving Away from RLHF

Disclaimer

More like this

Popular

Upcoming Events

Newsletter Signup Form!

Newsletter Signup Form!