Researchers Propose Using Synthetic Critiquing to Train Models

Researchers have found a novel way to approach reward models (RMs) in reinforcement learning from human feedback (RLHF), making use of synthetic critiques.

In the paper titled ‘Improving Reward Models with Synthetic Critiques’, researchers from Cohere and the University of Oxford proposed using large language models (LLMs) to align other language models, reducing the cost and time needed in using human annotation.

“RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalisation performance on unseen distributions,” the researchers said.

Beyond their use in assisting human evaluation (e.g. CriticGPT), can critiques directly enhance preference learning? During my @Cohere internship, we explored using synthetic critiques from large language models to improve reward models.

📑Preprint: https://t.co/awVgk0awaG pic.twitter.com/hafKpkQnA5

— Daniella Ye (@Daniella_yz) July 6, 2024

To rectify this, the researchers propose using LLMs to generate critiques, assessing the relationship between prompts and the generated output and predicting scalar awards. Through experimentation, they found that the synthetic critiques were able to score higher on existing RM benchmarks.

These synthetic critiques provided additional feedback in various aspects like instruction following, correctness, and style. Taking these into consideration, the reward models received better signals and features, allowing them to better assess and score the language models.

Interestingly, as seen in the figure, GPT-4o was able to match and, in certain benchmarks, even surpass RMs with no critiques. Additionally, they warned that using poor models could be detrimental to the overall training process, as seen in the case of LLaMa-2 7B.

They also highlighted that the use of synthetic critiques meant an increase in data efficiency. “In these settings one high-quality critique-enhanced preference pair is worth 40 non-enhanced preference pairs. As these critiques are generated without human labour, this approach could make it much more cost-effective to obtain competitive reward models,” they said.

There have been several discussions on improving how reward models work, making them less reliant on RLHF. Currently, all major AI players rely on the reward approach to align their LLMs, including Google, OpenAI and Meta.

Research on replacing RLHF with reinforcement learning from AI feedback (RLAIF) has also been ongoing at Google Research, who found that their approach still meant that RLHF had an edge over AI. With this paper, however, the tide could be turning in favour of AI-based critiquing.

Source link

Previous News

Introducing Dash by Niki’s Studio: Developing Office Management System with Advanced Open-Source ERP System

Next News

Telecoms companies now using scambaiting AI systems; Lenny+

Disclaimer

Popular

Microsoft to Introduce Voice Reporting Feature for Xbox

Adobe teams up with India’s Education Ministry for creative learning initiative

Meta May Allow Instagram and Facebook Users in Europe to Pay to Avoid Ads

Indian fintechs amplify payments soundbox pitches to woo merchants

Fintech Unicorn Pine Labs Launches Mini — A QR-First Device With Card Support

More Like this

Fisker Ocean owners stuck paying for recall repairs

Govt To Safeguard Retailers In Case Of Predatory Pricing: FM

Runway announces an API for its video-generating AI models

GIFT City: Infosys, Wipro to start fintech hubs in GIFT City’s IFSC under Techfin framework

BitGo launches regulated custody platform for native protocol tokens

iOS 18 rolling out RCS to the iPhone for better Android messaging

Researchers Propose Using Synthetic Critiquing to Train Models

Disclaimer

More like this

Fisker Ocean owners stuck paying for recall repairs

Govt To Safeguard Retailers In Case Of Predatory Pricing:...

Runway announces an API for its video-generating AI models

Popular

Wealthtech Centricity Bags $20 Mn To Build GenAI Modules

MCA Exempts Startups Looking To Reverse Flip From NCLT Nod

iPhone users can stay on iOS 17 and get security patches

Xiaomi India Ropes In Ex-Motorola Exec Sudhin Mathur As COO

Annual EV Sales To Touch 1 Cr Mark In India By 2030: Gadkari

Mamearth Shares Jump 5% To Hit A Fresh All-Time High At INR 546.5

AppsForBharat Nets $18 Mn To Boost Operations Of Its Spiritual App Sri Mandir, Eyes...

Upcoming Events

Fintech Revolution Summit | Jakarta | October 24

International Technology Congress 2024 Moscow | Russia | September 17 - 19

S1000D Launchpad | Bengaluru | September 17

Token 2049 | Singapore | Sept 18-19

ECODOX 4.0 | Delhi | September 18 - 19

StartupNews.fyi

StartupNews.fyi

Researchers Propose Using Synthetic Critiquing to Train Models

Disclaimer

Popular

More Like this

Researchers Propose Using Synthetic Critiquing to Train Models

Disclaimer

More like this

Popular

Upcoming Events

Newsletter Signup Form!

Newsletter Signup Form!