Researchers Propose Using Synthetic Critiquing to Train Models

Share via:


Researchers have found a novel way to approach reward models (RMs) in reinforcement learning from human feedback (RLHF), making use of synthetic critiques.

In the paper titled ‘Improving Reward Models with Synthetic Critiques’, researchers from Cohere and the University of Oxford proposed using large language models (LLMs) to align other language models, reducing the cost and time needed in using human annotation.

“RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalisation performance on unseen distributions,” the researchers said.

To rectify this, the researchers propose using LLMs to generate critiques, assessing the relationship between prompts and the generated output and predicting scalar awards. Through experimentation, they found that the synthetic critiques were able to score higher on existing RM benchmarks.

These synthetic critiques provided additional feedback in various aspects like instruction following, correctness, and style. Taking these into consideration, the reward models received better signals and features, allowing them to better assess and score the language models. 

Interestingly, as seen in the figure, GPT-4o was able to match and, in certain benchmarks, even surpass RMs with no critiques. Additionally, they warned that using poor models could be detrimental to the overall training process, as seen in the case of LLaMa-2 7B.

They also highlighted that the use of synthetic critiques meant an increase in data efficiency. “In these settings one high-quality critique-enhanced preference pair is worth 40 non-enhanced preference pairs. As these critiques are generated without human labour, this approach could make it much more cost-effective to obtain competitive reward models,” they said.

There have been several discussions on improving how reward models work, making them less reliant on RLHF. Currently, all major AI players rely on the reward approach to align their LLMs, including Google, OpenAI and Meta.

Research on replacing RLHF with reinforcement learning from AI feedback (RLAIF) has also been ongoing at Google Research, who found that their approach still meant that RLHF had an edge over AI. With this paper, however, the tide could be turning in favour of AI-based critiquing.





Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Popular

More Like this

Researchers Propose Using Synthetic Critiquing to Train Models


Researchers have found a novel way to approach reward models (RMs) in reinforcement learning from human feedback (RLHF), making use of synthetic critiques.

In the paper titled ‘Improving Reward Models with Synthetic Critiques’, researchers from Cohere and the University of Oxford proposed using large language models (LLMs) to align other language models, reducing the cost and time needed in using human annotation.

“RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalisation performance on unseen distributions,” the researchers said.

To rectify this, the researchers propose using LLMs to generate critiques, assessing the relationship between prompts and the generated output and predicting scalar awards. Through experimentation, they found that the synthetic critiques were able to score higher on existing RM benchmarks.

These synthetic critiques provided additional feedback in various aspects like instruction following, correctness, and style. Taking these into consideration, the reward models received better signals and features, allowing them to better assess and score the language models. 

Interestingly, as seen in the figure, GPT-4o was able to match and, in certain benchmarks, even surpass RMs with no critiques. Additionally, they warned that using poor models could be detrimental to the overall training process, as seen in the case of LLaMa-2 7B.

They also highlighted that the use of synthetic critiques meant an increase in data efficiency. “In these settings one high-quality critique-enhanced preference pair is worth 40 non-enhanced preference pairs. As these critiques are generated without human labour, this approach could make it much more cost-effective to obtain competitive reward models,” they said.

There have been several discussions on improving how reward models work, making them less reliant on RLHF. Currently, all major AI players rely on the reward approach to align their LLMs, including Google, OpenAI and Meta.

Research on replacing RLHF with reinforcement learning from AI feedback (RLAIF) has also been ongoing at Google Research, who found that their approach still meant that RLHF had an edge over AI. With this paper, however, the tide could be turning in favour of AI-based critiquing.





Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at office@startupnews.fyi

More like this

Fisker Ocean owners stuck paying for recall repairs

EV startup Fisker is about to enter the...

Govt To Safeguard Retailers In Case Of Predatory Pricing:...

SUMMARY Important to take care of small traders and...

Runway announces an API for its video-generating AI models

Runway, one of several AI startups developing video-generating...

Popular

Upcoming Events

Startup Information that matters. Get in your inbox Daily!