Computer Science > Computation and Language

arXiv:2408.09385 (cs)

[Submitted on 18 Aug 2024 (v1), last revised 30 Oct 2024 (this version, v2)]

Title:Reward Difference Optimization For Sample Reweighting In Offline RLHF

Authors:Shiqi Wang, Zhengze Zhang, Rui Zhao, Fei Tan, Cam Tu Nguyen

Abstract:With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others. To address this issue, we propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Specifically, we introduce reward difference coefficients to reweigh sample pairs in offline RLHF. We then develop a difference model which captures rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values

Comments:	EMNLP 2024 findings
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2408.09385 [cs.CL]
	(or arXiv:2408.09385v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2408.09385

Submission history

From: Shiqi Wang [view email]
[v1] Sun, 18 Aug 2024 07:04:16 UTC (9,096 KB)
[v2] Wed, 30 Oct 2024 04:47:00 UTC (9,097 KB)

Computer Science > Computation and Language

Title:Reward Difference Optimization For Sample Reweighting In Offline RLHF

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Reward Difference Optimization For Sample Reweighting In Offline RLHF

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators