DDangchani's DataLog

구분	PPO-based RLHF	DPO
학습 데이터	prompt, generated response, reward model score	preference pair $(x,y_w,y_l)$
Reward model	필요	명시적으로는 불필요
Value model	필요	불필요
Sampling	on-policy rollout 필요	offline preference data로 가능
목적함수	clipped policy gradient + KL penalty	pairwise logistic loss
구현 난이도	높음	상대적으로 낮음
주요 리스크	reward hacking, unstable RL	preference data 품질과 reference 선택에 민감

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” arXiv:2203.02155. Preprint, arXiv, March 4. https://doi.org/10.48550/arXiv.2203.02155.

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” Advances in Neural Information Processing Systems 36. https://arxiv.org/abs/2305.18290.

Christiano, Paul F., Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. “Deep Reinforcement Learning from Human Preferences.” Advances in Neural Information Processing Systems 30. https://arxiv.org/abs/1706.03741.

Policy Optimization

Setting

Policy

Reward

Value function

Optimization Objective

PPO(Proximal Policy Optimization)

Vanilla Policy Gradient

Trust Region Policy Optimization

Clipped Surrogate Objective

LLM에서의 Policy Optimization

RLHF Pipeline: SFT $\rightarrow$ Reward Model $\rightarrow$ PPO

왜 LLM에서 PPO가 어려운가?

DPO(Direct Preference Optimization)

DPO vs PPO

한 줄 직관

마무리