Blog posts

2025

Why it works: why GRPO can remove the value function? Because of one-step MDP nature.

3 minute read

Published:

GRPO (Group Relative Policy Optimization) [1] is an efficient reinforcement learning (RL) algorithm developed by DeepSeek to enhance reasoning capabilities in large language models (LLMs). Unlike traditional RL methods like Proximal Policy Optimization (PPO) [2], GRPO simplifies training by removing the need for a separate “value model”, significantly cutting computational costs while improving output quality. However, why GRPO can removes the “value model”? What component in GRPO works as the “value model”? In this blog, we will compare the GRPO with traditional RL algorithm (especially PPO), and try to figure out why GRPO can work.

Why it works: why use KL divergence as policy constraint? An information theory perspective.

7 minute read

Published:

The Kullback-Leibler (KL) divergencehas been long used as a policy constraint in the field of reinforcement learning (RL). For example, in online RL, where agents interacts with the environment to update its policy, KL divergence is adopted to limit the search steps. Actually, KL divergence are so widely in the RL that it has become the golden standard. However, it sounds magical to me: why we adopt KL divergence as the constraint of policies?