Why it works: why GRPO can remove the value function? Because of one-step MDP nature.

3 minute read

Published:

GRPO (Group Relative Policy Optimization) [1] is an efficient reinforcement learning (RL) algorithm developed by DeepSeek to enhance reasoning capabilities in large language models (LLMs). Unlike traditional RL methods like Proximal Policy Optimization (PPO) [2], GRPO simplifies training by removing the need for a separate “value model”, significantly cutting computational costs while improving output quality. However, why GRPO can removes the “value model”? What component in GRPO works as the “value model”? In this blog, we will compare the GRPO with traditional RL algorithm (especially PPO), and try to figure out why GRPO can work.

TL; DR

In short, the normalization works as the advantage function in GRPO algorithm, due the the one-step Markov decision process (MDP) nature of LLM fine-tuning. To illustrate this, we will first recall how one-step MDP is modeled in LLM post-training, then quickly go through what is GRPO, and finally compare the GRPO with advantage functions in PPO.

How is one-step MDP modeled in LLM post-training?

The usage of LLM is a “problem-answer” problem. When a question is prmpoted, a LLM seeks the answer for it. Unlike traditional RL (e.g., game-playing agents), an LLM’s “action” is the entire generated response (a sequence of tokens). The episode terminates immediately after generation, making it a single-step decision problem.

Hence, the one-step MDP for LLM can be formulated as follows:

  • State \(q\) (short for questions): The input prompt + previously generated tokens (if any).
  • Action \(o\) (short for outputs): The full responce of LLM.
  • Reward \(r\): Given only at the end of generation (no intermediate rewards).
  • Transition function: Not applied, since only one-step is considered.

In this case, the \(Q\)-functions originally defined in RL will become \(Q(q,o) = r(q,o)\) and value function becomes \(V(q) = \mathbb{E}_{o} Q(q,o)\) (for definition of \(Q\)-functions and \(V\), please refer to [3]).

What is GRPO?

Group Relative Policy Optimization (GRPO) is an efficient online RL algorithm designed for LLMs that:

  • Removes the value model, cutting computational costs.
  • Uses group-wise normalized rewards instead of per-token advantages.
  • Maintains training stability through constrained policy updates (similar to PPO).

This makes GRPO faster and more memory-efficient than PPO while preserving or improving output quality.

The detail of GRPO is shown as followings. For each question \(q\), GRPO samples a group of outputs \({o_1,\dots, o_G}\) from the old policy \(\pi_{\text{old}}\). Then optimizes the policy model by maximizing the following objective:

\[J(\pi) = \mathbb{E}_{q\sim P(Q), \{o_i\}_{i=1}^G\sim \pi(O\vert q)}\left[\frac{1}{G}\sum_{i=1}^GL^{\text{CLIP}}(\pi) - \beta D_{\text{KL}}(\pi\Vert \pi_{\text{ref}})\right]\]

where \(\beta\) is the weight hyperparamters of Kullback-Leibler (KL) divergence (for how KL divergence works, please refer to previous blog), and \(L^{\text{CLIP}}\) is

\[L^{\text{CLIP}} = \min\left(\bigg[\frac{\pi(o_{i}\vert q)}{\pi_{\text{old}}(o_{i}\vert q)}\bigg]\hat{A}_{i}, \text{clip}\bigg(\frac{\pi(o_{i}\vert q)}{\pi_{\text{old}}(o_{i}\vert q)},1-\epsilon,1+\epsilon\bigg)\hat{A}_{i}\right),\]

and the normalized function \(\hat{A}_{i,t}\) is

\[\hat{A}_{i} = \frac{r_i - \text{mean}(r_1,\dots,r_G)}{\text{std}(r_1,\dots,r_G))}.\]

Please note that the normalization of output length is ignored, as it is not important in the analysis.

Comparison with PPO

The advantage function at time step \(t\) is defined as

\[A(q_t,o_t) = Q(q_t,o_t) - V(q_t).\]

In PPO, it id estimated by Generalized Advantage Estimation (GAE):

\[A(q_t,o_t) = \sum_{k=0}^\infty \gamma^{k}\left(r(q_t,o_t)+ \gamma V(q_{t+1}) - V(q_t)\right),\]

where \(\gamma\) is a discount factor. When consider about one-step MDP, the GAE becomes

\[A(q,o) = r(q,o) - V(q) = r(q,o) - \sum_{o} r(q,o),\]

which is very similar to the GRPO formulation, as the GRPO further normalized over the stand variance. Therefore, I gauss the variance works as a numerical trick to stabilize the training process, while the reduction of mean is the key component.

References

  1. [1] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." arXiv preprint arXiv:2402.03300 (2024).
  2. [2] Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  3. [3] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998.