Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Posts

Why it works: why GRPO can remove the value function? Because of one-step MDP nature.

3 minute read

Published: June 08, 2025

GRPO (Group Relative Policy Optimization) [1] is an efficient reinforcement learning (RL) algorithm developed by DeepSeek to enhance reasoning capabilities in large language models (LLMs). Unlike traditional RL methods like Proximal Policy Optimization (PPO) [2], GRPO simplifies training by removing the need for a separate “value model”, significantly cutting computational costs while improving output quality. However, why GRPO can removes the “value model”? What component in GRPO works as the “value model”? In this blog, we will compare the GRPO with traditional RL algorithm (especially PPO), and try to figure out why GRPO can work.

Why it works: why use KL divergence as policy constraint? An information theory perspective.

7 minute read

Published: April 25, 2025

The Kullback-Leibler (KL) divergencehas been long used as a policy constraint in the field of reinforcement learning (RL). For example, in online RL, where agents interacts with the environment to update its policy, KL divergence is adopted to limit the search steps. Actually, KL divergence are so widely in the RL that it has become the golden standard. However, it sounds magical to me: why we adopt KL divergence as the constraint of policies?

portfolio

Portfolio item number 1

Published: June 17, 2025

Short description of portfolio item number 1

Portfolio item number 2

Published: June 17, 2025

Short description of portfolio item number 2

publications

Mean-field-aided multiagent reinforcement learning for resource allocation in vehicular networks

Published in IEEE Internet of Things Journal, 2022

This paper is about the application of mean-field reinforcement learning in V2X system.

Recommended citation: Zhang, Hengxi, et al. "Mean-field-aided multiagent reinforcement learning for resource allocation in vehicular networks." IEEE Internet of Things Journal 10.3 (2022): 2667-2679.

Autonomous Swarm Robot Coordination via Mean-Field Control Embedding Multi-Agent Reinforcement Learning

Published in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

This paper is about the application of mean-field reinforcement learning in swarm robotics.

Recommended citation: Tang, Huaze, et al. "Autonomous Swarm Robot Coordination via Mean-Field Control Embedding Multi-Agent Reinforcement Learning." 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023.

M^3ARL: Moment-Embedded Mean-Field Multi-Agent Reinforcement Learning for Continuous Action Space

Published in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

This paper is about the repesentation of mean-field in continuous action space.

Recommended citation: Tang, Huaze, et al. "M3ARL: Moment-Embedded Mean-Field Multi-Agent Reinforcement Learning for Continuous Action Space." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.

Mean-Field Aided QMIX: A Scalable and Flexible Q-Learning Approach for Large-Scale Agent Groups

Published in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

This paper is about the repesentation of mean-field for QMIX algorithm.

Recommended citation: Zhang, Enze, Tang, Huaze, et al. "Mean-Field Aided QMIX: A Scalable and Flexible Q-Learning Approach for Large-Scale Agent Groups." ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025.

Residual kernel policy network: Enhancing stability and robustness in rkhs-based reinforcement learning

Published in The Thirteenth International Conference on Learning Representations (ICLR), 2025

This paper is about a residual method with advantage functions to stabilize the RKHS RL methods.

Recommended citation: Zhang Y, Tang H, Lin H, et al. Residual kernel policy network: Enhancing stability and robustness in rkhs-based reinforcement learning[C]//The Thirteenth International Conference on Learning Representations. 2025.
Download Paper

talks

Conference Proceeding talk at IROS 2024

Published: March 01, 2014

Talk on paper Autonomous Swarm Robot Coordination via Mean-Field Control Embedding Multi-Agent Reinforcement Learning.

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Huaze Tang

Sitemap

Pages

Page Not Found

About me

Archive Layout with Content

Posts by Category

Posts by Collection

CV

Markdown

Page not in menu

Page Archive

Portfolio

Blogs

Publications

Sitemap

Posts by Tags

Talk map

Talks and presentations

Teaching

Terms and Privacy Policy

Blog posts

Jupyter notebook markdown generator

Posts

Why it works: why GRPO can remove the value function? Because of one-step MDP nature.

Why it works: why use KL divergence as policy constraint? An information theory perspective.

portfolio

Portfolio item number 1

Portfolio item number 2

publications

Mean-field-aided multiagent reinforcement learning for resource allocation in vehicular networks

Autonomous Swarm Robot Coordination via Mean-Field Control Embedding Multi-Agent Reinforcement Learning

M^3ARL: Moment-Embedded Mean-Field Multi-Agent Reinforcement Learning for Continuous Action Space

Mean-Field Aided QMIX: A Scalable and Flexible Q-Learning Approach for Large-Scale Agent Groups

Residual kernel policy network: Enhancing stability and robustness in rkhs-based reinforcement learning

talks

Conference Proceeding talk at IROS 2024

teaching

Teaching experience 1

Teaching experience 2