January 18, 2025

The categoriy of RL Algorithm

Reinforcement learning algorithms come in various types, and choosing the right approach is a critical task. To make an informed decision, it is essential to first understand the two major categories: value-based and policy-based methods. This classification significantly influences the problem domains and characteristics each approach can address.

Value-based methods, such as Q-Learning and Deep Q-Networks (DQN), aim to learn the value function for state-action pairs and use this to select the optimal action. These approaches are particularly effective when dealing with discrete action spaces. On the other hand, policy-based methods, such as REINFORCE, Proximal Policy Optimization (PPO), and Trust Region Policy Optimization (TRPO), directly learn a policy that maximizes future cumulative rewards. These methods are characterized by their ability to output a probability distribution over actions, making them naturally suited for handling continuous action spaces.

Understanding these differences allows for selecting the most appropriate algorithm depending on the specific problem and environment at hand.

How to choose the action

Value-Basde

a^* = \arg\max_a Q(s, a)

Policy-Based

\pi(a|s) = P(a|s)\\
\pi(a|s) = \mathcal{N}(\mu(s), \sigma^2(s))

Exploitation and Exploration

I will write this later.

Training

Anyone with experience in reinforcement learning would likely understand that training in this field is highly complex, regardless of the method chosen.

In particular, value-based methods often require more complex learning techniques than policy-based methods for the following reasons:

  1. – Maximum value computation (especially in continuous action spaces).
  2. – Instability caused by bootstrapping.
  3. – The need for additional techniques to address issues such as overestimation of Q-values (e.g., Double DQN, target networks).
  4. – The implementation of mechanisms like experience replay and prioritized replay.

These methods require various implementation strategies, which I believe stem from the need to propagate cumulative rewards back to the initial states.

Q(s, a) = \mathbb{E} \left[ \sum_{t=0}^\infty \gamma^t r_{t+1} \mid s_t = s, a_t = a \right]

Value propagation is carried out as described above.

Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]

On the other hand, policy-based learning is based on the direct optimization of cumulative rewards (R below) from the environment and does not require the value propagation process.

\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot R \right]

Considering these factors, it becomes clear why policy-based methods tend to have simpler learning processes, particularly for tasks involving continuous action spaces. However, in discrete action spaces, value-based methods (e.g., DQN) remain highly effective.

Comparison

FeatureValue-BasedPolicy-Based
InputState ( s )State ( s )
OutputAction values ( Q(s, a) )Probability distribution over actions ( \pi(a \mid s) )
Action SelectionChoose action with highest valueSample action based on probability
AdvantagesBalances exploration and exploitation easilyDirectly handles continuous action spaces
DisadvantagesDifficult to apply to continuous action spacesMay have high variance and unstable learning