How should we choose RL Algorithm?

The categoriy of RL Algorithm

Reinforcement learning algorithms come in various types, and choosing the right approach is a critical task. To make an informed decision, it is essential to first understand the two major categories: value-based and policy-based methods. This classification significantly influences the problem domains and characteristics each approach can address.

Value-based methods, such as Q-Learning and Deep Q-Networks (DQN), aim to learn the value function for state-action pairs and use this to select the optimal action. These approaches are particularly effective when dealing with discrete action spaces. On the other hand, policy-based methods, such as REINFORCE, Proximal Policy Optimization (PPO), and Trust Region Policy Optimization (TRPO), directly learn a policy that maximizes future cumulative rewards. These methods are characterized by their ability to output a probability distribution over actions, making them naturally suited for handling continuous action spaces.

Understanding these differences allows for selecting the most appropriate algorithm depending on the specific problem and environment at hand.

How to choose the action

Value-Based

a^* = \arg\max_a Q(s, a)

Policy-Based

\pi(a|s) = P(a|s)\\ \pi(a|s) = \mathcal{N}(\mu(s), \sigma^2(s))

Exploitation and Exploration

I will write this later.

Training

Anyone with experience in reinforcement learning would likely understand that training in this field is highly complex, regardless of the method chosen.

Value-Based

In particular, value-based methods often require more complex learning techniques than policy-based methods for the following reasons:

– Maximum value computation (especially in continuous action spaces).
– Instability caused by bootstrapping.
– The need for additional techniques to address issues such as overestimation of Q-values (e.g., Double DQN, target networks).
– The implementation of mechanisms like experience replay and prioritized replay.

These methods require various implementation strategies, which I believe stem from the need to propagate cumulative rewards back to the initial states.

Q(s, a) = \mathbb{E} \left[ \sum_{t=0}^\infty \gamma^t r_{t+1} \mid s_t = s, a_t = a \right]

Value propagation is carried out as described above.

Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]

Policy-Based

On the other hand, policy-based learning is based on direct optimization of the advantage ( it looks like cumulative rewards ( $R$ below) from the environment and does not require the value propagation process. According to Policy Gradient Theorem,

\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A \right]

Considering these factors, it becomes clear why policy-based methods tend to have simpler learning processes, particularly for tasks involving continuous action spaces. However, in discrete action spaces, value-based methods (e.g., DQN) remain highly effective.

By the way, the definition of advantage function is like this

A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t)

where

$Q^{\pi}(s_t, a_t)$ : Action value function (expected reward when taking action on the state $s_t$

$V^{\pi}(s_t)$ : State value function which is expected value when the state is $s_t$ .

The advantage of using advantage instead of cumulative reward is,

It seems that we can easily evaluate if an action of agent is good or not relatively.
there is some effect on decreasing the variance of long term rewards and of gradients of policy.
It is possible to control the way of updating policy by clipping advantage

Comparison

Feature	Value-Based	Policy-Based
Input	State ( s )	State ( s )
Output	Action values ( Q(s, a) )	Probability distribution over actions ( $\pi(a \mid s)$ )
Action Selection	Choose action with highest value	Sample action based on probability
Advantages	Balances exploration and exploitation easily	Directly handles continuous action spaces
Disadvantages	Difficult to apply to continuous action spaces	May have high variance and unstable learning

The categoriy of RL Algorithm

How to choose the action

Value-Based

Policy-Based

Exploitation and Exploration

Training

Value-Based

Policy-Based

Comparison

Related News

You may have missed