Note : Q-Learning – EyeVision

About

As I mentioned below article, there are many techniques that is used for Q learning. To summarize it, I will gradually fulfill the article here when I have time.

How should we choose RL Algorithm?

Q-Learning

Discounted Cumulative Expected Reward

The cumulative expected reward is defined as follows, and the goal is to make the agent take actions that maximize it. The discount factor is introduced to prevent divergence and to enhance the expected reward over a shorter time frame. However, the problem is that learning the Q-values that satisfy this condition is not straightforward.

\begin{align} Q(s, a) = \mathbb{E} \left[ \sum_{t=0}^\infty \gamma^t r_{t+1} \mid s_t = s, a_t = a \right]\\ \end{align}\\ \text{where } 0<\gamma <1

Bellman Equation

So, Bellman-Equation is introduced to think how Q value should be estimated. This Bellman Equation represents how expected Q value is estimated from given previous state and action.

Q(s, a) = \mathbb{E} \left[ R + \gamma \max_{a'} Q(s', a') \mid s, a \right]

Recurrence Formula for Q value update

\begin{align} Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] \end{align}

$Q(s, a)$ ：The current Q value
$\alpha$ : Leaarning rate, how much amount of new information to be taken
$r + \gamma \max_{a'} Q(s', a')$ ：target Q Value
$r + \gamma \max_{a'} Q(s', a') - Q(s, a)$ ：Temporal Difference (TD)Error, information to be updated

It seems that this update formula is just a weight moving average. But thing is actually a bit complicated. It is NOT that we have real actual target Q value. So, what Q learning is going to do is to propagate information (cumulative reward) gradually to following episodes, and expect accuracy of Q value get improved.

Deep Q Learning

In Deep Q Learning, we cannot directly apply the formula (2) above since it is not that the information is represented as a table. So, What we need is to optimize Q function, which is represented by an neural network, to approximate Q value.

So, the loss function to optimize the neural network is represented as follows: