
About
As I mentioned below article, there are many techniques that is used for Q learning. To summarize it, I will gradually fulfill the article here when I have time.
Q-Learning
Discounted Cumulative Expected Reward
The cumulative expected reward is defined as follows, and the goal is to make the agent take actions that maximize it. The discount factor is introduced to prevent divergence and to enhance the expected reward over a shorter time frame. However, the problem is that learning the Q-values that satisfy this condition is not straightforward.
\begin{align} Q(s, a) = \mathbb{E} \left[ \sum_{t=0}^\infty \gamma^t r_{t+1} \mid s_t = s, a_t = a \right]\\ \end{align}\\ \text{where } 0<\gamma <1
Bellman Equation
So, Bellman-Equation is introduced to think how Q value should be estimated. This Bellman Equation represents how expected Q value is estimated from given previous state and action.
Q(s, a) = \mathbb{E} \left[ R + \gamma \max_{a'} Q(s', a') \mid s, a \right]
Recurrence Formula for Q value update
\begin{align} Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] \end{align}
- Q(s, a) :The current Q value
- \alpha : Leaarning rate, how much amount of new information to be taken
- r + \gamma \max_{a'} Q(s', a') :target Q Value
- r + \gamma \max_{a'} Q(s', a') - Q(s, a) :Temporal Difference (TD)Error, information to be updated
It seems that this update formula is just a weight moving average. But thing is actually a bit complicated. It is NOT that we have real actual target Q value. So, what Q learning is going to do is to propagate information (cumulative reward) gradually to following episodes, and expect accuracy of Q value get improved.
Deep Q Learning
In Deep Q Learning, we cannot directly apply the formula (2) above since it is not that the information is represented as a table. So, What we need is to optimize Q function, which is represented by an neural network, to approximate Q value.
So, the loss function to optimize the neural network is represented as follows:
L(\theta) = \mathbb{E}_{(s, a, r, s') \sim D} \left[ \left( t - Q(s, a; \theta) \right)^2 \right]
where
- (s, a, r, s') \sim D : Sample data from experience replay buffer
- Q(s, a; \theta) : The output of current neural network
- t : Target Q value below
t = r + \gamma \max_{a'} Q(s', a'; \theta^-)
Note that WE HAVE TO FIX THE PARAMETER \theta^{-}. Because this is simply the target.
Techniques
Before going into main content, I got tired, haha. I leave it for now.