Asynchronous Advantage Actor-Critic

Asynchronous Advantage Actor-Critic(A3C)

A3C is one of the reinforcement learning methods and has the following characteristics:

It is a policy gradient-based method capable of handling continuous value inputs.
Multiple agents learn in parallel in different environments, leading to high data efficiency and faster learning.
As agents gain diverse experiences, the policy becomes more generalized and robust.
The inclusion of an entropy term encourages exploratory actions.

What kind of functions and loss functions are used for

There are basically 2 nn models, policy network and value netowork.

THe policy network is goind to decide which action to take based on state vector, and the value functin is going to evaluate the policy network, it let us know how good it is.

Policy Function

\pi(a | s; \theta)

Here, π is the policy function, which determines the probability of taking action a in state s. θ represents the parameters of the policy function.

Value Function

V(s; \theta_v)

Here, V is a function that evaluates the value of state s, and θv are the parameters of the value function.

Loss Function

L = \log \pi(a | s; \theta) A(s, a; \theta_v) + \beta H(\pi(s; \theta))

Here, A(s,a;θv) is the advantage function, representing the difference between the expected return of an action and the value function. H is the entropy of the policy, used to encourage exploration. β is the weight of the regularization term.

Advantage function A(s,a;θv) in reinforcement learning is a measure of how much better (or worse) it is to take a particular action a in a given state s, compared to the average. It is defined as the difference between the value of taking a certain action (the action-value function Q(s,a)) and the expected value of being in that state (the value function V(s)). The formula is as follows:

A(s, a; \theta_v) = Q(s, a; \theta_v) - V(s; \theta_v)

Here, θv represents the parameters of the value function V, and it’s common to use the same parameters for the action-value function Q as well. This function assesses the relative advantage of an agent taking a specific action.

In practical reinforcement learning algorithms, instead of directly computing the action-value function Q, the advantage is often approximated using rewards and estimated values of the value function. For instance, an approximation using the Temporal Difference (TD) error would be:

A(s, a) \approx r + \gamma V(s'; \theta_v) - V(s; \theta_v)

In this approximation, r is the reward, γ is the discount factor, and ′s′ is the next state after taking action a. This approximation calculates the difference between the sum of the immediate reward and the discounted expected value of future rewards, and the value of the current state. It measures how much better a particular action is compared to the average.