Policy Optimization Algorithm: the maths behind REINFORCE and its gradient calculation

Fuma
Mar 20, 2021
3 min read

Hello everyone! I've seen courses that begun by explaining Tabular RL first, getting into Value-based RL, but I think starting from Policy Gradients makes things easier to understand. Why? Because they're (at least, theoretically), the simplest of them all, trying to directly optimize the RL objective!

But let's go in order: what is the RL objective?

The objective of RL is optimizing the parameters (θ, Theta) of some function (could be a Neural Network, for example) so that the sum of the expected rewards gotten is as high as possible. It's basically just trying to get the biggest sum of rewards possible!

This objective can also be expressed as trying to find the parameters that maximize J(θ), which can be approximated as the sum over N rollouts of all the rewards we get at each timestep, divided by the number of rollouts itself (basically the mean reward we get in a single playthrough).

But there's an element I haven't mentioned yet: what's that τ~pθ(τ)? It's called the Trajectory Distribution and defined as follows:

τ is called the Trajectory: (s1, a1, s2, a2, s3, a3, ..., sT, aT)

The Trajectory Distribution seems complex, but it just expresses our policy in probabilistic terms: it's the probability we find ourselves in that initial state p(s1) multiplied the the product of probabilities over all time steps of taking action a(t) in state s(t) according to our policy π(θ) multiplied by the probability of ending up in the following state s(t+1) once we've taken the action a(t) that our policy chose in state s(t).

So, the probability we start in a certain state times the probability we choose a certain action in that state times the probability we end up in a second state having taken that first action in that first state times the probability we take a second action in that second state times the probability we end up in a certain third state having taken that second action in that second state and so on until the end of time... Well, at least the end of our timesteps :)

Great! Now that the RL objective is clear, Policy Gradient takes a very direct approach in trying to optimize it: just calculating its derivative and trying to find its local maximum! But there's a problem: we can't assume the transition dynamics of the environment are known, so we'll have to compute that derivative without the first and third element of the Trajectory Distribution (the probabilities of starting from a certain state s1 and the probabilities of going to a state s(t+1) when we take action a(t) in state s(t)). No worries tho, it can be done!

Let's first redefine the sum of rewards over all timesteps as r(τ), just to make our calculations a bit easier to write. It's an expectation of probabilities, so it can be written as the logarithm of that probability distribution multiplied by r(τ) in dτ.

Let's add the Gradient operator to both sides: we're calculating the derivative to find our local maximum, remember? So that's what we want to calculate. being a Linear Operator it can be brought into the integral without changing the result.

This is a convenient identity we'll use it in the next step (sort of in reverse, to substitute the right-side with the left-side logarithm). It's just the definition of the gradient of a logarithm being equal to the gradient of its argument divided by the argument itself, multiplying both sides by the argument to remove that denominator.

...and here we go! we've first just used the identity mentioned above and then gone back to an Expectation instead of the integral of a probability distribution. But... Does that substitution even help? We still have a gradient to calculate, of the log of the probabilities, times the sum of rewards...

...well, remember when we defined the Trajectory Distribution a few lines above? A great property of logarithm is that the logarithm of a product is the sum of the logarithms of the factors. Also, both the first and third elements in that sum don't depend on our parameters θ, so they equal to zero in the sum! This way, we can calculate gradients even when we don't know the Transition Probabilities :D

...and that's basically it, giving us the final result:

Now that we have a formula to calculate the gradient of our RL Objective we want to maximize, we can also approximate it using a certain amount of trials, the same way we've done above for J(θ):

This can be used to compute a gradient update:

and we can finally understand the pseudocode for the Vanilla Policy Gradient, also known as the REINFORCE algorithm, name given by its inventor Williams in 1988:

What we've covered today (and the images included) can be found in Lecture 5, Part 1 of the Berkeley CS 285 course, available on YouTube.

Thank you for reading and see you soon!

-Fuma

Policy Optimization Algorithm: the maths behind REINFORCE and its gradient calculation

Recent Posts

Comments

Subscribe Form