A Map of Reinforcement Learning Algorithms

Fuma
Mar 6, 2021
3 min read

Despite the usefulness of this wonderful subject, there is a scarcity on the web of accurate RL Algorithms maps. The best ones I've been able to find come from OpenAI's Spinning Up and Louis Kirsch's site, I'll post them below:

- OpenAI RL Algorithms Map: https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html

- Louis Kirsch RL Algorithms Map (under the "Methods" paragraph): http://louiskirsch.com/maps/reinforcement-learning

They were posted, respectively, in 2018 and January of 2019, so they don't include the latest releases in the field. I hope one day to be able to provide a more up-to-date version.

They present some differences (Kirsch's map identifies Imitation learning as a fundamentally different approach), but mostly share the same structure, the same followed by UC Berkeley's lessons, identifying three (plus one) kinds of algorithms:

- Policy Optimization Methods

- Value-based Methods (under Q-learning in the OpenAI map)

- Model-based Methods

- Actor-Critic methods, halfway between Policy Gradients and Value-based Methods (those are represented between the two classes in the OpenAI map and split between the two classes in Kirsch's map)

All of the algorithms presented above have one goal in common:

Now, there's quite a bit of mathematical notation above, let's try to understand what it means:

- θ is the notation used to denote the parameters of our function

- θ* that little star on the top right denotes the optimal parameters of our function

- argmax θ* means we're trying to find the parameters θ that maximize what comes next

- E stands for an expectation

- τ stands for Trajectory, the actions we'll take that will lead us to states that will lead us to more actions

- τ~pθ(τ) means our trajectory is taken from the probability distribution p which has parameters θ

- Σ is the sign for summation for each timestep t

- r ( s(t) , a(t) ) indicates the rewards gotten from taking the action a(t) in space s(t)

...quite a load! We'll see most of those components better in the next blog entry. For now, suffice to say... It's just trying to maximize the sum of rewards! (in expectation, unluckily we aren't dealing with certainties but with what our algorithm expects to get)

Despite sharing the same goal, each kind of algorithm approaches the problem differently:

POLICY GRADIENTS, the most basic form of policy iteration, directly try to optimize the objective above, deriving the expectation of the sum of rewards shown above and trying to find its optimum.

VALUE-BASED METHODS instead try to estimate the Value Function or Q-function of the optimal policy, adopting an implicit policy. It's basically trying to evaluate how much each action is expected to get in terms of reward from each state, and then just always greedily taking the best action.

[Question: is it better for it to estimate the value of a state V(s) or the q-value of a state-action pair Q(s, a)?

Answer: the V(s), since Q(s,a) is equal to r(s,a) plus the value of the next state we get to, so fitting to V allows us both to only have one parameter (s) and to get Q indirectly!]

ACTOR-CRITICs (AC) try to take advantage of both previous approaches, calculating both a policy (the actor) and the Value of states (the Critic) and using this knowledge about the value of states to improve the performance of the policy (just like an actor that gets better, learning from the critics'... well, critics)

MODEL-BASED METHODS take a different proach, trying to learn the dynamics of the world, the transitions from s to s'

I'd say that's enough for a post, I hope I've been able to clear some of your doubts. If I haven't, feel free to write me in chat!

Those arguments can be found in lesson 4 of UC Berkeley's CS285 class. See you next time and enjoy RL! :)

-Fuma

A Map of Reinforcement Learning Algorithms

Recent Posts

Comments

Subscribe Form