Skip to content

Reinforcement learning

Resources

Algorithms

Policy gradients (REINFORCE)

https://csc413-2020.github.io/assets/slides/lec10.pdf

https://www.davidsilver.uk/wp-content/uploads/2020/03/pg.pdf

Repeat forever:

Notes:

  • On-policy (~can only use data generated using the current policy).
  • Works with large and continuous action spaces.
  • Works with stochastic policies.

Q-learning

https://csc413-2020.github.io/assets/slides/lec11.pdf

https://www.davidsilver.uk/wp-content/uploads/2020/03/FA.pdf

Train a neural network representing your function.

In the case of deep Q-learning, update the parameters of your model.

Notes:

  • Off-policy (~can use data generated from previous runs using different policies; more sample efficient).
  • Does credit assignment (increases the reward associated with a specific good action rather than the entire policy).
  • Don't always take the best action during training (e.g. $\epsilon$-greedy policy).