Policy gradients (REINFORCE)
- On-policy (~can only use data generated using the current policy).
- Works with large and continuous action spaces.
- Works with stochastic policies.
Train a neural network representing your function.
In the case of deep Q-learning, update the parameters of your model.
- Off-policy (~can use data generated from previous runs using different policies; more sample efficient).
- Does credit assignment (increases the reward associated with a specific good action rather than the entire policy).
- Don't always take the best action during training (e.g. $