Reinforcement learning

Algorithms

Repeat forever:

Notes:

• On-policy (~can only use data generated using the current policy).
• Works with large and continuous action spaces.
• Works with stochastic policies.

Q-learning

Train a neural network representing your $Q(s_t, a_t)$ function.
• Don't always take the best action during training (e.g. $\epsilon$-greedy policy).