Multi-armed bandit problems (often monickered ‘bandit problems’) are a well studied field of reinforcement learning.

Dipendra et. al introduce the concept of “contextual bandit” in their approach to training a reinforcement learning agent in their first instruction following paper.


  • briefly summarize/remind readers of the definition of a multi-armed bandit
  • introduce the concept of a contextual bandit
    • talk about the limitations (i.e. reward shaping function has to have a “potential function” property to prove convergence in training)
  • briefly discuss some new approaches to immediate rewards - “curiosity” paper