Hierarchical value functions in deep reinforcement learning

We just released a paper on deep reinforcement learning with hierarchical value functions. One of the major problems in RL is to deal with sparse reward channels. Without observing a non-zero reward, it is hard for any agent to learn a reasonable value function. There is a direct relationship between the amount of exploration and observed rewards. Due to high branching factor in the action space, it can be difficult for the agent to efficiently explore the environment.

This got me thinking about intrinsic notions of reward. My earlier post talks about some of the prior work on intrinsic motivations. There has also been plenty of work on options in the context of reinforcement learning. Our work is inspired by this and many other papers cited in our arxiv paper.

Our model called hierarchical-DQN or h-DQN integrates hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning. A top-level value function learns a policy over intrinsic goals, and a lower-level function learns a policy over atomic actions to satisfy the given goals. h-DQN allows for flexible goal specifications, such as functions over entities and relations. This provides an efficient space for exploration in complicated environments.

Current limitations and open questions

  • What is the right parametrization / representation over higher level intrinsic goals for a wide range of tasks?
  • How do we parse raw states into more structured representations, so that goals could be discovered in an automatic way, especially symbolic goals? Ideally this should work within an end-to-end learning framework for efficient credit-assignment. For pixel input, it might be a good idea to learn deep learning models that explicitly reason about objects and other structured latents in the data. There has been a push towards this direction in recent literature – a. Ali Eslami et al.’s work on disentangling objects, b. Philip Isola’s work on using self-supervised convnets to disentangle entities from images, c. Katerina Fragkiadaki’s work on segmenting objects from videos.
  • A flexible short-term memory to handle longer range semi-Markovian settings.
  • Generic notions of intrinsic motivation defined over predictive models that learn to predict what happens next.
  • Hierarchical utilities or values with growing depth
  • Better exploration schemes such as the one proposed in the bootstrapped-DQN paper.
  • Imagination based planning
  • Using prioritized experience replays would likely enable faster learning
  • An ability to handle partial-options, where the model could decide to terminate behavior towards the given goal and choose some other goal

Tejas Kulkarni, Cambridge, MA

Updated: