DQfD

Reference

Paper:

Hester, Todd, et al. "Deep q-learning from demonstrations." Proceedings of the AAAI conference on artificial intelligence. Vol. 32. No. 1. 2018.

Slides by Yu-Wei Shih
https://zhuanlan.zhihu.com/p/272249076

Outline

Reference
Outline
Introduction
Methods
Simple Summary
Baseline: PDD DQN
DQfD: Four Losses
DQfD: 2 phases
How does it use demonstrations with self-generated data?
Algorithm
DQfD differs from PDD DQN in 6 key ways
Experimental Setup
Results

Introduction

AAAI 2018

DeepMind

Challenges of reinforcement learning:

Requires a large amount of data
Samples can be expensive!

No perfect simulator ⇒ might need to learn in the real world ⇒ expensive

How can it start from a good initial state?

Propose a method that utilizes prior demonstrations to accelerate training.

Methods

Simple Summary

Baseline: PDD DQN
4 losses
2 phases

Baseline: PDD DQN

Combination of Prioritized Experience Replay, Dueling Network, and DDQN
Building upon this structure, DQfD introduces several improvements to utilize demonstration data.

DQfD: Four Losses

1-step double Q-learning loss

$y_t=r_t+\gamma \max_{a'} Q(s_{t+1},a')$

n-step double Q-learning loss

$y_t=r_t+\gamma r_{t+1} +... +\gamma^{n-1}\max_{a'} Q(s_{t+n},a')$
Helps propagate the values of the expert's trajectory to all the earlier states.

supervised large margin classification loss

Purposes

The supervised loss is critical for the pre-training to have any effect. Since the demonstration data necessarily covers a narrow part of the state space and does not take all possible actions many state-action pairs have never been taken and have no data to ground them to realistic values.
If we were to pre-train the network with only Q-learning updates towards the max value of the next state, then the network would update towards the highest of these ungrounded variables. Additionally, the network would propagate these values throughout the Q function. This loss forces the values of the other actions to be at least a margin lower than the value of the demonstrator's action.
Adding this loss grounds the values of the unseen actions to reasonable values, and makes the greedy policy induced by the value function imitate the demonstrator.

If the algorithm is pre-trained with only this supervised loss:

There would be nothing constraining the values between consecutive states, and the Q-network would not satisfy the Bellman equation. This equation is required to improve the policy on-line with TD learning.

L2 regularization loss

Help prevent it from overfitting on the relatively small demonstration dataset.

Overall loss:

n-step TD loss helps propagate the value of expert trajectories to all early states, leading to better pre-training. The first two losses can ensure that the neural network value function estimates learned during pre-training satisfy the Bellman equation and can serve as a starting point for TD learning.

DQfD: 2 phases

Phases

Pretraining phase

⇒ Improve the initial performance.

Interacting with the environment

All the losses are applied to both phases, while the supervised loss is not applied to self-generated data ( $\lambda_2=0$ )

How does it use demonstrations with self-generated data?

After the pertaining phase, it starts interacting with the environment.

New self-generated data: adding it to its replay buffer
When it is full, overwriting the old self-generated data
Never overwrites the demonstration data

How does it determine the proportion of demonstration data and self-generated data to sample in a batch? 10%? 50%? It uses PER!

Prioritized Experience Replay (PER)

$\epsilon_a,\epsilon_d$ are added to the priorities of the agent and demonstration transitions

Algorithm

DQfD differs from PDD DQN in 6 key ways

Use demonstration
Pretraining
Supervised losses
L2 regularization losses
N-step TD losses
Demonstration priority bonus

Experimental Setup

Evaluated Algorithms

algorithm	with/without demonstration	with/without online interaction	Notes
DQfD	Yes	Yes
PDD DQN	No	Yes	include n-step return
Supervised imitation	Yes	No	performed a supervised classification of the demonstrator’s actions using a cross-entropy loss, with the same network architecture and L2 regularization used by DQfD

NN architecture

Dueling state-advantage convolutional network architecture

Parameter tuning

tuning for all the algorithms on 6 games and using the same parameters for the entire set of games

ALE environment

State

Images: downsampled to 84 * 84
Converted to grayscale
Stack 4 frames as a state

Action

18 possible actions for each game
Actions are repeated for 4 frames

Reward

Adpated reward: We found that in many of the games where the human player is better than DQN, it was due to DQN being trained with all rewards clipped to 1. For example, in PrivateEye, DQN has no reason to select actions that reward 25,000 versus actions that reward 10. To make the reward function used by the human demonstrator and the agent more consistent, we used unclipped rewards and converted the rewards using a logarithmic scale: r_agent = sign(r) · log(1+|r|). This transformation keeps the rewards over a reasonable scale for the neural network to learn while conveying important information about the relative scale of individual rewards.

Others

Random starting position: each episode is initialized with up to 30 no-op
Discount factor 0.99
Randomly selecting 42 env

Demonstration data

Had a human player play each game 3 ~ 12 times
Each episode: played until termination or for 20 minutes
Logged: state, action reward, terminations
Each game has: 5574~75472 transitions

The dataset size is “small” compared with other works

DQN (2015): learns from 200 million frames
AlphaGo (2016): 30 million human transitions

Demonstration score

Some games are better than PDD DQN, while others are worse than PDD DQN.
Let's look at the results together (see the experimental results table below).

Results

Outperform

Scores for the 11 games where DQfD achieves higher scores than any previously published deep RL result using random no-op starts. Previous results take the best agent at its best iteration and evaluate it for 100 episodes. DQfD scores are the best 3 million step window averaged over four seeds, which is an average of 508 episodes.

Demonstration scores vs DQfD vs PDD DQN

The best and worst human demonstration scores, the number of trials and transitions collected, and the average score for each algorithm over the best three million-step window, averaged over four seeds.
It is reasonable that in some games, PDD DQN has a similar score to DQfD since the score is the mean score over the best 3 million steps. (The best scores may be similar, but the time taken may not necessarily be the same).
In comparison, pure imitation learning performs worse than the demonstrator's performance in every game.

I observed that the scores of imitation were bad, even though the human demonstration did not receive bad scores.

DQfD outperforms the worst demonstration episode. It was successful in 29 out of 42 games
the best demonstration episode only improved in 14 of the games

Learning curves

Results on the 42 Atari games, averaged over 4 trials.

‣

All env

The authors or I observe the following:

DQfD provides an early-stage boost in some games.

E.g. Road Runner is the game with the smallest set of human demonstrations, consisting of only 5,574 transitions. Despite these factors, DQfD still achieves a higher score than PDDDQN for the first 36 million steps and matches PDD DQN's performance after that.

In certain games, PDD DQN performs better, while in other games, DQfD is superior.
In some hard tasks, all 3 algorithms cannot outperform human demonstration

E.g., Montezuma's Revenge

Initial performance (In real-world tasks, the agent must perform well from its very first action and must learn quickly.)

DQfD performed better than PDD DQN on the first million steps in 41 of 42 games.
On 31 games, DQfD starts out with higher performance than pure imitation learning, as the addition of the TD loss helps the agent generalize the demonstration data better.
On average, PDD DQN does not surpass the performance of DQfD until 83 million steps into the task

Ablation: loss

Comparisons of DQfD with λ1 and λ2 set to 0 on two games where DQfD achieved state-of-the-art results: Montezuma's Revenge and Q-Bert.

Vs other works

Compare DQfD with three related algorithms for leveraging demonstration data in DQN on two games where DQfD achieved state-of-the-art results: Montezuma's Revenge and Q-Bert.

Replay Buffer Spiking (RBS) (Lipton et al., 2016)

RBS is simply PDDDQN with the replay buffer initially full of demonstration data.

Human Experience Replay (HER) (Hosu and Rebedea, 2016)

HER keeps the demonstration data and mixes demonstration and agent data in each mini-batch.

Accelerated DQN with Expert Trajectories (ADET) (Lakshminarayanan, Ozair, and Bengio, 2016)

ADET is essentially DQfD with the large margin supervised loss replaced with a cross-entropy loss.

補充，這裡跟前面指到說 cross entropy loss, 那這到底要怎麼跟 Q net 結合？

ChatGPT 講解的 ADET cross entropy loss：

Demonstration Data Ratio in a Mini-batch

The ratio of how often the demonstration data was sampled versus how much it would be sampled with uniform sampling.
For the most difficult games like Pitfall and Montezuma's Revenge, the demonstration data is sampled more frequently over time. For most other games, the ratio converges to a near-constant level, which differs for each game.

The Deep Q-Learning from Demonstrations (DQfD) algorithm introduces improvements to utilize demonstration data for accelerating training in deep reinforcement learning. It combines different loss functions to ensure that the neural network value function estimates satisfy the Bellman equation and imitate the actions of the demonstrator. DQfD outperforms previous deep reinforcement learning results using random no-op starts and achieves higher scores than any previously published deep RL result in 41 out of 42 Atari games. It shows promising results in terms of initial performance, learning curves, and leveraging demonstration data. The algorithm demonstrates the potential of utilizing prior demonstrations to enhance the training process and improve the performance of deep RL agents.