Reference
- Paper:
- Hester, Todd, et al. "Deep q-learning from demonstrations." Proceedings of the AAAI conference on artificial intelligence. Vol. 32. No. 1. 2018.
- Slides by Yu-Wei Shih
- https://zhuanlan.zhihu.com/p/272249076
Outline
- Reference
- Outline
- Introduction
- Methods
- Simple Summary
- Baseline: PDD DQN
- DQfD: Four Losses
- DQfD: 2 phases
- How does it use demonstrations with self-generated data?
- Algorithm
- DQfD differs from PDD DQN in 6 key ways
- Experimental Setup
- Results
Introduction
- AAAI 2018
- DeepMind
- Challenges of reinforcement learning:
- Requires a large amount of data
- Samples can be expensive!
- No perfect simulator ⇒ might need to learn in the real world ⇒ expensive
- How can it start from a good initial state?
- Propose a method that utilizes prior demonstrations to accelerate training.
Methods
Simple Summary
- Baseline: PDD DQN
- 4 losses
- 2 phases
Baseline: PDD DQN
- Combination of Prioritized Experience Replay, Dueling Network, and DDQN
- Building upon this structure, DQfD introduces several improvements to utilize demonstration data.
DQfD: Four Losses
- 1-step double Q-learning loss
- n-step double Q-learning loss
- Helps propagate the values of the expert's trajectory to all the earlier states.
- supervised large margin classification loss
- Purposes
- The supervised loss is critical for the pre-training to have any effect. Since the demonstration data necessarily covers a narrow part of the state space and does not take all possible actions many state-action pairs have never been taken and have no data to ground them to realistic values.
- If we were to pre-train the network with only Q-learning updates towards the max value of the next state, then the network would update towards the highest of these ungrounded variables. Additionally, the network would propagate these values throughout the Q function. This loss forces the values of the other actions to be at least a margin lower than the value of the demonstrator's action.
- Adding this loss grounds the values of the unseen actions to reasonable values, and makes the greedy policy induced by the value function imitate the demonstrator.
- If the algorithm is pre-trained with only this supervised loss:
- There would be nothing constraining the values between consecutive states, and the Q-network would not satisfy the Bellman equation. This equation is required to improve the policy on-line with TD learning.
- L2 regularization loss
- Help prevent it from overfitting on the relatively small demonstration dataset.
Overall loss:
n-step TD loss helps propagate the value of expert trajectories to all early states, leading to better pre-training. The first two losses can ensure that the neural network value function estimates learned during pre-training satisfy the Bellman equation and can serve as a starting point for TD learning.
DQfD: 2 phases
- Phases
- Pretraining phase
- ⇒ Improve the initial performance.
- Interacting with the environment
- All the losses are applied to both phases, while the supervised loss is not applied to self-generated data ()
How does it use demonstrations with self-generated data?
- After the pertaining phase, it starts interacting with the environment.
- New self-generated data: adding it to its replay buffer
- When it is full, overwriting the old self-generated data
- Never overwrites the demonstration data
- How does it determine the proportion of demonstration data and self-generated data to sample in a batch? 10%? 50%? It uses PER!
- Prioritized Experience Replay (PER)
- are added to the priorities of the agent and demonstration transitions
Algorithm
DQfD differs from PDD DQN in 6 key ways
- Use demonstration
- Pretraining
- Supervised losses
- L2 regularization losses
- N-step TD losses
- Demonstration priority bonus
Experimental Setup
- Evaluated Algorithms
- NN architecture
- Dueling state-advantage convolutional network architecture
- Parameter tuning
- tuning for all the algorithms on 6 games and using the same parameters for the entire set of games
- ALE environment
- State
- Images: downsampled to 84 * 84
- Converted to grayscale
- Stack 4 frames as a state
- Action
- 18 possible actions for each game
- Actions are repeated for 4 frames
- Reward
- Adpated reward: We found that in many of the games where the human player is better than DQN, it was due to DQN being trained with all rewards clipped to 1. For example, in PrivateEye, DQN has no reason to select actions that reward 25,000 versus actions that reward 10. To make the reward function used by the human demonstrator and the agent more consistent, we used unclipped rewards and converted the rewards using a logarithmic scale: r_agent = sign(r) · log(1+|r|). This transformation keeps the rewards over a reasonable scale for the neural network to learn while conveying important information about the relative scale of individual rewards.
- Others
- Random starting position: each episode is initialized with up to 30 no-op
- Discount factor 0.99
- Randomly selecting 42 env
- Demonstration data
- Had a human player play each game 3 ~ 12 times
- Each episode: played until termination or for 20 minutes
- Logged: state, action reward, terminations
- Each game has: 5574~75472 transitions
- The dataset size is “small” compared with other works
- DQN (2015): learns from 200 million frames
- AlphaGo (2016): 30 million human transitions
- Demonstration score
- Some games are better than PDD DQN, while others are worse than PDD DQN.
- Let's look at the results together (see the experimental results table below).
algorithm | with/without demonstration | with/without online interaction | Notes |
DQfD | Yes | Yes | |
PDD DQN | No | Yes | include n-step return |
Supervised imitation | Yes | No | performed a supervised classification of the demonstrator’s actions using
a cross-entropy loss, with the same network architecture and
L2 regularization used by DQfD |
Results
- Outperform
- Scores for the 11 games where DQfD achieves higher scores than any previously published deep RL result using random no-op starts. Previous results take the best agent at its best iteration and evaluate it for 100 episodes. DQfD scores are the best 3 million step window averaged over four seeds, which is an average of 508 episodes.
- Demonstration scores vs DQfD vs PDD DQN
- The best and worst human demonstration scores, the number of trials and transitions collected, and the average score for each algorithm over the best three million-step window, averaged over four seeds.
- It is reasonable that in some games, PDD DQN has a similar score to DQfD since the score is the mean score over the best 3 million steps. (The best scores may be similar, but the time taken may not necessarily be the same).
- In comparison, pure imitation learning performs worse than the demonstrator's performance in every game.
- I observed that the scores of imitation were bad, even though the human demonstration did not receive bad scores.
- DQfD outperforms the worst demonstration episode. It was successful in 29 out of 42 games
- the best demonstration episode only improved in 14 of the games
- Learning curves
- Results on the 42 Atari games, averaged over 4 trials.
- The authors or I observe the following:
- DQfD provides an early-stage boost in some games.
- E.g. Road Runner is the game with the smallest set of human demonstrations, consisting of only 5,574 transitions. Despite these factors, DQfD still achieves a higher score than PDDDQN for the first 36 million steps and matches PDD DQN's performance after that.
- In certain games, PDD DQN performs better, while in other games, DQfD is superior.
- In some hard tasks, all 3 algorithms cannot outperform human demonstration
- E.g., Montezuma's Revenge
- Initial performance (In real-world tasks, the agent must perform well from its very first action and must learn quickly.)
- DQfD performed better than PDD DQN on the first million steps in 41 of 42 games.
- On 31 games, DQfD starts out with higher performance than pure imitation learning, as the addition of the TD loss helps the agent generalize the demonstration data better.
- On average, PDD DQN does not surpass the performance of DQfD until 83 million steps into the task
- Ablation: loss
- Comparisons of DQfD with λ1 and λ2 set to 0 on two games where DQfD achieved state-of-the-art results: Montezuma's Revenge and Q-Bert.
- Vs other works
- Compare DQfD with three related algorithms for leveraging demonstration data in DQN on two games where DQfD achieved state-of-the-art results: Montezuma's Revenge and Q-Bert.
- Replay Buffer Spiking (RBS) (Lipton et al., 2016)
- RBS is simply PDDDQN with the replay buffer initially full of demonstration data.
- Human Experience Replay (HER) (Hosu and Rebedea, 2016)
- HER keeps the demonstration data and mixes demonstration and agent data in each mini-batch.
- Accelerated DQN with Expert Trajectories (ADET) (Lakshminarayanan, Ozair, and Bengio, 2016)
- ADET is essentially DQfD with the large margin supervised loss replaced with a cross-entropy loss.
- 補充,這裡跟前面指到說 cross entropy loss, 那這到底要怎麼跟 Q net 結合?
- ChatGPT 講解的 ADET cross entropy loss:
- Demonstration Data Ratio in a Mini-batch
- The ratio of how often the demonstration data was sampled versus how much it would be sampled with uniform sampling.
- For the most difficult games like Pitfall and Montezuma's Revenge, the demonstration data is sampled more frequently over time. For most other games, the ratio converges to a near-constant level, which differs for each game.
The Deep Q-Learning from Demonstrations (DQfD) algorithm introduces improvements to utilize demonstration data for accelerating training in deep reinforcement learning. It combines different loss functions to ensure that the neural network value function estimates satisfy the Bellman equation and imitate the actions of the demonstrator. DQfD outperforms previous deep reinforcement learning results using random no-op starts and achieves higher scores than any previously published deep RL result in 41 out of 42 Atari games. It shows promising results in terms of initial performance, learning curves, and leveraging demonstration data. The algorithm demonstrates the potential of utilizing prior demonstrations to enhance the training process and improve the performance of deep RL agents.