Tags
PriorDistillationMulti-Task
Date
November 15, 2023
Last Edited Date
Language
Schmitt, Simon, et al. "Kickstarting deep reinforcement learning." arXiv preprint arXiv:1803.03835 (2018). (DeepMind)
- Introduction
- Methods
- Knowledge Transfer
- Population-Based Training (PBT)
- Experiment Setup
- Experiment Results
- Kickstarting With a Single Teacher
- Kickstarting With a Single Teacher (vs Other Distillation Weighting Approaches)
- Kickstarting With Multiple Teacher
- Special Findings
Introduction
Use previously-trained teacher agents to kickstart the training of a new student agent!
- Leverage ideas from:
- Policy distillation
- Population-based training (PBT)
- No constraints on the architecture of the teacher or student agents
- Matching the performance of an agent trained from scratch in fewer steps
- Faster training
- Allow the students to surpass their teachers in performance
- Different from pure imitation learning
- Automatically adjusts the influence of the teachers on the student agents
- Multi-task knowledge transferring
Methods
Main points: knowledge transfer and population-based training
Knowledge Transfer
- Use the idea of “policy distillation”
- : cross-entropy
- : teacher policy
- : student policy (parameterized by )
- : trajectories generated by the teacher (containing a sequence of )
- The teacher's role is to assist the student. The total loss is combined with Reinforcement Learning (RL) loss:
- : weight
- The knowledge transfer process changes over time
- : trajectories generated by the student (containing a sequence of )
- Solely responsible for generating trajectories, and can explore parts of the state space that the teacher does not visit
- Combine with A3C
- A3C loss
- A3C kickstarting loss:
- Combine with IMPALA
- IMPALA
- With a large-scale setting with distributed worker machines (off-policy)
- Employ a correction based on importance sampling
Population-Based Training (PBT)
- How to tune hyper-parameters such as learning rates?
- Hand-crafted: may need expert knowledge
- Sequential optimization: may need lots of iterations
- Parallel search: requires the use of more computational resources to train many models in parallel
- PBT: trains a population of agents in parallel to jointly optimize network weights and hyperparameters that affect learning dynamics (such as the learning rate or entropy cost)
- This paper’s instantiation of PBT:
- It allows us to automatically adjust the schedule for separately for each teacher in a multi-teacher scenario, while simultaneously adjusting the learning rate as well as the entropy regularisation strength.
Each agent periodically selects another member of the population at random and checks whether its performance is significantly better than its own. If this is true, weights and hyperparameters of the better agent are adopted.
Experiment Setup
- Base algorithm: IMPALA
- Architectures:
- Conv + LSTM + Policy/Value-Head
- Two sizes
small
agent: 2 conv layerslarge
agent: 15 conv layers- Environment: DMLab-30 Suite
- Image input
- 30 tasks
- Performance is human-normalized (Please see the details in the paper if needed.)
- The score for the full suite: averaging human-normalized scores across all 30
- To obtain a score for the full suite, we average this human-normalized score across all 30 tasks
- IMPALA + PBT + Kickstarting settings
- Computation resources
- 1 P100 GPU
- 150 actor workers
- Distributed equally, 5 per task, between the 30 tasks of the suite
- For the multi-teacher setup, it uses a separate for each teacher and allows PBT to choose a schedule for each teacher.
- The authors expect that the distillation weights will be correlated. Hence they implement the weights in the factorized way:
- So that the evolution algorithm can make all weights stronger or weaker simultaneously
Experiment Results
Kickstarting With a Single Teacher
- Only about 1 billion frames to reach the small teacher’s final (after 10 billion frames) performance
- So, it seems the teacher is the model trained for 10B.
- The teacher is “small from scratch)
Kickstarting With a Single Teacher (vs Other Distillation Weighting Approaches)
- Note that even for the manually specified schedules, PBT still controls other hyperparameters (learning rate and entropy cost)
- Vs constant schedules
- Vs large from scratch (Note that the teacher is small from scratch)
- Vs linear schedules
- It may not be apparent a priori.
Kickstarting With Multiple Teacher
Special Findings
- Laser tag
- The task suite contains a set of similar ‘laser tag’ tasks in which agents must navigate a procedurally-generated maze and tag other opponent ‘bots’ with a laser.
- In the 1-bot variant, encounters with the single opponent (and thus rewards) are very sparse: thus the from-scratch agent and even the single-task expert do not learn.
- The multi-teacher kickstarted agent, however, learns quickly.
- The reason is that a single-task expert learned strong performance on the 3-bot task (thanks to denser rewards), and its knowledge transfers to the 1-bot variant.
- The authors find the student ‘ignores’ the (incompetent) 1-bot expert (its quickly goes to 0).
- explore_goal_locations
- The expert learns to use its short-term memory and scores considerably higher and, thanks to this expert, the kick-started agent also masters the task. This is perhaps surprising because the kickstarting mechanism only guides the student agent in which action to take: it puts no constraint on how the student structures its internal memory state. However, the student can only predict the teacher’s behavior by remembering information from before the respawn, which seems to be enough supervisory signal to drive short-term memory formation. We find this a wonderful parallel with how the best human educators teach: not telling the student what to think, but simply putting the student in a fruitful position to learn for themselves.