Schmitt, Simon, et al. "Kickstarting deep reinforcement learning." arXiv preprint arXiv:1803.03835 (2018). (DeepMind)

Use previously-trained teacher agents to kickstart the training of a new student agent!
- Leverage ideas from:
- Policy distillation
- Population-based training (PBT)
- No constraints on the architecture of the teacher or student agents
- Matching the performance of an agent trained from scratch in fewer steps
- Faster training
- Allow the students to surpass their teachers in performance
- Different from pure imitation learning
- Automatically adjusts the influence of the teachers on the student agents
- Multi-task knowledge transferring
Main points: knowledge transfer and population-based training
Use the idea of “policy distillation”

The teacher's role is to assist the student. The total loss is combined with Reinforcement Learning (RL) loss:

Combine with A3C
A3C loss

A3C kickstarting loss:

Combine with IMPALA
IMPALA
Employ a correction based on importance sampling

How to tune hyper-parameters such as learning rates?

Jaderberg, Max, et al. "Population based training of neural networks." arXiv preprint arXiv:1711.09846 (2017).
PBT: trains a population of agents in parallel to jointly optimize network weights and hyperparameters that affect learning dynamics (such as the learning rate or entropy cost)
This paper’s instantiation of PBT:
Each agent periodically selects another member of the population at random and checks whether its performance is significantly better than its own. If this is true, weights and hyperparameters of the better agent are adopted.
It allows us to automatically adjust the schedule for $\lambda_k$ separately for each teacher in a multi-teacher scenario, while simultaneously adjusting the learning rate as well as the entropy regularisation strength.

Standard Multi-Task RL vs Single-Teacher Kickstarting vs Multi-Teacher Kickstarting
small agent: 2 conv layerslarge agent: 15 conv layersEnvironment: DMLab-30 Suite
](untitled-7.png)
https://deepmind.google/discover/blog/scalable-agent-architecture-for-distributed-training/
Only about 1 billion frames to reach the small teacher’s final (after 10 billion frames) performance
So, it seems the teacher is the model trained for 10B.

Rolling average of agent’s score (mean of top-3 PBT population members each)

PBT evolution of kickstarting distillation weight over the course of training for the best population member PBT evolution




Laser tag

explore_goal_locations
The expert learns to use its short-term memory and scores considerably higher and, thanks to this expert, the kick-started agent also masters the task. This is perhaps surprising because the kickstarting mechanism only guides the student agent in which action to take: it puts no constraint on how the student structures its internal memory state. However, the student can only predict the teacher’s behavior by remembering information from before the respawn, which seems to be enough supervisory signal to drive short-term memory formation. We find this a wonderful parallel with how the best human educators teach: not telling the student what to think, but simply putting the student in a fruitful position to learn for themselves.
