Kickstarting Deep Reinforcement Learning

Introduction

Use previously-trained teacher agents to kickstart the training of a new student agent!

Leverage ideas from:

Policy distillation
Population-based training (PBT)

No constraints on the architecture of the teacher or student agents
Matching the performance of an agent trained from scratch in fewer steps

Faster training

Allow the students to surpass their teachers in performance

Different from pure imitation learning
Automatically adjusts the influence of the teachers on the student agents

Multi-task knowledge transferring

Methods

Main points: knowledge transfer and population-based training

Knowledge Transfer

Use the idea of “policy distillation”

$H(\cdot||\cdot)$ : cross-entropy
$\pi_T$ : teacher policy
$\pi_S$ : student policy (parameterized by $\omega$ )
$x$ : trajectories generated by the teacher (containing a sequence of $(x_t)_{t\ge 0}$ )

The teacher's role is to assist the student. The total loss is combined with Reinforcement Learning (RL) loss:

$\lambda_k \ge 0$ : weight

The knowledge transfer process changes over time

$x$ : trajectories generated by the student (containing a sequence of $(x_t)_{t\ge 0}$ )

Solely responsible for generating trajectories, and can explore parts of the state space that the teacher does not visit

Combine with A3C

A3C loss

A3C kickstarting loss:

Combine with IMPALA

IMPALA

With a large-scale setting with distributed worker machines (off-policy)
Employ a correction based on importance sampling

Population-Based Training (PBT)

How to tune hyper-parameters such as learning rates?

Jaderberg, Max, et al. "Population based training of neural networks."

Hand-crafted: may need expert knowledge
Sequential optimization: may need lots of iterations
Parallel search: requires the use of more computational resources to train many models in parallel
PBT: trains a population of agents in parallel to jointly optimize network weights and hyperparameters that affect learning dynamics (such as the learning rate or entropy cost)

This paper’s instantiation of PBT:

Each agent periodically selects another member of the population at random and checks whether its performance is significantly better than its own. If this is true, weights and hyperparameters of the better agent are adopted.

It allows us to automatically adjust the schedule for $\lambda_k$ separately for each teacher in a multi-teacher scenario, while simultaneously adjusting the learning rate as well as the entropy regularisation strength.

Standard Multi-Task RL vs Single-Teacher Kickstarting vs Multi-Teacher Kickstarting

Experiment Setup

Base algorithm: IMPALA
Architectures:

Conv + LSTM + Policy/Value-Head
Two sizes

small agent: 2 conv layers
large agent: 15 conv layers

Environment: DMLab-30 Suite

https://deepmind.google/discover/blog/scalable-agent-architecture-for-distributed-training/

Image input
30 tasks
Performance is human-normalized (Please see the details in the paper if needed.)

The score for the full suite: averaging human-normalized scores across all 30

To obtain a score for the full suite, we average this human-normalized score across all 30 tasks

IMPALA + PBT + Kickstarting settings

Computation resources

1 P100 GPU

150 actor workers

Distributed equally, 5 per task, between the 30 tasks of the suite

For the multi-teacher setup, it uses a separate $\lambda_k^i$ for each teacher and allows PBT to choose a schedule for each teacher.

The authors expect that the distillation weights will be correlated. Hence they implement the weights in the factorized way: $\lambda_k^i=\alpha_k\rho_k^i$

So that the evolution algorithm can make all weights stronger or weaker simultaneously

Experiment Results

Kickstarting With a Single Teacher

Only about 1 billion frames to reach the small teacher’s final (after 10 billion frames) performance

So, it seems the teacher is the model trained for 10B.

The teacher is “small from scratch)

Rolling average of agent’s score (mean of top-3 PBT population members each)

PBT evolution of kickstarting distillation weight over the course of training for the best population member PBT evolution

Kickstarting With a Single Teacher (vs Other Distillation Weighting Approaches)

Note that even for the manually specified schedules, PBT still controls other hyperparameters (learning rate and entropy cost)
Vs constant schedules
Vs large from scratch (Note that the teacher is small from scratch)
Vs linear schedules

It may not be apparent a priori.

Kickstarting With Multiple Teacher

Special Findings

Laser tag

The task suite contains a set of similar ‘laser tag’ tasks in which agents must navigate a procedurally-generated maze and tag other opponent ‘bots’ with a laser.
In the 1-bot variant, encounters with the single opponent (and thus rewards) are very sparse: thus the from-scratch agent and even the single-task expert do not learn.
The multi-teacher kickstarted agent, however, learns quickly.

The reason is that a single-task expert learned strong performance on the 3-bot task (thanks to denser rewards), and its knowledge transfers to the 1-bot variant.
The authors find the student ‘ignores’ the (incompetent) 1-bot expert (its $\lambda_k^i$ quickly goes to 0).

explore_goal_locations

The expert learns to use its short-term memory and scores considerably higher and, thanks to this expert, the kick-started agent also masters the task. This is perhaps surprising because the kickstarting mechanism only guides the student agent in which action to take: it puts no constraint on how the student structures its internal memory state. However, the student can only predict the teacher’s behavior by remembering information from before the respawn, which seems to be enough supervisory signal to drive short-term memory formation. We find this a wonderful parallel with how the best human educators teach: not telling the student what to think, but simply putting the student in a fruitful position to learn for themselves.