Kickstarting Deep Reinforcement Learning

English RL RL from Prior Distillation Multi-Task

written by LiaoWC on 2023-11-05

Schmitt, Simon, et al. "Kickstarting deep reinforcement learning." arXiv preprint arXiv:1803.03835 (2018). (DeepMind)

Introduction

Use previously-trained teacher agents to kickstart the training of a new student agent!

Leverage ideas from:

Policy distillation

Population-based training (PBT)

No constraints on the architecture of the teacher or student agents

Matching the performance of an agent trained from scratch in fewer steps

Faster training

Allow the students to surpass their teachers in performance

Different from pure imitation learning

Automatically adjusts the influence of the teachers on the student agents

Multi-task knowledge transferring

Methods

Main points: knowledge transfer and population-based training

Knowledge Transfer

Use the idea of “policy distillation”
- $H(\cdot||\cdot)$: cross-entropy
- $\pi_T$: teacher policy
- $\pi_S$: student policy (parameterized by $\omega$)
- $x$: trajectories generated by the teacher (containing a sequence of $(xt){t\ge 0}$)
The teacher's role is to assist the student. The total loss is combined with Reinforcement Learning (RL) loss:
- $\lambda_k \ge 0$: weight
  - The knowledge transfer process changes over time
- $x$: trajectories generated by the student (containing a sequence of $(xt){t\ge 0}$)
  - Solely responsible for generating trajectories, and can explore parts of the state space that the teacher does not visit
Combine with A3C
- A3C loss
- A3C kickstarting loss:
Combine with IMPALA
- IMPALA
  - With a large-scale setting with distributed worker machines (off-policy)
  - Employ a correction based on importance sampling

Population-Based Training (PBT)

How to tune hyper-parameters such as learning rates?

Jaderberg, Max, et al. "Population based training of neural networks." arXiv preprint arXiv:1711.09846 (2017).
- Hand-crafted: may need expert knowledge
- Sequential optimization: may need lots of iterations
- Parallel search: requires the use of more computational resources to train many models in parallel
- PBT: trains a population of agents in parallel to jointly optimize network weights and hyperparameters that affect learning dynamics (such as the learning rate or entropy cost)
  - This paper’s instantiation of PBT:
    
    Each agent periodically selects another member of the population at random and checks whether its performance is significantly better than its own. If this is true, weights and hyperparameters of the better agent are adopted.
  - It allows us to automatically adjust the schedule for $\lambda_k$ separately for each teacher in a multi-teacher scenario, while simultaneously adjusting the learning rate as well as the entropy regularisation strength.

Standard Multi-Task RL vs Single-Teacher Kickstarting vs Multi-Teacher Kickstarting

Experiment Setup

Base algorithm: IMPALA
Architectures:
- Conv + LSTM + Policy/Value-Head
- Two sizes
  - small agent: 2 conv layers
  - large agent: 15 conv layers
Environment: DMLab-30 Suite

https://deepmind.google/discover/blog/scalable-agent-architecture-for-distributed-training/
- Image input
- 30 tasks
- Performance is human-normalized (Please see the details in the paper if needed.)
  - The score for the full suite: averaging human-normalized scores across all 30
- To obtain a score for the full suite, we average this human-normalized score across all 30 tasks
IMPALA + PBT + Kickstarting settings
- Computation resources
  - 1 P100 GPU
- 150 actor workers
  - Distributed equally, 5 per task, between the 30 tasks of the suite
- For the multi-teacher setup, it uses a separate $\lambda_k^i$ for each teacher and allows PBT to choose a schedule for each teacher.
  - The authors expect that the distillation weights will be correlated. Hence they implement the weights in the factorized way: $\lambda_k^i=\alpha_k\rho_k^i$
    - So that the evolution algorithm can make all weights stronger or weaker simultaneously

Experiment Results

Kickstarting With a Single Teacher

Only about 1 billion frames to reach the small teacher’s final (after 10 billion frames) performance
- So, it seems the teacher is the model trained for 10B.
  - The teacher is “small from scratch)
  Rolling average of agent’s score (mean of top-3 PBT population members each)
  
  PBT evolution of kickstarting distillation weight over the course of training for the best population member PBT evolution

Kickstarting With a Single Teacher (vs Other Distillation Weighting Approaches)

Note that even for the manually specified schedules, PBT still controls other hyperparameters (learning rate and entropy cost)
Vs constant schedules
Vs large from scratch (Note that the teacher is small from scratch)
Vs linear schedules
- It may not be apparent a priori.

Untitled

Kickstarting With Multiple Teacher

Untitled

Special Findings

Laser tag
- The task suite contains a set of similar ‘laser tag’ tasks in which agents must navigate a procedurally-generated maze and tag other opponent ‘bots’ with a laser.
- In the 1-bot variant, encounters with the single opponent (and thus rewards) are very sparse: thus the from-scratch agent and even the single-task expert do not learn.
- The multi-teacher kickstarted agent, however, learns quickly.
  - The reason is that a single-task expert learned strong performance on the 3-bot task (thanks to denser rewards), and its knowledge transfers to the 1-bot variant.
  - The authors find the student ‘ignores’ the (incompetent) 1-bot expert (its $\lambda_k^i$quickly goes to 0).
explore_goal_locations
- The expert learns to use its short-term memory and scores considerably higher and, thanks to this expert, the kick-started agent also masters the task. This is perhaps surprising because the kickstarting mechanism only guides the student agent in which action to take: it puts no constraint on how the student structures its internal memory state. However, the student can only predict the teacher’s behavior by remembering information from before the respawn, which seems to be enough supervisory signal to drive short-term memory formation. We find this a wonderful parallel with how the best human educators teach: not telling the student what to think, but simply putting the student in a fruitful position to learn for themselves.