R2D3

背景

R2D2：結合 RNN 、Ape-X 分散式設計等的 DQN 系列方法。針對 recurrent 的使用，設計了 burn-in phase 的方法來解 RNN hidden state 初始化的問題。
DQfD：利用 demonstration 的 DQN 方法。(參考連結)

方法

R2D3 基本上就是建立在 R2D2 上，加入 demonstration 的使用。

利用 demonstration 的方式和 DQfD 不太一樣。Loss 涵蓋 n-step, double Q-learning (with n=5) 和 dueling，並不像 DQfD 有加入 supervised loss。

R2D3 把 demonstration 放到另一個 demo replay buffer 中，訓練時抽取出來的資料有一定比例 ( $\rho$ ) 是從 demo replay buffer 取樣出來的，剩下 $1-\rho$ 的比例從 online data 取樣出來。取樣時皆使用 Prioritized Experience Replay。

實驗環境 - Hard-Eight Task Suite

https://deepmind.google/discover/blog/making-efficient-use-of-demonstrations-to-solve-hard-exploration-problems/

Hard-Eight Task Suite 是本篇的實驗環境。它被設計成符合開頭所述的三個性質：

Sparse rewards

In all but one task the only positive instantaneous reward obtained also ends the episode.

Partial observability

它是第一人稱視角遊戲

Highly variable initial conditions

Color, shape, configuration 等會不一樣

也因此，單純複製 demonstration 的動作可能沒有用

八種環境

需要一連串有意義的步驟組合才有辦法完成任務，例如 Baseball task：

Observation space: 96x72 RGB pixels
Action space: 46 discrete actions

Baselines

接下來介紹它在實驗裡所比較的對象。

Behavior Cloning (BC)

Fit a parameterized policy mapping states to actions, using the cross-entropy loss

R2D2 (No Demonstration)

移除掉 R2D3 的 demonstration。並把 demo ratio 設成 $\rho=0$ 。

DQfD (No Recurrence)

實驗設置

Ape-X 般的分散式訓練，256 個 $\epsilon$ -greedy CPU-based actor + 1 GPU-based learner

也是用 Ape-X 形式， $\epsilon_i\in[0.4^8,0.4]$ 各 actor 的 epsilon 不同增添探索的多樣性。
At least 10B actor steps for all tasks

General hyper-parameter 基本上各方法一樣。R2D2 vs R2D3 額外比較 $\rho$
Optimizer:

R2D3, R2D2, DQfD: Adam optimizer; learning rate $=2\times10^{-4}$
BC: Adam; learning rate: sweep over $\{10^{-5},10^{-4},10^{-3}\}$

Demonstration

Action-repeat = 2。它說 action repeat ＝ 4 對 demonstrator 來說滿難的。
100 demonstrations for each task spread across three different experts

Each expert contributed roughly one-third of the demonstrations for each task

Demonstrations for the tasks were collected using keyboard and mouse controls mapped to the agent’s exact action space

網路架構

實驗結果

綜合表現

8 個環境有 6 個 R2D3 玩得起來。有 2 個環境沒有方法可以練得起來。R2D3 以外的方法基本都不太行…。

Ratio $\rho$ 的比較

平均起來，1/256 是最好的。

影片

Human demonstrating

R2D3 on Hard Eight Tasks

視覺化軌跡

關於 Push Blocks 這個任務的描述如下：

The agent spawns in a medium sized room with a recessed sensor in the floor. There are several objects in the room that can be pushed but not lifted. The agent must push a block whose color matches the sensor into the recess in order to open a door to an adjoining room which contains a large apple which ends the episode. Pushing a wrong object into the recess makes the level impossible to complete.

關於動圖可看R2D3 - unnamed.gif 上面中間。主要就是推對應顏色的箱子來開門取蘋果。

右圖是在上圖訓練的紅色箭頭處畫的。

右圖可以看到 R2D2 仍較趨於隨機探索，而 R2D3 較超一定的區域前進。

Spatial pattern of exploration behavior at ∼5B actor steps (reward-driven learning kicks off for R2D3 only after ∼20B steps). Overlay of agent’s trajectories over 200 episodes. Blocks and sensors are not shown for clarity. R2D2 appears to follow a random walk. R2D3 concentrates on a particular spatial region.

Further detail of guided exploration behavior in the Push Blocks task: