R2D3
Chinese
RL
RL from Prior
BC
written by
LiaoWC
on 2023-12-19
Making Efficient Use of Demonstrations to Solve Hard Exploration Problems (R2D3)
本篇文獻使用 demonstration 來解具有以下性質的困難環境:
- Sparse rewards
- Partial observability
- Highly variable initial conditions
本篇的貢獻主要有:
- 提出一套具有上述三個特質的遊戲環境
- 設計一套利用 demonstration 的方法
- 對實驗、方法有詳細的分析
實驗結果:
- 使用 demonstration 的 R2D3 總體表現得比其它方法好
- 在這些困難的任務裡,有些任務玩出來可以比人類好
Reference:
背景
- R2D2:結合 RNN 、Ape-X 分散式設計等的 DQN 系列方法。針對 recurrent 的使用,設計了 burn-in phase 的方法來解 RNN hidden state 初始化的問題。
- DQfD:利用 demonstration 的 DQN 方法。(參考連結)
方法
R2D3 基本上就是建立在 R2D2 上,加入 demonstration 的使用。
利用 demonstration 的方式和 DQfD 不太一樣。Loss 涵蓋 n-step, double Q-learning (with n=5) 和 dueling,並不像 DQfD 有加入 supervised loss。
R2D3 把 demonstration 放到另一個 demo replay buffer 中,訓練時抽取出來的資料有一定比例 ($\rho$) 是從 demo replay buffer 取樣出來的,剩下 $1-\rho$ 的比例從 online data 取樣出來。取樣時皆使用 Prioritized Experience Replay。



實驗環境 - Hard-Eight Task Suite
](unnamed.gif)
https://deepmind.google/discover/blog/making-efficient-use-of-demonstrations-to-solve-hard-exploration-problems/
Hard-Eight Task Suite 是本篇的實驗環境。它被設計成符合開頭所述的三個性質:
- Sparse rewards
- In all but one task the only positive instantaneous reward obtained also ends the episode.
- Partial observability
- Highly variable initial conditions
- Color, shape, configuration 等會不一樣
- 也因此,單純複製 demonstration 的動作可能沒有用
Baselines
接下來介紹它在實驗裡所比較的對象。
- Behavior Cloning (BC)
- Fit a parameterized policy mapping states to actions, using the cross-entropy loss
- R2D2 (No Demonstration)
- 移除掉 R2D3 的 demonstration。並把 demo ratio 設成 $\rho=0$。
- DQfD (No Recurrence)
實驗設置
- Ape-X 般的分散式訓練,256 個 $\epsilon$-greedy CPU-based actor + 1 GPU-based learner
- 也是用 Ape-X 形式,$\epsilon_i\in[0.4^8,0.4]$ 各 actor 的 epsilon 不同增添探索的多樣性。
- At least 10B actor steps for all tasks
- General hyper-parameter 基本上各方法一樣。R2D2 vs R2D3 額外比較 $\rho$
- Optimizer:
- R2D3, R2D2, DQfD: Adam optimizer; learning rate $=2\times10^{-4}$
- BC: Adam; learning rate: sweep over ${10^{-5},10^{-4},10^{-3}}$
- Demonstration
- Action-repeat = 2。它說 action repeat = 4 對 demonstrator 來說滿難的。
- 100 demonstrations for each task spread across three different experts
- Each expert contributed roughly one-third of the demonstrations for each task
- Demonstrations for the tasks were collected using keyboard and mouse controls mapped to the agent’s exact action space

網路架構

實驗結果
綜合表現
8 個環境有 6 個 R2D3 玩得起來。有 2 個環境沒有方法可以練得起來。R2D3 以外的方法基本都不太行…。

Ratio $\rho$ 的比較
平均起來,1/256 是最好的。


影片
Human demonstrating
Human demonstrating
R2D3 on Hard Eight Tasks
R2D3 on Hard Eight Tasks
視覺化軌跡
關於 Push Blocks 這個任務的描述如下:
The agent spawns in a medium sized room with a recessed sensor in the floor. There are several objects in the room that can be pushed but not lifted. The agent must push a block whose color matches the sensor into the recess in order to open a door to an adjoining room which contains a large apple which ends the episode. Pushing a wrong object into the recess makes the level impossible to complete.
關於動圖可看unnamed.gif 上面中間。主要就是推對應顏色的箱子來開門取蘋果。


右圖是在上圖訓練的紅色箭頭處畫的。
右圖可以看到 R2D2 仍較趨於隨機探索,而 R2D3 較超一定的區域前進。

Spatial pattern of exploration behavior at ∼5B actor steps (reward-driven learning kicks off for R2D3 only after ∼20B steps). Overlay of agent’s trajectories over 200 episodes. Blocks and sensors are not shown for clarity. R2D2 appears to follow a random walk. R2D3 concentrates on a particular spatial region.
Further detail of guided exploration behavior in the Push Blocks task:

