R2D3

Chinese RL RL from Prior BC

written by LiaoWC on 2023-12-19


Making Efficient Use of Demonstrations to Solve Hard Exploration Problems (R2D3)

本篇文獻使用 demonstration 來解具有以下性質的困難環境:

  1. Sparse rewards
  2. Partial observability
  3. Highly variable initial conditions

本篇的貢獻主要有:

  • 提出一套具有上述三個特質的遊戲環境
  • 設計一套利用 demonstration 的方法
  • 對實驗、方法有詳細的分析

實驗結果:

  • 使用 demonstration 的 R2D3 總體表現得比其它方法好
  • 在這些困難的任務裡,有些任務玩出來可以比人類好

Reference:

背景


  • R2D2:結合 RNN 、Ape-X 分散式設計等的 DQN 系列方法。針對 recurrent 的使用,設計了 burn-in phase 的方法來解 RNN hidden state 初始化的問題。
  • DQfD:利用 demonstration 的 DQN 方法。(參考連結)

方法


R2D3 基本上就是建立在 R2D2 上,加入 demonstration 的使用。

利用 demonstration 的方式和 DQfD 不太一樣。Loss 涵蓋 n-step, double Q-learning (with n=5) 和 dueling,並不像 DQfD 有加入 supervised loss。

R2D3 把 demonstration 放到另一個 demo replay buffer 中,訓練時抽取出來的資料有一定比例 ($\rho$) 是從 demo replay buffer 取樣出來的,剩下 $1-\rho$ 的比例從 online data 取樣出來。取樣時皆使用 Prioritized Experience Replay。

Untitled

Untitled

Untitled

實驗環境 - Hard-Eight Task Suite


[https://deepmind.google/discover/blog/making-efficient-use-of-demonstrations-to-solve-hard-exploration-problems/](https://deepmind.google/discover/blog/making-efficient-use-of-demonstrations-to-solve-hard-exploration-problems/)

https://deepmind.google/discover/blog/making-efficient-use-of-demonstrations-to-solve-hard-exploration-problems/

Hard-Eight Task Suite 是本篇的實驗環境。它被設計成符合開頭所述的三個性質:

  1. Sparse rewards
    • In all but one task the only positive instantaneous reward obtained also ends the episode.
  2. Partial observability
    • 它是第一人稱視角遊戲
  3. Highly variable initial conditions
    • Color, shape, configuration 等會不一樣
      • 也因此,單純複製 demonstration 的動作可能沒有用
  • 八種環境

    • 需要一連串有意義的步驟組合才有辦法完成任務,例如 Baseball task:

      Untitled

  • Observation space: 96x72 RGB pixels

  • Action space: 46 discrete actions

Baselines


接下來介紹它在實驗裡所比較的對象。

  • Behavior Cloning (BC)
    • Fit a parameterized policy mapping states to actions, using the cross-entropy loss
  • R2D2 (No Demonstration)
    • 移除掉 R2D3 的 demonstration。並把 demo ratio 設成 $\rho=0$。
  • DQfD (No Recurrence)

實驗設置


  • Ape-X 般的分散式訓練,256 個 $\epsilon$-greedy CPU-based actor + 1 GPU-based learner
    • 也是用 Ape-X 形式,$\epsilon_i\in[0.4^8,0.4]$ 各 actor 的 epsilon 不同增添探索的多樣性。
    • At least 10B actor steps for all tasks
  • General hyper-parameter 基本上各方法一樣。R2D2 vs R2D3 額外比較 $\rho$
  • Optimizer:
    • R2D3, R2D2, DQfD: Adam optimizer; learning rate $=2\times10^{-4}$
    • BC: Adam; learning rate: sweep over ${10^{-5},10^{-4},10^{-3}}$
  • Demonstration
    • Action-repeat = 2。它說 action repeat = 4 對 demonstrator 來說滿難的。
    • 100 demonstrations for each task spread across three different experts
      • Each expert contributed roughly one-third of the demonstrations for each task
    • Demonstrations for the tasks were collected using keyboard and mouse controls mapped to the agent’s exact action space

Untitled

  • 網路架構

    Untitled

實驗結果


綜合表現

8 個環境有 6 個 R2D3 玩得起來。有 2 個環境沒有方法可以練得起來。R2D3 以外的方法基本都不太行…。

Untitled

Ratio $\rho$ 的比較

平均起來,1/256 是最好的。

Untitled

Untitled

影片

Human demonstrating

Human demonstrating

R2D3 on Hard Eight Tasks

R2D3 on Hard Eight Tasks

視覺化軌跡

關於 Push Blocks 這個任務的描述如下:

The agent spawns in a medium sized room with a recessed sensor in the floor. There are several objects in the room that can be pushed but not lifted. The agent must push a block whose color matches the sensor into the recess in order to open a door to an adjoining room which contains a large apple which ends the episode. Pushing a wrong object into the recess makes the level impossible to complete.

關於動圖可看unnamed.gif 上面中間。主要就是推對應顏色的箱子來開門取蘋果。

Untitled

Untitled

右圖是在上圖訓練的紅色箭頭處畫的。

右圖可以看到 R2D2 仍較趨於隨機探索,而 R2D3 較超一定的區域前進。

Spatial pattern of exploration behavior at ∼5B actor steps (reward-driven learning kicks off for R2D3 only after ∼20B steps). Overlay of agent’s trajectories over 200 episodes. Blocks and sensors are not shown for clarity. R2D2 appears to follow a random walk. R2D3 concentrates on a particular spatial region.

Spatial pattern of exploration behavior at ∼5B actor steps (reward-driven learning kicks off for R2D3 only after ∼20B steps). Overlay of agent’s trajectories over 200 episodes. Blocks and sensors are not shown for clarity. R2D2 appears to follow a random walk. R2D3 concentrates on a particular spatial region.

Further detail of guided exploration behavior in the Push Blocks task:

Untitled

Untitled