<aside>
๐ก Focus on the case where someone has already collected the data
</aside>
The Problem
- If you apply an existing method, do you have confidence that it will work?
What properties should a safe batch reinforcement learning algorithm have?
- Given past experience from current policies, produce a new policy that
- 1- $\delta$์ ํ๋ฅ ๋ก ํ์ฌ policy๋ณด๋ค ์ข์์ง
- $\delta$ ์ ํ ๊ฐ๋ฅ
- ํ์ดํผ ํ๋ผ๋ฏธํฐ๋ค๊ณผ ์๊ด์์
1. Notation
- Policy $\pi$ : $\pi(a)$ $=$ $P(a_t=a|s_t=s)$
- Trajectory : $T=(s_1,a_1,r_1,s_2,a_2,r_2,\dots,s_L,a_L,r_L)$
- Historical data : $D$ = {$T_1, T_2, \dots , T_n$}
- Historical data from behavior policy, $\pi_b$
- Objective:
$$
V^\pi = \mathbb{E}[\sum^L_{t=1}\gamma^t R_t|\pi]
$$
Safe batch reinforcement learning algorithm
- Reinforcement learning algorithm, $A$
- Historical data $D$ (random variable)
- Policy produced by the algorithm $A(D)$ (random variable)
- Safe batch reinforcement learning algorithm $A$ satisfies :
$$
P_r(V^{A(D)}\geq V^{\pi_b}) \geq 1-\delta
$$
$V^{\pi_b}$ โ value of the policy used to generate data