<aside> 💡 Focus on the case where someone has already collected the data

</aside>

The Problem

If you apply an existing method, do you have confidence that it will work?

What properties should a safe batch reinforcement learning algorithm have?

Given past experience from current policies, produce a new policy that
- 1- $\delta$의 확률로 현재 policy보다 좋아짐
- $\delta$ 선택 가능
- 하이퍼 파라미터들과 상관없음

1. Notation

Policy $\pi$ : $\pi(a)$ $=$ $P(a_t=a|s_t=s)$
Trajectory : $T=(s_1,a_1,r_1,s_2,a_2,r_2,\dots,s_L,a_L,r_L)$
Historical data : $D$ = {$T_1, T_2, \dots , T_n$}
Historical data from behavior policy, $\pi_b$
Objective:

$$ V^\pi = \mathbb{E}[\sum^L_{t=1}\gamma^t R_t|\pi] $$

Safe batch reinforcement learning algorithm

Reinforcement learning algorithm, $A$
Historical data $D$ (random variable)
Policy produced by the algorithm $A(D)$ (random variable)
Safe batch reinforcement learning algorithm $A$ satisfies :

$$ P_r(V^{A(D)}\geq V^{\pi_b}) \geq 1-\delta $$

$V^{\pi_b}$ → value of the policy used to generate data