<aside> ๐Ÿ’ก Focus on the case where someone has already collected the data

</aside>

The Problem

What properties should a safe batch reinforcement learning algorithm have?

1. Notation

$$ V^\pi = \mathbb{E}[\sum^L_{t=1}\gamma^t R_t|\pi] $$

Safe batch reinforcement learning algorithm

$$ P_r(V^{A(D)}\geq V^{\pi_b}) \geq 1-\delta $$

$V^{\pi_b}$ โ†’ value of the policy used to generate data