1. Introduction to Policy Search

So far, in order to learn a policy, we have focused on value-based approach where we find the optimal state-action value function with parameter $\theta$,

$$ Q_{\theta}(s,a) \approx Q^{\pi}(s,a) $$

and then use $Q_{\theta}$ to extract a policy with $\epsilon$-greedy. However, we can also use a Policy-Based approach to directly parameterize the policy :

$$ \pi_{\theta}(a|s) = \mathbb{P}[a|s;{\theta}] $$

Goal is to find a policy $\pi$ with the highest value function $V^{\pi}$.

Policy-Based RL

2. Stochastic Policies

Aliased Gridworld

Untitled

3. Policy Optimization

3-1. Policy Objective Functions

$$ J_1(\theta) = V^{\pi_\theta}(s_1) = \mathbb{E}{\pi\theta}[v_1] $$

$$ J_{avV}(\theta) =\sum_{s}d^{\pi_\theta}(s) V^{\pi_\theta}(s) $$