1. Introduction to Policy Search

So far, in order to learn a policy, we have focused on value-based approach where we find the optimal state-action value function with parameter $\theta$,

$$ Q_{\theta}(s,a) \approx Q^{\pi}(s,a) $$

and then use $Q_{\theta}$ to extract a policy with $\epsilon$-greedy. However, we can also use a Policy-Based approach to directly parameterize the policy :

$$ \pi_{\theta}(a|s) = \mathbb{P}[a|s;{\theta}] $$

Goal is to find a policy $\pi$ with the highest value function $V^{\pi}$.

Policy-Based RL

Advantages
- Better convergence
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic policies
Disadvantages
- Typically converge to a local rather than global optimum
- Typically inefficient and high variance

2. Stochastic Policies

Aliased Gridworld

Untitled

Deterministic Policy
- Under aliasing, an optimal deterministic policy will either move W or E in both grey states and get stuck.
- Value-based RL learns a near-deterministic policy → traverse for a long time
Stochastic Policy
- An optimal Stochastic policy will randomly move E or W in grey states.
- Policy-based RL can learn the optimal stochastic policy

3. Policy Optimization

3-1. Policy Objective Functions

We need to be able to measure how the policy ${\pi}_{\theta}(a|s)$ is performing in order to optimize it
Start Value : expected value of the start state (episodic environment)

$$ J_1(\theta) = V^{\pi_\theta}(s_1) = \mathbb{E}{\pi\theta}[v_1] $$

Average Value (continuing environment)

$$ J_{avV}(\theta) =\sum_{s}d^{\pi_\theta}(s) V^{\pi_\theta}(s) $$