1. Introduction to Policy Search
So far, in order to learn a policy, we have focused on value-based approach where we find the optimal state-action value function with parameter $\theta$,
$$
Q_{\theta}(s,a) \approx Q^{\pi}(s,a)
$$
and then use $Q_{\theta}$ to extract a policy with $\epsilon$-greedy. However, we can also use a Policy-Based approach to directly parameterize the policy :
$$
\pi_{\theta}(a|s) = \mathbb{P}[a|s;{\theta}]
$$
Goal is to find a policy $\pi$ with the highest value function $V^{\pi}$.
Policy-Based RL
-
Advantages
- Better convergence
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic policies
-
Disadvantages
- Typically converge to a local rather than global optimum
- Typically inefficient and high variance
2. Stochastic Policies
Aliased Gridworld

- Deterministic Policy
- Under aliasing, an optimal deterministic policy will either move W or E in both grey states and get stuck.
- Value-based RL learns a near-deterministic policy → traverse for a long time
- Stochastic Policy
- An optimal Stochastic policy will randomly move E or W in grey states.
- Policy-based RL can learn the optimal stochastic policy
3. Policy Optimization
3-1. Policy Objective Functions
- We need to be able to measure how the policy ${\pi}_{\theta}(a|s)$ is performing in order to optimize it
- Start Value : expected value of the start state (episodic environment)
$$
J_1(\theta) = V^{\pi_\theta}(s_1) = \mathbb{E}{\pi\theta}[v_1]
$$
- Average Value (continuing environment)
$$
J_{avV}(\theta) =\sum_{s}d^{\pi_\theta}(s) V^{\pi_\theta}(s)
$$