1. Dynamic programming

Know model $P(s^{\prime}| s, a)$ : reward and expectation over next states computed exactly

Untitled

dynamic programming
- $V^π(s) \approx \mathbb{E}{π}[r_t + \gamma V{k-1}|s_t = s]$
- Requires model of MDP M
- Bootstraps future return using value estimates
- Requires Markov assumption : bootstrapping regardless of history

2. Monte Carlo policy evaluation

Does not require MDP dynamics/rewards
No bootstrapping
Does not assume state is Markov
Can only be applied to episodic MDPs

2-1. First-visit Monte Carlo

Untitled

Properties:

2-2. Every-visit Monte Carlo

Untitled

Properties:

2-3. Incremental Monte Carlo

Untitled

<aside> 👩🏼‍🏫 skew it so that you’re running average is more weighted towards recent data because your real domain is non-stationary (when MDP is changing over time)

</aside>

1. Dynamic programming

2. Monte Carlo policy evaluation

2-1. First-visit Monte Carlo

Properties:

2-2. Every-visit Monte Carlo

2-3. Incremental Monte Carlo