1. Dynamic programming
- Know model $P(s^{\prime}| s, a)$ : reward and expectation over next states computed exactly

- dynamic programming
- $V^π(s) \approx \mathbb{E}{π}[r_t + \gamma V{k-1}|s_t = s]$
- Requires model of MDP M
- Bootstraps future return using value estimates
- Requires Markov assumption : bootstrapping regardless of history
2. Monte Carlo policy evaluation
- Does not require MDP dynamics/rewards
- No bootstrapping
- Does not assume state is Markov
- Can only be applied to episodic MDPs
2-1. First-visit Monte Carlo

2-2. Every-visit Monte Carlo

2-3. Incremental Monte Carlo

<aside>
👩🏼🏫 skew it so that you’re running average is more weighted towards recent data because your real domain is non-stationary (when MDP is changing over time)
</aside>