Finite MDPs
Finite Markov Decision Processes
Basic Settings
- State
- Action
- Reward (subscript +1)
the sequence is like:
$$S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2}, A_{2}, R_{3}, \ldots$$
A single thinking circle is: state -> action -> reward -> state'
Dyanmics of MDPs
- state $s$: we already observe
- action $a$: determinated by ourselves
- *reward $r$: we cannot control
- *new state $s'$: we cannot control
So the rest of 2 brings dynamics, conditional on 1st 2 things:
$$p\left(s^{\prime}, r \mid s, a\right) \doteq \operatorname{Pr}\left{S_{t}=s^{\prime}, R_{t}=r \mid S_{t-1}=s, A_{t-1}=a\right}$$
we did not know exactly distribution $p$ has, but we know $p$ defines the dynamics of the MDP. While $s'$ and $r$ are 2 random variables. $p$ is the joint density function of these two.
Properties of $p$
first, no doubt that:
$$\sum_{s^{\prime} \in S} \sum_{r \in \mathcal{R}} p\left(s^{\prime}, r \mid s, a\right)=1$$
of course you can get marginal density of $s'$ by summing over $t$:
$$p\left(s^{\prime} \mid s, a\right) \doteq \operatorname{Pr}\left{S_{t}=s^{\prime} \mid S_{t-1}=s, A_{t-1}=a\right}=\sum_{r \in \mathcal{R}} p\left(s^{\prime}, r \mid s, a\right)$$
or marginal density of $r$, and get expected $r$ further:
$$r(s, a) \doteq \mathbb{E}\left[R_{t} \mid S_{t-1}=s, A_{t-1}=a\right]=\sum_{r \in \mathcal{R}} r \sum_{s^{\prime} \in S} p\left(s^{\prime}, r \mid s, a\right)$$
keep in mind that the expected $r$ is not a value, is a function of tuple (s,a).
we could skip sum over $s'$ at first, to get a function of over a 3-elements tuple (s, a, s'):
$$r\left(s, a, s^{\prime}\right) \doteq \mathbb{E}\left[R_{t} \mid S_{t-1}=s, A_{t-1}=a, S_{t}=s^{\prime}\right]=\sum_{r \in \mathcal{R}} r \frac{p\left(s^{\prime}, r \mid s, a\right)}{p\left(s^{\prime} \mid s, a\right)}=\sum_{r \in \mathcal{R}} r \times p(r|s', s, a)$$
Reward
the objective is to maximum the total reward $G$
$$G_{t} \doteq R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\cdots=\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1}$$
An important recurrsive equation is :
$$G_{t}=R_{t+1}+\gamma G_{t+1}$$
Policies and Value Func
- Policy: Given states, the distribution of actions: $\pi(a \mid s)$
- Value function: estimate “how good it is”.
Value function of state (state value func)
$v_{\pi}(s)$: value function of a state $s$ under a policy $\pi$
- the expected return when starting in s and following $\pi$ thereafter
$$v_{\pi}(s) \doteq \mathbb{E}{\pi}\left[G{t} \mid S_{t}=s\right]=\mathbb{E}{\pi}\left[\sum{k=0}^{\infty} \gamma^{k} R_{t+k+1} \mid S_{t}=s\right]$$
this is function of s. since policy is a distribution $\pi(a \mid s)$, and here when s is given, the expection over policy is actually over $a$.
Value function of action (action value func)
the value of taking action $a$ in state $s$ under a policy $\pi$
$$q_{\pi}(s, a) \doteq \mathbb{E}{\pi}\left[G{t} \mid S_{t}=s, A_{t}=a\right]=\mathbb{E}{\pi}\left[\sum{k=0}^{\infty} \gamma^{k} R_{t+k+1} \mid S_{t}=s, A_{t}=a\right]$$
Bullman Equation
Lets review the above equation:
First, we have:
$$ \begin{aligned} V_{n(s)} &=E_{i t}\left(G_{t} \mid S_{t=s}\right) \ &=E_{i}\left(R_{t+1}+\gamma G_{t+1} \mid S_{t}=s\right) \end{aligned} $$
1. What is $E_{\pi}$?
consider a full cycle: $(s, a, r, s')$, we want to get the expected value of $r$:
- s is given
- a is a RV; that determinated by $\pi(a|s)$
- r, s' are RVs
2. what is the rest: $E_{\pi}\left(\gamma G_{t+1} \mid s_{1}=s\right)$ ?
Notice that $G_{\text {tri }}$ is only related to $s^{\prime}$:
So the rest:
$$ \begin{aligned} &=\gamma \sum_{s^{\prime}} p\left(s^{\prime} | s, a\right) \ &=\gamma \sum_{a} \pi(a | s) \sum_{s^{\prime}} p\left(s^{\prime} | s, a\right)\left(G_{t+1} | s_{t+1} = s^{\prime}\right) \ &=\gamma \sum_{a} \pi(a | s) \sum_{s} p\left(s^{\prime} | s, a \right) V_{\pi}\left(s^{\prime}\right) \end{aligned} $$
3. overall
$$ \text { Due to} \quad \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)=\sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r | s, a\right) $$
so we can derive $V_{\pi}(s)$ as :
$$\begin{aligned} V_{\pi(s)} &=E_{\pi}\left(R_{t+1}+\gamma G_{t+1} \mid s_{t=s}\right) \ &=\sum_{n} \pi(a \mid s) \sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r | s, a\right)\left[r+\gamma V_{\pi}\left(s^{\prime}\right)\right] \end{aligned}$$
known as Bellman Equation (under MDP and Policy $\pi$)
4. meaning
- Bellman Equation shows
a relationship between the
value of a state
and thevalues of its successor states
- It states that the
value of the start state
must equalthe (discounted) value of the expected next state
, plusthe reward expected along the way
.
An graphic example of Bellman equation
- branch = 3: 3 possible action due to policy
- solid circle: a state-action pair (s, a)
- then, each action has 2 possible (r, s') pair