# Bellman Equation

** 4 min read** • Published: **November 22, 2018**

Before we begin, let me just define a few terms:

- $S_t$ is the state at time $t$.
- $A_t$ is the action performed at time $t$.
- $R_t$ is the reward received at time $t$.
- $G_t$ is the
**return**, that is the sum of discounted rewards received from time $t$ onwards, defined as $G_t = \sum_{i=0}^\infty \gamma^i R_{t+i+1}$. - $V^\pi(s)$ is the value of a state when following a policy $\pi$, that is the expected return when starting in state $s$ and following a policy $\pi$, defined as $V^\pi(s) = E[G_t | S_t = s]$.
- $Q^\pi(s, a)$ is the value of a state $s$ when performing and action $a$ and then following the policy $\pi$, that is $Q^\pi(s, a) = E[G_t | S_t = s, A_t = a]$.

Before moving further, note a small algebraic tric for re-writing $G_t$ in terms of itself

$\begin{aligned} G_t &= \sum_{i=0}^\infty \gamma^i R_{t+i+1} \\ &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \\ &= R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \cdots) \\ &= R_{t+1} + \gamma G_{t+1}. \end{aligned}$We can use this in the definition of $V^\pi(s)$ and get

$\begin{aligned} V^\pi(s) &= \mathrm{E}[G_t | S_t = s] \\ &= \mathrm{E}[R_t + \gamma G_{t+1} | S_t = s] \\ &= \sum_a \pi(a|s) \sum_{s'}\sum_r p(s', r | s, a)[ r + \gamma \mathrm{E}[G_{t+1} | S_{t+1} = s']] \\ &= \sum_a \pi(a|s) \sum_{s'}\sum_r p(s', r | s, a)[ r + \gamma V^\pi(s')]. \end{aligned}$The last equation is called the Bellman equation for $V^\pi(s)$ and it shows a recursive relationship between the value of the current state and the possible next states. This in and of itself is not as interesting, but we’ll use it to derive a solution to finding the optimal policy.

Let us now define the optimal value function, that is the value function of the optimal policy (denoted $\pi^*$).

$V^*(s) = \max_\pi V^\pi(s)$that is the optimal value of a state is the maximum over all possible policies. Going one step further, we also define the optimal action-value function as

$Q^*(s, a) = \max_\pi Q_\pi(s, a).$It’s easy to see now that $V^*(s) = \max_a Q^*(s, a)$, that is the maximum value of a state is computed by performing the best possible action. We can use this further to arrive at a simplified Bellman equation as follows

$\begin{aligned} V^*(s) &= \max_a Q_{\pi_*}(s, a) \\ &= \max_a \mathrm{E}_{\pi_*} [G_t | S_t = s, A_t = a ] \\ &= \max_a \mathrm{E}_{\pi_*} [R_t + \gamma G_{t+1} | S_t = s, A_t = a ] \\ &\stackrel{*}{=} \max_a \mathrm{E} [R_t + \gamma V^*(S_{t+1}) | S_t = s, A_t = a ] \\ &= \max_a \sum_{s'}\sum_r p(s', r | s, a) [r + \gamma V^*(s')]. \end{aligned}$Here we managed to do a small but important trick in the second last equation (marked with $\stackrel{*}{=}$). But let us first decompose what does $\mathrm{E}_\pi*$ actually mean. Because $G_r$ is the discounted sum of rewards, it is only defined in terms of a policy. But since we assume the policy to be stochastic, we need to take an expectation over all possible actions chosen by the policy, and the possible rewards.

This changes at the marked equation, because we are no longer referring to the policy $\pi_*$, but rather to the value function $V^*$, which is not stochastic. As a reuslt, the expectation in the second to last equation is simply over $R_t$, because we still assume stochastic rewards.

### References

- Equations for the value function and notation borrowed from Reinforcement Learning: An Introduction by Andrew Barto and Richard S. Sutton.

### Share on Twitter and Facebook

### Discussion of "Bellman Equation"

If you have any questions, feedback, or suggestions, please do share them in the comments! I'll try to answer each and every one. If something in the article wasn't clear don't be afraid to mention it. The goal of these articles is to be as informative as possible.

If you'd prefer to reach out to me via email, my address is loading ..