Optimal Control Theory

Defining the Problem

Definition 1 — Control function

A control function $\mathbf{\gamma}(t)$ is a map from time $t\in [0, \infty)$ to the set of admissable controls $\Gamma$.
The control is also called a policy.

Definition 2 — Controlled Dynamical System

A system whose dynamics is described by the equation of motion \begin{equation} \dot{\mathbf{x}}(t)=\mathbf{f}(\mathbf{x}(t), \mathbf{\gamma}(t), t), \end{equation} is called a $\mathbf{\gamma}$-Controlled Dynamical System.
A system which does not explicitely depent on time \begin{equation} \dot{\mathbf{x}}(t)=\mathbf{f}(\mathbf{x}(t), \mathbf{\gamma}(t)), \end{equation} is called Autonomous.
The state of the system $\mathbf{x}(t)\in \mathbb{R}^n$ is also called trajectory.

Definition 3 — Payoff Functional

A functional of the form
\begin{equation} \mathbf{P}[\mathbf{\gamma}] := \int_0^T L(\mathbf{x}(t), \mathbf{\gamma}(t)) dt + \phi(\mathbf{x}(T)), \end{equation}
is called a Payoff Functional, where the function $L$ is called running payoff or Lagrangian and $\phi$ is called terminal payoff or boundary cost (sometimes also referred to as Mayer's Term).

Definition 4 — Optimal Control Problem

Given a system with controls $\mathbf{\gamma}$ and payoff functional $\mathbf{P}[\mathbf{\gamma}]$, the associated Optimal Control Problem amounts to finding an optimal control $\mathbf{\gamma}^*$ maximising the payoff: \begin{equation} \mathbf{P}[\mathbf{\gamma}^*] = \sup_{\mathbf{\gamma}\in\Gamma} \mathbf{P}[\mathbf{\gamma}], \end{equation} that is \begin{equation} \mathbf{P}[\mathbf{\gamma}^*] \ge \mathbf{P}[\mathbf{\gamma}], \quad \forall \mathbf{\gamma}\in\Gamma. \end{equation}
The problem is called a Lagrange problem when $\phi=0$, and a Mayer problem when $L=0$. When both $L\ne0$ and $\phi\ne0$ it is also called a Bolza Problem [1].

Exercise 1

Show that a problem in the Lagrange form can be turned into a Mayer problem with some appropriate transformation.

Consider a Lagrange problem with running cost $L(x, \gamma, t)$. Introduce an extra state variable $\tilde{x}$ such that \begin{equation} \begin{split} \dot{\tilde{x}}_0 &= L(x, \gamma, t),\\ \tilde{x}(t=0) &= 0. \end{split} \end{equation} Clearly this is such that \begin{equation} \tilde{x}(t=T) = \int_0^TL(x(t), \gamma(t), t) dt . \end{equation} Therefore, the problem expressed in terms of $\tilde{x}$ is a Mayer problem.

◻

Exercise 2

Show that a problem in the Mayer form can be turned into a Lagrange problem with some appropriate transformation.

Consider a Mayer problem with terminal cost $\phi(x, \gamma, t)$. Notice that \begin{equation} \begin{split} \phi(x(T), T) &= \phi(x(0), 0) + \int_0^T\frac{d}{dt}\phi(x(t), t) dt \\ &= \phi(x(0), 0) + \int_0^T\left(\partial_t \phi(x(t), t)\right) + \left(\partial_x\phi(x(t), t)\cdot f(x(t), \gamma(t), t)\right) dt. \end{split} \end{equation} Therefore, the problem expressed in these terms is a Lagrange problem.

◻

Real world applications are abundant, from minimising fuel consumption when designing space missions, to optimising the production of chemicals, from maximisation of throughput of information transmition over a communication channel, to the determination of optimal policy paths for monetary policy setting. Even more so, optimality can be considered to a large extent a universal principle of nature, in the sense that most processess seem to be optimising over some specific quantity. This is true for nature's fundamentals, where often one can describe physical systems via principles of least action, for biological systems, e.g. explaining animals foraging behaviour, up to societies whose individuals maximise their respective expected utilities. Furthermore, optimisaion clearly also plays a crucial role in many

The Maximum Principle

Definition 5 — Hamiltonian

Given a system with controls $\mathbf{\gamma}$, payoff functional $\mathbf{P}[\mathbf{\gamma}]$, and dynamics of the state variable $\mathbf{x}$ governed by the law of motion $\dot{\mathbf{x}}(t)=\mathbf{f}(\mathbf{x}(t), \mathbf{\gamma}(t), t)$ the control Hamiltonian is
\begin{equation} H(\mathbf{x}(t), \mathbf{\gamma}(t), \lambda(t), t) := L(\mathbf{x}(t), \mathbf{\gamma}(t), t) +\lambda^T(t) \mathbf{f}(\mathbf{x}(t), \mathbf{\gamma}(t), t), \end{equation}
where $L$ is the Lagrangian term in $\mathbf{P}$ (running payoff), and $\lambda$ is called the co-state variable.

Notice that the multiplier (or co-state) $\lambda(t)$, unlike the Lagrangian multiplier in static optimisation, depends on time [3].

Theorem 6 — Pontryagin Maximum Principle

Assume the optimal control $\mathbf{\gamma}^*$ is known and admissable ($\mathbf{\gamma}^*\in\Gamma$), which generates the associated optimal trajectory $\mathbf{x}^*$. Then, there exists and adjoint trajectory $\mathbf{p}^*$ conjugate to $\mathbf{x}^*$ such that
\begin{equation} \begin{split} \dot{\mathbf{x}}(t) &= \nabla_\lambda H(\mathbf{x}^*, \mathbf{\gamma}^*, \lambda^*, t) \quad\quad\quad\quad \phantom{-} (\text{state equation})\\ \dot{\lambda}(t) &= -\nabla_\mathbf{x} H(\mathbf{x}^*, \mathbf{\gamma}^*, \lambda^*, t) \quad\quad\quad\quad (\text{co-state equation})\\ H(\mathbf{x}^*, \mathbf{\gamma}^*, \lambda^*, t) &\ge H(\mathbf{x}^*, \mathbf{\gamma}, \lambda^*, t), \quad \forall \gamma \in \Gamma \quad \phantom{-.} (\text{Pontryagin Maximum Principle})\\ H(\mathbf{x}^*, \mathbf{\gamma}^*, \lambda^*, t) &= 0 \quad \forall t \in [0,T]\\ \lambda^*(T) &= - \nabla_\mathbf{x}\phi(\mathbf{x}^*(T)) \quad\quad\quad\quad\quad\quad \phantom{-} (\text{Transversality condition}) \end{split} \end{equation}

In other words, Pontryagin's Maximum Principle provides a necessary condition for optimality: an optimal control must globally optimise the Hamiltonian. The state and co-state equations are also known as canonical equations and describe the system's dynamics.

Example 1 — Minimal time for bringing a particle to rest at origin

Consider the one-dimensional problem of bringing a particle to rest at $x=0$ in minimal time. The initial position and velocity of the particle are $x_0$ and $v_0$ respectively. The control is the force $f(t)$ applied to the particle, such that $|f(t)|\le 1$.
Therefore, the problem is to minimise the total time \begin{equation} \mathcal{T} = \int_0^T 1 dt, \end{equation} where $T$ is the time when the particle reaches the origin, subject to the law of motion which, recalling that $v=\dot{x}$ and $f=\dot{v}$, reads \begin{equation} \frac{d}{dt} \begin{bmatrix} x \\ v \end{bmatrix} = \left(\begin{matrix} 0 & 1 \\ 0 & 0 \end{matrix}\right) \begin{bmatrix} x \\ v \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} f. \end{equation}
To start off, one finds the Hamiltonian to be
\begin{equation} H = \mathbf{\lambda}^T\left(\begin{matrix} 0 & 1 \\ 0 & 0 \end{matrix}\right) \begin{bmatrix} x \\ v \end{bmatrix} + \mathbf{\lambda}^T \begin{bmatrix} 0 \\ 1 \end{bmatrix} f - 1 = \lambda_1 v + \lambda_2 f - 1. \label{double_integrator_hamiltonian} \end{equation}
Because the term $\lambda_2 f$ in the Hamiltonian is linear in $f$, the maximum with respect to the control is achieved at one of the endpoints of the interval $[-1, 1]$, depending on the sign of $\lambda_2$. That is, the Hamiltonian is maximised by
\begin{equation} f^*=\begin{cases} \text{sgn}\left(\lambda_2\right) & \text{if $\lambda_2 \ne 0$}\\ \text{any value} \in {-1,1} & \text{otherwise} \end{cases} \quad. \end{equation}

The costate variables are given by \begin{equation} \begin{split} \dot{\lambda}_1 &= \partial_x H = 0, \\ \dot{\lambda}_2 &= \partial_v H = -\lambda_1. \end{split} \end{equation} Denotining the values of the costate variables at termination by $\bar{\lambda}_i = \lambda_i(t=T)$, we can therefore solve for the costate variables: \begin{equation} \begin{split} \lambda_1 &= \bar{\lambda}_1, \\ \lambda_2 &= \bar{\lambda}_1 t + \bar{\lambda}_2. \end{split} \end{equation} Here we observe that on the optimal trajectory, $\lambda_2$ will switch sign at most once before termination at $t_s=-\bar{\lambda}_2/\bar{\lambda}_1$, if $\text{sgn}\left(\bar{\lambda}_1\right)\ne\text{sgn}\left(\bar{\lambda}_2\right)$. This means that the optimal control strategy consists of switching between maximal accelaration in one direction to maximal acceleration in the opposite direction. This is therefore a bang-bang control.
Taking this result back to the equations of motion we find \begin{equation} \begin{split} \dot{v} &= f^* = \text{sgn}(\bar{\lambda}_1 t+ \bar{\lambda}_2) \\ \dot{x} &= v \end{split}\quad, \end{equation} thus \begin{equation} \begin{split} v &= \text{sgn}(\bar{\lambda}_1 t+ \bar{\lambda}_2) t + v_0 \\ x &= \frac{t^2}{2}\text{sgn}(\bar{\lambda}_1 t+ \bar{\lambda}_2) + v_0t + x_0 \end{split}\quad. \end{equation}

The terminal time $T$ is free, therefore the transversality condition tells us $H(t=T)=0$. Hence, we can conclude that $|\bar{\lambda}_2| = 1$.
This allows one to find explicit solutions for $x$ and $v$. However, notice that one can also get rid of the parametrisation in $t$ to find the associated orbits in the $x$-$v$ phase plane. For $f=\pm 1$ \begin{equation} \begin{split} x &= \pm \frac{t^2}{2} + v_0t + x_0,\\ v &= \pm t + v_0, \end{split} \end{equation} or \begin{equation} x = \pm \frac{v^2}{2} +(x_0\mp\frac{v_0}{2}). \end{equation} This family of parabulae intersects the origin only for $v_0=\pm\sqrt{x_0}$ which determines the switching curve $x = -\frac{1}{2}|v|v$ (blue curve in Figure 1). Therefore, the optimal strategy is to apply $f=1$ ($f=-1$) if the initial condition lies below (above) the switching curve, follow the orbit up to the switching curve, then switching the control value to $f=-1$ ($f=1$) and following the switching curve to the origin.

Fig.1: Optimal trajectory of the system described by equation \eqref{double_integrator_hamiltonian}, for different initial conditions (move your mouse to adjust them).

The interested reader may find more details on this problem, also referred to as double-integrator problem, in [2].

Dynamic Programming

A similar approach to the maximum principle developed during the cold war in the Soviet Union, was developed in the same period in the United States. However, while Pontryagin's maximum principle only yielded necessary conditions for optimality, the dynamic programming approach pioneered by Bellman at RAND would yield both necessary and sufficient conditions for optimality.

Dynamic programming is a general approach: let us start by considering a discrete deterministic example. We shall then move on to consider the continuous and stochastic extensions.
Consider a discrete system of the form \begin{equation} x_{t+1} = f(x_t, \gamma_t), \quad t = 0, ..., T-1; \end{equation} here $x_t\in X\subset\mathbb{N}$ and $\gamma_t\in \Gamma \subset \mathbb{N}$. Furthermore, to each discrete step forward in time $t$ is associated the running payoff $\ell(x_t, x_{t+1})$. Starting from $x_0=0$, the objetive is to maximise the total payoff.
The naive approach to solve this problem consists of enumerating each possible trajectory starting from the initial state $x_0$ up to time $T$. It's straightforward to find this approach implies evaluating $|X|^T$ trajectories. The evaluation of the payoff for each trajectory requires $T$ operations, so we denote by $\mathcal{O}(T|X|^T)$ the computational complexity of this approach.
This is brute force search is depicted in Fig 2 left, where paths explored multiple times appear darker.
However, a better approach is possible. Let us first introduce an observation which will be crucial in the following.

Definition 8 — Bellman's Principle of Optimality

For any point $\tau>0$ on an optimal trajectory $\mathbf{x}(t)$, the remaining trajectory $\mathbf{x}(t>\tau)$ is optimal for the corresponding problem initiated at $\tau$.

Or, in Bellman's own words [3]: "An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision."
Bellman's optimality principle therefore guarantees that from any point $x_t$ on the optimal path, one can discard paths going backwards in time being certain that they are not portions of optimal trajectories.
In order to leverage this intuition, let us thus define at each time $t$ the value function \begin{equation*} V(x_t) = \max\left(\sum_{s=t+1}^T \ell(x_s, x_s+1)\right). \end{equation*} Clearly this represents the largest payoff attainable starting from $x_t$. The introduction of $V(x_t)$ is useful as it allows us to start at $x_T$ and proceed backwards in time. At $t=T$ terminal costs are known for all possible choices of $x_T \in X$. At $t=T-1$, for each possible value $x_{T-1} \in X$ we can compute the optimal payoff as the sum of the one step running cost plus the terminal payoff, and record that. Then, one can repeat this procedure for each $ t < T - 1 $ up to $t=0$. Should one step involve two equivalent paths, either can be chosen at random. This process is depicted in Fig2, right.

Fig.2: Brute force forward scheme (left) versus dynmamics programming (backward scheme) approach (right). Darker paths are traversed more often. Here $|X| = 3$ and $T=5$.

At this point, for every $x_t$ we have an associated value function, which imply two major achievements. First, $V(x_0)$ represents the optimal payoff. Second, starting from $x_0$ and proceeding greedily with respect to the value function we are therefore guaranteed to find the optimal path.
This dynamic programming [4] approach requires to compute the cost of a transition and add it to the cost-to-go previously computed for each time $t$, for each possible state in $X$, and for each control. Hence, the computational complexity of the dynamic programming approach is $\mathcal{O}(T|X||\Gamma|)$. This is a major improvement compared to the naive forward scheme considered earlier!

The notion of dynamic programming can be formalised with the following theorem.

Theorem 9 — Dynamic Programming

Given a Bolza problem over the time interval $[t, T]$, with payoff functional $P[\gamma]$ and subject to the law of motion $\dot{x}(t)=f(x(t), \gamma(t), t)$, define the Value Function \begin{equation} V(x, t) := \sup_{\gamma_{[t, T]}} P[\gamma]. \end{equation} Then, one has \begin{equation} \label{eq:dynprogr} V(x, t) = \sup_{\gamma_{[t, T]}} \left\{\int_t^{\tau}L(x(t'), \gamma(t'), t') dt' + V(x(\tau), \tau)\right\}. \end{equation} for any $\tau\ge t$, and where $x$ on the right hand side is he state trajectory corresponding to the control $\gamma_{[t, T]}$.

Exercise 3

Prove Theorem 9 on Dynamic Programming.

Start by calling $P_\tau$ the right hand side of \eqref{eq:dynprogr}: \begin{equation} P_\tau = \sup_{\gamma_{[t_0, T]}} \left\{\int_t^{\tau}L(x(t), \gamma(t), t) dt + V(x(\tau), \tau)\right\}. \end{equation} For some $\epsilon>0$, pick a policy $\tilde{\gamma}$ such that \begin{equation} P[\tilde{\gamma}] \le V(x, t) + \epsilon. \end{equation} Under this control one has (by definition of the value function): \begin{equation} V(x(\tau), \tau) \le \int_\tau^T L(x(t), \tilde{\gamma}(t), t) dt + \phi(x(T)). \end{equation} Thus: \begin{equation} \begin{split} P_\tau &\le \int_s^\tau L(x(t), \tilde{\gamma}(t), t) dt + V(x(\tau), \tau)\\ &\le \int_s^T L(x(t), \tilde{\gamma}(t), t) dt + \phi(x(T)) = P[\tilde{\gamma}] \le V(x, t) + \epsilon. \end{split} \end{equation} Now, consider instead a policy $\hat{\gamma}$ such that \begin{equation} P_\tau + \epsilon \ge \int_s^\tau L(x(t), \hat{\gamma}, t)dt + V(x(\tau), \tau), \end{equation} and a policy $\check{\gamma}$ such that \begin{equation} V(x(\tau), \tau) + \epsilon \ge \int_\tau^T L(x(t), \check{\gamma}, t)dt. \end{equation} FINISH

◻

Exercise 4

A number triangle is a sequence of numbers starting from a 1-digit number, followed by a 2-digit number, etc. For example:

7
3 8
8 1 0
2 7 4 4
4 5 2 6 5

Write a function Max_Path(T) taking as input the number triangle T as a list of integers representing each level. The function should compute the highest sum of digits on a specific path starting at the top and ending somewhere at the base, following the rule that each step can only go directly-down or down-right. In the example the maximising path is highlighted in red. Test it on T=[7,38,810,2744,45265]: you should find Max_Path(T)=30. Then, use the following function to generate a triangle of length L:


  def gen_T(L):         
      from random import randint, seed
      seed(100)
      return [randint(10**(n) ,10**(n+1)-1) for n in range(L)]

Test it for L=50: you should find Max_Path(gen_T(50)) = 333.


  # Sum lines from the bottom to the top and maximise sums
  
  def Max_Path(A):
      A = [[int(d) for d in str(n)] for n in A]  # List becomes list of lists
      while len(A) > 1:
          e1 = A[-1]                                   # Last level
          e2 = A[-2]                                   # Penultimate level
          s1 = [e1[n]  +e2[n] for n in range(len(e2))] # Sum below
          s2 = [e1[n+1]+e2[n] for n in range(len(e2))] # Sum below right
          MS = [max(a,b) for a,b in zip(s1,s2)]        # Max of possible sums
          A[-2] = MS                                   # MS in penultimate line
          A.pop()                                      # Remove last line
      return A[0][0]

◻

In the field of economics it is often useful and convenient to consider time-discretised models where the objective function takes the form $\beta^t u_t$, for some istantaneous utility function $u_t$ and discount factor $\beta\in(0,1)$. Let us therefore summarise two specialised results for the cases of finite and infinite time horizon problems.

Theorem 10 — Discrete-time Finite-Horizon Dynamic Programming

Consider the discrete-time finite-horizon problem \begin{equation} \begin{split} \max_{c_t}&{\sum_{t=0}^T\beta^t u(x_t, c_t)}\\ &\text{subject to}\\ x_{t+1} &= f(x_t, c_t) \end{split} \end{equation} where the istantaneous utility $u(x_t, c_t)$ is a function of the state variable $x_t$ and control variable $c_t$.
Let the optimal value function be defined by \begin{equation} V_s(x) = \max_{c_{t\ge s}}\left\{\sum_{t=s}^T\beta^t u(x_t, c_t)\right\}. \end{equation} Then \begin{equation} \begin{split} V_s(x) &= \max_{c}\left\{\beta^t u(x, c) + V_{s+1}(f(x, u))\right\},\\ V_T(x) &= \max_{c}\left\{\beta^T u(x, c)\right\}. \end{split} \end{equation}

Theorem 11 — Discrete-time Infinite-Horizon Dynamic Programming

Consider the discrete-time infinite-horizon problem \begin{equation} \begin{split} \max_{c_t}&{\sum_{t=0}^\infty\beta^t u(x_t, c_t)}\\ &\text{subject to}\\ x_{t+1} &= f(x_t, c_t) \end{split} \end{equation} Assume further that $u$ is bounded in absolute value: this ensures convergence of the sum above. The, the value function $V(x)$ obeys the so called Bellman equation: \begin{equation} V(x) = \max_{c}\left\{u(x, t) + \beta V(f(x, c))\right\}. \end{equation}

Solving the Bellman equation is a functional equation which may not be trivial to solve. One has three main ways to approach its solution: guessing the form of $V(x)$ and veryfying it, guessing a trivial form (e.g. $V(x)=0$) and iterating the Bellman equation analytically, or using numerical methods.

Exercise 5

Solve the following discrete time finite-horizon optimisation problem \begin{equation} \begin{split} &\max{\sum_{t=0}^3}\left(1+x_t-c^2_t\right)\\ &\phantom{xxx}\text{subject to}\\ &x_{t+1} = x_t+c_t \end{split} \end{equation} with $x_0=0$ and $c_t\in \mathbb{R}$.

Let's start form $t=T=3$: \begin{equation} V_3(x) = \max_{c}\left\{1+x_t-c^2_t\right\}, \end{equation} that is $c_3^*=0$ and $V_3(x) = 1+x$.
Then \begin{equation} \begin{split} V_2(x) &= \max_{c}\left\{1+x_t-c^2_t+V_3(f(x, c, 2))\right\}\\ &= \max_{c}\left\{1+x_t-c^2_t+ 1 + x + c \right\}\\ &= \max_{c}\left\{2(1+x)+c(1-c) \right\}, \end{split} \end{equation} so that $c_2^* = \frac{1}{2}$ and $V_2(x) = 2x + \frac{9}{4}$.
Moving on, \begin{equation} \begin{split} V_1(x) &= \max_{c}\left\{1+x_t-c^2_t+V_2(f(x, c, 2))\right\}\\ &= \max_{c}\left\{1+x_t-c^2_t+ 2(x+c) + \frac{9}{4} \right\}\\ &= \max_{c}\left\{\frac{14}{3}+3x + c(2-c) \right\}, \end{split} \end{equation} so that $c_1^* = 1$ and $V_1(x) = 3x + \frac{17}{4}$.
Finally, \begin{equation} \begin{split} V_0(x) &= \max_{c}\left\{1+x_t-c^2_t+V_1(f(x, c, 2))\right\}\\ &= \max_{c}\left\{1+x_t-c^2_t+ 3(x+c) + \frac{17}{4} \right\}\\ &= \max_{c}\left\{21 + 4x + c(3-c) \right\}, \end{split} \end{equation} so that $c_0^* = \frac{3}{2}$ and $V_0(x) = 4x + \frac{29}{4}$.
In summary, the sequence of optimal controls is \begin{equation} \left\{c^*_t\right\}_{t=0}^3 = \left\{\frac{3}{2}, 1, \frac{1}{2}, 0 \right\}. \end{equation} Consequently, the sequence of optimal states is \begin{equation} \left\{x^*_t\right\}_{t=0}^3 = \left\{x^*_{t-1}+c^*_{t-1}\right\}_{t=0}^3 = \left\{0, \frac{3}{2}, \frac{5}{2}, 3 \right\}. \end{equation}

◻

Exercise 6

Solve the following discrete time infinite-horizon optimisation problem \begin{equation} \begin{split} &\max{\sum_{t=0}^\infty}\beta^t\left(-\frac{2}{3}x_t^2-c_t^2\right)\\ &\phantom{xxx}\text{subject to}\\ &x_{t+1} = x_t+c_t \end{split} \end{equation} with $x_0$ given and $c_t\in \mathbb{R}$.

The Bellman equation of the problem reads \begin{equation} \label{eq:exercise5_BE} V(x) = \max_{c}\left\{-\frac{2}{3}x^2-c^2+\beta V(x+c)\right\}. \end{equation} Let us start by iteration with the trivial guess $V(x) = 0$. This is maximised for $c=0$ at $-\frac{2}{3}x^2$. Let us therefore guess the following functional form for the optimal value function: $V(x) = -\alpha x^2$. Then, \eqref{eq:exercise5_BE} becomes \begin{equation} \label{eq:exercise5_BE2} -\alpha x^2 = \max_{c}\left\{-\frac{2}{3}x^2-c^2-\beta \alpha (x+c)^2\right\}. \end{equation} The right hand side of \eqref{eq:exercise5_BE2} is maximised by \begin{equation} c^*(x) = -\frac{\alpha\beta}{1-\alpha\beta}x. \end{equation} Therefore, \eqref{eq:exercise5_BE2} turns into \begin{equation} -\alpha = -\frac{2}{3}-\frac{\alpha^2\beta^2}{(1-\alpha\beta)^2}-\alpha\beta(1-\frac{\alpha\beta}{1-\alpha\beta})^2. \end{equation} Solving for $\alpha$ yields: \begin{equation} \alpha = \frac{5\beta-3\pm\sqrt{25\beta^2-6\beta+9}}{6\beta}. \end{equation}

◻

The Hamilton-Jacobi-Bellman equation

Let us now consider an infinitesimal generalisation of Theorem 9.

Theorem 12 — Hamilton-Jacobi-Bellman equation

Consider a value function defined by \begin{equation*} V(x, t) := \sup_{\gamma_{[t_0, T]}} P[\gamma]. \end{equation*} Then, $V$ is the unique viscosity solution [6, 7] of the Hamilton-Jacobi-Bellman equation \begin{equation} \label{eq:HJB} -\partial_t V(x, t) = \max_{\gamma} \left\{ L(x, \gamma, t) + \nabla_x V(x, t) \cdot f(x, \gamma, t) \right\}, \end{equation} with $V(x, T) = \phi(x(T))$. Moreover, this provides a necessary and sufficient condition for optimality.

The Hamilton-Jacobi-Bellman (HJB) equation \eqref{eq:HJB} is the nonlinear partial differential equation providing necessary and sufficient conditions for optimality of the control $\gamma$ with respect to the payoff function.
Notice that the terminal cost enters only the boundary condition and does not explicitely appear in the HJB equation itself.

Exercise 7

Derive the Hamilton-Jacobi-Bellman equation from Theorem 12.

To start, consider an infinitesimal dynamic programming principle \begin{equation} \label{eq:dpp_eq} V(x, t) = \sup_{\gamma_{[t, t+\delta t]}} \left\{\int_t^{t+\delta t}L(x(t), \gamma(t), t) dt + V(x(t+\delta t), t+\delta t)\right\}. \end{equation} A Taylor expansion of the equation of motion $\dot{x}(t) = f(x(t), \gamma(t), t)$ gives \begin{equation} \begin{split} x(t+\delta t) &= x(t) + \int_t^{t+\delta t} f(x(t'), \gamma(t'), t') dt'\\ &= x(t) + f(x(t), \gamma(t), t) \delta t +\mathcal{O}(\delta t). \end{split} \end{equation} Expanding further \eqref{eq:dpp_eq} to first order in $\delta t$, and assuming sufficient regularity of $V$, gives \begin{equation} \begin{split} V(x, t) &= \sup_{\gamma_{[t, t+\delta t]}} \left\{ L(x(t), \gamma(t), t)\delta t + V(x, t) +\partial_t V(x, t) \delta t + \nabla_x V(x, t) \cdot f(x(t), \gamma(t), t) \delta t \right\}\\ 0 &= \sup_{\gamma_{[t, t+\delta t]}} \left\{ L(x(t), \gamma(t), t)\delta t +\partial_t V(x, t) \delta t + \nabla_x V(x, t) \cdot f(x(t), \gamma(t), t) \delta t \right\}, \end{split} \end{equation} which after rearranging and taking the limit $\delta t\rightarrow 0$ finally yields \begin{equation} -\partial_t V(x, t) = \sup_{\gamma} \left\{ L(x, \gamma, t) + \nabla_x V(x, t) \cdot f(x, \gamma, t) \right\}. \end{equation}

◻

Linear-Quadratic Regulator

Finite-horizon LQR

Let us now turn our attention to the specialised cased of a finite-horizon linear system with quadratic payoff: \begin{equation} \begin{split} f(x(t), \gamma(t), t) &= A(t)x + B(t)\gamma,\\ L(x(t), \gamma(t)) &= Q(t)x^2(t)+R(t)\gamma^2(t),\\ \phi(x(T)) &= Mx^2(T), \end{split} \end{equation} with $A$, $B$, $Q$, $R$, and $M$ all positive. That is to say, the system with equation of motion \begin{equation} \dot{x} = A(t)x + B(t)\gamma, \end{equation} and payoff function \begin{equation} P[\gamma] = -\int_0^T Q(t)x^2(t)+R(t)\gamma^2(t)dt - Mx^2(T). \end{equation} As we shall see, this linear-quadratic regulator (LQR) is particularly useful as it is vastly more tractable than the general optimal control problem.
The Hamiltonian for the system can be then found to be \begin{equation} \label{eq:lqr_hamiltonian} H(x, \gamma, \lambda, t) = A(t)\lambda x+B(t)\lambda\gamma-Q(t)x^2-R(t)\gamma^2, \end{equation} Theorem 13 — LQR Linear State Feedback Law

For the Linear Quadratic Regulator defined above, one has the following linear state feedback law \begin{equation} \label{eq:lqr_lsfl} \gamma^{*}(t) = -R^{-1}(t)B(t)\chi(t) x^{*}(t), \end{equation} where the function $\chi(t)$ satisfied the Riccati Differential Equation \begin{equation} \label{eq:lqr_rde} \dot{\chi} = R^{-1}B^2\chi^2-2A\chi-Q, \end{equation} with boundary condition $\chi(T)=M$. Furthermore \begin{equation} \label{eq:lqr_state_costate_linear_rel} \lambda^*(t) = -2\chi(t) x^*(t). \end{equation}

The gradient of the Hamiltonian \eqref{eq:lqr_hamiltonian} with respect to the control is \begin{equation} \partial_\gamma H = B(t)\lambda-2R(t)\gamma, \end{equation} and $\partial_\gamma^2H=-2R(t)<0$. Therefore, an optimal control $\gamma^*$ must satisfy \begin{equation} \label{eq:lqr_costate_optimal_condition} \gamma^*= \frac{R^{-1}(t)}{2}B(t)\lambda^*, \end{equation} while the co-state satisfies the adjoint equation \begin{equation} \begin{split} \dot{\lambda}^* &= -\partial_x H = 2Q(t)x^*-A(t)\lambda^* ,\\ \lambda^*(T) &= Mx^*(T). \end{split} \end{equation} For the whole system we can therefore write the canonical equations \begin{equation} \label{eq:lqr_canonical_equations} \begin{pmatrix} \dot{x}^*\\ \dot{\lambda}^* \end{pmatrix} = \mathcal{H}(t) \begin{pmatrix} x^*\\ \lambda^* \end{pmatrix} \end{equation} where the so called Hamiltonian matrix $\mathcal{H}(t)$ is given by \begin{equation} \mathcal{H}(t) = \begin{pmatrix} A(t) & \frac{R^{-1}(t)}{2}B^2(t)\\ 2Q(t) & -A(t) \end{pmatrix}. \end{equation} Consider now instead the transition matrix, or propagator $P(t, t_0) = P^{-1}(t_0, t)$ such that \begin{equation} \label{eq:propagated} \begin{pmatrix} x^*(t)\\ \lambda^*(t) \end{pmatrix} = P(t, t0)\begin{pmatrix} x^*(t_0)\\ \lambda^*(t_0) \end{pmatrix}. \end{equation} The transversality condition $\lambda^*(T) = Mx^*(T)$ then implies that \eqref{eq:propagated} can be rewritten as \begin{equation} \label{eq:propagated2} \begin{pmatrix} x^*(t)\\ \lambda^*(t) \end{pmatrix} = \bar{P}(t, t0)\begin{pmatrix} x^*(t_0)\\ x^*(t_0) \end{pmatrix}, \end{equation} so that we can write \begin{equation} \lambda^* = -2\chi x^* \end{equation} with \begin{equation} -2\chi = \frac{\bar{P}_{21}+\bar{P}_{22}}{\bar{P}_{11}+\bar{P}_{12}}, \end{equation} where $\bar{P}_{ij}$ denotes the element $i,j$ of $\bar{P}$. This proves \eqref{eq:lqr_state_costate_linear_rel}.
Combining \eqref{eq:lqr_costate_optimal_condition} with \eqref{eq:lqr_state_costate_linear_rel} further yields \eqref{eq:lqr_lsfl}.
We are left to prove that $\chi$ satisfies the Riccati equation \eqref{eq:lqr_rde}. Differentiating \eqref{eq:lqr_state_costate_linear_rel} yields \begin{equation} \dot{\lambda}^* = -2\dot{\chi} x^* -2\chi \dot{x}^*. \end{equation} Expanding $\dot{x}^*$ and $\dot{\lambda}^*$ via the canonical equations \eqref{eq:lqr_canonical_equations} gives \begin{equation} 2Q x^* - A\lambda^* = -2\dot{\chi} x^* -2\chi Ax^* -2\chi \frac{R^{-1}}{2}B^2\lambda^*. \end{equation} Further expanding $\lambda^*$ via \eqref{eq:lqr_state_costate_linear_rel} yields the desired result.
Finally, the boundary condition for the Riccati equation can be found by combining \eqref{eq:lqr_state_costate_linear_rel} with the boundary condition $\lambda^*(T) = -2Mx^*(T)$ for the co-state variable.

◻

Theorem 14 — Global optimality of the Linear State Feedback Law

The linear state feedback law \eqref{eq:lqr_lsfl} obtained by the maximum principle is globally optimal.

Theorem 14 can be proved by considering the HJB equation for the LQR problem, and realising that the value function $V(x, t)=\chi(t)x^2$ satisfies it, with $\chi$ the solution to the differential Riccati equation.

Theorems 13 and 14 are remarkable results. They make the LQR problem substantially more tractable than the general optimal control problem presented in Definition 4. Crucially, the task of solving the HJB partial differential equation is reduced to solving the Riccati differential equation wich is an ODE.

Exercise 8

Consider the integrator $\dot{x}=\gamma$ with payoff $P(\gamma) = -\int_0^Tx^2(t)+\gamma^2(t) dt$. Find the optimal feedback law for the control.

To start, notice that the problem corresponds to the following identification \begin{equation*} A = M = 0,\quad B=Q=R=1. \end{equation*} The Riccati equation is therefore \begin{equation*} \dot{\chi}=\chi^2-1, \end{equation*} with boundary condition $\chi(T)=0$. Thus, \begin{equation*} \begin{split} \int_{\chi(t)}^0\frac{d\chi}{\chi^2-1}&=\int_t^T ds,\\ \tanh^{-1}(\chi(t))&=T-t,\\ \tanh{T-t}&=\chi(t). \end{split} \end{equation*} Therefore the optimal control is given by the feedback law \begin{equation*} \gamma(t)=-\tanh(T-t)x(t). \end{equation*}

◻

Infinite-horizon LQR

Let us now consider the special case of an infinite-horizon LQR problem. Starting from the finite-horizon LQR problem discussed above, assume that controls and payoff functionals are time-invariant, that is \begin{equation} \label{eq:IH_static_conditions} \dot{A}=\dot{B}=\dot{Q}=\dot{R}=0 \end{equation} and that there is no terminal cost $M=0$. Then consider the limit $T\rightarrow\infty$.
The main implication of the time-invariance conditions \eqref{eq:IH_static_conditions} is that the Riccati differential equation becomes time-invariant as well, so that its solution $\chi$ is now only a function of the terminal time $T$. Passing to the limit $T\rightarrow\infty$ turns into a constant, so that $\dot{\chi}=0$. Therefore, in the infinite-horizon problem, the Riccati differential equation is turned into the following algebraic Riccati equation \begin{equation} R^{-1}B^2\chi^2-2A\chi-Q = 0. \end{equation}

Stochastic Control

While so far we have discussed deterministic control problems, many real world applications require the control of noisy processes, ranging from the execution of optimal trading strategy, resources allocation, or the control of spacecrafts and robots.
Let us therefore introduce the stochastic extensions of the notions learnt above.

Definition 11 — Stochastic Optimal Control Problem

Given a stochastic process defined by the SDE \begin{equation} \label{eq:canonical_sde} dX(t) = f(X(t), \gamma(t), t)dt+\sigma(X(t), \gamma(t), t)dW(t), \end{equation} and the associated payoff \begin{equation} \mathbf{P}[\gamma] = \mathbb{E}\left[\int_0^T L(X(t), \gamma(t), t)dt +\phi(X(T))\right], \end{equation} the stochastic Bolza problem is \begin{equation} \sup_{\gamma\in\Gamma}\mathbf{P}[\gamma], \end{equation} subject to \eqref{eq:canonical_sde}, with the set of admissable controls $\Gamma$ being adapted to $W$.

Clearly this reduces to the deterministic Bolza problem for $\sigma=0$.

Theorem 15 — Stochastic Dynamic Programming

Given a stochastic Bolza problem over the time interval $[t, T]$, with payoff functional $P[\gamma]$ define the Value Function \begin{equation} V(x, t) := \sup_{\gamma_{[t, T]}} \mathbb{E}\left[\int_t^T L(X(t'), \gamma(t'), t')dt' +\phi(X(T))\right], \end{equation} with $X(t)=x$. Then, one has \begin{equation} \label{eq:stochdynprogr} V(x, t) = \sup_{\gamma_{[t, T]}} \mathbb{E}\left\{\int_t^{\tau}L(X(t'), \gamma(t'), t') dt + V(X(\tau), \tau)\right\}. \end{equation} for any stopping time $\tau\ge t$, .

Theorem 16 — Stochastic Hamilton-Jacobi-Bellman equation

The value function is the unique viscosity solution to the stochastic Hamilton-Jacobi-Bellman equation \begin{equation} -\partial_t V(x, t) = \sup_{\gamma} \left\{ L(x, \gamma, t) + \nabla_x V(x, t) \cdot f(x, \gamma, t) + \frac{1}{2}\text{Tr}\left[\sigma^T(x, \gamma, t)\nabla_x^2V(x, t)\sigma(x, \gamma, t)\right] \right\}, \end{equation} with $V(X, T) = \phi(X(T))$.

Exercise 9

Derive the Stochastic Hamilton-Jacobi-Bellman equation from Theorem 16.

◻

Exercise 10

Show that a p

Consider

◻

Example 2 — Optimal Consumption-Investment

Consider the stochastic optimal control problem of adjusting consumption $C_t$ and investments in an initial amount of wealth $W_0$ in either a saving account with interest rate $r$ or in a risky portfolio with average rate of return $\mu$ and return variance $\sigma^2$: \begin{equation} \frac{d S_t}{S_t} = r dt,\quad\quad \frac{d P_t}{P_t} = \mu dt + \sigma dZ_t, \end{equation} with $dZ_t$ a standard Wiener process. The utility of consumption is $U(c) = \log(c)$ and the rate of discount applied to consumption utility is $\rho$.
Denoting by $Q_t$ the fraction of wealth invested in the risky portfolio at time $t$, the time-dynamics of wealth is therefore described by[9] \begin{equation} dW_t = Q_tW_t(\mu dt+\sigma dZ_t) + (1-Q_t)W_t r dt - C_tdt . \end{equation} That is, the change in wealth is driven by the return on the risky portfolio, plus the return on the saving account, minus the amount of wealth consumed.
Therefore, the optimal control problem is \begin{equation} \begin{split} \text{max}&\left\{\mathbb{E}\int_0^T e^{-\rho t}U(C_t) dt\right\} \\ & \\ & \text{subject to}\\ & \\ dW_t &= Q_tW_t(\mu dt+\sigma dZ_t) + (1-Q_t)W_t r dt - C_tdt\\ C_t &\ge 0 \end{split} \end{equation}

References and notes

[1] "Über zwei Euler’sche Aufgaben aus der Variationsrechnung", Oskar Bolza, 1913, Annali di Matematica Pura ed Applicata

[2] "Analytic Solution of the Time-Optimal Control of a Double Integrator from an Arbitrary State to the State-space Origin", Marcello Romano, and Fabio Curti, 2019.

[3] "Dynamic Programming", Richard Bellman, 1957, Chap. III.3.

[4] Notice that while related, the Control Hamiltonian of optimal control theory is distinct from the Hamiltonian used in mechanics. In particular, the latter is used to derive the equation of motion of a dynamical system. The Hamiltonian of optimal control theory instead, is a concept developed by Lev Pontryagin, with the purpose to provide conditions for extrimising a functional with respect to a control variable.

[5] Why is this approach named Dynamic Programming? The term was coined in the early 50s by Richard Bellman while working on his research on multistage decision processes at RAND Corporation, at that time fully funded by the US government. The justification for the label has a rather interesting story. From Bellman's autobiography Eye of the Hurricane: An Autobiography (1984, page 159): «I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, "Where did the name, dynamic programming, come from?" The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word "research". I'm not using the term lightly; I'm using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word "programming". I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying. I thought, let's kill two birds with one stone. Let's take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.»

[6] Viscosity solutions are a class of weak solutions to non-linear PDEs, which can be seen as the limit of solutions of the original PDE regularised with a diffusive term (from which the term "viscosity"). A viscosity solution does not necessarily satisfy the PDE at every point like a classical solution. Instead, it satisfies certain inequalities that capture the essence of the PDE. See e.g. [7].

[7] L. C. Evans, Partial Differential Equations, AMS, 2010.

[8] "Calculus of Variations and Optimal Control Theory - A concise Introduction", D. Liberzon, 2009.

[9] This is of course under the assumption of continuous trading and absence of transaction and commission costs. See e.g. Harrison and Pliska, "Martingales and stochastic integrals in the theory of continuous trading", Stochastic Processes and their Applications, (1981) for a more rigorous treatment of the problem.