What Is Optimal Decision?
"Optimal" in finance rarely means perfect. It means the best action an
agent can take given its information, constraints, and objectives. The
job of this chapter is to build that idea up from a single-period
allocation problem to dynamic programming and reinforcement learning,
keeping the same notation and the same conceptual line: forecasts feed a
utility, the utility lives inside a constraint set, and time-consistent
behaviour falls out of a recursion.
Ingredients of an optimal decision
Every optimal-decision problem has four ingredients:
- Objective. The utility or cost function U to maximise (or
minimise). The shape of U — linear, quadratic, CRRA, recursive
Epstein–Zin — is what encodes risk attitude and intertemporal
preference.
- Information set. Ft, the history available at the
decision moment. Policies may only depend on Ft; a policy
that uses future prices is not a policy.
- Action space. A(s), possibly state-dependent. Box
constraints, simplex projections, leverage caps, turnover budgets all
live here.
- Dynamics. How actions and shocks evolve the state. Wealth update,
inventory dynamics, regime transitions — see Chapter 6 for the
state-space view of these.
If any ingredient is missing, "optimal" is undefined. A trading signal
without execution constraints is not optimal once slippage is real; a
utility without a discount factor is not optimal across horizons.
Static vs. dynamic decisions
- Static. Choose a once given s — for example, Markowitz weights
given (μ,Σ). The optimisation is a single convex
problem (Section 02-03) and the answer has closed-form structure.
- Dynamic. Choose at repeatedly; future opportunities depend on
today's choice. Requires dynamic programming (Section 5-03) or
policy search (Sections 5-04 / 5-05).
Real trading is dynamic. Risk budgets reset daily, funding costs change,
counterparties default, and information arrives. The static problem is
the right starting point — it is the benchmark every dynamic policy
should be measured against — but it is rarely the answer.
Utility and risk preferences
Utility shapes the optimal portfolio. Three forms recur:
- Quadratic / mean–variance. Penalises variance linearly; equivalent
to expected utility under joint Gaussianity.
- CRRA / CARA. Captures diminishing marginal utility of wealth.
CRRA scales risk premium with wealth (the right fit for compounding
investors); CARA gives wealth-independent positions (the right fit for
fixed-capacity desks).
- Prospect theory. Asymmetric value function with loss aversion and
probability weighting. Important for understanding investor flows;
rarely used inside an optimiser because it is non-convex.
Choosing the wrong utility is equivalent to optimising someone else's
problem. Stakeholder interviews and governance guidelines should inform
the selected objective before any code is written.
Information sets and avoiding leakage
Optimality is conditional on what you know. A policy can depend on
Ft and nothing more. This is also the constraint that
prevents look-ahead bugs in machine-learning pipelines: feature
engineering must avoid future data, and validation must split along
time. The forecasting evaluation discipline of Section 02-02 is what
keeps this enforceable.
From forecasts to decisions
Forecasting models deliver expected returns, conditional volatilities,
and predictive distributions. Optimal-decision modules convert them
into trades by solving optimisation problems or evaluating learned
policies. The glue between the two is the loss / utility function. A
probabilistic return forecast feeds a mean–CVaR optimisation; a
scenario tree feeds a reinforcement-learning environment; a regime
probability feeds a switching policy.
Without the principle of optimality, we would have to solve the
entire multi-period decision problem in one impossible step. With it,
we can work backward — solve the last-period problem for any reachable
state, plug that solution into the second-to-last period, and so on.
That is the recursion the rest of the chapter formalises.
Investing for three days
To see the principle of optimality in practice, consider a
three-day investment problem. We begin at day 0 with wealth W0. At
each decision date (day 0 and day 1) we choose how to allocate wealth
between a risky asset with random gross return Rt+1 and a risk-
free asset with known gross return Rf. For now, the risk-free rate
is constant and there is only one risky asset; think of the choice as
"S&P 500 vs. cash." We assume all wealth stays invested, and the only
thing that matters is final wealth on day 2, W2.
Even in this minimal setting, sequential optimal decision-making
appears clearly.
Wealth dynamics
Let wt be the fraction of wealth invested in the risky asset at day
t, so 1−wt goes into the risk-free asset. The portfolio gross
return from day t to t+1 is
RP,t+1=wtRt+1+(1−wt)Rf.
Wealth then evolves as Wt+1=WtRP,t+1. Over three days,
W1=W0(w0R1+(1−w0)Rf),W2=W1(w1R2+(1−w1)Rf).
By substitution, final wealth depends on both decisions:
W2=W0(w0R1+(1−w0)Rf)(w1R2+(1−w1)Rf).
Objective
We model investor preferences with a utility function u(W) satisfying
u′(W)>0 (more wealth is better) and u′′(W)<0 (concavity, hence
risk aversion). Concavity is what produces diversification in the
static portfolio theory of Section 02-03; here it produces dynamic
caution. With random returns, W2 is random too, so we maximise
expected utility:
w0,w1maxE[u(W2)].
This generalises mean–variance optimisation. When utility is quadratic,
expected utility reduces to mean–variance; for general utility, it
handles non-normal return distributions and richer preferences.
Solving all at once
A stubborn approach treats w0 and w1 as two unknowns and solves
the joint first-order conditions:
w0,w1maxE[u(W0(w0R1+(1−w0)Rf)(w1R2+(1−w1)Rf))].
For two periods this is feasible. It does not scale: the action space
grows exponentially in horizon, and the joint optimisation has no
useful structure. We need to exploit the sequential nature of the
problem.
The principle of optimality
Imagine we are standing at day 1 with wealth W1 already known. Only
one decision remains:
w1maxE1[u(W2)],W2=W1(w1R2+(1−w1)Rf).
This subproblem does not depend on how we arrived at W1. Whatever
happened on day 0 is summarised in the state W1, and the optimal
w1⋆=w1⋆(W1) is defined entirely by the day-1
information set.
Now step back to day 0. When we choose w0 we know two things:
- w0 determines the distribution of W1.
- From day 1 onwards we will follow w1⋆(W1).
The day-0 problem becomes
w0maxE0[u(W2⋆)]whereW2⋆=W1(w1⋆(W1)R2+(1−w1⋆(W1))Rf).
Day 0 no longer has to "guess" what we will do in the future. The
future is summarised by w1⋆. This is the principle of
optimality:
If the decision for the final period is optimal, then the optimal
decision for the previous period is the one that leads into that
optimal future.
The recursion is solved backward and runs in time linear in the
horizon. Section 5-03 turns this informal statement into the Bellman
equation.
Case study: a risk-neutral agent
To see an explicit solution, take linear utility u(W)=W. Since
u′′(W)=0, there is no curvature and no risk aversion. At day 1,
w1maxE1[W2]=w1maxW1(w1E1[R2]+(1−w1)Rf).
W1 is positive and known, so the choice maximises
w1E1[R2]+(1−w1)Rf, which is linear in w1 with
slope E1[R2]−Rf. Therefore
- w1⋆=1 if E1[R2]>Rf,
- w1⋆=0 if E1[R2]<Rf,
- any w1∈[0,1] if equal.
The same argument at day 0 gives the same kind of bang-bang rule. The
optimal three-day strategy under linear utility: in each period, invest
all wealth in whichever asset has the higher expected gross return at
that time.
This explicit solution is not realistic. Real investors are
risk-averse, not risk-neutral, and a risk-neutral policy has a
catastrophic flaw: in any environment with positive variance, betting
all wealth on the risky asset every period accumulates ruin probability.
A 5% chance of ruin per period compounds to near-certain ruin over 20
periods. Empirical evidence from financial markets is consistent — the
strategies that ignore downside risk eventually collapse. Risk-neutral
preferences are useful as a teaching device for the recursion; they
are the wrong utility for any policy that has to live through stress.
Adding survival instinct
The risk-neutral case shows the recursion clearly. To get realistic
behaviour we need a utility that encodes both risk aversion and
survival concern. The minimum requirements:
- u′(W)>0 — more wealth is always better.
- u′′(W)<0 — concavity, hence risk aversion (Section 02-03).
- limW→0u(W)=−∞ — ruin is catastrophic.
The optimisation problem is unchanged in form,
maxw0,w1E[u(W2)], but low-wealth trajectories are
heavily penalised. The day-1 first-order condition now reads
E1[u′(W2)(R2−Rf)]=0,
with W2=W1(w1R2+(1−w1)Rf). Because u′ is decreasing,
the marginal utility u′(W2) depends on W1: lower initial wealth
makes the agent more sensitive to downside risk. The optimal policy
becomes wealth-dependent, w1⋆=w1⋆(W1), and decreases
as W1 approaches zero. The dynamic programming structure is
unchanged — we still solve backward — but the utility shape has
produced cautious, wealth-dependent behaviour without imposing external
constraints.
What this chapter builds toward
Section 5-02 fills in the explicit utility forms (CRRA, log, quadratic,
Epstein–Zin) and turns the recursion into the static and intertemporal
optimisation problems they correspond to. Section 5-03 formalises the
recursion as the Bellman equation and walks through value iteration,
policy iteration, and the HJB equation. Sections 5-04 and 5-05 carry
the recursion into reinforcement learning, where the value function and
the policy are approximated by neural networks because the state space
is too large to enumerate.
The principle of optimality is the load-bearing idea. The rest of the
chapter is about what to do when its assumptions hold imperfectly —
when the state is high-dimensional, the dynamics are unknown, or the
utility includes risk attitudes that closed-form solutions do not
support.