7  Optimal Decision Making

Author

Adrien Osakwe

7.1 Decision-Making

Many statistical procedures involve some form of decision-making where actions are taken given the observed data. Examples include

  • parameter estimation

  • hypothesis testing

  • prediction/classification

  • model selection

We can denote a function \(T\) which is used as an estimator of a statistic of interest from the random variable \(Y\). Consider, the mean, order statistics, or the empirical cdf. We can also consider the model space \(\mathbb{F}\). We can then examine a loss function \(L(..)\) which will determine the accuracy with which we report \(T\) when the ground truth comes from \(F \in \mathbb{F}\).

For example, with the cdf \(F_y\) we can have that

\[ \mu = \int yF_y(dy) \]

And define the loss function as \(L(T,F_y) = (T - \mu)^2\) to represent the loss in reporting the estimator \(T\) when the value of interest is \(\mu\).

The optimal decision is the decision which minimizes the expected loss with respect to the distribution \(F_y\). In a parametric analysis defined by \(\theta\). We have that

  1. \(\theta\) is considered a fixed constant and the data as random in a frequentist setting
  2. The data are fixed and \(\theta\) is random in a Bayesian setting

7.1.1 Frequentist calculation

Back to our previous example, in a frequentist setting the optimal decision for \(T\) would be

\[ argmin_{\text{T}}\mathbb{E}_{F_y}[(T- \mu)^2] = argmin_{\text{T}}\{\mathbb{E}_{F_y}[(T -\mathbb{E}_{F_y}[T])^2] + (\mathbb{E_{F_y}[T] - \mu)^2}) \} \]

\[ = argmin_{\text{T}}\{Var_{F_y}[T] + (\mathbb{E_{F_y}[T] - \mu)^2})\} \]

This tells us that we need to take into account the variance of T and the squared bias to define the optimal T.

7.1.2 Kullback-Leibler Divergence/Loss

In Bayesian settings, we usually look at the Kullback-Leibler Loss which is used to measure the difference between two distributions \(F_0\) and \(F_1\).

\[ KL(F_0,F_1) = \int \log\{\frac{dF_0(y)}{dF_1(y)}\}dF_0(y) \]

This is defined when \(F_1\) is absolutely continuous with respect to \(F_0\). That is, the probability mass at a given point for \(F_1\) is zero whenever it is zero for \(F_0\). In essence, we can express the KL as an expectation:

\[ KL(f_0,f_1) = \mathbb{E}_{f_0}[log\{\frac{f_0(Y)}{f_1(Y)}\}] \]

It is important to note that:

  1. KL is always non-negative
  2. \(KL(F_0,F_1) \neq KL(F_1,F_0)\)
  3. KL can only be zero if both distributions are identical.

This is applicable to both discrete and continuous distributions. In a parametric setting, we can expect to have pdf \(f(y;\theta)\) where \(\theta_0\) represents the optimal, data-generating model. We can then write KL as

\[ KL(\theta_0,\theta) = \int\{\frac{f(y;\theta_0)}{f(y;\theta)}\}f(y;\theta_0)dy \]

and aim to report an estimator \(\hat{\theta} = T(Y)\) for the true value \(\theta_0\)

7.1.3 Decision Theory Concepts

The key parts of a decision problem are:

  • a decision d, selected from a set of \(D\) decisions must be made.

  • there is a true state of nature, \(v(\theta)\), which lies in the set \(\gamma\) , that is defined bby the data generating model, \(F_Y(y;\theta)\) .

  • there is a loss function \(L(d,v)\) for decision \(d\) and state \(v\) which records the loss when for a given state \(v\) and decision \(d\).

With these components, we aim to minimize the expected loss.

In the case of estimation, the decision required is the estimation of the parameter \(\theta\) and the true state of nature is the parameter’s value, that is, \(v(\theta) \equiv \theta\). If data is available, the optimal decision is usually a function of the observed data. If the decision takes the form of a statistic, we have \(d(y) \equiv T(y) = t_n\) with associated loss \(L(t_n,\theta)\). The corresponding random variable will be \(T_n \equiv T(Y)\).

Frequentist Risk (loss): The expected loss associated with \(d(Y)\) over the distribution of \(Y|\theta\).

\[ R_n(d,\theta) = \mathbb{E}_{F_Y}[L(T_n,\theta)] = \int_yL(T(y),\theta)f_Y(y;\theta)dy \]

Bayes Risk (average): Is the expected risk over the prior distribution:

\[ R_n(d) = \mathbb{E}_{\pi_0}[R_n(d,\theta)] = \mathbb{E}_{\pi_0}[\mathbb{E}_{F_Y}[L(T_n,\theta)]] \]

\[ = \int_\Theta\{ \int_yL(t_n,\theta)f_Y(y;\theta)dy\}\pi_0(\theta)d\theta \]

where \(t_n \equiv t_n(y)\).

\[ = \int_\Theta\int_yL(t_n,\theta)f_Y(y)\pi_n(\theta)dyd\theta \]

\[ = \int_y\{\int_\Theta L(t_n,\theta)\pi_n(\theta)d\theta\}f_Y(y)dy \]

Following Bayes Theorem.

With the prior and fixed observable data, the optimal Bayesian decision, the Bayes rule is

\[ \hat{d}_B = argmin_{d \in D} R_n(d(y)) \]

Interpretation:

  • Bayes risk is minimized when the inner integral is minimized for fixed observations, regardless of their value

  • The double integral is minimized

  • or that the optimal decision should be chosen conditionally on the observed data and not averaged over all possible values.

To estimate, we need to select a statistic of interest. The minimization can be reduced to the following:

\[ argmin_{t \in \Theta}\int_\Theta L(t_n,\theta)\pi_n(\theta)d\theta \]

Given the interpretation mentioned above. In other words, the optimal decision minimizes the Bayes risk and the posterior expected loss.

7.1.3.1 Squared-error loss

  • Returns the posterior expectation (see proof)

7.1.3.2 Absolute error loss

  • Returns the posterior median

7.1.3.3 Zero-one loss

  • Returns the posterior mode