9 Bayesian Modelling

Author

Adrien Osakwe

In this section we will explore extensions of the Bayesian methods framework for other modelling:

regression
latent variable models
hierarchical models

9.1 Regression Models

We can consider an infinite sequence $\{ (X_n,Y_n), n = 1,2,…\}$ such that for any $n \geq 1$

\[ f_{X_1,...,X_n,Y_1,...,Y_n}(x_1,...,x_n,y_1,...,y_n) \]

can be factorized as

\[ f_{X_1,...,X_n}(x_1,...,x_n)f_{Y_1,...,Y_n|X_1,...,X_n}(y_1,...,y_n|x_1,...,x_n) \]

where each term has a deFinetti representation.

\[ f_{X_1,...,X_n}(x_1,...,x_n) = \int \{\prod_{i=1}^nf_X(x_i;\phi)\}\pi_0(d\phi) \\ f_{Y_1,...,Y_n|X_1,...,X_n}(y_1,...,y_n|x_1,...,x_n) = \int \{\prod_{i=1}^nf_{Y|X}(y_i|x_i;\theta)\}\pi_0(d\theta) \]

Given the above structure, inference for $(\phi,\theta)$ is required:

inference for $\phi$ is done through the marginal model for the X variables
inference for $\theta$ is done through the conditional model for Y given that X is observed.

For the latter, the fact that X is random is irrelevant as we have conditioned the model on observed values of X.

When considering the statistical behaviour of Bayesian (and frequentist) procedures, we need to remember that X and Y have a joint structure.

Prediction

\[ f_{Y_{n+1}|X_{1:n},Y_{1:n}}(y_{n+1}|x_{1:n},y_{1:n}) \\ = \int f_{X_{n+1},Y_{n+1}|X_{1:n},Y_{1:n}}(x_{n+1},y_{n+1}|x_{1:n},y_{1:n})dx_{n+1} \\ = \int f_{Y_{n+1}|X_{1:n},X_{n+1},Y_{1:n}}(y_{n+1}|x_{1:n},x_{n+1},y_{1:n})f_{X_{n+1}|X_{1:n},Y_{1:n}}(x_{n+1}|x_{1:n},y_{1:n})dx_{n+1} \]

9.2 Linear Regression

We can start with the following linear regression model

\[ Y_i = x_i\beta + \epsilon_i \]

where for $i = 1,…,n$

$Y_i$ is a scalar
$x_i$ is (1 x d)
$\beta$ is (d x 1)
$\epsilon_i \sim Normal(0,\sigma^2)$, independently.

With this structure, we can describe the model for the partially exchangeable random variables (error terms) $\epsilon_i = Y_i - x_i\beta$, conditional on $X_i = x_i$. In this scenario, there may or may not be a need to model the distribution of $X_i$. –> Note: figure out why.

We can look at the vector form of the linear regression model

\[ Y = X\beta + \epsilon \]

where the response variable and the error terms are (n x 1) vectors and the predictors are an (nxd) matrix.

We can then have a conditional model

\[ f_{Y_1,...,Y_n|X_1,...,X_n}(y_1,...,y_n|x_1,...,x_n;\beta,\sigma^2) \equiv Normal_n(X\beta,\sigma^2I_n) \]

where $I_n$ is an identity matrix (nxn).

With this structure, we know the likelihood to be

\[ \mathcal{L}_n(\beta,\sigma^2) = (\frac{1}{2\pi\sigma^2})^{n/2}exp\{\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)\} \]

We can derive a joint conjugate prior

\[ \pi_0(\beta,\sigma^2) = \pi_0(\sigma^2)\pi_0(\beta|\sigma^2) \]

where

\[ \pi_0(\sigma^2) \equiv InvGamma(a_0/2,b_0/2)\\ \pi_0(\beta|\sigma^2) \equiv Normal_d(m_0,\sigma^2M_0) \]

where $a_0,b_0,m_0,M_0$ are user-defined constant hyperparameters. The joint posterior can hence be approximated with

\[ \pi_n(\beta,\sigma^2) \propto \mathcal{L}_n(\beta,\sigma^2)\pi_0(\beta,\sigma^2) \\ \pi_0(\sigma^2) = \frac{(b_0/2)^{a_0/2}}{\Gamma(a_0/2)}(\frac{1}{\sigma^2})^{a_0/2 +1}exp\{-\frac{b_0}{2\sigma^2}\}\\ \pi_0(\beta|\sigma^2) = (\frac{1}{2\pi\sigma^2})^{d/2}\frac{1}{|M_0|^{0.5}}exp\{\frac{1}{2\sigma^2}(\beta- m_0)^TM_0^{-1}(\beta-m_0)\} \]

We can explore the exponents of the above posterior as a quadratic form.

The expression

\[ (y - X\beta)^T(y-X\beta) + (\beta - m_0)^TM_0^{-1}(\beta-m_0) \\ \]

which equates to $(\beta - m_n)^TM_n^{-1}(\beta-m_n) + c_n$ where we need to determine the expressions for $m_n, M_n,c_n$.

Quadratic term:

$\beta^TM_n^{-1}\beta = \beta^TX^TX\beta + \beta^T M_0^{-1}\beta$

and therefore

$M_n^{-1} = X^TX + M_0^{-1}$ –>> $M_n = (X^TX + M_0^{-1})^{-1}$
Linear term:

$\beta^TM_n^{-1}m_n = \beta^TX^Ty + \beta^TM_0^{-1}m_0 $

and therefore

$m_n = M_n(X^Ty + M_0^{-1}m_0)$

$= (X^TX + M_0^{-1})^{-1}(X^Ty +M_0^{-1}m_0)$
Constant term:

$m_n^TM_n^{-1}m_n + c_n = y^Ty + m_0^TM_0^{-1}m_0$

and therefore

$c_n = y^Ty + m_0^TM_0^{-1}m_0 - m_n^TM_n^{-1}m_n$

Given us the joint posterior (under proportionality) as

\[ \pi_n(\beta,\sigma^2) = (\frac{1}{\sigma^2})^{\frac{(n+a_0+d)}{2}+1}\exp\{-\frac{(c_n + b_0)}{2\sigma^2}\}\exp\{\frac{1}{2\sigma^2}(\beta- m_n)^TM_n^{-1}(\beta-m_n)\} \]

Which tells us that the conditional posterior and marginalizing the joint posterior over $\beta$ give us

\[ \pi_n(\beta|\sigma^2) \equiv Normal_d(m_n,\sigma^2M_n) \\ \pi_n(\sigma^2) \equiv InvGamma(a_n/2,b_n/2) \]

where $a_n = a_0 + n$ and $b_n = b_0 + c_n$.

Note: see slides 218-221 for the marginal $\beta$ posterior where knowledge of the Inverse Gamma pdf will reveal that $\pi_n(\beta)$ is a multivariate Student-t distribution.

Assigning prior ignorance to $\beta$ (setting $M_0^{-1}$ to 0) will lead to results that equate to what we would get from the maximum likelihood approach.

\[ m_n \rightarrow (X^TX)^{-1}X^Ty \\ M_n \rightarrow (X^TX)^{-1} \]

We can alternatively use a g-prior: with hyperparameter $\lambda > 0$.

\[ M_0 = \lambda^{-1}(X^TX)^{-1} \]

and hence

\[ M_n = (1 + \lambda)^{-1}(X^TX)^{-1} \]

If, for the g-prior we have

\[ m_0 = 0_d // M_0 = \lambda I_d \]

then we will have

\[ m_n = (X^TX + \lambda I_d)^{-1}X^Ty \\ M_n = (X^TX + \lambda I_d)^{-1} \]

which gives us the procedure for ridge regression.

Jeffrey’s prior for linear regression (see slide 228 for derivation):

\[ \pi_0(\beta,\sigma^2) \propto (\frac{1}{\sigma^2})^{d/2 +1} \]

9 Bayesian Modelling

9.1 Regression Models

9.2 Linear Regression

9.3 Non-linear Regression

9.3.1 Generalized Linear Models