9 Bayesian Modelling
In this section we will explore extensions of the Bayesian methods framework for other modelling:
regression
latent variable models
hierarchical models
9.1 Regression Models
We can consider an infinite sequence \(\{ (X_n,Y_n), n = 1,2,…\}\) such that for any \(n \geq 1\)
\[ f_{X_1,...,X_n,Y_1,...,Y_n}(x_1,...,x_n,y_1,...,y_n) \]
can be factorized as
\[ f_{X_1,...,X_n}(x_1,...,x_n)f_{Y_1,...,Y_n|X_1,...,X_n}(y_1,...,y_n|x_1,...,x_n) \]
where each term has a deFinetti representation.
\[ f_{X_1,...,X_n}(x_1,...,x_n) = \int \{\prod_{i=1}^nf_X(x_i;\phi)\}\pi_0(d\phi) \\ f_{Y_1,...,Y_n|X_1,...,X_n}(y_1,...,y_n|x_1,...,x_n) = \int \{\prod_{i=1}^nf_{Y|X}(y_i|x_i;\theta)\}\pi_0(d\theta) \]
Given the above structure, inference for \((\phi,\theta)\) is required:
inference for \(\phi\) is done through the marginal model for the X variables
inference for \(\theta\) is done through the conditional model for Y given that X is observed.
For the latter, the fact that X is random is irrelevant as we have conditioned the model on observed values of X.
When considering the statistical behaviour of Bayesian (and frequentist) procedures, we need to remember that X and Y have a joint structure.
Prediction
\[ f_{Y_{n+1}|X_{1:n},Y_{1:n}}(y_{n+1}|x_{1:n},y_{1:n}) \\ = \int f_{X_{n+1},Y_{n+1}|X_{1:n},Y_{1:n}}(x_{n+1},y_{n+1}|x_{1:n},y_{1:n})dx_{n+1} \\ = \int f_{Y_{n+1}|X_{1:n},X_{n+1},Y_{1:n}}(y_{n+1}|x_{1:n},x_{n+1},y_{1:n})f_{X_{n+1}|X_{1:n},Y_{1:n}}(x_{n+1}|x_{1:n},y_{1:n})dx_{n+1} \]
9.2 Linear Regression
We can start with the following linear regression model
\[ Y_i = x_i\beta + \epsilon_i \]
where for \(i = 1,…,n\)
\(Y_i\) is a scalar
\(x_i\) is (1 x d)
\(\beta\) is (d x 1)
\(\epsilon_i \sim Normal(0,\sigma^2)\), independently.
With this structure, we can describe the model for the partially exchangeable random variables (error terms) \(\epsilon_i = Y_i - x_i\beta\), conditional on \(X_i = x_i\). In this scenario, there may or may not be a need to model the distribution of \(X_i\). –> Note: figure out why.
We can look at the vector form of the linear regression model
\[ Y = X\beta + \epsilon \]
where the response variable and the error terms are (n x 1) vectors and the predictors are an (nxd) matrix.
We can then have a conditional model
\[ f_{Y_1,...,Y_n|X_1,...,X_n}(y_1,...,y_n|x_1,...,x_n;\beta,\sigma^2) \equiv Normal_n(X\beta,\sigma^2I_n) \]
where \(I_n\) is an identity matrix (nxn).
With this structure, we know the likelihood to be
\[ \mathcal{L}_n(\beta,\sigma^2) = (\frac{1}{2\pi\sigma^2})^{n/2}exp\{\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)\} \]
We can derive a joint conjugate prior
\[ \pi_0(\beta,\sigma^2) = \pi_0(\sigma^2)\pi_0(\beta|\sigma^2) \]
where
\[ \pi_0(\sigma^2) \equiv InvGamma(a_0/2,b_0/2)\\ \pi_0(\beta|\sigma^2) \equiv Normal_d(m_0,\sigma^2M_0) \]
where \(a_0,b_0,m_0,M_0\) are user-defined constant hyperparameters. The joint posterior can hence be approximated with
\[ \pi_n(\beta,\sigma^2) \propto \mathcal{L}_n(\beta,\sigma^2)\pi_0(\beta,\sigma^2) \\ \pi_0(\sigma^2) = \frac{(b_0/2)^{a_0/2}}{\Gamma(a_0/2)}(\frac{1}{\sigma^2})^{a_0/2 +1}exp\{-\frac{b_0}{2\sigma^2}\}\\ \pi_0(\beta|\sigma^2) = (\frac{1}{2\pi\sigma^2})^{d/2}\frac{1}{|M_0|^{0.5}}exp\{\frac{1}{2\sigma^2}(\beta- m_0)^TM_0^{-1}(\beta-m_0)\} \]
We can explore the exponents of the above posterior as a quadratic form.
The expression
\[ (y - X\beta)^T(y-X\beta) + (\beta - m_0)^TM_0^{-1}(\beta-m_0) \\ \]
which equates to \((\beta - m_n)^TM_n^{-1}(\beta-m_n) + c_n\) where we need to determine the expressions for \(m_n, M_n,c_n\).
Quadratic term:
\(\beta^TM_n^{-1}\beta = \beta^TX^TX\beta + \beta^T M_0^{-1}\beta\)
and therefore
\(M_n^{-1} = X^TX + M_0^{-1}\) –>> \(M_n = (X^TX + M_0^{-1})^{-1}\)
Linear term:
$\beta^TM_n^{-1}m_n = \beta^TX^Ty + \beta^TM_0^{-1}m_0 $
and therefore
\(m_n = M_n(X^Ty + M_0^{-1}m_0)\)
\(= (X^TX + M_0^{-1})^{-1}(X^Ty +M_0^{-1}m_0)\)
Constant term:
\(m_n^TM_n^{-1}m_n + c_n = y^Ty + m_0^TM_0^{-1}m_0\)
and therefore
\(c_n = y^Ty + m_0^TM_0^{-1}m_0 - m_n^TM_n^{-1}m_n\)
Given us the joint posterior (under proportionality) as
\[ \pi_n(\beta,\sigma^2) = (\frac{1}{\sigma^2})^{\frac{(n+a_0+d)}{2}+1}\exp\{-\frac{(c_n + b_0)}{2\sigma^2}\}\exp\{\frac{1}{2\sigma^2}(\beta- m_n)^TM_n^{-1}(\beta-m_n)\} \]
Which tells us that the conditional posterior and marginalizing the joint posterior over \(\beta\) give us
\[ \pi_n(\beta|\sigma^2) \equiv Normal_d(m_n,\sigma^2M_n) \\ \pi_n(\sigma^2) \equiv InvGamma(a_n/2,b_n/2) \]
where \(a_n = a_0 + n\) and \(b_n = b_0 + c_n\).
Note: see slides 218-221 for the marginal \(\beta\) posterior where knowledge of the Inverse Gamma pdf will reveal that \(\pi_n(\beta)\) is a multivariate Student-t distribution.
Assigning prior ignorance to \(\beta\) (setting \(M_0^{-1}\) to 0) will lead to results that equate to what we would get from the maximum likelihood approach.
\[ m_n \rightarrow (X^TX)^{-1}X^Ty \\ M_n \rightarrow (X^TX)^{-1} \]
We can alternatively use a g-prior: with hyperparameter \(\lambda > 0\).
\[ M_0 = \lambda^{-1}(X^TX)^{-1} \]
and hence
\[ M_n = (1 + \lambda)^{-1}(X^TX)^{-1} \]
If, for the g-prior we have
\[ m_0 = 0_d // M_0 = \lambda I_d \]
then we will have
\[ m_n = (X^TX + \lambda I_d)^{-1}X^Ty \\ M_n = (X^TX + \lambda I_d)^{-1} \]
which gives us the procedure for ridge regression.
Jeffrey’s prior for linear regression (see slide 228 for derivation):
\[ \pi_0(\beta,\sigma^2) \propto (\frac{1}{\sigma^2})^{d/2 +1} \]