9  Bayesian Modelling

Author

Adrien Osakwe

In this section we will explore extensions of the Bayesian methods framework for other modelling:

9.1 Regression Models

We can consider an infinite sequence {(Xn,Yn),n=1,2,} such that for any n1

fX1,...,Xn,Y1,...,Yn(x1,...,xn,y1,...,yn)

can be factorized as

fX1,...,Xn(x1,...,xn)fY1,...,Yn|X1,...,Xn(y1,...,yn|x1,...,xn)

where each term has a deFinetti representation.

fX1,...,Xn(x1,...,xn)={i=1nfX(xi;ϕ)}π0(dϕ)fY1,...,Yn|X1,...,Xn(y1,...,yn|x1,...,xn)={i=1nfY|X(yi|xi;θ)}π0(dθ)

Given the above structure, inference for (ϕ,θ) is required:

  • inference for ϕ is done through the marginal model for the X variables

  • inference for θ is done through the conditional model for Y given that X is observed.

For the latter, the fact that X is random is irrelevant as we have conditioned the model on observed values of X.

When considering the statistical behaviour of Bayesian (and frequentist) procedures, we need to remember that X and Y have a joint structure.

Prediction

fYn+1|X1:n,Y1:n(yn+1|x1:n,y1:n)=fXn+1,Yn+1|X1:n,Y1:n(xn+1,yn+1|x1:n,y1:n)dxn+1=fYn+1|X1:n,Xn+1,Y1:n(yn+1|x1:n,xn+1,y1:n)fXn+1|X1:n,Y1:n(xn+1|x1:n,y1:n)dxn+1

9.2 Linear Regression

We can start with the following linear regression model

Yi=xiβ+ϵi

where for i=1,,n

  • Yi is a scalar

  • xi is (1 x d)

  • β is (d x 1)

  • ϵiNormal(0,σ2), independently.

With this structure, we can describe the model for the partially exchangeable random variables (error terms) ϵi=Yixiβ, conditional on Xi=xi. In this scenario, there may or may not be a need to model the distribution of Xi. –> Note: figure out why.

We can look at the vector form of the linear regression model

Y=Xβ+ϵ

where the response variable and the error terms are (n x 1) vectors and the predictors are an (nxd) matrix.

We can then have a conditional model

fY1,...,Yn|X1,...,Xn(y1,...,yn|x1,...,xn;β,σ2)Normaln(Xβ,σ2In)

where In is an identity matrix (nxn).

With this structure, we know the likelihood to be

Ln(β,σ2)=(12πσ2)n/2exp{12σ2(yXβ)T(yXβ)}

We can derive a joint conjugate prior

π0(β,σ2)=π0(σ2)π0(β|σ2)

where

π0(σ2)InvGamma(a0/2,b0/2)π0(β|σ2)Normald(m0,σ2M0)

where a0,b0,m0,M0 are user-defined constant hyperparameters. The joint posterior can hence be approximated with

πn(β,σ2)Ln(β,σ2)π0(β,σ2)π0(σ2)=(b0/2)a0/2Γ(a0/2)(1σ2)a0/2+1exp{b02σ2}π0(β|σ2)=(12πσ2)d/21|M0|0.5exp{12σ2(βm0)TM01(βm0)}

We can explore the exponents of the above posterior as a quadratic form.

The expression

(yXβ)T(yXβ)+(βm0)TM01(βm0)

which equates to (βmn)TMn1(βmn)+cn where we need to determine the expressions for mn,Mn,cn.

  • Quadratic term:

    βTMn1β=βTXTXβ+βTM01β

    and therefore

    Mn1=XTX+M01 –>> Mn=(XTX+M01)1

  • Linear term:

    $\beta^TM_n^{-1}m_n = \beta^TX^Ty + \beta^TM_0^{-1}m_0 $

    and therefore

    mn=Mn(XTy+M01m0)

    =(XTX+M01)1(XTy+M01m0)

  • Constant term:

    mnTMn1mn+cn=yTy+m0TM01m0

    and therefore

    cn=yTy+m0TM01m0mnTMn1mn

Given us the joint posterior (under proportionality) as

πn(β,σ2)=(1σ2)(n+a0+d)2+1exp{(cn+b0)2σ2}exp{12σ2(βmn)TMn1(βmn)}

Which tells us that the conditional posterior and marginalizing the joint posterior over β give us

πn(β|σ2)Normald(mn,σ2Mn)πn(σ2)InvGamma(an/2,bn/2)

where an=a0+n and bn=b0+cn.

Note: see slides 218-221 for the marginal β posterior where knowledge of the Inverse Gamma pdf will reveal that πn(β) is a multivariate Student-t distribution.

Assigning prior ignorance to β (setting M01 to 0) will lead to results that equate to what we would get from the maximum likelihood approach.

mn(XTX)1XTyMn(XTX)1

We can alternatively use a g-prior: with hyperparameter λ>0.

M0=λ1(XTX)1

and hence

Mn=(1+λ)1(XTX)1

If, for the g-prior we have

m0=0d//M0=λId

then we will have

mn=(XTX+λId)1XTyMn=(XTX+λId)1

which gives us the procedure for ridge regression.

Jeffrey’s prior for linear regression (see slide 228 for derivation):

π0(β,σ2)(1σ2)d/2+1

9.3 Non-linear Regression

9.3.1 Generalized Linear Models