May 27, 2014

On the issue of HMC step size, I wondered if anyone had published on the relationship between ideal step size and the log-posterior's second derivative. I found an answer in section 4.2 of Neal 2011[pdf], which poses the question in the context of a Gaussian-shaped posterior. Using eigenvalue analysis, he shows that a step size larger than 2 standard deviations results in unstable dynamcis, and the state will diverge to infinity. From Neal (2011):

For low-dimensional problems, using a value for ε that is just a bit below the stability limit is suﬃcient to produce a good acceptance rate. For high-dimensional problems, however, the stepsize may need to be reduced further than this to keep the error in H to a level that produces a good acceptance probability.

If we can estimate the Hessian of the log-posterior (perhaps diagonally), we can use this to choose the step-size as some fraction of that (user settable). Thus, our tuning run will perform Laplace approxization:

- perform local minimizations of negative log posterior.
- estimate diagonal Hessian at miminimum.

I've devised the strategy below for sampling the clustering model. I have determined that all variables can be Gibbs sampled except the latent piecewise linear model.

- Gibbs sample new membership values
- update each cluster's piecewise linear model using HMC
- Gibbs sample observation model using known latent values

Step 3 needs some explanation.

Let \(x\) be the column vector of latent immune activity values at each time, (t). These are provided by the (fixed) piecewise linear model. Recall that the observation model is:

\[
y = A x + B + \epsilon
\]

where \(\epsilon\) is a normally distributed random variable with variance \(\sigma^2\).

where \(\epsilon\) is a normally distributed random variable with variance \(\sigma^2\).

Thus, the error is given by

\[
e = \left ( (A B)\begin{array}{c} x^\top \\ 1 \end{array} - y^\top \right ) / \sigma
\]

Let \(\beta = (vec(A)^\top vec(B)^\top)^\top \) be the concatenated vectorization of A and B, and \(X = (x 1)^\top\). Following the [derivation provided by wikipedia](http://en.wikipedia.org/wiki/Bayesian_multivariate_linear_regression), the resulting log-likelihood can be written as a guassian w.r.t. \(\beta\): \[ -(\beta - \hat \beta)^\top (\Sigma_\epsilon^{-1} \ocross X X^\top) (\beta - \hat \beta) \]

Let \(\beta = (vec(A)^\top vec(B)^\top)^\top \) be the concatenated vectorization of A and B, and \(X = (x 1)^\top\). Following the [derivation provided by wikipedia](http://en.wikipedia.org/wiki/Bayesian_multivariate_linear_regression), the resulting log-likelihood can be written as a guassian w.r.t. \(\beta\): \[ -(\beta - \hat \beta)^\top (\Sigma_\epsilon^{-1} \ocross X X^\top) (\beta - \hat \beta) \]

where \(\ocross\) is the kronecker product in our case, the observation noise variance \(\Sigma_\epsilon\) is simply \(I \sigma^{2} \).

If we assume a uniform prior over \(A\) and \(B\), then this Gaussian becomes our conditional posterior distribution, which we can easilly sample from with Gibbs.

Posted by Kyle Simek