November 10, 2013

To optimize indices, we'll need to compute the derivative of the marginal log-likelihood w.r.t. changing indices.

I first tried to derive this using the generalization of the chain rule to matrix expressions (see matrix cookbook, section 2.8.1), but the computation exploded. Since ultimately, the derivative is a simple single-input, single output function, we can use differentials to derive the solution.

Let the marginal likelihood as a function of indices be \(g(x)\):

\[
\frac{\partial g(x)}{\partial x_i} = \frac{\partial}{\partial x_i}
\frac{1}{2}(-y^\top S^\top ( I + S K(x) S^\top)^{-1} S y)
\]
Let \(U = I + S K(x) S^\top\), and \(V = U^{-1}\). Working inside out, lets find \(\frac{\partial U}{\partial x_i}\).
\[
\begin{align}
U + dU &= I + S (K + dK) S ^\top \\
&= I + S K S^\top + S dK S ^\top \\
dU &= S \, dK\, S^\top \\
U' &= S K' S^\top
\end{align}
\]
Where \(M'\) is the derivative of the elements of \(M\) w.r.t. \(x_i\). Next, \(\frac{\partial V}{\partial x_i}\), which comes from the matrix cookbook, equation (36).
\[
dV = -U^{-1} \, dU \, U^{-1} \\
V' = -U^{-1} U' U^{-1}
\]
Finally, \(\frac{\partial g(x)}{\partial x_i}\):
\[
\begin{align}
g + dg &= -\frac{1}{2}y^\top S^\top (V + dV) S y \\
g + dg &= -\frac{1}{2}y^\top S^\top \, V \, S y - \frac{1}{2} y^\top S^\top \,dV \,S y \\
dg &= -\frac{1}{2}y^\top S^\top \,dV \,S y \\
g' &= -\frac{1}{2}y^\top S^\top \, V' \,S y \\
\end{align}
\]
Expanding \(V\) gives the final formula:
\[
\begin{align}
g' &= \frac{1}{2}y^\top S^\top U^{-1} S K' S^\top U^{-1} S y \\
g' &= \frac{1}{2}y^\top M K' M y \\
g' &= \frac{1}{2}z^\top K' z \tag{1}\\
\end{align}
\]

Here, \(M = S^\top U^{-1} S \), (which is symmetric), and \(z = M y\).

This equation gives us a single element of the gradient, namely \(d g(x)/dx_i\). However, once \(z\) is computed, we can reuse it when recomputing (1) for all other \(x_j\)'s. The cost of each subsequent gradient element becomes \(O(n^2)\), making the total gradient \(O(n^3)\), which is pretty good. (This assumes the K's can be computed efficiently, which is true; see below.) However, we also observe that \(K'\) is sparse with size \(O(n)\), so we can do sparse multiplication to reduce the running time to linear, and **the full gradient takes \(O(n^2)\)**, assuming \(z\) is precomputed. Cool!

First, we'll layout the general form of K', whose elements are the full derivative of the kernel w.r.t. \(x_k\).

\[
\frac{\partial K_{ij}}{\partial x_k} = \frac{\partial k(x_i, x_j)}{\partial x_i} \frac{d x_i}{d x_k} + \frac{\partial k(x_i, x_j)}{\partial x_j} \frac{d x_j}{d x_k}\\
\]

The first term is nonzero only on the i-th row of K', and the second term is nonzero on the i-th column of K'. This suggests the following convenient sparse representation for K'.

Let the vector \(\delta_i\) be the vector whose j-th element is \( \frac{\partial k(x_i, x_j) }{\partial x_i} \). Using this notation, we can rewrite \(K'\) as

\[
\frac{\partial K}{\partial x_i} = K' = C + C^\top \tag{4}
\]

where \(C = \left(0 \, \dots \, \delta_i \, \dots \, 0 \right) \).

Below we derive the derivative \(\frac{\partial k(x_i, x_j)}{\partial x_i}\) for each of the three covariance expresssions.

*Cubic covariance*

Recall the cubic covariance expression:

\[
k(x_i, x_j) = (x_a - x_b) x_b^2 / 2 + x_b^3/3
\]
Where \(x_b = min(x_i, x_i)\) and \(x_a = max(x_i, x_i)\).

Taking the derivative w.r.t. (x_i) gives:

\[
\begin{align}
\frac{\partial k(x_i, x_j)}{\partial x_i} &=
\begin{cases}
x_j^2 / 2 & \text{if } x_i >= x_j \\
x_i x_j - x_i^2/2 & \text{if } x_i < x_i
\end{cases} \\
&=
\begin{cases}
x_b^2 / 2 & \text{if } x_i >= x_j \\
x_a x_b - x_b^2/2 & \text{if } x_i < x_j
\end{cases} \\
\end{align}
\]

Or equivalently

\[
\frac{\partial k(x_i, x_j)}{\partial x_i} =
x_b \left ( x_j - x_b/2 \right ) \tag{2}
\]

*Linear Covariance*

Recall the linear covariance expression:

\[
k(x_i, x_j) = x_i x_j
\]
The derivative w.r.t. \(x_i\) is simply \(x_j\).

*Offset Covariance*

Recall the offset covariance expression:

\[
k(x_i, x_j) = k
\]
The derivative w.r.t. \(x_i\) is zero.

*Implementation*

Implemented end-to-end version in `kernel/get_model_kernel_derivative.m`

; see also components in `kernel/get_spacial_kernel_derivative.m`

and `kernel/cubic_kernel_derivative.m`

.

These functions return all of the partial derivatives of the matrix with respect to the first input. The i-th row of the result make up the nonzero values in \(\frac{\partial K}{\partial x_i}\). Below is example code that computes all of the partial derivative matrices.

```
N = 100;
% construct indices
x = linspace(0, 10, N);
% construct derivative rows
d_kernel = get_model_kernel_derivative(...);
d_K = eval_kernel(d_kernel, x, x);
% construct dK/dx_i, for each i = 1..N
d_K_d_x = dcell(1,N);
for i = 1:N
tmp = sparse(N, N);
tmp(i,:) = d_K(i,:);
tmp(:,i) = d_K(i,:)';
d_K_d_x{i} = tmp;
end
```

*Directional Derivatives*

I think we can get directional derivatives of \(K\) by taking the weighted sum of partial derivatives, where the weights are the component lengths of the direction vector. I have yet to confirm this beyond a hand-wavy hunch, and in practice, this might not even be needed, since computing the full gradient is so efficient.

As we saw earlier, \(\frac{\partial K}{\partial x_i}\) is sparse, and has the form in equation (4). We can use the sparsity to ultimately compute the entire gradient in a single matrix multiplication.

First we'll rewrite \(g'\) in terms if \(\delta_i\)

\[
\begin{align}
g' &= \frac{1}{2} z^\top K' z \\
&= \frac{1}{2} z^\top C z + z^\top C^\top z \\
&= \frac{1}{2} \left \{ (0 \, \dots \, z^\top \delta_i \, \dots \, 0) z + z^\top (0 \, \dots \, \delta_i^\top z \, \dots \, 0)^\top \right \} \\
&= z_i (\delta_i \cdot z)
\end{align}
\]

We can generalize this to the entire gradient using matrix operations:

\[
\begin{align}
\nabla g = z \odot (\Delta z) \tag{4}
\end{align}
\]

Where \(\Delta\) is the matrix whose ith row is \(\delta_i\), and \(\odot\) denotes element-wise multiplication.

To handle multiple dimensions, simply apply to each dimension independently and sum the results.

Posted by Kyle Simek