Finished compiling inference test with synthetic data.

Gradient is taking incredibly long, especially for a 7-dimensional model. PErhaps 100 data points is too large, but I'm guessing the cost of allocating vectors for each function evaluation is the botteneck (GDB seems to agree).

Baseline (Debugging mode, allocation checking enabled) |
1:28.94 |

Heap checking disabled | 0:21.06 (-1:07.88, 4.22x) |

Heap & initialization checking disabled | 0:11.19 (-9.87 1.88x) |

PRODUCTION=1 (with -O2) | 0:07.24 (-3.95 1.55x) |

-O3 | 0:07.86 (+0.62 0.92x) |

Used gprof with grpof2dot.py to get the following diagram: gprof.pdf.

iso_mvn_lpdf is getting hit hard:

` double iso_mvn_lpdf(const double`* mu, const double* y, double epsilon, size_t D)
{
double accum = 0;
double d;
for(size_t i = 0; i < D; ++i)
{
d = (*mu++ - *y++) / epsilon;
accum += d*d;
}
return -0.5 * (accum + D*log(2*M_PI*epsilon*epsilon));
}

It's already pretty lean (no alloc, all c-style). But we can move the divide-by-epsilon out of the loop for an easy 1.8x speedup.

` double iso_mvn_lpdf(const double`* mu, const double* y, double epsilon, size_t D)
{
double accum = 0;
double d;
for(size_t i = 0; i < D; ++i)
{
d = *mu++ - *y++;
accum += d*d;
}
accum /= epsilon*epsilon;
return -0.5 * (accum + D*log(2*M_PI*epsilon*epsilon));
}

Now the bottleneck is all the allocation, copying and freeing of kjb::Vector temporaries.

I tweaked the code for evaluating a piecewise linear function to avoid creating kjb::Vector temporaries, and running time dropped dramatically **from 11.4s to 1.26s.** This is in production mode, so its surprising that more temporaries aren't optimized out.

GProf after eliminating temporaries: gprof.pdf.

Since 1 iterations takes about 1.2s, bumping up to 10 iterations.

Remaining speed-up opportuntiies: exploit gradient independence, parallel gradient

Enabled 8-way parallel gradient evaluation, and got **worse** performance!

```
single threaded: 0:09.63
multi threaded: 0:15.48
```

This despote `top`

displaying 550% CPU utilization.

Maybe gnuprof is affecting performance

step size gradient size

Posted by Kyle Simek