March 04, 2014

The fact that we have a 500-dimensional dynamical system offers opportunities to learn high-dimensional relationships. I spent some time looking into structure learning.

**Variational Bayes**

Zoubin Ghahramani slides on variational bayes

Variational EM could learn LDS with unknown structure or switching state space model.

**Dirichlet Process**

Video: Inferring structure from data, Tom Griffiths, Cognitive Science and Machine Learning Summer School, 2010

Maybe could be used to learn LDS structure. Griffiths shows a noisy-or model with unknown dimension, using a Dirichlet process prior. At its center is a rank-defficient matrix constructed from product of NxM and MxN matrices, where M < N. Could use something similar to learn sparse dynamics matrix, but we're dealing with continuous-valued data rather than binary-valued. If this can be done, there must be existing literature on it, will dig further.

**Parameter Learning**

This tech report from Ghahramani and Hinton derives the EM for learning LDS parameters and latent states jointly. For the M step, they derive expressions for the full transition matrix, plus the system and observation noise covariance, and initial state. For the E step, they use Kalman for forward estimation followed by a backward recursion. We're more interested in a sparse representation of dynamics.

Parameter Estimation for Linear Dynamical Systems[pdf]

**MARRS**

Multiple Auto-regressive State-Space models for Analyzing Time-series Data by Elizabeth E. Holmes, Eric J. Ward, Kellie Wills

This R package handles learning and inference in general state-space models, with unspecified or semi-specified model structure. It uses EM to infer the model and latent state jointly. On the plus side, it handles missing data, and the model is very flexible. On the down side, it doesn't handle i.i.d. data in an obvious way (although could handle a small number by repeating matrix blocks). Could probably implement some form of the TIES model, sans prior.

**Other R Packages**

- dlm - An R Package for Dynamic Linear Models
- KFAS - Kalman Filter and Smoother for Exponential Family State Space Models
- FKF: Fast Kalman Filter

Kobus pointed out that the short time-frame means second-order dynamics likely won't emerge.

Matlab only handles CSV files with numeric-valued fields. We have dates and strings (and possibly others), so we need to write our own CSV parser.

Creating a *.meta file for each *.csv file, which describes column names and datatypes. Currently only three ways to parse a column: "numeric", "mm/dd/yy/ date", and "ignore". Possibly more in the future.

Wrote code for parsing meta files and csv files (all code is in projects/fire/trunk/src):

`io/fire_read_meta.m`

- Reads a .meta file into a 1xN structure array.`io/fire_read_csv.m`

- Reads a .csv file, using the corresponding .meta file to determine names and datatypes.

- Most datasets seem to be uniquely indexed on the pair (Subject_ID, Visit).
- There are 136 subjects and 9 visits
- Most datasets have 998 entries, except
**fcsfq5.csv - missing first visit**poms4.csv - missing 12 subjects**sri3.csv - 8 missing records (misc.)**fcsfq5.csv - missing 3 subjects**demograph.csv - Extra records (investigating...)*** Several missing Visit ID's - Total of 1009 (Subject_Id, Visit) pairs (some non-overlapping occurrances)
- Subject_ID scheme: 575 + a 2 or 3 digit number

The following applies only to parsable fields only

- 161 (numeric or date) fields w/ full coverage (ignoring the index fields) ** Unclear if all time-points are covered by all subjects
- Median coverage per record is
**163/547**fields - Median coverage per fields is 97.0%.
- 303 / 547 fields have over 95% coverage.
**273 numeric fields***193 int***10 double*** 70 bool

Coverage map is below (rows are fields, columns are records)

- 32 float fields
- 160 bool fields
- 323 int fields
- 30 date fields
- The rest are text, inconsistent, or unknown format

How many fully covered fields have more than two values?

Next pass:

- merge all columns into a large cell array
- for each column, store pointer back to column metadata ** i.e. dataset name, column name & type
- infer numeric type (boolean, int, float)
- Example query: get all int/float columns, excluding ID and visit
**Try clustering them (how to visualize?)**Try PCA on them - do visible clusters emerge? can we name them?) - Example Query: get all numerical columns vs. time
**Dynamic factor analysis?**split into subjects and visit indices - consider boolean fields

Should probably read the unreadable fields as text.

Posted by Kyle Simek