[Work Log] FIRE - improving fitting

May 26, 2014

Project	FIRE
Subproject	Piecewise Linear Clustering
Working path	projects/fire/trunk/src/piecewise_linear
SVN Revision	unknown (see text)

Unless otherwise noted, all filesystem paths are relative to the "Working path" named above.

Run: Refactoring kmeans, fixed bug

Description: Refactored k-means to make replicates easier to run. Also fixed a bug in how collapsed clusters are handled. Results:

BEFORE REPOPULATION FIX

    10 itns
    ----------
    training error: 3.61348
    test error: 4.60147

    15 itns
    --------
    training error: 3.60094
    test error: 4.58815

    20 itns
    ----------
    training error: 3.60094
    test error: 4.58815

AFTER REPOPULATION FIX

    10 itns
    ---------
    training error: 3.5833
    test error: 4.57327

    15 itns
    ---------
    training error: 3.58389
    test error: 4.57315

    20 itns
    ---------
    training error: 3.58411
    test error: 4.57321

Discussion:

Neither of these results match what I was getting on Friday. Testing error in particular is worse. Did I break something?

Found it: error in evaluation code arising from bad copy/paste.

Another issue: should be using centered_data.txt, not data.txt.

Run: multiple repetitions

Description: Run kmeans with 10 repetitions.
Issues:

Found assert failure - fixing empty clusters sometimes fails.
only fixed empty clusters after first iteration. fixed
fails when same cluster is picked twice. fixed
cluster weights was computed wrongly.

Results:

trivial model error: 3.55057

10 itns
-------
training error: 3.54682
test error: 4.51326

20 itns
------------
training error: 3.53998
test error: 4.52958

Training and teesting error improve over the single-initialization version. Test error is slightly worse for 20-iteration run; possibly due to overfitting.

Run: New baseline - use centered data

Description: Perform 10 replicates of k-means using centered log-transformed data.

Trivial model error: 1.40502

10 itns
-----------
training error: 1.10196
test error: 1.11679

15 itns
----------
training error: 1.09979
test error: 1.09903

20 itns
--------------
training error: 1.09756
test error: 1.09137

Run: compare against null model

Description: Do we do better or worse with a constant model? (slope zero, intercept zero) Results: see previous runs; "trivial model" results have been added.

Discussion:
It is interesting that the trivial model performs better on raw data than the cluster model. With rescaled data, the cluster model performs better.

Run: continuous model (aborted)

Description: Re-run using the continuous model. Details:

Found bug in line-fitting corner case. Fails if all observations occur at same time.
found huge bug in preprocessing -- all values are identical! Was an indexing error introduced when we added per-plate centering.

Will need to re-run all experiments.

Run: baseline (rerun)

Description: Re-run baseline fitting of centered data using discontinuous model. 10 Repetitions

Results:

Trivial model error: 1.58913
Single cluster error: 1.58365
training error: 1.61625
test error: 1.64005

Discussion: trivial model outperforms clustered model. This shouldn't be happening, need to investigate.

Posted by Kyle Simek

FIRE - Sampling strategy → ← FIRE - Refactoring, k-folds cross validation, baseline evaluation

KLS Research Blog ← Nothing to see here...