HHM — Reproducibility & Pipelines

✅ HHM Verified Analysis

What “Reproducible” Means Here

HHM analyses are reproducible when any independent person can re-run the steps and obtain the same decisions (pass/fail) and statistically consistent numbers (within confidence intervals).

Single source of truth: HHM_bundle.json pins operators, thresholds, nulls/CI, and the result-card schema.
Sealed inputs: dataset manifest + hashes; provenance graph for every derived file.
Deterministic runs: seeded randomness; fixed windowing; declared preprocessing.
Schema-checked outputs: machine-validated result cards, not free text.

TL;DR — no silent parameter drift, no movable thresholds, no hidden preprocessing. If anything changes, the version changes.

Pipeline at a Glance

Stage	What it does	Artifacts
1) Intake	Load dataset + manifest; verify hashes and licenses; record consent/PII status.	`dataset.json`, `manifest.json`, `provenance.json`
2) Prep	Declared preprocessing (filters, resampling, windowing) with parameters pinned.	`prep_config.yaml`, `prep_log.jsonl`
3) Measure	Apply HHM operators (OP001, OP003, OP002, OP006, …) from the bundle.	`ops_metrics.parquet`
4) Nulls & CI	Generate nulls, run bootstraps, compute CI; compare with preregistered thresholds.	`null_runs.parquet`, `ci.json`
5) Result Card	Emit schema-valid JSON with metrics, decisions, seeds, versions, hashes.	`result_card.json`
6) Audit	Automatic checks (schema, determinism, thresholds locked); export run report.	`audit_report.html`, `replay_cmd.txt`

Download Pipeline YAML Template Download HHM Bundle

Minimal Runbook

Use this when you just want to get a clean, reproducible “Hello, World” pass/fail with CIs.

Runbook R1 — Minimal Pipeline

# R1_minimal_pipeline (pseudocode / steps)
inputs:
  - data: dataset.zip (hash: SHA256:...)
  - bundle: HHM_bundle.json (version: 2.2, hash: ...)
params:
  seed: 20250111
  windows: "2s @ 50% overlap"
  basis: "Fourier 1–40 Hz, Hamming"
steps:
  - verify_manifest(dataset.manifest)
  - preprocess(data, basis, windows, filters=["bandpass 1–40Hz"])
  - op(OP001_CollapsePattern)
  - op(OP003_Echo)
  - op(OP002_Rec)              # if comparison target provided
  - op(OP006_UnifiedEntropy)
  - nulls: ["time_shuffle", "phase_randomize"]
  - bootstrap: n=1000
  - compare_to_thresholds(bundle.thresholds)
  - emit_result_card(schema=bundle.result_schema)
  - validate_json(result_card.json, bundle.result_schema)
outputs:
  - result_card.json
  - ci.json
  - prep_log.jsonl
  - ops_metrics.parquet

Runbook R2 — Benchmark & Transfer

# R2_benchmark_transfer (outline)
- load library of reference result_cards/*
- compute Rec/Entropy/Echo similarity surfaces
- report top-k matches + T014 status
- emit transfer_report.json (+ plots)

Seeds, Determinism & Environment Capture

Seed once, everywhere: one global_seed for Python/NumPy/random/backends; record it in the result card.
Pin numeric libs: BLAS/LAPACK/GPU versions can change floating-point noise; capture in an env_lock.
Containerize: run inside a pinned image; record image digest (immutable SHA).

# pipeline.yaml (excerpt)
env:
  container: ghcr.io/hhm/hhm-pipeline:2.2.0@sha256:
  seeds:
    global: 20250111
capture:
  python: "3.11.7"
  numpy: "1.26.4"
  scipy: "1.11.4"
  mkl: "2024.1"
  cuda: ""

Download Env Lock Template

Manifests & Provenance

Every file is content-addressed and every derivative is linked back to its sources.

Manifest: paths, sizes, BLAKE3/SHA-256 hashes, access level.
Provenance graph: nodes (files) + edges (transforms) with tool versions, params, seeds, and logs.

Manifest Template Provenance Template

Nulls, CI, and Threshold Decisions

Nulls and CIs are part of the pipeline, not an afterthought.

Null generators: e.g., time_shuffle, phase_randomize, block_bootstrap — defined in the bundle.
CI: bootstrap (default n=1000) with the bundle’s confidence level (e.g., 95%).
Decisions: pass/fail only when metrics clear preregistered thresholds vs null.

# thresholds.json (excerpt from HHM bundle)
{
  "Echo.min": 0.90,
  "Rec.min": 0.85,
  "Entropy.delta.max": 0.15,
  "CollapseOrbit.delta.max": 1
}

Read: Test Thresholds, Nulls & CI

Result Card — Schema & Example

Every run produces a machine-checkable JSON that locks versions, seeds, inputs, and outcomes.

{
  "schema": "hhm.result_card.v2",
  "bundle_version": "2.2.0",
  "env_lock": "env-lock:sha256:...",
  "dataset_id": "hhm-bio-whale-001",
  "manifest_hash": "blake3:...",
  "operators": ["OP001","OP003","OP006"],
  "params": { "basis": "Fourier 1-40Hz", "window": "2s/50%" },
  "seed": 20250111,
  "metrics": { "Echo": 0.931, "Entropy": 0.821 },
  "nulls": { "method": ["time_shuffle","phase_randomize"], "n": 1000 },
  "ci": { "Echo": [0.914, 0.946], "Entropy": [0.79, 0.85] },
  "thresholds_version": "2.2.0",
  "decisions": { "Echo.pass": true, "T014.pass": false },
  "artifacts": {
    "ops_metrics": "ipfs://.../ops.parquet",
    "audit_report": "ipfs://.../audit.html"
  },
  "created_at": "2025-08-11T12:00:00Z",
  "run_id": "rc-2025-08-11-1200-abc123"
}

CLI: validate a result card

hhm-validate \
  --schema /downloads/schemas/hhm.result_card.v2.json \
  --input result_card.json

Download Result Card Schema

CI/CD Checks (Recommended)

Schema check: result card must validate.
Determinism check: re-run same seed → identical decisions; metric deltas within numeric tolerance.
Threshold lock: thresholds version must match bundle; no local overrides.
Provenance closure: no orphaned artifacts; all derived files traced to raw inputs.

GitHub Actions — sample job

name: hhm-ci
on: [push, pull_request]
jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Pull container
        run: docker pull ghcr.io/hhm/hhm-pipeline:2.2.0@sha256:<digest>
      - name: Run pipeline (dry)
        run: docker run --rm -v $PWD:/work ghcr.io/hhm/hhm-pipeline:2.2.0 \
             hhm-run --pipeline pipeline.yaml --dry-run
      - name: Execute
        run: docker run --rm -v $PWD:/work ghcr.io/hhm/hhm-pipeline:2.2.0 \
             hhm-run --pipeline pipeline.yaml
      - name: Validate result card
        run: docker run --rm -v $PWD:/work ghcr.io/hhm/hhm-pipeline:2.2.0 \
             hhm-validate --schema /schemas/hhm.result_card.v2.json --input result_card.json

Local vs Cloud — Same Pipeline

Run the same pipeline.yaml locally or in the cloud. Only the storage: stanza changes.

# pipeline.yaml (storage profiles)
storage:
  local:
    root: "./data"
  s3:
    root: "s3://hhm-datasets/proj-x/"
    profile: "default"
profile: "local"   # or "s3"

Common Failure Modes (and Fixes)

“My numbers moved slightly between runs.”

Check seeds across all libs; confirm same container digest.
Ensure BLAS/GPU kernels are pinned; disable nondeterministic ops if present.
Increase sample sizes/bootstraps or tighten numeric tolerances in the audit.

“Schema validation failed.”

Diff your result_card.json against the schema; look for missing required fields.
Ensure thresholds_version and bundle_version match the bundle you ran.

“Pass/Fail changed after I tweaked preprocessing.”

Preprocessing is part of the registered pipeline. If you change it, bump pipeline_version and re-run full nulls/CI.

Ethics, Consent & Privacy

Respect dataset licenses and consent levels in the sheet; enforce PII policies at intake.
Never upload sensitive data to third-party clouds without a DPA and documented consent.
Prefer de-identified, aggregated outputs in result cards and reports.

Dataset Sheets & Provenance

Downloads & Starters

Everything you need to produce a clean, reproducible HHM run.

FAQ

Do I need Python execution to run HHM?

No. For planning and interpretation, file-aware AIs are enough. For measurement, use the provided container or your own pinned environment.

What counts as a “reproducible” change?

Any change that alters preprocessing, operators, thresholds, or nulls requires a new pipeline_version and fresh result cards.

Can I add custom operators?

Yes — follow the operator template, register in your pipeline, and validate on a known dataset before claiming results.