HHM — Dataset Sheets & Provenance

Why Dataset Sheets?

HHM results depend on clean, well-described data. Dataset Sheets are compact JSON/Markdown records that say what the dataset is, where it came from, how it’s been processed, and how to verify integrity. They pair with a Provenance Chain that tracks every transformation from raw → analysis-ready.

TL;DR: Every dataset linked to an experiment should ship with:

A dataset sheet (metadata + methods + access/consent)
A file manifest (paths, sizes, checksums)
A provenance log (transforms, code, environment)
Hashes for the sheet and manifest themselves (content-addressable)

Run this with an AI

"Create a Dataset Sheet for the attached files following HHM schema v1.
Infer missing fields if possible, compute SHA-256/BLAKE3 for all files,
generate a provenance chain from the provided preprocessing script,
and output: dataset_sheet.json, manifest.json, provenance.json, and a README.md."

Download Dataset Sheet Template Provenance Checklist

Dataset Sheet — Minimal Schema

Use this as a guide. The canonical version lives inside HHM_bundle.json (Schemas → dataset_sheet).

Field	Type	Description
`dataset_id`	string	Stable identifier (e.g., hhm-eeg-αβ-003)
`title`	string	Human-readable name
`version`	string	SemVer or date tag (e.g., 1.0.0 or 2025-08-08)
`domain`	string	EEG \| CMB \| audio \| symbols \| biology \| quantum \| other
`modality`	string	Specific instrument or encoding (e.g., 64ch-EEG, Planck-LFI)
`description`	string	Short abstract of content and purpose
`license`	string	SPDX id or custom URL
`consent`	object	{ level: public\|restricted\|private, notes, contact }
`pii_status`	string	none \| de-identified \| sensitive (explain mitigations)
`collection`	object	{ date_range, sites, devices, protocols }
`preprocessing`	array	Key steps (also logged in provenance)
`operators_expected`	array	OP ids commonly run (OP002, OP003, OP006, OP014, ...)
`threshold_profile`	string\|object	Reference to thresholds used or overrides
`manifest_ref`	string	Relative path/URL to manifest.json
`provenance_ref`	string	Relative path/URL to provenance.json
`sheet_hash`	object	{ algo: sha256\|blake3, value }
`created_at`	string	ISO 8601 timestamp
`created_by`	object	{ name, org, email }

Example: dataset_sheet.json

{
  "dataset_id": "hhm-eeg-alphabeta-003",
  "title": "EEG α/β meditation sessions (rest→focus)",
  "version": "1.1.0",
  "domain": "EEG",
  "modality": "64ch-EEG",
  "description": "Rest and meditation EEG; used for Echo, Rec, Entropy, Orbit tests.",
  "license": "CC-BY-4.0",
  "consent": {"level":"public","notes":"IRB #2025-04, de-identified"},
  "pii_status": "de-identified",
  "collection": {
    "date_range": "2025-03-05/2025-03-21",
    "sites": ["Lab A"],
    "devices": ["BioSemi ActiveTwo"],
    "protocols": ["eyes-closed rest", "guided focus"]
  },
  "preprocessing": [
    "notch 50/60Hz", "bandpass 1-45Hz", "ICA ocular removal", "re-ref Cz"
  ],
  "operators_expected": ["OP002","OP003","OP006","OP011","OP014"],
  "threshold_profile": "profiles/eeg_default_v2.json",
  "manifest_ref": "manifest.json",
  "provenance_ref": "provenance.json",
  "sheet_hash": {"algo":"sha256","value":"2b7c..."},
  "created_at": "2025-08-11T10:24:00Z",
  "created_by": {"name":"HHM Initiative","org":"harmonia.to","email":"data@harmonia.to"}
}

File Manifest — Integrity & Access

The manifest lists every file with size, checksum, and a content address. This is what lets others verify they have the exact bits you used.

Field	Type	Description
`root`	string	Base path or bucket (s3://…, ipfs://…)
`files[]`	array	List of file entries
`files[].path`	string	Relative path from root
`files[].bytes`	number	File size in bytes
`files[].hash`	object	{ algo: sha256\|blake3, value }
`files[].content_addr`	string	URI with hash (e.g., ipfs CID)
`files[].media_type`	string	MIME/type hint
`files[].access`	string	public \| restricted \| private

Example: manifest.json

{
  "root": "s3://harmonia-datasets/hhm-eeg-alphabeta-003/",
  "files": [
    {"path":"raw/sub01_run1.fif","bytes":182334590,"hash":{"algo":"blake3","value":"7f91..."}, "content_addr":"ipfs://bafy...","media_type":"application/octet-stream","access":"restricted"},
    {"path":"derivatives/sub01_run1_clean.fif","bytes":146220144,"hash":{"algo":"sha256","value":"1a3d..."}, "content_addr":"ipfs://bafy...","media_type":"application/octet-stream","access":"restricted"},
    {"path":"events/sub01_run1.tsv","bytes":8421,"hash":{"algo":"sha256","value":"9cbe..."}, "content_addr":"ipfs://bafy...","media_type":"text/tab-separated-values","access":"public"}
  ]
}

Integrity: Prefer BLAKE3 for speed + strong integrity, and also publish a SHA-256 for compatibility. Store both if possible.

Run this with an AI

"Compute BLAKE3 and SHA-256 for all files under ./dataset.
Generate manifest.json with content-addresses (ipfs:// if local IPFS available),
and verify the manifest against the current directory."

Provenance — Lineage & Transformations

Provenance documents the how: scripts, parameters, environments, and sequence of steps from raw → analysis-ready. This makes results auditable and repeatable.

Minimum:

Lineage graph (nodes = files/states, edges = transforms)
Transform records (tool, params, inputs→outputs)
Environment (OS, package versions, GPU/CPU)
Random seeds and determinism flags

Nice to have:

Container digest (e.g., ghcr.io/...@sha256:…)
CI logs or notebook hashes
Human notes (“ICA rejected 2 comps due to blink”)

Example: provenance.json

{
  "dataset_id": "hhm-eeg-alphabeta-003",
  "graph": {
    "nodes": [
      {"id":"raw/sub01_run1.fif","hash":"blake3:7f91..."},
      {"id":"derivatives/sub01_run1_clean.fif","hash":"sha256:1a3d..."}
    ],
    "edges": [
      {"from":"raw/sub01_run1.fif","to":"derivatives/sub01_run1_clean.fif",
       "transform_id":"t-pre-ica-01"}
    ]
  },
  "transforms": [
    {
      "id":"t-pre-ica-01",
      "tool":"mne-preproc.py@e2f4c1a",
      "container":"ghcr.io/hhm/mne:1.6.1@sha256:abcd...",
      "params":{"bandpass":[1,45],"notch":[50,60],"ica_method":"fastica","rej":{"eog":2}},
      "env":{"os":"Ubuntu 22.04","python":"3.11.6","mne":"1.6.1","numpy":"1.26.4"},
      "seed": 424242,
      "started":"2025-03-12T09:01:05Z","finished":"2025-03-12T09:07:18Z",
      "logs":["Removed comps: IC1, IC3 (blink)"]
    }
  ]
}

Run this with an AI

"Parse stdout/stderr from the attached preprocessing run, extract parameters,
normalize into provenance.json using the HHM schema, and emit a GraphML/JSON
lineage graph with node hashes."

Consent, Access Levels & Ethics

Some HHM datasets involve humans or sensitive recordings. Be precise about consent, de-identification, and who can access what.

Level	Who	Notes
public	Anyone	Redacted, de-identified; safe to mirror
restricted	Approved researchers	Requires DUA/IRB reference; access logs
private	PI team only	Contains sensitive or unredacted material

Rule: If a file is not public, don’t leak it via derived previews (e.g., spectrogram images) unless those are explicitly cleared in consent notes.

Run this with an AI

"Audit the dataset directory for potential PII (filenames/metadata).
Suggest redactions and update dataset_sheet.json → consent & pii_status fields."

Ready for Operators (OP002/OP003/OP006/OP014)

To make a dataset “operator-ready,” confirm:

Sampling rates and basis choices are recorded in the sheet
Normalization pipeline is frozen (z-score/unit-norm where applicable)
Windows/segments defined (length, step, overlap)
Expected thresholds profile is linked (from your thresholds page)

Run this with an AI

"Validate operator-readiness: check sampling/basis, normalization, windowing,
and thresholds linkage. Emit a readiness report and update dataset_sheet.json."

Packaging & Releases

Publish datasets as self-verifying bundles:

dataset_sheet.json, manifest.json, provenance.json, README.md
Top-level CHECKSUMS.txt (BLAKE3 + SHA-256 for all above)
Tag release (e.g., v1.1.0) and sign with maintainer key (optional)

CHECKSUMS.txt example

blake3  dataset_sheet.json  2b7c9d... 
sha256  dataset_sheet.json  8d4a6f...
blake3  manifest.json       7f9112...
sha256  manifest.json       1a3d55...
...

README Template

FAQ — Quick Answers

Do I need IPFS?

No. It’s optional. Content addresses help reproducibility; S3/HTTPS is fine if you keep checksums immutable.

Which hash should I use?

Use BLAKE3 (fast) + SHA-256 (compat). Publish both when you can.

Where do these schemas live?

In HHM_bundle.json → Schemas. This page mirrors the core fields for convenience.

How does this connect to thresholds & CIs?

Your dataset sheet links to a threshold profile. Provenance documents any preprocessing that would affect operator statistics. Together, they make your results comparable and auditable.

Safety & Claims

Reminder: HHM datasets are for research. Do not use for diagnosis, individual forecasting, or high-stakes decisions without independent review and approvals.

Downloads

Dataset Sheet Template (.json) Manifest Template (.json) Provenance Template (.json)