Dataset Sheets & Provenance

Everything you need to trust, reuse, and reproduce HHM datasets

Why Dataset Sheets?

HHM results depend on clean, well-described data. Dataset Sheets are compact JSON/Markdown records that say what the dataset is, where it came from, how it’s been processed, and how to verify integrity. They pair with a Provenance Chain that tracks every transformation from raw → analysis-ready.

TL;DR: Every dataset linked to an experiment should ship with:
  • A dataset sheet (metadata + methods + access/consent)
  • A file manifest (paths, sizes, checksums)
  • A provenance log (transforms, code, environment)
  • Hashes for the sheet and manifest themselves (content-addressable)
Run this with an AI
"Create a Dataset Sheet for the attached files following HHM schema v1.
Infer missing fields if possible, compute SHA-256/BLAKE3 for all files,
generate a provenance chain from the provided preprocessing script,
and output: dataset_sheet.json, manifest.json, provenance.json, and a README.md."
Download Dataset Sheet Template Provenance Checklist

Dataset Sheet — Minimal Schema

Use this as a guide. The canonical version lives inside HHM_bundle.json (Schemas → dataset_sheet).

FieldTypeDescription
dataset_idstringStable identifier (e.g., hhm-eeg-αβ-003)
titlestringHuman-readable name
versionstringSemVer or date tag (e.g., 1.0.0 or 2025-08-08)
domainstringEEG | CMB | audio | symbols | biology | quantum | other
modalitystringSpecific instrument or encoding (e.g., 64ch-EEG, Planck-LFI)
descriptionstringShort abstract of content and purpose
licensestringSPDX id or custom URL
consentobject{ level: public|restricted|private, notes, contact }
pii_statusstringnone | de-identified | sensitive (explain mitigations)
collectionobject{ date_range, sites, devices, protocols }
preprocessingarrayKey steps (also logged in provenance)
operators_expectedarrayOP ids commonly run (OP002, OP003, OP006, OP014, ...)
threshold_profilestring|objectReference to thresholds used or overrides
manifest_refstringRelative path/URL to manifest.json
provenance_refstringRelative path/URL to provenance.json
sheet_hashobject{ algo: sha256|blake3, value }
created_atstringISO 8601 timestamp
created_byobject{ name, org, email }
Example: dataset_sheet.json
{
  "dataset_id": "hhm-eeg-alphabeta-003",
  "title": "EEG α/β meditation sessions (rest→focus)",
  "version": "1.1.0",
  "domain": "EEG",
  "modality": "64ch-EEG",
  "description": "Rest and meditation EEG; used for Echo, Rec, Entropy, Orbit tests.",
  "license": "CC-BY-4.0",
  "consent": {"level":"public","notes":"IRB #2025-04, de-identified"},
  "pii_status": "de-identified",
  "collection": {
    "date_range": "2025-03-05/2025-03-21",
    "sites": ["Lab A"],
    "devices": ["BioSemi ActiveTwo"],
    "protocols": ["eyes-closed rest", "guided focus"]
  },
  "preprocessing": [
    "notch 50/60Hz", "bandpass 1-45Hz", "ICA ocular removal", "re-ref Cz"
  ],
  "operators_expected": ["OP002","OP003","OP006","OP011","OP014"],
  "threshold_profile": "profiles/eeg_default_v2.json",
  "manifest_ref": "manifest.json",
  "provenance_ref": "provenance.json",
  "sheet_hash": {"algo":"sha256","value":"2b7c..."},
  "created_at": "2025-08-11T10:24:00Z",
  "created_by": {"name":"HHM Initiative","org":"harmonia.to","email":"data@harmonia.to"}
}

File Manifest — Integrity & Access

The manifest lists every file with size, checksum, and a content address. This is what lets others verify they have the exact bits you used.

FieldTypeDescription
rootstringBase path or bucket (s3://…, ipfs://…)
files[]arrayList of file entries
files[].pathstringRelative path from root
files[].bytesnumberFile size in bytes
files[].hashobject{ algo: sha256|blake3, value }
files[].content_addrstringURI with hash (e.g., ipfs CID)
files[].media_typestringMIME/type hint
files[].accessstringpublic | restricted | private
Example: manifest.json
{
  "root": "s3://harmonia-datasets/hhm-eeg-alphabeta-003/",
  "files": [
    {"path":"raw/sub01_run1.fif","bytes":182334590,"hash":{"algo":"blake3","value":"7f91..."}, "content_addr":"ipfs://bafy...","media_type":"application/octet-stream","access":"restricted"},
    {"path":"derivatives/sub01_run1_clean.fif","bytes":146220144,"hash":{"algo":"sha256","value":"1a3d..."}, "content_addr":"ipfs://bafy...","media_type":"application/octet-stream","access":"restricted"},
    {"path":"events/sub01_run1.tsv","bytes":8421,"hash":{"algo":"sha256","value":"9cbe..."}, "content_addr":"ipfs://bafy...","media_type":"text/tab-separated-values","access":"public"}
  ]
}
Integrity: Prefer BLAKE3 for speed + strong integrity, and also publish a SHA-256 for compatibility. Store both if possible.
Run this with an AI
"Compute BLAKE3 and SHA-256 for all files under ./dataset.
Generate manifest.json with content-addresses (ipfs:// if local IPFS available),
and verify the manifest against the current directory." 

Provenance — Lineage & Transformations

Provenance documents the how: scripts, parameters, environments, and sequence of steps from raw → analysis-ready. This makes results auditable and repeatable.

Minimum:
  • Lineage graph (nodes = files/states, edges = transforms)
  • Transform records (tool, params, inputs→outputs)
  • Environment (OS, package versions, GPU/CPU)
  • Random seeds and determinism flags
Nice to have:
  • Container digest (e.g., ghcr.io/...@sha256:…)
  • CI logs or notebook hashes
  • Human notes (“ICA rejected 2 comps due to blink”)
Example: provenance.json
{
  "dataset_id": "hhm-eeg-alphabeta-003",
  "graph": {
    "nodes": [
      {"id":"raw/sub01_run1.fif","hash":"blake3:7f91..."},
      {"id":"derivatives/sub01_run1_clean.fif","hash":"sha256:1a3d..."}
    ],
    "edges": [
      {"from":"raw/sub01_run1.fif","to":"derivatives/sub01_run1_clean.fif",
       "transform_id":"t-pre-ica-01"}
    ]
  },
  "transforms": [
    {
      "id":"t-pre-ica-01",
      "tool":"mne-preproc.py@e2f4c1a",
      "container":"ghcr.io/hhm/mne:1.6.1@sha256:abcd...",
      "params":{"bandpass":[1,45],"notch":[50,60],"ica_method":"fastica","rej":{"eog":2}},
      "env":{"os":"Ubuntu 22.04","python":"3.11.6","mne":"1.6.1","numpy":"1.26.4"},
      "seed": 424242,
      "started":"2025-03-12T09:01:05Z","finished":"2025-03-12T09:07:18Z",
      "logs":["Removed comps: IC1, IC3 (blink)"]
    }
  ]
}
Run this with an AI
"Parse stdout/stderr from the attached preprocessing run, extract parameters,
normalize into provenance.json using the HHM schema, and emit a GraphML/JSON
lineage graph with node hashes."

Ready for Operators (OP002/OP003/OP006/OP014)

To make a dataset “operator-ready,” confirm:

Run this with an AI
"Validate operator-readiness: check sampling/basis, normalization, windowing,
and thresholds linkage. Emit a readiness report and update dataset_sheet.json."

Packaging & Releases

Publish datasets as self-verifying bundles:

CHECKSUMS.txt example
blake3  dataset_sheet.json  2b7c9d... 
sha256  dataset_sheet.json  8d4a6f...
blake3  manifest.json       7f9112...
sha256  manifest.json       1a3d55...
...
README Template

FAQ — Quick Answers

Do I need IPFS?

No. It’s optional. Content addresses help reproducibility; S3/HTTPS is fine if you keep checksums immutable.

Which hash should I use?

Use BLAKE3 (fast) + SHA-256 (compat). Publish both when you can.

Where do these schemas live?

In HHM_bundle.jsonSchemas. This page mirrors the core fields for convenience.

How does this connect to thresholds & CIs?

Your dataset sheet links to a threshold profile. Provenance documents any preprocessing that would affect operator statistics. Together, they make your results comparable and auditable.

Safety & Claims

Reminder: HHM datasets are for research. Do not use for diagnosis, individual forecasting, or high-stakes decisions without independent review and approvals.

Downloads

Dataset Sheet Template (.json) Manifest Template (.json) Provenance Template (.json)