Why Dataset Sheets?
HHM results depend on clean, well-described data. Dataset Sheets are compact JSON/Markdown records that say what the dataset is, where it came from, how it’s been processed, and how to verify integrity. They pair with a Provenance Chain that tracks every transformation from raw → analysis-ready.
- A dataset sheet (metadata + methods + access/consent)
 - A file manifest (paths, sizes, checksums)
 - A provenance log (transforms, code, environment)
 - Hashes for the sheet and manifest themselves (content-addressable)
 
Run this with an AI
"Create a Dataset Sheet for the attached files following HHM schema v1.
Infer missing fields if possible, compute SHA-256/BLAKE3 for all files,
generate a provenance chain from the provided preprocessing script,
and output: dataset_sheet.json, manifest.json, provenance.json, and a README.md."
      Dataset Sheet — Minimal Schema
Use this as a guide. The canonical version lives inside HHM_bundle.json (Schemas → dataset_sheet).
| Field | Type | Description | 
|---|---|---|
dataset_id | string | Stable identifier (e.g., hhm-eeg-αβ-003) | 
title | string | Human-readable name | 
version | string | SemVer or date tag (e.g., 1.0.0 or 2025-08-08) | 
domain | string | EEG | CMB | audio | symbols | biology | quantum | other | 
modality | string | Specific instrument or encoding (e.g., 64ch-EEG, Planck-LFI) | 
description | string | Short abstract of content and purpose | 
license | string | SPDX id or custom URL | 
consent | object | { level: public|restricted|private, notes, contact } | 
pii_status | string | none | de-identified | sensitive (explain mitigations) | 
collection | object | { date_range, sites, devices, protocols } | 
preprocessing | array | Key steps (also logged in provenance) | 
operators_expected | array | OP ids commonly run (OP002, OP003, OP006, OP014, ...) | 
threshold_profile | string|object | Reference to thresholds used or overrides | 
manifest_ref | string | Relative path/URL to manifest.json | 
provenance_ref | string | Relative path/URL to provenance.json | 
sheet_hash | object | { algo: sha256|blake3, value } | 
created_at | string | ISO 8601 timestamp | 
created_by | object | { name, org, email } | 
Example: dataset_sheet.json
{
  "dataset_id": "hhm-eeg-alphabeta-003",
  "title": "EEG α/β meditation sessions (rest→focus)",
  "version": "1.1.0",
  "domain": "EEG",
  "modality": "64ch-EEG",
  "description": "Rest and meditation EEG; used for Echo, Rec, Entropy, Orbit tests.",
  "license": "CC-BY-4.0",
  "consent": {"level":"public","notes":"IRB #2025-04, de-identified"},
  "pii_status": "de-identified",
  "collection": {
    "date_range": "2025-03-05/2025-03-21",
    "sites": ["Lab A"],
    "devices": ["BioSemi ActiveTwo"],
    "protocols": ["eyes-closed rest", "guided focus"]
  },
  "preprocessing": [
    "notch 50/60Hz", "bandpass 1-45Hz", "ICA ocular removal", "re-ref Cz"
  ],
  "operators_expected": ["OP002","OP003","OP006","OP011","OP014"],
  "threshold_profile": "profiles/eeg_default_v2.json",
  "manifest_ref": "manifest.json",
  "provenance_ref": "provenance.json",
  "sheet_hash": {"algo":"sha256","value":"2b7c..."},
  "created_at": "2025-08-11T10:24:00Z",
  "created_by": {"name":"HHM Initiative","org":"harmonia.to","email":"data@harmonia.to"}
}
      File Manifest — Integrity & Access
The manifest lists every file with size, checksum, and a content address. This is what lets others verify they have the exact bits you used.
| Field | Type | Description | 
|---|---|---|
root | string | Base path or bucket (s3://…, ipfs://…) | 
files[] | array | List of file entries | 
files[].path | string | Relative path from root | 
files[].bytes | number | File size in bytes | 
files[].hash | object | { algo: sha256|blake3, value } | 
files[].content_addr | string | URI with hash (e.g., ipfs CID) | 
files[].media_type | string | MIME/type hint | 
files[].access | string | public | restricted | private | 
Example: manifest.json
{
  "root": "s3://harmonia-datasets/hhm-eeg-alphabeta-003/",
  "files": [
    {"path":"raw/sub01_run1.fif","bytes":182334590,"hash":{"algo":"blake3","value":"7f91..."}, "content_addr":"ipfs://bafy...","media_type":"application/octet-stream","access":"restricted"},
    {"path":"derivatives/sub01_run1_clean.fif","bytes":146220144,"hash":{"algo":"sha256","value":"1a3d..."}, "content_addr":"ipfs://bafy...","media_type":"application/octet-stream","access":"restricted"},
    {"path":"events/sub01_run1.tsv","bytes":8421,"hash":{"algo":"sha256","value":"9cbe..."}, "content_addr":"ipfs://bafy...","media_type":"text/tab-separated-values","access":"public"}
  ]
}
      BLAKE3 for speed + strong integrity, and also publish a SHA-256 for compatibility. Store both if possible.
      Run this with an AI
"Compute BLAKE3 and SHA-256 for all files under ./dataset.
Generate manifest.json with content-addresses (ipfs:// if local IPFS available),
and verify the manifest against the current directory." 
      Provenance — Lineage & Transformations
Provenance documents the how: scripts, parameters, environments, and sequence of steps from raw → analysis-ready. This makes results auditable and repeatable.
- Lineage graph (nodes = files/states, edges = transforms)
 - Transform records (tool, params, inputs→outputs)
 - Environment (OS, package versions, GPU/CPU)
 - Random seeds and determinism flags
 
- Container digest (e.g., 
ghcr.io/...@sha256:…) - CI logs or notebook hashes
 - Human notes (“ICA rejected 2 comps due to blink”)
 
Example: provenance.json
{
  "dataset_id": "hhm-eeg-alphabeta-003",
  "graph": {
    "nodes": [
      {"id":"raw/sub01_run1.fif","hash":"blake3:7f91..."},
      {"id":"derivatives/sub01_run1_clean.fif","hash":"sha256:1a3d..."}
    ],
    "edges": [
      {"from":"raw/sub01_run1.fif","to":"derivatives/sub01_run1_clean.fif",
       "transform_id":"t-pre-ica-01"}
    ]
  },
  "transforms": [
    {
      "id":"t-pre-ica-01",
      "tool":"mne-preproc.py@e2f4c1a",
      "container":"ghcr.io/hhm/mne:1.6.1@sha256:abcd...",
      "params":{"bandpass":[1,45],"notch":[50,60],"ica_method":"fastica","rej":{"eog":2}},
      "env":{"os":"Ubuntu 22.04","python":"3.11.6","mne":"1.6.1","numpy":"1.26.4"},
      "seed": 424242,
      "started":"2025-03-12T09:01:05Z","finished":"2025-03-12T09:07:18Z",
      "logs":["Removed comps: IC1, IC3 (blink)"]
    }
  ]
}
      Run this with an AI
"Parse stdout/stderr from the attached preprocessing run, extract parameters,
normalize into provenance.json using the HHM schema, and emit a GraphML/JSON
lineage graph with node hashes."
      Consent, Access Levels & Ethics
Some HHM datasets involve humans or sensitive recordings. Be precise about consent, de-identification, and who can access what.
| Level | Who | Notes | 
|---|---|---|
| public | Anyone | Redacted, de-identified; safe to mirror | 
| restricted | Approved researchers | Requires DUA/IRB reference; access logs | 
| private | PI team only | Contains sensitive or unredacted material | 
Run this with an AI
"Audit the dataset directory for potential PII (filenames/metadata).
Suggest redactions and update dataset_sheet.json → consent & pii_status fields."
      Ready for Operators (OP002/OP003/OP006/OP014)
To make a dataset “operator-ready,” confirm:
- Sampling rates and basis choices are recorded in the sheet
 - Normalization pipeline is frozen (z-score/unit-norm where applicable)
 - Windows/segments defined (length, step, overlap)
 - Expected thresholds profile is linked (from your thresholds page)
 
Run this with an AI
"Validate operator-readiness: check sampling/basis, normalization, windowing,
and thresholds linkage. Emit a readiness report and update dataset_sheet.json."
      Packaging & Releases
Publish datasets as self-verifying bundles:
dataset_sheet.json,manifest.json,provenance.json,README.md- Top-level 
CHECKSUMS.txt(BLAKE3 + SHA-256 for all above) - Tag release (e.g., 
v1.1.0) and sign with maintainer key (optional) 
CHECKSUMS.txt example
blake3  dataset_sheet.json  2b7c9d... 
sha256  dataset_sheet.json  8d4a6f...
blake3  manifest.json       7f9112...
sha256  manifest.json       1a3d55...
...
      FAQ — Quick Answers
Do I need IPFS?
No. It’s optional. Content addresses help reproducibility; S3/HTTPS is fine if you keep checksums immutable.
Which hash should I use?
Use BLAKE3 (fast) + SHA-256 (compat). Publish both when you can.
Where do these schemas live?
In HHM_bundle.json → Schemas. This page mirrors the core fields for convenience.
How does this connect to thresholds & CIs?
Your dataset sheet links to a threshold profile. Provenance documents any preprocessing that would affect operator statistics. Together, they make your results comparable and auditable.