Why Dataset Sheets?
HHM results depend on clean, well-described data. Dataset Sheets are compact JSON/Markdown records that say what the dataset is, where it came from, how it’s been processed, and how to verify integrity. They pair with a Provenance Chain that tracks every transformation from raw → analysis-ready.
- A dataset sheet (metadata + methods + access/consent)
- A file manifest (paths, sizes, checksums)
- A provenance log (transforms, code, environment)
- Hashes for the sheet and manifest themselves (content-addressable)
Run this with an AI
"Create a Dataset Sheet for the attached files following HHM schema v1.
Infer missing fields if possible, compute SHA-256/BLAKE3 for all files,
generate a provenance chain from the provided preprocessing script,
and output: dataset_sheet.json, manifest.json, provenance.json, and a README.md."
Dataset Sheet — Minimal Schema
Use this as a guide. The canonical version lives inside HHM_bundle.json
(Schemas → dataset_sheet).
Field | Type | Description |
---|---|---|
dataset_id | string | Stable identifier (e.g., hhm-eeg-αβ-003) |
title | string | Human-readable name |
version | string | SemVer or date tag (e.g., 1.0.0 or 2025-08-08) |
domain | string | EEG | CMB | audio | symbols | biology | quantum | other |
modality | string | Specific instrument or encoding (e.g., 64ch-EEG, Planck-LFI) |
description | string | Short abstract of content and purpose |
license | string | SPDX id or custom URL |
consent | object | { level: public|restricted|private, notes, contact } |
pii_status | string | none | de-identified | sensitive (explain mitigations) |
collection | object | { date_range, sites, devices, protocols } |
preprocessing | array | Key steps (also logged in provenance) |
operators_expected | array | OP ids commonly run (OP002, OP003, OP006, OP014, ...) |
threshold_profile | string|object | Reference to thresholds used or overrides |
manifest_ref | string | Relative path/URL to manifest.json |
provenance_ref | string | Relative path/URL to provenance.json |
sheet_hash | object | { algo: sha256|blake3, value } |
created_at | string | ISO 8601 timestamp |
created_by | object | { name, org, email } |
Example: dataset_sheet.json
{
"dataset_id": "hhm-eeg-alphabeta-003",
"title": "EEG α/β meditation sessions (rest→focus)",
"version": "1.1.0",
"domain": "EEG",
"modality": "64ch-EEG",
"description": "Rest and meditation EEG; used for Echo, Rec, Entropy, Orbit tests.",
"license": "CC-BY-4.0",
"consent": {"level":"public","notes":"IRB #2025-04, de-identified"},
"pii_status": "de-identified",
"collection": {
"date_range": "2025-03-05/2025-03-21",
"sites": ["Lab A"],
"devices": ["BioSemi ActiveTwo"],
"protocols": ["eyes-closed rest", "guided focus"]
},
"preprocessing": [
"notch 50/60Hz", "bandpass 1-45Hz", "ICA ocular removal", "re-ref Cz"
],
"operators_expected": ["OP002","OP003","OP006","OP011","OP014"],
"threshold_profile": "profiles/eeg_default_v2.json",
"manifest_ref": "manifest.json",
"provenance_ref": "provenance.json",
"sheet_hash": {"algo":"sha256","value":"2b7c..."},
"created_at": "2025-08-11T10:24:00Z",
"created_by": {"name":"HHM Initiative","org":"harmonia.to","email":"data@harmonia.to"}
}
File Manifest — Integrity & Access
The manifest lists every file with size, checksum, and a content address. This is what lets others verify they have the exact bits you used.
Field | Type | Description |
---|---|---|
root | string | Base path or bucket (s3://…, ipfs://…) |
files[] | array | List of file entries |
files[].path | string | Relative path from root |
files[].bytes | number | File size in bytes |
files[].hash | object | { algo: sha256|blake3, value } |
files[].content_addr | string | URI with hash (e.g., ipfs CID) |
files[].media_type | string | MIME/type hint |
files[].access | string | public | restricted | private |
Example: manifest.json
{
"root": "s3://harmonia-datasets/hhm-eeg-alphabeta-003/",
"files": [
{"path":"raw/sub01_run1.fif","bytes":182334590,"hash":{"algo":"blake3","value":"7f91..."}, "content_addr":"ipfs://bafy...","media_type":"application/octet-stream","access":"restricted"},
{"path":"derivatives/sub01_run1_clean.fif","bytes":146220144,"hash":{"algo":"sha256","value":"1a3d..."}, "content_addr":"ipfs://bafy...","media_type":"application/octet-stream","access":"restricted"},
{"path":"events/sub01_run1.tsv","bytes":8421,"hash":{"algo":"sha256","value":"9cbe..."}, "content_addr":"ipfs://bafy...","media_type":"text/tab-separated-values","access":"public"}
]
}
BLAKE3
for speed + strong integrity, and also publish a SHA-256
for compatibility. Store both if possible.
Run this with an AI
"Compute BLAKE3 and SHA-256 for all files under ./dataset.
Generate manifest.json with content-addresses (ipfs:// if local IPFS available),
and verify the manifest against the current directory."
Provenance — Lineage & Transformations
Provenance documents the how: scripts, parameters, environments, and sequence of steps from raw → analysis-ready. This makes results auditable and repeatable.
- Lineage graph (nodes = files/states, edges = transforms)
- Transform records (tool, params, inputs→outputs)
- Environment (OS, package versions, GPU/CPU)
- Random seeds and determinism flags
- Container digest (e.g.,
ghcr.io/...@sha256:…
) - CI logs or notebook hashes
- Human notes (“ICA rejected 2 comps due to blink”)
Example: provenance.json
{
"dataset_id": "hhm-eeg-alphabeta-003",
"graph": {
"nodes": [
{"id":"raw/sub01_run1.fif","hash":"blake3:7f91..."},
{"id":"derivatives/sub01_run1_clean.fif","hash":"sha256:1a3d..."}
],
"edges": [
{"from":"raw/sub01_run1.fif","to":"derivatives/sub01_run1_clean.fif",
"transform_id":"t-pre-ica-01"}
]
},
"transforms": [
{
"id":"t-pre-ica-01",
"tool":"mne-preproc.py@e2f4c1a",
"container":"ghcr.io/hhm/mne:1.6.1@sha256:abcd...",
"params":{"bandpass":[1,45],"notch":[50,60],"ica_method":"fastica","rej":{"eog":2}},
"env":{"os":"Ubuntu 22.04","python":"3.11.6","mne":"1.6.1","numpy":"1.26.4"},
"seed": 424242,
"started":"2025-03-12T09:01:05Z","finished":"2025-03-12T09:07:18Z",
"logs":["Removed comps: IC1, IC3 (blink)"]
}
]
}
Run this with an AI
"Parse stdout/stderr from the attached preprocessing run, extract parameters,
normalize into provenance.json using the HHM schema, and emit a GraphML/JSON
lineage graph with node hashes."
Consent, Access Levels & Ethics
Some HHM datasets involve humans or sensitive recordings. Be precise about consent, de-identification, and who can access what.
Level | Who | Notes |
---|---|---|
public | Anyone | Redacted, de-identified; safe to mirror |
restricted | Approved researchers | Requires DUA/IRB reference; access logs |
private | PI team only | Contains sensitive or unredacted material |
Run this with an AI
"Audit the dataset directory for potential PII (filenames/metadata).
Suggest redactions and update dataset_sheet.json → consent & pii_status fields."
Ready for Operators (OP002/OP003/OP006/OP014)
To make a dataset “operator-ready,” confirm:
- Sampling rates and basis choices are recorded in the sheet
- Normalization pipeline is frozen (z-score/unit-norm where applicable)
- Windows/segments defined (length, step, overlap)
- Expected thresholds profile is linked (from your thresholds page)
Run this with an AI
"Validate operator-readiness: check sampling/basis, normalization, windowing,
and thresholds linkage. Emit a readiness report and update dataset_sheet.json."
Packaging & Releases
Publish datasets as self-verifying bundles:
dataset_sheet.json
,manifest.json
,provenance.json
,README.md
- Top-level
CHECKSUMS.txt
(BLAKE3 + SHA-256 for all above) - Tag release (e.g.,
v1.1.0
) and sign with maintainer key (optional)
CHECKSUMS.txt example
blake3 dataset_sheet.json 2b7c9d...
sha256 dataset_sheet.json 8d4a6f...
blake3 manifest.json 7f9112...
sha256 manifest.json 1a3d55...
...
FAQ — Quick Answers
Do I need IPFS?
No. It’s optional. Content addresses help reproducibility; S3/HTTPS is fine if you keep checksums immutable.
Which hash should I use?
Use BLAKE3 (fast) + SHA-256 (compat). Publish both when you can.
Where do these schemas live?
In HHM_bundle.json
→ Schemas. This page mirrors the core fields for convenience.
How does this connect to thresholds & CIs?
Your dataset sheet links to a threshold profile. Provenance documents any preprocessing that would affect operator statistics. Together, they make your results comparable and auditable.