VAE Dataset Normalizer

:toc: macro :toclevels: 3 :icons: font :source-highlighter: rouge

Normalize VAE-decoded image datasets for training AI artifact detection models with formal verification guarantees and RSR (Rhodium Standard Repository) compliance.

Overview

The VAE Dataset Normalizer prepares VAE-decoded image datasets for machine learning research. It provides cryptographically verified data splits, formal proofs of correctness, and contrastive learning models to detect VAE artifacts.

This project is fully compliant with the Rhodium Standard Repository (RSR) specification, ensuring reproducibility, security, and accountability.

Features

SHAKE256 (d=256) cryptographic checksums for data integrity (FIPS 202)
Train/Test/Val/Calibration splits (70/15/10/5) - both random and stratified
Dublin Core metadata via CUE configuration language
Nickel schema for flexible configuration
Isabelle/HOL proofs for split property verification
Julia/Flux.jl utilities for training integration
Contrastive learning model for VAE artifact detection
Diff-based compression reducing storage by ~50%

Installation

Prerequisites

Required

Rust 1.70+ with Cargo

Optional

CUE - metadata validation
Nickel - configuration
Julia - training utilities
Isabelle - formal proofs
just - modern task runner
Podman - container runtime (never Docker)

Build

# Using Just (recommended)
just build

# Using Nix
nix build

# Direct Cargo
cargo build --release

Usage

Normalize a Dataset

# Full normalization with checksums
vae-normalizer normalize -d /path/to/dataset -o /path/to/output

# Fast mode (skip checksums)
vae-normalizer normalize -d /path/to/dataset -o /path/to/output --skip-checksums

# Custom seed and strata
vae-normalizer normalize -d /path/to/dataset -o /path/to/output --seed 12345 --strata 8

Verify Output

# Basic verification
vae-normalizer verify -o /path/to/output

# With checksum verification
vae-normalizer verify -o /path/to/output --checksums -d /path/to/dataset

Show Statistics

# Text format
vae-normalizer stats -o /path/to/output

# JSON format
vae-normalizer stats -o /path/to/output --format json

Compute File Hash

vae-normalizer hash /path/to/image.png

Compress Dataset (Diff Format)

Reduce storage by ~50% by storing VAE images as diffs from originals:

# Compress dataset (creates Original/ + Diff/ structure)
vae-normalizer compress -d /path/to/dataset -o /path/to/compressed

# Decompress (reconstruct full VAE/ directory)
vae-normalizer decompress -d /path/to/compressed -o /path/to/reconstructed-vae

# Reconstruct single image
vae-normalizer reconstruct -o /path/to/original.png -d /path/to/diff.png -o output.png

The diff encoding uses: diff = VAE - Original + 128 (offset for signed values).
Reconstruction: VAE = Original + Diff - 128.

Train Contrastive Model

# Setup Julia dependencies
just julia-setup

# Train VAE artifact detector
just train /path/to/dataset output 50 32

# Train on compressed dataset
just train-compressed /path/to/compressed output 50

# Evaluate model
just evaluate output/vae_classifier.bson /path/to/dataset

Output Structure

output/
├── splits/
│   ├── random_train.txt
│   ├── random_test.txt
│   ├── random_val.txt
│   ├── random_calibration.txt
│   ├── stratified_train.txt
│   ├── stratified_test.txt
│   ├── stratified_val.txt
│   └── stratified_calibration.txt
├── manifest.csv
└── metadata.cue

Dataset Structure

Standard Format

Expected input:

dataset/
├── Original/
│   ├── image001.png
│   ├── image002.png
│   └── ...
└── VAE/
    ├── image001.png
    ├── image002.png
    └── ...

Image files are paired by filename stem (e.g., Original/foo.png pairs with VAE/foo.png).

Compressed Diff Format

After running compress, the dataset uses diff encoding:

compressed/
├── Original/
│   ├── image001.png
│   └── ...
└── Diff/
    ├── image001.png  # Encodes (VAE - Original + 128)
    └── ...

Diff images compress well (~50% smaller) because most pixels cluster near 128 (gray) when VAE and original are similar.

Formal Verification

The Isabelle/HOL theory VAEDataset_Splits.thy proves:

Disjointness: No overlap between train/test/val/calibration sets
Exhaustiveness: Every image appears in exactly one split
Ratio correctness: Split sizes within 1% of targets
Bijection: 1:1 correspondence between original and VAE images

To verify proofs:

isabelle build -d . -b VAEDataset_Splits

Julia Integration

Standard Dataset

include("julia_utils.jl")
using .VAEDatasetUtils

# Load training split
dataset = VAEDetectorDataset(
    "output/splits/random_train.txt",
    "output/manifest.csv",
    "/path/to/dataset"
)

# Create data loader
loader = DataLoader(dataset, batch_size=32, shuffle=true)

# Train with Flux.jl
for (x, y) in loader
    # x: batch of images
    # y: labels (0=original, 1=VAE)
end

Compressed Dataset

include("julia_utils.jl")
using .VAEDatasetUtils

# Load compressed dataset (VAE images reconstructed on-the-fly)
dataset = CompressedVAEDataset(
    "output/splits/random_train.txt",
    "/path/to/compressed/Original",
    "/path/to/compressed/Diff"
)

# Same API as standard dataset
loader = DataLoader(dataset, batch_size=32, shuffle=true)

for (x, y) in loader
    # VAE images are reconstructed automatically from diffs
end

Configuration

CUE Schema

Metadata validated against Dublin Core via metadata_schema.cue:

dublin_core: {
    title:       "VAEDecodedImages-SDXL"
    creator:     "Joshua Jewell"
    type:        "Dataset"
    format:      "image/png"
}

Nickel Config

Alternative configuration via config.ncl:

{
  splits.ratios = {
    train = 0.70,
    test = 0.15,
    validation = 0.10,
    calibration = 0.05,
  },
  checksums.algorithm = 'SHAKE256,
  pipeline.vae_model = "SDXL VAE",
}

RSR Compliance

This repository follows the Rhodium Standard Repository (RSR) specification:

Documentation

README.adoc, LICENSE.txt, SECURITY.md, CODE_OF_CONDUCT.md
CONTRIBUTING.adoc with Tri-Perimeter Contribution Framework (TPCF)
GOVERNANCE.adoc, MAINTAINERS.md, FUNDING.yml
CHANGELOG.md, ROADMAP.md, REVERSIBILITY.md

Security

Memory-safe Rust implementation (no unsafe code)
FIPS 202 compliant cryptography (SHAKE256)
Chainguard Wolfi container base
SPDX headers on all source files
Supply chain security (pinned dependencies)

Infrastructure

Nix flakes for reproducibility
Podman containers (never Docker)
Nickel/CUE configuration
Justfile task runner

Accountability

Mutually Assured Accountability (MAA) framework
Provenance chains (.well-known/provenance.json)
Git history with DCO sign-off

Run just validate for full RSR compliance verification.

License

This software is licensed under the Palimpsest-MPL-1.0 License.

The Palimpsest-MPL extends MPL-2.0 with provisions for:

Ethical use and emotional lineage preservation
Quantum-safe cryptographic provenance
Attribution and transparent modification history
Community benefit and cooperative economics

See palimpsest-license for full details.

Contributing

See CONTRIBUTING.adoc for contribution guidelines.

Security

See SECURITY.md for security policy and vulnerability reporting.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.claude		.claude
.clusterfuzzlite		.clusterfuzzlite
.githooks		.githooks
.github		.github
.machine_readable		.machine_readable
.reuse		.reuse
LICENSES		LICENSES
docs		docs
fuzz		fuzz
src		src
vae-normalizer		vae-normalizer
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.nojekyll		.nojekyll
ACCOUNTABILITY.adoc		ACCOUNTABILITY.adoc
CHANGELOG.adoc		CHANGELOG.adoc
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.adoc		CONTRIBUTING.adoc
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Containerfile		Containerfile
DCO.md		DCO.md
ECOSYSTEM.scm		ECOSYSTEM.scm
FUNDING.yml		FUNDING.yml
GOVERNANCE.adoc		GOVERNANCE.adoc
Justfile		Justfile
LICENSE		LICENSE
MAINTAINERS.adoc		MAINTAINERS.adoc
MAINTAINERS.md		MAINTAINERS.md
META.scm		META.scm
Mustfile		Mustfile
Project.toml		Project.toml
QUICKSTART.md		QUICKSTART.md
README.adoc		README.adoc
REVERSIBILITY.md		REVERSIBILITY.md
SECURITY.md		SECURITY.md
STATE.scm		STATE.scm
SUPPORT.md		SUPPORT.md
VAEDataset_Splits.thy		VAEDataset_Splits.thy
config.ncl		config.ncl
contrastive_model.jl		contrastive_model.jl
julia_utils.jl		julia_utils.jl
justfile		justfile

Uh oh!

License

hyperpolymath/zerostep

Folders and files

Latest commit

History

Repository files navigation

VAE Dataset Normalizer

Overview

Features

Installation

Prerequisites

Build

Usage

Normalize a Dataset

Verify Output

Show Statistics

Compute File Hash

Compress Dataset (Diff Format)

Train Contrastive Model

Output Structure

Dataset Structure

Standard Format

Compressed Diff Format

Formal Verification

Julia Integration

Standard Dataset

Compressed Dataset

Configuration

CUE Schema

Nickel Config

RSR Compliance

Documentation

Security

Infrastructure

Accountability

License

Related

Contributing

Security

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages