8000 GitHub - hyperpolymath/zerostep: Normalize VAE-decoded image datasets for training AI artifact detection models with formal verification guarantees and RSR (Rhodium Standard Repository) compliance.
[go: up one dir, main page]

Skip to content

Normalize VAE-decoded image datasets for training AI artifact detection models with formal verification guarantees and RSR (Rhodium Standard Repository) compliance.

License

Notifications You must be signed in to change notification settings

hyperpolymath/zerostep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

VAE Dataset Normalizer

Palimpsest-MPL-1.0 :toc: macro :toclevels: 3 :icons: font :source-highlighter: rouge

Normalize VAE-decoded image datasets for training AI artifact detection models with formal verification guarantees and RSR (Rhodium Standard Repository) compliance.

Overview

The VAE Dataset Normalizer prepares VAE-decoded image datasets for machine learning research. It provides cryptographically verified data splits, formal proofs of correctness, and contrastive learning models to detect VAE artifacts.

This project is fully compliant with the Rhodium Standard Repository (RSR) specification, ensuring reproducibility, security, and accountability.

Features

  • SHAKE256 (d=256) cryptographic checksums for data integrity (FIPS 202)

  • Train/Test/Val/Calibration splits (70/15/10/5) - both random and stratified

  • Dublin Core metadata via CUE configuration language

  • Nickel schema for flexible configuration

  • Isabelle/HOL proofs for split property verification

  • Julia/Flux.jl utilities for training integration

  • Contrastive learning model for VAE artifact detection

  • Diff-based compression reducing storage by ~50%

Installation

Prerequisites

Required
  • Rust 1.70+ with Cargo

Optional
  • CUE - metadata validation

  • Nickel - configuration

  • Julia - training utilities

  • Isabelle - formal proofs

  • just - modern task runner

  • Podman - container runtime (never Docker)

Build

# Using Just (recommended)
just build

# Using Nix
nix build

# Direct Cargo
cargo build --release

Usage

Normalize a Dataset

# Full normalization with checksums
vae-normalizer normalize -d /path/to/dataset -o /path/to/output

# Fast mode (skip checksums)
vae-normalizer normalize -d /path/to/dataset -o /path/to/output --skip-checksums

# Custom seed and strata
vae-normalizer normalize -d /path/to/dataset -o /path/to/output --seed 12345 --strata 8

Verify Output

# Basic verification
vae-normalizer verify -o /path/to/output

# With checksum verification
vae-normalizer verify -o /path/to/output --checksums -d /path/to/dataset

Show Statistics

# Text format
vae-normalizer stats -o /path/to/output

# JSON format
vae-normalizer stats -o /path/to/output --format json

Compute File Hash

vae-normalizer hash /path/to/image.png

Compress Dataset (Diff Format)

Reduce storage by ~50% by storing VAE images as diffs from originals:

# Compress dataset (creates Original/ + Diff/ structure)
vae-normalizer compress -d /path/to/dataset -o /path/to/compressed

# Decompress (reconstruct full VAE/ directory)
vae-normalizer decompress -d /path/to/compressed -o /path/to/reconstructed-vae

# Reconstruct single image
vae-normalizer reconstruct -o /path/to/original.png -d /path/to/diff.png -o output.png

The diff encoding uses: diff = VAE - Original + 128 (offset for signed values).
Reconstruction: VAE = Original + Diff - 128.

Train Contrastive Model

# Setup Julia dependencies
just julia-setup

# Train VAE artifact detector
just train /path/to/dataset output 50 32

# Train on compressed dataset
just train-compressed /path/to/compressed output 50

# Evaluate model
just evaluate output/vae_classifier.bson /path/to/dataset

Output Structure

output/
├── splits/
│   ├── random_train.txt
│   ├── random_test.txt
│   ├── random_val.txt
│   ├── random_calibration.txt
│   ├── stratified_train.txt
│   ├── stratified_test.txt
│   ├── stratified_val.txt
│   └── stratified_calibration.txt
├── manifest.csv
└── metadata.cue

Dataset Structure

Standard Format

Expected input:

dataset/
├── Original/
│   ├── image001.png
│   ├── image002.png
│   └── ...
└── VAE/
    ├── image001.png
    ├── image002.png
    └── ...

Image files are paired by filename stem (e.g., Original/foo.png pairs with VAE/foo.png).

Compressed Diff Format

After running compress, the dataset uses diff encoding:

compressed/
├── Original/
│   ├── image001.png
│   └── ...
└── Diff/
    ├── image001.png  # Encodes (VAE - Original + 128)
    └── ...

Diff images compress well (~50% smaller) because most pixels cluster near 128 (gray) when VAE and original are similar.

Formal Verification

The Isabelle/HOL theory VAEDataset_Splits.thy proves:

  1. Disjointness: No overlap between train/test/val/calibration sets

  2. Exhaustiveness: Every image appears in exactly one split

  3. Ratio correctness: Split sizes within 1% of targets

  4. Bijection: 1:1 correspondence between original and VAE images

To verify proofs:

isabelle build -d . -b VAEDataset_Splits

Julia Integration

Standard Dataset

include("julia_utils.jl")
using .VAEDatasetUtils

# Load training split
dataset = VAEDetectorDataset(
    "output/splits/random_train.txt",
    "output/manifest.csv",
    "/path/to/dataset"
)

# Create data loader
loader = DataLoader(dataset, batch_size=32, shuffle=true)

# Train with Flux.jl
for (x, y) in loader
    # x: batch of images
    # y: labels (0=original, 1=VAE)
end

Compressed Dataset

include("julia_utils.jl")
using .VAEDatasetUtils

# Load compressed dataset (VAE images reconstructed on-the-fly)
dataset = CompressedVAEDataset(
    "output/splits/random_train.txt",
    "/path/to/compressed/Original",
    "/path/to/compressed/Diff"
)

# Same API as standard dataset
loader = DataLoader(dataset, batch_size=32, shuffle=true)

for (x, y) in loader
    # VAE images are reconstructed automatically from diffs
end

Configuration

CUE Schema

Metadata validated against Dublin Core via metadata_schema.cue:

dublin_core: {
    title:       "VAEDecodedImages-SDXL"
    creator:     "Joshua Jewell"
    type:        "Dataset"
    format:      "image/png"
}

Nickel Config

Alternative configuration via config.ncl:

{
  splits.ratios = {
    train = 0.70,
    test = 0.15,
    validation = 0.10,
    calibration = 0.05,
  },
  checksums.algorithm = 'SHAKE256,
  pipeline.vae_model = "SDXL VAE",
}

RSR Compliance

This repository follows the Rhodium Standard Repository (RSR) specification:

Documentation

  • README.adoc, LICENSE.txt, SECURITY.md, CODE_OF_CONDUCT.md

  • CONTRIBUTING.adoc with Tri-Perimeter Contribution Framework (TPCF)

  • GOVERNANCE.adoc, MAINTAINERS.md, FUNDING.yml

  • CHANGELOG.md, ROADMAP.md, REVERSIBILITY.md

Security

  • Memory-safe Rust implementation (no unsafe code)

  • FIPS 202 compliant cryptography (SHAKE256)

  • Chainguard Wolfi container base

  • SPDX headers on all source files

  • Supply chain security (pinned dependencies)

Infrastructure

  • Nix flakes for reproducibility

  • Podman containers (never Docker)

  • Nickel/CUE configuration

  • Justfile task runner

Accountability

  • Mutually Assured Accountability (MAA) framework

  • Provenance chains (.well-known/provenance.json)

  • Git history with DCO sign-off

Run just validate for full RSR compliance verification.

License

This software is licensed under the Palimpsest-MPL-1.0 License.

The Palimpsest-MPL extends MPL-2.0 with provisions for:

  • Ethical use and emotional lineage preservation

  • Quantum-safe cryptographic provenance

  • Attribution and transparent modification history

  • Community benefit and cooperative economics

See palimpsest-license for full details.

Contributing

See CONTRIBUTING.adoc for contribution guidelines.

Security

See SECURITY.md for security policy and vulnerability reporting.

About

Normalize VAE-decoded image datasets for training AI artifact detection models with formal verification guarantees and RSR (Rhodium Standard Repository) compliance.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •  
0