igllama

A Zig-based Ollama alternative for running LLMs locally. Built on top of llama.cpp.zig bindings.

Why igllama? See our design philosophy for details on why we built this and the technical choices behind it.

Features

Interactive Chat - Multi-turn conversations with auto-save and session resume
Ollama-like CLI - Familiar commands: pull, run, list, show, rm, chat, import
HuggingFace Integration - Download models directly from HuggingFace Hub
OpenAI-compatible API - REST server with /v1/chat/completions and /v1/embeddings
GGUF Support - Inspect and run GGUF model files
GPU Acceleration - Metal, Vulkan, and CUDA backend support
Auto-detect Chat Templates - Supports 12+ formats (ChatML, Llama 3, Mistral, etc.)
Qwen3.5 Support - Run Qwen3.5-35B-A3B MoE models with GGUF quantization; --no-think flag suppresses chain-of-thought blocks for faster responses
Qwen3.5 Small Series - Optimized support for Qwen 3.5 0.8B, 2B, 4B, and 9B models with native multimodality and 262K context
Constrained Generation - GBNF grammar support for structured output
Pure Zig - No Python or system dependencies required
Cross-platform - Windows, Linux, macOS support

Installation

As a CLI Tool

# Clone with submodules
git clone --recursive https://github.com/bkataru/igllama.git
cd igllama

# Build
zig build -Doptimize=ReleaseFast

# Binary located at ./zig-out/bin/igllama

As a Library

Add igllama to your project with zig fetch:

zig fetch --save git+https://github.com/bkataru/igllama.git

This updates your build.zig.zon:

.dependencies = .{
    .igllama = .{
        .url = "git+https://github.com/bkataru/igllama.git",
        .hash = "...",
    },
},

Then in your build.zig:

const igllama = b.dependency("igllama", .{
    .target = target,
    .optimize = optimize,
});
exe.root_module.addImport("llama", igllama.module("llama"));

Quick Start

# Build igllama
zig build -Doptimize=ReleaseFast

# Download a model from HuggingFace
igllama pull TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF

# List cached models
igllama list

# Run inference
igllama run tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Hello! Tell me a joke."

# Show model metadata
igllama show tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

# Remove a cached model
igllama rm TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF

CLI Commands

Command	Description
`igllama help`	Show usage information
`igllama version`	Show version information
`igllama pull <repo_id>`	Download model from HuggingFace
`igllama list`	List all cached models
`igllama run <model> -p <prompt>`	Run inference on a model
`igllama chat <model>`	Interactive multi-turn chat session
`igllama import <path>`	Import local GGUF file to cache
`igllama api <model>`	Start OpenAI-compatible API server
`igllama show <model.gguf>`	Display GGUF file metadata
`igllama rm <repo_id>`	Remove a cached model
`igllama serve <subcommand>`	Manage llama-server lifecycle

Chat Command

Interactive multi-turn conversations with automatic session management:

# Start a chat session
igllama chat tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

# Use a specific chat template
igllama chat model.gguf --template chatml

# Resume a previous session
igllama chat model.gguf --resume session_name

# Disable auto-save
igllama chat model.gguf --no-save

# Adjust sampling parameters
igllama chat model.gguf --temp 0.8 --top-p 0.9 --top-k 40

# Use grammar for structured output
igllama chat model.gguf --grammar-file json.gbnf

In-chat commands:

Command	Description
`/help`	Show available commands
`/quit` or `/exit`	Exit the chat session
`/clear`	Clear conversation history and KV cache
`/save <name>`	Save session to a file
`/load <name>`	Load a saved session
`/sessions`	List all saved sessions
`/system <text>`	Set or update system prompt
`/tokens`	Show token usage statistics
`/stats`	Show generation statistics
`/template <name>`	Switch chat template

Supported chat templates: ChatML, Llama 2, Llama 3, Mistral, Phi-3, Gemma, Zephyr, Vicuna, Alpaca, DeepSeek, Command-R, and more.

Import Command

Import local GGUF files into the model cache:

# Import with symlink (default, saves disk space)
igllama import /path/to/model.gguf

# Import with copy (creates a full copy)
igllama import /path/to/model.gguf --copy

# Import with a custom alias
igllama import /path/to/model.gguf --alias my-model

API Server

Start an OpenAI-compatible REST API server:

# Start API server on default port (8080)
igllama api tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

# Specify host and port
igllama api model.gguf --host 0.0.0.0 --port 3000

# With GPU acceleration
igllama api model.gguf --gpu-layers -1

# CPU-optimized (tune threads to your hardware)
igllama api model.gguf --threads 8 --threads-batch 16 --mlock --ctx-size 8192
# Suppress Qwen3.5 <think> blocks for faster responses
igllama api model.gguf --no-think

Endpoints:

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions (streaming supported)
`/v1/embeddings`	POST	Generate embeddings
`/health`	GET	Health check

Example requests:

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

# Streaming chat
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

# Generate embeddings
curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "input": ["Hello world", "How are you?"]
  }'

Serve Subcommands

Subcommand	Description
`serve start -m <model>`	Start llama-server with a model
`serve stop`	Stop the running server
`serve status`	Show server status
`serve logs`	View server logs
`serve logs --follow`	Tail server logs continuously

Configuration

Models are cached in:

Custom: Set IGLLAMA_HOME environment variable
Default: ~/.cache/igllama (Linux/macOS) or %LOCALAPPDATA%\igllama (Windows)

Chat sessions are auto-saved to:

~/.cache/huggingface/sessions/ (Linux/macOS)
%LOCALAPPDATA%\huggingface\sessions\ (Windows)

Building

Requires Zig 0.15.x or later.

# Debug build
zig build

# Release build (optimized)
zig build -Doptimize=ReleaseFast

# Run tests
zig build test

Build Options

Option	Description
`-Doptimize=ReleaseFast`	Optimized release build
`-Dserver=true`	Build llama-server HTTP API server
`-Dmetal=true`	Enable Metal GPU backend (macOS)
`-Dvulkan=true`	Enable Vulkan GPU backend
`-Dcuda=true`	Enable CUDA GPU backend (experimental)
`-Dmetal_bf16=true`	Use BF16 for Metal (M2+ recommended)
`-Dcpp_samples`	Include llama.cpp C++ samples
`-Dllama_ref=<ref>`	Use specific llama.cpp version (branch/tag/commit)
`-Dllama_url=<url>`	Custom llama.cpp git URL

Custom llama.cpp Version

You can build with a specific version of llama.cpp:

# Build with a specific tag
zig build -Dllama_ref=b4567 -Doptimize=ReleaseFast

# Build from a fork
zig build -Dllama_url=https://github.com/user/llama.cpp -Dllama_ref=main

Development

Running Examples

# Run simple example with a local model
zig build run-simple -Doptimize=ReleaseFast -- --model_path path/to/model.gguf --prompt "Hello!"

# Run C++ samples (if enabled)
zig build run-cpp-main -Doptimize=ReleaseFast -- -m path/to/model.gguf -p "Hello!"

Project Structure

igllama/
├── src/
│   ├── main.zig           # CLI entry point
│   ├── config.zig         # Configuration/paths
│   ├── history.zig        # Session auto-save/resume
│   └── commands/          # CLI command implementations
│       ├── help.zig
│       ├── pull.zig
│       ├── list.zig
│       ├── run.zig
│       ├── chat.zig       # Interactive chat with KV cache
│       ├── api.zig        # OpenAI-compatible API server
│       ├── import.zig     # Import local GGUF files
│       ├── show.zig
│       ├── rm.zig
│       └── serve.zig      # Server lifecycle management
├── docs/
│   └── MOTIVATION.md      # Design philosophy
├── llama.cpp.zig/         # llama.cpp Zig bindings (submodule)
├── llama.cpp/             # llama.cpp source (submodule)
├── examples/              # Example code
│   └── simple.zig         # Basic inference example
├── tools/                 # Build tools
│   └── generate_asset.zig # Asset header generator for server
└── .github/               # CI/CD workflows

Tested Platforms

x86_64 Windows
x86_64 Linux (Ubuntu 22+)
x86_64 macOS
aarch64 macOS (Apple Silicon)

Backend Support

Backend	Status	Notes
CPU	Supported	Default backend, no additional dependencies
Metal	Supported	macOS with Xcode required
Vulkan	Experimental	Requires Vulkan SDK
CUDA	Experimental	Requires NVIDIA CUDA toolkit

GPU Acceleration

Enable GPU backends at build time:

# macOS with Metal (Apple Silicon / Intel with AMD GPU)
zig build -Doptimize=ReleaseFast -Dmetal=true

# Metal with BF16 support (Apple Silicon M2+)
zig build -Doptimize=ReleaseFast -Dmetal=true -Dmetal_bf16=true

# Vulkan (requires Vulkan SDK + glslc in PATH)
zig build -Doptimize=ReleaseFast -Dvulkan=true

# CUDA (requires nvcc, experimental)
zig build -Doptimize=ReleaseFast -Dcuda=true

At runtime, use --gpu-layers to offload layers to GPU:

# Offload 35 layers to GPU
igllama run model.gguf -p "Hello" --gpu-layers 35

# Offload all layers to GPU (-1)
igllama run model.gguf -p "Hello" --gpu-layers -1

GPU Backend Requirements

Backend	Requirements
Metal	macOS 11+, Xcode Command Line Tools
Vulkan	Vulkan SDK, glslc compiler in PATH
CUDA	NVIDIA GPU, CUDA Toolkit 11.0+

Performance Benchmarks

Qwen3.5 on CPU-only Systems

Benchmarks running on Linux with AMD EPYC-Rome (16-core @ 2.0 GHz, 8 memory channels), 30 GB RAM, no GPU:

Model	GGUF Quant	Speed	Notes
Qwen3.5-0.8B	UD-Q4_K_XL	23.01 tok/s	Ultra-fast edge AI
Qwen3.5-2B	UD-Q4_K_XL	18.38 tok/s	Fast reasoning
Qwen3.5-4B	UD-Q4_K_XL	8.48 tok/s	Sweet spot for agentic tasks
Qwen3.5-9B	UD-Q4_K_XL	6.45 tok/s	High-end reasoning
Qwen3.5-35B-A3B	UD-Q4_K_XL	5.56 tok/s	Best MoE choice for large knowledge

Optimal launch: igllama api model.gguf --threads 8 --threads-batch 16 --mlock --ctx-size 8192 --no-think

Full documentation:

Qwen3.5 Quick Start Guide - 10-minute setup
Qwen3.5 Small Case Study - Benchmarks for 0.8B–9B models
Qwen3.5 35B MoE Case Study - Technical deep dive for the MoE architecture

Roadmap

License

MIT License - See LICENSE for details.

Last updated: March 2026 - v0.3.11 release

Credits

llama.cpp - The underlying inference engine
llama.cpp.zig - Zig bindings for llama.cpp
hf-hub-zig - HuggingFace Hub client
zenmap - Memory-mapped file handling for GGUF

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
.github/workflows		.github/workflows
.vale/styles/config/vocabularies/igllama		.vale/styles/config/vocabularies/igllama
docs		docs
examples		examples
llama.cpp @ 3191462		llama.cpp @ 3191462
llama.cpp.zig		llama.cpp.zig
patches		patches
src		src
tools		tools
website		website
.gitignore		.gitignore
.gitmodules		.gitmodules
.vale.ini		.vale.ini
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
build.zig		build.zig
build.zig.zon		build.zig.zon
build_llama.zig		build_llama.zig
build_server.zig		build_server.zig

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

igllama

Features

Installation

As a CLI Tool

As a Library

Quick Start

CLI Commands

Chat Command

Import Command

API Server

Serve Subcommands

Configuration

Building

Build Options

Custom llama.cpp Version

Development

Running Examples

Project Structure

Tested Platforms

Backend Support

GPU Acceleration

GPU Backend Requirements

Performance Benchmarks

Qwen3.5 on CPU-only Systems

Roadmap

License

Credits

About

Uh oh!

Releases 13

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

igllama

Features

Installation

As a CLI Tool

As a Library

Quick Start

CLI Commands

Chat Command

Import Command

API Server

Serve Subcommands

Configuration

Building

Build Options

Custom llama.cpp Version

Development

Running Examples

Project Structure

Tested Platforms

Backend Support

GPU Acceleration

GPU Backend Requirements

Performance Benchmarks

Qwen3.5 on CPU-only Systems

Roadmap

License

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages