windows

NVIDIA Model Optimizer - Windows

A Library to Quantize and Compress Deep Learning Models for Optimized Inference on Native Windows RTX GPUs

Latest News

[2024/11/19] Microsoft and NVIDIA Supercharge AI Development on RTX AI PCs
[2024/11/18] Quantized INT4 ONNX models available on Hugging Face for download

Overview

The Model Optimizer - Windows (ModelOpt-Windows) is engineered to deliver advanced model compression techniques, including quantization, to Windows RTX PC systems. Specifically tailored to meet the needs of Windows users, ModelOpt-Windows is optimized for rapid and efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. The primary objective of the ModelOpt-Windows is to generate optimized, standards-compliant ONNX-format models. This makes it an ideal solution for seamless integration with ONNX Runtime (ORT) and DirectML (DML) frameworks, ensuring broad compatibility with any inference framework supporting the ONNX standard. Furthermore, ModelOpt-Windows integrates smoothly within the Windows ecosystem, with full support for tools and SDKs such as Olive and ONNX Runtime, enabling deployment of quantized models across various independent hardware vendors (IHVs) through the DML path and TensorRT path.

Model Optimizer is available for free for all developers on NVIDIA PyPI. This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.

Installation

ModelOpt-Windows can be installed either as a standalone toolkit or through Microsoft's Olive.

Standalone Toolkit Installation (with CUDA 12.x)

To install ModelOpt-Windows as a standalone toolkit on CUDA 12.x systems, run the following commands:

pip install nvidia-modelopt[onnx]

Installation with Olive

To install ModelOpt-Windows through Microsoft's Olive, use the following commands:

pip install olive-ai[nvmo]
pip install onnxruntime-genai-directml>=0.4.0
pip install onnxruntime-directml==1.20.0

For more details, please refer to the detailed installation instructions.

Techniques

Quantization

Quantization is an effective model optimization technique for large models. Quantization with ModelOpt-Windows can compress model size by 2x-4x, speeding up inference while preserving model quality. ModelOpt-Window enables highly performant quantization formats including INT4, FP8, INT8, etc. and supports advanced algorithms such as AWQ and SmoothQuant* focusing on post-training quantization (PTQ) for ONNX and PyTorch* models with DirectML, CUDA and TensorRT* inference backends.

For more details, please refer to the detailed quantization guide.

Getting Started

The ONNX quantization API requires a model, calibration data, along with quantization settings like algorithm, calibration-EPs etc. Here’s an example snippet to apply INT4 AWQ quantization:

from modelopt.onnx.quantization.int4 import quantize as quantize_int4
# import other packages as needed
calib_inputs = get_calib_inputs(dataset, model_name, cache_dir, calib_size, batch_size,...)
quantized_onnx_model = quantize_int4(
    onnx_path,
    calibration_method="awq_lite",
    calibration_data_reader=None if use_random_calib else calib_inputs,
    calibration_eps=["dml", "cpu"]
)
onnx.save_model(
    quantized_onnx_model,
    output_path,
    save_as_external_data=True,
    location=os.path.basename(output_path) + "_data",
    size_threshold=0,
)

Check modelopt.onnx.quantization.quantize_int4 for details about INT4 quantization API.

Refer to our Support Matrix for details about supported features and models.

To learn more about ONNX PTQ, refer to our docs.

Deployment

The quantized onnx model can be deployed using frameworks like onnxruntime. Ensure that model’s opset is 19+ for FP8 quantization, and it is 21+ for INT4 quantization. This is needed due to different opset requirements of ONNX’s Q/DQ nodes for INT4, FP8 data-types support. Refer to Apply Post Training Quantization (PTQ) for details.

# write steps (say, upgrade_opset() method) to upgrade or patch opset of the model, if needed
# the opset-upgrade, if needed, can be done on either base ONNX model or on the quantized model
# finally, save the quantized model

quantized_onnx_model = upgrade_opset(quantized_onnx_model)
onnx.save_model(
    quantized_onnx_model,
    output_path,
    save_as_external_data=True,
    location=os.path.basename(output_path) + "_data",
    size_threshold=0,
)

For detailed instructions about deployment of quantized models with ONNX Runtime, see the ONNX Runtime Deployment Guide.

Note

The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace NVIDIA collections.

Examples

Examples for Post-Training Quantization (PTQ) of ONNX models:
- PTQ for GenAI LLMs covers how to use ONNX PTQ with ONNX Runtime GenAI built LLM ONNX models, and their deployment with DirectML.
- PTQ for Whisper illustrates using ONNX PTQ with a Whisper ONNX model (i.e. an ASR model). It also provides example script for Optimum-ORT based inference of Whisper using CUDA EP.
- PTQ for SAM2 illustrates using ONNX PTQ with a SAM2 ONNX model (i.e. a segmentation model).
Examples that demonstrate PTQ of a PyTorch model followed by ONNX export:
- Diffusers example demonstrates how to apply PTQ to diffusion models in PyTorch format and then export the quantized models to ONNX.
MMLU Benchmark provides an example script for MMLU benchmarking of LLM models, and demonstrates how to run it with various popular backends like DirectML, TensorRT-LLM* and model formats like ONNX and PyTorch*.

Support Matrix

Model Type	Support Matrix
Large Language Models (LLMs)	View Support Matrix
Automatic Speech Recognition	View Support Matrix
Segmentation Models	View Support Matrix
Diffusion Models	View Support Matrix

Benchmark Results

Please refer to benchmark results for performance and accuracy comparisons of popular Large Language Models (LLMs).

Collection Of Optimized ONNX Models

The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace NVIDIA collections. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.

Release Notes

Please refer to changelog

* Experimental support

Name		Name	Last commit message	Last commit date
parent directory ..
accuracy_benchmark		accuracy_benchmark
onnx_ptq		onnx_ptq
torch_onnx/diffusers		torch_onnx/diffusers
Benchmark.md		Benchmark.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

NVIDIA Model Optimizer - Windows

Latest News

Table of Contents

Overview

Installation

Standalone Toolkit Installation (with CUDA 12.x)

Installation with Olive

Techniques

Quantization

Getting Started

Deployment

Examples

Support Matrix

Benchmark Results

Collection Of Optimized ONNX Models

Release Notes

FilesExpand file tree

windows

Directory actions

More options

Directory actions

More options

Latest commit

History

windows

Folders and files

parent directory

README.md

NVIDIA Model Optimizer - Windows

Latest News

Table of Contents

Overview

Installation

Standalone Toolkit Installation (with CUDA 12.x)

Installation with Olive

Techniques

Quantization

Getting Started

Deployment

Examples

Support Matrix

Benchmark Results

Collection Of Optimized ONNX Models

Release Notes