A high-throughput and memory-efficient inference and serving engine for LLMs
-
Updated
Nov 24, 2025 - Python
10BC0
A high-throughput and memory-efficient inference and serving engine for LLMs
Port of OpenAI's Whisper model in C/C++
Making large AI models cheaper, faster and more accessible
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Cross-platform, customizable ML solutions for live and streaming media.
ncnn is a high-performance neural network inference framework optimized for the mobile platform
SGLang is a fast serving framework for large language models and vision language models.
Faster Whisper transcription with CTranslate2
Machine Learning Engineering Open Book
🎨 The exhaustive Pattern Matching library for TypeScript, with smart type inference.
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
Large Language Model Text Generation Inference
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
OpenVINO™ is an open source toolkit for optimizing and deploying AI inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
💎1MB lightweight face detection model (1MB轻量级人脸检测模型)
Add a description, image, and links to the inference topic page so that developers can more easily learn about it.
To associate your repository with the inference topic, visit your repo's landing page and select "manage topics."