🤗 Hugging Face | 🤖 ModelScope | 📑 Blog | 📖 Documentation
🖥️ Demo | 💬 WeChat (微信) | 🫨 Discord
Visit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with Qwen2-
or visit the Qwen2 collection, and you will find all you need! Enjoy!
To learn more about Qwen2, feel free to read our documentation [EN|ZH]. Our documentation consists of the following sections:
- Quickstart: the basic usages and demonstrations;
- Inference: the guidance for the inference with transformers, including batch inference, streaming, etc.;
- Run Locally: the instructions for running LLM locally on CPU and GPU, with frameworks like
llama.cpp
andOllama
; - Deployment: the demonstration of how to deploy Qwen for large-scale inference with frameworks like
vLLM
,TGI
, etc.; - Quantization: the practice of quantizing LLMs with GPTQ, AWQ, as well as the guidance for how to make high-quality quantized GGUF files;
- Training: the instructions for post-training, including SFT and RLHF (TODO) with frameworks like Axolotl, LLaMA-Factory, etc.
- Framework: the usage of Qwen with frameworks for application, e.g., RAG, Agent, etc.
- Benchmark: the statistics about inference speed and memory footprint.
After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you:
- Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B;
- Having been trained on data in 27 additional languages besides English and Chinese;
- State-of-the-art performance in a large number of benchmark evaluations;
- Significantly improved performance in coding and mathematics;
- Extended context length support up to 128K tokens with Qwen2-7B-Instruct and Qwen2-72B-Instruct.
- 2024.06.06: We released the Qwen2 series. Check our blog!
- 2024.03.28: We released the first MoE model of Qwen: Qwen1.5-MoE-A2.7B! Temporarily, only HF transformers and vLLM support the model. We will soon add the support of llama.cpp, mlx-lm, etc. Check our blog for more information!
- 2024.02.05: We released the Qwen1.5 series.
Detailed evaluation results are reported in this 📑 blog.
transformers>=4.40.0
for Qwen2 dense and MoE models. The latest version is recommended.
Warning
For requirements on GPU memory and the respective throughput, see results here.
Here we show a code snippet to show you how to use the chat model with transformers
:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2-7B-Instruct"
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
For quantized models, we advise you to use the GPTQ and AWQ correspondents, namely Qwen2-7B-Instruct-GPTQ-Int8
, Qwen2-7B-Instruct-AWQ
.
We strongly advise users especially those in mainland China to use ModelScope. snapshot_download
can help you solve issues concerning downloading checkpoints.
Note
After installing ollama, you can initiate the ollama service with the following command:
ollama serve
# You need to keep this service running whenever you are using ollama
To pull a model checkpoint and run the model, use the ollama run
command. You can specify a model size by adding a suffix to qwen2
, such as :0.5b
, :1.5b
, :7b
, or :72b
:
ollama run qwen2:7b
# To exit, type "/bye" and press ENTER
You can also access the ollama service via its OpenAI-compatible API. Please note that you need to (1) keep ollama serve
running while using the API, and (2) execute ollama run qwen2:7b
before utilizing this API to ensure that the model checkpoint is prepared.
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1/',
api_key='ollama', # required but ignored
)
chat_completion = client.chat.completions.create(
messages=[
{
'role': 'user',
'content': 'Say this is a test',
}
],
model='qwen2:7b',
)
If you have encountered problems related to quantized models on GPU, please try one of the following:
-
Upgrading
ollama
to at least 0.1.42. -
Enabling the flash attention implementation in the
llama.cpp
backend (forollama
over 0.1.39)OLLAMA_FLASH_ATTENTION=1 ollama serve
or following the instructions at ollama faq to configure the environment variables in ollama service.
-
Disabling running on GPU in the ollama app:
>>> /set parameter num_gpu 0 Set parameter 'num_gpu' to '0'
For additional details, please visit ollama.ai.
Download our provided GGUF files or create them by yourself, and you can directly use them with the latest llama.cpp
with a one-line command:
./main -m <path-to-file> -n 512 --color -i -cml -f prompts/chat-with-qwen.txt
If you have encountered problems related to quantized models on GPU, please try passing the -fa
argument to enable the flash attention implementation in newest version of llama.cpp
.
If you are running on Apple Silicon, we have also provided checkpoints compatible with mlx-lm
. Look for models ending with MLX on HuggingFace Hub, like Qwen2-7B-Instruct-MLX.
Qwen2 has already been supported by lmstudio.ai. You can directly use LMStudio with our GGUF files.
Qwen2 has already been supported by OpenVINO toolkit. You can install and run this chatbot example with Intel CPU, integrated GPU or discrete GPU.
You can directly use text-generation-webui
for creating a web UI demo. If you use GGUF, remember to install the latest wheel of llama.cpp
with the support of Qwen2.
Clone llamafile
, run source install, and then create your own llamafile with the GGUF file following the guide here. You are able to run one line of command, say ./qwen.llamafile
, to create a demo.
Qwen2 is supported by multiple inference frameworks. Here we demonstrate the usage of vLLM
and SGLang
.
Warning
We advise you to use vLLM>=0.4.0
to build OpenAI-compatible API service. Start the server with a chat model, e.g. Qwen2-7B-Instruct
:
python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model Qwen/Qwen2-7B-Instruct
Then use the chat API as demonstrated below:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2-7B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
]
}'
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen2-7B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
]
)
print("Chat response:", chat_response)
Note
Please install SGLang
from source. Similar to vLLM
, you need to launch a server and use OpenAI-compatible API service. Start the server first:
python -m sglang.launch_server --model-path Qwen/Qwen2-7B-Instruct --port 30000
You can use it in Python as shown below:
from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint
@function
def multi_turn_question(s, question_1, question_2):
s += system("You are a helpful assistant.")
s += user(question_1)
s += assistant(gen("answer_1", max_tokens=256))
s += user(question_2)
s += assistant(gen("answer_2", max_tokens=256))
set_default_backend(RuntimeEndpoint("http://localhost:30000"))
state = multi_turn_question.run(
question_1="What is the capital of China?",
question_2="List two local attractions.",
)
for m in state.messages():
print(m["role"], ":", m["content"])
print(state["answer_1"])
We advise you to use training frameworks, including Axolotl, Llama-Factory, Swift, etc., to finetune your models with SFT, DPO, PPO, etc.
To simplify the deployment process, we provide docker images with pre-built environments: qwenllm/qwen. You only need to install the driver and download model files to launch demos and finetune the model.
docker run --gpus all --ipc=host --network=host --rm --name qwen2 -it qwenllm/qwen:2-cu121 bash
Check the license of each model inside its HF repo. It is NOT necessary for you to submit a request for commercial usage.
If you find our work helpful, feel free to give us a cite.
@article{qwen,
title={Qwen Technical Report},
author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
journal={arXiv preprint arXiv:2309.16609},
year={2023}
}
If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups!