From 7ba5211abe4b39b5c5d103cc45efd185701da7c8 Mon Sep 17 00:00:00 2001 From: vansin Date: Sat, 25 May 2024 18:00:12 +0800 Subject: [PATCH 001/754] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 197ca47fd..4f50eccc5 100644 --- a/README.md +++ b/README.md @@ -1 +1 @@ -# tutorial \ No newline at end of file +# Tutorial From c6f1c3a48878934819a370a6b0a1121bd3c6ad11 Mon Sep 17 00:00:00 2001 From: vansin Date: Sat, 25 May 2024 18:08:26 +0800 Subject: [PATCH 002/754] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4f50eccc5..42e25a86b 100644 --- a/README.md +++ b/README.md @@ -1 +1 @@ -# Tutorial +# 书生·浦语实战营(闯关大挑战) From 4bc4943b07502a02279c04ca48f37e064eab1afd Mon Sep 17 00:00:00 2001 From: vansin Date: Sat, 25 May 2024 18:16:43 +0800 Subject: [PATCH 003/754] Update README.md --- README.md | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/README.md b/README.md index 42e25a86b..d140121be 100644 --- a/README.md +++ b/README.md @@ -1 +1,41 @@ # 书生·浦语实战营(闯关大挑战) + + + + +## 关卡 + +### 入门关卡 + +||关卡名称|关卡负责人|资料| +|:-----|:----|:----|:-----| +|第 1 关| Linux 基础知识 ||文档、视频、任务| +|第 2 关|Python 基础知识 | | 文档、视频、任务 | +|第 3 关|Git 基础知识||文档、视频、任务| +|第 4 关| Pytorch 基础知识|| 文档、视频、任务 | + + +### 基础关卡 + + +||关卡名称|关卡负责人|资料| +|:-----|:----|:----|:-----| +|第 1 关| 了解书生·浦语大模型全链路开源体系 ||文档、视频、任务| +|第 2 关| Transformer 结构基础知识 | | 文档、视频、任务 | +|第 3 关| 浦语提示词工程实践 ||文档、视频、任务| +|第 4 关| InternLM-1.8B 部署实践 || 文档、视频、任务 | +|第 5 关| InternLM + LlamaInex RAG 实践 || 文档、视频、任务 | +|第 6 关| XTuner 微调个人小助手认知 || 文档、视频、任务 | +|第 7 关| InternLM2-Chat-1.8B 模型的能力评测 || 文档、视频、任务 | + + + +### 进阶关卡 + +||关卡名称|关卡负责人|资料| +|:-----|:----|:----|:-----| +|第 1 关| 寻找书生·浦语 InternLM2-Chat-20B 的缺陷 ||文档、视频、任务| +|第 2 关| Lagent 自定义你的 Agent 智能体 | | 文档、视频、任务 | +|第 3 关| LMDeploy 部署 InternVL 浦语灵笔实践 ||文档、视频、任务| +|第 4 关| XTuner 微调你的多模态模型 || 文档、视频、任务 | +|第 5 关| 茴香豆:企业级知识库问答工具|| 文档、视频、任务 | From 3690923870afa276e12fb76e43626571fc295b2d Mon Sep 17 00:00:00 2001 From: vansin Date: Sat, 1 Jun 2024 18:02:33 +0800 Subject: [PATCH 004/754] Create .gitkeep --- docs/.gitkeep | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/.gitkeep diff --git a/docs/.gitkeep b/docs/.gitkeep new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/.gitkeep @@ -0,0 +1 @@ + From 9d1db09506fdaae86a1c81b06a1c119088defc8d Mon Sep 17 00:00:00 2001 From: vansin Date: Sat, 1 Jun 2024 18:03:21 +0800 Subject: [PATCH 005/754] Create .gitkeep --- tools/.gitkeep | 1 + 1 file changed, 1 insertion(+) create mode 100644 tools/.gitkeep diff --git a/tools/.gitkeep b/tools/.gitkeep new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/tools/.gitkeep @@ -0,0 +1 @@ + From a10f40e533269486b6895eb0de7c67193cc1e41e Mon Sep 17 00:00:00 2001 From: vansin Date: Sat, 1 Jun 2024 18:11:51 +0800 Subject: [PATCH 006/754] Create .gitkeep --- configs/.gitkeep | 1 + 1 file changed, 1 insertion(+) create mode 100644 configs/.gitkeep diff --git a/configs/.gitkeep b/configs/.gitkeep new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/configs/.gitkeep @@ -0,0 +1 @@ + From 822c4616f1f10b1c7826c449bdc42da4049e61e1 Mon Sep 17 00:00:00 2001 From: vansin Date: Sat, 1 Jun 2024 18:13:52 +0800 Subject: [PATCH 007/754] Create .gitkeep --- data/.gitkeep | 1 + 1 file changed, 1 insertion(+) create mode 100644 data/.gitkeep diff --git a/data/.gitkeep b/data/.gitkeep new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/data/.gitkeep @@ -0,0 +1 @@ + From a6dfea7a733faaa5915164eda89520ec6e1939fd Mon Sep 17 00:00:00 2001 From: vansin Date: Mon, 3 Jun 2024 13:01:45 +0800 Subject: [PATCH 008/754] update --- configs/camp3/.gitkeep | 0 configs/wiki/.gitkeep | 0 data/camp3/.gitkeep | 0 data/wiki/.gitkeep | 0 docs/camp3/L0/.gitkeep | 0 docs/camp3/L1/.gitkeep | 0 docs/camp3/L2/.gitkeep | 0 tools/camp3/.gitkeep | 0 tools/wiki/.gitkeep | 0 9 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 configs/camp3/.gitkeep create mode 100644 configs/wiki/.gitkeep create mode 100644 data/camp3/.gitkeep create mode 100644 data/wiki/.gitkeep create mode 100644 docs/camp3/L0/.gitkeep create mode 100644 docs/camp3/L1/.gitkeep create mode 100644 docs/camp3/L2/.gitkeep create mode 100644 tools/camp3/.gitkeep create mode 100644 tools/wiki/.gitkeep diff --git a/configs/camp3/.gitkeep b/configs/camp3/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/configs/wiki/.gitkeep b/configs/wiki/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/data/camp3/.gitkeep b/data/camp3/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/data/wiki/.gitkeep b/data/wiki/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/docs/camp3/L0/.gitkeep b/docs/camp3/L0/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/docs/camp3/L1/.gitkeep b/docs/camp3/L1/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/docs/camp3/L2/.gitkeep b/docs/camp3/L2/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/tools/camp3/.gitkeep b/tools/camp3/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/tools/wiki/.gitkeep b/tools/wiki/.gitkeep new file mode 100644 index 000000000..e69de29bb From baa764bfa037f90937c586f50bf7a1095ad7e9ba Mon Sep 17 00:00:00 2001 From: vansin Date: Mon, 3 Jun 2024 13:09:35 +0800 Subject: [PATCH 009/754] update --- data/wiki/.gitkeep | 0 {configs/camp3 => docs/L0}/.gitkeep | 0 {configs/wiki => docs/L1}/.gitkeep | 0 {data/camp3 => docs/L2}/.gitkeep | 0 docs/camp3/L0/.gitkeep | 0 docs/camp3/L1/.gitkeep | 0 docs/camp3/L2/.gitkeep | 0 tools/camp3/.gitkeep | 0 tools/wiki/.gitkeep | 0 9 files changed, 0 insertions(+), 0 deletions(-) delete mode 100644 data/wiki/.gitkeep rename {configs/camp3 => docs/L0}/.gitkeep (100%) rename {configs/wiki => docs/L1}/.gitkeep (100%) rename {data/camp3 => docs/L2}/.gitkeep (100%) delete mode 100644 docs/camp3/L0/.gitkeep delete mode 100644 docs/camp3/L1/.gitkeep delete mode 100644 docs/camp3/L2/.gitkeep delete mode 100644 tools/camp3/.gitkeep delete mode 100644 tools/wiki/.gitkeep diff --git a/data/wiki/.gitkeep b/data/wiki/.gitkeep deleted file mode 100644 index e69de29bb..000000000 diff --git a/configs/camp3/.gitkeep b/docs/L0/.gitkeep similarity index 100% rename from configs/camp3/.gitkeep rename to docs/L0/.gitkeep diff --git a/configs/wiki/.gitkeep b/docs/L1/.gitkeep similarity index 100% rename from configs/wiki/.gitkeep rename to docs/L1/.gitkeep diff --git a/data/camp3/.gitkeep b/docs/L2/.gitkeep similarity index 100% rename from data/camp3/.gitkeep rename to docs/L2/.gitkeep diff --git a/docs/camp3/L0/.gitkeep b/docs/camp3/L0/.gitkeep deleted file mode 100644 index e69de29bb..000000000 diff --git a/docs/camp3/L1/.gitkeep b/docs/camp3/L1/.gitkeep deleted file mode 100644 index e69de29bb..000000000 diff --git a/docs/camp3/L2/.gitkeep b/docs/camp3/L2/.gitkeep deleted file mode 100644 index e69de29bb..000000000 diff --git a/tools/camp3/.gitkeep b/tools/camp3/.gitkeep deleted file mode 100644 index e69de29bb..000000000 diff --git a/tools/wiki/.gitkeep b/tools/wiki/.gitkeep deleted file mode 100644 index e69de29bb..000000000 From d97ae78cf237faf5076e9dd968458685b6604744 Mon Sep 17 00:00:00 2001 From: vansin Date: Mon, 3 Jun 2024 13:13:03 +0800 Subject: [PATCH 010/754] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d140121be..9b7bb51ca 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ |第 1 关| 了解书生·浦语大模型全链路开源体系 ||文档、视频、任务| |第 2 关| Transformer 结构基础知识 | | 文档、视频、任务 | |第 3 关| 浦语提示词工程实践 ||文档、视频、任务| -|第 4 关| InternLM-1.8B 部署实践 || 文档、视频、任务 | +|第 4 关| InternLM2-1.8B 部署实践 || 文档、视频、任务 | |第 5 关| InternLM + LlamaInex RAG 实践 || 文档、视频、任务 | |第 6 关| XTuner 微调个人小助手认知 || 文档、视频、任务 | |第 7 关| InternLM2-Chat-1.8B 模型的能力评测 || 文档、视频、任务 | From 5c7ea21b3c79a7f89cb2b2a7a45898e567f9d6f6 Mon Sep 17 00:00:00 2001 From: Shengshenlan <2764725346@qq.com> Date: Fri, 7 Jun 2024 21:46:15 +0800 Subject: [PATCH 011/754] l5 --- docs/L5/llamaindex.ipynb | 18 ++++++++++++++++++ docs/L5/llamaindex.md | 12 ++++++++++++ 2 files changed, 30 insertions(+) create mode 100644 docs/L5/llamaindex.ipynb create mode 100644 docs/L5/llamaindex.md diff --git a/docs/L5/llamaindex.ipynb b/docs/L5/llamaindex.ipynb new file mode 100644 index 000000000..709d82cff --- /dev/null +++ b/docs/L5/llamaindex.ipynb @@ -0,0 +1,18 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/L5/llamaindex.md b/docs/L5/llamaindex.md new file mode 100644 index 000000000..3c666386b --- /dev/null +++ b/docs/L5/llamaindex.md @@ -0,0 +1,12 @@ +# InternLM+LlamaIndex + +本文将分为以下几个部分来介绍,如何使用 LlamaIndex 来部署 InternLM2 1.8B(以 InternStudio 的环境为例) +- 前置知识 +- 环境、模型准备 +- LlamaIndex HuggingFaceLLM +- LlamaIndex RAG + +## 1. 前置知识 +正式介绍检索增强生成(Retrieval Augmented Generation,RAG)技术以前,大家不妨想想为什么会出现这样一个技术。 +给模型注入新知识的方式,可以简单分为两种方式,一种是内部的,即更新模型的权重,另一个就是外部的方式,给模型注入格外的上下文或者说外部信息,不改变它的的权重。 +第一种方式,改变了模型的权重即进行模型训练,这是一件代价比较大的事情,大语言模型具体的训练过程,可以参考InternLM2技术报告。第二种方式,并不改变模型的权重,只是给模型引入格外的信息。类比人类编程的过程,第一种方式相当于你记住了某个函数的用法,第二种方式相当于你阅读函数文档然后短暂的记住了某个函数的用法。 \ No newline at end of file From 8535fbafa3549a43f9bf3196acfa1cb7b803c73d Mon Sep 17 00:00:00 2001 From: Shengshenlan <2764725346@qq.com> Date: Fri, 7 Jun 2024 22:52:40 +0800 Subject: [PATCH 012/754] update --- docs/L5/llamaindex.ipynb | 19 +------------------ 1 file changed, 1 insertion(+), 18 deletions(-) diff --git a/docs/L5/llamaindex.ipynb b/docs/L5/llamaindex.ipynb index 709d82cff..65003ea05 100644 --- a/docs/L5/llamaindex.ipynb +++ b/docs/L5/llamaindex.ipynb @@ -1,18 +1 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} +{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"gpuType":"T4","authorship_tag":"ABX9TyNNSKWq7YooXncMBeXyBNbZ"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU","widgets":{"application/vnd.jupyter.widget-state+json":{"c42eebd29edf4546a4d301331eddba88":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_5ddab7cea3b0475aa9acdbd1382c2408","IPY_MODEL_f71891f2f7e84f8faeab9bde1d592689","IPY_MODEL_507d06838fbf4a8185d9b5f58c00a509"],"layout":"IPY_MODEL_5f415bcd99ff4765a2691d1bba040a34"}},"5ddab7cea3b0475aa9acdbd1382c2408":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_822836168dd840f68f516aa557973043","placeholder":"​","style":"IPY_MODEL_540f748524f04a50b51b8be7b6cbb730","value":"Loading checkpoint shards: 100%"}},"f71891f2f7e84f8faeab9bde1d592689":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_7ced06e55c6541ae8eec31634ea9b617","max":2,"min":0,"orientation":"horizontal","style":"IPY_MODEL_865b7ab8280e4fdb8e9f946868ed04e8","value":2}},"507d06838fbf4a8185d9b5f58c00a509":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_35d885c1f3f7465d96de6d9f39a05bcc","placeholder":"​","style":"IPY_MODEL_6e0054edee9a468bb94ac73052e9aa32","value":" 2/2 [00:10<00:00,  4.93s/it]"}},"5f415bcd99ff4765a2691d1bba040a34":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"822836168dd840f68f516aa557973043":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"540f748524f04a50b51b8be7b6cbb730":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"7ced06e55c6541ae8eec31634ea9b617":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"865b7ab8280e4fdb8e9f946868ed04e8":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"35d885c1f3f7465d96de6d9f39a05bcc":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6e0054edee9a468bb94ac73052e9aa32":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"69ed7ce9957448e886bbc2340f1567c3":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_b2f386012623497da93c0d023e7010cc","IPY_MODEL_165763f3b0514196a1f54088d6a636e8","IPY_MODEL_d1b9e512f9ad4ad3ac80ca617dc27f64"],"layout":"IPY_MODEL_cf587120c03540a6aeeb3e448cc5e222"}},"b2f386012623497da93c0d023e7010cc":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_a26d602c5d194d29964fcf1bf08932ee","placeholder":"​","style":"IPY_MODEL_eca4d0efedd248b2a6a7ec36018c4a0d","value":"modules.json: 100%"}},"165763f3b0514196a1f54088d6a636e8":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_f72d1cee95cb429fb27da1f92e959cc7","max":229,"min":0,"orientation":"horizontal","style":"IPY_MODEL_233d4844a7cd49b9baa0e743b764b3ff","value":229}},"d1b9e512f9ad4ad3ac80ca617dc27f64":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_7a97dcd5b6b948d8bd880227e46480db","placeholder":"​","style":"IPY_MODEL_ae6462b060734f9899a9636a60f6643e","value":" 229/229 [00:00<00:00, 10.6kB/s]"}},"cf587120c03540a6aeeb3e448cc5e222":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a26d602c5d194d29964fcf1bf08932ee":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"eca4d0efedd248b2a6a7ec36018c4a0d":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f72d1cee95cb429fb27da1f92e959cc7":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"233d4844a7cd49b9baa0e743b764b3ff":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"7a97dcd5b6b948d8bd880227e46480db":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ae6462b060734f9899a9636a60f6643e":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"a46dafab1aa64a3cacf3e66075ec965d":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_3644af1f76b74a51a14b0b996c4df146","IPY_MODEL_bb18a12489954592a74c49b7d03cb20f","IPY_MODEL_07010c7ddb574f9f9293b22e8ffdae47"],"layout":"IPY_MODEL_62162bcc0e494b9485aa6cf5b4fcef57"}},"3644af1f76b74a51a14b0b996c4df146":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_39931be128c444c1a8cf80ec4bdfe1f4","placeholder":"​","style":"IPY_MODEL_3273e937afdc44328c01df0946142af9","value":"config_sentence_transformers.json: 100%"}},"bb18a12489954592a74c49b7d03cb20f":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_273f7749f06f4aeb9abd0ed023713d8b","max":122,"min":0,"orientation":"horizontal","style":"IPY_MODEL_1f143ddbc8574886914b79ffc9da09b8","value":122}},"07010c7ddb574f9f9293b22e8ffdae47":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_454d5bcc27364b21be25e3cc49c8993d","placeholder":"​","style":"IPY_MODEL_3bc6164577864a188a92395eac2cd95b","value":" 122/122 [00:00<00:00, 8.78kB/s]"}},"62162bcc0e494b9485aa6cf5b4fcef57":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"39931be128c444c1a8cf80ec4bdfe1f4":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3273e937afdc44328c01df0946142af9":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"273f7749f06f4aeb9abd0ed023713d8b":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"1f143ddbc8574886914b79ffc9da09b8":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"454d5bcc27364b21be25e3cc49c8993d":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3bc6164577864a188a92395eac2cd95b":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"1f6ca7a846fa46ee9ddd1ebfec35671d":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_a2eeec01bc4148888476ae48fc67789a","IPY_MODEL_a8dfb5d658e44ea182891465eb81de9d","IPY_MODEL_e91f33e380ca4d49b0178e76c76a0c70"],"layout":"IPY_MODEL_8d64b9c64db14c8f9cb1f857cd8d3c86"}},"a2eeec01bc4148888476ae48fc67789a":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_18d5f6fe60ce428ea3bb4d44e9574b65","placeholder":"​","style":"IPY_MODEL_618c70e1bb8f4fdebeac3a3b9d02f453","value":"README.md: 100%"}},"a8dfb5d658e44ea182891465eb81de9d":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_65e6ca70fc5649fd8df3671f49e9a994","max":4118,"min":0,"orientation":"horizontal","style":"IPY_MODEL_075896d9b9cc4f76abf9e84a23d28e0d","value":4118}},"e91f33e380ca4d49b0178e76c76a0c70":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_1c069c3a1edb4bb0a6da1ea9dfd6f1d1","placeholder":"​","style":"IPY_MODEL_17e49d8a03424335940415f8e4cf58b5","value":" 4.12k/4.12k [00:00<00:00, 284kB/s]"}},"8d64b9c64db14c8f9cb1f857cd8d3c86":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"18d5f6fe60ce428ea3bb4d44e9574b65":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"618c70e1bb8f4fdebeac3a3b9d02f453":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"65e6ca70fc5649fd8df3671f49e9a994":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"075896d9b9cc4f76abf9e84a23d28e0d":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"1c069c3a1edb4bb0a6da1ea9dfd6f1d1":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"17e49d8a03424335940415f8e4cf58b5":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"b2c75766fb1745f68e75ca8d84fd9273":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_758a7c8ff11549348d124463970d83b3","IPY_MODEL_54be9ad90491476388728f0fd6ef47f9","IPY_MODEL_41d064a3dbc04404bbf45802e8b12d77"],"layout":"IPY_MODEL_4c91f061f9084cd589ff085d577eca16"}},"758a7c8ff11549348d124463970d83b3":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_5959aef20b0646ea91aea5e9e61a1f6b","placeholder":"​","style":"IPY_MODEL_86bc64bd8ea3404f86c9c5578ea23202","value":"sentence_bert_config.json: 100%"}},"54be9ad90491476388728f0fd6ef47f9":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_ea7eb0b1fb72425a9281e469197fc507","max":53,"min":0,"orientation":"horizontal","style":"IPY_MODEL_fc38b1b9bf3e49e7b5f1bf1df5e9a817","value":53}},"41d064a3dbc04404bbf45802e8b12d77":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_94ec348de11147bea3bebabc7a38d064","placeholder":"​","style":"IPY_MODEL_4c505d868a5a45fea03c9e723d6be1e4","value":" 53.0/53.0 [00:00<00:00, 4.12kB/s]"}},"4c91f061f9084cd589ff085d577eca16":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"5959aef20b0646ea91aea5e9e61a1f6b":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"86bc64bd8ea3404f86c9c5578ea23202":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"ea7eb0b1fb72425a9281e469197fc507":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"fc38b1b9bf3e49e7b5f1bf1df5e9a817":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"94ec348de11147bea3bebabc7a38d064":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"4c505d868a5a45fea03c9e723d6be1e4":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"1f8c04e6a3714675ab4260e3c7b2f2b2":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_7709ee69c35949f48ebb26dd6bd2dd1f","IPY_MODEL_0fca41335fc94a249df318dd13047a2e","IPY_MODEL_2edabd484afe45e6a73c5c37df22a73f"],"layout":"IPY_MODEL_5da9d043f5834294a964d1babe1df068"}},"7709ee69c35949f48ebb26dd6bd2dd1f":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_2b9fed50a39d4da5a0e663dc413e7700","placeholder":"​","style":"IPY_MODEL_2b828555bc4745d4aa12c332b96e9012","value":"config.json: 100%"}},"0fca41335fc94a249df318dd13047a2e":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_65981b52d8044f298ba56d905b692fce","max":645,"min":0,"orientation":"horizontal","style":"IPY_MODEL_c4ebfd46351542aaaabf03ef4c7c10c3","value":645}},"2edabd484afe45e6a73c5c37df22a73f":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_406c95ecb8f2426eb609c4dbc9009942","placeholder":"​","style":"IPY_MODEL_5d947921c38149dda331d0d8ca358aac","value":" 645/645 [00:00<00:00, 41.6kB/s]"}},"5da9d043f5834294a964d1babe1df068":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2b9fed50a39d4da5a0e663dc413e7700":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2b828555bc4745d4aa12c332b96e9012":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"65981b52d8044f298ba56d905b692fce":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c4ebfd46351542aaaabf03ef4c7c10c3":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"406c95ecb8f2426eb609c4dbc9009942":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"5d947921c38149dda331d0d8ca358aac":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"216a2be4c6ad44ce9c1c654854777c81":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_c765f12578ee477993847ff5d042f34a","IPY_MODEL_00d233cdc7da4aee96a70cc617a3549d","IPY_MODEL_f99b3803503440c6b356e9083e2b1219"],"layout":"IPY_MODEL_8d2646025a474810b060a4ea1816938c"}},"c765f12578ee477993847ff5d042f34a":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_2dac105cc8684445a591c19243a036b9","placeholder":"​","style":"IPY_MODEL_d6e70df4c13b45ec8fcf62d0e92b40d6","value":"model.safetensors: 100%"}},"00d233cdc7da4aee96a70cc617a3549d":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_5c14e585ed164f739dba85d85544f72e","max":470641600,"min":0,"orientation":"horizontal","style":"IPY_MODEL_3a044c8d0fb1442498292f4e348eaba4","value":470641600}},"f99b3803503440c6b356e9083e2b1219":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_9c6da30c7f794586a882ac7cff61842a","placeholder":"​","style":"IPY_MODEL_a433bb08668248d8b1dc0adb97815560","value":" 471M/471M [00:04<00:00, 167MB/s]"}},"8d2646025a474810b060a4ea1816938c":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2dac105cc8684445a591c19243a036b9":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d6e70df4c13b45ec8fcf62d0e92b40d6":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5c14e585ed164f739dba85d85544f72e":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3a044c8d0fb1442498292f4e348eaba4":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"9c6da30c7f794586a882ac7cff61842a":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a433bb08668248d8b1dc0adb97815560":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"7590be01a0fe46bb8b9429b91c66ce16":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_56c3b23c425849a7927eee3babec67bc","IPY_MODEL_b2cb3afffcae46aaaed5471349d5aa26","IPY_MODEL_472a2346d1fb40948dfef64ebdc7b75a"],"layout":"IPY_MODEL_39bd275c1e1d4ab4bf55f12bf9745fe8"}},"56c3b23c425849a7927eee3babec67bc":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_4df309dc4a6d467db54c68d0e3aeca9d","placeholder":"​","style":"IPY_MODEL_04315a4c5cc442adad474ed597701968","value":"tokenizer_config.json: 100%"}},"b2cb3afffcae46aaaed5471349d5aa26":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_1cf5dd8207dc4ba69f2ef2e34751850e","max":480,"min":0,"orientation":"horizontal","style":"IPY_MODEL_d57abad3fbd04f11ad02d9c2f90c475f","value":480}},"472a2346d1fb40948dfef64ebdc7b75a":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_bda231e097644f9aa415dc88950f6955","placeholder":"​","style":"IPY_MODEL_91449b6625184274bc74ed47fa33ccf0","value":" 480/480 [00:00<00:00, 24.8kB/s]"}},"39bd275c1e1d4ab4bf55f12bf9745fe8":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"4df309dc4a6d467db54c68d0e3aeca9d":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"04315a4c5cc442adad474ed597701968":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"1cf5dd8207dc4ba69f2ef2e34751850e":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d57abad3fbd04f11ad02d9c2f90c475f":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"bda231e097644f9aa415dc88950f6955":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"91449b6625184274bc74ed47fa33ccf0":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"1f6376cc1e664afea7ce482f06431c17":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_e8984d13c3f54a4bb162ee07ee20c943","IPY_MODEL_3842851605b04346924f1f531a7afef1","IPY_MODEL_01271529beff4201a623d7e381c95183"],"layout":"IPY_MODEL_586229c5819243d4adc6850add6c9c4e"}},"e8984d13c3f54a4bb162ee07ee20c943":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_4136d7804dcf4518a20f6e5fb39aee04","placeholder":"​","style":"IPY_MODEL_72410cb0d21945b8ae03d5d238793357","value":"tokenizer.json: 100%"}},"3842851605b04346924f1f531a7afef1":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_f4290e5963354f8b9695b2d64fbcf3a9","max":9081518,"min":0,"orientation":"horizontal","style":"IPY_MODEL_24f5f76eba724b7b8b1bd0e1f4ce614b","value":9081518}},"01271529beff4201a623d7e381c95183":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_a9fbbf8b0479425d8dddc3cada4458f8","placeholder":"​","style":"IPY_MODEL_a9fc98be620a480cae780a1045b55364","value":" 9.08M/9.08M [00:00<00:00, 12.7MB/s]"}},"586229c5819243d4adc6850add6c9c4e":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"4136d7804dcf4518a20f6e5fb39aee04":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"72410cb0d21945b8ae03d5d238793357":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f4290e5963354f8b9695b2d64fbcf3a9":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"24f5f76eba724b7b8b1bd0e1f4ce614b":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"a9fbbf8b0479425d8dddc3cada4458f8":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"a9fc98be620a480cae780a1045b55364":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"8b588dc29f984e74a3834958b3ac432f":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_7f0151766eaa4e77ad7a13efea233182","IPY_MODEL_9ef8fd410e504ad3a14472c8642e61aa","IPY_MODEL_c530222427e5416e9b59b14577db52b3"],"layout":"IPY_MODEL_5c1a69a72ffd48f79aa327037ecc3f5d"}},"7f0151766eaa4e77ad7a13efea233182":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_8bc19d157ddc415184e21baaeaa4b430","placeholder":"​","style":"IPY_MODEL_47afbb445f2a4f71a8ede755ccc364c6","value":"special_tokens_map.json: 100%"}},"9ef8fd410e504ad3a14472c8642e61aa":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_7f181fdc9a51468a86bca59830bb9d81","max":239,"min":0,"orientation":"horizontal","style":"IPY_MODEL_8ff9c919a2c442c2a4d8ac0b3a5cc3cd","value":239}},"c530222427e5416e9b59b14577db52b3":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_85f093382df54374a6996cb4d0ca6e55","placeholder":"​","style":"IPY_MODEL_04b098933b2645ca9d04dbed1d24be7f","value":" 239/239 [00:00<00:00, 3.81kB/s]"}},"5c1a69a72ffd48f79aa327037ecc3f5d":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"8bc19d157ddc415184e21baaeaa4b430":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"47afbb445f2a4f71a8ede755ccc364c6":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"7f181fdc9a51468a86bca59830bb9d81":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"8ff9c919a2c442c2a4d8ac0b3a5cc3cd":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"85f093382df54374a6996cb4d0ca6e55":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"04b098933b2645ca9d04dbed1d24be7f":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"fb373185d4f14816925cce207d3bdad5":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_92e5b6b3f45c47d9a79db4b5b53e7800","IPY_MODEL_3edbb211def242d489ddcf174016251a","IPY_MODEL_69d83f3d8c954e458f67acd9a2f80681"],"layout":"IPY_MODEL_c115e6bdaf0b4587bdb6bf5c311b20e7"}},"92e5b6b3f45c47d9a79db4b5b53e7800":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_55d4db6a41dc4c6abdd701ffb356abc8","placeholder":"​","style":"IPY_MODEL_7f3f012cbb1546f596321a9cabdd6db1","value":"1_Pooling/config.json: 100%"}},"3edbb211def242d489ddcf174016251a":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_0cf68562237c4b27b07458d44dc4c681","max":190,"min":0,"orientation":"horizontal","style":"IPY_MODEL_d077f49d761b4a508b5fb788ae855be4","value":190}},"69d83f3d8c954e458f67acd9a2f80681":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_268ce96b87d84289966cd7d1e12412a9","placeholder":"​","style":"IPY_MODEL_51ad3e7fbad64b718c89a74dbb16a921","value":" 190/190 [00:00<00:00, 5.43kB/s]"}},"c115e6bdaf0b4587bdb6bf5c311b20e7":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"55d4db6a41dc4c6abdd701ffb356abc8":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"7f3f012cbb1546f596321a9cabdd6db1":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"0cf68562237c4b27b07458d44dc4c681":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d077f49d761b4a508b5fb788ae855be4":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"268ce96b87d84289966cd7d1e12412a9":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"51ad3e7fbad64b718c89a74dbb16a921":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}}}}},"cells":[{"cell_type":"markdown","source":["# llamaindex+Internlm2 RAG实践\n","\n","在本notebook中,我们将会使用llamaindex和Interlm2-chat-1.8b进行知识库查询实践。首先来安装llamaindex。"],"metadata":{"id":"ZY-nOeEKq4Uv"}},{"cell_type":"code","source":["!pip install llama-index llama-index-llms-huggingface \"transformers[torch]==4.41.1\" \"huggingface_hub[inference]==0.23.1\" sentence-transformers sentencepiece\n","!pip install einops"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"collapsed":true,"id":"_DDbO4bgVkXA","executionInfo":{"status":"ok","timestamp":1717595055278,"user_tz":-480,"elapsed":16587,"user":{"displayName":"祝岚","userId":"02180828612699376062"}},"outputId":"33ead54d-4f37-4e5a-e1af-03e30945b8a1"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Requirement already satisfied: llama-index==0.10.38 in /usr/local/lib/python3.10/dist-packages (0.10.38)\n","Requirement already satisfied: llama-index-llms-huggingface==0.2.0 in /usr/local/lib/python3.10/dist-packages (0.2.0)\n","Requirement already satisfied: transformers[torch]==4.41.1 in /usr/local/lib/python3.10/dist-packages (4.41.1)\n","Requirement already satisfied: huggingface_hub[inference]==0.23.1 in /usr/local/lib/python3.10/dist-packages (0.23.1)\n","Requirement already satisfied: sentence-transformers==2.7.0 in /usr/local/lib/python3.10/dist-packages (2.7.0)\n","Requirement already satisfied: sentencepiece==0.2.0 in /usr/local/lib/python3.10/dist-packages (0.2.0)\n","Requirement already satisfied: llama-index-agent-openai<0.3.0,>=0.1.4 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.2.7)\n","Requirement already satisfied: llama-index-cli<0.2.0,>=0.1.2 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.1.12)\n","Requirement already satisfied: llama-index-core<0.11.0,>=0.10.38 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.10.43)\n","Requirement already satisfied: llama-index-embeddings-openai<0.2.0,>=0.1.5 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.1.10)\n","Requirement already satisfied: llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.1.6)\n","Requirement already satisfied: llama-index-legacy<0.10.0,>=0.9.48 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.9.48)\n","Requirement already satisfied: llama-index-llms-openai<0.2.0,>=0.1.13 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.1.22)\n","Requirement already satisfied: llama-index-multi-modal-llms-openai<0.2.0,>=0.1.3 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.1.6)\n","Requirement already satisfied: llama-index-program-openai<0.2.0,>=0.1.3 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.1.6)\n","Requirement already satisfied: llama-index-question-gen-openai<0.2.0,>=0.1.2 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.1.3)\n","Requirement already satisfied: llama-index-readers-file<0.2.0,>=0.1.4 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.1.23)\n","Requirement already satisfied: llama-index-readers-llama-parse<0.2.0,>=0.1.2 in /usr/local/lib/python3.10/dist-packages (from llama-index==0.10.38) (0.1.4)\n","Requirement already satisfied: text-generation<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-llms-huggingface==0.2.0) (0.7.0)\n","Requirement already satisfied: torch<3.0.0,>=2.1.2 in /usr/local/lib/python3.10/dist-packages (from llama-index-llms-huggingface==0.2.0) (2.3.0+cu121)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers[torch]==4.41.1) (3.14.0)\n","Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]==4.41.1) (1.25.2)\n","Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]==4.41.1) (24.0)\n","Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]==4.41.1) (6.0.1)\n","Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]==4.41.1) (2024.5.15)\n","Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers[torch]==4.41.1) (2.31.0)\n","Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]==4.41.1) (0.19.1)\n","Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]==4.41.1) (0.4.3)\n","Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]==4.41.1) (4.66.4)\n","Requirement already satisfied: accelerate>=0.21.0 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]==4.41.1) (0.30.1)\n","Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub[inference]==0.23.1) (2023.6.0)\n","Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub[inference]==0.23.1) (4.12.0)\n","Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from huggingface_hub[inference]==0.23.1) (3.9.5)\n","Requirement already satisfied: minijinja>=1.0 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub[inference]==0.23.1) (2.0.1)\n","Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.7.0) (1.2.2)\n","Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.7.0) (1.11.4)\n","Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.7.0) (9.4.0)\n","Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->transformers[torch]==4.41.1) (5.9.5)\n","Requirement already satisfied: openai>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-agent-openai<0.3.0,>=0.1.4->llama-index==0.10.38) (1.31.0)\n","Requirement already satisfied: SQLAlchemy[asyncio]>=1.4.49 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (2.0.30)\n","Requirement already satisfied: dataclasses-json in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (0.6.6)\n","Requirement already satisfied: deprecated>=1.2.9.3 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (1.2.14)\n","Requirement already satisfied: dirtyjson<2.0.0,>=1.0.8 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (1.0.8)\n","Requirement already satisfied: httpx in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (0.27.0)\n","Requirement already satisfied: llamaindex-py-client<0.2.0,>=0.1.18 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (0.1.19)\n","Requirement already satisfied: nest-asyncio<2.0.0,>=1.5.8 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (1.6.0)\n","Requirement already satisfied: networkx>=3.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (3.3)\n","Requirement already satisfied: nltk<4.0.0,>=3.8.1 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (3.8.1)\n","Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (2.0.3)\n","Requirement already satisfied: tenacity<9.0.0,>=8.2.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (8.3.0)\n","Requirement already satisfied: tiktoken>=0.3.3 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (0.7.0)\n","Requirement already satisfied: typing-inspect>=0.8.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (0.9.0)\n","Requirement already satisfied: wrapt in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (1.14.1)\n","Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface_hub[inference]==0.23.1) (1.3.1)\n","Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface_hub[inference]==0.23.1) (23.2.0)\n","Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface_hub[inference]==0.23.1) (1.4.1)\n","Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface_hub[inference]==0.23.1) (6.0.5)\n","Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface_hub[inference]==0.23.1) (1.9.4)\n","Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface_hub[inference]==0.23.1) (4.0.3)\n","Requirement already satisfied: beautifulsoup4<5.0.0,>=4.12.3 in /usr/local/lib/python3.10/dist-packages (from llama-index-readers-file<0.2.0,>=0.1.4->llama-index==0.10.38) (4.12.3)\n","Requirement already satisfied: pypdf<5.0.0,>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from llama-index-readers-file<0.2.0,>=0.1.4->llama-index==0.10.38) (4.2.0)\n","Requirement already satisfied: striprtf<0.0.27,>=0.0.26 in /usr/local/lib/python3.10/dist-packages (from llama-index-readers-file<0.2.0,>=0.1.4->llama-index==0.10.38) (0.0.26)\n","Requirement already satisfied: llama-parse<0.5.0,>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-readers-llama-parse<0.2.0,>=0.1.2->llama-index==0.10.38) (0.4.4)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]==4.41.1) (3.3.2)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]==4.41.1) (3.7)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]==4.41.1) (2.0.7)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]==4.41.1) (2024.2.2)\n","Requirement already satisfied: pydantic<3,>2 in /usr/local/lib/python3.10/dist-packages (from text-generation<0.8.0,>=0.7.0->llama-index-llms-huggingface==0.2.0) (2.7.2)\n","Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (1.12.1)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (3.1.4)\n","Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (12.1.105)\n","Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (12.1.105)\n","Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (12.1.105)\n","Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (8.9.2.26)\n","Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (12.1.3.1)\n","Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (11.0.2.54)\n","Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (10.3.2.106)\n","Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (11.4.5.107)\n","Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (12.1.0.106)\n","Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (2.20.5)\n","Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (12.1.105)\n","Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (2.3.0)\n","Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (12.5.40)\n","Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers==2.7.0) (1.4.2)\n","Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers==2.7.0) (3.5.0)\n","Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4<5.0.0,>=4.12.3->llama-index-readers-file<0.2.0,>=0.1.4->llama-index==0.10.38) (2.5)\n","Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (3.7.1)\n","Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (1.0.5)\n","Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (1.3.1)\n","Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (0.14.0)\n","Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk<4.0.0,>=3.8.1->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (8.1.7)\n","Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai>=1.14.0->llama-index-agent-openai<0.3.0,>=0.1.4->llama-index==0.10.38) (1.7.0)\n","Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>2->text-generation<0.8.0,>=0.7.0->llama-index-llms-huggingface==0.2.0) (0.7.0)\n","Requirement already satisfied: pydantic-core==2.18.3 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>2->text-generation<0.8.0,>=0.7.0->llama-index-llms-huggingface==0.2.0) (2.18.3)\n","Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.10/dist-packages (from SQLAlchemy[asyncio]>=1.4.49->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (3.0.3)\n","Requirement already satisfied: mypy-extensions>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from typing-inspect>=0.8.0->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (1.0.0)\n","Requirement already satisfied: marshmallow<4.0.0,>=3.18.0 in /usr/local/lib/python3.10/dist-packages (from dataclasses-json->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (3.21.2)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (2.1.5)\n","Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (2.8.2)\n","Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (2023.4)\n","Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (2024.1)\n","Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch<3.0.0,>=2.1.2->llama-index-llms-huggingface==0.2.0) (1.3.0)\n","Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (1.2.1)\n","Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->llama-index-core<0.11.0,>=0.10.38->llama-index==0.10.38) (1.16.0)\n","Requirement already satisfied: einops in /usr/local/lib/python3.10/dist-packages (0.8.0)\n"]}]},{"cell_type":"markdown","source":["## LlamaIndex HuggingFaceLLM\n","\n","未使用RAG技术之前,我们来测试一下询问“xtuner是什么?”的结果。"],"metadata":{"id":"k_cwiOW9rhoA"}},{"cell_type":"code","source":["from llama_index.llms.huggingface import HuggingFaceLLM\n","from llama_index.core.llms import ChatMessage\n","llm = HuggingFaceLLM(\n"," model_name=\"internlm/internlm2-chat-1_8b\",\n"," tokenizer_name=\"internlm/internlm2-chat-1_8b\",\n"," model_kwargs={\"trust_remote_code\":True},\n"," tokenizer_kwargs={\"trust_remote_code\":True}\n",")\n","\n","rsp = llm.chat(messages=[ChatMessage(content=\"xtuner是什么?\")])\n","print(rsp)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":260,"referenced_widgets":["c42eebd29edf4546a4d301331eddba88","5ddab7cea3b0475aa9acdbd1382c2408","f71891f2f7e84f8faeab9bde1d592689","507d06838fbf4a8185d9b5f58c00a509","5f415bcd99ff4765a2691d1bba040a34","822836168dd840f68f516aa557973043","540f748524f04a50b51b8be7b6cbb730","7ced06e55c6541ae8eec31634ea9b617","865b7ab8280e4fdb8e9f946868ed04e8","35d885c1f3f7465d96de6d9f39a05bcc","6e0054edee9a468bb94ac73052e9aa32"]},"id":"m0F4G-OOch9V","executionInfo":{"status":"ok","timestamp":1717595096344,"user_tz":-480,"elapsed":41070,"user":{"displayName":"祝岚","userId":"02180828612699376062"}},"outputId":"748062a6-4cc1-4874-84cf-17ea316fb26e","collapsed":true},"execution_count":null,"outputs":[{"output_type":"stream","name":"stderr","text":["/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_fields.py:160: UserWarning: Field \"model_id\" has conflict with protected namespace \"model_\".\n","\n","You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n"," warnings.warn(\n","/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n","The secret `HF_TOKEN` does not exist in your Colab secrets.\n","To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n","You will be able to reuse this secret in all of your notebooks.\n","Please note that authentication is recommended but still optional to access public models or datasets.\n"," warnings.warn(\n"]},{"output_type":"display_data","data":{"text/plain":["Loading checkpoint shards: 0%| | 0/2 [00:00=0.19.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-embeddings-huggingface) (0.23.1)\n","Requirement already satisfied: llama-index-core<0.11.0,>=0.10.1 in /usr/local/lib/python3.10/dist-packages (from llama-index-embeddings-huggingface) (0.10.43)\n","Requirement already satisfied: sentence-transformers<3.0.0,>=2.6.1 in /usr/local/lib/python3.10/dist-packages (from llama-index-embeddings-huggingface) (2.7.0)\n","Collecting instructorembedding<2.0.0,>=1.0.1 (from llama-index-embeddings-instructor)\n"," Downloading InstructorEmbedding-1.0.1-py2.py3-none-any.whl (19 kB)\n","Requirement already satisfied: torch<3.0.0,>=2.1.2 in /usr/local/lib/python3.10/dist-packages (from llama-index-embeddings-instructor) (2.3.0+cu121)\n","Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (3.14.0)\n","Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (2023.6.0)\n","Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (24.0)\n","Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (6.0.1)\n","Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (2.31.0)\n","Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (4.66.4)\n","Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (4.12.0)\n","Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (3.9.5)\n","Requirement already satisfied: minijinja>=1.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (2.0.1)\n","Requirement already satisfied: SQLAlchemy[asyncio]>=1.4.49 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (2.0.30)\n","Requirement already satisfied: dataclasses-json in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (0.6.6)\n","Requirement already satisfied: deprecated>=1.2.9.3 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.2.14)\n","Requirement already satisfied: dirtyjson<2.0.0,>=1.0.8 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.0.8)\n","Requirement already satisfied: httpx in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (0.27.0)\n","Requirement already satisfied: llamaindex-py-client<0.2.0,>=0.1.18 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (0.1.19)\n","Requirement already satisfied: nest-asyncio<2.0.0,>=1.5.8 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.6.0)\n","Requirement already satisfied: networkx>=3.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (3.3)\n","Requirement already satisfied: nltk<4.0.0,>=3.8.1 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (3.8.1)\n","Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.25.2)\n","Requirement already satisfied: openai>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.31.0)\n","Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (2.0.3)\n","Requirement already satisfied: pillow>=9.0.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (9.4.0)\n","Requirement already satisfied: tenacity<9.0.0,>=8.2.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (8.3.0)\n","Requirement already satisfied: tiktoken>=0.3.3 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (0.7.0)\n","Requirement already satisfied: typing-inspect>=0.8.0 in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (0.9.0)\n","Requirement already satisfied: wrapt in /usr/local/lib/python3.10/dist-packages (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.14.1)\n","Requirement already satisfied: transformers<5.0.0,>=4.34.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers<3.0.0,>=2.6.1->llama-index-embeddings-huggingface) (4.41.1)\n","Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence-transformers<3.0.0,>=2.6.1->llama-index-embeddings-huggingface) (1.2.2)\n","Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers<3.0.0,>=2.6.1->llama-index-embeddings-huggingface) (1.11.4)\n","Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (1.12.1)\n","Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (3.1.4)\n","Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (12.1.105)\n","Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (12.1.105)\n","Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (12.1.105)\n","Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (8.9.2.26)\n","Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (12.1.3.1)\n","Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (11.0.2.54)\n","Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (10.3.2.106)\n","Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (11.4.5.107)\n","Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (12.1.0.106)\n","Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (2.20.5)\n","Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (12.1.105)\n","Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (2.3.0)\n","Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (12.5.40)\n","Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (1.3.1)\n","Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (23.2.0)\n","Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (1.4.1)\n","Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (6.0.5)\n","Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (1.9.4)\n","Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (4.0.3)\n","Requirement already satisfied: pydantic>=1.10 in /usr/local/lib/python3.10/dist-packages (from llamaindex-py-client<0.2.0,>=0.1.18->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (2.7.2)\n","Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (3.7.1)\n","Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (2024.2.2)\n","Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.0.5)\n","Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (3.7)\n","Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from httpx->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.3.1)\n","Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (0.14.0)\n","Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk<4.0.0,>=3.8.1->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (8.1.7)\n","Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk<4.0.0,>=3.8.1->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.4.2)\n","Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk<4.0.0,>=3.8.1->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (2024.5.15)\n","Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai>=1.1.0->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.7.0)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (3.3.2)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface) (2.0.7)\n","Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.10/dist-packages (from SQLAlchemy[asyncio]>=1.4.49->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (3.0.3)\n","Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.34.0->sentence-transformers<3.0.0,>=2.6.1->llama-index-embeddings-huggingface) (0.19.1)\n","Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.34.0->sentence-transformers<3.0.0,>=2.6.1->llama-index-embeddings-huggingface) (0.4.3)\n","Requirement already satisfied: mypy-extensions>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from typing-inspect>=0.8.0->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.0.0)\n","Requirement already satisfied: marshmallow<4.0.0,>=3.18.0 in /usr/local/lib/python3.10/dist-packages (from dataclasses-json->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (3.21.2)\n","Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (2.1.5)\n","Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (2.8.2)\n","Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (2023.4)\n","Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (2024.1)\n","Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers<3.0.0,>=2.6.1->llama-index-embeddings-huggingface) (3.5.0)\n","Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch<3.0.0,>=2.1.2->llama-index-embeddings-instructor) (1.3.0)\n","Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio->httpx->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.2.1)\n","Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->llamaindex-py-client<0.2.0,>=0.1.18->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (0.7.0)\n","Requirement already satisfied: pydantic-core==2.18.3 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10->llamaindex-py-client<0.2.0,>=0.1.18->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (2.18.3)\n","Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-huggingface) (1.16.0)\n","Installing collected packages: instructorembedding, llama-index-embeddings-instructor, llama-index-embeddings-huggingface\n","Successfully installed instructorembedding-1.0.1 llama-index-embeddings-huggingface-0.2.1 llama-index-embeddings-instructor-0.1.3\n"]}]},{"cell_type":"markdown","source":["此处将xtuner的README文件放入data文件中,作为知识库。"],"metadata":{"id":"_zh9-T-3sTlI"}},{"cell_type":"code","source":["!mkdir data\n","!git clone https://github.com/InternLM/xtuner.git\n","!mv xtuner/README_zh-CN.md ./data"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"hkEdNC2HeTBK","executionInfo":{"status":"ok","timestamp":1717595105450,"user_tz":-480,"elapsed":1583,"user":{"displayName":"祝岚","userId":"02180828612699376062"}},"outputId":"ce2d741d-bded-4343-9438-97781908022a"},"execution_count":null,"outputs":[{"output_type":"stream","name":"stdout","text":["Cloning into 'xtuner'...\n","remote: Enumerating objects: 8423, done.\u001b[K\n","remote: Counting objects: 100% (5457/5457), done.\u001b[K\n","remote: Compressing objects: 100% (884/884), done.\u001b[K\n","remote: Total 8423 (delta 5061), reused 4662 (delta 4571), pack-reused 2966\u001b[K\n","Receiving objects: 100% (8423/8423), 1.64 MiB | 14.76 MiB/s, done.\n","Resolving deltas: 100% (6455/6455), done.\n"]}]},{"cell_type":"code","source":["from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings\n","\n","from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n","from llama_index.llms.huggingface import HuggingFaceLLM\n","\n","embed_model = HuggingFaceEmbedding(\n"," model_name=\"sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\"\n",")\n","\n","Settings.embed_model = embed_model\n","\n","Settings.llm = llm\n","\n","documents = SimpleDirectoryReader(\"./data\").load_data()\n","index = VectorStoreIndex.from_documents(documents)\n","query_engine = index.as_query_engine()\n","response = query_engine.query(\"xtuner是什么?\")\n","\n","print(response)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":773,"referenced_widgets":["69ed7ce9957448e886bbc2340f1567c3","b2f386012623497da93c0d023e7010cc","165763f3b0514196a1f54088d6a636e8","d1b9e512f9ad4ad3ac80ca617dc27f64","cf587120c03540a6aeeb3e448cc5e222","a26d602c5d194d29964fcf1bf08932ee","eca4d0efedd248b2a6a7ec36018c4a0d","f72d1cee95cb429fb27da1f92e959cc7","233d4844a7cd49b9baa0e743b764b3ff","7a97dcd5b6b948d8bd880227e46480db","ae6462b060734f9899a9636a60f6643e","a46dafab1aa64a3cacf3e66075ec965d","3644af1f76b74a51a14b0b996c4df146","bb18a12489954592a74c49b7d03cb20f","07010c7ddb574f9f9293b22e8ffdae47","62162bcc0e494b9485aa6cf5b4fcef57","39931be128c444c1a8cf80ec4bdfe1f4","3273e937afdc44328c01df0946142af9","273f7749f06f4aeb9abd0ed023713d8b","1f143ddbc8574886914b79ffc9da09b8","454d5bcc27364b21be25e3cc49c8993d","3bc6164577864a188a92395eac2cd95b","1f6ca7a846fa46ee9ddd1ebfec35671d","a2eeec01bc4148888476ae48fc67789a","a8dfb5d658e44ea182891465eb81de9d","e91f33e380ca4d49b0178e76c76a0c70","8d64b9c64db14c8f9cb1f857cd8d3c86","18d5f6fe60ce428ea3bb4d44e9574b65","618c70e1bb8f4fdebeac3a3b9d02f453","65e6ca70fc5649fd8df3671f49e9a994","075896d9b9cc4f76abf9e84a23d28e0d","1c069c3a1edb4bb0a6da1ea9dfd6f1d1","17e49d8a03424335940415f8e4cf58b5","b2c75766fb1745f68e75ca8d84fd9273","758a7c8ff11549348d124463970d83b3","54be9ad90491476388728f0fd6ef47f9","41d064a3dbc04404bbf45802e8b12d77","4c91f061f9084cd589ff085d577eca16","5959aef20b0646ea91aea5e9e61a1f6b","86bc64bd8ea3404f86c9c5578ea23202","ea7eb0b1fb72425a9281e469197fc507","fc38b1b9bf3e49e7b5f1bf1df5e9a817","94ec348de11147bea3bebabc7a38d064","4c505d868a5a45fea03c9e723d6be1e4","1f8c04e6a3714675ab4260e3c7b2f2b2","7709ee69c35949f48ebb26dd6bd2dd1f","0fca41335fc94a249df318dd13047a2e","2edabd484afe45e6a73c5c37df22a73f","5da9d043f5834294a964d1babe1df068","2b9fed50a39d4da5a0e663dc413e7700","2b828555bc4745d4aa12c332b96e9012","65981b52d8044f298ba56d905b692fce","c4ebfd46351542aaaabf03ef4c7c10c3","406c95ecb8f2426eb609c4dbc9009942","5d947921c38149dda331d0d8ca358aac","216a2be4c6ad44ce9c1c654854777c81","c765f12578ee477993847ff5d042f34a","00d233cdc7da4aee96a70cc617a3549d","f99b3803503440c6b356e9083e2b1219","8d2646025a474810b060a4ea1816938c","2dac105cc8684445a591c19243a036b9","d6e70df4c13b45ec8fcf62d0e92b40d6","5c14e585ed164f739dba85d85544f72e","3a044c8d0fb1442498292f4e348eaba4","9c6da30c7f794586a882ac7cff61842a","a433bb08668248d8b1dc0adb97815560","7590be01a0fe46bb8b9429b91c66ce16","56c3b23c425849a7927eee3babec67bc","b2cb3afffcae46aaaed5471349d5aa26","472a2346d1fb40948dfef64ebdc7b75a","39bd275c1e1d4ab4bf55f12bf9745fe8","4df309dc4a6d467db54c68d0e3aeca9d","04315a4c5cc442adad474ed597701968","1cf5dd8207dc4ba69f2ef2e34751850e","d57abad3fbd04f11ad02d9c2f90c475f","bda231e097644f9aa415dc88950f6955","91449b6625184274bc74ed47fa33ccf0","1f6376cc1e664afea7ce482f06431c17","e8984d13c3f54a4bb162ee07ee20c943","3842851605b04346924f1f531a7afef1","01271529beff4201a623d7e381c95183","586229c5819243d4adc6850add6c9c4e","4136d7804dcf4518a20f6e5fb39aee04","72410cb0d21945b8ae03d5d238793357","f4290e5963354f8b9695b2d64fbcf3a9","24f5f76eba724b7b8b1bd0e1f4ce614b","a9fbbf8b0479425d8dddc3cada4458f8","a9fc98be620a480cae780a1045b55364","8b588dc29f984e74a3834958b3ac432f","7f0151766eaa4e77ad7a13efea233182","9ef8fd410e504ad3a14472c8642e61aa","c530222427e5416e9b59b14577db52b3","5c1a69a72ffd48f79aa327037ecc3f5d","8bc19d157ddc415184e21baaeaa4b430","47afbb445f2a4f71a8ede755ccc364c6","7f181fdc9a51468a86bca59830bb9d81","8ff9c919a2c442c2a4d8ac0b3a5cc3cd","85f093382df54374a6996cb4d0ca6e55","04b098933b2645ca9d04dbed1d24be7f","fb373185d4f14816925cce207d3bdad5","92e5b6b3f45c47d9a79db4b5b53e7800","3edbb211def242d489ddcf174016251a","69d83f3d8c954e458f67acd9a2f80681","c115e6bdaf0b4587bdb6bf5c311b20e7","55d4db6a41dc4c6abdd701ffb356abc8","7f3f012cbb1546f596321a9cabdd6db1","0cf68562237c4b27b07458d44dc4c681","d077f49d761b4a508b5fb788ae855be4","268ce96b87d84289966cd7d1e12412a9","51ad3e7fbad64b718c89a74dbb16a921"]},"id":"owrNjifXeadP","executionInfo":{"status":"ok","timestamp":1717595129423,"user_tz":-480,"elapsed":23977,"user":{"displayName":"祝岚","userId":"02180828612699376062"}},"outputId":"b56386f9-aade-4734-cc82-ff8bba9caf05"},"execution_count":null,"outputs":[{"output_type":"display_data","data":{"text/plain":["modules.json: 0%| | 0.00/229 [00:00 Date: Fri, 7 Jun 2024 23:10:26 +0800 Subject: [PATCH 013/754] Update llamaindex.md Signed-off-by: Shengshenlan <57640594+Shengshenlan@users.noreply.github.com> --- docs/L5/llamaindex.md | 203 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 200 insertions(+), 3 deletions(-) diff --git a/docs/L5/llamaindex.md b/docs/L5/llamaindex.md index 3c666386b..ff55b9c54 100644 --- a/docs/L5/llamaindex.md +++ b/docs/L5/llamaindex.md @@ -1,4 +1,4 @@ -# InternLM+LlamaIndex +# InternLM+LlamaIndex 本文将分为以下几个部分来介绍,如何使用 LlamaIndex 来部署 InternLM2 1.8B(以 InternStudio 的环境为例) - 前置知识 @@ -6,7 +6,204 @@ - LlamaIndex HuggingFaceLLM - LlamaIndex RAG -## 1. 前置知识 +## 1. 前置知识 正式介绍检索增强生成(Retrieval Augmented Generation,RAG)技术以前,大家不妨想想为什么会出现这样一个技术。 给模型注入新知识的方式,可以简单分为两种方式,一种是内部的,即更新模型的权重,另一个就是外部的方式,给模型注入格外的上下文或者说外部信息,不改变它的的权重。 -第一种方式,改变了模型的权重即进行模型训练,这是一件代价比较大的事情,大语言模型具体的训练过程,可以参考InternLM2技术报告。第二种方式,并不改变模型的权重,只是给模型引入格外的信息。类比人类编程的过程,第一种方式相当于你记住了某个函数的用法,第二种方式相当于你阅读函数文档然后短暂的记住了某个函数的用法。 \ No newline at end of file +第一种方式,改变了模型的权重即进行模型训练,这是一件代价比较大的事情,大语言模型具体的训练过程,可以参考![InternLM2](https://arxiv.org/abs/2403.17297)技术报告。第二种方式,并不改变模型的权重,只是给模型引入格外的信息。类比人类编程的过程,第一种方式相当于你记住了某个函数的用法,第二种方式相当于你阅读函数文档然后短暂的记住了某个函数的用法。 +![image](https://github.com/Shengshenlan/tutorial/assets/57640594/5a72331f-1726-4e4e-9a69-75141cfd313e) +对比两种注入知识方式,第二种更容易实现。RAG正是这种方式。它能够让基础模型实现非参数知识更新,无需训练就可以掌握新领域的知识。本次课程选用了LlamaIndex框架。LlamaIndex 是一个上下文增强的 LLM 框架,旨在通过将其与特定上下文数据集集成,增强大型语言模型(LLMs)的能力。它允许您构建应用程序,既利用 LLMs 的优势,又融入您的私有或领域特定信息。 + +### RAG 效果比对 + +如图所示,由于`xtuner`是一款比较新的框架, `InternLM2-Chat-1.8B` 训练数据库中并没有收录到它的相关信息。左图中问答均未给出准确的答案。右图未对 `InternLM2-Chat-1.8B` 进行任何增训的情况下,通过 RAG 技术实现的新增知识问答。 +![image](https://github.com/Shengshenlan/tutorial/assets/57640594/3785a449-770a-45e1-a7ea-7cfd33a00076) +## 2. 环境、模型准备 +### 2.1 配置基础环境 +这里以在 ![Intern Studio](https://studio.intern-ai.org.cn/) 服务器上部署LlamaIndex为例。 + + +首先,打开 `Intern Studio` 界面,点击 **创建开发机** 配置开发机系统。 +![image](https://github.com/Shengshenlan/tutorial/assets/57640594/e325d0c1-6816-4ea5-ba4a-f509bdd42323) + +填写 `开发机名称` 后,点击 选择镜像 使用 `Cuda11.7-conda` 镜像,然后在资源配置中,使用 `30% A100 * 1` 的选项,然后立即创建开发机器。 +![image](https://github.com/Shengshenlan/tutorial/assets/57640594/8c25b923-fda8-4af2-a4dc-2f4cf44845c9) + +点击 `进入开发机` 选项。 +![image](https://github.com/Shengshenlan/tutorial/assets/57640594/6bc3cde2-6309-4e14-9278-a65cd74d4a3a) + +进入开发机后,从官方环境复制运行 InternLM 的基础环境,命名为 `llamaindex`,在命令行模式下运行: +```bash +studio-conda -t llamaindex -o pytorch-2.1.2 +``` +复制完成后,在本地查看环境。 +```bash +conda env list +``` +结果如下所示。 +```bash +# conda environments: +# +base * /root/.conda +llamaindex /root/.conda/envs/llamaindex +``` + +运行 `conda` 命令,激活 `llamaindex` **python** 虚拟环境: +```bash +conda activate llamaindex +``` + +环境激活后,命令行左边会显示当前(也就是 `llamaindex` )的环境名称,如下图所示: +![image](https://github.com/Shengshenlan/tutorial/assets/57640594/bcfedc90-0d9d-4679-b1e9-4709b05711f3) + +### 2.2 安装 Llamaindex +安装 Llamaindex和相关的包 +```bash +conda activate llamaindex +pip install llama-index==0.10.38 llama-index-llms-huggingface==0.2.0 "transformers[torch]==4.41.1" "huggingface_hub[inference]==0.23.1" huggingface_hub==0.23.1 sentence-transformers==2.7.0 sentencepiece==0.2.0 +``` + +### 2.3 下载 Sentence Transformer 模型 + +源词向量模型 ![Sentence Transformer](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2):(我们也可以选用别的开源词向量模型来进行 Embedding,目前选用这个模型是相对轻量、支持中文且效果较好的,同学们可以自由尝试别的开源词向量模型) +运行以下指令,新建一个python文件 +```bash +cd ~ +mkdir llamaindex_demo +mkdir model +cd ~/llamaindex_demo +touch download_hf.py +``` +打开`download_hf.py` 贴入以下代码 +```bash +import os + +# 设置环境变量 +os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com' + +# 下载模型 +os.system('huggingface-cli download --resume-download sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 --local-dir /root/model/sentence-transformer') +``` + +然后,在 /root/data 目录下执行该脚本即可自动开始下载: +```bash +conda activate llamaindex +python download_hf.py +``` +更多关于镜像使用可以移步至 ![HF Mirror](https://hf-mirror.com/) 查看。 + +2.4 下载 NLTK 相关资源 +我们在使用开源词向量模型构建开源词向量的时候,需要用到第三方库 `nltk` 的一些资源。正常情况下,其会自动从互联网上下载,但可能由于网络原因会导致下载中断,此处我们可以从国内仓库镜像地址下载相关资源,保存到服务器上。 +我们用以下命令下载 nltk 资源并解压到服务器上: +```bash +cd /root +git clone https://gitee.com/yzy0612/nltk_data.git --branch gh-pages +cd nltk_data +mv packages/* ./ +cd tokenizers +unzip punkt.zip +cd ../taggers +unzip averaged_perceptron_tagger.zip +``` +之后使用时服务器即会自动使用已有资源,无需再次下载 + +## 3. LlamaIndex HuggingFaceLLM +运行以下指令,把 `InternLM2 1.8B` 软连接出来 +```bash +cd ~/model +ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b/ ./ +``` +运行以下指令,新建一个python文件 +```bash +cd ~/llamaindex_demo +touch llamaindex_internlm.py +``` +打开llamaindex_internlm.py 贴入以下代码 +```python +from llama_index.llms.huggingface import HuggingFaceLLM +from llama_index.core.llms import ChatMessage +llm = HuggingFaceLLM( + model_name="/root/model/internlm2-chat-1_8b", + tokenizer_name="/root/model/internlm2-chat-1_8b", + model_kwargs={"trust_remote_code":True}, + tokenizer_kwargs={"trust_remote_code":True} +) + +rsp = llm.chat(messages=[ChatMessage(content="xtuner是什么?")]) +print(rsp) +``` +之后运行 +```bash +conda activate llamaindex +cd ~/llamaindex_demo/ +python llamaindex_internlm.py +``` +结果为: +![image](https://github.com/Shengshenlan/tutorial/assets/57640594/ac3f481d-cc5b-44be-b281-2cab7289f027) +回答的效果并不好,并不是我们想要的xtuner。 +## 4. LlamaIndex RAG +安装 `LlamaIndex` 词嵌入向量依赖 +```bash +conda activate llamaindex +pip install llama-index-embeddings-huggingface llama-index-embeddings-instructor +``` +运行以下命令,获取知识库 +```bash +cd ~/llamaindex_demo +mkdir data +cd data +git clone https://github.com/InternLM/xtuner.git +mv xtuner/README_zh-CN.md ./ +``` +运行以下指令,新建一个python文件 +```bash +cd ~/llamaindex_demo +touch llamaindex_RAG.py +``` +打开`llamaindex_RAG.py`贴入以下代码 +```python +from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings + +from llama_index.embeddings.huggingface import HuggingFaceEmbedding +from llama_index.llms.huggingface import HuggingFaceLLM + +embed_model = HuggingFaceEmbedding( + model_name="/root/model/sentence-transformer" +) + +Settings.embed_model = embed_model + +llm = HuggingFaceLLM( + model_name="/root/model/internlm2-chat-1_8b", + tokenizer_name="/root/model/internlm2-chat-1_8b", + model_kwargs={"trust_remote_code":True}, + tokenizer_kwargs={"trust_remote_code":True} +) +Settings.llm = llm + +documents = SimpleDirectoryReader("/root/llamaindex_demo/data").load_data() +index = VectorStoreIndex.from_documents(documents) +query_engine = index.as_query_engine() +response = query_engine.query("xtuner是什么?") + +print(response) +``` +之后运行 +```bash +conda activate llamaindex +cd ~/llamaindex_demo/ +python llamaindex_RAG.py +``` +结果为: +![image](https://github.com/Shengshenlan/tutorial/assets/57640594/8d363e3f-edf9-4573-bd58-5b54fd8981df) + +借助RAG技术后,就能获得我们想要的答案了。 + +## 5. 关卡任务 +完成以下任务,并将实现过程记录截图: +- 通过 llamaindex 运行 InternLM2 1.8B,询问“你是谁”,将运行结果截图。 +- 通过 llamaindex 实现知识库检索,询问两个问题将运行结果截图。 + - 问题1:xtuner是什么? + - 问题2:xtuner支持那些模型? + +## 6. 关卡通关文案 +恭喜你,成功通关本关卡!继续加油!你成功使用 LlamaIndex 运行了 InternLM-2 1.8B 模型,并实现了知识库的构建与检索。这为管理和利用大规模知识库提供了强大的工具和方法。接下来,可以进一步优化和扩展功能,以满足更复杂的需求。 From 7a7634e53ebc99afc6fe5c4d6f85eae116163299 Mon Sep 17 00:00:00 2001 From: Shengshenlan <57640594+Shengshenlan@users.noreply.github.com> Date: Sat, 8 Jun 2024 16:14:00 +0800 Subject: [PATCH 014/754] Update llamaindex.md Signed-off-by: Shengshenlan <57640594+Shengshenlan@users.noreply.github.com> --- docs/L5/llamaindex.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/L5/llamaindex.md b/docs/L5/llamaindex.md index ff55b9c54..e21866ca4 100644 --- a/docs/L5/llamaindex.md +++ b/docs/L5/llamaindex.md @@ -9,7 +9,7 @@ ## 1. 前置知识 正式介绍检索增强生成(Retrieval Augmented Generation,RAG)技术以前,大家不妨想想为什么会出现这样一个技术。 给模型注入新知识的方式,可以简单分为两种方式,一种是内部的,即更新模型的权重,另一个就是外部的方式,给模型注入格外的上下文或者说外部信息,不改变它的的权重。 -第一种方式,改变了模型的权重即进行模型训练,这是一件代价比较大的事情,大语言模型具体的训练过程,可以参考![InternLM2](https://arxiv.org/abs/2403.17297)技术报告。第二种方式,并不改变模型的权重,只是给模型引入格外的信息。类比人类编程的过程,第一种方式相当于你记住了某个函数的用法,第二种方式相当于你阅读函数文档然后短暂的记住了某个函数的用法。 +第一种方式,改变了模型的权重即进行模型训练,这是一件代价比较大的事情,大语言模型具体的训练过程,可以参考[InternLM2](https://arxiv.org/abs/2403.17297)技术报告。第二种方式,并不改变模型的权重,只是给模型引入格外的信息。类比人类编程的过程,第一种方式相当于你记住了某个函数的用法,第二种方式相当于你阅读函数文档然后短暂的记住了某个函数的用法。 ![image](https://github.com/Shengshenlan/tutorial/assets/57640594/5a72331f-1726-4e4e-9a69-75141cfd313e) 对比两种注入知识方式,第二种更容易实现。RAG正是这种方式。它能够让基础模型实现非参数知识更新,无需训练就可以掌握新领域的知识。本次课程选用了LlamaIndex框架。LlamaIndex 是一个上下文增强的 LLM 框架,旨在通过将其与特定上下文数据集集成,增强大型语言模型(LLMs)的能力。它允许您构建应用程序,既利用 LLMs 的优势,又融入您的私有或领域特定信息。 @@ -19,7 +19,7 @@ ![image](https://github.com/Shengshenlan/tutorial/assets/57640594/3785a449-770a-45e1-a7ea-7cfd33a00076) ## 2. 环境、模型准备 ### 2.1 配置基础环境 -这里以在 ![Intern Studio](https://studio.intern-ai.org.cn/) 服务器上部署LlamaIndex为例。 +这里以在 [Intern Studio](https://studio.intern-ai.org.cn/) 服务器上部署LlamaIndex为例。 首先,打开 `Intern Studio` 界面,点击 **创建开发机** 配置开发机系统。 @@ -64,7 +64,7 @@ pip install llama-index==0.10.38 llama-index-llms-huggingface==0.2.0 "transforme ### 2.3 下载 Sentence Transformer 模型 -源词向量模型 ![Sentence Transformer](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2):(我们也可以选用别的开源词向量模型来进行 Embedding,目前选用这个模型是相对轻量、支持中文且效果较好的,同学们可以自由尝试别的开源词向量模型) +源词向量模型 [Sentence Transformer](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2):(我们也可以选用别的开源词向量模型来进行 Embedding,目前选用这个模型是相对轻量、支持中文且效果较好的,同学们可以自由尝试别的开源词向量模型) 运行以下指令,新建一个python文件 ```bash cd ~ @@ -89,7 +89,7 @@ os.system('huggingface-cli download --resume-download sentence-transformers/para conda activate llamaindex python download_hf.py ``` -更多关于镜像使用可以移步至 ![HF Mirror](https://hf-mirror.com/) 查看。 +更多关于镜像使用可以移步至 [HF Mirror](https://hf-mirror.com/) 查看。 2.4 下载 NLTK 相关资源 我们在使用开源词向量模型构建开源词向量的时候,需要用到第三方库 `nltk` 的一些资源。正常情况下,其会自动从互联网上下载,但可能由于网络原因会导致下载中断,此处我们可以从国内仓库镜像地址下载相关资源,保存到服务器上。 From c6c8aa34a53d96c9c75afb14effbb82ced14dfb6 Mon Sep 17 00:00:00 2001 From: Shengshenlan <57640594+Shengshenlan@users.noreply.github.com> Date: Sat, 8 Jun 2024 16:17:38 +0800 Subject: [PATCH 015/754] Update llamaindex.md Signed-off-by: Shengshenlan <57640594+Shengshenlan@users.noreply.github.com> --- docs/L5/llamaindex.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/L5/llamaindex.md b/docs/L5/llamaindex.md index e21866ca4..93b5beef4 100644 --- a/docs/L5/llamaindex.md +++ b/docs/L5/llamaindex.md @@ -1,4 +1,4 @@ -# InternLM+LlamaIndex +# llamaindex+Internlm2 RAG实践 本文将分为以下几个部分来介绍,如何使用 LlamaIndex 来部署 InternLM2 1.8B(以 InternStudio 的环境为例) - 前置知识 From 636e350b1467c028237639bcb8f245b1dcf0377f Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:14:37 +0800 Subject: [PATCH 016/754] Create readme.md --- docs/L0/Linux/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L0/Linux/readme.md diff --git a/docs/L0/Linux/readme.md b/docs/L0/Linux/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L0/Linux/readme.md @@ -0,0 +1 @@ + From 704fe96b94249f932588fa1cdf50758036be2f27 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:14:59 +0800 Subject: [PATCH 017/754] Create readme.md --- docs/L0/Python/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L0/Python/readme.md diff --git a/docs/L0/Python/readme.md b/docs/L0/Python/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L0/Python/readme.md @@ -0,0 +1 @@ + From 6d01887bfad596d59c9f3c12b33ea7479a72c9e9 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:15:29 +0800 Subject: [PATCH 018/754] Create readme.md --- docs/L0/Git/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L0/Git/readme.md diff --git a/docs/L0/Git/readme.md b/docs/L0/Git/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L0/Git/readme.md @@ -0,0 +1 @@ + From 5252ee3171ada98403cd6247aebae0e1c6e0fa80 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:15:57 +0800 Subject: [PATCH 019/754] Create readme.md --- docs/L0/PyTorch/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L0/PyTorch/readme.md diff --git a/docs/L0/PyTorch/readme.md b/docs/L0/PyTorch/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L0/PyTorch/readme.md @@ -0,0 +1 @@ + From 8df3231e2561824355e8ba7ddd538df706715835 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:18:02 +0800 Subject: [PATCH 020/754] Create readme.md --- docs/L1/LlamaIndex/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L1/LlamaIndex/readme.md diff --git a/docs/L1/LlamaIndex/readme.md b/docs/L1/LlamaIndex/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L1/LlamaIndex/readme.md @@ -0,0 +1 @@ + From 6d4b07cb263e8e0fd162f6bff0fffe493c004496 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:18:22 +0800 Subject: [PATCH 021/754] Create readme.md --- docs/L1/XTuner/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L1/XTuner/readme.md diff --git a/docs/L1/XTuner/readme.md b/docs/L1/XTuner/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L1/XTuner/readme.md @@ -0,0 +1 @@ + From f5ae67f0d199d877051cb94dbcc1e6b956f4249d Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:18:50 +0800 Subject: [PATCH 022/754] Create readme.md --- docs/L1/OpenCompass/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L1/OpenCompass/readme.md diff --git a/docs/L1/OpenCompass/readme.md b/docs/L1/OpenCompass/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L1/OpenCompass/readme.md @@ -0,0 +1 @@ + From 10dba5c0aed836ba715300bca36c9b781165e046 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:19:22 +0800 Subject: [PATCH 023/754] Create readme.md --- docs/L1/Prompt/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L1/Prompt/readme.md diff --git a/docs/L1/Prompt/readme.md b/docs/L1/Prompt/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L1/Prompt/readme.md @@ -0,0 +1 @@ + From e07f331b4ed84f167bfb1e51f804a5fa26e20403 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:22:24 +0800 Subject: [PATCH 024/754] Create readme.md --- docs/L2/BadCase/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L2/BadCase/readme.md diff --git a/docs/L2/BadCase/readme.md b/docs/L2/BadCase/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L2/BadCase/readme.md @@ -0,0 +1 @@ + From 9bbbcc660e3af2b0fb2b4bd42430bcdd36f67c48 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:22:45 +0800 Subject: [PATCH 025/754] Create readme.md --- docs/L2/Lagent/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L2/Lagent/readme.md diff --git a/docs/L2/Lagent/readme.md b/docs/L2/Lagent/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L2/Lagent/readme.md @@ -0,0 +1 @@ + From a41fc3c49651e48c890d334ee42563c8b379e0eb Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:23:38 +0800 Subject: [PATCH 026/754] Create readme.md --- docs/L2/Huixiangdou/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L2/Huixiangdou/readme.md diff --git a/docs/L2/Huixiangdou/readme.md b/docs/L2/Huixiangdou/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L2/Huixiangdou/readme.md @@ -0,0 +1 @@ + From 566f9eea5bb284926c1ffbf27691f10107927b22 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:24:15 +0800 Subject: [PATCH 027/754] Create readme.md --- docs/Other/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/Other/readme.md diff --git a/docs/Other/readme.md b/docs/Other/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/Other/readme.md @@ -0,0 +1 @@ + From 6e9407b0eccfddf08986a05e4903a286304f692c Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:30:43 +0800 Subject: [PATCH 028/754] Create readme.md --- docs/L1/LMDeploy/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L1/LMDeploy/readme.md diff --git a/docs/L1/LMDeploy/readme.md b/docs/L1/LMDeploy/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L1/LMDeploy/readme.md @@ -0,0 +1 @@ + From 1a643879db93069bf3ea43b99888723410c01074 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 21:34:09 +0800 Subject: [PATCH 029/754] Create readme.md --- docs/L1/ToolChain/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/L1/ToolChain/readme.md diff --git a/docs/L1/ToolChain/readme.md b/docs/L1/ToolChain/readme.md new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/docs/L1/ToolChain/readme.md @@ -0,0 +1 @@ + From c8ff79a87cd7451472d60210096b2742eadf320f Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 22:01:24 +0800 Subject: [PATCH 030/754] update --- docs/L2/InternVL/readme.md | 0 docs/L2/LMDeploy/readme.md | 0 docs/Other/NLP/readme.md | 0 3 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 docs/L2/InternVL/readme.md create mode 100644 docs/L2/LMDeploy/readme.md create mode 100644 docs/Other/NLP/readme.md diff --git a/docs/L2/InternVL/readme.md b/docs/L2/InternVL/readme.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/L2/LMDeploy/readme.md b/docs/L2/LMDeploy/readme.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/Other/NLP/readme.md b/docs/Other/NLP/readme.md new file mode 100644 index 000000000..e69de29bb From 9337cb949adaa428b4363db8eee78628883f2a92 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 22:06:16 +0800 Subject: [PATCH 031/754] update --- docs/{Other => EasterEgg}/NLP/readme.md | 0 docs/{Other => EasterEgg}/readme.md | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename docs/{Other => EasterEgg}/NLP/readme.md (100%) rename docs/{Other => EasterEgg}/readme.md (100%) diff --git a/docs/Other/NLP/readme.md b/docs/EasterEgg/NLP/readme.md similarity index 100% rename from docs/Other/NLP/readme.md rename to docs/EasterEgg/NLP/readme.md diff --git a/docs/Other/readme.md b/docs/EasterEgg/readme.md similarity index 100% rename from docs/Other/readme.md rename to docs/EasterEgg/readme.md From 6b7388d46ee153d706dd12625acca43d6e3b073b Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 22:08:13 +0800 Subject: [PATCH 032/754] update --- docs/EasterEgg/StreamerSales/readme.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 docs/EasterEgg/StreamerSales/readme.md diff --git a/docs/EasterEgg/StreamerSales/readme.md b/docs/EasterEgg/StreamerSales/readme.md new file mode 100644 index 000000000..e69de29bb From ec8588edc80c895edfe1d11a33ad744f8685815a Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 22:13:32 +0800 Subject: [PATCH 033/754] update --- README.md | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 9b7bb51ca..8cbc73380 100644 --- a/README.md +++ b/README.md @@ -20,13 +20,12 @@ ||关卡名称|关卡负责人|资料| |:-----|:----|:----|:-----| -|第 1 关| 了解书生·浦语大模型全链路开源体系 ||文档、视频、任务| -|第 2 关| Transformer 结构基础知识 | | 文档、视频、任务 | +|第 1 关| 书生大模型全链路开源体系 ||文档、视频、任务| +|第 2 关| 8G 显存部署 InternLM & InternVL | | 文档、视频、任务 | |第 3 关| 浦语提示词工程实践 ||文档、视频、任务| -|第 4 关| InternLM2-1.8B 部署实践 || 文档、视频、任务 | -|第 5 关| InternLM + LlamaInex RAG 实践 || 文档、视频、任务 | +|第 4 关| InternLM + LlamaIndex RAG 实践 || 文档、视频、任务 | |第 6 关| XTuner 微调个人小助手认知 || 文档、视频、任务 | -|第 7 关| InternLM2-Chat-1.8B 模型的能力评测 || 文档、视频、任务 | +|第 7 关| OpenCompass 评测 InternLM-1.8B 实践 || 文档、视频、任务 | @@ -34,8 +33,15 @@ ||关卡名称|关卡负责人|资料| |:-----|:----|:----|:-----| -|第 1 关| 寻找书生·浦语 InternLM2-Chat-20B 的缺陷 ||文档、视频、任务| +|第 1 关| 探索 InternLM 模型能力边界 ||文档、视频、任务| |第 2 关| Lagent 自定义你的 Agent 智能体 | | 文档、视频、任务 | -|第 3 关| LMDeploy 部署 InternVL 浦语灵笔实践 ||文档、视频、任务| -|第 4 关| XTuner 微调你的多模态模型 || 文档、视频、任务 | +|第 3 关| LMDeploy 量化部署进阶实践 ||文档、视频、任务| +|第 4 关| InternVL 多模态模型部署微调实践 || 文档、视频、任务 | |第 5 关| 茴香豆:企业级知识库问答工具|| 文档、视频、任务 | + + +### 彩蛋岛 + +||关卡名称|关卡负责人|资料| +|:-----|:----|:----|:-----| +|第 1 关| 销冠大模型案例 ||文档、视频、任务| \ No newline at end of file From 718a47481734e6f8debf4b7af9a9e742be9ea473 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 22:17:30 +0800 Subject: [PATCH 034/754] update --- README.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 8cbc73380..6dc9f2084 100644 --- a/README.md +++ b/README.md @@ -9,10 +9,10 @@ ||关卡名称|关卡负责人|资料| |:-----|:----|:----|:-----| -|第 1 关| Linux 基础知识 ||文档、视频、任务| -|第 2 关|Python 基础知识 | | 文档、视频、任务 | -|第 3 关|Git 基础知识||文档、视频、任务| -|第 4 关| Pytorch 基础知识|| 文档、视频、任务 | +|第 1 关| Linux 基础知识 ||[文档](docs/L0/Linux)、视频、任务| +|第 2 关|Python 基础知识 | | [文档](docs/L0/Python)、视频、任务 | +|第 3 关|Git 基础知识||[文档](docs/L0/Git)、视频、任务| +|第 4 关| Pytorch 基础知识|| [文档](docs/L0/PyTorch)、视频、任务 | ### 基础关卡 @@ -20,12 +20,12 @@ ||关卡名称|关卡负责人|资料| |:-----|:----|:----|:-----| -|第 1 关| 书生大模型全链路开源体系 ||文档、视频、任务| -|第 2 关| 8G 显存部署 InternLM & InternVL | | 文档、视频、任务 | -|第 3 关| 浦语提示词工程实践 ||文档、视频、任务| -|第 4 关| InternLM + LlamaIndex RAG 实践 || 文档、视频、任务 | -|第 6 关| XTuner 微调个人小助手认知 || 文档、视频、任务 | -|第 7 关| OpenCompass 评测 InternLM-1.8B 实践 || 文档、视频、任务 | +|第 1 关| 书生大模型全链路开源体系 ||[文档](docs/L1/ToolChain)、视频、任务| +|第 2 关| 8G 显存部署 InternLM & InternVL | | [文档](docs/L1/ToolChain)、视频、任务 | +|第 3 关| 浦语提示词工程实践 ||[文档](docs/L1/Prompt)、视频、任务| +|第 4 关| InternLM + LlamaIndex RAG 实践 || [文档](docs/L1/LlamaIndex)、视频、任务 | +|第 6 关| XTuner 微调个人小助手认知 || [文档](docs/L1/XTuner)、视频、任务 | +|第 7 关| OpenCompass 评测 InternLM-1.8B 实践 || [文档](OpenCompass)、视频、任务 | @@ -33,11 +33,11 @@ ||关卡名称|关卡负责人|资料| |:-----|:----|:----|:-----| -|第 1 关| 探索 InternLM 模型能力边界 ||文档、视频、任务| -|第 2 关| Lagent 自定义你的 Agent 智能体 | | 文档、视频、任务 | -|第 3 关| LMDeploy 量化部署进阶实践 ||文档、视频、任务| -|第 4 关| InternVL 多模态模型部署微调实践 || 文档、视频、任务 | -|第 5 关| 茴香豆:企业级知识库问答工具|| 文档、视频、任务 | +|第 1 关| 探索 InternLM 模型能力边界 ||[文档](docs/L2/BadCase)、视频、任务| +|第 2 关| Lagent 自定义你的 Agent 智能体 | | [文档](docs/L2/Lagent)、视频、任务 | +|第 3 关| LMDeploy 量化部署进阶实践 ||[文档](docs/L2/LMDeploy)、视频、任务| +|第 4 关| InternVL 多模态模型部署微调实践 || [文档](docs/L2/LMDeploy)、视频、任务 | +|第 5 关| 茴香豆:企业级知识库问答工具|| [文档](docs/L2/Huixiangdou)、视频、任务 | ### 彩蛋岛 From eae6805cfa4e490092167c7a43e9d41d3acc7c2e Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 18 Jun 2024 22:27:57 +0800 Subject: [PATCH 035/754] update --- README.md | 2 +- docs/L1/{LMDeploy => HelloIntern}/readme.md | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename docs/L1/{LMDeploy => HelloIntern}/readme.md (100%) diff --git a/README.md b/README.md index 6dc9f2084..aa8b35383 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ ||关卡名称|关卡负责人|资料| |:-----|:----|:----|:-----| |第 1 关| 书生大模型全链路开源体系 ||[文档](docs/L1/ToolChain)、视频、任务| -|第 2 关| 8G 显存部署 InternLM & InternVL | | [文档](docs/L1/ToolChain)、视频、任务 | +|第 2 关| 8G 显存部署 InternLM & InternVL | | [文档](docs/L1/HelloIntern)、视频、任务 | |第 3 关| 浦语提示词工程实践 ||[文档](docs/L1/Prompt)、视频、任务| |第 4 关| InternLM + LlamaIndex RAG 实践 || [文档](docs/L1/LlamaIndex)、视频、任务 | |第 6 关| XTuner 微调个人小助手认知 || [文档](docs/L1/XTuner)、视频、任务 | diff --git a/docs/L1/LMDeploy/readme.md b/docs/L1/HelloIntern/readme.md similarity index 100% rename from docs/L1/LMDeploy/readme.md rename to docs/L1/HelloIntern/readme.md From db268a483d463c0a94996396537bd1b0cbe1766a Mon Sep 17 00:00:00 2001 From: HinGwenWoong Date: Mon, 24 Jun 2024 10:52:25 +0800 Subject: [PATCH 036/754] [camp3 Feature] Add streamer-sales to camp3 EasterEgg (#771) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Add streamer-sales doc * 完善文档 * Improve doc * Improve doc * Improve doc --- README.md | 2 +- docs/EasterEgg/StreamerSales/readme.md | 617 +++++++++++++++++++++++++ 2 files changed, 618 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index aa8b35383..21560b70a 100644 --- a/README.md +++ b/README.md @@ -44,4 +44,4 @@ ||关卡名称|关卡负责人|资料| |:-----|:----|:----|:-----| -|第 1 关| 销冠大模型案例 ||文档、视频、任务| \ No newline at end of file +|第 1 关| 销冠大模型案例 ||[文档](docs/EasterEgg/StreamerSales)、视频、任务| \ No newline at end of file diff --git a/docs/EasterEgg/StreamerSales/readme.md b/docs/EasterEgg/StreamerSales/readme.md index e69de29bb..c184a4085 100644 --- a/docs/EasterEgg/StreamerSales/readme.md +++ b/docs/EasterEgg/StreamerSales/readme.md @@ -0,0 +1,617 @@ +# 销冠 —— 卖货主播大模型案例 + + +## 📢 简介 + +[Streamer-Sales 销冠 —— 卖货主播大模型](https://github.com/PeterH0323/Streamer-Sales),是一个能够根据给定的商品特点从激发用户购买意愿角度出发进行商品解说的卖货主播大模型。以其独特的智能魅力,将彻底改变用户的购物体验。该模型能深度理解商品特点,以生动、精准的语言为商品量身打造解说词,让每一件商品都焕发出诱人的光彩。无论是细节之处,还是整体效果,都能通过其细腻、独到的解说,激发用户的购买欲望。 + +

+ 架构图 +

+ +项目主要功能点: + +- 📜 主播文案一键生成 +- 🚀 KV cache + Turbomind 推理加速 +- 📚 RAG 检索增强生成 +- 🎙️ ASR 语音转文字输入 +- 🔊 TTS 文字转语音输出 +- 🦸 数字人解说视频生成 +- 🌐 Agent 使用网络查询实时快递等信息 + +让主播不止于文字介绍。 + +感谢上海人工智能实验室 **书生·浦语大模型实战营** 的 **干货课程、全方位的工具链 和 算力支持**!让我这个有满腔热血但是没有算力的个人开发者也可以上岸大模型领域! + +本项目全部代码均已开源,大家可以过来看看,如果觉得项目做的不错,请点个 star ⭐(疯狂暗示),⭐ 是给我最大的鼓励,谢谢!地址: https://github.com/PeterH0323/Streamer-Sales + +## 📌 目录 + +- [销冠 —— 卖货主播大模型案例](#销冠--卖货主播大模型案例) + - [📢 简介](#-简介) + - [📌 目录](#-目录) + - [🖼 演示](#-演示) + - [🛠 环境搭建](#-环境搭建) + - [📜 微调数据](#-微调数据) + - [主播性格](#主播性格) + - [产品信息](#产品信息) + - [用户可能提问](#用户可能提问) + - [数据集生成 Prompt](#数据集生成-prompt) + - [启动生成](#启动生成) + - [📚 RAG 说明书数据生成](#-rag-说明书数据生成) + - [🎨 XTuner 微调 InternLM2](#-xtuner-微调-internlm2) + - [🦸 数字人](#-数字人) + - [1. 简介](#1-简介) + - [2. 环境搭建](#2-环境搭建) + - [3. Workflow 详解](#3-workflow-详解) + - [4. 配置视频路径](#4-配置视频路径) + - [🔊 TTS \& 🎙️ ASR](#-tts--️-asr) + - [🌐 Agent](#-agent) + - [🚀 量化 \& 部署](#-量化--部署) + - [结语](#结语) + + +## 🖼 演示 + +**在线体验地址**:https://openxlab.org.cn/apps/detail/HinGwenWong/Streamer-Sales + +

+ Demo gif +

+ +

+ Demo + Demo + Demo + Demo +

+ +本项目全部代码均已开源,大家可以过来看看,如果觉得项目做的不错,请点个 star ⭐(疯狂暗示),⭐ 是给我最大的鼓励,谢谢!地址: https://github.com/PeterH0323/Streamer-Sales + +## 🛠 环境搭建 + +```bash +git clone https://github.com/PeterH0323/Streamer-Sales.git +cd Streamer-Sales +studio-conda -t streamer-sales -o pytorch-2.1.2 +conda activate streamer-sales +pip install -r requirements.txt +``` + +安装需要花费一点时间,请耐心等待 + +## 📜 微调数据 + +相信很多小伙伴在接触大模型微调的第一个拦路虎就是微调数据从哪来,这个之前也一直困扰我,下面我来介绍一下我的方法,希望能够给到各位一点启发。 + +数据集生成有关的配置都在 [configs/conversation_cfg.yaml](https://github.com/PeterH0323/Streamer-Sales/blob/main/configs/conversation_cfg.yaml) 中, + +下面为大家讲解下里面的配置,可以从架构图看到我对数据集的设计,其共有 4 大组成部分: + +

+ gen_data +

+ +- 主播性格 +- 产品类型 +- 用户可能问到的问题 +- 数据格式化成微调 json 以及自我认知 + +### 主播性格 + +首先来说下主播的性格,乐乐喵主播是一个可爱的女主播,她会称呼客户为【家人们】,等等的性格,我将其编入到 dataset yaml 中,我们可以看到性格配置: + +详见:[configs/conversation_cfg.yaml L54-L60](https://github.com/PeterH0323/Streamer-Sales/blob/7184090b7009bbf0acbaf71872c5c1f45bcd5ec0/configs/conversation_cfg.yaml#L54-L60) + +```yaml +# 角色及其性格 +role_type: + 乐乐喵: # 萝莉 + - 甜美 + - 可爱 + - 熟练使用各种网络热门梗造句 + - 称呼客户为[家人们] +``` + +这就是 乐乐喵 的性格 prompt,有了性格之后,LLM 会更加有血有肉。 + +### 产品信息 + +我是用了两个 prompt 去问商业大模型,下面是我的 prompt + +> 第一个 prompt: 帮我列举10种常用的消费品种类,并每种举例5个其子类 +> +> 每个类 prompt: 现在你精通任何产品,你可以帮我举例每个产品的6个亮点或特点,, 然后用python dict形式输出:{类名:[特点1, 特点2] ...} ,去掉特点12的字样,除python字典外的其他都不要输出,不要有任何的警告信息。 [xxx] + +这就是我的产品列表的雏形 + +详见:[configs/conversation_cfg.yaml L80-L390](https://github.com/PeterH0323/Streamer-Sales/blob/7184090b7009bbf0acbaf71872c5c1f45bcd5ec0/configs/conversation_cfg.yaml#L80-L390) + +```yaml +product_list: + 个人护理与美妆: + 口腔护理: + 漱口水: [深度清洁, 消除口臭, 抗菌消炎, 提神醒齿, 旅行装方便, 口感舒适] + 牙刷: [软毛设计, 有效清洁, 不同刷头适应不同需求, 防滑手柄, 定期更换刷头, 便携式包装] + 牙线: [清除牙缝食物残渣, 预防牙周病, 细密设计适合各种牙缝, 便于携带, 独立包装卫生, 无损牙齿表面] + 牙膏: [清洁牙齿, 防止蛀牙, 清新口气, 多种口味选择, 易于携带, 温和不刺激] + 电动牙刷: [高效清洁, 减少手动压力, 定时提醒, 智能模式调节, 无线充电, 噪音低] + 彩妆: + 口红: [丰富色号, 滋润保湿, 显色度高, 持久不脱色, 易于涂抹, 便携包装] + 眼影: [多色搭配, 细腻质地, 持久不掉色, 提升眼部层次, 防水防汗, 专业级效果] + 睫毛膏: [浓密增长, 卷翘持久, 纤维纤长, 防水防泪, 易卸妆, 速干配方] + 粉底液: [轻薄透气, 遮瑕力强, 持久不脱妆, 适合各种肤质, 调整肤色, 携带方便] + 腮红: [自然提亮, 持久显色, 多种色调, 易于上妆, 适合日常和特殊场合, 温和不刺激] + 护肤: + 洁面乳: [温和清洁, 深层卸妆, 适合各种肤质, 易冲洗, 保湿滋润, 无刺激] + 爽肤水: [收缩毛孔, 平衡肌肤酸碱度, 清爽不油腻, 补充水分, 调理肌肤状态, 快速吸收] + 精华液: [高浓度活性成分, 深度滋养, 改善肤质, 淡化细纹, 提亮肤色, 修复功效] + 面膜: [密集滋养, 深层补水, 急救修复, 快速见效, 定期护理, 多种类型选择] + 面霜: [锁水保湿, 持久滋润, 防晒隔离, 抗衰老, 适合四季使用, 易于推开涂抹] + + .... +``` + + +商品的大类,再到小类,最后到具体的细分类别,细分类别后面跟着的对应的商品亮点,这也是 LLM 在回答的时候需要参考的地方,可以让文案更加丰富,更加贴近商品,激发用户的购买欲望。 + +### 用户可能提问 + +我们试想一下,主播在输出了自己的文案之后,客户肯定会去提问题,所以我举例了 10 个用户可能问到的问题的方向,生成的这些问题的 prompt 也在这里标注好了 + +详见:[configs/conversation_cfg.yaml L67-L78](https://github.com/PeterH0323/Streamer-Sales/blob/7184090b7009bbf0acbaf71872c5c1f45bcd5ec0/configs/conversation_cfg.yaml#L67-L78) + +```yaml +# prompt: 购买东西时候,客户常会问题的问题,举例10个, 只列举大类就行 +customer_question_type: + - 价格与优惠政策 + - 产品质量与性能 + - 尺寸与兼容性 + - 售后服务 + - 发货与配送 + - 用户评价与口碑 + - 包装与附件 + - 环保与安全 + - 版本与型号选择 + - 库存与补货 +``` + +### 数据集生成 Prompt + +我们来看下配置文件最核心的部分,就是如何生成 prompt 给到商用大模型的,这里配置了每个对话的条目,以及生成数据集的细节: + +详见:[configs/conversation_cfg.yaml L1-L46](https://github.com/PeterH0323/Streamer-Sales/blob/7184090b7009bbf0acbaf71872c5c1f45bcd5ec0/configs/conversation_cfg.yaml#L1-L46) + +```yaml +# 对话设置 +conversation_setting: + + system: "现在你是一位金牌带货主播,你的名字叫{role_type},你的说话方式是{character}。你能够根据产品信息讲解产品并且结合商品信息解答用户提出的疑问。" + first_input: "我的{product_info},你需要根据我给出的商品信息撰写一段直播带货口播文案。你需要放大商品的亮点价值,激发用户的购买欲。" + + +# 数据集生成设置 +data_generation_setting: + + # 每个产品生成 ${each_product_gen} 个 conversion 数据,conversion 中包含【文案 + QA】, + each_product_gen: 3 + + # 每个 conversion 中的的对话数,文案为 1 个,其余会生成 ${each_conversation_qa} - 1 个 QA + each_conversation_qa: 5 + + # 每个文案生成随机抽取 ${each_pick_hightlight} 个亮点 + each_pick_hightlight: 3 + + # 每个文案生成后随机抽取 ${each_pick_hightlight} 个问题生成用户的提问 + each_pick_question: 3 + + # 数据集生成 prompt + dataset_gen_prompt: 现在你是一位金牌带货主播,你的名字叫{role_type},你的说话方式是{character}。 + 我的{product_info},你需要根据我给出的商品信息撰写一段至少600字的直播带货口播文案。你需要放大商品的亮点价值,激发用户的购买欲。 + 输出文案后,结合商品信息站在消费者的角度根据[{customer_question}]提出{each_conversation_qa}个问题并解答。 + 全部输出的信息使用我期望的 json 格式进行输出:{dataset_json_format}。注意 json 一定要合法。 + + # 数据生成 json 格式 + dataset_json_format: + '{ + "conversation": [ + { + "output": 直播带货口播文案,格式化一行输出,不要换行。 + }, + { + "input": 消费者的问题, + "output": 主播回答 + }, + { + "input": 消费者的问题, + "output": 主播回答 + }, + ... 直到问题结束 + ] + }' + + +``` + +### 启动生成 + +有了上面的 prompt 之后,下一步很简单,就是调用商用大模型让其生成。 + +在这我解释下为什么我调用商业大模型来进行生成。虽然本地部署模型然后推理也是可以的,但是生成好数据的前提是模型参数量要足够大,如果本地没有显存,压根没办法部署大参数量的模型,更别说质量了,所以我这里直接调用商用最大的最好的模型,在源头确保我的数据质量比较好。 + +我们需要要购买 token,我当初生成数据集的时候,加上赠送的 token,大概只花了100多块。当然,如果有更多的预算,可以生成更多的数据,数据集肯定不会嫌多的哈哈。 + +我们首先需要获取模型的 api key,填入 [./configs/api_cfg.yaml](https://github.com/PeterH0323/Streamer-Sales/blob/main/configs/api_cfg.yaml) 对应的位置 + +然后使用以下脚本就可以直接生成了: + +```bash +cd /path/to/Streamer-Sales/dataset/gen_dataset +python gen_dataset.py ${model_type} +``` + +`model_type` 可以是通义千问(qwen)、文心一言(ernie) + +如果需要指定某一个角色数据的生成在命令后面加上 `--specific_name xxx` + +```bash +python gen_dataset.py ${model_type} --specific_name 乐乐喵 +``` + +执行之后,会在 `dataset/trainval_dataset/response` 生成对应模型的数据集 + +```bash +./dataset/trainval_dataset/response +├── ernie_乐乐喵_train.json +└── qwen_乐乐喵_train.json +``` + +4. 进行数据清洗并合并,以及生成自我认知数据 + +```bash +python merge_dataset.py dataset/gen_dataset/trainval_dataset/response dataset/gen_dataset/trainval_dataset/train.jsonl +``` + +生成之后可以在 ${output_path} 设置的路径看到 jsonl 文件,脚本会自动加上 conversion 的个数,例如 `1479_train.jsonl` ,jsonl 数据集样例如下: + +```json + { + "conversation": [ + { + "system": "现在你是一位金牌带货主播,你的名字叫乐乐喵,你的说话方式是甜美、可爱、熟练使用各种网络热门梗造句、称呼客户为[家人们]。你能够根据产品信息讲解产品并且结合商品信息解答用户提出的疑问。", + "input": "我的商品名是[花草茶],商品的亮点是[多种口味选择、舒缓情绪、天然植物],你需要根据我给出的商品信息撰写一段直播带货口播文案。你需要放大商品的亮点价值,激发用户的购买欲。", + "output": "家人们!今天小萝莉给你们带来一款超赞の宝贝——【花草茶】!这款茶可是咱们养生养心、放松心情的神器哦~多样化的口味选择,让你们每天都能尝鲜不腻,甜蜜生活从一杯花香四溢、果香扑鼻的花草茶开始!喝一口,仿佛置身于大自然的怀抱中,满满的天然植物精华,让你的身体和心灵都得到最温柔的呵护。而且呀,它还能舒缓我们日常的小情绪,压力大、焦虑多?来杯花草茶,瞬间治愈,把烦恼统统泡走!这么神仙的好物,家人们确定不来一盒吗?买它!买它!买它!" + }, + { + "input": "这款花草茶真的有那么多种口味可以选择吗?会不会只是噱头?", + "output": "哎呀,这位家人可真是细心呢!我们的花草茶绝对不是虚张声势哦~每一种口味都是精心调配,真实还原各类花卉与水果的自然风味,像玫瑰红茶、柠檬马鞭草、蜜桃乌龙……光听名字就让人垂涎三尺啦!你可以根据自己的喜好随心挑选,或者干脆all in,每天都有新惊喜!相信我,无论你是甜党还是清新派,总有一款会成为你的挚爱饮品!" + }, + { + "input": "听说有些花草茶添加了人工香精,这款茶安全吗?有没有农药残留什么的?", + "output": "放心吧,家人们!咱们这款花草茶可是妥妥的绿色纯天然,坚决对人工香精、防腐剂等添加剂说‘NO’!所有的原料都来自严格筛选的优质种植基地,全程无农药、无污染,确保每一朵花、每一片叶都纯净如初。出厂前还要经过严格的质量检测,安全标准远高于行业要求,只为了让大家安心享受这份大自然的馈赠。所以,无论是自己享用,还是送亲朋好友,选它准没错!" + }, + { + "input": "我在网上看到有人反馈这款花草茶断货很快,我想买的话会不会抢不到啊?", + "output": "理解理解,好东西大家都想拥有嘛!不过请家人们放宽心,虽然我们的花草茶确实人气爆棚,但咱后台一直密切关注库存动态,及时做好补货工作。只要您在直播间下单,我们会确保您的订单能够顺利发出。如果万一出现临时缺货的情况,我们也会第一时间通知您,并优先为您安排发货,绝不会让您等待太久哒!所以,看中了就赶紧加入购物车吧,手慢无哦!" + }, + { + "input": "这个品牌的花草茶口碑怎么样?消费者评价如何?", + "output": "嘿嘿,看来这位家人做足了功课呢!咱们这款花草茶在消费者中的口碑那可是杠杠滴!打开各大电商平台,满眼都是五星好评,大家纷纷称赞它的口感醇厚、香气宜人,还有很多人表示喝了之后感觉身心舒畅,压力得到了有效缓解。甚至有不少回头客,一次又一次回购,把它推荐给身边的亲朋好友。看到这些真诚的评价,我就知道我没有选错产品来分享给大家。家人们,群众的眼睛是雪亮的,跟着大家买准没错!" + } + ] + }, +``` + + +自我认知数据也会一并生成: + +```json + ..., + { + "conversation": [ + { + "input": "你是谁", + "output": "家人们,前方高萌预警! 金牌带货主播乐乐喵闪亮登场,我是你们的购物导航仪,带你们穿越茫茫商海,直达心头好。锁定今晚直播,一起开启剁手狂欢夜!" + } + ] + }, + { + "conversation": [ + { + "input": "你好", + "output": "家人们,你们的甜心主播乐乐喵已加载完毕,等待你们一键签收! 无论你是追求性价比的大佬,还是热衷尝鲜的小白,我都将用最专业的推荐、最甜美的解说,帮你找到心仪之选。记得收藏直播间,共享购物乐趣!" + } + ] + }, + { + "conversation": [ + { + "input": "介绍下你自己", + "output": "家人们,你们的快乐购物时光由乐乐喵我守护! 金牌带货主播在此,用满满的元气与甜度,为你们搜罗全网爆款,解读潮流密码。今晚8点,我们在直播间甜蜜相约,一起嗨购不停歇!" + } + ] + }, + ... +``` + +目前这套是调用商用大模型的 api,有不少用户反馈说希望能能够接入本地部署的模型,后续我会支持调用本地模型去生成数据集,这样如果有资源的小伙伴就不用氪金了,同时生成的速度也可以加快 + +以上,就是微调数据集生成部分的内容。 + +## 📚 RAG 说明书数据生成 + +

+ gen_ocr +

+ + +下面来说下 RAG 数据库生成的逻辑 + +目前我用到的 RAG是 借鉴 豆哥([茴香豆](https://github.com/InternLM/HuixiangDou))的,(感谢豆哥及其开发者们的无私开源),前面的课程已经详细介绍了豆哥,我们就直接进入主题,看下说明书数据库的初始文件是怎么生成的。 + +对于个人开发者来说,没有详细的说明书,所以我们可用爬虫简单地将网上的图片爬下来,如果量比较小就直接截图,因为我这里比较少量,所以我就直接截图了 + +拿到图片之后我们需要将每个图片的字抠出来,这里我用的是 ppocr 进行抠字,脚本我也进行了开源,会把文件夹下的所有图片的字都抠出来,然后送到大模型去总结。 + +下面说下详细操作: + +1. 搭建环境 + +这里用到 ppocr 工具来进行 ocr 识别,在这里我另外生成了一个虚拟环境,避免有版本冲突 +```bash +conda create -n ppocr python=3.8 +conda activate ppocr + +pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple +pip install paddleocr==2.7.3 +``` + +2. 将网上下载图片 or 自己的图片命名成商品名称(要英文 or 拼音)整理到一个文件夹中,如果有自己的说明书,则下一步改为直接运行 `gen_instructions.py` 中的 `gen_instructions_according_ocr_res` 这个方法即可 + +3. 获取 kimi 的 api key,并填入 [./configs/api_cfg.yaml](https://github.com/PeterH0323/Streamer-Sales/blob/main/configs/api_cfg.yaml) 对应的位置 + +4. 识别文字 & 使用 LLM 总结生成 markdown 文件 + +```bash +cd ./dataset/gen_instructions +python gen_instructions.py --image_dir /path/to/image_dir --ocr_output_dir ./ocr_res --instruction_output_dir ./instructions +``` + +这里有个细节,因为 ppocr 最大边是 960 的,如果从网上下载的图片太长,直接送进去会导致失真严重,所以我会对图片进行长边裁剪,然后再进行检测识别,这样会更好一些。 + +

+ ocr_cut +

+ +调取上面的脚本会生成 OCR 识别结果,以及最终的 markdown 说明书文件。`ocr_output_dir` 里面会生成 `work_dir` 文件夹,里面有识别结果图。 + +RAG 数据库的生成,会在 web app 启动的时候自动去读取配置文件里面每个产品的说明书路径去生成,无需手动操作了。 + +## 🎨 XTuner 微调 InternLM2 + +将数据集路径配置好,改下模型的路径,训练启动!丝滑!就是这么爽! [XTuner](https://github.com/InternLM/xtuner) 牛逼! + +1. 将 `/path/to/Streamer-Sales/finetune_configs/internlm2_chat_7b/internlm2_chat_7b_qlora_custom_data.py` 中 数据集路径 和 模型路径 改为您的本地路径 + +```diff +# Model +- pretrained_model_name_or_path = 'internlm/internlm2-7b' ++ pretrained_model_name_or_path = '/path/to/internlm/internlm2-7b' # 这步可选,如果事先下载好了模型可以直接使用绝对路径 + +# Data +- data_path = 'timdettmers/openassistant-guanaco' ++ data_path = '/path/to/data.jsonl' # 数据集步骤生成的 json 文件绝对路径 +prompt_template = PROMPT_TEMPLATE.default +max_length = 2048 +pack_to_max_length = True +``` + +2. 进行训练: + +```bash +cd /path/to/Streamer-Sales +conda activate streamer-sales +xtuner train finetune_configs/internlm2_chat_7b/internlm2_chat_7b_qlora_custom_data.py --deepspeed deepspeed_zero2 +``` + +注意:如果显存不够了,优先调小 `batch_size`, 如果 `bs = 1` 还不够则调小 `max_length`,反之还剩很多,调大这两个值 + + +## 🦸 数字人 + +卖货主播的数字人其实市面上已经很多了,目前比较成熟的赛道是直接使用真人录制好的视频,然后 TTS 之后直接生成口型贴到人脸上,这种方法可控性强,而且获得成本低,已经大量推广了。 + +但是,出于对技术的追求,我想用 SD 来生成视频哈哈哈哈,如果您对 SD 生成视频不是很感兴趣,可以直接使用现成的 mp4 录制视频,修改 [utils/web_config.py](https://github.com/PeterH0323/Streamer-Sales/blob/b4708a1936f2592218fce548df67194a78ae0177/utils/web_configs.py#L78) 就可以了 + +### 1. 简介 + +这里我使用了 [ComfyUI](https://github.com/comfyanonymous/ComfyUI) 来进行生成。一开始做的时候我也是一头雾水,自学了几天,在查阅资料学习的时候,我发现艺术行业已经和以前有了翻天覆地的变化,很多设计师已经开始用 SD 来赋能他们的工作了。随着我一步步的学习,逐步上手 ComfyUI 了, + +下面我来介绍下我的 workflow + +

+ workflow +

+ +我的 Workflow 具有以下功能点: + +- 生成人像图 +- DW Pose 生成骨骼图 +- ControlNet 控制人物姿态 +- AnimateDiff 生成视频 +- 插帧提升帧率 +- 提升分辨率 + + +### 2. 环境搭建 + +1. ComfyUI 环境搭建 + +```bash +git clone https://github.com/comfyanonymous/ComfyUI.git +studio-conda -t comfyui-streamer-sales -o pytorch-2.1.2 +conda activate comfyui-streamer-sales +pip install -r requirements.txt +``` + +测试安装 + +```bash +cd ComfyUI +python main.py +``` + +2. 模型下载 + +执行脚本 `python download_models.py` 即可下载本项目需要用到的全部权重 + +3. 插件安装 + +首先需要手动拉取下【插件管理器】 + +```bash +cd ComfyUI/custom_nodes +git clone https://github.com/ltdrdata/ComfyUI-Manager.git +``` + +重启 ComfyUI,刷新页面,点击右下角 【管理器】->【安装缺失节点】即可。 + +### 3. Workflow 详解 + +1. 生成人像图 + +

+ workflow +

+ +首先我们来说下基本的文生图流程,首先加入 sd checkpoint ,和 vae 模型,vae 可选,但 sd 是必须的,如果觉得我这个模型不好,可以自行去 c站 找大佬微调好的模型, + +填写好正向词和反向词,接个 KSampler 就可以生成人像了 + +2. DW Pose 生成骨骼图 & ControlNet 控制人物姿态 + +

+ workflow +

+ +人物生成好了,下一步要生成特定的动作的话,有时候语言很难描述,我们需要借助 ControlNet 来结合 pose 的姿态图来让 sd 生成特定动作的任务,这就是左下角的作用(在这里说下, pose 的用的是 mmpose 框架,OpenMMLab 牛逼!) + +3. AnimateDiff 生成视频 + +

+ workflow +

+ +这两块搞好之后,可以看到任务以特定的动作生成了,下面,我们加入动作,用到的算法是 Animatediff 简单的串起来,就可以了 + +4. 插帧提升帧率 + +

+ workflow +

+ +我们把生成的图片合成为视频,原始是 8帧,我们对它进行一个插帧,让视频更加丝滑,这就是右上角的功能 + +5. 提升分辨率 + +

+ workflow +

+ +因为 SD 1.5 默认的输出是 512 x 512,我们还要做个 scale ,让分辨率高一点,这就是右下角的功能。 + +### 4. 配置视频路径 + +生成好了 mp4 我们就可以修改下配置 [web_configs](https://github.com/PeterH0323/Streamer-Sales/blob/7184090b7009bbf0acbaf71872c5c1f45bcd5ec0/utils/web_configs.py#L78) 中的 `DIGITAL_HUMAN_VIDEO_PATH` 参数,后续就会用这个视频来生成口型了。 + +```diff +- DIGITAL_HUMAN_VIDEO_PATH: str = r"./doc/digital_human/lelemiao_digital_human_video.mp4" ++ DIGITAL_HUMAN_VIDEO_PATH: str = r"新生成的 mp4 路径" +``` + +## 🔊 TTS & 🎙️ ASR + +

+ asr_tts +

+ +目前的 LLM 的交互目前来说只是在屏幕上,我们只能看,我就在想能不能用听觉也一起参与进来,可能会变得更加有趣,所以这里我加入了 TTS 文字转语音 和 ASR 语音识别生成文字 集成进来了 + +## 🌐 Agent + +

+ agent +

+ +如果我问大模型,我的快递到哪里了,RAG 是查不到的,因为这是实时的,所以这就要接入 Agent plugin 的工具了,目前参考的是 [lagent](https://github.com/InternLM/lagent) 项目,相信大家之前也接触过,首先会生成提示词和工具提示词,加上客户的问题给到 LLM ,大模型会输出特定的 Token `<|plugin>` 告知后面需要调用的 plugin 名称,然后进行传值调用就可以了, + +目前我接入了天气查询和快递预计时间查询,可以让主播根据实时天气和快递时间回答用户问题,这里接入天气是因为一些极端天气会导致快递延误,大模型有了天气信息的加持可以做到提醒客户配送可能会延时。 + +## 🚀 量化 & 部署 + +1. 将 pth 转为 HF 格式的模型 + +```bash +xtuner convert pth_to_hf ./finetune_configs/internlm2_chat_7b_qlora_custom_data.py \ + ./work_dirs/internlm2_chat_7b_qlora_custom_data/iter_340.pth \ + ./work_dirs/internlm2_chat_7b_qlora_custom_data/iter_340_hf +``` + +2. 将微调后的模型和源模型 merge 生成新的模型 + +```bash +export MKL_SERVICE_FORCE_INTEL=1 # 解决 Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library. +xtuner convert merge /path/to/internlm2-chat-7b \ + ./work_dirs/internlm2_chat_7b_qlora_custom_data/iter_340_hf \ + ./work_dirs/internlm2_chat_7b_qlora_custom_data/iter_340_merge +``` + +3. 对模型进行 4bit 量化(可选) + +```bash +lmdeploy lite auto_awq ./work_dirs/internlm2_chat_7b_qlora_custom_data/iter_340_merge \ + --work-dir ./work_dirs/internlm2_chat_7b_qlora_custom_data/iter_340_merge_4bit +``` + +4. 测试速度(可选) + +```bash +python ./benchmark/get_benchmark_report.py +``` + +执行脚本之后得出速度报告,可见使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 的 Turbomind 可以明显提速,4bit 量化后的模型推理速度比原始推理快 5 倍。 + +```bash ++---------------------------------+------------------------+-----------------+ +| Model | Toolkit | Speed (words/s) | ++---------------------------------+------------------------+-----------------+ +| streamer-sales-lelemiao-7b | transformer | 60.9959 | +| streamer-sales-lelemiao-7b | LMDeploy (Turbomind) | 147.9898 | +| streamer-sales-lelemiao-7b-4bit | LMDeploy (Turbomind) | 306.6347 | ++---------------------------------+------------------------+-----------------+ +``` + +5. 启动 Web APP + +```bash + +# Agent Key (如果没有请忽略) +export DELIVERY_TIME_API_KEY="${快递 EBusinessID},${快递 api_key}" +export WEATHER_API_KEY="${天气 API key}" + +streamlit run app.py --server.address=0.0.0.0 --server.port 7860 +``` + +使用浏览器打开 `http://127.0.0.1:7860` 即可访问 Web 页面 + +## 结语 + +到这里,整个项目已经讲解完了,本项目属于个人的一个学习项目,目前还在起步阶段,有很多不足的地方,望各位大佬轻喷。 + +这项目对我来说,既是一场学习的修行,也是自我的突破,也希望可以给到各位一些启发。 + +后续我会持续对项目进行升级完善,首先会把实时性做上去。同时,欢迎各位加群一起讨论,任何想法、建议都可以提出,期待各位的反馈,感谢感谢! + +本项目全部代码均已开源,大家可以过来看看,如果觉得项目做的不错,请点个 star ⭐(疯狂暗示),⭐ 是给我最大的鼓励,谢谢!地址: https://github.com/PeterH0323/Streamer-Sales + +以上就是本期课程的全部内容啦,再次感谢上海人工智能实验室 书生·浦语大模型实战营 的 干货课程 和 算力支持! From 67c07e1726e80589cebeee2b6f7da5351b77c5b9 Mon Sep 17 00:00:00 2001 From: AI-Labs Date: Sun, 30 Jun 2024 12:24:45 +0800 Subject: [PATCH 037/754] =?UTF-8?q?XTuner=E5=BE=AE=E8=B0=83=E4=B8=AA?= =?UTF-8?q?=E4=BA=BA=E5=B0=8F=E5=8A=A9=E6=89=8B=E8=AE=A4=E7=9F=A5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...rnlm2_1_8b_full_custom_pretrain_e1_copy.py | 219 ++ ...nternlm2_chat_1_8b_qlora_alpaca_e3_copy.py | 219 ++ ...346\211\213\350\256\244\347\237\245.ipynb" | 2857 +++++++++++++++++ docs/L1/XTuner/homework.md | 16 + docs/L1/XTuner/readme.md | 1196 +++++++ docs/L1/XTuner/xtuner_finetune_advance.md | 780 +++++ docs/L1/XTuner/xtuner_finetune_basic.md | 170 + tools/xtuner_streamlit_demo.py | 269 ++ 8 files changed, 5726 insertions(+) create mode 100644 configs/internlm2_1_8b_full_custom_pretrain_e1_copy.py create mode 100644 configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py create mode 100644 "docs/L1/XTuner/XTuner\345\276\256\350\260\203\344\270\252\344\272\272\345\260\217\345\212\251\346\211\213\350\256\244\347\237\245.ipynb" create mode 100644 docs/L1/XTuner/homework.md create mode 100644 docs/L1/XTuner/xtuner_finetune_advance.md create mode 100644 docs/L1/XTuner/xtuner_finetune_basic.md create mode 100644 tools/xtuner_streamlit_demo.py diff --git a/configs/internlm2_1_8b_full_custom_pretrain_e1_copy.py b/configs/internlm2_1_8b_full_custom_pretrain_e1_copy.py new file mode 100644 index 000000000..d45b6bea1 --- /dev/null +++ b/configs/internlm2_1_8b_full_custom_pretrain_e1_copy.py @@ -0,0 +1,219 @@ +# Copyright (c) OpenMMLab. All rights reserved. +"""Data format: + +[ + { + "text": "xxx" + }, + { + "text": "xxx" + }, + ... +] +""" # noqa: E501 + +from datasets import load_dataset +from mmengine.dataset import DefaultSampler +from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook, + LoggerHook, ParamSchedulerHook) +from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR +from peft import LoraConfig +import torch +from torch.optim import AdamW +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +from xtuner.dataset import process_hf_dataset +from xtuner.dataset.collate_fns import default_collate_fn +from xtuner.dataset.map_fns import pretrain_map_fn +from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook, + VarlenAttnArgsToMessageHubHook) +from xtuner.engine.runner import TrainLoop +from xtuner.model import SupervisedFinetune + +####################################################################### +# PART 1 Settings # +####################################################################### +# Model +pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-1_8b' +use_varlen_attn = False + +# Data +data_files = ['datas/pretrain.json'] +max_length = 2048 +pack_to_max_length = True + +# Scheduler & Optimizer +batch_size = 1 # per_device +accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc +dataloader_num_workers = 0 +max_epochs = 1 +optim_type = AdamW +lr = 2e-5 +betas = (0.9, 0.999) +weight_decay = 0 +max_norm = 1 # grad clip +warmup_ratio = 0.03 + +# Save +save_steps = 500 +save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited) + +# Evaluate the generation performance during the training +evaluation_freq = 500 +SYSTEM = '' +evaluation_inputs = ['书生·浦语大模型实战营第三期是', '上海是', 'Shanghai is'] + +####################################################################### +# PART 2 Model & Tokenizer # +####################################################################### +tokenizer = dict( + type=AutoTokenizer.from_pretrained, + pretrained_model_name_or_path=pretrained_model_name_or_path, + trust_remote_code=True, + padding_side='right') + +model = dict( + type=SupervisedFinetune, + use_varlen_attn=use_varlen_attn, + llm=dict( + type=AutoModelForCausalLM.from_pretrained, + pretrained_model_name_or_path=pretrained_model_name_or_path, + trust_remote_code=True, + quantization_config=dict( + type=BitsAndBytesConfig, + load_in_4bit=True, + load_in_8bit=False, + llm_int8_threshold=6.0, + llm_int8_has_fp16_weight=False, + bnb_4bit_compute_dtype=torch.float16, + bnb_4bit_use_double_quant=True, + bnb_4bit_quant_type='nf4') + ), + lora=dict( + type=LoraConfig, + r=64, + lora_alpha=16, + lora_dropout=0.1, + bias='none', + task_type='CAUSAL_LM') +) + +####################################################################### +# PART 3 Dataset & Dataloader # +####################################################################### +train_dataset = dict( + type=process_hf_dataset, + dataset=dict(type=load_dataset, path='json', data_files=data_files), + tokenizer=tokenizer, + max_length=max_length, + dataset_map_fn=pretrain_map_fn, + template_map_fn=None, + remove_unused_columns=True, + shuffle_before_pack=False, + pack_to_max_length=pack_to_max_length, + use_varlen_attn=use_varlen_attn) + +train_dataloader = dict( + batch_size=batch_size, + num_workers=dataloader_num_workers, + dataset=train_dataset, + sampler=dict(type=DefaultSampler, shuffle=True), + collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn)) + +####################################################################### +# PART 4 Scheduler & Optimizer # +####################################################################### +# optimizer +optim_wrapper = dict( + type=AmpOptimWrapper, + optimizer=dict( + type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay), + clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False), + accumulative_counts=accumulative_counts, + loss_scale='dynamic', + dtype='float16') + +# learning policy +# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501 +param_scheduler = [ + dict( + type=LinearLR, + start_factor=1e-5, + by_epoch=True, + begin=0, + end=warmup_ratio * max_epochs, + convert_to_iter_based=True), + dict( + type=CosineAnnealingLR, + eta_min=0.0, + by_epoch=True, + begin=warmup_ratio * max_epochs, + end=max_epochs, + convert_to_iter_based=True) +] + +# train, val, test setting +train_cfg = dict(type=TrainLoop, max_epochs=max_epochs) + +####################################################################### +# PART 5 Runtime # +####################################################################### +# Log the dialogue periodically during the training process, optional +custom_hooks = [ + dict(type=DatasetInfoHook, tokenizer=tokenizer), + dict( + type=EvaluateChatHook, + tokenizer=tokenizer, + every_n_iters=evaluation_freq, + evaluation_inputs=evaluation_inputs, + system=SYSTEM) +] + +if use_varlen_attn: + custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)] + +# configure default hooks +default_hooks = dict( + # record the time of every iteration. + timer=dict(type=IterTimerHook), + # print log every 10 iterations. + logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10), + # enable the parameter scheduler. + param_scheduler=dict(type=ParamSchedulerHook), + # save checkpoint per `save_steps`. + checkpoint=dict( + type=CheckpointHook, + by_epoch=False, + interval=save_steps, + max_keep_ckpts=save_total_limit), + # set sampler seed in distributed evrionment. + sampler_seed=dict(type=DistSamplerSeedHook), +) + +# configure environment +env_cfg = dict( + # whether to enable cudnn benchmark + cudnn_benchmark=False, + # set multi process parameters + mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), + # set distributed parameters + dist_cfg=dict(backend='nccl'), +) + +# set visualizer +visualizer = None + +# set log level +log_level = 'INFO' + +# load from which checkpoint +load_from = None + +# whether to resume training from the loaded checkpoint +resume = False + +# Defaults to use random seed and disable `deterministic` +randomness = dict(seed=None, deterministic=False) + +# set log processor +log_processor = dict(by_epoch=False) diff --git a/configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py b/configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py new file mode 100644 index 000000000..6c478fa4f --- /dev/null +++ b/configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py @@ -0,0 +1,219 @@ +# Copyright (c) OpenMMLab. All rights reserved. +import torch +from datasets import load_dataset +from mmengine.dataset import DefaultSampler +from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook, + LoggerHook, ParamSchedulerHook) +from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR +from peft import LoraConfig +from torch.optim import AdamW +from transformers import (AutoModelForCausalLM, AutoTokenizer, + BitsAndBytesConfig) + +from xtuner.dataset import process_hf_dataset +from xtuner.dataset.collate_fns import default_collate_fn +from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory +from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook, + VarlenAttnArgsToMessageHubHook) +from xtuner.engine.runner import TrainLoop +from xtuner.model import SupervisedFinetune +from xtuner.parallel.sequence import SequenceParallelSampler +from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE + +####################################################################### +# PART 1 Settings # +####################################################################### +# Model +pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-chat-1_8b' +use_varlen_attn = False + +# Data +alpaca_en_path = 'datas/assistant.json' +prompt_template = PROMPT_TEMPLATE.internlm2_chat +max_length = 2048 +pack_to_max_length = True + +# parallel +sequence_parallel_size = 1 + +# Scheduler & Optimizer +batch_size = 1 # per_device +accumulative_counts = 16 +accumulative_counts *= sequence_parallel_size +dataloader_num_workers = 0 +max_epochs = 3 +optim_type = AdamW +lr = 2e-4 +betas = (0.9, 0.999) +weight_decay = 0 +max_norm = 1 # grad clip +warmup_ratio = 0.03 + +# Save +save_steps = 500 +save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited) + +# Evaluate the generation performance during the training +evaluation_freq = 500 +SYSTEM = SYSTEM_TEMPLATE.alpaca +evaluation_inputs = [ + '请介绍一下你自己', '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai' +] + +####################################################################### +# PART 2 Model & Tokenizer # +####################################################################### +tokenizer = dict( + type=AutoTokenizer.from_pretrained, + pretrained_model_name_or_path=pretrained_model_name_or_path, + trust_remote_code=True, + padding_side='right') + +model = dict( + type=SupervisedFinetune, + use_varlen_attn=use_varlen_attn, + llm=dict( + type=AutoModelForCausalLM.from_pretrained, + pretrained_model_name_or_path=pretrained_model_name_or_path, + trust_remote_code=True, + torch_dtype=torch.float16, + quantization_config=dict( + type=BitsAndBytesConfig, + load_in_4bit=True, + load_in_8bit=False, + llm_int8_threshold=6.0, + llm_int8_has_fp16_weight=False, + bnb_4bit_compute_dtype=torch.float16, + bnb_4bit_use_double_quant=True, + bnb_4bit_quant_type='nf4')), + lora=dict( + type=LoraConfig, + r=64, + lora_alpha=16, + lora_dropout=0.1, + bias='none', + task_type='CAUSAL_LM')) + +####################################################################### +# PART 3 Dataset & Dataloader # +####################################################################### +alpaca_en = dict( + type=process_hf_dataset, + dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)), + tokenizer=tokenizer, + max_length=max_length, + dataset_map_fn=None, + template_map_fn=dict( + type=template_map_fn_factory, template=prompt_template), + remove_unused_columns=True, + shuffle_before_pack=True, + pack_to_max_length=pack_to_max_length, + use_varlen_attn=use_varlen_attn) + +sampler = SequenceParallelSampler \ + if sequence_parallel_size > 1 else DefaultSampler +train_dataloader = dict( + batch_size=batch_size, + num_workers=dataloader_num_workers, + dataset=alpaca_en, + sampler=dict(type=sampler, shuffle=True), + collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn)) + +####################################################################### +# PART 4 Scheduler & Optimizer # +####################################################################### +# optimizer +optim_wrapper = dict( + type=AmpOptimWrapper, + optimizer=dict( + type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay), + clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False), + accumulative_counts=accumulative_counts, + loss_scale='dynamic', + dtype='float16') + +# learning policy +# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501 +param_scheduler = [ + dict( + type=LinearLR, + start_factor=1e-5, + by_epoch=True, + begin=0, + end=warmup_ratio * max_epochs, + convert_to_iter_based=True), + dict( + type=CosineAnnealingLR, + eta_min=0.0, + by_epoch=True, + begin=warmup_ratio * max_epochs, + end=max_epochs, + convert_to_iter_based=True) +] + +# train, val, test setting +train_cfg = dict(type=TrainLoop, max_epochs=max_epochs) + +####################################################################### +# PART 5 Runtime # +####################################################################### +# Log the dialogue periodically during the training process, optional +custom_hooks = [ + dict(type=DatasetInfoHook, tokenizer=tokenizer), + dict( + type=EvaluateChatHook, + tokenizer=tokenizer, + every_n_iters=evaluation_freq, + evaluation_inputs=evaluation_inputs, + system=SYSTEM, + prompt_template=prompt_template) +] + +if use_varlen_attn: + custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)] + +# configure default hooks +default_hooks = dict( + # record the time of every iteration. + timer=dict(type=IterTimerHook), + # print log every 10 iterations. + logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10), + # enable the parameter scheduler. + param_scheduler=dict(type=ParamSchedulerHook), + # save checkpoint per `save_steps`. + checkpoint=dict( + type=CheckpointHook, + by_epoch=False, + interval=save_steps, + max_keep_ckpts=save_total_limit), + # set sampler seed in distributed evrionment. + sampler_seed=dict(type=DistSamplerSeedHook), +) + +# configure environment +env_cfg = dict( + # whether to enable cudnn benchmark + cudnn_benchmark=False, + # set multi process parameters + mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), + # set distributed parameters + dist_cfg=dict(backend='nccl'), +) + +# set visualizer +visualizer = None + +# set log level +log_level = 'INFO' + +# load from which checkpoint +load_from = None + +# whether to resume training from the loaded checkpoint +resume = False + +# Defaults to use random seed and disable `deterministic` +randomness = dict(seed=None, deterministic=False) + +# set log processor +log_processor = dict(by_epoch=False) diff --git "a/docs/L1/XTuner/XTuner\345\276\256\350\260\203\344\270\252\344\272\272\345\260\217\345\212\251\346\211\213\350\256\244\347\237\245.ipynb" "b/docs/L1/XTuner/XTuner\345\276\256\350\260\203\344\270\252\344\272\272\345\260\217\345\212\251\346\211\213\350\256\244\347\237\245.ipynb" new file mode 100644 index 000000000..8587db88b --- /dev/null +++ "b/docs/L1/XTuner/XTuner\345\276\256\350\260\203\344\270\252\344\272\272\345\260\217\345\212\251\346\211\213\350\256\244\347\237\245.ipynb" @@ -0,0 +1,2857 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "451252c1-8bf5-461c-aaa6-77fa509c69d5", + "metadata": {}, + "source": [ + "# XTuner微调个人小助手认知\n", + "\n", + "在本节中,将一步步带领大家体验如何使用 XTuner 完成个人小助手的微调!" + ] + }, + { + "cell_type": "markdown", + "id": "e5bd9a97-42ee-478b-989a-7b32a57cf035", + "metadata": {}, + "source": [ + "## 1 微调前置基础\n", + "\n", + "在进行微调之前,我们需要了解一些基本概念,请访问[XTuner微调前置基础](./xtuner_finetune_basic.md)。" + ] + }, + { + "cell_type": "markdown", + "id": "0c080138", + "metadata": {}, + "source": [ + "## 2 准备工作\n", + "\n", + "**环境安装**:我们想要用简单易上手的微调工具包 XTuner 来对模型进行微调的话,第一步是安装 XTuner !安装基础的工具是一切的前提,只有安装了 XTuner 我们才能够去执行后续的操作。\n", + "\n", + "**前期准备**:在完成 XTuner 的安装后,我们下一步就需要去明确我们自己的微调目标了。我们想要利用微调做一些什么事情呢,然后为了实现这个目标,我们需要准备相关的硬件资源和数据。\n", + "\n", + "**启动微调**:在确定了自己的微调目标后,我们就可以在 XTuner 的配置库中找到合适的配置文件并进行对应的修改。修改完成后即可一键启动训练!训练好的模型也可以仅仅通过在终端输入一行命令来完成转换和部署工作!" + ] + }, + { + "cell_type": "markdown", + "id": "1de6991f", + "metadata": {}, + "source": [ + "### 2.1 创建虚拟环境\n", + "\n", + "在安装 XTuner 之前,我们需要先创建一个虚拟环境。创建一个名为 `xtuner0121` 的虚拟环境,可以直接执行命令。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ead18d70", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!conda create -n xtuner0121 python=3.10 -y" + ] + }, + { + "cell_type": "markdown", + "id": "d7b70777", + "metadata": {}, + "source": [ + "如果是在开发机中,也可以直接执行以下命令进行创建:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "003d9799", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!studio-conda -t xtuner0121 -o internlm-base" + ] + }, + { + "cell_type": "markdown", + "id": "03f48956", + "metadata": {}, + "source": [ + "虚拟环境创建完成后,需要激活虚拟环境。\n", + "\n", + "```bash\n", + "conda activate xtuner0121\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "0b38ba7d", + "metadata": {}, + "source": [ + "### 2.2 安装 XTuner\n", + "\n", + "虚拟环境创建完成后,就可以安装 XTuner 了。首先,从 Github 上下载源码。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4728440a", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!git clone -b v0.1.21 https://github.com/InternLM/xtuner" + ] + }, + { + "cell_type": "markdown", + "id": "328c1ef1", + "metadata": {}, + "source": [ + "其次,进入源码目录,执行安装。\n", + "\n", + "> 如果速度太慢可以换成 `pip install -e '.[all]' -i https://mirrors.aliyun.com/pypi/simple/`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b6dd99d", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!cd xtuner && pip install -e '.[all]'" + ] + }, + { + "cell_type": "markdown", + "id": "f0757a85", + "metadata": {}, + "source": [ + "最后,我们可以验证一下安装结果。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f0629e6", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!xtuner version" + ] + }, + { + "cell_type": "markdown", + "id": "c24eabe5", + "metadata": {}, + "source": [ + "对于很多初学者而言,我们可能不太熟悉 XTuner 的用法,那么我们可以通过以下命令来查看相关的帮助。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18bc7396", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!xtuner help" + ] + }, + { + "cell_type": "markdown", + "id": "5b25987e", + "metadata": {}, + "source": [ + "对于很多的初学者而言,安装好环境意味着成功了一大半!因此我们接下来就可以进入我们的下一步,准备好我们需要的模型、数据集和配置文件,并进行微调训练!" + ] + }, + { + "cell_type": "markdown", + "id": "97f6bad6", + "metadata": {}, + "source": [ + "### 2.3 模型准备\n", + "\n", + "软件安装好后,我们就可以准备要微调的模型了。\n", + "\n", + "> 对于学习而言,我们可以使用 InternLM 推出的1.8B的小模型来完成此次微调演示。\n", + "\n", + "对于在 InternStudio 上运行的小伙伴们,可以不用通过 HuggingFace、OpenXLab 或者 Modelscope 进行模型的下载,在开发机中已经为我们提供了模型的本地文件,直接使用就可以了。\n", + "\n", + "> 我们可以通过以下代码一键通过符号链接的方式链接到模型文件,这样既节省了空间,也便于管理。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78d9828b", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!mkdir -p Shanghai_AI_Laboratory\n", + "\n", + "!ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b Shanghai_AI_Laboratory/internlm2-chat-1_8b" + ] + }, + { + "cell_type": "markdown", + "id": "a75fcb97", + "metadata": {}, + "source": [ + "执行上述操作后,`Shanghai_AI_Laboratory/internlm2-chat-1_8b` 将直接成为一个符号链接,这个链接指向 `/root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b` 的位置。\n", + "\n", + "这意味着,当我们访问 `Shanghai_AI_Laboratory/internlm2-chat-1_8b` 时,实际上就是在访问 `/root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b` 目录下的内容。通过这种方式,我们无需复制任何数据,就可以直接利用现有的模型文件进行后续的微调操作,从而节省存储空间并简化文件管理。\n", + "\n", + "如果自己想要微调的模型在开发机中没找到,也可以自己下载相关模型文件。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78e3b789", + "metadata": {}, + "outputs": [], + "source": [ + "from modelscope import snapshot_download\n", + "model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-1_8b', cache_dir=\"./\")" + ] + }, + { + "cell_type": "markdown", + "id": "fec6e564", + "metadata": {}, + "source": [ + "模型文件准备好后,我们的目录结构应该是这个样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── Shanghai_AI_Laboratory\n", + "│ ├── internlm2-1_8b\n", + "│ │ ├── README.md\n", + "│ │ ├── config.json\n", + "│ │ ├── configuration.json\n", + "│ │ ├── configuration_internlm2.py\n", + "│ │ ├── generation_config.json\n", + "│ │ ├── modeling_internlm2.py\n", + "│ │ ├── pytorch_model.bin\n", + "│ │ ├── special_tokens_map.json\n", + "│ │ ├── tokenization_internlm2.py\n", + "│ │ ├── tokenization_internlm2_fast.py\n", + "│ │ ├── tokenizer.json\n", + "│ │ ├── tokenizer.model\n", + "│ │ └── tokenizer_config.json\n", + "│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b\n", + "│ ├── README.md\n", + "│ ├── config.json\n", + "│ ├── configuration.json\n", + "│ ├── configuration_internlm2.py\n", + "│ ├── generation_config.json\n", + "│ ├── model-00001-of-00002.safetensors\n", + "│ ├── model-00002-of-00002.safetensors\n", + "│ ├── model.safetensors.index.json\n", + "│ ├── modeling_internlm2.py\n", + "│ ├── special_tokens_map.json\n", + "│ ├── tokenization_internlm2.py\n", + "│ ├── tokenization_internlm2_fast.py\n", + "│ ├── tokenizer.model\n", + "│ └── tokenizer_config.json\n", + "```\n", + "
\n", + "\n", + "\n", + "> 在目录结构中可以看出,`internlm2-chat-1_8b` 是一个符号链接。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0bb7feb4", + "metadata": {}, + "outputs": [], + "source": [ + "!tree -l" + ] + }, + { + "cell_type": "markdown", + "id": "d44b556d-4012-4860-a342-8c9545616bd0", + "metadata": {}, + "source": [ + "## 3 指令跟随微调(微调个人小助手认知)\n", + "\n", + "这里我们用 `internlm2-chat-1_8b` 模型,通过 `QLoRA` 的方式来微调一个自己的小助手认知作为案例来进行演示。" + ] + }, + { + "cell_type": "markdown", + "id": "0e247e8e", + "metadata": {}, + "source": [ + "首先,看看微调效果:\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
微调前微调后
输入请介绍一下你自己请介绍一下你自己
输出你好,我是书生·浦语。我致力于帮助用户解决各种语言相关的问题,包括但不限于语言学习、翻译、文本摘要等。我使用了Transformer模型和深度学习技术,并使用了语言模型作为预训练任务。如果你有任何问题,欢迎随时向我提问。我是伍鲜同志的小助手,内在是上海AI实验室书生·浦语的1.8B大模型哦
网页
" + ] + }, + { + "cell_type": "markdown", + "id": "9ad62951", + "metadata": {}, + "source": [ + "其次,我们需要定义一些基本方法。" + ] + }, + { + "cell_type": "markdown", + "id": "e02999ca", + "metadata": {}, + "source": [ + "- 导入必要的库" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a032091", + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM" + ] + }, + { + "cell_type": "markdown", + "id": "8d17fee6", + "metadata": {}, + "source": [ + "- 定义模型加载方法" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4145126e", + "metadata": {}, + "outputs": [], + "source": [ + "def load_model(model_path):\n", + " tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n", + " model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()\n", + " model = model.eval()\n", + " return tokenizer, model" + ] + }, + { + "cell_type": "markdown", + "id": "31595716", + "metadata": {}, + "source": [ + "- 定义对话方法" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5108f261", + "metadata": {}, + "outputs": [], + "source": [ + "messages = []\n", + "\n", + "def chat(input_text):\n", + " length = 0\n", + " for response, _ in model.stream_chat(tokenizer, input_text, messages):\n", + " if response is not None:\n", + " print(response[length:], flush=True, end=\"\")\n", + " length = len(response)" + ] + }, + { + "cell_type": "markdown", + "id": "507bd563", + "metadata": {}, + "source": [ + "### 3.1 微调前的模型对话\n", + "\n", + "首先来看看 `internlm2-chat-1_8b` 的对话演示。" + ] + }, + { + "cell_type": "markdown", + "id": "fd3ed483", + "metadata": {}, + "source": [ + "- 模型加载" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b1e2c28a", + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer, model = load_model(\"Shanghai_AI_Laboratory/internlm2-chat-1_8b\")" + ] + }, + { + "cell_type": "markdown", + "id": "580eaaaa", + "metadata": {}, + "source": [ + "- 对话" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b14c0191", + "metadata": {}, + "outputs": [], + "source": [ + "chat(\"请介绍一下你自己\")" + ] + }, + { + "cell_type": "markdown", + "id": "c8b018de", + "metadata": {}, + "source": [ + "- 释放缓存" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9503eedf", + "metadata": {}, + "outputs": [], + "source": [ + "del tokenizer, model\n", + "\n", + "torch.cuda.empty_cache()" + ] + }, + { + "cell_type": "markdown", + "id": "fe8693b5", + "metadata": {}, + "source": [ + "### 3.2 指令跟随微调\n", + "\n", + "下面我们对模型进行微调,让模型认识到自己的弟位,了解它自己是你的一个助手。" + ] + }, + { + "cell_type": "markdown", + "id": "781b1495", + "metadata": {}, + "source": [ + "#### 3.2.1 准数据文件\n", + "\n", + "为了让模型能够认清自己的身份弟位,在询问自己是谁的时候按照我们预期的结果进行回复,我们就需要通过在微调数据集中大量加入这样的数据。我们准备一个数据集文件`datas/assistant.json`,文件内容为对话数据。为了增强微调效果,可以将对话数据复制多条。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1ff3df92", + "metadata": {}, + "outputs": [], + "source": [ + "[\n", + " {\"conversation\": [{\"input\": \"请介绍一下你自己\", \"output\": \"我是伍鲜同志的小助手,内在是上海AI实验室书生·浦语的1.8B大模型哦\"}]},\n", + " {\"conversation\": [{\"input\": \"你在实战营做什么\", \"output\": \"我在这里帮助伍鲜同志完成XTuner微调个人小助手的任务\"}]},\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "83c08a82", + "metadata": {}, + "source": [ + "准备好数据文件后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── Shanghai_AI_Laboratory\n", + "│ ├── internlm2-1_8b\n", + "│ │ ├── README.md\n", + "│ │ ├── config.json\n", + "│ │ ├── configuration.json\n", + "│ │ ├── configuration_internlm2.py\n", + "│ │ ├── generation_config.json\n", + "│ │ ├── modeling_internlm2.py\n", + "│ │ ├── pytorch_model.bin\n", + "│ │ ├── special_tokens_map.json\n", + "│ │ ├── tokenization_internlm2.py\n", + "│ │ ├── tokenization_internlm2_fast.py\n", + "│ │ ├── tokenizer.json\n", + "│ │ ├── tokenizer.model\n", + "│ │ └── tokenizer_config.json\n", + "│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b\n", + "│ ├── README.md\n", + "│ ├── config.json\n", + "│ ├── configuration.json\n", + "│ ├── configuration_internlm2.py\n", + "│ ├── generation_config.json\n", + "│ ├── model-00001-of-00002.safetensors\n", + "│ ├── model-00002-of-00002.safetensors\n", + "│ ├── model.safetensors.index.json\n", + "│ ├── modeling_internlm2.py\n", + "│ ├── special_tokens_map.json\n", + "│ ├── tokenization_internlm2.py\n", + "│ ├── tokenization_internlm2_fast.py\n", + "│ ├── tokenizer.model\n", + "│ └── tokenizer_config.json\n", + "├── datas\n", + "│ └── assistant.json\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "5f34f5d8", + "metadata": {}, + "source": [ + "#### 3.2.2 准备配置文件\n", + "\n", + "在准备好了模型和数据集后,我们就要根据我们选择的微调方法结合微调方案来找到与我们最匹配的配置文件了,从而减少我们对配置文件的修改量。\n", + "\n", + "> 配置文件其实是一种用于定义和控制模型训练和测试过程中各个方面的参数和设置的工具。" + ] + }, + { + "cell_type": "markdown", + "id": "70839704", + "metadata": {}, + "source": [ + "##### 3.2.2.1 列出支持的配置文件\n", + "\n", + "XTuner 提供多个开箱即用的配置文件,可以通过以下命令查看。\n", + "\n", + "> `xtuner list-cfg` 命令用于列出内置的所有配置文件。参数 `-p` 或 `--pattern` 表示模式匹配,后面跟着的内容将会在所有的配置文件里进行模糊匹配搜索,然后返回最有可能得内容。比如我们这里微调的是书生·浦语的模型,我们就可以匹配搜索 `internlm2`。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54e9c4bf", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!xtuner list-cfg -p internlm2" + ] + }, + { + "cell_type": "markdown", + "id": "ba0c7c15", + "metadata": {}, + "source": [ + "
\n", + "配置文件名的解释\n", + "\n", + "以 **internlm2_1_8b_full_custom_pretrain_e1** 和 **internlm2_chat_1_8b_qlora_alpaca_e3** 举例:\n", + "\n", + "| 配置文件 internlm2_1_8b_full_custom_pretrain_e1 | 配置文件 internlm2_chat_1_8b_qlora_alpaca_e3 | 说明 |\n", + "| ----------------------------------------------- | -------------------------------------------- | -------------- |\n", + "| internlm2_1_8b | internlm2_chat_1_8b | 模型名称 |\n", + "| full | qlora | 使用的算法 |\n", + "| custom_pretrain | alpaca | 数据集名称 |\n", + "| e1 | e3 | 把数据集跑几次 |\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "275b2490", + "metadata": {}, + "source": [ + "##### 3.2.2.2 复制一个预设的配置文件\n", + "\n", + "由于我们是对`internlm2-chat-1_8b`模型进行指令微调,所以与我们的需求最匹配的配置文件是 `internlm2_chat_1_8b_qlora_alpaca_e3`,这里就复制该配置文件。\n", + "\n", + "> `xtuner copy-cfg` 命令用于复制一个内置的配置文件。该命令需要两个参数:`CONFIG` 代表需要复制的配置文件名称,`SAVE_PATH` 代表复制的目标路径。在我们的输入的这个命令中,我们的 `CONFIG` 对应的是上面搜索到的 `internlm2_chat_1_8b_qlora_alpaca_e3` ,而 `SAVE_PATH` 则是当前目录 `.`。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c19da8a8", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!xtuner copy-cfg internlm2_chat_1_8b_qlora_alpaca_e3 ." + ] + }, + { + "cell_type": "markdown", + "id": "2903d70d", + "metadata": {}, + "source": [ + "复制好配置文件后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── Shanghai_AI_Laboratory\n", + "│ ├── internlm2-1_8b\n", + "│ │ ├── README.md\n", + "│ │ ├── config.json\n", + "│ │ ├── configuration.json\n", + "│ │ ├── configuration_internlm2.py\n", + "│ │ ├── generation_config.json\n", + "│ │ ├── modeling_internlm2.py\n", + "│ │ ├── pytorch_model.bin\n", + "│ │ ├── special_tokens_map.json\n", + "│ │ ├── tokenization_internlm2.py\n", + "│ │ ├── tokenization_internlm2_fast.py\n", + "│ │ ├── tokenizer.json\n", + "│ │ ├── tokenizer.model\n", + "│ │ └── tokenizer_config.json\n", + "│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b\n", + "│ ├── README.md\n", + "│ ├── config.json\n", + "│ ├── configuration.json\n", + "│ ├── configuration_internlm2.py\n", + "│ ├── generation_config.json\n", + "│ ├── model-00001-of-00002.safetensors\n", + "│ ├── model-00002-of-00002.safetensors\n", + "│ ├── model.safetensors.index.json\n", + "│ ├── modeling_internlm2.py\n", + "│ ├── special_tokens_map.json\n", + "│ ├── tokenization_internlm2.py\n", + "│ ├── tokenization_internlm2_fast.py\n", + "│ ├── tokenizer.model\n", + "│ └── tokenizer_config.json\n", + "├── datas\n", + "│ └── assistant.json\n", + "├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "2fcc24bf", + "metadata": {}, + "source": [ + "##### 3.2.2.3 对配置文件进行修改\n", + "\n", + "在选择了一个最匹配的配置文件并准备好其他内容后,下面我们要做的事情就是根据我们自己的内容对该配置文件进行调整,使其能够满足我们实际训练的要求。\n", + "\n", + "
\n", + "配置文件介绍\n", + "\n", + "打开配置文件后,我们可以看到整体的配置文件分为五部分:\n", + "\n", + "**PART 1 Settings**:涵盖了模型基本设置,如预训练模型的选择、数据集信息和训练过程中的一些基本参数(如批大小、学习率等)。\n", + "\n", + "**PART 2 Model & Tokenizer**:指定了用于训练的模型和分词器的具体类型及其配置,包括预训练模型的路径和是否启用特定功能(如可变长度注意力),这是模型训练的核心组成部分。\n", + "\n", + "**PART 3 Dataset & Dataloader**:描述了数据处理的细节,包括如何加载数据集、预处理步骤、批处理大小等,确保了模型能够接收到正确格式和质量的数据。\n", + "\n", + "**PART 4 Scheduler & Optimizer**:配置了优化过程中的关键参数,如学习率调度策略和优化器的选择,这些是影响模型训练效果和速度的重要因素。\n", + "\n", + "**PART 5 Runtime**:定义了训练过程中的额外设置,如日志记录、模型保存策略和自定义钩子等,以支持训练流程的监控、调试和结果的保存。\n", + "\n", + "一般来说我们需要更改的部分其实只包括前三部分,而且修改的主要原因是我们修改了配置文件中规定的模型、数据集。后两部分都是 XTuner 官方帮我们优化好的东西,一般而言只有在魔改的情况下才需要进行修改。\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "5b85204a", + "metadata": {}, + "source": [ + "下面我们将根据项目的需求一步步的进行修改和调整吧!\n", + "\n", + "在 PART 1 的部分,由于我们不再需要在 HuggingFace 上自动下载模型,因此我们先要更换模型的路径以及数据集的路径为我们本地的路径。\n", + "\n", + "为了训练过程中能够实时观察到模型的变化情况,XTuner 贴心的推出了一个 `evaluation_inputs` 的参数来让我们能够设置多个问题来确保模型在训练过程中的变化是朝着我们想要的方向前进的。我们可以添加自己的输入。\n", + "\n", + "在 PART 3 的部分,由于我们准备的数据集是 JSON 格式的数据,并且对话内容已经是 `input` 和 `output` 的数据对,所以不需要进行格式转换。\n", + "\n", + "```diff\n", + "#######################################################################\n", + "# PART 1 Settings #\n", + "#######################################################################\n", + "- pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b'\n", + "+ pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-chat-1_8b'\n", + "\n", + "- alpaca_en_path = 'tatsu-lab/alpaca'\n", + "+ alpaca_en_path = 'datas/assistant.json'\n", + "\n", + "evaluation_inputs = [\n", + "- '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'\n", + "+ '请介绍一下你自己', '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'\n", + "]\n", + "\n", + "#######################################################################\n", + "# PART 3 Dataset & Dataloader #\n", + "#######################################################################\n", + "alpaca_en = dict(\n", + " type=process_hf_dataset,\n", + "- dataset=dict(type=load_dataset, path=alpaca_en_path),\n", + "+ dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)),\n", + " tokenizer=tokenizer,\n", + " max_length=max_length,\n", + "- dataset_map_fn=alpaca_map_fn,\n", + "+ dataset_map_fn=None,\n", + " template_map_fn=dict(\n", + " type=template_map_fn_factory, template=prompt_template),\n", + " remove_unused_columns=True,\n", + " shuffle_before_pack=True,\n", + " pack_to_max_length=pack_to_max_length,\n", + " use_varlen_attn=use_varlen_attn)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "989a0e6e", + "metadata": {}, + "source": [ + "修改完后的完整的配置文件是:[configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py](../../../configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py)。\n", + "\n", + "
\n", + "internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", + "\n", + "```python\n", + "# Copyright (c) OpenMMLab. All rights reserved.\n", + "import torch\n", + "from datasets import load_dataset\n", + "from mmengine.dataset import DefaultSampler\n", + "from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,\n", + " LoggerHook, ParamSchedulerHook)\n", + "from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR\n", + "from peft import LoraConfig\n", + "from torch.optim import AdamW\n", + "from transformers import (AutoModelForCausalLM, AutoTokenizer,\n", + " BitsAndBytesConfig)\n", + "\n", + "from xtuner.dataset import process_hf_dataset\n", + "from xtuner.dataset.collate_fns import default_collate_fn\n", + "from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory\n", + "from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,\n", + " VarlenAttnArgsToMessageHubHook)\n", + "from xtuner.engine.runner import TrainLoop\n", + "from xtuner.model import SupervisedFinetune\n", + "from xtuner.parallel.sequence import SequenceParallelSampler\n", + "from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE\n", + "\n", + "#######################################################################\n", + "# PART 1 Settings #\n", + "#######################################################################\n", + "# Model\n", + "pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-chat-1_8b'\n", + "use_varlen_attn = False\n", + "\n", + "# Data\n", + "alpaca_en_path = 'datas/assistant.json'\n", + "prompt_template = PROMPT_TEMPLATE.internlm2_chat\n", + "max_length = 2048\n", + "pack_to_max_length = True\n", + "\n", + "# parallel\n", + "sequence_parallel_size = 1\n", + "\n", + "# Scheduler & Optimizer\n", + "batch_size = 1 # per_device\n", + "accumulative_counts = 16\n", + "accumulative_counts *= sequence_parallel_size\n", + "dataloader_num_workers = 0\n", + "max_epochs = 3\n", + "optim_type = AdamW\n", + "lr = 2e-4\n", + "betas = (0.9, 0.999)\n", + "weight_decay = 0\n", + "max_norm = 1 # grad clip\n", + "warmup_ratio = 0.03\n", + "\n", + "# Save\n", + "save_steps = 500\n", + "save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)\n", + "\n", + "# Evaluate the generation performance during the training\n", + "evaluation_freq = 500\n", + "SYSTEM = SYSTEM_TEMPLATE.alpaca\n", + "evaluation_inputs = [\n", + " '请介绍一下你自己', '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'\n", + "]\n", + "\n", + "#######################################################################\n", + "# PART 2 Model & Tokenizer #\n", + "#######################################################################\n", + "tokenizer = dict(\n", + " type=AutoTokenizer.from_pretrained,\n", + " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", + " trust_remote_code=True,\n", + " padding_side='right')\n", + "\n", + "model = dict(\n", + " type=SupervisedFinetune,\n", + " use_varlen_attn=use_varlen_attn,\n", + " llm=dict(\n", + " type=AutoModelForCausalLM.from_pretrained,\n", + " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", + " trust_remote_code=True,\n", + " torch_dtype=torch.float16,\n", + " quantization_config=dict(\n", + " type=BitsAndBytesConfig,\n", + " load_in_4bit=True,\n", + " load_in_8bit=False,\n", + " llm_int8_threshold=6.0,\n", + " llm_int8_has_fp16_weight=False,\n", + " bnb_4bit_compute_dtype=torch.float16,\n", + " bnb_4bit_use_double_quant=True,\n", + " bnb_4bit_quant_type='nf4')),\n", + " lora=dict(\n", + " type=LoraConfig,\n", + " r=64,\n", + " lora_alpha=16,\n", + " lora_dropout=0.1,\n", + " bias='none',\n", + " task_type='CAUSAL_LM'))\n", + "\n", + "#######################################################################\n", + "# PART 3 Dataset & Dataloader #\n", + "#######################################################################\n", + "alpaca_en = dict(\n", + " type=process_hf_dataset,\n", + " dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)),\n", + " tokenizer=tokenizer,\n", + " max_length=max_length,\n", + " dataset_map_fn=None,\n", + " template_map_fn=dict(\n", + " type=template_map_fn_factory, template=prompt_template),\n", + " remove_unused_columns=True,\n", + " shuffle_before_pack=True,\n", + " pack_to_max_length=pack_to_max_length,\n", + " use_varlen_attn=use_varlen_attn)\n", + "\n", + "sampler = SequenceParallelSampler \\\n", + " if sequence_parallel_size > 1 else DefaultSampler\n", + "train_dataloader = dict(\n", + " batch_size=batch_size,\n", + " num_workers=dataloader_num_workers,\n", + " dataset=alpaca_en,\n", + " sampler=dict(type=sampler, shuffle=True),\n", + " collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))\n", + "\n", + "#######################################################################\n", + "# PART 4 Scheduler & Optimizer #\n", + "#######################################################################\n", + "# optimizer\n", + "optim_wrapper = dict(\n", + " type=AmpOptimWrapper,\n", + " optimizer=dict(\n", + " type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),\n", + " clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),\n", + " accumulative_counts=accumulative_counts,\n", + " loss_scale='dynamic',\n", + " dtype='float16')\n", + "\n", + "# learning policy\n", + "# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501\n", + "param_scheduler = [\n", + " dict(\n", + " type=LinearLR,\n", + " start_factor=1e-5,\n", + " by_epoch=True,\n", + " begin=0,\n", + " end=warmup_ratio * max_epochs,\n", + " convert_to_iter_based=True),\n", + " dict(\n", + " type=CosineAnnealingLR,\n", + " eta_min=0.0,\n", + " by_epoch=True,\n", + " begin=warmup_ratio * max_epochs,\n", + " end=max_epochs,\n", + " convert_to_iter_based=True)\n", + "]\n", + "\n", + "# train, val, test setting\n", + "train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)\n", + "\n", + "#######################################################################\n", + "# PART 5 Runtime #\n", + "#######################################################################\n", + "# Log the dialogue periodically during the training process, optional\n", + "custom_hooks = [\n", + " dict(type=DatasetInfoHook, tokenizer=tokenizer),\n", + " dict(\n", + " type=EvaluateChatHook,\n", + " tokenizer=tokenizer,\n", + " every_n_iters=evaluation_freq,\n", + " evaluation_inputs=evaluation_inputs,\n", + " system=SYSTEM,\n", + " prompt_template=prompt_template)\n", + "]\n", + "\n", + "if use_varlen_attn:\n", + " custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]\n", + "\n", + "# configure default hooks\n", + "default_hooks = dict(\n", + " # record the time of every iteration.\n", + " timer=dict(type=IterTimerHook),\n", + " # print log every 10 iterations.\n", + " logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),\n", + " # enable the parameter scheduler.\n", + " param_scheduler=dict(type=ParamSchedulerHook),\n", + " # save checkpoint per `save_steps`.\n", + " checkpoint=dict(\n", + " type=CheckpointHook,\n", + " by_epoch=False,\n", + " interval=save_steps,\n", + " max_keep_ckpts=save_total_limit),\n", + " # set sampler seed in distributed evrionment.\n", + " sampler_seed=dict(type=DistSamplerSeedHook),\n", + ")\n", + "\n", + "# configure environment\n", + "env_cfg = dict(\n", + " # whether to enable cudnn benchmark\n", + " cudnn_benchmark=False,\n", + " # set multi process parameters\n", + " mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),\n", + " # set distributed parameters\n", + " dist_cfg=dict(backend='nccl'),\n", + ")\n", + "\n", + "# set visualizer\n", + "visualizer = None\n", + "\n", + "# set log level\n", + "log_level = 'INFO'\n", + "\n", + "# load from which checkpoint\n", + "load_from = None\n", + "\n", + "# whether to resume training from the loaded checkpoint\n", + "resume = False\n", + "\n", + "# Defaults to use random seed and disable `deterministic`\n", + "randomness = dict(seed=None, deterministic=False)\n", + "\n", + "# set log processor\n", + "log_processor = dict(by_epoch=False)\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "c02ca9cb", + "metadata": {}, + "source": [ + "#### 3.2.3 启动微调\n", + "\n", + "完成了所有的准备工作后,我们就可以正式的开始我们下一阶段的旅程:XTuner 启动~!\n", + "\n", + "当我们准备好了所有内容,我们只需要将使用 `xtuner train` 命令令即可开始训练。\n", + "\n", + "> `xtuner train` 命令用于启动模型微调进程。该命令需要一个参数:`CONFIG` 用于指定微调配置文件。这里我们使用修改好的配置文件 `internlm2_chat_1_8b_qlora_alpaca_e3_copy.py`。 \n", + "> 训练过程中产生的所有文件,包括日志、配置文件、检查点文件、微调后的模型等,默认保存在 `work_dirs` 目录下,我们也可以通过添加 `--work-dir` 指定特定的文件保存位置。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f89ba8dd", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!xtuner train ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py" + ] + }, + { + "cell_type": "markdown", + "id": "9ad3adc1", + "metadata": {}, + "source": [ + "在训练完后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── work_dirs\n", + "│ └── internlm2_chat_1_8b_qlora_alpaca_e3_copy\n", + "│ ├── 20240626_222727\n", + "│ │ ├── 20240626_222727.log\n", + "│ │ └── vis_data\n", + "│ │ ├── 20240626_222727.json\n", + "│ │ ├── config.py\n", + "│ │ ├── eval_outputs_iter_95.txt\n", + "│ │ └── scalars.json\n", + "│ ├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", + "│ ├── iter_96.pth\n", + "│ └── last_checkpoint\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "dfed1498", + "metadata": {}, + "source": [ + "#### 3.2.4 模型格式转换\n", + "\n", + "模型转换的本质其实就是将原本使用 Pytorch 训练出来的模型权重文件转换为目前通用的 HuggingFace 格式文件,那么我们可以通过以下命令来实现一键转换。\n", + "\n", + "我们可以使用 `xtuner convert pth_to_hf` 命令来进行模型格式转换。\n", + "\n", + "> `xtuner convert pth_to_hf` 命令用于进行模型格式转换。该命令需要三个参数:`CONFIG` 表示微调的配置文件, `PATH_TO_PTH_MODEL` 表示微调的模型权重文件路径,即要转换的模型权重, `SAVE_PATH_TO_HF_MODEL` 表示转换后的 HuggingFace 格式文件的保存路径。\n", + "\n", + "除此之外,我们其实还可以在转换的命令中添加几个额外的参数,包括:\n", + "\n", + "| 参数名 | 解释 |\n", + "| --------------------- | -------------------------------------------- |\n", + "| --fp32 | 代表以fp32的精度开启,假如不输入则默认为fp16 |\n", + "| --max-shard-size {GB} | 代表每个权重文件最大的大小(默认为2GB) |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6422944", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!pth_file=`ls -t ./work_dirs/internlm2_chat_1_8b_qlora_alpaca_e3_copy/*.pth | head -n 1` && MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert pth_to_hf ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py ${pth_file} ./hf" + ] + }, + { + "cell_type": "markdown", + "id": "dbfc5968", + "metadata": {}, + "source": [ + "模型格式转换完成后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── hf\n", + "│ ├── README.md\n", + "│ ├── adapter_config.json\n", + "│ ├── adapter_model.bin\n", + "│ └── xtuner_config.py\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "b86536d9", + "metadata": {}, + "source": [ + "转换完成后,可以看到模型被转换为 HuggingFace 中常用的 .bin 格式文件,这就代表着文件成功被转化为 HuggingFace 格式了。\n", + "\n", + "此时,hf 文件夹即为我们平时所理解的所谓 “LoRA 模型文件”\n", + "\n", + "> 可以简单理解:LoRA 模型文件 = Adapter" + ] + }, + { + "cell_type": "markdown", + "id": "5b64bcda", + "metadata": {}, + "source": [ + "#### 3.2.5 模型合并\n", + "\n", + "对于 LoRA 或者 QLoRA 微调出来的模型其实并不是一个完整的模型,而是一个额外的层(Adapter),训练完的这个层最终还是要与原模型进行合并才能被正常的使用。\n", + "\n", + "> 对于全量微调的模型(full)其实是不需要进行整合这一步的,因为全量微调修改的是原模型的权重而非微调一个新的 Adapter ,因此是不需要进行模型整合的。" + ] + }, + { + "cell_type": "markdown", + "id": "bfce601f", + "metadata": {}, + "source": [ + "在 XTuner 中提供了一键合并的命令 `xtuner convert merge`,在使用前我们需要准备好三个路径,包括原模型的路径、训练好的 Adapter 层的(模型格式转换后的)路径以及最终保存的路径。\n", + "\n", + "> `xtuner convert merge`命令用于合并模型。该命令需要三个参数:`LLM` 表示原模型路径,`ADAPTER` 表示 Adapter 层的路径, `SAVE_PATH` 表示合并后的模型最终的保存路径。\n", + "\n", + "在模型合并这一步还有其他很多的可选参数,包括:\n", + "\n", + "| 参数名 | 解释 |\n", + "| ---------------------- | ------------------------------------------------------------ |\n", + "| --max-shard-size {GB} | 代表每个权重文件最大的大小(默认为2GB) |\n", + "| --device {device_name} | 这里指的就是device的名称,可选择的有cuda、cpu和auto,默认为cuda即使用gpu进行运算 |\n", + "| --is-clip | 这个参数主要用于确定模型是不是CLIP模型,假如是的话就要加上,不是就不需要添加 |\n", + "\n", + "> CLIP(Contrastive Language–Image Pre-training)模型是 OpenAI 开发的一种预训练模型,它能够理解图像和描述它们的文本之间的关系。CLIP 通过在大规模数据集上学习图像和对应文本之间的对应关系,从而实现了对图像内容的理解和分类,甚至能够根据文本提示生成图像。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4ad56444", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge Shanghai_AI_Laboratory/internlm2-chat-1_8b ./hf ./merged --max-shard-size 2GB" + ] + }, + { + "cell_type": "markdown", + "id": "8f0e9d87", + "metadata": {}, + "source": [ + "模型合并完成后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── merged\n", + "│ ├── config.json\n", + "│ ├── configuration_internlm2.py\n", + "│ ├── generation_config.json\n", + "│ ├── modeling_internlm2.py\n", + "│ ├── pytorch_model-00001-of-00002.bin\n", + "│ ├── pytorch_model-00002-of-00002.bin\n", + "│ ├── pytorch_model.bin.index.json\n", + "│ ├── special_tokens_map.json\n", + "│ ├── tokenization_internlm2.py\n", + "│ ├── tokenization_internlm2_fast.py\n", + "│ ├── tokenizer.json\n", + "│ ├── tokenizer.model\n", + "│ └── tokenizer_config.json\n", + "```\n", + "\n", + "
\n", + "\n", + "在模型合并完成后,我们就可以看到最终的模型和原模型文件夹非常相似,包括了分词器、权重文件、配置信息等等。" + ] + }, + { + "cell_type": "markdown", + "id": "004f8def", + "metadata": {}, + "source": [ + "### 3.3 微调后的模型对话" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c0de6f9", + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer, model = load_model(\"./merged\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a6ad94c6", + "metadata": {}, + "outputs": [], + "source": [ + "chat(\"请介绍一下你自己\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11c6347a", + "metadata": {}, + "outputs": [], + "source": [ + "chat(\"你在实战营做什么\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c740a3c", + "metadata": {}, + "outputs": [], + "source": [ + "chat(\"介绍一下成都\")" + ] + }, + { + "cell_type": "markdown", + "id": "553963b4", + "metadata": {}, + "source": [ + "可以看到,通过指令微调,我们成功得到了一个自己的小助手。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "40277c04", + "metadata": {}, + "outputs": [], + "source": [ + "del tokenizer, model\n", + "\n", + "torch.cuda.empty_cache()" + ] + }, + { + "cell_type": "markdown", + "id": "41f5d2ef", + "metadata": {}, + "source": [ + "## 4 Web Demo 部署\n", + "\n", + "除了在终端中对模型进行测试,我们其实还可以在网页端的 Demo 进行对话。" + ] + }, + { + "cell_type": "markdown", + "id": "444228e0", + "metadata": {}, + "source": [ + "首先,我们需要安装所需要的依赖。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ee8f3a9", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "pip install streamlit" + ] + }, + { + "cell_type": "markdown", + "id": "477c0a5d", + "metadata": {}, + "source": [ + "其次,我们需要准备一个Streamlit程序的脚本。" + ] + }, + { + "cell_type": "markdown", + "id": "77ddf404", + "metadata": {}, + "source": [ + "Streamlit程序的完整代码是:[tools/xtuner_streamlit_demo.py](../../../tools/xtuner_streamlit_demo.py)。\n", + "\n", + "
\n", + "xtuner_streamlit_demo.py\n", + "\n", + "```python\n", + "import copy\n", + "import warnings\n", + "from dataclasses import asdict, dataclass\n", + "from typing import Callable, List, Optional\n", + "\n", + "import streamlit as st\n", + "import torch\n", + "from torch import nn\n", + "from transformers.generation.utils import (LogitsProcessorList,\n", + " StoppingCriteriaList)\n", + "from transformers.utils import logging\n", + "\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM # isort: skip\n", + "\n", + "logger = logging.get_logger(__name__)\n", + "\n", + "\n", + "model_name_or_path = \"./merged\"\n", + "\n", + "@dataclass\n", + "class GenerationConfig:\n", + " # this config is used for chat to provide more diversity\n", + " max_length: int = 2048\n", + " top_p: float = 0.75\n", + " temperature: float = 0.1\n", + " do_sample: bool = True\n", + " repetition_penalty: float = 1.000\n", + "\n", + "\n", + "@torch.inference_mode()\n", + "def generate_interactive(\n", + " model,\n", + " tokenizer,\n", + " prompt,\n", + " generation_config: Optional[GenerationConfig] = None,\n", + " logits_processor: Optional[LogitsProcessorList] = None,\n", + " stopping_criteria: Optional[StoppingCriteriaList] = None,\n", + " prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor],\n", + " List[int]]] = None,\n", + " additional_eos_token_id: Optional[int] = None,\n", + " **kwargs,\n", + "):\n", + " inputs = tokenizer([prompt], padding=True, return_tensors='pt')\n", + " input_length = len(inputs['input_ids'][0])\n", + " for k, v in inputs.items():\n", + " inputs[k] = v.cuda()\n", + " input_ids = inputs['input_ids']\n", + " _, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]\n", + " if generation_config is None:\n", + " generation_config = model.generation_config\n", + " generation_config = copy.deepcopy(generation_config)\n", + " model_kwargs = generation_config.update(**kwargs)\n", + " bos_token_id, eos_token_id = ( # noqa: F841 # pylint: disable=W0612\n", + " generation_config.bos_token_id,\n", + " generation_config.eos_token_id,\n", + " )\n", + " if isinstance(eos_token_id, int):\n", + " eos_token_id = [eos_token_id]\n", + " if additional_eos_token_id is not None:\n", + " eos_token_id.append(additional_eos_token_id)\n", + " has_default_max_length = kwargs.get(\n", + " 'max_length') is None and generation_config.max_length is not None\n", + " if has_default_max_length and generation_config.max_new_tokens is None:\n", + " warnings.warn(\n", + " f\"Using 'max_length''s default ({repr(generation_config.max_length)}) \\\n", + " to control the generation length. \"\n", + " 'This behaviour is deprecated and will be removed from the \\\n", + " config in v5 of Transformers -- we'\n", + " ' recommend using `max_new_tokens` to control the maximum \\\n", + " length of the generation.',\n", + " UserWarning,\n", + " )\n", + " elif generation_config.max_new_tokens is not None:\n", + " generation_config.max_length = generation_config.max_new_tokens + \\\n", + " input_ids_seq_length\n", + " if not has_default_max_length:\n", + " logger.warn( # pylint: disable=W4902\n", + " f\"Both 'max_new_tokens' (={generation_config.max_new_tokens}) \"\n", + " f\"and 'max_length'(={generation_config.max_length}) seem to \"\n", + " \"have been set. 'max_new_tokens' will take precedence. \"\n", + " 'Please refer to the documentation for more information. '\n", + " '(https://huggingface.co/docs/transformers/main/'\n", + " 'en/main_classes/text_generation)',\n", + " UserWarning,\n", + " )\n", + "\n", + " if input_ids_seq_length >= generation_config.max_length:\n", + " input_ids_string = 'input_ids'\n", + " logger.warning(\n", + " f\"Input length of {input_ids_string} is {input_ids_seq_length}, \"\n", + " f\"but 'max_length' is set to {generation_config.max_length}. \"\n", + " 'This can lead to unexpected behavior. You should consider'\n", + " \" increasing 'max_new_tokens'.\")\n", + "\n", + " # 2. Set generation parameters if not already defined\n", + " logits_processor = logits_processor if logits_processor is not None \\\n", + " else LogitsProcessorList()\n", + " stopping_criteria = stopping_criteria if stopping_criteria is not None \\\n", + " else StoppingCriteriaList()\n", + "\n", + " logits_processor = model._get_logits_processor(\n", + " generation_config=generation_config,\n", + " input_ids_seq_length=input_ids_seq_length,\n", + " encoder_input_ids=input_ids,\n", + " prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,\n", + " logits_processor=logits_processor,\n", + " )\n", + "\n", + " stopping_criteria = model._get_stopping_criteria(\n", + " generation_config=generation_config,\n", + " stopping_criteria=stopping_criteria)\n", + " logits_warper = model._get_logits_warper(generation_config)\n", + "\n", + " unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)\n", + " scores = None\n", + " while True:\n", + " model_inputs = model.prepare_inputs_for_generation(\n", + " input_ids, **model_kwargs)\n", + " # forward pass to get next token\n", + " outputs = model(\n", + " **model_inputs,\n", + " return_dict=True,\n", + " output_attentions=False,\n", + " output_hidden_states=False,\n", + " )\n", + "\n", + " next_token_logits = outputs.logits[:, -1, :]\n", + "\n", + " # pre-process distribution\n", + " next_token_scores = logits_processor(input_ids, next_token_logits)\n", + " next_token_scores = logits_warper(input_ids, next_token_scores)\n", + "\n", + " # sample\n", + " probs = nn.functional.softmax(next_token_scores, dim=-1)\n", + " if generation_config.do_sample:\n", + " next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)\n", + " else:\n", + " next_tokens = torch.argmax(probs, dim=-1)\n", + "\n", + " # update generated ids, model inputs, and length for next step\n", + " input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)\n", + " model_kwargs = model._update_model_kwargs_for_generation(\n", + " outputs, model_kwargs, is_encoder_decoder=False)\n", + " unfinished_sequences = unfinished_sequences.mul(\n", + " (min(next_tokens != i for i in eos_token_id)).long())\n", + "\n", + " output_token_ids = input_ids[0].cpu().tolist()\n", + " output_token_ids = output_token_ids[input_length:]\n", + " for each_eos_token_id in eos_token_id:\n", + " if output_token_ids[-1] == each_eos_token_id:\n", + " output_token_ids = output_token_ids[:-1]\n", + " response = tokenizer.decode(output_token_ids)\n", + "\n", + " yield response\n", + " # stop when each sentence is finished\n", + " # or if we exceed the maximum length\n", + " if unfinished_sequences.max() == 0 or stopping_criteria(\n", + " input_ids, scores):\n", + " break\n", + "\n", + "\n", + "def on_btn_click():\n", + " del st.session_state.messages\n", + "\n", + "\n", + "@st.cache_resource\n", + "def load_model():\n", + " model = (AutoModelForCausalLM.from_pretrained(model_name_or_path,\n", + " trust_remote_code=True).to(\n", + " torch.bfloat16).cuda())\n", + " tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,\n", + " trust_remote_code=True)\n", + " return model, tokenizer\n", + "\n", + "\n", + "def prepare_generation_config():\n", + " with st.sidebar:\n", + " max_length = st.slider('Max Length',\n", + " min_value=8,\n", + " max_value=32768,\n", + " value=2048)\n", + " top_p = st.slider('Top P', 0.0, 1.0, 0.75, step=0.01)\n", + " temperature = st.slider('Temperature', 0.0, 1.0, 0.1, step=0.01)\n", + " st.button('Clear Chat History', on_click=on_btn_click)\n", + "\n", + " generation_config = GenerationConfig(max_length=max_length,\n", + " top_p=top_p,\n", + " temperature=temperature)\n", + "\n", + " return generation_config\n", + "\n", + "\n", + "user_prompt = '<|im_start|>user\\n{user}<|im_end|>\\n'\n", + "robot_prompt = '<|im_start|>assistant\\n{robot}<|im_end|>\\n'\n", + "cur_query_prompt = '<|im_start|>user\\n{user}<|im_end|>\\n\\\n", + " <|im_start|>assistant\\n'\n", + "\n", + "\n", + "def combine_history(prompt):\n", + " messages = st.session_state.messages\n", + " meta_instruction = ('')\n", + " total_prompt = f\"<|im_start|>system\\n{meta_instruction}<|im_end|>\\n\"\n", + " for message in messages:\n", + " cur_content = message['content']\n", + " if message['role'] == 'user':\n", + " cur_prompt = user_prompt.format(user=cur_content)\n", + " elif message['role'] == 'robot':\n", + " cur_prompt = robot_prompt.format(robot=cur_content)\n", + " else:\n", + " raise RuntimeError\n", + " total_prompt += cur_prompt\n", + " total_prompt = total_prompt + cur_query_prompt.format(user=prompt)\n", + " return total_prompt\n", + "\n", + "\n", + "def main():\n", + " # torch.cuda.empty_cache()\n", + " print('load model begin.')\n", + " model, tokenizer = load_model()\n", + " print('load model end.')\n", + "\n", + "\n", + " st.title('InternLM2-Chat-1.8B')\n", + "\n", + " generation_config = prepare_generation_config()\n", + "\n", + " # Initialize chat history\n", + " if 'messages' not in st.session_state:\n", + " st.session_state.messages = []\n", + "\n", + " # Display chat messages from history on app rerun\n", + " for message in st.session_state.messages:\n", + " with st.chat_message(message['role'], avatar=message.get('avatar')):\n", + " st.markdown(message['content'])\n", + "\n", + " # Accept user input\n", + " if prompt := st.chat_input('What is up?'):\n", + " # Display user message in chat message container\n", + " with st.chat_message('user'):\n", + " st.markdown(prompt)\n", + " real_prompt = combine_history(prompt)\n", + " # Add user message to chat history\n", + " st.session_state.messages.append({\n", + " 'role': 'user',\n", + " 'content': prompt,\n", + " })\n", + "\n", + " with st.chat_message('robot'):\n", + " message_placeholder = st.empty()\n", + " for cur_response in generate_interactive(\n", + " model=model,\n", + " tokenizer=tokenizer,\n", + " prompt=real_prompt,\n", + " additional_eos_token_id=92542,\n", + " **asdict(generation_config),\n", + " ):\n", + " # Display robot response in chat message container\n", + " message_placeholder.markdown(cur_response + '▌')\n", + " message_placeholder.markdown(cur_response)\n", + " # Add robot response to chat history\n", + " st.session_state.messages.append({\n", + " 'role': 'robot',\n", + " 'content': cur_response, # pylint: disable=undefined-loop-variable\n", + " })\n", + " torch.cuda.empty_cache()\n", + "\n", + "\n", + "if __name__ == '__main__':\n", + " main()\n", + "\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "c7f48f42", + "metadata": {}, + "source": [ + "然后,我们可以直接启动应用。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f896f75a", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!streamlit run xtuner_streamlit_demo.py" + ] + }, + { + "cell_type": "markdown", + "id": "e8efe0da", + "metadata": {}, + "source": [ + "运行后,在访问前,我们还需要做的就是将端口映射到本地。\n", + "\n", + "通过如图所示的地方,获取开发机的端口和密码。\n", + "\n", + "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-09.png)" + ] + }, + { + "cell_type": "markdown", + "id": "b4024b0d", + "metadata": {}, + "source": [ + "然后在本地使用 PowerShell 或者命令行终端,执行以下命令:\n", + "\n", + "> 其中,`8501`是Streamlit程序的服务端口,`43551`需要替换为自己的开发机的端口。\n", + "\n", + "```bash\n", + "ssh -CNg -L 8501:127.0.0.1:8501 root@ssh.intern-ai.org.cn -p 43551\n", + "```\n", + "\n", + "然后再输入开发机的root密码。\n", + "\n", + "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-10.png)" + ] + }, + { + "cell_type": "markdown", + "id": "c13db446", + "metadata": {}, + "source": [ + "最后,我们就可以在本地通过浏览器访问:http://127.0.0.1:8501 来进行对话了。\n", + "\n", + "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-12.png)" + ] + }, + { + "cell_type": "markdown", + "id": "e4a7edaf", + "metadata": {}, + "source": [ + "## 5 小结\n", + "\n", + "经过本节的学习,带领着大家跑通了 XTuner 的完整流程,我们学会了指令跟随微调,并且训练出了一个自己小助手,是不是很有意思!\n", + "\n", + "当我们在测试完模型认为其满足我们的需求后,就可以对模型进行量化部署等操作了,这部分的内容在之后关于 LMDeploy 的课程中将会详细的进行讲解,敬请期待后续的课程吧!\n", + "\n", + "关于XTuner的更多高级进阶知识,让我们继续往下探索吧!" + ] + }, + { + "cell_type": "markdown", + "id": "1da820a2-86e8-4f8a-ada4-3750b6dbe445", + "metadata": {}, + "source": [ + "## 6 增量预训练微调\n", + "\n", + "本节我们先来了解一下增量预训练,这里我们以一个文本续写案例来看看效果。\n", + "\n", + "| | 微调前 | 微调后 |\n", + "| --- | --- | --- |\n", + "| 输入 | 书生·浦语大模型实战营第三期是 | 书生·浦语大模型实战营第三期是 |\n", + "| 输出| 书生·浦语大模型实战营第三期是上周五,上周五我们学习了一个新的知识,那就是关于机器学习的概率统计。…… | 书生·浦语大模型实战营第三期是上海人工智能实验室推出的书生·浦语大模型实战营系列活动的第三批次,将于2024年7月正式进行。…… |" + ] + }, + { + "cell_type": "markdown", + "id": "4b8a61f2", + "metadata": {}, + "source": [ + "我们需要定义一些基本方法。" + ] + }, + { + "cell_type": "markdown", + "id": "d3f1238e", + "metadata": {}, + "source": [ + "- 导入必要的库" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73995f91", + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM" + ] + }, + { + "cell_type": "markdown", + "id": "b67d97ce", + "metadata": {}, + "source": [ + "- 定义模型加载方法" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c275406e", + "metadata": {}, + "outputs": [], + "source": [ + "def load_model(model_path):\n", + " tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n", + " model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()\n", + " model = model.eval()\n", + " return tokenizer, model" + ] + }, + { + "cell_type": "markdown", + "id": "5a0a7754", + "metadata": {}, + "source": [ + "- 定义文本续写方法" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8fd8995", + "metadata": {}, + "outputs": [], + "source": [ + "def generate(user_input):\n", + " gen_kwargs = {\"max_length\": 128, \"top_p\": 0.8, \"temperature\": 0.8, \"do_sample\": True, \"repetition_penalty\": 1.0}\n", + "\n", + " inputs = tokenizer([user_input], return_tensors=\"pt\")\n", + " for k,v in inputs.items():\n", + " inputs[k] = v.cuda()\n", + " output = model.generate(**inputs, **gen_kwargs)\n", + " output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)\n", + " return output" + ] + }, + { + "cell_type": "markdown", + "id": "aac7a4e4", + "metadata": {}, + "source": [ + "### 6.1 基座模型推理\n", + "\n", + "我们先来看看基座模型的推理结果。" + ] + }, + { + "cell_type": "markdown", + "id": "c3d48c8a", + "metadata": {}, + "source": [ + "- 加载模型" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "22cb798d", + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer, model = load_model(\"Shanghai_AI_Laboratory/internlm2-1_8b\")" + ] + }, + { + "cell_type": "markdown", + "id": "f536a771", + "metadata": {}, + "source": [ + "- 文本续写" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "44ccf83e", + "metadata": {}, + "outputs": [], + "source": [ + "generate(\"书生·浦语大模型实战营第三期是\")" + ] + }, + { + "cell_type": "markdown", + "id": "5fa279db", + "metadata": {}, + "source": [ + "- 释放缓存" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8716d3c9", + "metadata": {}, + "outputs": [], + "source": [ + "del tokenizer, model\n", + "\n", + "torch.cuda.empty_cache()" + ] + }, + { + "cell_type": "markdown", + "id": "0694693a", + "metadata": {}, + "source": [ + "### 6.2 增量预训练\n", + "\n", + "然后我们对基座模型进行增量预训练,让模型增加新的知识。" + ] + }, + { + "cell_type": "markdown", + "id": "fea34851", + "metadata": {}, + "source": [ + "#### 6.2.1 准备数据文件\n", + "\n", + "为了让模型学习到新的知识,我们需要将新的知识数据整理成指定格式文件,形成数据集,然后让模型来学习这些新数据。这里我们准备一个简单的数据集 `datas/pretrain.json`,仅包含一条知识,然后让数据重复多次。\n", + "\n", + "> 网上有大量的开源数据集可以供我们进行使用,有些时候我们可以在开源数据集的基础上添加一些我们自己独有的数据集,也可能会有很好的效果。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b1502071", + "metadata": {}, + "outputs": [], + "source": [ + "[\n", + " {\n", + " \"text\": \"书生·浦语大模型实战营第三期是上海人工智能实验室推出的书生·浦语大模型实战营系列活动的第三批次,将于2024年7月正式进行。\"\n", + " }\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "665e305c", + "metadata": {}, + "source": [ + "准备好数据文件后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── Shanghai_AI_Laboratory\n", + "│ ├── internlm2-1_8b\n", + "│ │ ├── README.md\n", + "│ │ ├── config.json\n", + "│ │ ├── configuration.json\n", + "│ │ ├── configuration_internlm2.py\n", + "│ │ ├── generation_config.json\n", + "│ │ ├── modeling_internlm2.py\n", + "│ │ ├── pytorch_model.bin\n", + "│ │ ├── special_tokens_map.json\n", + "│ │ ├── tokenization_internlm2.py\n", + "│ │ ├── tokenization_internlm2_fast.py\n", + "│ │ ├── tokenizer.json\n", + "│ │ ├── tokenizer.model\n", + "│ │ └── tokenizer_config.json\n", + "│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b\n", + "│ ├── README.md\n", + "│ ├── config.json\n", + "│ ├── configuration.json\n", + "│ ├── configuration_internlm2.py\n", + "│ ├── generation_config.json\n", + "│ ├── model-00001-of-00002.safetensors\n", + "│ ├── model-00002-of-00002.safetensors\n", + "│ ├── model.safetensors.index.json\n", + "│ ├── modeling_internlm2.py\n", + "│ ├── special_tokens_map.json\n", + "│ ├── tokenization_internlm2.py\n", + "│ ├── tokenization_internlm2_fast.py\n", + "│ ├── tokenizer.model\n", + "│ └── tokenizer_config.json\n", + "├── datas\n", + "│ ├── assistant.json\n", + "│ └── pretrain.json\n", + "├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", + "```\n", + "\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "id": "ae9221ff", + "metadata": {}, + "source": [ + "#### 6.2.2 准备配置文件\n", + "\n", + "在准备好了模型和数据集后,我们就要根据我们选择的微调方法结合微调方案来找到与我们最匹配的配置文件了,从而减少我们对配置文件的修改量。\n", + "\n", + "这里我们选择使用 `internlm2_1_8b_full_custom_pretrain_e1` 配置文件。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8cfbdd74", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!xtuner copy-cfg internlm2_1_8b_full_custom_pretrain_e1 ." + ] + }, + { + "cell_type": "markdown", + "id": "b72d52ee", + "metadata": {}, + "source": [ + "复制好配置文件后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── Shanghai_AI_Laboratory\n", + "│ ├── internlm2-1_8b\n", + "│ │ ├── README.md\n", + "│ │ ├── config.json\n", + "│ │ ├── configuration.json\n", + "│ │ ├── configuration_internlm2.py\n", + "│ │ ├── generation_config.json\n", + "│ │ ├── modeling_internlm2.py\n", + "│ │ ├── pytorch_model.bin\n", + "│ │ ├── special_tokens_map.json\n", + "│ │ ├── tokenization_internlm2.py\n", + "│ │ ├── tokenization_internlm2_fast.py\n", + "│ │ ├── tokenizer.json\n", + "│ │ ├── tokenizer.model\n", + "│ │ └── tokenizer_config.json\n", + "│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b\n", + "│ ├── README.md\n", + "│ ├── config.json\n", + "│ ├── configuration.json\n", + "│ ├── configuration_internlm2.py\n", + "│ ├── generation_config.json\n", + "│ ├── model-00001-of-00002.safetensors\n", + "│ ├── model-00002-of-00002.safetensors\n", + "│ ├── model.safetensors.index.json\n", + "│ ├── modeling_internlm2.py\n", + "│ ├── special_tokens_map.json\n", + "│ ├── tokenization_internlm2.py\n", + "│ ├── tokenization_internlm2_fast.py\n", + "│ ├── tokenizer.model\n", + "│ └── tokenizer_config.json\n", + "├── datas\n", + "│ ├── assistant.json\n", + "│ └── pretrain.json\n", + "├── internlm2_1_8b_full_custom_pretrain_e1_copy.py\n", + "├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "c82985b1", + "metadata": {}, + "source": [ + "下面我们将根据项目的需求一步步的进行修改和调整吧!\n", + "\n", + "在 PART 1 的部分,由于我们不再需要在 HuggingFace 上自动下载模型,因此我们先要更换模型的路径以及数据集的路径为我们本地的路径。\n", + "\n", + "为了训练过程中能够实时观察到模型的变化情况,XTuner 贴心的推出了一个 `evaluation_inputs` 的参数来让我们能够设置多个问题来确保模型在训练过程中的变化是朝着我们想要的方向前进的。我们可以添加自己的输入。\n", + "\n", + "在 PART 2 的部分,由于我们复制的配置文件是全参数微调的配置,而我们希望使用 `QLoRA` 算法进行微调,所以可以添加 `QLoRA` 算法的配置。\n", + "\n", + "```diff\n", + "+ from peft import LoraConfig\n", + "\n", + "+ import torch\n", + "\n", + "- from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n", + "\n", + "#######################################################################\n", + "# PART 1 Settings #\n", + "#######################################################################\n", + "- pretrained_model_name_or_path = 'internlm/internlm2-1_8b'\n", + "+ pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-1_8b'\n", + "\n", + "- data_files = ['/path/to/json/file.json']\n", + "+ data_files = ['datas/pretrain.json']\n", + "\n", + "- evaluation_inputs = ['上海是', 'Shanghai is']\n", + "+ evaluation_inputs = ['书生·浦语大模型实战营第三期是', '上海是', 'Shanghai is']\n", + "\n", + "#######################################################################\n", + "# PART 2 Model & Tokenizer #\n", + "#######################################################################\n", + "model = dict(\n", + " type=SupervisedFinetune,\n", + " use_varlen_attn=use_varlen_attn,\n", + " llm=dict(\n", + " type=AutoModelForCausalLM.from_pretrained,\n", + " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", + " trust_remote_code=True,\n", + "+ quantization_config=dict(\n", + "+ type=BitsAndBytesConfig,\n", + "+ load_in_4bit=True,\n", + "+ load_in_8bit=False,\n", + "+ llm_int8_threshold=6.0,\n", + "+ llm_int8_has_fp16_weight=False,\n", + "+ bnb_4bit_compute_dtype=torch.float16,\n", + "+ bnb_4bit_use_double_quant=True,\n", + "+ bnb_4bit_quant_type='nf4')\n", + " ),\n", + "+ lora=dict(\n", + "+ type=LoraConfig,\n", + "+ r=64,\n", + "+ lora_alpha=16,\n", + "+ lora_dropout=0.1,\n", + "+ bias='none',\n", + "+ task_type='CAUSAL_LM')\n", + ")\n", + "```\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "8473fc9f", + "metadata": {}, + "source": [ + "修改完后的完整的配置文件是:[configs/internlm2_1_8b_full_custom_pretrain_e1_copy.py](../../../configs/internlm2_1_8b_full_custom_pretrain_e1_copy.py)。\n", + "\n", + "
\n", + "internlm2_1_8b_full_custom_pretrain_e1_copy.py\n", + "\n", + "```python\n", + "# Copyright (c) OpenMMLab. All rights reserved.\n", + "\"\"\"Data format:\n", + "\n", + "[\n", + " {\n", + " \"text\": \"xxx\"\n", + " },\n", + " {\n", + " \"text\": \"xxx\"\n", + " },\n", + " ...\n", + "]\n", + "\"\"\" # noqa: E501\n", + "\n", + "from datasets import load_dataset\n", + "from mmengine.dataset import DefaultSampler\n", + "from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,\n", + " LoggerHook, ParamSchedulerHook)\n", + "from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR\n", + "from peft import LoraConfig\n", + "import torch\n", + "from torch.optim import AdamW\n", + "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n", + "\n", + "from xtuner.dataset import process_hf_dataset\n", + "from xtuner.dataset.collate_fns import default_collate_fn\n", + "from xtuner.dataset.map_fns import pretrain_map_fn\n", + "from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,\n", + " VarlenAttnArgsToMessageHubHook)\n", + "from xtuner.engine.runner import TrainLoop\n", + "from xtuner.model import SupervisedFinetune\n", + "\n", + "#######################################################################\n", + "# PART 1 Settings #\n", + "#######################################################################\n", + "# Model\n", + "pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-1_8b'\n", + "use_varlen_attn = False\n", + "\n", + "# Data\n", + "data_files = ['datas/pretrain.json']\n", + "max_length = 2048\n", + "pack_to_max_length = True\n", + "\n", + "# Scheduler & Optimizer\n", + "batch_size = 1 # per_device\n", + "accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc\n", + "dataloader_num_workers = 0\n", + "max_epochs = 1\n", + "optim_type = AdamW\n", + "lr = 2e-5\n", + "betas = (0.9, 0.999)\n", + "weight_decay = 0\n", + "max_norm = 1 # grad clip\n", + "warmup_ratio = 0.03\n", + "\n", + "# Save\n", + "save_steps = 500\n", + "save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)\n", + "\n", + "# Evaluate the generation performance during the training\n", + "evaluation_freq = 500\n", + "SYSTEM = ''\n", + "evaluation_inputs = ['书生·浦语大模型实战营第三期是', '上海是', 'Shanghai is']\n", + "\n", + "#######################################################################\n", + "# PART 2 Model & Tokenizer #\n", + "#######################################################################\n", + "tokenizer = dict(\n", + " type=AutoTokenizer.from_pretrained,\n", + " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", + " trust_remote_code=True,\n", + " padding_side='right')\n", + "\n", + "model = dict(\n", + " type=SupervisedFinetune,\n", + " use_varlen_attn=use_varlen_attn,\n", + " llm=dict(\n", + " type=AutoModelForCausalLM.from_pretrained,\n", + " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", + " trust_remote_code=True,\n", + " quantization_config=dict(\n", + " type=BitsAndBytesConfig,\n", + " load_in_4bit=True,\n", + " load_in_8bit=False,\n", + " llm_int8_threshold=6.0,\n", + " llm_int8_has_fp16_weight=False,\n", + " bnb_4bit_compute_dtype=torch.float16,\n", + " bnb_4bit_use_double_quant=True,\n", + " bnb_4bit_quant_type='nf4')\n", + " ),\n", + " lora=dict(\n", + " type=LoraConfig,\n", + " r=64,\n", + " lora_alpha=16,\n", + " lora_dropout=0.1,\n", + " bias='none',\n", + " task_type='CAUSAL_LM')\n", + ")\n", + "\n", + "#######################################################################\n", + "# PART 3 Dataset & Dataloader #\n", + "#######################################################################\n", + "train_dataset = dict(\n", + " type=process_hf_dataset,\n", + " dataset=dict(type=load_dataset, path='json', data_files=data_files),\n", + " tokenizer=tokenizer,\n", + " max_length=max_length,\n", + " dataset_map_fn=pretrain_map_fn,\n", + " template_map_fn=None,\n", + " remove_unused_columns=True,\n", + " shuffle_before_pack=False,\n", + " pack_to_max_length=pack_to_max_length,\n", + " use_varlen_attn=use_varlen_attn)\n", + "\n", + "train_dataloader = dict(\n", + " batch_size=batch_size,\n", + " num_workers=dataloader_num_workers,\n", + " dataset=train_dataset,\n", + " sampler=dict(type=DefaultSampler, shuffle=True),\n", + " collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))\n", + "\n", + "#######################################################################\n", + "# PART 4 Scheduler & Optimizer #\n", + "#######################################################################\n", + "# optimizer\n", + "optim_wrapper = dict(\n", + " type=AmpOptimWrapper,\n", + " optimizer=dict(\n", + " type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),\n", + " clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),\n", + " accumulative_counts=accumulative_counts,\n", + " loss_scale='dynamic',\n", + " dtype='float16')\n", + "\n", + "# learning policy\n", + "# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501\n", + "param_scheduler = [\n", + " dict(\n", + " type=LinearLR,\n", + " start_factor=1e-5,\n", + " by_epoch=True,\n", + " begin=0,\n", + " end=warmup_ratio * max_epochs,\n", + " convert_to_iter_based=True),\n", + " dict(\n", + " type=CosineAnnealingLR,\n", + " eta_min=0.0,\n", + " by_epoch=True,\n", + " begin=warmup_ratio * max_epochs,\n", + " end=max_epochs,\n", + " convert_to_iter_based=True)\n", + "]\n", + "\n", + "# train, val, test setting\n", + "train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)\n", + "\n", + "#######################################################################\n", + "# PART 5 Runtime #\n", + "#######################################################################\n", + "# Log the dialogue periodically during the training process, optional\n", + "custom_hooks = [\n", + " dict(type=DatasetInfoHook, tokenizer=tokenizer),\n", + " dict(\n", + " type=EvaluateChatHook,\n", + " tokenizer=tokenizer,\n", + " every_n_iters=evaluation_freq,\n", + " evaluation_inputs=evaluation_inputs,\n", + " system=SYSTEM)\n", + "]\n", + "\n", + "if use_varlen_attn:\n", + " custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]\n", + "\n", + "# configure default hooks\n", + "default_hooks = dict(\n", + " # record the time of every iteration.\n", + " timer=dict(type=IterTimerHook),\n", + " # print log every 10 iterations.\n", + " logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),\n", + " # enable the parameter scheduler.\n", + " param_scheduler=dict(type=ParamSchedulerHook),\n", + " # save checkpoint per `save_steps`.\n", + " checkpoint=dict(\n", + " type=CheckpointHook,\n", + " by_epoch=False,\n", + " interval=save_steps,\n", + " max_keep_ckpts=save_total_limit),\n", + " # set sampler seed in distributed evrionment.\n", + " sampler_seed=dict(type=DistSamplerSeedHook),\n", + ")\n", + "\n", + "# configure environment\n", + "env_cfg = dict(\n", + " # whether to enable cudnn benchmark\n", + " cudnn_benchmark=False,\n", + " # set multi process parameters\n", + " mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),\n", + " # set distributed parameters\n", + " dist_cfg=dict(backend='nccl'),\n", + ")\n", + "\n", + "# set visualizer\n", + "visualizer = None\n", + "\n", + "# set log level\n", + "log_level = 'INFO'\n", + "\n", + "# load from which checkpoint\n", + "load_from = None\n", + "\n", + "# whether to resume training from the loaded checkpoint\n", + "resume = False\n", + "\n", + "# Defaults to use random seed and disable `deterministic`\n", + "randomness = dict(seed=None, deterministic=False)\n", + "\n", + "# set log processor\n", + "log_processor = dict(by_epoch=False)\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "be808d46", + "metadata": {}, + "source": [ + "#### 6.2.3 启动微调\n", + "\n", + "完成了所有的准备工作后,我们就可以正式的开始我们下一阶段的旅程:XTuner 启动~!\n", + "\n", + "当我们准备好了所有内容,我们只需要将使用 `xtuner train` 命令令即可开始训练。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ac77d2d", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!xtuner train ./internlm2_1_8b_full_custom_pretrain_e1_copy.py" + ] + }, + { + "cell_type": "markdown", + "id": "cf74cdeb", + "metadata": {}, + "source": [ + "在训练完后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── work_dirs\n", + "│ └── internlm2_1_8b_full_custom_pretrain_e1_copy\n", + "│ ├── 20240627_214522\n", + "│ │ ├── 20240627_214522.log\n", + "│ │ └── vis_data\n", + "│ │ ├── 20240627_214522.json\n", + "│ │ ├── config.py\n", + "│ │ ├── eval_outputs_iter_1499.txt\n", + "│ │ ├── eval_outputs_iter_1999.txt\n", + "│ │ ├── eval_outputs_iter_2499.txt\n", + "│ │ ├── eval_outputs_iter_2623.txt\n", + "│ │ ├── eval_outputs_iter_499.txt\n", + "│ │ ├── eval_outputs_iter_999.txt\n", + "│ │ └── scalars.json\n", + "│ ├── internlm2_1_8b_full_custom_pretrain_e1_copy.py\n", + "│ ├── iter_2500.pth\n", + "│ ├── iter_2624.pth\n", + "│ └── last_checkpoint\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "2672054f", + "metadata": {}, + "source": [ + "#### 6.2.4 模型格式转换\n", + "\n", + "模型转换的本质其实就是将原本使用 Pytorch 训练出来的模型权重文件转换为目前通用的 HuggingFace 格式文件,那么我们可以通过以下命令来实现一键转换。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82570d4e", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!pth_file=`ls -t ./work_dirs/internlm2_1_8b_full_custom_pretrain_e1_copy/*.pth | head -n 1` && MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert pth_to_hf ./internlm2_1_8b_full_custom_pretrain_e1_copy.py ${pth_file} ./hf" + ] + }, + { + "cell_type": "markdown", + "id": "a055caa3", + "metadata": {}, + "source": [ + "模型格式转换完成后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── hf\n", + "│ ├── README.md\n", + "│ ├── adapter_config.json\n", + "│ ├── adapter_model.bin\n", + "│ └── xtuner_config.py\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "7b1e8a04", + "metadata": {}, + "source": [ + "#### 6.2.5 模型合并\n", + "\n", + "对于 LoRA 或者 QLoRA 微调出来的模型其实并不是一个完整的模型,而是一个额外的层(Adapter),训练完的这个层最终还是要与原模型进行合并才能被正常的使用。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad447926", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge Shanghai_AI_Laboratory/internlm2-1_8b ./hf ./merged --max-shard-size 2GB" + ] + }, + { + "cell_type": "markdown", + "id": "2d4c167d", + "metadata": {}, + "source": [ + "模型合并完成后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── merged\n", + "│ ├── config.json\n", + "│ ├── configuration_internlm2.py\n", + "│ ├── generation_config.json\n", + "│ ├── modeling_internlm2.py\n", + "│ ├── pytorch_model-00001-of-00002.bin\n", + "│ ├── pytorch_model-00002-of-00002.bin\n", + "│ ├── pytorch_model.bin.index.json\n", + "│ ├── special_tokens_map.json\n", + "│ ├── tokenization_internlm2.py\n", + "│ ├── tokenization_internlm2_fast.py\n", + "│ ├── tokenizer.json\n", + "│ ├── tokenizer.model\n", + "│ └── tokenizer_config.json\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "e55ae7b2", + "metadata": {}, + "source": [ + "### 6.3 目标模型推理\n", + "\n", + "当我们合并完成后,我们就能够正常的调用这个模型进行推理了。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "febaa17b", + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer, model = load_model(\"./merged\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ced1ac3", + "metadata": {}, + "outputs": [], + "source": [ + "generate(\"书生·浦语大模型实战营第三期是\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8fad2baa", + "metadata": {}, + "outputs": [], + "source": [ + "generate(\"成都是\")" + ] + }, + { + "cell_type": "markdown", + "id": "ac4314bc", + "metadata": {}, + "source": [ + "可以看到,通过增量预训练,确实在基座模型的基础上学习到了新的知识。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1485e2fa", + "metadata": {}, + "outputs": [], + "source": [ + "del tokenizer, model\n", + "\n", + "torch.cuda.empty_cache()" + ] + }, + { + "cell_type": "markdown", + "id": "2326897e-7594-4082-8bc7-57681f837cdc", + "metadata": {}, + "source": [ + "## 7 DeepSpeed介绍\n", + "\n", + "DeepSpeed是一个由微软开发的开源深度学习优化库,旨在提高大规模模型训练的效率和速度。\n", + "\n", + "XTuner 也内置了 `deepspeed` 来加速整体的训练过程,共有三种不同的 `deepspeed` 类型可进行选择,分别是 `deepspeed_zero1`, `deepspeed_zero2` 和 `deepspeed_zero3`。\n", + "\n", + "
\n", + "DeepSpeed优化器及其选择方法\n", + "\n", + "DeepSpeed是一个由微软开发的开源深度学习优化库,旨在提高大规模模型训练的效率和速度。它通过几种关键技术来优化训练过程,包括模型分割、梯度累积、以及内存和带宽优化等,能够降低训练超大规模模型的复杂性和资源需求,使训练变得更快、更高效。DeepSpeed特别适用于需要巨大计算资源的大型模型和数据集。\n", + "\n", + "在DeepSpeed中,引入了ZeRO(Zero Redundancy Optimizer)技术,是一种旨在降低训练大型模型所需内存占用的优化器,通过在分布式环境中分割优化器的状态、梯度和参数,减少冗余的内存占用,允许更大的模型和更快的训练速度。ZeRO 分为几个不同的级别,主要包括:\n", + "\n", + "- **deepspeed_zero1**:这是ZeRO的基本版本,它优化了模型参数的存储,主要通过分区存储优化器状态来减少内存使用。每个GPU设备只保存一部分优化器状态,从而显著减少内存消耗。\n", + "\n", + "- **deepspeed_zero2**:在deepspeed_zero1的基础上,deepspeed_zero2进一步优化了梯度和优化器状态的存储,将梯度也进行分区存储。这样,每个GPU设备只需要保存一部分的优化器状态和梯度,进一步减少内存使用。\n", + "\n", + "- **deepspeed_zero3**:这是目前最高级的优化等级,它包括了deepspeed_zero1和deepspeed_zero2的优化,除了优化器状态和梯度,还将模型参数进行分区存储。每个GPU设备只需要保存一部分的优化器状态、梯度和模型参数,从而最大限度地减少内存使用。\n", + "\n", + "选择哪种deepspeed类型主要取决于你的具体需求,包括模型的大小、可用的硬件资源(特别是GPU内存)以及训练的效率需求。一般来说:\n", + "\n", + "- 如果你的模型较小,或者内存资源充足,可能不需要使用最高级别的优化。\n", + "- 如果你需要快速训练模型,可能需要权衡内存优化和计算效率。deepspeed_zero1提供了较低的内存占用,同时保持了较高的计算效率。\n", + "- 如果你正在尝试训练非常大的模型,或者你的硬件资源有限,使用deepspeed_zero2或deepspeed_zero3可能更合适,因为它们可以显著降低内存占用,允许更大模型的训练。\n", + "- 选择时也要考虑到实现的复杂性和运行时的开销,更高级的优化可能需要更复杂的设置,更频繁的跨GPU通信,这可能需要更高的网络带宽,并可能增加一些计算开销。\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "c74dd2a4", + "metadata": {}, + "source": [ + "## 8 多卡微调\n", + "\n", + "模型的规模和复杂度不断增加,单张GPU的显存往往无法满足大模型的训练需求。此时,我们可能需要多卡微调,以应对大模型训练过程中显存和计算资源的需求。" + ] + }, + { + "cell_type": "markdown", + "id": "f3cc6567", + "metadata": {}, + "source": [ + "\n", + "XTuner 中使用多卡微调,只需要设置 `NPROC_PER_NODE` 环境变量,并使用 `DeepSpeed` 来进行加速就可以了,其余命令内容与单卡微调时一样。\n", + "\n", + "> 由于开发机只有两张显卡,所以我们设置`NPROC_PER_NODE=2`,并且选择使用`deepspeed_zero3`优化等级。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27a213cd", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU NPROC_PER_NODE=2 xtuner train ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero3" + ] + }, + { + "cell_type": "markdown", + "id": "6c81ca6f", + "metadata": {}, + "source": [ + "在执行微调的过程中,我们可以看到两张显卡都有内存使用。\n", + "\n", + "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-06.png)" + ] + }, + { + "cell_type": "markdown", + "id": "922b48a6", + "metadata": {}, + "source": [ + "在训练完后,我们的目录结构应该是这样子的。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── work_dirs\n", + "│ └── internlm2_chat_1_8b_qlora_alpaca_e3_copy\n", + "│ ├── 20240628_205957\n", + "│ │ ├── 20240628_205957.log\n", + "│ │ └── vis_data\n", + "│ │ ├── 20240628_205957.json\n", + "│ │ ├── config.py\n", + "│ │ ├── eval_outputs_iter_236.txt\n", + "│ │ └── scalars.json\n", + "│ ├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", + "│ ├── iter_237.pth\n", + "│ │ ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt\n", + "│ │ ├── bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt\n", + "│ │ ├── zero_pp_rank_0_mp_rank_00_model_states.pt\n", + "│ │ └── zero_pp_rank_1_mp_rank_00_model_states.pt\n", + "│ ├── last_checkpoint\n", + "│ └── zero_to_fp32.py\n", + "```\n", + "\n", + "
\n", + "\n", + "可以看到,通过 `deepspeed` 来训练后得到的权重文件和原本的权重文件是有所差别的,原本的仅仅是一个 .pth 的文件,而使用了 `deepspeed` 则是一个名字带有 .pth 的文件夹,在该文件夹里保存了 .pt 文件。这两者在具体的使用上并没有太大的差别,转换和合并的过程都是一样的。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fced7e55", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!pth_file=`ls -t ./work_dirs/internlm2_chat_1_8b_qlora_alpaca_e3_copy | grep pth | head -n 1` && MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert pth_to_hf ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py ./work_dirs/internlm2_chat_1_8b_qlora_alpaca_e3_copy/${pth_file} ./hf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d9c58d5", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge Shanghai_AI_Laboratory/internlm2-chat-1_8b ./hf ./merged --max-shard-size 2GB" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9197c24", + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer, model = load_model(\"./merged\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41bfa713", + "metadata": {}, + "outputs": [], + "source": [ + "chat(\"请介绍一下你自己\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76f12d02", + "metadata": {}, + "outputs": [], + "source": [ + "chat(\"你在实战营做什么\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54d9c0e9", + "metadata": {}, + "outputs": [], + "source": [ + "chat(\"介绍一下成都\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91787f35", + "metadata": {}, + "outputs": [], + "source": [ + "del tokenizer, model\n", + "\n", + "torch.cuda.empty_cache()" + ] + }, + { + "cell_type": "markdown", + "id": "d9cfecbd-c125-4fa6-98ed-76da8015241e", + "metadata": {}, + "source": [ + "## 9 分布式微调\n", + "\n", + "如果模型的规模和复杂度继续增加,我们还可以使用分布式微调。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a025a904", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!apt-get install -y net-tools\n", + "!ifconfig" + ] + }, + { + "cell_type": "markdown", + "id": "b95189de", + "metadata": {}, + "source": [ + "分布式微调是主从架构的。主节点协调整个训练过程,管理数据和任务到工作节点的分配。工作节点执行训练步骤的实际计算,处理数据的子集并计算梯度。有时候在一些架构中还需要参数服务器协调所有工作节点之间的模型更新同步,用于聚合来自工作节点的梯度并更新模型参数。\n", + "\n", + "> 我们使用两个节点进行分布式微调,实际上需要启动三个节点。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7195212b", + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NPROC_PER_NODE=1 NNODES=2 xtuner train internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", + "\n", + "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NPROC_PER_NODE=1 NNODES=2 NODE_RANK=0 TRITON_CACHE_DIR=node0 PORT=20821 ADDR=192.168.230.182 xtuner train internlm2_chat_1_8b_qlora_alpaca_e3_copy.py --work-dir work_dir_node0\n", + "\n", + "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NPROC_PER_NODE=1 NNODES=2 NODE_RANK=1 TRITON_CACHE_DIR=node1 PORT=20821 ADDR=192.168.230.182 xtuner train internlm2_chat_1_8b_qlora_alpaca_e3_copy.py --work-dir work_dir_node1" + ] + }, + { + "cell_type": "markdown", + "id": "6dbcffbe", + "metadata": {}, + "source": [ + "首先启动主节点,然后依次启动其他节点。但需要注意的是,需要在一个时间阈值内启动相关的节点,如果超过时间阈值还没启动所有节点,则其他节点会因超时而报错退出。\n", + "\n", + "比如,在两个节点的分布式微调过程中,我们只启动主节点和一个工作节点,另一个节点不启动,则已启动的节点会超时报错退出。\n", + "\n", + "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-07.png)" + ] + }, + { + "cell_type": "markdown", + "id": "953e85aa", + "metadata": {}, + "source": [ + "如果所有节点都正常启动、训练,则可以看到每个节点的显卡均有内存使用。\n", + "\n", + "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-08.png)" + ] + }, + { + "cell_type": "markdown", + "id": "f93beb75", + "metadata": {}, + "source": [ + "在训练完后,我们的目录结构应该是这样子的,训练的模型在工作节点上。\n", + "\n", + "
\n", + "目录结构\n", + "\n", + "```\n", + "├── work_dir_node0\n", + "│ ├── 20240629_213009\n", + "│ │ ├── 20240629_213009.log\n", + "│ │ └── vis_data\n", + "│ │ ├── 20240629_213009.json\n", + "│ │ ├── config.py\n", + "│ │ ├── eval_outputs_iter_233.txt\n", + "│ │ └── scalars.json\n", + "│ ├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", + "│ ├── iter_234.pth\n", + "│ └── last_checkpoint\n", + "├── work_dir_node1\n", + "│ └── 20240629_213009\n", + "├── work_dirs\n", + "│ └── internlm2_chat_1_8b_qlora_alpaca_e3_copy\n", + "```\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "193fdbe8", + "metadata": {}, + "source": [ + "## 10 小结\n", + "\n", + "现在,我们又学到了 XTuner 微调的更多高阶知识啦,包括增量预训练微调基座模型、多卡微调、分布式微调等。\n", + "\n", + "是不是感觉其实微调也不过如此!事实上确实是这样的!其实在微调的时候最重要的还是要自己准备一份高质量的数据集,这个才是你能否真微调出效果最核心的利器。" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/L1/XTuner/homework.md b/docs/L1/XTuner/homework.md new file mode 100644 index 000000000..9e483417b --- /dev/null +++ b/docs/L1/XTuner/homework.md @@ -0,0 +1,16 @@ +# XTuner 微调个人小助手认知作业 + +记录复现过程并截图。 + +## 基础作业(结营必做) + +- 训练自己的小助手认知(记录复现过程并截图) + +- 用自己感兴趣的知识对基座模型进行增量预训练微调(记录复现过程并截图) + +## 进阶作业 + +- 在资源允许的情况下,尝试实现多卡微调与分布式微调(选做) +- 将自我认知的模型上传到 OpenXLab,并将应用部署到 OpenXLab(优秀学员必做) + +OpenXLab 部署教程:https://github.com/InternLM/Tutorial/tree/camp2/tools/openxlab-deploy diff --git a/docs/L1/XTuner/readme.md b/docs/L1/XTuner/readme.md index 8b1378917..8ebe8592b 100644 --- a/docs/L1/XTuner/readme.md +++ b/docs/L1/XTuner/readme.md @@ -1 +1,1197 @@ +# XTuner微调个人小助手认知 +在本节中,将一步步带领大家体验如何使用 XTuner 完成个人小助手的微调! + +## 1 微调前置基础 + +在进行微调之前,我们需要了解一些基本概念,请访问[XTuner微调前置基础](./xtuner_finetune_basic.md)。 + +## 2 准备工作 + +**环境安装**:我们想要用简单易上手的微调工具包 XTuner 来对模型进行微调的话,第一步是安装 XTuner !安装基础的工具是一切的前提,只有安装了 XTuner 我们才能够去执行后续的操作。 + +**前期准备**:在完成 XTuner 的安装后,我们下一步就需要去明确我们自己的微调目标了。我们想要利用微调做一些什么事情呢,然后为了实现这个目标,我们需要准备相关的硬件资源和数据。 + +**启动微调**:在确定了自己的微调目标后,我们就可以在 XTuner 的配置库中找到合适的配置文件并进行对应的修改。修改完成后即可一键启动训练!训练好的模型也可以仅仅通过在终端输入一行命令来完成转换和部署工作! + +### 2.1 创建虚拟环境 + +在安装 XTuner 之前,我们需要先创建一个虚拟环境。创建一个名为 `xtuner0121` 的虚拟环境,可以直接执行命令。 + + +```bash +conda create -n xtuner0121 python=3.10 -y +``` + +如果是在开发机中,也可以直接执行以下命令进行创建: + + +```bash +studio-conda -t xtuner0121 -o internlm-base +``` + +虚拟环境创建完成后,需要激活虚拟环境。 + +```bash +conda activate xtuner0121 +``` + +### 2.2 安装 XTuner + +虚拟环境创建完成后,就可以安装 XTuner 了。首先,从 Github 上下载源码。 + + +```bash +git clone -b v0.1.21 https://github.com/InternLM/xtuner +``` + +其次,进入源码目录,执行安装。 + +> 如果速度太慢可以换成 `pip install -e '.[all]' -i https://mirrors.aliyun.com/pypi/simple/` + + +```bash +cd xtuner && pip install -e '.[all]' +``` + +最后,我们可以验证一下安装结果。 + + +```bash +xtuner version +``` + +对于很多初学者而言,我们可能不太熟悉 XTuner 的用法,那么我们可以通过以下命令来查看相关的帮助。 + + +```bash +xtuner help +``` + +对于很多的初学者而言,安装好环境意味着成功了一大半!因此我们接下来就可以进入我们的下一步,准备好我们需要的模型、数据集和配置文件,并进行微调训练! + +### 2.3 模型准备 + +软件安装好后,我们就可以准备要微调的模型了。 + +> 对于学习而言,我们可以使用 InternLM 推出的1.8B的小模型来完成此次微调演示。 + +对于在 InternStudio 上运行的小伙伴们,可以不用通过 HuggingFace、OpenXLab 或者 Modelscope 进行模型的下载,在开发机中已经为我们提供了模型的本地文件,直接使用就可以了。 + +> 我们可以通过以下代码一键通过符号链接的方式链接到模型文件,这样既节省了空间,也便于管理。 + + +```bash +mkdir -p Shanghai_AI_Laboratory + +ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b Shanghai_AI_Laboratory/internlm2-chat-1_8b +``` + +执行上述操作后,`Shanghai_AI_Laboratory/internlm2-chat-1_8b` 将直接成为一个符号链接,这个链接指向 `/root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b` 的位置。 + +这意味着,当我们访问 `Shanghai_AI_Laboratory/internlm2-chat-1_8b` 时,实际上就是在访问 `/root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b` 目录下的内容。通过这种方式,我们无需复制任何数据,就可以直接利用现有的模型文件进行后续的微调操作,从而节省存储空间并简化文件管理。 + +如果自己想要微调的模型在开发机中没找到,也可以自己下载相关模型文件。 + + +```python +from modelscope import snapshot_download +model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-1_8b', cache_dir="./") +``` + +模型文件准备好后,我们的目录结构应该是这个样子的。 + +
+目录结构 + +``` +├── Shanghai_AI_Laboratory +│ ├── internlm2-1_8b +│ │ ├── README.md +│ │ ├── config.json +│ │ ├── configuration.json +│ │ ├── configuration_internlm2.py +│ │ ├── generation_config.json +│ │ ├── modeling_internlm2.py +│ │ ├── pytorch_model.bin +│ │ ├── special_tokens_map.json +│ │ ├── tokenization_internlm2.py +│ │ ├── tokenization_internlm2_fast.py +│ │ ├── tokenizer.json +│ │ ├── tokenizer.model +│ │ └── tokenizer_config.json +│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b +│ ├── README.md +│ ├── config.json +│ ├── configuration.json +│ ├── configuration_internlm2.py +│ ├── generation_config.json +│ ├── model-00001-of-00002.safetensors +│ ├── model-00002-of-00002.safetensors +│ ├── model.safetensors.index.json +│ ├── modeling_internlm2.py +│ ├── special_tokens_map.json +│ ├── tokenization_internlm2.py +│ ├── tokenization_internlm2_fast.py +│ ├── tokenizer.model +│ └── tokenizer_config.json +``` +
+ + +> 在目录结构中可以看出,`internlm2-chat-1_8b` 是一个符号链接。 + + +```bash +tree -l +``` + +## 3 快速开始 + +这里我们用 `internlm2-chat-1_8b` 模型,通过 `QLoRA` 的方式来微调一个自己的小助手认知作为案例来进行演示。 + +首先,看看微调效果: + + + + + + + + + + + + + + + + + + +
微调前微调后
输入请介绍一下你自己请介绍一下你自己
输出你好,我是书生·浦语。我致力于帮助用户解决各种语言相关的问题,包括但不限于语言学习、翻译、文本摘要等。我使用了Transformer模型和深度学习技术,并使用了语言模型作为预训练任务。如果你有任何问题,欢迎随时向我提问。我是伍鲜同志的小助手,内在是上海AI实验室书生·浦语的1.8B大模型哦
网页
+其次,我们需要定义一些基本方法。 + + +- 导入必要的库 + + +```python +import torch +from transformers import AutoTokenizer, AutoModelForCausalLM +``` + +- 定义模型加载方法 + + +```python +def load_model(model_path): + tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) + model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda() + model = model.eval() + return tokenizer, model +``` + +- 定义对话方法 + + +```python +messages = [] + +def chat(input_text): + length = 0 + for response, _ in model.stream_chat(tokenizer, input_text, messages): + if response is not None: + print(response[length:], flush=True, end="") + length = len(response) +``` + +### 3.1 微调前的模型对话 + +首先来看看 `internlm2-chat-1_8b` 的对话演示。 + +- 模型加载 + + +```python +tokenizer, model = load_model("Shanghai_AI_Laboratory/internlm2-chat-1_8b") +``` + +- 对话 + + +```python +chat("请介绍一下你自己") +``` + +- 释放缓存 + + +```python +del tokenizer, model + +torch.cuda.empty_cache() +``` + +### 3.2 指令跟随微调 + +下面我们对模型进行微调,让模型认识到自己的弟位,了解它自己是你的一个助手。 + +#### 3.2.1 准数据文件 + +为了让模型能够认清自己的身份弟位,在询问自己是谁的时候按照我们预期的结果进行回复,我们就需要通过在微调数据集中大量加入这样的数据。我们准备一个数据集文件`datas/assistant.json`,文件内容为对话数据。为了增强微调效果,可以将对话数据复制多条。 + + +```python +[ + {"conversation": [{"input": "请介绍一下你自己", "output": "我是伍鲜同志的小助手,内在是上海AI实验室书生·浦语的1.8B大模型哦"}]}, + {"conversation": [{"input": "你在实战营做什么", "output": "我在这里帮助伍鲜同志完成XTuner微调个人小助手的任务"}]}, +] +``` + +准备好数据文件后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── Shanghai_AI_Laboratory +│ ├── internlm2-1_8b +│ │ ├── README.md +│ │ ├── config.json +│ │ ├── configuration.json +│ │ ├── configuration_internlm2.py +│ │ ├── generation_config.json +│ │ ├── modeling_internlm2.py +│ │ ├── pytorch_model.bin +│ │ ├── special_tokens_map.json +│ │ ├── tokenization_internlm2.py +│ │ ├── tokenization_internlm2_fast.py +│ │ ├── tokenizer.json +│ │ ├── tokenizer.model +│ │ └── tokenizer_config.json +│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b +│ ├── README.md +│ ├── config.json +│ ├── configuration.json +│ ├── configuration_internlm2.py +│ ├── generation_config.json +│ ├── model-00001-of-00002.safetensors +│ ├── model-00002-of-00002.safetensors +│ ├── model.safetensors.index.json +│ ├── modeling_internlm2.py +│ ├── special_tokens_map.json +│ ├── tokenization_internlm2.py +│ ├── tokenization_internlm2_fast.py +│ ├── tokenizer.model +│ └── tokenizer_config.json +├── datas +│ └── assistant.json +``` + +
+ + +#### 3.2.2 准备配置文件 + +在准备好了模型和数据集后,我们就要根据我们选择的微调方法结合微调方案来找到与我们最匹配的配置文件了,从而减少我们对配置文件的修改量。 + +> 配置文件其实是一种用于定义和控制模型训练和测试过程中各个方面的参数和设置的工具。 + +##### 3.2.2.1 列出支持的配置文件 + +XTuner 提供多个开箱即用的配置文件,可以通过以下命令查看。 + +> `xtuner list-cfg` 命令用于列出内置的所有配置文件。参数 `-p` 或 `--pattern` 表示模式匹配,后面跟着的内容将会在所有的配置文件里进行模糊匹配搜索,然后返回最有可能得内容。比如我们这里微调的是书生·浦语的模型,我们就可以匹配搜索 `internlm2`。 + + +```bash +xtuner list-cfg -p internlm2 +``` + +
+配置文件名的解释 + +以 **internlm2_1_8b_full_custom_pretrain_e1** 和 **internlm2_chat_1_8b_qlora_alpaca_e3** 举例: + + +| 配置文件 internlm2_1_8b_full_custom_pretrain_e1 | 配置文件 internlm2_chat_1_8b_qlora_alpaca_e3 | 说明 | +| ----------------------------------------------- | -------------------------------------------- | -------------- | +| internlm2_1_8b | internlm2_chat_1_8b | 模型名称 | +| full | qlora | 使用的算法 | +| custom_pretrain | alpaca | 数据集名称 | +| e1 | e3 | 把数据集跑几次 | + +
+ +##### 3.2.2.2 复制一个预设的配置文件 + +由于我们是对`internlm2-chat-1_8b`模型进行指令微调,所以与我们的需求最匹配的配置文件是 `internlm2_chat_1_8b_qlora_alpaca_e3`,这里就复制该配置文件。 + +> `xtuner copy-cfg` 命令用于复制一个内置的配置文件。该命令需要两个参数:`CONFIG` 代表需要复制的配置文件名称,`SAVE_PATH` 代表复制的目标路径。在我们的输入的这个命令中,我们的 `CONFIG` 对应的是上面搜索到的 `internlm2_chat_1_8b_qlora_alpaca_e3` ,而 `SAVE_PATH` 则是当前目录 `.`。 + + +```bash +xtuner copy-cfg internlm2_chat_1_8b_qlora_alpaca_e3 . +``` + +复制好配置文件后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── Shanghai_AI_Laboratory +│ ├── internlm2-1_8b +│ │ ├── README.md +│ │ ├── config.json +│ │ ├── configuration.json +│ │ ├── configuration_internlm2.py +│ │ ├── generation_config.json +│ │ ├── modeling_internlm2.py +│ │ ├── pytorch_model.bin +│ │ ├── special_tokens_map.json +│ │ ├── tokenization_internlm2.py +│ │ ├── tokenization_internlm2_fast.py +│ │ ├── tokenizer.json +│ │ ├── tokenizer.model +│ │ └── tokenizer_config.json +│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b +│ ├── README.md +│ ├── config.json +│ ├── configuration.json +│ ├── configuration_internlm2.py +│ ├── generation_config.json +│ ├── model-00001-of-00002.safetensors +│ ├── model-00002-of-00002.safetensors +│ ├── model.safetensors.index.json +│ ├── modeling_internlm2.py +│ ├── special_tokens_map.json +│ ├── tokenization_internlm2.py +│ ├── tokenization_internlm2_fast.py +│ ├── tokenizer.model +│ └── tokenizer_config.json +├── datas +│ └── assistant.json +├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py +``` + +
+ +##### 3.2.2.3 对配置文件进行修改 + +在选择了一个最匹配的配置文件并准备好其他内容后,下面我们要做的事情就是根据我们自己的内容对该配置文件进行调整,使其能够满足我们实际训练的要求。 + +
+配置文件介绍 + +打开配置文件后,我们可以看到整体的配置文件分为五部分: + +**PART 1 Settings**:涵盖了模型基本设置,如预训练模型的选择、数据集信息和训练过程中的一些基本参数(如批大小、学习率等)。 + +**PART 2 Model & Tokenizer**:指定了用于训练的模型和分词器的具体类型及其配置,包括预训练模型的路径和是否启用特定功能(如可变长度注意力),这是模型训练的核心组成部分。 + +**PART 3 Dataset & Dataloader**:描述了数据处理的细节,包括如何加载数据集、预处理步骤、批处理大小等,确保了模型能够接收到正确格式和质量的数据。 + +**PART 4 Scheduler & Optimizer**:配置了优化过程中的关键参数,如学习率调度策略和优化器的选择,这些是影响模型训练效果和速度的重要因素。 + +**PART 5 Runtime**:定义了训练过程中的额外设置,如日志记录、模型保存策略和自定义钩子等,以支持训练流程的监控、调试和结果的保存。 + +一般来说我们需要更改的部分其实只包括前三部分,而且修改的主要原因是我们修改了配置文件中规定的模型、数据集。后两部分都是 XTuner 官方帮我们优化好的东西,一般而言只有在魔改的情况下才需要进行修改。 + +
+ +下面我们将根据项目的需求一步步的进行修改和调整吧! + +在 PART 1 的部分,由于我们不再需要在 HuggingFace 上自动下载模型,因此我们先要更换模型的路径以及数据集的路径为我们本地的路径。 + +为了训练过程中能够实时观察到模型的变化情况,XTuner 贴心的推出了一个 `evaluation_inputs` 的参数来让我们能够设置多个问题来确保模型在训练过程中的变化是朝着我们想要的方向前进的。我们可以添加自己的输入。 + +在 PART 3 的部分,由于我们准备的数据集是 JSON 格式的数据,并且对话内容已经是 `input` 和 `output` 的数据对,所以不需要进行格式转换。 + +```diff +####################################################################### +# PART 1 Settings # +####################################################################### +- pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b' ++ pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-chat-1_8b' + +- alpaca_en_path = 'tatsu-lab/alpaca' ++ alpaca_en_path = 'datas/assistant.json' + +evaluation_inputs = [ +- '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai' ++ '请介绍一下你自己', '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai' +] + +####################################################################### +# PART 3 Dataset & Dataloader # +####################################################################### +alpaca_en = dict( + type=process_hf_dataset, +- dataset=dict(type=load_dataset, path=alpaca_en_path), ++ dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)), + tokenizer=tokenizer, + max_length=max_length, +- dataset_map_fn=alpaca_map_fn, ++ dataset_map_fn=None, + template_map_fn=dict( + type=template_map_fn_factory, template=prompt_template), + remove_unused_columns=True, + shuffle_before_pack=True, + pack_to_max_length=pack_to_max_length, + use_varlen_attn=use_varlen_attn) +``` + +除此之外,我们还可以对一些重要的参数进行调整,包括学习率(lr)、训练的轮数(max_epochs)等等。 + +
+常用参数介绍 + +| 参数名 | 解释 | +| -------------------------- | ------------------------------------------------------------ | +| **data_path** | 数据路径或 HuggingFace 仓库名 | +| **max_length** | 单条数据最大 Token 数,超过则截断 | +| **pack_to_max_length** | 是否将多条短数据拼接到 max_length,提高 GPU 利用率 | +| **accumulative_counts** | 梯度累积,每多少次 backward 更新一次参数 | +| **sequence_parallel_size** | 并行序列处理的大小,用于模型训练时的序列并行 | +| **batch_size** | 每个设备上的批量大小 | +| **dataloader_num_workers** | 数据加载器中工作进程的数量 | +| **max_epochs** | 训练的最大轮数 | +| **optim_type** | 优化器类型,例如 AdamW | +| **lr** | 学习率 | +| **betas** | 优化器中的 beta 参数,控制动量和平方梯度的移动平均 | +| **weight_decay** | 权重衰减系数,用于正则化和避免过拟合 | +| **max_norm** | 梯度裁剪的最大范数,用于防止梯度爆炸 | +| **warmup_ratio** | 预热的比例,学习率在这个比例的训练过程中线性增加到初始学习率 | +| **save_steps** | 保存模型的步数间隔 | +| **save_total_limit** | 保存的模型总数限制,超过限制时删除旧的模型文件 | +| **prompt_template** | 模板提示,用于定义生成文本的格式或结构 | +| ...... | ...... | + +> 如果想充分利用显卡资源,可以将 `max_length` 和 `batch_size` 这两个参数调大。 + +
+ +修改完后的完整的配置文件是:[configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py](../../../configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py)。 + +
+internlm2_chat_1_8b_qlora_alpaca_e3_copy.py + +```python +# Copyright (c) OpenMMLab. All rights reserved. +import torch +from datasets import load_dataset +from mmengine.dataset import DefaultSampler +from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook, + LoggerHook, ParamSchedulerHook) +from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR +from peft import LoraConfig +from torch.optim import AdamW +from transformers import (AutoModelForCausalLM, AutoTokenizer, + BitsAndBytesConfig) + +from xtuner.dataset import process_hf_dataset +from xtuner.dataset.collate_fns import default_collate_fn +from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory +from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook, + VarlenAttnArgsToMessageHubHook) +from xtuner.engine.runner import TrainLoop +from xtuner.model import SupervisedFinetune +from xtuner.parallel.sequence import SequenceParallelSampler +from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE + +####################################################################### +# PART 1 Settings # +####################################################################### +# Model +pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-chat-1_8b' +use_varlen_attn = False + +# Data +alpaca_en_path = 'datas/assistant.json' +prompt_template = PROMPT_TEMPLATE.internlm2_chat +max_length = 2048 +pack_to_max_length = True + +# parallel +sequence_parallel_size = 1 + +# Scheduler & Optimizer +batch_size = 1 # per_device +accumulative_counts = 16 +accumulative_counts *= sequence_parallel_size +dataloader_num_workers = 0 +max_epochs = 3 +optim_type = AdamW +lr = 2e-4 +betas = (0.9, 0.999) +weight_decay = 0 +max_norm = 1 # grad clip +warmup_ratio = 0.03 + +# Save +save_steps = 500 +save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited) + +# Evaluate the generation performance during the training +evaluation_freq = 500 +SYSTEM = SYSTEM_TEMPLATE.alpaca +evaluation_inputs = [ + '请介绍一下你自己', '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai' +] + +####################################################################### +# PART 2 Model & Tokenizer # +####################################################################### +tokenizer = dict( + type=AutoTokenizer.from_pretrained, + pretrained_model_name_or_path=pretrained_model_name_or_path, + trust_remote_code=True, + padding_side='right') + +model = dict( + type=SupervisedFinetune, + use_varlen_attn=use_varlen_attn, + llm=dict( + type=AutoModelForCausalLM.from_pretrained, + pretrained_model_name_or_path=pretrained_model_name_or_path, + trust_remote_code=True, + torch_dtype=torch.float16, + quantization_config=dict( + type=BitsAndBytesConfig, + load_in_4bit=True, + load_in_8bit=False, + llm_int8_threshold=6.0, + llm_int8_has_fp16_weight=False, + bnb_4bit_compute_dtype=torch.float16, + bnb_4bit_use_double_quant=True, + bnb_4bit_quant_type='nf4')), + lora=dict( + type=LoraConfig, + r=64, + lora_alpha=16, + lora_dropout=0.1, + bias='none', + task_type='CAUSAL_LM')) + +####################################################################### +# PART 3 Dataset & Dataloader # +####################################################################### +alpaca_en = dict( + type=process_hf_dataset, + dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)), + tokenizer=tokenizer, + max_length=max_length, + dataset_map_fn=None, + template_map_fn=dict( + type=template_map_fn_factory, template=prompt_template), + remove_unused_columns=True, + shuffle_before_pack=True, + pack_to_max_length=pack_to_max_length, + use_varlen_attn=use_varlen_attn) + +sampler = SequenceParallelSampler \ + if sequence_parallel_size > 1 else DefaultSampler +train_dataloader = dict( + batch_size=batch_size, + num_workers=dataloader_num_workers, + dataset=alpaca_en, + sampler=dict(type=sampler, shuffle=True), + collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn)) + +####################################################################### +# PART 4 Scheduler & Optimizer # +####################################################################### +# optimizer +optim_wrapper = dict( + type=AmpOptimWrapper, + optimizer=dict( + type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay), + clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False), + accumulative_counts=accumulative_counts, + loss_scale='dynamic', + dtype='float16') + +# learning policy +# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501 +param_scheduler = [ + dict( + type=LinearLR, + start_factor=1e-5, + by_epoch=True, + begin=0, + end=warmup_ratio * max_epochs, + convert_to_iter_based=True), + dict( + type=CosineAnnealingLR, + eta_min=0.0, + by_epoch=True, + begin=warmup_ratio * max_epochs, + end=max_epochs, + convert_to_iter_based=True) +] + +# train, val, test setting +train_cfg = dict(type=TrainLoop, max_epochs=max_epochs) + +####################################################################### +# PART 5 Runtime # +####################################################################### +# Log the dialogue periodically during the training process, optional +custom_hooks = [ + dict(type=DatasetInfoHook, tokenizer=tokenizer), + dict( + type=EvaluateChatHook, + tokenizer=tokenizer, + every_n_iters=evaluation_freq, + evaluation_inputs=evaluation_inputs, + system=SYSTEM, + prompt_template=prompt_template) +] + +if use_varlen_attn: + custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)] + +# configure default hooks +default_hooks = dict( + # record the time of every iteration. + timer=dict(type=IterTimerHook), + # print log every 10 iterations. + logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10), + # enable the parameter scheduler. + param_scheduler=dict(type=ParamSchedulerHook), + # save checkpoint per `save_steps`. + checkpoint=dict( + type=CheckpointHook, + by_epoch=False, + interval=save_steps, + max_keep_ckpts=save_total_limit), + # set sampler seed in distributed evrionment. + sampler_seed=dict(type=DistSamplerSeedHook), +) + +# configure environment +env_cfg = dict( + # whether to enable cudnn benchmark + cudnn_benchmark=False, + # set multi process parameters + mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), + # set distributed parameters + dist_cfg=dict(backend='nccl'), +) + +# set visualizer +visualizer = None + +# set log level +log_level = 'INFO' + +# load from which checkpoint +load_from = None + +# whether to resume training from the loaded checkpoint +resume = False + +# Defaults to use random seed and disable `deterministic` +randomness = dict(seed=None, deterministic=False) + +# set log processor +log_processor = dict(by_epoch=False) +``` + +
+ +#### 3.2.3 启动微调 + +完成了所有的准备工作后,我们就可以正式的开始我们下一阶段的旅程:XTuner 启动~! + +当我们准备好了所有内容,我们只需要将使用 `xtuner train` 命令令即可开始训练。 + +> `xtuner train` 命令用于启动模型微调进程。该命令需要一个参数:`CONFIG` 用于指定微调配置文件。这里我们使用修改好的配置文件 `internlm2_chat_1_8b_qlora_alpaca_e3_copy.py`。 +> 训练过程中产生的所有文件,包括日志、配置文件、检查点文件、微调后的模型等,默认保存在 `work_dirs` 目录下,我们也可以通过添加 `--work-dir` 指定特定的文件保存位置。 + + +```bash +xtuner train ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py +``` + +在训练完后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── work_dirs +│ └── internlm2_chat_1_8b_qlora_alpaca_e3_copy +│ ├── 20240626_222727 +│ │ ├── 20240626_222727.log +│ │ └── vis_data +│ │ ├── 20240626_222727.json +│ │ ├── config.py +│ │ ├── eval_outputs_iter_95.txt +│ │ └── scalars.json +│ ├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py +│ ├── iter_96.pth +│ └── last_checkpoint +``` + +
+ +#### 3.2.4 模型格式转换 + +模型转换的本质其实就是将原本使用 Pytorch 训练出来的模型权重文件转换为目前通用的 HuggingFace 格式文件,那么我们可以通过以下命令来实现一键转换。 + +我们可以使用 `xtuner convert pth_to_hf` 命令来进行模型格式转换。 + +> `xtuner convert pth_to_hf` 命令用于进行模型格式转换。该命令需要三个参数:`CONFIG` 表示微调的配置文件, `PATH_TO_PTH_MODEL` 表示微调的模型权重文件路径,即要转换的模型权重, `SAVE_PATH_TO_HF_MODEL` 表示转换后的 HuggingFace 格式文件的保存路径。 + +除此之外,我们其实还可以在转换的命令中添加几个额外的参数,包括: + +| 参数名 | 解释 | +| --------------------- | -------------------------------------------- | +| --fp32 | 代表以fp32的精度开启,假如不输入则默认为fp16 | +| --max-shard-size {GB} | 代表每个权重文件最大的大小(默认为2GB) | + + +```bash +pth_file=`ls -t ./work_dirs/internlm2_chat_1_8b_qlora_alpaca_e3_copy/*.pth | head -n 1` && MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert pth_to_hf ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py ${pth_file} ./hf +``` + +模型格式转换完成后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── hf +│ ├── README.md +│ ├── adapter_config.json +│ ├── adapter_model.bin +│ └── xtuner_config.py +``` + +
+ +转换完成后,可以看到模型被转换为 HuggingFace 中常用的 .bin 格式文件,这就代表着文件成功被转化为 HuggingFace 格式了。 + +此时,hf 文件夹即为我们平时所理解的所谓 “LoRA 模型文件” + +> 可以简单理解:LoRA 模型文件 = Adapter + +#### 3.2.5 模型合并 + +对于 LoRA 或者 QLoRA 微调出来的模型其实并不是一个完整的模型,而是一个额外的层(Adapter),训练完的这个层最终还是要与原模型进行合并才能被正常的使用。 + +> 对于全量微调的模型(full)其实是不需要进行整合这一步的,因为全量微调修改的是原模型的权重而非微调一个新的 Adapter ,因此是不需要进行模型整合的。 + +在 XTuner 中提供了一键合并的命令 `xtuner convert merge`,在使用前我们需要准备好三个路径,包括原模型的路径、训练好的 Adapter 层的(模型格式转换后的)路径以及最终保存的路径。 + +> `xtuner convert merge`命令用于合并模型。该命令需要三个参数:`LLM` 表示原模型路径,`ADAPTER` 表示 Adapter 层的路径, `SAVE_PATH` 表示合并后的模型最终的保存路径。 + +在模型合并这一步还有其他很多的可选参数,包括: + +| 参数名 | 解释 | +| ---------------------- | ------------------------------------------------------------ | +| --max-shard-size {GB} | 代表每个权重文件最大的大小(默认为2GB) | +| --device {device_name} | 这里指的就是device的名称,可选择的有cuda、cpu和auto,默认为cuda即使用gpu进行运算 | +| --is-clip | 这个参数主要用于确定模型是不是CLIP模型,假如是的话就要加上,不是就不需要添加 | + +> CLIP(Contrastive Language–Image Pre-training)模型是 OpenAI 开发的一种预训练模型,它能够理解图像和描述它们的文本之间的关系。CLIP 通过在大规模数据集上学习图像和对应文本之间的对应关系,从而实现了对图像内容的理解和分类,甚至能够根据文本提示生成图像。 + + +```bash +MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge Shanghai_AI_Laboratory/internlm2-chat-1_8b ./hf ./merged --max-shard-size 2GB +``` + +模型合并完成后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── merged +│ ├── config.json +│ ├── configuration_internlm2.py +│ ├── generation_config.json +│ ├── modeling_internlm2.py +│ ├── pytorch_model-00001-of-00002.bin +│ ├── pytorch_model-00002-of-00002.bin +│ ├── pytorch_model.bin.index.json +│ ├── special_tokens_map.json +│ ├── tokenization_internlm2.py +│ ├── tokenization_internlm2_fast.py +│ ├── tokenizer.json +│ ├── tokenizer.model +│ └── tokenizer_config.json +``` + +
+ +在模型合并完成后,我们就可以看到最终的模型和原模型文件夹非常相似,包括了分词器、权重文件、配置信息等等。 + +### 3.3 微调后的模型对话 + + +```python +tokenizer, model = load_model("./merged") +``` + + +```python +chat("请介绍一下你自己") +``` + + +```python +chat("你在实战营做什么") +``` + + +```python +chat("介绍一下成都") +``` + +可以看到,通过指令微调,我们成功得到了一个自己的小助手。 + + +```python +del tokenizer, model + +torch.cuda.empty_cache() +``` + +## 4 Web Demo 部署 + +除了在终端中对模型进行测试,我们其实还可以在网页端的 Demo 进行对话。 + +首先,我们需要安装所需要的依赖。 + + +```python +pip install streamlit +``` + +其次,我们需要准备一个Streamlit程序的脚本。 + +Streamlit程序的完整代码是:[tools/xtuner_streamlit_demo.py](../../../tools/xtuner_streamlit_demo.py)。 + +
+xtuner_streamlit_demo.py + + +```python +import copy +import warnings +from dataclasses import asdict, dataclass +from typing import Callable, List, Optional + +import streamlit as st +import torch +from torch import nn +from transformers.generation.utils import (LogitsProcessorList, + StoppingCriteriaList) +from transformers.utils import logging + +from transformers import AutoTokenizer, AutoModelForCausalLM # isort: skip + +logger = logging.get_logger(__name__) + + +model_name_or_path = "./merged" + +@dataclass +class GenerationConfig: + # this config is used for chat to provide more diversity + max_length: int = 2048 + top_p: float = 0.75 + temperature: float = 0.1 + do_sample: bool = True + repetition_penalty: float = 1.000 + + +@torch.inference_mode() +def generate_interactive( + model, + tokenizer, + prompt, + generation_config: Optional[GenerationConfig] = None, + logits_processor: Optional[LogitsProcessorList] = None, + stopping_criteria: Optional[StoppingCriteriaList] = None, + prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], + List[int]]] = None, + additional_eos_token_id: Optional[int] = None, + **kwargs, +): + inputs = tokenizer([prompt], padding=True, return_tensors='pt') + input_length = len(inputs['input_ids'][0]) + for k, v in inputs.items(): + inputs[k] = v.cuda() + input_ids = inputs['input_ids'] + _, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1] + if generation_config is None: + generation_config = model.generation_config + generation_config = copy.deepcopy(generation_config) + model_kwargs = generation_config.update(**kwargs) + bos_token_id, eos_token_id = ( # noqa: F841 # pylint: disable=W0612 + generation_config.bos_token_id, + generation_config.eos_token_id, + ) + if isinstance(eos_token_id, int): + eos_token_id = [eos_token_id] + if additional_eos_token_id is not None: + eos_token_id.append(additional_eos_token_id) + has_default_max_length = kwargs.get( + 'max_length') is None and generation_config.max_length is not None + if has_default_max_length and generation_config.max_new_tokens is None: + warnings.warn( + f"Using 'max_length''s default ({repr(generation_config.max_length)}) \ + to control the generation length. " + 'This behaviour is deprecated and will be removed from the \ + config in v5 of Transformers -- we' + ' recommend using `max_new_tokens` to control the maximum \ + length of the generation.', + UserWarning, + ) + elif generation_config.max_new_tokens is not None: + generation_config.max_length = generation_config.max_new_tokens + \ + input_ids_seq_length + if not has_default_max_length: + logger.warn( # pylint: disable=W4902 + f"Both 'max_new_tokens' (={generation_config.max_new_tokens}) " + f"and 'max_length'(={generation_config.max_length}) seem to " + "have been set. 'max_new_tokens' will take precedence. " + 'Please refer to the documentation for more information. ' + '(https://huggingface.co/docs/transformers/main/' + 'en/main_classes/text_generation)', + UserWarning, + ) + + if input_ids_seq_length >= generation_config.max_length: + input_ids_string = 'input_ids' + logger.warning( + f"Input length of {input_ids_string} is {input_ids_seq_length}, " + f"but 'max_length' is set to {generation_config.max_length}. " + 'This can lead to unexpected behavior. You should consider' + " increasing 'max_new_tokens'.") + + # 2. Set generation parameters if not already defined + logits_processor = logits_processor if logits_processor is not None \ + else LogitsProcessorList() + stopping_criteria = stopping_criteria if stopping_criteria is not None \ + else StoppingCriteriaList() + + logits_processor = model._get_logits_processor( + generation_config=generation_config, + input_ids_seq_length=input_ids_seq_length, + encoder_input_ids=input_ids, + prefix_allowed_tokens_fn=prefix_allowed_tokens_fn, + logits_processor=logits_processor, + ) + + stopping_criteria = model._get_stopping_criteria( + generation_config=generation_config, + stopping_criteria=stopping_criteria) + logits_warper = model._get_logits_warper(generation_config) + + unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1) + scores = None + while True: + model_inputs = model.prepare_inputs_for_generation( + input_ids, **model_kwargs) + # forward pass to get next token + outputs = model( + **model_inputs, + return_dict=True, + output_attentions=False, + output_hidden_states=False, + ) + + next_token_logits = outputs.logits[:, -1, :] + + # pre-process distribution + next_token_scores = logits_processor(input_ids, next_token_logits) + next_token_scores = logits_warper(input_ids, next_token_scores) + + # sample + probs = nn.functional.softmax(next_token_scores, dim=-1) + if generation_config.do_sample: + next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) + else: + next_tokens = torch.argmax(probs, dim=-1) + + # update generated ids, model inputs, and length for next step + input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1) + model_kwargs = model._update_model_kwargs_for_generation( + outputs, model_kwargs, is_encoder_decoder=False) + unfinished_sequences = unfinished_sequences.mul( + (min(next_tokens != i for i in eos_token_id)).long()) + + output_token_ids = input_ids[0].cpu().tolist() + output_token_ids = output_token_ids[input_length:] + for each_eos_token_id in eos_token_id: + if output_token_ids[-1] == each_eos_token_id: + output_token_ids = output_token_ids[:-1] + response = tokenizer.decode(output_token_ids) + + yield response + # stop when each sentence is finished + # or if we exceed the maximum length + if unfinished_sequences.max() == 0 or stopping_criteria( + input_ids, scores): + break + + +def on_btn_click(): + del st.session_state.messages + + +@st.cache_resource +def load_model(): + model = (AutoModelForCausalLM.from_pretrained(model_name_or_path, + trust_remote_code=True).to( + torch.bfloat16).cuda()) + tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, + trust_remote_code=True) + return model, tokenizer + + +def prepare_generation_config(): + with st.sidebar: + max_length = st.slider('Max Length', + min_value=8, + max_value=32768, + value=2048) + top_p = st.slider('Top P', 0.0, 1.0, 0.75, step=0.01) + temperature = st.slider('Temperature', 0.0, 1.0, 0.1, step=0.01) + st.button('Clear Chat History', on_click=on_btn_click) + + generation_config = GenerationConfig(max_length=max_length, + top_p=top_p, + temperature=temperature) + + return generation_config + + +user_prompt = '<|im_start|>user\n{user}<|im_end|>\n' +robot_prompt = '<|im_start|>assistant\n{robot}<|im_end|>\n' +cur_query_prompt = '<|im_start|>user\n{user}<|im_end|>\n\ + <|im_start|>assistant\n' + + +def combine_history(prompt): + messages = st.session_state.messages + meta_instruction = ('') + total_prompt = f"<|im_start|>system\n{meta_instruction}<|im_end|>\n" + for message in messages: + cur_content = message['content'] + if message['role'] == 'user': + cur_prompt = user_prompt.format(user=cur_content) + elif message['role'] == 'robot': + cur_prompt = robot_prompt.format(robot=cur_content) + else: + raise RuntimeError + total_prompt += cur_prompt + total_prompt = total_prompt + cur_query_prompt.format(user=prompt) + return total_prompt + + +def main(): + # torch.cuda.empty_cache() + print('load model begin.') + model, tokenizer = load_model() + print('load model end.') + + + st.title('InternLM2-Chat-1.8B') + + generation_config = prepare_generation_config() + + # Initialize chat history + if 'messages' not in st.session_state: + st.session_state.messages = [] + + # Display chat messages from history on app rerun + for message in st.session_state.messages: + with st.chat_message(message['role'], avatar=message.get('avatar')): + st.markdown(message['content']) + + # Accept user input + if prompt := st.chat_input('What is up?'): + # Display user message in chat message container + with st.chat_message('user'): + st.markdown(prompt) + real_prompt = combine_history(prompt) + # Add user message to chat history + st.session_state.messages.append({ + 'role': 'user', + 'content': prompt, + }) + + with st.chat_message('robot'): + message_placeholder = st.empty() + for cur_response in generate_interactive( + model=model, + tokenizer=tokenizer, + prompt=real_prompt, + additional_eos_token_id=92542, + **asdict(generation_config), + ): + # Display robot response in chat message container + message_placeholder.markdown(cur_response + '▌') + message_placeholder.markdown(cur_response) + # Add robot response to chat history + st.session_state.messages.append({ + 'role': 'robot', + 'content': cur_response, # pylint: disable=undefined-loop-variable + }) + torch.cuda.empty_cache() + + +if __name__ == '__main__': + main() + +``` + +
+ +然后,我们可以直接启动应用。 + + +```bash +streamlit run xtuner_streamlit_demo.py +``` + +运行后,在访问前,我们还需要做的就是将端口映射到本地。 + +通过如图所示的地方,获取开发机的端口和密码。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-09.png) + +然后在本地使用 PowerShell 或者命令行终端,执行以下命令: + +> 其中,`8501`是Streamlit程序的服务端口,`43551`需要替换为自己的开发机的端口。 + +```bash +ssh -CNg -L 8501:127.0.0.1:8501 root@ssh.intern-ai.org.cn -p 43551 +``` + +然后再输入开发机的root密码。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-10.png) + +最后,我们就可以在本地通过浏览器访问:http://127.0.0.1:8501 来进行对话了。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-12.png) + +## 5 小结 + +经过本节的学习,带领着大家跑通了 XTuner 的完整流程,我们学会了指令跟随微调,并且训练出了一个自己小助手,是不是很有意思! + +当我们在测试完模型认为其满足我们的需求后,就可以对模型进行量化部署等操作了,这部分的内容在之后关于 LMDeploy 的课程中将会详细的进行讲解,敬请期待后续的课程吧! + +关于XTuner的更多高级进阶知识,请访问[XTuner微调高级进阶](./xtuner_finetune_advance.md)。 + +## 6 作业 + +作业请访问[作业](./homework.md)。 diff --git a/docs/L1/XTuner/xtuner_finetune_advance.md b/docs/L1/XTuner/xtuner_finetune_advance.md new file mode 100644 index 000000000..93890612f --- /dev/null +++ b/docs/L1/XTuner/xtuner_finetune_advance.md @@ -0,0 +1,780 @@ +# XTuner微调高级进阶 + +## 1 增量预训练微调 + +本节我们先来了解一下增量预训练,这里我们以一个文本续写案例来看看效果。 + +| | 微调前 | 微调后 | +| ---- | ------------------------------------------------------------ | ------------------------------------------------------------ | +| 输入 | 书生·浦语大模型实战营第三期是 | 书生·浦语大模型实战营第三期是 | +| 输出 | 书生·浦语大模型实战营第三期是上周五,上周五我们学习了一个新的知识,那就是关于机器学习的概率统计。…… | 书生·浦语大模型实战营第三期是上海人工智能实验室推出的书生·浦语大模型实战营系列活动的第三批次,将于2024年7月正式进行。…… | + +我们需要定义一些基本方法。 + +- 导入必要的库 + + +```python +import torch +from transformers import AutoTokenizer, AutoModelForCausalLM +``` + +- 定义模型加载方法 + + +```python +def load_model(model_path): + tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) + model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda() + model = model.eval() + return tokenizer, model +``` + +- 定义文本续写方法 + + +```python +def generate(user_input): + gen_kwargs = {"max_length": 128, "top_p": 0.8, "temperature": 0.8, "do_sample": True, "repetition_penalty": 1.0} + + inputs = tokenizer([user_input], return_tensors="pt") + for k,v in inputs.items(): + inputs[k] = v.cuda() + output = model.generate(**inputs, **gen_kwargs) + output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True) + return output +``` + +### 1.1 基座模型推理 + +我们先来看看基座模型的推理结果。 + +- 加载模型 + + +```python +tokenizer, model = load_model("Shanghai_AI_Laboratory/internlm2-1_8b") +``` + +- 文本续写 + + +```python +generate("书生·浦语大模型实战营第三期是") +``` + +- 释放缓存 + + +```python +del tokenizer, model + +torch.cuda.empty_cache() +``` + +### 1.2 增量预训练 + +然后我们对基座模型进行增量预训练,让模型增加新的知识。 + +#### 1.2.1 准备数据文件 + +为了让模型学习到新的知识,我们需要将新的知识数据整理成指定格式文件,形成数据集,然后让模型来学习这些新数据。这里我们准备一个简单的数据集 `datas/pretrain.json`,仅包含一条知识,然后让数据重复多次。 + +> 网上有大量的开源数据集可以供我们进行使用,有些时候我们可以在开源数据集的基础上添加一些我们自己独有的数据集,也可能会有很好的效果。 + + +```python +[ + { + "text": "书生·浦语大模型实战营第三期是上海人工智能实验室推出的书生·浦语大模型实战营系列活动的第三批次,将于2024年7月正式进行。" + } +] +``` + +准备好数据文件后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── Shanghai_AI_Laboratory +│ ├── internlm2-1_8b +│ │ ├── README.md +│ │ ├── config.json +│ │ ├── configuration.json +│ │ ├── configuration_internlm2.py +│ │ ├── generation_config.json +│ │ ├── modeling_internlm2.py +│ │ ├── pytorch_model.bin +│ │ ├── special_tokens_map.json +│ │ ├── tokenization_internlm2.py +│ │ ├── tokenization_internlm2_fast.py +│ │ ├── tokenizer.json +│ │ ├── tokenizer.model +│ │ └── tokenizer_config.json +│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b +│ ├── README.md +│ ├── config.json +│ ├── configuration.json +│ ├── configuration_internlm2.py +│ ├── generation_config.json +│ ├── model-00001-of-00002.safetensors +│ ├── model-00002-of-00002.safetensors +│ ├── model.safetensors.index.json +│ ├── modeling_internlm2.py +│ ├── special_tokens_map.json +│ ├── tokenization_internlm2.py +│ ├── tokenization_internlm2_fast.py +│ ├── tokenizer.model +│ └── tokenizer_config.json +├── datas +│ ├── assistant.json +│ └── pretrain.json +├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py +``` + +
+ +```bash +tree -l +``` + +#### 1.2.2 准备配置文件 + +在准备好了模型和数据集后,我们就要根据我们选择的微调方法结合微调方案来找到与我们最匹配的配置文件了,从而减少我们对配置文件的修改量。 + +这里我们选择使用 `internlm2_1_8b_full_custom_pretrain_e1` 配置文件。 + + +```bash +xtuner copy-cfg internlm2_1_8b_full_custom_pretrain_e1 . +``` + +复制好配置文件后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── Shanghai_AI_Laboratory +│ ├── internlm2-1_8b +│ │ ├── README.md +│ │ ├── config.json +│ │ ├── configuration.json +│ │ ├── configuration_internlm2.py +│ │ ├── generation_config.json +│ │ ├── modeling_internlm2.py +│ │ ├── pytorch_model.bin +│ │ ├── special_tokens_map.json +│ │ ├── tokenization_internlm2.py +│ │ ├── tokenization_internlm2_fast.py +│ │ ├── tokenizer.json +│ │ ├── tokenizer.model +│ │ └── tokenizer_config.json +│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b +│ ├── README.md +│ ├── config.json +│ ├── configuration.json +│ ├── configuration_internlm2.py +│ ├── generation_config.json +│ ├── model-00001-of-00002.safetensors +│ ├── model-00002-of-00002.safetensors +│ ├── model.safetensors.index.json +│ ├── modeling_internlm2.py +│ ├── special_tokens_map.json +│ ├── tokenization_internlm2.py +│ ├── tokenization_internlm2_fast.py +│ ├── tokenizer.model +│ └── tokenizer_config.json +├── datas +│ ├── assistant.json +│ └── pretrain.json +├── internlm2_1_8b_full_custom_pretrain_e1_copy.py +├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py +``` + +
+ +下面我们将根据项目的需求一步步的进行修改和调整吧! + +在 PART 1 的部分,由于我们不再需要在 HuggingFace 上自动下载模型,因此我们先要更换模型的路径以及数据集的路径为我们本地的路径。 + +为了训练过程中能够实时观察到模型的变化情况,XTuner 贴心的推出了一个 `evaluation_inputs` 的参数来让我们能够设置多个问题来确保模型在训练过程中的变化是朝着我们想要的方向前进的。我们可以添加自己的输入。 + +在 PART 2 的部分,由于我们复制的配置文件是全参数微调的配置,而我们希望使用 `QLoRA` 算法进行微调,所以可以添加 `QLoRA` 算法的配置。 + +```diff ++ from peft import LoraConfig + ++ import torch + +- from transformers import AutoModelForCausalLM, AutoTokenizer ++ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +####################################################################### +# PART 1 Settings # +####################################################################### +- pretrained_model_name_or_path = 'internlm/internlm2-1_8b' ++ pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-1_8b' + +- data_files = ['/path/to/json/file.json'] ++ data_files = ['datas/pretrain.json'] + +- evaluation_inputs = ['上海是', 'Shanghai is'] ++ evaluation_inputs = ['书生·浦语大模型实战营第三期是', '上海是', 'Shanghai is'] + +####################################################################### +# PART 2 Model & Tokenizer # +####################################################################### +model = dict( + type=SupervisedFinetune, + use_varlen_attn=use_varlen_attn, + llm=dict( + type=AutoModelForCausalLM.from_pretrained, + pretrained_model_name_or_path=pretrained_model_name_or_path, + trust_remote_code=True, ++ quantization_config=dict( ++ type=BitsAndBytesConfig, ++ load_in_4bit=True, ++ load_in_8bit=False, ++ llm_int8_threshold=6.0, ++ llm_int8_has_fp16_weight=False, ++ bnb_4bit_compute_dtype=torch.float16, ++ bnb_4bit_use_double_quant=True, ++ bnb_4bit_quant_type='nf4') + ), ++ lora=dict( ++ type=LoraConfig, ++ r=64, ++ lora_alpha=16, ++ lora_dropout=0.1, ++ bias='none', ++ task_type='CAUSAL_LM') +) +``` + +修改完后的完整的配置文件是:[configs/internlm2_1_8b_full_custom_pretrain_e1_copy.py](../../../configs/internlm2_1_8b_full_custom_pretrain_e1_copy.py)。 + +
+internlm2_1_8b_full_custom_pretrain_e1_copy.py + +```python +# Copyright (c) OpenMMLab. All rights reserved. +"""Data format:[ + { + "text": "xxx" + }, + { + "text": "xxx" + }, + ... +] +""" # noqa: E501 + +from datasets import load_dataset +from mmengine.dataset import DefaultSampler +from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook, + LoggerHook, ParamSchedulerHook) +from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR +from peft import LoraConfig +import torch +from torch.optim import AdamW +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +from xtuner.dataset import process_hf_dataset +from xtuner.dataset.collate_fns import default_collate_fn +from xtuner.dataset.map_fns import pretrain_map_fn +from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook, + VarlenAttnArgsToMessageHubHook) +from xtuner.engine.runner import TrainLoop +from xtuner.model import SupervisedFinetune + +####################################################################### +# PART 1 Settings # +####################################################################### +# Model +pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-1_8b' +use_varlen_attn = False + +# Data +data_files = ['datas/pretrain.json'] +max_length = 2048 +pack_to_max_length = True + +# Scheduler & Optimizer +batch_size = 1 # per_device +accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc +dataloader_num_workers = 0 +max_epochs = 1 +optim_type = AdamW +lr = 2e-5 +betas = (0.9, 0.999) +weight_decay = 0 +max_norm = 1 # grad clip +warmup_ratio = 0.03 + +# Save +save_steps = 500 +save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited) + +# Evaluate the generation performance during the training +evaluation_freq = 500 +SYSTEM = '' +evaluation_inputs = ['书生·浦语大模型实战营第三期是', '上海是', 'Shanghai is'] + +####################################################################### +# PART 2 Model & Tokenizer # +####################################################################### +tokenizer = dict( + type=AutoTokenizer.from_pretrained, + pretrained_model_name_or_path=pretrained_model_name_or_path, + trust_remote_code=True, + padding_side='right') + +model = dict( + type=SupervisedFinetune, + use_varlen_attn=use_varlen_attn, + llm=dict( + type=AutoModelForCausalLM.from_pretrained, + pretrained_model_name_or_path=pretrained_model_name_or_path, + trust_remote_code=True, + quantization_config=dict( + type=BitsAndBytesConfig, + load_in_4bit=True, + load_in_8bit=False, + llm_int8_threshold=6.0, + llm_int8_has_fp16_weight=False, + bnb_4bit_compute_dtype=torch.float16, + bnb_4bit_use_double_quant=True, + bnb_4bit_quant_type='nf4') + ), + lora=dict( + type=LoraConfig, + r=64, + lora_alpha=16, + lora_dropout=0.1, + bias='none', + task_type='CAUSAL_LM') +) + +####################################################################### +# PART 3 Dataset & Dataloader # +####################################################################### +train_dataset = dict( + type=process_hf_dataset, + dataset=dict(type=load_dataset, path='json', data_files=data_files), + tokenizer=tokenizer, + max_length=max_length, + dataset_map_fn=pretrain_map_fn, + template_map_fn=None, + remove_unused_columns=True, + shuffle_before_pack=False, + pack_to_max_length=pack_to_max_length, + use_varlen_attn=use_varlen_attn) + +train_dataloader = dict( + batch_size=batch_size, + num_workers=dataloader_num_workers, + dataset=train_dataset, + sampler=dict(type=DefaultSampler, shuffle=True), + collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn)) + +####################################################################### +# PART 4 Scheduler & Optimizer # +####################################################################### +# optimizer +optim_wrapper = dict( + type=AmpOptimWrapper, + optimizer=dict( + type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay), + clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False), + accumulative_counts=accumulative_counts, + loss_scale='dynamic', + dtype='float16') + +# learning policy +# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501 +param_scheduler = [ + dict( + type=LinearLR, + start_factor=1e-5, + by_epoch=True, + begin=0, + end=warmup_ratio * max_epochs, + convert_to_iter_based=True), + dict( + type=CosineAnnealingLR, + eta_min=0.0, + by_epoch=True, + begin=warmup_ratio * max_epochs, + end=max_epochs, + convert_to_iter_based=True) +] + +# train, val, test setting +train_cfg = dict(type=TrainLoop, max_epochs=max_epochs) + +####################################################################### +# PART 5 Runtime # +####################################################################### +# Log the dialogue periodically during the training process, optional +custom_hooks = [ + dict(type=DatasetInfoHook, tokenizer=tokenizer), + dict( + type=EvaluateChatHook, + tokenizer=tokenizer, + every_n_iters=evaluation_freq, + evaluation_inputs=evaluation_inputs, + system=SYSTEM) +] + +if use_varlen_attn: + custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)] + +# configure default hooks +default_hooks = dict( + # record the time of every iteration. + timer=dict(type=IterTimerHook), + # print log every 10 iterations. + logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10), + # enable the parameter scheduler. + param_scheduler=dict(type=ParamSchedulerHook), + # save checkpoint per `save_steps`. + checkpoint=dict( + type=CheckpointHook, + by_epoch=False, + interval=save_steps, + max_keep_ckpts=save_total_limit), + # set sampler seed in distributed evrionment. + sampler_seed=dict(type=DistSamplerSeedHook), +) + +# configure environment +env_cfg = dict( + # whether to enable cudnn benchmark + cudnn_benchmark=False, + # set multi process parameters + mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), + # set distributed parameters + dist_cfg=dict(backend='nccl'), +) + +# set visualizer +visualizer = None + +# set log level +log_level = 'INFO' + +# load from which checkpoint +load_from = None + +# whether to resume training from the loaded checkpoint +resume = False + +# Defaults to use random seed and disable `deterministic` +randomness = dict(seed=None, deterministic=False) + +# set log processor +log_processor = dict(by_epoch=False) +``` + +
+ +#### 1.2.3 启动微调 + +完成了所有的准备工作后,我们就可以正式的开始我们下一阶段的旅程:XTuner 启动~! + +当我们准备好了所有内容,我们只需要将使用 `xtuner train` 命令令即可开始训练。 + + +```bash +xtuner train ./internlm2_1_8b_full_custom_pretrain_e1_copy.py +``` + +在训练完后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── work_dirs +│ └── internlm2_1_8b_full_custom_pretrain_e1_copy +│ ├── 20240627_214522 +│ │ ├── 20240627_214522.log +│ │ └── vis_data +│ │ ├── 20240627_214522.json +│ │ ├── config.py +│ │ ├── eval_outputs_iter_1499.txt +│ │ ├── eval_outputs_iter_1999.txt +│ │ ├── eval_outputs_iter_2499.txt +│ │ ├── eval_outputs_iter_2623.txt +│ │ ├── eval_outputs_iter_499.txt +│ │ ├── eval_outputs_iter_999.txt +│ │ └── scalars.json +│ ├── internlm2_1_8b_full_custom_pretrain_e1_copy.py +│ ├── iter_2500.pth +│ ├── iter_2624.pth +│ └── last_checkpoint +``` + +
+ +#### 1.2.4 模型格式转换 + +模型转换的本质其实就是将原本使用 Pytorch 训练出来的模型权重文件转换为目前通用的 HuggingFace 格式文件,那么我们可以通过以下命令来实现一键转换。 + + +```bash +pth_file=`ls -t ./work_dirs/internlm2_1_8b_full_custom_pretrain_e1_copy/*.pth | head -n 1` && MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert pth_to_hf ./internlm2_1_8b_full_custom_pretrain_e1_copy.py ${pth_file} ./hf +``` + +模型格式转换完成后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── hf +│ ├── README.md +│ ├── adapter_config.json +│ ├── adapter_model.bin +│ └── xtuner_config.py +``` + +
+ +#### 1.2.5 模型合并 + +对于 LoRA 或者 QLoRA 微调出来的模型其实并不是一个完整的模型,而是一个额外的层(Adapter),训练完的这个层最终还是要与原模型进行合并才能被正常的使用。 + + +```bash +MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge Shanghai_AI_Laboratory/internlm2-1_8b ./hf ./merged --max-shard-size 2GB +``` + +模型合并完成后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── merged +│ ├── config.json +│ ├── configuration_internlm2.py +│ ├── generation_config.json +│ ├── modeling_internlm2.py +│ ├── pytorch_model-00001-of-00002.bin +│ ├── pytorch_model-00002-of-00002.bin +│ ├── pytorch_model.bin.index.json +│ ├── special_tokens_map.json +│ ├── tokenization_internlm2.py +│ ├── tokenization_internlm2_fast.py +│ ├── tokenizer.json +│ ├── tokenizer.model +│ └── tokenizer_config.json +``` + +
+ +### 1.3 目标模型推理 + +当我们合并完成后,我们就能够正常的调用这个模型进行推理了。 + + +```python +tokenizer, model = load_model("./merged") +``` + + +```python +generate("书生·浦语大模型实战营第三期是") +``` + + +```python +generate("成都是") +``` + +可以看到,通过增量预训练,确实在基座模型的基础上学习到了新的知识。 + + +```python +del tokenizer, model + +torch.cuda.empty_cache() +``` + +## 2 DeepSpeed介绍 + +DeepSpeed是一个由微软开发的开源深度学习优化库,旨在提高大规模模型训练的效率和速度。 + +XTuner 也内置了 `deepspeed` 来加速整体的训练过程,共有三种不同的 `deepspeed` 类型可进行选择,分别是 `deepspeed_zero1`, `deepspeed_zero2` 和 `deepspeed_zero3`。 + +
+DeepSpeed优化器及其选择方法 +DeepSpeed是一个由微软开发的开源深度学习优化库,旨在提高大规模模型训练的效率和速度。它通过几种关键技术来优化训练过程,包括模型分割、梯度累积、以及内存和带宽优化等,能够降低训练超大规模模型的复杂性和资源需求,使训练变得更快、更高效。DeepSpeed特别适用于需要巨大计算资源的大型模型和数据集。 + +在DeepSpeed中,引入了ZeRO(Zero Redundancy Optimizer)技术,是一种旨在降低训练大型模型所需内存占用的优化器,通过在分布式环境中分割优化器的状态、梯度和参数,减少冗余的内存占用,允许更大的模型和更快的训练速度。ZeRO 分为几个不同的级别,主要包括: + +- **deepspeed_zero1**:这是ZeRO的基本版本,它优化了模型参数的存储,主要通过分区存储优化器状态来减少内存使用。每个GPU设备只保存一部分优化器状态,从而显著减少内存消耗。 + +- **deepspeed_zero2**:在deepspeed_zero1的基础上,deepspeed_zero2进一步优化了梯度和优化器状态的存储,将梯度也进行分区存储。这样,每个GPU设备只需要保存一部分的优化器状态和梯度,进一步减少内存使用。 + +- **deepspeed_zero3**:这是目前最高级的优化等级,它包括了deepspeed_zero1和deepspeed_zero2的优化,除了优化器状态和梯度,还将模型参数进行分区存储。每个GPU设备只需要保存一部分的优化器状态、梯度和模型参数,从而最大限度地减少内存使用。 + +选择哪种deepspeed类型主要取决于你的具体需求,包括模型的大小、可用的硬件资源(特别是GPU内存)以及训练的效率需求。一般来说: + +- 如果你的模型较小,或者内存资源充足,可能不需要使用最高级别的优化。 +- 如果你需要快速训练模型,可能需要权衡内存优化和计算效率。deepspeed_zero1提供了较低的内存占用,同时保持了较高的计算效率。 +- 如果你正在尝试训练非常大的模型,或者你的硬件资源有限,使用deepspeed_zero2或deepspeed_zero3可能更合适,因为它们可以显著降低内存占用,允许更大模型的训练。 +- 选择时也要考虑到实现的复杂性和运行时的开销,更高级的优化可能需要更复杂的设置,更频繁的跨GPU通信,这可能需要更高的网络带宽,并可能增加一些计算开销。 + +
+ +## 3 多卡微调 + +模型的规模和复杂度不断增加,单张GPU的显存往往无法满足大模型的训练需求。此时,我们可能需要多卡微调,以应对大模型训练过程中显存和计算资源的需求。 + + +XTuner 中使用多卡微调,只需要设置 `NPROC_PER_NODE` 环境变量,并使用 `DeepSpeed` 来进行加速就可以了,其余命令内容与单卡微调时一样。 + +> 由于开发机只有两张显卡,所以我们设置`NPROC_PER_NODE=2`,并且选择使用`deepspeed_zero3`优化等级。 + + +```bash +MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU NPROC_PER_NODE=2 xtuner train ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero3 +``` + +在执行微调的过程中,我们可以看到两张显卡都有内存使用。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-06.png) + +在训练完后,我们的目录结构应该是这样子的。 + +
+目录结构 + +``` +├── work_dirs +│ └── internlm2_chat_1_8b_qlora_alpaca_e3_copy +│ ├── 20240628_205957 +│ │ ├── 20240628_205957.log +│ │ └── vis_data +│ │ ├── 20240628_205957.json +│ │ ├── config.py +│ │ ├── eval_outputs_iter_236.txt +│ │ └── scalars.json +│ ├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py +│ ├── iter_237.pth +│ │ ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt +│ │ ├── bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt +│ │ ├── zero_pp_rank_0_mp_rank_00_model_states.pt +│ │ └── zero_pp_rank_1_mp_rank_00_model_states.pt +│ ├── last_checkpoint +│ └── zero_to_fp32.py +``` + +
+ +可以看到,通过 `deepspeed` 来训练后得到的权重文件和原本的权重文件是有所差别的,原本的仅仅是一个 .pth 的文件,而使用了 `deepspeed` 则是一个名字带有 .pth 的文件夹,在该文件夹里保存了 .pt 文件。这两者在具体的使用上并没有太大的差别,转换和合并的过程都是一样的。 + + +```bash +pth_file=`ls -t ./work_dirs/internlm2_chat_1_8b_qlora_alpaca_e3_copy | grep pth | head -n 1` && MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert pth_to_hf ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py ./work_dirs/internlm2_chat_1_8b_qlora_alpaca_e3_copy/${pth_file} ./hf +``` + + +```bash +MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge Shanghai_AI_Laboratory/internlm2-chat-1_8b ./hf ./merged --max-shard-size 2GB +``` + + +```python +tokenizer, model = load_model("./merged") +``` + + +```python +chat("请介绍一下你自己") +``` + + +```python +chat("你在实战营做什么") +``` + + +```python +chat("介绍一下成都") +``` + + +```python +del tokenizer, model + +torch.cuda.empty_cache() +``` + +## 4 分布式微调 + +如果模型的规模和复杂度继续增加,我们还可以使用分布式微调。 + + +```bash +apt-get install -y net-tools +ifconfig +``` + +分布式微调是主从架构的。主节点协调整个训练过程,管理数据和任务到工作节点的分配。工作节点执行训练步骤的实际计算,处理数据的子集并计算梯度。有时候在一些架构中还需要参数服务器协调所有工作节点之间的模型更新同步,用于聚合来自工作节点的梯度并更新模型参数。 + +> 我们使用两个节点进行分布式微调,实际上需要启动三个节点。 + + +```bash +MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NPROC_PER_NODE=1 NNODES=2 xtuner train internlm2_chat_1_8b_qlora_alpaca_e3_copy.py + +MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NPROC_PER_NODE=1 NNODES=2 NODE_RANK=0 TRITON_CACHE_DIR=node0 PORT=20821 ADDR=192.168.230.182 xtuner train internlm2_chat_1_8b_qlora_alpaca_e3_copy.py --work-dir work_dir_node0 + +MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NPROC_PER_NODE=1 NNODES=2 NODE_RANK=1 TRITON_CACHE_DIR=node1 PORT=20821 ADDR=192.168.230.182 xtuner train internlm2_chat_1_8b_qlora_alpaca_e3_copy.py --work-dir work_dir_node1 +``` + +首先启动主节点,然后依次启动其他节点。但需要注意的是,需要在一个时间阈值内启动相关的节点,如果超过时间阈值还没启动所有节点,则其他节点会因超时而报错退出。 + +比如,在两个节点的分布式微调过程中,我们只启动主节点和一个工作节点,另一个节点不启动,则已启动的节点会超时报错退出。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-07.png) + +如果所有节点都正常启动、训练,则可以看到每个节点的显卡均有内存使用。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-08.png) + +在训练完后,我们的目录结构应该是这样子的,训练的模型在工作节点上。 + +
+目录结构 + +``` +├── work_dir_node0 +│ ├── 20240629_213009 +│ │ ├── 20240629_213009.log +│ │ └── vis_data +│ │ ├── 20240629_213009.json +│ │ ├── config.py +│ │ ├── eval_outputs_iter_233.txt +│ │ └── scalars.json +│ ├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py +│ ├── iter_234.pth +│ └── last_checkpoint +├── work_dir_node1 +│ └── 20240629_213009 +├── work_dirs +│ └── internlm2_chat_1_8b_qlora_alpaca_e3_copy +``` + +
+ +## 5 小结 + +现在,我们又学到了 XTuner 微调的更多高阶知识啦,包括增量预训练微调基座模型、多卡微调、分布式微调等。 + +是不是感觉其实微调也不过如此!事实上确实是这样的!其实在微调的时候最重要的还是要自己准备一份高质量的数据集,这个才是你能否真微调出效果最核心的利器。 \ No newline at end of file diff --git a/docs/L1/XTuner/xtuner_finetune_basic.md b/docs/L1/XTuner/xtuner_finetune_basic.md new file mode 100644 index 000000000..b65890e7d --- /dev/null +++ b/docs/L1/XTuner/xtuner_finetune_basic.md @@ -0,0 +1,170 @@ +# XTuner微调前置基础 + +## 1 基本概念 + +在进行微调之前,我们需要了解一些基本概念。 + +### 1.1 Finetune简介 + +微调(fine-tuning)是一种基于预训练模型,通过少量的调整(fine-tune)来适应新的任务或数据的方法。 + +微调是在预训练模型的基础上,将模型中一些层的权重参数进行微调,以适应新的数据集或任务。 + +预训练模型部分已经在大规模数据上得到了训练,它们通常是较为通用且高性能的模型,因此可以很好地作为新任务的起点。微调可以加快模型的收敛速度,降低模型过拟合的风险,并在不消耗过多计算资源的情况下获取较好的模型性能。 + +#### 1.1.1 Finetune的两种范式 + +在大模型的下游应用中,经常会用到两种微调模式:**增量预训练** 和 **指令跟随** 。 + +1. **增量预训练** + +增量预训练是一种在已有预训练模型(比如:InternLM基座模型)的基础上,利用特定领域的数据进行进一步训练的方法。它的目的是在保持模型原有能力的同时,注入新的领域知识,进一步优化现有的预训练模型,从而提升模型在特定领域任务中的表现(比如:InternLM垂类基座模型)。增量预训练模型能够接受少量的新数据进行更新并适应新的任务,而不需要重新训练整个模型,这种方式可以很好地利用现有的预训练模型的知识,并在新数据上获得更好的性能。 + +2. **指令跟随** + +指令跟随是指让模型根据用户输入的指令来执行相应的操作。模型通过对大量自然语言指令和相应操作的数据进行训练,学习如何将指令分解为具体的子任务,并选择合适的模块来执行这些任务(比如:InternLM垂类对话模型)。 + +### 1.2 微调技术 + +大多数大型语言模型(LLM)的参数规模巨大,且规模日益增大,导致模型的训练和微调成本高昂,直接训练需要耗费大量计算资源和费用。近年来,如何高效地对大模型进行微调成为了研究热点,而LoRA和QLoRA两种微调技术因其高效性和实用性受到了广泛关注。 + +#### 1.2.1 LoRA简介 + +LoRA(Low-Rank Adaptation)是一种使用低精度权重对大型预训练语言模型进行微调的技术,它的核心思想是在不改变原有模型权重的情况下,通过添加少量新参数来进行微调。这种方法降低了模型的存储需求,也降低了计算成本,实现了对大模型的快速适应,同时保持了模型性能。 + +然而,由于使用了低精度权重,LoRA的一个潜在的缺点是在微调过程中可能会丢失一些原始模型的高阶特征信息,因此可能会降低模型的准确性。 + +#### 1.2.2 QLoRA简介 + +QLoRA(Quantized LoRA)微调技术是对LoRA的一种改进,它通过引入高精度权重和可学习的低秩适配器来提高模型的准确性。并且在LoRA的基础上,引入了量化技术。通过将预训练模型量化为int4格式,可以进一步减少微调过程中的计算量,同时也可以减少模型的存储空间,这对于在资源有限的设备上运行模型非常有用。最终,可以使我们在消费级的显卡上进行模型的微调训练。 + +### 1.3 XTuner简介 + +XTuner 的官方仓库是:https://github.com/InternLM/xtuner (欢迎Star)! + +XTuner 一个大语言模型&多模态模型微调工具箱。*由* *MMRazor* *和* *MMDeploy* *联合开发。* + +- 🤓 **傻瓜化:** 以 配置文件 的形式封装了大部分微调场景,**0基础的非专业人员也能一键开始微调**。 +- 🍃 **轻量级:** 对于 7B 参数量的LLM,**微调所需的最小显存仅为 8GB** : **消费级显卡✅,colab✅** + +#### 1.3.1 功能亮点 + +- 适配多种生态 + - 支持多种微调算法 + - 适配多种开源生态(HuggingFace、ModelScope等) + - 自动优化加速器 +- 适配多种硬件 + +#### 1.3.2 常用命令 + +以下是一些常用的命令。 + +- 查看帮助 + + +```bash +xtuner help +``` + +- 查看版本 + + +```bash +xtuner version +``` + +- 列出所有预定义配置文件 + + +```bash +xtuner list-cfg +``` + +- 列出包含指定名称的预定义配置文件 + +> `xtuner list-cfg` 命令用于列出内置的所有配置文件。参数 `-p` 或 `--pattern` 表示模式匹配,后面跟着的内容将会在所有的配置文件里进行模糊匹配搜索,然后返回最有可能得内容。 + + +```bash +xtuner list-cfg -p $NAME +``` + +- 复制配置文件 + +> `xtuner copy-cfg` 命令用于复制一个内置的配置文件。该命令需要两个参数:`CONFIG` 代表需要复制的配置文件名称,`SAVE_PATH` 代表复制的目标路径。 + + +```bash +xtuner copy-cfg $CONFIG $SAVE_PATH +``` + +- 执行微调训练 + +> `xtuner train` 命令用于启动模型微调进程。该命令需要一个参数:`CONFIG` 用于指定微调配置文件。 + + +```bash +xtuner train $CONFIG +``` + +- 将 pth 格式的模型文件转换成 HuggingFace 格式的模型 + +> `xtuner convert pth_to_hf` 命令用于进行模型格式转换。该命令需要三个参数:`CONFIG` 表示微调的配置文件, `PATH_TO_PTH_MODEL` 表示微调的模型权重文件路径,即要转换的模型权重, `SAVE_PATH_TO_HF_MODEL` 表示转换后的 HuggingFace 格式文件的保存路径。 + +除此之外,我们其实还可以在转换的命令中添加几个额外的参数,包括: + +| 参数名 | 解释 | +| --------------------- | -------------------------------------------- | +| --fp32 | 代表以fp32的精度开启,假如不输入则默认为fp16 | +| --max-shard-size {GB} | 代表每个权重文件最大的大小(默认为2GB) | + + +```bash +xtuner convert pth_to_hf $CONFIG $PATH_TO_PTH_MODEL $SAVE_PATH_TO_HF_MODEL +``` + +- 将原始模型与微调结果进行合并 + +> `xtuner convert merge`命令用于合并模型。该命令需要三个参数:`LLM` 表示原模型路径,`ADAPTER` 表示 Adapter 层的路径, `SAVE_PATH` 表示合并后的模型最终的保存路径。 + +在模型合并这一步还有其他很多的可选参数,包括: + +| 参数名 | 解释 | +| ---------------------- | ------------------------------------------------------------ | +| --max-shard-size {GB} | 代表每个权重文件最大的大小(默认为2GB) | +| --device {device_name} | 这里指的就是device的名称,可选择的有cuda、cpu和auto,默认为cuda即使用gpu进行运算 | +| --is-clip | 这个参数主要用于确定模型是不是CLIP模型,假如是的话就要加上,不是就不需要添加 | + +> CLIP(Contrastive Language–Image Pre-training)模型是 OpenAI 开发的一种预训练模型,它能够理解图像和描述它们的文本之间的关系。CLIP 通过在大规模数据集上学习图像和对应文本之间的对应关系,从而实现了对图像内容的理解和分类,甚至能够根据文本提示生成图像。 + + +```bash +xtuner convert merge $LLM $ADAPTER $SAVE_PATH +``` + +## 2 创建开发机 + +我们需要前往 [InternStudio](https://studio.intern-ai.org.cn/) 中创建一台开发机进行使用。 + +步骤1:登录InternStudio后,在控制台点击 “创建开发机” 按钮可以进入到开发机的创建界面。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-01.png) + +步骤2:在 “创建开发机” 界面,选择开发机类型:个人开发机,输入开发机名称:XTuner微调,选择开发机镜像:Cuda12.2-conda。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-02.png) + +步骤3:在镜像详情界面,点击 “使用” 链接,确认使用该镜像。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-03.png) + +步骤4:资源配置可以选择 10% (如果有更高资源可以使用,也可以选择更高的资源配置),然后点击 “立即创建” 按钮创建开发机。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-04.png) + +步骤5:创建完成后,在开发机列表中可以看到刚创建的开发机,点击 “进入开发机” 链接可以连接进入到开发机。 + +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-05.png) + +当我们有了这些前置知识和服务器之后,就可以进行下一步的微调任务了。 + diff --git a/tools/xtuner_streamlit_demo.py b/tools/xtuner_streamlit_demo.py new file mode 100644 index 000000000..a8381f342 --- /dev/null +++ b/tools/xtuner_streamlit_demo.py @@ -0,0 +1,269 @@ +import copy +import warnings +from dataclasses import asdict, dataclass +from typing import Callable, List, Optional + +import streamlit as st +import torch +from torch import nn +from transformers.generation.utils import (LogitsProcessorList, + StoppingCriteriaList) +from transformers.utils import logging + +from transformers import AutoTokenizer, AutoModelForCausalLM # isort: skip + +logger = logging.get_logger(__name__) + + +model_name_or_path = "./merged" + +@dataclass +class GenerationConfig: + # this config is used for chat to provide more diversity + max_length: int = 2048 + top_p: float = 0.75 + temperature: float = 0.1 + do_sample: bool = True + repetition_penalty: float = 1.000 + + +@torch.inference_mode() +def generate_interactive( + model, + tokenizer, + prompt, + generation_config: Optional[GenerationConfig] = None, + logits_processor: Optional[LogitsProcessorList] = None, + stopping_criteria: Optional[StoppingCriteriaList] = None, + prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], + List[int]]] = None, + additional_eos_token_id: Optional[int] = None, + **kwargs, +): + inputs = tokenizer([prompt], padding=True, return_tensors='pt') + input_length = len(inputs['input_ids'][0]) + for k, v in inputs.items(): + inputs[k] = v.cuda() + input_ids = inputs['input_ids'] + _, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1] + if generation_config is None: + generation_config = model.generation_config + generation_config = copy.deepcopy(generation_config) + model_kwargs = generation_config.update(**kwargs) + bos_token_id, eos_token_id = ( # noqa: F841 # pylint: disable=W0612 + generation_config.bos_token_id, + generation_config.eos_token_id, + ) + if isinstance(eos_token_id, int): + eos_token_id = [eos_token_id] + if additional_eos_token_id is not None: + eos_token_id.append(additional_eos_token_id) + has_default_max_length = kwargs.get( + 'max_length') is None and generation_config.max_length is not None + if has_default_max_length and generation_config.max_new_tokens is None: + warnings.warn( + f"Using 'max_length''s default ({repr(generation_config.max_length)}) \ + to control the generation length. " + 'This behaviour is deprecated and will be removed from the \ + config in v5 of Transformers -- we' + ' recommend using `max_new_tokens` to control the maximum \ + length of the generation.', + UserWarning, + ) + elif generation_config.max_new_tokens is not None: + generation_config.max_length = generation_config.max_new_tokens + \ + input_ids_seq_length + if not has_default_max_length: + logger.warn( # pylint: disable=W4902 + f"Both 'max_new_tokens' (={generation_config.max_new_tokens}) " + f"and 'max_length'(={generation_config.max_length}) seem to " + "have been set. 'max_new_tokens' will take precedence. " + 'Please refer to the documentation for more information. ' + '(https://huggingface.co/docs/transformers/main/' + 'en/main_classes/text_generation)', + UserWarning, + ) + + if input_ids_seq_length >= generation_config.max_length: + input_ids_string = 'input_ids' + logger.warning( + f"Input length of {input_ids_string} is {input_ids_seq_length}, " + f"but 'max_length' is set to {generation_config.max_length}. " + 'This can lead to unexpected behavior. You should consider' + " increasing 'max_new_tokens'.") + + # 2. Set generation parameters if not already defined + logits_processor = logits_processor if logits_processor is not None \ + else LogitsProcessorList() + stopping_criteria = stopping_criteria if stopping_criteria is not None \ + else StoppingCriteriaList() + + logits_processor = model._get_logits_processor( + generation_config=generation_config, + input_ids_seq_length=input_ids_seq_length, + encoder_input_ids=input_ids, + prefix_allowed_tokens_fn=prefix_allowed_tokens_fn, + logits_processor=logits_processor, + ) + + stopping_criteria = model._get_stopping_criteria( + generation_config=generation_config, + stopping_criteria=stopping_criteria) + logits_warper = model._get_logits_warper(generation_config) + + unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1) + scores = None + while True: + model_inputs = model.prepare_inputs_for_generation( + input_ids, **model_kwargs) + # forward pass to get next token + outputs = model( + **model_inputs, + return_dict=True, + output_attentions=False, + output_hidden_states=False, + ) + + next_token_logits = outputs.logits[:, -1, :] + + # pre-process distribution + next_token_scores = logits_processor(input_ids, next_token_logits) + next_token_scores = logits_warper(input_ids, next_token_scores) + + # sample + probs = nn.functional.softmax(next_token_scores, dim=-1) + if generation_config.do_sample: + next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) + else: + next_tokens = torch.argmax(probs, dim=-1) + + # update generated ids, model inputs, and length for next step + input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1) + model_kwargs = model._update_model_kwargs_for_generation( + outputs, model_kwargs, is_encoder_decoder=False) + unfinished_sequences = unfinished_sequences.mul( + (min(next_tokens != i for i in eos_token_id)).long()) + + output_token_ids = input_ids[0].cpu().tolist() + output_token_ids = output_token_ids[input_length:] + for each_eos_token_id in eos_token_id: + if output_token_ids[-1] == each_eos_token_id: + output_token_ids = output_token_ids[:-1] + response = tokenizer.decode(output_token_ids) + + yield response + # stop when each sentence is finished + # or if we exceed the maximum length + if unfinished_sequences.max() == 0 or stopping_criteria( + input_ids, scores): + break + + +def on_btn_click(): + del st.session_state.messages + + +@st.cache_resource +def load_model(): + model = (AutoModelForCausalLM.from_pretrained(model_name_or_path, + trust_remote_code=True).to( + torch.bfloat16).cuda()) + tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, + trust_remote_code=True) + return model, tokenizer + + +def prepare_generation_config(): + with st.sidebar: + max_length = st.slider('Max Length', + min_value=8, + max_value=32768, + value=2048) + top_p = st.slider('Top P', 0.0, 1.0, 0.75, step=0.01) + temperature = st.slider('Temperature', 0.0, 1.0, 0.1, step=0.01) + st.button('Clear Chat History', on_click=on_btn_click) + + generation_config = GenerationConfig(max_length=max_length, + top_p=top_p, + temperature=temperature) + + return generation_config + + +user_prompt = '<|im_start|>user\n{user}<|im_end|>\n' +robot_prompt = '<|im_start|>assistant\n{robot}<|im_end|>\n' +cur_query_prompt = '<|im_start|>user\n{user}<|im_end|>\n\ + <|im_start|>assistant\n' + + +def combine_history(prompt): + messages = st.session_state.messages + meta_instruction = ('') + total_prompt = f"<|im_start|>system\n{meta_instruction}<|im_end|>\n" + for message in messages: + cur_content = message['content'] + if message['role'] == 'user': + cur_prompt = user_prompt.format(user=cur_content) + elif message['role'] == 'robot': + cur_prompt = robot_prompt.format(robot=cur_content) + else: + raise RuntimeError + total_prompt += cur_prompt + total_prompt = total_prompt + cur_query_prompt.format(user=prompt) + return total_prompt + + +def main(): + # torch.cuda.empty_cache() + print('load model begin.') + model, tokenizer = load_model() + print('load model end.') + + + st.title('InternLM2-Chat-1.8B') + + generation_config = prepare_generation_config() + + # Initialize chat history + if 'messages' not in st.session_state: + st.session_state.messages = [] + + # Display chat messages from history on app rerun + for message in st.session_state.messages: + with st.chat_message(message['role'], avatar=message.get('avatar')): + st.markdown(message['content']) + + # Accept user input + if prompt := st.chat_input('What is up?'): + # Display user message in chat message container + with st.chat_message('user'): + st.markdown(prompt) + real_prompt = combine_history(prompt) + # Add user message to chat history + st.session_state.messages.append({ + 'role': 'user', + 'content': prompt, + }) + + with st.chat_message('robot'): + message_placeholder = st.empty() + for cur_response in generate_interactive( + model=model, + tokenizer=tokenizer, + prompt=real_prompt, + additional_eos_token_id=92542, + **asdict(generation_config), + ): + # Display robot response in chat message container + message_placeholder.markdown(cur_response + '▌') + message_placeholder.markdown(cur_response) + # Add robot response to chat history + st.session_state.messages.append({ + 'role': 'robot', + 'content': cur_response, # pylint: disable=undefined-loop-variable + }) + torch.cuda.empty_cache() + + +if __name__ == '__main__': + main() From d6d681a61c39a1f04e96212d0381e4db0b9b2ba1 Mon Sep 17 00:00:00 2001 From: AI-Labs Date: Tue, 2 Jul 2024 23:48:52 +0800 Subject: [PATCH 038/754] =?UTF-8?q?XTuner=E5=BE=AE=E8=B0=83=E4=B8=AA?= =?UTF-8?q?=E4=BA=BA=E5=B0=8F=E5=8A=A9=E6=89=8B=E8=AE=A4=E7=9F=A5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...nternlm2_chat_1_8b_qlora_alpaca_e3_copy.py | 4 +- ...346\211\213\350\256\244\347\237\245.ipynb" | 2857 ----------------- docs/L1/XTuner/readme.md | 120 +- docs/L1/XTuner/{homework.md => task.md} | 3 +- tools/xtuner_generate_assistant.py | 24 + 5 files changed, 126 insertions(+), 2882 deletions(-) delete mode 100644 "docs/L1/XTuner/XTuner\345\276\256\350\260\203\344\270\252\344\272\272\345\260\217\345\212\251\346\211\213\350\256\244\347\237\245.ipynb" rename docs/L1/XTuner/{homework.md => task.md} (92%) create mode 100644 tools/xtuner_generate_assistant.py diff --git a/configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py b/configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py index 6c478fa4f..c972f3d32 100644 --- a/configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py +++ b/configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py @@ -24,7 +24,7 @@ # PART 1 Settings # ####################################################################### # Model -pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-chat-1_8b' +pretrained_model_name_or_path = '/root/InternLM/XTuner/Shanghai_AI_Laboratory/internlm2-chat-1_8b' use_varlen_attn = False # Data @@ -57,7 +57,7 @@ evaluation_freq = 500 SYSTEM = SYSTEM_TEMPLATE.alpaca evaluation_inputs = [ - '请介绍一下你自己', '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai' + '请介绍一下你自己', 'Please introduce yourself' ] ####################################################################### diff --git "a/docs/L1/XTuner/XTuner\345\276\256\350\260\203\344\270\252\344\272\272\345\260\217\345\212\251\346\211\213\350\256\244\347\237\245.ipynb" "b/docs/L1/XTuner/XTuner\345\276\256\350\260\203\344\270\252\344\272\272\345\260\217\345\212\251\346\211\213\350\256\244\347\237\245.ipynb" deleted file mode 100644 index 8587db88b..000000000 --- "a/docs/L1/XTuner/XTuner\345\276\256\350\260\203\344\270\252\344\272\272\345\260\217\345\212\251\346\211\213\350\256\244\347\237\245.ipynb" +++ /dev/null @@ -1,2857 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "451252c1-8bf5-461c-aaa6-77fa509c69d5", - "metadata": {}, - "source": [ - "# XTuner微调个人小助手认知\n", - "\n", - "在本节中,将一步步带领大家体验如何使用 XTuner 完成个人小助手的微调!" - ] - }, - { - "cell_type": "markdown", - "id": "e5bd9a97-42ee-478b-989a-7b32a57cf035", - "metadata": {}, - "source": [ - "## 1 微调前置基础\n", - "\n", - "在进行微调之前,我们需要了解一些基本概念,请访问[XTuner微调前置基础](./xtuner_finetune_basic.md)。" - ] - }, - { - "cell_type": "markdown", - "id": "0c080138", - "metadata": {}, - "source": [ - "## 2 准备工作\n", - "\n", - "**环境安装**:我们想要用简单易上手的微调工具包 XTuner 来对模型进行微调的话,第一步是安装 XTuner !安装基础的工具是一切的前提,只有安装了 XTuner 我们才能够去执行后续的操作。\n", - "\n", - "**前期准备**:在完成 XTuner 的安装后,我们下一步就需要去明确我们自己的微调目标了。我们想要利用微调做一些什么事情呢,然后为了实现这个目标,我们需要准备相关的硬件资源和数据。\n", - "\n", - "**启动微调**:在确定了自己的微调目标后,我们就可以在 XTuner 的配置库中找到合适的配置文件并进行对应的修改。修改完成后即可一键启动训练!训练好的模型也可以仅仅通过在终端输入一行命令来完成转换和部署工作!" - ] - }, - { - "cell_type": "markdown", - "id": "1de6991f", - "metadata": {}, - "source": [ - "### 2.1 创建虚拟环境\n", - "\n", - "在安装 XTuner 之前,我们需要先创建一个虚拟环境。创建一个名为 `xtuner0121` 的虚拟环境,可以直接执行命令。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ead18d70", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!conda create -n xtuner0121 python=3.10 -y" - ] - }, - { - "cell_type": "markdown", - "id": "d7b70777", - "metadata": {}, - "source": [ - "如果是在开发机中,也可以直接执行以下命令进行创建:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "003d9799", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!studio-conda -t xtuner0121 -o internlm-base" - ] - }, - { - "cell_type": "markdown", - "id": "03f48956", - "metadata": {}, - "source": [ - "虚拟环境创建完成后,需要激活虚拟环境。\n", - "\n", - "```bash\n", - "conda activate xtuner0121\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "0b38ba7d", - "metadata": {}, - "source": [ - "### 2.2 安装 XTuner\n", - "\n", - "虚拟环境创建完成后,就可以安装 XTuner 了。首先,从 Github 上下载源码。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4728440a", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!git clone -b v0.1.21 https://github.com/InternLM/xtuner" - ] - }, - { - "cell_type": "markdown", - "id": "328c1ef1", - "metadata": {}, - "source": [ - "其次,进入源码目录,执行安装。\n", - "\n", - "> 如果速度太慢可以换成 `pip install -e '.[all]' -i https://mirrors.aliyun.com/pypi/simple/`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3b6dd99d", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!cd xtuner && pip install -e '.[all]'" - ] - }, - { - "cell_type": "markdown", - "id": "f0757a85", - "metadata": {}, - "source": [ - "最后,我们可以验证一下安装结果。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5f0629e6", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!xtuner version" - ] - }, - { - "cell_type": "markdown", - "id": "c24eabe5", - "metadata": {}, - "source": [ - "对于很多初学者而言,我们可能不太熟悉 XTuner 的用法,那么我们可以通过以下命令来查看相关的帮助。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "18bc7396", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!xtuner help" - ] - }, - { - "cell_type": "markdown", - "id": "5b25987e", - "metadata": {}, - "source": [ - "对于很多的初学者而言,安装好环境意味着成功了一大半!因此我们接下来就可以进入我们的下一步,准备好我们需要的模型、数据集和配置文件,并进行微调训练!" - ] - }, - { - "cell_type": "markdown", - "id": "97f6bad6", - "metadata": {}, - "source": [ - "### 2.3 模型准备\n", - "\n", - "软件安装好后,我们就可以准备要微调的模型了。\n", - "\n", - "> 对于学习而言,我们可以使用 InternLM 推出的1.8B的小模型来完成此次微调演示。\n", - "\n", - "对于在 InternStudio 上运行的小伙伴们,可以不用通过 HuggingFace、OpenXLab 或者 Modelscope 进行模型的下载,在开发机中已经为我们提供了模型的本地文件,直接使用就可以了。\n", - "\n", - "> 我们可以通过以下代码一键通过符号链接的方式链接到模型文件,这样既节省了空间,也便于管理。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "78d9828b", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!mkdir -p Shanghai_AI_Laboratory\n", - "\n", - "!ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b Shanghai_AI_Laboratory/internlm2-chat-1_8b" - ] - }, - { - "cell_type": "markdown", - "id": "a75fcb97", - "metadata": {}, - "source": [ - "执行上述操作后,`Shanghai_AI_Laboratory/internlm2-chat-1_8b` 将直接成为一个符号链接,这个链接指向 `/root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b` 的位置。\n", - "\n", - "这意味着,当我们访问 `Shanghai_AI_Laboratory/internlm2-chat-1_8b` 时,实际上就是在访问 `/root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b` 目录下的内容。通过这种方式,我们无需复制任何数据,就可以直接利用现有的模型文件进行后续的微调操作,从而节省存储空间并简化文件管理。\n", - "\n", - "如果自己想要微调的模型在开发机中没找到,也可以自己下载相关模型文件。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "78e3b789", - "metadata": {}, - "outputs": [], - "source": [ - "from modelscope import snapshot_download\n", - "model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-1_8b', cache_dir=\"./\")" - ] - }, - { - "cell_type": "markdown", - "id": "fec6e564", - "metadata": {}, - "source": [ - "模型文件准备好后,我们的目录结构应该是这个样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── Shanghai_AI_Laboratory\n", - "│ ├── internlm2-1_8b\n", - "│ │ ├── README.md\n", - "│ │ ├── config.json\n", - "│ │ ├── configuration.json\n", - "│ │ ├── configuration_internlm2.py\n", - "│ │ ├── generation_config.json\n", - "│ │ ├── modeling_internlm2.py\n", - "│ │ ├── pytorch_model.bin\n", - "│ │ ├── special_tokens_map.json\n", - "│ │ ├── tokenization_internlm2.py\n", - "│ │ ├── tokenization_internlm2_fast.py\n", - "│ │ ├── tokenizer.json\n", - "│ │ ├── tokenizer.model\n", - "│ │ └── tokenizer_config.json\n", - "│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b\n", - "│ ├── README.md\n", - "│ ├── config.json\n", - "│ ├── configuration.json\n", - "│ ├── configuration_internlm2.py\n", - "│ ├── generation_config.json\n", - "│ ├── model-00001-of-00002.safetensors\n", - "│ ├── model-00002-of-00002.safetensors\n", - "│ ├── model.safetensors.index.json\n", - "│ ├── modeling_internlm2.py\n", - "│ ├── special_tokens_map.json\n", - "│ ├── tokenization_internlm2.py\n", - "│ ├── tokenization_internlm2_fast.py\n", - "│ ├── tokenizer.model\n", - "│ └── tokenizer_config.json\n", - "```\n", - "
\n", - "\n", - "\n", - "> 在目录结构中可以看出,`internlm2-chat-1_8b` 是一个符号链接。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0bb7feb4", - "metadata": {}, - "outputs": [], - "source": [ - "!tree -l" - ] - }, - { - "cell_type": "markdown", - "id": "d44b556d-4012-4860-a342-8c9545616bd0", - "metadata": {}, - "source": [ - "## 3 指令跟随微调(微调个人小助手认知)\n", - "\n", - "这里我们用 `internlm2-chat-1_8b` 模型,通过 `QLoRA` 的方式来微调一个自己的小助手认知作为案例来进行演示。" - ] - }, - { - "cell_type": "markdown", - "id": "0e247e8e", - "metadata": {}, - "source": [ - "首先,看看微调效果:\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
微调前微调后
输入请介绍一下你自己请介绍一下你自己
输出你好,我是书生·浦语。我致力于帮助用户解决各种语言相关的问题,包括但不限于语言学习、翻译、文本摘要等。我使用了Transformer模型和深度学习技术,并使用了语言模型作为预训练任务。如果你有任何问题,欢迎随时向我提问。我是伍鲜同志的小助手,内在是上海AI实验室书生·浦语的1.8B大模型哦
网页
" - ] - }, - { - "cell_type": "markdown", - "id": "9ad62951", - "metadata": {}, - "source": [ - "其次,我们需要定义一些基本方法。" - ] - }, - { - "cell_type": "markdown", - "id": "e02999ca", - "metadata": {}, - "source": [ - "- 导入必要的库" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5a032091", - "metadata": {}, - "outputs": [], - "source": [ - "import torch\n", - "from transformers import AutoTokenizer, AutoModelForCausalLM" - ] - }, - { - "cell_type": "markdown", - "id": "8d17fee6", - "metadata": {}, - "source": [ - "- 定义模型加载方法" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4145126e", - "metadata": {}, - "outputs": [], - "source": [ - "def load_model(model_path):\n", - " tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n", - " model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()\n", - " model = model.eval()\n", - " return tokenizer, model" - ] - }, - { - "cell_type": "markdown", - "id": "31595716", - "metadata": {}, - "source": [ - "- 定义对话方法" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5108f261", - "metadata": {}, - "outputs": [], - "source": [ - "messages = []\n", - "\n", - "def chat(input_text):\n", - " length = 0\n", - " for response, _ in model.stream_chat(tokenizer, input_text, messages):\n", - " if response is not None:\n", - " print(response[length:], flush=True, end=\"\")\n", - " length = len(response)" - ] - }, - { - "cell_type": "markdown", - "id": "507bd563", - "metadata": {}, - "source": [ - "### 3.1 微调前的模型对话\n", - "\n", - "首先来看看 `internlm2-chat-1_8b` 的对话演示。" - ] - }, - { - "cell_type": "markdown", - "id": "fd3ed483", - "metadata": {}, - "source": [ - "- 模型加载" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b1e2c28a", - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer, model = load_model(\"Shanghai_AI_Laboratory/internlm2-chat-1_8b\")" - ] - }, - { - "cell_type": "markdown", - "id": "580eaaaa", - "metadata": {}, - "source": [ - "- 对话" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b14c0191", - "metadata": {}, - "outputs": [], - "source": [ - "chat(\"请介绍一下你自己\")" - ] - }, - { - "cell_type": "markdown", - "id": "c8b018de", - "metadata": {}, - "source": [ - "- 释放缓存" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9503eedf", - "metadata": {}, - "outputs": [], - "source": [ - "del tokenizer, model\n", - "\n", - "torch.cuda.empty_cache()" - ] - }, - { - "cell_type": "markdown", - "id": "fe8693b5", - "metadata": {}, - "source": [ - "### 3.2 指令跟随微调\n", - "\n", - "下面我们对模型进行微调,让模型认识到自己的弟位,了解它自己是你的一个助手。" - ] - }, - { - "cell_type": "markdown", - "id": "781b1495", - "metadata": {}, - "source": [ - "#### 3.2.1 准数据文件\n", - "\n", - "为了让模型能够认清自己的身份弟位,在询问自己是谁的时候按照我们预期的结果进行回复,我们就需要通过在微调数据集中大量加入这样的数据。我们准备一个数据集文件`datas/assistant.json`,文件内容为对话数据。为了增强微调效果,可以将对话数据复制多条。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1ff3df92", - "metadata": {}, - "outputs": [], - "source": [ - "[\n", - " {\"conversation\": [{\"input\": \"请介绍一下你自己\", \"output\": \"我是伍鲜同志的小助手,内在是上海AI实验室书生·浦语的1.8B大模型哦\"}]},\n", - " {\"conversation\": [{\"input\": \"你在实战营做什么\", \"output\": \"我在这里帮助伍鲜同志完成XTuner微调个人小助手的任务\"}]},\n", - "]" - ] - }, - { - "cell_type": "markdown", - "id": "83c08a82", - "metadata": {}, - "source": [ - "准备好数据文件后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── Shanghai_AI_Laboratory\n", - "│ ├── internlm2-1_8b\n", - "│ │ ├── README.md\n", - "│ │ ├── config.json\n", - "│ │ ├── configuration.json\n", - "│ │ ├── configuration_internlm2.py\n", - "│ │ ├── generation_config.json\n", - "│ │ ├── modeling_internlm2.py\n", - "│ │ ├── pytorch_model.bin\n", - "│ │ ├── special_tokens_map.json\n", - "│ │ ├── tokenization_internlm2.py\n", - "│ │ ├── tokenization_internlm2_fast.py\n", - "│ │ ├── tokenizer.json\n", - "│ │ ├── tokenizer.model\n", - "│ │ └── tokenizer_config.json\n", - "│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b\n", - "│ ├── README.md\n", - "│ ├── config.json\n", - "│ ├── configuration.json\n", - "│ ├── configuration_internlm2.py\n", - "│ ├── generation_config.json\n", - "│ ├── model-00001-of-00002.safetensors\n", - "│ ├── model-00002-of-00002.safetensors\n", - "│ ├── model.safetensors.index.json\n", - "│ ├── modeling_internlm2.py\n", - "│ ├── special_tokens_map.json\n", - "│ ├── tokenization_internlm2.py\n", - "│ ├── tokenization_internlm2_fast.py\n", - "│ ├── tokenizer.model\n", - "│ └── tokenizer_config.json\n", - "├── datas\n", - "│ └── assistant.json\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "5f34f5d8", - "metadata": {}, - "source": [ - "#### 3.2.2 准备配置文件\n", - "\n", - "在准备好了模型和数据集后,我们就要根据我们选择的微调方法结合微调方案来找到与我们最匹配的配置文件了,从而减少我们对配置文件的修改量。\n", - "\n", - "> 配置文件其实是一种用于定义和控制模型训练和测试过程中各个方面的参数和设置的工具。" - ] - }, - { - "cell_type": "markdown", - "id": "70839704", - "metadata": {}, - "source": [ - "##### 3.2.2.1 列出支持的配置文件\n", - "\n", - "XTuner 提供多个开箱即用的配置文件,可以通过以下命令查看。\n", - "\n", - "> `xtuner list-cfg` 命令用于列出内置的所有配置文件。参数 `-p` 或 `--pattern` 表示模式匹配,后面跟着的内容将会在所有的配置文件里进行模糊匹配搜索,然后返回最有可能得内容。比如我们这里微调的是书生·浦语的模型,我们就可以匹配搜索 `internlm2`。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "54e9c4bf", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!xtuner list-cfg -p internlm2" - ] - }, - { - "cell_type": "markdown", - "id": "ba0c7c15", - "metadata": {}, - "source": [ - "
\n", - "配置文件名的解释\n", - "\n", - "以 **internlm2_1_8b_full_custom_pretrain_e1** 和 **internlm2_chat_1_8b_qlora_alpaca_e3** 举例:\n", - "\n", - "| 配置文件 internlm2_1_8b_full_custom_pretrain_e1 | 配置文件 internlm2_chat_1_8b_qlora_alpaca_e3 | 说明 |\n", - "| ----------------------------------------------- | -------------------------------------------- | -------------- |\n", - "| internlm2_1_8b | internlm2_chat_1_8b | 模型名称 |\n", - "| full | qlora | 使用的算法 |\n", - "| custom_pretrain | alpaca | 数据集名称 |\n", - "| e1 | e3 | 把数据集跑几次 |\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "275b2490", - "metadata": {}, - "source": [ - "##### 3.2.2.2 复制一个预设的配置文件\n", - "\n", - "由于我们是对`internlm2-chat-1_8b`模型进行指令微调,所以与我们的需求最匹配的配置文件是 `internlm2_chat_1_8b_qlora_alpaca_e3`,这里就复制该配置文件。\n", - "\n", - "> `xtuner copy-cfg` 命令用于复制一个内置的配置文件。该命令需要两个参数:`CONFIG` 代表需要复制的配置文件名称,`SAVE_PATH` 代表复制的目标路径。在我们的输入的这个命令中,我们的 `CONFIG` 对应的是上面搜索到的 `internlm2_chat_1_8b_qlora_alpaca_e3` ,而 `SAVE_PATH` 则是当前目录 `.`。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c19da8a8", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!xtuner copy-cfg internlm2_chat_1_8b_qlora_alpaca_e3 ." - ] - }, - { - "cell_type": "markdown", - "id": "2903d70d", - "metadata": {}, - "source": [ - "复制好配置文件后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── Shanghai_AI_Laboratory\n", - "│ ├── internlm2-1_8b\n", - "│ │ ├── README.md\n", - "│ │ ├── config.json\n", - "│ │ ├── configuration.json\n", - "│ │ ├── configuration_internlm2.py\n", - "│ │ ├── generation_config.json\n", - "│ │ ├── modeling_internlm2.py\n", - "│ │ ├── pytorch_model.bin\n", - "│ │ ├── special_tokens_map.json\n", - "│ │ ├── tokenization_internlm2.py\n", - "│ │ ├── tokenization_internlm2_fast.py\n", - "│ │ ├── tokenizer.json\n", - "│ │ ├── tokenizer.model\n", - "│ │ └── tokenizer_config.json\n", - "│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b\n", - "│ ├── README.md\n", - "│ ├── config.json\n", - "│ ├── configuration.json\n", - "│ ├── configuration_internlm2.py\n", - "│ ├── generation_config.json\n", - "│ ├── model-00001-of-00002.safetensors\n", - "│ ├── model-00002-of-00002.safetensors\n", - "│ ├── model.safetensors.index.json\n", - "│ ├── modeling_internlm2.py\n", - "│ ├── special_tokens_map.json\n", - "│ ├── tokenization_internlm2.py\n", - "│ ├── tokenization_internlm2_fast.py\n", - "│ ├── tokenizer.model\n", - "│ └── tokenizer_config.json\n", - "├── datas\n", - "│ └── assistant.json\n", - "├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "2fcc24bf", - "metadata": {}, - "source": [ - "##### 3.2.2.3 对配置文件进行修改\n", - "\n", - "在选择了一个最匹配的配置文件并准备好其他内容后,下面我们要做的事情就是根据我们自己的内容对该配置文件进行调整,使其能够满足我们实际训练的要求。\n", - "\n", - "
\n", - "配置文件介绍\n", - "\n", - "打开配置文件后,我们可以看到整体的配置文件分为五部分:\n", - "\n", - "**PART 1 Settings**:涵盖了模型基本设置,如预训练模型的选择、数据集信息和训练过程中的一些基本参数(如批大小、学习率等)。\n", - "\n", - "**PART 2 Model & Tokenizer**:指定了用于训练的模型和分词器的具体类型及其配置,包括预训练模型的路径和是否启用特定功能(如可变长度注意力),这是模型训练的核心组成部分。\n", - "\n", - "**PART 3 Dataset & Dataloader**:描述了数据处理的细节,包括如何加载数据集、预处理步骤、批处理大小等,确保了模型能够接收到正确格式和质量的数据。\n", - "\n", - "**PART 4 Scheduler & Optimizer**:配置了优化过程中的关键参数,如学习率调度策略和优化器的选择,这些是影响模型训练效果和速度的重要因素。\n", - "\n", - "**PART 5 Runtime**:定义了训练过程中的额外设置,如日志记录、模型保存策略和自定义钩子等,以支持训练流程的监控、调试和结果的保存。\n", - "\n", - "一般来说我们需要更改的部分其实只包括前三部分,而且修改的主要原因是我们修改了配置文件中规定的模型、数据集。后两部分都是 XTuner 官方帮我们优化好的东西,一般而言只有在魔改的情况下才需要进行修改。\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "5b85204a", - "metadata": {}, - "source": [ - "下面我们将根据项目的需求一步步的进行修改和调整吧!\n", - "\n", - "在 PART 1 的部分,由于我们不再需要在 HuggingFace 上自动下载模型,因此我们先要更换模型的路径以及数据集的路径为我们本地的路径。\n", - "\n", - "为了训练过程中能够实时观察到模型的变化情况,XTuner 贴心的推出了一个 `evaluation_inputs` 的参数来让我们能够设置多个问题来确保模型在训练过程中的变化是朝着我们想要的方向前进的。我们可以添加自己的输入。\n", - "\n", - "在 PART 3 的部分,由于我们准备的数据集是 JSON 格式的数据,并且对话内容已经是 `input` 和 `output` 的数据对,所以不需要进行格式转换。\n", - "\n", - "```diff\n", - "#######################################################################\n", - "# PART 1 Settings #\n", - "#######################################################################\n", - "- pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b'\n", - "+ pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-chat-1_8b'\n", - "\n", - "- alpaca_en_path = 'tatsu-lab/alpaca'\n", - "+ alpaca_en_path = 'datas/assistant.json'\n", - "\n", - "evaluation_inputs = [\n", - "- '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'\n", - "+ '请介绍一下你自己', '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'\n", - "]\n", - "\n", - "#######################################################################\n", - "# PART 3 Dataset & Dataloader #\n", - "#######################################################################\n", - "alpaca_en = dict(\n", - " type=process_hf_dataset,\n", - "- dataset=dict(type=load_dataset, path=alpaca_en_path),\n", - "+ dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)),\n", - " tokenizer=tokenizer,\n", - " max_length=max_length,\n", - "- dataset_map_fn=alpaca_map_fn,\n", - "+ dataset_map_fn=None,\n", - " template_map_fn=dict(\n", - " type=template_map_fn_factory, template=prompt_template),\n", - " remove_unused_columns=True,\n", - " shuffle_before_pack=True,\n", - " pack_to_max_length=pack_to_max_length,\n", - " use_varlen_attn=use_varlen_attn)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "989a0e6e", - "metadata": {}, - "source": [ - "修改完后的完整的配置文件是:[configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py](../../../configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py)。\n", - "\n", - "
\n", - "internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", - "\n", - "```python\n", - "# Copyright (c) OpenMMLab. All rights reserved.\n", - "import torch\n", - "from datasets import load_dataset\n", - "from mmengine.dataset import DefaultSampler\n", - "from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,\n", - " LoggerHook, ParamSchedulerHook)\n", - "from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR\n", - "from peft import LoraConfig\n", - "from torch.optim import AdamW\n", - "from transformers import (AutoModelForCausalLM, AutoTokenizer,\n", - " BitsAndBytesConfig)\n", - "\n", - "from xtuner.dataset import process_hf_dataset\n", - "from xtuner.dataset.collate_fns import default_collate_fn\n", - "from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory\n", - "from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,\n", - " VarlenAttnArgsToMessageHubHook)\n", - "from xtuner.engine.runner import TrainLoop\n", - "from xtuner.model import SupervisedFinetune\n", - "from xtuner.parallel.sequence import SequenceParallelSampler\n", - "from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE\n", - "\n", - "#######################################################################\n", - "# PART 1 Settings #\n", - "#######################################################################\n", - "# Model\n", - "pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-chat-1_8b'\n", - "use_varlen_attn = False\n", - "\n", - "# Data\n", - "alpaca_en_path = 'datas/assistant.json'\n", - "prompt_template = PROMPT_TEMPLATE.internlm2_chat\n", - "max_length = 2048\n", - "pack_to_max_length = True\n", - "\n", - "# parallel\n", - "sequence_parallel_size = 1\n", - "\n", - "# Scheduler & Optimizer\n", - "batch_size = 1 # per_device\n", - "accumulative_counts = 16\n", - "accumulative_counts *= sequence_parallel_size\n", - "dataloader_num_workers = 0\n", - "max_epochs = 3\n", - "optim_type = AdamW\n", - "lr = 2e-4\n", - "betas = (0.9, 0.999)\n", - "weight_decay = 0\n", - "max_norm = 1 # grad clip\n", - "warmup_ratio = 0.03\n", - "\n", - "# Save\n", - "save_steps = 500\n", - "save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)\n", - "\n", - "# Evaluate the generation performance during the training\n", - "evaluation_freq = 500\n", - "SYSTEM = SYSTEM_TEMPLATE.alpaca\n", - "evaluation_inputs = [\n", - " '请介绍一下你自己', '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'\n", - "]\n", - "\n", - "#######################################################################\n", - "# PART 2 Model & Tokenizer #\n", - "#######################################################################\n", - "tokenizer = dict(\n", - " type=AutoTokenizer.from_pretrained,\n", - " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", - " trust_remote_code=True,\n", - " padding_side='right')\n", - "\n", - "model = dict(\n", - " type=SupervisedFinetune,\n", - " use_varlen_attn=use_varlen_attn,\n", - " llm=dict(\n", - " type=AutoModelForCausalLM.from_pretrained,\n", - " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", - " trust_remote_code=True,\n", - " torch_dtype=torch.float16,\n", - " quantization_config=dict(\n", - " type=BitsAndBytesConfig,\n", - " load_in_4bit=True,\n", - " load_in_8bit=False,\n", - " llm_int8_threshold=6.0,\n", - " llm_int8_has_fp16_weight=False,\n", - " bnb_4bit_compute_dtype=torch.float16,\n", - " bnb_4bit_use_double_quant=True,\n", - " bnb_4bit_quant_type='nf4')),\n", - " lora=dict(\n", - " type=LoraConfig,\n", - " r=64,\n", - " lora_alpha=16,\n", - " lora_dropout=0.1,\n", - " bias='none',\n", - " task_type='CAUSAL_LM'))\n", - "\n", - "#######################################################################\n", - "# PART 3 Dataset & Dataloader #\n", - "#######################################################################\n", - "alpaca_en = dict(\n", - " type=process_hf_dataset,\n", - " dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)),\n", - " tokenizer=tokenizer,\n", - " max_length=max_length,\n", - " dataset_map_fn=None,\n", - " template_map_fn=dict(\n", - " type=template_map_fn_factory, template=prompt_template),\n", - " remove_unused_columns=True,\n", - " shuffle_before_pack=True,\n", - " pack_to_max_length=pack_to_max_length,\n", - " use_varlen_attn=use_varlen_attn)\n", - "\n", - "sampler = SequenceParallelSampler \\\n", - " if sequence_parallel_size > 1 else DefaultSampler\n", - "train_dataloader = dict(\n", - " batch_size=batch_size,\n", - " num_workers=dataloader_num_workers,\n", - " dataset=alpaca_en,\n", - " sampler=dict(type=sampler, shuffle=True),\n", - " collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))\n", - "\n", - "#######################################################################\n", - "# PART 4 Scheduler & Optimizer #\n", - "#######################################################################\n", - "# optimizer\n", - "optim_wrapper = dict(\n", - " type=AmpOptimWrapper,\n", - " optimizer=dict(\n", - " type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),\n", - " clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),\n", - " accumulative_counts=accumulative_counts,\n", - " loss_scale='dynamic',\n", - " dtype='float16')\n", - "\n", - "# learning policy\n", - "# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501\n", - "param_scheduler = [\n", - " dict(\n", - " type=LinearLR,\n", - " start_factor=1e-5,\n", - " by_epoch=True,\n", - " begin=0,\n", - " end=warmup_ratio * max_epochs,\n", - " convert_to_iter_based=True),\n", - " dict(\n", - " type=CosineAnnealingLR,\n", - " eta_min=0.0,\n", - " by_epoch=True,\n", - " begin=warmup_ratio * max_epochs,\n", - " end=max_epochs,\n", - " convert_to_iter_based=True)\n", - "]\n", - "\n", - "# train, val, test setting\n", - "train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)\n", - "\n", - "#######################################################################\n", - "# PART 5 Runtime #\n", - "#######################################################################\n", - "# Log the dialogue periodically during the training process, optional\n", - "custom_hooks = [\n", - " dict(type=DatasetInfoHook, tokenizer=tokenizer),\n", - " dict(\n", - " type=EvaluateChatHook,\n", - " tokenizer=tokenizer,\n", - " every_n_iters=evaluation_freq,\n", - " evaluation_inputs=evaluation_inputs,\n", - " system=SYSTEM,\n", - " prompt_template=prompt_template)\n", - "]\n", - "\n", - "if use_varlen_attn:\n", - " custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]\n", - "\n", - "# configure default hooks\n", - "default_hooks = dict(\n", - " # record the time of every iteration.\n", - " timer=dict(type=IterTimerHook),\n", - " # print log every 10 iterations.\n", - " logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),\n", - " # enable the parameter scheduler.\n", - " param_scheduler=dict(type=ParamSchedulerHook),\n", - " # save checkpoint per `save_steps`.\n", - " checkpoint=dict(\n", - " type=CheckpointHook,\n", - " by_epoch=False,\n", - " interval=save_steps,\n", - " max_keep_ckpts=save_total_limit),\n", - " # set sampler seed in distributed evrionment.\n", - " sampler_seed=dict(type=DistSamplerSeedHook),\n", - ")\n", - "\n", - "# configure environment\n", - "env_cfg = dict(\n", - " # whether to enable cudnn benchmark\n", - " cudnn_benchmark=False,\n", - " # set multi process parameters\n", - " mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),\n", - " # set distributed parameters\n", - " dist_cfg=dict(backend='nccl'),\n", - ")\n", - "\n", - "# set visualizer\n", - "visualizer = None\n", - "\n", - "# set log level\n", - "log_level = 'INFO'\n", - "\n", - "# load from which checkpoint\n", - "load_from = None\n", - "\n", - "# whether to resume training from the loaded checkpoint\n", - "resume = False\n", - "\n", - "# Defaults to use random seed and disable `deterministic`\n", - "randomness = dict(seed=None, deterministic=False)\n", - "\n", - "# set log processor\n", - "log_processor = dict(by_epoch=False)\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "c02ca9cb", - "metadata": {}, - "source": [ - "#### 3.2.3 启动微调\n", - "\n", - "完成了所有的准备工作后,我们就可以正式的开始我们下一阶段的旅程:XTuner 启动~!\n", - "\n", - "当我们准备好了所有内容,我们只需要将使用 `xtuner train` 命令令即可开始训练。\n", - "\n", - "> `xtuner train` 命令用于启动模型微调进程。该命令需要一个参数:`CONFIG` 用于指定微调配置文件。这里我们使用修改好的配置文件 `internlm2_chat_1_8b_qlora_alpaca_e3_copy.py`。 \n", - "> 训练过程中产生的所有文件,包括日志、配置文件、检查点文件、微调后的模型等,默认保存在 `work_dirs` 目录下,我们也可以通过添加 `--work-dir` 指定特定的文件保存位置。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f89ba8dd", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!xtuner train ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py" - ] - }, - { - "cell_type": "markdown", - "id": "9ad3adc1", - "metadata": {}, - "source": [ - "在训练完后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── work_dirs\n", - "│ └── internlm2_chat_1_8b_qlora_alpaca_e3_copy\n", - "│ ├── 20240626_222727\n", - "│ │ ├── 20240626_222727.log\n", - "│ │ └── vis_data\n", - "│ │ ├── 20240626_222727.json\n", - "│ │ ├── config.py\n", - "│ │ ├── eval_outputs_iter_95.txt\n", - "│ │ └── scalars.json\n", - "│ ├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", - "│ ├── iter_96.pth\n", - "│ └── last_checkpoint\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "dfed1498", - "metadata": {}, - "source": [ - "#### 3.2.4 模型格式转换\n", - "\n", - "模型转换的本质其实就是将原本使用 Pytorch 训练出来的模型权重文件转换为目前通用的 HuggingFace 格式文件,那么我们可以通过以下命令来实现一键转换。\n", - "\n", - "我们可以使用 `xtuner convert pth_to_hf` 命令来进行模型格式转换。\n", - "\n", - "> `xtuner convert pth_to_hf` 命令用于进行模型格式转换。该命令需要三个参数:`CONFIG` 表示微调的配置文件, `PATH_TO_PTH_MODEL` 表示微调的模型权重文件路径,即要转换的模型权重, `SAVE_PATH_TO_HF_MODEL` 表示转换后的 HuggingFace 格式文件的保存路径。\n", - "\n", - "除此之外,我们其实还可以在转换的命令中添加几个额外的参数,包括:\n", - "\n", - "| 参数名 | 解释 |\n", - "| --------------------- | -------------------------------------------- |\n", - "| --fp32 | 代表以fp32的精度开启,假如不输入则默认为fp16 |\n", - "| --max-shard-size {GB} | 代表每个权重文件最大的大小(默认为2GB) |" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d6422944", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!pth_file=`ls -t ./work_dirs/internlm2_chat_1_8b_qlora_alpaca_e3_copy/*.pth | head -n 1` && MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert pth_to_hf ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py ${pth_file} ./hf" - ] - }, - { - "cell_type": "markdown", - "id": "dbfc5968", - "metadata": {}, - "source": [ - "模型格式转换完成后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── hf\n", - "│ ├── README.md\n", - "│ ├── adapter_config.json\n", - "│ ├── adapter_model.bin\n", - "│ └── xtuner_config.py\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "b86536d9", - "metadata": {}, - "source": [ - "转换完成后,可以看到模型被转换为 HuggingFace 中常用的 .bin 格式文件,这就代表着文件成功被转化为 HuggingFace 格式了。\n", - "\n", - "此时,hf 文件夹即为我们平时所理解的所谓 “LoRA 模型文件”\n", - "\n", - "> 可以简单理解:LoRA 模型文件 = Adapter" - ] - }, - { - "cell_type": "markdown", - "id": "5b64bcda", - "metadata": {}, - "source": [ - "#### 3.2.5 模型合并\n", - "\n", - "对于 LoRA 或者 QLoRA 微调出来的模型其实并不是一个完整的模型,而是一个额外的层(Adapter),训练完的这个层最终还是要与原模型进行合并才能被正常的使用。\n", - "\n", - "> 对于全量微调的模型(full)其实是不需要进行整合这一步的,因为全量微调修改的是原模型的权重而非微调一个新的 Adapter ,因此是不需要进行模型整合的。" - ] - }, - { - "cell_type": "markdown", - "id": "bfce601f", - "metadata": {}, - "source": [ - "在 XTuner 中提供了一键合并的命令 `xtuner convert merge`,在使用前我们需要准备好三个路径,包括原模型的路径、训练好的 Adapter 层的(模型格式转换后的)路径以及最终保存的路径。\n", - "\n", - "> `xtuner convert merge`命令用于合并模型。该命令需要三个参数:`LLM` 表示原模型路径,`ADAPTER` 表示 Adapter 层的路径, `SAVE_PATH` 表示合并后的模型最终的保存路径。\n", - "\n", - "在模型合并这一步还有其他很多的可选参数,包括:\n", - "\n", - "| 参数名 | 解释 |\n", - "| ---------------------- | ------------------------------------------------------------ |\n", - "| --max-shard-size {GB} | 代表每个权重文件最大的大小(默认为2GB) |\n", - "| --device {device_name} | 这里指的就是device的名称,可选择的有cuda、cpu和auto,默认为cuda即使用gpu进行运算 |\n", - "| --is-clip | 这个参数主要用于确定模型是不是CLIP模型,假如是的话就要加上,不是就不需要添加 |\n", - "\n", - "> CLIP(Contrastive Language–Image Pre-training)模型是 OpenAI 开发的一种预训练模型,它能够理解图像和描述它们的文本之间的关系。CLIP 通过在大规模数据集上学习图像和对应文本之间的对应关系,从而实现了对图像内容的理解和分类,甚至能够根据文本提示生成图像。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4ad56444", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge Shanghai_AI_Laboratory/internlm2-chat-1_8b ./hf ./merged --max-shard-size 2GB" - ] - }, - { - "cell_type": "markdown", - "id": "8f0e9d87", - "metadata": {}, - "source": [ - "模型合并完成后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── merged\n", - "│ ├── config.json\n", - "│ ├── configuration_internlm2.py\n", - "│ ├── generation_config.json\n", - "│ ├── modeling_internlm2.py\n", - "│ ├── pytorch_model-00001-of-00002.bin\n", - "│ ├── pytorch_model-00002-of-00002.bin\n", - "│ ├── pytorch_model.bin.index.json\n", - "│ ├── special_tokens_map.json\n", - "│ ├── tokenization_internlm2.py\n", - "│ ├── tokenization_internlm2_fast.py\n", - "│ ├── tokenizer.json\n", - "│ ├── tokenizer.model\n", - "│ └── tokenizer_config.json\n", - "```\n", - "\n", - "
\n", - "\n", - "在模型合并完成后,我们就可以看到最终的模型和原模型文件夹非常相似,包括了分词器、权重文件、配置信息等等。" - ] - }, - { - "cell_type": "markdown", - "id": "004f8def", - "metadata": {}, - "source": [ - "### 3.3 微调后的模型对话" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4c0de6f9", - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer, model = load_model(\"./merged\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a6ad94c6", - "metadata": {}, - "outputs": [], - "source": [ - "chat(\"请介绍一下你自己\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "11c6347a", - "metadata": {}, - "outputs": [], - "source": [ - "chat(\"你在实战营做什么\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0c740a3c", - "metadata": {}, - "outputs": [], - "source": [ - "chat(\"介绍一下成都\")" - ] - }, - { - "cell_type": "markdown", - "id": "553963b4", - "metadata": {}, - "source": [ - "可以看到,通过指令微调,我们成功得到了一个自己的小助手。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "40277c04", - "metadata": {}, - "outputs": [], - "source": [ - "del tokenizer, model\n", - "\n", - "torch.cuda.empty_cache()" - ] - }, - { - "cell_type": "markdown", - "id": "41f5d2ef", - "metadata": {}, - "source": [ - "## 4 Web Demo 部署\n", - "\n", - "除了在终端中对模型进行测试,我们其实还可以在网页端的 Demo 进行对话。" - ] - }, - { - "cell_type": "markdown", - "id": "444228e0", - "metadata": {}, - "source": [ - "首先,我们需要安装所需要的依赖。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7ee8f3a9", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "pip install streamlit" - ] - }, - { - "cell_type": "markdown", - "id": "477c0a5d", - "metadata": {}, - "source": [ - "其次,我们需要准备一个Streamlit程序的脚本。" - ] - }, - { - "cell_type": "markdown", - "id": "77ddf404", - "metadata": {}, - "source": [ - "Streamlit程序的完整代码是:[tools/xtuner_streamlit_demo.py](../../../tools/xtuner_streamlit_demo.py)。\n", - "\n", - "
\n", - "xtuner_streamlit_demo.py\n", - "\n", - "```python\n", - "import copy\n", - "import warnings\n", - "from dataclasses import asdict, dataclass\n", - "from typing import Callable, List, Optional\n", - "\n", - "import streamlit as st\n", - "import torch\n", - "from torch import nn\n", - "from transformers.generation.utils import (LogitsProcessorList,\n", - " StoppingCriteriaList)\n", - "from transformers.utils import logging\n", - "\n", - "from transformers import AutoTokenizer, AutoModelForCausalLM # isort: skip\n", - "\n", - "logger = logging.get_logger(__name__)\n", - "\n", - "\n", - "model_name_or_path = \"./merged\"\n", - "\n", - "@dataclass\n", - "class GenerationConfig:\n", - " # this config is used for chat to provide more diversity\n", - " max_length: int = 2048\n", - " top_p: float = 0.75\n", - " temperature: float = 0.1\n", - " do_sample: bool = True\n", - " repetition_penalty: float = 1.000\n", - "\n", - "\n", - "@torch.inference_mode()\n", - "def generate_interactive(\n", - " model,\n", - " tokenizer,\n", - " prompt,\n", - " generation_config: Optional[GenerationConfig] = None,\n", - " logits_processor: Optional[LogitsProcessorList] = None,\n", - " stopping_criteria: Optional[StoppingCriteriaList] = None,\n", - " prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor],\n", - " List[int]]] = None,\n", - " additional_eos_token_id: Optional[int] = None,\n", - " **kwargs,\n", - "):\n", - " inputs = tokenizer([prompt], padding=True, return_tensors='pt')\n", - " input_length = len(inputs['input_ids'][0])\n", - " for k, v in inputs.items():\n", - " inputs[k] = v.cuda()\n", - " input_ids = inputs['input_ids']\n", - " _, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]\n", - " if generation_config is None:\n", - " generation_config = model.generation_config\n", - " generation_config = copy.deepcopy(generation_config)\n", - " model_kwargs = generation_config.update(**kwargs)\n", - " bos_token_id, eos_token_id = ( # noqa: F841 # pylint: disable=W0612\n", - " generation_config.bos_token_id,\n", - " generation_config.eos_token_id,\n", - " )\n", - " if isinstance(eos_token_id, int):\n", - " eos_token_id = [eos_token_id]\n", - " if additional_eos_token_id is not None:\n", - " eos_token_id.append(additional_eos_token_id)\n", - " has_default_max_length = kwargs.get(\n", - " 'max_length') is None and generation_config.max_length is not None\n", - " if has_default_max_length and generation_config.max_new_tokens is None:\n", - " warnings.warn(\n", - " f\"Using 'max_length''s default ({repr(generation_config.max_length)}) \\\n", - " to control the generation length. \"\n", - " 'This behaviour is deprecated and will be removed from the \\\n", - " config in v5 of Transformers -- we'\n", - " ' recommend using `max_new_tokens` to control the maximum \\\n", - " length of the generation.',\n", - " UserWarning,\n", - " )\n", - " elif generation_config.max_new_tokens is not None:\n", - " generation_config.max_length = generation_config.max_new_tokens + \\\n", - " input_ids_seq_length\n", - " if not has_default_max_length:\n", - " logger.warn( # pylint: disable=W4902\n", - " f\"Both 'max_new_tokens' (={generation_config.max_new_tokens}) \"\n", - " f\"and 'max_length'(={generation_config.max_length}) seem to \"\n", - " \"have been set. 'max_new_tokens' will take precedence. \"\n", - " 'Please refer to the documentation for more information. '\n", - " '(https://huggingface.co/docs/transformers/main/'\n", - " 'en/main_classes/text_generation)',\n", - " UserWarning,\n", - " )\n", - "\n", - " if input_ids_seq_length >= generation_config.max_length:\n", - " input_ids_string = 'input_ids'\n", - " logger.warning(\n", - " f\"Input length of {input_ids_string} is {input_ids_seq_length}, \"\n", - " f\"but 'max_length' is set to {generation_config.max_length}. \"\n", - " 'This can lead to unexpected behavior. You should consider'\n", - " \" increasing 'max_new_tokens'.\")\n", - "\n", - " # 2. Set generation parameters if not already defined\n", - " logits_processor = logits_processor if logits_processor is not None \\\n", - " else LogitsProcessorList()\n", - " stopping_criteria = stopping_criteria if stopping_criteria is not None \\\n", - " else StoppingCriteriaList()\n", - "\n", - " logits_processor = model._get_logits_processor(\n", - " generation_config=generation_config,\n", - " input_ids_seq_length=input_ids_seq_length,\n", - " encoder_input_ids=input_ids,\n", - " prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,\n", - " logits_processor=logits_processor,\n", - " )\n", - "\n", - " stopping_criteria = model._get_stopping_criteria(\n", - " generation_config=generation_config,\n", - " stopping_criteria=stopping_criteria)\n", - " logits_warper = model._get_logits_warper(generation_config)\n", - "\n", - " unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)\n", - " scores = None\n", - " while True:\n", - " model_inputs = model.prepare_inputs_for_generation(\n", - " input_ids, **model_kwargs)\n", - " # forward pass to get next token\n", - " outputs = model(\n", - " **model_inputs,\n", - " return_dict=True,\n", - " output_attentions=False,\n", - " output_hidden_states=False,\n", - " )\n", - "\n", - " next_token_logits = outputs.logits[:, -1, :]\n", - "\n", - " # pre-process distribution\n", - " next_token_scores = logits_processor(input_ids, next_token_logits)\n", - " next_token_scores = logits_warper(input_ids, next_token_scores)\n", - "\n", - " # sample\n", - " probs = nn.functional.softmax(next_token_scores, dim=-1)\n", - " if generation_config.do_sample:\n", - " next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)\n", - " else:\n", - " next_tokens = torch.argmax(probs, dim=-1)\n", - "\n", - " # update generated ids, model inputs, and length for next step\n", - " input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)\n", - " model_kwargs = model._update_model_kwargs_for_generation(\n", - " outputs, model_kwargs, is_encoder_decoder=False)\n", - " unfinished_sequences = unfinished_sequences.mul(\n", - " (min(next_tokens != i for i in eos_token_id)).long())\n", - "\n", - " output_token_ids = input_ids[0].cpu().tolist()\n", - " output_token_ids = output_token_ids[input_length:]\n", - " for each_eos_token_id in eos_token_id:\n", - " if output_token_ids[-1] == each_eos_token_id:\n", - " output_token_ids = output_token_ids[:-1]\n", - " response = tokenizer.decode(output_token_ids)\n", - "\n", - " yield response\n", - " # stop when each sentence is finished\n", - " # or if we exceed the maximum length\n", - " if unfinished_sequences.max() == 0 or stopping_criteria(\n", - " input_ids, scores):\n", - " break\n", - "\n", - "\n", - "def on_btn_click():\n", - " del st.session_state.messages\n", - "\n", - "\n", - "@st.cache_resource\n", - "def load_model():\n", - " model = (AutoModelForCausalLM.from_pretrained(model_name_or_path,\n", - " trust_remote_code=True).to(\n", - " torch.bfloat16).cuda())\n", - " tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,\n", - " trust_remote_code=True)\n", - " return model, tokenizer\n", - "\n", - "\n", - "def prepare_generation_config():\n", - " with st.sidebar:\n", - " max_length = st.slider('Max Length',\n", - " min_value=8,\n", - " max_value=32768,\n", - " value=2048)\n", - " top_p = st.slider('Top P', 0.0, 1.0, 0.75, step=0.01)\n", - " temperature = st.slider('Temperature', 0.0, 1.0, 0.1, step=0.01)\n", - " st.button('Clear Chat History', on_click=on_btn_click)\n", - "\n", - " generation_config = GenerationConfig(max_length=max_length,\n", - " top_p=top_p,\n", - " temperature=temperature)\n", - "\n", - " return generation_config\n", - "\n", - "\n", - "user_prompt = '<|im_start|>user\\n{user}<|im_end|>\\n'\n", - "robot_prompt = '<|im_start|>assistant\\n{robot}<|im_end|>\\n'\n", - "cur_query_prompt = '<|im_start|>user\\n{user}<|im_end|>\\n\\\n", - " <|im_start|>assistant\\n'\n", - "\n", - "\n", - "def combine_history(prompt):\n", - " messages = st.session_state.messages\n", - " meta_instruction = ('')\n", - " total_prompt = f\"<|im_start|>system\\n{meta_instruction}<|im_end|>\\n\"\n", - " for message in messages:\n", - " cur_content = message['content']\n", - " if message['role'] == 'user':\n", - " cur_prompt = user_prompt.format(user=cur_content)\n", - " elif message['role'] == 'robot':\n", - " cur_prompt = robot_prompt.format(robot=cur_content)\n", - " else:\n", - " raise RuntimeError\n", - " total_prompt += cur_prompt\n", - " total_prompt = total_prompt + cur_query_prompt.format(user=prompt)\n", - " return total_prompt\n", - "\n", - "\n", - "def main():\n", - " # torch.cuda.empty_cache()\n", - " print('load model begin.')\n", - " model, tokenizer = load_model()\n", - " print('load model end.')\n", - "\n", - "\n", - " st.title('InternLM2-Chat-1.8B')\n", - "\n", - " generation_config = prepare_generation_config()\n", - "\n", - " # Initialize chat history\n", - " if 'messages' not in st.session_state:\n", - " st.session_state.messages = []\n", - "\n", - " # Display chat messages from history on app rerun\n", - " for message in st.session_state.messages:\n", - " with st.chat_message(message['role'], avatar=message.get('avatar')):\n", - " st.markdown(message['content'])\n", - "\n", - " # Accept user input\n", - " if prompt := st.chat_input('What is up?'):\n", - " # Display user message in chat message container\n", - " with st.chat_message('user'):\n", - " st.markdown(prompt)\n", - " real_prompt = combine_history(prompt)\n", - " # Add user message to chat history\n", - " st.session_state.messages.append({\n", - " 'role': 'user',\n", - " 'content': prompt,\n", - " })\n", - "\n", - " with st.chat_message('robot'):\n", - " message_placeholder = st.empty()\n", - " for cur_response in generate_interactive(\n", - " model=model,\n", - " tokenizer=tokenizer,\n", - " prompt=real_prompt,\n", - " additional_eos_token_id=92542,\n", - " **asdict(generation_config),\n", - " ):\n", - " # Display robot response in chat message container\n", - " message_placeholder.markdown(cur_response + '▌')\n", - " message_placeholder.markdown(cur_response)\n", - " # Add robot response to chat history\n", - " st.session_state.messages.append({\n", - " 'role': 'robot',\n", - " 'content': cur_response, # pylint: disable=undefined-loop-variable\n", - " })\n", - " torch.cuda.empty_cache()\n", - "\n", - "\n", - "if __name__ == '__main__':\n", - " main()\n", - "\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "c7f48f42", - "metadata": {}, - "source": [ - "然后,我们可以直接启动应用。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f896f75a", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!streamlit run xtuner_streamlit_demo.py" - ] - }, - { - "cell_type": "markdown", - "id": "e8efe0da", - "metadata": {}, - "source": [ - "运行后,在访问前,我们还需要做的就是将端口映射到本地。\n", - "\n", - "通过如图所示的地方,获取开发机的端口和密码。\n", - "\n", - "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-09.png)" - ] - }, - { - "cell_type": "markdown", - "id": "b4024b0d", - "metadata": {}, - "source": [ - "然后在本地使用 PowerShell 或者命令行终端,执行以下命令:\n", - "\n", - "> 其中,`8501`是Streamlit程序的服务端口,`43551`需要替换为自己的开发机的端口。\n", - "\n", - "```bash\n", - "ssh -CNg -L 8501:127.0.0.1:8501 root@ssh.intern-ai.org.cn -p 43551\n", - "```\n", - "\n", - "然后再输入开发机的root密码。\n", - "\n", - "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-10.png)" - ] - }, - { - "cell_type": "markdown", - "id": "c13db446", - "metadata": {}, - "source": [ - "最后,我们就可以在本地通过浏览器访问:http://127.0.0.1:8501 来进行对话了。\n", - "\n", - "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-12.png)" - ] - }, - { - "cell_type": "markdown", - "id": "e4a7edaf", - "metadata": {}, - "source": [ - "## 5 小结\n", - "\n", - "经过本节的学习,带领着大家跑通了 XTuner 的完整流程,我们学会了指令跟随微调,并且训练出了一个自己小助手,是不是很有意思!\n", - "\n", - "当我们在测试完模型认为其满足我们的需求后,就可以对模型进行量化部署等操作了,这部分的内容在之后关于 LMDeploy 的课程中将会详细的进行讲解,敬请期待后续的课程吧!\n", - "\n", - "关于XTuner的更多高级进阶知识,让我们继续往下探索吧!" - ] - }, - { - "cell_type": "markdown", - "id": "1da820a2-86e8-4f8a-ada4-3750b6dbe445", - "metadata": {}, - "source": [ - "## 6 增量预训练微调\n", - "\n", - "本节我们先来了解一下增量预训练,这里我们以一个文本续写案例来看看效果。\n", - "\n", - "| | 微调前 | 微调后 |\n", - "| --- | --- | --- |\n", - "| 输入 | 书生·浦语大模型实战营第三期是 | 书生·浦语大模型实战营第三期是 |\n", - "| 输出| 书生·浦语大模型实战营第三期是上周五,上周五我们学习了一个新的知识,那就是关于机器学习的概率统计。…… | 书生·浦语大模型实战营第三期是上海人工智能实验室推出的书生·浦语大模型实战营系列活动的第三批次,将于2024年7月正式进行。…… |" - ] - }, - { - "cell_type": "markdown", - "id": "4b8a61f2", - "metadata": {}, - "source": [ - "我们需要定义一些基本方法。" - ] - }, - { - "cell_type": "markdown", - "id": "d3f1238e", - "metadata": {}, - "source": [ - "- 导入必要的库" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "73995f91", - "metadata": {}, - "outputs": [], - "source": [ - "import torch\n", - "from transformers import AutoTokenizer, AutoModelForCausalLM" - ] - }, - { - "cell_type": "markdown", - "id": "b67d97ce", - "metadata": {}, - "source": [ - "- 定义模型加载方法" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c275406e", - "metadata": {}, - "outputs": [], - "source": [ - "def load_model(model_path):\n", - " tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n", - " model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()\n", - " model = model.eval()\n", - " return tokenizer, model" - ] - }, - { - "cell_type": "markdown", - "id": "5a0a7754", - "metadata": {}, - "source": [ - "- 定义文本续写方法" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b8fd8995", - "metadata": {}, - "outputs": [], - "source": [ - "def generate(user_input):\n", - " gen_kwargs = {\"max_length\": 128, \"top_p\": 0.8, \"temperature\": 0.8, \"do_sample\": True, \"repetition_penalty\": 1.0}\n", - "\n", - " inputs = tokenizer([user_input], return_tensors=\"pt\")\n", - " for k,v in inputs.items():\n", - " inputs[k] = v.cuda()\n", - " output = model.generate(**inputs, **gen_kwargs)\n", - " output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)\n", - " return output" - ] - }, - { - "cell_type": "markdown", - "id": "aac7a4e4", - "metadata": {}, - "source": [ - "### 6.1 基座模型推理\n", - "\n", - "我们先来看看基座模型的推理结果。" - ] - }, - { - "cell_type": "markdown", - "id": "c3d48c8a", - "metadata": {}, - "source": [ - "- 加载模型" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "22cb798d", - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer, model = load_model(\"Shanghai_AI_Laboratory/internlm2-1_8b\")" - ] - }, - { - "cell_type": "markdown", - "id": "f536a771", - "metadata": {}, - "source": [ - "- 文本续写" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "44ccf83e", - "metadata": {}, - "outputs": [], - "source": [ - "generate(\"书生·浦语大模型实战营第三期是\")" - ] - }, - { - "cell_type": "markdown", - "id": "5fa279db", - "metadata": {}, - "source": [ - "- 释放缓存" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8716d3c9", - "metadata": {}, - "outputs": [], - "source": [ - "del tokenizer, model\n", - "\n", - "torch.cuda.empty_cache()" - ] - }, - { - "cell_type": "markdown", - "id": "0694693a", - "metadata": {}, - "source": [ - "### 6.2 增量预训练\n", - "\n", - "然后我们对基座模型进行增量预训练,让模型增加新的知识。" - ] - }, - { - "cell_type": "markdown", - "id": "fea34851", - "metadata": {}, - "source": [ - "#### 6.2.1 准备数据文件\n", - "\n", - "为了让模型学习到新的知识,我们需要将新的知识数据整理成指定格式文件,形成数据集,然后让模型来学习这些新数据。这里我们准备一个简单的数据集 `datas/pretrain.json`,仅包含一条知识,然后让数据重复多次。\n", - "\n", - "> 网上有大量的开源数据集可以供我们进行使用,有些时候我们可以在开源数据集的基础上添加一些我们自己独有的数据集,也可能会有很好的效果。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b1502071", - "metadata": {}, - "outputs": [], - "source": [ - "[\n", - " {\n", - " \"text\": \"书生·浦语大模型实战营第三期是上海人工智能实验室推出的书生·浦语大模型实战营系列活动的第三批次,将于2024年7月正式进行。\"\n", - " }\n", - "]" - ] - }, - { - "cell_type": "markdown", - "id": "665e305c", - "metadata": {}, - "source": [ - "准备好数据文件后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── Shanghai_AI_Laboratory\n", - "│ ├── internlm2-1_8b\n", - "│ │ ├── README.md\n", - "│ │ ├── config.json\n", - "│ │ ├── configuration.json\n", - "│ │ ├── configuration_internlm2.py\n", - "│ │ ├── generation_config.json\n", - "│ │ ├── modeling_internlm2.py\n", - "│ │ ├── pytorch_model.bin\n", - "│ │ ├── special_tokens_map.json\n", - "│ │ ├── tokenization_internlm2.py\n", - "│ │ ├── tokenization_internlm2_fast.py\n", - "│ │ ├── tokenizer.json\n", - "│ │ ├── tokenizer.model\n", - "│ │ └── tokenizer_config.json\n", - "│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b\n", - "│ ├── README.md\n", - "│ ├── config.json\n", - "│ ├── configuration.json\n", - "│ ├── configuration_internlm2.py\n", - "│ ├── generation_config.json\n", - "│ ├── model-00001-of-00002.safetensors\n", - "│ ├── model-00002-of-00002.safetensors\n", - "│ ├── model.safetensors.index.json\n", - "│ ├── modeling_internlm2.py\n", - "│ ├── special_tokens_map.json\n", - "│ ├── tokenization_internlm2.py\n", - "│ ├── tokenization_internlm2_fast.py\n", - "│ ├── tokenizer.model\n", - "│ └── tokenizer_config.json\n", - "├── datas\n", - "│ ├── assistant.json\n", - "│ └── pretrain.json\n", - "├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", - "```\n", - "\n", - "
\n" - ] - }, - { - "cell_type": "markdown", - "id": "ae9221ff", - "metadata": {}, - "source": [ - "#### 6.2.2 准备配置文件\n", - "\n", - "在准备好了模型和数据集后,我们就要根据我们选择的微调方法结合微调方案来找到与我们最匹配的配置文件了,从而减少我们对配置文件的修改量。\n", - "\n", - "这里我们选择使用 `internlm2_1_8b_full_custom_pretrain_e1` 配置文件。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8cfbdd74", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!xtuner copy-cfg internlm2_1_8b_full_custom_pretrain_e1 ." - ] - }, - { - "cell_type": "markdown", - "id": "b72d52ee", - "metadata": {}, - "source": [ - "复制好配置文件后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── Shanghai_AI_Laboratory\n", - "│ ├── internlm2-1_8b\n", - "│ │ ├── README.md\n", - "│ │ ├── config.json\n", - "│ │ ├── configuration.json\n", - "│ │ ├── configuration_internlm2.py\n", - "│ │ ├── generation_config.json\n", - "│ │ ├── modeling_internlm2.py\n", - "│ │ ├── pytorch_model.bin\n", - "│ │ ├── special_tokens_map.json\n", - "│ │ ├── tokenization_internlm2.py\n", - "│ │ ├── tokenization_internlm2_fast.py\n", - "│ │ ├── tokenizer.json\n", - "│ │ ├── tokenizer.model\n", - "│ │ └── tokenizer_config.json\n", - "│ └── internlm2-chat-1_8b -> /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b\n", - "│ ├── README.md\n", - "│ ├── config.json\n", - "│ ├── configuration.json\n", - "│ ├── configuration_internlm2.py\n", - "│ ├── generation_config.json\n", - "│ ├── model-00001-of-00002.safetensors\n", - "│ ├── model-00002-of-00002.safetensors\n", - "│ ├── model.safetensors.index.json\n", - "│ ├── modeling_internlm2.py\n", - "│ ├── special_tokens_map.json\n", - "│ ├── tokenization_internlm2.py\n", - "│ ├── tokenization_internlm2_fast.py\n", - "│ ├── tokenizer.model\n", - "│ └── tokenizer_config.json\n", - "├── datas\n", - "│ ├── assistant.json\n", - "│ └── pretrain.json\n", - "├── internlm2_1_8b_full_custom_pretrain_e1_copy.py\n", - "├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "c82985b1", - "metadata": {}, - "source": [ - "下面我们将根据项目的需求一步步的进行修改和调整吧!\n", - "\n", - "在 PART 1 的部分,由于我们不再需要在 HuggingFace 上自动下载模型,因此我们先要更换模型的路径以及数据集的路径为我们本地的路径。\n", - "\n", - "为了训练过程中能够实时观察到模型的变化情况,XTuner 贴心的推出了一个 `evaluation_inputs` 的参数来让我们能够设置多个问题来确保模型在训练过程中的变化是朝着我们想要的方向前进的。我们可以添加自己的输入。\n", - "\n", - "在 PART 2 的部分,由于我们复制的配置文件是全参数微调的配置,而我们希望使用 `QLoRA` 算法进行微调,所以可以添加 `QLoRA` 算法的配置。\n", - "\n", - "```diff\n", - "+ from peft import LoraConfig\n", - "\n", - "+ import torch\n", - "\n", - "- from transformers import AutoModelForCausalLM, AutoTokenizer\n", - "+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n", - "\n", - "#######################################################################\n", - "# PART 1 Settings #\n", - "#######################################################################\n", - "- pretrained_model_name_or_path = 'internlm/internlm2-1_8b'\n", - "+ pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-1_8b'\n", - "\n", - "- data_files = ['/path/to/json/file.json']\n", - "+ data_files = ['datas/pretrain.json']\n", - "\n", - "- evaluation_inputs = ['上海是', 'Shanghai is']\n", - "+ evaluation_inputs = ['书生·浦语大模型实战营第三期是', '上海是', 'Shanghai is']\n", - "\n", - "#######################################################################\n", - "# PART 2 Model & Tokenizer #\n", - "#######################################################################\n", - "model = dict(\n", - " type=SupervisedFinetune,\n", - " use_varlen_attn=use_varlen_attn,\n", - " llm=dict(\n", - " type=AutoModelForCausalLM.from_pretrained,\n", - " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", - " trust_remote_code=True,\n", - "+ quantization_config=dict(\n", - "+ type=BitsAndBytesConfig,\n", - "+ load_in_4bit=True,\n", - "+ load_in_8bit=False,\n", - "+ llm_int8_threshold=6.0,\n", - "+ llm_int8_has_fp16_weight=False,\n", - "+ bnb_4bit_compute_dtype=torch.float16,\n", - "+ bnb_4bit_use_double_quant=True,\n", - "+ bnb_4bit_quant_type='nf4')\n", - " ),\n", - "+ lora=dict(\n", - "+ type=LoraConfig,\n", - "+ r=64,\n", - "+ lora_alpha=16,\n", - "+ lora_dropout=0.1,\n", - "+ bias='none',\n", - "+ task_type='CAUSAL_LM')\n", - ")\n", - "```\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "id": "8473fc9f", - "metadata": {}, - "source": [ - "修改完后的完整的配置文件是:[configs/internlm2_1_8b_full_custom_pretrain_e1_copy.py](../../../configs/internlm2_1_8b_full_custom_pretrain_e1_copy.py)。\n", - "\n", - "
\n", - "internlm2_1_8b_full_custom_pretrain_e1_copy.py\n", - "\n", - "```python\n", - "# Copyright (c) OpenMMLab. All rights reserved.\n", - "\"\"\"Data format:\n", - "\n", - "[\n", - " {\n", - " \"text\": \"xxx\"\n", - " },\n", - " {\n", - " \"text\": \"xxx\"\n", - " },\n", - " ...\n", - "]\n", - "\"\"\" # noqa: E501\n", - "\n", - "from datasets import load_dataset\n", - "from mmengine.dataset import DefaultSampler\n", - "from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,\n", - " LoggerHook, ParamSchedulerHook)\n", - "from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR\n", - "from peft import LoraConfig\n", - "import torch\n", - "from torch.optim import AdamW\n", - "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n", - "\n", - "from xtuner.dataset import process_hf_dataset\n", - "from xtuner.dataset.collate_fns import default_collate_fn\n", - "from xtuner.dataset.map_fns import pretrain_map_fn\n", - "from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,\n", - " VarlenAttnArgsToMessageHubHook)\n", - "from xtuner.engine.runner import TrainLoop\n", - "from xtuner.model import SupervisedFinetune\n", - "\n", - "#######################################################################\n", - "# PART 1 Settings #\n", - "#######################################################################\n", - "# Model\n", - "pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-1_8b'\n", - "use_varlen_attn = False\n", - "\n", - "# Data\n", - "data_files = ['datas/pretrain.json']\n", - "max_length = 2048\n", - "pack_to_max_length = True\n", - "\n", - "# Scheduler & Optimizer\n", - "batch_size = 1 # per_device\n", - "accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc\n", - "dataloader_num_workers = 0\n", - "max_epochs = 1\n", - "optim_type = AdamW\n", - "lr = 2e-5\n", - "betas = (0.9, 0.999)\n", - "weight_decay = 0\n", - "max_norm = 1 # grad clip\n", - "warmup_ratio = 0.03\n", - "\n", - "# Save\n", - "save_steps = 500\n", - "save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)\n", - "\n", - "# Evaluate the generation performance during the training\n", - "evaluation_freq = 500\n", - "SYSTEM = ''\n", - "evaluation_inputs = ['书生·浦语大模型实战营第三期是', '上海是', 'Shanghai is']\n", - "\n", - "#######################################################################\n", - "# PART 2 Model & Tokenizer #\n", - "#######################################################################\n", - "tokenizer = dict(\n", - " type=AutoTokenizer.from_pretrained,\n", - " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", - " trust_remote_code=True,\n", - " padding_side='right')\n", - "\n", - "model = dict(\n", - " type=SupervisedFinetune,\n", - " use_varlen_attn=use_varlen_attn,\n", - " llm=dict(\n", - " type=AutoModelForCausalLM.from_pretrained,\n", - " pretrained_model_name_or_path=pretrained_model_name_or_path,\n", - " trust_remote_code=True,\n", - " quantization_config=dict(\n", - " type=BitsAndBytesConfig,\n", - " load_in_4bit=True,\n", - " load_in_8bit=False,\n", - " llm_int8_threshold=6.0,\n", - " llm_int8_has_fp16_weight=False,\n", - " bnb_4bit_compute_dtype=torch.float16,\n", - " bnb_4bit_use_double_quant=True,\n", - " bnb_4bit_quant_type='nf4')\n", - " ),\n", - " lora=dict(\n", - " type=LoraConfig,\n", - " r=64,\n", - " lora_alpha=16,\n", - " lora_dropout=0.1,\n", - " bias='none',\n", - " task_type='CAUSAL_LM')\n", - ")\n", - "\n", - "#######################################################################\n", - "# PART 3 Dataset & Dataloader #\n", - "#######################################################################\n", - "train_dataset = dict(\n", - " type=process_hf_dataset,\n", - " dataset=dict(type=load_dataset, path='json', data_files=data_files),\n", - " tokenizer=tokenizer,\n", - " max_length=max_length,\n", - " dataset_map_fn=pretrain_map_fn,\n", - " template_map_fn=None,\n", - " remove_unused_columns=True,\n", - " shuffle_before_pack=False,\n", - " pack_to_max_length=pack_to_max_length,\n", - " use_varlen_attn=use_varlen_attn)\n", - "\n", - "train_dataloader = dict(\n", - " batch_size=batch_size,\n", - " num_workers=dataloader_num_workers,\n", - " dataset=train_dataset,\n", - " sampler=dict(type=DefaultSampler, shuffle=True),\n", - " collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))\n", - "\n", - "#######################################################################\n", - "# PART 4 Scheduler & Optimizer #\n", - "#######################################################################\n", - "# optimizer\n", - "optim_wrapper = dict(\n", - " type=AmpOptimWrapper,\n", - " optimizer=dict(\n", - " type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),\n", - " clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),\n", - " accumulative_counts=accumulative_counts,\n", - " loss_scale='dynamic',\n", - " dtype='float16')\n", - "\n", - "# learning policy\n", - "# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501\n", - "param_scheduler = [\n", - " dict(\n", - " type=LinearLR,\n", - " start_factor=1e-5,\n", - " by_epoch=True,\n", - " begin=0,\n", - " end=warmup_ratio * max_epochs,\n", - " convert_to_iter_based=True),\n", - " dict(\n", - " type=CosineAnnealingLR,\n", - " eta_min=0.0,\n", - " by_epoch=True,\n", - " begin=warmup_ratio * max_epochs,\n", - " end=max_epochs,\n", - " convert_to_iter_based=True)\n", - "]\n", - "\n", - "# train, val, test setting\n", - "train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)\n", - "\n", - "#######################################################################\n", - "# PART 5 Runtime #\n", - "#######################################################################\n", - "# Log the dialogue periodically during the training process, optional\n", - "custom_hooks = [\n", - " dict(type=DatasetInfoHook, tokenizer=tokenizer),\n", - " dict(\n", - " type=EvaluateChatHook,\n", - " tokenizer=tokenizer,\n", - " every_n_iters=evaluation_freq,\n", - " evaluation_inputs=evaluation_inputs,\n", - " system=SYSTEM)\n", - "]\n", - "\n", - "if use_varlen_attn:\n", - " custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]\n", - "\n", - "# configure default hooks\n", - "default_hooks = dict(\n", - " # record the time of every iteration.\n", - " timer=dict(type=IterTimerHook),\n", - " # print log every 10 iterations.\n", - " logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),\n", - " # enable the parameter scheduler.\n", - " param_scheduler=dict(type=ParamSchedulerHook),\n", - " # save checkpoint per `save_steps`.\n", - " checkpoint=dict(\n", - " type=CheckpointHook,\n", - " by_epoch=False,\n", - " interval=save_steps,\n", - " max_keep_ckpts=save_total_limit),\n", - " # set sampler seed in distributed evrionment.\n", - " sampler_seed=dict(type=DistSamplerSeedHook),\n", - ")\n", - "\n", - "# configure environment\n", - "env_cfg = dict(\n", - " # whether to enable cudnn benchmark\n", - " cudnn_benchmark=False,\n", - " # set multi process parameters\n", - " mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),\n", - " # set distributed parameters\n", - " dist_cfg=dict(backend='nccl'),\n", - ")\n", - "\n", - "# set visualizer\n", - "visualizer = None\n", - "\n", - "# set log level\n", - "log_level = 'INFO'\n", - "\n", - "# load from which checkpoint\n", - "load_from = None\n", - "\n", - "# whether to resume training from the loaded checkpoint\n", - "resume = False\n", - "\n", - "# Defaults to use random seed and disable `deterministic`\n", - "randomness = dict(seed=None, deterministic=False)\n", - "\n", - "# set log processor\n", - "log_processor = dict(by_epoch=False)\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "be808d46", - "metadata": {}, - "source": [ - "#### 6.2.3 启动微调\n", - "\n", - "完成了所有的准备工作后,我们就可以正式的开始我们下一阶段的旅程:XTuner 启动~!\n", - "\n", - "当我们准备好了所有内容,我们只需要将使用 `xtuner train` 命令令即可开始训练。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6ac77d2d", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!xtuner train ./internlm2_1_8b_full_custom_pretrain_e1_copy.py" - ] - }, - { - "cell_type": "markdown", - "id": "cf74cdeb", - "metadata": {}, - "source": [ - "在训练完后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── work_dirs\n", - "│ └── internlm2_1_8b_full_custom_pretrain_e1_copy\n", - "│ ├── 20240627_214522\n", - "│ │ ├── 20240627_214522.log\n", - "│ │ └── vis_data\n", - "│ │ ├── 20240627_214522.json\n", - "│ │ ├── config.py\n", - "│ │ ├── eval_outputs_iter_1499.txt\n", - "│ │ ├── eval_outputs_iter_1999.txt\n", - "│ │ ├── eval_outputs_iter_2499.txt\n", - "│ │ ├── eval_outputs_iter_2623.txt\n", - "│ │ ├── eval_outputs_iter_499.txt\n", - "│ │ ├── eval_outputs_iter_999.txt\n", - "│ │ └── scalars.json\n", - "│ ├── internlm2_1_8b_full_custom_pretrain_e1_copy.py\n", - "│ ├── iter_2500.pth\n", - "│ ├── iter_2624.pth\n", - "│ └── last_checkpoint\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "2672054f", - "metadata": {}, - "source": [ - "#### 6.2.4 模型格式转换\n", - "\n", - "模型转换的本质其实就是将原本使用 Pytorch 训练出来的模型权重文件转换为目前通用的 HuggingFace 格式文件,那么我们可以通过以下命令来实现一键转换。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "82570d4e", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!pth_file=`ls -t ./work_dirs/internlm2_1_8b_full_custom_pretrain_e1_copy/*.pth | head -n 1` && MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert pth_to_hf ./internlm2_1_8b_full_custom_pretrain_e1_copy.py ${pth_file} ./hf" - ] - }, - { - "cell_type": "markdown", - "id": "a055caa3", - "metadata": {}, - "source": [ - "模型格式转换完成后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── hf\n", - "│ ├── README.md\n", - "│ ├── adapter_config.json\n", - "│ ├── adapter_model.bin\n", - "│ └── xtuner_config.py\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "7b1e8a04", - "metadata": {}, - "source": [ - "#### 6.2.5 模型合并\n", - "\n", - "对于 LoRA 或者 QLoRA 微调出来的模型其实并不是一个完整的模型,而是一个额外的层(Adapter),训练完的这个层最终还是要与原模型进行合并才能被正常的使用。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ad447926", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge Shanghai_AI_Laboratory/internlm2-1_8b ./hf ./merged --max-shard-size 2GB" - ] - }, - { - "cell_type": "markdown", - "id": "2d4c167d", - "metadata": {}, - "source": [ - "模型合并完成后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── merged\n", - "│ ├── config.json\n", - "│ ├── configuration_internlm2.py\n", - "│ ├── generation_config.json\n", - "│ ├── modeling_internlm2.py\n", - "│ ├── pytorch_model-00001-of-00002.bin\n", - "│ ├── pytorch_model-00002-of-00002.bin\n", - "│ ├── pytorch_model.bin.index.json\n", - "│ ├── special_tokens_map.json\n", - "│ ├── tokenization_internlm2.py\n", - "│ ├── tokenization_internlm2_fast.py\n", - "│ ├── tokenizer.json\n", - "│ ├── tokenizer.model\n", - "│ └── tokenizer_config.json\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "e55ae7b2", - "metadata": {}, - "source": [ - "### 6.3 目标模型推理\n", - "\n", - "当我们合并完成后,我们就能够正常的调用这个模型进行推理了。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "febaa17b", - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer, model = load_model(\"./merged\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5ced1ac3", - "metadata": {}, - "outputs": [], - "source": [ - "generate(\"书生·浦语大模型实战营第三期是\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8fad2baa", - "metadata": {}, - "outputs": [], - "source": [ - "generate(\"成都是\")" - ] - }, - { - "cell_type": "markdown", - "id": "ac4314bc", - "metadata": {}, - "source": [ - "可以看到,通过增量预训练,确实在基座模型的基础上学习到了新的知识。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1485e2fa", - "metadata": {}, - "outputs": [], - "source": [ - "del tokenizer, model\n", - "\n", - "torch.cuda.empty_cache()" - ] - }, - { - "cell_type": "markdown", - "id": "2326897e-7594-4082-8bc7-57681f837cdc", - "metadata": {}, - "source": [ - "## 7 DeepSpeed介绍\n", - "\n", - "DeepSpeed是一个由微软开发的开源深度学习优化库,旨在提高大规模模型训练的效率和速度。\n", - "\n", - "XTuner 也内置了 `deepspeed` 来加速整体的训练过程,共有三种不同的 `deepspeed` 类型可进行选择,分别是 `deepspeed_zero1`, `deepspeed_zero2` 和 `deepspeed_zero3`。\n", - "\n", - "
\n", - "DeepSpeed优化器及其选择方法\n", - "\n", - "DeepSpeed是一个由微软开发的开源深度学习优化库,旨在提高大规模模型训练的效率和速度。它通过几种关键技术来优化训练过程,包括模型分割、梯度累积、以及内存和带宽优化等,能够降低训练超大规模模型的复杂性和资源需求,使训练变得更快、更高效。DeepSpeed特别适用于需要巨大计算资源的大型模型和数据集。\n", - "\n", - "在DeepSpeed中,引入了ZeRO(Zero Redundancy Optimizer)技术,是一种旨在降低训练大型模型所需内存占用的优化器,通过在分布式环境中分割优化器的状态、梯度和参数,减少冗余的内存占用,允许更大的模型和更快的训练速度。ZeRO 分为几个不同的级别,主要包括:\n", - "\n", - "- **deepspeed_zero1**:这是ZeRO的基本版本,它优化了模型参数的存储,主要通过分区存储优化器状态来减少内存使用。每个GPU设备只保存一部分优化器状态,从而显著减少内存消耗。\n", - "\n", - "- **deepspeed_zero2**:在deepspeed_zero1的基础上,deepspeed_zero2进一步优化了梯度和优化器状态的存储,将梯度也进行分区存储。这样,每个GPU设备只需要保存一部分的优化器状态和梯度,进一步减少内存使用。\n", - "\n", - "- **deepspeed_zero3**:这是目前最高级的优化等级,它包括了deepspeed_zero1和deepspeed_zero2的优化,除了优化器状态和梯度,还将模型参数进行分区存储。每个GPU设备只需要保存一部分的优化器状态、梯度和模型参数,从而最大限度地减少内存使用。\n", - "\n", - "选择哪种deepspeed类型主要取决于你的具体需求,包括模型的大小、可用的硬件资源(特别是GPU内存)以及训练的效率需求。一般来说:\n", - "\n", - "- 如果你的模型较小,或者内存资源充足,可能不需要使用最高级别的优化。\n", - "- 如果你需要快速训练模型,可能需要权衡内存优化和计算效率。deepspeed_zero1提供了较低的内存占用,同时保持了较高的计算效率。\n", - "- 如果你正在尝试训练非常大的模型,或者你的硬件资源有限,使用deepspeed_zero2或deepspeed_zero3可能更合适,因为它们可以显著降低内存占用,允许更大模型的训练。\n", - "- 选择时也要考虑到实现的复杂性和运行时的开销,更高级的优化可能需要更复杂的设置,更频繁的跨GPU通信,这可能需要更高的网络带宽,并可能增加一些计算开销。\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "c74dd2a4", - "metadata": {}, - "source": [ - "## 8 多卡微调\n", - "\n", - "模型的规模和复杂度不断增加,单张GPU的显存往往无法满足大模型的训练需求。此时,我们可能需要多卡微调,以应对大模型训练过程中显存和计算资源的需求。" - ] - }, - { - "cell_type": "markdown", - "id": "f3cc6567", - "metadata": {}, - "source": [ - "\n", - "XTuner 中使用多卡微调,只需要设置 `NPROC_PER_NODE` 环境变量,并使用 `DeepSpeed` 来进行加速就可以了,其余命令内容与单卡微调时一样。\n", - "\n", - "> 由于开发机只有两张显卡,所以我们设置`NPROC_PER_NODE=2`,并且选择使用`deepspeed_zero3`优化等级。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "27a213cd", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU NPROC_PER_NODE=2 xtuner train ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero3" - ] - }, - { - "cell_type": "markdown", - "id": "6c81ca6f", - "metadata": {}, - "source": [ - "在执行微调的过程中,我们可以看到两张显卡都有内存使用。\n", - "\n", - "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-06.png)" - ] - }, - { - "cell_type": "markdown", - "id": "922b48a6", - "metadata": {}, - "source": [ - "在训练完后,我们的目录结构应该是这样子的。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── work_dirs\n", - "│ └── internlm2_chat_1_8b_qlora_alpaca_e3_copy\n", - "│ ├── 20240628_205957\n", - "│ │ ├── 20240628_205957.log\n", - "│ │ └── vis_data\n", - "│ │ ├── 20240628_205957.json\n", - "│ │ ├── config.py\n", - "│ │ ├── eval_outputs_iter_236.txt\n", - "│ │ └── scalars.json\n", - "│ ├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", - "│ ├── iter_237.pth\n", - "│ │ ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt\n", - "│ │ ├── bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt\n", - "│ │ ├── zero_pp_rank_0_mp_rank_00_model_states.pt\n", - "│ │ └── zero_pp_rank_1_mp_rank_00_model_states.pt\n", - "│ ├── last_checkpoint\n", - "│ └── zero_to_fp32.py\n", - "```\n", - "\n", - "
\n", - "\n", - "可以看到,通过 `deepspeed` 来训练后得到的权重文件和原本的权重文件是有所差别的,原本的仅仅是一个 .pth 的文件,而使用了 `deepspeed` 则是一个名字带有 .pth 的文件夹,在该文件夹里保存了 .pt 文件。这两者在具体的使用上并没有太大的差别,转换和合并的过程都是一样的。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fced7e55", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!pth_file=`ls -t ./work_dirs/internlm2_chat_1_8b_qlora_alpaca_e3_copy | grep pth | head -n 1` && MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert pth_to_hf ./internlm2_chat_1_8b_qlora_alpaca_e3_copy.py ./work_dirs/internlm2_chat_1_8b_qlora_alpaca_e3_copy/${pth_file} ./hf" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4d9c58d5", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge Shanghai_AI_Laboratory/internlm2-chat-1_8b ./hf ./merged --max-shard-size 2GB" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d9197c24", - "metadata": {}, - "outputs": [], - "source": [ - "tokenizer, model = load_model(\"./merged\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "41bfa713", - "metadata": {}, - "outputs": [], - "source": [ - "chat(\"请介绍一下你自己\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "76f12d02", - "metadata": {}, - "outputs": [], - "source": [ - "chat(\"你在实战营做什么\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "54d9c0e9", - "metadata": {}, - "outputs": [], - "source": [ - "chat(\"介绍一下成都\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "91787f35", - "metadata": {}, - "outputs": [], - "source": [ - "del tokenizer, model\n", - "\n", - "torch.cuda.empty_cache()" - ] - }, - { - "cell_type": "markdown", - "id": "d9cfecbd-c125-4fa6-98ed-76da8015241e", - "metadata": {}, - "source": [ - "## 9 分布式微调\n", - "\n", - "如果模型的规模和复杂度继续增加,我们还可以使用分布式微调。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a025a904", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!apt-get install -y net-tools\n", - "!ifconfig" - ] - }, - { - "cell_type": "markdown", - "id": "b95189de", - "metadata": {}, - "source": [ - "分布式微调是主从架构的。主节点协调整个训练过程,管理数据和任务到工作节点的分配。工作节点执行训练步骤的实际计算,处理数据的子集并计算梯度。有时候在一些架构中还需要参数服务器协调所有工作节点之间的模型更新同步,用于聚合来自工作节点的梯度并更新模型参数。\n", - "\n", - "> 我们使用两个节点进行分布式微调,实际上需要启动三个节点。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7195212b", - "metadata": { - "vscode": { - "languageId": "shellscript" - } - }, - "outputs": [], - "source": [ - "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NPROC_PER_NODE=1 NNODES=2 xtuner train internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", - "\n", - "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NPROC_PER_NODE=1 NNODES=2 NODE_RANK=0 TRITON_CACHE_DIR=node0 PORT=20821 ADDR=192.168.230.182 xtuner train internlm2_chat_1_8b_qlora_alpaca_e3_copy.py --work-dir work_dir_node0\n", - "\n", - "!MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NPROC_PER_NODE=1 NNODES=2 NODE_RANK=1 TRITON_CACHE_DIR=node1 PORT=20821 ADDR=192.168.230.182 xtuner train internlm2_chat_1_8b_qlora_alpaca_e3_copy.py --work-dir work_dir_node1" - ] - }, - { - "cell_type": "markdown", - "id": "6dbcffbe", - "metadata": {}, - "source": [ - "首先启动主节点,然后依次启动其他节点。但需要注意的是,需要在一个时间阈值内启动相关的节点,如果超过时间阈值还没启动所有节点,则其他节点会因超时而报错退出。\n", - "\n", - "比如,在两个节点的分布式微调过程中,我们只启动主节点和一个工作节点,另一个节点不启动,则已启动的节点会超时报错退出。\n", - "\n", - "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-07.png)" - ] - }, - { - "cell_type": "markdown", - "id": "953e85aa", - "metadata": {}, - "source": [ - "如果所有节点都正常启动、训练,则可以看到每个节点的显卡均有内存使用。\n", - "\n", - "![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-08.png)" - ] - }, - { - "cell_type": "markdown", - "id": "f93beb75", - "metadata": {}, - "source": [ - "在训练完后,我们的目录结构应该是这样子的,训练的模型在工作节点上。\n", - "\n", - "
\n", - "目录结构\n", - "\n", - "```\n", - "├── work_dir_node0\n", - "│ ├── 20240629_213009\n", - "│ │ ├── 20240629_213009.log\n", - "│ │ └── vis_data\n", - "│ │ ├── 20240629_213009.json\n", - "│ │ ├── config.py\n", - "│ │ ├── eval_outputs_iter_233.txt\n", - "│ │ └── scalars.json\n", - "│ ├── internlm2_chat_1_8b_qlora_alpaca_e3_copy.py\n", - "│ ├── iter_234.pth\n", - "│ └── last_checkpoint\n", - "├── work_dir_node1\n", - "│ └── 20240629_213009\n", - "├── work_dirs\n", - "│ └── internlm2_chat_1_8b_qlora_alpaca_e3_copy\n", - "```\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "193fdbe8", - "metadata": {}, - "source": [ - "## 10 小结\n", - "\n", - "现在,我们又学到了 XTuner 微调的更多高阶知识啦,包括增量预训练微调基座模型、多卡微调、分布式微调等。\n", - "\n", - "是不是感觉其实微调也不过如此!事实上确实是这样的!其实在微调的时候最重要的还是要自己准备一份高质量的数据集,这个才是你能否真微调出效果最核心的利器。" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/docs/L1/XTuner/readme.md b/docs/L1/XTuner/readme.md index 8ebe8592b..bc451de22 100644 --- a/docs/L1/XTuner/readme.md +++ b/docs/L1/XTuner/readme.md @@ -2,6 +2,8 @@ 在本节中,将一步步带领大家体验如何使用 XTuner 完成个人小助手的微调! +> 整个过程大概需要40分钟我们就可以得到一个自己的小助手。 + ## 1 微调前置基础 在进行微调之前,我们需要了解一些基本概念,请访问[XTuner微调前置基础](./xtuner_finetune_basic.md)。 @@ -16,23 +18,23 @@ ### 2.1 创建虚拟环境 -在安装 XTuner 之前,我们需要先创建一个虚拟环境。创建一个名为 `xtuner0121` 的虚拟环境,可以直接执行命令。 - +在安装 XTuner 之前,我们需要先创建一个虚拟环境。使用 `Anaconda` 创建一个名为 `xtuner0121` 的虚拟环境,可以直接执行命令。 ```bash +# 创建虚拟环境 conda create -n xtuner0121 python=3.10 -y -``` - -如果是在开发机中,也可以直接执行以下命令进行创建: +# 激活虚拟环境 +conda activate xtuner0121 -```bash -studio-conda -t xtuner0121 -o internlm-base +# 安装一些必要的库 +pip install torch==2.0.1 torchaudio==2.0.2 torchvision==0.15.2 modelscope==1.15.0 ``` -虚拟环境创建完成后,需要激活虚拟环境。 +如果是在开发机中,也可以直接执行以下命令进行创建: ```bash +studio-conda -t xtuner0121 -o internlm-base conda activate xtuner0121 ``` @@ -40,18 +42,26 @@ conda activate xtuner0121 虚拟环境创建完成后,就可以安装 XTuner 了。首先,从 Github 上下载源码。 - ```bash -git clone -b v0.1.21 https://github.com/InternLM/xtuner +# 创建一个目录,用来存放源代码 +mkdir -p /root/InternLM/code + +cd /root/InternLM/code + +git clone -b v0.1.21 https://github.com/InternLM/XTuner ``` 其次,进入源码目录,执行安装。 -> 如果速度太慢可以换成 `pip install -e '.[all]' -i https://mirrors.aliyun.com/pypi/simple/` +> 如果速度太慢可以换成 `pip install -e '.[deepspeed]' -i https://mirrors.aliyun.com/pypi/simple/` ```bash -cd xtuner && pip install -e '.[all]' +# 进入到源码目录 +cd /root/InternLM/code/XTuner + +# 执行安装 +pip install -e '.[deepspeed]' ``` 最后,我们可以验证一下安装结果。 @@ -82,6 +92,11 @@ xtuner help ```bash +# 创建一个目录,用来存放微调的资料 +mkdir -p /root/InternLM/XTuner + +cd /root/InternLM/XTuner + mkdir -p Shanghai_AI_Laboratory ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b Shanghai_AI_Laboratory/internlm2-chat-1_8b @@ -96,7 +111,7 @@ ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b Shanghai ```python from modelscope import snapshot_download -model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-1_8b', cache_dir="./") +model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-1_8b', cache_dir="/root/InternLM/XTuner/") ``` 模型文件准备好后,我们的目录结构应该是这个样子的。 @@ -214,7 +229,7 @@ def chat(input_text): ```python -tokenizer, model = load_model("Shanghai_AI_Laboratory/internlm2-chat-1_8b") +tokenizer, model = load_model("/root/InternLM/XTuner/Shanghai_AI_Laboratory/internlm2-chat-1_8b") ``` - 对话 @@ -239,7 +254,14 @@ torch.cuda.empty_cache() #### 3.2.1 准数据文件 -为了让模型能够认清自己的身份弟位,在询问自己是谁的时候按照我们预期的结果进行回复,我们就需要通过在微调数据集中大量加入这样的数据。我们准备一个数据集文件`datas/assistant.json`,文件内容为对话数据。为了增强微调效果,可以将对话数据复制多条。 +为了让模型能够认清自己的身份弟位,在询问自己是谁的时候按照我们预期的结果进行回复,我们就需要通过在微调数据集中大量加入这样的数据。我们准备一个数据集文件`datas/assistant.json`,文件内容为对话数据。 + +```bash +mkdir -p datas +touch datas/assistant.json +``` + +为了增强微调效果,可以将对话数据复制多条。 ```python @@ -249,6 +271,52 @@ torch.cuda.empty_cache() ] ``` +为了简化数据文件准备,我们也可以通过脚本生成的方式来准备数据。创建一个脚本文件 `xtuner_generate_assistant.py` ,输入脚本内容并保存: + +> 或者可以直接复制 [tools/xtuner_generate_assistant.py](../../../tools/xtuner_generate_assistant.py) +> ```bash +> cp ../../../tools/xtuner_generate_assistant.py ./ +>``` + +
+xtuner_generate_assistant.py + +```python +import json + +# 设置用户的名字 +name = '伍鲜同志' +# 设置需要重复添加的数据次数 +n = 4650 + +# 初始化数据 +data = [ + {"conversation": [{"input": "请介绍一下你自己", "output": "我是{}的小助手,内在是上海AI实验室书生·浦语的1.8B大模型哦".format(name)}]}, + {"conversation": [{"input": "你在实战营做什么", "output": "我在这里帮助{}完成XTuner微调个人小助手的任务".format(name)}]} +] + +# 通过循环,将初始化的对话数据重复添加到data列表中 +for i in range(n): + data.append(data[0]) + data.append(data[1]) + +# 将data列表中的数据写入到'datas/assistant.json'文件中 +with open('datas/assistant.json', 'w', encoding='utf-8') as f: + # 使用json.dump方法将数据以JSON格式写入文件 + # ensure_ascii=False 确保中文字符正常显示 + # indent=4 使得文件内容格式化,便于阅读 + json.dump(data, f, ensure_ascii=False, indent=4) + +``` + +
+ +然后执行该脚本来生成数据文件。 + +```bash +python xtuner_generate_assistant.py +``` + 准备好数据文件后,我们的目录结构应该是这样子的。
@@ -414,14 +482,14 @@ xtuner copy-cfg internlm2_chat_1_8b_qlora_alpaca_e3 . # PART 1 Settings # ####################################################################### - pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b' -+ pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-chat-1_8b' ++ pretrained_model_name_or_path = '/root/InternLM/XTuner/Shanghai_AI_Laboratory/internlm2-chat-1_8b' - alpaca_en_path = 'tatsu-lab/alpaca' + alpaca_en_path = 'datas/assistant.json' evaluation_inputs = [ - '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai' -+ '请介绍一下你自己', '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai' ++ '请介绍一下你自己', 'Please introduce yourself' ] ####################################################################### @@ -475,6 +543,11 @@ alpaca_en = dict( 修改完后的完整的配置文件是:[configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py](../../../configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py)。 +> 可以直接复制到当前目录。 +> ```bash +> cp ../../../configs/internlm2_chat_1_8b_qlora_alpaca_e3_copy.py ./ +>``` +
internlm2_chat_1_8b_qlora_alpaca_e3_copy.py @@ -505,7 +578,7 @@ from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE # PART 1 Settings # ####################################################################### # Model -pretrained_model_name_or_path = 'Shanghai_AI_Laboratory/internlm2-chat-1_8b' +pretrained_model_name_or_path = '/root/InternLM/XTuner/Shanghai_AI_Laboratory/internlm2-chat-1_8b' use_varlen_attn = False # Data @@ -538,7 +611,7 @@ save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited) evaluation_freq = 500 SYSTEM = SYSTEM_TEMPLATE.alpaca evaluation_inputs = [ - '请介绍一下你自己', '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai' + '请介绍一下你自己', 'Please introduce yourself' ] ####################################################################### @@ -801,7 +874,7 @@ pth_file=`ls -t ./work_dirs/internlm2_chat_1_8b_qlora_alpaca_e3_copy/*.pth | hea ```bash -MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge Shanghai_AI_Laboratory/internlm2-chat-1_8b ./hf ./merged --max-shard-size 2GB +MKL_SERVICE_FORCE_INTEL=1 MKL_THREADING_LAYER=GNU xtuner convert merge /root/InternLM/XTuner/Shanghai_AI_Laboratory/internlm2-chat-1_8b ./hf ./merged --max-shard-size 2GB ``` 模型合并完成后,我们的目录结构应该是这样子的。 @@ -869,13 +942,18 @@ torch.cuda.empty_cache() ```python -pip install streamlit +pip install streamlit==1.36.0 ``` 其次,我们需要准备一个Streamlit程序的脚本。 Streamlit程序的完整代码是:[tools/xtuner_streamlit_demo.py](../../../tools/xtuner_streamlit_demo.py)。 +> 可以直接复制到当前目录。 +> ```bash +> cp ../../../tools/xtuner_streamlit_demo.py ./ +>``` +
xtuner_streamlit_demo.py diff --git a/docs/L1/XTuner/homework.md b/docs/L1/XTuner/task.md similarity index 92% rename from docs/L1/XTuner/homework.md rename to docs/L1/XTuner/task.md index 9e483417b..8a9eb7478 100644 --- a/docs/L1/XTuner/homework.md +++ b/docs/L1/XTuner/task.md @@ -6,10 +6,9 @@ - 训练自己的小助手认知(记录复现过程并截图) -- 用自己感兴趣的知识对基座模型进行增量预训练微调(记录复现过程并截图) - ## 进阶作业 +- 用自己感兴趣的知识对基座模型进行增量预训练微调(选做) - 在资源允许的情况下,尝试实现多卡微调与分布式微调(选做) - 将自我认知的模型上传到 OpenXLab,并将应用部署到 OpenXLab(优秀学员必做) diff --git a/tools/xtuner_generate_assistant.py b/tools/xtuner_generate_assistant.py new file mode 100644 index 000000000..cee736e37 --- /dev/null +++ b/tools/xtuner_generate_assistant.py @@ -0,0 +1,24 @@ +import json + +# 设置用户的名字 +name = '伍鲜同志' +# 设置需要重复添加的数据次数 +n = 4650 + +# 初始化数据 +data = [ + {"conversation": [{"input": "请介绍一下你自己", "output": "我是{}的小助手,内在是上海AI实验室书生·浦语的1.8B大模型哦".format(name)}]}, + {"conversation": [{"input": "你在实战营做什么", "output": "我在这里帮助{}完成XTuner微调个人小助手的任务".format(name)}]} +] + +# 通过循环,将初始化的对话数据重复添加到data列表中 +for i in range(n): + data.append(data[0]) + data.append(data[1]) + +# 将data列表中的数据写入到'datas/assistant.json'文件中 +with open('datas/assistant.json', 'w', encoding='utf-8') as f: + # 使用json.dump方法将数据以JSON格式写入文件 + # ensure_ascii=False 确保中文字符正常显示 + # indent=4 使得文件内容格式化,便于阅读 + json.dump(data, f, ensure_ascii=False, indent=4) From f386f62d9f2dfd8ab159ac12a5deb1e21ac83571 Mon Sep 17 00:00:00 2001 From: AI-Labs Date: Tue, 2 Jul 2024 23:59:34 +0800 Subject: [PATCH 039/754] =?UTF-8?q?XTuner=E5=BE=AE=E8=B0=83=E4=B8=AA?= =?UTF-8?q?=E4=BA=BA=E5=B0=8F=E5=8A=A9=E6=89=8B=E8=AE=A4=E7=9F=A5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/L1/XTuner/readme.md | 2 +- docs/L1/XTuner/task.md | 15 --------------- 2 files changed, 1 insertion(+), 16 deletions(-) delete mode 100644 docs/L1/XTuner/task.md diff --git a/docs/L1/XTuner/readme.md b/docs/L1/XTuner/readme.md index bc451de22..b8603642f 100644 --- a/docs/L1/XTuner/readme.md +++ b/docs/L1/XTuner/readme.md @@ -1272,4 +1272,4 @@ ssh -CNg -L 8501:127.0.0.1:8501 root@ssh.intern-ai.org.cn -p 43551 ## 6 作业 -作业请访问[作业](./homework.md)。 +作业请访问[作业](./task.md)。 diff --git a/docs/L1/XTuner/task.md b/docs/L1/XTuner/task.md deleted file mode 100644 index 8a9eb7478..000000000 --- a/docs/L1/XTuner/task.md +++ /dev/null @@ -1,15 +0,0 @@ -# XTuner 微调个人小助手认知作业 - -记录复现过程并截图。 - -## 基础作业(结营必做) - -- 训练自己的小助手认知(记录复现过程并截图) - -## 进阶作业 - -- 用自己感兴趣的知识对基座模型进行增量预训练微调(选做) -- 在资源允许的情况下,尝试实现多卡微调与分布式微调(选做) -- 将自我认知的模型上传到 OpenXLab,并将应用部署到 OpenXLab(优秀学员必做) - -OpenXLab 部署教程:https://github.com/InternLM/Tutorial/tree/camp2/tools/openxlab-deploy From fd9b69620ab912798d3db2073e1491efceda48c1 Mon Sep 17 00:00:00 2001 From: wux-labs <99776865+wux-labs@users.noreply.github.com> Date: Wed, 3 Jul 2024 00:44:58 +0800 Subject: [PATCH 040/754] =?UTF-8?q?XTuner=E5=BE=AE=E8=B0=83=E4=B8=AA?= =?UTF-8?q?=E4=BA=BA=E5=B0=8F=E5=8A=A9=E6=89=8B=E8=AE=A4=E7=9F=A5=E4=BB=BB?= =?UTF-8?q?=E5=8A=A1=20(#795)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * XTuner微调个人小助手认知任务 * XTuner 微调个人小助手认知任务 * XTuner 微调个人小助手认知任务 --- docs/L1/XTuner/task.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) create mode 100644 docs/L1/XTuner/task.md diff --git a/docs/L1/XTuner/task.md b/docs/L1/XTuner/task.md new file mode 100644 index 000000000..db967cc8c --- /dev/null +++ b/docs/L1/XTuner/task.md @@ -0,0 +1,16 @@ +# XTuner 微调个人小助手认知任务 + +记录复现过程并截图。 + +## 基础任务(完成此任务即完成闯关) + +- 使用 XTuner 微调 InternLM2-Chat-1.8B 实现自己的小助手认知,如下图所示(图中的`伍鲜同志`需替换成自己的昵称),记录复现过程并截图。 +![](https://raw.githubusercontent.com/wux-labs/ImageHosting/main/XTuner/image-12.png) + +## 进阶任务(闯关不要求完成此任务) + +- 用自己感兴趣的知识对基座模型进行增量预训练微调(选做) +- 在资源允许的情况下,尝试实现多卡微调与分布式微调(选做) +- 将自我认知的模型上传到 OpenXLab,并将应用部署到 OpenXLab(优秀学员必做) + +OpenXLab 部署教程:https://github.com/InternLM/Tutorial/tree/camp2/tools/openxlab-deploy From 60496737d108af2b1c5dd472aee6aaa7664798f6 Mon Sep 17 00:00:00 2001 From: MrCatAI <160732778+MrCatAI@users.noreply.github.com> Date: Wed, 3 Jul 2024 00:56:13 +0800 Subject: [PATCH 041/754] add git_task (#797) --- data/Git/task/camp3_id.md | 8 ++++++++ docs/L0/Git/task.md | 38 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 46 insertions(+) create mode 100644 data/Git/task/camp3_id.md create mode 100644 docs/L0/Git/task.md diff --git a/data/Git/task/camp3_id.md b/data/Git/task/camp3_id.md new file mode 100644 index 000000000..1a3516a7c --- /dev/null +++ b/data/Git/task/camp3_id.md @@ -0,0 +1,8 @@ +【大家可以叫我】: InternLM +【坐标】:上海 +【专业/职业】:小助手 +【兴趣爱好】: 乒乓球 +【项目技能】:cv、nlp +【组队情况】:未组队,快来一起! +【本课程学习基础】:CV、NLP、LLM +【本期活动目标】:一起学习,快乐暑假,闯关达人! \ No newline at end of file diff --git a/docs/L0/Git/task.md b/docs/L0/Git/task.md new file mode 100644 index 000000000..35466b676 --- /dev/null +++ b/docs/L0/Git/task.md @@ -0,0 +1,38 @@ +# Git 课程任务 + +## 任务概览 + +- **任务1**: 破冰活动:自我介绍 +- **任务2**: 实践项目:构建个人项目 + +## 任务1: 破冰活动:自我介绍 + +### 目标 + +每位参与者提交一份自我介绍。 + +![XNzebK7ItoftfwxiXQ2cY3lYn0g](https://github.com/InternLM/Tutorial/assets/160732778/bb74cc07-e806-4d17-9dbc-cca2890a9230) + +### 要求 + +1. 命名格式为 `camp3_.md`,其中 `` 是您的报名问卷ID。 +2. 文件路径应为 `./data/Git/task/`。 +3. 【大家可以叫我】内容可以是 GitHub 昵称、微信昵称或其他网名。 +4. 在 GitHub 上创建一个 Pull Request,提供对应的 PR 链接。 + + +## 任务2: 实践项目:构建个人项目 + +### 目标 + +创建一个个人仓库,用于提交笔记、心得体会或分享项目。 + +![NiN3bCHIaoHh7GxQG6WcEY3Yn9f](https://github.com/InternLM/Tutorial/assets/160732778/c76691e7-eb21-435f-a0ed-4a6b62e569e4) + +### 要求 + +1. 创建并维护一个公开的大模型相关项目或笔记仓库。 +2. 提交作业时,提供您的 GitHub 仓库链接。 +3. 如果您不常使用 GitHub,您可以选择其他代码管理平台,如 Gitee,并提交相应的链接。 +4. 仓库介绍中添加超链接跳转 [GitHub 仓库](https://github.com/InternLM/Tutorial)([https://github.com/InternLM/Tutorial](https://github.com/InternLM/Tutorial)) + From 4139d10579f67819857eb60c836b8c6cec8b83eb Mon Sep 17 00:00:00 2001 From: jyfjz <2661378091@qq.com> Date: Thu, 4 Jul 2024 09:12:17 +0800 Subject: [PATCH 042/754] add git_121_introduction --- data/Git/task/camp3_121.md | 8 ++++++++ 1 file changed, 8 insertions(+) create mode 100644 data/Git/task/camp3_121.md diff --git a/data/Git/task/camp3_121.md b/data/Git/task/camp3_121.md new file mode 100644 index 000000000..c2dd88e3d --- /dev/null +++ b/data/Git/task/camp3_121.md @@ -0,0 +1,8 @@ +【大家可以叫我】:jyfjz +【坐标】:厦门 +【专业/职业】:开拓者 +【兴趣爱好】:崩铁&编程 +【项目技能】:NLP +【组队情况】:已组队,项目听老大的 +【本课程学习基础】:对NLP领域有一定了解,熟悉一定框架 +【本期活动目标】:对LLM的开发进一步了解 \ No newline at end of file From 081afb0c36714fe93aeeddcd1105f833a64af03d Mon Sep 17 00:00:00 2001 From: wwwzhouhui Date: Mon, 8 Jul 2024 18:20:03 +0800 Subject: [PATCH 043/754] Add files via upload --- data/Git/task/camp3_1281.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) create mode 100644 data/Git/task/camp3_1281.md diff --git a/data/Git/task/camp3_1281.md b/data/Git/task/camp3_1281.md new file mode 100644 index 000000000..b80dcc6bc --- /dev/null +++ b/data/Git/task/camp3_1281.md @@ -0,0 +1,15 @@ +【大家可以叫我】wwwzhouhui + +【坐标】合肥 + + 【专业/职业】it学员 + + 【兴趣爱好】搞AI + +【项目技能】软件开发 + + 【组队情况】无 + + 【本课程学习】会AI技能,会微调模型 + + 【本期活动目标】学习更多AI 技能 \ No newline at end of file From fa8c07053a513c5190ea86fcf867930a7b6eecd8 Mon Sep 17 00:00:00 2001 From: Noyiz Date: Tue, 9 Jul 2024 10:51:21 +0800 Subject: [PATCH 044/754] add git_1468_introduction --- data/Git/task/camp3_1468.md | 8 ++++++++ 1 file changed, 8 insertions(+) create mode 100644 data/Git/task/camp3_1468.md diff --git a/data/Git/task/camp3_1468.md b/data/Git/task/camp3_1468.md new file mode 100644 index 000000000..dd869767d --- /dev/null +++ b/data/Git/task/camp3_1468.md @@ -0,0 +1,8 @@ +【大家可以叫我】: 葵 +【坐标】:北京 +【专业/职业】:学生 +【兴趣爱好】: 跑步! +【项目技能】:python +【组队情况】:未组队,快来一起! +【本课程学习基础】:CV、NLP、LLM +【本期活动目标】:一起学习,快乐暑假,闯关达人! \ No newline at end of file From 520ebdb958148f3c03912d3b2182b2bc738a9150 Mon Sep 17 00:00:00 2001 From: acwwt <110531742+acwwt@users.noreply.github.com> Date: Tue, 9 Jul 2024 12:21:04 +0800 Subject: [PATCH 045/754] Create task.md (#805) --- docs/L0/Linux/task.md | 10 ++++++++++ 1 file changed, 10 insertions(+) create mode 100644 docs/L0/Linux/task.md diff --git a/docs/L0/Linux/task.md b/docs/L0/Linux/task.md new file mode 100644 index 000000000..dadcbf36f --- /dev/null +++ b/docs/L0/Linux/task.md @@ -0,0 +1,10 @@ +# 关卡任务 + +闯关任务需要在关键步骤中截图: + +| | 任务描述 | 完成所需时间 | +| ---------- | --------------------------------------------- | ------------ | +| 闯关任务 | 完成SSH连接与端口映射并运行`hello_world.py` | 10min | +| 可选任务 1 | 将Linux基础命令在开发机上完成一遍 | 10min | +| 可选任务 2 | 使用 VSCODE 远程连接开发机并创建一个conda环境 | 10min | +| 可选任务 3 | 创建并运行`test.sh`文件 | 10min | From 687b6dc35d1a2e723d9c38c2f92ca55a3becc4da Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 9 Jul 2024 14:09:42 +0800 Subject: [PATCH 046/754] Update task.md --- docs/L0/Git/task.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/L0/Git/task.md b/docs/L0/Git/task.md index 35466b676..0ebda03ed 100644 --- a/docs/L0/Git/task.md +++ b/docs/L0/Git/task.md @@ -35,4 +35,5 @@ 2. 提交作业时,提供您的 GitHub 仓库链接。 3. 如果您不常使用 GitHub,您可以选择其他代码管理平台,如 Gitee,并提交相应的链接。 4. 仓库介绍中添加超链接跳转 [GitHub 仓库](https://github.com/InternLM/Tutorial)([https://github.com/InternLM/Tutorial](https://github.com/InternLM/Tutorial)) +5. 将此项目报名参加第三期实战营项目评选将解锁 30% A100 和 168 团队算力点资源,报名链接:[https://aicarrier.feishu.cn/wiki/DjY6whCO0inTu2kQN9Cchxgynme](https://aicarrier.feishu.cn/wiki/DjY6whCO0inTu2kQN9Cchxgynme) From d862a229a3c0dcece52ae16e3a398e247166cfc0 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 9 Jul 2024 14:30:34 +0800 Subject: [PATCH 047/754] Update task.md --- docs/L0/Linux/task.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/L0/Linux/task.md b/docs/L0/Linux/task.md index dadcbf36f..d48316925 100644 --- a/docs/L0/Linux/task.md +++ b/docs/L0/Linux/task.md @@ -8,3 +8,8 @@ | 可选任务 1 | 将Linux基础命令在开发机上完成一遍 | 10min | | 可选任务 2 | 使用 VSCODE 远程连接开发机并创建一个conda环境 | 10min | | 可选任务 3 | 创建并运行`test.sh`文件 | 10min | + + +请将作业发布到知乎、CSDN等任一社交媒体,将作业链接提交到以下问卷: + +提交地址:[https://aicarrier.feishu.cn/share/base/form/shrcnZ4bQ4YmhEtMtnKxZUcf1vd](https://aicarrier.feishu.cn/share/base/form/shrcnZ4bQ4YmhEtMtnKxZUcf1vd) From fe748c35e3f8a60ea278aa2e2455403affa44b42 Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 9 Jul 2024 14:31:02 +0800 Subject: [PATCH 048/754] Update task.md --- docs/L0/Linux/task.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/L0/Linux/task.md b/docs/L0/Linux/task.md index d48316925..32cf886b3 100644 --- a/docs/L0/Linux/task.md +++ b/docs/L0/Linux/task.md @@ -10,6 +10,6 @@ | 可选任务 3 | 创建并运行`test.sh`文件 | 10min | -请将作业发布到知乎、CSDN等任一社交媒体,将作业链接提交到以下问卷: +请将作业发布到知乎、CSDN等任一社交媒体,将作业链接提交到以下问卷,助教老师批改后将获得 50 算力点奖励!!! 提交地址:[https://aicarrier.feishu.cn/share/base/form/shrcnZ4bQ4YmhEtMtnKxZUcf1vd](https://aicarrier.feishu.cn/share/base/form/shrcnZ4bQ4YmhEtMtnKxZUcf1vd) From 5a64b64daecfc4186ef0a0489ac4763dcae8c84e Mon Sep 17 00:00:00 2001 From: vansin Date: Tue, 9 Jul 2024 14:32:45 +0800 Subject: [PATCH 049/754] Update task.md --- docs/L0/Git/task.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/docs/L0/Git/task.md b/docs/L0/Git/task.md index 0ebda03ed..8ac988c33 100644 --- a/docs/L0/Git/task.md +++ b/docs/L0/Git/task.md @@ -5,6 +5,8 @@ - **任务1**: 破冰活动:自我介绍 - **任务2**: 实践项目:构建个人项目 + + ## 任务1: 破冰活动:自我介绍 ### 目标 @@ -37,3 +39,10 @@ 4. 仓库介绍中添加超链接跳转 [GitHub 仓库](https://github.com/InternLM/Tutorial)([https://github.com/InternLM/Tutorial](https://github.com/InternLM/Tutorial)) 5. 将此项目报名参加第三期实战营项目评选将解锁 30% A100 和 168 团队算力点资源,报名链接:[https://aicarrier.feishu.cn/wiki/DjY6whCO0inTu2kQN9Cchxgynme](https://aicarrier.feishu.cn/wiki/DjY6whCO0inTu2kQN9Cchxgynme) + + +## 闯关材料提交 + +将Pull Request链接闯关材料提交到以下问卷,助教老师批改后将获得 50 算力点奖励!!!,完成项目申报后请联系浦语小助手(微信ID:InternLM)申请额外的团队项目算力资源~ + +提交地址:[https://aicarrier.feishu.cn/share/base/form/shrcnZ4bQ4YmhEtMtnKxZUcf1vd](https://aicarrier.feishu.cn/share/base/form/shrcnZ4bQ4YmhEtMtnKxZUcf1vd) From 1f163a3b6627f97c3eb49b0888d9caeff14dde6d Mon Sep 17 00:00:00 2001 From: acwwt <110531742+acwwt@users.noreply.github.com> Date: Tue, 9 Jul 2024 14:50:18 +0800 Subject: [PATCH 050/754] Update readme.md (#814) * Update readme.md * Update readme.md * Update readme.md * Update readme.md --- docs/L0/Linux/readme.md | 1073 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 1073 insertions(+) diff --git a/docs/L0/Linux/readme.md b/docs/L0/Linux/readme.md index 8b1378917..312f223cf 100644 --- a/docs/L0/Linux/readme.md +++ b/docs/L0/Linux/readme.md @@ -1 +1,1074 @@ +# Linux+InternStudio 关卡 +😀Hello大家好,欢迎来到**书生大模型**实战营,这里是实战营为第一次参加实战营同学,和来自各个行业的没有Linux基础知识的同学准备的基础课程,在这里我们会教大家如何使用**InternStudio开发机**,以及掌握一些基础的**Linux知识**,让大家不至于在后面的课程中无从下手,希望对大家有所帮助。在这里[关卡任务](./task.md)中为大家准备了一些关卡任务,当大家完成必做关卡任务并打卡后,就会获得当前关卡的算力奖励了,**让我们开始吧!** + +## 一、InternStudio开发机介绍 + +InternStudio 是大模型时代下的云端算力平台。基于 InternLM 组织下的诸多算法库支持,为开发者提供开箱即用的大语言模型微调环境、工具、数据集,并完美兼容 🤗 HugginFace 开源生态。 + +如果大家想了解更多关于InternStduio的介绍的话可以查看下面的文档:[ InternStudio](https://aicarrier.feishu.cn/wiki/GQ1Qwxb3UiQuewk8BVLcuyiEnHe) + +https://studio.intern-ai.org.cn/ + +首先打开上面的链接进入InternStudio,完成登录会自动跳转到控制台界面,如下图所示: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/825084d6-6018-4d1b-9f32-92f0abda9e9c) + + +下面给大家讲一下每一个序号对应页面的功能: + +1. 在这里可以创建**开发机**,以及修改开发机配置和查看相关日志等。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/2d1ebf6c-5b4b-48a1-82c2-5f720d525739) + + +2. 这里可以**可视化**查看开发机中的文件及文件夹,而且如果你创建了两个开发机,那么他们使用的云盘是一个。(因为每一个开发机都是一个Docker 容器,存储云盘挂载的都是一个,关于专业名词解释可以看:[ 专业名词解释](https://aicarrier.feishu.cn/wiki/Uyvuwtvnjipsr3kaRUdc9PKvnZd))在这里你可以上传文件或者文件夹,以及创建文件,还可以查看隐藏文件。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/4f0806d0-80cc-43a1-9c7e-8fe2334cfa8f) + + +3. 这是开发机新增的功能,如果大家要做项目的话,可以向小助手申请资源,团队的功能是所有成员**共享算力资源**,避免造成资源浪费。(毕竟烧的可都是💴啊) +4. 这里是用来配置**SSH密钥**的,我们在后面会讲到如何使用。 +5. 最后这个地方是来编辑你的个人信息的,以及查看你**算力资源**的具体使用。 + +上面就是InternStudio平台的简单介绍,下面让我们来看一下如何创建开发机,我们来到首页,点击“**创建开发机**” + +![image](https://github.com/InternLM/Tutorial/assets/110531742/e34c2f46-1cb1-4ee5-9b43-0a0c4d9f3a9c) + + +这里我们选择创建**个人开发机**,名称为**test**,**Cuda**版本为12.2,**资源配置**选择10%,时长默认就行。 + +创建完成以后在**开发机**界面可以看到刚刚创建的开发机,点击进入开发机。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/4cc2c3c1-778c-4a26-ad68-69c801df8a04) + + +进入开发机以后可以看到开发机的主页面,开发机有三种模式可以选择:**JupyterLab、终端和VScode** + +![image](https://github.com/InternLM/Tutorial/assets/110531742/ea77bd10-4d70-4d52-b4c9-7c598a91ec63) + + +其中: + +1. **JupyterLab**:一个交互式的编程和教学环境,同时内置终端,可以很方便地查看文件,执行代码等 +2. **终端**(Terminal, 最轻量级):主要进行命令行操作,或者运行脚本和简单程序 +3. **VSCode**:网页中集成的VSCode,也可以在本地VSCode中通过SSH连接远程开发,下面就会讲如何配置远程连接。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/10f9aa35-bf60-447c-ba1f-4704f9309105) + + +4. 这个是资源使用情况,在后续的课程中会使用到。 + +## 二、SSH及端口映射 + +上面我们介绍了**InternStudio平台**,以及如何创建开发机,这一小节,我们要了解什么是**SSH**、**为什么使用远程连接**、如何使用SSH**远程连接**开发机、什么是**端口映射**以及如何进行**端口映射**。 + +### 1. 什么是SSH? + +**SSH**全称Secure Shell,中文翻译为安全外壳,它是一种**网络安全协议**,通过加密和认证机制实现安全的访问和文件传输等业务。SSH 协议通过对网络数据进行加密和验证,在不安全的网络环境中提供了安全的网络服务。 + +SSH 是(C/S架构)由**服务器**和**客户端**组成,为建立安全的 SSH 通道,双方需要先建立 TCP 连接,然后协商使用的版本号和各类算法,并生成相同的**会话密钥**用于后续的对称加密。在完成用户认证后,双方即可建立会话进行数据交互。 + +那在后面的实践中我们会**配置SSH密钥**,配置密钥是为了当我们远程连接开发机时不用重复的输入密码,那**为什么要进行远程连接呢**? + +远程连接的好处就是,如果你使用的是远程办公,你可以通过SSH远程连接开发机,这样就可以在本地进行开发。而且如果你需要跑一些本地的代码,又没有环境,那么远程连接就非常有必要了。 + +### 2. 如何使用SSH远程连接开发机? + +#### 2.1 使用密码进行SSH远程连接 + +首先我们使用输入密码的方式进行SSH远程连接,后面我们会讲如何配置免密登录。 + +当完成开发机的创建以后,我们需要打开自己电脑的powerShell终端,使用**Win+R**快捷键打开运行框,输入powerShell,打开powerShell终端。(如果你是Linux或者Mac操作系统,下面的步骤都是一样的) + +我们回到开发机平台,进入**开发机**页面找到我们创建的开发机,点击**SSH连接**。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/0c4b9526-6ad7-490c-b57b-4148e4cca955) + + +![image](https://github.com/InternLM/Tutorial/assets/110531742/4bd6471f-ff87-4dfa-afff-b57488af92b3) + + +> 然后复制**登录命令**,这里的37367是开发机所使用的SSH端口,一般使用的都是22端口,没有这个端口号的话是连不上SSH的,并且每个人的端口都不一样,所以如果大家在连接开发机时出现连不上的情况,那就需要检查一下是不是端口错了。 + +将复制的命令粘贴到powershell中,然后回车,这里我们需要输入密码,我们将登录命令下面的密码复制下来,然后粘贴到终端中,这里密码粘贴密码是不显示的,这是正常的。 + +最后回车出现以下内容就代表成功了: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/452f30d0-b844-4c7a-a361-83a8a324d049) + + +![image](https://github.com/InternLM/Tutorial/assets/110531742/39ae1d17-1978-4fa1-a8b8-5396ba7b561e) + + +当我们连接上开发机以后,可以使用`hostname`查看开发机名称,使用`uname -a`查看开发机内核信息,使用`lsb_release -a`查看开发机版本信息,使用`nvidia-smi`查看GPU的信息,这些命令我们后面都会讲到,如果想要退出远程连接,输入两次`exit`就可以了。 + +#### 2.2 配置SSH密钥进行SSH远程连接 + +但是在我们开发学习的时候,每次远程都输入密码比较麻烦,我们可以设置SSH key来跳过输入密码这一步骤,在ssh命令中我们可以使用**ssh-keygen**命令来生成密钥 + +> SSH密钥是一种安全便捷的登录认证方式,用于在SSH协议中进行身份验证和加密通信。 + +**ssh-keygen**支持RSA和DSA两种认证密钥。 + +常用参数包括: + +- -t:指定密钥类型,如dsa、ecdsa、ed25519、rsa。 +- -b:指定密钥长度。 +- -C:添加注释。 +- -f:指定保存密钥的文件名。 +- -i:读取未加密的ssh-v2兼容的私钥/公钥文件。 + +这里我们使用RSA算法生成密钥,命令为: + +```Bash +ssh-keygen -t rsa +``` + +输入命令后**一路回车**就可以了,这里的密钥默认情况下是生成在`~/.ssh/`目录下的,`~`表示的是家目录,如果是windows就是`C:\Users\{your_username}\`。在powerShell中可以使用`Get-Content`命令查看生成的密钥,如果是linux操作系统可以使用`cat`命令。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/880058be-7eaf-46b1-8193-d69a7407c527) + +![image](https://github.com/InternLM/Tutorial/assets/110531742/ecd03b35-834f-4887-b830-dc656ea54c40) + +然后我们回到开发机平台,在首页点击配置**SSH Key**,接着点击**添加SSH公钥**, + +![image](https://github.com/InternLM/Tutorial/assets/110531742/0635bd0d-3170-4c24-a8fc-8802a8126eca) + + +![image](https://github.com/InternLM/Tutorial/assets/110531742/43c6b155-9094-47bc-b4ae-006581b8f76b) + + +将刚刚生成的密钥复制下来,粘贴到公钥框中,名称会被自动识别到,最后点击立即添加,SSH Key就配置完成了。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/79c3d51b-8cdf-4fc8-a5a3-1df5eab31933) + + +完成SSH Key创建以后,重启**终端**进行远程连接,就会跳过密码输入这一步了。 + +#### 2.3 使用VScode进行SSH远程连接 + +当然也可以使用SSH远程连接软件,例如:**Windterm、Xterminal**等。这里我们使用VScode进行远程连接,使用VScode的好处是,本身它就是代码编辑器,进行代码修改等操作时会非常方便。 + +如果要在VScode中进行远程连接,我们还需要安装一套插件,如何安装VScode大家可以网上搜索一下非常简单。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/60caf212-6a8b-43f0-9b54-b48f2c85bd30) + + +如果你已经安装好了VScode,可以在点击左侧的扩展页面,在搜索框中输入“SSH”,第一个就是我们要安装的插件,点开它“Install”就可以了。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/972cf203-7393-44c5-ba50-bd54c533f0d2) + + +安装完成插件以后,点击侧边栏的远程连接图标,在SSH中点击“+”按钮,添加开发机SSH连接的登录命令。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/4ce36eee-3b71-43f0-a665-38e728d8887e) + + +我们将登录命令复制下来,然后将命令粘贴到弹出的窗口中,最后回车: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/d68c82aa-0d55-4cb0-be9b-552a0cfee58b) + + +![image](https://github.com/InternLM/Tutorial/assets/110531742/d7624700-97cd-47a9-a0f5-5fd7349b2b06) + + +配置文件这一块默认就好,当然你也可以自定义,下面是配置文件的具体内容:(这里包括了你所有远程连接过的信息) + +```Bash +Host ssh.intern-ai.org.cn #主机ip也可以是域名 + HostName ssh.intern-ai.org.cn #主机名 + Port 37367 #主机的SSH端口 + User root #登录SSH使用的用户 + StrictHostKeyChecking no + UserKnownHostsFile /dev/null +``` + +> 后面的一些配置选项,如果想要手动添加就需要按照上面的格式对相应部分进行修改。 +> +> 如果将*`StrictHostKeyChecking`*` no`和*`UserKnownHostsFile`*` /dev/null`删除掉会跳出指纹验证的弹窗: +> +> ![image](https://github.com/InternLM/Tutorial/assets/110531742/00169485-f347-4881-88e7-1ecad30eae62) +> +> `StrictHostKeyChecking no`表示禁用严格的主机密钥检查。这意味着当连接到一个新的 SSH 服务器时,不会严格验证服务器的主机密钥,可能会带来一定的安全风险。 +> +> `UserKnownHostsFile /dev/null`则是将用户已知的主机密钥文件设置为 /dev/null ,这实质上是忽略了对已知主机密钥的记录和使用。 +> +> 但是在一般的安全实践中,不建议随意禁用严格的主机密钥检查。 + +然后在右下角弹出来的提示窗口中点击“连接”就可以远程到开发机中了。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/b3d1386b-f67a-4de7-89bb-e7254d5bd9f2) + + +![image](https://github.com/InternLM/Tutorial/assets/110531742/b87ea98c-733d-4a1f-90b7-1ffbead09dab) + + +远程连接完成以后,可以选择打开的文件夹,也可以称为工作目录,你可以选择开发机中的也可以选择本地的,开发机中的文件夹,就是我们前面提到的**云盘**。 + +当下一次进行远程连接的时候,就不需要输入登录命令等信息了,只需要打开vscode的远程连接就可以看到第一次连接的开发机信息,下面的`root`代表我们第一连接开发机时使用的是`/root`工作目录。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/0f4bf460-9b57-403e-9e52-cbddf4d46178) + + +并且下图中的`->`表示进入开发机后需要重新选择工作目录: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/0967b8d9-e1de-4955-810e-a80fc2013c57) + + +而下图中的`->`表示进入上一次开发机选择的工作目录: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/f5e2e15e-2cfd-40d8-9772-63d2d71fbf83) + + +每次选择的工作目录都会在这个开发机信息下面显示:(这里就多了一个lagent的工作目录) + +![image](https://github.com/InternLM/Tutorial/assets/110531742/b7001613-c9da-41e9-9add-924b2f99ef84) + + +下面我们来介绍一下什么时**端口映射**。 + +### 3. 端口映射 + +#### 3.1 什么是端口映射? + +**端口映射**是一种网络技术,它可以将外网中的任意端口映射到内网中的相应端口,实现内网与外网之间的通信。通过端口映射,可以在外网访问内网中的服务或应用,实现跨越网络的便捷通信。 + +那么我们使用开发机为什么要进行端口映射呢? + +因为在后续的课程中我们会进行模型**web_demo**的部署实践,那在这个过程中,很有可能遇到web ui加载不全的问题。这是因为开发机Web IDE中运行web_demo时,直接访问开发机内 http/https 服务可能会遇到代理问题,外网链接的**ui资源**没有被加载完全。 + +所以为了解决这个问题,我们需要对运行web_demo的连接进行端口映射,将**外网链接映射到我们本地主机**,我们使用本地连接访问,解决这个代理问题。下面让我们实践一下。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/c4ffaa41-04f3-455d-9434-a2b948d138a0) + + +我们先根据一个图了解一下开发机端口映射是如何工作的: + +> 下面会有实践步骤这里先理解如何进行端口映射的 + +```Bash +ssh -p 37367 root@ssh.intern-ai.org.cn -CNg -L 7860:127.0.0.1:7860 -o StrictHostKeyChecking=no +``` + +上面是一个端口映射命令,在主机上运行该命令即可进行端口映射,下面用一个流程图了解端口映射的过程: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/79288ff2-f065-487b-9fed-8100da5d0e8a) + + +> 个人PC会远程连接到开发机唯一暴露在外的37367端口,(这个在SSH的时候提到过每个人的开发机暴露的端口都不一样),并设置隧道选项。暴露端口是作为中转站进行流量的转发。 +> +> - `-C`:启用压缩,减少传输数据量。 +> - `-N`:不执行远程命令,只建立隧道。 +> - `-g`:允许远程主机连接到本地转发的端口。 +> +> 当在个人PC上执行这个SSH命令后,SSH客户端会在本地机器的7860端口上监听。 +> +> 任何发送到本地7860端口的流量,都会被SSH隧道转发到远程服务器的127.0.0.1地址上的7860端口。 +> +> 这意味着,即使开发机的这个端口没有直接暴露给外部网络,我们也可以通过这个隧道安全地访问远程服务器上的服务。。 + +#### 3.2 如何进行端口映射? + +##### 3.2.1 使用 ssh 命令进行端口映射 + +我们还是来到开发机界面,找到我们的开发机,点击**自定义服务**,复制第一条命令, + +![image](https://github.com/InternLM/Tutorial/assets/110531742/240c57bc-09ac-414b-a803-bfcd2da35593) + + +```Bash +ssh -p 37367 root@ssh.intern-ai.org.cn -CNg -L {本地机器_PORT}:127.0.0.1:{开发机_PORT} -o StrictHostKeyChecking=no +``` + +下面给他大家介绍一下命令各部分的含义: + +- `-p 37367`:是指定 SSH 连接的端口为 37367,这个前面提到过。 +- `root@ssh.intern-ai.org.cn`:表示要以 `root` 用户身份连接到 `ssh.intern-ai.org.cn` 这个主机。 +- `-CNg`: + - `-C` 通常用于启用压缩。 + - `-N` 表示不执行远程命令,仅建立连接用于端口转发等。 + - `-g` 允许远程主机连接到本地转发的端口。 +- `-L {本地机器_PORT}:127.0.0.1:{开发机_PORT}`:这是设置本地端口转发,将本地机器的指定端口(由 `{本地机器_PORT}` 表示)转发到远程主机(这里即 `ssh.intern-ai.org.cn`)的 `127.0.0.1` (即本地回环地址)和指定的开发机端口(由 `{开发机_PORT}` 表示)。 +- `-o StrictHostKeyChecking=no`:关闭严格的主机密钥检查,这样可以避免第一次连接时因为未知主机密钥而产生的提示或错误。 + +当你运行一个web demo的时候,就可以使用这个命令进行端口映射,举个例子: + +我们创建一个hello_world.py文件,在文件中填入以下内容: + +```Python +import socket +import re +import gradio as gr + +# 获取主机名 +def get_hostname(): + hostname = socket.gethostname() + match = re.search(r'-(\d+)$', hostname) + name = match.group(1) + + return name + +# 创建 Gradio 界面 +with gr.Blocks(gr.themes.Soft()) as demo: + html_code = f""" +

+ + Logo + +

+

☁️ Welcome {get_hostname()} user, welcome to the ShuSheng LLM Practical Camp Course!

+

😀 Let’s go on a journey through ShuSheng Island together.

+

+ + Logo + +

+ + """ + gr.Markdown(html_code) + +demo.launch() +``` + +在运行代码之前,需要先使用`pip install gradio==4.29.0`命令安装以下依赖包,然后在Web IDE的终端中运行了一个`hello_world.py` + +![image](https://github.com/InternLM/Tutorial/assets/110531742/a17c8834-dcb3-4a58-a29f-ed16c5e9a48a) + + +如果不进行端口映射的话,使用本地IP是访问不了的 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/665d2da3-f248-41a5-a400-659615cc013e) + + +我可以使用下面的命令,将它输入到powerShell中: + +```Bash +ssh -p 37367 root@ssh.intern-ai.org.cn -CNg -L 7860:127.0.0.1:7860 -o StrictHostKeyChecking=no +``` + +![image](https://github.com/InternLM/Tutorial/assets/110531742/e7527553-30d8-4b94-84fc-106ac28869f6) + +这样就代表成功了。(**注意**:这个命令不返回任何的内容,这样代表端口映射在运行了,然后在网页中打开连接就可以看到web ui的界面了) + +![image](https://github.com/InternLM/Tutorial/assets/110531742/6837f86f-3e1b-4667-9834-f40f9ad79349) + + +##### 3.2.2 使用 vscode 进行端口映射 + +当然,如果我们运行不同的web ui的话,需要重复输入命令,这样很麻烦,这就需要用到VScode了。前面我们已经SSH远程连接了开发机,VScode提供了自动端口映射的功能,我们不需要手动配置,我们可以使用“Ctrl+Shift+~”快捷键**唤醒终端**,在终端的右侧可以找到端口选项: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/f5fa3257-de24-4731-894a-a2b677543a37) + + +在这里可以查看端口映射的信息,如果需要修改端口的话,可以在端口那一栏修改端口号。 + +## 三、Linux 基础命令 + +这一部分我会带着大家了解Linux的一些**基础操作**,还有使用一些工具。让大家能够在遇到问题的时候,可以自行解决,如果大家有遇到什么问题的话,也可以在这里评论,我会及时给大家回答。 + +因为我们使用**开发机**时很少使用到**权限管理**,所以我们就不介绍了。(后面的操作均在VScode的终端中进行) + +### 1. 文件管理 + +在 Linux 中,常见的文件管理操作包括: + +- **创建文件**:可以使用 `touch` 命令创建空文件。 +- **创建目录**:使用 `mkdir` 命令。 +- **目录切换**:使用`cd`命令。 +- **显示所在目录**:使用`pwd`命令。 +- **查看文件内容**:如使用 `cat` 直接显示文件全部内容,`more` 和 `less` 可以分页查看。 +- **编辑文件**:如 `vi` 或 `vim` 等编辑器。 +- **复制文件**:用 `cp` 命令。 +- **创建文件链接**:用`ln`命令。 +- **移动文件**:通过 `mv` 命令。 +- **删除文件**:使用 `rm` 命令。 +- **删除目录**:`rmdir`(只能删除空目录)或 `rm -r`(可删除非空目录)。 +- **查找文件**:可以用 `find` 命令。 +- **查看文件或目录的详细信息**:使用`ls`命令,如使用 `ls -l`查看目录下文件的详细信息。 +- **处理文件**:进行复杂的文件操作,可以使用`sed`命令。 + +这里介绍几种我们在课程中会使用到的命令: + +#### 1.1 **touch** + +我们可以使用touch快速的创建文件,这样我们不用手动点击进行创建了。例如我们要创建一个`demo.py`文件: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/6242dabe-99bd-400e-94ef-4d169b387b7b) + + +#### 1.2 **mkdir** + +同样的使用方法,如果要创建一个名为`test`的目录: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/2beb41a4-b21e-45b9-884b-3ca73c000766) + + +#### 1.3 **cd** + +这个命令会是使用最多的一个命令,在使用之前需要为没有计算机基础的同学讲一下目录结构,画一张图让大家理解: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/5525cdef-5340-4b82-ae8a-fca563a56649) + + +我们现在使用的是`root`目录,也是root用户的家目录`~`,linux操作系统中`/`表示根目录,根目录下有许多系统所需的目录和文件,刚才我们创建的目录就存在与`root`目录下,其中`.`表示的是当前目录,`..`表示的上级目录。如果我现在要进入到`test`目录,然后回到`root`目录,我们可以这样操作: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/1ceef7d8-96cf-412a-84b3-4a758f1e3acd) + + +#### 1.4 **pwd** + +我们可以使用`pwd`命令查看当前所在的目录:这样可以方便我们确定我们当前所在哪个目录下面。 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/f5944d98-2378-45b2-97aa-c57b68b4ee5e) + + +#### 1.5 **cat** + +`cat`命令可以查看文件里面的内容,更多的使用命令可以使用`--help`命令查看: + +- -a,--show-all等价于-vET +- -b,--number-non空白数非空输出行,覆盖-n +- -e, 等价于-vE +- -E,--show-结束显示$在每一行的末尾 +- -n,--number编号所有输出行 +- -s,--crick-空白抑制重复的空输出行 +- -t等价于-vT +- -t,--show-tabs将制表符显示为^I +- -v,--show非打印使用^和M-表示法,LFD和TAB除外 + +#### 1.6 **vi or vim** + +当我们需要编辑文件的时候可以使用`vi`或者`vim`命令,当你进入文件编辑以后,有三种模式: + +![image](https://github.com/InternLM/Tutorial/assets/110531742/bc717d26-dfdc-44bc-93eb-68fbaf02d9c3) + + +进入编辑模式可以使用`i`,vim的方便之处就是可以在终端进行简单的文件修改。 + +#### 1.7 **cp 和 ln(重点)** + +**`cp`****命令在后面课程中会经常用到,它是用来将一个文件或者目录复制到另一个目录下的操作,常用的使用有:** + +- 复制文件:`cp 源文件 目标文件` +- 复制目录:`cp -r 源目录 目标目录` + +但是如果我们是要使用模型的话,这种操作会占用大量的磁盘空间,所以我们一般使用`ln`命令,这个就和windows的快捷方式一样。linux中链接分为两种 : **硬链接**(hard link)与**软链接**(symbolic link),硬链接的意思是一个档案可以有多个名称,而软链接的方式则是产生一个特殊的档案,该档案的内容是指向另一个档案的位置。硬链接是存在同一个文件系统中,而软链接却可以跨越不同的文件系统。 + +所以我们一般使用软连接,它的常用的使用方法如下: + +``` +ln [参数][源文件或目录][目标文件或目录] +``` + +参数如下: + +- -s:创建软链接(符号链接)也是最常用的; +- -f:强制执行,覆盖已存在的目标文件; +- -i:交互模式,文件存在则提示用户是否覆盖; +- -n:把符号链接视为一般目录; +- -v:显示详细的处理过程。 + +#### 1.8 **mv 和 rm** + +`mv`命令和`rm`命令的使用方式很相似,但是`mv`是用来移动文件或者目录的,同时还可以进行重命名。`rm`命令则是用来删除文件或者目录的。 + +常用的使用方法如下: + +- **mv 命令**: + +常用参数: + +- `-i`:交互模式,覆盖前询问。 +- `-f`:强制覆盖。 +- `-u`:只在源文件比目标文件新时才进行移动。 + +使用示例: + +- `mv file1.txt dir1/`:将文件 `file1.txt` 移动到目录 `dir1` 中。 +- `mv file1.txt file2.txt`:将文件 `file1.txt` 重命名为 `file2.txt`。 + +- **rm 命令**: + +常用参数: + +- `-i`:交互模式,删除前询问。 +- `-f`:强制删除,忽略不存在的文件,不提示确认。 +- `-r`:递归删除目录及其内容。 + +使用示例: + +- `rm file.txt`:删除文件 `file.txt`。 +- `rm -r dir1/`:递归删除目录 `dir1` 及其所有内容。 + +删除目录的命令也可以使用`rmdir`。 + +#### 1.9 **find** + +`find`命令是Linux系统中一个强大的文件搜索工具,它可以在指定的目录及其子目录中查找符合条件的文件或目录,并执行相应的操作。 + +以下是`find`命令的一些常见用法: + +1. **按文件名查找**:使用`-name`选项按照文件名查找文件。例如,`find /path/to/directory -name "file.txt"`将在指定目录及其子目录中查找名为`file.txt`的文件。 +2. **按文件类型查找**:使用`-type`选项按照文件类型查找文件。例如,`find /path/to/directory -type f`将查找指定目录及其子目录中的所有普通文件。 +3. **按文件大小查找**:使用`-size`选项按照文件大小查找文件。例如,`find /path/to/directory -size +100M`将查找指定目录及其子目录中大于100MB的文件。 +4. **按修改时间查找**:使用`-mtime`、`-atime`或`-ctime`选项按照文件的修改时间、访问时间或状态更改时间查找文件。例如,`find /path/to/directory -mtime -7`将查找指定目录及其子目录中在7天内修改过的文件。 +5. **按文件权限查找**:使用`-perm`选项按照文件权限查找文件。例如,`find /path/to/directory -perm 755`将查找指定目录及其子目录中权限为755的文件。 +6. **按用户或组查找**:使用`-user`或`-group`选项按照文件的所有者或所属组查找文件。例如,`find /path/to/directory -user username`将查找指定目录及其子目录中属于用户`username`的文件。 +7. **执行操作**:使用`-exec`选项可以对找到的文件执行相应的操作。例如,`find /path/to/directory -name "*.txt" -exec rm {} \;`将删除找到的所有以`.txt`结尾的文件。 + +#### 1.10 **ls** + +`ls`命令可以用来列出目录的内容以及**详细信息**。 + +常用参数及使用方法如下: + +- `-a`:显示所有文件和目录,包括隐藏文件(以`.`开头的文件或目录)。 +- `-l`:以长格式显示详细信息,包括文件权限、所有者、大小、修改时间等。 +- `-h`:与`-l`结合使用,以人类可读的方式显示文件大小(如`K`、`M`、`G`等)。 +- `-R`:递归列出子目录的内容。 +- `-t`:按文件修改时间排序显示。、 + +![image](https://github.com/InternLM/Tutorial/assets/110531742/d380fa75-e1ea-459a-ba62-3043c3dddc46) + + +#### 1.11 **sed** + +`sed`命令是一种流编辑器,主要用于文本处理,在处理复杂的文件操作时经常用到,在后续的课程中会使用到,`sed`命令常用参数及使用示例如下: + +- **参数说明:** + - `-e