BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

This repository implements Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models,
Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian YU, Guangming Lu, Rong Xiao

Setting Up Environment

Mainly consist of the following three steps.

Install Dependencies
```
pip install -r requirements.txt
```
Download the LLM Intended for Acceleration as Base Model
- LLaMA-2 chat model series: 7B, 13B, 70B
- Vicuna series: 7B, 13B, 33B
- Falcon series: 40B
Create Symbolic Links from Source Base Models to the Checkpoint Directory
```
ln -s SOURCE_CHECKPOINT_PATH checkpoints/TARGET_CHECKPOINT_NAME
```

Preparing Data

We describe separately how to prepare the training datasets and the test datasets.

Training Datasets

We provide a small set of training samples for LLaMA-2-7B-Chat in this repository. The complete training data for LLaMA-2-7B-Chat can be found here.

For experiments with other base models, such as Falcon, additional preparation of training data is required.
1. Start the TGI service for executing Falcon inference:
```
text-generation-launcher --model-id checkpoints/falcon-40b-instruct-hf --trust-remote-code --max-input-length 2048 --max-total-tokens 4096 --sharded true --num-shard 8
```
2. Generate prompts, also referred to as queries or questions, using the predefined Falcon templates:
```
python3 test/gen_prompt.py --model_type falcon --output_path data/assembled_v2/falcon-40b/alpaca_lima_cip-50k_code_platypus_v2-prompt2.jsonl
```
3. Generate Falcon outputs based on greedy sampling, forming prompt-response (question-answer) pairs as the training samples:
```
# NUM_PROCESS denotes the number of processes executing simultaneously through TGI.
# IP denotes the IP address providing the TGI service.
python3 test/gen_llm_output.py data/assembled_v2/falcon-40b/alpaca_lima_cip-50k_code_platypus_v2-prompt2-output.jsonl data/assembled_v2/falcon-40b/tmp NUM_PROCESS IP
```
4. Merge all jsonl files in directory data/assembled_v2/falcon-40b/tmp into one file alpaca_lima_cip-50k_code_platypus_v2-prompt2-output.jsonl and place it in directory data/assembled_v2/falcon-40b.
Test Datasets

We offer the MT-Bench dataset in this repository, while other datasets for evaluation (XSum, CIP-test, and HumanEval-X) can be found here.

Training

We using LLaMA-2-7B-Chat as the base model for BiTA training in the example.

Single-Node

Run the script:

sh scripts/run_sft-pt2_llama2-7b-chat.sh

Multi-Node

We employ the DeepSpeed library for multi-node training: (in our implementation, 32 NVIDIA A800-80GB GPUs are utilized)

# remove any existing hostfile
rm -rf hostfile
# generate a new hostfile
sh gen_openpai_hostfile.sh > hostfile
# run the training script
sh scripts/run_deepspeed_sft-pt2_llama2-70b-chat.sh

Evaluation

We provide scripts for both single-GPU testing and multi-GPU testing. The accelerated LLaMA-2-7B-Chat is evaluated using the following scripts. For other base models, simply adjust the path TEST_DIR and related hyperparameters (MODEL_TYPE, MASK_ID, etc.) in the scripts.

Single-GPU
```
sh scripts/run_eval_mt-bench.sh
```

Multi-GPU

sh scripts/run_multigpu_eval_mt-bench.sh

Performance

We present concise speedup results of model acceleration below; for more detailed results, please refer to our paper.

Model	XSum	MT-Bench	CIP	HumanEval-X
LLaMA-2-7B	2.19	2.38	2.29	2.73
LLaMA-2-13B	2.29	2.41	2.39	2.88
Vicuna-33B	2.20	2.47	2.10	3.00
Falcon-40B	2.28	2.75	2.32	3.07
LLaMA-2-70B	2.55	2.72	2.58	3.31

License

This repository is licensed under the Apache-2.0 License.

Please follow the model licenses to use the corresponding model weights: LLaMA-2 / Vicuna / Falcon

Citation

If you find this project useful in your research, please kindly cite:

@article{lin2024bita,
  title={BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models},
  author={Lin, Feng and Yi, Hanling and Li, Hongbin and Yang, Yifan and Yu, Xiaotian and Lu, Guangming and Xiao, Rong},
  journal={arXiv preprint arXiv:2401.12522},
  year={2024}
}

Acknowledgement

This repository greatly benefits from LLaMA-Factory. We extend our gratitude for their outstanding contributions.

Contact

Please feel free to reach out if you have any questions! Email: lin1993@mail.ustc.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
docs		docs
scripts		scripts
src		src
tests		tests
testsets		testsets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
accelerate_config_8_nodes.yaml		accelerate_config_8_nodes.yaml
gen_openpai_hostfile.sh		gen_openpai_hostfile.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Setting Up Environment

Preparing Data

Training

Evaluation

Performance

License

Citation

Acknowledgement

Contact

About

Releases

Packages

Contributors 2

Languages

License

linfeng93/BiTA

Folders and files

Latest commit

History

Repository files navigation

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Setting Up Environment

Preparing Data

Training

Evaluation

Performance

License

Citation

Acknowledgement

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages