Computer Science > Machine Learning

arXiv:2410.01623 (cs)

[Submitted on 2 Oct 2024 (v1), last revised 12 Oct 2024 (this version, v2)]

Title:Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Authors:Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, Guoren Wang

Abstract:Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models (LLMs). Previous methods either rely on decomposing weight matrices (e.g., LoRA), or seek to decompose gradient matrices (e.g., GaLore) to ensure reduced memory consumption. However, both of them constrain the training in a low-rank subspace, thus inevitably leading to sub-optimal performance. This raises a question: whether it is possible to consistently preserve the low-rank constraint for memory efficiency, while achieving full-rank training (i.e., training with full-rank gradients of full-rank weights) to avoid inferior outcomes? In this paper, we propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to achieve this goal. First, we observe an interesting phenomenon during LLM training: the scaling impact of adaptive optimizers (e.g., Adam) on the gradient norm remains similar from low-rank to full-rank training. Based on this observation, we propose a norm-based scaling method, which utilizes the scaling impact of low-rank optimizers as substitutes for that of original full-rank optimizers to enable full-rank training. In this way, we can preserve the low-rank constraint in the optimizer while achieving full-rank training for better performance. Moreover, we find that there are sudden gradient rises during the optimization process, potentially causing loss spikes. To address this, we further put forward a norm-growth limiter to smooth the gradient via regulating the relative increase of gradient norms. Extensive experiments on the pre-training and fine-tuning of LLMs show that Fira outperforms both LoRA and GaLore, achieving performance that is comparable to or even better than full-rank training.

Comments:	Add further analysis of the scaling factor, code is available at: this https URL
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2410.01623 [cs.LG]
	(or arXiv:2410.01623v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.01623

Submission history

From: Kaituo Feng [view email]
[v1] Wed, 2 Oct 2024 14:58:27 UTC (9,517 KB)
[v2] Sat, 12 Oct 2024 08:44:01 UTC (9,518 KB)

Computer Science > Machine Learning

Title:Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators