Computer Science > Computation and Language

arXiv:2404.00929 (cs)

[Submitted on 1 Apr 2024 (v1), last revised 6 Jun 2024 (this version, v2)]

Title:A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Authors:Yuemei Xu, Ling Hu, Jiayi Zhao, Zihan Qiu, Yuqi Ye, Hanwen Gu

Abstract:Based on the foundation of Large Language Models (LLMs), Multilingual Large Language Models (MLLMs) have been developed to address the challenges of multilingual natural language processing tasks, hoping to achieve knowledge transfer from high-resource to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolution, key techniques, and multilingual capacities. Secondly, we explore widely utilized multilingual corpora for MLLMs' training and multilingual datasets oriented for downstream tasks that are crucial for enhancing the cross-lingual capability of MLLMs. Thirdly, we survey the existing studies on multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs including its category and evaluation metrics, and summarize the existing debiasing techniques. Finally, we discuss existing challenges and point out promising research directions. By demonstrating these aspects, this paper aims to facilitate a deeper understanding of MLLMs and their potentiality in various domains.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2404.00929 [cs.CL]
	(or arXiv:2404.00929v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.00929

Submission history

From: Ling Hu [view email]
[v1] Mon, 1 Apr 2024 05:13:56 UTC (3,034 KB)
[v2] Thu, 6 Jun 2024 16:04:15 UTC (4,536 KB)

Computer Science > Computation and Language

Title:A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators