Federated and Edge Learning For Large Language Models
Federated and Edge Learning For Large Language Models
Information Fusion
journal homepage: www.elsevier.com/locate/inffus
Keywords: As the demand for sophisticated language models (LMs) continues to grow, the necessity to deploy them
Edge learning efficiently across federated and edge environments becomes increasingly evident. This survey explores the
Edge computing nuanced interplay between federated and edge learning for large language models (LLMs), considering the
Federated learning
evolving landscape of distributed computing. We investigate how federated learning paradigms can be tailored
Large language models
to accommodate the unique characteristics of LMs, ensuring collaborative model training while respecting
Natural language processing
privacy constraints inherent in federated environments. Additionally, we scrutinize the challenges posed by
resource constraints at the edge, reporting on relevant literature and established techniques within the realm
of LLMs for edge deployments, such as model pruning or model quantization. The future holds the potential for
LMs to leverage the collective intelligence of distributed networks while respecting the autonomy and privacy
of individual edge devices. Through this survey, the objective is to provide an in-depth analysis of the current
state of efficient and privacy-aware LLM training and deployment in federated and edge environments, with
the aim of offering valuable insights and guidance to researchers shaping the ongoing discussion in this field.
1. Introduction learning (ML) models, deep learning (DL) models, and transformer-
based models. Early language models relied on basic statistical meth-
‘‘Language is a process of free creation; its laws and ods that estimated word sequence probabilities through frequency
principles are fixed, but the manner in which the counts [5]. Examples of probability-based LMs include n-grams [6],
principles of generation are used is free and infinitely Hidden Markov Models (HMMs) [7], and Maximum Entropy Mod-
varied. Even the interpretation and use of words involves els [8]. N-grams, as an example, are sequences of neighboring words or
a process of free creation’’. tokens utilized to predict the likelihood of the subsequent word based
Noam Chomsky on preceding ones [9]. Although regarded as basic by modern criteria,
these models represented a crucial initiation in comprehending natural
Expressing and communicating through language is a fundamental
language, enabling fundamental text generation and word prediction
human ability that begins to develop in early childhood and continues
but having constraints in grasping intricate contextual associations [10–
to evolve throughout a lifetime [1]. Unlike humans, machines lack the
12]. Then a shift toward data-driven methodologies occurred [13],
innate capacity to comprehend and communicate in human language
and researchers explored ML algorithms to enhance language un-
unless empowered with sophisticated artificial intelligence (AI) algo-
derstanding [14]. Models such as Support Vector Machines (SVMs)
rithms. Since the proposal of the Turing Test in the 1950s, overcoming
exemplify this shift [15]. ML models brought a more sophisticated
this challenge has been a longstanding pursuit in research, aiming to
approach to NLP tasks, enabling the development of applications like
equip machines with the capability to read, write, and communicate
spam detection [16] and sentiment analysis [17]. The availability
like humans [2]. Language modeling, a crucial task in natural language of large-scale Twitter datasets, in particular, revolutionized real-time
processing (NLP), stands out as a significant approach in enhancing the sentiment analysis [18]. The rise of DL, along with the availability of
language intelligence of machines by enabling them to predict the next extensive public datasets [19], and powerful computing devices [20]
word or character in a given text sequence [3,4], thus allowing the capable of processing large amounts of complex data, represented a
model to produce new text and complete sentences, among its diverse crucial juncture in the advancement of LMs [21]. Notably, neural
applications. networks, specifically Recurrent Neural Networks (RNNs) and Long
Language models (LMs) can be mainly categorized into four cat- Short-Term Memory (LSTM) networks, capable to capture intricate
egories, as depicted in Fig. 1: statistical language models, machine
∗ Corresponding author.
E-mail address: diletta.chiaro@unina.it (D. Chiaro).
https://doi.org/10.1016/j.inffus.2024.102840
Received 5 November 2024; Accepted 28 November 2024
Available online 16 December 2024
1566-2535/© 2024 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
F. Piccialli et al. Information Fusion 117 (2025) 102840
Fig. 1. A hierarchical representation of LMs, illustrating the commonly adopted classification, reporting some solutions ranging from statistical-based models to traditional machine
and deep learning-based ones, and finally to the latest transformer-based models.
features and long-range dependencies within textual content, gained been made to address ethical concerns [48], interpretability [49], and
prominence in this era [22]. This phase significantly enhanced the minimizing biases in LLMs [50].
models’ capacity to comprehend context, rendering them well-suited However, LLMs’ enormous resource requirements pose obstacles.
for tasks such as machine translation and speech recognition [23,24]. A paradigm shift has been spurred by the cost of training on cloud
Nonetheless, DL encountered challenges related to vanishing gradi- servers with strong Graphical Processing Units (GPU) clusters and the
ents [25] and long-term dependencies [26], thereby constraining its latency of cloud-based inference. There are many different reasons to
overall effectiveness. move LLM inference to the edge [51], and these reasons are influ-
However, it was not until the introduction of the Transformer enced by variables unique to LLM, original equipment manufacturer
architecture in the influential work ‘‘Attention is All You Need’’ in considerations, and industry dynamics. The primary factor pushing
2017 [27] that a truly groundbreaking leap occurred in the realm of LLMs to the edge is the decrease in reliance on connectivity [51].
LLMs. Founded on the self-attention mechanism [28] and frequently Edge-deployed LLMs, in contrast to their cloud-based counterparts, can
pre-trained on extensive text corpora, Transformers empowered models operate without any or very little network access. As an essential com-
to encompass the entire context of a sentence or document, fostering ponent of an ideal user experience in LLM-based apps, latency drives
genuine contextual understanding [29], and leading to a revolution in edge migration as well. Reaction times can be significantly reduced
applications like chatbots [30], text summarization [31], and language by locally conducted inference, which offers far better user experience
translation [32]. To distinguish LMs based on parameter scales, the than relying on the reliability and speed of a network connection.
research community has coined the term ‘‘large language models’’ Also, by reducing the need to send sensitive data via networks, this
(LLMs) for pre-trained language models (PLMs) with substantial sizes,
method reduces the possibility of data breaches and gives consumers
often containing tens or hundreds of billions of parameters. An essential
more control over their personal information. Customization appears
characteristic of LLMs is their capacity to handle vast amounts of data,
as a driving force behind edge deployments, impacting both inference
including unstructured text, and capture semantic relationships be-
and training [52]. An LLM can, on the verge, deeply understand a
tween words and phrases [33]. Furthermore, these models can process
user’s speech patterns, writing style, and more. Enhancing privacy is
various types of data, such as visual [34], audio [35], audiovisual [36],
coupled with the capability for devices to customize models to align
and multi-modal data [37], learning the semantic connections among
with specific personalities and behaviors, thereby crafting a uniquely
them.
tailored user experience. Scalability represents another key driver, as
This breakthrough paved the way for the development of state-of-
the widespread distribution of applications across a diverse array of
the-art models, exemplified by OpenAI’s GPT (Generative Pre-trained
devices is facilitated, avoiding the burden of overwhelming centralized
Transformer) series [38], PaLM (Pathways Language Model) [39], and
servers, thanks to the growing prevalence of edge devices. With the in-
Google’s Gemini [40] and Gemma [41]. Additionally, LLM develop-
ments have expanded into more specialized fields [42,43], with mod- depth exploration of parallel, distributed, and federated learning (FL)
els created for tasks including code production [44], scientific re- in recent years, numerous solutions in the realms of edge learning (EL)
search [45], website building [46], and medical language process- and federated learning have been suggested. These solutions aim to
ing [47]. The process of crafting and tuning prompts in NL to optimize train, fine-tune, or facilitate the deployment of LLMs [53,54].
the performance of LLMs for specific tasks is termed prompt engineer- In light of these transformative trends, our survey becomes instru-
ing. It involves strategically constructing input prompts to guide AI mental in comprehensively examining the current landscape and future
models towards producing more accurate, pertinent, and valuable re- trajectories of federated and edge learning within the domain of LLMs.
sponses. Effective prompt engineering can substantially enhance LLMs’ The survey aims to delve into the nuances of how FL and EL are
effectiveness in particular tasks by furnishing clear instructions and shaping the evolution of LLMs, exploring their impact on scalability,
contextual cues to steer the model’s output. Moreover, prompt engi- privacy, and user experience. By synthesizing insights from industry
neering aids in mitigating ‘‘catastrophic forgetting’’, wherein an LLM developments, research advancements, and emerging trends, this sur-
may lose previously acquired knowledge when fine-tuning for a new vey endeavors to provide a roadmap for the ongoing integration of
task, as well as the occurrence of hallucinations, wherein AI models, federated and EL in the realm of LLMs. The remainder of this survey
particularly LLMs, produce irrelevant, implausible, or nonsensical out- is structured as follows: Section 2 provides an introduction to LLMs,
puts. While hallucinations and prompting might indeed be fascinating encompassing a concise history of state-of-the-art solutions and the
topics, this survey does not delve into the specifics of these aspects. prevalent approach to dealing with them, namely pre-training and
In order to guarantee responsible and equitable use, efforts have also fine-tuning. Additionally, it delves into the reasons and challenges
2
F. Piccialli et al. Information Fusion 117 (2025) 102840
associated with LLMs in both EL and FL. Section 3 outlines the method- autoregressive pre-training model that performs well on multiple NLP
ology employed in this survey, detailing the retrieval strategy for the tasks. T5, or Text-to-Text Transfer Transformer, achieves impressive
discussed and reported papers. In Section 4, a deeper examination is results on various benchmarks.
undertaken, focusing on scrutinized papers related to LLMs and EL In June 2020, GShard [67] is tailored for distributed training of mas-
and/or FL. This section describes the solutions adopted to facilitate sive models, enabling efficient processing across multiple accelerators
model deployment on the edge, as well as techniques within the (such as GPUs or TPUs). In the year 2021, EleutherAI developed both
federated realm. This section aims to shed light on the innovative GPT-Neo [68] and GPT-J [69]. GPT-Neo, a community-driven project,
approaches and practical strategies employed to address the unique aims to create accessible and powerful language models, available in
constraints and opportunities presented by deploying LLMs in edge various sizes such as GPT-Neo 1.3B and GPT-NeoX with billions of
and federated environments. In Section 5, we ventured into the realm parameters [70]. Notable LLMs from the same year include CodeX [71],
of available datasets and open-source codes closely aligned with the Jurassic-1 [72], AnthropicLM [73], GLaM [74], and Gopher [75].
research domain and pertinent to the surveyed papers. Through the The year 2022 witnessed an explosion in the development of LLMs,
presentation of numerous tables summarizing content and providing with models like MT-NLG [76], InstructGPT [77], LaMDA [78], Chin-
direct links to repositories, we aimed to furnish readers with a com- chilla [79], PaLM [39], OPT [80], BLOOM/BLOOMZ [81,82], Min-
prehensive resource base for further exploration and utilization in their erva [83], Sparrow [84], Flan-PaLM [85], Galactica [86], AnthropicLM
own endeavors. Conclusions close the survey by summarizing the key v4-s3 [73], OPT-IML [87], etc. These models ranged from 10B to 100B
findings and insights gleaned from our investigation. Fig. 2 illustrates parameters, with pre-training data sizes reaching up to 1.4 trillion
the structure of this survey. tokens.
In 2023, LLaMA was introduced by Touvron et al. [88], featuring a
2. Large language models parameter range from 7 billion to 65 billion. LLaMA demonstrated out-
standing performance in instruction-following tasks, with LLaMA-13B
2.1. LLMs: then and now outperforming GPT-3 in various benchmarks. The same year also saw
the launch of chatbots like Bard [89], Claude [73], and Gemini [40],
The evolution of LLMs can be traced back to recent years, witness- along with further improved LLMs such as Vicuna [90], Jurassic-2 [91],
ing notable progress and breakthroughs with the introduction of the Falcon 40B/180B [92], Gorilla [93], Orca/Orca-2 [94,95], among oth-
Transformer architecture and the launch of the GPT series. In 2017, ers. In early 2024, the Google DeepMind team unveiled Gemma [41],
Google proposed the Transformer model [27], leveraging the attention the newest iteration of a lightweight open model.
mechanism to learn longer-term dependencies in language and enabling In 2024, research related to LLMs remained popular, with many
parallel training on multiple GPUs [55]. This innovation facilitated the teams focusing on developing multimodal models to create more ro-
training of significantly larger models [56]. bust foundational systems. For instance, Stable LM 2 [96] was intro-
Following this development, in 2018, OpenAI adopted the novel neu- duced for multilingual tasks and trained on data from seven languages.
ral network architecture for language modeling tasks, unveiling the in- The Google DeepMind team also launched a new version of their
augural GPT model, GPT-1 [57]. GPT-1 showcased substantial enhance- lightweight open model, Gemma [41]. In March, Inflection-2.5 [97]
ments in commonsense reasoning, question answering, and text entail- was released, which enhanced the functionality of personal AI assis-
ment compared to existing pre-trained language models. Although its tants while optimizing resource use during training. That same month,
limitations, it laid the foundation for subsequent, more potent models, Claude 3 debuted, offering significant improvements over its predeces-
ushering in a new era of AI research and highly competitive exploration sor, particularly in various cognitive tasks. Following this, LLaMA 3 was
in LLMs. released in April, and GPT-4o [98] arrived in May as a multimodal AI
In 2019, OpenAI released GPT-2 [58], boasting a parameter size capable of processing and generating text, audio, and visual content
ten times larger than GPT-1, totaling 1.5 billion parameters. By 2020, in real time. In July, Qwen2 [99] was released, building on the orig-
GPT-3 [59] was launched, standing out as one of the largest language inal Qwen model and incorporating several enhancements, including
models to date with an impressive 175 billion parameters. The GPT- improved performance in chat applications and stronger multilingual
3 family, particularly ChatGPT [60], gained widespread attention and capabilities. In August, the xAI team introduced Grok 2 [100], which
popularity across various industries since its November 2022 release. is tailored specifically for users of the X platform. Most recently, in
In March 2023, GPT-4 [61] was unveiled, extending text input to fused October, Gemini 1.5 Flash-8B [101] went into production, boasting
multimodal inputs. GPT-4 demonstrated enhanced capabilities in han- enhanced speed and efficiency compared to earlier versions. Fig. 3
dling complex tasks, exhibiting significant performance improvements provides a timeline overview of both open and proprietary LLMs.
and the ability to generate even more coherent and natural-sounding
text compared to its predecessor. Simultaneously, other outstanding 2.2. LLMs: pre-training then fine-tuning
LLMs emerged during this period. Google’s BERT [62], released in
2018 with 1.1 billion parameters, achieved SOTA results across 11 NLP LLMs share a common procedure in their training process, which in-
tasks. In 2019, Facebook AI developed BART [63] and RoBERTa [64], volves pre-training on large text data corpora followed by task-specific
which are improved versions of the BERT model. In the same year, fine-tuning [4]. The models are exposed to a variety of online texts
Google developed XLNet [65] and T5 [66]. XLNet is a generalized during pre-training to acquire facts, grammar, reasoning skills, and a
3
F. Piccialli et al. Information Fusion 117 (2025) 102840
certain amount of common sense knowledge [102,103]. They acquire a have contributed to significant improvements in both performance and
wide comprehension of language as a result of this process. The models efficiency. Furthermore, in terms of training mode, there has been a
are then fine-tuned on smaller datasets to customize them for specific notable trend towards embracing multi-modality. Take Gemini [40],
uses. For example, ChatGPT is optimized for conversational settings, for instance, which is designed to process various data types simul-
making it suitable for virtual assistants and chatbots [90,104]. Though taneously, including text, images, audio, video, and even code, rather
less well-known, Llama and Falcon are potential developments or spe- than exclusively relying on textual corpora. This transition has garnered
cialized versions, perhaps created for particular use cases or research substantial interest within the community. As illustrated in Fig. 4,
goals. Collectively, these models showcase cutting-edge advancements the escalation in the size of LLMs correlates with a corresponding
in NLP, enabling enhanced comprehension and human-like interactions rise in hardware requirements. Absolutely, GPU and RAM are crucial
attributed to the prowess of AI-driven language models [105–107]. The hardware components for running LLMs. During both training and
training process for models like ChatGPT, Llama, and Falcon [108,109] inference, LLMs typically rely on GPUs or tensor processing units
encompasses several crucial phases. The initial stage is pre-training, (TPUs) [117,118]. These processors are particularly well-suited for
wherein these models undergo training on an extensive and diverse handling the computational demands of transformer-based models,
dataset of online text. This phase aims to instill grammar, vocabu- which are commonly used in LLMs. To function, they need a sizeable
lary, context, and general knowledge [110] into the models’ learning quantity of memory [118], indispensable for efficiently managing large
framework. The foundational architecture of the Transformer model datasets and model parameters during training. LLM training setups
plays a pivotal role in understanding the relationships between words often require significant amounts of RAM, with DDR4 or DDR5 RAM,
within sentences. Following the pre-training phase, models undergo known for their high bandwidth and capacity, being recommended to
refinement using task-specific datasets tailored for particular objectives, prevent memory-related bottlenecks. For this reason, key considera-
such as text creation or discussion in the case of ChatGPT. Their profi- tions when selecting GPUs include factors such as memory capacity
ciency in these specific tasks is honed through fine-tuning, employing (VRAM), memory bandwidth, and CUDA cores (processing power),
hyperparameter optimization to maximize performance. Ethical consid- with high-end options like NVIDIA’s Tesla series or GeForce RTX series
erations are integral to this process, aiming to mitigate unfavorable or being preferred for LLM training. Fast and high-capacity storage is
biased outcomes. Furthermore, the training is a resource-intensive and crucial for managing the extensive data involved in LLM training, with
iterative endeavor, subject to continuous monitoring and adjustments Solid State Drives (SSDs), particularly NVMe SSDs, being favored over
to enhance both performance and safety [111,112]. Hard Disk Drives (HDDs) due to their superior read and write speeds.
LLMs have progressed through various developmental phases, wit- Proper cooling solutions, such as high-performance fans or liquid
nessing an evolution in both size and complexity. The GPT series, cooling systems, are necessary [119] to prevent overheating resulting
comprising GPT-1, GPT-2, and GPT-3 [113], has exhibited successive from the intense computational load of LLM training. A robust power
growth in the number of parameters. Beginning with a scale in the supply unit (PSU) ensures consistent and sufficient power flow to
hundreds of millions for GPT-1, it has now reached a staggering 1.7 all components. Sometimes, for training very large LLMs, distributed
trillion parameters for GPT-4 [114]. This substantial increase in pa- computing setups involving multiple GPUs or machines collaborat-
rameters facilitates enhanced language understanding and generation ing on training become essential, requiring networking infrastructure,
capabilities [115]. In a parallel vein, models inspired by BERT have specialized software frameworks (e.g., Horovod), and synchronization
also undergone advancements in pre-training strategies. Notable exam- techniques to ensure efficient parallel processing. Central Processing
ples include ALBERT (A Lite BERT) [116] and RoBERTa [64], which Units (CPUs) remain crucial for data preprocessing, model setup, and
4
F. Piccialli et al. Information Fusion 117 (2025) 102840
Fig. 4. Comparison of traditional machine learning, deep learning, and LLMs for language modeling. LLMs emerge as more demanding across various dimensions, necessitating
huge data for training, extensive feature extraction, and exhibiting greater complexity, hardware demands, and reduced interpretability compared to traditional approaches.
coordination, playing a significant role in tasks such as data loading and • Personalization: Edge computing facilitates a higher degree of
preprocessing. While a powerful multi-core CPU can accelerate these personalization in LLM applications. Devices gain the capability
tasks, the actual training phase heavily relies on the parallel processing to finely tune models according to individual user personalities
capabilities of GPUs. and habits. This tailored approach ensures a more tailored and
engaging user experience, as applications adapt dynamically to
2.3. LLMs deployment on the edge user preferences without reliance on centralized servers.
• Scalability in Edge Deployment: The scalability of edge devices
In recent years, LLMs have made significant strides in AI, showcas- plays a pivotal role in the widespread distribution of LLM appli-
ing advanced capabilities in NLP. However, the resource requirements cations. With edge devices deployed at scale, the distribution of
for training these models on cloud servers, particularly those equipped applications across a diverse array of devices becomes feasible.
with extensive GPU clusters, incur substantial costs. Additionally, the
This not only prevents overloading of central servers but also
inference of these models on cloud servers presents challenges such
optimizes resource utilization, ensuring seamless scalability in
as notable latency, impacting the overall user experience and raising
response to growing user demands.
concerns about privacy and security [120,121].
To address these issues, there is a growing trend to prioritize edge
inference deployments for LLMs in upcoming platforms. This shift to- 2.3.2. The challenges
wards edge computing aims to mitigate the drawbacks associated with • Resource Constraints: Efficiently running LLMs on edge devices
centralized cloud-based approaches, focusing on optimizing resource presents a significant technical challenge due to inherent limita-
utilization, minimizing latency, and addressing privacy and security tions in processing power, memory, and storage compared to ro-
considerations in the practical implementation of LLMs [122]. Never- bust cloud servers. Shrinking the size of LLMs without sacrificing
theless, the inherent constraints of edge devices, including restricted performance is complex and requires sophisticated optimization
processing power, memory, and storage, pose considerable obstacles and quantization techniques. Despite significant efforts in the AI
to the seamless integration of resource-intensive LLMs [123]. In this industry, reducing LLM size is not just a preference but a necessity
context, addressing the limitations of edge deployment becomes crucial for successful deployment on the edge. This need is underscored
for unlocking the full potential of LLMs in diverse applications [124]. by the incorporation of Neural Processing Units (NPUs), tailored
Following, we delve into main drivers for deploying LLMs on the for specific use cases, which play a vital role in the intricate
edge, while also addressing the complex challenges linked to such landscape of edge computing.
deployments, such as resource constraints, energy efficiency consider- • Energy Efficiency: [119] Using resource-intensive models like
ations, security implications, and compatibility issues.
LLMs on battery-powered edge devices raises a crucial concern:
rapid battery drainage. Developers and chip architects must
2.3.1. The whys
meticulously optimize their designs to ensure energy efficiency
• Reduced Connectivity Dependency: Cloud-based LLMs have
[125]. The primary aim is to minimize any noticeable nega-
typically relied on a steady network connection for smooth in-
tive impacts on battery life, acknowledging the delicate equi-
ference. However, shifting LLM inference to the edge allows
librium between computational demands and sustainable device
applications to function seamlessly in environments with unreli-
operation. Achieving this balance requires a collaborative ef-
able or no network connectivity. This not only addresses opera-
fort to improve algorithms, hardware architectures, and power
tional hurdles but also enables the deployment of applications in
resource-limited settings. management strategies [126].
• Low Latency for Enhanced User Experience: Many LLM-based • Security: The transition to edge computing offers the promise of
applications rely on swift responses to ensure top-notch user improved data privacy compared to cloud-based models but also
experiences. The speed and reliability of the network connection brings forth a unique set of challenges regarding data security on
are crucial factors influencing the responsiveness of cloud-based edge devices. The decentralized nature of edge computing neces-
LLMs. By moving inference tasks to the edge, response times are sitates strong measures to protect sensitive information processed
substantially reduced, enhancing user experience, especially in locally. Therefore, implementing secure data storage protocols
applications requiring real-time interactions. and encryption mechanisms becomes essential to counter po-
• Privacy and Data Security Through Edge Computing: Edge tential threats and vulnerabilities in this distributed computing
computing emerges as a pivotal enabler for augmenting privacy paradigm [127].
and data security in LLM applications. By processing data locally • Compatibility: The compatibility landscape poses a significant
on the device, attack surfaces are substantially reduced compared hurdle in the deployment of LLMs on edge devices. It is not
to traditional cloud-based systems. This mitigates the risk of guaranteed that LLMs will seamlessly integrate with all edge de-
data breaches, as sensitive information is not transmitted over vices due to variations in hardware and software configurations.
the network to remote servers. The incorporation of FL further Developers play a pivotal role in ensuring compatibility by either
bolsters these privacy-centric measures, ushering in a new era of crafting models capable of running on diverse configurations
secure and decentralized data processing. or by collaborating with hardware and software providers to
5
F. Piccialli et al. Information Fusion 117 (2025) 102840
offer tailored solutions. The need for standardized approaches 2.4.2. The challenges
or customized adaptations becomes apparent to facilitate the While FL holds immense potential, its development for LLMs is still
widespread and effective deployment of LLMs across diverse edge in a preliminary stage, primarily due to the following challenges:
computing environments.
• High demands: LLMs impose significant demands on memory,
communication, and computational resources [129]. Traditional
2.4. LLMs within FL context FL methods involve transmitting and training the entire LLM
across multiple clients using their local data. However, the sub-
As technology continues to advance, FL is becoming increasingly stantial size of LLMs introduces complexities related to storage,
important in enhancing the effectiveness, adaptability, and security model transmission, and the computational resources needed for
of LLMs across various applications and industries. The collaborative training or fine-tuning [130]. This challenge becomes particularly
nature of FL, along with its streamlined training methods and creative pronounced in scenarios with limited storage and computational
problem-solving capabilities, sets the stage for a transformative shift. capabilities, especially in cross-device FL.
The synergy between FL and LLMs not only helps them reach their full • Proprietary LLMs: Proprietary LLMs pose challenges as clients
potential but also lays the foundation for a future where the seamless do not own them. Allowing federated fine-tuning without access-
integration of these technologies plays a key role in advancing language ing the entire model becomes necessary, particularly in closed-
processing and understanding. source LLMs. The ongoing debate about open-sourcing generative
AI models has gained traction, especially following an incident
2.4.1. The whys where researchers instructed a proprietary generative AI system
Among the many reasons why LLMs may benefit from the federated called MegaSyn [131] to create toxic molecules, some resembling
approach, one notable advantage lies in the ability to create more known nerve agents. This raises a critical issue: opponents ar-
customized models. For organizations seeking to fine-tune foundational gue that open-sourcing generative AI may lead to misuse, while
models, accessing the necessary data is often a challenge due to its proponents believe that proprietary models concentrate too much
distribution across various departments, companies, and geographic power in the hands of a select few.
regions. The scattered nature of this data, coupled with regulatory
constraints on centralized data pooling, poses obstacles to traditional Despite these challenges, FL has the potential to overcome obstacles
model refinement. However, FL presents a viable solution to this chal- associated with using LLMs. Collaborative pre-training and fine-tuning
lenge. When used in conjunction with privacy technologies, FL allows enhance the robustness of LLMs, and efficient algorithms address mem-
organizations to access distributed data through a FL platform. This ory, communication, and computation challenges [132]. Designing FL
approach empowers organizations to drive better, more personalized systems tailored to LLMs and harnessing decentralized data present
models without the need to centralize or move the data. Importantly, it exciting opportunities for the future.
guarantees privacy to each data owner, fostering collaboration without
compromising data security. Additionally, FL offers the added benefit 3. Related literature survey
of reducing the time and costs associated with centralizing data and
establishing complex data sharing agreements. In particular, the key A thorough examination of existing literature focusing on articles
advantages of FL may be summarized as follows: introducing LLMs within edge and FL areas was conducted. Following,
we elaborate on the methodology employed for retrieving relevant
• Advanced security: FL prioritizes user privacy by sending only literature and delineate the process for selecting articles.
model updates, not raw data, to a central server. This decen-
tralized approach aligns with privacy regulations, mitigates the
3.1. Retrieval strategy
risk of data breaches, and ensures data security. As data privacy
gains significance, FL provides a solution that enables organiza-
tions to comply with data protection laws while harnessing the To systematically clarify the methodology used in this article’s liter-
capabilities of LLMs [128]. ature review, we strictly adhered to PRISMA standards [133]. PRISMA
has emerged as the gold standard for systematic reviews and meta-
• Scalability and convenience: FL’s decentralized training spreads
the computational workload, enhancing scalability and yield- analyses, providing a robust framework that ensures transparency,
ing substantial cost savings. By leveraging the computational reliability, and reproducibility in our research pursuits. Initially, we
power of diverse devices, FL makes fine-tuning a manageable and meticulously determined search terms, search time horizon, and search
economically efficient process. This democratizes access to LLM scope, aligning them with the thematic focus of the literature review.
benefits, particularly beneficial for organizations with limited Subsequent phases involved the scrutiny of titles and abstracts to
resources. identify articles meeting eligibility criteria, followed by comprehensive
• Adaptability: FL seamlessly addresses the challenge of contin- reviews of full texts for further assessment. Ultimately, the inclusion of
uously expanding datasets by integrating newly collected data articles was contingent on their alignment with the review’s topic and
into existing models. This ensures continuous improvement and their provision of valuable solutions or insights to address the research
adaptability to changing environments, making FL essential for questions at hand.
the evolution of LLMs. In dynamic sectors like healthcare and
finance, FL ensures LLMs stay relevant and practical, keeping pace 3.1.1. Articles search
with the latest information. We searched articles based on the occurrence of terms in titles,
• Optimized user experience: FL tackles privacy and scalability abstracts, and keywords. Initially, our focus was on articles specifically
concerns, enhancing the user experience by deploying models di- addressing LLMs. Building on this, we broadened our search crite-
rectly to edge devices. This speeds up model responses, minimizes ria to include terms associated with edge computing, such as edge
latency, and ensures quick answers for users. Local deployment is learning, edge computing, and mobile edge computing. Simultane-
particularly relevant in applications where immediate responses ously, we incorporated terms relevant to federated/distributed learn-
are critical, such as virtual assistants and interactive customer ing (DistL). The formulated search keywords comprised a combina-
service, offering a practical solution to address user needs. tion of these terms (including their plural forms) and were struc-
tured as follows: TITLE-ABS-KEY: (("Large language models"
6
F. Piccialli et al. Information Fusion 117 (2025) 102840
7
F. Piccialli et al. Information Fusion 117 (2025) 102840
Table 1 utilized to combine mobile agents with server models, and perfor-
Hardware configurations for experimental studies in the literature.
mance on text rewriting tasks is demonstrated through instruction
Hardware configurations Relevant studies tuning of on-device models. Recognizing the underutilization of
NVIDIA GTX 1080 GPUs [141] consumer-level GPUs, [143] proposes a strategy where the work-
NVIDIA RTX 2080 Ti GPUs [142]
load of LLMs is decomposed and distributed across devices with
NVIDIA RTX 3080 GPUs [143]
NVIDIA RTX 3090 GPUs [142,144–146] restricted memory capacity. Experimental results indicate that
NVIDIA RTX 4070 GPUs [147] 50 RTX 3080 GPUs can match the throughput of 4 H100 GPUs,
NVIDIA RTX 4090 GPUs [147,148] yielding substantial cost savings.
NVIDIA Tesla T4 GPUs [149]
• Edge inference. Model inference refers to using a trained model
NVIDIA Tesla V100 GPUs [142,150–152]
NVIDIA A10 GPUs [153]
to predict or classify unseen data. Edge inference does not re-
NVIDIA A100 GPUs [141,142,149,154–171] quire a complicated training process and is easier to implement
NVIDIA RTX A6000 GPUs [163] than edge fine-tuning. However, in edge inference, important
NVIDIA Titan RTX GPUs [172] indicators that must be paid attention to are memory usage
Azure NDv5 H100 GPUs [173]
and inference latency [185]. It is crucial to take corresponding
NVIDIA Jetson AGX [162]
NVIDIA Jetson NX [147,170,174] measures to reduce the model size [186]. Moreover, to minimize
NVIDIA Jetson TX2 [169,174,175] inference latency, it is essential to integrate LLMs into devices
NVIDIA Jetson Nano [176] situated near the end-user [187]. Existing programs have taken
Xiaomi 10 and Xiaomi 12 [174]
the lead in taking such measures. The llama.cpp [188] is a pro-
Redmi 10X Pro, Redmi K50, [177]
Mi 10 Lite gram developed in C++ that quantifies the original LLaMa model
Samsung S23 [178] using 4 bit integers, enabling inference on edge devices with
Google Pixel 7 Pro, 8 Pro [169,170,178–180] limited capacity such as MacBook. The program MLC LLM [189]
Raspberry Pi 4B [175,176] leverages compilation technology to facilitate local development,
Android devices [181]
Snapdragon CPU and DSP, [176]
optimization, and deployment of LLMs on personal devices. This
Apple M1, Microcontrollers allows the execution of quantized 7B models on smartphones.
CUBOT X30 [179] In this part of the research work, efficient inference engines for
OPPO Reno 6 [182] edge devices have been implemented and produced impressive
results. EdgeMoE [175] serves as a device-side engine where
non-expert weights reside in the device’s memory, while ex-
4.1.1. Edge fine-tuning vs. edge inference pert weights are stored externally and loaded into memory only
Deploying LLMs at the edge offers a promising solution, enabling when activated. This model partitioning not only conserves mem-
models to take advantage of the data proximity at the edge and mitigat- ory but also enhances computing efficiency. Another device-side
ing various risks, such as latency and potential data leakage associated inference engine, LLMCAD [174], utilizes compact LLMs with
with cloud transmission. There are some research efforts dedicated to smaller memory footprints to generate simple tokens, and a high-
enabling running LLMs on edge devices, which can be categorized as precision LLM is employed to verify the accuracy of those tokens.
edge fine-tuning and edge inference. In the work of [148], a staged speculative decoding method
is introduced to expedite on-device inference for small batches.
• Edge fine-tuning. Fine-tuning LLMs on edge devices, especially Experimental results, using the GPT-2-L model with a parameter
mobile platforms like smartphones and laptops, demands a sub- size of 762M, demonstrate a 3.16 times reduction in latency. [190]
stantial allocation of memory resources. According to the findings propose Agile-Quant, an activation-guided quantization frame-
in [177], fine-tuning a BERT model with a batch size of 8 on work to accelerate inference of LLMs on edge devices. It combines
a smartphone Redmi K10X Pro resulted in a memory utilization a simplified activation quantization with an activation-aware tag-
of approximately 5 GB. Given the typically limited RAM capacity trimming technique to reduce outliers and improve attention
of contemporary mobile devices, dedicating a significant portion efficiency. Using SIMD-based 4-bit multipliers and optimized TRIP
of memory to LLMs may hinder the concurrent running of other matrix multiplication, Agile-Quant delivers up to 2.55x speedup
applications. over FP16 models in 8-bit and 4-bit weight quantization scenarios
Despite the challenges that cannot be underestimated, on-device on various edge devices.
training is a promising solution. It allows pre-trained LLM to
be personalized to the user’s local data without sending it to The scenarios depicted in Fig. 6 highlight various possibilities for
the cloud [183]. PockEngine, as presented in [176], serves as edge deployment. The evolution of LLMs towards EL presents substan-
an engine for edge training that can undergo fine-tuning on tial opportunities for enhancing domains such as smart cities [191],
various edge devices. It incorporates a sparse backpropagation smart transportation [192], and smart manufacturing [193]. For ex-
method for efficient low-latency edge training. PockEngine facil- ample, in scenario 1, the edge device handles the training of initial
itates deployment on edge devices and ensures the capability to layers, thereby reducing the need for local training [194]. In scenario 3,
fine-tune models on such devices with the support of consumer- a range of model compression techniques are utilized to optimize the
level GPUs. In the attempt to personalize LLMs for edge de- model, thus improving the efficiency of both edge training and edge
vices, [153] introduces a scheme that autonomously identifies inference [195].
and stores the most relevant data in a self-supervised manner,
with the selected data having a reduced memory footprint suit- 4.1.2. Model compression: enabling edge computing
able for user annotations during fine-tuning. Another approach, Presently, a prevalent strategy in research to tailor LLMs for edge
proposed by [141], establishes a collaborative fine-tuning frame- computing involves compressing the model [196]. This compression is
work where edge users leverage their local data to train the initial achieved through various methods, with model pruning, model quan-
layers of the adapter, while the remaining layers remain frozen. tization, and knowledge distillation (KD) being the most commonly
The server then receives these trained parameters and updates the utilized techniques.
subsequent layers. Similarly, NetGPT [184] enables cloud–edge
collaborative training by deploying smaller LLMs at the edges • Model quantization. This process aims to reduce the size of
and larger LLMs in the cloud. In the work of [178], cascading is Language Models (LLMs) by modifying how model weights are
8
F. Piccialli et al. Information Fusion 117 (2025) 102840
Fig. 6. Edge deployment and application scenes of LLMs in smart cities, smart transportation, and smart manufacturing. Scenario 1: Driving Analysis System (DAS) offers
personalized driving behavior analysis by utilizing real-time data, and delivering tailored insights and recommendations to Mr. Z. Scenario 2: Robert reports to the engineer about
the operation of the equipment by examining sensor data and maintenance records. Scenario 3: Alice responds to a user’s request for assistance with the work schedule, prompting
the user to provide additional details.
stored [197]. Typically, deep neural network weights are stored • Model pruning. The purpose of model pruning is achieved by
as 32-bit floating-point numbers. Quantization, as discussed in shaping the weights of LLMs. Two prominent types of model
literature [198], involves using tensors with fewer bits, commonly pruning, namely structured pruning and semi-structured pruning,
reducing them to 16 bits, 8 bits, and 4 bits. Model quantization are outlined in [142]. Structured model pruning works by reduc-
is further categorized into weight-only quantization and weight- ing the number of layers in the model or attention heads in a
activated quantization based on the quantization scope [159, Transformer, as illustrated by [207]. On the other hand, semi-
199]. This distinction is essential because the activation function structured model pruning removes specific weights by setting
is more sensitive to quantization. However, retaining softmax them to zero, as explained by [208]. Typically, weight saliency
without quantization or maintaining higher accuracy might intro- metrics are employed in related studies to quantify the accuracy
duce additional latency [200]. The quantization process is further loss of accuracy resulting from model pruning.
detailed based on execution steps, with two main approaches: In their work, [142] performed unstructured weight pruning on
post-training quantization (PTQ) and quantization-aware training the BERT model during both the pre-training and fine-tuning
(QAT) [201]. PTQ involves converting a pre-trained model into stages. They demonstrated a remarkable 10 times compression
a quantized version without additional training, making it faster in model size, leading to a tenfold acceleration in CPU inference
and more cost-effective. In contrast, QAT employs an extended with only a 1% sacrifice in accuracy. While weight pruning proves
training process that simulates quantization effects, ensuring the effective for sparse LLMs, it often necessitates multiple rounds of
model’s adaptability to reduced precision without compromising fine-tuning or retraining to ensure optimal performance [209].
performance. Given the substantial sizes of LLMs and the extensive datasets
The studies conducted by [105,202] demonstrate that Trans-
required for their training, the prospect of repeated retraining
former models like BERT and GPT-3 can significantly enhance
poses a challenge. In response, recent research efforts have shifted
memory efficiency and speed by reducing the precision of weights
towards one-shot unstructured pruning without the need for ad-
and activations to INT8. Another innovative approach, presented
ditional fine-tuning [210–213]. Additionally, [156] introduces
in [163], introduces Sparse Quantized Representation (SpQR),
an iterative weight pruning method for sparse LLMs, aiming
which encodes and decodes weights during quantization, result-
to minimize the reconstruction error between dense and sparse
ing in a 15% speedup for running LLMs. To further enhance
LLMs.
the flexibility of quantization, [150] introduces the Quantization-
Model pruning is a potentially effective technique, and utilizing
Aware Low-Rank Adaptive (QA-LoRA) technique. This method
model pruning requires a structural understanding of the model.
involves quantizing LLM weights to INT4 during fine-tuning, in-
However, some model architectures may not be suitable for model
corporating both LLM and auxiliary weights into the quantized
model without compromising accuracy. In the pursuit of optimiz- pruning because the complex structure and dependencies in the
ing weight quantization, [203] explores the use of low-precision original model may be destroyed during the pruning process,
floating-point (FP) numbers (FP8 and FP4) for LLMs. Interestingly, resulting in performance degradation. Furthermore, determining
the study reveals that FP4 achieves equivalent performance to the optimal pruning ratio is a challenge, as over- or under-pruning
INT4, and in models with over 1 billion parameters, the perfor- may affect model performance [214].
mance advantage of FP8 over INT8 becomes more pronounced. • Knowledge distillation. The concept of knowledge distillation
Recognizing the complexity and diversity of tensor distribution (KD), initially introduced by [215], involves using a larger teacher
in quantization, [204,205] highlights the importance of using model to guide the training of a smaller student model. In the
different quantization formats for different layers. The research context of challenges faced by LLMs in deployment on resource-
recommends adopting a mixed format approach for quantization constrained devices, KD has gained significant attention as an
to achieve optimal results. effective method for compressing models. Broadly, KD can take
Model quantization is presently one of the widely adopted com- place in two stages: pre-training and fine-tuning of LLMs. Task-
pression methods, effectively mitigating both model storage re- agnostic KD, as indicated by [216–218], refers to distillation
quirements and inference overhead [173]. Nevertheless, reducing performed in the pre-training stage, while task-specific KD, as
model parameters from floating-point representation to lower bit mentioned by [219–221], involves first fine-tuning the LLM for
widths, this process may lead to a sacrifice in accuracy [206]. downstream tasks and then distilling it. Both approaches typically
9
F. Piccialli et al. Information Fusion 117 (2025) 102840
entail comparing the output distributions of the student and of a variety of LLMs faced in real-world applications. Vertical FL is
teacher models. supposed to safeguard user input and model knowledge by partitioning
In a related study, [222] introduced a method named pQRNN, the model into bottom and top parts. Nevertheless, as highlighted
utilizing a pre-trained mBERT fine-tuned for semantic parsing in the article [145], there is a potential for privacy leaks through
tasks as a teacher model. The experimental results demonstrated the reconstruction of input from intermediate embeddings. The article
a student model performance of 95.9% compared to the teacher then presents a solution to address the input reconstruction attack
model, accompanied by a reduction in model size by a factor specifically targeting vertically joint LLMs.
of 350. Another approach proposed by [223] involves distilling
knowledge from a BERT model into a single-layer BiLSTM, reduc- 4.2.2. Parameter-efficient fine-tuning
ing model parameters by approximately 100 times. This resulted In the field of FL for LLMs, optimizing model updates while con-
in a reduction of inference times by a factor of 15 on tasks sidering limited communication and computing resources is crucial.
such as paraphrasing, natural language reasoning, and sentiment To alleviate the burden of frequent model transmission, a common
classification. approach involves the use of Parameter-Efficient Fine-Tuning (PEFT)
It is crucial to acknowledge that while KD proves effective for methods on client models [234]. These methods typically focus on min-
model compression, it has certain limitations when implemented imizing the number of parameter updates rather than fine-tuning the
with LLMs. The need to calculate the difference in output dis- entire model. This optimization reduces communication and computing
tributions between the teacher and student models can lead to costs while still preserving model quality [235]. Traditional fine-tuning
an increased computational burden during training, especially on involves extending the training of a pre-trained LLM to a specific task or
edge devices with constrained resources. Additionally, due to the dataset. However, complete parameter fine-tuning becomes impractical
reduced capacity of the student model, it may not comprehen- for LLMs due to the need to update all parameters and generate
sively inherit all the knowledge of LLMs, resulting in the loss of distinct instances for various tasks, leading to substantial memory
some information [224]. consumption [236]. In response to these challenges, parameter-efficient
fine-tuning selectively updates only a small subset of parameters while
4.2. LLMs & federated learning keeping the majority frozen [237]. This cost-effective strategy is fa-
vored by many researchers, especially in edge deployments where
Modern distributed computing techniques, such as FL, offer a means computational resources are constrained.
of model training without the need for centralized data, thereby pro- PEFT plays a significant role in fine-tuning parameters for LLMs
viding a level of privacy protection [225]. In the context of LM applica- and is widely applied across training contexts rooted in FL. Fig. 7
tions, FL is particularly attractive due to its effectiveness in addressing (a) illustrates FL employing PEFT where clients possess identical LLM
the challenges posed by distributed data [225,226]. Many real-world architectures [238–240]. Furthermore, when devices exhibit hetero-
scenarios necessitate the use of data on edge devices instead of central- geneity resulting in models of varying sizes, the KD technique outlined
ized servers, and this decentralized approach significantly enhances the in Section 4.1.2 can facilitate federated training of LLMs [166], as
efficiency of user data protection [227,228]. In Section 4.2.1, we delve depicted in Fig. 7(b).
into a detailed discussion of research efforts focused on preserving In [241], federated training was conducted on multiple clients that
privacy in LLMs. In addition, a typical FL deployment strategy involves only updated adapters and classification headers. The evaluation results
utilizing a large public dataset stored on a central server to initially demonstrated a reduction in training time by about 20-40%, along
fine-tune the pre-trained base LLM. Subsequently, pre-tuned models with a more than 98% increase in transmission speed. Another notable
serve as the initialization for client models, with clients utilizing their approach is presented in [154], where PrivateLoRA integrates three
private datasets for further fine-tuning [229]. In Section 4.2.2, we low-rank matrices for weight adaptation. The two non-trainable metrics
examine typical methods for parameter-efficient fine-tuning in LLM and are deployed in the cloud, while the trainable metric resides on the
explore how these methods are integrated with FL. edge device. This proposal guides the transformer toward personalized
output and achieves privacy preservation. Furthermore, [172] intro-
4.2.1. Privacy protection duces FedIT, a method that utilizes instructions stored on different local
The applications of LLM are interconnected with various aspects of devices and performs instruction tuning via FL. This approach ensures
our lives, necessitating a heightened focus on issues of data security privacy preservation and data security by leveraging instructions in the
and privacy protection [230,231]. For example, ChatDoctor [232] was fine-tuning process.
developed for online medical consultation. In this scenario, users are
required to transmit their medical information to a cloud system, 4.3. Distributed learning
inevitably risking the exposure of their data. Similarly, Chat-GPT is
widely used that aid users in answering questions, providing solutions, Distributed learning (DistL) involves distributing the computational
and even generating code. However, this model necessitates users to workload of a ML task across multiple computers or network nodes,
submit their queries to the server, raising concerns, especially involving particularly when handling resource-intensive computations [242].
personal information or confidential commercial data [145]. LLM based This approach encompasses two primary strategies: data parallelization
on FL presents new solutions to address these privacy challenges in and model parallelization. In data parallelization, the dataset is frag-
various sensitive scenarios. recent research indicates an emerging trend mented into smaller subsets, with each subset processed by a distinct
in integrating LLMs with the FL paradigm [227,233]. Horizontal FL is machine or node [160]. On the other hand, model parallelization
the most common form of FL, each client maintains the same model involves distributing various components of the model across multiple
and forms a new global model by aggregating these models. [233] pro- machines [243]. Each machine performs computations for a portion
poses the framework FewFedWeight, using BART-Base as a client and of the model’s operations, and their outputs are combined to produce
global model in FL, and experiments on 118 NLP tasks demonstrate its the final result. These methods enable the handling of larger models,
effectiveness in small-sample generalization and privacy preservation quicker training times, and improved fault tolerance of compute node
for multi-task learning. FedMLSecurity [168] is a benchmark dedicated failures [171]. Fig. 8 depicts schematic diagrams illustrating efficient
to attacks and defenses in federated LLM. It consists of two main LLM training via model parallelization. In (a), the standard split learn-
components, one simulates attacks injected during FL training, and ing approach in the distributed paradigm of LLM is shown, where
another one simulates defense mechanisms to mitigate the impact of the head sub-model is deployed on the edge device while the larger
attacks. This research work demonstrates different security problems tail sub-model is trained in the cloud [244,245]. In (b), complete
10
F. Piccialli et al. Information Fusion 117 (2025) 102840
Fig. 7. Illustration of FL for LLMs, showcasing (a) the application of PEFT with clients possessing identical LLM architectures, and (b) the extension to federated training
accommodating models of varying sizes through KD.
Fig. 8. Distributed learning for LLMs. (a) Standard split learning: head sub-model on edge for local processing, larger tail sub-model trained in cloud for computational resources.
(b) Complete model parallel distributed training: model segments processed in parallel across distributed resources, enhancing training efficiency and resource use.
model parallel distributed training is shown [179]. DistL plays a crucial While it complements distributed learning, DeepSpeed goes beyond by
role in the training of LLMs, addressing the substantial computational providing various optimization techniques such as model ZeRO, 3D-
challenges associated with processing extensive datasets and numerous Parallelism, DeepSpeed-MoE, and ZeRO-Infinity, among others. These
parameters [246]. LLMs, especially those utilized in NLP tasks, often optimizations aim to enhance the efficiency of training large models,
boast billions or even trillions of parameters, demanding sophisticated reduce memory requirements, and improve overall scalability. The
techniques to distribute the computational workload effectively. At- collaborative utilization of Megatron’s expertise in distributed training
tempting to train these massive models on a single machine becomes and model parallelism, combined with DeepSpeed’s optimization ca-
impractical due to the sheer computational requirements and memory pabilities, has enabled the training of the Megatron-Turing NLG 530B
constraints. Even the most powerful GPUs struggle to accommodate the model [76]—the most extensive and potent monolithic transformer
parameters of these models in their memory. Moreover, the multitude language model trained to date, boasting a staggering 530 billion pa-
of computing operations needed can lead to excessively long training rameters. Succeeding the Turing NLG 17B and Megatron-LM, MT-NLG
times unless careful attention is given to optimizing the algorithms, surpasses its predecessors with three times the number of parameters,
software, and hardware stack collectively [247]. showcasing unparalleled accuracy across a diverse range of natural
This is where frameworks like Megatron [248,249] and Deep- language tasks such as completion prediction, reading comprehension,
Speed [250] come into play, offering solutions to optimize the training commonsense reasoning, natural language inferences, and word sense
of LLMs in distributed environments. Megatron (1, 2, and 3), developed disambiguation.
by the Applied Deep Learning Research team at NVIDIA, is a large, Nonetheless, these widely-used training frameworks encounter dif-
powerful prominent framework specifically tailored for the distributed ficulties when training LLMs in a heterogeneous Network Interface
training of LLMs. It places a strong emphasis on model parallelism, Card (NIC) environment which involves communication between de-
enabling the distribution of model components across multiple GPUs. vices over a network, enabling data transfer and network connec-
Megatron excels in handling models with trillions of parameters and tivity. The challenge lies in optimizing GPU utilization within het-
is designed to make optimal use of GPU computing resources within erogeneous clusters, leading to suboptimal utilization of GPU com-
heterogeneous clusters. By leveraging parallel processing, Megatron puting resources [251]. This limitation hampers their efficiency in
significantly accelerates the training process, making it feasible to train harnessing the full potential of GPU resources in diverse comput-
massive language models efficiently. On the other hand, DeepSpeed, ing environments. In addressing this challenge, frameworks like Co-
developed by Microsoft, serves as a DistL optimization library with a CoNet [152] and Holmes [165] have been specifically crafted to stream-
focus on addressing challenges associated with training large models. line the optimization of data, model, and pipeline parallel workloads
11
F. Piccialli et al. Information Fusion 117 (2025) 102840
Table 2
Open datasets for text generation, text classification.
Type Task Dataset name Relevant studies
General Language Understanding Evaluation (GLUE) [144,175,176,253,253]
Recognizing Textual Entailment [254]
Multi-Genre Natural Language Inference (MultiNLI) [142,255]
Natural Instructions [256]
Natural language inference (Comprehensive) Alpaca [150,153]
WikiText [144,164,175,179,200,203,204,229]
Databricks Dolly 15k [153,155,172]
Colossal Clean Crawled Corpus (C4) [163,203,228]
RedPajama [163]
Stanford Question Answering Dataset (SQuAD) [142,144,145,169,253]
Text generation Question-answering NLI (QNLI) [254]
HellaSwag [146,154,204]
Physical Interaction: Question Answering (PIQA) [146,204]
Question answering Quora Question Pairs (QQP) [142,255]
Boolean Questions (BoolQ) [154,254]
GSM8K [154,155]
Stack Overflow Dataset [228,229,241]
Reddit Corpus [229]
Semantic textual similarity Microsoft Research Paraphrase Corpus (MRPC) [254]
Corpus of Linguistic Acceptability (CoLA) [254]
Text summarization SAMSum [162,175]
Cloze LAMBADA [204]
Stanford Sentiment Treebank (SST-2) [254,255,257]
MPQA Opinion Corpus [254]
Sentiment analysis Subjectivity dataset (SUBJ) [254]
Movie Reviews (MR) [254]
Text classification
Yelp Review Polarity [169,225,257,258]
AG News [169,225,257]
Topic classification YAHOO Dataset [169]
TREC-10 [254]
in LLMs. These frameworks aim to enhance the efficiency of han- predicting the next word in a sequence, providing diverse avenues
dling diverse parallel workloads within LLMs, providing solutions to for exploring the intricacies of natural language generation. Question-
the complexities posed by heterogeneous computing environments, answer (Q&A) sets consist of questions paired with corresponding
e.g. network interface cards. FlexModel [164] facilitates the processing responses. They are commonly employed in language modeling projects
of models distributed across multi-GPU and multi-node configurations, for training models to perform question-answering tasks or to under-
thereby enhancing the interpretability of distributed LLMs. On the stand queries and generate appropriate responses. Semantic Textual
other hand, LinguaLinked [179] is a system designed for decentralized, Similarity (STS) datasets contain pairs of sentences or text fragments
distributed LLM inference on mobile devices. Extensively tested across annotated with similarity scores, indicating their semantic similarity
a range of Android devices, it has demonstrated an overall inference or relatedness. STS datasets are widely used for training and evaluating
speedup ranging from 1.29× to 1.32×. An additional technique, intra- NLP models in tasks like paraphrase detection, duplicate detection, text
layer model parallelism, addresses memory limitations on devices when similarity assessment, and information retrieval. Text Summarization
handling LLMs. This method achieves this by partitioning individual datasets pair documents or articles with corresponding summaries,
layers or operators across multiple devices within a distributed cluster where the summaries provide concise representations of the main
of accelerators [252]. points or key information contained in the original text. These datasets
enable the development and evaluation of models capable of auto-
5. Dataset and open-source codes
matically condensing large amounts of information into informative
summaries. A cloze dataset typically consists of sentences or passages
In this section, we provide an overview of the datasets and open-
with one or more words removed, and the task is to predict the missing
source codes in the surveyed papers. These datasets encompass a
words based on the context provided. These datasets are commonly
diverse range of types and purposes, offering LLM customization oppor-
used for evaluating language understanding and completion tasks, as
tunities. Furthermore, the availability of open-source codes enhances
well as training language models.
the reproducibility of research endeavors and establishes benchmarks
for future studies. Text classification. Text classification group encompasses a diverse
array of datasets specifically curated to facilitate endeavors in text clas-
5.1. Dataset sification. These datasets serve as foundational resources extensively
employed to bolster the training of LLM geared towards adeptly classi-
The datasets used in the surveyed papers can be categorized into fying text, discerning, and assigning predetermined attributes or classes
two main groups: text generation, and text classification. Table 2 based on the textual content’s characteristics. Within this collection,
reports the dataset and relevant studies within each type. one finds datasets tailored for various applications, ranging from topic
Text generation. Text generation encompasses datasets specifically categorization and sentiment analysis to a myriad of other text clas-
tailored to facilitate text-generation tasks, serving as foundational re- sification tasks, thereby underpinning the advancement and efficacy
sources for training LLM focused on generating coherent and con- of NLP techniques and algorithms. Sentiment Analysis datasets con-
textually relevant language. These datasets span various applications, tain text samples annotated with sentiment labels indicating the emo-
including but not limited to multiple-choice tasks, dialogue genera- tional polarity of the text, while Topic Classification datasets involve
tion for conversational agents, sentence completion challenges, and assigning topics or categories to text documents based on their content.
12
F. Piccialli et al. Information Fusion 117 (2025) 102840
Table 3
Open source projects in surveyed papers.
Paper Year Framework Problem Link
[144] 2021 Edge Model compression https://github.com/MohammadrezaBanaei/orientation_based_embedding_compression
[142] 2022 Edge Model compression https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT
[253] 2023 Edge System design https://github.com/SamsungLabs/Sparse-Multi-DNN-Scheduling
[150] 2023 Edge Model compression https://github.com/yuhuixu1993/qa-lora
[259] 2023 Edge Model compression https://github.com/microsoft/DeepSpeed
[163] 2023 Edge Model compression https://github.com/Vahe1994/SpQR
[147] 2023 Edge Model compression https://github.com/mit-han-lab/llm-awq
[156] 2023 Edge Fine-tuning https://github.com/zyxxmu/DSnoT
[159] 2023 Edge Model compression https://github.com/OpenGVLab/OmniQuant
[205] 2023 Edge Model compression https://github.com/lightmatter-ai/INT-FP-QSim
[130] 2024 Edge Fine-tuning https://github.com/ZO-Bench/ZO-LLM
[225] 2021 Federated Fine-tuning https://github.com/statDataAnalyzer/scaling_fl
[172] 2023 Federated Fine-tuning https://github.com/JayZhang42/FederatedGPT-Shepherd
[173] 2023 Federated Model compression https://github.com/Azure/MS-AMP
[244] 2023 Federated Split computing https://github.com/nishio-laboratory/lambda_split
[168] 2023 Federated Attacks in FL https://github.com/FedML-AI/FedML/tree/master/python/fedml/core/security
[161] 2023 Federated Fine-tuning https://github.com/yuelinan/FedJudge
[155] 2023 Federated PEFT https://github.com/alibaba/FederatedScope/tree/llm
[166] 2023 Federated Fine-tuning https://github.com/FederatedAI/FATE-LLM
[146] 2023 Federated Fine-tuning https://github.com/alibaba/FederatedScope/tree/fedsp/federatedscope/nlp/fedsp
[254] 2023 Federated Fine-tuning https://github.com/llm-eff/FedPepTAO
[127] 2024 Federated Attacks in FL https://github.com/FedML-AI/FedML/tree/master/python/fedml/core/security
[128] 2024 Federated Instruction tuning https://github.com/rui-ye/OpenFedLLM
[240] 2024 Federated Fine-tuning https://github.com/UbiquitousLearning/FwdLLM
[152] 2022 Distributed Model parallel https://github.com/parasailteam/coconet
[164] 2023 Distributed Model parallel https://github.com/VectorInstitute/flex_model
[260] 2023 Distributed Model parallel https://github.com/xuqifan897/Optimus
[246] 2024 Distributed Model parallel https://github.com/zjc664656505/LinguaLinked-Inference
Table 4
Open source projects without paper.
Project name About Link
LeapfrogAI Production-ready Generative AI for local, cloud native, airgap, and edge. https://github.com/defenseunicorns/leapfrogai
llama-utils The easiest & fastest way to run customized and fine-tuned LLMs locally or on the https://github.com/second-state/llama-utils
edge.
LLM API Fully typed & consistent chat APIs for OpenAI, Anthropic, Azure’s chat models for https://github.com/dzhng/llm-api
browser, edge, and node environments.
balena-serge Run an LLM on your edge device with balena.io. https://github.com/klutchell/balena-serge
Edge Infer EdgeInfer enables efficient edge intelligence by running small AI models, including https://github.com/unit-mesh/edge-infer
embeddings and OnnxModels, on resource-constrained devices like Android, iOS, or
MCUs for real-time decision-making.
llama4j An easy-to-use Java SDK for running LLaMA models on edge devices, powered by https://github.com/JavaLLM/llama4j
LLaMA.cpp.
llm-edge-web Web app for LLMs on EDGE devices. https://github.com/timothyoei/llm-edge-web
LLM InferenceNet LLM InferenceNet is a C++ based project designed to achieve fast inference from https://github.com/adithya-s-k/LLM-InferenceNet
LLMs by leveraging a client–server architecture.
llama.cpp Inference of LLMs in pure C/C++ https://github.com/ggerganov/llama.cpp
SplitLLM LLM, Vertical Federated Learning, and some funny experiments. https://github.com/zfscgy/SplitLLM
fedGPT An implementation of training nanoGPT through Federated Learning and https://github.com/aneesh-aparajit/fedGPT/tree/main
implementing Differential Privacy
5.2. Open-source codes for intriguing future directions. These aspects are crucial to consider
as they shape the trajectory of advancements in this intersection of
Numerous authors have contributed to the proliferation of open- technologies. Deploying LLMs on edge devices introduces resource
source implementations of their proposed models, thereby fostering ac- constraints, such as limited computational power and memory. One
cessibility and collaboration within the research community. In of the primary challenges lies in adapting LLMs to the inherent re-
Tables 3 and 4, we provide the overview of the open-source projects source constraints of edge devices. These devices, characterized by
dedicated to addressing relevant challenges with LLMs. Notably, these limited computational power and memory, demand specialized opti-
research works widely leverage frameworks such as TensorFlow, Py- mization techniques to ensure that LLMs operate efficiently without
Torch, and FATE, and Python and C++ as the language of choice compromising performance.
for building LLMs. This concerted effort toward openness and stan- In the realm of FL, communication overhead emerges as a crit-
dardization promotes reproducibility and facilitates innovation and ical bottleneck. The process of transmitting model updates between
advancement in NLP research. edge devices and a central server can be hampered by unreliable
or constrained networks, necessitating the exploration of more ef-
6. Open issues and conclusion ficient communication protocols. Privacy considerations loom large
in FL scenarios, where models are trained locally on edge devices.
Navigating the landscape of LLMs in the context of edge and fed- Balancing the need for model updates with user privacy becomes a deli-
erated learning brings forth a set of challenges and opens avenues cate challenge, urging researchers to develop robust privacy-preserving
13
F. Piccialli et al. Information Fusion 117 (2025) 102840
mechanisms. Interpreting the decisions made by LLMs at the edge [4] B. Min, H. Ross, E. Sulem, A.P.B. Veyseh, T.H. Nguyen, O. Sainz, E. Agirre,
presents a non-trivial task. Ensuring transparency and interpretability I. Heintz, D. Roth, Recent advances in natural language processing via large
pre-trained language models: A survey, ACM Comput. Surv. 56 (2) (2023) 1–40.
of these models is essential, especially when deployed in applications
[5] N. Omar, Q. Al-Tashi, Arabic nested noun compound extraction based on
where the rationale behind decisions is crucial, such as in healthcare linguistic features and statistical measures, GEMA Online® J. Lang. Stud. 18
or finance. (2) (2018).
The trajectory of research in this domain points towards the explo- [6] S. Diao, R. Xu, H. Su, Y. Jiang, Y. Song, T. Zhang, Taming pre-trained language
ration of specialized optimization techniques for edge-deployed LLMs. models with n-gram representations for low-resource domain adaptation, in:
Proceedings of the 59th Annual Meeting of the Association for Computational
Techniques encompassing model compression, quantization, and ar- Linguistics and the 11th International Joint Conference on Natural Language
chitectural modifications are envisioned to be at the forefront, cater- Processing (Volume 1: Long Papers), 2021, pp. 3336–3349.
ing to the unique resource constraints of edge devices. Efforts in FL [7] V.A. Petrushin, Hidden markov models: Fundamentals and applications, in:
should focus on refining communication protocols. Techniques such Online Symposium for Electronics Engineer, 2000.
[8] S. Khudanpur, J. Wu, Maximum entropy techniques for exploiting syntactic,
as model sparsity and differential privacy may prove instrumental in
semantic and collocational dependencies in language modeling, Comput. Speech
reducing the volume of information transmitted between edge devices Lang. 14 (4) (2000) 355–372.
and central servers, mitigating communication overhead. Enhancing [9] H. Wang, J. He, X. Zhang, S. Liu, A short text classification method based on
privacy-preserving mechanisms in FL is crucial for the widespread N-gram and CNN, Chin. J. Electron. 29 (2) (2020) 248–254.
[10] R. Rosenfeld, Two decades of statistical language modeling: Where do we go
adoption of this paradigm. Future research may delve into advanced
from here? Proc. IEEE 88 (8) (2000) 1270–1278.
cryptographic techniques or novel FL frameworks that prioritize user [11] E. Arisoy, T.N. Sainath, B. Kingsbury, B. Ramabhadran, Deep neural network
privacy without compromising the performance of the trained models. language models, in: Proceedings of the NAACL-HLT 2012 Workshop: Will We
In the quest for interpretable models at the edge, research endeavors Ever Really Replace the N-Gram Model? on the Future of Language Modeling
are anticipated to focus on providing insights into the decision-making for HLT, 2012, pp. 20–28.
[12] J.R. Bellegarda, Exploiting latent semantic information in statistical language
process of LLMs. Real-time interpretability is paramount, especially in modeling, Proc. IEEE 88 (8) (2000) 1279–1296.
applications where understanding model decisions is critical for user [13] F. Alva-Manchego, C. Scarton, L. Specia, Data-driven sentence simplification:
trust and compliance with regulations. Survey and benchmark, Comput. Linguist. 46 (1) (2020) 135–187.
In conclusion, the fusion of LLMs with edge and FL pushes re- [14] M. Malik, M.K. Malik, K. Mehmood, I. Makhdoom, Automatic speech
recognition: a survey, Multimedia Tools Appl. 80 (2021) 9411–9457.
searchers to address challenges collaboratively. The journey ahead [15] J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, A. Lopez, A comprehen-
involves not only overcoming hurdles but also shaping the landscape sive survey on support vector machine classification: Applications, challenges
of distributed, privacy-aware, and interpretable LMs that cater to the and trends, Neurocomputing 408 (2020) 189–215.
evolving needs of diverse applications. [16] M. Crawford, T.M. Khoshgoftaar, J.D. Prusa, A.N. Richter, H. Al Najada, Survey
of review spam detection using machine learning techniques, J. Big Data 2 (1)
(2015) 1–24.
CRediT authorship contribution statement [17] M. Neethu, R. Rajasree, Sentiment analysis in twitter using machine learn-
ing techniques, in: 2013 Fourth International Conference on Computing,
Francesco Piccialli: Conceptualization, Formal analysis, Method- Communications and Networking Technologies, ICCCNT, IEEE, 2013, pp. 1–5.
ology, Supervision, Writing – original draft. Diletta Chiaro: Investi- [18] A. Go, L. Huang, R. Bhayani, Twitter sentiment analysis, Entropy 17 (2009)
252.
gation, Methodology, Supervision. Pian Qi: Investigation, Resources, [19] Q. Lhoest, A.V. del Moral, Y. Jernite, A. Thakur, P. von Platen, S. Patil, J.
Visualization, Writing – original draft. Valerio Bellandi: Conceptual- Chaumond, M. Drame, J. Plu, L. Tunstall, et al., Datasets: A community library
ization, Writing – original draft, Writing – review & editing. Ernesto for natural language processing, 2021, arXiv preprint arXiv:2109.02846.
Damiani: Supervision, Validation, Writing – original draft. [20] O. Sharir, B. Peleg, Y. Shoham, The cost of training nlp models: A concise
overview, 2020, arXiv preprint arXiv:2004.08900.
[21] L. Deng, Y. Liu, A joint introduction to natural language processing and to deep
Declaration of competing interest learning, Deep Learn. Natl. Lang. Process. (2018) 1–22.
[22] W. Yin, K. Kann, M. Yu, H. Schütze, Comparative study of CNN and RNN for
The authors declare that they have no known competing finan- natural language processing, 2017, arXiv preprint arXiv:1702.01923.
[23] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, S. Khudanpur, Recurrent neural
cial interests or personal relationships that could have appeared to
network based language model., in: Interspeech, Vol. 2, Makuhari, 2010, pp.
influence the work reported in this paper. 1045–1048.
[24] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8)
Acknowledgments (1997) 1735–1780.
[25] S. Hochreiter, Recurrent neural net learning and vanishing gradient, Int. J.
Uncertain. Fuzziness Knowl.-Based Syst. 6 (2) (1998) 107–116.
• PNRR project FAIR - Future AI Research (PE00000013), Spoke 3, [26] S. Hihi, Y. Bengio, Hierarchical recurrent neural networks for long-term
under the NRRP MUR program funded by the NextGenerationEU. dependencies, Adv. Neural Inf. Process. Syst. 8 (1995).
• G.A.N.D.A.L.F. - Gan Approaches for Non-iiD Aiding Learning [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł.
in Federations, CUP: E53D23008290006, PNRR - Missione 4 Kaiser, I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. Syst.
30 (2017).
‘‘Istruzione e Ricerca’’ - Componente C2 Investimento 1.1 ‘‘Fondo [28] P. Shaw, J. Uszkoreit, A. Vaswani, Self-attention with relative position
per il Programma Nazionale di Ricerca e Progetti di Rilevante representations, 2018, arXiv preprint arXiv:1803.02155.
Interesse Nazionale (PRIN)’’. [29] Q. Liu, M.J. Kusner, P. Blunsom, A survey on contextual embeddings, 2020,
arXiv preprint arXiv:2003.07278.
[30] E. Adamopoulou, L. Moussiades, Chatbots: History, technology, and applica-
Data availability tions, Mach. Learn. Appl. 2 (2020) 100006.
[31] M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E.D. Trippe, J.B. Gutierrez, K.
Kochut, Text summarization techniques: a brief survey, 2017, arXiv preprint
No data was used for the research described in the article.
arXiv:1707.02268.
[32] Y. Ge, W. Hua, J. Ji, J. Tan, S. Xu, Y. Zhang, Openagi: When llm meets domain
experts, 2023, arXiv preprint arXiv:2304.04370.
References [33] K. Adnan, R. Akbar, An analytical study of information extraction from
unstructured and multidimensional big data, J. Big Data 6 (1) (2019) 1–38.
[1] S. Pinker, The Language Instinct: how the Mind Creates Language, Penguin uK, [34] M. Awais, M. Naseer, S. Khan, R.M. Anwer, H. Cholakkal, M. Shah, M.-H.
2003. Yang, F.S. Khan, Foundational models defining a new era in vision: A survey
[2] A.M. Turing, Computing machinery and intelligence., Creat. Comput. 6 (1) and outlook, 2023, arXiv preprint arXiv:2307.13721.
(1980) 44–53. [35] H. Zhang, X. Li, L. Bing, Video-llama: An instruction-tuned audio-visual
[3] K. Chowdhary, K. Chowdhary, Natural language processing, Fundam. Artif. language model for video understanding, 2023, arXiv preprint arXiv:2306.
Intell. (2020) 603–649. 02858.
14
F. Piccialli et al. Information Fusion 117 (2025) 102840
[36] A. Rouditchenko, A. Boggust, D. Harwath, B. Chen, D. Joshi, S. Thomas, K. [63] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V.
Audhkhasi, H. Kuehne, R. Panda, R. Feris, et al., Avlnet: Learning audio- Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-training
visual language representations from instructional videos, 2020, arXiv preprint for natural language generation, translation, and comprehension, 2019, arXiv
arXiv:2006.09199. preprint arXiv:1910.13461.
[37] Y. Zhao, Z. Lin, D. Zhou, Z. Huang, J. Feng, B. Kang, Bubogpt: Enabling visual [64] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
grounding in multi-modal llms, 2023, arXiv preprint arXiv:2307.08581. L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining
[38] B. Ghojogh, A. Ghodsi, Attention mechanism, transformers, BERT, and GPT: approach, 2019, arXiv preprint arXiv:1907.11692.
tutorial and survey, 2020. [65] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R.R. Salakhutdinov, Q.V. Le, Xlnet:
[39] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Generalized autoregressive pretraining for language understanding, Adv. Neural
Barham, H.W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scaling language Inf. Process. Syst. 32 (2019).
modeling with pathways, J. Mach. Learn. Res. 24 (240) (2023) 1–113. [66] A. Roberts, C. Raffel, K. Lee, M. Matena, N. Shazeer, P.J. Liu, S. Narang, W.
[40] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Li, Y. Zhou, Exploring the limits of transfer learning with a unified text-to-text
Schalkwyk, A.M. Dai, A. Hauth, et al., Gemini: a family of highly capable transformer, Tech. Rep., 2019, Google.
multimodal models, 2023, arXiv preprint arXiv:2312.11805. [67] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N.
Shazeer, Z. Chen, Gshard: Scaling giant models with conditional computation
[41] T.M. Gemma Team, C. Hardin, R. Dadashi, S. Bhupatiraju, L. Sifre, M. Rivière,
and automatic sharding, 2020, arXiv preprint arXiv:2006.16668.
M.S. Kale, J. Love, P. Tafti, L. Hussenot, et al., Gemma, 2024, http://dx.doi.
[68] S. Black, L. Gao, P. Wang, C. Leahy, S. Biderman, GPT-Neo: Large Scale
org/10.34740/KAGGLE/M/3301, URL https://www.kaggle.com/m/3301.
Autoregressive Language Modeling with Mesh-Tensorflow, 2021.
[42] K. Tirumala, D. Simig, A. Aghajanyan, A.S. Morcos, D4: Improving llm pre-
[69] B. Wang, A. Komatsuzaki, GPT-j-6B: A 6 billion parameter autoregressive
training via document de-duplication and diversification, 2023, arXiv preprint
language model, 2021.
arXiv:2308.12284.
[70] A. Andonian, Q. Anthony, S. Biderman, S. Black, P. Gali, L. Gao, E. Hallahan, J.
[43] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J.
Levy-Kramer, C. Leahy, L. Nestler, et al., GPT-NeoX: Large scale autoregressive
Spencer-Smith, D.C. Schmidt, A prompt pattern catalog to enhance prompt
language modeling in pytorch, 2021.
engineering with chatgpt, 2023, arXiv preprint arXiv:2302.11382.
[71] M. Chen, J. Tworek, H. Jun, Q. Yuan, H.P.d. Pinto, J. Kaplan, H. Edwards,
[44] F.F. Xu, U. Alon, G. Neubig, V.J. Hellendoorn, A systematic evaluation of large Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models
language models of code, in: Proceedings of the 6th ACM SIGPLAN International trained on code, 2021, arXiv preprint arXiv:2107.03374.
Symposium on Machine Programming, 2022, pp. 1–10. [72] O. Lieber, O. Sharir, B. Lenz, Y. Shoham, Jurassic-1: Technical details and
[45] H. Wang, T. Fu, Y. Du, W. Gao, K. Huang, Z. Liu, P. Chandak, S. Liu, evaluation, White Paper, Vol. 1, AI21 Labs, 2021, p. 9.
P. Van Katwyk, A. Deac, et al., Scientific discovery in the age of artificial [73] Anthropic, Anthropiclm, anthropiclm v4-s3, and claude., 2021, 2022, and 2023,
intelligence, Nature 620 (7972) (2023) 47–60. URL https://www.anthropic.com/.
[46] J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, Q. Wang, Software testing [74] N. Du, Y. Huang, A.M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou,
with large language model: Survey, landscape, and vision, 2023, arXiv preprint A.W. Yu, O. Firat, et al., Glam: Efficient scaling of language models with
arXiv:2307.07221. mixture-of-experts, in: International Conference on Machine Learning, PMLR,
[47] A.J. Thirunavukarasu, D.S.J. Ting, K. Elangovan, L. Gutierrez, T.F. Tan, D.S.W. 2022, pp. 5547–5569.
Ting, Large language models in medicine, Nat. Med. 29 (8) (2023) 1930–1940. [75] J.W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S.
[48] J. Cabrera, M.S. Loyola, I. Magaña, R. Rojas, Ethical dilemmas, mental health, Henderson, R. Ring, S. Young, et al., Scaling language models: Methods, analysis
artificial intelligence, and llm-based chatbots, in: International Work-Conference & insights from training gopher, 2021, arXiv preprint arXiv:2112.11446.
on Bioinformatics and Biomedical Engineering, Springer, 2023, pp. 313–326. [76] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z.
[49] A. Creswell, M. Shanahan, I. Higgins, Selection-inference: Exploiting large Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, et al., Using deepspeed and
language models for interpretable logical reasoning, 2022, arXiv preprint arXiv: megatron to train megatron-turing nlg 530b, a large-scale generative language
2205.09712. model, 2022, arXiv preprint arXiv:2201.11990.
[50] E. Ferrara, Should chatgpt be biased? challenges and risks of bias in large [77] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang,
language models, 2023, arXiv preprint arXiv:2304.03738. S. Agarwal, K. Slama, A. Ray, et al., Training language models to follow
[51] Z. Lin, G. Qu, Q. Chen, X. Chen, Z. Chen, K. Huang, Pushing large language instructions with human feedback, Adv. Neural Inf. Process. Syst. 35 (2022)
models to the 6G edge: Vision, challenges, and opportunities, 2023, arXiv 27730–27744.
preprint arXiv:2309.16739. [78] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng,
[52] X.L. Dong, S. Moon, Y.E. Xu, K. Malik, Z. Yu, Towards next-generation A. Jin, T. Bos, L. Baker, Y. Du, et al., Lamda: Language models for dialog
intelligent assistants leveraging llm techniques, in: Proceedings of the 29th applications, 2022, arXiv preprint arXiv:2201.08239.
ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. [79] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford,
5792–5793. D.d.L. Casas, L.A. Hendricks, J. Welbl, A. Clark, et al., Training compute-optimal
large language models, 2022, arXiv preprint arXiv:2203.15556.
[53] Z. Cai, J. Chen, W. Chen, W. Wang, X. Zhu, A. Ouyang, F-codellm: A
[80] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab,
federated learning framework for adapting large language models to practical
X. Li, X.V. Lin, et al., Opt: Open pre-trained transformer language models, 2022,
software development, in: Proceedings of the 2024 IEEE/ACM 46th Interna-
arXiv preprint arXiv:2205.01068.
tional Conference on Software Engineering: Companion Proceedings, 2024, pp.
[81] B. Workshop, T.L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R.
416–417.
Castagné, A.S. Luccioni, F. Yvon, et al., Bloom: A 176b-parameter open-access
[54] H. Woisetschläger, A. Erben, S. Wang, R. Mayer, H.-A. Jacobsen, Federated fine-
multilingual language model, 2022, arXiv preprint arXiv:2211.05100.
tuning of llms on the very edge: The good, the bad, the ugly, in: Proceedings of
[82] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T.L. Scao,
the Eighth Workshop on Data Management for End-To-End Machine Learning,
M.S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf, et al., Crosslingual generalization
2024, pp. 39–50.
through multitask finetuning, 2022, arXiv preprint arXiv:2211.01786.
[55] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-
[83] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh,
end object detection with transformers, in: European Conference on Computer
A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al., Solving quantitative
Vision, Springer, 2020, pp. 213–229.
reasoning problems with language models, Adv. Neural Inf. Process. Syst. 35
[56] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, (2022) 3843–3857.
T. Rault, R. Louf, M. Funtowicz, et al., Transformers: State-of-the-art natural [84] A. Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh,
language processing, in: Proceedings of the 2020 Conference on Empirical L. Weidinger, M. Chadwick, P. Thacker, et al., Improving alignment of dialogue
Methods in Natural Language Processing: System Demonstrations, 2020, pp. agents via targeted human judgements, 2022, arXiv preprint arXiv:2209.14375.
38–45. [85] H.W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang,
[57] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned language models,
understanding by generative pre-training, 2018. 2022, arXiv preprint arXiv:2210.11416.
[58] A. Radford, J. Wu, R. Child, et al., Language models are unsupervised multitask [86] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A.
learners, OpenAI Blog 1 (8) (2019) 9. Poulton, V. Kerkez, R. Stojnic, Galactica: A large language model for science,
[59] T. Brown, B. Mann, N. Ryder, et al., Language models are few-shot learners, 2022, arXiv preprint arXiv:2211.09085.
Adv. Neural Inf. Process. Syst. 33 (2020) 1877–1901. [87] S. Iyer, X.V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, T.
[60] OpenAI, ChatGPT, 2022, URL https://chat.openai.com/. Wang, Q. Liu, P.S. Koura, et al., Opt-iml: Scaling language model instruction
[61] A. Josh, A. Steven, A. Sandhini, et al., GPT-4 technical report, 2023, arXiv: meta learning through the lens of generalization, 2022, arXiv preprint arXiv:
2303.08774. 2212.12017.
[62] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep [88] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B.
bidirectional transformers for language understanding, 2018, arXiv preprint Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient
arXiv:1810.04805. foundation language models, 2023, arXiv preprint arXiv:2302.13971.
15
F. Piccialli et al. Information Fusion 117 (2025) 102840
[89] Google, Bard, 2023, URL https://bard.google.com/?hl=en-GB. [122] Y. He, J. Fang, F.R. Yu, V.C. Leung, Large language models (LLMs) inference
[90] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. offloading and resource allocation in cloud-edge computing: An active inference
Zhuang, J.E. Gonzalez, et al., Vicuna: An open-source chatbot impressing gpt-4 approach, IEEE Trans. Mob. Comput. (2024).
with 90%* chatgpt quality, 2023, See https://vicuna.lmsys.org. (Accessed 14 [123] P. Sundaravadivel, P.J. Roselyn, V. Narayanaswamy, V.I. Jeyaraj, A. Ramesh, A.
April 2023). Khanal, Integrating image-based LLMs on edge-devices for underwater robotics,
[91] AI21 Labs, Jurassic-2, 2023, URL https://www.ai21.com/blog/introducing-j2. in: Real-Time Image Processing and Deep Learning 2024, Vol. 13034, SPIE,
[92] A.D.T.I. Institute, Falcon 40b/180b, 2023, URL https://falconllm.tii.ae/. 2024, pp. 119–125.
[93] S.G. Patil, T. Zhang, X. Wang, J.E. Gonzalez, Gorilla: Large language model [124] K. Alizadeh, I. Mirzadeh, D. Belenko, K. Khatamifard, M. Cho, C.C. Del Mundo,
connected with massive apis, 2023, arXiv preprint arXiv:2305.15334. M. Rastegari, M. Farajtabar, Llm in a flash: Efficient large language model
[94] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, A. Awadallah, inference with limited memory, 2023, arXiv preprint arXiv:2312.11514.
Orca: Progressive learning from complex explanation traces of gpt-4, 2023, [125] X. Yuan, H. Li, K. Ota, M. Dong, Generative inference of large language
arXiv preprint arXiv:2306.02707. models in edge computing: An energy efficient approach, in: 2024 International
[95] A. Mitra, L. Del Corro, S. Mahajan, A. Codas, C. Simoes, S. Agarwal, X. Chen, A. Wireless Communications and Mobile Computing, IWCMC, IEEE, 2024, pp.
Razdaibiedina, E. Jones, K. Aggarwal, et al., Orca 2: Teaching small language 244–249.
models how to reason, 2023, arXiv preprint arXiv:2311.11045. [126] C. Du, S.-B. Ko, H. Zhang, Energy efficient FPGA-based binary transformer
[96] S. A.I., Stable LM 2, 2024, URL https://stability.ai/. accelerator for edge devices, in: 2024 IEEE International Symposium on Circuits
[97] I. A.I., Inflection-2.5, 2024, URL https://inflection.ai/. and Systems, ISCAS, IEEE, 2024, pp. 1–5.
[98] S. Shahriar, B.D. Lund, N.R. Mannuru, M.A. Arshad, K. Hayawi, R.V.K. Bevara, [127] S. Han, B. Buyukates, Z. Hu, H. Jin, W. Jin, L. Sun, X. Wang, W. Wu, C.
A. Mannuru, L. Batool, Putting gpt-4o to the sword: A comprehensive evaluation Xie, Y. Yao, et al., Fedsecurity: A benchmark for attacks and defenses in
of language, vision, speech, and multimodal proficiency, Appl. Sci. 14 (17) federated learning and federated llms, in: Proceedings of the 30th ACM SIGKDD
(2024) 7782. Conference on Knowledge Discovery and Data Mining, 2024, pp. 5070–5081.
[99] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. [128] R. Ye, W. Wang, J. Chai, D. Li, Z. Li, Y. Xu, Y. Du, Y. Wang, S. Chen,
Huang, et al., Qwen2 technical report, 2024, arXiv preprint arXiv:2407.10671. Openfedllm: Training large language models on decentralized private data via
[100] xAI, Grok-2, 2024, URL https://x.ai/. federated learning, in: Proceedings of the 30th ACM SIGKDD Conference on
[101] Google DeepMind, Gemini 1.5 flash-8B, 2024, URL https://deepmind.google/ Knowledge Discovery and Data Mining, 2024, pp. 6137–6147.
technologies/gemini/flash/. [129] J. Du, T. Lin, C. Jiang, Q. Yang, C.F. Bader, Z. Han, Distributed foundation
[102] A. Baladón, I. Sastre, L. Chiruzzo, A. Rosá, RETUYT-inco at BEA 2023 shared
models for multi-modal learning in 6G wireless networks, IEEE Wirel. Commun.
task: Tuning open-source LLMs for generating teacher responses, in: Proceedings
31 (3) (2024) 20–30.
of the 18th Workshop on Innovative Use of NLP for Building Educational
[130] Y. Zhang, P. Li, J. Hong, J. Li, Y. Zhang, W. Zheng, P.-Y. Chen, J.D. Lee, W.
Applications (BEA 2023), 2023, pp. 756–765.
Yin, M. Hong, et al., Revisiting zeroth-order optimization for memory-efficient
[103] J.J. Nay, Large language models as fiduciaries: A case study toward robustly
llm fine-tuning: A benchmark, 2024, arXiv preprint arXiv:2402.11592.
communicating with artificial intelligence through legal standards, 2023, arXiv
[131] F. Urbina, C.T. Lowden, J.C. Culberson, S. Ekins, MegaSyn: integrating gen-
preprint arXiv:2301.10095.
erative molecular design, automated analog designer, and synthetic viability
[104] T.Y. Zhuo, Z. Li, Y. Huang, Y.-F. Li, W. Wang, G. Haffari, F. Shiri, On robustness
prediction, ACS Omega 7 (22) (2022) 18699–18713.
of prompt-based semantic parsing with large pre-trained language model: An
[132] F. Wu, Z. Li, Y. Li, B. Ding, J. Gao, Fedbiot: Llm local fine-tuning in
empirical study on codex, 2023, arXiv preprint arXiv:2301.12868.
federated learning without full model, in: Proceedings of the 30th ACM SIGKDD
[105] Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, Y. He, Zeroquant:
Conference on Knowledge Discovery and Data Mining, 2024, pp. 3345–3355.
Efficient and affordable post-training quantization for large-scale transformers,
[133] M.J. Page, J.E. McKenzie, P.M. Bossuyt, I. Boutron, T.C. Hoffmann, C.D.
Adv. Neural Inf. Process. Syst. 35 (2022) 27168–27183.
Mulrow, L. Shamseer, J.M. Tetzlaff, E.A. Akl, S.E. Brennan, et al., The PRISMA
[106] A. Zou, Z. Wang, J.Z. Kolter, M. Fredrikson, Universal and transferable
2020 statement: an updated guideline for reporting systematic reviews, Int. J.
adversarial attacks on aligned language models, 2023, arXiv preprint arXiv:
Surg. 88 (2021) 105906.
2307.15043.
[134] P. Andrews, O.E. Nordberg, S. Zubicueta Portales, N. Borch, F. Guribye, K.
[107] D.M. Katz, M.J. Bommarito, S. Gao, P. Arredondo, Gpt-4 passes the bar exam,
Fujita, M. Fjeld, Aicommentator: A multimodal conversational agent for embed-
2023, Available at SSRN 4389233.
ded visualization in football viewing, in: Proceedings of the 29th International
[108] K.I. Roumeliotis, N.D. Tselikas, D.K. Nasiopoulos, Llama 2: Early adopters’
Conference on Intelligent User Interfaces, 2024, pp. 14–34.
utilization of meta’s new open-source pretrained model, 2023.
[135] H. Cui, Y. Du, Q. Yang, Y. Shao, S.C. Liew, LLMind: Orchestrating AI and IoT
[109] A. Byrd, Truth-telling: Critical inquiries on LLMs and the corpus texts that train
with LLM for complex task execution, IEEE Commun. Mag. (2024).
them., Compos. Stud. 51 (1) (2023) 135–142.
[110] X. Zhang, B. Yu, H. Yu, Y. Lv, T. Liu, F. Huang, H. Xu, Y. Li, Wider and deeper [136] N. Zhong, Y. Wang, R. Xiong, Y. Zheng, Y. Li, M. Ouyang, D. Shen, X. Zhu,
llm networks are fairer llm evaluators, 2023, arXiv preprint arXiv:2308.01862. CASIT: Collective intelligent agent system for internet of things, IEEE Internet
[111] I. Yildirim, L. Paul, From task structures to world models: What do LLMs Things J. (2024).
know? 2023, arXiv preprint arXiv:2310.04276. [137] X. Li, Z. Lu, D. Cai, X. Ma, M. Xu, Large language models on mobile devices:
[112] H. Jin, X. Han, J. Yang, Z. Jiang, C.-Y. Chang, X. Hu, GrowLength: Accelerating Measurements, analysis, and insights, in: Proceedings of the Workshop on Edge
LLMs pretraining by progressively growing training length, 2023, arXiv preprint and Mobile Foundation Models, 2024, pp. 1–6.
arXiv:2310.00576. [138] M.A. Ferrag, M. Ndhlovu, N. Tihanyi, L.C. Cordeiro, M. Debbah, T. Lestable,
[113] R.V.P. Marcel, B.E.M. Fernando, Y.V.J. Roberto, A brief history of the artificial N.S. Thandi, Revolutionizing cyber threat detection with large language models:
intelligence: chatgpt: The evolution of GPT, in: 2023 18th Iberian Conference A privacy-preserving bert-based lightweight model for iot/iiot devices, IEEE
on Information Systems and Technologies, CISTI, IEEE, 2023, pp. 1–5. Access (2024).
[114] E.Y. Chang, Examining GPT-4: Capabilities, implications and future direc- [139] J. Ren, D. Zhang, S. He, Y. Zhang, T. Li, A survey on end-edge-cloud
tions, in: The 10th International Conference on Computational Science and orchestrated network computing paradigms: Transparent computing, mobile
Computational Intelligence, 2023. edge computing, fog computing, and cloudlet, ACM Comput. Surv. 52 (6)
[115] M. Zhang, J. Li, A commentary of GPT-3 in MIT technology review 2021, (2019) 1–36.
Fundam. Res. 1 (6) (2021) 831–833. [140] Q. Dong, X. Chen, M. Satyanarayanan, Creating edge ai from cloud-based
[116] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A llms, in: Proceedings of the 25th International Workshop on Mobile Computing
lite bert for self-supervised learning of language representations, 2019, arXiv Systems and Applications, 2024, pp. 8–13.
preprint arXiv:1909.11942. [141] L. Qian, J. Zhao, User association and resource allocation in large language
[117] D. Demszky, D. Yang, D.S. Yeager, C.J. Bryan, M. Clapper, S. Chandhok, J.C. model based mobile edge computing system over wireless communications,
Eichstaedt, C. Hecht, J. Jamieson, M. Johnson, et al., Using large language 2023, arXiv preprint arXiv:2310.17872.
models in psychology, Nat. Rev. Psychol. 2 (11) (2023) 688–701. [142] E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Kurtz, B. Fineran, M. Goin, D.
[118] M.U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M.B. Shaikh, N. Akhtar, Alistarh, The optimal bert surgeon: Scalable and accurate second-order pruning
J. Wu, S. Mirjalili, et al., A survey on large language models: Applications, for large language models, 2022, arXiv preprint arXiv:2203.07259.
challenges, limitations, and practical usage, Authorea Prepr. (2023). [143] Z. Tang, Y. Wang, X. He, L. Zhang, X. Pan, Q. Wang, R. Zeng, K. Zhao, S. Shi,
[119] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, Z. Liu, B. He, et al., FusionAI: Decentralized training and deploying LLMs with massive
et al., Summary of chatgpt-related research and perspective towards the future consumer-level GPUs, 2023, arXiv preprint arXiv:2309.01172.
of large language models, Meta-Radiol. (2023) 100017. [144] K. Bałazy, M. Banaei, R. Lebret, J. Tabor, K. Aberer, Direction is what you need:
[120] Y. Yuan, R. Kong, Y. Li, Y. Liu, Wip: An on-device LLM-based approach to Improving word embedding compression in large language models, 2021, arXiv
query privacy protection, in: Proceedings of the Workshop on Edge and Mobile preprint arXiv:2106.08181.
Foundation Models, 2024, pp. 7–9. [145] F. Zheng, Input reconstruction attack against vertical federated large language
[121] S.M. Hasan, A.M. Alotaibi, S. Talukder, A.R. Shahid, Distributed threat intel- models, 2023, arXiv preprint arXiv:2311.07585.
ligence at the edge devices: A large language model-driven approach, 2024, [146] C. Dong, Y. Xie, B. Ding, Y. Shen, Y. Li, Tunable soft prompts are messengers
arXiv preprint arXiv:2405.08755. in federated learning, 2023, arXiv preprint arXiv:2311.06805.
16
F. Piccialli et al. Information Fusion 117 (2025) 102840
[147] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, S. Han, AWQ: Activation-aware [175] R. Yi, L. Guo, S. Wei, A. Zhou, S. Wang, M. Xu, Edgemoe: Fast on-device
weight quantization for LLM compression and acceleration, 2023, arXiv preprint inference of moe-based large language models, 2023, arXiv preprint arXiv:
arXiv:2306.00978. 2308.14352.
[148] B. Spector, C. Re, Accelerating llm inference with staged speculative decoding, [176] L. Zhu, L. Hu, J. Lin, W.-M. Chen, W.-C. Wang, C. Gan, S. Han, PockEngine:
2023, arXiv preprint arXiv:2308.04623. Sparse and efficient fine-tuning in a pocket, in: Proceedings of the 56th
[149] T. Prompt, Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp.
LLM Inference with Transferable Prompt. 1381–1394.
[150] Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, Q. Tian, [177] Y. Chen, Y. Yan, Q. Yang, Y. Shu, S. He, J. Chen, Confidant: Customizing
Qa-lora: Quantization-aware low-rank adaptation of large language models, transformer-based LLMs via collaborative edge training, 2023, arXiv preprint
2023, arXiv preprint arXiv:2309.14717. arXiv:2311.13381.
[151] J. Shi, Z. Yang, H.J. Kang, B. Xu, J. He, D. Lo, Towards smaller, faster, and [178] Y. Zhu, Y. Liu, F. Stahlberg, S. Kumar, Y.-h. Chen, L. Luo, L. Shu, R. Liu, J.
greener language models of code, 2023, arXiv e-prints, arXiv–2309. Chen, L. Meng, Towards an on-device agent for text rewriting, 2023, arXiv
[152] A. Jangda, J. Huang, G. Liu, A.H.N. Sabet, S. Maleki, Y. Miao, M. Musuvathi, T. preprint arXiv:2308.11807.
Mytkowicz, O. Saarikivi, Breaking the computation and communication abstrac- [179] J. Zhao, Y. Song, S. Liu, I.G. Harris, S.A. Jyothi, LinguaLinked: A distributed
tion barrier in distributed machine learning workloads, in: Proceedings of the large language model inference system for mobile devices, 2023, arXiv preprint
27th ACM International Conference on Architectural Support for Programming arXiv:2312.00388.
Languages and Operating Systems, 2022, pp. 402–416. [180] V. Jaganathan, D. Gouda, K. Arora, M. Aggarwal, C. Zhang, On-device video
[153] R. Qin, J. Xia, Z. Jia, M. Jiang, A. Abbasi, P. Zhou, J. Hu, Y. Shi, Enabling on- analysis with LLMs, in: Proceedings of the 25th International Workshop on
device large language model personalization with self-supervised data selection Mobile Computing Systems and Applications, 2024, pp. 153–153.
and synthesis, 2023, arXiv preprint arXiv:2311.12275. [181] S. Carreira, T. Marques, J. Ribeiro, C. Grilo, Revolutionizing mobile interaction:
[154] Y. Wang, Y. Lin, X. Zeng, G. Zhang, PrivateLoRA for efficient privacy preserving Enabling a 3 billion parameter gpt LLM on mobile, 2023, arXiv preprint
LLM, 2023, arXiv preprint arXiv:2311.14030. arXiv:2310.01434.
[155] W. Kuang, B. Qian, Z. Li, D. Chen, D. Gao, X. Pan, Y. Xie, Y. Li, B. Ding, [182] D. Peng, Z. Fu, J. Wang, Pocketllm: Enabling on-device fine-tuning for
J. Zhou, Federatedscope-llm: A comprehensive package for fine-tuning large personalized llms, 2024, arXiv preprint arXiv:2407.01031.
language models in federated learning, 2023, arXiv preprint arXiv:2309.00363. [183] J. Bang, J. Lee, K. Shim, S. Yang, S. Chang, Crayon: Customized on-device
[156] Y. Zhang, L. Zhao, M. Lin, Y. Sun, Y. Yao, X. Han, J. Tanner, S. Liu, R. Ji, LLM via instant adapter blending and edge-server hybrid inference, 2024, arXiv
Dynamic sparse no training: Training-free fine-tuning for sparse LLMs, 2023, preprint arXiv:2406.07007.
arXiv preprint arXiv:2310.08915. [184] Y. Chen, R. Li, Z. Zhao, C. Peng, J. Wu, E. Hossain, H. Zhang, Netgpt: A native-
[157] I. Mirzadeh, K. Alizadeh, S. Mehta, C.C. Del Mundo, O. Tuzel, G. Samei, M. ai network architecture beyond provisioning personalized generative services,
Rastegari, M. Farajtabar, Relu strikes back: Exploiting activation sparsity in 2023, arXiv preprint arXiv:2307.06148.
large language models, 2023, arXiv preprint arXiv:2310.04564. [185] N. Dhar, B. Deng, D. Lo, X. Wu, L. Zhao, K. Suo, An empirical analysis and
[158] E. Yvinec, A. Dapogny, K. Bailly, Nupes: Non-uniform post-training quantization resource footprint study of deploying large language models on edge devices,
via power exponent search, 2023, arXiv preprint arXiv:2308.05600. in: Proceedings of the 2024 ACM Southeast Conference, 2024, pp. 69–76.
[159] W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y. Qiao, P. [186] P. Choi, J. Kim, J. Kwak, Impact of joint heat and memory constraints of mobile
Luo, Omniquant: Omnidirectionally calibrated quantization for large language device in edge-assisted on-device artificial intelligence, in: Proceedings of the
models, 2023, arXiv preprint arXiv:2308.13137. 2nd International Workshop on Networked AI Systems, 2024, pp. 31–36.
[160] S.A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, L. Song, S. Rajbhandari, Y. He, [187] Y. Ding, C. Niu, F. Wu, S. Tang, C. Lyu, G. Chen, Enhancing on-device LLM
Deepspeed ulysses: System optimizations for enabling training of extreme long inference with historical cloud-based llm interactions, in: Proceedings of the
sequence transformer models, 2023, arXiv preprint arXiv:2309.14509. 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024,
[161] L. Yue, Q. Liu, Y. Du, W. Gao, Y. Liu, F. Yao, Fedjudge: Federated legal large pp. 597–608.
language model, 2023, arXiv preprint arXiv:2309.08173. [188] Llama.cpp, 2023, URL https://github.com/ggerganov/llama.cpp.
[162] H. Woisetschläger, A. Isenko, S. Wang, R. Mayer, H.-A. Jacobsen, Federated [189] G. Gerganov, Llama. cpp: Port of facebook’s llama model in c/c++, 2023, URL
fine-tuning of llms on the very edge: The good, the bad, the ugly, 2023, arXiv https://github.com/mlc-ai/mlc-llm.
preprint arXiv:2310.03150. [190] X. Shen, P. Dong, L. Lu, Z. Kong, Z. Li, M. Lin, C. Wu, Y. Wang, Agile-quant:
[163] T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, S. Activation-guided quantization for faster inference of LLMs on the edge, in:
Ashkboos, A. Borzunov, T. Hoefler, D. Alistarh, SpQR: A sparse-quantized Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024,
representation for near-lossless LLM weight compression, 2023, arXiv preprint pp. 18944–18951.
arXiv:2306.03078. [191] Z. Li, Z. Hou, H. Liu, T. Li, C. Yang, Y. Wang, C. Shi, L. Xie, W. Zhang, L. Xu,
[164] M. Choi, M.A. Asif, J. Willes, D. Emerson, FlexModel: A framework for et al., Federated learning in large model era: Vision-language model for smart
interpretability of distributed large language models, 2023, arXiv preprint city safety operation management, in: Companion Proceedings of the ACM on
arXiv:2312.03140. Web Conference 2024, 2024, pp. 1578–1585.
[165] F. Yang, S. Peng, N. Sun, F. Wang, K. Tan, F. Wu, J. Qiu, A. Pan, Holmes: To- [192] Y. Rong, Y. Mao, H. Cui, X. He, M. Chen, Edge computing enabled large-scale
wards distributed training across clusters with heterogeneous NIC environment, traffic flow prediction with GPT in intelligent autonomous transport system for
2023, arXiv preprint arXiv:2312.03549. 6G network, IEEE Trans. Intell. Transp. Syst. (2024).
[166] T. Fan, Y. Kang, G. Ma, W. Chen, W. Wei, L. Fan, Q. Yang, Fate-llm: A industrial [193] N. Su, C. Hu, B. Li, B. Li, TITANIC: Towards production federated learning with
grade federated learning framework for large language models, 2023, arXiv large language models, in: IEEE INFOCOM, 2024.
preprint arXiv:2310.10049. [194] C. Liu, J. Zhao, Resource allocation in large language model integrated 6G
[167] M. Cho, K.A. Vahid, Q. Fu, S. Adya, C.C. Del Mundo, M. Rastegari, D. Naik, vehicular networks, 2024, arXiv preprint arXiv:2403.19016.
P. Zatloukal, eDKM: An efficient and accurate train-time weight clustering for [195] S. Paul, L. Zhang, Y. Shen, H. Jin, Enabling device control planning capa-
large language models, IEEE Comput. Architect. Lett. (2024). bilities of small language model, in: ICASSP 2024-2024 IEEE International
[168] S. Han, B. Buyukates, Z. Hu, H. Jin, W. Jin, L. Sun, X. Wang, C. Xie, K. Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2024,
Zhang, Q. Zhang, et al., FedMLSecurity: A benchmark for attacks and defenses pp. 12066–12070.
in federated learning and LLMs, 2023, arXiv preprint arXiv:2306.04959. [196] U. Thakker, P.N. Whatmough, Z.-G. Liu, M. Mattina, J. Beu, Compressing
[169] M. Xu, Y. Wu, D. Cai, X. Li, S. Wang, Federated fine-tuning of billion-sized language models using doped kronecker products, 2020, arXiv preprint arXiv:
language models across mobile devices, 2023, arXiv preprint arXiv:2308.13894. 2001.08896.
[170] J. Yuan, C. Yang, D. Cai, S. Wang, X. Yuan, Z. Zhang, X. Li, D. Zhang, H. [197] X. Wei, Y. Zhang, X. Zhang, R. Gong, S. Zhang, Q. Zhang, F. Yu, X. Liu, Outlier
Mei, X. Jia, et al., Rethinking mobile AI ecosystem in the LLM era, 2023, arXiv suppression: Pushing the limit of low-bit transformer language models, Adv.
preprint arXiv:2308.14363. Neural Inf. Process. Syst. 35 (2022) 17402–17414.
[171] A. Douillard, Q. Feng, A.A. Rusu, R. Chhaparia, Y. Donchev, A. Kuncoro, M. [198] T. Choudhary, V. Mishra, A. Goswami, J. Sarangapani, A comprehensive
Ranzato, A. Szlam, J. Shen, DiLoCo: Distributed low-communication training of survey on model compression and acceleration, Artif. Intell. Rev. 53 (2020)
language models, 2023, arXiv preprint arXiv:2311.08105. 5113–5155.
[172] J. Zhang, S. Vahidian, M. Kuo, C. Li, R. Zhang, G. Wang, Y. Chen, Towards [199] J.H. Heo, J. Kim, B. Kwon, B. Kim, S.J. Kwon, D. Lee, Rethinking channel
building the federated GPT: Federated instruction tuning, 2023, arXiv preprint dimensions to isolate outliers for low-bit weight quantization of large language
arXiv:2305.05644. models, 2023, arXiv preprint arXiv:2309.15531.
[173] H. Peng, K. Wu, Y. Wei, G. Zhao, Y. Yang, Z. Liu, Y. Xiong, Z. Yang, B. Ni, [200] N.P. Pandey, M. Fournarakis, C. Patel, M. Nagel, Softmax bias correction for
J. Hu, et al., Fp8-lm: Training fp8 large language models, 2023, arXiv preprint quantized generative models, in: Proceedings of the IEEE/CVF International
arXiv:2310.18313. Conference on Computer Vision, 2023, pp. 1453–1458.
[174] D. Xu, W. Yin, X. Jin, Y. Zhang, S. Wei, M. Xu, X. Liu, LLMCad: Fast and [201] W.X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J.
scalable on-device large language model inference, 2023, arXiv preprint arXiv: Zhang, Z. Dong, et al., A survey of large language models, 2023, arXiv preprint
2309.04255. arXiv:2303.18223.
17
F. Piccialli et al. Information Fusion 117 (2025) 102840
[202] T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer, Gpt3. int8 (): 8-bit matrix [230] X.-Y. Liu, R. Zhu, D. Zha, J. Gao, S. Zhong, M. Qiu, Differentially private low-
multiplication for transformers at scale, Adv. Neural Inf. Process. Syst. 35 rank adaptation of large language model using federated learning, 2023, arXiv
(2022) 30318–30332. preprint arXiv:2312.17493.
[203] X. Wu, Z. Yao, Y. He, Zeroquant-fp: A leap forward in llms post-training [231] L. Peng, G. Luo, S. Zhou, J. Chen, Z. Xu, R. Zhang, J. Sun, An in-depth
w4a8 quantization using floating-point formats, 2023, arXiv preprint arXiv: evaluation of federated learning on biomedical natural language processing,
2307.09782. medRxiv (2023) 2023-2011.
[204] Y. Zhang, L. Zhao, S. Cao, W. Wang, T. Cao, F. Yang, M. Yang, S. Zhang, N. [232] L. Yunxiang, L. Zihan, Z. Kai, D. Ruilong, Z. You, Chatdoctor: A medical chat
Xu, Integer or floating point? New outlooks for low-bit quantization on large model fine-tuned on llama model using medical domain knowledge, 2023, arXiv
language models, 2023, arXiv preprint arXiv:2305.12356. preprint arXiv:2303.14070.
[205] L. Nair, M. Bernadskiy, A. Madhavan, C. Chan, A. Basumallik, D. Bunandar, [233] W. Dong, X. Wu, J. Li, S. Wu, C. Bian, D. Xiong, FewFedWeight: Few-shot
INT-FP-QSim: Mixed precision and formats for large language models and vision federated learning framework across multiple nlp tasks, 2022, arXiv preprint
transformers, 2023, arXiv preprint arXiv:2307.03712. arXiv:2212.08354.
[206] L. Deng, G. Li, S. Han, L. Shi, Y. Xie, Model compression and hardware [234] Z. Qin, D. Chen, B. Qian, B. Ding, Y. Li, S. Deng, Federated full-parameter
acceleration for neural networks: A comprehensive survey, Proc. IEEE 108 (4) tuning of billion-sized language models with communication cost under 18
(2020) 485–532. kilobytes, 2023, arXiv preprint arXiv:2312.06353.
[207] S.N. Sridhar, A. Sarah, Undivided attention: Are intermediate layers necessary [235] B. Ouyang, S. Ye, L. Zeng, T. Qian, J. Li, X. Chen, Pluto and charon: A
for bert? 2020, arXiv preprint arXiv:2012.11881. time and memory efficient collaborative edge ai framework for personal LLMs
[208] T. Gale, E. Elsen, S. Hooker, The state of sparsity in deep neural networks, fine-tuning, in: Proceedings of the 53rd International Conference on Parallel
2019, arXiv preprint arXiv:1902.09574. Processing, 2024, pp. 762–771.
[209] L. Yin, S. Liu, M. Fang, T. Huang, V. Menkovski, M. Pechenizkiy, Lottery pools: [236] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W.
Winning more by interpolating tickets without increasing training or inference Chen, et al., Parameter-efficient fine-tuning of large-scale pre-trained language
cost, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, models, Nat. Mach. Intell. 5 (3) (2023) 220–235.
2023, pp. 10945–10953. [237] G. Pu, A. Jain, J. Yin, R. Kaplan, Empirical analysis of the strengths and
[210] E. Frantar, D. Alistarh, Optimal brain compression: A framework for accurate weaknesses of PEFT techniques for LLMs, 2023, arXiv preprint arXiv:2304.
post-training quantization and pruning, Adv. Neural Inf. Process. Syst. 35 (2022) 14999.
4475–4488. [238] L. Qian, J. Zhao, User association and resource allocation in large language
[211] I. Hubara, B. Chmiel, M. Island, R. Banner, J. Naor, D. Soudry, Acceler- model based mobile edge computing system over 6G wireless communications,
ated sparse neural training: A provable and efficient method to find n: m in: 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring), IEEE,
transposable masks, Adv. Neural Inf. Process. Syst. 34 (2021) 21099–21111. 2024, pp. 1–7.
[212] H. Shao, B. Liu, Y. Qian, One-shot sensitivity-aware mixed sparsity pruning for [239] G. Qu, Z. Lin, F. Liu, X. Chen, K. Huang, TrimCaching: Parameter-sharing
large language models, 2023, arXiv preprint arXiv:2310.09499. AI model caching in wireless edge networks, 2024, arXiv preprint arXiv:
2405.03990.
[213] E. Frantar, D. Alistarh, Massive language models can be accurately pruned in
[240] M. Xu, D. Cai, Y. Wu, X. Li, S. Wang, Fwdllm: Efficient federated finetuning of
one-shot, 2023, arXiv preprint arXiv:2301.00774.
large language models with perturbed inferences, in: USENIX ATC, 2024.
[214] X. Ma, G. Fang, X. Wang, LLM-pruner: On the structural pruning of large
[241] G. Kim, J. Yoo, S. Kang, Efficient federated learning with pre-trained large
language models, 2023, arXiv preprint arXiv:2305.11627.
language model using several adapter mechanisms, Mathematics 11 (21) (2023)
[215] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network,
4479.
2015, arXiv preprint arXiv:1503.02531.
[242] M. Benington, L. Phan, C.P. Paul, E. Shoemaker, P. Ranade, T. Collett, G.H.
[216] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, Q. Liu, Tinybert:
Perez, C. Krieger, Scaling studies for efficient parameter search and parallelism
Distilling bert for natural language understanding, 2019, arXiv preprint arXiv:
for large language model pre-training, 2023, arXiv preprint arXiv:2310.05350.
1909.10351.
[243] Y. Ghannane, M.S. Abdelfattah, Diviml: A module-based heuristic for mapping
[217] Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, D. Zhou, Mobilebert: a compact
neural networks onto heterogeneous platforms, in: 2023 IEEE/ACM Inter-
task-agnostic bert for resource-limited devices, 2020, arXiv preprint arXiv:
national Conference on Computer Aided Design, ICCAD, IEEE, 2023, pp.
2004.02984.
01–09.
[218] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, Minilm: Deep self-
[244] S. Ohta, T. Nishio, 𝛬-Split: A privacy-preserving split computing framework for
attention distillation for task-agnostic compression of pre-trained transformers,
cloud-powered generative AI, 2023, arXiv preprint arXiv:2310.14651.
Adv. Neural Inf. Process. Syst. 33 (2020) 5776–5788.
[245] W. Huang, Y. Wang, A. Cheng, A. Zhou, C. Yu, L. Wang, A fast, performant,
[219] H. Tsai, J. Riesa, M. Johnson, N. Arivazhagan, X. Li, A. Archer, Small and
secure distributed training framework for large language model, 2024, arXiv
practical BERT models for sequence labeling, 2019, arXiv preprint arXiv:1909.
preprint arXiv:2401.09796.
00100.
[246] J. Zhao, Y. Song, I. Harris, S.A. Jyothi, et al., LinguaLinked: Distributed large
[220] S. Sun, Y. Cheng, Z. Gan, J. Liu, Patient knowledge distillation for bert model language model inference on mobile devices, in: Proceedings of the 62nd
compression, 2019, arXiv preprint arXiv:1908.09355. Annual Meeting of the Association for Computational Linguistics (Volume 3:
[221] D. Chatterjee, Making neural machine reading comprehension faster, 2019, System Demonstrations), 2024, pp. 160–171.
arXiv preprint arXiv:1904.00796. [247] X. Zhou, Q. Jia, Y. Hu, R. Xie, T. Huang, F.R. Yu, Geng: An LLM-based
[222] P. Kaliamoorthi, A. Siddhant, E. Li, M. Johnson, Distilling large language generic time series data generation approach for edge intelligence via cross-
models into tiny and effective students using pQRNN, 2021, arXiv preprint domain collaboration, in: IEEE INFOCOM 2024-IEEE Conference on Computer
arXiv:2101.08890. Communications Workshops (INFOCOM WKSHPS), IEEE, 2024, pp. 1–6.
[223] R. Tang, Y. Lu, L. Liu, L. Mou, O. Vechtomova, J. Lin, Distilling task-specific [248] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti,
knowledge from bert into simple neural networks, 2019, arXiv preprint arXiv: D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al., Efficient
1903.12136. large-scale language model training on gpu clusters using megatron-lm, in:
[224] C. Hu, X. Li, D. Liu, H. Wu, X. Chen, J. Wang, X. Liu, Teacher-student Proceedings of the International Conference for High Performance Computing,
architecture for knowledge distillation: A survey, 2023, arXiv preprint arXiv: Networking, Storage and Analysis, 2021, pp. 1–15.
2308.04268. [249] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro,
[225] A. Hilmkil, S. Callh, M. Barbieri, L.R. Sütfeld, E.L. Zec, O. Mogren, Scaling Megatron-lm: Training multi-billion parameter language models using model
federated learning for fine-tuning of large language models, in: International parallelism, 2019, arXiv preprint arXiv:1909.08053.
Conference on Applications of Natural Language To Information Systems, [250] J. Rasley, S. Rajbhandari, O. Ruwase, Y. He, Deepspeed: System optimizations
Springer, 2021, pp. 15–23. enable training deep learning models with over 100 billion parameters, in:
[226] S. Wang, S. Zhuang, B. Koopman, G. Zuccon, ReSLLM: Large language models Proceedings of the 26th ACM SIGKDD International Conference on Knowledge
are strong resource selectors for federated search, 2024, arXiv preprint arXiv: Discovery & Data Mining, 2020, pp. 3505–3506.
2401.17645. [251] D.P. Pau, F.M. Aymone, Forward learning of large language models by consumer
[227] T.J. Chua, W. Yu, J. Zhao, K.-Y. Lam, FedPEAT: Convergence of federated devices, Electronics 13 (2) (2024) 402.
learning, parameter-efficient fine tuning, and emulator assisted tuning for [252] S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen,
artificial intelligence foundation models with mobile edge computing, 2023, K.S. Murthy, M. Maggioni, Q. Zhang, et al., Overlap communication with
arXiv preprint arXiv:2310.17491. dependent computation via decomposition in large deep learning models, in:
[228] B. Wang, Y.J. Zhang, Y. Cao, B. Li, H.B. McMahan, S. Oh, Z. Xu, M. Proceedings of the 28th ACM International Conference on Architectural Support
Zaheer, Can public large language models help private cross-device federated for Programming Languages and Operating Systems, Vol. 1, 2022, pp. 93–106.
learning? 2023, arXiv preprint arXiv:2305.12132. [253] H. Fan, S.I. Venieris, A. Kouris, N. Lane, Sparse-dysta: Sparsity-aware dynamic
[229] C. Hou, H. Zhan, A. Shrivastava, S. Wang, S. Livshits, G. Fanti, D. Lazar, and static scheduling for sparse multi-DNN workloads, in: Proceedings of the
Privately customizing prefinetuning to better match user data in federated 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023,
learning, 2023, arXiv preprint arXiv:2302.09042. pp. 353–366.
18
F. Piccialli et al. Information Fusion 117 (2025) 102840
[254] T. Che, J. Liu, Y. Zhou, J. Ren, J. Zhou, V.S. Sheng, H. Dai, D. Dou, Federated [257] J. Sun, Z. Xu, H. Yin, D. Yang, D. Xu, Y. Chen, H.R. Roth, FedBPT: Efficient
learning of large language models with parameter-efficient prompt tuning and federated black-box prompt tuning for large language models, 2023, arXiv
adaptive optimization, 2023, arXiv preprint arXiv:2310.15080. preprint arXiv:2310.01467.
[255] T. Tambe, J. Zhang, C. Hooper, T. Jia, P.N. Whatmough, J. Zuckerman, [258] J. Jiang, X. Liu, C. Fan, Low-parameter federated learning with large language
M.C. Dos Santos, E.J. Loscalzo, D. Giri, K. Shepard, et al., 22.9 a 12nm models, 2023, arXiv preprint arXiv:2307.13896.
18.1 TFLOPs/w sparse transformer processor with entropy-based early exit, [259] X. Wu, Z. Yao, Y.H. Zeroquant-fp, A leap forward in llms post-training w4a8
mixed-precision predication and fine-grained power management, in: 2023 IEEE quantization using floating-point formats, 2023, arXiv preprint arXiv:2307.
International Solid-State Circuits Conference, ISSCC, IEEE, 2023, pp. 342–344. 09782.
[256] L. Collins, S. Wu, S. Oh, K.C. Sim, Profit: Benchmarking personalization and [260] Q. Xu, Y. You, An efficient 2d method for training super-large deep learn-
robustness trade-off in federated prompt tuning, 2023, arXiv preprint arXiv: ing models, in: 2023 IEEE International Parallel and Distributed Processing
2310.04627. Symposium, IPDPS, IEEE, 2023, pp. 222–232.
19