0% found this document useful (0 votes)

45 views13 pages

Fine-Tuning Large Language Models For

The article discusses the fine-tuning of large language models (LLMs) for specialized tasks, particularly in the medical field. It outlines various methodologies for adapting pretrained models to specific domains, emphasizing the benefits and limitations of these approaches. The authors provide examples of LLM applications in medicine, including clinical decision support and medical education, while highlighting the importance of fine-tuning to improve model performance in specialized contexts.

Uploaded by

nirmeendiaaeldeen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views13 pages

Fine-Tuning Large Language Models For

Uploaded by

nirmeendiaaeldeen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

SPECIAL ARTICLE

Fine-Tuning Large Language Models for

Specialized Use Cases
D.M. Anisuzzaman, PhD; Jeffrey G. Malins, PhD; Paul A. Friedman, MD;
and Zachi I. Attia, PhD

Abstract

Large language models (LLMs) are a type of artificial intelligence, which operate by predicting and
assembling sequences of words that are statistically likely to follow from a given text input. With this basic
ability, LLMs are able to answer complex questions and follow extremely complex instructions. Products
created using LLMs such as ChatGPT by OpenAI and Claude by Anthropic have created a huge amount of
traction and user engagements and revolutionized the way we interact with technology, bringing a new
dimension to human-computer interaction. Fine-tuning is a process in which a pretrained model, such as
an LLM, is further trained on a custom data set to adapt it for specialized tasks or domains. In this review,
we outline some of the major methodologic approaches and techniques that can be used to fine-tune LLMs
for specialized use cases and enumerate the general steps required for carrying out LLM fine-tuning. We
then illustrate a few of these methodologic approaches by describing several specific use cases of fine-
tuning LLMs across medical subspecialties. Finally, we close with a consideration of some of the bene-
fits and limitations associated with fine-tuning LLMs for specialized use cases, with an emphasis on
specific concerns in the field of medicine.
ª 2025. Published by Elsevier Inc on behalf of Mayo Foundation for Medical Education and Research. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) n Mayo Clin Proc Digital Health 2025;3(1):100184

L
arge language models (LLMs), a special- the breakthroughs that led to the creation of
ized subset of artificial intelligence (AI), LLMs is the use of foundational models that
are designed to generate text through a process and comprehend natural language us- Department of Cardio-
process known as autoregression (often leading deep learning methods. The 2 primary vascular Medicine, Mayo
Clinic, Rochester, MN.
ing them to be termed autoregressive LLMs). ideas of foundation models are self-
These models operate by predicting and supervised learning and scale. In self-
assembling sequences of words that are statis- supervision, instead of training a model to
tically likely to follow from a given text input, perform a task that requires explicit annota-
thereby enabling them to produce coherent tions, the model learns from the vast amounts
and relevant sentences. The models can accept of unlabeled data available, extracting patterns
conversational input as text or via speech (us- and understanding context without human
ing language recognition) and can generate intervention. In addition to being more scal-
outputs at various levels ranging from tech- able, self-supervised tasks can allow a model
nical/professional to that of a high school edu- to anticipate a portion of the inputs, which
cation and more. They can summarize vast makes the model richer and potentially more
quantities of data, have access to unimaginably valuable than models trained on a more con-
large volumes of information, and stand to strained label space. Once the model learns
make this available, easily, to the user. The the foundational patterns of language, the
public release of ChatGPT has opened the same model can then be applied using transfer
public’s imagination and given a glimpse into learning followed by fine-tuning, which en-
an information-rich future. ables the model to learn to perform more spe-
These capabilities allow LLMs to perform a cific tasks using a smaller set of labeled
variety of general purpose tasks such as samples. For scale, the era of the internet pro-
answering questions, completing sentences, vides a nearly limitless amount of data1 and,
and even generating entire articles. One of coupled with advances in computing power,

Mayo Clin Proc Digital Health n March 2025;3(1):100184 n https://doi.org/10.1016/j.mcpdig.2024.11.005 1

www.mcpdigitalhealth.org n ª 2025. Published by Elsevier Inc on behalf of Mayo Foundation for Medical Education and Research. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
MAYO CLINIC PROCEEDINGS: DIGITAL HEALTH

Data preparation Model selection Fine-tuning Validation

LLMDataHub Accuracy
Selecting GPT-4 LaMDA Supervised
existing Perplexity
fine-tuning
Hugging eataset
face LLaMA PaLM ROUGE scores
Preparing Disparity analysis
custom RLHF
Mistral Cohere
dataset Human evaluation

normalization cleaning In-context Hyperparameter

formatting learning tuning
Parameter-efficient
fine-tuning

QLoRA Prompt
Prefix tuning
tuning

FIGURE 1. A general workﬂow of LLM ﬁne-tuning for specialized use cases. LLM, large language model; QLoRA, quantized low-rank
adaptation; RLHF, reinforcement learning from human feedback.

enables the training of models on an unprece- between Paris and France and London and
dented scale using graphical processing units United kingdom may be used by an LLM. A
(GPUs). Together, these develop- limitation of LLMs is that after training is
mentsdenhanced by innovations such as the completed, a model no longer learns or ac-
transformer model architecture2dhave signif- quires new information, and the information
icantly propelled the capabilities and applica- it was trained on may be general (such as
tions of LLMs. A general workflow of LLM Wikipedia), but not well-suited to a specific
fine-tuning for specialized use cases is shown task. These limitations can be mitigated with
in Figure 1. fine-tuning to better sculpt an LLM to address
Some existing LLMs to date are Alpaca,3 a specific field (such as medicine or law), and
BERT,4 BLOOM,5 Claude,6 Cohere,7 Ernie,8 retrieval augmented generation, which pro-
Falcon,9 Flan,10 Gemini,11 Gemma,12 GPT- vides additional information that a model
3.5,13 GPT-4,14 LaMDA,15 LLaMA,16 may use to address questions and which is
Mistral, MPT, Orca, PaLM 2, Phi-1,21
17 18 19 20
particularly useful if that additional informa-
StableLM,22 T5,23 Vicuna,24 and Zephyr.25 tion was not included in the model’s training.
All these models were developed to handle In the domain of health care, a number of
language-related tasks by different for-profit LLMs have been fine-tuned to perform tasks
and nonprofit organizations such as Google, associated with preconsultation, diagnosis,
Meta, and Stanford. Although most of these management, and prediction of future medical
models were created as general task models, outcomes, as well as medical education and
some were developed for specialized tasks medical writing.26-28 LLMs specific to the
such as language translation, human-like medical domain include BioBERT,29 Bio-
chat, and code generation. GPT,30 BioMistral,31 ChatDoctor,32 Clinical
In addition to anticipating subsequent Camel,33 DoctorGLM,34 Med-Alpaca,35 Med-
text, because models are trained with billions PaLM,36 Med-PaLM 2,37 Med42-v2,38 Medi-
of tokens, many words map to multiple tokens tron-70b,39 OpenBioLLM-70B,40 and PMC-
(ie, they are represented by word vectors), LLaMA.41 One particularly powerful use of
enabling mathematical connections between models such as these is obtaining answers to
multiple meanings of a term. For example, questions rather than links to articles, with
Paris will have connections to France, city, the caveat that using systems not designed to
capital, and so on, so that the relationships address medical questions may be
n n
2 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES

TABLE. Uses of LLMs in Medicine

Description of LLM
medical uses Strengths Limitations Example usage
Medical research Can quickly synthesize and summarize May not have access to the most Documentation for clinical trials43
assistance existing medical literature, helping recent studies owing to training data
researchers stay up-to-date with cutoffs; could miss context or
recent developments nuance in highly specialized areas
Clinical decision Provides support in diagnosing Relies on the data it was trained on, Differentiation between abdominal
support complex cases by suggesting which may not include rare diseases pathologies44
possible diagnoses on the basis of or latest treatment modalities
symptoms and medical history
Patient interaction Handles routine inquiries from May lack the empathetic nuances that Answering cataract operation-related
automation patients, such as explaining medical human interaction provides; risk of questions45
procedures and advising on miscommunication in complex
medication schedules scenarios
Medical education and Assists in the education of medical Might not perfectly mimic the Interactive practice cases to evaluate
training students and professionals by unpredictability of real-life medical medical reasoning46
providing explanations, generating cases; information may become
quizzes, and simulating patient cases outdated
Documentation and Helps in generating and organizing Possible issues with accuracy and Generation of radiology reports on the
reporting medical reports, thereby reducing privacy concerns; needs constant basis of chest X-rays47
the administrative burden on health verification
care providers
Treatment plan Suggests treatment plans on the basis May not incorporate experiential Assistance with complex decision
management of clinical guidelines and individual learning or adapt to unconventional making for breast cancer care48
patient data cases as effectively as a human
would
Support for remote Provides medical information and Dependence on internet connectivity; Providing community health workers
areas support in remote areas where may not handle local medical with contextually appropriate
medical expertise is limited practices or nonstandard treatments medical knowledge49
well
The first 3 columns of this table were generated by ChatGPT 4.0 on May 6, 2024. The final column with example usages was added by the authors.

inaccurate.42 LLMs can potentially be used for applications across various medical subspe-
any task that requires reading text, and sum- cialties. After this, we close with a consider-
marizing it, or extracting pertinent informa- ation of some of the benefits and key
tion. Examples of data extraction uses could limitations associated with fine-tuning LLMs
include reviewing of medical records to create in the medical domain.
a discharge summary, identifying and summa-
rizing all risks for stroke in a patient with atrial
fibrillation, or determining preoperative surgi- FINE-TUNING METHODOLOGY
cal risk using standardized scoring criteria. A Fine-tuning is a process in which a pretrained
list of potential uses of LLMs in medicine model is adapted for particular tasks or do-
along with specific examples is provided in mains by continuing to train the model using
Table.43-49 only a domain-specific data set that is
In the following sections, we first outline different than the original data set used to
some of the major approaches and techniques the train the base model. Various fine-
for fine-tuning LLMs in the medical domain tuning strategies and approaches are used to
and touch on retrieval augmented generation. adjust the model parameters to a specific
Then, we describe specialized use cases in need. Some fine-tuning approaches are briefly
which LLMs have been fine-tuned for medical described in this article.
n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 3
www.mcpdigitalhealth.org
MAYO CLINIC PROCEEDINGS: DIGITAL HEALTH

Supervised Fine-Tuning model to adjust and develop in response to

With this approach, every input data point is real-world input, resulting in enhanced and
linked to a label, and the model is trained on more efficient applications. Some standard
a task-specific labeled data set. The model reinforcement learning from human feedback
learns to modify its parameters to anticipate (RLHF) techniques are as follows:
these labels as precisely as possible. Some su-
1. Reward modeling: In this method, the
pervised fine-tuning techniques are as
model generates multiple potential outputs
follows:
or actions, which are subsequently assessed
1. Transfer learning: In this approach, a by human evaluators who assign a ranking
model is first initialized with saved weights or rating on the basis of their quality. The
from a model pretrained on a large, general model uses these human-provided assess-
data set and then is subsequently trained ments to generate predictions and adapt
with limited task-specific data. Weights its behavior to optimize the anticipated
refer to the learned parameters of a model rewards.
that has been trained on a large data set 2. Proximal policy optimization: Proximal
for a specific task, which represent the policy optimization (PPO) modifies the lan-
knowledge the model has gained during guage model’s policy to maximize the ex-
its training process, encapsulating features pected reward. A policy refers to the
and patterns relevant to the task it was orig- strategy or set of rules that a reinforcement
inally trained on. learning agent uses to make decisions in an
2. Multitask learning: Here, models are fine- environment. For example, in PPO, the
tuned on numerous related tasks, taking policy determines how a robotic arm
advantage of their similarities and differ- should move to pick up objects on the ba-
ences, in order to maximize performance. sis of visual inputs from a camera. PPO’s
For example, with a CNN model trained primary goal is to make policy improve-
on a generic large data set (eg, KI- ments whereas ensuring the modifications
NETICS400), one can perform some spe- do not deviate too much from the previous
cific tasks (eg, estimating left ventricular policy. To achieve this balance, the policy
ejection fraction, patient age, and patient update process introduces a constraint
sex from an echocardiogram) with a that prohibits detrimental large updates
much smaller data set by leveraging the although permitting advantageous minor
generic features the model learned from updates. Compared with other reinforce-
the large data set. ment learning techniques, PPO is more reli-
3. Instruction-tuning: Instruction-tuning in- able and effective.
volves fine-tuning a pretrained LLM to 3. Comparative ranking: In this method, the
follow specific task instructions, such as model produces several outputs or actions,
translation, summarization, or question which human investigators then rank ac-
answering. For example, in translation, cording to compatibility or quality. The
the model is trained on examples in which model then modifies its behavior to
each input includes an instruction like generate higher-ranked outputs. This
“Translate the following sentence from En- method provides relative and better feed-
glish to French,” followed by an English back to the model by ranking multiple out-
sentence and its French translation. After puts rather than individual outputs.
fine-tuning, the model learns to follow 4. Preference feedback: This technique in-
translation instructions and can generalize volves the model generating several outputs
to translate new sentences. and human experts selecting among them,
leading the model to modify its behavior
accordingly. This method is useful when
Reinforcement Learning From Human assigning a numeric value (reward) to an
Feedback output is difficult. It is an effective method
This method uses the knowledge of human of fine-tuning the model in practical
evaluators; in addition, it also allows the applications.

n n
4 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES

FINE-TUNING PIPELINE with determining the ejection fraction

To carry out a fine-tuning process for a (heart pump strength). The ejection frac-
specialized use case, there are several generic tion can be manually measured by a human
steps that include the following50: to assess the performance of the AI tool. In
contrast, generative AI, such as LLMs,
1. Data preparation: Data set preparation for
generate new text or images, the perfor-
LLM model fine-tuning is entirely task-
mance of which may be harder to grade.
specific. In general, the model must be pre-
If asked to create a poem, how does one
sented with some blocks of text. Many data
assess the quality of it?
sets are available to fine-tune an LLM.51-53
One must follow the instructions given In general practice, there are 2 types of
for each data set to prepare the data for validation to perform: (1) internal validation,
fine-tuning. For custom data sets, depend- which is used to select the best model,
ing on the task, the data set preparation monitor the model’s learning process, and call-
may include data cleaning, normalizing back and stop model training with certain
for missing values, and formatting the criteria; and (2) validation on a holdout test
text to align with the model’s input set to evaluate model performance for real-
requirements. world applications. Although, in general,
2. Selecting the appropriate pretrained model: some commonly used metrics for model eval-
There are several LLMs to date (BERT, uation include accuracy, area under the curve,
Cohere, GPT-4, LLaMA, Mistral, etc; access precision, recall, and so on, LLM model evalu-
to nonopen source models will require ation may require some careful task-specific
working with the owners of the models), metric selection.54 Some key performance
and choosing the appropriate one that metrics used in LLM evaluation are as follows:
complies with the demands of the target
d Accuracy: measures the model’s ability to
task is essential. For fine-tuning an LLM
produce correct responses to prompts.
model for a specific data set or task, one
d Perplexity: measures uncertainty in predict-
must have a good grasp of the model archi-
ing the next token.
tecture and the input and output require- d ROUGE scores: compares an LLM’s output
ments. Depending on the available
with a set of reference summaries.
resources, the model weights and number
d Diversity: evaluates the variety of responses
of parameters should be considered when
generated.
selecting a model. Finally, the performance
d Disparity analysis: identifies and mitigates
of the model on the relevant target task
biases within model responses.
should be considered during model
d Coh-Metrix: analyzes logical consistency
selection.
and clarity over longer stretches of text.
3. Fine-tuning the model: LLM fine-tuning d Human evaluation: subjective assessment by
also includes basic hyperparameter tuning,
human judges.
adjustments of the learning rate, batch
size, regularization, optimizer, number of In medical LLM development, in cases for
epochs, and so on. As LLMs are trained which models are fine-tuned for clinical pre-
on vast amounts of data, overfitting a small diction tasks in which the ground truth labels
data set for a specific task could be a likely are well defined (eg, predicting discharge
event. Careful tuning of the hyperpara- events), evaluation typically involves statistical
meters guarantees that the model learns performance metrics like accuracy, precision,
efficiently and does not overfit when and so on. For more generative tasks in which
applied to new data. the ground truth labels are not well defined
4. Validation: Validation of an LLM is com- (eg, medical report summarization), human
plex. In predictive AI, a specific input and or domain expert evaluation is crucial to
output are expected. For example, a neural ensure that model outputs are clinically accu-
network might assess a medical image such rate and safe for real-world applications. For
as an electrocardiogram, or a medical video example, Singhal et al36 developed Med-
such as an echocardiogram, and be tasked PaLM for answering medical questions and

n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 5
www.mcpdigitalhealth.org
MAYO CLINIC PROCEEDINGS: DIGITAL HEALTH

had clinicians review outputs to ensure that of the LLaMA 65B parameter model requires
the responses were medically sound and factu- more than 780 GB of GPU memory, whereas
ally accurate. Similarly, Serapio et al55 fine- using the QLoRA technique requires only
tuned LLMs for generating radiological im- 48GB of GPU memory.58 This powerful tech-
pressions from chest computed tomography nique is on the basis of these following highly
scans and had model outputs assessed by technical ingredients:
board-certified radiologists.
i. 4-bit NormalFloat representation of model
Beyond these general approaches, a num-
parameters, whereas typically, parameters
ber of specific techniques can be applied to
of trained models are stored in a 32-bit
fine-tune LLMs for specialized use cases.
format. This technique divides model pa-
Some example techniques include the
rameters into equally-sized buckets instead
following:
of equally-spaced buckets.
1. In-context learning: In this approach, a pre- ii. Double quantization is a method that
trained LLM is induced to perform a task quantizes the quantization constants. In
using prompted examples. An example of general, quantization converts datatypes
this is few-shot learning, which involves with a larger number of bits to fewer bits
giving the model a few shots or instances (eg, FP32 to 8-bit Integers). Quantized
to learn a new task during inference. Few- low-rank adaptation uses the blockwise
shot learning aims to direct the model’s quantization technique that requires more
predictions by providing examples and memory than standard quantization but
context specifically in the prompt but reduces bias significantly, thus retaining
importantly does not involve gradient- good performance.
based training.56 iii. Low-rank adaptation,57 which freezes the
2. Hyperparameter tuning: This is a straight- pretrained model weights and injects train-
forward method that consists of manually able rank decomposition matrices into
modifying basic hyperparameters (ie, each layer of the transformer architecture,
learning rate, batch size, optimizer, and greatly reduces the number of trainable pa-
number of epochs) of the model until the rameters for downstream tasks.
desired performance is obtained. This
In simpler terms, low-rank adaptation
changes how the model learns; that is,
finds a more compressed version of the LLM
how fast it learns, how to decide when
weights and updates those weights. Although
training is completed, and so on.
the compression may lose some data, under
3. Parameter-efficient fine-tuning: Parameter-
the assumption a lot of the model weights
efficient fine-tuning (PEFT) is an efficient
are redundant, leading only to a small decrease
technique in which only a small portion
in performance relative to savings in memory
of the parameters of an LLM are selectively
and required compute power.
modified during fine-tuning, typically by
4. Retrieval augmented generation:
adding new layers or modifying existing
Retrieval augmented generation (RAG) is a
ones in a task-specific manner. This
technique that combines the capabilities of
method drastically lowers computational
neural language models with information
and storage needs although keeping perfor-
retrieval systems to enhance the generation
mance comparable with complete fine-
of contextually rich and accurate responses.
tuning. Some PEFT techniques are low-
In RAG, when a query is received, the model
rank adaptation,57 quantized low-rank
first uses a retrieval system to fetch relevant
adaptation (QLoRA),58 Prefix tuning, and
documents or snippets from a large corpus,
Prompt tuning.
such as a database of scientific literature. These
Quantized low-rank adaptation is a very retrieved texts are then fed into a generative
popular technique used for LLM fine-tuning model, typically a transformer-based neural
owing to its power of using much smaller network, which integrates the retrieved infor-
amounts of memory than a full fine-tuning mation with its pretrained knowledge to pro-
approach with the price of sacrificing some duce a coherent and informed response. This
performance. For example, a full fine-tuning approach is particularly useful in domains
n n
6 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES

where accuracy and specificity are critical, Chinese and English medical data sets, with
such as scientific research or technical sup- RLHF further aligning the model to improve
port, because it allows the model to base its performance in medical dialogs and multiple-
answers on up-to-date and source-specific choice questions.
data, providing citations and grounding its re-
sponses in existing literature. A comparison of Integration of LLMs into Electronic Health
the outputs generated by presenting the same Record Systems
medical question to a search engine, LLM, and Several LLMs have been applied to electronic
RAG-based system is shown in Figure 2. health record (EHR) systems, providing bene-
fits such as generating patient summaries from
SPECIALIZED USE CASES IN MEDICINE EHRs, assisting health care providers with
Across many of the major subspecialties of more efficient decision making, named entity
medicine, LLMs are being fine-tuned to recognition, medical note summarization,
address specific issues, and practitioners are and predictive diagnosis.91 Zhang et al92
posing questions about how fine-tuned LLMs investigated the application of LLM fine-
could revolutionize their fields. These subspe- tuning to EHR audit log data for clinical pre-
cialties include but are not limited to cardiol- diction tasks, with a focus on discharge pre-
ogy,59-61 dermatology,62 digital pathology,63 dictions. Cui et al93 evaluated the zero-shot
gastroenterology and hepatology,64-66 hema- and few-shot performance of LLMs on EHR-
tology,67 neurology,68-70 obstetrics and gyne- based disease prediction tasks and proposed
cology,71,72 oncology,73-75 ophthalmology,76 a novel approach that leverages collaborative
orthopedics,77 pediatrics,78,79 psychiatry,80-82 LLM agents to enhance predictive perfor-
radiology,83,84 operation,85,86 and urol- mance. Li et al94 fine-tuned an LLM named
ogy.87,88 Although there are nuances accord- LlamaCare and evaluated it on various clinical
ing to specific subspecialties, many tasks, such as generating discharge summaries,
practitioners highlight the potential of fine- predicting mortality and length of stay, and
tuned LLMs to aid clinicians in areas such as more.
clinical decision support, treatment planning,
and patient consultation, as well as alleviate Generation of Echocardiography Reports to
administrative burden associated with tasks Streamline Workflows
such as generating clinical notes, discharge re- Echocardiography is one of the most widely
ports, and medical billing. At the same time, used imaging techniques for gaining insights
many are concerned about the ethical, legal, into the structure and function of the heart.
and social implications of using such models. A typical echocardiography report includes
In the subsequent section, we review some numerous measurements as well as text-
of these concerns. Before that, below we high- based statements or findings. These findings
light a few specific methodologies and use are summarized by a clinician to give an over-
cases that illustrate the general framework out- all set of final impressions for the study. This is
lined in the previous sections. a time-consuming and error-prone process. To
address this issue, Chao et al95 leveraged
Using RLHF to Fine-Tune LLMs in Medicine several open-source LLMs to generate echocar-
Mukherjee et al89 developed a constellation diography reports using either zero-shot
system called Polaris, which was composed learning (for Flan-T5, Med-Alpaca, Llama-2,
of several agents. Their primary agent (focused and Zephyr) or QLoRA fine-tuning (Llama-2
on patient-friendly conversation) was devel- and Zephyr). Using a training data set of
oped in 3 stages: general instruction-tuning, 95,506 echocardiography reports, the authors
conversation and agent tuning, and RLHF. observed that EchoGPT, which is a Llama-2
The RLHF step was performed by registered model trained using instruction fine-tuning
nurses, who gave preference feedback on mul- with QLoRA, outperformed other LLMs on
tiple responses. Zhao et al90 developed Aqulia- critical performance metrics. In addition,
Med, a bilingual medical LLM, using super- when 4 echocardiography board-certified car-
vised fine-tuning and RLHF to tackle medical diologists were asked to rate reports generated
challenges. It was trained on large-scale by EchoGPT for 30 randomly selected cases,
n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 7
www.mcpdigitalhealth.org
MAYO CLINIC PROCEEDINGS: DIGITAL HEALTH

FIGURE 2. Comparison of responses using different systems to the same question: “What are the interactions between propafenone
and colchicine?”

n n
8 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES

the generated reports were rated similarly to treatment optimization. Furthermore,

reports generated by cardiologists for these although CohortGPT was built on ChatGPT
same cases (in completeness, conciseness, cor- and GPT-4, the model can be implemented
rectness, and clinical utility). On the basis of in any open-source LLM.
these results, the authors argue that EchoGPT
could be used as a copilot for report genera-
BENEFITS AND LIMITATIONS OF FINE-
tion, which would allow for considerable
TUNING LLMS FOR SPECIALIZED USE
streamlining of the echocardiography report
CASES
workflow. With that said, the authors stress
To build a reliable real-world LLM-based
that draft reports generated by EchoGPT
application, fine-tuning is a necessary and
should still be reviewed and approved by clini-
crucial step because it fills in the gap between
cians, noting that some hallucinations were
general knowledge and domain-specific exper-
observed in reports generated by EchoGPT
tise for that application. Some benefits of LLM
(albeit not as many as were observed for
fine-tuning are the following:
zero-shot learning).
1. Domain-specific knowledge: general pur-
Identifying Eligible Patients for Clinical Trials pose LLMs may not have enough domain-
Randomized clinical trials are a cornerstone of specific knowledge.
medical research, yet it can be a challenge to 2. Specific task optimization: general purpose
identify patients who meet all inclusion and LLMs can be optimized for specific tasks
exclusion criteria for a clinical trial. To (eg, health report summarization and dis-
leverage the power of LLMs to assist with ease detection from a report).
participant recruitment for clinical trials, 3. Data efficiency: fine-tuning works well with
Guan et al96 developed CohortGPT, built on smaller quantities of labeled data because it
ChatGPT and GPT-4. CohortGPT can take involves using pretrained LLM(s) trained
input text from unstructured or semistruc- on huge data sets.
tured data, such as clinical notes and radiology 4. Better performance: fine-tuning often leads
reports, to designate disease labels associated to improved performance because the
with the input text. To develop this model, model learns domain-specific knowledge
the authors made use of a technique called to perform relevant tasks although preser-
chain-of-thought (CoT) prompting, a type of ving out-of-domain knowledge.
in-context learning that guides LLMs to learn 5. Resource efficiency: fine-tuning requires
task-specific logical chains, which detail how less resources in terms of time and memory
correct answers are deduced from given infor- than training a general purpose LLM from
mation. Using the CoT technique in conjunc- scratch.
tion with reinforcement learning, Guan et al96
With that said, there are several critical
trained a policy model to dynamically select
limitations to LLMs to consider when fine-
CoT samples. They then presented these
tuning for specialized tasks.99,100 A few of
CoT samples to a prompt model alongside
these are as follows:
knowledge graphs, which can be thought of
as rules detailing the relationships between 1. Hallucinations: these refer to situations in
different concepts, such as that cardiomegaly which model output contains inaccurate
is a type of heart disease or that scoliosis is a or nonfactual information.42,99 In the med-
type of spine disease. Using thousands of pub- ical domain for example, these could
licly available radiology reports in the Indiana consist of findings that are not actually pre-
chest X-ray collection97 and MIMIC-CXR98 sent in a study report. Addressing halluci-
data sets, Guan et al96 found that CohortGPT nations could involve processes such as
can reliably classify report text as being associ- inducing a model to provide a reasoning
ated with specific disease labels. On the basis process or confidence score associated
of these results, the authors argue that with model output. For example, the Med-
CohortGPT can be useful not only for patient ical Domain Hallucination Test (Med-
recruitment for clinical trials but also for other HALT) has been designed to evaluate and
medical applications such as diagnosis and reduce hallucinations in the medical
n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 9
www.mcpdigitalhealth.org
MAYO CLINIC PROCEEDINGS: DIGITAL HEALTH

domain and includes metrics for hallucina- summarize vast amounts of information, may
tions associated with reasoning and prevent important pieces of information from
memory.101 being missed, and can meaningfully tap into
2. Legal and safety concerns: for example, in vast stores of literature to inform clinicians at
the medical domain, the data to be used the point of care when meeting with a patient.
for fine-tuning may contain sensitive pa- However, much remains unproven including
tient information that needs to be safe- how to ensure the information is reliable, pri-
guarded. In addition, if model output is vacy is preserved, and answers are tuned to
used to guide treatment decisions for pa- usefully guide medical professionals.
tients, incorrect output (such as hallucina-
tions) could be harmful. This is why
authors such as Chao et al95 emphasize
the critical need for human review of model POTENTIAL COMPETING INTERESTS
output. In addition, cybersecurity measures Drs Anisuzzaman, Malins, Friedman, and Attia
such as the use of pseudonyms can enhance have invented algorithms licensed to Ultra-
the privacy and security of patient data.102 Sight and may benefit from algorithm
3. Biases in training data sets: fine-tuned LLMs commercialization via Mayo Clinic. None of
can inherit biases from the pretrained these relations with industry are related in
models on which they are built, and there any way to the content of the current submis-
is a critical need to use techniques that miti- sion. Given their role as Editorial Board Mem-
gate this bias.103,104 In medicine, this bias bers, Drs. Attia and Friedman had no
has the potential to exacerbate health ineq- involvement in the peer-review of this article
uities if not addressed.105 Some techniques and have no access to information regarding
for mitigating bias include prompt engineer- its peer-review. Drs Friedman and Attia report
ing, debiasing algorithms, and continuous multiple patents owned by Mayo for AI ECG
monitoring of model performance.106 and stock or stock options in Anumana and
4. Lack of domain-specific data: depending on XAI Health.
the extent to which a specific use case is
specialized, there may not be sufficient
quantities of domain-specific data to fine-
SUPPLEMENTAL ONLINE MATERIAL
tune an LLM using certain approaches.
Supplemental material can be found online at
Here, techniques such as in-context
https://www.mcpdigitalhealth.org/. Supple-
learning or PEFT may be more appropriate
mental material attached to journal articles
than full fine-tuning.
has not been edited, and the authors take re-
5. Data leakage: many of the pretrained LLMs
sponsibility for the accuracy of all data.
do not report which data were used for
training, so if open data sets are used for
Abbreviations and Acronyms: AI, artificial intelligence;
fine-tuning, these data may have already
CoT, chain-of-thought; GPU, graphical processing unit; LLM,
been used for training the base model. large language model; PEFT, parameter-efficient fine-tuning;
This can lead to data leakage from the vali- PPO, proximal policy optimization; QLoRA, quantized low-
dation set to the model, resulting in overly rank adaptation; RAG, retrieval augmented generation;
optimistic performance. Addressing this RLHF, reinforcement learning from human feedback
concern will involve greater transparency Publication dates: Received for publication August 2, 2024;
on the part of developers when describing revisions received November 6, 2024; accepted for publica-
training data sets and careful selection of tion November 18, 2024.
pretrained LLMs that provide information
Correspondence: Address to Zachi I. Attia, PhD, Depart-
about the source, quality, and quantity of ment of Cardiovascular Medicine, Mayo Medical School,
training data.102 Artificial Intelligence in Cardiology, Mayo Clinic, 200 1st
Street SW, Rochester, MN 55905 (attia.itzhak@mayo.edu).

ORCID
CONCLUSION D.M. Anisuzzaman: https://orcid.org/0000-0001-8068-
Large language models are poised to transform 2571; Zachi I. Attia: https://orcid.org/0000-0002-9706-
medicine. In written form or verbally, they can 7900

n n
10 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES

REFERENCES 22. Bellagente M, Tow J, Mahan D, et al. Stable LM 2 1.6 B tech-

nical report Preprint. Posted online February 27, 2024. arXiv
1. Bommasani R, Hudson DA, Adeli E, et al. On the opportu-
240217834. https://doi.org/10.48550/arXiv.2402.17834.
nities and risks of foundation models Preprint. Posted online
23. Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of
August 16, 2021. arXiv 210807258. https://doi.org/10.
transfer learning with a unified text-to-text transformer.
48550/arXiv.2108.07258.
J Mach Learn Res. 2020;21(140):1-67.
2. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you
24. Chiang W-L, Li Z, Lin Z, et al. Vicuna: An open-source chat-
need. Adv Neural Inform Process Syst. 2017;30.
bot impressing GPT-4 with 90%* ChatGPT quality. https://
3. Taori R, Gulrajani I, Zhang T, et al. Stanford Alpaca: An
vicuna.lmsys.org. Accessed April 14, 2023.
Instruction-Following Llama Model. Stanford University;
25. Tunstall L, Beeching E, Lambert N, et al. Zephyr: direct distillation
2023.
of lm alignment Preprint. Posted online October 25, 2023. arXiv
4. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: pre-training of
231016944. https://doi.org/10.48550/arXiv.2310.16944.
deep bidirectional transformers for language understanding Pre-
26. Yang R, Tan TF, Lu W, Thirunavukarasu AJ, Ting DSW, Liu N.
print. Posted online October 11, 2018. arXiv 181004805.
Large language models in health care: development, applica-
https://doi.org/10.48550/arXiv.1810.04805.
tions, and challenges. Health Care Sci. 2023;2(4):255-263.
5. Le Scao T, Fan A, Akiki C, et al. Bloom: A 176b-parameter
https://doi.org/10.1002/hcs2.61.
open-access multilingual language model Preprint. Posted online
27. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L,
November 9, 2022. arXiv 2211.05100. https://doi.org/10.
Tan TF, Ting DSW. Large language models in medicine. Nat
48550/arXiv.2211.05100.
Med. 2023;29(8):1930-1940. https://doi.org/10.1038/s41591-
6. Anthropic. Introducing Claude. https://www.anthropic.com/
023-02448-8.
news/introducing-claude. Accessed April 24, 2024.
28. Kraljevic Z, Bean D, Shek A, et al. Foresightda generative pre-
7. Cohere. Cohere: the leading enterprise AI platform. https://
trained transformer for modelling of patient timelines using
cohere.com/. Accessed April 24, 2024.
electronic health records: a retrospective modelling study.
8. BaiduResearch. ERNIE Bot: Baidu’s knowledge-enhanced large
Lancet Digit Health. 2024;6(4):e281-e290. https://doi.org/10.
language model built on full AI stack technology. http://
1016/S2589-7500(24)00025-6.
research.baidu.com/Blog/index-view?id=183. Accessed April
29. Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical
25, 2024.
language representation model for biomedical text mining.
9. ZXhang YX, Haxo YM, Mat YX. Falcon LLM: a new frontier in
Bioinformatics. 2020;36(4):1234-1240. https://doi.org/10.
natural language processing. AC Investment Res J. 2023;
1093/bioinformatics/btz682.
220(44).
30. Luo R, Sun L, Xia Y, et al. BioGPT: generative pre-trained trans-
10. GoogleResearch. Introducing FLAN: more generalizable lan-
former for biomedical text generation and mining. Brief Bioinform.
guage models with instruction fine-tuning. https://research.
2022;23(6):bbac409. https://doi.org/10.1093/bib/bbac40.
google/blog/introducing-flan-more-generalizable-language-
31. Labrak Y, Bazoge A, Morin E, Gourraud P-A, Rouvier M,
models-with-instruction-fine-tuning/. Accessed April 25,
Dufour R. BioMistral: a collection of open-source pretrained
2024.
large language models for medical domains Preprint. Posted
11. Gemini Team Google, Anil R, Borgeaud S, Alayrac J-B, et al.
online February 15, 2024. arXiv 240210373. https://doi.org/
Gemini: a family of highly capable multimodal models Preprint.
10.18653/v1/2024.findings-acl.348.
Posted online December 19, 2023. arXiv 231211805.
32. Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. ChatDoctor: a
https://doi.org/10.48550/arXiv.2312.11805.
medical chat model fine-tuned on a large language model
12. Gemma Team, Mesnard T, Hardin C, Dadashi R, et al.
meta-AI (LLaMA) using medical domain knowledge. Cureus.
Gemma: Open models based on gemini research and technology
2023;15(6):e40985. https://doi.org/10.7759/cureus.40895.
Preprint. Posted online March 13, 2021. arXiv 240308295.
33. Toma A, Lawler PR, Ba J, Krishnan RG, Rubin BR, Wang B.
https://doi.org/10.48550/arXiv.2403.08295.
Clinical Camel: an open expert-level medical language model
13. OpenAI. ModelsdGPT 3.5 Turbo. https://platform.openai.
with dialogue-based knowledge encoding Preprint. Posted on-
com/docs/models/gpt-3-5-turbo. Accessed April 25, 2024.
line May 19, 2023. arXiv 230512031. https://doi.org/10.
14. OpenAI Achiam J, Adler S, Agarwal S, et al. GPT-4 technical
48550/arXiv.2305.12031.
report Preprint. Posted online March 15, 2023. arXiv
34. Xiong H, Wang S, Zhu Y, et al. Doctorglm: fine-tuning your chinese
230308774. https://doi.org/10.48550/arXiv.2303.08774t.
doctor is not a herculean task Preprint. Posted online April 3, 2023.
15. Thoppilan R, De Freitas D, Hall J, et al. LaMDA: language models
arXiv 230401097. https://doi.org/10.48550/arXiv.2304.01097.
for dialog applications Preprint. Posted online January 20, 2022.
35. Han T, Adams LC, Papaioannou J-M, et al. MedAlpacadan
arXiv 220108239. https://doi.org/10.48550/arXiv.2201.08239.
open-source collection of medical conversational AI models and
16. Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient
training data Preprint. Posted online October 4, 2023. arXiv
foundation language models Preprint. Posted online February
230408247. https://doi.org/10.48550/arXiv.2304.08247.
27, 2023. arXiv 230213971. https://doi.org/10.48550/arXiv.
36. Singhal K, Azizi S, Tu T, et al. Large language models encode
2302.13971.
clinical knowledge. Nature. 2023;620(7972):172-180. https://
17. Jiang AQ, Sablayrolles A, Mensch A, et al. Mistral 7B Preprint.
doi.org/10.1038/s41586-023-06291-2.
Posted online October 10, 2023. arXiv 231006825. https://
37. Singhal K, Tu T, Gottweis J, et al. Towards expert-level medical
doi.org/10.48550/arXiv.2310.06825.
question answering with large language models Preprint. Posted
18. HuggingFace. MPT. https://huggingface.co/docs/transformers/
online May 16, 2023. arXiv 230509617. https://doi.org/10.
main/model_doc/mpt. Accessed April 25, 2024.
48550/arXiv.2305.09617.
19. KDnuggets. Orca LLM: simulating the reasoning processes of
38. Christophe C, Kanithi PK, Raha T, Khan S, Pimentel MAF.
ChatGPT. https://www.kdnuggets.com/2023/06/orca-llm-
Med42-v2: a suite of clinical LLMs Preprint. Posted online
reasoning-processes-chatgpt.html. Accessed April 25, 2024.
August 12, 2024. arXiv 240806142. https://doi.org/10.
20. Anil R, Dai AM, Firat O, et al. Palm 2 technical report Preprint.
48550/arXiv.2408.06142.
Posted online May 17, 2023. arXiv 230510403. https://doi.
39. Chen Z, Cano AH, Romanou A, et al. MEDITRON-70b: scaling
org/10.48550/arXiv.2305.10403.
medical pretraining for large language models Preprint. Posted
21. Gunasekar S, Zhang Y, Aneja J, et al. Textbooks are all you need
online November 27, 2023. arXiv 231116079. https://doi.
Preprint. Posted online June 20, 2023. arXiv 230611644.
org/10.48550/arXiv.2311.16079.
https://doi.org/10.48550/arXiv.2306.11644.

n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 11
www.mcpdigitalhealth.org
MAYO CLINIC PROCEEDINGS: DIGITAL HEALTH

40. Pal A, Sankarasubbu M. OpenBioLLMs: advancing open- September 1, 2024. medRxiv 24312887. https://doi.org/10.
source large language models for healthcare and life sciences. 1101/2024.09.01.24312887.
Hugging Face. https://huggingface.co/aaditya/OpenBioLLM- 60. Novak A, Rode F, Lisii. The pulse of artificial intelligence in car-
Llama3-70B. Accessed September 30, 2024. diology: a comprehensive evaluation of state-of-the-art large
41. Wu C, Lin W, Zhang X, Zhang Y, Xie W, Wang Y. PMC- language models for potential use in clinical cardiology Pre-
LLaMA: toward building open-source language models for print. Posted online January 30, 2024. medRxiv 23293689.
medicine. J Am Med Inform Assoc. 2024;31(9):1833-1843. https://doi.org/10.1101/2023.08.08.23293689.
https://doi.org/10.1093/jamia/ocae045. 61. Boonstra MJ, Weissenbacher D, Moore JH, Gonzalez-
42. Siontis KC, Attia ZI, Asirvatham SJ, Friedman PA. ChatGPT hallu- Hernandez G, Asselbergs FW. Artificial intelligence: revolu-
cinating: can it get any more humanlike? Eur Heart J. 2024;45(5): tionizing cardiology with large language models. Eur Heart J.
321-323 https://doi.org/10.1093/eurheartj/ehad766. 2024;45(5):332-345. https://doi.org/10.1093/eurheartj/
43. Markey N, El-Mansouri I, Rensonnet G, van Langen C, ehad838.
Meier C. From RAGs to riches: using large language models 62. Gui H, Omiye JA, Chang CT, Daneshjou R. The promises and perils
to write documents for clinical trials Preprint. Posted online of foundation models in dermatology. J Invest Dermatol. 2024;
February 26, 2024. arXiv 240216406. https://doi.org/10. 144(7):1440-1448. https://doi.org/10.1016/j.jid.2023.12.019.
48550/arXiv.2402.16406. 63. Ullah E, Parwani A, Baig MM, Singh R. Challenges and barriers
44. Hager P, Jungmann F, Holland R, et al. Evaluation and mitiga- of using large language models (LLM) such as ChatGPT for
tion of the limitations of large language models in clinical de- diagnostic medicine with a focus on digital pathology–a recent
cision-making. Nat Med. 2024;30(9):2613-2622. https://doi. scoping review. Diagn Pathol. 2024;19(1):43. https://doi.org/10.
org/10.1038/s41591-024-03097-1. 1186/s13000-024-01464-7.
45. Ramjee P, Sachdeva B, Golechha S, et al. CataractBot: An 64. Shahab O, El Kurdi B, Shaukat A, Nadkarni G, Soroush A.
LLM-Powered Expert-in-the-Loop Chatbot for Cataract Pa- Large language models: a primer and gastroenterology applica-
tients Preprint. Posted online February 7, 2024. arXiv tions. Ther Adv Gastroenterol. 2024;17:1756284824122703.
240204620. https://doi.org/10.48550/arXiv.2402.04620. https://doi.org/10.1177/17562848241227031.
46. Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The 65. Omar Sr M, Sharif Sr K, Glicksberg Sr BS, Nadkarni G, Klang E
role of large language models in medical education: applica- Sr. Emerging applications of NLP and large language models in
tions and implications. JMIR Med Educ. 2023;9:e50945. gastroenterology and hepatology: a systematic review Pre-
https://doi.org/10.2196/50945. print. Posted online June 27, 2021. medRxiv 24309567.
47. Wang Z, Liu L, Wang L, Zhou L. R2gengpt: Radiology report https://doi.org/10.1101/2024.06.26.24309567.
generation with frozen LLMs. Meta-Radiology. 2023;1(3): 66. Giuffre M, Kresevic S, Pugliese N, You K, Shung DL. Opti-
100033. https://doi.org/10.1016/j.metrad.2023.100033. mizing large language models in digestive disease: strategies
48. Griewing S, Knitza J, Boekhoff J, et al. Evolution of publicly and challenges to improve clinical outcomes. Liver Int. 2024;
available large language models for complex decision-making 44(9):2114-2124. https://doi.org/10.1111/liv.15974.
in breast cancer care. Arch Gynecol Obstet. 2024;310(1):537- 67. Mudrik A, Nadkarni GN, Efros O, Glicksberg BS, Klang E,
550. https://doi.org/10.1007/s00404-024-07565-4. Soffer S. Exploring the role of large language models (LLMs)
49. Gangavarapu A. Introducing L2M3, a multilingual medical large in hematology: a systematic review of applications, benefits,
language model to advance health equity in low-resource re- and limitations. Br J Haematol. 2024;205(5):1685-1698.
gions Preprint. Posted online April 11, 2024. arXiv https://doi.org/10.1111/bjh.19738.
240408705. https://doi.org/10.48550/arXiv.2404.08705. 68. Barrit S, El Hadwe SE, Carron R, Madsen JR. Rise of large lan-
50. Turing. Fine-tuning LLMS: overview, methods, and best prac- guage models in neurosurgery. J Neurosurg. 2024;141(3):878-
tices. https://www.turing.com/resources/finetuning-large- 880. https://doi.org/10.3171/2024.3.JNS24610.
language-models. Accessed April 26, 2024. 69. Chiang C-C, Fries JA. Exploring the potential of large language
51. Zhao J. LLMDataHub: awesome datasets for LLM training. models in neurology, using neurologic localization as an
https://github.com/Zjh-819/LLMDataHub. Accessed April 26, example. Neurol Clin Pract. 2024;14(3):e200311. https://doi.
2024. org/10.1212/CPJ.0000000000200311.
52. HuggingFace. Datasets (filter Other by name “llm”). https:// 70. Romano MF, Shih LC, Paschalidis IC, Au R, Kolachalama VB.
huggingface.co/datasets?other=llm. Accessed April 26, 2024. Large language models in neurology research and future prac-
53. Liu Y, Cao J, Liu C, Ding K, Jin L. Datasets for large language tice. Neurology. 2023;101(23):1058-1067. https://doi.org/10.
models: a comprehensive survey Preprint. Posted online 1212/WNL.0000000000207967.
February 28, 2024. arXiv 240218041. https://doi.org/10. 71. Bachmann M, Duta I, Mazey E, Cooke W, Vatish M, Jones GD.
48550/arXiv.2402.18041. Exploring the capabilities of ChatGPT in women’s health: ob-
54. Aisera. LLM evaluation metrics: performance benchmark. stetrics and gynaecology. NPJ Womens Health. 2024;2(1):26.
https://aisera.com/blog/llm-evaluation/. Accessed April 26, https://doi.org/10.1038/s44294-024-00028-w.
2024. 72. Mudrik A, Tsur A, Nadkarni G, et al. Leveraging large language
55. Serapio A, Chaudhari G, Savage C, et al. An open-source models in gynecologic oncology: a systematic review of cur-
fine-tuned large language model for radiological impression rent applications and challenges Preprint. Posted online
generation: a multi-reader performance study. BMC Med Im- August 9, 2024. medRxiv 24311699. https://doi.org/10.1101/
aging. 2024;24(1):254. https://doi.org/10.1186/s12880-024- 2024.08.08.24311699.
01435-w. 73. Rydzewski NR, Dinakaran D, Zhao SG, et al. Comparative
56. Liu H, Tam D, Muqeeth M, et al. Few-shot parameter-efficient evaluation of LLMs in clinical oncology. NEJM AI. 2024;1(5).
fine-tuning is better and cheaper than in-context learning. Adv 10.1056/aioa2300151. https://doi.org/10.1056/aioa2300151.
Neural Inform Process Syst. 2022;35:1950-1965. 74. Lawson McLean A, Wu Y, Lawson McLean AC, Hristidis V.
57. Hu EJ, Shen Y, Wallis P, et al. LoRA: low-rank adaptation of Large language models as decision aids in neuro-oncology: a
large language models Preprint. Posted online June 17, 2021. review of shared decision-making applications. J Cancer Res
arXiv 210609685. https://doi.org/10.48550/arXiv.2106.09685. Clin Oncol. 2024;150(3):139. https://doi.org/10.1007/s00432-
58. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: 024-05673-x.
efficient finetuning of quantized LLMs. Adv Neural Inform Pro- 75. Benary M, Wang XD, Schmidt M, et al. Leveraging large lan-
cess Syst. 2024;36:10088-10115. guage models for decision support in personalized oncology.
59. Gendler M, Nadkarni G, Sudri K, et al. Large language models JAMA Netw Open. 2023;6(11):e2343689. https://doi.org/10.
in cardiology: a systematic review Preprint. Posted online 1001/jamanetworkopen.2023.43689.

n n
12 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES

76. Luo M-J, Pang J, Bi S, et al. Development and evaluation of a 92. Zhang X, Yan C, Yang Y, et al. Optimizing large language
retrieval-augmented large language model framework for models for discharge prediction: best practices in leveraging
ophthalmology. JAMA Ophthalmol. 2024;142(9):798-805. electronic health record audit logs Preprint. Posted online
https://doi.org/10.1001/jamaophthalmol.2024.2513. September 13, 2024. medRxiv 24313594. https://doi.org/10.
77. Chatterjee S, Bhattacharya M, Pal S, Lee S-S, Chakraborty C. 1101/2024.09.12.24313594.
ChatGPT and large language models in orthopedics: from ed- 93. Cui H, Shen Z, Zhang J, et al. LLMs-based few-shot disease
ucation and surgery to research. J Exp Orthop. 2023;10(1):128. predictions using EHR: a novel approach combining predictive
https://doi.org/10.1186/s40634-023-00700-1. agent reasoning and critical agent instruction Preprint. Posted
78. Sisk BA, Antes AL, DuBois JM. An overarching framework for online March 19, 2024. arXiv 240315464. https://doi.org/10.
the ethics of artificial intelligence in pediatrics. JAMA Pediatr. 48550/arXiv.2403.15464.
2024;178(3):213-214. https://doi.org/10.1001/jamapediatrics. 94. Li R, Wang X, Yu H. LlamaCare: an instruction fine-tuned
2023.5761. large language model for clinical NLP. In: Calzolari N,
79. Wyatt KD, Alexander N, Hills GD, et al. Making sense of artifi- Kan M-Y, Hoste V, Lenci A, Sakti S, Xue N, eds. Proceedings
cial intelligence and large language modelsdincluding of the 2024 Joint International Conference on Computational Lin-
ChatGPTdin pediatric hematology/oncology. Pediatr Blood guistics, Language Resources and Evaluation (LREC-COLING
Cancer. 2024;71(9):e31143. https://doi.org/10.1002/pbc.31143. 2024). ELRA/ICCL; 2024:10632-10641.
80. Obradovich N, Khalsa SS, Khan WU, et al. Opportunities and risks 95. Chao C-J, Banerjee I, Arsanjani R, et al. EchoGPT: a large lan-
of large language models in psychiatry. NPP Digit Psychiatry Neuro- guage model for echocardiography report summarization
sci. 2024;2(1):8. https://doi.org/10.1038/s44277-024-00010-z. Preprint. Posted online January 20, 2024. medRxiv
81. Volkmer S, Meyer-Lindenberg A, Schwarz E. Large language models 24301503. https://doi.org/10.1101/2024.01.18.24301503.
in psychiatry: opportunities and challenges. Psychiatry Res. 2024;339: 96. Guan Z, Wu Z, Liu Z, et al. CohortGPT: an enhanced GPT for
116026. https://doi.org/10.1016/j.psychres.2024.116026. participant recruitment in clinical study Preprint. Posted online
82. Omar M, Soffer S, Charney AW, Landi I, Nadkarni GN, July 21, 2023. arXiv 230711346. https://doi.org/10.48550/
Klang E. Applications of large language models in psychiatry: arXiv.2307.11346.
a systematic review. Front Psychiatry. 2024;15:1422807. 97. Demner-Fushman D, Kohli MD, Rosenman MB, et al. Prepar-
https://doi.org/10.3389/fpsyt.2024.1422807. ing a collection of radiology examinations for distribution and
83. Liu Z, Zhong A, Li Y, et al. Tailoring large language models to retrieval. J Am Med Inform Assoc. 2016;23(2):304-310. https://
radiology: a preliminary approach to llm adaptation for a highly doi.org/10.1093/jamia/ocv080.
specialized domain. In: Cao X, Xu X, Rekik I, Cui Z, Ouyang X, 98. Johnson AEW, Pollard TJ, Berkowitz SJ, et al. MIMIC-CXR, a
eds. Machine Learning in Medical Imaging. MLMI 2023. Lecture de-identified publicly available database of chest radiographs
Notes in Computer Science, vol 14348. Springer. doi:10. with free-text reports. Sci data. 2019;6(1):317. https://doi.
1007/978-3-031-45673-2_46. org/10.1038/s41597-019-0322-0.
84. D’Antonoli TA, Stanzione A, Bluethgen C, et al. Large lan- 99. Zhou H, Gu B, Zou X, et al. A survey of large language models
guage models in radiology: fundamentals, applications, in medicine: progress, application, and challenge Preprint.
ethical considerations, risks, and future directions. Diagn Posted online November 9, 2023. arXiv 231105112. https://
Interv Radiol. 2024;30(2):80-90. https://doi.org/10.4274/dir. doi.org/10.48550/arXiv.2311.05112.
2023.232417. 100. Haltaufderheide J, Ranisch R. The ethics of ChatGPT in med-
85. Lee J, Sharma I, Arcaro N, et al. Automating surgical procedure icine and healthcare: a systematic review on large language
extraction for society of surgeons adult cardiac surgery registry models (LLMs). NPJ Digit Med. 2024;7(1):183. https://doi.org/
using pretrained language models. JAMIA Open. 2024;7(3): 10.1038/s41746-024-01157-x.
ooae054. https://doi.org/10.1093/jamiaopen/ooae054. 101. Pal A, Umapathi LK, Sankarasubbu M. Med-HALT: medical
86. Oh N, Choi G-S, Lee WY. ChatGPT goes to the operating domain hallucination test for large language models Preprint.
room: evaluating GPT-4 performance and its potential in sur- Posted online July 28, 2023. arXiv 230715343. https://doi.
gical education and training in the era of large language org/10.18653/v1/2023.conll-1.21.
models. Ann Surg Treat Res. 2023;104(5):269-273. https://doi. 102. Ong JCL, Chang SY-H, William W, et al. Ethical and regulatory
org/10.4174/astr.2023.104.5.269. challenges of large language models in medicine. Lancet Digit
87. Adhikari K, Naik N, Hameed BZ, Raghunath SK, Somani BK. Health. 2024;6(6):e428-e432. https://doi.org/10.1016/S2589-
Exploring the ethical, legal, and social implications of ChatGPT 7500(24)00061-X.
in urology. Curr Urol Rep. 2024;25(1):1-8. https://doi.org/10. 103. Goh E, Bunning B, Khoong E, et al. ChatGPT influence on
1007/s11934-023-01185-2. medical decision-making, bias, and equity: a randomized study
88. Gupta R, Pedraza AM, Gorin MA, Tewari AK. Defining the of clinicians evaluating clinical vignettes Preprint. Posted online
role of large language models in urologic care and research. November 27, 2023. medRxiv 23298844. https://doi.org/10.
Eur Urol Oncol. 2024;7(1):1-13. https://doi.org/10.1016/j.euo. 1101/2023.11.24.23298844.
2023.07.017. 104. Schmidgall S, Harris C, Essien I, et al. Addressing cognitive bias
89. Mukherjee S, Gamble P, Ausin MS, et al. Polaris: a safety- in medical language models Preprint. Posted online February
focused LLM constellation architecture for healthcare Pre- 12, 2024. arXiv 240208113. https://doi.org/10.48550/arXiv.
print. Posted online March 20, 2024. arXiv 240313313. 2402.08113.
https://doi.org/10.48550/arXiv.2403.13313. 105. Perez-Downes JC, Tseng AS, McConn KA, et al. Mitigating bias
90. Zhao L, Zeng W, Shi X, Zhou H, Hao D, Lin Y. Aqulia-Med in clinical machine learning models. Curr Treat Options Cardio-
LLM: pioneering full-process open-source medical language vasc Med. 2024;26(3):29-45. https://doi.org/10.1007/s11936-
models Preprint. Posted online June 18, 2024. arXiv 023-01032-0.
240612182. https://doi.org/10.48550/arXiv.2406.12182. 106. Omar Sr M, Sorin Sr V, Apakama DU, et al. Evaluating and
91. Li L, Zhou J, Gao Z, et al. A scoping review of using large lan- addressing demographic disparities in medical large language
guage models (LLMs) to investigate electronic health records models: a systematic review Preprint. Posted online October
(EHRs) Preprint. Posted online May 5, 2024. arXiv 1, 2024. medRxiv 24313295. https://doi.org/10.1101/2024.
240503066. https://doi.org/10.48550/arXiv.2405.03066. 09.09.24313295.

n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 13
www.mcpdigitalhealth.org

Fine-Tuning Large Language Models For Specialized Use Cases - 2025
No ratings yet
Fine-Tuning Large Language Models For Specialized Use Cases - 2025
13 pages
LLMs in Medicine: A Guide for Doctors
No ratings yet
LLMs in Medicine: A Guide for Doctors
19 pages
2308 01727v1
No ratings yet
2308 01727v1
12 pages
Question-Answer System On Medical Domain With LLMS Using Various Fine-Tuning Methods
No ratings yet
Question-Answer System On Medical Domain With LLMS Using Various Fine-Tuning Methods
15 pages
Toc 9780138199302
No ratings yet
Toc 9780138199302
8 pages
Large Language Models (LLMS) : Survey, Technical Frameworks, and Future Challenges
No ratings yet
Large Language Models (LLMS) : Survey, Technical Frameworks, and Future Challenges
51 pages
Decoding ChatGPT A Primer On Large Language Models For Clinicians
No ratings yet
Decoding ChatGPT A Primer On Large Language Models For Clinicians
4 pages
IJRPR29621
No ratings yet
IJRPR29621
7 pages
汇报张旭帆
No ratings yet
汇报张旭帆
22 pages
Large Language Model (LLM) 1
100% (1)
Large Language Model (LLM) 1
17 pages
Up 4
No ratings yet
Up 4
10 pages
Overview of Large Language Models
No ratings yet
Overview of Large Language Models
46 pages
Quick Start Guide To Large Language Models Second Edition Sinan Ozdemir Online PDF
100% (3)
Quick Start Guide To Large Language Models Second Edition Sinan Ozdemir Online PDF
115 pages
1 s2.0 S2162253124001422 Main
No ratings yet
1 s2.0 S2162253124001422 Main
16 pages
Harnessing The Power of Llms in Practice: A Survey On Chatgpt and Beyond
No ratings yet
Harnessing The Power of Llms in Practice: A Survey On Chatgpt and Beyond
32 pages
Know Thy Frenemy
No ratings yet
Know Thy Frenemy
40 pages
Large Language Models For Medicine: A Survey: Zheng, Gan, Chen, Qi, Liang and Yu
No ratings yet
Large Language Models For Medicine: A Survey: Zheng, Gan, Chen, Qi, Liang and Yu
22 pages
Introduction To Large Language Models-2025072419561496
No ratings yet
Introduction To Large Language Models-2025072419561496
16 pages
LLM Lifecycle & Fine-Tuning Guide
No ratings yet
LLM Lifecycle & Fine-Tuning Guide
2 pages
Unlocking The Power of LLMs - Transformative Use Cases Across Industries
No ratings yet
Unlocking The Power of LLMs - Transformative Use Cases Across Industries
44 pages
LLM Model
No ratings yet
LLM Model
43 pages
LLMs: Revolutionizing Industries
100% (1)
LLMs: Revolutionizing Industries
10 pages
Industrial Applications of Large Language Models
No ratings yet
Industrial Applications of Large Language Models
23 pages
LLMs: A Researcher's Guide
No ratings yet
LLMs: A Researcher's Guide
46 pages
Dokumen - Pub Quick Start Guide To Large Language Models Strategies and Best Practices For Using Chatgpt and Other Llms 9780138199425
No ratings yet
Dokumen - Pub Quick Start Guide To Large Language Models Strategies and Best Practices For Using Chatgpt and Other Llms 9780138199425
325 pages
LLMs: A Researcher's Guide
No ratings yet
LLMs: A Researcher's Guide
46 pages
An Analysis of Large Language Models: Their Impact and Potential Applications
No ratings yet
An Analysis of Large Language Models: Their Impact and Potential Applications
24 pages
Icaps LLM Tut Slides Posted
No ratings yet
Icaps LLM Tut Slides Posted
97 pages
Week4 LLMs EN
No ratings yet
Week4 LLMs EN
48 pages
LLMs: A Research Community Overview
No ratings yet
LLMs: A Research Community Overview
37 pages
Training Large Language Models
No ratings yet
Training Large Language Models
7 pages
Large Language Models in Neuroscience
No ratings yet
Large Language Models in Neuroscience
20 pages
An Overview of Large Language Models For Statisticians
No ratings yet
An Overview of Large Language Models For Statisticians
67 pages
How To Train Your Own LLM
No ratings yet
How To Train Your Own LLM
29 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
100% (6)
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
326 pages
SSRN Id4655822
No ratings yet
SSRN Id4655822
9 pages
Techniques, Tricks & Frameworks
No ratings yet
Techniques, Tricks & Frameworks
143 pages
Intro to Large Language Models
No ratings yet
Intro to Large Language Models
3 pages
Overview of Large Language Models
No ratings yet
Overview of Large Language Models
47 pages
LLM Basics for Researchers
No ratings yet
LLM Basics for Researchers
54 pages
What Are LLMs
No ratings yet
What Are LLMs
3 pages
LLMs: Training to Inference Guide
No ratings yet
LLMs: Training to Inference Guide
30 pages
The Best LLMs Cheatsheet - Part 1
No ratings yet
The Best LLMs Cheatsheet - Part 1
16 pages
Python BAKMR010399001
No ratings yet
Python BAKMR010399001
3 pages
Intro Class
100% (1)
Intro Class
81 pages
LLM Ntoes
No ratings yet
LLM Ntoes
1,139 pages
Attention Is All You Need.
No ratings yet
Attention Is All You Need.
5 pages
A Bibliometric Review of Large Language Models Research From 2017 To 2023
No ratings yet
A Bibliometric Review of Large Language Models Research From 2017 To 2023
36 pages
Large Language Models and Their Use Cases
No ratings yet
Large Language Models and Their Use Cases
3 pages
Large Language Models (LLMS) - Architecture, Training, Applications, and Challenges
No ratings yet
Large Language Models (LLMS) - Architecture, Training, Applications, and Challenges
5 pages
AILLM
No ratings yet
AILLM
3 pages
Jason Weston Reasoning Alignment Berkeley Talk
No ratings yet
Jason Weston Reasoning Alignment Berkeley Talk
106 pages
Training Large Language Models and Using Them For The Web
No ratings yet
Training Large Language Models and Using Them For The Web
60 pages
The Breakthrough of Large Language Models Release For Medical Applications: 1-Year Timeline and Perspectives
No ratings yet
The Breakthrough of Large Language Models Release For Medical Applications: 1-Year Timeline and Perspectives
11 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
adaptMLLM Fine-Tuning Multilingual Language Models
No ratings yet
adaptMLLM Fine-Tuning Multilingual Language Models
24 pages
A Fine-Tuned Large Language Model For Domain-Specific With Reinforcement Learning
No ratings yet
A Fine-Tuned Large Language Model For Domain-Specific With Reinforcement Learning
6 pages
Large Language Models in Medical and Healthcare Fields: Applications, Advances, and Challenges
No ratings yet
Large Language Models in Medical and Healthcare Fields: Applications, Advances, and Challenges
48 pages
Leveraging Transfer Learning Fine-Tuning Methodology For Enhanced Text Classification Using BERT
No ratings yet
Leveraging Transfer Learning Fine-Tuning Methodology For Enhanced Text Classification Using BERT
5 pages
Apache CloudStack As A VMware Alternative
No ratings yet
Apache CloudStack As A VMware Alternative
6 pages
Comparison of Fine-Tuning Strategies For Transfer Learning in Medical Image
No ratings yet
Comparison of Fine-Tuning Strategies For Transfer Learning in Medical Image
27 pages
Transfer Learning For Financial Data Predictions
No ratings yet
Transfer Learning For Financial Data Predictions
43 pages
ECD Curriculum Guidelines G000-3
100% (1)
ECD Curriculum Guidelines G000-3
121 pages
OCPJP 7 Preparation Guide
No ratings yet
OCPJP 7 Preparation Guide
10 pages
NSQF Syllabus Analysis & Implementation
No ratings yet
NSQF Syllabus Analysis & Implementation
5 pages
Historical Development of Education System in The Philippines
No ratings yet
Historical Development of Education System in The Philippines
2 pages
Half Yearly Exam Schedule - 2024
No ratings yet
Half Yearly Exam Schedule - 2024
2 pages
Raven
No ratings yet
Raven
48 pages
442 GSPH101 Chapter - 1
No ratings yet
442 GSPH101 Chapter - 1
21 pages
Essay On Library 100, 200 and 250 Words Leverage Edu
No ratings yet
Essay On Library 100, 200 and 250 Words Leverage Edu
1 page
Grade 11 Creative Writing Lesson Plan
No ratings yet
Grade 11 Creative Writing Lesson Plan
3 pages
Name Address Email
No ratings yet
Name Address Email
3 pages
Mak Research Report 2018
No ratings yet
Mak Research Report 2018
226 pages
Rock Guitar Booklet
No ratings yet
Rock Guitar Booklet
12 pages
探究单元UOI 告家长书
No ratings yet
探究单元UOI 告家长书
2 pages
Practice Exam Questions
No ratings yet
Practice Exam Questions
12 pages
Lesson Plan For MTB Mle
100% (2)
Lesson Plan For MTB Mle
7 pages
Senior Honor Choir SopAltTen Sight Reading Packet
No ratings yet
Senior Honor Choir SopAltTen Sight Reading Packet
8 pages
ANSWER SHEET ON Comparative of Shintoism and Daoism
100% (2)
ANSWER SHEET ON Comparative of Shintoism and Daoism
5 pages
The Use of Teaching Media
No ratings yet
The Use of Teaching Media
2 pages
Mei c3 Coursework Comparison
100% (2)
Mei c3 Coursework Comparison
7 pages
Understanding Management 11th Edition Richard L Daft Dorothy Marcic ISBN10 0357033825 ISBN13 9780357033821 Ebook and TestBank Bundle Verified PDF
No ratings yet
Understanding Management 11th Edition Richard L Daft Dorothy Marcic ISBN10 0357033825 ISBN13 9780357033821 Ebook and TestBank Bundle Verified PDF
408 pages
Iupac Rules Isomerism2020
No ratings yet
Iupac Rules Isomerism2020
26 pages
KNOLSKAPE Brochures
No ratings yet
KNOLSKAPE Brochures
1 page
Writing and Motivation - Chapter 1-20170722T101132
No ratings yet
Writing and Motivation - Chapter 1-20170722T101132
14 pages
Niyati Bhatia Asx Summary (CELFP3)
100% (1)
Niyati Bhatia Asx Summary (CELFP3)
3 pages
EMCU003 Scholarly Writing Publication Skills
50% (2)
EMCU003 Scholarly Writing Publication Skills
81 pages
9MA0 A Level Maths Papers 32 Topic Test 1
No ratings yet
9MA0 A Level Maths Papers 32 Topic Test 1
19 pages
Advanced Scikit Learn
No ratings yet
Advanced Scikit Learn
98 pages
Test Item Analysis and Interpretation
No ratings yet
Test Item Analysis and Interpretation
14 pages
Manoj Kumar Sharma CHRO Aarti
No ratings yet
Manoj Kumar Sharma CHRO Aarti
2 pages
The Summit Briefer: Event Schedule
No ratings yet
The Summit Briefer: Event Schedule
4 pages

Fine-Tuning Large Language Models For

Uploaded by

Fine-Tuning Large Language Models For

Uploaded by

SPECIAL ARTICLE

Fine-Tuning Large Language Models for

Mayo Clin Proc Digital Health n March 2025;3(1):100184 n https://doi.org/10.1016/j.mcpdig.2024.11.005 1

Data preparation Model selection Fine-tuning Validation

normalization cleaning In-context Hyperparameter

TABLE. Uses of LLMs in Medicine

Supervised Fine-Tuning model to adjust and develop in response to

FINE-TUNING PIPELINE with determining the ejection fraction

the generated reports were rated similarly to treatment optimization. Furthermore,

REFERENCES 22. Bellagente M, Tow J, Mahan D, et al. Stable LM 2 1.6 B tech-

You might also like