Fine-Tuning Large Language Models For
Fine-Tuning Large Language Models For
Abstract
Large language models (LLMs) are a type of artificial intelligence, which operate by predicting and
assembling sequences of words that are statistically likely to follow from a given text input. With this basic
ability, LLMs are able to answer complex questions and follow extremely complex instructions. Products
created using LLMs such as ChatGPT by OpenAI and Claude by Anthropic have created a huge amount of
traction and user engagements and revolutionized the way we interact with technology, bringing a new
dimension to human-computer interaction. Fine-tuning is a process in which a pretrained model, such as
an LLM, is further trained on a custom data set to adapt it for specialized tasks or domains. In this review,
we outline some of the major methodologic approaches and techniques that can be used to fine-tune LLMs
for specialized use cases and enumerate the general steps required for carrying out LLM fine-tuning. We
then illustrate a few of these methodologic approaches by describing several specific use cases of fine-
tuning LLMs across medical subspecialties. Finally, we close with a consideration of some of the bene-
fits and limitations associated with fine-tuning LLMs for specialized use cases, with an emphasis on
specific concerns in the field of medicine.
ª 2025. Published by Elsevier Inc on behalf of Mayo Foundation for Medical Education and Research. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) n Mayo Clin Proc Digital Health 2025;3(1):100184
L
arge language models (LLMs), a special- the breakthroughs that led to the creation of
ized subset of artificial intelligence (AI), LLMs is the use of foundational models that
are designed to generate text through a process and comprehend natural language us- Department of Cardio-
process known as autoregression (often lead- ing deep learning methods. The 2 primary vascular Medicine, Mayo
Clinic, Rochester, MN.
ing them to be termed autoregressive LLMs). ideas of foundation models are self-
These models operate by predicting and supervised learning and scale. In self-
assembling sequences of words that are statis- supervision, instead of training a model to
tically likely to follow from a given text input, perform a task that requires explicit annota-
thereby enabling them to produce coherent tions, the model learns from the vast amounts
and relevant sentences. The models can accept of unlabeled data available, extracting patterns
conversational input as text or via speech (us- and understanding context without human
ing language recognition) and can generate intervention. In addition to being more scal-
outputs at various levels ranging from tech- able, self-supervised tasks can allow a model
nical/professional to that of a high school edu- to anticipate a portion of the inputs, which
cation and more. They can summarize vast makes the model richer and potentially more
quantities of data, have access to unimaginably valuable than models trained on a more con-
large volumes of information, and stand to strained label space. Once the model learns
make this available, easily, to the user. The the foundational patterns of language, the
public release of ChatGPT has opened the same model can then be applied using transfer
public’s imagination and given a glimpse into learning followed by fine-tuning, which en-
an information-rich future. ables the model to learn to perform more spe-
These capabilities allow LLMs to perform a cific tasks using a smaller set of labeled
variety of general purpose tasks such as samples. For scale, the era of the internet pro-
answering questions, completing sentences, vides a nearly limitless amount of data1 and,
and even generating entire articles. One of coupled with advances in computing power,
QLoRA Prompt
Prefix tuning
tuning
FIGURE 1. A general workflow of LLM fine-tuning for specialized use cases. LLM, large language model; QLoRA, quantized low-rank
adaptation; RLHF, reinforcement learning from human feedback.
enables the training of models on an unprece- between Paris and France and London and
dented scale using graphical processing units United kingdom may be used by an LLM. A
(GPUs). Together, these develop- limitation of LLMs is that after training is
mentsdenhanced by innovations such as the completed, a model no longer learns or ac-
transformer model architecture2dhave signif- quires new information, and the information
icantly propelled the capabilities and applica- it was trained on may be general (such as
tions of LLMs. A general workflow of LLM Wikipedia), but not well-suited to a specific
fine-tuning for specialized use cases is shown task. These limitations can be mitigated with
in Figure 1. fine-tuning to better sculpt an LLM to address
Some existing LLMs to date are Alpaca,3 a specific field (such as medicine or law), and
BERT,4 BLOOM,5 Claude,6 Cohere,7 Ernie,8 retrieval augmented generation, which pro-
Falcon,9 Flan,10 Gemini,11 Gemma,12 GPT- vides additional information that a model
3.5,13 GPT-4,14 LaMDA,15 LLaMA,16 may use to address questions and which is
Mistral, MPT, Orca, PaLM 2, Phi-1,21
17 18 19 20
particularly useful if that additional informa-
StableLM,22 T5,23 Vicuna,24 and Zephyr.25 tion was not included in the model’s training.
All these models were developed to handle In the domain of health care, a number of
language-related tasks by different for-profit LLMs have been fine-tuned to perform tasks
and nonprofit organizations such as Google, associated with preconsultation, diagnosis,
Meta, and Stanford. Although most of these management, and prediction of future medical
models were created as general task models, outcomes, as well as medical education and
some were developed for specialized tasks medical writing.26-28 LLMs specific to the
such as language translation, human-like medical domain include BioBERT,29 Bio-
chat, and code generation. GPT,30 BioMistral,31 ChatDoctor,32 Clinical
In addition to anticipating subsequent Camel,33 DoctorGLM,34 Med-Alpaca,35 Med-
text, because models are trained with billions PaLM,36 Med-PaLM 2,37 Med42-v2,38 Medi-
of tokens, many words map to multiple tokens tron-70b,39 OpenBioLLM-70B,40 and PMC-
(ie, they are represented by word vectors), LLaMA.41 One particularly powerful use of
enabling mathematical connections between models such as these is obtaining answers to
multiple meanings of a term. For example, questions rather than links to articles, with
Paris will have connections to France, city, the caveat that using systems not designed to
capital, and so on, so that the relationships address medical questions may be
n n
2 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES
inaccurate.42 LLMs can potentially be used for applications across various medical subspe-
any task that requires reading text, and sum- cialties. After this, we close with a consider-
marizing it, or extracting pertinent informa- ation of some of the benefits and key
tion. Examples of data extraction uses could limitations associated with fine-tuning LLMs
include reviewing of medical records to create in the medical domain.
a discharge summary, identifying and summa-
rizing all risks for stroke in a patient with atrial
fibrillation, or determining preoperative surgi- FINE-TUNING METHODOLOGY
cal risk using standardized scoring criteria. A Fine-tuning is a process in which a pretrained
list of potential uses of LLMs in medicine model is adapted for particular tasks or do-
along with specific examples is provided in mains by continuing to train the model using
Table.43-49 only a domain-specific data set that is
In the following sections, we first outline different than the original data set used to
some of the major approaches and techniques the train the base model. Various fine-
for fine-tuning LLMs in the medical domain tuning strategies and approaches are used to
and touch on retrieval augmented generation. adjust the model parameters to a specific
Then, we describe specialized use cases in need. Some fine-tuning approaches are briefly
which LLMs have been fine-tuned for medical described in this article.
n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 3
www.mcpdigitalhealth.org
MAYO CLINIC PROCEEDINGS: DIGITAL HEALTH
n n
4 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES
n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 5
www.mcpdigitalhealth.org
MAYO CLINIC PROCEEDINGS: DIGITAL HEALTH
had clinicians review outputs to ensure that of the LLaMA 65B parameter model requires
the responses were medically sound and factu- more than 780 GB of GPU memory, whereas
ally accurate. Similarly, Serapio et al55 fine- using the QLoRA technique requires only
tuned LLMs for generating radiological im- 48GB of GPU memory.58 This powerful tech-
pressions from chest computed tomography nique is on the basis of these following highly
scans and had model outputs assessed by technical ingredients:
board-certified radiologists.
i. 4-bit NormalFloat representation of model
Beyond these general approaches, a num-
parameters, whereas typically, parameters
ber of specific techniques can be applied to
of trained models are stored in a 32-bit
fine-tune LLMs for specialized use cases.
format. This technique divides model pa-
Some example techniques include the
rameters into equally-sized buckets instead
following:
of equally-spaced buckets.
1. In-context learning: In this approach, a pre- ii. Double quantization is a method that
trained LLM is induced to perform a task quantizes the quantization constants. In
using prompted examples. An example of general, quantization converts datatypes
this is few-shot learning, which involves with a larger number of bits to fewer bits
giving the model a few shots or instances (eg, FP32 to 8-bit Integers). Quantized
to learn a new task during inference. Few- low-rank adaptation uses the blockwise
shot learning aims to direct the model’s quantization technique that requires more
predictions by providing examples and memory than standard quantization but
context specifically in the prompt but reduces bias significantly, thus retaining
importantly does not involve gradient- good performance.
based training.56 iii. Low-rank adaptation,57 which freezes the
2. Hyperparameter tuning: This is a straight- pretrained model weights and injects train-
forward method that consists of manually able rank decomposition matrices into
modifying basic hyperparameters (ie, each layer of the transformer architecture,
learning rate, batch size, optimizer, and greatly reduces the number of trainable pa-
number of epochs) of the model until the rameters for downstream tasks.
desired performance is obtained. This
In simpler terms, low-rank adaptation
changes how the model learns; that is,
finds a more compressed version of the LLM
how fast it learns, how to decide when
weights and updates those weights. Although
training is completed, and so on.
the compression may lose some data, under
3. Parameter-efficient fine-tuning: Parameter-
the assumption a lot of the model weights
efficient fine-tuning (PEFT) is an efficient
are redundant, leading only to a small decrease
technique in which only a small portion
in performance relative to savings in memory
of the parameters of an LLM are selectively
and required compute power.
modified during fine-tuning, typically by
4. Retrieval augmented generation:
adding new layers or modifying existing
Retrieval augmented generation (RAG) is a
ones in a task-specific manner. This
technique that combines the capabilities of
method drastically lowers computational
neural language models with information
and storage needs although keeping perfor-
retrieval systems to enhance the generation
mance comparable with complete fine-
of contextually rich and accurate responses.
tuning. Some PEFT techniques are low-
In RAG, when a query is received, the model
rank adaptation,57 quantized low-rank
first uses a retrieval system to fetch relevant
adaptation (QLoRA),58 Prefix tuning, and
documents or snippets from a large corpus,
Prompt tuning.
such as a database of scientific literature. These
Quantized low-rank adaptation is a very retrieved texts are then fed into a generative
popular technique used for LLM fine-tuning model, typically a transformer-based neural
owing to its power of using much smaller network, which integrates the retrieved infor-
amounts of memory than a full fine-tuning mation with its pretrained knowledge to pro-
approach with the price of sacrificing some duce a coherent and informed response. This
performance. For example, a full fine-tuning approach is particularly useful in domains
n n
6 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES
where accuracy and specificity are critical, Chinese and English medical data sets, with
such as scientific research or technical sup- RLHF further aligning the model to improve
port, because it allows the model to base its performance in medical dialogs and multiple-
answers on up-to-date and source-specific choice questions.
data, providing citations and grounding its re-
sponses in existing literature. A comparison of Integration of LLMs into Electronic Health
the outputs generated by presenting the same Record Systems
medical question to a search engine, LLM, and Several LLMs have been applied to electronic
RAG-based system is shown in Figure 2. health record (EHR) systems, providing bene-
fits such as generating patient summaries from
SPECIALIZED USE CASES IN MEDICINE EHRs, assisting health care providers with
Across many of the major subspecialties of more efficient decision making, named entity
medicine, LLMs are being fine-tuned to recognition, medical note summarization,
address specific issues, and practitioners are and predictive diagnosis.91 Zhang et al92
posing questions about how fine-tuned LLMs investigated the application of LLM fine-
could revolutionize their fields. These subspe- tuning to EHR audit log data for clinical pre-
cialties include but are not limited to cardiol- diction tasks, with a focus on discharge pre-
ogy,59-61 dermatology,62 digital pathology,63 dictions. Cui et al93 evaluated the zero-shot
gastroenterology and hepatology,64-66 hema- and few-shot performance of LLMs on EHR-
tology,67 neurology,68-70 obstetrics and gyne- based disease prediction tasks and proposed
cology,71,72 oncology,73-75 ophthalmology,76 a novel approach that leverages collaborative
orthopedics,77 pediatrics,78,79 psychiatry,80-82 LLM agents to enhance predictive perfor-
radiology,83,84 operation,85,86 and urol- mance. Li et al94 fine-tuned an LLM named
ogy.87,88 Although there are nuances accord- LlamaCare and evaluated it on various clinical
ing to specific subspecialties, many tasks, such as generating discharge summaries,
practitioners highlight the potential of fine- predicting mortality and length of stay, and
tuned LLMs to aid clinicians in areas such as more.
clinical decision support, treatment planning,
and patient consultation, as well as alleviate Generation of Echocardiography Reports to
administrative burden associated with tasks Streamline Workflows
such as generating clinical notes, discharge re- Echocardiography is one of the most widely
ports, and medical billing. At the same time, used imaging techniques for gaining insights
many are concerned about the ethical, legal, into the structure and function of the heart.
and social implications of using such models. A typical echocardiography report includes
In the subsequent section, we review some numerous measurements as well as text-
of these concerns. Before that, below we high- based statements or findings. These findings
light a few specific methodologies and use are summarized by a clinician to give an over-
cases that illustrate the general framework out- all set of final impressions for the study. This is
lined in the previous sections. a time-consuming and error-prone process. To
address this issue, Chao et al95 leveraged
Using RLHF to Fine-Tune LLMs in Medicine several open-source LLMs to generate echocar-
Mukherjee et al89 developed a constellation diography reports using either zero-shot
system called Polaris, which was composed learning (for Flan-T5, Med-Alpaca, Llama-2,
of several agents. Their primary agent (focused and Zephyr) or QLoRA fine-tuning (Llama-2
on patient-friendly conversation) was devel- and Zephyr). Using a training data set of
oped in 3 stages: general instruction-tuning, 95,506 echocardiography reports, the authors
conversation and agent tuning, and RLHF. observed that EchoGPT, which is a Llama-2
The RLHF step was performed by registered model trained using instruction fine-tuning
nurses, who gave preference feedback on mul- with QLoRA, outperformed other LLMs on
tiple responses. Zhao et al90 developed Aqulia- critical performance metrics. In addition,
Med, a bilingual medical LLM, using super- when 4 echocardiography board-certified car-
vised fine-tuning and RLHF to tackle medical diologists were asked to rate reports generated
challenges. It was trained on large-scale by EchoGPT for 30 randomly selected cases,
n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 7
www.mcpdigitalhealth.org
MAYO CLINIC PROCEEDINGS: DIGITAL HEALTH
FIGURE 2. Comparison of responses using different systems to the same question: “What are the interactions between propafenone
and colchicine?”
n n
8 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES
domain and includes metrics for hallucina- summarize vast amounts of information, may
tions associated with reasoning and prevent important pieces of information from
memory.101 being missed, and can meaningfully tap into
2. Legal and safety concerns: for example, in vast stores of literature to inform clinicians at
the medical domain, the data to be used the point of care when meeting with a patient.
for fine-tuning may contain sensitive pa- However, much remains unproven including
tient information that needs to be safe- how to ensure the information is reliable, pri-
guarded. In addition, if model output is vacy is preserved, and answers are tuned to
used to guide treatment decisions for pa- usefully guide medical professionals.
tients, incorrect output (such as hallucina-
tions) could be harmful. This is why
authors such as Chao et al95 emphasize
the critical need for human review of model POTENTIAL COMPETING INTERESTS
output. In addition, cybersecurity measures Drs Anisuzzaman, Malins, Friedman, and Attia
such as the use of pseudonyms can enhance have invented algorithms licensed to Ultra-
the privacy and security of patient data.102 Sight and may benefit from algorithm
3. Biases in training data sets: fine-tuned LLMs commercialization via Mayo Clinic. None of
can inherit biases from the pretrained these relations with industry are related in
models on which they are built, and there any way to the content of the current submis-
is a critical need to use techniques that miti- sion. Given their role as Editorial Board Mem-
gate this bias.103,104 In medicine, this bias bers, Drs. Attia and Friedman had no
has the potential to exacerbate health ineq- involvement in the peer-review of this article
uities if not addressed.105 Some techniques and have no access to information regarding
for mitigating bias include prompt engineer- its peer-review. Drs Friedman and Attia report
ing, debiasing algorithms, and continuous multiple patents owned by Mayo for AI ECG
monitoring of model performance.106 and stock or stock options in Anumana and
4. Lack of domain-specific data: depending on XAI Health.
the extent to which a specific use case is
specialized, there may not be sufficient
quantities of domain-specific data to fine-
SUPPLEMENTAL ONLINE MATERIAL
tune an LLM using certain approaches.
Supplemental material can be found online at
Here, techniques such as in-context
https://www.mcpdigitalhealth.org/. Supple-
learning or PEFT may be more appropriate
mental material attached to journal articles
than full fine-tuning.
has not been edited, and the authors take re-
5. Data leakage: many of the pretrained LLMs
sponsibility for the accuracy of all data.
do not report which data were used for
training, so if open data sets are used for
Abbreviations and Acronyms: AI, artificial intelligence;
fine-tuning, these data may have already
CoT, chain-of-thought; GPU, graphical processing unit; LLM,
been used for training the base model. large language model; PEFT, parameter-efficient fine-tuning;
This can lead to data leakage from the vali- PPO, proximal policy optimization; QLoRA, quantized low-
dation set to the model, resulting in overly rank adaptation; RAG, retrieval augmented generation;
optimistic performance. Addressing this RLHF, reinforcement learning from human feedback
concern will involve greater transparency Publication dates: Received for publication August 2, 2024;
on the part of developers when describing revisions received November 6, 2024; accepted for publica-
training data sets and careful selection of tion November 18, 2024.
pretrained LLMs that provide information
Correspondence: Address to Zachi I. Attia, PhD, Depart-
about the source, quality, and quantity of ment of Cardiovascular Medicine, Mayo Medical School,
training data.102 Artificial Intelligence in Cardiology, Mayo Clinic, 200 1st
Street SW, Rochester, MN 55905 (attia.itzhak@mayo.edu).
ORCID
CONCLUSION D.M. Anisuzzaman: https://orcid.org/0000-0001-8068-
Large language models are poised to transform 2571; Zachi I. Attia: https://orcid.org/0000-0002-9706-
medicine. In written form or verbally, they can 7900
n n
10 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES
n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 11
www.mcpdigitalhealth.org
MAYO CLINIC PROCEEDINGS: DIGITAL HEALTH
40. Pal A, Sankarasubbu M. OpenBioLLMs: advancing open- September 1, 2024. medRxiv 24312887. https://doi.org/10.
source large language models for healthcare and life sciences. 1101/2024.09.01.24312887.
Hugging Face. https://huggingface.co/aaditya/OpenBioLLM- 60. Novak A, Rode F, Lisii. The pulse of artificial intelligence in car-
Llama3-70B. Accessed September 30, 2024. diology: a comprehensive evaluation of state-of-the-art large
41. Wu C, Lin W, Zhang X, Zhang Y, Xie W, Wang Y. PMC- language models for potential use in clinical cardiology Pre-
LLaMA: toward building open-source language models for print. Posted online January 30, 2024. medRxiv 23293689.
medicine. J Am Med Inform Assoc. 2024;31(9):1833-1843. https://doi.org/10.1101/2023.08.08.23293689.
https://doi.org/10.1093/jamia/ocae045. 61. Boonstra MJ, Weissenbacher D, Moore JH, Gonzalez-
42. Siontis KC, Attia ZI, Asirvatham SJ, Friedman PA. ChatGPT hallu- Hernandez G, Asselbergs FW. Artificial intelligence: revolu-
cinating: can it get any more humanlike? Eur Heart J. 2024;45(5): tionizing cardiology with large language models. Eur Heart J.
321-323 https://doi.org/10.1093/eurheartj/ehad766. 2024;45(5):332-345. https://doi.org/10.1093/eurheartj/
43. Markey N, El-Mansouri I, Rensonnet G, van Langen C, ehad838.
Meier C. From RAGs to riches: using large language models 62. Gui H, Omiye JA, Chang CT, Daneshjou R. The promises and perils
to write documents for clinical trials Preprint. Posted online of foundation models in dermatology. J Invest Dermatol. 2024;
February 26, 2024. arXiv 240216406. https://doi.org/10. 144(7):1440-1448. https://doi.org/10.1016/j.jid.2023.12.019.
48550/arXiv.2402.16406. 63. Ullah E, Parwani A, Baig MM, Singh R. Challenges and barriers
44. Hager P, Jungmann F, Holland R, et al. Evaluation and mitiga- of using large language models (LLM) such as ChatGPT for
tion of the limitations of large language models in clinical de- diagnostic medicine with a focus on digital pathology–a recent
cision-making. Nat Med. 2024;30(9):2613-2622. https://doi. scoping review. Diagn Pathol. 2024;19(1):43. https://doi.org/10.
org/10.1038/s41591-024-03097-1. 1186/s13000-024-01464-7.
45. Ramjee P, Sachdeva B, Golechha S, et al. CataractBot: An 64. Shahab O, El Kurdi B, Shaukat A, Nadkarni G, Soroush A.
LLM-Powered Expert-in-the-Loop Chatbot for Cataract Pa- Large language models: a primer and gastroenterology applica-
tients Preprint. Posted online February 7, 2024. arXiv tions. Ther Adv Gastroenterol. 2024;17:1756284824122703.
240204620. https://doi.org/10.48550/arXiv.2402.04620. https://doi.org/10.1177/17562848241227031.
46. Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The 65. Omar Sr M, Sharif Sr K, Glicksberg Sr BS, Nadkarni G, Klang E
role of large language models in medical education: applica- Sr. Emerging applications of NLP and large language models in
tions and implications. JMIR Med Educ. 2023;9:e50945. gastroenterology and hepatology: a systematic review Pre-
https://doi.org/10.2196/50945. print. Posted online June 27, 2021. medRxiv 24309567.
47. Wang Z, Liu L, Wang L, Zhou L. R2gengpt: Radiology report https://doi.org/10.1101/2024.06.26.24309567.
generation with frozen LLMs. Meta-Radiology. 2023;1(3): 66. Giuffre M, Kresevic S, Pugliese N, You K, Shung DL. Opti-
100033. https://doi.org/10.1016/j.metrad.2023.100033. mizing large language models in digestive disease: strategies
48. Griewing S, Knitza J, Boekhoff J, et al. Evolution of publicly and challenges to improve clinical outcomes. Liver Int. 2024;
available large language models for complex decision-making 44(9):2114-2124. https://doi.org/10.1111/liv.15974.
in breast cancer care. Arch Gynecol Obstet. 2024;310(1):537- 67. Mudrik A, Nadkarni GN, Efros O, Glicksberg BS, Klang E,
550. https://doi.org/10.1007/s00404-024-07565-4. Soffer S. Exploring the role of large language models (LLMs)
49. Gangavarapu A. Introducing L2M3, a multilingual medical large in hematology: a systematic review of applications, benefits,
language model to advance health equity in low-resource re- and limitations. Br J Haematol. 2024;205(5):1685-1698.
gions Preprint. Posted online April 11, 2024. arXiv https://doi.org/10.1111/bjh.19738.
240408705. https://doi.org/10.48550/arXiv.2404.08705. 68. Barrit S, El Hadwe SE, Carron R, Madsen JR. Rise of large lan-
50. Turing. Fine-tuning LLMS: overview, methods, and best prac- guage models in neurosurgery. J Neurosurg. 2024;141(3):878-
tices. https://www.turing.com/resources/finetuning-large- 880. https://doi.org/10.3171/2024.3.JNS24610.
language-models. Accessed April 26, 2024. 69. Chiang C-C, Fries JA. Exploring the potential of large language
51. Zhao J. LLMDataHub: awesome datasets for LLM training. models in neurology, using neurologic localization as an
https://github.com/Zjh-819/LLMDataHub. Accessed April 26, example. Neurol Clin Pract. 2024;14(3):e200311. https://doi.
2024. org/10.1212/CPJ.0000000000200311.
52. HuggingFace. Datasets (filter Other by name “llm”). https:// 70. Romano MF, Shih LC, Paschalidis IC, Au R, Kolachalama VB.
huggingface.co/datasets?other=llm. Accessed April 26, 2024. Large language models in neurology research and future prac-
53. Liu Y, Cao J, Liu C, Ding K, Jin L. Datasets for large language tice. Neurology. 2023;101(23):1058-1067. https://doi.org/10.
models: a comprehensive survey Preprint. Posted online 1212/WNL.0000000000207967.
February 28, 2024. arXiv 240218041. https://doi.org/10. 71. Bachmann M, Duta I, Mazey E, Cooke W, Vatish M, Jones GD.
48550/arXiv.2402.18041. Exploring the capabilities of ChatGPT in women’s health: ob-
54. Aisera. LLM evaluation metrics: performance benchmark. stetrics and gynaecology. NPJ Womens Health. 2024;2(1):26.
https://aisera.com/blog/llm-evaluation/. Accessed April 26, https://doi.org/10.1038/s44294-024-00028-w.
2024. 72. Mudrik A, Tsur A, Nadkarni G, et al. Leveraging large language
55. Serapio A, Chaudhari G, Savage C, et al. An open-source models in gynecologic oncology: a systematic review of cur-
fine-tuned large language model for radiological impression rent applications and challenges Preprint. Posted online
generation: a multi-reader performance study. BMC Med Im- August 9, 2024. medRxiv 24311699. https://doi.org/10.1101/
aging. 2024;24(1):254. https://doi.org/10.1186/s12880-024- 2024.08.08.24311699.
01435-w. 73. Rydzewski NR, Dinakaran D, Zhao SG, et al. Comparative
56. Liu H, Tam D, Muqeeth M, et al. Few-shot parameter-efficient evaluation of LLMs in clinical oncology. NEJM AI. 2024;1(5).
fine-tuning is better and cheaper than in-context learning. Adv 10.1056/aioa2300151. https://doi.org/10.1056/aioa2300151.
Neural Inform Process Syst. 2022;35:1950-1965. 74. Lawson McLean A, Wu Y, Lawson McLean AC, Hristidis V.
57. Hu EJ, Shen Y, Wallis P, et al. LoRA: low-rank adaptation of Large language models as decision aids in neuro-oncology: a
large language models Preprint. Posted online June 17, 2021. review of shared decision-making applications. J Cancer Res
arXiv 210609685. https://doi.org/10.48550/arXiv.2106.09685. Clin Oncol. 2024;150(3):139. https://doi.org/10.1007/s00432-
58. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: 024-05673-x.
efficient finetuning of quantized LLMs. Adv Neural Inform Pro- 75. Benary M, Wang XD, Schmidt M, et al. Leveraging large lan-
cess Syst. 2024;36:10088-10115. guage models for decision support in personalized oncology.
59. Gendler M, Nadkarni G, Sudri K, et al. Large language models JAMA Netw Open. 2023;6(11):e2343689. https://doi.org/10.
in cardiology: a systematic review Preprint. Posted online 1001/jamanetworkopen.2023.43689.
n n
12 Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005
www.mcpdigitalhealth.org
FINE-TUNING LLMS FOR SPECIALIZED USE CASES
76. Luo M-J, Pang J, Bi S, et al. Development and evaluation of a 92. Zhang X, Yan C, Yang Y, et al. Optimizing large language
retrieval-augmented large language model framework for models for discharge prediction: best practices in leveraging
ophthalmology. JAMA Ophthalmol. 2024;142(9):798-805. electronic health record audit logs Preprint. Posted online
https://doi.org/10.1001/jamaophthalmol.2024.2513. September 13, 2024. medRxiv 24313594. https://doi.org/10.
77. Chatterjee S, Bhattacharya M, Pal S, Lee S-S, Chakraborty C. 1101/2024.09.12.24313594.
ChatGPT and large language models in orthopedics: from ed- 93. Cui H, Shen Z, Zhang J, et al. LLMs-based few-shot disease
ucation and surgery to research. J Exp Orthop. 2023;10(1):128. predictions using EHR: a novel approach combining predictive
https://doi.org/10.1186/s40634-023-00700-1. agent reasoning and critical agent instruction Preprint. Posted
78. Sisk BA, Antes AL, DuBois JM. An overarching framework for online March 19, 2024. arXiv 240315464. https://doi.org/10.
the ethics of artificial intelligence in pediatrics. JAMA Pediatr. 48550/arXiv.2403.15464.
2024;178(3):213-214. https://doi.org/10.1001/jamapediatrics. 94. Li R, Wang X, Yu H. LlamaCare: an instruction fine-tuned
2023.5761. large language model for clinical NLP. In: Calzolari N,
79. Wyatt KD, Alexander N, Hills GD, et al. Making sense of artifi- Kan M-Y, Hoste V, Lenci A, Sakti S, Xue N, eds. Proceedings
cial intelligence and large language modelsdincluding of the 2024 Joint International Conference on Computational Lin-
ChatGPTdin pediatric hematology/oncology. Pediatr Blood guistics, Language Resources and Evaluation (LREC-COLING
Cancer. 2024;71(9):e31143. https://doi.org/10.1002/pbc.31143. 2024). ELRA/ICCL; 2024:10632-10641.
80. Obradovich N, Khalsa SS, Khan WU, et al. Opportunities and risks 95. Chao C-J, Banerjee I, Arsanjani R, et al. EchoGPT: a large lan-
of large language models in psychiatry. NPP Digit Psychiatry Neuro- guage model for echocardiography report summarization
sci. 2024;2(1):8. https://doi.org/10.1038/s44277-024-00010-z. Preprint. Posted online January 20, 2024. medRxiv
81. Volkmer S, Meyer-Lindenberg A, Schwarz E. Large language models 24301503. https://doi.org/10.1101/2024.01.18.24301503.
in psychiatry: opportunities and challenges. Psychiatry Res. 2024;339: 96. Guan Z, Wu Z, Liu Z, et al. CohortGPT: an enhanced GPT for
116026. https://doi.org/10.1016/j.psychres.2024.116026. participant recruitment in clinical study Preprint. Posted online
82. Omar M, Soffer S, Charney AW, Landi I, Nadkarni GN, July 21, 2023. arXiv 230711346. https://doi.org/10.48550/
Klang E. Applications of large language models in psychiatry: arXiv.2307.11346.
a systematic review. Front Psychiatry. 2024;15:1422807. 97. Demner-Fushman D, Kohli MD, Rosenman MB, et al. Prepar-
https://doi.org/10.3389/fpsyt.2024.1422807. ing a collection of radiology examinations for distribution and
83. Liu Z, Zhong A, Li Y, et al. Tailoring large language models to retrieval. J Am Med Inform Assoc. 2016;23(2):304-310. https://
radiology: a preliminary approach to llm adaptation for a highly doi.org/10.1093/jamia/ocv080.
specialized domain. In: Cao X, Xu X, Rekik I, Cui Z, Ouyang X, 98. Johnson AEW, Pollard TJ, Berkowitz SJ, et al. MIMIC-CXR, a
eds. Machine Learning in Medical Imaging. MLMI 2023. Lecture de-identified publicly available database of chest radiographs
Notes in Computer Science, vol 14348. Springer. doi:10. with free-text reports. Sci data. 2019;6(1):317. https://doi.
1007/978-3-031-45673-2_46. org/10.1038/s41597-019-0322-0.
84. D’Antonoli TA, Stanzione A, Bluethgen C, et al. Large lan- 99. Zhou H, Gu B, Zou X, et al. A survey of large language models
guage models in radiology: fundamentals, applications, in medicine: progress, application, and challenge Preprint.
ethical considerations, risks, and future directions. Diagn Posted online November 9, 2023. arXiv 231105112. https://
Interv Radiol. 2024;30(2):80-90. https://doi.org/10.4274/dir. doi.org/10.48550/arXiv.2311.05112.
2023.232417. 100. Haltaufderheide J, Ranisch R. The ethics of ChatGPT in med-
85. Lee J, Sharma I, Arcaro N, et al. Automating surgical procedure icine and healthcare: a systematic review on large language
extraction for society of surgeons adult cardiac surgery registry models (LLMs). NPJ Digit Med. 2024;7(1):183. https://doi.org/
using pretrained language models. JAMIA Open. 2024;7(3): 10.1038/s41746-024-01157-x.
ooae054. https://doi.org/10.1093/jamiaopen/ooae054. 101. Pal A, Umapathi LK, Sankarasubbu M. Med-HALT: medical
86. Oh N, Choi G-S, Lee WY. ChatGPT goes to the operating domain hallucination test for large language models Preprint.
room: evaluating GPT-4 performance and its potential in sur- Posted online July 28, 2023. arXiv 230715343. https://doi.
gical education and training in the era of large language org/10.18653/v1/2023.conll-1.21.
models. Ann Surg Treat Res. 2023;104(5):269-273. https://doi. 102. Ong JCL, Chang SY-H, William W, et al. Ethical and regulatory
org/10.4174/astr.2023.104.5.269. challenges of large language models in medicine. Lancet Digit
87. Adhikari K, Naik N, Hameed BZ, Raghunath SK, Somani BK. Health. 2024;6(6):e428-e432. https://doi.org/10.1016/S2589-
Exploring the ethical, legal, and social implications of ChatGPT 7500(24)00061-X.
in urology. Curr Urol Rep. 2024;25(1):1-8. https://doi.org/10. 103. Goh E, Bunning B, Khoong E, et al. ChatGPT influence on
1007/s11934-023-01185-2. medical decision-making, bias, and equity: a randomized study
88. Gupta R, Pedraza AM, Gorin MA, Tewari AK. Defining the of clinicians evaluating clinical vignettes Preprint. Posted online
role of large language models in urologic care and research. November 27, 2023. medRxiv 23298844. https://doi.org/10.
Eur Urol Oncol. 2024;7(1):1-13. https://doi.org/10.1016/j.euo. 1101/2023.11.24.23298844.
2023.07.017. 104. Schmidgall S, Harris C, Essien I, et al. Addressing cognitive bias
89. Mukherjee S, Gamble P, Ausin MS, et al. Polaris: a safety- in medical language models Preprint. Posted online February
focused LLM constellation architecture for healthcare Pre- 12, 2024. arXiv 240208113. https://doi.org/10.48550/arXiv.
print. Posted online March 20, 2024. arXiv 240313313. 2402.08113.
https://doi.org/10.48550/arXiv.2403.13313. 105. Perez-Downes JC, Tseng AS, McConn KA, et al. Mitigating bias
90. Zhao L, Zeng W, Shi X, Zhou H, Hao D, Lin Y. Aqulia-Med in clinical machine learning models. Curr Treat Options Cardio-
LLM: pioneering full-process open-source medical language vasc Med. 2024;26(3):29-45. https://doi.org/10.1007/s11936-
models Preprint. Posted online June 18, 2024. arXiv 023-01032-0.
240612182. https://doi.org/10.48550/arXiv.2406.12182. 106. Omar Sr M, Sorin Sr V, Apakama DU, et al. Evaluating and
91. Li L, Zhou J, Gao Z, et al. A scoping review of using large lan- addressing demographic disparities in medical large language
guage models (LLMs) to investigate electronic health records models: a systematic review Preprint. Posted online October
(EHRs) Preprint. Posted online May 5, 2024. arXiv 1, 2024. medRxiv 24313295. https://doi.org/10.1101/2024.
240503066. https://doi.org/10.48550/arXiv.2405.03066. 09.09.24313295.
n n
Mayo Clin Proc Digital Health March 2025;3(1):100184 https://doi.org/10.1016/j.mcpdig.2024.11.005 13
www.mcpdigitalhealth.org