[go: up one dir, main page]

0% found this document useful (0 votes)
12 views31 pages

Graph Benchmark

Diet plays a critical role in human health, yet tailoring dietary reasoning to individual health conditions remains a major challenge.

Uploaded by

Mỹ Linh Bàng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views31 pages

Graph Benchmark

Diet plays a critical role in human health, yet tailoring dietary reasoning to individual health conditions remains a major challenge.

Uploaded by

Mỹ Linh Bàng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

NGQA: A Nutritional Graph Question Answering Benchmark

for Personalized Health-aware Nutritional Reasoning


Zheyuan Zhang1* , Yiyang Li1* , Nhi Ha Lan Le2* , Zehong Wang1 , Tianyi Ma1 , Vincent Galassi1 ,
Keerthiram Murugesan 3 , Nuno Moniz1 , Werner Geyer3 , Nitesh V Chawla1 , Chuxu Zhang4 Yanfang Ye1†
1
University of Notre Dame, 2 Brandeis University, 3 IBM Research, 4 University of Connecticut
*
Equal Contribution † Corresponding Author
{zzhang42,yli62,zwang43,tma2,vgalassi,nmoniz2,nchawla,yye7}@nd.edu,
nhihlle@brandeis.edu, keerthiram.murugesa@ibm.com, werner.geyer@us.ibm.com, chuxu.zhang@uconn.edu

Abstract Question Level: Standard


Sparse Standard Complex
Ingredient A High Calorie
Diet plays a critical role in human health, yet Question Settings contradict
has
contains has
tailoring dietary reasoning to individual health -B - ML - TG
Obesity
arXiv:2412.15547v1 [cs.CL] 20 Dec 2024

Danish pastry
conditions remains a major challenge. Nutri- Task Settings with fruit User has
tion Question Answering (QA) has emerged contains
Diet
has
as a popular method for addressing this prob- High contradict Habit A
Data Source Sodium Hypertension
lem. However, current research faces two
a) Overview of NGQA Benchmark b) An Example of Standard Question
critical limitations. On the one hand, the ab-
sence of datasets involving user-specific med- Task Level: - ML

ical information severely limits personaliza- Question: Based on the information provided, please report the
nutrient tags to judge if food Danish pastry with fruit is healthy or
tion. This challenge is further compounded
unhealthy to the user.
by the wide variability in individual health
needs. On the other hand, while large lan- Answer: high calorie, high sodium.

guage models (LLMs), a popular solution for c) An Example of Answering the Question in Multi-label (ML) Task
this task, demonstrate strong reasoning abil-
ities, they struggle with the domain-specific Figure 1: An Overview of NGQA Benchmark (a) along
complexities of personalized healthy dietary with a data showcase: (b) an example of the knowledge
reasoning, and existing benchmarks fail to cap- graph used for a standard level question and (c) the
ture these challenges. To address these gaps, question and the answer of that question under the multi-
we introduce the Nutritional Graph Question label classification task (-ML) settings.
Answering (NGQA) benchmark, the first graph
question answering dataset designed for per- benefits of balanced nutrition, unhealthy eating
sonalized nutritional health reasoning. NGQA habits remain alarmingly prevalent in modern so-
leverages data from the National Health and Nu- ciety (WHO, 2021). In the United States alone,
trition Examination Survey (NHANES) and the approximately 42.4% of adults are classified as
Food and Nutrient Database for Dietary Studies obese (CDC, 2020a), and in 2017, poor dietary
(FNDDS) to evaluate whether a food is healthy
habits contributed to over 11 million deaths and a
for a specific user, supported by explanations
of the key contributing nutrients. The bench- substantial number of disability-adjusted life-years
mark incorporates three question complexity (DALYs), often linked to factors such as excessive
settings and evaluates reasoning across three sodium intake (Afshin et al., 2019; WHO, 2023).
downstream tasks. Extensive experiments with These statistics underscore an urgent need to pro-
LLM backbones and baseline models demon- mote healthier eating habits on a societal scale.
strate that the NGQA benchmark effectively However, nutritional health requires complex do-
challenges existing models. In sum, NGQA
main knowledge, and there is no one-size-fits-all
addresses a critical real-world problem while
advancing GraphQA research with a novel
solution for healthy diets, as the nutritional needs
domain-specific benchmark. Our codebase and of individuals can vary widely based on their health
dataset are available here. conditions. For example, a diet suitable for some-
one with a high body mass index (BMI) may differ
drastically from that of an individual with a low
1 Introduction
BMI. Likewise, while individuals recovering from
Diet is a cornerstone of human health, playing a opioid misuse may benefit from a high-protein diet,
pivotal role in both maintaining well-being and such dietary choices can be harmful to those manag-
preventing disease. Despite the well-documented ing chronic kidney disease (Mahboub et al., 2021).

1
Why this benchmark matters: Numerous ef- follows:
forts have sought to address the challenges in per-
sonalized nutritional health, with Nutrition Ques- • Novel Benchmark for Personalized Nutri-
tion Answering (QA) emerging as a popular task tion. We present NGQA, the first benchmark
(Min et al., 2022; Bondevik et al., 2024). Recent ad- to incorporate users’ medical information in a
vancements in large language models (LLMs) have nutritional question answering task, address-
demonstrated significant potential in this domain, ing a significant research gap in the domain
offering sophisticated reasoning capabilities to ana- of personalized healthy diet research.
lyze and interpret nutritional information (Mavro-
matis and Karypis, 2024). However, these efforts • Advancing the GraphQA Ecosystem.
remain constrained by two major limitations. First, NGQA introduces a domain-specific bench-
to the best of our knowledge, no existing bench- mark and extends GraphQA benchmarks
mark truly personalizes answers based on users’ beyond datasets like WebQSP and Expla-
specific health conditions, primarily due to the Graphs in the general domain. This addition
inaccessibility of individual medical data (Bölz broadens the scope of GraphQA research,
et al., 2023). This lack of user-specific datasets enabling a more comprehensive evaluation
has severely hindered the development of effective of GraphQA models’ capabilities beyond
solutions. Second, while LLMs exhibit impressive general reasoning tasks.
reasoning capabilities in general domains, the med-
• Comprehensive Resource and Evaluation.
ical and nutritional intricacies of this task impose
Through extensive experiments, NGQA pro-
severe limitations on their effectiveness (Mialon
vides a challenging benchmark, a complete
et al., 2023). Current benchmarks fail to capture
codebase supporting the full pipeline from
the domain-specific complexities of personalized
data preprocessing to model evaluation, and
health-aware dietary reasoning, making it difficult
an extensibility for integrating new mod-
to evaluate, let alone improve, these models in
els. This comprehensive resource helps ad-
meaningful ways.
vance research in both personalized nutritional
To address these critical gaps and advance the
health and the broader GraphQA field.
understanding of healthy diet personalization, we
propose the Nutritional Graph Question Answering
(NGQA) benchmark. This is the first benchmark 2 Related Work
in the personalized nutritional health domain to
evaluate whether a specific food is healthy for a Question Answering in Nutritional Health Do-
user, supported by detailed reasoning of the key main. Question answering has become an essential
contributing nutrients. By recognizing the intri- tool in the nutritional and health domain, offer-
cate interplay between a user’s medical conditions, ing a flexible framework for applications such as
dietary behaviors, and the nutrition of foods, we food recommendation (Min et al., 2022; Bonde-
frame this task as a knowledge graph question an- vik et al., 2024). Knowledge graphs (KGs) have
swering problem. Specifically, using data from the been widely used to model relationships between
National Health and Nutrition Examination Survey foods, ingredients, and health, supporting tasks
(NHANES) and the Food and Nutrient Database like ingredient substitution and adaptive dietary
for Dietary Studies (FNDDS), we construct the recommendations (Haussmann et al., 2019; Chen
NGQA benchmark and categorize questions into et al., 2021; Fatemi et al., 2023a; Xu et al., 2024).
three complexity settings: sparse, standard, and Recent approaches incorporate health metrics into
complex. Each question type is further evaluated QA systems, focusing on recipe recommendations
through three downstream tasks, binary classifi- and nutritional ontologies (Li et al., 2023; Senevi-
cation (-B), multi-label classification (-ML), and ratne et al., 2021). However, existing methods lack
text generation (-TG), to explore distinct reasoning true personalization, as highlighted by (Bölz et al.,
aspects (Figure-1 (a)). We conduct extensive exper- 2023), due to the absence of user-specific medical
iments using various LLM backbones and baseline data. Our work fills this gap by introducing the
models to ensure the benchmark is both appropri- first GraphQA benchmark for personalized nutri-
ately challenging and meaningful for advancing tional health, enabling models to provide tailored
the field. Our contributions can be summarized as nutritional reasoning and explanations.

2
Dietary 4 Conditions
Extract Habits Filter
9 Special Diets
Medical 5644 Users Tagging
Status Scheme Multi-step Annotation
User Filtering
Health Standards User Data Collection

Nutrition
Data Source Extract Tags Filter
3 Categories
Ingredients 849 Foods
NGQA Benchmark
Nutrition Standards Food Filtering
Food Data Collection

Figure 2: The NGQA benchmark construction process. Each stage shown in the figure is detailed in Section 3.For
example, "User Data Collection" block, is introduced in Section 3.1 under the paragraph titled User Data Collection.
Graph Retrieval Augmented Generation. and comprehensive food nutritional information,
Knowledge Graph Question Answering (KGQA) enabling a fine-grained analysis of how individ-
has progressed from early semantic parsing and ual health conditions interact with food nutrition.
retrieval-based methods to advanced techniques By representing these relationships through graph
leveraging large language models (LLMs) and structures, the benchmark supports answering com-
graph neural networks (GNNs) for reasoning and plex nutritional questions while capturing the intri-
retrieval (Jiang et al., 2023; Kim et al., 2023; cate interplay between users’ medical conditions
Gao et al., 2024). Building on this progress, and dietary choices. The following sections pro-
Graph-Retrieval Augmented Generation (Graph- vide a detailed discussion of these datasets and their
RAG) has emerged as a widely studied method, integration into our benchmark.
offering more precise, context- and structure-aware User Data Collection. The NHANES dataset
reasoning compared to traditional text-based forms the foundation of our work for collecting
RAG methods (Lewis et al., 2020; Lazaridou user data. We extract medical information, dietary
et al., 2022; Guo et al., 2024; Wen et al., 2023). habits, and food intake records to construct the
Despite the development of various LLM-powered graph. Specifically, NHANES provides laboratory
models, benchmarks for the Graph-RAG task reports detailing body metrics like Body Mass In-
remain scarce and lack standardization. Early dex (BMI) and blood pressure, along with biochem-
benchmarks focus primarily on general graph tasks ical markers such as blood urea nitrogen. It also
such as shortest paths and node degree (Fatemi includes questionnaire responses on prescription
et al., 2023b; Wang et al., 2024a), while (He drug usage, adherence to special diets, and over-
et al., 2024) introduces a GraphQA benchmark for all health status. Additionally, NHANES records
complex reasoning using general-purpose datasets. users’ food intake history and dietary behaviors,
Building on their framework, we develop the such as the frequency of adding salt at the table.
first domain-specific benchmark in the nutritional Our study incorporates 54 distinct dietary habits,
health domain, bridging the gap between general with detailed data processing methods provided in
GraphQA research and personalized health-aware Appendix-B. This comprehensive dataset serves as
reasoning. More detailed literature is available in the backbone of our graph, capturing user health
Appendix-A. conditions and dietary patterns with granular detail.
3 NGQA Benchmark Food Data Collection. Nutritional information
for food items is sourced from FNDDS. FNDDS
3.1 Data Collection connects NHANES food codes to detailed nutri-
Data Source. Using data from the National Health tional data cataloged in the What We Eat in Amer-
and Nutrition Examination Survey (NHANES) and ica (WWEIA) database. Using FNDDS, we asso-
the Food and Nutrient Database for Dietary Studies ciate each food item in NHANES with its full nu-
(FNDDS), we construct the first GraphQA bench- tritional composition. Additionally, FNDDS links
mark designed to address personalized healthy nu- food items to ingredient information and classifies
trition intake questions. This benchmark integrates them into broader food categories. For example, a
detailed user health profiles, dietary behaviors, food item like "apple" is linked to its nutrient values

3
(e.g., sugars, vitamins) and assigned to the category million food records. While this dataset offers
"fruits." These associations enrich the graph by pro- an invaluable resource for studying nutrition and
viding node-level data for food, ingredients, and health, it includes inconsistencies, ambiguities, and
categories. irrelevant entries. To establish a scientifically ro-
Tagging Scheme. To evaluate whether a food bust and meaningful benchmark, precise data anno-
is specifically healthy for a user based on their tation is essential. This involves not only cleaning
personal health conditions, we propose a tagging and filtering the data but also carefully defining
scheme that assigns nutrition-related tags to both and validating annotations to accurately capture
users and foods. This systematic framework aligns real-world relationships between health conditions,
food nutritional properties with user health needs, dietary behaviors, and food options. Our annota-
enabling robust assessments of food suitability. tion process refines both user and food datasets
For food tagging, we build upon established to ensure relevance, accuracy, and applicability to
guidelines and introduce newly applied standards. real-life scenarios.
Prior works have utilized recommendations from User Filtering. Annotating user data requires
the World Health Organization (WHO) and the careful consideration of the complex interactions
Food Standards Agency (FSA) (Wang et al., 2021), between nutrition and health. For instance, elevated
while we extend this by incorporating the more de- blood urea nitrogen (BUN) levels may indicate kid-
tailed EU Nutrition & Health Claims Regulation ney dysfunction, warranting a low-protein diet, but
(Commission, 2006) and the Codex Alimentarius could also result from insufficient water intake. To
Commission (CAC) (Alimentarius, 1985, 1997). maintain scientific rigor and practical relevance,
These standards define precise thresholds for nutri- we focus on annotating four prevalent health sta-
ent claims. For instance, the EU regulation permits tuses—obesity, hypertension, opioid misuse, and
labeling a food as "low sodium" only if it contains diabetes—that are directly influenced by dietary in-
no more than 0.12 g of sodium per 100 g (Commis- terventions. Additionally, we annotate nine special
sion, 2006). Foods meeting such criteria are tagged diets reported by users, reflecting health-related di-
with corresponding labels like "low_sodium" or etary practices. Further details on the definitions
"high_protein", reflecting their nutritional proper- and implications of these health statuses and diets
ties. are provided in the Appendix-B. To ensure consis-
On the user side, health tags are derived from tency and relevance, we exclude users under 18,
the NHANES dataset, which includes laboratory focusing solely on adult dietary patterns.
results and self-reported health information. For Food Filtering. For food annotation, we identify
example, users with high blood pressure, as defined practical entries in the FNDDS database that align
by American Heart Association (AHA) thresholds with real-world dietary reasoning. While FNDDS
or similar guidelines, are tagged with "hyperten- supports comprehensive nutritional analysis, it in-
sion," indicating that a low-sodium diet would be cludes many entries unsuitable for practical use,
beneficial (Grillo et al., 2019; Smyth et al., 2014). such as raw ingredients or standalone additives. To
By linking health and food tags, our scheme ef- address this, we restrict our focus to the "mixed
fectively represents personalized dietary needs and dishes" category, as it represents combined recipes
captures the interplay between medical conditions closest to real-life diets. Additionally, we include
and nutritional requirements. The detailed stan- other relevant categories, such as bakery products
dards and additional tags for other nutrients and and desserts (definitions of FNDDS categories are
health conditions are described in Appendix-B. By available in the Appendix-I). Finally, we apply a
integrating this methodology into our graph-based keyword-based deduplication method to remove
benchmark, we provide a framework for advanc- highly similar entries.
ing personalized dietary reasoning and evaluating Multi-step Annotation. Using the previously
models in this domain. defined standards and tagging schemes, our anno-
tation process systematically establishes "match"
3.2 Data Annotation or "contradict" relationships between user health
Real-world data is inherently messy and incom- conditions and food nutritional profiles. For exam-
plete, and the datasets we use are no exception. ple, the tag "high_calorie" contradicts the condition
Spanning from 2003 to 2020, NHANES provides "obesity", while "low_sodium" matches with "hy-
data for approximately 100,000 users and over 2 pertension". To ensure accuracy and reliability, we

4
Question Level: Sparse Question Level: Standard Question Level: Complex

Ingredient A Ingredient B Ingredient A High Calorie Ingredient A Low Calorie


contradict match
Diet has has
has contains contains
has Habit B has Obesity has Obesity
has
Classic mixed Danish pastry Potato salad
vegetables User with fruit User has with egg User has
has
contains contains contains
has Diet has Diet has Diet
Low match Habit A High contradict Habit A High contradict Habit A
Sodium Hypertension Sodium Hypertension Sodium Hypertension

Single link: one condition one tag Multiple links: all match or contradict Mixed links: match and contradict
a) Overview of the Different Question Levels in NGQA Benchmark

Standard Question: Based on the information provided, please judge if food Danish pastry with fruit
is healthy or unhealthy to the user and why? <Task Specific Prompts>

-B Binary Classification - ML Multi-label Classification - TG Text Generation

Answer: No Answer: high calorie, Answer: No, because the


high sodium food is high in calorie …

b) Overview of the Different Task Levels in NGQA Benchmark

Figure 3: The illustration of different question levels and task levels.


adopt a multi-step annotation process. After ini- unique link between the user and the food signifi-
tial filtering and tagging, large language models cantly increases the difficulty of subgraph retrieval,
(LLMs) perform an initial sanity check to iden- making models vulnerable to interference from ir-
tify inconsistencies or anomalies in the annotations. relevant nodes.
Subsequently, three human annotators with domain Standard questions represent the balanced and
expertise review and cross-validate the results to idealized scenarios in our benchmark. In this
eliminate remaining inaccuracies. By combining category, foods are linked to multiple nutrition
automated checks with human validation, our rigor- tags, which either match or contradict several user
ous annotation strategy captures the real-life com- health conditions. This configuration reflects con-
plexities of personalized nutrition while maintain- trolled cases where the relationship between dietary
ing high standards of quality and reliability. choices and health outcomes is clear-cut, enabling
a focused evaluation of model performance. Stan-
4 Task Definition and Evaluation
dard questions serve as a foundation for benchmark-
4.1 Question Setting ing in structured and well-defined environments.
With the annotated data in place, we designed three Complex questions are designed to replicate the
distinct types of questions, i.e., sparse, standard, intricacies of real-life nutritional decision-making.
and complex, to capture varying levels of difficulty Foods in this category may simultaneously have
and emulate real-world scenarios in personalized tags that both match with and contradict a user’s
nutrition reasoning. This stratification ensures that health conditions. For instance, a food may be low
our benchmark accommodates a wide range of re- in sodium (beneficial for hypertension) but also
search and application needs, spanning from con- high in sugar (problematic for diabetes). These
trolled, idealized setups to challenging, real-life scenarios require models to navigate conflicting in-
cases, as illustrated in Figure-3 (a). formation, prioritize user health needs, and perform
Sparse questions address scenarios with min- nuanced trade-off reasoning. This category closely
imal available information. In this setting, each mirrors the ambiguous and multifaceted challenges
food has only one nutrition tag linked to a sin- of real-world dietary decisions.
gle user health condition. This setup reflects real- The benchmark’s statistical breakdown is pre-
world cases where labels are scarce or data is in- sented in Table-1. To further evaluate the com-
complete, challenging models to reason effectively plexity and informativeness of the questions, we
with limited information. Although sparse ques- introduce the Signal-to-Noise Ratio (SNR). SNR
tions may appear simple to human observers, the measures the ratio of nodes or tags relevant to the

5
Question Level # Records Avg. # Nodes Avg. # Edges 4.3 Evaluation Metrics
Sparse 8,490 25.84 24.86 To evaluate model performance, we adopt task-
Standard 3,622 28.16 28.98
Complex 1,690 30.94 34.04
specific metrics tailored to each type. For classifi-
cation tasks, we use standard metrics like accuracy,
Table 1: Statistics of the Benchmark by Question Level. recall, precision, and F1 score for comprehensive
Question Level Avg. Node SNR Avg. Tag SNR performance assessment. Multi-label classification
tasks extend these metrics to their weighted ver-
Sparse 16.37 19.30
sions, accounting for the distribution of multiple la-
Standard 24.68 49.39
Complex 31.57 76.32 bels across samples. Text generation tasks are eval-
uated with widely used metrics such as ROUGE,
Table 2: Signal-to-Noise Ratio (SNR) by Question BLEU, and BERT scores, which collectively as-
Level. sess relevance and semantic similarity to reference
texts. The definition of ground truths is available in
answer (signal) against the total nodes or tags in the
Appendix-B. This multifaceted design supports di-
graph (noise). As shown in Table-2, sparse ques-
verse model architectures and evaluation strategies,
tions exhibit the lowest SNR, reflecting the limited
providing a robust foundation for advancing per-
resources available for these tasks. Conversely,
sonalized nutrition research. By bridging the gap
complex questions, despite containing conflicting
between controlled research environments and the
information, achieve the highest SNR, underscor-
complexities of real-world applications, our bench-
ing the rich contextual information necessary for
mark fosters innovation and opens new avenues for
accurate reasoning. More statistics of the bench-
addressing healthy dietary reasoning.
mark are available in Appendix-E.
5 Experiments
4.2 Task Setting
To enhance the generality and versatility of our 5.1 Experiment Settings
benchmark, we design three distinct downstream In this section, we conduct extensive experiments
task types, each centered on the same domain ques- to evaluate existing Graph-RAG models’ reasoning
tion but requiring different forms of output, as il- capability on the proposed benchmark. For base-
lustrated in Figure-3 (b). This diversity ensures the line models, we select the five most classical base-
benchmark accommodates a wide range of method- lines: KAPING (Baek et al., 2023), CoT-Zero (Ko-
ologies and research focuses while fostering inno- jima et al., 2022), CoT-BAG (Wang et al., 2024a),
vation in addressing personalized nutrition chal- ToG (Sun et al., 2024), and a naive plain Graph-
lenges. The tasks are defined as follows: RAG pipeline (implementation details in Appendix-
Binary Classification (-B): This task requires a C). For the main experiments, we choose GPT-
simple "yes" or "no" response, indicating whether 4o-mini as the LLM backbone, we also conduct
a specific food is suitable for a user based on additional experiments on a series of other clas-
their health profile. It emphasizes straightforward sic LLM backbones in Appendix-D. Note that we
decision-making, reflecting applications like auto- didn’t select the most advanced LLM backbones or
mated diet advisories or recommendation systems. the most sophisticated fine-tuned baselines because
Multi-Label Classification (-ML): In this task, we argue our contributions focus primarily on the
models must retrieve the nutritional tags associated proposed benchmark with the novel tasks for this
with a food and determine which match with or specific domain, and the experiment results along
contradict the user’s health conditions. By demand- with the hallucination analyses have demonstrated
ing richer output, this task evaluates the model’s our tasks are properly designed where the classic
ability to leverage graph information and identify baselines can be adequately challenged while main-
nuanced relationships. taining efficiency. In the following sections, we go
Text Generation (-TG): The output is a natural through the experiment results for each task.
language explanation of why a food is healthy or
unhealthy for a user. This task assesses a model’s 5.2 Binary Classification Task
capability for interpretable and user-friendly rea- Table-3 (a) presents the performance of baseline
soning, which is crucial for real-world applications models on the binary classification task, which eval-
such as personalized dietary assistant chatbots. uates the models’ ability to provide a decisive "yes"

6
a) Binary Classification (-B) b) Multi-label Classification (-ML) c) Text Generation (-TG)
Question Level Method
Accuracy Recall Precision F1 Accuracy Recall Precision F1 ROUGE-1 ROUGE-2 ROUGE-L BLEU BERT
Plain 0.5973 0.1634 1.0000 0.2810 0.1798 0.9943 0.2109 0.3442 0.5385 0.4775 0.5385 0.2838 0.9370
KAPING 0.5347 0.0541 0.7246 0.1006 0.1753 0.9915 0.2075 0.3394 0.5234 0.4600 0.5234 0.2674 0.9353
Sparse CoT-Zero 0.6604 0.2951 0.9983 0.4555 0.2032 0.9958 0.2435 0.3842 0.5463 0.4842 0.5462 0.2889 0.9388
CoT-BAG 0.6038 0.1769 1.0000 0.3006 0.2134 0.9966 0.2520 0.3945 0.5481 0.4886 0.5480 0.2930 0.9385
ToG 0.7729 0.5383 0.9817 0.6953 0.2439 0.9128 0.2986 0.4333 0.6254 0.5710 0.6251 0.3612 0.9465
Plain 0.5762 0.1989 1.0000 0.3317 0.4909 0.9980 0.4901 0.6528 0.7219 0.6321 0.6941 0.4840 0.9618
KAPING 0.5022 0.0637 0.9313 0.1192 0.4593 0.9956 0.4624 0.6272 0.7087 0.6237 0.6764 0.4617 0.9599
Standard CoT-Zero 0.6565 0.3507 1.0000 0.5193 0.5390 0.9967 0.5447 0.6963 0.7329 0.6443 0.7049 0.4939 0.9630
CoT-BAG 0.5900 0.2249 1.0000 0.3673 0.5599 0.9982 0.5611 0.7091 0.7333 0.6456 0.7032 0.4951 0.9630
ToG 0.8628 0.7411 0.9993 0.8511 0.6189 0.8843 0.6793 0.7464 0.8182 0.7632 0.7817 0.6112 0.9716
Plain 0.6598 0.0636 0.9750 0.1194 0.7185 0.9721 0.7374 0.8358 0.7356 0.6510 0.7001 0.4949 0.9599
KAPING 0.6574 0.0571 0.9722 0.1079 0.6883 0.9758 0.7129 0.8093 0.7394 0.6634 0.7016 0.4839 0.9602
Complex CoT-Zero 0.6627 0.0718 0.9778 0.1337 0.7453 0.9735 0.7679 0.8557 0.7478 0.6599 0.7103 0.5048 0.9615
CoT-BAG 0.6627 0.0701 1.0000 0.1311 0.7546 0.9631 0.7801 0.8587 0.7467 0.6622 0.7080 0.5049 0.9611
ToG 0.7473 0.3964 0.8100 0.5323 0.6153 0.6989 0.8119 0.7303 0.7729 0.6915 0.7366 0.5313 0.9639

Table 3: Experimental results based on five baseline methods on the three tasks with the three question levels using
the GPT-4o-mini. The best performance of each group is bolded.

Figure 4: Efficiency analysis of the five baseline meth- Figure 5: Retrieval quality of ToG vs. Plain across three
ods across three tasks. types of questions on recall, precision and F1.

or "no" response based on summarized reasoning. and text generation (TG) tasks. The ML task evalu-
The results reveal a notable conservatism in model ates models’ ability to retrieve nutrition tags asso-
behavior, as evidenced by the low recall scores. ciated with foods and user health conditions, while
This likely stems from the sensitive nature of med- the TG task tests their capacity to generate natural
ical questions, where LLMs try to avoid offering language explanations, offering a more compre-
simple "yes" answers without explanations unless hensive and realistic evaluation. The results reveal
their confidence is exceptionally high. Despite this similar patterns across tasks: while baselines are
challenge, the experiments yield two important in- competent at identifying nutrition tags from the
sights into how external domain knowledge can graph, the primary challenge lies in correctly iden-
support LLMs in this scenario. First, increasing tifying the relevant tags based on user health condi-
the number of links in the graph (e.g., from Sparse tions, as indicated by the overall high recall scores
to Standard questions) consistently improves re- in the ML task.
call across all baselines. This indicates that richer Both tasks are most challenging on sparse ques-
external knowledge provides LLMs with greater tion sets due to their low-resource nature. Con-
context and reassurance, enabling them to produce versely, models achieve the best performance on
more confident positive answers. Second, ToG complex question sets, which may appear coun-
significantly outperforms other baselines, show- terintuitive. However, as shown in Table-2, com-
ing performance gains unique to this task. We at- plex questions have a higher Signal-to-Noise Ratio
tribute this improvement to ToG’s effective pruning (SNR), providing models with a clearer signal that
mechanism, which removes irrelevant nodes and offsets their logical complexity. Additionally, the
increases the SNR. By reducing noise and focus- ToG model performs similarly on the standard and
ing on relevant information, ToG enhances LLMs’ complex question sets due to its pruning process,
ability to make confident and accurate decisions. which increases SNR by removing irrelevant nodes.
While effective, this process can also discard valu-
5.3 Multi-label and Text Generation Task able information, leading to lower performance on
Table-3 (b) and (c) present the performance of base- complex questions. This trade-off contrasts with
line models on the multi-label classification (ML) ToG’s success in binary classification task and high-

7
tritional health (Mialon et al., 2023). Figure-6 illus-
Based on the nutrients the food provides and the
user with obesity and opioid misuse, please answer trates an example where we evaluate whether the
whether the food "Taco, corn tortilla, beef, food "Taco, corn tortilla, beef, cheese" is a healthy
Question cheese" is healthy for the user and why?
option for a user who is obese and recovering from
Factual Hallucination – Lack of Domain Knowledge opioid misuse. Our analysis identifies two main
It depends on the user’s dietary needs; it may be types of hallucinations. The first is Factual Hal-
unhealthy because the food is high in carbohydrate. lucination, where the model produces incorrect or
Direct
Contextual Hallucination – Missing User’s Needs irrelevant information, often due to reliance on gen-
eral knowledge not explicitly included in the graph.
No, because the food is high in sodium and high in
cholesterol. It’s not good for the health. These errors are common when LLMs perform
KAPING
direct inference without external knowledge and
Correct – Focus on What User’s Conditions Require
occasionally occur when retrieved graphs contain
Yes, because the food is high in protein and low in noise. For example, the model incorrectly deemed
carbohydrates, an appropriate food in the context.
ToG the taco unsuitable, overlooking the fact that corn
tortillas are relatively low in carbohydrates.
Figure 6: A case study of error analysis. The second type is Contextual Hallucination,
lights the comprehensiveness of our benchmark, where the model fails to prioritize tags that directly
which challenges models across diverse scenarios relate to the user’s health profile, focusing instead
to uncover their strengths and weaknesses. on less relevant attributes. This issue is less pro-
nounced in ToG due to its ability to retrieve com-
5.4 Efficiency and Retrieval Quality
pact, focused subgraphs, unlike simpler methods
Beyond model performance, efficiency is a critical like KAPING and CoT-Zero, which lack effective
consideration in Graph-RAG systems. To evaluate pruning. In this case, the taco’s high sodium and
this, we conduct an efficiency analysis of baseline cholesterol overshadowed its alignment with the
models on our benchmark, as shown in Figure-4. user’s specific health needs for a low-carb, high-
As can be seen, the binary classification task ex- protein diet, leading to a less optimal assessment.
hibits the fastest runtime, as it requires the shortest In summary, these hallucinations highlight the
output. In contrast, the multi-label classification importance of our domain-specific benchmark in
and text generation tasks involve longer outputs, establishing a rigorous framework to evaluate and
leading to slower performance. Due to ToG’s re- improve LLMs, advancing both the nutritional
liance on multiple LLM calls during the retrieval health domain and Graph-RAG research while fos-
process, its runtime is significantly slower com- tering the development of more robust and general-
pared to other methods. Additionally, the quality izable models (More examples in Appendix-H).
of subgraph retrieval plays a crucial role in down-
stream reasoning. To assess this, we perform a 6 Conclusion
retrieval quality analysis using ToG as a case study,
In this work, we introduce the Nutritional Graph
comparing it against a plain Graph-RAG pipeline,
Question Answering (NGQA) benchmark, the first
as illustrated in Figure-5. As shown, the retrieval
dataset designed to address the critical challenges
scores of ToG align with its performance in the
of personalized nutritional health reasoning. By
main experiments, confirming our assumption that
leveraging user-specific medical data and framing
fluctuations in ToG’s performance are rooted in
the problem as a knowledge graph question answer-
its pruning process during the subgraph retrieval
ing task, NGQA bridges the gap between general-
phase.
purpose benchmarks and domain-specific applica-
5.5 Error Analysis tions. Our benchmark not only advances the scope
of GraphQA research by incorporating complex,
In this section, we analyze the types of hallucina- real-world nutritional scenarios but also provides
tions observed in our experiments using a specific a comprehensive resource for evaluating and im-
example and demonstrate the importance of exter- proving models in this domain. We believe NGQA
nal domain knowledge in mitigating these errors. lays the foundation for future research in person-
Traditional LLM-enhanced methods are well- alized diet and health-aware reasoning, fostering
known for their susceptibility to hallucination er- innovation in both nutritional health and GraphQA.
rors, particularly in domain-specific tasks like nu-

8
Limitation physical addresses—removed. Despite the absence
of PII, the dataset retains its utility for detailed
In this section, we discuss the limitations of this
analyses, allowing us to investigate the relationship
work and outline directions for future research.
between users’ medical data and health-aware food
First, the benchmark includes a limited number
recommendations as presented in this study. Ad-
of health conditions, though more are available.
ditionally, in practical applications, the generated
For example, osteoporosis suggests a high-calcium
recommendations and interpretations are treated as
diet, a renal diet indicates low protein intake, and
personal medical records, ensuring sustained pri-
high low-density lipoprotein (LDL) levels may call
vacy protection. By adhering to these principles,
for a low-cholesterol diet. As noted in the pa-
our research maintains the highest levels of ethical
per, we prioritized conditions most prevalent in
responsibility and data privacy.
the United States and most relevant to dietary inter-
ventions, but expanding to include additional con-
ditions could enhance coverage and utility. Second, References
while we focus on the interplay between dietary be- Ashkan Afshin, Patrick J Sur, Kairsten A Fay, Leslie
haviors and medical conditions, other factors, such Cornaby, Giannina Ferrara, Jason S Salama, and
as food insecurity, remain unexplored. NHANES Christopher J L Murray. 2019. Health effects of
offers extensive socioeconomic data, presenting op- dietary risks in 195 countries, 1990–2017: a system-
atic analysis for the global burden of disease study
portunities to extend the benchmark to account for
2017. The Lancet.
broader determinants of dietary decision-making.
Third, for simplicity, complex questions are re- FAO/WHO Codex Alimentarius. 1985. Guidelines on
duced to binary classification by counting "match" nutrition labelling. Accessed: 2024-07-12.
and "contradict" tags. However, real-life dietary FAO/WHO Codex Alimentarius. 1997. Guidelines for
decisions require nuanced trade-offs and reasoning use of nutrition and health claims. Accessed: 2024-
that go beyond this approach. More sophisticated 07-12.
evaluation methods could better reflect practical Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023.
scenarios. Lastly, the benchmark could benefit Knowledge-augmented language model prompting
from additional tasks. For example, the existing for zero-shot knowledge graph question answering.
In ACL.
graphs support questions like, "What alternative
foods could meet a user’s dietary preferences and Felix Bölz, Diana Nurbakova, Sylvie Calabretto, Armin
medical needs?" Incorporating such tasks would Gerl, Lionel Brunie, and Harald Kosch. 2023. Hum-
broaden the benchmark’s scope and encourage fur- mus: A linked, healthiness-aware, user-centered and
argument-enabling recipe data set for recommenda-
ther innovation. Despite these limitations, this tion. In RecSys.
work establishes a robust baseline as a pioneering
effort in personalized nutrition reasoning. We defer Jon Nicolas Bondevik, Kwabena Ebo Bennin, Önder
Babur, and Carsten Ersch. 2024. A systematic review
these challenges to future work, envisioning the on food recommender systems. Expert Systems with
benchmark as a foundation for ongoing advance- Applications.
ments in this critical domain.
CDC. 2020a. Adult obesity facts.
Ethics and Privacy Statement CDC. 2020b. Americans Share Hopeful Sto-
ries of Recovery From Opioid Use Disorder.
Safeguarding privacy and adhering to ethical prin- https://www.cdc.gov/rxawareness/pdf/
ciples are paramount when working with sensi- articles/TA-T3D2-English_MatteArticle_
tive health-related data. The National Health and Release_508.pdf.
Nutrition Examination Survey (NHANES) serves Yu Chen, Ananya Subburathinam, Ching-Hua Chen,
as a benchmark in this regard, strictly complying and Mohammed J Zaki. 2021. Personalized food
with confidentiality protocols mandated by pub- recommendation as constrained question answering
lic legislation. These robust privacy measures over a large-scale food knowledge graph. In WSDM.
enable us to achieve our research goals while European Commission. 2006. Eu nutrition & health
remaining fully aligned with the survey’s estab- claims regulation legislation (ec) 1924/2006. Ac-
lished guidelines. Notably, the NHANES dataset cessed: 2024-07-12.
is anonymized, with personally identifiable infor- Carrie Dennett. 2021. Diet’s role in opioid recovery.
mation (PII)—such as social security numbers and Today’s Dietitian.

9
Bahare Fatemi, Quentin Duval, Rohit Girdhar, Michal Angeliki Lazaridou, Elena Gribovskaya, Wojciech
Drozdzal, and Adriana Romero-Soriano. 2023a. Stokowiec, and Nikolai Grigorev. 2022. Internet-
Learning to substitute ingredients in recipes. arXiv. augmented language models through few-shot
prompting for open-domain question answering.
Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi. arXiv.
2023b. Talk like a graph: Encoding graphs for large
language models. arXiv. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
Yifu Gao, Linbo Qiao, Zhigang Kan, Zhihua Wen, rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
Yongquan He, and Dongsheng Li. 2024. Two-stage täschel, et al. 2020. Retrieval-augmented generation
generative question answering on temporal knowl- for knowledge-intensive nlp tasks. NeuralIPS.
edge graph using large language models. arXiv.
Diya Li, Mohammed J Zaki, and Ching-hua Chen. 2023.
Mouzhi Ge, Francesco Ricci, and David Massimo. 2015. Health-guided recipe recommendation over knowl-
Health-aware food recommender system. In RecSys. edge graphs. Journal of Web Semantics.
Andrea Grillo, Lucia Salvi, Paolo Coruzzi, Paolo Salvi, Peiyu Li, Xiaobao Huang, Yijun Tian, and Nitesh V
and Gianfranco Parati. 2019. Sodium intake and Chawla. 2024. Cheffusion: Multimodal foundation
hypertension. Nutrients. model integrating recipe and food image generation.
Ja K Gu, Penelope Allison, Alexis Grimes Trotter, Lu- In CIKM.
enda E Charles, Claudia C Ma, Matthew Groenewold, Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun
Michael E Andrew, and Sara E Luckhaupt. 2022. Tian, and Meng Jiang. 2024a. Towards safer large
Prevalence of self-reported prescription opioid use language models through machine unlearning. arXiv.
and illicit drug use among us adults: Nhanes 2005–
2016. Journal of occupational and environmental Zheyuan Liu, Xiaoxin He, Yijun Tian, and Nitesh V
medicine. Chawla. 2024b. Can we soft prompt llms for graph
learning tasks? In WWW.
Tiezheng Guo, Qingwen Yang, Chen Wang, Yanyi Liu,
Pan Li, Jiawei Tang, Dapeng Li, and Yingyou Wen. Zheyuan Liu, Chunhui Zhang, Yijun Tian, Erchi
2024. Knowledgenavigator: Leveraging large lan- Zhang, Chao Huang, Yanfang Ye, and Chuxu Zhang.
guage models for enhanced reasoning over knowl- 2023. Fair graph representation learning via diverse
edge graph. Complex & Intelligent Systems. mixture-of-experts. In WWW.
Steven Haussmann, Oshani Seneviratne, Yu Chen, Nadine Mahboub, Rana Rizk, Mirey Karavetian, and
Yarden Ne’eman, James Codella, Ching-Hua Chen, Nanne de Vries. 2021. Nutritional status and eating
Deborah L McGuinness, and Mohammed J Zaki. habits of people who use drugs and/or are undergoing
2019. Foodkg: a semantics-driven knowledge graph treatment for recovery: a narrative review. Nutrition
for food recommendation. In The Semantic Web– reviews.
ISWC 2019: 18th International Semantic Web Confer-
ence, Auckland, New Zealand, October 26–30, 2019, Costas Mavromatis and George Karypis. 2024. Gnn-
Proceedings, Part II 18. rag: Graph neural retrieval for large language model
reasoning. arXiv.
Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla,
Thomas Laurent, Yann LeCun, Xavier Bresson, and Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo-
Bryan Hooi. 2024. G-retriever: Retrieval-augmented foros Nalmpantis, Ram Pasunuru, Roberta Raileanu,
generation for textual graph understanding and ques- Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu,
tion answering. arXiv. Asli Celikyilmaz, et al. 2023. Augmented language
models: a survey. arXiv.
Xiaobao Huang, Mihir Surve, Yuhan Liu, Tengfei Luo,
Olaf Wiest, Xiangliang Zhang, and Nitesh V Chawla. Weiqing Min, Chunlin Liu, Leyi Xu, and Shuqiang
2024. Application of large language models in chem- Jiang. 2022. Applications of knowledge graphs for
istry reaction data extraction and cleaning. In CIKM. food science and industry. Patterns.
Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, NIDA. 2024. Opioids. https://www.drugabuse.
Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: gov/drug-topics/opioids.
A general framework for large language model to
reason over structured data. In EMNLP. Khary K Rigg and Gladys E Ibañez. 2010. Motiva-
tions for non-medical prescription drug use: A mixed
Jiho Kim, Yeonsu Kwon, Yohan Jo, and Edward Choi. methods analysis. Journal of Substance Abuse Treat-
2023. Kg-gpt: A general framework for reasoning ment.
on knowledge graphs using large language models.
In EMNLP. Andrew Rosenblum, Lisa A Marsch, Herman Joseph,
and Russell K Portenoy. 2008. Opioids and the treat-
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- ment of chronic pain: controversies, current status,
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- and future directions. Experimental and Clinical
guage models are zero-shot reasoners. NeuralIPS. Psychopharmacology.

10
Chris Sanchez and Zheyuan Zhang. 2022. The effects Zehong Wang, Zheyuan Zhang, Nitesh V Chawla,
of in-domain corpus size on pre-training bert. arXiv. Chuxu Zhang, and Yanfang Ye. 2024c. Gft: Graph
foundation model with transferable tree vocabulary.
Oshani Seneviratne, Jonathan Harris, Ching-Hua Chen, arXiv preprint arXiv:2411.06070.
and Deborah L McGuinness. 2021. Personal health
knowledge graph for clinically relevant diet recom- Zehong Wang, Zheyuan Zhang, Chuxu Zhang, and Yan-
mendations. arXiv. fang Ye. 2024d. Subgraph pooling: Tackling nega-
tive transfer on graphs. In IJCAI.
Sola S Shirai, Oshani Seneviratne, Minor E Gordon,
Ching-Hua Chen, and Deborah L McGuinness. 2021. Yilin Wen, Zifeng Wang, and Jimeng Sun. 2023.
Identifying ingredient substitutions using a knowl- Mindmap: Knowledge graph prompting sparks graph
edge graph of food. Frontiers in Artificial Intelli- of thoughts in large language models. arXiv.
gence.
WHO. 2021. Healthy diet.
Andrew Smyth, Martin J O’Donnell, Salim Yusuf,
Catherine M Clase, Koon K Teo, Michelle Cana- WHO. 2023. Obesity info page of the world health
van, Donal N Reddan, and Johannes FE Mann. 2014. organization.
Sodium intake and renal outcomes: a systematic re-
view. American journal of hypertension. Yuanbo Xu, Tian Li, Yongjian Yang, Weitong Chen, and
Lin Yue. 2024. An adaptive category-aware recom-
Haitian Sun, Tania Bedrax-Weiss, and William Cohen. mender based on dual knowledge graphs. Informa-
2019. Pullnet: Open domain question answering tion Processing & Management.
with iterative retrieval on knowledge bases and text.
In EMNLP-IJCNLP. Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut,
Percy Liang, and Jure Leskovec. 2021. Qa-gnn: Rea-
Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo soning with language models and knowledge graphs
Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung- for question answering. In NAACL.
Yeung Shum, and Jian Guo. 2024. Think-on-graph:
Wenbin Yue, Zidong Wang, Jieyu Zhang, and Xi-
Deep and responsible reasoning of large language
aohui Liu. 2021. An overview of recommenda-
model on knowledge graph. In ICLR.
tion techniques and their applications in healthcare.
Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, IEEE/CAA Journal of Automatica Sinica.
Bing Yin, and Meng Jiang. 2024. Democratizing
Jing Zhang, Xiaokang Zhang, Jifan Yu, Jian Tang, Jie
large language models via personalized parameter-
Tang, Cuiping Li, and Hong Chen. 2022. Subgraph
efficient fine-tuning. arXiv.
retrieval enhanced model for multi-hop knowledge
Lauren J. Tanz, Amanda T. Dinwiddie, Christine L. base question answering. In ACL.
Mattson, Julie O’Donnell, and Nicole L. Davis. 2022.
Lingzi Zhang, Yinan Zhang, Xin Zhou, and Zhiqi Shen.
Drug overdose deaths among persons aged 10–19
2024a. Greenrec: A large-scale dataset for green
years - united states, july 2019-december 2021. Mor-
food recommendation. In WWW.
bidity and Mortality Weekly Report.
Zheyuan Zhang, Zehong Wang, Shifu Hou, Evan Hall,
Dhaval Taunk, Lakshya Khanna, Siri Venkata Pavan Ku-
Landon Bachman, Jasmine White, Vincent Galassi,
mar Kandru, Vasudeva Varma, Charu Sharma, and
Nitesh V Chawla, Chuxu Zhang, and Yanfang Ye.
Makarand Tapaswi. 2023. Grapeqa: Graph augmen-
2024b. Diet-odin: A novel framework for opioid
tation and pruning to enhance question-answering.
misuse detection with interpretable dietary patterns.
In WWW.
In KDD.
Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan
Zheyuan Zhang, Zehong Wang, Tianyi Ma,
Tan, Xiaochuang Han, and Yulia Tsvetkov. 2024a.
Varun Sameer Taneja, Sofia Nelson, Nhi Ha Lan
Can language models solve graph problems in natural
Le, Keerthiram Murugesan, Mingxuan Ju, Nitesh V
language? NeuralIPS.
Chawla, Chuxu Zhang, et al. 2024c. Mopi-hfrs:
Wenjie Wang, Ling-Yu Duan, Hao Jiang, Peiguang Jing, A multi-objective personalized health-aware
Xuemeng Song, and Liqiang Nie. 2021. Market2dish: food recommendation system with llm-enhanced
health-aware food recommendation. TOMM. interpretation. arXiv.

Xintao Wang, Qianwen Yang, Yongting Qiu, Jiaqing


Liang, Qianyu He, Zhouhong Gu, Yanghua Xiao,
and Wei Wang. 2023. Knowledgpt: Enhancing large
language models with retrieval and storage access on
knowledge bases. arXiv.

Zehong Wang, Sidney Liu, Zheyuan Zhang, Tianyi Ma,


Chuxu Zhang, and Yanfang Ye. 2024b. Can llms
convert graphs to text-attributed graphs? arXiv.

11
A Additional Related Work KGQA. However, most benchmarks in this field
are designed for general-purpose datasets and fail
A.1 Prior Works in Nutrition Personalization
to address domain-specific complexities, such as
With growing awareness of the importance of di- the challenges unique to nutritional health reason-
etary health, various studies have sought to incor- ing.
porate health metrics into applications such as food
recommendation systems. These approaches can A.3 Graph-Retrieval Augmented Generation
be grouped into three primary categories. First, Graph neural networks exhibit powerful potentials
some research emphasizes single indicators like in dealing with complicated structural data (Wang
calorie or fat content, as highlighted in works by et al., 2024d; Liu et al., 2023; Wang et al., 2024c)
Ge et al. (Ge et al., 2015) and Shirai et al. (Shirai and it can facilitate LLM to better understand real
et al., 2021; Li et al., 2024), though such metrics world tasks (Wang et al., 2024b; Huang et al., 2024;
often fail to represent the multifaceted nature of Liu et al., 2024b). Graph-Retrieval Augmented
a balanced diet. Second, simulated health data Generation (Graph-RAG) extends the Retrieval-
has been utilized, as demonstrated by Wang et al. Augmented Generation (RAG) framework (Lewis
(Wang et al., 2021), but these methods often di- et al., 2020) by enriching large language models
verge from real-world data distributions. Finally, with structured knowledge retrieval. While tradi-
recent studies have applied global health guidelines tional RAG retrieves unstructured text, Graph-RAG
to develop composite health scores, such as those leverages GNNs to retrieve structured subgraphs
by Bolz et al. (Bölz et al., 2023) and Zhang et encoded as triples, improving reasoning precision
al. (Zhang et al., 2024a). However, foods deemed and minimizing redundancy (Guo et al., 2024; Wen
healthy by general standards can still negatively et al., 2023; Lazaridou et al., 2022).
affect certain individuals (Yue et al., 2021), high- Existing Graph-RAG benchmarks primarily eval-
lighting the absence of a universal solution. The uate basic graph reasoning tasks, such as shortest
primary challenge remains the scarcity of accurate paths, node degree, and edge existence (Fatemi
user health data, a gap our benchmark uniquely et al., 2023b; Wang et al., 2024a). Although these
addresses. benchmarks provide insights into foundational rea-
soning, they lack domain specificity. Recent work
A.2 Knowledge Graph Question Answering
by He et al. (He et al., 2024) introduced bench-
Knowledge Graph Question Answering (KGQA) marks targeting advanced reasoning in general
has undergone significant advancements, evolving graph contexts, but domain-specific benchmarks
from early approaches such as semantic parsing and for applications such as nutrition remain underde-
retrieval-based methods. Initial models translated veloped. By adapting the principles of Graph-RAG,
natural language queries into structured formats our work introduces the first benchmark designed
like SPARQL for execution on knowledge graphs to tackle personalized health-aware reasoning, ad-
(Sun et al., 2019; Zhang et al., 2022). Many of dressing this critical gap in the literature.
these methods employed pre-trained models like
BERT for query encoding and used frameworks B Benchmark Details
such as GNNs or LSTMs for retrieving entities
and subgraphs (Yasunaga et al., 2021; Taunk et al., B.1 Data Source Description
2023). NHANES. National Health and Nutrition Exam-
More recent progress integrates large language ination Survey (NHANES) is a publicly avail-
models (LLMs) to improve both retrieval efficiency able dataset collected by the U.S. Centers for Dis-
and reasoning ability (Sanchez and Zhang, 2022; ease Control and Prevention (CDC) to assess the
Liu et al., 2024a; Tan et al., 2024). Approaches health and nutritional status of the U.S. population
like Jiang et al. (Jiang et al., 2023) and Wang et through interviews, physical examinations, and lab-
al. (Wang et al., 2023) utilize LLMs to transform oratory tests. Data is released every two years and
queries into formats such as SQL or SPARQL, en- encompasses five main categories: Demograph-
hancing retrieval accuracy. Others, such as Kim et ics, Dietary Data, Examination Data, Laboratory
al. (Kim et al., 2023) and Gao et al. (Gao et al., Data, and Questionnaire Data. These comprehen-
2024), focus on reasoning over retrieved subgraphs sive datasets provide a wealth of information on
or triples, tackling multi-hop reasoning tasks in health indicators, dietary behaviors, and medical

12
conditions. Nutrients Low Threshold High Threshold NRV

FNDDS and WWEIA. The Food and Nutrient Calories (kcal) 40 225 2000
Carbohydrates (g) 55 75 -
Database for Dietary Studies (FNDDS) is a com- Protein (g) 10 15 50
prehensive resource developed by the U.S. Depart- Saturated Fat (g) 1.5 5 20
Cholesterol (mg) 20 40 300
ment of Agriculture (USDA) to facilitate dietary Sugar (g) 5 22.5 -
intake analysis by providing detailed nutritional in- Dietary Fiber (g) 3 6 -

formation for foods and beverages consumed in the Sodium (mg) 120 200 2000
Potassium (mg) 0 525 3500
United States. It serves as the backbone for analyz- Phosphorus (mg) 0 105 700
ing dietary recall data collected through the What Iron (mg) 0 3.3 22
Calcium (mg) 0 150 1000
We Eat in America (WWEIA) program, which is a Folic Acid (µg) 0 60 400
component of NHANES. WWEIA captures dietary Vitamin C (mg) 0 15 100
Vitamin D (µg) 0 2.25 15
intake data through 24-hour dietary recall inter- Vitamin B12 (µg) 0 0.36 2.4
views, linking reported food and beverage items to
their corresponding nutrient profiles in FNDDS. To- Table 4: Nutrient Reference Values (NRV) and thresh-
gether, FNDDS and WWEIA enable researchers to olds (per 100g of food) used based on the nutritional
standards.
study dietary patterns, nutrient intake, and their re-
lationship to health outcomes, making them critical Health Indicator High Threshold Low Threshold

tools for advancing nutrition research and public BMI 30 18.5


Waist Circumference (cm) 102 (88) -
health policy. Blood Pressure (mmHg) 140 90
Osteoporosis - -

B.2 Dietary Habit Processing Details Blood Urea Nitrogen (mmol/L) 7.1 -
Low-Density Lipoprotein (mmol/L) 3.3 -
Red Blood Cell (million cells/uL) - 4
Dietary habit data was sourced from various Glucose (mmol/L) 7 -
NHANES tables, including the Diet Behavior and Glycohemoglobin (%) 6.5 -
Hemoglobin (g/dL) - 13.2 (11.6)
Consumer Behavior datasets, which capture user-
reported behaviors and preferences related to food Table 5: Health Indicators with Corresponding High
choices, preparation methods, and consumption and Low Thresholds. Parentheses indicate sex-specific:
patterns. Traditional processing approaches proved male (female) thresholds where applicable.
insufficient for the complexity and diversity of
these features. To address this, a thorough man- pects, including 7 for macro-nutrients (calories,
ual review was conducted by a team of four re- carbohydrates, protein, saturated fat, cholesterol,
searchers. Key features indicative of dietary habits, sugar, and dietary fiber) and 9 for micro-nutrients
such as awareness of healthy eating practices or (sodium, potassium, phosphorus, iron, calcium,
frequency of consuming processed or frozen foods, folic acid, and vitamin C, D, and B12) following
were identified and categorized. Users were then the tagging scheme introduced in (Zhang et al.,
grouped into high and low habit categories based on 2024c). A detailed table of thresholds can be seen
their responses, with the top 10% and bottom 10% in Table-4. As discussed in the paper, these thresh-
assigned corresponding habit tags. For instance, olds are derived from existing standards and leg-
users reporting the highest milk consumption were islation, from World Health Organization (WHO),
tagged with "drink lots of milk," while those with Food Standards Agency (FSA)m EU Nutrition &
minimal consumption were labeled as "drink little Health Claims Regulation (Commission, 2006) and
or no milk." This process generated 54 distinct di- the Codex Alimentarius Commission (CAC) (Al-
etary habit tags, which were incorporated as nodes imentarius, 1985, 1997). An even more detailed
in the graph. These habit nodes provide critical standards are listed in Appendix-I. Following the
insights into user behaviors, enabling a nuanced similar practice, we also extract the thresholds for
understanding of the relationship between dietary health conditions, as shown in Table-5, Since we
patterns and health outcomes. have the thresholds for both nutrition and health,
we demonstrate the full mapping relationship can
B.3 Full Mappings of Nutrition Tags be seen in Table-6. Note that the special diet data
In this section, we discuss the overall mapping can be retrieved from NHANES data, which di-
relationship between health indicators and nutri- rectly indicates a user needs certain nutrients.
tion. In total, we involve nutrition tags for 16 However, as we emphasize in the paper, the in-
different nutrients focusing on various health as- teractions between nutrition and health are com-

13
plex and multi-facet. To maintain scientific rigor category system that assigns a therapeutic classifi-
and practical relevance, we focus on annotating cation to each drug and each ingredient of the drug.
four prevalent health statues, of which diet has Category codes used to identify prescription opioid
been proved to be beneficial for intervention. Their use were: Level 1: 57 = central nervous system
mapping to nutrition tags can be seen in Table-7. agents; Level 2: 58 = Analgesics; Level 3: 60 =
The definition of these major health statues are narcotic analgesics, or 191 = narcotic analgesics
discussed in the next section. combinations (Detail in Appendix-I).

B.4 The Definition of Health Conditions B.5 Definitions of Ground Truth


In the paper, we focus on annotating the four preva- In this section, we outline how ground truths are
lent health statuses—obesity, hypertension, opioid determined for each task. For the multi-label classi-
misuse, and diabetes—that are directly influenced fication task, the process is straightforward. As
by dietary interventions. Among them, WHO and discussed earlier, nutrition tags are created and
American Heart Association (AHA) provide clear linked to users’ health conditions based on pre-
and well-known definitions for obesity and hyper- defined standards. The ground truths for this task
tension. We mark a user obesity if the BMI is 30 are simply the lists of nutrition tags relevant to each
or greater, and we mark a user hypertension if the user’s health profile.
average of 4 test of systolic pressure is 140 mm For the binary classification task, we use the
Hg or higher or diastolic pressure is 90 mm Hg relationship between the user’s condition and the
or higher. This is classified as stage-2 hyperten- food’s nutrition tags. A "Yes" label is assigned
sion and require medical control. For Diabetes, if the relationship is a "match," and "No" is as-
NHANES provides specific questionnaire for di- signed if the relationship is a "contradict." In the
abetic users, and we also mark a user diabetic if case of complex question settings, where multiple
the user’s Glucose (mmol/L) level is over 7.0 AND "match" and "contradict" links exist, we calculate
Glycohemoglobin (%) is over 6.5. the count of each. A question is marked as "Yes" if
Opioid misuse, on the other hand, is a tricky the number of "match" links exceeds the number
health condition to be defined. However we argue of "contradict" links.
this health condition is of vital importance, as the For the text generation task, we generate refer-
opioid crisis has been one of the most critical so- ence texts using a combined approach. First, the
ciety concerns in the United States. Opioids are a overall healthiness of the food is determined us-
category of drugs that include the illegal substance ing the binary classification result ("Yes" or "No").
heroin, synthetic opioids such as fentanyl, and pre- This is followed by a natural language explanation
scription painkillers like oxycodone (NIDA, 2024). that lists the relevant nutrition tags. For example, a
While primarily used for pain management, opioids reference text might read: "Yes, because the food
can induce euphoria, making them prone to misuse is low in calories and high in protein." This method
(Dennett, 2021; Rigg and Ibañez, 2010; Rosenblum ensures that the reference text provides a clear and
et al., 2008). For instance, in 2019, 10.1 million natural explanation for the decision.
Americans reported opioid misuse, and in 2021,
there were an estimated 108,000 drug overdose C Implementation Details
deaths in the United States, 90% of which were
linked to opioids (CDC, 2020b; Tanz et al., 2022). In this section, we discuss the implementation de-
In this work, we follow prior work (Zhang et al., tails of the baseline models. Specially how we set
2024b) to define misuse by the following criteria: the hyper-parameters and how we make adaption
(1) records of illicit opioid drug use, like heroin, to our task. All codes all provided in the codebase
within a year, or (2) records of prescription opioid mentioned in the abstract.
medication use for over 90 days, which is a thresh- Plain refers to a naive GraphRAG pipeline. Un-
old commonly employed in the medical domain like approaches that directly input natural language
(Gu et al., 2022). text or tabular data, we transform the user and food
NHANES dataset provides illicit drug usage information from the knowledge graph structure
data, and we can track down the opioid prescrip- into multiple triples, each consisting of an entity, a
tion medicine usage data using the Multum Lexicon relationship, and another entity, then concatenate
Therapeutic Classification Scheme, a 3-level nested them before feeding into the LLMs.

14
Nutrient Category Tag Name Source Health Indicators
High Calories Low BMI; Low waist circumference; Weight gain/Muscle building diet
Low Calories High BMI; High waist circumference; Weight loss diet
Low Carb Low carbohydrate diet; High BMI; High waist circumference
High Protein Opioid misuse; Weight gain/Muscle building diet; High protein diet
Macro-nutrients Low Protein High blood urea nitrogen; Renal/Kidney diet
Low Saturated Fat High low-density lipoprotein; Low fat/Low cholesterol diet
Low Sugar Opioid misuse; Diabetic Diet; Low sugar Diet
Low Cholesterol High low-density lipoprotein; Low fat/Low cholesterol diet
High Fiber High low-density lipoprotein; Opioid misuse; Diabetic Diet
Low Sodium High blood pressure; Renal/Kidney diet; Low salt diet
High Potassium High blood pressure
Low Phosphorus Renal/Kidney diet
High Iron Low red blood cell/Low hemoglobin
Micro-nutrients High Calcium Osteoporosis/brittle bones
High Folic Acid Low red blood cell count
High Vitamin C Low red blood cell/Low hemoglobin; Osteoporosis/brittle bones
High Vitamin D Osteoporosis/brittle bones
High Vitamin B12 Low red blood cell count

Table 6: Nutrient Categories, Tag Names, and Associated Source Health Indicators. Nutrient categories are organized
to consolidate related tags and their respective health indicators for clarity.

Health Indicator Associated Tags the first stage, "Let’s think step by step" is ap-
Obesity Low Calorie
Opioid Misuse High Protein; Low Sugar; Low Sodium
pended after the question to guide the model to-
Hypertension Low Sodium wards producing a reasoning path. In the second
Diabetes Low Sugar; Low Carb
Weight Loss/Low Calorie Diet Low Calorie stage, the reasoning path is fed to the model to
Low Fat/Low Cholesterol Diet Low Cholesterol; Low Saturated Fat
Low Salt/Low Sodium Diet Low Sodium
extract the final answer. However, our initial ex-
Sugar-Free/Low Sugar Diet Low Sugar periments showed that we can combine these two
Diabetic Diet Low Sugar; Low Carb
Weight Gain/Muscle Building Diet High Calorie; High Protein steps, by having both "Let’s think step by step" and
Low Carbohydrate Diet Low Carb
High Protein Diet High Protein
final output requirements in one prompt, while still
Renal/Kidney Diet Low Protein achieving the same performance. This allows us
to save computational and API resources, avoid-
Table 7: Health Indicators and Their Associated Nutri-
ing potential inconsistencies and information loss
tional Tags. Each indicator is linked to relevant tags
reflecting dietary requirements. that arise when feeding the reasoning output into a
second step. This is because with the one-step ap-
proach, the model can make a final decision based
KAPING answers questions based on a sub-
on both the original graph, and its own reasoning
graph composed of the entities mentioned in the
path, whereas in the second-step approach, the orig-
query and their neighboring nodes. Following
inal graph is not available to the model.
the methodology described in the original pa-
per, we first extract the entities present in the CoT-BAG is designed to improve the graph rea-
query—specifically the user and food—from the soning capabilities of LLMs by first encouraing the
provided knowledge graph. Then, we include their model to "build" an implicit graph representation
respective neighboring nodes to construct a sub- of the problem, and then using chain-of-thought
graph via retrieval. This subgraph is subsequently reasoning to solve it. For this approach, a single
transformed into triples and concatenated before prompt is sufficient to guide the model through
feeding into the LLMs. Note that in the original im- both the graph construction and reasoning, by com-
plementation, the authors also used top-k filtering bining both "Let’s construct a graph from the given
to prune the retrieval results. However, since we nodes and edges" and "Let’s think step by step to
don’t have any other entities in the question, this arrive at the final answer". Adapting CoT-BaG to
pruning based on embedding similarities with the our benchmark requires creating a textual descrip-
question doesn’t generate any reasonable results. tion of the graph triples, in the following format:
We skip this step in our implementation. "The graph contains an edge between node [source]
CoT-Zero is a two-stage prompting stategy. In and node [target] with attribute [relationship], an

15
a) Binary Classification (-B) b) Multi-label Classification (-ML) c) Text Generation (-TG)
Question Level Method
Accuracy Recall Precision F1 Accuracy Recall Precision F1 ROUGE-1 ROUGE-2 ROUGE-L BLEU BERT
Plain 0.6161 0.2413 0.8619 0.3770 0.2190 0.8958 0.2365 0.3666 0.5645 0.4999 0.5642 0.3092 0.9375
KAPING 0.5329 0.0732 0.6268 0.1310 0.1951 0.8885 0.2194 0.3468 0.5374 0.4678 0.5370 0.2759 0.9346
Sparse CoT-Zero 0.6049 0.2885 0.7255 0.4128 0.3633 0.7636 0.4265 0.5263 0.5593 0.5016 0.5589 0.3424 0.8871
CoT-BAG 0.6060 0.2875 0.7307 0.4126 0.4204 0.7430 0.4724 0.5589 0.5479 0.4888 0.5474 0.3325 0.8849
ToG 0.8483 0.6959 0.9844 0.8154 0.3227 0.9561 0.3168 0.4672 0.7216 0.6793 0.7215 0.4997 0.9582
Plain 0.5903 0.2584 0.8871 0.4002 0.5651 0.9224 0.5665 0.6932 0.7746 0.7074 0.7344 0.5513 0.9656
KAPING 0.4809 0.0480 0.6216 0.0891 0.4830 0.8954 0.5064 0.6391 0.7203 0.6368 0.6835 0.4748 0.9594
Standard CoT-Zero 0.6576 0.3528 1.0000 0.5216 0.5373 0.9963 0.5429 0.6948 0.7333 0.6446 0.7058 0.4940 0.9507
CoT-BAG 0.5872 0.2197 1.0000 0.3603 0.5585 0.9984 0.5599 0.7084 0.5479 0.4888 0.5474 0.3325 0.8849
ToG 0.8647 0.7443 1.0000 0.8534 0.8242 0.9238 0.8437 0.8745 0.8870 0.8292 0.8227 0.6959 0.9775
Plain 0.6249 0.0424 0.3562 0.0758 0.6790 0.8679 0.7695 0.8108 0.7608 0.6814 0.7136 0.5102 0.9604
KAPING 0.6302 0.0473 0.4143 0.0849 0.6549 0.8501 0.7522 0.7915 0.7446 0.6644 0.7032 0.4910 0.9587
Complex CoT-Zero 0.6639 0.0750 0.9787 0.1394 0.7466 0.9729 0.7693 0.8562 0.7474 0.6597 0.7107 0.5053 0.9475
CoT-BAG 0.6621 0.0685 1.0000 0.1282 0.7533 0.9628 0.7783 0.8577 0.7468 0.6620 0.7076 0.5051 0.9470
ToG 0.7219 0.2936 0.8295 0.4337 0.6871 0.7160 0.8952 0.7846 0.8177 0.7424 0.7651 0.5978 0.9692

Table 8: Experimental results based on five baseline methods on the three tasks with the three question levels using
the Llama-3.1-70B-instruct. The best performance of each group is bolded.

a) Binary Classification (-B) b) Multi-label Classification (-ML) c) Text Generation (-TG)


Question Level Method
Accuracy Recall Precision F1 Accuracy Recall Precision F1 ROUGE-1 ROUGE-2 ROUGE-L BLEU BERT
Plain 0.5363 0.0384 0.9573 0.0739 0.1965 0.8102 0.2720 0.3770 0.4572 0.3806 0.4556 0.2137 0.9200
KAPING 0.5370 0.0399 0.9588 0.0766 0.1960 0.8120 0.2713 0.3769 0.4565 0.3798 0.4548 0.2135 0.9199
Sparse CoT-Zero 0.5324 0.0301 0.9535 0.0583 0.2535 0.8273 0.3934 0.4664 0.4350 0.3575 0.4334 0.1992 0.8728
CoT-BAG 0.5885 0.2983 0.6607 0.4110 0.2698 0.8720 0.3523 0.4693 0.4498 0.3767 0.4485 0.2116 0.8777
ToG 0.6336 0.4025 0.7109 0.5140 0.2100 0.7045 0.2493 0.3563 0.4480 0.3441 0.4432 0.1940 0.9074
Plain 0.5268 0.1054 1.0000 0.1907 0.4599 0.8212 0.5386 0.6216 0.6260 0.5178 0.6067 0.3607 0.9380
KAPING 0.5245 0.1007 1.0000 0.1830 0.4606 0.8214 0.5396 0.6228 0.6272 0.5192 0.6076 0.3623 0.9387
Standard CoT-Zero 0.4917 0.0391 1.0000 0.0753 0.5280 0.8426 0.6216 0.6881 0.5854 0.5708 0.4747 0.3213 0.9120
CoT-BAG 0.5953 0.3100 0.8049 0.4476 0.5654 0.8577 0.6222 0.7073 0.6147 0.5128 0.5968 0.3504 0.9184
ToG 0.8385 0.7630 0.9178 0.8333 0.5151 0.7613 0.5774 0.6378 0.6302 0.5061 0.5985 0.3526 0.9284
Plain 0.6627 0.0799 0.8909 0.1467 0.5991 0.7924 0.7511 0.7482 0.6636 0.5725 0.6432 0.3953 0.9402
KAPING 0.6645 0.0865 0.8833 0.1575 0.5998 0.7884 0.7518 0.7458 0.6637 0.5713 0.6452 0.3934 0.9400
Complex CoT-Zero 0.6467 0.0277 0.9444 0.0539 0.6352 0.7831 0.8071 0.7761 0.6300 0.5339 0.6149 0.3574 0.9184
CoT-BAG 0.6556 0.2186 0.5654 0.3153 0.6295 0.7686 0.7996 0.7712 0.6506 0.5619 0.6321 0.3829 0.9223
ToG 0.7710 0.7732 0.6565 0.7101 0.5224 0.6157 0.7529 0.6408 0.6296 0.5114 0.5981 0.3500 0.9267

Table 9: Experimental results based on five baseline methods on the three tasks with the three question levels using
the GPT-3.5-turbo. The best performance of each group is bolded.
edge between..." to include in the input prompt, too early risks discarding paths that may be critical
alongside the question, and output requirements. for answering the query. Delaying pruning allows
ToG to collect more comprehensive information
ToG introduces a strategy that iteratively
before making pruning decisions. These modifi-
searches and prunes reasoning paths on a knowl-
cations ensure that ToG is better aligned with the
edge graph starting from entities mentioned in the
requirements and complexities of our benchmark,
query to identify suitable paths. However, the open-
enabling more effective performance evaluation.
source ToG codebase is implemented based on
Wikidata and Freebase databases, making it incom- D Additional Experiments
patible with private datasets. To evaluate ToG on
our benchmark, we reimplemented it following the To further demonstrate the performance of different
original methodology. Furthermore, we adapted LLM backbones on our benchmark, we conducted
ToG to better suit the characteristics of our bench- additional tests using Llama-3.1-70b-Instruct and
mark with the following adjustments: 1). Adjusting GPT-3.5-Turbo as backbones for various baselines.
the width parameter to 5: ToG’s original width pa- As shown in Table-8 and Table-9, the performance
rameter is set to 3, which retains three reasoning trends of Llama-3.1-70b-Instruct align closely with
paths during pruning. However, answering ques- those of GPT-4o-mini, although Llama-3.1-70b-
tions in our benchmark sometimes requires more Instruct generally yields better results. This is con-
than three reasoning paths. By setting the width sistent with its stronger reasoning capabilities.
parameter to 5, ToG preserves five reasoning paths Additionally, ToG exhibited a noticeable per-
at each pruning step and generates answers based formance degradation when GPT-3.5-Turbo was
on these paths. 2). Delaying pruning until the used as the backbone, particularly when addressing
second iteration: In ToG’s first iteration, the in- standard and complex questions. This decline is
formation gathered is often insufficient to evaluate primarily due to GPT-3.5-Turbo’s relatively weaker
the importance of each reasoning path. Pruning reasoning abilities, which often lead to the retrieval

16
Diet Type Obesity Hypertension Opioid Misuse Diabetes
Weight Loss/Low Calorie Diet 2,253 647 222 267
Low Fat/Low Cholesterol Diet 448 247 76 116
Low Salt/Low Sodium Diet 442 350 86 115
Sugar-Free/Low Sugar Diet 170 89 20 78
Diabetic Diet 692 432 126 647
Weight Gain/Muscle Building Diet 3 20 12 1
Low Carbohydrate Diet 244 69 25 57
High Protein Diet 47 12 9 8
Renal/Kidney Diet 25 24 13 7

Table 10: Adoption of Diet Types Across Health Conditions. Each entry represents the number of users with a
specific condition following a corresponding diet type.

Status # Users hypertension (10,257 users). These numbers em-


phasize the widespread impact of these conditions
Weight Loss/Low Calorie Diet 4,693
on public health and underscore the urgent need for
Low Fat/Low Cholesterol Diet 1,196
dietary interventions. However, the stark contrast
Low Salt/Low Sodium Diet 1,037
between the prevalence of these conditions and the
Sugar-Free/Low Sugar Diet 417
adoption of relevant dietary interventions—such as
Diabetic Diet 1,403
low-calorie diets (4,693 users) or low-sodium di-
Weight Gain/Muscle Building Diet 274
Low Carbohydrate Diet 489 ets (1,037 users)—reveals a significant gap. While
High Protein Diet 146 conditions like obesity and hypertension demand
Renal/Kidney Diet 59 immediate dietary action, far fewer individuals en-
Obesity 18,271 gage in corresponding interventions. This disparity
Hypertension 10,257 highlights the critical need for personalized dietary
Opioid Misuse 2,822 reasoning to encourage healthier eating habits tai-
Diabetes 3,837 lored to individual health conditions.

Table 11: Distribution of Users Across Health Condi-


tions and Special Diets. A similar trend emerges in Table-10, which ex-
amines the alignment between specific health con-
of suboptimal information. Such information pro- ditions and diet types. While there is some adoption
vides minimal support—or even introduces neg- of relevant dietary actions, such as weight loss di-
ative impacts—on subsequent answer generation. ets (2,253 for obesity, 647 for hypertension) and
These two sets of experiments highlight the strin- low-sodium diets (442 for obesity, 350 for hyper-
gent reasoning requirements imposed by our bench- tension), these numbers remain disproportionately
mark on the tested models. low relative to the overall prevalence of these con-
ditions. The gap is even more pronounced for
E Additional Statistics diabetes, where fewer than half of diagnosed in-
dividuals (647 users) follow diabetic diets out of
In addition to the basic statistics provided above, 3,837 diagnosed users. Specialized interventions,
we also provide an in detailed benchmark dis- such as renal/kidney or muscle-building diets, see
cussing the user distribution on health conditions minimal adoption across all conditions, suggest-
and the overlap between the four major conditions ing a lack of accessibility or awareness for these
and the special diets. targeted approaches. These patterns reinforce the
Spanning from 2003 to 2020, the latest available need for tailored, actionable dietary recommenda-
NHANES data includes a total of 95,872 unique tions to address the divide between health condition
users. Table-11 illustrates the distribution of health prevalence and effective dietary responses, ensur-
conditions across this population, highlighting the ing broader access to appropriate and impactful
significant prevalence of obesity (18,271 users) and interventions.

17
Role Category Content
“Act as a nutritionist. Analyze if a given food is
System -
healthy to a user following further instructions.”
“Based on the nutrients the food provides and the user
Question needs, please answer whether the food ‘Fish curry with
rice’ is healthy for the user?”
“Below is the extra information you use to answer the
Default question, note that you should not use your general
Method knowledge and the answer is among this information.”
prompt
Customized “…”

(Fish curry with rice belongs to Seafood mixed dishes),


Textualized graph (Fish curry with rice has fish curry), (Fish curry with
rice contains high_sodium)…

Binary “Your output will strictly be Yes or No with no other


classification words.”
User
“Your output must be strictly formatted as a comma-
separated list of nutrients prefixed with “high” or
“low”, based solely on the provided options: carb,
Multi-label
protein, sugar, sodium, cholesterol, saturated_fat,
classification
calorie. For example, a valid output would be:
Task high_carb, low_protein, high_sugar. No extra words or
prompt deviations are allowed.”
“Your output must consist of “Yes” or “No”, followed by
a list of nutrients addressed with “high” or “low,”
selected from the following options: carb, protein,
Text sugar, sodium, cholesterol, saturated fat, and calorie.
generation For example, a valid output would be: Yes, because the
food is high in carb, low in protein, high in sugar.
Ensure the output adheres to this format without any
additional words or deviations.”

Figure 7: The paradigm of prompt for final output.

Role Content explicit guidance.


"Identify the top-<width> When querying LLMs for the final output, the
reasoning paths that are most
likely to lead to the answer for
paradigm of our prompt is shown as Figure-7.
the query. Please respond with The system prompt is fixed while the user prompt
System the indices of the reasoning consists of four flexible parts: question, method
paths, starting from 1, and prompt, textualized graph, and task prompt. The
separate them with commas (e.g.,
1,2,5). Include nothing else in question and task prompt will be automatically ad-
your response." justed according to the experiment settings. The
"The query is <question>, and method prompt can be customized to the meth-
the reasoning paths are: ods proposed by the benchmark users, e.g., adding
User <reasoning_path_list>. Your
"Let’s think step by step." for CoT-Zero and adding
selected top-<width> reasoning
paths are:" "Let’s construct a graph from the given nodes and
edges" for CoT-BAG. We encourage benchmark
Figure 8: The prompt used in ToG. users to further explore the potential of method
prompts. The textualized graph is by default gener-
F Prompt Design ated by concatenating the triplets in the retrieved
In this section, we will demonstrate our carefully knowledge graph. Benchmark users can also cus-
designed prompts for the three task settings and tomize their own textualization method.
selected baselines. The principle of our prompt Additionally, the prompt we used to prune the
design is to let LLMs become familiar with nutri- relations and entities when testing ToG is shown in
tional domain knowledge while avoiding providing Figure-8.

18
G Case Study thresholds (Figure-10 and Figure-11) . Note that
since there are discrepancies in the regulation. We
We present 7 case studies across 3 Tasks (Bi- adopt a stricter measure and make it sure it fits
nary Classification, Multi-label Classification, Text NHANES data. The Vitamins and Minerals high
Generation), 3 Question Levels (Sparse, Standard, thresholds are calculated from the Daily Nutritional
Complex) and 5 Baselines (Plain, KAPING, CoT- Reference Value (NRV), where CAC defines if a
Zero, CoT-BaG, ToG). This section provides in- food (per 100g) contains over 15% of NRV, it can
sights into how the prompts are structured across claim itself a source of such nutrient. The Codex
different baselines and the reasoning path behind Alimentarius, or "Food Code" is a collection of
the LLM’s final answer, as detailed in Tables 12-18. standards, guidelines and codes of practice adopted
The case studies provide critical insights into the by the Codex Alimentarius Commission. The Com-
strengths and limitations of each baseline, while mission, also known as CAC, is the central part
emphasizing the challenges posed by personal- of the Joint FAO/WHO Food Standards Program
ized dietary reasoning, highlighting our bench- and was established by FAO and WHO to protect
mark’s role in advancing the development of ro- consumer health and promote fair practices in food
bust, domain-specific AI models for personalized trade. 3) The Multum Lexicon Therapeutic Classifi-
health-aware nutrition reasoning. cation Scheme6 , used to define opioid prescription
medicines and later mark opioid misuse (Figure-
H Addtional Error Analysis
12).
Our experiments showed that in the specific task of
health-aware nutrition reasoning, LLMs are prone
to two main types of errors: contextual hallucina-
tion and factual hallucination. To understand these
shortcomings, we perform an error analysis focus-
ing on the Text Quality Evaluation task, using 3
methods (KAPING, CoT-Zero, ToG) as a repre-
sentative setting. We prompt the models to also
include the reasonings behind their final answer,
which then go through a human review process,
revealing 2 types of reasoning failures: Contextual
Hallucination and Factual Hallucination. Note that
we do not check for KG topology errors, as our KG
generation process ensures there are no structural
problems in the knowledge base that would affect
the model’s information retrieval and processing
performance. Exemplary demonstrations of these
2 error types are shown in Table-19 and Table-20.

I Standards and Regulation


In this section, we provide the standards and reg-
ulations used in this paper and attach their links
of original document in footnote. There in general
three categories: 1) The FNDDS category code 1
used for filtering food candidates (Figure-9). 2) Nu-
trition claim regulations from WHO, FSA2 , CAC34 ,
and EU legislation5 . used for defining nutrition
1
Full documention of FNDDS at here
2
FSA Guideline
3
Guidelines on Nutrition Labeling
4
Guidelines for Use of Nutrition and Health Claims
5 6
EU Nutrition & Health Claims Regulation legislation Full document of Multum Lexicon Therapeutic Classifi-
(EC) cation Scheme at here

19
Configurations Task: Binary Classification
Question: Complex
Model: GPT-4o-mini
Method: Plain

Prompt [Question] Based on the nutrients the food provides and the user needs, please answer whether
the food "Black beans and brown rice" is healthy for the user? Please answer with yes or no.
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information.
[Textualized graph] (Black beans and brown rice belongs to Bean, pea, legume dishes), (Black
beans and brown rice has Black beans, from dried, fat added), (Black beans and brown rice
has Rice, brown, cooked, no added fat), (Black beans and brown rice belongs to low_carb),
(Black beans and brown rice belongs to low_sugar), (Black beans and brown rice belongs to
high_sodium), (Black beans and brown rice belongs to low_protein), (Black beans and brown
rice belongs to low_cholesterol), (Black beans and brown rice belongs to low_saturated_fat),
(user has Eats lots of fish), (user has Adds little to no salt at table), (user has Eats little to no
fast food), (user has Eats few to no ready to eat meals), (user has Takes more supplements),
(user has Uses lots of salt in preparation), (user has Ate more food than usual), (user has Ate
less food than usual), (user has Eats gluten free diet), (user has Eats high protein diet), (user
has diabetes), (user has Low carbohydrate diet), (user has High protein diet), (diabetes match
low_sugar), (diabetes match low_carb), (Low carbohydrate diet match low_carb), (High protein
diet contradict low_protein)
[Note prompt] Important Note: Your output will strictly be Yes or No with no other words.

True Answer No

LLM Answer No

LLM Reasoning While black beans and brown rice provide some beneficial nutrients, they are low in protein,
which contradicts the user’s high protein diet requirement. Additionally, the dish is high
in sodium, which may not be suitable for someone managing diabetes and following a low
carbohydrate and low sugar diet. Therefore, it does not align well with the user’s nutritional
needs.

Observation Factual hallucination: Although the model is able to correctly determine that the food is unhealthy
to the user because it is low in protein compared to the user’s High protein diet requirement,
upon inspecting its reasonings, it is found that the model mistakenly deduces that high_sodium
food tag is unsuitable for a low carb and low sugar diet, which is not correct.

Table 12: Case Study 1

20
Configurations Task: Multi-label Classification
Question: Standard
Model: Llama-3.1-70b-instruct
Method: KAPING

Prompt [Question] Based on the nutrients the food provides and the user needs, please answer what
nutrient tags are used to determine whether the food "Nachos, cheese only" is healthy or
unhealthy for the user?
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information.
[Textualized graph] (Nachos, cheese only belongs to Nachos), (Nachos, cheese only has Cheese
sauce), (Nachos, cheese only has Snacks, tortilla chips, plain, white corn, salted), (Nachos,
cheese only has Fast foods, nachos, with cheese), (Nachos, cheese only belongs to low_carb),
(Nachos, cheese only belongs to low_sugar), (Nachos, cheese only belongs to high_sodium),
(Nachos, cheese only belongs to high_calorie), (Nachos, cheese only belongs to low_protein),
(Nachos, cheese only belongs to low_cholesterol), (user has Drinks little or no milk), (user has
Eats little or no shellfish), (user has Adds little to no salt at table), (user has Drinks Alcohol more
than average), (user has Eats many ready to eat meals), (user has Drinks lots of water), (user has
Ate more food than usual), (user has Ate less food than usual), (user has Eats gluten free diet),
(user has obesity), (user has High protein diet), (obesity contradict high_calorie), (High protein
diet contradict low_protein)
[Note prompt] Important Note: Your output must be strictly, with no extra words, separated
by comma, a list of nutrients with high or low before the nutrients among these options: carb,
protein, sugar, sodium, cholesterol, saturated_fat, calorie. For example, the output is: high_carb,
low_protein, high_sugar. You should only include the nutrient tags that connect the food with
the user.

True Answer high_calorie, low_protein

LLM Answer high_calorie, high_sodium, low_protein

LLM Reasoning It is stated that "Nachos, cheese only belongs to high_calorie" and "high_sodium." Additionally,
it is mentioned that "Nachos, cheese only belongs to low_protein."
Given the user’s condition of obesity and High protein diet requirements, consuming high-calorie
food would contradict their health goal of obesity. Furthermore, a high-protein diet requirement
contradicts the fact that “Nachos, cheese only belongs to low_protein”. User’s habit of adding
little to no salt at table contradicts the fact that “Nachos, cheese only belong to high_sodium”.

Notes The model includes information about the user’s habit of adding little to no salt at the table and
concludes that high_sodium is a food tag that is important in determining the food’s healthiness,
which is irrelevant information in this case because it is supposed to focus on the user’s health
and diet only - Factual Hallucination.

Table 13: Case Study 2

21
Configurations Task: Text Generation
Question: Complex
Model: GPT-4o-mini
Method: CoT-Zero

Prompt [Question] Based on the nutrients the food provides and the user needs, please answer whether
the food "Turkey with gravy" is healthy for the user? Please answer with a short sentence
explaining why.
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information. Let’s think
step by step to determine the healthiness of the food, by extracting the nutritional properties of
the food from the given graph, then comparing them to the nutrition requirements of the health
status, dietary need and habits of the user. A food is unhealthy only if it has certain properties
that are unsuitable to the user’s health and diet. Do not be too strict with your criteria, only
focus on a few main nutritional tags that strongly indicate its healthiness or unhealthiness to the
particular diet or health status the user has. Some nutritional tags might not be as important in
determining healthiness.
[Textualized graph] (Turkey with gravy belongs to Poultry mixed dishes), (Turkey with gravy
has Turkey, whole, meat only, cooked, roasted), (Turkey with gravy has Salt, table, iodized),
(Turkey with gravy has Gravy, chicken, canned or bottled, ready-to-serve), (Turkey with gravy
belongs to low_carb), (Turkey with gravy belongs to low_sugar), (Turkey with gravy belongs
to high_sodium), (Turkey with gravy belongs to high_protein), (Turkey with gravy belongs to
high_cholesterol), (Turkey with gravy belongs to low_saturated_fat), (user has Eats little or no
shellfish), (user has Drinks Alcohol less than average), (user has Eats little to no frozen food),
(user has Eats few to no meals outside home), (user has Eats few to no ready to eat meals),
(user has Takes few or no supplements), (user has Uses little to no salt in preparation), (user
has Ate more food than usual), (user has Ate less food than usual), (user has Eats weight loss
diet), (user has Eats low fat diet), (user has Eats high fiber diet), (user has opioid_misuse), (user
has diabetes), (user has Low fat/Low cholesterol diet), (user has Low salt/Low sodium diet),
(user has Diabetic diet), (opioid_misuse match high_protein), (opioid_misuse match low_sugar),
(opioid_misuse contradict high_sodium), (diabetes match low_sugar), (diabetes match low_carb),
(Low fat/Low cholesterol diet contradict high_cholesterol), (Low fat/Low cholesterol diet match
low_saturated_fat), (Low salt/Low sodium diet contradict high_sodium), (Diabetic diet match
low_sugar), (Diabetic diet match low_carb)
[Note prompt] Important Note: Your output must be a Yes or No followed by strictly a list of
nutrients with high or low as prefix among these options: carb, protein, sugar, sodium, cholesterol,
saturated fat, calorie. For example, the output is: Yes, because the food is high in carb, low in
protein, high in sugar.

True Answer Yes, because the food is low in carb, low in sugar, low in saturated_fat, high in protein.

LLM Answer No, because the food is high in sodium, high in cholesterol.

LLM Reasoning Turkey with gravy is high in sodium, which contradicts the user’s low salt/low sodium diet. It is
also high in cholesterol, which contradicts the user’s low fat/low cholesterol diet.

Notes The model fails to recognize there are far more good properties about this food item than bad
ones.

Table 14: Case Study 3

22
Configurations Task: Multi-label Classification
Question: Sparse
Model: GPT-3.5-turbo
Method: CoT-BaG

Prompt [Question] Based on the nutrients the food provides and the user needs, please answer whether
the food "Sesame chicken" is healthy for the user? Please answer with a short sentence
explaining why.
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information. You will be
given the textual description of a directed graph. Let’s first construct a graph with the given nodes
and edges. Then determine the healthiness of the food by traversing the graph and determining
the nutritional properties of the food, then compare them to the health status, dietary need and
habits of the user. Do not be too strict with your criteria, only focus on a few main nutritional
tags that strongly indicate its healthiness or unhealthiness to the particular diet or health status
the user has. Some nutritional tags might not be as important in determining healthiness.
[Textualized graph] Here is the description of the graph: This is the list of edges: an edge between
node "Sesame chicken" and "Stir-fry and soy-based sauce mixtures" with attribute "belongs
to", an edge between node "Sesame chicken" and "Restaurant, Chinese, sesame chicken" with
attribute "has", an edge between node "Sesame chicken" and "low_carb" with attribute "belongs
to", an edge between node "Sesame chicken" and "high_sodium" with attribute "belongs to",
an edge between node "Sesame chicken" and "high_calorie" with attribute "belongs to", an
edge between node "Sesame chicken" and "high_cholesterol" with attribute "belongs to", an
edge between node "user" and "Eats little or no shellfish" with attribute "has", an edge between
node "user" and "Eats little or no fish" with attribute "has", an edge between node "user" and
"Adds little to no salt at table" with attribute "has", an edge between node "user" and "Heavy
cigarette smoker" with attribute "has", an edge between node "user" and "Drinks Alcohol more
than average" with attribute "has", an edge between node "user" and "Eats lots of frozen food"
with attribute "has", an edge between node "user" and "Eats few to no ready to eat meals" with
attribute "has", an edge between node "user" and "Takes few or no supplements" with attribute
"has", an edge between node "user" and "Drinks lots of water" with attribute "has", an edge
between node "user" and "Eats organic food less often" with attribute "has", an edge between
node "user" and "Claims to have a poor diet" with attribute "has", an edge between node "user"
and "Ate more food than usual" with attribute "has", an edge between node "user" and "obesity"
with attribute "has", an edge between node "user" and "hypertension" with attribute "has", an
edge between node "obesity" and "high_calorie" with attribute "contradict", an edge between
node "hypertension" and "high_sodium" with attribute "contradict".
[Note prompt] Important Note: Your output must be a Yes or No followed by strictly a list of
nutrients with high or low as prefix among these options: carb, protein, sugar, sodium, cholesterol,
saturated fat, calorie. For example, the output is: Yes, because the food is high in carb, low in
protein, high in sugar.

True Answer No, because the food is high in sodium, high in calorie.

LLM Answer No, because the food is high in sodium, high in calorie, high in cholesterol.

LLM Reasoning Sesame chicken typically contains a lot of sodium due to the soy-based sauce, high calories from
frying, and cholesterol from the meat and sauce. These nutritional properties may not align well
with the user’s dietary needs, especially if they have hypertension or obesity.

Notes Although the model is able to identify two important food tags that contribute to the food’s
unhealthiness, upon inspecting its reasonings, it is unclear whether the model deduces this answer
from the graph information, or from its general knowledge of the food’s ingredients.

Table 15: Case Study 4

23
Configurations Task: Text Generation
Question: Standard
Model: GPT-4o-mini
Method: ToG

Prompt [Question] Based on the nutrients the food provides and the user needs, please answer whether
the food "Vegetable rice soup, canned, prepared with water or ready-to-serve" is healthy for
the user? Please answer with a short sentence explaining why.
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information. You will be
given the textual description of a directed graph.
[Textualized graph] Note: Bolded triples are those that appear in ToG’s retrieved subgraph.
(Vegetable rice soup, canned, prepared with water or ready-to-serve belongs to Soups), (Vegetable
rice soup, canned, prepared with water or ready-to-serve has Soup, minestrone, canned, chunky,
ready-to-serve), (Vegetable rice soup, canned, prepared with water or ready-to-serve belongs
to low_carb), (Vegetable rice soup, canned, prepared with water or ready-to-serve belongs to
low_sugar), (Vegetable rice soup, canned, prepared with water or ready-to-serve belongs to
high_sodium), (Vegetable rice soup, canned, prepared with water or ready-to-serve belongs to
low_protein), (Vegetable rice soup, canned, prepared with water or ready-to-serve belongs
to low_cholesterol), (Vegetable rice soup, canned, prepared with water or ready-to-serve
belongs to low_saturated_fat), (user has Drinks lots of milk), (user has Eats little or no
shellfish), (user has Eats little or no fish), (user has Adds little to no salt at table), (user has
Eats few to no meals outside home), (user has Eats few to no ready to eat meals), (user has
Takes more supplements), (user has Claims to have a good diet), (user has Ate more food than
usual), (user has Ate less food than usual), (user has Eats weight loss diet), (user has Eats high
protein diet), (user has Low fat/Low cholesterol diet), (user has Low carbohydrate diet),
(Low fat/Low cholesterol diet match low_cholesterol), (Low fat/Low cholesterol diet match
low_saturated_fat), (Low carbohydrate diet match low_carb)
[Note prompt] Important Note: Your output must be a Yes or No followed by strictly a list of
nutrients with high or low as prefix among these options: carb, protein, sugar, sodium, cholesterol,
saturated fat, calorie. For example, the output is: Yes, because the food is high in carb, low in
protein, high in sugar.

True Answer Yes, because the food is low in carb, low in cholesterol, low in saturated_fat.

LLM Answer Yes, because the food is low in carb, low in cholesterol, low in saturated fat.

LLM Reasoning The user has a Low fat/Low cholesterol diet and a Low carbohydrate diet. The given food,
Vegetable rice soup, matches the user’s dietary requirements as it belongs to low_cholesterol,
low_saturated_fat, and low_carb categories. Therefore, the food is suitable for the user’s needs.

Observation ToG’s retrieved subgraph contains all the necessary triples, no more no less, to determine that
the food is healthy to the user’s diet requirements.

Table 16: Case Study 5

24
Configurations Task: Text Generation
Question: Complex
Model: Llama-3.1-70B-instruct
Method: ToG

Prompt [Question] Based on the nutrients the food provides and the user needs, please answer whether
the food "Pie, custard, individual size or tart" is healthy for the user? Please answer with a
short sentence explaining why.
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information. You will be
given the textual description of a directed graph.
[Textualized graph] Note: Bolded triples are those that appear in ToG’s retrieved subgraph.
(Pie, custard, individual size or tart belongs to Cakes and pies), (Pie, custard, individual size or
tart has Sugars, granulated), (Pie, custard, individual size or tart has Vanilla extract, imitation,
no alcohol), (Pie, custard, individual size or tart has Cornstarch), (Pie, custard, individual size
or tart has Egg, whole, raw, fresh), (Pie, custard, individual size or tart has Wheat flour, white,
all-purpose, enriched, bleached), (Pie, custard, individual size or tart has Shortening, vegetable,
household, composite), (Pie, custard, individual size or tart has Salt, table, iodized), (Pie, custard,
individual size or tart has Milk, nonfat, fluid, without added vitamin A and vitamin D (fat free or
skim)), (Pie, custard, individual size or tart has Beverages, water, tap, municipal), (Pie, custard,
individual size or tart belongs to low_carb), (Pie, custard, individual size or tart belongs to
high_sodium), (Pie, custard, individual size or tart belongs to low_protein), (Pie, custard,
individual size or tart belongs to high_cholesterol), (Pie, custard, individual size or tart
belongs to low_saturated_fat), (user has Drinks lots of milk), (user has Uses tobacco often),
(user has Drinks Alcohol more than average), (user has Rarely or never check nutrition labels),
(user has Takes few or no supplements), (user has Uses lots of salt in preparation), (user has
Claims to have a poor diet), (user has Ate more food than usual), (user has Ate less food than
usual), (user has Eats weight loss diet), (user has obesity), (user has opioid_misuse), (user
has hypertension), (user has diabetes), (user has Low fat/Low cholesterol diet), (obesity
need low_calorie), (opioid_misuse contradict low_protein), (opioid_misuse need low_sugar),
(opioid_misuse contradict high_sodium), (hypertension contradict high_sodium), (diabetes
need low_sugar), (diabetes match low_carb), (Low fat/Low cholesterol diet contradict
high_cholesterol), (Low fat/Low cholesterol diet match low_saturated_fat)
[Note prompt] Important Note: Your output must be a Yes or No followed by strictly a list of
nutrients with high or low as prefix among these options: carb, protein, sugar, sodium, cholesterol,
saturated fat, calorie. For example, the output is: Yes, because the food is high in carb, low in
protein, high in sugar.

True Answer No, because the food is high in sodium, high in cholesterol, low in protein.

LLM Answer No, because the food is low in carb, low in protein, high in cholesterol.

LLM Reasoning The user has diabetes, and a low-carb diet is suitable for diabetes management. Therefore, the
food being low in carb is a positive aspect. However, the user also has opioid misuse, which
contradicts with the food being low in protein. Protein is essential for overall health and recovery.
Additionally, the user follows a Low fat/Low cholesterol diet. The food being high in cholesterol
contradicts this requirement.

Observation ToG’s retrieved subgraph doesn’t contain the important triple “(hypertension contradict
high_sodium)”, hence it fails to include this information in the reasoning process. Instead,
it incorrectly includes the food tag low_carb into the final answer, despite successfully identify-
ing that this is a positive aspect of the food.

Table 17: Case Study 6

25
Configurations Task: Multi-label Classification
Question: Complex
Model: Llama-3.1-70B-instruct
Method: ToG

Prompt [Question] Based on the nutrients the food provides and the user needs, please answer what
nutrient tags are used to determine whether the food "Lasagna with cheese and meat sauce,
reduced fat and sodium (diet frozen meal)" is healthy or unhealthy for the user?
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information. You will be
given the textual description of a directed graph.
[Textualized graph] Note: Bolded triples are those that appear in ToG’s retrieved subgraph.
(Lasagna with cheese and meat sauce, reduced fat and sodium (diet frozen meal) belongs to Pasta
mixed dishes, excludes macaroni and cheese), (Lasagna with cheese and meat sauce, reduced
fat and sodium (diet frozen meal) belongs to low_carb), (Lasagna with cheese and meat
sauce, reduced fat and sodium (diet frozen meal) belongs to low_sugar), (Lasagna with
cheese and meat sauce, reduced fat and sodium (diet frozen meal) belongs to high_sodium),
(Lasagna with cheese and meat sauce, reduced fat and sodium (diet frozen meal) belongs to
low_protein), (Lasagna with cheese and meat sauce, reduced fat and sodium (diet frozen
meal) belongs to low_cholesterol), (Lasagna with cheese and meat sauce, reduced fat and
sodium (diet frozen meal) belongs to low_saturated_fat), (user has Eats lots of shellfish),
(user has Eats lots of fish), (user has Eats little to no frozen food), (user has Eats few to
no meals outside home), (user has Eats few to no ready to eat meals), (user has Takes few
or no supplements), (user has Uses lots of salt in preparation), (user has Claims to have a
poor diet), (user has Ate more food than usual), (user has Ate less food than usual), (user has
Eats weight loss diet), (user has obesity), (user has opioid_misuse), (user has hypertension),
(user has diabetes), (user has Low fat/Low cholesterol diet), (obesity need low_calorie),
(opioid_misuse contradict low_protein), (opioid_misuse match low_sugar), (opioid_misuse con-
tradict high_sodium), (hypertension contradict high_sodium), (diabetes match low_sugar),
(diabetes match low_carb), (Low fat/Low cholesterol diet match low_cholesterol), (Low
fat/Low cholesterol diet match low_saturated_fat)
[Note prompt] Important Note: Your output must be separated by comma, a list of nutrients with
high or low before the nutrients among these options: carb, protein, sugar, sodium, cholesterol,
saturated_fat, calorie. For example, the output is: high_carb, low_protein, high_sugar. Provide
reasonings for your output.

True Answer low_carb, low_sugar, high_sodium, low_cholesterol, low_saturated_fat, low_protein

LLM Answer high_sodium, low_sugar, low_cholesterol, low_carb

LLM Reasoning - high_sodium: Although the meal is labeled as "reduced sodium", it still belongs to the
high_sodium category. This is a concern for the user who has hypertension, as high sodium
intake can exacerbate this condition. - low_sugar: The meal belongs to the low_sugar category,
which is good for the user who has opioid_misuse. - low_cholesterol: The user has Low fat/Low
cholesterol diet, which matches low_cholesterol of the food. - low_carb: The food is low in carb,
which is suitable for the user who has diabetes.

Observation The model is able to correctly identify 4 out of 6 important tags in this case. Although it is
missing half of the tags, one good thing to note is that ToG is able to pick out both the matches
(low_cholesterol, low_sugar, low_carb) and the contradictions (high_sodium).

Table 18: Case Study 7

26
Question Based on the nutrients the food provides and the user needs, please answer whether the food
"Cheese pastry puffs" is healthy for the user? Please answer with a short sentence explaining
why.

User Profile Eats lots of shellfish, Eats lots of fish, Light cigarette smoker, Drinks alcohol less than average,
Eats little to no fast food, Eats little to no frozen food, Often checks nutrition labels, Takes more
supplements, Uses little to no salt in preparation, Drinks little water, Eats weight-gain diet, Has
hypertension

Food Profile low_carb, low_sugar, high_sodium, high_calorie, low_protein, high_cholesterol,


high_saturated_fat

True Answer No, because the food is high in sodium.

KAPING - Answer: No, because the food is high in sodium, high in calorie, high in saturated fat, high in
cholesterol.
- Reasoning: Although the user often checks nutrition labels and eats little to no fast food or
frozen meals (factual hallucination - irrelevant information), their hypertension contradicts with
the high sodium content found in Cheese pastry puffs. Moreover, being a light smoker might
increase cardiovascular risks; consuming foods high in calorie, saturated fat, and cholesterol like
Cheese pastry puffs will not help. (contextual hallucination - unaware of user’s needs)

CoT-Zero - Answer: No, because the food is high in sodium, low in protein, high in cholesterol, high in
saturated fat, high in calorie.
- Reasoning: The presence of ’high_sodium’ category contradicts with ’hypertension’, as
excessive sodium can worsen hypertension. High saturated fat, calorie and cholesterol from
Cheese and Margarine are also concerns for heart health. Low protein likely won’t support muscle
maintenance or satisfaction between meals. (factual hallucination - irrelevant information)

ToG - Answer: No, because the food is high in sodium.


- Reasoning: Cheese pastry puffs is not healthy because it is high in sodium, and the user has
hypertension.

Table 19: Error Analysis 1

27
Question Based on the nutrients the food provides and the user needs, please answer whether the food
"Shepherd’s pie with lamb" is healthy for the user? Please answer with a short sentence
explaining why.

User Profile Drinks lots of milk, Eats little to no shellfish, Eats little to no fish, Drinks alcohol more than
average, Eats lots of fast food, Eats few to no ready-to-eat meals, Rarely or never checks nutrition
labels, Takes more supplements, Claims to have a poor diet, Has obesity, Has diabetes.

Food Profile low_carb, low_sugar, high_sodium, low_protein, low_cholesterol, low_saturated_fat

True Answer Yes, because the food is low in carb, low in sugar.

KAPING - Answer: No, because the food is high in sodium, low in protein.
- Reasoning: While Shepherd’s pie with lamb contains some nutritious ingredients like vegeta-
bles and lean meat, its high sodium content can be detrimental for someone with obesity and
diabetes, as it may contribute to hypertension and other health issues. Additionally, the low
protein content may not be sufficient for satiety and muscle maintenance. (contextual & factual
hallucinations - using external knowledge to make assumptions about user’s health & include
irrelevant information)

CoT-Zero - Answer: No, because the food is high in sodium, low in cholesterol, low in saturated fat.
- Reasoning: Given the user’s obesity and diabetes, a diet low in calories, sugar, and carbohy-
drates is essential. While Shepherd’s pie with lamb is low in sugar and carbs, its high sodium
content may pose a risk for those with diabetes and obesity, making it less suitable for the user’s
health needs. (contextual hallucination)

ToG - Answer: Yes, because the food is low in sugar, low in carb.
- Reasoning: Given that the user has diabetes, Shepherd’s pie with lamb is healthy because it is
low_sugar and low_carb.

Table 20: Error Analysis 2

Figure 9: FNDDS Category Code - Mixed Dishes.

28
Figure 10: Guidelines for use of nutrition and health claims.

29
Figure 11: Daily nutrition value from Codex Alimentarius.

30
Figure 12: Multum Lexicon Therapeutic Classification Scheme - Part of Level 3.

31

You might also like