Graph Benchmark
Graph Benchmark
Danish pastry
conditions remains a major challenge. Nutri- Task Settings with fruit User has
tion Question Answering (QA) has emerged contains
Diet
has
as a popular method for addressing this prob- High contradict Habit A
Data Source Sodium Hypertension
lem. However, current research faces two
a) Overview of NGQA Benchmark b) An Example of Standard Question
critical limitations. On the one hand, the ab-
sence of datasets involving user-specific med- Task Level: - ML
ical information severely limits personaliza- Question: Based on the information provided, please report the
nutrient tags to judge if food Danish pastry with fruit is healthy or
tion. This challenge is further compounded
unhealthy to the user.
by the wide variability in individual health
needs. On the other hand, while large lan- Answer: high calorie, high sodium.
guage models (LLMs), a popular solution for c) An Example of Answering the Question in Multi-label (ML) Task
this task, demonstrate strong reasoning abil-
ities, they struggle with the domain-specific Figure 1: An Overview of NGQA Benchmark (a) along
complexities of personalized healthy dietary with a data showcase: (b) an example of the knowledge
reasoning, and existing benchmarks fail to cap- graph used for a standard level question and (c) the
ture these challenges. To address these gaps, question and the answer of that question under the multi-
we introduce the Nutritional Graph Question label classification task (-ML) settings.
Answering (NGQA) benchmark, the first graph
question answering dataset designed for per- benefits of balanced nutrition, unhealthy eating
sonalized nutritional health reasoning. NGQA habits remain alarmingly prevalent in modern so-
leverages data from the National Health and Nu- ciety (WHO, 2021). In the United States alone,
trition Examination Survey (NHANES) and the approximately 42.4% of adults are classified as
Food and Nutrient Database for Dietary Studies obese (CDC, 2020a), and in 2017, poor dietary
(FNDDS) to evaluate whether a food is healthy
habits contributed to over 11 million deaths and a
for a specific user, supported by explanations
of the key contributing nutrients. The bench- substantial number of disability-adjusted life-years
mark incorporates three question complexity (DALYs), often linked to factors such as excessive
settings and evaluates reasoning across three sodium intake (Afshin et al., 2019; WHO, 2023).
downstream tasks. Extensive experiments with These statistics underscore an urgent need to pro-
LLM backbones and baseline models demon- mote healthier eating habits on a societal scale.
strate that the NGQA benchmark effectively However, nutritional health requires complex do-
challenges existing models. In sum, NGQA
main knowledge, and there is no one-size-fits-all
addresses a critical real-world problem while
advancing GraphQA research with a novel
solution for healthy diets, as the nutritional needs
domain-specific benchmark. Our codebase and of individuals can vary widely based on their health
dataset are available here. conditions. For example, a diet suitable for some-
one with a high body mass index (BMI) may differ
drastically from that of an individual with a low
1 Introduction
BMI. Likewise, while individuals recovering from
Diet is a cornerstone of human health, playing a opioid misuse may benefit from a high-protein diet,
pivotal role in both maintaining well-being and such dietary choices can be harmful to those manag-
preventing disease. Despite the well-documented ing chronic kidney disease (Mahboub et al., 2021).
1
Why this benchmark matters: Numerous ef- follows:
forts have sought to address the challenges in per-
sonalized nutritional health, with Nutrition Ques- • Novel Benchmark for Personalized Nutri-
tion Answering (QA) emerging as a popular task tion. We present NGQA, the first benchmark
(Min et al., 2022; Bondevik et al., 2024). Recent ad- to incorporate users’ medical information in a
vancements in large language models (LLMs) have nutritional question answering task, address-
demonstrated significant potential in this domain, ing a significant research gap in the domain
offering sophisticated reasoning capabilities to ana- of personalized healthy diet research.
lyze and interpret nutritional information (Mavro-
matis and Karypis, 2024). However, these efforts • Advancing the GraphQA Ecosystem.
remain constrained by two major limitations. First, NGQA introduces a domain-specific bench-
to the best of our knowledge, no existing bench- mark and extends GraphQA benchmarks
mark truly personalizes answers based on users’ beyond datasets like WebQSP and Expla-
specific health conditions, primarily due to the Graphs in the general domain. This addition
inaccessibility of individual medical data (Bölz broadens the scope of GraphQA research,
et al., 2023). This lack of user-specific datasets enabling a more comprehensive evaluation
has severely hindered the development of effective of GraphQA models’ capabilities beyond
solutions. Second, while LLMs exhibit impressive general reasoning tasks.
reasoning capabilities in general domains, the med-
• Comprehensive Resource and Evaluation.
ical and nutritional intricacies of this task impose
Through extensive experiments, NGQA pro-
severe limitations on their effectiveness (Mialon
vides a challenging benchmark, a complete
et al., 2023). Current benchmarks fail to capture
codebase supporting the full pipeline from
the domain-specific complexities of personalized
data preprocessing to model evaluation, and
health-aware dietary reasoning, making it difficult
an extensibility for integrating new mod-
to evaluate, let alone improve, these models in
els. This comprehensive resource helps ad-
meaningful ways.
vance research in both personalized nutritional
To address these critical gaps and advance the
health and the broader GraphQA field.
understanding of healthy diet personalization, we
propose the Nutritional Graph Question Answering
(NGQA) benchmark. This is the first benchmark 2 Related Work
in the personalized nutritional health domain to
evaluate whether a specific food is healthy for a Question Answering in Nutritional Health Do-
user, supported by detailed reasoning of the key main. Question answering has become an essential
contributing nutrients. By recognizing the intri- tool in the nutritional and health domain, offer-
cate interplay between a user’s medical conditions, ing a flexible framework for applications such as
dietary behaviors, and the nutrition of foods, we food recommendation (Min et al., 2022; Bonde-
frame this task as a knowledge graph question an- vik et al., 2024). Knowledge graphs (KGs) have
swering problem. Specifically, using data from the been widely used to model relationships between
National Health and Nutrition Examination Survey foods, ingredients, and health, supporting tasks
(NHANES) and the Food and Nutrient Database like ingredient substitution and adaptive dietary
for Dietary Studies (FNDDS), we construct the recommendations (Haussmann et al., 2019; Chen
NGQA benchmark and categorize questions into et al., 2021; Fatemi et al., 2023a; Xu et al., 2024).
three complexity settings: sparse, standard, and Recent approaches incorporate health metrics into
complex. Each question type is further evaluated QA systems, focusing on recipe recommendations
through three downstream tasks, binary classifi- and nutritional ontologies (Li et al., 2023; Senevi-
cation (-B), multi-label classification (-ML), and ratne et al., 2021). However, existing methods lack
text generation (-TG), to explore distinct reasoning true personalization, as highlighted by (Bölz et al.,
aspects (Figure-1 (a)). We conduct extensive exper- 2023), due to the absence of user-specific medical
iments using various LLM backbones and baseline data. Our work fills this gap by introducing the
models to ensure the benchmark is both appropri- first GraphQA benchmark for personalized nutri-
ately challenging and meaningful for advancing tional health, enabling models to provide tailored
the field. Our contributions can be summarized as nutritional reasoning and explanations.
2
Dietary 4 Conditions
Extract Habits Filter
9 Special Diets
Medical 5644 Users Tagging
Status Scheme Multi-step Annotation
User Filtering
Health Standards User Data Collection
Nutrition
Data Source Extract Tags Filter
3 Categories
Ingredients 849 Foods
NGQA Benchmark
Nutrition Standards Food Filtering
Food Data Collection
Figure 2: The NGQA benchmark construction process. Each stage shown in the figure is detailed in Section 3.For
example, "User Data Collection" block, is introduced in Section 3.1 under the paragraph titled User Data Collection.
Graph Retrieval Augmented Generation. and comprehensive food nutritional information,
Knowledge Graph Question Answering (KGQA) enabling a fine-grained analysis of how individ-
has progressed from early semantic parsing and ual health conditions interact with food nutrition.
retrieval-based methods to advanced techniques By representing these relationships through graph
leveraging large language models (LLMs) and structures, the benchmark supports answering com-
graph neural networks (GNNs) for reasoning and plex nutritional questions while capturing the intri-
retrieval (Jiang et al., 2023; Kim et al., 2023; cate interplay between users’ medical conditions
Gao et al., 2024). Building on this progress, and dietary choices. The following sections pro-
Graph-Retrieval Augmented Generation (Graph- vide a detailed discussion of these datasets and their
RAG) has emerged as a widely studied method, integration into our benchmark.
offering more precise, context- and structure-aware User Data Collection. The NHANES dataset
reasoning compared to traditional text-based forms the foundation of our work for collecting
RAG methods (Lewis et al., 2020; Lazaridou user data. We extract medical information, dietary
et al., 2022; Guo et al., 2024; Wen et al., 2023). habits, and food intake records to construct the
Despite the development of various LLM-powered graph. Specifically, NHANES provides laboratory
models, benchmarks for the Graph-RAG task reports detailing body metrics like Body Mass In-
remain scarce and lack standardization. Early dex (BMI) and blood pressure, along with biochem-
benchmarks focus primarily on general graph tasks ical markers such as blood urea nitrogen. It also
such as shortest paths and node degree (Fatemi includes questionnaire responses on prescription
et al., 2023b; Wang et al., 2024a), while (He drug usage, adherence to special diets, and over-
et al., 2024) introduces a GraphQA benchmark for all health status. Additionally, NHANES records
complex reasoning using general-purpose datasets. users’ food intake history and dietary behaviors,
Building on their framework, we develop the such as the frequency of adding salt at the table.
first domain-specific benchmark in the nutritional Our study incorporates 54 distinct dietary habits,
health domain, bridging the gap between general with detailed data processing methods provided in
GraphQA research and personalized health-aware Appendix-B. This comprehensive dataset serves as
reasoning. More detailed literature is available in the backbone of our graph, capturing user health
Appendix-A. conditions and dietary patterns with granular detail.
3 NGQA Benchmark Food Data Collection. Nutritional information
for food items is sourced from FNDDS. FNDDS
3.1 Data Collection connects NHANES food codes to detailed nutri-
Data Source. Using data from the National Health tional data cataloged in the What We Eat in Amer-
and Nutrition Examination Survey (NHANES) and ica (WWEIA) database. Using FNDDS, we asso-
the Food and Nutrient Database for Dietary Studies ciate each food item in NHANES with its full nu-
(FNDDS), we construct the first GraphQA bench- tritional composition. Additionally, FNDDS links
mark designed to address personalized healthy nu- food items to ingredient information and classifies
trition intake questions. This benchmark integrates them into broader food categories. For example, a
detailed user health profiles, dietary behaviors, food item like "apple" is linked to its nutrient values
3
(e.g., sugars, vitamins) and assigned to the category million food records. While this dataset offers
"fruits." These associations enrich the graph by pro- an invaluable resource for studying nutrition and
viding node-level data for food, ingredients, and health, it includes inconsistencies, ambiguities, and
categories. irrelevant entries. To establish a scientifically ro-
Tagging Scheme. To evaluate whether a food bust and meaningful benchmark, precise data anno-
is specifically healthy for a user based on their tation is essential. This involves not only cleaning
personal health conditions, we propose a tagging and filtering the data but also carefully defining
scheme that assigns nutrition-related tags to both and validating annotations to accurately capture
users and foods. This systematic framework aligns real-world relationships between health conditions,
food nutritional properties with user health needs, dietary behaviors, and food options. Our annota-
enabling robust assessments of food suitability. tion process refines both user and food datasets
For food tagging, we build upon established to ensure relevance, accuracy, and applicability to
guidelines and introduce newly applied standards. real-life scenarios.
Prior works have utilized recommendations from User Filtering. Annotating user data requires
the World Health Organization (WHO) and the careful consideration of the complex interactions
Food Standards Agency (FSA) (Wang et al., 2021), between nutrition and health. For instance, elevated
while we extend this by incorporating the more de- blood urea nitrogen (BUN) levels may indicate kid-
tailed EU Nutrition & Health Claims Regulation ney dysfunction, warranting a low-protein diet, but
(Commission, 2006) and the Codex Alimentarius could also result from insufficient water intake. To
Commission (CAC) (Alimentarius, 1985, 1997). maintain scientific rigor and practical relevance,
These standards define precise thresholds for nutri- we focus on annotating four prevalent health sta-
ent claims. For instance, the EU regulation permits tuses—obesity, hypertension, opioid misuse, and
labeling a food as "low sodium" only if it contains diabetes—that are directly influenced by dietary in-
no more than 0.12 g of sodium per 100 g (Commis- terventions. Additionally, we annotate nine special
sion, 2006). Foods meeting such criteria are tagged diets reported by users, reflecting health-related di-
with corresponding labels like "low_sodium" or etary practices. Further details on the definitions
"high_protein", reflecting their nutritional proper- and implications of these health statuses and diets
ties. are provided in the Appendix-B. To ensure consis-
On the user side, health tags are derived from tency and relevance, we exclude users under 18,
the NHANES dataset, which includes laboratory focusing solely on adult dietary patterns.
results and self-reported health information. For Food Filtering. For food annotation, we identify
example, users with high blood pressure, as defined practical entries in the FNDDS database that align
by American Heart Association (AHA) thresholds with real-world dietary reasoning. While FNDDS
or similar guidelines, are tagged with "hyperten- supports comprehensive nutritional analysis, it in-
sion," indicating that a low-sodium diet would be cludes many entries unsuitable for practical use,
beneficial (Grillo et al., 2019; Smyth et al., 2014). such as raw ingredients or standalone additives. To
By linking health and food tags, our scheme ef- address this, we restrict our focus to the "mixed
fectively represents personalized dietary needs and dishes" category, as it represents combined recipes
captures the interplay between medical conditions closest to real-life diets. Additionally, we include
and nutritional requirements. The detailed stan- other relevant categories, such as bakery products
dards and additional tags for other nutrients and and desserts (definitions of FNDDS categories are
health conditions are described in Appendix-B. By available in the Appendix-I). Finally, we apply a
integrating this methodology into our graph-based keyword-based deduplication method to remove
benchmark, we provide a framework for advanc- highly similar entries.
ing personalized dietary reasoning and evaluating Multi-step Annotation. Using the previously
models in this domain. defined standards and tagging schemes, our anno-
tation process systematically establishes "match"
3.2 Data Annotation or "contradict" relationships between user health
Real-world data is inherently messy and incom- conditions and food nutritional profiles. For exam-
plete, and the datasets we use are no exception. ple, the tag "high_calorie" contradicts the condition
Spanning from 2003 to 2020, NHANES provides "obesity", while "low_sodium" matches with "hy-
data for approximately 100,000 users and over 2 pertension". To ensure accuracy and reliability, we
4
Question Level: Sparse Question Level: Standard Question Level: Complex
Single link: one condition one tag Multiple links: all match or contradict Mixed links: match and contradict
a) Overview of the Different Question Levels in NGQA Benchmark
Standard Question: Based on the information provided, please judge if food Danish pastry with fruit
is healthy or unhealthy to the user and why? <Task Specific Prompts>
5
Question Level # Records Avg. # Nodes Avg. # Edges 4.3 Evaluation Metrics
Sparse 8,490 25.84 24.86 To evaluate model performance, we adopt task-
Standard 3,622 28.16 28.98
Complex 1,690 30.94 34.04
specific metrics tailored to each type. For classifi-
cation tasks, we use standard metrics like accuracy,
Table 1: Statistics of the Benchmark by Question Level. recall, precision, and F1 score for comprehensive
Question Level Avg. Node SNR Avg. Tag SNR performance assessment. Multi-label classification
tasks extend these metrics to their weighted ver-
Sparse 16.37 19.30
sions, accounting for the distribution of multiple la-
Standard 24.68 49.39
Complex 31.57 76.32 bels across samples. Text generation tasks are eval-
uated with widely used metrics such as ROUGE,
Table 2: Signal-to-Noise Ratio (SNR) by Question BLEU, and BERT scores, which collectively as-
Level. sess relevance and semantic similarity to reference
texts. The definition of ground truths is available in
answer (signal) against the total nodes or tags in the
Appendix-B. This multifaceted design supports di-
graph (noise). As shown in Table-2, sparse ques-
verse model architectures and evaluation strategies,
tions exhibit the lowest SNR, reflecting the limited
providing a robust foundation for advancing per-
resources available for these tasks. Conversely,
sonalized nutrition research. By bridging the gap
complex questions, despite containing conflicting
between controlled research environments and the
information, achieve the highest SNR, underscor-
complexities of real-world applications, our bench-
ing the rich contextual information necessary for
mark fosters innovation and opens new avenues for
accurate reasoning. More statistics of the bench-
addressing healthy dietary reasoning.
mark are available in Appendix-E.
5 Experiments
4.2 Task Setting
To enhance the generality and versatility of our 5.1 Experiment Settings
benchmark, we design three distinct downstream In this section, we conduct extensive experiments
task types, each centered on the same domain ques- to evaluate existing Graph-RAG models’ reasoning
tion but requiring different forms of output, as il- capability on the proposed benchmark. For base-
lustrated in Figure-3 (b). This diversity ensures the line models, we select the five most classical base-
benchmark accommodates a wide range of method- lines: KAPING (Baek et al., 2023), CoT-Zero (Ko-
ologies and research focuses while fostering inno- jima et al., 2022), CoT-BAG (Wang et al., 2024a),
vation in addressing personalized nutrition chal- ToG (Sun et al., 2024), and a naive plain Graph-
lenges. The tasks are defined as follows: RAG pipeline (implementation details in Appendix-
Binary Classification (-B): This task requires a C). For the main experiments, we choose GPT-
simple "yes" or "no" response, indicating whether 4o-mini as the LLM backbone, we also conduct
a specific food is suitable for a user based on additional experiments on a series of other clas-
their health profile. It emphasizes straightforward sic LLM backbones in Appendix-D. Note that we
decision-making, reflecting applications like auto- didn’t select the most advanced LLM backbones or
mated diet advisories or recommendation systems. the most sophisticated fine-tuned baselines because
Multi-Label Classification (-ML): In this task, we argue our contributions focus primarily on the
models must retrieve the nutritional tags associated proposed benchmark with the novel tasks for this
with a food and determine which match with or specific domain, and the experiment results along
contradict the user’s health conditions. By demand- with the hallucination analyses have demonstrated
ing richer output, this task evaluates the model’s our tasks are properly designed where the classic
ability to leverage graph information and identify baselines can be adequately challenged while main-
nuanced relationships. taining efficiency. In the following sections, we go
Text Generation (-TG): The output is a natural through the experiment results for each task.
language explanation of why a food is healthy or
unhealthy for a user. This task assesses a model’s 5.2 Binary Classification Task
capability for interpretable and user-friendly rea- Table-3 (a) presents the performance of baseline
soning, which is crucial for real-world applications models on the binary classification task, which eval-
such as personalized dietary assistant chatbots. uates the models’ ability to provide a decisive "yes"
6
a) Binary Classification (-B) b) Multi-label Classification (-ML) c) Text Generation (-TG)
Question Level Method
Accuracy Recall Precision F1 Accuracy Recall Precision F1 ROUGE-1 ROUGE-2 ROUGE-L BLEU BERT
Plain 0.5973 0.1634 1.0000 0.2810 0.1798 0.9943 0.2109 0.3442 0.5385 0.4775 0.5385 0.2838 0.9370
KAPING 0.5347 0.0541 0.7246 0.1006 0.1753 0.9915 0.2075 0.3394 0.5234 0.4600 0.5234 0.2674 0.9353
Sparse CoT-Zero 0.6604 0.2951 0.9983 0.4555 0.2032 0.9958 0.2435 0.3842 0.5463 0.4842 0.5462 0.2889 0.9388
CoT-BAG 0.6038 0.1769 1.0000 0.3006 0.2134 0.9966 0.2520 0.3945 0.5481 0.4886 0.5480 0.2930 0.9385
ToG 0.7729 0.5383 0.9817 0.6953 0.2439 0.9128 0.2986 0.4333 0.6254 0.5710 0.6251 0.3612 0.9465
Plain 0.5762 0.1989 1.0000 0.3317 0.4909 0.9980 0.4901 0.6528 0.7219 0.6321 0.6941 0.4840 0.9618
KAPING 0.5022 0.0637 0.9313 0.1192 0.4593 0.9956 0.4624 0.6272 0.7087 0.6237 0.6764 0.4617 0.9599
Standard CoT-Zero 0.6565 0.3507 1.0000 0.5193 0.5390 0.9967 0.5447 0.6963 0.7329 0.6443 0.7049 0.4939 0.9630
CoT-BAG 0.5900 0.2249 1.0000 0.3673 0.5599 0.9982 0.5611 0.7091 0.7333 0.6456 0.7032 0.4951 0.9630
ToG 0.8628 0.7411 0.9993 0.8511 0.6189 0.8843 0.6793 0.7464 0.8182 0.7632 0.7817 0.6112 0.9716
Plain 0.6598 0.0636 0.9750 0.1194 0.7185 0.9721 0.7374 0.8358 0.7356 0.6510 0.7001 0.4949 0.9599
KAPING 0.6574 0.0571 0.9722 0.1079 0.6883 0.9758 0.7129 0.8093 0.7394 0.6634 0.7016 0.4839 0.9602
Complex CoT-Zero 0.6627 0.0718 0.9778 0.1337 0.7453 0.9735 0.7679 0.8557 0.7478 0.6599 0.7103 0.5048 0.9615
CoT-BAG 0.6627 0.0701 1.0000 0.1311 0.7546 0.9631 0.7801 0.8587 0.7467 0.6622 0.7080 0.5049 0.9611
ToG 0.7473 0.3964 0.8100 0.5323 0.6153 0.6989 0.8119 0.7303 0.7729 0.6915 0.7366 0.5313 0.9639
Table 3: Experimental results based on five baseline methods on the three tasks with the three question levels using
the GPT-4o-mini. The best performance of each group is bolded.
Figure 4: Efficiency analysis of the five baseline meth- Figure 5: Retrieval quality of ToG vs. Plain across three
ods across three tasks. types of questions on recall, precision and F1.
or "no" response based on summarized reasoning. and text generation (TG) tasks. The ML task evalu-
The results reveal a notable conservatism in model ates models’ ability to retrieve nutrition tags asso-
behavior, as evidenced by the low recall scores. ciated with foods and user health conditions, while
This likely stems from the sensitive nature of med- the TG task tests their capacity to generate natural
ical questions, where LLMs try to avoid offering language explanations, offering a more compre-
simple "yes" answers without explanations unless hensive and realistic evaluation. The results reveal
their confidence is exceptionally high. Despite this similar patterns across tasks: while baselines are
challenge, the experiments yield two important in- competent at identifying nutrition tags from the
sights into how external domain knowledge can graph, the primary challenge lies in correctly iden-
support LLMs in this scenario. First, increasing tifying the relevant tags based on user health condi-
the number of links in the graph (e.g., from Sparse tions, as indicated by the overall high recall scores
to Standard questions) consistently improves re- in the ML task.
call across all baselines. This indicates that richer Both tasks are most challenging on sparse ques-
external knowledge provides LLMs with greater tion sets due to their low-resource nature. Con-
context and reassurance, enabling them to produce versely, models achieve the best performance on
more confident positive answers. Second, ToG complex question sets, which may appear coun-
significantly outperforms other baselines, show- terintuitive. However, as shown in Table-2, com-
ing performance gains unique to this task. We at- plex questions have a higher Signal-to-Noise Ratio
tribute this improvement to ToG’s effective pruning (SNR), providing models with a clearer signal that
mechanism, which removes irrelevant nodes and offsets their logical complexity. Additionally, the
increases the SNR. By reducing noise and focus- ToG model performs similarly on the standard and
ing on relevant information, ToG enhances LLMs’ complex question sets due to its pruning process,
ability to make confident and accurate decisions. which increases SNR by removing irrelevant nodes.
While effective, this process can also discard valu-
5.3 Multi-label and Text Generation Task able information, leading to lower performance on
Table-3 (b) and (c) present the performance of base- complex questions. This trade-off contrasts with
line models on the multi-label classification (ML) ToG’s success in binary classification task and high-
7
tritional health (Mialon et al., 2023). Figure-6 illus-
Based on the nutrients the food provides and the
user with obesity and opioid misuse, please answer trates an example where we evaluate whether the
whether the food "Taco, corn tortilla, beef, food "Taco, corn tortilla, beef, cheese" is a healthy
Question cheese" is healthy for the user and why?
option for a user who is obese and recovering from
Factual Hallucination – Lack of Domain Knowledge opioid misuse. Our analysis identifies two main
It depends on the user’s dietary needs; it may be types of hallucinations. The first is Factual Hal-
unhealthy because the food is high in carbohydrate. lucination, where the model produces incorrect or
Direct
Contextual Hallucination – Missing User’s Needs irrelevant information, often due to reliance on gen-
eral knowledge not explicitly included in the graph.
No, because the food is high in sodium and high in
cholesterol. It’s not good for the health. These errors are common when LLMs perform
KAPING
direct inference without external knowledge and
Correct – Focus on What User’s Conditions Require
occasionally occur when retrieved graphs contain
Yes, because the food is high in protein and low in noise. For example, the model incorrectly deemed
carbohydrates, an appropriate food in the context.
ToG the taco unsuitable, overlooking the fact that corn
tortillas are relatively low in carbohydrates.
Figure 6: A case study of error analysis. The second type is Contextual Hallucination,
lights the comprehensiveness of our benchmark, where the model fails to prioritize tags that directly
which challenges models across diverse scenarios relate to the user’s health profile, focusing instead
to uncover their strengths and weaknesses. on less relevant attributes. This issue is less pro-
nounced in ToG due to its ability to retrieve com-
5.4 Efficiency and Retrieval Quality
pact, focused subgraphs, unlike simpler methods
Beyond model performance, efficiency is a critical like KAPING and CoT-Zero, which lack effective
consideration in Graph-RAG systems. To evaluate pruning. In this case, the taco’s high sodium and
this, we conduct an efficiency analysis of baseline cholesterol overshadowed its alignment with the
models on our benchmark, as shown in Figure-4. user’s specific health needs for a low-carb, high-
As can be seen, the binary classification task ex- protein diet, leading to a less optimal assessment.
hibits the fastest runtime, as it requires the shortest In summary, these hallucinations highlight the
output. In contrast, the multi-label classification importance of our domain-specific benchmark in
and text generation tasks involve longer outputs, establishing a rigorous framework to evaluate and
leading to slower performance. Due to ToG’s re- improve LLMs, advancing both the nutritional
liance on multiple LLM calls during the retrieval health domain and Graph-RAG research while fos-
process, its runtime is significantly slower com- tering the development of more robust and general-
pared to other methods. Additionally, the quality izable models (More examples in Appendix-H).
of subgraph retrieval plays a crucial role in down-
stream reasoning. To assess this, we perform a 6 Conclusion
retrieval quality analysis using ToG as a case study,
In this work, we introduce the Nutritional Graph
comparing it against a plain Graph-RAG pipeline,
Question Answering (NGQA) benchmark, the first
as illustrated in Figure-5. As shown, the retrieval
dataset designed to address the critical challenges
scores of ToG align with its performance in the
of personalized nutritional health reasoning. By
main experiments, confirming our assumption that
leveraging user-specific medical data and framing
fluctuations in ToG’s performance are rooted in
the problem as a knowledge graph question answer-
its pruning process during the subgraph retrieval
ing task, NGQA bridges the gap between general-
phase.
purpose benchmarks and domain-specific applica-
5.5 Error Analysis tions. Our benchmark not only advances the scope
of GraphQA research by incorporating complex,
In this section, we analyze the types of hallucina- real-world nutritional scenarios but also provides
tions observed in our experiments using a specific a comprehensive resource for evaluating and im-
example and demonstrate the importance of exter- proving models in this domain. We believe NGQA
nal domain knowledge in mitigating these errors. lays the foundation for future research in person-
Traditional LLM-enhanced methods are well- alized diet and health-aware reasoning, fostering
known for their susceptibility to hallucination er- innovation in both nutritional health and GraphQA.
rors, particularly in domain-specific tasks like nu-
8
Limitation physical addresses—removed. Despite the absence
of PII, the dataset retains its utility for detailed
In this section, we discuss the limitations of this
analyses, allowing us to investigate the relationship
work and outline directions for future research.
between users’ medical data and health-aware food
First, the benchmark includes a limited number
recommendations as presented in this study. Ad-
of health conditions, though more are available.
ditionally, in practical applications, the generated
For example, osteoporosis suggests a high-calcium
recommendations and interpretations are treated as
diet, a renal diet indicates low protein intake, and
personal medical records, ensuring sustained pri-
high low-density lipoprotein (LDL) levels may call
vacy protection. By adhering to these principles,
for a low-cholesterol diet. As noted in the pa-
our research maintains the highest levels of ethical
per, we prioritized conditions most prevalent in
responsibility and data privacy.
the United States and most relevant to dietary inter-
ventions, but expanding to include additional con-
ditions could enhance coverage and utility. Second, References
while we focus on the interplay between dietary be- Ashkan Afshin, Patrick J Sur, Kairsten A Fay, Leslie
haviors and medical conditions, other factors, such Cornaby, Giannina Ferrara, Jason S Salama, and
as food insecurity, remain unexplored. NHANES Christopher J L Murray. 2019. Health effects of
offers extensive socioeconomic data, presenting op- dietary risks in 195 countries, 1990–2017: a system-
atic analysis for the global burden of disease study
portunities to extend the benchmark to account for
2017. The Lancet.
broader determinants of dietary decision-making.
Third, for simplicity, complex questions are re- FAO/WHO Codex Alimentarius. 1985. Guidelines on
duced to binary classification by counting "match" nutrition labelling. Accessed: 2024-07-12.
and "contradict" tags. However, real-life dietary FAO/WHO Codex Alimentarius. 1997. Guidelines for
decisions require nuanced trade-offs and reasoning use of nutrition and health claims. Accessed: 2024-
that go beyond this approach. More sophisticated 07-12.
evaluation methods could better reflect practical Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023.
scenarios. Lastly, the benchmark could benefit Knowledge-augmented language model prompting
from additional tasks. For example, the existing for zero-shot knowledge graph question answering.
In ACL.
graphs support questions like, "What alternative
foods could meet a user’s dietary preferences and Felix Bölz, Diana Nurbakova, Sylvie Calabretto, Armin
medical needs?" Incorporating such tasks would Gerl, Lionel Brunie, and Harald Kosch. 2023. Hum-
broaden the benchmark’s scope and encourage fur- mus: A linked, healthiness-aware, user-centered and
argument-enabling recipe data set for recommenda-
ther innovation. Despite these limitations, this tion. In RecSys.
work establishes a robust baseline as a pioneering
effort in personalized nutrition reasoning. We defer Jon Nicolas Bondevik, Kwabena Ebo Bennin, Önder
Babur, and Carsten Ersch. 2024. A systematic review
these challenges to future work, envisioning the on food recommender systems. Expert Systems with
benchmark as a foundation for ongoing advance- Applications.
ments in this critical domain.
CDC. 2020a. Adult obesity facts.
Ethics and Privacy Statement CDC. 2020b. Americans Share Hopeful Sto-
ries of Recovery From Opioid Use Disorder.
Safeguarding privacy and adhering to ethical prin- https://www.cdc.gov/rxawareness/pdf/
ciples are paramount when working with sensi- articles/TA-T3D2-English_MatteArticle_
tive health-related data. The National Health and Release_508.pdf.
Nutrition Examination Survey (NHANES) serves Yu Chen, Ananya Subburathinam, Ching-Hua Chen,
as a benchmark in this regard, strictly complying and Mohammed J Zaki. 2021. Personalized food
with confidentiality protocols mandated by pub- recommendation as constrained question answering
lic legislation. These robust privacy measures over a large-scale food knowledge graph. In WSDM.
enable us to achieve our research goals while European Commission. 2006. Eu nutrition & health
remaining fully aligned with the survey’s estab- claims regulation legislation (ec) 1924/2006. Ac-
lished guidelines. Notably, the NHANES dataset cessed: 2024-07-12.
is anonymized, with personally identifiable infor- Carrie Dennett. 2021. Diet’s role in opioid recovery.
mation (PII)—such as social security numbers and Today’s Dietitian.
9
Bahare Fatemi, Quentin Duval, Rohit Girdhar, Michal Angeliki Lazaridou, Elena Gribovskaya, Wojciech
Drozdzal, and Adriana Romero-Soriano. 2023a. Stokowiec, and Nikolai Grigorev. 2022. Internet-
Learning to substitute ingredients in recipes. arXiv. augmented language models through few-shot
prompting for open-domain question answering.
Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi. arXiv.
2023b. Talk like a graph: Encoding graphs for large
language models. arXiv. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
Yifu Gao, Linbo Qiao, Zhigang Kan, Zhihua Wen, rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
Yongquan He, and Dongsheng Li. 2024. Two-stage täschel, et al. 2020. Retrieval-augmented generation
generative question answering on temporal knowl- for knowledge-intensive nlp tasks. NeuralIPS.
edge graph using large language models. arXiv.
Diya Li, Mohammed J Zaki, and Ching-hua Chen. 2023.
Mouzhi Ge, Francesco Ricci, and David Massimo. 2015. Health-guided recipe recommendation over knowl-
Health-aware food recommender system. In RecSys. edge graphs. Journal of Web Semantics.
Andrea Grillo, Lucia Salvi, Paolo Coruzzi, Paolo Salvi, Peiyu Li, Xiaobao Huang, Yijun Tian, and Nitesh V
and Gianfranco Parati. 2019. Sodium intake and Chawla. 2024. Cheffusion: Multimodal foundation
hypertension. Nutrients. model integrating recipe and food image generation.
Ja K Gu, Penelope Allison, Alexis Grimes Trotter, Lu- In CIKM.
enda E Charles, Claudia C Ma, Matthew Groenewold, Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun
Michael E Andrew, and Sara E Luckhaupt. 2022. Tian, and Meng Jiang. 2024a. Towards safer large
Prevalence of self-reported prescription opioid use language models through machine unlearning. arXiv.
and illicit drug use among us adults: Nhanes 2005–
2016. Journal of occupational and environmental Zheyuan Liu, Xiaoxin He, Yijun Tian, and Nitesh V
medicine. Chawla. 2024b. Can we soft prompt llms for graph
learning tasks? In WWW.
Tiezheng Guo, Qingwen Yang, Chen Wang, Yanyi Liu,
Pan Li, Jiawei Tang, Dapeng Li, and Yingyou Wen. Zheyuan Liu, Chunhui Zhang, Yijun Tian, Erchi
2024. Knowledgenavigator: Leveraging large lan- Zhang, Chao Huang, Yanfang Ye, and Chuxu Zhang.
guage models for enhanced reasoning over knowl- 2023. Fair graph representation learning via diverse
edge graph. Complex & Intelligent Systems. mixture-of-experts. In WWW.
Steven Haussmann, Oshani Seneviratne, Yu Chen, Nadine Mahboub, Rana Rizk, Mirey Karavetian, and
Yarden Ne’eman, James Codella, Ching-Hua Chen, Nanne de Vries. 2021. Nutritional status and eating
Deborah L McGuinness, and Mohammed J Zaki. habits of people who use drugs and/or are undergoing
2019. Foodkg: a semantics-driven knowledge graph treatment for recovery: a narrative review. Nutrition
for food recommendation. In The Semantic Web– reviews.
ISWC 2019: 18th International Semantic Web Confer-
ence, Auckland, New Zealand, October 26–30, 2019, Costas Mavromatis and George Karypis. 2024. Gnn-
Proceedings, Part II 18. rag: Graph neural retrieval for large language model
reasoning. arXiv.
Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla,
Thomas Laurent, Yann LeCun, Xavier Bresson, and Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo-
Bryan Hooi. 2024. G-retriever: Retrieval-augmented foros Nalmpantis, Ram Pasunuru, Roberta Raileanu,
generation for textual graph understanding and ques- Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu,
tion answering. arXiv. Asli Celikyilmaz, et al. 2023. Augmented language
models: a survey. arXiv.
Xiaobao Huang, Mihir Surve, Yuhan Liu, Tengfei Luo,
Olaf Wiest, Xiangliang Zhang, and Nitesh V Chawla. Weiqing Min, Chunlin Liu, Leyi Xu, and Shuqiang
2024. Application of large language models in chem- Jiang. 2022. Applications of knowledge graphs for
istry reaction data extraction and cleaning. In CIKM. food science and industry. Patterns.
Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, NIDA. 2024. Opioids. https://www.drugabuse.
Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: gov/drug-topics/opioids.
A general framework for large language model to
reason over structured data. In EMNLP. Khary K Rigg and Gladys E Ibañez. 2010. Motiva-
tions for non-medical prescription drug use: A mixed
Jiho Kim, Yeonsu Kwon, Yohan Jo, and Edward Choi. methods analysis. Journal of Substance Abuse Treat-
2023. Kg-gpt: A general framework for reasoning ment.
on knowledge graphs using large language models.
In EMNLP. Andrew Rosenblum, Lisa A Marsch, Herman Joseph,
and Russell K Portenoy. 2008. Opioids and the treat-
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- ment of chronic pain: controversies, current status,
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- and future directions. Experimental and Clinical
guage models are zero-shot reasoners. NeuralIPS. Psychopharmacology.
10
Chris Sanchez and Zheyuan Zhang. 2022. The effects Zehong Wang, Zheyuan Zhang, Nitesh V Chawla,
of in-domain corpus size on pre-training bert. arXiv. Chuxu Zhang, and Yanfang Ye. 2024c. Gft: Graph
foundation model with transferable tree vocabulary.
Oshani Seneviratne, Jonathan Harris, Ching-Hua Chen, arXiv preprint arXiv:2411.06070.
and Deborah L McGuinness. 2021. Personal health
knowledge graph for clinically relevant diet recom- Zehong Wang, Zheyuan Zhang, Chuxu Zhang, and Yan-
mendations. arXiv. fang Ye. 2024d. Subgraph pooling: Tackling nega-
tive transfer on graphs. In IJCAI.
Sola S Shirai, Oshani Seneviratne, Minor E Gordon,
Ching-Hua Chen, and Deborah L McGuinness. 2021. Yilin Wen, Zifeng Wang, and Jimeng Sun. 2023.
Identifying ingredient substitutions using a knowl- Mindmap: Knowledge graph prompting sparks graph
edge graph of food. Frontiers in Artificial Intelli- of thoughts in large language models. arXiv.
gence.
WHO. 2021. Healthy diet.
Andrew Smyth, Martin J O’Donnell, Salim Yusuf,
Catherine M Clase, Koon K Teo, Michelle Cana- WHO. 2023. Obesity info page of the world health
van, Donal N Reddan, and Johannes FE Mann. 2014. organization.
Sodium intake and renal outcomes: a systematic re-
view. American journal of hypertension. Yuanbo Xu, Tian Li, Yongjian Yang, Weitong Chen, and
Lin Yue. 2024. An adaptive category-aware recom-
Haitian Sun, Tania Bedrax-Weiss, and William Cohen. mender based on dual knowledge graphs. Informa-
2019. Pullnet: Open domain question answering tion Processing & Management.
with iterative retrieval on knowledge bases and text.
In EMNLP-IJCNLP. Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut,
Percy Liang, and Jure Leskovec. 2021. Qa-gnn: Rea-
Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo soning with language models and knowledge graphs
Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung- for question answering. In NAACL.
Yeung Shum, and Jian Guo. 2024. Think-on-graph:
Wenbin Yue, Zidong Wang, Jieyu Zhang, and Xi-
Deep and responsible reasoning of large language
aohui Liu. 2021. An overview of recommenda-
model on knowledge graph. In ICLR.
tion techniques and their applications in healthcare.
Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, IEEE/CAA Journal of Automatica Sinica.
Bing Yin, and Meng Jiang. 2024. Democratizing
Jing Zhang, Xiaokang Zhang, Jifan Yu, Jian Tang, Jie
large language models via personalized parameter-
Tang, Cuiping Li, and Hong Chen. 2022. Subgraph
efficient fine-tuning. arXiv.
retrieval enhanced model for multi-hop knowledge
Lauren J. Tanz, Amanda T. Dinwiddie, Christine L. base question answering. In ACL.
Mattson, Julie O’Donnell, and Nicole L. Davis. 2022.
Lingzi Zhang, Yinan Zhang, Xin Zhou, and Zhiqi Shen.
Drug overdose deaths among persons aged 10–19
2024a. Greenrec: A large-scale dataset for green
years - united states, july 2019-december 2021. Mor-
food recommendation. In WWW.
bidity and Mortality Weekly Report.
Zheyuan Zhang, Zehong Wang, Shifu Hou, Evan Hall,
Dhaval Taunk, Lakshya Khanna, Siri Venkata Pavan Ku-
Landon Bachman, Jasmine White, Vincent Galassi,
mar Kandru, Vasudeva Varma, Charu Sharma, and
Nitesh V Chawla, Chuxu Zhang, and Yanfang Ye.
Makarand Tapaswi. 2023. Grapeqa: Graph augmen-
2024b. Diet-odin: A novel framework for opioid
tation and pruning to enhance question-answering.
misuse detection with interpretable dietary patterns.
In WWW.
In KDD.
Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan
Zheyuan Zhang, Zehong Wang, Tianyi Ma,
Tan, Xiaochuang Han, and Yulia Tsvetkov. 2024a.
Varun Sameer Taneja, Sofia Nelson, Nhi Ha Lan
Can language models solve graph problems in natural
Le, Keerthiram Murugesan, Mingxuan Ju, Nitesh V
language? NeuralIPS.
Chawla, Chuxu Zhang, et al. 2024c. Mopi-hfrs:
Wenjie Wang, Ling-Yu Duan, Hao Jiang, Peiguang Jing, A multi-objective personalized health-aware
Xuemeng Song, and Liqiang Nie. 2021. Market2dish: food recommendation system with llm-enhanced
health-aware food recommendation. TOMM. interpretation. arXiv.
11
A Additional Related Work KGQA. However, most benchmarks in this field
are designed for general-purpose datasets and fail
A.1 Prior Works in Nutrition Personalization
to address domain-specific complexities, such as
With growing awareness of the importance of di- the challenges unique to nutritional health reason-
etary health, various studies have sought to incor- ing.
porate health metrics into applications such as food
recommendation systems. These approaches can A.3 Graph-Retrieval Augmented Generation
be grouped into three primary categories. First, Graph neural networks exhibit powerful potentials
some research emphasizes single indicators like in dealing with complicated structural data (Wang
calorie or fat content, as highlighted in works by et al., 2024d; Liu et al., 2023; Wang et al., 2024c)
Ge et al. (Ge et al., 2015) and Shirai et al. (Shirai and it can facilitate LLM to better understand real
et al., 2021; Li et al., 2024), though such metrics world tasks (Wang et al., 2024b; Huang et al., 2024;
often fail to represent the multifaceted nature of Liu et al., 2024b). Graph-Retrieval Augmented
a balanced diet. Second, simulated health data Generation (Graph-RAG) extends the Retrieval-
has been utilized, as demonstrated by Wang et al. Augmented Generation (RAG) framework (Lewis
(Wang et al., 2021), but these methods often di- et al., 2020) by enriching large language models
verge from real-world data distributions. Finally, with structured knowledge retrieval. While tradi-
recent studies have applied global health guidelines tional RAG retrieves unstructured text, Graph-RAG
to develop composite health scores, such as those leverages GNNs to retrieve structured subgraphs
by Bolz et al. (Bölz et al., 2023) and Zhang et encoded as triples, improving reasoning precision
al. (Zhang et al., 2024a). However, foods deemed and minimizing redundancy (Guo et al., 2024; Wen
healthy by general standards can still negatively et al., 2023; Lazaridou et al., 2022).
affect certain individuals (Yue et al., 2021), high- Existing Graph-RAG benchmarks primarily eval-
lighting the absence of a universal solution. The uate basic graph reasoning tasks, such as shortest
primary challenge remains the scarcity of accurate paths, node degree, and edge existence (Fatemi
user health data, a gap our benchmark uniquely et al., 2023b; Wang et al., 2024a). Although these
addresses. benchmarks provide insights into foundational rea-
soning, they lack domain specificity. Recent work
A.2 Knowledge Graph Question Answering
by He et al. (He et al., 2024) introduced bench-
Knowledge Graph Question Answering (KGQA) marks targeting advanced reasoning in general
has undergone significant advancements, evolving graph contexts, but domain-specific benchmarks
from early approaches such as semantic parsing and for applications such as nutrition remain underde-
retrieval-based methods. Initial models translated veloped. By adapting the principles of Graph-RAG,
natural language queries into structured formats our work introduces the first benchmark designed
like SPARQL for execution on knowledge graphs to tackle personalized health-aware reasoning, ad-
(Sun et al., 2019; Zhang et al., 2022). Many of dressing this critical gap in the literature.
these methods employed pre-trained models like
BERT for query encoding and used frameworks B Benchmark Details
such as GNNs or LSTMs for retrieving entities
and subgraphs (Yasunaga et al., 2021; Taunk et al., B.1 Data Source Description
2023). NHANES. National Health and Nutrition Exam-
More recent progress integrates large language ination Survey (NHANES) is a publicly avail-
models (LLMs) to improve both retrieval efficiency able dataset collected by the U.S. Centers for Dis-
and reasoning ability (Sanchez and Zhang, 2022; ease Control and Prevention (CDC) to assess the
Liu et al., 2024a; Tan et al., 2024). Approaches health and nutritional status of the U.S. population
like Jiang et al. (Jiang et al., 2023) and Wang et through interviews, physical examinations, and lab-
al. (Wang et al., 2023) utilize LLMs to transform oratory tests. Data is released every two years and
queries into formats such as SQL or SPARQL, en- encompasses five main categories: Demograph-
hancing retrieval accuracy. Others, such as Kim et ics, Dietary Data, Examination Data, Laboratory
al. (Kim et al., 2023) and Gao et al. (Gao et al., Data, and Questionnaire Data. These comprehen-
2024), focus on reasoning over retrieved subgraphs sive datasets provide a wealth of information on
or triples, tackling multi-hop reasoning tasks in health indicators, dietary behaviors, and medical
12
conditions. Nutrients Low Threshold High Threshold NRV
FNDDS and WWEIA. The Food and Nutrient Calories (kcal) 40 225 2000
Carbohydrates (g) 55 75 -
Database for Dietary Studies (FNDDS) is a com- Protein (g) 10 15 50
prehensive resource developed by the U.S. Depart- Saturated Fat (g) 1.5 5 20
Cholesterol (mg) 20 40 300
ment of Agriculture (USDA) to facilitate dietary Sugar (g) 5 22.5 -
intake analysis by providing detailed nutritional in- Dietary Fiber (g) 3 6 -
formation for foods and beverages consumed in the Sodium (mg) 120 200 2000
Potassium (mg) 0 525 3500
United States. It serves as the backbone for analyz- Phosphorus (mg) 0 105 700
ing dietary recall data collected through the What Iron (mg) 0 3.3 22
Calcium (mg) 0 150 1000
We Eat in America (WWEIA) program, which is a Folic Acid (µg) 0 60 400
component of NHANES. WWEIA captures dietary Vitamin C (mg) 0 15 100
Vitamin D (µg) 0 2.25 15
intake data through 24-hour dietary recall inter- Vitamin B12 (µg) 0 0.36 2.4
views, linking reported food and beverage items to
their corresponding nutrient profiles in FNDDS. To- Table 4: Nutrient Reference Values (NRV) and thresh-
gether, FNDDS and WWEIA enable researchers to olds (per 100g of food) used based on the nutritional
standards.
study dietary patterns, nutrient intake, and their re-
lationship to health outcomes, making them critical Health Indicator High Threshold Low Threshold
B.2 Dietary Habit Processing Details Blood Urea Nitrogen (mmol/L) 7.1 -
Low-Density Lipoprotein (mmol/L) 3.3 -
Red Blood Cell (million cells/uL) - 4
Dietary habit data was sourced from various Glucose (mmol/L) 7 -
NHANES tables, including the Diet Behavior and Glycohemoglobin (%) 6.5 -
Hemoglobin (g/dL) - 13.2 (11.6)
Consumer Behavior datasets, which capture user-
reported behaviors and preferences related to food Table 5: Health Indicators with Corresponding High
choices, preparation methods, and consumption and Low Thresholds. Parentheses indicate sex-specific:
patterns. Traditional processing approaches proved male (female) thresholds where applicable.
insufficient for the complexity and diversity of
these features. To address this, a thorough man- pects, including 7 for macro-nutrients (calories,
ual review was conducted by a team of four re- carbohydrates, protein, saturated fat, cholesterol,
searchers. Key features indicative of dietary habits, sugar, and dietary fiber) and 9 for micro-nutrients
such as awareness of healthy eating practices or (sodium, potassium, phosphorus, iron, calcium,
frequency of consuming processed or frozen foods, folic acid, and vitamin C, D, and B12) following
were identified and categorized. Users were then the tagging scheme introduced in (Zhang et al.,
grouped into high and low habit categories based on 2024c). A detailed table of thresholds can be seen
their responses, with the top 10% and bottom 10% in Table-4. As discussed in the paper, these thresh-
assigned corresponding habit tags. For instance, olds are derived from existing standards and leg-
users reporting the highest milk consumption were islation, from World Health Organization (WHO),
tagged with "drink lots of milk," while those with Food Standards Agency (FSA)m EU Nutrition &
minimal consumption were labeled as "drink little Health Claims Regulation (Commission, 2006) and
or no milk." This process generated 54 distinct di- the Codex Alimentarius Commission (CAC) (Al-
etary habit tags, which were incorporated as nodes imentarius, 1985, 1997). An even more detailed
in the graph. These habit nodes provide critical standards are listed in Appendix-I. Following the
insights into user behaviors, enabling a nuanced similar practice, we also extract the thresholds for
understanding of the relationship between dietary health conditions, as shown in Table-5, Since we
patterns and health outcomes. have the thresholds for both nutrition and health,
we demonstrate the full mapping relationship can
B.3 Full Mappings of Nutrition Tags be seen in Table-6. Note that the special diet data
In this section, we discuss the overall mapping can be retrieved from NHANES data, which di-
relationship between health indicators and nutri- rectly indicates a user needs certain nutrients.
tion. In total, we involve nutrition tags for 16 However, as we emphasize in the paper, the in-
different nutrients focusing on various health as- teractions between nutrition and health are com-
13
plex and multi-facet. To maintain scientific rigor category system that assigns a therapeutic classifi-
and practical relevance, we focus on annotating cation to each drug and each ingredient of the drug.
four prevalent health statues, of which diet has Category codes used to identify prescription opioid
been proved to be beneficial for intervention. Their use were: Level 1: 57 = central nervous system
mapping to nutrition tags can be seen in Table-7. agents; Level 2: 58 = Analgesics; Level 3: 60 =
The definition of these major health statues are narcotic analgesics, or 191 = narcotic analgesics
discussed in the next section. combinations (Detail in Appendix-I).
14
Nutrient Category Tag Name Source Health Indicators
High Calories Low BMI; Low waist circumference; Weight gain/Muscle building diet
Low Calories High BMI; High waist circumference; Weight loss diet
Low Carb Low carbohydrate diet; High BMI; High waist circumference
High Protein Opioid misuse; Weight gain/Muscle building diet; High protein diet
Macro-nutrients Low Protein High blood urea nitrogen; Renal/Kidney diet
Low Saturated Fat High low-density lipoprotein; Low fat/Low cholesterol diet
Low Sugar Opioid misuse; Diabetic Diet; Low sugar Diet
Low Cholesterol High low-density lipoprotein; Low fat/Low cholesterol diet
High Fiber High low-density lipoprotein; Opioid misuse; Diabetic Diet
Low Sodium High blood pressure; Renal/Kidney diet; Low salt diet
High Potassium High blood pressure
Low Phosphorus Renal/Kidney diet
High Iron Low red blood cell/Low hemoglobin
Micro-nutrients High Calcium Osteoporosis/brittle bones
High Folic Acid Low red blood cell count
High Vitamin C Low red blood cell/Low hemoglobin; Osteoporosis/brittle bones
High Vitamin D Osteoporosis/brittle bones
High Vitamin B12 Low red blood cell count
Table 6: Nutrient Categories, Tag Names, and Associated Source Health Indicators. Nutrient categories are organized
to consolidate related tags and their respective health indicators for clarity.
Health Indicator Associated Tags the first stage, "Let’s think step by step" is ap-
Obesity Low Calorie
Opioid Misuse High Protein; Low Sugar; Low Sodium
pended after the question to guide the model to-
Hypertension Low Sodium wards producing a reasoning path. In the second
Diabetes Low Sugar; Low Carb
Weight Loss/Low Calorie Diet Low Calorie stage, the reasoning path is fed to the model to
Low Fat/Low Cholesterol Diet Low Cholesterol; Low Saturated Fat
Low Salt/Low Sodium Diet Low Sodium
extract the final answer. However, our initial ex-
Sugar-Free/Low Sugar Diet Low Sugar periments showed that we can combine these two
Diabetic Diet Low Sugar; Low Carb
Weight Gain/Muscle Building Diet High Calorie; High Protein steps, by having both "Let’s think step by step" and
Low Carbohydrate Diet Low Carb
High Protein Diet High Protein
final output requirements in one prompt, while still
Renal/Kidney Diet Low Protein achieving the same performance. This allows us
to save computational and API resources, avoid-
Table 7: Health Indicators and Their Associated Nutri-
ing potential inconsistencies and information loss
tional Tags. Each indicator is linked to relevant tags
reflecting dietary requirements. that arise when feeding the reasoning output into a
second step. This is because with the one-step ap-
proach, the model can make a final decision based
KAPING answers questions based on a sub-
on both the original graph, and its own reasoning
graph composed of the entities mentioned in the
path, whereas in the second-step approach, the orig-
query and their neighboring nodes. Following
inal graph is not available to the model.
the methodology described in the original pa-
per, we first extract the entities present in the CoT-BAG is designed to improve the graph rea-
query—specifically the user and food—from the soning capabilities of LLMs by first encouraing the
provided knowledge graph. Then, we include their model to "build" an implicit graph representation
respective neighboring nodes to construct a sub- of the problem, and then using chain-of-thought
graph via retrieval. This subgraph is subsequently reasoning to solve it. For this approach, a single
transformed into triples and concatenated before prompt is sufficient to guide the model through
feeding into the LLMs. Note that in the original im- both the graph construction and reasoning, by com-
plementation, the authors also used top-k filtering bining both "Let’s construct a graph from the given
to prune the retrieval results. However, since we nodes and edges" and "Let’s think step by step to
don’t have any other entities in the question, this arrive at the final answer". Adapting CoT-BaG to
pruning based on embedding similarities with the our benchmark requires creating a textual descrip-
question doesn’t generate any reasonable results. tion of the graph triples, in the following format:
We skip this step in our implementation. "The graph contains an edge between node [source]
CoT-Zero is a two-stage prompting stategy. In and node [target] with attribute [relationship], an
15
a) Binary Classification (-B) b) Multi-label Classification (-ML) c) Text Generation (-TG)
Question Level Method
Accuracy Recall Precision F1 Accuracy Recall Precision F1 ROUGE-1 ROUGE-2 ROUGE-L BLEU BERT
Plain 0.6161 0.2413 0.8619 0.3770 0.2190 0.8958 0.2365 0.3666 0.5645 0.4999 0.5642 0.3092 0.9375
KAPING 0.5329 0.0732 0.6268 0.1310 0.1951 0.8885 0.2194 0.3468 0.5374 0.4678 0.5370 0.2759 0.9346
Sparse CoT-Zero 0.6049 0.2885 0.7255 0.4128 0.3633 0.7636 0.4265 0.5263 0.5593 0.5016 0.5589 0.3424 0.8871
CoT-BAG 0.6060 0.2875 0.7307 0.4126 0.4204 0.7430 0.4724 0.5589 0.5479 0.4888 0.5474 0.3325 0.8849
ToG 0.8483 0.6959 0.9844 0.8154 0.3227 0.9561 0.3168 0.4672 0.7216 0.6793 0.7215 0.4997 0.9582
Plain 0.5903 0.2584 0.8871 0.4002 0.5651 0.9224 0.5665 0.6932 0.7746 0.7074 0.7344 0.5513 0.9656
KAPING 0.4809 0.0480 0.6216 0.0891 0.4830 0.8954 0.5064 0.6391 0.7203 0.6368 0.6835 0.4748 0.9594
Standard CoT-Zero 0.6576 0.3528 1.0000 0.5216 0.5373 0.9963 0.5429 0.6948 0.7333 0.6446 0.7058 0.4940 0.9507
CoT-BAG 0.5872 0.2197 1.0000 0.3603 0.5585 0.9984 0.5599 0.7084 0.5479 0.4888 0.5474 0.3325 0.8849
ToG 0.8647 0.7443 1.0000 0.8534 0.8242 0.9238 0.8437 0.8745 0.8870 0.8292 0.8227 0.6959 0.9775
Plain 0.6249 0.0424 0.3562 0.0758 0.6790 0.8679 0.7695 0.8108 0.7608 0.6814 0.7136 0.5102 0.9604
KAPING 0.6302 0.0473 0.4143 0.0849 0.6549 0.8501 0.7522 0.7915 0.7446 0.6644 0.7032 0.4910 0.9587
Complex CoT-Zero 0.6639 0.0750 0.9787 0.1394 0.7466 0.9729 0.7693 0.8562 0.7474 0.6597 0.7107 0.5053 0.9475
CoT-BAG 0.6621 0.0685 1.0000 0.1282 0.7533 0.9628 0.7783 0.8577 0.7468 0.6620 0.7076 0.5051 0.9470
ToG 0.7219 0.2936 0.8295 0.4337 0.6871 0.7160 0.8952 0.7846 0.8177 0.7424 0.7651 0.5978 0.9692
Table 8: Experimental results based on five baseline methods on the three tasks with the three question levels using
the Llama-3.1-70B-instruct. The best performance of each group is bolded.
Table 9: Experimental results based on five baseline methods on the three tasks with the three question levels using
the GPT-3.5-turbo. The best performance of each group is bolded.
edge between..." to include in the input prompt, too early risks discarding paths that may be critical
alongside the question, and output requirements. for answering the query. Delaying pruning allows
ToG to collect more comprehensive information
ToG introduces a strategy that iteratively
before making pruning decisions. These modifi-
searches and prunes reasoning paths on a knowl-
cations ensure that ToG is better aligned with the
edge graph starting from entities mentioned in the
requirements and complexities of our benchmark,
query to identify suitable paths. However, the open-
enabling more effective performance evaluation.
source ToG codebase is implemented based on
Wikidata and Freebase databases, making it incom- D Additional Experiments
patible with private datasets. To evaluate ToG on
our benchmark, we reimplemented it following the To further demonstrate the performance of different
original methodology. Furthermore, we adapted LLM backbones on our benchmark, we conducted
ToG to better suit the characteristics of our bench- additional tests using Llama-3.1-70b-Instruct and
mark with the following adjustments: 1). Adjusting GPT-3.5-Turbo as backbones for various baselines.
the width parameter to 5: ToG’s original width pa- As shown in Table-8 and Table-9, the performance
rameter is set to 3, which retains three reasoning trends of Llama-3.1-70b-Instruct align closely with
paths during pruning. However, answering ques- those of GPT-4o-mini, although Llama-3.1-70b-
tions in our benchmark sometimes requires more Instruct generally yields better results. This is con-
than three reasoning paths. By setting the width sistent with its stronger reasoning capabilities.
parameter to 5, ToG preserves five reasoning paths Additionally, ToG exhibited a noticeable per-
at each pruning step and generates answers based formance degradation when GPT-3.5-Turbo was
on these paths. 2). Delaying pruning until the used as the backbone, particularly when addressing
second iteration: In ToG’s first iteration, the in- standard and complex questions. This decline is
formation gathered is often insufficient to evaluate primarily due to GPT-3.5-Turbo’s relatively weaker
the importance of each reasoning path. Pruning reasoning abilities, which often lead to the retrieval
16
Diet Type Obesity Hypertension Opioid Misuse Diabetes
Weight Loss/Low Calorie Diet 2,253 647 222 267
Low Fat/Low Cholesterol Diet 448 247 76 116
Low Salt/Low Sodium Diet 442 350 86 115
Sugar-Free/Low Sugar Diet 170 89 20 78
Diabetic Diet 692 432 126 647
Weight Gain/Muscle Building Diet 3 20 12 1
Low Carbohydrate Diet 244 69 25 57
High Protein Diet 47 12 9 8
Renal/Kidney Diet 25 24 13 7
Table 10: Adoption of Diet Types Across Health Conditions. Each entry represents the number of users with a
specific condition following a corresponding diet type.
17
Role Category Content
“Act as a nutritionist. Analyze if a given food is
System -
healthy to a user following further instructions.”
“Based on the nutrients the food provides and the user
Question needs, please answer whether the food ‘Fish curry with
rice’ is healthy for the user?”
“Below is the extra information you use to answer the
Default question, note that you should not use your general
Method knowledge and the answer is among this information.”
prompt
Customized “…”
18
G Case Study thresholds (Figure-10 and Figure-11) . Note that
since there are discrepancies in the regulation. We
We present 7 case studies across 3 Tasks (Bi- adopt a stricter measure and make it sure it fits
nary Classification, Multi-label Classification, Text NHANES data. The Vitamins and Minerals high
Generation), 3 Question Levels (Sparse, Standard, thresholds are calculated from the Daily Nutritional
Complex) and 5 Baselines (Plain, KAPING, CoT- Reference Value (NRV), where CAC defines if a
Zero, CoT-BaG, ToG). This section provides in- food (per 100g) contains over 15% of NRV, it can
sights into how the prompts are structured across claim itself a source of such nutrient. The Codex
different baselines and the reasoning path behind Alimentarius, or "Food Code" is a collection of
the LLM’s final answer, as detailed in Tables 12-18. standards, guidelines and codes of practice adopted
The case studies provide critical insights into the by the Codex Alimentarius Commission. The Com-
strengths and limitations of each baseline, while mission, also known as CAC, is the central part
emphasizing the challenges posed by personal- of the Joint FAO/WHO Food Standards Program
ized dietary reasoning, highlighting our bench- and was established by FAO and WHO to protect
mark’s role in advancing the development of ro- consumer health and promote fair practices in food
bust, domain-specific AI models for personalized trade. 3) The Multum Lexicon Therapeutic Classifi-
health-aware nutrition reasoning. cation Scheme6 , used to define opioid prescription
medicines and later mark opioid misuse (Figure-
H Addtional Error Analysis
12).
Our experiments showed that in the specific task of
health-aware nutrition reasoning, LLMs are prone
to two main types of errors: contextual hallucina-
tion and factual hallucination. To understand these
shortcomings, we perform an error analysis focus-
ing on the Text Quality Evaluation task, using 3
methods (KAPING, CoT-Zero, ToG) as a repre-
sentative setting. We prompt the models to also
include the reasonings behind their final answer,
which then go through a human review process,
revealing 2 types of reasoning failures: Contextual
Hallucination and Factual Hallucination. Note that
we do not check for KG topology errors, as our KG
generation process ensures there are no structural
problems in the knowledge base that would affect
the model’s information retrieval and processing
performance. Exemplary demonstrations of these
2 error types are shown in Table-19 and Table-20.
19
Configurations Task: Binary Classification
Question: Complex
Model: GPT-4o-mini
Method: Plain
Prompt [Question] Based on the nutrients the food provides and the user needs, please answer whether
the food "Black beans and brown rice" is healthy for the user? Please answer with yes or no.
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information.
[Textualized graph] (Black beans and brown rice belongs to Bean, pea, legume dishes), (Black
beans and brown rice has Black beans, from dried, fat added), (Black beans and brown rice
has Rice, brown, cooked, no added fat), (Black beans and brown rice belongs to low_carb),
(Black beans and brown rice belongs to low_sugar), (Black beans and brown rice belongs to
high_sodium), (Black beans and brown rice belongs to low_protein), (Black beans and brown
rice belongs to low_cholesterol), (Black beans and brown rice belongs to low_saturated_fat),
(user has Eats lots of fish), (user has Adds little to no salt at table), (user has Eats little to no
fast food), (user has Eats few to no ready to eat meals), (user has Takes more supplements),
(user has Uses lots of salt in preparation), (user has Ate more food than usual), (user has Ate
less food than usual), (user has Eats gluten free diet), (user has Eats high protein diet), (user
has diabetes), (user has Low carbohydrate diet), (user has High protein diet), (diabetes match
low_sugar), (diabetes match low_carb), (Low carbohydrate diet match low_carb), (High protein
diet contradict low_protein)
[Note prompt] Important Note: Your output will strictly be Yes or No with no other words.
True Answer No
LLM Answer No
LLM Reasoning While black beans and brown rice provide some beneficial nutrients, they are low in protein,
which contradicts the user’s high protein diet requirement. Additionally, the dish is high
in sodium, which may not be suitable for someone managing diabetes and following a low
carbohydrate and low sugar diet. Therefore, it does not align well with the user’s nutritional
needs.
Observation Factual hallucination: Although the model is able to correctly determine that the food is unhealthy
to the user because it is low in protein compared to the user’s High protein diet requirement,
upon inspecting its reasonings, it is found that the model mistakenly deduces that high_sodium
food tag is unsuitable for a low carb and low sugar diet, which is not correct.
20
Configurations Task: Multi-label Classification
Question: Standard
Model: Llama-3.1-70b-instruct
Method: KAPING
Prompt [Question] Based on the nutrients the food provides and the user needs, please answer what
nutrient tags are used to determine whether the food "Nachos, cheese only" is healthy or
unhealthy for the user?
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information.
[Textualized graph] (Nachos, cheese only belongs to Nachos), (Nachos, cheese only has Cheese
sauce), (Nachos, cheese only has Snacks, tortilla chips, plain, white corn, salted), (Nachos,
cheese only has Fast foods, nachos, with cheese), (Nachos, cheese only belongs to low_carb),
(Nachos, cheese only belongs to low_sugar), (Nachos, cheese only belongs to high_sodium),
(Nachos, cheese only belongs to high_calorie), (Nachos, cheese only belongs to low_protein),
(Nachos, cheese only belongs to low_cholesterol), (user has Drinks little or no milk), (user has
Eats little or no shellfish), (user has Adds little to no salt at table), (user has Drinks Alcohol more
than average), (user has Eats many ready to eat meals), (user has Drinks lots of water), (user has
Ate more food than usual), (user has Ate less food than usual), (user has Eats gluten free diet),
(user has obesity), (user has High protein diet), (obesity contradict high_calorie), (High protein
diet contradict low_protein)
[Note prompt] Important Note: Your output must be strictly, with no extra words, separated
by comma, a list of nutrients with high or low before the nutrients among these options: carb,
protein, sugar, sodium, cholesterol, saturated_fat, calorie. For example, the output is: high_carb,
low_protein, high_sugar. You should only include the nutrient tags that connect the food with
the user.
LLM Reasoning It is stated that "Nachos, cheese only belongs to high_calorie" and "high_sodium." Additionally,
it is mentioned that "Nachos, cheese only belongs to low_protein."
Given the user’s condition of obesity and High protein diet requirements, consuming high-calorie
food would contradict their health goal of obesity. Furthermore, a high-protein diet requirement
contradicts the fact that “Nachos, cheese only belongs to low_protein”. User’s habit of adding
little to no salt at table contradicts the fact that “Nachos, cheese only belong to high_sodium”.
Notes The model includes information about the user’s habit of adding little to no salt at the table and
concludes that high_sodium is a food tag that is important in determining the food’s healthiness,
which is irrelevant information in this case because it is supposed to focus on the user’s health
and diet only - Factual Hallucination.
21
Configurations Task: Text Generation
Question: Complex
Model: GPT-4o-mini
Method: CoT-Zero
Prompt [Question] Based on the nutrients the food provides and the user needs, please answer whether
the food "Turkey with gravy" is healthy for the user? Please answer with a short sentence
explaining why.
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information. Let’s think
step by step to determine the healthiness of the food, by extracting the nutritional properties of
the food from the given graph, then comparing them to the nutrition requirements of the health
status, dietary need and habits of the user. A food is unhealthy only if it has certain properties
that are unsuitable to the user’s health and diet. Do not be too strict with your criteria, only
focus on a few main nutritional tags that strongly indicate its healthiness or unhealthiness to the
particular diet or health status the user has. Some nutritional tags might not be as important in
determining healthiness.
[Textualized graph] (Turkey with gravy belongs to Poultry mixed dishes), (Turkey with gravy
has Turkey, whole, meat only, cooked, roasted), (Turkey with gravy has Salt, table, iodized),
(Turkey with gravy has Gravy, chicken, canned or bottled, ready-to-serve), (Turkey with gravy
belongs to low_carb), (Turkey with gravy belongs to low_sugar), (Turkey with gravy belongs
to high_sodium), (Turkey with gravy belongs to high_protein), (Turkey with gravy belongs to
high_cholesterol), (Turkey with gravy belongs to low_saturated_fat), (user has Eats little or no
shellfish), (user has Drinks Alcohol less than average), (user has Eats little to no frozen food),
(user has Eats few to no meals outside home), (user has Eats few to no ready to eat meals),
(user has Takes few or no supplements), (user has Uses little to no salt in preparation), (user
has Ate more food than usual), (user has Ate less food than usual), (user has Eats weight loss
diet), (user has Eats low fat diet), (user has Eats high fiber diet), (user has opioid_misuse), (user
has diabetes), (user has Low fat/Low cholesterol diet), (user has Low salt/Low sodium diet),
(user has Diabetic diet), (opioid_misuse match high_protein), (opioid_misuse match low_sugar),
(opioid_misuse contradict high_sodium), (diabetes match low_sugar), (diabetes match low_carb),
(Low fat/Low cholesterol diet contradict high_cholesterol), (Low fat/Low cholesterol diet match
low_saturated_fat), (Low salt/Low sodium diet contradict high_sodium), (Diabetic diet match
low_sugar), (Diabetic diet match low_carb)
[Note prompt] Important Note: Your output must be a Yes or No followed by strictly a list of
nutrients with high or low as prefix among these options: carb, protein, sugar, sodium, cholesterol,
saturated fat, calorie. For example, the output is: Yes, because the food is high in carb, low in
protein, high in sugar.
True Answer Yes, because the food is low in carb, low in sugar, low in saturated_fat, high in protein.
LLM Answer No, because the food is high in sodium, high in cholesterol.
LLM Reasoning Turkey with gravy is high in sodium, which contradicts the user’s low salt/low sodium diet. It is
also high in cholesterol, which contradicts the user’s low fat/low cholesterol diet.
Notes The model fails to recognize there are far more good properties about this food item than bad
ones.
22
Configurations Task: Multi-label Classification
Question: Sparse
Model: GPT-3.5-turbo
Method: CoT-BaG
Prompt [Question] Based on the nutrients the food provides and the user needs, please answer whether
the food "Sesame chicken" is healthy for the user? Please answer with a short sentence
explaining why.
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information. You will be
given the textual description of a directed graph. Let’s first construct a graph with the given nodes
and edges. Then determine the healthiness of the food by traversing the graph and determining
the nutritional properties of the food, then compare them to the health status, dietary need and
habits of the user. Do not be too strict with your criteria, only focus on a few main nutritional
tags that strongly indicate its healthiness or unhealthiness to the particular diet or health status
the user has. Some nutritional tags might not be as important in determining healthiness.
[Textualized graph] Here is the description of the graph: This is the list of edges: an edge between
node "Sesame chicken" and "Stir-fry and soy-based sauce mixtures" with attribute "belongs
to", an edge between node "Sesame chicken" and "Restaurant, Chinese, sesame chicken" with
attribute "has", an edge between node "Sesame chicken" and "low_carb" with attribute "belongs
to", an edge between node "Sesame chicken" and "high_sodium" with attribute "belongs to",
an edge between node "Sesame chicken" and "high_calorie" with attribute "belongs to", an
edge between node "Sesame chicken" and "high_cholesterol" with attribute "belongs to", an
edge between node "user" and "Eats little or no shellfish" with attribute "has", an edge between
node "user" and "Eats little or no fish" with attribute "has", an edge between node "user" and
"Adds little to no salt at table" with attribute "has", an edge between node "user" and "Heavy
cigarette smoker" with attribute "has", an edge between node "user" and "Drinks Alcohol more
than average" with attribute "has", an edge between node "user" and "Eats lots of frozen food"
with attribute "has", an edge between node "user" and "Eats few to no ready to eat meals" with
attribute "has", an edge between node "user" and "Takes few or no supplements" with attribute
"has", an edge between node "user" and "Drinks lots of water" with attribute "has", an edge
between node "user" and "Eats organic food less often" with attribute "has", an edge between
node "user" and "Claims to have a poor diet" with attribute "has", an edge between node "user"
and "Ate more food than usual" with attribute "has", an edge between node "user" and "obesity"
with attribute "has", an edge between node "user" and "hypertension" with attribute "has", an
edge between node "obesity" and "high_calorie" with attribute "contradict", an edge between
node "hypertension" and "high_sodium" with attribute "contradict".
[Note prompt] Important Note: Your output must be a Yes or No followed by strictly a list of
nutrients with high or low as prefix among these options: carb, protein, sugar, sodium, cholesterol,
saturated fat, calorie. For example, the output is: Yes, because the food is high in carb, low in
protein, high in sugar.
True Answer No, because the food is high in sodium, high in calorie.
LLM Answer No, because the food is high in sodium, high in calorie, high in cholesterol.
LLM Reasoning Sesame chicken typically contains a lot of sodium due to the soy-based sauce, high calories from
frying, and cholesterol from the meat and sauce. These nutritional properties may not align well
with the user’s dietary needs, especially if they have hypertension or obesity.
Notes Although the model is able to identify two important food tags that contribute to the food’s
unhealthiness, upon inspecting its reasonings, it is unclear whether the model deduces this answer
from the graph information, or from its general knowledge of the food’s ingredients.
23
Configurations Task: Text Generation
Question: Standard
Model: GPT-4o-mini
Method: ToG
Prompt [Question] Based on the nutrients the food provides and the user needs, please answer whether
the food "Vegetable rice soup, canned, prepared with water or ready-to-serve" is healthy for
the user? Please answer with a short sentence explaining why.
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information. You will be
given the textual description of a directed graph.
[Textualized graph] Note: Bolded triples are those that appear in ToG’s retrieved subgraph.
(Vegetable rice soup, canned, prepared with water or ready-to-serve belongs to Soups), (Vegetable
rice soup, canned, prepared with water or ready-to-serve has Soup, minestrone, canned, chunky,
ready-to-serve), (Vegetable rice soup, canned, prepared with water or ready-to-serve belongs
to low_carb), (Vegetable rice soup, canned, prepared with water or ready-to-serve belongs to
low_sugar), (Vegetable rice soup, canned, prepared with water or ready-to-serve belongs to
high_sodium), (Vegetable rice soup, canned, prepared with water or ready-to-serve belongs to
low_protein), (Vegetable rice soup, canned, prepared with water or ready-to-serve belongs
to low_cholesterol), (Vegetable rice soup, canned, prepared with water or ready-to-serve
belongs to low_saturated_fat), (user has Drinks lots of milk), (user has Eats little or no
shellfish), (user has Eats little or no fish), (user has Adds little to no salt at table), (user has
Eats few to no meals outside home), (user has Eats few to no ready to eat meals), (user has
Takes more supplements), (user has Claims to have a good diet), (user has Ate more food than
usual), (user has Ate less food than usual), (user has Eats weight loss diet), (user has Eats high
protein diet), (user has Low fat/Low cholesterol diet), (user has Low carbohydrate diet),
(Low fat/Low cholesterol diet match low_cholesterol), (Low fat/Low cholesterol diet match
low_saturated_fat), (Low carbohydrate diet match low_carb)
[Note prompt] Important Note: Your output must be a Yes or No followed by strictly a list of
nutrients with high or low as prefix among these options: carb, protein, sugar, sodium, cholesterol,
saturated fat, calorie. For example, the output is: Yes, because the food is high in carb, low in
protein, high in sugar.
True Answer Yes, because the food is low in carb, low in cholesterol, low in saturated_fat.
LLM Answer Yes, because the food is low in carb, low in cholesterol, low in saturated fat.
LLM Reasoning The user has a Low fat/Low cholesterol diet and a Low carbohydrate diet. The given food,
Vegetable rice soup, matches the user’s dietary requirements as it belongs to low_cholesterol,
low_saturated_fat, and low_carb categories. Therefore, the food is suitable for the user’s needs.
Observation ToG’s retrieved subgraph contains all the necessary triples, no more no less, to determine that
the food is healthy to the user’s diet requirements.
24
Configurations Task: Text Generation
Question: Complex
Model: Llama-3.1-70B-instruct
Method: ToG
Prompt [Question] Based on the nutrients the food provides and the user needs, please answer whether
the food "Pie, custard, individual size or tart" is healthy for the user? Please answer with a
short sentence explaining why.
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information. You will be
given the textual description of a directed graph.
[Textualized graph] Note: Bolded triples are those that appear in ToG’s retrieved subgraph.
(Pie, custard, individual size or tart belongs to Cakes and pies), (Pie, custard, individual size or
tart has Sugars, granulated), (Pie, custard, individual size or tart has Vanilla extract, imitation,
no alcohol), (Pie, custard, individual size or tart has Cornstarch), (Pie, custard, individual size
or tart has Egg, whole, raw, fresh), (Pie, custard, individual size or tart has Wheat flour, white,
all-purpose, enriched, bleached), (Pie, custard, individual size or tart has Shortening, vegetable,
household, composite), (Pie, custard, individual size or tart has Salt, table, iodized), (Pie, custard,
individual size or tart has Milk, nonfat, fluid, without added vitamin A and vitamin D (fat free or
skim)), (Pie, custard, individual size or tart has Beverages, water, tap, municipal), (Pie, custard,
individual size or tart belongs to low_carb), (Pie, custard, individual size or tart belongs to
high_sodium), (Pie, custard, individual size or tart belongs to low_protein), (Pie, custard,
individual size or tart belongs to high_cholesterol), (Pie, custard, individual size or tart
belongs to low_saturated_fat), (user has Drinks lots of milk), (user has Uses tobacco often),
(user has Drinks Alcohol more than average), (user has Rarely or never check nutrition labels),
(user has Takes few or no supplements), (user has Uses lots of salt in preparation), (user has
Claims to have a poor diet), (user has Ate more food than usual), (user has Ate less food than
usual), (user has Eats weight loss diet), (user has obesity), (user has opioid_misuse), (user
has hypertension), (user has diabetes), (user has Low fat/Low cholesterol diet), (obesity
need low_calorie), (opioid_misuse contradict low_protein), (opioid_misuse need low_sugar),
(opioid_misuse contradict high_sodium), (hypertension contradict high_sodium), (diabetes
need low_sugar), (diabetes match low_carb), (Low fat/Low cholesterol diet contradict
high_cholesterol), (Low fat/Low cholesterol diet match low_saturated_fat)
[Note prompt] Important Note: Your output must be a Yes or No followed by strictly a list of
nutrients with high or low as prefix among these options: carb, protein, sugar, sodium, cholesterol,
saturated fat, calorie. For example, the output is: Yes, because the food is high in carb, low in
protein, high in sugar.
True Answer No, because the food is high in sodium, high in cholesterol, low in protein.
LLM Answer No, because the food is low in carb, low in protein, high in cholesterol.
LLM Reasoning The user has diabetes, and a low-carb diet is suitable for diabetes management. Therefore, the
food being low in carb is a positive aspect. However, the user also has opioid misuse, which
contradicts with the food being low in protein. Protein is essential for overall health and recovery.
Additionally, the user follows a Low fat/Low cholesterol diet. The food being high in cholesterol
contradicts this requirement.
Observation ToG’s retrieved subgraph doesn’t contain the important triple “(hypertension contradict
high_sodium)”, hence it fails to include this information in the reasoning process. Instead,
it incorrectly includes the food tag low_carb into the final answer, despite successfully identify-
ing that this is a positive aspect of the food.
25
Configurations Task: Multi-label Classification
Question: Complex
Model: Llama-3.1-70B-instruct
Method: ToG
Prompt [Question] Based on the nutrients the food provides and the user needs, please answer what
nutrient tags are used to determine whether the food "Lasagna with cheese and meat sauce,
reduced fat and sodium (diet frozen meal)" is healthy or unhealthy for the user?
[Method prompt] Below are the extra information you use to answer the question, note that you
should not use your general knowledge and the answer is among this information. You will be
given the textual description of a directed graph.
[Textualized graph] Note: Bolded triples are those that appear in ToG’s retrieved subgraph.
(Lasagna with cheese and meat sauce, reduced fat and sodium (diet frozen meal) belongs to Pasta
mixed dishes, excludes macaroni and cheese), (Lasagna with cheese and meat sauce, reduced
fat and sodium (diet frozen meal) belongs to low_carb), (Lasagna with cheese and meat
sauce, reduced fat and sodium (diet frozen meal) belongs to low_sugar), (Lasagna with
cheese and meat sauce, reduced fat and sodium (diet frozen meal) belongs to high_sodium),
(Lasagna with cheese and meat sauce, reduced fat and sodium (diet frozen meal) belongs to
low_protein), (Lasagna with cheese and meat sauce, reduced fat and sodium (diet frozen
meal) belongs to low_cholesterol), (Lasagna with cheese and meat sauce, reduced fat and
sodium (diet frozen meal) belongs to low_saturated_fat), (user has Eats lots of shellfish),
(user has Eats lots of fish), (user has Eats little to no frozen food), (user has Eats few to
no meals outside home), (user has Eats few to no ready to eat meals), (user has Takes few
or no supplements), (user has Uses lots of salt in preparation), (user has Claims to have a
poor diet), (user has Ate more food than usual), (user has Ate less food than usual), (user has
Eats weight loss diet), (user has obesity), (user has opioid_misuse), (user has hypertension),
(user has diabetes), (user has Low fat/Low cholesterol diet), (obesity need low_calorie),
(opioid_misuse contradict low_protein), (opioid_misuse match low_sugar), (opioid_misuse con-
tradict high_sodium), (hypertension contradict high_sodium), (diabetes match low_sugar),
(diabetes match low_carb), (Low fat/Low cholesterol diet match low_cholesterol), (Low
fat/Low cholesterol diet match low_saturated_fat)
[Note prompt] Important Note: Your output must be separated by comma, a list of nutrients with
high or low before the nutrients among these options: carb, protein, sugar, sodium, cholesterol,
saturated_fat, calorie. For example, the output is: high_carb, low_protein, high_sugar. Provide
reasonings for your output.
LLM Reasoning - high_sodium: Although the meal is labeled as "reduced sodium", it still belongs to the
high_sodium category. This is a concern for the user who has hypertension, as high sodium
intake can exacerbate this condition. - low_sugar: The meal belongs to the low_sugar category,
which is good for the user who has opioid_misuse. - low_cholesterol: The user has Low fat/Low
cholesterol diet, which matches low_cholesterol of the food. - low_carb: The food is low in carb,
which is suitable for the user who has diabetes.
Observation The model is able to correctly identify 4 out of 6 important tags in this case. Although it is
missing half of the tags, one good thing to note is that ToG is able to pick out both the matches
(low_cholesterol, low_sugar, low_carb) and the contradictions (high_sodium).
26
Question Based on the nutrients the food provides and the user needs, please answer whether the food
"Cheese pastry puffs" is healthy for the user? Please answer with a short sentence explaining
why.
User Profile Eats lots of shellfish, Eats lots of fish, Light cigarette smoker, Drinks alcohol less than average,
Eats little to no fast food, Eats little to no frozen food, Often checks nutrition labels, Takes more
supplements, Uses little to no salt in preparation, Drinks little water, Eats weight-gain diet, Has
hypertension
KAPING - Answer: No, because the food is high in sodium, high in calorie, high in saturated fat, high in
cholesterol.
- Reasoning: Although the user often checks nutrition labels and eats little to no fast food or
frozen meals (factual hallucination - irrelevant information), their hypertension contradicts with
the high sodium content found in Cheese pastry puffs. Moreover, being a light smoker might
increase cardiovascular risks; consuming foods high in calorie, saturated fat, and cholesterol like
Cheese pastry puffs will not help. (contextual hallucination - unaware of user’s needs)
CoT-Zero - Answer: No, because the food is high in sodium, low in protein, high in cholesterol, high in
saturated fat, high in calorie.
- Reasoning: The presence of ’high_sodium’ category contradicts with ’hypertension’, as
excessive sodium can worsen hypertension. High saturated fat, calorie and cholesterol from
Cheese and Margarine are also concerns for heart health. Low protein likely won’t support muscle
maintenance or satisfaction between meals. (factual hallucination - irrelevant information)
27
Question Based on the nutrients the food provides and the user needs, please answer whether the food
"Shepherd’s pie with lamb" is healthy for the user? Please answer with a short sentence
explaining why.
User Profile Drinks lots of milk, Eats little to no shellfish, Eats little to no fish, Drinks alcohol more than
average, Eats lots of fast food, Eats few to no ready-to-eat meals, Rarely or never checks nutrition
labels, Takes more supplements, Claims to have a poor diet, Has obesity, Has diabetes.
True Answer Yes, because the food is low in carb, low in sugar.
KAPING - Answer: No, because the food is high in sodium, low in protein.
- Reasoning: While Shepherd’s pie with lamb contains some nutritious ingredients like vegeta-
bles and lean meat, its high sodium content can be detrimental for someone with obesity and
diabetes, as it may contribute to hypertension and other health issues. Additionally, the low
protein content may not be sufficient for satiety and muscle maintenance. (contextual & factual
hallucinations - using external knowledge to make assumptions about user’s health & include
irrelevant information)
CoT-Zero - Answer: No, because the food is high in sodium, low in cholesterol, low in saturated fat.
- Reasoning: Given the user’s obesity and diabetes, a diet low in calories, sugar, and carbohy-
drates is essential. While Shepherd’s pie with lamb is low in sugar and carbs, its high sodium
content may pose a risk for those with diabetes and obesity, making it less suitable for the user’s
health needs. (contextual hallucination)
ToG - Answer: Yes, because the food is low in sugar, low in carb.
- Reasoning: Given that the user has diabetes, Shepherd’s pie with lamb is healthy because it is
low_sugar and low_carb.
28
Figure 10: Guidelines for use of nutrition and health claims.
29
Figure 11: Daily nutrition value from Codex Alimentarius.
30
Figure 12: Multum Lexicon Therapeutic Classification Scheme - Part of Level 3.
31