Understanding Natural Language Processing
Understanding Natural Language Processing
Understanding
Natural
Language
Understanding
Understanding Natural Language Understanding
Erik Cambria
Understanding Natural
Language Understanding
Erik Cambria
College of Computing and Data Science
Nanyang Technological University
Singapore, Singapore
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2025
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
Cover illustration: Cover illustration created by the author using Adobe Firefly
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
About half a century ago, artificial intelligence (AI) pioneers like Marvin Minsky
embarked on the ambitious project of emulating how the human mind encodes and
decodes meaning. While today we have a better understanding of the brain thanks to
neuroscience, we are still far from unlocking the secrets of the mind. Especially when
it comes to language, the prime example of human intelligence, we face enormous
difficulties in replicating how the human mind processes it. “Understanding natural
language understanding”, i.e., understanding how the mind encodes and decodes
meaning through language, is a significant milestone in our journey towards creating
machines that genuinely comprehend human language.
Large language models (LLMs), such as GPT-4, have astounded us with their
ability to generate coherent, contextually relevant text, seemingly bridging the gap
between human and machine communication. Yet, despite their impressive capabil-
ities, these models operate on statistical patterns rather than true comprehension.
This textbook delves into the nuanced differences between these two paradigms and
explores the future of AI as we strive to achieve true natural language understand-
ing (NLU). LLMs excel at identifying and replicating patterns within vast datasets,
producing responses that appear intelligent and meaningful. They can generate text
that mimics human writing styles, provide summaries of complex documents, and
even engage in extended dialogues with users. However, their limitations become
evident when they encounter tasks that require deeper understanding, reasoning, and
contextual knowledge. LLMs can produce plausible-sounding but incorrect answers,
struggle with ambiguous queries, and often lack the ability to generalize knowledge
across different domains effectively.
Instead, an NLU system that deconstructs meaning leveraging linguistics and
semiotics (on top of statistical analysis) represents a more profound level of language
comprehension. It involves understanding context in a manner similar to human
cognition, discerning subtle meanings, implications, and nuances that current LLMs
might miss or misinterpret. NLU grasps the semantics behind words and sentences,
comprehending synonyms, metaphors, idioms, and abstract concepts with precision.
This deeper comprehension allows for consistent accuracy and robustness, enabling
AI systems to handle ambiguous, incomplete, or novel queries more effectively.
vii
viii Preface
This work would have never been possible without the help of my wonderful re-
search group, the Sentic Team ([Link] who have helped me
translate my silly ideas into concrete research works over the last ten years. Special
thanks go to my awesome postdocs Drs. Rui Mao, Qian Liu, and Xulang Zhang, who
were instrumental in organizing and refining the materials of this textbook. Last but
not least, I thank my beautiful wife, Jocelyn Choong, for often forcing me to focus
on finishing this book despite the many distractions life has to offer.
ix
Contents
2 Syntactics Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2 Microtext Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.1 Linguistic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.2 Statistical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2.3 Neural Network Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
xi
xii Contents
[Link]
Contents xiii
[Link]
xiv Contents
[Link]
Contents xv
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
6.1 Learning Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
6.1.1 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
6.1.2 Quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
[Link]
Acronyms
xvii
[Link]
xviii Acronyms
[Link]
Chapter 1
Natural Language Understanding & AI
Abstract In this chapter, we delve into the critical role that natural language un-
derstanding (NLU) plays in shaping the future of artificial intelligence (AI). To set
the stage, we begin by defining what constitutes an NLU system. Next, we explore
how NLU can drive the evolution of next-generation AI systems, which promise to
be more reliable, responsible, and personalized. To this end, we introduce the seven
pillars for the future of AI, which represent the foundational elements necessary
to advance AI technology in a way that is more transparent and reliable. Next, we
propose the concept of responsible recommender systems, which incorporate ethi-
cal guidelines and user-centric principles to ensure recommendations are not only
relevant but also fair, unbiased, and respectful of user privacy. Lastly, we present a
framework for personalized sentiment analysis, which aims at making AI systems
more responsive and attuned to the needs and emotions of each user.
Key words: Natural Language Understanding, Reliable AI, Responsible AI, Per-
sonalized AI
1.1 Introduction
NLU can better recognize and avoid biases, leading to more fair and ethical
responses. LLMs can inadvertently reinforce biases present in their training data
because they lack the deeper understanding necessary to critically evaluate and
mitigate such biases. For example, an LLM trained on biased data might perpetuate
stereotypes, whereas NLU would recognize and avoid such biases. This capability is
essential in applications such as hiring processes, where unbiased decision-making is
crucial for fairness. Moreover, NLU understands the potential impact of its responses,
avoiding harmful or inappropriate content more reliably. It comprehends the ethical
implications and societal norms guiding human interactions, ensuring safer and
more responsible AI behavior. For instance, an NLU system would avoid making
insensitive comments about sensitive topics, understanding the context and potential
repercussions. This understanding helps in creating AI systems that can be trusted
in sensitive applications like mental health support and education.
Achieving true NLU involves advanced knowledge representation, such as incor-
porating symbolic reasoning and structured knowledge bases. This includes ontolo-
gies, semantic networks, and rule-based systems that explicitly encode relationships
and rules. For example, an NLU system could use an ontology to understand the
relationship between different medical conditions and treatments. This structured
approach allows the system to make logical inferences and provide reasoned answers
based on a deep understanding of the subject matter. Combining symbolic AI with
machine learning creates hybrid systems that leverage both structured knowledge
and the pattern recognition strengths of LLMs. Neurosymbolic integration, which
merges neural networks with symbolic reasoning systems, helps in understanding
and generating more accurate and contextually appropriate responses. For instance, a
neurosymbolic system might use neural networks to process natural language input
and symbolic reasoning to deduce the appropriate response based on an internal
knowledge base. This hybrid approach allows for more sophisticated and reliable AI
systems that can handle complex queries and tasks. NLU also requires embedding
real-world knowledge and commonsense reasoning into AI systems. This involves
training on diverse data sources and integrating world models that simulate real-world
scenarios. Advanced dialogue systems can maintain context over long conversations,
understand intents and sentiments, and manage turn-taking effectively. For example,
a customer service chatbot with NLU would handle a multi-step customer query
seamlessly, maintaining context and providing accurate solutions throughout the
interaction. This capability is essential for creating AI systems that can engage in
meaningful and productive conversations with users. Techniques like meta-learning
and analogical reasoning enable systems to adapt quickly to new information and
contexts, transferring knowledge from known situations to new, similar ones. This
continuous learning and adaptation make AI systems more resilient and effective in
dynamic environments.
In summary, NLU systems go beyond the mere statistical analysis of language
and, hence, have the potential to be the enablers of next-generation AI systems that
are reliable, responsible and personalized. We discuss this in more detail in the next
three sections.
4 1 Natural Language Understanding & AI
In 2022, the world was stunned by ChatGPT, a chatbot that relies on an LLM built by
means of generative pretraining transformers (GPT). The performance capabilities
of GPT-based LLMs enable chatbots to generate detailed, original, and plausible
responses to prompts. GPT-4 and other LLMs are pretrained on a large dataset (self-
supervised and at scale), before being adapted for a variety of downstream tasks
through fine-tuning. Pretraining is time-intensive and never repeated, whereas fine-
tuning is conducted in a regular fashion. The behavior of GPT-based chatbots arises
through fine-tuning. The performance capabilities of LLMs have been attributed to
at least two factors: pretraining and scale (Bommasani et al., 2021). Pretraining, an
instance of transfer learning in which LLMs use knowledge acquired from one task
and transfer it to another, makes LLMs possible. Scale, including better computer
hardware, the transformer architecture, the availability of more and higher-quality
training data, makes LLMs powerful. Although these capabilities are not insub-
stantial, they do not yet rise to the level of NLU (Bender and Koller, 2020; Amin
et al., 2024, 2023). In addition, LLMs are prone to hallucination: ChatGPT may
produce linguistic responses that, though syntactically and semantically fine and
credible-sounding, are ultimately incorrect (Shen et al., 2023b). Furthermore, we
may distinguish between the capabilities of LLMs (acquired through pretraining)
and the behavior (affected by fine-tuning, which happens after pretraining) of LLMs.
Fine-tuning can have unintended effects, including behavioral drift on certain tasks.
ChatGPT, in fact, seems prone to the ‘short blanket dilemma’: while trying to im-
prove its accuracy on some tasks, OpenAI researchers inadvertently made ChatGPT
worse for tasks which it previously excelled at (Chen et al., 2023a).
AI research has slowly been drifting away from what its forefathers envisioned
back in the 1960s. Instead of evolving towards the emulation of human intelligence,
AI research has regressed into the mimicking of intelligent behavior in the past decade
or so. The main goal of most tech companies is not designing the building blocks of
intelligence but simply creating products that existing and potential customers deem
intelligent. In this context, instead of labeling it as ‘artificial’ intelligence, it may
be more apt to characterize such research as ‘pareidoliac’ intelligence. This term
highlights the development of expert systems while raising questions about their
claim to possess genuine intelligence. We feel there is a need for an AI refocus on
humanity, an Anti-Copernican revolution of sorts: like Copernicus demoted humans
from their privileged spot at the center of the universe, in fact, deep learning has
removed humans from the equation of learning. In traditional neural networks,
especially those with a shallow architecture (few hidden layers), humans were at
the center of the technological universe as they had to carefully design the input
features, select appropriate hyperparameters, adjust learning rates, etc. Instead, due
to their increased complexity and capacity to automatically learn features from
data, deep neural networks do not require manual feature engineering and, hence,
have effectively removed humans from the loop of learning. While this is good in
terms of cost, time, and effectiveness, it is bad for several other reasons, including
transparency, accountability, and bias.
1.2 Towards More Reliable AI 5
In the deep learning era, humans no longer have control on how the learning
process takes place. To save on cost and time, we delegated the important task of
selecting which features are important for classification to deep neural networks.
These, however, are mathematical models with no commonsense whatsoever: they
do not know how to properly choose features. For example, in selecting candidates
for a job opening, deep neural networks may decide that gender is an important
feature to take into account simply because more men are present in the training data
as positive samples.
The issue is not only that deep nets may accidentally choose unimportant or even
wrong features, but that we have no way of knowing this because of their black-box
nature (Yeo et al., 2024b). In other words, not only humans have been taken out of
the picture but they have also been blindfolded. For these reasons, we feel there is a
need to bring human-centered capabilities back at the center of AI, e.g., by having
human-in-the-loop or human-in-command systems that ensure AI outputs and rea-
soning steps are human-readable and human-editable. To this end, we propose seven
pillars for the future of AI (Cambria et al., 2023), namely: Multidisciplinarity, Task
Decomposition, Parallel Analogy, Symbol Grounding, Similarity Measure, Intention
Awareness, and Trustworthiness (Fig. 1.1).
1.2.1 Multidisciplinarity
Due to the complex and multifaceted nature of modern AI technologies and appli-
cations, Multidisciplinarity is of increasing importance for the future of AI. The
integration of knowledge from disciplines like mathematics, semiotics, logic, lin-
guistics, psychology, sociology, and ethics allows for a more holistic understanding
of AI’s capabilities and limitations. Mathematical principles such as linear alge-
bra, calculus, probability theory, and optimization underpin the design of AI al-
gorithms. Maths alone, however, is not enough for designing intelligent systems,
because mathematical approaches excel at capturing predominant linguistic patterns
but often struggle with addressing ‘long tail’ issues such as less common or niche
linguistic phenomena. Disciplines like semiotics can help AI systems understand the
nuances of language, including metaphors, idioms, sarcasm, and cultural references,
whether they fall within the more frequent or rarer occurrences across the spectrum
of everyday human language.
Logic also plays a fundamental and enduring role in the development and advance-
ment of AI, as it provides a rigorous framework for reasoning, problem-solving, and
knowledge representation. Word embeddings, which essentially replace words with
numbers, have made most AI researchers forget about the importance of linguistics.
Concepts from syntax, semantics, phonetics, and morphology (see next section),
however, are crucial for interpreting the intended meaning of natural language. Psy-
chology will play an essential role in creating systems that enhance well-being, foster
human relationships, and provide meaningful and empathetic interactions (Fei et al.,
2024). By addressing issues related to inequality and cultural diversity, sociology
will guide AI development in ways that promote positive societal outcomes and
responsible innovation. The arts are also going to be key for the future of AI, as
highlighted by recent STEAM (STEM + Art) initiatives, in order to ‘humanize’
AI through computational creativity, cultural and social understanding, and the en-
hancement of AI usability (Sorensen et al., 2022). Finally, ethics are paramount to
ensure that AI technologies are developed, deployed, and used in ways that align
with human values and promote accountability (Floridi et al., 2018).
Like Multidisciplinarity, Task Decomposition aims to better handle the complex and
multifaceted nature of AI problems. It is a method commonly used in psychology,
instructional design, and project management to break down a complex task or
activity into its individual components. Task Decomposition is one of the key points
of this textbook: no matter what kind of downstream task we are handling, if we do
not deconstruct it into its constituent subtasks, we are practically forcing our model
to implicitly solve a lot of subtasks it has never been trained for. The ‘sentiment
suitcase model’ (Cambria et al., 2017), for example, lists 15 NLU subtasks that need
to be solved separately before sentiment analysis can be accomplished.
1.2 Towards More Reliable AI 7
Firstly, a Syntactics Layer preprocesses text so that informal and inflected expres-
sions are reduced to plain standard text (Zhang et al., 2023a). This is done through
subtasks such as microtext normalization, for refining and standardizing informal
text, part of speech (POS) tagging, for assigning grammatical categories (such as
nouns, verbs, adjectives, and adverbs) to each word in a sentence, and lemmatization,
for reducing words to their base or dictionary form (lemmas).
Secondly, a Semantics Layer deconstructs normalized text into concepts, resolves
references, and filters out neutral content from the input (Mao et al., 2024c). This is
done through subtasks such as word sense disambiguation (WSD), for determining
the correct meaning of a word within a given context, named entity recognition
(NER), for identifying and classifying names of people, places, organizations, and
dates, and subjectivity detection, to distinguish between factual information and
subjective content.
Finally, the Pragmatics Layer extracts meaning from both sentence structure and
semantics obtained from the previous layers. This is done through subtasks such as
personality recognition, to infer traits, characteristics, preferences, and behavioral
tendencies of the speaker, metaphor understanding, for interpreting figurative lan-
guage in text, and aspect extraction, for identifying and extracting specific facets,
features, or components mentioned in text and, hence, enabling a more fine-grained
analysis. Only after handling all these subtasks, which we humans take care of al-
most subconsciously during reading or communication, the downstream task, e.g.,
polarity detection, can be effectively processed.
Fig. 1.2: An example of ‘panalogy’ where the same data is ‘redundantly’ represented
as a knowledge graph, as a matrix, and as embeddings (Cambria et al., 2012).
For more general NLU tasks, it could be useful to have the same data ‘redundantly’
represented both as a knowledge graph and as embeddings (Fig. 1.2). The knowledge
graph could be more useful for solving problems requiring Symbol Grounding, e.g.,
answering questions like ‘what is what?’ (see next section). Embeddings, instead,
could be more useful for Similarity Measure, e.g., answering questions like ‘what is
similar to what?’ (explained later).
In the context of AI, the Symbol Grounding problem arises because computers
lack the inherent sensory experiences that humans possess. They process symbols
as strings of characters or digital information without a direct connection to the real
world, raising the question of how they can truly understand the meanings of symbols
in a way that is equivalent to human understanding. For instance, consider the word
‘apple’. Humans understand this word not just as a sequence of letters, but as a fruit
with certain sensory qualities like color, taste, texture, and smell, all of which are
grounded in our experiences with actual apples. Current AI systems are unable to
grasp the richness of meaning behind the word ‘apple’ without having those sensory
experiences.
To solve this, we may have to take a step back in order to move forward. Old-
school (symbolic) AI was better at Symbol Grounding but it was not scalable nor
flexible. New deep-learning-based (subsymbolic) AI, instead, is very scalable and
flexible but it does not handle symbols. The best of both worlds could be achieved
through a hybrid (neurosymbolic) AI that leverages the strengths of both symbolic
and subsymbolic models to overcome their respective limitations and achieve a
more comprehensive understanding of the world. In NLU research, this can be
implemented in several ways, e.g., by injecting external knowledge into a deep
neural neural network (Liang et al., 2022b) in the form of embeddings (Fig. 1.3).
Intention Awareness refers to the ability to recognize and understand the intentions
or goals of oneself or others. It plays a crucial role in human social interactions and
communication, as it enables individuals to anticipate and interpret the actions and
behaviors of others, leading to more effective and empathetic interactions. Current
AI models provide one-fits-all solutions without taking into account user beliefs,
goals and preferences.
Theory of mind should always be applied to better understand user’s actions
and queries. When this is not possible, user profiling in the form of persona or
personality recognition should be employed to generate more relevant actions or
answers (Zhu et al., 2023). For the same reason, AI should also have enough com-
monsense knowledge, including a model of fundamental human beliefs, desires, and
intentions, in order to minimize miscommunication and avoid unintended conse-
quences (e.g., apocalyptic scenarios like accidentally wiping out humanity in the
attempt to solve climate change). In other words, future AI systems should always
try to understand what users are doing and why they are doing it. For instance,
recent hybrid frameworks have tried to improve human-robot interaction by model-
ing Intention Awareness in terms of motivational and affective processes based on
conceptual dependency theory (Ho et al., 2023).
Finally, recent attempts to augment the human decision-making process, espe-
cially in dynamic and time-sensitive scenarios such as military command and control,
game theory, home automation, and swarm robotics, have focused primarily on envi-
ronmental details such as positions, orientations, and other characteristics of objects
and actors of an operating environment (situation awareness). However, a significant
factor in such environments is the intentions of the actors involved (Howard and
Cambria, 2013). While creating systems that can shoulder a greater portion of this
decision-making burden is a computationally intensive task, performance advances
in modern computer hardware bring us closer to this goal.
1.2.7 Trustworthiness
Last but not least, Trustworthiness is a key pillar that measures the degree to which
AI systems, models, and algorithms can be relied upon to perform as intended, make
accurate and ethical decisions, and avoid harmful consequences. It is a concept
closely related to Intention Awareness, but also explainability and interpretability.
Explainability allows an AI model to generate descriptions of its decision-making
processes in order to enable the user to make informed modifications to the outputs
or even to the model itself in a human-in-the-loop fashion. Interpretability, in turn,
enables users to understand the inner workings of an AI model, e.g., by identifying
which input features have the most impact on its output or by assessing how changes
in input variables affect the model’s predictions or by leveraging a confidence score
to gauge how confident the AI model is about its own output.
12 1 Natural Language Understanding & AI
Fig. 1.4: Recommender systems have both positive and long-term negative effects
on their users and economy.
The main goals of recommender systems and their positive impact on the economy
and humans (Fig. 1.4) are:
PI1: Positive economic impact on companies using recommender systems (De Biasio
et al., 2024), i.e., personalization provided by recommender systems can be a
competitive advantage over market rivals;
PI2: Better user experience, which directly results from support of recommender
systems in navigation and communication with the system;
PI3: Better user satisfaction can be one of the measured feature of recommender
systems making users utilize them (He et al., 2024b);
PI4: Time-saving is a consequence of PI2, i.e., faster navigation through online services
and large item collections (Tembhare et al., 2023);
PI5: Broadening of horizons – some recommender system suggestion of items beyond
known user preferences (Liang and Willemsen, 2023). It is also addressed using
quite well known concepts and measures: diversity, coverage, novelty, unexpect-
edness, and serendipity (Fu et al., 2023);
PI6: Nudging users towards their positive decisions and behavior, e.g., related to
unhealthy eating (Castiglia et al., 2022) or news diversity.
1.3 Towards More Responsible AI 15
All the above-mentioned effects may lead to other effects like decreased well-
being or physical and mental health. There are also other adverse effects on humans,
which often directly result from the contradictory goals of sellers and customers, like
nudging user moods to induce unplanned purchases (Ho and Lim, 2018). The phe-
nomenon commonly called information bubble (NI1) stems from human to consume
increasingly similar items over time even without any recommendation. However,
recommender systems can reinforce this effect, even if the user may sometimes feel
bored or less satisfied (Areeb et al., 2023). On the other hand, there are recommender
system solutions going in the opposite direction, i.e., broadening user horizons (PI3).
Unfortunately, however, are not commonly used in commercial applications. User
autonomy (NI4) refers to the user’s ability to undertake their free choices (Krook and
Blockx, 2023), i.e., possibly without any manipulation. This is also closely related
to user control over their decisions and recommendation mechanisms (Harambam
et al., 2019). The loss of such control may partially result from being in the informa-
tion bubble (NI1), excessive recommender system use (NI2), and loss of criticism
(NI3).
Recently, companies and other stakeholders can collect large amounts of data about
their users. More data potentially leads to more accurate and friendly recommender
systems. This is additionally supported by multimodal recommender systems that
process diverse information about the environment, recommended items, and also
user social networks. Besides, recommender systems can make use of the general
data about users like their personality traits, cognitive abilities or more temporal
affective states like emotions (Dhelim et al., 2022). As a result, current recommender
systems, and future recommender system in particular, will increasingly resemble
human advisors.
16 1 Natural Language Understanding & AI
Fig. 1.5: Recommender systems become more and more human-like. Includes a
Midjourney-generated image.
Fig. 1.6: Present and future relationships between the user, recommender system and
other humans.
1 Note that we are not analyzing any particular service, but only want to indicate some of the
potential risks and directions for further studies.
18 1 Natural Language Understanding & AI
It means that, in many cases, user and business goals may be opposed. This
is closely related to the negative long-term impacts: NI1 and NI2. If the business
model of dating services relies on monthly subscriptions, then recommender system
in such a service can be optimized to achieve business goals, i.e., to retain paying
users. This may result in matching people to short-term human relationships rather
than long-term ones. This, in turn, may not be in line with the long-term goals of
many users seeking a more permanent relationships.
1) human values and user personal goals (especially related to long-term impact on
them), and
2) societal recommendations (related to lifestyle or health like physical activities,
sleep, the need for breaks in online activity, etc.)
All of the above should be taken into account while maintaining user auton-
omy (Krook and Blockx, 2023) and non-conflict with business objectives. Preserving
user autonomy is a crucial component of RRS mitigating the long-term negative im-
pact NI4. It enforces that the recommender system designer should strive to keep the
user free: free to change their choices, free to make new ones, or even to disable the
system altogether. Our RRS concept adheres to the principles of digital humanism,
emphasizing the importance of norms to ensure that technology is developed and
used in a manner that is more ethical and driven by human values (Prem et al., 2023).
Please note that we focus primarily on recommender systems’ impact on individual
users. However, the social effect can be considered as well.
1.3 Towards More Responsible AI 19
Fig. 1.7: Idea of RRS in a market environment. It respects: (1) individual user goals
and values, especially long-term ones, (2) societal goals, e.g., lifestyle or health
recommendations, and (3) business goals. RRS also (4) preserves trustworthiness,
fairness, accountability, and explainability leading to better transparency.
2 Please note that existing nudging solutions (PI6) focus on single positive nudging objectives rather
than on integration and alignment with other targets, e.g., through multi-criteria inference.
20 1 Natural Language Understanding & AI
AI and recommender systems are poised to revolutionize the fabric of societal inter-
actions, offering unprecedented opportunities to reshape the way individuals connect,
communicate, and collaborate. While they can facilitate connections by matching
individuals with shared interests and preferences, there is concern that these medi-
ated interactions may lack the depth and authenticity of face-to-face communication.
Over-reliance on algorithmic recommendations may contribute to the formation of
echo chambers, limiting exposure to diverse perspectives and potentially weakening
interpersonal bonds. Moreover, the increasing integration of AI and recommender
systems into daily life raises questions about individual autonomy. While these sys-
tems aim to streamline decision-making by offering tailored suggestions, there is
a risk of subtle influence on user behavior. This raises ethical concerns about the
manipulation of human autonomy and the potential erosion of free will in the face
of algorithmic determinism. On the one hand, AI and recommender systems provide
access to vast amounts of information and support data-driven decision-making, em-
powering individuals to make more informed choices. Simultaneously, the reliance
on algorithmic recommendations may reduce individuals’ inclination to critically
evaluate information and exercise independent judgment. Moreover, the lack of
transparency in recommendation processes may hinder users’ understanding of how
decisions are made, potentially undermining trust in the information presented.
In the near future, it will be increasingly important to raise awareness among users
about the potential risks associated with recommender systems, drawing parallels to
other contexts such as smoking and investing in the stock market. Just as education
campaigns have been instrumental in informing the public about the health hazards
of smoking and the financial risks of stock market investments, a similar approach
is needed to highlight the potential pitfalls of relying blindly on recommendation
algorithms.
1.3 Towards More Responsible AI 21
1.3.5 Summary
So, are recommender systems friends, foes, or frenemies? They can be friends when
they provide transparent and personalized recommendations that genuinely enhance
user experiences. For instance, on streaming platforms like Netflix or Spotify, rec-
ommender systems recommend movies, shows, or music tailored to users’ tastes and
preferences, helping them discover content they might enjoy. They can be foes when
they prioritize the interests of businesses or other stakeholders over those of users,
which can lead to a proliferation of sponsored content or biased recommendations,
potentially undermining user trust and satisfaction. Given the pervasive influence of
recommender systems on various aspects of human life, it is crucial to study their
mechanisms, effects, and implications more comprehensively.
22 1 Natural Language Understanding & AI
As AI becomes increasingly integrated into various aspects of daily life, the de-
mand for more tailored and individualized experiences is growing. Personalization
enhances the relevance, efficiency, and effectiveness of AI applications, leading to
numerous benefits across different sectors. Firstly, personalization significantly im-
proves user experience by allowing AI systems to cater to the unique preferences,
needs, and behaviors of individual users. This leads to more intuitive and engag-
ing interactions, enhancing user satisfaction and loyalty. For example, personalized
recommendation systems in e-commerce and streaming services help users discover
products and content that align with their tastes, thereby improving their overall
experience. Secondly, personalized AI systems increase efficiency and productivity
by streamlining workflows and automating tasks in ways that are specifically tai-
lored to the user’s habits and requirements. In professional settings, personalized AI
assistants can manage schedules, prioritize tasks, and provide relevant information,
thereby boosting productivity and efficiency.
In this chapter, we focus on the task of personalized sentiment analysis, which
aims to analyze individual sentiment perceptions. This approach is motivated by the
observation that different individuals may perceive an identical statement differently
regarding its sentiment polarity. In contrast, conventional sentiment analysis aims
to predict the semantic sentiment of a statement, where the sentiment prediction
remains the same for an identical statement. For instance, an introverted person may
have a negative sentiment towards performing in front of a large audience, while
an extroverted person may view the same situation positively (see Fig. 1.8). This
variation in sentiment perception can be attributed to personality traits. While the
distinction between introversion and extroversion is from personality theory, the
variability in sentiment perception among individuals can also be influenced by
other factors. The inconsistency in sentimental perception between individuals may
originate from multiple sources. For example, as the adage suggests, “the enemy of
my enemy is my friend”, the sentimental perception of a person can be driven by the
relationship or the context of the situation. In this case, personalized sentiment analy-
sis extends beyond traditional semantic and pragmatic understanding, incorporating
a broader range of human subjective factors, such as persona information.
1.4 Towards More Personalized AI 23
With the development of neural networks, the research focus of sentiment analysis
shifted towards developing different learning frameworks to improve accuracy. Sev-
eral algorithms were proposed for sentiment analysis including convolutional neural
network (CNN)-based supervised learning (Kim, 2014), transfer learning (Dong and
De Melo, 2018), adversarial training (Fedus et al., 2018), meta-learning (He et al.,
2023b), and prompt-based (Mao et al., 2023c). These research efforts address learn-
ing challenges in sentiment analysis, e.g., pattern discovery with label data, efficient
learning from few-shot examples, robust representations, and domain adaption. Dur-
ing this period, research in multimodal (Zadeh et al., 2017) or cross-lingual (Zhang
et al., 2024b) sentiment analysis was dynamic because it expanded the scope of
sentiment analysis beyond English text. Recently, there has been a significant en-
richment in the task setups of sentiment analysis. Researchers are no longer satisfied
with simply predicting a sentiment polarity for an input text; they are extending the
scope of sentiment analysis to include different levels of granularity and contextual
awareness, e.g., aspect-based sentiment analysis (ABSA) (Mao and Li, 2021) and
opinion mining (Marrese-Taylor et al., 2014), emotion detection (Abdul-Mageed
and Ungar, 2017), conversational sentiment analysis (Li et al., 2023b), sentiment
analysis from electroencephalography (EEG) signals (Kumar et al., 2019b), facial
expressions (Dagar et al., 2016) or speech (Lu et al., 2020). Another trend in senti-
ment analysis is that researchers paid more attention to the linguistic phenomena that
likely affect sentiment analysis, e.g., metaphors (Mao et al., 2023b), sarcasm (Yue
et al., 2023), and ambiguous word senses (Zhang et al., 2023b). Considering the
impact of sentiment in broad domains, there are research papers studying sentiment
analysis in different science domains, e.g., nature disaster (Duong et al., 2024), men-
tal health (Ji et al., 2022), finance (Ma et al., 2023, 2024), legislation (Proksch et al.,
2019) and education (Altrabsheh et al., 2013).
To sum up, previous research has addressed sentiment analysis by tackling learn-
ing challenges, enhancing sentiment analysis granularity, improving NLU in learning
systems, and grounding sentiment analysis in different downstream tasks. However,
sentiment perception can vary subjectively in different contexts. There is limited
research on personalized sentiment analysis that integrates various types of persona
information. This motivates us to bridge this gap by forming a framework to identify
the sources of subjectivity in sentiment analysis and developing a neurosymbolic
system to process the task of personalized sentiment analysis.
Theory of mind suggests that individuals understand that others may hold beliefs,
desires, intentions, emotions, and thoughts that differ from their own (Apperly and
Butterfill, 2009). Thus, we believe that multiple factors can influence individual sen-
timent perceptions. According to the theory of appraisal (Martin and White, 2003),
opinions and sentiments arise not as direct responses to stimuli but as complicated
evaluations incorporating subjective judgments across multiple levels.
26 1 Natural Language Understanding & AI
For example, those with conservative ideologies often emphasize individual re-
sponsibility and individual rights, influencing their stance on welfare and healthcare-
related policies (Taber and Lodge, 2006). In contrast, individuals with liberal ideolo-
gies often prioritize social justice and equality, leading to opposing opinions. The way
people construct their ideas is also influenced by certain personality traits (Jansen
et al., 2022b). For example, people who are open to new experiences are more likely
to be receptive to new ideas and have flexible, open-minded perspectives (McCrae
and Costa, 1987). On the other hand, those with higher conscientiousness typically
base their beliefs on carefully analyzing the available data, leading to more thought-
ful viewpoints. Sometimes, sentiment perceptions can be influenced by individuals’
subjectivity, e.g., individuals who have had positive interactions with dogs are more
likely to view dogs positively, contrasting with those whose experiences have been
negative (Tyrer et al., 2015). People tend to focus on information that aligns with
their personal preferences and beliefs, potentially distorting their perception of emo-
tions (Bower, 1981).
This tendency, termed confirmation bias, can impact how individuals interpret
emotional events. Moreover, varied experiences and subjective feelings can influence
using metaphorical language among individuals to express their opinions (Lakoff
and Johnson, 1980). For example, financial analysts may employ different metaphors
in their reports under different market conditions (Mao et al., 2023a). The public’s
perception of different types of weather disasters is also reflected in their metaphor-
ical expressions (Mao et al., 2024d). To sum up, theoretical research and empirical
studies support that individuals’ sentiment perceptions are subject to multiple factors,
including entity diversity that distinguishes between humans and other intelligent
agents like animals and AI; culture, religion, vocation, ideology, personality, and
subjectivity. These factors may impact personalized sentiment analysis in different
scenarios. However, their collective impacts generally represent the complex inter-
play of individual characteristics and contextual influences on sentiment perception.
These factors not only influence how individuals perceive and interpret the senti-
ment of a target but also alter the language they use to express subjective feelings.
In sentiment analysis, understanding these factors is necessary for developing more
personalized and context-sensitive systems.
1.4.2 Methodology
After reviewing relevant literature in psychology and cognitive science in the pre-
vious section, we define a hierarchical framework, termed Personalized Sentiment
Analysis Pyramid (see Fig. 1.9), which encompasses seven factors that can influence
individual sentiment perception. By considering each factor, such framework aims
to enable more accurate and personalized sentiment analysis, tailoring sentiment
detection to the unique characteristics and contexts of each individual. This hierar-
chical framework is shaped as a layered structure that includes seven persona aspects,
namely: entity, culture, religion, vocation, ideology, personality, and subjectivity.
28 1 Natural Language Understanding & AI
Entity refers to the differentiation between human genders and other intelligent
agents. Culture represents how various cultures perceive concepts as positive or
negative. Religion involves considering how specific religious beliefs can influence
an individual’s opinions on certain topics. Vocation aids in understanding people’s
opinions based on their occupation and educational background. Ideology involves
political beliefs and social, economic, or philosophical viewpoints. Personality assists
in categorizing concepts as positive or negative based on personality traits. Finally,
subjectivity considers personal preferences and experiences. At the bottom of the
pyramid, personalization is more general, e.g., entities of the same gender and
species, such as males, females, AI, or other creatures, can share the same persona
information. Personalization is more specific at the top layer, e.g., subjectivity level.
Next, we use GPT-4 Turbo to analyze the persona information of our subjects related
to the above seven aspects. LLMs have shown superior knowledge in diverse domains,
including multilingual capabilities and scientific task processing (Mao et al., 2024a).
They were also suggested as a useful tool for survey research and persona information
generation (Jansen et al., 2023). Thus, it is eligible for analyzing the personalities of
a subject from different aspects.
Finally, we test LLM performance on personalized sentiment analysis tasks. The
obtained persona information in the former step is used as symbolic knowledge, guid-
ing the sentiment inference of an LLM. Since we combine the symbolic knowledge
and the reasoning ability of neural network-based LLMs together, the methodology
is neurosymbolic. The structured symbolic knowledge provides a clear and under-
standable reasoning basis, enhancing the interpretability and explainability of the
system’s decisions (Yeo et al., 2024a). The testing data were sourced from novels,
including dialogues with multi-turns. The task is to predict a speaker’s sentiment
perceptions towards another speaker involved in the conversation. We hypothesize
that the additional persona information has different utilities in this scenario because
the intensity of the influence of personal characteristics on sentiment perception
changes as the scene changes. We aim to evaluate the utilities of the ensemble and
each type of persona information.
The overall workflow of our method can be viewed in Fig. 1.10. Our task setup
and computing pipeline represent a novel approach for several reasons. First, we do
not focus on analyzing the sentiment of a conversation based solely on its semantic
content. Instead, our goal is to analyze how one person perceives another’s senti-
ments. This means that even in a conversation that may seem neutral, there could
still be negative sentiment if the individuals involved in the conversation do not
like each other. Second, unlike traditional personalized AI techniques, such as user
preference-based dialogue systems (Zhu et al., 2023, 2024a), personality trait-based
recommender systems (Yang and Huang, 2019), or annotation-subjectivity-driven
sentiment analysis (Zhang et al., 2024c), our approach considers persona information
from multiple aspects. This allows our system to incorporate a broader range of fac-
tors that may be informative for personalized sentiment analysis. Finally, our system
prioritizes user subjectivity by generating personalized outputs based on different
types of persona information, even when presented with the same dialogue input.
This approach to human-computer interaction (HCI) is more human-centric.
1.4 Towards More Personalized AI 29
Since we have defined seven aspects for persona analysis and our analytical subjects
are characters from Harry Potter novels, we can consult the persona information for
GPT-4 directly. We formulate the query template as follows.
[Goal]: I want to categorize the given object according to their {aspect_term} in the Harry
Potter book series. Please suggest a description, one in a line, starting with “-” and surrounded
by quotes “”. For example: - “{example_category}”
Do not output anything else.
You may choose only one {aspect_term} from the following list: {category_list}
Please categorize {sample_in_prompt} into one {aspect_term} according to their
{aspect_term} in the Harry Potter book series.
Table 1.1 shows the aspect terms for the seven aspects used to analyze the characters in
the Harry Potter book series. To prevent overlap among the seven aspects and ensure
that the LLM fully understands the meanings of the aspect terms, we provide a list of
categories for the first six aspects (shown as follows) for the LLM to reference during
inference. For “subjectivity”, we provide examples such as “Quidditch Seeker” and
“Painting”, allowing the LLM to generate relevant answers openly.
• Entity: [specie type] Wizards and Witches, Muggles, Werewolves, Dragons, Hip-
pogriffs, Basilisks, Trolls, Hags, Giants, Ghosts, House-elves, Goblins, Centaurs,
Veela, Merpeople, Dementors, Vampires. [gender] male, female, or, inapplicable.
• Culture: Gryffindor, Slytherin, Hufflepuff, Ravenclaw, England, Scotland, Wales,
Irish, French, Bulgarian, India, African, Romani, Middle Eastern, USA.
• Vocation: Auror, Healer, Transfigurers, Charms Experts, Diviners, Professor,
Magizoologist, Potion Master, Curse Breaker, Metamorphmagi, Animagi, Oc-
clumens, Legilimens, Runes Experts, Patronus Charm Casters, Quidditch Player,
Journalist, Shop Owner, Ministry Official, Librarian, Herbologist, Arithmancer,
Servants, Metalworkers, Bankers, Underwater Dwellers, Companions or Pets.
30 1 Natural Language Understanding & AI
Table 1.1: Persona aspect term for analyzing the characters in Harry Potter novels.
• Religion: Good vs. Evil, Love vs. Indifference, Acceptance_Death vs. Fear_ Death
vs. Bravery_Death vs. Denial_Death vs. Honor_Death, Sacrifice vs. Selfishness,
Redemption vs. Condemnation, Impartiality vs. Prejudice, Tolerance vs. Intoler-
ance, Courage vs. Cowardice, Faith vs. Skepticism, Responsibility vs. Irrespon-
sibility.
• Ideology: Equality and Inclusivity vs. inequality and exclusivity, Reform vs. Satus
Quo, Utilitarianism vs. Moral Absolutism, Knowledge vs. Ignorance, Loyalty and
Community vs. disloyalty and individualism, Pragmatism vs. Idealism.
• Personality: ESTJ, ENTJ, ESFJ, ENFJ, ISTJ, ISFJ, INTJ, INFJ, ESTP, ESFP,
ENTP, ENFP, ISTP, ISFP, INTP, INFP.
The prompt template (prompt (m!n) ) for inferring the sentiment perception of m
towards n can be viewed below. In the prompt box, the content after [Goal] refers to
the task description (t) that directs an LLM to deliver desired predictions, following
a fixed structure. scene (e) is the background illustration in which the conversation
takes place.
1.4 Towards More Personalized AI 31
1.4.3 Experiment
1) What is the utility of using the ensemble of the seven levels of personalization?
2) What is the utility of using individual personalization?
3) How does personalization impact sentiment analysis accuracy across different
types of entity and culture factors?
These research questions are explored in Sections [Link]–[Link], respectively.
[Link] Dataset
Our personalized sentiment analysis uses the Harry Potter Dataset (HPD) (Chen
et al., 2023b). It was developed to enhance the alignment of conversation agents with
fictional characters from Harry Potter novels. It includes annotating relationships and
character attributes that evolve over the storyline. HPD includes background infor-
mation, such as conversation scenes, speaker identities, and character attributes, to
enable dialogue agents to generate replies consistent with the Harry Potter universe.
In contrast to our structured persona analysis, the character attributes in the dataset
were not derived from the same set of analytical aspects. Thus, we did not use their
character attribute descriptions.
32 1 Natural Language Understanding & AI
We leverage their affection labels that indicate the sentiment intensity of a per-
ceiving subject to another perceiving object as our sentiment intensity labels. In
our classification task, positive labels correspond to the sentiment intensity ranging
from 5 to 1, and negative labels correspond to the sentiment intensity ranging
from 1 to 5. A neutral label corresponds to the sentiment intensity of 0. Our method
is evaluated using the English version of the original HPD. The statistics of our
employed data and the sentiment intensity distribution are shown in Fig. 1.11.
The persona information was queried from GPT-4 Turbo. The personalized senti-
ment analysis was evaluated with GPT-4 Turbo and GPT-3.5 Turbo (i.e., ChatGPT),
respectively. GPT-4 Turbo is an upgraded version of GPT-3.5 Turbo. Both LLMs
were developed by Open AI, pretrained with Transformer-based deep neural net-
works and a large number of corpus resources. By a preliminary test, we found that
these LLMs have rich knowledge about Harry Potter novels.
We compared the sentiment analysis results of GPT-3.5 and GPT-4 with and without
the ensembled seven levels of personalization in Table 1.2. The ensemble utility of
seven levels of personalization improved the performance of GPT-3.5 on sentiment
analysis. For GPT-4, the inclusion of the seven levels of personalization seems
to contribute to a slight improvement in F1 (p:=pos) and accuracy and MSE or
even a marginal decrease in F1 (p:neg) and Macro F1 . Taking into consideration the
answer rate, however, we found the accuracy of GPT-4 on the whole dataset turns into
0.8977⇥0.8155 = 0.7321 while the one of GPT-4 w/ p1:7 becomes 0.9058⇥09492 =
0.8598. Thus, the adjusted accuracy increase of 0.1277 demonstrates the ensemble
effectiveness of the seven levels of personalization on sentiment analysis task.
Table 1.2: Personalized sentiment analysis. p:=pos means positive sentiment is de-
fined as positive labels for computing F1 ; p:=neg means negative sentiment is defined
as positive labels for computing F1 .
F1 F1
Macro. F1 Acc MSE Answer Rate
(p:=pos) (p:=neg)
GPT-3.5 0.9381 0.5340 0.4907 0.8686 0.3586 0.6522
w/ p1:7 0.9484 0.5859 0.5114 0.8936 0.3293 0.6860
Delta 0.0103 0.0519 0.0207 0.0250 -0.2036 0.0338
GPT-4 0.9569 0.7372 0.5704 0.8977 0.1860 0.8155
w/ p1:7 0.9631 0.7092 0.5630 0.9058 0.1812 0.9492
Delta 0.0062 -0.028 -0.0074 0.0081 -0.0048 0.1337
Table 1.3 shows the positive influence of each individual personalization on the
performance of GPT-3.5 in the personalized sentiment analysis tasks. Among them,
Culture, Vocation, Ideology, and Subjectivity strengthened the performance of GPT-
3.5 by a significant margin, while Entity, Religion, and Personality contributed less.
There may be a thought-provoking rationale for such a discrepancy. The Harry Potter
book series is deeply rooted in a rich culture, ideology, and subjectivity backdrops,
thereby offering a rich tapestry of themes and narratives that resonate deeply with
readers’ or even sentiment intensity annotators’ own values. For example, someone
who values loyalty and friendship may possibly echo characters like Harry and Ron.
Hence, they may easily capture Harry’s negative sentiment towards Peter Pettigrew,
who betrayed his friends, James and Lily Potter. Consequently, an LLM knowing
these factors (culture, ideology, and subjectivity) may understand characters’ senti-
ments more precisely by resonating with characters’ values.
34 1 Natural Language Understanding & AI
F1 F1
Macro. F1 Acc-all MSE Answer Rate
(pos=pos) (pos=neg)
GPT-3.5 0.9381 0.5340 0.4907 0.8686 0.3586 0.6522
w/ p1 0.9478 0.5553 0.5010 0.8825 0.3200 0.6769
w/ p2 0.9508 0.5793 0.5138 0.8926 0.2941 0.6801
w/ p3 0.9413 0.5403 0.4939 0.8841 0.3736 0.6713
w/ p4 0.9504 0.5542 0.5015 0.8886 0.3055 0.7171
w/ p5 0.9456 0.5532 0.4996 0.8884 0.3419 0.6938
w/ p6 0.9490 0.5619 0.5073 0.8868 0.3108 0.6667
w/ p7 0.9502 0.5722 0.5074 0.8897 0.3046 0.6780
The results presented in Section [Link] are readily comparable, as the growths in
F1 , Macro. F1 , accuracy, and answer rate are consistent. Unlike Sections [Link]
and [Link], performance summarized by different categories of entity and culture
aspects is more sensitive to the answer rate, since we investigate the results by
breaking down the characters into finer-grained groups instead of treating them as a
whole. Therefore, we calculated the evaluation metrics based on several query sam-
ples, by setting the missing result with a fixed value out of the scope of ground-truth
labels (Zhu et al., 2024b). Moreover, we detailed the sentiment analysis results by
presenting the metrics of each category group, both as subjects (conveying senti-
ments to Harry) and objects (receiving sentiments from Harry).
In general, including seven levels of personalization completely enhanced the
performance of groups Ghosts, Acromantula, Veela, Centaurs, and French. For entity
breakdowns, the negative effects on GPT-4 of including seven aspects occur only in
the cases where the entity groups (Muggles, Wizards and Witches, House-elves, and
Goblins) play as an object. For culture breakdowns, the negative impacts of including
seven aspects on GPT-4 are primarily observed when the entity groups (England,
Gryffindor, Ravenclaw, Slytherin, and Hufflepuff) function as objects. However,
effects related to these breakdowns are observed in only two cases (Ravenclaw and
Hufflepuff) when they serve as subjects. For both breakdowns (entity and culture),
the negative effects on GPT-3.5 of including seven aspects occur relatively more
often and mainly when the entity group plays as a subject. The above observation
highlights the bias of LLMs (such as GPT-3.5 and GPT-4) towards the subject and
the object on sentiment analysis, especially when these models address personalized
neurosymbolic knowledge. Additionally, comparing the results of GPT-3.5 and GPT-
4, we observed that GPT-3.5, when integrated with personalized neurosymbolic
knowledge, achieved comparable or even superior performance to GPT-4. This is
validated by the results from several Entity or Culture groups including Muggles,
Ghosts, Acromantula, Veela, and England.
1.5 Conclusion 35
1.4.4 Summary
In this section, we studied the task of personalized sentiment analysis. Personas were
widely studied in commercial domains and web research (Jansen, 2022; Jansen et al.,
2022a). Unlike conventional sentiment analysis tasks that aim to analyze sentiment
by the meanings of the text, personalized sentiment analysis targets to analyzing the
individual sentiment perception. The difference is that an identical statement can
yield the same sentiment prediction by its meaning. In contrast, different people may
perceive the message differently based on their own personal preferences, personality
traits, beliefs, background, etc. To this end, we devised a framework, termed the
Personalized Sentiment Analysis Pyramid, for tackling all these different facets
through seven different levels of personalization, namely: Entity, Culture, Religion,
Vocation, Ideology, Personality, and Subjectivity.
We evaluated the framework with a dialogue dataset sourced from Harry Pot-
ter novels. The evaluation showed that personalized neurosymbolic knowledge, i.e.,
seven levels of personalization, augmented LLMs’ performance on sentiment analy-
sis. We also analyzed the utility of each persona aspect and found that each individual
persona aspect can augment sentiment intensity classification results. Finally, we in-
vestigated the influence of persona information on several character groups in the
Harry Potter novels. Results showed that including persona information elevated
the performance of groups Ghost, Acromantula, Veela, Centaurs, and French. Fur-
thermore, a bias of LLMs fed with personalized neurosymbolic knowledge towards
subject and object groups is observed.
1.5 Conclusion
The pursuit of automating tedious or repetitive tasks has a rich history, with origins
tracing back to Ancient Egypt and the Greek Empire. Among the earliest documented
works on automation is the “Book of Ingenious Devices”, published in 850 by the
Banu Musa brothers. While we have made significant strides since those times,
thanks to advancements in mathematical modeling, we now face the challenge that
mere mathematics alone may not suffice to model the intricate processes by which
the human brain encodes and decodes meaning for complex tasks, including intuitive
decision-making, sense disambiguation, and narrative comprehension.
In this chapter, we discussed the importance of personalized AI (in the context of
sentiment analysis) and responsible AI (in the context of recommender systems). We
also proposed a novel approach to AI that centers on humanity, characterized by seven
essential features or pillars. In the future, we plan to define best practices for abiding
by such pillars. For example, current post-hoc interpretability methods may not be
the best way to implement Trustworthiness as they simply find correlations between
inputs and outputs of an AI model without really explaining its inner workings.
Similarly, there is no point in having a confidence score if this is calculated based
on the wrong parameters.
36 1 Natural Language Understanding & AI
If we do not engineer it well, in fact, AI could very much end up being like
plastic: a great invention that made our life easier about a century ago, but which
is now threatening our own existence. NLU can aid this process by implementing
a more transparent a brain-inspired way of processing language. We discuss this
in more detail in the next three chapters, in which we illustrate how to handle the
multiple cognitive processes associated with the different building blocks of language
understanding via three layers (Fig. 1.12), namely: Syntactics Processing, Semantics
Processing, and Pragmatics Processing.
This framework draws inspiration from two significant sources: the sentiment
analysis suitcase model (Cambria et al., 2017) and the Jumping NLP Curves
paradigm (Cambria and White, 2014). The former was specifically designed to
address polarity detection. It focused on a narrower scope compared to more gen-
eral NLU tasks and utilized a different arrangement of its components to tackle
sentiment-related challenges.
The latter conceptualized the evolution of NLP research as a series of three dis-
tinct curves, each representing different levels of complexity in handling natural
language. These curves illustrate the progression from basic language processing
techniques to more sophisticated approaches that handle increasingly complex lin-
guistic phenomena. The current framework builds on this paradigm by incorporating
its insights into the design and organization of NLU components, aiming to model
and manage the complexities of modern NLP challenges effectively.
A. Reading List
• Erik Cambria, Rui Mao, Melvin Chen, Zhaoxia Wang, and Seng-Beng Ho. Seven
Pillars for the Future of Artificial Intelligence. IEEE Intelligent Systems, 38(6):
62–69, 2023 (Cambria et al., 2023)
• Luyao Zhu, Rui Mao, Erik Cambria, and Bernard Jim Jansen. Neurosymbolic AI
for Personalized Sentiment Analysis. In Proceedings of HCII, 2024 (Zhu et al.,
2024b)
B. Relevant Videos
• Labcast about the Seven Pillars for the Future of AI: [Link]/SX1Cl_eDTLE
C. Related Code
• Github repository about ChatGPT Affect: [Link]/SenticNet/ChatGPT-Affect
D. Exercises
• Exercise 1. Choose a downstream task or application and explain in detail how
each of the Seven Pillars can be applied to it. Discuss how these pillars can
enhance the task’s effectiveness, accuracy, and ethical implications, considering
factors like interpretability, robustness, scalability, fairness, and privacy. Illus-
trate how these principles can be integrated into the development, deployment,
and monitoring phases of the system to ensure it operates efficiently and ethically.
1.6 Learning Resources 39
2.1 Introduction
Nevertheless, we argue that integrating symbolic and subsymbolic AI, also known
as neurosymbolic AI, is the key for stepping forward in the path from NLP to NLU.
Despite recent advancements in deep learning, statistical NLP does not achieve true
NLU, as it merely makes probabilistic guesses. Belinkov et al. (2017) conducted an
experiment to evaluate the word embeddings learned by neural machine translation
(NMT) on syntactic tasks, with results indicating that they are poor representations
for syntactic information. Therefore, to further improve the performance of semantic
and pragmatic tasks, it is of great benefit to regard syntactics processing as sub-
problems. For instance, POS tagging is incorporated in the decoding process (Feng
et al., 2019b) or as an auxiliary task (Niehues and Cho, 2017; Mao and Li, 2021) to
improve NMT and metaphor detection, lemmatization as a preprocessing step boosts
the accuracy of neural sentiment analysis on social media text (Symeonidis et al.,
2018). Hence, syntactics processing is an integral part of the neurosymbolic NLU
paradigm.
In this chapter, we examine the five most basic tasks in syntactics processing,
namely, microtext normalization, sentence boundary disambiguation (SBD), POS
tagging, text chunking, and lemmatization. There is a variety of other tasks in
this field, e.g., stopword removal, negation detection, constituency parsing, and
dependency parsing. Here, we decide to focus on the most foundational syntactic
processing tasks that can be helpful to high-level syntactic tasks and other NLU
applications. To elaborate, dependency parsing analyzes the grammatical structure
of a sentence based on the dependencies between words. It is strongly related to
POS linguistically. Indeed the majority of existing algorithms heavily rely on POS
tagging (Chen and Manning, 2014; Dyer et al., 2015; Zhou et al., 2020a). Similarly,
constituency parsing analyzes sentences by breaking them down into constituents.
It can be regarded as hierarchical text chunking, establishing structure within the
chunks. Therefore, we view them as high-level tasks that can benefit from the newly
introduced low-level syntactic processing techniques and justify separate reviews.
Although low-level syntactic processing tasks have gradually faded out of public
views in many NLP conference proceedings, previous surveys presented significant
findings in individual syntactic processing tasks. Satapathy et al. (2020) presented a
comprehensive review on microtext normalization. They discussed the similarities
between microtext and brachygraphy, and suggested the potential applications of
microtext normalization in more complex NLU tasks. Read et al. (2012) reviewed
10 publicly available SBD systems, and argued that the published results of these
systems are evaluated on different task definitions and data annotations. Thus, they
assessed the systems on unified task definition and gold-standard datasets, as well as
on user-generated content (UGC) corpora. Manning (2011) conducted error analysis
on existing POS tagging algorithms at the time, deducing 7 common error categories,
and proposed a solution for the errors and inconsistencies in the gold-standard dataset.
Kanis and Skorkovská (2010) compared manual and automatic Czech lemmatizers,
based on their influence on the performance of information retrieval. However, many
of these reviews are out-of-date, unable to cover the recent advancement, powered by
e.g., deep learning techniques. Additionally, some of them focused on the applications
of specific tasks.
2.1 Introduction 43
Fig. 2.1: Outline of this chapter. Each subtask is explained in terms of different
technical trends.
In the era of short message service (SMS) and social media, a plethora of text data
can be mined from the web. Microtext is a term referring to shorthand writing, com-
monly seen in these informal texts. The occurrence of microtext poses a problem for
NLU, as it uses informal languages, e.g., shortened and lengthened words, abbrevi-
ated phrases, and missing grammatical components. It is challenging for machines
to comprehend the meaning of the source text. Therefore, microtext normalization
is essential in NLU. The task aims to convert the informal texts into their standard
forms that can be easily processed by a downstream task, such as sentiment analysis
for social media (Brody and Diakopoulos, 2011; Satapathy et al., 2017b, 2019b;
Chaturvedi et al., 2024), information retrieval (Bontcheva et al., 2013b), and ques-
tion answering (QA) system (Mittal et al., 2014). Microtext is often highly compact,
informal, and semi-structured (Rosa and Ellen, 2009). There are many abbreviations,
acronyms, lax spelling and grammar, as well as metadata that establish the context.
Although microtext is a written language, it is strongly influenced by phonetics, re-
sulting in language-dependent character repetition and substitution phenomena, e.g.,
“nooooo” for “no”, and “luv” for “love”. It is also an important preprocessing task
in the context computer vision, e.g., optical character recognition (OCR) can be used
to digitize handwritten or printed informal notes, and then microtext normalization
can be applied to standardize the resulting text for further processing.
Table 2.1: Popular text and phonetic corpora for microtext normalization.
Even without any knowledge about texting conventions, humans can understand
this type of irregular words by sounding them out. Thus, phonetics-based abbrevia-
tions and spelling alterations are invented and evolve more freely than orthography-
based ones. For instance, “schewpid” is created to imitate “stupid” in a British
accent. It is uncommonly used, but most English speakers can understand it, even
if they have never seen it before. As for orthography-based abbreviations and alter-
ations such as “wdym” for “what do you mean”, less Internet-savvy people will have
to look up its meaning. Therefore, text-level mapping alone is often not adequate
to address the sparsity caused by phonetic-level alterations. Thus, existing typical
corpora (see Table 2.1) are prepared for both text-level and phonetic-level mappings.
Accuracy, top-n accuracy, F1 score, and bi-lingual evaluation understudy (BLEU)
score (Papineni et al., 2002) are widely used evaluation measures. Specifically, top-n
accuracy considers the system correct, if the correct standard form is in its top n
predictions.
In the remainder of this chapter, we review and categorize existing microtext
normalization approaches by linguistic, statistical, and neural network methods.
Given a tokenized input string, candidate outputs are produced from each of these
channels via binary search. The syntactic analysis module is based on a bottom-up
parser. It takes the lexically normalized string outputted from the previous module,
and builds the corresponding parsing tree using grammar rules. The module will
then decide whether the sentence is valid or not, which helps to make the text
normalization task more efficient. Desai and Narvekar (2015) introduced a method to
normalize OOV words. First, a set of rules are applied to standardize elongated words
such as ‘goooood’. Then, the most probable candidates are found from compiled
databases for the OOV tokens. Lastly, any remaining noises are normalized using
Levenshtein edit distance, which signifies the minimum number of single character
insertions, deletions, and substitutions required to transform one word to another.
Mittal et al. (2014) proposed a three-level architecture to process microtext in QA
systems. First, noise presented in the SMS tokens are replaced with the closest
dictionary words using the Soundex (Beider, 2008), Metaphone algorithm (Philips,
1990), and a modified version of the longest common subsequence (LCS) algorithm
termed Phonetic LSC (PLSC) that also takes phonetic similarity into account. Next,
a semantically similar set of candidate questions are selected. Lastly, the results are
optimized via syntactic tree matching (STM) and WordNet-based similarity.
Satapathy et al. (2017b) proposed a phonetic-based algorithm to normalize tweets,
which is used to enhance the accuracy of polarity detection. It is based on the
assumption that humans are able to understand character repetition and substitution
in unknown informal words because they automatically shift to the phonetic domain
when they read text. Therefore, they adopted an ensemble approach that mainly
relies on phoneme, and used a lexicon to address other forms of OOV words such
as acronyms and emoticons. Given a tweet, all the OOV words are matched with the
lexicon to find their in-vocabulary forms and corresponding polarity class. Then, any
leftover unnormalized tokens are processed using Soundex, which uses homophones
to encode texts so that characters with similar pronunciations can be easily matched.
Since their objective is to boost polarity detection, the phonetic code of OOV words
is matched against the knowledge base SenticNet (Cambria et al., 2024). The final
output is fed into a polarity classifier to test the effectiveness of normalization, and
yields promising results.
All of the phonetic algorithms used in the above mentioned methods encode words
based on their spelling, and thus are only applicable for English or languages written
in Latin alphabet. An alternative is to use International Phonetic Alphabet (IPA),
which more accurately represents pronunciations and can be used to encode any
languages. However, its disadvantage is also evident – such a fine-grained encoding
system can be too specific for phonetic matching. Thus, it needs to be simplified
accordingly. Khoury (2015) proposed a phonetic tree-based microtext normalization
model. The model determines the probable pronunciation of English words based
on their spelling via a radix-tree structured phonetic dictionary. To build the tree,
Khoury created a dataset of IPA transcriptions of the English Wiktionary1.
1 [Link]
2.2 Microtext Normalization 47
Jahjah et al. (2016) presented a novel word-searching strategy, built on the idea
of sounding out the consonants of a given word. Their algorithm uses spelling
and phonetic strategies to extract the base consonant data from misspelled and real
phrases. First, both visual signature and phonetic signature of real English words are
extracted using a set of rules. Then, the model finds a set of in-vocabulary words
with identical signatures to the OOV word, along with their occurrence probability,
and applies several heuristics to find the best matching in-vocabulary word.
Building upon their previous work, Satapathy et al. (2019b) proposed a cognitive-
inspired microtext normalizer, named PhonSenticNet, to aid concept-level sentiment
analysis for SMS texts. Given a message, a binary classifier is first applied to detect
whether microtext is present, aiming to reduce execution time. If true, PhonSenticNet
finds the closest match for every concept in the input sentence based on phonetic
distance, which is computed using Sorensen similarity algorithm based on both
Soundex encoding as well as IPA encoding. Results showed that PhonSenticNet
outperforms their previous algorithm.
tˆ = argmaxs P(s)P(t | s)
where the language model P(s) encodes which strings in standard form are valid, and
the error term P(t | s) models how the standard forms are distorted. The former can
be easily constructed by exploiting large amounts of unlabeled data, therefore the
paper focused on the latter, which is implemented by a letter-to-phoneme conversion
HMM. Given a word in standard English, the HMM formulates the graphemes in
the word as observation states, and their corresponding phonemes as hidden states.
To train and test their method, they created a word-aligned corpus of SMS texts and
standard English.
48 2 Syntactics Processing
Jiampojamarn et al. (2007) pointed out that the previous letter-to-phoneme con-
version systems only support one-to-one letter-phoneme alignment, which leads to
difficulties in handling single letters that correspond to two phonemes, or vice versa.
This can be partially mitigated by manually constructing a fixed list of merged
phonemes prior to the alignment process. The paper, however, proposed a better so-
lution to overcome this limitation – an automatic many-to-many alignment method.
Given an input word, a many-to-many aligner establishes the appropriate alignments
across the graphemes and phonemes. After this process, the input word is represented
as a set of letter chunks, each containing one or two letters aligned with phonemes.
The chunk boundaries are determined via a bigram letter chunking prediction model
based on instance-based learning (Aha et al., 1991). The prediction model generates
all the bigrams in a word, automatically determining whether each of them should be
a double letter chunk based on the context. Subsequently, an HMM embedded with a
local classifier is applied to find the globally most possible sequence of phonemes of
the given word. The local classifier, also based on instance-based learning technique,
generates the phoneme candidates for every letter chunk according to the context.
Thus, compared to conventional HMM described above, their modified HMM is able
to utilize context information from not only the phoneme side but also the grapheme
side.
Bartlett et al. (2008) applied a structured SVM model called SVM-HMM
model (Altun et al., 2003) to orthographic syllabification, which acts as a sub-system
to improve letter-to-phoneme conversion. The paper formulated orthographic syl-
labification as a sequence labeling problem, employing SVM-HMM to predict tag
sequences. Unlike standard SVM, SVM-HMM employs Markov assumption during
scoring, and thus is able to consider complete tag sequences during training. The
motivation for using SVM-HMM over HMM is mainly the benefit of the discrim-
inative property in SVM, and leeway to adopt feature representations without any
conditional independence assumptions. The paper explored two tagging schemas for
syllabification. One is positional tags (NB tags) (Bouma, 2003) that label whether
each letter occurs at a syllable boundary or not. The other is structural tags (ONC
tags) (Daelemans et al., 1997; Skut et al., 2002), which represent the role each letter
is playing within the syllable, namely onset, nucleus, and coda. The former is simple,
straightforward, and adaptable to different languages. The latter, on the other hand,
is able to capture the internal structure of a syllable to improve accuracy, but does not
explicitly represent syllable breaks. To mitigate the weakness of the ONC tag schema,
the paper combined the two schemas and proposed a hybrid Break ONC tag schema.
Experiments show that their model not only outperforms previous syllabification
systems, but also improves the accuracy of letter-to-phoneme conversion. Kaufmann
and Kalita (2010) employed an SMT system for X message normalization. Given
a tweet, it is first passed through a syntactic normalization module, where sets of
rules on the orthographic- and syntactic-level are applied to preliminarily normalize
the input sentence. Next, the normalized tweet is fed into the machine translation
module, which is implemented by an existing SMT model called Moses (Koehn
et al., 2007), to transform the tweet into standard English.
2.2 Microtext Normalization 49
Xue et al. (2011) introduced a method that considers orthographic factors, pho-
netic factors, contextual factors, and acronym expansions, formulating each of them
as a noisy channel. The core concept is that a non-standard term should be similar
to its standard form in respect of one or more of these factors, and thus each channel
is responsible for one aspect of the distortion that converts the intended form into
the observed form. The grapheme channel models the corruption of spellings. The
phoneme channel causes distortion in pronunciations. The context channel changes
terms around a target term. The acronym channel transforms a phrase into a single
term. These channel models are combined using two variations of channel probabili-
ties, namely Generic Channel Probabilities and Term Dependent Channel Probabili-
ties. The former assumes that the probability of a term being emitted through a noisy
channel is independent of its standard form, whereas the latter takes into considera-
tion that some standard forms are more likely to be distorted via a certain channel
in reality. Experiments show that the two variants achieve similar performance, with
the latter being slightly better on SMS dataset.
Han and Baldwin (2011) theorized that supervised learning is not adept at han-
dling X OOV words due to data sparsity. Hence, they introduced an unsupervised
classifier that detects ill-formed words and generates candidate in-vocabulary words
using morphophonemic similarity. Their normalization strategy consists of three
stages. First, a confusion set of candidate in-vocabulary words is generated for each
OOV word based on morphophonemic variations, which are produced using lexical
edit distance and the double Metaphone algorithm (Philips, 2000). Then, based on
its confusion set, a linear kernel SVM classifier (Fan et al., 2008) is employed to
detect whether the given OOV word is actually an ill-informed word. Lastly, if the
OOV word is indeed ill-informed, the most likely candidate is selected based on
morphophonemic similarity and context. Although their method outperforms most
supervised models, there are limitations. The most prominent one is that it only
targets single-token words, and thus is unable to handle phrases and acronyms.
Pennell and Liu (2011) described an SMT-based system for expanding abbrevia-
tions found in informal texts. The method operates in two phases. In the first phase,
a character-level SMT model generates possible hypotheses for each abbreviation.
By training the model at character level, the model is able to learn the common char-
acter abbreviation patterns regardless of their associated words, and thus alleviates
the OOV problem. In the second phase, an in-domain language model decodes the
hypotheses, choosing the most likely one in the context.
Similarly, Li and Liu (2012) introduced a word-level framework based on an
SMT approach, which performs well in translating OOV words into in-vocabulary
words. The framework can be divided into two components. One component is a
character-based SMT module that translates non-standard words into standard words
by matching their character sequence. The other component is a two-stage SMT
module that leverages phonetic information. Non-standard words are first translated
into possible phonetic symbols, then mapping to standard words. The candidate word
lists generated by the two components are then combined and ranked using a set of
heuristic rules.
50 2 Syntactics Processing
2 [Link]
2.2 Microtext Normalization 51
For missing word recovery, the paper specifically targeted the omission of “be”
in English. Hence, a CRF is employed to label every token in the input sentence
with a tag that indicates whether the token should be followed by a conjugation of
“be”. Furthermore, they proposed a novel beam-search decoder that can integrate
different types of normalization operations, including statistical and rule-based oper-
ations. Given an input sentence, the decoder iteratively produces new sentence-level
hypotheses, evaluating them to retain the plausible ones, until it finds the best nor-
malization. Each hypothesis is generated by several hypothesis producers, each of
which focuses on a different target aspect of informal texts, applying the correspond-
ing type of normalization operation to the sentence. In other words, the objective of
beam-search is to find the best pipeline of hypothesis producers.
Yang and Eisenstein (2013) suggested that data sparsity of social media texts
makes unsupervised learning extremely challenging. To solve this problem, they
proposed a unified unsupervised model based on maximum-likelihood framework.
Given an informal sentence, a log-linear model is applied to model the string similar-
ity between tokens in tweets and standard English. Since it is an unsupervised, locally-
maximized conditional model, typical dynamic programming techniques such as the
Viterbi algorithm is not ideal. Instead, they applied the sequential Monte Carlo al-
gorithm (Cappé et al., 2007) for training. The log-linear model is combined with a
language model for standard English to output the desired conditional probability.
Based on the assumption that syllable plays a fundamental role in non-standard
word generation in social media, Xu et al. (2015b) viewed syllable as the basic
unit and extended the conventional noisy channel approach. Given a non-standard
word, it is first segmented into syllables to identify the non-standard syllables.
Then, the error term of the noisy channel can be represented as syllable similarity,
which is an exponential potential function that combines orthographic similarity
and phonetic similarity. The former is measured by edit distance and LCS, whereas
the latter is measured by phonetic edit distance and phonetic LCS based on letter-
to-phone transcripts. Results showed that the syllable-based approach indeed yields
significant improvement. Additionally, this method has the advantage of being robust
and domain independent.
In recent years, deep learning models are widely used in the field of NLP (Otter
et al., 2020; Mao et al., 2019), as they are able to automatically extract features from
input text. Neural network approaches make use of this advancement in microtext
normalization domain. As stated in the previous section, microtext normalization
can be viewed as a translation problem. Hence, it is a natural progression to employ
the encoder-decoder framework commonly adopted by NMT (Yang et al., 2020a).
Additionally, since the attention of normalizing microtext has gradually shifted
from word level to character level, neural networks are frequently utilized to extract
character-level features.
52 2 Syntactics Processing
2.2.4 Summary
With neural network approaches, Chrupa≥a (2014) addressed the target domain
problem of NMT-based approach by proposing a semi-supervised alternative. How-
ever, such method only targets single-token word. Leeman-Munk et al. (2015) cir-
cumvented both of these limitations by using a character-level pipeline model, which
conversely can introduce propagated errors. Lusetti et al. (2018); Satapathy et al.
(2019a); Lourentzou et al. (2019) all made use of the encoder-decoder structure
with various neural networks, which can be considered as an NMT-based approach.
Nonetheless, the target domain problem is less prominent nowadays, because a) more
annotated datasets are available, and b) deep learning approach is more adept at ex-
tracting generalized information than traditional SMT systems. Table 2.2 presents
a summary of the introduced methods and their concerns. Notably, most linguistic
and statistical methods use a combination of orthographic matching and phonetic
matching by computing their corresponding similarities. This is expected, as each
provides complementary information that the other cannot address.
There are methods that use neither, formulating their target problems as sequence
prediction tasks instead, using either graphical models or neural networks. For some
approaches, contexts are taken into consideration by incorporating syntactic fea-
tures (Fossati and Di Eugenio, 2008; Kaufmann and Kalita, 2010; Zhang et al.,
2013), or using attention mechanism (Lusetti et al., 2018; Satapathy et al., 2019a;
Lourentzou et al., 2019). Additionally, syllable-level and character-level models are
able to learn more fine-grained patterns than word-level ones, and thus are gaining
popularity in recent years, especially for the latter. Since microtext normalization
is a field with abundant raw text, we also mark out methods that are able to utilize
unlabeled data. Finally, there are specific sub-problems that were addressed by a part
of the reviewed linguistic and statistical methods, namely, abbreviation and acronym
and sentiment-related error. Table 2.3 shows the performance of the introduced
methods, organized by their target tasks. On the SMS-C dataset, Li and Liu (2012)
obtained the highest top-1 accuracy, whereas Liu et al. (2012) obtained the highest
top-20 accuracy. Zhang et al. (2013) achieved the best performance on LexNorm1.1,
and Xu et al. (2015b) on LexNorm1.2.
To sum up, linguistic approaches build upon simple text-level and phonetic-level
matching algorithms, utilizing hand-crafted rules, lexicons or knowledge bases. As
mentioned above, it is adequate at targeting specific problems, e.g., IPA-based pho-
netic matching, sentiment, and context provided by metadata. However, its drawbacks
are prominent. Rule-based and lexicon-based methods require a lot of human effort,
hence unable to adapt to other language domains. Statistical approach, on the other
hand, is much less labor-intensive. This type of models regard microtext normaliza-
tion as an SMT problem or a sequence labeling problem. Although most of them still
rely on feature engineering, they pave the way for neural network approaches. With
the help of neural networks, context-sensitive character-level features can be easily
extracted by Seq2Seq models. It is also adaptable to other languages. However, neu-
ral network approaches still have much room for improvement. To our knowledge,
there is not yet a deep learning normalizer that directly addresses the data sparsity
problem caused by phonetics-based alterations.
56 2 Syntactics Processing
Letter-to-phoneme Jiampojamarn et al. (2007) CMUDict 65.6 acc. CELEX Eng 83.6 acc.
Bartlett et al. (2008) CELEX 89.99 acc.
QA system Mittal et al. (2014) FAQ corpus 75 acc.
Missing tokens Wang and Ng (2013) NUS SMS 66.54 BLEU
Parsing Zhang et al. (2013) SMS-C 81 F1 LexNorm1.1 91.9 F1
It can potentially lead to performance drop for languages that are rich in such
alterations on social media, such as Chinese. Thus, incorporating phonetic features
or a phonetic-level encoder can be a potential direction for future work. Furthermore,
the use of microtext is strongly affected by the geolocation and dialect of users. For
instance, when expressing laughter in Latin alphabets, there is a tendency to use “lol”
for English speakers, “hhh” for Chinese speakers, “www” for Japanese speakers,
“kkk” for Korean speakers, “555” for Thai speakers, etc. Metadata that provides
such information is scarcely utilized in existing methods. As such, this would also
be an interesting direction to explore.
SBD, which decides where sentences begin and end in raw texts, is an important yet
overlooked preprocessing task for many NLU applications. Downstream tasks such
as machine translation (Matsoukas et al., 2007; Zhou et al., 2017) and document
summarization (Jing et al., 2003; Boudin et al., 2011) rely on predetermined sentence
boundaries for good performance.
2.3 Sentence Boundary Disambiguation 57
Grefenstette and Tapanainen (1994) were the first to propose a rule-based algo-
rithm to resolve the ambiguities caused by the usage of periods. Specifically, a set of
rules are manually created using regular expressions to represent possible patterns
where periods that do not occur as full-stops. The system then matches the surround-
ing context of every period in texts against the regular expressions to predict whether
it is a sentence boundary. Their system obtains reasonably high accuracy with great
computational efficiency. To enable automated extraction of rules, Stamatatos et al.
(1999) applied transformation-based learning to SBD. Compared to the original
transformation-based algorithm (Brill, 1995), which is discussed in detail in the
POS tagging section, their method limits the number of possible transformations.
This is achieved by maintaining two sets of rules for each punctuation mark. Initially,
all punctuation marks are considered to be full-stops. Then, for each of them, the
system automatically learns a set for rules that triggers the removal of a sentence
boundary, and a set of rules that triggers the insertion of a sentence boundary. This
mechanism ensures that the maximum number of transitions for each punctuation
mark is two.
Sadvilkar and Neumann (2020) developed a rule-based SBD system called
PySBD, where the reasoning mechanism of their system are explainable. The perfor-
mance is comparable to statistical models. Unlike the other supervised SBD methods,
PySBD is trained and evaluated with the Golden Rule Set (GRS). GRS is a language
specific corpus designed for SBD. It contains sets of hand-crafted rules over a variety
of domains that are carried out in a pipeline fashion. Despite the decent accuracy,
there are some drawbacks to rule-based methods. Firstly, periods exhibit absorption
properties, meaning when multiple periods occurs they are often marked as one.
Therefore, it is challenging to build a comprehensive set of rules where they do
not contradict each other. Furthermore, such systems are often developed using a
specific corpus, thus further application to a corpus in another language or domain
is difficult. Therefore, many machine learning methods are proposed for the SBD
tasks to address these shortcomings.
To reduce the labor efforts in developing hand-crafted rules and features, Reynar
and Ratnaparkhi (1997) proposed a SBD system based on MaxEnt (Ratnaparkhi,
1996). It requires only simple information about the candidate punctuation marks.
For each candidate punctuation mark, the system utilizes morpholexical features
of its trigram tokens, estimating their joint probability distribution as a sentence
boundary. The performance of their system is comparable to those which require
vastly more resources.
Schmid (2000) proposed an unsupervised learning method by manually listing all
the possible scenarios where a period may denote a decimal point or an abbreviation.
For each of the listed scenarios, a probability model is created accordingly to predict
whether the period is a full-stop. Unlike previous token-based methods, which rely on
the focal token itself and its local context to determine whether it is an abbreviation,
their model computes the distribution of possible abbreviations by scanning through
the whole corpus. This method can be referred to as a type-based approach (Kiss
and Strunk, 2002).
2.3 Sentence Boundary Disambiguation 59
Wong and Chao (2010); Wong et al. (2014) applied an incremental algorithm
on SBD. Their system is based on i + Learning principle, which is an incremental
decision tree learning algorithm, making it flexible to changes and suited for online
learning. The algorithm can be divided into two phases. First, the system constructs
a top-down binary decision tree offline using the initial training data. The resulting
tree acts as an optimal base for the second phase, which is a online procedure that
adopts the tree transposition mechanism of Incremental Tree Induction (Utgoff et al.,
1997) as a bias to grow and revise the base tree. This method dynamically revises
the tree according to the new incoming data while preserving the essential statistical
information. Thus it is able to adapt to texts in a different language or domain without
the need to retrain from scratch. Their system is trained on morpholexical features
extracted from trigram contexts.
All of the aforementioned SBD systems, with the exception of Kiss and Strunk
(2002, 2006), use n-gram based technique to extract textual information, which leads
to sparse vector space problems. To address this, Treviso et al. (2017) suggested word
embedding as an alternative and verified which embedding induction method works
best for SBD. They investigated word2vec (Mikolov et al., 2013a), Wang2Vec (Ling
et al., 2015a), and FastText (Bojanowski et al., 2016). They used Continuous Bag-
Of-Words (CBOW) and Skip-gram to train the vectors, respectively. They then tested
the embeddings on a Recurrent CNN (RCNN) (Treviso et al., 2016) for sentence
segmentation. Experiments show that word2vec consistently performs better than
the other two methods. Additionally, the Skip-gram strategy generally yields better
results than the CBOW for the Wang2Vec and the FastText, whereas for word2vec the
better strategy depends on the corpus. Furthermore, they also compared the model
performance between using only the extracted morpholexical features and adding
morphosyntactic features for SBD. Interestingly, the results show that the explicitly
added features do not make a difference in terms of accuracy. They theorized that
the word embedding alone carries sufficient morphosyntactic information for SBD.
Knoll et al. (2019) utilized both word embeddings and character embeddings
to further capture morpholexical features. Based on the observation that previous
SBD systems perform poorly on a domain-specific corpus such as clinical texts, they
proposed a deep learning algorithm to address this problem. First, the input text is
tokenized and transformed into word embeddings using FastText. The text is also fed
into a CNN layer (Collobert et al., 2011) to obtain character embeddings, which is
summed with the word embeddings. The final word representation is passed through
a BiLSTM layer (Graves et al., 2013) and a sigmoid-activated dense layer to output
the log-probability of a word being the start of a sentence. They tested their algorithm
on the Medical Information Mart for Intensive Care (MIMIC) corpus (Johnson et al.,
2016) and a dataset drawn from the Fairview Health Services (FV). Results suggest
that the deep learning approach indeed outperforms the traditional ones, especially on
corpora from different domains and corpora where sentences are often not terminated
by punctuation marks.
2.3 Sentence Boundary Disambiguation 61
It is difficult for word-based approaches to robustly capture contexts larger than the
focal token. Syntax-based SBD systems can solve this problem by utilizing POS tags.
Intuitively, we know that the number of possible POS patterns for a bigram is much
less than the number of possible word patterns, thus lessening sparsity problems.
Additionally, the syntactic function of a word makes a big difference when identifying
abbreviations. For instance, when predicting whether a capitalized word following a
period is a proper name or a common word, taking into account the POS tags of its
trigram context is more effective than relying only on their morpholexical properties.
Palmer and Hearst (1994) were the first to incorporate POS tagging in the task of
SDB. They proposed an efficient and portable system based on feed-forward neural
network. The core idea is to use POS probabilities of the tokens surrounding a
punctuation mark as input to the feed-forward network, which outputs an activation
value to determine what label to assign to the punctuation mark. First, the system
uses a slightly modified version of the PARTS POS tagger (Church, 1989), which
also produces the frequency counts of the POS tags associated with each token.
The system then maps them into a more generalized set of 18 POS categories. Each
input token in the n-gram context around a punctuation is then represented by a
descriptor array indicating their probability distributions for these 18 categories, in
addition of two flags that mark whether the word is capitalized and if it follows a
punctuation mark. Subsequently, these descriptor arrays are fed into a fully-connected
hidden layer with a sigmoid activation function, and then into an output unit to
decide whether the punctuation marks are full-stops. The system also introduces
two adjustable thresholds to leave room for difficult ambiguities. When the output
score falls under the first threshold, the punctuation mark is not a sentence boundary.
When the score is higher than the second threshold, it is a sentence boundary. If the
score is in between these two thresholds, that means the system is under-informed to
make a confident prediction and it’ll be marked accordingly for later use.
Based on this algorithm, Palmer and Hearst (1997) further developed a SBD
system called Satz, where the aforementioned n-gram descriptor arrays containing
POS information can be fed into either a neural network as in their previous work,
or a decision tree. The learning algorithm chosen for the system is a c4.5 (Salzberg,
1994) decision tree induction program, which iteratively constructs the tree using the
descriptor arrays as input attributes. Each leaf node of the decision tree represents
the value of the goal attribute, which in this case indicates whether a punctuation
mark is a sentence boundary. After the decision tree is built, the algorithm prunes it
by recursively examining each sub-tree to reduce errors and overfitting. Experiments
show that the tree-based learning method achieves comparable accuracy as the feed-
forward neural network. However, there are problems with the n-grams of generalized
POS categories. First, the generalized categories are far sparser than the traditional
Penn Treebank POS tags, thus requiring more training data. Furthermore, since
words outside of the n-gram have no influence on the prediction, the n-gram must be
of sufficient length to capture syntactic information. In the Satz System, n is set to
be 6, which is also much sparser than the commonly adopted bigrams and trigrams.
62 2 Syntactics Processing
The reason behind these weakness is that, this method is built on the premise
that sentence boundaries must be obtained before POS tagging. Thus, to utilize
syntactic information, the original POS tags have to replaced by the generalized
POS categories. This dilemma is solved by Mikheev (2000), who suggested that POS
taggers do not necessarily require predetermined sentence boundaries to operate. The
simple solution is to break the input text into short text-spans so that it is easier for
POS taggers to handle. Based on this notion, he proposed a SBD system using
POS tagging framework. To achieve this, he made some minor adjustments to the
Brown Corpus and the WSJ corpus. First, the period in abbreviation is tokenized
separately from the rest of the abbreviations. Second, all the periods are marked
accordingly to three types of tags, namely full-stop, abbreviation, or both. With this
setting, SBD can be performed using a POS tagger, which is able to fully make use
of the local syntactic information. The POS tagger chosen for this system is based
on HMM and MaxEnt, which will be discussed in detail in the POS tagging section.
He further introduced a document-centered approach (Mikheev, 2002), which can
also be regarded as type-based. When using this approach, the system scans the
entire document for contexts where the focal token is marked unambiguously. This
approach is proven to be effective in distinguishing whether a capitalized word is a
proper name or a common word, which works in complement with the POS tagger
to determine sentence boundaries.
Stevenson and Gaizauskas (2000) applied memory-based learning to identify
sentence boundaries in transcripts produced by an ASR system. It is more challenging
than SBD with standard texts. For instance, the text generated by an ASR system
is generally unpunctuated, in single case, and likely to contain transcription errors.
Their algorithm is based on TiMBL (Daelemans et al., 2003), which memorizes a
set of training examples. It classifies new instances by assigning them the class of the
most similar learned instances. The system adopts both POS tags and morpholexical
features such as capitalization and stopword flag. Results show that the proposed
algorithm cannot effectively address the difficulties in ASR transcripts. Following the
work of Reynar and Ratnaparkhi (1997), Agarwal et al. (2005) proposed a MaxEnt
classifier that incorporates both lexical and syntactic information. They assigned
each token with a binary End-of-Sentence tag and a POS tag, formulating SBD as a
sequential labeling problem. Similar to the previous MaxEnt model, their classifier
is also trained on the features of trigram contexts. Their work also concluded with
the best tested feature set for the MaxEnt classifier.
3 [Link]
64 2 Syntactics Processing
Table 2.5: Comparison between different SBD methods. Token stands for token-
based SBD. Type stands for type-based SBD.
2.3.4 Summary
Since the rule-based methods are rigid and labor intensive, Reynar and Ratna-
parkhi (1997); Rudrapal et al. (2015) alleviated this by using statistical models that
require simple features. Riley (1989) proposed the first tree-based SBD system,
which is more comprehensible to human. To reduce the computation cost, Wong
and Chao (2010); Wong et al. (2014) applied an incremental tree-based algorithm,
which is also more flexible and suited for online learning. Type-based methods,
compared to token-based ones, not only consider the morpholexical features of the
token, but also compute its likelihood as a non-sentence boundary over the entire
corpus. Hence, type-based approaches have the advantage of having the capacity to
manage a large amount of unannotated data. Among type-based methods, Schmid
(2000) implemented separate probability models for different scenarios. However,
it requires the manual labor of listing such scenarios. Kiss and Strunk (2002, 2006)
adverted this by framing SBD as a collocation detection problem of words and
following periods. Gillick (2009) classified sentence boundary based on trigram
context, only incorporating a type-based method for difficult instances, which is less
computationally expensive. Treviso et al. (2017) suggested word embedding as a
more effective alternative for n-gram and type-based approaches, bypassing feature
engineering. Further on this direction, Knoll et al. (2019) utilized word-level and
character-level neural networks to automatically extract morpholexical features.
Syntax-based systems are proposed based on the assumption that POS tags can
provide context information where word-based approaches are lacking. The majority
of syntax-based systems utilize syntactic features in conjunction with morpholexi-
cal ones to enhance the performance. Palmer and Hearst (1994, 1997) are the first
to incorporate POS information, but wrongly assume that sentence boundary is a
prerequisite for POS tagging. Notably, although these systems consist of a feedfor-
ward neural network, it is solely used for classification, not for feature extraction.
Thus, they are not included in the neural category in Table 2.6. Mikheev (2000,
2002) solved the POS tagging dilemma, and developed a type-based method. Fol-
lowing this setup, Agarwal et al. (2005) used a token-based probabilistic model,
whereas Stevenson and Gaizauskas (2000) tackled SBD in ASR with memory-based
learning, which can be seen as type-based. Lastly, since speech lacks the textual
cues crucial to sentence boundary, prosody-based systems incorporate prosodic fea-
tures for SBD in ASR. Gotoh and Renals (2000); Liu et al. (2004, 2005) integrated
prosodic features using various machine learning models. However, their systems
are not adept in distinguishing SUs and sentences. Liu et al. (2006) confronted this
weakness by tackling the imbalanced data distribution.
SBD for standard texts has already been well-studied. However, domain-specific
SBD still remains a challenge, as some domains tend to use punctuations differently
from general formal texts. Griffis et al. (2016) reviewed several off-the-shelf models
on biomedical and clinical corpora. Their error analysis showed that the semicolons,
colons, and newlines heavily used in clinical text are extremely error-prone. Addition-
ally, periods used in unknown abbreviations, names, and numbers are a significant
cause of error as well. Fatima and Mueller (2019) attempted to solve the task of SBD
in financial domain via a machine learning approach and an unsupervised rule-based
approach.
66 2 Syntactics Processing
Table 2.6: Performance of the introduced SBD methods. W stands for Word. S stands
for Syntax. P stands for Prosody. NIST is for NIST SU error rate.
Experiment 1 Experiment 2
Method
Dataset Result Measure Dataset Result Measure
Stamatatos et al. (1999) MG 99.4% Acc.
Sadvilkar and Neumann (2020) GRS 97.92% Acc.
Treviso et al. (2017) T-CTL 79% F1 T-MCI 74% F1
Knoll et al. (2019) MIMIC 98.6% F1 FV 99.2% F1
Rudrapal et al. (2015) BC 99.6% Acc. SMC 87.0% Acc.
Grefenstette and Tapanainen (1994) BC 93.78% Acc.
W Reynar and Ratnaparkhi (1997) BC 97.9% Acc. WSJ 98.8% Acc.
Schmid (2000) BC 99.70% Acc. WSJ 99.62% Acc.
Kiss and Strunk (2002) WSJ 99.05% F1
Kiss and Strunk (2006) BC 98.89% F1 WSJ 98.35% F1
Gillick (2009) BC 99.64% Acc. WSJ 99.75% Acc.
Riley (1989) BC 99.8% Acc.
Wong and Chao (2010) BC 99.98% Acc.
Wong et al. (2014) BC 99.81% Acc. WSJ 99.80% Acc.
Palmer and Hearst (1994) WSJ 98.5% Acc.
Palmer and Hearst (1997) WSJ 98.9% Acc.
Mikheev (2000) BC 98.8% Acc. WSJ 99.2% Acc.
S Mikheev (2002) BC 99.72% Acc. WSJ 99.55% Acc.
Agarwal et al. (2005) BC 97.7% F1 WSJ 97.8% F1
Stevenson and Gaizauskas (2000) T-News 76% F1
Gotoh and Renals (2000) T-R 70% F1
Liu et al. (2004) BN 48.61% NIST CTS 30.66% NIST
P
Liu et al. (2005) BN 46.28% NIST CTS 29.30% NIST
Liu et al. (2006) BN 49.57% NIST CTS 32.40% NIST
Unfortunately, the former fails to produce acceptable results, whereas the per-
formance of the latter is acceptable but still leaves a lot to be desired. Sanchez
(2019) examined several off-the-shelf algorithms for SBD on legal texts. Similarly,
the results are not ideal, indicating that there is still a lot of room for improvement
for these existing approaches. Therefore, a robust SBD system that is capable of
handling domain-specific corpora is called for. Another challenging aspect is SBD
in speech, especially SU boundary detection. Considering the lack of annotated re-
sources in this domain, one possible direction for future work is semi-supervised
learning (SSL). For instance, the machine can learn from annotated training samples
that are manually labeled to precisely indicate SU boundaries, infer labels on the
unannotated data, and fine-tune itself.
2.4 POS Tagging 67
POS tagging is a fundamental task in NLU, which aims to label each word in a
given text with its POS tag, e.g., noun, verb, adjective, etc. It is an upstream task
that preprocesses the input texts to assist more complex NLU applications. Since
the POS of a word can affect its meaning and polarity, POS tagging is important
for downstream tasks such as WSD (Taghipour and Ng, 2015a; Alva and Hegde,
2016), information retrieval (Mahmood et al., 2017), sentiment analysis (Asghar
et al., 2014; Mubarok et al., 2017), metaphor detection (Mao et al., 2021a; Ge et al.,
2022) and interpretation (Mao et al., 2018, 2022a). Numerous studies have been
done on POS tagging for a variety of languages (Shao et al., 2017; Nguyen et al.,
2017; Kanakaraddi and Nandyal, 2018; Darwish et al., 2018). Considering that the
fundamental methodologies are similar across different languages, here we focus on
introducing POS taggers in English.
English POS tagging is a well-studied problem. Following Church (1989), most
early works utilized hand-crafted features, derived from the local contexts, e.g., rule-
based learning (Brill, 1995), memory-based learning (Daelemans et al., 1999b),
and other statistical approaches, among which the MaxEnt framework (Ratnaparkhi,
1996; Toutanvoa and Manning, 2000; Curran and Clark, 2003) and directed graphical
models (Kupiec, 1992; Brants, 2000; McCallum et al., 2000) received the most
attention. The performance of feature engineering approaches took a big leap with
the novel CRF (Lafferty et al., 2001) and the perceptron algorithm (Collins, 2002),
which addresses the parameter estimation problem of MaxEnt models. Attempts at
applying bidirectional inference to the task also achieved remarkable improvements,
for instance the cyclic dependency network (Toutanova et al., 2003) and guided
learning framework (Shen et al., 2007). With the recent surge of interest in deep
learning, neural networks such as CNN (LeCun et al., 1998; Collobert et al., 2011),
recurrent neural network (RNN) and LSTM (Hochreiter and Schmidhuber, 1997;
Graves et al., 2013) have been applied to the field of POS tagging, freeing the task
of POS tagging from hand-crafted feature set and yielding even more promising
results. Subsequently, many studies extended the previous methods to SSL to further
boost the tagging accuracy (Clark et al., 2003; Suzuki and Isozaki, 2008; Zhou et al.,
2018). The performance of POS taggers are measured by accuracy, or in its other
form, e.g., error rate.
In this chapter, we first review the tagging schemas of POS tagging. Then, we
divide existing POS taggers into the following categories: feature engineering ap-
proaches, deep learning approaches, and SSL approaches.
In this section, we introduce the two most prevalent tagsets used in the field of POS
tagging, namely: Penn Treebank POS tags and Universal POS tags.
68 2 Syntactics Processing
The Penn Treebank POS tagset (Marcus et al., 1993) is the most widely used
tagging paradigm. Derived from the Brown corpus, the Penn Treebank has since
replaced it as the standard for POS tagging in English. Compared to its predecessor,
the Penn Treebank offers a more fine-grained syntactic distinction, containing 36
POS categories in total. For instance, a base form of a verb is always tagged as VB
in the Brown corpus, whereas the Penn Treebank differentiates it as VB (imperative
or infinitive) or VBP (non-third person singular present tense) depending on the
context. However, as the need for multilingual and cross-lingual POS induction
arises, the granular differentiation of the Penn Treebank becomes a drawback, since
many languages do not follow the grammatical structure of English.
To address the weakness of the Penn Treebank and facilitate future research,
Petrov et al. (2011) proposed a tagset with coarser syntactic POS categories based
on the observed commonalities across most languages. The Universal tagset contains
12 universal POS categories. In addition, they also created a mapping from 25 Penn
Treebank POS categories to their tagset. When used in conjunction with the Penn
Treebank, the Universal tagset and mappings are able to account for 22 different
languages. The commonly used corpora for POS tagging (in English) are listed in
Table 2.7.
Church (1989) was the first to determine that features from two or less of the
nearby tokens are significantly informative to predicting the POS tag of a given
token. For example, the word bear can be a verb or a noun, but if it is observed to
follow a determiner such as the, then the tagger can label bear in the word sequence
the bear as noun. The tagger takes a lexicon-based approach – for each word in the
corpus, their most frequent POS tags and the corresponding lexical probabilities are
stored in a lookup table. The tagger also computes the contextual probability, which
is the probability of observing a POS tag given the two POS tags following it. Then,
given a sentence, the goal is to search for the tag sequence that optimizes the lexical
and contextual probabilities.
Brill (1995) proposed a POS tagger with transformation-based learning. Initially,
the tagger assigns each word the most likely POS tag according to the training corpus.
When the prediction is incorrect, the tagger tries another transformation from the
transformational rule templates, which is derived from the three preceding words,
the three following words, and their POS tags. The learning procedure stops when no
more transformation can be found to reduces the errors. This method is simple and
effective, but unfortunately it requires a long training time. To address this problem,
Ngai and Florian (2001) optimized the model by incorporating good and bad counts
for each transformational rule, which avert repetition in the learning procedure. This
method successfully reduces the training time while maintaining the accuracy.
Daelemans et al. (1999b) introduced a memory-based approach to POS tagging.
Examples in training set are represented as a feature vector with an associated tag
category. For each test data point, its similarity to all examples in the memory is
computed. The category of the most similar instances is chosen as the predicted
category for the test data points.
Abney et al. (1999) applied boosting to POS tagging. Boosting algorithm is
similar to transformation-based learning discussed above, where the model combines
template rules to produce the most accurate classification rule. They proposed two
methods to deal with the multi-class problem. First they applied the [Link]
algorithm (Schapire and Singer, 1999), where each possible class is paired with
the given word and assigned a binary label as a derived problem, which is then
solved by binary AdaBoost. [Link] is memory-consuming, therefore they
proposed a novel [Link] algorithm, which uses binary AdaBoost to train
separate binary classifiers for each class, and combines their output by choosing the
class with most confidence. Unlike [Link], here the predictions are selected
independently for each class. The boosting approach is shown to be more accurate
than transformation-based learning.
Nakagawa et al. (2001) employed SVM to specifically solve the unknown word
problem in POS tagging. In this method, binary SVM classifiers are created for
each POS tag based on the training corpus, which are then used to predict the POS
tags of unknown words. Subsequently, Giménez and Marquez (2004); Giménez
and Màrquez (2004) proposed SVMTool, which extends the binary SVM to cover
multi-class classification. SVM classifiers are created for each POS tag that contains
ambiguous lexical items. When labeling a word, the most confident prediction among
all the SVM classifiers is selected.
70 2 Syntactics Processing
Toutanova et al. (2003) was the first to introduce bidirectional inference for POS
tagging. They proposed a cyclic dependency network with a series of local condi-
tional log-linear models to exploit information from both directions explicitly. Each
node in the network represents a random variable with a corresponding local condi-
tional probability model that considers the source variables from all incoming arcs.
The tagger finds the sequence that maximizes the score via Viterbi algorithm, similar
to previous MaxEnt and HMM models. The only difference between bidirectional
inference and an unidirectional graphical model such as HMM is that, when the
Markov window is at the time step i, the score it receives is P(ti 1 | ti, ti 2, wi 1 )
instead of P(ti | ti 1, ti 2, wi ), t and w being output tag and input word respectively.
Their model, however, suffers from collusion problem where the model lock onto
conditionally consistent but jointly unlikely sequences. This is because the local clas-
sifiers encounter double-counting problem when using the information from future
tags.
In order to avert this problem, Tsuruoka and Tsujii (2005) proposed an alternative
bidirectional inference algorithm with an easiest-first strategy. The label sequence of
a given sentence is the product of local probabilities. The proposed inference method
is to consider all of the possible decomposition structures and choose the optimal
structure to predict label sequence. The paper also proposed a more efficient alter-
native to bidirectional decoding algorithm, which adopts the easiest-first strategy.
Instead of enumerating all the possible decompositions, the tagger tags the easiest
word at each step, and repeating the procedure until all the words are tagged. To pick
the easiest word, the appropriate local MaxEnt classifier is selected according to the
availability of the neighboring labels, and used to output the probabilities. The word
with the highest probability is deemed as the easiest word for the current step. Their
bidirectional inference method is proven to be able to find the highest probability
sequence with similar performance but lower computational complexity.
Shen et al. (2007) proposed a novel guided learning framework for bidirectional
inference. Unlike the easiest-first strategy which only uses heuristic rule to determine
the order of inference, their approach incorporates the selection of inference order
into the training of the MaxEnt classifier for individual token labeling, combining
the two into a single learning task. Specifically, a sub-sequence of the input sentence
is called a span. Each span is associated with one or more hypotheses, which are
started and grown via labeling actions. The tagger initializes and maintains a set P
of accepted spans and a set Q of candidate spans. It repeatedly selects a candidate
span from Q whose action score of its top hypothesis is the highest,moving it to P,
until a span covering the whole input sentence is added to P from Q. For training, the
tagger uses guided learning to learn the weight of action score. If the top hypothesis
of a selected span is compatible with the gold standard, then the candidate span is
accepted. Otherwise, similar to perceptron algorithm (Collins, 2002), the weight is
updated by rewarding the feature weights of the gold standard action and punishing
the feature weights of the action of the top hypothesis. Then, all the existing spans in
Q are removed and replaced with new hypotheses for all the possible spans generated
based on the context spans in P. This allows the tagger to simultaneously learn the
individual classification and the inference order selection.
2.4 POS Tagging 73
Ma et al. (2013) proposed an easy-first POS tagger with beam search. The easy-
first POS tagger enumerates all possible word-tag pairs, choosing the most confident
one to label according to the score function, marking the word as processed. Then, the
tagger re-computes the scores of the unprocessed words based on the local context,
repeating the selection procedure until all the words are marked as processed. For the
easy-first tagging with beam search, a set of labeling action sequences is maintained
and grows via beam search. At each step, the sequences in the beam are expanded
in all possible ways, and the top expanded sequences within the beam width are
selected into . The trainable weight vector in a score function is learned through
perceptron-based global learning similar to the previously mentioned guided learning
framework (Shen et al., 2007), however its performance on the WSJ dataset is not as
good as the latter.
All the above-mentioned methods are dependent on hand-crafted features sets. With
the development of deep learning, the application of neural networks to POS tagging
makes it possible to avoid feature engineering and further improve the tagging
performance. Notably, neural network models are widely-employed to automatically
capture character-level patterns, which have to be modeled with morphological
features in previous taggers, e.g., suffix, capitalization, presence of numerals, etc.
Collobert et al. (2011) proposed a window approach for sequence labeling tasks,
which assumes the tag of a word is mainly dependent on its neighboring words.
Hence, it considers a fixed size window of words around the current word as local
features. Given an input sentence, the tagger passes it through a lookup table layer.
The resulting sequence of representations is fed into the convolutional layer, which
can be regarded as a feed-forward neural network with L layers, to extract local
features. That is, a concatenation of the word vectors in the focal window is inputted
to L linear layers to perform affine transformations over their inputs. Finally, a
softmax layer computes the probabilities for each labels given the output of the L-th
layer. The POS tagger is trained with word-level log-likelihood, also more commonly
known as cross-entropy.
Dos Santos and Zadrozny (2014) used deep nets to learn the character-level
representations and combined them with the corresponding word representations to
perform POS tagging. Prior to their work, the morphological or other intra-word
information is given to the tagger via hand-crafted features. To reduce human effort,
CharWNN is proposed as an extension of the previously introduced convolutional
architecture (Collobert et al., 2011). CharWNN uses the convolutional layer to
extract features from the input words and generates their character-level embedding
at tagging time. The character-level embeddings are concatenated with word-level
embeddings as word representations. Taking the window approach, a fixed window
size of word representations in the input sentence are concatenated into a vector,
which are then fed into two linear neural network layers to compute the scores.
74 2 Syntactics Processing
Wang et al. (2015a) proposed a BLSTM-RNN model for POS tagging. The model
first implements a linear layer as a lookup table to produce word embeddings, which
are fed into a bidirectional LSTM (BiLSTM) layer and then a softmax layer to output
the scores of tags. They also introduced a novel method to train word embedding
on unlabeled data, where BLSTM-RNN takes a sentence with some words replaced
by randomly chosen words as input, tagging the words in the sentence as correct or
incorrect. Thus the lookup table in BLSTM-RNN is trained to minimize the binary
classification error. Ma and Hovy (2016) introduced a BiLSTM-CNN-CRF model,
which utilizes both word-level and character-level representations automatically, re-
quiring no task-specific resources, feature engineering or data preprocessing. First,
a CNN layer (Chiu and Nichols, 2016) is applied to encode character-level infor-
mation of a word into its character-level representation. A dropout layer is applied
before character embeddings are feed into CNN. Then, the model concatenates the
character-level representations and word embeddings, feeding them into BiLSTM
to model context information of each word. A dropout layer is also applied to the
output vectors. Lastly, the output vector of BiLSTM is fed to a CRF layer to decode
labels for the whole sentence.
Akbik et al. (2018) proposed Flair, a contextual string embedding for sequence
labeling tasks. The Flair embedding is learned using a LSTM-based language model
over sequences of characters instead of words. For optimal performance, the Flair
embedding is stacked together with pretrained static word embedding GloVe (Pen-
nington et al., 2014) and a task-trained character embedding learned by LSTM. The
final embedding is passed onto a standard BiLSTM-CRF architecture to acquire the
output label sequence.
Zhao et al. (2019) proposed a deep CNN architecture called Deep Gated Dual
Path CNN (GatedDualCNN) for sequence labeling. The model first uses a CharCNN
to extract character-level representations, which are then concatenated with the word
embeddings and fed into a 1-D convolution (Conv1D) layer followed by rectified
linear unit (ReLU) and batch normalization (BN) to get the inital hidden states.
Thereafter, in order to stack up more convolutional layers while averting the vanishing
gradient problem, the paper incorporated gate blocks, residual connection, and dense
connection. The core component of a gate block is the gated linear unite (Dauphin
et al., 2017), whose output is processed by a Conv1D layer with ReLU and BN to
produce the successive hidden state. To encourage feature re-usage between gate
blocks, residual connection is introduced to bypass the non-linear transformation in
the gate block. On the other hand, dense connection serves the purpose of new feature
exploration in a dense [Link] combine these two connections, the model uses a dual
path, where hidden state produced by each block is split row-wise, then fed into the
residual path and the dense path respectively. The outputs are concatenated as the
input of the next block. The final hidden state is then passed on to a CRF layer to
decode the best sequence of tags. Yang et al. (2017) introduced transfer learning for
deep hierarchical RNN POS tagger to alleviate the out-of-domain problem. The base
model uses a character-level GRU (Cho et al., 2014) to obtain character embeddings,
which are concatenated with word embeddings and passed on to a word-level GRU
and a CRF layer to predict the tag sequence.
2.4 POS Tagging 75
They described three different transfer learning architectures for this base model.
Transfer model T-A, which shares all the model parameters and feature represen-
tations between domains, is used for cross-domain transfer where label mapping is
possible. It only performs a label mapping step on top of the base’s CRF layer. If
the two domains have disparate label sets, then transfer model T-B learns a separate
CRF layer for each tasks while sharing parameters in other layers. For cross-lingual
transfer, model T-C only shares the parameters and representations in the character-
level GRU, keeping two separate word-level GRU and CRF layers for the source
task and the target task. The paper experiments with transferring from chunking and
Name Entity Recognition (NER) to standard POS tagging on WSJ, and also tests
transfer learning from WSJ to Genia Biomedical corpus (Kim et al., 2003) and X
corpus T-POS.
Similarly, Meftah and Semmar (2018) presented a transfer-learning-based end-to-
end neural model. Their base model uses a CNN layer to extract character embedding,
a GRU layer to compute hidden states, and a fully-connected layer and softmax layer
to output the scores for tags. Two transfer learning architectures are proposed based
on this neural network. For cross-domain transfer, they used a parent network for
source data and a child network for target data. The parent network is trained on
annotated out-of-domain data, namely WSJ, whose parameters are transferred to the
child network. Then, the child network is fine-tuned through training on labeled X
datasets. For cross-task transfer, the parent network and the child network shares
a set of parameters, jointly optimizing the two tasks, while maintaining separate
task-specific parameters that are trained on the corresponding task. The task selected
for the parent network is NER.
where Z(x) is the normalization factor, Fc is the set of feature function for the
corresponding clique c, and is the function weights. In this SSL-based architecture,
the feature functions for clique c are the concatenation of Fc and the log likelihood
of all the joint probability models. A set of model parameters ✓ is introduced to
weight the joint models, which is estimated using unlabeled data via Maximum
Discriminant Functions.
Subramanya et al. (2010) described another CRF-based algorithm for semi-
supervised POS tagger. For graph construction, they used local sequence contexts as
graph vertices V, which consists of a set Vl of n-grams that occur in the labeled data
and a set Vu in the unlabeled data. The graph is built over types rather than tokens
via a symmetric similarity function, thus named similarity graph. This similarity
graph is used as a smoothness regularizer to train CRF in a semi-supervised manner.
Specifically, given a set of CRF parameters, the tagger first computes marginals over
tokens in the unlabeled data, which are then aggregated to marginals over types and
used to initialize the graph label distributions. After running label propagation, the
posteriors from the graph are used to smooth the state posteriors. Subsequently, the
unlabeled data is decoded using Viterbi algorithm to produce a set of automatic
annotations, which are combined with the labeled data to retrain the CRF through
supervised learning. The procedure is repeated until convergence. They used WSJ
as labeled source domain training data, and the QuestionBank (Judge et al., 2006)
as test data. The unlabeled data are collected from Internet search queries in similar
forms to the QuestionBank. Experiments show that their proposed algorithm indeed
outperforms supervised CRF in other domains.
Søgaard (2010) introduced stacked learning as a way to reduce POS tagging to a
classification task, thus simplifying semi-supervised training. The stacking approach
here is to combine SVMTool (Giménez and Màrquez, 2004) and an unsupervised
tagger (Biemann, 2006) into a single-end classifier, where the former predicts the
POS tag of a given word and the latter sorts the word into a word cluster. SSL
is achieved by tri-training with disagreement. Firstly, three classifiers of the same
learning algorithm mentioned above are trained on three bootstrap samples of the
labeled dataset, which ensures that the classifiers are diverse. Then, a data point in
the unlabeled dataset is labeled for classifier c1 , if and only if the other two agree
on its label assignment but c1 disagrees, which strengthens the weakness of the
classifier without skewing the labeled data by easy data points. This labeling process
is repeated until the classifiers no longer change. Subsequently, the three classifiers
are integrated by majority voting. Zhou et al. (2018) proposed a weakly supervised
sequence tagging model with ECOC (Error-Correcting Output Codes) that can learn
to predict the POS tag for a given word in a context, given a dictionary of words with
their possible tags.
78 2 Syntactics Processing
Most approaches prior to this paper are based on disambiguation, such as CRF
and HMM, which suffers from the negative effects of false positive tags as the size
of possible tags increases. The POS tagger is trained and tested based on constrained
ECOC (Dietterich and Bakiri, 1994). First, a unique L-bit vector is assigned to
each tag. The set of bit-vectors is regarded as a coding matrix, where each row
represents a codeword, e.g., class, and each column specifies a dichotomy over the
tag space to learn a binary classifier. In the encoding stage, for each column of the
coding matrix, a binary classifier is built based on binary training examples derived
from the dictionary of the words with their possible tags. In the decoding stage, the
codeword of an unlabeled test instance is generated by concatenating the predictive
output of the L binary classifiers. The predicted instance is the class with the closest
codeword according to hamming distance or Euclidean distance. Thus, the proposed
model which not only treats the set of possible tags as an entirety without resorting
to disambiguation procedure, but also needs no manual intervention for feature
engineering.
Gui et al. (2017) proposed a Target Preserved Adversarial Neural Network
(TPANN) for POS tagging on X (formerly known as Twitter). WSJ is used as
the labeled out-of-domain data, T-POS, ARK, and NPS as labeled in-domain data,
and tweets collected via X API as unlabeled in-domain data. The objective is to
learn common features between resource-rich domain and target domain while pre-
serving some domain-specific features of the target domain. TPANN first extracts
character embedding features via CNN and concatenates them to word embedding as
input. The hidden states are produced by a BiLSTM layer. Subsequently, the hidden
states are transferred to a POS tagging classifier and a domain discriminator, which
are both standard feed-forward networks with a softmax layer. The POS tagging
classifier maps the hidden states to their labels, whereas the domain discriminator
maps the same hidden states to the domain labels so as to make the input features
domain-invariant. By training this adversarial network, common features can be ob-
tained, but some domain-specific features are weakened. Thus, the paper introduced
a domain-specific auto-encoder to reconstruct target domain data. Specifically, at the
X decoder side, the hidden state ht is computed with ht = LST M([h0 zt 1 ], ht 1 ),
where h0 is the last hidden state of the BiLSTM layer, denotes the concatenation
operation, and zt 1 is computed from ht 1 using a multiple perceptron function. In
this way, the auto-encoder counteracts the adversarial network’s tendency to erase
target domain features by optimizing the common representation to be informative
on the target domain data.
2.4.5 Summary
POS tagging is a well-researched problem in the field of NLU. Existing POS tag-
ging models can be divided into feature engineering approaches and deep learning
approaches. Nonetheless, to address the growing need for domain adaptable POS
taggers, we introduce another category to introduce semi-supervised methods.
2.4 POS Tagging 79
Table 2.8: Comparison between different POS tagging methods. FE stands for feature
engineering. DL stands for deep learning. SS stands for semi-supervised. RW stands
for rare word.
Early feature engineering methods predict the POS tag of a word based on its
local n-gram context with various machine learning techniques, following Church
(1989). Graphical models are commonly used for sequence labeling. Generative
graphical models (Kupiec, 1992; Brants, 2000) estimate the joint distribution based
on the explicit dependency between the states and the observations, and thus difficult
to accommodate context features. Directed discriminative model (McCallum et al.,
2000) addresses this weakness by modeling the conditional probability based on the
dependencies between adjacent states and the observation sequence, while it suffers
from the label bias problem. It is an undirected discriminative graphical model that is
able to leverage decisions globally. However, it is disadvantageous in computational
efficiency.
80 2 Syntactics Processing
Experiment
Method
Dataset Result Measure
Brill (1995) WSJ 96.6% Acc.
Ngai and Florian (2001) WSJ 96.61% Acc.
Daelemans et al. (1999b) WSJ 96.6% Acc.
Abney et al. (1999) WSJ 96.68% Acc.
Nakagawa et al. (2001) WSJ 97.1% Acc.
Giménez and Màrquez (2004) WSJ 97.05% Acc.
Ratnaparkhi (1996) WSJ 96.63% Acc.
Toutanvoa and Manning (2000) WSJ 96.86% Acc.
Curran and Clark (2003) WSJ 97.27% Acc.
Feature Engineering
Kupiec (1992) WSJ 95.7% Acc.
Brants (2000) WSJ 96.7% Acc.
Collins (2002) WSJ 97.11% Acc.
Lafferty et al. (2001) WSJ 95.73% Acc.
Rush et al. (2012) WSJ 91.98% Acc.
Toutanova et al. (2003) WSJ 97.24% Acc.
Tsuruoka and Tsujii (2005) WSJ 97.24% Acc
Shen et al. (2007) WSJ 97.33% Acc.
Ma et al. (2013) WSJ 97.28% Acc.
Collobert et al. (2011) WSJ 97.37% Acc.
Dos Santos and Zadrozny (2014) WSJ 97.47% Acc.
Wang et al. (2015a) WSJ 97.40% Acc.
Ma and Hovy (2016) WSJ 97.55% Acc.
Akbik et al. (2018) WSJ 97.85% Acc.
Deep Learning Zhao et al. (2019) WSJ 97.59% Acc.
Yang et al. (2017) WSJ 97.55% Acc.
Yang et al. (2017) Genia 92.62% Acc.
Yang et al. (2017) T-POS 83.65% Acc.
Meftah and Semmar (2018) T-POS 90.90% Acc.
Meftah and Semmar (2018) ARK 92.01% Acc.
Meftah and Semmar (2018) NPS 93.20% Acc.
Clark et al. (2003) WSJ 409 Perplexity
Ando and Zhang (2005) BC 93.1% Acc.
Toutanova and Johnson (2007) WSJ 93.4% Acc.
Spoustová et al. (2009) WSJ 97.44% Acc.
Suzuki and Isozaki (2008) WSJ 97.35% Acc.
Semi-supervised Subramanya et al. (2010) QuestionBank 86.8% Acc.
Søgaard (2010) WSJ 97.27% Acc.
Zhou et al. (2018) WSJ 92.91% Acc.
Gui et al. (2017) T-POS 90.92% Acc.
Gui et al. (2017) ARK 92.80% Acc.
Gui et al. (2017) NPS 94.10% Acc.
82 2 Syntactics Processing
Text chunking is an NLU task that splits sentences into non-overlapping segments,
such as Noun Phrase (NP) and Verb Phrase (VP). Chunking, also called shallow
phrasing, can be applied as a preprocessing step before complete parsing. It helps
the machine to learn the sentence structure and relation between words, e.g., recog-
nizing names and syntactic components. Thus, it provides a useful foundation for
downstream NLU tasks that require a general understanding of sentence components,
e.g., NER (Collobert et al., 2011; Yang et al., 2017), text summarization (Gupta et al.,
2016a), and sentiment analysis (Syed et al., 2014).
In this section, we first review different tagging schemas in text chunking, then
investigating previous methods in tow categories: feature engineering approaches
and deep learning approaches.
2.5 Text Chunking 83
Ramshaw and Marcus (1999) proposed noun phrase chunking as a machine learning
problem. Prior to their work, chunk structure was mostly encoded with brackets
between words, which is often met with the problem of unbalanced brackets. To
solve this problem, they introduced the IOB1 (also known as IOB) tagging schema
to represent chunk structures, where “B” stands for the beginning of a chunk that
immediately follows another chunk, “I” means the word is inside a chunk, and “O”
stands for outside of any chunk. Thus, chunking is considered as a sequence labeling
problem. The dataset they derived from the Penn Treebank is later referred to as
baseNP and is used in some very early works. Thereafter, Sang and Buchholz (2000)
introduced the widely-used dataset CoNLL-2000, extending the task of chunking
from noun phrase to other types of chunks, such as verb phrases, prepositional
phrases and adverb phrases. The dataset modifies the above-mentioned IOB1 encod-
ing schema to IOB2 (also known as BIO), where “B” is simply used in the beginning
of every chunk. It also contains the corresponding POS tag of every token assigned
by a standard POS tagger from Brill and Wu (1998). The CoNLL-2000 dataset, along
with F1 score as metric, has become a standard for evaluating chunkers. Although
the CoNLL-2000 comes with IOB2 encoding, it is not difficult to convert it into
other schemas. A variety of encoding schemas are explored to study their effects on
chunking performance.
An alternative to IOB is IOE (Sang and Veenstra, 1999), where “E” represents the
final word of a chunk immediately preceding another chunk in IOE1, or the final word
on every chunk in IOE2. Sang and Veenstra (1999) split the baseNP dataset into two
group and investigated the effectiveness of IOB and IOE. Results are inconclusive to
determine which one can best improve the performance. Considering that IOB and
IOE follow the same core concept to segment chunks, it is reasonable that they do
not vary much in performance.
Another popular encoding schema is BIOES (BILOU) (Ratinov and Roth, 2009),
where “E” stands for the ending token of a chunk, and “S” denotes a single element.
Research shows that chunkers using BIOES outperform those using IOB signifi-
cantly (Yang et al., 2018; Ratinov and Roth, 2009; Dai et al., 2015). This is likely
because BIOES is more fine-grain then IOB, allowing the machine to learn a more
expressive model with only a small amount of extra parameters.
84 2 Syntactics Processing
The feature engineering approaches rely on hand-crafted feature sets from the sur-
rounding contexts, e.g., local lexical information, POS tags, and chunk tags of pre-
vious words. In this chapter, we further categorize the feature engineering approach
into two groups: local classification approaches and global classification approaches.
The local classification approaches view the chunking task as a sequence of clas-
sification problems, one for each of the word in the sequence, where the predicted
tag at each position may depend on the features of the whole input sentence and
the predicted tags of previous words. The global classification approaches, on the
other hand, are able to trade off decisions at different positions to obtain a globally
optimal label sequence. Chunkers in this category are mostly graphical models, such
as HMM (Freitag and McCallum, 2000) and CRF.
The local classification approaches predict the label of one word in a sequence at a
time, utilizing different lexical and syntactic information as features, e.g., the word
itself, its POS tag, its surrounding words and their POS tags, to make the best local
decision. Ramshaw and Marcus (1999) proposed a chunker with transformation-
based learning. The chunk structure is represented by IOB1 tag schema in non-
recursive base NP distinction, and by BN, N, BV, V, P in noun/verb phrase separation.
First, a baseline heuristic is learned using POS tags. It is then used to produce
initial hypotheses for each site in training corpus. When the baseline prediction is
incorrect, the rule templates generate candidate rules for different locations based
on the identities of words within a neighborhood, their POS tags and the current
chunk tags. The candidate rules are tested against the rest of the corpus and sorted
based on their positive scores. This will eventually create an ordered sequence of
rules that predict the features of words. In order to speed up the learning process, an
index is constructed to link each candidate rule to its static locations in the corpus,
and the rules are disabled and re-enabled based on their scores and changes. (Ngai
and Florian, 2001) also applied transformation-based learning to chunking, whose
method is previously introduced in the POS tagging section.
Daelemans et al. (1999a) proposed a memory-based learning method where
POS tagging, chunking, and identification of syntactic relations are formulated as
memory-based modules. The proposed model is a lazy learner, keeping all training
data available for extrapolation. Thus, it is more accurate than greedy learners for
NLU tasks. Memory-based learning constructs a classifier for a task by storing a set
of examples. Each example associates a feature vector with one of a finite number
of classes. The classifier extrapolates the classes of feature vectors from those of the
most similar feature vectors in the memory. The syntactic analysis process is split
into a number of classification tasks where input vectors represent a focused item
and a dynamically selected surrounding context. Outputs of some memory-based
modules are used as input by other memory-based modules.
2.5 Text Chunking 85
Based on their work, Lee and Wu (2007) proposed a mask method based on SVM
classifier that does not depend on external knowledge and multiple learners. The
purpose of the mask method is to collect unknown word examples from original
training data, so that the chunker can handle unknown words in testing data. First,
a tokenizer and a POS-tagger is applied to produce POS tag for each token. Then,
the feature selection component encodes the important features of context words.
One-Against-All SVM classification method is employed to classify the IOB1 tag of
the words. After lexicon-related features are derived, the training set is divided into 2
or more parts. New training examples can be generated by mapping the new feature
dictionary set from each training part. This method emulates training examples that
do not contain lexical information, which helps the model considers the effect of
unknown words, and adjusts the weights on lexical related features.
Johnson and Zhang (2005) proposed a semi-supervised method for text chunking,
which is based on the idea that good classifiers should have similar predictive
structure, and thus learns good structure from an auxiliary classification problem
can help improve performance on the target problem. The paper presents a linear
prediction model for structural learning. Supposedly, there is a low-dimensional
predictive structure shared by multiple prediction problems, which can be discovered
through joint empirical risk minimization (ERM). The goal of this model is to
discover the common low-dimensional predictive structure parameterized by the
projection matrix in the predictor, i.e., to find the optimal projection matrix that
minimizes the empirical risk summed over all the problems. This optimization
problem is solved by alternating structure optimization (ASO) (Ando and Zhang,
2005). For SSL, the auxiliary prediction problems are generated automatically from
unlabeled data. A classifier is trained with a feature map and labeled data, whose
behavior is then predicted on the unlabeled data using another distinct feature map.
After the training data for each auxiliary problem is created, the optimal projection
matrix is computed from the training data via ASO, and the empirical risk on the
labeled data is minimized.
The baseline system only uses the current POS as lexical entry to determine the
current structural chunk tag. Then, the paper attempts to add more contextual infor-
mation by adding lexical entries into the lexicon, such as the current and the previous
words, and their POS tags. Adding more contextual information significantly im-
proves the accuracy, however, it is difficult to merge all the above context-dependent
lexicons in a single lexicon due to memory limitation. Thus, an error-driven learning
approach is adopted to examine the effectiveness of lexical entries and reduce the
size of lexicon.
Molina and Pla (2002) proposed an HMM-based tagging method where the model
finds the sequence of states of maximum probability given the input sequence. This
method can be used in many different shallow parsing tasks including text chunking,
given the appropriate input information. When implemented for chunking, the model
considers words and POS tags as the input. In addition, the paper suggests that the
output tag set could be too generic to produce accurate models. Thus, for a chunking
task, the model can be enriched by adding POS information and certain selected
words into the chunk tags. These are achieved by applying a specialization function
on the original training set. From this new training set, the Specialized HMM can be
learned by maximum likelihood. The tagging process is carried out using the Viterbi
algorithm.
Although the HMM-based algorithms are well-understood, they require strict
conditional independence assumptions to work effectively, which makes it diffi-
cult to represent non-independent features, such as surrounding words. Attempts
have been made to enable chunkers to handle more statistically correlated features
of input tokens while obtaining global optimal labeling, e.g., the perceptron algo-
rithm (Collins, 2002) and bidirectional inference algorithm (Tsuruoka and Tsujii,
2005), both previously discussed in the POS tagging section.
Another popular algorithm to address this problem is based on CRF, which
is the most widely-used alternative to generative graphical models. As mentioned
in the POS tagging section, first proposed by Lafferty et al. (2001), CRF is an
undirected linear-chain graphical model. It uses a single exponential model for the
joint probability of the entire tag sequence given an observation sequence. CRF can
not only take in many statistically correlated features from the input data and train
them discriminatively, but also trade off decisions at different sequence positions to
obtain a globally optimal labeling. Hence, it averts the limitations while maintaining
the advantages of the local classification approach and the HMM-based approach.
Sha and Pereira (2003) introduced the application of CRF to text chunking,
proposing a novel CRF training algorithm with better convergence properties. For
chunking task, the CRF labels are pairs of consecutive chunk tags, which establishes a
second-order Markov dependency between chunk tags. The local feature is based on a
predicate on the input sequence and current position, and a predicate on pairs of label.
Instead of using iterative scaling as training algorithm (Lafferty et al., 2001), the paper
experiments with two training methods to maximize the log-likelihood of the training
set: preconditioned conjugate gradient (Shewchuk et al., 1994), and limited-memory
quasi-Newton (Nocedal and Wright, 2006). Both utilize approximate second-order
information to achieve high convergence speed.
88 2 Syntactics Processing
Following their work, many alternative CRF or other second-order random fields
algorithms are proposed for chunkers to model more complex dependencies. For in-
stance, as described in the POS tagging section, Suzuki and Isozaki (2008) employed
a SSL CRF, which is also tested to be effective at text chunking. Sutton et al. (2007)
proposed dynamic CRF (DCRF), which is a generalization of the original CRF that
repeats structure and parameters over a sequence of state vectors. Compared to con-
ventional CRF, DCRF is able to represent distributed hidden states and complex
interactions among states, such as factorial, second-order and hierarchical structure.
To achieve this, DCRF introduces clique index c, which represents any state in the
unrolled graph through a time step offset and its index in the state sequence y. Then,
the set of variables in the unrolled version of clique index c at time step t can be
denoted as yt,c . Let C be a set of clique indices. Similar to standard CRF, given an
input sequence x, the conditional probability P(y | x) is computed as
1 ÷÷
P(y | x) = exp( · F(yt,c, x, t)),
Z(x) t c 2C
!
1 ÷÷
T 1 L
P(y | x) = exp( · F(yl,t , yl,t+1, x, t))
Z(x) t=1 l=1
!
÷T ÷
L 1
exp( · F(yl,t , yl+1,t , x, t)) .
t=1 l=1
This factorized structure can be used to jointly train several sequence labeling
tasks, such as POS tagging and text chunking, with shared information. Based on
this, the paper further describes a marginal DCRF for joint learning between POS
tagging and chunking, which is inspired by the notion that the main purpose of POS
tagging is to help the prediction of chunking. Therefore, training by maximizing the
joint likelihood is not ideal, since the model might trade off accuracy among the
chunk tag to obtain accuracy among the POS tag.
The proposed marginal training encourages the model to prioritize learning the
main task whilst retaining useful information from the other task. That is, in the
training set, the observations of POS tag sequence are ignored, thus the model is
able to focus on the conditional probability over y. Experiments show that joint
training using marginal FCRF improves chunking accuracy slightly in comparison
to cascaded training where the two tasks are learnt in sequence.
2.5 Text Chunking 89
Many deep neural networks, e.g., CNN, RNN and LSTM, can be applied to text
chunking. Unlike the previous methods, deep learning approaches are able to au-
tomatically extract features from the input texts, making it possible to avoid hand-
crafted feature templates.
As previously mentioned in the POS tagging section, Collobert et al. (2011)
described a window approach for sequence labeling problems. Aside from the pro-
posed model, they also suggested a novel training algorithm for tasks such as text
chunking, where tags are organized in chunks and some tags cannot follow other
tags. Named sentence-level log-likelihood, it is proposed as an alternative to softmax
and CRF, thus allowing the consideration of scores over all possible tag paths for
a given sentence. To achieve this, a transition score Ai, j is introduced, which is a
trainable variable for jumping from tag i to tag j in successive words. The score of a
tag path is computed by summing the transition scores and the scores outputted by
the neural network. The score over all possible tag paths is normalized via softmax
and interpreted as a conditional tag path probability.
2.5 Text Chunking 91
The advantage of this method over CRF is that it uses a non-linear neural network
instead of a linear model to maximize the likelihood, which encourages the model to
learn useful features according to the task of interest. Additionally, they jointly train
POS tagging, text chunking, and NER using the proposed window approach, where
all models share the same lexicon lookup table and the parameters of the first linear
layer, and the training objective is to minimize the loss averaged across all tasks.
Similarly, Yang et al. (2017) also utilized the correlation between text chunking,
POS tagging and NER. They showed that transfer learning from the latter two tasks
to chunking yields competitive performance. Their method was introduced in detail
in the POS tagging section. Huang et al. (2015) proposed a BiLSTM model with
a CRF layer (BiLSTM-CRF). The model efficiently utilizes past and future input
features via a BiLSTM layer and sentence level tag information via a CRF layer.
Yang et al. (2016a) described a deep hierarchical RNN, which can encode both
character level and word level sequential information. Many previous works (Dos San-
tos and Zadrozny, 2014; Santos and Guimaraes, 2015; Kuru et al., 2016) show that
character level features help alleviate the out of vocabulary (OOV) problem in se-
quence labeling tasks, most of which rely on convolutional layer to extract such
features. The proposed model, however, employs bidirectional GRU (Cho et al.,
2014) to achieve this. The model stack multiple recurrent layers together to build a
deep GRU. Such deep GRU is used on both character level and word level, together
forming a hierarchical GRU. The word representations produced by the hierarchical
GRU are fed into another deep bidirectional GRU to extract the context information
in the word sequence. The resulting sequence of hidden states is used as input fea-
tures for the next layer, where a CRF models the dependencies between tags in the
sequence and predicts a sequence of tags.
Zhai et al. (2017) proposed a BiLSTM-based sequence chunking model where
each chunk is treated as a complete unit for labeling. They also explored the idea
of using pointer networks (Vinyals et al., 2015a) instead of IOB labels. The paper
divides sequence labeling into two subtasks: segmentation and labeling. The former
is to identify scope of the chunks explicitly, whereas the latter is to label each chunk
as a single unit based on the segmentation results. The model employs an encoder-
decoder framework where the decoder is modified to take chunks as inputs. The
BiLSTM encoder is used to create a sentence representation as well as segment the
input sequence into chunks. It uses a CNNMax layer to extract important information
from words in the chunk, and utilizes context word embeddings of the chunks to
capture context information. The decoder is a LSTM that takes all the information
above to generate hidden states to label the segmented chunks. To further improve
the accuracy, the model uses pointer network instead of IOB2 tags to identify chunks.
It identifies a chunk, labels it, and repeats the process until all words are processed.
At the beginning of a possible chunk, the pointer network determines which word
is the ending point. After a chunk is identified and labeled, it serves as the input of
the decoder in the next time step. With this setup, the model is able to utilize chunk-
level features for segmentation. Experiments show that pointer network yields better
performance than the IOB2 encoding schema.
92 2 Syntactics Processing
1 ÷
r
p(c | w) = exp F(li 1, li, w, bi, ei ),
Z(c) i=1
Table 2.10: Comparison between different text chunking methods. FE stands for
feature engineering. LC stands for lexical context. SC stands for syntactic context.
PC stands for preceding chunk tag prediction. CL stands for chunk-level. Morph is
short for morphological. Joint stands for joint learning with other tasks.
2.5.4 Summary
Experiment
Method
Dataset Result Measure
Muis and Lu (2016) NUS SMS 76.62% F1
Ramshaw and Marcus (1999) baseNP 92.3% F1
Daelemans et al. (1999a) baseNP 93.8% F1
Ngai and Florian (2001) CoNLL2000 92.30% F1
Sang (2000) CoNLL2000 92.50% F1
Feature Engineering - local
Van Halteren (2000) CoNLL2000 93.32% F1
Koeling (2000) CoNLL2000 91.97% F1
Kudo and Matsumoto (2000) CoNLL2000 93.95% F1
Zhang et al. (2001) CoNLL2000 93.56% F1
Lee and Wu (2007) CoNLL2000 94.22% F1
Johnson and Zhang (2005) CoNLL2000 94.39% F1
Zhou and Su (2000) CoNLL2000 93.68% F1
Molina and Pla (2002) CoNLL2000 92.23% F1
Collins (2002) CoNLL2000 93.53% F1
Tsuruoka and Tsujii (2005) CoNLL2000 93.70% F1
Sha and Pereira (2003) CoNLL2000 94.38% F1
Feature Engineering - global Suzuki and Isozaki (2008) CoNLL2000 95.15% F1
McDonald et al. (2005) CoNLL2000 93.90% F1
Sutton et al. (2007) CoNLL2000 93.87% F1
Sun et al. (2008) CoNLL2000 94.34% F1
Lin et al. (2020) CoNLL2000 92.44% F1
Muis and Lu (2016) NUS SMS 76.62% F1
Collobert et al. (2011) CoNLL2000 94.32% F1
Yang et al. (2017) CoNLL2000 95.41% F1
Huang et al. (2015) CoNLL2000 94.49% F1
Yang et al. (2016a) CoNLL2000 95.41% F1
Zhai et al. (2017) CoNLL2000 94.72% F1
Deep Learning Rei (2017) CoNLL2000 93.88% F1
Sun et al. (2020) CoNLL2000 95.44% F1
Liu et al. (2020b) CoNLL2000 91.80% F1
Wei et al. (2021) CoNLL2000 95.15% F1
Zhao et al. (2019) CoNLL2000 94.80% F1
Akbik et al. (2018) CoNLL2000 96.72% F1
It is worth exploring whether a model designed specifically for chunking can im-
prove the performance. The more significant challenge, however, is that text chunking
as a sub-system in complex applications remains quite rare. Although pushing for
accuracy is important, it is also good to keep in mind how chunking as a syntactic
preprocessing step can benefit higher-level NLU tasks such as semantics or pragmat-
ics. With the development of neural network, the performance of sequence labeling
model is advancing rapidly. Currently, the most popular approach to text chunking
is the combination of different deep learning algorithms and CRF. In the future, one
possible way to further improve the performance is to apply graph neural network to
chunking models. Another way is to find a neural network alternative to CRF. Lastly,
we hope to see more integration between text chunking and other more complex
downstream tasks, e.g., aspect extraction, sentiment analysis, etc.
2.6 Lemmatization
Lemmatization is an NLU task that reduces the inflected forms of given words into
their morphologically correct root forms. It is an essential preprocessing technique
that extracts concepts and keywords for downstream applications, e.g., search en-
gine (Halácsy and Trón, 2006; Balakrishnan and Lloyd-Yemoh, 2014) and dialogue
system (Zhao and Gao, 2017; Altinok, 2018; Liu et al., 2019a). Another commonly
used method is stemming, which also converts words into their base form, but does
so by cutting off the prefixes or suffixes. Lemmatization, on the other hand, performs
a morphological analysis based on the context of given words, and thus is able to
preserve more syntactic information. For example, given the word “studied”, a stem-
mer simply removes the suffix and returns “studi”, whereas a lemmatizer is able to
extract the proper lemma “study”.
Lemmatization has received growing attention in recent years, especially for
highly inflected languages such as Dutch, Latin and Arabic. As lemmatizing English
words are relatively easy, here we cover lemmatizers for inflection-rich languages to
provide more perspectives to this task. Annotated training corpora for lemmatization
mostly include lexicons for the target languages, CoNLL-2007 (Nivre et al., 2007),
CoNLL-2009 (Haji et al., 2009), CoNLL-2018 (Zeman et al., 2018) and the Uni-
versal Dependencies (UD) treebanks (Nivre et al., 2016). The standard evaluation
metric is accuracy. Relevant information about the commonly used lemmatization
datasets can be viewed in Table 2.12.
Existing lemmatizers regard lemmatization as either a suffix and prefix transfor-
mation problem, or a string-to-string transduction problem. The former focus on the
starting and ending letters, identifying recurring affixes and transforming them. The
latter, on the other hand, considers the whole word form and generates the operations
to convert it into its lemma. In this chapter, we introduce the previous works in three
section: the transformation approaches, the statistical transduction approaches, and
the neural transduction approaches.
2.6 Lemmatization 97
Early lemmatizers learnt a set of classification rules that detected and modified
suffix and/or prefix of a given word form to transform it into the corresponding
lemma. The transformation approaches viewed lemmatization as a rule-based clas-
sification problem, where the class label assigned to a word was defined via giving
the transformation to be applied on the word in order to get the normalized form.
For instance, class labels could take the form of (x to y), where x is the suffix of
the word form and y is that of its lemma. Mladenic (2002) proposed two methods
for mapping from words to their lemmas. The first on is letter-based representation
using transformation-based learning, where the machine learns a set of classifica-
tion rules from feature set comprised of suffixes. The other method is context-based
representation using Naïve Bayes Majority classifier, where the features are n-grams
of the given words.
98 2 Syntactics Processing
This approach is proven to perform well for different languages, effectively re-
ducing the labor to manually create full-paradigm inflectional lexicon. Based on
the SES mechanism, Chrupa≥a et al. (2008) proposed Morfette, which is a modular,
data-driven, probabilistic system that jointly learns lemmatization and morphological
tagging from morphologically annotated corpora. Morfette consists of two MaxEnt
classifiers (Ratnaparkhi, 1996) that are trained to predict lemmas and morphological
tags respectively, and another module that dynamically combines their predictions
and outputs the probability distribution over tag-lemma pair sequences. The lemma
classifier uses the SES method to induce classes automatically.
Taking inspiration from the above-mentioned works, Müller et al. (2015) pre-
sented LEMMING, a modular model that jointly learns lemmatization and mor-
phological tagging at the token level. The proposed lemmatizer maps an inflected
form into its lemma given its morphological attributes using a log-linear model.
Following the induction method in Morfette (Chrupa≥a et al., 2008), the lemmatizer
selects candidates through a deterministic pre-extraction of edit trees. This formal-
ization allows the integration of arbitrary global features. It is vastly used in later
transduction-based models. For joint learning, the lemmatizer is combined with a
morphological tagger MARMOT (Müller et al., 2013) in a tree-structured CRF. Ex-
periments show that LEMMING yields significant improvements in joint accuracy
compared to Morfette and other lemmatizers.
Lyras et al. (2007) implemented the Levenshtein edit distance in a dictionary-
based algorithm for automatic lemmatization. The system calculates the similarity
between the input word and all dictionary lemmas, selecting those with the minimum
edit distance. To improve accuracy, they also used common suffixes in the target
language, removing possible suffixes and recalculating distances. The system then
compares all stored lemmas and returns the n-best ones with the lowest edit distance.
Experiments showed that the suffix-removal method, though more labor-intensive,
achieved higher accuracy than the baseline in both modern Greek and English.
Dreyer et al. (2008) presented a conditional log-linear model, which employs
overlapping features over latent alignment sequences, and learns latent classes and
latent string pair regions from incomplete training data. Given an input, the candidate
output is selected by a sliding window over the aligned (input, output) pair. At each
window position, the log probabilities of all possible alignments are accumulated,
evaluating each alignment separately. To further improve the performance, new la-
tent dimensions can be added to the (input, output) tuple. Toutanova and Cherry
(2009) presented a global joint model for lemmatization and POS tagging trained on
morphological lexicons and unlabeled data. The model consists of two components
– a semi-supervised POS tagger (Toutanova and Johnson, 2007), and a lemmatizing
transducer that is optionally given the POS tags of input word to inform the lemma-
tization. Taking a pipeline approach, given a sentence, the POS tagger first predicts
a set of tags for each word. Subsequently, the lemmatizer predicts one lemma for
each of the possible tags. The k-best predictions of tag sets and lemmas are chosen
to be the output. Additionally, based on the idea that words with the same lemma
have dependent POS tag sets, dependencies among multiple words are dynamically
determined.
2.6 Lemmatization 101
The recent advancement of neural networks brings a fresh perspective to the trans-
duction approach. The encoder-decoder architecture especially receives a lot of
attention, as character-level Seq2Seq model is exceptionally adept at handling the
string-to-string transduction task, able to capture more contextual information than
the statistical approaches. Kestemont et al. (2017) described a deep learning ap-
proach to lemmatization for variation-rich languages. The proposed system consists
of two basic components. One is temporal convolutions that model the orthography
of input words at the character level. The other is distributional word embeddings
from Skip-gram model, which represents the lexical context surrounding the input.
Given a word, the system feeds the focus token into the convolutional component,
and its left and right tokens into two embedding components, respectively. The out-
puts of these three sub-nets, along with the one-hot encoding of the input token, are
concatenated to form one single hidden representation, which is then passed through
a final linear layer to produce the lemma.
Chakrabarty et al. (2017) introduced a composite bidirectional gated RNN
(BGRNN) architecture for language-independent, context-sensitive lemmatization.
Following previous works, the task is defined as detecting the correct edit tree rep-
resenting the transformation between a word form and its lemma. The proposed
architecture consists of two stages. First, a BGRNN is used to extract the char-
acter level dependencies. The outputs are combined with the corresponding word
embedding given by word2vec (Mikolov et al., 2013a) to form the final word repre-
sentations. Then, the representations are fed sentence-wise into the second BGRNN
to capture the contextual information of the input word, and to learn the mapping
from word embeddings to word-lemma transformations.
2.6 Lemmatization 103
The proposed model is compared with LEMMING (Müller et al., 2015) in Ger-
man, achieving competitive accuracy. The results also indicate that a log-linear
lemmatizer such as LEMMING is preferable when dealing with misspelled words,
whereas Seq2Seq lemmatizer is able to generalize and handle OOV words better.
Kondratyuk et al. (2018) proposed LemmaTag, which is a featureless bidirectional-
RNN-based architecture that jointly learns lemmatization and POS tagging. Given
a sentence, a GRU outputs the character-level embedding for every word, which is
summed with the word embedding to form the final word embeddings. The resulting
sequence is passed onto two layers of BiLSTM with residual connections, producing
a sequence of word representations with sentence-level connections. The POS tagger,
made up of a fully-connected layer, predicts the tag values, concatenating them into
a flat vector to pass onto the lemmatizer. The lemmatizer consists of a LSTM layer
with character-level attention mechanism that takes in the final word embedding,
the character embedding, and the POS features of the focus word to generate the
corresponding lemma.
Malaviya et al. (2019) introduced a simple LSTM-based joint learning model
for lemmatization and morphological tagging. Given a sentence, the morphological
tagger obtains the word representation for each word using a character-level BiL-
STM, which is then fed into a word-level BiLSTM to predict the corresponding
morphological tag. For lemmatization, a string-to-string transduction model (Wu
and Cotterell, 2019) is adopted, which is a Seq2Seq model with hard attention mech-
anism (Xu et al., 2015c; Rastogi et al., 2016). The joint probability of the input
sentence is define as the product of the probability outputted by the tagger and all
the probabilities outputted by the transducer. Yildiz and Tantu (2019) proposed
Morpheus, a joint contextual lemmatizer and morphological tagger. Similarly to the
above-mentioned works, given a sentence, firstly a character-level LSTM generates
word vectors, which are then fed into a word-level BiLSTM to produce context-aware
word representations. Subsequently, two separate LSTM decoders are employed to
predict morphological tags and edit operations from word forms to lemmas. More
specifically, to find the minimum edit operations, a dynamic programming method
based on Levenshtein distance is used.
Manjavacas et al. (2019) also built upon the classic encoder-decoder architecture
to acknowledge the difficulty of spelling variations in lemmatization for non-standard
language. Specifically, they presented a hierarchical sentence encoder that is jointly
trained for lemmatization and language modeling, adopting the attention mecha-
nism (Bahdanau et al., 2015) to extract additional context. The hierarchical encoder
consists of three levels. A bidirectional RNN first computes the character-level rep-
resentation, which is fed into another bidirectional RNN to extract word-level fea-
tures. Lastly, the final bidirectional RNN outputs the sentence-level features based
on the word-level hidden state. To extract higher quality sentence-level features, a
word-level bidirectional language model is added to the lemmatizer. Two softmax
classifiers predict left and right tokens using forward and backward sentence-level
hidden states. Jointly minimizing lemmatization and language modeling losses al-
lows the lemmatizer to represent global sentence context without needing POS or
morphological tags.
2.6 Lemmatization 105
2.6.4 Summary
Experiment 1 Experiment 2
Method Language Dataset Result Measure Language Dataset Result Measure
Mladenic (2002) Slovene MULText-EAST 74.2% Acc.
Plisson et al. (2004) Slovene MULText-EAST 77.0% Acc.
Juröi et al. (2007) Slovene MULText-EAST 94.38% Acc. English Multext 94.14% Acc.
Trans Erjavec and Dûeroski (2004) Slovene MULText-EAST 92.0% Acc.
Kanis and Müller (2005) Czech PDT 95.89% F1
Jongejan and Dalianis (2009) Icelandic IFD 71.3% Acc. English CELEX 89.0% Acc.
Daelemans et al. (2009) Afrikaans D-Lem 91.0% Acc.
Gesmundo and Samardzic (2012) English 1984 99.6% Acc.
Chrupa≥a (2006) Spanish Cast3LB 92.48% F1 Catalan Cast3LB 94.64% F1
Müller et al. (2015) English Pen Treebank 98.84% Acc. Spanish CoNLL-2009 98.46% Acc.
Lyras et al. (2007) Modern Greek L-Lem 95.03% Acc. English L-Lem 96.46% Acc.
Dreyer et al. (2008) German CELEX 94.0% Acc.
Toutanova and Cherry (2009) English CELEX 99.0% F1 Slovene MULText-EAST 91.2% F1
Statistical Nicolai and Kondrak (2016) English CoNLL-2009 93.3% Acc. German CELEX 90.0% Acc.
Barteld et al. (2016) German GML 97.76% Acc.
Gallay and äimko (2016) MSA PATB 95.4% Acc. EGY ARZ 83.3% Acc.
Rosa and éabokrtskỳ (2019) English UD v2.3 97.78% Acc.
Akhmetov et al. (2020) 25 languages Lem-list 84.05% avg Acc.
Kestemont et al. (2017) Middle Dutch RELIG 90.97% Acc. Middle Dutch CG-LIT 91.67% Acc.
Chakrabarty et al. (2017) Hindi WSD 94.90% Acc. Spanish CoNLL-2009 98.11% Acc.
Bergmanis and Goldwater (2018) 20 languages UD v2.0 95.0% avg Acc.
Celano (2020) Latin LT4HALA 94.6% Acc.
Arakelyan et al. (2018) English CoNLL-2018 95.77% Acc. French CoNLL-2018 86.28% Acc.
Pütz et al. (2018) German T´’uBa-D/Z 97.02% Acc. German NoSta-D 83.96% Acc.
Kondratyuk et al. (2018) Czech PDT 98.37% Acc. English UD-EWT 97.53% Acc.
Neural Malaviya et al. (2019) 20 languages UD v2.0 95.42% avg Acc.
Yildiz and Tantu (2019) 97 languages UniMorph 97.85% avg Acc.
Manjavacas et al. (2019) 20 languages UD v2.2 96.28% avg Acc.
Kondratyuk (2019) 100 languages SIGMORPHON 95.00% avg Acc.
Zalmout and Habash (2019) MSA PATB 97.6% Acc. EGY ARZ 88.5% Acc.
Chakrabarty et al. (2019) French UD 95.47% Acc. Hindi UD 97.07% Acc.
Schmitt and Constant (2019) French FTB 95.6% Acc. Polish SEJFEK 88.9% Acc.
Zalmout and Habash (2020) MSA PATB 95.4% Acc. EGY ARZ 83.3% Acc.
Milintsevich and Sirts (2021) 23 languages UD v2.5 97.09% avg Acc.
Transduction approaches, on the other hand, view either character as the base unit
for lemma generation, or the entire word as the base unit to map onto its lemma based
on string similarity measures. Transduction-based methods are more flexible than the
transformation-based ones, as they do not rely on an affix lookup table, manually-built
or automatically-generated. Thus, the transduction approach is better suited to handle
irregular changes. In this chapter, we further split the transduction approaches into
statistical and neural-network-based methods, since the development of deep learning
brings a new perspective to lemmatization. Statistical transduction approaches mostly
rely on edit scripts, for instance, the data-driven SES mechanism (Chrupa≥a, 2006;
Chrupa≥a et al., 2008). Müller et al. (2015) improved SES by proposing edit tree,
which does not encodes the LCSs and thus more capable of generalizing. Following
their work, Lyras et al. (2007); Nicolai and Kondrak (2016) enhanced lemmatization
with stemming. Toutanova and Cherry (2009) incorporated POS tagging as a sub-
system. Barteld et al. (2016) further improved edit tree with LC, which is more
adept at handling word-internal inflections. Dreyer et al. (2008) proposed a window
approach as an alternative to edit tree, which can handle more complex inflections,
while it is restricted by a window size. Another alternative is to infer lemma from
word embeddings (Gallay and äimko, 2016; Rosa and éabokrtskỳ, 2019; Akhmetov
et al., 2020), paving the way for neural approaches.
2.6 Lemmatization 109
2.7 Conclusion
A. Reading List
• Xulang Zhang, Rui Mao, and Erik Cambria. Granular Syntax Processing with
Multi-Task and Curriculum Learning. Cognitive Computation 16, 2024 (Zhang
et al., 2024a)
• Iti Chaturvedi, Ranjan Satapathy, Curtis Lynch, and Erik Cambria. Predicting
Word Vectors for Microtext. Expert Systems, e13589, 2024 (Chaturvedi et al.,
2024)
• Xulang Zhang, Rui Mao, and Erik Cambria. SenticVec: Toward Robust and
Human-Centric Neurosymbolic Sentiment Analysis. In: Proceedings of ACL,
4851–4863, 2024 (Zhang et al., 2024c)
B. Relevant Videos
• Labcast about Granular Syntax Processing: [Link]/Fr9lZ6u3Fxs
C. Related Code
• Github repository about Syntactics Processing: [Link]/SenticNet/Syntactics
D. Exercises
• Exercise 1. Create a Python program that reads a list of microtexts and outputs
their normalized versions. Use a predefined dictionary of common microtext ab-
breviations and their full forms. Implement a function to replace abbreviations
with full forms. Test your program with a given set of microtexts.
112 2 Syntactics Processing
• Exercise 4. Test your text chunking skills by identifying and labeling noun
phrases (NP) and verb phrases (VP) in the previous sentence. Read the sentence
again, determine NPs and VPs, and then label each chunk accordingly.
• Exercise 5. Lemmatize the previous sentence by replacing each word with its
base form or lemma. Then, compare your lemmatized sentence with the output
produced by NLTK’s WordNetLemmatizer to see how they align and identify any
differences.
Chapter 3
Semantics Processing
Key words: Word Sense Disambiguation, Named Entity Recognition, Concept Ex-
traction, Anaphora Resolution, Subjectivity Detection
3.1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025 113
E. Cambria, Understanding Natural Language Understanding,
[Link]
114 3 Semantics Processing
Semantics
Fig. 3.1: Semantic research domains in linguistics. Lex. denotes lexical; sem. denotes
semantics; und. denotes understanding.
3.1 Introduction 115
Table 3.1: Surveyed semantic processing tasks and their downstream applications. F
denotes that the technique yielded features for a downstream task model; P denotes
that the technique was used as a parser; E denotes that the technique improved the
explainability for a downstream task. CE denotes concept extraction. AR denotes
anaphora resolution. SD denotes subjectivity detection.
Given the broadness of semantics, this chapter’s scope lies in semantic processing
techniques for WSD, NER, concept extraction, anaphora resolution, and subjectivity
detection. This is because these low-level semantic processing tasks reflect different
aspects of semantics. In addition, there were many research works on these tasks
in the field of computational linguistics. We focus on low-level semantic processing
tasks, rather than high-level semantic processing tasks, e.g., sentiment analysis and
natural language inference, because they provide fundamental building blocks for
both high-level semantic processing tasks and higher-level NLU tasks.
Multiple semantic processing techniques were rarely surveyed in the same article.
Salloum et al. (2020) surveyed several high-level semantic processing tasks, e.g.,
latent semantic analysis, explicit semantic analysis, and sentiment analysis. Unlike
the work of Salloum et al. (2020), this chapter includes the latest research in low-level
semantic processing techniques. Compared to the latest semantics processing surveys
focusing on specific tasks (Ransing and Gulati, 2022; Poesio et al., 2023; Wang
et al., 2022; Montoyo et al., 2012), we additionally reviewed important theoretical
research and downstream task applications in these domains. These contents can
help readers better understand the foundation of semantic research in linguistics, as
well as potential application scenarios.
More importantly, theoretical research shows the big picture of a semantic process-
ing task, which may inspire different research tasks in the computational linguistic
community. The collection of multiple semantic processing techniques is helpful for
readers to have a comprehensive understanding of a large field, inspiring more fusion
research across different domains. Theoretical research of other tasks has the poten-
tial to inspire fresh perspectives among researchers who have been concentrating on
a specific semantic research task.
3.1 Introduction 117
Fig. 3.2: Outline of this chapter. Each subtask is explained in terms of different
technical trends and downstream applications.
The complexity of human language is difficult for machines to understand it. One
of the challenges is the ambiguity of word senses. In natural language, a word may
have multiple senses, given different contexts. Consider the following example:
(1) He got his shoes wet as he walked along the bank.
According to the Oxford English Dictionary, the major senses of “bank” include
(a) an organization that provides various financial services, for example keeping
or lending money; (b) the side of a river, canal, etc. and the land near it. With the
context, humans can easily know that “bank” here refers to the sense (b). However,
it is challenging for machines to do so because the interpretation made by humans is
contingent upon their comprehension of the fact that the probability of getting one’s
shoes wet is higher when walking alongside a river bank as compared to a financial
institution. Machines rarely take the commonsense into account when inferring the
meaning of “bank”4, because they do not have human-like cognition and reasoning
abilities by nature.
There are two main technical trends in addressing the task of WSD, namely
knowledge-based methods and supervised methods. Knowledge-based WSD utilizes
the word relations from knowledge graphs, e.g., WordNet (Fellbaum, 1998) and
BabelNet (Navigli and Ponzetto, 2012b) to achieve the disambiguation of word
senses. In supervised methods, the WSD task is usually defined as a classification
task by word senses. A WSD model is trained with annotated data. Two examples of
supervised WSD are illustrated in Fig. 3.3. As shown in the figure, a naïve strategy
of the knowledge-based WSD is that the sense that shares the most relations with
the context words is selected as the best-matched one. For supervised WSD systems,
the predictive model predicts the potential senses, given the target word and its
context words as input. In recent times, the use of knowledge bases has proven
advantageous for several modern supervised systems. As a result, there has been a
growing trend in integrating knowledge-based and supervised methods to enhance
their performance (Wang and Wang, 2020).
4 Current methods likely disambiguate word senses by word co-occurrences. However, word co-
occurrences are not commonsense.
3.2 Word Sense Disambiguation 119
WSD has been recognized as a crucial module in numerous NLU tasks that
heavily rely on word senses, such as polarity detection, information retrieval, and
machine translation. The application of WSD techniques has been demonstrated to
be beneficial for these NLU tasks. While prior surveys (Bevilacqua et al., 2021;
Navigli, 2009) have conducted extensive reviews for WSD, the works discussed in
them are outdated.
The hypothesis from distributional semantics (Firth, 1957) argued that word mean-
ings can be inferred from word co-occurrences. Words that appear in similar con-
texts tend to have similar meanings. Such a hypothesis has been the most significant
foundation of developing semantic representations in the computational linguistics
community, e.g., vector space representations (Turney and Pantel, 2010; Mikolov
et al., 2013b; Pennington et al., 2014) and PLMs (Devlin et al., 2018; Liu et al.,
2019c). Based on such a hypothesis, dense semantic vectorial representation re-
search commonly follows a similar training paradigm, e.g., using context words to
predict a target word. Currently, ChatGPT further proves that learning to use words
that have appeared before to predict the next possible word can achieve the skills
of analogy and reasoning with the help of a very large Transformer (Vaswani et al.,
2017)-based model.
120 3 Semantics Processing
Goldberg and Suttle (2010) argued that the meanings of words are frequently derived
from larger language units, termed constructions. Constructions consist of a form and
a meaning, ranging from single words to full sentences in size. The interpretation of a
construction is reliant on both its structure and the situations in which it is employed.
Goldberg and Suttle (2010) argued that semantic restrictions are better linked with
the construction as an entirety rather than with the lexical semantic framework of the
verbs. The work of Goldberg and Suttle (2010) highlights that the interpretation of
meanings of language units can be extended from individual words to constructions.
It shows the necessity of defining language units in WSD.
Fillmore et al. (2006) proposed frame semantics that provides a distinct viewpoint
on the meanings of words and the principles behind language construction. Frame
semantics emphasizes the significance of the surrounding context and encyclope-
dic knowledge in comprehending word meanings. Petruck (1996) explained that a
“frame” refers to a collection of concepts interconnected in a manner that understand-
ing any one concept depends on the understanding of the complete system. In frame
semantics, the meaning of “cooking” is beyond its dictionary meaning. It also as-
sociates with the concept of “food”, “cook”, “container”, and “heating instrument”.
Frame semantics motivates later ontology research, e.g., FrameNet (Ruppenhofer
et al., 2016) and FrameNet-based WSD systems, significantly.
3.2 Word Sense Disambiguation 121
For knowledge-based WSD, the data are normally presented as ontology, such as
WordNet, FrameNet, and BabelNet, where words and concepts are connected by
relations. The relations include hyponyms, hypernyms, holonyms, meronyms, at-
tributes, entailment, etc. An explanation (gloss) and a few example sentences are
given for each synset. Synsets of the same POS are connected under some relations
independently. However, there exist relations when the basic concept of two words
is the same but in a different POS (for example, “propose” and “proposal” were
characterized as “derivationally related synsets” in WordNet). For supervised WSD,
a particular word in a given sentence is annotated with a sense ID that corresponds
to one of the potential senses in a knowledge base, such as WordNet. A sample of
annotation is shown in the next section.
3.2.3 Datasets
Our surveyed datasets and their statistics can be viewed in Table 3.2. The biggest
manually annotated English corpus currently accessible is SemCor5 (Miller et al.,
1993). It has 200K content terms tagged with their related definitions and around
40K phrases. Although SemCor serves as the principal training corpus for WSD,
its limited coverage of the English vocabulary for both words and meanings is
its most significant drawback. In essence, SemCor merely includes annotations
for 22K distinct lexemes in WordNet, the most extensive and commonly employed
computerized English dictionary, which corresponds to less than 15% of all words. To
augment the coverage of words, Vial et al. (2019) incorporated the English Princeton
WordNet Gloss Corpus (WNG)6, which contains more than 59K WordNet senses,
as a complemented data. The WNG is annotated manually or semi-automatically.
SemCor and its variations (Bentivogli and Pianta, 2005; Bond et al., 2012) lack an
acceptable multilingual equivalent in the majority of global languages, which limits
the scaling capabilities of WSD models beyond English. To address the aforemen-
tioned issues, numerous automatic methods for creating multilingual sense-annotated
data have been developed (Pasini and Navigli, 2017; Pasini et al., 2018; Scarlini et al.,
2019). In an English-Italian parallel corpus known as MultiSemCor (Pianta et al.,
2002), senses from the English and Italian versions of WordNet are annotated. The
Line-hard-serve corpus (Leacock et al., 1993) contains 4K samples of the nomi-
nal, adjective, and verbal words with sense tags. The data were sourced from Wall
Street Journal (WSJ) corpus and the American Printing House for the Blind (APHB)
corpus. The Interest corpus (Bruce and Wiebe, 1999) contains 2,369 occurrences
of the term interest that have been sense-labeled. The data were sourced from the
HECTOR word sense corpus (Atkins, 1992).
5 [Link]
6 [Link]
122 3 Semantics Processing
Table 3.2: WSD datasets and statistics. EMEA, KDEdoc, and EUB denote European
Medicines Agency documents, KDE manual corpus, and the EU bookshop corpus,
respectively.
The Defence Science Organisation (DSO), based in Singapore, created the DSO
corpus7 (Ng and Lee, 1996), which contains 192,800 sense-tagged tokens from 191
words from the Brown and WSJ corpora. The Open Mind Word Expert (OMWE)
dataset8 (Chklovski and Pantel, 2004) is a corpus of sentences with 288 noun oc-
currences that were jointly annotated by Web users. One Million Sense-Tagged for
WSD and Induction (OMSTI)9 (Taghipour and Ng, 2015b) is a semi-automatically
annotated WSD dataset with WordNet sense inventory. The data were sourced from
MultiUN corpus, which is a collection of United Nation documents. The OMSTI
includes 687,871 nouns, 412,482 verbs, and 251,362 adjectives and 6,207 adverbs
after including selected samples from SemCor and DSO.
The SensEval and SemEval datasets are created from the SensEval/SemEval eval-
uation campaigns. Now, these datasets have been the most widely used benchmarking
datasets in WSD. Raganato et al. (2017b) collected these datasets together10 and de-
veloped a unified evaluation framework for empirical comparison. The statistics of
the following datasets are from the collection of Raganato et al. (2017b). SensEval-
2 (Edmonds and Cotton, 2001) used WordNet 1.7 sense inventory, including 2,282
sense annotations for nouns, verbs, adverbs and adjectives. SensEval-3 (Snyder and
Palmer, 2004) employed WordNet 1.7.1 sense inventory, including 1,850 sense anno-
tations. SemEval-2007 Task 17 (Pradhan et al., 2007) employed WordNet 2.1 sense
inventory, including 455 nominal and verbal sense annotations.
7 [Link]
8 [Link]
9 [Link]
10 [Link]
3.2 Word Sense Disambiguation 123
SemEval-2013 Task 12 (Navigli et al., 2013) used WordNet 3.0 sense inven-
tory, including 1,644 nominal sense annotations. SemEval-2015 Task 13 (Moro and
Navigli, 2015) utilized WordNet 3.0 sense inventory, including 1,022 sense annota-
tions. It is worth noting that some of the SemEval tasks are multilingual, including
SemEval-2013 and SemEval-2015, which facilitates multilingual WSD.
All of these corpora are annotated using various WordNet sense inventories, with
the exception of the Interest corpus (tagged with LDOCE senses) and the SensEval-1
corpus. The Interest corpus and the SensEval-1 corpus were sense-labeled using the
HECTOR sense inventories, a lexicon and corpus from a joint Oxford University
Press/Digital project (Atkins, 1992) Generally, the data and labels in WSD datasets
are organized in the following forms. Then, the task is to identify the sense classes,
given contexts, and target words.
Machine-readable dictionaries (MRDs) have been a useful source for WSD due to
their structured knowledge and easy access (Navigli, 2009). Dictionaries frequently
contain extensive information about the various meanings of a word, as well as il-
lustrative examples of their usage within context. Therefore, dictionaries can serve
as valuable knowledge bases for the task of WSD. Additionally, MRDs may provide
further information such as synonyms, antonyms, and related words, which can aid
in facilitating a more comprehensive comprehension of a word’s meaning. Through
the analysis of this information, a system may make more precise determinations
about which meaning is most fitting in a given context. There are many electronic
dictionaries available for machines to refer to, such as the Longman Dictionary
of Contemporary English (LDOCE) (Mayor, 2009), the Oxford Dictionary of En-
glish (Dictionary, 2010), Collins English Dictionary (Dictionary, 1982), and the
Oxford Advanced Learner’s Dictionary of Current English (OALD) (Hornby and
Cowie, 1974).
124 3 Semantics Processing
Table 3.3: Useful knowledge bases for WSD. LDOCE means Longman Dictionary
of Contemporary English. ODE means Oxford Dictionary of English. CED means
Collins English Dictionary. OALD means Oxford Advanced Learner’s Dictionary
of Current English. Unstructured or structured means the knowledge base contains
unstructured or structured lexical knowledge by concepts.
[Link] WordNet
[Link] FrameNet
11 [Link]
3.2 Word Sense Disambiguation 125
[Link] BabelNet
[Link] SyntagNet
SyntagNet (Maru et al., 2019) is a manually developed lexical resource that integrates
semantically disambiguated lexical combinations, e.g., noun-verb and noun-noun
pairs. The development of SyntagNet involved initially extracting lexical combina-
tions from English Wikipedia and the British National Corpus, which were then
subjected to a process of manual disambiguation, based on the WordNet. SyntagNet
covers five major languages, e.g., English, German, French, Spanish, and Italian.
In the WSD task, given a sentence of n words T = {x1, . . . , xn }, the model predicts
a sense for each word given the dictionary. Normally, the F1 score is adopted, which
is a specialization of the F score when ↵ = 1:
1
F= (3.1)
↵ P1 + (1 ↵) R1
Where P denotes precision and R denotes recall:
correct predictions
P= (3.2)
total predictions
correct predictions
R= (3.3)
n
These metrics do not accurately represent how well systems can produce a level
of confidence for a particular sensory choice. To this end, Resnik and Yarowsky
(1999) developed an evaluation criterion that considers the discrepancies between
the accurate and selected senses to weigh misclassification mistakes. Errors get
penalized less severely than coarser sense distinctions if the chosen sense is a fine-
grained distinction of the true sense. There have been evaluation metrics for even
more precise measurements, including the receiver operation characteristic (ROC).
These metrics, however, are not frequently utilized as precision, recall, and F1 .
126 3 Semantics Processing
3.2.7 Methods
This matching algorithm looks for similarities between the context of a term whose
sense needs to be disambiguated and its sense representation, such as the definition
of a potential sense and its associated sense that was retrieved from a knowledge
base. Lesk is a naïve knowledge-based WSD algorithm that looks for terms that
are similar to the target word in the context of each sense (Lesk, 1986). It enumer-
ates the intersections among lexicon definitions of the diverse connotations of every
target word contained within a given sentence. Banerjee et al. (2003) proposed an
advanced version of the Lesk where the standard TF-IDF method is employed for
word weighting. Another improved version of Lesk (Basile et al., 2014) includes
word embeddings to improve the accuracy of determining how close the definition
and context of the target word are. Finally, SREFK B (Wang and Wang, 2020) is a
vector-based technique that disambiguates word senses by using sense embeddings
and contextualized word representations. It applies BERT to represent WordNet
instances and definitions, as well as the automatically obtained contexts from the
Web.
12 [Link]
13 [Link]
14 [Link]
3.2 Word Sense Disambiguation 127
A. Graph-based matching
This other matching algorithm creates a graph using the context and connections
retrieved from knowledge bases. Synsets and the relationships between them become
graph nodes and edges, respectively. Senses are then disambiguated based on the
constructed graphs. A variety of graph-based techniques, such as LDA (Blei et al.,
2003), PageRank (Brin and Page, 1998), Random Walks (Agirre et al., 2014), Clique
Approximation (Moro et al., 2014b), and Game Theory (Tripodi and Navigli, 2019),
are used to disambiguate the meaning of a given word using the created graph. Agirre
and Soroa (2009) presented a graph-based unsupervised WSD system that employs
random walk over a WordNet semantic network. They employed a customized ver-
sion of the Page Rank algorithm (Haveliwala, 2002). The technique leverages the
inherent structural properties of the graph that underlies a specific lexical knowledge
base, and shows the capability of the algorithm to identify global optima for WSD,
based on the relations among entities. Agirre et al. (2014) evaluated this algorithm
with new datasets and variations of the algorithm to prove its effectiveness.
Navigli and Lapata (2007) also introduced a graph-based unsupervised model
for WSD, which analyzed the connectivity of graph structures to identify the most
pertinent word senses. A graph is constructed to represent all possible interpretations
of the word sequence, where nodes represent word senses and edges represent sense
dependencies. The model assessed the graph structure to determine the significance
of each node, thus finding the most crucial node for each word. Babelfy (Moro
et al., 2014b) is also a graph-based WSD method that uses random walk to identify
relationships between synsets. It used BabelNet (Navigli and Ponzetto, 2012b) and
performed random walks with Restart (Tong et al., 2006). In addition, it incorporated
the entire document at the time of disambiguation. The candidate disambiguation
is upon automatically developed semantic interpretation graph which used a graph
structure to represent various possible interpretations of input text.
SyntagRank (Scozzafava et al., 2020) is a high-scoring knowledge-based WSD
algorithm. It is an entirely graph-based algorithm that uses the Personalized PageR-
ank algorithm to incorporate WordNet (for English), BabelNet (for non-English) and
SyntagNet. SyntagRank is generally considered a stronger method than SREFK B .
BabelNet enabled SyntagRank to improve its ability to scale across a wide range of
languages, whereas SREFK B has only been evaluated in English.
To address these issues, numerous works began to supplement the training data by
utilizing various lexical knowledge, such as sense definitions (Kumar et al., 2019a;
Blevins and Zettlemoyer, 2020), semantic relations (Bevilacqua and Navigli, 2020;
Conia and Navigli, 2021), and data generated via novel generative methods (Barba
et al., 2021b). In this section, we review representative works in supervised WSD.
Popov (2017) proposed to use BiLSTM (Graves and Schmidhuber, 2005), GloVe
word embeddings, and word2vec lemma embeddings. Yuan et al. (2016) suggested
another LSTM-based WSD approach that was trained in a semi-supervised fashion.
SSL was achieved by employing label propagation (Talukdar and Crammer, 2009) to
assign labels to unannotated sentences by assessing their similarity to labeled ones.
The best performance on the SensEval-2 dataset can be observed from the model that
was semi-supervision-trained with OMSTI and 1,000 additional unlabeled sentences.
Additionally, Le et al. (2018) looked more closely at how many elements affect
its performance, and several intriguing conclusions were drawn. The initial point
to highlight is that achieving strong WSD performance does not necessitate an
exceedingly large unannotated dataset. Furthermore, this method provides a more
evenly-distributed sense assignment in comparison to prior approaches, as evidenced
by its relatively strong performance on infrequent cases. Additionally, it is worth
noting that the limited sense coverage of the annotated dataset may serve as an upper
limit on overall performance.
With the development of self-attention-based neural architectures and their ca-
pacity to extract sophisticated language information (Vaswani et al., 2017), the use
of transformer-based architectures in fully supervised WSD systems is becoming
more and more popular. The WSD task is usually fine-tuned on a pretrained trans-
former model, which is a popular strategy. The task-specific inputs are given to the
pretrained model, which is then further trained across a number of epochs with the
task-specific objective. Likewise, in recent token classification models for WSD,
the contextualized representations are usually generated by a pretrained model and
then fed to either a feedforward network (Hadiwinoto et al., 2019) or a stack of
Transformer layers (Bevilacqua and Navigli, 2019). These methods outperform ear-
lier randomly initialized models (Raganato et al., 2017a). Hadiwinoto et al. (2019)
tested different pooling strategies of BERT, e.g., last layer projection, weighted sum
of hidden layers, and Gated Linear Unit (Dauphin et al., 2017).
The best performance on SensEval-2 is given by the strategy of the weighted sum
of hidden layers, accounting for 76.4% F1 . Bevilacqua and Navigli (2019) proposed a
bi-directional Transformer that explicitly attends to past and future information. This
model achieved 75.7% F1 on SensEval-2 by training with the combination of SemCor
and WordNet’s Tagged Glosses15. It is worth noting that, the categorical cross-
entropy, which is frequently utilized for training, limits the performance. In reality, it
has been demonstrated that the binary cross-entropy loss performs better (Conia and
Navigli, 2021) because it enables the consideration of many annotations for a single
instance in the training set as opposed to the use of a single ground-truth sense alone.
In the above-mentioned approaches, each sense is assumed to be a unique class, and
the classification architecture is limited to the information provided by the training
corpus.
15 [Link]
130 3 Semantics Processing
The edges that connect the senses and synsets are a valuable source of knowledge
that augments the annotated data. Traditionally, graph knowledge-based systems,
such as those based on Personalized PageRank (Scozzafava et al., 2020), have taken
advantage of this information. Moreover, utilizing WordNet as a graph has benefited
many modern supervised systems. Thus, formally, knowledge-augmented supervised
WSD is defined as a methodology that combines traditional supervised machine
learning techniques with external knowledge resources to improve the accuracy and
performance of WSD.
Wang and Wang (2020) used WordNet hypernymy and hyponymy relations to
devise a try-again mechanism that refines the prediction of the WSD model. The
SemCor corpus was utilized to acquire a supervised sense embedding for every
annotated sense in their supervised method (SREFSup ). Vial et al. (2019) reduced
the number of output classes by mapping each sense to an ancestor in the WordNet
taxonomy, then yielding a smaller but robust sense vocabulary. The authors used
BERT contextualized embeddings. By training with SemCor and WordNet gloss
corpora, the model achieved 79.7% F1 on SensEval-2. Different variations also
achieve outstanding performance on diverse WSD datasets.
Loureiro and Jorge (2019) created representations for those senses not appear-
ing in SemCor by using the averaged neighbor embeddings in the WordNet. The
token-tagger models EWISE (Kumar et al., 2019a) and EWISER (Bevilacqua and
Navigli, 2020) both leveraged the WordNet graph structure to train the gloss embed-
ding offline, where EWISER demonstrated how the WordNet entire graph feature
can be directly extracted. EWISE used ConvE (Dettmers et al., 2018) to obtain
graph embeddings. Conia and Navigli (2021) provided a new technique to use the
same edge information by replacing the adjacency matrix multiplication with a bi-
nary cross-entropy loss where other senses connected to the gold sense are also
taken into account. The edge information was obtained from WordNet. In general,
edge information is increasingly used in supervised WSD, gradually blending with
knowledge-based techniques. However, it can only be conveniently utilized by to-
ken classification procedures, whereas its incorporation into sequence classification
techniques has not yet been researched.
It has also been extensively studied how to use sense definitions as an additional
source for supervised WSD apart from the traditional data annotations. It consid-
erably increased the scalability of a model on the senses that are underrepresented
in the training corpus. Huang et al. (2019a) argued that WSD has traditionally been
approached as a binary classification task, whereby a model must accurately decide
if the sense of a given word in context aligns with one of its potential meanings
in a sense inventory, based on the provided definition. define the WSD task as a
sentence-pair classification task, where the WordNet gloss of a target word is con-
catenated after an input sentence. Blevins and Zettlemoyer (2020) used a bi-encoder
to project both words in context and WordNet glosses in a common vector space.
Disambiguation is then carried out by determining the gloss that is most similar to
the target word.
3.2 Word Sense Disambiguation 131
WSD has been applied in many Sentiment Analysis works to improve accuracy
and explainability. Farooq et al. (2015) proposed a WSD framework to enhance the
performance of sentiment analysis. To determine the orientation of opinions related
to product attributes in a particular field, a lexical dictionary comprising various word
senses is developed. The process involves extracting relevant features from product
reviews and identifying opinion-bearing texts, followed by the extraction of words
used to describe the features and their contexts to form seed words. These seed words,
which consist of adjectives, nouns, verbs, and adverbs, are manually annotated with
their respective polarities, and their coverage is extended by retrieving their synonyms
and antonyms. WSD was utilized to identify the sentiment-orientated senses, such as
the positive, negative, or neutral senses of a word in a sentence, because a word may
have different sentiment polarities by taking different senses in different contexts.
Nassirtoussi et al. (2015) offered a novel approach to forecast intra-day directional
movements of the EUR/USD exchange rates based on news headline text mining in an
effort to address semantic and sentiment components of text-mining. They evaluated
news headlines semantically and emotionally using the lexicons, e.g., WordNet and
SentiWordNet (Baccianella et al., 2010). SentiWordNet is a publicly accessible
lexical resource designed for sentiment analysis that allocates a positivity score,
negativity score, and objectivity score to each synset within WordNet.
132 3 Semantics Processing
Nassirtoussi et al. (2015) found that both positive and negative emotions may
influence the market in the same way. WSD worked as a technique to abstract semantic
information in their framework. Thus, it enhances the feature representations and
explainability in their downstream task modeling. SentiWordNet has served as a basis
for various sentiment analysis models. In the work of Ohana and Tierney (2009), the
feasibility of using the emotional scores of SentiWordNet to automatically classify
the sentiment of movie reviews was examined. Other applications, e.g., business
opinion mining (Saggion↵ and Funk, 2010), article emotion classification (Devitt and
Ahmad, 2007), word-of-mouth sentiment classification (Hung and Lin, 2013; Hung
and Chen, 2016) also showed that SentiWordNet as a semantic feature enhancement
knowledge base can deliver accuracy gains in sentiment analysis tasks.
The impacts of using WSD for information retrieval have been examined in many
works. Krovetz and Croft (1992) disambiguated word senses for terms in queries
and documents to examine how ambiguous word senses impact information retrieval
performance. The researchers arrived at the conclusion that the advantages of WSD
in information retrieval are marginal. This is due to the fact that query words have
uneven sense distributions. The impact of collocation from other query terms already
plays a role in disambiguation. WSD was used as a parser to study this task. However,
the findings from Gonzalo et al. (1998) are different. They examined the impact of
improper disambiguation using SemCor. By accurately modeling documents and
queries together with synsets, they achieved notable gains (synonym sets). Addition-
ally, their study showed that WSD with a 40%-50% error rate could still improve
information retrieval performance when using synset representation with synonyms.
Gonzalo et al. (1999); Stokoe et al. (2003) further confirmed the significance of
WSD to information retrieval. Gonzalo et al. (1999) also found that POS informa-
tion has a lower utility for information retrieval. Based on artificially creating word
ambiguity, Sanderson (1994) employed pseudo words to explore the effects of sense
ambiguity on information retrieval. They came to the conclusion that the high accu-
racy of WSD is a crucial condition to accomplish progress. Blloshmi et al. (2021)
introduced an innovative approach to multilingual query expansion by integrating
WSD, which augments the query with sense definitions as supplementary semantic
information in multi-lingual neural ranking-based information retrieval. The results
demonstrated the advantages of WSD in improving contextualized queries, resulting
in a more accurate document-matching process and retrieving more relevant doc-
uments. Kim et al. (2004) labeled words with 25 root meanings of nouns rather
than utilizing fine-grained sense inventories of WordNet. Their retrieval technique
preserved the stem-based index and changed the word weight in a document in
accordance with the degree to which it matched the query’s sense. They credited
their coarse-grained, reliable, and adaptable sense tagging system with the improve-
ment on TREC collections. The detrimental effects of disambiguation mistakes are
somewhat mitigated by the addition of senses to the conventional stem-based index.
3.2 Word Sense Disambiguation 133
The challenge of ambiguous word senses poses a significant barrier to the develop-
ment of an efficient machine translator. As a result, a number of researchers have
turned their attention to exploring WSD for machine translation. Some works tried
to establish datasets to quantify the WSD capacity of machine translation systems.
Rios Gonzales et al. (2017) proposed a test set of 6,700 lexical ambiguities for
German-French and 7,200 for German-English. They discovered that WSD remains
a difficult challenge for NMT, especially for uncommon word senses, even with
70% of lexical ambiguities properly resolved. Campolungo et al. (2022) proposed a
benchmark dataset that aims at measuring WSD biases in machine translation in five
language combinations. They also agreed that state-of-the-art systems still exhibited
notable constraints when confronted with less common word senses. Raganato et al.
(2019) proposed MUCOW, a multilingual contrastive test set automatically created
from word-aligned parallel corpora and the comprehensive multilingual sense inven-
tory of BabelNet. MUCOW spans 16 language pairs and contains more than 200,000
contrastive sentence pairs. The researchers thoroughly evaluated the effectiveness
of the ambiguous lexicons and the resulting test suite by utilizing pretrained NMT
models and analyzing all submissions across nine language pairs from the WMT19
news shared translation task.
Some works analyzed the internal representations to understand the disambigua-
tion process in machine translation systems. Marvin and Koehn (2018) examined the
extent to which ambiguous word senses could be decoded through the use of word
embeddings in relation to deeper layers of the NMT encoder, which were believed
to represent words with contextual information. In line with prior research, they
discovered that the NMT system frequently mistranslated ambiguous terms. Tang
et al. (2019) trained a classifier to determine if a translation is accurate given the
representation of an ambiguous noun. The fact that encoder hidden states performed
much better than word embeddings suggests that encoders are able to appropriately
encode important data for disambiguation into hidden states. Liu et al. (2018a) dis-
covered that an increase in the number of senses associated with each word results
in a decline in the performance of word-level translation. The root of the issue may
be the mapping of each word to similar word vectors, regardless of its context. They
proposed to integrate techniques from neural WSD systems into an NMT system to
address this issue.
3.2.9 Summary
However, we also observe some important linguistic arguments were rarely studied
in the computational linguistic domain, e.g., defining the scope of linguistic units for
WSD and integrating relevant concepts (frames) for word sense representations. The
development of WSD datasets has greatly ignited the research enthusiasm of scholars
in WSD. However, we also observed that the computational research on WSD is
also limited by these well-defined datasets because WSD datasets generally follow
a very similar labeling paradigm. Relevant linguistic studies have shown broader
possibilities in WSD. Finally, we find that many of WSD modeling techniques do
not link well with downstream applications. The research of WSD methods has
intersections with downstream applications, whereas they cannot well cover the
needs of downstream tasks. This also shows that the research opportunities in WSD
can be largely extended besides word sense classification.
Table 3.4 shows the technical trends of WSD methods. As seen in the table, ear-
lier approaches likely used knowledge-based and supervised approaches. WordNet
and BabelNet are useful knowledge bases that were frequently used by knowledge-
based methods. Word embeddings, PLMs, and linguistic features, e.g., POS tags
and semantic relatedness were frequently used by supervised methods. For old pure
knowledge-based methods, the PageRank framework was likely used, because many
knowledge bases are represented as graphs. PageRank is an algorithm used in graph
computation to measure the importance of nodes in a graph. Classical machine learn-
ing techniques, e.g., decision trees and SVM, were commonly used by supervised
WSD methods. Supervised learning algorithms demonstrate superior performance
in comparison to knowledge-based approaches. Nevertheless, it is not always rea-
sonable to assume the availability of substantial training datasets for different areas,
languages, and activities. Ng (1997) predicted that a corpus of around 3.2 million
sense-tagged words would be necessary in order to produce a high-accuracy, wide-
coverage disambiguation system. The creation of such a training corpus requires an
estimated 27 person-years of labor. The accuracy of supervised systems might be
greatly improved above state-of-the-art methods with such a resource. However, the
success of this hypothesis is at the cost of huge resource consumption.
We observe more hybrid approaches that leverage knowledge bases in a super-
vised learning fashion in recent years. This is because researchers have observed the
limitations of typical supervised WSD in processing rare or unseen cases. Knowledge
bases provide additional information to support the learning of unseen cases. Knowl-
edge bases provide additional knowledge for the languages whose annotated data are
scarce. In this case, multilingual knowledge bases can enhance the representations
of word senses in a new domain. As a result, we can observe the accuracy of the
hybrid approaches surpasses the pure knowledge-based or supervised approaches.
Most existing WSD datasets define the task as a word sense classification task. Then,
the following methodology research upon the datasets focused on improving the
accuracy of mapping the sense of a word to its dictionary sense class.
3.2 Word Sense Disambiguation 135
The WSD task was commonly defined as a word sense classification task. However,
we observe that classifying words by sense classes is not the only need for downstream
NLU tasks. There are three main tasks that are strongly related to WSD, e.g., polarity
detection, information retrieval, and machine translation. One of the roles of WSD
on the three tasks is to deliver or enhance features to gain improvements on the
three tasks. On the other hand, we also observe many downstream works used WSD
techniques as a parser to obtain words with different levels of word sense ambiguity
or used WSD to gain insights into their model behaviors to improve the explainability
of a study. In these cases, defining WSD as a sense classification task may be sub-
optimal for downstream applications.
WSD has a huge potential in NLU research. For example, disambiguating word
senses in a large corpus can lead to a deeper understanding of language usage
patterns and the semantic relationships between words. WSD is also a significant
component in semantic explainable AI, because it helps researchers better understand
the decision-making process of a model on the semantic level. Researchers can
develop a more transparent and trustworthy model by explaining word senses in
contexts. As a feature generator, a WSD may be more effective if it can generate
contextualized word meanings in natural language, rather than predict a sense class
that maps to a predefined gloss in a dictionary. However, research in these fields is
rare in the WSD community.
3.2 Word Sense Disambiguation 137
The task of WSD can be broader than the current word sense classification task setup
from either the theoretical research side or the downstream application side. Besides,
the improvements in WSD accuracy can also attract more downstream applications.
Thus, we come up with the following future work suggestions.
WSD can have different learning forms, besides word sense classification, e.g., para-
phrasing an ambiguous word into a less ambiguous one (Mao et al., 2018, 2022a),
generating contextualized word senses in natural language. Such an extension may
have significance in downstream applications. From the perspective of linguistic and
cognitive research, studying how to define a language unit to better disambiguate
word senses, or studying how to link a word to its associated concepts in a context
can also improve the significance of WSD in the era of LLM-based NLP. Future
works may study how to define the task of WSD to better support the research in
different disciplines.
C. Multilingual WSD
Recent years witness great success of PLMs in various domains. Existing PLMs
followed the same hypothesis that the sense of a word can be learned from its as-
sociated context. However, there has not been a PLM that explicitly disambiguates
word senses to enhance the learning of semantic representations. Naïvely learning
the semantic representation of a target word by its associated context words cannot
learn the conceptual association of the target word. For example, many words can
associate with the word “apple”. How can we know an apple as fruit is red or green,
sweet, tree-growing, nutritious, etc? As an electronic device, Apple is associated
with an operating system, a circuit board, a brand, etc. Disambiguating word senses
before pretraining may build such connections between concepts.
As Bevilacqua et al. (2021) argued, WSD can also be integrated with an entity
linking task (Moro et al., 2014b), where the model predicts associated entities to
help WSD systems explore the related glosses and relations. Related fusion works
also include fusing WSD for sentiment analysis (Farooq et al., 2015), information
retrieval (Blloshmi et al., 2021) and machine translation (Campolungo et al., 2022).
The future study of WSD can be grounded on an end task so that the end task can
more effectively benefit from the fusion of a WSD model.
Besides, NER is also used in various data mining tasks to recognize keywords,
topics, and attributes (He et al., 2019a; Li et al., 2021d, 2019a). NER can be traced
back to the third Message Understanding Conference (MUC) (Chinchor et al., 1993).
The task for MUC-3 was designed to extract relevant information from the text and
convert it into a structured format based on a predefined template, e.g., incident, the
targets, perpetrators, date, location, and effects. Early NER systems that participated
in MUC-3 primarily relied on rule-based approaches, which involved the manual
creation of rules to identify named entities based on their linguistic and contextual
features. However, with the dominance of deep learning in the NLP community,
most NER tasks are now performed using neural networks.
One of the first neural networks for NER was proposed by Collobert and Weston
(2008), which used a single CNN with manually constructed feature vectors. Later,
this approach was replaced with high-dimensional continuous vectors, which were
learned from large amounts of unlabeled data in an unsupervised manner (Collobert
et al., 2011). With stronger models, now, the research in NER has been largely
extended to nested NER (Su et al., 2022), few-shot NER (Huang et al., 2022b),
joint entity and relation extraction (JERE) (Zhong and Chen, 2021). Compared to
standard NER whose entity relationship is absent, entities in nested NER have a
hierarchical or nested structure, where one entity is embedded within another entity.
For example, given
(3) The Ontario Supreme Court said ...
“Ontario” is a state entity that is embedded under the government entity of “Ontario
Supreme Court” (Ringland et al., 2019b). Given the very expensive annotation costs,
few-shot NER is also a very important research trend. It learns NER with a limited
amount of labeled data. JERE tasks are established based on the needs of downstream
applications. In many cases, people not only need to know what an entity is but also
need to know the relationship between entities. Thus, JERE needs to identify named
entities in text as well as extract the relationships that exist between them. In the
following example
(4) Greg Christie has been one of the greatest engineers at Apple.
For standard NER, “Greg Christie” should be identified as Person; “Apple” should
be identified as Company. However, for JERE, besides the above entity recognition,
an additional relationship label, “work_at” should also be predicted. Compared to
identifying entities that are hierarchically structured within each other in nested
NER tasks, the outcomes of JERE deliver another relationship dimension to connect
entities. Both tasks are helpful in developing a comprehensive knowledge graph.
Due to the wide range of applications of NER, there have been several surveys
conducted on this typical NLU task (Li et al., 2020b; Yadav and Bethard, 2018).
Song et al. (2021) focused specifically on NER in the biomedical field, also known
as Bio-NER. In this domain, the presence of meaningless characters in biomedical
data presents a significant challenge, particularly with regards to inconsistent word
distribution.
140 3 Semantics Processing
Similarly, Liu et al. (2022d) summarized and discussed the challenges specific to
Chinese NER, rather than the more general English NER tasks. Meanwhile, Nasar
et al. (2021) explored both NER and relation extraction tasks, as they are closely
linked and are typically composed of pipeline tasks. The above mentioned surveys
focus on the technical perspective of NER, based on deep learning technology,
while this section takes a different approach. Specifically, we begin by examining
the linguistic background of NER and its historical development, as well as current
approaches and future directions. By taking a longer-term view and a more funda-
mental approach, we aim to provide a broader and more comprehensive perspective
on the development of NER.
Rosch (1973) argued that our classification system, which includes the classification
of named entities, is based on a central or prototype example. A prototype is a typical
example of a category that represents the most common features or characteristics
associated with the category. For example, the prototype of “bird” must associate
the features, such as wings, feathers, and the ability to fly. Birds such as ostriches or
penguins, which do not perfectly possess these characteristics, may be viewed as less
typical examples. Rosch and Mervis (1975) discovered that individuals can identify
typical category examples faster and with greater precision than atypical examples.
Thus, learning from prototypes can help to quickly grasp the important features of a
named entity with a few examples.
Rosch et al. (1976) argued that the classification of categories is frequently deter-
mined not by strict boundaries, but by various degrees of membership. We can use
this theory for NER because the NER task also categorizes entities by predefined
classes. The idea of Graded Membership implies how humans perceive and cate-
gorize the world around us. Some categories, e.g., “vegetable”, may be viewed as
less distinct and vaguer. The theory suggests that the borders between categories
may not be well-defined in some cases, leading to ambiguities when attempting to
classify certain items, such as tomatoes or mushrooms. The ambiguity can be fur-
ther compounded by cultural or regional differences in how categories are defined
or classified.
3.3 Named Entity Recognition 141
According to Fauconnier and Turner (2008), the act of blending different elements
and their corresponding relationships is an unconscious process that is believed to be
ubiquitous in everyday thought and language. This process involves the combination
of various mental spaces or cognitive domains that are drawn from different scenarios
and experiences. These scenarios may be derived from personal experiences, cultural
practices, or societal norms, among others. Concept blending allows us to create a
new concept by combining existing ones in novel ways. For example “SpaceX”
may be mapped to mental spaces related to “aerospace” and “technology”; “Tesla”
may be mapped to mental spaces related to “car” and “clean energy”. Conceptual
blending provides an explanation for the recognition and comprehension of newly
named entities by mapping them onto existing mental spaces or concepts.
From the aspect of computational linguistics, the core issue of NER is how to
define a named entity. Marrero et al. (2013) group the criteria of a named entity
as grammatical category, rigid designation, unique identification, and the domain
of applications. However, many of the entity definitions in the NER domain are
imperfect. From the view of grammatical category, a named entity is traditionally
defined as a proper noun or a common name for a proper noun. Previous work has
described NER as the recognition of proper nouns in general. However, as pointed out
by Borrega et al. (2007), the classic grammatical approach to proper noun analysis
is insufficient to deal with the challenges posed by NER applications. For instance,
in a toy QA task such as
(5) Do crocodiles live in the sea or on land?
“crocodiles”, “sea”, and “land” are not proper nouns, while they are commonly
recognized as the essential entities for a proper understanding of the question. Con-
sequently, a proper noun is no longer considered a criterion for identifying named
entities in current NER research.
This highlights the difficulty of defining entities with complex concepts in real-
world applications. As a result, annotators likely make subjective judgments when
labeling complex entities, which may be affected by entity descriptions and annota-
tors’ understanding.
From the view of unique identification, MUCs require that NER tasks annotate
the “unique identification” of entities for all expressions (Grishman and Sundheim,
1996). However, determining what is unique depends on contextual elements, and
can be a subjective process. While this “unique identification” is typically considered
to be the reference being referred to, the definition itself poses a challenge in terms
of defining what is truly unique.
The definition of named entities was frequently grounded in the domain of applica-
tions. Entity definitions can be different between different NER tasks. For instance, in
drug-drug interaction tasks (Deng et al., 2020), diseases may not be considered enti-
ties, whereas they are entities in adverse drug events (Demner-Fushman et al., 2019).
Inconsistent entity definitions create challenges for machine learning. Because in-
consistent entity definitions mean that for the same semantic unit, the machine has to
summarize different entity representations to distinguish their labels under different
tasks. This is also not conducive to training an all-around NER classifier on different
application domains.
Tokens: West African Crocodile are semiaquatic reptiles that live in Africa
IO I I I O I I O O O I
BIO B I I O B I O O O B
BIOES B I E O B E O O O S
In contrast, the BIO scheme provides more precise annotations by identifying the
beginning and continuation of an entity in the text. This labeling system allows for
more accurate recognition of entities in a text and better classification of individual
tokens. The BIOES scheme further extends the BIO scheme by providing more
precise boundaries for entities, thereby allowing for better recognition of entity
boundaries in a text. The “Single” label is used to denote an entity that consists
of a single token, whereas the “End” label is used to indicate the final token of an
entity. By incorporating these additional labels, the BIOES scheme provides a more
nuanced approach to entity recognition and annotation.
3.3.3 Datasets
The surveyed popular NER datasets and their statistics can be viewed in Table 3.7.
The first NER-focused dataset was published in the 6th MUC (Grishman and Sund-
heim, 1996). This task consists of three subtasks, including entity names, temporal
expressions, and number expressions.
The defined entities include organizations, persons, and locations. The defined
time expressions include dates and times. The defined quantities include monetary
values and percentages. An example of this dataset is shown as follows.
MUC was replaced by Automatic Content Extraction (ACE) program after 1997.
ACE05 (Walker et al., 2006) is another popular NER dataset published at ACE
Conference. ACE05 is a multilingual dataset, which contains English, Arabic, and
Chinese data. The corpus consists of data of various types annotated for entities, re-
lations, and events. Its data source includes broadcast conversation, broadcast news,
newsgroups, telephone conversations, and weblogs. An example of this dataset is
shown below.
After MUC, the Text Analysis Conference (TAC) published the Knowledge Base
Population challenge. In this challenge, the Stanford NLP Group developed TAC Re-
lation Extraction Dataset (TACRED) (Zhang et al., 2017b), which contains 106,264
instances with annotated entities, relations and some other NLP tasks. An example
of this dataset is shown as follows.
id: “e7798fb926b9403cfcd2”,
docid: “APW_ENG_20101103.0539”,
relation: “per:title”,
token: “[‘At’, ‘the’, ‘same’, ‘time’, ‘,’, ‘Chief’, ...]”,
subj_start: “8”,
subj_end: “9”,
obj_start: “12”,
obj_end: “12”,
subj_type: “PERSON”,
obj_type: “TITLE”,
stanford_pos: “[‘IN’, ‘DT’, ‘JJ’, ‘NN’, ‘,’, ‘NNP’, ‘NNP’, ...]”,
stanford_ner: “[‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ...]”
stanford_head: “[4, 4, 4, 12, 12, 10, 10, 10, 10, 12, ...]”,
stanford_deprel: “[‘case’, ‘det’, ‘amod’, ‘nmod’, ‘punct’, ...]”.
3.3 Named Entity Recognition 145
CoNLL-2003 (Sang and De Meulder, 2003) is another widely used NER dataset.
This task concerned language-independent NER, which concentrates on four kinds
of named entities: locations, persons, organizations, and names of miscellaneous
entities that do not belong to the previous three kinds. The related data files are
available in English and German. An example of this dataset is shown as follows.
Besides the above famous datasets, MultiNERD (Tedeschi and Navigli, 2022),
HIPE-2020 (Ehrmann et al., 2022), and NNE (Ringland et al., 2019a) are also
popular NER datasets in general domain. NER tasks have garnered considerable
attention in numerous specialized domains. Informatics for Integrating Biology and
the Bedside (I2B2) (Stubbs and Uzuner, 2015) is a national biomedical computing
project sponsored by the National Institutes of Health (NIH) from 2004 to 2014.
I2B2 actively advocates mining medical value from clinical data and has organized
a series of evaluation tasks and workshops for unstructured medical record data,
and these evaluation tasks and open datasets have gained wide influence in the
medical NLP community. I2B2 is maintained in the Department of Biomedical
Information at Harvard Medical School and continues to conduct assessment tasks
and workshops, and the project has been renamed National NLP Clinical Challenges
(N2C2). Besides, there also exist many other biomedical datasets for specific medical
NER tasks, including Adverse Drug Events (ADE) (Gurulingappa et al., 2012; Alvaro
et al., 2017), Drug-Drug Interaction (Herrero-Zazo et al., 2013), and Chemical
Protein Reaction (CPR) (Krallinger et al., 2017), and GENIA (Shibuya and Hovy,
2020).
Table 3.8 illustrates useful knowledge bases for NER. The biggest ones are Wikidata16
and Wikipedia17, which are multilingual free online encyclopedias maintained by
worldwide volunteers. There are also knowledge bases in a specific field. SNOMED
CT (Systematized Nomenclature of Medicine – Clinical Terms) (Donnelly et al.,
2006) is a systematically organized collection of medical terms that provides a
standardized representation of clinical information, which is often used in NER
tasks involving clinical data. MeSH (Medical Subject Headings) (Lipscomb, 2000)
is another controlled vocabulary, developed by the U.S. National Library of Medicine.
It is used for indexing and organizing biomedical literature.
16 [Link]
17 [Link]
146 3 Semantics Processing
Other medical knowledge bases include UMLS (Unified Medical Language Sys-
tem) (Wheeler et al., 2007; Bodenreider, 2004), ICD-10 (Hirsch et al., 2016),
MIMIC-III (Johnson et al., 2016), DrugBank (Wishart et al., 2018), and bioinfor-
matics knowledge base BioModels (Li et al., 2010). GeoNames (Ahlers, 2013) is a
comprehensive geographic knowledge repository that encompasses over 25 million
geographical names and comprises over 11 million distinctive features, including
cities, countries, and landmarks. EDGAR (Electronic Data Gathering, Analysis, and
Retrieval) (Branahl, 1998) is a database maintained by the U.S. Securities and Ex-
change Commission (SEC), containing financial filings and reports from publicly
traded companies. EduKG (Hu et al., 2016) is an educational knowledge base.
In the process of NER task evaluation, the main evaluation metrics are also Precision,
Recall, and F-value.
18 [Link]
3.3 Named Entity Recognition 147
Brat22 (Browser-based Rapid Annotation Tool) is a free data labeling tool that offers
a seamless browser-based interface for annotating text. It streamlines numerous
annotation tasks related to NLU. With a thriving support community, Brat is a
well-known and widely used tool in NER. It also offers the option of integrating
with external resources, such as Wikipedia. Moreover, Brat enables organizations
to establish servers that allow multiple users to collaborate on annotation tasks.
However, implementing this feature does necessitate some technical proficiency and
server management skills.
19 [Link]
20 [Link]
21 [Link]
22 [Link]
148 3 Semantics Processing
3.3.6 Methods
A. Multi-label Method
Due to the fact that nested named entities can have multiple labels for a single
token, traditional sequence labeling methods are not directly applicable to the recog-
nition of nested named entities. To address this issue, researchers have attempted to
convert the multi-label problem into a single-label problem or adjust the decoder
to assign multiple labels to the same entity. Katiyar and Cardie (2018) proposed a
method to address nested NER by modifying the label representation in the train-
ing set. Instead of using one-hot encoding, they used a uniform distribution over
the specified classes as the label. During inference, a hard threshold is set and any
class with probability above this threshold is predicted for the token. However, this
approach has two limitations: it is difficult to determine the objective for model
learning; the method is sensitive to the manually chosen threshold value.
Straková et al. (2019) changed nested NER from multi-label to single-label tasks
by modifying the annotation schema. They combined any two categories that may
co-occur to produce a new label (e.g., combine B-Location with B-Organization to
construct a new label B_Loc_Org). One benefit of this approach is that the final
classification task is still a single category because all possible classification targets
had been covered in the schema. Nonetheless, this method brought about a prolif-
eration of label categories in an exponential manner, leading to sparsely annotated
labels that proved difficult to learn, particularly in the context of entities nested
across multiple layers. In order to address the issue of label sparsity, Shibuya and
Hovy (2020) proposed a hierarchical approach. If the classification of nested entities
cannot be resolved in a single pass, the classification is continued iteratively until
either the maximum number of iterations is reached or no new entities can be gen-
erated. Nevertheless, this approach is susceptible to error propagation, whereby an
erroneous classification in a preceding iteration could impact subsequent iterations.
B. Generation-based Method
Li et al. (2020g) proposed a unified framework to accomplish flat and nested NER
tasks by formulating NER as a machine reading comprehension (MRC) task (Liu
et al., 2023a). In this approach, the extraction of each entity type corresponds to
specific questions. For instance, when the model is given the question “which loca-
tion is mentioned in the sentence?” along with the original sentences, it generates an
answer such as “Washington”. This approach is similar to Prompt Tuning (Liu et al.,
2021c), which avoids the labor-intensive process of constructing manual questions.
However, in this method, the generated tokens must be mapped to pre-defined named
entity types.
3.3 Named Entity Recognition 149
Yan et al. (2021b) proposed a novel pointer generation network. Given an input
sentence, the model generates the entity indexes in this sentence that belong to
entities. In such a way, flat, nested, and discontinuous entities can be recognized in
a unified framework. Skylaki et al. (2020); Fei et al. (2021a); Yang and Tu (2022);
Su et al. (2022) are also following the idea of generating indexes of a sentence to
recognize nested entities.
C. Hypergraph-based Method
O PER_S O O O O O
O O O O O O
Fig. 3.4: A typical example for nested NER with hypergraph solution
Finkel and Manning (2009) firstly introduced hypergraphs into nested NER tasks,
named Mention Hypergraph. In their model, Mention Hypergraph utilized nodes and
directed hyper-edges to jointly represent named entities and their combinations. To
compute the training loss, the proportion of accurate structures was calculated and di-
vided by a normalized term. This term was obtained using a dynamic programming
algorithm that aggregated feasible nested sub-graphs for NER. However, the nor-
malized terms obtained from this algorithm included fractions of pseudo-structures,
which led to errors.
150 3 Semantics Processing
A. Metric Learning
To avoid the above issues, Yang and Katiyar (2020) followed the nearest neighbor
inference (Wiseman and Stratos, 2019) to assign labels to tokens. In contrast to
Prototypical Networks, which learn a prototype for each entity class, this study
characterized each token by its labeled instances in the support set alongside its
context. The approach determined the nearest labeled token in the support set,
followed by assigning labels to the tokens in the query set that require prediction.
Das et al. (2022) proposed CONTaiNER, which optimized the inter-token dis-
tribution distance. CONTaiNER employed generalized objectives to different token
categories based on their Gaussian-distributed feature vectors. Such a method has
the potential to mitigate overfitting problems that arise from the training domains.
B. Prompt Tuning
Recently, prompt tuning has shown great potential on few-shot tasks by reformulat-
ing other tasks as mask language tasks (He et al., 2023a; Mao et al., 2023c; Schick
and Schütze, 2021). Prompt tuning-based methods need construct prompts to obtain
masked word predictions and then map predicted works into pre-defined labels, as
shown in Fig. 3.5.
( Label )
Person Label words Mapping People
... that her fellow pilot Erik Satie is .... Erik Satie is [MASK] entity.
Cui et al. (2021) proposed a template-based method for NER, which first ap-
plied the prompt tuning to NER tasks. However, their method had to enumerate
all possible spans of sentences combined with all entity types to predict labels,
which suffered serious redundancy when entity types or sentence lengths increased.
Manually defined prompts were labor-intensive and made the algorithm sensitive to
these prompts. To avoid the manual prompt constructions, Ma et al. (2022a) tried to
explore a prompt-free method for few-shot NER. The present study introduced an
entity-oriented language model that decodes input tokens into their corresponding
label words if they belong to entities. In cases where the tokens are not entities,
the entity-oriented language model decodes the original tokens. Nevertheless, this
approach encounters difficulties in labeling word engineering. While this study pro-
posed an automated label selection technique, the associated experiments revealed
some degree of instability.
152 3 Semantics Processing
Considering that NER is usually combined with relation extraction tasks applied
in various downstream tasks, jointly recognizing named entities and classifying rela-
tions is a hot topic in related fields. MTL is the most common solution in joint NER
and relation extraction. Miwa and Bansal (2016) firstly employed a shared BiLSTM
encoder to obtain token representations, and then fed encoded representations into
NER and relation extraction classifiers, respectively. Sun et al. (2020) utilized a GCN
as a shared encoder to enable joint inference of both entity and relation types. The
core idea of the above study is that multi-task models can enhance the interactions
between the learning of NER and relation extraction, and further alleviate the error
propagation by sharing common parameters (He et al., 2021). However, this work
cannot ensure that the sharing of information is useful and proper. NER and relation
extraction might need different features to result in precise predictions. To this end,
Yan et al. (2021c) proposed an information filtering mechanism to provide valid
features for NER and relation extraction. Their method used an entity and relation
gate to divide cell neurons into different parts and established a two-way interaction
between NER and relation extraction. In the final network, each neuron contained a
shared partition and two task-specific partitions.
B. Table Filling
While MTL can improve the interdependence between NER and relation extrac-
tion, the relation extraction process still requires the pairing of all entities from the
NER tasks to classify relations, making it impossible to completely eliminate error
propagation. To solve the problem, Miwa and Sasaki (2014) proposed a table-filling
strategy to achieve joint NER and relation extraction by labeling input tokens in a
table. The method utilized token lists of sentences to form rows and columns.
3.3 Named Entity Recognition 153
Then, they extracted entities using the diagonal elements and classified relations
with a lower/upper triangular matrix of the table. This basic table-filling strategy can
be seen in Fig. 3.6. Nonetheless, this approach involved the explicit integration of
entity-relation label interdependence, which necessitated the use of intricate features
and search heuristics. Gupta et al. (2016b) incorporated neural networks with a
table-filling strategy via a unified multi-task RNN. This method detected both entity
pairs and the related relations with an entity-relation table, which alleviated the
need for search heuristics and explicit entity-relation label dependencies. Zhang
et al. (2017a) further integrated global optimization and syntax information into
the table-filling strategy to combine NER and relation extraction tasks. Ren et al.
(2021a) argued that the above table-filling-based studies only focus on utilizing
local features without the global associations between relations and pairs. Ren et al.
(2021a) first produced a table feature for every relation, followed by extracting two
types of global associations from the generated table features. Finally, the table
feature for each relation was integrated with the global associations. Such a process
is performed iteratively to enhance the final features for joint learning of NER and
relation extraction tasks.
C. Tagging Scheme
The table-filling approach can mitigate issues related to error propagation. How-
ever, these techniques require the pairing of all sentence elements to assign labels,
resulting in significant redundancy. To address the redundancy and avoid error prop-
agation, Zheng et al. (2017) proposed a novel tagging scheme that converted joint
NER and relation extraction into a united task. The idea was similar to the solution for
nested entities Straková et al. (2019), which combined NER labels with relation ex-
traction labels by modifying the annotation schema. For example, given the sentence
“The United States president Biden will visit ...”, by allocating the customized labels
“Country-President_B_1”, “Country-President_E_1” for tokens “United”, “States”,
and “Country-President_E_2” for token “Biden”, the proposed method can directly
obtain the triplet (United State, Country-President, Biden).
154 3 Semantics Processing
Yu et al. (2020); Wei et al. (2019) proposed two similar methods. In contrast to
conventional joint approaches for NER and relation extraction, which involve rec-
ognizing entities followed by relation classification, the two methods first identified
all head entities. Next, for each identified head entity, they simultaneously predicted
corresponding tail-entities and relations, achieving cascade frameworks combined
with a customized tagging scheme. The typical joint NER and relation extraction
tasks learn to model the conditional probability:
where h represent head entity; r represent relation; h represent tail entity. The above
methods combined the last two parts in Eq. 3.4, yielding
Knowledge graphs are structured semantic knowledge bases for rapidly describing
concepts and their interrelationships in the physical world, aggregating large amounts
of knowledge by reducing the data granularity from the document level to the instance
level (Yao et al., 2022). Thus, knowledge graphs enable rapid response and reasoning
about knowledge. At present, the application of knowledge graphs has become
prevalent in industrial domains, such as Google search. Generally, the construction of
Knowledge Graphs consists of three main parts: information extraction, information
fusion, and information processing. The task of information extraction involves the
identification of nodes through NER and the establishment of edges via relation
extraction. The task of information fusion is utilized for normalizing nodes and
edges. The normalized nodes and edges need to go through a quality assessment
with the task of information processing to be added to knowledge graphs.
He et al. (2021) proposed an MTL-based method for the construction of ge-
nealogical knowledge graphs. At first, He et al. (2019b) collected unstructured
online obituary data. Then, they extracted named entities as nodes and classified
family relationships for these recognized people as edges to construct genealogical
knowledge graphs. Similarly, Jiang et al. (2020) utilized NER and relation extraction
for obtaining the nodes and edge in biomedical knowledge graphs. They proposed
a customized tagging schema to convert the construction of biomedical knowledge
graphs into a sequence labeling task with multiple inputs and multiple outputs. Li
et al. (2020c) proposed a systematic approach for constructing a medical knowledge
graph, which involves extracting entities such as diseases and symptoms, as well as
related relationships, from electronic medical records. Silvestri et al. (2022), Peng
et al. (2019), and Shafqat et al. (2022) aimed to collect and utilize medical knowledge
for NER.
3.3 Named Entity Recognition 155
Commonly, dialogue systems can be categorized into three main types, namely task-
oriented, QA, and open-domain (Ni et al., 2022). NER plays a role in enhancing
NLU for the three types of dialogue systems, organizing original user messages into
semantic slots, and classifying data domain and user intention (Li et al., 2017b). Abro
et al. (2022) proposed an argumentative dialogue system with NER and other NLU
tasks. The approach can enhance comprehension of user intent by comprehending
injected entities and relationships. For QA (Dimitrakis et al., 2020) and open-domain
dialogue systems, NER also plays a crucial role in the part of intent recognition and
knowledge retrieval. For example, Zhang et al. (2021a) developed a sequence of
sub-goals with external knowledge to improve generation performance. External
knowledge refers to a range of named entities and relationships that are associated
with a conceptual entity. Leveraging external knowledge allows the dialogue system
to deliver a more cohesive small talk from the open domain.
3.3.8 Summary
NER is a very important NLU subtask that enhances the accuracy of other subtasks,
e.g., concept extraction. It is the manifest of cognitive semantics, because named
entities are not simply categorized by their semantics. The classified named entities
also reflect their inherent attributes in people’s cognition. According to Prototype
Theory (see Section [Link]), the inherent attributes of named entities can be repre-
sented by prototypes. It is gratifying to observe that a theory has had a significant
influence on research related to few-shot NER. On the other hand, the ambiguity of
named entity classification argued by Graded Membership (see Section [Link]) and
Grammatical Category (see Section [Link]) was rarely analyzed from computational
linguistic aspects.
3.3 Named Entity Recognition 157
We also do not see explainable NER studies that explain why an entity is clas-
sified into a particular category from the perspective of conceptual blending (see
Section [Link]). NER research on these aspects is helpful for achieving human-
like intelligence in categorizing named entities. The availability of numerous NER
datasets, both in general and medical domains, has significantly enhanced computa-
tional research in this area. This may be attributed to the great application value of
NER, as well as a wide range of data annotation tools. Encyclopedias knowledge and
domain-specific knowledge also provide external information to help NER models
better understand the context and commonsense. Now, NER has developed many
practical task setups to the need of technical applications, e.g., nested NER, few-shot
NER, joint NER and relation extraction, and downstream tasks, e.g., knowledge
graph construction (KGC), recommendation systems, and dialogue systems.
Due to the extensive research conducted on typical NER methods over the years,
researchers are shifting their focus towards NER techniques that are more applicable
to practical scenarios, for example, nested NER, few-shot NER, and joint NER and
relation extraction. Recent technological trends for the aforementioned NER tasks
are summarized in Table 4.21. Overall, nested NER can be addressed by multi-
label, generation-based, and hypergraph-based methods. Among them, multi-label
methods are straightforward and easy to implement. However, there are several
limitations in the surveyed multi-label methods.
Table 3.9: A summary of representative NER techniques. The study with ⇤ means it
cannot be compared with other studies since it did not report 5-shot results.
Task Reference Tech Feature and KB. Framework Dataset Score Metric
Katiyar and Cardie (2018) DL Emb. BiLSTM ACE-05 70.2% F1
Straková et al. (2019) DL Emb. LSTM-CRF ACE-05 84.3% F1
Shibuya and Hovy (2020) DL Emb. LSTM-CRF ACE-05 84.3% F1
Nested NER Li et al. (2020g) DL BERT, Wikipedia Unified Framework ACE-05 86.9% F1
Yan et al. (2021b) DL BERT Pointer Networks ACE-05 84.7% F1
Finkel and Manning (2009) Graph Emb., Constituency Parsing Hypergraph GENIA 72.0% F1
Muis and Lu (2017) Graph Emb., Multigraph Representation Hypergraph GENIA 70.8% F1
Wang and Lu (2018) Graph Emb., Segmental Hypergraphs Hypergraph GENIA 75.1% F1
Yang and Tu (2022) DL BERT Pointer Networks ACE-05 85.0% F1
Su et al. (2022) DL BERT Pointer Networks CONLL04 88.6% F1
Fritzler et al. (2019)⇤ DL Prototypical network RNN+ CRF Ontonotes - F1
Yang and Katiyar (2020) DL BERT Nearest Neighbor I2B2 22.1% F1
Few-shot NER Das et al. (2022) DL BERT Contrastive Learning I2B2 31.8% F1
(5 shot) Cui et al. (2021) DL BERT Prompt Tuning I2B2 36.7% F1
Huang et al. (2022b) DL BERT Prompt Tuning I2B2 43.7% F1
He et al. (2024a) DL BERT, Wikipedia Prompt Tuning I2B2 52.7% F1
Miwa and Bansal (2016) DL Emb. BiLSTM ACE-05 55.6% F1
Sun et al. (2020) Graph Emb., Bipartite Graph ACE-05 59.1% F1
Joint NER Yan et al. (2021c) DL BERT Partition Filter ACE-05 66.8% F1
and relation extraction Gupta et al. (2016b) ML Emb., Table filling CoNLL04 72.1% F1
Zhang et al. (2017a) DL Emb., Table filling ACE-05 57.5% F1
Zheng et al. (2017) DL Emb., Tagging scheme NYT 49.5% F1
Yu et al. (2020) DL Emb., Tagging scheme NYT 59.0% F1
Shafqat et al. (2022)⇤ DL Emb., ICD-10 BiLSTM no public - F1
Task-driven NER Hirsch et al. (2016)⇤ DL Emb., UMLS BiLSTM no public - F1
Peng et al. (2019)⇤ DL BERT, MIMIC-III Fine Tuning no public - F1
158 3 Semantics Processing
For example, thresholds for multi-label selection are hard to decide empiri-
cally (Katiyar and Cardie, 2018); multiple labels are suffering sparsity (Straková
et al., 2019) or error propagation (Shibuya and Hovy, 2020), which can lower
model performance. Generation-based methods are flexible. By reformulating NER
tasks as QA, they can generate any results which satisfied the pre-defined require-
ments (Shibuya and Hovy, 2020; Li et al., 2020g). These methods are used for
handling Flat NER (Skylaki et al., 2020), nested NER (Yan et al., 2021b), and dis-
continuous NER (Fei et al., 2021a). However, a generation-based method is hard
to control what is generated, even if some studies (Skylaki et al., 2020; Fei et al.,
2021a; Yang and Tu, 2022; Su et al., 2022) have attempted to restrict the outputs
of generation-based methods to a specific set of indexes (pointer network). The
core point of the hypergraph-based method is about how to establish a hypergraph
data structure to better represent interaction among all tokens in a sentence. These
methods are good at modeling the interactions among all tokens in a sentence. It
is important to note that the majority of hypergraph-based methods exhibit a task-
specific nature, indicating a limited scope of applicability. These methods may not
be universally applicable, and their effectiveness may be constrained by the specific
task they are designed for.
Few-shot NER is usually achieved by metric learning and prompt tuning. Metric
learning has demonstrated its effectiveness in various few-shot tasks (Kaya and Bilge,
2019; Fritzler et al., 2019). For few-shot NER tasks, some works predict the final
labels by comparing token-to-token distance (Yang and Katiyar, 2020; Das et al.,
2022) or token-to-prototype distance (Huang et al., 2022b). These methods have
to decide different distance calculation functions according to different task (Kulis
et al., 2013) and suffer instability introduced by insufficient data. By exploiting the
full potential of language models, prompt tuning is proposed and demonstrated as
a promising technology for few-shot tasks (Liu et al., 2021c; He et al., 2023a; Liu
et al., 2022e). Prompt tuning reformulate NER as a mask language model task to
reduce the gap between NER and employed pretraining LMs. The backward is that
prompt tuning needs extra template construction and label word mappings and some
studies have tried to deal with such problems (Huang et al., 2022b; He et al., 2024a).
For Joint NER and relation extraction tasks, we summarize related studies into
three groups, including parameter sharing-based MTL, table-filling strategy, and
customized tagging scheme. Parameter sharing is the basic idea in MTL, which
can be used to enhance the interaction between NER and relation extraction (Li
et al., 2017a; Bekoulis et al., 2018). This method can provide some relief from
error propagation, but it cannot completely eliminate the issue. Also, this method
has to pair every two entities for relation extraction, which introduces unnecessary
redundancy. Table filling-based joint NER and relation extraction can completely
eliminate error propagation by converting NER and relation extraction into a whole
sequence-tagging task (Gupta et al., 2016b; Ren et al., 2021a; Ma et al., 2022b).
However, these methods have to label every two token pairs in an input sentence in
an enumerable fashion. If relation extraction is defined as an unidirectional task, the
half of calculations are wasted.
3.3 Named Entity Recognition 159
Following the idea of the table filling strategy, tagging scheme-based methods
also model the NER and relation extraction as an integrated task. The fundamental
concept of the tagging scheme is to merge the labels assigned for NER with those
assigned for relation extraction into a unified label (Zheng et al., 2017; Straková
et al., 2019). Such a method has the potential to circumvent issues related to both
error propagation and redundancy; however, it may also lead to a sparsity of labels.
With the development of PLMs and LLMs, many downstream tasks are organized
as end-to-end processing tasks to achieve higher accuracy and mitigate error propa-
gation issues. However, we can still observe that NER can improve the explainability
in recommendation and dialogue systems (Kim et al., 2012; Zhang et al., 2021a),
which is also an important aspect of AI research. There is still a considerable un-
tapped potential for integrating NER with other downstream tasks, e.g., explaining
how concepts blend each other between different entities; what the inherent attribute
of a group of entities the selected prototypes represent; how robust an identified
named entity is.
Open-domain NER
Compared with typical single-domain NER, open-domain NER has more cate-
gories. Besides, the entity classes are hardly defined in advance. For such reason,
open-domain NER is more capable of handling rapidly expanding data, and mining
more potential knowledge which is hidden in massive unstructured text data (Hohe-
necker et al., 2020; Kolluru et al., 2020). Open-domain NER is significant because it
discovers and connects world knowledge via automatic text mining. Many manually
developed lexical resources, e.g., WordNet can only cover limited concepts. When
the concepts come to MWEs, manually mining, structuring and updating those con-
cepts can result in the exponential growth of human efforts. Open-domain NER is
helpful for mitigating human efforts and delivering a knowledge base that connects
entities from different domains.
Multilingual NER
In light of the fact that a significant number of languages in existence lack suf-
ficient annotated data, knowledge transfer from high-resource languages to low-
resource languages can serve as a viable solution to compensate for the paucity of
data (Rahimi et al., 2019; Tedeschi et al., 2021b; Zhou et al., 2022). Developing
robust multilingual NER systems that can perform across multiple languages will
achieve more comprehensive knowledge graphs, linking entities from different lan-
guages. It is valuable because it may lead to a united concept representation system
covering different languages. On the other hand, the task of developing multilingual
NER systems is fraught with difficulties, primarily due to the inherent dissimilari-
ties in entity types and language structures across different languages. As a result,
aligning entities and transferring knowledge learned from one language to another
can present significant challenges for multilingual NER systems.
3.4 Concept Extraction 161
In real-world scenarios, there exist flat-named entities, nested entities, and dis-
continuous entities. Most NER-related studies only focus on the combination of flat
with nested entities or flat discontinuous entities. Both of them cannot recognize all
kinds of entities. Developing a unified framework to simultaneously handle such a
problem becomes an urgent need for NER (Fei et al., 2021a). Hierarchical concept
representation knowledge bases may provide a preliminary ontology that can be used
for organizing entities and their relationships. However, most of the ontology systems
were manually developed by experts. This manually constructed knowledge may be
invalid in specific application scenarios. A potential avenue for future research in
NER is the development of a unified and robust framework for organizing entities.
Such a framework could create comprehensive knowledge graphs that capture entity
relationships and support downstream tasks.
Concept extraction is a process to extract concepts of interest from the text. To the best
of our knowledge, the task of computational concept extraction was first proposed
by Montgomery (1982). He argued that taxonomic hierarchies could be constructed
to allow property inheritance of concepts, and therefore to perform rudimentary
inference and analogical reasoning based on the taxonomies. He also highlighted two
important subtasks of concept extraction for the next-generation knowledge-based
systems from the perspective of 1982, namely lexicon development and conceptual
structure construction.
162 3 Semantics Processing
Medin and Schaffer (1978) argued that concepts are represented by a collection of
particular exemplars or individual instances that are linked to the category. When
we categorize an instance, we compare it with multiple specific exemplars of the
category. This is different from Prototype Theory where a new instance is cate-
gorized by comparing the instance to the abstract prototype of the category (see
Section [Link]). Medin and Schaffer (1978) formed the task of concept categoriza-
tion as a classification task. Experiments showed that the classification judgments
made by participants were impacted by various factors. These factors included the
extent of resemblance between the probe item and exemplars previously acquired,
the number of prior exemplars that shared resemblances with the probe item, and the
similarity present both within and between the categories of the previously learned
exemplars. For concept extraction, Exemplar Theory may suggest that models may
take categorized instances into account when categorizing a new instance.
Gardenfors (2004) defined concept as the “theoretical entities that can be used to
explain and predict various empirical phenomena concerning concept formation”.
The author believed that concept representations are multi-dimensional, where each
dimension is indicative of a different characteristic or property associated with the
concept. For example, one could represent the concept of a car within a conceptual
space that includes dimensions such as size, speed, color, and shape.
164 3 Semantics Processing
From the goal of the keyphrase annotation aspect, there are in general two types of
annotation schemes for keyphrase extraction-like concept extraction. The first is to
precisely select existing keyphrases from input text, but not to create semantically-
equivalent phrases. The second is to both select existing keyphrases and create
“absent keyphrases” that are necessary but do not exist in the input text (Hulth,
2003). From the format of assigned annotations aspect, there are in general two
annotation schemes as well. The first scheme is to directly give the keyphrases ex-
isting in the source text. The second scheme treats the keyphrase extraction task as a
sequence labeling task, and assigns a label to each of the tokens in source text (Hulth,
2003). The assigned labels in the current dataset follow a BIO scheme defined in
Table 4.18. Specifically, three labels are used: B (Beginning), I (Inner), and O (Other).
3.4.3 Datasets
Table 3.11 lists some of the most popular concept extraction datasets, providing
detailed statistics for each. It includes information such as task, source, and size
of the dataset. Overall a main thread of dataset development is (1) larger scale of
datasets; (2) attending to both extractive keyphrases and abstractive keyphrases; (3)
more fine-grained annotations for tags; (4) more application domains.
3.4 Concept Extraction 165
Table 3.11: Concept extraction datasets and statistics. KE denotes Keyphrase Ex-
traction. ClCE denotes Clinical Concept Extraction. CoCE denotes Course Concept
Extraction. PCE denotes Patent Concept Extraction.
Hulth (2003) proposed one of the first keyphrase extraction datasets, termed
Inspec, which is based on scientific papers. Paper abstracts are used as the keyphrase
extraction context and paper keywords are used as keyphrases. Each abstract has
two sets of keywords: a set of controlled terms, i.e., terms restricted to the Inspec
thesaurus; and a set of uncontrolled terms that can be any suitable terms that may or
may not be present in the abstracts. They collected 1000 abstracts as a train set, 500
as a validation set, and 500 as a test set.
Nguyen and Kan (2007) proposed the NUS dataset with the motivation that
keyphrase extraction requires multiple judgments and cannot rely merely on the
single set of author-provided keyphrases. They first used Google Search API to
retrieve scientific publications, and then recruited student volunteers to participate
in manual keyphrase assignments. They finally collect 211 documents, each with
two sets of keyphrases: one is given by the original authors of the paper, and the
other is given by student volunteers. The data format of NUS is the same as Inspec.
Krapivin et al. (2009) proposed the Krapivin dataset, consisting of around 2,000
scientific papers as well as their keywords assigned by the original authors. The
scientific papers were published by ACM in the period from 2003 to 2005, and
were written in English. One of the novelties of this dataset is that the text data in
the scientific papers were collected with three distinct categories: title, abstract, and
main body. They finally collect 460 test data and 1.84k validation data. The data
format is similar to Inspec but has a title and body in addition to the abstract.
166 3 Semantics Processing
tweet: “Hard to believe it but these are REAL state alternatives to taking
Obamacare funds from the gov’t (via @Upworthy)”
keyphrase: “obamacare”
Pan et al. (2017) proposed a keyphrase extraction dataset, where data were sourced
from online course captions. Labels are existing phrases in the captions. The courses
are computer science and economics courses, selected from two famous MOOC
platforms — Coursera and XuetangX. Labels were first filtered from captions using
automatic methods and then annotated by two human annotators. A candidate con-
cept was only labeled as a course concept if the two annotators were in agreement.
As a result, they collected captions from 4375 videos, and 16720 labeled concepts.
KP-20K (Meng et al., 2017) is a testing dataset, where the input texts are titles
and abstracts of computer science research papers collected from ACM Digital
Library. The labeled keyphrases are the keyphrases shown in the research papers.
The annotation follows the second scheme in Section 3.4.2, since the keyphrases
given by authors were not necessarily existing keyphrases in the papers. KP-20K
has the same data format as Inspec. Huang et al. (2019b), instead, were motivated
to automatically construct an educational concept map, showing concepts that will
be learned in courses, as well as the temporal relation between the concepts (e.g., to
learn concept A, it is a prerequisite to learn concept B; Concept A and concept B
can help with the understanding of each other).
3.4 Concept Extraction 167
To construct the dataset written in Chinese, they first used OCR to obtain the text
from textbooks, then manually labeled key concepts for each textbook (as “key con-
cept” or “not key concept”) and finally manually annotated the relationships among
the labeled key concepts (as “wi is w j ’s prerequisite”, “wi and w j has collaboration
relationship”, or “no relationship”).
keyword: “[ ‘average’, ‘weighted average’, ... ]”
relation: “[ ‘average : weighted average’, ... ]”
As a result, they collected 3730 pages in textbooks, 1092 key concepts, 818
prerequisite relations, and 916 collaboration relations. However, in their GitHub
repo, only keyphrases and relations between keyphrases can be found, while the
text cannot be found. Other concept extraction datasets focused on specific domains
such as clinical concepts, e.g., TempEval (Bethard et al., 2016), i2b2-2010 (Uzuner
et al., 2011), n2c2-2018 (Henry et al., 2020), and MIMIC (Gehrmann et al., 2018),
course concepts, e.g., MOOCs (Pan et al., 2017) and EMRCM (Huang et al., 2019b),
and patent concepts, e.g., USPTO (Liu et al., 2020a). They also followed keyphrase
extraction setups, whereas the targets are to extract concepts of interest.
The field of concept extraction also uses Precision, Recall, and F1 score as evaluation
metrics. Some keyphrase extraction research considered the task as an information
retrieval task. Then, the information retrieval metric, e.g., mean average precision
(MAP) was also used for keyphrase extraction as the main measure. It is calculated
by taking the average of the average precision scores for each query in a dataset.
1’
n
M AP = Avg_Precisioni, (3.6)
n i=1
where n is the total number of queries. Avg_Precisioni denotes the averaged preci-
sion of query i. In the context of keyphrase extraction, the MAP score is determined
by comparing the generated list of keyphrases with a predefined gold standard set,
and evaluating the average precision of the top n keyphrases, where n corresponds
to the total number of keyphrases in the gold standard set. Each generated keyphrase
is considered as a query. The gold standard set serves as the relevant document.
3.4.7 Methods
Zhang et al. (2016) focused on the task of keyphrase generation on X data, and
framed this task as a sequence labeling task. They proposed a joint-layer RNN
model. For each input token, the joint-layer RNN model outputs two indicators ( ŷ1
and ŷ2 ), where ŷ1 has two values True and False, indicating whether the current
word is a keyword. ŷ2 has 5 values Single, Begin, Middle, End and Not indicating
the current word is a single keyword, the beginning of a keyphrase, the middle of
a keyphrase, the ending of a keyphrase, or not a part of a keyphrase, respectively.
Their experiments show that the joint-layer RNN model outperforms both the vanilla
RNN model and the LSTM model. However, when ŷ1 and ŷ2 have contradictions, it
might be hard to find an optimal strategy to determine which indicator to refer to. In
addition, joint-layer RNN can only extract an existing sequence as a keyphrase, but
cannot abstractively obtain a (non-existing but better) keyphrase.
Wang et al. (2018c) hypothesized that the performance of keyphrase extrac-
tion could be improved in the unlabeled or insufficiently labeled target domain by
transferring knowledge from a resource-rich domain. They accordingly proposed a
topic-based adversarial neural network (called TANN) that can learn transferable
knowledge across domains efficiently by performing adversarial training. The ex-
periment section shows that TANN largely outperforms joint-layer RNN (Zhang
et al., 2016). Li et al. (2018a) proposed an unsupervised method for concept mining,
which was motivated by the fact that supervised methods might be hard to generalize
to unseen domains. They assumed that the quality of an extracted concept can be
measured by its occurrence contexts and proposed a pipeline method for concept
mining. The method first populates many raw concepts extracted from text, and then
evaluates the concepts by comparing the embedding of concepts against the current
local context.
Al-Zaidy et al. (2019) identified two limitations of previous supervised ap-
proaches: 1) They classify the labels of each candidate phrase independently without
considering potential dependencies between candidate phrases; 2) They do not in-
corporate hidden semantics in the input text. Correspondingly, Al-Zaidy et al. (2019)
addressed keyphrase extraction as a sequence labeling task, and proposed a model
named BiLSTM-CRF that unite both the advantages of LSTM (capturing semantics)
and CRF (capturing dependencies). Their results show that BiLSTM-CRF outper-
forms CopyRNN (Meng et al., 2017) by a large margin.
Fang et al. (2021a) hypothesized that previous extractive methods ignore struc-
tured information in the raw textual data (title, topic, and clue words), which might
lead to worse performance. They accordingly proposed a model named GACEN
that can utilize the title, topic, and clue words as additional supervision to provide
guidance. GACEN also utilized CRF to model dependencies in the output. The ex-
periment section shows that GACEN outperforms Joint-layer-RNN (Zhang et al.,
2016) and CopyRNN (Meng et al., 2017).
170 3 Semantics Processing
Meng et al. (2017) were motivated that classic keyphrase generation methods can
only extract the keyphrases that appear in the source text. Those methods are un-
able to reveal and leverage the full semantics for keyphrase ranking. Consequently,
they proposed an RNN-based generative model incorporating a CopyRNN. Their
method uses an encoder-decoder architecture to catch the semantics of the input text.
Previous methods suffered from both coverage (not all keyphrases are extracted)
and repetition (similar keyphrases are extracted) problems (Meng et al., 2017). For
the coverage issue, Chen et al. (2018c) integrated a coverage mechanism (Tu et al.,
2016) into their approach, which enhances the attention distributions of multiple
keyphrases in order to cover a wider range of information within the source doc-
ument and effectively summarize it into keyphrases. For the repetition issue, they
constructed a target side review context set that contains contextual information of
generated phrases.
Ye and Wang (2018) believed that although Seq2Seq models have achieved good
performance, model training still relies on large amounts of labeled data. Corre-
spondingly, they leveraged unsupervised learning methods such as TF-IDF and
self-learning algorithms to create keyphrase labels for large amounts of unlabeled
data. Then, they train their model with a mixture of self-labeled and labeled data
together for training. They also used MTL to train their model. Experiments show
that their performance outperforms previous works. Chen et al. (2019) argued that
prior research on keyphrase generation has treated the document title and main body
in the same manner, overlooking the significant role that the title plays in shaping the
overall document. Hence, they proposed a Title-Guided Network (TG-Net) where
the title is additionally employed as a query-like input to particularly assign attention
to the title. The performance of TG-Net outperforms CopyRNN (Meng et al., 2017).
ConceptNet (Havasi and Speer, 2007) grew out of the Open Mind Common Sense
(OMCS) project (Singh et al., 2002), which aimed at commonsense knowledge ac-
quisition. Contributors delivered knowledge by fulfilling blanks within a sentence,
For example, given “[ ] can be used to [ ]”, the concepts, e.g., “ink” and “print” and
the associated relationship “UsedFor” can be obtained. The nodes of ConceptNet
are words and MWEs and its edges are one of the 21 basic relation types, such as
“IsA”, “PartOf”, and “UsedFor”. In the latest ConceptNet 5.5 (Speer et al., 2017),
the relations are increased to 36. Concepts and predicates were obtained via pattern
matching. Each collected sentence is compared with pre-defined regular expres-
sions, e.g., “NP is used for VP”(UsedFor), “NP is a kind of NP”(IsA), “NP can VP”
(CapableOf). NP (noun phrases) and VP (verb phrases) are concepts, while “Used-
For”, “IsA”, and “CapableOf” are predicates. In the case of a complex sentence that
contains several clauses, the patterns are employed to extract a simpler sentence
from it, which can then be subjected to the concept and predicate extraction process.
To evaluate ConceptNet, its assertions were compared with those in similar lexical
resources to determine their alignment.
SenticNet (Cambria et al., 2024) is a commonsense knowledge base for concept-
level sentiment analysis built by means of neurosymbolic AI. Concepts are extracted
using a graph-based semantic parsing method termed Sentic Parser (Cambria et al.,
2022), which analyzes text at both MWE-level and word-level (Fig. 3.7). The key
novelties introduced by this parser with respect to previous work are that it leverages
morphology for syntactic normalization and it uses primitives for semantic normal-
ization. Sentic Parser is superior to standard concept parsers because it focuses on the
extraction of root words from text and, hence, it normalizes complex natural language
constructs to a sort of primitive-based protolanguage. This is done by leveraging a
graph-based approach which enables nested affix handling, microtext normaliza-
tion, and compound word processing. Preliminary experiments on other alphabetic
languages showed that the proposed methodology could be language-agnostic.
Fig. 3.7: Sentic Parser analyzes text at both MWE-level (left) and word-level (right).
172 3 Semantics Processing
23 Intuitively, “entity” can cover all possible senses of the “alley” in WordNet, while it is not the
ideal concept representation of “alley”, because it is too abstract. Thus, the authors aimed at a
concrete concept representation that can cover the majority senses of a word.
3.4 Concept Extraction 173
They compared the 4 models to LSTM-CRFs (Huang et al., 2015) baselines and
found that transformer-based models are effective for clinical concept extraction
tasks. Lange et al. (2020) proposed a joint model for both clinical concept extraction
and de-identification tasks. De-identification is important since in some clinical
concept extraction scenarios, the privacy of patients should be protected. They
hypothesized that jointly modeling the two tasks can be beneficial, and proposed
two end-to-end models. One is a multi-task model where the tasks share the input
representation across tasks; the other is a stacked model, which used the privacy
token predictions to mask the corresponding embeddings in the input layer and only
use the masked embeddings for concept extraction. They found that the performance
of the concept extraction model can be improved by training and evaluating it on
anonymized data, thereby confirming their initial hypothesis.
In tasks involving the extraction of course concepts, the concepts are typically
defined as the knowledge concepts that are taught in the course videos, as well as
the related topics that aid in the students’ comprehension of the course videos. Iden-
tifying course concepts at a fine level is very important, as students with different
backgrounds need different concepts to quickly understand the main content of a
course.
Pan et al. (2017) contributed the first attempt to systematically investigate the
problem of course concept extraction in MOOCs. In the past, course concepts were
presented by instructors at a general level, with only a few concepts being covered
in an entire course video. However, they emphasized the significance of identifying
course concepts at a granular level, i.e., automatically identifying all course concepts
from each video clip, to facilitate easier comprehension. They identified a challenge
for the task that the course concept appears at a low frequency mainly because the
different courses have different concepts. They accordingly proposed to utilize word
embedding to catch the semantic relations between words and incorporate online
encyclopedias to learn the latent representations for candidate course concepts. They
also proposed a graph-based propagation algorithm to rank the candidates based on
learned representations.
Later, Wang et al. (2018b) argued that external knowledge must be involved to
solve the concept extraction problem and proposed to utilize both the structured and
unstructured data in Wikipedia to provide external knowledge to concept extraction.
Their results show that their method outperforms the work of Pan et al. (2017).
Liu et al. (2020a) developed a framework to extract technical concepts from patents.
Patent documents have different structures than other documents. For instance, they
have “title”, “abstract”, and “claim”, which exhibit a multi-level of information.
Motivated by this, the authors proposed a framework named UMTPE, which can
effectively leverage multi-level information to extract concepts.
174 3 Semantics Processing
As discussed earlier, Sentic Parser (Cambria et al., 2022) can be used for polar-
ity detection by leveraging the SenticNet knowledge base. Unlike deep learning
techniques, SenticNet performs polarity detection in a completely explainable and
interpretable way, e.g., by providing a list of key polar concepts that most contributed
to the final polarity value.
Li et al. (2023b) proposed a neurosymbolic system for conversational emotion
recognition. ConceptNet was used as a knowledge base to acquire commonsense
knowledge out of context. For example, if a person mentions that he will “chop
all onions we have and cry”, another conversation participant expresses “disgust”
emotion. This is because “onion IsA lacrimator” is a commonsense in ConceptNet.
Such a commonsense cannot be obtained from the dictionary meanings of “onion”
and the context, while ConceptNet commonsense knowledge provides the evidence
and explainability to infer such an emotional status from the context. The authors used
an utterance dependency parser and a neural network to learn symbolic knowledge
to enhance the explainability and accuracy of their method.
By using the concept mapping method from the work of Ge et al. (2022), Han et al.
(2022a) used concept mappings to support depression detection and explanation.
The hypothesis is that depression patients may have similar cognition patterns that
are reflected in their metaphorical expressions. Thus, they used concept mappings as
additional features besides tweets. The concept mappings were generated from tweets
that contain metaphors. They also proposed an explainable encoder that can identify
significant concept mappings that contribute to depression detection. The concept
mappings also improve the accuracy of depression detection, besides explaining the
common concept mapping patterns.
Xiong et al. (2017) manually analyzed the potential problems of a literature search
website [Link], and found that the issue of “Concept Not Understood”
represents one of the most significant challenges. The reason is that previous methods
measure similarity based on text, but not on their semantic embeddings. As a result,
they proposed an embedding-based similarity matching method, which extracts the
concepts in both query and documents and measures the similarity between these
concepts to obtain the similarity between a query and a document. Liu et al. (2018b)
used extracted knowledge concepts as one of the inputs to obtain a unified semantic
representation for educational excises. The representation is further used to retrieve
similar excises based on similarity with other representations.
3.4 Concept Extraction 175
Fang and Zhang (2022) grounded concept extraction in the context of CEG. CEG
aims to generate an explanation in natural language to explain the reason why a
statement is anti-commonsense. For example, given “he took a nap in the sink”, the
model aimed to generate “a sink is too small and dirty to take a nap in”. The concepts,
“small” and “dirty” (bridge concepts), are obtained via a prompt-tuning method. The
authors developed a masked word prediction template to query the bridge concepts
that are most likely to appear in the “mask” position. Then, they use a generator to
generate the explanation with the concatenation of the original statement and the
discrete bridge concepts. This method improves the explainability in explaining why
a statement is anti-commonsense.
3.4.9 Summary
A concept is an abstract idea that is reflected in the mind. Concept extraction is the
foundation of detecting the main idea of a context and developing conceptual knowl-
edge bases. Related theoretical research showed that concepts may be abstracted
from multiple specific exemplars (Medin and Schaffer, 1978) or prototypes (Rosch,
1973). There are limited primitives that construct human cognition and reasoning,
which are the foundation of complex concepts (Wierzbicka, 1972).
176 3 Semantics Processing
Havasi and Speer (2007) Knwl. eng. textual patterns Pattern matching - - -
Snow et al. (2006) SL feature vectors, WordNet Probabilistic - - -
chunking,
Structured CE
Cambria et al. (2022) sem. pars., syntactic patterns Syntactic parsing - - -
POS tag.
Ge et al. (2022) SL statistics, WordNet Elbow algorithm - - -
token mention, pos
Li and Huang (2016) DL tag, word shape CNN TempEval 78.80% F1
Liu et al. (2017c) DL word2vec, character2vec BiLSTM i2b2-2010 85.78% F1
Clinical CE Gehrmann et al. (2018) DL word2vec CNN MIMIC 76.00% F1
Yang et al. (2020b) DL word2vec Transformer n2c2-2018 88.36% F1
Lange et al. (2020) DL word2vec Multi-task-biLSTM i2b2-2010 88.90% F1
Pan et al. (2017) Graph word2vec; Encyclopedia Graph-based propagation MOOCs 41.60% MAP
Course CE
Wang et al. (2018b) Graph word2vec, Wikipedia Graph-based propagation MOOCs 47.50% MAP
self pretrained
Patent CE Liu et al. (2020a) ML word2vec, DBpedia Clustering USPTO 43.37% F1
Concept extraction methods have been widely used in downstream tasks, e.g., polarity
detection, information retrieval, dialogue systems, and CEG. Compared to other low-
level semantic processing techniques, the roles of concept extraction are more diverse
in downstream applications. For all the surveyed downstream tasks, the products of
concept extraction can be used as additional features to improve the performance
of downstream tasks. Moreover, concept extraction can add interpretability to such
downstream tasks, e.g., explaining anti-commonsense (Fang and Zhang, 2022) and
concept mapping patterns of depressed patients (Han et al., 2022a).
178 3 Semantics Processing
In the era of PLM and LLM, it seems many complex tasks can be achieved from
end-to-end with deep neural networks. However, black-box-like neural networks pre-
vent humans from understanding their decision-making mechanisms. This may be
contrary to the original intention of human beings to build AI, e.g., giving machines
the ability to think like humans. Neurosymbolic AI which combines the knowledge
of symbolic representations with neural networks, seems to be able to compensate for
the lack of model interpretability of pure neural networks because symbolic repre-
sentations in natural language, e.g., words and concepts are human-readable. We can
explain a prediction by viewing what symbolic knowledge is activated. Meantime,
symbolic knowledge can represent commonsense knowledge, which is difficult for
neural networks to learn from corpora. As the fundamental technique of knowledge
base development, concept extraction has a huge potential in downstream applica-
tions.
“Concept” is also very relevant to human visual recognition. It is argued that for
humans, the ability of visual classification is obtained from concept learning, which
learned the generalized concept description from sample observations such that a
given observation can be identified as a learned concept (Seel, 2011; Xiong et al.,
2021). On the other hand, the abstractness of concepts is strongly related to im-
agery (Paivio, 1965), because abstract concepts are those that are not applicable to
tangible, perceptible objects that can be observed through touch, sight, hearing, or
other sensory experiences (Löhr, 2022). Thus, learning the relationships between
concepts and imagery can help concept extraction research hierarchically organized
concepts, e.g., primitives, concepts, and entities. However, till now, there is a lack
of research papers working on multimodal concept extraction to the best of our
knowledge. It could be also interesting to investigate possible synergies in concept
extraction between different modalities.
Despite the attention some scholars have given to neurosymbolic AI, the body of
related works remains relatively scant in comparison to end-to-end neural network
models. One possible explanation for this disparity is that, at present, there is greater
emphasis placed on the accuracy of the model rather than the transparency of its
decision-making process. Thus, there is a need for more concept extraction applica-
tions, which can aid in enhancing the explainability of neural network-based models.
It offers insights for the development of knowledge bases, prompting researchers to
reassess how they extract and organize concepts in order to more effectively support
subsequent applications.
180 3 Semantics Processing
[Link] Constraints
When human beings resolute co-reference, there are semantic and syntactic con-
straints. As for the semantic constraints, agreements such as gender and number
agreements are the strongest type (Garnham, 2001). However, most recently, agree-
ment mismatch problems (especially for gender agreements) have been becoming
more frequent since more people have started to use plural pronouns to avoid gender
bias. As for syntactic constraints, according to the binding theory (Büring, 2005), in
the sentence (a) of the following example, “John” cannot co-refer with “him” while
in the sentence (b) “John” can.
(9) a. John likes him.
b. John likes him in the mirror.
Centering Theory (Joshi and Kuhn, 1979; Grosz et al., 1983, 1995) was introduced
as a model of local coherence24 based on the idea of center of attention. The theory
assumes that, during the production or comprehension of a discourse, the discourse
participant’s attention is often centered on a set of entities (a subset of all entities in
the discourse) and such an attentional state evolves dynamically. It models transitions
of the attentional state and defines three types of transitions: CONTINUE, RETAIN, and
SHIFT. For each utterance, the transition is decided by its backward-looking center
(defined as the most salient entity in the previous utterance that is also realized in
the current utterance and denoted as Cb ) as well as forward-looking center (defined
as the most salient entity in the current utterance and denoted as C f ). Consider the
following discourse adopted from Kehler (1997):
(10) a. Terry really gets angry sometimes.
b. Yesterday was a beautiful day and he was excited about trying out his
new sailboat. [Cb = Terry, C f = Terry]
c. He wanted Tony to join him on a sailing expedition, and left him a
message on his answering machine. [Cb = Terry, C f = Terry]
d. Tony called him at 6AM the next morning. [Cb = Terry, C f = Tony]
e. Tony was furious with him for being woken up so early. [Cb = Tony,
C f = Tony]
where we annotate each utterance with its backward-looking and forward-looking
centers. The transition from utterance (10-a) to (10-b) is a CONTINUE as both
backward-looking and forward-looking centers are unchanged.
24 Instead of focusing on the whole discourse, centering theory focuses only on the discourse
segment.
182 3 Semantics Processing
The next one is a RETAIN transition since although the most salient entity changes
(i.e., C f ), the forward-looking center stays the same, whereas the transition from
utterance (10-d) to (10-e) is a SHIFT transition because of the change of backward-
looking transition. Intuitively, a discourse with more CONTINUE transitions is more
coherent than the one with more SHIFT transitions. Though Centering Theory is
not a theory of anaphora resolution, anaphora resolution can directly benefit from
modeling transitions, which provides certain information about the preference for
the referents of pronouns.
[Link] Coolness
Huang (1984) classified human languages into cool languages and hot languages. If
a language is “cooler” than another language, then understanding a sentence in that
language relies more on context. Specifically, cool languages (e.g., Mandarin) make
liberal use of zero pronouns. Consider the following conversation:
(11) a. ` )↵¡‘Ü⌫ (Did you see Bill today?)
b. *pro*↵¡*pro*Ü⇥(*I* saw *him*.)
where a *pro* represents a zero pronoun25. The first zero pronoun refers to one of
the speakers while the second zero pronoun refers to Bill. Zero pronouns of this kind
are called anaphoric zero pronouns (AZPs). In addition to Mandarin, a number of
other languages (i.e., cool languages) also allow zero pronouns, including examples
like Japanese, Arabic, and Korean. The current theory suggests that the anaphora
resolution of cool languages should also take AZPs into consideration, namely AZP
resolution (AZPR) (Chen and Ng, 2013).
25 In linguistics, a zero pronoun is a pronoun that is implied but not explicitly expressed in a
sentence.
3.5 Anaphora Resolution 183
In this subsection, we introduce two commonly used annotation schemes for anaphora
resolution: MUC and MATE. There are also other schemes, for example, the Lan-
caster scheme (Fligelstone, 1992) and the DRAMA scheme (Passonneau, 1997).
[Link] MUC
MUC (Hirschman et al., 1997; Hirschman and Chinchor, 1998) is one of the very
first schemes, which is used for annotating the MUC (Chinchor and Sundheim, 1995)
and the ACE (Doddington et al., 2004) corpora and is still widely used these years.
It is primary goal is to annotate co-reference chains in discourse, in which MUC
defines and proposes to annotate the IDENTITY (IDENT) relation. Relations as
such are symmetrical (i.e., if A IDENT B, then B IDENT A) and transitive (i.e., if
A IDENT B and B IDENT C, then A IDENT C). Annotation is done using SGML,
for example:
(12) hCOREF ID=“100”iLawson Mardon Group Ltd.h/COREFi said hCOREF
ID=“101” TYPE=“IDENT” REF=“100”iith/COREFi ...
The annotation above construct a link between the pronoun “it” and the noun phrase
“Lawson Mardon Group Ltd.”. MUC proposes to annotate co-reference chains fol-
lowing a paradigm analogous to anaphora resolution. Annotators are first asked to
annotate markable phrases (e.g., nouns, noun phrases, and pronouns) and partition
the phrases into sets of co-referring elements. This helps the annotation task achieve
good inter-annotator agreement (i.e., larger than 95%). Nevertheless, it has been
pointed out by Deemter and Kibble (2000) that MUC has certain flaws: MUC does
not guarantee that the annotated relations are all co-referential. It includes non-
identity-of-reference relations or bound anaphora, resulting in a corpus that mixes
co-reference and anaphora.
[Link] MATE
Instead of annotating a single device INDENT, MATE (Poesio et al., 1999a; Poesio,
2004) was proposed to do so-called “anaphoric annotation” which is explicitly based
on the discourse model assumption (Heim, 1982; Gundel et al., 1993; Webber,
2016; Kamp and Reyle, 2013). The scheme was first proposed to annotate anaphora
in dialogues but was then extended to relations in discourse (see (Pradhan et al.,
2012) for more details). Such a good extensibility is a result of the fact that MATE
is a meta-scheme: It consists of a core scheme and multiple extensions. The core
scheme can be used to conduct the same annotation task as MUC and can be extended
with respect to different tasks. The annotation normally uses XML, but many of its
extensions use other their own formats.
184 3 Semantics Processing
In addition to the “co-referential” relation discussed above, many are also interested
in “hard” cases, each kind of which is often annotated as following an extension
of MATE. These include the following three: (1) zero pronoun: Pradhan et al.
(2012) annotated (both anaphoric and non-anaphoric) zero pronouns in Chinese and
Arabic (see Section [Link]); (2) bridging reference: bridging anaphora is a kind of
indirect referent, where the antecedent of an anaphor is not explicitly mentioned
but “associated” information is mentioned (Clark, 1975). Identifying such a relation
needs commonsense inference. Consider the following example from Clark (1975):
(13) I looked into the room. The ceiling was very high.
“the room” is an antecedent of “the ceiling” because the room has a ceiling; (3)
deictic reference: deixis (Webber, 1988) is a phrase that refers to the “speaker’s
position” (e.g., time, place, situation), which is always abstracted. For example, in
(14) I went to school yesterday.
the first person pronoun “I” and the word “yesterday” are deictic references, which
refer to the speaker and the day before the date when (14) was uttered, respectively.
Schemes like ARRAU (Poesio and Artstein, 2008) extended MATE and is able to
annotate bridging and deictic references.
3.5.3 Datasets
The 6th version of MUC (Chinchor and Sundheim, 1995) is the first corpus that
enables the co-reference resolution, where the task of co-reference resolution and the
MUC annotation scheme was first defined. Its texts are inherited from the prevision
MUCs and are English news. An example of MUC-6 is shown in Example (12).
(Chinchor, 1998) updated MUC-6 in 2001 and construct the MUC-7/MET-2 corpus.
MUC-7 was designed to be multilingual (NB: data in Chinese and Japanese are
included in MET-2, which has been considered as a part of MUC-7) and to be
more carefully annotated than MUC-6 by providing annotators with a clearer task
definition and finer annotation guidelines.
ACE is a multilingual (i.e., English, Chinese, and Arabic) multi-domain co-
reference resolution corpus (Doddington et al., 2004). In terms of co-reference reso-
lution, it was built with the same purpose as MUC26 and the same problems pointed
out by Deemter and Kibble (2000) (see Section 3.5.2 for more discussion). In addi-
tion to MUC and AEC, there are works following the MUC scheme, while targeting
domains other than news, which include GENIA (Kim et al., 2003), GUM (Zeldes,
2017), and PRECO (Chen et al., 2018b).
The GNOME corpus was first proposed to investigate the effect of salience on
language production (see Section [Link]) and then be used to develop and evaluate
anaphora resolution algorithms (Poesio, 2003; Poesio and Alexandrov-Kabadjov,
2004) targeting especially the bridging reference resolution, in the course of which
the MATE scheme was introduced (see Section 3.5.2). GNOME is an English multi-
domain corpus. The initial GNOME corpus (Poesio et al., 1999b) consists of data
from the museum domain (building on the SOLE project by Hitzeman et al. (1998))
and patient information leaflets (building on the ICONOCLAST project), which is
then expended to include tutorial dialogues (Poesio, 2000). GNOME followed the
MATE scheme. Each noun phrase is marked by an hnei and its anaphoric relations
(marked by) are annotated separately, for example:
hante current=”ne09” i
hanchor ID=”ne07” rel=”ident” ... i
h/ante i
26 Though, in terms of entity recognition, they do not have the same purpose.
186 3 Semantics Processing
Though it has been widely used in co-reference resolution tasks, many of its
relations are not co-reference. For example, bound anaphora frequently appear (see
the start of this section for more discussion). Additionally, OntoNotes annotates
zero pronouns in its Chinese and Arabic portions (see Section [Link]). There are
other corpora following M/O, but targeting different domains, including biomedical,
e.g., CRAFT (Cohen et al., 2017), Wikipedia, e.g., GAP (Webster et al., 2018) and
WikiCoref (Ghaddar and Langlais, 2016), and literary text, e.g., LitBank (Bamman
et al., 2020), and different anaphorical phenomena, including bridging anaphora,
e.g., ISNOTE (Hou et al., 2018), style variation, e.g., WikiCoref (Ghaddar and
Langlais, 2016), and ambiguity, e.g., GAP (Webster et al., 2018).
ARRAU is an English multi-domain (i.e., dialogue, narrative, and news) anaphora
resolution dataset, annotated following the MATE scheme (Poesio and Artstein,
2008; Uryupina et al., 2020). However, different from other corpora that also follow
MATE, ARRAU extended MATE to annotate anaphoric ambiguity explicitly (recall
that MATE is a meta-scheme). Poesio and Artstein (2008) introduced the Quasi-
identity relation, which is used for the situation when co-refer is possible but not
certain by annotators and allowed each anaphor to have two distinct interpretations.
In the example sample below, the footnote “1,2” of the anaphor “it” means ambiguity
exists and it can either refer to ‘engine E2’ or “the boxcar at Elmira”.
(u1) M: can we .. kindly hook up ... uh ... [engine E2]1 to [the boxcar at Elmira]2
(u2) M: +and+ send [it]1,2 to Corning as soon as possible please
The Winograd Scheme Challenge (Levesque et al., 2012) focuses on the “hard”
cases of CR, which often require lexical and commonsense knowledge. It can be
traced back to Terry Winograd’s minimal pair (Winograd, 1972):
(15) a. The city council refused the demonstrators a permit because they
feared violence.
b. The city council refused the demonstrators a permit because they
advocated violence.
The antecedent of “they” changes from “the city council” to “the demonstrators”
from (15-a) to (15-b). Levesque et al. (2012) introduced the WSC benchmark con-
sisting of hundreds of such minimal pairs. Since then, many larger-scale WSC-like
corpora have been constructed. This includes the DPR corpus (Rahman and Ng,
2012), the PDP corpus (Davis et al., 2017), and the Winogrande corpus (Sakaguchi
et al., 2021). Following a similar paradigm, GAP (Webster et al., 2018), Winogen-
der (Rudinger et al., 2018) and Winobias (Zhao et al., 2018) were proposed for
“hard” cases that link to gender bias.
NP4E (Hasler et al., 2006) and ECB+ (Cybulska and Vossen, 2014) are corpora for
investigating cross-document co-reference. They annotated both entities and events
co-reference and both within and cross-document co-reference. These corpora were
built by starting from a set of clusters of documents, the documents of each of which
describe the same fundamental events.
3.5 Anaphora Resolution 187
The corpora mentioned above are all in English, some of which have Chinese
and Arabic portions. There are anaphora/co-reference resolution corpora that fo-
cus on languages other than them. These include ANCOR in French (Muzerelle
et al., 2013), ANCORA in Catalan and Spanish (Taulé et al., 2008), COREA in
Dutch (Hendrickx et al., 2008), NAIST in Japanese (Iida et al., 2007b), PCC in
Polish (Ogrodniczuk et al., 2013), PCEDT in Czech (Nedoluzhko et al., 2014), and
TUBA-DZ in German (Telljohann et al., 2004).
Both lexical and world knowledge are useful for anaphor interpretation. See the
following examples from Martin (2015):
(16) a. There was a lot of Tour de France riders staying at our hotel. Several
of the athletes even ate in the hotel restaurant.
b. She was staying at the Ritz, but even that hotel didn’t offer dog walking
service.
We need the lexical knowledge that indicates “riders” are “athletes” while need the
world knowledge of the fact that “Ritz” is a “hotel”. As discussed earlier, WordNet
is a lexical knowledge base that could aid this process. Wikipedia has also been
an important world knowledge source for many anaphora resolution systems. These
knowledge bases consist of documents from Wikipedia as well as related meta-data.
Typical examples include bases from those directly dumped from raw Wikipedia
documents27 to better-structured ones, such as Wikidata (Vrandecic and Krötzsch,
2014), DBpedia (Auer et al., 2007), and Freebase (Bollacker et al., 2008).
27 [Link]
188 3 Semantics Processing
MUC and Beyond. Along with MUC-6 (see Section 3.5.3), Vilain et al. (1995)
proposed the MUC score. It computes the recall and precision of anaphora/co-
reference resolution outputs by considering co-reference chains in a document as a
graph. Vilain et al. (1995) first defined two sets: a set of key entities K, in which
there are gold standard reference chains (NB: a chain is sometimes named as a class
or a cluster), and a set of response entities R, in which there are system generated
chained. MUC score computes the recall based on the number of missing links in R
compared to K, formally:
Õ
ki 2K (|k i | |p(k i, R)|)
Recall = Õ (3.7)
ki 2K (|k i | 1)
where |k i | is the number of mentions in the chain k i and p(ki, R) is the set of
partitions that is constructed by intersecting ki with R. The computation of MUC
precision is done by switching K and R. However, it has been pointed out that MUC
has certain flaws: on the one hand, since MUC is merely building on mismatches
of links between the two sets, it is not discriminative enough (Bagga and Baldwin,
1998; Luo, 2005). For example, it does not tell the difference between an extra link
between two singletons or two prominent entities. On the other hand, Luo (2005);
Kübler and Zhekova (2011) argued that MUC prefers singletons. For instance, if we
merge all mentions in OntoNotes into singletons, the resulting MUC will be higher
than that of the state of the art (Moosavi and Strube, 2016).
3.5 Anaphora Resolution 189
Many metrics beyond MUC have been proposed by measuring recall and precision
using mentions instead of links. Bagga and Baldwin (1998) proposed B3 , which
considers the fractions of the correctly identified mentions in R:
Õ Õ |ki \r j | 2
Õ
ki 2K r j 2R |ki |
Recall = (3.8)
ki 2K |k i |
where K ⇤ is the set of key entities that have the optimal mapping with R, which is
found by the Kuhn-Munkres algorithm, and (·) is a similarity measure. Nevertheless,
CEAF has two shortcomings: it overlooks all unaligned response entities (Denis and
Baldridge, 2009) and weights entities equally (Stoyanov et al., 2009). In addition to
the above mentioned based metrics, to handle singletons, Recasens and Hovy (2011)
proposed BLANC to also consider non-coreference/non-anaphoric links. It measures
the fiction of both correctly identified co-reference links and non-coreference entities,
and averages them to obtain the final score. Moosavi and Strube (2016) conducted
controlled experiments and proved that all the aforementioned computations of
precision and recall are neither interpretable nor reliable as they suffer from the
so-called mention identification effect. They proposed the LEA metric, which was
claimed to be able to solve the above issues from two perspectives: (1) it considers
both links and mentions; (2) it weights entities with respect to their importance.
Text Editors. In the early years, anaphora/co-reference were annotated using text
editors or manipulation tools. For example, MUC-6 and ACE were annotated using
plain text editors while GNOME was annotated using the XML manipulation tool
developed by the University of Edinburgh28.
28 [Link]
190 3 Semantics Processing
For example, ARRAU and PCC used MMAX2, which is a free, extensible,
general-purpose, and desktop-based annotation tool. It allows users to annotate re-
lations using fields in a form, and the form is customizable. The NP4E project used
PALinkA and ECB+ used CAT (Bartalesi Lenzi et al., 2012). Both of them were
designed for the event and reference annotation. More recently, co-reference anno-
tation tools that provide better visualization, allow drag-and-drop annotation, and
offer post-annotation analysis have been built. Typical examples include CorefAn-
notator (Reiter, 2018), which is open-sourced and desktop-based, SCAR (Oberle,
2018), which is open-sourced and web-based, and LightTag, which is not fully free
but provides good online teamwork services.
3.5.7 Methods
A. Linguistically-inspired Approaches
Like many other tasks in NLP, early works on anaphora resolution built on rules
that are rooted cognitively and linguistically. Here, the term “early” represents the
age when systematic evaluations of anaphora resolution, e.g., MUC, had not been
introduced. The very first algorithm is the naïve algorithm proposed by Hobbs
(1978). It first does a breadth-first search from the parse tree of the sentence to
search for identifying mentions and links mentions based on constraints introduced
in Section [Link].
3.5 Anaphora Resolution 191
Later on, a series of anaphora resolution systems were proposed together with
computational investigations of the effect of salience (see Section [Link]). Based
on a set of factors that proved to influence salience, Sidner (1979) introduced rules
that are used to compute the expected focus of discourse and rules that are used
to interpret anaphora. As a matter of fact, this work was built on the “centering
view” rooted from Grosz (1977), which suggests that, during anaphora resolution,
the searching of antecedents should be restricted to the set of centered entities. It
could be seen as a prototype of the idea of “center of salience” of the centering
theory (see Section [Link]), but the rules proposed by Sidner (1979) are extremely
complex.
Starting from Sidner (1979), Carter (1987) focused on the rules about salience and
developed a system coined Shallow Processing Anaphor Resolver (SPAR). SPAR
maintains linguistically-inspired rules as domain knowledge and does commonsense
inference over them. As pointed out by Carter (1987), since maintaining domain
knowledge and reasoning rules is expensive, SPAR made them as simple as possible.
That is why it was called “shallow processing”. Carter assessed SPAR on a set of
322 test samples and found that SPAR could successfully resolve 93% pronominal
anaphors and 87% non-pronominal anaphora.
Hobbs et al. (1988) formalized commonsense inference in anaphora resolution as
abduction and introduced TACITUS. To do abduction, in TACITUS, knowledge (i.e.,
rules) is maintained in formal logic (FOL in this case). Focusing on salience, Lappin
and Leass (1994) proposed the Resolution of Anaphora Procedure (RAP) algorithm.
After selecting a set of candidate antecedents based on semantic and syntactic
constraints, RAP contains a rule-based procedure for assigning values to several
salience parameters, which are then used for resolute anaphors. An assessment
on 360 hand-crafted texts containing pronouns showed RAP defeated the naïve
algorithm by 2%.
Also starting from Sidner (1979), there were subsequent works that extended
the idea of “focus” on the basis of the introduction of the concept of “centering”.
Brennan et al. (1987) introduced the BFP algorithm for anaphora resolution, which
roughly has three stages: (1) construct a set of candidate antecedents with accordance
to the rules of the semantic constraint; (2) filter and classify the candidates based
on which action a candidate belongs to in centering theory (see Section [Link]);
and (3) select the best candidate in according to a pre-defined preference over the
actions. One limitation of the BFP algorithm is that its final choice is merely based
on a linear preference order.
To optimize this selection process, Beaver (2004) marries BFP with the optimal-
ity theory. Another limitation is that, by only considering the center theory, BFP
overlooked a key pattern of how human resolute pronouns, namely, incremental res-
olution (Kehler, 1997). In response to this problem, Tetreault (2001) proposed the
Left-to-Right Centering (LRC) algorithm, which is an incremental resolution algo-
rithm that adheres to centering constraints. An evaluation on the New York Time
corpus (Ge et al., 1998) suggests that LRC outperformed both BFP and the naïve
algorithm.
192 3 Semantics Processing
B. Knowledge-poor Approaches
After the introduction of the MUC-6 shared task, anaphora resolution systems are
able to be evaluated on a large scale. However, the trade-off is that the anaphora
resolution systems can no longer access inputs that are annotated with gold-standard
semantic and syntactic knowledge. Building on this setting, “knowledge-poor” ap-
proaches were proposed and most systems of this kind prefer rules that have high
precision but do not rely on knowledge. The most influential work is CogNIAC (Bald-
win, 1997), which is a heuristic precision-first anaphora resolver that relies on rules
that are almost always true. For example, CogNIAC contains a rule saying if there
is just one possible antecedent in entire the prior discourse, then that entity is the
antecedent. Its rules were selected based on the precision tested on a set of test sen-
tences. It is worth noting that rules in CogNIAC are still used in many state-of-the-art
practical anaphora resolution systems, e.g., the Stanford Deterministic Coreference
Resolver (Lee et al., 2013).
As pointed out by Poesio et al. (2023), this encourages two major changes in anaphora
resolution: one this that instead of relying on perfect knowledge and doing reason-
ing on it, anaphora resolution systems started to syntactic parsers and approximate
knowledge like WordNet. The other is that the focus of anaphora resolution models
moved from being aware of only pronouns to all kinds of nominal phrases (that
function as referring).
Kameyama (1997) proposed to resolve anaphors that are proper names, descrip-
tions, and pronouns. It relies on syntactic and semantic constraints, but the related
information came from a syntactic parser and morphological filter based on person,
number, and gender features. Later on, approaches that marry rules with WordNet
were introduced (Harabagiu and Maiorano, 1999; Liang and Wu, 2003). They made
use of heuristic rules (as in CogNIAC), some of which consider lexical information
from WordNet.
The most famous rule-based anaphora resolution system is the one proposed
by Haghighi and Klein (2009), which is still frequently used as a strong baseline
in today’s research on anaphora resolution. In addition to aforesaid syntactic and
semantic constraints, Haghighi and Klein (2009) makes full use of the parse trees.
For example, it contains rules that rely on the distance between mentions, which is
obtained from computing the shortest path between two mentions in the parse tree. It
also uses Wikipedia as a resource for acquiring semantic knowledge of each entity.
One limitation of heuristic-based systems is that lower precision features often
overwhelm higher precision features. In response to this, more recently rule-based
systems (Raghunathan et al., 2010; Lee et al., 2013) categorized rules into sieves and
made decisions with an ordered set of rules. These works are often called multi-sieve
approaches.
3.5 Anaphora Resolution 193
The introduction of large-scale benchmarks also encourages the trend of using ma-
chine learning techniques in anaphora resolution. Basically, these learning-based
models treat anaphora resolution as a series of classification problems. We catego-
rize them on the basis of how they define the classification task.
A. Mention-pair Models
B. Entity-Mention Models
C. Mention-Ranking Models
Wiseman et al. (2015) was the first to use deep neural networks in anaphora resolu-
tion. It is a non-linear mention-ranking model. Instead of conjunction features (as in
statistical models), the model of Wiseman et al. (2015) uses a neural network to learn
feature representations as an extension to the mention-ranking model. They defined
two feature vectors, each of which is obtained from pretraining the model on any
of the subtasks of anaphora resolution, namely, mention identification and mention
linking. The final decision is made through a non-linear classification, based on these
features. Both Wiseman et al. (2016) and Clark and Manning (2016b) augmented
the work of Wiseman et al. (2015) by inducing global features, but they followed
different schemes. Wiseman et al. (2016) ran an RNN to encode the representation
of each sequence of mentions corresponding to an entity (i.e., a cluster) in the his-
tory. Whereas, Clark and Manning (2016b) first used a feed-forward neural network
to encode each mention-pair of an entity and computed the entity representation
by pooling over all mention-pairs. Later on, Clark and Manning (2016a) extended
their previous work (Clark and Manning, 2015), which built up co-reference chains
with agglomerative clustering. Each mention starts in its own cluster and then pairs
of clusters are merged using imitation learning (a type of reinforcement learning
technique) by assuming merging clusters are actions. Clark and Manning (2016a)
replaced imitation learning with deep reinforcement learning (DRL).
3.5 Anaphora Resolution 195
Liu et al. (2023b) proposed an MTL framework for mention detection and men-
tion linking tasks, because they found that the learning of mention detection task
can enhance the learning of dependent information of input tokens, which is com-
plimentary for mention linking detection. Such an approach achieved comparable
performance to (Kocijan et al., 2019) with only 0.05% WIKICREM training samples.
B. End-to-End Models
C. Knowledge-based Models
The training of AZPR systems shares the problem of lacking annotated training
data. For example, the AZPR largest corpus, i.e., the Chinese portion of OntoNotes,
contains only 12,111 AZPs. To incorporate more data into training, there have
been three paradigms: (1) Joint modeling: Chen et al. (2021b); Aloraini et al. (2022)
proposed to train a model that resolves either AZPs and non-zero pronouns jointly; (2)
Multi-linguality: Iida and Poesio (2011); Aloraini and Poesio (2020) trained multi-
lingual AZPR systems which were trained on AZPR data in multiple languages; (3)
Data augmentation: Liu et al. (2017b) made use of large-scale reading comprehension
dataset in Chinese to generate pseudo training data for Chinese AZPR. Aloraini and
Poesio (2021) augmented Arabic AZPR data by a number of augmentation strategies,
e.g., back translation, masking candidate mentions, etc.
Stojanovski and Fraser (2018) provided the following example to illustrate how
oracle anaphora singles can help machine translation systems.
(17) a. Let me summarize the novel for you.
b. It presents a problem.
c. er!@#$XPRONOUN It presents a problem.
d. Er prasentiert ein Problem.
Given the context (a) and the course sentence (b), based on the oracle anaphora
information, Stojanovski and Fraser (2018) pre-pend the input sentence of machine
translation with pronoun translation as shown in (c) and ask the system to translation
with a target (d) in German. In this case, the pronoun “it” which refers to “novel”
(in German “Roman”) is translated to “er” (the German masculine pronoun agreeing
with “Roman”). Without this information, they argued that machine translation will
be hard to produce “er”. The experiment on a number of NMT models suggested
that would improve the BLEU scores by 4-5 points.
This argumentation was strengthened by the experiments conducted by Saunders
et al. (2020), who concluded that NMT does not translate gender co-reference. De-
spite these theoretical studies, Le Nagard and Koehn (2010); Hardmeier and Federico
(2010); Guillou (2012) focused on improving machine translation with anaphora res-
olution outputs. The solution is often using anaphora resolution outcomes to obtain
features of each pronoun (including, gender, number, and animacy) in order to en-
hance the pronoun translation performance. Beyond these works, Miculicich and
Popescu-Belis (2017) proposed to use clustering scores which are used for generat-
ing co-reference chains in anaphora resolution (see Section [Link]) as features for
re-ranking machine translation results.
198 3 Semantics Processing
There has been a long tradition of studying the impact of AZPs on machine
translation systems, especially when translating from a pro-drop language to a non-
pro-drop language. For example, the Japanese-English machine translation in the
1990s had already been deployed an AZPR systems (Nakaiwa and Ikehara, 1992).
Later systems followed a slightly different strategy. Instead of doing a full anaphora
resolution, these systems only detect AZPs in the source language and directly
translate them into the target language without further resolute them (Tan et al.,
2019; Wang et al., 2019b).
[Link] Summarization
There are two major uses of anaphora resolution in text summarization (Steinberger
et al., 2007). One is to help with finding the important terms while the other is
to help with evaluating the coherence of the summarization. Many works have
demonstrated that incorporating the information of co-reference chains contributes
to both the faithfulness and the coverage of summarization systems (Bergler et al.,
2003; Witte and Bergler, 2003; Sonawane and Kulkarni, 2016; Liu et al., 2021d).
Nevertheless, it is also worth noting that there are also some studies that showed that
anaphora resolution had negative effects (Orasan, 2007; Mitkov et al., 2007). One
possible explanation is that the effect highly depends on the task the summarization
system is addressing and the performance of the anaphora resolution systems.
For textual entailment, to understand the impact of anaphora resolution, Mirkin et al.
(2010) manually analyzed 120 samples in the RTE-5 development set (Bentivogli
et al., 2009). They found that for 44% samples anaphora relations are mandatory
for inference and for 28% sample anaphora optionally support the inference. Based
on this fact, many systems that got involved in the RTE challenge made use of
anaphora resolution. Nevertheless, since anaphora resolution systems at that moment
were not strong enough, errors they made would propagate to downstream textual
entailment systems (Adams et al., 2007; Agichtein et al., 2008). As a consequence,
the contribution of anaphora resolution was negative or not significant (Bar-Haim
et al., 2008; Chambers et al., 2007).
Sukthanker et al. (2020) identified two ways anaphora resolution aids polarity detec-
tion: It enhances sentiment analysis of online reviews by linking pronouns to global
entities and improves fine-grained ABSA by clustering entities into distinct aspects
for better sentiment extraction.
3.5 Anaphora Resolution 199
3.5.9 Summary
As shown in Table 3.17, there are two clear technical trends. One is that the research
interest in the realm of anaphora resolution has shifted from machine learning-based
or rule-based anaphora resolution to neural approaches, especially the End2End
neural anaphora resolution, which does mention identification and linking simulta-
neously. Another one is that, as previously elucidated in Section 3.5.7, there exist
distinct shortcomings associated with each of the task formulations such as mention
pair, entity mention, and mention ranking. Consequently, another tendency is to
employ higher-order inferences (Lee et al., 2018) to directly rank clusters or entities,
which allows for the incorporation of benefits from all the formulations.
To sum up, state-of-the-art anaphora resolution models are often End2End cluster
ranking models. Most recent advances tended to further improve this paradigm from
two angles, namely reducing the search space as an End2End anaphora resolution
searches across all possible spans in its inputs for antecedents (Wu et al., 2020b);
and equipping anaphora resolution systems with knowledge (which, recently, often
large-scale PLMs) to boost their ability of reasoning (Joshi et al., 2019, 2020).
Many demonstrations were carried out approximately 15 years ago to validate the
necessity of anaphora resolution for both language generation and understanding
downstream tasks (Steinberger et al., 2007; Mirkin et al., 2010; Nicolov et al., 2008;
Li et al., 2021d; He et al., 2022a). Nevertheless, practically, at that moment, anaphora
resolution often had negative effects (Bar-Haim et al., 2008; Chambers et al., 2007;
Orasan, 2007; Mitkov et al., 2007). This is mainly because anaphora resolution
systems were not powerful enough and errors they made may propagate to their
downstream tasks.
Recently, with significant advancements in the capabilities of anaphora resolution
systems, more and more anaphora resolution systems have been used for providing
anaphora information for downstream tasks (see Table 3.18). In short, anaphora reso-
lution helps its downstream applications mainly in two ways. It links noun phrases in
different sentences. As a consequence, these applications have better performance in
comprehending discourse-level information. On the other hand, linking noun phrases
helps downstream applications to do higher-level reasoning, e.g., extracting global
entities (Sukthanker et al., 2020) and recovering the ellipses (Aralikatte et al., 2021).
Most downstream task models utilize anaphora resolution as an additional feature
to improve task performance. However, we did not see how anaphora resolution
techniques help to explain how and why anaphora is used in a certain context.
202 3 Semantics Processing
Current annotation schemes for anaphora work practically but face theoretical issues
due to the lack of unified rules on what is remarkable and the unclear distinction
between co-reference and anaphora, despite clear boundaries in linguistic theory.
Annotation schemes so far are imperfect to improve the practicality so that large
anaphora/co-reference resolution datasets (that can be used for training and assess-
ing data-driven anaphora resolution systems) could be constructed. In exchange, the
resulting corpora are imperfect in terms of both quality (i.e., some annotated relations
might not be anaphoras) and coverage (i.e., some kinds of anaphora are not covered).
On a different note, anaphora resolution, which can also be seen as a pragmatics
task, disagreement on how an anaphora is interpreted happens across different read-
ers (Uma et al., 2022). Nonetheless, many datasets resolve disagreements through
majority voting, while only a few works explicitly annotated ambiguities, which
are the causes of the disagreements (Poesio and Artstein, 2008). In aggregate, it is
plausible to design a scheme (probably by extending MATE) that not only handles
disagreements but also balances quality, practicality, and coverage. Furthermore, it
is important to empirically investigate how the errors and limitations inherent in the
annotation scheme can impact the performance of anaphora resolution systems.
Analogue to the disagreements in the anaphora annotation, one can expect that,
for a single mismatch between an output and a reference answer, it might be an error
for some readers but not an error for the rest. For different mismatches, they might
have different severity. The impact of severity of errors has been studied for the
production of reference (van Miltenburg et al., 2020).
3.6 Subjectivity Detection 203
For example, saying “a woman is a man” is more serious than saying “a red coat
is pink”), but it has never been explored in the realm of anaphora resolution. This
said, roughly computing the overlaps between model outputs and reference outputs
might be problematic. On the one hand, due to discrepancies and varying degrees
of errors in anaphora resolution, human evaluation (Martschat and Strube, 2014) is
necessary to improve the analysis and evaluation of anaphora resolution models, as
well as to establish benchmarks for developing more accurate evaluation metrics. On
the other hand, when designing new evaluation metrics, disagreements, and error
severity should be considered by data-driven methods.
Model Development
Speech acts have a strong connection with subjective expressions because speech
acts perform actions, such as making a promise, giving an order, or expressing
a belief. Austin (1975) argued that language is not just a tool for describing the
world but also a means of accomplishing things in the world. Through speech acts,
individuals can influence the world around them and the actions of others. In this
sense, many seemingly objective expressions with speech acts can become subjective.
For example, if someone says,
(18) I promise to do it.
The utterance is not just conveying information but also performing the act of making
a promise. A more subjective case is
(19) I believe that it will rain tomorrow.
When individuals express belief in such a manner, they are essentially asserting their
mental disposition or perspective towards a specific statement. This entails making
a claim about their inner state or outlook toward a proposition. Austin (1975) argues
that a considerable number of utterances possess illocutionary force, which signifies
that their purpose is not merely to communicate information but also to accomplish
something beyond that.
Lakoff and Johnson (1980) argued that metaphors are not solely a linguistic phe-
nomenon, but also mirror human cognition via concept mappings. When an indi-
vidual uses a metaphorical expression, they employ a source concept to represent
a target concept in a particular context, thereby conveying their cognitive attitude
toward the target concept. This process, known as concept mappings, facilitates such
representation. In instances such as the statement
(20) Our love is a journey.
The individual utilizes the concept of a “journey” as the source to represent the
target concept of “love”, expressing their subjective feeling that their love is char-
acterized by both ups (joy) and downs (sadness). “Our love is a journey” cannot be
an objective statement, because the two concepts are from different domains, i.e.,
literally, love is not a journal. Thus, there is a semantic contrast between the literal
and contextual meanings of a metaphor (Mao et al., 2019). The semantic disparities
inherent in metaphors suggest that relying on the literal meanings of a statement
alone is insufficient in substantiating its subjectivity. Even though the statement of
Example (20) does not use any obvious opinionated words, e.g., “happy” and “sad”,
it also expresses a personal feeling. Thus, the pragmatics of statements must also be
taken into account in subjective detection.
206 3 Semantics Processing
3.6.3 Datasets
A summary of all the introduced datasets can be found in Table 3.19. Generally,
subjectivity detection data are organized in the following forms. A text is typically
labeled as either subjective or objective, with the former category often further
classified as positive, negative, or neutral. The following examples are from SemEval-
2013 Task 2B: Sentiment Analysis on X (Nakov et al., 2013).
id1: “263732569508552704”
id2: “369152026”
text: “Kick-off your weekend with service! EV!’s Get on the Bus trip to the
Boys & Girls Club is Friday from 3-6! Hope to see you there :)”
label: positive
id1: “213342054351257601”
id2: “189656827”
text: “Desperation Day (February 13th) the most well known day in all mens life.”
label: negative
id1: “263803288074477568”
id2: “396953010”
text: “It seem like Austin Rivers is tryin to had to get a bucket. I feel em tho my
1st game in the league I was trying hard too”
label: neutral
3.6 Subjectivity Detection 207
Table 3.19: Subjectivity detection datasets and statistics. ISD denotes individual sub-
jectivity detection. CDSD denotes context-dependent subjectivity detection. CLSD
denotes cross-lingual subjectivity detection. MMSD denotes multimodal subjectiv-
ity detection. BD denotes bias detection.
For fine-grained subjective annotation, the labels are annotated at the span level.
The following examples are from SemEval-2013 Task 2A: Sentiment Analysis on
X (Nakov et al., 2013).
id1: “255732290246815744”
id2: “315400337”
text: “Billy Cundiff may be leaving Washington. Hopefully he won’t miss the door
on the way out.”
start id: “7”
end id: “7”
label: “positive”
id1: “255732290246815744”
id2: “315400337”
text: “Billy Cundiff may be leaving Washington. Hopefully he won’t miss the door
on the way out.”
start id: “9”
end id: “10”
label: “positive”
208 3 Semantics Processing
29 [Link]
30 [Link]
3.6 Subjectivity Detection 209
The Forum dataset (Biyani et al., 2014) contains 700 threads from online forums
Trip Advisor–New York31 and Ubuntu Forums32, manually annotated for subjec-
tivity. Email (Murray and Carenini, 2011) contains 1,800 sentences derived from
BC3 corpus (Ulrich et al., 2008), 172 out of which are labeled as subjective. For
multimodal subjectivity detection, the AMIDA dataset (Wilson, 2008) consists of
19,071 dialogue act segments from 20 conversations from the AMI Meeting Cor-
pus (McCowan et al., 2005), manually annotated with the AMIDA scheme. 42%
of the dialogue act segments are tagged with at least one subjective annotation.
The Institute for Creative Technologies Multimodal Movie Opinion (ICT-MMMO)
dataset (Wöllmer et al., 2013) contains 370 YouTube review videos labeled as
strongly negative, weakly negative, neutral, weakly positive, and strongly positive.
Multimodal Opinion Utterances Dataset (MOUD) (Morency et al., 2011) is a col-
lection of 80 YouTube review videos annotated as positive, negative, and neutral.
For the bias detection task, which aims to identify subjective bias in Wikipedia,
the following datasets are widely used. Conservapedia (Hube and Fetahu, 2018) is
a collection of 1,000 single-sentence statements from Conservapedia33, manually
annotated as biased or unbiased. Wiki Neutrality Corpus (WNC) (Pryzant et al.,
2020) contains 180,000 aligned Wikipedia sentence pairs. Each pair consists of a
sentence before and after bias neutralization by English Wikipedia editors.
Lexicons of subjectivity clues and patterns are commonly used for subjectivity
detection, as summarized in Table 3.20. The General Inquirer (Stone et al., 1966) is
a lexicon consisting of 10,000 words sorted into 180 categories for content analysis.
The Subjectivity Clues lexicon (Riloff and Wiebe, 2003) is a list of words that are
subjective in most cases (strongly subjective) and words that may have subjective use
in certain contexts (weakly subjective). MPQA Subjectivity Lexicon (Wilson et al.,
2005) expanded the Subjectivity Clues using additional dictionaries and lexicons,
containing over 8,000 subjectivity clues.
Knowledge bases that provide sentiment information are also widely used for
subjectivity detection. WordNet-Affect (Strapparava et al., 2004) is a set of synsets
derived from WordNet that effectively represents affective concepts. SentiWordNet,
as introduced in the previous section, is based on WordNet. Each word in Senti-
WordNet is given three scores indicating its positivity, negativity, and objectivity.
As discussed earlier, SenticNet (Cambria et al., 2024) is also a popular knowledge
base for subjectivity detection.
31 [Link]
32 [Link]
33 [Link]
210 3 Semantics Processing
The aforementioned NER annotation tools (see Section [Link]) can be used for
subjectivity detection because these tools can annotate labels for spans (fine-grained
subjectivity detection) and sentences (coarse-grained subjectivity detection).
3.6.7 Methods
A. Lexicon-based
Wiebe and Riloff (2005) further improved this bootstrapping system by using
the labeled sentence produced by the rule-based method as initial training data for
a Naïve Bayes classifier. The major weakness of these methods is the unreliable
assumption that the absence of subjective clues and patterns indicates objectivity,
resulting in false-positive errors. Kim and Hovy (2005) first compiled lists of words
that convey opinions and those that do not, which were manually annotated with
corresponding classes and levels of strength. They expanded the lists with a common
English word list by measuring the WordNet distance between a common word and
the compiled seed lists. They further identified additional opinion words and non-
opinion words from editorial and non-editorial WSJ documents by computing their
relative frequencies. By detecting the subjectivity of a given sentence based on the
presence of a single strong valence word, their method achieved 65% accuracy on
MPQA.
Benamara et al. (2011) argued that sentence-level subjectivity detection cannot
fully leverage context, because a sentence may contain several opinion clauses, and
opinion expressions may be discursively related. As such, they proposed a segment-
level annotation based on the Segmented Discourse Representation Theory (Asher
and Lascarides, 2003), where segments are labeled as explicitly subjective, implicitly
subjective, subjective non-evaluative, and objective. This fine-grained annotation
can better enhance polarity detection, as segments in the latter two categories do not
covey positive, negative, or opinion. However, the limitation of this method is that the
four label classes are unbalanced in the corpus. Additionally, implicitly subjective
segments are often nuanced and hard to identify. Thus, it would be challenging to
design an appropriate classifier. The paper circumvented this problem by reframing
the task as two parallel binary classification tasks and obtained 82.31% accuracy
with a manually compiled French lexicon and SVMs as classifiers.
Merely detecting the existence of subjective keywords is often an insufficient
indication of a sentence’s subjectivity. Other works attempted to enrich the feature
set by incorporating more sentence-level information. Relying on expert knowledge
of parse tree, Xuan et al. (2012) manually constructed a set of syntax-based patterns
from unigrams and bigrams to extract features. A MaxEnt model was employed
as the classifier, obtaining 92.1% accuracy on the Movie dataset. Remus (2011)
hypothesized that the readability of a sentence was related to its subjectivity. Hence,
readability formulae such as Devereux Readability Index (Smith, 1961) and Easy
Listening (Fang, 1966) were incorporated as features in addition to the MPQA
Subjectivity Lexicon, obtaining 84.5% F-measure on Moive.
Many works proposed subjectivity detection systems that specifically targeted
X (formerly known as Twitter). Given the word constraint imposed by X, a tweet
is generally regarded as a sentence. Barbosa and Feng (2010) believed that using
subjectivity detection as an upstream task would improve the performance of polarity
detection on X text. Aside from conventional features such as subjective clues and
POS tags, they leveraged Tweet-specific syntax features, e.g., links and upper case.
An SVM classifier was employed, which achieved 81.9% accuracy on the X dataset,
and improved the accuracy of polarity detection by 5.6%.
212 3 Semantics Processing
B. Word Frequency
C. Deep Learning
Both methods employed a simple softmax classifier for each task. The former
achieved 95.23% accuracy on the Movie dataset, while the latter obtained 95.1%.
However, a shared limitation is that, despite their overall good performance, some
of the tasks did not exceed single-task learning baselines. This is likely because both
methods adopted hard parameter sharing MTL (Crawshaw, 2020), which emphasizes
more on generalization rather than optimization.
Sagnika et al. (2021) presented an attention-based CNN-LSTM model for sub-
jectivity detection, which served as a preprocessing step for sentiment analysis.
The combination of CNN and LSTM enabled the model to capture both spatial
and temporal information. Additionally, it utilized word embeddings enhanced by
sentiment-related information (Sagnika et al., 2020). Initially, the training of the
model was carried out with the Movie dataset, after which it was utilized to analyze
the sentiment of the IMDB dataset. The objective sentences were eliminated from
the dataset to form a modified set of reviews. Various models were tested as senti-
ment classifiers. The subjectivity detection model not only obtained 97.1% accuracy
on the Movie dataset, but also consistently improved the performance of sentiment
analysis.
The method of individual detection categorizes each sentence without considering its
context. However, subjectivity detection and sentiment classification are contextual
problems since lexical items can affect each other in a discourse setting (Aue and
Gamon, 2005; Polanyi and Zaenen, 2006). Pang and Lee (2004) were the first
to leverage inter-sentence context information to filter out objective sentences, in
order to better serve document-level polarity detection. Based on the hypothesis that
adjacent text spans might have the same subjectivity label (Wiebe, 1994), suggested
an algorithm known as the “minimum cuts algorithm” that aims to optimize the
subjectivity status score for every sentence separately, while also punishing the
assignment of different labels to two closely related sentences. These two sub-
objectives are independent of each other, making the model more flexible for the
addition of features. Context-dependent methods can be divided into two categories,
namely, the feature engineering approach and the statistical approach.
A. Feature Engineering
B. Statistical Approach
The GBN layer converted the sentence sequence from the MPQA dataset into
a time series of word frequency, captured second-order word dependencies with a
time lag of 2, and generated a subset of sentences that contained the most significant
words from the MPQA Subjectivity Lexicon. The model adopted a CNN sentence
model with convolution kernels of increasing size, which combined the local word
dependencies within the kernel size to model long-range syntactic relations. It was
pretrained with the sub-set of sentences produced by GBN before being trained on
the full dataset, obtaining the accuracy of 93.2% on MPQA and 96.4% on the Movie
dataset.
A. Language-Independent Approach
B. Translation Approach
Another solution is the translation approach, where lexical resources for the target
language are automatically generated by translating the resources and tools available
for English, usually with the help of SMT (Kim and Hovy, 2006; Mihalcea et al.,
2007; Wan, 2009; Amini et al., 2019). Banea et al. (2010) conducted a study on En-
glish and five other highly lexicalized languages, proving that a multilingual feature
space constructed through SMT improved the accuracy of subjectivity detection on
all languages involved. However, the sentence translation process can lead to the loss
of essential lexical information such as inflection and formality, which often served
as an indicator of subjectivity.
Chaturvedi et al. (2016b) mitigated this information loss during translation by
using a neural network to transfer resources from English to Spanish. They first trans-
lated the MPQA Subjectivity Lexicon into Spanish using an SMT system (Lopez,
2008). A MaxEnt-based POS tagger (Toutanvoa and Manning, 2000) and a multilin-
gual WSD system (Moro et al., 2014a) were incorporated in the preprocessing stage
to minimize the loss of lexical information during translation.
218 3 Semantics Processing
Their proposed model, named Lyapunov Deep Neural Network (LDNN), ex-
tracted spatial features from the input Spanish sentence and its translated English
form using CNN, which were then combined with an RNN to capture the bilin-
gual temporal features. To mitigate the vanishing gradient problem with RNN, a
Lyapunov function was used as the error function of RNN for stable convergence.
Utilizing the high-level features produced by Lyapunov-guided RNN, a multiple
kernel learning (Subrahmanya and Shin, 2009; Zhang et al., 2010) classifier yielded
the prediction. Their model obtained 84.0% F-measure on MPQA Gold, and 88.4%
accuracy on TASS.
While most studies on detecting subjectivity have concentrated on text-based data, the
identification of subjective expressions in other modalities, such as audio and video,
presents an important area for research. For instance, Murray and Carenini (2009,
2011) proposed an automatic pattern extraction method for subjective expression in
spoken conversation, which is able to extract Varying Instantiation N-Grams (VIN)
from labeled and unlabeled data. Unlike convention n-gram, a VIN is a trigram where
each unit can be either a word or a POS label, which is a more robust alternative
to syntactic parsers for fragmented and disfluent text, such as meeting transcripts.
Combined with a large raw feature set, a MaxEnt classifier scored the F-measure of
52% on the AMIDA dataset. This method, however, did not leverage any information
from other modalities.
Raaijmakers et al. (2008), instead, explored the effectiveness of lexical and acous-
tic features in speech subjectivity detection. Specifically, they investigated word, char-
acter, prosody, and phoneme n-grams. Following Wrede and Shriberg (2003); Banse
and Scherer (1996), the prosodic features were extracted based on pitch, energy, and
the distribution of energy in the long-term averaged spectrum. Word-, character-, and
phoneme-level features were extracted from manual speech transcripts. A separate
BoosTexter classifier (Schapire and Singer, 2000) was employed for each feature
set, whose predictions were combined using a simple linear interpolation strategy to
obtain the final output. The combination of the four types of feature sets achieved
75.4% accuracy and 67.1% F-measure on AMIDA. Experiments showed that word-
and character-level features contributed the most.
Bias detection refers to the task of identifying biased statements from supposedly
impartial articles. Specifically, in Wikipedia, the Neutral Point of View (NPOV) is
a core principle that ensures neutrality for controversial topics. Thus, the goal of
this task is to detect sentences that violate NPOV policy on a Wikipedia page. Bias
detection is closely related to subjectivity detection. Its development mirrors the
technical trends of the latter.
3.6 Subjectivity Detection 219
The presence of objective text can decrease the accuracy of polarity detection,
as sentiment classifiers are usually optimized for the binary classification task of
labelling text as either positive or negative. Therefore, having a subjectivity detection
as upstream task can greatly enhance polarity detection (Satapathy et al., 2017a;
Das and Sagnika, 2020). For document-level sentiment analysis, Bonzanini et al.
(2012) showed that subjectivity detection reduced the amount of data to 60% while
still producing the same polarity classification results as full-text classification.
Pang and Lee (2004); Das and Sagnika (2020) applied subjectivity detection to
filter out objective sentences in reviews prior to classifying their polarity. Similarly,
Kamal (2013) first extracted subjective sentences from customer reviews and then
employed a rule-based system to mine feature-opinion pairs from the subjective
sentences. Barbosa and Feng (2010); Soong et al. (2019) used subjectivity detection
in sentiment analysis for X microtext. These works proved that removing objective
content from the dataset indeed makes the learning of sentiment more effective.
Hate speech detection is a task that identifies abusive speech targeting a person or a
group based on stereotypical group characteristics, e.g., ethnicity, religion, or gender,
on social media (Warner and Hirschberg, 2012). Since hate speech is often marked
by its content, tone, and target (Cohen-Almagor, 2011), its detection is similar to
that of polarity. Additionally, subjectivity clues tend to be surrounding the polarizing
and arguing topics, which aligns well with hate speech detection.
3.6 Subjectivity Detection 221
QA systems generally encounter two types of questions: the ones that expect truth as
answers, and the ones that expect opinions. Therefore, it is crucial for a QA system to
distinguish opinions from facts, and provide the appropriate type depending on the
question (Yu and Hatzivassiloglou, 2003). To achieve this goal, a QA system should
operate in two stages. First, it must determine whether a question calls for a subjective
or objective answer, which is its subjectivity orientation (Li et al., 2008a,b; Aikawa
et al., 2011). Then, the system needs to consider subjectivity as a relevant factor
in the information retrieval process. Subjectivity detection can be incorporated as a
filter or feature set in a QA system. For instance, Stoyanov et al. (2005) modified the
conventional QA system by applying a subjectivity filter and an opinion source filter
on the initial information retrieval results, which improved the system significantly.
On the other hand, Wan and McAuley (2016) leveraged subjective features from
reviews to provide users with a list of relevance-ranked reviews, which improved the
performance of answering binary questions from categories with abundant data.
3.6.9 Summary
Subjectivity detection is a well-studied NLU subtask. There are five technical trends
in this area, namely individual, context-dependent, cross-lingual, multimodal sub-
jectivity detection, and bias detection. A summary of the trends can be found in
Tables 3.21 and 3.22. For individual subjectivity detection (Table 3.21), the sub-
jectivity of each sentence or snippet is determined only by the lexical, syntactic,
and semantic information of the sentence itself. There are mainly three types of
methods for individual subjectivity detection. First, the lexicon-based approaches
rely on external lexicons that contain subjective and sentiment clues to predict the
subjectivity of a sentence. The weakness of such an approach is that subjective
clues are often not extensive and reliable enough to determine the subjectivity of
a sentence. Some works attempted to address this issue by utilizing sentence-level
features to extract syntactic information (Wilson et al., 2004; Xuan et al., 2012; Bar-
bosa and Feng, 2010), or incorporating WSD to identify subjective clues according
to context (Akkaya et al., 2009; Ortega et al., 2013). Nonetheless, these methods
cannot fully extract the underlying sentence structure and contextual information.
Word-frequency-based approaches, on the other hand, predict sentence subjectivity
according to the word presence or occurrence in a given corpus, thus being able
to adapt to new domains and languages. Additionally, this approach requires little
external resources or human effort.
Like lexicon-based approach, however, word frequency methods lack the ability to
capture syntactic information. To address this limitation, deep-learning-based meth-
ods utilize neural networks to learn spatial and temporal dependencies. Specifically,
PLMs are widely used for their ability to provide universal representations (Kim,
2014; Liu et al., 2019b; Sun et al., 2019). For context-dependent subjectivity de-
tection (Table 3.22), the subjectivity of a sentence is determined with regards to its
surrounding context, e.g., inter-sentence-level (Pang and Lee, 2004; Belinkov et al.,
2017), document-level (Yu and Hatzivassiloglou, 2003; Das and Bandyopadhyay,
2009; Karimi and Shakery, 2017), or discourse-level (Biyani et al., 2014) infor-
mation. In existing works, such information is typically captured through feature
engineering or statistical means.
As a large part of subjectivity detection works relies on external subjective clues,
cross-lingual subjectivity detection aims specifically to solve the lack of lexical
resources for non-English languages. There are mainly two branches of thought
to address this problem (Table 3.22). One is to make use of language-independent
methods such as word frequency (Rustamov et al., 2013; Lin et al., 2011; Kamil et al.,
2018; Belinkov et al., 2017) and language modeling (Karimi and Shakery, 2017).
The other is to generate resources for the target language from English lexicons with
the help of SMT systems (Banea et al., 2010; Chaturvedi et al., 2016b).
Multimodal subjectivity detection is a rising field of interest in accordance with
the rising need for sentiment analysis in various media (Table 3.22). Existing works
utilized lexical, prosodic, and phonemic features for subjectivity detection in spoken
conversations (Murray and Carenini, 2011; Raaijmakers et al., 2008). Subjectivity
detection in other modalities such as video remains mostly unexplored.
224 3 Semantics Processing
Bias detection is a task that is closely related to subjectivity detection (Table 3.22).
It aims to identify biased statements from supposedly impartial articles such as
Wikipedia. Despite its greater complexity, the identification of bias exhibits technical
patterns that are akin to those found in subjectivity detection, e.g., lexicon-based (Re-
casens et al., 2013; Hube and Fetahu, 2018), deep learning (Hube and Fetahu, 2019;
Pryzant et al., 2020), and cross-lingual (Aleksandrova et al., 2019) methods.
Due to its filtering nature, subjectivity detection is widely used as a parser for
many downstream tasks, e.g., sentiment analysis (Pang and Lee, 2004; Barbosa and
Feng, 2010; Kamal, 2013; Soong et al., 2019; Das and Sagnika, 2020), information
retrieval (Zhang et al., 2007; Wiebe and Riloff, 2011), hate speech detection (Gitari
et al., 2015), and QA systems (Stoyanov et al., 2005; Wan and McAuley, 2016). Most
existing works take the pipeline approach, using the filtered results from subjectivity
detection as the input of the target application. On the other hand, we also observe
that subjectivity lexicons can also be useful features to support hate speech detection
and QA systems.
A survey of literature pertaining to subjectivity detection reveals that the progress
made in this research area has not kept pace with the advancements made in its
downstream polarity detection tasks, e.g., sentiment analysis (Gandhi et al., 2023).
This is likely because sentiment analysis may deliver more fine-grained classification
outputs, which helps to gain business insights, e.g., sentiment polarities on product
or service reviews.
However, it should be noted that while positive, negative, and neutral sentiment
polarities represent subsets of subjective texts, there exists a substantial portion of
texts that are objective in nature, presenting factual information. Objective texts are
likely to be infrequent in reviews of products or services, as customers often use such
platforms to express their opinions. However, in the context of opinion mining on
social media, it is crucial to differentiate between subjective and objective statements,
given that even statements with neutral sentiment polarities can be indicative of an
individual’s opinion. Thus, it is still necessary to conduct subjectivity detection
before sentiment analysis.
A sentence may contain several clauses with differing subjectivity. For instance,
a sentence may present two or more opinions, or contain both opinions and factual
information. Therefore, to better assist downstream applications, fine-grained sub-
jectivity detection that identifies the particular opinion-bearing clauses is worthy
of investigation. However, there is limited research on this issue. Benamara et al.
(2011) proposed segment-level subjectivity detection. Wilson et al. (2004) proposed
a method specifically for classifying the subjectivity of deeply nested clauses. There
is scope for additional research to exploit the full potential of the fine-grained sub-
jectivity annotation offered by the MPQA scheme (Wiebe et al., 2005).
While much of the subjectivity detection research has utilized lexical resources
such as subjectivity and affective lexicons to explain the subjective nature of text
based on individual words, these resources do not capture the pragmatic nuances
of words within their contextual environment. This is because the utilized lexical
knowledge is context-independent. Theoretical research has explained subjectivity
from the perspective of pragmatics (Austin, 1975; Lakoff and Johnson, 1980).
It would be useful to distinguish between different types of objectivity, e.g., a
piece of text that is completely factual versus one that contains both positive and
negative sentiments towards the same opinion target and, hence, results in overall
neutrality or ambivalence (Valdivia et al., 2018). Explainable subjectivity detection
could push the development of more linguistics-inspired models that can account for
the complexities of subjectivity and its expression in natural language. Additionally,
there is potential for cross-disciplinary collaboration between linguistics, cognitive
science, and computer science to further advance our understanding of subjectivity
and its detection in various domains.
3.7 Conclusion
Shaping semantic processing tasks into tasks that are more conducive to machine
learning can indeed improve the accuracy of specific tasks. However, improving
accuracy in a single-task setting is not the only pursuit of semantics processing. We
should pay more attention to how semantic processing techniques can better serve
humans and machines to explain language phenomena. We hope that this chapter can
stimulate more research directions in the field of semantics processing and inspire
researchers to place greater emphasis on the nature and cognition of semantics. With
the development of more powerful tools such as PLMs and LLMs, it is perhaps
valuable for our research to use these tools to address those fundamental linguistic
challenges that were previously considered daunting. Regardless of the sophistication
of tasks that can be performed by LLMs, basic semantic processing tasks remain
crucial for comprehending and utilizing language effectively. These tasks serve as
the foundation upon which our understanding of language is built.
A. Reading List
• Xulang Zhang, Rui Mao, Kai He, and Erik Cambria. Neurosymbolic Sentiment
Analysis with Dynamic Word Sense Disambiguation. In: Proceedings of EMNLP,
8772–8783, 2023 (Zhang et al., 2023b)
• Ran Zhou, Xin Li, Ruidan He, Lidong Bing, Erik Cambria, Luo Si, and Chunyan
Miao. MELM: Data Augmentation with Masked Entity Language Modeling for
Low-Resource NER. In: Proceedings of ACL, 2251-2262, 2022 (Zhou et al., 2022)
B. Relevant Videos
• Labcast about Dynamic WSD: [Link]/dRrxVWPBfVo
C. Related Code
• Github repository about Dynamic WSD: [Link]/SenticNet/Dynamic-WSD
D. Exercises
• Exercise 1. Implement the Lesk algorithm on the following sentences to deter-
mine the correct sense of each target word in the given context (use WordNet or
any other lexical database for word definitions and examples): “He went to the
bank to deposit his paycheck”; “She played the bass in the school orchestra”; “The
bark of the tree was rough to the touch”.
• Exercise 2. Design a simple NER algorithm that detects named entities and
their corresponding types in the following sentences: “Apple is looking at buying
U.K. startup for $1 billion”; “San Francisco considers banning sidewalk delivery
robots”; “London is a big city in the United Kingdom”; “Elon Musk founded
SpaceX in 2002”; “Barack Obama was born on August 4, 1961, in Honolulu”.
• Exercise 3. Identify and list polar concepts (lemmatized key words and MWEs
that capture the essence of content polarity) from the following text and then
compare them with the concepts extracted using the concept parsing API
([Link]/api/#concept): “I love affogato because I am a coffee addict and I
like cold desserts: hot espresso on ice cream is such a delightful contrast in tem-
peratures and flavors that I find irresistible”.
• Exercise 4. Write a simple program to detect and resolve anaphoras in the fol-
lowing paragraph: “Roberto decided to bake a cake. He found a recipe online and
gathered all the ingredients. It took him an hour to mix everything and put it in
the oven. While it was baking, he cleaned up the kitchen and prepared some tea.
When the cake was ready, Roberto let it cool before decorating it with frosting.
He was proud of his creation and couldn’t wait to share it with his friends”.
4.1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025 229
E. Cambria, Understanding Natural Language Understanding,
[Link]
230 4 Pragmatics Processing
Table 4.1: The surveyed pragmatic processing tasks and their downstream applica-
tions. F denotes that the technique yielded features for a downstream task model; P
denotes that the technique was used as a parser; E denotes that the technique improved
the explainability for a downstream task. MU denotes metaphor understanding. SD
denotes sarcasm detection. PR denotes personality recognition. AE denotes aspect
extraction. PD denotes polarity detection.
Downstream tasks MU SD PR AE PD
Sentiment Analysis P F, P F, P
Knowledge Graph Construction F, E P P
Psychology Analysis F, E P
Machine Translation P
Cognition Analysis P, E
Political Analysis E
Advertising E
Dialogue Systems F
Recommendation Systems F, P E
Social Network Analysis F, P, E
Human-robot Interaction F, P
Financial Prediction F, E
Fig. 4.1: Outline of this chapter. Each subtask is explained in terms of different
technical trends and downstream applications.
Metaphors constitute a form of figurative language wherein one or more words are
employed to convey a meaning distinct from their literal interpretation. In pragmatics,
metaphors are instrumental in conveying abstract concepts, expressing attitudes,
and shaping the communicative intent of utterances. They serve as powerful tools
for conveying meaning by drawing upon familiar, concrete imagery to illuminate
more complex or abstract ideas. The interpretation of metaphors often involves
an understanding of the contextual cues, shared knowledge between speakers, and
the social and cultural background (Ritchie, 2004), making them a rich area of
investigation in pragmatic analysis.
Additionally, metaphors are integral to various aspects of human communication,
including humor, persuasion, and the conveyance of emotions (Charteris-Black,
2009), making their examination essential for a comprehensive understanding of
pragmatic language use. On the other hand, metaphors pose a challenge for nu-
merous NLP applications, such as machine translation and sentiment analysis (Mao
et al., 2023b). The intricacy lies in the complexity faced by machine intelligence in
deducing the real meanings embedded within metaphors (see errors in Fig. 4.2). All
the aforementioned aspects highlight the significance of metaphor understanding in
NLP and pragmatics.
Metaphor identification, positioned as an antecedent task for metaphor under-
standing, has garnered substantial attention within the research community (Mao
and Li, 2021; Tian et al., 2023). This heightened interest is largely attributed to the
creation of an extensive metaphor identification dataset, namely the VU Amsterdam
(VUA) Metaphor Corpus (Steen et al., 2010). Conversely, the landscape of compu-
tational metaphor understanding research is comparatively less developed, primarily
stemming from the scarcity of comprehensive annotated data encompassing diverse
metaphor understanding tasks. In this endeavor, we survey important computational
works in metaphor understanding with the aim of fostering increased engagement and
contributions from researchers in this specialized research domain. This is because
metaphor understanding is a direct gateway to pragmatic understanding, compared
to metaphor identification.
234 4 Pragmatics Processing
Two distinct facets of metaphor understanding have been studied, namely lin-
guistic metaphor understanding and conceptual metaphor processing. The former
involves the generation of literal texts for the rephrasing or explanation of given
metaphorical expressions from a linguistic standpoint. This category can be further
delineated into three groups, namely property extraction, word-level paraphrasing,
and explanation pairing. Conversely, conceptual metaphor processing endeavors
to establish concept mappings that elucidate the target and source domains of a
metaphorical expression from a conceptual standpoint. The source domain serves as
the origin of metaphorical qualities, providing imagery, attributes, or characteristics
that are subsequently applied to another domain. On the other hand, the target domain
represents the subject of the metaphor, interpreting and incorporating the metaphor-
ical qualities derived from the source domain. This dual framework illuminates the
linguistic and conceptual dimensions integral to a comprehensive understanding of
metaphorical expressions.
Lakoff and Johnson (1980) argued that metaphorical expressions reflect the human
cognitive process in their CMT. They articulated the interpretation of metaphors
by devising concept mappings, elucidating the cognitive mechanisms underlying
metaphors.
(1) I spent three days reading a book.
4.2 Metaphor Understanding 235
Given Example (1), the target concept is metaphorically associated with the
source concept through the verb “spent”. Within the concept mapping of
, the attribute of , specifically its value and limited availability, is
metaphorically transferred to the notion of . This metaphor serves to underscore
the conceptualization that is regarded as a valuable and finite resource, akin to
the attributes associated with .
(2) She attacked his argument.
Metaphors also impact human behaviors. In Example (2), is metaphori-
cally projected into a domain. When we view an “argument” through the lens
of “war”, it is easy to associate it with strategies, attacks, and defensive behaviors
common in a war. This perception can lead people to forget the importance of collab-
oration and mutual benefit during an argument, resulting in an aggressive approach
with raised voices to win the argument.
(3) Encountering her miraculously restored my faith in love.
(4) Our relationship has reached a crossroads.
As metaphors serve as reflections of cognitive mechanisms guiding human concep-
tualizations, variations in metaphoric expressions can signify distinct perceptions
of a concept among individuals. For instance, a young couple may conceptualize
as akin to , influenced by the enchantment of their initial encounter and
falling in love, e.g., Example (3). In contrast, an older couple might adopt the concept
mapping of , embodying the cumulative experiences amassed over
the years, e.g., Example (4). Thus, considering the frequent occurrence of metaphors
in everyday communication (Shutova, 2015), metaphors can serve as an important
entrance for studying human cognition.
Landau et al. (2010) investigated the impact of metaphors on shaping social thought
and attitudes, utilizing empirical evidence to illustrate that metaphors play a distinc-
tive role in social information processing. They introduced the metaphoric transfer
strategy and the alternate source strategy as empirical approaches to assess the in-
fluence of conceptual metaphor and embodied simulation on social information pro-
cessing. Their findings reveal that manipulating psychological states associated with
one concept induces metaphor-consistent alterations in how individuals process in-
formation related to seemingly dissimilar concepts. Furthermore, they observed that
metaphors have bidirectional effects on social information processing. Manipulating
abstract social concepts led to changes in perceptions of more concrete concepts. For
example, changes in divinity-related images affected spatial memory biases, and dif-
ferent moods influenced attention shifts. Social exclusion also impacted perceptions,
such as temperature sensations.
236 4 Pragmatics Processing
The study also discussed the differentiation between conceptual metaphor and
embodied simulation. Landau et al. (2010) posit that while both are cognitive mech-
anisms involving bodily states in processing abstract concepts, they differ in their uti-
lization of these bodily states. Conceptual metaphor operates as an inter-conceptual
mechanism, mapping content and structure between apparently dissimilar concepts.
It draws on representations of bodily states related to a concept in a manner dis-
tinct from how embodied simulations employ such representations. Conceptual
metaphors can incorporate concepts representing common knowledge about bod-
ily states, whereas embodied simulations exclusively involve specific bodily states
occurring during experiences with abstract concepts. In contrast, embodied simu-
lation functions as intra-conceptual mechanism, utilizing representations of bodily
states associated with a given concept. It employs bodily states to simulate the
experience of the concept, enabling individuals to understand it better.
Shutova and Teufel (2010) presented a practical procedure for concept mapping an-
notations based on MIP. They built source and target concept domain lists containing
a subset of domains from a master metaphor list (MML) (Lakoff, 1994) and novel
domains annotated by themselves. They asked annotators to choose concept domains
for the basic and contextual meanings of single-word verb metaphors.
(5) If he asked her to post a letter or buy some razor blades from the chemist,
she was transported with pleasure.
In Example (5). the basic meaning of the verb transported is “goods being trans-
ported/carried somewhere by a vehicle”, while the context meaning of this sentence
is “a person being transported by a feeling” (Shutova and Teufel, 2010). According
to the different agents of the transported action, the source-target concept mapping
is “ is ”. However, this annotation scheme must provide the whole
concept lists before the annotation, which is labor-costing and makes it difficult to
update any novel metaphors. Mohler et al. (2016) used a similar annotation scheme
to (Shutova and Teufel, 2010). In particular, they provided annotators with two
source domains produced by their conceptual metaphor mapping system (Mohler
et al., 2014).
238 4 Pragmatics Processing
4.2.3 Datasets
This section introduces existing datasets for metaphor understanding. Table 4.2
summarizes the basic information and statistics of the relevant datasets. Bizzoni and
Lappin (2018) proposed a metaphor understanding dataset1 in a paraphrasing task
format. This dataset contains 200 sentence sets. Each set contains one metaphorical
sentence and four literal sentences with different semantic similarities: one strong
paraphrase, one loose paraphrase, and two non-paraphrases. The strict semantic
similarity rule made this dataset manually constructed and a relatively small volume.
Zayed et al. (2020) developed a dataset for metaphor understanding through the
generation of definitions, providing comprehensive interpretations of metaphorical
expressions. The metaphorical expressions are in a verb-direct object form. The
definitions are collected from multiple English dictionaries.
Metaphorical expressions: “release old emotional pain”,
Definition: “to express feelings such as anger or worry in order to get rid of them”,
Source: Oxford.
Liu et al. (2022a) created Fig-QA2 metaphor interpretation dataset and proposed
an exploration based on the Winograd schema (Levesque et al., 2012) to investigate
the effectiveness of language models in metaphor interpretation. The aim is to prompt
language models to select or generate implications for two metaphorical phrases with
divergent meanings.
1 [Link]
2 [Link]
4.2 Metaphor Understanding 239
Fig-QA displays 10,256 distinct and creative metaphors, along with interpre-
tations crafted by workers from Amazon Mechanical Turk (AMT). The dataset is
structured as pairs of phrases embodying contrasting meanings, along with their
potential interpretations.
Metaphorical sentence 1: “Her word had the strength of titanium.”,
Interpretation 1: “Her promises can be believed.”,
Metaphorical sentence 2: “Her word had the strength of a wine glass.”,
Interpretation 2: “Her promises cannot be trusted.”
Lakoff (1994) first attempted to compile an MML for source and target con-
cept mappings. MML contains 791 nested concept mappings with corresponding
metaphorical expressions. The data were sourced from published books and papers,
student papers, and research seminars. However, MML has stayed unfinished up to
now. The varying abstractness levels of concept mappings remain an issue (Shutova
and Teufel, 2010).
Concept mappings: “change is motion (Location)”,
sub concept mappings: “stopping being in a state is leaving a location”,
Metaphor expressions: “He came out of the coma with little long-term damage.”
3 [Link]
4 [Link]
5 [Link]
240 4 Pragmatics Processing
‘ID’: ‘trn_1179’,
‘doc_ID’: ‘ahc-fragment60’,
‘sent_ID’: ‘1221’,
‘sent’: ‘Relief was at hand .’,
‘metaphor_index_list’: [[0], [2, 3]],
‘pos_list’: [‘Assistance’, ‘nearby’],
‘neg_list’: [‘Easement’, ‘Sculptural relief’, ‘Decrease’, ‘Ministration’, ...],
‘lemma’: ‘relief be at hand .’,
‘pos_tags’: [‘NN’, ‘VBD’, ‘IN’, ‘NN’, ‘.’],
‘open_class’: [‘NOUN’, ‘VERB’, ‘O’, ‘NOUN’, ‘O’],
‘genre’: ‘news’
Useful knowledge bases that can be used in metaphor understanding are WordNet
and ConceptNet (introduced earlier in Section 3.4). Strzalkowski et al. (2013); Li
et al. (2013); Dodge et al. (2015); Gagliano et al. (2016); Mao et al. (2018); Ge
et al. (2022) utilized lexical relations in WordNet, such as synonyms, hyponyms,
and hypernyms to obtain or expand candidate concept words. Mason (2004); Gandy
et al. (2013); Strzalkowski et al. (2013) clustered concepts by WordNet functions,
such as semantic categories and hypernyms. Shutova (2010) utilized WordNet to
disambiguate the sense of concepts. Su et al. (2020) calculated relatedness between
concepts by synonymous extension in WordNet.
FrameNet (Ruppenhofer et al., 2016) is another useful knowledge base for
metaphor understanding, It serves as an English lexical repository designed for
both human and machine readability. It was established through the annotation of
real-life textual examples illustrating word usage and is grounded in the framework of
frame semantics. The repository encompasses 1,224 frames, where each frame rep-
resents a diagrammatic depiction of a scenario, incorporating diverse elements such
as participants, props, and other conceptual roles. Additionally, FrameNet comprises
13,687 lexical units, encompassing lemmas and their POSs that evoke frames.
Table 4.3: Useful knowledge bases for metaphor understanding. MWD denotes
Merriam Webster’s dictionary.
Since multiple data structures in annotated datasets (see Section 4.2.3) and task
forms (see Section 4.2.6) exist in the metaphor understanding domain, researchers
developed a variety of evaluation metrics.
Accuracy (Shutova, 2010; Mao et al., 2018; Bizzoni and Lappin, 2018) and F1
score (Gandy et al., 2013; Dodge et al., 2015; Ge et al., 2022) are frequently used
automatic evaluation metrics for datasets with golden labels. When the proposed
models can generate a list of possible interpretations, the higher ranks of correct
interpretations show better performance of models. Mean reciprocal rank (MRR) is
a metric traditionally used in QA systems and is appropriate for this task. Shutova
(2010); Song et al. (2020b) applied MRR to evaluate the ranking quality of their
models. Li et al. (2013); Song et al. (2020b) utilized the proportion of the correct
interpretations among the top N predictions (Hits@N) to evaluate the performance
comprehensively.
Each possible mapping in the work of Mason (2004) was evaluated regarding
polarity, the number of word collocations instantiating the mapping, and the sys-
tematic co-occurrence. Gagliano et al. (2016) demonstrated the blending effect of
generated concepts by cosine similarity. Bizzoni and Lappin (2018) used correlation
to show the performance of the paraphrase ordering task. Other automatic evaluation
methods involve grounding the evaluation tasks in downstream applications, such
as metaphor identification (Ge et al., 2022), machine translation (Mao et al., 2018),
and sentiment analysis (Mao et al., 2022a), to assess the efficacy of a metaphor
understanding system.
6 [Link]
7 [Link]
242 4 Pragmatics Processing
Researchers used human evaluation metrics for metaphor understanding tasks with-
out golden labels to measure the performance of outputs from multiple perspectives.
Li et al. (2013) defined “correct Top 3” to denote concepts within the top three po-
sitions of lists that, while not considered golden labels, are recognized as metaphors
by human judges. Gandy et al. (2013) measured the meaningfulness of the gener-
ated concept mappings. Su et al. (2015, 2017, 2020) evaluated the acceptability of
produced interpretations. Rai et al. (2019) chose appropriateness to show the per-
formance of the proposed model. Sometimes the evaluation metrics could be too
abstract to understand. Some researchers proposed detailed questions to help peo-
ple for evaluation more understand the criteria. Strzalkowski et al. (2013) evaluated
metaphoricity, affect, and force (inverse of commonness) by asking, “Q1: To what
degree does the above passage use metaphor to describe the highlighted concept?
Q2: To what degree does this passage convey an idea that is either positive or nega-
tive? Q3: To what degree is it a common way to express this idea?”. Ge et al. (2022)
evaluated the quality of the source and target domains with two questions: “Q1.
Whether the noun is conceptually mapped to the source concept? Q2. Whether the
basic meaning of the noun belong to the target concept?” Mao et al. (2022a) eval-
uated coherence, semantic completeness, and literality of generated interpretations
with three questions: “Q1. Does the paraphrased word semantically and grammat-
ically fit the context? Q2. To what extent does the paraphrased word represent the
contextual meaning of the original target word? Q3. Is the paraphrased word a literal
counterpart of the original target word?”.
4.2.6 Methods
10 [Link]
244 4 Pragmatics Processing
To improve the learning of the end-to-end metaphor paraphrasing task, Mao et al.
(2024b) proposed a new dataset ( - ) and a metaphor-tailored PLM. Compared to
conventional language modeling methods, e.g., masked word prediction-based and
next word prediction-based methods, Mao et al. (2024b) pretrained a model with an
ALM paradigm. The authors randomly replaced some words, forming anomalous
sentences, to simulate the selectional preference violation of metaphors. Then, they
developed an MTL and contrastive learning framework to learn the anomalous detec-
tion and original word retrieval tasks, simultaneously, at the pretraining stage. Thus,
the learned linguistic patterns can be inherited by the following metaphor identifi-
cation and paraphrasing tasks. The authors found that such a pretraining paradigm
exceeds parameter-size-comparable PLM baselines in the end-to-end metaphor in-
terpretation task.
Mao et al. (2022a) presented a dictionary and rule-based method to identify and
understand metaphorical MWEs. The method set up three rules with dependency
pairing and lemma information to identify MWEs. The interpretation was completed
with the help of semantic similarity and a manually annotated idioms dictionary.
The selected explanation achieved the highest semantic similarity with the source
sentence. The output added an explanation as a clause of the source sentence, which
downstream tasks can directly use. They tested the MWE detection model on Formal
Idioms Corpus11. The proposed method outperformed machine learning baselines
on an unseen case evaluation task.
A. Feature-based Methods
11 [Link]
4.2 Metaphor Understanding 247
Li et al. (2013) proposed the first unsupervised data-driven model for metaphor
identification and concept generation. They prepared a word pair dataset by searching
“like-a” and “is-a” syntactic patterns in a web corpus. They designed a context-
related formula with selectional preference to obtain the implicit source or target
concepts among the candidates from the metaphorical corpus. The testing dataset
contained 83 sentences expressing metaphors by subject-verb or verb-object relations
from Lakoff and Johnson (1980). The baseline was a variant of the proposed model
with a replacement of a metaphor database. They applied two labeling criteria.
The generated concept mappings are marked as “match” if they are the same as or
subsumed by gold labels. If the generated concept mappings are not “match” but are
considered metaphors by at least two annotators, they are marked as “correct”.
Dodge et al. (2015) put up a framework termed MetaNet for bridging theory-
and corpus-driven approaches to metaphor processing. Together with linguists, they
created a hand-crafted repository. The repository of metaphor expressions and frame-
works included several relations like “subcase of” and “incorporates as a role” as
well as concept domains. Then, they looked for phrases using a list of grammatical
constructs. Through the relational network of the repository and with the aid of
WordNet, FrameNet, and Wiktionary12, the source and target terms were matched
with concept domains in the repository. By using pre-defined syntactic patterns
from language characteristics and knowledge sources, this approach is restricted in
its ability to identify metaphors.
Rosen (2018) proposed a source domain mapping model expressing the inter-
action between a target word and its construction-based context. It extracted mul-
tiple self-defined features related to dependency relationships and fed the features
into a deep neural network classifier. The features were interactions derived from
dependency-parsed inputs, such as target domain x subject and target domain x
object. The evaluation was conducted on LCC (Mohler et al., 2016). However, the
source domains were represented as one-hot vectors in outputs, leading to data spar-
sity somewhat. The model could only output one of 77 known and limited source
domains.
B. Embedding-based Methods
Gagliano et al. (2016) applied word2vec to identify the connection between two
words in the semantic space to generate a figurative relationship. Based on word2vec
vectors of a target concept and an attribute, the addition and intersection models
independently derived candidates of a source concept. They designed quantitative
and qualitative analyses to evaluate the generated terms. In the quantitative analysis,
they used the cosine similarity of word2vec vectors to measure the semantic repre-
sentations. In the qualitative analysis, AMT employees were asked to select the final
source concepts in the provided list and generate a metaphorical expression. The
result showed that candidate concepts with balanced cosine similarity between two
words could boost the effect of integrating two semantic spaces.
12 [Link]
248 4 Pragmatics Processing
13 [Link]
14 [Link]
4.2 Metaphor Understanding 249
Researchers have discovered that metaphorical expressions are difficult for the learn-
ing of downstream NLP tasks. Since the semantics of metaphorical expressions differ
from those of literal ones, NLP systems likely make mistakes while comprehending
them. This is why metaphor understanding is a key subtask of the NLU suitcase
model (Fig. 1.12). In sentiment analysis, the literal meaning of the phrase “she de-
voured his novels” is not positive, while its metaphorical connotation is positive.
A sentiment classifier may deliver an inaccurate answer for expressions without
metaphor understanding (Mao et al., 2022a). Due to cultural variations, it might
be difficult for language learners to understand metaphors. The literal translation
of the expression “she devoured his novels” in the same case does not make sense
in Chinese (Mao et al., 2018). Therefore, transforming metaphors into their literal
counterparts can enhance the comprehension of metaphors for both humans and
machines.
On the other hand, metaphors also provide an entry point for investigating hu-
man cognition. These psycho-linguistic expressions offer a distinctive window into
the subconscious minds of humans (Goatly, 2007). By their very nature, metaphors
extend beyond literal language, tapping into the rich reservoir of symbolism and im-
agery ingrained in our thoughts and emotions. When individuals employ metaphors,
they tap into a collective repository of shared cultural, social, and personal experi-
ences, unveiling deeper cognitive processes that might not be immediately apparent
in explicit language. One structured way of representing metaphorical cognition
processes is through concept mappings, where the projection from a target to vari-
ous source domains highlights distinctions between different cognitive frameworks.
Concept mappings have been utilized in various traditional diagnostic psychological
assessments, such as the Word-association Test, Thematic Apperception Test, and
Rorschach Test, each employing distinct forms of concept mappings (Rapaport et al.,
1946). Consequently, the understanding of conceptual metaphors holds potential for
downstream research in psychology and cognitive science.
Mao et al. (2018) evaluated the proposed metaphor understanding model on a ma-
chine translation task. The testing dataset contained 50 metaphorical sentences and
50 literal sentences randomly sampled from Mohammad et al. (2016)15. The text
identified as metaphors was paraphrased by the metaphor understanding model be-
fore going into the machine translation system. The results showed that paraphrasing
metaphors into their literal equivalents boosted 26% accuracy in Google translation
and 24% in Bing translation for an English-Chinese metaphor translation task.
15 [Link]
250 4 Pragmatics Processing
Mao et al. (2022a) assessed MetaPro on a news headline sentiment analysis dataset
(NHSA) from SemEval-2017 Task 5 (Cortis et al., 2017). The findings demonstrated
that MetaPro could boost the performance of state-of-the-art sentiment analysis
classifiers by an average of 4.0% F1 scores. They also demonstrated that the sentiment
classifier gained additional benefits in the non-metaphorical (paraphrased) dataset if
the train and test sets were paraphrased first before training and evaluating.
Han et al. (2022a) believed that metaphorical concept mapping features could en-
hance the performance and the explainability of a depression detection task. They
presented an explainable hierarchical attention network (HAN) that could extract
depression-featured tweets and their corresponding concept mappings. Since the in-
dividual had implicitly delivered these concept mapping patterns in everyday social
media contact, the generated concept mappings could provide insight into the inner
reality of a depressed person. MetaPro was employed to generate the concept map-
ping features. The evaluation was conducted on a publicly available X dataset called
MDL (Shen et al., 2017). Their experiments showed that introducing the concept
mapping features could increase the accuracy of the depression detection task. The
authors also demonstrated common concept mapping patterns among the depressed
group.
Mao et al. (2023a) leveraged MetaPro to uncover the cognitive patterns of finan-
cial analysts under different market conditions. They examined MetaPro-generated
concepts from 1.4 million financial reports spanning a decade, encompassing 1,000
analysts and 6,000 stocks in the US market. Analyzing these mappings revealed
distinct patterns corresponding to various market conditions. In rising markets, re-
ports often emphasized target concepts like , , , ,
and through metaphorical expressions. However, these concepts were less
prominent in turbulent markets. Conversely, during turbulent conditions, metaphors
were frequently used to elucidate new target concepts such as , , -
_ , , , , and . Furthermore, there
was a higher frequency of both optimistic and cautious concept mappings in com-
parison to down days. Conversely, on down days, report concept mappings tended
to be more encouraging and persuasive.
With MetaPro, Mao et al. (2024d) also analyzed the public perception towards four
types of weather disasters, e.g., floods, hurricanes, tornadoes, and wildfires. They
parsed concept mappings from tweets. By statistical analysis, they found that disaster
management is often likened to a strategic battle, focusing on mental preparedness,
resource assessment, and adaptive planning. Disasters are viewed as dynamic and
transformative, requiring urgent and intense responses. The public emphasizes proac-
tive disaster response, including early warning systems and community resilience.
4.2 Metaphor Understanding 251
Metaphor use in the text can reveal fine-grained sentiment and underlying cultural
tendencies. Researchers in political analysis provided a perspective to deconstruct
text by metaphors. Hu and Wang (2021) analyzed the differences and similarities in
using conceptual metaphors in two government reports from China and the United
States. The similarity between the two nations reflected their political systems’
acceptance of commonsense. For instance, metaphors from the source
domain conveyed that achieving the goals would take time. Diverse rooting cultures
were mostly responsible for the variances. The USA political report, for instance, used
metaphors from the source domain to connect to American entertainment.
Mao et al. (2024e) also used MetaPro to analyze the relationship between
metaphor cognition and voting behaviors at the United Nations. They first parsed
metaphorical concept mappings from the debates at the United Nations Security
Council from January 1995 to December 2020. The concept mappings were embed-
ded by GloVe. The annual representations of the metaphorical cognitive pattern of a
country were derived by averaging the concept mapping embeddings from the same
country for a given year. Next, the pairwise consistency of voting between two coun-
tries at the United Nations General Assembly in a given year was measured using
Cohen’s Kappa (Cohen, 1960). The tests conducted on the United States, China, and
Russia revealed moderate correlation coefficients between their metaphorical cogni-
tion and voting consistency. The authors concluded that metaphorical cognition can
impact voting behaviors at the United Nations. They also provided a summary of the
evolution of cognitive patterns over the 15-year analysis period.
4.2.8 Summary
Table 4.4 summarizes the major technical trends by task setups that were introduced
in Section 4.2.6. The property exaction systems normally use unsupervised and
graph learning methods with linguistic features, such as POS tags, knowledge bases,
and word embeddings. However, methods for property exaction are only effective
when interpreting nominal metaphors. Generated properties can succinctly describe
the link between source and target nouns, while property-based approaches barely
stretch to other POS of metaphors, such as verbs, adjectives, and adverbs. Given the
complexity of syntax in real-world text, the interpretation outputs defined as “target
be property” are not valuable for assisting downstream operations.
Word-level paraphrasing-based approaches can be utilized as a text preprocessing
tool to enhance the capacity to grasp semantic concepts for various downstream NLP
activities such as machine translation and sentiment analysis. Given the absence
of large annotated datasets, previous paraphrasing-based metaphor understanding
systems commonly learn the task by word co-occurrence modeling. To improve
the accuracy, they also leverage linguistic features, such as POS tags and WordNet
knowledge base.
Metaphor understanding models can also be used as parsers for several NLU ap-
plications, including mental health (Han et al., 2022a), finance (Mao et al., 2023a),
international relationship (Mao et al., 2024e), and weather disasters (Mao et al.,
2024d). These studies not only highlight the relevance of metaphor research but
also offer practical methodologies, empirically substantiating the utility of concept
mapping in supporting research within psychology and cognition. Finally, Hu and
Wang (2021) examined concept mappings of metaphorical expressions in govern-
ment reports, revealing similarities and differences in political attitudes. This analysis
underscores the critical role of metaphor comprehension in effective communication.
In conclusion, examining cognition through metaphors provides a means to conduct
extensive research with large samples, as this method does not require interviews to
uncover cognitive patterns. This noninvasive approach, such as studying cognition
using data mining techniques, can yield more reliable results by mitigating the risk
of subjects not sharing their genuine thoughts during psychological tests.
The investigation pinpointed the left rostro-ventral portion of the IFG, specif-
ically Brodmann’s Area 47, as crucial in integrating discourse context, utterance,
and affective prosody for sarcasm comprehension. This area demonstrated a signifi-
cant interaction effect, underscoring its role in context-dependent perception of the
utterance. Then, the researchers observed functional asymmetry in incongruity pro-
cessing, with the left IFG implicated in integrating statement, context, and prosody,
while the right IFG responded to negative prosody, indicating incongruity detection
between the statement and prosody. Aligning with behavioral findings, the fMRI data
affirmed that incongruity between discourse context and overall utterance meaning
influenced sarcasm perception. Positive prosody with positive semantic content el-
evated the overall positive valence of utterance meaning, while negative prosody
combined with positive semantic content diminished the positive valence, intensify-
ing incongruity and enhancing sarcasm perception. These results elucidate the neural
substrates involved in integrating affective prosody, semantic content, and discourse
context during sarcasm comprehension, with a particular emphasis on the left IFG
as a pivotal region in this cognitive process.
There are two main approaches for annotating datasets for sarcasm detection: man-
ual and weakly supervised annotation. In both approaches, the task is framed as a
binary classification task, specifically, discerning whether a given target is sarcastic
or not. However, sarcastic statements can be further annotated via subtypes, e.g.,
sarcasm, irony, satire, understatement, overstatement, and rhetorical question. Sar-
casm is often hard to discern even by humans and is subjective. Due to such inherent
characteristics, leveraging human supervision is a natural choice for high-quality
sarcasm labeling. Normally, human annotators are not authors of utterances, which
means that manually annotated datasets contain sarcasm perceived by annotators
which are not necessarily the same as that intended by authors (Oprea and Magdy,
2020). Inter-annotator agreement (Artstein and Poesio, 2008) is often measured to
ensure the consistency of data labeling.
Despite the higher quality of manually curated datasets, the manual labeling
of large-scale, noisy data for sarcasm detection is highly labor-intensive and time-
consuming in comparison to some annotation tasks which can be done without
human supervision such as image annotation (Wallace et al., 2014). Furthermore,
annotating sarcasm is challenging because it requires a thorough understanding of
context and cultural background as discussed in Section 4.3.1. To overcome such
limitations of manual annotation, weakly supervised annotation is used to increase
efficiency and improve the usability of massive amounts of unlabelled data. Most
corpora for sarcasm detection are generated using social media data based on the
hypothesis that texts containing certain indicators express sarcasm.
258 4 Pragmatics Processing
X16 and Reddit17 are common social media platforms for studying sarcasm. For
X, several hashtags, such as #sarcasm, #irony, #sarcastic, #cynicism and #not, are
leveraged as markers of sarcasm (Kunneman et al., 2015). For Reddit, the marker
“/s” at the end of posts is used to collect sarcastic statements (Khodak et al., 2018).
However, gathering data by hashtags can inevitably introduce biases, such as those
more subtle, sarcastic expressions without obvious hints will not be annotated due
to the absence of hashtags.
4.3.3 Datasets
Commonly used sarcasm detection benchmarking datasets are shown in Table 4.6.
Oprea and Magdy (2020) proposed a sarcasm detection dataset, iSarcasm18, which
consists of 777 sarcastic and 3,707 non-sarcastic English tweets. To collect intended
sarcasm rather than perceived sarcasm, a subset of X users was asked to provide
one sarcastic and three non-sarcastic tweets. For each sarcastic tweet, an explanation
of why it is sarcastic and a non-sarcastic paraphrase were also collected. Moreover,
a linguist was employed to further categorize each sarcastic tweet into one of the
following seven categories: 1) sarcasm, 2) irony, 3) satire, 4) understatement, 5)
overstatement, 6) rhetorical question, and 7) invalid/unclear.
Category: sarcasm
Text: Thank @user for being so entertaining at the Edinburgh signings! You did not
disappoint! I made my flight so will have plenty time to read @user
Explanation: I went to a book signing and the author berated me for saying I was
lying about heading to Singapore straight after the signing
Rephrased: I would have said ’here is the proof of my travel, I am mad you
embarrassed me in front of a large audience’!
16 [Link]
17 [Link]
18 [Link]
4.3 Sarcasm Detection 259
Category: irony
Text: Staring at the contents of your fridge but never deciding what to eat is a
cool way to diet
Explanation: I wasn’t actually talking about a real diet. I was making fun of how you
never eat anything just staring at the contents of your fridge full of indecision.
Rephrased: I’m always staring at the contents of my fridge and then walking away
with nothing cause I can never decide.
Label: non-sarcasm
Text: You do know west teams play against west teams more than east teams right?
Author: Shbshb906
Subreddit: nba
Comment score: -4
Label: sarcasm
Text: gotta love the teachers who give exams on the day after halloween
Author: DEP61
Subreddit: CFBOffTopic
Comment score: 3
SemEval-2018 Task 3 dataset (Van Hee et al., 2018) was constructed using both
weakly supervised and manual annotation schemes. After collecting English tweets
using several hashtags related to irony (i.e., #irony, #sarcasm, and #not), the corpus
was further manually labeled by three human annotators. As a result, the dataset
consists of 1,728 verbal irony with polarity contrast, 267 other types of verbal irony,
401 situational irony, and 604 non-ironic tweets.
Label: non-sarcasm
Text: Had no sleep and have got school now
Label: sarcasm
Text: I just love when you test my patience!!
19 [Link]
260 4 Pragmatics Processing
MUStARD20 (Castro et al., 2019) was manually annotated, including three modal
data (i.e., textual, audio, and visual features) extracted from TV shows. This dataset
consists of 345 sarcastic and 345 non-sarcastic utterances. Each utterance consists
of one or more sentences and its conversational context is also provided.
Fig. 4.3: A data example from the MUStARD multimodal sarcasm detection dataset,
developed by Castro et al. (2019).
Label: sarcasm
Id: 2662
Text: That would be one hell of a kiss...you would probably drown first though :)
Multimodal sarcasm detection dataset22 (Cai et al., 2019) includes sarcastic tweets
that were collected using a set of hashtags. Non-sarcastic tweets in the dataset do not
contain such hashtags. Each tweet has its associated image. After applying filtering
rules, 10,560 sarcastic and 14,075 non-sarcastic tweets remain in the final dataset.
20 [Link]
21 [Link]
22 [Link]
4.3 Sarcasm Detection 261
Image:
Fig. 4.4: A data example from the multimodal sarcasm detection dataset, developed
by Cai et al. (2019).
4.3.5 Methods
Yao et al. (2021) aimed to simulate how humans process and interpret sarcasm
given multimodal information. To this end, they proposed a multimodal, binary sar-
casm detection model for X data. It incorporated multiple stacks of gate mechanisms,
guide attention, attention pooling, and multi-hop processing of guide attention and
attention pooling, in the hope that these components could effectively learn interac-
tions among tweet text, an embedded image, text in the image, and image caption.
Multimodal sarcasm detection dataset (Cai et al., 2019) was used for evaluation. The
proposed model is compared with single- and multimodal models for tweet-level
sarcasm detection. This work hypothesized that image captions help a sarcasm de-
tection model learn context and incongruity. The image caption model (Xu et al.,
2015c) utilized in the study was not deemed state of the art, and the rationale behind
not opting for a more contemporary model was not substantiated in the research. In
fact, the ablation study results showed that the effect of the image caption modality
is marginal.
Liang et al. (2022a) argued that not every part of an image equally contributes
to the performance of a multimodal, binary sarcasm detection model. The authors
also believed that learning affective relationship between key visual and textual
information is useful. Based on such hypotheses, a model based on GCNs and
attention mechanism was proposed for multimodal sentence-level sarcasm detection.
Given a sentence and an image paired with it, the textual and visual representations
of the inputs were obtained using a pretrained BERT (Devlin et al., 2018) and
Vision Transformer (ViT) (Dosovitskiy et al., 2021). A cross-modal graph in which
nodes were the obtained multimodal representation. Edges were affective similarity
weights calculated using SenticNet. Then, the adjacency matrix of the cross-modal
graph was fed into multi-layer GCNs. To capture attention information, the simple
concatenation of the input representations was passed through an attention layer.
Computed attention weights were combined with the output of the final layer of the
GCNs to form the final representation of the multimodal input, which was fed into a
softmax layer.
Kamal and Abulaish (2022) focused on the binary classification of sarcasm asso-
ciated with self-deprecation, based on a hypothesis that comprehending such sarcasm
can be useful for improving marketing strategies. This work assumed that sarcastic
tweets in which authors refer to themselves express self-deprecation. In addition
to 6 benchmark datasets for tweet-level sarcasm detection, this work constructed
its own corpus using hashtags. Specifically, sarcastic tweets were collected using
“#sarcasm” and non-sarcastic tweets were collected using “#not”, “#education”,
“#politics”, “#love”, and “#hate”. To select tweets containing self-reference, filter-
ing methods based on a pre-defined set of regular expressions and clustering were
applied to the 7 datasets. In the proposed model, the GloVe embedding of an input
tweet passed through a convolution layer, BiGRU and a sigmoid layer. Experimental
results showed that their model outperforms several deep learning models and two
models developed by existing works on sarcasm detection in terms of accuracy,
precision, recall and F1 score.
4.3 Sarcasm Detection 265
A multimodal sarcasm detection dataset (Cai et al., 2019) was used in the exper-
iments and the proposed model achieves an F1 score of 0.8292. An ablation study
showed that incorporating inter- and intra-modality incongruity learning modules is
useful for sarcasm detection. However, tweets expressing sarcasm may lack hashtags
altogether, or the presence of hashtags does not guarantee a direct association with
sarcasm. A more thorough analysis of the dataset is required to argue that the model
can actually capture ‘incongruity’ between tweet textual content and hashtag.
Wu et al. (2021c) were motivated by a hypothesis that sarcasm involves incon-
gruity between a positive word in text and either a negative facial expression or tone.
The authors proposed a multimodal architecture that jointly learns such incongruities
at word level and utterance-level features. It achieved F1 scores of 0.745 and 0.7 on
speaker-dependent and speaker-independent sets in the MuStARD (Castro et al.,
2019) dataset, respectively. An ablation study showed the effectiveness of word-
level incongruity learning in sarcasm detection. The authors believed that sarcasm
does not involve incongruity between visual and acoustic modalities and that be-
tween a negative word and positive context by definition. Different researchers have
given different definitions of sarcasm and its true meaning is in dispute. Additionally,
sarcasm exists in a wide range of forms and structures (Eke et al., 2020).
Chen et al. (2022) argued that sarcasm involves exaggeration and incongruity.
They propose a model consisting of two modules for learning these characteristics
of sarcasm. For learning exaggeration, LSTM with a self-attention mechanism was
applied to an input statement and output a sentiment-aware embedding. For incon-
gruity, an input sentence was segmented into sequential context chunks, and then,
the average of the pretrained embeddings of such chunks was fed into BiLSTM.
Finally, the concatenation of forward and backward hidden states was obtained. The
output vectors of the two modules were concatenated into a single vector, which was
fed into a softmax layer predicting the label of the input. The ablation study shows
that removing exaggeration and incongruity modules decreases F1 scores by 2.31%
and 2.71%, respectively. Although the authors hypothesized that their modules learn
exaggeration and incongruity in sarcastic statements, it is not clear how each module
is suitable for capturing a specific linguistic characteristic of sarcasm, and thus, their
method and experimental results do not fully support their hypothesis.
Yue et al. (2023) proposed a cognition-inspired model, KnowleNet. They argued
that achieving effective multimodal sarcasm detection relies on prior knowledge and
recognizing the semantic contrast between modalities. The identification of seman-
tic disconnection among modalities necessitates the incorporation of commonsense
information, particularly given that implicit expressions of emotion often demand
a nuanced conceptual understanding of words within diverse contexts. Leveraging
conceptual knowledge from ConceptNet, the authors introduced a novel method
for cross-modal semantic similarity detection, aiming to assess the interrelatedness
between visual and textual information. To enhance feature differentiation between
positive and negative samples, contrastive learning was employed, and the visualiza-
tion results provided supporting evidence for the efficacy of this contrastive learning
approach.
4.3 Sarcasm Detection 267
Sarcasm detection plays a key role in sentiment analysis. Lunando and Purwarianti
(2013) are one of the early works which considered sarcasm in the task of sentiment
analysis. The proposed architecture first identified the sentiment label (i.e., positive,
negative, and neutral) for an input text and then performed sarcasm detection for
the texts labeled with the “positive” label. For the sarcasm detection component, the
model employed two features (i.e., manually computed topic negativity score and
the number of interjection words) on top of the three linguistic features, used for the
first component.
Bouazizi and Ohtsuki (2015) attempted to examine whether sarcasm can improve
sentiment classification on X corpora. To this end, they used hand-crafted features
such as the number of laughters, that of interjections, and the existence of common
sarcastic expressions. Experimental results showed that adding such features could
improve the classification performance of machine learning models. Yunitasari et al.
(2019) exploited the output of binary sarcasm detection to improve the performance
of sentiment classification for Indonesian tweets. This work assumed that sarcasm
is associated with negative sentiment. Therefore, when a sarcastic statement was
classified as “positive”, the model changed its label to “negative”. For sarcasm de-
tection, the architecture employed hand-crafted features associated with sentiment
and punctuation as well as lexical and syntactic features. Experimental results failed
to prove their hypothesis, which indicates that sarcasm is not necessarily associ-
ated with negative sentiment. More advanced contextual information is required to
comprehend sarcasm.
El Mahdaouy et al. (2021) proposed an MTL model that performs sarcasm and
sentiment classification for the Arabic language. The model was capable of learning
the relationship between the two tasks by incorporating attention layers, while most
existing MTL models for sentiment analysis performed different tasks independently.
The ArSarcasm Shared Task dataset consisting of 15,548 tweets was used in the ex-
periments. The results showed that their model outperformed single-tasking models
in sarcasm detection in terms of macro-average F1 score, while a single-tasking
model without attention achieved the best performance for sentiment classification.
As for class-specific F1 score, the proposed MTL model performed best. Their find-
ings indicated that utilizing relational information between sarcasm detection and
sentiment classification could boost the classification performance of each task.
4.3 Sarcasm Detection 269
Larsen et al. (2016) investigated how Huntington’s disease gene expansion carriers
perform in sarcasm detection, emotion recognition, and theory of mind compared
with healthy controls. This study used sarcasm detection as a measure to assess social-
cognitive impairment. It was assessed using the Social Inference Minimal (SI-M)
test of The Awareness of Social Inference Test (TASIT) (McDonald et al., 2003).
The SI-M test evaluated the understanding of sincere and sarcastic interactions.
The results indicated that the higher the severity of patient’s illness is, the poorer
their performance on sarcasm detection is. Their findings offered insights into how
sarcasm detection could aid the early detection of mental health diseases and the
tracking of their progression as well as support patients and clinical and health
professionals.
Several studies and research articles have reported that exposure to cyberbully-
ing is associated with several mental disorders such as depression, self-harm, and
anxiety (Schodt et al., 2021; Skilbred-Fjeld et al., 2020; Maurya et al., 2022). Chia
et al. (2021) applied a machine learning model trained on a sarcasm detection dataset
to a cyberbullying detection dataset to examine the extent to which sarcasm detec-
tion could be useful for cyberbullying detection. Experimental results showed that
a sarcasm detection model surpassed the performance of a model trained on a cy-
berbullying detection dataset, as evidenced by a higher F1 score. This indicates that
sarcastic expressions are commonly used in instances of cyberbullying. It highlights
the importance of developing better tools for detecting sarcasm, as this could signif-
icantly improve the identification and understanding of online cyberbullying cases.
Enhanced sarcasm detection could help in accurately identifying harmful content
and protecting individuals from online abuse.
The COVID-19 pandemic has affected people’s mental health globally. Rother-
mich et al. (2021) studied changes in the usage of humor and sarcasm among 661
adults (164 with mental disorders and 497 asymptomatic individuals) during the pan-
demic via self-report questionnaires. Experimental subjects were asked how their
usage of sarcasm and humor has changed in the past month. Chi-square tests were
used to evaluate relationships between the frequency of change in the usage of humor
and sarcasm and the anxiety/depression severity of the subjects. The results indicated
that the individuals with depression tend to use it more frequently while those with
anxiety showed no significant change in the usage of humor.
As for sarcasm, individuals with mental disorders used more sarcasm than the
asymptomatic ones did. The authors reported that it should be further investigated
whether the motivation behind the increase in the usage of sarcasm is associated with
self-enhancement or self-deprecation. The findings and future research directions
described in this study offered useful insights into a better understanding of coping
strategies of individuals with mental disorders during crises, which can potentially
benefit patients and health professionals.
270 4 Pragmatics Processing
Some studies attempted to apply sarcasm detection in other domains apart from
sentiment analysis and mental health. Although these areas have yet to be explored
extensively, they have the potential to benefit businesses and human well-being.
Danielyan (2022) studied the usage and impact of sarcasm in advertising campaigns
from a pragmalinguistic perspective. This work reported that sarcastic expressions in
advertisements drew people’s attention and enhanced user engagement, thereby help-
ing companies gain a positive brand reputation. The paper acknowledges potential
drawbacks and adverse effects associated with the use of sarcasm in advertising. One
concern is the risk of misinterpretation or misunderstanding by the target audience,
potentially resulting in confusion or negative reactions. Furthermore, the caustic or
bitter tone of sarcasm can be perceived negatively by the audience. The paper also
highlights ongoing debates among scholars and linguists regarding the appropriate-
ness of employing sarcastic language in social contexts, with some asserting that it
may have toxic or detrimental effects on relationships.
Zhou et al. (2021b) proposed a dialogue system that generates sarcasm-aware
responses. The authors argued that sarcasm-aware dialogue systems are essential
to avoid misunderstanding and more human interactions. The proposed model first
detected sarcasm from text and speech signals, using existing sentiment analysis
tools. Next, the output of the sarcasm detection module was encoded as a two-
dimension one-hot vector, in which the first and the second respectively represented
whether an input utterance is sarcastic and whether each of the expected responses
is sarcastic. The sarcasm-aware representation was concatenated with other types of
input embeddings and the joint representation was fed into a deep learning model
that generated a response for the input. For evaluation, the SARC (Khodak et al.,
2018) dataset and BLEU (Papineni et al., 2002) which is originally designed to
measure how well a machine-translated text correlates with human judgment, were
used. BLEU was also used as an indicator of the adequacy and fluency of a natural
language generation model. Some examples of sarcasm-aware responses generated
by the proposed model were given in this work and the authors claimed that their
model is capable of generating more interesting dialogues.
4.3.7 Summary
Sarcasm detection is an active research area and has great potential to improve the
performance of computational methods for a wide range of NLU tasks. Findings
of the theoretical research into sarcasm provide useful clues into the computational
learning and modeling of sarcasm from the perspectives of cognition, communicative
behavior, and neural mechanisms. For example, several studies emphasized that
the understanding of contextual information and the identification of contradiction
between the literal/figurative meaning of a statement and observational evidence are
the keys to the successful interpretation of sarcasm.
4.3 Sarcasm Detection 271
The latest computational research actively takes such signals into the learning
process of sarcasm, leading to the development of models equipped with diverse
modality learning capabilities. We also reviewed two main approaches for annotating
datasets for sarcasm detection. Manual annotation is a natural choice for constructing
high-quality corpora, although it is difficult to generate large-scale datasets for labor-
and time-costs. Some researchers have advocated for novel data labeling approaches
that demand minimal human supervision, advocating for the adoption of weakly
supervised annotation methods. These methods typically involve the generation of
annotations through sets of pre-defined keywords, such as hashtags on X, deemed
relevant to sarcasm.
Table 4.7: Technical trends in sarcasm detection. Feat., emb., SI, and cont. denote
feature-, embedding-, semantic incongruity- and context-based trends, respectively.
Bouazizi and Ohtsuki (2015) claimed that some hand-crafted features are as-
sociated with sarcasm and adding such features improved the performance of the
proposed model for sentiment analysis. Yunitasari et al. (2019) studied sentiment
classification based on an assumption that sarcasm carries negative sentiment. How-
ever, their experimental results did not support such an assumption. El Mahdaouy
et al. (2021) investigated how MTL on both sentiment and sarcasm could improve
the performance of each task. They did not assume certain aspects of sarcasm and,
hence, their method was more generalizable than the two studies introduced earlier.
Some studies explored the correlation between sarcasm and mental health condi-
tions. In this context, the identification of sarcasm serves as a parsing mechanism,
enabling subsequent mental analysis based on the recognition of patterns in sarcastic
expressions, particularly concerning mental health issues. Larsen et al. (2016) inves-
tigated the cognitive functioning of patients with Huntington’s disease via sarcasm
detection and observed that patients lost their ability to understand and interpret sar-
casm as their disease developed. This study did not exploit automatic, computational
sarcasm detection. Chia et al. (2021) studied the usage of sarcasm in cyberbullying
and reported that sarcastic expressions were commonly used in cyberharassment.
Rothermich et al. (2021) researched the impact of the global COVID-19 pandemic
on people’s mental health by statistically analyzing changes in the usage of humor and
sarcasm. This study reported that people with different mental disorders exhibited
different patterns in the usage of humor and sarcasm.
Furthermore, the identification of sarcasm holds promise for enhancing explain-
ability in advertising and improving efficacy in dialogue systems. Danielyan (2022)
reported that sarcasm could be an effective tool in advertising campaigns as it helped
to draw people’s attention and increase customer engagement. In this context, the
use of sarcastic language serves as a tool for explaining customer behaviors. Zhou
et al. (2021b) proposed a sarcasm-aware, generative dialogue system. The dialogue
system aims to employ the identified sarcastic label as a signal, prompting the dia-
logue system to generate appropriate responses to sarcastic expressions. Although
this study shed light on the importance of sarcasm understanding in dialogue sys-
tems, the results generated using the proposed model showed that there was great
room for improvement.
274 4 Pragmatics Processing
Research into automatic sarcasm detection has gained popularity due to its impor-
tance in our daily communication. Despite several efforts to comprehend sarcasm
and make computers understand and use sarcasm in different disciplines, several
challenges have yet to be tackled. Firstly, understanding sarcasm requires diverse
contextual factors. Several studies in psycholinguistics, pragmatics, and neuroscience
have delved into how context could help humans perceive, process, and comprehend
sarcasm. Ensuring the context awareness and generalizability of sarcasm detection
methods is crucial. Computational research in automatic sarcasm detection usu-
ally employs sarcastic utterances obtained from social media (e.g., X), which have
limited context in nature due to short lengths. To overcome this limitation, it is
recommended to investigate sarcasm detection using multiple modalities, broader
contexts, and external knowledge. Employing multimodal data enables the detec-
tion of incongruities beyond text, while a broader context, such as shared long-term
experiences and cultural background, offers clues for identifying inconsistent in-
formation in sarcastic expressions. Additionally, incorporating external knowledge,
such as commonsense, provides information not directly obtainable from sarcasm
datasets. Moreover, future research should further investigate how to generate and
leverage effective representations of a wide range of contextual factors.
Secondly, there is a need to create multimodal sarcasm detection datasets with
separate sarcasm annotations for distinct modalities. Current studies in multimodal
sarcasm detection commonly utilize the dataset from the work of Cai et al. (2019),
where a unified sarcasm label is assigned to a text-image pair for multimodal com-
puting. While this label is reasonable for multimodal analysis, it poses challenges in
ablation studies when text-only or image-only inputs share the same label derived
from the multimodal context. This is problematic because some sarcastic expres-
sions have labels annotated based on the combination of text and image, whereas the
independent text or images may not be sarcastic. As illustrated in the example on the
right in Fig. 4.4, a combination of a raining image and the statement “what wonderful
weather” might be defined as sarcasm, but the sarcasm may not be evident in each
modality. Therefore, it is unreliable to assert that a multimodal model outperforms
a unimodal model if sarcasm labels annotated for multimodal data are employed for
unimodal evaluation. To address this issue, a multimodal dataset with independent
annotations for different modalities is essential in the sarcasm detection research
community.
Thirdly, existing studies on sarcasm detection (Liu et al., 2014; Meriem et al.,
2021; Mishra et al., 2016) mentioned that sarcasm identification could benefit several
real-world applications such as marketing, advertising, opinion mining, dialogue sys-
tems, and review summarization. Although several researchers have acknowledged
the importance of sarcasm detection in downstream tasks, practical applications of
the task have not been widely studied. The complexity inherent in detecting sarcasm
across diverse domains contributes to the challenges associated with this task. This
heightened difficulty in sarcasm recognition poses additional obstacles to its effective
application in downstream tasks.
4.4 Personality Recognition 275
One of the most famous personality factor models, the Big Five, is an inter-individual
personality structure which originated from the five factor model (FFM) (Norman,
1963). Goldberg (1981) designated the FFM as the Big Five, based on the research
of Wiggins (1968) who dubbed “Extraversion (E)” and “Neuroticism (N)” the Big
Two. Costa Jr (1980) added a dimension “Openness to Experience (O)” and later
in 1985 and 1989, they proposed measure scales of “Agreeableness (A)” and “Con-
scientiousness (C)”. Finally, the Big Five dimensions, shown in Table 4.9, were
compiled, showing the putative assignment of standard personality factors or scales.
The descriptions of the binary values (Yes / No) utilize the Big Five trait definition
from the works of McCrae and John (1992). Most personality traits are measured
from two fundamental sources: self-reports and observer ratings. Using both in-
struments, i.e., questionnaires and trait adjectives (lexical approach), in self-reports
or observer ratings originates in the 1980s, of which has been justified the feasi-
bility by a series of research papers (Norman, 1969). Remarkably, all five factors
were shown to have discriminant and convergent validity across personality instru-
ments and observers (McCrae and Costa, 1985; Goldberg, 1989), enduring across
decades in adults (McCrae and Costa, 1990). These studies justified empirically the
correspondences between the named factors in the two traditional sources.
Personality assessment research has extensively utilized questionnaire instru-
ments based on the five factors. These questionnaires involve individuals providing
self-ratings or observers rating the target person on a Likert-type scale, ranging, for
example, from -2 to 2 or from 1 to 5, to gauge the degree of agreement from disagree
strongly to agree strongly. The top popular questionnaires include the Ten-Item Per-
sonality Inventory (TIPI, 10 items) (Gosling et al., 2003), the Big-Five Inventory
(BFI, 44 items) (John et al., 1991), the NEO Five-Factor Inventory (NEO-FFI, 60
items) (McCrae and Costa Jr, 2004), and the NEO-Personality-Inventory Revised
(NEO-PI-R, 240 items) (Costa Jr and McCrae, 1995). The first-person questionnaires
(e.g., “I tend to find fault with others”) lead to self-report, i.e., people’s identities,
while the third-person questionnaires (e.g., “this person tends to find fault with oth-
ers”) come from observer ratings, i.e., people’s reputation. The quantitative measures
of personality traits set the stage for representing personality in a numerical model.
Table 4.9: Big Five personality traits model (also known as OCEAN).
Description
Factor Yes No
Openness (O) innovative and curious dogmatic and cautious
Conscientiousness (C) organized and professional careless and sloppy
Extraversion (E) active and sociable reserved and solitary
Agreeableness (A) trustworthy and modest unreliable and boastful
Neuroticism (N) sensitive and anxious secure and confident
4.4 Personality Recognition 277
Another popular personality typology was developed by Myers et al. (1998), termed
MBTI. MBTI is rooted in personality typology (Jung and Beebe, 2016) and
was formulated to assess individuals’ psychological preferences in cognition and
decision-making. It classifies people into 16 distinct personality types, determined
by their preferences across four dichotomous dimensions: Extraversion-Introversion
(E-I), Sensing-Intuition (S-N), Thinking-Feeling (T-F), and Judging-Perceiving (J-
P) (see 4.10 for details). The primary objective of developing MBTI during World
War II was to create a pragmatic tool facilitating self-awareness and interpersonal
understanding, particularly in realms such as career planning, team dynamics, and
personal development. Over time, MBTI gained prominence in organizational and
industrial psychology (Cohen et al., 2013), as well as career counseling (Savickas,
2019).
However, MBTI also encountered skepticism from scholars and psychometric
experts who question its theoretical underpinnings and measurement robustness.
Pittenger (1993) argued that MBTI displayed correlations with other personality
measures, suggesting challenges in maintaining its distinct assessment qualities.
Advocates of MBTI’s validity typically cited findings linking one of its four scales to
specific behaviors. Nevertheless, the available data did not substantiate the assertion
that MBTI effectively measures distinct personality types or that the delineation of
the 16 types is crucial for a comprehensive understanding of personality.
Table 4.10: The four dichotomies of the MBTI model (Myers et al., 1998).
Extraversion-Introversion Dichotomy
(attitudes or orientations of energy)
To get the personality trait annotation, the composers or speakers of the corpus (e.g.,
essays, conversations, etc.) will be scored by themselves or other observers according
to given personality questionnaire instruments. As mentioned in section [Link], there
are a large number of questionnaires, such as TIPI, BFI, NEO-FFI, NEO-PI-R, etc.
Since TIPI and BFI contain relatively less items and all the questionnaires share the
similar question format, the detailed questionnaires of TIPI (Table 4.13) and BFI
(Table 4.11) are token as the examples and shown in this section. For each question
in BFI, the participant needs to select a number ranging from 1 to 5 to show the
agreement (from disagree strongly to agree strongly) with that statement. Then, the
result/ category of each dimension of Big Five is obtained through the scoring of
each dimension of the personality traits in Table 4.12. While in TIPI, the participant
are required to write a number ranging from 1 to 7 to indicate the extent to which he
or she agrees with the statement. The continuous decimal value of each dimension
of Big Five is obtained according the scoring rule in Table 4.14.
4.4 Personality Recognition 279
Table 4.11: BFI. Participants are required to write a number next to each statement to
show the degree (1 disagree strongly, 2 disagree a little, 3 neither agree nor disagree,
4 agree a little, 5 agree strongly) to which they agree or disagree with that statement.
Table 4.13: TIPI. Participants are required to write a number next to each statement
to show the degree (from 1 to 7) to which they agree or disagree with that statement.
4.4.3 Datasets
Many datasets were constructed for personality recognition based on OCEAN and
MBTI (Table 4.15). Essay dataset (Pennebaker and King, 1999) contains 2,479
stream-of-consciousness essays written by psychology students who were asked to
write down whatever they think off within 20 minutes. The binary personality labels
of each essay writer were obtained by asking students to fill in the BFI questionnaire.
The following snippet shows an annotated essay sample.
280 4 Pragmatics Processing
text: “I haven’t chatted on MIRC for a long while. Maybe I’ll do that at work, it should be fun. Hmmm, I’m just thinking
like random thoughts, about nothing specific. My gyming is going fine, hope I get a good looking chest/body over the next
month or so, not that I need to show it off or something, just because this way the clothes i wear will I guess look
better on me... I think I’m going to go to my job kind of early and print out some slides for the CS class quiz it should
be easy. and about this whole Coop thing I think I should go check out the companies but not really apply apply I don’t
want to as yet I think I can make my GPA higher and then apply what do you say well I can’t there’s no one really who
YouTube dataset (Biel et al., 2013) is built from 408 randomly selected YouTube
vlogs with a set of personality traits for each vlogger. The content of the vlogs are
relevant to books, movies, politics, personal issues, etc. The personality traits are
the Big Five scores annotated by crowdsource workers. the dataset is balanced in
gender, with 211 female (52%) and 197 male (48%) vloggers. There are 243k word
tokens and 10k unique tokens with an audio of 30h in total in the YouTube dataset.
text: “- And here is. I don’t know if you can hear me or not, probably or hopefully because this music is really loud.
countless time to upload the same videos and for some reason it’s just not being
cooperative. I don’t know, but um, I’ve been trying for a week to upload the ones for local bloggers that I did. Sorry
guys, um, my promotion video might be up, might not. I don’t know, well well I mean, like, it might be up eventually.
Um, yeah, ah, right now I’m in North Carolina. Ah, where are we? What town are we in?
- North Carolina?
- We were just at some first basketball game, uh, where we played a song and stuff. But um, I just wanted to apologize.
friends that are in this room. The other ones are downstairs getting our pizza. So,
SPADE dataset (Kerz et al., 2022) consists of 20h of speech from 220 individuals
among which 129 are male, 88 are female and 3 are diverse. The transcribed speech
was obtained via the AMT crowdsourcing platform, with 848,827 words in total.
The personality traits were collected through the BFI questionnaire.
4.4 Personality Recognition 281
Kaggle MBTI dataset23 was gathered from the PersonalityCafe forum, containing
over 8600 instances. Each instance includes the last 50 posts of a user and associates
an MBTI type of the user.
Personality: ENFJ
Post: That sounds like a beautiful relationship already. But don’t push too fast as it
seems she needs to be appreciated as a friend. But at the end of the day you can ...
4.4.5 Methods
23 [Link]
282 4 Pragmatics Processing
The content words refer to those depicting an action or object, e.g., verbs, adjec-
tives and nouns. This means how, instead of what, people communicate fundamen-
tally reflects more of their psychological situation. However, the approaches based
on LIWC had some limitations. They were constrained by the inherent computation
mechanism, i.e., statistical features based on word counts, failing to capture some
implicit contextual elements such as irony or sarcasm (Pennebaker et al., 2003).
Additionally, these methods only explore and analyze the correlations between the
LIWC features and the personality, instead of automatically recognizing the person-
ality traits from texts (ätajner and Yenikent, 2020).
They used a list of features including LIWC, MRC (Coltheart, 1981), utterance
features based on speech act categories from Walker and Whittaker (1990), and
prosodic features like voice’s pitch characterized by Praat (Boersma, 2001). Experi-
mental results came to several interesting conclusions: 1) The reported correlations
between features suggested by the psycholinguistic literature and the personality
ratings are weak in general; 2) Features selection is of vital importance; 3) More
complex models (e.g., SVMs and boosting algorithms) tend to perform better on a
large corpus than the simple ones (e.g., Naïve Bayes or regression trees); 4) Rank-
ing models outperformed classifiers in personality recognition task, which can be
attributed to that personality traits vary continuously among the population; 5) The
observer ratings can be captured more precisely than the self-report scores.
The dramatic performance improvement of deep learning models relax the con-
straint of feature selection in personality recognition, i.e., from predefined features
suggested by the psycholinguistic literature to more dimensions, longer input se-
quences, or multimodality features. Yuan et al. (2018) employed CNNs to extract
the features automatically from the textual content besides LIWC on the MyPer-
sonality dataset (Stillwell and Kosinski, 2015). They discovered the preference for
emotion words, social process words, positive words, or negative words varies among
different personality traits. This is almost contradictory to the traditional theories
mentioned above about the role of function words in personality recognition. It is
also notable that myPersonality is based on users’ Facebook homepages, i.e., social
media content. It also shows the trend of personality recognition from social media
data as well as using more powerful deep learning methods.
The aforementioned studies showed that social media can provide a good source
of data for personality recognition by mitigating the biases coming from ques-
tionnaires and predefined psycholinguistic features, as it contains fine-grained and
natural descriptions of an individual’s personality. The development of deep learning
methods exploited the potential of social media data and made it possible to capture
information directly from the input sequences, such as n-gram features, besides the
suggested features from psycholinguistic literature. Majumder et al. (2017) proposed
a novel CNN-based deep learning model for personality recognition. They used the
CNN feature extractor to obtain the n-gram sentence representation from the fed
sentences, in which each word is vectorized as a word embedding by pretrained
word2vec embeddings. The obtained vectors are concatenated with the Mairesse
features (Mairesse et al., 2007) to obtain the document-level representation which
was fed to a fully connected layer for classification. Experimental results indicated
that the was no performance improvement of SVM dealing with the extracted features
from CNN compared with a final fully connected layer. It suggested the potential of
deep learning models in personality recognition tasks and the research direction of
applying LSTM recurrent network to personality recognition.
Su et al. (2016) employed an RNN fed with LIWC-based and grammatical fea-
tures for personality recognition in conversations. However, both works highly re-
lied on domain expertise as they took predefined linguistic features as inputs and
could not exploit the semantic information from the sentences. In the same year
when Majumder et al. (2017) presented the CNN-based method, Liu et al. (2017a)
developed a hierarchical model, Character to Word to Sentence for Personality Traits
(C2W2S4PT), to obtain the vector representations of words and sentences from
character level and word level. C2W2S4PT obtained better performance than previ-
ous feature-engineering-heavy works and machine learning methods, such as SVM
Regression and Random Forest. These works inspired researchers to focus more
on explore the capability of deep learning models with engineering-free features in
personality recognition.
Sun et al. (2018) proposed to exploit the structures of texts to detect users’
personalities by developing a hybrid model, i.e., BiLSTMs concatenated with CNN
(2CLSTM). In this work, BiLSTMs are used to encode the context information,
and CNN is utilized to capture the structural feature named latent sentence group
(LSG) generated from BiLSTMs. According to Ramezani et al. (2022)’s research,
recurrent classifiers, e.g., RNN-based, LSTM-based, and BiLSTM-based classifiers,
outperformed CNN-based classifiers. They attributed the results to the reason that
the inputs are sentences in the form of sequences and recurrent classifiers are more
capable of processing temporal information, i.e., exploring the existing interactions
among the words in a sentence. BiLSTM was superior to other recurrent classifiers
as it does better in remembering information over several timesteps than RNN and
focusing more on future information than LSTM, which explains why 2CLSTM
employed BiLSTM for context encoding. Additionally, CNN-based classifier was
believed to be good at feature extraction, which provides a support for utilizing CNN
to capture the structural information in the work of Sun et al. (2018).
4.4 Personality Recognition 285
Recent studies showed that the researchers are not resigned to only semantic and
structural information extraction with lightweight models anymore. They started to
pay more attention to incorporating external knowledge graph or LLMs into per-
sonality classifiers. Kazemeini et al. (2021) investigated two approaches, a siamese
BiLSTM and Sentence-BERT (Reimers and Gurevych, 2019), on distinguishing per-
sonality types. Both approaches were of siamese structure, as they were supposed
to classify the personality trait type in an interpretable way where the semantics of
the statements were captured so that statements with similar personality traits could
be embedded close to each other in the semantic vector space. Poria et al. (2013)
utilized an SMO classifier, combining commonsense knowledge based features with
psycholinguistic features and frequency based features, for personality recognition.
The proposed method surpassed other state-of-art methods based on psycholin-
guistic features and frequency analysis. The commonsense knowledge mainly came
from resources such as ConceptNet and SenticNet. However, these methods usually
ignored the external knowledge, conventional psycholinguistic features, semantic
information, or LLM features and thus failed to fuse them all.
Mehta et al. (2020) developed a BERT-based model integrating traditional psy-
cholinguistic features, commonsense knowledge and language model embeddings for
personality prediction. The psycholinguistic features in this work included Mairesse
features and Readability. In their fine-tuning setup, they experimented with logistic
regression, SVM and a multi-layer perceptron (MLP) with 50 hidden units. Ex-
perimental results showed that the language modeling features beat conventional
psycholinguistic features with commonsense knowledge features. Ren et al. (2021b)
proposed a multi-label model integrating the pretrained BERT model with a neural
network, which incorporated semantic and emotional features for personality recog-
nition from social media texts. This work provided several significant implications:
1) External emotion knowledge improved the effectiveness of the personality recog-
nition model; 2) Emotional information was found to be closely related to personality
traits and could be applied to the further explanation of personality traits; 3) LLMs
such as BERT could extract more contextual information than word embedding
methods like word2vec or GloVe (Pennington et al., 2014), and it should the basis
for future personality recognition with deep learning. These studies suggested that
social media data is a good resource for the assessment of individual personalities
in a more objective way than traditional questionnaires (Stachl et al., 2020).
Compared to personality recognition with the Big Five personality traits which
has been studied from multiple types of texts and with various approaches, person-
ality recognition with MBTI did not attract the attention of researchers until recently
and it exclusively utilizes X data.
286 4 Pragmatics Processing
The MBTI model profiled X users based on their tweets with each personality
dimension modeled as a binary classification separately. Plank and Hovy (2015)
annotated a corpus of 1.2M English tweets with gender and MBTI type. They
trained a binary Logistic Regression classifier for each MBTI dimension (I-E, S-N,
T-F, J-P) with the features including gender, n-grams, and count-based meta-features
(e.g., the number of tweets, favorites, and followers). The models outperformed the
majority-class baseline model on only two personality aspects, Thinking / Feeling
and Introversion / Extroversion. Verhoeven et al. (2016) performed MBTI-based
personality recognition from Tweets in six European languages with n-gram features.
The models got the F1 score from 0.47 to 0.79. Yamada et al. (2019) built a model for
MBTI personality recognition on X posts in Japanese and behavioral features of X
users. Experimental results indicated that user behaviors are crucial for recognizing
the personality of users who have no posts, but it became less important as the
number of collected user posts increased. Although all these studies are performed
on large training datasets, they barely managed to surpass the majority-based or even
random-guessing baselines.
According to their experiments, both annotators felt confident to label for the J-P
and S-N tasks on only 15% and 30% of the samples, respectively. It is lower than
that for E-I and T-F tasks (53% and 43%, respectively). The annotators showed low
agreements with “gold labels”. Moreover, the results raised concern about whether
linguistic patterns in textual data could be expected in the MBTI framework con-
structed from traditional questionnaire. ätajner and Yenikent (2021) also analyzed
the poor performance of MBTI models on X data. They conducted an annotation
study to explore theoretical reasons behind this issue. Their research tested and
confirmed two hypotheses: 1) X posts lack sufficient linguistic signals for MBTI
personality recognition, even for trained annotators; 2) textual data do not align well
with MBTI questionnaire scores. The study revealed three key findings: the style and
content of texts can show signals of opposite polarities, individuals in the middle
ranges of personality traits may display mixed signals, and the language used in X
posts can make users appear more extroverted due to specific stylistic features like
grammar, exclamation marks, sentence structure, and tonality.
Another reason for the mediocre performance of MBTI models could be that the fea-
ture engineering needs to be improved. In the previous studies, the selected features,
mainly in the form of n-gram and user behavioral meta-features, were quite simple.
In addition, the models employed for personality recognition were machine learning
methods, such as SVM, LinearSVC, and logistic regression. These models failed to
extract complex features and limited the representation of the fed features as well.
The advances of the deep learning methods contributed to the progress of the feature
engineering of MBTI personality recognition and better experimental results.
Yang et al. (2021a) categorized existing approaches to combining multiple posts
automatically into two groups. First, each document was encoded independently and
then averaged to get the user representation (Keh et al., 2019). Second, the posts
are concatenated into a long sequence in an arbitrary post order and then encoded
hierarchically (Lynn et al., 2020). However, the first approach treated each post of a
user equally and failed to distinguish their different importance weights or capture
the interactions among them. The second approach overcame the above weakness but
introduced the extra post-order bias, affecting the consistency of the users’ person-
alities and the generalization of the models. Additionally, existing methods ignored
the association between different post information and different personality aspects.
To this end, Yang et al. (2021a) proposed a multi-document Transformer model as
a post-order-agnostic encoder to gather the posts of a user without introducing the
post-order bias. Moreover, they designed a personality-specific attention mechanism
to allow each personality aspect to focus on relevant post information. Their model
obtained 0.7047 and 0.6092 in Micro-F1 on average on Kaggle24 and Pandora25
datasets, respectively.
24 [Link]
25 [Link]
288 4 Pragmatics Processing
However, most of the recent studies focus on applying various deep learning
models in a data-driven manner instead of leveraging psycholinguistic knowledge to
explore the connections between the user’s language use and his or her personality
traits. Motivated by the observation, Yang et al. (2021d) designed a psycholinguistic
knowledge-based tripartite graph network, TrigNet, consisting of a BERT-based
graph initializer and a tripartite graph network. The tripartite graph network injected
structural psycholinguistic knowledge from LIWC. A new flow GAT was proposed to
transmit information between neighboring parties in the tripartite graph and reduce
the computational complexity in graph learning. This work provided a new way
to exploit domain knowledge and an effective and efficient model for personality
recognition tasks.
Yang et al. (2021b) argued that existing data-driven personality recognition mod-
els, capturing personality cues in social media posts implicitly, usually lacked the
guidance of psychological knowledge. They believed that the user’s posts could con-
tain critical information which would help to answer the psychological questionnaire.
Therefore, they presented a novel model, the Psychological Questionnaire Enhanced
Network (PQ-Net), to guide personality recognition by linking the texts and questions
in the questionnaire. PQ-Net consists of two streams: 1) a context stream to encode
each textual input into a contextual text representation; 2) a questionnaire stream to
capture question-relevant information in contextual representations and generate can-
didate answer representations to enhance the contextual representations. According
to the experimental results, PQ-Net outperformed the baseline models and generated
representations which are more inductive and distinguishable in term of personality
dimensions.
Nevertheless, we may notice that there is still room for improvement. In the
questionnaire stream, the preferred answer to each question was inferred based on
the ground truth personality labels of the user. Then, the inferred answer labels were
employed for supervised training PQ-Net. We have known that each personality
aspect is based on the accumulation of the score in each question, although the
final category of each personality dimension is a binary class. There are cases
where people possibly belong to the middle ranges of certain personality aspect.
For example, there are two choices “Quiet and reserved.” and “A good mixer.”
corresponding to the Introversion and Extraversion categories, respectively, in the
question “Are you usually a good mixer with groups of people or rather quiet and
reserved?”. Even the ground truth personality label of a user is Extraversion, it is still
possible that his or her answer to the example question is “Quiet and reserved.” with
some cues in posts saying “I need time alone sometimes”. However, the labeling
method in Yang et al. (2021b)’s work would generate a pseudo label “A good
mixer.” to this question according to the user’s final personality label Extraversion.
The mismatch between the relevant post information and the inferred label might
introduce a personality representation bias to the contextual representation.
4.4 Personality Recognition 289
Although theoretical research about personality traits can date back to the last cen-
tury, personality recognition is still at a relatively early age. Most of the research
focuses were defining the task, building the dataset, feature engineering, and de-
signing methodologies. However, we can find some early applications of personality
recognition in different research domains according to existing literature (Vinciarelli
and Mohammadi, 2014).
There are mainly two types of recommendation systems, i.e., conventional recom-
mendation systems and Personality-Aware recommendation systems according to
the research of Dhelim et al. (2022). Conventional recommendation systems usually
include three steps: 1) At the rating stage, the users show their interests or eval-
uations by rating the items; 2) At the filtering stage, the content similarity and/or
rating similarity are computed to find out the matched items; 3) At the recommen-
dation stage, the system recommends the systems retrieved by the filtering stage.
Compared with conventional recommendation systems, personality-aware recom-
mendation systems include two more stages before the rating stage and also adjust
the filtering stage. Firstly, at the personality measurement and matching stage, the
system obtains the personality traits of the users through a self-report questionnaire
or personality recognition techniques on the users’ history data, e.g., social media
data. Then, the user’s personality type is matched with relevant items. The relevance
can be determined by lexical matching or fine-grained rules. Secondly, at the filtering
stage, the personality similarity is also calculated between each pair of users, besides
the content and/or rating similarity.
Balakrishnan and Arabi (2018) proposed a hybrid movie recommendation sys-
tem, HyPeRM, enhanced by users’ personality traits and demographic information
(e.g., age and sex). Specifically, the system employed collaborative filtering to filter
the movies on the basis of demographic information and personality. Their work in-
dicated that incorporating users’ personality and demographic information improved
the results of movie recommendations. Asabere and Acakpovi (2020) designed a TV
program recommendation system based on personality and social properties. They
generated the group recommendations for TV programs by leveraging personality
similarity coupled with normalization procedure and computing tie strength with
folksonomy. In the studies mentioned, the users’ personality traits were collected
through questionnaires, but they proved the effectiveness of incorporating person-
ality traits in recommendation systems. There have been several papers presenting
recommendation systems compatible with both questionnaire-based and personality
recognition-based techniques. For example, Dhelim et al. (2020) designed a product
recommendation system named Meta-Interest to infer users’ needs based on their
topical interests. They integrated the users’ Big Five personalities to enhance the
interest mining process and filter the recommended product.
290 4 Pragmatics Processing
There have been research works that studied the relationship between the personality
traits of individuals and different aspects of social media, and how their personality
influenced relations with other group members (Maria Balmaceda et al., 2014).
According to Dolgova et al. (2010), the user’s personality could shape his or her
structural position on social networks. Correa et al. (2010) investigated personality
traits as important factors for the social media engagement of individuals. They
studied the relationship between users’ personalities and the use of social media
and found that emotional stability is negatively related to social media use while
extraversion is positively related.
4.4 Personality Recognition 291
Maria Balmaceda et al. (2014) studied the association rules among Big Five
personality traits of users and provided some insights about how people communicate
and who they tend to communicate with based on the personalities on social networks.
Their research suggested that users with a certain personality aspect value tended
to communicate with others having a similar value in the same personality aspect.
Additionally, they also discovered some interaction patterns between different aspects
of personality. For example, emotionally stable people tended to communicate with
agreeable ones, and agreeable users tended to communicate with extroverted ones.
Their research also provided empirical support for friend recommendations, based
on the personalities on social networks.
The explosive increase in the user number on social media made it a hard task
for users to find interesting people to communicate with. Therefore, many research
works on social networks paid more attention to proposing personality-aware rec-
ommendation algorithms. Ning et al. (2019) utilized Big Five personality traits
obtained from the questionnaire and the users’ harmony rating to enhance the hybrid
filtering process for the personality-aware friend recommendation system PersoNet.
It outperformed the traditional collaborative filtering-based friend recommendation
algorithms. Chakrabarty et al. (2020) employed Hellinger-Bhattacharyya Distance
(H-B Distance) to compute the similarity of users’ Big Five personality traits and
used it to recommend friends in their personality-aware friend recommendation
systems, named FAFinder (Friend Affinity Finder).
However, the aforementioned methods have some limitations, as users are usually
reluctant to filling in the questionnaire even it is for non-commercial objectives
such as friend recommendation. Therefore, it is more feasible to use the personality
recognition-based friend recommendation algorithm instead of the questionnaire-
based. Tommasel et al. (2015) obtained the Big Five personality scores from X data
through the tool of Mairesse et al. (2007) and used them to analyze the influence
of personality traits on followee selection. They found that personalities should be
considered a distinctive factor in followee selection. Moreover, they also observed
the association between different dimensions of personality.
The study by Tommasel et al. (2016) also reached a similar conclusion, sug-
gesting that incorporating personality traits could improve friend recommendation
algorithms. Xiao et al. (2018) proposed a personality-aware followee recommenda-
tion model based on sentiment analysis, text semantics, and the Big Five personality
traits calculated by a Chinese language psychological analysis system TextMind
from the Chinese data collected from Sina Weibo. Experimental results showed the
personality traits obtained from personality recognition were crucial for followee
selection on Chinese social networks as well.
292 4 Pragmatics Processing
4.4.7 Summary
As shown in Table 4.16, there are mainly two tasks in personality recognition, i.e., Big
Five personality recognition and MBTI personality recognition. Due to the distinct
research progress of the two tasks, different research trends were observed accord-
ingly. The Big Five personality model has been the subject of extensive research
in both psychology and linguistics for a considerable period of time. Therefore, the
early research on Big Five-based personality recognition used LIWC (Francis and
Booth, 1993) features, a “gold standard” from the aspects of linguistic, grammat-
ical, and psychological dimensions. The advance of machine learning approaches
and computational linguistics inspired researchers to attempt on open-vocabulary
approach, e.g., n-gram features (Oberlander and Nowson, 2006), and multimodality
features (Mairesse et al., 2007). At the same time, there is a trend to applying more
powerful deep learning models such as CNN to extract features automatically from
the texts (Yuan et al., 2018) besides LIWC.
With the rapid progress of social media like Facebook and X, there is a growing
need to automatically recognize the personality type of people through their posts in-
stead of their self-reports or questionnaire. The development of deep learning models
especially the pretrained LLMs enabled the researchers to build powerful models to
exploit more complicated structures of texts, and integrate external knowledge and
traditional psycholinguistic features for personality recognition (Mehta et al., 2020;
Ren et al., 2021b). The MBTI model was developed using an open-vocabulary
approach, lacking substantial theoretical research background in psycholinguistics.
This approach contrasts with Big Five-based personality recognition, where the the-
oretical discussion precedes empirical research. Furthermore, ätajner and Yenikent
(2020, 2021) focused on explaining why MBTI frameworks could not obtain good
performance in certain cases. Another major difference between MBTI- and Big
Five-based personality recognition is that MBTI frameworks (Plank and Hovy, 2015;
Yamada et al., 2019) were initially tailored for processing social media data instead
of self-reports. Like Big Five frameworks, MBTI frameworks benefited from the
advance of deep learning models (Yang et al., 2021a) and the integration of com-
monsense and psychological knowledge (Yang et al., 2021d).
As shown in Table 4.17, personality recognition has primarily been applied to product
recommendation systems and social network analysis. Initially, these fields relied on
questionnaires to determine personality types. Over time, they adopted personality
recognition frameworks for scoring. However, most applications favored machine
learning-based frameworks or existing tools with basic feature engineering over
designing more advanced models for downstream tasks.
4.4 Personality Recognition 293
Instead, they paid more attention to how to combine the obtained personality
scores with the product recommendation systems (Yakhchi et al., 2020) or friend
recommendation systems (Tommasel et al., 2015; Xiao et al., 2018). As the down-
stream tasks of personality recognition are still at a relatively early stage, there are
a large number of papers studying the effect of incorporating the personalities into
the downstream tasks (Lu and Kan, 2023; Xiao et al., 2018; Ning et al., 2019).
Their research provided theoretical justification and valuable insights for combining
personality recognition with downstream tasks. It is also notable that most of the
applications utilized Big Five frameworks instead of MBTI frameworks, which is
possible because of the unsatisfying performance of personality recognition models
for MBTI.
294 4 Pragmatics Processing
Finally, there is a desire to combine persona analysis with other NLU tasks. Many
NLU tasks involve labels that are subjective to individuals, as the same statement can
cause different affective responses between different people. For example, introverted
individuals may react negatively to a suggestion to perform in front of audiences,
whereas extroverts might welcome such an opportunity. Current NLP research often
overlooks personalized prediction, as it tends to rely on majority-defined ground-
truth labels. However, this approach contradicts the spirit of human-centric AI, as it
neglects user diversity. Therefore, future endeavors could explore the integration of
persona analysis with a broader range of NLU tasks.
Aspect extraction is the process of identifying and extracting specific components, at-
tributes, or features mentioned in text, typically employed to identify opinion targets
in the context of ABSA. Besides ABSA, aspect extraction is useful in tasks like text
summarization (Tang et al., 2024), financial forecasting (Du et al., 2023a), recom-
mendation systems (Karthik and Ganapathy, 2021), fake review detection (Bathla
et al., 2022), and many more. In customer feedback analysis, it pinpoints which
aspects of products or services are frequently mentioned and their associated senti-
ments. Aspect extraction also aids in topic modeling, trend analysis, and competitive
analysis by identifying specific subjects or features discussed in texts. In chatbots
and virtual assistants, it improves response accuracy to user inquiries. Moreover, it
enhances content recommendation and personalization by identifying aspects users
engage with, offering more relevant content and insights into market trends.
As mentioned, aspect extraction is key for ABSA and aspect-based opinion mining
(ABOM) in order to enable a more nuanced comprehension of specific requirements
or opinions related to a product or service. In particular, while ABSA is more
focused on analyzing sentiments associated with aspects, ABOM is a broader term
that includes not only sentiment analysis but also the extraction and analysis of
various types of opinion-related information. Hu and Liu (2004) identified two types
of aspects, namely explicit and implicit aspects. Aspect terms that are explicitly
specified in given sentences are explicit aspects, while aspects without corresponding
words are considered implicit aspects (Dalila et al., 2018).
For instance, consider the statement “it’s very light-weight and we can get amaz-
ing pix too”, where the aspect “weight” is explicitly specified (Ganganwar and
Rajalakshmi, 2019). In contrast, in the statement “It is very light. You can carry
it everywhere”, the aspect “weight” is implicit. Compared with explicit aspects ex-
traction (Tubishat et al., 2021; Behdenna et al., 2022), implicit aspects are more
challenging due to their indirect nature and reliance on contextual and common-
sense understanding (Verma and Davis, 2021; Zhuang et al., 2022; Ahmed et al.,
2022). However, implicit aspects are important linguistic phenomena that cannot be
ignored in real-world applications.
296 4 Pragmatics Processing
[Link] Ontology
In both cases, the elided elements with square brackets can be inferred from context,
but the elided phrase does not necessarily have the same reference as the antecedent,
as references are context-dependent.
(8) Jim likes his new laptop. Jane does too.
In Example (8), Jim likes Jim’s new laptop, while Jane likes Jane’s laptop. The elided
element refers to a different entity. Such a situation also happens to Example (7),
where the two Chinese dishes are not the same. Another type of ellipsis is gapping,
which involves the omission of non-initial verb phrases in coordinated clauses, where
only the first verb phrase is fully expressed (Johnson, 2006).
(9) For the new laptop, Jim likes the appearance, and Jane [likes] the screen.
Gapping is restricted to coordinations and typically involves the removal of material
from the second conjunct, with remnants being major constituents like subjects, and
objects. Both verb phrase ellipsis and gapping involve the omission of verb phrases,
however, they occur in different contexts. Verb phrase ellipsis often requires the
presence of an auxiliary verb in the second clause (e.g., “don’t” in Example (6)),
while gapping does not have this requirement. Sluicing is another type of ellipsis that
occurs when a wh-phrase and a part of the containing clause are omitted, leaving a
clause with an interrogative meaning (Chung et al., 1995). For example,
(10) My friend wrote some reviews for the book, but I don’t know [what the
reviews are].
In sluicing, the missing information is not explicitly provided, making it challenging
for the listener or reader to infer the exact content of the missing elements. Aspect
extraction involves identifying and extracting aspects or attributes of entities or topics
mentioned in text. Elliptical constructions e.g., verb phrase ellipsis and noun phrase
ellipsis can pose challenges for aspect extraction because they involve the omission
of key elements (verb phrases or noun phrases) that are relevant for determining
aspects. Gapping and sluicing can lead to ambiguity regarding which aspects are
associated with which entities or topics in the sentence, further complicating aspect
extraction in NLU. Considering the frequent occurrences of elliptical constructions,
they cannot be overlooked in developing aspect extraction systems.
The “B” label denotes the beginning of an aspect, the “I” label indicates that the
token is inside an aspect, and the “O” label signifies that the token is not part of
any aspect. Building upon the BIO scheme, the BIOES scheme incorporates two
additional labels: “S” (Single) and “E” (End). These labels are designed to provide
clearer annotations of aspect boundaries. The “S” label is used for single-token
aspects, while the “E” label marks the end of an aspect entity.
Table 4.18: The three common annotation schemes for aspect extraction.
4.5.3 Datasets
Widely used aspect extraction datasets are shown in Table 4.19. The SemEval-2014
Task 426 (Pontiki et al., 2014) consists of four subtasks: aspect term extraction, aspect
term polarity detection, aspect category detection, and aspect category polarity
detection. The task involves two domain-specific datasets focused on laptop and
restaurant reviews, which include over 6,000 sentences. These datasets are annotated
at a fine-grained aspect level.
Table 4.19: Aspect extraction datasets and statistics. AE denotes aspect extraction;
AC denotes aspect category detection; ATP denotes aspect term polarity detection;
ACP denotes aspect category polarity detection.
26 [Link]
4.5 Aspect Extraction 299
<sentence id=”813”>
<text>All the appetizers and salads were fabulous, the steak was mouth watering and
the pasta was delicious!!!</text>
<aspectTerms>
<aspectTerm term=”appetizers” polarity=”positive” from=”8” to=”18”/>
<aspectTerm term=”salads” polarity=”positive” from=”23” to=”29”/>
<aspectTerm term=”steak” polarity=”positive” from=”49” to=”54”/>
<aspectTerm term=”pasta” polarity=”positive” from=”82” to=”87”/>
</aspectTerms>
<aspectCategories>
<aspectCategory category=”food” polarity=”positive”/>
</aspectCategories>
</sentence>
SemEval-2016 Task 528 (Pontiki et al., 2016) was also developed for ABSA
from laptop and restaurant reviews, but it introduced new test datasets for evaluation
and included multilingual datasets encompassing English, Arabic, Dutch, Russian,
Spanish, and Turkish at both the sentence and document levels. Hu and Liu (2004)
proposed an aspect-level sentiment analysis dataset29 for mining and summarizing
customer reviews of digital products. The instances in the dataset contain aspect
category labels and sentiment polarities, where the sentiment polarities range from
-3 (strong negative) to +3 (strong positive). In the following example, the numbers
in the square brackets denote the sentiment polarities; the terms before the sentiment
polarities are aspects; ## denotes the start of a review sentence.
Title: great camera
canon g3[+3]##i bought my canon g3 about a month ago and i have to say i am
very satisfied .
photo quality[+2]##i have taken hundreds of photos with it and i continue to be
amazed by their quality .
feature[+2]##the g3 is loaded with many useful features , and unlike many smaller
digital cameras , it is easy to hold steady when using slower shutter speeds .
27 [Link]
28 [Link]
29 [Link]
300 4 Pragmatics Processing
AWARE31 (Alturaief et al., 2021) is an ABSA dataset annotated for App reviews.
The 11,323 reviews were collected from 11,323, and were annotated for aspect terms,
categories, and sentiment. Reviews were collected from three domains: productivity,
social networking, and games.
domain: “productivity” ,
app: “things-3” ,
review_id: “c9274c0a-a120-4e09-816b-7a8ba3a16634” ,
sentence_id: “00808934-e8b9-42fa-b37f-cfeac234bbdd” ,
title: “Difficult to update from Things 2” ,
review: “This new version of Things has an entirely different aesthetic from Things 2.
Things 2 is much more minimalist; Things 3 seems to have a lot of UI bloat. Not quite
sure where the design award came from.” ,
sentence: “This new version of Things has an entirely different aesthetic from Things
2.” ,
rating: “3” ,
is_opinion: “TRUE” ,
category: “usability” ,
term: “new version” ,
from: “6” ,
to: “17” ,
sentiment: “positive”.
Besides the above datasets, TSA-MD (Toledo-Ronen et al., 2022), MAMS (Jiang
et al., 2019), and the studies (Peng et al., 2018; Vo and Zhang, 2015) are also ABSA
datasets, which contain aspect extraction as a subtask.
30 [Link]
31 [Link]
4.5 Aspect Extraction 301
Table 4.20 shows various knowledge bases related to aspect extraction. Besides
WordNet (Fellbaum, 1998), SentiWordNet (Baccianella et al., 2010), BabelNet (Nav-
igli and Ponzetto, 2012b), FrameNet (Ruppenhofer et al., 2016), ConceptNet (Speer
et al., 2017), SenticNet (Cambria et al., 2024), and Wiktionary (introduced earlier),
some other useful knowledge bases for aspect extraction are DBpedia (Auer et al.,
2007), YAGO (Suchanek et al., 2007), Freebase (Bollacker et al., 2008), Probase (Wu
et al., 2012), Knowledge Vault (Dong et al., 2014), and FinSenticNet (Du et al.,
2023b).
In particular, DBpedia is a large-scale multilingual knowledge base extracted from
Wikipedia containing millions of entities. YAGO is a large knowledge base with gen-
eral knowledge about people, cities, countries, movies, and organizations. Freebase
is a practical, scalable tuple database that contains structured human knowledge.
Probase is a large-scale concept knowledge map proposed by Microsoft, contain-
ing entities mapped to different semantic concepts and labeled with corresponding
probability labels. Knowledge Vault is a web-scale probabilistic knowledge base
that combines extractions from Web content with prior knowledge from existing
knowledge bases. FinSenticNet is a lexicon tailored for financial sentiment analy-
sis, including 6,741 concepts, over 65% of which are complex MWEs. Ontologies
from professional domains, e.g., medicine (Lipscomb, 2000; Donnelly et al., 2006;
Wishart et al., 2018; Hirsch et al., 2016; Johnson et al., 2016; Wheeler et al., 2007)
and geology (Ahlers, 2013), can be also helpful for domain-specific aspect extraction.
Table 4.20: Useful knowledge bases for aspect extraction. These knowledge bases
can also be helpful for polarity detection in Section 4.6 because they provide either
sentiment scores or commonsense knowledge.
As a standard sequence labeling task, precision, recall, F1 , and accuracy are the
main evaluation metrics in aspect extraction tasks. In particular, precision is used to
measure the proportion of correctly identified aspects out of all aspects identified by
the model, indicating how accurate the predictions are. Recall assesses the proportion
of correctly identified aspects out of all actual relevant aspects present in the data,
showing the model’s ability to find all relevant aspects. The F1 score, which is the
harmonic mean of precision and recall, balances these two metrics and provides
a single measure of a model’s performance, especially when there is an uneven
distribution between precision and recall. Accuracy measures the overall rate of
correct predictions, comparing the total number of correct aspect predictions to the
total number of predictions made. Together, these metrics provide a comprehensive
evaluation of the model’s ability to accurately identify aspects in text data.
4.5.6 Methods
A. Unsupervised Method
There are also studies that utilized auxiliary information for aspect extraction.
Moghaddam and Ester (2010) employed known features mentioned in reviews to
identify specific aspects from customer reviews written in free text format. By intro-
ducing predefined aspects and their ratings, they improved the accuracy of opinion
mining. In addition, Meng and Wang (2009) employed supplementary details about
the products available on review websites and correlations between aspects and opin-
ions to pinpoint the relevant aspects. Bancken et al. (2014) proposed an algorithm
to automatically detect and rate the product aspects from customer reviews. This al-
gorithm works by matching syntactic dependency path among different words from
the sentence. To identify product aspects and their opinion words, ten handcraft de-
pendency paths were defined. Then, the algorithm generates a syntactic dependency
tree by using a dependency parser and extracts the basic aspect opinion pair from
the sentence. Luo et al. (2019b) proposed an unsupervised model that employs word
embeddings for aspect clustering. They utilized Probase and WordNet to identify
prominent aspects and emphasized explicit aspects.
B. Semi-supervised Method
Bajaj et al. (2021) argued that typical ABSA tasks suffered sentiment inconsis-
tency and colossal search space. To overcome such problems, they propose a similar
span-based extract-then-classify framework. An aspect extractor is separately used to
extract the multiple possible targets from the sentence. He et al. (2023b) introduced
a meta-based self-training method combined with a meta-weighter. They trained a
teacher model to produce pseudo-labeled data, which is utilized by a student model
for supervised learning. The meta-weighter component was jointly trained with the
student model, providing subtask-specific weights to each instance, coordinating
their convergence rates, balancing class labels, and reducing the impact of noise
introduced during self-training.
C. Supervised Method
Supervised explicit aspect extraction is the most common setting for aspect ex-
traction tasks. In this setting, employing external knowledge is a popular approach
for enhancing performance. For example, Sentic LSTM (Ma et al., 2018) enhanced
LSTM by integrating a hierarchical attention mechanism that includes target-level
attention and sentence-level attention for ABSA. Additionally, they also incorporates
commonsense knowledge from SenticNet for sentiment-related aspects. Ghosal et al.
(2020) proposed a domain adaptive model to boost the performance of sentiment
analysis with specific aspects. They utilized a graph convolutional autoencoder that
leverages inter-domain concepts in a domain-invariant manner from ConceptNet.
Liang et al. (2022b) proposed Sentic GCN, building a graph neural network by
incorporating affective knowledge from external knowledge bases, which is done
to improve the dependency graphs of sentences. The novel approach integrates
both the contextual and aspect word dependencies, as well as the affective infor-
mation between opinion words and aspects, into the graph model, resulting in an
enhanced affective graph model. SentiPrompt (Li et al., 2021a) is a method based
on prompt tuning, which employs consistency and polarity judgment templates to
create prompts related to ground truth aspects. Nonetheless, Mao et al. (2023c)
raised concerns about potential biases that could exist in prompt-based methods for
sentiment analysis and emotion detection.
A. Unsupervised Method
Zhang et al. (2012) proposed a Weakness Finder method that extracts features and
groups them explicitly using a morpheme-based approach and a similarity measure
based on HowNet. Implicit features are identified and grouped using a collocation
selection method for each aspect. Dependency parsing is a common technique in
sentiment analysis that uses predefined rules to extract opinion targets and sentiment
words by analyzing the relationships between words.
To enhance the unsupervised dependency parsing, Zainuddin et al. (2016) pro-
posed a hybrid approach that combines rule mining, dependency parsing, and Sen-
tiWordNet. This approach enables implicit aspect extraction in an unsupervised
manner. Karmaker Santu et al. (2016) utilized a probabilistic topic modeling method
to extract implicit features. They modeled the reviews as generative probabilistic
feature models, with the reviews represented as associations between sentences and
features using hidden variables. Then, they employed the expectation-maximization
technique to calculate the model parameters using tagged training data. Finally, they
extracted implicit features using hidden variables and the calculated parameter val-
ues.
306 4 Pragmatics Processing
B. Semi-supervised Method
Jiang et al. (2014) employed association rule mining to extract implicit aspects,
incorporating indicators like semi-supervised LDA topic models. These were inte-
grated into an improved collocation model, which was then used to extract basic
rules. These basic rules were further expanded with new rules generated by the
topic model. The combination of these basic and new rules enabled the extraction
of implicit aspects. Xu et al. (2015a) combined SVM with topic modeling to extract
explicit and implicit aspects. They enhanced the LDA topic model with the pro-
posed cannot-link, must-link, and prior knowledge. The must-link relation provides
information about pairs of words that must be present in the same cluster, while
the cannot-link relation provides information about pairs of words that cannot be
in the same cluster. The LDA explicit topic model, augmented with these enhanced
features, was then employed to identify relevant attributes from the dataset. Subse-
quently, several SVM classifiers were initiated, which were trained on the selected
attributes to uncover implicit features.
Yu et al. (2018) argue that many current methods overlook the syntactic relation-
ships between aspect terms and opinion terms, which can result in inconsistencies
between the model predictions and syntactic constraints. To address this issue, an
MTL framework is first employed to implicitly capture the relations between the
aspects and opinions. Then, a global inference method is proposed, which explicitly
models several syntactic constraints between aspects and opinions to reveal their
intra-task and inter-task relationship. This approach aims to achieve an optimal so-
lution over the neural predictions for both tasks. Maylawati et al. (2020) proposed a
method to extract implicit aspect extraction. The proposed method contains feature
extraction, feature selection, clustering, and association rule mining to recognize the
final implicit aspects.
C. Supervised Method
By utilizing aspect hierarchy and opinion terms, Yu et al. (2011b) identified implicit
aspects. They created a hierarchal organization by integrating product specifications
and customer reviews, which allowed him to infer implicit aspects within review
sentences. Yan et al. (2015) proposed a NodeRank algorithm which first identified
all co-occurrences of the opinion words with the explicit aspects. Then, the algorithm
calculates the NodeRank value for each potential aspect with the opinion word. The
aspect with the highest value was considered as potential implicit aspect.
Liao et al. (2019) identified fact-implied implicit sentiment at the sentence level.
To achieve this, they suggested a multi-level semantic fusion method for identifying
implicit aspects in sentiment analysis. The corpus is used to learn three distinct
features at varying levels, including sentiment target representation at the word
level, structure embedded representation at the sentence level, and context semantic
background representation at the document level. A fact-implied Chinese implicit
sentiment corpus was also developed in this work.
4.5 Aspect Extraction 307
Feng et al. (2019a) utilized a deep CNN with a sequential algorithm to label
words in a sentence. Implicit aspects were identified by treating them as topics, and
then, determining the degree of sentiment words and aspects that were matched.
Initially, They extracted aspects using word vectors, POS vectors, and dependent
syntax vectors to train a deep CNN. Next, a sequential algorithm was used to obtain
sentiment labels with implicit aspect identification. Rana et al. (2020) argued that
prior techniques for implicit aspect extraction have concentrated on certain types
of aspects while disregarding the fundamental issue at hand. To deal with such
a problem, they utilized co-occurrence and similarity-based techniques to identify
implicit aspects in a multi-level approach, which crafted rules to identify clues and
used recognized clues to extract implicit aspects. To identify implicit aspects, Xu et al.
(2020b) proposed a NMF-based approach. They employed clustering based on the
co-occurrence of opinion words and word vectors to gather contextual information
about opinion targets from review sentences. Then, they utilized a classifier to
forecast the implicit target of users’ opinions.
ABSA and ABOM are the most popular downstream applications for aspect extrac-
tion. To tackle ABSA and ABOM in various settings, numerous tasks have been
developed to analyze different sentiment components and their associations, such as
aspect terms, aspect categories, opinion terms, and sentiment polarity. Unlike earlier
ABSA studies that focused solely on a single sentiment component, recent research
has investigated several compound ABSA and ABOM tasks that encompass multiple
components to capture more comprehensive aspect-level information. For example,
in the aspect-opinion pair extraction task, Chen et al. (2020); Zhao et al. (2020a)
aimed to obtain a clearer comprehension of the mentioned opinion target and its
associated opinion expression. It is necessary to extract the aspect and opinion terms
as a compound, such as the pairs of (pizza, delicious) and (price, expensive).
Dai et al. (2020) tried to deal with aspect extraction and aspect category identifi-
cation, simultaneously. Recently, finer-grained sentiment analysis tasks, e.g., Aspect
Sentiment Quad Prediction (ASQP), Aspect Category Sentiment Analysis (ACSA),
Aspect-Category-Sentiment Detection (ACSD), and Aspect Sentiment Triplet Ex-
traction (ASTE) were proposed (Cai et al., 2021; Zhang et al., 2021b; Mao et al.,
2022b; Bao et al., 2022; Wan et al., 2020; Wu et al., 2021a; Xu et al., 2020a;
Mukherjee et al., 2021; Wu et al., 2020c). These tasks commonly considered aspect
extraction as one of the learning tasks, providing necessary information for sentiment
analysis. Fundamental aspect extraction tasks along with other advanced tasks can
be structured in either an end-to-end fashion or a pipeline approach. Moreover, some
tasks rely on explicit aspects, while others depend on implicit aspects.
308 4 Pragmatics Processing
4.5.8 Summary
As shown in Table 4.21, recent research focuses on addressing explicit and implicit
aspect extraction tasks. The difference is that explicit aspect extraction identifies
aspects explicitly mentioned in text, while implicit aspect extraction aims to uncover
aspects that are not explicitly stated but can be inferred from the context. Implicit
aspect extraction often involves identifying nuanced or indirect references to aspects,
requiring a deeper understanding of the text.
4.5 Aspect Extraction 309
Researchers have developed various techniques for both explicit and implicit as-
pect extraction, including rule-based methods, machine learning algorithms, and
neural network models. These approaches have been applied across different aspect
extraction domains, e.g., product, restaurant, and movie reviews, demonstrating the
importance of aspect extraction in understanding opinions and sentiments expressed
in text. At the early stage, linguistic features played a critical role in this domain.
Researchers tried to identify aspects by analyzing the linguistic associations, depen-
dency relationships, POS tags, and customer behaviors. However, such superficial
features cannot provide a deeper contextual understanding of aspects.
Task Reference Tech Feature and KB. Framework Dataset Score Metric
Agrawal et al. (1994) Rule Ling., trading Rule-based Synthetic data - -
Bafna and Toshniwal (2013) Rule Ling., POS Probabilistic Self-collected 0.92 Acc.
Pang and Lee (2004) Rule Ling. Graph-based Self-collected 0.90 Acc.
Bloom et al. (2007)⇤ Rule Ling., Lexic. Probabilistic Pang and Lee (2004) - -
Kobayashi et al. (2007) Rule Ling., Stat. Feature eng. Self-collected 0.72 Prec.
Moghaddam and Ester (2010) Rule Aspects, ratings Feature eng. Self-collected 0.80 Prec.
Meng and Wang (2009) Rule Ling., PS Feature eng. Self-collected 0.75 F1
Bancken et al. (2014) Rule Dep., WN Feature eng. Self-collected 0.60 Acc.
Luo et al. (2019b) Rule PB, WN, GloVe Feature eng. Self-collected 0.70 Acc.
Hai et al. (2012) ML Ling. Feature eng. Self-collected 0.73 F1
Jakob and Gurevych (2010b) ML Ling., opinion CRF-based Self-collected 0.52 F1
Zhao et al. (2014) ML Depend., opinion Graph-based COAE2008 0.79 F1
Yu et al. (2011a) ML Boolean, Dep. Probabilistic Self-collected 0.74 F1
Explicit
Mukherjee and Liu (2012) ML Ling. Probabilistic Self-collected - -
Jin et al. (2009) ML Ling., POS HMM Hu and Liu (2004) 0.75 F1
Li et al. (2020d) DL GloVe LSTM, att. SemEval-2014 Task 4 0.72 F1
Zhou et al. (2021a) DL word2vec Meta learn. SemEval-2014 Task 4 - -
Bajaj et al. (2021) DL BERT Transformer SemEval-2014 Task 4 0.76 F1
He et al. (2023b) DL BERT Meta learn. SemEval-2014 Task 4 0.82 F1
Chen and Qian (2020) DL BERT MTL SemEval-2014 Task 4 0.81 F1
He et al. (2019c) DL BERT MTL SemEval-2014 Task 4 0.77 F1
Mao and Li (2021) DL BERT MTL Lapt.14 & Rest.14 0.85 F1
Ghosal et al. (2020) DL Emb., ConceptNet Graph-based Self-collected - -
Liang et al. (2022b) DL Dep., SN, BERT Graph-based SemEval-2014 Task 4 - -
Li et al. (2021a) DL Bart, Cons. Prompt-tun. SemEval-2014 Task 4 - -
Hu and Liu (2004) Rule Ling., POS Feature eng. Self-collected 0.56 Prec.
Zhang et al. (2012) Rule Ling., HowNet Feature eng. Self-collected - -
Zainuddin et al. (2016) Rule Ling., dep., SWN Feature eng. Self-collected - -
Jiang et al. (2014) Rule Ling. LDA Self-collected 0.78 F1
Karmaker Santu et al. (2016) ML Ling., TF-IDF EM algorithm Self-collected - -
Xu et al. (2015a) ML Ling., PMI, POS SVM, LDA Self-collected 0.78 F1
Implicit Maylawati et al. (2020) ML Ling., TF-IDF Clustering SemEval-2014 Task 4 - -
Yan et al. (2015) ML Ling., NodeRank Feature eng. Self-collected 0.84 F1
Rana et al. (2020) ML Ling., Stat. Feature eng. SemEval-2014 Task 4 0.88 F1
Xu et al. (2020b) ML Ling., dep., Stat. Matrix factor. SemEval-2015 Task 12 0.72 F1
Yu et al. (2018) DL GloVe BiLSTM SemEval-2014 Task 4 0.85 F1
Liao et al. (2019) DL word2vec CNN Self-collected - -
Feng et al. (2019a) DL Ling., POS, dep. CNN Liao et al. (2019) 0.82 F1
310 4 Pragmatics Processing
Recently, deep neural networks and PLMs were widely employed. Aspect extrac-
tion tasks are learned with other related tasks, e.g., opinion extraction and sentiment
polarity prediction. Those related tasks may provide complementary information
for each other. Thus, learning these tasks simultaneously can share the useful in-
formation, yielding higher accuracy on each task. While implicit aspect extraction
is generally considered more challenging than explicit aspect extraction, recent re-
search has shown less interest in this linguistic phenomenon, with a greater focus on
achieving state-of-the-art performance on benchmark datasets without considering
linguistic properties. Alternatively, another type of effort is to achieve fine-grained
analysis of affective expressions, e.g., ASTE aims to extract aspect terms, opinion
terms, and their corresponding sentiment polarities. The studies on aspect extraction
from different linguistic contexts deserve more research efforts.
Finally, current aspect extraction research commonly focuses on conventional
application scenarios, e.g., product or service reviews. This emphasis is likely due to
the fact that most research in this field aims to identify sentiment polarities for specific
aspect categories, which can hold commercial significance, such as understanding
customer preferences for various aspects of a product or service. However, the scope
of aspect extraction can be broadened to encompass other scenarios. Unlike NER,
which targets specific entities in text, aspect extraction involves identifying various
attributes or features of entities, such as food quality, service speed, or restaurant
cleanliness. While these are not specific named entities, they contribute to a richer
understanding of attributes associated with entities.
Current aspect extraction methods tailored to specific domains may not work
well in general domains. Training a model on a domain-specific dataset, like one for
restaurant reviews, may not apply to other areas, such as extracting research findings
in scientific texts. While aspect extraction is useful in scientific research, there has
been limited exploration of domain-agnostic methods.
As mentioned above, the main technical trend for aspect extraction is the integra-
tion with other sentiment and opinion mining tasks, such as ABSA, ABOM, ASQP,
ACSA, ACSD, and ASTE. Different task setups allow users to retrieve different
types of information related to sentiment analysis. Then, the natural idea is to design
a unified framework to tackle multiple related tasks at the same time. Its practical
usefulness lies in the fact that one may prefer not to alter the model architecture
and undergo retraining for each instance of new data with varying types of opinion
annotations. There are two main paradigms for designing such a unified framework.
The first possible unified framework is to form the tasks as QA tasks. Gao
et al. (2021a) divided ABOM into two subtasks: aspect term extraction and aspect-
specified opinion extraction. They extract all the candidate aspect terms, followed
by extracting the corresponding opinion words given the aspect term. The proposed
model employed a span-based tagging scheme and constructed a question-answer-
based MRC task to achieve efficient extraction of aspect-opinion pairs. To address
the challenges of the ASTE task, Chen et al. (2021a) convert the task into a multi-
turn MRC and propose a bidirectional MRC framework. This framework includes
three types of queries (non-restrictive extraction, restrictive extraction, and sentiment
classification) to establish relationships between various subtasks.
The second possible framework is to develop a Seq2Seq learning framework by
directly generating the required sentiment elements in the natural language form.
Zhang et al. (2021c) proposed a generative framework comprising two types of
modules: annotation-style and extraction-style modeling. The former module adds
annotations to a given sentence to include label information in the construction
of the target sentence, while the latter module directly adopts the desired natural
language label of the input sentence as the target. Yan et al. (2021a) redefined every
subtask target as a mixed sequence of pointer indexes and sentiment class indexes,
converting all related subtasks into a unified generative formulation. With this unified
formulation, they utilized the pretraining Seq2Seq model BART to solve all ABSA
subtasks in an end-to-end framework.
B. Multimodal ABSA
The field of sentiment analysis faces a current challenge and opportunity in an-
alyzing emotions and opinions expressed through graphics and videos on social
media platforms. This challenge arises from the rapid development of social net-
works and the increased expressive tendencies of individuals on these platforms.
Specifically, the challenge lies in analyzing sentiment in multimodal data, including
voice, image, and text. Given that there is often a strong connection between various
forms of content, utilizing multimodal information can enhance the analysis of users’
sentiments towards different aspects.
4.5 Aspect Extraction 313
Yu et al. (2022) argued that previous approaches either use separately pretrained
visual and textual models or use vision-language models pretrained with general
pretraining tasks, which are inadequate to identify fine-grained aspects, opinions,
and their alignments across modalities. To tackle these limitations, they propose a
task-specific Vision-Language Pretraining framework, which is a unified multimodal
encoder-decoder architecture for all the pretraining and downstream tasks. They fur-
ther design three types of task-specific pretraining tasks from the language, vision,
and multimodal modalities, respectively. Ling et al. (2022) argue that existing meth-
ods did not effectively capture both coarse-grained and fine-grained image-target
matching, which includes the relevance between the image and the target, as well
as the alignment between visual objects and the target. To address this issue, they
proposed a new MTL architecture called the Coarse-to-Fine Grained Image-Target
Matching Network. This architecture jointly performs image-target relevance classi-
fication, object-target alignment, and targeted sentiment classification.
Currently, much of the research in ABSA is centered around English and Chi-
nese languages, predominantly focusing on commercial domains such as products
and services. To address the scarcity of data for languages and domains with limited
resources, cross-lingual and cross-domain learning is necessary in the domain of
ABSA. This approach requires an understanding of the world knowledge pertaining
to relevant aspects, in addition to their linguistic meanings. World knowledge should
be independent of languages, because it reflects the properties of entities, cultures,
and information of the world. Ontology serves as a formal model of world knowl-
edge, delineating the categories of entities existing in the world and the relationships
between them.
The aspects of a concept cannot be naïvely defined by meanings. For exam-
ple, “dog” may include properties such as “color”, “size”, “weight” and “kind”.
While these properties are related to the concept of a dog, they may not be con-
sidered semantically central to the definition of a dog. The intra-semantic relevance
of this properties is also weak. Therefore, in aspect extraction, without access to
world knowledge, identifying all aspects associated with an unseen entity becomes
challenging, as different aspects may carry varying meanings and contextual associa-
tions. Additionally, transferring knowledge learned from supervised learning across
different entities is complex, as different entities may be associated with distinct
aspects. For example, a classifier trained on restaurant-related aspects cannot be di-
rectly applied to extract aspects related to dogs. Incorporating a systematic ontology
can facilitate cross-language and cross-domain transfer learning. In this context, the
learning system is not focused on detecting similar aspects from labeled datasets.
Instead, its objective is to understand the relationship between entities and their
properties based on an ontology and textual data. As ontologies encapsulate world
knowledge, which is only loosely dependent on languages, this learning paradigm
can aid in cross-language inference as well.
314 4 Pragmatics Processing
The NLU suitcase model proposed in this textbook (Fig. 1.12) can be applied to any
downstream task, e.g., polarity detection, emotion detection, text summarization,
QA, information retrieval, topic modeling, text classification, machine translation,
dialogue systems, and more. Depending on the task, one or more modules may be
omitted. For example, if training/test data do not contain any sarcasm, the sarcasm
detection module is unnecessary. In this textbook, we pick polarity detection as
downstream task. In its simplest form, it can be solved by a binary classifier that
merely categorizes text as either “positive” or “negative”. Catching all the nuances
of how polarity can be expressed (either explicitly or implicitly) in natural language,
however, requires a more sophisticated model like the NLU suitcase model.
Polarity detection has received increasing attention from the research community
in the past decade mainly because of the rise of massive UGC such as movie reviews
and product comments. Unlike formal documents where “factual” and “neutral”
content is in the majority, UGC contains a large number of subjective expressions
from which valuable information can be extracted. Traditional research analyzes the
sentiment polarities, based on textual meanings. The hypothesis is that the words
and phrases used in a text carry certain emotional weights that can be classified as
positive, negative, or neutral. Recently, Zhu et al. (2024b) proposed personalized
sentiment analysis, suggesting that sentiment polarities should be defined by the
perceived impressions from the text. This approach acknowledges that individuals
may perceive sentiment differently from the same message, such as a proposal to
dance in front of a crowd. Therefore, sentiment analysis should incorporate person-
alized information to differentiate the perceived sentiment accordingly. The research
of polarity detection also simulates the interest from the industry as it is used in
different application scenarios, such as stock market forecasting (Ma et al., 2023,
2024). In the financial domain, while information that causes stock market fluctu-
ations can be considered positive or negative (Du et al., 2024c), Ma et al. (2023)
argued that market sentiment does not always align with the semantic sentiment of
text. For example, raising debt may cause different investor reactions depending on
the market environment.
Besides text, polarity detection is also applied to images and speech (Fan et al.,
2024). The goal of image polarity detection is to differentiate images raising positive
emotions in users from images causing negative emotions (Ragusa et al., 2019).
An image may contain much information such as facial expressions (Cowen et al.,
2021), body posture (Ferres et al., 2022), gestures (Keshari and Palaniswamy, 2019;
Atanassov et al., 2021), and scene context information (Kosti et al., 2019; Yang
et al., 2021c), which significantly increases the complexity of the task. In recent
years, researchers put forward a lot of efforts to deal with difficulties in image
polarity detection, e.g., a series of facial expression encoding methods (Li et al.,
2009; Soleymani et al., 2015) have been built on the basis of Facial Action Coding
System (Ekman and Friesen, 1978) to accurately describe the face and recognize the
polarity afterward.
4.6 Downstream Task 315
Nowadays, smartphone users interact frequently with voice assistants such as Siri
and Google Assistant. It is essential for voice assistants to identify the sentiment of the
user and generate empathetic responses accordingly. Compared with text, audio may
contain rich tone information which is a good indicator for determining the polarity of
the speaker. For instance, a high pitch is usually linked with a positive polarity while
low-frequency tones often correspond to negative polarity (Chaturvedi et al., 2022).
The main challenge of speech polarity detection is how to learn an intermediate
representation of input speech signal without any manual feature engineering (Latif
et al., 2021) in an uncontrolled natural environment (Fahad et al., 2021).
Despite the fact that most words, idioms, or phrases may occur in both affirmative
and negative sentences, there are some which may occur only in affirmative, or
only in negative sentences. These words, idioms, or phrases are termed “polarity-
sensitive” (Baker, 1970). Here are some examples:
(11) I would rather go to New York.
(12) I would rather not go to New York.
(13) She did pretty well on the project.
(14) She didn’t do pretty well on the project.
In this case, “polarity-sensitive” is a strict concept that covers only a small range of
polarity-sensitive words, idioms, or phrases. Fauconnier (1975a) described a more
general polarity with the extended notion: semantic and pragmatic polarity based on
the following observations:
• Some words or phrases are polarized with respect to logical structures.
• Some words or phrases are polarized with respect to context, which means their
polarity can therefore vary from context to context.
• The polarity items such as words and phrases have the property of polarity-
reversal (Baker, 1970).
It was noted in the work of Fauconnier (1975b) that grammatical superlatives could
yield quantificational effects in some sentences:
(15) The most stylish suit looks bad on Maxwell (“Any suit looks bad on
Maxwell”).
(16) Maxwell cannot solve the simplest question (“Maxwell cannot solve any
question”).
316 4 Pragmatics Processing
In the above sentences, superlatives such as “the most stylish” and “the simplest”
serve as a universal quantifier in terms of semantics. Nonetheless, there are still many
sentences in which the superlative cannot occur even with the semantic value:
(17) The most stylish suit looks good on Maxwell (,“Any suit looks good on
Maxwell”).
(18) The simplest question is easy for Maxwell (,“Any question is easy for
Maxwell”).
The aforementioned examples illustrate that superlatives are not always quantifi-
cational. Nonetheless, a weak polarity principle can be formulated: a quantifying
superlative that is appropriate for use in an affirmative sentence is generally un-
suitable for use in the corresponding negative sentence. According to Fauconnier
(1975a), quantifying superlatives are polarized in compliance with this principle.
As mentioned, the polarity of some words or phrases may change from context to
context. Consider the following sentences:
(19) Even Maxwell doesn’t understand “Elements”.
(20) Even the monk is tempted to use contraceptives.
These sentences are suitable only in certain contexts that are compatible with the
“premise” of the even-phrase. For instance, Example (20) is appropriate only if it
is assumed that the monk is highly unlikely to be tempted to use contraceptives;
otherwise, it would not be appropriate. It turns out that, in a given context, an
even-phrase is suitable in an affirmative sentence, but not suitable in the negative
counterpart of that sentence:
(21) Even Maxwell understands “Elements”.
(22) Even the monk is not tempted to use contraceptives.
It is concluded from Sections [Link] and [Link] that it is impossible to lexically mark
words, idioms, or phrases whose polarity highly depends on the logical structure and
the context within the sentence. Despite all this, in the 1980s and 1990s, Krifka
(1992, 1994); Kadmon and Landman (1993); Lee and Horn (1994) attempted to
offer lexical semantics explanations for the distribution of polarity-sensitive items
(words, idioms, phrases). The target is to find out general properties to unite a large
number of heterogeneous polarity-sensitive items. Israel (1996) argued that polarity
sensitivity arises from the interaction of two lexical semantic properties, quantitative
and informative values (Kay, 1990), where the quantitative value reflects the fact
that a sizable portion of polarity-sensitive items encodes some notion of amount or
degree, and the informative value reflects the fact that some propositions are more
informative than others in context.
4.6 Downstream Task 317
contextual operators
are lexical items or grammatical constructions whose semantic value consists, at least in part,
of instructions to find in, or impute to, the context a certain kind of information structure
and to locate the information presented by the sentence within that information structure in
a specified way (Kay, 1990).
The polarity of words, idioms, or phrases has been thoroughly studied from the
perspective of linguistics. The polarity of sentences, paragraphs, or documents is
defined as an opinion toward a topic, rather than a linguistic concept.
Opinions are not always associated with the sentiment. For instance, the sen-
tence “I believe Jacky Chan is a famous star” contains no sentiment but a claim.
Hence, there are mainly two types of annotation schemes of polarity, i.e., 1) bi-
nary classification where Y 2 {Positive, Negative}; 2) 3-class classification where
Y 2 {Positive, Negative, Neutral}. Specifically, the annotation schemes of word-
level, phrase-level, sentence-level, paragraph-level, and document-level text are
slightly different from each other. Besides, when annotating, it is essential to take
the domain of the text into consideration.
In the beginning, word-level annotation is conducted in a supervised manner
or by experts, e.g., Opinion Lexicon (Hu and Liu, 2004), NTUSD (Ku et al.,
2006), SentiWordNet (Baccianella et al., 2010), SO-CAL (Taboada et al., 2011)
and AFINN (Nielsen, 2011). The quality of these lexicons is satisfactory. However,
it requires huge human resources to build and cannot generalize to certain domains.
To this end, researchers developed many data-driven sentiment lexicon construction
methods to annotate the polarity of words in a semi-supervised way. Typically, these
approaches (Rao and Ravichandran, 2009; Velikovich et al., 2010; Feng et al., 2012;
Tang et al., 2014a; Saif et al., 2016; Wu et al., 2016; Li et al., 2018b) employ a
limited set of seed words (either through manual annotation or existing sentiment
lexicons) to disseminate sentiment labels to a large number of unlabeled words and
thus yield a large, domain-specific sentiment lexicon.
Wilson et al. (2005) proposed to identify contextual polarity for a large number
of phrase-level sentiment expressions. Annotators are required to tag the polarity
of subjective expressions as positive, negative, both, or neutral. Here, both is ap-
plicable to expressions that have both positive and negative sentiments. In addition
to expressions that do not contain polarity, neutral is extended to a different type
of subjective expression like speculation. Below are the annotation examples of the
contextual polarity annotations:
1) Moreover, politicians use the concepts of good and evil (both) solely for the
purposes of intimidating and exaggerating.
2) Throughout the night, a vast number of individuals who supported the coup
celebrated (positive) by waving flags and blowing whistles.
3) According to Jerome, the hospital does not feel (neutral) any different from a
hospital in the United States.
Annotation acquisition is a crucial step in developing supervised classifiers. How-
ever, it is time-consuming and cost-intensive to annotate the polarity of texts solely
by experts. Therefore, crowdsourcing (Howe, 2008) is introduced to quickly acquire
annotations for the purposes of constructing all kinds of predictive models. Hsueh
et al. (2009) studied the difference between expert annotators and non-expert anno-
tators (e.g., AMT) in the light of annotation quality. They found that the quality of
labels can be improved by eliminating noisy annotators and ambiguous examples.
It is also proven that quality measures are useful for selecting annotations that give
rise to more accurate classifiers.
4.6 Downstream Task 319
4.6.3 Datasets
Initially, polarity detection was performed mostly movie reviews, product reviews,
and social media posts in English. Subsequently, many polarity detection datasets for
low-resource languages were constructed. Maas et al. (2011) proposed a large movie
review dataset32, including 25,000 training instances and 25,000 testing instances
with positive or negative labels sourced from IMDB. Negative reviews are defined
as those with a score of 4 or less out of 10, while positive reviews are those with a
score of 7 or more out of 10. Neutral reviews are excluded from the dataset.
Movie review 1: “This short deals with a severely critical writing teacher whose
undiplomatic criticism extends into his everyday life. When he learns why that’s not
a good idea, we learn a bit about the beautiful craft of writing that he’s been
lecturing on.”,
Label 1: “Positive”
Movie review 2: “I found this movie really hard to sit through, my attention kept
wandering off the tv. As far as romantic movies go..this one is the worst I’ve seen.
Don’t bother with it.”,
Label 2: “Negative”
Pang and Lee (2004) introduced a movie dataset33 with 2,000 reviews written
before 2002. They extracted at most 20 reviews for each author (312 authors in
total). They also released a subjectivity dataset that contains 5,000 subjective and
5,000 objective reviews from Rotten Tomatoes. Only sentences or snippets with more
than 10 tokens are selected. Go et al. (2009) introduced a social media posts dataset
(Sentiment140)34 extracted from the X API. There are 160,000 tweets in this dataset
annotated using distant supervision.
Tweet 1: “Maxwell is my new best friend.”,
Label 1: “Positive”
Tweet 2: “just landed at New York.”,
Label 2: “Neutral”
Tweet 3: “Math exam studying ugh”,
Label 3: “Negative”
32 [Link]
33 [Link]
34 [Link]
35 [Link]
320 4 Pragmatics Processing
Zhang et al. (2015b) built a dataset36 for binary polarity detection. The authors
provided a set of 560,000 highly polar Yelp reviews for training and 38,000 for
testing by considering stars 1 and 2 negative, and 3 and 4 positive. Nakov et al.
(2013) proposed a message polarity detection task and a corresponding X dataset.
The authors extracted tweets from the X API and crowdsourced them on AMT for
annotation. Li et al. (2018b) presented a Chinese Tourism Review dataset, consisting
of reviews of popular tourism products. The authors manually annotated 30,180
tourism reviews and created a balanced dataset with 7,995 instances and a real-world
dataset with the all annotated instances.
Knowledge bases are valuable tools for polarity detection in sentiment analysis
because they provide a structured repository of factual information, relationships,
and concepts that enhance the understanding of text. Knowledge repositories such
as SentiWordNet (Baccianella et al., 2010), SenticNet (Cambria et al., 2024) and
FinSenticNet (Du et al., 2023b) (introduced earlier) include dictionaries, ontologies,
and databases that encompass various domains, providing context and background
knowledge that can be crucial for accurate sentiment interpretation.
As discussed earlier in the textbook, knowledge bases can enable important sub-
tasks, e.g., WSD and NER, and hence, increase the accuracy of polarity detection.
Moreover, they can offer insights into metaphorical expressions and cultural nuances
that might not be evident from the text alone. For example, understanding that the
phrase “break the ice” means to alleviate tension can prevent misinterpretation of its
literal meaning, which might not directly convey sentiment. By linking such expres-
sions to their underlying meanings, knowledge bases help in correctly identifying
polarity. Knowledge bases also facilitate understanding relationships between enti-
ties and concepts, which is crucial for detecting sentiment polarity. For instance, if
a text mentions a company’s stock dropping after a ‘scandal’, a knowledge base can
provide background on the negative connotations of ‘scandal’ and its typical impact
on public perception.
In the era of LLMs, knowledge bases are still important because they provide a
reliable source of factual accuracy and consistency that LLMs alone cannot guar-
antee. LLMs generate responses based on patterns in the data they were trained on,
which may not always be up-to-date or factually accurate. Knowledge bases, on the
other hand, are curated and regularly updated repositories of structured information,
ensuring that the data they contain is accurate and reliable. Furthermore, knowledge
bases offer domain-specific expertise that LLMs may lack. While LLMs are gener-
alists, capable of discussing a wide range of topics, they may not possess the depth
of knowledge needed for specialized fields like medicine, law, or finance.
36 [Link]
4.6 Downstream Task 321
The structured nature of knowledge bases also complements the unstructured data
processing capabilities of LLMs. Knowledge bases use ontologies and structured for-
mats to organize information, making it easier to understand relationships between
different concepts and entities. This structure is particularly useful for complex
queries where precise relationships matter, something that LLMs might not inher-
ently handle well due to their training on unstructured data. Additionally, knowledge
bases offer a way to maintain temporal consistency, especially for information that
frequently changes. Since LLMs are trained on datasets that may quickly become
outdated, they can struggle to provide accurate information on recent developments.
Knowledge bases, with their regular updates and mechanisms for tracking changes,
help ensure that the information remains relevant and up-to-date.
1’
n
MAE = |yi ŷi |, (4.1)
n i=1
where n is the total number of observations. yi is the actual value for the i-th
observation. ŷi is the predicted value for the i-th observation. |yi ŷi | is the absolute
difference between the actual and predicted values for the i-th observation.
4.6.6 Methods
A. Statistics-based Approach
The main idea of the statistics-based approaches is to design proper statistics rules
to extract potential sentiment words from documents. Researchers identified effec-
tive statistics-based rules such as pointwise mutual information (PMI) (Church and
Hanks, 1990), symmetrical conditional probability (Bu et al., 2010) and enhanced
mutual information (Zhang et al., 2009) by observing, analyzing, and summarizing.
PMI is an effective statistical indicator that measures statistical relatedness/ similarity
between two words. The calculation of PMI is as follows:
n(wi, w j )/N
PM I(wi, w j ) = log , (4.2)
(n(wi )/N)(n(w j )/N)
where n(wi, w j ) refers to the co-occurrence number of words i and j; N is the total
number of documents. Supposing that wi is a word with polar, if PM I(wi, w j ) is
high, then w j is likely to share the same polarity with wi .
Feng et al. (2012) provided an unsupervised method to learn a sentiment lexicon
from massive Microblog data with emoticons. In the paper, emoticons acted as
noisy labels and are partitioned into positive and negative sets. If a word wi is more
similar to positive emoticons (PE) than negative emoticons (NE), i.e., score(wi ) =
PM I(wi, PE) PM I(wi, N E) > 0, then the polarity of wi is regarded as positive,
otherwise negative. Emoticon makes it feasible to build a domain-specific sentiment
lexicon in an unsupervised manner. However, it cannot detect user-invented words
that frequently appear on social media platforms, which may reduce its practical
usage. To this end, Wu et al. (2016) extended the previous work by introducing
new word detection in Chinese Microblog. The authors also used three types of
knowledge to assign sentiment intensities for words. Nevertheless, it did not consider
semantic knowledge. Subsequently, Li et al. (2018b) further improved Chinese new
word detection in the tourism domain via assembled mutual information (AMI), and
introduced semantic similarity knowledge to refine sentiment intensities of words.
In summary, statistics-based methods do not require extra human efforts and can
easily extend the coverage of lexicons. Besides, it considers user-invented words and
utilizes knowledge from different sources (statistic knowledge, semantic knowledge,
prior knowledge) to determine relatively precise sentiment intensities for words. One
drawback of this method is that it requires massive data, which may lead to biased
sentiment intensity if without enough data.
B. Semantic-based Approach
Kamps et al. (2004) put forward an approach calculating semantic similarity be-
tween evaluative factors “good” and “bad” and adjectives. If a word is closer to the
reference word “good” than “bad” in terms of distance, then it is regarded as positive,
otherwise negative. Here, the distance is computed by the shortest path in WordNet
between “good” (“bad”) and the word, i.e., EV A(w) = d(w,bad) d(w,good)
d(good bad) . However,
WordNet-based measures of distance or similarity mainly focus on taxonomic rela-
tions, which makes it almost impossible to calculate the noun and adverb categories
in WordNet.
Velikovich et al. (2010) presented a graph propagation method to build polarity
lexicons. It starts from a small set of seed polarity words and then propagates
information from the seed set to the rest of the graph through the edges. The weights of
edges are obtained from the cosine similarity of context vectors of two words. Graph
propagation is supposed to overcome the drawbacks of common graph propagation
algorithms. However, a destination word may be influenced by a seed word through
different paths, resulting in a high polarity. Finally, it is worth noting that the cosine
similarity of the context vectors is an indirect way of calculating the semantic
similarity of two words, which may be inaccurate or biased.
Viegas et al. (2020) proposed an unsupervised method to expand sentiment lex-
icons by exploiting semantic relationships. Specifically, the authors employed pre-
trained word embeddings to directly calculate the semantic similarity between words
based on the hypothesis that a closer distance between word vectors indicates a
closer polarity and intensity. This unsupervised method provided a simple, yet very
effective, technique for improving the coverage of sentiment lexicons. Although
the empirical study demonstrated the flexibility and effectiveness of the proposed
method, the directly computed semantic similarity between word embeddings can-
not avoid a typical error that words that are close in vector space may have opposite
semantic meanings, e.g., the semantic similarity between pleasant and unpleasant
is 0.7451 according to Spacy. Therefore, a more accurate measurement of polarity
similarity should be the future research focus for semantic-based sentiment lexicon
construction approaches.
The majority of the existing deep learning-based polarity detection models are
trained on large labeled data. In general, these supervised methods significantly
improve the performance of polarity detection on many benchmark datasets. RNN
techniques (Basiri et al., 2021), such as GRU (Cheng et al., 2020) and LSTM (Imran
et al., 2020), have been proven to be very effective in the polarity detection task as
they naturally support the calculation of sequential text data. In recent years, atten-
tion mechanism (Vaswani et al., 2017) and transformer-based techniques, especially
PLMs such as BERT, become the standard paradigms and yield state-of-the-art per-
formance on almost all the benchmark datasets. In this subsection, we introduce the
trends of PLM-based polarity detection models.
324 4 Pragmatics Processing
A. PLM as Contextualizer
Pota et al. (2020) provided an effective BERT-based pipeline for X polarity de-
tection. The authors converted Italian tweets into plain texts by removing emoticons,
emojis, hashtags, etc., and used it to fine-tune the BERT model. It is worth noting that
overall positive (opos) and overall negative (oneg) are analyzed by two classification
tasks in this paper. Hence, the final polarity can be derived by the combination of
opos and oneg, e.g., oneg(0) + opos(0) = Neutral, oneg(1) + opos(0) = Negative,
oneg(1) + opos(1) = Mixed. One insightful conclusion is that preprocessing can
significantly improve the performance of fine-tuning as BERT is also trained on plain
text. Catelli et al. (2022) presented a framework to detect deceptive reviews and iden-
tify the sentiment polarity of the review by a multi-label classification mechanism.
The label powerset technique is used to transform the multi-label problem into a
single-label problem considering the 2 L possible combinations (deceptive positive,
deceptive negative, truthful positive, and truthful negative) of the transformed labels,
where L is a set of non-disjoint labels. Reviews with deceptive and polarity labels
are used to fine-tune BERT to enable dependency modeling between polarity and
opinion truthfulness, yielding better performance over baselines. Nevertheless, the
transformation is applicable only when the number of non-disjoint labels is low,
preventing it from being widely applied to similar problems.
4.6 Downstream Task 325
In recent years, prompt-tuning (Brown et al., 2020) PLM approaches for polar-
ity detection attract more attention from the research community as they yield better
performance on a wide range of NLU tasks such as polarity detection (Gao et al.,
2021b) and relation extraction (Han et al., 2022b). Li et al. (2021a) put forward a
model named SentiPrompt for aspect-based polarity detection (Pontiki et al., 2016)
leveraging sentiment knowledge enhanced prompts to tune the PLM in the unified
framework. Here, a typical sentiment knowledge enhanced prompt template looks
like this: “The A is O? [MASK]. This is [MASK]” , where A and O are randomly
sampled aspect and opinion terms(e.g., A is sampled from {Sushi, Price}, O is sam-
pled from {Good, High}). [MASK] can be either “yes” (consistent with ground truth)
or “no” (inconsistent with ground truth). BART (Lewis et al., 2020) with pointer
network (Vinyals et al., 2015b) is used as an Encoder-Decoder module for output
(including [MASK] filling) generation. SentiPrompt surpasses the strongest base-
lines on ABSA by a large margin. Nonetheless, the effect of prompts with different
templates remains unclear.
Mao et al. (2023c) conducted a systematic study on prompt-based polarity detec-
tion to investigate the influence of prompt templates. The paradigm of the prompt-
based PLM is to add a prompt with a [MASK] token upon a sentence (e.g., “I feel
[MASK], the movie is very horrible.”) and then predict the probabilities of pre-
defined emotion word (e.g., joyful, sad, and frustrated) in the [MASK] position. The
polarity of the sentence can be induced by label-word mapping, where each polarity
corresponds to a set of sentiment words. One of the advantages of prompt-based
approaches over fine-tuning approaches is that it does not require annotating large-
scale datasets, which is favored by low-resource languages. This study also observed
that different positions of prompts deliver different results, which indicates the PLM
is biased on positions as cognitively, different positions do not change the polarity of
the original input. Thus, it is essential to design proper prompts to achieve optimal
performance.
Du et al. (2024b) introduced an innovative framework called Hierarchical Prompt-
ing for Financial Sentiment Analysis. This framework aims to enhance the under-
standing of crucial factors in financial sentiment analysis for LLMs, focusing on
semantic overview, contextual exploration, and influencing variables. It was ob-
served that some LLMs lack the logical reasoning capabilities required for effective
financial sentiment analysis. To address this, the hierarchical prompting framework
queries additional knowledge from the LLMs, structuring the reasoning process to
ultimately improve their performance. Finally, Zhu et al. (2024b) (introduced earlier)
also belongs to the group of fine-tune-based PLM approaches to polarity detection.
In particular, an LLM was employed for tailored sentiment analysis, based on seven
different levels of personalization. Subsequently, the LLM was prompted to evaluate
the sentiment perceptions between two interlocutors.
326 4 Pragmatics Processing
Fig. 4.5: The Hourglass of Emotions revisited (Susanto et al., 2020). This emotion
categorization model represents affective states both through labels and through four
independent but concomitant affective dimensions, which can potentially describe
the full range of emotional experiences that are rooted in any of us.
Graphs are a popular and widely used knowledge representation tool. One of the
typical representation forms is the subject-predicate-object (SPO) triplet, where
subject and object are entities (e.g., people, places, things), and the predicate (e.g.,
is_a, is_located_in) is the relation pointing from subject to object. The knowledge
recorded in triplets is commonsense knowledge yet essential for people to under-
stand sentences, especially less informative sentences. Researchers developed effec-
tive methods to represent graph-based knowledge and combine it with numerical
connectionists (neural networks) for a more powerful polarity detection ability.
Liao et al. (2022) presented a framework named KG-MPOA using dynamic
commonsense knowledge for Chinese implicit polarity detection. Compared with
explicit polarity detection, the implicit one is more challenging due to the lack of
sentiment words. To this end, external symbolic knowledge triplets from ConceptNet
are fused to the semantic representation of implicit sentences. To embed the graph-
based knowledge, the authors designed a pipeline for knowledge distillation, i.e.,
subjective triplet distillation (step 1), literal-related triplet distillation (step 2), noisy
triplet filtering (step 3), and semantic-related triplet distillation (step 4). Specifically,
step 1 extracts triplets whose subject or object can be retrieved from an existing
sentiment lexicon. Step 2 removes triplets whose subject and object cannot be found
in input sentences. Step 3 is used to filter certain triplet relations (e.g., “RelatedTo”).
Step 4 leverages BERT to compute the similarity between an input sentence and a
triplet-converted expression. The distilled triplets are regarded as a knowledge graph
and encoded by a dynamic graph attention layer integrating into the neural module.
4.6 Downstream Task 329
Commonsense knowledge (IC_Knwl) for each input news headline, e.g., for given
news headline H, certain inference type (I) like “xWant, xNeed and xEffect”, and
the number of returned references (N), the symbolic knowledge can be obtained by
IC_Knwl = COM ET(H, I, N). The generated IC_Knwl is further processed to fit
the expression of natural language and then combined with the input news headline
for classification by PLM. The framework is designed as language-agnostic through
the translate-retrieve-translate (TRT) mechanism (Fang et al., 2022b). Experiments
conducted on news headline political polarity detection proved the effectiveness of
the aforementioned usage of graph-based symbolic knowledge.
FOL, also called quantified logic or predicate logic, is used to express the rela-
tionship between objects by allowing for variables in predicates bound by quanti-
fiers (Wang and Yang, 2022). For example, “cat is an animal” can be induced by FOL
8x Cat(x) ) Animal(x). Here, 8 is a quantifier and Cat and Animal are variables.
A few solutions have emerged to enable the combination of FOL based symbolic
knowledge and neural networks. Huang et al. (2022a) presented a novel logic tensor
network (Badreddine et al., 2022) with massive rules (LTNMR) for aspect-based
polarity detection. Specifically, the authors integrate two types of knowledge into the
logic tensor network (LTN), i.e., dependency knowledge and human-defined knowl-
edge rule. A mutual distillation structure knowledge injection (MDSKI) strategy is
proposed to transfer dependency knowledge from teacher BERT to student network
LTNMR to achieve better performance. Human-defined knowledge is represented in
FOL “R(x, a) ^ emo+ ! P(R(x, a), l+ )” form integrated into the LTN, where emo+
is the positive sentiment tag of the word from SenticNet, l+ is the polarity label, x
the input text, and a the aspect term. Intuitively, the logic is to encourage the polarity
of the whole sentence to be close to the implicit polarity of aspect-related words.
Zhang et al. (2022) advanced a sentiment interpretable logic tensor network
(SILTN) for aspect-term polarity detection. SILTN is interpretable as it is a neu-
rosymbolic formalism that naturally supports learning and reasoning about data
with differentiable FOL. FOL provides a flexible declarative language for conveying
high-level cognition and representing structured knowledge. Despite the effective-
ness of explainability, the performance of SILTN is still unsatisfactory due to its
relatively shallow network structure. Therefore, the authors proposed a two-stage
syntax knowledge distillation (TSynKD) strategy to improve the inferring accuracy
of SILTN. Specifically, a large network BERT serves as teacher one, a big network
aspect-specific dynamic GCN (AsDGCN) as teacher two, and SILTN as a student.
In the output distillation stage, BERT’s logits are used as AsDGCN’s learning ob-
jective. In the feature distillation stage, SILTN learns dependency knowledge from
AsDGCN and also uses BERT’s logits as its training objective.
330 4 Pragmatics Processing
In short, the FOL-based approaches are a promising solution for building trust-
worthy, explainable, and powerful AI. However, current approaches rely heavily
on the strong representation ability of the subsymbolic system. Hence, improving
the predictive performance of the symbolic system without overly leveraging the
pretrained subsymbolic system would be a future research focus.
Human-robot interaction refers to the conversation between human and robot agents
in this chapter. People often express their opinions over a topic or event in a conver-
sation with a robot agent. Hence, it is essential to identify the polarity or sentiment
of the user in this sentiment-aware dialogue system. On one hand, it helps to improve
the deficiencies of a product by understanding the user’s polarity toward the product.
On the other hand, it assists in generating human-like empathetic responses that are
appreciated by the users. In recent years, many conversational polarity detection ap-
proaches have been presented to better recognize user polarities (Zhong et al., 2019;
Li et al., 2021b; Lee and Choi, 2021; Li et al., 2022; Lee and Lee, 2022; Li et al.,
2023b). These approaches may act as a pretrained classifier module in the empathetic
chatbot to help identify polarity and generate empathetic responses accordingly.
Tahara et al. (2019) proposed an empathetic dialogue system on the basis of
extracted X polarity. The authors assumed that the degree of a user’s empathy can
be increased if the system can generate an utterance from topic-related tweets that
contain the same polarity as the user’s utterance. The assumption is confirmed by a
preliminary experiment that the degree of empathy can be improved in the case of
a polarized user utterance. However, the proposed method cannot increase empathy
in the case of a neutral user utterance. Besides, the authors utilized Google API37 to
measure the polarity while fine-grained emotion may provide more insights that are
beneficial for an empathetic dialogue system.
37 [Link]
4.6 Downstream Task 331
In the context of RRSs, polarity detection can also be applied for analyzing
recommender system bias. It is believed that the algorithm of the news recommender
system could potentially form filter bubbles, leading users to become more selective
in their exposure to news. This could result in increased vulnerability to polarized
opinions and fake news. Alam et al. (2022) utilized the users’ polarities toward news
to analyze and quantify the degree to which the recommendation is biased. The
authors paired pre-defined questions with news articles, and categorized the articles
into “in favor” or “against” the pre-defined question (e.g., refugee and migration)
to transfer German news-related polarity detection to a pretrained classifier. The
bias of a recommender system is measured by the average polarity bias score of
recommended articles and user history articles. If the score of recommended articles
is significantly different from that of user history articles, the recommender system is
considered to be biased. Experiments on four news recommender systems indicated
that text-based systems tend to amplify user attitudes related to polarity, while
knowledge-aware systems are less prone to bias.
The opinions of investors toward a stock, to some extent, reflect the future trend (up
or down) of the price of that stock (Du et al., 2024a). Hence, it is valuable to regard
investor polarity as an indicator to predict stock price. Derakhshan and Beigy (2019)
introduced a human sentiment model to identify investors’ opinions. The authors
collected comments with explicitly expressed label words such as “buy” and “sell”
and used these label words as handcrafted features for classification. Experiments
on both English and Iranian datasets demonstrated that the classifier benefited from
investor polarity information.
Colasanto et al. (2022) presented a stock price prediction approach utilizing
a fine-tuned BERT-based model named AlBERTino to capture polarity from the
market. AlBERTino was fine-tuned on Italian financial news and outperformed its
base version AlBERTo (Polignano et al., 2019) by a large margin in terms of polarity
detection. The authors exploited the Bayesian inference (Bernardo and Smith, 2001)
to obtain a new set of bounded drift and volatility values on the basis of the polarity
score from AlBERTino. Finally, the exact future value of the price can be determined
on an hourly and daily basis. An empirical study indicated that the proposed approach
achieved better results on an hourly basis prediction. This approach brings more
insights into the prediction of the exact value of stock price other than only the trend
(up or down) prediction.
Ma et al. (2023) proposed a framework for stock price movement prediction. The
model incorporated information from multiple sources, e.g., numerical indicators,
news, and company relationships. They believed that semantic sentiment is not equal
to market sentiment. Thus, they used the stock price movement directions as market
sentiment signals, training a sentiment representation generator. The generator is
used for embedding news data in the stock price movement classifier.
4.6 Downstream Task 333
Stock portfolio allocation refers to the task of allocating funds among the stocks
to maximize profit. A profitable yet low-risk allocation strategy would be valued
by all investors. Koratamaddi et al. (2021) put forward a market sentiment-aware
DRL (Arulkumaran et al., 2017) framework for developing an optimal policy that
maximizes returns with minimal risk given historical stock prices, polarity infor-
mation, and current portfolio. Specifically, observations, i.e., stock price movements
and polarity from social media, are fed into the state of the agent. The actor and critic
networks are updated based on observations and prediction errors are employed to
update Q-value. An empirical study showed that the proposed method achieved bet-
ter return values than the non-sentiment-aware baseline throughout the experimental
period. Ma et al. (2024) investigated the portfolio optimization task by simultane-
ously learning risk and return. This task is formulated as a ranking problem, where
stocks that are highly ranked based on predicted risk and return are included in the
portfolio for the next trading day. In a manner similar to (Ma et al., 2023), they also
utilized a pretrained news embedding generator, which was trained using market
sentiment signals.
4.6.8 Summary
The concept of polarity can be traced back to the 1970s. Researchers studied word-
to phrase-level polarity from the perspective of linguistics, which serves as the
theoretical base of polarity detection study. As discussed throughout this section,
polarity detection can be a downstream task within the NLU suitcase model but
also an upstream task for many other applications such as chatbots, recommender
systems, and financial forecasting. Thanks to the development of social media plat-
forms and mobile applications, UGC has increased fast in the past decade, which in
turn calls for accurate and automatic polarity analysis of users. Real-world polarity
detection shows promising commercial value as it can be applied to improving cus-
tomer service, and product quality contrapuntally. It is worth noting that, while this
section mostly surveyed text-based polarity detection in English, multimodal and
multilingual polarity detection are also very popular research fields.
As shown in Table 4.23, polarity detection is categorized into three research streams:
lexicon-based, PLM model, and neurosymbolic methods. Lexicon-based methods,
common in early research, aim to build comprehensive and accurate sentiment
lexicons but require significant manual effort and are not easily adaptable to specific
domains. Data-driven methods, including statistical and semantic approaches, later
automated lexicon creation, particularly for specific domains.
334 4 Pragmatics Processing
Table 4.23: Technical trends in polarity detection. NeSy stands for neurosymbolic
AI. SA denotes statistical analysis. DR denotes dynamic rewarding; TRT denotes a
translate-retrieve-translate strategy. LTN denotes logic tensor network.
Polarity detection research has gained increasing attention due to its importance
and promising application value in a number of downstream tasks. Despite the fact
that researchers have made great progress in identifying different kinds of (e.g.,
sentiment and political) polarities of users, several challenges remain unsolved. The
first challenge is context-dependent errors. People usually use sarcasm to express
their negative sentiments with positive words. For example, “Very good; well done!”
is sarcastic when someone does something wrong. One approach to addressing this
issue is to incorporate a sarcasm detection module as shown in the NLU suitcase
model.
The second challenge is limited training data in low-resource languages. Most
of the existing polarity detection datasets are collected from English platforms like
X and Amazon. Some low-resource languages have different characteristics (e.g.,
preprocessing, stopwords, structural differences across languages) that make it diffi-
cult to effortlessly transfer English-based models to low-resource languages. Finally,
current polarity detection models suffer from bias problems as the training data is
collected from human posts which may involve human biases.
336 4 Pragmatics Processing
Existing studies suggest that polarity detection serves an important role in down-
stream tasks (Liu et al., 2021b; Alam et al., 2022; Colasanto et al., 2022; Koratamaddi
et al., 2021). Apparently, these downstream tasks still suffer from the challenges of
polarity detection. Apart from this, dialogue systems are relatively weak at generat-
ing controllable responses, which may cause side effects to users in a bad emotional
state. The main challenge of the recommender system is the bias problem. People get
trapped in similar content so that their attitudes or opinions are strengthened again
and again. This is not good for people to have a comprehensive understanding of the
world. The stock price is influenced by many factors which cannot be fully reflected
in investor comments. Besides, as a financial-related task, stock price prediction calls
for a high accuracy as either false positive or false negative results in huge losses.
Amin et al. (2024) found that current expert affective computing systems can
exceed LLMs in a wide range of subtasks. Developing robust polarity detection
systems is still valuable but challenging. Despite PLMs achieved great accuracy in
polarity detection, neurosymbolic approaches significantly improve the reasoning
ability without compromising the representation ability of neural models, yielding
a trustworthy, powerful and robust AI. In general, the research of neurosymbolic
AI is still in its early stages. Hence, there are a series of key problems that remain
unsolved. The first problem is how to effectively represent symbolic knowledge. As
mentioned, FOL is a promising paradigm but it is hard to model the infinitary FOL
in the finiteness neural networks (Bader and Hitzler, 2005). Therefore, it is crucial to
explore some other representation paradigms such as programming languages (Nye
et al., 2020; Jin et al., 2022) and symbolic expressions (Lample and Charton, 2019;
Li et al., 2020e) in future research. On the other hand, subsymbolic and symbolic
systems can be combined in different ways and a better combination, especially a
fully integrated neurosymbolic system (Wang and Yang, 2022), should be investi-
gated further. Last but not least, the aforementioned four downstream tasks could
also benefit from the powerful reasoning ability of the subsymbolic system, e.g., gen-
erating controllable empathetic responses with the help of subsymbolic knowledge
in human-robot interaction, and reducing the bias of recommender systems.
4.7 Conclusion
A. Reading List
• Rui Mao, Kai He, Claudia Beth Ong, Qian Liu, and Erik Cambria. MetaPro 2.0:
Computational Metaphor Processing on the Effectiveness of Anomalous Lan-
guage Modeling. In Proceedings of ACL, 9891–9908, 2024 (Mao et al., 2024b)
• Bin Liang, Lin Gui, Yulan He, Erik Cambria, and Ruifeng Xu. Fusion and Dis-
crimination: A Multimodal Graph Contrastive Learning Framework for Mul-
timodal Sarcasm Detection. IEEE Transactions on Affective Computing 15,
2024 (Liang et al., 2024)
• Kai He, Rui Mao, Tieliang Gong, Chen Li, and Erik Cambria. Meta-based Self-
Training and Re-Weighting for Aspect-based Sentiment Analysis. IEEE Transac-
tions on Affective Computing 14(3): 1731–1742, 2023 (He et al., 2023b)
B. Relevant Videos
• Labcast about Metaphor Understanding: [Link]/pwSAoR15esw
C. Related Code
• Github repo about Sarcasm Detection: [Link]/SenticNet/Sarcasm-Detection
D. Exercises
• Exercise 1. Perform metaphor understanding on the sentence “Navigating life’s
ocean, we face stormy seas of challenges and serene waters of joy; while the sun
of hope illuminates our path, we must also beware the shadows of doubt lurking
beneath the surface”. Take a step-by-step approach whereby you first detect which
parts of text are metaphors, then identify source and target for each of the detected
metaphors, and finally perform ABSA based on these.
• Exercise 2. Choose a set of social media posts and determine if they are sar-
castic. Explain the reasoning behind your decisions. Outline a basic approach
for detecting sarcasm in text using a rule-based system or machine learning.
Discuss potential features, such as word choice, punctuation, sentiment reversal,
and context. Discuss the challenges of sarcasm detection in NLU. Consider as-
pects like lack of tonal clues, cultural differences, and the complexity of language.
5.1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025 339
E. Cambria, Understanding Natural Language Understanding,
[Link]
340 5 Knowledge Representation & Reasoning
Fig. 5.1: Example of the description of commonsense knowledge with concepts (e.g.,
hammer, nail, and wood), instead of specific entities (e.g., claw_hammer) or abstract
primitives (e.g., tool).
5.1 Introduction 341
Secondly, human cognition relies on a small set of fundamental and innate build-
ing blocks called primitives. In the conceptual dependency theory (Jackendoff, 1976;
Minsky, 1974; Rumelhart and Ortony, 1977; Schank, 1972; Wierzbicka, 1996), prim-
itives serve as elemental units of information and actions, like COLOR, SHAPE, SIZE,
INCREASE, and DECREASE, and forms the foundation for humans to make general-
izations, inferences, and predictions, ultimately facilitating efficient reasoning and
understanding in a wide range of real-world situations. For example, we generalize
concepts with relevant higher-level primitives. Verb concepts such as eat, slurp, and
munch could be related to a primitive EAT. Noun concepts like pasta, bread, and
milk can be associated with the primitive FOOD. Therefore, eat pasta or slurp milk
can be generalized into a primitive-level description, i.e., EAT FOOD. Hierarchical
concept representations have significant applications in diverse domains, e.g., con-
ceptual metaphor understanding (Ge et al., 2022; Mao et al., 2023b) and cognitive
analysis (Mao et al., 2023a).
In history, some efforts have been devoted to building knowledge bases more in
line with human cognition. For example, VoCSK is designed to exploit concept-level
knowledge representation for implicit verb-oriented commonsense knowledge (e.g.,
person eats food instead of John eats bread). SenticNet is developed for organizing
sentiment knowledge with a core set of primitives. ASER (short for Activities,
States, Events, and their Relations) is built to extend the traditional definition of
selectional preference to higher-order selectional preference over eventualities. These
methods share a common goal of conceptualizing diverse types of commonsense
knowledge, mapping them to higher-level cognition, and moving beyond the explicit
representation of knowledge as discrete facts. Following this line, we take a further
step by constructing a new framework for representing the intricate commonsense
knowledge based on conceptual dependency theory.
In this chapter, we propose a new framework for commonsense knowledge repre-
sentation and reasoning based on conceptual primitives, named PrimeNet. The data
and the code used to develop PrimeNet are available on SenticNet github1. Addi-
tionally, PrimeNet is also available as an API for verb-noun generalization2 and as
a set of embeddings for ABSA available in 80 different languages3. The PrimeNet
framework consists of three layers, as illustrated in Fig. 5.2:
1 [Link]
2 [Link]
3 [Link]
342 5 Knowledge Representation & Reasoning
gunlock industrial_equipment
isA Superordinate-level Primitive Tool Primitives
partOf isA (e.g., Tool)
mannerOf power_tool Hammer
beat hammer Basic-level Concept Concepts
madeOf usedFor hit (e.g., hammer)
isA isA Subordinate-level Entity Entities
forged_metal (e.g., Engineer Hammer)
brick hammer (e.g., engineer Hammer)
claw hammer
Three-layer Knowledge Representation
Fig. 5.2: Illustration of three-layer structure in PrimeNet. Given the factual knowl-
edge, a concept layer is generated as the basic level, comprising widely recognized
mental representations associated with various categories or classes of objects. Its
subordinate layer is termed as entity layer, which consists of specific entities, and
its superordinate layer is defined as primitive level, encapsulating overarching and
fundamental primitives.
5.1 Introduction 343
Fig. 5.3: PrimeNet preliminary knowledge graph. The initial knowledge graph col-
lects all natural language relationships (edges) between concepts (nodes) found in
the training data. After several rounds of normalization, the final PrimeNet graph
only leverages 34 relationships.
Moreover, we manually check the primitives, refine the hierarchy structure of the
primitives, and generate the explanation of primitives. For example, DEACTIVATE
is defined as change the status from on to off, i.e., STATE=ON ! STATE=OFF. In
Table. 5.1, we present several cases of verb primitives used in PrimeNet. This
strategy of constructing a primitive layer balances the need for human hand-coding
for accuracy with that for crowdsourcing and machine-based knowledge extraction
for coverage.
Table 5.1: Examples of verb primitives in PrimeNet. Given the input string, we
illustrate the detected verb primitives, and its primitive-level representation and
explanation. Primitives are marked in green.
5.2 Background
In the 1950s, Chomsky (1957) introduced the universal grammar theory, positing
innate linguistic structures as foundational conceptual primitives. According to this
theory, humans inherently possess the capacity to acquire language, with universal
linguistic structures serving as fundamental building blocks shared across all lan-
guages. The conceptual dependency theory, put forth by Schank (1972), suggested
that the basis of natural language is conceptual, forming an interlingual foundation
composed of shared concepts and relationships across languages. Jackendoff (1976)
delved into explanatory semantic representation, asserting the existence of semantic
primitives common to all languages, enabling humans to express a diverse range
of semantic information. Wierzbicka (1996) emphasized that “conceptual primitives
and semantic universals are the cornerstones of a semantic theory”, asserting that this
limited set of primitives can determine interpretations for all lexical and grammatical
meanings in natural language. These theories collectively aim to identify a core set
of fundamental primitives for language, facilitating the description of lexicalized
concepts.
In the realm of cognitive science, theoretical studies on commonsense knowl-
edge representation align with similar insights. Jackendoff et al. (1983) highlighted
a strong correlation between semantic primitives and cognitive representation. Ac-
cording to Pesina and Solonchak (2015), the primitives studied in linguistics form
the basis for the formation of a person’s conceptual system, which is both unique and
universal in many aspects. In this view, language emerges as a central tool for cog-
nitive functions, including conceptualization and categorization. In the development
of knowledge representation theories in cognitive science, many have been based
on the idea that humans possess a core set of knowledge connecting a vast array of
specific knowledge.
In the early stages, Minsky (1974) studied the framework for knowledge repre-
sentation and introduced the concept of “frames” as a structured way to organize
information about situations or objects. He proposed that humans when encounter-
ing new situations, retrieve typical knowledge from their minds. Piaget et al. (1952)
introduced the term “schema”, representing both the category of knowledge and the
process of acquiring that knowledge. The knowledge representation based on schema
has also been further researched by Rumelhart and Ortony (1977); Winograd (1976);
Bobrow and Norman (1975); Johnson (1989) and others. Spelke and Kinzler (2007)
introduced the core knowledge theory, suggesting that infants are born with “core
knowledge systems” supporting basic intuitions about the world. West (2011) intro-
duced a data modeling structure divided into primitive and derived concepts, with
primitive concepts serving as building blocks for other concepts. These theories col-
lectively underscore that the core primitive set constitutes the fundamental structure
of human cognition and provides guidance for knowledge representation.
346 5 Knowledge Representation & Reasoning
5.2.2 Challenges
On the other hand, primitives are not fixed but rather flexible and adaptable.
The core primitives are deeply embedded in the human conceptual system, which is
both unique and universal in many aspects. The proposed number of semantic prim-
itives varies significantly, ranging from a few units in some studies (Wierzbicka,
1996; Jackendoff et al., 1983) to several dozens (Wierzbicka, 1996) or even hun-
dreds (Cambria et al., 2024) in others. Pesina and Solonchak (2015) stated that the
main concepts of human society remain relatively stable, but their overall volume
changes over time.
A. Crowdsourcing
Factual knowledge represents concrete and specific details about the world, events,
people, places, objects, and other observable phenomena, such as “wheel is part of
bicycle”, “dog is an animal”, and “Los Angeles is located in California”. In the early
1980s, the Cyc project (Lenat, 1995) undertook the task of manually constructing
a comprehensive knowledge base encompassing the basic facts and rules about the
world. After the efforts of its first decade, the Cyc project expanded to include
around 100,000 terms. By the time of its release in 2012, known as OpenCyc 4.0,
the knowledge base had undergone substantial growth, encompassing over 2 million
facts across 239,000 concepts. The DOLCE project (Gangemi et al., 2002) was de-
signed to manually collect the ontological categories underlying natural language
and human commonsense with disambiguated concepts and relations. Freebase (Bol-
lacker et al., 2008) is a collaborative knowledge base by gathering data from various
sources, including Wikipedia, the Notable Names Database, and contributions from
community users. Google Knowledge Graph4 is powered in part by Freebase, with
an extensive collection of billions of facts about people, places, and things. As
discussed in earlier sections, ConceptNet (Speer et al., 2017) also leverages crowd-
sourcing contributions from users to acquire commonsense knowledge. Moreover,
ConceptNet is available in 83 languages and can be linked to other knowledge bases,
such as WordNet, Wiktionary, OpenCyc, and DBpedia.
As for lexical knowledge, there are several databases manually created by experts,
such as WordNet (Miller, 1995), Roget’s Thesaurus (Kipfer, 2006), FrameNet (Baker
et al., 1998), MetaNet (Dodge et al., 2015), VerbNet (Schuler, 2005), and Prop-
Bank (Palmer et al., 2005). As discussed earlier in the textbook, WordNet is also a
highly popular lexical knowledge base which captures semantic relations between
words. WordNet is now available in over 200 languages, allowing researchers and
linguists worldwide to explore the complexities of language and word associations
across diverse contexts.
Encyclopedic knowledge is related to a broad understanding of various subjects
and topics. For example, Wikidata (Vrandecic and Krötzsch, 2014) is a knowledge
graph coupled with Wikipedia, which is a free, open, and multilingual online encyclo-
pedia that is collaboratively edited by volunteers. DBpedia (Auer et al., 2007) extracts
structured information from Wikipedia data and converts it into a machine-readable
format for use in the Semantic Web and data mining domains. The encyclopedic
knowledge resources offer a wide range of information to help people understand
various topics and fields.
More recently, commonsense knowledge bases have been specifically developed
to cater to particular tasks, such as dialogue systems (Young et al., 2018). For
example, SenticNet has been specifically developed for affective computing tasks.
Visual Genome (Krishna et al., 2017) contains annotations of concepts and their
relations found in a collection of images. ATOMIC (Sap et al., 2019a) is developed to
capture inferential commonsense knowledge, such as cause-and-effect relationships.
Finally, ATOMIC20 20 (Hwang et al., 2020) is proposed to unify the triples from
ConceptNet and ATOMIC, together with some newly developed relations.
4 [Link]
5.2 Background 349
B. Automatic Extraction
Despite commonsense knowledge is not explicitly defined, it has been observed that
certain types of commonsense knowledge can be extracted through automatic meth-
ods, such as text mining and information extraction. Compared with crowdsourcing,
these automatic extraction methods can handle large volumes of data efficiently and
at a lower cost, making them valuable tools for efficiently capturing and updating
commonsense knowledge from various domains.
Firstly, automatic extraction methods generally acquire commonsense knowledge
from large-scale text and web pages. For example, Never-Ending Language Learning
(NELL) project (Mitchell and Fredkin, 2014) is designed to automatically extract
structured information from unstructured Web pages. With hundreds of pre-defined
categories and relations, NELL extracts knowledge from more than 500 million web
pages, resulting in a large knowledge base comprising over 2.8 million instances.
WebChild (Tandon et al., 2014) is constructed through automated extraction and dis-
ambiguation from Web contents. It utilizes seeds derived from WordNet and pattern
matching techniques on large-scale text collections to gather information, including
fine-grained relations like “hasShape,” “hasTaste,” and “evokesEmotion”. As dis-
cussed earlier, ASER, SenticNet and Probase also contain elements of automatic
extraction. A subsequent version of Probase, named as Microsoft Concept Graph (Ji
et al., 2019), harnesses billions of web pages and search logs to build a huge graph
of relations between concepts, and has been proven valuable in enhancing search
engines, spell-checkers, recommendation engines, and other AI-driven systems.
Secondly, several methods are used to improve existing commonsense knowl-
edge bases. The automatic extraction methods can help fill gaps, update outdated
information, and supplement missing commonsense knowledge in existing knowl-
edge bases. For example, BabelNet (Navigli and Ponzetto, 2012a) is a multilingual
knowledge base which is automatically created by mapping Wikipedia to English
WordNet based on multilingual concept lexicalizations and machine translations.
Dense-ATOMIC (Shen et al., 2023a) is designed to overcome the limitations of
ATOMIC in knowledge coverage and multi-hop reasoning, by employing a knowl-
edge graph completion approach.
Thirdly, some efforts have been made to automatic integrate diverse common-
sense knowledge bases, enhancing the overall coverage and richness of the knowledge
base. For example, YAGO (Suchanek et al., 2007) is designed to extract common-
sense knowledge from Wikipedia, WordNet, WikiData, GeoNames, and other data
sources. Bouraoui et al. (2022) employed Region Connection Calculus to merge
open-domain terminological knowledge. CommonSense Knowledge Graph (CSKG)
integrates knowledge bases from seven diverse, disjoint sources such as ConceptNet
and WordNet. Based on ASER, Zhang et al. (2020b) have developed TransOMCS
with an algorithm for discovering patterns from the overlap of existing commonsense
and linguistic knowledge bases, and a commonsense knowledge ranking model to
select the highest-quality extracted knowledge.
350 5 Knowledge Representation & Reasoning
5 [Link]
5.2 Background 351
In this section, we first introduce the task definition. Then, we introduce the solution
of constructing PrimeNet and the key ideas of each module.
The solution of PrimeNet mainly consists of three modules. The first module is
to construct the knowledge graph G = {V, E, R} to organize the large-scale com-
monsense knowledge. This knowledge graph is designed to cover a wide range of
commonsense knowledge, encompassing specific entities and extensive informa-
tion. We refer to this graph as the entity layer of PrimeNet. The second module is
a conceptualization module, which identifies a small set of concepts C on top of
the set of specific entities E in G, as well as the hyperedges M v!c to link entities
to concepts. We consider this concept set and the mapping between concepts and
entities as the concept layer of PrimeNet. The third module is a primitive detection
module that constructs the core primitive set P on top of the concept set C and builds
the hyperedges M c!p to link concepts to their primitives. This small primitive set
and the mapping between primitives and concepts are used as the primitive layer of
PrimeNet. In the following, we will provide a more in-depth introduction to each
module, along with corresponding examples for illustration.
Over the course of many years, a vast reservoir of factual knowledge has accumulated,
taking on various forms and originating from diverse sources. In order to system-
atically organize this wealth of knowledge, we have undertaken the construction of
a knowledge graph. Drawing inspiration from the theory of cognitive development
put forth by Piaget et al. (1952), which posits that human cognitive development
occurs in stages, we have adopted a gradual expansion strategy to build the knowl-
edge repository. Rather than merging disparate sources abruptly, our approach is to
delicately expand the knowledge base.
The fundamental idea underlying our strategy is that human knowledge acquisition
follows a pattern of continuous expansion, rooted in commonly shared and widely
accepted information. For instance, individuals typically begin by learning that a
“hammer” is a “tool” used for driving “nails”, and subsequently delve into more
intricate details, such as discerning the differences among various types of hammers
like “engineer hammer” and“brick hammer”. To emulate this cognitive process, we
initially construct a basic graph consisting of widely used concepts and relations as
the entity layer of PrimeNet.
354 5 Knowledge Representation & Reasoning
To construct the concept layer over the knowledge graph G, this module focuses
on identifying a suitable concept set C from the node set V and establishing hy-
peredges in the set M v!c to link entities with their respective concepts. Within
PrimeNet, this concept layer encapsulates commonly used mental representations
of categories, classes, or ideas that share common features or characteristics. For
instance, “hammer” is the concept that represents a category encompassing entities
such as “engineering hammer”, “brick hammer”, and “rubber hammer”. Conse-
quently, we initialize the concept set layer using Core WordNet6, a compilation of
about 5000 most commonly used words meticulously curated by experts. Then, we
design a concept detection method to discover new concepts and expand the concept
set, leveraging Probase and build the edges to link entities to the detected concepts.
Our observation underscores that, for a concept, its hyponyms tend to establish
robust connections with diverse concepts in a probabilistic taxonomy, whereas a
specific entity is more concentrated in its connection to concepts. To capture this
regularity, we introduce a novel scoring function designed to identify whether a
term qualifies as a concept. In contrast to alternative conceptualization methods, our
approach stands out by centering around core words rather than initiating from the
leaves of an extensive taxonomy for concept detection. The pre-defined core words
enhance diversity and accuracy, distinguishing our strategy as effective in steering
clear of misleading information stemming from isolated graphs or incorrect circles
within the large-scale taxonomy.
6 [Link]
5.4 Knowledge Graph Construction 355
Specifically, it is observed that concepts under the same primitive often share a
similar meaning and context. For instance, elongate and stretch fall under the same
primitive GROW and share a similar context. Although intuitive, lexical substitution
tends to overlook crucial differences between concepts. For example, verbs such as
stretch and compress belong to opposite primitives, GROW and SHRINK respectively,
yet can be identified within similar lexical contexts. To address this issue, we lever-
age powerful LLMs to filter out incorrect concepts within each cluster, generating a
primitive that accurately describes the concept cluster. Manual checks are also em-
ployed to ensure the quality of primitives in building the primitive layer. This strategy
strikes a balance between human hand-coding for accuracy and crowdsourcing and
machine-based knowledge extraction for comprehensive coverage.
In this section, we detail the construction of the knowledge graph (G) of PrimeNet. It
mainly contains four stages. First, commonsense knowledge acquisition is to collect
high-quality knowledge from diverse sources which are created through manually
annotated or crowdsourcing. Then, knowledge integration is to map the nodes and
relations among different sources. Next, the graph construction is to organize the
knowledge in a graph. Finally, exploration is to define functions to leverage the
knowledge graph in the downstream tasks. We detail each stage as follows.
• Lexical knowledge extracted from WordNet, FrameNet, and Roget (Kipfer, 2006);
• Factual knowledge extracted from ConceptNet;
• Structured information in Wikidata and DBpedia. For DBpedia, we extract knowl-
edge from InfoBoxes which provide information about a wide variety of topics,
e.g., people, places, and organizations, as well as knowledge from InstanceTypes
which contains instances of 438 types, e.g., book, company, city, and plant.
• Task-specific knowledge, such as inferential knowledge extracted from ATOMIC,
which is organized as typed “if-then” relations with variables, and visual knowl-
edge extracted from Visual Genome (Krishna et al., 2017).
356 5 Knowledge Representation & Reasoning
Table 5.2: Sources of commonsense knowledge for building the knowledge graph of
PrimeNet. Creation denotes the construction methods, # R denotes the number of
relation types, and Size denotes the graph scale.
7 [Link]
8 [Link]
5.4 Knowledge Graph Construction 357
Fig. 5.5: Illustration of graph construction of PrimeNet. Starting with Core WordNet,
we first construct a basic graph with core words and relations from WordNet and
ConceptNet. Then, we add instanceOf knowledge from DBpedia and Wikipedia.
Next, diverse types of knowledge from other knowledge bases are incorporated into
the graph of PrimeNet.
9 [Link]
358 5 Knowledge Representation & Reasoning
Table 5.3: Core relations of PrimeNet, and their description, example, and mappings
to WordNet and ConceptNet.
Table 5.4: Functions designed for exploring PrimeNet. For each function, we intro-
duce its input, output, and description.
5.4.4 Exploration
Then, we design multiple functions for exploring the graph that are capable of:
• Exploring graph structure of PrimeNet. For example, nodes and edges functions
are designed to generate all concepts and relations in PrimeNet, respectively,
and get_number_of_nodes and get_number_of_edges are designed to count the
number of nodes and edges in the knowledge graph.
• Exploring commonsense knowledge for specific concepts. For example, given a
concept, what_is function is designed to get all its relations, get_polarity function
is used to get its sentiment polarity, and find_path function is designed to find a
specific path in PrimeNet given a pair of concepts.
5.5 Concept Detection 359
• Integrating new knowledge into PrimeNet. For example, the add_node and
add_edge functions are designed to add new concepts and relations into PrimeNet,
and the add_primenet_new function is able to incorporate a new knowledge base
into PrimeNet.
We detail all the designed functions in Table 5.4, including their input, output, and
description. These functions make it easy to apply PrimeNet in downstream tasks,
as well as update PrimeNet with new commonsense knowledge or domain-specific
knowledge.
5.5.1 Preliminaries
The abstract words have higher-level scores and specific terms have smaller level
scores. For example, the level scores of dog, mammal, and animal, are 72, 89, and 362,
respectively. It is also observed that, for an abstract term, its hyponyms are usually
positioned at diversified levels, while its hyponyms would be more concentrated for
a specific term. Based on it, (Liu et al., 2022c) defined an entropy-based metric for
the abstractness measurement. Formally, the entropy score of a term is defined as
follows.
Definition (Entropy Score) Given a term c, its entropy score is defined as:
(
0, if c is a leaf term
entropy(c) = Õl (5.2)
i=1 pi (c) · log pi (c) otherwise
where l is the maximum level, and pi (c) is the ratio of the number of cs hyponyms
at the i-th level to the total number of cs hyponyms. ⇤
The entropy of abstract terms is often greater than that of specific terms. For ex-
ample, the entropy scores of pupil, student, and people are 0.563, 0.927, and 1.790,
respectively. In general, abstract concepts and concrete entities are differentiated us-
ing these abstractness measure methods by manually-defined thresholds (Liu et al.,
2022c). However, these methods are inaccurate and not suitable when applied to
complex graphs with large-scale commonsense knowledge. The primary reason is
the vast amount of knowledge, inevitably leading to the presence of cycles and iso-
lated subgraphs, significantly reducing the accuracy of the aforementioned methods.
Furthermore, some commonly used vocabulary lacks numerous lower-level nodes,
e.g., voice, track, and driver, and they have lower scores compared with other words
with more hyponyms, e.g., transport, symbol, and medicine. As such, the conceptu-
alization methods which only rely on hierarchical information are not reasonable for
such cases.
We perform a probing experiment as illustrated in Fig. 5.6. We assume that words
from Core WordNet are concepts, given their fundamental role in describing the
world. For all nodes in Core WordNet and our knowledge graph G of PrimeNet, we
show probability distributions of their level scores and entropy scores. It is observed
that a considerable number of words in Core WordNet have level scores below 50,
and entropy scores under 1. These words are readily excluded from concept sets, by
applying previous methods for conceptualization.
5.5.2 Conceptualization
Fig. 5.6: Illustration of data distribution of Core WordNet and the graph of PrimeNet,
considering of the level scores and entropy scores of nodes.
Specifically, the initial set of concepts, denoted as C 0 = {c1, c2, c3, . . . }, comprises
commonly used words from Core WordNet that describe the world in human daily
life. In an ideal scenario, the hypernyms of these core words are expected to be
more abstract and should be considered as concepts. However, in practical scenario,
not all of their hypernyms can be unequivocally regarded as concepts due to the
intricate interweaving of commonsense knowledge. For instance, relationships such
as (dog, isA, animal), (dog, isA, pet), (pet, isA, animal), and (dog, isA, species)
are all deemed correct and coexist within the knowledge base. Thus, we need a
more accurate method to measure the abstractness of hypernyms. It is observed that
not all hypernyms have the same weight when working as the concept of a dog.
This problem has been deeply studied and Probase has been constructed to provide
statistical insights of isA relations.
362 5 Knowledge Representation & Reasoning
Fig. 5.7: Examples of top-50 words scored by the designed conceptual score function.
We compare their level scores and entropy scores with our conceptual scores.
It includes “isA” relations for 2.7 million terms, automatically mined from a
corpus of 1.68 billion web pages. That is, each triplet (t, isA, c) is linked to a
frequency score f rec(t, c), providing frequency information computed through a
data-driven method based on large-scale corpus. For example, (dog, isA, animal)
and (dog, isA, species) show that both animal and species are concepts of dog, and
f req(dog, animal) > f req(dog, species) shows animal is a more typical concept
for dog, compared with species.
Given a triplet (t, isA, c), it is associated with a frequency score f req(t, c) in
Probase. The frequency score is an important signal to identify whether this relation
is typical or not. Based on this observation, Wang et al. (2015b) proposed a typicality
score, which is defined based on the frequency information to tell how popular a
concept c is as far as an entity t is concerned, and how popular an entity t is as far
as a concept c is concerned:
Definition (Typicality Score) Given a term t, the conditional probability Pr(c|t) of
a term c is defined as:
f req(t, c)
Pr(c|t) = Õ , (5.3)
ci 2hy per(t) f req(t, ci )
f req(t, c)
Pr(t|c) = Õ , (5.4)
ti 2hy po(c) f req(ti, c)
This scoring method is designed to quantify the extent to which a term functions as
a universal, abstract link across a diverse array of concepts. Utilizing the initial set C 0 ,
we calculate the abstraction scores of their hypernyms, presenting the top 50 terms
in Fig. 5.7. According to human analysis, all of them are confirmed as conceptual
terms. In addition, we present their level scores and entropy scores, revealing that
these metrics fall short in inferring them as abstract terms. For instance, topic,
song, and adjective exhibit low level scores (i.e., 3, 3, and 28), and author and
classic display low entropy scores (i.e., 0.59 and 1.72), excluding them from being
identified as concepts. We employ an iterative approach to augment the concept set
by systematically incorporating terms with high abstraction scores. In i-th iteration,
we introduce the top-n (e.g., n = 3) hypernyms for each concept in C i 1 . The
constraint imposed is that these hypernyms must surpass a specified threshold Tabs .
This process results in the construction of an updated concept set, denoted as C i .
The primitive discovery is to identify the most basic and essential element of the
world knowledge, which provides a way to represent and organize knowledge in
a structured and meaningful manner (Schank, 1972; Guarino, 1995). The well-
designed primitive set can help to produce more accurate and reusable knowledge
bases. However, creating a thorough set of primitives is extremely time-consuming
and labor-intensive, hence it is not generally employed in most knowledge bases (Min-
sky, 1974; Jackendoff, 1976; Schank, 1972; Cambria et al., 2024).
364 5 Knowledge Representation & Reasoning
1) Training Data. We extract all the verb-noun and adjective-noun concepts from
ConceptNet 5.5 (Speer et al., 2017) together with a sample sentence for each
concept. The collection of concepts is denoted as E = {e1, e2, e3, . . . , en }, where
each concept ei 2 E is assigned with a sample sentence si . For each concept ei ,
we remove it from the sentence si and the remaining sentence is denoted as its
context ci . We employ PLMs to represent the concept ei and its context ci as
fixed-dimensional embeddings, i.e., ei and ci , respectively.
2) Training Objective. Then, we fine-tune the PLM with a lexical substitution task.
The assumption is that a relevant lexical substitute should be both semanti-
cally similar to the target word and have a similar contextual background.
Given a concept ei , its context ci is regarded as the positive example. We
create negative examples by sampling random concepts, which are denoted as
⇤ , e⇤ , . . . , e⇤ }.
N (ei ) = {ei,1 i,2 i,z
10 In our experiment, the used pretrained model is all-mpnet-base-v2. Having undergone pretraining
on over 1 billion sentence pairs, this model is capable of mapping input text to a 768-dimensional
vector space, ideal for tasks such as clustering or semantic search. Further details can be found at:
[Link]
5.6 Primitive Discovery 365
Fig. 5.8: The overall framework for primitive detection. LLM1 is used as an examinee
to generate representative primitive for each concept cluster, and LLM2 is used as
an examiner to verify the primitive and its related concepts.
11 [Link]
366 5 Knowledge Representation & Reasoning
The primitive detection involves detecting the errors in each cluster, and associating
a meaningful and generalizable primitive with a cluster of related concepts. For
example, the concepts like ingest, slurp, munch are represented by a primitive EAT.
It is inherent to human nature to try to categorize things, events and people, finding
patterns and forms they have in common. We explore the generative ability of LLMs
for primitive detection. To ensure the accuracy, as illustrated in Fig. 5.8, we design a
detection-verification framework, where the first LLM works as examinee to generate
primitive for the a concept cluster, and another LLM works as examiner to check
whether the generated primitive is correct. Specifically,
Step-1: Primitive Detection by Examinee LLM The input of examinee (denoted
as LLM1) is a cluster of concepts. The designed prompt is “Please generate a
primitive for the following concepts: C.”, where C is a list of concepts in a cluster.
Step-2: Primitive Verification by Examiner LLM The examiner (denoted as
LLM2) is to verify whether the primitive generated by LLM1 is correct or not. To
setup LLM2, we input the primitive P and the related concepts C into it, concatenated
to the following instructions: Do you think P is representative for the following
concepts: C. Please answer “yes” or “no”.
Step-3: Explainable context by Examiner LLM For the correct primitive and
cluster, we ask the LLM2 to generate a sentence as explainable context. With the
primitive P and the related concepts C into it, concatenated to the following instruc-
tions: Please generate a short sentence to describe the primitive P. In this [MASK]
can be replaced by the concepts in C.
5.7 Experiments
In this section, we compare our PrimeNet with other widely-used knowledge bases
in terms of coverage, accuracy, and efficiency. Then, we conduct experiments on
semantic semantic similarity and commonsense reasoning to verify the accuracy
and efficiency of PrimeNet.
Table 5.5: Accuracy (%) assessed by human annotators. Size denotes the number of
triplets in different knowledge bases.
• ConceptNet (Speer et al., 2017): This is a large-scale knowledge base that contains
relational knowledge collected from resources created by experts, crowdsourcing,
and games with a purpose (Von Ahn, 2006).
As shown in Table 5.512, it is observed that PrimeNet stands out as the highest
quality knowledge base with an acceptance rate of 92.4%, showing that PrimeNet
is highly reliable and contains commonsense knowledge that is consistent with hu-
man understanding. ConceptNet, ATOMIC20 20 , and ATOMIC also demonstrate high
quality, with acceptance rates of 88.6%, 91.3%, and 88.5%, respectively. Although
TransOMCS has a vast number of triplets (i.e., 18.5M), it has a lower accuracy
compared to the other resources, with an acceptance rate of only 41.7%, indicating
it may not be as reliable as the other knowledge bases.
12 Performance of compared knowledge bases is reported by Hwang et al. (2020), evaluated through
crowdsourcing on the AMT platform.
368 5 Knowledge Representation & Reasoning
where ↵ and control the relative strengths of associations, w⇤i is the original
embedding of word wi , and wi is its new embedding, R denotes a set of relations
extracted from the knowledge base, and (wi, w j ) denotes a relation which connects
wi and w j . We test the retrofitted embeddings with different knowledge bases on two
tasks, i.e., semantic similarity and SAT-style analogy.
This task is to measure the degree of similarity between word pairs by calculating
the cosine similarities between their embeddings, and then compare the similarities
to human judgments. A good method should provide similarities that are strongly
correlated with the human judgments evaluated by Spearman correlation coeffi-
cient (Myers and Well, 1995). In our experiment, we conduct experiments on eight
widely-used word similarity datasets, including
• YP-130: A dataset comprising 130 word pairs with similarity ratings provided by
human annotators (Yang and Powers, 2005).
• MenTR-3K: Consists of 3,000 word pairs with similarity judgments collected
from human participants (Bruni et al., 2012).
• RG-65: Contains 65 word pairs with similarity ratings obtained through human
evaluations (Rubenstein and Goodenough, 1965).
• MTurk-771: Comprises 771 word pairs with similarity scores obtained through
crowdsourcing on AMT (Halawi et al., 2012).
• SimLex-999: Includes 999 word pairs with similarity ratings collected from
human subjects, aiming to provide a balanced set for evaluating word similarity
models (Hill et al., 2015).
13 [Link]
5.7 Experiments 369
Two popular pretrained word embeddings are used in our experiments, including
word2vec (Mikolov et al., 2013b), which is trained on the first 100M of plain text
from Wikipedia14, and GloVe (Pennington et al., 2014), which are trained on 6 billion
words from Wikipedia and English Gigaword15. In this task, we compare PrimeNet
with FrameNet, WordNet, and ConceptNet, which contain synonyms knowledge.
Table 5.6 presents the overall performance on different word similarity datasets.
PrimeNet demonstrated a significant improvement in retrofitting semantic repre-
sentations, with an average increase of 6.73%, 5.49%, and 5.31% for word2vec
(300d), GloVe (50d), and GloVe (300d), respectively. WordNet also achieved no-
table performance gains, with an average improvement of 4.75%, 3.79%, and 3.98%,
benefiting the high-quality synonyms knowledge constructed by experts. While the
crowdsourced ConceptNet only slightly outperformed word2vec (300d) and GloVe
(50d), and slightly worse than GloVe (300d). The solid performance gain achieved
by PrimeNet suggests that it is successful in integrating knowledge from various
sources into PrimeNet and creating a robust knowledge base.
Methods YP-130 MenTR-3K RG-65 MTurk-771 SimLex-999 SimVerb-3500 VERB-143 WS-353 Average ( )
word2vec (300d) 0.215 0.600 0.633 0.554 0.287 0.155 0.358 0.705 0.438
+FrameNet 0.334 0.589 0.620 0.571 0.295 0.227 0.321 0.651 0.451 (+1.25%)
+WordNet 0.316 0.620 0.717 0.598 0.377 0.237 0.318 0.705 0.486 (+4.75%)
+ConceptNet 0.386 0.582 0.577 0.533 0.341 0.229 0.302 0.651 0.450 (+1.16%)
+PrimeNet 0.325 0.638 0.680 0.617 0.416 0.271 0.385 0.715 0.506 (+6.73%)
GloVe (50d) 0.377 0.652 0.602 0.554 0.265 0.153 0.250 0.499 0.419
+FrameNet 0.459 0.622 0.617 0.568 0.288 0.217 0.240 0.471 0.435 (+1.61%)
+WordNet 0.510 0.649 0.688 0.540 0.342 0.239 0.188 0.500 0.457 (+3.79%)
+ConceptNet 0.427 0.599 0.558 0.493 0.356 0.234 0.236 0.489 0.424 (+0.50%)
+PrimeNet 0.443 0.674 0.707 0.597 0.376 0.236 0.273 0.485 0.474 (+5.49%)
GloVe (300d) 0.561 0.737 0.766 0.650 0.371 0.227 0.305 0.605 0.528
+FrameNet 0.589 0.701 0.756 0.639 0.361 0.278 0.274 0.558 0.519 (-0.84%)
+WordNet 0.610 0.759 0.841 0.679 0.470 0.313 0.256 0.612 0.568 (+3.98%)
+ConceptNet 0.561 0.700 0.747 0.583 0.420 0.288 0.300 0.595 0.524 (-0.34%)
+PrimeNet 0.593 0.764 0.818 0.684 0.496 0.316 0.350 0.626 0.581 (+5.31%)
A. Task Setting
B. Baselines
C. Benchmarks
D. Performance
The overall performance is shown in Table 5.7. It is observed that pretraining the lan-
guage models with external knowledge is effectiveness to improve the performance
of commonsense QA task. The main reason is that the external knowledge is impor-
tant supplementary information for implicit knowledge embedding in PLMs. Our
PrimeNet achieved the best performance when RoBERTa is used as backbone, with
the average performance gains of 1.74%, 2.88%, 0.82% over ATOMIC, Concept-
Net+Wikidata+WordNet, and CSKG, respectively. This experiment indicates that
PrimeNet has a good quality in organizing commonsense knowledge.
372 5 Knowledge Representation & Reasoning
In our method, we manually checked the detected primitives. This step is conduct by 5
senior Ph.D. students majors in NLP. We manually code the explainable of primitives.
For example, INCREASE is defined as INCREASE(obj) := obj++, which is the basic
operation that increments the value of an object and provides a foundation for more
complex reasoning. It is observed that some primitives have a hierarchical structure.
We show examples of primitives in Fig. 5.9. At Level-1, the primitive GROW is
defined as GROW(obj) = INCREASE([Link]) := [Link]++ = obj(l++,
h++, w++), which is accomplished by using the INCREASE primitive to increment
the object’s SIZE attribute, such as length (l), height (h), and width (w).
The Level-2 primitive LENGTHEN is even more specific, adding only length to
an object, and it is defined as LENGTHEN(obj)=INCREASE([Link])
:=[Link]++ = obj(l++, h, w). We have also performed several ex-
periments on affordances by testing how PrimeNet is able to model human-object
interactions in different scenarios, e.g., how to identify and use a liquid container,
and in different modalities, e.g., speech processing and computer vision (Fig. 5.10).
Finally, we have also carried out preliminary experiments on how PrimeNet can
represent and handle different types of domain-specific knowledge, e.g., safety com-
monsense knowledge (Fig. 5.11). We intend to provide a more detailed account of
these experiments and additional ones in our future work.
Fig. 5.12: Overall architecture of the LLM evaluation proposed by Xu et al. (2024).
Implicit reasoning is a challenging task which does not contain explicitly clues for
designing reasoning strategies. For example, “Did Aristotle use a laptop?” is an
implicit question (Geva et al., 2021), and it requires to infer the strategy for answer-
ing the implicit question, i.e., temporal comparison. Recently, AI systems based on
PLMs have achieved impressive performance in answering explicit questions, even
surpassing human performance in some datasets, e.g., SQuAD (Rajpurkar et al.,
2016) and TriviaQA (Joshi et al., 2017). However, their accuracy on implicit ques-
tions is only 66% (Geva et al., 2021).
376 5 Knowledge Representation & Reasoning
A key property of implicit reasoning is the diverse strategies. Humans cannot pre-
define all of the strategies due to the complexity of scenarios. To conduct implicit
reasoning, PrimeNet has the potential to build a finite set of strategies at the primitive
level, and apply the primitive-based strategies on concepts and entities. For example,
the implicit questions, e.g., “Did Aristotle Use a Laptop?”, “Did Shakespeare play
guitar?”, and “Was NATO involved in World War I?”, have the same reasoning
strategies in the primitive level, e.g., COMPARE(TIME(Entity-1), TIME(Entity-2)).
Primitives can be used to conduct implicit reasoning by providing the basic cognitive
processes or mental operations that underlie our ability to reason implicitly.
5.8.3 Neurosymbolic AI
5.9 Conclusion
A. Reading List
• Qian Liu, Sooji Han, Yang Li, Erik Cambria, and Kenneth Kwok. PrimeNet: A
Framework for Commonsense Knowledge Representation and Reasoning based
on Conceptual Primitives. Cognitive Computation, 2024 (Liu et al., 2024b)
• Zonglin Yang, Xinya Du, Erik Cambria, Claire Cardie. End-to-end Case-based
Reasoning for Commonsense Knowledge Base Completion. Proceedings of
EACL, 3509-3522, 2023 (Yang et al., 2023)
• Weijie Yeo, Ranjan Satapathy, Siow Mong Goh, and Erik Cambria. How in-
terpretable are reasoning explanations from prompting large language models?
Proceedings of NAACL, 2148–2164, 2024 (Yeo et al., 2024c)
B. Relevant Videos
• Labcast about PrimeNet: [Link]/KVkcJfispww
C. Related Code
• Github repository about PrimeNet [Link]/SenticNet/PrimeNet
D. Exercises
• Exercise 1. Extracting commonsense knowledge is challenging because it is
often not explicitly mentioned in written documents. List and describe two po-
tential sources from which commonsense knowledge can be acquired. For each
source, explain how it can be used to gather relevant information and what type
of commonsense knowledge might be obtained.
• Exercise 4. Explain with examples how a bottom-up strategy can enable the
discovery of conceptual primitives for a specific domain directly from text data,
and how a top-down strategy can then be applied on the resulting set of primitives
to identify more words and MWEs related to them.
In this textbook, we have explored the multifaceted domain of NLU, delving into
the complexities and advancements that define this critical area of AI. We have in-
vestigated ongoing initiatives aimed at enhancing the reliability, responsibility, and
personalization of AI. We have navigated through the essential components of NLU,
including syntactic, semantic, and pragmatic processing techniques, and examined
the pivotal role of LLMs and their integration into neurosymbolic frameworks.
Our journey began with a foundational understanding of how language operates at
various levels of abstraction, highlighting the importance of both symbolic and sub-
symbolic approaches. We then discussed the emergence of LLMs, showcasing their
impressive capabilities in handling diverse NLP tasks. However, we also addressed
their limitations, particularly in the realm of logical reasoning and commonsense
understanding. The introduction of neurosymbolic methods represents a significant
stride toward bridging these gaps. By combining the learning prowess of subsym-
bolic systems with the reasoning capabilities of symbolic systems, we can create
more robust and versatile NLU frameworks. The integration of knowledge bases like
PrimeNet into these frameworks exemplifies the potential for enhanced reasoning
and understanding.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025 379
E. Cambria, Understanding Natural Language Understanding,
[Link]
380 6 Conclusion
6.1.1 Assignment
In this group assignment, you will build an opinion search engine. Given a specific
topic of your choice (e.g., cryptocurrencies), your system should enable users to
find relevant opinions about any instance of such topic (e.g., bitcoin) and perform
sentiment analysis on the results (e.g., opinions about bitcoin are 70% positive and
30% negative). Once you have chosen a topic, make sure that a) you can find enough
data about it (e.g., some topics may be too niche to the point that you would only
find a few hundreds data about it) and b) the opinions you get are balanced (e.g., if
the topic you chose only has negative opinions associated with it, then it is probably
not a good topic). For ideas about interesting topics, you can check our project page
at [Link] You are pretty much free to use anything you want in
terms of available tools/libraries. However, your system cannot be just a mashup
of existing services. Your final score will depend not only on how you developed
your system but also on its novelty and your creativity: in other words, to get a high
score you do not only need to implement a system that works, but also a system
that is useful and user-friendly. The main tasks of the assignment are crawling (20
points), indexing (40 points) and classification (40 points). A minimum of 60 points
is required to pass the assignment. Consult with your course coordinator regarding
the procedures and requirements for submitting your assignment.
Crawl text data from any sources which you are interested in and permitted to
access, e.g., X API or Reddit API. The crawled corpus should have at least 10,000
records and at least 100,000 words. It is fine to use available datasets for training
(e.g., popular sentiment benchmarks), but you still have to at least crawl and label
data for testing.
382 6 Conclusion
Also, make sure your dataset does not contain duplicates and try your best to make
it balanced (e.g., equal number of positive and negative entries). Before crawling any
data, carefully consider the questions in this material, e.g., check whether the data
have enough details to answer the questions. You can use any third party libraries
for the crawling task, e.g.:
• Jsoup: [Link]
• Twitter4j: [Link]
• Facebook marketing: [Link]
• Instagram: [Link]
• Amazon: [Link]
• Tinder: [Link]
• Tik Tok: [Link]
Indexing: You can do this from scratch or use a combination of available tools,
e.g., Solr+Lucene+Jetty. Solr runs as a standalone full-text search server within a
servlet container such as Jetty and it uses the Lucene search library at its core for text
indexing and search. Solr has REST-like HTTP/XML and JSON APIs that make it
easy to use from any programming language. Useful documentations include:
Querying: You need to provide a simple but friendly user interface (UI) for querying.
It could be either a web-based or mobile app based UI. You could use JSP in Java or
Django in Python to develop your UI website. Since Solr provides REST-like APIs
to access indexes, one extra JSon or RESTFul library would be enough. Otherwise,
you may use any third party library.
6.1 Learning Resources 383
Question 3: Explore some innovations for enhancing the indexing and ranking.
Explain why they are important to solve specific problems, illustrated with examples.
You can list anything that has helped improving your system from the first version
to the last one, plus queries that did not work earlier but now work because of the
improvements you made. Possible innovations include (but are not limited to) the
following:
• Timeline search (e.g., allow user to search within specific time windows)
• Geo-spatial search (e.g., use map information to refine query results)
• Enhanced search (e.g., add histograms, pie charts, word clouds, etc.)
• Interactive search (e.g., refine search results based on users’ relevance feedback)
• Multimodal search (e.g., implement image or video retrieval)
• Multilingual search (e.g., enable information retrieval in multiple languages)
• Multifaceted search (e.g., visualize information according to different categories)
Choose two or more subtasks from the NLU suitcase model (Fig. 1.12) to per-
form information extraction on your crawled data. For example, you could choose
subjectivity detection and polarity detection to first categorize your data as neutral
versus opinionated and then classify the resulting opinionated data as positive versus
negative. Different classification approaches can be applied, including:
You can tap into any resource or toolkit you like, as long as you motivate your
choices and you are able to critically analyze obtained results. Some possible choices
include:
384 6 Conclusion
• Weka: [Link]
• Hadoop: [Link]
• Pylearn2: [Link]
• SciKit: [Link]
• NLTK: [Link]
• Theano: [Link]
• Keras: [Link]
• Tensorflow: [Link]
• PyTorch: [Link]
• Huggingface: [Link]
• AllenNLP [Link]
D. Submission
The submission shall consist of one single PDF file. Add some pictures to your
report to make it clearer and easier to read. There is no page limit and no special
formatting is required. The file shall contain the following five key items:
1) The names of the group members in the first page
2) Your answers to all the above questions
6.1 Learning Resources 385
6.1.2 Quiz
15) Which of the following is a common dataset used for evaluating SBD systems?
A) Penn Treebank
B) OntoNote
C) MIMIC
D) SIGMORPHON
20) Which resource is commonly used for training and evaluating POS taggers?
A) OntoNote
B) Penn Treebank
C) LexNorm
D) LIWC
21) In addition to words, what other feature can be useful in POS tagging?
A) Word length
B) Word frequency
C) Word embeddings
D) Lemmatization
23) Which machine learning technique is commonly used for text chunking?
A) K-means clustering
B) HMM
C) SVM
D) Decision trees
25) Which resource is commonly used for training and evaluating chunking systems?
A) Penn Treebank
B) OntoNote
C) LexNorm
D) MIMIC
29) Which of the following tools is commonly used for lemmatization in Python?
A) NLTK
B) OpenCV
C) TensorFlow
D) Scikit-learn
31) Which of the following is the correct lemma for the word “better”?
A) good
B) best
C) well
D) better
33) Which of the following words has the same lemma and root form in English?
A) Running
B) Happier
C) Cats
D) Quick
390 6 Conclusion
45) Which resource is commonly used for training and evaluating NER systems?
A) OntoNote
B) Penn Treebank
C) CoNLL-2003 dataset
D) SenticNet
47) Which of the following techniques is most useful for concept extraction?
A) NER
B) POS tagging
C) TF-IDF
D) OCR
392 6 Conclusion
51) Which of the following tools can be used for concept extraction in Python?
A) OpenCV
B) NLTK
C) Scikit-learn
D) SpaCy
55) Which of the following algorithms can be used for anaphora resolution?
A) LDA
B) SVM
C) CRF
D) Neural networks
56) Which resource is often used to help with anaphora resolution in NLU?
A) WordNet
B) OntoNote
C) Penn Treebank
D) SenticNet
57) Which machine learning technique is commonly used for subjectivity detection?
A) K-means clustering
B) Decision trees
C) SVM
D) PCA
58) What is the difference between ambivalence and neutrality in sentiment analysis?
A) Ambivalence means mixed feelings, while neutrality means no sentiment
B) Ambivalence means no sentiment, while neutrality means mixed feelings
C) Ambivalence and neutrality are the same
D) Ambivalence means strong feelings, while neutrality means extreme emotions
59) Which of the following features is most useful for subjectivity detection?
A) Named entities
B) POS Tags
C) TF-IDF
D) OCR
61) Which of the following techniques can be used for metaphor detection in NLU?
A) NER
B) Topic modeling
C) POS tagging
D) WSD
394 6 Conclusion
64) Which of the following NLU techniques is often used to understand metaphors?
A) CNN
B) SVM
C) SRL
D) OCR
65) Which resource can be useful for training metaphor understanding models?
A) WordNet
B) OntoNote
C) PropBank
D) Penn Treebank
69) Which machine learning model can be used for sarcasm detection?
A) K-means clustering
B) Decision trees
C) RNN
D) Linear regression
70) Which NLU technique can be useful for detecting sarcasm in text?
A) NER
B) Sentiment analysis
C) POS tagging
D) Text summarization
71) Which of the following resources can help improve sarcasm detection models?
A) SenticNet
B) OntoNote
C) Penn Treebank
D) SIGMORPHON
72) Which of the following models is commonly used for personality recognition?
A) CNN
B) RNN
C) SVM
D) Decision trees
73) In personality recognition, what does the “Big Five” model assess?
A) Cognitive abilities
B) Openness, conscientiousness, extraversion, agreeableness, and neuroticism
C) Emotional intelligence
D) Linguistic diversity
75) Which approach can assess user personality based on social media activity?
A) Sentiment analysis
B) Text classification
C) Personality prediction models
D) NER
396 6 Conclusion
76) What is a key advantage of using machine learning for personality recognition?
A) It can provide real-time feedback on personality traits
B) It can translate text into multiple languages
C) It can generate new text automatically
D) It can correct grammatical errors in text
77) Which resource can be useful for evaluating personality recognition models?
A) WordNet
B) Personality-labeled text datasets
C) OntoNote
D) Penn Treebank
80) Which of the following algorithms can be used for aspect extraction?
A) LDA
B) K-means clustering
C) Decision trees
D) PCA
81) Which NLU tool or library can be useful for aspect extraction?
A) OpenCV
B) NLTK
C) TensorFlow
D) Keras
85) Which of the following methods can be used to perform polarity detection?
A) LDA
B) SVM
C) CNN
D) PCA
87) Which of the following libraries can be used for polarity detection in Python?
A) OpenCV
B) Scikit-learn
C) TensorFlow
D) SpaCy
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025 400
E. Cambria, Understanding Natural Language Understanding,
[Link]
References 401
Cem Akkaya, Janyce Wiebe, and Rada Mihalcea. Subjectivity word sense disam-
biguation. In EMNLP, pages 190–199, 2009.
Ahmed Al Hamoud, Amber Hoenig, and Kaushik Roy. Sentence subjectivity analysis
of a political and ideological debate dataset using LSTM and BiLSTM with
attention and GRU models. Journal of King Saud University-Computer and
Information Sciences, pages 7975–7987, 2022.
Rabah A. Al-Zaidy, Cornelia Caragea, and C. Lee Giles. Bi-lstm-crf sequence
labeling for keyphrase extraction from scholarly documents. In Ling Liu, Ryen W.
White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-
Yates, and Leila Zia, editors, The World Wide Web Conference, pages 2551–2557.
ACM, 2019.
Firoj Alam, Evgeny A Stepanov, and Giuseppe Riccardi. Personality traits recogni-
tion on social network-facebook. In international AAAI conference on web and
social media, pages 6–9, 2013.
Mehwish Alam, Andreea Iana, Alexander Grote, Katharina Ludwig, Philipp Müller,
and Heiko Paulheim. Towards analyzing the bias of news recommender systems
using sentiment and stance detection. In Web Conference, pages 448–457, 2022.
Zakariae Alami Merrouni, Bouchra Frikh, and Brahim Ouhbi. Automatic keyphrase
extraction: a survey and trends. Journal of Intelligent Information Systems, 54:
391–424, 2020.
Desislava Aleksandrova, François Lareau, and Pierre André Ménard. Multilingual
sentence-level bias detection in Wikipedia. In RANLP, pages 42–51, 2019.
Badr AlKhamissi, Millicent Li, Asli Celikyilmaz, Mona T. Diab, and Marjan
Ghazvininejad. A review on language models as knowledge bases. CoRR,
abs/2204.06031, 2022.
Abdulrahman Aloraini and Massimo Poesio. Cross-lingual zero pronoun resolution.
In LREC, pages 90–98, 2020.
Abdulrahman Aloraini and Massimo Poesio. Data augmentation methods for
anaphoric zero pronouns. In Fourth Workshop on Computational Models of
Reference, Anaphora and Coreference, pages 82–93, 2021.
Abdulrahman Aloraini, Sameer Pradhan, and Massimo Poesio. Joint coreference
resolution for zeros and non-zeros in Arabic. arXiv preprint arXiv:2210.12169,
2022.
Duygu Altinok. An ontology-based dialogue management system for banking and
finance dialogue systems. arXiv preprint arXiv:1804.04838, 2018.
Nabeela Altrabsheh, Mohamed Medhat Gaber, Mihaela Cocea, et al. SA-E: sentiment
analysis for education. Frontiers in Artificial Intelligence and Applications, 255:
353–362, 2013.
Yasemin Altun, Ioannis Tsochantaridis, and Thomas Hofmann. Hidden markov
support vector machines. In ICML, pages 3–10, 2003.
Nouf Alturaief, Hamoud Aljamaan, and Malak Baslyman. Aware: Aspect-based
sentiment analysis dataset of apps reviews for requirements elicitation. In 2021
36th IEEE/ACM International Conference on Automated Software Engineering
Workshops (ASEW), pages 211–218, 2021.
402 References
Pooja Alva and Vinay Hegde. Hidden Markov model for POS tagging in word
sense disambiguation. In 2016 International Conference on Computation System
and Information Technology for Sustainable Solutions (CSITSS), pages 279–284,
2016.
Nestor Alvaro, Yusuke Miyao, Nigel Collier, et al. TwiMed: Twitter and PubMed
comparable corpus of drugs, diseases, symptoms, and their relations. JMIR Public
Health and Surveillance, 3(2):e6396, 2017.
Enrique Amigó, Jorge Carrillo de Albornoz, Irina Chugur, Adolfo Corujo, Julio
Gonzalo, Tamara Martín, Edgar Meij, Maarten De Rijke, and Damiano Spina.
Overview of RepLab 2013: Evaluating online reputation monitoring systems. In
International conference of the cross-language evaluation forum for european
languages, pages 333–352, 2013.
Mostafa Amin, Erik Cambria, and Björn Schuller. Will affective computing emerge
from foundation models and General AI? A first evaluation on ChatGPT. IEEE
Intelligent Systems, 38(2):15–23, 2023.
Mostafa M Amin, Rui Mao, Erik Cambria, and Björn W Schuller. A wide evalu-
ation of ChatGPT on affective computing tasks. IEEE Transactions on Affective
Computing, 2024.
Ida Amini, Samane Karimi, and Azadeh Shakery. Cross-lingual subjectivity detec-
tion for resource lean languages. In Tenth Workshop on Computational Approaches
to Subjectivity, Sentiment and Social Media Analysis, pages 81–90, 2019.
Henry Anaya-Sánchez, Aurora Pons-Porrata, and Rafael Berlanga-Llavori. Word
sense disambiguation based on word sense clustering. In Advances in Artificial
Intelligence-IBERAMIA-SBIA 2006, pages 472–481. Springer, 2006.
Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures
from multiple tasks and unlabeled data. Journal of Machine Learning Research,
6:1817–1853, 2005.
Chinatsu Aone and Scott William Bennett. Automated acquisition of anaphora
resolution strategies. AAAI, 1995.
Ian A Apperly and Stephen A Butterfill. Do humans have two systems to track beliefs
and belief-like states? Psychological Review, 116(4):953, 2009.
Gor Arakelyan, Karen Hambardzumyan, and Hrant Khachatrian. Towards jointud:
Part-of-speech tagging and lemmatization using recurrent neural networks. arXiv
preprint arXiv:1809.03211, 2018.
Rahul Aralikatte, Heather Lent, Ana Valeria Gonzalez, Daniel Herschcovich, Chen
Qiu, Anders Sandholm, Michael Ringaard, and Anders Søgaard. Rewarding
coreference resolvers for being consistent with world knowledge. In EMNLP-
IJCNLP, pages 1229–1235, 2019.
Rahul Aralikatte, Matthew Lamm, Daniel Hardt, and Anders Søgaard. Ellipsis
resolution as question answering: An evaluation. In EACL, pages 810–817, 2021.
Qazi Mohammad Areeb, Mohammad Nadeem, Shahab Saquib Sohail, Raza Imam,
Faiyaz Doctor, Yassine Himeur, Amir Hussain, and Abbes Amira. Filter bubbles
in recommender systems: Fact or fallacy—a systematic review. Wiley Interdisci-
plinary Reviews: Data Mining and Knowledge Discovery, 13(6):e1512, 2023.
References 403
Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. Gender,
genre, and writing style in formal written texts. Text & talk, 23(3):321–346, 2003.
Shlomo Argamon, Sushant Dhawle, Moshe Koppel, and James W Pennebaker. Lex-
ical predictors of personality type. In 2005 joint annual meeting of the interface
and the classification society of North America, pages 1–16, 2005.
Jennifer E Arnold. Reference form and discourse patterns. Stanford University,
1998.
Ron Artstein and Massimo Poesio. Inter-coder agreement for computational linguis-
tics. Computational linguistics, 34(4):555–596, 2008.
Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony
Bharath. Deep reinforcement learning: A brief survey. IEEE Signal Process-
ing Magazine, 34(6):26–38, 2017.
Nana Yaw Asabere and Amevi Acakpovi. ROPPSA: TV program recommendation
based on personality and social awareness. Mathematical Problems in Engineer-
ing, 2020:1–15, 2020.
Jens B Asendorpf. Head-to-head comparison of the predictive validity of personality
types and dimensions. European Journal of Personality, 17(5):327–346, 2003.
Muhammad Zubair Asghar, Aurangzeb Khan, Shakeel Ahmad, and Fazal Masud
Kundi. A review of feature extraction in sentiment analysis. Journal of Basic and
Applied Scientific Research, 4(3):181–186, 2014.
Nicholas Asher and Alex Lascarides. Logics of Conversation. Cambridge University
Press, 2003.
Michael C Ashton, Kibeom Lee, Marco Perugini, Piotr Szarota, Reinout E De Vries,
Lisa Di Blas, Kathleen Boies, and Boele De Raad. A six-factor structure of
personality-descriptive adjectives: solutions from psycholexical studies in seven
languages. Journal of personality and social psychology, 86(2):356, 2004.
Atanas V Atanassov, Dimitar I Pilev, Fani N Tomova, and Vanya D Kuzmanova.
Hybrid system for emotion recognition based on facial expressions and body
gesture recognition. In 2021 International Conference Automatics and Informatics
(ICAI), pages 135–140, 2021.
Beryl TS Atkins. Tools for computer-aided corpus lexicography: the hector project.
Acta Linguistica Hungarica, 41:5–71, 1992.
Sandeep Attree. Gendered ambiguous pronouns shared task: Boosting model con-
fidence by evidence pooling. In First Workshop on Gender Bias in Natural
Language Processing, pages 134–146, 2019.
Anthony Aue and Michael Gamon. Customizing sentiment classifiers to new do-
mains: A case study. In RANLP, 2005.
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak,
and Zachary Ives. DBpedia: A nucleus for a web of open data. In International
Semantic Web Conference, pages 722–735, 2007.
John Langshaw Austin. How to do things with words. Oxford University Press,
1975.
Nastaran Babanejad, Heidar Davoudi, Aijun An, and Manos Papagelis. Affective
and contextual embedding for sarcasm detection. In COLING, pages 225–243,
2020.
404 References
Pratyay Banerjee and Chitta Baral. Self-supervised knowledge triplet learning for
zero-shot question answering. In EMNLP, pages 151–162, 2020.
Satanjeev Banerjee, Ted Pedersen, et al. Extended gloss overlaps as a measure of
semantic relatedness. In IJCAI, volume 3, pages 805–810, 2003.
Ann Banfield. Unspeakable Sentences (Routledge Revivals): Narration and Repre-
sentation in the Language of Fiction. Routledge, 2014.
Jeesoo Bang, Hyungjong Noh, Yonghee Kim, and Gary Geunbae Lee. Example-
based chat-oriented dialogue system with personalized long-term memory. In
2015 International Conference on Big Data and Smart Computing (BIGCOMP),
pages 238–243, 2015.
Rainer Banse and Klaus R Scherer. Acoustic profiles in vocal emotion expression.
Journal of Personality and Social Psychology, 70(3):614, 1996.
Hui Bao, Kai He, Xuemeng Yin, Xuanyu Li, Xinrui Bao, Haichuan Zhang, Jialun
Wu, and Zeyu Gao. Bert-based meta-learning approach with looking back for
sentiment analysis of literary book reviews. In NLPCC, pages 235–247, 2021.
Xiaoyi Bao, Z Wang, Xiaotong Jiang, Rong Xiao, and Shoushan Li. Aspect-based
sentiment analysis with opinion tree generation. IJCAI 2022, pages 4044–4050,
2022.
Roy Bar-Haim, Ido Dagan, Shachar Mirkin, Eyal Shnarch, Idan Szpektor, Jonathan
Berant, and Iddo Greental. Efficient semantic deduction and approximate matching
over compact parse forests. In TAC, pages 1–10, 2008.
Edoardo Barba, Tommaso Pasini, and Roberto Navigli. ESC: Redesigning WSD
with extractive sense comprehension. In NAACL-HLT, pages 4661–4672, 2021a.
Edoardo Barba, Luigi Procopio, Caterina Lacerra, Tommaso Pasini, and Roberto
Navigli. Exemplification modeling: Can you give me an example, please? In
IJCAI, pages 3779–3785, 2021b.
Luciano Barbosa and Junlan Feng. Robust sentiment detection on twitter from biased
and noisy data. In COLING, pages 36–44, 2010.
Valentina Bartalesi Lenzi, Giovanni Moretti, and Rachele Sprugnoli. CAT: the
CELCT annotation tool. In Eighth International Conference on Language Re-
sources and Evaluation (LREC’12), pages 333–338, 2012.
Fabian Barteld, Ingrid Schröder, and Heike Zinsmeister. Dealing with word-
internal modification and spelling variation in data-driven lemmatization. In
10th SIGHUM workshop on language technology for cultural heritage, social
sciences, and humanities, pages 52–62, 2016.
Susan Bartlett, Grzegorz Kondrak, and Colin Cherry. Automatic syllabification
with structured SVMs for letter-to-phoneme conversion. In ACL-08: HLT, pages
568–576, 2008.
Jon Barwise and John Perry. Situations and attitudes. The Journal of Philosophy,
78(11):668–691, 1981.
Pierpaolo Basile, Annalina Caputo, and Giovanni Semeraro. An enhanced lesk
word sense disambiguation algorithm through a distributional semantic model. In
COLING, pages 1591–1600, 2014.
406 References
Mohammad Ehsan Basiri, Shahla Nemati, Moloud Abdar, Erik Cambria, and U Ra-
jendra Acharya. Abcdm: An attention-based bidirectional CNN-RNN deep model
for sentiment analysis. Future Generation Computer Systems, 115:279–294, 2021.
Gourav Bathla, Pardeep Singh, Rahul Kumar Singh, Erik Cambria, and Rajeev
Tiwari. Intelligent fake reviews detection based on aspect extraction and analysis
using deep learning. Neural Computing and Applications, 34:20213–20229, 2022.
Zeynep Batmaz, Ali Yurekli, Alper Bilge, and Cihan Kaleli. A review on deep
learning for recommender systems: challenges and remedies. Artificial Intelligence
Review, 52(1):1–37, 2019.
David I Beaver. The optimization of discourse anaphora. Linguistics and Philosophy,
27:3–56, 2004.
Manjot Bedi, Shivani Kumar, Md Shad Akhtar, and Tanmoy Chakraborty. Multi-
modal sarcasm detection and humor classification in code-mixed conversations.
IEEE Transactions on Affective Computing, 2021.
Salima Behdenna, Barigou Fatiha, and Ghalem Belalem. Ontology-based approach
to enhance explicit aspect extraction in standard arabic reviews. International
Journal of Computing and Digital Systems, 11(1):277–287, 2022.
Alexander Beider. Beider-morse phonetic matching: An alternative to soundex with
fewer false hits. Avotaynu: the International Review of Jewish Genealogy, 24(2):
12, 2008.
Ioannis Bekoulis, Johannes Deleu, Thomas Demeester, and Chris Develder. Joint
entity recognition and relation extraction as a multi-head selection problem. Expert
Systems With Applications, 114:34–45, 2018.
Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and
James Glass. Evaluating layers of representation in neural machine translation on
part-of-speech and semantic tagging tasks. In IJCNLP, pages 1–10, 2017.
Farah Benamara, Baptiste Chardon, Yannick Mathieu, and Vladimir Popescu. To-
wards context-based subjectivity analysis. In IJCNLP, pages 1180–1188, 2011.
Emily M Bender and Alexander Koller. Climbing towards nlu: On meaning, form,
& understanding in the age of data. In ACL, pages 5185–5198, 2020.
Eric Bengtson and Dan Roth. Understanding the value of features for coreference
resolution. In EMNLP, pages 294–303, 2008.
Luisa Bentivogli and Emanuele Pianta. Exploiting parallel texts in the creation of
multilingual semantically annotated resources: the multisemcor corpus. Natural
Language Engineering, 11(3):247–261, 2005.
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth PAS-
CAL recognizing textual entailment challenge. In TAC, pages 1–18, 2009.
Gábor Berend. Sparsity makes sense: Word sense disambiguation using sparse
contextualized word representations. In EMNLP, pages 8498–8508, 2020.
Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. A maximum
entropy approach to natural language processing. Computational Linguistics, 22
(1):39–71, 1996.
Sabine Bergler, René Witte, Michelle Khalife, Zhuoyan Li, and Frank Rudzicz. Using
knowledge-poor coreference resolution for text summarization. In Workshop on
Text Summarization, pages 1–8, 2003.
References 407
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation.
Journal of Machine Learning Research, 3:993–1022, 2003.
Terra Blevins and Luke Zettlemoyer. Moving down the long tail of word sense
disambiguation with gloss informed bi-encoders. In ACL, pages 1006–1017,
2020.
John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-
boxes and blenders: Domain adaptation for sentiment classification. In 45th annual
meeting of the association of computational linguistics, pages 440–447, 2007.
Rexhina Blloshmi, Tommaso Pasini, Niccolò Campolungo, Somnath Banerjee,
Roberto Navigli, and Gabriella Pasi. IR like a SIR: Sense-enhanced informa-
tion retrieval for multiple languages. In EMNLP, pages 1030–1041, 2021.
Kenneth Bloom, Navendu Garg, and Shlomo Argamon. Extracting appraisal expres-
sions. In NAACL-HLT, pages 308–315, 2007.
Daniel G Bobrow and Donald A Norman. Some principles of memory schemata.
In Representation and understanding, pages 131–149. Morgan Kaufmann, San
Diego, 1975.
Olivier Bodenreider. The unified medical language system (umls): integrating
biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004.
Paul Boersma. Praat, a system for doing phonetics by computer. Glot. Int., 5(9):
341–345, 2001.
Gergo Bogacsovics, Janos Toth, Andras Hajdu, and Balazs Harangi. Enhancing
CNNs through the use of hand-crafted features in automated fundus image clas-
sification. Biomedical Signal Processing and Control, 76:103685, 2022.
Alena Böhmová, Jan Hajic, Eva Hajicová, Barbora Hladká, and Anne Abeillé. The
prague dependency treebank: Three-level annotation scenario. Treebanks: build-
ing and using parsed corpora, 20:103–127, 2003.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching
word vectors with subword information. Transactions of the Association for
Computational Linguistics, 5, 2016.
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Free-
base: A collaboratively created graph database for structuring human knowledge.
In SIGMOD, pages 1247–1250, 2008.
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora,
Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma
Brunskill, et al. On the Opportunities & Risks of Foundation Models. arXiv
preprint arXiv:2108.07258, 2021.
Francis Bond, Timothy Baldwin, Richard Fothergill, and Kiyotaka Uchimoto.
Japanese semcor: A sense-tagged corpus of japanese. In 6th Global WordNet
Conference, pages 56–63, 2012.
Sofia Bonicalzi, Mario De Caro, and Benedetta Giovanola. Artificial intelligence
and autonomy: On the ethical dimension of recommender systems. Topoi, 2023.
Kalina Bontcheva, Hamish Cunningham, Ian Roberts, Angus Roberts, Valentin
Tablan, Niraj Aswani, and Genevieve Gorrell. GATE Teamware: a web-based,
collaborative text annotation framework. Language Resources and Evaluation, 47
(4):1007–1029, 2013a.
References 409
Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark A Greenwood, Diana May-
nard, and Niraj Aswani. Twitie: An open-source information extraction pipeline
for microblog text. In RANLP, pages 83–90, 2013b.
Marco Bonzanini, Miguel Martinez-Alvarez, and Thomas Roelleke. Opinion sum-
marisation through sentence extraction: An investigation with movie reviews.
In 35th International ACM SIGIR Conference on Research and Development in
Information Retrieval, pages 1121–1122, 2012.
Ari Bornstein, Arie Cattan, and Ido Dagan. CoRefi: A crowd sourcing suite for
coreference annotation. In EMNLP, pages 205–215, 2020.
Lera Boroditsky. How language shapes thought. Scientific American, 304(2):62–65,
2011.
Oriol Borrega, Mariona Taulé, and M Antø’nia Martı. What do we mean when we
speak about named entities. In Corpus Linguistics, pages 1–27, 2007.
Denny Borsboom, Gideon J Mellenbergh, and Jaap Van Heerden. The theoretical
status of latent variables. Psychological review, 110(2):203, 2003.
Johan Bos. Implementing the binding and accommodation theory for anaphora
resolution and presupposition projection. Computational Linguistics, 29(2):179–
210, 2003.
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Ce-
likyilmaz, and Yejin Choi. COMET: commonsense transformers for automatic
knowledge graph construction. In ACL, pages 4762–4779, 2019.
Mondher Bouazizi and Tomoaki Ohtsuki. Opinion mining in Twitter how to make
use of sarcasm to enhance sentiment analysis. In 2015 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining 2015, pages
1594–1597, 2015.
Florian Boudin, Stéphane Huet, and Juan-Manuel Torres-Moreno. A graph-based
approach to cross-language multi-document summarization. Polibits, pages 113–
118, 2011.
Gosse Bouma. Finite state methods for hyphenation. Natural Language Engineering,
9:5 – 20, 03 2003. doi: 10.1017/S1351324903003073.
Zied Bouraoui, Sébastien Konieczny, Thanh Ma, Nicolas Schwind, and Ivan Varz-
inczak. Region-based merging of open-domain terminological knowledge. In
International Conference on Principles of Knowledge Representation and Rea-
soning, KR, pages 81–90, 2022.
Gordon H Bower. Mood and memory. American Psychologist, 36(2):129, 1981.
Sven Branahl. Das EDGAR (Electronic Data Gathering, Analysis and Retrieval)
System der SEC und seine Bedeutung für die Bereitstellung von Rechnungsle-
gungsinformationen. diplom. de, 1998.
Thorsten Brants. TnT-a statistical part-of-speech tagger. arXiv preprint cs/0003055,
2000.
Susan E Brennan. Centering attention in discourse. Language and Cognitive pro-
cesses, 10(2):137–167, 1995.
Susan E. Brennan, Marilyn W. Friedman, and Carl J. Pollard. A centering approach
to pronouns. In ACL, pages 155–162, 1987.
410 References
Erik Cambria and Bebo White. Jumping NLP curves: A review of natural language
processing research. IEEE Computational Intelligence Magazine, 9(2):48–57,
2014.
Erik Cambria, Amir Hussain, Catherine Havasi, and Chris Eckl. Common sense
computing: From the society of mind to digital intuition and beyond. In Julian
Fierrez, Javier Ortega, Anna Esposito, Andrzej Drygajlo, and Marcos Faundez-
Zanuy, editors, Biometric ID Management and Multimodal Communication, vol-
ume 5707 of Lecture Notes in Computer Science, pages 252–259. Springer, Berlin
Heidelberg, 2009.
Erik Cambria, Daniel Olsher, and Kenneth Kwok. Sentic activation: A two-level
affective common sense reasoning framework. In AAAI, pages 186–192, Toronto,
2012.
Erik Cambria, Soujanya Poria, Alexander Gelbukh, and Mike Thelwall. Sentiment
analysis is a big suitcase. IEEE Intelligent Systems, 32(6):74–80, 2017.
Erik Cambria, Rui Mao, Sooji Han, and Qian Liu. Sentic parser: A graph-based
approach to concept extraction for sentiment analysis. In ICDM Workshops, pages
413–420, 2022.
Erik Cambria, Rui Mao, Melvin Chen, Zhaoxia Wang, and Seng-Beng Ho. Seven
pillars for the future of artificial intelligence. IEEE Intelligent Systems, 38(6):
62–69, 2023.
Erik Cambria, Xulang Zhang, Rui Mao, Melvin Chen, and Kenneth Kwok. SenticNet
8: Fusing emotion AI and commonsense AI for interpretable, trustworthy, and ex-
plainable affective computing. In International Conference on Human-Computer
Interaction (HCII), 2024.
Niccolò Campolungo, Federico Martelli, Francesco Saina, and Roberto Navigli.
DiBiMT: A novel benchmark for measuring word sense disambiguation biases in
machine translation. In ACL, pages 4331–4352, 2022.
Olivier Cappé, Simon J Godsill, and Eric Moulines. An overview of existing methods
and recent advances in sequential monte carlo. IEEE, 95(5):899–924, 2007.
Claire Cardie and Kiri Wagstaff. Noun phrase coreference as clustering. In 1999
Joint SIGDAT Conference on Empirical Methods in Natural Language Processing
and Very Large Corpora, pages 82–89, 1999.
David Carter. Interpreting Anaphors in Natural Language Texts. Halsted Press,
1987.
Avshalom Caspi, Brent W Roberts, and Rebecca L Shiner. Personality development:
Stability and change. Annu. Rev. Psychol., 56:453–484, 2005.
Giovanni Castiglia, Ayoub El Majjodi, Alain Dominique Starke, Fedelucio Narducci,
Yashar Deldjoo, and Federica Calò. Nudging towards health in a conversational
food recommender system using multi-modal interactions and nutrition labels. In
Fourth Knowledge-aware and Conversational Recommender Systems Workshop
(KaRS), volume 3294, pages 29–35, 2022.
Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann,
Rada Mihalcea, and Soujanya Poria. Towards multimodal sarcasm detection (an
obviously perfect paper). In ACL, pages 4619–4629, Florence, Italy, July 2019.
412 References
Rosario Catelli, Hamido Fujita, Giuseppe De Pietro, and Massimo Esposito. Decep-
tive reviews and sentiment polarity: Effective link by exploiting BERT. Expert
Systems with Applications, 209:118290, 2022.
Nicolas Cebron and Michael R Berthold. Active learning for object classification:
from exploration to exploitation. Data Mining and Knowledge Discovery, 18(2):
283–299, 2009.
Giuseppe GA Celano. A gradient boosting-Seq2Seq system for Latin POS tagging
and lemmatization. In LT4HALA 2020-1st Workshop on Language Technologies
for Historical and Ancient Languages, pages 119–123, 2020.
Fabio Celli and Bruno Lepri. Is big five better than MBTI? A personality computing
challenge using twitter data. Computational Linguistics CLiC-it, 2018:93, 2018.
Daniel Cervone. Personality architecture: Within-person structures and processes.
Annual review of psychology, 56:423, 2005.
Wallace Chafe. Givenness, contrastiveness, definiteness, subjects, topics, and point
of view. Subject and topic, 1976.
Haixia Chai and Michael Strube. Incorporating centering theory into neural coref-
erence resolution. In NAACL-HLT, pages 2996–3002, 2022.
Abhisek Chakrabarty, Onkar Arun Pandit, and Utpal Garain. Context sensitive
lemmatization using two successive bidirectional gated recurrent networks. In
ACL, pages 1481–1491, 2017.
Abhisek Chakrabarty, Akshay Chaturvedi, and Utpal Garain. CNN-based context
sensitive lemmatization. In ACM India Joint International Conference on Data
Science and Management of Data, pages 334–337, 2019.
Navoneel Chakrabarty, Siddhartha Chowdhury, Sangita D Kanni, and Swarnakeshar
Mukherjee. FAFinder: Friend suggestion system for social networking. In In-
telligent Data Communication Technologies and Internet of Things: ICICI 2019,
pages 51–58, 2020.
Craig G. Chambers and Ron Smyth. Structural parallelism and discourse coherence:
A test of centering theory. Journal of Memory and Language, 39(4):593–608,
1998.
Nathanael Chambers, Daniel Cer, Trond Grenager, David Hall, Chloe Kiddon,
Bill MacCartney, Marie-Catherine de Marneffe, Daniel Ramage, Eric Yeh, and
Christopher D. Manning. Learning alignments and leveraging natural logic. In
ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pages 165–
170, 2007.
Jonathan Charteris-Black. Metaphor and political communication. In Metaphor and
Discourse, pages 97–115. Springer, 2009.
Iti Chaturvedi, Erik Cambria, Soujanya Poria, and Rajiv Bajpai. Bayesian deep
convolution belief networks for subjectivity detection. In ICDM workshops, pages
916–923, 2016a.
Iti Chaturvedi, Erik Cambria, and David Vilares. Lyapunov filtering of objectivity
for spanish sentiment model. In IJCNN, pages 4474–4481, 2016b.
Iti Chaturvedi, Edoardo Ragusa, Paolo Gastaldo, Rodolfo Zunino, and Erik Cambria.
Bayesian network based extreme learning machine for subjectivity detection.
Journal of The Franklin Institute, 355(4):1780–1797, 2018.
References 413
Iti Chaturvedi, Tim Noel, and Ranjan Satapathy. Speech emotion recognition using
audio matching. Electronics, 11(23):3943, 2022.
Iti Chaturvedi, Ranjan Satapathy, Curtis Lynch, and Erik Cambria. Predicting word
vectors for microtext. Expert Systems, page e13589, 2024.
Dushyant Singh Chauhan, Gopendra Vikram Singh, Aseem Arora, Asif Ekbal, and
Pushpak Bhattacharyya. An emoji-aware multitask framework for multimodal
sarcasm detection. Knowledge-Based Systems, 257:109924, 2022.
Chen Chen and Vincent Ng. Chinese zero pronoun resolution: Some recent advances.
In EMNLP, pages 1360–1365, 2013.
Chen Chen and Vincent Ng. Chinese zero pronoun resolution with deep neural
networks. In ACL, pages 778–788, 2016.
Danqi Chen and Christopher D Manning. A fast and accurate dependency parser
using neural networks. In EMNLP, pages 740–750, 2014.
Guanyi Chen, Kees van Deemter, and Chenghua Lin. Modelling pro-drop with the
rational speech acts model. In 11th International Conference on Natural Language
Generation, pages 57–66, 2018a.
Hong Chen, Zhenhua Fan, Hao Lu, Alan Yuille, and Shu Rong. PreCo: A large-scale
dataset in preschool vocabulary for coreference resolution. In EMNLP, pages
172–181, 2018b.
Jun Chen, Xiaoming Zhang, Yu Wu, Zhao Yan, and Zhoujun Li. Keyphrase gener-
ation with correlation constraints. In Ellen Riloff, David Chiang, Julia Hocken-
maier, and Jun’ichi Tsujii, editors, EMNLP, pages 4057–4066, 2018c.
Li Chen, Luole Qi, and Feng Wang. Comparison of feature-level learning methods
for mining online consumer reviews. Expert Systems with Applications, 39(10):
9588–9601, 2012.
Lingjiao Chen, Matei Zaharia, and James Zou. How is ChatGPT’s behavior changing
over time? arXiv preprint arXiv:2307.09009, 2023a.
Melvin Chen. Trust & trust-engineering in artificial intelligence research: Theory &
praxis. Philosophy & Technology, 34(4):1429–1447, 2021.
Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue
Wang, and Jia Li. Large language models meet Harry Potter: A dataset for aligning
dialogue agents with characters. In EMNLP Findings, pages 8506–8520, 2023b.
Shaowei Chen, Jie Liu, Yu Wang, Wenzheng Zhang, and Ziming Chi. Synchronous
double-channel recurrent network for aspect-opinion pair extraction. In ACL,
pages 6515–6524, Online, July 2020.
Shaowei Chen, Yu Wang, Jie Liu, and Yuelin Wang. Bidirectional machine reading
comprehension for aspect sentiment triplet extraction. In AAAI, pages 12666–
12674, 2021a.
Shisong Chen, Binbin Gu, Jianfeng Qu, Zhixu Li, An Liu, Lei Zhao, and Zhigang
Chen. Tackling zero pronoun resolution and non-zero coreference resolution
jointly. In 25th Conference on Computational Natural Language Learning, pages
518–527, 2021b.
Tao Chen and Min-Yen Kan. Creating a live, public short message service corpus:
the nus sms corpus. Language Resources and Evaluation, 2012.
414 References
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In
22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 785–794, 2016.
Wang Chen, Yifan Gao, Jiani Zhang, Irwin King, and Michael R. Lyu. Title-guided
encoding for keyphrase generation. In AAAI, pages 6268–6275, 2019.
Wangqun Chen, Fuqiang Lin, Xuan Zhang, Guowei Li, and Bo Liu. Jointly learning
sentimental clues and context incongruity for sarcasm detection. IEEE Access,
10:48292–48300, 2022.
Zhuang Chen and Tieyun Qian. Relation-aware collaborative learning for unified
aspect-based sentiment analysis. In ACL, pages 3685–3694, 2020.
Yan Cheng, Leibo Yao, Guoxiong Xiang, Guanghe Zhang, Tianwei Tang, and Linhui
Zhong. Text sentiment orientation analysis based on multi-channel CNN and
bidirectional GRU with attention mechanism. IEEE Access, 8:134964–134975,
2020.
Paula Chesley, Bruce Vincent, Li Xu, and Rohini K Srihari. Using verbs and
adjectives to automatically classify blog sentiment. Training, 580(263):233–241,
2006.
Zheng Lin Chia, Michal Ptaszynski, Fumito Masui, Gniewosz Leliwa, and Michal
Wroczynski. Machine learning and feature engineering-based study into sarcasm
and irony classification with application to cyberbullying detection. Information
Processing & Management, 58(4):102600, 2021.
Nancy Chinchor, Lynette Hirschman, and David D Lewis. Evaluating message
understanding systems: an analysis of the third message understanding conference
(MUC-3). Computational Linguistics, 19(3):409–450, 1993.
Nancy A. Chinchor. Overview of MUC-7. In Seventh Message Understanding
Conference (MUC-7), pages 1–4, 1998.
Nancy A Chinchor and Beth Sundheim. Message understanding conference (muc)
tests of discourse processing. In Proc. AAAI Spring Symposium on Empirical
Methods in Discourse Interpretation and Generation, pages 21–26, 1995.
Jason PC Chiu and Eric Nichols. Named entity recognition with bidirectional LSTM-
CNNs. Transactions of the Association for Computational Linguistics, 4:357–370,
2016.
Timothy Chklovski. Learner: a system for acquiring commonsense knowledge by
analogy. In John H. Gennari, Bruce W. Porter, and Yolanda Gil, editors, 2nd
International Conference on Knowledge Capture (K-CAP 2003), pages 4–12,
2003.
Timothy Chklovski and Patrick Pantel. Verbocean: Mining the web for fine-grained
semantic verb relations. In EMNLP, pages 33–40, 2004.
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio.
On the properties of neural machine translation: Encoder-decoder approaches.
In Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi, editors,
SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in
Statistical Translation, pages 103–111, 2014.
Noam Chomsky. Syntactic Structures. De Gruyter Mouton, Berlin, Boston, 1957.
ISBN 9783112316009.
References 415
Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar,
and Anupam Basu. Investigation and modeling of the structure of texting language.
International Journal of Document Analysis and Recognition (IJDAR), 10(3):
157–174, 2007.
Evripides Christodoulou, Andreas Gregoriades, Maria Pampaka, and Herodotos
Herodotou. Personality-informed restaurant recommendation. In Information
Systems and Technologies: WorldCIST 2022, Volume 1, pages 13–21. Springer,
2022.
Grzegorz Chrupa≥a. Simple data-driven context-sensitive lemmatization. Proces.
del Leng. Natural, 37, 2006.
Grzegorz Chrupa≥a. Normalizing tweets with edit scripts and recurrent neural em-
beddings. In ACL, pages 680–686, 2014.
Grzegorz Chrupa≥a, Georgiana Dinu, and Josef Genabith. Learning morphology
with Morfette. Chrupa≥a, Grzegorz and Dinu, Georgiana and van Genabith, Josef
(2008) Learning morphology with Morfette. In: LREC 2008 - Sixth International
Conference on Language Resources and Evaluation, 28-30 May 2008, Marrakech,
Morocco., 01 2008.
Sandra Chung, William A Ladusaw, and James McCloskey. Sluicing and logical
form. Natural Language Semantics, 3:239–282, 1995.
Kenneth Church and Patrick Hanks. Word association norms, mutual information,
and lexicography. Computational linguistics, 16(1):22–29, 1990.
Kenneth Ward Church. A stochastic parts program and noun phrase parser for
unrestricted text. In International Conference on Acoustics, Speech, and Signal
Processing,, pages 695–698, 1989.
Montserrat Civit and Ma Antònia Martí. Building cast3lb: A Spanish treebank.
Research on Language and Computation, 2(4):549–574, 2004.
Alexander Clark. Combining distributional and morphological information for part
of speech induction. In EACL, pages 59–66, 2003.
Herbert H Clark. Bridging. In Theoretical Issues in Natural Language Processing,
pages 169–174, 1975.
Kevin Clark and Christopher D. Manning. Entity-centric coreference resolution with
model stacking. In ACL-IJCNLP, pages 1405–1415, 2015.
Kevin Clark and Christopher D. Manning. Deep reinforcement learning for mention-
ranking coreference models. In EMNLP, pages 2256–2262, 2016a.
Kevin Clark and Christopher D. Manning. Improving coreference resolution by
learning entity-level distributed representations. In ACL, pages 643–653, 2016b.
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. ELEC-
TRA: pre-training text encoders as discriminators rather than generators. In 8th
International Conference on Learning Representations, pages 1–18, 2020a.
Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as soft reasoners
over language. In IJCAI, pages 3882–3890, 2020b.
Stephen Clark, James R Curran, and Miles Osborne. Bootstrapping POS-taggers
using unlabelled data. In HLT-NAACL, pages 49–55, 2003.
Jacob Cohen. A coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20(1):37–46, 1960.
416 References
Paul T Costa Jr and Robert R McCrae. Domains and facets: Hierarchical personality
assessment using the revised NEO personality inventory. Journal of Personality
Assessment, 64(1):21–50, 1995.
Juan M Cotelo, Fermín Cruz, F Javier Ortega, and José A Troyano. Explorando
Twitter mediante la integración de información estructurada y no estructurada.
Procesamiento del Lenguaje Natural, pages 75–82, 2015.
Isaac Councill, Ryan McDonald, and Leonid Velikovich. What’s great and what’s
not: learning to classify the scope of negation for improved sentiment analysis.
In workshop on negation and speculation in natural language processing, pages
51–59, 2010.
Alan S Cowen, Dacher Keltner, Florian Schroff, Brendan Jou, Hartwig Adam, and
Gautam Prasad. Sixteen facial expressions occur in similar contexts worldwide.
Nature, 589(7841):251–257, 2021.
Michael Crawshaw. Multi-task learning with deep neural networks: A survey. arXiv
preprint arXiv:2009.09796, pages 1–43, 2020.
William Croft and D Alan Cruse. Cognitive linguistics. Cambridge University Press,
2004.
Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue Zhang. Template-based named
entity recognition using bart. In ACL-IJCNLP, pages 1835–1845, 2021.
James R Curran and Stephen Clark. Investigating GIS and smoothing for maximum
entropy taggers. In EACL, 2003.
Agata Cybulska and Piek Vossen. Using a sledgehammer to crack a nut? lexical
diversity and event coreference resolution. In Ninth International Conference on
Language Resources and Evaluation, pages 4545–4552, 2014.
W Daelemans, J Zavrel, K van der Sloot, and A van den Bosch. TiMBL: Tilburg
memory based learner, version 5.0, reference guide. Research Group Technical
Report Series, 3, 2003.
Walter Daelemans, Antal Van den Bosch, and Ton Weijters. Igtree: using trees for
compression and classification in lazy learning algorithms. In Lazy learning,
pages 407–423. Springer, 1997.
Walter Daelemans, Sabine Buchholz, and Jorn Veenstra. Memory-based shallow
parsing. arXiv preprint cs/9906005, 1999a.
Walter Daelemans, Jakub Zavrel, Peter Berck, and Steven Gillis. MBT: A memory-
based part of speech tagger-generator. Fourth Workshop on Very Large Corpora:
1996; Copenhagen, Denmark, 11 1999b.
Walter Daelemans, Hendrik Johannes Groenewald, and Gerhard B van Huyssteen.
Prototype-based active learning for lemmatization. In International Conference
RANLP-2009, pages 65–70, 2009.
Debishree Dagar, Abir Hudait, Hrudaya Kumar Tripathy, and MN Das. Automatic
emotion detection model from facial expression. In ICACCCT, pages 77–85, 2016.
Hong-Jie Dai, Po-Ting Lai, Yung-Chun Chang, and Richard Tzong-Han Tsai. En-
hancing of chemical compound and drug name recognition using representative
tag scheme and fine-grained tokenization. Journal of cheminformatics, 7(1):1–10,
2015.
418 References
Zehui Dai, Cheng Peng, Huajie Chen, and Yadong Ding. A multi-task incremental
learning framework with category name embedding for aspect-category sentiment
analysis. In EMNLP, pages 6955–6965, 2020.
Bouras Dalila, Amroune Mohamed, and Hakim Bendjanna. A review of recent aspect
extraction techniques for opinion mining systems. In 2018 2nd International
Conference on Natural Language and Speech Processing (ICNLSP), pages 1–6,
2018.
Hayk Danielyan. Sarcasm in social and commercial advertising: A pragmalinguistic
perspective. Armenian Folia Anglistika, 18(2 (26)):72–84, 2022.
Kareem Darwish, Hamdy Mubarak, Ahmed Abdelali, Mohamed Eldesouki, Younes
Samih, Randah Alharbi, Mohammed Attia, Walid Magdy, and Laura Kallmeyer.
Multi-dialect Arabic POS tagging: A CRF approach. In Eleventh International
Conference on Language Resources and Evaluation (LREC 2018), 2018.
Amitava Das and Sivaji Bandyopadhyay. Theme detection an exploration of opinion
subjectivity. In 2009 3rd International Conference on Affective Computing and
Intelligent Interaction and Workshops, pages 1–6, 2009.
Amitava Das and Sivaji Bandyopadhyay. Subjectivity detection using genetic algo-
rithm. Computational Approaches to Subjectivity and Sentiment Analysis, pages
14–21, 2010.
Nilanjana Das and Santwana Sagnika. A subjectivity detection-based approach
to sentiment analysis. In Machine Learning and Information Processing, pages
149–160. Springer, 2020.
Sarkar Snigdha Sarathi Das, Arzoo Katiyar, Rebecca Passonneau, and Rui Zhang.
CONTaiNER: Few-shot named entity recognition via contrastive learning. In
ACL, pages 6338–6353, 2022.
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language model-
ing with gated convolutional networks. In ICML, pages 933–941, 2017.
Ernest Davis, Leora Morgenstern, and Charles L Ortiz. The first Winograd schema
challenge at IJCAI-16. AI Magazine, 38(3):97–98, 2017.
Alvise De Biasio, Nicolò Navarin, and Dietmar Jannach. Economic recommender
systems – a systematic review. Electronic Commerce Research and Applications,
63:101352, 2024.
Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleö
Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey:
Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 44(7):3366–3385, 2021.
Kees van Deemter and Rodger Kibble. On coreferring: Coreference in MUC and
related annotation schemes. Computational Linguistics, 26(4):629–637, 2000.
Dina Demner-Fushman, Kin Wah Fung, Phong Do, Richard D Boyce, and Travis R
Goodwin. Overview of the TAC 2018 Drug-Drug Interaction Extraction from
Drug Labels Track. TAC, November:1–12, 2019.
Yifan Deng, Xinran Xu, Yang Qiu, Jingbo Xia, Wen Zhang, and Shichao Liu. A
multimodal deep learning framework for predicting drug–drug interaction events.
Bioinformatics, 36(15):4316–4322, 2020.
References 419
Pascal Denis and Jason Baldridge. Specialized models and ranking for coreference
resolution. In EMNLP, pages 660–669, 2008.
Pascal Denis and Jason Baldridge. Global joint models for coreference resolution
and named entity classification. Procesamiento del lenguaje natural, 42, 2009.
Ali Derakhshan and Hamid Beigy. Sentiment analysis on stock social media for stock
price movement prediction. Engineering Applications of Artificial Intelligence,
85:569–578, 2019.
Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. Results
of the wnut2017 shared task on novel and emerging entity recognition. In 3rd
Workshop on Noisy User-generated Text, pages 140–147, 2017.
Oksana Dereza. Lemmatization for ancient languages: Rules or neural networks?
In Dmitry Ustalov, Andrey Filchenkov, Lidia Pivovarova, and Jan éiûka, editors,
Artificial Intelligence and Natural Language, pages 35–47, Cham, 2018. Springer
International Publishing. ISBN 978-3-030-01204-5.
Neelmay Desai and Meera Narvekar. Normalization of noisy text data. Procedia
Computer Science, 45:127–132, 2015.
Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convo-
lutional 2d knowledge graph embeddings. In AAAI, pages 1811–1818, 2018.
Ann Devitt and Khurshid Ahmad. Sentiment polarity identification in financial
news: A cohesion-based approach. In 45th Annual Meeting of the Association of
Computational Linguistics, pages 984–991, 2007.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-
training of deep bidirectional Transformers for language understanding. NAACL-
HLT, 2018.
Sahraoui Dhelim, Huansheng Ning, Nyothiri Aung, Runhe Huang, and Jianhua Ma.
Personality-aware product recommendation system based on user interests mining
and metapath discovery. IEEE Transactions on Computational Social Systems, 8
(1):86–98, 2020.
Sahraoui Dhelim, Nyothiri Aung, Mohammed Amine Bouras, Huansheng Ning, and
Erik Cambria. A survey on personality-aware recommendation systems. Artificial
Intelligence Review, 55:2409–2454, 2022.
Collins English Dictionary. Collins. London & Glasgow, 1982.
Oxford Dictionary. Oxford dictionary of english. Oxford Dictionary of English, 3rd
edn. Oxford University Press. China Translation & Printing Services Ltd, China,
2010.
Thomas G Dietterich and Ghulum Bakiri. Solving multiclass learning problems
via error-correcting output codes. Journal of artificial intelligence research, 2:
263–286, 1994.
Eleftherios Dimitrakis, Konstantinos Sgontzos, and Yannis Tzitzikas. A survey on
question answering systems over linked data and documents. Journal of Intelligent
Information Systems, 55(2):233–259, 2020.
Xiaowen Ding and Bing Liu. Resolving object and attribute coreference in opinion
mining. In COLING, pages 268–276, 2010.
420 References
Stefanie Dipper, Anke Lüdeling, and Marc Reznicek. Nosta-d: A corpus of german
non-standard varieties. Non-standard data sources in corpus-based research, 5:
69–76, 2013.
George R Doddington, Alexis Mitchell, Mark Przybocki, Lance Ramshaw, Stephanie
Strassel, and Ralph Weischedel. The automatic content extraction (ACE)
program–tasks, data, and evaluation. In Fourth International Conference on Lan-
guage Resources and Evaluation, pages 1–4, 2004.
Ellen K Dodge, Jisup Hong, and Elise Stickles. MetaNet: Deep semantic automatic
metaphor analysis. In Third Workshop on Metaphor in NLP, pages 40–49, 2015.
Evgenia Dolgova, Woody van Olffen, Frans van den Bosch, and Henk Volberda.
The interaction between personality, social network position and involvement in
innovation process. Technical report, Erasmus Research Institute of Management,
2010.
Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy,
Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A web-scale
approach to probabilistic knowledge fusion. In 20th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 601–610, 2014.
Xin Luna Dong and Gerard De Melo. A helping hand: Transfer learning for deep
sentiment analysis. In ACL, pages 2524–2534, 2018.
Kevin Donnelly et al. SNOMED-CT: The advanced terminology and coding system
for ehealth. Studies in Health Technology and Informatics, 121:279, 2006.
Cicero Dos Santos and Bianca Zadrozny. Learning character-level representations
for part-of-speech tagging. In ICML, pages 1818–1826, 2014.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for
image recognition at scale. In International Conference on Learning Representa-
tions, pages 1–21, 2021.
Markus Dreyer, Jason Smith, and Jason Eisner. Latent-variable modeling of string
transductions with finite-state methods. In EMNLP, pages 1080–1089, 2008.
Kelvin Du, Frank Xing, and Erik Cambria. Incorporating multiple knowledge sources
for targeted aspect-based financial sentiment analysis. ACM Transactions on
Management Information Systems, 14(3):23, 2023a.
Kelvin Du, Frank Xing, Rui Mao, and Erik Cambria. FinSenticNet: A concept-
level lexicon for financial sentiment analysis. In 2023 IEEE Symposium Series on
Computational Intelligence (SSCI), pages 109–114, Mexico City, Mexico, 2023b.
Kelvin Du, Rui Mao, Frank Xing, and Erik Cambria. A dynamic dual-graph neural
network for stock price movement prediction. In IJCNN, Yokohama, Japan, 2024a.
Kelvin Du, Frank Xing, Rui Mao, and Erik Cambria. An evaluation of reasoning
capabilities of large language models in financial sentiment analysis. In IEEE
Conference on Artificial Intelligence, pages 189–194, Singapore, 2024b.
Kelvin Du, Frank Xing, Rui Mao, and Erik Cambria. Financial sentiment analysis:
Techniques and applications. ACM Computing Surveys, 56(9):1–42, 2024c.
Mark Dunlop and Andrew Crossan. Predictive text entry methods for mobile phones.
Personal Technologies, 4:2–3, 2000.
References 421
Cuc Duong, Vethavikashini Chithrra Raghuram, Amos Lee, Rui Mao, Gianmarco
Mengaldo, and Erik Cambria. Neurosymbolic AI for mining public opinions
about wildfires. Cognitive Computation, 16(4):1531–1553, 2024.
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A Smith.
Transition-based dependency parsing with stack long short-term memory. arXiv
preprint arXiv:1505.08075, 2015.
Philip Edmonds and Scott Cotton. SensEval-2: Overview. In SENSEVAL-2 Sec-
ond International Workshop on Evaluating Word Sense Disambiguation Systems,
pages 1–5, 2001.
Maud Ehrmann, Matteo Romanello, Sven Najem-Meyer, Antoine Doucet, and Simon
Clematide. Extended overview of HIPE-2022: Named entity recognition and
linking in multilingual historical documents. In Guglielmo Faggioli, Nicola Ferro,
Allan Hanbury, and Martin Potthast, editors, Working Notes of CLEF 2022 -
Conference and Labs of the Evaluation Forum, volume 3180, pages 1038–1063,
2022.
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Boönjak, and Sebastian
Riedel. emoji2vec: Learning emoji representations from their description. In
Fourth International Workshop on Natural Language Processing for Social Media,
pages 48–54, 2016.
Christopher Ifeanyi Eke, Azah Anir Norman, Liyana Shuib, and Henry Friday
Nweke. Sarcasm identification in textual data: systematic review, research chal-
lenges and open directions. Artificial Intelligence Review, 53(6):4215–4258, 2020.
Paul Ekman and Wallace V Friesen. Facial action coding system. Environmental
Psychology & Nonverbal Behavior, 1978.
Abdelkader El Mahdaouy, Abdellah El Mekki, Kabil Essefar, Nabil El Mamoun,
Ismail Berrada, and Ahmed Khoumsi. Deep multi-task model for sarcasm detec-
tion and sentiment analysis in Arabic language. In Sixth Arabic Natural Language
Processing Workshop, pages 334–339, 2021.
Jeffrey L Elman. Distributed representations, simple recurrent networks, and gram-
matical structure. Machine learning, 7(2):195–225, 1991.
Ali Emami, Adam Trischler, Kaheer Suleman, and Jackie Chi Kit Cheung. A
generalized knowledge hunting framework for the Winograd schema challenge.
In NAACL, pages 25–31, 2018.
Tomaz Erjavec. The multext-east Slovene lexicon. In 7th Electrotechnical Confer-
ence ERK, Volume B, pages 189–192, 1998.
Tomaû Erjavec and Sas o Dûeroski. Machine learning of morphosyntactic structure:
Lemmatizing unknown slovene words. Applied artificial intelligence, 18(1):17–
41, 2004.
Md Shah Fahad, Ashish Ranjan, Jainath Yadav, and Akshay Deepak. A survey of
speech emotion recognition in natural environment. Digital signal processing,
110:102951, 2021.
Chunxiao Fan, Jie Lin, Rui Mao, and Erik Cambria. Fusing pairwise modalities for
emotion recognition in conversations. Information Fusion, 106:102306, 2024.
422 References
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin.
Liblinear: A library for large linear classification. the Journal of machine Learning
research, 9:1871–1874, 2008.
Irving E Fang. The “easy listening formula”. Journal of Broadcasting & Electronic
Media, 11(1):63–68, 1966.
Songtao Fang, Zhenya Huang, Ming He, Shiwei Tong, Xiaoqing Huang, Ye Liu, Jie
Huang, and Qi Liu. Guided attention network for concept extraction. In Zhi-Hua
Zhou, editor, IJCAI, pages 1449–1455, 2021a.
Tianqing Fang, Weiqi Wang, Sehyun Choi, Shibo Hao, Hongming Zhang, Yangqiu
Song, and Bin He. Benchmarking commonsense knowledge base population with
an effective evaluation dataset. In EMNLP, pages 8949–8964, 2021b.
Tianqing Fang, Quyet V. Do, Hongming Zhang, Yangqiu Song, Ginny Y. Wong,
and Simon See. Pseudoreasoner: Leveraging pseudo labels for commonsense
knowledge base population. In EMNLP, pages 3379–3394, 2022a.
Yanbo Fang and Yongfeng Zhang. Data-efficient concept extraction from pre-trained
language models for commonsense explanation generation. In Yoav Goldberg,
Zornitsa Kozareva, and Yue Zhang, editors, EMNLP, pages 5883–5893, 2022.
Yuwei Fang, Shuohang Wang, Yichong Xu, Ruochen Xu, Siqi Sun, Chenguang
Zhu, and Michael Zeng. Leveraging knowledge in multilingual commonsense
reasoning. In ACL, pages 3237–3246, 2022b.
Umar Farooq, Tej Prasad Dhamala, Antoine Nongaillard, Yacine Ouzrout, and
Muhammad Abdul Qadir. A word sense disambiguation method for feature level
sentiment analysis. In 2015 9th International Conference on Software, Knowledge,
Information Management and Applications, pages 1–8, 2015.
Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard H. Hovy, and
Noah A. Smith. Retrofitting word vectors to semantic lexicons. In NAACL-HLT,
pages 1606–1615, 2015.
Mehwish Fatima and Mark-Christoph Mueller. HITS-SBD at the FinSBD task:
Machine learning vs. rule-based sentence boundary detection. In First Workshop
on Financial Technology and Natural Language Processing, pages 115–121, 2019.
Gilles Fauconnier. Polarity and the scale principle. Chicago, 1975a.
Gilles Fauconnier. Pragmatic scales and logical structure. Linguistic Inquiry, 6(3):
353–375, 1975b.
Gilles Fauconnier and Mark Turner. The Way We Think: Conceptual Blending and
the Mind’s Hidden Complexities. Basic Books, 2008.
William Fedus, Ian Goodfellow, and Andrew M Dai. MaskGAN: Better text gener-
ation via filling in the _. In ICLR, 2018.
Hao Fei, Donghong Ji, Bobo Li, Yijiang Liu, Yafeng Ren, and Fei Li. Rethink-
ing boundaries: End-to-end recognition of discontinuous mentions with pointer
networks. In AAAI, pages 12785–12793, 2021a.
Hao Fei, Yafeng Ren, Yue Zhang, and Donghong Ji. Nonautoregressive encoder-
decoder neural framework for end-to-end aspect-based sentiment triplet extraction.
IEEE Transactions on Neural Networks and Learning Systems, 2021b.
References 423
Hao Fei, Han Zhang, Bin Wang, Lizi Liao, Qian Liu, and Erik Cambria. Empathyear:
An open-source avatar multimodal empathetic chatbot. In ACL, pages 61–71,
2024.
Jerome Feldman. From Molecule to Metaphor: A Neural Theory of Language. MIT
Press, 2008.
Christiane Fellbaum. WordNet: An Electronic Lexical Database (Language, Speech,
and Communication). The MIT Press, 1998.
Jinzhan Feng, Shuqin Cai, and Xiaomeng Ma. Enhanced sentiment labeling and
implicit aspect identification by integration of deep convolution neural network
and sequential algorithm. Cluster Computing, 22:5839–5857, 2019a.
Shi Feng, Lin Wang, Weili Xu, Daling Wang, and Ge Yu. Unsupervised learning
chinese sentiment lexicon from massive microblog data. In Advanced Data Mining
and Applications: 8th International Conference, ADMA 2012, Nanjing, China,
December 15-18, 2012. Proceedings 8, pages 27–38, 2012.
Xiaocheng Feng, Zhangyin Feng, Wanlong Zhao, Nan Zou, Bing Qin, and Ting Liu.
Improved neural machine translation with pos-tagging through joint decoding.
In International Conference on Artificial Intelligence for Communications and
Networks, pages 159–166, 2019b.
Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph
neural networks. In AAAI, volume 33, pages 3558–3565, 2019c.
Marco Ferrarotti, Walter Rocchia, and Sergio Decherchi. Finding principal paths in
data space. IEEE Transactions on Neural Networks and Learning Systems, 30(8):
2449–2462, 2019.
Kim Ferres, Timo Schloesser, and Peter A Gloor. Predicting dog emotions based on
posture analysis using deeplabcut. Future Internet, 14(4):97, 2022.
David Ferrucci and Adam Lally. UIMA: An architectural approach to unstructured
information processing in the corporate research environment. Natural Language
Engineering, 10(3-4):327–348, 2004.
Leon Festinger. A Theory of Cognitive Dissonance. Stanford University Press, 1957.
ISBN 9780804709118.
Charles J Fillmore, Paul Kay, Mary C O’connor, et al. Regularity and idiomaticity
in grammatical constructions: The case of let alone. Language, 64(3):501–538,
1988.
Charles J Fillmore et al. Frame semantics. Cognitive Linguistics: Basic Readings,
34:373–400, 2006.
Jenny Rose Finkel and Christopher D Manning. Nested named entity recognition.
In EMNLP, pages 141–150, 2009.
Jenny Rose Finkel, Trond Grenager, and Christopher D Manning. Incorporating
non-local information into information extraction systems by Gibbs sampling. In
ACL, pages 363–370, 2005.
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi
Wolfman, and Eytan Ruppin. Placing search in context: the concept revisited. In
International World Wide Web Conference, WWW, pages 406–414, 2001.
John Firth. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis,
pages 10–32, 1957.
424 References
Andrea Gagliano, Emily Paul, Kyle Booten, and Marti A Hearst. Intersecting word
vectors to take figurative language to new heights. In Fifth Workshop on Compu-
tational Linguistics for Literature, pages 20–31, 2016.
Ladislav Gallay and Marián äimko. Utilizing vector models for automatic text
lemmatization. In International Conference on Current Trends in Theory and
Practice of Informatics, pages 532–543, 2016.
Michel Galley, Kathleen McKeown, Julia Hirschberg, and Elizabeth Shriberg. Iden-
tifying agreement and disagreement in conversational speech: use of Bayesian
networks to model pragmatic dependencies. In ACL, pages 669–677, 2004.
Ankita Gandhi, Kinjal Adhvaryu, Soujanya Poria, Erik Cambria, and Amir Hussain.
Multimodal sentiment analysis: A systematic review of history, datasets, multi-
modal fusion methods, applications, challenges and future directions. Information
Fusion, 91:424–444, 2023.
Lisa Gandy, Nadji Allan, Mark Atallah, Ophir Frieder, Newton Howard, Sergey
Kanareykin, Moshe Koppel, Mark Last, Yair Neuman, and Shlomo Argamon.
Automatic identification of conceptual metaphors with limited knowledge. In
AAAI, pages 328–334, 2013.
Vaishali Ganganwar and R Rajalakshmi. Implicit aspect extraction for sentiment
analysis: A survey of recent approaches. Procedia Computer Science, 165:485–
491, 2019.
Aldo Gangemi, Nicola Guarino, Claudio Masolo, Alessandro Oltramari, and Luc
Schneider. Sweetening ontologies with DOLCE. In Knowledge Engineering and
Knowledge Management. Ontologies and the Semantic Web, 13th International
Conference, EKAW, volume 2473 of Lecture Notes in Computer Science, pages
166–181, 2002.
Chen Gao, Xiang Wang, Xiangnan He, and Yong Li. Graph neural networks for
recommender system. In Fifteenth ACM International Conference on Web Search
and Data Mining, pages 1623–1625, 2022.
Jie Gao, Sooji Han, Xingyi Song, and Fabio Ciravegna. RP-DNN: A tweet level
propagation context based deep neural networks for early rumor detection in social
media. In LREC, pages 6094–6105, Marseille, France, May 2020. European
Language Resources Association. ISBN 979-10-95546-34-4.
Lei Gao, Yulong Wang, Tongcun Liu, Jingyu Wang, Lei Zhang, and Jianxin Liao.
Question-driven span labeling model for aspect–opinion pair extraction. In AAAI,
pages 12875–12883, 2021a.
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models
better few-shot learners. In ACL-IJCNLP, pages 3816–3830, 2021b.
Peter Gardenfors. Conceptual Spaces: The Geometry of Thought. MIT Press, 2004.
Alan Garnham. Mental Models and the Interpretation of Anaphora. Psychology
Press, 2001.
Mengshi Ge, Rui Mao, and Erik Cambria. Explainable metaphor identification
inspired by conceptual metaphor theory. AAAI, 36(10):10681–10689, 2022.
Mengshi Ge, Rui Mao, and Erik Cambria. A survey on computational metaphor pro-
cessing techniques: From identification, interpretation, generation to application.
Artificial Intelligence Review, 56:1829–1895, 2023.
426 References
Niyu Ge, John Hale, and Eugene Charniak. A statistical approach to anaphora
resolution. In Sixth workshop on very large corpora, pages 161–170, 1998.
Sebastian Gehrmann, Franck Dernoncourt, Yeran Li, Eric T Carlson, Joy T Wu,
Jonathan Welt, John Foote Jr, Edward T Moseley, David W Grant, Patrick D
Tyler, et al. Comparing deep learning and concept extraction based methods for
patient phenotyping from clinical narratives. PloS one, 13(2):e0192360, 2018.
Daniela Gerz, Ivan Vulic, Felix Hill, Roi Reichart, and Anna Korhonen. Simverb-
3500: A large-scale evaluation set of verb similarity. In EMNLP, pages 2173–2182,
2016.
Andrea Gesmundo and Tanja Samardzic. Lemmatisation as a tagging task. In ACL,
pages 368–372, 2012.
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan
Berant. Did aristotle use a laptop? A question answering benchmark with im-
plicit reasoning strategies. Transactions of the Association for Computational
Linguistics, 9:346–361, 2021.
Abbas Ghaddar and Philippe Langlais. WikiCoref: An English coreference-
annotated corpus of wikipedia articles. In Tenth International Conference on
Language Resources and Evaluation, pages 136–142, 2016.
Deepanway Ghosal, Devamanyu Hazarika, Abhinaba Roy, Navonil Majumder, Rada
Mihalcea, and Soujanya Poria. Kingdom: Knowledge-guided domain adaptation
for sentiment analysis. In ACL, pages 3198–3210, 2020.
Debanjan Ghosh, Alexander Richard Fabbri, and Smaranda Muresan. The role of
conversation context for sarcasm detection in online interactions. arXiv preprint
arXiv:1707.06226, 2017.
Raymond W Gibbs Jr. A new look at literal meaning in understanding what is said
and implicated. Journal of pragmatics, 34(4):457–486, 2002.
Edward Gibson. Linguistic complexity: Locality of syntactic dependencies. Cogni-
tion, 68(1):1–76, 1998.
Alastair Gill, Scott Nowson, and Jon Oberlander. What are they blogging about?
personality, topic and motivation in blogs. In International AAAI Conference on
Web and Social Media, pages 18–25, 2009.
Dan Gillick. Sentence boundary detection and the problem with the US. In NAACL-
HLT, pages 241–244, 2009.
Jesús Giménez and Lluís Màrquez. SVMTool: A general POS tagger generator based
on support vector machines. In Fourth International Conference on Language
Resources and Evaluation (LREC’04), 2004.
Jesús Giménez and Lluis Marquez. Fast and accurate part-of-speech tagging: The
SVM approach revisited. RANLP, pages 153–162, 2004.
Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills,
Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A
Smith. Part-of-speech tagging for twitter: Annotation, features, and experiments.
Technical report, Carnegie-Mellon Univ Pittsburgh Pa School of Computer Sci-
ence, 2010.
References 427
Christian Girardi, Manuela Speranza, Rachele Sprugnoli, and Sara Tonelli. Cromer:
A tool for cross-document event and entity coreference. In LREC, pages 3204–
3208, 2014.
Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura Damien, and Jun Long. A
lexicon-based approach for hate speech detection. International Journal of Mul-
timedia and Ubiquitous Engineering, 10(4):215–230, 2015.
Talmy Givón. Topic continuity in discourse: The functional domain of switch
reference. Switch reference and universal grammar, 51:82, 1983.
Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using
distant supervision. CS224N project report, Stanford, 1(12):2009, 2009.
Andrew Goatly. Washing the brain: Metaphor and hidden ideology, volume 23.
John Benjamins Publishing, 2007.
Vinod Goel, Gorka Navarrete, Ira A Noveck, and Jérôme Prado. The reasoning
brain: The interplay between cognitive neuroscience and theories of reasoning,
2017.
Adele Goldberg and Laura Suttle. Construction grammar. Wiley Interdisciplinary
Reviews: Cognitive Science, 1(4):468–477, 2010.
Lewis R Goldberg. Language and individual differences: The search for universals in
personality lexicons. Review of personality and social psychology, 2(1):141–165,
1981.
Lewis R Goldberg. Standard markers of the Big-Five factor structure. In First
International Workshop on Personality Language, 1989.
Julio Gonzalo, Felisa Verdejo, Irina Chugur, and Juan Cigarran. Indexing with
WordNet synsets can improve text retrieval. In Usage of WordNet in Natural
Language Processing Systems, pages 38–44, 1998.
Julio Gonzalo, Anselmo Penas, and Felisa Verdejo. Lexical ambiguity and informa-
tion retrieval revisited. In EMNLP-VLC, pages 195–202, 1999.
Samuel D Gosling, Peter J Rentfrow, and William B Swann Jr. A very brief measure
of the big-five personality domains. Journal of Research in personality, 37(6):
504–528, 2003.
Yoshihiko Gotoh and Steve Renals. Sentence boundary detection in broadcast speech
transcripts. In ASR2000-Automatic Speech Recognition, 2000.
Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with
bidirectional LSTM and other neural network architectures. Neural Networks, 18
(5-6):602–610, 2005.
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition
with deep recurrent neural networks. In 2013 IEEE international conference on
acoustics, speech and signal processing, pages 6645–6649, 2013.
Gregory Grefenstette and Pasi Tapanainen. What is a word, what is a sentence?:
Problems of tokenisation. Report, Grenoble Laboratory, 1994.
Denis Griffis, Chaitanya Shivade, Eric Fosler-Lussier, and Albert M. Lai. A quan-
titative and qualitative evaluation of sentence boundary detection for the clinical
domain. AMIA Joint Summits on Translational Science, pages 88–97, 2016.
Ralph Grishman and Beth M Sundheim. Message understanding conference-6: A
brief history. In COLING, pages 466–471, 1996.
428 References
Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. Providing a unified account
of definite noun phrases in discourse. In ACL, pages 44–50, 1983.
Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. Centering: A framework
for modeling the local coherence of discourse. Computational Linguistics, 21(2):
203–225, 1995.
Barbara Jean Grosz. The Representation and Use of Focus in Dialogue Understand-
ing. University of California, Berkeley, 1977.
Adam J Grove and Dan Roth. Linear concepts and hidden variables. Machine
Learning, 42(1-2):123–141, 2001.
Rosanna E Guadagno, Bradley M Okdie, and Cassie A Eno. Who blogs? Personality
predictors of blogging. Computers in human behavior, 24(5):1993–2004, 2008.
Nicola Guarino. Formal ontology, conceptual analysis and knowledge representation.
International journal of human-computer studies, 43(5-6):625–640, 1995.
Tao Gui, Qi Zhang, Haoran Huang, Minlong Peng, and Xuanjing Huang. Part-of-
speech tagging for twitter with adversarial neural networks. In EMNLP, pages
2411–2420, 2017.
Liane Guillou. Improving pronoun translation for statistical machine translation. In
EACL, pages 1–10, 2012.
Jeanette K Gundel, Nancy Hedberg, and Ron Zacharski. Cognitive status and the
form of referring expressions in discourse. Language, pages 274–307, 1993.
Honey Gupta, Aveena Kottwani, Soniya Gogia, and Sheetal Chaudhari. Text analysis
and information retrieval of text data. In WiSPNET, pages 788–792, 2016a.
Pankaj Gupta, Hinrich Schütze, and Bernt Andrassy. Table filling multi-task recur-
rent neural network for joint entity and relation extraction. In COLING, pages
2537–2547, 2016b.
Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin
Hofmann-Apitius, and Luca Toldo. Development of a benchmark corpus to sup-
port the automatic extraction of drug-related adverse effects from medical case
reports. Journal of Biomedical Informatics, 45(5):885–892, 2012.
Dario Gutierrez, Ekaterina Shutova, Tyler Marghetis, and Benjamin Bergen. Literal
and metaphorical senses in compositional distributional semantic models. In ACL,
pages 183–193, 2016.
Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung Gan. Improved word sense dis-
ambiguation using pre-trained contextualized word representations. In EMNLP-
IJCNLP, pages 5297–5306, 2019.
Aria Haghighi and Dan Klein. Simple coreference resolution with rich syntactic and
semantic features. In EMNLP, pages 1152–1161, 2009.
Zhen Hai, Kuiyu Chang, and Gao Cong. One seed to find them all: Mining opinion
features via association. In CIKM, pages 255–264, 2012.
Jan Haji , Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara,
Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian
Padó, Jan ät pánek, Pavel Stra ák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang.
The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple
languages. In CoNLL, pages 1–18, 2009.
References 429
Péter Halácsy and V Trón. Benefits of deep NLP-based lemmatization for informa-
tion retrieval. In CLEF (Working Notes), 2006.
Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. Large-scale
learning of word relatedness with constraints. In ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 1406–1414, 2012.
Jeffrey A Hall, Jess Dominguez, and Teodora Mihailova. Interpersonal media and
face-to-face communication: Relationship with life satisfaction and loneliness.
Journal of Happiness Studies, 24(1):331–350, 2023.
Bo Han and Timothy Baldwin. Lexical normalisation of short text messages: Makn
sens a# twitter. In ACL-HLT, pages 368–378, 2011.
Na-Rae Han. Korean zero pronouns: analysis and resolution. University of Penn-
sylvania, 2006.
Sooji Han, Jie Gao, and Fabio Ciravegna. Neural language model based training data
augmentation for weakly supervised early rumor detection. In 2019 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining,
pages 105–112, 2019.
Sooji Han, Rui Mao, and Erik Cambria. Hierarchical attention network for explain-
able depression detection on Twitter aided by metaphor concept mappings. In
COLING, pages 94–104, 2022a.
Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu, and Maosong Sun. PTR: Prompt
tuning with rules for text classification. AI Open, 3:182–192, 2022b.
Sanda M. Harabagiu and Steven J. Maiorano. Knowledge-lean coreference reso-
lution and its relation to textual cohesion and coherence. In The Relation of
Discourse/Dialogue Structure and Reference, pages 29–38, 1999.
Sanda M. Harabagiu, Razvan C. Bunescu, and Steven J. Maiorano. Text and knowl-
edge mining for coreference resolution. In NAACL, pages 1–8, 2001.
Jaron Harambam, Dimitrios Bountouridis, Mykola Makhortykh, and Joris
Van Hoboken. Designing for the better by taking users into account: A quali-
tative evaluation of user control mechanisms in (news) recommender systems. In
13th ACM conference on recommender systems, pages 69–77, 2019.
Christian Hardmeier and Marcello Federico. Modelling pronominal anaphora in
statistical machine translation. In IWSLT (International Workshop on Spoken
Language Translation); Paris, France; December 2nd and 3rd, 2010., pages 283–
289, 2010.
Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic
determination of minimum cost paths. IEEE transactions on Systems Science and
Cybernetics, 4(2):100–107, 1968.
Md Rajibul Hasan, Ashish Kumar Jha, and Yi Liu. Excessive use of online video
streaming services: Impact of recommender system use, psychological factors,
and motives. Computers in Human Behavior, 80:220–228, 2018.
Laura Hasler, Constantin Orǎsan, and Karin Naumann. NPs for events: Experi-
ments in coreference annotation. In Fifth International Conference on Language
Resources and Evaluation (LREC’06), pages 1167–1172, 2006.
Catherine Havasi and Robert Speer. ConceptNet 3: A flexible, multilingual semantic
network for common sense knowledge. In RANLP, pages 27–29, 2007.
430 References
Irene Roswitha Heim. The semantics of definite and indefinite noun phrases. Uni-
versity of Massachusetts Amherst, 1982.
Sigrún Helgadóttir. Icelandic frequency dictionary 2012.11-training/testing sets,
2012.
Mohammad Helmi and Seyed Mohammad T AlModarresi. Human activity recog-
nition using a fuzzy inference system. In 2009 IEEE International Conference on
Fuzzy Systems, pages 1897–1902, 2009.
Iris Hendrickx, Gosse Bouma, Frederik Coppens, Walter Daelemans, Veronique
Hoste, Geert Kloosterman, Anne-Marie Mineur, Joeri Van Der Vloet, and Jean-
Luc Verschelde. A coreference corpus and resolution system for Dutch. In Sixth
International Conference on Language Resources and Evaluation, pages 1–6,
2008.
Sam Henry, Kevin Buchan, Michele Filannino, Amber Stubbs, and Ozlem Uzuner.
2018 n2c2 shared task on adverse drug events and medication extraction in elec-
tronic health records. Journal of the American Medical Informatics Association,
27(1):3–12, 2020.
María Herrero-Zazo, Isabel Segura-Bedmar, Paloma Martínez, and Thierry De-
clerck. The DDI corpus: An annotated corpus with pharmacological substances
and drug-drug interactions. Journal of Biomedical Informatics, 46(5):914–920,
2013.
Clara E Hill. Helping skills: Facilitating, exploration, insight, and action. American
Psychological Association, 2009.
Felix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluating semantic
models with (genuine) similarity estimation. Computational Linguistics, 41(4):
665–695, 2015.
Dustin Hillard, Mari Ostendorf, and Elizabeth Shriberg. Detection of agreement vs.
disagreement in meetings: Training with unlabeled data. In HLT-NAACL, pages
34–36, 2003.
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
Salakhutdinov. Improving neural networks by preventing co-adaptation of feature
detectors. arXiv preprint arXiv:1207.0580, pages 1–18, 2012.
JA Hirsch, G Nicola, G McGinty, RW Liu, RM Barr, MD Chittle, and L Manchikanti.
ICD-10: history and context. American Journal of Neuroradiology, 37(4):596–
599, 2016.
Lynette Hirschman and Nancy Chinchor. Appendix F: MUC-7 coreference task
definition (version 3.0). In Seventh Message Understanding Conference (MUC-
7), pages 1–17, 1998.
Lynette Hirschman, Patricia Robinson, John Burger, and Marc Vilain. Automating
coreference: The role of annotated training data. In AAAI Spring Symposium on
Applying Machine Learning to Discourse Processing, pages 118–121, 1997.
Janet Hitzeman, Alan W Black, Paul Taylor, Chris Mellish, and Jon Oberlander. On
the use of automatically generated discourse-level information in a concept-to-
speech synthesis system. In 5th International Conference on Spoken Language
Processing, pages 2763–2766, 1998.
432 References
Seng-Beng Ho, Zhaoxia Wang, Boon-Kiat Quek, and Erik Cambria. Knowledge
representation for conceptual, motivational, and affective processes in natural
language communication. In BICS, volume 14374, pages 14–30, 2023.
Shuk Ying Ho and Kai H Lim. Nudging moods to induce unplanned purchases in
imperfect mobile personalization contexts. Mis Quarterly, 42(3):757–A13, 2018.
Jerry R Hobbs. Resolving pronoun references. Lingua, 44(4):311–338, 1978.
Jerry R. Hobbs, Mark Stickel, Paul Martin, and Douglas Edwards. Interpretation as
abduction. In ACL, pages 95–103, 1988.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural com-
putation, 9(8):1735–1780, 1997.
Jack Hoeksema. On the grammaticalization of negative polarity items. In Annual
Meeting of the Berkeley Linguistics Society, pages 273–282, 1994.
Patrick Hohenecker, Frank Mtumbuka, Vid Kocijan, and Thomas Lukasiewicz. Sys-
tematic comparison of neural architectures and training approaches for open in-
formation extraction. In EMNLP, pages 8554–8565, 2020.
John H Holland. Adaptation in Natural and Artificial Systems: An Introductory
Analysis with Applications to Biology, Control, and Artificial Intelligence. MIT
press, 1992.
Janusz A Ho≥yst, Philipp Mayr, Michael Thelwall, Ingo Frommholz, Shlomo Havlin,
Alon Sela, Yoed N Kenett, Denis Helic, Aljoöa Rehar, Sebastijan R Ma ek, Prze-
mys≥aw Kazienko, Tomasz Kajdanowicz, Przemys≥aw Biecek, Boleslaw K. Szy-
manski, and Julian Sienkiewicz. Protect our environment from information over-
load. Nature Human Behaviour, pages 1–2, 2024.
Vincent Homer. Domains of polarity items. Journal of Semantics, 38(1):1–48, 2021.
Albert Sydney Hornby and Anthony P Cowie. Oxford advanced learner’s dictionary
of current english. Paperback,, 1974.
David W. Hosmer and Stanley Lemeshow. Applied Logistic Regression. Wiley, 2
edition, 2000.
Yufang Hou, Katja Markert, and Michael Strube. Unrestricted bridging resolution.
Computational Linguistics, 44(2):237–284, 2018.
Eduard Hovy, Mitch Marcus, Martha Palmer, Lance Ramshaw, and Ralph
Weischedel. Ontonotes: the 90% solution. In human language technology confer-
ence of the NAACL, Companion Volume: Short Papers, pages 57–60, 2006.
Newton Howard and Erik Cambria. Intention awareness: Improving upon situa-
tion awareness in human-centric environments. Human-centric Computing and
Information Sciences, 3(9), 2013.
Jeff Howe. Crowdsourcing: How the Power of the Crowd is Driving the Future of
Business. Random House, 2008.
Pei-Yun Hsueh, Prem Melville, and Vikas Sindhwani. Data quality from crowd-
sourcing: A study of annotation selection criteria. In NAACL HLT 2009 Workshop
on Active Learning for Natural Language Processing, pages 27–35, 2009.
Jiawei Hu, Zheng Li, and Bin Xu. An approach of ontology based knowledge base
construction for chinese k12 education. In 2016 First International Conference
on Multimedia and Image Processing (ICMIP), pages 83–88, 2016.
References 433
Minghao Hu, Yuxing Peng, Zhen Huang, Dongsheng Li, and Yiwei Lv. Open-
domain targeted sentiment analysis via span-based extraction and classification.
arXiv preprint arXiv:1906.03820, 2019.
Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In tenth
ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 168–177, 2004.
Renqing Hu and Xue Wang. A cognitive pragmatic analysis of conceptual metaphor
in political discourse based on text data mining. In 2021 4th International Con-
ference on Information Systems and Computer Aided Education, pages 235–238,
2021.
Hu Huang, Bowen Zhang, Liwen Jing, Xianghua Fu, Xiaojun Chen, and Jianyang
Shi. Logic tensor network with massive learned knowledge for aspect-based
sentiment analysis. Knowledge-Based Systems, 257:109943, 2022a.
James Huang. On the distribution and reference of empty pronouns. Linguistic
Inquiry, pages 531–574, 1984.
Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. GRADE:
Automatic graph-enhanced coherence metric for evaluating open-domain dialogue
systems. In EMNLP, pages 9230–9240, 2020a.
Luyao Huang, Chi Sun, Xipeng Qiu, and Xuan-Jing Huang. GlossBERT: BERT
for word sense disambiguation with gloss knowledge. In EMNLP-IJCNLP, pages
3509–3514, 2019a.
Minghui Huang, Haoran Xie, Yanghui Rao, Jingrong Feng, and Fu Lee Wang. Sen-
timent strength detection with a context-dependent lexicon-based convolutional
neural network. Information Sciences, 520:389–399, 2020b.
Xiaoqing Huang, Qi Liu, Chao Wang, Haoyu Han, Jianhui Ma, Enhong Chen,
Yu Su, and Shijin Wang. Constructing educational concept maps with multiple
relationships from multi-source data. In Jianyong Wang, Kyuseok Shim, and
Xindong Wu, editors, ICDM, pages 1108–1113. IEEE, 2019b.
Yucheng Huang, Kai He, Yige Wang, Xianli Zhang, Tieliang Gong, Rui Mao, and
Chen Li. COPNER: Contrastive learning with prompt guiding for few-shot named
entity recognition. In COLING, pages 2515–2527, 2022b.
Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional LSTM-CRF models for sequence
tagging. arXiv preprint arXiv:1508.01991, 2015.
Christoph Hube and Besnik Fetahu. Detecting biased statements in Wikipedia. In
Web Conference, pages 1779–1786, 2018.
Christoph Hube and Besnik Fetahu. Neural based statement classification for biased
language. In Twelfth ACM International Conference on Web Search and Data
Mining, pages 195–203, 2019.
Anette Hulth. Improved automatic keyword extraction given more linguistic knowl-
edge. In EMNLP, pages 1–8, 2003.
Chihli Hung and Shiuan-Jeng Chen. Word sense disambiguation based sentiment
lexicons for sentiment classification. Knowledge-Based Systems, 110:224–232,
2016.
434 References
Chihli Hung and Hao-Kai Lin. Using objective words in SentiWordNet to improve
word-of-mouth sentiment classification. IEEE Intelligent Systems, 28(02):47–54,
2013.
Hairong Huo and Mizuho Iwaihara. Utilizing BERT pretrained models with various
fine-tune methods for subjectivity detection. In Asia-Pacific Web (APWeb) and
Web-Age Information Management (WAIM) Joint International Conference on
Web and Big Data, pages 270–284, 2020.
Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi,
Antoine Bosselut, and Yejin Choi. Comet-atomic 2020: On symbolic and neural
commonsense knowledge graphs. In AAAI, 2020.
Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. Embeddings
for word sense disambiguation: An evaluation study. In ACL, pages 897–907,
2016.
Francisco Iacobelli, Alastair J Gill, Scott Nowson, and Jon Oberlander. Large scale
personality classification of bloggers. In Affective Computing and Intelligent
Interaction: Fourth International Conference, ACII 2011, Memphis, TN, USA,
October 9–12, 2011, Proceedings, Part II, pages 568–577, 2011.
Nancy Ide and Jean Véronis. Multext: Multilingual text tools and corpora. In
COLING, 1994.
Ryu Iida and Massimo Poesio. A cross-lingual ilp solution to zero anaphora resolu-
tion. In ACL-HLT, pages 804–813, 2011.
Ryu Iida, Kentaro Inui, and Yuji Matsumoto. Zero-anaphora resolution by learning
rich syntactic pattern features. ACM Transactions on Asian Language Information
Processing, 6(4):1–22, 2007a.
Ryu Iida, Mamoru Komachi, Kentaro Inui, and Yuji Matsumoto. Annotating a
Japanese text corpus with predicate-argument and coreference relations. In Lin-
guistic Annotation Workshop, pages 132–139, 2007b.
Filip Ilievski, Alessandro Oltramari, Kaixin Ma, Bin Zhang, Deborah L. McGuin-
ness, and Pedro A. Szekely. Dimensions of commonsense knowledge. Knowledge-
Based Systems, 229:107347, 2021a.
Filip Ilievski, Pedro A. Szekely, and Bin Zhang. CSKG: the commonsense knowledge
graph. In Semantic Web - 18th International Conference, ESWC, volume 12731
of Lecture Notes in Computer Science, pages 680–696, 2021b.
Ali Shariq Imran, Sher Muhammad Daudpota, Zenun Kastrati, and Rakhi Batra.
Cross-cultural polarity and emotion detection using sentiment analysis and deep
learning on COVID-19 related tweets. Ieee Access, 8:181074–181090, 2020.
Andrea Iovine, Fedelucio Narducci, and Giovanni Semeraro. Conversational recom-
mender systems and natural language:: A study through the converse framework.
Decision Support Systems, 131:113250, 2020.
Hideki Isozaki and Tsutomu Hirao. Japanese zero pronoun resolution based on
ranking rules and machine learning. In EMNLP, pages 184–191, 2003.
Michael Israel. Polarity sensitivity as lexical semantics. Linguistics and Philosophy,
pages 619–666, 1996.
Ray Jackendoff. Toward an explanatory semantic representation. Linguistic inquiry,
7(1):89–150, 1976.
References 435
Ray S Jackendoff et al. Semantics and Cognition. The MIT Press, Cambridge,
Massachusetts, 1983.
Vincent Jahjah, Richard Khoury, and Luc Lamontagne. Word normalization using
phonetic signatures. In Canadian Conference on Artificial Intelligence, pages
180–185, 2016.
Niklas Jakob and Iryna Gurevych. Using anaphora resolution to improve opinion
target identification in movie reviews. In ACL, pages 263–268, 2010a.
Niklas Jakob and Iryna Gurevych. Extracting opinion targets in a single and cross-
domain setting with conditional random fields. In EMNLP, pages 1035–1045,
2010b.
Bernard J Jansen. Understanding user-web interactions via web analytics. Springer
Nature, 2022.
Bernard J Jansen, Soon-Gyo Jung, Lene Nielsen, Kathleen W Guan, and Joni Salmi-
nen. How to create personas: Three persona creation methodologies with im-
plications for practical employment. Pacific Asia Journal of the Association for
Information Systems, 14(3):1, 2022a.
Bernard J Jansen, Joni Salminen, Soon-gyo Jung, and Kathleen Guan. Data-driven
personas. Springer Nature, 2022b.
Bernard J Jansen, Soon-gyo Jung, and Joni Salminen. Employing large language
models in survey research. Natural Language Processing Journal, 4:100020,
2023.
Lei Ji, Yujing Wang, Botian Shi, Dawei Zhang, Zhongyuan Wang, and Jun Yan.
Microsoft concept graph: Mining semantic concepts for short text understanding.
Data Intelligence, 1(3):238–270, 2019.
Shaoxiong Ji, Xue Li, Zi Huang, and Erik Cambria. Suicidal ideation and men-
tal disorder detection with attentive relation networks. Neural Computing and
Applications, 34(13):10309–10319, 2022.
Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Sherif. Applying many-to-
many alignments and hidden markov models to letter-to-phoneme conversion. In
ACL-HLT, pages 372–379, 2007.
Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kondrak. Integrating joint
n-gram features into a discriminative training framework. In HLT-ACL, pages
697–700, 2010.
Julie Jiang, Xiang Ren, and Emilio Ferrara. Retweet-BERT: Political leaning detec-
tion using language features and information diffusion on social networks. arXiv
preprint arXiv:2207.08349, 2022.
Qingnan Jiang, Lei Chen, Ruifeng Xu, Xiang Ao, and Min Yang. A challenge dataset
and effective models for aspect-based sentiment analysis. In EMNLP-IJCNLP,
pages 6280–6285, 2019.
Tianwen Jiang, Qingkai Zeng, Tong Zhao, Bing Qin, Ting Liu, Nitesh V Chawla, and
Meng Jiang. Biomedical knowledge graphs construction from conditional state-
ments. IEEE/ACM Transactions on Computational Biology and Bioinformatics,
18(3):823–835, 2020.
436 References
Wei Jiang, Hao Pan, and Qing Ye. An improved association rule mining approach
to identification of implicit product aspects. The Open Cybernetics & Systemics
Journal, 8(1), 2014.
Mu Jin, Zhihao Ma, Kebing Jin, Hankz Hankui Zhuo, Chen Chen, and Chao Yu.
Creativity of AI: Automatic symbolic option discovery for facilitating deep rein-
forcement learning. In AAAI, pages 7042–7050, 2022.
Wei Jin, Hung Hay Ho, and Rohini K Srihari. OpinionMiner: A novel machine
learning system for web opinion mining and extraction. In 15th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pages 1195–
1204, 2009.
Hongyan Jing, Daniel Lopresti, and Chilin Shih. Summarization of noisy documents:
a pilot study. In HLT-NAACL 03 Text Summarization Workshop, pages 25–32,
2003.
Ni Jinjie, Pandelea Vlad, Young Tom, Zhou Haicang, and Cambria Erik. HiTKG:
Towards goal-oriented conversations via multi-hierarchy learning. In AAAI, pages
11112–11120, 2022.
Yohan Jo and Alice H Oh. Aspect and sentiment unification model for online
review analysis. In fourth ACM International Conference on Web Search and
Data Mining, pages 815–824, 2011.
Thorsten Joachims. Optimizing search engines using clickthrough data. In eighth
ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 133–142, 2002.
O John, E Donahue, and R Kentle. The Big Five inventory-versions 4a and 54.
Berkeley, CA: University of California. Berkeley, Institute of Personality and
Social Research, 1991.
Alistair Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mo-
hammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and
Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific
data, 3(1):1–9, 2016.
Daniel Johnson and John Gardner. Personality, motivation and video games. In
22nd Conference of the Computer-Human Interaction Special Interest Group of
Australia on Computer-Human Interaction, pages 276–279, 2010.
Kyle Johnson. Gapping. The Blackwell companion to syntax, pages 407–435, 2006.
Mark Johnson. The body in the mind: The bodily basis of meaning, imagination,
and reason. Journal of Aesthetics and Art Criticism, 47(4), 1989.
Rie Johnson and Tong Zhang. A high-performance semi-supervised learning method
for text chunking. In ACL, pages 1–9, 2005.
Karen Sparck Jones. A statistical interpretation of term specificity and its application
in retrieval. Journal of documentation, 1972.
Bart Jongejan and Hercules Dalianis. Automatic training of lemmatization rules
that handle morphological changes in pre-, in-and suffixes alike. In ACL-IJCNLP,
pages 145–153, 2009.
Greety Jose and Nisha S Raj. Lexico-syntactic normalization model for noisy
SMS text. In 2014 International Conference on Electronics, Communication and
Computational Engineering (ICECCE), pages 163–168, 2014.
References 437
Aditya Joshi, Vaibhav Tripathi, Pushpak Bhattacharyya, and Mark James Carman.
Harnessing sequence labeling for sarcasm detection in dialogue from TV series
‘friends’. In CoNLL, pages 146–155, 2016.
Aravind K Joshi and Steve Kuhn. Centered logic: The role of entity centered sentence
representation in natural language inferencing. In IJCAI, pages 435–439, 1979.
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large
scale distantly supervised challenge dataset for reading comprehension. In ACL,
pages 1601–1611, 2017.
Mandar Joshi, Omer Levy, Luke Zettlemoyer, and Daniel Weld. BERT for corefer-
ence resolution: Baselines and analysis. In EMNLP-IJCNLP, pages 5803–5808,
2019.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and
Omer Levy. SpanBERT: Improving pre-training by representing and predicting
spans. Transactions of the Association for Computational Linguistics, 8:64–77,
2020.
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou,
and Tomas Mikolov. [Link]: Compressing text classification models. arXiv
preprint arXiv:1612.03651, 2016.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomás Mikolov. Bag of tricks
for efficient text classification. In Mirella Lapata, Phil Blunsom, and Alexander
Koller, editors, EACL, pages 427–431, 2017.
John Judge, Aoife Cahill, and Josef van Genabith. QuestionBank: Creating a corpus
of parse-annotated questions. In ACL, pages 497–504, 2006.
Carl Jung and John Beebe. Psychological types. Routledge, 2016.
Matjaû Juröi , Igor Mozeti , and Nada Lavra . Learning ripple down rules for
efficient lemmatization. In 10th international multiconference information society,
IS, pages 206–209, 2007.
Nirit Kadmon and Fred Landman. Any. Linguistics and Philosophy, pages 353–422,
1993.
Mikael Kågebäck and Hans Salomonsson. Word sense disambiguation using a bidi-
rectional LSTM. In 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-
V), pages 51–56, 2016.
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural
network for modelling sentences. In ACL, pages 655–665, 2014.
Ahmad Kamal. Subjectivity classification using machine learning techniques for
mining feature-opinion pairs from web opinion sources. International Journal of
Computer Science Issues, 10(5):191, 2013.
Ashraf Kamal and Muhammad Abulaish. CAT-BiGRU: Convolution and attention
with bi-directional gated recurrent unit for self-deprecating sarcasm detection.
Cognitive Computation, 14(1):91–109, 2022.
Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya
Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. LLMs can’t plan, but
can help planning in LLM-Modulo frameworks. arXiv preprint arXiv:2402.01817,
2024.
438 References
Adam Kilgarriff and Martha Palmer. Introduction to the special issue on senseval.
Computers and the Humanities, 34(1):1–13, 2000.
A Kim, Hyun-Je Song, Seong-Bae Park, et al. A two-step neural dialog state tracker
for task-oriented dialog processing. Computational Intelligence and Neuroscience,
2018:1–12, 2018.
J-D Kim, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. Genia corpus—a se-
mantically annotated corpus for bio-textmining. Bioinformatics, 19(suppl_1):
i180–i182, 2003.
Jeong-Dong Kim, Jiseong Son, and Doo-Kwon Baik. CA 5W1H onto: ontological
context-aware model based on 5W1H. International Journal of Distributed Sensor
Networks, 8(3):247346, 2012.
Sang-Bum Kim, Hee-Cheol Seo, and Hae-Chang Rim. Information retrieval using
word senses: root sense tagging approach. In 27th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, pages
258–265, 2004.
Soo-Min Kim and Eduard Hovy. Determining the sentiment of opinions. In COLING,
pages 1367–1373, 2004.
Soo-Min Kim and Eduard Hovy. Automatic detection of opinion bearing words and
sentences. In Companion Volume to the Proceedings of Conference including
Posters/Demos and tutorial abstracts, pages 61–66, 2005.
Soo-Min Kim and Eduard Hovy. Identifying and analyzing judgment opinions. In
NAACL-HLT, pages 200–207, 2006.
Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timothy Baldwin. Semeval-2010
task 5: Automatic keyphrase extraction from scientific articles. In Katrin Erk and
Carlo Strapparava, editors, 5th International Workshop on Semantic Evaluation,
pages 21–26, 2010.
Yoon Kim. Convolutional neural networks for sentence classification. In EMNLP,
pages 1746–1751, 2014.
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware
neural language models. In AAAI, 2016.
Barbara Ann Kipfer. Roget’s 21st century thesaurus in dictionary form (Third
Edition). Bantam Dell, New York, NY, 2006.
Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina
Vylomova, Patrick Xia, Manaal Faruqui, Sabrina J Mielke, Arya D McCarthy,
Sandra Kübler, et al. Unimorph 2.0: universal morphology. arXiv preprint
arXiv:1810.11101, 2018.
Yuval Kirstain, Ori Ram, and Omer Levy. Coreference resolution without span
representations. In ACL-IJCNLP, pages 14–19, 2021.
Tibor Kiss and Jan Strunk. Viewing sentence boundary detection as collocation
identification. In KONVENS, volume 2002, pages 75–82, 2002.
Tibor Kiss and Jan Strunk. Unsupervised multilingual sentence boundary detection.
Computational Linguistics, 32(4):485–525, 2006.
Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-
normalizing neural networks. In 31st international conference on neural infor-
mation processing systems, pages 972–981, 2017.
References 441
Jan-Christoph Klie, Michael Bugert, Beto Boullosa, Richard Eckart de Castilho, and
Iryna Gurevych. The INCEpTION platform: Machine-assisted and knowledge-
oriented interactive annotation. In COLING, pages 5–9, 2018.
Benjamin C. Knoll, Elizabeth A. Lindemann, Arian L. Albert, Genevieve B. Melton,
and Serguei V. S. Pakhomov. Recurrent deep network models for clinical nlp tasks:
Use case with sentence boundary disambiguation. Studies in health technology
and informatics, 264(31437913):198–202, 2019.
Nozomi Kobayashi, Kentaro Inui, and Yuji Matsumoto. Extracting aspect-evaluation
and aspect-of relations in opinion mining. In EMNLP-CoNLL, pages 1065–1074,
2007.
Vid Kocijan, Oana-Maria Camburu, Ana-Maria Cretu, Yordan Yordanov, Phil Blun-
som, and Thomas Lukasiewicz. WikiCREM: A large unsupervised corpus for
coreference resolution. In EMNLP-IJCNLP, pages 4303–4312, 2019.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed-
erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard
Zens, et al. Moses: Open source toolkit for statistical machine translation. In
ACL, pages 177–180, 2007.
Rob Koeling. Chunking with maximum entropy models. In Fourth Conference on
Computational Natural Language Learning and the Second Learning Language
in Logic Workshop, 2000.
Keshav Kolluru, Vaibhav Adlakha, Samarth Aggarwal, Soumen Chakrabarti, et al.
OpenIE6: Iterative grid labeling and coordination analysis for open information
extraction. In EMNLP, pages 3748–3761, 2020.
Dan Kondratyuk. Cross-lingual lemmatization and morphology tagging with two-
stage multilingual BERT fine-tuning. In 16th Workshop on Computational Re-
search in Phonetics, Phonology, and Morphology, pages 12–18, 2019.
Dan Kondratyuk and Milan Straka. 75 languages, 1 model: Parsing universal depen-
dencies universally. arXiv preprint arXiv:1904.02099, 2019.
Daniel Kondratyuk, Tomáö Gaven iak, Milan Straka, and Jan Haji . Lemmatag:
Jointly tagging and lemmatizing for morphologically-rich languages with BRNNs.
arXiv preprint arXiv:1808.03703, 2018.
Fang Kong and Guodong Zhou. A tree kernel-based unified framework for Chinese
zero anaphora resolution. In EMNLP, pages 882–891, 2010.
Prahlad Koratamaddi, Karan Wadhwani, Mridul Gupta, and Sriram G Sanjeevi.
Market sentiment-aware deep reinforcement learning approach for stock portfolio
allocation. Engineering Science and Technology, an International Journal, 24(4):
848–859, 2021.
Michal Kosinski, David Stillwell, and Thore Graepel. Private traits and attributes
are predictable from digital records of human behavior. National Academy of
Sciences, 110(15):5802–5805, 2013.
Ronak Kosti, Jose M Alvarez, Adria Recasens, and Agata Lapedriza. Context based
emotion recognition using emotic dataset. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 42(11):2755–2766, 2019.
442 References
Martin Krallinger, Obdulia Rabal, Anália Lourenço, Julen Oyarzabal, and Alfonso
Valencia. Information Retrieval and Text Mining Technologies for Chemistry.
Chemical Reviews, 117(12):7673–7761, 2017.
Mikalai Krapivin, Aliaksandr Autaeu, and Maurizio Marchese. Large dataset for
keyphrases extraction. Technical report, University of Trento, 2009.
Roger J Kreuz and Sam Glucksberg. How to be sarcastic: The echoic reminder
theory of verbal irony. Journal of Experimental Psychology: General, 118(4):
374, 1989.
Manfred Krifka. Some remarks on polarity items. Semantic Universals and Universal
Semantics, pages 150–189, 1992.
Manfred Krifka. The semantics and pragmatics of weak and strong polarity items in
assertions. In Semantics and Linguistic Theory, volume 4, pages 195–219, 1994.
Saul A Kripke. Naming and necessity. In Semantics of natural language, pages
253–355. Springer, 1972.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S.
Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision us-
ing crowdsourced dense image annotations. International Journal of Computer
Vision, 123(1):32–73, 2017.
Paul R Kroeger. Analyzing Meaning: An Introduction to Semantics and Pragmatics.
Language Science Press, 2023.
Joshua Krook and Jan Blockx. Recommender systems, autonomy and user engage-
ment. In First International Symposium on Trustworthy Autonomous Systems,
pages 1–9, 2023.
Robert Krovetz and W Bruce Croft. Lexical ambiguity and information retrieval.
ACM Transactions on Information Systems (TOIS), 10(2):115–141, 1992.
Lun-Wei Ku, Yu-Ting Liang, and Hsin-Hsi Chen. Opinion extraction, summarization
and tracking in news and blog corpora. In AAAI, pages 100–107, 2006.
Sandra Kübler and Desislava Zhekova. Singletons and coreference resolution eval-
uation. In RANLP, pages 261–267, 2011.
Taku Kudo and Yuji Matsumoto. Use of support vector learning for chunk identifi-
cation. In Fourth Conference on Computational Natural Language Learning and
the Second Learning Language in Logic Workshop, 2000.
Brian Kulis et al. Metric learning: A survey. Foundations and Trends® in Machine
Learning, 5(4):287–364, 2013.
Akshi Kumar, Shubham Dikshit, and Victor Hugo C Albuquerque. Explainable
artificial intelligence for sarcasm detection in dialogues. Wireless Communications
and Mobile Computing, 2021, 2021.
Sawan Kumar, Sharmistha Jat, Karan Saxena, and Partha Talukdar. Zero-shot word
sense disambiguation using sense definition embeddings. In ACL, pages 5670–
5681, 2019a.
Sudhanshu Kumar, Mahendra Yadava, and Partha Pratim Roy. Fusion of EEG re-
sponse and sentiment analysis of products review to predict customer satisfaction.
Information Fusion, 52:41–52, 2019b.
References 443
Ziva Kunda. The case for motivated reasoning. Psychological Bulletin, 108(3):480,
1990.
Florian Kunneman, Christine Liebrecht, Margot Van Mulken, and Antal Van den
Bosch. Signaling sarcasm: From hyperbole to hashtag. Information Processing
& Management, 51(4):500–509, 2015.
Yen-Ling Kuo, Jong-Chuan Lee, Kai-yang Chiang, Rex Wang, Edward Shen, Cheng-
wei Chan, and Jane Yung-jen Hsu. Community-based game design: experiments
on social games for commonsense data collection. In ACM SIGKDD Workshop
on Human Computation, pages 15–22, 2009.
Julian Kupiec. Robust part-of-speech tagging using a hidden markov model. Com-
puter Speech & Language, 6(3):225–242, 1992.
Onur Kuru, Ozan Arkan Can, and Deniz Yuret. CharNER: Character-level named
entity recognition. In COLING, pages 911–921, 2016.
John Lafferty and Chengxiang Zhai. Document language models, query models, and
risk minimization for information retrieval. In 24th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, pages
111–119, 2001.
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional
random fields: Probabilistic models for segmenting and labeling sequence data.
In ICML, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan
Kaufmann Publishers Inc. ISBN 1558607781.
Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman.
Building machines that learn and think like people. Behavioral and brain sciences,
40:e253, 2017.
George Lakoff. Master Metaphor List. University of California, 1994.
George Lakoff and Mark Johnson. Metaphors We Live by. University of Chicago
press, 1980.
Guillaume Lample and François Charton. Deep learning for symbolic mathematics.
In International Conference on Learning Representations, 2019.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
and Radu Soricut. Albert: A lite bert for self-supervised learning of language
representations. In International Conference on Learning Representations, pages
1–17, 2020.
Mark J Landau, Brian P Meier, and Lucas A Keefer. A metaphor-enriched social
cognition. Psychological Bulletin, 136(6):1045, 2010.
Lukas Lange, Heike Adel, and Jannik Strötgen. Closing the gap: Joint de-
identification and concept extraction in the clinical domain. In Dan Jurafsky,
Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, ACL, pages 6945–
6952, 2020.
Joseph LaPorte. Rigid designators for properties. Philosophical Studies, 130(2):
321–336, 2006.
Shalom Lappin and Herbert J Leass. An algorithm for pronominal anaphora resolu-
tion. Computational Linguistics, 20(4):535–561, 1994.
444 References
Christine Largeron, Christophe Moulin, and Mathias Géry. Entropy based feature
selection for text categorization. In 2011 ACM Symposium on Applied Computing,
pages 924–928, 2011.
Ida Unmack Larsen, Tua Vinther-Jensen, Anders Gade, Jørgen Erik Nielsen, and
Asmus Vogel. Do I misconstrue? Sarcasm detection, emotion recognition, and
theory of mind in Huntington disease. Neuropsychology, 30(2):181, 2016.
Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Junaid Qadir, and Bjoern W
Schuller. Survey of deep representation learning for speech emotion recognition.
IEEE Transactions on Affective Computing, 2021.
Minh Le, Marten Postma, Jacopo Urbani, and Piek Vossen. A deep dive into word
sense disambiguation with lstm. In COLING, pages 354–365, 2018.
Thi Thuy Le, Thanh Hung Vo, Duc Trung Mai, Than Tho Quan, and Tuoi Thi Phan.
Sentiment analysis using anaphoric coreference resolution and ontology inference.
In International Workshop on Multi-disciplinary Trends in Artificial Intelligence,
pages 297–303, 2016.
Ronan Le Nagard and Philipp Koehn. Aiding pronoun translation with co-reference
resolution. In Joint Fifth Workshop on Statistical Machine Translation and Met-
ricsMATR, pages 252–261, 2010.
Claudia Leacock, Geoffrey Towell, and Ellen M Voorhees. Corpus-based statistical
sense resolution. In HLT, pages 260–265, 1993.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. IEEE, 86(11):2278–2324, 1998.
Bongseok Lee and Yong Suk Choi. Graph based network with contextualized
representations of turns in dialogue. In EMNLP, pages 443–455, 2021.
Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu,
and Dan Jurafsky. Deterministic coreference resolution based on entity-centric,
precision-ranked rules. Computational Linguistics, 39(4):885–916, 2013.
Heeyoung Lee, Mihai Surdeanu, and Dan Jurafsky. A scaffolding approach to coref-
erence resolution integrating statistical and rule-based models. Natural Language
Engineering, 23(5):733–762, 2017a.
Joosung Lee and Wooin Lee. CoMPM: Context modeling with speaker’s pre-trained
memory tracking for emotion recognition in conversation. In NAACL-HLT, pages
5669–5679, 2022.
Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural
coreference resolution. In EMNLP, pages 188–197, 2017b.
Kenton Lee, Luheng He, and Luke Zettlemoyer. Higher-order coreference resolution
with coarse-to-fine inference. In NAACL-HLT, pages 687–692, 2018.
Young-Suk Lee and Laurence Horn. Any as indefinite plus even. Manuscript, Yale
University, 1994.
Yue-Shi Lee and Yu-Chieh Wu. A robust multilingual portable phrase chunking
system. Expert Systems with Applications, 33(3):590–599, 2007.
Samuel Leeman-Munk, James Lester, and James Cox. Ncsu_sas_sam: Deep encod-
ing and reconstruction for normalization of noisy text. In Workshop on Noisy
User-generated Text, pages 154–161, 2015.
References 445
Feng-Lin Li, Hehong Chen, Guohai Xu, Tian Qiu, Feng Ji, Ji Zhang, and Haiqing
Chen. AliMeKG: Domain knowledge graph construction and application in e-
commerce. In 29th ACM International Conference on Information & Knowledge
Management, pages 2581–2588, 2020a.
Hongsong Li, Kenny Q Zhu, and Haixun Wang. Data-driven metaphor recognition
and explanation. Transactions of the Association for Computational Linguistics,
1:379–390, 2013.
Jiangnan Li, Zheng Lin, Peng Fu, and Weiping Wang. Past, present, and future:
Conversational emotion recognition through structural modeling of psychological
knowledge. In EMNLP, pages 1204–1214, 2021b.
Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. A survey on deep learning for
named entity recognition. IEEE Transactions on Knowledge and Data Engineer-
ing, 34(1):50–70, 2020b.
Keqian Li, Hanwen Zha, Yu Su, and Xifeng Yan. Concept mining via embedding.
In ICDM, pages 267–276, 2018a.
Linfeng Li, Peng Wang, Jun Yan, Yao Wang, Simin Li, Jinpeng Jiang, Zhe Sun,
Buzhou Tang, Tsung-Hui Chang, Shenghui Wang, et al. Real-world data med-
ical knowledge graph: construction and applications. Artificial Intelligence in
Medicine, 103:101817, 2020c.
Ning Li, Chi-Yin Chow, and Jia-Dong Zhang. Emova: A semi-supervised end-
to-end moving-window attentive framework for aspect mining. In Advances in
Knowledge Discovery and Data Mining: 24th Pacific-Asia Conference, PAKDD
2020, Singapore, May 11–14, 2020, Proceedings, Part II 24, pages 811–823,
2020d.
Peng Li and Heng Huang. UTA DLNLP at semeval-2016 task 12: Deep learning
based natural language processing system for clinical information identification
from clinical notes and pathology reports. In Steven Bethard, Daniel M. Cer,
Marine Carpuat, David Jurgens, Preslav Nakov, and Torsten Zesch, editors, 10th
International Workshop on Semantic Evaluation, pages 1268–1273, 2016.
Qing Li, Siyuan Huang, Yining Hong, Yixin Chen, Ying Nian Wu, and Song-Chun
Zhu. Closed loop neural-symbolic learning via integrating neural perception,
grammar parsing, and symbolic reasoning. In ICML, pages 5884–5894, 2020e.
Wei Li, Kun Guo, Yong Shi, Luyao Zhu, and Yuanchun Zheng. Dwwp: Domain-
specific new words detection and word propagation system for sentiment analysis
in the tourism domain. Knowledge-Based Systems, 146:203–214, 2018b.
Wei Li, Luyao Zhu, Yong Shi, Kun Guo, and Erik Cambria. User reviews: Senti-
ment analysis using lexicon integrated two-channel CNN-LSTM family models.
Applied Soft Computing, 94:106435, 2020f.
Wei Li, Luyao Zhu, and Erik Cambria. Taylor’s theorem: A new perspective for
neural tensor networks. Knowledge-Based Systems, 228:107258, 2021c.
Wei Li, Wei Shao, Shaoxiong Ji, and Erik Cambria. BiERU: Bidirectional emotional
recurrent unit for conversational sentiment analysis. Neurocomputing, 467:73–82,
2022.
References 447
Wei Li, Luyao Zhu, Rui Mao, and Erik Cambria. SKIER: A symbolic knowledge
integrated model for conversational emotion recognition. In AAAI, pages 13121–
13129, 2023b.
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. A
unified MRC framework for named entity recognition. In ACL, pages 5849–5859,
2020g.
Xin Li, Lidong Bing, Piji Li, and Wai Lam. A unified model for opinion target
extraction and target sentiment prediction. In AAAI, pages 6714–6721, 2019b.
Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. End-
to-end task-completion neural dialogue systems. In IJCNLP, pages 733–743,
2017b.
Yang Li, Kangbo Liu, Ranjan Satapathy, Suhang Wang, and Erik Cambria. Recent
developments in recommender systems: A survey. IEEE Computational Intelli-
gence Magazine, 19(2):78–95, 2024.
Yufei Li, Xiaoyong Ma, Xiangyu Zhou, Pengzhen Cheng, Kai He, and Chen Li.
Knowledge enhanced lstm for coreference resolution on biomedical texts. Bioin-
formatics, 37(17):2699–2705, 2021d.
Zisheng Li, Jun-ichi Imai, and Masahide Kaneko. Facial-component-based bag
of words and phog descriptor for facial expression recognition. In 2009 IEEE
International Conference on Systems, Man and Cybernetics, pages 1353–1358,
2009.
Bin Liang, Chenwei Lou, Xiang Li, Min Yang, Lin Gui, Yulan He, Wenjie Pei, and
Ruifeng Xu. Multi-modal sarcasm detection via cross-modal graph convolutional
network. In ACL, pages 1767–1777, 2022a.
Bin Liang, Hang Su, Lin Gui, Erik Cambria, and Ruifeng Xu. Aspect-based sen-
timent analysis via affective knowledge enhanced graph convolutional networks.
Knowledge-Based Systems, 235:107643, 2022b.
Bin Liang, Lin Gui, Yulan He, Erik Cambria, and Ruifeng Xu. Fusion and dis-
crimination: A multimodal graph contrastive learning framework for multimodal
sarcasm detection. IEEE Transactions on Affective Computing, 15, 2024.
Tyne Liang and Dian-Song Wu. Automatic pronominal anaphora resolution in
English texts. In Research on Computational Linguistics Conference, pages 111–
127, 2003.
Yu Liang and Martijn C Willemsen. Promoting music exploration through person-
alized nudging in a genre exploration recommender. International Journal of
Human–Computer Interaction, 39(7):1495–1518, 2023.
Yunlong Liang, Fandong Meng, Jinchao Zhang, Yufeng Chen, Jinan Xu, and Jie
Zhou. An iterative multi-knowledge transfer network for aspect-based sentiment
analysis. arXiv preprint arXiv:2004.01935, 2020.
Jian Liao, Suge Wang, and Deyu Li. Identification of fact-implied implicit sentiment
based on multi-level semantic fused representation. Knowledge-Based Systems,
165:197–207, 2019.
Jian Liao, Min Wang, Xin Chen, Suge Wang, and Kai Zhang. Dynamic commonsense
knowledge fused method for Chinese implicit sentiment analysis. Information
Processing & Management, 59(3):102934, 2022.
448 References
Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, and Jiliang Tang.
Does gender matter? towards fairness in dialogue systems. arXiv preprint
arXiv:1910.10486, 2019a.
Jian Liu, Zhiyang Teng, Leyang Cui, Hanmeng Liu, and Yue Zhang. Solving
aspect category sentiment analysis as a text generation task. arXiv preprint
arXiv:2110.07310, 2021a.
Jing Liu, Yue Wang, Lihua Huang, Chenghong Zhang, and Songzheng Zhao. Iden-
tifying adverse drug reaction-related text from social media: A multi-view active
learning approach with various document representations. Information, 13(4):
189, 2022b.
Jingping Liu, Tao Chen, Chao Wang, Jiaqing Liang, Lihan Chen, Yanghua Xiao,
Yunwen Chen, and Ke Jin. Vocsk: Verb-oriented commonsense knowledge mining
with taxonomy-guided induction. Artificial Intelligence, 310:103744, 2022c.
Pan Liu, Yanming Guo, Fenglei Wang, and Guohui Li. Chinese named entity
recognition: The state of the art. Neurocomputing, 473:37–53, 2022d.
Peng Liu, Wei Chen, Gaoyan Ou, Tengjiao Wang, Dongqing Yang, and Kai Lei.
Sarcasm detection in social media based on imbalanced classification. In In-
ternational Conference on Web-Age Information Management, pages 459–471,
2014.
Qi Liu, Zai Huang, Zhenya Huang, Chuanren Liu, Enhong Chen, Yu Su, and Guoping
Hu. Finding similar exercises in online education systems. In Yike Guo and Faisal
Farooq, editors, SIGKDD, pages 1821–1830, 2018b.
Qian Liu, Heyan Huang, Guangquan Zhang, Yang Gao, Junyu Xuan, and Jie Lu.
Semantic structure-based word embedding by incorporating concept convergence
and word divergence. In AAAI, pages 5261–5268, 2018c.
Qian Liu, Rui Mao, Xiubo Geng, and Erik Cambria. Semantic matching in ma-
chine reading comprehension: An empirical study. Information Processing &
Management, 60(2):103145, 2023a.
Qian Liu, Xiubo Geng, Yu Wang, Erik Cambria, and Daxin Jiang. Disentangled
retrieval and reasoning for implicit question answering. IEEE Transactions on
Neural Networks and Learning Systems, 35(6):7804–7815, 2024a.
Qian Liu, Sooji Han, Yang Li, Erik Cambria, and Kenneth Kwok. PrimeNet: A
framework for commonsense knowledge representation and reasoning based on
conceptual primitives. Cognitive Computation, 2024b.
Ruicheng Liu, Guanyi Chen, Rui Mao, and Erik Cambria. A multi-task learning
model for gold-two-mention co-reference resolution. In IJCNN, pages 1–8, 2023b.
Ruicheng Liu, Rui Mao, Anh Tuan Luu, and Erik Cambria. A brief survey on
advances in coreference resolution. Artificial Intelligence Review, 56:14439–
14481, 2023c.
Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong
Jiang, and Minlie Huang. Towards emotional support dialog systems. In ACL-
IJCNLP, pages 3469–3483, 2021b.
Ting Liu, Yiming Cui, Qingyu Yin, Wei-Nan Zhang, Shijin Wang, and Guoping
Hu. Generating and exploiting large-scale pseudo training data for zero pronoun
resolution. In ACL, pages 102–111, 2017b.
450 References
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and
Jie Tang. GPT understands, too. arXiv preprint arXiv:2103.10385, 2021c.
Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie
Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and
tasks. In ACL, pages 61–68, 2022e.
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep
neural networks for natural language understanding. In ACL, pages 4487–4496,
2019b.
Yang Liu, Andreas Stolcke, Elizabeth Shriberg, and Mary Harper. Comparing
and combining generative and posterior probability models: Some advances in
sentence boundary detection in speech. In EMNLP, pages 64–71, 2004.
Yang Liu, Andreas Stolcke, Elizabeth Shriberg, and Mary Harper. Using conditional
random fields for sentence boundary detection in speech. In ACL, pages 451–458,
2005.
Yang Liu, Nitesh V. Chawla, Mary P. Harper, Elizabeth Shriberg, and Andreas
Stolcke. A study in machine learning from imbalanced data for sentence boundary
detection in speech. Computer Speech & Language, 20(4):468–494, 2006.
Ye Liu, Han Wu, Zhenya Huang, Hao Wang, Jianhui Ma, Qi Liu, Enhong Chen,
Hanqing Tao, and Ke Rui. Technical phrase extraction for patent mining: A
multi-level approach. In ICDM, pages 1142–1147, 2020a.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly
optimized BERT pretraining approach. arXiv e-prints, pages arXiv–1907, 2019c.
Yupeng Liu, Guodong Li, and Xiaochen Zhang. Semi-Markov CRF model based on
stacked neural Bi-LSTM for sequence labeling. In 2020 IEEE 3rd International
Conference of Safe Production and Informatization (IICSPI), pages 19–23, 2020b.
doi: 10.1109/IICSPI51290.2020.9332321.
Zengjian Liu, Ming Yang, Xiaolong Wang, Qingcai Chen, Buzhou Tang, Zhe Wang,
and Hua Xu. Entity recognition from clinical texts via recurrent neural network.
BMC Medical Informatics Decis. Mak., 17(2):53–61, 2017c.
Zhengyuan Liu, Ke Shi, and Nancy Chen. Coreference-aware dialogue summariza-
tion. In 22nd Annual Meeting of the Special Interest Group on Discourse and
Dialogue, pages 509–519, 2021d.
Guido Löhr. What are abstract concepts? on lexical ambiguity and concreteness
ratings. Review of Philosophy and Psychology, 13(3):549–566, 2022.
Adam Lopez. Statistical machine translation. ACM Computing Surveys, 40(3):1–49,
2008.
Daniel Loureiro and Alipio Jorge. Language modelling makes sense: Propagating
representations through wordnet for full-coverage word sense disambiguation. In
ACL, pages 5682–5691, 2019.
Ismini Lourentzou, Kabir Manghnani, and ChengXiang Zhai. Adapting sequence
to sequence models for text normalization in social media. In International AAAI
Conference on Web and Social Media, volume 13, pages 335–345, 2019.
References 451
Kaixin Ma, Filip Ilievski, Jonathan Francis, Yonatan Bisk, Eric Nyberg, and Alessan-
dro Oltramari. Knowledge-driven data construction for zero-shot evaluation in
commonsense question answering. In AAAI, pages 13507–13515, 2021.
Ruotian Ma, Xin Zhou, Tao Gui, Yiding Tan, Linyang Li, Qi Zhang, and Xuanjing
Huang. Template-free prompt tuning for few-shot NER. In NAACL-HLT, pages
5721–5732, 2022a.
Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional
LSTM-CNNs-CRF. arXiv preprint arXiv:1603.01354, 2016.
Youmi Ma, Tatsuya Hiraoka, and Naoaki Okazaki. Named entity recognition and
relation extraction using enhanced table filling by contextualized representations.
Journal of Natural Language Processing, 29(1):187–223, 2022b.
Yu Ma, Rui Mao, Qika Lin, Peng Wu, and Erik Cambria. Multi-source aggregated
classification for stock price movement prediction. Information Fusion, 91:515–
528, 2023.
Yu Ma, Rui Mao, Qika Lin, Peng Wu, and Erik Cambria. Quantitative stock portfolio
optimization by multi-task learning risk and return. Information Fusion, 104:
102165, 2024.
Yukun Ma, Haiyun Peng, and Erik Cambria. Targeted aspect-based sentiment anal-
ysis via embedding commonsense knowledge into an attentive LSTM. In AAAI,
pages 5876–5883, 2018.
Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. The Penn
Arabic Treebank: Building a large-scale annotated Arabic corpus. In NEMLAR
conference on Arabic language resources and tools, volume 27, pages 466–467,
2004.
Mohamed Maamouri, Sondos Krouna, Dalila Tabessi, Nadia Hamrouni, and Nizar
Habash. Egyptian Arabic morphological annotation guidelines, 2012.
Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and
Christopher Potts. Learning word vectors for sentiment analysis. In ACL-HLT,
pages 142–150, 2011.
David Magnusson and Bertil Torestad. A holistic view of personality: A model
revisited. Annual Review of Psychology, 44:427, 1993.
Ahsan Mahmood, Hikmat Ullah Khan, Zahoor ur Rehman, and Wahab Khan. Query
based information retrieval and knowledge extraction using hadith datasets. In
2017 13th International Conference on Emerging Technologies (ICET), pages
1–6, 2017. doi: 10.1109/ICET.2017.8281714.
François Mairesse, Marilyn A Walker, Matthias R Mehl, and Roger K Moore. Using
linguistic cues for the automatic recognition of personality in conversation and
text. Journal of Artificial Intelligence Research, 30:457–500, 2007.
Navonil Majumder, Soujanya Poria, Alexander Gelbukh, and Erik Cambria. Deep
learning-based document modeling for personality detection from text. IEEE
Intelligent Systems, 32(2):74–79, 2017.
Chaitanya Malaviya, Shijie Wu, and Ryan Cotterell. A simple joint model for
improved contextual neural lemmatization. arXiv preprint arXiv:1904.02306,
2019.
References 453
Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. Good
debt or bad debt: Detecting semantic orientations in economic texts. Journal of
the Association for Information Science and Technology, 65(4):782–796, 2014.
Suresh Manandhar, Saöo Dûeroski, and Tomaû Erjavec. Learning multilingual mor-
phology with clog. In International Conference on Inductive Logic Programming,
pages 135–144, 1998.
Enrique Manjavacas, Ákos Kádár, and Mike Kestemont. Improving lemmatization
of non-standard languages with joint learning. arXiv preprint arXiv:1903.06939,
2019.
Christopher D. Manning. Part-of-speech tagging from 97% to 100%: Is it time for
some linguistics? In Alexander F. Gelbukh, editor, Computational Linguistics and
Intelligent Text Processing, pages 171–189, Berlin, Heidelberg, 2011. Springer
Berlin Heidelberg. ISBN 978-3-642-19400-9.
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven
Bethard, and David McClosky. The Stanford CoreNLP natural language process-
ing toolkit. In ACL, pages 55–60, 2014.
Rui Mao and Xiao Li. Bridging towers of multi-task learning with a gating mecha-
nism for aspect-based sentiment analysis and sequential metaphor identification.
In AAAI, volume 35, pages 13534–13542, 2021.
Rui Mao, Chenghua Lin, and Frank Guerin. Word embedding and WordNet based
metaphor identification and interpretation. In ACL, pages 1222–1231, 2018.
Rui Mao, Chenghua Lin, and Frank Guerin. End-to-end sequential metaphor iden-
tification inspired by linguistic theories. In ACL, pages 3888–3898, 2019.
Rui Mao, Chenghua Lin, and Frank Guerin. Combining pre-trained word embed-
dings and linguistic features for sequential metaphor identification. arXiv preprint
arXiv:2104.03285, 2021a.
Rui Mao, Xiao Li, Mengshi Ge, and Erik Cambria. MetaPro: A computational
metaphor processing model for text pre-processing. Information Fusion, 86-87:
30–43, 2022a.
Rui Mao, Kelvin Du, Yu Ma, Luyao Zhu, and Erik Cambria. Discovering the
cognition behind language: Financial metaphor analysis with MetaPro. In ICDM,
pages 1211–1216, 2023a.
Rui Mao, Xiao Li, Kai He, Mengshi Ge, and Erik Cambria. MetaPro Online:
A computational metaphor processing online system. In ACL, pages 127–135,
2023b.
Rui Mao, Qian Liu, Kai He, Wei Li, and Erik Cambria. The biases of pre-trained
language models: An empirical study on prompt-based sentiment analysis and
emotion detection. IEEE Transactions on Affective Computing, 14(3):1743–1753,
2023c.
Rui Mao, Guanyi Chen, Xulang Zhang, Frank Guerin, and Erik Cambria. GPTEval:
A survey on assessments of ChatGPT and GPT-4. In LREC-COLING, pages
7844–7866, 2024a.
Rui Mao, Kai He, Claudia Beth Ong, Qian Liu, and Erik Cambria. MetaPro 2.0:
Computational metaphor processing on the effectiveness of anomalous language
modeling. In ACL, pages 9891–9908, 2024b.
454 References
Rui Mao, Kai He, Xulang Zhang, Guanyi Chen, Jinjie Ni, Zonglin Yang, and Erik
Cambria. A survey on semantic processing techniques. Information Fusion, 101
(101988), 2024c.
Rui Mao, Qika Lin, Qiawen Liu, Gianmarco Mengaldo, and Erik Cambria. Under-
standing public perception towards weather disasters through the lens of metaphor.
In IJCAI, 2024d.
Rui Mao, Tianwei Zhang, Qian Liu, Amir Hussain, and Erik Cambria. Unveiling
diplomatic narratives: Analyzing United Nations Security Council debates through
metaphorical cognition. In Annual Meeting of the Cognitive Science Society
(CogSci), pages 1709–1716, 2024e.
Yue Mao, Yi Shen, Chao Yu, and Longjun Cai. A joint training dual-MRC framework
for aspect based sentiment analysis. In AAAI, pages 13543–13551, 2021b.
Yue Mao, Yi Shen, Jingchao Yang, Xiaoying Zhu, and Longjun Cai. Seq2Path:
Generating sentiment tuples as paths of a tree. In ACL, pages 2215–2225, Dublin,
Ireland, May 2022b.
Vassallo Marco, Gabrieli Giuliano, Valerio Basile, Cristina Bosco, et al. The tenuous-
ness of lemmatization in lexicon-based sentiment analysis. In Italian Conference
on Computational Linguistics, volume 2481, pages 1–6, 2019.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a
large annotated corpus of English: The Penn Treebank. Computational Linguis-
tics, 19(2):313–330, 1993.
Jose Maria Balmaceda, Silvia Schiaffino, and Daniela Godoy. How do personal-
ity traits affect communication among users in online social networks? Online
Information Review, 38(1):136–153, 2014.
Hazel Rose Markus and Shinobu Kitayama. Culture and the self: Implications for
cognition, emotion, and motivation. In College student development and academic
life, pages 264–293. Routledge, 2014.
Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, and
Juan Miguel Gómez-Berbís. Named entity recognition: fallacies, challenges and
opportunities. Computer Standards & Interfaces, 35(5):482–489, 2013.
Edison Marrese-Taylor, Juan D Velásquez, and Felipe Bravo-Marquez. A novel de-
terministic approach for aspect-based opinion mining in tourism products reviews.
Expert Systems with Applications, 41(17):7764–7775, 2014.
M Antonia Martı, Mariona Taulé, Lluıs Márquez, and Manuel Bertran. CESS-ECE:
A multilingual and multilevel annotated corpus, 2007. URL [Link]
edu/~mbertran/cess-ece.
James H Martin. A Computational Model of Metaphor Interpretation. Academic
Press Professional, Inc., 1990.
James R Martin and Peter R White. The Language of Evaluation, volume 2. Springer,
2003.
Scott Martin. The role of salience ranking in anaphora resolution, 2015.
Sebastian Martschat and Michael Strube. Recall error analysis for coreference
resolution. In EMNLP, pages 2070–2081, 2014.
References 455
Marco Maru, Federico Scozzafava, Federico Martelli, and Roberto Navigli. Syn-
tagnet: Challenging supervised word sense disambiguation with lexical-semantic
combinations. In EMNLP-IJCNLP, pages 3534–3540, 2019.
Rebecca Marvin and Philipp Koehn. Exploring word sense disambiguation abilities
of neural machine translation systems. In 13th Conference of the Association for
Machine Translation in the Americas (Volume 1: Research Track), pages 125–131.
Association for Machine Translation in the Americas, 2018.
Zachary J Mason. CorMet: A computational, corpus-based conventional metaphor
extraction system. Computational Linguistics, 30(1):23–44, 2004.
Spyros Matsoukas, Ivan Bulyko, Bing Xiang, Kham Nguyen, Richard Schwartz,
and John Makhoul. Integrating speech recognition and machine translation. In
2007 IEEE International Conference on Acoustics, Speech and Signal Processing-
ICASSP’07, volume 4, pages IV–1281, 2007.
Tomoko Matsui, Tagiru Nakamura, Akira Utsumi, Akihiro T Sasaki, Takahiko Koike,
Yumiko Yoshida, Tokiko Harada, Hiroki C Tanabe, and Norihiro Sadato. The role
of prosody and context in sarcasm comprehension: Behavioral and fMRI evidence.
Neuropsychologia, 87:74–84, 2016.
Chanda Maurya, T Muhammad, Preeti Dhillon, and Priya Maurya. The effects of
cyberbullying victimization on depression and suicidal ideation among adoles-
cents and young adults: A three year cohort study from India. BMC Psychiatry,
22(1):1–14, 2022.
Diah Hevyka Maylawati, Warih Maharani, and Ibnu Asror. Implicit aspect extraction
in product reviews using FIN algorithm. In 2020 8th International Conference on
Information and Communication Technology (ICoICT), pages 1–5, 2020.
Diana Maynard and Mark A Greenwood. Who cares about sarcastic Tweets? In-
vestigating the impact of sarcasm on sentiment analysis. In Ninth International
Conference on Language Resources and Evaluation, pages 4238–4243, 2014.
Michael Mayor. Longman dictionary of contemporary English. Pearson Education
India, 2009.
Andrew McCallum and Ben Wellner. Object consolidation by graph partitioning
with a conditionally-trained distance metric. In KDD Workshop on Data Cleaning,
Record Linkage and Object Consolidation, pages 1–6, 2003.
Andrew McCallum and Ben Wellner. Conditional models of identity uncertainty
with application to noun coreference. NeurIPS, 17:1–8, 2004.
Andrew McCallum, Dayne Freitag, and Fernando CN Pereira. Maximum entropy
markov models for information extraction and segmentation. In Icml, volume 17,
pages 591–598, 2000.
Arya D McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence
Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J
Mielke, Jeffrey Heinz, et al. The SIGMORPHON 2019 shared task: Morpho-
logical analysis in context and cross-lingual transfer for inflection. arXiv preprint
arXiv:1910.11493, 2019.
Joseph F McCarthy and Wendy G Lehnert. Using decision trees for conference
resolution. In IJCAI, pages 1050–1055, 1995.
456 References
Michael C McCord. Slot grammar. In Natural language and logic, pages 118–145.
Springer, 1990.
Iain McCowan, Jean Carletta, Wessel Kraaij, Simone Ashby, Sebastien Bourban,
Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos,
et al. The AMI meeting corpus. In 5th International Conference on Methods and
Techniques in Behavioral Research, volume 88, page 100, 2005.
Kathleen E McCoy and Michael Strube. Generating anaphoric expressions: pro-
noun or definite description? The Relation of Discourse/Dialogue Structure and
Reference, 1999.
Robert R McCrae and Paul T Costa. Updating norman’s “adequacy taxonomy”’:
Intelligence and personality dimensions in natural language and in questionnaires.
Journal of Personality and Social Psychology, 49(3):710, 1985.
Robert R McCrae and Paul T Costa. Validation of the five-factor model of personality
across instruments and observers. Journal of Personality and Social Psychology,
52(1):81, 1987.
Robert R McCrae and Paul T Costa Jr. A contemplated revision of the NEO five-
factor inventory. Personality and Individual Differences, 36(3):587–596, 2004.
Robert R McCrae and Oliver P John. An introduction to the five-factor model and
its applications. Journal of Personality, 60(2):175–215, 1992.
R.R. McCrae and P.T. Costa. Personality in Adulthood. Guilford Publications, 1990.
ISBN 9780898625288.
Ryan McDonald, Koby Crammer, and Fernando Pereira. Flexible text segmentation
with structured multilabel classification. In HLT-EMNLP, pages 987–994, 2005.
Skye McDonald and Samantha Pearce. Clinical insights into pragmatic theory:
Frontal lobe deficits and sarcasm. Brain and Language, 53(1):81–104, 1996.
Skye McDonald, Sharon Flanagan, Jennifer Rollins, and Julianne Kinch. TASIT: A
new clinical tool for assessing social perception after traumatic brain injury. The
Journal of Head Trauma Rehabilitation, 18(3):219–238, 2003.
Brian McFee, Colin Raffel, Dawen Liang, Daniel P Ellis, Matt McVicar, Eric Bat-
tenberg, and Oriol Nieto. librosa: Audio and music signal analysis in Python. In
14th Python in Science Conference, volume 8, pages 18–25, 2015.
Walaa Medhat, Ahmed Hassan, and Hoda Korashy. Sentiment analysis algorithms
and applications: A survey. Ain Shams Engineering Journal, 5(4):1093–1113,
2014.
Douglas L Medin and Marguerite M Schaffer. Context theory of classification
learning. Psychological Review, 85(3):207, 1978.
Raveesh Meena, Gabriel Skantze, and Joakim Gustafson. Data-driven models for
timing feedback responses in a map task dialogue system. Computer Speech &
Language, 28(4):903–922, 2014.
Sara Meftah and Nasredine Semmar. A neural network model for part-of-speech
tagging of social media texts. In Eleventh International Conference on Language
Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European
Language Resources Association (ELRA).
References 457
Yash Mehta, Samin Fatehi, Amirmohammad Kazameini, Clemens Stachl, Erik Cam-
bria, and Sauleh Eetemadi. Bottom-up and top-down: Predicting personality with
psycholinguistic and language model features. In ICDM, pages 1184–1189, 2020.
Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and
Yu Chi. Deep keyphrase generation. In Regina Barzilay and Min-Yen Kan,
editors, ACL, pages 582–592, 2017.
Xinfan Meng and Houfeng Wang. Mining user reviews: from specification to sum-
marization. In ACL-IJCNLP 2009 Conference Short Papers, pages 177–180,
2009.
Amina Ben Meriem, Lobna Hlaoua, and Lotfi Ben Romdhane. A fuzzy approach for
sarcasm detection in social networks. Procedia Computer Science, 192:602–611,
2021.
Andreas Messalas, Yiannis Kanellopoulos, and Christos Makris. Model-agnostic
interpretability with shapley values. In 2019 10th International Conference on
Information, Intelligence, Systems and Applications (IISA), pages 1–7, 2019.
Mayuri Mhatre, Dakshata Phondekar, Pranali Kadam, Anushka Chawathe, and
Kranti Ghag. Dimensionality reduction for sentiment analysis using pre-
processing techniques. In 2017 International Conference on Computing Method-
ologies and Communication (ICCMC), pages 16–21, 2017.
Lesly Miculicich Miculicich and Andrei Popescu-Belis. Using coreference links to
improve Spanish-to-English machine translation. In 2nd Workshop on Coreference
Resolution Beyond OntoNotes (CORBON 2017), pages 30–40, 2017.
Rada Mihalcea, Carmen Banea, and Janyce Wiebe. Learning multilingual subjective
language via cross-lingual projections. In 45th annual meeting of the association
of computational linguistics, pages 976–983, 2007.
Andrei Mikheev. Tagging sentence boundaries. In NAACL, 2000.
Andrei Mikheev. Periods, capitalized words, etc. Computational Linguistics, 28(3):
289–318, 2002.
Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. Workshop at ICLR, 2013, 01 2013a.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed
representations of words and phrases and their compositionality. In NIPS, pages
3111–3119, 2013b.
Kirill Milintsevich and Kairit Sirts. Enhancing sequence-to-sequence neural lemma-
tization with external resources. arXiv preprint arXiv:2101.12056, 2021.
George A Miller. WordNet: A lexical database for English. Communications of the
ACM, 38(11):39–41, 1995.
George A Miller, Claudia Leacock, Randee Tengi, and Ross T Bunker. A semantic
concordance. In HLT, pages 303–308, 1993.
Gary Miner, John Elder IV, Andrew Fast, Thomas Hill, Robert Nisbet, and Dursun
Delen. Practical text mining and statistical analysis for non-structured text data
applications. Academic Press, 2012.
Marvin Minsky. A framework for representing knowledge, 1974.
Marvin Minsky. The Emotion Machine: Commonsense Thinking, Artificial Intelli-
gence, and the Future of the Human Mind. Simon & Schuster, New York, 2006.
458 References
Shachar Mirkin, Ido Dagan, and Sebastian Padó. Assessing the role of discourse
references in entailment inference. In ACL, pages 1209–1219, 2010.
Walter Mischel. Toward an integrative science. Annual Review of Psychology, 55:
1–22, 2004.
Abhijit Mishra, Diptesh Kanojia, Seema Nagar, Kuntal Dey, and Pushpak Bhat-
tacharyya. Harnessing cognitive features for sarcasm detection. In ACL, pages
1095–1104, Berlin, Germany, 2016.
Rishabh Misra and Prahal Arora. Sarcasm detection using hybrid neural network.
arXiv preprint arXiv:1908.07414, 2019.
Tom Mitchell and E. Fredkin. Never-ending language learning. In 2014 IEEE
International Conference on Big Data (Big Data), pages 1–1, 2014.
Ruslan Mitkov. Anaphora Resolution. Routledge, 2014.
Ruslan Mitkov. The Oxford Handbook of Computational Linguistics. Oxford Uni-
versity Press, 2022.
Ruslan Mitkov, Richard Evans, Constantin Or san, Le An Ha, and Viktor Pekar.
Anaphora resolution: To what extent does it help nlp applications? In Anaphora:
Analysis, Algorithms and Applications: 6th Discourse Anaphora and Anaphor
Resolution Colloquium, pages 179–190, 2007.
Ankush Mittal, Pooja Bhatt, and Padam Kumar. Phonetic matching and syntactic tree
similarity based QA system for SMS queries. In 2014 International Conference
on Green Computing Communication and Electrical Engineering (ICGCCEE),
pages 1–6, 2014.
Makoto Miwa and Mohit Bansal. End-to-end relation extraction using lstms on
sequences and tree structures. In ACL, pages 1105–1116, 2016.
Makoto Miwa and Yutaka Sasaki. Modeling joint entity and relation extraction with
table representation. In EMNLP, pages 1858–1869, 2014.
Dunja Mladenic. Automatic word lemmatization. In 5th International Multi-
Conference Information Society, IS-2002 B, pages 153–159, 2002.
Aditya Mogadala and Vasudeva Varma. Language independent sentence-level sub-
jectivity analysis with feature selection. In 26th Pacific Asia Conference on
Language, Information, and Computation, pages 171–180, 2012.
Samaneh Moghaddam and Martin Ester. Opinion digger: An unsupervised opinion
miner from unstructured product reviews. In CIKM, pages 1825–1828, 2010.
Saif Mohammad, Ekaterina Shutova, and Peter Turney. Metaphor as a medium for
emotion: An empirical study. In Fifth Joint Conference on Lexical and Computa-
tional Semantics, pages 23–33, 2016.
Michael Mohler, Bryan Rink, David Bracewell, and Marc Tomlinson. A novel distri-
butional approach to multilingual conceptual metaphor recognition. In COLING,
pages 1752–1763, 2014.
Michael Mohler, Mary Brunson, Bryan Rink, and Marc Tomlinson. Introducing
the LCC metaphor datasets. In Tenth International Conference on Language
Resources and Evaluation (LREC’16), pages 4221–4227, 2016.
Antonio Molina and Ferran Pla. Shallow parsing using specialized HMMs. Journal
of Machine Learning Research, 2(Mar):595–613, 2002.
References 459
Natawut Monaikul, Giuseppe Castellucci, Simone Filice, and Oleg Rokhlenko. Con-
tinual learning for named entity recognition. In AAAI, pages 13570–13577, 2021.
Christine A. Montgomery. Concept extraction. American Journal of Computational
Linguistics, 8(2):70–73, 1982.
Andrés Montoyo, Patricio Martínez-Barco, and Alexandra Balahur. Subjectivity
and sentiment analysis: An overview of the current state of the area and envisaged
developments. Decision Support Systems, 53(4):675–679, 2012.
Nafise Sadat Moosavi and Michael Strube. Which coreference evaluation metric do
you trust? a proposal for a link-based entity aware metric. In ACL, pages 632–642,
2016.
Louis-Philippe Morency, Ariadna Quattoni, and Trevor Darrell. Latent-dynamic
discriminative models for continuous gesture recognition. In 2007 IEEE Con-
ference on Computer Vision and Pattern Recognition, pages 1–8, 2007. doi:
10.1109/CVPR.2007.383299.
Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. Towards multimodal
sentiment analysis: Harvesting opinions from the web. In 13th international
conference on multimodal interfaces, pages 169–176, 2011.
Andrea Moro and Roberto Navigli. SemEval-2015 task 13: Multilingual all-words
sense disambiguation and entity linking. In 9th International Workshop on Se-
mantic Evaluation, pages 288–297, 2015.
Andrea Moro, Francesco Cecconi, and Roberto Navigli. Multilingual word sense
disambiguation and entity linking for everybody. In 2014 International Conference
on Posters & Demonstrations Track-Volume 1272, pages 25–28, 2014a.
Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity linking meets
word sense disambiguation: a unified approach. Transactions of the Association
for Computational Linguistics, 2:231–244, 2014b.
Guanyi Mou and Kyumin Lee. Malicious bot detection in online social networks:
Arming handcrafted features with deep learning. In International Conference on
Social Informatics, pages 220–236, 2020.
Mohamad Syahrul Mubarok, Adiwijaya, and Muhammad Dwi Aldhi. Aspect-based
sentiment analysis to review products using Naïve Bayes. In AIP Conference
Proceedings, volume 1867, page 020060, 2017.
Andrius Mudinas, Dell Zhang, and Mark Levene. Combining lexicon and learning
based approaches for concept-level sentiment analysis. In ISDOMW, 2012.
Aldrian Obaja Muis and Wei Lu. Weak semi-Markov CRFs for noun phrase chunking
in informal text. In NAACL-HLT, pages 714–719, 2016.
Aldrian Obaja Muis and Wei Lu. Labeling gaps between words: Recognizing over-
lapping mentions with mention separators. In EMNLP, pages 2608–2618, 2017.
Arjun Mukherjee and Bing Liu. Aspect extraction through semi-supervised model-
ing. In ACL, pages 339–348, 2012.
Rajdeep Mukherjee, Tapas Nayak, Yash Butala, Sourangshu Bhattacharya, and
Pawan Goyal. Paste: A tagging-free decoding framework using pointer networks
for aspect sentiment triplet extraction. arXiv preprint arXiv:2110.04794, 2021.
Thomas Müller, Helmut Schmid, and Hinrich Schütze. Efficient higher-order crfs
for morphological tagging. In EMNLP, pages 322–332, 2013.
460 References
Thomas Müller, Ryan Cotterell, Alexander Fraser, and Hinrich Schütze. Joint lemma-
tization and morphological tagging with Lemming. In EMNLP, pages 2268–2274,
2015.
Gabriel Murray and Giuseppe Carenini. Predicting subjectivity in multimodal con-
versations. In EMNLP, pages 1348–1357, 2009.
Gabriel Murray and Giuseppe Carenini. Subjectivity detection in spoken and written
conversations. Natural Language Engineering, 17(3):397–418, 2011.
Sumiya Mushtaq and Neerendra Kumar. Text-based automatic personality recog-
nition: Recent developments. In Third International Conference on Computing,
Communications, and Cyber-Security, pages 537–549, 2023.
Judith Muzerelle, Anaïs Lefeuvre, Jean-Yves Antoine, Emmanuel Schang, Denis
Maurel, Jeanne Villaneau, and Iris Eshkol. ANCOR, the first large French speaking
corpus of conversational speech annotated in coreference to be freely available.
In TALN, pages 555–563, 2013.
Isabel Briggs Myers, Mary H. McCaulley, Naomi L. Quenk, and Allen L. Hammer.
MBTI Manual: A Guide to the Development and Use of the Myers - Briggs Type
Indicator. Consulting Psychologists Press, third edition, 1998.
Jerome L Myers and Arnold D Well. Research Design & Statistical Analysis.
Routledge, New York, 1995.
Tetsuji Nakagawa, Taku Kudo, and Yuji Matsumoto. Unknown word guessing and
part-of-speech tagging using support vector machines. In NLPRS, pages 325–331,
2001.
Hiromi Nakaiwa and Satoru Ikehara. Zero pronoun resolution in a machine trans-
lation system by using japanese to english verbal semantic attributes. In Third
Conference on Applied Natural Language Processing, pages 201–208, 1992.
Hiromi Nakaiwa and Satoshi Shirai. Anaphora resolution of Japanese zero pronouns
with deictic reference. In COLING, pages 812–817, 1996.
Hiromi Nakaiwa, Satoshi Shirai, Satoru Ikehara, and Tsukasa Kawaoka. Extrasen-
tential resolution of Japanese zero pronouns using semantic and pragmatic con-
straints. In AAAI 1995 Spring Symposium Series, pages 99–105, 1995.
Preslav Nakov, Sara Rosenthal, Zornitsa Kozareva, Veselin Stoyanov, Alan Ritter,
and Theresa Wilson. SemEval-2013 task 2: Sentiment analysis in Twitter. In
SemEval, pages 312–320, 2013.
Sri Nandhini and JI Sheeba. Cyberbullying detection and classification using infor-
mation retrieval algorithm. In ICARCSET, 2015.
Zara Nasar, Syed Waqar Jaffry, and Muhammad Kamran Malik. Named entity
recognition and relation extraction: State-of-the-art. ACM Computing Surveys, 54
(1):1–39, 2021.
Arman Khadjeh Nassirtoussi, Saeed Aghabozorgi, Teh Ying Wah, and David
Chek Ling Ngo. Text mining of news-headlines for FOREX market prediction: A
multi-layer dimension reduction algorithm with semantics and sentiment. Expert
Systems with Applications, 42(1):306–324, 2015.
Vivi Nastase, Michael Strube, Benjamin Boerschinger, Caecilia Zirn, and Anas
Elghafari. WikiNet: A very large scale multi-lingual concept network. In LREC,
pages 1015–1022, 2010.
References 461
Jinjie Ni, Tom Young, Vlad Pandelea, Fuzhao Xue, and Erik Cambria. Recent
advances in deep learning based dialogue systems: A systematic survey. Artificial
Intelligence Review, pages 3055–3155, 2022.
Cristina Nicolae and Gabriel Nicolae. BESTCUT: A graph algorithm for coreference
resolution. In EMNLP, pages 275–283, 2006.
Garrett Nicolai and Grzegorz Kondrak. Leveraging inflection tables for stemming
and lemmatization. In ACL, pages 1138–1147, 2016.
Nicolas Nicolov, Franco Salvetti, and Steliana Ivanova. Sentiment analysis: Does
coreference matter. In AISB 2008 Convention Communication, Interaction and
Social Intelligence, volume 1, page 37, 2008.
Jan Niehues and Eunah Cho. Exploiting linguistic resources for neural machine
translation using multi-task learning. arXiv preprint arXiv:1708.00993, 2017.
Finn Årup Nielsen. A new anew: Evaluation of a word list for sentiment analysis in
microblogs. In Workshop on’Making Sense of Microposts: Big Things Come in
Small Packages, pages 93–98, 2011.
Huansheng Ning, Sahraoui Dhelim, and Nyothiri Aung. PersoNet: Friend recom-
mendation system based on big-five personality traits and hybrid filtering. IEEE
Transactions on Computational Social Systems, 6(3):394–402, 2019.
Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDonald, Jens Nilsson, Sebastian
Riedel, and Deniz Yuret. The CoNLL 2007 shared task on dependency parsing.
In EMNLP-CoNLL, pages 915–932, Prague, Czech Republic, June 2007.
Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Ha-
jic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Na-
talia Silveira, et al. Universal dependencies v1: A multilingual treebank collec-
tion. In Tenth International Conference on Language Resources and Evaluation
(LREC’16), pages 1659–1666, 2016.
Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science &
Business Media, 2006.
Warren T Norman. Toward an adequate taxonomy of personality attributes: Repli-
cated factor structure in peer nomination personality ratings. The Journal of
Abnormal and Social Psychology, 66(6):574, 1963.
Warren T. Norman. “To see ourselves as others see us!": Relations among self-
perceptions, peer-perceptions, and expected peer-perceptions of personality. Mul-
tivariate Behavioral Research, 4(4):417–443, 1969.
Natalya F Noy, Deborah L McGuinness, et al. Ontology development 101: A guide
to creating your first ontology, 2001.
Gertrude E Noyes. The first English dictionary, Cawdrey’s table alphabeticall.
Modern Language Notes, 58(8):600–605, 1943.
Maxwell Nye, Armando Solar-Lezama, Josh Tenenbaum, and Brenden M Lake.
Learning compositional rules via neural program synthesis. NeurIPS, 33:10832–
10842, 2020.
Jon Oberlander and Scott Nowson. Whose thumb is it anyway? Classifying author
personality from weblog text. In COLING-ACL, pages 627–634, 2006.
References 463
Haiyun Peng, Yukun Ma, Yang Li, and Erik Cambria. Learning multi-grained aspect
target sequence for chinese sentiment analysis. Knowledge-Based Systems, 148:
167–176, 2018.
Haiyun Peng, Lu Xu, Lidong Bing, Fei Huang, Wei Lu, and Luo Si. Knowing what,
how and why: A near complete solution for aspect-based sentiment analysis. In
AAAI, pages 8600–8607, 2020.
Yifan Peng, Shankai Yan, and Zhiyong Lu. Transfer learning in biomedical nat-
ural language processing: An evaluation of bert and elmo on ten benchmarking
datasets. BioNLP 2019, page 58, 2019.
James W Pennebaker. The Secret Life of Pronouns: What Our Words Say About Us.
Bloomsbury Publishing USA, 2013.
James W Pennebaker and Laura A King. Linguistic styles: Language use as an
individual difference. Journal of Personality and Social Psychology, 77(6):1296,
1999.
James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and
word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001,
2001.
James W Pennebaker, Matthias R Mehl, and Kate G Niederhoffer. Psychologi-
cal aspects of natural language use: Our words, our selves. Annual Review of
Psychology, 54(1):547–577, 2003.
James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate Blackburn. The de-
velopment and psychometric properties of LIWC2015. Technical report, The
University of Texas at Austin, 2015.
Deana Pennell and Yang Liu. Normalization of text messages for text-to-speech. In
2010 IEEE International Conference on Acoustics, Speech and Signal Processing,
pages 4842–4845, 2010.
Deana Pennell and Yang Liu. A character-level machine translation approach for
normalization of SMS abbreviations. In IJCNLP, pages 974–982, 2011.
Jeffrey Pennington, Richard Socher, and Christopher D Manning. GloVe: Global
vectors for word representation. In EMNLP, pages 1532–1543, 2014.
Iryna Pentina, Tyler Hancock, and Tianling Xie. Exploring relationship development
with social chatbots: A mixed-method study of replika. Computers in Human
Behavior, 140:107600, 2023.
S Pesina and T Solonchak. Semantic primitives and conceptual focus. Procedia-
Social and Behavioral Sciences, 192:339–345, 2015.
Robert Peters and Norbert Nagel. Das digitale, referenzkorpus mit-
telniederdeutsch/niederrheinisch (ReN)’. Jahrbuch für Germanistische
Sprachgeschichte, 5(1):165–175, 2014.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin,
Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In
EMNLP-IJCNLP, pages 2463–2473, 2019.
Fabio Petroni, Patrick S. H. Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang
Wu, Alexander H. Miller, and Sebastian Riedel. How context affects language
models’ factual predictions. In Conference on Automated Knowledge Base Con-
struction, AKBC, 2020.
466 References
Slav Petrov, Dipanjan Das, and Ryan McDonald. A universal part-of-speech tagset.
arXiv preprint arXiv:1104.2086, 2011.
Saöa PetroviÊ, Miles Osborne, and Victor Lavrenko. The Edinburgh twitter corpus.
In NAACL HLT 2010 workshop on computational linguistics in a world of social
media, pages 25–26, 2010.
Miriam RL Petruck. Frame semantics. Handbook of pragmatics, 2, 1996.
Lawrence Philips. Hanging on the metaphone. Computer Language, 7(12):39–43,
1990.
Lawrence Philips. The double metaphone search algorithm. C/C++ users journal,
18(6):38–43, 2000.
Jean Piaget, Margaret Cook, et al. The origins of intelligence in children, volume 8.
International Universities Press, New York, 1952.
Emanuele Pianta, Luisa Bentivogli, and Christian Girardi. MultiWordNet: develop-
ing an aligned multilingual database. In First International Conference on Global
WordNet, pages 293–302, 2002.
Martin J Pickering and Simon Garrod. An integrated theory of language production
and comprehension. Behavioral and Brain Sciences, 36(4):329–347, 2013.
David J Pittenger. The utility of the Myers-Briggs Type Indicator. Review of
educational research, 63(4):467–488, 1993.
Barbara Plank and Dirk Hovy. Personality traits on Twitter—or—how to get 1,500
personality tests in a week. In 6th Workshop on Computational Approaches to
Subjectivity, Sentiment and Social Media Analysis, pages 92–98, 2015.
John C Platt. Fast training of support vector machines using sequential minimal
optimization, advances in kernel methods. Support Vector Learning, pages 185–
208, 1999.
Joël Plisson, Nada Lavrac, Dunja Mladenic, et al. A rule based approach to word
lemmatization. In IS, volume 3, pages 83–86, 2004.
Massimo Poesio. Annotating a corpus to develop and evaluate discourse entity
realization algorithms: Issues and preliminary results. In Second International
Conference on Language Resources and Evaluation, pages 1–8, 2000.
Massimo Poesio. Associative descriptions and salience: A preliminary investigation.
In 2003 EACL Workshop on The Computational Treatment of Anaphora, pages
31–38, 2003.
Massimo Poesio. The MATE/GNOME proposals for anaphoric annotation, revisited.
In HLT-NAACL, pages 154–162, 2004.
Massimo Poesio and M Alexandrov-Kabadjov. A general-purpose, off the shelf
anaphoric resolver. In LREC, pages 653–656, 2004.
Massimo Poesio and Ron Artstein. Anaphoric annotation in the ARRAU corpus.
In Sixth International Conference on Language Resources and Evaluation, pages
1–5, 2008.
Massimo Poesio, Florence Bruneseaux, and Laurent Romary. The MATE meta-
scheme for coreference in dialogues in multiple languages. In ACL’99 Workshop
Towards Standards and Tools for Discourse Tagging, pages 65–74, 1999a.
References 467
Massimo Poesio, Renate Henschel, Janet Hitzeman, Rodger Kibble, Shane Mon-
tague, and Kees van Deemter. Towards an annotation scheme for noun phrase
generation. Technical report, EACL, 1999b.
Massimo Poesio, Juntao Yu, Silviu Paun, Abdulrahman Aloraini, Pengcheng Lu,
Janosch Haber, and Derya Cokal. Computational models of anaphora. Annual
Review of Linguistics, 9:561–587, 2023.
Livia Polanyi and Annie Zaenen. Contextual valence shifters. In Computing Attitude
and Affect in Text: Theory and Applications, pages 1–10. Springer, 2006.
Marco Polignano, Pierpaolo Basile, Marco De Gemmis, Giovanni Semeraro, Valerio
Basile, et al. Alberto: Italian BERT language understanding model for NLP
challenging tasks based on tweets. In CEUR Workshop Proceedings, volume
2481, pages 1–6, 2019.
Marco Polignano, Valerio Basile, Pierpaolo Basile, Giuliano Gabrieli, Marco Vas-
sallo, and Cristina Bosco. A hybrid lexicon-based and neural approach for explain-
able polarity detection. Information Processing & Management, 59(5):103058,
2022.
Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion An-
droutsopoulos, and Suresh Manandhar. SemEval-2014 task 4: Aspect based sen-
timent analysis. In 8th International Workshop on Semantic Evaluation (SemEval
2014), pages 27–35, 2014.
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Suresh Manandhar, and Ion
Androutsopoulos. SemEval-2015 task 12: Aspect based sentiment analysis. In 9th
International Workshop on Semantic Evaluation (SemEval 2015), pages 486–495,
2015.
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh
Manandhar, Mohammed AL-Smadi, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing
Qin, Orphée De Clercq, et al. Semeval-2016 task 5: Aspect based sentiment
analysis. In ProWorkshop on Semantic Evaluation (SemEval-2016), pages 19–30,
2016.
Simone Paolo Ponzetto and Michael Strube. Exploiting semantic role labeling, Word-
Net and Wikipedia for coreference resolution. In Human Language Technology
Conference of the NAACL, Main Conference, pages 192–199, 2006.
Alexander Popov. Word sense disambiguation with recurrent neural networks. In
Student Research Workshop associated with RANLP 2017, pages 25–34, 2017.
Soujanya Poria, Alexandar Gelbukh, Basant Agarwal, Erik Cambria, and Newton
Howard. Common sense knowledge based personality recognition from text.
In Mexican International Conference on Artificial Intelligence, pages 484–496,
2013.
Marten Postma, Emiel Van Miltenburg, Roxane Segers, Anneleen Schoen, and Piek
Vossen. Open Dutch WordNet. In 8th Global WordNet Conference (GWC), pages
302–310, 2016.
Marco Pota, Mirko Ventura, Rosario Catelli, and Massimo Esposito. An effec-
tive BERT-based pipeline for twitter sentiment analysis: A case study in italian.
Sensors, 21(1):133, 2020.
468 References
Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher D. Manning. Universal
Dependency parsing from scratch. In CoNLL 2018 Shared Task: Multilingual
Parsing from Raw Text to Universal Dependencies, pages 160–170, Brussels,
Belgium, October 2018. doi: 10.18653/v1/K18-2016.
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning.
Stanza: A python natural language processing toolkit for many human languages.
arXiv preprint arXiv:2003.07082, 2020.
Lin Qiu, Han Lin, Jonathan Ramsay, and Fang Yang. You are what you tweet: Per-
sonality expression and perception on Twitter. Journal of research in personality,
46(6):710–718, 2012.
Stephan Raaijmakers, Khiet P Truong, and Theresa Wilson. Multimodal subjectivity
analysis of multiparty conversation. In EMNLP, pages 466–474, 2008.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving
language understanding by generative pre-training. Technical report, OpenAI,
2018.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8),
2019.
Colin Raffel, Minh-Thang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck.
Online and linear-time attention by enforcing monotonic alignments. In ICML,
pages 2837–2846, 2017.
Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. Neural sequence
learning models for word sense disambiguation. In EMNLP, pages 1156–1167,
2017a.
Alessandro Raganato, Jose Camacho-Collados, and Roberto Navigli. Word sense
disambiguation: A unified evaluation framework and empirical comparison. In
EACL, pages 99–110, 2017b.
Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. The MuCoW test suite
at WMT 2019: Automatically harvested multilingual contrastive word sense dis-
ambiguation test sets for machine translation. In Fourth Conference on Machine
Translation (Volume 2: Shared Task Papers, Day 1), pages 470–480, 2019.
Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers,
Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. A multi-pass sieve for
coreference resolution. In EMNLP, pages 492–501, 2010.
Edoardo Ragusa, Erik Cambria, Rodolfo Zunino, and Paolo Gastaldo. A survey on
deep learning in image polarity detection: Balancing generalization performances
and computational costs. Electronics, 8(7):783, 2019.
Afshin Rahimi, Yuan Li, and Trevor Cohn. Massively multilingual transfer for NER.
In ACL, pages 151–164, 2019.
Altaf Rahman and Vincent Ng. Narrowing the modeling gap: A cluster-ranking
approach to coreference resolution. Journal of Artificial Intelligence Research,
40:469–521, 2011.
Altaf Rahman and Vincent Ng. Resolving complex cases of definite pronouns: The
Winograd schema challenge. In EMNLP-CoNLL, pages 777–789, 2012.
470 References
Mohammad Wali Ur Rahman, Sicong Shao, Pratik Satam, Salim Hariri, Chris
Padilla, Zoe Taylor, and Carlos Nevarez. A BERT-based deep learning approach
for reputation analysis in social media. In 2022 IEEE/ACS 19th International
Conference on Computer Systems and Applications (AICCSA), pages 1–8, 2022.
Sunny Rai and Shampa Chakraverty. Metaphor detection using fuzzy rough sets. In
International Joint Conference on Rough Sets, pages 271–279, 2017.
Sunny Rai, Shampa Chakraverty, Devendra K Tayal, Divyanshu Sharma, and Ayush
Garg. Understanding metaphors using emotions. New Generation Computing, 37
(1):5–27, 2019.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD:
100,000+ questions for machine comprehension of text. In EMNLP, pages 2383–
2392, 2016.
Majid Ramezani, Mohammad-Reza Feizi-Derakhshi, and Mohammad-Ali Balafar.
Knowledge graph-enabled text-based automatic personality prediction. Compu-
tational Intelligence and Neuroscience, 2022, 2022.
Lance A Ramshaw and Mitchell P Marcus. Text chunking using transformation-
based learning. In Natural language processing using very large corpora, pages
157–176. Springer, 1999.
Toqir A Rana, Yu-N Cheah, and Tauseef Rana. Multi-level knowledge-based ap-
proach for implicit aspect identification. Applied Intelligence, 50:4616–4630,
2020.
Francisco Rangel, Fabio Celli, Paolo Rosso, Potthast Martin, Benno Stein, Walter
Daelemans, et al. Overview of the 3rd author profiling task at PAN 2015. In
CLEF2015 Working Notes. Working Notes of CLEF 2015-Conference and Labs
of the Evaluation forum., 2015.
Rasika Ransing and Archana Gulati. A survey of different approaches for word sense
disambiguation. In ICT Analysis and Applications, pages 435–445. Springer, 2022.
Delip Rao and Deepak Ravichandran. Semi-supervised polarity lexicon induction.
In EACL, pages 675–682, 2009.
David Rapaport, Merton Gill, and Roy Schafer. Diagnostic psychological testing:
The theory, statistical evaluation, and diagnostic application of a battery of tests:
Volume II. The Year Book, 1946.
Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner. Weighting finite-state trans-
ductions with neural context. In NAACL-HLT, pages 623–633, 2016.
Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity
recognition. In CoNLL, pages 147–155, 2009.
Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In
EMNLP, 1996.
Biswarup Ray, Avishek Garain, and Ram Sarkar. An ensemble-based hotel rec-
ommender system using sentiment analysis and aspect categorization of hotel
reviews. Applied Soft Computing, 98:106935, 2021.
Jonathon Read, Rebecca Dridan, Stephan Oepen, and Lars Jørgen Solberg. Sentence
boundary detection: A long solved problem? In COLING 2012: Posters, pages
985–994, 2012.
References 471
Marta Recasens and Eduard Hovy. BLANC: Implementing the rand index for
coreference evaluation. Natural Language Engineering, 17(4):485–510, 2011.
Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. Linguistic
models for analyzing and detecting biased language. In ACL, pages 1650–1659,
2013.
Marek Rei. Semi-supervised multitask learning for sequence labeling. arXiv preprint
arXiv:1704.07156, 2017.
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using
siamese BERT-networks. In EMNLP-IJCNLP, pages 3982–3992, 2019.
Tanya Reinhart. Coreference and bound anaphora: A restatement of the anaphora
questions. Linguistics and Philosophy, pages 47–88, 1983.
Nils Reiter. CorefAnnotator - a new annotation tool for entity references. In Abstracts
of EADH: Data in the Digital Humanities, pages 1–4, 2018.
Ahmed Remaida, Benyoussef Abdellaoui, Aniss Moumen, and Younes El Bouzekri
El Idrissi. Personality traits analysis using artificial neural networks: A literature
survey. In 2020 1st international conference on innovative research in applied
science, engineering and technology (IRASET), pages 1–6, 2020.
Robert Remus. Improving sentence-level subjectivity classification through read-
ability measurement. In 18th Nordic Conference of Computational Linguistics,
pages 168–174, 2011.
Feiliang Ren, Longhui Zhang, Shujuan Yin, Xiaofeng Zhao, Shilei Liu, Bochao Li,
and Yaduo Liu. A novel global feature-oriented relational triple extraction model
based on table filling. In EMNLP, pages 2646–2656, 2021a.
Lu Ren, Bo Xu, Hongfei Lin, Xikai Liu, and Liang Yang. Sarcasm detection with
sentiment semantics enhanced multi-level memory network. Neurocomputing,
401:320–326, 2020.
Zhancheng Ren, Qiang Shen, Xiaolei Diao, and Hao Xu. A sentiment-aware deep
learning approach for personality detection from text. Information Processing &
Management, 58(3):102532, 2021b.
Philip Resnik and David Yarowsky. Distinguishing systems and distinguishing
senses: New evaluation methods for word sense disambiguation. Natural Lan-
guage Engineering, 5(2):113–133, 1999.
Jeffrey C Reynar and Adwait Ratnaparkhi. A maximum entropy approach to identi-
fying sentence boundaries. arXiv preprint cmp-lg/9704002, 1997.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “why should I trust you?”
Explaining the predictions of any classifier. In 22nd ACM SIGKDD international
conference on knowledge discovery and data mining, pages 1135–1144, 2016.
Michael Riley. Some applications of tree-based modelling to speech and language.
In Proceedings of Speech and Natural Language, 1989.
Ellen Riloff. Automatically generating extraction patterns from untagged text. In
AAAI, pages 1044–1049, 1996.
Ellen Riloff and Janyce Wiebe. Learning extraction patterns for subjective expres-
sions. In EMNLP, pages 105–112, 2003.
472 References
Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert, and
Ruihong Huang. Sarcasm as contrast between a positive sentiment and negative
situation. In EMNLP, pages 704–714, 2013.
Kyeongmin Rim. MAE2: Portable annotation tool for general natural language use.
In Proc 12th Joint ACL-ISO Workshop on Interoperable Semantic Annotation,
pages 75–80, 2016.
Nicky Ringland, Xiang Dai, Ben Hachey, Sarvnaz Karimi, Cecile Paris, and James R.
Curran. NNE: A dataset for nested named entity recognition in english newswire.
In ACL, pages 5176–5181, 2019a.
Nicky Ringland, Xiang Dai, Ben Hachey, Sarvnaz Karimi, Cecile Paris, and James R
Curran. Nne: A dataset for nested named entity recognition in english newswire.
In ACL, pages 5176–5181, 2019b.
Annette Rios Gonzales, Laura Mascarell, and Rico Sennrich. Improving word sense
disambiguation in neural machine translation with sense embeddings. In Second
Conference on Machine Translation, pages 11–19, 2017.
David Ritchie. Metaphors in conversational context: Toward a connectivity theory
of metaphor interpretation. Metaphor and Symbol, 19(4):265–287, 2004.
Alan Ritter, Sam Clark, Oren Etzioni, et al. Named entity recognition in tweets: an
experimental study. In EMNLP, pages 1524–1534, 2011.
Leah Roberts. Syntactic processing, pages 227–247. Cambridge Handbooks in
Language and Linguistics. Cambridge University Press, 2016. doi: 10.1017/
CBO9781107425965.011.
Kevin Dela Rosa and Jeffrey Ellen. Text classification methodologies applied to
micro-text in military chat. In ICML, pages 710–714, 2009. doi: 10.1109/ICMLA.
2009.49.
Rudolf Rosa and Zden k éabokrtskỳ. Unsupervised lemmatization as embeddings-
based word clustering. arXiv preprint arXiv:1908.08528, 2019.
Eleanor Rosch and Carolyn B Mervis. Family resemblances: Studies in the internal
structure of categories. Cognitive Psychology, 7(4):573–605, 1975.
Eleanor Rosch, Carolyn B Mervis, Wayne D Gray, David M Johnson, and Penny
Boyes-Braem. Basic objects in natural categories. Cognitive Psychology, 8(3):
382–439, 1976.
Eleanor H Rosch. Natural categories. Cognitive psychology, 4(3):328–350, 1973.
Zachary Rosen. Computationally constructed concepts: A machine learning ap-
proach to metaphor interpretation using usage-based construction grammatical
cues. In Workshop on Figurative Language Processing, pages 102–109, 2018.
Frank Rosenblatt. The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 65(6):386, 1958.
Sascha Rothe and Hinrich Schütze. Autoextend: Extending word embeddings to
embeddings for synsets and lexemes. In ACL-IJCNLP, pages 1793–1803, 2015.
Kathrin Rothermich, Ayotola Ogunlana, and Natalia Jaworska. Change in humor
and sarcasm use based on anxiety and depression symptom severity during the
COVID-19 pandemic. Journal of Psychiatric Research, 140:95–100, 2021.
Herbert Rubenstein and John B. Goodenough. Contextual correlates of synonymy.
Communications of the ACM, 8(10):627–633, 1965.
References 473
Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme.
Gender bias in coreference resolution. In NAACL-HLT, pages 8–14, 2018.
Dwijen Rudrapal, Anupam Jamatia, Kunal Chakma, Amitava Das, and Björn Gam-
bäck. Sentence boundary detection for social media text. In 12th International
Conference on Natural Language Processing, pages 254–260, Trivandrum, India,
December 2015. NLP Association of India.
Hotze Rullmann. Two types of negative polarity items. In North East Linguistics
Society, 1996.
David E Rumelhart and Andrew Ortony. The representation of knowledge in memory.
Schooling and the acquisition of knowledge, 99:135, 1977.
Josef Ruppenhofer, Michael Ellsworth, Myriam Schwarzer-Petruck, Christopher R
Johnson, and Jan Scheffczyk. FrameNet II: Extended theory and practice. Tech-
nical report, International Computer Science Institute, 2016.
Alexander M Rush, Roi Reichart, Michael Collins, and Amir Globerson. Improved
parsing and POS tagging using inter-sentence consistency constraints. In EMNLP-
CoNLL, pages 1434–1444, 2012.
Samir Rustamov. A hybrid system for subjectivity analysis. Advances in Fuzzy
Systems, 2018, 2018.
Samir Rustamov, Elshan Mustafayev, and Mark A Clements. Sentence-level subjec-
tivity detection using neuro-fuzzy models. In 4th Workshop on Computational
Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 108–114,
2013.
Tatyana Ruzsics and Tanja Samardzic. Neural sequence-to-sequence learning of
internal word structure. In CoNLL, pages 184–194, 2017.
Nipun Sadvilkar and Mark Neumann. PySBD: Pragmatic sentence boundary dis-
ambiguation. arXiv preprint arXiv:2010.09657, 2020.
Horacio Saggion↵ and Adam Funk. Interpreting SentiWordNet for opinion classifica-
tion. In Seventh Conference on International Language Resources and Evaluation,
pages 1129–1133, 2010.
Santwana Sagnika, Bhabani Shankar Prasad Mishra, and Saroj K Meher. Improved
method of word embedding for efficient analysis of human sentiments. Multimedia
Tools and Applications, 79(43):32389–32413, 2020.
Santwana Sagnika, Bhabani Shankar Prasad Mishra, and Saroj K Meher. An
attention-based CNN-LSTM model for subjectivity detection in opinion-mining.
Neural Computing and Applications, 33(24):17425–17438, 2021.
Hassan Saif, Yulan He, Miriam Fernandez, and Harith Alani. Contextual semantics
for sentiment analysis of Twitter. Information Processing & Management, 52(1):
5–19, 2016.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Wino-
grande: An adversarial winograd schema challenge at scale. In AAAI, pages
8732–8740, 2020.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Wino-
grande: An adversarial winograd schema challenge at scale. Communications of
the ACM, 64(9):99–106, 2021.
474 References
David Salgado, Martin Krallinger, Marc Depaule, Elodie Drula, Ashish V Tendulkar,
Florian Leitner, Alfonso Valencia, and Christophe Marcelle. MyMiner: a web
application for computer-assisted biocuration and text annotation. Bioinformatics,
28(17):2285–2287, 2012.
Said A Salloum, Rehan Khan, and Khaled Shaalan. A survey of semantic analysis
approaches. In International Conference on Artificial Intelligence and Computer
Vision, pages 61–70, 2020.
Merrilee H Salmon. Introduction to Logic and Critical Thinking. Matthew J. Van
Cleave, 1989.
Steven L. Salzberg. C4.5: Programs for machine learning by j. ross quinlan. morgan
kaufmann publishers, inc., 1993. Machine Learning, 16(3):235–240, 1994.
Fahime Same, Guanyi Chen, and Kees Van Deemter. Non-neural models matter: a
re-evaluation of neural referring expression generation systems. In ACL, pages
5554–5567, Dublin, Ireland, May 2022. doi: 10.18653/v1/[Link]-long.380.
George Sanchez. Sentence boundary detection in legal text. In Natural Legal
Language Processing Workshop 2019, pages 31–38, 2019.
Mark Sanderson. Word sense disambiguation and information retrieval. In SIGIR,
pages 142–151, 1994.
Erik F Sang and Sabine Buchholz. Introduction to the CoNLL-2000 shared task:
Chunking. arXiv preprint cs/0009008, 2000.
Erik F Sang and Jorn Veenstra. Representing text chunks. arXiv preprint cs/9907006,
1999.
Erik Tjong Kim Sang. Text chunking by system combination. In Fourth Confer-
ence on Computational Natural Language Learning and the Second Learning
Language in Logic Workshop, 2000.
Erik Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003
shared task: Language-independent named entity recognition. In HLT-NAACL,
pages 142–147, 2003.
Cicero Nogueira dos Santos and Victor Guimaraes. Boosting named entity recogni-
tion with neural character embeddings. arXiv preprint arXiv:1505.05008, 2015.
Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie,
Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. ATOMIC: An
atlas of machine commonsense for if-then reasoning. In AAAI, volume 33, pages
3027–3035, 2019a.
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social
iqa: Commonsense reasoning about social interactions. In EMNLP-IJCNLP, pages
4462–4472, 2019b.
Sunita Sarawagi and William W Cohen. Semi-Markov conditional random fields for
information extraction. NeurIPS, 17:1185–1192, 2004.
Ryohei Sasano and Sadao Kurohashi. A discriminative approach to Japanese zero
anaphora resolution with large-scale lexicalized case frames. In IJCNLP, pages
758–766, 2011.
Ryohei Sasano, Daisuke Kawahara, and Sadao Kurohashi. A fully-lexicalized proba-
bilistic model for Japanese zero anaphora resolution. In COLING, pages 769–776,
2008.
References 475
Bianca Scarlini, Tommaso Pasini, and Roberto Navigli. With more contexts comes
better performance: Contextualized sense embeddings for all-round word sense
disambiguation. In EMNLP, pages 3528–3539, 2020a.
Bianca Scarlini, Tommaso Pasini, and Roberto Navigli. SensEmBERT: Context-
enhanced sense embeddings for multilingual word sense disambiguation. In AAAI,
pages 8758–8765, 2020b.
Roger C Schank. Conceptual dependency: A theory of natural language understand-
ing. Cognitive psychology, 3(4):552–631, 1972.
Robert E Schapire and Yoram Singer. Improved boosting algorithms using
confidence-rated predictions. Machine learning, 37(3):297–336, 1999.
Robert E Schapire and Yoram Singer. BoosTexter: A boosting-based system for text
categorization. Machine Learning, 39(2):135–168, 2000.
Timo Schick and Hinrich Schütze. It’s not just size that matters: Small language
models are also few-shot learners. In NAACL-HLT, pages 2339–2352, 2021.
Helmut Schmid. Unsupervised learning of period disambiguation for tokenisation.
Internal Report, IMS-CL, 2000.
Marine Schmitt and Matthieu Constant. Neural lemmatization of multiword ex-
pressions. In Joint Workshop on Multiword Expressions and WordNet (MWE-WN
2019), pages 142–148, 2019.
Martin Schmitt, Simon Steinheber, Konrad Schreiber, and Benjamin Roth. Joint
aspect and polarity classification for aspect-based sentiment analysis with end-to-
end neural networks. arXiv preprint arXiv:1808.09238, 2018.
Kaitlyn B Schodt, Selena I Quiroz, Brittany Wheeler, Deborah L Hall, and Yasin N
Silva. Cyberbullying and mental health in adults: The moderating role of social
media use and gender. Frontiers in Psychiatry, page 954, 2021.
Karin Kipper Schuler. VerbNet: A broad-coverage, comprehensive verb lexicon.
University of Pennsylvania, Philadelphia, PA,United States, 2005.
H Andrew Schwartz, Johannes C Eichstaedt, Margaret L Kern, Lukasz Dziurzynski,
Stephanie M Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David
Stillwell, Martin EP Seligman, et al. Personality, gender, and age in the language
of social media: The open-vocabulary approach. PloS One, 8(9):e73791, 2013.
Federico Scozzafava, Marco Maru, Fabrizio Brignone, Giovanni Torrisi, and Roberto
Navigli. Personalized pagerank with syntagmatic information for multilingual
word sense disambiguation. In ACL, pages 37–46, 2020.
Djamé Seddah, Reut Tsarfaty, Sandra Kübler, Marie Candito, Jinho Choi, Richárd
Farkas, Jennifer Foster, Iakes Goenaga, Koldo Gojenola, Yoav Goldberg, et al.
Overview of the SPMRL 2013 shared task: cross-framework evaluation of parsing
morphologically rich languages. In Fourth Workshop on Statistical Parsing of
Morphologically-Rich Languages, 2013.
Norbert M Seel. Encyclopedia of the Sciences of Learning. Springer Science &
Business Media, 2011.
Kazuhiro Seki, Atsushi Fujii, and Tetsuya Ishikawa. A probabilistic model for
japanese zero pronoun resolution integrating syntactic and semantic features. In
NLPRS, pages 403–410, 2001.
References 477
Kazuhiro Seki, Atsushi Fujii, and Tetsuya Ishikawa. A probabilistic method for
analyzing japanese anaphora integrating zero pronoun detection and resolution.
In COLING, pages 1–7, 2002.
Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Ju-
lian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli
Barone, Jozef Mokry, et al. Nematus: a toolkit for neural machine translation.
arXiv preprint arXiv:1703.04357, 2017.
Jesus Serrano-Guerrero, Jose A Olivas, Francisco P Romero, and Enrique Herrera-
Viedma. Sentiment analysis: A review and comparative analysis of web services.
Information Sciences, 311:18–38, 2015.
Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In
NAACL-HLT, pages 213–220, 2003.
Sarah Shafqat, Hammad Majeed, Qaisar Javaid, and Hafiz Farooq Ahmad. Standard
NER tagging scheme for big data healthcare analytics built on unified medical
corpora. Journal of Artificial Intelligence and Technology, 2(4):152–157, 2022.
Yan Shao, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. Character-based
joint segmentation and pos tagging for chinese using bidirectional rnn-crf. arXiv
preprint arXiv:1704.01314, 2017.
Guangyao Shen, Jia Jia, Liqiang Nie, Fuli Feng, Cunjun Zhang, Tianrui Hu, Tat-
Seng Chua, and Wenwu Zhu. Depression detection via harvesting social media:
A multimodal dictionary learning solution. In IJCAI, pages 3838–3844, 2017.
Libin Shen, Giorgio Satta, and Aravind Joshi. Guided learning for bidirectional se-
quence classification. In 45th Annual Meeting of the Association of Computational
Linguistics, pages 760–767, Prague, Czech Republic, June 2007.
Xiangqing Shen, Siwei Wu, and Rui Xia. Dense-atomic: Towards densely-connected
ATOMIC with high knowledge coverage and massive multi-hop paths. In ACL,
pages 13292–13305, 2023a.
Yiqiu Shen, Laura Heacock, Jonathan Elias, Keith D Hentel, Beatriu Reig, George
Shih, and Linda Moy. Chatgpt & other large language models are double-edged
swords. Radiology, 307(2):e230163, 2023b.
Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method
without the agonizing pain, 1994.
Yong Shi, Luyao Zhu, Wei Li, Kun Guo, and Yuanchun Zheng. Survey on classic
and latest textual sentiment analysis articles and techniques. International Journal
of Information Technology & Decision Making, 18(04):1243–1287, 2019.
Takashi Shibuya and Eduard Hovy. Nested named entity recognition via second-best
sequence learning and decoding. Transactions of the Association for Computa-
tional Linguistics, 8:605–620, 2020.
Mayank Shrivastava and Shishir Kumar. A pragmatic and intelligent model for
sarcasm detection in social media text. Technology in Society, 64:101489, 2021.
Ekaterina Shutova. Automatic metaphor interpretation as a paraphrasing task. In
NAACL-HLT, pages 1029–1037, 2010.
Ekaterina Shutova. Design and evaluation of metaphor processing systems. Com-
putational Linguistics, 41(4):579–623, 2015.
478 References
Ekaterina Shutova and Simone Teufel. Metaphor corpus annotated for source-target
domain mappings. In Seventh International Conference on Language Resources
and Evaluation (LREC’10), pages 3255–3261, 2010.
Ekaterina Shutova, Douwe Kiela, and Jean Maillard. Black holes and white rabbits:
Metaphor identification with visual features. In NAACL-HLT, pages 160–170,
2016.
Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi.
Unsupervised commonsense question answering with self-talk. In EMNLP, pages
4615–4629, 2020.
Candace Lee Sidner. Towards a computational theory of definite anaphora com-
prehension in english discourse. Technical report, Massachusetts Inst of Tech
Cambridge Artificial Intelligence lab, 1979.
Natalia Silveira, Timothy Dozat, Marie-Catherine De Marneffe, Samuel R Bowman,
Miriam Connor, John Bauer, and Christopher D Manning. A gold standard
dependency corpus for English. In LREC, pages 2897–2904, 2014.
Stefano Silvestri, Francesco Gargiulo, and Mario Ciampi. Iterative annotation of
biomedical ner corpora with deep neural networks and knowledge bases. Applied
Sciences, 12(12):5775, 2022.
John Simpson and Edmund Weiner. The Oxford English Dictionary. Oxford Uni-
versity Press, 2 edition, 1989.
Push Singh, Thomas Lin, Erik T. Mueller, Grace Lim, Travell Perkins, and Wan Li
Zhu. Open mind common sense: Knowledge acquisition from the general public.
In On the Move to Meaningful Internet Systems, volume 2519 of Lecture Notes in
Computer Science, pages 1223–1237, 2002.
Richard Laishram Singh, Krishnendu Ghosh, Kishorjit Nongmeikapam, and Sivaji
Bandyopadhyay. A decision tree based word sense disambiguation system in
manipuri language. Advanced Computing, 5(4):17, 2014.
Juan Sixto, Aitor Almeida, and Diego López-de Ipiña. An approach to subjectivity
detection on Twitter using the structured information. In International Conference
on Computational Collective Intelligence, pages 121–130, 2016.
Sara Skilbred-Fjeld, Silje Endresen Reme, and Svein Mossige. Cyberbullying in-
volvement and mental health problems among late adolescents. Cyberpsychology:
Journal of Psychosocial Research on Cyberspace, 14(1), 2020.
Wojciech Skut, Brigitte Krenn, Thorsten Brants, and Hans Uszkoreit. An annota-
tion scheme for free word order languages. 5th Conference on Applied Natural
Language Processing, 05 2002. doi: 10.3115/974557.974571.
Stavroula Skylaki, Ali Oskooei, Omar Bari, Nadja Herger, and Zac Kriegman. Named
entity recognition in the legal domain using a pointer generator network. arXiv
preprint arXiv:2012.09936, 2020.
Edgar A Smith. Devereux readability index. The Journal of Educational Research,
54(8):298–303, 1961.
Paul Smolensky, Richard McCoy, Roland Fernandez, Matthew Goldrick, and Jian-
feng Gao. Neurocompositional computing: From the central paradox of cognition
to a new generation of ai systems. AI Magazine, 43(3):308–322, 2022.
References 479
Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot
learning. In 31st International Conference on Neural Information Processing
Systems, pages 4080–4090, 2017. ISBN 9781510860964.
Rion Snow, Dan Jurafsky, and Andrew Y Ng. Semantic taxonomy induction from
heterogenous evidence. In ACL, pages 801–808, 2006.
Benjamin Snyder and Martha Palmer. The English all-words task. In SENSEVAL-3,
the Third International Workshop on the Evaluation of Systems for the Semantic
Analysis of Text, pages 41–43, 2004.
Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning
with neural tensor networks for knowledge base completion. In NIPS, pages
926–934, 2013a.
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,
Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic com-
positionality over a sentiment treebank. In EMNLP, pages 1631–1642, 2013b.
Anders Søgaard. Simple semi-supervised training of part-of-speech taggers. In ACL
2010 Conference Short Papers, pages 205–208, Uppsala, Sweden, July 2010.
Mohammad Soleymani, Sadjad Asghari-Esfeden, Yun Fu, and Maja Pantic. Analysis
of eeg signals and facial expressions for continuous emotion detection. IEEE
Transactions on Affective Computing, 7(1):17–28, 2015.
Swapna Somasundaran and Janyce Wiebe. Recognizing stances in ideological on-
line debates. In NAACL HLT 2010 workshop on computational approaches to
analysis and generation of emotion in text, pages 116–124, 2010.
Swapna Somasundaran, Josef Ruppenhofer, and Janyce Wiebe. Detecting arguing
and sentiment in meetings. In 8th SIGdial Workshop on Discourse and Dialogue,
pages 26–34, 2007.
Sheetal Sonawane and Parag Kulkarni. The role of coreference resolution in extrac-
tive summarization. In 2016 International Conference on Computing, Analytics
and Security Trends (CAST), pages 351–356, 2016.
Bosheng Song, Fen Li, Yuansheng Liu, and Xiangxiang Zeng. Deep learning meth-
ods for biomedical named entity recognition: a survey and qualitative comparison.
Briefings in Bioinformatics, 22(6):bbab282, 2021.
Linfeng Song, Kun Xu, Yue Zhang, Jianshu Chen, and Dong Yu. ZPR2: Joint zero
pronoun recovery and resolution using multi-task learning and BERT. In ACL,
pages 5429–5434, 2020a.
Min Song, Il-Yeol Song, Xiaohua Hu, and Robert B Allen. Integrating text chunking
with mixture hidden Markov models for effective biomedical information extrac-
tion. In International Conference on Computational Science, pages 976–984,
2005.
Wei Song, Jingjin Guo, Ruiji Fu, Ting Liu, and Lizhen Liu. A knowledge graph
embedding approach for metaphor processing. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 29:406–420, 2020b.
Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. A machine learning
approach to coreference resolution of noun phrases. Computational Linguistics,
27(4):521–544, 2001.
480 References
Hoong-Cheng Soong, Norazira Binti A Jalil, Ramesh Kumar Ayyasamy, and Rehan
Akbar. The essential of sentiment analysis and opinion mining in social media:
Introduction and survey of the recent approaches and techniques. In 2019 IEEE 9th
Symposium on Computer Applications & Industrial Electronics, pages 272–277,
2019.
Vibeke Sorensen, John Stephen Lansing, Nagaraju Thummanapalli, and Erik Cam-
bria. Mood of the planet: Challenging visions of big data in the arts. Cognitive
Computation, 14(1):310–321, 2022.
Benjamin Spector. Global positive polarity items and obligatory exhaustivity. Se-
mantics and Pragmatics, 7:11–1, 2014.
Robyn Speer, Joshua Chin, and Catherine Havasi. ConceptNet 5.5: An open multi-
lingual graph of general knowledge. In AAAI, pages 4444–4451, 2017.
Elizabeth S Spelke and Katherine D Kinzler. Core knowledge. Developmental
science, 10(1):89–96, 2007.
Drahomíra Spoustová, Jan Haji , Jan Raab, and Miroslav Spousta. Semi-supervised
training for the averaged perceptron POS tagger. In 12th Conference of the
European Chapter of the ACL (EACL 2009), pages 763–771, Athens, Greece,
March 2009.
Clemens Stachl, Florian Pargent, Sven Hilbert, Gabriella M Harari, Ramona
Schoedel, Sumer Vaid, Samuel D Gosling, and Markus Bühner. Personality
research and assessment in the era of machine learning. European Journal of
Personality, 34(5):613–631, 2020.
Sanja ätajner and Seren Yenikent. A survey of automatic personality detection from
texts. In COLING, pages 6284–6295, 2020.
Sanja ätajner and Seren Yenikent. Why is MBTI personality detection from texts a
difficult task? In EACL, pages 3580–3589, 2021.
Efstathios Stamatatos, Nikos Fakotakis, and George Kokkinakis. Automatic extrac-
tion of rules for sentence boundary disambiguation. In Workshop on Machine
Learning in Human Language Technology, pages 88–92, 1999.
Keith Stanovich and Richard West. Advancing the rationality debate. Behavioral
and brain sciences, 23(5):701–717, 2000.
Harald Steck, Chaitanya Ekanadham, and Nathan Kallus. Is cosine-similarity of
embeddings really about similarity? In Web Conference, WWW ’24. ACM, May
2024. doi: 10.1145/3589335.3651526.
Gerard Steen, Lettie Dorst, Berenike Herrmann, Anna Kaal, Tina Krennmayr, and
Trijntje Pasma. A method for linguistic metaphor identification from MIP to
MIPVU preface. Method For Linguistic Metaphor Identification: From MIP To
MIPVU, 14:IX–+, 2010.
Josef Steinberger, Massimo Poesio, Mijail A Kabadjov, and Karel Jeûek. Two uses of
anaphora resolution in summarization. Information Processing & Management,
43(6):1663–1680, 2007.
Pontus Stenetorp, Sampo Pyysalo, Goran TopiÊ, Tomoko Ohta, Sophia Ananiadou,
and Jun’ichi Tsujii. Brat: a web-based tool for nlp-assisted text annotation. In
EACL, pages 102–107, 2012.
References 481
Chang Su, Ying Peng, Shuman Huang, and Yijiang Chen. A metaphor compre-
hension method based on culture-related hierarchical semantic model. Neural
Processing Letters, 51(3):2807–2826, 2020.
Jianlin Su, Ahmed Murtadha, Shengfeng Pan, Jing Hou, Jun Sun, Wanwei Huang,
Bo Wen, and Yunfeng Liu. Global pointer: Novel efficient span-based approach
for named entity recognition. arXiv preprint arXiv:2208.03054, 2022.
Ming-Hsiang Su, Chung-Hsien Wu, and Yu-Ting Zheng. Exploiting turn-taking tem-
poral evolution for personality trait perception in dyadic conversations. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 24(4):733–744, 2016.
Niranjan Subrahmanya and Yung C Shin. Sparse multiple kernel learning for signal
processing applications. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 32(5):788–798, 2009.
Amarnag Subramanya, Slav Petrov, and Fernando Pereira. Efficient graph-based
semi-supervised learning of structured tagging models. In EMNLP, EMNLP ’10,
pages 167–176, USA, 2010.
Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. YAGO: A core of
semantic knowledge. In International Conference on the World Wide Web, pages
697–706, 2007.
Rhea Sukthanker, Soujanya Poria, Erik Cambria, and Ramkumar Thirunavukarasu.
Anaphora and coreference resolution: A review. Information Fusion, 59:139–162,
2020.
Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune BERT
for text classification? In China National Conference on Chinese Computational
Linguistics, pages 194–206, 2019.
Xiangguo Sun, Bo Liu, Jiuxin Cao, Junzhou Luo, and Xiaojun Shen. Who am I?
Personality detection based on deep learning for texts. In 2018 IEEE International
Conference on Communications (ICC), pages 1–6, 2018.
Xu Sun, Louis-Philippe Morency, Daisuke Okanohara, Yoshimasa Tsuruoka, and
Jun’ichi Tsujii. Modeling latent-dynamic in shallow parsing: A latent conditional
model with improved inference. In COLING, pages 841–848, 2008.
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng
Wang. ERNIE 2.0: A continual pre-training framework for language understand-
ing. In AAAI, pages 8968–8975, 2020.
Yosephine Susanto, Andrew Livingstone, Bee Chin Ng, and Erik Cambria. The
hourglass model revisited. IEEE Intelligent Systems, 35(5):96–102, 2020.
Charles Sutton, Andrew McCallum, and Khashayar Rohanimanesh. Dynamic condi-
tional random fields: Factorized probabilistic models for labeling and segmenting
sequence data. Journal of Machine Learning Research, 8(3), 2007.
Jun Suzuki and Hideki Isozaki. Semi-supervised sequential labeling and segmen-
tation using giga-word scale unlabeled data. In ACL-08: HLT, pages 665–673,
2008.
Swati Swati, Adrian MladeniÊ Grobelnik, Dunja MladeniÊ, and Marko Grobelnik.
A commonsense-infused language-agnostic learning framework for enhancing
prediction of political polarity in multilingual news headlines. arXiv e-prints,
pages arXiv–2212, 2022.
References 483
Gongbo Tang, Rico Sennrich, and Joakim Nivre. Encoders help you disambiguate
word senses in neural machine translation. In EMNLP-IJCNLP, pages 1429–1435,
2019.
Mariona Taulé, M Antònia Martí, and Marta Recasens. AnCora: Multilevel annotated
corpora for catalan and spanish. In Sixth International Conference on Language
Resources and Evaluation (LREC’08), pages 1–6, 2008.
Yla R Tausczik and James W Pennebaker. The psychological meaning of words:
LIWC and computerized text analysis methods. Journal of Language and Social
Psychology, 29(1):24–54, 2010.
Paul Taylor, Alan W Black, and Richard Caley. The architecture of the Festival
speech synthesis system. In The Third ESCA/COCOSDA Workshop (ETRW) on
Speech Synthesis, 1998.
Simone Tedeschi and Roberto Navigli. MultiNERD: A multilingual, multi-genre
and fine-grained dataset for named entity recognition (and disambiguation). In
NAACL, pages 801–812, 2022.
Simone Tedeschi, Simone Conia, Francesco Cecconi, and Roberto Navigli. Named
entity recognition for entity linking: What works and what’s next. In EMNLP,
pages 2584–2596, 2021a.
Simone Tedeschi, Valentino Maiorca, Niccolò Campolungo, Francesco Cecconi, and
Roberto Navigli. WikiNEuRal: Combined neural and knowledge-based silver data
creation for multilingual ner. In EMNLP, pages 2521–2533, 2021b.
Heike Telljohann, Erhard Hinrichs, Sandra Kübler, and Ra Kübler. The TüBa-D/Z
treebank: Annotating german with a context-free backbone. In LREC, pages
2229–2232, 2004.
Puneshkumar U Tembhare, Ritesh Hiware, Shrey Ojha, Abhisheik Nimpure, and
Faiz raza. Content recommender system based on users reviews. In International
Conference on ICT for Sustainable Development, pages 441–451, 2023.
Joel R. Tetreault. A corpus-based evaluation of centering and pronoun resolution.
Computational Linguistics, 27(4):507–520, 2001.
Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. Sentiment strength detec-
tion for the social web. Journal of the American Society for Information Science
and Technology, 63(1):163–173, 2012.
Yuan Tian, Nan Xu, Wenji Mao, and Daniel Zeng. Modeling conceptual attribute
likeness and domain inconsistency for metaphor detection. In EMNLP, pages
7736–7752, 2023.
Orith Toledo-Ronen, Matan Orbach, Yoav Katz, and Noam Slonim. Multi-domain
targeted sentiment analysis. In NAACL-HLT, pages 2751–2762, Seattle, United
States, July 2022.
Antonela Tommasel, Alejandro Corbellini, Daniela Godoy, and Silvia Schiaffino.
Exploring the role of personality traits in followee recommendation. Online
Information Review, 39(6):812–830, 2015.
Antonela Tommasel, Alejandro Corbellini, Daniela Godoy, and Silvia Schiaffino.
Personality-aware followee recommendation algorithms: An empirical analysis.
Engineering Applications of Artificial Intelligence, 51:24–36, 2016.
References 485
Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart
and its applications. In ICDM, pages 613–622, 2006.
Kristina Toutanova and Colin Cherry. A global model for joint lemmatization and
part-of-speech prediction. In ACL-IJCNLP, pages 486–494, 2009.
Kristina Toutanova and Mark Johnson. A Bayesian LDA-based model for semi-
supervised part-of-speech tagging. NeurIPS, 20:1521–1528, 2007.
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-
rich part-of-speech tagging with a cyclic dependency network. In NAACL-HLT,
NAACL ’03, pages 173–180, USA, 2003. doi: 10.3115/1073445.1073478.
Kristina Toutanvoa and Christopher D Manning. Enriching the knowledge sources
used in a maximum entropy part-of-speech tagger. In Joint SIGDAT conference
on Empirical methods in natural language processing and very large corpora,
pages 63–70, 2000.
Elizabeth Closs Traugott. Revisiting subjectification and intersubjectification. Sub-
jectification, Intersubjectification and Grammaticalization, 29:71, 2010.
Marcos V. Treviso, Christopher D. Shulby, and Sandra M. Aluisio. Evaluating word
embeddings for sentence boundary detection in speech transcripts, 2017.
Marcos Vinícius Treviso, Christopher Shulby, and Sandra Maria Aluísio. Sentence
segmentation in narrative transcripts from neuropsychological tests using recur-
rent convolutional neural networks. arXiv preprint arXiv:1610.00211, 2016.
Rocco Tripodi and Roberto Navigli. Game theory meets embeddings: a unified
framework for word sense disambiguation. In EMNLP-IJCNLP, pages 88–99,
2019.
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe
Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned mul-
timodal language sequences. In ACL, volume 2019, page 6558, 2019.
S. C. Tseng. Processing spoken mandarin corpora. Traitement Automatique Des
Langues, 45(2):89–108, 2004.
Yoshimasa Tsuruoka and Jun’ichi Tsujii. Bidirectional inference with the easiest-first
strategy for tagging sequence data. In HLT-EMNLP, HLT ’05, pages 467–474,
USA, 2005. doi: 10.3115/1220575.1220634.
Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman, Eric Nyberg, and Chris Dyer.
Metaphor detection with cross-lingual model transfer. In ACL, pages 248–258,
2014.
Geng Tu, Taiyu Niu, Ruifeng Xu, Bin Bin Liang, and Erik Cambria. AdaCLF:
An adaptive curriculum learning framework for emotional support conversation.
IEEE Intelligent Systems, 39(4):5–11, 2024.
Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling
coverage for neural machine translation. In ACL, pages 76—85, 2016.
Mohammad Tubishat, Norisma Idris, and Mohammad Abushariah. Explicit aspects
extraction in sentiment analysis using optimal rules combination. Future Gener-
ation Computer Systems, 114:448–480, 2021.
Peter D Turney and Patrick Pantel. From frequency to meaning: Vector space models
of semantics. Journal of Artificial Intelligence Research, 37:141–188, 2010.
486 References
Cynthia Van Hee, Els Lefever, and Véronique Hoste. SemEval-2018 task 3: Irony de-
tection in English tweets. In 12th International Workshop on Semantic Evaluation,
pages 39–50, 2018.
Colette M Van Kerckvoorde. An Introduction to Middle Dutch. De Gruyter Mouton,
2019.
Emiel van Miltenburg, Wei-Ting Lu, Emiel Krahmer, Albert Gatt, Guanyi Chen,
Lin Li, and Kees van Deemter. Gradations of error severity in automatic image
descriptions. In 13th International Conference on Natural Language Generation,
pages 398–411, 2020.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, £ukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS,
30, 2017.
Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Hannan, and Ryan McDonald.
The viability of web-derived polarity lexicons. In NAACL-HLT, pages 777–785,
2010.
Ben Verhoeven, Walter Daelemans, and Barbara Plank. Twisty: a multilingual Twitter
stylometry corpus for gender and personality profiling. In Tenth International
Conference on Language Resources and Evaluation (LREC’16), pages 1632–
1637, 2016.
Kanishk Verma and Brian Davis. Implicit aspect-based opinion mining and analysis
of airline industry based on user-generated reviews. SN Computer Science, 2(4):
1–9, 2021.
Loïc Vial, Benjamin Lecouteux, and Didier Schwab. Sense vocabulary compression
through the semantic knowledge of WordNet for neural word sense disambigua-
tion. In 10th Global WordNet Conference, pages 108–117, 2019.
Felipe Viegas, Mário S Alvim, Sérgio Canuto, Thierson Rosa, Marcos André
Gonçalves, and Leonardo Rocha. Exploiting semantic relationships for unsu-
pervised expansion of sentiment lexicons. Information Systems, 94:101606, 2020.
Renata Vieira and Massimo Poesio. An empirically-based system for processing
definite descriptions. Computational Linguistics, 26(4):539–593, 2000.
Supriti Vijay and Aman Priyanshu. NERDA-Con: Extending NER models for con-
tinual learning–integrating distinct tasks and updating distribution shifts. arXiv
preprint arXiv:2206.14607, 2022.
Marc Vilain, John D Burger, John Aberdeen, Dennis Connolly, and Lynette
Hirschman. A model-theoretic coreference scoring scheme. In MUC-6, pages
45–52, 1995.
Julio Villena, Janine García-Morera, Miguel García-Cumbreras, Eugenio Martínez-
Cámara, Maria Martín-Valdivia, and L. López. Overview of TASS 2015. In TASS
2015: Workshop on Sentiment Analysis at SEPLN co-located with 31st SEPLN
Conference, pages 13–21, 09 2015.
Alessandro Vinciarelli and Gelareh Mohammadi. A survey of personality computing.
IEEE Transactions on Affective Computing, 5(3):273–291, 2014.
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NeurIPS,
pages 2692–2700, 2015a.
488 References
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In 28th
International Conference on Neural Information Processing Systems-Volume 2,
pages 2692–2700, 2015b.
Duy-Tin Vo and Yue Zhang. Target-dependent Twitter sentiment classification with
rich automatic features. In IJCAI, 2015.
Luis Von Ahn. Games with a purpose. Computer, 39(6):92–94, 2006.
Luis von Ahn, Mihir Kedia, and Manuel Blum. Verbosity: a game for collect-
ing common-sense facts. In 2006 Conference on Human Factors in Computing
Systems, CHI, pages 75–78, 2006.
Denny Vrande iÊ. Ontology evaluation. In Handbook on Ontologies, pages 293–313.
Springer, 2009.
Denny Vrandecic and Markus Krötzsch. Wikidata: a free collaborative knowledge-
base. Communications of the ACM, 57(10):78–85, 2014.
Tung Vuong, Salvatore Andolina, Giulio Jacucci, and Tuukka Ruotsalo. Spoken
conversational context improves query auto-completion in web search. ACM
Transactions on Information Systems (TOIS), 39(3):1–32, 2021.
Lennart Wachowiak and Dagmar Gromann. Systematic analysis of image schemas
in natural language through explainable multilingual neural language processing.
In COLING, pages 5571–5581, 2022.
Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. ACE
2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 57:
45, 2006.
Marilyn Walker and Steve Whittaker. Mixed initiative in dialogue: an investigation
into discourse segmentation. In ACL, pages 70–78, 1990.
Byron C Wallace, Laura Kertz, Eugene Charniak, et al. Humans require context to
infer ironic intent (so computers probably do, too). In ACL, pages 512–516, 2014.
Hai Wan, Yufei Yang, Jianfeng Du, Yanan Liu, Kunxun Qi, and Jeff Z Pan. Target-
aspect-sentiment joint detection for aspect-based sentiment analysis. In AAAI,
pages 9122–9129, 2020.
Mengting Wan and Julian McAuley. Modeling ambiguity, subjectivity, and diverging
viewpoints in opinion question answering systems. In ICDM, pages 489–498,
2016.
Qian Wan, Luona Wei, Xinhai Chen, and Jie Liu. A region-based hypergraph network
for joint entity-relation extraction. Knowledge-Based Systems, 228:107298, 2021.
Xiaojun Wan. Co-training for cross-lingual sentiment classification. In ACL-
IJCNLP, pages 235–243, 2009.
Bailin Wang and Wei Lu. Neural segmental hypergraphs for overlapping mention
recognition. In EMNLP, pages 204–214, 2018.
Feixiang Wang, Man Lan, and Wenting Wang. Towards a one-stop solution to both
aspect extraction and sentiment analysis tasks with neural multi-task learning. In
IJCNN, pages 1–8, 2018a.
Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and
Minyi Guo. Exploring high-order user preference on the knowledge graph for
recommender systems. ACM Transactions on Information Systems, 37(3):1–26,
2019a.
References 489
Longyue Wang, Zhaopeng Tu, Xing Wang, and Shuming Shi. One model to learn
both: Zero pronoun prediction and translation. In EMNLP-IJCNLP, pages 921–
930, 2019b.
Ming Wang and Yinglin Wang. A synset relation-enhanced framework with a try-
again mechanism for word sense disambiguation. In EMNLP, pages 6229–6240,
2020.
Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and Hai Zhao. Part-of-speech
tagging with bidirectional long short-term memory recurrent neural network.
CoRR, abs/1510.06168, 2015a.
Pidong Wang and Hwee Tou Ng. A beam-search decoder for normalization of
social media text with application to machine translation. In NAACL-HLT, pages
471–481, 2013.
Shan Wang and Francis Bond. Building the Chinese open Wordnet (COW): Starting
from core synsets. In 11th Workshop on Asian Language Resources, pages 10–18,
2013.
Sida Wang and Christopher Manning. Fast dropout training. In ICML, pages 118–
126, 2013.
Sida I Wang and Christopher D Manning. Baselines and bigrams: Simple, good
sentiment and topic classification. In ACL, pages 90–94, 2012.
Wei Wang, Ling He, Yenchun Jim Wu, and Mark Goh. Signaling persuasion in
crowdfunding entrepreneurial narratives: the subjectivity vs objectivity debate.
Computers in Human Behavior, 114:106576, 2021.
Wenguan Wang and Yi Yang. Towards data-and knowledge-driven artificial intelli-
gence: A survey on neuro-symbolic computing. arXiv preprint arXiv:2210.15889,
2022.
Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. Coupled
multi-layer attentions for co-extraction of aspect and opinion terms. In AAAI,
2017.
Xiaochen Wang, Wenzheng Feng, Jie Tang, and Qingyang Zhong. Course concept
extraction in MOOC via explicit/implicit representation. In Third IEEE Interna-
tional Conference on Data Science in Cyberspace, pages 339–345, 2018b.
Yanan Wang, Qi Liu, Chuan Qin, Tong Xu, Yijun Wang, Enhong Chen, and Hui
Xiong. Exploiting topic-based adversarial neural network for cross-domain
keyphrase extraction. In ICDM, pages 597–606, 2018c.
Yexiang Wang, Yi Guo, and Siqi Zhu. Slot attention with value normalization for
multi-domain dialogue state tracking. In EMNLP, pages 3019–3028, 2020.
Yong Wang and Ian H Witten. Induction of model trees for predicting continuous
classes. In European Conference on Machine Learning, 1996.
Yu Wang, Hanghang Tong, Ziye Zhu, and Yun Li. Nested named entity recognition:
A survey. ACM Transactions on Knowledge Discovery from Data, 16(6):1–29,
2022.
Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua Xiao. An inference
approach to basic level of categorization. In CIKM, pages 653–662, 2015b.
William Warner and Julia Hirschberg. Detecting hate speech on the world wide web.
In Second Workshop on Language in Social Media, pages 19–26, 2012.
490 References
Zhen Wu, Chengcan Ying, Fei Zhao, Zhifang Fan, Xinyu Dai, and Rui Xia. Grid
tagging scheme for aspect-oriented fine-grained opinion extraction. arXiv preprint
arXiv:2010.04640, 2020c.
Yu Xia, Quan Wang, Yajuan Lyu, Yong Zhu, Wenhao Wu, Sujian Li, and Dai Dai.
Learn and review: Enhancing continual named entity recognition via reviewing
synthetic samples. In ACL, pages 2291–2300, 2022.
Pan Xiao, YongQuan Fan, and YaJun Du. A personality-aware followee recommen-
dation model based on text semantics and sentiment analysis. In Natural Language
Processing and Chinese Computing, pages 503–514, 2018.
Chenyan Xiong, Russell Power, and Jamie Callan. Explicit semantic ranking for
academic search via knowledge graph embedding. In Rick Barrett, Rick Cum-
mings, Eugene Agichtein, and Evgeniy Gabrilovich, editors, 26th International
Conference on World Wide Web, pages 1271–1279, 2017.
Shengzhou Xiong, Yihua Tan, and Guoyou Wang. Explore visual concept formation
for image classification. In Marina Meila and Tong Zhang, editors, ICML, volume
139 of Machine Learning Research, pages 11470–11479. PMLR, 2021.
Fangzhi Xu, Qika Lin, Jiawei Han, Tianzhe Zhao, Jun Liu, and Erik Cambria. Are
large language models really good logical reasoners? a comprehensive evaluation
from deductive, inductive and abductive views. arXiv preprint arXiv:2306.09841,
2024.
Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. Double embeddings and CNN-based
sequence labeling for aspect extraction. arXiv preprint arXiv:1805.04601, 2018.
Hua Xu, Fan Zhang, and Wei Wang. Implicit feature identification in Chinese reviews
using explicit topic mining model. Knowledge-Based Systems, 76:166–175, 2015a.
Ke Xu, Yunqing Xia, and Chin-Hui Lee. Tweet normalization with syllables. In
ACL-IJCNLP, pages 920–928, 2015b.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neu-
ral image caption generation with visual attention. In ICML, pages 2048–2057,
2015c.
Lu Xu, Hao Li, Wei Lu, and Lidong Bing. Position-aware tagging for aspect sentiment
triplet extraction. arXiv preprint arXiv:2010.02609, 2020a.
Lu Xu, Yew Ken Chia, and Lidong Bing. Learning span-level interactions for aspect
sentiment triplet extraction. In ACL-IJCNLP, pages 4755–4766, 2021.
Qiannan Xu, Li Zhu, Tao Dai, Lei Guo, and Sisi Cao. Non-negative matrix fac-
torization for implicit aspect identification. Journal of Ambient Intelligence and
Humanized Computing, 11:2683–2699, 2020b.
Huong Nguyen Thi Xuan, Anh Cuong Le, et al. Linguistic features for subjectivity
classification. In 2012 International Conference on Asian Language Processing,
pages 17–20, 2012.
Ying Xue, Jianshan Sun, Yezheng Liu, Xin Li, and Kun Yuan. Facial expression-
enhanced recommendation for virtual fitting rooms. Decision Support Systems,
177:114082, 2024.
Zhenzhen Xue, Dawei Yin, and Brian D Davison. Normalizing microtext. In AAAI,
2011.
494 References
Vikas Yadav and Steven Bethard. A survey on recent advances in named entity
recognition from deep learning models. In COLING, pages 2145–2158, 2018.
Shahpar Yakhchi, Amin Beheshti, Seyed Mohssen Ghafari, and Mehmet Orgun.
Enabling the analysis of personality aspects in recommender systems. arXiv
preprint arXiv:2001.04825, 2020.
Kosuke Yamada, Ryohei Sasano, and Koichi Takeda. Incorporating textual informa-
tion on user behavior for personality prediction. In ACL, pages 177–182, 2019.
Hang Yan, Junqi Dai, Tuo Ji, Xipeng Qiu, and Zheng Zhang. A unified generative
framework for aspect-based sentiment analysis. In ACL-IJCNLP, pages 2416–
2429, 2021a.
Hang Yan, Tao Gui, Junqi Dai, Qipeng Guo, Zheng Zhang, and Xipeng Qiu. A
unified generative framework for various NER subtasks. In ACL-IJCNLP, pages
5808–5822, 2021b.
Yukun Yan and Sen Song. Local hypergraph-based nested named entity recognition
as query-based sequence labeling. arXiv preprint arXiv:2204.11467, 2022.
Zhiheng Yan, Chong Zhang, Jinlan Fu, Qi Zhang, and Zhongyu Wei. A partition
filter network for joint entity and relation extraction. In EMNLP, pages 185–197,
2021c.
Zhijun Yan, Meiming Xing, Dongsong Zhang, and Baizhang Ma. EXPRS: An
extended pagerank method for product feature extraction from online consumer
reviews. Information & Management, 52(7):850–858, 2015.
Chin Lung Yang, Peter C Gordon, Randall Hendrick, and Jei Tun Wu. Comprehen-
sion of referring expressions in chinese. Language and Cognitive Processes, 14
(5-6):715–743, 1999.
Dongqiang Yang and David Martin Powers. Measuring semantic similarity in the
taxonomy of WordNet. Australian Computer Society, Australia, 2005.
Feifan Yang, Xiaojun Quan, Yunyi Yang, and Jianxing Yu. Multi-document trans-
former for personality detection. In AAAI, pages 14221–14229, 2021a.
Feifan Yang, Tao Yang, Xiaojun Quan, and Qinliang Su. Learning to answer psy-
chological questionnaire for personality detection. In EMNLP, pages 1131–1142,
2021b.
Hsin-Chang Yang and Zi-Rui Huang. Mining personality traits from social messages
for game recommender systems. Knowledge-Based Systems, 165:157–168, 2019.
Jie Yang, Shuailong Liang, and Yue Zhang. Design challenges and misconceptions
in neural sequence labeling. arXiv preprint arXiv:1806.04470, 2018.
Jingyuan Yang, Xinbo Gao, Leida Li, Xiumei Wang, and Jinshan Ding. SOLVER:
Scene-object interrelated visual emotion reasoning network. IEEE Transactions
on Image Processing, 30:8686–8701, 2021c.
Shuoheng Yang, Yuxin Wang, and Xiaowen Chu. A survey of deep learning tech-
niques for neural machine translation. arXiv preprint arXiv:2002.07526, 2020a.
Songlin Yang and Kewei Tu. Bottom-up constituency parsing and nested named
entity recognition with pointer networks. In ACL, pages 2403–2416, 2022.
Tao Yang, Feifan Yang, Haolan Ouyang, and Xiaojun Quan. Psycholinguistic tripar-
tite graph network for personality detection. In ACL-IJCNLP, pages 4229–4239,
2021d.
References 495
Xi Yang, Jiang Bian, William R. Hogan, and Yonghui Wu. Clinical concept extraction
using transformers. J. Am. Medical Informatics Assoc., 27(12):1935–1942, 2020b.
Xiaofeng Yang, Guodong Zhou, Jian Su, and Chew Lim Tan. Coreference resolution
using competition learning approach. In ACL, pages 176–183, 2003.
Yi Yang and Jacob Eisenstein. A log-linear model for unsupervised text normaliza-
tion. In EMNLP, pages 61–72, 2013.
Yi Yang and Arzoo Katiyar. Simple and effective few-shot named entity recognition
with structured nearest neighbor learning. In EMNLP, pages 6365–6375, 2020.
Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. Multi-task cross-lingual
sequence tagging from scratch. arXiv preprint arXiv:1603.06270, 2016a.
Zhilin Yang, Ruslan Salakhutdinov, and William W. Cohen. Transfer learning for
sequence tagging with hierarchical recurrent networks. CoRR, abs/1703.06345,
2017.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy.
Hierarchical attention networks for document classification. In NAACL-HLT,
pages 1480–1489, 2016b.
Zonglin Yang, Xinya Du, Erik Cambria, and Claire Cardie. End-to-end case-based
reasoning for commonsense knowledge base completion. In EACL, pages 3509–
3522, 2023.
Zonglin Yang, Li Dong, Xinya Du, Hao Cheng, Erik Cambria, Xiaodong Liu, Jian-
feng Gao, and Furu Wei. Language models as inductive reasoners. In EACL,
pages 209–225, 2024.
Fanglong Yao, Xian Sun, Hongfeng Yu, Wenkai Zhang, Wei Liang, and Kun Fu.
Mimicking the brain’s cognition of sarcasm from multidisciplines for Twitter
sarcasm detection. IEEE Transactions on Neural Networks and Learning Systems,
2021.
Liang Yao, Chengsheng Mao, and Yuan Luo. KG-BERT: BERT for knowledge graph
completion. CoRR, abs/1909.03193, 2019.
Yuanzhou Yao, Zhao Zhang, Yongjun Xu, and Chao Li. Data augmentation for few-
shot knowledge graph completion from hierarchical perspective. In COLING,
pages 2494–2503, 2022.
Deming Ye, Yankai Lin, Jiaju Du, Zhenghao Liu, Peng Li, Maosong Sun, and
Zhiyuan Liu. Coreferential Reasoning Learning for Language Representation. In
EMNLP, pages 7170–7186, 2020.
Hai Ye and Lu Wang. Semi-supervised learning for neural keyphrase generation. In
EMNLP, pages 4142–4153, 2018.
Weijie Yeo, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, and Erik
Cambria. Self-training large language models through knowledge detection. arXiv
preprint arXiv:2406.11275, 2024a.
Weijie Yeo, Ranjan Satapathy, and Erik Cambria. Plausible extractive rationalization
through semi-supervised entailment signal. In ACL, pages 5182–5192, 2024b.
Weijie Yeo, Ranjan Satapathy, Siow Mong Goh, and Erik Cambria. How interpretable
are reasoning explanations from prompting large language models? In NAACL,
pages 2148–2164, 2024c.
496 References
Eray Yildiz and A Cüneyd Tantu . Morpheus: A neural network for jointly learning
contextual lemmatization and morphological tagging. In Workshop on Computa-
tional Research in Phonetics, Phonology, and Morphology, pages 25–34, 2019.
Qingyu Yin, Weinan Zhang, Yun Zhang, and Ting Liu. A deep neural network for
chinese zero pronoun resolution. In IJCAI, pages 3322–3328, 2017a.
Qingyu Yin, Yu Zhang, Weinan Zhang, and Ting Liu. Chinese zero pronoun reso-
lution with deep memory network. In EMNLP, pages 1309–1318, 2017b.
Qingyu Yin, Yu Zhang, Wei-Nan Zhang, Ting Liu, and William Yang Wang. Deep
reinforcement learning for Chinese zero pronoun resolution. In ACL, pages 569–
578, 2018a.
Qingyu Yin, Yu Zhang, Weinan Zhang, Ting Liu, and William Yang Wang. Zero
pronoun resolution with attention-based neural network. In COLING, pages 13–
23, 2018b.
Qingyu Yin, Weinan Zhang, Yu Zhang, and Ting Liu. Chinese zero pronoun reso-
lution: A collaborative filtering-based approach. ACM Transactions on Asian and
Low-Resource Language Information Processing, 19(1):1–20, 2019.
Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Subham Biswas, and Minlie
Huang. Augmenting end-to-end dialogue systems with commonsense knowledge.
In AAAI, pages 4970–4977, 2018.
Bowen Yu, Zhenyu Zhang, Xiaobo Shu, Tingwen Liu, Yubin Wang, Bin Wang, and
Sujian Li. Joint extraction of entities and relations based on a novel decomposition
strategy. In ECAI 2020, pages 2282–2289. IOS Press, 2020.
Guoxin Yu, Jiwei Li, Ling Luo, Yuxian Meng, Xiang Ao, and Qing He. Self
question-answering: Aspect-based sentiment analysis by role flipped machine
reading comprehension. In EMNLP, pages 1331–1342, 2021.
Hong Yu and Vasileios Hatzivassiloglou. Towards answering opinion questions:
Separating facts from opinions and identifying the polarity of opinion sentences.
In EMNLP, pages 129–136, 2003.
Jianfei Yu, Jing Jiang, and Rui Xia. Global inference for aspect and opinion terms
co-extraction based on multi-task neural networks. IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 27(1):168–177, 2018.
Jianfei Yu, Jieming Wang, Rui Xia, and Junjie Li. Targeted multimodal sentiment
classification based on coarse-to-fine grained image-target matching. In IJCAI,
2022.
Jianxing Yu, Zheng-Jun Zha, Meng Wang, and Tat-Seng Chua. Aspect ranking:
Identifying important product aspects from online consumer reviews. In ACL-
HLT, pages 1496–1505, 2011a.
Jianxing Yu, Zheng-Jun Zha, Meng Wang, Kai Wang, and Tat-Seng Chua. Domain-
assisted product aspect hierarchy generation: Towards hierarchical organization
of unstructured consumer reviews. In EMNLP, pages 140–150, 2011b.
Cuixin Yuan, Junjie Wu, Hong Li, and Lihong Wang. Personality recognition based
on user generated content. In 2018 15th International Conference on Service
Systems and Service Management (ICSSSM), pages 1–6, 2018.
References 497
Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf.
Semi-supervised word sense disambiguation with neural models. In COLING,
pages 1374–1385, 2016.
Tan Yue, Rui Mao, Heng Wang, Zonghai Hu, and Erik Cambria. KnowleNet:
Knowledge fusion network for multimodal sarcasm detection. Information Fusion,
100:101921, 2023.
Yessi Yunitasari, Aina Musdholifah, and Anny Kartika Sari. Sarcasm detection for
sentiment analysis in indonesian tweets. Indonesian Journal of Computing and
Cybernetics Systems, 13(1):53–62, 2019.
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe
Morency. Tensor fusion network for multimodal sentiment analysis. In EMNLP,
pages 1103–1114, 2017.
Nurulhuda Zainuddin, Ali Selamat, and Roliana Ibrahim. Improving Twitter aspect-
based sentiment analysis using hybrid approach. In Intelligent Information and
Database Systems, pages 151–160, 2016.
Nasser Zalmout and Nizar Habash. Joint diacritization, lemmatization, normaliza-
tion, and fine-grained morphological tagging. arXiv preprint arXiv:1910.02267,
2019.
Nasser Zalmout and Nizar Habash. Utilizing subword entities in character-level
sequence-to-sequence lemmatization models. In COLING, pages 4676–4682,
2020.
Omnia Zayed, John Philip McCrae, and Paul Buitelaar. Figure me out: A gold
standard dataset for metaphor interpretation. In LREC, pages 5810–5819, 2020.
Eugene B. Zechmeister, Andrea M. Chronis, William L. Cull, Catherine A. D’Anna,
and Noreen A. Healy. Growth of a functionally important lexicon. Journal of
Reading Behavior, 27(2):201–212, 1995.
Amir Zeldes. The GUM corpus: Creating multilayer resources in the classroom.
Language Resources and Evaluation, 51(3):581–612, 2017.
Daniel Zeman, Jan Hajic, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter,
Joakim Nivre, and Slav Petrov. CoNLL 2018 shared task: Multilingual parsing
from raw text to universal dependencies. In CoNLL 2018 Shared Task: Multilin-
gual parsing from raw text to universal dependencies, pages 1–21, 2018.
Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. Neural models for sequence
chunking. arXiv preprint arXiv:1701.04027, 2017.
Bowen Zhang, Xu Huang, Zhichao Huang, Hu Huang, Baoquan Zhang, Xianghua
Fu, and Liwen Jing. Sentiment interpretable logic tensor network for aspect-term
sentiment analysis. In COLING, pages 6705–6714, 2022.
Chen Zhang, Qiuchi Li, Dawei Song, and Benyou Wang. A multi-task learning
framework for opinion triplet extraction. arXiv preprint arXiv:2010.01512, 2020a.
Congle Zhang, Tyler Baldwin, Howard Ho, Benny Kimelfeld, and Yunyao Li. Adap-
tive parser-centric text normalization. In ACL, pages 1159–1168, 2013.
Hongming Zhang, Yan Song, Yangqiu Song, and Dong Yu. Knowledge-aware
pronoun coreference resolution. In ACL, pages 867–876, 2019.
498 References
Hongming Zhang, Daniel Khashabi, Yangqiu Song, and Dan Roth. Transomcs:
From linguistic graphs to commonsense knowledge. In IJCAI, pages 4004–4010,
2020b.
Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song, and Cane Wing-Ki Leung.
ASER: A large-scale eventuality knowledge graph. In Web Conference 2020,
WWW, pages 201–211, 2020c.
Jun Zhang, Yan Yang, Chencai Chen, Liang He, and Zhou Yu. KERS: A knowledge-
enhanced framework for recommendation dialog systems with multiple subgoals.
In EMNLP, pages 1092–1101, 2021a.
Meishan Zhang, Yue Zhang, and Duy-Tin Vo. Neural networks for open domain
targeted sentiment. In EMNLP, pages 612–621, 2015a.
Meishan Zhang, Yue Zhang, and Guohong Fu. End-to-end neural relation extraction
with global optimization. In EMNLP, pages 1730–1740, 2017a.
Qi Zhang, Yang Wang, Yeyun Gong, and Xuanjing Huang. Keyphrase extraction
using deep recurrent neural networks on twitter. In Jian Su, Xavier Carreras, and
Kevin Duh, editors, EMNLP, pages 836–845, 2016.
Rui Zhang, Cícero Nogueira dos Santos, Michihiro Yasunaga, Bing Xiang, and
Dragomir Radev. Neural coreference resolution with deep biaffine attention by
joint mention detection and mention clustering. In ACL, pages 102–107, 2018.
Tong Zhang, Fred Damerau, and David E Johnson. Text chunking using regularized
winnow. In ACL, pages 539–546, 2001.
Wei Zhang, Clement Yu, and Weiyi Meng. Opinion retrieval from blogs. In CIKM,
pages 831–840, 2007.
Wen Zhang, Taketoshi Yoshida, Xijin Tang, and Tu-Bao Ho. Improving effectiveness
of mutual information for substantival multiword expression extraction. Expert
Systems with Applications, 36(8):10919–10930, 2009.
Wenhao Zhang, Hua Xu, and Wei Wan. Weakness finder: Find product weakness
from Chinese reviews by using aspects based sentiment analysis. Expert Systems
with Applications, 39(11):10283–10291, 2012.
Wenxuan Zhang, Yang Deng, Xin Li, Yifei Yuan, Lidong Bing, and Wai Lam.
Aspect sentiment quad prediction as paraphrase generation. In EMNLP, pages
9209–9219, 2021b.
Wenxuan Zhang, Xin Li, Yang Deng, Lidong Bing, and Wai Lam. Towards generative
aspect-based sentiment analysis. In ACL-IJCNLP, pages 504–510, 2021c.
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks
for text classification. NeurIPS, 28, 2015b.
Xulang Zhang, Rui Mao, and Erik Cambria. A survey on syntactic processing
techniques. Artificial Intelligence Review, 56:5645–5728, 2023a.
Xulang Zhang, Rui Mao, Kai He, and Erik Cambria. Neurosymbolic sentiment
analysis with dynamic word sense disambiguation. In EMNLP Findings, pages
8772–8783, 2023b.
Xulang Zhang, Rui Mao, and Erik Cambria. Granular syntax processing with multi-
task and curriculum learning. Cognitive Computation, 16, 2024a.
References 499
Xulang Zhang, Rui Mao, and Erik Cambria. Multilingual emotion recognition:
Discovering the variations of lexical semantics between languages. In IJCNN,
2024b.
Xulang Zhang, Rui Mao, and Erik Cambria. SenticVec: Toward robust and human-
centric neurosymbolic sentiment analysis. In ACL Findings, pages 4851–4863,
2024c.
Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Man-
ning. Position-aware attention and supervised data improve slot filling. In EMNLP,
pages 35–45, 2017b.
Ziming Zhang, Ze-Nian Li, and Mark S Drew. AdaMKL: A novel biconvex multiple
kernel learning approach. In 2010 20th International Conference on Pattern
Recognition, pages 2126–2129, 2010.
Han Zhao, Zhengdong Lu, and Pascal Poupart. Self-adaptive hierarchical sentence
model. In IJCAI, pages 4069–4076, 2015.
He Zhao, Longtao Huang, Rong Zhang, Quan Lu, and Hui Xue. SpanMlt: A
span-based multi-task learning framework for pair-wise aspect and opinion terms
extraction. In ACL, pages 3239–3248, Online, July 2020a.
He Zhao, Longtao Huang, Rong Zhang, Quan Lu, and Hui Xue. SpanMlt: A
span-based multi-task learning framework for pair-wise aspect and opinion terms
extraction. In ACL, pages 3239–3248, 2020b.
Jialiang Zhao and Qi Gao. Annotation and detection of emotion in text-based
dialogue systems with cnn. arXiv preprint arXiv:1710.00987, 2017.
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang.
Gender bias in coreference resolution: Evaluation and debiasing methods. In
NAACL-HLT, pages 15–20, 2018.
Lujun Zhao, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. Sequence labeling with
deep gated dual path CNN. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 27(12):2326–2335, 2019.
Qiyun Zhao, Hao Wang, Pin Lv, and Chen Zhang. A bootstrapping based refinement
framework for mining opinion words and targets. In CIKM, pages 1995–1998,
2014.
Shanheng Zhao and Hwee Tou Ng. Identification and resolution of Chinese zero
pronouns: A machine learning approach. In EMNLP-CoNLL, pages 541–550,
2007.
Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu.
Joint extraction of entities and relations based on a novel tagging scheme. In ACL,
pages 1227–1236, 2017.
Yaowei Zheng, Richong Zhang, Suyuchen Wang, Samuel Mensah, and Yongyi Mao.
Anchored model transfer and soft instance transfer for cross-task cross-domain
learning: A study through aspect-level sentiment classification. In Web Confer-
ence, pages 2754–2760, 2020.
Peixiang Zhong, Di Wang, and Chunyan Miao. Knowledge-enriched transformer for
emotion detection in textual conversations. In EMNLP-IJCNLP, pages 165–176,
2019.
500 References
Xiaoshi Zhong and Erik Cambria. Time expression recognition and normalization:
A survey. Artificial Intelligence Review, 56:9115–9140, 2023.
Zexuan Zhong and Danqi Chen. A frustratingly easy approach for entity and relation
extraction. In NAACL-HLT, pages 50–61, 2021.
Zhi Zhong and Hwee Tou Ng. It makes sense: A wide-coverage word sense disam-
biguation system for free text. In ACL, pages 78–83, 2010.
Deyu Zhou, Zhikai Zhang, Min-Ling Zhang, and Yulan He. Weakly supervised POS
tagging without disambiguation. ACM Transactions on Asian and Low-Resource
Language Information Processing, 17(4):1–19, 2018.
GuoDong Zhou and Jian Su. Error-driven HMM-based chunk tagger with context-
dependent lexicon. In EMNLP-VLC, pages 71–79, 2000.
Houquan Zhou, Yu Zhang, Zhenghua Li, and Min Zhang. Is POS tagging necessary
or even helpful for neural dependency parsing?, 2020a.
Kun Zhou, Wayne Xin Zhao, Shuqing Bian, Yuanhang Zhou, Ji-Rong Wen, and
Jingsong Yu. Improving conversational recommender systems via knowledge
graph based semantic fusion. In SIGKDD, pages 1006–1014, 2020b.
Nina Zhou, Xuancong Wang, and AiTi Aw. Dynamic boundary detection for speech
translation. In APSIPA, pages 651–656, 2017.
Ran Zhou, Xin Li, Ruidan He, Lidong Bing, Erik Cambria, Luo Si, and Chunyan
Miao. MELM: Data augmentation with masked entity language modeling for
low-resource NER. In ACL, pages 2251–2262, 2022.
Yan Zhou, Fuqing Zhu, Pu Song, Jizhong Han, Tao Guo, and Songlin Hu. An
adaptive hybrid framework for cross-domain aspect-based sentiment analysis. In
AAAI, pages 14630–14637, 2021a.
Zhengyu Zhou, In Gyu Choi, Yongliang He, Vikas Yadav, and Chin-Hui Lee. Us-
ing paralinguistic information to disambiguate user intentions for distinguishing
phrase structure and sarcasm in spoken dialog systems. In 2021 IEEE Spoken
Language Technology Workshop (SLT), pages 1020–1027, 2021b.
Zhi-Hua Zhou and Xu-Ying Liu. Training cost-sensitive neural networks with meth-
ods addressing the class imbalance problem. IEEE Transactions on Knowledge
and Data Engineering, 18(1):63–77, 2005.
Luyao Zhu, Wei Li, Rui Mao, Vlad Pandelea, and Erik Cambria. PAED: Zero-shot
persona attribute extraction in dialogues. In ACL, pages 9771–9787, 2023.
Luyao Zhu, Wei Li, Rui Mao, and Erik Cambria. HIPPL: Hierarchical intent-
inferring pointer network with pseudo labeling for consistent persona-driven dia-
logue generation. IEEE Computational Intelligence Magazine, 2024a.
Luyao Zhu, Rui Mao, Erik Cambria, and Bernard J. Jansen. Neurosymbolic AI
for personalized sentiment analysis. In International Conference on Human-
Computer Interaction (HCII), Washington DC, USA, 2024b.
Yin Zhuang, Zhen Liu, Ting-Ting Liu, Chih-Chieh Hung, and Yan-Jie Chai. Implicit
sentiment analysis based on multi-feature neural network model. Soft Computing,
26(2):635–644, 2022.
Anne Zimmerman, Joel Janhonen, and Emily Beer. Human/AI relationships: chal-
lenges, downsides, and impacts on human/human relationships. AI and Ethics,
2023.