Ebook: Knowledge Graphs: Semantics, Machine Learning, and Languages
Semantic computing is an integral part of modern technology, an essential component of fields as diverse as artificial intelligence, data science, knowledge discovery and management, big data analytics, e-commerce, enterprise search, technical documentation, document management, business intelligence, and enterprise vocabulary management.
This book presents the proceedings of SEMANTICS 2023, the 19th International Conference on Semantic Systems, held in Leipzig, Germany, from 20 to 22 September 2023. The conference is a pivotal event for those professionals and researchers actively engaged in harnessing the power of semantic computing, an opportunity to increase their understanding of the subject’s transformative potential while confronting its practical limitations. Attendees include information managers, IT architects, software engineers, and researchers from a broad spectrum of organizations, including research facilities, non-profit entities, public administrations, and the world's largest corporations.
For this year’s conference a total of 54 submissions were received in response to a call for papers. These were subjected to a rigorous, double-blind review process, with at least three independent reviews conducted for each submission. The 16 papers included here were ultimately accepted for presentation, with an acceptance rate of 29.6%. Areas covered include novel research challenges in areas such as data science, machine learning, logic programming, content engineering, social computing, and the Semantic Web.
The book provides an up-to-date overview, which will be of interest to all those wishing to stay abreast of emerging trends and themes within the vast field of semantic computing.
Abstract. This volume encompasses the proceedings of SEMANTiCS 2023, the 19th International Conference on Semantic Systems, a pivotal event for professionals and researchers actively engaged in harnessing the power of semantic computing. At SEMANTiCS, attendees gain a profound understanding of its transformative potential, while also confronting the practical limitations it presents. Each year, the conference magnetizes information managers, IT architects, software engineers, and researchers from a broad spectrum of organizations, spanning research facilities, non-profit entities, public administrations, and the world’s largest corporations.
Keywords. Semantic Systems, Knowledge Graphs, Artificial Intelligence, Semantic Web, Linked Data, Machine Learning, Knowledge Discovery
SEMANTiCS serves as a vibrant platform facilitating the exchange of cutting-edge scientific findings in the realm of semantic systems. Furthermore, it extends its scope to encompass novel research challenges in areas such as data science, machine learning, logic programming, content engineering, social computing, and the Semantic Web. Having reached its 19th year, the conference has evolved into a distinguished international event that seamlessly bridges the gap between academia and industry.
Participants and contributors of SEMANTiCS gain invaluable insights from esteemed researchers and industry experts, enabling them to stay abreast of emerging trends and themes within the vast field of semantic computing. The SEMANTiCS community thrives on its diverse composition, attracting professionals with multifaceted roles encompassing artificial intelligence, data science, knowledge discovery and management, big data analytics, e-commerce, enterprise search, technical documentation, document management, business intelligence, and enterprise vocabulary management.
In 2023, the conference embraced the subtitle “Towards Decentralized Knowledge Eco-Systems” and particularly welcomed submissions pertaining to the following topics:
∙ Web Semantics & Linked (Open) Data
∙ Enterprise Knowledge Graphs, Graph Data Management
∙ Machine Learning Techniques for/using Knowledge Graphs (e.g. reinforcement learning, deep learning, data mining and knowledge discovery)
∙ Knowledge Management (e.g. acquisition, capture, extraction, authoring, integration, publication)
∙ Terminology, Thesaurus & Ontology Management
∙ Reasoning, Rules, and Policies
∙ Natural Language Processing for/using Knowledge Graphs (e.g. entity linking and resolution using target knowledge such as Wikidata and DBpedia, foundation models)
∙ Crowdsourcing for/using Knowledge Graphs
∙ Data Quality Management and Assurance
∙ Mathematical Foundation of Knowledge-aware AI
∙ Multimodal Knowledge Graphs
∙ Semantics in Data Science
∙ Semantics in Blockchain environments
∙ Trust, Data Privacy, and Security with Semantic Technologies
∙ Economics of Data, Data Services, and Data Ecosystems
∙ IoT and Stream Processing
∙ Conversational AI and Dialogue Systems
∙ Provenance and Data Change Tracking
∙ Semantic Interoperability (via mapping, crosswalks, standards, etc.)
Special Sub-Topics:
∙ Digital Humanities and Cultural Heritage
∙ LegalTech, AI Safety, Explainable and Interoperable AI
∙ Decentralized and/or Federated Knowledge Graphs
Application of Semantically Enriched and AI-Based Approaches:
∙ Knowledge Graphs in Bioinformatics and Medical AI
∙ Clinical Use Case of AI-based Approaches
∙ AI for Environmental Challenges
∙ Semantics in Scholarly Communication and Open Research Knowledge Graphs
∙ AI and LOD within GLAM (galleries, libraries, archives, and museums) institutions
The Research and Innovation track garnered significant attention with 54 submissions after a call for papers was publicly announced. To ensure meticulous evaluations, an esteemed program committee comprising 85 members collaborated to identify the papers of utmost impact and scientific merit. Implementing a double-blind review process, wherein author identities and the reviewers were obscured to assure anonymity. A minimum of three independent reviews were conducted for each submission. Upon completion of all reviews, the program committee chairs meticulously compared and deliberated on the evaluations, addressing any disparities or differing viewpoints with the reviewers. This comprehensive approach facilitated a meta-review, enabling the committee to recommend acceptance or rejection of each paper. Ultimately, we were pleased to accept 16 papers, resulting in an acceptance rate of 29.6%.
In addition to the peer-reviewed work, the conference had three renowned keynotes from Xin Luna Dong (Meta Reality Lab), Marco Varone (Expert.ai), and Aidan Hogan (Department of Computer Science, University of Chile).
Additionally, the program had posters and demos, a comprehensive set of workshops, as well as talks from industry leaders.
We thank all authors who submitted papers. We particularly thank the program committee which provided careful reviews in a quick turnaround time. Their service is essential for the quality of the conference.
Special thanks also go to our sponsors without whom this event would not be possible:
Gold Sponsors: Metaphacts, Pantopix, PoolParty, TopQuadrant
Silver Sponsors: GNOSS, IOLAR, Ontotext, neo4j, RDFOX, The QA Company
Bronze Sponsor: RWS
Startup Sponsor: Karakun, SP Semantic Partners
Sincerely yours,
The Editors
Leipzig, September 2023
Hate speech comes in different forms depending on the communities targeted, often based on factors like gender, sexuality, race, or religion. Detecting it online is challenging because existing systems are not accounting for the diversity of hate based on the identity of the target and may be biased towards certain groups, leading to inaccurate results. Current language models perform well in identifying target communities, but only provide a probability that a hate speech text contains references to a particular group. This lack of transparency is problematic because these models learn biases from data annotated by individuals who may not be familiar with the target group. To improve hate speech detection, particularly target group identification, we propose a new hybrid approach that incorporates explicit knowledge about the language used by specific identity groups. We leverage a Knowledge Graph (KG) and adapt it, considering an appropriate level of abstraction, to recognise hate speech-language related to gender and sexual orientation. A thorough quantitative and qualitative evaluation demonstrates that our approach is as effective as state-of-the-art language models while adjusting better to domain and data changes. By grounding the task in explicit knowledge, we can better contextualise the results generated by our proposed approach with the language of the groups most frequently impacted by these technologies. Semantic enrichment helps us examine model outcomes and the training data used for hate speech detection systems, and handle ambiguous cases in human annotations more effectively. Overall, infusing semantic knowledge in hate speech detection is crucial for enhancing understanding of model behaviors and addressing biases derived from training data.
Purpose:
This study addresses the limitations of current short abstracts of DBPEDIA entities, which often lack a comprehensive overview due to their creating method (i.e., selecting the first two-three sentences from the full DBPEDIA abstracts).
Methodology:
We leverage pre-trained language models to generate abstractive summaries of DBPEDIA abstracts in six languages (English, French, German, Italian, Spanish, and Dutch). We performed several experiments to assess the quality of generated summaries by language models. In particular, we evaluated the generated summaries using human judgments and automated metrics (Self-ROUGE and BERTScore). Additionally, we studied the correlation between human judgments and automated metrics in evaluating the generated summaries under different aspects: informativeness, coherence, conciseness, and fluency.
Findings:
Pre-trained language models generate summaries more concise and informative than existing short abstracts. Specifically, BART-based models effectively overcome the limitations of DBPEDIA short abstracts, especially for longer ones. Moreover, we show that BERTScore and ROUGE-1 are reliable metrics for assessing the informativeness and coherence of the generated summaries with respect to the full DBPEDIA abstracts. We also find a negative correlation between conciseness and human ratings. Furthermore, fluency evaluation remains challenging without human judgment.
Value:
This study has significant implications for various applications in machine learning and natural language processing that rely on DBPEDIA resources. By providing succinct and comprehensive summaries, our approach enhances the quality of DBPEDIA abstracts and contributes to the semantic web community.
Knowledge Graph Question Answering (KGQA) systems enable access to semantic information for any user who can compose a question in natural language. KGQA systems are now a core component of many industrial applications, including chatbots and conversational search applications. Although distinct worldwide cultures speak different languages, the number of languages covered by KGQA systems and its resources is mainly limited to English. To implement KGQA systems worldwide, we need to expand the current KGQA resources to languages other than English. Taking into account the recent popularity that Large-Scale Language Models are receiving, we believe that providing quality resources is key to the development of future pipelines. One of these resources is the datasets used to train and test KGQA systems. Among the few multilingual KGQA datasets available, only one covers Spanish, i.e., QALD-9. We reviewed the Spanish translations in the QALD-9 dataset and confirmed several issues that may affect the KGQA system’s quality. Taking this into account, we created new Spanish translations for this dataset and reviewed them manually with the help of native speakers. This dataset provides newly created, high-quality translations for QALD-9; we call this extension QALD-9-ES. We merged these translations into the QALD-9-plus dataset, which provides trustworthy native translations for QALD-9 in nine languages, intending to create one complete source of high-quality translations. We compared the new translations with the QALD-9 original ones using language-agnostic quantitative text analysis measures and found improvements in the results of the new translations. Finally, we compared both translations using the GERBIL QA benchmark framework using a KGQA system that supports Spanish. Although the question-answering scores only improved slightly, we believe that improving the quality of the existing translations will result in better KGQA systems and therefore increase the applicability of KGQA w.r.t. the Spanish language domain.
To fully harness the potential of data, the creation of machine-readable data and utilization of the FAIR Data Principles is vital for successful data-driven science. Ontologies serve as the foundation for generating semantically rich, FAIR data that machines can understand, enabling seamless data integration and exchange across scientific disciplines. In this paper, we introduce a versatile Terminology Service that supports various tasks, including discovery, provision, as well as ontology design and curation. This service offers unified access to a vast array of ontologies across scientific disciplines, encouraging their reuse, improvement, and maturation. We present a user-driven service development approach, along with a use case involving a collaborative ontology design process, engaging domain experts, knowledge workers, and ontology engineers. This collaboration incorporates the application and evaluation of the Terminology Service, as well as supplementary tools, workflows, and collaboration models. We demonstrate the feasibility, prerequisites, and ongoing challenges related to developing Terminology Services that address numerous aspects of ontology utilization for producing FAIR, machine-actionable data.
The aim of this study is to identify idiomatic expressions in English using the measure perplexity. The assumption is that idiomatic expressions cause higher perplexity than literal expressions given a reference text. Perplexity in our study is calculated based on n-grams of (i) PoS tags, (ii) tokens, and (iii) thematic roles within the boundaries of a sentence. In the setting of our study, we observed that no perplexity in the contexts of (i), (ii) and (iii) manages to distinguish idiomatic expressions from literals. We postulate that larger, extra-sentential contexts should be used for the determination of perplexity. In addition, the number of thematic roles in (iii) should be reduced to a smaller number of basic roles in order to avaiod an uniform distribution of n-grams.
Purpose:
The query language GraphQL has gained significant traction in recent years. In particular, it has recently gained the attention of the semantic web and graph database communities and is now often used as a means to query knowledge graphs. Most of the storage solutions that support GraphQL rely on a translation layer to map the said language to another query language that they support natively, for example SPARQL.
Methodology:
Our main innovation is a multi-way left-join algorithm inspired by worst-case optimal multi-way join algorithms. This novel algorithm enables the native execution of GraphQL queries over RDF knowledge graphs. We evaluate our approach in two settings using the LinGBM benchmark generator.
Findings:
The experimental results suggest that our solution outperforms the state-of-the-art graph storage solution for GraphQL with respect to both query runtimes and scalability.
Value:
Our solution is implemented in an open-sourced triple store, and is intended to advance the development of representation-agnostic storage solutions for knowledge graphs.
This paper presents the construction of a Knowledge Graph (KG) of Educational Resources (ER), where RDF reification is essential. The ERs are described based on the subjects they cover considering their relevance. RDF reification is used to incorporate this subject’s relevance. Multiple reification models with distinct syntax and performance implications for storage and query processing exist. This study aims to experimentally compare four statement-based reification models with four triplestores to determine the most pertinent choice for our KG. We built four versions of the KG. Each version has a distinct reification model, namely standard reification, singleton properties, named graphs, and RDF-star, which were obtained using RML mappings. Each of the four triplestores (Virtuoso, Jena, Oxigraph, and GraphDB) was setup four times (except for Virtuoso, which does not support RDF-star), and seven different SPARQL queries were experimentally evaluated. This study shows that standard reification and named graphs lead to good performance. It also shows that, in the particular context of the used KG, Virtuoso outperforms Jena, GraphDB, and Oxigraph in most queries. The recent specification of RDF-star and SPARQL-star sheds light on statement-level annotations. The empirical study reported in this paper contributes to the efforts towards the efficient usage of RDF reification. In addition, this paper shares the pipeline of the KG construction using standard semantic web technologies.
Purpose:
A previous paper proposed a bidirectional A* search algorithm for quickly finding meaningful paths in Wikidata that leverages semantic distances between entities as part of the search heuristics. However, the work lacks an optimization of the algorithm’s hyperparameters and an evaluation on a large dataset among others. The purpose of the present paper is to address these open points.
Methodology:
Approaches aimed at enhancing the accuracy of the semantic distances are discussed. Furthermore, different options for constructing a dataset of dual-entity queries for pathfinding in Wikidata are explored. 20% of the compiled dataset are utilized to fine-tune the algorithm’s hyperparameters using the Simple optimizer. The optimized configuration is subsequently evaluated against alternative configurations, including a baseline, using the remaining 80% of the dataset.
Findings:
The additional consideration of entity descriptions increases the accuracy of the semantic distances. A dual-entity query dataset with 1,196 entity pairs is derived from the TREC 2007 Million Query Track dataset. The optimization yields the values 0.699/0.109/0.823 for the hyperparameters. This configuration achieves a higher coverage of the test set (79.2%) with few entity visits (24.7 on average) and moderate path lengths (4.4 on average). For reproducibility, the implementation called BiPaSs, the query dataset, and the benchmark results are provided.
Value:
Web search engines reliably generate knowledge panels with summarizing information only in response to queries mentioning a single entity. This paper shows that quickly finding paths between unseen entities in Wikidata is feasible. Based on these paths, knowledge panels for dual-entity queries can be generated that provide an explanation of the mentioned entities’ relationship, potentially satisfying the users’ information need.
Purpose:
Information about biographies of museum objects (object provenance) is often unavailable in machine-readable format. This limits findability and reusability of object provenance information for domain research. We address the challenges of defining a data model to represent ethnographic cultural heritage objects’ provenance, which includes multiple interpretations (polyvocality) of, and theories for, the object biography, chains of custody and context of acquiring.
Methodology:
To develop a data model for representing the provenance of ethnographic objects, we conducted (semi-)structured interviews with five provenance experts to elicit a set of requirements. Based on these requirements and a careful examination of six diverse examples of ethnographic object provenance reports, we established a set of modelling choices that utilise existing ontologies such as CIDOC-CRM (a domain standard) and PROV-DM, as well as RDF-named graphs.
Evaluation:
Finally, we validate the model on provenance reports containing six seen and five unseen ethnographic cultural heritage object from three separate sources. The 11 reports are converted into RDF triples following the proposed data model. We also constructed SPARQL queries corresponding to nine competency questions elicited from domain experts in order to report on satisfiability.
Findings:
The results show that the adapted combined model allows us to express the heterogeneity and polyvocality of the object provenance information, trace data provenance and link with other data sources for further enrichment.
Value:
The proposed model from this paper allows publishing such knowledge in a machine-readable format, which will foster information contextualisation, findability and reusability.
Purpose:
Knowledge graphs have so far been intensively used in the cultural heritage domain. Current interaction paradigms and interfaces however are often limited to textual representations or 2D visualizations, not taking into account the 4D nature of data. In digital history in particular, where events as well as geographical and temporal relationships play an important role, exploration paradigms that take into account the 4D nature of event-related data are important, as they have the potential to support historians in generating new knowledge and discovering new relationships. In this paper, we explore the potential of virtual reality as a paradigm allowing digital humanities researchers, historians in particular, to explore a semantic 4D space defined by knowledge graphs from an egocentric perspective.
Methodology:
We present eTaRDiS: a virtual reality based tool supporting immersive exploration of knowledge graphs. We evaluate the tool in the context of a task in which historians and laypersons with a history background explore DBpedia and Wikidata. We report results of a study involving 13 subjects that interacted with the data in eTaRDiS in the context of a specific task, in order to gain insights regarding the interaction patterns of users with our system. The usability of the tool was evaluated using a questionnaire including questions from the System Usability Scale (SUS) in addition to task-specific questions.
Findings:
The usability evaluation showed that our tool achieved an overall SUS score of 71.92, corresponding to a ‘satisfactory’ rating. While the mean score reached with laypersons with a history background was quite high with 76.0, corresponding to a rating of ‘excellent’, the score for historians was lower with 69.4, corresponding to a ‘sufficient to satisfactory’ rating. A qualitative analysis of the interaction data revealed that participants quickly identified the relevant information in the tasks using a variety of strategies and taking advantage of the features provided in eTaRDiS.
Value:
eTaRDiS is to our knowledge the first virtual reality based exploration tool supporting the exploration of knowledge graphs. The findings of the usability evaluation and the qualitative analysis of exploration patterns show that the system could potentially be a valuable tool for allowing digital humanities researchers to explore knowledge graphs as a way to discover new relationships between historical events and persons of interest.
Preserving historical city architectures and making them (publicly) available has emerged as an important field of the cultural heritage and digital humanities research domain. In this context, the TRANSRAZ project is creating an interactive 3D environment of the historical city of Nuremberg which spans over different periods of time. Next to the exploration of the city’s historical architecture, TRANSRAZ is also integrating information about its inhabitants, organizations, and important events, which are extracted from historical documents semi-automatically. Knowledge Graphs have proven useful and valuable to integrate and enrich these heterogeneous data. However, this task also comes with versatile data modeling challenges. This paper contributes the TRANSRAZ data model, which integrates agents, architectural objects, events, and historical documents into the 3D research environment by means of ontologies. Goal is to explore Nuremberg’s multifaceted past in different time layers in the context of its architectural, social, economical, and cultural developments.
Purpose:
Data integration and applications across knowledge graphs (KGs) rely heavily on the discovery of links between resources within these KGs. Geospatial link discovery algorithms have to deal with millions of point sets containing billions of points.
Methodology:
To speed up the discovery of geospatial links, we propose COBALT. COBALT combines the content measures with R-tree indexing. The content measures are based on the area, diagonal and distance of the minimum bounding boxes of the polygons which speeds up the process but is not perfectly accurate. We thus propose two polygon splitting approaches for improving the accuracy of COBALT.
Findings:
Our experiments on real-world datasets show that COBALT is able to speed up the topological relation discovery over geospatial KGs by up to 1.47 × 104 times over state-of-the-art linking algorithms while maintaining an F-Measure between 0.7 and 0.9 depending on the relation. Furthermore, we were able to achieve an F-Measure of up to 0.99 by applying our polygon splitting approaches before applying the content measures.
Value:
The process of discovering links between geospatial resources can be significantly faster by sacrificing the optimality of the results. This is especially important for real-time data-driven applications such as emergency response, location-based services and traffic management. In future work, additional measures, like the location of polygons or the name of the entity represented by the polygon, could be integrated to further improve the accuracy of the results.
As the number of RDF datasets published on the semantic web continues to grow, it becomes increasingly important to efficiently link similar entities between these datasets. However, the performance of existing data linking tools, often developed for general purposes, seems to have reached a plateau, suggesting the need for more modular and efficient solutions. In this paper, we propose –and formalize in OWL– a classification of the different Linking Problem Types (LPTs) to help the linked data community identify upstream the problems and develop more efficient solutions. Our classification is based on the description of heterogeneity reported in the literature –especially five articles– and identifies five main types of linking problems: predicate value problems, predicate problems, class problems, subgraph problems, and graph problems. By classifying LPTs, we provide a framework for understanding and addressing the challenges associated with semantic data linking. It can be used to develop new solutions based on existing modularized tools addressing specific LPTs, thus improving the overall efficiency of data linking.
Purpose:
Following the impact of the GDPR on the regulation of the use of personal data of European citizens, the European Commission is now focused on implementing a common data strategy to promote the (re)use and sharing of data between citizens, companies and governments while maintaining it under the control of the entities that generated it. In this context, the Data Governance Act (DGA) emphasizes the altruistic reuse of data and the emergence of data intermediaries as trusted entities that do not have an interest in analysing the data itself and act only as enablers of the sharing of data between data holders and data users.
Methodology:
In order to address DGA’s new requirements, this work investigates how to apply existing Semantic Web vocabularies to (1) generate machine-readable policies for the reuse of public data, (2) specify data altruism consent terms and (3) create uniform registers of data altruism organisations and intermediation services’ providers.
Findings:
In addition to promoting machine-readability and interoperability, the use of the identified semantic vocabularies eases the modelling of data-sharing policies and consent forms across different use cases and provides a common semantic model to keep a public register of data intermediaries and altruism organisations, as well as records of their activities. Since these vocabularies are openly accessible and easily extendable, the modelling of new terms that cater to DGA-specific requirements is also facilitated.
Value:
The main results are an ad-hoc vocabulary with the new terms and examples of usage, which are available at unmapped: uri https://w3id.org/dgaterms. In future research, this work can be used to automate the generation of documentation for the new DGA data-sharing entities and be extended to deal with requirements from other data-related regulations.
Recording and documenting human and AI-driven normative decision-making processes has so far been highly challenging. We focus on the challenge of normative coordination: the process by which stakeholders in a community understand and agree what norms they abide by. Our aim is to develop and formalize the FLINT language, which allows a high-level description of normative systems. FLINT enables legal experts to agree on norms, while also serving as a basis for technical implementation.
Our contribution consists of the development of an ontology for FLINT and its RDF/OWL implementation which we have made openly accessible. We designed the ontology on the basis of competency questions. Additionally, we validated the ontology by modeling example cases and using the ontology’s data model in software tooling.
Quantum computing is currently experiencing rapid progress. Due to the complexity and continuous growth of knowledge in this field, it is essential to store information in a way that allows an easy access, analysis and navigation over reliable resources. Knowledge graphs (KGs) with machine-readable semantics offer a structural information representation and can enhance the capabilities of knowledge processing and information retrieval. In this paper, we extend the platform and ecosystem for quantum applications (PlanQK) for a KG. Specifically, we describe how the quantum computing knowledge, which is submitted on the platform by researchers and industry actors, is incorporated into the graph. Moreover, we outline the semantic search over the PlanQK KG.