Skip to main content

Wim Martens

Followers

15

Following

4

Public Views

Interests

Uploads

Papers

Partial Order Multiway Search

ACM Transactions on Database Systems

Partial order multiway search (POMS) is a fundamental problem that finds applications in crowdsou... more Partial order multiway search (POMS) is a fundamental problem that finds applications in crowdsourcing, distributed file systems, software testing, and more. This problem involves an interaction between an algorithm \(\mathcal {A} \) and an oracle, conducted on a directed acyclic graph \(\mathcal {G} \) known to both parties. Initially, the oracle selects a vertex t in \(\mathcal {G} \) called the target . Subsequently, \(\mathcal {A} \) must identify the target vertex by probing reachability. In each probe , \(\mathcal {A} \) selects a set Q of vertices in \(\mathcal {G} \) , the number of which is limited by a pre-agreed value k . The oracle then reveals, for each vertex q ∈ Q , whether q can reach the target in \(\mathcal {G} \) . The objective of \(\mathcal {A} \) is to minimize the number of probes. We propose an algorithm to solve POMS in \(O(\log _{1+k} n + \frac{d}{k} \log _{1+d} n) \) probes, where n represents the number of vertices in \(\mathcal {G} \) , and d denotes the...

A Researcher's Digest of GQL

HAL (Le Centre pour la Communication Scientifique Directe), Mar 28, 2023

Efficient Incremental Evaluation of Succinct Regular Expressions

Regular expressions are omnipresent in database applications. They form the structural core of sc... more Regular expressions are omnipresent in database applications. They form the structural core of schema languages for XML, they are a fundamental ingredient for navigational queries in graph databases, and are being considered in languages for upcoming technologies such as schema- and transformation languages for tabular data on the Web. In this paper we study the usage and effectiveness of the counting operator (or: limited repetition) in regular expressions. The counting operator is a popular extension which is part of the POSIX standard and therefore also present in regular expressions in grep, Java, Python, Perl, and Ruby. In a database context, expressions with counting appear in XML Schema and languages for querying graphs such as SPARQL 1.1 and Cypher. We first present a practical study that suggests that counters are extensively used in practice. We then investigate evaluation methods for such expressions and develop a new algorithm for efficient incremental evaluation. Finally, we conduct an extensive benchmark study that shows that exploiting counting operators can lead to speed-ups of several orders of magnitude in a wide range of settings: normal and incremental evaluation on synthetic and real expressions.

Split-Correctness in Information Extraction

arXiv (Cornell University), Oct 8, 2018

Threshold queries in theory and in the wild

Proceedings of the VLDB Endowment

Threshold queries are an important class of queries that only require computing or counting answe... more Threshold queries are an important class of queries that only require computing or counting answers up to a specified threshold value. To the best of our knowledge, threshold queries have been largely disregarded in the research literature, which is surprising considering how common they are in practice. In this paper, we present a deep theoretical analysis of threshold query evaluation and show that thresholds can be used to significantly improve the asymptotic bounds of state-of-the-art query evaluation algorithms. We also empirically show that threshold queries are significant in practice. In surprising contrast to conventional wisdom, we found important scenarios in real-world data sets in which users are interested in computing the results of queries up to a certain threshold, independent of a ranking function that orders the query results.

Querying Graphs with Data

Journal of the ACM, 2016

Graph databases have received much attention as of late due to numerous applications in which dat... more Graph databases have received much attention as of late due to numerous applications in which data is naturally viewed as a graph; these include social networks, RDF and the Semantic Web, biological databases, and many others. There are many proposals for query languages for graph databases that mainly fall into two categories. One views graphs as a particular kind of relational data and uses traditional relational mechanisms for querying. The other concentrates on querying the topology of the graph. These approaches, however, lack the ability to combine data and topology, which would allow queries asking how data changes along paths and patterns enveloping it. In this article, we present a comprehensive study of languages that enable such combination of data and topology querying. These languages come in two flavors. The first follows the standard approach of path queries, which specify how labels of edges change along a path, but now we extend them with ways of specifying how both...

Regular Expressions with Counting: Weak versus Strong Determinism

SIAM Journal on Computing, 2012

PG-Schema: Schemas for Property Graphs

Proceedings of the ACM on Management of Data

Property graphs have reached a high level of maturity, witnessed by multiple robust graph databas... more Property graphs have reached a high level of maturity, witnessed by multiple robust graph database systems as well as the ongoing ISO standardization effort aiming at creating a new standard Graph Query Language (GQL). Yet, despite documented demand, schema support is limited both in existing systems and in the first version of the GQL Standard. It is anticipated that the second version of the GQL Standard will include a rich DDL. Aiming to inspire the development of GQL and enhance the capabilities of graph database systems, we propose PG-Schema, a simple yet powerful formalism for specifying property graph schemas. It features PG-Schema with flexible type definitions supporting multi-inheritance, as well as expressive constraints based on the recently proposed PG-Keys formalism. We provide the formal syntax and semantics of PG-Schema, which meet principled design requirements grounded in contemporary property graph management scenarios, and offer a detailed comparison of its featu...

GPC: A Pattern Calculus for Property Graphs

arXiv (Cornell University), Oct 29, 2022

Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers, List of Authors

Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers, List of Au... more

Minimization of tree pattern queries

ACM SIGMOD Record, 2001

Tree patterns forms a natural basis to query tree-structured data such as XML and LDAP. Since the... more Tree patterns forms a natural basis to query tree-structured data such as XML and LDAP. Since the efficiency of tree pattern matching against a tree-structured database depends on the size of the pattern, it is essential to identify and eliminate redundant nodes in the pattern and do so as quickly as possible. In this paper, we study tree pattern minimization both in the absence and in the presence of integrity constraints (ICs) on the underlying tree-structured database. When no ICs are considered, we call the process of minimizing a tree pattern, constraint-independent minimization. We develop a polynomial time algorithm called CIM for this purpose. CIM's efficiency stems from two key properties: (i) a node cannot be redundant unless its children are, and (ii) the order of elimination of redundant nodes is immaterial. When ICs are considered for minimization, we refer to it as constraint-dependent minimization. For tree-structured databases, required child/descendant and type ...

Optimizing tree pattern queries: why cutting is not enough (invited talk)

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

Tree pattern queries are a natural language for querying graph- and tree-structured data. A centr... more Tree pattern queries are a natural language for querying graph- and tree-structured data. A central question for understanding their optimization problem was whether they can be minimized by cutting away redundant parts. This question has been studied since the early 2000's and was recently resolved.

Containment of Simple Conjunctive Regular Path Queries

Proceedings of the Seventeenth International Conference on Principles of Knowledge Representation and Reasoning

Testing containment of queries is a fundamental reasoning task in knowledge representation. We st... more Testing containment of queries is a fundamental reasoning task in knowledge representation. We study here the containment problem for Conjunctive Regular Path Queries (CRPQs), a navigational query language extensively used in ontology and graph database querying. While it is known that containment of CRPQs is EXPSPACE-complete in general, we focus here on severely restricted fragments, which are known to be highly relevant in practice according to several recent studies. We obtain a detailed overview of the complexity of the containment problem, depending on the features used in the regular expressions of the queries, with completeness results for NP, Pi2p, PSPACE or EXPSPACE.

Graph Pattern Matching in GQL and SQL/PGQ

Proceedings of the 2022 International Conference on Management of Data

The Complexity of Regular Trail and Simple Path Queries on Undirected Graphs

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Optimal Algorithms for Multiway Search on Partial Orders

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Holding a Conference Online and Live due to Covid-19

ACM SIGMOD Record

The joint EDBT/ICDT conference (International Conference on Extending Database Technology / Inter... more The joint EDBT/ICDT conference (International Conference on Extending Database Technology / International Conference on Database Theory) is a well established conference series on data management, with annual meetings in the second half of March that attract 250 to 300 delegates. Three weeks before EDBT/ICDT 2020 was planned to take place in Copenhagen, the rapidly developing Covid-19 pandemic led to the decision to cancel the face-to-face event. In the interest of the research community, it was decided to move the conference online while trying to preserve as much of the real-life experience as possible. As far as we know, we are one of the first conferences that moved to a fully synchronous online experience due to the COVID- 19 outbreak. By fully synchronous, we mean that participants jointly listened to presentations, had live Q&A, and attended other live events associated with the conference. In this report, we share our decisions, experiences, and lessons learned.

Weight Annotation in Information Extraction

Logical Methods in Computer Science, 2022

The framework of document spanners abstracts the task of information extraction from text as a fu... more The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of VSet-automata -- a restricted class of transducers that mark the endpoints of selected spans. In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the positive RA operators via the semiring opera...

Foundations of Data Management (Dagstuhl Perspectives Workshop 16151)

Dagstuhl Reports, 2016

In this Workshop we have explored the degree to which principled foundations are crucial to the l... more In this Workshop we have explored the degree to which principled foundations are crucial to the long-term success and effectiveness of the new generation of data management paradigms and applications, and investigated what forms of research need to be pursued to develop and advance these foundations. The workshop brought together specialists from the existing database theory community, and from adjoining areas, particularly from various subdisciplines within the Big Data community, to understand the challenge areas that might be resolved through principled foundations and mathematical theory.

Partial Order Multiway Search

ACM Transactions on Database Systems

Partial order multiway search (POMS) is a fundamental problem that finds applications in crowdsou... more Partial order multiway search (POMS) is a fundamental problem that finds applications in crowdsourcing, distributed file systems, software testing, and more. This problem involves an interaction between an algorithm \(\mathcal {A} \) and an oracle, conducted on a directed acyclic graph \(\mathcal {G} \) known to both parties. Initially, the oracle selects a vertex t in \(\mathcal {G} \) called the target . Subsequently, \(\mathcal {A} \) must identify the target vertex by probing reachability. In each probe , \(\mathcal {A} \) selects a set Q of vertices in \(\mathcal {G} \) , the number of which is limited by a pre-agreed value k . The oracle then reveals, for each vertex q ∈ Q , whether q can reach the target in \(\mathcal {G} \) . The objective of \(\mathcal {A} \) is to minimize the number of probes. We propose an algorithm to solve POMS in \(O(\log _{1+k} n + \frac{d}{k} \log _{1+d} n) \) probes, where n represents the number of vertices in \(\mathcal {G} \) , and d denotes the...

A Researcher's Digest of GQL

HAL (Le Centre pour la Communication Scientifique Directe), Mar 28, 2023

Efficient Incremental Evaluation of Succinct Regular Expressions

Regular expressions are omnipresent in database applications. They form the structural core of sc... more Regular expressions are omnipresent in database applications. They form the structural core of schema languages for XML, they are a fundamental ingredient for navigational queries in graph databases, and are being considered in languages for upcoming technologies such as schema- and transformation languages for tabular data on the Web. In this paper we study the usage and effectiveness of the counting operator (or: limited repetition) in regular expressions. The counting operator is a popular extension which is part of the POSIX standard and therefore also present in regular expressions in grep, Java, Python, Perl, and Ruby. In a database context, expressions with counting appear in XML Schema and languages for querying graphs such as SPARQL 1.1 and Cypher. We first present a practical study that suggests that counters are extensively used in practice. We then investigate evaluation methods for such expressions and develop a new algorithm for efficient incremental evaluation. Finally, we conduct an extensive benchmark study that shows that exploiting counting operators can lead to speed-ups of several orders of magnitude in a wide range of settings: normal and incremental evaluation on synthetic and real expressions.

Split-Correctness in Information Extraction

arXiv (Cornell University), Oct 8, 2018

Threshold queries in theory and in the wild

Proceedings of the VLDB Endowment

Threshold queries are an important class of queries that only require computing or counting answe... more Threshold queries are an important class of queries that only require computing or counting answers up to a specified threshold value. To the best of our knowledge, threshold queries have been largely disregarded in the research literature, which is surprising considering how common they are in practice. In this paper, we present a deep theoretical analysis of threshold query evaluation and show that thresholds can be used to significantly improve the asymptotic bounds of state-of-the-art query evaluation algorithms. We also empirically show that threshold queries are significant in practice. In surprising contrast to conventional wisdom, we found important scenarios in real-world data sets in which users are interested in computing the results of queries up to a certain threshold, independent of a ranking function that orders the query results.

Querying Graphs with Data

Journal of the ACM, 2016

Graph databases have received much attention as of late due to numerous applications in which dat... more Graph databases have received much attention as of late due to numerous applications in which data is naturally viewed as a graph; these include social networks, RDF and the Semantic Web, biological databases, and many others. There are many proposals for query languages for graph databases that mainly fall into two categories. One views graphs as a particular kind of relational data and uses traditional relational mechanisms for querying. The other concentrates on querying the topology of the graph. These approaches, however, lack the ability to combine data and topology, which would allow queries asking how data changes along paths and patterns enveloping it. In this article, we present a comprehensive study of languages that enable such combination of data and topology querying. These languages come in two flavors. The first follows the standard approach of path queries, which specify how labels of edges change along a path, but now we extend them with ways of specifying how both...

Regular Expressions with Counting: Weak versus Strong Determinism

SIAM Journal on Computing, 2012

PG-Schema: Schemas for Property Graphs

Proceedings of the ACM on Management of Data

Property graphs have reached a high level of maturity, witnessed by multiple robust graph databas... more Property graphs have reached a high level of maturity, witnessed by multiple robust graph database systems as well as the ongoing ISO standardization effort aiming at creating a new standard Graph Query Language (GQL). Yet, despite documented demand, schema support is limited both in existing systems and in the first version of the GQL Standard. It is anticipated that the second version of the GQL Standard will include a rich DDL. Aiming to inspire the development of GQL and enhance the capabilities of graph database systems, we propose PG-Schema, a simple yet powerful formalism for specifying property graph schemas. It features PG-Schema with flexible type definitions supporting multi-inheritance, as well as expressive constraints based on the recently proposed PG-Keys formalism. We provide the formal syntax and semantics of PG-Schema, which meet principled design requirements grounded in contemporary property graph management scenarios, and offer a detailed comparison of its featu...

GPC: A Pattern Calculus for Property Graphs

arXiv (Cornell University), Oct 29, 2022

Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers, List of Authors

Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers, List of Au... more

Minimization of tree pattern queries

ACM SIGMOD Record, 2001

Tree patterns forms a natural basis to query tree-structured data such as XML and LDAP. Since the... more Tree patterns forms a natural basis to query tree-structured data such as XML and LDAP. Since the efficiency of tree pattern matching against a tree-structured database depends on the size of the pattern, it is essential to identify and eliminate redundant nodes in the pattern and do so as quickly as possible. In this paper, we study tree pattern minimization both in the absence and in the presence of integrity constraints (ICs) on the underlying tree-structured database. When no ICs are considered, we call the process of minimizing a tree pattern, constraint-independent minimization. We develop a polynomial time algorithm called CIM for this purpose. CIM's efficiency stems from two key properties: (i) a node cannot be redundant unless its children are, and (ii) the order of elimination of redundant nodes is immaterial. When ICs are considered for minimization, we refer to it as constraint-dependent minimization. For tree-structured databases, required child/descendant and type ...

Optimizing tree pattern queries: why cutting is not enough (invited talk)

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

Tree pattern queries are a natural language for querying graph- and tree-structured data. A centr... more Tree pattern queries are a natural language for querying graph- and tree-structured data. A central question for understanding their optimization problem was whether they can be minimized by cutting away redundant parts. This question has been studied since the early 2000's and was recently resolved.

Containment of Simple Conjunctive Regular Path Queries

Proceedings of the Seventeenth International Conference on Principles of Knowledge Representation and Reasoning

Testing containment of queries is a fundamental reasoning task in knowledge representation. We st... more Testing containment of queries is a fundamental reasoning task in knowledge representation. We study here the containment problem for Conjunctive Regular Path Queries (CRPQs), a navigational query language extensively used in ontology and graph database querying. While it is known that containment of CRPQs is EXPSPACE-complete in general, we focus here on severely restricted fragments, which are known to be highly relevant in practice according to several recent studies. We obtain a detailed overview of the complexity of the containment problem, depending on the features used in the regular expressions of the queries, with completeness results for NP, Pi2p, PSPACE or EXPSPACE.

Graph Pattern Matching in GQL and SQL/PGQ

Proceedings of the 2022 International Conference on Management of Data

The Complexity of Regular Trail and Simple Path Queries on Undirected Graphs

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Optimal Algorithms for Multiway Search on Partial Orders

Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Holding a Conference Online and Live due to Covid-19

ACM SIGMOD Record

The joint EDBT/ICDT conference (International Conference on Extending Database Technology / Inter... more The joint EDBT/ICDT conference (International Conference on Extending Database Technology / International Conference on Database Theory) is a well established conference series on data management, with annual meetings in the second half of March that attract 250 to 300 delegates. Three weeks before EDBT/ICDT 2020 was planned to take place in Copenhagen, the rapidly developing Covid-19 pandemic led to the decision to cancel the face-to-face event. In the interest of the research community, it was decided to move the conference online while trying to preserve as much of the real-life experience as possible. As far as we know, we are one of the first conferences that moved to a fully synchronous online experience due to the COVID- 19 outbreak. By fully synchronous, we mean that participants jointly listened to presentations, had live Q&A, and attended other live events associated with the conference. In this report, we share our decisions, experiences, and lessons learned.

Weight Annotation in Information Extraction

Logical Methods in Computer Science, 2022

The framework of document spanners abstracts the task of information extraction from text as a fu... more The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of VSet-automata -- a restricted class of transducers that mark the endpoints of selected spans. In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the positive RA operators via the semiring opera...

Foundations of Data Management (Dagstuhl Perspectives Workshop 16151)

Dagstuhl Reports, 2016

In this Workshop we have explored the degree to which principled foundations are crucial to the l... more In this Workshop we have explored the degree to which principled foundations are crucial to the long-term success and effectiveness of the new generation of data management paradigms and applications, and investigated what forms of research need to be pursued to develop and advance these foundations. The workshop brought together specialists from the existing database theory community, and from adjoining areas, particularly from various subdisciplines within the Big Data community, to understand the challenge areas that might be resolved through principled foundations and mathematical theory.