Data Science 2015
Data Science 2015
)
LNCS 9147
Data Science
30th British International Conference
on Databases, BICOD 2015
Edinburgh, UK, July 6–8, 2015, Proceedings
123
Lecture Notes in Computer Science 9147
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zürich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/7409
Sebastian Maneth (Ed.)
Data Science
30th British International Conference
on Databases, BICOD 2015
Edinburgh, UK, July 6–8, 2015
Proceedings
123
Editor
Sebastian Maneth
University of Edinburgh
Edinburgh
UK
LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI
This volume contains the papers presented at BICOD 2015: the 30th British Interna-
tional Conference on Databases held during July 6–8, 2015, in Edinburgh.
The BICOD Conference (formerly known as BNCOD) is a venue for the presen-
tation and discussion of research papers on a broad range of topics related to databases
and data-centric computation. The theme of BICOD 2015 was “Data Science”, i.e., the
extraction of meaning from big data. The conference featured three invited lectures and
three invited keynotes, all centered around the theme of data science.
This year, BICOD attracted 37 complete submissions from 14 different countries,
namely, Austria, China, Egypt, France, Germany, Ireland, Italy, The Netherlands,
Pakistan, Sweden, Switzerland, Tunisia, Turkey, and the UK. Each submission was
reviewed by at least three Program Committee members. The committee decided to
accept 19 papers on such topics as benchmarking, data integration, data replication,
deep learning, graph processing, linked data, log processing, main memory processing,
NoSQL querying, and social network analysis.
We would like to thank the authors for submitting their work to this year’s BICOD
conference, the Program Committee members for their help in selecting an excellent
conference program, and the distinguished speakers and lecturers for accepting our
invitation. We also thank Linda Hope and Magdalena Mazurczak for their involvement
in the local organization of the conference.
Program Committee
Philippe Bonnet IT University of Copenhagen, Denmark
Jan Van den Bussche University of Hasselt, Belgium
Bogdan Cautis University of Paris-Sud 11, France
James Cheney University of Edinburgh, UK
Barry Eaglestone University of Sheffield, UK
Wenfei Fan University of Edinburgh, UK
Alvaro Fernandes University of Manchester, UK
George Fletcher TU Eindhoven, The Netherlands
Georg Gottlob University of Oxford, UK
Anne James Coventry University, UK
Sebastian Maneth University of Edinburgh, UK (PC Chair)
Ioana Manolescu Inria Saclay, France
Peter McBrien Imperial College London, UK
Hannes Muehleisen CWI Amsterdam, The Netherlands
David Nelson University of Sunderland, UK
Dan Olteanu University of Oxford, UK
Norman Paton University of Manchester, UK
Peter Pietzuch Imperial College London, UK
Alex Poulovassilis Birkbeck College, University of London, UK
Juan Reutter PUC Chile, Chile
Tore Risch Uppsala University, Sweden
Mark Roantree Dublin City University, Ireland
Kai-Uwe Sattler TU Ilmenau, Germany
Sławek Staworko Inria and University of Lille, France
Jens Teubner TU Dortmund, Germany
John Wilson University of Strathclyde, UK
Peter Wood Birkbeck College, University of London, UK
VIII Organization
Sponsoring Institutions
Renée J. Miller
R.J. Miller—Supported by Bell Canada, NSERC and the NSERC Business Intelli-
gence Network.
Dealing with a Web of Data
Nigel Shadbolt
Padhraic Smyth
Graham Cormode
University of Warwick
G.Cormode@Warwick.ac.uk
Daniel Keim
Aaron Roth
University of Pennsylvania
References
[BSSU15] Bassily, R., Smith, A., Steinke, T., Ullman, J.: More general queries and
less generalization error in adaptive data analysis. arXiv:1503.0484 (2015)
[DFH+15] Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A.:
Preserving statistical validity in adaptive data analysis. In: Proceedings of
the 47th Annual ACM Symposium on Theory of Computing. ACM (2015)
[DMNS06] Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sen-
sitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006.
LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006)
[GGH+14] Gaboardi, M., Jesús Gallego, E., Hsu, J., Roth, A., Steven Wu, Z.: Dual
query: practical private query release for high dimensional data. In: Pro-
ceedings of the 31th International Conference on Machine Learning, ICML
2014, Beijing, China, 21–26 June 2014, vol. 32, pp. 1170–1178. JMLR.org
(2014)
[KV94] Kearns, M.J., Vazirani, U.V.: An introduction to computational learning
theory. MIT Press (1994)
[NS15] Nissim, K., Stemmer, U.: On the generalization properties of differential
privacy. arXiv:1504.05800 (2015)
Contents
Invited Lectures
Data Integration
Graph Data
Data Exploration
Scalability
Graham Cormode(B)
1 Introduction
Business and scientific communities all agree that “big data” holds both tremen-
dous promise, and substantial challenges [8]. There is much potential for extract-
ing useful intelligence and actionable information from the large quantities of
data generated and captured by modern information processing systems. Big
data challenges involve not only the sheer volume of the data, but the fact that
it can represent a complex variety of entities and interactions between them, and
new observations that arrive, often across multiple locations, at high velocity.
Examples of applications that generate big data include:
– Physical Data from sensor deployments and scientific experiments—astronomy
data from modern telescopes generates terabytes of data each night, while the
data collected from a single particle physics experiment is too big to store;
– Medical Data, as we can now sequence whole genomes economically, generating
data sets of the order of 200TB in one example [7];
– Activity Data, as human activity data is captured and stored in ever greater
quantities and detail: interactions from online social networks, locations from
GPS, Internet activity etc.
Across all of these disparate settings, certain common themes emerge. The data
in question is large, and growing. The applications seek to extract patterns,
trends or descriptions of the data. Ensuring the scalability of systems, and the
timeliness and veracity of the analysis is vital in many of these applications. In
order to realize the promise of these sources of data, we need new methods that
can handle them effectively.
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 3–6, 2015.
DOI: 10.1007/978-3-319-20424-6 1
4 G. Cormode
While such sources of big data are becoming increasingly common, the
resources to process them (chiefly, processor speed, fast memory and slower
disk) are growing at a slower pace. The consequence of this trend is that there
is an urgent need for more effort directed towards capturing and processing
data in many critical applications. Careful planning and scalable architectures
are needed to fulfill the requirements of analysis and information extraction on
big data. In response to these needs, new computational paradigms are being
adopted to deal with the challenge of big data. Large scale distributed computa-
tion is a central piece: the scope of the computation can exceed what is feasible
on a single machine, and so clusters of machines work together in parallel. On
top of these architectures, parallel algorithms are designed which can take the
complex task and break it into independent pieces suitable for distribution over
multiple machines.
A central challenge within any such system is how to compute and repre-
sent complex features of big data in a way that can be processed by many single
machines in parallel. A vital component is to be able to build and manipulate a
compact summary of a large amount of data. This powerful notion of a small sum-
mary, in all its many and varied forms, is the subject of this tutorial. The idea of
a summary is a natural and familiar one. It should represent something large and
complex in a compact fashion. Inevitably, a summary must dispense with some of
the detail and nuance of the object which it is summarizing. However, it should also
preserve some key features of the object in a very accurate fashion. Effective com-
pact summaries are often approximate in their answers to queries and randomized.
The theory of compact summaries can be traced back over four decades.
A first example is the Morris Approximate Counter, which approximately counts
quantities up to magnitude n using O(log log n) bits, rather than the log n
bits to count exactly [15]. Subsequently, there has been much interest in sum-
maries in the context of streaming algorithms: these are algorithms that process
data in the form of a stream of updates, and whose associated data structures
can be seen as a compact summary [16]. More recently, the more general notion
of mergeable summaries has arisen: summaries that can be computed on different
portions of a dataset in isolation, then subsequently combined to form a sum-
mary of the union of the inputs [1]. It turns out that a large number streaming
algorithms entail a mergeable summary, hence making this class of objects a
large and interesting one.
There has been much effort expended on summary techniques over recent years,
leading to the invention of powerful and effective summaries which have found
applications in Internet Service Providers [5], Search Engines [12,17], and beyond.
2 Outline
The accompanying talk will introduce the notion of summaries, and outline ideas
behind some of the most prominent examples, which include:
References
1. Agarwal, P., Cormode, G., Huang, Z., Phillips, J., Wei, Z.: Mergeable summaries.
ACM Principles Database Sys. 38(4), 1–28 (2012)
2. Ahn, K.J., Guha, S., McGregor, A.: Analyzing graph structure via linear measure-
ments. In: ACM-SIAM Symposium on Discrete Algorithms (2012)
3. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the
frequency moments. ACM Symp. Theor. Comput. 46(2), 20–29 (1996)
4. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams.
In: Proceedings of the International Colloquium on Automata, Languages and
Programming (ICALP) (2002)
5. Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O.,
Srivastava, O.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD Inter-
national Conference on Management of Data, pp. 35–46 (2004)
6. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the Count-
Min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
7. Cravedi, K., Randall, T., Thompson. L.: 1000 genomes project data available on
Amazon Cloud. NIH News, March 2012
8. Cukier, K.: Data, data everywhere. The Economist, February 2010
9. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for database applica-
tions. J. Comput. Syst. Sci. 31, 182–209 (1985)
10. Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: Hyperloglog: the analysis of
a near-optimal cardinality estimation algorithm. In: International Conference on
Analysis of Algorithms (2007)
11. Greenwald, M., Khanna, S.: Space-efficient online computation of quantile sum-
maries. In: ACM SIGMOD International Conference on Management of Data
(2001)
12. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M.,
Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. In: International
Conference on Very Large Data Bases, pp. 330–339 (2010)
13. Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and
top-k elements in data streams. In: International Conference on Database Theory
(2005)
14. Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143–152
(1982)
6 G. Cormode
15. Morris, R.: Counting large numbers of events in small registers. Commun. ACM
21(10), 840–842 (1977)
16. Muthukrishnan, S.: Data Streams: Algorithms and Applications. Now Publishers,
Norwell (2005)
17. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel
analysis with sawzall. Dyn. Grids Worldwide Comput. 13(4), 277–298 (2005)
18. Woodruff, D.: Sketching as a tool for numerical linear algebra. Found. Trends
Theor. Comput. Sci. 10(1–2), 1–157 (2014)
Data Integration
A Framework for Scalable Correlation
of Spatio-temporal Event Data
1 Introduction
An event is often described as “something that happens at some place at some
time”. Thus, events inherently have a spatial and a temporal component. These
spatio-temporal events do not only origin from sensor readings, but can also be
extracted from text corpora like news, weblogs, and tweets.
The task we focus on in this paper is to find events that are correlated to a
given event in terms of its time and place of occurrence. The result is, e.g., a list
of pointers to documents in which similar events have been detected. For such
correlation tasks, we are facing the following problems:
– First, event specifications are often imprecise. For example, for the event
extracted from the sentence “Obama visited Germany in April 2009”, we do
not know (using only the text source) which part of Germany Obama visited
or at what exact dates he visited Germany.
– Second, for comparing events in terms of their similarity solely based on their
temporal and geographic components, we need a distance measure.
– Third, depending on the specific application different correlation techniques
are needed: for finding similar events, nearest neighbor or skyline queries are
an appropriate approach, whereas for determining hot spots, clustering (such
as DBSCAN) might be a better choice.
– Finally, because (extracted) event data can be large datasets, scalable tech-
niques are required. Modern data processing frameworks such as Apache
Hadoop or Spark provide a suitable platform for addressing this challenge.
In [2] an adaption of DBSCAN to MapReduce is proposed, whereas in [1] and
[4] adaptions of the skyline algorithm are shown.
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 9–15, 2015.
DOI: 10.1007/978-3-319-20424-6 2
10 S. Hagedorn et al.
In general, there are two approaches for realizing a distance function for
imprecise event data. First, dates representing a month or year can be mapped
to intervals of days (e.g., “2014-05” can be mapped to [2014-05-01, 2014-05-30])
with each subinterval being valid instance of “2014-05”. Similarly, a country can
be mapped to a polygon or minimum bounding box. Then, a function is devised
that determines the distance between intervals (for time) and boxes/polygons
(for regions). Each such a function can either yield a single scalar value (e.g., the
average distance between points of two intervals/boxes), or an interval, giving the
minimum and maximum distance between two intervals/boxes. In our current
framework, we only consider the former case where single scalar values for both
the temporal and geographic component are determined and linearly combined
using a weight. That is, for two events e1 and e2 , we assume a distance function
dist(e1 , e2 ) := wt distt (e1 , e2 ) + wg distg (e1 , e2 ), with diste and distg functions
for determining the distance between intervals and regions/boxes, respectively,
and wt , wg ∈ [0, 1], wt + wg = 1.
Nearest Neighbor Queries. Nearest neighbor queries represent the most straight-
forward solution. Given a set of events E, a reference event er and a distance
function dist, the task is to find the set kNN(er ) of the k nearest events. In the
case of our spatio-temporal event data this requires a single distance measure,
which is usually defined using weights for the spatial and temporal distances.
GeoDB
Given the event data model, the distance functions, and the set of correlation
functions described above, the goal of our work is to provide a framework for scal-
able event data correlation. As the underlying platform we have chosen Apache
Spark1 , but our framework can be easily ported to other platforms providing a
similar (Scala-based) API such as the Apache Flink2 project. Figure 1 shows the
components of the framework and their role in an event analysis pipeline.
The core components are the following operators implemented as transfor-
mations on Spark’s resilient distributed datasets (RDD):
PrepareEvents: This operator transforms a set of raw (textual) event data into
a set of event records t, q conforming to our framework. This means that
textual temporal and spatial properties are normalized into numerical values,
i.e., date/time values and points or polygons for the spatial descriptions such
as names of cities or locations. For the latter, a service such as GeoNames3
can be used.
CalcDistance: This implements a transformation operator for calculating the
spatial and temporal distance dist of each event of a RDD to a given reference
event.
TopKOp: This operator computes the top-k list of events from an input RDD
produced by CalcDistance. Parameters to this operator are k as well as the
weights for the geographic (wg ) and temporal (wt ) distance.
SkylineOp: This operator computes the skyline of event records from a RDD
produced by CalcDistance. The dominance relation can be passed as para-
meter to the operator.
ClusteringOp: Finding groups of correlated events is realized by the
ClusteringOp operator implementing a parallel variant of DBSCAN [3] for
spatio-temporal data. Parameters are the standard clustering parameters ε
1
http://spark.apache.org.
2
http://flink.apache.org.
3
http://www.geonames.org.
A Framework for Scalable Correlation of Spatio-temporal Event Data 13
and MinPts as well as a global distance function taking both spatial and
temporal distances into account.
While the implementation of PrepareEvents, CalcDistance, and – a sort-
based – TopKOp operator is rather straightforward, efficient skyline processing
and density-based clustering require more effort. As mentioned in Sect. 1, there
already exist some proposals for MapReduce-based implementations of these
operators that have inspired our Spark implementations.
Both SkylineOp and ClusteringOp are based on a grid partitioning, where
the dimensions of the grid are either the spatial and temporal dimensions (in
case of skyline processing) or longitude, latitude, and time in case of clustering.
For simplicity, we assume – non-optimal – equally-sized grid cells representing
partitions of Spark’s RDDs.
Our skyline operator implements the idea presented in [4] by computing in a
first phase bitstrings representing grid cells containing data points. This can be
done in parallel without moving data. By combining these bitstrings in a reduce
step, dominated as well as empty cells can be pruned. In the second phase, all
nodes compute a local skyline of their data by taking the information from this
global bitstring into account. Finally, the local skylines are merged.
For density-based clustering, grid cells must not be disjoint in order to deter-
mine the neighborhood for objects at the border of cells. Thus, we compute an
overlap between neighboring cells and assign objects in this overlap area to its
neighbor cells, too. Next, for each cell a local DBSCAN is performed. Note that
compared to the skyline processing strategy, this requires to repartition data
according their grid cell membership. Finally, we build a global graph of all local
clusters in order to merge clusters from different cells.
5 Use Cases
In this section, we show the outcome of the skyline and top-k operations. Due
to space limitations we do not present a full performance evaluation. Our test
dataset was crawled from the website eventful.com and contains 122,467 events.
It consists only of events that took place in Germany where the earliest event
appeared on 2007-06-30 and the latest on 2020-06-30. For the test of our opera-
tors, we manually removed all events in the eastern part of Germany (which is
the federal state of Saxony).
Figure 2 shows the spatial distribution of all events in our dataset. On the left,
the skyline (marked with +) is shown. The right figure shows the result of the top-
k query (k = 10; marked with •). The reference point for both queries is shown
as . One can see that the spatio-temporal skyline not only finds correlated
events that have both a small spatial and temporal distance to the reference
event, but also considers events as correlated that are near to the reference event
in at least one dimension. The two shown skyline points in the north and the
south have a large spatial distance, but only a small temporal distance and thus,
are considered correlated to the reference event. On the other hand, the top-
k operator accepts user-defined weights for the spatial and temporal distances
14 S. Hagedorn et al.
Fig. 2. Left: the skyline (+); right: top-10 result (•) for a reference event ().
to express a desired preference over one or the other dimension. In the given
example these weights are wg = 0.10 for the geographic and wt = 0.90 for the
temporal dimension, i.e., the temporal distance is considered more important. As
Fig. 2 shows, the resulting points have a large geographic distance, but are near
to the reference event in the temporal dimension. Note, there are events that
take place at the exact same position, so that they cover each other in the figure
and appear as one point. Thus, the figure shows only eight result points. Due to
space limitations, we cannot show the results of the spatio-temporal clustering.
Acknowledgement. This work was funded by the DFG under grant no. SA782/22.
A Framework for Scalable Correlation of Spatio-temporal Event Data 15
References
1. Chen, L., Hwang, K., Wu, J.: MapReduce skyline query processing with a new
angular partitioning approach. In: IPDPSW (2012)
2. Dai, B.-R., Lin, I.-C.: Efficient map/reduce-based DBSCAN algorithm with opti-
mized data partition. In: CLOUD (2012)
3. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discov-
ering clusters in large spatial databases with noise. In: KDD (1996)
4. Mullesgaard, K., Pederseny, J.L., Lu, H., Zhou, Y.: Efficient skyline computation in
MapReduce. In: EDBT (2014)
Towards More Data-Aware
Application Integration
1 Introduction
Integration middleware systems in the sense of EAI brokers [5] address the funda-
mental need for (business) application integration by acting as a messaging hub
between applications. As such, they have become ubiquitous in service-oriented
enterprise computing environments. Messages are mediated between applications
mostly in wire formats based on XML (e. g., SOAP for Web Services).
The advent of more “data-aware” integration scenarios (observation O1 )
put emphasis on (near) “real-time” or online processing (O2 ), which requires us
to revisit the standard integration capabilities, system design and architectural
decisions. For instance, in the financial/utilities industry, China Mobile gener-
ates 5–8 TB of call detail records per day, which have to be processed by inte-
gration systems (i. e., mostly message routing and transformation patterns),
“convergent charging”1 (CC) and “invoicing” applications (not further dis-
cussed). In addition, the standard XML-processing has to give ground to other
1
Solace Solutions, visited 02/2015; last update 2012: http://www.solacesystems.com/
techblog/deconstructing-kafka.
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 16–28, 2015.
DOI: 10.1007/978-3-319-20424-6 3
Towards More Data-Aware Application Integration 17
formats like JSON and CSV (O3 ). These observations (O1–3 ) are backed by
similar scenarios from sports management (e. g., online player tracking) and the
rapidly growing amount of data from the Internet of Things and Cyber Physical
System domains. For those scenarios, an architectural setup with systems like
Message Queuing (MQ) are used as reliable “message buffers” (i. e., queues,
topics) that handle “bursty” incoming messages and smoothen peak loads
(cf. Fig. 1). Integration systems are used as message consumers, which (transac-
tionally) dequeue, transform (e. g., translation, content enrichment) and route
messages to applications. For reliable transport and message provenance, inte-
gration systems require relational Database Systems, in which most of the (busi-
ness) application data is currently stored (O4 ). When looking at the throughput
capabilities of the named systems, software-/hardware-based MQ systems like
Apache Kafka or Solace(See footnote 1) are able to process several millions of
messages per second. RDBMS benchmarks like TPC-H, TPC-DI measure queries
and inserts in PB sizes, while simple, header-based routing benchmarks for inte-
gration systems show message throuphputs of few thousands of messages per sec-
ond [2] (O5 ). In other words, MQ and DBMS (e. g., RDBMS, NoSQL, NewSQL)
systems are already addressing observations O1–5. Integration systems, however,
seem to not be there yet.
Compared to MQs, integration systems work on message data, which seems
to make the difference in message throughput. We argue that integration opera-
tions, represented by Enterprise Integration Patterns (EIP) [9], can be mapped
to an “in-memory” representation of the table-centric RDBMS operations to
profit from their efficient and fast evaluation. Early ideas on this were brought
up in our position papers [11,13]. In this work, we follow up to shed light on
the observed discrepancies. We revisit the EIP building blocks and operator
model of integration systems, for which we define RDBMS-like table operators
(so far without changing their semantics) as a symbiosis of RDBMS and inte-
gration processing, e. g., by using Datalog [17]. We choose Datalog as example
of an efficiently computable, table-like integration co-processing facility close to
the actual storage representation with expressive foundations (e. g., recursion),
which we call Table-centric Integration Patterns (TIP). To show the applicabil-
ity of our approach to integration scenarios along observations O1–5 we conduct
an experimental message throughput analysis for selected routing and transfor-
mation patterns, where we carefully embed the TIP definitions into the open-
source integration system Apache Camel [3] that implements most of the EIPs.
Not changing the EIP semantics means that table operations are executed on
“single-record” table messages. We give an outlook to “multi-record” table mes-
sage processing.
The remainder of this paper is organized along its main contributions. After
a more comprehensive explanation of the motivating CC example and a brief
sketch of our approach in Sect. 2, we analyse common integration patterns with
respect to their extensibility for alternative operator models and define a table-
centric operator/processing model that can be embedded into the patterns (still)
18 D. Ritter
aligned with their semantics in Sect. 3. In Sect. 4 we apply our approach to a con-
ventional integration system and briefly describe and discuss our experimental
performance analysis, and we draw an experimental sketch of the idea of “multi-
record” table message processing. Section 5 examines related work and Sect. 6
concludes the paper.
The Enterprise Integration Patterns (EIPs) [9] define “de-facto” standard opera-
tions on the header (i. e., payload’s meta-data) and body (i. e., message payload)
of a message, which are normally implemented in the integration system’s host
language (e. g., Java, C#). This way, the actual integration operation (i. e., the
content developed by an integration expert like mapping programs and routing
conditions) can be differentiated from the implementation of the runtime sys-
tem that invokes the content operations and processes their results. For instance,
Fig. 1 shows the separation of concerns within integration systems with respect to
“system-related” and “content-related parts” and sketches which pattern oper-
ations to re-define using relational table operators, while leaving the runtime
system (implementation) as is. The goal is to only change these operations and
make integration language additions for table-centric processing within the con-
ventional integration system, while preserving the general integration semantics
like Quality of Service (e. g., best effort, exactly once) and the Message Exchange
Pattern (e. g., one-way, two-way). In other words, the content-related parts of
the pattern definitions are evaluated by an “in-process” table operation proces-
sor (e. g., a Datalog system), which is embedded into the standard integration
system and invoked during the message processing.
Before defining Table-centric Integration Patterns (short TIP) for message rout-
ing and transformation more formally, let us recall the encoding of some relevant,
basic database operations/operators into Datalog: join, projection, union,
and selection. The join of two relations r(x, y) and s(y, z) on parameter y
is encoded as j(x, y, z) ← r(x, y), s(y, z), which projects all three parameters
to the resulting predicate j. More explicitly, a projection on parameter x of
relation r(x, y) is encoded as p(x) ← r(x, y). The union of r(x, y) and s(x, y) is
u(x, y) ← r(x, y). u(x, y) ← s(x, y), which combines several relations to one. The
selection r(x, y) according to a built-in predicate φ(x), where φ(x) can contain
constants and free variables, is encoded as s(x, y) ← r(x, y), φ(x). Built-in pred-
icates can be binary relations on numbers such as <,<=,=, binary relations on
strings such as equals, contains, startswith or predicates applied to expressions
based on binary operators like +, −, ∗, / (e. g., x = p(y) + 1), and operations on
relations like z = max(p(x, y), x), z = min(p(x, y), x), which would assign the
maximal or the minimal value x of a predicate p to a parameter z.
20 D. Ritter
union on
proje on
cti
join n
ti
i
built-
selec
Message Routing
Router, Filter:
Recipient List
Multicast, Join Router
Splitter
Correlation, Completion
Aggregation
Message Transformation
Message translator
Content filter
Content enricher
Fig. 2. Message routing and transformation patterns mapped to Datalog. Most com-
mon Datalog operations for a single pattern are marked “dark blue”, less common ones
“light blue”, and possible but uncommon ones “white” (Color figure online).
In this section the message routing pattern implementations are re-defined, which
can be seen as control and data flow definitions of an integration channel pipeline.
For that, they access the message to route it within the integration system and
eventually to its receiver(s). They influence the channel and message cardinality
as well as the content of the message.
Splitter. The Splitter pattern has a channel cardinality of 1:1 and creates new,
leaving messages. Thereby the splitter breaks the entering message into multiple
(smaller) messages (i. e., message cardinality of 1:n). Hereby, the stateless split-
ter uses a split condition sc on the content-level, with {out1 , out2 , ..., outn } :=
sc(msgin , conds), which accesses the entering message’s body to determine a list
of distinct body parts {out1 , out2 , ..., outn }, based on a list of conditions conds,
that are each inserted to a list of individual, newly created, leaving messages
{msgout1 , msgout2 , ..., msgoutn } with n ∈ N by a splitter function. The header
and attachments are copied from the entering to each leaving message.
The re-defined split condition sctip evaluates a set of Datalog rules as conds
(i. e., mostly selection, and sometimes built-in and join constructs; the latter two
22 D. Ritter
are marked “light blue”). Each part of the body outi with i ∈ N is a set of facts
that is passed to a split function, which wraps each set into a single message.
4 Experimental Evaluation
As System under Test (SuT) for an experimental evaluation we used the open
source, Java-based Apache Camel integration system [3] in version 2.14.0, which
implements most of the EIPs. The Camel system allows content-level extensions
through several interfaces, with which the TIP definitions were implemented
and embedded (e. g., own Camel Expression definitions for existing patterns,
and Camel Processor definitions for custom or non-supported patterns). The
Datalog system we used for the measurements is a Java-based, standard naı̈ve-
recursive Datalog processor (i. e., without stratification) [17] in version 0.0.6
from [14].
Subsequently, the basic setup and execution of the measurements are intro-
duced. However, due to brevity, a more detailed description of the setup, the
integration scenarios and more detailed results are provided in the Suppl. Mate-
rial of the extended version of this paper [12].
Towards More Data-Aware Application Integration 23
4.1 Setup
4
Apache JMeter, visited 02/2015: http://jmeter.apache.org/.
24 D. Ritter
Listing 1.1. Routing condition: tip rc Listing 1.3. Routing condition with
1 cbr−o r d e r ( id , − ,OTOTALPRICE, −): − join over “multi-format” message
2 o r d e r ( id , otype , − , 1 cbr−c u s t (CUSTKEY, −): −
3OTOTALPRICE,−OPRIORITY, − ) , 2 c u s t om e r ( c i d , ctype ,CUSTKEY, − ,
4 =(OPRIORITY, ”1−URGENT” ) 3CNATIONKEY, − ,ACCTBAL, − ) ,
5 >(OTOTALPRICE, 1 0 0 0 0 0 . 0 0 ) . 4 n a t i o n ( nid , ntype ,NATIONKEY, − ,
5NREGIONKEY, − ) ,
6 >(ACCTBAL, 3 0 0 0 . 0 ) ,
Listing 1.2. Message translation pro-
7 =(CNATIONKEY,NATIONKEY)
gram: mttip 8 =(NREGIONKEY, 3 ) .
1 conv−o r d e r ( id , otype ,
2ORDERKEY,CUSTKEY, SHIPPRIORITY): −
3 o r d e r ( id , otype ,ORDERKEY,
4CUSTKEY, − ,SHIPPRIORITY, − ) .
Table 1. Throughput measurements for format conversion, message routing and tran-
formation patterns based on 4 kB messages generated from 1.5 million standard TPC-H
orders records.
The throughput test streams all 1.5 million order/customer messages to the
pipeline. The performance measurement results are depicted in Table 1 for a sin-
gle thread execution. Measurements with multiple threads show a scaling up to
factor 10 of the results, with a saturation around 36 threads (i. e., factor of num-
ber of cores; not shown). The stream conversion to JSON object aggregated for
all messages is slightly faster than for ONC. However, in both order messages
cases the TIP-based implementation reaches a slightly higher transaction per
second rate (tps), which lets the processing end 7 s and 4 s earlier respectively,
Towards More Data-Aware Application Integration 25
due to the natural processing of ONC iterators in the Datalog engine. Although
the measured 99 % confidence intervals do not overlap, the execution times are
similar. The rather theoretical case of increasing the number of selection/built-
in operations on the order messages (e. g., date before/after, string contains)
showed a stronger impact for the Camel-Java case than the Camel-Datalog
case (not shown). In general, the Camel-Java implementation concludes with
a routing decision as soon as a logical conjunction is found, while the conjunc-
tive Datalog implementation currently evaluates all conditions before returning.
In the context of integration operations this is not necessary, thus could be
improved by adapting the Datalog evaluation for that, which we could experi-
mentally show (not shown; out of scope for this paper). The measured through-
put of the content-based router with join processing on “multi-format” the
1.5 million TPC-H customer/nation messages again shows similar results. Only
this time, the too simple NestedLoopJoin implementation in the used Datalog
engine causes a loss of nine seconds compared to the “hand-coded” JSON join
implementation.
5 Related Work
The work on Java systems like Telegraph Dataflow [16], and Jaguar [19]) can
be considered related work in the area of programming languages on application
systems for faster, data-aware processing. These approaches are mainly target-
ing to make Java more capable of data-processing, while mainly dealing with
threading, garbage collection and memory management. None of them considers
the combination of the host language with table-centric processing.
Declarative XML Processing. Related work can be found in the area of declara-
tive XML message processing (e. g., [4]). Using an XQuery data store for defining
persistent message queues (i. e., conflicting with O3 ), the work targets a com-
plementary subset of our approach (i. e., persistent message queuing).
Data Integration. The data integration domain uses integration systems for
querying remote data that is treated as local or “virtual” relations. Starting with
SQL-based approaches (e. g., using Garlic [8]), the data integration research
reached relational logic programming, summarized by [6]. In contrast to such
remote queries, we define a table-centric, integration programming approach for
application integration, while keeping the current semantics (for now).
6 Concluding Remarks
This paper motivates a look into a growing “processing discrepancy” (e. g., mes-
sage throughput) between current integration and complementary systems (e. g.,
MQ, RDBMS) based on known scenarios with new requirements and fast grow-
ing new domains (O1–O3 ). Towards a message throughput improvement, we
extended the current integration processing on a content level by table-centric
integration processing (TIP). To remain compliant to the current EIP definitions
the TIP-operators work on “single-record” messages, which lets us compare with
current approaches using a brief experimental throughput evaluation. Although
the results slightly improve the standard processing, not to mention the declar-
ative vs. “hand-coded” definition of integration content, the actual potential
Towards More Data-Aware Application Integration 27
References
1. Abiteboul, S., Antoine, E., Miklau, G., Stoyanovich, J., Testard, J.: Rule-based
application development using webdamlog. In: ACM SIGMOD (2013)
2. AdroitLogic: Esb performance (2013). http://esbperformance.org/
3. Anstey, J., Zbarcea, H.: Camel in Action. Manning, Stamford (2011)
4. Böhm, A., Kanne, C., Moerkotte, G.: Demaq: a foundation for declarative XML
message processing. In: CIDR 2007, pp. 33–43 (2007)
5. Chappell, D.: Enterprise Service Bus. O’Reilly Media Inc., Sebastopol (2004)
6. Genesereth, M.R.: Data Integration: The Relational Logic Approach. Morgan &
Claypool Publishers, San Rafael (2010)
7. Green, T.J., Aref, M., Karvounarakis, G.: LogicBlox, platform and language:
a tutorial. In: Barceló, P., Pichler, R. (eds.) Datalog 2.0 2012. LNCS, vol. 7494,
pp. 1–8. Springer, Heidelberg (2012)
8. Haas, L.M., Kossmann, D., Wimmers, E.L., Yang, J.: Optimizing queries across
diverse data sources. In: VLDB, pp. 276–285 (1997)
9. Hohpe, G., Woolf, B.: Enterprise Integration Patterns: Designing, Building, and
Deploying Messaging Solutions. Addison-Wesley Longman, Boston (2003)
10. Reimann, P., Schwarz, H.: Datenmanagementpatterns in simulationsworkflows. In:
BTW, pp. 279–293 (2013)
11. Ritter, D.: What about database-centric enterprise application integration? In:
ZEUS, pp. 73–76 (2014)
12. Ritter, D.: Towards more data-aware application integration (extended version).
CoRR abs/1504.05707 (2015). arXiv: 1504.05707
13. Ritter, D., Bross, J.: DatalogBlocks: relational logic integration patterns. In:
Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014,
Part II. LNCS, vol. 8645, pp. 318–325. Springer, Heidelberg (2014)
14. Ritter, D., Westmann, T.: Business network reconstruction using datalog. In: Bar-
celó, P., Pichler, R. (eds.) Datalog 2.0 2012. LNCS, vol. 7494, pp. 148–152. Springer,
Heidelberg (2012)
28 D. Ritter
15. Russell, N., ter Hofstede, A., Edmond, D., van der Aalst, W.: Workflow data pat-
terns: identification, representation and tool support. In: ER (2005)
16. Shah, M.A., Madden, S., Franklin, M.J., Hellerstein, J.M.: Java support for data-
intensive systems: experiences building the telegraph dataflow system. SIGMOD
Rec. 30(4), 103–114 (2001)
17. Ullman, J.: Principles of Database and Knowledge-Base Systems, vol. I. Computer
Science Press, New York (1988)
18. Vrhovnik, M., Schwarz, H., Suhre, O., Mitschang, B., Markl, V., Maier, A., Kraft,
T.: An approach to optimize data processing in business processes. In: VLDB
(2007)
19. Welsh, M., Culler, D.E.: Jaguar: enabling efficient communication and I/O in Java.
Concurr. Pract. Exp. 12(7), 519–538 (2000)
20. Zaniolo, C.: A logic-based language for data streams. In: SEBD 2012 (2012)
21. Zaniolo, C.: Logical foundations of continuous query languages for data streams.
In: Barceló, P., Pichler, R. (eds.) Datalog 2.0 2012. LNCS, vol. 7494, pp. 177–189.
Springer, Heidelberg (2012)
Applying NoSQL Databases for Integrating Web
Educational Stores - An Ontology-Based
Approach
1 Introduction
The use of web content is increasing because of its accessibility at any time and
from any place. Online libraries have started to support open access and profes-
sionals are increasingly using Web 2.0 technologies to spread their knowledge. In
medical education, although its content should be of a high quality and provided
by authorized sources, the web had played an important role in providing such
content. Medical communities have a high awareness of the range of educational
content available and show substantial interest in using such resources [1].
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 29–40, 2015.
DOI: 10.1007/978-3-319-20424-6 4
30 R.Q. Al Fayez and M. Joy
Searching for relevant educational content on the web can be challenging for
its users. The vast amount of information available and the diversity of its types
makes the search process time consuming. The content of any website is usually
stored in a relational database with different fields used for describing its records.
Therefore, integrating web databases into one data store is a challenging issue.
New practices for publishing web content using Linked Data are being adopted
by an increasing number of content providers leading to the web of data [2].
In this paper, we present a novel system that adopts Linked Data practices for
automatic linking of heterogeneous web stores into one dataset based on biomed-
ical ontologies enrichment. The developed system links some of the high quality
User Generated Content (UGC), published on YouTube and blogs by medical
educators and organizations, with content from online medical libraries. Using
biomedical ontologies, we enriched the content of these databases by annotating
free-text descriptions provided in their metadata records. Ontology-based anno-
tation allows the system to discover keyword terms in web database content and
builds dynamic linkages between them. The final linked dataset is represented
in RDF/XML format and URIs are used for describing the dataset content.
Researchers in the field of e-learning refer to online educational resources that
can be used in the learning process as Learning Objects (LOs). Learning Objects
as defined in [3] can be of different types -images, videos, or text-, and differ in
granularity from books to small web pages. Since the application domain of this
work is medical education, we refer to the educational resources retrieved from
the web and used in this system as Educational Medical Objects (EMOs). The
result of our work is a linked dataset of EMOs named the LEMO dataset and a
system for managing them called the LEMO system.
The paper is structured as follows. Section 2 presents background and related
work about the subject. Section 3 describes the processes applied for harvesting
distributed web stores and building the LEMO RDF Triple Store. Section 4 pro-
vides more details about the ontologies used in the LEMO system, and explains
the use of these ontologies in the annotation and enrichment process. Further-
more, a detailed description of the NoSQL RDF Triple Store components are
presented in this section. Section 5 details a comparative analyses for using
the LEMO system with the MeSH and SNOMED CT ontologies, and discusses
experiments conducted for querying this dataset using ontological-based queries.
Finally, Sect. 6 presents the conclusions and future work.
Using Linked Data and ontologies for data enrichment have been researched heav-
ily in the recent years. The enrichment methods can happen at the server-side or
client-side of a system. Both variations have been tested in [4] and the advantages
and disadvantages were compared. Enriching queries is another method applied
at the server-side of the system, and the work presented in [5] investigated enrich-
ing queries made to a collection of medical images provided by one library. The
queries have been expanded after enriching the text with MeSH ontology terms.
Applying NoSQL Databases for Integrating Web Educational Stores 31
Data enrichment has also been used with UGC content on the web, because user
generated tags or folksonomies describing YouTube videos may be poorly chosen.
The tag expansion and raking applied in [6] has been shown to enhance the descrip-
tion of the videos on YouTube. Enriching the content of a single dataset has been
heavily researched, especially in the medical field. This is due to having mature
and well maintained biomedical ontologies [7].
Linked Data principles have been adopted in education. Projects have been
developed for supporting the use of web educational data [8]. Efforts for linking
different educational ICT tools registries are presented in [9]. Another project
for publishing datasets of educational medical materials in Linked Data has been
developed in [10], which focused on providing a data model for exposing various
types of educational materials to the web in order to enhance their interoperabil-
ity and reuse across the web. It is clear that Linked Data will have a potential
in the education field. A project presented in [11] developed a curriculum from
Open Educational Resources tailored for teaching and training practitioners to
use Linked Data. These days, educational organizations and universities are con-
sidering storing and publishing data using a Linked Data format [12].
Before integrating web educational stores, we need to harvest and model distrib-
uted Educational Medical Objects (EMOs) into one data model. Our goal is to
integrate different types of EMOs into one linked data set that is searchable and
queryable.
In order to accomplish this goal, we developed two harvesting endpoints. In
the first one, we incorporated the OAI-PMH protocol [13]. The other endpoint
is basically an RSS feed reader storing web feeds from websites that provide
them. Many online libraries expose their metadata using an OAI-PMH protocol
such as the PubMed library. Using these harvesting endpoints, developed in the
LEMO system, we harvested articles from the PubMed library and videos and
blogs from YouTube and blogging platforms. The resulted dataset consisted of
2720 medical educational objects divided into 1000 articles from PubMed library
and 1720 blogs and videos harvested from five different blogging websites and 6
YouTube channels. The chosen blogs and YouTube channels are maintained by
either medical academics or journals and dedicated to educational purposes.
The harvested metadata records are retrieved in XML formats. The OAI
service in PubMed supports DCMI metadata, therefore we can set the format
parameter in the OAI requests produced by LEMO to be DCMI for harvesting
content. On the other hand, blogs and video RSS feeds are structured XML
documents which are machine interpretable and provide access to parts of the
website entries such as title, authors, publication date, and content [14]. A frag-
ment of the XML files harvested is illustrated in Fig. 1 along with the processes
needed to build LEMO RDF Triple Store.
The heterogeneous metadata structures for all EMOs retrieved are mapped
into the LEMO data model using XSLT techniques. The LEMO data model has
32 R.Q. Al Fayez and M. Joy
been proposed in [15] at an earlier stage of developing the LEMO system after
conducting a comparative study of existing data model in medical education. It is
based on the DCMI metadata schema and implemented in RDF/XML formats.
New LEMO properties were introduced for describing the enriched resources
in LEMO store which will be discussed in detail in Sect. 4. The mapped files
are then sent to the ontology enrichment process which annotates the free-text
of EMOs, and discovers possible subjects to categorize them. This will result in
having an enriched LEMO Triple Store which consists of EMOs, terms annotated
in EMOs, and ontology classes used for annotation, as shown in Fig. 1.
4 Ontology-Based Annotations
Biomedical ontologies are formal representations of knowledge with definition of
concepts and their relations [16]. Such ontologies have been used for indexing
data produced by researchers in the field to ease its integration [17]. They are also
used for indexing articles in medicine libraries such as the use of MeSH ontology
for indexing PubMed library articles. In the LEMO system, we use ontologies to
annotate free-text in the harvested EMO metadata such as titles and descrip-
tions. Annotating the free-text enables us to discover relations between non
related objects on the web. In the LEMO system, we adopt the Linked Data
format for building the LEMO Triple Store which is considered the best practice
for connecting semantically related data that was not previously linked [2].
The application domain of the LEMO system is medicine education. The
BioPortal1 open repository for biomedical ontologies is used to explore possible
ontologies to integrate them with the LEMO system. BioPortal provides access
to ontologies developed in different formats via web services which enable its
1
http://bioportal.bioontology.org/.
Applying NoSQL Databases for Integrating Web Educational Stores 33
users to get information about the ontologies or its content [18]. The LEMO
system uses additional web services such as the annotator service provided by
Bioportal for annotating and linking objects in the dataset. The ontologies used
in the LEMO system so far are the Systematized Nomenclature of Medicine -
Clinical Terms SNOMED CT and the Medical Subject Headings MeSH.
in detail using LEMO properties that store details about the original text anno-
tated, its indices and content along with details about the class it was annotated
to using a specific ontology, its ID, label, definition, and synonyms if they exist.
The terms’ annotated classes are nodes in the original ontology used for enriching
the dataset. Hence, a sub graph of the original ontology can be built using the
collection of ontology classes used for annotating its terms. The class relations
are stored using the lemo:adjacentTo property. These class resources will enable
further processing of the LEMO Triple Store to discover subjects or categories
for EMOs and build dynamic linkages between its resources.
In the annotation process, free-text of EMOs is sent for the BioPortal anno-
tator service and an ontology is specified in the request parameter. Then, the
response is read and terms’ annotated resources are created and linked to EMOs
stored in the LEMO RDF Triple Store. After the annotation process, each EMO
is represented by a set of keywords which are the terms annotated in its title
and description. Each set of keywords representing an EMO forms a smaller
sub graph of linked ontology classes based on their adjacency lists stored in the
LEMO Triple Store. For discovering subjects for an EMO, we apply a simple
term filtering technique to identify a smaller set of keywords which represent the
EMO subject property and stored as the value of the dc:subject predicate for
that EMO. In term filtering, we assign weights for the keywords based on their
position and co-occurrence in an EMO term annotation set. Then, the accu-
mulated weight for each keyword is calculated based on its hierarchical level in
ontology. If the term annotated class is a parent of many terms annotated for
the same EMO, then it will be more important than a term that is leaf in the
ontology. The final weight value for each term annotated is stored in the rdf:value
property of the term annotation resource. After normalizing the weights of the
keywords, the top ranked keywords are selected as subjects of an EMO.
We tested the LEMO system against two biomedical ontologies: MeSH and
SNOMED. Comparison of annotation results, term filtering, and linkage discov-
ery of these two experiments are detailed in the following section. The results of
Applying NoSQL Databases for Integrating Web Educational Stores 35
Table 1. Number of terms annotated for the set of EMOs using different ontologies
the comparisons helps to decide which ontology to use in LEMO system based
on the larger number of annotations created.
The components of the dataset, harvested from the web for testing this system,
are detailed in Table 1. The table details the numbers of resources harvested
grouped by its type. It also details the number of keywords discovered after
annotating its textual content whether annotated in the title or the description
based on MeSH and SNOMED CT ontologies.
We notice that the number of terms annotated using the SNOMED CT
ontology is greater than the number of keywords annotated using the MeSH
ontology. The difference is not significant for video and blog EMOs compared to
article EMOs. This is due to the short text provided in the metadata of blogs and
videos compared to the longer text provided for articles in the online libraries.
The collection of terms annotated for the dataset is used for building linkages
between the EMOs and discovering subjects for categorizing the EMOs.
The LEMO dataset consists of different types of EMOs. Video and blog
EMOs usually have shorter descriptions in their metadata fields if any. This
affects the number of keywords annotated for EMOs from such types and that
reflects on the number of subjects selected. Using MeSH and SNOMED CT
annotations, the results of discovering subjects for video and blog EMOs are not
enhanced in both experiments. Figure 4a and b illustrate the relation between
the counts of subjects discovered in MeSH and SNOMED CT annotated EMOs
and their types. In both experiments, video and blog EMOs have low subject
counts. This is due to the low numbers of terms annotated in this type of EMOs.
Comparing the subject selection process based on MeSH and SNOMED CT, we
notice that in the SNOMED CT based dataset, very few EMOs did not have
any subject count, while in the MeSH based dataset, more than 150 EMOs from
articles, videos or blog types have subject counts equal to zero.
After the subject selection process, we analysed and compared the dynamically
built links in the LEMO dataset. We consider that there exists a link between two
EMOs if they have a similar annotated class in their subjects or keywords set.
Also, we count a link between two EMOs as directed links. Therefore, if there is
a link from node a to node b the link count will be two not one. We compared the
number of links built in the dataset in the two experiments conducted. Table 2
illustrates the number of links built in LEMO dataset in both experiments;
MeSH and SNOMED CT annotations. As detailed in the table, the number
of links based on SNOMED CT ontology is greater than ones based on MeSH
ontology. The results are almost doubled in the links count. This is due to the
large number of annotation discovered using SNOMED CT ontology.
Table 2. Links count in LEMO dataset based of MeSH And SNOMED CT ontologies
in LEMO content. The system binds the user with choosing ontological classes
rather than writing a free-text in the search box. Figure 5 illustrates the auto-
complete feature presented in LEMO for ontological based access. The auto-
complete text box retrieves SNOMED CT ontology classes used in LEMO store.
Algorithm 1 explains the ontological-based technique for searching and ranking
the results of searching for a selected class.
The algorithm developed in LEMO is based on the NoSQL structure for
LEMO store explained previously in Fig. 2. As explained in the ontology-based
query algorithm, the search process starts with one ontology class Q. Then, a
query vector is built based on the class adjacency properties stored in LEMO
store. Now, we start searching for EMOs annotated with any of the ontology
classes related to the query class Q. So far the search results are not ranked
according to its relevance to the query initiated. Hence, the related classes
retrieved are weighted according to their co-occurrences with Q class in the
search result set. Then, the weights are normalized according to the length of
the search result size and the class Q in the vector will have a weight of 1. For
each EMO in the search result, the weights of its annotations found in QVecor
38 R.Q. Al Fayez and M. Joy
are retrieved from rdf:value in LEMO store which is used to store the weight
of an annotated class as explained in the previous sections. In order to have
more accurate weights for EMOs’ classes, the weights are normalized based on
the length of their annotation list. Finally, the search results are ranked based
on its euclidean distance from the query vector as shown in the algorithm. In
order to test this algorithm, we conducted random queries of 5 classes found in
LEMO Triple Store. The results are shown in Table 3 and compared with exact
text matching search.
From the sample of classes queried in this experiments, we can notice that
the ontological-based search always retrieved higher number of results than text-
based access. The overlap coefficient always indicated a percentage higher than
90 % for all the queries tested. In other words, the ontological access covers
almost all the search results of text-based access. We calculated the Jaccard
Similarity coefficient to emphasize the case in the last query “RENAL DIS-
EASE ”. In this query, the text-based results are only 4 EMOs since we used
exact text matching. Hence, the jaccard similarity is very low between the two
search results. The query vector resulted from searching for “RENAL DISEASE ”
Applying NoSQL Databases for Integrating Web Educational Stores 39
Query Class Size of Ontology- Size of Text-based Overlap Coefficient Jaccard Similartiy
based Result set (O) Result set (T) O∩T Coef
HEPATITIS 27 21 100 % 0.78
INFLUENZA 30 25 92 % 0.71
MUSCLE 66 65 95 % 0.89
BRAIN 61 49 100 % 0.80
RENAL DISEASE 36 4 100 % 0.11
References
1. Sandars, J., Schroter, S.: Web 2.0 technologies for undergraduate and postgraduate
medical education: an online survey. Postgrad. Med. J. 83(986), 759–762 (2007)
2. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. Int. J. Seman.
Web Inform. Syst. 5(3), 1–22 (2009)
3. McGreal, R.: Learning objects: a practical definition. Instr. Technol. 1, 21 (2004)
4. Ritze, D., Eckert, K.: Data enrichment in discovery systems using linked data. In:
Spiliopoulou, M., Schmidt-Thieme, L., Janning, R. (eds.) Data Analysis, Machine
Learning and Knowledge Discovery, pp. 455–462. Springer, Heidelberg (2014)
40 R.Q. Al Fayez and M. Joy
1 Introduction
The World Wide Web has expanded from a network of hyper-linked documents to
a more complex structure where both documents and data are easily published,
consumed and reused. Ideally, users should be able to access this information
as a single, global data space. However, Linked Data on the Web is highly het-
erogeneous: different datasets may describe overlapping domains, using different
approaches to data modelling and naming. A single global ontological conceptu-
alisation is impracticable, and instead a more extensible approach is needed for
semantic integration of heterogeneous Linked Data sets into a global data space.
In a recent paper [2], we introduced a theoretical framework for the inte-
gration of linked data sets, defining the semantic relationships between them
through peer-to-peer mappings. In [2], we specified the semantics of query
answering in this framework, as well as query answering and query rewriting
algorithms. Here, we build on this work by introducing a prototype system that
implements these techniques. After briefly summarising our theoretical frame-
work, we present the architecture of the system and the main tasks that the
system carries out. Finally, we summarise our current research and we establish
some goals for future work.
To motivate our research, we begin by presenting an example. Suppose
two RDF sources describe data in the domain of movies (source 1) and peo-
ple (source 2). A user wants to retrieve the names and ages of actors in the
movie Mulholland Drive, and poses the following query over their local source
(source 1), hoping for additional information from other relevant sources too:
SELECT ?name ?age
WHERE {db1:Mulholland_Dr_2001 db1:actor ?x . ?x rdfs:label ?name
. ?x foaf:age ?age }
An empty result will be returned because source 1 does not contain foaf data.
The problem can be addressed by using the SPARQL 1.1 SERVICE clause, as
follows:
SELECT ?name ?age
WHERE {db1:Mulholland_Dr_2001 db1:actor ?x . ?x rdfs:label ?name
SERVICE <http://data.people.org/sparql> { ?x foaf:age ?age } }
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 41–45, 2015.
DOI: 10.1007/978-3-319-20424-6 5
42 M.M. Dimartino et al.
Even now, it is likely that query evaluation returns an empty result, because
real-world entities may be denoted by different IRIs in different sources. In this
case, mappings for the variable ?x may not be found. To cope with this issue,
Linked Data best practices suggest the adoption of the built-in OWL property
sameAs, which states that two linked IRIs represent the same real-world entity.
The user can leverage the semantics of the owl:sameAs predicate and submit
the following query, so as to perform coreference resolution of the “equivalent”
IRIs:
SELECT ?name ?age
WHERE { { db1:Mulholland_Dr_2001 db1:actor ?x . ?x rdfs:label ?name
. ?x owl:sameAs ?z
SERVICE <http://data.people.org/sparql> { ?z foaf:age ?age } }
UNION { db1:Mulholland_Dr_2001 db1:actor ?x . ?x rdfs:label ?name
SERVICE <http://data.people.org/sparql>
{ ?z owl:sameAs ?x . ?z foaf:age ?age } } }
The user may not know whether the owl:sameAs triples are stored in source 1
or source 2, so these two cases need to be taken into consideration by including
the UNION operator and two disjuncts in the query pattern. A non-empty result
is still not guaranteed since the owl:sameAs triples may be missing.
The drawbacks of this approach are: (1) the user needs to be aware of all
the potential sources of information, (2) the user needs to be familiar with the
semantic links between sources, and (3) as the number of sources increases, the
queries become more complicated to formulate. What is needed is a system that
does not require the user to be aware of what other sources are available and
where query rewriting is performed automatically in order to obtain as many
answers to user queries as possible. We describe such a system in the rest of the
paper.
2 Theoretical Foundations
Our approach to semantic integration of heterogeneous Linked Data sources
is based on the RDF Peer System (RPS) introduced in [2]. This is a frame-
work for peer-based integration of RDF datasets, where the semantic rela-
tionships between data at different peers are expressed through mappings.
Formally, an RPS P is defined as a tuple P = (S, G, E), where S is the set
of the peer schemas in P, G is a set of graph mapping assertions and E is a
set of equivalence mappings. A peer schema in S is the set of IRIs that a peer
(i.e. an RDF data source) adopts to describe its data. The sets of schema-level
mappings and instance-level mappings between peers are given by G and E,
respectively. G provides semantic linkage between the schemas of different peers
and it contains mapping assertions of the form Q Q , where Q and Q are
“conjunctive” SPARQL queries with the same arity over two peers, e.g.
Q := q(x, y) ← (x, actor, y)
Q := q(x, y) ← (x, starring, z) AND (z, artist, y)
in our earlier example setting. Mappings in E are of the form c ≡e c , where c and
Implementing Peer-to-Peer Semantic Integration of Linked Data 43
c are IRIs located in the same peer or in two different peers. Each equivalence
mapping states that the two IRIs represent the same real-world object. In this
sense, equivalence mappings are used to provide coreference resolution during
query processing.
A solution for an RPS P, denoted by I, is defined as a “global” RDF data-
base which contains: (i) all the triples in the peers; (ii) triples inferred through
the graph mapping assertions, by testing that for each mapping Q Q , Q eval-
uated over I is contained in Q evaluated over I; and (iii) triples inferred from
the equivalence mappings, such that, for each equivalence mapping c ≡e c , c
appears in I at a certain position in a triple (i.e., subject, predicate or object)
if and only if I contains also a triple such that c appears at the same position.
Following this, query answering under RPSs is defined by extending the notion
of certain answers [1,2]. For a graph pattern query q, expressed in any peer
schema(s), the set of certain answers is given by the tuples contained in all the
results of q evaluated over all the possible solutions I. In this regard, a query is
answered by leveraging the semantics of the mappings so that a more complete
answer is given, integrating data from multiple RDF sources. We use the notion
of certain answers in RPSs to assess the correctness and completeness of our
query processing techniques.
Our system provides a query interface between the user and the Linked Data
sources. A unified SPARQL endpoint accepts queries expressed in any source
vocabulary. The queries are rewritten with respect to the semantic mappings of
the RPS, so as to retrieve the set of certain answers; we envisage that mappings
between peers can either be designed manually or automatically inferred. Then,
a second rewriting step is performed, generating a SPARQL 1.1 federated query
to be evaluated over the sources. The query result is then presented to the user.
The system’s main components are shown below.
Query
Manually designed
SPARQL endpoint mappings
Automated alignment
SPARQL 1.1
federated query
Linked Data
44 M.M. Dimartino et al.
In more detail: The query rewriting engine performs query rewriting of the
user’s query. The rewritten query is then evaluated over the sources and the
result is presented to the user. The query rewriting engine is composed of two
sub-engines:
(i) The semantic integration module generates a “perfect rewriting” of the user’s
query, that is, a query that preserves a sound and complete answer of the
original query based on the semantic mappings in the RPS. Note that, in
general, sets of RPS mappings are not FO-rewritable (see [2]), so at present
our system is confined to FO-rewritable ones.
(ii) The query federation module executes a second rewriting step in order to
generate a federated query to be evaluated over multiple RDF sources. Triple
patterns in the body of the query are grouped with respect to the RDF
sources that can provide a successful graph pattern match. Then, the groups
are assigned to the endpoints of the related sources, and evaluated using the
SPARQL 1.1 SERVICE clause.
The system provides for automated alignment of the peer schemas, to link enti-
ties and concepts in the Linked Open Data cloud. This part has not yet been
implemented, but we envisage that it would extract structural information from
the sources, such as the sets of entities, predicates, classes etc. Then, it would
perform schema alignment and coreference resolution by:
– retrieving mappings between sources, such as owl:sameAs or VoID1 triples,
and other semantic links between sources;
– generating new mappings, using existing ontology matching and instance link-
age techniques, such as Falcon-AO [3];
– translating these alignments into our peer mapping language; and
– storing the mappings in the RPS.
References
1. Abiteboul, S., Duschka, O.M.: Complexity of answering queries using materialized
views. In: PODS, pp. 254–263 (1998)
2. Dimartino, M.M., Calı̀, A., Poulovassilis, A., Wood, P.T.: Peer-to-peer semantic
integration of linked data. In: EDBT/ICDT Workshops, pp. 213–220 (2015)
3. Hu, W., Qu, Y., Cheng, G.: Matching large ontologies: a divide-and-conquer app-
roach. Data Knowl. Eng. 67(1), 140–160 (2008)
Graph Data
Virtual Network Mapping: A Graph Pattern
Matching Approach
1 Introduction
Virtual network mapping (VNM) is also known as virtual network embedding
or assignment. It takes as input (1) a substrate network (SN, a physical net-
work), and (2) a virtual network (VN) specified in terms of a set of virtual nodes
(machines or routers, denoted as VMs) and their virtual links, along with con-
straints imposed on the capacities of the nodes (e.g., cpu and storage) and on
the links (e.g., bandwidth and latency). VNM is to deploy the VN in the SN such
that virtual nodes are hosted on substrate nodes, virtual links are instantiated
with physical paths in the SN, and the constraints on the virtual nodes and links
are satisfied.
VNM is critical to managing big data. Big data is often distributed to data
centers [23,26]. However, data center networks become the bottleneck for dynamic
cloud workloads of querying and managing the data. In traditional networking
platforms, network resources are manually configured with static policies, and
new workload provisioning often takes days or weeks [1]. This highlights the need
for VNM, to automatically deploy virtual networks in a data center network in
response to real-time requests. Indeed, VNM is increasingly employed in industry,
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 49–61, 2015.
DOI: 10.1007/978-3-319-20424-6 6
50 Y. Cao et al.
e.g., Amazon’s EC2 [2], VMware Data Center [3] and Big Switch Networks
[1]. It has proven effective in increasing server utilization and reducing server
provisioning time (from days or weeks to minutes), server capital expenditures
and operating expenses [1]. There has also been a host of work on virtualization
techniques for big data [23] and database systems [7,24].
Several models have been proposed to specify VNM in various settings:
Example 1. Consider a VN request and an SN, depicted in Figs. 1(a) and 1(b),
respectively. The VN has three virtual nodes VM1 , VM2 and VM3 , each specifying
a capacity constraint, along with a constraint on each virtual link. In the SN,
each substrate node bears a resource capacity and each connection (edge) has an
attribute, indicating either bandwidth or latency. Consider the following cases.
(1) Mapping with Latency Constraints (VNML ). Assume that the numbers
attached to the virtual nodes and links in Fig. 1(a) denote requirements on cpus
and latencies for SN, respectively. Then the VNM problem, denoted by VNML ,
aims to map each virtual node to a substrate node with sufficient computational
power, and to map each virtual link (v, v ) in the VN to a path in the SN such
that its latency, i.e., the sum of the latencies of the edges on the path, does not
Virtual Network Mapping: A Graph Pattern Matching Approach 51
exceed the latency specified for (v, v ). The need for studying VNML arises from
latency sensitive applications such as multimedia transmitting networks [21],
which concern latency rather than bandwidth.
(2) Priority Mapping (VNMP ). Assume that the constraints on the nodes in
Fig. 1(a) are cpu capacities, and constraints imposed on edges are bandwidth
capacities. Here the VNM problem, denoted by VNMP , is to map each virtual
node to a node in SN with sufficient cpu capacity, and each virtual link (v, v )
in the VN to a path in SN such that the minimum bandwidth of all edges on
the path is no less than the bandwidth specified for (v, v ). The need for this is
evident in many applications [4], we want to give different priorities at run time
to virtual links that share some physical links, and require the mapping only to
provide bandwidth guarantee for the connection with the highest priority.
(3) Mapping with Node Sharing (VNESP(NS) ). Assume that the numbers
attached to the virtual nodes and links in Fig. 1(a) denote requirements on cpus
and bandwidths for SN, respectively. Then VNESP(NS) is an extension of the
single-path VN embedding (VNESP ) by supporting node sharing, i.e., by allow-
ing mapping multiple virtual nodes to the same substrate node, as needed by
X-Bone [6].
There is also practical need for extending other mappings with node sharing,
such as virtual machine placement (VMP), latency mapping (VNML ), priority
mapping VNMP and multi-path VN embedding (VNEMP ). We denote such an
extension by adding a subscript NS.
Observe that (a) VNM varies from practical requirements, e.g., when latency,
high-priority connections and node sharing are concerned; (b) Existing models
are not capable of expressing such requirements; indeed, none of them is able to
specify VNML , VNMP or VNESP(NS) ; And (c) it would be an overkill to develop a
model for each of the large variety of requirements, and to study it individually.
As suggested by the example, we need a generic model to express virtual net-
work mappings in various practical settings, including both those already studied
(e.g., VMP, VNESP and VNEMP ) and those being overlooked (e.g., VNML , VNMP
and VNESP(NS) ). The uniform model allows us to characterize and compare VNMs
in different settings, and better still, to study generic properties that pertain to
all the variants. Among these are the complexity and approximation analyses
of VNMs, which are obviously important but have not yet been systematically
studied by and large.
Contributions & Roadmap. This work takes a step toward providing a uni-
form model to characterize VNMs. We show that VNMs, an important problem
for managing big data, can actually be tackled by graph pattern matching tech-
niques, a database topic that has been well studied. We also provide complexity
and approximation bounds for VNMs. Moreover, for intractable VNM cases, we
develop effective heuristic methods to find high-quality mappings.
(1) We propose a generic model to express VNMs in terms of graph
pattern matching [18] (Sect. 2). In this model a VN request is specified as a graph
52 Y. Cao et al.
pattern, bearing various constraints on nodes and links defined with aggregation
functions, and an SN is simply treated as a graph with attributes associated
with its nodes and edges. The decision and optimization problems for VNMs are
then simply graph pattern matching problems. We also show that the model is
able to express VNMs commonly found in practice, including all the mappings
we have seen so far (Sect. 3).
(2) We establish complexity and approximation bounds for VNMs (Sect. 4). We
give a uniform upper bound for the VNM problems expressed in this model, by
showing that all these problems are in NP. We also show that VNM is poly-
nomial time (PTIME) solvable if only node constraints are present (VMP), but
it becomes NP-complete when either node sharing is allowed or constraints on
edges are imposed. Moreover, we propose a VNM cost function and study opti-
mization problems for VNM based on the metric. We show that the optimization
problems are intractable in most cases and worse still, are NPO-complete in gen-
eral and APX-hard [10] for special cases. To the best of our knowledge, these are
among the first complexity and approximation results on VNMs.
We contend that these results are useful for developing virtualized cloud data
centers for querying and managing big data, among other things. By modeling
VNM as graph pattern matching, we are able to characterize various VN requests
with different classes of graph patterns, and study the expressive power and
complexity of these graph pattern languages. The techniques developed for graph
pattern matching can be leveraged to study VNMs. Indeed, the proofs of some of
the results in this work capitalize on graph pattern techniques. Furthermore, the
results of this work are also of interest to the study of graph pattern matching [18].
Example 2. The SN depicted in Fig. 1(b) is a weighted graph GS , where (1) the
node set is {a, b, ..., f}; (2) the edges include the directed edges in the graph; (3)
the weights associated with nodes indicate cpu capacities; and (4) the weights of
edges denote bandwidth or latency capacities. Figure 1(a) shows a VN, where (1)
the node set is {VM1 , VM2 , VM3 }; (2) the edge set is {(VMi , VMj ) | i, j = 1, 2, 3};
(3) fVP (VM1 ) = 66, fVP (VM2 ) = 20, fVP (VM3 ) = 30; and (4) the function fEP
is defined on the edge labels. As will be seen when we define the notion of VN
requests, the labels indicate requirements on deploying the VN in an SN.
3 Case Study
All the VNM requirements in the Introduction (Sect. 1) can be expressed in our
model, by treating VN request as a pattern and SN as a graph. Below we present
a case study.
Virtual Network Mapping: A Graph Pattern Matching Approach 55
Example 3. Consider the VN given in Fig. 1(a) and the SN of Fig. 1(b). Con-
straints for priority mapping can be defined as described above, using the node
and edge labels (on bandwidths) in Fig. 1(a). There exists a priority mapping
from the VN to the SN. Indeed, one can map VM1 , VM2 and VM3 to b, a and d,
respectively, and map the virtual links to the shortest physical paths uniquely
determined by the node mapping, e.g., (VM1 , VM2 ) is mapped to (b, a).
When node sharing is allowed in VNESP , i.e., for single-path embedding with
node sharing (VNESP(NS) ), a VN request is specified similarly. Here a substrate
node u can host multiple virtual nodes (hence |N (u)| ≥ 0) such that the sum of
the capacities of all the virtual nodes does not exceed the capacity of u. Similarly,
one can also specify multi-path VN embedding with node sharing (VNEMP(NS) ).
Example 4. Consider the VN of Fig. 2(a), and the SN of Fig. 2(b). There is a
VNESP from the VN to the SN, by mapping VM1 , VM2 , VM3 to a, b, e, respectively,
and mapping the VN edges to the shortest paths in the SN determined by the
node mapping. There is also a multi-path embedding VNEMP from the VN to the
SN, by mapping VM1 , VM2 and VM3 to a, c and e, respectively. For the virtual
links, (VM1 , VM2 ) can be mapped to the physical path (a, b, c), (VM1 , VM3 )
to (a, e), and (VM3 , VM2 ) to two paths ρ1 = (e, b, c) and ρ2 = (e, d, c) with
rE ((VM3 , VM2 ), ρ1 ) = 5 and rE ((VM3 , VM2 ), ρ2 ) = 15; similarly for the other
virtual links.
One can verify that the VN of Fig. 2(a) allows no more than one virtual node
to be mapped to the same substrate node in Fig. 2(b). However, if we change the
bandwidths of the edges connecting a and e in SN from 30 to fVS (a, e) = 40 and
fVS (e, a) = 50, then there exists a mapping from the VN to the SN that supports
node sharing. Indeed, in this setting, one can map both VM1 , VM2 to e and map
VM3 to a; and map the virtual edges to the shortest physical paths determined
by the node mapping; for instance, both (VM1 , VM3 ) and (VM2 , VM3 ) can be
mapped to (e, a).
Here NPO is the class of all NP optimization problems (cf. [10]). An NPO-
complete problem is NP-hard to optimize, and is among the hardest optimization
problems. APX is the class of problems that allow PTIME approximation algo-
rithms with a constant approximation ratio (cf. [10]).
5 Related Work
Virtualization techniques have been investigated for big data processing [23] and
database applications [7,8,24]. However, none of these has provided a systematic
study of VNM, by modeling VNM as graph pattern matching. The only exception
is [20], which adopted subgraph isomorphism for VNM, a special case of the
generic model proposed in this work. Moreover, complexity and approximation
analyses associated with VNM have not been studied for cloud computing in
database applications.
Several models have been developed for VNM. (a) The VM placement prob-
lem (VMP, [12]) is to map a set of VMs onto an SN with constraints on node
capacities. (b) Single-path VN embedding (VNESP , [22] ) is to map a VN to an
SN by a node-to-node injection and an edge-to-path function, subject to con-
straints on the cpu capacities of nodes and constraints on the bandwidths of
physical connections. (c) Different from VNESP , multi-path embedding (VNEMP ,
[14,25]) allows an edge of a VN to be mapped to multiple parallel paths of an
SN such that the sum of the bandwidth capacities of those paths is no smaller
than the bandwidth of that edge. (d) While graph layout problems are similar
to VN mapping, they do not have bandwidth constraints on edges but instead,
impose certain topological constraints (see [15] for a survey). In contrast to our
work, these models are studied for specific domains, and no previous work has
studied generic models to support various VN requests that commonly arise in
practice.
Very few complexity results are known for VNM. The only work we are aware
of is [9], which claimed that the testbed mapping problem is NP-hard in the pres-
ence of node types and some links with infinite capacity. Several complexity and
approximation results are established for graph pattern matching (see [18] for
a survey). However, those results are for edge-to-edge mappings, whereas VNM
typically needs to map virtual links to physical paths. There have been recent
extensions to support edge-to-path mappings for graph pattern matching [16,17],
with several intractability and approximation bounds established there. Those
differ from this work in that either no constraints on links are considered [17], or
graph simulation is adopted [16], which does not work for VNM. The complexity
and approximation bounds developed in this work are among the first results
that have been developed for VNM in cloud computing.
6 Conclusion
We are also exploring techniques for processing VN requests for different appli-
cations, as well as their use in graph pattern matching.
Acknowledgments. Fan and Cao are supported in part by NSFC 61133002, 973
Program 2014CB340302, Shenzhen Peacock Program 1105100030834361, Guangdong
Innovative Research Team Program 2011D005, EPSRC EP/J015377/1 and
EP/M025268/1, and a Google Faculty Research Award. Ma is supported in part by
973 Program 2014CB340304, NSFC 61322207 and the Fundamental Research Funds
for the Central Universities.
References
1. http://www.bigswitch.com/
2. http://aws.amazon.com/ec2/
3. http://www.vmware.com/solutions/datacenter/
4. http://frenzy.ivic.org.cn/
5. http://homepages.inf.ed.ac.uk/s1165433/papers/vnm-full.pdf
6. http://www.isi.edu/xbone/
7. Aboulnaga, A., Amza, C., Salem, K.: Virtualization and databases: state of the
art and research challenges. In: EDBT (2008)
8. Aboulnaga, A., Salem, K., Soror, A., Minhas, U., Kokosielis, P., Kamath, S.:
Deploying database appliances in the cloud. IEEE Data Eng. Bull 32(1), 13–20
(2009)
9. Andersen, D.: Theoretical approaches to node assignment (2002)(unpublished
manuscript)
10. Ausiello, G.: Complexity and Approximation: Combinatorial Optimization Prob-
lems and Their Approximability Properties. Springer Verlag, Heidelberg (1999)
11. Bavier, A.C., Feamster, N., Huang, M., Peterson, L.L., Rexford, J.: In VINI veritas:
realistic and controlled network experimentation. In: SIGCOMM (2006)
12. Bobroff, N., Kochut, A., Beaty, K.: Dynamic placement of virtual machines for
managing sla violations. In: IM (2007)
13. Chowdhury, N., Boutaba, R.: A survey of network virtualization. Comput. Netw.
54(5), 862–876 (2010)
14. Chowdhury, N., Rahman, M., Boutaba, R.: Virtual network embedding with coor-
dinated node and link mapping. In: INFOCOM (2009)
15. Dı́az, J., Petit, J., Serna, M.: A survey of graph layout problems. CSUR 34(3),
313–356 (2002)
16. Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: from
intractable to polynomial time. In: VLDB (2010)
17. Fan, W., Li, J., Ma, S., Wang, H., Wu, Y.: Graph homomorphism revisited for
graph matching. In: VLDB (2010)
18. Gallagher, B.: Matching structure and semantics: a survey on graph-based pattern
matching. In: AAAI FS (2006)
19. Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., Lu, S.:
Bcube: a high performance, server-centric network architecture for modular data
centers. In: SIGCOMM (2009)
20. Lischka, J., Karl, H.: A virtual network mapping algorithm based on subgraph
isomorphism detection. In: SIGCOMM workshop VISA (2009)
Virtual Network Mapping: A Graph Pattern Matching Approach 61
21. Reinhardt, W.: Advance reservation of network resources for multimedia applica-
tions. In: IWACA (1994)
22. Ricci, R., Alfeld, C., Lepreau, J.: A solver for the network testbed mapping prob-
lem. SIGCOMM CCR 33, 65–81 (2003)
23. Trelles, O., Prins, P., Snir, M., Jansen, R.C.: Big data, but are we ready? Nat.
Rev. Genet. 12(3), 224 (2011)
24. Xiong, P., Chi, Y., Zhu, S., Moon, H.J., Pu, C., Hacigümüs, H.: Intelligent man-
agement of virtualized resources for database systems in cloud environment. In:
ICDE (2011)
25. Yu, M., Yi, Y., Rexford, J., Chiang, M.: Rethinking virtual network embedding:
substrate support for path splitting and migration. SIGCOMM CCR 38(2), 17–29
(2008)
26. Zong, B., Raghavendra, R., Srivatsa, M., Yan, X., Singh, A.K., Lee, K.: Cloud
service placement via subgraph matching. In: ICDE (2014)
A Fast Approach for Detecting Overlapping
Communities in Social Networks Based
on Game Theory
1 Introduction
game theory. The non-cooperative game theory [7] studies the individual behaviors of
agents, where each agent selects its strategy independently for improving its own
utility. The cooperative game theory [8] studies the cooperative behaviors of groups of
agents, where agents cooperate to each other for improving the group’s utility and a
group of agents is called a coalition.
The game theory, either cooperative or not, has been used separately to solve
community detection problems. The non-cooperative game theory-based methods
consider community formation as the result of the individual behaviors of selfish agents
and the community structure as the equilibrium amongst individual agents, while the
cooperative game theory-based methods consider community formation as the result of
the group behaviors of rational agents and the community structure as the equilibrium
amongst groups of rational agents. The cooperative game theory-based methods
neglect individual utilities of agents while the non-cooperative game theory-based
methods neglect utilities of groups. Thus, it is a challenging task to improve existing
methods to obtain more rational and logical results.
In this study, we develop a new approach that utilizes both cooperative and non-
cooperative game theory to detect communities. Firstly, individuals in a social network
are regarded as rational agents who cooperate with other agents to form coalitions for
achieving and improving group’s utilities. Then, each individual is modelled as a
selfish agent who selects coalitions to join or leave based on its own utility measure-
ment. Each agent is allowed to select multiple coalitions, thus overlapping coalitions
can be formed. Because the non-cooperative game is played on the basis of the result of
the cooperative game rather than taking the initiative in which every agent has one
community of its own, the number of agents that would change their community
memberships to improve their utilities could decrease, thus the efficiency of the
non-cooperative game will be improved. By combining the cooperative and non-
cooperative game theory, utilities of groups and individuals can be taken into account
simultaneously, thus the rational and accuracy of communities detected can be
improved and the computational cost will be decreased.
The rest of this paper is organized as follows: Sect. 2 introduces related work and
Sect. 3 introduces a game theory-based approach for community detection. The
experimental results on the real networks and benchmark networks are presented in
Sects. 4 and 5 summarizes this paper.
2 Related Work
The group The stable coalition structure The individual The stable community structure
The graph
game = {S1 , S 2 ,..., S k } game ' = {S1 , S 2 ,..., S k }
G = (V , E )
further. After the group game achieves equilibrium state of coalitions, the utility of
each coalition can not be improved any further, but some agents may do not satisfy
their individual utilities, so agents begin to play the individual game to improve their
individual utilities.
eðx; Si Þ
vx ðSi Þ ¼ ð2Þ
dðxÞ
Where eðx; Si Þ denotes the the number of edges that link x to nodes of coalition Si .
The value of vx ðSi Þ is the ratio of edges between x and Si over the degree of x, and it
measures how close between x and Si . 0 vx ðSi Þ 1. vx ðSi Þ ¼ 1 means all edges of x is
connected to nodes of Si , in this case, x will be an inner node of Si after x joins Si ;
vx ðSi Þ ¼ 0 means no one edge connects x to a node of Si . The greater the value of vx ðSi Þ
is, the closer the connection between x and Si will be.
Definition 6. Two operations: join and leave. Let x 2 V, Si 2 C. If x 62 Si &
vx ðSi Þ x, then x joins Si , Si ¼ Si þ fxg; If x 2 Si & vx ðSi Þ\e, then x leaves Si ,
Si ¼ Si fxg.
x is the lower bound of the utility value of x who can join a new coalition, and e is
the upper bound of the utility value of x who can leave the coalition that it is in.
Definition 7. The ðx; eÞ-stable community structure. A collection of coalitions C ¼
fS1 ; S2 ; . . .; Sk g is regarded as a stable community structure if 8Si 2 C, 8x 2 Si ,
vx ðSi Þ x, and 8x 62 Si , vx ðSi Þ\e holds.
A stable community structure can be regarded as a kind of equilibrium state of
agents, in which no agent has an interest in changing its coalition memberships any
further.
4 Experiments
into two components, corresponding to the left overlapping communities and the two
right communities in Fig. 2(c).
Fig. 2. The community structures of the Zachary’s karate network. (a) The LocalEquilibrium
[12]; (b) The group game; (c) The CoCo-game
Figure 3 presents the result of the Les Misèrables’ network of characters, where (a)
is the network structure, (b) is the result of the LocaEquilibrium, (c) is the result of the
group game, and (d) is the result of the CoCo-game. In this network, the LocaEqui-
librium detects 13 communities, the group game detects 6 communities, and the CoCo-
game detects 5 communities. Compare (c) with (d), node 3, 4, 47, 48 change their
(a) (b)
(c) (d)
Fig. 3. The community structures of the Les Misèrables’ network of characters. (a) The
networks (b) The LocalEquilibrium; (c) The group game; (d) The CoCo-game
A Fast Approach for Detecting Overlapping Communities 69
memberships. Compare (b) with (d), we can see that the middle community of (d)
approximates the combination of communities containing node 12 in (b).
minc=10
maxc=50
mu=0.1 mu=0.3
minc=20
maxc=100
mu=0.3
Normalized Mutual Information
minc=10
maxc=50
mu=0.3
minc=20
maxc=100
mu=0.1 mu=0.3
Fig. 4. The NMI values between the community structures detected by the LocalEquilibrium/the
group game/the CoCo-game and the benchmark community structures under different fractions
of overlapping nodes, (a)–(d) consist of 1,000 nodes, (e)–(h) consist of 5,000 nodes.
70 L. Zhou et al.
minc=10
mu=0.1
maxc=50
Running Time (s)
mu=0.3
minc=20
maxc=100
mu=0.1
mu=0.3
minc=10
maxc=50
mu=0.1
Running Time (s)
mu=0.3
minc=20
maxc=100
mu=0.1 mu=0.3
Fig. 5. The running times of the LocalEquilibrium, the group game and the CoCo-game under
different fractions of overlapping nodes, (a)–(d) consist of 1,000 nodes, (e)–(h) consist of 5,000
nodes.
Figure 4 presents the NMI values between the community structures detected by the
LocalEquilibrium/the group game/the CoCo-game and the benchmark community
structures under different fractions of overlapping nodes. Figure 5 compares the run-
ning times of the LocalEquilibrium, the group game and the CoCo-game for detecting
community structures on the produced benchmark networks. The x-axis represents the
portion of nodes that belong to multiple communities.
Figure 4 show that the CoCo-game outperforms the group game in all cases. It
indicates that the individual game after the group game is effective. Meanwhile, the
CoCo-game is similar to the LocalEquilibrium for mu ¼ 0:1 and outperforms the
LocalEquilibrium for mu ¼ 0:3.
Figure 5 indicates that the running time for both the group game and the CoCo-
game are acceptable, and they are much faster than the LocalEquilibrium over all
instances. Moreover, the running time of the LocalEquilibrium increases greatly with
the number of nodes N, the portion of crossing edges mu, and the fraction of over-
lapping nodes. Meanwhile, the running times of both the group game and the CoCo-
game are more stable than the LocalEquilibrium.
A Fast Approach for Detecting Overlapping Communities 71
Figure 6 (a) and (b) present the NMI values between the community structures
detected by the CoCo-game and the benchmark community structures under different x
and e. The network used consists 1000 nodes in which 150 nodes belong to 2 com-
munities, mu ¼ 0:3, minc ¼ 20, maxc ¼ 100. The x-axis represents the value of x or e.
From Fig. 6, we can see that the value of NMI is affected by x and e. How to set
automatically x and e is our future work.
Normalized Mutual Information
ε = 1/ 4
Normalized Mutual Information
ε
ω
ω = 1/ 3
Fig. 6. The NMI values between the community structures detected by the CoCo-game and the
benchmark community structures under different x and e.
5 Summary
In this paper, we develop a new approach that utilizes both cooperative and non-
cooperative game theory to detect communities with improved accuracy. Because each
agent is allowed to select multiple coalitions, the overlapping communities can be
identified rationally. The experimental results demonstrated the features of our
approach, they show that the joint use of cooperative and non-cooperative game the-
ories to detect overlapping communities is effective and efficient.
Acknowledgement. The authors thank sincerely Mr. Wei Chen from Microsoft Research Asia
for providing code on their work and helps. This work is supported by the National Natural
Science Foundation of China under Grant No.61262069, No. 61472346, Program for Young and
Middle-aged Teachers Grant, Yunnan University, and Program for Innovation Research Team in
Yunnan University (Grant No. XT412011).
72 L. Zhou et al.
References
1. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks.
Phys. Rev. E 69, 026113 (2004)
2. Fortunato, S.: Community detection in graphs. Phys. Rep. 486, 75–174 (2010)
3. Li, X.T., Ng, M.K., Ye, Y.M.: Multicomm: finding community structure in multi-
dimensional networks. IEEE Trans. Knowl. Data Eng. 26(4), 929–941 (2014)
4. Folino, F., Pizzuti, C.: An evolutionary multiobjective approach for community discovery in
dynamic networks. IEEE Trans. Knowl. Data Eng. 26(8), 1838–1852 (2014)
5. Zhou, L., Lü, K., Cheng, C., Chen, H.: A game theory based approach for community
detection in social networks. In: Gottlob, G., Grasso, G., Olteanu, D., Schallhart, C. (eds.)
BNCOD 2013. LNCS, vol. 7968, pp. 268–281. Springer, Heidelberg (2013)
6. Zacharias, G.L., MacMillan, J., Hemel, S.B.V. (eds.): Behavioral Modeling and Simulation:
From Individuals to Societies. The National Academies Press, Washington (2008)
7. Nash, J.F.: Non-cooperative games. Ann. Math. 54(2), 286–295 (1951)
8. Zlotkin, G., Rosenschein J.: Coalition cryptography and stability mechanisms for coalition
formation in task oriented domains. In: Proceedings of The Twelfth National Conference on
Artificial Intelligence, Seattle, Washington, 1-4 August, pp.432–437. The AAAI Press,
Menlo Park, California (1994)
9. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community
structures of complex networks in nature and society. Nat. 435, 814–818 (2005)
10. Ahn, Y.Y., Bagrow, J.P., Lehmann, S.: Link communities reveal multi-scale complexity in
networks. Nat. 466(7307), 761–764 (2010)
11. Ball, B., Karrer, B., Newman, M.E.J.: An efficient and principled method for detecting
communities in networks. Phys. Rev. E 84, 036103 (2011)
12. Chen, W., Liu, Z., Sun, X., Wang, Y.: A game-theoretic framework to identify overlapping
communities in social networks. Data Min. Knowl. Disc. 21(2), 224–240 (2010)
13. Alvari, H., Hashemi, S., Hamzeh, A.: Detecting overlapping communities in social networks
by game theory and structural equivalence concept. In: Deng, H., Miao, D., Lei, J., Wang,
F.L. (eds.) AICI 2011, Part II. LNCS, vol. 7003, pp. 620–630. Springer, Heidelberg (2011)
14. Lung, R.I., Gog, A., Chira, C.: A game theoretic approach to community detection in social
networks. In: Pelta, D.A., Krasnogor, N., Dumitrescu, D., Chira, C., Lung, R. (eds.) NICSO
2011. SCI, vol. 387, pp. 121–131. Springer, Heidelberg (2011)
15. Hajibagheri, A., Alvari, H., Hamzeh, A., Hashemi, A.: Social networks community detection
using the shapley value. In: 16th CSI International Symposium on Artificial Intelligence and
Signal Processing (AISwww.lw20.comP), Shiraz, Iran, 2-3 May, pp. 222–227 (2012)
16. Zhou, L., Cheng, C., Lü, K., Chen, H.: Using coalitional games to detect communities in
social networks. In: Wang, J., Xiong, H., Ishikawa, Y., Xu, J., Zhou, J. (eds.) WAIM 2013.
LNCS, vol. 7923, pp. 326–331. Springer, Heidelberg (2013)
17. Zhou, L., Lü, K.: Detecting communities with different sizes for social network analysis.
The Comput. J. Oxford University Press (2014). doi:10.1093/comjnl/bxu087
18. Lancichinetti, A., Fortunato, S.: Benchmarks for testing community detection algorithms
on directed and weighted graphs with overlapping communities. Phys. Rev. E 80(1),
16118 (2009)
19. Danon, L., Díaz-Guilera, A., Duch, J., Arenas, A.: Comparing community structure
identification. J. Stat. Mech.: Theory Exp. 2005(09), P09008 (2005)
A Fast Approach for Detecting Overlapping Communities 73
20. Lancichinetti, A., Fortunato, S., Kertesz, J.: Detecting the overlapping and hierarchical
community structure in complex networks. New J. Phys. 11, 033015 (2009)
21. Zachary, W.W.: An information flow model for conflict and fission in small groups.
J. Anthropol. Res. 33, 452–473 (1977)
22. Knuth, D.E.: The Stanford GraphBase: A Platform for Combinatorial Computing. ACM
Press, New York (1993)
Consistent RDF Updates
with Correct Dense Deltas
1 Introduction
Resource Description Framework (RDF) is an annotation language that provides
a graph-based representation of information about Web resources in the Seman-
tic Web. Because RDF content (in triple form) is shared between different agents,
a common interpretation of the terms used in annotations is required. This inter-
pretation is typically provided by an ontology expressed as RDF Schema (RDFS)
or Web Ontology Language (OWL). Both RDFS and OWL are expressed as RDF
triples. The schema provides additional semantics for the basic RDF model. In
any particular data collection, changes in the domain that are reflected by evo-
lution of the ontology may require changes in the underlying RDF data. Due to
the dynamic and evolving nature of typical Semantic Web structures, RDF data
may change on a regular basis, producing successive versions that are available
for publication and distribution [4]. In the context of such dynamic RDF data
collections, which may be very large structures, it quickly becomes infeasible
to store a historic sequence of updates in any accessible form as a consequence
of the significant storage space needed. An alternative solution to propagation
and storage of successively updated copies of a data collection is to compute the
differences between these copies and use these as a means of transforming the
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 74–86, 2015.
DOI: 10.1007/978-3-319-20424-6 8
Consistent RDF Updates with Correct Dense Deltas 75
base data structure into subsequent versions. These differences (the delta) show
triple content that has been changed between two RDF models and can be used
to transform one RDF model into another. Rather than storing all versions of a
data structure, it is only necessary to store one version and retain the capability
of restoring any version of interest by executing the consecutive deltas.
The work presented in this paper addresses the problem of change detection
in RDF knowledge bases. An important requirement of change detection tools is
their ability to produce the smallest correct delta that will efficiently transform
one RDF model to another. This is a particularly important problem when
RDF collections are large and dynamic. In this context, propagation between
server and client or between nodes in a peer-to-peer system becomes challenging
as a consequence of the potentially excessive use of network bandwidth. In a
scenario where RDF update is carried out by push-based processes, the update
itself needs to be minimised to restrict network bandwidth costs. In addition, in
pull-based scenarios, it is important to limit server processing so that updates
can be generated with maximum efficiency. The contribution of this work is an
approach for using the smallest deltas that will maintain the consistency of an
RDF knowledge base together with an evaluation of the performance challenges
of generating this structure.
2 Related Work
Managing the differences between RDF knowledge bases using deltas is an impor-
tant task in the ontology evolution process. because they allow the synchro-
nization of ontology changes [2], the update of ontologies to newer versions,
and the reduction of storage overhead required to hold ontology versions [8].
Changes between ontologies can be detected using change detection tools that
report changes in low-level (RDF) or high level (ontology) structures. High-
level change detection techniques typically focus on exploiting semantic vari-
ation between ontologies. Example of these tools include SemVersion [9] and
PromptDiff [6]. High-level changes may involve adding or generalising domains or
renaming classes [7]. By contrast, low-level change detection techniques focus on
reporting ontology changes in terms of simple change operations (i.e. add/delete
triples). These tools differ in the level of semantic complexity represented by
the ontology languages. Work in low-level change detection tools focuses on the
exploitation of useful properties for producing deltas (e.g. the delta size and
the level of execution semantics) that can be interpreted by both human and
machine.
For example, Zeginis et al. [10] proposed three RDF/S differential delta func-
tions associated with the inferred knowledge from RDFS knowledge bases: dense
(ΔD); dense & closure (ΔDC) and explicit & dense (ΔED). These deltas vary in
the application of inference to reduce their size and are explained in greater detail
in Sect. 3. Results show that ΔD produced the smallest delta but was prone to
ambiguity and may potentially produce inconsistently updated RDF knowledge
bases. In this paper, we characterise ΔDc , which is a correction method for ΔD
76 S. Al Azwari and J.N. Wilson
RDF updates allow low-level triple operations for insertion and deletion that
were formalised by Zeginis et al. [10]. In the context of the two example RDF
models M and M in Fig. 1, the naı̈ve way of generating the delta involves
computing the set-difference between the two versions using the explicit sets of
triples forming these versions. The explicit delta (ΔE) contains a set of triples
to be deleted from and inserted into M in order to transform it into M .
From the example in Fig. 1, the delta obtained by applying the above change
detection function is shown in Fig. 2.
Executing these updates against M will correctly transform it to M . How-
ever, this function handles only the syntactic level of RDF and does not exploit
its semantics. In the latter context, executing some of the updates in ΔE is not
necessary as they can still be inferred from other triples. For instance, we can
observe from the example in Fig. 1 that deleting (Graduate subClassOf Person)
from M , in order to transform it into M , is not necessary as this triple can still
be inferred from the triples (Graduate subClassOf Student) and (Student sub-
ClassOf Person) in M . Since this update is not necessary, it is useful to remove
it from the delta. RDF data is rich in semantic content and exploiting this in
the process of updating RDF models can minimize the delta size and therefore
the storage space and the time to synchronize changes between models.
Consistent RDF Updates with Correct Dense Deltas 77
C(M ) = M ∪ {t ∈ (SP O) | M |= t}
The rules in Table 1 can be used in the explicit dense function (ΔED), which
combines both explicit and inference approaches for computing the delta. The
inserted set of triples is computed explicitly as in ΔE, while the delete set is
computed based on inference using the rule set.
Applying this function to the example in Fig. 1 produces the delta shown in
Fig. 3. The inserts in this delta are achieved by explicitly calculating the set
difference M − M to provide the set of triples that should be inserted to M in
order to transform it into M . On the other hand, the set of deleted triples is
achieved by calculating the closure of M using the RDFS entailment rules to
infer new triples and add them to M . From the example, the inferred triples in
M are:
(Teacher subClassOf Person)
(Head Teacher subClassOf Person)
(Head Teacher subClassOf Staff)
(Graduate subClassOf Student)
These inferred triples are then added to M to calculate the set difference
M − C(M ) which results in only one triple to delete: (John type Student).
The number of updates produced by this delta is smaller than the one produced
by the ΔE as a result of the inference process.
The effect of the inference process in minimising ΔED was limited to apply-
ing the inference rules when computing the deleted set of triples only. Applying
inference rules for computing the inserted triples may further reduce the number
of updates. For example, inserting the three triples (Teacher subClassOf Person),
(Head Teacher subClassOf Person) and (John type Person) into M may not be
necessary because these triples implicitly exist in M and can be inferred in M
using the RDFS entailment rules. In this example, applying rdfs1 to M would
infer (John Type Person) while the other two triples could be inferred using
rdfs2. The application of inference over both the insert and delete sets produces
the dense delta (ΔD).
Figure 4(a) and (b) illustrate the distinction between ΔED and ΔD. In the
former only the deletes that are not in C(M ) need to be carried out. In this case,
C(M ) is not checked to see whether all of the planned inserts need to be applied.
In the case of ΔD, deletes are handled in the same way as in ΔED however
inserts are only applied if they are not in C(M ). This results in minimising both
delete and insert operations.
From the example in Fig. 1, the updates generated by applying (ΔD) are
shown in Fig. 5. ΔD is smaller than either ΔE or ΔED with only three updates
to transform M to M . However, in contrast to ΔE and ΔED, ΔD does not
always provide the correct delta to carry out the transformation. In this case,
applying ΔD to transform M into M will transform M as shown in Fig. 7. This
delta function does not correctly update M to M because when applying the
updates, (John type Person) is not inserted into M and cannot be inferred in
M after the triple (John type Student) has been deleted.
Consistent RDF Updates with Correct Dense Deltas 79
6 for b ∈ Ins do
7 if (inferable(b, M )) and (all antecedents of b ∈
/ Del) then
8 remove b from Ins;
C(MÊ) C(MÊ)
C(M)
M MÊ M MÊ
Unchanged
Unchanged
Deletes
Deletes
Inserts
Inserts
A A B
Fig. 5. The dense delta (ΔD) Fig. 6. The corrected dense delta ΔDc
Under the semantics of the subset of RDFS rules in Table 1 all deltas are unique
with respect to the difference between C(M ) and C(M ). ΔDc does not require
M or M to be closed and consequently it is not unique.
80 S. Al Azwari and J.N. Wilson
The corrected dense delta is produced by checking triples in both the insert
and delete sets of ΔE. Firstly, the delete set should be calculated before the
insert set. Secondly, all antecedents for each inferred triple must be checked to
see whether they exist in the delete set. If one or both antecedents exist in the
delete set then this triple cannot be inferred. To calculate the closure for M
in order to compute the insert set, if two triples in M point to a conclusion
based on the rules, then these triples are checked against the deleted set. The
conclusion cannot be true if at least one of the two triples exists in the delete
set, otherwise, the conclusion is true and the triple can be inferred in M . This
process (Algorithm 1) produces the corrected dense delta ΔDc .
Because the delete set is calculated first, the triple (John Type Person) will
not be inferred from (John Type Student) and (Student SubclassOf Person) given
that the former is included in the delete set. The delta will result in the updates
shown in Fig. 6. Applying these updates to M will result in the model in Fig. 8.
This model is identical to M , indicating the correctness of ΔDc . The number
of updates after fixing the incorrectness problem is increased but it produces
a correct delta. However, this number is smaller than the number of updates
produced by ΔED or equal to it in the worst case. In such a worst case, none
of the inserted triples in ΔDc can be inferred in M because either there are no
triples that can be inferred or at least one of the antecedents of every inferable
triple is included in the delete set.
Both ΔED and ΔDc functions discussed above apply inference-then-
difference strategy. This implies that the full closure of the RDF models should
be calculated and all the possible conclusions under the RDFS entailment rules
are stored in these models. By contrast, a backward inference approach uses the
difference-then-inference strategy. That is, instead of computing the entire clo-
sure of M , in the case of ΔED, this method calculates first the set-differences
M − M and M − M , and then checks every triple in M − M and removes it
if it can be inferred in M . The operation becomes:
Instead of pre-computing the full closure in advance, this method infers only
triples related to the result of M − M . This would be expected to improve the
time and space required in change detection by comparison with the forward
inference approach.
Consistent RDF Updates with Correct Dense Deltas 81
In the example dataset shown in Fig. 1, to calculate ΔED using the backward
inference strategy, the sets of inserted and deleted triples are calculated using set-
difference operation in the same way as when calculating ΔE. After calculating
the changes at the syntactic level, each triple in the delete set is checked to
see if it can be inferred in M using the RDFS entailment rules. For example,
the triple (Graduate subClassOf Person) in M − M is checked to see if it can
be derived in M . Using the RDFS entailment rules this triple can be derived
from the two triples (Graduate subClassOf Student) and (Student subClassOf
Person), therefore, this triple is removed from M − M . Rather than checking
all the triples in M , only the three triples in M − M are checked.
For applying the backward inference in ΔDc , first the set of deleted triples
in M − M is inferred as explained above, then the set of inserted triples in
M − M is also checked to see if it can be derived in M . However, to guarantee
the correctness of the delta, before removing the inferable triples from the delta,
antecedents of each inferable triple in M − M are checked to see if at least one
of them exists in M − M . If this is the case, this triple cannot be removed from
the delta. Algorithm 1 describes the generation of ΔDc by backward inference.
Both forward inference and backward inference produce the same delta, but
the latter applies the inference rules on only the necessary triples. However,
although the backward inference method is applied to infer only relevant triples,
applying the inference on some of these triples might be unnecessary allowing
pruning to be applied before backward inference [4]. The general rule for pruning
is that if the subject or object of a triple in M − M or M − M does not exist in
M or M , respectively, then this triple cannot be inferred, consequently the triple
can be pruned before the inference process begins. Although pruning may reduce
the workload for inferencing, it carries a potential performance penalty [1].
structure used, the time required to carry out pruning exceeds the inference
time both for ΔDc and ΔED. This is consistent with previous findings [1]. The
overall delta time shown in Fig. 11 indicates that taking account of set difference
operations, inferencing and pruning, approaches that prune the delta set tend
to require significantly more processing power than non-pruning approaches.
Overall, the ΔE is the fastest process since no pruning or inferencing is carried
out. The delta sizes shown in Fig. 12 indicate that applying inference on this
data set reduces the updates that need to be executed, particularly when it is
applied to both the insert and delete sets.
The relationship between Figs. 11 and 12 is summarised in Fig. 13, which is
based on the average delta size and average generation time for all the data
models. Figure 13 shows the interaction between the degree of inference (i.e.
the delete set and/or the insert set or no inference at all) and the approach to
inferencing (i.e. inferring all triples or only necessary triples) and their impact
on the delta size and the delta computation time. It can be seen that ΔDc
has the smallest delta size compared to ΔED and ΔE. It can also be seen
that the approach to inferencing affects the delta computation time. Figure 13
indicates that ΔBc is more efficient (i.e. smaller delta size and faster generation)
than the other methods tested. Overall, Fig. 12 shows that the computation
time increases in the sequence of explicit, backward inference, pruned backward
inference, forward inference whereas the delta size increases in the sequence ΔDc ,
ΔED, ΔE.
The consistency of M after delta application was evaluated by comparing
the in-memory M produced by applying the delta to M in the database with the
original in-memory M using the Jena isIsomorphic method. Applying ΔDc using
the approach described above was found to result in the same M as that used
to generate the delta. By contrast, tests carried out to assess the consistency
of applying the uncorrected ΔD indicate that in all the models tested, this
approach always failed to produce consistent updates.
The overall effect of these results is to indicate that ΔDc provides a viable
route to minimising the data that would need to be transferred from a server
to a client in order to update copies of an RDF data store. Pruning may assist
this process but comes at a cost of additional processing time, which may be
unacceptable in a peer-to-peer context or where updates need to be generated
on demand.
By contrast with inference strength1 [10, p 14:20], reduction strength shown
in Table 2 indicates when the size of ΔE, ΔED and ΔDc are different i.e. when
inference is capable of making a difference to the size of the delta. When the
inference strength is zero, there are no inferences to be made and the model is
closed. Under these circumstances, |ΔE| = |ΔDc |. However, |ΔE| may still
be equal to |ΔDc | when the inference strength is greater than zero. This occurs
when, for example, none of the triples in the delta are inferable in M .
1 |C(M )|−|M |
Inf erence strength = |M |
.
Consistent RDF Updates with Correct Dense Deltas 85
Both inference strength and reduction strength also give an indication of the
work load of pruning. High values for these parameters indicate that a large
number of triples can be inferred. However, adding such inferable triples provides
a large collection of data that needs to be checked for possible pruning before
inference can take place.
This paper describes a correction method for dense deltas that results in con-
sistent update of RDF datasets. We have eliminated the need for conditions on
the dataset by checking the antecedents of inferable triples in the insert set. If
at least one such antecedent is found in the delete set then the inferable triple
in the insert set cannot be removed from the delta. Otherwise, this triple can be
safely removed from the delta to minimize its size.
A summary of our results is shown in Fig. 13, which characterises the inter-
action between the degree of inference (i.e. the delete set and/or the insert set
or no inference at all) and the approach to inferencing (i.e. inferring all triples
or only necessary triples) and their combined impact on the delta size and com-
putation time. It can be seen that ΔDc has the smallest delta size compared
to ΔED and ΔE. It can also be seen that the approach to inferencing affects
86 S. Al Azwari and J.N. Wilson
the delta computation time. Figure 13 indicates that backward inference is more
efficient (i.e. smaller delta size and faster generation) than the other methods
tested.
In this work we have investigated the effect of inference degree and infer-
ence approach on both the delta computation time and storage space over RDF
datasets. Similar methods can be applied to ontologies that are represented in
OWL 2. Here the RL rule set [5] is much richer than the rule set for RDFS with
consequent potential for benefits to delta generation performance and size. Also,
it is worth exploring different inference strengths to further evaluate the delta
sizes and performance of the different approaches to producing these deltas. In
particular while backward inference may be efficient, combining it with pruning
may be expensive in terms of computation time where data is characterised by
large inference strengths. Exploiting the inferred triples to infer new information
may provide further improvements in update performance.
References
1. Al Azwari, S., Wilson, J.N.: The cost of reasoning with RDF updates. In: ICSC
2015, pp. 328–331. IEEE (2015)
2. Cloran, R., Irwin, B.: XML digital signature and RDF. In: Information Society
South Africa (ISSA 2005), July 2005
3. Hayes, P., McBride, B.: RDF semantics. W3C recommendation. World Wide Web
Consortium (2004)
4. Im, D.H., Lee, S.W., Kim, H.J.: Backward inference and pruning for RDF change
detection using RDBMS. J. Info. Science 39(2), 238–255 (2013)
5. Motik, B., Grau, B.C., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 Web
ontology language profiles, W3C Recommendation 11 December 2012 (2013)
6. Noy, N., Musen, M.: Promptdiff: a fixed-point algorithm for comparing ontology
versions. AAAI/IAAI 2002, 744–750 (2002)
7. Papavasileiou, V., Flouris, G., Fundulaki, I., Kotzinos, D., Christophides, V.: High-
level change detection in RDF(S) KBs. ACM Trans. Database Syst. 38, 1:1–1:42
(2013)
8. PaPavaSSiliou, S., PaPagianni, C., DiStefano, S.: M2M interactions paradigm
via volunteer computing and mobile crowdsensing. In: Misic, V., Misic, J. (eds.)
Machine-to-machine communications: architectures, technology, standards, and
applications, pp. 295–309. CRC Press, Boca Raton (2014)
9. Völkel, M., Groza, T.: SemVersion: An RDF-based ontology versioning system. In:
Nunes, M., Isaas, P., Martnez, I. (eds.) Proceedings of the IADIS International
Conference on WWW/Internet, p. 44. IADIS (2006)
10. Zeginis, D., Tzitzikas, Y., Christophides, V.: On computing deltas of RDF/S knowl-
edge bases. ACM Trans. Web (TWEB) 5(3), 14 (2011)
Query-Oriented Summarization of RDF Graphs
1 Introduction
RDF graphs and queries An RDF graph (or graph) is a set of triples of the
form s p o, stating that the subject s has the property p, and the value of that
property is the object o. Triples are formed using uniform resource identifiers
(URIs), typed or untyped literals (constants), and blank nodes (unknown URIs
or literals) corresponding to incomplete information. We use s, p, and o in triples
as placeholders. Literals are shown as strings between quotes, e.g., “string”.
The RDF standard provides a
set of built-in classes and proper-
ties in the rdf: and rdfs: pre-defined
namespaces, e.g., triples of the form
s rdf:type o specify the class(es) to
which a resource belongs. For brevity,
we use type to denote rdf:type. For
example, the RDF graph G below
describes a book, identified by doi1 :
Fig. 1. Sample RDF graph its author (a blank node :b1 related
to the author name), title and publi-
cation date.
{doi1 rdf:type Book, doi1 writtenBy :b1 , doi1 publishedIn “1932”,
G=
doi1 hasTitle “Port des Brumes , :b1 hasName “G. Simenon”}
RDF Schema triples allow enhancing the descriptions in RDF graphs by declar-
ing deductive constraints between the graph classes and properties, namely: sub-
ClassOf, subPropertyOf, domain and range, where the latter two denote the first
and second attribute of a property, respectively. Consequently, an RDF graph
may have implicit triples even though they do not exist explicitly. For instance,
assume the RDF graph G above is extended with the following constraints:
The resulting graph is depicted in Fig. 1. Its implicit triples are those represented
by dashed-line edges. Adding all the implicit triples to an RDF graph G leads to
its saturation G∞ , which is the RDF graph stating the semantics of G.
In this work, we consider conjunctive SPARQL queries, a.k.a. Basic Graph
Pattern (BGP) queries. The evaluation of a query q against an RDF graph G
based on G’s explicit triples may lead to an incomplete answer; the complete
answer is obtained by evaluating q against G∞ . e.g., consider:
q(x3 ) :- x1 hasAuthor x2 , x2 hasName x3 , x1 hasTitle “Le Port des Brumes
Its answer against the graph in Fig. 1 is q(G∞ ) = {“G. Simenon }. Note
that evaluating q against G leads to an empty answer.
Summary Requirements We assume that the summary SG of an RDF graph
G is an RDF graph itself. Further, we require summaries to satisfy the following
conditions: (i) The saturation of the summary of an RDF graph G must be
the same as the summary of its saturation G∞ , since the semantics of an RDF
graph is its saturation; (ii) The summary should be (if possible, much) smaller
than the RDF graph; (iii) The summary should be representative: queries with
results on G should also have results on the summary; (iv) The summary should
be accurate: queries with results on the summary should reflect that such data
existed indeed in the graph. To formalize these, let Q be a SPARQL dialect.
Definition 1. (Query-Based Representativeness) SG is Q-representative of G if
and only if for any query q ∈ Q such that q(G∞ ) = ∅, we have q(S∞
G ) = ∅.
Note that several graphs may have the same summary, since a summary loses
some of the information from the original graph. If two RDF graphs differ only
with respect to such information, they have the same summary. We term inverse
set of SG , the set of all RDF graphs whose summary is SG . This leads to the
accuracy criterion, with respect to any graph a summary may correspond to:
Definition 2. (Query-Based Accuracy) Let SG be a summary, and G the inverse
set of SG . The summary SG is Q-accurate if for any query q ∈ Q such that
q(S∞ ∞
G ) = ∅, there exists G ∈ G such that q(G ) = ∅.
For compactness, the (voluminous) set of literals, along with subject and object
URIs for non-type triples from G should not appear in the summary. However,
given that property URIs are often specified in SPARQL queries [1], and that
typically there are far less distinct property URIs than the subject or object
URIs [4], property URIs should be preserved by the summary. This leads us to
considering the following SPARQL dialect:
Definition 3. (RBGP queries) A relational basic graph pattern query (RBGP)
is a conjunctive SPARQL query whose body has: (i) URIs in all the property
positions, (ii) a URI in the object position of every type triple, and (iii) variables
in any other positions.
We define RBGP representativeness and RBGP accuracy by instantiating Q in
Definition 1 and Definition 2, respectively, to RBGP queries.
90 Š. Čebirić et al.
3 RDF Summaries
We assume a function newURI() returning a fresh URI on each call. We call data
property any property p in G different from type. Further, for any data property
p, the property source of p, denoted S(p), is a URI set using newURI(), and
similarly, the property target of p, denoted T (p), is a URI set using newURI().
We introduce our summaries below; examples are delegated to [5] and can
also be found at https://team.inria.fr/oak/projects/rdfsummary/.
Definition 4. (Baseline Summary) Given an RDF graph G, the baseline sum-
mary of G is an RDF graph BG such that:
Schema BG has the same schema triples as G.
DNT (Data triples of BG whose property is not type) Let p, p1 , p2 be some data
properties from G.
DNT1 The triple S(p) p T (p) belongs to BG ;
DNT2 if s p1 o1 , s p2 o2 ∈ G, then S(p1 ) = S(p2 );
DNT3 if s1 p1 o, s2 p2 o ∈ G, then T (p1 ) = T (p2 );
DNT4 if s p1 o1 , o1 p2 o2 ∈ G, then T (p1 ) = S(p2 );
DT (Data triples of BG whose property is type)
DT1 If s p o, s type c are in G, then S(p) type c is in BG ;
DT2 if s p o, o type c are in G, then T (p) type c is in BG ;
DT3 Let nall be set to newURI(). If s type c ∈ G, and ∃s p o ∈ G, then
nall type c ∈ BG .
Refined Summary The baseline summary may unify property source and
target URIs quite aggressively. For instance, if a store and a person both have
a zipcode, they will lead to the same baseline URI, even though they are very
different things. To mitigate this issue, we designed a second flavor of summary of
an RDF graph G, termed refined and denoted RG . For space reasons, the definition
is delegated to [5]. Intuitively, the difference between the baseline and the refined
summary is that the latter fuses data property source and/or target URIs only
if one resource in G that leads to their unification has no type at all.
Summary Properties Both summaries meet our requirements (i), (iii) and
(iv) as follows. We say two summary graphs are equivalent, denoted ≡, iff they are
identical up to a bijection between their sets of URIs. The summaries commute
with saturation, i.e., (SG )∞ ≡ SG∞ , and are RBGP accurate. The BG is fully RBGP
representative, and the RG is representative of RBGPs having no more than one
type triple with the same subject. This follows from a graph homomorphism from
G∞ to (SG )∞ [5]. Observe that SG is not a core of G, since we cannot guarantee a
homomorphism from SG to G (SG may comprise false positives).
The size of the baseline summary is bounded by the size of G’s schema plus
the number of data properties and class assertions from G. It can be built in
O(|G|2 ) time. Computing the refined summary has O(|G|5 ) complexity, requiring
an efficient underlying system e.g., based on triple partitioning and indexing or
a distributed processing platform such as [2]. An upper bound for its size is the
number of classes in G × the number of distinct data properties in G.
Query-Oriented Summarization of RDF Graphs 91
Acknowledgments. This work has been partially funded by the projects Datalyse
“Investissement d’Avenir” and ODIN “DGA RAPID”.
References
1. Arias, M., Fernández, J.D., Martı́nez-Prieto, M.A., de la Fuente, P.: An empirical
study of real-world SPARQL queries (2011). CoRR, abs/1103.5043
2. Goasdoué, F., Kaoudi, Z., Manolescu, I., Quiané-Ruiz, J.-A., Zampetakis, S.:
CliqueSquare: flat plans for massively parallel RDF queries. In: ICDE (2015)
3. Goldman, R., Widom, J.: Dataguides: Enabling query formulation and optimization
in semistructured databases. In: VLDB (1997)
4. Statistics on The Billion Triple Challenge Dataset (2010). http://gromgull.net/
blog/2010/09/btc2010-basic-stats
5. Technical report (2015)
Data Exploration
ReX: Extrapolating Relational Data
in a Representative Way
1 Introduction
Generating synthetic data is convenient in multiple application areas (e.g., soft-
ware validation, data masking, database testing). Synthetic data is generally
used when real data is not available, when it cannot be published publicly or
when larger amounts of data are needed. Therefore, it represents an artificial
enabler for any analysis that requires data. When using synthetic data, a neces-
sary evaluation is how representative it is in comparison to real-life data.
Extrapolating the data from an existing relational database is a potential
solution to overcome the lack of realism of the synthetic data. There are two
directions that can be explored for scaling data: (i) to a particular size, or (ii) to
a particular time in the future. The first is useful in multiple application areas
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 95–107, 2015.
DOI: 10.1007/978-3-319-20424-6 10
96 T.S. Buda et al.
where the size of the generated database matters, such as scalability testing. The
second direction could be addressed by applying machine learning techniques to
predict how data will evolve using accurate historical data. In this paper, we
explore the first path, which represents a starting point for studying the evolu-
tion of a database. Maintaining the distributions present in the original database
contributes to the realism of the generated data. The representativeness dimen-
sion is crucial as the results of the analysis to be applied on the representative
extrapolated database are expected to be similar to the ones from the original
database (e.g., in approximate query answering). This path has been explored
before. In [19], the authors introduce the scaling problem as follows:
The authors propose a novel tool, namely UpSizeR, which aims to solve the scal-
ing problem in an innovative way, using mining algorithms such as clustering to
ensure that the representativeness is maintained. The method requires complex
inputs from the user (e.g., the probability perturbation exponent). Most of the
existing synthetic database generators require complex inputs from the user in
order to generate realistic data [1,3,11]. However, complex inputs require expert
knowledge, and thus may lead to poor accuracy in the results.
In this paper, we propose an automated representative extrapolation tech-
nique, ReX, that addresses the scaling problem above. Similarly to [4] and [19],
we define a representative database as a database where the distributions of the
relationships between the tables are preserved from the original database. As
foreign keys are enforced links between tables, they represent invaluable inputs
to depict the relationships between data in a relational database. This represents
a first step towards achieving a representative extrapolated database. We devise
two techniques for handling non-key attributes. To illustrate ReX’s applicability
in a real scenario, we perform approximate query answering evaluation. We com-
pare ReXto UpSizeR [19] and show that our solution outperforms UpSizeR in
terms of representativeness, query answering, database size, and execution time.
The remainder of this paper is organized as follows: Sect. 2 introduces the
potential solutions to the scaling problem. Section 3 presents the representative
extrapolation system, ReX. Section 4 presents the evaluation of ReX. Section 5
presents the related work. Finally, Sect. 6 concludes the paper.
O(ti ) = x, and O(tj ) = (y · x)
t t
∀p(x,y)∈sptj ∀p(x,y)∈sptj
i i
From Fig. 1(b), we determine that: O(t1 ) = 12, O(t2 ) = 40, and O(t3 ) =
52. When extrapolating O by s to produce the extrapolated database X, we
expect that each table t of O will be scaled in size by s such that: X(t) =
s · O(t).
A horizontal growth direction for each point of a scatter plot produces
the optimal results in terms of database size. Considering a horizontal growth
t
direction, each point p of sptji scales s times on the x-axis: ∀p(x, y) becomes
p (x , y), where x = s · x. This leads to the following properties of ti and tj in X:
X(ti ) = (s · x) = s · O(ti ),X(tj ) = (y · (s · x)) = s · O(tj )
t t
∀p(x ,y)∈sptj ∀p(x ,y)∈sptj
i i
Through horizontal scaling: X(t1 ) = 24, X(t2 ) = 80, and X(t3 ) = 104.
These are the desired expected sizes of the tables. This leads to X being repre-
sentative of O (i.e., as each point is scaled by s), and of accurate size (i.e., as each
table is scaled by s). Therefore, the extrapolation solution must create for each
of the x identifiers of ti , pki , exactly s-1 new identifiers, pki , and for each of the
x · y key values of tj , (pkj , f kj ), exactly s-1 new key values of tj , (pkj , f kj ), each
individually referencing one of the s-1 new identifiers created for ti , f kj = pki .
This is exemplified in Fig. 2, where ti = t1 , and parents(ti ) = {t2 , t3 }.
In this paper, we propose a system called ReX1 that aims to produce a repre-
sentative extrapolated database X, given a scaling rate, s ∈ N, and a relational
database O. The objective is to maintain the distributions between the consecu-
tive linked tables and the referential integrity of the data. We assume that there
are no cycles of dependencies and that foreign keys only reference primary keys.
ReXproduces the extrapolated database in a single pass over the entire origi-
nal database and thus reduces the complexity of a two-step algorithm that would
compute the expected scaled distribution and generate data accordingly through
horizontal scaling to ensure representativeness.
Natural Scale Discussion. When the scaling rate is a real number (i.e., s ∈ / N),
the floating part requires the generation of tuples for only a fraction of each table
from O. Thus, the method must decide for which partial number of tuples of tj
it should create new tuples. As this represents a different problem by itself [4,8],
in this paper we consider only natural scaling rates. Moreover, the scenario
of naturally scaling databases is commonly applicable to enterprises where it
is rarely needed to extrapolate to a fraction rather than a natural number.
1
Representative eXtrapolation System, https://github.com/tbuda/ReX.
ReX: Extrapolating Relational Data in a Representative Way 99
The maximum error brought by approximating the real scaling rate to a natural
number is 33.33 %, and occurs for s = 1.5 (i.e., caused by X containing 33.33 %
less or more tuples than desired). The impact of the floating part decreases as s
increases (e.g., when s = 10.5 the error caused by approximating is is reduced to
4.8 %). Another solution is using a sampling method for the remaining fractional
number. However, both solutions would introduce errors in the results, and in
this paper we are interested in evaluating the extrapolation technique.
could cover more test cases than the original ones. Moreover, we expect that
maintaining the frequency count of the non-key attributes ensures that queries
that compute an aggregate of a non-key attribute scale according to s with
no errors (e.g., the maximum age entry in a Person table). Furthermore, the
second solution, ReXmain , ensures that the X preserves intra-tuple correlations
(e.g., between the age and marital status attributes of a Person table), intra-
table correlations at an attribute level (e.g., between the age of a Person table
and its balance in an Account table) and frequency count of non-key values.
3.3 Approach
ReXselects the leaf tables as starting tables. The algorithm maintains the posi-
tion of each primary key value when populating a table using a hash table.
Thus, by starting with the leaf tables, the method avoids potentially time con-
suming queries for determining the position of a foreign key value in its original
referenced table, and retrieves is from the hash table previously constructed.
Moreover, through this bottom-up approach, X is produced through a single
pass over each table of O. Phase one of the algorithm consists of generating the
new key and non-key attributes’ values for the leaf tables. The method retrieves
the records of the leaf table from O and enforces a horizontal growth direction
by generating s new tuples for each tuple of a table from O. Regarding key val-
ues, ReXwill call the generation function fi (x), described in Sect. 3.1. Regarding
non-key values, ReXmain maintains their values from the original tuple. ReXrfc
randomly selects a value from O(ti ), while maintaing its frequency count. This is
achieved through the SQL query on O: SELECT nk FROM ti ORDER BY RAND().
In order to maintain the frequency count, ReXrfc runs the query s times and
iterates through the result set returned, ensuring that each value has been used
s times for producing X. Phase two consists of identifying the next table to
be filled. The algorithm recursively fills the parents of the already populated
tables until the entire database is processed. To avoid size overhead or referential
breaches due to processing a table multiple times (e.g., due to diamond patterns
[8]), a table can only be populated once its children have been populated.
4 Evaluation
In this section, we compare our extrapolation system ReXto the UpSizeR app-
roach [19]. Both methods aim to construct an extrapolated database, X, repre-
sentative of the original database, O, that maintains the referential integrity of
the data.
UpSizeR Overview. UpSizeR represents a representative scaling method that
addresses the scaling problem. Its objective is to generate synthetic data with
similar distributions of the relationships between the tables of the database (i.e.,
between primary and foreign key pairs) to the ones from the original data-
base [19]. For this purpose, the approach computes the relationship degree (i.e.,
cardinality constraint) of each existing key of each table in the original database
and generates synthetic data accordingly.
ReX: Extrapolating Relational Data in a Representative Way 101
In the case of a table with multiple foreign key constraints, the method uses
a clustering algorithm for generating a joint degree distribution of the table.
However, the mechanisms employed by UpSizeR can lead to time-consuming
operations and require complex parameters as inputs from the user, which can
lead to inaccurate results.
Environment and Methodology. ReXwas developed using Java 1.6. ReXand
UpSizeR were applied on MySQL 5.5.35 databases. They were deployed on a
machine consisting of 2 Intel Xeon E5-2430 CPUs of 2.20 GHz and 6 cores
each, 64 GB RAM, and 2 TB Serial ATA Drive with 7,200 rpm, running 64-bit
Ubuntu 12.04. The MySQL server was run with default status variables. We used
the centralized version of UpSizeR available online2 . We assume that the user
has no prior knowledge of the database to be extrapolated and keep the default
parameters’ values. This coincides with the evaluation strategy the authors pre-
sented in [19]. Moreover, we show in Sect. 4.1 that the default parameters provide
a near optimal configuration for the database considered.
Database. We used the Financial database3 from the PKDD’99 Challenge Dis-
covery in order to evaluate ReXand UpSizeR in a real environment. It contains
typical bank data, such as clients information, their accounts, and loans. It con-
tains 8 tables, and a total of 1,079,680 tuples. The sizes of the tables range
from 77 (District) to 1,056,320 tuples (Trans). The Financial database schema
is depicted in [4]. The starting table identified by ReXis the District table.
2
comp.nus.edu.sg/∼upsizer/#download.
3
lisp.vse.cz/pkdd99/Challenge/berka.htm.
102 T.S. Buda et al.
30 15
15 0
10 -5
5 -10
0 -15
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Scale rate s Scale rate s
4
tpc.org/tpch.
ReX: Extrapolating Relational Data in a Representative Way 103
40 40
UpSizeR UpSizeR
35 ReXrfc 35 ReXrfc
30 20
UpSizeR 20 UpSizeR UpSizeR
25 ReXrfc ReXrfc ReXrfc
F2 relative error (%)
20 5 250
UpSizeR UpSizeR UpSizeR
ReX ReX ReXrfc
F6 relative error (%)
4 200
Fig. 8. G3 query error. Fig. 9. G4 query error. Fig. 10. Execution time.
that ReX maintain 0 % error for the G4 query answering, due to them preserving
the frequency count of the non-key attributes.
Execution Time. Figure 10 presents the methods’ execution time on the Finan-
cial database. We notice that ReX is up to 2 times faster than UpSizeR. When
applied on a larger database, such as a 1 GB TPC-H database, we observed
more significant differences between the methods’ performance. In particular,
ReX performed between 3 and 8.5 times faster with an average of 23 m differ-
ence between UpSizeR and ReX’s execution run time.
Additional Discussion. When using a system with complex inputs, the chal-
lenge stands in determining the optimal parameters on the target database. We
investigate the impact of the number of clusters expected, k (used in the gener-
ation of the joint degree distribution) and the probability perturbation exponent,
p (used in the generation of the joint probability matrix) on UpSizeR, as they
represent key inputs for UpSizeR’s generation process. We considered the follow-
ing set of values for k and p: {3,5,25,50,100,500,2500,5000}, and {-15,-10,-7,-5,-3,
-1,0,10}, respectively. Increasing k to 5,000 raised the run time of UpSizeR to
16.4 h, compared to 12 s when k is 3 by default. Running UpSizeR with p equal
to −25 and −50 did not scale and after 10 days their execution was stopped.
Identical results were found for p equal to 10, 50, and 500. The query relative
error of F7 is 1.8 %, regardless of k and p. Similar conclusions were drawn for
ReX: Extrapolating Relational Data in a Representative Way 105
s = {2, 5, 8} and when jointly varying k and p. Results suggest that the modifi-
cation of the parameters brings little benefits for all dimensions considered. In
contrast, we observe that UpSizeR’s parameters have a significant impact mainly
on the query answering accuracy. Small variations of the parameters resulted in
high errors in query answering. This suggests that a trial and error approach
might not lead to any benefits, even after a large amount of time is invested.
5 Related Work
Significant efforts have been made to improve the realism of synthetic data gen-
erators. We acknowledge them below, based on their application area.
General Methods. Many commercial applications generate synthetic data-
bases that respect the given schema constraints and use real sources as input
for several attributes (e.g., names, age)5 . Furthermore, the academic community
have proposed many general-purpose synthetic data generators [9,11,12]. MUDD
[17] is another parallel data generator that uses real data for the attributes’
domain. In [3], the authors propose a Data Generation Language to specify and
generate databases that can respect inter and intra table correlations. However,
the user must learn the specification language and input the distributions.
Software Testing. Existing methods for populating testing environments usu-
ally generate synthetic data values or use some type of random distribution to
select data from the production environment to be included in the resulting
database [14,18]. AGENDA [7] is a synthetic data generator based on a-priori
knowledge about the original database (e.g., test case expected behavior). Fur-
thermore, in [6] the authors describe a new approach for generating data for
specific queries received as input. QAGen [2], MyBenchmark [13], and Data-
Synth [1] similarly generate query-aware test databases through cardinality con-
straints. However, they require complex inputs (e.g., distribution of an attribute,
queries), which can be error-prone, as they might exclude vital test cases.
Data Mining. In [15], the authors propose a synthetic data generator for data
clustering and outlier analysis, based on the parameters given as input (e.g.,
number of clusters expected, size, shape). In [16], the authors propose a syn-
thetic data generator that receives as input a set of maximal frequent itemset
distributions and generate itemset collections that satisfy these input distribu-
tions. Other tools that can be used in this field are WEKA [10], GraphGen6 ,
IBM QUEST7 . For instance, GraphGen generates synthetic graph data for fre-
quent subgraph mining. However, the approaches require input parameters and
generally produce synthetic data targeting a data mining algorithm.
5
sqledit.com/dg, spawner.sourceforge.net, dgmaster.sourceforge.net,
generatedata.com.
6
cse.ust.hk/graphgen.
7
ibmquestdatagen.sourceforge.net.
106 T.S. Buda et al.
References
1. Arasu, A., Kaushik, R. Li, J.: Data generation using declarative constraints. In:
SIGMOD, pp. 685–696 (2011)
2. Binnig, C., Kossmann, D., Lo, E., Özsu, M.T.: Qagen: Generating query-aware
test databases. In: SIGMOD, pp. 341–352 (2007)
3. Bruno, N., Chaudhuri, S.: Flexible database generators. In: VLDB, pp. 1097–1107
(2005)
4. Buda, T.S., Cerqueus, T., Murphy, J., Kristiansen, M.: CoDS: a representative
sampling method for relational databases. In: Decker, H., Lhotská, L., Link, S.,
Basl, J., Tjoa, A.M. (eds.) DEXA 2013, Part I. LNCS, vol. 8055, pp. 342–356.
Springer, Heidelberg (2013)
5. Buda, T.S., Cerqueus, T., Murphy, J., Kristiansen, M.: VFDS: Very fast database
sampling system. In: IEEE IRI, pp. 153–160 (2013)
6. Chays, D., Shahid, J., Frankl, P.G.: Query-based test generation for database appli-
cations. In: DBTest, pp. 1–6 (2008)
7. Deng, Y., Frankl, P., Chays, D.: Testing database transactions with agenda. In:
ICSE, pp. 78–87 (2005)
ReX: Extrapolating Relational Data in a Representative Way 107
8. Gemulla, R., Rösch, P., Lehner, W.: Linked bernoulli synopses: sampling along
foreign keys. In: Ludäscher, B., Mamoulis, N. (eds.) SSDBM 2008. LNCS, vol.
5069, pp. 6–23. Springer, Heidelberg (2008)
9. Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly
generating billion-record synthetic databases. SIGMOD Rec. 23(2), 243–252 (1994)
10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
weka data mining software: an update. SIGKDD 11(1), 10–18 (2009)
11. Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator.
SIGMOD Rec. 36(1), 19–24 (2007)
12. Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: VLDB,
pp. 1243–1246 (2006)
13. Lo, E., Cheng, N., Hon, W.-K.: Generating databases for query workloads. PVLDB
3(1–2), 848–859 (2010)
14. Olston, C., Chopra, S., Srivastava, U.: Generating example data for dataflow pro-
grams. In: SIGMOD, pp. 245–256 (2009)
15. Pei, Y., Zaane, O.: A synthetic data generator for clustering and outlier analysis.
Technical report (2006)
16. Ramesh, G., Zaki, M.J., Maniatty, W.A.: Distribution-based synthetic database
generation techniques for itemset mining. In: IDEAS, pp. 307–316 (2005)
17. Stephens, J.M. Poess, M.: MUDD: a multidimensional data generator. In: WOSP,
pp. 104–109 (2004)
18. Taneja, K., Zhang, Y., Xie, T.: MODA: Automated test generation for database
applications via mock objects. In: ASE, pp. 289–292 (2010)
19. Tay, Y., Dai, B.T., Wang, D.T., Sun, E.Y., Lin, Y., Lin, Y.: UpSizeR: synthetically
scaling an empirical relational database. Inf. Syst. 38(8), 1168–1183 (2013)
Evaluation Measures for Event Detection
Techniques on Twitter Data Streams
1 Introduction
Microblogging is a form of social media that enables users to broadcast short mes-
sages, links, and audiovisual content to a network of followers as well as to their
own public timeline. In the case of Twitter, the most popular and fastest-growing
microblogging service, these so-called tweets can contain up to 140 characters.
Twitter’s 288 million monthly active users produce a total of over 500 million
tweets per day1 . As a consequence, several proposals have been made to lever-
age Twitter as a source of up-to-date news and information, e.g., to respond to
natural disasters [13], to track epidemics [7], or to follow political elections [17].
A number of techniques have been designed and developed to detect such
events in the Twitter social media data stream. Typically, they adopt the defini-
tion of an event introduced by research on Topic Detection and Tracking (TDT),
i.e., a real-world occurrence that takes place in a certain geographical location
and over a certain time period [2]. The main focus of these event detection tech-
niques lies in addressing the specific requirements introduced by Twitter data,
such as the brevity of tweets together with the fact that they contain a substan-
tial amount of spam, typos, slang, etc. Although most proposals provide some
qualitative evidence to motivate the benefits of the technique, few perform a
quantitative evaluation or compare their results to competing approaches.
We argue that this lack of comparative evaluation is explained by the fact
that measuring the quantitative and qualitative performance of event detection
1
https://about.twitter.com/company/.
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 108–119, 2015.
DOI: 10.1007/978-3-319-20424-6 11
Evaluation Measures for Event Detection Techniques 109
2 Background
Several event detection techniques for Twitter data streams have recently been
proposed. Farzindar and Khreich [8] survey sixteen techniques and conclude that
most approaches are evaluated by self-defined measures with manually labeled
reference data sets. Also, almost none of the reviewed techniques are compared
to competing approaches. In the following, we summarize what evaluations have
been performed by the authors of the most-cited approaches and what corpora
are currently available for evaluation purposes. Our findings show that neither
the works discussed below nor the sixteen in the above-mentioned survey provide
a general solution that can be used to evaluate approaches comparatively.
enBlogue [3] identifies unusual shifts in the co-occurrence of tag pairs and reports
these shifts as events, which are rated in terms of quality in a user study. Twitter-
Monitor [9] detects “bursty” keywords and then groups them into trends, which
are visualized in order for users to decide whether a trend is interesting or not.
Cordeiro [6] proposes the use of continuous wavelet analysis to detect event peaks
in a signal based on hashtags frequency and summarizes the detected events into
topic clusters with latent dirichlet allocation (LDA [5]). The technique is evalu-
ated using a visualization of the results obtained from an eight day dataset with
13.6 million tweets. All of these manual evaluations are, however, not general in
the sense that they do not scale and might suffer from human error or bias. Weng
et al. [17] present a technique that uses term frequencies of individual terms as a
signal for discrete wavelet analysis to detect event terms. Then, graph partition-
ing is used to group similar terms into events. The approach is evaluated using a
custom ground truth that is built using LDA on a dataset containing of 4,331,937
110 A. Weiler et al.
tweets collected from Singapore-based users. After cleaning and filtering, a total
of 8,140 unique words are retained per month of Twitter data. Detected events
are compared to this ground truth on a daily basis. The result of this evaluation
is that detected events are plausible, but also that there are several days with
no events detected. Since event detection is often time-critical and events should
be reported in (near) real-time, this coarse evaluation technique is not suited for
general evaluations.
3 Measures
In order to address the lack of a common evaluation method for event detection
in Twitter data streams, we propose a number of measures. Our goal is to define
Evaluation Measures for Event Detection Techniques 111
measures that can easily be used by other researchers and that do not depre-
cate over time as most reference corpora do. While all of our measures support
relative comparisons, we do not claim that they can be used to draw absolute
conclusions. A single event detection technique can, therefore, only be evaluated
“against itself”, e.g., with respect to different parameter settings or to confirm
that improvements to the technique yield better results. For a set of techniques,
the measures can be used to rank them with respect to different criteria. In this
paper, we focus on the second application.
Run-Time Performance. We measure run-time performance as the number of
tweets that an approach processes per second. This measure is important to judge
the feasibility of a technique. Most event detection techniques can be configured
based on numerous parameter that influence both the processing speed and result
quality. In combination with other measures, the run-time performance measure
can, therefore, also be used to study the trade-off between these two objectives.
Duplicate Event Detection Rate (DEDR). This measure captures the per-
centage of duplicate events detected by an approach. The implementations of
state-of-the-art event detection techniques used in this paper avoid the reporting
of duplicate events within their processing time-frame, e.g., a one-hour window.
Nevertheless, important or long-lasting events can reoccur across several time-
frames and, therefore, expecting a 0 % rate of duplicate events is not reasonable.
Precision. Our precision measure is composed of two components. First, we
query Google using the five event terms and a specific date range as search query
input. Doing so, we are able to verify if the detected event has been described
by an important article returned by Google for the corresponding time frame.
As important articles we define search results that are from one of the top 15
news websites such as CNN, CBSNews, USAToday, BBC, and Reuters. For the
second part of our precision measure, we query the archive of the New York
Times2 with the five event terms as well as the specific date range. Since the
number of hits (h), which are in the range between 0 and 10 both for Google
(hG ) or New York Times (hNYT ), is an indicator of how important a reported
event is, we calculate the final precision score for all results (N ) by weighting
the single results as
N
1 1 G 1 NYT
h + hi .
N i=0 2 i 2
the maximal similarity value (max sim). Since we exclude matches on one term
only, this similarity value can either be two, three, four, or five terms. With this
weighting, we calculate the final recall score for all headlines (N) as
1 1
N
max sim(Tihl , T e ).
N i=0 2
event term as the associated event description terms. Both of these approaches
report N events per time window. The next two approaches, TopN and LastN
are based on the IDF score of single terms among all distinct terms in the time
window. While TopN selects the N most frequent terms, LastN selects the N
terms with the lowest frequency. Both of them report the selected event terms
together with the four most co-occurring terms.
In addition to these baseline approaches, we implemented several techniques
that have been proposed to detect events in Twitter data streams. The corre-
sponding Niagarino query plans are shown at the top of Fig. 1. The first tech-
nique, LLH, is a reimplementation of Weiler et al. [16], which is realized as a
log-likelihood ratio user-defined function that is applied to the grouped set of
terms of a time window. In contrast to the original technique that detected
events for pre-defined geographical areas, we adjusted the approach to calcu-
late the log-likelihood measure for the frequency of all distinct terms in the
current time window against their frequency in the past time windows. Events
are reported by selecting the top N terms with the highest log-likelihood ratio
together with the corresponding top four most co-occurring terms. Since, these
are the terms with the highest abnormal behavior in their current frequency
with respect to their historical frequency, we define these terms to be events.
The second technique, Shifty, is a reimplementation of Weiler et al. [14]. In con-
trast to the original paper, which additionally analysis bigrams, we now only use
single terms in the analysis. The technique calculates a measure that is based
on the shift of IDF values of single terms in pairs of successive sliding windows
of a pre-defined size sinput . First, the IDF value of each term in a single window
is continuously computed and compared to the average IDF value of all terms
within that window. Terms with an IDF value above the average are filtered out.
The next step builds a window with size s1 that slides with range r1 in order
114 A. Weiler et al.
to calculate the shift from one window to the next. In this step, the shift value
is again checked against the average shift of all terms and only terms with a
shift above the average are retained. In the last step, a new sliding window with
size s2 that slides with range r2 is created. The total shift value is computed
as the sum of all shift values of the sub-windows of this window. If this total
shift value is greater than the pre-defined threshold Ω, the term is detected as
event and reported together with its top four co-occurrence terms. The third
technique, WATIS, is an implementation of Cordeiro [6]. The algorithm parti-
tions the stream into intervals of s seconds and builds DF-IDF signals for each
distinct term. Due to the noisy nature of the Twitter data stream, signals are
then processed by applying an adaptive Kolmogorov-Zurbenko filter (KZA), a
low-pass filter that smoothens the signal by calculating a moving average with
ikza iterations over N intervals. It then uses a continuous wavelet transforma-
tion to construct a time/frequency representation of the signal and two wavelet
analyses, the tree map of the continuous wavelet extrema and the local maxima
detection, to detect abrupt increases in the frequency of a term. In order to
enrich events with more information, the previously mentioned LDA algorithm
(with iLDA iterations) is used to model one topic consisting of five terms. After
the LDA phase the event is reported. Finally, the fourth technique, EDCoW, is
an implementation of Weng et al. [17]. The first step of the algorithm is to parti-
tion the stream into intervals of s seconds and to build DF-IDF signals for each
distinct term in the interval. These signals are further analyzed using a discrete
wavelet analysis that builds a second signal for the individual terms. Each data
point of this second signal summarizes a sequence of values from the first signal
with length Δ. The next step then filters out trivial terms by checking the cor-
responding signal auto-correlations against a threshold γ. The remaining terms
are then clustered to form events with a modularity-based graph partitioning
technique. Insignificant events are filtered out using a threshold parameter .
Since this approach detects events with a minimum of two terms, we introduced
an additional enrichment step that adds the top co-occurring terms to obtain
events with at least five terms. Since the original paper fails to mention the
type of wavelet that was used, we experimented with several types. The results
reported in this paper are based on the Discrete Meyer wavelet.
5 Evaluation
In order to demonstrate that the measures proposed in this paper are discrim-
inating, we run experiments against three different real-world Twitter stream
datasets (consisting of five days each) that we collected. The three datasets
respectively contain the days of February 1 to 6, 11 to 16, and 21 to 26,
2015 (EST). By using the Gardenhose access of the Twitter streaming API,
we are able to obtain a randomly sampled 10 % stream of all public tweets.
The collection contains an average of 2.2 million tweets per hour and almost 50
million tweets per day. We pre-filtered the dataset for tweets with English lan-
guage content by using a pre-existing Java library5 . After this step, the dataset
5
https://code.google.com/p/language-detection/.
Evaluation Measures for Event Detection Techniques 115
3000000
Total English
2500000
2000000
1500000
1000000
500000
0
Fig. 2. Average hourly total and English tweets for all three datasets.
contains an average of 660,000 tweets per hour and 16 million tweets per day.
Figure 2 shows the distribution of the total and the English number of tweets
per hour for each day as an average of all three datasets.
Approach Parameters
Shifty sinput = 1 min, s1 = 2 min, r1 = 1 min, s2 = 4 min, r2 = 1 min, Ω = 30
WATIS s = 85 s, N = 5 intervals, ikza = 5, ilda = 500
EDCoW s = 10 s, N = 32 intervals, γ = 1, = 0.2
116 A. Weiler et al.
5.2 Results
In the following, we present the results of our evaluation. Note, that we sum-
marized the results of both datasets as an average. First, we start with the
run-time performance. Run-time performance was measured using Oracle Java
1.8.0 25 (64 bit) on server-grade hardware with 2 Intel Xeon E5345s processors
at 2.33 GHz with 4 cores each and 24 GB of main memory.
Figure 3 shows the run-time
performance results for all tech- 14000
Tweets/sec
second) for all three datasets. 8000
1.2
DEDR1 DEDR2 DEDR3 DEDR4 DEDR5
1.0
0.8
0.6
0.4
0.2
0.0
LastN FR RE LDA TopN LLH Shifty WATIS EDCoW TT
of duplicates for DEDR1. Since the terms of FR and RE are randomly chosen,
they generally report a lower number of duplicates. From the event detection
techniques, the results for Shifty, WATIS, and EDCoW closely resemble the
results of applying our DEDR measure to TT, whereas the all other approaches
have significantly different profiles. We therefore argue that DEDR is a useful
measure to characterize event detection techniques.
For the evaluation of our precision and recall measures, we only use events
that were not filtered out by DEDR3, i.e., all events with three or more com-
mon terms are removed from the result set and only the remaining non-duplicate
events are further analyzed. Note that this results in an implicit inclusion of the
DEDR measure in our precision and recall measures. Figure 5 shows the aver-
age precision, recall, and F-measure over all three data sets for all techniques.
Based on these measure, we observe that all of the dedicated event detection
techniques clearly outperform the baseline approaches. This finding confirms
the validity of the precision and recall measure proposed in this paper. We con-
clude our evaluation by discussing the results shown in Fig. 5 in more detail.
First, we note that the scores are generally very low. However, since we are only
interested in relative comparisons, this is not a problem. Among the baseline
approaches, both LDA and RE score comparable to dedicated event detection
techniques with respect to specific measures. The precision of LDA is higher
than the one of LLH and Shifty, RE scores well in terms of recall. In both cases,
this result can be explained with the way these approaches work. Also, it demon-
strates the importance of studying both precision and recall, which we support
with our F-measure. The best approaches according to our measures are the
advanced WATIS and EDCoW techniques, which are also the most cited event
detection techniques. Since EDCoW produces the most events of all techniques,
its parameters could also be adjusted to increase its precision score. Also, the
basic enrichment process that we implemented for EDCoW could be improved.
For example, WATIS uses LDA for the same purpose and scores very well in
terms of recall. Our own techniques, LLH and Shifty, do not perform as well as
the two advanced techniques. However, we note that Shifty is the only online
event reporting technique and therefore only uses very short time intervals (of
four minutes in this case) instead of a full hour to classify terms as events.
118 A. Weiler et al.
0.4
Precision
Recall
0.3
FMeasure
0.2
0.1
0.0
LastN FR RE LDA TopN LLH Shifty WATIS EDCoW TT
6 Conclusions
In this paper, we have addressed the lack of quantitative and comparative evalu-
ation of event detection techniques by proposing a number of measures, both for
run-time and task-based performance. In contrast to previous evaluation meth-
ods, all our measures can be automatically applied to evaluate large results sets
without the requirement of an existing gold standard. In order to demonstrate
the validity of our proposed measures, we have studied them based on several
baseline approaches and state-of-the-art event detection techniques. We have
shown that our measures are able to discriminate between different techniques
and support relative comparisons.
As future work, we plan to further confirm the findings presented in this paper
by implementing additional event detection techniques, such as enBlogue [3],
in our evaluation framework. Based on these fully validated measures, we will
tune the parameters of each technique, which will enable us to draw absolute
conclusions about their performance.
References
1. Aggarwal, C.C., Subbian, K.: Event detection in social streams. In: Proceedings of
the SIAM International Conference on Data Mining (SDM), pp. 624–635 (2012)
Evaluation Measures for Event Detection Techniques 119
cost on a subset of the dataset and then updates the parameters. This process
allows us to achieve a predictive model for: “forgetful? (yes/no)” in MAAS.
2.2 Regression
An MLP is a simple form of one hidden layer neural network, where latent and
abstract features are learned in the hidden layer. As with the other machines in
our architecture and for artificial neural networks (ANN) in general, each node
in a hidden layer L(i) is connected to every node in layer L(i−1) and every node
(i) (i)
in L(i+1) . Each node n1 to nn in layer L(i) contains a non-linear activation
function, which calculates a node’s activation energy. This value is propagated
through the layers via the connections, a subset of which are shown in Fig. 1.
This process is called feed-forward propagation and is the hypothesis function
for all shallow and deep feed-forward neural networks.
2
ensures features with large data values does not overly impact the model.
124 J.O. Donoghue and M. Roantree
Our MLP was trained with SGD and back-propagation. It is similar to train-
ing a regression model and uses the same cost function except the parameters
in each layer must be updated with respect to the cost of the output.
The optimum parameters for each machine were located via a process called grid
search which tests a range of values to find wherein the optimum lies. Regression
was used to determine the learning rate, regularisation term and fine-tune steps
for the RBM, MLP and DBN; and the RBM and MLP were used to determine the
number of nodes in the first and second hidden layers of the DBN respectively.
The range searched for regularisation and learning rate was from 0.001 to 1,
roughly divided into 10 gradations. Three values for steps of GD were tested: 100;
1000; and 10000 as we estimated any larger values would lead to over-fitting (not
generalising well to new data) given the sample-size. All possible combinations
were tested for both continuous and categorical data, giving 246 in total.
The number of hidden nodes tested for both the RBM and MLP were 10, 30,
337, 900, 1300 and 2000. There were 337 features before categorisation therefore,
any more than 2000 hidden nodes was deemed unnecessary. Each configuration
was run twice (for categorical and continuous) in the MLP but 5 times each in
the RBM (only categorical) as there were 5 epoch values (1, 5, 10, 15 and 20)
being tested. Any more than 20 would have over-fit the data.
Bias terms were initialised to zero for all models. From Glorot et. al [12], the
MLP,RBM, and DBN weights were randomly initialised between the bounds:
6 6
[−4 f anin +f anout , 4 f anin +f anout ], whereas for regression the weights were
randomly initialised without bounds. f anin is the number of inputs to and
f anout is the number of outputs from a node.
All experiments were run on a Dell Optiplex 790 running 64-bit Windows 7
Home Premium SP1 with an Intel Core i7-2600 quad-core 3.40 GHz CPU and
16.0 GB of RAM. The code was developed in Python using the Enthought Canopy
(1.4.1.1975) distribution of 64-bit Python 2.7.6 and developed in PyCharm 3.4.1
IDE, making use of the NumPy 1.8.1-1 and Theano 0.6.0.
Table 1 shows the results and hyper-parameter configurations for the ten best
performing models in a series of grid-search experiments for regression. The
models are ranked by the lowest negative log-likelihood found on the training
data out of the 246 experiments performed.
Experiments 8-1-0 and 7-0-1 achieved the best results for the categorical and
continuous data respectively. 8-1-0 achieved a low training cost of 0.002, a valida-
tion cost of 7.725 and a test cost of 0.305. 7-0-1 achieved a slightly poorer result
of 0.004, 8.180 and 2.816 for the same measures. Both experiments achieved the
second lowest cost on the training data, but performed significantly better on
the validation data, meaning these hyper-parameters generalised better. Models
learned were not optimal, but given the amount of data available they were ade-
quate as over 69 % of the instances were correctly classified for the categorical
data and just over 70 % for the continuous data.
Although the categorical data achieved a lower cost, the continuous data
made better predictions. This suggests categorising the data helped remove noise
but along with this the transformation eliminated some information relevant to
modelling. Interestingly the best performing learning rate (alpha) is much higher
for the categorical than the continuous data and ten times less iterations of
A Framework for Selecting Deep Learning Hyper-parameters 127
gradient descent (GD) were required. Therefore gradient descent was far steeper
for the categorical data as it converged and gave us the best parameters much
faster than with the continuous, showing that one-hot encoded data can be
modelled easier, building a predictive model in far less time.
Table 2 shows the 10 highest scoring RBM model configurations out of 35 runs,
ranked by the best reconstruction cost (closest to 0) achieved on training data.
The result of the best performing RBM configuration can be seen in bold
in Table 2. It has 30 hidden nodes and went through 1 epoch of training.
A node configuration of 100 units in the hidden layer achieved the best recon-
struction cost of -68.719 on the training data, compared to the configuration
with 30 hidden nodes which scored -73.357. The 30 hidden node configuration
was determined to be the better architecture as it performed only slightly worse
on the training data but it scored -19.580 on the validation set, performing
better than every other configuration in the top 5 which measured in the 20’s.
Therefore, the 30 hidden unit configuration generalises better to unseen data.
The reconstruction cost achieved on the training data by Ex. 3-1 is far worse
at -435.809, but the validation score is better at -17.977 due to the higher num-
ber of epochs. As the model iterates through the training data, more and more
abstract features are learned so the model makes a better estimate at recon-
structing unseen data. We want to learn the features that perform comparable
on the training data as well as unseen data, therefore one training epoch gave
the best performance.
128 J.O. Donoghue and M. Roantree
Table 3 shows the top 10 scoring experiments out of the 14 performed. Here,
experiments 2 and 10 gave the best results achieving training, validation and
test negative log likelihood costs of 0.17, 2.107, 0.76 and 0.842, 11.664, 0.974
respectively.
Table 3. MLP layer 3 hidden nodes grid search
From the above table it can be shown that ten hidden nodes - which is the
smallest possible number of hidden nodes - gave the best results for both the
categorical and continuous data. Further to this, the MLP improves upon the
model found with regression for both data-types as the best performing MLP
model was 76.8 % accurate in its predictions for the continuous test data and
70.9 % for the categorical.
As a better predictive model was found through the MLP when we compare
to regression, it would suggest that abstract features were learned in the hidden
layer. Further to this, as the smallest available hidden node value performed best
we conclude that the number of features particularly relevant to the outcome
we are modelling are relatively low. It can again be seen from the results that
that the continuous data lends itself to more powerful models in comparison to
the categorical data and this can be put down to information being lost during
transformation.
Ex. 6 achieved the third best error rate on the test data. It immediately
improved on 0.272 which was the lowest error rate achieved by picking a random
initial configuration and tuning using technique outlined above. In fact, 0.272
was the best test error achievable without hyper-parameters found from previous
experiments. Tuning improved the model up to a point (Ex. 2) before it degraded
(Ex. 3, 4) and then again achieved previous levels of accuracy (Ex. 5).
When choosing the estimated best starting point for the comparison configu-
ration it was thought that more hidden layers would better model the data. The
opposite was found when 2 hidden layers performed best. Interestingly, when a
number of nodes the same as the number of features for the continuous data
were inserted for the first hidden layer (Ex. 7) it improved on the test error from
in Ex. 6. Our analysis is that an abstract feature representation similar to that
of the original continuous data was learned in the first hidden layer.
0.238 - the lowest test error achieved (Ex. 7), improved upon on the error
for the best categorical data model found with the MLP and approaches our
previous best continuous data model score of 0.232 with the MLP. We concluded
that this was due to the DBN learning a better feature representation in its
hidden layers. This shows that a DBN with multiple-layers has great potential in
learning a feature representation from text based datasets, given that this model
was learned on only a small subset of the MAAS dataset and deep architectures
have been shown to far outperform shallow models given enough data [24].
Therefore, it can be seen that performing a grid search on: the regression
layer to find the learning rate and regularisation term; the RBM to find the
number of nodes in the first hidden layer; and the MLP to find the number of
nodes in the last hidden layer gave us a methodology for selecting a good starting
point from which to determine the best hyper-parameter configuration for our
deep network, at least in the case of a DBN.
130 J.O. Donoghue and M. Roantree
5 Related Research
In [6], the authors introduce a random search method to find the best hyper-
parameter configuration for a DL architecture and compares their results to
previous work [17] which - like our own - uses a multi-resolution grid-search
coupled with a manual optimisation intervention element. In [6], they also carry
out a series of simulation experiments where random search is compared to both
grid-search and low discrepancy sequential methods. Their main contribution
is a large series of non-simulated experiments which search for the best hyper-
parameters for a one-layer neural network and Deep Belief Network. These are
carried out on eight datasets in order to recreate and compare their experimental
results with those obtained in [17].
Random search is found to outperform grid search on all datasets in a one-
layer neural network, but for the DBN experiments, random and grid search
perform comparably on four datasets with grid search outperforming on three
datasets and random search finding the best model on the fourth dataset. In [6],
the authors offer many reasons as to why random search is a better option but
most hinge on the fact that they show that the hyper-parameter search space,
although high-dimensional, has a low effective dimensionality. This means that
although there are many parameters to tune, only a particular subset of these
have a great effect on training the model and this subset is different for every
dataset (also shown in the paper). This property leads to random search being
more effective as it leaves fewer gaps in the search space and it does not require
as many iterations in order to find the optimum hyper-parameter configuration.
We chose grid and manual search for these exploratory experiments as it was
shown to perform comparably to random search for a DBN. Both [6] and [17]
chose to globally optimise the parameters of the entire DBN at once rather than
incrementally tune its constituent parts. In other words, they do not optimise
each model first where the results of the last set of experiments feed into the
next. Contrary to an adaptive approach, which is the focus of our experiments
and methodology.
A second major issue is the analysis of high-dimensional data and feature
selection [4,5,15] which has been extensively explored in a healthcare context
[1,3,11]. In [11] and [1], both groups describe a methodology where features are
selected in a two-step manually intensive fashion in order to learn predictive mod-
els. In these two approaches for selecting a feature representation in the health
domain, shallow algorithms are utilised and high dimensional data is not encoun-
tered, where in one instance only nine features were modelled [11]. Furthermore,
sometimes relevant features were completely eliminated which impacted on the
performance of the model [1].
Finally, in the medical context, DBNs have been used for medical text classi-
fication [24], as well as to aid in medical decision making with electronic health
recordds [18], but never for the analysis of clinical trial data. Neither [24] or
[18] provide a methodology on how to choose the initial hyper-parameter con-
figuration of a deep learning architecture. Furthermore, they use third party
implementations of a DBN which do not allow for the extension with further
A Framework for Selecting Deep Learning Hyper-parameters 131
References
1. Arauzo-Azofra, A., Aznarte, J.L., Bentez, J.M.: Empirical study of feature selection
methods based on individual feature evaluation for classification problems. Expert
Syst. Appl. 38(7), 8170–8177 (2011)
2. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Ian Goodfellow, J., Bergeron,
A., Bouchard, N., Bengio, Y.: Theano: new features and speed improvements. In:
Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop (2012)
3. Bellazzi, R., Zupan, B.: Predictive data mining in clinical medicine: current issues
and guidelines. Int. J. Med. Inform. 77(2), 81–97 (2008)
4. Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1),
1–127 (2009)
5. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new
perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
6. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J.
Mach. Learn. Res. 13, 281–305 (2012)
7. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G.,
Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expres-
sion compiler. In: Proceedings of the Python for Scientific Computing Conference
(SciPy), June 2010. Oral Presentation
8. Camous, F., McCann, D., Roantree, M.: Capturing personal health data from
wearable sensors. In: International Symposium on Applications and the Internet,
SAINT 2008, pp. 153–156. IEEE (2008)
132 J.O. Donoghue and M. Roantree
9. Deckers, K., Boxtel, M.P.J., Schiepers, O.J.G., Vugt, M., Sánchez, J.L.M., Anstey,
K.J., Brayne, C., Dartigues, J.-F., Engedal, K., Kivipelto, M., et al.: Target risk
factors for dementia prevention: a systematic review and delphi consensus study on
the evidence from observational studies. Int. J.Geriatr. Psychiatry 30(3), 234–246
(2014)
10. Donnelly, N., Irving, K., Roantree, M.: Cooperation across multiple healthcare
clinics on the cloud. In: Magoutis, K., Pietzuch, P. (eds.) DAIS 2014. LNCS, vol.
8460, pp. 82–88. Springer, Heidelberg (2014)
11. Fakhraei, S., Soltanian-Zadeh, H., Fotouhi, F., Elisevich, K.: Confidence in medical
decision making: application in temporal lobe epilepsy data mining. In: Proceedings
of the 2011 Workshop on Data Mining for Medicine and Healthcare, pp. 60–63.
ACM (2011)
12. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward
neural networks. In: International Conference on Artificial Intelligence and Statis-
tics, pp. 249–256 (2010)
13. Hinton, G.: A practical guide to training restricted boltzmann machines. Momen-
tum 9(1), 926 (2010)
14. Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief
nets. Neural Comput. 18(7), 1527–1554 (2006)
15. Humphrey, E.J., Bello, J.P., LeCun, Y.: Feature learning and deep architectures:
new directions for music informatics. J. Intell. Inf. Syst. 41(3), 461–481 (2013)
16. van Boxtel, M.P.J., Ponds, R.H.W.M., Jolles, J., Houx, P.J.: The Maastricht
Aging Study: Determinants of Cognitive Aging. Neuropsych Publishers, Maastricht
(1995)
17. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y.: An empirical
evaluation of deep architectures on problems with many factors of variation. In:
Proceedings of the 24th International Conference on Machine Learning, ICML
2007, pp. 473–480. ACM, New York, NY, USA (2007)
18. Liang, Z., Zhang, G., Huang, J.X., Hu, Q.V.: Deep learning for healthcare decision
making with EMRs. In: 2014 IEEE International Conference on Bioinformatics
and Biomedicine (BIBM), pp. 556–559. IEEE (2014)
19. Roantree, M., O’Donoghue, J., O’Kelly, N., Pierce, M., Irving, K., Van Boxtel,
M., Köhler, S.: Mapping longitudinal studies to risk factors in an ontology for
dementia. Health Inf. J., pp. 1–13 (2015)
20. Roantree, M., Shi, J., Cappellari, P., O’Connor, M.F., Whelan, M., Moyna, N.:
Data transformation and query management in personal health sensor networks.
J. Netw. Comput. Appl. 35(4), 1191–1202 (2012). Intelligent Algorithms for Data-
Centric Sensor Networks
21. Salakhutdinov, R., Hinton, G.E.: Deep boltzmann machines. In: International Con-
ference on Artificial Intelligence and Statistics, pp. 448–455 (2009)
22. van Boxtel, M.P., Buntinx, F., Houx, P.J., Metsemakers, J.F., Knottnerus, A.,
Jolles, J.: The relation between morbidity and cognitive performance in a normal
aging population. J. Gerontol. Ser. A Biol. Sci. Med. Sci. 53(2), 147–154 (1998)
23. Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R.: Regularization of neural
networks using dropconnect. In: Proceedings of the 30th International Conference
on Machine Learning, ICML-2013, pp. 1058–1066 (2013)
24. Jimeno Yepes, A., MacKinlay, A., Bedo, J., Garnavi, R., Chen, Q.: Deep belief net-
works and biomedical text categorisation. In: Australasian Language Technology
Association Workshop, p. 123 (2014)
Using Virtual Meeting Structure to Support
Summarisation
1 Introduction
A virtual meeting system has been developed, called V-Room [1], which makes use of
agendas with timed items and meeting roles, leading to improved topic guidance and
subsequent summarisation. The roles currently fall into four categories: the chair; the
facilitator; item leaders; and participants. Roles can always be interpreted in different
ways in order to suit different meeting protocols. For instance in the educational
domain case described in this paper, the chairman is the module leader, the facilitator is
the assistant lecturer and students are the participants. Item leaders can be any of the
participants depending on their interest in the item discussed. The agenda, items and
roles are set up as part of a pre-meeting process. This paper illustrates how the use of
such a model of meeting structure can drive automated summarisation of virtual
meeting content.
The paper is organised as follows. Section 2 briefly describes some related litera-
ture. Section 3 describes our summarisation approach. Some experimental results are
provided in Sect. 4, while Sect. 5 provides a brief conclusion and directions for future
work.
2 Related Work
Extensive work has been carried out in automatic summarisation [2, 3]. The approach
adopted in V-ROOM is textual sentence extraction with meta-information added to
provide the context of the discussion. Since both context and content is provided we
consider our summarisation to be indicative and informative. The term indicative is
used to describe summaries that give heading information such as topics of discussion
and the term informative is used to describe summarisations that include the content of
the discussion rather than just the topic headings.
One of the techniques we have used for sentence extraction is TextRank, an
unsupervised method. TextRank is a graph-based algorithm [4] for ranking sentences
within a text. It achieves this by building a graph where a unit of text is represented by
a vertex and a link between two units of text is represented by an edge. A link may be
formed between two text units if the same word or phrase is used in both text units.
Links between natural language text units may be multiple (more than one word or
phrase connection occurs) or partial (part of a phrase connects but not the whole
phrase). For this reason edges are assigned weights.
A novelty of our work is the exploration of the application of summarisation
techniques such as TextRank to the area of virtual meetings and combining such
techniques with the use of an underlying data model which holds the structure of the
meeting and the roles of participants.
The approach is based on an underlying data model representing structural aspects of the
meeting. An extract from the data model schema is shown below. The use of item titles
and other structural aspects can aid summarisation and automated minutes generation.
chat (time, meeting_id, username, message,)
user (user_id, username, password, last_ login).
item (item_id, item_title, meeting_id, item_endtime, status, leader).
meeting (meeting_id, meeting_title, starttime, endtime, chair, facilitator, status).
The messages are collected, analysed and the most important sentences are used for
the summary. A pre-processing stage ensures basic punctuation of the input text and a
post-processing stage adds meta-data to create the meeting context which helps to convey
further meaning to the summarisation (see Fig. 1). The item title is used in order to locate
and isolate sentences connected to it and then the TextRank algorithm is used in order to
return the sentences that score higher in the marking. All of the sentences receive a score
but the summary is based on the top sentences, returned in chronological order.
The summary can only be extracted when the meeting has been finished, otherwise
the results will not be accurate. One of the biggest challenges is to match the item with
the corresponding text on the database [5]. Let us assume that this step has been
completed successfully. Within the summarisation component, the text is split into
sentences. Then a sparse matrix of token counts is created based on the size of the
vocabulary that has been identified by analysing the data. The matrix shows frequency
of words across sentences. The matrix is normalised by associating weights to words
depending on their importance. Then a similarity matrix is created across the sentences
and an associated weighted graph is created. From the graph, we extract the sentences
that are connected to the title based on the expectation that those sentences are more
likely to be the heart of the conversation. Then the TextRank algorithm is applied in
order to give a score to each of them with the assumption that top sentences are the most
highly connected sentences and hence most representative of the conversation. Finally,
the top sentences are returned in chronological order to maintain the flow of discussion.
4 Experimental Results
The evaluation of the summarisation system took place by examining the summari-
sation methods on a meeting transcript that we collected from a virtual meeting
between 2 members of academic staff and 9 students. We tested the system on various
meeting items using TextRank with and without reference to item title (we call these
methods TR-IT and TR-NIT respectively). The results of the test of the item entitled
“Types of Testing” is presented in Table 1. Originally the text conversation consisted
of 36 sentences and the summary configuration was set to reduce to 6 sentences. We
tested summarisation on various items and on the whole meeting without reference to
item delineation or item title.
We found that delineation into items before applying TextRank led to better
summaries. Both summaries shown in Table 1 are indicative and informative according
to the definition given in Sect. 2. The TR-IT summary is perhaps more informative
regarding the breadth of discussion but the TR-NIT gives a better indication of the
emerging decision. For other items we found that TR-IT produced more clearly
enhanced summaries. Our findings led to the recommendation that structural features
could be used to enhance summaries. We noted that if a brief description of the item’s
purpose had been given as part of the agenda setting, further useful indicative context
could have been provided and this would enhance the summaries.
Our initial findings show that capturing and representing meeting structure can improve
the quality of automated summaries of virtual meetings. Further evaluation is needed
however. TextRank is not the only algorithm that can be used to find the best sentences
for extraction. A future direction of exploration will be utilising the roles of the
participants within the meeting to influence sentence extraction. We also note that
additional meta-information could provide useful context. Our future work will explore
these avenues.
References
1. Thompson, P., James, A., Nanos, A.: V-ROOM: virtual meeting system trial. In: 17th IEEE
International Conference on Computer Supported Cooperative Work in Design, pp. 563–569.
IEEE press, Washington (2013)
2. Jones, K.S.: Introduction to Text Summarisation. In: Mani, I., Maybury, M. (eds.) Advances
in Automated Text Summarization. The MIT Press, Cambridge (1998)
3. Hovy, E., Lin, C., Y.: Automated text summarization and the SUMMARIST system. In:
Proceedings of a Workshop held at Baltimore, Association for Computational Linguistics,
pp. 197–214, Stroudsburg (1998)
4. Rada, M., Tarau, P., TextRank: bringing order into texts. In: Proceedings of the Conference
on Empirical Methods in Natural Language Processing, Association for Computational Lin-
guistics, pp. 404–411. , Stroudsburg (2004)
5. James, A., Nanos, A., Thompson, P.: V-ROOM: A virtual meeting system with intelligent
structured summarisation. Enterprise Information Systems, doi:10.1080/17517575.2015.
1019571
NoSQL and Distributed Processing
NotaQL Is Not a Query Language! It’s for Data
Transformation on Wide-Column Stores
1 Motivation
When we take a look at NoSQL
databases1 , they differ from clas-
sical relational databases in terms
of scalability, their data model and
query method. The simplest form of
such a database is a key-value store:
One can simply write and read val-
ues using a key-based access. In
this paper, we concentrate on wide-
column stores. Such a store con-
Fig. 1. Person table with a children graph
sists of tables that have one row- and amounts of pocket money
id column and one or more column
families. Basically, each column family can be seen as a separate key-value
store where column names function as keys. The three most popular wide-
column stores are Google’s Big Table [2], its open-source implementation Apache
HBase2 , and Cassandra3 . Figure 1 shows an example table with two column
families.
At first sight, the table looks similar to a relational table. This is because
both consist of columns and these columns hold atomic values. In relational
1
http://nosql-database.org.
2
http://hbase.apache.org.
3
http://cassandra.apache.org.
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 139–151, 2015.
DOI: 10.1007/978-3-319-20424-6 14
140 J. Schildgen and S. Deßloch
databases, however, the database schema is static, i.e., all columns of a table
are known, before values are inserted or modified. In contrast, in a wide-column
store, at each insertion, one is able to set and create arbitrary columns. In other
words, the database schema does not exist, or is dynamically evolving. The first
column family information contains attributes of people. Note that different
rows can have different columns which are not predefined at table-creation time.
The second column family children models a graph structure. The names in the
columns are references to row-ids of children and the values are the amounts of
pocket money the children get from their parents. We will later use this table as
an example for all of our NotaQL transformations. Web graphs are very akin to
this example: The first column family comprises information about a web site,
while the second contains links to other web sites.
If the table in Fig. 1 was stored in HBase, one could use a Get operation
in the HBase Shell or the Java API to fetch a row with all its columns by
its row-id. In HBase, there always is an index on the row-id. Other secondary
indexes are not supported. To execute more complex queries, programmers can
utilize a framework that allows access via an SQL-like query language. The most
prominent system for that is Hive [21]; others are presented in the next section.
As an alternative, one may consider generating redundant data which then
can be accessed via simple Get operations. This approach shows similarities with
materialized views in traditional relational DBMS [7]. In [13], ideas are presented
to do selections, joins, groupings and sorts by defining transformations over the
data. The authors advocate that one does not need a query language like SQL
when the data is stored in the way it is needed at query time. If finding all
people with a specific year of birth is a frequent query, the application which
modifies data should maintain a second table whose row-id is a year of birth and
columns are foreign keys to the original row-id in the main table. As a drawback,
applications have to be modified carefully to maintain all the tables, so every
change in the base data immediately leads to many changes in different tables.
In [5], similar approaches are presented to maintain secondary indexes on HBase,
either with a dual-write strategy or by letting a MapReduce [3] job periodically
update an index table.
In this paper, we present NotaQL, a data-transformation language for wide-
column stores. Like SQL, it is easy to learn and powerful. NotaQL is made for
schema-flexible databases, there is a support for horizontal aggregations, and
metadata can be transformed to data and vice versa. Complex transformations
with filters, groupings and aggregations, as well as graph and text algorithms can
be expressed with minimal effort. The materialized output of a transformation
can be efficiently read by applications with the simple Get API.
In the following section, we present some related work. In Sect. 3, NotaQL is
introduced as a data-transformation language. We present a MapReduce-based
transformation platform in Sect. 4 and the last section concludes the article.
2 Related Work
Transformations and queries on NoSQL, relational and graph databases can
be done by using different frameworks and languages. With Clio [8], one can
NotaQL Is Not a Query Language! It’s for Data Transformation 141
perform a schema mapping from different source schemata into a target schema
using a graphical interface. Clio creates views in a semi-automatic way which
can be used to access data from all sources. This virtual integration differs from
our approach because NotaQL creates materialized views. Clio can only map
metadata to metadata and data to data. There is no possibility to translate
attribute names into values and vice versa. In [1], a copy-and-paste model is
presented to load data from different sources into a curated database. Curated
databases are similar to data warehouses, but here it is allowed to modify data
in the target system. A tree-based model is used to support operations from
SQL and XQuery as well as copying whole subtrees. The language presented in
that paper also contains provenance functions to find out by which transaction
a node was created, modified or copied. Although the language is very powerful,
it does not support aggregations, unions and duplicate elimination because in
these cases, the origin of a value is not uniquely defined.
There are many approaches to query wide-column stores using SQL, e.g.
Hive, Phoenix4 or Presto5 . On the one hand, one does not need to learn a new
query language and applications which are based on relational databases can be
reused without many modifications. On the other hand, SQL is not well-suited
for wide-column stores, so the expressiveness is limited. Figure 16 at the end of
this paper shows the weak points of SQL: Transformations between metadata
and data, horizontal aggregations and much more can not be expressed with an
SQL query. Furthermore, many frameworks do not support the schema flexibility
of HBase. Before an HBase table can be queried by Hive, one has to create a
new Hive table and define how its columns are mapped to an existing HBase
table6 . With Phoenix, an HBase table can be queried with SQL after defining the
columns and their types of a table with a CREATE TABLE command. Presto is an
SQL query engine by Facebook. The presto coordinator creates an execution plan
for a given query and a scheduler distributes the tasks to the nodes that are close
to the data. Usually, Presto directly accesses data that is stored in the Hadoop
distributed file system but connectors for other systems, e.g. HBase, exist as well.
The strength of Presto is a nearly full ANSI-SQL support—including joins and
window functions—and its ten times higher speed than Hive and MapReduce.
But again, only relational queries on relational tables with static schemas are
possible.
The HBase Query Language by Jaspersoft7 can be used to support more
complex queries on an HBase table. It is better suited for wide-column stores
than SQL, but not easy to use. One has to define a query as a JSON document
that can be very long, even for simple queries. The syntax of our language
NotaQL is inspired by Sawzall [19], a programming language used by Google to
define log processing tasks instead of manually writing a MapReduce job. The
input of a Sawzall script is one single line of input (e.g. a log record) and the
4
http://phoenix.apache.org.
5
http://prestodb.io.
6
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration.
7
https://community.jaspersoft.com/wiki/jaspersoft-hbase-query-language.
142 J. Schildgen and S. Deßloch
output are insertions into virtual tables. A Sawzall script runs as a MapReduce
job and the input and output is not an HBase table but a CSV file. The language
Pig Latin [17] provides relational-algebra-like operators to load, filter and group
data. Pig programs can be interconnected with a workflow manager like Nova
[16]. Google BigQuery [20] is the publicly-available version of Dremel [14]. One
can import and analyze data that is stored in the Google Cloud Storage using
SQL. As the data is stored in a column-oriented manner, it can be filtered
and aggregated very fast. In the paper, it is recommended to use BigQuery in
combination with MapReduce. First, MapReduce can join and pre-process data,
then this data can be analyzed using BigQuery. As NotaQL transformations are
based on MapReduce, one can replace the complex MapReduce transformations
by NotaQL scripts and combine them with fast query languages like BigQuery,
Phoenix, or HBase QL.
3.2 Predicates
There are two kinds of predicates in NotaQL:
a row predicate which acts as an input-row
filter to perform a row selection and a cell
predicate which selects specific cells in a row.
The row predicate is an optional filter defi- Fig. 4. Row predicate
nition placed at the beginning of a NotaQL
script using an IN-FILTER clause. If such a
predicate is set, every row in the input table
which does not satisfy it will be skipped. Fig. 5. Cell predicate
That means, before a mapping is performed,
a whole row is handled as if it would not exist when the predicate is evaluated
as false. In this predicate, comparison and logical operators as well as column
names and constants can be used.
The transformation in Fig. 4 is executed as follows: Only rows that contain a
column born with a value greater than 1950 are selected. The rest of the rows are
skipped. In the remaining rows, only the column salary is read and returned.
The result is one table with only one column salary and between zero and n
rows, where n is the number of rows in the base table. The transformation is
equivalent to the SQL query SELECT salary FROM in WHERE born>1950. Some
more examples for row predicates:
– (born>1950 AND born<1960) OR cpny=‘IBM’ OR col count()>5,
– school respectively !school— checks column existence / absence in a row.
When cells should be filtered within one row without knowing their names, a
cell predicate can be used. It starts with a ? and can be placed after an IN. c
or IN. v. The transformation in Fig. 5 only copies columns with a value equal
to e5, independent of their names. The question mark indicates the begin of a
predicate so that cells are skipped which do not satisfy it. The @ symbol is used
to refer to the current cell’s value. A cell predicate can also be used to drop
columns, e.g. OUT.$(IN. c?(!name)) <- IN. v.
NotaQL Is Not a Query Language! It’s for Data Transformation 145
enables new kinds of transformations which are not possible with classical query
languages yet.
Graphs are often modeled as adjacency
lists in a wide-column store. Each row rep-
resents one vertex in a graph and each col-
umn represents an edge to another vertex. If
the edges are weighted, the value of a column
contains the weight. In a relational database,
columns are part of the meta-data level of a Fig. 9. Reversing a graph
table. In a wide-column store, they are part
of the data level. This is why SQL is not well-suited for graph algorithms.
in the column family edges. In this transformation, the input and output tables
are the same, so there are no steps needed to preserve the graph structure. The
results are only updated output cells for the column PR. An example: Nodes A
(PR: 0.1) and B (PR: 0.3) have one outgoing edge each, namely to node C. So,
C’s new PageRank value is 0.1 0.3
1 + 1 = 0.4.
Like updates in SQL, NotaQL transformations have snapshot semantics. This
means, logically the full input is read, then all cells are mapped to output cells
and at the end the output is written. So writes into the input table during job-
execution do not interfere with the remaining transformation process. For our
example, the execution framework has to decide after each execution whether
more iterations are needed or not. PageRank can be executed iteratively until
the changes of the PageRank values are below a specific accuracy value. One
approach to control the number of iterations is a change measurement after each
iteration. Depending on the amount of changes since the previous iteration, a
new run is started or the overall job terminates. Another approach is the usage
of an input format that compares the last two versions of each cell value and
ignores a row when the changes are below a threshold. Then, the job terminates
when the input is empty.
Breadth-First Search. The distance between two vertices in a graph is the number
of edges on the shortest path between them. In a weighted graph, it is the sum of
(positive) weights of those edges. Breadth-first search [11] can be used to compute
the distance from one predefined vertex V0 to every other vertex. Therefore, a
dist column is added for start vertex V0 with the value 0. For all other vertices,
the distance is ∞. This can be modeled by the absence of the dist column in
the column family alg.
The NotaQL script in Fig. 12 is executed iteratively until the result does not
change anymore. In a connected graph, the number of iterations is equal to the
diameter of the graph. In each iteration, neighbors of vertices whose distance
are known are updated.
The IN-FILTER skips rows with an unknown
distance. For the others, the distance of each
neighbor vertex is set to the vertex’ own distance
plus one. If multiple vertices have an edge to the
same neighbor, the minimum value is taken. If
the algorithm should take weighted edges into Fig. 12. Breadth-first search
account, the 1 in the last line has to be replaced by IN. v to add the current
edge weight to the own distance.
cells. The output is a table where for each word (row-id) a column count holds
the number of occurrences of the word in all input cells.
With a small modification in the word-count script, one can calculate a term
index with NotaQL: OUT.$(IN. r) <- COUNT(); Here, each term row contains
a count value for each document that contains the term. These can be used to
support an efficient full-text search over large text data. In addition to these
examples, many other graph and text algorithms can be expressed in NotaQL.
For example, the computation of TF-IDF (term frequency/inverse document
frequency) is a chain of three NotaQL transformations.
Map. The input for one Map function is one Fig. 14. NotaQL map function
row from the input table that consists of a row-
id and a set of columns and values. Figure 14
10
http://hadoop.apache.org.
NotaQL Is Not a Query Language! It’s for Data Transformation 149
shows how predicates are evaluated and the map output is produced. The Map-
output key is a combination of an output row-id and a column qualifier. So,
each Reduce function processes all the values for one specific cell. It is efficient
to use a Partitioner function which transfers the data directly to the node which
is responsible for storing rows with the given row-id.
Fig. 16. SQL is not well-suited for wide-column stores, 4 easy (4) hard 8 impossible
5 Conclusion
wide-column stores. The output table can be accessed in the application with a
primitive GET API and the up-to-dateness of the data is defined by the query-
execution interval.
We are currently working on language extensions for NotaQL to support more
complex transformations, e.g. Top-k algorithms. For faster transformations, we
are implementing an incremental component in our framework. This means, a
transformation can reuse the results from a former run and it has only to read
the delta. Currently, only standalone transformations are supported. Iterative
algorithms need to be executed through a batch script which checks a termina-
tion criterion and supervises the iterations. A language extension for iterative
transformations is planned.
Although all experiments are based on the NoSQL database system HBase,
NotaQL scripts can be defined on other wide-column stores and other NoSQL
and relational databases as well. Next, we will apply our findings to Cassandra
because the support of secondary indexes in Cassandra enables better optimiza-
tions for NotaQL computations. Our vision is for cross-platform transformations.
Then, the input and output of a NotaQL transformation can be any data source
from a relational or NoSQL database. So, one can transform a CSV log file into
an HBase table, load a graph from HypergraphDB into MySQL or integrate data
from Cassandra and a key-value store into MongoDB.
References
1. Buneman, P., Cheney, J.: A copy-and-paste model for provenance in curated data-
bases. Notes 123, 6512 (2005)
2. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M.,
Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for
structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 1–14 (2008). Article 4
3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters.
In: OSDI, pp. 137–150 (2004)
4. Emde, M.: GUI und testumgebung für die HBase-schematransformationssprache
NotaQL. Bachelor’s thesis, Kaiserslautern University (2014)
5. George, L.: HBase: The Definitive Guide, 1st edn. O’Reilly Media, Sebastopol
(2011)
6. Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: distributed
graph-parallel computation on natural graphs. In: OSDI, vol. 12, p. 2 (2012)
7. Gupta, A., Jagadish, H.V., Mumick, I.S.: Data integration using self-maintainable
views. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS,
vol. 1057, pp. 140–144. Springer, Heidelberg (1996)
8. Hernández, M.A., Miller, R.J., Haas, L.M.: Clio: A semi-automatic tool for schema
mapping. ACM SIGMOD Rec. 30(2), 607 (2001)
9. Hong, S., Chafi, H., Sedlar, E., Olukotun, K.: Green-marl: a DSL for easy and
efficient graph analysis. ACM SIGARCH Comput. Archit. News 40(1), 349–362
(2012)
10. Lakshmanan, L.V.S., Sadri, F., Subramanian, I.N.: SchemaSQL-a language for
interoperability in relational multi-database systems. In: VLDB, vol. 96, pp. 239–
250 (1996)
NotaQL Is Not a Query Language! It’s for Data Transformation 151
11. Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Synth. Lect.
Hum. Lang. Technol. 3(1), 1–177 (2010)
12. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N.,
Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings
of the 2010 ACM SIGMOD International Conference on Management of Data, pp.
135–146. ACM (2010)
13. Grinev, M.: Do You Really Need SQL to Do It All in Cassandra? (2010). http://
wp.me/pZn7Z-o
14. Sergey, M., Andrey, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M.,
Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Commun. ACM
54(6), 114–123 (2011)
15. Murray, D.G., Sherry, F.M.C., Isaacs, R., Isard, M., Barham, P., Abadi, M.: Naiad:
a timely dataflow system. In: Proceedings of the Twenty-Fourth ACM Symposium
on Operating Systems Principles, pp. 439–455. ACM (2013)
16. Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao,
V.B.N., Sankarasubramanian, V., Seth, S., et al.: Nova: continuous pig/hadoop
workflows. In: Proceedings of the 2011 ACM SIGMOD International Conference
on Management of Data, pp. 1081–1090. ACM (2011)
17. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-
foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD
International Conference on Management of Data, pp. 1099–1110. ACM (2008)
18. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
bringing order to the web. Technical report 1999–66, Stanford InfoLab, November
1999. Previous number = SIDL-WP-1999-0120
19. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel
analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)
20. Sato, K.: An inside look at google bigquery. White paper (2012). https://cloud.
google.com/files/BigQueryTechnicalWP.pdf
21. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H.,
Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce frame-
work. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
22. Wyss, C.M., Robertson, E.L.: Relational languages for metadata integration. ACM
Trans. Database Syst. (TODS) 30(2), 624–660 (2005)
23. Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed
graph system on spark. In: First International Workshop on Graph Data Manage-
ment Experiences and Systems, p. 2. ACM (2013)
24. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin,
M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstrac-
tion for in-memory cluster computing. In: Proceedings of the 9th USENIX Confer-
ence on Networked Systems Design and Implementation, p. 2. USENIX Association
(2012)
NoSQL Approach to Large Scale Analysis
of Persisted Streams
Keywords: NoSQl data stores Numerical stream logs Data stream archival
1 Introduction
The data rate and volume of streams of measurements can become very high. This
becomes a bottleneck when using relational databases for large-scale analysis of
streaming logs [1–4]. Persisting large volumes of streaming data at high rates requires
high performance bulk-loading of data into a database before analysis. The loading
time for relational databases may be time consuming due to full transactional consis-
tency [5] and high cost of indexing [6]. In contrast to relational DBMSs, NoSQL data
stores are designed to perform simple tasks with high scalability [7]. For providing high
performance updates and bulk-loading, NoSQL data stores generally sacrifice strong
consistency by providing so called eventual consistency compared with the ACID
transactions of regular DBMSs. Therefore, NoSQL data stores could be utilized for
analysis of streams of numerical logs where full transactional consistency is not
required.
Unlike NoSQL data stores, relational databases provide advanced query languages
and optimization technique for scalable analytics. It has been demonstrated in [8] that
indexing is a major factor for providing scalable performance, giving relational dat-
abases a performance advantage compared to a NoSQL data store to speed up the
analytical task. Like relational databases, some state-of-the-art NoSQL data stores
(e.g. MongoDB), also provide a query language and both primary and secondary
indexing, which should be well suited for analyzing persisted streams.
To understand how well NoSQL data stores are suited for persisting and analyzing
numerical stream logs, we propose to develop a benchmark comparing state-of-the-art
relational databases with state-of-the-art NoSQL data stores. Using the benchmark as
test bed, we will then investigate techniques for scalable query processing and indexing
of numerical streams persisted with NoSQL data stores.
2 Application Scenario
The Smart Vortex EU project [1] serves as a real world application context, which
involves analyzing stream logs from industrial equipment. In the scenario, a factory
operates some machines and each machine has several sensors that measure various
physical properties like power consumption, pressure, temperature, etc. For each
machine, the sensors generate logs of measurements, where each log record has
timestamp ts, machine identifier m, sensor identifier s, and a measured value mv.
Relational databases are used to analyze the logs by bulk-loading them in table
measures (m, s, ts, mv) which contains a large volume of data logs from many sensors
of different machines [3, 4].
Since the incoming sensor streams can be very large in volume, it is important that
the measurements are bulk-loaded fast. After stream logs have been loaded into the
database, the user can perform queries to detect anomalies of sensor readings. The
following query analyzes the values of mv from sensor logs for a given time interval
and parameterized threshold.
Analysis of large-scale stream logs in the above application scenario poses the follow-
ing challenges (C1 to C6) in utilizing relational and NoSQL data stores.
C1. Bulk-Loading: In relational DBMSs, the high cost of maintaining the indexes and
full transactional consistency can degrade the bulk-loading performance of large vol-
ume of data logs. The loading performance of a relational DBMS from a major
commercial vendor, called DB-C and a popular open source relational database, called
DB-O for 6 GB of data logs is shown in Fig. 1. It took more than 1 h in a high
performance commodity machine for the state-of-the-art commercial DBMS, DB-C to
bulk-load data logs consisting of around 111 million sensor measurements. Some of the
data logs consist of more than a billion sensor measurements, which require high-
performance bulk-loading. To boost up the performance, weak consistency level of a
NoSQL or relational database can be utilized.
154 K. Mahmood et al.
30,000 20.0
22,105 Index
DB-O
Load Time (s)
Size (GB)
6.3 Data
DB-C 10.0 7.5
15,000
3,882 11.4
7.5
0 0.0
0 2 4 6 8 DB-O DB-C
DB size (GB) Data Stores
C2. Index Size: Fig. 2 shows the index and database sizes for 6 GB of stream logs
loaded into the two DBMSs. The size of the index created in both relational DBMSs
was larger than the size of the original logs. For high performance and scalable analysis
of typical stream logs, hundreds of gigabytes of memory is required in our application.
It is interesting to see whether the state-of-the-art NoSQL data store can provide
memory efficient indexing strategies. Novel indexing techniques can also be incor-
porated in order to provide a memory efficient indexing for analyzing persisted streams.
C3. Indexing Strategies: Unlike relational databases and MongoDB, most NoSQL
data stores do not provide both primary and secondary indexing, which are essential to
scalable processing of queries over data logs. Some NoSQL data stores such as Hbase,
Cassandra, Memcached, Voldemort, and Riak do not provide full secondary indexing,
which is needed for queries having inequalities over non-key attributes. CouchDB has
secondary index, but queries have to be written as map-reduce views [7], not trans-
parently utilizing indexes.
C4. Query Processing: Unlike relational databases, most NoSQL data stores do not
provide a query optimizer. Some NoSQL data stores, e.g. MongoDB, provide a query
language that is able to transparently utilize indexes. However, the sophistication of
query optimizer still needs to be investigated for scalable analysis of data logs.
C5. Advanced Analytics: Relational DBMS features for advanced analytics such as
joins or numerical expressions is limited in NoSQL data stores. Therefore, it needs to
be investigated how advanced numerical analytics over large-scale data logs could be
performed by NoSQL data stores.
C6. Parallelization of Data: NoSQL data stores have the ability to distribute data over
many machines, which can provide parallel query execution. However, typical queries
for analyzing data logs can generate lots of intermediate results that need to be
transferred over the network between nodes, which can be a performance bottleneck.
Therefore, the performance of both horizontal and vertical partitioning of distributed
NoSQL data stores can be investigated for query execution over numerical logs.
4 Proposed Work
There are several investigations that can be performed for large-scale analysis of
numerical stream logs.
NoSQL Approach to Large Scale Analysis of Persisted Streams 155
Stream Log Analysis Benchmark: Typical TPC benchmarks [9] such as TPC-C,
TPC-DS, and TPC-H are targeted towards OLTP or decision support, not for log
analysis. To benchmark data stream management systems, the Linear Road Benchmark
(LRB) [10] is typically used. However, LRB does not include the performance of
persisted streams. Analysis of large-scale data logs often requires scalable queries (e.g.
[3, 4] ) over persisted numerical logs, which should be the focus the benchmark. In the
benchmark, several state-of-the-art NoSQL data stores should be compared with
relational DBMSs to investigate at what degree NoSQL data stores are suitable for
persisting and analyzing large scale numerical data streams. The performance of bulk-
loading capacities of the databases w.r.t. indexing and relaxed consistency should be
investigated in the benchmark. The queries should be fundamental to log analyses and
targeted to discover the efficiency of query processing and utilization of primary and
secondary index of the data logs. The benchmark should analyze and compare the
performance differences of loading with relaxed consistency, index utilization, and
query execution for both NoSQL and relational databases, which can provide the
important insights into challenges C1, C3, C4, and C6.
Query Processing: Supporting advanced analytics using a complete query language
with a NoSQL data store requires the development of query processing techniques to
compensate for the limitation of the NoSQL query languages, for example lack of join
and numerical operators. The push-down of query operators as generated parallel server
side scripts should be investigated. Furthermore, it should be investigated how domain
indexing strategies [11] in a main memory client-side database (e.g. Amos II [12]
developed at UDBL of Uppsala University and [13] ) can improve performance of
numerical data log analyses of data retrieved from back-end NoSQL databases. These
can provide the insights of the challenges C2 and C5.
References
1. Smart Vortex Project. http://www.smartvortex.eu/
2. Zeitler, E., Risch, T.: Massive scale-out of expensive continuous queries. In: VLDB (2011)
3. Truong, T., Risch, T.: Scalable numerical queries by algebraic inequality Transformations.
In: DASFAA (2014)
4. Zhu, M., Stefanova, S., Truong, T., Risch, T.: Scalable numerical SPARQL queries over
relational databases. In: LWDM Workshop (2014)
5. Doppelhammer, J., Höppler, T., Kemper, A., Kossmann, D.: Database performance in the
real world. In: SIGMOD (1997)
6. Stonebraker, M.: SQL databases v. NoSQL databases. Comm. ACM. 53(4), 10–11 (2010)
7. Cattell, R.: Scalable SQL and NoSQL data stores. ACM SIGMOD Rec. 39, 12–27 (2011)
8. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., Dewitt, D.J., Madden, S., Stonebraker, M.:
A Comparison of approaches to large-scale data analysis. In: SIGMOD (2009)
9. Council, T.P.P.: TPC Benchmarks. http://www.tpc.org/information/benchmarks.asp
10. Arasu, A., Cherniack, M., Galvez, E., Maier, D., Maskey, A.S., Ryvkina, E., Stonebraker,
M., Tibbetts, R.: Linear road: a stream data management benchmark. In: VLDB (2004)
11. Gaede, V., Günther, O.: Multidimensional access methods. ACM Comput. Surv. 30, 47–91
(1998)
156 K. Mahmood et al.
12. Risch, T., Josifovski, V., Katchaounov, T.: Functional data integration in a distributed
mediator system. In: Gray, P.M.D., Kerschberg, L., King, P.J.H., Poulovassilis, A. (eds.)
The Functional Approach to Data Management. Springer, Heidelberg (2004)
13. Freedman, C., Ismert, E., Larson, P.-Å.: Compilation in the microsoft SQL server hekaton
engine. IEEE Data Eng. Bull. 37, 22–30 (2014)
Horizontal Fragmentation and Replication
for Multiple Relaxation Attributes
Lena Wiese(B)
1 Introduction
When storing large-scale data sets in distributed database systems, these data
sets are usually fragmented (that is, partitioned) into smaller subsets and these
subsets are distributed over several database servers. Moreover, to achieve better
availability and failure tolerance, copies of the data sets (the so-called replicas)
are created and stored in a distributed fashion so that different replicas of the
same data set reside on distinct servers.
In addition to technical requirements of data distribution, intelligent query
answering mechanisms are increasingly important to find relevant answers to
user queries. Flexible (or cooperative) query answering systems help users of a
database system find answers related to his original query in case the original
query cannot be answered exactly. Semantic techniques rely on taxonomies (or
ontologies) to replace some values in a query by others that are closely related
according to the taxonomy. This can be achieved by techniques of query relax-
ation – and in particular query generalization: the user query is rewritten into
a weaker, more general version to also allow related answers.
In this paper we make the following contributions:
– we devise an m-copy replication scheme for the fragments ensuring the repli-
cation factor m by storing overlapping fragments on distinct servers;
– we state the replication problem as an optimization problem minimizing the
number of occupied servers;
– we describe a recovery procedure for this kind of replication.
Section 2 introduces the main notions used in this article and gives an illustra-
tive example. Section 3 defines the problem of data replication with overlapping
fragments addressed in this article. Section 4 describes replication and recovery
in a practical system. Related work is presented in Sects. 5 and 6 concludes this
article with suggestions for future work.
Query generalization has long been studied in flexible query answering [8].
Query generalization at runtime has been implemented in the CoopQA sys-
tem [5] by applying three generalization operators to a conjunctive query. Anti-
Instantiation (AI) is one query generalization operator that replaces a constant
(or a variable occurring at least twice) in a query with a new variable y. In
this paper we focus on replacements of constants because this allows for finding
answers that are semantically close to the replaced constant. As the query lan-
guage we focus on conjunctive queries expressed as logical formulas. We assume
a logical language L consisting of a finite set of predicate symbols (denoting
the table names; for example, Ill, Treat or P ), a possibly infinite set dom of
constant symbols (denoting the values in table cells; for example, Mary or a),
and an infinite set of variables (x or y). A term is either a constant or a variable.
The capital letter X denotes a vector of variables; if the order of variables in
X does not matter, we identify X with the set of its variables and apply set
operators – for example we write y ∈ X.
A query formula Q is a conjunction (denoted ∧) of literals (consisting of a
predicate and terms) with a set of variables X occurring freely; hence we write
a query as Q(X) = Li1 ∧ . . . ∧ Lin . The Anti-Instantiation (AI) operator chooses
a constant a in a query Q(X), replaces one occurrence of a by a new variable y
and returns the query QAI (X, y) as the relaxed query. The relaxed query QAI
is a deductive generalization of Q (see [5]).
As a running example, we consider a hospital information system that stores
illnesses and treatments of patients as well as their personal information (like
address and age) in the following three database tables:
Horizontal Fragmentation and Replication 159
The query Q(x1 , x2 , x3 ) = Ill (x1 , Flu) ∧ Ill (x1 , Cough) ∧ Info(x1 , x2 , x3 ) asks
for all the patient IDs x1 as well as names x2 and addresses x3 of patients
that suffer from both flu and cough. This query fails with the given database
tables as there is no patient with both flu and cough. However, the querying
user might instead be interested in the patient called Mary who is ill with both
flu and asthma. We can find this informative answer by relaxing the query con-
dition Cough and instead allowing other related values (like Asthma) in the
answers. An example generalization with AI is QAI (x1 , x2 , x3 , y) = Ill (x1 , Flu) ∧
Ill (x1 , y) ∧ Info(x1 , x2 , x3 ) by introducing the new variable y. It results in an
non-empty (and hence informative) answer: Ill (2748, Flu) ∧ Ill (2748, Asthma) ∧
Info(2748, Mary,‘New Str 3 , Newtown’). Another answer obtained is the fact
that Mary suffers from a broken leg as: Ill (2748, Flu) ∧ Ill (2748, brokenLeg) ∧
Info(2748, Mary,‘New Str 3 , Newtown’) which is however an overgeneralization.
An extension of the basic BPP, the Bin Packing with Conflicts (BPPC)
problem, considers a conflict graph G = (V, E) where the node set V = {1, . . . , n}
corresponds to the set of objects. A binary edge e = (i, j) exists whenever the two
objects i and j must not be placed into the same bin. In the ILP representation,
a further constraint (Eq. 9) is added to avoid conflicts in the placements.
K
minimize yk (minimize number of bins) (6)
k=1
K
s.t. xik = 1, i = 1, . . . , n (each object assigned to one bin) (7)
k=1
n
wi xik ≤ W yk , k = 1, . . . , K (capacity not exceeded) (8)
i=1
xik + xjk ≤ yk k = 1, . . . , K, ∀(i, j) ∈ E (no conflicts) (9)
yk ∈ {0, 1} k = 1, . . . , K (10)
xik ∈ {0, 1} k = 1, . . . , K, i = 1, . . . , n (11)
In order to support flexible query answering on multiple columns, one table can
be fragmented multiple times (by clustering different columns); that is, we can
choose more than one relaxation attribute. In this case, several fragmentations
will be obtained. More formally, if α relaxation attributes are chosen and clus-
tered, then we obtain α fragmentations Fl (l = 1 . . . α) of the same table; each
fragmentation contains fragments fl,s where index s depends on the number of
clusters found: if nl clusters are found, then Fl = {fl,1 , . . . , fl,nl }.
162 L. Wiese
We assume that each of the clusterings (and hence the corresponding frag-
mentation) is complete: every value in the column is assigned to one cluster and
hence every tuple is assigned to one fragment. We also assume that each cluster-
ing and each fragmentation are non-redundant: every value is assigned to exactly
one cluster and every tuple belongs to exactly one fragment (for one clustering);
in other words, the fragments inside one fragmentation do not overlap.
However, fragments from two different fragmentations (for two different clus-
terings) may overlap. For example, both the Respiratory as well as the IDhigh
fragments contain the tuple 8457, Cough. Due to completeness, every tuple
is contained in exactly one of the fragments of each of the α fragmentations:
for any tuple j, if α relaxation attributes are chosen and clustered, then in any
fragmentation Fl (l = 1 . . . α) there is a fragment fl,s such that tuple j ∈ fl,s .
We illustrate this with our example. Assume that 5 rows is the maximum capac-
ity W of each server and assume a replication factor 2. In a conventional repli-
cation approach, all fragments are of approximately the same size and do not
overlap. Hence, the conventional approach would replicate all fragments (Respi-
ratory, Fracture, IDhigh, IDlow ) to two servers each. then assign the Respiratory
fragment (with 4 rows) to one server S1 and a copy of it to another server S2.
Now the Fracture fragment (with 2 rows) will not fit on any of the two servers;
its two replicas will be stored on two new servers S3 and S4. For storing the
IDlow fragment (with 4 rows), the conventional approach would need two more
servers S5 and S6. The IDhigh fragment (with 2 rows) could then be mapped to
servers S3 and S4. The conventional replication approach would hence require
at least six servers to achieve a replication factor 2.
In contrast, our intelligent replication approach takes advantage of the over-
lapping fragments so that three servers suffice to fulfill the replication factor
2; that is, the amount of servers can be substantially reduced if a more intelli-
gent replication and recovery scheme is used that respects the fact that several
fragments overlap and that can handle fragments of differing size to optimally
fill remaining server capacities. This allows for better self-configuration capac-
ities of the distributed database system. First we observe how one fragment
can be recovered from the other fragments: Fragment Respiratory can be recov-
ered from fragments IDlow and IDhigh (because Respiratory = (IDlow ∩ Res-
piratory) ∪ (IDhigh ∩ Respiratory)); Fragment Fracture can be recovered from
fragment IDlow (because Fracture = (IDlow ∩ Fracture)); Fragment IDlow can
be recovered from fragments Respiratory and Fracture (because IDlow = (IDlow
∩ Respiratory) ∪ (IDlow ∩ Fracture)); Fragment IDhigh can be recovered from
fragment Respiratory (because IDhigh = (IDhigh ∩ Respiratory)). Hence, we can
store fragment Respiratory on server S1, fragment IDlow on server S2, and frag-
ments Fracture and IDhigh on server S3 and still have replication factor 2 for
individual tuples.
164 L. Wiese
We now show that our replication problem (with its extensions to overlapping
fragments and counting replication based on tuples) can be expressed as an
advanced BPPC problem. Let J be the amount of tuples in the input table, m
be the number of fragmentations, K the total number of available servers and
n be the overall number of fragments obtained in all fragmentations. In the ILP
representation we keep the variables yk for the bins and xik for fragments – to
simplify notation we assume that i = 1 . . . n where n = |F1 | + . . . + |Fm | = n1 +
. . . + nm : all fragments are numbered consecutively from 1 to n even when they
come from different fragmentations. In addition, we introduce K new variables
zjk for each the tuple j such that zjk = 1 if the tuple j is placed on server k;
we maintain a mapping between fragments and tuples such that if fragment i is
assigned to bin k, and j is contained in i, then tuple j is also assigned to k (see
Eq. (15)); the other way round, if there is no fragment i containing j and being
assigned to bin k, then tuple j neither is assigned to k (see Eq. (16)); and we
modify the conflict constraint to support the replication factor: we require that
for each tuple j the amount of bins/servers used is at least m (see Eq. (17)) to
ensure the replication factor.
K
minimize yk (minimize number of bins) (12)
k=1
K
s.t. xik = 1, i = 1, . . . , n (each fragment i assigned to one bin) (13)
k=1
n
wi xik ≤ W yk , k = 1, . . . , K (capacity not exceeded) (14)
i=1
zjk ≥ xik for all j : j ∈ i (tuple j in bin when fragment i is) (15)
zjk ≤ xik for all j (tuple not in bin when no fragment is)(16)
(i:j∈i)
K
zjk ≥ m for all j (replication factor m on tuples) (17)
k=1
yk ∈ {0, 1} k = 1, . . . , K (18)
xik ∈ {0, 1} k = 1, . . . , K, i = 1, . . . , n (19)
zjk ∈ {0, 1} k = 1, . . . , K, j = 1, . . . , J (20)
The ILP representation in the previous section is highly inefficient and does
not scale to large amounts of tuples: due to the excessive use of z-variables,
Horizontal Fragmentation and Replication 165
for large J finding a solution will take prohibitively long. Indeed, in the given
representation, we have K y-variables, n · K x-variables, and J · K z-variables
where usually J n. That is why we want to show now that it is possible to
focus on the x-variables to achieve another ILP representation for overlap-DRP:
for any tuple j such that j is contained in two fragments i and i (we assume
that i < i to avoid isomorphic statements in the proof), it is sufficient to ensure
that the two fragments are stored on two different servers. In other words, for
the (m · (m − 1))/2 pairs of overlapping fragments i and i , we can make them
mutually exclusive in the ILP representation; that is, in the ILP representation
we have to satisfy (m · (m − 1))/2 equalities of the form xik + xi k = 1 to make
them pairwise conflicting.
Theorem 1. If for any two fragments i and i such that i ∩ i = ∅ there hold
(m · (m − 1))/2 equations of the form xik + xi k = 1 where i < i , i = 1, . . . , n − 1,
K
i = 2, . . . , n and k = 1, . . . , K, then it holds for any tuple j that k=1 zjk ≥ m.
Proof. First of all, for every tuple j there are m fragments i such that j ∈ i due
to completeness of the m fragmentations. Now we let I be the set of these m
fragments. Then for any two i, i ∈ I we have j ∈ i ∩ i by construction. Due to
Eq. (13), for every i ∈ I there must be exactly one bin k such that xik = 1 and
for all other i∗ it holds that either xik + xi∗ k = 1 (if i < i∗ ) or xi∗ k + xik = 1 (if
i∗ < i) so that none of these fragments is assigned to bin k. Hence, m bins are
needed to accommodate all fragments in I. Due to Eq. (15), we assure that when
K
xik = 1 then also zjk = 1 for the given j and any i ∈ I. Hence k=1 zjk ≥ m
(Eq. 17) holds.
Continuing our example, we have a conflict graph over the fragments Respiratory,
Fracture, IDlow and IDhigh with an edge between Respiratory and IDlow, and
an edge between Respiratory and IDhigh, and an edge between Fracture and
166 L. Wiese
K
minimize yk (minimize number of bins) (21)
k=1
K
s.t. xik = 1, i = 1, . . . , n (each fragment i assigned to one bin) (22)
k=1
n
wi xik ≤ W yk , k = 1, . . . , K (capacity not exceeded) (23)
i=1
xik + xi k ≤ yk k = 1, . . . , K, i ∩ i = ∅ (overlapping fragments i, i ) (24)
yk ∈ {0, 1} k = 1, . . . , K (25)
xik ∈ {0, 1} k = 1, . . . , K, i = 1, . . . , n (26)
4 Experimental Study
To enable recovery, a lookup table is maintained that stores for each clusterid
the tupleids of those tuples that constitute the clustered fragment. The recov-
ery procedure was executed on the range-based fragmentation to recover the
clustered fragmentation by running INSERT INTO ci SELECT * FROM r1 , . . . , rm
JOIN lookup on (lookup.tupleid = ci .tupleid) WHERE lookup.clusterid= i
for each cluster i. The runtimes obtained are shown in Fig. 1.
5 Related Work
References
1. Agrawal, S., Narasayya, V., Yang, B.: Integrating vertical and horizontal partition-
ing into automated physical database design. In: Proceedings of the 2004 ACM
SIGMOD International Conference on Management of Data, pp. 359–370. ACM
(2004)
2. Coffman, Jr., E.G., Csirik, J., Leung, J.Y.T.: Variants of classical one-dimensional
bin packing. In: Gonzalez, T.F. (ed.) Handbook of Approximation Algorithms and
Meta-Heuristics, pp. 33:1–33:10. Francis and Taylor Books (CRC Press), London
(2007)
3. Goudarzi, H., Pedram, M.: Energy-efficient virtual machine replication and place-
ment in a cloud computing system. In: IEEE 5th International Conference on Cloud
Computing (CLOUD), pp. 750–757. IEEE (2012)
4. Grund, M., Krüger, J., Plattner, H., Zeier, A., Cudre-Mauroux, P., Madden, S.:
Hyrise: a main memory hybrid storage engine. Proc. VLDB Endow. 4(2), 105–116
(2010)
5. Inoue, K., Wiese, L.: Generalizing conjunctive queries for informative answers. In:
Christiansen, H., De Tré, G., Yazici, A., Zadrozny, S., Andreasen, T., Larsen, H.L.
(eds.) FQAS 2011. LNCS, vol. 7022, pp. 1–12. Springer, Heidelberg (2011)
6. Jansen, K., Öhring, S.: Approximation algorithms for time constrained scheduling.
Inf. Comput. 132(2), 85–108 (1997)
Horizontal Fragmentation and Replication 169
7. Loukopoulos, T., Ahmad, I.: Static and adaptive distributed data replication using
genetic algorithms. J. Parallel Distrib. Comput. 64(11), 1270–1285 (2004)
8. Michalski, R.S.: A theory and methodology of inductive learning. Artif. Intell.
20(2), 111–161 (1983)
9. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer
Science & Business Media, NewYork (2011)
10. Shi, W., Hong, B.: Towards profitable virtual machine placement in the data center.
In: Fourth IEEE International Conference on Utility and Cloud Computing (UCC),
pp. 138–145. IEEE (2011)
11. U.S. National Library of Medicine: Medical subject headings. http://www.nlm.nih.
gov/mesh/
12. Wiese, L.: Clustering-based fragmentation and data replication for flexible query
answering in distributed databases. J. Cloud Comput. 3(1), 1–15 (2014)
Scalability
Scalable Queries Over Log
Database Collections
1 Introduction
Various business applications need to observe the working status of products in order
to analyse their proper behaviours. Our application is from a real-world scenario [11],
where machines such as trucks, pumps, kilns, etc. are widely distributed at different
geographic locations and where sensors on machines produce large volumes of data.
The data describes time stamped sensor readings of machine components (e.g. oil
temperature and pressure) and can be used to analyse abnormal behaviours of the
equipment. In order to analyse passed behaviour of monitored equipment, the sensor
readings can be stored in relational databases and analysed with SQL. In our appli-
cation area, data is produced and maintained locally at many different sites in auton-
omous relational DBMSs called log databases. New sites and log databases are
dynamically added and removed from the federation. The number of sites is potentially
large, so it is important that the query processing scales with increasing number of sites.
A global meta-database enables a global view of the working status of all machines on
different sites. It stores meta-data about machines, sensors, sites, etc.
© Springer International Publishing Switzerland 2015
S. Maneth (Eds.): BICOD 2015, LNCS 9147, pp. 173–185, 2015.
DOI: 10.1007/978-3-319-20424-6_17
174 M. Zhu et al.
A particular challenge in our scenario is a scalable way to process queries that join
meta-data with data selected from the collection of autonomous log databases using
standard DBMS APIs. This paper proposes two strategies to perform such joins,
namely parallel bind-join (PBJ) and parallel bulk-load join (PBLJ). PBJ generalizes
the bind-join (BJ) [4] operator, which is a state-of-the-art algorithm for joining data
from an autonomous external database with a central database. One problem with bind-
join in our scenario is that large numbers of SQL queries will be sent to the log
databases for execution, one for each parameter combination selected from the meta-
database, which is slow. Furthermore, whereas bind-join is well suited for joining data
from a single log database with the meta-database, our application scenario requires
joining data from many sites.
With both PBJ and PBLJ, streams of selected meta-data variable bindings are
distributed to the wrapped log databases and processed there in parallel. After the
parallel processing the result streams are merged asynchronously by FLOQ.
• With PBJ the streams of bindings selected from the meta-database are bind-joined
in the distributed wrappers with their encapsulated log databases. The bind-joins of
different wrapped log databases are executed in parallel.
• With PBLJ the selected bindings are first bulk loaded in parallel into a binding table
in each log database where a regular join is performed between the loaded bindings
and the local measurements.
The strategies are implemented in our prototype system called FLOQ (Fused LOg
database Query processor). FLOQ provides general query processing over collections
of autonomous relational log databases residing on different sites. The collection of log
databases is integrated by FLOQ through a meta-database where properties about data
in the log databases are stored. On each site the log database is encapsulated by a
FLOQ wrapper to pre- and post-process queries.
To investigate our strategies, a cost model is proposed to evaluate the efficiency of
each strategy. To evaluate the performance we define fundamental queries for detecting
abnormal sensor readings and investigate the impact of our join strategies. A relational
DBMS from a major commercial vendor is used for storing the log databases.
In summary the contributions are:
• Two join strategies are proposed and compared: parallel bind-join a parallel bulk-
load join, for parallel execution of queries joining meta-data with data from col-
lections of autonomous databases using external DBMS APIs.
• A cost model is proposed to evaluate the strategies.
• The conclusions from the cost model are verified experimentally.
The rest of this paper is organized as follows: Sect. 2 overviews the FLOQ system
architecture and presents the scenario and queries used for the performance evaluation.
Section 3 presents the join strategies and the cost model used in the evaluation.
Section 4 presents the performance evaluation for the join strategies. Section 5
describes related work. Finally, Sect. 6 concludes and outlines some future work.
Scalable Queries Over Log Database Collections 175
2 FLOQ
Figure 1 illustrates the FLOQ architecture. To analyse machine behaviours, the user
sends queries over the integrated log databases to FLOQ. FLOQ processes a query by
first querying the meta-database to find the identifiers of the queried log databases
containing the desired data, then in parallel sending distributed queries to the log
databases, and finally collecting and merging the distributed query results to obtain the
final result. Scalable parallel processing of queries making joins between a meta-
database and many large log databases is the subject of this paper.
Query
Log databases
..............
RDB RDB RDB
Each log database is encapsulated with a FLOQ wrapper called from the FLOQ
server to process queries over the wrapped log database. A FLOQ wrapper contains a
full query processor which enables, e.g. local bind-joins between a stream of bindings
selected from the meta-database and the log database. Parallel processing is provided
since the FLOQ wrappers work independently of each other. Each FLOQ wrapper
sends back to the FLOQ server the result of executing a query as a stream of tuples.
The results from many wrappers are asynchronously merged by the FLOQ server
while emitting the result to the user. Details of the query processor are described in
[10, 13, 14] and are outside the scope of this paper.
(si, mi, sm, ev) stores the sensor installation information, i.e. a sensor installation
identifier si, the machine installation mi of si, the sensor model sm, and the expected
measured value ev. The columns m and sid in table MachineInstallation are foreign
keys in tables MachineModel and Site, respectively. The column mi in table Sensor-
Installation is foreign key to MachineInstallation.
Fig. 2. (a). Meta-database schema. (b) Log table at each site. (c) Integrated view in FLOQ server
The table Site(sid, name, logdb) stores information about the sites where the log
databases are located: a numeric site identifier sid, its name, and an identifier of its log
database, logdb. A new log database is registered to FLOQ by inserting a new row in
table Site. Each site presents to FLOQ its log data as a temporal local relation Measures
(mi, si, bt, et, mv) (Fig. 2(b)) representing measurements from the sensors installed on
the machines at the site, i.e. temporal local-as-view [5] data integration is used. For a
machine installation mi at a particular site the local view presents the measured
readings from sensor installation si in the valid time interval [bt,et). The columns mi
and si in Measures are foreign keys from the corresponding columns in the meta-
database tables MachineInstallation and SensorInstallation, respectively.
The view VMeasures (Fig. 2(c)) in FLOQ integrates the collection of log databases.
It is logically a union-all of the local Measures views at the different sites. In VMea-
sures the attribute logdb identifies the origin of each tuple. Through the meta-database
users can make queries over the log databases by joining other meta-data with
VMeasures. Since the set of log databases is dynamic it is not feasible to define
VMeasures as a static view; instead FLOQ processes queries to VMeasures by
dynamically submitting SQL queries to the log databases and collecting the results. In
the experiments we populate the meta-database and the log databases with data from a
real-world application [11].
restricted. The experiments are scaled by varying these parameters. The number of log
databases is varied by restricting sid, the amount of data selected from each log
database is varied by th, and the number of bindings selected from the meta-database is
varied by mi.
Query Q2 in Fig. 4 is similar to Q1, the difference being that it applies an aggregate
function over Q1, i.e. it computes the number of faulty sensor readings. Here only a
single value is returned from each log database. The purpose of the query is to
investigate the join strategies without concerning the overhead of transferring sub-
stantial amounts of data back to the client.
3 Join Strategies
The two strategies, PBJ and PBLJ, for parallel execution of queries joining data
between the meta-database and the log databases are illustrated in Figs. 7 and 8,
respectively. With both strategies FLOQ first extracts parameter bindings from the
meta-database. The result is a stream of tuples is called the binding stream B where
each tuple (i, v1, v2, …, vp) is a parameter binding. The elements v1, v2, …, vp of the
binding stream are the values of the free variables in the query fragment sent to the log
databases. For example, in Q1 the free variables are (mi, si, ev). Each binding tuple is
prefixed with a destination site, i, identifying where the log database RDBi resides. The
parameter binding tuples are joined with measurements in the log databases. Thus the
binding stream is split into one site binding stream Bi per log database RDBi, B = B1 [
B2 … [ Bn, where n is the number of sites. The destination i determines to which site
the rest of the tuple (v1, v2, …, vp), is routed. The join strategies are defined as follows:
PBLJ, parallel bulk-load join: With PBLJ (Fig. 8) each FLOQ wrapper first bulk
loads the entire binding stream Bi into a binding table in RDBi. When all parameter
bindings have been loaded, the system submits a single SQL query to the log database
to join the loaded binding table with σi. As for PBJ, the result stream Ri is shipped back
Scalable Queries Over Log Database Collections 179
to the FLOQ server through the wrapper for asynchronous merging. Compared to PBJ,
the advantage of this approach is that only one query is sent to each log database. It
requires the extra step of bulk loading in parallel the entire parameter streams into each
log database, which, however, should be less costly compared to calling many prepared
SQL statements through JDBC with PBJ. The bulk loading facility of the DBMS is
utilized for high performance.
BJ, regular bind-join: If there is a single log database, PBJ is analogous to BJ and is a
baseline in our evaluations. With BJ one prepared SQL query per binding is shipped
from the FLOQ wrapper to only one log database, RDB1.
The total cost of the FLOQ server execution is approximately divided between two
major components, which are the cost of splitting the binding stream B, Cs , and the cost
of merging all result streams Ri , Cm . The cost of the FLOQ server execution is inde-
pendent of any join strategies, i.e.:
CFLOQ ¼ Cs þ Cm ð2Þ
The total site cost Ci is approximately divided between four major cost components:
(i) transferring the binding stream Bi from the FLOQ server to the site, CBi , (ii) exe-
cuting i in the log database, Cri , (iii) local join either in RDBi ( for PBLJ)
180 M. Zhu et al.
or in the FLOQ wrapper ( for PBJ), and (iv) transferring the result stream Ri to
the FLOQ server, CRi . Thus the total site cost Ci is defined as:
ð3Þ
By combining Eqs. (1), (2), and (3), the total cost of a distributed join becomes:
ð4Þ
For each site, the binding stream Bi is significantly smaller than the number of
logged measurements in RDBi :
For PBJ, the bind-join is performed in each FLOQ wrapper, therefore, the cost of a
local join can be replaced with the cost of a bind-join in the wrapper, .
Also the cost of executing the sub-query σi that selects data from a log database, Cri , is
replaced with the BJ selection cost, CrWrapper
i
, in the site cost in (3):
ð6Þ
In PBLJ the joins and selections are combined into one sub-query to each RDBi.
Therefore, the cost of andCri in the site cost in Eq. (3) for PBLJ can be r-
eplaced with the cost of join and selection in the log database
( andCrLogDB
i
):
ð7Þ
In PBJ, the FLOQ server transfers the binding stream Bi to a FLOQ wrapper
through the standard network protocol. Therefore, the cost of transferring bindings to
each site, CBPBJi
, is the aggregated network communication overhead for each
binding, CNet .
CBPBJ
i
¼ jBi j CNet ; wherejBi j 1 ð8Þ
In PBLJ all the bindings Bi are bulk-loaded directly into the log database. The cost
of sending all bindings to site i,CBPBLJ
i
; is the cost of bulk loading the bindings,
CBulkloadi .
CBPBLJ
i
¼ CBulkloadi ð9Þ
CBPBLJ
i
CBPBJ
i
ð10Þ
Scalable Queries Over Log Database Collections 181
On the other hand, the selection cost of PBLJ is also low compared to PBJ since the
cost of selection performed by RDBi is lower than the combined cost of selection and
JDBC overhead for each binding b of a binding stream Bi :
CrLogDB
i
jBi j ðCrb þ CJDBC Þ; where b 2 Bi and jBi j 1; therefore: ð11Þ
CrLogDB
i
CrWrapper
i
ð12Þ
Similarly, a local join in the relational DBMS is efficient compared to the join
performed in a FLOQ wrapper since query optimization techniques can be applied
inside a relational DBMS where the overhead JDBC calls are eliminated. Thus,
ð13Þ
From Eqs. (10), (12), and (13), the total cost at site i for the three components,
transferring bindings (CBi ), selection (Cri ), and join ( ) are lower for PBLJ than for
PBJ. The cost CRi of transferring the result streams Ri to the FLOQ server is equal for
both PBLJ and PBJ, therefore, comparing (6) and (7):
From Eq. (1), as the cost of the execution at the FLOQ server CFLOQ is equal for
both PBJ and PBLJ, by combing Eqs. (1) and (14) it can be stated that the overall cost
of join in PBLJ is lower than PBJ:
3.2 Discussion
According to Eq. (15), PBLJ should always outperform PBJ in every experiment when
jBi j 1. Equation (8) and (11) suggest that PBLJ will perform increasingly better than
PBJ when scaling the number of bindings jBi j. It is evident from Eq. (4) that, inde-
pendent the chosen join strategy, when the size of the result stream jRi j is large, the
tuple transfer cost (CRi ) will be a major dominating component in the cost model.
Therefore, the performance trade-offs between respective join strategies, are more
significant when the number of tuples returned from the log database is small.
To conclude, according to the cost model, the performance evaluation should be
investigated by (i) varying the number of tuples returned from the sites, (ii) scaling the
number of sites, and (iii) scaling the number of bindings from the meta-database.
4 Performance Evaluation
We compared the performance of the join strategies PBJ and PBLJ based on the queries
Q1, Q2, Q3, and Q4. In our real-world application each log database had more than
250 million measurements from sensor readings, occupying 10 GB of raw data.
182 M. Zhu et al.
The following scalability experiments were performed on six PCs (with 4 processors and
8 GB main memory) running Windows 7 while: (i) scaling the number of result tuples |
Ri|; (ii) scaling the number of sites, n; and (iii) scaling the number of bindings |Bi|.
Scaling the number of result tuples
Figure 9(a) shows the execution times of Q1 for the two join strategies over a single
log database, while scaling the number of result tuples |R| by adjusting th. As expected
from Eq. (12), PBLJ performs better than PBJ. Since there is only one site, PBJ is
equivalent to BJ.
(a) (b)
Fig. 9. Q1 (a) with one log database and (b) with six log databases
Figure 9(b) compares the performance of Q1 for six log databases while scaling |R|.
As expected PBLJ scales better than PBJ. However, as more tuples are returned from
the log databases the network overhead is becoming a major dominating factor, making
the performance difference of the join strategies insignificant. Notice that the number of
returned tuples remains the same for both strategies; thus the network overhead is
equal. However, PBLJ will always perform better (even with a small fraction) than PBJ
since other overhead is larger for PBJ.
(a) (b)
Fig. 10. Execution time for Q3 and Q4 with six log databases
Figure 10 compares PBJ and PBLJ for Q3 and Q4 for six log databases. Q3 is an
example of a slow numerical query requiring a full scan of Measures, whereas Q4 is
faster since it exposes the index on Measures.mv for query Q3. It is evident from
Scalable Queries Over Log Database Collections 183
Fig. 10 that PBLJ performs better than PBJ for both query Q3 and Q4. Figure 10(b)
shows the performance improvement due to index utilization compared to sequential
scan in Q3.
To conclude, PBLJ performs better than PBJ when the number of returned tuples is
increased, as also indicated by Eq. (15) of the cost model.
(a) 1k tuples from each database (b) 295k tuples from each database
Fig. 11. Execution time for Q1 varying number of log databases and selectivity
(a) (b)
In all experiments, the PBLJ join strategy performs better than PBJ, in particular
while scaling the number of bindings |Bi|. This confirms Eq. (15) in the cost model. The
performance improvement is more significant when the number of tuples returned from
each log database is low.
5 Related Work
Bind-join was presented in [4] as a method to join data from external databases [7]. We
generalized bind-join to process in parallel parameterized queries to dynamic collec-
tions of autonomous log databases. Furthermore we showed that our bulk-load join
method scales better in our setting.
In Google Fusion Tables [3] left outer joins are used to combine relational views of
web pages, while [6] uses adaptive methods to join data from external data sources. In
[9] the selection of autonomous data sources to join is based on market mechanisms.
Our case is different because we investigate strategies to join meta-data with data from
dynamic collections of log databases without joining the data sources themselves.
Vertical partitioning and indexing of fact tables in monolithic data warehouses is
investigated in [1]. One can regard our VMeasures view as a horizontally partitioned
fact table. A major difference to data warehouse techniques is that we are integrating
data from dynamic collections of autonomous log databases, rather than scalable
processing of queries to data uploaded to a central data warehouse.
In [2] the problem of making views of many autonomous data warehouses is
investigated. The databases are joined using very large SQL queries joining many
external databases. Rather than integrating external databases by huge SQL queries, our
strategies are based on simple queries over a view (VMeasures) of dynamic collections
of external databases, i.e. the local-as-view approach [5].
A classical optimization strategy used in distributed databases [8] is to cost different
shipping alternatives of data between non-autonomous data servers before joining
them. By contrast, we investigate using standard DBMS APIs (JDBC and bulk load) to
make multi-database joins of meta-data with dynamic sets of autonomous log databases
using local-as-view.
Scalable Queries Over Log Database Collections 185
6 Conclusions
Two join strategies were proposed for parallel execution of queries joining meta-data
with data from autonomous log databases using standard DBMS APIs: parallel bind-
join (PBJ) and parallel bulk-load join (PBLJ). For the performance evaluation we
defined typical fundamental queries and investigated the impact of our join strategies.
A cost model was used to guide and evaluate the efficiency of the strategies. The
experimental results validated the cost model. In general, PBLJ performs better than
PBJ when the number of bindings from the meta-database is increased.
In the experiments a rather small set of autonomous log databases were used. Further
investigations should evaluate the impact of having very large number of log databases
and different strategies to improve communication overheads, e.g. by compression.
Acknowledgments. This work is supported by EU FP7 project Smart Vortex and the Swedish
Foundation for Strategic Research under contract RIT08-0041.
References
1. Datta, A., VanderMeer, D.E., Ramamritham, K.: Parallel star join + dataindexes: efficient
query processing in data warehouses and OLAP. J. IEEE TKDE 14(6), 1299–1316 (2002)
2. Dieu, N., Dragusanu, A., Fabret, F., Llirbat, F., Simon, E.: 1,000 tables inside the from.
J. ACM VLDB 2(2), 1450–1461 (2009)
3. Garcia-Molina, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R.,
Shen, W.: Google fusion tables: data management, integration and collaboration in the
cloud. In: SoCC, pp. 175–180 (2010)
4. Haas, L., Kossmann, D., Wimmers, E., Yang, J: Optimizing queries across diverse data
source. In: VLDB, pp. 276–285 (1997)
5. Halevy, A., Rajaraman, A., Ordille, J.: Data integration: the teenage years. In: VLDB, pp. 9–16
(2006)
6. Ives, G., Halevy, A., Weld, D.: Adapting to source properties in processing data integration
queries. In: SIGMOD, pp. 395–406 (2004)
7. Josifovski, V., Schwarz, P., Haas, L., Lin, E.: Garlic: a new flavor of federated query
processing for DB2. In: SIGMOD, pp. 524–532 (2002)
8. Kossmann, D.: The state of the art in distributed query processing. J. ACM Comput. Surv.
32(4), 422–469 (2000)
9. Pentaris, F., Ioannidis, Y.: Query optimization in distributed networks of autonomous
database systems. J. ACM Trans. Database Syst. 31(2), 537–583 (2006)
10. Risch, T., Josifovski, V.: Distributed data integration by object-oriented mediator servers.
J. Concurrency Comput. Pract. Experience 13(11), 933–953 (2001)
11. Smart Vortex Project. http://www.smartvortex.eu/
12. Truong, T., Risch, T.: Scalable numerical queries by algebraic inequality transformations.
In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B.
(eds.) DASFAA 2014, Part I. LNCS, vol. 8421, pp. 95–109. Springer, Heidelberg (2014)
13. Zhu, M., Risch, T.: Querying combined cloud-based and relational databases. In: CSC,
pp. 330–335 (2011)
14. Zhu, M., Stefanova, S., Truong, T., Risch, T.: Scalable numerical SPARQL queries over
relational databases. In: LWDM Workshop, pp. 257–262 (2014)
ECST – Extended Context-Free Straight-Line
Tree Grammars
Abstract. Grammar-based compressors like e.g. CluX [1], BPLEX [2], Tree-
RePAIR [3] transform an XML tree X into a context-free straight-line linear tree
(CSLT) grammar G and yield strong compression ratios compared to other
classes of XML-specific compressors. However, CSLT grammars have the
disadvantage that simulating on G update operations like inserting, deleting, or
re-labeling a node V of X requires to isolate the path from X’s root to V from all
the paths represented by G. Usually, this leads to an increased redundancy
within G, as grammar rules are copied and modified, but the original and the
modified grammar rules often differ only slightly. In this paper, we propose
extended context-free straight-line tree (ECST) grammars that allow reducing
the redundancy created by path isolation. Furthermore, we show how to query
and how to update ECST compressed grammars.
1 Introduction
1.1 Motivation
XML is a widely used, but verbose data exchange and data transmission standard. In
order to reduce the volume and costs involved in storage and transmission of verbose
XML data, a variety of structure-based XML compression technologies to reduce the
size of the structural part of XML documents have been developed. They range from
the compression of an XML tree into a directed acyclic graph (DAG) (e.g. [4]) to the
compression into a context-free straight-line linear tree (CSLT) grammar (e.g. CluX [1],
BPLEX [2], TreeRePAIR [3]). Out of these, compression into a CSLT grammar yields
the most strongly compressed still queryable and updateable data format.
While DAG compression shares identical XML subtrees, i.e., repeated occurrences
of a subtree are replaced with a reference to the subtree, CSLT grammar-based com-
pression additionally shares similar subtrees having an identical connected fragment of
nodes. The identical connected fragment of nodes is represented by a CSLT grammar
rule, and different parts of similar subtrees are represented by parameters. However, a
CSLT grammar rule can only represent a connected fragment of nodes, such that
similar subtrees that differ in an inner node or an inner fragment cannot be shared by a
single CSLT grammar rule. Our extension ECST overcomes this limitation.
© Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 186–198, 2015.
DOI: 10.1007/978-3-319-20424-6_18
ECST – Extended Context-Free Straight-Line Tree Grammars 187
1.2 Contributions
In this paper, we introduce extended context-free straight-line tree (ECST) grammars as
a compressed XML data format that overcomes the redundant fragment problem. In
particular:
• We define how ECST grammars extend CSLT grammars.
• We show how path isolation which is necessary prior to simulating update opera-
tions on grammars works with ECST grammars.
• We demonstrate why using ECST grammars can avoid copying of large parts of
rules during the process of path isolation.
• We describe how to evaluate path queries on ECST grammar-compressed data.
• We show how standard update operations can be simulated on ECST grammars.
Altogether, on an ECST grammar G representing an XML tree X, we can not only
simulate all query and update operations on X, but also compression to ECST gram-
mars allows keeping the grammars small by avoiding redundant fragments generated
by updates of G.
Differences between the subtrees ST1, …, STk are represented by providing formal
parameters to the grammar rule for A and by providing different actual parameters for
different calls of A.
To simplify the presentation and the compression algorithm, we follow [1–3] and,
without loss of generality, work on the binary representation of XML trees.
For example, look at the binary XML tree X1 (c.f. Fig. 1) which is represented by
the grammar Grammar 1. X1 represents a database of customers together with their
orders. For each customer, contact data is stored in form of address, first name, and last
name, and for each order, the shipping data is stored in form of address, first name, and
last name. Each address might contain an optional sub-element ‘isPOB’ that defines
whether the address represents a post office box. Due to space limitations, we abbre-
viate the element labels as follows: customer ! cu; order ! od; address !
ad; firstname ! fn; lastname ! ln; and isPOB ! ip:
The repeated fragment consisting of the node with label ‘ad’, its next-sibling with
label ‘fn’, and this node’s next-sibling with label ‘ln’ is represented by the grammar
rule A1. The ‘ad’-labeled nodes are the root nodes of the two similar subtrees con-
taining the repeated fragment. The subtrees differ in the first-child of the ‘ad’-labeled
nodes which is represented by the formal parameter y1 of rule A1 and by the actual
parameter ‘ip’ for the first subtree in preorder and by the nullpointer as actual parameter
for the second subtree. (For readability, we represent the nullpointer by ‘-’).
Note that we can reproduce each XML tree X from G by decompression, i.e.,
recursively applying all rules of G. For this purpose, we would substitute each call of a
rule A(ap1, …, apn) with the right-hand side of the rule A, rhs(A), where we replace A’s
formal parameters y1, …, yn with the actual parameters ap1, …, apn.
When simulating update operations of a node V of X directly on a CSLT grammar
G, i.e., without decompressing G to X, the so called grammar path GP in G repre-
senting the path p from X’s root to V has to be isolated in G first, such that GP
represents only the path p to the node V to be updated.
ECST – Extended Context-Free Straight-Line Tree Grammars 189
A grammar path GP = [A0, p0, …, An, pn] is defined as follows (c.f. [5]). Each
grammar path GP contains an alternating sequence of non-terminals Ai, 0 ≤ i ≤ n, and
index positions pi which refer to a symbol within rhs(Ai), which is a non-terminal
calling the next grammar rule Ai+1 for 0 ≤ i < n, and which is a terminal symbol with
the same label as the node V for i = n. A0 is the start symbol S of grammar G. For
example, if we apply the XPath query Q = //od/child::ad to Grammar 1, the
selected node can be described by GP1: = [S,3,A1,1]. Thus, GP1 describes a rule
call to rhs(A1) at position 3 in rule S. Finally, terminal ‘ad’ at position 1 in rhs(A1) is
selected.
While update operations like e.g. the re-labeling of a single node V with label v to
label w is quite simple in the XML document tree X (just change the label of node V
from v to w), simulating these operations on grammar-compressed XML documents is
more complex for the following reason. In grammar-compressed documents G, in
general, a single terminal symbol does not refer to a single node V of a tree X only, but
to a set of nodes of X. This means that prior to changing the terminal label, we first
have to isolate the grammar path GP addressing node V from G, such that the isolated
path represents only the node V to be relabeled. Considering a grammar path GP = [A0,
p0, …, An, pn] representing the node V of X, two steps are necessary in order to relabel
the terminal v to w. First, we have to copy rule An into A0n . But this is not sufficient. In
the second step, we have to copy each rule Ai, 1 ≤ i < n to A0i , and we have to adjust the
calling non-terminal to refer to A0iþ1 within each copied rule A0i .
Consider for example Grammar 1 given above and the update operation “relabel //
od/child::ad into pob”. The terminal ‘ad’ in rule A1 represents two ‘ad’-labeled nodes
of the XML tree X1 (shown in Fig. 1), out of which only the first node in preorder (i.e.
the node defined by grammar path GP1) is a child of an ‘od’-labeled node in X1. In
order to update only this ‘ad’-labeled node, we have to generate two copies of rule A1.
This leads to the following modified grammar, Grammar 2:
As after copying A1 into A01 only the label ‘ad’ has been changed into ‘pob’, the
remaining fragments of the right-hand side rhs(A1) of the rule for A1 occur also
redundantly in rhsðA10 Þ: Redundancy even increases if the rules to be copied are larger.
This is an example for the redundant fragment problem of CSLT grammars
invoked by path isolation and subsequent updates: As a CSLT grammar rule can only
represent a connected fragment of nodes shared by subtrees, subtrees that differ in the
root node or in an inner node cannot be shared by a single CSLT grammar rule.
fragments. ECST’s extension to CSLT is that the difference between similar subtrees is
represented by a ranked parameter, i.e. a parameter that itself has parameters.
ECST grammars are defined as follows. Let T be an alphabet of terminal symbols,
N = {A0, …, An} be a ranked alphabet of non-terminal symbols, ‘-’ be the nullpointer,
and Z = {z1, …, zm} be a set of ranked alphabet of parameter symbols. The function
rank: N[Z ! N0 assigns to each non-terminal symbol and to each parameter symbol a
natural number representing its rank. Let the sets T, N, Z, and {-} be disjoint. Let
furthermore ID, LP 2 N be special non-terminal symbols with the following predefined
grammar rules: IDðy1 Þ ! y1 ;LPðy1 ; y2 Þ ! y2 :Then, a term over T [ N [ Z [ {-} is
• the nullpointer ‘-’,
• a terminal expression of the form t(fc, ns) where t 2 T, and fc and ns are terms
representing the first-child of t and the next-sibling of t respectively,
• a simple term Ai, rank(Ai) = 0, or a non-terminal expression of the form Ai(t1, …, tm)
where Ai 2 N, m = rank(Ai) and ti is a term for i 2 {1, …, m}, or
• a simple parameter zi of rank 0 or a parameter expression zi(t1, …, tm) where zi 2 Z,
m = rank(zi) and ti is a term for i 2 {1, …, m}.
Then, an ECST grammar is defined as a tuple (T, N, Z, P, S, ID, LP) where T, N, Z, ID,
and LP are defined as above. S 2 N is the start symbol, and P is a set of grammar rules,
where the following constraints must be met:
• For each non-terminal Ai 2 N, with rank(Ai) = m there exists exactly one grammar
rule of the form Ai ðzi1 ; . . .; zim Þ ! rhsðAi Þ; where rhs(Ai) is a term that contains
each symbol zi1, …, zim exactly once in that order, with zij 2 Z and all zij are
distinct. Furthermore, for each non-terminal symbol Aj occurring in rhs(Ai), we
have j < i (i.e. calls of grammar rules are acyclic).
• For each rule Ai zi1 ; . . .; zij ; . . .; zin ! rhsðAi Þ; rankðAi Þ ¼ n; and each param-
eter expression zij(t1, …, tm), rank(zij) = m occurring in rhs(Ai), then in all calls of
Ai the appropriate values that can be passed to zij are a non-terminal expression Ak
having rank(Ak) = m or another parameter expression of rank m.
ECST grammars extend CSLT grammars by allowing parameters of arbitrary rank,
instead of parameters of rank 0 only, and the special non-terminal symbols ID and LP.
rule A0m y1 ; . . .; yjl ! rhs A0m ; where rhs A0m is rhs(Am) except that the call to
Am+1 on the grammar path, i.e. at position pm in rhs A0m ; is replaced with a call of
A0mþ1 : Finally, in rhs(S), we replace the call of A1 at position p0 in rhs(S), with a call of
A01 :
Grammar path isolation of GP1 leads to the following modified grammar, Grammar 3,
which is an intermediate step towards generating Grammar 2:
In case of the relabel operation, only the label ‘ad’ of the rule A10 , i.e. of the last rule
0
An generated for GP1, has to be changed to ‘pob’. That is, how we get Grammar 2.
Insert. Let V be a node in X, and let v be a terminal in rule An y1 ; . . .; yj !
rhsðAn Þ of G, such that the grammar path GP = [A0, p0, …, An, pn] to v represents V.
We want to simulate on G inserting a subtree as the previous sibling of the node V in
X. Let the subtree to be inserted be represented by a grammar rule Iðy1 Þ ! rhsðIÞ;
where y1 represents the next-sibling of the root node of I.
First, we change the call of An at position pn-1 within rhs A0n1 into a call of A0n .
Let further y1, …, yi be the parameters
occuring in rhs(An) atpositions smaller than pn.
Then, we create a new rule An y1 ; . . .; yi ; z; yiþ1 ; . . .; yj ! rhsðAn Þ; where rhs
(An*) is rhs(An) except that the term t at position pn is replaced by the term z(t), where z
2 Z is a parameter,with rank(z) = 1, and z’s position among the parameters of An* is so
that the parameter order is preserved. To simulate the insert, the rule
0 0
0
An y1 ; . . .; yj ! rhs An is replaced with An y1 ; . . .; yj ! An
y1 ; . . .; yi ; I; yiþ1 ; . . .; yj ; and the rule An y1 ; . . .; yj ! rhsðAn Þ is replaced
with A0n y1 ; . . .; yj ! An y1 ; . . .; yi ; ID; yiþ1 ; . . .; yj :
Grammar 5 shows the result of inserting into Grammar 1 the subtree represented by a rule
Iðy1 Þ ! odðadð; Þ; y1 Þ as previous-sibling of the second ‘ad’-labeled node of X1:
Remark. Note that whenever we want to simulate inserting a subtree or a node as the
last child of a node V, we first navigate to the nullpointer representing the non-
existence of a next-sibling of the last child of V and perform the above defined
operations on the position of that nullpointer. Similarly, if we want to insert a subtree or
a node as the only child of V, we first navigate to the nullpointer representing the non-
existence of a first-child of V. In both cases, the selected terminal v is a nullpointer.
Delete. Let V be a node in X that shall be deleted including all of its descendants (i.e.,
V and the subtree rooted in V’s first-child are deleted and the pointer to V is replaced
by a pointer toV’s next-sibling). Furthermore, let v be a terminal in rhs(An) of a rule
An y1 ; . . .; yj ! rhsðAn Þ of G, such that the grammar path GP = [A0, p0, …, An, pn]
to v represents V. To simulate this delete operation on G, we change the call of An at
position pn-1 within rhs A0n1 into a call of A0n Then, we create a new rule An
y1 ; . . .; yi ; z; yiþ1 ; . . .; yj ! rhsðAn Þ; where rhs(An*) is rhs(An) except that the
terminal v at position pn is replaced by the ranked parameter z 2 Z, rank(z) = rank
(v) = 2, and the position of z among the parameters of An* is sothat the parameter
order is preserved. Furthermore, we replace the rule A0n y1 ; . . .; yj ! rhs A0n with
ECST – Extended Context-Free Straight-Line Tree Grammars 193
the new rule A0n y1 ; . . .; yj ! An y1 ; . . .; yi ; LP; yiþ1 ; . . .; yj and the rule
An y1 ; . . .; yj ! rhsðAn Þ with the new rule An y1 ; . . .; yj ! An ðy1 ; . . .;
yi ; v; yiþ1 ; . . .; yj Þ:
Grammar 6 shows the result of simulating on Grammar 1 the deletion of the second
‘fn’-labeled node in X1 and replacing it by its next-sibling with label ‘ln’:
path from X’s root node to the modified node, as well as from other rules. This means, that
we need the original version of S, i.e., S ! cuðodðA1ðipð; ÞÞ; A1ðÞÞ; Þ as well as
the modified version S0 ! cuðodðA1ðipð; ÞÞ; A10 ðÞÞ; Þ: In order to avoid this
redundancy, we define a rule S ðzÞ ! cuðodðA1ðipð; ÞÞ; zðÞÞ; Þ and the two
versions S ! S ðA1Þ and S0 ! S ðA10 Þ: So again, by using the ranked parameter,
we have avoided storing parts of the rules S and S0 redundantly.
Up to now, we have shown how to use ECST grammars to perform updates that lead to
fewer redundancies than these updates performed on CSLT grammars. Next, we dis-
cuss how query evaluation on an XML tree X and update processing of nodes of X
selected by queries can be simulated on an ECST grammar G representing X.
For example, consider the Grammar 7 and the grammar path GP1 ¼ ½S; 1; A1; 1
referring to a node V1 with label ‘b’. When navigating to V10 s first-child V2, we get
GP2 ¼ ½S; 1; A1; 2 referring to a parameter z1. As z1 is the first formal parameter of
rhs(A1), its actual value av ¼ A2 is the first parameter of nonterminal A1 at position
[S,1]. Therefore, we substitute A1 in GP2 by a virtual nonterminal A10 , i.e., we set
GP2 ¼ ½S; 1; A10 ; 2; and we define a rule A10 ! rhsðA10 Þ; where rhs(A10 Þ is equal to
rhs(A1) except z1 replaced by A2, such that we get A10 ! bðA2ðað; Þ; IDÞ; Þ:
Now we determine the root terminal for rule A2 resulting finally in grammar path
GP2 ¼ ½S; 1; A10 ; 2; A2; 1 representing V2 and referring to a terminal symbol ‘d’.
In order to simulate navigating to the next-sibling V3 of V2, we start searching at
position GP3 ¼ ½S; 1; A10 ; 2; A2; 3 . There, we find the formal parameter z2, having
its actual value ID in rule A1 of Grammar 7. Therefore, we substitute A2 in GP3 by a
virtual nonterminal A20 , i.e. we set GP3 = [S, 1, A1’2, A2’ ,3], and we define a rule
A20 ! rhsðA20 Þ; where rhs(A20 Þ is equal to rhs(A2) except that we replace z2(A3) by
ID(A3) which is simplified to A3, leading to the grammar rule A20 ðy1Þ ! dðy1; A3Þ:
As at the third position of A20 , we find the nonterminal A3 which has to be expanded
too, the resulting grammar path representing V3 is GP3 ¼ ½S; 1; A10 ; 2; A20 ; 3;
A3; 1 referring to a terminal symbol ‘f’.
4 Related Work
There are several approaches to XML structure compression which can be mainly
divided into the categories: encoding-based, schema-based or grammar-based
compressors.
Encoding-based compressors (e.g. [8–10], XMill [11], XPRESS [12], and
XGrind [13] ) allow for a faster compression speed than the other compressors, as only
local data has to be considered in the compression in comparison to grammar-based
compressors which consider different subtrees.
Schema-based compressors (e.g. XCQ [14], Xenia [15], and XSDS [16] ) subtract
the given schema information from the structural information and only generate and
output information not already contained in the schema information.
XQzip [17] and the approaches [18] and [4] belong to grammar-based compression.
They compress the data structure of an XML document by combining identical subtrees.
An extension of [4] and [17] are e.g. CluX [1], BPLEX [2], and TreeRePAIR [3] that not
only combine identical subtrees, but recognize similar patterns within the XML tree, and
therefore allows a higher degree of compression. The approaches [6], [19] and [20]
follow different approaches on how to compute updates on grammar-compressed XML
data. A generalization of grammar-based compression to functional programs repre-
senting the compressed tree data was presented in [21].
In order to eliminate the redundancies caused by performing updates, a recom-
pression of the updated grammar as proposed in [22] might be performed.
In contrast, we propose using ECST grammars instead of CSLT grammars, such
that the overhead can be avoided, instead of eliminating it afterwards.
References
1. Böttcher, S., Hartel, R., Krislin, C.: CluX - Clustering XML sub-trees. In: ICEIS 2010,
Funchal, Madeira, Portugal (2010)
2. Busatto, G., Lohrey, M., Maneth, S.: Efficient memory representation of XML documents.
In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 199–216. Springer,
Heidelberg (2005)
3. Lohrey, M., Maneth, S., Mennicke, R.: Tree structure compression with repair. In: DCC
2011, Snowbird, UT, USA (2011)
4. Buneman, P., Grohe, M., Koch, C.: Path queries on compressed XML. In: VLDB 2003,
Berlin, Germany (2003)
5. Bätz, A., Böttcher, S., Hartel, R.: Updates on grammar-compressed XML data. In:
Fernandes, A.A.A., Gray, A.J.G., Belhajjame, K. (eds.) BNCOD 2011. LNCS, vol. 7051,
pp. 154–166. Springer, Heidelberg (2011)
6. Böttcher, S., Hartel, R., Jacobs, T.: Fast multi-update operations on compressed XML data.
In: Gottlob, G., Grasso, G., Olteanu, D., Schallhart, C. (eds.) BNCOD 2013. LNCS, vol.
7968, pp. 149–164. Springer, Heidelberg (2013)
7. Böttcher, S., Steinmetz, R.: Evaluating XPath queries on XML data streams. In: Cooper, R.,
Kennedy, J. (eds.) BNCOD 2007. LNCS, vol. 4587, pp. 101–113. Springer, Heidelberg (2007)
8. Zhang, N., Kacholia, V., Özsu, M.: A succinct physical storage scheme for efficient
evaluation of path queries in XML. In: ICDE 2004, Boston, MA, USA (2004)
9. Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: DCC 2001,
Snowbird, Utah, USA (2001)
10. Girardot, M., Sundaresan, N.: Millau: an encoding format for efficient representation and
exchange of XML over the Web. Comput. Netw. 33, 747–765 (2000)
11. Liefke, H., Suciu, D.: XMILL: an efficient compressor for XML data. In: SIGMOD 2000,
Dallas, Texas, USA (2000)
12. Min, J.-K., Park, M.-J., Chung, C.-W.: XPRESS: a queriable compression for XML data. In:
SIGMOD 2003, San Diego, California, USA (2003)
13. Tolani, P., Haritsa, J.: XGRIND: a query-friendly XML compressor. In: ICDE 2002, San
Jose, CA (2002)
14. Ng, W., Lam, W., Wood, P., Levene, M.: XCQ: a queriable XML compression system.
Knowl. Inf. Syst. 10, 421–452 (2006)
15. Werner, C., Buschmann, C., Brandt, Y., Fischer, S.: Compressing SOAP messages by using
pushdown automata. In: ICWS 2006, Chicago, Illinois, USA (2006)
16. Böttcher, S., Hartel, R., Messinger, C.: XML stream data reduction by shared KST
signatures. In: HICSS-42 2009, Waikoloa, Big Island, HI, USA (2009)
17. Cheng, J., Ng, W.: XQzip: querying compressed XML using structural indexing. In: Bertino, E.,
Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT
2004. LNCS, vol. 2992, pp. 219–236. Springer, Heidelberg (2004)
18. Adiego, J., Navarro, G., Fuente, P.: Lempel-ziv compression of structured text. In: DCC
2004, Snowbird, UT, USA (2004)
19. Fisher, D., Maneth, S.: Structural selectivity estimation for XML documents. In: ICDE 2007,
Istanbul, Turkey (2007)
20. Fisher, D., Maneth, S.: Selectivity Estimation. Patent WO 2007/134407 A1, May 2007
21. Kobayashi, N., Matsuda, K., Shinohara, A., Yaguchi, K.: Functional programs as
compressed data. High.-Order Symbolic Comput. 25(1), 39–84 (2012)
22. Böttcher, S., Hartel, R., Jacobs, T., Maneth, S.: OnlineRePair: a recompressor for XML
structures. In: Poster Paper, DCC, Snow Bird, Utah, USA (2015)
Configuring Spatial Grids for Efficient Main
Memory Joins
1 Introduction
Spatial joins are an operation of increasing importance in many applications.
Whether for spatial datasets from astronomy, neuroscience, medicine or others,
the join has to be performed to find objects that intersect with each other or
are within a given distance of each other (distance join). An efficient execution
of this operation is therefore key to improve overall performance.
In this context main memory joins are becoming increasingly important
because many datasets fit into the main memory directly. Even if they do not,
and the join has to be performed on disk, a crucial part of a disk-based join is the
in memory join. While the strategies of disk-based approaches to partition the
data (replication or no replication, space-oriented partitioning or data-oriented
partitioning) so it fits into memory differ [1], every approach requires an efficient
algorithm to join two partitions in main memory.
The only approaches specifically designed to join spatial data in memory
are the nested loop join and plane sweep join approach. The nested loop join
technique works by comparing all spatial elements pairwise and is thus compu-
tationally very expensive. The plane sweep approach [2] sorts the datasets in one
dimension and scans both datasets synchronously with a sweep plane. It has a
lower time complexity but compares objects no matter how far apart they are
on the sweep plane.
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 199–205, 2015.
DOI: 10.1007/978-3-319-20424-6 19
200 F. Tauheed et al.
To speed up the join time over these two slow approaches, tried and tested
tree-based indexing techniques on disk have been optimized for main mem-
ory. Although these approaches indeed improve performance, recent research
shows that a simple grid performs best to join for one-off spatial joins in mem-
ory [3]. The problem of configuring the grid optimally, however, is challenging
and remains unaddressed to date.
In this paper we therefore develop a cost model that can be used to configure
the grid optimally. With experiments we show that the cost model can accurately
predict the performance of the join.
The probing step of the algorithm retrieves the mapped MBRs of the two objects
sets from the grid. The algorithm iterates over all the grid cells and separately
retrieves all the MBRs in the cell. For each cell it then compares all MBRs
representing objects from the first dataset with all MBRs representing objects
from the second dataset.
Because MBRs can be mapped to several cells, intersections between the
same pair of MBR’s may be detected multiple times. This leads to (a) additional
computational overhead (because of additional comparisons) and (b) duplicate
results. The spatial grid-based spatial join thus use a global (across all grid cells)
set based data structure in a postprocessing step to deduplicate the results before
reporting them.
Uniform grids are very sensitive to data skew and using them in spatial join
algorithm can lead to performance degradation because in dense regions the
number of MBRs mapped on a grid cell increases and consequently the number
of comparisons required increases too. All MBRs may be mapped to one single
grid cell. In this scenario the performance of spatial grid hash join becomes
equivalent to a nested loop join because all MBRs need to be compared pairwise
and the total number of comparisons is O(n2 ). Even worse, all MBRs may be
mapped to the same multiple cells and the nested loop is executed comparing the
same MBRs several times (resulting in duplicates that need to be eliminated).
The problem of data skew can be addressed by setting a finer grid resolution.
With a finer grid resolution also the objects or their MBRs in very dense regions
of the datasets will be distributed to numerous grid cells instead of just a few.
As we will discuss in the next section, the resolution, however, cannot be set
infinitely fine-grained, but reducing the cell size still helps to address the problem
of data skew.
Changing the grid resolution, i.e., making it coarser or finer grained directly
affects the performance of the algorithm.
202 F. Tauheed et al.
To determine the optimal resolution we develop a cost model for predicting the
time for the join. Like our algorithm we also split the cost model into building
and probing costs.
Building Cost. The building phase loops over the MBR of each of the Nd
objects in the first dataset and for each MBR finds the intersecting grid cells
using the getCell (gC) function. For each cell a hashLookup (hL) is performed
to obtain the list of pointers that point to the MBR and in the end the pointer
of the current MBR is added to the list using insertPointer (iP). The resulting
cost is summarized in the following equation with Ci as the number of cells an
MBR intersects with:
⎧ ⎫
Nd ⎨
Ci ⎬
BuildingCost = gC(M BRi ) + [hL(j) + iP (&M BRi )] (1)
⎩ ⎭
i=1 j=1
Ci
gC(M BRi ) = vertexT oGridCell(j)
j=1
To determine the actual building cost we need to know the duration of each
individual operation and the number of iterations of each loop. vertexToGridCell
Configuring Spatial Grids for Efficient Main Memory Joins 203
and insertPointer both are constant time operations and for the sake of simplic-
ity we also assume hashLookup to be a constant time operation (this essentially
means we use a tuned hash table which is collision-free). The execution time of
all these operations heavily depends on the hardware platform they are executed
on. We use microbenchmarks to determine their execution time.
Nd , the number of objects in the first dataset, is a given and Average(C),
the number of cells an average object’s MBR maps to, is calculated as follows
(instead of calculating Ci for each object we use an approximation, i.e., Aver-
age(C)).
Average(C), the number of cells an average M BRi maps to depends on the
average volume of the MBR and on the grid resolution. On average, the MBR of
an object particular M BRi the number of cells it is mapped to, can be approxi-
mated by V olume(M BRi )/V olume(gridCell). This, however, is only an approx-
imation and it underestimates the number of grid cells because the exact number
of grid cells intersecting depends on the exact loation of M BRi relative to the
grid cells. If M BRi is exactly aligned with the grid cell then the combined vol-
ume of the grid cell is equal to the volume of M BRi . If, however, M BRi is not
aligned, then the combined volume of the grid cell is greater than the volume of
M BRi to at most the volume of a single grid cell.
To resolve this issue we expand the volume of M BRi by half the volume of
a single grid cell, to get a better approximation for the average case.
Nd
Average(C) = T otal(C)/Nd
Probing Cost. Similar to the building step, the probing step loops over each
object in the second dataset. For each object the algorithm finds the list of cells
intersecting the MBR of the object. However, instead of mapping the MBR on
the grid, the probing step retrieves the mapped MBRs from the first dataset for
testing the overlap.
⎧ ⎡ ⎤ ⎫
Na ⎨
Ci
Sj ⎬
P robingCost = gC(M BRi ) + ⎣hL(j) + (oT (i, k) + dD())⎦
⎩ ⎭
i=1 j=1 k=1
(2)
The operations of the probing step are overlapTest (oT), which compares
two MBRs for overlap, and deduplication (dD), which uses a hash based set to
remove duplicate results. We consider both these operations as constant time
operations, because we assume a near collision free hash set for our estimates.
The number of iterations of the loop Na is the size of the outer data set.
Similar to the building cost model, we use Average(C) to approximate the
number of grid cells that intersect with the MBRs of the outer data set.
204 F. Tauheed et al.
4 Conclusions
Whether in disk- or in memory spatial joins, the main memory join is a crucial
operation. Recent research demonstrated that grid-based approaches outperform
tree-based ones in main memory [3], but the question of how to set the optimal
resolution remains unaddressed. In this paper we described our implementation
Configuring Spatial Grids for Efficient Main Memory Joins 205
References
1. Jacox, E.H., Samet, H.: Spatial join techniques. ACM TODS 32(1), 1–44 (2007)
2. Preparata, F., Shamos, M.: Computational Geometry: An Introduction. Springer,
New York (1993)
3. Šidlauskas, D., Jensen, C.S.: Spatial joins in main memory: implementation mat-
ters! In: VLDB 2015 (2015)
4. Orenstein, J.: A comparison of spatial query processing techniques for native and
parameter spaces. In: SIGMOD 1990 (1990)
5. Tauheed, F., Biveinis, L., Heinis, T., Schürmann, F., Markram, H., Ailamaki, A.:
Accelerating range queries for brain simulations. In: ICDE 2012 (2012)
Transactional and Incremental Type Inference
from Data Updates
1 Introduction
Many approaches to reasoning over knowledge bases take a query-rewriting app-
roach (e.g. Ontop [1], Stardog [14], DLDB [13]), where a query over the knowl-
edge base is rewritten to a (often complex) query over the base facts in the
knowledge base. When the number of queries made on the knowledge base
greatly exceeds the number of updates, it might be more efficient to adopt a
materialised approach (e.g. OWLim [6], WebPIE [17], RDFox [12], Oracle’s
RDF store [18], Minerva [19]), where the extent of the knowledge base is cal-
culated after updates to the knowledge base, and hence queries are answered
directly from the inferred facts.
Even if data is stored in a relational database, such as in Minerva, the reason-
ing in materialised approaches is normally conducted outside of the core RDBMS
engine, and hence fails to provide transactional reasoning [7]. In transactional
reasoning, the result of reasoning from data is available at the commit of any
transaction that inserts or deletes data, and hence reasoning obeys the normal
ACID properties of transactions.
This paper considers reasoning over knowledge bases expressed in OWL 2
RL [10]. It restricts itself to consider the issue of efficiently handling ontology
c Springer International Publishing Switzerland 2015
S. Maneth (Ed.): BICOD 2015, LNCS 9147, pp. 206–219, 2015.
DOI: 10.1007/978-3-319-20424-6 20
Transactional and Incremental Type Inference from Data Updates 207
queries where there are updates occurring to the A-Box (in database terms the
data) and not to the T-Box, and the number of queries greatly exceeds the num-
ber of updates. Hence, the reasoning performed is type inference (i.e. deriving
for each instance its membership of classes and properties), and it adopts a mate-
rialised approach. This paper sets out to provide transactional type inference for
ontologies held in an RDBMS, providing type inference over OWL 2 RL ontolo-
gies that can fully integrate with the data of existing RDBMS applications, and
maintain the full ACID properties of transactions.
To illustrate the issues addressed, consider a T-Box with three rules:
(1) (2) (3)
which define that (1) every man is a person, (2) every parent is a human, and
(3) a person is equivalent to a human. Now suppose that A-Box for the ontol-
ogy is held as four tables Man, Parent, Person and Human in a database, and
four transactions, T1 inserting , T2 inserting , T3 delet-
ing , and finally T4 deleting , are executed. The expected
changes of the database are illustrated in Fig. 1.
The first transaction T1 should change the state of the database from S0
to S1 . After executing T1 , should be viewed not only from Man but also
from Person and Human, because should be inferred as both a and
a as a result of rules (1) and (3). Transactional type inference requires
that any other transaction Tc concurrent with T1 should view the database
either as S0 or S1 , but not any intermediate state. For example, the query
should always evaluate to false in Tc . With regards
to deletes, T3 , which only deletes , should not delete inferred facts
or , since and rules (1) and (3) can still infer
them (i.e. the database is changed to S3 ). However, the same inferred facts must
be deleted when is deleted in T4 .
Furthermore, we need to reject user attempts to delete implicit facts. For
example, when in database states S1 , S2 or S3 , allowing a user to delete
208 Y. Liu and P. McBrien
• We assign each fact a state when materialising data. Then, deletions over the
database invoke triggers to update the state of related facts, which reduces
the number of real deletes and reinserts.
• The triggers in the RDBMS will be invoked whenever a user updates the data-
base; consequently, our approach preserves ACID properties over the results
of reasoning.
• Since our approach materialises the results of reasoning, it processes queries
more efficiently than non-materialising approaches.
• Our approach can be incorporated into almost all standard RDBMS applica-
tions, in order to enhance their database schemas with type inference reason-
ing.
2 Our Approach
This section describes what we call an auto type inference database
(ATIDB). Our approach separates the T-Box and A-Box reasoning as shown
Transactional and Incremental Type Inference from Data Updates 209
in Fig. 2. Whilst this separation does entail that the reasoning of a system is not
complete, it is not uncommon in other large-scale reasoners, and as documented
in Sect. 4, the completeness achieved matches that of other approaches.
A fact might not hold and hence is not in the database (C(x)ø ), a fact might
be explicitly stored because an explicit A-Box rule asserts it (C(x)e ), or a fact
might be implicitly inferred from other facts (C(x)i ). Our method of deleting data
introduces a fourth state (C(x)d ), where the data has lost one of the supporting
arguments for being in the database, and a process of checking if the data is still
inferable from other data is being conducted.
The state can be changed by insert and delete operations. We identify two
classes of insert. An ontology insert means that some user or application is
inserting a new explicit fact into the database, and is detected by the trigger
when C(x)e . By contrast, a reasoner insert means that a reasoner has derived
some implicit fact from the existing facts in the database, and is detected by the
trigger when C(x)i . Similarly, an ontology delete is some user or application
deleting an explicit fact from the database, and is detected by when ¬C(x)e ,
and a reasoner delete is when some supporting evidence for a fact has been
deleted, detected in triggers by when ¬C(x)i . Figure 3 gives an overview of the
possible state transitions which can occur. For inserts:
• For a data item which is not present in the table, C(x)ø , an ontology insert of
C(x)e updates C(x)ø to C(x)e , and a reasoner insert of C(x)i changes C(x)ø to
C(x)i .
• For a data item implicitly stored, C(x)i , further reasoner inserts of C(x)i , do
not change the state, so that repeated inference of other facts based on C(x)i
is avoided. However, inserting C(x)e gives explicit semantics, and thus updates
C(x)i to C(x)e .
• For a data item explicitly stored, C(x)e , inserting C(x)i does not change the
state, in order to avoid duplicated inference. However, to implement normal
database semantics, a rollback occurs if further attempts are made to insert
C(x)e .
For deletes:
• For a data item C(x)ø , ontology or reasoner deletes are ignored, to match the
normal database semantics that deletes of data not present cause no errors.
• For a data item C(x)i , attempting an ontology delete ¬C(x)e causes inconsis-
tencies since the assertion of being no C(x) conflicts with what can be inferred
from other facts, and the transaction should be rolled back. The reasoner
delete ¬C(x)i , by contrast, changes C(x)i to C(x)d , in order to label the data
for rechecking.
• For a data item C(x)e , attempting the ontology delete ¬C(x)e changes it to
C(x)d , because the data might still be inferable even after removing the explicit
semantics. However, a reasoner delete ¬C(x)i does not change the state, since
only ontology deletes can remove the explicit semantics.
Note that, when the state of C(x) is updated to d, a recursive labelling process
is conducted to implicitly delete other data which depends on C(x)d . When the
whole labelling process is finished, all data items labelled with d are checked as
to whether they are inferable from data in state e or i. If they are still inferable,
Transactional and Incremental Type Inference from Data Updates 211
we change C(x)d to C(x)i ; otherwise, C(x)d is updated to C(x)ø (i.e. deleted from
the database).
Now, we demonstrate for the example in the introduction how our approach can
achieve type inference in an incremental manner.
Figure 4 first shows the trigger events as the database passes from S0 to
S1 through two intermediate database states S0a and S0b whilst executing
T1 . Firstly, the attempt to insert Man(John)e is checked by a ‘before trigger’
(indicated by the − prefix) when − Man(x)e if ¬Man(x)e then Man(x)e . Since
Man(John)ø is true, the insert is permitted, and the database enters S0a . Rule (1)
is translated into an ‘after trigger’ (denoted by the + prefix) when + Man(x)e∨i
then Person(x)i . Thus, after Man(John)e is inserted, this trigger is invoked to
infer a reasoner insert of Person(John)i , updating S0a to S0b .
An equivalent relationship between two classes can be treated as two sub-
sumption relations; therefore, rule (3) is translated into two triggers: when
+
Person(x)e∨i then Human(x)i and when + Human(x)e∨i then Person(x)i . After
the database is updated to the intermediate state S0b , the insert of Person(John)i
causes the attempt to insert Human(John)i , which changes Human(John)ø to
Human(John)i (i.e. S0b is updated to S1 ). The insert of Human(John)i gener-
ates the attempt to insert Person(John)i again because of the after insert trigger
on Human. However, the attempt to insert Person(John)i is ignored, because
Person(John)i is true in S1 (i.e. the database stays at S1 ).
Figure 4 then illustrates the process of executing T2 , which inserts
. T2 first attempts to insert Parent(John)e , which updates the state
of Parent(John) from ø to e (i.e. S1 is updated to S2 ). Afterwards, the after insert
trigger on Parent generates a new reasoner insert of Human(John)i , which is then
212 Y. Liu and P. McBrien
Fig. 5. T3 : delete
3 Implementation as SQOWL2
In this section, we describe the implementation of our approach, called SQOWL2.
It uses the OWL API [5] to load an ontology, and applies Pellet [15] for the
214 Y. Liu and P. McBrien
4 Evaluation
This section1 compares SQOWL2 to Stardog and OWLim. Stardog is a non-
materialising reasoner, while OWLim is a materialising reasoner. They both store
their data outside of an RDBMS, and do not provide transactional reasoning.
For the comparison of speed of incremental type inference and query process-
ing, we used the well known Lehigh University Ontology Benchmark
(LUBM) [3], which covers a university domain. It provides a T-Box of 43 OWL
1
All experiments were processed on a machine with Intel i7-2600 CPU @ 3.40 GHz,
8 Cores, and 16 GB of memory, running Microsoft SQL Server 2014. SQOWL2 uses
OWL API v3.4.3 for ontology loading and Pellet v2.3.1 for classification. For com-
parisons, we used OWLim-Lite v5.4.6486 and Stardog-Community v2.2.1.
Transactional and Incremental Type Inference from Data Updates 215
classes, 32 OWL properties and approximately 200 axioms. LUBM also offers 14
benchmark queries which we numbered as Q1–Q14, and an A-Box generator to
produce data sets of different size. In this section, we use L-n to denote a set of
A-Boxes which contains n universities (each university contains approximately
100,000 class & property instances).
To evaluate the completeness of SQOWL2, we compared the results of
answering LUBM queries by SQOWL2 to those of Pellet (a complete reasoner),
but also used SQOWL2 to process more generic and exhaustive test suites gen-
erated by SyGENiA [16]. SyGENiA generates a test suite for a given query and
a T-Box. Each test suite contains all possible inference logic that infers answers
to this query w.r.t. the T-Box. If a reasoning system successfully passes the test
suite, it is complete to answer this query w.r.t. the T-Box and any arbitrary
A-Box of data.
was stable (e.g. Data loading speed of SQOWL2 was around 6,000 inserts/s).
Stardog was the fastest, as it does not perform reasoning during data loading.
SQOWL2 was the slowest, because it performs type inference as part of database
transactions with full ACID properties, and materialises the result of reasoning
during inserts. Indeed, due to the overheads associated with providing ACID
properties for database updates, even without reasoning (i.e. with SQOWL2
triggers), the speed of data loading for LUBM in the SQL Server database used
for testing was around 14,600 inserts/s. Thus, the process of reasoning caused a
significant, but not impractical overhead to normal RDBMS database operations.
Data Deleting: After data loading, we used each system to execute a number
of random deletes translated from A-Box data. Table 3 only shows the aver-
age speed of handling deletes by SQOWL2 and Stardog, as the Lite version of
OWLim performs the whole reasoning again after deleting any facts. Again, as
Stardog does not consider reasoning when inserting or deleting facts, its speed
of data deleting was stable over different datasets, and was much faster than
SQOWL2. As expected, the label & check process meant that the speed of han-
dling deletes by SQOWL2 decreased when processing deletes from larger data
sets, except from L-25 to L-50. The reason for this improvement was the RDBMS
switching to a more efficient query plan when moving from L-25 to L-50. Due to
the cost of the reasoning process, SQOWL2 caused a significant overhead when
comparing with the speed of processing deletes without considering reasoning
(but with indicies created), which is at about 20,000 deletes/s.
because the two systems store explicit and implicit data at the data load-
ing stage. For example, the average speed for executing queries over L-100 by
SQOWL2 was about 100 times as fast as Stardog. When only comparing the
two materialisation-based systems, SQOWL2’s average speed was significantly
faster than OWLim for L-25, L-50 and L-100, and was comparable to OWLim
for L-200. The average speed of query processing by SQOWL2 dropped sharply
from L-100 to L-200 due mostly to Q2 (which needed 0.48 s over L-100 but 21 s
over L-200). The query plans used by the RDBMS show that it chose Nested
Loops for joining tables when processing Q2 over L-200, which is less efficient
than the Hash Match used over L-100. We intentionally did not tune the database
to solve this problem, but note that as in any RDBMS application, larger data
sets may require certain queries to be manually tuned by a database adminis-
trator.
A more detailed query processing time needed for executing each query by
three systems w.r.t. L-100 is shown in Table 5. SQOWL2 outperformed OWLim
when answering 12 of 14 queries, (i.e. SQOWL2 was slightly slower when process-
ing Q11 and Q13). SQOWL2 was much faster than Stardog when processing all
LUBM queries, except Q9, which was just slightly slower than Stardog. Stardog
was significantly slower when answering Q6 and Q10 than both SQOWL2 and
OWLim, since these two queries are very complex to rewrite and compute the
answers (169 and 168 inference cases are respectively contained in the test suites
generated by SyGENiA for Q6 and Q10).
5 Related Work
Reasoning over large scale data can be classified as dynamic and materialised
approaches. Systems using the former approach (e.g. DLDB [13], Stardog [14]
and Ontop [1]) store only explicit facts and conduct reasoning only when there
is a query executed over the ontology (i.e. query rewriting methods), where no
incremental type inference is required. DLDB can be considered as transactional
reasoning system, as it uses temporal views inside an RDBMS as a manner to
rewrite executed queries. Reasoning systems based on materialised approaches
store both explicit and implicit data, in order to provide a fast query processing
service [9]. Most systems tend to perform reasoning outside an RDBMS (i.e. not
proper transactional reasoning), even though they still choose an RDBMS as a
possible data container. WebPIE [17], as a sample inference engine, applies the
MapReduce model and builds the reasoning mechanism for RDFS and OWL
ter Horst semantics on top of a Hadoop cluster. WebPIE only supports incre-
mental data loading but not deleting. OWLim [6] is another triple-store system,
which uses a file system instead of an RDBMS as a container for storing seman-
tic data. Its standard and enterprise versions support incremental data loading
and deleting, but not in a transactional manner. RDFox [12] adopts a so called
Backward/Forward algorithm (can be more efficient than DRed in some cases)
to achieve incremental reasoning without using an RDBMS, and it is not a
transactional reasoning system. Minerva [19] only uses an RDBMS to hold the
218 Y. Liu and P. McBrien
6 Conclusion
We have demonstrated an approach using SQL triggers which extends an
RDBMS to have type inference capabilities. We are the first approach to pro-
vide transactional and incremental type inference from both inserts and deletes
of A-Box data, and holding data in an RDBMS allows ontology reasoning to
be integrated into mainstream data processing. The evaluation shows that our
SQOWL2, compared to two fast reasoners, is faster at query processing, and the
completeness of query answering is comparable to or better than the same task
for other rule-based engines. Of course, the approach is unsuited to applications
where the inferred data is very large, and queries are relatively infrequent com-
pared to updates. As our approach is built as a separate layer over the RDBMS,
our work has not yet addressed the key issue of optimising the efficiency of han-
dling updates to be as fast as the other engines which are specially designed for
triple store (e.g. OWLim and RDFox), which is the subject of our future work.
References
1. Bagosi, T., Calvanese, D., Hardi, J., Komla-Ebri, S., Lanti, D., Rezk, M.,
Rodrı́guez-Muro, M., Slusnys, M., Xiao, G.: The ontop framework for ontology
based data access. In: Zhao, D., Du, J., Wang, H., Wang, P., Ji, D., Pan, J.Z.
(eds.) CSWS 2014. CCIS, vol. 480, pp. 67–77. Springer, Heidelberg (2014)
2. Grau, B.C., Motik, B., Stoilos, G., Horrocks, I.: Completeness guarantees for
incomplete ontology reasoners: theory and practice. J. JAIR 43(1), 419–476 (2012)
3. Guo, Y., Pan, Z., Heflin, J.: LUBM: a benchmark for OWL knowledge base systems.
J. Web Semant. 3(2–3), 158–182 (2005)
4. Gupta, A., Mumick, I.S., Subrahmanian, V.S.: Maintaining views incrementally.
In: Proceedings of SIGMOD, pp. 157–166 (1993)
5. Horridge, M., Bechhofer, S.: The OWL API: a Java API for owl ontologies. Semant.
Web 2(1), 11–21 (2011)
6. Kiryakov, A., Ognyanov, D., Manov, D.: OWLIM-a pragmatic semantic repository
for OWL. In: Proceedings of WISE, pp. 182–192 (2005)
7. Liu, Y., McBrien, P.: SQOWL2: transactional type inference for OWL 2 DL in an
RDBMS. In: Description Logics, pp. 779–790 (2013)
8. McBrien, P.J., Rizopoulos, N., Smith, A.C.: SQOWL: type inference in an RDBMS.
In: Parsons, Jeffrey, Saeki, Motoshi, Shoval, Peretz, Woo, Carson, Wand, Yair (eds.)
ER 2010. LNCS, vol. 6412, pp. 362–376. Springer, Heidelberg (2010)
9. McBrien, P., Rizopoulos, N., Smith, A.C.: Type inference methods and perfor-
mance for data in an RDBMS. In: Proceedings of SWIM, p. 6 (2012)
10. Motik, B., Grau, B.C., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 web
ontology language profiles. W3C Recommendation 27, 61 (2007)
Transactional and Incremental Type Inference from Data Updates 219
11. Motik, B., Nenov, Y., Piro, R., Horrocks, I.: Incremental Update of Datalog Mate-
rialisation: The Backward/Forward Algorithm. AAAI Press, California (2015)
12. Motik, B., Nenov, Y., Piro, R., Horrocks, I., Olteanu, D.: Parallel materialisation
of datalog programs in centralised, main-memory RDF systems. In: Proceedings
of the AAAI, pp. 129–137 (2014)
13. Pan, Z., Zhang, X., Heflin, J.: DLDB2: a scalable multi-perspective semantic web
repository. In: Proceedings of WI-IAT 2008, pp. 489–495 (2008)
14. Pérez-Urbina, H., Rodrıguez-Dıaz, E., Grove, M., Konstantinidis, G., Sirin, E.:
Evaluation of query rewriting approaches for OWL 2. In: Proceedings of SSWS+
HPCSW, vol. 943 (2012)
15. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: a practical owl-dl
reasoner. J. Web Semant. 5(2), 51–53 (2007)
16. Stoilos, G., Grau, B.C., Horrocks, I.: How incomplete is your semantic web rea-
soner? In: AAAI (2010)
17. Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., Bal, H.: WebPIE: a web-
scale parallel inference engine using mapreduce. J. Web Semant. 10, 59–75 (2012)
18. Wu, Z., Eadon, G., Das, S., Chong, E.I., Kolovski, V., Annamalai, M., Srinivasan,
J.: Implementing an inference engine for RDFS/OWL constructs and user-defined
rules in oracle. In: Proceedings of ICDE, pp. 1239–1248 (2008)
19. Zhou, J., Ma, L., Liu, Q., Zhang, L., Yu, Y., Pan, Y.: Minerva: a scalable OWL
ontology storage and inference system. In: Mizoguchi, R., Shi, Z.-Z., Giunchiglia, F.
(eds.) ASWC 2006. LNCS, vol. 4185, pp. 429–443. Springer, Heidelberg (2006)
Author Index