Real-Time Semantic Web Data Stream Processing Using Storm
Real-Time Semantic Web Data Stream Processing Using Storm
Volume: 02, Issue: ICCIT- 1441, Page No.: 79 - 83, 9 h & 1 0 Sep. 2020
Vol 02, No. ICCIT- 1441, Page No.: 79 - 83, 9th & 10th Sep. 2020
978-1 -7281 -2680-7/20/$31.00 ©2020 IEEE
Authorized licensed use limited to: R V College of Engineering. Downloaded on January 23,2024 at 18:42:02 UTC from IEEE Xplore. Restrictions apply.
Banane, Web Data Stream Processing ...
flows. But for our use case, the problem related to multiple memory, control their partitioning in order to optimize their
treatments, requiring several compressions and location, and manipulate them using a set of operations (map,
decompressions, persists. filter, join). Spark Streaming extends Spark by the micro
batch operation. It accumulates the data over a certain period
Several approaches to processing big data in distributed
to produce a micro-RDD on which it performs the desired
systems. Recently, many applications have emerged, using
calculation. For this, unlike Storm[15] which performs
data streams from distributed and heterogeneous sources.
processing one by one, Spark Streaming will add a delay
The realization of such a system remains a scientific
between the arrival of a message and its processing. Its API
challenge that will have to take into account the volume of
is identical to the classic Spark API. It is thus possible to
data, their speed and their variety. Some prototypes have
process data streams in the same way as static data. Systems
been proposed in order to define a system architecture that
have also been proposed in order to define hybrid
ensures the management of massive data flows in real-time.
architectures that manage both batch and real-time
o n the other hand, the domain of the semantic web offers
processing such as the Lambda and Storm-Yarn architecture.
through a common format (RDF) to combine several
The idea of Lambda Architecture [8] is to simultaneously use
heterogeneous systems and thus compensate for the variety
batch processing on all data to provide complete views, and
of data. In what follows, we describe some existing systems
real-time processing of data flows to provide dynamic views.
adapted to the processing of raw data flows on a distributed
The outputs from the two treatments can be combined at the
platform. Apache Hadoop [5] is one of these distributed
presentation level. This architecture attempts to balance
systems widely used to analyze big data. It allows you to
throughput, latency, and fault tolerance. It is made up of 3
manage a distributed data file system HDFS (Hadoop
layers: The batch layer manages the storage of the data set,
Distributed File System) which supports storage on a very
as well as the calculation of complete views on a large set or
large number of machines. The advantage of HDFS is to
part of data. These views are updated infrequently since the
limit the transfer time by assigning to each entity of the
calculation time can be long (a few hours). The real-time
cluster the task of processing the data it contains. Hadoop is
layer is used to process recent data (which is not taken into
based on the MapReduce parallel calculation algorithm [6]
account in the batch layer) in order to compensate for the
where the calculation time is normally divided by the
high latency of the batch layer. It continuously calculates
number of entities performing the task. This parallel
real-time views incrementally based on a flow processing
processing is based on the batch mode where each
system (e.g. Storm) and random read/write databases.
calculation lasts a certain time. It is very efficient for
Processing latency is in the order of a few milliseconds. The
analyzing large volumes of data. However, it was not
service layer is used to manage the merging of results from
designed to meet the needs of analyzes with high time
the batch and real-time layers. The logic of fusion is the
constraints, for example, in the case of real-time detection of
responsibility of the developer who will have to define how
anomalies or bank fraud. To get around the nature of the
the data will be exploited. The advantage of the Lambda
batch mode, other solutions are appearing in the Big Data
architecture is its ability to process and maintain data flows,
ecosystem, the most popular of which is Apache Storm and
while large historical data is also processed by a batch
Spark Streaming. Apache Storm [7] is a real-time oriented
pipeline. However, the duality of the batch and real-time
solution based on the concept of complex event processing
layers requires producing the same result from two different
(CEP) and uses the concept of topology. Concretely, it is a
paths. This requires maintaining code in two complex
fault-tolerant distributed computing system that guarantees
distributed systems, designed differently while ensuring the
data processing at least once. Storm revolves around 4 uniqueness of processing an event. Storm-on-Yarn [10] is
concepts: Tuple: it represents a message in the "Storm"
another solution developed by Yahoo! to co-locate the real
sense, namely a list of dynamically typed named values. time processing with the batch processing. The idea is to
Stream: a collection of tuples with the same pattern. Spout: it
make it possible to run Hadoop and CEP technologies in the
consumes the data from a source and transmits one or more
same cluster instead of two separate clusters. The load used
streams to the bolts. Bolt: a tuples processing node that can
by Storm often varies depending on the speed and volume of
generate streams which will be transmitted to other bolts. It
data to be processed. Storm-on-Yarn allows you to manage
can also write the output data to external storage platforms.
peak loads and dynamically allocate resources, normally
Storm also supports an additional level of abstraction
used by Hadoop, to Storm when necessary. Besides, Yahoo!
through the Trident API [8]. This API integrates certain
Added mechanisms that allow Storm applications to access
functions on a data set such as join, aggregation, and
data stored in HDFS and HBase[18].
grouping. It allows processing ordered by minibatch of N
tuples. o n the other hand, Storm does not provide any Big The originality of our work is the management of RDF
Data storage medium as in Hadoop[13]. Spark Streaming [9] data in real-time via the use of a big data processing tool in
is another real-time processing system based on the real-time called Storm..
MapReduce programming paradigm. It is the extension of
Apache Spark, analysis software that accelerates the III. Se m a n t ic W e b , D a t a flo w , and St o r m
processing of data on a Hadoop platform. Spark is 10 to 100
times faster than Hadoop due to the reduced number of A. Sematic Web
writes and reads on the disc. For this, it uses an abstraction
called RDD (Resilient Distributed Dataset) which allows, The Semantic Web aims to organize and structure the
enormous amount of information presented on the Net. It is a
transparently, to mount in-memory data distributed on HDFS
and to persist them on disk if necessary. RDD has the semi-structured language based on XML. Figure 1 shows
advantage of providing fault tolerance without having to one of the versions of the layered organization offered by the
resort to the often costly replication mechanisms. It makes it W3C. Each layer is built on the layers below. Thus, all of the
possible to explicitly persist the intermediate data in layers use XML syntax. This allows you to take advantage of
all the technologies developed around XML: XML Schema,
Vol. 02, No. ICCIT- 1441, Page No.: 79 - 83, 9th & 10th Sep. 2020
Authorized licensed use limited to: R V College of Engineering. Downloaded on January 23,2024 at 18:42:02 UTC from IEEE Xplore. Restrictions apply.
Banane, Web Data Stream Processing ...
IV. A p p r o a c h D e s c r ip t io n
In the architecture of our system, event data is processed
and managed by distributed systems like Redis [12] and
Storm [13], Redis is used as a memory processing
component.
Value
B. Data Flow
We can define data flows as a continuous, ordered
sequence of items (implicitly by time of arrival in the Data
Flow Management System, or explicitly by production
timestamp at source), arriving in real time. The adoption of heterogeneity of d iffere n t
formats and data models! Homogeneity
semantic web technologies in the world of dynamic data and More knowledge
sensors gave rise to the concept of RDF data flow. Thus,
RDF flows were introduced as a natural extension of the Fig. 5. Translation of data formats and models in RDF
RDF model in the flow environment.
Vol. 02, No. ICCIT- 1441, Page No.: 79 - 83, 9th & 10th Sep. 2020
Authorized licensed use limited to: R V College of Engineering. Downloaded on January 23,2024 at 18:42:02 UTC from IEEE Xplore. Restrictions apply.
Banane, Web Data Stream Processing ...
have to be considered such as the dynamic distribution of intermediate processing data in memory to have fast and
data and tasks, scheduling and parallelization of processing, inexpensive input-output access. The second approach
while optimizing network traffic and workload. allows you to persist static data and relevant summaries of
data flow to disk. There are several NoSQL storage solutions
A. Continuous SPARQL in memory such as Memcached [9] and Redis [10]. The data
A continuous query engine should be able to reason not is stored in RAM in a key-value format and can be
represented in several structures such as strings, lists, hashes,
only on data flows but also on static data and even the set of
Cloud Linked Open Data (LOD) datasets. The requests must and sets. A comparative study [11] shows almost similar
performances between Memcached and Redis in terms of
adapt to the incoming speed of the data flows and be
evaluated continuously in order to take account of the execution time. As part of our system, we decided to use
Redis because it supports more functionality for
evolving nature of the flow. The semantics of SPARQL
queries must allow processing based on time or the order of manipulating data. Unlike Memcached, Redis allows you to
periodically persist data on disk, which helps prevent data
arrival of data. The standard SPARQL will be extended by
introducing the concept of an adaptable sliding window (the loss in the event of a failure. It also supports an LUA-based
scripting language for writing stored procedures, the
defined portion of a flow).
atomicity of which is guaranteed by the architecture.
Some prototypes have been proposed recently in the monothreade. Besides, Redis' Sorted Set structure provides a
literature, drawing inspiration from the work done by the practical implementation of sliding windows. It allows you to
conventional database community. For example, CSPARQL automatically manage the sampling by operating
[14] is one of the first extensions of SPARQL intended to aggregations over a time interval, but the eviction must be
support continuous queries. other projects extending programmed manually.
SPARQL have been launched. SPARQL-Stream [15]
extends SPARQL so that it can manage window operators V. V a l id a t io n
without worrying about query performance. CQELS [16], the
most recent language, allows you to act natively on RDF This section assesses the quality and relevance of our
flows and continuous requests without going through extension. To do this, we looked at the performance obtained
intermediary tools. These projects take into account the in terms of execution time and the preservation of the
temporal aspect of flows and implement windowing semantics of the data. We consider the processing of a set of
operators. However, none of these examples is suited to the tweets.
large volume of distributed data flows. Queries on this data Twitter allows free retrieval of streaming data, taking
must be able to run in a dynamic environment with high time advantage of this advantage using a streaming tool like
constraints. The distribution of these queries, as well as the Storm, which is essential for processing this data in real time.
data, plays an important role in ensuring a certain level of In this paper, we will read and analyze Twitter messages in
scalability and latency. This distribution should take into real time with our Storm-based system. We create our
account the optimization of network traffic and the workload. application which retrieves tweets from “Twitter API” using
Also, the distribution of data in several RDF storage Java Eclipse.
platforms requires the establishment of a SPARQL
federation [17]. This raises a question about the best strategy After adding the necessary twitter4j biblios, we first
to follow to optimally execute the federation of continuous create a Java class CreateSpout.java, We know that the
SPARQL queries. To our knowledge, there are two works processing of tweets can be done using only one Bolt, but we
[11,12] that propose to execute continuous SPARQL queries created two Bolts BoltExtractor and RetweetBoltExtractor to
in a distributed way. However, their performance has not prove the join of our system. To join the tiles of these two
been evaluated in a context of complex reasoning, and there Bolts, we need another Bolt BoltRDFWriter, it will store this
is no consideration of the federation aspect. data in RDF format. Now let's create a topology that will
allow us to perform the processing in real time.
B. Data storage
Two approaches to data storage will have to be used in
our system. The first approach makes it possible to store the
45156 [NIO5erverCxn.Factory:0.0.0.0/0.0.0.0:2000] WARN org.apache,zookeeper.server.NIOServerCnxnFactory - Ignoring exception
java.nio.channels.ClosedChanneLException: null
at sun.nio.c h .ServerSocketChannellmpl.accept(ServerSocketChannellmpl.ja va:137) ~ [na:1.6.0_29]
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:188) ~[zookeeper-3.4.5.jar:3.4.5-1392090]
at java.lang.Thread.run(Thread.java:662) [na:1.6.0_29]
45156 [main] INFO backtype.storm.testing - Done shutting down in process zookeeper
Vol. 02, No. ICCIT- 1441, Page No.: 79 - 83, 9 & 10th Sep. 2020
Authorized licensed use limited to: R V College of Engineering. Downloaded on January 23,2024 at 18:42:02 UTC from IEEE Xplore. Restrictions apply.
Banane, Web Data Stream Processing ...
import org.apache.storm.Config;
import o r g .apache.storm.LocalCluster;
import org.apache.storm.topology.TopologyBuilder;
Vol. 02, No. ICCIT- 1441, Page No.: 79 - 83, 9th & 10th Sep. 2020
Authorized licensed use limited to: R V College of Engineering. Downloaded on January 23,2024 at 18:42:02 UTC from IEEE Xplore. Restrictions apply.