A Near Real-Time Big Data Processing Architecture
A Near Real-Time Big Data Processing Architecture
A Near Real-Time Big Data Processing Architecture
Xuming Xiu
I hereby declare that this project is entirely my own work and that it has not been submitted
as an exercise for a degree at this or any other university.
I have read and I understand the plagiarism provisions in the General Regulations of the
University Calendar for the current year, found at http://www.tcd.ie/calendar.
I have also completed the Online Tutorial on avoiding plagiarism `Ready Steady Write', located
at http://tcd-ie.libguides.com/plagiarism/ready-steady-write.
Signed: Date:
i
Abstract
Latency has been an issue that many trading rms making eorts to solve. When trading data
coming in the form of the streams, a system that can help to make decisions in a relatively
short time before the next data point arrives would benet a lot. Concerning Big Data, the
characteristics challenge the industry from Velocity, Volume, and Variety. This paper prosed
an architecture of a system that can quickly retrieve a large volume of the historical dataset,
and performing near real-time aggregation as new data coming. This project targets the
velocity and volume aspects of big data.
The paper explores the possible technical solutions to reduce the latency for trading rms.
The performance of such a system is directly related to prot. The motivation of this project
is to present a solution to help with decision making and keep the latency as low as possible.
During the investigation of the technologies and the problem. Apache Spark and Kafka proved
to have good performance in terms of real-time processing. There is also a methodology of
using Geometric Brownian Motion to generate synthetic data to enrich the sample space.
This project also uses a time-series database for real-time monitoring of the volatility. Finally,
the system can aggregate 1 single column of 20 million rows with up 10 seconds in average
in the Hadoop cluster.
ii
Acknowledgements
I would rst like to acknowledge my supervisor, Dr. Khurshid Ahmad, who has been support
me during the entire project. He led to discover my interest in big data and supported me
when I met diculties. Sometimes, he explained more than once to me about the concepts
I also would like to thank my parents who supported me to study abroad. Without them, I
wouldn't have such a fantastic opportunity to discover my interests. Finally, I also want to
thank my girlfriend who has been taking good care of me during the busy time.
iii
Contents
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Variety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.4 Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
iv
3.3.1 Receiving the Messages . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.7.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Conclusion 42
6 Appendix 43
6.1 A1: Calculate return of stock price . . . . . . . . . . . . . . . . . . . . . . 43
Bibliography 47
v
List of Figures
3.2 Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vi
List of Tables
vii
Listings
viii
1 Introduction
The Internet generates billions of Data every minute around the world. These data usually
have the properties of high velocity, large volume, and multi-variant. These data always have
valuable information along with them and waiting to be discovered. Time series is one
important category of Big Data. Apart from the general characteristics of big data, time
series Big Data is a sequence of discrete-time data that taken successively equally spaced
points in time. It is observable that potential correlations can be discovered as the data
points are collected at adjacent periods. When there is a large volume for such data in
high-frequency streams, analysis can be challenging. Such time-series data can be collected
from many dierent places. Some typical ones are like trading data, service logs, and trac
data. Many applications have the capability to support time series analysis.
the major challenges of Big Data analysis. Streamed data can be even more challenging as it
requires a system to be able to continuously handle the incoming data and perform real-time
analysis. Trading data is typical streaming data. Besides, the system has to be able to
retrieve the large volume of the historical dataset in a short time. The latency is the critical
problem to streaming analysis. When the data arriving at a particular frequency, the result
of the analysis has to be nished before the next data coming in. Especially in terms of
high-frequency trading (HFT), the reaction time is critical. Many trading rms place the
server next to the trading house to reduce the latency. The journal [9] mentioned a new
measure to capture the reaction time which is called Decision Latency. The quickest HFT
rm in their models rst responds to protable trading openings, taking all of the prots,
while the slower participants arrive slightly too late. As a consequence, minor variations in
trade pace are correlated with signicant changes in company-wide trade revenues, with
trading clustered among the fastest HFT industry. Another journal [15], the authors
presented a conclusion that relative latency is important for success in trading on short-lived
information. The study also suggested that lower latency can help with improving traditional
market quality measures. Latency becomes a critical issue that can signicantly benet
1
market behaviours.
In summary, latency is the problem that this project aims to solve. When trading data are
the previous section, we have introduced the latency in high-frequency trading and why it is
important to keep as low as possible. However, from a technical point of view, Decision
Latency can be optimized from the software level. The most direct cause of the Decision
Latency is brought by data analysis. Typically, analyzing real-time data in real-time requires
historical knowledge. Thus, the rst objective is to collect the historical dataset for a given
stock price. Next, since the historical dataset will be used to perform real-time aggregation.
Therefore, the second objective is to quickly retrieve a large volume of the historical dataset
into the system. When making a decision, specically for trading data, volatility is an
important metrics to reference. It indicates the potential prot or loss when making a
transaction for the next trading. Therefore, the third objective of this project is computing
statistical moments of the trading records with the latency with as low as possible. Most
importantly, the computation of these statistical moments has to be nished before the
arrival of the next data point. To simulate the trading process, there has to be a service to
collect trading data in real-time. This is an important step to measure the Decision Latency
1.3 Contribution
This project makes contributions in reviewing the related journals in low latency trading and
trying to reduce Decision Latency as smaller as possible. While researching on other similar
address the problem. The system has been proven to work as expected. However, it still has
room to improve in terms of the automation process and resource allocations. These are
pointed out in the discussion section. This is a working example for trading rms to have a
near real-time time processing system to help making decisions based on a large volume of
2
1.4 Content Structure
In the second chapter, the Big Data related technologies and literature are reviewed.
Dierent technologies in terms of data storage (HDFS, NoSQL, and RDBMS) and big data
processing tools (Spark and Hadoop MapReduce) are compared. These technologies and
tools are fundamentals to achieve near real-time processing. There are also Benchmarking
results included illustrating the most suitable technologies for this project. There are 2
similar projects introduced in this chapter as well. The real-time trac data analysis project
has demonstrated the benets of using HDFS as storage technology. Another project
High-speed log streams analytics provided the guidelines of the simulation real-time trading
records using Kafka and SparkStreaming. This project is inspired by the above 2 similar
works.
The third chapter describes the design and implementation details of the architecture. There
are 2 major components in this project which are the Ingestion Platform and the Spark App.
data stream where the velocity of data ow can be controlled. The Spark App uses Apache
Spark [31] to perform real-time analysis. There is also an extra step to generate synthetic
data because the size of the collected data is not large enough. The synthetic data is a
The fourth chapter is an actual case study where it uses the stock price from J.P. Morgan
Chase & Co. to test the system. It indicated that the system running in a cluster that
contains 3 nodes can aggregate a single column with 10 seconds. The evaluation part
contains the system setup speed. It discussed the advantage and disadvantages of such a
design. It also evaluates the system performance in terms of the latency. There is also a
discussion regarding the challenges that this project resolved, the reasons to use these
technologies that have been chosen for this project, and possible future work.
Finally, the conclusion chapter concludes that the Decision Latency could be reduced down
to 10 seconds for a single column using such system and conguration. It also concludes the
drawbacks of the system and suggests some possible working solutions in future.
3
2 Literature Review and Similar Work
This chapter includes some of similar work and review of the related literature about the Big
Data technologies. There are great amount of previous work and papers to reference about
big data. Some techniques and methodologies provide very insightful guidelines for this
project. The most critical part of this project is the issue brought by properties of big data.
There are some of previous work that provide similar uses cases for addressing these
complex structure with the diculties of storing, analyzing and visualizing for further
processes or results [30]. This denition provides perspectives of potential problems that are
needed to be addressed when working with big data. The following three introduce the
2.1.1 Velocity
The rst problem facing is that the issue of velocity. The velocity is directly related to
latency when data coming as stream. The limitation of velocity is not only in terms of the
data ow but also is required for all processing unit. A typical question is how fast we can
process the incoming data [30]. Latency is a critical challenge that needs to be solved.
There are some components that may slow down the computation from dierent aspects.
The latency comes from hardware level mostly related to the network. As the the nodes
within the cluster are connected via local network, a decrease in time taken for a packet
traverses switches would result in faster communication [32]. From operating system point
of view, a well congured scheduler and resource allocator will signicantly increase the
speed [32]. Hadoop Yarn is a perfect choice as a role of scheduler no matter in cluster or a
single machine.
4
2.1.2 Volume
Other than velocity, volume is also a critical feature of big data [30]. Volume generally
means the size of the big data. With that being introduced, the challenge facing with
volume is the storage. Since the data tend to be very large, traditional le systems or
databases hosting on single node is not able to handle such amount of data. The enterprise
level big data could be in tarabytes or even petabytes. These data are even more complex
rather than just big in le size. They could cover multi dimensional data which makes it
2.1.3 Variety
Variety refers to the data coming from multiple sources and the data itself consists multiple
types both structured and unstructured [30]. To analysis big data that has such
characteristic, the Dimensional Data Analysis (DDA) technique is recommended [14]. The
Once the metrics are collected, they are used to compare with ideal and vestigial values to
determine approximate structural model. The performance benchmarks indicated that using
such algorithm, analyst can nish data ingestion 2 million data points within 400 seconds
[14].
With more study in Big Data, there are more challenges discovered in Big Data. There are
extra dimensions added to the Big Data concept [5]. Variability refers to inconsistent of
data ow. It can be hard to manager particular for unstructured data. Value indicates data
cannot be meaningless. There are also valuable information can be concluded from big data.
Veracity refers to the quality of data. With the size of of data growing large, the data
quality can be challenging. Validity can help with making correct and accurate decisions.
Visualization is a task to present the data with the most intuitive approach. Despite new
studies keep adding more characteristics to Big Data concept, this project will focus only on
5
2.2 Big Data Technologies
2.2.1 Databases vs Distributed File Systems
The database has always been playing an important role in data storage technologies. A
relational database can store data in table format. The great advantage of relational
databases is that it supports schema which makes data modelling easier. One of the aims of
RDBMS is to enforcing data integrity. To achieve this, a proper design of tables is always
required by RDBMS. This may be achievable when the data is relatively small. However,
when dealing with big data, this can be signicantly expensive as the normalization process
requires table joins and searching for keys throughout the entire dataset. NoSQL is expected
Besides, the variety also challenges the RDBMS since the database schema of the dataset
storage, wide-column storage, and graph database. MongoDB as one of the most popular
document-oriented database has grown very fast with its scalability and reliability [23]. A
comparison between RDBMS and NoSQL database indicates that the dierence in average
time CRUD operations of NoSQL is increasing as the size of data increases [12]. Therefore,
a valid conclusion can be drawn that in NoSQL database is more suitable than RDBMS for
With the development of the hardware, this challenge can also be addressed by using a
cluster. Instead of using a single machine to store everything, a group of computers that are
physically distributed and connected by a local area network (LAN) are used to share data
and storage resources. This is also known as the Distributed File System (DFS). Larger
main memories with less expensive price enable the exponential increment in caching
performance. Optical disk and optical bre networks make it faster to access the resources
and to communicate among the nodes. Battery-backed memory can enhance the reliability
of main memories caches as well [20]. As the hardware development reaches the bottleneck,
Other new technologies were invented to deal with volume issue for the past years. Google
developed Google File System (GFS) which is believed to be outperformed than Hadoop le
system to address their challenge [10]. GFS integrates MapReduce and oers the capability
of byte-stream-based le view of big data that is partitioned over several hundreds of nodes
within a cluster.
6
2.2.2 MapReduce
MapReduce is introduced as one of the approaches to process big data eciently. Notably,
MapReduce as the programing model works independently with storage layers. MapReduce
programming model for processing a large volume of data. There are 2 phases in this
programing model namely Map and Reduce. Both phases allow programmers to customize
the function to achieve the goal. Map phase takes inputs and transforms inputs to the
key-value pair format. Then the key-value pairs received by the reducer and performed
aggregations by the reduc. Before the key-value pairs received by the reducer, there is also a
step in between whose task is to shue the key-value pairs to consolidate records from the
mapper. MapReduce is expected to have a good performance under the parallel environment
as mapper and reducer are separately doing tasks. A survey [19] investigated the MapReduce
programming model from various dimensions include usability, exibility, fault tolerance, and
eciency. The survey took the implementation of MapReduce, Hadoop, to examine the
performance under the parallel processing environment. It concludes that as Hadoop uses
checkpoints frequently, the eciency could be dropped by I/O operations. However, the pros
of using checkpoints can signicantly increase the ability of fault tolerance and scalability.
When using MapReduce programming model to process big data, the clear drawback
immediately when the data is loaded. At this stage, there is no data modeling involved and
hence the data is not indexed by the MapReduce job. This faster in processing large volume
MapReduce and distributed le system successfully. The ecosystem even supports other big
data technologies like Cassandra and Hive [6]. The core components of Hadoop includes
Hadoop MapReduce, Hadoop Distributed File System (HDFS), and Hadoop Yarn for
resource management and job scheduling. Hadoop makes MapReduce running more
eciently on distributed nodes with the help of Yarn. Yarn can dynamically allocate
application and DFS, it can utilize the resources and application performance. Yarn default
scheduler processes jobs in First In First Out (FIFO) policy. The global ResourceManager is
responsible for taking jobs submitted by users, scheduling these jobs, and allocating
resources to them. In each node, there is a NodeManager installed monitoring and reporting
the current resources available to the global ResourceManager to assign the resources to
7
each application. NodeManager also takes control of Resource Containers The
While Hadoop MapReduce focusing on the processing of big data, HDFS takes care of the
data storage. A diagram of the HDFS architecture has been shown in Figure 2.1. HDFS
consists of NameNode and DataNode where NameNode manages the name, location, the
permission of each block of a dataset and DataNode stores replication of data blocks in
memory and process I/O operations [36]. HDFS provides reliable fault tolerance ability by 2
policies, replication and checkpoint recovery. Since HDFS works on multiple nodes, the data
could be replicated anywhere on DataNode within the cluster. With the help of the
NameNode, retrieving the data could be done eciently. Using checkpoint recovery help to
improve the fault tolerance by rolling back the last saved synchronization point and restart
8
In summary, the optimized hardware and choice of frameworks can signicantly reduce some
of the challenges of Big Data. The Hadoop ecosystem provides the ability to integrate the
distributed computation resources and storage resources to allow users processing and
storing a large volume of data. The combination of the MapReduce programming model and
Distributed File System (DFS) will make signicantly huge improvements in terms of
performance. An analytical study [33] compared dierent major DFSs and concludes that
MapReduce is the perfect framework to apply to DFSs to maintain the performance and
2.2.4 Spark
Hadoop MapReduce is not the only MapReduce paradigm, there are many implementations
of MapReduce in the market holding dierent purposes. Among all of them, Spark is famous
for its processing speed. While the Hadoop MapReduce reads in les as stream and stores
each record in the result le after reduce, Spark uses in-memory processing to increase the
speed. A very important concept in Spark is Resilient Distributed Dataset (RDD) which is
an immutable and fault-tolerant collection of elements [29]. These RDDs are created when
loading the les from the DFS. Users have options to persist them in memory. Since the
RDDs are immutable, in the later releases, Spark introduced DataFrame API which allows
users to perform aggregations, joins, and other relational operations. The best feature of
DataFrame is that it supports schema inference. The data lives within the DataFrame is
almost equivalent to a relational table. Thanks to this feature, there is no need to use
Java/Scala serialization when writing the data to disk (or distribute on clusters) as Spark in
nature knows the schema. Moreover, when the user denes the mapper and reducer
functions, no execution occurs until the user calls an output operation. This is known as the
logical plan.
SparkSQL is a framework associated with DataFrame API. It supports multiple data sources
9
loading the data into memory, it will automatically deserialize to Java/Scala objects. A very
important feature in SparkSQL is the Catalyst Optimizer [8]. The Catalyst Optimizer is
designed for tackling dierent optimization tasks in Big Data. The implementation of
Catalyst Optimizer is based on Tree data structure. The nodes in the tree can be
manipulated by rules which are the rule. Rules are nothing but functions to transform the
input tree to another output tree and nally to an RDD. As Figure 2.2, the Optimizer starts
to perform logical optimization rst after resolving the logical plan. This is done by analysis
transforms/actions. After this phase is done, it takes the optimized logical plan to generate
one or more physical plans. Then Spark Engine will evaluate these physical plans against the
cost model. nally, the most optimized plan is selected and used for execution [8].
SparkStreaming is a scalable fault-tolerant streaming processing tools also works for batch
workloads. SparkStreaming will receive real-time data from dierent sources and transfer the
processed data into dierent storage technologies such as database, or live dashboard. The
object in Scala [31]. The major aspects of SparkStreaming are fast recovery from failures,
better load balancing and resource usage. More importantly, it supports interactive queries
between streaming data and static datasets. The use of SparkStreaming will help to achieve
While both Hadoop MapReduce and spark are designed for big data processing. However,
their use cases are dierent in many ways. Spark supports in-memory processing and linear
processing which means there is no need to load the data in every time to perform a query.
It is especially good for near real-time processing when more hardware is available. However,
Hadoop MapReduce is designed for more general purpose. A comparison conducted came
out a conclusion that Spark is the framework for processing a large number of data after
scale of data from a dierent resource, the performance of Hadoop MapReduce is observed
10
2.3 Previous Work
There are several applications which focusing on streaming data analysis. Although they
have applied in dierent areas, some design techniques worth referencing. When designing
the streaming data analysis architecture, there are two major challenges need to put into
consideration. The main challenges in the storage layer are the capability of storing a large
volume of data, data loss, and consistency. A proper designed DFS like HDFS could resolve
the above challenges together with distributed infrastructure. From the processing layer
these projects, the authors emphasized the velocity of Big Data. The data source is
collected from various trac sensors and are preprocessed. There are several phases involved
in this architecture.
• Ingestion and aggregations Apache Flume is chosen as the technology to ingest the
data from dierent sources. The data source is from GPS data, from taxi and road
weather data, from weather stations and then ingested and forwarded to Spark for
preprocessing. At the same time, data are stored as a raw le in HDFS.
• Real-time analysis In this phase, the trac ow indicator values will be calculated in
real-time with the input given as GPS event. The main operations are based on
Sparks's map, reduce, and lter. The results are also stored in HDFS.
• Periodic batch analysis There are also batch analysis carried out periodically. This is
executed separately in a child process so that the error resiliency could be improved.
The main latencies in this project are accumulated in preprocessing, Impala table updating,
query execution, and Ingestion. The main reasons for these latencies including the
conguration of the Messaging system, I/O performance, and network resource. Our project
can inspire from this project by optimizing the I/O performance. This can be archived by
choosing the dierent size of partitions of the raw le according to the requirements. Since
our project will be running on the cloud environment, more enhanced computing resources
will be available. Moreover, this project has proved that HDFS is a perfect choice of storage
methodologies of the ingestion phase because we don't have a large volume of streaming
data.
11
2.3.2 High speed log streams generated from web
As the development of web technologies, there are always valuable insights in logs of
web-based applications. These logs will help to nd usage patterns, potential failures, and so
on. This project proposed architecture for high-speed web log analysis [3]. The overall
architecture consists of a Kafka Server for accessing weblogs, SparkStreaming for processing
the incoming logs in memory and then output the results to storage places. The most
critical part of this project is the usage of Kafka. As the velocity can be fully controlled by
using a message queue, which is very similar to our project goal, Kafka is an ideal message
machines. It also has a great mechanism for preventing data loss by using replicated and
persistent storage [7]. The results of the projects also proved Kafka is the most suitable tool
for message queue and satisfy our requirements. The details about the usage of message
2.4 Summary
This chapter reviews the possible technologies to build a real-time analysis system to reduce
Decision Latency. The comparison between Spark and Hadoop MapReduce suggests that
Spark is more suitable for this project because of the in-memory processing. HDFS have
can work with Spark and make use of Spark-SQL to set the schema of the data manually
while maintaining the eciency of processing. The review of the 2 examples of similar
architectures inspired the design of this project. Kafka works as a messaging system can be
used to simulate the real-time trading records and HDFS can be used to store the historical
12
3 Design and Implementation
This chapter will present the methodologies of solution design, a detailed introduction of the
proposed architecture, and workow process. There are many types of streaming data. The
data source could be from sensors on IoT devices, service logs, or even nancial data. The
common nature of the streaming data is that they are time-series data. With this being
introduced, the proposed architecture will focus on nancial trading data and perform a
real-time calculation of volatility. As it has discussed in the previous chapter, the main
challenges in for streaming data analytics occurs to the storage layer and processing layer.
Therefore, this project aims to design an architecture using available technologies to exam
the latencies and where they might happen. This project also highlighted how cloud
the past. Such data usually not available for free or only available for a limited period of
times. The maximize the data volume, this project uses the intraday stock prices with
1-minute bar. The historical stock price provider is IEXCloud [16]. This platform oers a
free plan with a limited number of request. The APIs available for intraday is designed to
request the minute le on the current day. However, it is possible to congure to get the
historical data if parsing a specic date in the request body. Therefore, the logic is for a
particular stock, keep requesting for the records on previous day until there is no record
available. The more historical data we have, the more synthesis data we can generate.
IEXCloud also provides other APIs for market indexes, cryptocurrencies, options and so on.
However, due to the limited number of requests, another open-source platform needs to be
Alpha Vintage [34], is a provider for real-time and historical data on stocks, forex, and
cryptocurrencies. However, this platform does not support historical data in 1 minute for a
13
relatively long time. Some tests conducted using this platform indicates that Alpha Vintage
will provide historical data for 1-minute stock prices up to a week. This will not satisfy the
requirement of using big data. Therefore, the combined usage of the above 2 is
essential.
collected can not satisfy the requirement of Big Data because the volume is not large
enough. The nature of the stock trading data is time-series data, data is in series of
particular periods or intervals. Trading data as such series has the property of discreteness.
The Ecient Market Hypothesis indicates that the history of a stock price can be fully
reected in current prices and if any new information about the stock, the market will always
respond [11]. Given these two properties, a valid assumption can be made that current price
can also have an impact on the expected future prices. However, there remain uncertainties
14
The above diagram 3.1 shows the index prices changing over the past 6 months where the
y-axis is the price and x-axis is the time. S&P 500 indicates the stock performance of the
large 500 companies listed in the US stock exchange. This is some way can present an
overall picture of the market. As mentioned before in the EMH, the market will respond
with the information about the stock. Due to the global pandemic happened in March 2020,
the index drastically decreased. Later in early April, the US government announced the
Quantitative Easing(QE) to save the US stock market. Then the indexes went up to reect
From the diagram 3.1, it can also be concluded that the change of the stock prices depends
very much on the previous state. Therefore, the change of price is following the Markov
chain where the Markov chain describes the stochastic process of changing.
random variable follows the Brownian motion [4]. The stock prices also have a similarity
with the Markov process which is also known as the Wiener Process. Traditional Brownian
Motion describes the random motion of a part in a uid. The mathematical properties of
the stochastic process in linear dimension can be dened by Geometric Brownian Motion.
The stock prices behave similarly to the stochastic process in continuous time where they
both have long term trend and short term uctuation. The study [4] shows that using
Geometric Brownian Motion can have a reliable prediction of a short term with Mean
There are 2 components in the Geometric Brownian Motion model which are Drift and
diusion. 2 metrics are important for this model which are the mean and standard deviation
of the return. The return of the stock prices at time k is given by the following equation:
Sk − Sk−1
rk = (1)
Sk−1
Where S is the price of the stock.
The mean µ of the return can be calculated by summing of the return within the historical
1 X
µ= ∗ rk (2)
|k|
The mean or the expected value can be used in the drift function to reect the longer-term
trend. If the mean is negative, it indicates the return is negative on average within the
15
historical period. When the average return is negative, it suggests that there will be a
loss.
The positivity of the mean determines the if the stock price goes up or down. However, the
stocks prices never grow smoothly. Standard deviation will help to incorporate random
shocks. Standard deviation determines the magnitude of the movement. The equation of
s
1 X
σ= ∗ (rk − µ)2 (3)
|k|
With the mean and standard deviation being introduced, the drift function is dened as
follow:
1
driftk = µ − σ 2 (4)
2
Since the drift function (4) contains only µ and σ, this function is constant. The value
determines if the stock prices is going to increase from a longer-term perspective. If not
applying random shock to the series, the series will change smoothly. Recall the Markov
process of the stock prices indicates that the current state is impacted by the previous state.
Therefore, to use the drift can be used to determine the next state. Given the previous stock
1 2
Sk = Sk−1 ∗ e µ− 2 σ (5)
introduced by random shock. Recall we discussed previously, the market will always respond
to the news. Therefore, the curve cannot be smooth. It is changing almost every time.
However, dierent random event may apply dierent scale of impact on the stock price. In
the diusion function, the standard deviation σ is used to control the magnitude of the
diffusionk = σ ∗ zk (6)
where z is the eectiveness of the random shock at time k Since the drift function is
constant, the diusion process will help to simulate the trend under dierent scenarios by
applying dierent scale of random shocks. In summary, to model the stock price at time k
16
given the value at k-1, adding diusion term to function (5) gives:
1 2
Sk = Sk−1 ∗ e (µ− 2 σ +σ∗zk )
(7)
The above equations allows to make prediction for the next data point. If to predict of the
stock price for the next 4 times, the function is given by:
1 2 1 2 1 2 1 2
Sk+4 = Sk ∗e (µ− 2 σ +σ∗zk )
∗e (µ− 2 σ +σ∗zk+1 )
∗e (µ− 2 σ +σ∗zk+2 )
∗e (µ− 2 σ +σ∗zk+3 )
(8)
the above function can also be generalized to make the prediction of the stock price at time
k
1 2
Y
Sk = S0 ∗ e (µ− 2 ∗σ +σ∗zi )
(9)
i=1
Notably, the drift term is constant and each time the diusion updated by multiplying its
1 2
Pk
Sk = S0 ∗ e ((µ− 2 ∗σ )∗k+σ∗ i=1 zi ) (10)
The nal function 10 will be used to generate synthetic series. The stochastic process then
can be simulated by Geometric Brownian Motion. The Geometric Brownian Motion does
not guarantee the mean value of the generated series remaining the same as the original
series because of the random shock. In another word, the longer-term trend of the
generating series can be dierent from the original series. Although the theoretical value of
the mean should be 0 in an innite amount of time, the mean of generated series can still be
greater than or less than 0. This is because it is impossible to have an innite number of
data points.
In more general scenarios, Geometric Brownian Motion can also be used to predict the
future changes of the stock prices. This is based on EMH where the past prices are already
interpolated. The stock prices changing is a Markov process. Therefore, the prediction can
represent the future changes at some level based on the assumptions of EMH and Markov
process.
17
3.2 Simulate Real-Time Data Streams: Ingestion Plat-
form
3.2.1 Overview
Realtime Streaming data is another important aspect of this project. Realtime is the
problem that needs to be resolved which is related to the velocity of the big data. When
real-time trading data arrive in the system, the system also has to be able to analyze the
volatility and risk of real-time. This component is responsible to create real-time data
streams that can be congured in terms of the velocity and data source. This a Spring boot
web application that integrates with Kafka. It can collect the data source in various format
and from various sites. There are 3 endpoints available at the moment.
will download the record as a CSV le. Then the le will be scanned line by line. It
This is not the main endpoint but an endpoint for debugging purpose.
3.2.2 Kafka
delivering real-time trading record. A messaging system is a producer-consumer model where
the producer sends the message to the consumer and the consumer receives the message
from the producer. Kafka as an open-source messaging system has been used a lot in big
18
Figure 3.2: Kafka
The producer will publish a message to one or more Kafka topics. The consumer will be able
to receive the message if it subscribes the same Kafka topic. Using a Queue data structure
allows the message being processed sequentially. A Kafka topic lives with Kafka broker
together with the partition id. Kafka is also a distributed architecture that can supports
message replication among the nodes in the cluster. This keeps the availability of the data.
Based on the above features, Kafka can help to achieve delivering the trading record in
real-time.
3.2.3 Implementation
Setup Kafka Cluster
The features of Kafka have been introduced previously. To set up a Kafka Cluster on a local
machine, the Docker container was used. Using the Docker container allows creating
if using a swarm of Docker engines. The virtual instances also included a Zookeeper instance
19
which manages the Kafka broker. The message logs are partitioned within the virtual Kafka
instances and mounted into the host machine. The port number is also exposed to the host
YAML le that contains the congurations about the each Docker containers including base
image, dependencies, network congurations and interaction between the container and the
host machine.
same design of Kafka but provides large storage capability in the native cloud environment.
It also provides a UI to allow tracking messages. These are huge advantages compare to
Web Service
Spring Boot is a popular web application development framework. It is a Java framework
that allows developers to focus on the business logic without worrying too much on the
other details. The most important concept in Spring boot is Aspect-Oriented Programming
(AOP) and Inversion Of Control (IOC). AOP allows modularizing the functions that are
being used multiple points of an application. This is also known as cross-cutting concerns
[1]. The common functionalities are dened in one place when using AOP. However, these
functionalities can be adjusted without modifying the class to be used in the new features.
In this application, there are 3 modules created and implemented. FileManager is responsible
for managing les in the folder including scan le records. IngestionManager is responsible
for retrieving the data from Datasource and download the data in either CSV or JSON
format. The messaging module implemented the Kafka API and is responsible for sending
IngestionManager, Kafka will send a message containing the le path, symbol name, and
function when this the download is completed. IngestionManager will also generate a
timestamp that is unique to every downloaded le. The message will be received by the web
application itself. Kafka provides Java APIs which have call back functions to check if a
message has been sent successfully. A log message will be printed either onSuccess or
onFailure.
future.addCallback(new ListenableFutureCallback<>() {
@Override
public void onFailure(Throwable throwable) {
20
LOGGER.error("Unable to send message: [{}]",
throwable.getMessage());
}
@Override
public void onSuccess(String message) {
LOGGER.info("Sent message [{}]", message);
}
});
The message body contains the symbol, the highest price, the lowest price, opening price,
closing price and the trading volume; After the message is successfully sent, FileManager will
start working on reading the le and extracting the records. FileManager Interface will
manage every le in every folder as if a data pool manager. When the message is received
by the FileManager, it will use the le path contained in the message payload to read the
le. Then, the content of the le will be saved in memory as a HashMap. Typically, a le of
the 1-minute bar in a trading day will only have up to 480 trading records. This is tolerant
for JVM to load the entire le. The HashMap will be parsed to message sender again. This
time, each Entry in the HashMap will be sent as a message in user-dened frequency. To
achieve this, it implemented Java Runnable interface to control the frequency. An integer
value is passed into the application as Springboot properties [2]. This thread will sleep for
the interval passed in. Also, the return of the stock price will be calculated and saved into
the InuxDB.
try {
LOGGER.info("The next message will be sent after [{}]
milliseconds",getSleepTime(timeInterval));
Thread.sleep(getSleepTime(timeInterval));
} catch (InterruptedException e) {
LOGGER.error("Thread interrupted");
}
If no value parsed into the application, it will use the frequency in the le. If the requested
le is 1 minute, the message will be sent every minute. The message will be received by the
Spark App to perform real-time analysis. The downloaded les will also be categorized by
This service is one of the core components of the entire architecture. It also has integrated
with Swagger UI to document the endpoints that are currently available. Data Ingestion is
21
very important to satisfy the variety and the velocity requirement of big data. The Ingestion
Platform itself is highly congurable in terms of data sources, stock markets, as well as the
velocity of simulating the real-time data. The platform can be used as a stress test tooling
introduced in the previous chapter, the benchmark results of Spark and Hadoop MapReduce
indicates Spark has better performance when processing the real-time data stream. Due to
the feature of in-memory processing, it shows faster in terms of processing. Spark Streaming
is another reason for this framework being used. By using this, the application can consume
To use Spark in the application, the spark context has to be congured. The conguration
includes spark checkpoint, Google Cloud credentials, Application name, and the number of
thread to be used int local machine. Spark context also allows choose conguration at
runtime depends on dierent requirements [8]. In this application, there are 2 congurations
required. The rst one is for spark streaming context. When running spark locally, the
checkpoint directory is set to a local folder. When running spark on the cloud, this is set to
a folder in Google Cloud Storage (GCS). The application is set to local[*] which means as
many as worker threads as logical cores available [31]. Another conguration is the Google
Cloud Storage conguration for loading the historical le. Since the Google Cloud Storage
backed up GFS, providing the link starting with gs:// link to the le location can load the
le. The Hadoop Distributed File System is also implemented based on GFS. Therefore, it is
also important to add the Hadoop conguration when loading the le from GCS. After the
SparkStreaming context has been created, it will be connected to the messaging system to
consume the message. Spark streaming provides Dstream which is a high-level abstraction
of Discretized Stream. When the input stream is received by Spark Streaming, it will be
have been introduced in the previous chapter. Notably, there are some operations can be
applied to a Dstream object. The operations can be categorized into Transformations and
Actions. These operations often have no impact on RDDs since RDDs are immutable. Map
dierent objects. The map function takes the original RDD type as a parameter and returns
the target RDD type. Another function in Transformations is Filter. The lter function also
22
takes in the original RDD type and return a boolean variable. This function is applied when
ltering for the resulting RDDs. In this application, the source of the Dstream is from the
Ingestion Platform in the forms of the message. In the map function, a deserialization
method is dened to convert the input stream of String type to a Java Object. In the code,
it dened a class GCSRecord which contains all elds of the trading records. Spark can
apply the map function in small batches. Figure 3.3 illustrates the process of applying map
function to the incoming stream. The transformed Dstream will be used to perform
In terms of the Actions operations, Action is a method to access the actual data in an RDD.
Action operations are suitable for the small size of data [31]. The incoming dataset
compared to the historical data is extremely small. Therefore, some Action operations can
be applied. Reduce is one of the most important and commonly used one which is used to
perform aggregations. The reduce function has to input parameters which are the RDDs and
return the aggregated results. The reduce function was used to nd the return of the stock
prices. Recall the equation for calculating the return stated in equation (1), it requires 2
parameters. The current stock price and the previous stock price. The gure 3.4 shows the
SparkStreaming provides a sliding window API which can be used. The window length of 1.5
minutes was dened. The reduce function will aggregate all of the RDDs within this batch
interval. In the conguration of this project, the number of RDDs within this window should
23
be 2. After the return RDD is calculated, each RDD will be delivered into the InuxDB
instance. The timestamp is created at the time of which the InuxPoint object is
created.
include mean and standard deviation for both opening prices and closing prices and an
additional eld using mean divided by the standard deviation. When a new message coming
in, it will use the historical data into the memory and append the new data at the end of the
historical datasets. Since RDDs are immutable, Spark provides another API called DataSet
[8]. The datasets object allows joining, aggregation, and other relational operations. RDDs
are operated with functional programming constructs that include and expand based on map
and reduce. SparkSQL provides APIs for aggregation including mean and standard deviation.
The RDDs in Dstream processed sequentially. Each RDD will be used to created a new
DataSet and union with the historical Dataset. Then calculate the average and standard
Dataset<GCSRecord> tempDataSet =
context.createDataset(gcsRecordJavaRDD.rdd(), gcsRecordEncoder);
dataset = dataset.union(tempDataSet);
double avgOpenReturn = dataset.agg(functions.avg("openReturn"));
double stdOpenReturn = dataset.agg(functions.stddev("openReturn"))
The historical DataSet was used repeatedly. Therefore, it could be cached into the memory
for saving loading times. This transformation also creates a new Dstream. The new
Dstream was connected by a sink to InuxDB and store the calculated RDDs there. Figure
3.5 illustrates the overall ow of Dstream. Reactiveinux [28] was used as SparkStreaming
24
Figure 3.5: Overall Flow
cloud. It allows dynamical resources allocation for better performance. For this purpose, the
Dataproc cluster was created with 1 master node and 3 worker nodes. The Spark job can be
submitted through the gcloud command line. A Spark job is considered as an executable jar
where all the classes, methods, and dependencies are packaged. Without further
programmatic work, the methods of packaging have to be congured. Maven was used as
should compile Scala and bundle into the jar. The submission of a Spark job included the
monitoring and other time series related applications. Inux data also provides visualization
utilities. The data visualization tool also provided by Inux data has the functionalities of
creating real-time visualization dashboard. InuxDB provides SQL like queries so that users
can query the database without requirement of learning new syntax. InuxDB is also
To deploy InuxDB to Google cloud, a Kubernetes engine instance was setup with Ubuntu
18.4 Operating System. InuxDB can be installed and congured via command line. After
installation, the instance has to be accessible from other machines. So, the next step was to
In the Ingestion Platform, the Java client of InuxDB was added as a maven dependency.
When writing a record to the database, the connection was established programmatically. In
25
Listing 3.4: Using Java client to write data into InuxDB
sparkInflux.saveToInflux(InfluxReturnStream);
3.5 Summary
In this chapter, the core components and their implementations have been introduced.
Figure 3.6 provides an overview of the integrated architecture of the whole design. The core
components InuxDB, Ingestion Platform, the method for generating synthetic data, and
Spark App have been detailed explained regarding the implementations. The Ingestion
Platform is responsible for collecting the trading data from various sources. The calculated
return of the prices will be stored into InuxDB. At the same time, it will also send the
records through the messaging system. The messages are then received by the Spark App
and perform real-time analysis. The Spark App also interacts with GCS to read the historical
les into the memory. Synthetic Data Generation Utils is the implementation of Geometric
Brownian Motion. The generating series are also stored in GCS. An actual use case and
26
Figure 3.6: Integrated Architecture
27
4 Case Study: J.P. Morgan Stock
Price Volatility
In this chapter, it will focus on the performance of the architecture on an actual use case.
This chapter is divided into background introduction, workow of the architecture, and the
4.1 Background
J.P.Morgan Chase & Co is an American investment bank. It is the largest bank in the
United States [18]. Its main business include investment bank and nancial services, assets
management. As one of the largest banks in the market, it represents the banking and
nancial services industry at some level. The stock prices of J.P.Morgan also reects the
condence of the traders in the stock market. Therefore, this is a typical example for to be
The J.P.Morgan historical stock prices dataset is collected from IEXCloud [16] for the past 1
year dated back to 1st April 2019. The collected dataset contains the following eld:
price, closing price, the highest price, and the lowest price. When the opening price is greater
28
than the closing, it means the the return at the current timestamp is negative. In contrast,
the return is positive. This is important to understand when calculate the return.
from IEXCloud. The table 4.1 gives an overview of the raw data collected from IEXCloud.
The table is indexed by timestamp and parsed into DataFrame using Pandas [24]. The prices
that are usually used for analysis are opening and closing prices. A visualization for both
opening series and closing series is created. The diagram 4.2 is the visualization of historical
prices in past one year. The x-axis is the time and y-axis shows the open price and the close
price. The diagram 4.2 is created by Plotly [27]. This Library allows to create interactive
29
visualizations in python. Recall that the stock price changing is a stochastic process.
Therefore, the Geometric Brownian Motion model is used to generate the synthetic data
Although the general trend of this stock prices is growing, it is not growing smoothly. It is
observable that the prices are uctuating constantly. An initial analysis of the series
indicates that the diusion term could be relatively large in the entire dataset. To calculate
the drift term and diusion term, it must require the descriptive statistics especially mean
The rst step is to calculate the return of both opening series and closing series. Recall the
function (1) for calculating the return, it uses the prices between 2 timestamps to calculate
the return. The build in functions in pandas [24] has the ability to calculate the return as
30
Figure 4.3: Distribution of mean
Using the mean and standard deviation, we can use equation (4) and equation (6) to get
drift and diusion. The distribution of the return of opening is showing in gure 4.3. The
majority of the return values are distributed around 0. Therefore, the mean should be close
to 0. Since the prices changed with large amplitude, the standard deviation should be much
The descriptive statistics are important because they will be used in Drift(4) and Diusion
(6). The table shows that the mean of the return on open price and close price are both
-0.000002 whereas the standard deviation are 0.001304 and 0.001295 respectively. This
31
We use the open price as an example and to show the impact of random shock on the stock
price using Geometric Brownian Motion. We use function 9 to generate the 200 simulations
over the next 500 trading minutes. For each of the simulations, the drifts and diusions for
32
The gure 4.3 shows the stock prices will not necessarily fall due to the eect of the random
shock. It can have various behaviours due to random shock. Using this technique, we
generate a large le containing the open price and the close price and the return for both.
There is a total number of 20 million data points generated using this technique. Due to the
size of the dataset is too large, pandas will slow down the speed in processing the data at
this scale. Therefore, we use pySpark to process it. Again, the rst step is to calculate the
descriptive statistics of the return. PySpark also provides build-in functions to achieve this.
Using PySpark allows processing a large volume of data in a relatively short time. It is
observable that the mean values slightly changed while the standard deviation remained the
same in table 4.3. This is because of the eect of random shock. This le was then
The partition number depends very much on the number of cores. The default number of
partitions is 200 by Spark. When loading the les, Spark will only keep the memory utilized.
If the memory is not sucient for the entire le, Spark will only read the partitions that
within the allowance of the memory and spill the rest of the partitions on the disk.
According to Spark documentation, the best practices for the number of partitions. If an
RDD has too many partitions, the task scheduler will take more than the execution time to
schedule the tasks. This will signicantly increase the latency in actual real-time processing.
If the number of partitions is to less, it will result in some of the cores in the idle state hence
resulting in less concurrency. The number of partitions is initially set to 22 after the le is
generated. When reading the le through the spark application, the le gets repartitioned.
In local environments, the computer has 4 cores and les are partitioned into 8 chunks.
When performing the calculation, it takes 2 minutes locally. However, the latency could be
33
reduced drastically using cluster on the cloud. The Dataproc was used in order to speed up
the calculation with 3 nodes and each node has 4 cores. The le then was partitioned into
24 chunks. The execution time was much faster. The speed of computation can be
determined by hardware as well as the number of partitions. The goal is to utilize the cores
Vintage. In this case study, we use the one minute stock price of J.P. Morgan Chase & Co.
In the Ingestion Platform, the endpoint to download the le was used. When this endpoint
gets queried, it will request the target le from Alpha Vintage. The function used is
TIME_SERIES_INTRADAY with 1 minute interval and the symbol for J.P. Morgan Chase
& Co. is JPM. When the le is downloaded, as explained in the previous chapter, the
FileManager will start to read the content of the le and send them to Message Sender. The
message sender will send the trading records in user dened frequency. The default value in
this case was 1 minute. However, during testing, this rate was adjusted to 30 seconds to
test the limit of the system. The results will be discussed later. The trading records are
persist in Google PubSub and and waiting to be polled by the Spark App. There is a topic
created for simulate the trading data ow in real time. This topic is also subscribed by the
Spark App so that the Spark App has the ability to access the persistent messages.
The Spark App will continuously consume the messages and processing them. The incoming
messages are deserialized into a Java Class. When the Spark App is running, it will load in
In this project, the means and standard deviations for both open price and close at real time
were calculated. These 2 statistics represents the volatility of the stock price. Volatility
represents the historical uctuations of the price at the past. The value of this property
changes as new trading records coming in. The reason why volatility is important is because
it is an indicator of the optimism or the fear in the market. This is also important for
assessing the performance of the stock. As in has been introduced in the Geometric
Brownian Motion, the uctuation (the diusion term) is directly related to standard
deviation. The magnitude of the volatility depends on the standard deviation (variance) as
well. With higher value of standard deviation, the large magnitude of uctuation can appear
on the stock price. Traders use volatility to measure the risk of the stock. If the volatility
Calculations for big data can be expensive. Especially the aggregation, most of the
aggregation involves iterating whole data. The aggregation functions in Spark provides the
34
capability of calculating mean and standard deviation. There is no inclusion of shuing
when using aggregation alone. Recall the diagram 3.5, the RDDs of the historical data will
The calculated volatilities at real time are stored in InuxDB. In inuxDB, the two measures
was created. The rst measure is the return of the stock price and the second measure was
created for tracking volatility at real time. The measurements of the volatility and returns
will persist in the database. Since the instance is hosted on cloud, Google provide high
availability for reading and writing data. The congurations for InuxDB for the Ingestion
Platform and the Spark App are dierent because of the dierent usage of the InuxDB. In
the Ingestion Platform, the Java client was created with hardcoded endpoint. In the Spark
App, it uses the connector as the stream will be written directly to InuxDB.
4.5 Result
The results are evaluated by examining the timestamp of 2 measurements and comparing
the latency. The return traces are populated into the measurement in a xed rate of 1
minute. The latency is measured by the time dierence between the return and the
statistical moments. For example, the rst return record arrived to the database at 25th
March 2020, 17:45:20 and the associated statistical moments arrived at 25th March 2020,
17:45:54. Therefore, there is a 34 seconds latency. The return of which calculated between
the current trading record and the previous data arrived at the database before the new
records coming in. The average time of statistical moments is computed take up to 40
seconds. This has a negative eect on decision making when the data are arriving in
1-minute frequency. The screenshot 4.5 of the measurement in the InuxDB instance hosted
on the Compute Engine. In the screenshot 4.5, the return is computed using the equation 1.
The timestamp uses the system time instead of the original trading time. The reason for not
35
using the original trading time is for the purpose of latency investigation. In the ow
introduced in Figure 3.5, the return was computed using reduce function. This step involves
shuing. However, when the size of the RDD is 2 under the normal circumstance.
The return RDD is then used to compute the statistical moments. As it produces a Dstream
of the return RDDs, without the need of collecting the RDDs in the stream, it can then be
used for computing the mean and standard deviation together with the historical dataset.
The computation of mean and standard deviation could be slow as it will aggregate the
20000000 rows of data. The union is a very ecient operation in Spark. It will only combine
2 partitions without moving any other partitions around. Therefore, the new incoming data
unions with the historical dataset and then the aggregation functions are applied. In gure
screenshot 4.6, the measurements for volatility is computed and stored into the database.
Note the rst record is not the result, the statistical moments of all historical datasets are
calculated on application start. Therefore, when estimating the latency, this record will not
be considered. In average, the trace gets populated on average 40 seconds late than the
arriving of the original data. This means the aggregation time for data takes 40 seconds for
both opening prices and closing prices. There are 2 statistical properties and for two types pf
prices calculated in the total. Therefore, to compute each of them, it takes about 10
The results suggested the system can tolerant with the trading frequency at 1 trade per
minute. In the algorithm trading environment, the trading frequency can be much higher.
Therefore, a more robust system would be required to measure risk in real-time. At the
36
moment, the response time for the system can satisfy the requirement of trading at
1-minute frequency. The parallelism of Spark Engine plays a critical role. This is directly
4.6 Evaluation
4.6.1 System setup
The system contains a Google PubSub instance, an inuxDB instance, a Spring boot web
app for ingesting the data from the source, a Spark application for processing the streaming
data, and a Google Cloud Storage for les storage. The Dataproc cluster provided by Google
running Hadoop to perform Spark jobs execution. To measure the real-time risk for a
specic stock price, the rst step is to collect enough amount of historical dataset. In this
case study, because there was no enough amount of historical dataset available, the
Geometric Brownian Motion technique was used to generate the synthetic data. This
technique uses the mean and standard deviation from sample space to ensure the generate
data complies with the statistical properties of sample data. The Hadoop cluster was setup
containing 3 nodes. Each node has 4 cores and 500 GB of disk size. The conguration of
this cluster can satisfy all requirements of such a system. Dataproc will allocate the
resources dynamically. The le partitioner was also congured manually to make the best
use of the environments. This is the only place that involves system tuning. For dierent
cluster congurations, the number of partitions is dierent. For this specic case, 24 was
Due to the limited budget on Google Cloud, the most economical conguration for the
cluster was chosen as such. Dynamically generating synthetic data and le partition is not
4.6.2 Performance
The performance of the system can be evaluated from the computation speed and stability.
These 2 metrics are the key performance indexes of the system as well as the key
objectives.
The speed of computation is evaluated by computing the delay of the statistical moments.
In the screenshots of the databases, it is observed that the average delay is 30 seconds for
means and standard deviations for both opening price return and closing price return. The
system can support trading in 1 minute. The designed big data processing systems that
allow quick retrieval of large items of historical data and allow for high-speed computation.
The performance can be further improved with stronger hardwares and more ecient
37
approaches of computation resources allocation. In a study, [37] investigated the Spark-GPU
acceleration when running Spark SQL queries. It suggested that for SQL queries, it can be
The most critical component to evaluate the stability of the system is the messaging system.
also an approach to simulate real-time trading. When using PubSub in this project, the
messages are guaranteed to be delivered. Unlike Kafka, PubSub requires manually ACK on
the consumer side. This means any failure message will be persisted and observable until
they are consumed. Apache Bahir is developed as a connector particularly for Google
PubSub and Spark Streaming. It makes use of the manually ACK feature of Google PubSub
4.7 Discussion
4.7.1 Data Selection
The historical dataset for J.P.Morgan Chase & Co. only available till 1st of April 2019. The
le size is not large enough to satisfy the volume in Big Data. Therefore, we had to use
Geometric Brownian Motion to generate synthetic data. The key nding in the dataset
generated by GBM is that the mean of return is not stable. Since we applied random shock
on the drift, the mean return is determined by the random variable. Thus, each simulation
can have dierent results. However, the standard deviation remains the same as the
magnitude of change is bounded by the original standard deviation. Therefore, the generated
data can only represent part of the original universe. Based on this fact, the results
computed by the system can only be referenced based on the generated data universe. If
there were more historical datasets available, the result could be more representative.
implementations, particularly in Spark and Hadoop. However other open source technologies
were also considered. Recall one of the goals of the project is to produce data stream and
control the speed, the streaming tool was considered as the rst options. The major
technologies include Apache Spark, Apache Flink, and Apache Storm. The benchmark
results conducted for the above three technologies. The study [13] showed that
SparkStreaming process the new events in a stepwise manner while the other tow is linear.
The causes for this is because of the micro-batching design in SparkStreaming [31].
Although the study concludes that Apache Flink and Apache Storm are more successful in
near real-time processing with lower latency, it is also worth to mention Spark can process
38
data with higher throughput. However, under the low throughput rate such as under 50000
per seconds, there is no dierence among Storm, Flink, and SparkStreaming. The original
design of this project was to have the capability of handling the data from multiple sources
at the same time. Therefore, it was expected to have a higher throughput rate. The second
reason why SparkStreaming was chosen as the streaming technology is that it is as a part of
the Spark ecosystem that can integrate well enough with other Spark suites such as Spark
SQL.
The Hadoop MapReduce was also mentioned as another MapReduce paradigm. However, it
is impossible to use Hadoop standalone to achieve the object of processing real-time events.
Another consideration for not using Hadoop MapReduce is that it was required to perform
processing is not supported by Hadoop MapReduce, an extra latency could have been
introduced if not using Spark. The usage of SparkStreaming for this project also proved that
SparkStreaming has the capability of processing the data in near real-time in this case study.
However, the experiment of handling messages from multiple sources have not been
In terms of the simulation component, Kafka was initially used for messaging system set up
on the local machine. As it was introduced, the log les will be stored on disk in dierent
partitions. However, when having a large volume of incoming messages, the disk overhead
gets large. This is a challenge to storage ability to the local machine. However, when
deploying to Google Cloud, the Kafka cluster requires extra VM instances. Due to the above
considerations, Google PubSub was replaced for native cloud environments with equally well
performance.
4.7.3 Challenges
This project focused on the velocity and volume aspects of big data. With the help of cloud
services, the large volume of data can be stored easily. The velocity is measured by the
latency of computation. During the implementations of the architecture, there are some
challenges faced and worth mentioning. The rst challenge was encountered during the
deployment of the Ingestion Platform. The Google App Engine uses internal production
server while Spring boot uses tomcat as an embedded server. Normally, when having the
Spring boot running on the server or local machine, it uses Tomcat as the default server. It
will handle the incoming requests from the clients. However, App Engine is a fully managed
platform and allows to scale the application with a minimum amount of conguration.
There was the compatibility issue with the App Engine and Tomcat server. The workaround
was to remove the Tomcat dependency from Spring boot and use the default servlet
39
Another challenge faced was running the parameters for Spark Engine and le partition. To
achieve the maximum parallelism, it uses all worker nodes available. There were some
experiments conducted to test the speed of processing regarding the number of partitions. It
concluded that around 22 to 24 partitions the Spark Engine has the maximum processing
speed. A problem for applying to reduce function for calculating the return is that when the
messages are failed or delivered lately, the return is calculated incorrectly. This is highly
depending on the stability of the messaging system. There is no feasible solution for these
messaging module to ensure the correctness of the return. The messaging module acts like a
middleware that connects the Ingestion Platform and Spark App together. As discussed
previously, if the message module is down or even a message is delivered late, the
computation result is dierent. Therefore, the message module has to be stable and
Another future work is for testing the performance of multi-source processing. The design of
the system has put this requirement into consideration. Therefore, SparkStreaming was
chosen to handle the situation of a high throughput rate. Due to time constraint, there is no
Finally, the automation process of generating synthetic data is another important future
work. The purpose of generating synthetic data is the volume of the original data is not
large enough to satisfy the requirement of Big Data. However, if for specic stock prices, a
large volume of historical datasets is available, the step of generating synthetic data is no
longer required.
time dierence between the arrival of return and the computed mean return and standard
deviation return. The latency is brought down to 10 seconds to aggregate over a single
column. This is because the conguration that utilize the worker nodes. The time spent to
calculate the statistical moments is made up of several components. The rst one is
aggregating the historical dataset. The aggregation function requires to iterate all rows. If
the number of rows is dened as N, the time complexity is O(4N) as it will calculate mean
and standard deviation for both opening and closing prices. The second cause of latency is
that when the application starts, it will also take O(N) time to load the historical datasets.
The union operation can be treated as constant time complexity. Therefore, the total time
40
complexity is O(4N) and can be written as O(N) for computing the statistical moment every
time.
With the help of cloud services, Big Data storage is not an issue anymore. Volume can
easily be resolved by adding more hardwares. Google Cloud Storage is built on top of Google
Variety is the aspect that did not get chance to experiment in this case study. The idea of
variety is to investigate if dierent stock prices have inuence on each other. Due to the
time constrain, this experiment was not conducted. However, as mentioned above in 4.7.4,
the design of the system has put this into the consideration. SparkStreaming is able receive
the messages from multiple sources. The initial idea was to set the context to each stock
price when it is ingested by Kafka to allow SparkStreaming to process dierent stock prices
in one spark job. This will potentially aect the speed of calculation.
41
5 Conclusion
This paper has contributed to reviewing the suitable technologies to build low latency
decision system. The reviews the latency problem, particularly in High-Frequency Trading
industry, suggested that higher latency can result in decreasing in market quality. This
project aims to simulate a trading environment and perform near-real-time analysis hence to
This project has tested under the trading frequency of 1 trade per minute, the decision time
can take up to 40 seconds based on 20 million rows of the historical dataset. The users for
such systems is targeted to analysts in the nancial service industry. This architecture
involves Big Data technologies and aims to solve the challenges of Big Data in terms of
Velocity and Volume. This project answers the questions the response of retrieving large
data in a relatively smaller amount of time. The system also achieves the average
the system from the computation speed aspect. This trading frequency is lower-bounded by
the computation time. The current system is capable of analyzing the risk of trading in near
In the current implementation, the system can potentially fail if the messaging module is
congested. The default conguration of the window size is 1.5 minute in SparkStreaming to
allow 2 RDDs to exist at the same time. This can allow calculating the return using the
current price and the previous price. For example, if the messaging module failed to deliver
one message, the RDDs in the dened window length will be incorrect and hence the return
is calculated incorrectly.One possible x can use a load balancer and monitor the healthiness
of the messaging module. When it is discovered oine, replace with backup servers.
In future, it may worth investigating the possible approaches to decrease the latency from
allocation algorithm on the cloud environments (particularly in le partition), and nally, the
42
6 Appendix
This chapter contains the some of the code snippets and pictures in some of key
for i in range(1,num_simulations):
temp = [k for k in range(size+1)]
temp[0] = base_price
noise = standard_brownian_motion(size)
for j in range(1, int(size+1)):
drift = (mean - 0.5 *variance)*noise
43
diffusion = volitility*noise[j-1]
#print('drift:' + str(drift))
#print('diffusion: '+str(diffusion))
temp[j] = base_price*np.exp(drift + diffusion)
res.update({i:temp})
return res
def standard_brownian_motion(size):
np.random.seed()
dt = 1/size+1
W = np.random.standard_normal(int(size+1))
W = np.cumsum(W)*np.sqrt(dt)
return W
df_open_lag =
df.withColumn('prev_open',func.lag(df['open']).over(Window.orderBy('index')))
df = df_open_lag.withColumn('open_return', (df_open_lag['open'] -
df_open_lag['prev_open']) / df_open_lag['open'])
df_close_lag =
df.withColumn('prev_close',func.lag(df['close']).over(Window.orderBy('index')))
df = df_close_lag.withColumn('close_return', (df_close_lag['close'] -
df_close_lag['prev_close']) / df_close_lag['close'])
df=df.withColumn('moving_average_open',func.avg('open_return').over(Window.orderBy('index').
df=df.withColumn('cum_sum_open',func.sum('open_return').over(Window.orderBy('index').rangeBe
df = df.withColumn('moving_average_open',df['cum_sum_open']/df['index'])
df=df.withColumn('moving_average_close',func.avg('close_return').over(Window.orderBy('index'
df=df.withColumn('cum_sum_close',func.sum('close_return').over(Window.orderBy('index').range
df = df.withColumn('moving_average_close',df['cum_sum_close']/df['index'])
44
6.5 A5: Request historical 1 minute trading data from
IEXCloud
def get_file(num_days, symbol):
URL = 'https://cloud.iexapis.com/v1/stock/' + symbol + '/intraday-prices'
df = pd.DataFrame()
PARAMS = {'token': 'secret_token'}
columns_to_drop = ['marketAverage', 'marketNumberOfTrades', 'minute',
'label', 'notional', 'numberOfTrades',
'marketHigh', 'marketLow', 'marketVolume',
'marketNotional', 'marketOpen', 'marketClose',
'changeOverTime', 'marketChangeOverTime']
45
46
Bibliography
https://doi.org/10.1088%2F1742-6596%2F974%2F1%2F012047.
[5] Fernando Almeida. Big Data: Concept, Potentialities and Vulnerabilities. In:
https://doi.org/10.1145/2723372.2742797.
[9] Matthew Baron et al. Risk and Return in High-Frequency Trading. In: Journal of
Financial and Quantitative Analysis 54.3 (2019), 9931024. doi:
10.1017/S0022109018001096.
47
[10] Vinayak R. Borkar, Michael J. Carey, and Chen Li. Big Data Platforms: What's
10.1145/2331042.2331057. url:
https://doi-org.elib.tcd.ie/10.1145/2331042.2331057.
[11] Salem Press
PhD Burns William E. Ecient-market hypothesis (EMH). In:
Document-Oriented Database (MongoDB) for Big Data Applications. In: 2015 8th
International Conference on Advanced Software Engineering Its Applications (ASEA).
2015, pp. 4147.
[13] S. Chintapalli et al. Benchmarking Streaming Computation Engines: Storm, Flink and
[15] Joel Hasbrouck and Gideon Saar. Low-latency trading. In: Journal of Financial
Markets 16.4 (2013). High-Frequency Trading, pp. 646 679. issn: 1386-4181. doi:
https://doi.org/10.1016/j.finmar.2013.05.003. url:
http://www.sciencedirect.com/science/article/pii/S1386418113000165.
[16] IEX Cloud: Financial Data Infrastructure. url: https://iexcloud.io/.
[17] InuxDB 1.8 documentation. url:
https://docs.influxdata.com/influxdb/v1.8/.
[18] JPM.N - JPMorgan Chase & Co. Prole. url:
https://www.reuters.com/companies/JPM.N.
[19] Kyong-Ha Lee et al. Parallel Data Processing with MapReduce: A Survey. In:
Spark. In: 2015 IEEE International Conference on Big Data (Big Data). 2015,
pp. 28552858.
48
[22] Migrating Apache Spark Jobs to Dataproc | Migrating Hadoop to GCP. url:
https://cloud.google.com/solutions/migration/hadoop/migrating-
apache-spark-jobs-to-cloud-dataproc.
[23] NoSQL Databases Explained. url: https://www.mongodb.com/nosql-explained.
[24] pandas. url: https://pandas.pydata.org/.
[25] J. Patel. An Eective and Scalable Data Modeling for Enterprise Big Data Platform.
In: 2019 IEEE International Conference on Big Data (Big Data). 2019, pp. 26912697.
[26] Andrew Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. In:
https://doi.org/10.1145/1559845.1559865.
[27] Plotly Python Graphing Library. url: https://plotly.com/python/.
[28] pygmalios/reactiveinux. url:
Pygmalios.
https://github.com/pygmalios/reactiveinflux.
[29] RDD Programming Guide. url: https://spark.apache.org/docs/latest/rdd-
programming-guide.html#resilient-distributed-datasets-rdds.
[30] S. Sagiroglu and D. Sinanc. Big data: A review. In: 2013 International Conference
on Collaboration Technologies and Systems (CTS). 2013, pp. 4247.
(2016). 1st International Conference on Information Security & Privacy 2015, pp. 224
49
[36] Hugh J. Watson. Tutorial: Big Data Analytics: Concepts, Technologies, and
clusters. In: 2016 IEEE International Conference on Big Data (Big Data). 2016,
pp. 273283.
50