Departamento de Eletrónica,
Universidade de Aveiro Telecomunicações e Informática
2014
Micael José
Pedrosa Capitão
Plataforma de Mediação para a Inserção de Dados
em Hadoop
Mediator Framework for Inserting Data into Hadoop
Departamento de Eletrónica,
Universidade de Aveiro Telecomunicações e Informática
2014
Micael José
Pedrosa Capitão
Plataforma de Mediação para a Inserção de Dados
em Hadoop
Digitally signed by MICAEL JOSÉ PEDROSA CAPITÃO
DN: c=PT, o=Cartão de Cidadão, ou=Cidadão Português, ou=Assinatura Qualificada do Cidadão, sn=PEDROSA CAPITÃO,
givenName=MICAEL JOSÉ, serialNumber=BI134954190, cn=MICAEL JOSÉ PEDROSA CAPITÃO
Date: 2015.01.03 16:30:55 Z
Mediator Framework for Inserting Data into Hadoop
Dissertação apresentada à Universidade de Aveiro para cumprimento dos requisitos necessários à obtenção do grau de Mestre em Engenharia de Computadores e Telemática, realizada sob a orientação científica do Doutor Óscar
Narciso Mortágua Pereira, Professor auxiliar do Departamento de Eletrónica,
Telecomunicações e Informática da Universidade de Aveiro.
o júri / the jury
presidente / president
Prof. Doutor Tomás António Mendes Oliveira e Silva
professor associado da Universidade de Aveiro
vogais / examiners committee
Prof. Doutora Maribel Yasmina Campos Alves Santos
professora associada com agregação da Universidade do Minho
Prof. Doutor Óscar Narciso Mortágua Pereira
professor auxiliar da Universidade de Aveiro (orientador)
agradecimentos /
acknowledgements
Quero deixar o meu agradecimento ao professor Óscar Pereira pela disponibilidade, paciência, e orientação científica ao longo desta dissertação.
Ao João Osório agradeço toda a disponibilidade, amizade, compreensão
e ajuda técnica que, em muito, tornou possível a concretização deste trabalho.
A todos os colegas da PT Inovação e Sistemas que, de alguma forma me
prestaram auxílio e que contribuíram para um ambiente de companheirismo,
espírito de entreajuda e momentos de descontracção. Dos quais, dou um
especial agradecimento ao João Moreto e ao Paulo Vieira pelo esclarecimento de dúvidas relativas ao Altaia e ao Manuel Gonçalves pelo apoio às
tecnologias do ecossistema Hadoop.
Agradeço aos colegas e amigos da Banda Filarmónica Ilhense que, como
grupo, me proporcionou momentos de descontracção, de companheirismo e
de cultura.
À Lígia Venâncio, ao Leandro Pina e ao André Martins com quem partilhei os
bons e maus momentos, deixo o meu profundo agradecimento pela amizade,
compreensão e paciência ao longo desta etapa.
Aos restantes colegas e amigos que sempre me motivaram, o meu obrigado.
Por fim, e o mais importante, agradeço aos meus pais e irmã pelo que sou
hoje e por me terem dado todo o apoio que me permitiu chegar até aqui.
Palavras Chave
Altaia, Hadoop, HDFS, LevelDB, Hive, Impala, KPI, KQI, CDR, EDR, xDR,
base de dados, distribuído, tolerância a falhas.
Resumo
“Dados” sempre foram um dos mais valiosos recursos das organizações.
Com eles pode-se extrair informação e, com informação suficiente, pode-se
criar conhecimento. No entanto, é necessário primeiro conseguir guardar
esses dados para posteriormente os processar. Nas últimas décadas tem-se
assistido ao que foi apelidado de “explosão de informação”. Com o advento
das novas tecnologias, o volume, velocidade e variedade dos dados tem
crescido exponencialmente, tornando-se no que é hoje conhecido como big
data.
Os operadores de telecomunicações obtêm, através de equipamentos de monitorização da rede, milhões de registos relativos a eventos da rede, os Call
Detail Records (CDRs) e os Event Detail Records (EDRs), conhecidos como
xDRs. Esses registos são armazenados e depois processados para deles
se produzirem métricas relativas ao desempenho da rede e à qualidade dos
serviços prestados. Com o aumento dos utilizadores de telecomunicações, o
volume de registos gerados que precisam de ser armazenados e processados
cresceu exponencialmente, inviabilizando as soluções que assentam em bases de dados relacionais, estando-se agora perante um problema de big data.
Para tratar esse problema, múltiplas contribuições foram feitas ao longo dos
últimos anos que resultaram em soluções sólidas e inovadores. De entre
elas, destaca-se o Hadoop e o seu vasto ecossistema. O Hadoop incorpora
novos métodos de guardar e tratar elevados volumes de dados de forma
robusta e rentável, usando hardware convencional.
Esta dissertação apresenta uma plataforma que possibilita aos actuais sistemas que inserem dados em bases de dados relacionais, que o continuem
a fazer de forma transparente quando essas migrarem para Hadoop. A
plataforma tem de, tal como nas bases de dados relacionais, dar garantias
de entrega, suportar restrições de chaves únicas e ser tolerante a falhas.
Como prova de conceito, integrou-se a plataforma desenvolvida com um sistema especificamente desenhado para o cálculo de métricas de performance
e de qualidade de serviço a partir de xDRs, o Altaia. Pelos testes de desempenho realizados, a plataforma cumpre e excede os requisitos relativos
à taxa de inserção de registos. Durante os testes também se avaliou o seu
comportamento perante tentativas de inserção de registos duplicados e perante situações de falha, tendo o resultado, para ambas as situações, sido o
esperado.
Keywords
Altaia, Hadoop, HDFS, LevelDB, Hive, Impala, KPI, KQI, CDR, EDR, xDR,
database, distributed, fault tolerance.
Abstract
Data has always been one of the most valuable resources for organizations.
With it we can extract information and, with enough information on a subject,
we can build knowledge. However, it is first needed to store that data for later
processing. On the last decades we have been assisting what was called
“information explosion”. With the advent of the new technologies, the volume,
velocity and variety of data has increased exponentially, becoming what is
known today as big data.
Telecommunications operators gather, using network monitoring equipment,
millions of network event records, the Call Detail Records (CDRs) and the
Event Detail Records (EDRs), commonly known as xDRs. These records are
stored and later processed to compute network performance and quality of
service metrics. With the ever increasing number of telecommunications subscribers, the volume of generated xDRs needing to be stored and processed
has increased exponentially, making the current solutions based on relational
databases not suited any more and so, they are facing a big data problem.
To handle that problem, many contributions have been made on the last years
that have resulted in solid and innovative solutions. Among them, Hadoop
and its vast ecosystem stands out. Hadoop integrates new methods of storing
and process high volumes of data in a robust and cost-effective way, using
commodity hardware.
This dissertation presents a platform that enables the current systems
inserting data into relational databases, to keep doing it transparently
when migrating those to Hadoop. The platform has to, like in the relational
databases, give delivery guarantees, support unique constraints and, be fault
tolerant.
As proof of concept, the developed platform was integrated with a system
specifically designed to the computation of performance and quality of service
metrics from xDRs, the Altaia. The performance tests have shown the platform
fulfils and exceeds the requirements for the insertion rate of records. During
the tests the behaviour of the platform when trying to insert duplicated records
and when in failure scenarios have also been evaluated. The results for both
situations were as expected.
Contents
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
L i s t o f Ta b l e s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
List of Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
L i s t o f Ac ro n y m s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
List of Figures
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
5
5
. . . . .
. . . . .
. . . . .
System
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
8
9
13
17
18
20
23
23
Development . . . . . . . . . . . . . . . . . . . .
3.1 Requirements . . . . . . . . . . . . . . . . . . .
3.2 Architecture . . . . . . . . . . . . . . . . . . . .
3.2.1 Overview . . . . . . . . . . . . . . . . .
3.2.2 Initial version . . . . . . . . . . . . . . .
3.2.3 Improved version . . . . . . . . . . . . .
3.3 Implementation . . . . . . . . . . . . . . . . . .
3.3.1 Constraints checking . . . . . . . . . . .
3.3.2 Persistence writer and persistence queue
3.3.3 Table writer and file writer . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
27
28
28
29
32
33
33
37
38
1
I n t ro d u c t i o n
1.1 Preamble . .
1.2 Motivation .
1.3 Objectives .
1.4 Structure . .
2
B ac kg ro u n d . . . . . . . . . . . .
2.1 Big Data history . . . . . . . .
2.2 Hadoop . . . . . . . . . . . . . .
2.2.1 Hadoop Distributed File
2.2.2 MapReduce . . . . . . .
2.3 Analytical tools on Hadoop . .
2.3.1 Hive . . . . . . . . . . .
2.3.2 Impala . . . . . . . . . .
2.4 LevelDB . . . . . . . . . . . . .
2.5 Altaia . . . . . . . . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
i
3.3.4
3.3.5
3.3.6
3.3.7
File uploader . .
Table manager .
Bucketing . . . .
Version store and
. . . . .
. . . . .
. . . . .
recovery
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
42
43
43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
47
48
49
50
51
53
5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Work overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
57
59
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
ii
.
.
.
.
.
.
.
.
.
.
.
E va l u at i o n . . . . . . . . . .
4.1 Proof Of Concept . . . . . .
4.1.1 Impala sink . . . . .
4.1.2 Concerns . . . . . .
4.2 The environment . . . . . .
4.3 Testing the initial version . .
4.4 Testing the improved version
Ta b l e s c r e at i o n D D L
.
.
.
.
.
.
.
.
.
.
.
4
A
.
.
.
.
.
.
.
. . . .
. . . .
. . . .
mode
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Figures
1.1
1.2
1.3
Wireless brodband penetration in G7 countries from 2009 to 2013, in number of
subscriptions per 100 inhabitants.1 . . . . . . . . . . . . . . . . . . . . . . . . . .
Total wireless broadband subscriptions by country, in millions of subscriptions.2
Altaia architecture overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hadoop ecosystem overview.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Architecture of Hadoop Distributed File System (HDFS).4 . . . . . . . . . . . .
Architecture of HDFS showing the files’ blocks distributed across the multiple
DataNodes.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 HDFS federation architecture. 6 . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Word count program flow executed with MapReduce.7 . . . . . . . . . . . . . . .
2.6 MapReduce architecture prior Hadoop 2.0.8 There are two MapReduce tasks,
submitted by different clients, running. . . . . . . . . . . . . . . . . . . . . . . .
2.7 Hadoop general architecture discriminating the components belonging to HDFS
and MapReduce.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8 Hadoop 2.0 architecture with Yet Another Resource Negotiator (YARN).10 There
are two Application Masters running, each with their own containers and submitted
by different clients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9 Differences between Hadoop 1.0 and Hadoop 2.0 concerning the responsibilities of
MapReduce.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 Hive architecture. [35] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11 Impala architecture [42]. Impala components are in orange and the Hadoop and
Hive ones are in blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12 Diagram of TM Forum Frameworkx’s Business Process Framework (eTOM).12 .
2.1
2.2
2.3
3.1
3.2
3.3
3.4
3.5
Architecture diagram of the initial version. This version was implemented before knowing the performance requirements, which led to an almost completely
synchronous design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Architecture diagram of the improved version. Asynchronous sending of files and
splitting batches across different buckets have made it possible to fulfil and exceed
the performance requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Unique index example showing different unique keys and a same unique key with
different versions. On the right side several keys are tested against the index to
check whether they exist for a given version. . . . . . . . . . . . . . . . . . . . .
Memory-mapped file organization, showing the positions to store the cursors and
the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sequence of steps to put a file into the correct table directory. . . . . . . . . . .
2
3
4
9
10
11
12
14
14
15
16
17
19
22
24
31
32
36
37
40
iii
3.6
3.7
3.8
4.1
4.2
iv
Possible configurations of the pipeline for the persistence queue and unique index
checking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Location of the Version Store in the mediator framework architecture. . . . . . .
Configuration of the mediator framework when recovering data and its reconfiguration to perform normally after the recovery. . . . . . . . . . . . . . . . . . . . .
Diagram of the existing system in which the mediator framework integrates to
replace the existing insertion mechanism into an Oracle Relational Database
Management System (RDBMS). . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance with different numbers of buckets. With the increasing number of
buckets, the improvements tend to be less apparent. . . . . . . . . . . . . . . . .
44
45
46
48
54
L i s t o f Ta b l e s
4.1
4.2
4.3
4.4
4.5
Correspondence between components and their short names used in this chapter.
It is presented whether a component is supposed to convert entities to rows when
assembled in the final mediator framework. . . . . . . . . . . . . . . . . . . . . .
Performance of the several components of the mediator. . . . . . . . . . . . . . .
Several tests performed on file-wr component to try to understand how its
performance could be improved. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results obtained with different numbers of buckets. . . . . . . . . . . . . . . . .
Performance results obtained running the mediator framework for a period of 5
hours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
51
52
53
55
v
List of Listings
3.1
3.2
Interface defining a method to convert an entity into a row. . . . . . . . . . .
Interface defining the internal representation of an entity. The Writer allows
the serialization of an entity into a row. . . . . . . . . . . . . . . . . . . . . .
3.3 Common interface used by the components allowing them to be plugged as
intended to achieve a certain behaviour. . . . . . . . . . . . . . . . . . . . . .
3.4 Entity definition and its schema, table description and write adapter to convert
it to a row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Interface of the File Uploader. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Definition of the file upload description used when the File Uploader warns its
listeners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Hive DDL produced by the Table Manager to create the table recordtable.
A.1 Table creation Data Definition Language (DDL) for records of type VOZ_2G,
composed of 155 columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Table creation DDL for records of type VOZ_3G, composed of 189 columns. . .
30
30
31
34
41
41
42
67
70
vii
L i s t o f Ac ro n y m s
API
Application Programming Interface
MPP
Massively Parallel Processing
CDH
Cloudera Distribution Including
Apache Hadoop
NDFS
Nutch Distributed Filesystem
ODBC
Open Database Connectivity
CDR
Call Detail Record
OSS
Operations Support System
CQM
Customer Quality Management
POC
Proof Of Concept
DAG
Directed Acyclic Graph
POJO
Plain Old Java Object
DDL
Data Definition Language
QoS
Quality of Service
DML
Data Manipulation Language
RDBMS
EDR
Event Detail Record
Relational Database Management
System
ETL
Extract Transform Load
RTT
Round-Trip Time
eTOM
Business Process Framework
SerDe
Serializer/Deserializer
FIFO
First-In-First-Out
SLA
Service-Level Agreement
GFS
Google File System
TCP
Transmission Control Protocol
HDFS
Hadoop Distributed File System
UDF
User-Defined Function
JDBC
Java Database Connectivity
URI
Uniform Resource Identifier
JVM
Java Virtual Machine
URL
Uniform Resource Locator
KPI
Key Performance Indicator
xDRs
CDRs and EDRs
KQI
Key Quality Indicator
YARN
Yet Another Resource Negotiator
ix
chapter
1
I n t ro d u c t i o n
This chapter makes an introduction of the motivation for this dissertation, making an overview of data
storing and processing challenges. Then the objectives are set and finally it is presented the structure
of this document.
1.1
preamble
Data has always been one of the most valuable resources. With it, we can create information,
and with enough information on a subject, we can build knowledge. With that knowledge, people
and organizations can make better predictions, better decisions, thriving in a always more demanding
world.
The “information explosion”, as called in the Lawton Constitution newspaper in 1941 [1][2] is the
acknowledgement that data has been growing at a rate that is making it harder to store, organize and
process. Data growth has always led to technology improvements allowing more and more information
to be created from data, enabling organizations to generate more knowledge and then better predictions
and decisions. In 1997, scientists at NASA published a paper [3] describing the issues they were having
visualizing large data sets that could not fit in main memory not even on local disk. They called that
the problem of big data.
Nowadays big data is a broad term encompassing any collection so large and/or complex that it
becomes difficult to process it using traditional data processing applications [4]. New technologies have
been developed to deal with the big data problem. Google alone have contributed with papers of their
own platform, including a distributed file system, Google File System (GFS) [5], a new processing
paradigm, MapReduce [6], a high performance data storage, BigTable [7] and a scalable and interactive
ad-hoc query system [8]. Based on those papers by Google, several projects have been born being
Hadoop and its vast ecosystem the reference.
1
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
For a telecommunications operator, dealing with millions of Call Detail Records (CDRs) and
Event Detail Records (EDRs) (the xDRs) coming from network monitoring equipments (probes) may
be a challenge. These xDRs need to be stored and processed to gather valuable information to the
business. The xDRs are used, for example, to compute Key Performance Indicators (KPIs) and
Key Quality Indicators (KQIs) allowing a telecommunications operator to know the usage level of its
network infrastructures and how are they performing. With that information it can diagnose network
infrastructure problems and Quality of Service (QoS) problems that would compromise Service-Level
Agreements (SLAs) and, it can plan future network infrastructure upgrades more intelligently.
More recently, operators are concerned about not only how their network infrastructure performs
but how is that performance perceived by a certain client. This user centric approach to the problem,
rather than network centric, is called Customer Quality Management (CQM).
With an always increasing demand for telecommunication services, as shown in Figure 1.1 for
wireless broadband subscriptions, resulting in the millions of users worldwide, depicted in Figure 1.2,
the amount of xDRs a telecommunications operator has to process to calculate KPIs and KQIs, for
example, has grown to big data sizes. This leads to the need of migrating their current data systems,
typically RDBMSs, to more specialized, more scalable and more cost effective systems designed from
the ground up to deal with the big data problem.
Wireless broadband penetration, G7 countries through, Dec. 2013
120
Japan
United States
100
United Kingdom
80
OECD
Italy
60
France
Canada
40
Germany
20
0
Figure 1.1: Wireless brodband penetration in G7 countries from 2009 to 2013, in number of
subscriptions per 100 inhabitants.1
1
Available at: http://www.oecd.org/sti/broadband/1i-BBPenetrationHistorical-G7-201312.xls
2
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
Total wireless broadband subscriptions, by country, millions, December 2013
United States
Japan
Korea
United Kingdom
Italy
Germany
France
Spain
Australia
Turkey
Poland
Canada
Mexico
Netherlands
Sweden
Finland
Czech Republic
Chile
Denmark
Austria
Switzerland
Belgium
Norway
Israel
Greece
Portugal
New Zealand
Ireland
Slovak Republic
Hungary
Estonia
Slovenia
Luxembourg
Iceland
0
50
100
150
200
250
300
Figure 1.2: Total wireless broadband subscriptions by country, in millions of subscriptions.2
1.2
motivation
With the big data movement came the urge to migrate existing systems to that new reality. Many
of those systems integrate with RDBMSs. Specifically, at Portugal Telecom Inovação e Sistemas, it is
developed an Operations Support System (OSS) product, named Altaia, responsible for the processing
of xDRs to then generate metrics: the KPIs and KQIs.
Figure 1.3 shows an overview of Altaia’s architecture. The DBN0 s is where raw data, including
the xDRs, is stored upon being captured from the probes and other external systems. Preprocessed
data coming from the Altaia Mediation systems is also stored in the DBN0 s. In DBN1 it is stored the
dimensional hierarchies with the KPIs and KQIs upon processing DBN0 s’ raw data with the Altaia
Framework. From the Altaia Portal it is possible for a client to request the raw records that have
originated a certain metric, that is, drilling-down back to the original raw data stored in the DBN0 s.
It is at the DBN0 s level that data is getting bigger. At the moment, both DBN0 and DBN1 are
supported by Oracle RDBMS instances but due to cost, scalability and performance requirements, new
alternatives involving big data tools, specifically from the Hadoop ecosystem, are being investigated.
Altaia is an example of a system bound to the SQL query language to interact with its DBN0 s
and DBN1. Moreover, the systems composing the Altaia Mediation are also bound to the features
2
Available at: http://www.oecd.org/sti/broadband/1c-TotalBBSubs-bars-2013-12.xls
chapter 1.
i n t ro d u c t i o n
3
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
RDBMSs offer when inserting data, like delivery and durability guarantees of data and, the checking
of data restrictions like unique key constraints.
Figure 1.3: Altaia architecture overview.
The Hadoop ecosystem has some tools, like Hive and Impala, that will be presented in Sections 2.3.1
and 2.3.2 respectively, that offer a SQL-like language to interact with data stored into Hadoop, more
specifically HDFS, that will be presented in Section 2.2.1. Because of Altaia’s dependency on SQL,
these tools are the straightforward choices to migrate the DBN0 s to Hadoop.
Choosing Hive and Impala facilitates the data querying part. Making mediation systems to
insert data into Hadoop, however, requires extra effort due to the lack of RDBMSs semantics and
guarantees from both Hive and Impala. Those systems can relax the need of using Data Manipulation
Language (DML) statements to insert data but they cannot relax other needs like data delivery and
durability guarantees and the data constraints offered by RDBMSs.
To tackle the issue of inserting data into Hadoop allowing existing systems to more easily migrate
to Hadoop technologies, a solution providing applications the features they expect from a RDBMS
had to be created.
This dissertation was developed during a curricular internship at Portugal Telecom Inovação e
Sistemas that had a duration of six months.
4
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
1.3
objectives
The main goal of this dissertation is the development of a mediation framework to insert data
into Hadoop. It has to provide some features of RDBMSs like data delivery guarantees and data
constraints. The inserted data has to be ready to be queried by both Hive and Impala.
To allow the development of a mediation framework, the first goal is the study and familiarization
with current big data systems and tools like Hadoop, Hive, Impala, HBase, and others that may be
found useful to achieve the main goal.
As Proof Of Concept (POC), the developed framework needs to be tested against an already
existing CQM mediation system (Section 4.1) and fulfil some performance requirements, discussed in
Section 3.1.
At the beginning of the internship the goal was more focused on trying big data tools to do Extract
Transform Load (ETL) jobs and, in trying alternative big data technologies for the DBN0 s. Later,
Impala was imposed as the technology to be used to query data from the DBN0 s and so the main goal
for this dissertation was set up and carried on.
1.4
structure
This document is divided into five chapters. The current chapter presents the motivation of the
work of this dissertation and its goals. The remaining chapters cover the following aspects:
• Chapter 2: presents the background supporting the work on this dissertation;
• Chapter 3: clearly outlines the problem to solve and the requirements for a solution. Then it
presents the architecture and implementation details of that solution;
• Chapter 4: evaluates and discusses the implemented solution, at architectural, implementation
and performance levels;
• Chapter 5: wraps up the dissertation into a brief overview of the developed work and gives
directions of future improvements to be considered.
chapter 1.
i n t ro d u c t i o n
5
chapter
2
B ac kg ro u n d
This chapter starts by giving an historical overview of what is today called big data. Then it describes
the technologies needed for this dissertation, giving a historical overview of them and how they relate to
each other.
2.1
big data history
Dealing with ever increasing amounts of data has always been a problem, forcing new techniques
to be developed to take advantage of the available data and so extract information from it and build
knowledge.
Back in 1880 the U.S. Census took seven years to process the results and by the time it was
completed, they were already obsolete [9]. It was just too many data to process. In 1941 scholars
began to coin the ever increasing amounts of data as “information explosion” [1]. The first warning of
data’s storage and retrieval issues was made in 1944 when Fremont Rider estimated the American
university libraries were doubling in size every sixteen years [10].
In the early 1960, Derek Price observed that the amount of scientific research was too much for
humans to keep abreast of [2] and that the abstract journals created in the late 1800s as a way to
manage the knowledge base were also growing at the same trajectory, multiplying by a factor of ten
every half-century. New ways of keeping and organizing data, information and knowledge were needed.
During the 60s yet, organizations began automating their inventory systems to centralized computing systems. In 1970 Edgar Codd revolutionized the databases with the relational model [11], allowing
users of large data banks to access data without having to know its internal representation.
As storage capacity and data continued to grow, Parkinson’s law [12] was paraphrased by Tjomsland
in 1980 as “Data expands to fill the space available” [13] because data was being retained as users no
longer were able to identify obsolete data.
7
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
The need for consistent storage of historically complete and accurate data resulted in Barry Devlin
and Paul Murphy defining, in 1988, an architecture for business reporting and analysis [14], which
became the foundation of data warehousing.
In 1997, scientists at NASA published a paper [3] describing the issues they were having in
visualizing large data sets that could not fit in main memory not even on local disk. They called that
the problem of big data. Since then, the term started to gain popularity and be applied any time it
was at hands a problem involving large data sets that could not possibly be stored and processed by an
organization. In that same year, Michael Lesk concludes there may be a few petabytes of information
in the world and that by the year 2000 there would be enough disk and tape to save everything [15].
He has been proved wrong.
In 2001, Doug Laney describes the 3Vs [16] - Volume, Velocity and Variety - as dimensions describing
data management solutions. The 3Vs are today the generally accepted big data characteristics.
Several studies [17] [18] have tried to measure how much information is there and predict the rate
of data growth, analysing what was causing it, that is, find out from where was that data coming from.
Later in 2008, the term big data is popularized, predicting that “big-data computing” will “transform
the activities of companies, scientific researchers, medical practitioners, and our nation’s defense and
intelligence operations.” [19].
2.2
hadoop
Doug Cutting had a goal to build a web search engine from scratch and so he started the development
of Lucene. While developing Lucene’s web crawler (now with Mike Cafarella too), called Nutch, they
realized their architecture would not scale to the billions of pages on the Web [20]. With the help of
a Google paper published in 2003 describing their Google File System (GFS) [5], they thought that
something like it would solve their storage needs for the large files produced as part of the web crawl
and indexing processes [20]. And so, they started writing an open source implementation of Google
File System (GFS), the Nutch Distributed Filesystem (NDFS).
Later in 2004, Google (again) published a paper introducing MapReduce [6], a new programming
model and an implementation for processing and generating large data sets. The Nutch developers
started working immediately on an implementation of MapReduce and by 2005 they already had
ported Nutch’s algorithms to run using MapReduce and NDFS.
Because NDFS and Nutch’s implementation of MapReduce were being used beyond Nutch’s
purpose, crawling and indexing, they moved them out of Nutch, creating an independent subproject of
Lucene, and so, in 2006, named after Doug Cutting’s son’s elephant toy, Hadoop has born. The NDFS
was renamed to Hadoop Distributed File System (HDFS).
On that same year Doug Cutting moved to Yahoo!, which provided a dedicated team and resources
8
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
to continue the development of Hadoop, turning it the technology used by Yahoo! to generate their
search index.
In 2008, Hadoop was made a top-level project at Apache, confirming its success. Hadoop itself is
composed of HDFS and of an implementation of MapReduce. However, the term is also used “for a
family of related projects that fall under the umbrella of infrastructure for distributed computing and
large-scale data processing” [20], forming the Hadoop ecosystem.
The Hadoop ecosystem is composed of several tools (coming from other projects) helping in
different tasks. There are data warehousing tools, analytical tools, data importing and exporting tools
between RDBMS and Hadoop, data capturing tools, NoSQL databases with columnar storage, system
coordination and management tools, etc. Figure 2.1 presents some of the projects composing the
Hadoop ecosystem and their roles.
Figure 2.1: Hadoop ecosystem overview.1
2.2.1
Hadoop Distributed File System
The GFS article [5] presented a distributed file system for large distributed data-intensive applications, providing fault tolerance while running on commodity hardware. It describes an architecture
composed of a single master and multiple chunkservers. In GFS files are divided into fixed-size chunks,
being each chunk identified by a unique handle assigned by the master at the time of the chunk creation.
Chunkservers store the chunks on local disks and read or write chunk data specified by a handle and a
byte range. The master maintains all the file system metadata.
1
Available at: http://dbaquest.blogspot.pt/2013/08/hadoop-eco-system-map.html
chapter 2.
b ac kg ro u n d
9
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
One of the key designs parameters of GFS is its chunk size of 64 MB. Having a large chunk size
minimizes the need to interact with the master in scenarios where big files are read and/or written
sequentially. The design of HDFS was based on the following assumptions [20] [21] [22] borrowed from
the GFS design:
• Many expensive commodity hardware that often fail;
• Very large files. “files that are hundreds of megabytes, gigabytes, or terabytes in size”;
• Streaming data access: “write-once, read-many-times pattern”;
• High sustained bandwidth is more important than low latency.
Figure 2.2 shows the HDFS architecture. According to [21], HDFS has a master/slave architecture
consisting of a NameNode, which is a master server that manages the file system namespace and
controls the access to files, and multiple DataNodes, which manage the storage of the nodes they run
on. The file system namespace exposed to clients allows them to access data as files when, in reality,
those files are split into blocks stored across the cluster’s DataNodes. When a file is loaded into HDFS,
it is broken up into blocks, as shown in Figure 2.3, which are stored across the cluster’s DataNodes. As
a fault tolerance measure, each block is replicated (with a factor of 3 by default) across different nodes.
The block size and replication factor are configurable per file.
Figure 2.2: Architecture of HDFS.2
The NameNode manages the file system NameNode operations like opening, closing and renaming,
and manages the mappings between file’s blocks to DataNodes. The DataNodes serve reads and writes
requests from the clients.
2
10
Available at: http://edu-kinect.com/blog/2014/06/16/hdfs-architecture/
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
HDFS has a hierarchical file organization in directories, being files identified by path-names. Files
and directories can be created, removed, moved and renamed. These operations affecting the NameNode
are guaranteed to be atomic.
Figure 2.3: Architecture of HDFS showing the files’ blocks distributed across the multiple DataNodes.3
One of the concepts in HDFS its the block. In file systems for a single disk, the block is the
minimum amount of data that can be read or write and so files’ sizes are integral multiples of the block
size, typically 512 bytes. In HDFS there is the concept of block too, each one of 64 MB by default, so
files are split into block-sized chunks, being each one stored as independent units. If a file in HDFS is
smaller than the block size - and unlike files systems for a single disk - the block will not occupy “a
full block’s worth of underlying storage” [20].
namenode limitations
The HDFS architecture consists of a single NameNode. In order to keep the rate of metadata
operations high, the whole namespace is kept in RAM [23]. The NameNode stores persistently the
namespace image and its modification log (from Figure 2.3, the FsImage and EditLog respectively).
This architecture has a single point of failure, the NameNode, and does not scale horizontally as the
whole namespace is managed by that single node. Currently, both issues have been mitigated, being
those mitigation mechanisms briefly described in this subsection.
Prior to Hadoop 2.0 (currently 2.5), the NameNode was a single point of failure and so, if the
machine or process running the NameNode went down, the whole cluster would be unavailable until
the NameNode was either restarted or brought up on a separate machine [24]. Now, two separate
machines are configured as NameNodes. At any point in time one of the NameNodes is in active state
and the other is in standby mode. The active one is responsible for all the operations while the other is
3
Available at: http://www.revelytix.com/?q=content/hadoop-ecosystem
chapter 2.
b ac kg ro u n d
11
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
simply a slave, maintaining enough state to provide a fast automatic failover, if necessary [25]. This is
called HDFS High Availability and has allowed to mitigate the single point of failure in the NameNode.
To allow to scale the NameNode horizontally, HDFS federation, introduced with Hadoop 2.0, uses
multiple independent NameNodes (as shown in Figure 2.4), each with its own namespace [26]. The
NameNodes do not require coordination with each other. Each NameNode has a block pool managed
independently of other block pools. The DataNodes are used as common storage by all the NameNodes.
To seamlessly access this federated architecture, at the client side, there is a mount table (using
ViewFs [27]), so different top level directories are served by different namespaces.
Figure 2.4: HDFS federation architecture.
4
The HDFS federation provides scalability and isolation, allowing, for example, to have different
namespaces for different applications so, a misbehaving application overloading a NameNode would
not interfere with other applications.
accessing HDFS
Interacting with HDFS is done through its Java Application Programming Interface (API) or by a
shell utility. Those interactions imply communicating to the cluster through Transmission Control
Protocol (TCP), so network is always involved.
From the Figure 2.2, when a client wants to read data from HDFS, it first contacts the NameNode
so it can know in which nodes the file’s blocks are. Then the client retrieve those blocks directly
from the DataNodes. When creating a file, the file data is first staged locally at the client slide and
when there is enough data to be worth sending (one HDFS block size), the client then contacts the
namenode, which inserts the file name into the file system hierarchy and allocates a block for it in
a DataNode and answers the client with the identity of the DataNode to where the block should be
4
Available
at:
http://www.edureka.co/blog/overview-of-hadoop-2-0-clusterarchitecture-federation/
12
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
sent. The client flushes the block to the DataNode and after that tells the NameNode the file is closed,
which commits the file creation.
A file/directory in HDFS is identified by an Uniform Resource Identifier (URI) of the format
hdfs://<namenode>/<path> (example hdfs://namenode.host.pt:8020/user/capitao/file).
2.2.2
mapreduce
MapReduce [6] is a programming model and an implementation for processing and generating
large data sets in which users write a map function that processes a key/value pair to generate a set of
intermediate key/value pair and, a reduce function that merges all the intermediate values associated
with the same intermediate key. It was presented by Google in 2004, followed by an open source
implementation from the then Nutch team. That implementation is now part of Hadoop.
As an example using MapReduce, consider the problem of counting the number of occurrences
of each word in a large collection of documents. The user has to write code similar to the following
pseudo-code [6]:
reduce(String key, Iterator values):
map(String key, String value):
// key: a word
// key: document name
// values: a list of counts
// value: document contents
int result = 0;
for each word w in value:
for each v in values:
EmitIntermediate(w, "1");
result += ParseInt(v);
Emit(AsString(result));
The map function is executed by the mapper instances and the reduce function is executed by the
reducer instances. Figure 2.5 shows the execution of the word count program. First, the input is split
line by line and then, each line goes to a different mapper in which, for each word in the line, it emits
a tuple containing the word (the key) and a counter (the value) starting with 1. In the next stage,
the tuples are shuffled (by key) into a logical order (in this case in alphabetical order). The goal of
shuffling is to have all the tuples with a same key going to the same reducer. The reducers receive, for
each key, all the values present in the tuples and, in the word count example, they simply sum all the
values for a key and emit a resulting tuple consisting of the key and the sum, which is the number of
occurrences of that key.
MapReduce takes advantage of files blocks locality in HDFS, so when a MapReduce task is
scheduled, the map part, which is the one that reads the input files, is set up to run on a machine
containing the corresponding input data. This is important because when having limited bandwidth,
it is cheaper to move computation to the data rather than moving data to the computation.
chapter 2.
b ac kg ro u n d
13
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
Figure 2.5: Word count program flow executed with MapReduce.5
architecture
In Hadoop, prior to version 2.0, a MapReduce job was supported by a dedicated component
responsible for coordinating the computation across the nodes of the cluster, splitting the work across
them. On each node of the cluster, the assigned work was handled and tracked by another component.
Figure 2.6 shows the MapReduce architecture existing in those older Hadoop versions.
Figure 2.6: MapReduce architecture prior Hadoop 2.0.6 There are two MapReduce tasks, submitted
by different clients, running.
5
6
Available at: http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/
Available at: http://blog.spryinc.com/2013/11/hadoop-fundamentals-yarn-concepts.
html
14
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
From Figure 2.6, the Job Tracker is responsible of accepting jobs from clients, scheduling them
for execution, distributing the map and reduce tasks across the worker nodes, handling task failure
recovery, and tracking the job status [28]. Each worker node has a Task Tracker responsible of spawning
the map and reduce tasks according to the Job Tracker’s instructions and reporting status back to it.
As mentioned in Section 2.2.1, in HDFS the files are split into blocks each one going to, possibly,
different DataNodes. When reading data from HDFS it is more advantageous to place the reading
processes (in the case of MapReduce, the maps) running on the same nodes where the blocks of the
files they want to read are locally stored. This means a worker/slave node in Hadoop has both a
DataNode and a Task Tracker, as shown in Figure 2.7. When the Job Tracker is assigning the map
tasks, it takes into consideration which DataNodes have the blocks that are to be read, assigning the
maps to the Task Trackers of the nodes having those DataNodes.
Figure 2.7: Hadoop general architecture discriminating the components belonging to HDFS and
MapReduce.7
The Job Tracker in this MapReduce architecture is a single point of failure and poses restrictions
to scalability [29]. Besides that, it is also the resource manager of the cluster and is only able to cope
with MapReduce. To tackle these issues, beginning in Hadoop 2.0, it was introduced the YARN.
Yet Another Resource Negotiator
Yet Another Resource Negotiator (YARN) is a resource management and execution framework. It
separates the resource management from the workload management, previously carried out by the Job
Tracker. This way, there is a generic resource management and execution framework, and MapReduce
is just one data processing application that can be run on top of it.
7
Available at: http://www.revelytix.com/?q=content/hadoop-ecosystem
chapter 2.
b ac kg ro u n d
15
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
The job of YARN is scheduling jobs on a Hadoop cluster [29]. To do that, it introduces some
new components, shown in Figure 2.8: the Resource Manager, an Application Master, application
containers and Node Managers.
Figure 2.8: Hadoop 2.0 architecture with YARN.8 There are two Application Masters running, each
with their own containers and submitted by different clients.
The Resource Manager is a scheduler which arbitrates all available cluster resources among
competing applications [30]. The Application Master is an instance of a framework specific library (for
example MapReduce) and is responsible for requesting and managing Containers [29], which grant
right to an application to use a specific amount of resources (CPU, memory, disk, etc.) on a specific
worker/slave node. The Application Master is itself a Container. The Node Manager is responsible
for the Containers on the host it is running, monitoring its resource usage and reporting back to the
Resource Manager.
In YARN, an application, according to [30] and [20], starts by requesting the Resource Manager
for a container to run its Application Master. The Resource Manager gets a Container and launches
the Application Master on that Container. After that, the now running Application Master negotiates
appropriate containers with the Resource Manager and launches them by providing their definition to
the corresponding Node Managers. After an application is done with its job, the Application Master
deregisters with the Resource Manager and shuts down, freeing its own Container.
In the case of MapReduce, each job is a new instance of an application[31] so each MapReduce job
starts by launching an Application Master (which corresponds to the older Job Tracker but without
the resource management of the cluster). The Application Master in MapReduce requests Containers
8
Available at: https://hadoop.apache.org/docs/r2.5.2/hadoop-yarn/hadoop-yarn-site/
YARN.html
16
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
as needed for the several map and reduce tasks that need to be executed. The containers for the map
and reduce tasks are freed after being no longer needed by the job. After the job is complete, the
Application Master container is freed too.
In YARN, the Resource Manager has become a single point of failure because if it is unavailable, no
applications can be scheduled to run on the cluster. To solve that, and like the HDFS High Availability
referred in Section 2.2.1, it is possible to have two Resource Managers, one in the active state and
another in standby, allowing for a fast failover [32][33].
Summing up, prior to Hadoop 2.0, MapReduce was treated like a first-class data processing
framework requiring specific components to run on the cluster to handle resources management and
tasks assignment and execution. Other data processing tools wanting to use the cluster’s resources
would need to have their own mechanisms of resource management and execution. In Hadoop 2.0, with
the introduction of YARN, MapReduce became a data processing tool running on top of a generic
resource management and execution framework. This allows to run any data processing tool on top of
a Hadoop cluster, having its resources efficiently managed. Figure 2.9 shows the functional change
between Hadoop 1.0 and Hadoop 2.0.
Figure 2.9: Differences between Hadoop 1.0 and Hadoop 2.0 concerning the responsibilities of
MapReduce.9
2.3
analytical tools on hadoop
Hadoop brought what was needed to store and process large amounts of data with HDFS and
MapReduce respectively. However, writing MapReduce programs is generally too low level and rigid,
time consuming and error prone [34]. To mitigate those issues, several analytical tools running on
Hadoop have appeared: some use MapReduce, others do not. The following sections will present some
of the analytical tools relevant for this dissertation.
9
Available at:
distributed.html
chapter 2.
http://www.natalinobusa.com/2014/02/hadoop-20-beyond-mapreduce-
b ac kg ro u n d
17
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
2.3.1
hive
The amount of log and dimension data in Facebook that needs to be processed and stored has
increased with the increasing of usage of the site. At that time they started to experiment with Hadoop
as a replacement for their current solution based on Oracle. The results with Hadoop and MapReduce
were promising [20]. The problem with using MapReduce is because it is too low level, requiring
developers to write custom programs which are hard to maintain and reuse. At the same time it was
missing the ability of expressing common computations in the form of SQL, a language in which most
engineers and analysts are familiar with. From that need, Hive [35] was born.
Hive is an open source, now a subproject of Hadoop, data warehousing and SQL infrastructure
built on top of Hadoop. It can run SQL queries on data stored in HDFS by automatically generating
and running MapReduce jobs so, in some way, it improves the usability of Hadoop. Because Hive
provides a SQL-like language, the HiveQL, it is well positioned to integrate with already existing
systems that can only speak SQL.
Hive structures data into well known RDBMS concepts, like databases, tables, columns, rows, and
partitions. The supported data types range from the primitives integers, floats, doubles and strings, to
the more complex types like maps, lists and structs, being the latter allowed to be nested arbitrarily to
construct more complex types [35]. The users can extend Hive with their own types and User-Defined
Functions (UDFs).
Like RDBMSs, Hive stores data in tables, having each table a set of rows, and each row is
composed of a specified number of columns, having each column an associated data type [35]. When
creating a table in Hive, the DDL (example in Listing A.1) specifies the schema for the table, a
Serializer/Deserializer (SerDe) and input and output formats. The SerDe [36][37] is responsible of
serializing and deserializing the output and input data based on the data types specified in the schema,
for a given file format. For instance, if the file is a text file and an integer field is to be read, the SerDe
has to parse the text corresponding to that field as an integer value. It such a field is to be written,
the SerDe converts that integer into its textual representation.
The input and output formats are related to the file format in which data files are stored. Hadoop
files can be stored in different formats and the file input/output formats specify how records are stored
in those files. For instance, in a text file, the input/output format uses, by default, the rows delimited
by a newline and the columns by a ctrl-A [35]. When reading a text file, the rows and the columns
are constructed by making the splits on those delimiters. When writing, the text file is constructed
by concatenating all the rows delimited by a newline, in which the columns of each row have been
concatenated delimited by a ctrl-A.
Unlike RDBMSs, in which updates, transactions and indices are certain, Hive has not included
those features until recently [20]. That is because Hive was built to run over HDFS using MapReduce,
“where full-table scans are the norm and a table update is achieved by transforming the data into a
new table”. Hive still does not support updates (or deletes) but is does support inserting new rows
18
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
into an existing table, however, that creates a new file (generally small comparing to the block size in
HDFS) for each insert statement, which for Hadoop, in general, is a suboptimal approach for both
HDFS and MapReduce performance. Hive supports indices, allowing it to scan only the needed files to
serve certain queries. For instance, the query SELECT * FROM t WHERE x=a can take advantage of an
index on column x. However, these indices are aimed at speeding up queries and not guaranteeing
unique key constraints, for example.
Figure 2.10: Hive architecture. [35]
Architecturally, from Figure 2.10, Hive is composed by the following components [35]: (1) the
Metastore which stores the system catalog and metadata about tables, columns, partitions, etc.; (2)
the Driver which manages the lifecycle of a HiveQL query as it progresses through Hive, compiles
HiveQL into a Directed Acyclic Graph (DAG) of MapReduce jobs and executes those jobs, interacting
with the Hadoop cluster; (3) the HiveServer which includes a Thrift [38] interface and a Java Database
Connectivity (JDBC) and Open Database Connectivity (ODBC) server, providing a way of integrating
Hive with other applications and (4) the client components like the Command Line Interface and the
Web Interface.
The Metastore is the system catalog for Hive as it stores all the information about tables, including
their partitions, schemas, locations, etc. That information can be queried and modified by several
applications, not only Hive. Because the data in the Metastore need to be served fast to the query
compiler (inside the Driver), it is backed by a RDBMS because of the lower latency. It is the Metastore
that imposes structure on Hadoop files. Without it, everything is just files inside directories with no
real structure or meaning.
chapter 2.
b ac kg ro u n d
19
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
data storage
Hive has a simple, yet effective way of organizing its data. A table’s files reside into the same
HDFS directory. A table is allowed to be partitioned, which is a way of dividing a table into parts
based on the values of partitioning columns. Each partition goes to a different subdirectory of the
table’s directory, allowing queries to be faster when filtering by partitioning columns values because
only the needed partitions are scanned for the query, reducing the amount of data read from HDFS.
By default, when creating tables, Hive stores them in its warehouse directory, assuming the data
on those tables is managed by it, that is, Hive owns them. However, there are use cases in which data
is in some location and is managed by other means different from the Hive ones. If we want Hive to
query that data without assuming ownership of it, and thus allowing Hive to query data outside its
warehouse too, we can create a table as external.
data insertion
Hive has limited support for insertions through DML queries [20]. Hive supports inserting data from
the results of another query (using a INSERT INTO|OVERWRITE TABLE <table> [PARTITION(...)]
<select statement> statement [39]) and supports loading already prepared data files into its tables (using a LOAD DATA [LOCAL] INPATH <path> [OVERWRITE] INTO TABLE <table> [PARTITION(...)]
statement [39]). However, if one tries to insert single records (using a INSERT INTO TABLE <table>
[PARTITION(...)]
VALUES <values>), that results in the creation of a single file for each record,
which is highly inefficient for storage in HDFS and for the reading performance.
Another way of inserting data into a Hive table is by placing the data files directly into the
right directories for that table and/or partition and then, instruct Hive, using a MSCK REPAIR TABLE
<table> statement, to scan HDFS for newly added files and partitions, adding them to the metadata
of that table and so, they will be considered for future queries on that table [40].
2.3.2
impala
In 2010, Google published the Dremel [8] paper, describing a scalable and interactive ad-hoc query
system, combining multi-level execution trees and a columnar data layout. According to the paper,
Dremel is capable, unlike traditional databases, of operating on data “in place”, using for example the
GFS or BigTable [7]. It is able to execute queries that usually would require a MapReduce job to be
run but at a fraction of the execution time. The queries are executed using a SQL-like language.
Based on the ideas of the Dremel paper, Cloudera10 , a company providing Hadoop-based software,
has developed Impala. According to [41], Impala is a Massively Parallel Processing (MPP) query
10
20
http://www.cloudera.com/
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
engine for Hadoop. It can read data already existing in Hadoop, allowing it to share the same data
with tools like Hive, with no need of duplicating or converting any data. The main difference of Impala
from the other Hadoop tools is the response time. In spite of using and reusing already existing Hadoop
components, Impala does not use the MapReduce engine most Hadoop tools use. Because of that,
Impala is not only positioned for batch queries, but for interactive near real-time queries too [42].
Hive already provides a Metastore with metadata about tables. To not redesign the wheel and to
be easier to use Hive and Impala interchangeably, Impala uses the Hive Metastore as tables’ metadata
repository. According to [41], Impala and Hive tables are highly interoperable, allowing to switch
between performing batch operations with Hive and performing interactive queries with Impala, on the
same tables. Furthermore, Impala also uses the same SQL syntax (HiveQL) and same JDBC/ODBC
driver as Hive [42]. Typically, using Impala instead of Hive requires just to change the connection
Uniform Resource Locator (URL). Although sharing the HiveQL, there are some semantic differences
between Hive and Impala [43] that need to be addressed by the client applications.
Figure 2.11 shows the Impala architecture, including the integration with already existing Hadoop
ecosystem services. The Query Planner, Query Coordinator and Query Exec Engine make part of the
Impalad (Impala Daemon) (not shown in the Figure), which is a daemon process running on the same
nodes as the DataNodes of HDFS. The impalad reads and writes data files, accepts queries, parallelizes
the queries and distributes work to other nodes of the Impala cluster, transmitting the intermediate
query results back to the central coordinator node [44]. During the query planning, the Query Planner
uses the tables’ metadata from the Hive Metastore and uses information about files’ blocks from the
HDFS NameNode, so a query can be scheduled to execute on the nodes containing locally the data
blocks. A query can be submitted to any running impalad, being the node that receives the query, the
coordinator node for that query. The other nodes transmit partial results back to the coordinator,
which constructs the final result set of a query [44].
The impalad instances are in constant communication with the State Store. The State Store is
used to check and maintain the health status of all impalad instances on the cluster so all nodes know
which nodes are ready to accept work. In case of a State Store failure, the cluster continues to work
normally, however, less robust because a node may attempt to schedule work on a failed node.
Another component making part of the Impala cluster is the Catalog Service (not shown in the
Figure too). Because retrieving all the metadata for a table (from the Hive Metastore) can be time
consuming, impalad instances cache information about the tables for which they have run queries
recently. Formerly, when executing statements that change a table’s metadata, only the coordinator of
that statement would be aware of such changes, so to make all the other nodes aware of them, they
had to invalidate or refresh their caches, loading the metadata from the Metastore again. To mitigate
that issue, the Catalog Service propagates all metadata changes to all the Impala cluster nodes. This
way, when executing metadata-changing statements through Impala, all the cluster is immediately
aware of them.
When a metadata-changing statement is executed through Impala, the Hive Metastore is immedichapter 2.
b ac kg ro u n d
21
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
ately updated with that new information. However, if such statement is run through Hive, Impala
nodes do not recognize the changes, needing their caches to be invalidated (when creating new tables)
or refreshed (when changing metadata on an already existing table) [44].
Figure 2.11: Impala architecture [42]. Impala components are in orange and the Hadoop and Hive
ones are in blue.
parquet file format
The Dremel [8] paper described a columnar data layout for data. The columnar-oriented layout
offers several advantages over the row-oriented one [45], allowing to use less Input/Output as only the
needed columns are read and to save storage space because the columnar layout compresses better.
Designed and implemented in collaboration between Twitter and Cloudera, the Parquet file format
was built with nested data structures in mind and uses the record shredding and assembly algorithm
described in the Dremel paper [46]. Currently, Parquet is supported across the Hadoop ecosystem,
with special relevance to Hive and Impala.
To allow to use less Input/Output while reading, Parquet supports projection push down, so only
the needed columns are accessed [47]. Parquet also supports several encoding schemes for different
scenarios, being them the Bit Packing, the Run Length Encoding and the Dictionary encoding.
According to [48], “Parquet is especially good for queries scanning particular columns within a
table, for example to query "wide" tables with many columns, or to perform aggregation operations
such as SUM() and AVG() that need to process most or all of the values from a column.”. Parquet is
in fact the recommended file format to be used with Impala as it is the one that best matches the kind
of workloads for what Impala was made for.
Creating a Parquet file, according to [48], is a memory-intensive operation because the incoming
data is buffered until it reaches one data block in size. Then that chunk of data is organized and
compressed in memory before being written out to the file. The Parquet file format is optimized to
22
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
work with large data files, typically 1 GB each [49]. When dealing with small files, the performance
advantages of Parquet are not apparent.
2.4
leveldb
In 2006, Google published a paper describing a distributed storage system, running on top of
GFS for managing structured data, the BigTable [7]. BigTable is a sparce, distributed, persistent
multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp. Each
value in the map is an uninterpreted array of bytes. One important aspect is that BigTable maintains
data in lexicographic order by row key, which allows for fast random accesses. Based on these concepts,
several projects have been born, namely HBase [50] and LevelDB [51][52], being the later from the
same authors of BigTable.
According to [53], LevelDB is “an open-source, dependency-free, embedded key/value data store”.
It was developed in 2011 by Jeff Dean and Sanjay Ghemawat, the authors of BigTable, borrowing
ideas (but not code) from BigTable. According to [54] and [53], it’s goal was to replace SQLite [55] as
the backing store for Chrome’s IndexedDB [56] implementation.
In LevelDB, keys and values are arbitrary byte arrays and data is, like in BigTable, sorted
lexicographically by key (which is useful for querying it later). It supports batch writes and data can
be traversed forward and backwards. All the data is automatically compressed using Snappy [57].
It does not support SQL, being its basic operations: Put(key,value), Get(key) and Delete(key).
Being an embedded store, an instance of LevelDB can only be accessed by one process at a time.
The Put and Delete operations can be executed in batch and that batch execution is guaranteed
to be atomic [53]. Getting data from LevelDB is done through iterators. Iterators can start at any
specified key and, in case the key does not exist, it jumps to the next key coming lexicographically
after the specified one.
LeveDB features, combined with its simplicity to be embedded into an application, makes it
particularly interesting, for this dissertation, to be used as an index.
2.5
altaia
Altaia is a product from Portugal Telecom Inovação e Sistemas aimed at the management of
performance and QoS of telecommunications networks and services. It fulfils the Quality and Performance Management assurance areas of the service management & operations of eTOM (as depicted in
Figure 2.12), which is part of the TM Forum Frameworkx guidelines [58].
chapter 2.
b ac kg ro u n d
23
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
Figure 2.12: Diagram of TM Forum Frameworkx’s eTOM.11
Altaia provides traffic and network performance measurements, QoS and network and services
usage measures, service guarantee analysis, threshold alarms generation, network and service metrics
management (KPIs and KQIs) and provides SLA management. Architecturally, Altaia is composed of
four main subsystems: the Altaia Framework, the Altaia Portal, the Altaia Mediation and the Manager.
Figure 1.3 presents an overview of that architecture.
The Altaia Mediation is responsible for the self-discovery of the network, detecting problems with
data collecting from the network monitoring devices, and it is responsible for inserting the collected
data into a normalized (for the usage Altaia gives to it) database, DBN0, which is used by the other
Altaia modules. The mediation is also ready to plug custom adapters that allows it to communicate
with heterogeneous network devices. The DBN0 is where the raw data, including the CDRs are stored
upon being captured from the probes and other external systems.
The Altaia Framework is the central piece of the system. It contains the functionality allowing
it to define and process metrics from data collected from the DBN0s, generating KPIs and KQIs.
The generated metrics are stored into a data warehouse, the DBN1, allowing for a hierarchical and
dimensional view of the data. The Altaia Framework also detects SLA violations and behavioural
changes, notifying those situations.
The Altaia Portal is the entry point for users wanting to access performance and QoS reports, SLA
11
Available
at:
http://www.tmforum.org/Models/Frameworx14/main/
diagramac379ad6e0054204b29009c3d82ff997.htm
24
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
violations and other relevant alarms. From the Altaia Portal it is also possible for a client to request
the raw records that have originated a certain metric, that is, drilling-down back to the original raw
data stored in DBN0.
Currently, as mentioned in Section 1.2, both DBN0 and DBN1 are supported on top of Oracle
RDBMSs. The current requirements regarding data storage and processing capabilities are posing new
challenges to the current implementation. Because of that, big data alternatives for both DBN0 and
DBN1 are being studied.
chapter 2.
b ac kg ro u n d
25
chapter
3
Development
This chapter describes the proposed solution that allows the insertion of data into Hadoop to be queryable
by both Hive and Impala. The solution has delivery guarantees and ensures unique key constraints.
First the requirements are presented and an overview of the solution’s architecture is shown, pointing
out its components and general functionality. At last, implementation details of the solution are shown
and explained.
3.1
requirements
Inserting data into Hadoop is done through the creation of files. Those files should be as big as
possible to take advantage of HDFS and MapReduce in general. In HDFS, small files pose problems
to the NameNode as it has to maintain information about more files in memory and, pose performance
problems too as HDFS is designed for batch processing in which high throughput of data is more
important than low latency on data access.
Specifically to Hive, it is only an engine that by using MapReduce is capable of reading HDFS files
and treat their contents as rows of a table, which makes it possible to use a SQL-like language, the
HiveQL, to query data stored in HDFS. Hive has support for inserting data through DML statements.
However they are limited to insertions from query results, which allows inserting multiple rows at a
time and, inserting one single row at a time. In the first case, the size of the resulting files depend on
the size of the data to be stored, depend on the configured block size and, depend on the number of
reducers for that operation. On the second case, for each insert statement, a single file containing the
data related to it is created, resulting in too many small files which is bad for both the namenode and
for the overall performance of insertion and reading from HDFS.
Many applications are more or less tightly coupled to the insertion behaviour of a RDBMS that is
capable of handling batched (or not batched) insertions of single rows and at the same time abstract
27
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
and manage the underlying data. Adding to that, RDBMSs also guarantee that once an insertion
operation returns and reports success, the rows are safely stored and, is also able to guarantee keys
constraints, like the unique key constraints.
Adding support for batch insertions and keys constraints directly to Hive is not the intended
solution but to mitigate the above issues, a mediator responsible of inserting data into Hive, has to
ensure non duplicated keys and the durability of the inserted rows so when a caller returns from an
insertion operation it is guaranteed, like in RDBMSs, the data is safe. Other functional requirement
include managing Hive’s tables, that is, creating and refreshing them so Hive can recognize the newly
added files and; partition data in a configurable manner. Despite configurable, the used and tested
partitioning scheme is time based, by year, month, day and hour.
The mediator has to support flushing in configurable intervals of time so that applications expecting
data to be available in Hive in certain time intervals would see it.
At the beginning of the development there were no performance requirements for the insertion so
an initial functional version was made. Later, because of the POC involving CQM and Altaia, some
changes had to be made to meet the performance requirements. The performance requirement is the
ability to insert at least 10, 000 rows per second of the types VOZ_2G and VOZ_3G (see Appendix A),
which is the expected throughput of the CQM Mediation Module used in the POC. As an another
requirement to the POC, the inserted data should use the Parquet file format and be ready to be
queryable through Impala too.
3.2
architecture
3.2.1
overview
By analysing the problem, it was possible to subdivide it into several sub-problems that could be
handled independently of each other, leading to a solution composed of small independent components,
each one performing a simple and well defined task. In runtime these components are glued together
forming a processing pipeline. There are no restrictions of the ordering of the components in the
pipeline, however only well defined and logical (for the goal) combinations will have the desired
outcomes. At the time of this writing there were only two combinations: one for the insertion of data
and another to the data recovery mechanism, both being described in this chapter. More crucial and
specific details on the implementation will be covered on the Section 3.3. From now on, the solution
will be referred as the mediator.
Because this mediator may be used by any system wanting to insert data into Hadoop to be usable
by Hive, it has to implement its own data validation mechanism to ensure clients do not try to insert
data that would cause issues later when trying to query it through Hive, that is, the data being inserted
28
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
should follow a certain schema.
This mediator requires clients to provide it a table description, which includes:
1. base URI where the table data is going to be stored;
2. table name;
3. row format, which translates to the file format in which the data is stored;
4. compression algorithm to be used on the data files;
5. schema, which includes the definition of the columns and the enumeration of the columns that
should be used for partitioning and as unique keys;
6. partitioner that depending on its implementation and on the columns specified for partitioning
will compute the partition to where a given row should be stored.
In addition of having to specify the table description, the user has to provide an implementation,
for each type of entity being stored, of a conversion method that converts an entity to an internal and
manageable representation of it, the row.
After specifying the table description and the conversion method all a client has to do is to use
one of the two interface methods that allows to insert batches of data. It is allowed to insert data in
their entity, and thus original, form or in their rows converted form. Further on we will see that this
interface accepts these two forms of data because it is both the interface presented to the user and the
interface that glues together the independent components that compose the solution.
As mentioned previously in Section 3.1, there was an initial version (Section 3.2.2) that had
no performance requirements and, an improved version (Section 3.2.3) that allowed to meet those
performance requirements for the POC.
3.2.2
initial version
The initial version was made without performance requirements for the mediator operation, however
it was taken into account the performance of the independent components so each component has a
performance that was thought to be acceptable for the possible use cases.
To facilitate the comprehension of what kind of data really goes through the pipeline let us start
by defining entity and row and how do they relate to each other and how the first is converted into
the second, presenting already some implementation details just to clarify the operation. An entity is
simply a Plain Old Java Object (POJO) with attributes. Because that POJO may have the attributes
it wants with an unknown or heterogeneous API to access them, the first step to facilitate the access
to them is by converting the entity into a standardized representation that can be understood and
manipulated by the mediator system. To do that, the user has to provide an implementation of the
interface on the Listing 3.1 which given the original entity and a row, writes to the second each field of
chapter 3.
development
29
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
the first by the order specified in the schema. A row is defined as in the Listing 3.2, providing a Writer
which allows the implementation of toRow to write each attribute of the entity to the row.
public interface EntityWriteAdapter <T> {
void toRow (T e n t i t y , Row . Wr it er w r i t e r ) ;
}
Listing 3.1: Interface defining a method to convert an entity into a row.
public interface Row<T> {
C l a s s <T> g e t E n t i t y C l a s s ( ) ;
L i s t <Object> g e t F i e l d s ( ) ;
void c l e a r ( ) ;
Wri t er g e t W r i t e r ( ) ;
interface Wri t er {
Wri t er w r i t e ( Boolean f i e l d ) ;
Wri t er w r i t e ( S t r i n g f i e l d ) ;
Wri t er w r i t e ( Byte f i e l d ) ;
Wri t er w r i t e ( S h o r t f i e l d ) ;
Wri t er w r i t e ( I n t e g e r f i e l d ) ;
Wri t er w r i t e ( Long f i e l d ) ;
Wri t er w r i t e ( F l o a t f i e l d ) ;
Wri t er w r i t e ( Double f i e l d ) ;
Wri t er w r i t e ( Date f i e l d ) ;
Wri t er w r i t e ( byte [ ] f i e l d ) ;
}
}
Listing 3.2: Interface defining the internal representation of an entity. The Writer allows the
serialization of an entity into a row.
Based on the requirements it seemed that some kind of pipeline based architecture would cope
with the task of inserting data into Hive as it would allow different components with well defined
tasks to come into play at well defined stages. The pipeline based architecture is the adopted one and
Figure 3.1 shows the different components composing that pipelined architecture.
Each component can live on its own allowing them to be recombined as intended to produce
different behaviours. This is achieved by having them implementing the same interface, Listing 3.3,
which defines methods to receive batches of either entities or their already converted equivalent rows.
The component staying at the beginning of the pipeline will receive batches of entities and the remaining
ones will receive batches of rows. Insights about the internals of each component are given in Section 3.3
so as an overview, the operational flow, as well as the components responsibilities, is explained below.
From the Figure 3.1, the original batch of entities enters the system through the Static Constraints
Checker where they are first converted to rows and then checked against the defined schema for not
nullable columns that happened to be null and, for columns with wrong data types (this may occur if
the implementation of the conversion mechanism, toRow, is wrong). Only the valid rows pass to the
next component and the invalid ones are filtered out. The Mediator task is to send the batch for both
30
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
the Persistence Writer and the Unique Constraints Checker to run in parallel and wait for the results
of both to send them to the next stage. The Persistence Writer, which is backed by a Persistence
Queue (Section 3.3.2), persists the rows while there is no confirmation they are really safe (so they can
be replayed later using the Persistence Queue), that is, the File Writer has not confirmed the file has
been successfully written to its final destination. The Unique Constraints Checker task is to check
the rows are not duplicated and filter out the duplicated ones, if any. The filtered rows from Unique
Constraints Checker are the ones the Mediator sends to the Table Writer.
public interface Writer<T> extends F l u s h a b l e , C l o s e a b l e {
void append ( final I t e r a b l e <T> e n t i t i e s ) throws IOException ;
void appendRows ( final I t e r a b l e <Row<T>> rows ) throws IOException ;
}
Listing 3.3: Common interface used by the components allowing them to be plugged as intended to
achieve a certain behaviour.
Static Constraints
Checker
Persistence Writer
Table Manager
Unique Constraints
Checker
Mediator
Table Writer
Partition 1
Partition 2
File Writer
File Writer
HDFS partition 1
directory
HDFS partition 2
directory
Partition n
(...)
File Writer
HDFS partition n
directory
Figure 3.1: Architecture diagram of the initial version. This version was implemented before knowing
the performance requirements, which led to an almost completely synchronous design.
The batch contains rows related to different times, where rows from different hours must be placed
into different partitions and so, different files. The partitioning is configurable, but the used and tested
one is the [year, month, day, hour] partitioning. For each necessary partition for the rows of the
batch, a File Writer is instantiated to write a file for the right directory of the table for that partition.
That file is of the type specified by the row format in the table descriptor provided by the client. The
Table Writer task is to instantiate file writers as needed for the incoming rows and to split the batch
across them depending on the times of the rows.
chapter 3.
development
31
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
Several batches may be received before flushing and closing the files. The flushing order is given
from time to time by the client and, when received, the opened files from the file writers are closed and
the data persisted by the Persistence Writer is cleared because now it is guaranteed the data is safe.
Because adding files to a table’s directory does not make it automatically known to Hive and
Impala because the metastore has to know them, the Table Manager has to be used to instruct Hive
to scan the partitions’ directories and add to its metastore any new files found. After that Impala is
instructed to refresh its cache of Hive’s metastore so it can see the new files also.
3.2.3
improved version
After having implemented the initial version and made some insertion tests for the POC, presented
in Chapter 4, it was concluded that the initial design did not cope with the established performance
requirements of inserting 10, 000 rows per second of the types VOZ_2G and VOZ_3G. Some improvements
are merely tweaks but others, like the asynchronous sending of files and the bucketing, have forced to
make some changes to the design. Figure 3.2 shows the overview of the improved architecture.
Static Constraints
Checker
File Uploader
Table Manager
Bucket Splitter
Bucket 1
Persistence Writer
Mediator
Bucket n
Unique Constraints
Checker
(...)
Persistence Writer
Table Writer
Partition 1
File Writer
Partition 2
File Writer
Mediator
Unique Constraints
Checker
Table Writer
Partition 1
Partition n
File Writer
File Writer
Partition 2
File Writer
Partition n
File Writer
Local temporary
directory
Figure 3.2: Architecture diagram of the improved version. Asynchronous sending of files and splitting
batches across different buckets have made it possible to fulfil and exceed the performance requirements.
The first change is in the way the files were being saved to the table’s directory. In the initial
version, the File Writer created the files directly in an HDFS location. It was observed that creating
32
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
the files locally instead and then sending them to the correct HDFS directory was faster. Because now
there is a local checkpoint, there are guarantees the data is safe, only after closing the files, even before
sending the files to their final destinations. This allows the system to be ready to insert more data,
after closing a file, faster. Because now the sending of the files is not included as a task of the pipeline
itself (more specifically, a task of the Table Writer), a new component, the File Uploader, has been
created, which asynchronously sends the completed local files to their final destinations.
Creating the files locally and sending them asynchronously is one of the major changes but it was
not enough. One drawback of this pipeline architecture is that it is not really a pipeline because a
new batch can only be inserted after the caller returns from the previous one which only occurs after
delivering it to the File Writer. To tackle that issue and to in some way make the design more easy to
distribute, a new component, the Bucket Splitter, capable of splitting the batch into different buckets
has been created. This way, there are as many pipelines as the number of buckets and each one can
run in parallel with no dependencies between each one.
Despite running on a single machine, this bucket design, in addition to the asynchronous sending
of files, has made it possible to fulfil and exceed the performance requirements of the POC.
3.3
implementation
On the previous Sections 3.1 and 3.2 it was presented the overall architecture of the solution
without going into details.
The mediator framework integrates with the client program, running on the same process of the
client. Its operations are triggered by the client. However, there are two actions that run freely once the
client has started the mediator: the asynchronous files uploading and the automatic, and asynchronous,
refreshing of Hive’s tables. This section dives into implementation details supporting the previously
presented architecture.
3.3.1
constraints checking
By recalling the Figure 3.1 and the Static Constraints Checker and Unique Constraints Checker,
their purpose is to validate the data entering the system, either by asserting no invalid rows pass to
next stages and by asserting no rows are stored twice, ensuring unique key constraints. The static
constraints checking includes verifying attributes for null values when they are not allowed to be null
and, includes verifying if the data types for the attributes are correct. The unique constraints checking
verifies whether a row has already been inserted, avoiding duplicated rows.
chapter 3.
development
33
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
static constraints
Static constraints are those depending on the rows themselves. They do not depend on the time of
arrival to the system and so one same row being tested twice gives the same result those two times.
The test is performed against the schema. The schema is composed of a list of columns. Each column
has a name, a data type and an attribute telling whether it is allowed to be null. Columns used for
partitioning and as unique keys are not allowed to be null. These restrictions are automatically checked
by the schema builder (example in Listing 3.4, SCHEMA) so no invalid schemas are constructed.
public class Record {
private final S t r i n g i m s i ;
private final S t r i n g msisdn ;
private final Date dateEnd ;
private final int y e a r ;
private final int month ;
private final int day ;
private final int hour ;
// The constructor has been omitted .
public static Schema SCHEMA = new Schema . B u i l d e r ( )
. addColumn ( " imsi " , Schema . Type . STRING, Schema . N u l l i t y .NOT_NULL)
. addColumn ( " msisdn " , Schema . Type . STRING, Schema . N u l l i t y .NULL)
. addColumn ( " dateend " , Schema . Type .TIMESTAMP,
Schema . N u l l i t y .NOT_NULL)
. addColumn ( " year " , Schema . Type . INTEGER, Schema . N u l l i t y .NOT_NULL)
. addColumn ( " month " , Schema . Type . INTEGER, Schema . N u l l i t y .NOT_NULL)
. addColumn ( " day " , Schema . Type . INTEGER, Schema . N u l l i t y .NOT_NULL)
. addColumn ( " hour " , Schema . Type . INTEGER, Schema . N u l l i t y .NOT_NULL)
. uni que ( " imsi " , " dateend " )
. p a r t i t i o n B y ( " year " , " month " , " day " , " hour " )
. create () ;
public static final TableDesc<Record> TABLE_DESC = new TableDesc<Record >(
" hdfs :/// user / altaia / tables " , // Base URI
" recordtable " ,
// Table name
SCHEMA,
new P a r t i t i o n e r I m p l <Record >(SCHEMA) ,
RowFormat .PARQUET,
CompressionType .SNAPPY) ;
public static final EntityWriteAdapter <Record> WRITE_ADAPTER = new
EntityWriteAdapter <Record >() {
@Override
public void toRow ( Record e , Row . Wr i ter w r i t e r ) {
writer . write ( e . imsi )
. w r i t e ( e . msisdn )
. w r i t e ( e . dateEnd )
. write ( e . year )
. w r i t e ( e . month )
. w r i t e ( e . day )
. w r i t e ( e . hour ) ;
}
};
}
Listing 3.4: Entity definition and its schema, table description and write adapter to convert it to a
row.
34
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
The Static Constraints Checker verifies whether a row complies with the schema specified for it.
Any row not respecting the contract is discarded or, by configuration, make all the batch to fail by
throwing a ConstraintViolationException exception.
Listing 3.4 shows what a client has to provide to the mediator in order to have its entity Record
stored. From the code listing we can see the Record has several attributes. Those entity’s attributes
cannot be accessed directly by the mediator without converting it first to its row representation. In the
WRITE_ADAPTER, the toRow method converts an entity to its row representation, writing the entity’s
attributes one by one to the Row.Writer (defined in Listing 3.2). The order by which the attributes
are written relates to the order the columns supporting them are defined in the SCHEMA.
A row has a list of fields. The field at index 0 corresponds to the first column defined in the schema
and so on. The Static Constraints Checker goes through the rows of a batch and checks (1) if not
nullable columns have their corresponding field values as null, which is a violation, and then (2) checks
if the data types present in the rows’ fields correspond to the ones defined by their columns. Failing
to pass the second check, depending on the way the client and the entity is implemented, means the
toRow implementation is not correct.
unique constraints
Unique constraints refer to unique key constraints. As previously mentioned in Section 3.1, Hive
does not support unique key constraints so a mechanism to check whether a given row, with a given
key, already exists had to be implemented in the mediator framework.
The implementation of the Unique Constraints Checker involves the creation and maintenance
of an index composed of the unique keys of the rows already inserted. Any time a new row is to be
inserted, this index in queried to check whether a row with the same unique key has already been
inserted and then the new unique key is stored in the index. This index is supported by LevelDB
(Section 2.4).
Just as example, and considering the POC scenario, the unique keys are expected to be 20 bytes
in length (concatenating all the fields composing them), and the index has to be maintained for rows
as old as one month. 10, 000 rows per second are expected which makes the index as big as ≈ 483 GiB,
without being compressed.
In the unique index it is stored the concatenated bytes of the unique key columns, concatenated
with the version number of the batch. The version is explained with more detail in the Section 3.3.7.
Because keys in LevelDB are sorted that means that a same unique key with different versions would
have the oldest version first but, because it is more handy to have the newest version first, the version
is stored as Integer.MAX_VALUE - version, making the newest version of an unique key to appear
first when iterating the keys in ascending order, which is the natural ordering in LevelDB.
Taking Listing 3.4 as example, the columns imsi and dateend are defined as unique. The imsi is
chapter 3.
development
35
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
a string value and the dateend a timestamp one. A LevelDB key in that case would be in the form
[imsi | dateend | version], where | denotes concatenation. imsi is 8 bytes (for simplicity, let us
assume a fixed size), dateend is 8 bytes (in the Java’s long representation) and version is 4 bytes,
totalling 20 bytes.
When we want to know if a certain key is in the index we seek the key by [imsi | dateend] and
if we find it, then we use the provided version number to check if the key existed for a version less
than or equal to that provided version, that is, the floor version of a key, which is the greatest version
less than or equal to the given one. The reason to force the newest version of a key to appear first is
to make it possible this floor version searching.
Figure 3.3 shows a unique index example and different keys to be tested against that index for a
given version. Key a is tested positive because there is already in the index that key with a version
number less than or equal to 3, which is the key 2. The same holds true for key b because of key 1.
Testing key c gives negative because, in the index, there is no version less than or equal to 1 for that
key, which, if existing, would be between keys 2 and 3. Testing key d gives negative for the same
reason as the key c. Finally, testing key e gives positive because that key, for that version, exists in
the index.
LevelDB index
Which keys are reported as
existing in the index?
dateend
(8 bytes)
version
(4 bytes)
1
imsi0001
12301
4
imsi0001
12301
3
a
2
imsi0001
12301
2
imsi0001
12301
5
b
3
imsi0002
12302
3
imsi0001
12301
1
c
4
imsi0003
12303
3
imsi0003
12303
2
d
5
imsi0004
12304
3
imsi0004
12304
3
e
...
imsi
(8 bytes)
Figure 3.3: Unique index example showing different unique keys and a same unique key with different
versions. On the right side several keys are tested against the index to check whether they exist for a
given version.
The version of a batch, and thus for the unique keys of its rows, is managed by another component,
the Version Store, explained in Section 3.3.7.
In the processing pipeline of the Figure 3.1, the Unique Constraints Checker, (1) receives a batch
of rows, (2) asks the version store for the current version, (3) tests the rows, for that version, against
the index, (4) filters the rows found not to be unique and, (5) stores in the index the newly found
unique rows’ keys, with the given version.
36
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
3.3.2
persistence writer and persistence queue
The frequency by which new files get ready to be inserted into Hive is different from the frequency
by which new batches arrive to the mediator framework. To be able to guarantee the client that the
batch is safe after returning from the insertion, the received batch has to be safely persisted until the
new files are closed. We cannot rely on the files themselves, even partially, because the supported files,
specially the columnar file format ones, require a memory staging to organize all the column’s values
together and only at the end, the file is written.
The Persistence Writer shown in Figures 3.1 and 3.2 is backed by a Persistence Queue. The
Persistence Writer stores rows into the Persistence Queue but when they need to be replayed in case
of failures, they are directly read from the Persistence Queue because the interface of the Persistence
Writer (shown in Listing 3.3) does not allow to read the stored rows.
The Persistence Queue is backed by a memory-mapped file implemented to behave like a First-InFirst-Out (FIFO) queue. Figure 3.4 shows the organization of the memory-mapped file.
data
Available to read
Available to write
cursors
0
8
fileSize-1
R
RU
W
WU
Figure 3.4: Memory-mapped file organization, showing the positions to store the cursors and the
data.
The queue supports sequential reads and writes. When writing data, it is only safe upon committing
the write. When reading data, the read data is discarded only after committing the read. In both
write and read, the uncommitted state can be reset, returning to the previously committed states.
The memory-mapped file has two regions: one to save the data and another one to save the cursors
indicating the read (R) and write (W) data positions. The write starts at the W cursor and advances
sequentially. The cursor WU (write uncommitted) indicates the current write position. Upon commit,
W takes the position of WU. Once there is committed data, it can be read. R starts at the beginning
of the committed written data and advances sequentially. The cursor RU (read uncommitted) indicates
the current read position. Upon committing the read, R takes the position of RU, freeing that interval
to write more data. When committing, first it is flushed the data region and only then it is flushed the
cursors region.
The queue is, logically, a circular structure, so in the Figure 3.4, after the position fileSize-1
(the end of the data region) comes the position 8 (the beginning of the data region). The available
data to be read is in between the R and W cursors and, the free available space is in between the W
and R cursors, turning around the queue.
chapter 3.
development
37
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
The queue stores serialized batches of rows. Each row in the batch is serialized using Kryo [59]
and the byte array resulting from serializing all the rows is what is really stored by the queue. With
each batch it is also stored the batch size (to make it possible to read entire batches then) and the
batch version.
In Figure 3.1, as already mentioned, the Persistence Writer uses a Persistence Queue to store
the received batches. The Persistence Writer, (1) receives a batch of rows, (2) asks the Version Store
(Section 3.3.7) for the current version, (3) writes the batch into the persistence queue with the received
version, (4) commits the write. After flushing the files to be inserted into Hive, and thus confirming
the data is safe, the contents of the queue are emptied by simply reading (and ignoring) all of its
contents and committing the read operation.
The size of the queue, while the mediator framework is running, has a fixed size. While the contents
of the queue cannot be discarded it is possible for it to run out of available space. To tackle that, the
Persistence Writer has a callback mechanism allowing it to inform the Mediator of that situation,
which triggers a flush operation, causing the Persistence Queue to be emptied and then the process
can continue. The size of the Persistence Queue should be chosen according to the configured time to
trigger flushes and, the expected size and number of batches to receive during that time.
In Section 3.3.7, it is described the case in which the data stored previously in the persistence
queue is read during a recovery operation, being used to replay batches previously sent by the client.
3.3.3
table writer and file writer
The Table Writer has the responsibility of instantiating the needed File Writers and directing
rows for the correct one, based on its partition.
The table descriptor (example in Listing 3.4, TABLE_DESC), and more specifically the row format,
tells the table writer to which file format the rows should be saved. The row format translates to a
concrete implementation of a file writer supporting a specific file format. At the time of this writing
only the file writer for Parquet files has been implemented as it is the one required for the POC.
file writer selection based on the partition
When the table writer receives a row it first uses the partitioner from the table description to
know to which partition that row should go. The partition is just the relative path, inside the table
directory, where Hive expects data from that partition to be. The partitioner extracts from the row the
fields corresponding to the columns that have been defined as partitioning in the schema and constructs
that relative path. The partition’s relative path is constructed as /col1=val1/col2=val2 and so on.
Taking Listing 3.4 as example, the base URI is hdfs:///user/altaia/tables and the table name
is recordtable so the directory for that table is hdfs:///user/altaia/tables/recordtable. The
38
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
columns used for partitioning are year, month, day and hour. If a row comes with the values 2014, 10,
23 and 13 for the columns year, month, day and hour respectively, the resulting partition’s relative
path is /year=2014/month=10/day=23/hour=13.
After identifying the partition to where the row should go, the table writer creates a new file writer
for that partition so rows belonging to a same partition go to the same file writer and consequently to
the same file. Then, the table writer applies transformations to the row and sends it to the file writer,
moving then to the next row.
row transformations
The transformations are applied directly to the row, changing that same instance. A transformation that is always applied is the strip of the partitioning columns. When a table is partitioned,
Hive takes the values of the partitioning columns from the partition itself. So, if a partition is
/year=2014/month=10/day=23/hour=13, Hive automatically knows the columns year, month, day
and hour have the values 2014, 10, 23 and 13 respectively, so there is no need to waste space storing
those values in the files.
Because in the POC it is intended to use Impala to query the data, another transformation that
is applied is the conversion of the Java’s timestamp in milliseconds to another representation of it.
Impala does not recognize the Java’s timestamp as a valid timestamp, so queries involving times do
not work with that representation. Instead, Impala does recognize timestamps represented as double
numbers in which the integer part is the Unix timestamp in seconds and the fractional part is the
milliseconds and nanoseconds part. So the transformation involves converting all the timestamp fields
of a row into that double representation, which is done by dividing the Java’s timestamp by 1000.
storing of the files
Opening the files for writing directly on their final destinations would potentially make it possible
for Hive to try to read from not yet closed files and worst, from potentially corrupted files if something
fails while writing to them. To avoid that, the files are first stored to a temporary location and after
being written and closed, they are then moved to their final destinations.
In HDFS the move operation is guaranteed to be atomic [5] but the copy operation is not. For
that reason when moving a file to its final destination, that file have to come from another HDFS
location, so a move operation can be performed in an atomic way, guaranteeing that when Hive knows
about new files on its tables, those files are ready to be used and are not files being copied yet or
something else.
Section 3.2 refers an initial and an improved version of the mediator framework. This storing of file
mechanism is one of the differences between the two versions. On the initial version of the mediator,
chapter 3.
development
39
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
the files were open directly on HDFS, in a temporary location, and after being closed they were moved
to their final destination as shown in Figure 3.5.
...
Table URI
hdfs:///user/altaia/tables/recordtable
Table Writer
Partition
/year=2014/month=10/day=23/hour=13
Partition
/year=2014/month=10/day=23/hour=14
File Writer
File Writer
writes to
writes to
<Table URI>/.tmp/
year=2014/month=10/
day=23/hour=13/file1
<Table URI>/.tmp/
year=2014/month=10/
day=23/hour=14/file1
is moved to
is moved to
<Table URI>/year=2014/
month=10/day=23/
hour=13/file1
<Table URI>/year=2014/
month=10/day=23/
hour=14/file1
Figure 3.5: Sequence of steps to put a file into the correct table directory.
On the improved, and final, version of the mediator, the files are open in a local directory (the local
directory of the File Uploader) and only after being closed they are copied to a temporary location on
HDFS and then, moved to the final destination. In this new mechanism, after closing the File Writer
(and so, the file), the table writer warns the file uploader (Section 3.3.4) a new file is ready to be sent,
and the file uploader sends it asynchronously, taking care of copying it first to a temporary HDFS
location and then moving it to its final destination.
3.3.4
file uploader
Creating the files locally and sending them asynchronously has shown to allow a higher throughput
of rows (Section 4.4) then the more conservative way of creating the files directly in HDFS. This
way of sending the files has allowed to fulfil the performance requirements for the POC, mentioned in
Section 3.1, of inserting at least 10, 000 rows per second.
The File Uploader, is an independent service, inside the same process, that receives requests to
upload files and executes those requests asynchronously. Those requests are immediately stored in
a catalog, backed by a SQLite database to make it possible to resume the operation if the system
is stopped, either abruptly or cleanly. Thinking in future improvements to the mediator framework,
being able to know when a file is enqueued for sending and has been actually sent may be useful (in
fact, the table manager uses that information to more intelligently order refreshes to the tables), so this
40
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
File Uploader supports the registering of listeners for those events. The interface of the File Uploader
is shown in Listing 3.5 and the UploadDesc in the Listing 3.6.
public interface F i l e U p l o a d e r {
void s t a r t ( ) ;
boolean e n q u e u e F i l e ( S t r i n g f i l e P a t h , S t r i n g dstPath ) ;
String localDirectoryUri () ;
void shutdown ( ) ;
boolean
boolean
boolean
boolean
registerFileEnqueuedListener ( FileEnqueuedListener l i s t e n e r ) ;
unregisterFileEnqueuedListener ( FileEnqueuedListener l i s t e n e r ) ;
registerFileUploadedListener ( FileUploadedListener l i s t e n e r ) ;
unregisterFileUploadedListener ( FileUploadedListener l i s t e n e r ) ;
interface F i l e E n q u e u e d L i s t e n e r {
void onFileEnqueued ( UploadDesc d e s c ) ;
}
interface F i l e U p l o a d e d L i s t e n e r {
void o n F i l e U p l o a d e d ( UploadDesc d e s c ) ;
}
}
Listing 3.5: Interface of the File Uploader.
public class UploadDesc {
private final int i d ;
private final Date enqueueTime ;
private final S t r i n g f i l e N a m e ;
private final S t r i n g dstPath ;
// The constructor and accessors have been omitted .
}
Listing 3.6: Definition of the file upload description used when the File Uploader warns its listeners.
The File Uploader expects files to be stored in its local directory, which can be obtained through
its interface. When files are ready to be sent, that is, when a table writer instructs its file writers
to close their files and they finish that task, the table writer enqueues each of those files in the File
Uploader, indicating the file (under the local directory) to be sent and its final destination and, the
File Uploader registers that request into its catalog. Then, for each file the File Uploader has to send,
it first sends it to a temporary HDFS location and then moves it to its final location. Having the file
been delivered, its entry is deleted from the catalog.
In case of a system failure, the process of sending a file may be interrupted at any phase, be it
while sending from the local directory to the temporary directory, or from the temporary directory
to the final destination or, while deleting the entry from the catalog. When starting again, the file
uploader just loads the entries from the catalog and tries to do the procedure from the beginning but
if it is not possible, then it resumes the procedure from the possible intermediate stages.
chapter 3.
development
41
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
3.3.5
table manager
Having the files into their right locations it is only needed Hive to recognize those newly added
files, adding them to its metastore as belonging to that table. After that, and because Impala uses a
cached version of Hive’s metastore, it is needed to instruct Impala to refresh that cache so it can see
those new files too. The described above is the job of the Table Manager.
The refresh operations are triggered by DDL commands that can be run through JDBC. In Hive
the MSCK REPAIR TABLE <table_name> makes it recognize new partitions that are added directly
to HDFS (what is exactly what mediator framework does) or that have been modified (added or
removed files), adding that new information to its metastore. After running the command in Hive it is
needed to instruct Impala to refresh its cache of the Hive’s metastore by issuing the DDL command
REFRESH <table_name>.
The Table Manager also has the ability to create tables in Hive based on the provided table
description. For instance, with the table description from Listing 3.4, the Table Manager produces the
DDL of Listing 3.7 an then executes it in Hive to create the table. Because of the decision of issuing
the DDL commands to create tables over Hive instead of Impala (as Impala is just for the POC),
Impala needs to have its metastore cache invalidate to be able to recognize those newly added tables.
Because the command to invalidate the cache is not supported by HiveQL, and so it is not supported
via JDBC, the Table Manager opens a SSH session to a host having an Impala Shell and then executes
the invalidate command on the Impala Shell, making Impala recognize the newly added tables.
CREATE EXTERNAL TABLE IF NOT EXISTS r e c o r d t a b l e (
i m s i STRING,
msisdn STRING,
dateend DOUBLE
) PARTITIONED BY (
y e a r INT ,
month INT ,
day INT ,
hour INT
)
ROW FORMAT SERDE ’ parquet . hive . serde . ParquetHiveSerDe ’
STORED AS
INPUTFORMAT ’ parquet . hive . DeprecatedParquetInputFormat ’
OUTPUTFORMAT ’ parquet . hive . DeprecatedParquetOutputFormat ’
LOCATION ’ hdfs :/// user / altaia / tables / recordtable ’
Listing 3.7: Hive DDL produced by the Table Manager to create the table recordtable.
The Table Manager operations are executed inside retry cycles, so in case of connection failures,
the operations are retried after a backoff period, allowing the Table Manager to be resilient.
42
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
table refresher
The Table Manager itself has no autonomy. Operations like the table creation are performed once
at the mediator framework startup. However, other operations like the refresh of tables need to be
performed after files have been added to tables.
The Table Refresher is a service, making use of the Table Manager, that issues refresh operations
over Hive and Impala after files have been uploaded, and so, added to tables. To do that, it relies on
the events of the File Uploader (Section 3.3.4) telling when files have successfully been sent. To not
trigger the refresh after each uploaded file, the Table Refresher waits some time (by default, 5 seconds)
to allow other files to be uploaded. After that time of no files being uploaded, the Table Refresher
gives the refresh order (first to Hive and then to Impala), and then waits for more file uploads and the
procedure repeats.
3.3.6
bucketing
Section 3.2.3 refers that splitting the batch into multiple buckets results in performance gains
and Section 4.4 shows it does. The bucketing has to guarantee that a same row passing through the
mediator but in different times goes to same bucket. Without guaranteeing that, the unique index
cannot ensure the unique constraints.
To do the splitting, the Bucket Splitter, is introduced in the pipeline just after checking the static
constraints. This component splits the batch for the n different buckets and then sends them in parallel
for each of the subsections of the pipeline dealing with that specific bucket.
The rows are split based on their unique columns, giving a bucket number. To determine that
bucket number, the unique columns of a row are converted to bytes and concatenated together. Then
they pass through an hash function, giving an hash_number. The bucket number is then determined
by the operation (hash_number mod number_of_buckets).
Any time the number of buckets change, the persistence queues and unique indices have to be
rebuilt, which is not supported at the time of the writing of this document. For now, the only solution
is to just discard the already existing queues and indices taking into account the consequences of that.
3.3.7
version store and recovery mode
On the Sections 3.3.1 and 3.3.2 it was mentioned the existence of a Version Store. That store
keeps versions and their creation timestamps. Like the unique index, this Version Store is backed by a
LevelDB instance.
The creation timestamps stored are in seconds (an integer) and the version is another integer. The
versions are always incremented by one. To easily reach the most recent version in the store and to
chapter 3.
development
43
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
allow to find a floor version for a given time (that is the version whose timestamp is less than or equal
to a given one), the timestamp and the version are inverted, using Integer.MAX_VALUE - value (like
in the unique index, for the versions) and stored in the form [timestamp | version].
The Version Store is used to give batches an ID that can, in case of failure, be used as a transaction
ID, allowing to recover the batches of rows already delivered by the client without having the unique
index to report those batches of rows as already existing when they are not (because of the failure).
Knowing the batches have to be persisted to account for a possible system failure before having
the final files generated, let us picture some possible configurations of the pipeline and the expected
outcomes in the case of a failure, without having versions.
Unique Constraints
Checker
Persistence Writer
...
...
Case C
...
Case B
Persistence Writer
Unique Constraints
Checker
...
...
Persistence Writer
Unique Constraints
Checker
Mediator
...
Case A
Figure 3.6: Possible configurations of the pipeline for the persistence queue and unique index checking.
Figure 3.6 presents some possible configurations of the pipeline, accounting with the need to persist
the batches of rows in case of a failure, and without versions. Remember that a batch is only considered
successfully inserted after being delivered to the File Writer without any issues while passing through
the pipeline. Remember also that after having a batch stored in the queue, even if something fails
before reaching the file writer, it is possible to replay the contents of the queue. An unsuccessful
insertion, depending on the client, would make it to retry the same batch.
Case A shows a situation in which the batch is first tested for duplicated unique keys and only
then persisted in the queue. In this case if the persistence in the queue fails, the insertion fails but the
unique index has been left inconsistent. When the client retries the batch, all of its rows are reported
as duplicated. But if the insertion succeeds and only after that something goes wrong and the queue is
replayed, we know the rows stored in the queue are not duplicated.
Case B persists the batch first and tests for duplicated unique keys after. In this case the queue
may have duplicated rows as they have not been tested yet. If something fails after checking the unique
rows, we cannot simply replay the queue as it may have duplicated rows and, we cannot also query the
unique index for the rows in the queue because they will be reported as duplicated.
Because cases A and B cannot guarantee consistency in failure scenarios and, because of the
Input/Output intensive nature, combined with serialization, of the queue and of the unique index,
parallelizing both tasks have shown performance improvements. Case C shows that parallel scenario.
44
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
In this case the issues from both case A and B are present depending on which component finishes its
task first when running in parallel.
None of the presented cases can guarantee consistency in case of failure and because of that, the
batches versioning and the Version Store were introduced so, in an normal system operation, the
...
Version Store is located as shown in Figure 3.7.
Persistence Writer
Mediator
Unique Constraints
Checker
...
Generate new version
Read current version
Version
Store
Read current version
Figure 3.7: Location of the Version Store in the mediator framework architecture.
During a normal system operation, that is, without having to recover any rows from the Persistence
Queue, the Mediator receives a batch of rows, orders the Version Store to generate a new version and
then sends the batch for both the Persistence Queue and the Unique Constraints Checker, in parallel.
Those two components retrieve the last version from the Version Store, which both receive the version
generated upon the mediator request to do so and, do the rest of their tasks.
Figure 3.8 shows the configuration of the mediator to be able to recover. When a client instantiates
the mediator framework, it gets a component, the Broker Writer. The first time the client tries to
insert a batch, the Broker Writer builds the recovery mode (the structure on the left). In this mode,
the Bucketed Data Recovery instantiates one recovery pipeline for each existing bucket, so the data
recovery runs in parallel and independently on each bucket. The recovery pipeline is composed by the
Persistence Queue, followed by the Unique Constraints Checker and the Table Writer. The remaining
pipeline is like the one of the Figure 3.2.
After having the recovery pipeline instantiated, the rows of the Persistence Queue are replayed
and checked against the unique index (through the Unique Constraints Checker). For that check, it is
used the version number stored with the rows −1, so the unique index will not report those rows as
duplicated if they are not.
The unique rows are then sent to the Table Writer and the process continues normally as described
in Section 3.3.3. After replaying all the rows from the Persistence Queue, the newly generated files are
closed and submitted to the File Uploader to be sent, finishing the recovery operation and emptying
the Persistence Queue.
chapter 3.
development
45
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
Finished the recovery operation, the Broker Writer builds the structure on the right shown in
Figure 3.8, which is the pipeline of the Figure 3.2 and, proceeds with the normal operation of the
mediator framework. For the client it is completely transparent whether or not a data recovery has
occurred.
Broker Writer
After...
While recovering...
Bucket 1
(...)
Static Constraints
Checker
Bucket n
Persistence Queue
Bucket Splitter
Unique Constraints
Checker
Unique Constraints
Checker
Table Writer
Table Writer
...
...
...
Persistence Queue
Bucketed Data
Recovery
Figure 3.8: Configuration of the mediator framework when recovering data and its reconfiguration
to perform normally after the recovery.
46
M i c a e l C a p i tão
chapter
4
E va l u at i o n
This section describes the POC that was set up to test the mediator framework described in Chapter 3.
Then, it presents the tests performed on the mediator framework, including the ones resulting in the
need of improving the architecture presented initially.
The tests were first made to the whole system to check its global performance and then to each individual
component to be able to identify the bottlenecks. After identifying the bottlenecks, some architectural
changes had to be made, followed by new performance tests to the global system performance to validate
the requirement of inserting 10, 000 rows per second.
4.1
Proof Of Concept
To test the mediator framework in a real scenario, a POC was set up. Section 2.5 described Altaia,
a system capable of computing metrics (KPIs and KQIs) from CDRs and EDRs (xDRs) coming from
network monitoring equipments (probes). Those xDRs (intact and/or preprocessed first) compose the
raw data that is stored in the DBN0, which is now being migrated to an Hadoop solution (previously
Oracle).
The computing of metrics with a user centric approach, Customer Quality Management (CQM),
rather than a network centric one, requires the xDRs to be preprocessed to perform enrichments before
being ready to be consumed by Altaia. The enrichments are performed by an ETL system, the CQM
Mediation Module, which reads, from a directory, the original xDRs coming from the probes, processes
them and, stores the results in an Oracle RDBMS (the current DBN0).
The CQM Mediation Module has the notion of Sink to where data is sent to be stored. That
sink may have several implementations, being the original one to Oracle. Despite that decoupling,
the system is indeed coupled to RDBMSs behaviours concerning guarantees of delivery, so if the sink
47
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
succeeds in accepting a batch, the system assumes its contents are stored and safe. Concerning Altaia,
duplicated rows are not accepted as they would change the final results of the metrics. At this point
the CQM Mediation Module and Altaia need already two functional requirements of the mediator
framework to insert data into Hadoop.
Because Hive has too much latency for Altaia’s needs, the goal is to query the data in DBN0 with
Impala. Using Impala requires the mediator framework to manage Hive’s tables and ensure Impala
has an updated information of those Hive’s tables. Because Impala performs better when used with
Parquet files, the mediator framework is required to store the rows using that file format. To allow
Impala to perform even better for the Altaia needs, namely the collecting of data, the destination
tables are required to be partitioned by year, month, day and hour, as described in Section 3.3.3. The
collecting of raw data to compute the metrics is done by Altaia every 5 minutes, so in this POC the
files created by the mediator framework are flushed, closed and placed into their final destinations
every 5 minutes.
Figure 4.1 shows an overview of the system responsible for preprocessing the xDRs for the CQM
scenario. It shows the existing sink, the Oracle Sink, which is now being replaced, for the POC, by the
Impala Sink.
Network monitoring
devices (Probes)
CQM
Mediation
Module
Existing layer
Oracle
RDBMS
Oracle Sink
DBN0
<<Sink>>
Directory with xDRs
Impala Sink
Mediation
Framework
HDFS tables
directories
DBN0
New layer
Figure 4.1: Diagram of the existing system in which the mediator framework integrates to replace
the existing insertion mechanism into an Oracle RDBMS.
4.1.1
impala sink
Shown in Figure 4.1, the Impala sink is the bridge between the already existing system that performs
ETL operations on xDRs, the CQM Mediation Module, and the mediator framework. Section 3.2.1
refers the mediator framework requires a set of informations/utilities from the client, namely the table
descriptions and the write adapters to convert entities to rows. Impala Sink’s job is to set up the
mediator framework, providing it with the information/utilities it requires and starting up its services,
like the File Uploader and the Table Refresher, as well as giving the order to create the tables.
When the Impala Sink receives batches of data from the CQM Mediation Module, it directs them
to the mediator framework using the append method of the interface of Listing 3.3. It is also de Impala
48
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
Sink that triggers the time based flush orders, causing the mediator framework to flush and close the
files to be sent to HDFS.
4.1.2
concerns
memory consumption while creating parquet files
Writing Parquet files is a memory-intensive operation because the incoming data is buffered until
it reaches one data block in size, and only then that chunk of data is organized and compressed in
memory before being written out to the file [60].
Because the mediator framework writes to partitioned tables, for each partition there is one Parquet
file being buffered first to memory, and so several large chunks of data may be manipulated in memory
at once. In the normal operation of the CQM Mediation Module the data is usually from sequent
periods of time, which results in between one to three Parquet files being generated at the same time.
The problem resides when reprocessing operations are scheduled, which may result in records with
not sequent periods of time being processed, resulting in potentially too many Parquet files being
generated at once, leading to the crash of the whole application because of not enough memory.
The API to write Parquet files does not offer ways of knowing the state of memory consumption
and other statistics so, to avoid out of memory crashes, some dimensioning needs to be made to know
how much memory give to the Java Virtual Machine (JVM). That amount of memory should be
roughly (block_size ∗ number_partitions). The block_size is 1 GB by default [60] but can be
tuned to a lower size. The number_partitions requires the analysis of the situations resulting in
several files to be generated simultaneously.
small files
The time between flushes in the POC scenario is 5 minutes. In that time the quantity of data,
in MB, buffered for a Parquet file is far from the recommended 1 GB, as mentioned in Section 2.3.2.
Because the data is inserted into hourly partitions, that means a partition will have lots of small files,
posing performance issues when running queries on top of it.
Dealing with that issue requires the small files of each partition to be consolidated into a bigger
one when no more data is expected to arrive a partition. One possible way of doing that consolidation
is by selecting all the rows of a partition and inserting them back to it, overwriting the existing data.
This “read everything” and write again approach generates a consolidated/compacted file. However, if
there are running queries using the original small files during the compaction, they may fail.
The goal is to have an automated compaction mechanism capable of doing its job without interfering
with the running queries. This small files and compaction issue is out of the scope of this dissertation
and is not currently supported by the mediator framework.
chapter 4.
e va l u at i o n
49
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
4.2
the environment
The tests were conducted on a sixteen-node cluster each one with 2 Intel Xeon E5-2670 2.60GHz
CPUs and 128GB of main memory. The nodes are interconnected by 10Gbps Ethernet interfaces. The
cluster was running Red Hat Enterprise Linux 6.5 with Java 1.7.0_51. Cloudera Distribution Including
Apache Hadoop (CDH) 5.0.0 is the Hadoop distribution and was already installed on it.
The mediator framework, from now on referred as only mediator, was deployed on one of the
cluster nodes, the blade1. The global system performance tests involved running the mediator on
blade1 to insert data to the HDFS installed on the cluster, making it available to Hive and Impala.
The individual components tests involved only the blade1, except for the table writer which has, on
its initial version, to write to HDFS, so there is network involved.
All the tests were performed using randomly-generated records of a real type used in production,
the VOZ_3G (the biggest record that will be used in the mediator for now), whose table creation DDL
is shown on Listing A.2. The record generator generates batches of 100, 000 records and is able to
generate as much as ≈ 40, 000 records per second. The time it takes to generate the batch is not
included in the measurement of times for the system and individual components performances. The
VOZ_3G record is composed of 189 fields and each row has a size of ≈ 2.5KiB, which makes a batch to
be ≈ 244M iB.
For each test, several time measurements, one for each processed batch, were taken, making it
possible to reach an average value.
Table 4.1 reminds of the several components and their correspondence to the Figures 3.1 and 3.2,
indicates whether they convert entities to rows or not and suggests a shorter name to be used from now
on. In the case they do not convert entities to rows (when assembled in the final mediator), the batch
of entities/records is first converted to rows and only then the time measuring test proceeds. Despite
being generic in the architecture overview, the File Writer in here refers to the Parquet file writer one.
Number
1
2
3
4
5
6
8
Name
Static Constraints Checker
Mediator
Persistence Writer
Unique Constraints Checker
Table Writer
File Writer (Parquet)
Bucket Splitter
Converts entities
Yes
No
No
No
No
No
No
Short name
static-chk
mediator
persist-wr
unique-chk
table-wr
file-wr
bucket-split
Table 4.1: Correspondence between components and their short names used in this chapter. It is
presented whether a component is supposed to convert entities to rows when assembled in the final
mediator framework.
50
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
4.3
testing the initial version
In Section 3.2.2 it is presented the architecture, Figure 3.1, of the implemented solution before
knowing the performance requirements. After the tests it was concluded that some tweaks to Parquet
file writing and architectural changes were needed.
global performance
Making successive batches pass through the mediator it was verified that it took ≈ 31 seconds
to complete, which is ≈ 3, 200 rows per second. The writing to Parquet files was thought to be the
culprit and, to check that case, the mediator was tested disabling the Parquet writing, allowing it to
take only ≈ 5 seconds to complete, which is 20, 000 rows per second.
Identified the slowest component, several tests were made to determine the issue and how to tackle
it. Those tests were directed to the Parquet file writer itself and so will be covered on the next section.
components performance
The architecture of the mediator allows to easily test each one of the components. Those tests
serve as a way of checking bottlenecks and to have a performance overview of all the components.
When testing the components independently the following results on Table 4.2 were obtained.
Short name
Time
(s)
Throughput
(rows/s)
static-chk
mediator
persist-wr
unique-chk
table-wr
file-wr
≈ 0.5
≈ 4.0
≈ 3.5
≈ 2.0
≈ 0.5
≈ 26
200, 000
25, 000
≈ 28, 500
50, 000
200, 000
≈ 3, 800
Note
Writing to HDFS
Table 4.2: Performance of the several components of the mediator.
The static-chk test includes the conversion of entities to rows and checking whether the rows
violate any constraint, concluding it is not a bottleneck. The mediator sends the batch to the
persist-wr and the unique-chk in parallel so its time was expected to be the greatest of the two. In
fact it adds an overhead of ≈ 0.5 seconds, which adding to the time of the persist-wr (the greatest
time of the 2 running in parallel) gives ≈ 4.0 seconds obtained in the test.
chapter 4.
e va l u at i o n
51
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
Judging the time of the persist-wr, and so its throughput, it can be, in the future, a possible bottleneck. The table-wr test includes checking to which partition the rows belong, includes instantiating
the file writers (stub versions in this case) and, includes applying row’s transformations.
The file-wr test revealed the true bottleneck of the mediator: writing to Parquet. Several tests
including only this component were performed to try to understand the problem and overcome it.
Table 4.3 shows the results of the several tests performed on the file-wr.
While writing to Parquet files, the dictionary encoding, which allows to save space if there common
values (as the value is saved once and other occurrences will have only an index for the value), is
enabled by default. Experimenting with the dictionary encoding enabled or disabled and with writing
to HDFS or to local, some interesting findings showed up.
Test A is the same presented already in Table 4.2. Test B experimented to write the Parquet file
locally instead of in HDFS. It seems that just by removing the network part the performance improved
by about 45% compared with Test A.
Test
Time
(s)
Throughput
(rows/s)
A
B
C
D
E
≈ 26.0
≈ 18.0
≈ 14.0
≈ 7.0
≈ 10.0
≈ 3, 800
≈ 5, 500
≈ 7, 100
≈ 14, 200
≈ 10, 000
Description
Writing to HDFS. Dictionary encoding enabled.
Writing to local file. Dictionary encoding enabled.
Writing to HDFS. Dictionary encoding disabled.
Writing to local file. Dictionary encoding disabled.
Writing to local file. Dictionary encoding disabled. Send
the file to HDFS
Table 4.3: Several tests performed on file-wr component to try to understand how its performance
could be improved.
Tests C and D are similar to tests A and B but without having the dictionary encoding enabled. Test
C represents an improvement of about 85% comparing with test A. Test D represents an improvement
of about 160% comparing with test B. The dictionary encoding does play a big part on the performance
issue but the writing to HDFS is not free of guilt.
Test E is the same as test D but once the file is closed it sends it to HDFS. It would be expected
to see both tests C and E performing similar as both involve HDFS but test E was about 40% faster
than test C. The difference could be caused by Parquet writer sending small chunks of data over
the network, not taking advantage of the machine’s network capabilities and being delayed by the
Round-Trip Time (RTT). Some profiling tests made to verify that possibility did not prove conclusive
because the Parquet writer was not sending small chunks of data but the whole data at once when the
file was being closed.
Tests D and E seem promising for the performance issue but those are the times for the file-wr
alone. Before reaching the file-wr, the remaining mediator has already spent 5 seconds, so even
52
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
with the result of test E (the files need to be sent to HDFS anyway), the system takes 15 seconds
which gives a throughput of ≈ 6, 600 rows per second. While not enough, it already represents an
improvement of about 100% compared with the originally 31 seconds it took to write a batch.
4.4
testing the improved version
In section 4.3 the performance of the initial version was tested, concluding that even improving
the file-wr, it was not enough to meet the requirement. Because of that some architectural changes
had to be made. Section 3.2.3 refers the bucketing and the asynchronous file uploading to HDFS as
the improvements that have been made to meet the performance requirement.
Because improving the file-wr more is not feasible, having more file-wr and splitting a batch
across them might help. With that in mind the bucketing was introduced as shown in Figure 3.2.
The performance was not expected to improve linearly with the number of buckets, as increasing that
number means potentially having more Input/Output on the machine when components persist-wr,
unique-chk and file-wr come into play, but it was expected to reduce the time needed to process
the whole batch as now it was being split across the different buckets and run in parallel.
At this time it was considered to use the conditions of test E presented on Table 4.3, disabling the
Parquet dictionary encoding and creating the files locally and then send them synchronously to HDFS.
The bucket-split component introduces a delay of ≈ 0.5 seconds determining the bucket for
each row and preparing the sub-batches for each bucket. Testing with several number of buckets, the
results are shown on Table 4.4 and Figure 4.2.
Buckets
Time
(s)
Throughput
(rows/s)
1
2
3
4
5
6
7
8
9
10
≈ 15.5
≈ 10.3
≈ 8.9
≈ 8.3
≈ 7.6
≈ 7.4
≈ 7.2
≈ 6.9
≈ 6.8
≈ 6.6
≈ 6, 400
≈ 9, 700
≈ 11, 200
≈ 12, 000
≈ 13, 100
≈ 13, 500
≈ 13, 800
≈ 14, 400
≈ 14, 700
≈ 15, 100
Table 4.4: Results obtained with different numbers of buckets.
chapter 4.
e va l u at i o n
53
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
16000
14000
Rows/s
12000
10000
8000
6000
4000
2000
0
1
2
3
4
5
6
7
8
9
10
Number of buckets
Figure 4.2: Performance with different numbers of buckets. With the increasing number of buckets,
the improvements tend to be less apparent.
When going from 1 bucket to 2 buckets the performance improved by about 50% but when going
from 2 buckets to 3, that improvement is only about 15% and when going from 3 to 4 buckets we see
an improvement of only about 7% and so on.
Comparing 1 and 4 buckets, we see an improvement of about 90% which is the difference between
inserting ≈ 6, 400 rows per second or inserting ≈ 12, 000 rows per second, being the later already
about ≈ 20% above the performance requirement. However, the bucketing accentuates an undesired
performance issue brought by having small Parquet files, as discussed in Section 4.1.2. That issue can
be solved with a compaction operation performed later on the small files to produce only a big one.
However, that issue is out of the scope of this dissertation.
To decouple the sending of the files to HDFS with the creation of the files themselves, it was
created the File Uploader, Section 3.3.4. This way we could make the mediator to be able to process
another batch without having to wait for the files to be sent to HDFS, potentially resulting in
performance improvements. The Table Refresher, Section 3.3.5, was already asynchronous even before
the existence of the File Uploader, so Hive and Impala only knew about the new files later. This way,
the asynchronous file sending is a pertinent improvement.
The benefits of asynchronously sending the files to HDFS can only be seen by letting the whole
mediator run for a long time and from time to time checking how many rows Impala reports to exist
in the voz_3g table. To do that, and to check how many rows are inserted per second, the mediator
was set up with 4 buckets and with the asynchronous file sending. The mediator was left running for 5
hours and at the end of each hour a count of the rows, performed by Impala, was made. With those
counts it was possible to determine how many rows were being inserted per second.
The voz_3g table already had data on it. The rows count is rounded to the millions as having
exact values is not realistic due to the asynchronous nature of both the File Uploader and the Table
Refresher. The initial count of the voz_3g table was 43, 000, 000 rows. The results are shown in
54
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
Table 4.5.
Total
rows
Rows
per hour
Throughput
(rows/s)
≈ 43, 000, 000
≈ 93, 000, 000
≈ 140, 000, 000
≈ 188, 000, 000
≈ 238, 000, 000
≈ 287, 000, 000
≈ 50, 000, 000
≈ 47, 000, 000
≈ 48, 000, 000
≈ 50, 000, 000
≈ 49, 000, 000
≈ 13, 800
≈ 13, 000
≈ 13, 300
≈ 13, 800
≈ 13, 600
Hour
0
1
2
3
4
5
Table 4.5: Performance results obtained running the mediator framework for a period of 5 hours.
In Table 4.4 it is seen that with 4 buckets and with sending the file synchronously, the throughput
is of about 12, 000 rows per second. In this last test, with the files being sent asynchronously it was
registered a throughput of about 13, 500 rows per second, which results in an improvement of about
12%. The performance improvement with the File Uploader does not seem a big deal but adding to
that performance improvement is the ability to recover more easily after a failure and having all the
file sending logic into one shared component instead of split across all the file writers.
Comparing with the requirement of 10, 000 records per second, the 13, 500 rows per second fulfils
the requirement and exceeds it by about 35%.
chapter 4.
e va l u at i o n
55
chapter
5
Conclusion
This chapter wraps up the dissertation into a brief overview of the developed work and gives directions
of future improvements to be considered.
5.1
work overview
The work of this dissertation consisted on the development of a mediator framework to insert
data into Hadoop. The mediator integrates with systems used to insert data into Relational Database
Management System (RDBMS) making it easier for those systems to insert their data into Hadoop.
Because the data was intended to be queried by Hive, as it uses a SQL-like language, the HiveQL, it
had to be inserted in a format recognized by it.
The insertion of data into Hadoop consists in accumulating rows in order to create a big file that
can be sent to HDFS then. Because of that accumulating process, in case of a failure, all the data
perceived as being inserted is lost if the destination file gets corrupted or if its contents, at the time of
the failure, were in memory yet. Hadoop, and more specifically Hive, does not support indices like the
traditional RDBMSs, so it does not support unique key constraints, for example.
The mediator framework had to guarantee the durability of the inserted rows, even in case of
failures, and guarantee unique key constraints. It also had to support data partitioning and the
management of Hive’s tables.
The proposed solution consists on multiple components, each one carrying out a well defined task.
The components are allowed to be glued together as intended, resulting in different configurations with
different outcomes. That flexibility has allowed the mediator framework to have a configuration to
insert data and another configuration to recover data after a failure.
57
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
The configuration to insert data is a pipeline. Data, more specifically batches of rows, enter that
pipeline and pass through the components on a well defined order. That order makes the rows to
be checked for validity against a schema, to be checked for uniqueness, to be temporarily stored to
account for system failures, to be split into different partitions and, finally, to be written to a file that
is then sent to HDFS to be part of a Hive table.
The requirement of guaranteeing the data was safe was specially hard to fulfil because of the
different frequencies in which batches arrive and files are closed and sent to HDFS. To fulfil that
requirement, a persistent queue was introduced. That queue stores the accepted batches that are not
yet in HDFS so in case of a system failure, when restarting, the data on that queue is replayed and
the lost files are rebuilt.
Supporting unique key constraints implied the construction of an index to store the already known
keys. Because of the requirement of guaranteeing data was safe, it was possible for the index, in case
of failure, to report not duplicated records as duplicated when replaying the data on the persistence
queue. To solve that, the batches were versioned when arriving to the mediator so, in the case of a
data replay, the version number is used to allow the index to ignore records that otherwise would be
reported as duplicated.
To test the mediator framework, a POC was set up in which it had to integrate with an existing
system that performs ETL operations on xDRs for a CQM scenario. The result is intended to be then
used by another system, Altaia, to compute metrics, the KPIs and KQIs. For the POC there was the
requirement of inserting 10, 000 rows per second of records of the types VOZ_2G and VOZ_3G and, the
data had to be queryable not only by Hive but by Impala also.
The first performance tests (using the VOZ_3G as it is the biggest record) were disappointing, as
the mediator was only able to insert at a rate of ≈ 3, 200 rows per second. Thorough tests have shown
the problem resided mostly in the write of the Parquet files. Tweaking dictionary encoding options and
writing the file locally and then sending it to HDFS resulted in ≈ 6, 600 rows per second. To improve
the insertion rate even more, the mediator framework had to be changed to include bucketing. That
way, a batch is split into sub-batches and each sub-batch is processed in a bucket, each one having its
own independent insertion pipeline, allowing the insertion to be done in parallel, potentially improving
the insertion rate.
Testing the mediator framework with all the added improvements (and configuring it to use 4
buckets) has shown it was able of inserting data at a rate of 13, 500 rows per second, fulfilling and
exceeding the performance requirement by about 35%.
Using the bucketing approach accentuates the problem of generating too many small files, which is
bad for the HDFS namenode and for Parquet files in general, as the benefits of the columnar storage
approach are more noticeable with big files. That problem, however, can be solved by a compaction
operation applied later on the files of a table’s partitions.
58
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
5.2
future work
Although the goals for this dissertation were met, there are certain areas needing to be improved.
So the following is suggested as future work and improvements:
• Explore alternatives to Parquet files. Generating Parquet files requires too much memory and,
in the use case of the POC, they were not being generated with the adequate size to be relevant
later to the query performance;
• In case Parquet files are maintained for the insertion, develop a mechanism of detecting the
current size of the in-memory buffers and, in case they are dangerously reaching the limits of
available memory for the process, trigger a flush.
• Implement an automatic compactions mechanism that can safely compact the small files into a
big one. In the worst case, this mechanism would have to deal with running queries on the files
it wants to compact and then delete;
• Explore alternatives to LevelDB. Although simple and with good performance, being embedded
makes it harder to execute maintenance operations, like removing old keys that are not needed any
more, while being used by the mediator framework. HBase or Cassandra could be replacements,
although maybe too bloated for the purpose;
• When adding files to HDFS and, in order to Hive and Impala to recognize them, issue a
LOAD DATA DML directly on Impala instead of instructing Hive to search for new files and
instructing Impala to refresh its metastore cache after that.
chapter 5.
conclusion
59
References
[1]
The Lawton Constitution from Lawton, Oklahoma - Page 10, 2014. [Online]. Available: http:
//www.newspapers.com/newspage/36574866/ (visited on 28/10/2014).
[2]
Big Data - A Visual History, 2014. [Online]. Available: http://www.winshuttle.com/bigdata-timeline/ (visited on 28/10/2014).
[3]
M. Cox and D. Ellsworth, ‘Application-controlled demand paging for out-of-core visualization’,
235–ff. Oct. 1997. [Online]. Available: http://dl.acm.org/citation.cfm?id=266989.267068.
[4]
Big data. [Online]. Available: http : / / en . wikipedia . org / wiki / Big _ data (visited on
28/10/2014).
[5]
S. Ghemawat, H. Gobioff and S.-T. Leung, ‘The Google file system’, ACM SIGOPS Operating
Systems Review, vol. 37, no. 5, p. 29, Dec. 2003, issn: 01635980. doi: 10.1145/1165389.945450.
[Online]. Available: http://portal.acm.org/citation.cfm?doid=1165389.945450.
[6]
J. Dean and S. Ghemawat, ‘MapReduce - Simplified Data Processing on Large Clusters’, in
Proceedings of the 6th Symposium on Operating Systems Design and Implementation, USENIX
Association, Ed., 2004, pp. 137–149.
[7]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A.
Fikes and R. E. Gruber, ‘Bigtable: a distributed storage system for structured data’, Nov. 2006.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1267308.1267323.
[8]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton and T. Vassilakis,
‘Dremel: interactive analysis of web-scale datasets’, in Proc. of the 36th Int’l Conf on Very Large
Data Bases, 2010, pp. 330–339. [Online]. Available: http://www.vldb2010.org/accept.htm.
[9]
Tabulating machine. [Online]. Available: http : / / en . wikipedia . org / wiki / Tabulating _
machine (visited on 28/10/2014).
[10]
F. Rider, The Scholar and the Future of the Research Library: A Problem and Its Solution.
New York city: Hadham press, 1944. [Online]. Available: http://catalog.hathitrust.org/
Record/001161356.
[11]
E. F. Codd, ‘A relational model of data for large shared data banks’, Communications of the
ACM, vol. 13, no. 6, pp. 377–387, Jun. 1970, issn: 00010782. doi: 10.1145/362384.362685.
[Online]. Available: http://dl.acm.org/citation.cfm?id=362384.362685.
[12]
C. N. Parkinson, Parkinson’s Law, Nov. 1955. [Online]. Available: http://www.economist.
com/node/14116121 (visited on 28/10/2014).
[13]
Gap between MSS products and user requirements : Digest of papers, ser. IEEE Symposium on
Mass Storage Systems; 4, IEEE, IEEE, 1980. [Online]. Available: https://getinfo.de/app/
Gap-between-MSS-products-and-user-requirements/id/TIBKAT:017462509.
61
[14]
B. A. Devlin and P. T. Murphy, ‘An architecture for a business and information system’, IBM
Systems Journal, vol. 27, no. 1, p. 60, 1988. [Online]. Available: http://domino.research.ibm.
com/tchjr/journalindex.nsf/0/c95461887f5a5cb285256bfa00685be4.
[15]
M. Lesk, How Much Information Is There In the World?, 1997. [Online]. Available: http :
//www.lesk.com/mlesk/ksg97/ksg.html (visited on 28/10/2014).
[16]
D. Laney, 3D Data Management: Controlling Data Volume, Velocity, and Variety, 2001. [Online].
Available: http : / / blogs . gartner . com / doug - laney / files / 2012 / 01 / ad949 - 3D - Data Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.
[17]
P. Lyman and H. R. Varian, ‘How Much Information?’, 1999, [Online]. Available: http://www2.
sims.berkeley.edu/research/projects/how-much-info/ (visited on 28/10/2014).
[18]
J. F. Gantz, J. Mcarthur and S. Minton, ‘The Expanding Digital Universe: a forecast of
worldwide information growth through 2010’, Tech. Rep., 2007. [Online]. Available: https:
//www.zotero.org/gabrieldumouchel/items/itemKey/9DTIR9ST.
[19]
R. E. Bryant, R. H. Katz and E. D. Lazowska, ‘Big-Data Computing : Creating revolutionary
breakthroughs in commerce , science , and society Motivation : Our Data-Driven World’,
Computing Research Association, 2008.
[20]
T. White, Hadoop: The Definitive Guide, 3rd ed., M. Loukides and M. Blanchette, Eds. Sebastopol: O’Reilly Media, 2012, isbn: 978-1-449-31152-0.
[21]
HDFS Architecture. [Online]. Available: http://hadoop.apache.org/docs/r2.5.1/hadoopproject-dist/hadoop-hdfs/HdfsDesign.html (visited on 28/10/2014).
[22]
Y. Carmel, HDFS vs. GFS, 2013. [Online]. Available: http : / / pt . slideshare . net /
YuvalCarmel/gfs-vs-hdfs (visited on 28/10/2014).
[23]
K. V. Shvachko, ‘HDFS scalability : the limits to growth’, pp. 6–16, 2010.
[24]
‘HDFS High Availability’, 2014. [Online]. Available: http://hadoop.apache.org/docs/r2.
3.0/hadoop- yarn/hadoop- yarn- site/HDFSHighAvailabilityWithNFS.html (visited on
08/11/2014).
[25]
‘Introduction to HDFS High Availability’, 2014. [Online]. Available: http://www.cloudera.
com / content / cloudera / en / documentation / cdh5 / v5 - 0 - 0 / CDH5 - High - Availability Guide/cdh5hag_hdfs_ha_intro.html (visited on 08/11/2014).
[26]
‘An Introduction to HDFS Federation’, 2011. [Online]. Available: http://hortonworks.com/
blog/an-introduction-to-hdfs-federation/ (visited on 08/11/2014).
[27]
‘ViewFs Guide’, 2014. [Online]. Available: http://hadoop.apache.org/docs/r2.4.0/hadoopproject-dist/hadoop-hdfs/ViewFs.html (visited on 08/11/2014).
[28]
‘Apache Hadoop MapReduce Concepts’, [Online]. Available: https://docs.marklogic.com/
guide/mapreduce/hadoop (visited on 08/11/2014).
[29]
D. Eadline, ‘The YARN Invitation’, [Online]. Available: http://www.admin-magazine.com/
HPC/Articles/The-New-Hadoop (visited on 08/11/2014).
[30]
A. Murthy, ‘Apache Hadoop YARN ? Concepts and Applications’, 2012. [Online]. Available:
http : / / hortonworks . com / blog / apache - hadoop - yarn - concepts - and - applications/
(visited on 08/11/2014).
[31]
Introduction to YARN and MapReduce 2, 2013. [Online]. Available: http://pt.slideshare.
net/cloudera/introduction-to-yarn-and-mapreduce-2 (visited on 08/11/2014).
62
[32]
ResourceManager High Availability, 2014. [Online]. Available: http://hadoop.apache.org/
docs / r2 . 5 . 2 / hadoop - yarn / hadoop - yarn - site / ResourceManagerHA . html (visited on
08/11/2014).
[33]
Configuring High Availability for ResourceManager (MRv2/YARN), 2014. [Online]. Available:
http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5- 1- x/CDH5High-Availability-Guide/cdh5hag_rm_ha_config.html (visited on 08/11/2014).
[34]
J. Zhou, SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets, 2009. [Online].
Available: http://isg.ics.uci.edu/slides/MicrosoftSCOPE.pptx (visited on 08/11/2014).
[35]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu and R.
Murthy, ‘Hive - a petabyte scale data warehouse using Hadoop’, in 2010 IEEE 26th International
Conference on Data Engineering (ICDE 2010), IEEE, 2010, pp. 996–1005, isbn: 978-1-42445445-7. doi: 10.1109/ICDE.2010.5447738. [Online]. Available: http://ieeexplore.ieee.
org/lpdocs/epic03/wrapper.htm?arnumber=5447738.
[36]
S. Ho, SerDe, 2014. [Online]. Available: https://cwiki.apache.org/confluence/display/
Hive/SerDe (visited on 08/11/2014).
[37]
L. Leverenz, DeveloperGuide, 2014. [Online]. Available: https : / / cwiki . apache . org /
confluence/display/Hive/DeveloperGuide (visited on 08/11/2014).
[38]
Apache Thrift, 2014. [Online]. Available: https : / / thrift . apache . org/ (visited on
08/11/2014).
[39]
——, LanguageManual DML, 2014. [Online]. Available: https : / / cwiki . apache . org /
confluence/display/Hive/LanguageManual+DML (visited on 08/11/2014).
[40]
C. Tang, LanguageManual DDL, 2014. [Online]. Available: https : / / cwiki . apache . org /
confluence/display/Hive/LanguageManual+DDL (visited on 08/11/2014).
[41]
J. Russell, Cloudera Impala, 1st ed., M. Loukides, Ed. Sebastopol: O’Reilly Media, 2014, isbn:
978-1-491-94535-3. [Online]. Available: https://www.cloudera.com/content/dam/cloudera/
Resources/PDF/orielly-cloudera-impala-ebook.pdf.
[42]
M. Kornacker and J. Erickson, Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real,
2012. [Online]. Available: http://blog.cloudera.com/blog/2012/10/cloudera- impalareal-time-queries-in-apache-hadoop-for-real/ (visited on 08/11/2014).
[43]
SQL Differences Between Impala and Hive. [Online]. Available: http : / / www . cloudera .
com/content/cloudera/en/documentation/cloudera- impala/latest/topics/impala_
langref_unsupported.html (visited on 08/11/2014).
[44]
Impala Concepts and Architecture, 2014. [Online]. Available: http://www.cloudera.com/
content/cloudera/en/documentation/cloudera- impala/v1/latest/Installing- andUsing-Impala/ciiu_concepts.html (visited on 28/10/2014).
[45]
D. Abadi, P. Boncz, S. Harizopoulos and S. Madden, The Design and Implementation of
Modern Column-Oriented Database Systems, 3. 2013, vol. 5, pp. 197–280, isbn: 1900000024. doi:
10.1561/1900000024. [Online]. Available: http://www.cs.yale.edu/homes/dna/papers/
abadi-column-stores.pdf.
[46]
Parquet. [Online]. Available: https://github.com/Parquet/parquet- format (visited on
08/11/2014).
[47]
J. L. Dem, Parquet: Columnar storage for the people, 2013. [Online]. Available: http : / /
cdn.oreillystatic.com/en/assets/1/event/100/Parquet_%20An%20Open%20Columnar%
20Storage%20for%20Hadoop%20Presentation%201.pdf (visited on 08/11/2014).
63
[48]
Using the Parquet File Format with Impala Tables. [Online]. Available: http://www.cloudera.
com/content/cloudera/en/documentation/cloudera- impala/latest/topics/impala_
parquet.html (visited on 08/11/2014).
[49]
Using the Parquet File Format with Impala, Hive, Pig, and MapReduce. [Online]. Available:
http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5- 1- x/CDH5Installation-Guide/cdh5ig_parquet.html (visited on 08/11/2014).
[50]
Apache HBase. [Online]. Available: http://hbase.apache.org/ (visited on 28/10/2014).
[51]
LevelDB. [Online]. Available: https://github.com/google/leveldb (visited on 28/10/2014).
[52]
J. Dean and S. Ghemawat, LevelDB: A Fast Persistent Key-Value Store, 2011. [Online]. Available:
http : / / google - opensource . blogspot . pt / 2011 / 07 / leveldb - fast - persistent - key value-store.html (visited on 28/10/2014).
[53]
R. Vagg, LevelDB and Node: What is LevelDB Anyway?, 2013. [Online]. Available: http :
//dailyjs.com/2013/04/19/leveldb-and-node-1/ (visited on 26/11/2014).
[54]
J. Dean and S. Ghemawat, LevelDB: A Fast Persistent Key-Value Store. [Online]. Available:
http://en.wikipedia.org/wiki/LevelDB (visited on 28/10/2014).
[55]
SQLite. [Online]. Available: http://www.sqlite.org/ (visited on 26/11/2014).
[56]
IndexedDB. [Online]. Available: https://developer.mozilla.org/en- US/docs/Web/API/
IndexedDB_API (visited on 26/11/2014).
[57]
Snappy. [Online]. Available: https://code.google.com/p/snappy (visited on 28/10/2014).
[58]
TM Forum Frameworx. [Online]. Available: http://www.tmforum.org/TMForumFrameworx/
1911/home.html (visited on 08/11/2014).
[59]
Kryo. [Online]. Available: https : / / github . com / EsotericSoftware / kryo (visited on
28/10/2014).
[60]
‘Using the Parquet File Format with Impala Tables’, 2014. [Online]. Available: http://www.
cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/
impala%5C_parquet.html (visited on 08/11/2014).
64
Appendices
65
appendix
A
Ta b l e s c r e at i o n D D L
CREATE EXTERNAL TABLE IF NOT EXISTS voz_2g (
d a t e _ s t a r t DOUBLE,
date_end DOUBLE,
week INT ,
minute INT ,
week_day BOOLEAN,
week_end BOOLEAN,
h o l i d a y BOOLEAN,
work_period STRING,
i m s i STRING,
i m s i _ v a l i d BOOLEAN,
i m e i s v STRING,
i m e i STRING,
t a c INT ,
f a c INT ,
s n r INT ,
svn INT ,
f i l e _ c o l _ t i m e DOUBLE,
rec_col_time DOUBLE,
i n s e r t _ t i m e DOUBLE,
s e q BIGINT ,
a_msisdn STRING,
a_imsi STRING,
trm_brand STRING,
trm_model STRING,
trm_type STRING,
trm_os STRING,
trm_technology STRING,
trm_band_gsm STRING,
trm_band_umts STRING,
trm_band_lte STRING,
trm_bw_800 STRING,
trm_bw_1800 STRING,
trm_bw_2600 STRING,
trm_hd_voice STRING,
trm_class_gprs STRING,
trm_class_edge STRING,
trm_mod_edge STRING,
trm_baud_umts STRING,
67
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
trm_cat_hsdpa STRING,
trm_cat_hsupa STRING,
trm_cat_lte STRING,
t r m _ c h i ps e t STRING,
account_id INT ,
sub_account_id INT ,
account_name STRING,
account_type STRING,
a _ i m s i _ v a l i d BOOLEAN,
line_number INT ,
f i l e n a m e STRING,
s e r v i c e STRING,
t e c h n o l o g y STRING,
f i l e STRING,
num_events INT ,
opc INT ,
dpc INT ,
f i r s t _ l a c INT ,
c u r r e n t _ l a c INT ,
f i r s t _ c i INT ,
c u r r e n t _ c i INT ,
c a l l i n g _ d i g i t s STRING,
c a l l e d _ d i g i t s STRING,
sms STRING,
c a t e g o r y STRING,
forward_vm_progress DOUBLE,
b_number STRING,
setup_time DOUBLE,
a l e r t _ t i m e DOUBLE,
connect_time DOUBLE,
d i s c o n n e c t _ t i m e DOUBLE,
c a l l _ p r o c e e d _ t i m e DOUBLE,
c a l l _ c o n f i r m _ t i m e DOUBLE,
b_msisdn STRING,
b_imsi STRING,
i n i t i a l _ c o d e c STRING,
message STRING,
c a l l _ a c t i v e _ d u r a t i o n BIGINT ,
c a l l _ s e t u p _ d u r a t i o n BIGINT ,
i n s u c c e s s BOOLEAN,
l e c _ t e c h n o l o g y STRING,
c a u s e STRING,
r e s u l t _ e n d STRING,
sub_result_end STRING,
lec_p1_pname STRING,
lec_p1_cause STRING,
lec_p1_result_end STRING,
lec_p1_sub_result_end STRING,
lec_p2_pname STRING,
lec_p2_cause STRING,
lec_p2_result_end STRING,
lec_p2_sub_result_end STRING,
lec_p3_pname STRING,
lec_p3_cause STRING,
lec_p3_result_end STRING,
lec_p3_sub_result_end STRING,
lec_p4_pname STRING,
lec_p4_cause STRING,
lec_p4_result_end STRING,
lec_p4_sub_result_end STRING,
s m s _ i s n u l l BOOLEAN,
sms_isnull_msg_mobileterm BOOLEAN,
a l e r t _ t i m e _ n o t n u l l BOOLEAN,
c o n n e c t _ t i m e _ n o t n u l l BOOLEAN,
d i s c o n n e c t _ t i m e _ n o t n u l l BOOLEAN,
a l e r t _ t i m e _ i s n u l l BOOLEAN,
f o r w a r d v m _ p r o g r e s s _ n o t n u l l BOOLEAN,
68
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
sms_notnull BOOLEAN,
sms_notnull_msg_mobileterm BOOLEAN,
a l e r t _ t i m e _ i s n u l l _ s m s _ n o t n u l l BOOLEAN,
f s t _ c i _ r n c _ b s c STRING,
fst_ci_nodeb_bts STRING,
f s t _ c i _ c e l u l a STRING,
f s t _ c i _ c o d i g o _ s i t e STRING,
f s t _ c i _ l a c INT ,
f s t _ c i _ r a c INT ,
f s t _ c i _ d i s t r i t o STRING,
f s t _ c i _ c o n c e l h o STRING,
f s t _ c i _ f r e g u e s i a STRING,
f s t _ c i _ f a b r i c a n t e STRING,
f s t _ c i _ s g s n STRING,
f s t _ c i _ l a t i t u d e STRING,
f s t _ c i _ l o n g i t u d e STRING,
cur_ci_rnc_bsc STRING,
cur_ci_nodeb_bts STRING,
c u r _ c i _ c e l u l a STRING,
c u r _ c i _ c o d i g o _ s i t e STRING,
c u r _ c i _ l a c INT ,
cur_ci_rac INT ,
c u r _ c i _ d i s t r i t o STRING,
c u r _ c i _ c o n c e l h o STRING,
c u r _ c i _ f r e g u e s i a STRING,
c u r _ c i _ f a b r i c a n t e STRING,
cur_ci_sgsn STRING,
c u r _ c i _ l a t i t u d e STRING,
c u r _ c i _ l o n g i t u d e STRING,
mcc INT ,
mnc INT ,
a l e r t _ t i m e _ i s n u l l _ f a l h a BOOLEAN,
a l e r t _ t i m e _ n o t n u l l _ f a l h a BOOLEAN,
c i d a INT ,
call_qty_ho INT ,
call_ho_time DOUBLE,
c a l l _ i n t r a _ q t y _ h o INT ,
c a l l _ i n t e r _ q t y _ h o INT ,
call_intra_dwn_qlt_qty_ho INT ,
call_intra_up_qlt_qty_ho INT ,
call_intra_dwn_str_qty_ho INT ,
call_intra_up_str_qty_ho INT ,
call_inter_dwn_qlt_qty_ho INT ,
call_inter_up_qlt_qty_ho INT ,
call_inter_dwn_str_qty_ho INT ,
call_inter_up_str_qty_ho INT
)
PARTITIONED BY (
year INT ,
month INT ,
day INT ,
hour INT
)
ROW FORMAT SERDE ’ parquet . hive . serde . ParquetHiveSerDe ’
STORED AS
INPUTFORMAT ’ parquet . hive . DeprecatedParquetInputFormat ’
OUTPUTFORMAT ’ parquet . hive . DeprecatedParquetOutputFormat ’
LOCATION ’ hdfs :/// user / altaia / tables / voz_2g ’ ;
Listing A.1: Table creation DDL for records of type VOZ_2G, composed of 155 columns.
appendix a.
ta b l e s c r e at i o n D D L
69
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
CREATE EXTERNAL TABLE IF NOT EXISTS voz_3g (
d a t e _ s t a r t DOUBLE,
date_end DOUBLE,
week INT ,
minute INT ,
week_day BOOLEAN,
week_end BOOLEAN,
h o l i d a y BOOLEAN,
work_period STRING,
i m s i STRING,
i m s i _ v a l i d BOOLEAN,
i m e i s v STRING,
i m e i STRING,
t a c INT ,
f a c INT ,
s n r INT ,
svn INT ,
f i l e _ c o l _ t i m e DOUBLE,
rec_col_time DOUBLE,
i n s e r t _ t i m e DOUBLE,
s e q BIGINT ,
a_msisdn STRING,
a_imsi STRING,
trm_brand STRING,
trm_model STRING,
trm_type STRING,
trm_os STRING,
trm_technology STRING,
trm_band_gsm STRING,
trm_band_umts STRING,
trm_band_lte STRING,
trm_bw_800 STRING,
trm_bw_1800 STRING,
trm_bw_2600 STRING,
trm_hd_voice STRING,
trm_class_gprs STRING,
trm_class_edge STRING,
trm_mod_edge STRING,
trm_baud_umts STRING,
trm_cat_hsdpa STRING,
trm_cat_hsupa STRING,
trm_cat_lte STRING,
t r m _ c h i ps e t STRING,
account_id INT ,
sub_account_id INT ,
account_name STRING,
account_type STRING,
a _ i m s i _ v a l i d BOOLEAN,
line_number INT ,
f i l e n a m e STRING,
s e r v i c e STRING,
t e c h n o l o g y STRING,
f i l e STRING,
num_events INT ,
opc INT ,
dpc INT ,
f i r s t _ s a c INT ,
c u r r e n t _ s a c INT ,
f i r s t _ l a c INT ,
s o u r c e _ i r a t _ c i INT ,
t a r g e t _ i r a t _ c i INT ,
rabr_time DOUBLE,
raba_time DOUBLE,
i r a t _ t i m e DOUBLE,
i r a t STRING,
c a l l i n g _ d i g i t s STRING,
c a l l e d _ d i g i t s STRING,
70
M i c a e l C a p i tão
M e d i at o r F r a m e wo r k f o r I n s e rt i n g Data i n t o H a d o o p
sms STRING,
c a t e g o r y STRING,
forward_vm_progress DOUBLE,
b_number STRING,
setup_time DOUBLE,
a l e r t _ t i m e DOUBLE,
connect_time DOUBLE,
d i s c o n n e c t _ t i m e DOUBLE,
c a l l _ p r o c e e d _ t i m e DOUBLE,
c a l l _ c o n f i r m _ t i m e DOUBLE,
b_msisdn STRING,
b_imsi STRING,
i n i t i a l _ c o d e c STRING,
message STRING,
c a l l _ a c t i v e _ d u r a t i o n BIGINT ,
c a l l _ s e t u p _ d u r a t i o n BIGINT ,
i n s u c c e s s BOOLEAN,
l e c _ t e c h n o l o g y STRING,
c a u s e STRING,
r e s u l t _ e n d STRING,
sub_result_end STRING,
lec_p1_pname STRING,
lec_p1_cause STRING,
lec_p1_result_end STRING,
lec_p1_sub_result_end STRING,
lec_p2_pname STRING,
lec_p2_cause STRING,
lec_p2_result_end STRING,
lec_p2_sub_result_end STRING,
lec_p3_pname STRING,
lec_p3_cause STRING,
lec_p3_result_end STRING,
lec_p3_sub_result_end STRING,
lec_p4_pname STRING,
lec_p4_cause STRING,
lec_p4_result_end STRING,
lec_p4_sub_result_end STRING,
s m s _ i s n u l l BOOLEAN,
sms_isnull_msg_mobileterm BOOLEAN,
a l e r t _ t i m e _ n o t n u l l BOOLEAN,
c o n n e c t _ t i m e _ n o t n u l l BOOLEAN,
d i s c o n n e c t _ t i m e _ n o t n u l l BOOLEAN,
a l e r t _ t i m e _ i s n u l l BOOLEAN,
f o r w a r d v m _ p r o g r e s s _ n o t n u l l BOOLEAN,
sms_notnull BOOLEAN,
sms_notnull_msg_mobileterm BOOLEAN,
a l e r t _ t i m e _ i s n u l l _ s m s _ n o t n u l l BOOLEAN,
i r a t _ t i m e _ n o t n u l l BOOLEAN,
a l e r t _ t i m e _ i s n u l l _ f a l h a BOOLEAN,
a l e r t _ t i m e _ n o t n u l l _ f a l h a BOOLEAN,
mcc INT ,
mnc INT ,
fst_sac_rnc_bsc STRING,
fst_sac_nodeb_bts STRING,
f s t _ s a c _ c e l u l a STRING,
f s t _ s a c _ c o d i g o _ s i t e STRING,
f s t _ s a c _ l a c INT ,
f s t _ s a c _ r a c INT ,
f s t _ s a c _ d i s t r i t o STRING,
f s t _ s a c _ c o n c e l h o STRING,
f s t _ s a c _ f r e g u e s i a STRING,
f s t _ s a c _ f a b r i c a n t e STRING,
f s t _ s a c _ s g s n STRING,
f s t _ s a c _ l a t i t u d e STRING,
f s t _ s a c _ l o n g i t u d e STRING,
cur_sac_rnc_bsc STRING,
cur_sac_nodeb_bts STRING,
appendix a.
ta b l e s c r e at i o n D D L
71
M S c C o m p u t e r a n d T e l e m at i c s E n g i n e e r i n g
c u r _ s a c _ c e l u l a STRING,
c u r _ s a c _ c o d i g o _ s i t e STRING,
cur_sac_lac INT ,
cur_sac_rac INT ,
c u r _ s a c _ d i s t r i t o STRING,
c ur _sa c_ co ncelho STRING,
c u r _ s a c _ f r e g u e s i a STRING,
c u r _ s a c _ f a b r i c a n t e STRING,
cur_sac_sgsn STRING,
c u r _ s a c _ l a t i t u d e STRING,
c u r _ s a c _ l o n g i t u d e STRING,
s r c _ i r a t _ c i _ r n c _ b s c STRING,
sr c _ i r a t _ c i _ no deb_ bt s STRING,
s r c _ i r a t _ c i _ c e l u l a STRING,
s r c _ i r a t _ c i _ c o d i g o _ s i t e STRING,
s r c _ i r a t _ c i _ l a c INT ,
s r c _ i r a t _ c i _ r a c INT ,
s r c _ i r a t _ c i _ d i s t r i t o STRING,
s r c _ i r a t _ c i _ c o n c e l h o STRING,
s r c _ i r a t _ c i _ f r e g u e s i a STRING,
s r c _ i r a t _ c i _ f a b r i c a n t e STRING,
s r c _ i r a t _ c i _ s g s n STRING,
s r c _ i r a t _ c i _ l a t i t u d e STRING,
s r c _ i r a t _ c i _ l o n g i t u d e STRING,
t g t _ i r a t _ c i _ r n c _ b s c STRING,
tgt_irat_ci_nodeb_bts STRING,
t g t _ i r a t _ c i _ c e l u l a STRING,
t g t _ i r a t _ c i _ c o d i g o _ s i t e STRING,
t g t _ i r a t _ c i _ l a c INT ,
t g t _ i r a t _ c i _ r a c INT ,
t g t _ i r a t _ c i _ d i s t r i t o STRING,
t g t _ i r a t _ c i _ c o n c e l h o STRING,
t g t _ i r a t _ c i _ f r e g u e s i a STRING,
t g t _ i r a t _ c i _ f a b r i c a n t e STRING,
t g t _ i r a t _ c i _ s g s n STRING,
t g t _ i r a t _ c i _ l a t i t u d e STRING,
t g t _ i r a t _ c i _ l o n g i t u d e STRING,
c i d a INT ,
c a l l _ q t y _ r e l o c INT ,
r e q _ r e l o c _ t i m e DOUBLE,
call_qty_ho INT ,
call_ho_time DOUBLE,
c a l l _ i n t r a _ q t y _ h o INT ,
c a l l _ i n t e r _ q t y _ h o INT ,
call_intra_dwn_qlt_qty_ho INT ,
call_intra_up_qlt_qty_ho INT ,
call_intra_dwn_str_qty_ho INT ,
call_intra_up_str_qty_ho INT ,
call_inter_dwn_qlt_qty_ho INT ,
call_inter_up_qlt_qty_ho INT ,
call_inter_dwn_str_qty_ho INT ,
call_inter_up_str_qty_ho INT
)
PARTITIONED BY (
year INT ,
month INT ,
day INT ,
hour INT
)
ROW FORMAT SERDE ’ parquet . hive . serde . ParquetHiveSerDe ’
STORED AS
INPUTFORMAT ’ parquet . hive . DeprecatedParquetInputFormat ’
OUTPUTFORMAT ’ parquet . hive . DeprecatedParquetOutputFormat ’
LOCATION ’ hdfs :/// user / altaia / tables / voz_3g ’ ;
Listing A.2: Table creation DDL for records of type VOZ_3G, composed of 189 columns.
72
M i c a e l C a p i tão