Best Practice Guide - HPC for
Data Science on the Cray Urika
Andreas Vroutsis, EPCC, United Kingdom
Sandra Mendez, LRZ, Germany
Terry Sloan (Editor), EPCC, United Kingdom
Volker Weinberg (Editor), LRZ, Germany
Version 1.0 by 14-01-2019
1
Best Practice Guide - HPC for
Data Science on the Cray Urika
Table of Contents
1. Introduction .............................................................................................................................. 4
1.1. About this document ....................................................................................................... 4
1.2. Guide Structure .............................................................................................................. 4
1.3. Abbreviations and acronyms ............................................................................................. 4
2. Spark ...................................................................................................................................... 6
2.1. Introduction ................................................................................................................... 6
2.2. Resilient Distributed Datasets (RDDs) ................................................................................ 6
2.2.1. RDD Creation and Transformation .......................................................................... 6
2.3. Architecture ................................................................................................................... 7
2.3.1. Spark SQL .......................................................................................................... 7
2.3.2. MLib (Machine Learning) ...................................................................................... 7
2.3.3. GraphX .............................................................................................................. 8
2.3.4. Spark Streaming ................................................................................................... 8
2.4. Running Spark on Clusters ............................................................................................... 8
2.5. Spark Standalone Cluster ................................................................................................. 9
2.6. Running Spark in Slurm .................................................................................................. 9
3. Cray Urika GX System ............................................................................................................ 11
3.1. Introduction .................................................................................................................. 11
3.2. Overview ..................................................................................................................... 12
3.3. Architecture / Configuration ............................................................................................ 13
3.3.1. System Level ..................................................................................................... 13
3.3.2. Node Level ........................................................................................................ 13
3.4. System Access .............................................................................................................. 15
3.4.1. Access to the EPCC-hosted Urika .......................................................................... 15
3.4.2. User Interfaces ................................................................................................... 15
3.4.3. Data Transfer (EPCC-hosted Urika) ....................................................................... 16
3.5. Production Environment ................................................................................................. 17
3.5.1. Hadoop and its ecosystem .................................................................................... 17
3.5.2. Apache Spark .................................................................................................... 18
3.5.3. Cray Graph Engine ............................................................................................. 18
3.5.4. Resource Management ......................................................................................... 18
3.5.5. File Systems ...................................................................................................... 22
3.5.6. Fault Tolerance .................................................................................................. 22
3.5.7. User Interfaces ................................................................................................... 23
3.6. Programming Environment ............................................................................................. 25
3.6.1. Components ....................................................................................................... 25
3.6.2. Anaconda .......................................................................................................... 25
3.6.3. Jupyter Notebook ................................................................................................ 25
3.6.4. Apache Spark .................................................................................................... 26
3.7. Performance Analysis .................................................................................................... 28
3.8. Tuning ........................................................................................................................ 29
3.9. Debugging ................................................................................................................... 30
3.9.1. Tools ................................................................................................................ 30
3.9.2. Log Files ........................................................................................................... 30
Further documentation ................................................................................................................. 32
2
Best Practice Guide - HPC for
Data Science on the Cray Urika
3
Best Practice Guide - HPC for
Data Science on the Cray Urika
1. Introduction
1.1. About this document
This best practice guide provides information about exploiting HPC platforms and techniques for Data Science
projects.
1.2. Guide Structure
This best practice guide is divided into sections covering specific topics. The contents of each section are briefly
described below.
Section
Content
1. Introduction
This section! It describes the guide and its structure.
2. Spark
This is an open source distributed data analytics platform. It provides access to many different data sources
and enables parallel computations to be distributed
across a cluster. This chapter of the guide describes Resilient Distributed Datasets, the concept at the heart of
Spark. It also describes the architecture of Spark, its libraries and how to run Spark on a cluster.
3. Urika GX
The Cray Urika GX system is an HPC platform dedicated to highly interactive and iterative data analytics
that require supercomputer levels of computing performance. This chapter describes the production and programming environment on the platform. This environment includes Spark, Hadoop, R and graph databases.
The chapter's contents are based on the Urika GX system hosted by EPCC and Cray on behalf of the Alan
Turing Institute in the UK.
1.3. Abbreviations and acronyms
Abbreviation
Explanation
API
Application Programming Interface
CGE
Cray Graph Engine
CPU
Central Processing Unit
EPCC
Edinburgh Parallel Computing Centre
GB
GigaByte(s)
HDD
Hard Drisk Drive
HDFS
Hadoop Distributed File System
HDP
Hortonworks Data Platform
HPC
High Performance Computing
I/O
Input/Output
JVM
Java Virtual Machine
LDAP
Lightweight Directory Access Protocol
ML
Machine Learning
NFS
Network File System
4
Best Practice Guide - HPC for
Data Science on the Cray Urika
Abbreviation
Explanation
PBS
Portable Batch System
RAM
Random Access Memory
RDD
Resilient Distributed Dataset
RDF
Resource Description Framework
RPC
Remote Procedure Call
SBT
Scala Build Tool
SCP
Secure Copy Protocol
SFTP
Secure File Transfer Protocol
SQL
Structured Query Language
SSD
Solid State Drive
TB
TeraByte(s)
TCP
Transmission Control Protocol
UAI
Urika-GX Application Interface
UI
User Interface
5
Best Practice Guide - HPC for
Data Science on the Cray Urika
2. Spark
2.1. Introduction
Apache Spark[3] is a cluster computing framework for large-scale data processing. It is best known for its ability
to cache large datasets in memory between jobs. It provides high-level APIs in Java, Scala, Python and R, and an
optimized engine that supports general execution graphs.
Spark has a large active community. It includes libraries for machine learning, SQL, structured streaming and
graph databases. This chapter describes concept at the heart of Spark, namely resilient Distributed Datasets, as
well as the architecture of Spark and its libraries. It also explains how to run Spark on a cluster.
2.2. Resilient Distributed Datasets (RDDs)
The RDD concept aims to support in-memory data storage, distributed across a cluster in a manner that is demonstrably both fault-tolerant and efficient.
An RDD is a distributed collection of data items, for example lines from a text file or sensor data with timestamp
and values. An RDD has the following properties:
• Immutability: One can execute an operation on an RDD to produce another RDD but one cannot alter the
original RDD.
• Partitioned: An RDD comprises a distributed collection or partitions of items and hence the contents of an RDD
can be operated on in parallel. Any operation on an RDD is typically performed using multiple nodes of a
computer cluster.
• Resilience: If one of the nodes hosting a partition fails, another of the cluster nodes can takes its data.
Once data is loaded into an RDD, two basic types of operation can be carried out upon it:
• Transformations, which create a new RDD by changing the original through processes such as mapping, filtering, and more;
• Actions, such as counts, which measure but do not change the original data.
The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged,
and can be repeated in the event of data loss or the failure of a cluster node.
Transformations are said to be lazily evaluated, meaning that they are not executed until a subsequent action has a
need for the result. This will normally improve performance, as it can avoid the need to process data unnecessarily.
It can also, in certain circumstances, introduce processing bottlenecks that cause applications to stall while waiting
for a processing action to conclude.
Fault-tolerance is achieved, in part, by tracking the sequence of transformations applied to data partitions in an
RDD. Efficiency is achieved by parallelization of the data processing across multiple nodes in the cluster, and by
minimization of data replication between those nodes.
2.2.1. RDD Creation and Transformation
There are two ways to create RDDs: parallelizing an existing collection, or referencing a suitably formatted dataset
in an external storage system (see https://spark.apache.org/docs/latest/rdd-programming-guide.html#externaldatasets).
• Parallelizing an existing collection: for example the following Python code calls the Spark sc.parallelize method
to create an RDD.
data = [1, 2, 3, 4, 5]
6
Best Practice Guide - HPC for
Data Science on the Cray Urika
rdd = sc.parallelize(data, 5) # create 5 partitions
In this example the number of partitions to create has been set manually to 5 but you can instead get Spark to
automatically determine the number of partitions to create based on the size of the cluster.
• Referencing a dataset on distributed storage: for example the following Python code creates a text file RDD
using the Spark textfile method.
rdd = sc.textFile("data.txt")
RDDs can be transformed into derived RDDs, for example:
rdd2 = rdd.filter( lambda x : (x % 2 == 0) ) # operation: filter odd tuples
2.3. Architecture
In addition to RDDs, Spark supports a rich set of higher-level tools including Spark SQL for SQL and structured
data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for processing
of live data streams (See Figure 1, “Spark Architecture”).
Figure 1. Spark Architecture
2.3.1. Spark SQL
Spark SQL[5] is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces
provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There
are several ways to interact with Spark SQL including SQL and the Dataset API.
To learn more about programming with Spark SQL please refer to the official documentation [5].
2.3.2. MLib (Machine Learning)
MLlib[6] is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and
easy. At a high level, it provides tools such as:
• ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative
filtering;
• Featurization: feature extraction, transformation, dimensionality reduction, and selection;
• Pipelines: tools for constructing, evaluating, and tuning ML pipelines;
• Persistence: saving and loading algorithms, models, and pipelines;
• Utilities: linear algebra, statistics, data handling, etc.
7
Best Practice Guide - HPC for
Data Science on the Cray Urika
To learn more about programming with MLib please refer to the official documentation [6].
2.3.3. GraphX
GraphX[7] is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX
extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached
to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators (e.g.,
subgraph, joinVertices, and aggregateMessages) as well as an optimized variant of the Pregel API[8].
To learn more about programming with GraphX please refer to the official documentation [7].
2.3.4. Spark Streaming
Spark Streaming[4] enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data
can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using
complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed
data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark's machine
learning and graph processing algorithms on data streams.
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches,
which are then processed by the Spark engine to generate the final stream of results in batches.
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume,
and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a
sequence of RDDs. You can write Spark Streaming programs in Scala, Java or Python.
To learn more about programming with Spark Streaming please refer to the official documentation [4].
2.4. Running Spark on Clusters
As explained in the official documentation[9], Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object in your main program, called the driver program (See Figure 2, “Cluster
Mode Overview (from https://spark.apache.org/docs/latest/cluster-overview.html) ”).
To run on a cluster, the SparkContext can connect to several types of cluster managers (e.g. Spark's own standalone
cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires
executors on nodes in the cluster, which are processes that run computations and store data for your application.
Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.
Finally, SparkContext sends tasks to the executors to run[9].
Figure 2. Cluster Mode Overview (from https://spark.apache.org/docs/latest/clusteroverview.html)
As explained in the official documentation[9], there are several useful things to note about this architecture:
8
Best Practice Guide - HPC for
Data Science on the Cray Urika
• Each application gets its own executor processes. These stay up for the duration of the whole application and run
tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling
side (each driver schedules its own tasks) and executor side (tasks from different applications run in different
JVMs). However, it also means that data cannot be shared across different Spark applications (i.e. instances of
SparkContext) without writing it to an external storage system.
• Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these
communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other
applications (e.g. Mesos/YARN). The driver program must listen for and accept incoming connections from its
executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver
program must be network addressable from the worker nodes.
• Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the
same local area network.
The cluster managers Spark supports are listed in the official documentation at [9]. Some of these are:
• Standalone: a simple cluster manager included with Spark that makes it easy to set up a cluster.
• Apache Mesos: a general cluster manager that can also run Hadoop MapReduce and service applications.
• Hadoop YARN: the resource manager in Hadoop 2.
2.5. Spark Standalone Cluster
As explained in [10], you can launch a Spark standalone cluster by first creating a file called conf/slaves in your
Spark directory. This file must contain the hostnames of all the machines where you intend to start Spark workers,
one per line. If the file conf/slaves does not exist then only , a single machine (localhost) is used, which is useful for
testing. Note, the master machine accesses each of the worker machines via ssh. By default, ssh is run in parallel
and requires password-less (using a private key) access to be setup. If you do not have a password-less setup, you
can set the environment variable SPARK_SSH_FOREGROUND and serially provide a password for each worker.
Once the conf/slaves file is set up you can launch or stop a cluster with the following scripts that are available
in $SPARK_HOME/sbin [10].
• sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
• sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
• sbin/start-slave.sh - Starts a slave instance on the machine the script is executed on.
• sbin/start-all.sh - Starts both a master and a number of slaves as described above.
• sbin/stop-master.sh - Stops the master that was started via the sbin/start-master.sh script.
• sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
• sbin/stop-all.sh - Stops both the master and the slaves as described above.
Optionally it is possible to configure the cluster further by setting environment variables in conf/spark-env.sh.
Create this file by starting with the conf/spark-env.sh.template, and copy it to all worker machines for the settings
to take effect. For example,you can use it to select directories for logs and workers by setting the environment
variables: SPARK_LOG_DIR and SPARK_WORKER_DIR.
2.6. Running Spark in Slurm
On a traditional HPC platform a Spark cluster can be run in standalone mode on top of a Slurm resource manager.
This requires nodes exclusively allocated to run the Spark master and worker daemons. The spark-submit
script can then be used to submit jobs to the Spark cluster.
9
Best Practice Guide - HPC for
Data Science on the Cray Urika
Here is an example of the steps required to do this on one particular traditional HPC platform. Please note that
you may need to customise the content of these steps for your own HPC platform.
1. Allocation of nodes in Slurm.
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
-o spark-pi.%j.%N.out
-e spark-pi.%j.%N.err
-D ./
-J spark-pi
--clusters=mpp2
--nodes=3
--mem 20000
--ntasks-per-node 4
--cpus-per-task 7
--time=00:10:00
2. Load software and start the master and workers. (Note the available modules on your HPC platform may differ
and may need to install these yourself.)
source /etc/profile.d/modules.sh
module load java
module load python
module load R
module load spark
## Start master and slave in
spark-start
echo $MASTER
The spark-start script starts the master and workers per Slurm task on the allocated nodes.
3. Launching a spark application in the spark cluster:
spark-submit --total-executor-cores 84 \
--executor-memory 5G \
$SPARK_HOME/examples/src/main/python/pi.py 1000
When submitting a Spark application there are a few tuning parameters which should be considered:
• the --ntasks-per-node parameter that specifies how many executors will be started on each node. By default,
Spark will use 1 core per executor, thus it is essential to specify the --total-executor-cores, where this number
cannot exceed the total number of cores available on the nodes allocated for the Spark application (84 cores
resulting in 7 CPU cores per executor in this example).
• the --executor-memory parameter that specifies the memory per each executor. It is 2GB by default, and cannot
be greater than a RAM available on a cluster node (64 GiB in allocated nodes for this example).
10
Best Practice Guide - HPC for
Data Science on the Cray Urika
3. Cray Urika GX System
3.1. Introduction
The Urika GX is primarily targetted at users who wish to undertake highly interactive and iterative data analytics that require supercomputer levels of computing performance. Its architecture and supporting software stack
therefore differs from a traditional high performance computing (HPC) platform.
In a traditional setting, a batch scheduler such as PBS or SLURM is employed to manage access to the computing
resources available on an HPC platform. Typically, a user logs on to the front end node of the HPC platform and
prepares a script that defines both the computing task they wish to execute and the amount of computing resources
this task requires. The user then submits the script to the batch scheduler. The batch scheduler then determines when
the script is executed. This approach allows many different users with different computing resource requirements
to share an HPC platform. In addition, it allows efficient usage of the available computing resources.
However, this approach does mean that a user's job may have to wait for suitable resources to become available
before it starts. Hence the user may have to wait some time, perhaps hours or even days, for their results. This
makes it unsuitable for users who wish to have interactive access and who want to be able to immediately redirect
their analyses based on up-to-date results. Moreover an HPC platform may not be configured appropriately for
a particular user's needs instead it will be configured to match an overall optimum such as high throughput or
capability. For example, a user may need a particular mix of compute and disk resources that few, if any other
users, want. So the HPC service provider has little incentive to configure the HPC platform for such a user.
The Urika GX provides the users with the option to choose the type of resources (e.g. SSD) they wish to utilize
for each of their applications, so that they will achieve the best possible performance. Moreover, Urika's resource
manager can dynamically determine the optimal amount of resources that it should offer to every application so that
the cluster's total resources are utilized optimally. In this way, not only can the platform address each individual
user's needs, but it can also serve more users at the same time.
11
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.2. Overview
The Cray Urika GX system is an HPC platform dedicated to highly interactive and iterative data analytics that
require supercomputer levels of computing performance. The chapter's contents are based on the Urika GX system
hosted by EPCC and Cray on behalf of the Alan Turing Institute in the UK. This chapter contains sections on
the following sections:
• the system's configuration (section 3.3);
• the way a user can access the system; (section 3.4);
• the production environment (section 3.5);
• the programming environment (section 3.6);
• the available performance analysis tools (section 3.7);
• the suggested parameter tuning (section 3.8);
• the debugging tools (section 3.9).
12
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.3. Architecture / Configuration
Section 3.3.1 describes the overall configuration of the Urika GX, while section 3.3.2 focuses on the node-level
configurations. The specifications presented hold for all Urika-GX systems. The differences between the standard
specifications and the EPCC-hosted platform will be explicitly noted.
3.3.1. System Level
At the time of writing Urika GX nodes use CentOS 7.2 operating system. These nodes are organized in GreenBlade
chassis [11]. A Urika-GX system rack is depicted at figure 3.
Figure 3. Urika-GX system
This image is taken from https://www.cray.com/products/analytics/urika-gx
Urika GX has three kinds of networks:
Network
Description
Aries High Speed Network
This provides application and data connectivity between
different nodes.
Operational Ethernet network
This is used for importing user data and accessing data
streaming applications from compute nodes.
Management Ethernet network
This is used for system management.
3.3.2. Node Level
3.3.2.1. Types
There are three kinds of nodes in the Urika GX system:
13
Best Practice Guide - HPC for
Data Science on the Cray Urika
Type
Usage
Compute Nodes
Applications and services are run on these nodes.
Login Nodes
A user logs in to these nodes and from there launches
their applications onto the compute nodes.
I/O Nodes
These nodes handle the connections to external storage
and file systems.
3.3.2.2. Configurations
3.3.2.2.1. Processors
Generally, Urika-GX systems use 2 processors per node, with the processors' type being one of the following:
• Intel Broadwell 18C E5-2697 v4
• Intel Broadwell 8C E5-2620 v4
As far as the EPCC-hosted Urika machine is concerned, it uses Intel Xeon E5-2695 v4. Each of the processors
possess 18 cores, thus each node has 36 CPU cores.
3.3.2.2.2. RAM
Regarding the available memory per node, Urika systems offer three options:
• 128 GB
• 256 GB
• 512 GB
The EPCC-hosted machine has 256 GB per node.
3.3.2.2.3. Storage Memory
As far as the storage memory is concerned, all the nodes can have either 4 or 8 TB of HDD storage memory. On
the other hand, SSD availability depends on the kind of the node:
Node type
SSD storage memory
Compute node
2 TB or 4 TB
Login node
Not available by default
I/O node
Not supported
For more information regarding the architecture and the configurations of Urika GX see [20].
14
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.4. System Access
3.4.1. Access to the EPCC-hosted Urika
Users of the EPCC-hosted Urika can obtain instructions on how to obtain a Urika account and access it by visiting
http://ati-rescomp-service-docs.readthedocs.io/en/latest/cray/connecting.html .
3.4.2. User Interfaces
3.4.2.1. Access to User Interfaces
Most of the user interfaces can be accessed through the primary Urika-GX Applications Interface (UAI). An
alternative way is to use the following urls:
User Interface
URL
Urika-GX Applications Interface
http://urika1.turing.ac.uk:80
YARN Resource Manager
http://urika1.turing.ac.uk:8088,
http://urika2.turing.ac.uk:8088
Hadoop Job History Server
http://urika1.turing.ac.uk:19888,
http://urika2.turing.ac.uk:19888
Marathon
http://urika1.turing.ac.uk:8080,
http://urika2.turing.ac.uk:8080
Mesos Master
http://urika1.turing.ac.uk:5050,
http://urika2.turing.ac.uk:5050
Spark Application's Web UI
http://urika1.turing.ac.uk:4040,
http://urika2.turing.ac.uk:4040. In case
multiple applications are launched, they run on ports
4041,4042,4043 onwards.
Spark History Server
http://urika2.turing.ac.uk:18080,
http://urika2.turing.ac.uk:18080
Grafana
http://urika2.turing.ac.uk:3000
Jupyter Notebook
http://urika1.turing.ac.uk:7800
Cray Application Management
http://urika1.turing.ac.uk/
applications
Whenever a user accesses an application user interface (e.g. Spark, Grafana) though the UAI, a banner containing
learning resources and other links is also visible. Here users can find the Urika system documentation and guides
through the learning resources link. Moreover, users are provided with tutorials on the software pre-installed on
the Urika GX. This banner is not visible when the users choose to access the user interfaces using the URLs listed
in the table above.
The Urika-GX Analytic Applications Guide [19] contains further URLs for the user interfaceses of other Urika
GX applications.
More information regarding the role of each user interface on the Urika GX is given in section 3.5.
3.4.2.2. Authentication
The following authentication mechanisms can be used to access certain user interfaces:
User Interface
Username
Password
Grafana
admin
admin
Jupyter Notebook
login_username
login_password
15
Best Practice Guide - HPC for
Data Science on the Cray Urika
User Interface
Username
Password
Mesos Master
login_username
The
password
can
be
found
in
/security/secrets/userName.mesos file.
Jupyter Notebook
login_username
login_password
Cray Application Management UI
LDAP
username : admin , password: admin
The Urika-GX Analytic Applications Guide [19] contains the authentication mechanisms for further Urika GX
user interfaces.
3.4.3. Data Transfer (EPCC-hosted Urika)
Users of the EPCC-hosted Urika can use secure copy (i.e. scp) to transfer to and from their Urika GX. Instructions on how to scp to do this can be found at http://ati-rescomp-service-docs.readthedocs.io/en/latest/cray/
connecting.html.
The SFTP network protocol is also supported (see $ man sftp ).
16
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.5. Production Environment
Sections 3.5.1, 3.5.2 and 3.5.3 describe the components of Hadoop, Spark and Cray Graph Engine respectively,
provided on the Urika GX platform. The way Urika manages resources is presented in section 3.5.4, while the
different types of file systems are presented in section 3.5.5. Finally, the way fault tolerance is preserved and user
interfaces can be utilized is presented in sections 3.5.6 and 3.5.7 respectively.
3.5.1. Hadoop and its ecosystem
The Urika GX ships with Hortonworks Data Platform (HDP) which includes Apache Hadoop. Apart from the core
Hadoop components, the following Hadoop ecosystem components are installed on Urika GX:
Component
Description
Apache Avro
The data serialization system: as explained at [12] when
Avro data is stored in a file, its schema is stored with it,
so that files may be processed later by any program.
Apache DataFu
This is a collection of libraries for working with big
data on Hadoop. For more information see https://
datafu.apache.org/.
Hive
As explained at [13] this is a data warehouse system that
uses SQL to read, write and manage large datasets residing in distributed storage. Structure can be projected
onto data already in storage.
Hue
This is a visual interface or workbench for querying and
visualizing data.
Apache Kafka
As explained at [14] this is distributed streaming platform that enables publishing of and subscribing to
streams of records. It stored these records in a fault-tolerant way and enables processing of these records as
they occur.
Apache Oozie
This is a workflow scheduler system for managinh Hadoop jobs. For more information see http://
oozie.apache.org/.
Apache Parquet
As explained at [15] this is columnar format data storage
on Hadoop.
Apache Pig
As explained at [16] this a platform for analyzing large
data sets. It consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The key feature
of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them
to handle very large data sets.
Apache Sqoop
This is a tool for transferring bulk data between Hadoop
and structured data stores. For more information see
http://sqoop.apache.org/.
Apache HiveServer2
This service enables clients to executing queries against
against Hive.
Apache Hive Thrift Server
RPC framework for building cross-platform services
Apache Zookeeper
Configuration manager for distributed systems
More information on these components can also be found at http://hortonworks.com and http://www.apache.org.
17
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.5.2. Apache Spark
Urika GX supports the following Spark core and ecosystem components:
• Spark Core, DataFrames, and Resilient Distributed Datasets (RDDs)
• Spark SQL, Datasets and Dataframes
• Spark Streaming
• MLlib Machine Learning Library
• Spark Streaming
• GraphX
More information with regards to the above components can be found at https://spark.apache.org/
documentation.html and section 2.
3.5.3. Cray Graph Engine
Cray Graph Engine (CGE) is a software application capable of searching large graph-oriented databases and querying complex relationships between data items. CGE is designed to store and analyze datasets when the patterns
of relationships and interconnections between data items are at least as important as the data items themselves.
It includes two major components:
Component
Description
Graph Oriented Database
Database that uses graph structures to store and represent data.
Resource Description Framework (RDF)
Data representation standard, presenting data as a triple
containing a subject, a predicate and an object.
As opposed to the storage technique of the relational databases, CGE uses RDFs to store data. For more information
regarding RDFs read section 2.2.2 from Cray Graph Engine User Guide [18].
3.5.4. Resource Management
Urika GX possesses a number of different resource management tools. Urika's resource management enables
system resources to be allocated dynamically, based on the needs of each application.
3.5.4.1. Mesos
Mesos acts as the primary resource manager on Urika-GX and lies between the operating system and the application layer. Its task is to optimize resource utilization.
Mesos does not decide about the schedule and execution of the different jobs. Moreover, Mesos does not offer a
queue. Instead, Mesos offers resources to the frameworks that are registered with it. It is up to the framework's
scheduler to decide whether to accept or reject the offer. If the offer is accepted, it is the framework's responsibility
to schedule the execution of the jobs, using the resources provided. In case when the offer is rejected, Mesos will
continue to make new offers based on resource availability. On the Urika the frameworks available with Mesos,
are as follows:
• Marathon (for more information section 3.5.4.2)
• Yarn (for more information see section 3.5.4.4)
• Each Spark job (for more information see section 3.5.4.5)
The Mesos architecture consists of the following components:
18
Best Practice Guide - HPC for
Data Science on the Cray Urika
• Mesos agents/slaves
• Mesos masters
Mesos slaves play the role of the cluster's resources. Mesos master decides how many resources to offer to each
framework, according to an organizational policy (e.g. fair sharing or priority). The reasons why Mesos ships with
more than one master are presented in section 3.5.6. Mesos masters are configured with Apache Zookeeper.
3.5.4.2. Marathon
Marathon is used for launching long-running applications to run under Mesos and acts as a Mesos ecosystem component. Marathon is registered as a single framework with Mesos. Marathon's API is not capable of determining if
there are enough resources for a job that has not been submitted. Therefore, Marathon uses Mesos to negotiate for
resources. When Mesos informs Marathon that the required resources are available, the job is posted to Marathon.
Marathon instances are also configured with Zookeeper.
3.5.4.3. Mrun
Mrun is a Cray-developed application launcher, which is built upon Marathon commands. It uses Marathon in
order to set up resources for CGE and HPC jobs. Mrun is submitted as an application to Marathon, therefore no
job is posted until the resource requirements are satisfied. Mrun needs to be executed by a login node. It cannot
be executed by a tenant VM. Finally, if an Anaconda environment is activated on the login node when mrun is
used, compute nodes are aware of that virtual environment.
The following commands are used to obtain information about the status of Mesos and Marathon environment:
Command
Description
$ mrun --info
This is used to obtain a snapshot of the active frameworks registered with Mesos, Marathon applications
and the available computing resources.
$ mrun --resources
This provides with a list of the system's nodes along with
their availability status and their CPU/memory specifications.
The following commands are used to launch HPC applications:
Command
Description
$ mrun app.exe
app.exe will run as a single task on one node.
$ mrun -n num_of_tasks \
-N num_of_nodes app.exe
app.exe
will run as
num_of_nodes nodes.
$ mrun -n num_of_tasks \
-N num_of_nodes app.exe \
--wait \
--immediate=num_of_seconds
app.exe will run as
num_of_tasks
tasks on
num_of_nodes nodes. If the required resources are not
available instantly, mrun will continue to poll Mesos.
Providing that the required resources become available
within num_of_seconds seconds, the application will
be posted to Marathon. Otherwise, mrun will time out.
num_of_tasks
tasks on
The following commands are used with regards to a specific running Marathon application:
Command
Description
$ mrun --detail appID
This outputs additional information with regards to the
application with ID: appID
$ mrun --cancel appID
This cancels or aborts the application with ID: appID
For more information, read the manual of mrun ( $ man mrun )
19
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.5.4.4. Yarn
Yarn acts as the resource manager for Hadoop jobs on Urika GX and uses its own queue for Hadoop workloads.
Cray has developed scripts to set up resources for Yarn. These scripts are submitted as applications to Marathon
and they allow the dynamic allocation of resources between Mesos and Yarn. Just like mrun, Yarn scripts cannot
be executed if the required resources are not available. When the requested nodes are not available, the current
resource availability is reported and the script exits. Yarn scripts are also known as flex scripts and are presented
below:
Command
Description
$ urika-yam-status
This displays the lists of existing applications and the
resources allocated to each application. For more information, read the manual of urika-yam-status ( $ man
urika-yam-status ).
$ urika-yam-flexup \
--nodes num_of_nodes \
--identifier request_id \
--timeout num_of_minutes
This is used to 'flex up' num_of_nodes nodes.
request_id is the unique identifier of the request. In case
the request is accepted and the flexed up nodes are idle
for num_of_minutes the resources will be automatically
released. For more information read the manual of urika-yam-flexup ( $ man urika-yam-flexup ).
$ urika-yam-flexdown \
--identifier request_id
This is used to manually 'flex down' the nodes flexed up
by the request with ID: request_id . For more information, read the manual of the urika-yam-flexdown command ( $ man urika-yam-flexdown ).
The following command is used for launching a hadoop application:
Command
$ yarn jar file.jar
main_class \
arg1 arg2 arg3
Description
\
This is used to run job: file.jar, while main_class is the
main class of the executable and arg1 arg2 and arg3
are command line arguments of the program. The main
class and the command line arguments are optional parameters.
Now we will present an example of a Hadoop job submission. In this example we will run a TeraSort benchmark.
Given the fact that we have already accessed a login node, we will have to use the following commands:
1. Checking whether there are available nodes to flex up:
$ mrun --resources
2. Checking whether we have already flexed up some nodes and how many nodes are flexed up for the needs
of other applications:
$ urika-yam-status
3. Flexing up 3 nodes for the needs of our job:
$ urika-yam-flexup --nodes 3 --identifier hadoopexample
4. If the folder expected to store the output of our Hadoop jobs already exists, it has to be deleted before the job's
submission:
$ hdfs dfs -rm -R /tmp/10gsort
5. This job generates the data which will be used as an input for TeraSort:
$
yarn
jar
/usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen 100 /tmp/10gsort/input
20
Best Practice Guide - HPC for
Data Science on the Cray Urika
6. Executing TeraSort benchmark:
$
yarn
jar
/usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar terasort /tmp/10gsort/input /tmp/10gsort/output
7. This job evaluates the output of the TeraSort benchmark:
$
yarn
jar
/usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teravalidate /tmp/10gsort/output /tmp/10gsort/validate
8. Confirming the success of the validation by checking the output of the validation job:
$ hdfs dfs -ls /tmp/10gsort/validate
9. Flexing down the nodes that we used:
$ urika-yam-flexdown --identifier hadoopexample
3.5.4.5. Spark Jobs
Spark, like Marathon, is preconfigured to authenticate with Mesos. Each spark job is registered as a separate
framework with Mesos. However, Spark jobs do not behave in the same way Mrun and Yarn scripts do. Spark
can accept offers with fewer resources than it is requested.
In order to connect to Mesos, Spark master is set to:
mesos://zk://zoo1:2181,zoo2:2181,zoo3:2181/mesos
Spark launch wrapper scripts are used for the launch of spark applications or interactive shells. These scripts are
located to:
/opt/cray/spark2/default/usrScripts
The provided spark launch wrapper scripts are the following:
• spark-shell
• spark-submit
• spark-sql
• pyspark
• sparkR
• run-example
The user can use the following flags to change the default settings of the above scripts:
Flag
Description
--total-executor-cores
This sets the number of desired cores.
--driver-memory
This sets the desired amount of memory allocated to the
driver. By default 16 gigabytes are allocated to the driver.
--executor-memory
This sets the desired amount of memory allocated to the
executors. By default 96 gigabytes are allocated to each
executor.
The users also have the option to use both SSDs and HDDs (instead of SSD alone) in order to provide their spark
jobs with additional temporary space. In order to achieve that, they will have to follow the steps below:
21
Best Practice Guide - HPC for
Data Science on the Cray Urika
1. Create a file named spark_local_dirs.hdd under their home directory ( /home/users/username/spark_local_dirs.hdd )
2. Use the command echo true >> /home/users/username/spark_local_dirs.hdd to add
true to the file's contents.
In case the users wish to revert to the default configurations they just have to delete spark_local_dirs.hdd
.
For more information on resource management on Urika read Urika-GX Analytic Applications Guide [19].
3.5.5. File Systems
3.5.5.1. Hadoop Distributed File System (HDFS)
HDFS is a highly fault tolerant distributed file system. Hadoop uses HDFS to store data. The Urika GX also has
tiered HDFS data storage. HDFS data is stored on the SSDs and HDDs of Urika GX's compute nodes and is
transfered over the Aries Network. HDFS is the data store for all the Hadoop components on Urika GX. Users
cannot have write access to HDFS unless an administrator has provided them with a designated folder under
hdfs:///user .
3.5.5.2. Network File System (NFS)
NFS is a distributed file system protocol. NFS is made available to every node via the management network. NFS
is not suitable for big data transfers and large writes, as this will cause the network to operate much slower and
timeout. Home directories are mounted on NFS, with limited space.
3.5.5.3. Lustre
Lustre is a parallel distributed file system. It is suitable for larger data sets and it is supported as an external file
system on Urika GX. Lustre is mounted at /mnt/lustre.
For more information on Urika's file systems read Urika-GX Analytic Applications Guide [19].
3.5.6. Fault Tolerance
Urika GX is fault tolerant and so provides resiliency against system failures. Failed jobs are re-scheduled automatically.
3.5.6.1. Zookeeper
Zookeeper enables highly reliable distributed coordination. On Urika, 3 Zookeeper instances are running, while a
minimum of 2 are always available. Urika uses Zookeper to provide Mesos and Marathon with fault tolerance.
3.5.6.2. Hadoop
Whenever there is a failure in the execution of a Hadoop job, the corresponding process is reported to the master
and is re-scheduled.
3.5.6.3. Spark
Spark tracks transformations and actions through an acyclic lineage graph. In case of a failure, Spark detects the
point of failure and re-schedules the after-the-failure computations to a different node.
3.5.6.4. Mesos
Mesos runs on high availability mode. Similarly to Zookeper, Mesos has 3 master instances running. If one of
them fails, one of the remaining two is elected as the new master. In this way, no disturbance takes place during
the resource management process.
22
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.5.6.5. Marathon
Marathon also has 3 running instances and follows the same procedure with Mesos in case one of these instances
fails. If a Mesos task fails, Marathon will accept more resources from Mesos and another task will be launched,
usually on a different node.
For more information regarding Urika's fault tolerance read Urika-GX Analytic Applications Guide [19].
3.5.7. User Interfaces
Section 3.4 contains information regarding the different ways to access the most important user interfaces featured
on Urika GX.
3.5.7.1. Urika GX Applications Interface (UAI)
UAI is the primary entry point to view a number of applications running on Urika GX. Moreover, it is used for
accessing training material and monitor the system's health information.
3.5.7.2. Grafana
Grafana is a metrics, dashboard, and graph editor. Grafana can be used for the monitoring of system resources.
Two of the major components of Grafana are the following:
• Organizations correspond to different deployment models.
• Users are named accounts in Grafana.
A user can belong to one or more organizations. Furthermore, a user can have different privileges, depending on
the role he has been assigned to.
For information regarding the performance analysis tools of Grafana, visit section 3.7.
3.5.7.3. Cray Application Management UI
The Cray Application Management UI contains information about both running and finished jobs. This UI enables
users to access the logs of Spark jobs or delete jobs they have submitted.
3.5.7.4. Spark
Each Spark application launches its own Web UI. This UI can be used to monitor running Spark jobs and displays
useful information, such as scheduler stages/tasks, RDD sizes/memory usage etc. On the other hand, the Spark
History Server monitors completed Spark jobs.
Both of these UIs link Spark applications to the Grafana UI, where more information regarding resource utilization
is displayed.
3.5.7.5. Hadoop
Hadoop jobs can be monitored using the following three interfaces:
• Hadoop Job History Server
• YARN Resource Manager
• Cray Application Management UI
3.5.7.6. Mesos
Mesos Web UI can be used to monitor different components of the Mesos cluster, such as the Mesos slaves,
resources and frameworks. Users can use Mesos Web UI to view the resources reserved as well as their tasks.
Users should avoid launching applications directly from Mesos Web UI.
23
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.5.7.7. Marathon
Marathon Web UI can be used for the creation of applications. Users should avoid deleting the analytic applications
that use the 'flex scripts', except if it is mandatory to shut down nodes used by Yarn.
For more information regarding the available Urika UIs read Urika-GX Analytic Applications Guide [19].
24
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.6. Programming Environment
Section 3.6.1 explains the basic programming components of Urika GX, while section 3.6.2 describes how Anaconda can be used. In section 3.6.3 Jupyter Notebook is presented, while in the last section (3.6.4) the programming options offered by Spark are explained.
3.6.1. Components
Some of the components of Urika's analytics environment are presented below:
• Python 2
• Python 3
• Scala
• R
• Anaconda Python
• Numpy
• Scipy
• Git
• gcc
• Apache Maven
In addition to the above components, Urika also has a number of enviromental modules.
3.6.2. Anaconda
Anaconda contains conda, which is a source package and environment manager. Anaconda enables users to easily install pre-compiled software locally, without needing administrator privileges. A user can use the following
commands in order to load and perform basic management of anaconda's environments:
Command
Description
$ module load anaconda3
The anaconda3 module is loaded and Anaconda
Python becomes the default Python.
$ conda create --name py36Env \
python=3.6
A new environment with Python 3.6 is created.
$ source activate py36Env
The conda environment py36Env is activated. Both
PySpark and Python utilize the active environment.
$ source deactivate
The active conda environment is deactivated.
$ module unload anaconda3
The anaconda3 module is unloaded.
For more information regarding the management of Anaconda environments, the users can visit https://
docs.anaconda.com/anaconda/navigator/tutorials/manage-environments.
3.6.3. Jupyter Notebook
Jupyter Notebook is a a web application that creates executable documents. Moreover, it enables adding explanatory text between executable cells.
25
Best Practice Guide - HPC for
Data Science on the Cray Urika
On Urika GX, Jupyter Notebook supports by default the following kernels:
• Python 2
• Python 3
• Bash
• R
• PySpark
• Scala
• SparkR
Jupyter Notebook's users might usually need to use python libraries and packages that are not provided by the
default python kernels. In this case, they are able to create a new customizable ipython kernel though an Anaconda
environment. We present an example below:
Command
Description
$ module load anaconda3
The anaconda3 module is loaded.
$ conda create --name jupyterEnv \
python=3.6
A new environment with Python 3.6 is created.
$ conda install \
--name jupyterEnv ipykernel
ipykernel is installed under jupyterEnv environment.
$ source activate jupyterEnv
jupyterEnv environment is activated.
Python kernel My Python Kernel is created by
$ python -m ipykernel install \
the jupyterEnv environment.
--user --name jupyterEnv \
--display-name "My Python Kernel"
After the execution of the commands above My Python Kernel is added to the kernel options provided by the Jupyter Notebook User Interface. This kernel is able to utilize every python package installed under
jupyterEnv Anaconda environment. Moreover, the user does NOT have to activate jupyterEnv everytime My Python Kernel is to be used.
A user must stop their notebooks before they log off or else the notebook will continue to use resources unnecessarily. In cases where Jupyter processes are still running after a user has logged out, the Linux kill command
can be used to manually kill them.
3.6.4. Apache Spark
The spark/2.3.0 environmental module is loaded by default after a user logins to a login node. On Urika
GX, Spark comes with APIs for Java, Scala, Python and R.
3.6.4.1. Java
Java applications are built using Maven. The following dependency should be added to the
similarly to the following example:
<dependencies>
<dependency>
<groupId> org.apache.spark </groupId>
<artifactId> spark-core_2.11 </artifactId>
<version> 2.2.0 </version>
<dependency/>
26
pox.xml
file,
Best Practice Guide - HPC for
Data Science on the Cray Urika
</dependencies>
3.6.4.2. Scala
Scala applications are built using Scala Build Tool (sbt). A dependency, like the one presented below, should be
added to .sbt file.
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
A user can set the required number of Spark cores using Scala code. If NUM_CORES is the required number
of Spark cores, the next lines should be added to Spark's command line (after invoking spark-shell wrapper
script) or to Jupyter Notebook.
sc.stop()
sc = SparkContext(conf=SparkConf().set("spark.cores.max", "NUM_CORES"))
3.6.4.3. PySpark
PySpark is aware of Anaconda environments. In case there is one activated anaconda environment, Spark
will utilize the version the environment uses. In order for the anaconda's Python version to be overridden,
PYSPARK_PYTHON environmental variable should be manually set to point to the required Python version.
It is possible for the user to change the default number of Spark cores required, by adding the following lines to
Spark's command line (after invoking the pyspark wrapper script) or to Jupyter notebook:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val conf = new SparkConf().set("spark.cores.max", "NUM_CORES.")
val sc = new SparkContext(conf)
NUM_CORES is the required number of Spark cores.
3.6.4.4. SparkR
SparkR can also be used to set the required number of Spark cores used for a Spark application. The following
commands should be added to Spark's command line (after SparkR wrapper script has been invoked) or to
Jupyter notebook:
sparkR.session(spark.cores.max = "NUM_CORES")
27
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.7. Performance Analysis
Urika's main performance analysis tool is Grafana. Grafana dashboards collect all the visualizations into an individual interface and they include:
• Statistical data regarding network, I/O and CPU utilization both for every node and the system as a whole.
• Metrics regarding Hadoop applications and cluster.
• Statistical data regarding Spark jobs.
28
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.8. Tuning
Spark's configurations on Urika GX have some differences from Spark's standard configurations:
• spark.shuffle.compress = false and spark.locality.wait = 1 . These configurations
result in a better performance for some applications on Urika GX. In case an application is running out of
memory or SSD space, spark.shuffle.compress should be switched back to true .
• Each executor is provided with 96GB of memory, while the driver is provided with 16GB.
By default, Spark runs temporary files on the SSDs of the compute nodes. However, a combination of HDDs and
SSDs offers flexibility, especially when a Spark job requires large shuffle space. On the other hand, using only
SSDs provides the best performance. For information on how to change the default storage of Spark's temporary
files read section 3.5.4.5.
In section 3.5.4 we referred to the fact that Spark jobs accept offers with less resources than the ones requested. Users can control the minimum of resources that a spark application can accept through the variable
spark.scheduler.minRegisteredResourcesRatio .
Section 20.4 of Urika-GX Analytic Applications Guide [19] refers to more tunable Spark and Hadoop configuration parameters.
29
Best Practice Guide - HPC for
Data Science on the Cray Urika
3.9. Debugging
In section 3.9.1 we refer to the components of Urika that can be used as debugging tools, while in section 3.9.2
we present the location of the log files of different applications of Urika GX.
3.9.1. Tools
In section 3.5.7 we presented Hadoop Job History Server UI and Yarn Resource Manager UI, which can be used
for the monitoring of Hadoop applications. Regarding Spark jobs, Spark Web UI and Spark History Server can
help during the debugging process, while the Spark shell can also be an effective debugging tool. Cray Application
Management UI can also be used for the monitoring and the debugging of Hadoop, Spark and CGE jobs.
3.9.2. Log Files
Apart from presenting the status of applications, Cray Application Management UI provides with links to the
generated log files of the various jobs. The physical location of log files of some applications are given below:
Application
Log File Location
Mesos
/var/log/mesos
Marathon
/var/log/messages
Grafana
/var/log/grafana/grafana.log
Hadoop
/var/log/hadoop/hdfs/ and
/var/log/hadoop/yarn/
of the individual compute nodes.
/app-logs in hdfs.
Spark
/var/log/mesos/agent/slaves/ of the individual compute nodes.
Jupyter Notebook
/var/log/jupyterhub.log
Flex Scripts
/var/log/urika-yam.log
For more information regarding debugging, log files and troubleshooting of the various Urika applications read
Urika-GX Graph Engine User Guide [18] and Urika-GX Analytic Applications Guide [19].
30
Best Practice Guide - HPC for
Data Science on the Cray Urika
31
Best Practice Guide - HPC for
Data Science on the Cray Urika
Further documentation
Books
[1] Best Practice Guide - Intel Xeon Phi, January 2017, http://www.prace-ri.eu/IMG/pdf/Best-Practice-GuideIntel-Xeon-Phi-1.pdf .
Websites, forums, webinars
[2] PRACE Webpage, http://www.prace-ri.eu/.
[3] Apache Spark, https://spark.apache.org/docs/latest/index.html.
[4] Apache Spark Streaming, https://spark.apache.org/docs/latest/streaming-programming-guide.html.
[5] Apache Spark SQL, https://spark.apache.org/docs/latest/sql-programming-guide.html.
[6] Apache Spark MLib, https://spark.apache.org/docs/latest/ml-guide.html.
[7] Apache Spark GraphX, https://spark.apache.org/docs/latest/graphx-programming-guide.html.
[8] Pregel: a system for large-scale graph processing, Proceeding SIGMOD '10 Proceedings of the 2010 ACM
SIGMOD International Conference on Management of data Pages 135-146, Indianapolis, Indiana, USA
— June 06 - 10, 2010 .
[9] Cluster Mode Overview, https://spark.apache.org/docs/latest/cluster-overview.html.
[10] Spark Standalone Mode, https://spark.apache.org/docs/latest/spark-standalone.html.
[11] Urika-GX Hardware Guide (Rev C.) H-6142, https://pubs.cray.com/content/00485604-DB/FA00242577.
[12] Apache Avro 1.8.2 Documentation, http://avro.apache.org/docs/current/.
[13] Apache Hive, https://hive.apache.org/.
[14] Apache Kafka, https://kafka.apache.org/intro.
[15] Apache Parquet, https://parquet.apache.org/.
[16] Apache Pig, https://pig.apache.org/.
Manuals, papers
[17] PRACE Public Deliverable 7.6 Best Practice Guides for New and Emerging Architectures, http://www.praceri.eu/IMG/pdf/D7.6_4ip.pdf.
[18] Cray® Graph Engine User Guide (3.1.UP02) S-3014 .
[19] Urika®-GX Analytic Applications Guide (2.0.UP00) S-3015 .
[20] Urika®-GX System Overview (2.0.UP00) S-3017.
32