Advanced Database Technology:
Resource Materials for Lab Session:
MYSQL:
The MySQL Server provides an industry-standard SQL interface to the cluster
enabling complex, relational queries to be run, and providing connectivity to all
of the standard MySQL connectors including:
Common web development languages and frameworks, i.e. PHP, Perl,
Python, Django, Ruby, Ruby on Rails, etc;
JDBC (for additional connectivity into ORMs including EclipseLink,
Hibernate, etc)
.NET, ODBC, etc
MySQL Document store gives users maximum flexibility in developing
traditional SQL relational applications and NoSQL schema-free document
database applications. This eliminates the need for a separate NoSQL
document database. Developers can mix and match relational data and
JSON documents in the same database as well as the same application.
For example, both data models can be queried in the same application
and results can be in table, tabular or JSON formats.
Specialized clients can perform SQL as well as CRUD operations on the
document database. These clients are MySQL Shell and the MySQL
Connectors. MySQL Shell is an interactive interface using JavaScript, Python or
SQL modes. Connectors are used for developing applications using
programming languages like Java, NodeJS, Python, C++, etc.
One of the great new features in MySQL 8.0 is the Document Store. Now with
MySQL you can store your JSON documents in collections and manage them
using CRUD operations.
A NoSQL database has a dynamic schema for unstructured data. Data is stored
in many ways which means it can be document-oriented, column-oriented,
graph-based, or organized as a key-value store. This flexibility means that
documents can be created without having a defined structure.
NoSQL databases are horizontally scalable. This means that you handle more
traffic by sharing or adding more servers in your NoSQL database. It is similar to
adding more floors to the same building versus adding more buildings to the
neighbourhood. Thus NoSQL can ultimately become larger and more powerful,
making these databases the preferred choice for large or ever-changing data
sets.
MONGODB
MongoDB is a document database. It stores data in a type of JSON format
called BSON. MongoDB is an open-source nonrelational database management
system that uses flexible documents instead of tables and rows to process and
store various forms of data. As a NoSQL database solution, MongoDB does not
require a relational database management system (RDBMS), so it provides an
elastic data storage model that enables users to store and query multivariate
data types with ease. This not only simplifies database management for
developers but also creates a highly scalable environment for cross-platform
applications and services.
A record in MongoDB is a document, which is a data structure composed of
key-value pairs similar to the structure of JSON objects.
Records in a MongoDB database are called documents, and the field values
may include numbers, strings, booleans, arrays, or even nested documents.
SQL vs Document Databases
SQL databases are considered relational databases. They store related data in
separate tables. When data is needed, it is queried from multiple tables to join
the data back together.
MongoDB is a document database which is often referred to as a non-relational
database. This does not mean that relational data cannot be stored in
document databases. It means that relational data is stored differently. A better
way to refer to it is as a non-tabular database.
MongoDB stores data in flexible documents. Instead of having multiple tables,
you can simply keep all of your related data together. This makes reading your
data very fast.
You can still have multiple groups of data too. In MongoDB, instead of tables,
these are called collections.
Cassandra :
Apache Cassandra is an open-source NoSQL distributed database trusted by
thousands of companies for scalability and high availability without
compromising performance. Linear scalability and proven fault tolerance on
commodity hardware or cloud infrastructure make it the perfect platform for
mission-critical data.
Cassandra is a free and open-source, distributed, wide-column
store, NoSQL database management system designed to handle large amounts
of data across many commodity servers, providing high availability with
no single point of failure.
The key space in Cassandra is a namespace that defines data replication across
nodes. Therefore, replication is defined at the keyspace level. Below is an
example of key space creation, including a column family in CQL 3.0: The
design goal of Cassandra is to handle big data workloads across multiple nodes
without any single point of failure. Cassandra has peer-to-peer distributed
system across its nodes, and data is distributed among all the nodes in a
cluster.
Data Replication in Cassandra
In Cassandra, one or more of the nodes in a cluster act as replicas for a given
piece of data. If it is detected that some of the nodes responded with an out-
of-date value, Cassandra will return the most recent value to the client. After
returning the most recent value, Cassandra performs a read repair in the
background to update the stale values.
The following figure shows a schematic view of how Cassandra uses data
replication among the nodes in a cluster to ensure no single point of failure.
Note − Cassandra uses the Gossip Protocol in the background to allow the
nodes to communicate with each other and detect any faulty nodes in the
cluster.
Cassandra Query Language
Users can access Cassandra through its nodes using Cassandra Query Language
(CQL). CQL treats the database (Keyspace) as a container of tables.
Programmers use cqlsh: a prompt to work with CQL or separate application
language drivers.
Clients approach any of the nodes for their read-write operations. That node
(coordinator) plays a proxy between the client and the nodes holding the data.
Apache HIVE:
Apache Hive is a distributed, fault-tolerant data warehouse system that enables
analytics at a massive scale. Hive Metastore(HMS) provides a central repository
of metadata that can easily be analysed to make informed, data-driven
decisions, and therefore it is a critical component of many data lake
architectures.
Hive is built on top of Apache Hadoop and supports storage on S3, adls, gs etc
through hdfs. Hive allows users to read, write, and manage petabytes of data
using SQL.
The Hive Metastore (HMS) is a central repository of metadata for Hive tables
and partitions in a relational database, and provides clients (including Hive,
Impala and Spark) access to this information using the metastore service API. It
has become a building block for data lakes that utilize the diverse world of
open-source software, such as Apache Spark and Presto. In fact, a whole
ecosystem of tools, open-source and otherwise, are built around the Hive
Metastore, some of which this diagram illustrates.
Hive provides full ACID support for ORC tables and insert-only support to all
other formats.