Module 3.
Hyperstore Software Architecture
Now, we are going to talk about some of the design elements of Hyperstore's
software architecture.
Some of these we have already mentioned.
You now already know Hyperstore is natively S3 compatible.
Hyperstore is also designed using a peer-to-peer architecture, or sometimes
referred to as a shared nothing architecture.
This is inherently designed to allow for partition tolerance or the ability to
continue operations in the event hardware fails.
As we also mentioned, we store metadata in a distributed database to allow for a
scale-out architecture.
For this, we use Cassandra.
Cassandra is not a cloudian product.
It is maintained by Apache and was originally designed by Facebook.
Hyperstore is designed to be multi-tenant.
In other words, being able to share storage from multiple applications and multiple
users,
or in the case of managed service providers, multiple tenants.
We have two programmable APIs in the solution.
The first we've already talked about the S3 API.
There is another called the admin API, which we will discuss later on.
Although Hyperstore is intended as an on-prem solution, we do support the ability
to tier data to the public or other private clouds and support hybrid cloud models.
We also provide intelligent, proactive support capabilities using smart support.
And lastly, is the intelligence of the software, which is the power behind our
Hyperstore solution.
At high level, Hyperstore can be regarded as a collection of integrated services.
Each, however, plays a different role.
These are in access, metadata management, data management, user management, quality
of service, system management from the CLI, and system management in terms of the
web UI portal.
And lastly, configuration management.
Some of these services we refer to as common services.
These common services run on every node in the cluster. Some are regarded as master
slave services and only run on some of the nodes.
The common services are S3, Cassandra, Hyperstore, the admin service, and the CMC
service.
In a large, distributed environment, it is beneficial and in very, very large
environments necessary to have a single source of configuration management.
With Cloudy and Hyperstore, we achieve this using the Puppet Configuration
Management Platform.
This modular approach allows us to scale literally for both capacity and
performance.
It is a single software stack, which makes it easy to scale with no separate
gateways, storage managers, or metadata servers.
This allows us to provide a single scalable namespace with capacity and performance
scaling as the cluster expands.
Hyperstore Interfaces. We have two programmable APIs, the S3 API, and the admin
management API.
Applications capable of interfacing directly with the S3 API can do so directly.
The other API is the admin API, which listens on port 1944-3 on all servers in the
cluster, assuming a single region.
A customer or partner who wishes to create their own management portal can do so
and utilize the same functionality as the CMC web interface.
This is also useful for customers or partners who wish to write scripts which can
be reproduced on a required basis directly with the admin management API.
Our CMC is our web interface portal. From this portal, all administration tasks can
be performed as you can see from the slide.
The web interface can interface directly with the admin management API.
When an action is taken on the portal, in the background, the corresponding admin
management API request is sent to one of the servers listening on the admin
management API port.
You can also see an arrow pointing from the CMC web interface to the S3 API.
This is used to allow a user or administrator or group administrator to view the
buckets and objects of users on the system.
Our Hyperstore S3 service supports the vast majority of the Amazon S3 REST API,
including advanced features.
It supports the same REST formats for the following areas as specified in the S3
specification.
It works over HTTP and HTTPS. It is responsible for error responses and the REST
error format,
common request headers and response headers, and authenticating requests,
operations on service, and bucket objects on ACLs.
The Hyperstore Admin API. In comparison, whereas the S3 API is used for data
management, the Hyperstore Admin API is used to manage Hyperstore.
This includes creating user groups, managing storage policies and rating plans,
setting quality of service controls, storing all tiering credentials,
managing public URLs, usage reporting and billing, as well as system services and
system monitoring.
The Cloudian Management Console, or CMC, is a customizable or brandable multi-
tenant web user portal.
This portal allows for management of the Hyperstore cluster and reporting.
It provides an S3 client or data explorer with upload, download, delete and file
sharing features.
It provides group and user account management through role-based access control, as
well as configuration settings for SMTP and SMMP alerting for cluster health.
Hyperstore Data Management. From a high level, the structure of a Hyperstore
cluster consists of a number of nodes.
We support a minimum of three nodes. Each of those nodes must reside within a data
center.
Note, a node cannot reside in more than one data center and should be based on the
physical location.
A data center or multiple data centers reside within a single region.
A region generally is a geographic location. However, a region can also be used to
segregate data from other data, for example, multi-tenant.
If tenants require data separation from other tenants at the node level, then
multiple regions could be a way to achieve this.
Multiple regions would exist within a single cluster.
A single Cloudian Management Console can manage up to 20 regions.
A user can create a bucket or select a bucket from any region in the cluster.
A bucket within a region can span multiple data centers but cannot span multiple
regions.
Data is distributed between nodes and data centers in a region based on a defined
storage policy.
Hyperstore's cluster architecture is based on a shared nothing architecture.
Each node has connectivity to every other node in the cluster.
There is no limitation on individual nodes and nodes can be a varying capacity or
hardware.
It can be on premise or in the Cloud or indeed physical or virtual.
The peer-to-peer nature ensures that data is only ever one hop away.
This provides for consistent performance.
It is identical software on all the nodes irrespective of the hardware.
As the software is built around the S3 API at its core, there are no gateways
involved and you can start small and grow as needed.
As S3 is a common service, the S3 API is reachable on every single node in the
cluster.
Therefore, to scale the S3 capability is a simple matter of adding additional
nodes.
A single Hyperstore cluster can have one or many storage policies.
Storage policies determine the data protection and consistency level applied,
which in turn determines the data protection and consistency level at the bucket
and object.
Consistency levels can be applied to ensure strong consistency, eventual
consistency, or dynamic consistency.
These will be discussed later.
From a disaster recovery perspective, we can create standard replication across
data centers,
replicated erasure coding across data centers, and where there are three or more
data centers in a metro area,
then distributed erasure coding is an option.
In this slide, we will try to show how data is distributed in a cluster,
when an object is stored a combination of bucket name and the object name,
as well as date and timestamp, is converted in an MD5, which is then converted into
an integer.
This gives us a unique key, which we use to determine where we will store the data
in the cluster.
For metadata, the storage policy will also determine the number of metadata copies
calculated.
A node list is generated based on the number of metadata copies required,
and stored in the Cassandra database on those nodes on SSD.
A base similar process happens for the data.
The number of object copies or number of fragments require based on replication or
erasure coding,
as well as the size of the object, will generate a node list to determine where
each fragment or copy of data will be stored
on the high-capacity drives on each of the relevant nodes.
This diagram attempts to explain that process.
At the top is the client which needs to store an object.
The S3 request is sent to one of the nodes in the cluster through either the use of
a load balancer or DNS round robin.
That node takes on a dynamic role which we call the coordinator role.
It is the coordinator's responsibility to determine if there are enough nodes
available to satisfy the storage policy data protection scheme.
If there are, if there are, the coordinator node is then responsible for sending
the data fragments to each of the nodes that will be storing that data fragment.
In this example, the object being stored is 60 megabytes in size.
We adopt a feature called chunking in order to reduce that 60 megabyte file down
into manageable 10 megabyte chunks.
Each of those 10 megabit chunks can be regarded as a separate object for storage
and as such each chunk will be protected
using the storage policy data protection scheme in this instance, erasure coding 4
plus 2.
As the initial location where data is stored is based on the text of the bucket
name, object name, timestamp,
and it is possible that the resultant integer range could be allocated to some
disks more than others.
This could produce an imbalance of data storage.
Plaudian Hyperstore uses a feature called dynamic object routing which will
periodically determine if certain disks are being more utilized than others.
If this happens, the dynamic object routing table will start to use the other disks
in the nodes.
In order to store the data in a more balanced format, dynamic object routing will
also be employed where a disk in a node has failed.
Then the remaining disks in that node can be used to store data that was destined
for the failed disk.
Cluster expansion is a simple process of adding additional nodes to increase
capacity and or front end compute.
It would be inefficient if new nodes in the cluster were only able to receive net
new data.
Therefore, in the process of adding additional nodes will also rebalance the
existing data across all nodes in the cluster.
This allows for a more balanced capacity usage going forward.
As well as data expansion being a non-intrusive process which can all be done
online, so too is our software upgrades process.
There is no requirement for downtime.
Once initiated, it is an automated process.
Software upgrades are currently done using the puppet framework where the upgrade
is processed on a node by node basis.
And assuming a data protection and consistency level has been employed to not be
impacted by a node failure,
then the node upgrade will have no impact to the operation of the cluster.
Automated maintenance and self-healing.
Data protection determines the durability of the stored data.
Consistency level determines data availability.
With a consistency level which utilizes eventual consistency, this ensures the
availability of data even when a disk or node is in a failure state.
eventual consistency also helps reduce the S3 write request latencies as not all
writes need to be committed before acknowledgement to the client.
It however does still ensure a high degree of data durability.
It must be noted that not all of the objects intended replicas or erasure coded
fragments may exist in the system when using an eventual consistency policy.
The data will be written when the node or disk is brought back online, either
automatically via proactive repair or other means.
We utilize both proactive repair, repair on read and auto repair for this process.
The current state of repair within the cluster can be seen via the CMC by clicking
on the analytics tab and then rebuilding status subtab.
Here we can see the repair status legend which identifies which nodes are in
proactive rebuild, rebalancing data, conducting a data integrity check or requiring
repair.
It also shows nodes which the cluster regards as all clear.
Green is good.
Proactive repair happens when a node has been down for a limited time.
Apparently the default is for restoration of any replicas that were destined for
that node but the node was down for less than 4 hours.
This is also known as hints.
If the node has been offline for greater than 4 hours then additional features are
required to restore the intended number of replicas or erasure coding fragments to
the node.
One of these is called auto repair.
Not all nodes in the cluster will however run auto repair at the same time and is
staggered over a 30 day period for all nodes in the cluster.
A common process to repair objects outside of these time scales is repair on read.
This will repair replicas or fragments which are either out of date or missing
during a read process.
If a replica or fragment is missing during a get request then that missing data or
out of data is re-filled during that get process.
System architecture.
As hopefully you'll be starting to understand Cloudy and Hyperstore is very much a
network based solution.
As such there are some network components and information that Cloudy and
Hyperstore relies on to function correctly.
One of these bits of information is the network name of the cluster.
We refer to this as an S3 endpoint.
The S3 endpoint is a URL the S3 client uses to connect to Hyperstore.
If this is a public URL with no access restrictions that is all that is required.
If however the data is protected then to ensure only authorized access then an
access key and secret key is also required.
As the S3 endpoint is a URL that requires a DNS lookup to translate the DNS name to
an IP address.
It is recommended that the DNS lookup points to a virtual IP address commonly
running on a load balancer.
The load balancer will then, through health checks, be able to determine which of
the cluster nodes are able to respond to the S3 request.
If a load balancer is not used then basic DNS round robin is an option.
However this is not without its challenges.
As can be seen in this slide using DNS round robin can very quickly become complex
as each endpoint and URL requires an entry for each Hyperstore node.
In a small cluster of the three nodes this may be manageable.
However as clusters typically grow each additional node added requires multiple
entries in the DNS table.
These required DNS entries can be seen here.
As mentioned the S3 service endpoint is just one of several required DNS entries.
There will also be one of these per region.
Each bucket that is created in Hyperstore can also be accessed via the bucket name.
For this we would recommend the use of a wildcard DNS entry as bucket names may not
always be readily known ahead of time.
And this helps reduce the management overhead of having to add a new DNS entry
every time a bucket is created.
We also support static website hosting from a bucket.
If this is a technology or feature that is required then as this is enabled at the
bucket level you will also need a wildcard entry for the website endpoint.
A critical DNS entry is the admin service.
As each server or node runs the admin service has to be added into the DNS.
The Cloudian Management Console service needs to be able to resolve that entry so
it can issue admin API requests.
Lastly every node can also respond and supply access to the Cloudian Management
Console.
Therefore a DNS entry for the CMC is also required.
Cloudian would recommend the use of a load balancer rather than DNS round robin for
the following reason.
DNS typically does not employ health checks of the service and is only supplying an
IP address to the client that has requested the name.
It is possible the supplied IP address may be down and therefore the S3, CMC or
admin request will time out based on the TTL in your network.
This could be considerable.
Subsequent requests will receive the next IP address in the DNS list.
However this gives inconsistent performance and user experience.
By employing a load balancer it is possible to not only send requests to IP
addresses which are online but extend that to only send the request to a node where
the service is also online.
It is possible to have a hyperstore node where only the S3 service has been taken
down for a reason.
In this scenario we would still want CMC and admin requests to still be routed to
this node.
Hyperstore can now also be found in cloud architecture.
Customers and users and tenants can access data via the CMC or third party S3 API
application and clients.
Service providers and administrative tenants can use the cloud provider management
portal or CMC portal
in order to configure access to clouding and hyperstore through the cloud
architecture or orchestration tools natively.
This allows for the provisioning, usage tracking and reporting, alerting, logging
and quota management as well as chargeback.
An example of this is VMware VCD using a virtual object storage engine.
Some of the capabilities and offerings can be seen here for management.
Using the VCD UI integration provides a single management platform for both
orchestration and usage of clouding and hyperstore.
It can be used for importing and exporting VCD objects and provide single sign-on
between the hyperstore CMC and VCD.
It also provides for multi-region support.
Further capabilities include reporting for usage tracking and utility based
consumption.
Some of the cases supported are to provide an S3 compatible storage service used
for temporary storage.
As a document store service, file and share with guest access and to hold VCD
catalogs and templates.
User management capabilities include object storage specific rights in VCD, quota
management and usage integration.
There is also close integration using hyperstore APIs for S3, IAM and admin.