ECA 5 - 15 Course Guide VB
ECA 5 - 15 Course Guide VB
COPYRIGHT
Copyright 2020 Nutanix, Inc.
Nutanix, Inc.
1740 Technology Drive, Suite 150
San Jose, CA 95110
All rights reserved. This product is protected by U.S. and international copyright and intellectual
property laws. Nutanix and the Nutanix logo are registered trademarks of Nutanix, Inc. in the
United States and/or other jurisdictions. All other brand and product names mentioned herein
are for identification purposes only and may be trademarks of their respective holders.
License
The provision of this software to you does not grant any licenses or other rights under any
Microsoft patents with respect to anything other than the file server implementation portion of
the binaries for this software, including no licenses or any other rights in any hardware or any
devices or software that are used to communicate with or in connection with this software.
Conventions
Convention Description
variable_value The action depends on a value that is unique to your
environment.
ncli> command The commands are executed in the Nutanix nCLI.
Version B
Last modified: July 17, 2020
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 2
Contents
Copyright...................................................................................................................2
License.................................................................................................................................................................. 2
Conventions........................................................................................................................................................ 2
Version.................................................................................................................................................................. 2
Module 1: Introduction......................................................................................... 9
Making Computing Invisible: A History of Cloud Computing........................................................9
What is Server Virtualization?................................................................................................................... 11
Traditional Three-Tier Architecture........................................................................................................ 12
Nodes, Blocks, and Clusters.......................................................................................................................12
Acropolis............................................................................................................................................................ 13
Prism Overview............................................................................................................................................... 13
Cloud Computing (Public, Private, and Hybrid Clouds)................................................................. 15
Why Hybrid Cloud?....................................................................................................................................... 17
Reducing Friction in the Hybrid Cloud..................................................................................... 17
State in Data and Applications: Heirlooms vs Handkerchiefs.......................................... 19
Networks: Pipes of Friction.......................................................................................................... 20
Reducing Friction Through Automation................................................................................. 20
Nutanix Zero Trust Architecture..................................................................................................21
Governance and Compliance....................................................................................................... 23
Capital Expenditures (Capex) vs. Operating Expenditures (Opex)..............................24
Predictable vs. Unpredictable Workloads...............................................................................25
One Platform. Any App. Any Location.................................................................................... 27
Other Resources............................................................................................................................................ 28
Additional Lab Source: Test Drive.............................................................................................28
Nutanix University.............................................................................................................................29
Nutanix Certification.................................................................................................................................... 29
Labs..................................................................................................................................................................... 29
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
allssh and hostssh.............................................................................................................................39
aCLI Example...................................................................................................................................... 39
PowerShell Cmdlets..................................................................................................................................... 39
PowerShell Cmdlets Examples.................................................................................................... 39
PowerShell Cmdlets (Partial List).............................................................................................. 40
REST API.......................................................................................................................................................... 40
Labs.................................................................................................................................................................... 40
Module 4: Networking.......................................................................................54
Overview........................................................................................................................................................... 54
Default Network Configuration............................................................................................................... 54
Default Network Configuration (cont.)................................................................................................ 55
Open vSwitch (OVS)....................................................................................................................................55
Bridges............................................................................................................................................................... 56
Ports.................................................................................................................................................................... 56
Bonds..................................................................................................................................................................56
Bond Modes.........................................................................................................................................57
Virtual Local Area Networks (VLANs)................................................................................................. 60
IP Address Management (IPAM)..............................................................................................................61
Network Segmentation............................................................................................................................... 62
Configuring Network Segmentation for an Existing RDMA Cluster............................. 63
Network Segmentation During Cluster Expansion..............................................................63
Network Segmentation During an AOS Upgrade................................................................63
Reconfiguring the Backplane Network.................................................................................... 63
Disabling Network Segmentation.............................................................................................. 64
Unsupported Network Segmentation Configurations....................................................... 64
AHV Host Networking.................................................................................................................................64
Recommended Network Configuration................................................................................... 64
AHV Networking Terminology Comparison...........................................................................67
Labs..................................................................................................................................................................... 67
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
Data Storage Representation...................................................................................................................98
Storage Components.......................................................................................................................98
Understanding Snapshots and Clones................................................................................................. 99
Clones.....................................................................................................................................................99
Shadow Clones.................................................................................................................................. 99
Snapshotting Disks.........................................................................................................................100
Capacity Optimization - Deduplication............................................................................................... 101
Deduplication Process................................................................................................................... 102
Deduplication Techniques............................................................................................................103
Capacity Optimization - Compression................................................................................................103
Compression Process.................................................................................................................... 104
Compression Technique Comparison..................................................................................... 104
Workloads and Dedup/Compression......................................................................................105
Deduplication and Compression Best Practices.................................................................105
Replication Factor.......................................................................................................................................106
Erasure Coding Basics...............................................................................................................................106
EC-X Compared to Traditional RAID...................................................................................... 107
EC-X Process.....................................................................................................................................108
Erasure Coding in Operation......................................................................................................109
Replication Factor 3 with Erasure Coding: 6-Node........................................................... 110
Replication Factor 2 with Erasure Coding: 4-Node.............................................................111
Labs......................................................................................................................................................................111
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
CVM Unavailability...........................................................................................................................128
Node Unavailability.........................................................................................................................130
Drive Unavailability......................................................................................................................... 130
Boot Drive (DOM) Unavailability................................................................................................131
Network Link Unavailability.........................................................................................................133
Redundancy Factor (Fault Tolerance)................................................................................................ 133
Block Fault Tolerant Data Placement................................................................................................. 135
Rack Fault Tolerance................................................................................................................................. 136
VM High Availability in Acropolis......................................................................................................... 136
High Availability............................................................................................................................... 138
Affinity and Anti-Affinity Rules for AHV............................................................................................138
Limitations of Affinity Rules........................................................................................................139
Labs....................................................................................................................................................................139
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
Operations Deep Dive.................................................................................................................... 171
Applications Deep Dive.................................................................................................................172
Labs....................................................................................................................................................................172
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020
Module
1
INTRODUCTION
Companies also demand flexibility, choice, agility, and cost efficiency from these enabling
technologies to ensure that business capabilities can change with changing demand, market,
and business mission.
In the past decade, the rise of cloud services fulfilled the demand for more IT and business
agility, bringing new applications and services online almost overnight. Yet this ability created
secondary issues of system and data sprawl, governance and compliance stratification, and it
ultimately cost businesses more than traditional datacenter models since it lacked mature cost
controls.
So, for these reasons and more, businesses realized that certain workloads and data sets were
more suitable for their datacenters, while others required the web-scale architecture that
reached a global audience, without friction or impedance.
This is how the hybrid cloud model was born. It blends control, flexibility, security, scalability,
and cost effectiveness, serving the needs of both business and customers alike.
But to really understand the business drivers that led to the hybrid cloud, we need to briefly
discuss where all of this began. And everything began with the mainframe.
Mainframe Computing
Starting with mainframe computing, users had the ability to create massive, monolithic code
structures that lived on largely siloed and custom-built equipment. The processing power and
centralized design of this system made it very costly and inflexible – maintaining each system
required specialized training and careful coordination to ensure minimal business disruptions.
In the event of a failure, a secondary mainframe is required, and restoring from backup tapes
could take days to complete. Applications had to be custom written for each platform, which
were both time consuming and expensive.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 9
Introduction
Unix Servers
The creation of Unix operating systems saw a standardization of hardware and software into
more focused and manageable systems. This homogeneity enabled computer operator to
standardize their skill sets, and maintain systems similarly across any business, or even multiple
enterprises. Unix system hardware is still specialized by vendor though (IA64, SPARC, and
others), as were different Unix operating systems. Applications are developed, but still not
ported between disparate vendors, creating lock-in and requiring customized skill sets for
computer operators.
Intel x86
Enter the Intel x86 platform: a commoditized set of hardware that is delivered rapidly and
cheaply and standardized the way that hardware systems are created today. By streamlining
the underlying hardware architecture, the subsequent operating systems on x86 systems are
managed more easily than their counterparts. Parts and whole systems are interchangeable,
OS images are ported easily between systems, and applications are rapidly developed and
migrated to new systems.
The advancement of Intel x86 systems also saw that hardware innovation outpace software
demand, where multicore systems with abundant memory and storage would sit idle, or
underutilized at periods, due to the static nature of the system size. X86 servers suffered from
their own success and required further innovation at the software level to unlock the next
innovation: virtualization.
Virtualized x86
Intel x86 software virtualization abstracts an operating system from its underlying hardware,
allowing any x86 operating system to run simultaneously with other x86 operating systems on
the same bare metal server. This allowed for even more flexibility, cost savings, and efficiency,
as well as portability – now, applications shipped preinstalled within virtual machine “files” or
images. These virtualized systems maximized the density of operating system to hardware,
cutting costs in the datacenter as well as enabling newer programmatic ways to rapidly deploy
new workloads.
Virtualized systems still required the overhead of maintenance and a specialized skill set to
operate and maintain, and quite often businesses suffered from the operational complexity
of maintaining hundreds or even thousands of virtual machines at scale. Upgrades, updates,
and system maintenance still required careful coordination and planning, and often disrupted
business operations. This model was positively changed again when containers were
introduced.
Containers
Containers are prepackaged images of software, using fractions of the compute and storage
capacity of virtual machines, that can be instantaneously deployed upon any container runtime
via automation and orchestration. These tiny compute units allowed developers to rapidly
test and deploy code in a matter of minutes instead of days – checking in software changes
to a repository, and enabling an automated software build and test cycle that could be simply
monitored and not managed heavily. Containers also enabled applications to be subdivided
into smaller “micro-services”, where an entire application need not reside in the same instance
or operating system, but only a fraction that could service a known business demand. This
capability combined with the “pay for what you use” operational model of the cloud allowed
businesses to truly only pay for what services they needed, when they needed them.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 10
Introduction
Serverless
At every step of the way, computing power was enhanced, made more efficient, and drew
applications closer to their desired operating environment: the right performance, fulfilling the
right customer demand, at exactly the right cost and scale.
However, serverless computing isn't the only option for performance, cost, and scale. There is,
as we'll see later in this module, an inherent risk in surrendering complete hardware control to
a third-party and simply consuming resources as a service. An organization losing control of its
data is hardly an ideal scenario so there's a clear need for an on-prem solution that provides the
same benefits that serverless computing offers, but without the security and governance risks.
Hardware virtualization involves virtual machines (VMs), which take the place of a “real”
computer with a “real” operating system.
One of the main reasons businesses use virtualization technology is server virtualization, which
uses a hypervisor to “duplicate” the hardware underneath. In a non-virtualized environment,
the guest operating system (OS) normally works in conjunction with the hardware. When
virtualized, the OS still runs as if its on hardware, letting companies enjoy much of the same
performance they expect without hardware. Though the hardware performance vs. virtualized
performance isn’t always equal, virtualization still works and is preferable since most guest
operating systems don’t need complete access to hardware. As a result, businesses can enjoy
better flexibility and control and eliminate any dependency on a single piece of hardware.
Because of its success with server virtualization, virtualization has spread to other areas of the
datacenter, including applications, networks, data, and desktops.
Put simply, virtualization solutions streamline your enterprise datacenter. It abstracts away the
complexity in deploying and administering a virtualized solution, while providing the flexibility
needed in the modern datacenter.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 11
Introduction
Virtualization Terminology
2. Hypervisor: creates a virtual version of a once-physical system. Manages multiple guest VMs
simultaneously. Apps and O.S. are abstracted away from the hardware. VMs are presented
with a virtual O.S.
3. Guest (virtual) machine: virtual machine (VM). VMs have run their own OS. Interaction with
physical hardware is done through para-virtualized drivers.
Legacy infrastructure—with separate storage, storage networks, and servers—is not well
suited to meet the growing demands of enterprise applications or the fast pace of modern
business. The silos created by traditional infrastructure have become a barrier to change and
progress, adding complexity to every step from ordering to deployment to management. New
business initiatives require buy-in from multiple teams and your organization needs to predict
IT infrastructure 3-to-5 years in advance. As most IT teams know, this is almost impossible to
get right. In addition, vendor lock-in and increasing licensing costs are stretching budgets to the
breaking point.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 12
Introduction
A node is an x86 server with compute and storage resources. A single Nutanix cluster can have
an unlimited number of nodes. Different hardware platforms are available to address varying
workload needs for compute and storage.
A block is a chassis that holds one to four nodes, and contains power, cooling, and the
backplane for the nodes. The number of nodes and drives depends on the hardware chosen for
the solution.
Acropolis
Acropolis is the foundation for a platform that starts with hyperconverged infrastructure
then adds built-in virtualization, storage services, virtual networking, and cross-hypervisor
application mobility.
For the complete list of features, see the Software Options page on the Nutanix website.
• AHV
AHV is the hypervisor while DSF and App Mobility Fabric are functional layers in the Controller
VM (CVM).
Note: Acropolis also refers to the base software running on each node in the
cluster.
Prism Overview
Prism is the management plane that provides a unified management interface that can generate
actionable insights for optimizing virtualization, provides infrastructure management and
everyday operations.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 13
Introduction
Prism gives Nutanix administrators an easy way to manage and operate their end-to-end virtual
environments. Prism includes two software components: Prism Element and Prism Central.
Prism Element
Prism Element provides a graphical user interface to manage most activities in a Nutanix
cluster.
Some of the major tasks you can perform using Prism Element include:
Prism Central
Provides multicluster management through a single web console and runs as a separate VM.
Note: We will cover both Prism Element and Prism Central in separate lessons
within this course.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 14
Introduction
Prism Interface
Prism is an end-to-end management solution for any virtualized datacenter, with additional
functionality for AHV clusters, and streamlines common hypervisor and VM tasks.
The information in Prism focuses on common operational tasks grouped into four areas:
• Infrastructure management
• Operational insight
• Capacity planning
• Performance monitoring
Prism provides one-click infrastructure management for virtual environments and is hypervisor
agnostic. With AHV installed, Prism and aCLI (Acropolis Command Line Interface) provide more
VM and networking options and functionality.
The arrival of cloud computing to enterprise IT brought much more than new business value
and end-user utility. It also added a great deal of confusion. An entirely new set of terms
was created to describe the many varieties of virtual data storage and transmission. First,
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 15
Introduction
you learned about private clouds, or cloud environments that were created to only support
workloads from a specific organization. Private cloud infrastructure like this is usually created
utilizing resources within a company’s own on-prem datacenter. Then as time progressed, you
learned about clouds that are publicly accessed and consumed (public clouds). This means
that all hardware-based networking, storage, and compute resources are owned and managed
by a third-party provider like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud
Platform (GCP). Though workloads are partitioned for data security, these resources are shared
by the customers of a particular public cloud provider.
With two types of clouds to account for now, you would naturally need terminology to describe
the transmission of applications and data between public and private clouds. This architecture
is known as a hybrid cloud. An encrypted highway of sorts, hybrid cloud allows operators to
perform a single task leveraging two separate cloud resources. Most hybrid cloud environments
combine the resources of two separate private clouds. This could be two private, two public, or
a mix of both. If you were to visualize a Venn diagram, and assigned an on-prem private cloud
on the left and a cloud hosted private on the right, a hybrid cloud would entail the sum of both
parts. The overlapping space in the middle represents the encrypted layer.
This middle ground between clouds provides a vital bridge for data transmission. It allows
organizations to leverage cloud capabilities without compromising productivity or security.
• Businesses that are managing resources privately in both on-prem and cloud hosted
environments.
• Companies who are migrating from a complete on-prem solution to a configuration that
incorporates some usage of public cloud capacity.
• Organizations that are moving back to a private, on-prem datacenter from being primarily
cloud-based.
Hybrid cloud infrastructure provides notable flexibility for organizations. You enjoy the secure
access of on-prem resources while also having the rapid scale and elasticity of the public cloud.
And, encrypted data sharing enables industries that manage hypersensitive information such
as public sector entities, law offices, financial service institutions, and healthcare providers
to consume cloud services. Organizations from these industries can store and share data as
needed with external partners while still adhering to regulatory compliance guidelines such as
HIPAA, ISO, PCI-DSS, CIS, NiST, and SOC-2.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 16
Introduction
Flexibility
The findings also make it clear that enterprise IT teams highly value having the flexibility to
choose the optimum IT infrastructure for each of their business applications on a dynamic
basis, with 61% of respondents saying that application mobility across clouds and cloud types
is “essential.” Cherry-picking infrastructure in this way to match the right resources to each
workload as needs change results in a growing mixture of on- and off-prem cloud resources,
a.k.a. the hybrid cloud.
Security
The proverbial “cloud” is no longer the simple notion it once was. There was a time when IT
made a fairly straightforward decision whether to run an application in its on-prem datacenter
or in the public cloud. However, with the growth of additional cloud options, such as managed
on premises private cloud services, decision-making has become much more nuanced. Instead
of facing a binary cloud-or-no-cloud situation, IT departments today more often are deciding
on which cloud(s) to use, often on an application-by-application basis.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 17
Introduction
There is a pressing need for a single platform that can span private, distributed, and public
clouds so that operators can manage their traditional and modern applications using a
consistent cloud platform.
Using the same platform on both clouds, a hybrid model dramatically reduces the operational
complexity of extending, bursting, or migrating your applications and data between clouds.
Operators can conveniently use the same skill sets, tools, and practices used on-prem to
manage applications running in public clouds such as AWS. Nutanix Clusters integrates with
public cloud accounts like AWS, so you can run applications within existing Virtual Private
Clouds (or VPCs), eliminating network complexity, and improving performance.
Maintain cloud optionality with portable software that can move with your applications and
data across clouds. With a consistent consumption model that spans private and public clouds,
you can confidently plan your long-term hybrid and multicloud strategy, maximizing the
benefits of each environment.
• Capacity Bursting: Rapidly add incremental capacity for Dev/Test or seasonal demands.
• On-demand Geo Expansion: Easily expand into regions beyond your current physical
presence.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 18
Introduction
In a hybrid world, where cost efficiency is of utmost importance, you must differentiate the
items that must exist and persist, and those that need only be leveraged when called upon.
This concept is called “having state” -where a database may be required to be online 24x7 and
replicated across multiple geographies to ensure data availability and accessibility, while the
application front-end used to access these data may only need to service a given set of users
for their localized working hours.
In this example, the data can be considered ever present, managed, and stateful. These data
are groomed, protected, and correlated with other data sets. The application front-end may be
ephemeral, having only the number of instances required to service the customer’s needs, and
some spare capacity to fulfill potential incoming sessions as needed.
If these application instances ebb and flow with user demand, they can reduce or increase cost
as needed – much like adjusting the flow of water through a faucet, you only use what you need
at any given time. Applications in this type of configuration are not useful for what they are, so
much as what they do.
In that respect, they are much like handkerchiefs – users can use them for their needs, and then
throw them away when they are done. This optimizes cloud resource utilization, minimizes the
impact of complex environments and security footprints at scale, and ultimately preserves cost
efficiency for the business while still providing the agility desired.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 19
Introduction
The hybrid cloud model presents new challenges with access to these data as well: can you
access the data you need, in the timeframe you require, via the platform of your choice? To
this end data locality is a differentiated advantage, providing faster access and lower turn-
around time to providing an end-user with their requested resources. If data locality reduces
friction between users and their services, then the transmission of that data can impact that
locality by increasing or decreasing performance and speed to access. The more direct path a
set of data requests can take through the cloud, the faster response time services can maintain,
while simultaneously limiting the number of points of failure any given network has in the data
path. Ultimately this reduces friction in the cloud and provides agility for both businesses and
customers alike.
By automation of repeatable, expectable, and manual tasks, and inserting a self-service portal
that service consumers can easily satisfy their demands without a chain of human responsibility,
we reduce the task time for IT workers from days per person to potentially zero input. This
allows IT workers to streamline their processes even further, and shift their work onto more
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 20
Introduction
proactive tasks that build on top of this model. Each iteration that allows the business to
consume IT services in a streamlined fashion builds more value and realizes greater innovation
cycles for IT as a whole.
This organization and automation further drive home the desired goal of “Write Once, Read
Many” – where expected and repeatable work can be automated and delivered at the speed
of business without the need for reactive means or interrupting the flow of IT innovation.
Innovation thereby becomes the default state for your IT department, rapidly growing the
capabilities of the business and further creating a differentiation between a business and their
competitors.
At each of these layers Nutanix helps assert a Zero Trust Architecture to ensure that only the
right user, on the right device and network, has access to the right applications and data.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 21
Introduction
Identity
Authentication
Prism Central supports user authentication. There are three authentication options:
• Local user authentication: Users can authenticate if they have a local Prism Central account
• Active Directory authentication: Users can authenticate using their Active Directory (or
OpenLDAP) credentials when Active Directory support is enabled for Prism Central.
• SAML authentication: Users can authenticate through a qualified identify provider when
SAML support is enabled for Prism Central. The Security Assertion Markup Language (SAML)
is an open standard for exchanging authentication and authorization data between two
parties, ADFS as the identity provider (IDP) and Prism Central as the service provider.
Authorization: RBAC
Prism Central supports role-based access control (RBAC) that you can configure to provide
customized access permissions for users based on their assigned roles. The roles dashboard
allows you to view information about all defined roles and the users and groups assigned to
those roles.
• Configuring authentication confers default user permissions that vary depending on the
type of authentication (full permissions from a directory service or no permissions from an
identify provider). You can configure role maps to customize these user permissions.
• You can refine access permissions even further by assigning roles to individual users or
groups that apply to a specified set of entities.
• With RBAC, user roles do not depend on the project membership. You can use RBAC and
log in to Prism Central even without a project membership
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 22
Introduction
Data-at-rest encryption can be delivered through self-encrypting drives (SED) that are factory-
installed in Nutanix hardware. This provides strong data protection by encrypting user and
application data for FIPS 140-2 Level 2 compliance. For SED drives, key management servers
are accessed via an interface using the industry-standard Key Management Interface Protocol
(KMIP) instead of storing the keys in the cluster. Nutanix also provides the option to use a
native data-at-rest encryption feature that does not require specialized hardware from self-
encrypting drives (SED).
Hardware encryption leveraging SEDs and External Key Management software are costly to
configure and maintain, as well as complex to operate. Nutanix recognized the need to simplify
cost and complexity by enabling software-based storage encryption, introduced in AOS 5.5.
The introduction of Native Key Management also allowed for external key management servers
to be integrated into the Nutanix cluster itself. This further reduced cost and complexity, and
this software-defined solution helps companies further drive simplicity and automation in their
security domains.
Nutanix solutions support SAML integration and optional two-factor authentication for system
administrators in environments requiring additional layers of security. When implemented,
administrator logins require a combination of a client certificate and username and password.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 23
Introduction
Capex
Capex are any large purchases a business makes for a project or initiative, normally paid
up-front for goods or services to be delivered. Capex costs for IT are normally allocated
for purchases of software and hardware, or one-time costs of service provided for large
projects. An easy way to think of capex costs would be to think of a bottle of water: When you
are thirsty, you simply purchase a bottle of water and drink it. The water consumed can be
considered the utility of the asset, or usefulness of a product (as you are now no longer thirsty)
and the bottle itself is also an asset that can either be repurposed (recycled, refilled with more
water, or used for an art project). In this example, we fully own the bottle, the water inside, and
the utility or value of that water (we are no longer thirsty).
Businesses have traditionally leveraged capex cost models to purchase assets as there are
two major financial benefits: tax write-offs and financial reporting measures. Capex costs are
sometimes seen as a necessary business expense, and as such are written off on taxes on a
depreciation schedule. This means that a purchase of one million dollars ($1M) can be used to
reduce corporate taxes either in one fiscal year, or spread out over the lifetime of the asset
itself (known as a depreciation schedule). Capex purchases also usually result in the gain of
an asset or multiple assets, that can grow the company’s balance sheet while spreading out
the depreciation of that asset’s cost over multiple years. This approach can ultimately show
an increase in valuation of a company and reflect directly in stock price, and from a business
perspective this increase is clearly desirable.
There are also added responsibilities that come with the capex cost model: assets are now
owned, and require upkeep and maintenance; large capital costs in the enterprise need to
be planned for at least a year (or multiple years) in advance, and budgeted by management
to ensure responsible spending; capital costs may not be accurately forecasted and become
a drag on corporate finances (inappropriate sizing of hardware/software; lack of training
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 24
Introduction
incorporated in projects; unknown or undocumented software costs and renewals, etc.) Cost
controls for capital expenditures are normally budgeted for yearly, and follow a schedule of
approvals at various levels in the management chain, giving tight control to corporate costs at
various stages.
Opex
• Regular payments made at repeatable intervals (daily, weekly, monthly, etc.) for
subscriptions and services that assist with the normal daily operation of the business.
• Fully tax deductible at the time of purchase.
In contrast to capex, opex are regular payments made at repeatable intervals (daily, weekly,
monthly, etc.) for subscriptions and services that assist with the normal daily operation of
the business. Normal operational expenses include: payroll for employees, utility expenses
such as electrical costs and cellular provider usage, and SaaS subscription fees. In our water
example, imagine that opex means that you would “pay by the drink” and not own the bottle.
This simplifies purchasing for many businesses, being able to “buy the value” without owning
and managing the asset.
Public cloud expenses fall under this category of operational expenses, as they are a “pay
as you go” model, and none of the physical assets in the cloud are owned by the subscriber.
The financial advantage here is that any subscriptions purchased via opex funds are fully tax
deductible at the time of purchase and may ultimately mean more real-time margin for the
company, equating to hopefully more financial gains for shareholders.
One downside to opex is that there are no assets to add to the corporate balance sheet, so
the additional value is in the utility of a service, and not in the purchase of the actual item
used. Another danger here is the variable nature of cloud usage equates to an unknown
monthly expense. As opposed to capex payments, which are up-front costs to the business,
and budgeted for yearly; opex costs are utility model – paid for after usage, which can put a
strain on finances if not forecasted appropriately. Cloud sprawl, and hence cost sprawl, are
quite common when adopting a new cloud model of operation and can be detrimental to the
business if not monitored carefully.
Ultimately there are benefits and detractors to both financial expense models, and every
business has a certain combination of both capex and opex in use. Knowing the different
options and how to optimize them will ultimately make IT departments successful in delivering
business value while effectively managing costs at scale.
Every business has applications that they rely upon for normal operations: whether they be
database servers to store and retrieve valuable data; web servers for eCommerce sites or even
simply company online presence; to massively complex and intricate ERP systems that manage
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 25
Introduction
the flow from order to delivery, and AI workloads used to formulate and model predictive
algorithms. This diversity of application usage and need requires capacity of hardware and
software to service the need, and to make things more interesting: the usage of any given
application can vary wildly depending on this business needs!
Predictable workloads can be defined as the hardware and software that supports any given
application, which in turn can be expected to operate in an observable and understood fashion.
We can “predict” how an application is going to be used, by simply knowing what business
needs that application serves. One good example of this predictability could be an email server.
An IT department would know how many users they have across the company, set limits on
attachment sizes sent per email, and know how many additional users they would have per
month/quarter/year to plan for growth. Usage by individual will be varied, but ultimately the
total expected performance needs and usage of an email server can be sized and predicted
successfully.
Unpredictable workloads can be viewed as the opposite of predictable workloads: any set
of hardware and software that support a business application that cannot be quantified
or understood to perform in an expected way. A good example of one of these workloads
might be an eCommerce site. Businesses regularly run marketing and sales campaigns to
drive potential customers to their websites to purchase products and services. While we can
(hopefully) expect customers to purchase our offerings, it is unknown how many users may be
attracted to which various campaigns, and hence visit the eCommerce site to buy something.
This lack of visibility is complicated when we look at the seasonality of workloads: Holidays can
see a sharp, drastic uptrend in purchasing from users, while the rest of the year we may have
more expectable results. Similarly, various events may see large inflows of user demand: the
World Cup drives billions of pageviews for FIFA every 4 years, while the remaining 3 may see
much less public demand on their IT infrastructure.
Before cloud computing, both predictable and unpredictable workloads would be hosted on
the same platforms in the datacenter. Businesses would capitalize costs, own assets to perform
the work desired, and service customer needs. The problem with this operating model is that
businesses would suffer from the ownership of an asset, while not reaping any benefit from it,
if that asset was purchased to service unpredictable workloads. The opposite can also be said:
when lacking the appropriate capacity to service customer demand, businesses would suffer
the loss of revenue by losing access to those customers who were unable to purchase goods or
services due to the increased demand.
How do we solve for both predictable and unpredictable workloads while still optimizing
cost and meeting customer demand? By combining the best of both worlds, the hybrid cloud
approach enables predictable workloads to attain efficiency and scale while maximizing capital
expense, and simultaneously enabling unpredictable workloads to achieve the dynamic web-
scale they require to meet the seasonality of user demands and obtaining the unique cost
benefits of “pay as you go”. The elastic nature of the cloud perfectly meets unpredictable
workload needs, while the steady-state nature of predictable workloads can be perfectly suited
for datacenter consumption.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 26
Introduction
The image above encapsulates a broad idea that lies at the heart of Nutanix.
At the top are workloads. Workloads are the reason the underlying infrastructure exists and
is necessary. They are how a business runs, how it grows, and how it shapes its present and
future.
At the bottom are the choices that you have when you consider deploying and running your
workloads on Nutanix. Choosing Nutanix is meant to be liberating, rather than restrictive.
Nutanix supports several leading hardware platforms, so you can run Nutanix software on your
choice of hardware.
And the freedom that comes with choosing hardware platforms extends to the public cloud as
well. If you have workloads on AWS, Azure, or GCP, Nutanix integrates neatly and tightly with
them, so you can continue to benefit from the public cloud when necessary while leveraging the
strengths of your private cloud – in a true, hybrid model.
And in the middle, between the underlying infrastructure and the workloads is the Nutanix
Cloud Platform – the products that power this tremendous freedom. When you choose to
modernize your infrastructure with Nutanix HCI, the only way forward is up – to better security,
to simplified storage, to automated operations, and fully integrated enterprise-grade backup
and DR.
Each product represents a key component of the hybrid cloud. AOS, AHV, and Prism are the
foundation. Every other product can be layered on top and integrates with this foundation to
give you a fully featured enterprise-class hybrid cloud solution.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 27
Introduction
Other Resources
https://www.nutanix.com/resources
Nutanix also maintains a list of resources including whitepapers, solution briefs, ebooks, and
other support material.
http://www.nutanixbible.com
The Nutanix Bible has become a valuable reference point for those that want to learn
about hyperconvergence and web-scale principles or dig deep into Nutanix and hypervisor
architectures. The book explains these technologies in a way that is understandable to IT
generalists without compromising the technical veracity.
https://next.nutanix.com
Nutanix also has a strong community of peers and professionals, the .NEXT community. Access
the community via the direct link shown here or from the Documentation menu in the Support
Portal. The community is a great place to get answers, learn about the latest topics, and lend
your expertise to your peers.
https://www.nutanix.com/support-services/training-certification/
An excellent place to learn and grow your expertise is with Nutanix training and certification.
Learn about other classes and get certified with us.
Nutanix Test drive is a tool provided at no cost that has guided tours on multiple Nutanix
products and features. You can access it directly through the site https://www.nutanix.com/
test-drive-hyperconverged-infrastructure or through my.nutanix.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 28
Introduction
Nutanix University
Nutanix University has many resources, including online training and an “Ask the Experts”
section
Nutanix Certification
Nutanix technical certifications are designed to recognize the skills and knowledge you've
acquired to successfully deploy, manage, optimize, and scale your Enterprise Cloud. Earning
these certifications validates your proven abilities and aptness to guide your organization along
the next phase of your Enterprise Cloud journey.
Visit our website for more information about our certification portfolio.
Labs
1. TCO/ROI video (self-paced)
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 29
Managing the Nutanix Cluster
Module
2
MANAGING THE NUTANIX CLUSTER
Overview
Within this module, you will learn how to manage your Nutanix cluster using various tools. After
completing this module, you will know how to:
• Describe tools like Prism Central, PowerShell, REST API.
Note: Within this module you’ll discuss various ways to manage your Nutanix
cluster. First, we’ll start with the Prism Element GUI, then talk about Command
Line Interfaces. Finally, we’ll provide an overview of lesson common tools such as
PowerShell Cmdlets and REST API.
Cluster Example
• As part of the cluster creation process, all storage hardware (SSDs, HDDs, and NVMe) is
presented as a single storage pool.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 30
Managing the Nutanix Cluster
When Nutanix is installed on a server, a Controller VM is deployed (CVM). Every CVM has
dedicated memory and reserved CPUs to allow the CVM to perform various services required
by the cluster.
The Nutanix cluster has a distributed architecture, which means that each node in the cluster
shares in the management of cluster resources and responsibilities. Within each node, there are
software components (aka AOS Services) that perform specific tasks during cluster operation.
All components run on multiple nodes in the cluster and depend on connectivity between their
peers that also run the component. Most components also depend on other components for
information.
Zookeeper
Zookeeper runs on either three or five nodes, depending on the redundancy factor (number of
data block copies) applied to the cluster. Zookeeper uses multiple nodes to prevent stale data
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 31
Managing the Nutanix Cluster
from being returned to other components. An odd number provides a method for breaking ties
if two nodes have different information.
Of these nodes, Zookeeper elects one node as the leader. The leader receives all requests for
information and confers with the two follower nodes. If the leader stops responding, a new
leader is elected automatically.
Zookeeper has no dependencies, meaning that it can start without any other cluster
components running.
Zeus
Zeus is an interface to access the information stored within Zookeeper and is the Nutanix
library that all other components use to access the cluster configuration.
A key element of a distributed system is a method for all nodes to store and update the
cluster's configuration. This configuration includes details about the physical components in the
cluster, such as hosts and disks, and logical components, like storage containers.
Medusa
Distributed systems that store data for other systems (for example, a hypervisor that hosts
virtual machines) must have a way to keep track of where that data is. In the case of a Nutanix
cluster, it is also important to track where the replicas of that data are stored.
Medusa is a Nutanix abstraction layer that sits in front of the database that holds metadata.
The database is distributed across all nodes in the cluster, using a modified form of Apache
Cassandra.
Cassandra
Cassandra is a distributed, high-performance, scalable database that stores all metadata about
the guest VM data stored in a Nutanix datastore.
Cassandra runs on all nodes of the cluster. Cassandra monitor Level-2 periodically sends
heartbeat to the daemon, that include information about the load, schema, and health of all the
nodes in the ring. Cassandra monitor L2 depends on Zeus/Zk for this information.
Stargate
A distributed system that presents storage to other systems (such as a hypervisor) needs a
unified component for receiving and processing data that it receives. The Nutanix cluster has a
software component called Stargate that manages this responsibility.
All read and write requests are sent across an internal vSwitch to the Stargate process running
on that node.
Stargate depends on Medusa to gather metadata and Zeus to gather cluster configuration data.
From the perspective of the hypervisor, Stargate is the main point of contact for the Nutanix
cluster.
Note: If Stargate cannot reach Medusa, the log files include an HTTP timeout. Zeus
communication issues can include a Zookeeper timeout.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 32
Managing the Nutanix Cluster
Curator
A Curator master node periodically scans the metadata database and identifies cleanup and
optimization tasks that Stargate should perform. Curator shares analyzed metadata across
other Curator nodes.
Curator depends on Zeus to learn which nodes are available, and Medusa to gather metadata.
Based on that analysis, it sends commands to Stargate.
Prism Features
Infrastructure Management
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 33
Managing the Nutanix Cluster
Performance Monitoring
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 34
Managing the Nutanix Cluster
• Customizable dashboards
• Single-click query
• Firefox
• Chrome
• Safari
• Microsoft Edge
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 35
Managing the Nutanix Cluster
Enabling Pulse
Pulse is enabled by default and monitors cluster health and proactively notifies customer
support if a problem is detected.
Controller VMs communicate with ESXi hosts and IPMI interfaces throughout the cluster to
gather health information.
Warnings and errors are also displayed in Prism Element, where administrators can analyze the
data and create reports.
• System alerts
• Configuration information
• Guest VMs
• User data
• Metadata
• Administrator credentials
• Identification data
• Private information
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 36
Managing the Nutanix Cluster
There are two most commonly used command line interfaces (CLIs).
• aCLI – Manage the Acropolis portion of the Nutanix environment: hosts, networks, snapshots,
and VMs.
The Acropolis 5.15 Command Reference on the Support Portal contains nCLI, aCLI and CVM
commands.
From Prism Element, download the nCLI installer to a local machine. This requires Java Runtime
Environment (JRE) version 5.0 or higher.
The PATH environment variable should point to the nCLI folder as well as the JRE bin folder.
Once downloaded and installed, go to a bash shell or command prompt and point ncli to the
cluster IP or any controller CVM.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 37
Managing the Nutanix Cluster
Command Format
nCLI commands must match the following format:
ncli> entity action parameter1=value parameter2=value ...
• You can replace entity with any Nutanix entity, such as cluster or disk.
• You can replace action with any valid action for the preceding entity. Each entity has a
unique set of actions, but a common action across all entities is list. For example, you can
type the following command to request a list of all storage pools in the cluster.
ncli> storagepool list
Some actions require parameters at the end of the command. For example, when creating
an NFS datastore, you need to provide both the name of the datastore as it appears to the
hypervisor and the name of the source storage container.
ncli> datastore create name="NTNX-NFS" ctr-name="nfs-ctr"
You can list parameter-value pairs in any order, as long as they are preceded by a valid entity
and action.
Note: Tip: To avoid syntax errors, surround all string values with double-quotes,
as demonstrated in the preceding example. This is particularly important when
specifying parameters that accept a list of values.
Some actions require parameters ncli> datastore create name="NTNX-NFS" ctr-name="nfs- ctr".
You can list parameter-value pairs in any order. You should surround string values with quotes.
The nCLI provides assistance on all entities and actions. By typing help at the command line,
you can request additional information at one of three levels of detail.
• <entity> help provides a list of all actions and parameters associated with the entity, as well
as which parameters are required, and which are optional
• <entity> action help provides a list of all parameters associated with the action, as well as a
description of each parameter
The nCLI provides additional details at each level. To control the scope of the nCLI help output,
add the detailed parameter, which can be set to either true or false.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 38
Managing the Nutanix Cluster
For example, type the following command to request a detailed list of all actions and
parameters for the cluster entity.
ncli> cluster help detailed=true
You can also type the following command if you prefer to see a list of parameters for the
cluster edit-params action without descriptions.
ncli> cluster edit-params help detailed=false
hostssh
aCLI Example
• Create a new virtual network for VMs: net.create vlan.100 ip_config=10.1.1.1/24
Note: Use extreme caution when executing allssh commands. The allssh command
executes a ssh command to all CVMs in the cluster.
PowerShell Cmdlets
PowerShell Cmdlets Examples
• Connect to a cluster: Connect-NutanixCluster –Server <Cluster IP> -UserName <Prism User> -P
<Password>
• Get information about the cluster you are connected to: Get-NutanixCluster
• Get information about ALL of the clusters you are connected to by specifying a CVM IP for
each cluster: Get-NutanixCluster -Server cvm_ip_addr
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 39
Managing the Nutanix Cluster
REST API
Labs
1. Connecting to Prism Element
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 40
Managing the Nutanix Cluster
5. Using nCLI
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 41
Securing the Nutanix Cluster
Module
3
SECURING THE NUTANIX CLUSTER
Overview
After completing this module, you will be able to
Security Overview
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 42
Securing the Nutanix Cluster
Our security development life cycle integrates security into every step of product development,
rather than applying it as an afterthought. It’s is a foundational part of product design. The
pervasive culture and processes built around security harden the Enterprise Cloud OS and
eliminate zero-day vulnerabilities.
For example, research and development teams work together to fully understand all the code
in the product, whether it is produced in-house or inherited from dependencies. We schedule
product updates to handle known common vulnerabilities and exposures (CVEs) for minor
release cycles, and backport all dependencies to their latest release versions in major release
cycles. This approach significantly reduces zero-day risks without slowing down product
evolution.
Efficient one-click operations and self-healing security models enable automation to maintain
security in an always-on hyperconverged solution. And finally, since this is about more than just
our platform, Nutanix also delivers validated joint solutions with security-focused vendors.
• Performs fully automated testing during development, and times all security-related code
modifications during minor releases to minimize risk.
• Assesses and mitigates customer risk from code changes by using threat modeling.
Note: Download the Information Security with Nutanix Tech Note for more
information on this topic.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 43
Securing the Nutanix Cluster
Two-Factor Authentication
Logons require a combination of a client certificate and username and password.
Administrators can use local accounts or use AD.
Cluster Lockdown
You can restrict access to a Nutanix cluster. SSH sessions can be restricted through
nonrepudiated keys.
You can disable remote logon with a password. You can completely lock down SSH access
by disabling remote logon and deleting all keys except for the interCVM and CVM to host
communication keys.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 44
Securing the Nutanix Cluster
Nutanix nodes are authenticated by a Cluster Local KMS, SEDs not required
Once deployed, STIGs lock down IT environments and reduce security vulnerabilities in
infrastructure.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 45
Securing the Nutanix Cluster
Traditionally, using STIGs to secure an environment is a manual process that is highly time-
consuming and prone to operator error. Because of this, only the most security-conscious IT
shops follow the required process.
Nutanix has created custom STIGs that are based on the guidelines outlined by DISA to keep
the Enterprise Cloud Platform within compliance and reduce attack surfaces.
Nutanix includes STIGs that collectively check over 800 security entities covering storage,
virtualization, and management:
• AHV
• AOS
• Prism
• Web server
To make the STIGs usable by all organizations, Nutanix provides the STIGs in machine-readable
XCCDF.xml format and PDF. This enables organizations to use tools that can read STIGs and
automatically validate the security baseline of a deployment, reducing the accreditation time
required to stay within compliance from months to days.
Nutanix leverages SaltStack and SCMA to self-heal any deviation from the security baseline
configuration of the operating system and hypervisor to remain in compliance. If any
component is found as non-compliant, then the component is set back to the supported
security settings without any intervention. To achieve this objective, Nutanix Controller VM
conforms to RHEL 7 (Linux 7) STIG as published by DISA. Additionally, Nutanix maintains its
own STIG for the Acropolis Hypervisor (AHV).
• Monitors over 800 security entities covering storage, virtualization, and management
The SCMA framework ensures that services are constantly inspected for variance to the
security policy.
With SCMA, you can schedule the STIG to run hourly, daily, weekly, or monthly. STIG has the
lowest system priority within the virtual storage controller, ensuring that security checks do not
interfere with platform performance.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 46
Securing the Nutanix Cluster
Enterprise key management (with KMS): A consolidated, central key management server (KMS)
which provides service to multiple cryptographic clusters.
Nutanix provides a software-only option for data-at-rest security with the Ultimate license. This
does not require the use of self-encrypting drives.
DARE Implementation
1. Install SEDs for all data drives in a cluster. The drives are FIPS 140-2 Level 2 validated and
use FIPS 140-2 validated cryptographic modules.
2. When you enable data protection for the cluster, the Controller VM must provide the proper
key to access data on a SED.
3. Keys are stored in a key management server that is outside the cluster, and the Controller
VM communicates with the key management server using the Key Management
Interoperability Protocol (KMIP) to upload and retrieve drive keys.
4. When a node experiences a full power off and power on (and cluster protection is enabled),
the Controller VM retrieves the drive keys from the key management server and uses them
to unlock the drives.
The Nutanix controller in each node then adds the PINs (aka KEK, key encryption key) to the
key management device.
Once the PIN is set on an SED, you need the PIN to unlock the device (lose the PIN, lose data).
You can reset the PIN using the SecureErase primitive to “unsecure” the disk/partition, but all
existing data is lost in this case.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 47
Securing the Nutanix Cluster
ESXi and NTNX boot partition remain unencrypted. SEDs support encrypting individual disk
partitions selectively using the “BAND” feature (a range of blocks).
Configuring Authentication
Changing Passwords
It is possible to change 4 different sets of passwords in a Nutanix cluster: user, CVM, IPMI, and
the hypervisor host.
When you will change these passwords depends on your company’s IT security policies. Most
companies enforce password changes on a schedule via security guidelines, but the intervals
are usually company specific.
Nutanix enables administrators with password complexity features such as forcing the use of
upper/lower case letters, symbols, numbers, change frequency, and password length. After you
have successfully changed a password, the new password is synchronized across all Controller
VMs and interfaces (Prism web console, nCLI, and SSH).
By default, the admin user password does not expire and can be changed at any time. If you
do change the admin password, you will also need to update any applications and scripts that
use the admin credentials for authentication. For authentication purposes, Nutanix recommends
that you create a user with an admin role, instead of using the admin account.
Note: For more information on this topic, please see the Nutanix Support Portal >
Common Criteria Guidance Reference > User Identity and Authentication.
You can change user passwords, including for the default admin user, in the web console or
nCLI. Changing the password through either interface changes it for both.
Using the web console: Log on to the web console as the user whose password is to be
changed and select Change Password from the user icon pull-down list of the main menu.
Note: For more information about changing properties of the current users, see the
Web Console Guide.
Remember to:
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 48
Securing the Nutanix Cluster
• Replace username with the name of the user whose password is to be changed.
Note: If you change the password of the admin user from the default, you must
specify the password every time you start an nCLI session from a remote system.
A password is not required if you are starting an nCLI session from a Controller VM
where you are already logged on.
Perform these steps on any one Controller VM in the cluster to change the password of
the nutanix user. After you have successfully changed the password, the new password is
synchronized across all Controller VMs in the cluster. During the sync, you will see a task
appear in the Recent Tasks section of Prism and will be notified when the password sync task is
complete.
3. Respond to the prompts, providing the current and new nutanix user password.
Changing password for nutanix.
Old Password:
New password:
Retype new password:
Password changed.
This procedure helps prevent the BMC password from being retrievable on port 49152.
Although it is not required for the administrative user to have the same password on all hosts,
doing so makes cluster management much easier. If you do select a different password for one
or more hosts, make sure to note the password for each host.
Note: The maximum allowed length of the IPMI password is 19 characters, except
on ESXi hosts, where the maximum length is 15 characters.
Note: Do not use the following special characters in the IPMI password: & ; ` ' \ " |
* ? ~ < > ^ ( ) [ ] { } $ \n \r
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 49
Securing the Nutanix Cluster
Note: It is also possible to change the IPMI password for ESXi, Hyper-V, and AHV
if you do not know the current password but have root access to the host. For
instructions on how to do this, please see the relevant section of the NX and SX
Series Hardware Administration Guide on the Nutanix support portal.
3. Respond to the prompts, providing the current and new root password.
Changing password for root.
New password:
Retype new password:
Password changed.
Prism Central supports role-based access control (RBAC) that you can configure to provide
customized access permissions for users based on their assigned roles. The roles dashboard
allows you to view information about all defined roles and the users and groups assigned to
those roles.
• Configuring authentication confers default user permissions that vary depending on the
type of authentication (full permissions from a directory service or no permissions from an
identity provider). You can configure role maps to customize these user permissions.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 50
Securing the Nutanix Cluster
• You can refine access permissions even further by assigning roles to individual users or
groups that apply to a specified set of entities.
Note: Defining custom roles and assigning roles are supported on AHV only.
Nutanix RBAC enables this by providing fine grained controls when creating custom roles in
Prism Central. As an example, it is possible to create a VM Admin role with the ability to view
VMs, limited permission to modify CPU, memory, and power state, and no other administrative
privileges.
• Clearly understand the specific set of tasks a user will need to perform their job
• Identify permissions that map to those tasks and assign them accordingly
• Document and verify your custom roles to ensure that the correct privileges have been
assigned
Built-in Roles
The following built-in roles are defined by default. You can see a more detailed list of
permissions for any of the built-in roles through the details view for that role. The Project
Admin, Developer, Consumer, and Operator roles are available when assigning roles in a project.
Role Privileges
You can specify a role for a user when you assign a user to a project, so
Project Admin individual users or groups can have different roles in the same project.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 51
Securing the Nutanix Cluster
Custom Roles
If the built-in roles are not sufficient for your needs, you can create one or more custom roles.
After creation, these roles can also be modified if necessary.
You can create a custom role from the Roles dashboard, with the following parameters:
• Name
• Description
• Permissions for VMs, blueprints, apps, marketplace items, and reports management
A custom role can also be modified or deleted from the Roles dashboard. When updating a
role, you will be able to modify the same parameters that are available when creating a custom
role. To delete a role, select the Delete option from the Actions menu and provide confirmation
when prompted.
• SAML-authorized users are not assigned any permissions by default; they must be explicitly
assigned.
Note: To configure user authentication, please see the Prism Web Console Guide >
Security Management > Configuring Authentication section.
You can refine the authentication process by assigning a role (with associated permissions) to
users or groups. To assign roles:
You can edit a role map entry, which will present you with the same field available when
creating a role map. Make your desired changes and save to update the entry.
You can also delete a role map entry, by clicking the delete icon and then providing
confirmation when prompted.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 52
Securing the Nutanix Cluster
Nutanix supports SSL certificate-based authentication for console access. AOS includes a self-
signed SSL certificate by default to enable secure communication with a cluster. AOS allows
you to replace the default certificate through the web console Prism user interface.
For more information, see the Nutanix Controller VM Security Operations Guide and
the Certificate Authority sections of the Common Criteria Guidance Reference on the Support
Portal.
Labs
1. Adding a user
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 53
Networking
Module
4
NETWORKING
Overview
After completing this module, you will be able to:
Connections from the server to the physical switch use 10 GbE or higher interfaces. You can
establish connections between the switches with 40 GbE or faster direct links, or through a
leaf-spine network topology (not shown). The IPMI management interface of the Nutanix node
also connects to the out-of-band management network, which may connect to the production
network, but it is not mandatory. Each node always has a single connection to the management
network, but we have omitted this element from further images in this document for clarity and
simplicity.
Review the Leaf Spine section of the Physical Networking Guide for more information on leaf-
spine topology.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 54
Networking
Each AHV server maintains an OVS instance, and all OVS instances combine to form a single
logical switch. Constructs called bridges manage the switch instances residing on the AHV
hosts. Use the following commands to configure OVS with bridges, bonds, and VLAN tags. For
example:
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 55
Networking
Bridges
Bridges act as virtual switches to manage traffic between physical and virtual network
interfaces. The default AHV configuration includes an OVS bridge called br0 and a native Linux
bridge called virbr0 (the names could vary between AHV/AOS versions and depending on
what configuration changes were done on the nodes, but in this training we will use br0 and
virbr0 by default). The virbr0 Linux bridge carries management and storage communication
between the CVM and AHV host. All other storage, host, and VM network traffic flows through
the br0 OVS bridge. The AHV host, VMs, and physical interfaces use "ports" for connectivity to
the bridge.
Ports
Ports are logical constructs created in a bridge that represent connectivity to the virtual switch.
Nutanix uses several port types, including internal, tap, VXLAN, and bond.
• An internal port with the same name as the default bridge (br0) provides access for the AHV
host.
• Bonded ports provide NIC teaming for the physical interfaces of the AHV host.
Bonds
Bonded ports aggregate the physical interfaces on the AHV host. By default, the system
creates a bond named br0-up in bridge br0 containing all physical interfaces. Changes to
the default bond br0-up using manage_ovs commands can rename it to bond0. Remember,
bond names on your system might differ from the diagram below. Nutanix recommends using
the name br0-up to quickly identify this interface as the bridge br0 uplink. Using this naming
scheme, you can also easily distinguish uplinks for additional bridges from each other.
OVS bonds allow for several load-balancing modes, including active-backup, balance-slb, and
balance-tcp. Active-backup mode is enabled by default. Nutanix recommends this mode for
ease of use.
The following diagram illustrates the networking configuration of a single host immediately
after imaging. The best practice is to use only the 10 GB NICs and to disconnect the 1 GB NICs if
you do not need them.
Only utilize NICs of the same speed within the same bond.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 56
Networking
Bond Modes
• active-backup (default)
• balance-slb
Active-Backup
With the active-backup bond mode, one interface in the bond carries traffic and other
interfaces in the bond are used only when the active link fails. Active-backup is the simplest
bond mode, easily allowing connections to multiple upstream switches without any additional
switch configuration. The active-backup bond mode requires no special hardware and you can
use different physical switches for redundancy.
The tradeoff is that traffic from all VMs uses only a single active link within the bond at
one time. All backup links remain unused until the active link fails. In a system with dual
10 GB adapters, the maximum throughput of all VMs running on a Nutanix node with this
configuration is 10 Gbps or the speed of a single link.
This mode only offers failover ability (no traffic load balancing.) If the active link goes down,
a backup or passive link activates to provide continued connectivity. AHV transmits all traffic
including those from the CVM and VMs across the active link. All traffic shares 10 Gbps of
network bandwidth.
Balance-SLB
To take advantage of the bandwidth provided by multiple upstream switch links, you can use
the balance-slb bond mode. The balance-slb bond mode in OVS takes advantage of all links in
a bond and uses measured traffic load to rebalance VM traffic from highly used to less used
interfaces. When the configurable bond-rebalance interval expires, OVS uses the measured
load for each interface and the load for each source MAC hash to spread traffic evenly among
links in the bond. Traffic from some source MAC hashes may move to a less active link to more
evenly balance bond member utilization.
Perfectly even balancing may not always be possible, depending on the number of source
MAC hashes and their stream sizes. Each individual VM NIC uses only a single bond member
interface at a time, but a hashing algorithm distributes multiple VM NICs’ multiple source MAC
addresses across bond member interfaces. As a result, it is possible for a Nutanix AHV node
with two 10 GB interfaces to use up to 20 Gbps of network throughput. Individual VM NICs have
a maximum throughput of 10 Gbps, the speed of a single physical interface. A VM with multiple
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 57
Networking
NICs could still have more bandwidth than the speed of a single physical interface, but there is
no guarantee that the different VM NICs will land on different physical interfaces.
The default rebalance interval is 10 seconds, but Nutanix recommends setting this interval to
30 seconds to avoid excessive movement of source MAC address hashes between upstream
switches. Nutanix has tested this configuration using two separate upstream switches with
AHV. If the upstream switches are interconnected physically or virtually, and both uplinks allow
the same VLANs, no additional configuration, such as link aggregation is necessary.
Note: Do not use link aggregation technologies such as LACP with balance-slb.
The balance-slb algorithm assumes that upstream switch links are independent L2
interfaces. It handles broadcast, unicast, and multicast (BUM) traffic, selectively
listening for this traffic on only a single active adapter in the bond.
Taking full advantage of bandwidth provided by multiple links to upstream switches, from
a single VM, requires dynamically negotiated link aggregation and load balancing using
balance-tcp. Nutanix recommends dynamic link aggregation with LACP instead of static link
aggregation due to improved failure detection and recovery.
Note: Ensure that you have appropriately configured the upstream switches before
enabling LACP. On the switch, link aggregation is commonly referred to as port
channel or LAG, depending on the switch vendor. Using multiple upstream switches
may require additional configuration such as MLAG or vPC. Configure switches to
fall back to active-backup mode in case LACP negotiation fails sometimes called
fallback or no suspend-individual. This setting assists with node imaging and initial
configuration where LACP may not yet be available.
With link aggregation negotiated by LACP, multiple links to separate physical switches appear
as a single layer-2 (L2) link. A traffic-hashing algorithm such as balance-tcp can split traffic
between multiple links in an active-active fashion. Because the uplinks appear as a single L2
link, the algorithm can balance traffic among bond members without any regard for switch
MAC address tables. Nutanix recommends using balance-tcp when using LACP and link
aggregation, because each TCP stream from a single VM can potentially use a different uplink in
this configuration.
With link aggregation, LACP, and balance-tcp, a single guest VM with multiple TCP streams
could use up to 20 Gbps of bandwidth in an AHV node with two 10 GB adapters.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 58
Networking
Configure link aggregation with LACP and balance-tcp using the commands below on all
Nutanix CVMs in the cluster.
Note: You must configure upstream switches for link aggregation with LACP
before configuring the AHV host from the CVM. Upstream LACP settings, such as
timers, should match the AHV host settings for configuration consistency. See KB
3263 for more information on LCAP configuration.
If upstream LACP negotiation fails, the default AHV host configuration disables the bond, thus
blocking all traffic. The following command allows fallback to active-backup bond mode in the
AHV host in the event of LACP negotiation failure:
nutanix@CVM$ ssh root@192.168.5.1 "ovs-vsctl set port br0-up other_config:lacp-fallback-
ab=true"
In the AHV host and on most switches, the default OVS LACP timer configuration is slow,
or 30 seconds. This value — which is independent of the switch timer setting — determines
how frequently the AHV host requests LACPDUs from the connected physical switch. The
fast setting (1 second) requests LACPDUs from the connected physical switch every second,
helping to detect interface failures more quickly. Failure to receive three LACPDUs — in other
words, after 3 seconds with the fast setting — shuts down the link within the bond. Nutanix
recommends setting lacp-time to decrease the time it takes to detect link failure from 90
seconds to 3 seconds. Only use the slower lacp-time setting if the physical switch requires it for
interoperability.
nutanix@CVM$ ssh root@192.168.5.1 "ovs-vsctl set port br0-up other_config:lacp-time=fast"
Next, enable LACP negotiation and set the hash algorithm to balance-tcp.
nutanix@CVM$ ssh root@192.168.5.1 "ovs-vsctl set port br0-up lacp=active"
nutanix@CVM$ ssh root@192.168.5.1 "ovs-vsctl set port br0-up bond_mode=balance-tcp"
Confirm the LACP negotiation with the upstream switch or switches using ovs-appctl, looking
for the word "negotiated" in the status lines.
nutanix@CVM$ ssh root@192.168.5.1 "ovs-appctl bond/show br0-up"
nutanix@CVM$ ssh root@192.168.5.1 "ovs-appctl lacp/show br0-up"
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 59
Networking
AHV supports two different ways to provide VM connectivity: managed and unmanaged
networks.
With unmanaged networks, VMs get a direct connection to their VLAN of choice. Each virtual
network in AHV maps to a single VLAN and bridge. All VLANs allowed on the physical switch
port to the AHV host are available to the CVM and guest VMs. You can create and manage
virtual networks, without any additional AHV host configuration, using:
• Prism Element
• REST API
Acropolis binds each virtual network it creates to a single VLAN. During VM creation, you can
create a virtual NIC and associate it with a network and VLAN. Or, you can provision multiple
virtual NICs each with a single VLAN or network.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 60
Networking
A managed network is a VLAN plus IP Address Management (IPAM). IPAM is the cluster
capability to function like a DHCP server, to assign an IP address to a VM that sits on the
managed network.
Administrators can configure each virtual network with a specific IP subnet, associated domain
settings, and group of IP address pools available for assignment.
• The Acropolis Master acts as an internal DHCP server for all managed networks.
• The OVS is responsible for encapsulating DHCP requests from the VMs in VXLAN and
forwarding them to the Acropolis Master.
The Acropolis Master runs the CVM administrative process to track device IP addresses. This
creates associations between the interface’s MAC addresses, IP addresses and defined pool of
IP addresses for the AOS DHCP server.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 61
Networking
Network Segmentation
Network segmentation is designed to manage traffic from backplane (storage and CVM) traffic.
It separates storage traffic from routable management traffic for security purposes and creates
separate virtual networks for each traffic type.
You can segment the network on a Nutanix cluster in the following ways:
View this Tech TopX video to learn more about network segmentation. You can also read the
Securing Traffic Through Network Segmentation section of the Nutanix Security Guide on the
Support Portal for more information on securing traffic through network segmentation.
For more information about segmenting the network when creating a cluster, see the Field
Installation Guide on the Support Portal.
You can segment the network on an existing cluster by using the Prism web console. The
network segmentation process:
• Creates a separate network for backplane communications on the existing default virtual
switch.
• Configures the eth2 interfaces that AHV creates on the CVMs during upgrade.
From the specified subnet, AHV assigns IP addresses to each new interface. Each node requires
two IP addresses. For new backplane networks, you must specify a nonroutable subnet. The
interfaces on the backplane network are automatically assigned IP addresses from this subnet,
so reserve the entire subnet for the backplane network alone.
If you plan to specify a VLAN for the backplane network, configure the VLAN on the physical
switch ports to which the nodes connect. If you specify the optional VLAN ID, AHV places the
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 62
Networking
newly created interfaces on the VLAN. Nutanix highly recommends a separate VLAN for the
backplane network to achieve true segmentation.
• Creates a separate network for RDMA communications on the existing default virtual switch.
From the specified subnet, AHV assigns IP addresses (two per node) to each new interface. For
new RDMA networks, you must specify a nonroutable subnet. AHV automatically assigns the
interfaces on the backplane network IP addresses from this subnet, so reserve the entire subnet
for the backplane network alone.
If you plan to specify a VLAN for the RDMA network, configure the VLAN on the physical
switch ports to which the nodes connect. If you specify the optional VLAN ID, AHV places the
newly created interfaces on the VLAN. Nutanix highly recommends a separate VLAN for the
RDMA network to achieve true segmentation.
When you change the subnet, any IP addresses assigned to the interfaces on the backplane
network change, and the procedure therefore involves stopping the cluster. For information
about how to reconfigure the network, see the Reconfiguring the Backplane Network section
of the Nutanix Security Guide on the Support Portal.
Note: Do not delete the eth2 interface that AHV creates on the CVMs, even if you
are not using the network segmentation feature.
Note: At the end of this procedure, the cluster stops and restarts, even if only
changing the VLAN ID, and therefore involves cluster downtime. Shut down all user
VMs and CVMs before reconfiguring the network backplane.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 63
Networking
At the end of this procedure, the cluster stops and restarts and therefore involves cluster
downtime. Shut down all user VMs and CVMs before reconfiguring the disabling network
segmentation.
• Clusters on which the eth2 interface on one or more CVMs have manually assigned IP
addresses.
• In ESXi clusters where the CVM connects to a VMware distributed virtual switch.
• Clusters that have two (or more) vSwitches or bridges for CVM traffic isolation. The CVM
management network (eth0) and the CVM backplane network (eth2) must reside on a single
vSwitch or bridge. Do not create these CVM networks on separate vSwitches or bridges.
• Optionally changing the IP address, netmask, and default gateway that AHV specified for the
hosts during the imaging process.
This Tech TopX video walks through AHV networking concepts, including both CLI and Prism
examples.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 64
Networking
OVS Do not modify the OpenFlow tables that are associated with the
default OVS bridge br0.
VLANs Add the Controller VM and the AHV host to the same VLAN. By
default, AHV assigns the Controller VM and the hypervisor to VLAN
0, which effectively places them on the native VLAN configured on
the upstream physical switch.
Do not add other devices, such as guest VMs, to the same VLAN as
the CVM and hypervisor. Isolate guest VMs on one or more separate
VLANs.
OVS bonded port Aggregate the host 10 GbE interfaces to an OVS bond on br0. Trunk
these interfaces on the physical switch.
(bond0)
By default, the 10 GbE interfaces in the OVS bond operate in the
recommended active-backup mode.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 65
Networking
1 GbE and 10 GbE If you want to use the 10 GbE interfaces for guest VM traffic, make
sure that the guest VMs do not use the VLAN over which the
interfaces (physical Controller VM and hypervisor communicate.
host)
If you want to use the 1 GbE interfaces for guest VM connectivity,
follow the hypervisor manufacturer’s switch port and networking
configuration guidelines.
IPMI port on the Do not trunk switch ports that connect to the IPMI interface.
hypervisor host
Configure the switch ports as access ports for management
simplicity.
Upstream physical Nutanix does not recommend the use of Fabric Extenders (FEX)
switch or similar technologies for production use cases. While initial, low
load implementations might run smoothly with such technologies,
poor performance, VM lockups, and other issues might occur as
implementations scale upward.
Avoid using shared buffers for the 10 GbE ports. Use a dedicated
buffer for each port.
Add all the nodes that belong to a given cluster to the same Layer-2
network segment.
Controller VM Do not remove the Controller VM from either the OVS bridge br0 or
the native Linux bridge virbr0.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 66
Networking
Labs
1. Creating an Unmanaged Network
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 67
Virtual Machine Management
Module
5
VIRTUAL MACHINE MANAGEMENT
Overview
After completing this module, you will be able to:
Images that are imported to Prism Element reside in and can be managed from Prism
Element. If connected to Prism Central, you can migrate your images over to Prism Central for
centralized management. This will not remove your images from Prism Element, but will allow
management only in Prism Central. So, for example, if you want to update a migrated image, it
can only be done from Prism Central, not from Prism Element.
Registration with Prism Central is also useful if you have multiple Prism Element clusters
managed by a single instance of Prism Central. In this scenario, if you upload an image to a local
Prism Element instance, for example, this is what happens:
• The image is available locally on that Prism Element instance. (Assuming it has not been
migrated to Prism Central.)
• When you create a VM using that image, the image is copied to other Prism Element
clusters, is made active, and is then available for use on all Prism Element clusters managed
by that instance of Prism Central.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 68
Virtual Machine Management
Overview
• RAW
The QCOW2 format decouples the physical storage layer from the virtual layer by adding a
mapping between the logical and physical blocks.
Post-Import Actions
After you import an image, you must clone a VM from the image that you have imported and
then delete the imported image.
For more information on how to create a VM from the imported image, see the Prism Web
Console Guide on the Support Portal.
Uploading Images
There are two ways to upload an image to the Image Service:
• Via Prism
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 69
Virtual Machine Management
2. Upload an image and specify the required parameters, including the name, the image type,
the container on which it will be stored, and the image source for upload.
After Prism completes the upload, the image will appear in a list of available images for use
during VM creation.
To create an image (testimage) from an image located at NFS server, you can use the following
command:
<acropolis> image.create testimage source_url=nfs://nfs_server_path/path_to_image
You can use the Prism web console to create virtual machines (VMs) on a Nutanix cluster. If
you have administrative access to a cluster, you can create a VM with Prism by completing a
form that requires a name, compute, storage, and network specifications. If you have already
uploaded the required image files to the image service, you can create either Windows or Linux
VMs during the VM creation process.
Prism also has self-service capabilities that enable administrators or project members with the
required permissions to create VMs. In this scenario, users will select from a list of pre-defined
templates for VMs and disk images to create their VM.
Finally, VMs can be updated after creation, cloned, or deleted as required. When updating a VM,
you can change compute details (vCPUs, cores per vCPU, memory), storage details (disk types
and capacity), as well as other parameters that were specified during the VM creation process.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 70
Virtual Machine Management
Creating a VM in AHV
1. In Prism Central, navigate to VM dashboard, click the List tab, and click Create VM.
2. In the Cluster Selection window, select the target cluster for your VM and click OK.
3. In the Create VM dialog box, update the following information as required for your VM:
• Name
• Description (optional)
• Timezone
• vCPUs
• Memory
• Network interface
- NIC
- Network address/prefix
• VM host affinity
4. After all fields have been updated and verified, click Save to create the VM.
When creating a VM, you can also provide a user data file for Linux VMs, or an answer file for
Windows VMs, for unattended provisioning. There are 3 ways to do this:
• If the file has been uploaded to a storage container on a cluster, click ADSF path and enter
the path to the file.
• If the file is available on your local computer, click Upload a File, click Choose File, and then
upload the file.
• If you want to create the file or paste the contents directly, click Type or paste script and
then use the text box that is provided
You can also copy or move files to a location on the VM for Linux VMs, or to a location in the
ISO file for Windows VMs, during initialization. To do this, you need to specify the source file
ADSF path and the destination path in the VM. To add other files or directories, repeat this
process as necessary.
1. In Prism Central, navigate to VM dashboard, click the List tab, and click Create VM.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 71
Virtual Machine Management
2. Select source images for the VM, including the VM template and disk images.
• VM name
• Target project
• Disks
• Network
4. After all the fields have been updated and verified, click Save to create the VM.
Managing a VM
Whether you have created a VM with administrative permissions or as a self-service
administrator, three options are available to you when managing VMs:
To delete a VM
To clone a VM
2. The Clone VM dialog box includes the same fields as the Create VM dialog box. However,
all fields will be populated with information based on the VM that you are cloning. You can
either:
• Change the information in some of the fields as desired, and then click Save.
Other operations that are possible for a VM via one-click operations in Prism Central are:
• Launch console
• Power on/off
• Take Snapshot
• Quarantine/Unquarantine
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 72
Virtual Machine Management
Note: For more information on each of these topics, please see the Prism
Central Guide > Virtual Infrastructure (Cluster) Administration > VM Management
documents on the Nutanix Support Portal.
SLES 11 SP3/SP4, 12
See the AHV Guest OS Compatibility Matrix on the Support Portal for the current list of
supported guest VMs in AHV.
Nutanix VirtIO
Nutanix VirtIO is a collection of drivers for paravirtual devices that enhance the stability and
performance of virtual machines on AHV.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 73
Virtual Machine Management
Nutanix Guest Tools (NGT) is an in-guest agent framework that enables advanced VM
management functionality through the Nutanix Platform.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 74
Virtual Machine Management
• Nutanix Guest Agent (NGA) service. Communicates with the Nutanix Controller VM.
• File Level Restore CLI. Performs self-service file-level recovery from the VM snapshots. For
more information about self-service restore, see the Acropolis Advanced Setup Guide.
• Nutanix VM mobility drivers. Provides drivers for VM migration between ESXi and AHV,
in-place hypervisor conversion, and cross-hypervisor disaster recovery (CH-DR) features.
For more information about cross- hypervisor disaster recovery, see the Cross-Hypervisor
Disaster Recovery section of the Data Protection and Disaster Recovery guide on the
Support Portal.
• VSS requestor and hardware provider for Windows VMs. Enables application-consistent
snapshots of AHV or ESXi Windows VMs. For more information about Nutanix VSS-based
snapshots for the Windows VMs, see the Application-Consistent Snapshot Guidelines on
the Support Portal.
• You must configure the cluster virtual IP address on the Nutanix cluster. If the virtual IP
address of the cluster changes, it will impact all the NGT instances that are running in your
cluster. For more information, see the Impact of Changing Virtual IP Address of the Cluster
section of the Prism Web Console Guide on the Support Portal.
• VMs must have at least one empty IDE CD-ROM slot to attach the ISO.
• The hypervisor should be ESXi 5.1 or later, or AHV 20160215 or later version.
• You should connect the VMs to a network that you can access by using the virtual IP
address of the cluster.
• For Windows Server Edition VMs, ensure that Microsoft VSS service is enabled before
starting the NGT installation.
• When you connect a VM to VG, NGT captures the IQN of the VM and stores the information.
If you change the VM IQN before the NGT refresh cycle occurs and you take a snapshot
of the VM, the NGT will not be able to provide auto restore capability because the
snapshot operation will not be able to capture the VM-VG connection. As a workaround,
you can manually restart the Nutanix guest agent service by running the $sudo service
ngt_guest_agent restart command on the Linux VM and from the Services tab of the
Windows VM to update NGT.
Note: See the supported operating system information for the specific NGT
features to verify if an operating system is supported for a specific NGT feature.
Windows
• You must install the SHA-2 code signing support update before installing NGT.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 75
Virtual Machine Management
Apply the security update in KB3033929 to enable SHA-2 code signing support on the
Windows OS. If the installation of the security update in KB3033929 fails, apply one of the
following rollups:
• For Windows Server Edition VMs, ensure that Microsoft VSS Services is enabled before
starting the NGT installation.
Linux
Versions: CentOS 6.5 and 7.0, Red Hat Enterprise Linux (RHEL) 6.5 and 7.0, Oracle Linux 6.5
and 7.0, SUSE Linux Enterprise Server (SLES) 11 SP4 and 12, Ubuntu 14.0.4 or later
• The SLES operating system is only supported for the application consistent snapshots with
VSS feature. The SLES operating system is not supported for the cross-hypervisor disaster
recovery feature.
Customizing a VM
Sysprep
Sysprep is a utility that prepares a Windows installation for duplication (imaging) across
multiple systems. Sysprep is most often used to generalize a Windows installation.
During generalization, Sysprep removes system-specific information and settings such as the
security identifier (SID) and leaves installed applications untouched.
You can capture an image of the generalized installation and use the image with an answer
file to customize the installation of Windows on other systems. The answer file contains the
information that Sysprep needs to complete an unattended installation.
2. Select a VM to clone, click Launch Console, and log in with Administrator credentials.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 76
Virtual Machine Management
3. Configure Sysprep with a system cleanup. Specify whether or not to generalize the
installation, then choose to shut down the VM.
Cloud-Init
On non-Windows VMs, Cloud-config files, special scripts designed to be run by the Cloud-Init
process, are generally used for initial configuration on the very first boot of a server. The cloud-
config format implements a declarative syntax for many common configuration items and
also allows you to specify arbitrary commands for anything that falls outside of the predefined
declarative capabilities. This lets the file act like a configuration file for common tasks, while
maintaining the flexibility of a script for more complex functionality.
You must pre-install the utility in the operating system image used to create VMs.
Cloud-Init runs early in the boot process and configures the operating system on the basis of
data that you provide. You can use Cloud-Init to automate tasks such as:
• Installing packages
• Copying files
• Bootstrapping other configuration management tools such as Chef, Puppet, and Salt
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 77
Virtual Machine Management
Hosts read and write data in shared Nutanix datastores as if they were connected to a SAN.
From the perspective of a hypervisor host, the only difference is the improved performance
that results from data not traveling across a traditional SAN. VM data is stored locally and
replicated on other nodes for protection against hardware failure.
When a guest VM submits a write request through the hypervisor, that request is sent to
the Controller VM on the host. To provide a rapid response to the guest VM, this data is first
stored on the metadata drive within a subset of storage called the oplog. This cache is rapidly
distributed across the 10GbE network to other metadata drives in the cluster.
Oplog data is periodically transferred to persistent storage within the cluster. Data is written
locally for performance and replicated on multiple nodes for high availability.
When the guest VM sends a read request through the hypervisor, the Controller VM reads
from the local copy first. If the host does not contain a local copy, then the Controller VM reads
across the network from a host that does contain a copy. As remote data is accessed, the
remote data is migrated to storage devices on the current host so that future read requests can
be local.
Live Migration
The Nutanix Enterprise Cloud Computing Platform fully supports live migration of VMs, whether
initiated manually or through an automatic process. All hosts within the cluster have visibility
into shared Nutanix datastores through the Controller VMs. Guest VM data is written locally and
is also replicated on other nodes for high availability.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 78
Virtual Machine Management
If you migrate a VM to another host, future read requests are sent to a local copy of the data
(if it exists). Otherwise, the request is sent across the network to a host that does contain the
requested data. As remote data is accessed, the remote data is migrated to storage devices on
the current host, so that future read requests can be local.
The Nutanix cluster automatically selects the optimal path between a hypervisor host and its
guest VM data. The Controller VM has multiple redundant paths available, which makes the
cluster more resilient to failures.
Flash mode is configured when you update the VM configuration. In addition to modifying the
configuration, you can attach a volume group to the VM and enable flash mode on the VM. If
you attach a volume group to a VM that is part of a protection domain, the VM is not protected
automatically. Add the VM to the same consistency group manually.
To enable flash mode on the VM, click the Enable Flash Mode check box.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 79
Virtual Machine Management
After you enable this feature on the VM, the status is updated in the VM table view. To view the
status of individual virtual disks (disks that are flashed to the SSD), go the Virtual Disks tab in
the VM table view.
You can disable the flash mode feature for individual virtual disks. To update the flash mode
for individual virtual disks, click the update disk icon in the Disks pane and deselect the Enable
Flash Mode check box.
Labs
1. Uploading an image
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 80
Module
6
HEALTH MONITORING AND ALERTS
Overview
After completing this module, you will be able to:
Health Monitoring
Nutanix provides a range of status checks to monitor the health of a cluster.
• Summary health status information for VMs, hosts, and disks displays on
the Home dashboard.
• In depth health status information for VMs, hosts, and disks is available through
the Health dashboard.
You can:
• Run Nutanix Cluster Check (NCC) health checks directly from Prism.
Note: If the Cluster Health service status is down for more than 15 minutes, an
alert email is sent by the AOS cluster to configured addresses and to Nutanix
support (if selected). In this case, no alert is generated in the Prism web
console. The email is sent once every 24 hours. You can run the NCC check
cluster_services_down_check to see the service status.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 81
Health Monitoring and Alerts
Nutanix Cluster Check (NCC) is a framework of scripts that can help diagnose cluster health.
You can run checks from the Prism Web Console or the CVM command line. NCC actions are
grouped into plugins and modules. Plugins are objects that run the diagnostic commands.
Modules are logical groups of plugins that can be run as a set. You can run individual or multiple
simultaneous health checks from either the Prism Web Console or the command line. When run
from the CVM command line, NCC generates a log file with the output of diagnostic commands
selected by the user. A similar log file is generated when the web console is used, but it is less
easy to read than the one generated by the command line.
NCC can be run as long as the individual nodes are up, regardless of cluster state. The scripts
run standard commands against the cluster or nodes based on the type of information being
retrieved.
Note: Some plug-ins run nCLI commands and might require the user to input the
nCLI password. The password is logged on as plain text.
If you change the default password of the admin user, you must specify the password every
time you start an nCLI session from a remote system.
If you are logged onto a CVM, a password is not required for nCLI. Comprehensive
documentation of NCC is available in the Nutanix AOS 5.10 Command Reference on the
Support Portal.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 82
Health Monitoring and Alerts
• Before and after activities such as adding, removing, reconfiguring, or upgrading nodes
NCC Syntax
The general syntax of NCC is $ ncc <ncc-flags> <module> <sub-module> [...] <plugin> <plugin-
flags>
NCC Output
The Type column distinguishes between modules (M) and plug-ins (P), $ ncc.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 83
Health Monitoring and Alerts
$cluster status
After installing the cluster, you need to verify that it is set up correctly using NCC: $ncc
health_checks run_all
NCC checks for common misconfiguration problems and verifies settings are correct. An
example of a common CVM misconfiguration problem is using 1GbE NICs instead of 10GbE
NICs.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 84
Health Monitoring and Alerts
Usage Examples
Note: The flags override the default configurations of the NCC modules and
plug-ins. Do not run these flags unless your cluster configuration requires these
modifications.
ncc_installer_filename.sh
$ ncchealth_checks run_all
Running Checks
In addition to running various cluster health checks, it is also possible to run NCC health checks
in parallel.
NCC allows you to set the frequency of cluster checks and to email the results of these checks.
By default, this feature is disabled. If you enable and configure this feature, NCC will:
• Run the checks periodically according to a frequency you have set
• Email the results of the checks to users who you have also configured to receive alert emails.
This feature uses the settings and infrastructure of the Alert Email Configuration feature,
which can send alert information automatically to Nutanix customer support and others.
• Run as configured even if you have upgraded your cluster's AOS or NCC version after
configuring this feature.
Note: For step-by-step instructions on how to configure check frequency and email
notifications, see the Nutanix Cluster Check 2.x Guide.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 85
Health Monitoring and Alerts
Health Dashboard
The Health dashboard displays dynamically updated health information about VMs, hosts, and
disks in the cluster. To view the Health dashboard, select Health from the drop down menu on
the left of the main menu.
The left column displays tabs for each entity type (VMs, hosts, disks, storage pools, storage
containers, cluster services, and [when configured] protection domains and remote sites). Each
tab displays the entity total for the cluster (such as the total number of disks) and the number
in each health state. Clicking a tab expands the displayed information (see the following
section).
The middle column displays more detailed information about whatever is selected in the left
column.
The right column displays a summary of all the health checks. You also have the option to view
individual checks from the Checks button (success, warning, failure, or disabled).
• The Summary tab provides a summarized view of all the health checks according to check
status and check type.
• The Checks tab provides information about individual checks. Hovering the cursor over an
entry displays more information about that health check. You can filter checks by clicking
the appropriate field type and clicking Apply.
• The Actions tab provides you with options to manage checks, run checks, and collect logs.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 86
Health Monitoring and Alerts
Cluster health checks cover a range of entities including AOS, hypervisor, and hardware
components. A set of checks are enabled by default, but you can run, disable, or reconfigure
any of the checks at any time to suit your specific needs.
To configure health checks, From the Actions menu on the Health dashboard, click Manage
Checks.
The displayed screen lists all checks that can be run on the cluster, divided into categories
including CVM, Cluster, Data Protection, File Server, Host, and so on. Sub-categories include
CPU, disk, and hardware for CVMs; Network, Protection Domains, and Remote Sites for
Clusters; CPU and disk for hosts; and so on.
Selecting a check from the left pane will allow you to:
• View a history of all entities evaluated by this check, displayed in the middle of the screen.
• View causes and resolutions, as well as supporting reference articles on the Nutanix
Knowledge Base.
• Before and after activities such as adding, removing, reconfiguring, or upgrading nodes
Set NCC Frequency allows you to configure the run schedule for Nutanix Cluster Checks and
view e-mail notification settings.
• Every 4 hours
• Every day
• Every week
You can set the day of the week and the start time for the checks where appropriate.
A report is sent to all e-mail recipients shown. You can configure e-mail notifications using
the Alert Email Configuration menu option.
Collecting Logs
Logs you your Nutanix cluster and its various components can be collected directly from the
Prism web console. Logs can be collected for Controller VMs, file server, hardware, alerts,
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 87
Health Monitoring and Alerts
hypervisor, and for the system. The most common scenarios in which you will need to collect
logs are when troubleshooting an issue, or when you need to provide information for a Nutanix
Support case.
1. On the Health dashboard, click Actions on the right pane and select Log Collector.
2. Select the period for which you want to collect logs, either by choosing a duration in hours
or by setting a custom date range.
After you run Log Collector and the task completes, the bundle will be available to download.
Analysis Dashboard
The Analysis dashboard allows you to create charts that can dynamically monitor a variety of
performance measures.
Chart definitions
The pane on the left lists the charts that can be run. No charts are provided by default, but you
can create any number of charts. A chart defines the metrics to monitor.
Chart monitors
When a chart definition is checked, the monitor appears in the middle pane. An Alerts monitor
always displays first. The remaining displayed monitors are determined by which charts are
checked in the left pane. You can customize the display by selecting a time interval from the
Range drop-down (above the charts) and then refining the monitored period by moving the
time interval end points to the desired length.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 88
Health Monitoring and Alerts
Alerts
Any alerts that occur during the interval specified by the timeline in the middle pane display in
the pane on the right.
1. On the Analysis dashboard, click New and select either a Metric chart or an Entity chart.
• For Metric charts, select the metric you want to monitor, the entity type, and then a list of
entities.
• For Entity charts, select the entity type, then the specific entity and all the metrics you want
to monitor on that entity.
Alerts Dashboard
The Alerts dashboard displays alert and event messages.
Alerts View
Two viewing modes are available: Alerts, and Events. The Alerts view, shown above, lists
all alerts messages and can be sorted by source entity, impact type, severity, resolution,
acknowledgement, and time of creation.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 89
Health Monitoring and Alerts
Informational: An "informational"
alert highlights a condition to be
aware of, for example, a reminder
that the support tunnel is enabled.
Resolved Indicates whether a user has set the (user and time), No
alert as resolved. Resolving an error
means you set that error as fixed.
(The alert may return if the condition
is scanned again at a future point.) If
you do not want to be notified about
the condition again, turn off the alert
for this condition.
Acknowledged Indicates whether the alert has been (user and time), No
acknowledged. Acknowledging
an alert means you recognize the
error exists (no more reminders for
this condition), but the alert status
remains.
Create Time Displays the date and time when the (time and date)
alert occurred.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 90
Health Monitoring and Alerts
Alert email notifications are enabled by default. This feature sends alert messages automatically
to Nutanix customer support through customer-opened ports 80 or 8443. To automatically
receive email notification alerts, ensure that nos-alerts and nos-asup recipients are added to the
accepted domain of your SMTP server. To customize who should receive the alert e-mails (or to
disable e-mail notification), do the following:
• The rules that govern when and to whom emails will be sent
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 91
Health Monitoring and Alerts
Events View
The Event messages view displays a list of event messages. Event messages describe cluster
actions such as adding a storage pool or taking a snapshot. This view is read-only and you do
not need to take any action like acknowledging or resolving generated events.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 92
Health Monitoring and Alerts
To filter the list, click the filter icon on the right of the screen. This displays a pane (on the
right) for selecting filter values. Check the box for each value to include in the filter. You can
include multiple values. The values are for event type (Behavioral Anomaly, System Action, User
Action) and time range (Last 1 hour, Last 24 hours, Last week, From XXX to XXX). You can also
specify a cluster. The selected values appear in the filter field above the events list. You can do
the following in the current filters field:
• Save the filter list by clicking the star icon. You can save a maximum of 20 filter lists per
entity type.
• Use a saved filter list by selecting from the drop down list.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 93
Health Monitoring and Alerts
Create Time Displays the date and time when the (time and date)
event occurred.
Labs
1. Creating a performance chart
3. Managing alerts
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 94
Module
7
DISTRIBUTED STORAGE FABRIC
Overview
After completing this module, you will be able to:
The Distributed Storage Fabric (DSF) is a scalable distributed storage system which exposes
NFS/SMB file storage as well as iSCSI block storage with no single points of failure. The
distributed storage fabric stores user data (VM disk/files) across storage tiers (SSDs, Hard
Disks, Cloud) on multiple nodes. The DSF also supports instant snapshots, clones of VM disks
and other advanced features such as deduplication, compression and erasure coding.
The DSF logically divides user VM data into extents which are 1MB in size. These extents may
be compressed, erasure coded, deduplicated, snapshotted or left untransformed. Extents can
also move around; new or recently accessed extents stay on faster storage (SSD) while colder
extents move to HDD. The DSF utilizes a “least recently used” algorithm to determine what data
can be declared “cold” and migrated to HDD. Additionally, the DSF attempts to maintain data
locality for VM data – so that one copy of each vDisk’s data is available locally from the CVM on
the host where the VM is running.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 95
Distributed Storage Fabric
DSF presents SDDs and HDDs as a storage pool and provides cluster-wide storage services:
• Snapshots
• Clones
• HA/DR
• Deduplication
• Compression
• Erasure coding
The Controller VMs (CVMs) running on each node combine to form an interconnected
network within the cluster, where every node in the cluster has access to data from shared
SSD, HDD, and cloud resources. The CVMs allow for cluster-wide operations on VM-centric
software-defined services: snapshots, clones, high availability, disaster recovery, deduplication,
compression, erasure coding, storage optimization, and so on.
Hypervisors (AHV, ESXi, Hyper-V) and the DSF communicate using the industry-standard
protocols NFS, iSCSI, and SMB3.
The extent store is the persistent bulk storage of DSF and spans SSD and HDD and is extensible
to facilitate additional devices/tiers. Data entering the extent store is either drained from the
OpLog or is sequential in nature and has bypassed the OpLog directly.
Nutanix ILM will determine tier placement dynamically based upon I/O patterns and will move
data between tiers.
The OpLog
The OpLog is similar to a filesystem journal and is used to service bursts of random write
operations, coalesce them, and then sequentially drain that data to the extent store. For each
write OP, the data is written to disk locally and synchronously replicated to another n number
of CVM’s OpLog before the write is acknowledged for data availability purposes (where “n” is
the RF of the container, 2 or 3).
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 96
Distributed Storage Fabric
All CVM participate in OpLog replication. Individual replica location is dynamically chosen based
upon load. The OpLog is stored on the SSD tier on the CVM to provide extremely fast write I/O
performance. OpLog storage is distributed across the SSD devices attached to each CVM.
For sequential workloads, the OpLog is bypassed and the writes go directly to the extent store.
If data is currently sitting in the OpLog and has not been drained, all read requests will be
directly fulfilled from the OpLog until they have been drained, where they would then be served
by the extent store/unified cache.
For containers where fingerprinting (aka Dedupe) has been enabled, all write I/Os will be
fingerprinted using a hashing scheme allowing them to be deduplicated based upon fingerprint
in the unified cache.
Going through the hypervisor, DSF sends write operations to the CVM on the local host, where
they are written to either the Oplog or Extent Store. In addition to the local copy an additional
write operation is, then distributed across the 10 GbE network to other nodes in the cluster.
Going through the hypervisor, read operations are sent to the local CVM which returns data
from a local copy. If no local copy is present, the local CVM retrieves the data from a remote
CVM that contains a copy.
The file system automatically tiers data across different types of storage devices using
intelligent data placement algorithms. These algorithms make sure that the most frequently
used data is available in memory or in flash for the fastest possible performance.
You can create, migrate, and manage VMs within Nutanix datastores as you would with any
other storage solution.
AHV Storage
Prism or aCLI is used to configure all AHV storage.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 97
Distributed Storage Fabric
A storage pool is a group of physical storage devices for the cluster including PCIe
SSD, SSD, and HDD devices. The storage pool spans multiple nodes and scales as the
cluster expands. A storage device can only be a member of a single storage pool. Nutanix
recommends creating a single storage pool containing all disks within the cluster.
• Storage Container
A storage container is a subset of available storage within a storage pool. Storage containers
enable an administrator to apply rules or transformations such as compression to a data set.
They hold the virtual disks (vDisks) used by virtual machines. Selecting a storage pool for a
new storage container defines the physical disks where the vDisks are stored.
• Volume Group
Each volume group contains a UUID, a name, and iSCSI target name. Each disk in the volume
group also has a UUID and a LUN number that specifies ordering within the volume group.
You can include volume groups in protection domains configured for asynchronous data
replication (Async DR) either exclusively or with VMs.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 98
Distributed Storage Fabric
Volume groups cannot be included in a protection domain configured for Metro Availability,
in a protected VStore, or in a consistency group for which application consistent
snapshotting is enabled.
• vDisk
A vDisk is a subset of available storage within a storage container that provides storage
to virtual machines. A vDisk is any file over 512 KB on DSF, including VMDKs and VM disks.
vDisks are broken up into extents, which are grouped and stored on physical disk as an
extent group.
• Datastore
A datastore is a hypervisor construct that provides a logical container for files necessary for VM
operations. In the context of the DSF, each container on a cluster is a datastore
Clones
As mentioned in the introduction to this module, a vDisk is composed of extents, which
are logically contiguous chunks of data. Extents are stored within extent groups, which are
physically contiguous data stored as files on the storage devices. When a snapshot or clone is
taken, the base vDisk is marked immutable and another vDisk is created where new data will be
written.
At creation, both vDisks have the same block map, which is a metadata mapping of the vDisk to
its corresponding extents. Unlike traditional approaches which require traversal of the snapshot
chain to locate vDisk data (which can add read latency), each vDisk has its own block map.
This eliminates any of the overhead normally seen by large snapshot chain depths and allows
multiple snapshots to be taken without any performance impact.
Shadow Clones
A clone is a duplicate of a vDisk, which can then be modified.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 99
Distributed Storage Fabric
A shadow clone, on the other hand, is a cache of a vDisk on all the nodes in the cluster. When a
vDisk is read by multiple VMs (such as the base image for a VDI clone pool), the cluster creates
shadow clones of the vDisk. Shadow clones are enabled by default.
Snapshotting Disks
Snapshots for a VM are crash consistent, which means that the VMDK on-disk images are
consistent with a single point in time. That is, the snapshot represents the on-disk data as if the
VM crashed. The snapshots are not, however, application consistent, meaning that application
data is not quiesced at the time the snapshot is taken.
For a breakdown of the differences in snapshots for different hypervisors and operating
systems, with different statuses of NGT, see the following table.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 100
Distributed Storage Fabric
ESXi AHV
Linux VMs Installed and Nutanix script- Installed and Nutanix script-
Active. Also based VSS Active. Also based VSS
pre_freeze and snapshots pre_freeze and snapshots
post_thaw scripts post_thaw scripts
are present are present
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 101
Distributed Storage Fabric
Deduplication Process
The Elastic Deduplication Engine is a software-based feature of DSF that allows for data
deduplication in the capacity (Extent Store) and performance (Unified Cache) tiers. Incoming
data is fingerprinted during ingest using a SHA-1 hash at a 16 K granularity. This fingerprint is
then stored persistently as part of the written block’s metadata.
Contrary to traditional approaches, which utilize background scans requiring the data to be
reread, Nutanix creates the fingerprint inline on ingest. For data being deduplicated in the
capacity tier, the data does not need to be scanned or reread – matching fingerprints are
detected and duplicate copies can be removed.
Block-level deduplication looks within a file and saves unique iterations of each block. All the
blocks are broken into chunks. Each chunk of data is processed using an SHA-1 hash algorithm.
This process generates a unique number for each piece: a fingerprint.
The fingerprint is then compared with the index of existing fingerprints. If it is already in
the index, the piece of data is considered a duplicate and does not need to be stored again.
Otherwise, the new hash number is added to the index and the new data is stored.
If you update a file, only the changed data is saved, even if only a few bytes of the document
or presentation have changed. The changes do not constitute an entirely new file. This behavior
makes block deduplication (compared with file deduplication) far more efficient.
However, block deduplication takes more processing power and uses a much larger index to
track the individual pieces.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 102
Distributed Storage Fabric
To reduce metadata overhead, fingerprint reference counts (refcounts) are monitored during
the deduplication process. Fingerprints with low refcounts will be discarded. Full extents are
preferred for capacity tier deduplication in order to minimize fragmentation.
When used in the appropriate situation, deduplication makes the effective size of the
performance tier larger so that more active data can be stored.
• Performance of guest VMs suffers when active data can no longer fit in the performance
tiers.
Deduplication Techniques
Inline deduplication is useful for applications with large common working sets.
Post-process deduplication is useful for virtual desktops (VDI) with full clones.
• Reduces redundant data in capacity tier, increasing effective storage capacity of a cluster
For sequential data that is written and compressed inline, the RF copy of the data is
compressed before transmission, further increasing performance since it is sending less data
across the network.
Inline compression also pairs perfectly with erasure coding. For instance, an algorithm may
represent a string of bits with a smaller string of 0s and 1s by using a dictionary for the
conversion between them, or the formula may insert a reference or pointer to a string of 0s and
1s that the program has already seen.
Text compression can be as simple as removing all unneeded characters, inserting a single
repeat character to indicate a string of repeated characters, and substituting a smaller bit
string for a frequently occurring bit string. Data compression can reduce a text file to 50% or a
significantly higher percentage of its original size.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 103
Distributed Storage Fabric
Compression Process
Inline compression condenses sequential streams of data or large I/O sizes (>64K) when
written to the Extent Store (SSD + HDD). This includes data draining from oplog as well as
sequential data skipping it.
Offline compression initially writes the data in an uncompressed state and then leverages the
Curator framework to compress the data cluster-wide. When inline compression is enabled
but the I/O operations are random in nature, the data is written uncompressed in the oplog,
coalesced, and then compressed in memory before being written to the Extent Store.
Nutanix leverages LZ4 for initial data compression, which provides a very good blend between
compression and performance. For cold data, Nutanix uses LZ4HC to provide an improved
compression ratio.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 104
Distributed Storage Fabric
Data compression tends to be more effective than deduplication in reducing the size of unique
information, such as images, audio, videos, databases, and executable files.
Workloads that frequently update data (for example, virtualized applications for power users,
such as CAD) are not good candidates for compression.
Deduplication is most effective in environments that have a high degree of redundant data,
such as virtual desktop infrastructure or storage backup systems.
User data File server, user data, vDisk Post-process compression with 4-6 hour delay
VDI Vmware View, Citrix VCAI snapshots, linked clones, full clones with
XenDesktop inline dedup (not container compression)
Note: Nutanix does not recommend turning on deduplication for VAAI (vStorage
APIs for Array Integration) clone or linked clone environments.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 105
Distributed Storage Fabric
Replication Factor
When you configure the DSF with a replication factor of 2 or 3, the Nutanix cluster maintains
two or three exact copies of the same data on different nodes to ensure data availability. The
actual logical capacity available depends on the replication factor you choose. When you
use replication factor 2 (also called fault tolerance 1), you have approximately 50 percent
capacity available. When you use replication factor 3 (also called fault tolerance 2), you have
approximately 33 percent capacity available.
Before EC-X starts, the data must be write-cold (in other words, there has been no write access
to the data for seven days). The amount of space you can save when using EC-X varies based
on:
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 106
Distributed Storage Fabric
EC-X works at the extent group layer, meaning it uses 1 or 4 MB data sets when performing its
calculations. By default, the cluster automatically uses extent groups that belong to the same
virtual disk (vDisk), as this method makes it easier to perform garbage cleanup, but the cluster
can use extent groups from different vDisks if necessary. vDisks on the DSF are made of virtual
blocks (vBlocks), which are 1 MB chunks of virtual address space. Each vDisk in the system is
owned by a Nutanix CVM that typically runs on the same Nutanix node as the VM the vDisk
belongs to.
Traditional RAID
• Slow rebuilds
• Hardware-defined
Erasure Coding
EC-X increases effective or usable capacity on a cluster. The savings after enabling EC-X is in
addition to deduplication and compression storage.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 107
Distributed Storage Fabric
EC-X Process
Erasure coding is performed post-process and leverages the Curator MapReduce framework
for task distribution.
Since this is a post-process framework, the traditional write I/O path is unaffected. In this
scenario, primary copies of both RF2 and RF3 data are local and replicas are distributed on the
remaining cluster nodes.
When Curator runs a full scan, it finds eligible extent groups for encoding. Eligible extent
groups must be "write-cold", meaning they have not been overwritten for a defined amount of
time. For regular vdisks, this time period is 7 days. For snapshot vdisks, it is 1 day.
After erasure coding finds the eligible candidates, Chronos will distribute and throttle the
encoding tasks.
Pros
• Increases usable capacity of RAW storage.
Cons
EC-X Workloads
Recommended workloads for erasure coding (workloads not requiring high I/O):
- Backups
- Archives
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 108
Distributed Storage Fabric
- File servers
- Log servers
• VDI is not capacity-intensive thanks to intelligent cloning (so EC-X advantages are minimal).
Before
After
Once the data becomes cold, the erasure code engine computes double-parity for the data
copies by taking all the data copies (‘d’) and performing an exclusive OR operation to create
one or more parity blocks. With the two parity blocks in place, the 2nd and 3rd copies are
removed.
You end up with 12 (original three copies) + 2 (parity) - 8 (removal second + third copies) = 6
blocks, which is a storage savings of 50%.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 109
Distributed Storage Fabric
EC-X makes a strip from existing data to create parity. The strip width depends on the number
of nodes in the Nutanix cluster and the data replication factor configured for the Nutanix
container.
EC-X tries to delete the copy of the extent group that is not local to the CVM. For example,
if VM1 runs on node1 and has egroup1 on node1 and node2, the DSF keeps the egroup1 copy
on node1 after the EC-X operation. EC-X places the parity extent group on a different node
(not node1) and does not compress the parity bit, even if compression is enabled at the DSF
container level. In a hybrid system, the DSF places the parity bit on HDD if possible.
In a 6-node cluster configured with redundancy factor 2, erasure coding uses a stripe size of 5:
4 nodes for data and 1 node for parity. The sixth node in the cluster ensures that if a node fails,
a node is available for rebuild. With a stripe of 4 data to 1 parity, the overhead is 25%. Without
erasure coding, the overhead is 100%.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 110
Distributed Storage Fabric
Erasure coding stripe size adapts to the size of the cluster starting with the minimum 4 nodes
with a maximum of 5 node stripe width.
Best Practices:
1. A cluster must have at least four nodes populated with each storage tier (SSD/HDD)
represented to enable erasure coding.
2. Avoid strips greater than (4, 1) because capacity savings provide diminishing returns and the
larger strip size increases the cost of rebuild.
3. Erasure coding effectiveness (data reduction savings) might be reduced on workloads that
have many overwrites outside of the erasure coding window, which by default is 7 days.
5. Erasure coding is an asynchronous process, so space savings might not appear for some
time.
6. Multiple node removal operations can break the erasure coded strip. If it is necessary to
remove multiple nodes from a cluster that uses erasure coding turn off erasure coding
before removing the nodes.
7. If erasure coding is enabled on any storage container, a minimum of four blocks for RF2 or
six blocks for RF3 is required to maintain block awareness.
8. Erasure coding mandates that data and parity strips must be placed on separate failure
domains (node) and that there is an additional node available for recovery. For example, a
strip size of (4, 1) requires you to have at least six nodes.
Labs
1. Creating a container with compression enabled
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 111
Distributed Storage Fabric
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 112
Module
8
MIGRATING WORKLOADS TO AHV
Objectives
When you have completed this module, you will be able to describe how to migrate workloads
using Nutanix Move.
Nutanix Move
Nutanix Move is a freely distributed application to support migration from a non-Nutanix source
to a Nutanix target with minimal downtime.
• Migration of Amazon Elastic Block Store (EBS) backed EC2 instances running on AWS.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 113
Migrating Workloads to AHV
- When you are migrating from ESXi to AHV, Nutanix Move directly communicates with
vCenter through the Management Server and the Source Agent. The Source Agent
collects information about the VM being migrated (guest VM) from the VMware library.
Note: Adding single AWS account as source with multiple IAM users is not
supported.
• Changed Block Tracking (CBT) driver: a driver running on the source VMs to be migrated to
facilitate efficient transfer of data from the source to the target. Move deploys the driver as
part of the source VM preparation and removes it during post migration cleanup.
In case of migration from AWS to AHV, NTNX-MOVE-AGENT runs on AWS as an EC2 instance
to establish connection between AWS and Nutanix Move. Nutanix Move takes snapshots of
the EBS volumes of the VMs for the actual transfer of data for the VM being migrated (guest
VM). The CBT driver computes the list of blocks that have changed to optimally transfer only
changed blocks of data on the disk. The data path connection between NTNX-MOVE-AGENT
and Nutanix Move is used to transfer data from AWS to the target Nutanix Cluster.
After the migration of the VM from the source to the target, Nutanix Move deletes all EBS
volume snapshots taken by it.
Note: Nutanix Move does not store other copies of the data.
Note: For AWS, the migration takes place in powered on state. For ESXi, the
power state is retained.
• Schedule migration.
• Schedule data-seeding for the virtual machines in advance and cut over to a new AHV
cluster.
• Manage VM migrations between multiple clusters from a single management interface.
• Migrate all AHV certified OSs (see the Supported Guest VM Types for AHV section of the
AHV Admin Guide on the Support Portal).
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 114
Migrating Workloads to AHV
Compatibility Matrix
Unsupported Features
• IPV6
• Windows VMs installed with any antivirus software. Antivirus software prevents the
installation of the VirtIO drivers.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 115
Migrating Workloads to AHV
To get started with Nutanix Move, you need to first download and invoke the Nutanix Move
appliance on the target clusters, and then deploy Nutanix Move. If you are migrating to multiple
AHV clusters, you can deploy Nutanix Move on any of the target clusters. Once the installation
has completed, continue with configuring the Move environment and build a Migration Plan
using the Move interface.
Labs
1. Preparing a VM for Migration
2. Deploying a Move VM
3. Configuring Move
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 116
Module
9
FILES AND VOLUMES
Overview
After completing this module, you will be able to:
Nutanix Volumes
Nutanix Volumes is a native scale-out block storage solution that enables enterprise
applications running on external servers to leverage the benefits of the hyperconverged
Nutanix architecture, accessing the Nutanix DSF via the iSCSI protocol
Nutanix Volumes offers a solution for workloads that may not be a fit for running on virtual
infrastructure but still need highly available and scalable storage. For example, workloads
requiring locally installed peripheral adaptors, high socket quantity compute demands, or
licensing constraints.
Nutanix Volumes enables you to create a shared infrastructure providing block-level iSCSI
storage for physical servers without compromising availability, scalability, or performance. In
addition, you can leverage efficient backup and recovery techniques, dynamic load-balancing,
LUN resizing, and simplified cloning of production databases. You can use Nutanix Volumes to
export Nutanix storage for use with applications like Oracle databases including Oracle RAC,
Microsoft SQL Server, and IBM Db2 running outside of the Nutanix cluster.
Every CVM in a Nutanix cluster can participate in presenting storage, allowing individual
applications to scale out for high performance. You can dynamically add or remove Nutanix
nodes, and by extension CVMs, from a Nutanix cluster.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 117
Files and Volumes
Nutanix manages storage allocation and assignment for Volumes through a construct called a
volume group (VG). A VG is a collection of “volumes,” more commonly referred to as virtual
disks (vDisks). Volumes presents these vDisks to both VMs and physical servers, which we refer
to as “hosts” unless otherwise specified.
vDisks represent logical “slices” of the ADSF’s container, which are then presented to the hosts
via the iSCSI protocol. vDisks inherit the properties (replication factor, compression, erasure
coding, and so on) of the container on which you create them. By default, these vDisks are
thinly provisioned. Because Nutanix uses iSCSI as the protocol for presenting VG storage, hosts
obtain access based on their iSCSI Qualified Name (IQN). The system uses IQNs as a whitelist
and attaches them to a VG to permit access by a given host. You can use IP addresses as an
alternative to IQNs for VG attachment. Once a host has access to a VG, Volumes discovers the
VG as one or more iSCSI targets. Upon connecting to the iSCSI targets, the host discovers the
vDisks as SCSI disk devices. The figure above shows these relationships.
• Disks as first-class entities - execution contexts are ephemeral and data is critical.
See Converting Volume Groups and Updating Clients to use Volumes for more information.
• YYYY- M M: The year and month the naming authority was established.
• NAMING - AUTHORITY: Usually a reverse syntax of the Internet domain name of the naming
authority.
iSCSI Qualified Name (IQN) is one of the naming conventions used by iSCSI to identify initiators
and targets. IQN is documented in RFC 3720. The IQN can be up to 255 characters long.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 118
Files and Volumes
using an incrementally changing identifier and of a variable challenge-value. CHAP requires that
both the client and server know the plaintext of the secret, although it is never sent over the
network.
Mutual CHAP authentication. With this level of security, the target and the initiator authenticate
each other. CHAP sets a separate secret for each target and for each initiator.
• The hosts HostA and HostB have their iSCSI initiators configured to communicate with the
iSCSI target (data services IP).
Before we get to configuration, we need to configure the data services IP that will act as our
central discovery/login portal.
Volume groups (VGs) work with ESXi, Hyper-V, and AHV for iSCSI connectivity. AHV also
supports attaching VGs directly to VMs. In this case, the VM discovers the vDisks associated
with the VG over the virtual SCSI controller.
You can use VGs with traditional hypervisor vDisks. For example, some VMs in a Nutanix cluster
may leverage .vmdk or .vhdx based storage on Network File System (NFS) or Server Message
Block (SMB), while other hosts leverage VGs as their primary storage.
VMs utilizing VGs at a minimum have their boot and operating system drive presented with
hypervisor vDisks. You can manage VGs from Prism or from a preferred CLI such as aCLI, nCLI,
or PowerShell. Within Prism, the Storage page lets you create and monitor VGs.
Nutanix Volumes presents a volume group and its vDisks as iSCSI targets and assigns IQNs.
Initiators or hosts have their IQNs attached to a volume group to gain access.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 119
Files and Volumes
Multiple hosts can share the vDisks associated with a VG for the purposes of shared storage
clustering. A common scenario for using shared storage is in Windows Server failover
clustering. You must explicitly mark the VG for sharing to allow more than one external initiator
or VM to attach.
In some cases, Volumes needs to present a volume group to multiple VMs or bare metal servers
for features like clustering. The graphic shows how an administrator can present the same
volume group to multiple servers.
Volume groups are connected via the iSCSI Qualified Name (IQN) which follows the format of
iqn.yyyy-mm.<naming-authority>.<unique_character_string>
Note: Allowing multiple systems to concurrently access this volume group can
cause serious problems.
Instead of configuring host iSCSI client sessions to connect directly to CVMs, Volumes uses
an external data services IP address. This data services IP acts as a discovery portal and initial
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 120
Files and Volumes
connection point. The data services address is owned by one CVM at a time. If the owner goes
offline, the address moves between CVMs, thus ensuring that it’s always available.
Once the affined Stargate is healthy for 2 or more minutes, the system quiesces and closes the
session, forcing another logon back to the affined Stargate.
Hosts read and write data in shared Nutanix datastores as if they were connected to a SAN.
Therefore, from the perspective of a hypervisor host, the only difference is the improved
performance that results from data not traveling across a network.
When a guest VM submits a write request through the hypervisor, Stargate sends that request
to the Controller VM on the host. To provide a rapid response to the guest VM, Volumes first
stores this data on the metadata drive, within a subset of storage called the oplog. This cache is
rapidly distributed across the 10 GbE network to other metadata drives in the cluster. Volumes
periodically transfers oplog data to persistent storage within the cluster.
Volumes writes data locally for performance and replicated on multiple nodes for High
Availability.
When the guest VM sends a read request through the hypervisor, the Controller VM reads from
the local copy first, if present. If the host does not contain a local copy, then the Controller VM
reads across the network from a host that does contain a copy. As Volumes accesses remote
data, the remote data is migrated to storage devices on the current host so that future read
requests can be local.
Labs
1. Deploying Windows or Linux VMs
Nutanix Files
Nutanix Files allows users to leverage the Nutanix platform as a highly available file server.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 121
Files and Volumes
Files is a software-defined, scale-out file storage solution that provides a repository for
unstructured data, such as
• home directories
• user profiles
• departmental shares
• application logs
• backups
• archives
Flexible and responsive to workload requirements, Files is a fully integrated, core component of
the Nutanix Enterprise Cloud.
Unlike standalone NAS appliances, Files consolidates VM and file storage, eliminating the need
to create an infrastructure silo. Administrators can manage Files with Nutanix Prism, just like
VM services, unifying and simplifying management. Integration with Active Directory enables
support for quotas and access-based enumeration, as well as self-service restores with the
Windows previous versions feature. All administration of share permissions, users, and groups
is done using the traditional Windows MMC for file management. Nutanix Files also supports
file server cloning, which lets you back up Files off-site and run antivirus scans and machine
learning without affecting production.
Files is fully integrated into Microsoft Active Directory (AD) and DNS. This allows all the secure
and established authentication and authorization capabilities of AD to be leveraged.
Files is a scale-out approach that provides SMB and NFS file services to clients. Nutanix Files
server instances contain a set of VMs (called FSVMs). Files requires at least three FSVMs
running on three nodes to satisfy a quorum for High Availability.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 122
Files and Volumes
Easy to Implement: Deploy in minutes, update non-disruptively with a single click, and manage
all storage from a single pane of glass.
Flexible: Scale-up or scale-out flexibly on the hardware of your choice and enjoy cloud-like
pay-as-you-grow consumption.
Intelligent: Know your data, who’s using it, and how—and then drive automated management
and control.
Nutanix Files consists of the following constructs just like any file server.
• File server: High level namespace. Each file server has a set of file services VMs (FSVM)
deployed.
• Share: A file share is a folder that can be accessed by machines over a network. Access
to these shares is controlled by a special Windows permissions called NTACLs, which are
typically set by the administrator. By default, domain administrators have full access and
domain users have read only access to the home share. General purpose shares have full
access to both domain administrator and domain users.
• Folder: Folders for storing files. Files shares folders across FSVMs.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 123
Files and Volumes
The graphic above shows a high-level overview of File Services Virtual Machine (FSVM)
storage. Each FSVM leverages the Acropolis Volumes API for data storage. Files accesses the
API using in-guest iSCSI. This allows any FSVM to connect to any iSCSI target in the event of an
FSVM failure.
Load balancing occurs on two levels. First, a client can connect to any one of the FSVMs and
users can add FSVMs as needed. Second, on the storage side, Nutanix Files can redistribute
volume groups to different FSVMs for better load balancing across nodes. The following
situations prompt load balancing:
1. When removing an FSVM from the cluster, Files automatically load balances all its volume
groups across the remaining FSVMs.
2. During normal operation, the distribution of top-level directories becomes poorly balanced
due to changing client usage patterns or suboptimal initial placement.
3. When increased user demand necessitates adding a new FSVM, its volume groups are
initially empty and may require rebalancing.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 124
Files and Volumes
Features
• Security descriptors
• Data streams
• OpLocks
• ESXi
• Many-to-one replication
Networking
Nutanix Files uses an external and a storage network. The IP addresses are within the user-
defined range for VLAN and IP addresses.
• Storage network: The storage network enables communication between the file server VMs
and the Controller VM.
• Client-side network: The external network enables communication between the SMB clients
to the FSVMs. This allows Windows clients to access the Nutanix Files shares. Files also uses
the external network to communicate to the Active Directory and domain name servers.
High Availability
Nutanix Files provides two levels of High Availability:
To provide for path availability, Files leverages DM-MPIO within the FSVM, which has the active
path set to the local CVM by default.
CVM Failure
If a CVM goes offline because of failure or planned maintenance, Files disconnects any active
sessions against that CVM, triggering the iSCSI client to log on again. The new logon occurs
through the external data services IP, which redirects the session to a healthy CVM. When the
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 125
Files and Volumes
failed CVM returns to operation, the iSCSI session fails back. In the case of a failback, the FSVM
off and redirected to the appropriate CVM.
Node Failure
When a physical node fails completely, Files uses leadership elections and the local CVM to
recover. The FSVM sends heartbeats to its local CVM once per second, indicating its state and
that it’s alive. The CVM keeps track of this information and can act during a failover. During a
node failure, a FSVM on that host can migrate to another host. Any loss of service of that FSVM
will then follow the below FSVM failure scenario until it is restored on a new host.
FSVM Failure
When an FSVM goes down, the CVM unlocks the files from the downed FSVM and releases the
external address from eth1. The downed FSVM’s resources then appear on a running FSVM. The
internal Zookeeper instances store this information so that they can send it to other FSVMs if
necessary.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 126
Files and Volumes
When an FSVM is unavailable, the remaining FSVMs volunteer for ownership of the shares
and exports that were associated with the failed FSVM. The FSVM that takes ownership of the
volume group informs the CVM that the volume group reservation has changed. If the FSVM
that attempts to control of the volume group is already the leader for a different volume group
that it has volunteered for, it relinquishes leadership for the new volume group immediately.
This arrangement ensures distribution of volume groups, even if multiple FSVMs fail.
The Nutanix Files Zookeeper instance tracks the original FSVM’s ownership using the storage
IP address (eth0), which does not float from node to node. Because FSVM-1’s client IP address
from eth1 is now on FSVM-2, client connections persist. The volume group and its shares and
exports are reregistered and locked to FSVM-2 until FSVM-1 can recover and a grace period has
elapsed.
When FSVM-1 comes back up and finds that its shares and exports are locked, it assumes that
an HA event has occurred. After the grace period expires, FSVM-1 regains control of the volume
group through the CVM.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 127
Managing Failures
Module
10
MANAGING FAILURES
Overview
Data Resiliency describes the number and types of failures a cluster can withstand; determined
by features such as redundancy factor and block or rack awareness.
After completing this module, you will be able to:
Scenarios
Component unavailability is an inevitable part of any datacenter lifecycle. The Nutanix
architecture was designed to address failures using various forms of hardware and software
redundancy.
A cluster can tolerate single failures of a variety of components while still running guest VMs
and responding to commands via the management console—all typically without a performance
penalty.
CVM Unavailability
A Nutanix node is a physical host with a Controller VM (CVM). Either component can fail
without impacting the rest of the cluster.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 128
Managing Failures
The Nutanix cluster monitors the status of CVMs in the cluster. If any Stargate process fails to
respond two or more times in a 30-second period, another CVM redirects hypervisor I/O on the
related host to another CVM. Read and write operations occur over the 10 GbE network until
the failed Stargate comes back online.
To prevent constant switching between Stargates, the data path is not restored until the
original Stargate has been stable for 30 seconds.
During the switching process, the host with a failed CVM may report that the shared storage
is unavailable. Guest VM IO may pause until the storage path is restored. Although the primary
copy of the guest VM data is unavailable because it is stored on disks mapped to the failed
CVM, the replicas of that data are still accessible.
As soon as the redirection takes place, VMs resume read and write I/O. Performance may
decrease slightly because the I/O is traveling across the network rather than across an internal
bus. Because all traffic goes across the 10 GbE network, most workloads do not diminish in a
way that is perceivable to users.
A second CVM failure has the same impact on the VMs on the other host, which means there
will be two hosts sending I/O requests across the network. More important is the additional risk
to guest VM data. With two CVMs unavailable, there are now two sets of physical disks that are
inaccessible. In a cluster with a replication factor 2 there is now a chance that some VM data
extents are missing completely, at least until one of the failed CVMs resume operation.
VM impact
• HA event: None
In the event of a CVM failure, the I/O operation is forwarded to another CVM in the cluster.
ESXi and Hyper-V handle this via a process called CVM autopathing, which leverages a Python
program called HA.py (like “happy”). HA.py modifies the routing table on the host to forward
traffic that is going to the internal CVM address (192.168.5.2) to the external IP of another CVM.
This enables the datastore to remain online - just the CVM responsible for serving the I/O
operations is remote. Once the local CVM comes back up and is stable the route is removed,
and the local CVM takes over all new I/O operations.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 129
Managing Failures
AHV leverages iSCSI multipathing, where the primary path is the local CVM and the two other
paths are remote. In the event where the primary path fails, one of the other paths becomes
active. Similar to autopathing with ESXi and Hyper-V, when the local CVM comes back online it
takes over as the primary path.
In the event where the node remains down for a prolonged period (for example, 30-minutes),
the CVM is removed from the metadata ring. It is joined back into the ring after it has been up
and stable for a period of time.
Node Unavailability
The built-in data redundancy in a Nutanix cluster supports High Availability (HA) provided by
the hypervisor. If a node fails, all HA-protected VMs can be automatically restarted on other
nodes in the cluster.
Curator and Stargate respond to two issues that arise from the host failure:
• When the guest VM begins reading across the network, Stargate begins migrating those
extents to the new host. This improves performance for the guest VM.
• Curator responds to the host and CVM being down by instructing Stargate to create new
replicas of the missing vDisk data.
Users who are accessing HA-protected VMs will notice that their VM is unavailable while it is
restarting on the new host. Without HA, the VM needs to be manually restarted.
Depending on the cluster workload, a second host failure could leave the remaining hosts with
insufficient processing power to restart the VMs from the second host. Even in lightly loaded
clusters, the larger concern is additional risk to guest VM data. For example, if a second host/
CVM fails before the cluster heals and its physical disks are inaccessible, some VM data will be
unavailable.
Remember, with replication factor 2 (RF2, set on a storage container level) there are two copies
of all data. If two nodes go offline simultaneously, it is possible to lose the primary and replicate
data. If this is unacceptable, implement replication factor 3 at the storage container level, or
redundancy factor RF3 which applies to the full cluster.
Drive Unavailability
Drives in a Nutanix node store four primary types of data:
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 130
Managing Failures
• Storage metadata
• Oplog
Cold-tier persistent data is stored on the hard-disk (HDD) drives of the node. Storage metadata,
oplog, hot-tier persistent data, and CVM boot files are kept in the serial AT attachment solid
state drive (SATA-SSD) in drive bay one. SSDs in a dual-SSD system are used for storage
metadata, oplog, hot-tier persistent data according to the replication factor of system. CVM
boot and operating system files are stored on the first two SSD devices in a RAID-1 (mirrored)
configuration. In all-flash nodes, data of all types is stored in the SATA-SSDs.
When a boot DOM (SATA DOM for NX hardware) fails, the node will continue to operate
normally as long as the hypervisor or CVM does not reboot. After a DOM failure, the hypervisor
or CVM on that node will no longer be able to boot as their boot files reside on the DOM.
Note: The CVM restarts if a boot drive fails or if you remove a boot drive without
marking the drive for removal and the data has not successfully migrated.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 131
Managing Failures
Cassandra uses up to 4 SSDs to store the database providing read and write access for cluster
metadata.
When a metadata drive fails, the local Cassandra process will no longer be able to access its
share of the database and will begin a persistent cycle of restarts until its data is available. If
Cassandra cannot restart, the Stargate process on that CVM will crash as well. Failure of both
processes results in automatic IO redirection.
During the switching process, the host with the failed SSD may report that the shared storage is
unavailable. Guest VM IO on this host will pause until the storage path is restored.
After redirection occurs, VMs can resume read and write I/O. Performance may decrease
slightly, because the I/O is traveling across the network rather than across the internal network.
Because all traffic goes across the 10 GbE network, most workloads will not diminish in a way
that is perceivable to users.
Multiple drive failures in a single selected domain (node, block, or rack) are also tolerated.
If Cassandra remains in a failed state for more than thirty minutes, the surviving Cassandra
nodes detach the failed node from the Cassandra database so that the unavailable metadata
can be replicated to the remaining cluster nodes. The process of healing the database takes
about 30-40 minutes.
If the Cassandra process restarts and remains running for five minutes, the procedure to
detach the node is canceled. If the process resumes and is stable after the healing procedure
is complete, the node will be automatically added back to the ring. A node can be manually
added to the database using the nCLI command:
ncli> host enable-metadata-store id=cvm_id
Each node contributes its local storage devices to the cluster storage pool. Cold-tier data is
stored in HDDs, while hot-tier data is stored in SSDs for faster performance. Data is replicated
across the cluster, so a single data drive failure does not result in data loss. Nodes containing
only SSD drives only have a hot tier.
When a data drive (HDD/SSD) fails, the cluster receives an alert from the host and immediately
begins working to create a second replica of any guest VM data that was stored on the drive.
In a cluster with a replication factor 2, losing a second drive in a different domain (node, block,
or rack) before the cluster heals can result in some VM data loss to both replicas. Although a
single drive failure does not have the same impact as a host failure, it is important to replace the
failed drive as soon as possible.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 132
Managing Failures
The physical network adapters on each host are grouped together on the external network.
Unavailability of a network link is tolerated with no impact to users if multiple ports are
connected to the network.
The Nutanix platform does not leverage any backplane for internode communication. It relies on
a standard 10 GbE network.
All storage I/O for VMs running on a Nutanix node is handled by the hypervisor on a dedicated
private network. The I/O request is handled by the hypervisor, which then forwards the request
to the private IP on the local CVM. The CVM then performs the remote replication with other
Nutanix nodes using its external IP over the public 10 GbE network.
In most cases, read requests are serviced by the local node and are not routed to the 10 GbE
network. This means that the only traffic routed to the public 10 GbE network is data replication
traffic and VM network I/O. Cluster-wide tasks, disk balancing for example, generate I/O on the
10 GbE network.
Each Nutanix node is configured at the factory to use one 10 GbE port as the primary pathway
for vSwitch0. Other 10 GbE ports are configured in standby mode. Guest VM performance does
not decrease in this configuration. If a 10 GbE port is not configured as the failover path, then
traffic fails over to a 1 GbE port. This failover reduces the throughput of storage traffic and
decreases the write performance for guest VMs on the host with the failed link. Other hosts may
experience a slight decrease as well, but only on writes to extents that are stored on the host
with the link failure. Nutanix networking best practices recommend removing 1 GbE ports from
each host’s network configuration.
If both 10 GbE links are down, then the host will fail over to a 1 GbE port if it is configured as
a standby interface. This failover reduces the throughput of storage traffic and decreases the
write performance for guest VMs on the host with the failed link. Other hosts may experience
a slight decrease as well, but only on writes to extents that are stored on the host with the link
failure.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 133
Managing Failures
Redundancy factor 3 (RF3) is a configurable option that allows a Nutanix cluster to withstand
the failure of two nodes or drives in different blocks. This is configured by navigating to the
gear button in Prism Element, then Redundancy State. From the drop-down menu on this page,
you can modify the redundancy factor configuration.
Note: If a cluster is set to RF2 it can be converted to RF3 if sufficient nodes are
present. Increasing the cluster RF level consumes 66% of the cluster’s storage vs
50% for RF2.
In the event of a Metadata drive failure: If any metadata drive fails on a host, the Controller
VM restarts. Once the Cassandra process restarts, the missing metadata is retrieved from the
other Controller VMs and sharded across the remaining metadata drives. When the faulty drive
recovers or is replaced, metadata is stored on that drive again. Performance may decrease
slightly for user VMs on the host due to the reboot and the fact that some I/O is traveling
across the network. However, most workloads should not diminish in a way that is perceivable
to users (other than during the reboot).
RF3 features
• At least one copy of all guest VM data plus the oplog is available if two nodes fail.
• The cluster maintains five copies of metadata and five copies of configuration data.
RF3 requirements
• For guest VMs to tolerate the simultaneous failure of two nodes or drives in different blocks,
the data must be stored on storage containers with RF3.
• The CVM must be configured with enough memory to support RF3.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 134
Managing Failures
Stargate is responsible for placing data across blocks, and Curator makes data placement
requests to Stargate to maintain block fault tolerance.
New and existing clusters can reach a block fault tolerant state. New clusters can be block fault
tolerant immediately after being created if the configuration supports it. Existing clusters that
were not previously block fault tolerant can be made tolerant by reconfiguring the cluster in a
manner that supports block fault tolerance.
New data in a block fault tolerant cluster is placed to maintain block fault tolerance. Existing
data that was not in a block fault tolerant state is moved and scanned by Curator to a block
fault tolerant state.
Depending on the volume of data that needs to be relocated, it might take Curator several
scans over a period of hours to distribute data across the blocks.
Block fault tolerant data placement is on a best effort basis but is not guaranteed. Conditions
such as high disk usage between blocks may prevent the cluster from placing guest VM
redundant copy data on other blocks.
Redundant copies of guest VM data are written to nodes in blocks other than the block that
contains the node where the VM is running. The cluster keeps two copies of each write stored
in the oplog.
The Nutanix Medusa component uses Cassandra to store metadata. Cassandra uses a ring-
like structure where data is copied to peers within the ring to ensure data consistency and
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 135
Managing Failures
availability. The cluster keeps at least three redundant copies of the metadata, at least half of
which must be available to ensure consistency.
With block fault tolerance, the Cassandra peers are distributed among the blocks to ensure that
no two peers are on the same block. In the event of a block failure, at least two copies of the
metadata is present in the cluster.
• Network partition; where one of the racks becomes inaccessible from other racks
When rack fault tolerance is enabled, and the guest VMs can continue to run with failure of one
rack (RF2) or two racks (RF3). The redundant copies of guest VM data and metadata exist on
other racks when one rack fails.
3 5 5 5* 2 nodes or 2
blocks or 2 racks
or 2 disks
* Erasure Coding with Rack Awareness - Erasure coding is supported on a rack-aware cluster.
You can enable erasure coding on new containers in rack aware clusters provided certain
minimums are met that are shown in the table above.
The table shows the level of data resiliency (simultaneous failure) provided for the following
combinations of replication factor, minimum number of nodes, minimum number of blocks, and
minimum number of racks.
Note: Rack Fault Tolerance is supported for AHV and ESXi only.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 136
Managing Failures
HA can ensure sufficient cluster resources are available to accommodate the migration of VMs
in case of node failure.
The Acropolis Master tracks node health by monitoring connections on all cluster nodes. When
a node becomes unavailable, Acropolis Master restarts all the VMs that were running on that
node on another node in the same cluster.
The Acropolis Master detects failures due to VM network isolation, which is signaled by a failure
to respond to heartbeats.
HA Configuration Options
• Reserved segments. On each node, some memory is reserved in the cluster for failover
of virtual machines from a failed node. The Acropolis service in the cluster calculates the
memory to be reserved in the cluster based on the virtual machine memory configuration.
AHV marks all nodes as schedulable and resources available for running VMs.
• Best effort (not recommended). No reservations of node or memory on node are done in
the cluster and in case of any failures the virtual machines are moved to other nodes based
on the resources and memory available on the node. This is not a preferred method. If there
are no resources available on the cluster or node some of the virtual machines may not be
powered-on.
• Reserved host (only available via aCLI and not recommended). A full node is reserved for
HA of VM in case of any node failures in the cluster and does not allow virtual machines to
be run and powered on or migrated to the node during normal operation of the cluster. This
mode only works if all the nodes in the cluster have the same amount of memory.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 137
Managing Failures
High Availability
The built-in data redundancy in a Nutanix cluster supports high availability provided by the
hypervisor. If a node fails, all HA-protected VMs can be automatically restarted on other
nodes in the cluster. Virtualization management VM high availability may implement admission
control to ensure that, in case of node failure, the rest of the cluster has enough resources to
accommodate the VMs. The hypervisor management system selects a new host for the VMs
that may or may not contain a copy of the VM data.
If the data is stored on a node other than the VM’s new host, then read requests are sent across
the network. As remote data is accessed, the remote data is migrated to storage devices on the
current host so that future read requests can be local. Write requests are sent to local storage
and replicated on a different host. During this interaction, the Nutanix software also creates new
copies of preexisting data to protect against future node or disk failures.
VM Anti-Affinity policy
This policy prevents virtual machines from running on the same node. The policy forces VMs to
run on separate nodes so that application availability is not affected by node failure. This policy
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 138
Managing Failures
does not limit the Acropolis Dynamic Scheduling (ADS) feature to take necessary action in case
of resource constraints.
Note: Currently, you can only define VM-VM anti-affinity policy by using aCLI. For
more information, see Configuring VM-VM Anti-Affinity Policy.
Note: Anti-Affinity policy is applied during the initial placement of VMs (when a VM
is powered on). Anti-Affinity policy can be over-ridden by manually migrating a VM
to the same host as its opposing VM, when a host is put in maintenance mode, or
during a HA event. ADS will attempt to resolve any anti-affinity violations when they
are detected.
The VM-host affinity policy controls the placement of VMs. Use this policy to specify that
a selected VM can only run on the members of the affinity host list. This policy checks and
enforces where a VM can be hosted when you power on or migrate the VM.
Note: If you choose to apply a VM-host affinity policy, it limits Acropolis HA and
Acropolis Dynamic Scheduling (ADS) in such a way that a virtual machine cannot
be powered on or migrated to a host that does not conform to requirements of the
affinity policy as this policy is enforced mandatorily.
Note: Select at least two hosts when creating a host affinity list to protect against
downtime in the case of a node failure. This configuration is always enforced; VMs
will not be moved from the hosts specified here, even in the case of an HA event.
Watch the following video to learn more about Nutanix affinity rules: https://youtu.be/
rfHR93RFuuU.
• You cannot remove the VM-host affinity for a powered on VM from Prism. You can use
the vm.affinity_unset vm_list aCLI command to perform this operation.
Labs
1. Failing a Node - VM High Availability
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 139
Managing Failures
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 140
Module
11
DATA PROTECTION
Overview
After completing this module, you will be able to:
Disaster Recovery (DR) is an area of failover planning that aims to protect an organization
from the effects of significant negative events. DR allows an organization to maintain or quickly
resume mission-critical functions following a disaster.
• RPO is the tolerated time interval after disruption that allows for a quantity of data lost
without exceeding the maximum allowable threshold.
• RPO designates the variable amount of data that will be lost or will have to be re-entered
during network downtime.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 141
Data Protection
Example: If the data snapshot interval and the RPO is 180 minutes, and the outage lasts only
2 hours, you’re still within the parameters that allow for recovery and business processes to
proceed given the volume of data lost during the disruption.
How much time does it take to recover after notification of business process disruption?
• RTO is therefore the duration of time and a service level within which a business process
must be restored after a disaster in order to avoid unacceptable consequences associated
with a break in continuity.
• RTO designates the amount of “real time” that can pass before the disruption begins to
seriously and unacceptably impede the flow of normal business operations.
Local Replication
Remote Replication
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 142
Data Protection
• Time Stream and Cloud: High RPO and RTO (hours) should be used for minor incidents.
• Synchronous and asynchronous: (near)-zero RPO and RTO should be used for major
incidents.
Time Stream
A time stream is a set of snapshots that are stored on the same cluster as the source VM or
volume group. Time stream is configured as an async protection domain without a remote site.
The Time Stream feature in Nutanix Acropolis gives you the ability to:
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 143
Data Protection
When a snapshot of a VM is initially taken on the Nutanix Enterprise Cloud Platform, the system
creates a read only, zero-space clone of the metadata (index to data) and makes the underlying
VM data immutable or read only. No VM data or virtual disks are copied or moved. The system
creates a read-only copy of the VM that can be accessed like its active counterpart.
Nutanix snapshots take only a few seconds to create, eliminating application and VM backup
windows.
Nutanix Guest Tools (NGT) is a software bundle that can be installed on a guest virtual machine
(Microsoft Windows or Linux). It is a software based in-guest agent framework which enables
advanced VM management functionality through the Nutanix Platform.
The solution is composed of the NGT installer which is installed on the VMs and the Guest Tools
Framework which is used for coordination between the agent and Nutanix platform.
• Nutanix Guest Agent (NGA) service. Communicates with the Nutanix Controller VM.
• File Level Restore CLI. Performs self-service file-level recovery from the VM snapshots.
• Nutanix VM Mobility Drivers. Facilitates by providing drivers for VM migration between ESXi
and AHV, in-place hypervisor conversion, and cross-hypervisor disaster recovery (CHDR)
features.
• VSS requestor and hardware provider for Windows VMs. Enables application-consistent
snapshots of AHV or ESXi Windows VMs.
• Guest Tools Service: Gateway between the Acropolis and Nutanix services and the Guest
Agent. Distributed across CVMs within the cluster with an elected NGT Master which runs on
the current Prism Leader (hosting cluster vIP)
• Guest Agent: Agent and associated services deployed in the VM's OS as part of the NGT
installation process. Handles any local functions (e.g. VSS, Self-service Restore (SSR), etc.)
and interacts with the Guest Tools Service.
The Guest Agent Service communicates with Guest Tools Service via the Nutanix Cluster IP
using SSL. For deployments where the Nutanix cluster components and UVMs are on a different
network (hopefully all), ensure that the following are possible:
• Create a firewall rule (and associated NAT) from UVM network(s) allowing communication
with the Cluster IP on port 2074 (preferred)
The Guest Tools Service acts as a Certificate Authority (CA) and is responsible for generating
certificate pairs for each NGT enabled UVM. This certificate is embedded into the ISO which is
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 144
Data Protection
configured for the UVM and used as part of the NGT deployment process. These certificates are
installed inside the UVM as part of the installation process.
Protection Domains
Concepts
Terminology
Protection Domain
Protection Domain (PD) is a defined group of entities (VMs, files and Volume Groups) that are
always backed up locally and optionally replicated to one or more remote sites.
An async DR protection domain supports backup snapshots for VMs and volume groups. A
metro availability protection domain operates at the storage container level.
A protection domain can use one of two replication engines depending on the replication
frequency that is defined when the protection domain is created. For 1 to 15 minute RPO,
NearSync will be used for replication. For 60 minutes and above, async DR will be used.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 145
Data Protection
Active local storage container linked to a standby container at a remote site. Local and remote
containers will have the same name. Containers defined in a Metro Availability Protection
Domain are synchronously replicated to a remote container of the same.
Consistency Group
Schedule
A schedule is a PD property that specifies snapshot intervals and snapshot retention. Retention
can be set differently for local and remote snapshots.
Snapshot
Read-only copy of the data and state of a VM, file or Volume Group at a specific point in time.
• Because restoring a VM does not allow for VMX editing, VM characteristics such as MAC
addresses may be in conflict with other VMs in the cluster
• You cannot make snapshots of entire file systems (beyond the scope of a VM) or containers
• You cannot include Volume Groups (Nutanix Volumes) in a protection domain configured for
Metro Availability
• Keep consistency groups as small as possible, typically at the application level. Note that
when using application consistent snapshots, it is not possible to include more than one VM
in a consistency group.
• Do not deactivate and then delete a protection domain that contains VMs
• If you want to enable deduplication on a container with protected VMs that are replicated to
a remote site, wait to enable deduplication until:
- Both sites are upgraded to a version that supports capacity tier deduplication.
• Active: Manages volume groups and live VMs. Makes, replicates, and expires snapshots.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 146
Data Protection
Note: For a list of guidelines when configuring async DR protection domains, please
see the Async DR Protection Domain Configuration section of the Prism Web
Console Guide on the Nutanix Support Portal.
After a protection domain is replicated to at least one remote site, you can carry out a planned
migration of the contained entities by failing over the protection domain. You can also trigger
failover in the event of a site disaster.
Failover and failback events re-create the VMs and volume groups at the other site, but the
volume groups are detached from the iSCSI initiators to which they were attached before the
event. After the failover or failback event, you must manually reattach the volume groups to
iSCSI initiators and rediscover the iSCSI targets from the VMs.
Disaster recovery configurations which are created with Prism Element use protection domains
and optional third-party integrations to protect VMs, and they replicate data between on-
premises Nutanix clusters. Protection domains provide limited flexibility in terms of supporting
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 147
Data Protection
operations such as VM boot order and require you to perform manual tasks to protect new VMs
as an application scales up.
You can use Leap between two physical data centers or between a physical data center and Xi
Cloud Services. Leap works with pairs of physically isolated locations called availability zones.
One availability zone serves as the primary location for an application while a paired availability
zone serves as the recovery location. While the primary availability zone is an on-premises
Prism Central instance, the recovery availability zone can be either on-premises or in Xi Cloud
Services.
Configuration tasks and disaster recovery workflows are largely the same regardless of whether
you choose Xi Cloud Services or an on-premises deployment for recovery.
Availability Zone
An availability zone is a location to which you can replicate the data that you want to protect.
It is represented by a Prism Central instance to which a Nutanix cluster is registered. To ensure
availability, availability zones must be physically isolated from each other.
• Xi Cloud Services. If you choose to replicate data to Xi Cloud Services, the on-premises
Prism Central instance is paired with a Xi Cloud Services account, and data is replicated to Xi
Cloud Services.
• Physical Datacenter. If you choose to back up data to a physical datacenter, you must
provide the details of a Prism Central instance running in a datacenter that you own and that
is physically isolated from the primary availability zone.
Availability zones in Xi Cloud Services are physically isolated from each other to ensure that a
disaster at one location does not affect another location. If you choose to pair with a physical
datacenter, the responsibility of ensuring that the paired locations are physically isolated lies
with you.
The availability zone that is primarily meant to host the VMs you want to protect.
The availability zone that is paired with the primary availability zone, for recovery purposes.
This can be a physical datacenter or Xi Cloud Services.
License Requirements
For disaster recovery between on-premises clusters and Xi Cloud Services, it is sufficient to use
the AOS Starter license on the on-premises clusters.
For disaster recovery between on-premises clusters, the license requirement depends on the
Leap features that you want to use. For information about the features that are available with
an AOS license, see Software Options.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 148
Data Protection
• On-premises Nutanix clusters and the Prism Central instance with which they are registered
must be running AOS 5.10 or later.
• The on-premises clusters must be running the version of AHV that is bundled with supported
version of AOS.
• On-Premises clusters registered with the Prism Central instance must have an external IP
address.
• The cluster on which the Prism Central instance is hosted must meet the following
requirements:
- The cluster must have an iSCSI data services IP address configured on it.
- The cluster must also have sufficient memory to support a hot add of memory to all
Prism Central nodes when you enable Leap. A small Prism Central instance (4 vCPUs, 16
GB memory) requires a hot add of 4 GB and a large Prism Central VM (8 vCPUs, 32 GB
memory) requires a hot add of 8 GB. If you have enabled Nutanix Flow, an additional 1 GB
must be hot-added to each Prism Central instance.
• A single-node Prism Central instance must have a minimum of 8 vCPUs and 32 GB memory.
• Each node in a scaled-out Prism Central instance must have a minimum of 4 vCPUs and 16
GB memory.
• The Prism Central VM must not be on the same network as the protected user VMs. If
present on the user VM network, the Prism Central VM becomes inaccessible when the route
to the network is removed following failover.
• Do not uninstall Nutanix VM mobility drivers on the VMs as the VMs become unusable post
migration after uninstalling mobility drivers.
Networking Requirements
Static IP address preservation refers to maintaining the same IP address in the destination. The
considerations to achieve this are as follows:
• The VMs must have Nutanix Guest Tools (NGT) installed on them.
• For an unplanned failover, if the snapshot used for restoration does not have an empty CD-
ROM slot, the static IP address is not configured on that VM.
• For a planned failover, if the latest state of the VM does not have an empty CD-ROM slot, the
static IP address is not configured on that VM after the failover.
• Linux VMs must have the NetworkManager command-line tool (nmcli) installed on them. The
version of nmcli must be greater than or equal to 0.9.10.0.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 149
Data Protection
• If you select a non-IPAM network in a VPC in Xi Cloud Services, the gateway IP address and
prefix fields are not auto-populated, and you must manually specify these values.
Requirements for Static IP Address Mapping between Source and Target Virtual Networks
If you want to map, make sure that the following requirements are met:
• Make sure that a free CD-ROM is available on each VM. The CD-ROM is required for
mounting NGT at the remote site after failover.
• Make sure that the guest VMs can reach the Controller VM from both availability zones.
You must design the virtual subnets that you plan to use for disaster recovery at the recovery
availability zone so that they can accommodate the VMs.
• Make sure that any virtual network intended for use as a recovery virtual network meets the
following requirements:
- The network prefix is the same as that of the source virtual network. For example, if the
source network address is 192.0.2.0/24, the network prefix of the recovery virtual network
must also be 24.
- The gateway IP address offset is the same as that in the source network. For example,
if the gateway IP address in the source virtual network 192.0.2.0/24 is 192.0.2.10, the last
octet of the gateway IP address in the recovery virtual network must also be 10.
• If you want to specify a single cluster as a target for recovering VMs from multiple source
clusters, make sure that the number of virtual networks on the target cluster is equal to the
sum of the number of virtual networks on the individual source clusters. For example, if there
are two source clusters, with one cluster having m networks and the other cluster having n
networks, make sure that the target cluster has m + n networks. Such a design ensures that
all migrated VMs can be attached to a network.
• It is possible to test failover and failback between physical clusters. To perform test runs
without affecting production, prepare test networks at both the source and destination sites.
Then, when testing, attach your test VMs to these networks.
• After you migrate VMs to Xi Cloud Services, make sure that the router in your data center
stops advertising the subnet in which the VMs were hosted.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 150
Data Protection
approaches to IT infrastructure from capex, power and space perspectives, as well as opex
constraints and the skills required on-site to manage and maintain them.
The Nutanix Enterprise Cloud is a powerful converged compute and storage system that offers
one-click simplicity and high availability for remote and branch offices. This makes deploying
and operating remote and branch offices as easy as deploying to the public cloud, but with
control and security on your own terms. Picking the right solution always involves trade-offs.
While a remote site is not your datacenter, uptime is nonetheless a crucial concern. Financial
constraints and physical layout also affect what counts as the best architecture for your
environment.
Three-node (or more) clusters are the gold standard for ROBO deployments. They provide data
protection by always committing two copies of your data, keeping data safe during failures, and
automatically rebuilding data within 60-seconds of a node failure.
Nutanix recommends designing three-node clusters with enough capacity to recover from the
failure of a single node. For sites with high availability requirements or which are difficult to
visit, additional capacity above the n+1 node counts is recommended.
Three-node clusters can scale up to eight nodes with 1 Gbps networking, and up to any scale
when using 10 Gbps and higher networking.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 151
Data Protection
Two-node clusters offer reliability for smaller sites while also being cost effective. A Witness
VM is required for two-node clusters only and is used only for failure scenarios to coordinate
rebuilding data and automatic upgrades. You can deploy the witness offsite up to 500 ms away
for ROBO. Multiple clusters can use the same witness for two-node configurations. Nutanix
supports two-node clusters with ESXi and AHV only.
One-node clusters are recommended for low availability requirements coupled with strong
overall management for multiple sites. Note that a one-node cluster provides resiliency against
the loss of a single hard drive. Nutanix supports one-node clusters with ESXi and AHV only.
Hypervisor
The three main considerations for choosing the right hypervisor for your ROBO environment
are supportability, operations, and licensing costs.
With Nutanix Acropolis, VM placement and data placement occurs automatically. Nutanix
also hardens systems by default to meet security requirements and provides the automation
necessary to maintain that security. Nutanix supplies STIGs (Security Technical Information
Guidelines) in machine-readable code for both AHV and the storage controller.
For environments that do not want to switch hypervisors in the main datacenter, Nutanix offers
cross-hypervisor disaster recovery to replicate VMs from AHV to ESXi or ESXi to AHV. In the
event of a disaster, administrators can restore their AHV VM to ESXi for quick recovery or
replicate the VM back to the remote site with easy workflows.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 152
Data Protection
also provides network visualization allowing you to troubleshoot basic networking issues,
right from the same dashboard. With the scale out capabilities added to the control plane, it is
possible to manage as high as 25 thousand VMs and more centrally.
Prism Element
Prism Element is a management interface native to the platform for every Nutanix cluster
deployed. Because Prism Element manages only the cluster it is part of, each Nutanix cluster
in a deployment has a unique Prism Element instance for management. Multiple clusters are
managed via Prism Central.
Prism Central
• Small environments: For fewer than 2,500 VMs, size Prism Central to 4 vCPUs, 12 GB of
memory, and 500 GB of storage.
• Large environments: For up to 12,000 VMs, size Prism Central to 8 vCPUs, 32 GB of RAM,
and 2,500 GB of storage.
• If installing on Hyper-V, use the SCVMM library on the same cluster to enable fast copy. Fast
copy improves the deployment time.
Each node registered to and managed by Prism Pro requires you to apply a Prism Pro license
through the Prism Central web console. For example, if you have registered and are managing
10 Nutanix nodes (regardless of the individual node or cluster license level), you need to apply
10 Prism Pro licenses through the Prism Central web console.
Nutanix offers an integrated solution for local on-site backups and replication for central
backup and disaster recovery. The powerful Nutanix Time Stream capability allows unlimited
VM snapshots to be created on a local cluster for faster RPO and RTO and rapidly restore state
when required. Using Prism, administrators can schedule local snapshots and replication tasks
and control retention policies on an individual snapshot basis. An intuitive snapshot browser,
allows administrators to quickly see local and remote snapshots and restore or retrieve a saved
snapshot or a specific VM within a snapshot with a single click. Snapshots are differential and
de-duplicated, hence backup and recovery is automatically optimized, allowing DR and remote
backups to be completed efficiently, for different environments.
• Backup – Provides local snapshot/restore at the ROBO site as well as remote snapshot/
restore to the main data center.
• Disaster Recovery – Provides snapshot replication to the main data center with automatic
failover in the event of an outage.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 153
Data Protection
There are several requirements when setting up a Witness VM. The minimum requirements are:
• 2 vCPUs
• 6 GB of memory
• 25 GB of storage
The Witness VM must reside in a separate failure domain. This means the witness and all
two-node clusters must have independent power and network connections. We recommend
locating the witness VM in a third physical site with dedicated network connections to all sites
to avoid single points of failure.
Communication with the witness happens over port TCP 9440. This port must be open for the
CVMs on any two-node clusters using the witness.
Network latency between each two-node cluster and the Witness VM must be less than 500 ms
for ROBO.
The Witness VM may reside on any supported hypervisor and run on Nutanix or non-Nutanix
hardware. You can register multiple two-node clusters to a single Witness VM.
Node Failure
When a node goes down, the live node sends a leadership request to the Witness VM and goes
into single-node mode. In this mode RF2 is still retained at the disk level, meaning data is copied
to two disks. (Normally, RF2 is maintained at the node level normally meaning data is copied to
each node.)
If one of the two metadata SSDs fails while in single-node mode, the cluster (node) goes into
read-only mode until a new SSD is picked for metadata service. When the node that was down
is back up and stable again, the system automatically returns to the previous state (RF2 at the
node level). No user intervention is necessary during this transition.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 154
Data Protection
When the network connection between the nodes fails, both nodes send a leadership request
to the Witness VM. Whichever node gets the leadership lock stays active and goes into single-
node mode. All operations and services on the other node are shut down, and the node goes
into a waiting state. When the connection is re-established, the same recovery process as in the
node failure scenario begins.
When the network connection between a single node (Node A in this example) and the Witness
fails, an alert is generated that Node A is not able to reach the Witness. The cluster is otherwise
unaffected, and no administrator intervention is required.
Witness VM Failure
When the Witness goes down (or the network connections to both nodes and the Witness fail),
an alert is generated but the cluster is otherwise unaffected. When connection to the Witness
is re-established, the Witness process resumes automatically. No administrator intervention is
required.
If the Witness VM goes down permanently (unrecoverable), follow the steps for configuring a
new Witness through the Configure Witness option of the Prism web console as described in
the Configuring a Witness (two-node cluster) topic on the Nutanix Support Portal.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 155
Data Protection
When a complete network failure occurs (no connections between the nodes or the Witness),
the cluster becomes unavailable. Manual intervention is needed to fix the network. While
the network is down (or when a node fails and the other node does not have access to the
Witness), you have the option to manually elect a leader and run in single-node mode. To
manually elect a leader, do the following:
1. Log in using SSH to the Controller VM for the node to be set as the leader and enter the
following command:
nutanix@cvm$ cluster set_two_node_cluster_leader
Run this command on just the node you want to elect as the leader. If both nodes are
operational, do not run it on the other node.
2. Remove (unconfigure) the current Witness and reconfigure with a new (accessible) Witness
when one is available as described in the Configuring a Witness (two-node cluster) topic on
the Nutanix Support Portal.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 156
Data Protection
Seeding
When dealing with a remote site that has a limited network connection back to the main
datacenter, it may be necessary to seed data to overcome network speed deficits. You may
also need to seed data if systems were foundationed at a main site and shipped to a remote
site without data, but that data is required at a later date.
Seeding involves using a separate device to ship the data to the remote location. Instead of
replication taking weeks or months, depending on the amount of data you need to protect, you
can copy the data locally to a separate Nutanix node and then ship it to your remote site.
Nutanix checks the snapshot metadata before sending the device to prevent unnecessary
duplication. Nutanix can apply its native data protection to a seed cluster by placing VMs in a
protection domain and replicating them to a seed cluster. A protection domain is a collection
of VMs that have a similar recovery point objective (RPO). You must ensure, however, that the
seeding snapshot doesn’t expire before you can copy the data to the final destination.
Note: For more information, please see the ROBO Deployment and Operations
Guide on the Nutanix Support Portal.
During this procedure, the administrator stores a snapshot of the VMs on the seed cluster while
it’s installed in the ROBO site, then physically ships it to the main datacenter.
2. Create a protection domain called PD1 on the ROBO cluster for the VMs and volume
groups.
3. Create an out-of-band snapshot (S1) for the protection domain on ROBO with no
expiration.
4. Create an empty protection domain called PD1 (same name used in step 2) on the seed
cluster.
6. Create remote sites on the ROBO cluster and the seed cluster.
7. Retrieve snapshot S1 from the ROBO cluster to the seed cluster (using Prism on the seed
cluster).
10. Create remote sites on the ROBO cluster and on the datacenter main cluster (DC1).
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 157
Data Protection
13. Retrieve S1 from the seed cluster to DC1 (using Prism on DC1). Prism generates an alert
here, but though it appears to be a full data replication, the seed cluster transferred
metadata information only.
15. Set up a replication schedule for PD1 on the ROBO cluster in Prism.
16. Once the first scheduled replication finishes, you can delete snapshot S1 to reclaim space.
Labs
1. Creating protection domains and local VM restore
5. Performing VM migration
6. Migrating back to primary
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 158
Module
12
PRISM CENTRAL
Overview
In the Managing a Nutanix Cluster module, you learned how to use Prism Element to configure a
cluster and set up Pulse and alerts. In this module you'll learn how to:
• Describe Prism Central.
Prism Central allows you to monitor and manage all Nutanix clusters from a single GUI:
• Central dashboard for clusters, VMs, hosts, disks, and storage with drill-down for detailed
information.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 159
Prism Central
• Multi-Cluster analytics
• Multi-Cluster alerts summary with drill-down for possible causes and corrective actions.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 160
Prism Central
First, you must deploy an instance of Prism Central into your environment.
Once you have Prism Central deployed, you need to connect all of your clusters to Prism
Central.
You can deploy a Prism Central VM using the "1-click" method. This method employs the Prism
web console from a cluster of your choice and creates the Prism Central VM in that cluster.
The "1-click" method is the easiest method to deploy Prism Central in most cases. However, you
cannot use this method when:
• The target cluster runs Hyper-V or Citrix Hypervisor (or mixed hypervisors)
Deployment Methods
• Deploying from a cluster that does not have Internet access (aka dark site).
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 161
Prism Central
If you have never logged into Prism Central as the user admin, you need to log on and change
the password before attempting to register a cluster with Prism Central.
Open ports 9440 and 80 (both directions) between the Prism Central VM, all Controller VMs,
and the cluster virtual IP address in each registered cluster.
A cluster can register with just one Prism Central instance at a time. To register with a different
Prism Central instance, first unregister the cluster.
Note: See KB 4944 for additional details if you have enabled Prism Self Service,
Calm, or other special features in Prism Central.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 162
Prism Central
Customizable Dashboards
The custom dashboard feature allows you to build a dashboard based on a collection of fixed
and customizable widgets. You can arrange the widgets on the screen to create exactly the
view into the environment that works best for you. A dashboard’s contents can range from a
single widget to a screen full of widgets.
Prism Pro comes with a default dashboard offering a view of capacity, health, performance, and
alerts that should be ideal for most users and a good starting point for others. The customizable
widgets allow you to display top lists, alerts, and analytics.
Note: Prism Pro allows you to create dashboards using fixed and customizable
widgets.
Scheduled Reporting
Reports can provide information to the organization that is useful at all levels, from operations
to leadership. A few common good use cases include:
• Inventory: Produces a list of physical clusters, nodes, VMs, or other entities within an
environment
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 163
Prism Central
The reporting feature within Prism Pro allows you to create both scheduled and as-needed
reports. Prism Pro includes a set of customizable predefined reports, or you can create new
reports using a built-in WYSIWYG (what you see is what you get) editor. In the editor, simply
select data points and arrange them in the desired layout to create your report.The ability to
group within reports can help you get a global view of a given data point or allow you to look at
entities by cluster. Once you have created reports, they can be run either on an as-needed basis
or by setting them to run on a schedule. Configure each report to retain a certain number of
copies before the system deletes the oldest versions. To access reports, choose the report, then
select the version you wish to view. You can either view the report within Prism or via email, if
you have configured the report to send copies to a recipient list.
Dynamic Monitoring
The system learns the behavior of each VM and establishes a dynamic threshold as a
performance baseline for each resource assigned to that VM.
Dynamic monitoring uses VM behavioral learning powered by the Nutanix Machine Learning
Engine (X-Fit) technology to build on VM-level resource monitoring. Each resource chart
represents the baseline as a blue shaded range. If a given data point for a VM strays outside the
baseline range (higher or lower), the system detects an anomaly and generates an alert. The
anomaly appears on the performance charts for easy reference and follow-up.
If the data point’s anomalous results persist over time, the system learns the new VM behavior
and adjusts the baseline for that resource. With behavioral learning, performance reporting
helps you better understand your workloads and have early knowledge of issues that traditional
static threshold monitoring would not otherwise discover.
Dynamic monitoring is available for both VMs and physical hosts and encompasses multiple
data points within CPU, memory, storage, and networking.
Capacity Runway
Capacity planning focuses on the consumption of three resource categories within a Nutanix
cluster: storage capacity, CPU, and memory.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 164
Prism Central
Capacity results appear as a chart that shows the historical consumption for the metric along
with the estimated capacity runway. The capacity runway is the number of days remaining
before the resource item is fully consumed. The Nutanix X-Fit algorithms perform capacity
calculations based on historical data. Prism Pro initially uses 90 days of historical data from
each Prism Element instance, then continues to collect additional data to use in calculations.
Prism Pro retains capacity data points longer than Prism Element, allowing organizations to
study a larger data sample.
The X-Fit method considers resources consumed and the rate at which the system consumes
additional amounts in the calculations for runway days remaining. Storage calculations factor
the amounts of live usage, system usage, reserved capacity, and snapshot capacity into runway
calculations. Storage capacity runway is aware of containers, so it can calculate capacity when
multiple containers that are growing at different rates consume a single storage pool. Container
awareness allows X-Fit to create more accurate runway estimates.
Note:
The Capacity Runway tab allows you to view a summary of the resource runway
information for the registered clusters and access detailed runway information
about each cluster. Capacity runway calculations include data from live usage,
system usage, reserved capacity, and snapshot capacity.
Creating a Scenario
Anticipating future resource needs can be a challenging task. To address this task, Prism
Central provides an option to create "what if" scenarios that assess the resource requirements
for possible future workloads. This allows you to evaluate questions like
• If I need a new database server in a month, does the cluster have sufficient resources to
handle that increased load?
• If I create a new cluster for a given set of workloads, what kind of cluster do I need?
You can create various "what if" scenarios to answer these and other questions. The answers
are derived by applying industry standard consumption patterns to the hypothetical workloads
and current consumption patterns for existing workloads.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 165
Prism Central
The VM efficiency features in Prism Pro recommend VMs within the environment that are
candidates for reclaiming unused resources that you can then return to the cluster.
Candidate types:
• Overprovisioned
• Inactive
• Constrained
• Bully
• Constrained: A constrained VM is one that does not have enough resources for the demand
and can lead to performance bottlenecks. A VM is considered constrained when it exhibits
one or more of the following baseline values, based on the past 30 days: CPU usage > 90%
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 166
Prism Central
(moderate), 95% (severe) CPU ready time > 5% , 10% Memory usage > 90%, 95%, Memory
swap rate > 0 Kbps (no moderate value).
• Bully: A bully VM is one that consumes too many resources and causes other VMs to starve.
A VM is considered a bully when it exhibits one or more of the following conditions for over
an hour: CPU ready time > 5%, memory swap rate > 0 Kbps, host I/O Stargate CPU usage >
85%.
The lists of candidates show the total amount of CPU and memory configured versus peak
amounts of CPU and memory used for each VM. The overprovisioned and inactive categories
provide a high-level summary of potential resources that can be reclaimed from each VM.
Prism Pro calculates the number, type, and configuration of nodes recommended for scaling to
provide the days of capacity requested.
You can model adding new workloads to a cluster and how those new workloads may affect
your capacity.
Capacity Planning
The Capacity Runway tab can help you understand how many days of resources you have left.
For example, determining how expanding an existing workload or adding new workloads to a
cluster may affect resources.
When you can’t reclaim enough resources, or when organizations need to scale the overall
environment, the capacity planning function can make node-based recommendations. These
node recommendations use the X-Fit data to account for consumption rates and growth and
meet the target runway period. Setting the runway period to 180 days causes Prism Pro to
calculate the number, type, and configuration of nodes recommended to provide the 180 days
of capacity requested.
As part of the capacity planning portion of Prism Pro, you can model adding new workloads to
a cluster and how those new workloads may affect your capacity. The Nutanix Enterprise Cloud
uses data from X-Fit and workload models that have been carefully curated over time through
our Sizer application to inform capacity planning. The add workload function allows you to add
various applications for capacity planning.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 167
Prism Central
• SQL Server: Size database workload based on different workload sizes and database types
• VMs: Enables growth modeling specifying a generic VM size to model or selecting existing
VMs on a cluster to model.
- This is helpful when planning to scale a specific application already running on the cluster
• VDI: Provides options to select broker technology, provisioning method, user type, and
number of users
• Splunk: Size based on daily index size, hot and cold retention times, and number of search
users
• XenApp: Similar to VDI; size server-based computing with data points for broker types,
server OS, provisioning type, and concurrent user numbers
• Percentage: Allows modeling that increases or decreases capacity demand for the cluster
The figure below captures an example of this part of the modeling process.
This centralized upgrade approach provides a single point from which you can monitor status
and alerts as well as initiate upgrades. Currently, multiple cluster upgrades are only available
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 168
Prism Central
for AOS software. One-click upgrades of the hypervisor and firmware are still conducted at the
cluster level.
Although there are - and will be in the future - more Test Drive options to choose from, our
focus is on those sessions that are related to Prism Central and Calm.
All shorter labs should be performed individually at first, using the default Guided Tour option.
Run through them at your own pace to become more familiar with the topic and the Test Drive
interface which is always available to you through http://www.nutanix.com.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 169
Prism Central
Find Inefficiencies lab: Main menu > My Operations > Find Inefficiencies - Guided Tour [Prism
Central > VM Efficiency]
Plan for the Future lab: Main menu > My Operations > Plan for the Future - Guided Tour [Prism
Central > Runway+Scenarios]
Automated Response lab: Main menu > My Operations > Automated Response - Guided Tour
[Prism Central > Playbooks]
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 170
Prism Central
Deploy Applications lab: Main menu > My Applications > Deploy Applications - Guided Tour
[Prism Central > Calm Setup + Blueprint]
Create App Blueprints lab: Main menu > My Applications > Create App Blueprints - Guided
Tour [Prism Central > Calm Blueprint edits + 30 secs video]
Empower Users lab: Main menu > My Applications > Empower Users - Guided Tour [Prism
Central > Calm Blueprint Marketplace]
60-mins lab using Prism Central (Pro license) to check for VM efficiency and anomalies, plan
your cluster’s future capacity needs, check the cluster’s “Runway”, and learn about automating
operations using Nutanix X-Play.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 171
Prism Central
60-mins lab using Prism Pro to build and deploy a Calm blueprint.
Labs
1. Deploying Prism Central
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 172
Module
13
MONITORING THE NUTANIX CLUSTER
Overview
After completing this module, you will be able to:
Nutanix Portal
• Nutanix Technical Support can monitor clusters and provide assistance when problems
occur.
• The Nutanix Support Portal is available for support assistance, software downloads, and
documentation.
• Nutanix supports REST API, which allows you to request information or run administration
scripts for a Nutanix cluster.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 173
Monitoring the Nutanix Cluster
Pulse
Pulse provides diagnostic system data to the Nutanix Support team to deliver proactive,
context- aware support for Nutanix solutions.
The Nutanix cluster automatically and unobtrusively collects this information with no effect on
system performance.
Pulse shares only basic system-level information necessary for monitoring the health and status
of a Nutanix cluster. Information includes:
• System alerts
When Pulse is enabled, it sends a message once every 24 hours to a Nutanix Support server by
default.
Pulse also collects the most important system-level statistics and configuration information
more frequently to automatically detect issues and help make troubleshooting easier. With this
information, Nutanix Support can apply advanced analytics to optimize your implementation
and to address potential problems.
Note: Pulse sends messages through ports 80/8443/443. If this is not allowed,
Pulse sends messages through your mail server. The Zeus leader IP address must
also be open in the firewall.
Pulse is enabled by default. You can enable or disable Pulse at any time.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 174
Monitoring the Nutanix Cluster
Logs
A log is generated as a result of a component failure in a cluster.
• INFO
• WARNING
• ERROR
• FATAL
Entries within a log use the following format:
• [IWEF] identifies whether the log entry is information, a warning, an error, or fatal
• threadid file:line
You can also generate a FATAL log on a process for testing. To do this, run the following
command in the CVM:
curl http://<svm ip>:<component port>/h/exit?abort=1
For practice, you can use this FATAL log to understand how to correlate it with an INFO file to
get more information. There are two ways to correlate a FATAL log with an INFO log:
• Search for the timestamp of the FATAL event in the corresponding INFO files.
3. Open the INFO file with vi and go to the bottom of the file (Shift+G).
4. Analyze the log entries immediately before the FATAL event, especially any errors or
warnings.
• If a process is repeatedly failing, it might be faster to do a long listing of the INFO files and
select the one immediately preceding the current one. The current one would be the one
referenced by the symbolic link.
$ ls *stargate*FATAL*
$ tail stargate.NUTANIX-CVM03.nutanix.log.FATAL.20120510-152823
$ grep F0820stargate.NUTANIX-CVM03.nutanix.log.INFO.20120510-152823
stargate.ERROR
stargate.INFO
stargate.ntnx-16sm32070038-b-cvm.nutanix.log.ERROR.20190505-142229.18195
stargate.ntnx-16sm32070038-b-cvm.nutanix.log.INFO.20190927-204653.18195.gz
stargate.ntnx-16sm32070038-b-cvm.nutanix.log.WARNING.20190505-142229.18195
stargate.out
stargate.out.20190505-142228
stargate.WARNING
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 175
Monitoring the Nutanix Cluster
vip_service_stargate.out
vip_service_stargate.out.20190505-142302
Linux Tools
ls
This command returns a list of all files in the current directory, which is useful when you want to
see how many log files exist.
Include a subset of the filename that you are looking for to narrow the search. For example: $ ls
*stargate*FATAL*
cat
This command reads data from files and outputs their content. It is the simplest way to display
the contents of a file at the command line.
tail
This command returns the last 10 lines that were written to the file, which is useful when
investigating issues that have happened recently or are still happening.
To change the number of lines, add the -n flag. For example: $ tail -n 20 stargate.NUTANIX-
CVM03.nutanix.log.FATAL. 20120510-152823.3135
grep
This command returns lines in the file that match a search string, which is useful if you
are looking for a failure that occurred on a particular day. For example: $ grep F0820
stargate.NUTANIX-CVM03.nutanix.log.FATAL. 20120510-152823.3135
Nutanix provides a variety of support services and materials through the Support portal. To
access the Nutanix support portal from Prism Central:
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 176
Monitoring the Nutanix Cluster
1. Select Support Portal from the user icon pull-down list of the main menu. The login screen
for the Nutanix support portal appears in a new tab or window.
2. Enter your support account user name and password. The Nutanix support portal home
page appears.
3. Select the desired service from the screen options. The options available to you are:
• Click one of the icons (Documentation, Open Case, View Cases, Downloads) in the middle
Note: Some options have restricted access and are not available to all users.
Labs
1. Using Nutanix Cluster Check (NCC) Health Checks
2. Collecting logs for support
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 177
Cluster Management and Expansion
Module
14
CLUSTER MANAGEMENT AND EXPANSION
Overview
After completing this module, you will be able to:
• Expand a cluster.
• Explain license management.
Exercise caution whenever connecting directly to a CVM as the risk of causing cluster issues is
increased. This is because if you make an error when entering a container name or VM name,
you are not typically prompted to confirm your action – the command simply executes. In
addition, commands are executed with elevated privileges, similar to root, requiring attention
when making such changes.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 178
Cluster Management and Expansion
While Nutanix cluster upgrades are non-disruptive and allow the cluster to run while nodes
upgrade in the background, there are situations in which some downtime may be necessary.
Certain maintenance operations and tasks such as hardware relocation would require a cluster
shutdown.
Before shutting down a node, shut down all the guest VMs running on the node or move
them to the other nodes in the cluster. Verify the data resiliency status of the cluster. The
recommendation for any RF level is to only shut down one node at a time, even if it's RF3. If
a cluster needs to have more than one node shut down, shut down the entire cluster. The
command, cluster status, executed from the CLI on a Control VM, shows the current status of all
cluster processes.
Note: This topic shows the process for AHV. Consult the appropriate admin manual
for other hypervisors.
1. Verify the cluster status with the cluster status command on a CVM.
2. If there are issues with the cluster, you can run the NCC checker with the command ncc
health_checks_run_all.
5. Shutdown the cluster with the command cluster stop. use the cluster status command to
see the current status of all cluster processes.
Starting a Node
1. If the node is turned off, turn it on (otherwise, go to the next step).
3. Find the name of the CVM by executing the following on the host: virsh list --all | grep
CVM
4. Examining the output from the previous command, if the CVM is OFF, start it from the
prompt on the host: virsh start cvm_name
5. If the node is in maintenance mode, log on to the CVM over SSH and take it out of
maintenance mode: acli host.exit_maintenance_mode AHV-hypervisor-IP-address
7. Confirm that cluster services are running on the CVM (make sure to replace cvm_ip_addr
accordingly): ncli cluster status | grep –A 15 cvm_ip_addr
a. Alternatively, you can use the following command to check if any services are down in the
cluster: cluster status | grep -v UP
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 179
Cluster Management and Expansion
Starting a Cluster
1. Log on to any CVM in the cluster with SSH.
Once the process begins, you will see a list of all the services that need to be started on each
CVM:
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 180
Cluster Management and Expansion
If the cluster starts properly, output similar to the following is displayed for each node in the
cluster at the end of the command execution:
CVM: 10.1.64.60 Up
Zeus UP [5362, 5391, 5392, 10848, 10977, 10992]
Scavenger UP [6174, 6215, 6216, 6217]
SSLTerminator UP [7705, 7742, 7743, 7744]
SecureFileSync UP [7710, 7761, 7762, 7763]
Medusa UP [8029, 8073, 8074, 8176, 8221]
DynamicRingChanger UP [8324, 8366, 8367, 8426]
Pithos UP [8328, 8399, 8400, 8418]
Hera UP [8347, 8408, 8409, 8410]
Stargate UP [8742, 8771, 8772, 9037, 9045]
InsightsDB UP [8774, 8805, 8806, 8939]
InsightsDataTransfer UP [8785, 8840, 8841, 8886, 8888, 8889, 8890]
Ergon UP [8814, 8862, 8863, 8864]
Cerebro UP [8850, 8914, 8915, 9288]
Chronos UP [8870, 8975, 8976, 9031]
Curator UP [8885, 8931, 8932, 9243]
Prism UP [3545, 3572, 3573, 3627, 4004, 4076]
CIM UP [8990, 9042, 9043, 9084]
AlertManager UP [9017, 9081, 9082, 9324]
Arithmos UP [9055, 9217, 9218, 9353]
Catalog UP [9110, 9178, 9179, 9180]
Acropolis UP [9201, 9321, 9322, 9323]
Atlas UP [9221, 9316, 9317, 9318]
Uhura UP [9390, 9447, 9448, 9449]
Snmp UP [9418, 9513, 9514, 9516]
SysStatCollector UP [9451, 9510, 9511, 9518]
Tunnel UP [9480, 9543, 9544]
ClusterHealth UP [9521, 9619, 9620, 9947, 9976, 9977,
10301]
Janus UP [9532, 9624, 9625]
NutanixGuestTools UP [9572, 9650, 9651, 9674]
MinervaCVM UP [10174, 10200, 10201, 10202, 10371]
ClusterConfig UP [10205, 10233, 10234, 10236]
APLOSEngine UP [10231, 10261, 10262, 10263]
APLOS UP [10343, 10368, 10369, 10370, 10502, 10503]
Lazan UP [10377, 10402, 10403, 10404]
Orion UP [10409, 10449, 10450, 10474]
Delphi UP [10418, 10466, 10467, 10468]
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 181
Cluster Management and Expansion
After you have verified that the cluster is up and running and there are no services down, you
can start guest VMs.
Hardware components, such as nodes and disks, can be removed from a cluster or reconfigured
in other ways when conditions warrant it. However, node removal is typically a lengthy and I/O-
intensive operation. Nutanix recommends to remove a node only when it needs to be removed
permanently from a cluster. Node removal is not recommended for troubleshooting scenarios.
1. Navigate to the Settings section of Prism and select Data at Rest Encryption.
3. Enter the required credentials in the Certificate Signing Request Information section.
4. In the Key Management Server section, add a new key management server.
6. Return to the Key Management Server section, upload all node certificates.
Note: Note that if an SED drive or node is not removed as recommended, then the
drive or node will be locked.
• You need to reclaim licenses before you remove a host from a cluster.
• Removing a host takes some time because data on that host must be migrated to other
hosts before it can be removed from the cluster. You can monitor progress through the
dashboard messages.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 182
Cluster Management and Expansion
• Removing a host automatically removes all the disks in that host from the storage containers
and the storage pool(s).
• Only one host can be removed at a time. If you want to remove multiple hosts, you must
wait until the first host is removed completely before attempting to remove the next host.
• After a node is removed, it goes into an unconfigured state. You can add such a node back
into the cluster through the expand cluster workflow, which we will discuss in the next topic
of this chapter.
Expanding a Cluster
The ability to dynamically scale the Acropolis cluster is core to its functionality. To scale an
Acropolis cluster, install the new nodes in the rack and power them on. After the nodes are
powered on, if the nodes contain a factory installed image of AHV and CVM, the cluster should
discover the new nodes using IPv6 Neighborhood Discovery protocol.
Note: Nodes that are installed with AHV and CVM, but not associated with a
cluster, are also discoverable. Factory install of AHV and CVM may not be possible
for nodes shipped in some regions of the world.
Multiple nodes can be discovered and added to the cluster concurrently if AHV and the CVM
are imaged in the factory, before they are shipped. Some pre-work is necessary for nodes that
do not meet this criteria. Additionally, nodes that are already part of a cluster are not listed as
options for cluster expansion.
The process for expanding a cluster depends on the hypervisor type, version of AOS, and data-
at-rest encryption status.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 183
Cluster Management and Expansion
Configuration Description
Same hypervisor and The node is added to the cluster without re-imaging it.
AOS version
nutanix@cvm$ /home/nutanix/cluster/bin/cluster -u
new_node_cvm_ip_address upgrade_node
After the upgrade is complete, you can add the node to the cluster
without re-imaging it. Alternately, if the AOS version on the node is
higher than the cluster, you must either upgrade the cluster to that
version or re-image the node.
AOS version is same You are provided with the option to re-image the node before adding it.
but hypervisor version (Re-imaging is appropriate in many such cases, but in some cases it may
is different not be necessary such as for a minor version difference. Depending on
the hypervisor, installation binaries (e.g. ISO) might need to be provided.
Expanding cluster To expand the ESXi cluster configured with DVS for Controller VM
when the ESXi cluster external communication, ensure that you do the following.
is configured with DVS
(Distributed VSwitch) • Expand DVS with the new node.
for CVM. • Make sure both the host and the CVM are configured with DVS.
• Make sure that host to CVM and CVM to CVM communications are
working.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 184
Cluster Management and Expansion
Managing Licenses
Nutanix provides automatic and manually applied licenses to ensure access to the variety of
features available. These features will enable you to administer your environment based on your
current and future needs. You can use the default feature set of AOS, upgrade to an advanced
feature set, update your license for a longer term or reassign existing licenses to nodes or
clusters as needed.
Each Nutanix NX Series node or block is delivered with a default Starter license which does not
expire. You are not required to register this license on your Nutanix Customer Portal account.
These licenses are automatically applied when a cluster is created, even when a cluster has
been destroyed and re-created. In these cases, Starter licenses do not need to be reclaimed.
Software only platforms, qualified by Nutanix (for example, the Cisco UCS M5 C-Series Rack
Server), might require a manually applied Starter license. Depending on the license level you
purchase, you can apply it using the Prism Element or Prism Central web console.
• Pro and Ultimate licenses have expiration dates. License notification alerts in Prism start 60
days before expiration.
• Upgrade your license type if you require continued access to Pro or Ultimate features.
• An administrator must install a license after creating a cluster for Pro and Ultimate licensing.
• Ensure consistent licensing for all nodes in a cluster. Nodes with different licensing, default to
minimum feature set.
For example, if two nodes in the cluster have Pro licenses and two nodes in the same have
Ultimate licenses, all nodes will effectively have Pro licenses and access to that feature set only.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 185
Cluster Management and Expansion
Attempts to access Ultimate features in this case result in a warning in the web console. If you
are using a Prism Pro trial license, the warning shows the expiration date and number of days
left in the trial period. Trial period is 60 days.
• You may see a "Licensing Status: In Process" alert message in the web console or log files.
• Generating a Cluster Summary File through the Prism web console, nCLI commands
(generate-cluster-info) or PowerShell commands (get-NTNXClusterLicenseInfo and get-
NTNXClusterLicenseInfoFile) initiates the cluster licensing process.
AOS Licenses
Starter Licenses are installed by default, on each Nutanix node and block. They never expire
and they do not require registration on your assigned Nutanix customer portal account.
Pro and Ultimate licenses are downloaded as a license file from the Nutanix Support Portal and
applied to your cluster using Prism.
With Pro or Ultimate or after upgrading to the Pro or Ultimate license, adding nodes or clusters
to your environment, requires you to generate a new license file for download and installation.
Note: For more information about the different features that are available with
Acropolis Starter, Pro, and Ultimate, please see: https://www.nutanix.com/
products/software-options
Prism Licenses
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 186
Cluster Management and Expansion
The Prism Pro license is available on a per-node basis, with options to purchase on a 1, 2, 3, 4, or
5-year term. A trial version of Prism Pro is included with every edition of AOS.
Add-on Licenses
Individual features known as add-ons can be added to your existing Prism license feature set.
When Nutanix makes add-ons available, you can add them to your existing Starter or Pro
license. For example, you can purchase Nutanix Files for your existing Pro licenses.
You need to purchase and apply one add-on license for each node in the cluster with a Pro
license. For example, if your current Pro-licensed cluster consists of four nodes, you need to
purchase four add-on licenses, then apply them to your cluster.
All nodes in your cluster need to be at the same license level (four Pro licenses and four add-
on licenses). You cannot buy one add-on license, apply it to one node and have three nodes
without add-on licenses.
Add-ons that are available with one to five year subscription terms are Nutanix Era, Nutanix
Flow, Nutanix Files and Nutanix Files Pro. Nutanix Calm is available in 25 VM subscription
license packs.
Before attempting to install an upgraded or add-on license, ensure that you have created a
cluster and have logged into the web console to ensure the Starter license has been applied.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 187
Cluster Management and Expansion
The Portal Connection feature simplifies licensing by integrating the licensing workflow into
a single interface in the web console. Once you configure this feature, you can perform most
licensing tasks from Prism without needing to explicitly log on to the Nutanix Support Portal.
Note: This feature is disabled by default. If you want to enable Portal Connection,
please see the Nutanix Licensing Guide on the Nutanix Support Portal.
Portal Connection communicates with the Nutanix Support Portal to detect changes or updates
to your cluster license status. When you open Licensing from the web console, the screen
displays 1-click action buttons to enable you to manage your licenses without leaving Prism.
Add Add an add-on license. This button appears if add-on features are
available for licensing.
Rebalance Ensure your available licenses are applied to each node in your
cluster. For example:
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 188
Cluster Management and Expansion
Note: For more information on managing licenses with the Portal Connection
feature, including example of upgrades, renewals, and removal, please see the
Nutanix Licensing Guide on the Nutanix Support Portal.
This is the default method of managing licenses since the Portal Connection feature is disabled
by default. This method is a 3-step process, in which you:
1. Generate a cluster summary file in the web console and upload it to the Nutanix support
portal.
2. Generate and download a license file from the Nutanix support portal.
1. From an internet-connected cluster, click the gear icon in the web console and open
Licensing.
3. Click Generate to create and save a cluster summary file to your local machine. The cluster
summary file is saved to your browser download directory or directory you specify.
Note: To begin this process, you must have first generated a cluster summary file in
the web console.
2. Click Support Portal, log on to the Nutanix support portal, and click My Products > Licenses.
3. Click License a New Cluster. The Manage Licenses dialog box displays.
4. Click Choose File. Browse to the Cluster Summary File you just downloaded, select it, and
click Next. The portal automatically assigns a license, based on the information contained in
the Cluster Summary File.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 189
Cluster Management and Expansion
5. Generate and apply the downloaded license file to the cluster. Click Generate to download
the license file created for the cluster to your browser download folder or directory you
specify.
Note: To begin this process, you must have first generated and downloaded a
license file from the Nutanix Support Portal.
1. In the Prism web console, click the upload link in the Manage Licenses dialog box.
2. Browse to the license file you downloaded, select it, and click Save.
Note: Note that the 3-step process described here applies to Prism Element, Prism
Central, and Add-on Licenses. For specific instructions related to each of these
three license types, please the relevant section of the Nutanix Licensing Guide on
the Nutanix Support Portal.
Since a dark site cluster will not be connected to the internet, the Portal Connection feature
cannot be used from the cluster itself. However, some steps in the licensing process will require
the use of a system connected to the internet. The three step process for licensing a dark site
cluster is as follows:
1. Open Licensing from the gear icon in the web console for the connected cluster.
3. Click Show Info and copy the cluster information needed to generate a license file. This
page displays the information that you need to enter at the support portal on an internet-
connected system. Copy this information to complete this licensing procedure.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 190
Cluster Management and Expansion
Flash TiB Number of Flash TiBs; used with capacity based licensing
1. Get your cluster information from the web console. Complete the installation process on a
machine connected to the internet.
2. Navigate to the Cluster Usage section of the Nutanix Support Portal to manage your
licenses.
3. Select the option for Dark Sites and then select the required license information, including
class, license version, and AOS version.
4. If necessary, enter capacity and block details. (Ensure that there are no typing errors.)
5. Select your licenses for Acropolis and then license your add-ons individually.
6. Check the summary, make sure all details are correct, and then download the license file.
7. Apply the downloaded license file to your dark site cluster to complete the process.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 191
Cluster Management and Expansion
Reclaiming a license returns it to your inventory and you can reapply it to other nodes in a
cluster. You will need to reclaim licenses when modifying license assignments, when removing
nodes from a cluster or before you destroy a cluster.
As with license management, licenses can be reclaimed both with and without the use of the
Portal Connection feature. Both procedures have been described below. For more information,
included detailed step-by-step procedures, please see the Nutanix Licensing Guide on the
Nutanix Support Portal.
Reclaim licenses to return them to your inventory when you remove one or more nodes from
a cluster. If you move nodes from one cluster to another, first reclaim the licenses, move the
nodes, then re-apply the licenses. Otherwise, if you are removing a node and not moving it to
another cluster, use the Rebalance button.
You can reclaim licenses for nodes in your clusters in cases where you want to make
modifications or downgrade licenses. For example, applying an Ultimate license to all nodes
in a cluster where some nodes are currently licensed as Pro and some nodes are licensed as
Ultimate. You might also want to transition nodes from Ultimate to Pro licensing.
1. Open Licensing from the gear icon in the web console for the connected cluster.
a. Open Licensing from the gear icon in the Prism web console for the connected cluster.
b. The Licensing window shows that you have installed the Nutanix Files add-on.
c. Click Remove File Server to remove this add-on feature. Click Yes in the confirmation
window.
Portal Connection places the cluster into standby mode to remove the feature and update
the cluster license status. After this operation is complete, license status is updated.
You will need to repeat this procedure for any other add-ons that you have installed.
You can now perform any additional tasks, such as destroying the cluster or re-applying
licenses.
There are two scenarios in which you will reclaim licenses without using Portal Connection.
First, when destroying a cluster and second, when removing nodes from a cluster. The
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 192
Cluster Management and Expansion
procedure for both scenarios is largely the same. Differences have been noted in the steps
below, where applicable.
Points to Remember
• After you remove a node, if you move the node to another cluster, it requires using an
available license in your inventory.
• You must unlicense (reclaim) your cluster (other than Starter on Nutanix NX Series
platforms) when you plan to destroy a cluster. First unlicense (reclaim) the cluster, then
destroy the cluster.
Note: If you have destroyed the cluster and did not reclaim all existing licenses by
unlicensing the cluster, contact Nutanix Support to help reclaim the licenses.
• Return licenses to your inventory when you remove one or more nodes from a cluster. Also,
if you move nodes from one cluster to another, first reclaim the licenses, move the nodes,
then re-apply the licenses.
• You can reclaim licenses for nodes in your clusters in cases where you want to make
modifications or downgrade licenses. For example, applying an Ultimate license to all nodes
in a cluster where some nodes are currently licensed as Pro and some nodes are licensed as
Ultimate. You might also want to transition nodes from Ultimate to Pro licensing.
• You do not need to reclaim Starter licenses for Nutanix NX Series platforms. These licenses
are automatically applied whenever you create a cluster.
1. Generate a cluster summary file in the web console and upload it to the Nutanix Support
Portal.
2. In the Support Portal, unlicense the cluster and download the license file.
3. Apply the downloaded license file to your cluster to complete the license reclamation
process.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 193
Cluster Management and Expansion
AOS
Each node in a cluster runs AOS. When upgrading a cluster, all nodes should be upgraded to
the same AOS version.
Nutanix provides a live upgrade mechanism that allows the cluster to run continuously while a
rolling upgrade of the nodes is started in the background. There is no downgrade option.
Hypervisor Software
Hypervisor upgrades provided by vendors such as VMware and qualified by Nutanix. The
upgrade process updates one node in a cluster at a time.
NCC
Foundation
Nutanix provides updated BIOS and Base Management Controller (BMC) firmware.
Nutanix rarely includes this firmware on the Nutanix Support Portal. Nutanix recommends that
you open a case on the Support Portal to request the availability of updated firmware for your
platform.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 194
Cluster Management and Expansion
Disk Firmware
Nutanix provides a live upgrade mechanism for disk firmware. The upgrade process updates
one disk at a time on each node for the disk group you have selected to upgrade.
Once the upgrade is complete on the first node in the cluster, the process begins on the next
node. Update happens on one disk at a time until all drives in the cluster have been updated.
For AOS only, Nutanix offers two types of releases that cater to the needs of different customer
environments.
• Short Term Support (STS) releases have new features and provide a regular upgrade path
• Long Term Support (LTS) releases are maintained for longer periods of time and primarily
include bug fixes over that extended period
To understand whether you have an STS or LTS release or which one is right for you, refer to
the following table:
STS Quarterly 3 months of Major new Customers that 5.6.x To the next
maintenance, features, are interested in upgrade path
followed by hardware adopting major 5.8.x supported
an additional platforms for new features 5.9.x STS release
3 months of new features. and are able
support. Also contains to perform 5.11.x OR
bug fixes. upgrades To the next
multiple times a upgrade path
year. supported
LTS release
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 195
Cluster Management and Expansion
Note: Note that the upgrade path must always be to a later release. Downgrades
are not supported.
• Check the status of your cluster to ensure everything is in a proper working state.
• Check the compatibility matrix for details of hypervisor and hardware support for different
versions of AOS.
3. Update LCM
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 196
Cluster Management and Expansion
2. Run the Nutanix Cluster Check (NCC) health checks from any CVM in the cluster.
3. Download the available hypervisor software from the vendor and the metadata file (JSON)
from the Nutanix Support Portal. If you are upgrading AHV, you can download the binary
bundle from the Nutanix Support Portal.
4. Upload the software and metadata through Upgrade Software.
6. Only one node is upgraded at a time. Ensure that all the hypervisors hosted in your cluster
are running the same version (all ESXi hosts running the same version, all AHV hosts running
the same version, and so on). The NCC check, same_hypervisor_version_check returns a
FAIL status if the hypervisors are different.
Note: Using the Upgrade Software (1-click upgrade) feature does not complete
successfully in this case.
Upgrading AHV
To upgrade AHV through the Upgrade Software feature in the Prism web console, do the
following:
1. Ensure that you are running the latest version of NCC. Upgrade NCC if required.
2. Run NCC to ensure that there are no issues with the cluster.
3. In the web console, navigate to the Upgrade Software section of the Settings page and click
the Hypervisor tab.
4. If Available Compatible Versions shows a new version of AHV, click Upgrade, then click
Upgrade Now, and click Yes when prompted for confirmation.
Upgrading AOS
To upgrade AOS through the Upgrade Software feature in the Prism web console, do the
following:
1. Ensure that you are running the latest version of NCC. Upgrade NCC if required.
2. Run NCC to ensure that there are no issues with the cluster.
3. In the web console, navigate to the Upgrade Software section of the Settings page and
select the option to upgrade AOS.
4. Optionally, you can also run pre-upgrade installation checks before proceeding with the
ugrade process.
5. If automatic downloads are enabled on your cluster, install the downloaded package. If
automatic downloads are not enabled, download the upgrade package and install it.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 197
Cluster Management and Expansion
LCM consists of a framework consisting of a set of modules for inventory and update. LCM
supports all Nutanix, Dell XC, Dell XC Core, and Lenovo HX platforms. LCM modules are
independent of AOS. They contain libraries and images, as well as metadata and checksums for
security. Currently, Nutanix supplies all modules.
The LCM framework is accessible through the Prism interface. It acts as a download manager
for LCM modules, validating and downloading module content. All communication between the
cluster and LCM modules goes through the LCM framework.
Accessing LCM
Whether you are accessing LCM from Prism Element or Prism Central, the steps to do so are
the same.
Note: Note: In AOS 5.11 and later, LCM is available as a menu item from the
Prism Home page, rather than the Settings page.
You can use LCM to display software and firmware versions of entities in a cluster. Inventory
information for a node is persistent for as long as the node remains in the chassis. When you
remove a node from a chassis, LCM will not retain inventory information for that node. When
you return the node to the chassis, you must perform inventory again to restore the inventory
information.
To perform inventory:
1. Open LCM.
2. To take an inventory, click Options and select Perform Inventory. If you do not have auto-
update enabled, and a new version of the LCM framework is available, LCM will display the
following warning:
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 198
Cluster Management and Expansion
• The Focus button, which lets you switch between a general display and a component-by-
component display.
• Auto-inventory. To enable this feature, click Settings and select the Enable LCM Auto
Inventory check box in the dialog box that appears.
• Get the current status of your cluster to ensure everything is in the proper working order.
LCM updates the cluster one node at a time: it brings a node down (if needed), performs
updates, brings the node up, waits until is fully functional, and then moves on to the next node.
If LCM encounters a problem during an update, it waits until the problem has been resolved
before moving on to the next node.
During an LCM update, there is never more than one node down at the same time even if the
cluster is RF3.
All LCM updates follow the procedure shown in the following flowchart:
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 199
Cluster Management and Expansion
1. If updates for the LCM framework are available, LCM auto-updates its own framework, then
continues with the operation.
2. After a self-update, LCM runs the series of pre-checks described in the Life Cycle Manager
Pre-Checks section of the Life Cycle Manager Guide on the Support Portal.
3. When the pre-checks are complete, LCM looks at the available component updates and
batches them according to dependencies. LCM batches updates in order to reduce or
eliminate the downtime of the individual nodes; when updates are batched, LCM only
performs the pre-update and post-update actions once. For example, on NX platforms, BIOS
updates depend on BMC updates, so LCM batches them so the BMC always updates before
the BIOS on each node.
4. Next, LCM chooses a node and performs any necessary pre-update actions.
5. Next, LCM performs the update. The update process and duration vary by component.
6. LCM performs any necessary post-update actions and brings the node back up.
7. When cluster data resiliency is back to normal, LCM moves to the next node.
2. Specify where LCM should look for updates, and then select the updates you want to
perform.
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 200
Cluster Management and Expansion
At a Dark Site
By default, LCM automatically fetches updates from a pre-configured URL. If you are managing
a Nutanix cluster at a site that cannot access the provided URL, you must configure LCM to
fetch updates locally, using the procedure described in the Life Cycle Manager Guide on the
Nutanix Support Portal.
Labs
1. Performing a one-click NCC upgrade
Do not replicate or distribute without written consent. Copyright Nutanix Inc. 2020 | 201