HPE AI-ML Accelerated With HPE Proliant
HPE AI-ML Accelerated With HPE Proliant
Contents
Executive summary......................................................................................................................................................................................................................................................................................................................................... 3
Solution overview.............................................................................................................................................................................................................................................................................................................................................. 4
Solution components .................................................................................................................................................................................................................................................................................................................................... 5
Hardware ............................................................................................................................................................................................................................................................................................................................................................ 5
Software............................................................................................................................................................................................................................................................................................................................................................10
Solution Configuration ..............................................................................................................................................................................................................................................................................................................................11
Deployment environment ...............................................................................................................................................................................................................................................................................................................11
Solution configuration ........................................................................................................................................................................................................................................................................................................................12
Server Configuration ............................................................................................................................................................................................................................................................................................................................12
Red Hat OpenShift Container Platform deployment ..................................................................................................................................................................................................................................................13
Step by Step OpenShift Deployment Process ............................................................................................................................................................................................................................................................13
RHEL CoreOS Install ............................................................................................................................................................................................................................................................................................................................14
Command Line Validation ..............................................................................................................................................................................................................................................................................................................16
OpenShift Web Consoles .................................................................................................................................................................................................................................................................................................................18
NVIDIA Driver Deployment..................................................................................................................................................................................................................................................................................................................19
Verify the GPU driver deployment.........................................................................................................................................................................................................................................................................................22
Testing the GPUs ....................................................................................................................................................................................................................................................................................................................................23
NGC - Sample Application Deployment............................................................................................................................................................................................................................................................................23
Appendix A: Bill of Materials ...............................................................................................................................................................................................................................................................................................................26
Appendix B: PXE configuration ........................................................................................................................................................................................................................................................................................................28
Appendix C: DHCP Configuration..................................................................................................................................................................................................................................................................................................29
Appendix D: DNS Configuration......................................................................................................................................................................................................................................................................................................30
Appendix E: Example Load Balancer Configuration....................................................................................................................................................................................................................................................31
Version History.................................................................................................................................................................................................................................................................................................................................................32
Resources and additional links .........................................................................................................................................................................................................................................................................................................33
Reference Architecture Page 3
Executive summary
In today’s data driven economy, to remain competitive, businesses must invest in artificial intelligence and machine learning tools and
applications. Using AI, organizations are developing, and putting into production, process and industry applications that automatically learn,
discover, and make recommendations or predictions that can be used to set strategic goals and provide a competitive advantage. To accomplish
these strategic goals, organizations require data scientists and analysts that are skilled in developing models and performing analytics on
enterprise data. These data analysts require specialized tools, applications, and compute resources to create complex models and analyze
massive amounts of data. Deploying and managing these CPU intensive applications can place a burden on already taxed IT organizations. The
Red Hat OpenShift Container Platform, running on HPE ProLiant DL servers and powered by NVIDIA® T4 GPUs, can provide an organization
with a powerful, highly available, DevOps platform that allows these CPU intensive data analytics tools and applications to be rapidly deployed to
end users through a self-service OpenShift registry and catalog. Additionally, OpenShift compute nodes equipped with NVIDIA T4 GPUs provide
the compute resources to meet the performance requirements of these CPU intensive applications and queries. According to the IDC, IDC
FutureScape: Worldwide Artificial Intelligence Predictions, document #US45576319 from www.IDC.com:
• By 2024, 75% of enterprises will invest in employee retraining and development, including third-party services, to address new skill needs and
ways of working resulting from AI.
• By 2022, 75% of enterprises will embed intelligent automation into technology and process development, using AI-based software to discover
operational and experiential insights to guide innovation.
• By 2024, AI will become the new user interface by redefining user experiences where over 50% of user touches will be augmented by
computer vision, speech, natural language, and AR/VR.
Target audience: This document is intended for Chief Information Officers (CIOs), Chief Technology Officers (CTOs), data center managers,
enterprise architects, data analysts, and implementation personnel wishing to learn more about Red Hat OpenShift Container Platform on HPE
ProLiant DL servers with NVIDIA T4 GPUs. Familiarity with HPE ProLiant DL servers, Red Hat OpenShift Container Platform, container-based
solutions, and core networking knowledge is assumed.
Document purpose: The purpose of this document is to provide a Reference Architecture that describes the best practices and technical details
for deploying a starter AI/ML platform with Red Hat OpenShift Container Platform on HPE ProLiant DL servers equipped with NVIDIA GPUs.
This starter AI/ML DevOps platform can be used to automate and simplify the deployment and management of tools and server-based
applications required by data scientists to perform complex data analysis for artificial intelligence and machine learning applications.
Reference Architecture Page 4
Solution overview
This Reference Architecture provides guidance for installing Red Hat OpenShift Container Platform (OCP) 4.1 on HPE ProLiant DL380 Gen10
and HPE ProLiant DL360 Gen10 servers equipped with NVIDIA T4 GPU processors. The solution consists of six (6) HPE ProLiant DL servers;
three (3) HPE ProLiant DL360 Gen10 servers used for the OCP master nodes, two (2) HPE ProLiant DL380 Gen10 servers equipped with
NVIDIA T4 GPU modules for the compute nodes, and one (1) HPE ProLiant DL360 Gen10 server that will be used as the OCP bootstrap node.
The bootstrap node can be repurposed as an additional OCP compute node to be used for workloads that do not require GPU resources. The
servers in this solution are interconnected with a 50 GB InfiniBand network.
Solution components
Hardware
The configuration deployed for this solution is described in greater detail in this section. Table 1, below, lists the various hardware components in
the solution.
Table 1. Components utilized in the creation of this solution
Component Qty Description
HPE ProLiant DL360 Gen10 4 OpenShift Master and Bootstrap / compute node
HPE ProLiant DL380 Gen10 2 OpenShift Worker Nodes
HPE SN2100M 1 Network Switch
Aruba 2930 1 Network Switch
NVIDIA T4 PCIe GPU Accelerator 8 GPU modules
Figure 2 illustrates the rack view of the HPE ProLiant DL servers deployed in this solution. Additional HPE ProLiant DL360 Gen10 and / or HPE
ProLiant DL380 Gen10 servers can be added to the solution as needed.
The HPE ProLiant DL380 Gen10 server has an adaptable chassis, including new HPE modular drive bay configuration options with up to 30
Small Form Factor (SFF), up to 19 Large Form Factor (LFF), or up to 20 NVMe drive options along with support for up to three double-wide GPU
options. Along with an embedded 4x1GbE, there is a choice of HPE FlexibleLOM or PCIe standup adapters, which offer a choice of networking
bandwidth (1GbE to 40GbE) and fabric that adapt and grow to changing business needs. In addition, the HPE ProLiant DL380 Gen10 server
comes with a complete set of HPE Technology Services, delivering confidence, reducing risk, and helping customers realize agility and stability.
This section describes the HPE ProLiant DL380 Gen10 servers used in the creation of this solution. Table 3 describes the individual
components. Each server was equipped with 192 GB ram and dual Xeon 2.6GHz 12 Core CPUs. Individual server sizing should be based on
customer needs and may not align with the configuration outlined in this document.
The HPE ProLiant DL380 Gen10 servers used in this Reference Architecture provide a robust platform to run containerized applications. The
two HPE ProLiant DL380 Gen10 servers in this solution are deployed as OpenShift Compute nodes and configured with 4 NVIDIA T4 GPU
Accelerators.
Table 2 lists the hardware components installed in the HPE ProLiant DL380 Gen10 servers.
Table 2. HPE ProLiant DL380 server configuration
Component Description
Figure 4 shows the iLO device inventory for the HPE ProLiant DL380 Gen10 servers. The NVIDIA T4 GPUs are installed in PCI slots 1, 2, 4, and
5.
HPE iLO
HPE Integrated Lights Out (iLO) is embedded in HPE ProLiant platforms and provides server management that enables faster deployment,
simplified lifecycle operations, while maintaining end-to-end security thus increasing productivity.
HPE ProLiant DL360 Gen10 servers
The HPE ProLiant DL360 Gen10 server is a secure, performance driven dense server that can be confidently deployed for virtualization,
database, or high-performance computing. The HPE ProLiant DL360 Gen10 server delivers security, agility, and flexibility without compromise.
This section describes the HPE ProLiant DL360 Gen10 servers used in the creation of this solution. Each server was equipped with 32GB ram
and dual Xeon 2.3 GHz 12 Core CPUs. Individual server sizing should be based on customer needs and may not align with the configuration
outlined in this document.
The HPE ProLiant DL360 Gen10 servers in this Reference Architecture provide the OpenShift control plane, the OpenShift master and
bootstrap nodes. The OpenShift master nodes are responsible for the OpenShift cluster health, scheduling, API access, and authentication. The
etcd cluster runs on the OpenShift master nodes. The bootstrap node provides resources used by the master nodes to create the control plane
for the OpenShift cluster. The bootstrap node is a temporary role that is only used during the initial OpenShift cluster installation. After the
OpenShift cluster bootstrap process is complete the bootstrap node can be removed from the cluster and repurposed as an OpenShift compute
node. In this Reference Architecture the bootstrap node is not equipped with NVIDIA T4 GPUs. This compute node can be used to deploy
containerized applications that do not require GPU resources.
Reference Architecture Page 8
Table 3 lists the components installed in the HPE ProLiant DL360 servers.
Table 3. HPE ProLiant DL360 server configuration
Component Description
Aruba 2930
The Aruba 2930F Switch Series is designed for customers creating smart digital workplaces that are optimized for mobile users with an
integrated wired and wireless approach. These convenient Layer 3 network switches include built-in uplinks and PoE power and are simple to
deploy and manage with advanced security and network management tools like Aruba ClearPass Policy Manager, Aruba AirWave and cloud-
based Aruba Central.
A powerful Aruba ProVision ASIC delivers performance, robust feature support and value with programmability for the latest applications.
Stacking with Virtual Switching Framework (VSF) provides simplicity and scalability. The 2930F supports built-in 1GbE or 10GbE uplinks, PoE+,
Access OSPF routing, Dynamic Segmentation, robust QoS, RIP routing, and IPv6 with no software licensing required.
NVIDIA T4
The NVIDIA T4 GPU is based on the latest NVIDIA Turing architecture which provides support for virtualized workloads with NVIDIA virtual GPU
(vGPU) software. It is a single wide card with passive cooling and offers good performance while consuming less power. The NVIDIA Turing
architecture includes RT cores for real-time ray tracing acceleration, and batch rendering. It also supports GDDR6 memory which provides
improved power efficiency and performance over the previous generation GDDR5 memory. With the help of Turing architecture, T4 is capable of
offering the same breakthrough performance and versatility to Virtual Machines (VMs) as achieved on non-virtualized systems. This
performance is achieved when T4 is used with NVIDIA vGPU software. Users can achieve a native-PC like experience in a virtualized environment
when T4 is combined with NVIDIA vGPU software. The T4 is well suited for various data center workloads including virtual desktops using
modern productivity applications, and virtual workstations for data scientists. 1
1
https://www.NVIDIA.com/en-in/data-center/-t4/
Reference Architecture Page 10
Network Overview
Each server in this solution is equipped with an HPE IB FDR/EN 40/50Gb InfiniBand adapter that is connected to an HPE StoreFabric M-series
SN2100M 100Gbe Ethernet switch. The HPE ProLiant DL servers use iLO management interfaces which connect to a 1Gbe Ethernet network.
Software
Red Hat OpenShift Container Platform
Red Hat OpenShift Container Platform® is Red Hat’s enterprise grade Kubernetes distribution that provides enterprises the ability to build,
deploy, and manage container-based applications. Red Hat OpenShift Container Platform provides enterprises with a full featured Kubernetes
environment that includes automated operations, cluster services, developer services, and application services to build Platform as a Service
(PaaS) and Containers as a Service (CaaS) on-premises hybrid cloud solution. Red Hat OpenShift Container Platform provides integrated logging
and metrics, authentication and scheduling, high availability, automated over the air updates, and an integrated application container registry.
Red Hat Enterprise Linux CoreOS
OpenShift Container Platform uses Red Hat Enterprise Linux CoreOS (RHCOS), a new container-oriented operating system that combines some
of the best features and functions of the CoreOS and Red Hat Atomic Host operating systems. RHCOS is specifically designed for running
Reference Architecture Page 11
containerized applications from OpenShift Container Platform and works with new tools to provide fast installation, Operator-based management,
and simplified upgrades 2.
RHCOS includes:
• Ignition, which is a first boot system configuration for initially bringing up and configuring OpenShift Container Platform nodes.
• CRI-O, a Kubernetes native container runtime implementation that integrates closely with the operating system to deliver an efficient and
optimized Kubernetes experience.
• Kubelet, the primary node agent for Kubernetes that is responsible for launching and monitoring containers.
In OpenShift Container Platform 4.1, you must use RHCOS for all control plane machines, but you can use Red Hat Enterprise Linux (RHEL) as
the operating system for compute (worker) machines.
Table 4 lists the versions of Red Hat OpenShift Container Platform and Red Hat Enterprise Linux CoreOS used in the creation of this solution.
The installer should insure they have downloaded or have access to this software.
Table 4. Software versions
Component Version
Solution Configuration
Deployment environment
This document makes assumptions about services and networks available within the implementation environment. This section discusses those
assumptions and, where applicable, provides details on how they should be configured. If a service is optional, it will be noted in the description.
Services
Table 5 lists the services utilized in the creation of this solution and provides a high-level explanation of their function.
Table 5. Services used in the creation of this solution
Service Required/Optional Description/Notes
DNS Required Provides name resolution on management and data center networks.
DHCP Required Provides IP address leases on PXE.
TFTP/PXE Required Required to provide network boot capabilities to hosts that will install via a Kickstart file.
Load Balancer Required Provides load balancing across the OpenShift master and worker nodes
NTP Required Required to insure consistent time across the solution stack.
DNS
All nodes used for the Red Hat OpenShift Container platform deployment must be registered in DNS. A sample DNS zone file is provided in
Appendix D of this document.
DHCP
DHCP services must be in place for the PXE and management networks. DHCP services are generally in place on data center networks. The MAC
address of the network interfaces on the servers can be collected using the HPE iLO management interface before installation has begun to
2
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/pdf/architecture/OpenShift_Container_Platform-4.1-Architecture-en-US.pdf
Reference Architecture Page 12
create address reservations on the DHCP server for the hosts. A reservation is required for a single adapter on the PXE network of each physical
server. A sample DHCP configuration file is provided in Appendix C of this document.
TFTP/PXE
The hosts in this configuration were deployed via a TFTP / PXE server to provide the initial network boot services. In order to successfully
complete the necessary portions of the install you will need a host that is capable of providing HTTP, TFTP, and network boot services. In the
solution environment, PXE services existed on the management network. It is beyond the scope of this document to provide instructions for
building a PXE server host, however sample configuration files are included in Appendix B of this document. It is assumed that TFTP and
network boot services are being provided from a Linux-based host.
Load Balancer
A load balancer is required for the deployment. A sample HAProxy configuration file is provided in Appendix E of this document
NTP
A Network Time Protocol server should be available to hosts within the solution environment.
NFS
An NFS share is required to provide persistent storage for the OpenShift Registry.
Installer laptop
A laptop system or virtual machine with the ability to connect to the various components within the solution stack is required.
Solution configuration
Note
In order to complete the installation of the required software in the following sections, internet access is required and should be enabled on at
least one active adapter.
Server Configuration
Installing the NVIDIA GPU modules
The HPE ProLiant DL380 Gen10 servers are configured with a PCI riser card to accommodate the NVIDIA GPU cards.
BIOS Settings
Set network boot order and network boot device to boot from the HPE InfiniBand adapters. Enable the HPE 547FDR adapter for network
booting. Set the network boot order to boot from the HPE 547FDR before the HDD.
BIOS Settings
Set network boot order and network boot device to boot from the HPE InfiniBand adapters
1. During the boot process, press F9 to bring up the BIOS/Platform Configuration (RBSU).
2. BIOS/Platform Configuration (RBSU) -> (at the top) Workload Profile -> selected “High Performance Compute (HPC) ->
3. BIOS/Platform Configuration (RBSU) ->Network Options -> Network Boot Options -> Embedded FlexibleLOM 1 Port 1 (see example below) -
> Network Boot -> Also select “Disabled” for the remaining network ports -> remember to Save (F10).
4. BIOS/Platform Configuration (RBSU) -> Boot Options -> Boot Mode “Legacy BIOS Mode” -> Legacy BIOS Boot Order -> Standard Boot Order
(IPL) -> Bring the “Embedded FlexibleLOM 1 Port 1 : HPE Eth (10/25/50G) Adapter to the first one in the boot order. (see example below) -
> Save
5. Reboot the system.
2. Select the Smart Storage Administrator. Wait for the Smart Storage Administrator to present (may take up to several minutes depending on
the storage configuration).
3. Select HPE Smart Array P408i-a SR -> Configure -> Create Array
4. Select Physical Drives for the New Array > Create Array > Create Logical Drive > Finish
5. Exit and reboot the server.
Please refer to the HPE Smart Storage Administrator User Guide for detailed instructions,
https://support.hpe.com/hpsc/doc/public/display?docId=c03909334
The components required for the installation include the openshift-installer, install-config.yml, rhcos image, a pull secret, and an ssh key. The
openshift-installer, pull secret, and RHCOS image can be obtained from the OpenShift Install on Bare Metal: User-Provisioned Infrastructure web
page, https://cloud.redhat.com/openshift/install/metal/user-provisioned. The components and their usage are described in the following section.
3. Unzip the openshift-installer in a working directory (/root/ocp-install) on the installer laptop or virtual machine.
#tar -xvf openshift-install-linux-4.1.0.tar.zg
#tar -xvf openshift-client-linux-4.1.0.tar.gz
4. Create the install-config.yaml. Review the install-config.yaml file later in this document.
#vi install-config.yaml
apiVersion: v1
baseDomain: hpecloud.org
compute:
- name: worker
replicas: 2
controlPlane:
name: master
replicas: 3
metadata:
name: lab
Reference Architecture Page 14
platform:
none: {}
pullSecret: ''
sshKey: ''
Copy and paste the contents of the public key (~/.ssh/id_rsa.pub) to the install-config.yaml file.
6. Obtain a pull secret from the OpenShift Install on Bare Metal: User-Provisioned Infrastructure web page.
Copy and paste the contents of the pull secret into the install-config.yaml file.
7. Create a shell script to launch the openshift-install utility.
#vi install.sh
#!/bin/bash
rm -rf installdir
mkdir installdir
cp install-config.yaml installdir/install-config.yaml
./openshift-install --dir=/root/installer/installdir create ignition-configs
2. Copy the initramfs image and the kernel files to the PXE server.
#scp rhcos-4.1.0-x86_64-installer-initramfs.img root@<pxe server IP address>:/var/lib/tftpboot/
#scp rhcos-4.1.0-x86_64-installer-kernel root@<pxe server IP address>:/var/lib/tftpboot/
Cluster formation
The master and worker nodes will display an error message until they connect to the bootstrap node. Once connected the master and worker
nodes reboot and form the OpenShift cluster. Once the cluster is formed a message will be displayed informing the installer that the bootstrap
process is complete and it is safe to remove the bootstrap resources.
1. Monitor the cluster formation. Connect to the bootstrap server using the following command:
#ssh core@bootstrap<domain name>
#journalctl -f -u bootkube.service
b. Create PV
Reference Architecture Page 16
4. Monitor the cluster operator status until the operators are all ready.
#watch -n5 oc get clusteroperators
Every 5.0s: oc get clusteroperators
5. Use the openshift installer to monitor the cluster for completion.
#./openshift-install --dir=./installdir wait-for install-complete
INFO Waiting up to 30m0s for the cluster at https://api.lab.hpecloud.org:6443 to initialize...
INFO Waiting up to 10m0s for the openshift-console route to be created...
INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/root/OCP-
Installer/installdir/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.lab.hpecloud.org
INFO Login to the console with user: kubeadmin, password: igbMz-4yLgw-K3m3Q-bGngz
6. Log into the cluster using the openshift client and default kubeadmin account. The output from step 5 above provides the installer with the
default username and password along with the url for accessing the OpenShift web console.
2. Configure the htpasswd identity provider and the admin user. Refer to the Red Hat OpenShift documentation,
https://docs.openshift.com/container-platform/4.1/authentication/understanding-identity-provider.html, for configuring alternate and
additional identity providers.
3. Create users.htpasswd file.
htpasswd -c -B -b /root/users.htpasswd <username> <password>
8. Test login.
oc login -u <username>
Reference Architecture Page 18
Using the OpenShift web console the administrator can drill down into the cluster to view the status of different areas of the cluster including
workloads and compute resources. Selecting Compute > Nodes provides the administrator with a list of OpenShift nodes and their status. Figure
10 shows the 5 nodes, 3 masters and 2 worker nodes, initially deployed in this solution. In this illustration the bootstrap node has not yet been
repurposed into a worker node.
Figure 11 illustrates the Node Feature Discovery operator install window that will be used to install the operator.
Once the Node Feature Discovery Operator is installed, select Catalog > Installed Operators to display the results of the operator installation as
displayed in Figure 12.
The next step is to install the NVIDIA drivers for the GPU modules. At the time of this writing the NVIDIA driver operator is still in development.
This paper will be updated with the procedures for installing the official NVIDIA driver operator once it has been posted to the Operator Hub. A
reference implementation is available as a Special Resource Operator and can be installed using the procedure below.
1. Pull the NVIDIA Special Resource Operator.
#git clone https://github.com/openshift-psap/special-resource-operator
#cd special-resource-operator
2. Run the oc get pod -o wide to display the NVIDIA containerized drivers.
# oc get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE
NOMINATED NODE READINESS GATES
NVIDIA-dcgm-exporter-cfldl 2/2 Running 0 12d 172.23.2.7 worker-1.lab.hpecloud.org <none> <none>
NVIDIA-dcgm-exporter-lw7wt 2/2 Running 0 12d 172.23.2.6 worker-0.lab.hpecloud.org <none> <none>
NVIDIA-device-plugin-daemonset-7zg2v 1/1 Running 0 12d 10.131.0.27 worker-0.lab.hpecloud.org <none> <none>
NVIDIA-device-plugin-daemonset-jp88w 1/1 Running 0 12d 10.128.2.16 worker-1.lab.hpecloud.org <none> <none>
NVIDIA-device-plugin-validation 0/1 Completed 0 12d 10.128.2.17 worker-1.lab.hpecloud.org <none> <none>
NVIDIA-device-plugin-validationdrp87 0/1 Completed 0 11d 10.128.2.26 worker-1.lab.hpecloud.org <none> <none>
NVIDIA-driver-daemonset-58bdl 1/1 Running 0 12d 10.131.0.25 worker-0.lab.hpecloud.org <none> <none>
NVIDIA-driver-daemonset-86rhs 1/1 Running 0 12d 10.128.2.14 worker-1.lab.hpecloud.org <none> <none>
NVIDIA-driver-validation 0/1 Completed 0 12d 10.128.2.15 worker-1.lab.hpecloud.org <none> <none>
NVIDIA-feature-discovery-5b776 1/1 Running 0 12d 10.131.0.29 worker-0.lab.hpecloud.org <none> <none>
NVIDIA-feature-discovery-z5frm 1/1 Running 0 12d 10.128.2.18 worker-1.lab.hpecloud.org <none> <none>
NVIDIA-grafana-6fbddd75c-g9pm7 1/1 Running 0 12d 10.131.0.28 worker-0.lab.hpecloud.org <none> <none>
special-resource-operator-7f74c6786b-q8rmw 1/1 Running 1 12d 10.128.2.13 worker-1.lab.hpecloud.org <none> <none>
The command oc describe nodes | grep NVIDIA will describe the nodes that are running the NVIDIA drivers. The output of the “oc describe
nodes |grep NVIDIA will display the driver information. The output below shows two (2) HPE ProLiant DL380 Gen10 servers each running 4
NVIDIA T4 GPU modules.
apiVersion: v1
kind: Pod
metadata:
name: NVIDIA-smi
spec:
containers:
- image: NVIDIA/cuda
name: NVIDIA-smi
command: ["/bin/bash", "-c","NVIDIA-smi; exit 0" ]
resources:
limits:
NVIDIA.com/gpu: 1
requests:
NVIDIA.com/gpu: 1
name: tensorflow-benchmarks-gpu
spec:
containers:
- image: nvcr.io/NVIDIA/tensorflow:19.09-py3
name: cudnn
command: ["/bin/sh","-c"]
args: ["git clone https://github.com/tensorflow/benchmarks.git;cd benchmarks/scripts/tf_cnn_benchmarks;python3
tf_cnn_benchmarks.py --num_gpus=1 --data_format=NHWC --batch_size=32 --model=resnet50 --variable_update=parameter_server"]
resources:
limits:
NVIDIA.com/gpu: 1
requests:
NVIDIA.com/gpu: 1
restartPolicy: Never
2. Create the tensorflow and benchmark pod.
#oc create -f tensorflow-benchmarks-gpu-pod.yaml
3. Run the oc logs command to review the output from the Tensorflow pod.
#oc logs tensorflow-benchmarks-gpu
TensorFlow: 1.14
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 32 global
32 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
NUMA bind: False
Data format: NHWC
Optimizer: sgd
Variables: parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Reference Architecture Page 25
1 P9K10A HPE 42U 600mmx1200mm G2 Kitted Advanced Shock Rack with Side Panels and Baying
1 P9K10A 001 HPE Factory Express Base Racking Service
4 867959-B21 HPE ProLiant DL360 Gen10 8SFF Configure-to-order Server
4 867959-B21 0D1 Factory Integrated
4 867959-B21 ABA HPE DL360 Gen10 8SFF CTO Server
4 860663-L21 HPE DL360 Gen10 Intel Xeon-Gold 5118 (2.3GHz/12-core/105W) FIO Processor Kit
4 860663-B21 HPE DL360 Gen10 Intel Xeon-Gold 5118 (2.3GHz/12-core/105W) Processor Kit
4 860663-B21 0D1 Factory Integrated
16 815097-B21 HPE 8GB (1x8GB) Single Rank x8 DDR4-2666 CAS-19-19-19 Registered Smart Memory Kit
16 815097-B21 0D1 Factory Integrated
8 P07926-B21 HPE 960GB SATA 6G Mixed Use SFF (2.5in) SC 3yr Wty Digitally Signed Firmware SSD
8 P07926-B21 0D1 Factory Integrated
4 867982-B21 HPE DL360 Gen10 Low Profile Riser Kit
4 867982-B21 0D1 Factory Integrated
4 P01366-B21 HPE 96W Smart Storage Battery (up to 20 Devices) with 145mm Cable Kit
4 P01366-B21 0D1 Factory Integrated
4 804331-B21 HPE Smart Array P408i-a SR Gen10 (8 Internal Lanes/2GB Cache) 12G SAS Modular Controller
4 804331-B21 0D1 Factory Integrated
4 879482-B21 HPE InfiniBand FDR/Ethernet 40/50Gb 2-port 547FLR-QSFP Adapter
4 879482-B21 0D1 Factory Integrated
8 865414-B21 HPE 800W Flex Slot Platinum Hot Plug Low Halogen Power Supply Kit
8 865414-B21 0D1 Factory Integrated
4 BD505A HPE iLO Advanced 1-server License with 3yr Support on iLO Licensed Features
4 BD505A 0D1 Factory Integrated
4 874543-B21 HPE 1U Gen10 SFF Easy Install Rail Kit
4 874543-B21 0D1 Factory Integrated
2 868703-B21 HPE ProLiant DL380 Gen10 8SFF Configure-to-order Server
2 868703-B21 0D1 Factory Integrated
2 868703-B21 ABA HPE DL380 Gen10 8SFF CTO Server
2 826862-L21 HPE DL380 Gen10 Intel Xeon-Gold 6126 (2.6GHz/12-core/120W) FIO Processor Kit
2 826862-B21 HPE DL380 Gen10 Intel Xeon-Gold 6126 (2.6GHz/12-core/120W) Processor Kit
2 826862-B21 0D1 Factory Integrated
24 835955-B21 HPE 16GB (1x16GB) Dual Rank x8 DDR4-2666 CAS-19-19-19 Registered Smart Memory Kit
24 835955-B21 0D1 Factory Integrated
4 P07926-B21 HPE 960GB SATA 6G Mixed Use SFF (2.5in) SC 3yr Wty Digitally Signed Firmware SSD
4 P07926-B21 0D1 Factory Integrated
2 826700-B21 HPE DL38X Gen10 x16 Tertiary Riser Kit
2 826700-B21 0D1 Factory Integrated
Reference Architecture Page 27
host master-0 {
hardware ethernet 30:e1:71:62:8d:30;
fixed-address 10.19.20.247;
Reference Architecture Page 30
backend ingress-http
balance source
mode tcp
server worker-0.lab 10.19.20.250:80 check
server worker-1.lab 10.19.20.251:80 check
frontend ingress-https
bind *:443
default_backend ingress-https
mode tcp
option tcplog
backend ingress-https
balance source
mode tcp
server worker-0.lab 10.19.20.250:443 check
server worker-1.lab 10.19.20.251:443 check
Version History
Project: HPE Reference Architecture for accelerated Artificial Intelligence & Machine Learning on HPE ProLiant DL380 Gen10 and HPE ProLiant DL360 Gen10 servers
Status: Published
Document version Date Description of change
NVIDIA, https://www.NVIDIA.com/en-us/data-center/-t4/
© Copyright 2020 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without
notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements
accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett
Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.
Red Hat is a registered trademark of Red Hat, Inc. in the United States and other countries. Linux is the registered trademark of Linus
Torvalds in the U.S. and other countries. Intel and Xeon are trademarks of Intel Corporation in the U.S. and other countries.