Machine Learning
[Tutorial: Environment Setup]
I-Ching Tseng
d08922025@csie.ntu.edu.tw
mlta-2022-spring@googlegroups.com
National Taiwan University
March 2022
Outline
q Overview
q Package Management Tools
q GPU
q Docker
q Conclusion
2
Overview
q To run a machine learning (ML) model
Ø You have to set up an environment first
Ø Using virtualization or package management tools is a good practice
• You can migrate the code and reproduce the result easily
• Different applications will not affect each other
• If your environment is broken, just create a new environment
q In this tutorial
Ø We will provide some guidelines for setting up environment
Ø We will help you understand the environment
• The software stack
• NVIDIA GPUs
3
Outline
q Overview
q Package Management Tools
Ø Prerequisites
Ø Conda
Ø Pipenv
Ø Summary
q GPU
q Docker
q Conclusion
4
Prerequisites
q Package management tools
Ø Help you to manage to environment
Ø Do not manage the GPU driver
q To utilize GPUs, make sure the GPU driver is intalled
Application
Conda/Pipenv
PyTorch
NVIDIA Driver
Software
Hardware
NVIDIA GPU
5
Conda
q Conda
Ø An open source package and environment management system
Ø Supports Windows, MacOS, and Linux
q We take Anaconda as an example
6
Quick Start - Anaconda
Steps Linux Command
Install Anaconda with the installer bash Anaconda3-2021.11-Linux-x86_64.sh
(Check the document for details)
Create an environment
(You can replace test_env with conda create -n test_env
your desired environment name)
Install packages conda install -n test_env pytorch torchvision
(You can find the command in torchaudio cudatoolkit=11.3 -c pytorch
the PyTorch official website)
Activate the environment conda activate test_env
Run your application python ml.py
Leave the environment conda deactivate
7
Pipenv
q Pipenv
Ø A tool that creates and manages a virtualenv
8
Quick Start - Pipenv
q To know more about Pipenv, please check the document
Steps Linux Command
Install Pipenv with pip3 pip3 install pipenv
pipenv install numpy torchvision torch --index
Install packages https://download.pytorch.org/whl/cu113
Activate the environment pipenv shell
Run your application python ml.py
Leave the environment Ctrl + D
9
Summary
q To utilize GPU, you must install driver on your host machine
q Using Conda or Pipenv to build environments is recommended
Ø Portable
Ø Reproducible
Ø Applications do not affect each other
q You can stop here if you just want to finish the homework
q Why is PyTorch so convenient?
Ø "We ship with everything in-built (PyTorch binaries include CUDA,
CuDNN, NCCL, MKL, etc.)." [Reference]
10
Outline
q Overview
q Package Management Tools
q GPU
Ø NVIDIA GPUs
Ø Software Stack
Ø NVIDIA Driver
Ø CUDA
q Docker
q Conclusion
11
NVIDIA GPUs
q General Purpose Graphics Processing Units (GPGPU)
Ø GPUs are originally designed for computer graphic applications
Ø GPU is good at parallelizing "simple and repetitive" computations
• E.g., matrix multiplication
Ø There are massive matrix multiplication computations in ML models
• We use GPU to accelerate ML model training
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html 12
Software Stack
Application
Translation Image Classification Regression
Frameworks (Caffe, Tensorflow, PyTorch, etc.)
Generic Convolution Layer cuDNN Optimized Conv. Layer
BLAS Libraries
OpenBLAS MKL2019 cuDNN/cuBLAS
NVIDIA Driver
Hardware
CPU FPGA GPU
13
NVIDIA Driver
q NVIDIA driver
Ø The software that allows operating systems (OS) to communicate with
GPUs
Ø Includes kernel modules
Frameworks
cuDNN Conv. Layer
cuDNN/cuBLAS
BLAS Lib
cuDNN/cuBLAS CUDA Runtime API
User space
NVIDIA Driver CUDA Driver Kernel space
Hardware
GPU
14
CUDA
q Compute Unified Device Architecture (CUDA)
Ø "A parallel computing platform and application programming interface
that allows software to use NVIDIA GPUs" [Wikipedia]
q CUDA Runtime API vs. CUDA Driver API
Ø The driver CUDA version must ≥ the runtime CUDA version
Ø Check the driver CUDA version
Ø When we "install CUDA"
• We usually refer to CUDA runtime
• You should check the framework compatibility
• The version should not be greater than the driver CUDA version
• You should choose the runtime CUDA version carefully
15
Outline
q Overview
q Package Management Tools
q GPU
q Docker
Ø Virtualization
Ø Why using Container?
Ø Contanerization with Docker
Ø Pulling Docker Images
Ø NVIDIA Docker
q Conclusion
16
Virtualization
q Virtual machine (VM) and container
q You only have to know that
Ø Containers only virtualize software layers above the OS level
• It is a good choice if we only focus on specific hardware (e.g., NVIDIA GPUs)
Ø Containers are relatively lightweight
https://www.docker.com/resources/what-container 17
Why using Container?
q Containers can virtualize more complex environments
Ø Even if you "only want to train models"
• You may use other frameworks that do not ship with CUDA and cuDNN
• You may need NCCL to perform efficient parallel and distributed training
• You may need to run an old version PyTorch, but the default CUDA version is
too old to communicate with the latest powerful GPU
q Slurm and Kubernetes are popular server management tools
in both academia and industry
Ø Slurm supports singularity container
Ø Kubernetes runs application in Docker containers
18
Containerization with Docker
q Docker
Ø A platform for you to build and run with containers
Ø Docker installation
• Docker Desktop (for Mac and Windows) runs a VM
q Docker image
Ø A set of instructions for creating a Docker container
q Steps of setting up environment with Docker
Ø Install Docker
• One-time effort
Ø Build/pull an image
• There are lots of built images
Ø Run the container
Ø Run your application
19
Pulling Docker Images
q Docker Hub
Ø A place for finding and sharing Docker images
• E.g., Docker Hub repository of PyTorch
q Check the Docker Hub and find the image tag
Ø 1.9.1-cuda11.1-cudnn8-devel vs. 1.9.1-cuda11.1-cudnn8-runtime?
Ø Run "docker pull <image_tag>"
20
NVIDIA Docker (1/2)
q Using GPUs in Docker container makes container less portable
Ø Containers work in user space
• Root privilege only means you can use some privileged system calls
Ø Using NVIDIA GPUs requires kernel modules and user-level libraries
• The CUDA version of the driver user-space modules must be exactly the same
as the CUDA version of the driver kernel modules
• The runtime CUDA version can be smaller than the driver CUDA version
Ø The host driver must exactly match the version of the driver installed in
the container
q We should use NVIDIA Docker
Ø Install NVIDIA Docker
Ø You do not have to install the NVIDIA driver in the container
https://github.com/NVIDIA/nvidia-docker 21
NVIDIA Docker (2/2)
q Steps
Ø Install the latest NVIDIA driver
• One-time effort
Ø Install NVIDIA Docker
• One-time effort
Ø Build/pull an image
Ø Run the container
Ø Run your application
22
Outline
q Overview
q Package Management Tools
q GPU
q Docker
q Conclusion
23
Conclusion
q Whether or not you virtualize your environment
Ø You must install the NVIDIA driver on the host to utilize NVIDIA GPUs
Ø The runtime CUDA version must be less than or equal to the driver
CUDA version
q If you want to use NVIDIA GPUs in containers
Ø Using NVIDIA Docker makes your life easier
• You do not need to install NVIDIA drivers in containers
• Containers are more portable
Ø You only have to pull the built Docker image from Docker Hub
• You do not have to set up CUDA, cuDNN, and frameworks yourself
• This is useful especially when the environment is complex
24
Q&A
Thank You!
25