How do I install NVIDIA GPU driver, CUDA Toolkit, NVIDIA Container Toolkit on Amazon EC2 instances running Amazon Linux 2 (AL2)?

8 minute read
Content level: Expert
1

I want to install NVIDIA driver, CUDA Toolkit, NVIDIA Container Toolkit on AL2 (Amazon Linux 2) (x86_64/arm64)

Overview

This article suggests how to install NVIDIA GPU driver, CUDA Toolkit, NVIDIA Container Toolkit on NVIDIA GPU EC2 instances running AL2 (Amazon Linux 2)

Note that by using this method, you agree to NVIDIA Driver License Agreement, End User License Agreement and other related license agreement.

Pre-built AMIs

If you need AMIs preconfigured with TensorFlow, PyTorch, NVIDIA CUDA drivers and libraries, consider AWS Deep Learning AMIs. Refer to Release notes for DLAMIs for currently supported options.

For container workloads, consider Amazon ECS-optimized Linux AMIs and Amazon EKS optimized AMIs

Note: instructions in this article are not applicable to pre-built AMIs.

GUI (graphical desktop) remote access

If you need remote graphical desktop access, refer to How do I install GUI (graphical desktop) on Amazon EC2 instances running Amazon Linux 2 (AL2)?

Note that this article installs NVIDIA Tesla driver (also know as NVIDIA Datacenter Driver), which is intended primarily for GPU compute workloads. If configured in xorg.conf, Tesla drivers support one display of up to 2560x1600 resolution. GRID drivers provide access to four 4K displays per GPU and are certified to provide optimal performance for professional visualization applications.

About CUDA toolkit

CUDA Toolkit is generally optional when GPU instance is used to run applications (as opposed to develop applications) as the CUDA application typically packages (by statically or dynamically linking against) the CUDA runtime and libraries needed.

End of life notice

  • OS support: Support for Amazon Linux 2 will end on 2025-06-30. AL2 has high level of compatibility with CentOS 7
  • NVIDIA Driver support: R550 and R535 are the latest Production Branch (PB) and Long Term Support Branch (LTSB) to support CentOS 7. R550 and R535 will end of life in February 2025 and June 2026 respectively
  • CUDA Toolkit support: NVIDIA has deprecated and removed support for CentOS 7 in CUDA 12.4 Update 1 and 12.5 respectively

As such, this guide may not work, and you are encouraged to use a newer OS. Possible options include Amazon Linux 2023, Ubuntu and RHEL/Rocky among others

Prepare Amazon Linux 2

Launch a new NVIDIA GPU instance running Amazon Linux 2 preferably with at least 25 GB storage and connect to the instance

sudo yum update -y
sudo amazon-linux-extras install -y epel
sudo yum install -y dkms kernel-devel vulkan-devel libglvnd-devel elfutils-libelf-devel automake make gcc gcc-c++  xorg-x11-server-Xorg xorg-x11-fonts-Type1 xorg-x11-drivers
sudo systemctl enable --now dkms

Restart your EC2 instance if kernel is updated

sudo reboot

Install NVIDIA driver

To install NVIDIA driver version 550.127.08

cd /tmp
DRIVER_VERSION=550.127.08
curl -L -O https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run
chmod +x ./NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run
sudo CC=/usr/bin/gcc10-cc ./NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run -s
sudo sed -i "s/\"'make'/& CC=\/usr\/bin\/gcc10-gcc /" /usr/src/nvidia-$DRIVER_VERSION/dkms.conf

To install another version instead, refer to Driver Release Notes and modify the above line that sets DRIVER_VERSION value

Verify compilation

Verify that driver module compilation is successful

tail /var/log/nvidia-installer.log

Output should be similar to below

-> done.
-> Driver file installation is complete.
-> Running post-install sanity check:
-> done.
-> Post-install sanity check passed.
-> Installation of the NVIDIA Accelerated Graphics Driver for Linux-aarch64 (version: 550.127.05) is now complete.

Verify module

nvidia-smi

Output should be similar to below

Fri Nov 22 06:48:59 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     Off |   00000000:00:1F.0 Off |                    0 |
| N/A   76C    P0             36W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Optional: CUDA Toolkit

Ensure your EC2 instance has more than 15 GB of free disk space to install CUDA Toolkit 12.4 Update 1

cd /tmp
if ( arch | grep -q x86 ); then
  wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
else
  wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run
fi
chmod +x ./cuda_*.run
sudo CC=/usr/bin/gcc10-cc ./cuda_*.run --toolkit --silent

Refer to CUDA Toolkit documentation about installation options. To install another version, refer to CUDA Toolkit Archive page for runfile (local) download link.

Verify compilation

tail /var/log/cuda-installer.log

Your output should be similar to below

[INFO]: Installing: cuda-cuobjdump
[INFO]: Installing: cuda-cuxxfilt
[INFO]: Installing: cuda-nvcc
[INFO]: Installing: cuda-nvvm
[INFO]: Installing: cuda-crt
[WARNING]: Skipping copy. File already exists at: /usr/local/cuda-12.4/bin/crt/link.stub
[WARNING]: Skipping copy. File already exists at: /usr/local/cuda-12.4/bin/crt/prelink.stub
[INFO]: Installing: cuda-nvprune
[INFO]: Installing: CUDA Documentation 12.4
[WARNING]: Cannot find manpages to install.

Verify installation

/usr/local/cuda/bin/nvcc -V

Output should be similar to below

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:24:28_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Post-installation Actions

Refer to NVIDIA CUDA Installation Guide for Linux for post-installation actions before CUDA Toolkit can be used. For example, you may want to modify your PATH and LD_LIBRARY_PATH environment variables to include /usr/local/cuda-12.4/bin and /usr/local/cuda-12.4/lib64 respectively.

Optional: NVIDIA Container Toolkit

NVIDIA Container toolkit supports AL2 on both x86_64 and arm64.

sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
sudo yum install -y nvidia-container-toolkit

Refer to NVIDIA Container toolkit documentation about supported platforms, prerequisites and installation options

Verify Container Toolkit

nvidia-container-cli -V

Output should be similar to below

cli-version: 1.17.1
lib-version: 1.17.1
build date: 2024-11-09T00:36+0000
build revision: 63d366ee3b4183513c310ac557bf31b05b83328f
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Container engine configuration

Refer to NVIDIA Container Toolkit site for container engine configuration instructions.

Docker

To install and configure docker

sudo amazon-linux-extras install -y docker
sudo systemctl enable docker
sudo usermod -aG docker ec2-user
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify Docker engine configuration

To verify docker configuration

sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2 nvidia-smi

Output should be similar to below

Unable to find image 'public.ecr.aws/amazonlinux/amazonlinux:2' locally
2: Pulling from amazonlinux/amazonlinux
ac443ee34758: Pull complete 
Digest: sha256:f3002d062d7f061f6280a2a71fc3efb8512df150925fd0f8997a6c6d97f927bb
Status: Downloaded newer image for public.ecr.aws/amazonlinux/amazonlinux:2
Fri Nov 22 06:53:06 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     Off |   00000000:00:1F.0 Off |                    0 |
| N/A   73C    P0             36W /   70W |       1MiB /  15360MiB |      8%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Install NVIDIA driver, CUDA toolkit and Container Toolkit on EC2 instance at launch

To install NVIDIA driver, CUDA toolkit and NVIDIA container toolkit including Docker when launching a new GPU instance, you can use the following as user data script

#!/bin/bash
sudo yum update -y
sudo amazon-linux-extras install -y epel
sudo yum install -y dkms kernel-devel vulkan-devel libglvnd-devel elfutils-libelf-devel automake make gcc gcc-c++  xorg-x11-server-Xorg xorg-x11-fonts-Type1 xorg-x11-drivers
sudo systemctl enable --now dkms

cd /tmp
DRIVER_VERSION=550.127.08
curl -L -O https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run
chmod +x ./NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run
sudo CC=/usr/bin/gcc10-cc ./NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run -s
sudo sed -i "s/\"'make'/& CC=\/usr\/bin\/gcc10-gcc /" /usr/src/nvidia-$DRIVER_VERSION/dkms.conf

if ( arch | grep -q x86 ); then
  curl -L -O https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
else
  curl -L -O https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run
fi
chmod +x ./cuda*.run
sudo CC=/usr/bin/gcc10-cc ./cuda_*.run  --toolkit --silent

sudo amazon-linux-extras install -y docker
sudo systemctl enable docker
sudo usermod -aG docker ec2-user

sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
sudo yum install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

sudo reboot

Verify

Connect to your EC2 instance

nvidia-smi
/usr/local/cuda/bin/nvcc -V
nvidia-container-cli -V
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2 nvidia-smi

View /var/log/cloud-init-output.log to troubleshoot any installation issues.

Perform post-installation actions in order to use CUDA toolkit. To verify integrity of installation, you can download, compile and run CUDA samples such as deviceQuery.

Amazon Linux 2 on g4dn

Uninstallation

CUDA Toolkit

To uninstall CUDA Toolkit, run the uninstallation script provided in the bin directory of the toolkit. For version 12.4

sudo /usr/local/cuda-12.4/bin/cuda-uninstaller

NVIDIA Driver

To remove NVIDIA driver

sudo /usr/bin/nvidia-uninstall

You may have to delete /usr/src/nvidia-$DRIVER_VERSION folder manually