How do I install NVIDIA GPU driver, CUDA toolkit and optionally NVIDIA Container Toolkit on Amazon Linux 2023 (AL2023)?

5 minute read
Content level: Intermediate
5

I want to install NVIDIA driver, CUDA toolkit and optionally NVIDIA Container Toolkit on Amazon Linux 2023 ( al2023 )

Prepare Amazon Linux 2023

Launch a NVIDIA GPU instance

sudo dnf update -y
sudo dnf install -y dkms kernel-devel kernel-modules-extra

Restart your AL2023 if kernel is updated.

Install NVIDIA driver and CUDA toolkit

Method 1: Package Manager Installation (x86_64)

CUDA version 12.5 and higher supports Amazon Linux 2023 on x86_64 only.

Ensure your OS has more than 5 GiB of free disk space

Add NVIDIA repo

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
sudo dnf clean expire-cache

Install NVIDIA driver

sudo dnf module install -y nvidia-driver:latest-dkms

Install CUDA toolkit

sudo dnf install -y cuda-toolkit

Method 2: Runfile installation (x86_64 and arm64)

CUDA Toolkit 12.5 currently supports AL2023 x86_64 rpm install. Runfile is not supported for AL2023 and may not work.

Ensure your OS has more than 10 GiB of free disk space

Install development libraries

sudo dnf install -y vulkan-devel libglvnd-devel elfutils-libelf-devel

Download CUDA toolkit installer

Intel/AMD x86_64

wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux.run

Graviton arm64 (G5g instance)

wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux_sbsa.run

You can go to CUDA Toolkit download page to obtain latest runfile (local) installer download URL for RHEL 9 on x86_64 and arm64 sbsa

Install NVIDIA driver and CUDA toolkit

chmod +x ./cuda*.run
sudo ./cuda_*.run --driver --toolkit --tmpdir=/var/tmp --silent

Post installation

Restart your OS

sudo reboot

Verify NVIDIA driver

nvidia-smi

Your output should be similar to below

Sun Aug  4 00:58:21 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     Off |   00000000:00:1F.0 Off |                    0 |
| N/A   77C    P0             33W /   70W |       1MiB /  15360MiB |      9%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Verify CUDA tookit

/usr/local/cuda/bin/nvcc --version

Output should be similar to below

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Fri_Jun_14_16:44:37_PDT_2024
Cuda compilation tools, release 12.6, V12.6.20
Build cuda_12.6.r12.6/compiler.34431801_0

More information

Refer to NVIDIA CUDA Installation Guide for Linux for more details and post installation instructions. For example, you may want to modify your PATH environment variable to include /usr/local/cuda/bin

[Optional] Install NVIDIA Container Toolkit

sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
sudo dnf clean expire-cache
sudo dnf install -y nvidia-container-toolkit

Container engine configuration

Refer to NVIDIA Container Toolkit site for container engine configuration instructions.

Docker

To install and configure docker

sudo dnf install -y docker
sudo usermod -aG docker ec2-user
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify Docker engine configuration

To verify docker configuration

sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi

Output should be similar to below

Unable to find image 'public.ecr.aws/amazonlinux/amazonlinux:2023' locally
2023: Pulling from amazonlinux/amazonlinux
b60b6c892280: Pull complete 
Digest: sha256:58b239881342c76bc01ac4a2bf442e76b7f816a136aa3bd9e48eecb8892cd171
Status: Downloaded newer image for public.ecr.aws/amazonlinux/amazonlinux:2023
Sun Aug  4 01:02:03 2024     
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     Off |   00000000:00:1F.0 Off |                    0 |
| N/A   74C    P0             31W /   70W |       1MiB /  15360MiB |      8%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Install NVIDIA driver, CUDA toolkit and Container Toolkit on EC2 instance at launch

To install the above including docker when launching a new AL2023 GPU instance, you can use the following user data script.

#!/bin/bash
sudo dnf update -y
sudo dnf install -y dkms kernel-devel kernel-modules-extra
cd /tmp
if (arch | grep -q x86); then
  sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
  sudo dnf clean expire-cache
  sudo dnf module install -y nvidia-driver:latest-dkms
  sudo dnf install -y cuda-toolkit
else
  sudo dnf install -y vulkan-devel libglvnd-devel elfutils-libelf-devel
  wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux_sbsa.run
  chmod +x ./cuda*.run
  sudo ./cuda_*.run --driver --toolkit --tmpdir=/var/tmp --silent
fi

 
nvidia-smi
/usr/local/cuda/bin/nvcc --version


sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
sudo dnf clean expire-cache
sudo dnf install -y nvidia-container-toolkit

sudo dnf install -y docker
sudo usermod -aG docker ec2-user
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi

To verify installation status, connect to your EC2 instance and view /var/log/cloud-init-output.log contents

3 Comments

This is great Mike!
Are there options for Graviton/ARM?

profile pictureAWS
EXPERT
iBehr
replied 4 months ago

Hello, I get ERROR when run the sample workload

[root@ip bin]# docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-sm
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown.

I using AL2023 (ami-0b17ca9fb2a39a659) on a Graviton ARM (g5g.xlarge) any advice?

replied 2 months ago

Worked perfectly to build an ECS-optimized GPU-ready AMI based on Al2023 (ami-01c1ede61c128dc37)! Thank you so much for this post!

replied 2 months ago