I want to install NVIDIA driver, CUDA toolkit and optionally NVIDIA Container Toolkit on Amazon Linux 2023 ( al2023 )
Prepare Amazon Linux 2023
Launch a NVIDIA GPU instance
sudo dnf update -y
sudo dnf install -y dkms kernel-devel kernel-modules-extra
Restart your AL2023 if kernel is updated.
Install NVIDIA driver and CUDA toolkit
Method 1: Package Manager Installation (x86_64)
CUDA version 12.5 and higher supports Amazon Linux 2023 on x86_64 only.
Ensure your OS has more than 5 GiB of free disk space
Add NVIDIA repo
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
sudo dnf clean expire-cache
Install NVIDIA driver
sudo dnf module install -y nvidia-driver:latest-dkms
Install CUDA toolkit
sudo dnf install -y cuda-toolkit
Method 2: Runfile installation (x86_64 and arm64)
CUDA Toolkit 12.5 currently supports AL2023 x86_64 rpm install. Runfile is not supported for AL2023 and may not work.
Ensure your OS has more than 10 GiB of free disk space
Install development libraries
sudo dnf install -y vulkan-devel libglvnd-devel elfutils-libelf-devel
Download CUDA toolkit installer
Intel/AMD x86_64
wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux.run
wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux_sbsa.run
You can go to CUDA Toolkit download page to obtain latest runfile (local)
installer download URL for RHEL 9 on x86_64 and arm64 sbsa
Install NVIDIA driver and CUDA toolkit
chmod +x ./cuda*.run
sudo ./cuda_*.run --driver --toolkit --tmpdir=/var/tmp --silent
Post installation
Restart your OS
sudo reboot
Verify NVIDIA driver
nvidia-smi
Your output should be similar to below
Sun Aug 4 00:58:21 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA T4G Off | 00000000:00:1F.0 Off | 0 |
| N/A 77C P0 33W / 70W | 1MiB / 15360MiB | 9% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Verify CUDA tookit
/usr/local/cuda/bin/nvcc --version
Output should be similar to below
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Fri_Jun_14_16:44:37_PDT_2024
Cuda compilation tools, release 12.6, V12.6.20
Build cuda_12.6.r12.6/compiler.34431801_0
More information
Refer to NVIDIA CUDA Installation Guide for Linux for more details and post installation instructions. For example, you may want to modify your PATH
environment variable to include /usr/local/cuda/bin
[Optional] Install NVIDIA Container Toolkit
sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
sudo dnf clean expire-cache
sudo dnf install -y nvidia-container-toolkit
Container engine configuration
Refer to NVIDIA Container Toolkit site for container engine configuration instructions.
Docker
To install and configure docker
sudo dnf install -y docker
sudo usermod -aG docker ec2-user
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify Docker engine configuration
To verify docker configuration
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi
Output should be similar to below
Unable to find image 'public.ecr.aws/amazonlinux/amazonlinux:2023' locally
2023: Pulling from amazonlinux/amazonlinux
b60b6c892280: Pull complete
Digest: sha256:58b239881342c76bc01ac4a2bf442e76b7f816a136aa3bd9e48eecb8892cd171
Status: Downloaded newer image for public.ecr.aws/amazonlinux/amazonlinux:2023
Sun Aug 4 01:02:03 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA T4G Off | 00000000:00:1F.0 Off | 0 |
| N/A 74C P0 31W / 70W | 1MiB / 15360MiB | 8% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Install NVIDIA driver, CUDA toolkit and Container Toolkit on EC2 instance at launch
To install the above including docker when launching a new AL2023 GPU instance, you can use the following user data script.
#!/bin/bash
sudo dnf update -y
sudo dnf install -y dkms kernel-devel kernel-modules-extra
cd /tmp
if (arch | grep -q x86); then
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
sudo dnf clean expire-cache
sudo dnf module install -y nvidia-driver:latest-dkms
sudo dnf install -y cuda-toolkit
else
sudo dnf install -y vulkan-devel libglvnd-devel elfutils-libelf-devel
wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux_sbsa.run
chmod +x ./cuda*.run
sudo ./cuda_*.run --driver --toolkit --tmpdir=/var/tmp --silent
fi
nvidia-smi
/usr/local/cuda/bin/nvcc --version
sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
sudo dnf clean expire-cache
sudo dnf install -y nvidia-container-toolkit
sudo dnf install -y docker
sudo usermod -aG docker ec2-user
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi
To verify installation status, connect to your EC2 instance and view /var/log/cloud-init-output.log
contents