Install NVIDIA GPU driver, CUDA Toolkit, NVIDIA Container Toolkit on Amazon EC2 instances running RHEL/Rocky Linux 8/9/10
Steps to install NVIDIA driver, CUDA Toolkit, NVIDIA Container Toolkit, and other NVIDIA software from NVIDIA repository on RHEL/Rocky 8/9/10 (x86_64/arm64)
Overview
This article suggests how to install NVIDIA GPU driver, CUDA Toolkit, NVIDIA Container Toolkit and other NVIDIA software directly from NVIDIA repository on NVIDIA GPU EC2 instances running RHEL (Red Hat Enterprise Linux) or Rocky Linux.
Note that by using this method, you agree to NVIDIA Driver License Agreement, End User License Agreement and other related license agreement. If you are doing development, you may want to register for NVIDIA Developer Program.
This article applies to RHEL/Rocky Linux on AWS only. Similar articles are available for AL2, AL2023, Ubuntu and Windows.
This article install NVIDIA Tesla driver which does not support G6f and Gr6f instance types
Other Options
If you need AMIs preconfigured with NVIDIA GPU driver, CUDA, other NVIDIA software, and optionally PyTorch or TensorFlow framework, consider AWS Deep Learning AMIs. Refer to Release notes for DLAMIs for currently supported options.
Refer to NVIDIA drivers for your Amazon EC2 instance for NVIDIA driver install options and NVIDIA Driver Installation Guide for Tesla driver installation instructions.
For container workloads, consider Amazon ECS-optimized Linux AMIs and Amazon EKS optimized AMIs
Note: instructions in this article are not applicable to pre-built AMIs.
Custom ECS GPU-optimized AMI
If you wish to build your own custom Amazon ECS GPU-optimized AMI, install NVIDIA driver, Docker, NVIDIA container toolkit and other related software, and refer to How do I create and use custom AMIs in Amazon ECS?
About CUDA toolkit
CUDA Toolkit is generally optional when GPU instance is used to run applications (as opposed to develop applications) as the CUDA application typically packages (by statically or dynamically linking against) the CUDA runtime and libraries needed.
System Requirements
NVIDIA CUDA supports the following platforms
- Red Hat Enterprise Linux (RHEL) 10 (x86_64 and arm64)
- Red Hat Enterprise Linux (RHEL) 9 (x86_64 and arm64)
- Red Hat Enterprise Linux (RHEL) 8 (x86_64 and arm64)
- Rocky Linux 10 (x86_64)
- Rocky Linux 9 (x86_64)
- Rocky Linux 8 (x86_64)
While it may work, NVIDIA do not support Rocky Linux on arm64 architecture or other RHEL compatible Linux OSs such as AlmaLinux. Refer to Driver installation guide for supported kernel versions, compilers and libraries.
Prerequisites
Go to Service Quotas console of your desired Region to verify On-Demand Instance quota value of your desired instance type:
- G instance types: Running On-Demand G and VT instances
- P instance types: Running On-Demand P instances
Request quota increase if the assigned value is less than vCPU count of your desired EC2 instance size. Do not proceed until your applied quota value is equal or higher than your instance type vCPU count
Prepare Rocky Linux / RHEL
Launch a new NVIDIA GPU instance preferably with at least 20 GB storage and connect to the instance
Update OS, add EPEL repository, install DKMS, kernel headers and development packages
sudo dnf update -y
OS_VERSION=$(. /etc/os-release;echo $VERSION_ID | sed -e 's/\..*//g')
if ( cat /etc/os-release | grep -q Red ); then
sudo subscription-manager repos --enable codeready-builder-for-rhel-$OS_VERSION-$(arch)-rpms
elif ( echo $OS_VERSION | grep -q 8 ); then
sudo dnf config-manager --set-enabled powertools
else
sudo dnf config-manager --set-enabled crb
fi
sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-$OS_VERSION.noarch.rpm
sudo dnf install -y dkms kernel-devel kernel-modules-extra amazon-ec2-utils unzip gcc make vulkan-devel libglvnd-devel elfutils-libelf-devel
sudo systemctl daemon-reload
sudo systemctl enable dkms
Restart your EC2 instance if kernel is updated
sudo reboot
Add NVIDIA repository
Configure Network Repo installation
DISTRO=$(. /etc/os-release;echo rhel$VERSION_ID | sed -e 's/\..*//g')
if (arch | grep -q x86); then
ARCH=x86_64
else
ARCH=sbsa
fi
sudo dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$DISTRO/$ARCH/cuda-$DISTRO.repo
sudo dnf clean expire-cache
If you are installing from AWS China Region, you may be able to replace repository source from https://developer.download.nvidia.com to https://developer.download.nvidia.cn
if (ec2-metadata -z | grep cn-); then
sudo sed -i "s/nvidia\.com/nvidia\.cn/g" /etc/yum.repos.d/cuda-rhel*.repo
sudo dnf clean expire-cache
fi
Install NVIDIA Driver
To install latest Tesla driver
sudo dnf module enable -y nvidia-driver:open-dkms
sudo dnf install -y nvidia-open
sudo dnf install -y nvidia-xconfig
To install a specific driver branch, e.g. R570 production
sudo dnf module enable -y nvidia-driver:570-open
sudo dnf install -y nvidia-open
sudo dnf install -y nvidia-xconfig
The above install open-source GPU kernel module which is recommended by NVIDIA (and is different from Nouveau open-source driver). Refer to Driver Installation Guide about NVIDIA Kernel Modules and installation options.
Alternatively, pre-compiled RHEL kernel modules may be available
sudo dnf module enable -y nvidia-driver:latest
sudo dnf install -y nvidia-driver nvidia-driver-cuda
sudo dnf install -y nvidia-xconfig
Verify
Restart your instance
nvidia-smi
Output should be similar to below
Sun Aug 10 03:03:55 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 23C P8 10W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Optional: CUDA Toolkit
To install latest CUDA Toolkit
sudo dnf install -y cuda-toolkit
To install a specific series, e.g. 12.x
sudo dnf install -y cuda-toolkit-12
To install a specific version, e.g. 12.9
sudo dnf install -y cuda-toolkit-12-9
Refer to CUDA Toolkit documentation about supported platforms and installation options
Verify
/usr/local/cuda/bin/nvcc -V
Output should be similar to below
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jul_16_07:30:01_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.48
Build cuda_13.0.r13.0/compiler.36260728_0
Post-installation Actions
Refer to NVIDIA CUDA Installation Guide for Linux for post-installation actions before CUDA Toolkit can be used. For example, you may want to modify your PATH environment variable to include /usr/local/cuda/bin.
if (cat /etc/os-release | grep -q Rocky); then
USER="rocky"
else
USER="ec2-user"
fi
sed -i '$aexport PATH=\"\$PATH:/usr/local/cuda/bin\"' /home/$USER/.bashrc
. /home/$USER/.bashrc
For runfile installation, modify LD_LIBRARY_PATH to include /usr/local/cuda/lib
Optional: NVIDIA Container Toolkit
NVIDIA Container toolkit supports RHEL 8 and 9 (but not 10) on both x86_64 and arm64. For arm64, use g5g.2xlarge or larger instance size as g5g.xlarge may cause failures due to the limited system memory.
To install latest NVIDIA Container Toolkit
if (! dnf search nvidia | grep -q nvidia-container-toolkit); then
sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
fi
sudo dnf install -y nvidia-container-toolkit
Refer to NVIDIA Container toolkit documentation about supported platforms, prerequisites and installation options
Verify
nvidia-container-cli -V
Output should be similar to below
cli-version: 1.17.8
lib-version: 1.17.8
build date: 2025-05-30T13:47+0000
build revision: 6eda4d76c8c5f8fc174e4abca83e513fb4dd63b0
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Container engine configuration
Refer to NVIDIA Container Toolkit documentation about container engine configuration.
Install and configure Docker
To install and configure docker.
if (cat /etc/os-release | grep -q Rocky); then
USER="rocky"
else
USER="ec2-user"
fi
sudo dnf config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo
sudo dnf install -y docker-ce docker-ce-cli containerd.io
sudo usermod -aG docker $USER
sudo systemctl enable docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify Docker engine configuration
To verify docker configuration
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/docker/library/rockylinux:9 nvidia-smi
Output should be similar to below
Unable to find image 'public.ecr.aws/docker/library/rockylinux:9' locally
9: Pulling from docker/library/rockylinux
446f83f14b23: Pull complete
Digest: sha256:d7be1c094cc5845ee815d4632fe377514ee6ebcf8efaed6892889657e5ddaaa6
Status: Downloaded newer image for public.ecr.aws/docker/library/rockylinux:9
Sun Aug 10 03:05:05 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 24C P8 13W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Install on EC2 instance at launch
To install NVIDIA driver and NVIDIA container toolkit including Docker when launching a new GPU instance with at least 20 GB storage, you can use the following as user data script.
Remove # character (except the first line) if you wish to install CUDA toolkit
#!/bin/bash
sudo dnf update -y
OS_VERSION=$(. /etc/os-release;echo $VERSION_ID | sed -e 's/\..*//g')
if ( cat /etc/os-release | grep -q Red ); then
sudo subscription-manager repos --enable codeready-builder-for-rhel-$OS_VERSION-$(arch)-rpms
elif ( echo $OS_VERSION | grep -q 8 ); then
sudo dnf config-manager --set-enabled powertools
else
sudo dnf config-manager --set-enabled crb
fi
sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-$OS_VERSION.noarch.rpm
sudo dnf install -y dkms kernel-devel kernel-devel-$(uname -r) kernel-modules-extra kernel-modules-extra-$(uname -r) unzip gcc make vulkan-devel libglvnd-devel elfutils-libelf-devel
sudo systemctl daemon-reload
sudo systemctl enable dkms
DISTRO=$(. /etc/os-release;echo rhel$VERSION_ID | sed -e 's/\..*//g')
if (arch | grep -q x86); then
ARCH=x86_64
else
ARCH=sbsa
fi
sudo dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$DISTRO/$ARCH/cuda-$DISTRO.repo
sudo dnf clean expire-cache
sudo dnf module enable -y nvidia-driver:open-dkms
sudo dnf install -y nvidia-open
sudo dnf install -y nvidia-xconfig
# sudo dnf install -y cuda-toolkit
# if (cat /etc/os-release | grep -q Rocky); then
# USER="rocky"
# else
# USER="ec2-user"
# fi
# sed -i '$aexport PATH=\"\$PATH:/usr/local/cuda/bin\"' /home/$USER/.bashrc
# . /home/$USER/.bashrc
if (cat /etc/os-release | grep -q Rocky); then
USER="rocky"
else
USER="ec2-user"
fi
sudo dnf config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo
sudo dnf install -y docker-ce docker-ce-cli containerd.io
sudo systemctl enable docker
sudo usermod -aG docker $USER
if (! dnf search nvidia | grep -q nvidia-container-toolkit); then
sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
fi
sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo reboot
Verify
Connect to your EC2 instance
nvidia-smi
/usr/local/cuda/bin/nvcc -V
nvidia-container-cli -V
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/docker/library/rockylinux:9 nvidia-smi
View /var/log/cloud-init-output.log to troubleshoot any installation issues.
Perform post-installation actions in order to use CUDA toolkit. To verify integrity of installation, you can download, compile and run CUDA samples such as deviceQuery.
If Docker and NVIDIA container toolkit (but not CUDA toolkit) are installed and configured, you can use CUDA samples container image to validate CUDA driver.
sudo docker run --rm --runtime=nvidia --gpus all nvcr.io/nvidia/k8s/cuda-sample:devicequery
GUI (graphical desktop) remote access
If you need remote graphical desktop access, refer to Install GUI (graphical desktop) on Amazon EC2 instances running RHEL/Rocky Linux 8/9?
Note that this article installs NVIDIA Tesla driver (also know as NVIDIA Datacenter Driver), which is intended primarily for GPU compute workloads. If configured in xorg.conf, Tesla drivers support one display of up to 2560x1600 resolution.
GRID drivers provide access to four 4K displays per GPU and are certified to provide optimal performance for professional visualization applications. AMIs preconfigured with GRID drivers are available from AWS Marketplace. You can also consider using amazon-ec2-nice-dcv-samples CloudFormation templates to provision your own EC2 instances with either NVIDIA Tesla or GRID driver, Docker with NVIDIA Container Toolkit, graphical desktop environment and Amazon DCV remote display protocol server.
Other software
AWS CLI
To install AWS CLI (AWS Command Line Interface) v2 through Snap
sudo dnf install -y snapd
sudo systemctl enable --now snapd snapd.socket
sudo ln -s /var/lib/snapd/snap /snap
sudo snap install aws-cli --classic
Verify
Log off and log in so that your PATH variables are updated correctly.
aws --version
Output should be similar to below
aws-cli/2.27.53 Python/3.13.4 Linux/5.14.0-570.39.1.el9_6.x86_64 exe/x86_64.rhel.9
SSM Agent
To install SSM agent for Session Manager access
if (arch | grep -q x86); then
sudo dnf install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
else
sudo dnf install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_arm64/amazon-ssm-agent.rpm
fi
This requires EC2 instance to have attached IAM role with the AmazonSSMManagedInstanceCore managed policy
EC2 Instance Connect
To install EC2 Instance Connect for secure SSH access
cd /tmp
if (arch | grep -q x86); then
ARCH=amd64
else
ARCH=arm64
fi
if ( cat /etc/os-release | grep -q 8\. ); then
curl -L -O https://amazon-ec2-instance-connect-us-west-2.s3.us-west-2.amazonaws.com/latest/linux_$ARCH/ec2-instance-connect.rhel8.rpm
curl -L -O https://amazon-ec2-instance-connect-us-west-2.s3.us-west-2.amazonaws.com/latest/linux_amd64/ec2-instance-connect-selinux.noarch.rpm
sudo dnf install -y ./ec2-instance-connect.rhel8.rpm ./ec2-instance-connect-selinux.noarch.rpm
else
curl -L -O https://amazon-ec2-instance-connect-us-west-2.s3.us-west-2.amazonaws.com/latest/linux_$ARCH/ec2-instance-connect.rpm
curl -L -O https://amazon-ec2-instance-connect-us-west-2.s3.us-west-2.amazonaws.com/latest/linux_amd64/ec2-instance-connect-selinux.noarch.rpm
sudo dnf install -y ./ec2-instance-connect.rpm ./ec2-instance-connect-selinux.noarch.rpm
fi
sudo systemctl restart sshd
Allow inbound SSH traffic in your security group
cuDNN (CUDA Deep Neural Network library)
To install cuDNN for the latest available CUDA version.
sudo dnf install -y zlib cudnn
Refer to cuDNN documentation about installation options and support matrix
NCCL (NVIDIA Collective Communication Library)
To install latest NCCL
sudo dnf install -y libnccl libnccl-devel libnccl-static
Refer to NCCL documentation about installation options
DCGM (Data Center GPU Manager)
To install DCGM
CUDA_VERSION=$(nvidia-smi | sed -E -n 's/.*CUDA Version: ([0-9]+)[.].*/\1/p')
sudo dnf install --assumeyes \
--setopt=install_weak_deps=True \
datacenter-gpu-manager-4-cuda${CUDA_VERSION}
Refer to DCGM documentation for more information
Verify
dcgmi --version
Output should be similar to below
dcgmi version: 4.4.1
GDS (GPUDirect Storage)
To install NVIDIA Magnum IO GPUDirect® Storage (GDS)
sudo dnf install -y nvidia-gds
To install for a specific CUDA version, e.g. 13.0
sudo dnf install -y nvidia-gds-13-0
Reboot
Reboot after installation is complete
sudo reboot
Verify
To verify module
lsmod | grep nvidia_fs
Output should be similar to below
nvidia_fs 323584 0
nvidia 11579392 3 nvidia_uvm,nvidia_fs,nvidia_modeset
To verify successful installation
/usr/local/cuda/gds/tools/gdscheck -p
Output should be similar to below
GDS release version: 1.15.1.6
nvidia_fs version: 2.26 libcufile version: 2.12
Platform: x86_64
...
...
=========
GPU INFO:
=========
GPU index 0 NVIDIA A10G bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 13000
Platform: g5.xlarge, Arch: x86_64(Linux 5.14.0-570.39.1.el9_6.x86_64)
Platform verification succeeded
Refer to GDS documentation and Driver installation guide for more information
GDRCopy
Magnum IO GDRCopy packages for different CUDA versions can be installed from NVIDIA Developer download site. Alternatively, download and compile from Github
Restart your EC2 instance
sudo reboot
Verify
lsmod | grep gdr
Output should be similar to below
gdrdrv 28672 0
nvidia 14376960 7 nvidia_uvm,gdrdrv,nvidia_modeset
UFM (Unified Fabric Manager)
P6 instance requires additional configuration as per EC2 and NVIDIA documentation.
To install latest NVIDIA Unified Fabric Manager (UFM)
sudo dnf install -y nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager
To install specific version, e.g. 570
sudo dnf install -y nvidia-fabricmanager-570
sudo systemctl enable nvidia-fabricmanager
Restart your EC2 instance
sudo reboot
Verify
nv-fabricmanager -v
systemctl status nvidia-fabricmanager
Output should be similar to below
Fabric Manager version is : 580.95.05
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: enabled)
Active: active (running) since ......... UTC; 1min 4s ago
Process: 22851 ExecStart=/usr/bin/nvidia-fabricmanager-start.sh --mode start (code=exited, status=0/SUCCESS)
Main PID: 22881 (nv-fabricmanage)
Tasks: 18 (limit: 3355442)
Memory: 38.1M
CPU: 633ms
CGroup: /system.slice/nvidia-fabricmanager.service
└─22881 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
.........compute.internal nv-fabricmanager[22881]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
.........compute.internal nv-fabricmanager[22881]: Detected Pre-NVL5 system
.........compute.internal nv-fabricmanager[22881]: Connected to 1 node.
.........compute.internal nv-fabricmanager[22881]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with the NVLink fabric.
.........compute.internal nv-fabricmanager[22881]: Started "Nvidia Fabric Manager"
.........compute.internal nv-fabricmanager[22881]: Started nvidia-fabricmanager.service - NVIDIA fabric manager service.
To view GPU fabric registration status
nvidia-smi -q -i 0 | grep -i -A 2 Fabric
Output should be similar to below after the GPU has been successfully registered
Fabric
State : Completed
Status : Success
Refer to Fabric Manager documentation for supported platforms, and any additional installation or configuration steps
- Language
- English
Relevant content
- asked 3 years ago
