Skip to content

Install NVIDIA GPU driver, CUDA toolkit, NVIDIA Container Toolkit on Amazon EC2 instances running Amazon Linux 2023 (AL2023)

17 minute read
Content level: Expert
5

Steps to install NVIDIA driver, CUDA Toolkit, NVIDIA Container Toolkit, and other NVIDIA software on AL2023 (Amazon Linux 2023) (x86_64/arm64)

Overview

This article suggests how to install NVIDIA Data Center GPU Driver, CUDA Toolkit, NVIDIA Container Toolkit and other NVIDIA software from NVIDIA repository on NVIDIA GPU EC2 instances running AL2023 (Amazon Linux 2023)

Note that by using this method, you agree to NVIDIA Driver License Agreement, End User License Agreement and other related license agreement. If you are doing development, you may want to register for NVIDIA Developer Program.

This article applies to AL2023 only. Similar articles are available for AL2, Ubuntu Linux, RHEL/Rocky Linux/AlmaLinux and Windows.

This article install NVIDIA Tesla driver which does not support G6f instances with fractional GPUs. Refer to this article about NVIDIA GRID driver install.

Other Options

If you need AMIs preconfigured with NVIDIA GPU driver, CUDA, other NVIDIA software, and optionally PyTorch or TensorFlow framework, consider AWS Deep Learning AMIs. Refer to Release notes for DLAMIs for currently supported options, and Deep Learning graphical desktop on Amazon Linux 2023 (AL2023) with AWS Deep Learning AMI (DLAMI) for graphical desktop setup guidance.

Refer to NVIDIA drivers for your Amazon EC2 instance for NVIDIA driver install options and NVIDIA Driver Installation Guide for Tesla driver installation instructions.

For container workloads, consider Amazon ECS-optimized Linux AMIs and Amazon EKS optimized AMIs

Note: instructions in this article are not applicable to pre-built AMIs.

Custom ECS/EKS GPU-optimized AMI

If you wish to build your own custom Amazon ECS or EKS GPU-optimized AMI, install NVIDIA driver, Docker and NVIDIA container toolkit, and refer to How do I create and use custom AMIs in Amazon ECS? or How do I create custom Amazon Linux AMIs for Amazon EKS?

About CUDA toolkit

As CUDA driver is part of NVIDIA GPU driver, CUDA Toolkit is generally optional when GPU instance is used to run applications (as opposed to develop applications) as the CUDA application typically packages (by statically or dynamically linking against) the CUDA runtime and libraries needed.

Version support

CUDA version 12.5 and higher supports Amazon Linux 2023 package manager installation on x86_64.

CUDA version 12.9 and NVIDIA driver 570.148.08 adds arm64 support.

NVIDIA driver versions 560 to 575 from NVIDIA repository supports compute only / headless mode but not desktop mode.

Prerequisites

Go to Service Quotas console of your desired Region to verify On-Demand Instance quota value of your desired instance type:

Service Quota

Request quota increase if the assigned value is less than vCPU count of your desired EC2 instance size. Do not proceed until your applied quota value is equal or higher than your instance type vCPU count

Prepare Amazon Linux 2023

Launch a new NVIDIA GPU instance running Amazon Linux 2023 preferably with at least 20 GB storage

Launch AL2023 EC2

Connect to the instance as ec2-user

Update OS

Update OS

sudo dnf update -y

Optional: you may want to upgrade to latest release version (if available) and disable deterministic upgrade

sudo dnf upgrade --releasever=latest
echo latest | sudo tee /etc/dnf/vars/releasever

Restart your EC2 instance

sudo reboot

Install DKMS and kernel headers

sudo dnf clean all
sudo dnf install -y dkms 
sudo systemctl enable --now dkms

K_VER=$(uname -r)
K_MAJOR_VER=$(echo $K_VER | cut -d. -f1-2)
case $K_VER in
  6.1.*)
    sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r) --allowerasing
    sudo dnf install -y kernel-modules-extra-$(uname -r) --allowerasing
    ;;
  *)
    sudo dnf install -y kernel$K_MAJOR_VER-headers-$(uname -r) kernel$K_MAJOR_VER-devel-$(uname -r) --allowerasing
    sudo dnf install -y kernel$K_MAJOR_VER-modules-extra-$(uname -r) --allowerasing
    ;;
esac

Add repository

You can choose either NVIDIA or AL2023 repository

Option 1: NVIDIA repo (x86_64 and arm64)

if (arch | grep -q x86); then
  ARCH=x86_64
else
  ARCH=sbsa
fi
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/$ARCH/cuda-amzn2023.repo
sudo dnf clean expire-cache

If you are installing from AWS China Region, you may be able to change repository source from https://developer.download.nvidia.com to https://developer.download.nvidia.cn

if (ec2-metadata -z | grep cn-); then
  sudo sed -i "s/nvidia\.com/nvidia\.cn/g" /etc/yum.repos.d/cuda-amzn2023.repo
  sudo dnf clean expire-cache
fi

Option 2: AL2023 repo (x86_64 only)

nvidia-release was added to 2023.6.20241031 release and enables a yum repository with NVIDIA drivers.

sudo dnf install -y nvidia-release

Install NVIDIA driver

Option 1: NVIDIA repo (x86_64 and arm64)

To install latest Tesla driver

sudo dnf module enable -y nvidia-driver:open-dkms
sudo dnf install -y nvidia-open 
sudo dnf install -y nvidia-xconfig

To install a specific driver branch, e.g. R580 LTSB

sudo dnf module enable -y nvidia-driver:580-open
sudo dnf install -y nvidia-open 
sudo dnf install -y nvidia-xconfig

Refer to Version Locking if you want to lock NVIDIA driver branch.

Option 2: AL2023 repo (x86_64 only)

sudo dnf install -y nvidia-open
sudo dnf install -y nvidia-xconfig

The above install open-source GPU kernel module which is recommended by NVIDIA (and is different from Nouveau open-source driver). Refer to Driver Installation Guide about NVIDIA Kernel Modules and installation options.

P instance

If you are using a P instance with multiple GPUs, you may need to install Fabric Manager. Refer to UFM (Unified Fabric Manager) section below for details.

Compute-only and Desktop Installation

NVIDIA supports custom installation method which supports the following configurations:

  • Desktop: Contains all the X/Wayland drivers and libraries to allow running a GPU with power management enabled on a desktop system but does not include any CUDA component
  • Compute-only / headless: Contains everything required to run CUDA applications on a GPU system where the GPU is not used to drive a display
  • Desktop and Compute: canonical way of installing the driver, with every possible library and display component. This might be required in cross functional combinations, for CUDA-accelerated video encoding/decoding.

To install for the above cases:

  • Desktop only: sudo dnf install -y nvidia-driver kmod-nvidia-open-dkms
  • Compute-only/headless: sudo dnf install -y nvidia-driver-cuda kmod-nvidia-open-dkms
  • Desktop and Compute: sudo dnf install -y nvidia-open

Refer to NVIDIA Driver Installation Guide for more information.

Verify

nvidia-smi

Output should be similar to below

Sat Dec 20 09:52:26 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01              Driver Version: 590.44.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   30C    P8             11W /  300W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Refer to section Verify installation integrity on steps to verify CUDA driver integrity.

Optional: Install CUDA toolkit

To install latest CUDA Toolkit

sudo dnf install -y cuda-toolkit

To install a specific series, e.g. 12.x

sudo dnf install -y cuda-toolkit-12

To install a specific version, e.g. 12.9

sudo dnf install -y cuda-toolkit-12-9

Refer to CUDA documentation for installation options

Verify

/usr/local/cuda/bin/nvcc -V

Output should be similar to below

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Nov__7_07:23:37_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.80
Build cuda_13.1.r13.1/compiler.36836380_0

Post-installation Actions

Refer to NVIDIA CUDA Installation Guide for Linux for post-installation actions before CUDA Toolkit can be used. For example, you may want to modify your PATH and LD_LIBRARY_PATH environment variables to include /usr/local/cuda/bin and /usr/local/cuda/lib64 respectively

sed -i '$aexport PATH=$PATH:/usr/local/cuda/bin\' ~/.bashrc
sed -i '$aexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64' ~/.bashrc
. ~/.bashrc

Optional: NVIDIA Container Toolkit

NVIDIA Container toolkit supports AL2023 on both x86_64 and arm64.

For arm64, use g5g.2xlarge or larger instance size as g5g.xlarge may cause failures due to the limited system memory.

if (! dnf search nvidia | grep -q nvidia-container-toolkit); then
  sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
  sudo dnf clean expire-cache
fi
sudo dnf install -y nvidia-container-toolkit

Refer to NVIDIA Container toolkit documentation about supported platforms, prerequisites and installation options

Verify Container Toolkit

nvidia-container-cli -V

Output should be similar to below

cli-version: 1.18.1
lib-version: 1.18.1
build date: 2025-11-24T14:45+0000
build revision: 889a3bb5408c195ed7897ba2cb8341c7d249672f
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Container engine configuration

Refer to NVIDIA Container Toolkit site for container engine configuration instructions.

Install and configure Docker

To install and configure docker

sudo dnf install -y docker
sudo systemctl enable docker
sudo usermod -aG docker ec2-user

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify Docker engine configuration

To verify docker configuration

sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi

Output should be similar to below

Unable to find image 'public.ecr.aws/amazonlinux/amazonlinux:2023' locally
2023: Pulling from amazonlinux/amazonlinux
38a4201225fe: Pull complete 
Digest: sha256:b605bd9526950f8d77a79b11667e4e7c75683e9d7dc6bb148bc023b8503163cb
Status: Downloaded newer image for public.ecr.aws/amazonlinux/amazonlinux:2023
Sat Dec 20 09:55:32 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01              Driver Version: 590.44.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   30C    P8             11W /  300W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

EC2 Install Script

You can use the below as install script (or user data) to install GPU driver and NVIDIA Container Toolkit on a new AL2023 NVIDIA GPU instance preferably with latest patches applied and at least 20 GB storage.

Remove the # characters (except the first line) if you wish to install CUDA toolkit

Option 1: NVIDIA repo (x86_64 and arm64)

#!/bin/bash
sudo dnf clean all
sudo dnf install -y dkms
sudo systemctl enable dkms

K_VER=$(uname -r)
K_MAJOR_VER=$(echo $K_VER | cut -d. -f1-2)
case $K_VER in
  6.1.*)
    sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r) --allowerasing
    sudo dnf install -y kernel-modules-extra-$(uname -r) --allowerasing
    ;;
  *)
    sudo dnf install -y kernel$K_MAJOR_VER-headers-$(uname -r) kernel$K_MAJOR_VER-devel-$(uname -r) --allowerasing
    sudo dnf install -y kernel$K_MAJOR_VER-modules-extra-$(uname -r) --allowerasing
    ;;
esac

cd /tmp

if (arch | grep -q x86); then
  ARCH=x86_64
else
  ARCH=sbsa
fi
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/$ARCH/cuda-amzn2023.repo
sudo dnf clean expire-cache

sudo dnf module enable -y nvidia-driver:open-dkms
sudo dnf install -y nvidia-open
sudo dnf install -y nvidia-xconfig

USER=ec2-user

# sudo dnf install -y cuda-toolkit
# sed -i '$aexport PATH=$PATH:/usr/local/cuda/bin' /home/$USER/.bashrc
# sed -i '$aexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64' /home/$USER/.bashrc

sudo dnf install -y docker
sudo systemctl enable docker
sudo usermod -aG docker $USER

if (! dnf search nvidia | grep -q nvidia-container-toolkit); then
  sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
  sudo dnf clean expire-cache
fi
sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

echo latest | sudo tee /etc/dnf/vars/releasever

if ( ec2-metadata -t | grep -q " p[0-9]" ); then
  if ( dnf module list | grep nvidia-driver | grep open-dkms | grep -q fm ); then
    sudo dnf module install -y nvidia-driver:open-dkms/fm
  else
    sudo dnf install -y nvidia-fabricmanager libnvidia-nscq libnvsdm nvidia-imex
  fi
  if ( ec2-metadata -t | grep -q " p[6-9]" ); then
    sudo dnf install -y nvlsm nvlink5
    echo "ib_umad" | sudo tee -a /etc/modules-load.d/modules.conf
    sudo modprobe ib_umad
  fi
  sudo systemctl enable --now nvidia-fabricmanager
fi

sudo reboot

Option 2: AL2023 repo (x86_64 only)

#!/bin/bash
sudo dnf clean all
sudo dnf install -y dkms
sudo systemctl enable dkms

K_VER=$(uname -r)
K_MAJOR_VER=$(echo $K_VER | cut -d. -f1-2)
case $K_VER in
  6.1.*)
    sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r) --allowerasing
    sudo dnf install -y kernel-modules-extra-$(uname -r) --allowerasing
    ;;
  *)
    sudo dnf install -y kernel$K_MAJOR_VER-headers-$(uname -r) kernel$K_MAJOR_VER-devel-$(uname -r) --allowerasing
    sudo dnf install -y kernel$K_MAJOR_VER-modules-extra-$(uname -r) --allowerasing
    ;;
esac

cd /tmp

sudo dnf install -y nvidia-release
sudo dnf clean expire-cache

sudo dnf install -y nvidia-open
sudo dnf install -y nvidia-xconfig

USER=ec2-user

# sudo dnf install -y cuda-toolkit
# sed -i '$aexport PATH=$PATH:/usr/local/cuda/bin' /home/$USER/.bashrc
# sed -i '$aexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64' /home/$USER/.bashrc

sudo dnf install -y docker
sudo systemctl enable docker
sudo usermod -aG docker $USER

sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

echo latest | sudo tee /etc/dnf/vars/releasever

if ( ec2-metadata -t | grep -q " p[0-9]" ); then
  sudo dnf install -y nvidia-fabricmanager libnvidia-nscq libnvsdm nvidia-imex
  if ( ec2-metadata -t | grep -q " p[6-9]" ); then
    sudo dnf install -y nvlsm nvlink5
    echo "ib_umad" | sudo tee -a /etc/modules-load.d/modules.conf
    sudo modprobe ib_umad
  fi
  sudo systemctl enable --now nvidia-fabricmanager
fi

sudo reboot

Verify

Connect to your EC2 instance

nvidia-smi
nvidia-container-cli -V
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi

If used as user data, view /var/log/cloud-init-output.log to troubleshoot any installation issues.

Perform post-installation actions in order to use CUDA toolkit (if installed).

Verify installation integrity

NVIDIA driver and NVIDIA Container Toolkit

To verify integrity of installation, you can use CUDA samples container image to validate CUDA driver.

sudo docker run --rm --runtime=nvidia --gpus all nvcr.io/nvidia/k8s/cuda-sample:devicequery

AL2023 CUDA driver

Ensure you get Result = PASS output.

NVIDIA driver and CUDA Toolkit

If CUDA toolkit is installed, you can download, compile and run CUDA samples such as deviceQuery.

Amazon Linux 2023 on g4dn

If you are using a P instance with multiple GPUs, you may need to install Fabric Manager. Refer to UFM (Unified Fabric Manager) section below for instructions

GUI (graphical desktop) remote access

If you need remote graphical desktop access, refer to How do I install GUI (graphical desktop) on Amazon EC2 instances running Amazon Linux 2023 (AL2023)?

This article installs NVIDIA Tesla driver (also know as NVIDIA Datacenter Driver), which is intended primarily for GPU compute workloads. If configured in xorg.conf, Tesla drivers support one display of up to 2560x1600 resolution.

GRID drivers provide access to four 4K displays per GPU and are certified to provide optimal performance for professional visualization applications. Refer to NVIDIA drivers for your Amazon EC2 instance and GPU-accelerated graphical desktop on Amazon Linux 2023 (AL2023) with NVIDIA GRID and Amazon DCV for setup options.

Other Software

PyTorch

Refer to article Install PyTorch on Amazon EC2 instances with NVIDIA GPU running Amazon Linux 2023 (AL2023)

DCGM (Data Center GPU Manager)

To install DCGM

CUDA_VERSION=$(nvidia-smi | sed -E -n 's/.*CUDA Version: ([0-9]+)[.].*/\1/p')
sudo dnf install --assumeyes \
                   --setopt=install_weak_deps=True \
                   datacenter-gpu-manager-4-cuda${CUDA_VERSION}

Refer to DCGM documentation for more information

Verify

dcgmi --version

Output should be similar to below


dcgmi  version: 4.4.2

GDS (GPUDirect Storage)

To install NVIDIA Magnum IO GPUDirect® Storage (GDS)

sudo dnf install -y nvidia-gds

To install for a specific CUDA version, e.g. 13.0

sudo dnf install -y nvidia-gds-13-0

Reboot

Restart to load kernel module

sudo reboot

Verify

To verify module

lsmod | grep nvidia_fs

Output should be similar to below

nvidia_fs             262144  0
nvidia              11481088  3 nvidia_uvm,nvidia_fs,nvidia_modeset

To verify successful installation

/usr/local/cuda/gds/tools/gdscheck -p

Output should be similar to below

 GDS release version: 1.16.0.49            
 nvidia_fs version:  2.27 nvidia_fs minimum version: 2.12
 Platform: x86_64  
...
...
 =========                    
 GPU INFO:                                                                                       
 =========       
 GPU index 0 NVIDIA A10G bar:1 bar size (MiB):32768 supports GDS, IOMMU State: Disabled          
 ==============
 PLATFORM INFO:
 ==============       
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  13010
 Platform: g5.xlarge, Arch: x86_64(Linux 6.1.158-180.294.amzn2023.x86_64)
 Platform verification succeeded

Refer to GDS documentation and Driver installation guide for more information

GDRCopy

Magnum IO GDRCopy can be built and installed from Github

Prerequisites

Ensure that

  • NVIDIA Driver and CUDA Toolkit are installed
  • PATH and LD_LIBRARY_PATH environment variables have been configured
nvcc --version
env | grep LD_LIBRARY

Output should be similar to below

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

LD_LIBRARY_PATH=:/usr/local/cuda/lib64

Build RPM packages

sudo dnf install -y dkms rpm-build make
sudo dnf install -y git
cd /tmp
git clone https://github.com/NVIDIA/gdrcopy
cd gdrcopy/packages
CUDA=/usr/local/cuda ./build-rpm-packages.sh

Output should be similar to below

.....
....
gdrcopy-2.5.1-1.src.rpm /tmp/gdr.c7FAag/topdir/RPMS/noarch/gdrcopy-devel-2.5.1-1.noarch.rpm /tmp/gdr.c7FAag/topdir/RPMS/noarch/gdrcopy-kmod-2.5.1-1dkms.noarch.rpm /tmp/gdr.c7FAag/topdir/RPMS/x86_64/gdrcopy-2.5.1-1.x86_64.rpm
+ cd /tmp/gdrcopy/packages
+ cp /tmp/gdr.c7FAag/topdir/RPMS/noarch/gdrcopy-devel-2.5.1-1.noarch.rpm ./gdrcopy-devel-2.5.1-1.amzn-2023.noarch.rpm
+ cp /tmp/gdr.c7FAag/topdir/RPMS/noarch/gdrcopy-kmod-2.5.1-1dkms.noarch.rpm ./gdrcopy-kmod-2.5.1-1dkms.amzn-2023.noarch.rpm
+ cp /tmp/gdr.c7FAag/topdir/RPMS/x86_64/gdrcopy-2.5.1-1.x86_64.rpm ./gdrcopy-2.5.1-1.amzn-2023.x86_64.rpm
+ cp /tmp/gdr.c7FAag/topdir/SRPMS/gdrcopy-2.5.1-1.src.rpm ./gdrcopy-2.5.1-1.amzn-2023.src.rpm

Cleaning up ...
+ rm -rf /tmp/gdr.c7FAag

Install

sudo dnf install -y gdrcopy-*.{noarch,$(arch)}.rpm

Restart your EC2 instance

sudo reboot

Verify

lsmod | grep gdr

Output should be similar to below

gdrdrv                 32768  0
nvidia              14381056  3 nvidia_uvm,gdrdrv,nvidia_modeset

CUDA-X Libraries

NVIDIA repository also provides access to CUDA Math, Quantum and other libraries such as cuTENSOR, cuFFT and cuQuantum. Refer to NVIDIA site for more information

UFM (Unified Fabric Manager)

Eligibility

To determine if you need NVIDIA Unified Fabric Manager (UFM)

nvidia-smi -q -i 0 | grep Fabric -A2 | grep State

If State is N/A, you do not need Fabric Manager

        State                             : N/A

If State is not N/A, install Fabric Manager as per next section

        State                             : In Progress

Install

To install latest NVIDIA Unified Fabric Manager (UFM), NSCQ, NVSDM, IMEX for EC2 instances with NVIDIA NVLink.

if ( dnf module list | grep nvidia-driver | grep open-dkms | grep -q fm ); then
  sudo dnf module install -y nvidia-driver:open-dkms/fm
else
  sudo dnf install -y nvidia-fabricmanager libnvidia-nscq libnvsdm nvidia-imex
fi
sudo systemctl enable --now nvidia-fabricmanager

P6 instance

P6 instance requires NVLink Subnet Manager (NVLSM).

sudo dnf install -y nvlsm nvlink5
echo "ib_umad" | sudo tee -a /etc/modules-load.d/modules.conf
sudo modprobe ib_umad

sudo systemctl restart nvidia-fabricmanager

Refer to EC2 and NVIDIA documentation for up to date instructions.

Verify

nv-fabricmanager -v
systemctl status nvidia-fabricmanager

Output should be similar to below

Fabric Manager version is : 590.44.01

● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: enabled)
     Active: active (running) since ......... UTC; 1min 4s ago
    Process: 22851 ExecStart=/usr/bin/nvidia-fabricmanager-start.sh --mode start (code=exited, status=0/SUCCESS)
   Main PID: 22881 (nv-fabricmanage)
      Tasks: 18 (limit: 3355442)
     Memory: 38.1M
        CPU: 633ms
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─22881 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
.........compute.internal nv-fabricmanager[22881]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
.........compute.internal nv-fabricmanager[22881]: Detected Pre-NVL5 system
.........compute.internal nv-fabricmanager[22881]: Connected to 1 node.
.........compute.internal nv-fabricmanager[22881]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with the NVLink fabric.
.........compute.internal nv-fabricmanager[22881]: Started "Nvidia Fabric Manager"
.........compute.internal nv-fabricmanager[22881]: Started nvidia-fabricmanager.service - NVIDIA fabric manager service.

To view GPU fabric registration status

nvidia-smi -q -i 0 | grep -i -A 2 Fabric

Output should be similar to below after GPUs have successfully registered

    Fabric
        State                             : Completed
        Status                            : Success

AL2023 Fabric Manager on P6

Refer to Fabric Manager documentation for more information.

5 Comments

This is great Mike!
Are there options for Graviton/ARM?

AWS
EXPERT
replied 2 years ago

Hello, I get ERROR when run the sample workload

[root@ip bin]# docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-sm
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown.

I using AL2023 (ami-0b17ca9fb2a39a659) on a Graviton ARM (g5g.xlarge) any advice?

replied 2 years ago

Worked perfectly to build an ECS-optimized GPU-ready AMI based on Al2023 (ami-01c1ede61c128dc37)! Thank you so much for this post!

replied 2 years ago

Been trying to do exactly that on a g4dn.xlarge machine, using these steps and also a bunch of other variations.

Keep getting:

[ec2-user@ ~]$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

[ec2-user@ ~]$ lsmod | grep nvidia
[ec2-user@ ~]$ sudo modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg)
[ec2-user@ ~]$ sudo dmesg | grep -i nvidia
[    4.918126] nvidia: loading out-of-tree module taints kernel.
[    4.918717] nvidia: module license 'NVIDIA' taints kernel.
[    4.944984] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    4.946115] nvidia: Unknown symbol drm_gem_object_free (err -2)
[    5.054328] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  547.271166] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  547.370785] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  547.449667] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  845.532310] nvidia: Unknown symbol drm_gem_object_free (err -2)

Apparently one might get this if the driver doesn't match the kernel (makes sense), but at this point I'm pretty sure there's something else going on.

My goal is to run a fairly straightforward Stable Diffusion setup, and I possibly need newer Python that the 3.7 (I think) the preconfigured "Deep Learning" AL 2 AMIs come with.

replied a year ago

Confirmed that this works with the latest AL2023 AMI as long as you have at least 15GB of storage.

replied a year ago