RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

0

The environment I'm using is:

  • aws p4dn.24xlarge instance (NVIDIA Ampere A100 GPU )
  • cuda 10.1
  • tensorflow 2.3.0
  • python 3.6.9 I get an error when I run the following. What is the reason?
tensorflow.test.is_gpu_available()

2022-01-23 07:56:08.088849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:10:1c.0 name: A100-SXM4-40GB computeCapability: 8.0 coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s 2022-01-23 07:56:08.088936: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2022-01-23 07:56:08.089013: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2022-01-23 07:56:08.089030: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2022-01-23 07:56:08.089046: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2022-01-23 07:56:08.089059: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2022-01-23 07:56:08.089074: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2022-01-23 07:56:08.089090: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2022-01-23 07:56:08.092700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func return func(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/test_util.py", line 1563, in is_gpu_available for local_device in device_lib.list_local_devices(): File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/client/device_lib.py", line 43, in list_local_devices _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config) RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

已提问 2 年前2168 查看次数
1 回答
1

AWS team has released TF 2.3.2 DLCs with CUDA 11.0 specifically to target the p4d.24xlarge instance type, because of the compatibility issues with drivers and CUDA versions required to work with p4d instances.

I assume that since this hasn’t come in through the SageMaker channels, this isn’t about a SageMaker job that you are running.

I recommend that you should try to use this image (or plug in any region they need into the image URI) 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.2-gpu-py37-cu110-ubuntu18.04 to use the p4d.24xlarge instance type, which may help to avoid CUDA related issues. It is possible that the CUDA issue you see is unrelated to this change to CUDA 11.0, but I strongly suspect this is caused by an incompatibility between the CUDA version on the image and the GPU architecture used by p4d instances.

Cheers !!

支持工程师
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则