RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

Question

The environment I'm using is:

- aws p4dn.24xlarge instance (NVIDIA Ampere A100 GPU )
- cuda 10.1
- tensorflow 2.3.0
- python 3.6.9
I get an error when I run the following. What is the reason?

```
tensorflow.test.is_gpu_available()
```

2022-01-23 07:56:08.088849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:10:1c.0 name: A100-SXM4-40GB computeCapability: 8.0 coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s 2022-01-23 07:56:08.088936: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2022-01-23 07:56:08.089013: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2022-01-23 07:56:08.089030: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2022-01-23 07:56:08.089046: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2022-01-23 07:56:08.089059: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2022-01-23 07:56:08.089074: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2022-01-23 07:56:08.089090: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2022-01-23 07:56:08.092700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func return func(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/test_util.py", line 1563, in is_gpu_available for local_device in device_lib.list_local_devices(): File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/client/device_lib.py", line 43, in list_local_devices _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config) RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

Answer

AWS team has released TF 2.3.2 DLCs with CUDA 11.0 specifically to target the p4d.24xlarge instance type, because of the compatibility issues with drivers and CUDA versions required to work with p4d instances.

I assume that since this hasn’t come in through the SageMaker channels, this isn’t about a SageMaker job that you are running.

I recommend that you should try to use this image (or plug in any region they need into the image URI)
[763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.2-gpu-py37-cu110-ubuntu18.04](http://763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.2-gpu-py37-cu110-ubuntu18.04)
to use the p4d.24xlarge instance type, which may help to avoid CUDA related issues.
It is possible that the CUDA issue you see is unrelated to this change to CUDA 11.0, but I strongly suspect this is caused by an incompatibility between the CUDA version on the image and the GPU architecture used by p4d instances.

Cheers !!

RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

相关内容