RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

0

The environment I'm using is:

  • aws p4dn.24xlarge instance (NVIDIA Ampere A100 GPU )
  • cuda 10.1
  • tensorflow 2.3.0
  • python 3.6.9 I get an error when I run the following. What is the reason?
tensorflow.test.is_gpu_available()

2022-01-23 07:56:08.088849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:10:1c.0 name: A100-SXM4-40GB computeCapability: 8.0 coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s 2022-01-23 07:56:08.088936: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2022-01-23 07:56:08.089013: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2022-01-23 07:56:08.089030: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2022-01-23 07:56:08.089046: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2022-01-23 07:56:08.089059: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2022-01-23 07:56:08.089074: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2022-01-23 07:56:08.089090: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2022-01-23 07:56:08.092700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func return func(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/test_util.py", line 1563, in is_gpu_available for local_device in device_lib.list_local_devices(): File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/client/device_lib.py", line 43, in list_local_devices _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config) RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

질문됨 2년 전2168회 조회
1개 답변
1

AWS team has released TF 2.3.2 DLCs with CUDA 11.0 specifically to target the p4d.24xlarge instance type, because of the compatibility issues with drivers and CUDA versions required to work with p4d instances.

I assume that since this hasn’t come in through the SageMaker channels, this isn’t about a SageMaker job that you are running.

I recommend that you should try to use this image (or plug in any region they need into the image URI) 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.2-gpu-py37-cu110-ubuntu18.04 to use the p4d.24xlarge instance type, which may help to avoid CUDA related issues. It is possible that the CUDA issue you see is unrelated to this change to CUDA 11.0, but I strongly suspect this is caused by an incompatibility between the CUDA version on the image and the GPU architecture used by p4d instances.

Cheers !!

지원 엔지니어
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠