- Newest
- Most votes
- Most comments
AWS team has released TF 2.3.2 DLCs with CUDA 11.0 specifically to target the p4d.24xlarge instance type, because of the compatibility issues with drivers and CUDA versions required to work with p4d instances.
I assume that since this hasn’t come in through the SageMaker channels, this isn’t about a SageMaker job that you are running.
I recommend that you should try to use this image (or plug in any region they need into the image URI) 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.2-gpu-py37-cu110-ubuntu18.04 to use the p4d.24xlarge instance type, which may help to avoid CUDA related issues. It is possible that the CUDA issue you see is unrelated to this change to CUDA 11.0, but I strongly suspect this is caused by an incompatibility between the CUDA version on the image and the GPU architecture used by p4d instances.
Cheers !!
Relevant content
- asked 2 years ago
- asked 5 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 10 months ago