- Più recenti
- Maggior numero di voti
- Maggior numero di commenti
Hello SachinAWS,
I believe that the CUDA version on the p2s should be CUDA 9.0. Our deep learning containers currently utilize CUDA 9.0 and I believe they run fine on the p2. Here is our GPU TensorFlow image that uses CUDA 9.0: https://github.com/aws/sagemaker-tensorflow-container/blob/master/docker/1.12.0/final/py2/Dockerfile.gpu#L2
I will reach out to the team that manages the host for your inferences on SageMaker to confirm the CUDA version.
According to the team, the CUDA version is determined at the container level, while the drivers are installed on the host itself.
As for debugging, please consider using the Python SDK with local mode. This will spin up your docker container similar to that of serving within SageMaker. The benefit is that you will be able to utilize docker logs and also has the benefit of lower latency due to not waiting for your serving instances to be provisioned.
https://github.com/aws/sagemaker-python-sdk#local-mode
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_local_mode_mnist.ipynb
Please let me know if there is anything to clarify.
Hi Daniel,
Thank you for your response. Even though my issue is still unresolved, I am now trying out Model Training inside Sagemaker itself and use the generated model for inference. I strongly hope that it fixes the problem during the inference.
FYI, a sample inference of Darknet YOLO model from Sagemaker is this:
(Notice that it uses the model binary file, and neither prints prediction nor any error. The prediction time of 0.035261 seconds also indicates that the GPU has been utilized, further indicating not a Sagemaker problem):
sample output:
layer filters size input output
0 conv 32 3 x 3 / 1 608 x 608 x 3 -> 608 x 608 x 32 0.639 BFLOPs
1 max 2 x 2 / 2 608 x 608 x 32 -> 304 x 304 x 32
2 conv 64 3 x 3 / 1 304 x 304 x 32 -> 304 x 304 x 64 3.407 BFLOPs
3 max 2 x 2 / 2 304 x 304 x 64 -> 152 x 152 x 64
4 conv 128 3 x 3 / 1 152 x 152 x 64 -> 152 x 152 x 128 3.407 BFLOPs
5 conv 64 1 x 1 / 1 152 x 152 x 128 -> 152 x 152 x 64 0.379 BFLOPs
6 conv 128 3 x 3 / 1 152 x 152 x 64 -> 152 x 152 x 128 3.407 BFLOPs
7 max 2 x 2 / 2 152 x 152 x 128 -> 76 x 76 x 128
8 conv 256 3 x 3 / 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BFLOPs
9 conv 128 1 x 1 / 1 76 x 76 x 256 -> 76 x 76 x 128 0.379 BFLOPs
10 conv 256 3 x 3 / 1 76 x 76 x 128 -> 76 x 76 x 256 3.407 BFLOPs
11 max 2 x 2 / 2 76 x 76 x 256 -> 38 x 38 x 256
12 conv 512 3 x 3 / 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BFLOPs
13 conv 256 1 x 1 / 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BFLOPs
14 conv 512 3 x 3 / 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BFLOPs
15 conv 256 1 x 1 / 1 38 x 38 x 512 -> 38 x 38 x 256 0.379 BFLOPs
16 conv 512 3 x 3 / 1 38 x 38 x 256 -> 38 x 38 x 512 3.407 BFLOPs
17 max 2 x 2 / 2 38 x 38 x 512 -> 19 x 19 x 512
18 conv 1024 3 x 3 / 1 19 x 19 x 512 -> 19 x 19 x1024 3.407 BFLOPs
19 conv 512 1 x 1 / 1 19 x 19 x1024 -> 19 x 19 x 512 0.379 BFLOPs
20 conv 1024 3 x 3 / 1 19 x 19 x 512 -> 19 x 19 x1024 3.407 BFLOPs
21 conv 512 1 x 1 / 1 19 x 19 x1024 -> 19 x 19 x 512 0.379 BFLOPs
22 conv 1024 3 x 3 / 1 19 x 19 x 512 -> 19 x 19 x1024 3.407 BFLOPs
23 conv 1024 3 x 3 / 1 19 x 19 x1024 -> 19 x 19 x1024 6.814 BFLOPs
24 conv 1024 3 x 3 / 1 19 x 19 x1024 -> 19 x 19 x1024 6.814 BFLOPs
25 route 16
26 conv 64 1 x 1 / 1 38 x 38 x 512 -> 38 x 38 x 64 0.095 BFLOPs
27 reorg / 2 38 x 38 x 64 -> 19 x 19 x 256
28 route 27 24
29 conv 1024 3 x 3 / 1 19 x 19 x1280 -> 19 x 19 x1024 8.517 BFLOPs
30 conv 75 1 x 1 / 1 19 x 19 x1024 -> 19 x 19 x 75 0.055 BFLOPs
31 detection
mask_scale: Using default '1.000000'
Loading weights from backup/yolov2.backup...Done!
/root/darknet/data2/Test.png: Predicted in 0.035261 seconds.
Expected output (addition to above o/p):
object1: 66%
object1: 65%
object2: 56%
object3: 74%
object3: 61%
object4: 92%
Edited by: SachinAws on Feb 20, 2019 5:31 AM
Sharing my observations:
- The Darknet Yolo Model is pretty dependent on the GPU env where it is trained. Means - the model generally gives inference/predictions where it is trained.
- The Darknet's yolo.cfg file contains few parameters that have to be enabled during training and testing (inference). The docker image should contain appropriate config parameters enabled during training and testing. Else the model does not give out any predictions, and neither prints any error.
Thank you.
Contenuto pertinente
- AWS UFFICIALEAggiornata un anno fa
- AWS UFFICIALEAggiornata 2 anni fa
- AWS UFFICIALEAggiornata 3 anni fa
- AWS UFFICIALEAggiornata un anno fa