Which GPU instances are supported by the sagemaker algorithm forecasting-deepar?

0

I previously ran a hyperparameter tuning job for SageMaker DeepAR with the instance type ml.c5.18xlarge but it seems insufficient to complete the tuning job within the max_run time specified in my account. Now, having tried to use the accelerated GPU instance ml.g4dn.16xlarge, I am prompted with an error - "Instance type ml.g4dn.16xlarge is not supported by algorithm forecasting-deepar."

I cannot find any documentation that indicates the list of instance types supported by deepar. What GPU/CPU instances have more compute capacity than ml.c5.18xlarge which I could leverage for my tuning job?

If there isn't, I would appreciate any recommendations as to how I could hasten the run time of the job. I require the tuning job to complete within the max run time of 432000 seconds. Thank you in advance!

1 回答
1

Hi, thanks for pointing this out. Indeed, all g4dn instances are currently not supported by the forecasting-deepar algorithm, but as you rightly point out, this is currently not documented. I will raise this with the service team to include in in the documentation.

In the meantime, you can try out the P3 instances instead - these are also powerful GPU instances and should help you speed up the training time.

AWS
Heiko
已回答 2 年前
  • I appreciate the quick response @Heiko! I see that for training there are 3 P3 instance options available, i.e. - 2xlarge, 8xlarge and 16xlarge. It would be super helpful if you could confirm which of these are configured for deepar.

    Additionally, I was hoping you could help me understand how the parameter 'instance_count' in the sagemaker Estimator class affects training time. The way I understand it is that the number attributed to this parameter results in the number of EC2 instances with the specified instance type to be allocated. For example with an instance_count = '3', we would have 3 EC2 instances, each with a p3.2xlarge (for example) launched to parallelize training.

    If so, which would you say is better in terms of improving training speed - using a higher instance_count / a single higher compute capacity instance? Thank you!

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则