Which GPU instances are supported by the sagemaker algorithm forecasting-deepar?

0

I previously ran a hyperparameter tuning job for SageMaker DeepAR with the instance type ml.c5.18xlarge but it seems insufficient to complete the tuning job within the max_run time specified in my account. Now, having tried to use the accelerated GPU instance ml.g4dn.16xlarge, I am prompted with an error - "Instance type ml.g4dn.16xlarge is not supported by algorithm forecasting-deepar."

I cannot find any documentation that indicates the list of instance types supported by deepar. What GPU/CPU instances have more compute capacity than ml.c5.18xlarge which I could leverage for my tuning job?

If there isn't, I would appreciate any recommendations as to how I could hasten the run time of the job. I require the tuning job to complete within the max run time of 432000 seconds. Thank you in advance!

1 Answer
1

Hi, thanks for pointing this out. Indeed, all g4dn instances are currently not supported by the forecasting-deepar algorithm, but as you rightly point out, this is currently not documented. I will raise this with the service team to include in in the documentation.

In the meantime, you can try out the P3 instances instead - these are also powerful GPU instances and should help you speed up the training time.

AWS
Heiko
answered 2 years ago
  • I appreciate the quick response @Heiko! I see that for training there are 3 P3 instance options available, i.e. - 2xlarge, 8xlarge and 16xlarge. It would be super helpful if you could confirm which of these are configured for deepar.

    Additionally, I was hoping you could help me understand how the parameter 'instance_count' in the sagemaker Estimator class affects training time. The way I understand it is that the number attributed to this parameter results in the number of EC2 instances with the specified instance type to be allocated. For example with an instance_count = '3', we would have 3 EC2 instances, each with a p3.2xlarge (for example) launched to parallelize training.

    If so, which would you say is better in terms of improving training speed - using a higher instance_count / a single higher compute capacity instance? Thank you!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions