Two simultaneous Batch jobs keep spawning multiple instances.

0

Hello,

Intro

I created a AWS Batch job queue, job definition and compute environment for GPU jobs using the AWS Batch Wizard. When I submit a single job, an instance is spun up, it runs and after finishing the job, the instance is shut down. I can then submit a new job.

Problem

When I submit a second job, while the first one is running, this second job is stuck in the "Runnable " state. I can see that multiple new instances are being spawned, however, they all stay in the "Initializing" state. To make things even weirder, when I then manually terminate both jobs, all instances shut down. however, when I then submit a single new job this one will also remain in the "Runnable" state and spawn many new instances. This does not appear to fix itself, unless I run the wizard again to create a new queue/definition/environment.

Additional Info:

The compute environment has a customized launch template to mount a 100GB volume for the container which is specified by the job definition. The job definition requests 8 vCPUs and the compute environment has minvCpus = desiredvCpus = 0 and maxvCpus = 128. The number of desired vCPUs in the compute environment dashboard keeps increasing until the limit of 128. The "extra" instances which are spawned in addition to the single, properly running one, all remain in the "Initializing" state. The keep running until the vCPU limit is reached and running instances start being terminated to "make room for new ones". The EC2 Auto Scaling group which corresponds to the compute environment is constantly showing "Updating Capacity...".

Any help with solving this issue would greatly be appreciated.

已提问 23 天前587 查看次数
1 回答
0

The issue got fixed by running the wizard again to recreate all resources.

The exact reason why the previous resources were broken I cannot exactly say. After playing around a little bit, I have found that I can reproduce the issue on a previously working resource set up if I simply edit the compute environment once with a seemingly trivial change (e.g.: adding an additional instance type). When examining the JSON file after the change, the only thing that is different (besides the added instance types) are that the update policy settings were added, which were missing before:

"updatePolicy": {
    "terminateJobsOnUpdate": false,
    "jobExecutionTimeoutMinutes": 30
  },

However, these are the default values so I do not understand why this would cause job submissions to break.

已回答 20 天前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则