Two simultaneous Batch jobs keep spawning multiple instances.

0

Hello,

Intro

I created a AWS Batch job queue, job definition and compute environment for GPU jobs using the AWS Batch Wizard. When I submit a single job, an instance is spun up, it runs and after finishing the job, the instance is shut down. I can then submit a new job.

Problem

When I submit a second job, while the first one is running, this second job is stuck in the "Runnable " state. I can see that multiple new instances are being spawned, however, they all stay in the "Initializing" state. To make things even weirder, when I then manually terminate both jobs, all instances shut down. however, when I then submit a single new job this one will also remain in the "Runnable" state and spawn many new instances. This does not appear to fix itself, unless I run the wizard again to create a new queue/definition/environment.

Additional Info:

The compute environment has a customized launch template to mount a 100GB volume for the container which is specified by the job definition. The job definition requests 8 vCPUs and the compute environment has minvCpus = desiredvCpus = 0 and maxvCpus = 128. The number of desired vCPUs in the compute environment dashboard keeps increasing until the limit of 128. The "extra" instances which are spawned in addition to the single, properly running one, all remain in the "Initializing" state. The keep running until the vCPU limit is reached and running instances start being terminated to "make room for new ones". The EC2 Auto Scaling group which corresponds to the compute environment is constantly showing "Updating Capacity...".

Any help with solving this issue would greatly be appreciated.

已提問 23 天前檢視次數 587 次
1 個回答
0

The issue got fixed by running the wizard again to recreate all resources.

The exact reason why the previous resources were broken I cannot exactly say. After playing around a little bit, I have found that I can reproduce the issue on a previously working resource set up if I simply edit the compute environment once with a seemingly trivial change (e.g.: adding an additional instance type). When examining the JSON file after the change, the only thing that is different (besides the added instance types) are that the update policy settings were added, which were missing before:

"updatePolicy": {
    "terminateJobsOnUpdate": false,
    "jobExecutionTimeoutMinutes": 30
  },

However, these are the default values so I do not understand why this would cause job submissions to break.

已回答 20 天前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南