Two simultaneous Batch jobs keep spawning multiple instances.

0

Hello,

Intro

I created a AWS Batch job queue, job definition and compute environment for GPU jobs using the AWS Batch Wizard. When I submit a single job, an instance is spun up, it runs and after finishing the job, the instance is shut down. I can then submit a new job.

Problem

When I submit a second job, while the first one is running, this second job is stuck in the "Runnable " state. I can see that multiple new instances are being spawned, however, they all stay in the "Initializing" state. To make things even weirder, when I then manually terminate both jobs, all instances shut down. however, when I then submit a single new job this one will also remain in the "Runnable" state and spawn many new instances. This does not appear to fix itself, unless I run the wizard again to create a new queue/definition/environment.

Additional Info:

The compute environment has a customized launch template to mount a 100GB volume for the container which is specified by the job definition. The job definition requests 8 vCPUs and the compute environment has minvCpus = desiredvCpus = 0 and maxvCpus = 128. The number of desired vCPUs in the compute environment dashboard keeps increasing until the limit of 128. The "extra" instances which are spawned in addition to the single, properly running one, all remain in the "Initializing" state. The keep running until the vCPU limit is reached and running instances start being terminated to "make room for new ones". The EC2 Auto Scaling group which corresponds to the compute environment is constantly showing "Updating Capacity...".

Any help with solving this issue would greatly be appreciated.

質問済み 1ヶ月前614ビュー
1回答
0

The issue got fixed by running the wizard again to recreate all resources.

The exact reason why the previous resources were broken I cannot exactly say. After playing around a little bit, I have found that I can reproduce the issue on a previously working resource set up if I simply edit the compute environment once with a seemingly trivial change (e.g.: adding an additional instance type). When examining the JSON file after the change, the only thing that is different (besides the added instance types) are that the update policy settings were added, which were missing before:

"updatePolicy": {
    "terminateJobsOnUpdate": false,
    "jobExecutionTimeoutMinutes": 30
  },

However, these are the default values so I do not understand why this would cause job submissions to break.

回答済み 1ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ