Two simultaneous Batch jobs keep spawning multiple instances.

0

Hello,

Intro

I created a AWS Batch job queue, job definition and compute environment for GPU jobs using the AWS Batch Wizard. When I submit a single job, an instance is spun up, it runs and after finishing the job, the instance is shut down. I can then submit a new job.

Problem

When I submit a second job, while the first one is running, this second job is stuck in the "Runnable " state. I can see that multiple new instances are being spawned, however, they all stay in the "Initializing" state. To make things even weirder, when I then manually terminate both jobs, all instances shut down. however, when I then submit a single new job this one will also remain in the "Runnable" state and spawn many new instances. This does not appear to fix itself, unless I run the wizard again to create a new queue/definition/environment.

Additional Info:

The compute environment has a customized launch template to mount a 100GB volume for the container which is specified by the job definition. The job definition requests 8 vCPUs and the compute environment has minvCpus = desiredvCpus = 0 and maxvCpus = 128. The number of desired vCPUs in the compute environment dashboard keeps increasing until the limit of 128. The "extra" instances which are spawned in addition to the single, properly running one, all remain in the "Initializing" state. The keep running until the vCPU limit is reached and running instances start being terminated to "make room for new ones". The EC2 Auto Scaling group which corresponds to the compute environment is constantly showing "Updating Capacity...".

Any help with solving this issue would greatly be appreciated.

1 Answer
0

The issue got fixed by running the wizard again to recreate all resources.

The exact reason why the previous resources were broken I cannot exactly say. After playing around a little bit, I have found that I can reproduce the issue on a previously working resource set up if I simply edit the compute environment once with a seemingly trivial change (e.g.: adding an additional instance type). When examining the JSON file after the change, the only thing that is different (besides the added instance types) are that the update policy settings were added, which were missing before:

"updatePolicy": {
    "terminateJobsOnUpdate": false,
    "jobExecutionTimeoutMinutes": 30
  },

However, these are the default values so I do not understand why this would cause job submissions to break.

answered 8 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions